Thursday, March 08, 2007

estimating the cub offense

there has been considerable quality discussion regarding the cub offense around baseball and on this blog. in light of the acquisition of alfonso soriano, many cub fans are primed for an offensive revolution -- this club, after all, is likely to approach 200 home runs.

this page has taken a more moderate tack from the beginning. understanding just how bad the 2006 ballclub was, and how easily incited to emotional release upon the signing of soriano a cub fan can be -- with no exception for yours truly -- in combination with the general euphoria of springtime... let us simply say, dear reader, that moderation is a vigilant watchword here, and expectations of blistering offense are therefore tempered.

quantitative analysis backing this view is now forthcoming, as a specific result of the rewarding dialogue that takes place in the comments of this page.

frequent commenter maddog of another cubs blog recently posted a series of projections that were the average of a handful of statistical estimates for cub starters in 2007. these estimates are what they are -- the outputs of a number of quantitative methodologies equally weighted without further examination -- and are useful and probably fairly accurate representations of the likely performance of the starting eight.

these can be transformed through the lineup analyzer at baseball musings to observe that, given these players playing at these levels, the output of the lineup would optimize at a level of about 5.22 runs/game. as can be seen, this output would have corresponded to that of the third-most-productive offense in the national league last year -- giving hope to fans of an offense that can drag even a mediocre pitching staff to 90 wins.

unfortunately, it isn't quite that simple. originally posted here, the analyzer correlates a regression of on-base percentage and slugging percentage to runs scored (much as this page has on occasion) and runs a monte carlo engine to compile a scoring mean probability for every possible lineup, then ranking the lineups, best and worst. it's a brilliant little application of technology to the sport.

but it doesn't present an accurate picture of how a team will score over the course of a season. this is, of course, because it is analyzing only the scoring potential of the given lineup and not the club. teams in reality sustain injuries, play backups and callups, get years of overperformance and underperformance out of individuals -- the myriad of perturbations to the ideal that constitute reality.

so what magnitude of difference can be observed between the analysis and the reality?

if one uses 2006 figures for obp% and slg% of the starting eight -- that is, the eight players who took more innings at each position than any other -- into the analyzer and you get optimal output of 4.87 runs/game. however, as we can see, the club actually scored 4.42 -- a shortfall of 0.45 runs/game. this is a very significant difference -- an 9% difference that, translated over a 162-game season, means some 73 runs.

one can do same for the 2005 cubs, finding an optimal theoretical output of 4.88 runs/game versus an actual of 4.34. for 2004 -- the offense that this writer finds most similar in recent seasons to this one -- one finds a 5.20 runs/game optimal output; that club actually scored 4.87 a game. for 2003, a theoretical 4.78 against an actual 4.47.

in other words -- if one believes the obp/slg projections derived above are substantially accurate -- the analyzer will (at least in the case of the cubs of recent vintage) return a runs-per-game figure that overestimates the actual output of the club by an average of 0.41 runs/game (with a range 0.31-0.54).

therefore -- when we sink the aforementioned average projections into the analyzer -- believing those estimates to be close to accurate reflections of what those players will really do, at least in aggregate if not in every simultaneous particular -- and get a mean optimal estimate of 5.22 -- what we're really looking at is an probable mean output of 4.81 runs/game, with a range from 4.91 to 4.68.

and one should further note that this overestimates the likely output. how? there exists also the probability that one or more of the starting eight that we imagine today will not take the majority of the innings at his assigned position -- that, by the office of injury or some other unforeseen intermediation, some inferior player will overtake him. to be sure, there also exists a possibility that a better player would supplant one of our supposed starters -- but the balance of the overall risk is almost certainly skewed to the negative. one may argue that this probability is negligible, but it is certainly not zero and could be prominent. the most notable occurence in reality of this was 2004 and the injury to nomar garciaparra, who was subsequently replaced by ramon martinez. had nomar factored into the 2004 theoretical, it would've predicted an even higher figure -- and the falloff to reality then been that much larger.

so how does that situate the 2007 cubs for putting up numbers on national league pitching? a total of 4.81 runs/game would have placed this club, in last year's league, 7th of 16 and 2nd of 6 in the division. this conforms well with the parallel drawn to 2004, a team which placed 7th in nl offense.

but it might be a bit better than all that would seem -- for taking the pecota projections of the other clubs in the central through the analyzer shows the cubs with a marginal offensive lead over their rivals for the central crown. it may not be a world-beating lineup, dear reader, but with some good pitching it might possibly be good enough.

No comments: