The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at www.retrosheet.org.

Showing posts with label introduction. Show all posts
Showing posts with label introduction. Show all posts

Thursday, July 19, 2007

How to Read a Bivariate Baseball Score Plot.

Basic Ideas
The bivariate baseball score plots present summary information for Major League Baseball teams’ game scores.

Each game is represented as one mark in the joint score distribution grid and one mark in each marginal. Splits based on a variety of game parameters (starting pitcher, day/night, home/away, etc.) are available; different values of the split parameter are differentiated by color and shape. Games are shown collected into little groups (in this case, groups are size 3) so as to maintain a rational aspect ratio for the overall plot.

Marginal Distributions
The marginal score distributions are shown on the top for the selected team and along the left side for that team’s opponents; also shown via small tick marks on the runs scale are the overall mean runs per game (rpg), along with mean rpg for games meeting, and those not meeting, the split criterion.

Arbitrarily, games meeting the split criterion are placed at the bottom in each stack. Reference lines are drawn to improve one’s ability to quickly count games in a column or row. If there are games with scores in excess of the arbitrary maximum (here, 15), a plus sign is added to denote the presence of such games.

The marginals are oriented along the top and left sides so as to facilitate comparison between the marginals. A simple twist of the head allows visual comparison of the two marginals without needing to reverse the positive direction mentally, as would be necessary if the marginals were shown protruding away from the center and located in the traditional bottom and left side positions.

Joint Distribution
The joint distribution is shown as collected marks in small squares. Victories for the selected team will be below the diagonal, losses above. One-run games will be just above and below the diagonal. Again, games are grouped for ease in counting; the squares are shaded in relation to the number of games they contain. Thus, more ink means more data. This presents a layered presentation for the data; the overall distribution is visible from afar, while atomic-level datum details are available upon closer inspection. “Reward the viewer for mental and visual investment in the graphic.”

Example: Astros, Roger Clemens in 2005
Open a png file.
Download a pdf version (31KB).

This example shows 163 games for the 2006 Houston Astros, with games started by Roger Clemens highlighted. The Astros finished the year 16 games over .500 but were 2 games under .500 in games which Clemens started.

The Astros' marginal score distribution (at the top of the plot) shows typical numbers from a good team: an overall average of over 4 rpg. Clemens, however, appears to have received less run support, as the Astros’ average offensive output in games he started is less than 3.5 rpg. Closer inspection reveals that Clemens was the unfortunate recipient of 9 of Houston’s 17 shutouts in 2006. While the Astros did score at least 7 runs for 5 of Clemens’ starts, the overall offensive support for Clemens actually was, well, offensive.

The Astros’ opponents’ marginal distribution (on the left) shows how teams fare against teams that beat them: their average rpg is just over 3.5 rpg compared with nearly 4.5 rpg for the Astros. Where the Astros were held to 1 run 27 times, their opponents were held to 1 or fewer on 42 occasions. Note that Clemens started 2 games that were shutouts and started 11 games where the opponents were held to fewer than 2 runs. He also started a game where the opponents scored 9 runs.

The joint distributions reveals details of Clemens’ abysmal run support. The bottom-left corner of the distribution shows five games which Clemens started in which the Astros lost 1-0, a pitcher’s nightmare. So, of the 11 games that Clemens started and the opponents were held to one run, 5 of those games failed to produce a single Houston run. In fact, Clemens was the only Astros pitcher to start a game in which the team lost 1-0.

The joint distribution reveals a rather ordinary overall record of 25-21 in one-run games, a measure often heralded as a mark of good teams.
The keen eye will note a single game on the diagonal, a 2-2 tie. Prior to 2007, such games that were tied but suspended were kept on the books for purposes of individual statistics, but were replayed at the next available opportunity.

Disclaimer and Software Information
Data for the plots were obtained from retrosheet.org. Programming was done using the R environment for statistcal computing and graphics.

An interactive website is available for examining score distributions of any team in the retrosheet database from 1876-2006 at http://data.vanderbilt.edu/rapache/bbplot/ .

Tuesday, July 17, 2007

Who we are...

rafe donahue
Day job: Biostatistician
Contribution to this project: Statistical philosophy, adult supervision, and guy willing to wear the tie at the presentation
Favorite MLB team: Brewers
Favorite NFL team: Packers

beetama74
Day job: Biostatistician
Contribution to this project: Statistical reasoning, R programming, and guy who created the original version of the plot
Favorite MLB team: Pirates
Favorite NFL team: Steelers

Jeffrey
Day job: Computer programmer
Contribution to this project: R/Apache implementation and a non-baseball guy's perspective
Favorite MLB team: unknown
Favorite NFL team: Titans (presumably)

Cole
Day job: Computer programmer
Contribution to this project: R/Apache implementation and a baseball guy's perspective
Favorite MLB team: Pirates
Favorite NFL team: Titans

Saturday, July 14, 2007

Bivariate Baseball Score Plot

It all started at the All-Star break of the last season (2006). The Pirate fans everywhere noticed that the Pittsburgh Pirates have lost many, many one-run games. (Games decided by one run)

They were 27-54 (.333) at the end of the first half, and they were 8-23 (.258) in one-run games. Obviously, the winning percentage (.258) and win loss difference (-15) were worst in the Majors. Then I thought, "How can I show the Pirates' record to accentuate their terrible performance in those one-run games?"

After some discussions with my colleagues, I came to the conclusion that the best way was to show everything. Not summaries, but every single datum = game. After some more discussions with the colleagues, I created what would become the Bivariate Baseball Score Plots. (And the Team was formed.)

So at the heart of this project, there is a dedicated and irate Pirate fan in Nashville. Well, actually our team of four happens to have 2 Pirate fans. The other two are a Brewer fan and a guy who doesn't care much about baseball.

Let's go Bucs!