The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at www.retrosheet.org.

Friday, September 14, 2007

Problems Fixed!

3 problems reported here have been fixed.

What's next?

I think that we want to give the users the choice for the maximum points to draw...

Tuesday, August 28, 2007

BBSP for the entire MLB 2006

To respond to the comment made to the last post, "is it possible to create the graphs for the entire league for a year? ... ...", I have created a BBSP for the entire MLB 2006 season.

Here's the plot (in pdf format).

Note:
There were 2429 games. (Each game is counted twice; once for "us" and for "opponent")
Home games (2429 of them) are highlighted.
Home teams were 1327 - 1102 (0.546).
The marginal plots on the top and on the left are exactly the same shape as they should.
The blue marginal plot on the top is the score distribution of the Home team, and the blue marginal on the left is of the Away team.
Home teams scored 20 twice (6/20 White Sox vs Cardinals, 9/18 Rockies vs Giants).

Here's the plot (in pdf format) without the details (w/o inside dots).

The MLB games are fairly competitive; one-run-games are very common. If you count the dots you will find that the most frequent scores were:
4-3 (123 times),
5-4 (105 times),
3-2 (101 times),
6-5 (97 times),
2-1 (78 times),
5-3 (77 times),
4-2 (76 times),
6-3 (76 times).

Saturday, August 25, 2007

Bob Gibson in 1968

The example mentioned in a comment to the last post.
http://data.vanderbilt.edu/bbplot/png/SLN-1968-19-1.png

Gibson was great in '68 by Baseball Analysts.

Friday, August 03, 2007

JSM handout available

Last Sunday, the 29th of July, I presented the BBSP at the Joint Statistical Meetings (JSM) in Salt Lake City as a 15-minute talk in a session run by the Statistics in Sports Section of the American Statistical Association. A copy of the handout that I, um, handed out at the talk is available (in pdf) for your viewing pleasure.

The handout contains description of how to read the plots and the Roger Clemens / Houston Astros example. As high-resolution pdf, it is suitable to be blown up from its normal size (8.5 x 11 inches), and might make a pretty print if printed at 17 x 22. I'm not sure if it is 'suitable for framing', but it might make a unique and inexpensive Christmas (or your favorite gift-giving holiday!) present for your favorite baseball fan.

Monday, July 30, 2007

Pittsburgh Pirates 2006

Click to open >> Pirates 2006 one-run games.

This is the plot that motivated the BBSP.

One-run games (games decided by one run) appear right above the diagonal if the Pirates lost these games, and right below the diagonal if the Pirates were on the winning side. You could count the blue x's and black dots, but you can get the general idea just by looking at the plot.

Many blue x's right above the diagonal indicating a terrible record in the one-run games in the first half. Some black dots right below the diagonal indicating the subsequent improvement in such games in the second half.

In any case, it's tough being a Pirate fan. 2-13 since All-Star Break. Ouch.

Friday, July 27, 2007

Known Issues

Here are the known issues that our team is working on right now:

Altoona Mountain City (1884) doesn't plot. I think that the issue is they didn't play any game on Sunday.

Many teams in 50's or before: Pitcher with Decision split acts funny. So does Opponent Pitcher with Decision split. Example: Pittsburgh Pirates 1950 and Pitcher with Decision.

I found out what the problem is. Retrosheet doesn't have data for "losing pitcher" for the old days (pre 1950?). So when the games are split based on "Pitcher with Decision", you'll notice that nobody has any losses. That's because nobody is in the "losing pitcher" column. At least the number of wins seems to be counted correctly.

"Opponent Pitcher with Decison" is not missing only if the (opponent) pitcher is the winning pitcher, so again with this split, only the wins, for the opponent pitcher (thus losses for the Pirates), appear.

We need a rational way to address this issue.


Ties don't appear in the record. Well, it's because I assumed that there is no tie game in MLB when I wrote the original code. I should have taken a history lesson.

Downloading a pdf file seems to be a problem with Firefox. I thought we have fixed this... Fixed!

If you find anything else, please let us know via comments.

Tuesday, July 24, 2007

Up and Running

The Bivariate Baseball Score Plot project is now open to public.

Its official birthday is 7/24/07.

Go to http://data.vanderbilt.edu/rapache/bbplot/ and have fun!
If you have any comments, please leave it on this blog.
Bug reports are also welcome. We're still working on it.

Thursday, July 19, 2007

How to Read a Bivariate Baseball Score Plot.

Basic Ideas
The bivariate baseball score plots present summary information for Major League Baseball teams’ game scores.

Each game is represented as one mark in the joint score distribution grid and one mark in each marginal. Splits based on a variety of game parameters (starting pitcher, day/night, home/away, etc.) are available; different values of the split parameter are differentiated by color and shape. Games are shown collected into little groups (in this case, groups are size 3) so as to maintain a rational aspect ratio for the overall plot.

Marginal Distributions
The marginal score distributions are shown on the top for the selected team and along the left side for that team’s opponents; also shown via small tick marks on the runs scale are the overall mean runs per game (rpg), along with mean rpg for games meeting, and those not meeting, the split criterion.

Arbitrarily, games meeting the split criterion are placed at the bottom in each stack. Reference lines are drawn to improve one’s ability to quickly count games in a column or row. If there are games with scores in excess of the arbitrary maximum (here, 15), a plus sign is added to denote the presence of such games.

The marginals are oriented along the top and left sides so as to facilitate comparison between the marginals. A simple twist of the head allows visual comparison of the two marginals without needing to reverse the positive direction mentally, as would be necessary if the marginals were shown protruding away from the center and located in the traditional bottom and left side positions.

Joint Distribution
The joint distribution is shown as collected marks in small squares. Victories for the selected team will be below the diagonal, losses above. One-run games will be just above and below the diagonal. Again, games are grouped for ease in counting; the squares are shaded in relation to the number of games they contain. Thus, more ink means more data. This presents a layered presentation for the data; the overall distribution is visible from afar, while atomic-level datum details are available upon closer inspection. “Reward the viewer for mental and visual investment in the graphic.”

Example: Astros, Roger Clemens in 2005
Open a png file.
Download a pdf version (31KB).

This example shows 163 games for the 2006 Houston Astros, with games started by Roger Clemens highlighted. The Astros finished the year 16 games over .500 but were 2 games under .500 in games which Clemens started.

The Astros' marginal score distribution (at the top of the plot) shows typical numbers from a good team: an overall average of over 4 rpg. Clemens, however, appears to have received less run support, as the Astros’ average offensive output in games he started is less than 3.5 rpg. Closer inspection reveals that Clemens was the unfortunate recipient of 9 of Houston’s 17 shutouts in 2006. While the Astros did score at least 7 runs for 5 of Clemens’ starts, the overall offensive support for Clemens actually was, well, offensive.

The Astros’ opponents’ marginal distribution (on the left) shows how teams fare against teams that beat them: their average rpg is just over 3.5 rpg compared with nearly 4.5 rpg for the Astros. Where the Astros were held to 1 run 27 times, their opponents were held to 1 or fewer on 42 occasions. Note that Clemens started 2 games that were shutouts and started 11 games where the opponents were held to fewer than 2 runs. He also started a game where the opponents scored 9 runs.

The joint distributions reveals details of Clemens’ abysmal run support. The bottom-left corner of the distribution shows five games which Clemens started in which the Astros lost 1-0, a pitcher’s nightmare. So, of the 11 games that Clemens started and the opponents were held to one run, 5 of those games failed to produce a single Houston run. In fact, Clemens was the only Astros pitcher to start a game in which the team lost 1-0.

The joint distribution reveals a rather ordinary overall record of 25-21 in one-run games, a measure often heralded as a mark of good teams.
The keen eye will note a single game on the diagonal, a 2-2 tie. Prior to 2007, such games that were tied but suspended were kept on the books for purposes of individual statistics, but were replayed at the next available opportunity.

Disclaimer and Software Information
Data for the plots were obtained from retrosheet.org. Programming was done using the R environment for statistcal computing and graphics.

An interactive website is available for examining score distributions of any team in the retrosheet database from 1876-2006 at http://data.vanderbilt.edu/rapache/bbplot/ .

Tuesday, July 17, 2007

Who we are...

rafe donahue
Day job: Biostatistician
Contribution to this project: Statistical philosophy, adult supervision, and guy willing to wear the tie at the presentation
Favorite MLB team: Brewers
Favorite NFL team: Packers

beetama74
Day job: Biostatistician
Contribution to this project: Statistical reasoning, R programming, and guy who created the original version of the plot
Favorite MLB team: Pirates
Favorite NFL team: Steelers

Jeffrey
Day job: Computer programmer
Contribution to this project: R/Apache implementation and a non-baseball guy's perspective
Favorite MLB team: unknown
Favorite NFL team: Titans (presumably)

Cole
Day job: Computer programmer
Contribution to this project: R/Apache implementation and a baseball guy's perspective
Favorite MLB team: Pirates
Favorite NFL team: Titans

Saturday, July 14, 2007

Bivariate Baseball Score Plot

It all started at the All-Star break of the last season (2006). The Pirate fans everywhere noticed that the Pittsburgh Pirates have lost many, many one-run games. (Games decided by one run)

They were 27-54 (.333) at the end of the first half, and they were 8-23 (.258) in one-run games. Obviously, the winning percentage (.258) and win loss difference (-15) were worst in the Majors. Then I thought, "How can I show the Pirates' record to accentuate their terrible performance in those one-run games?"

After some discussions with my colleagues, I came to the conclusion that the best way was to show everything. Not summaries, but every single datum = game. After some more discussions with the colleagues, I created what would become the Bivariate Baseball Score Plots. (And the Team was formed.)

So at the heart of this project, there is a dedicated and irate Pirate fan in Nashville. Well, actually our team of four happens to have 2 Pirate fans. The other two are a Brewer fan and a guy who doesn't care much about baseball.

Let's go Bucs!