Exploring Predicting Saves for Pitchers in MLB
I woke up on a cold Chicago January morning to the exciting news that the White Sox had recently signed a new player, Liam Hendriks. For an up and coming team, adding an All-Star to anchor their bullpen was an exciting prospect and a sign that the team was investing in making a competitive leap for 2021. Most of my experience watching Hendriks play came just months previously, when his Oakland A’s defeated the White Sox in the opening round of the playoffs. He was a truly dominant force as the A’s closer in the short series and I had heard his name mentioned throughout the last couple years, so I assumed he had been in that role for a while.
When I opened his Baseball-Reference page, I was surprised to learn that not only had he only been a closer for just two years, but he was in fact that same below average starting pitcher who played for the Minnesota Twins nearly ten years ago! I knew that this transition from starter to closer was not unheard of, but I was taken aback by the stark contrast in his success early in his career to later. I was struck by two related questions. First, using the statistics on this page, could I predict how well he would do for the White Sox in the coming season? And second, is there a way to predict saves for closers generally, even for pitchers whose careers change trajectory as much as Hendriks’ had?
Is there a way to predict saves for closers generally, even for pitchers whose careers change trajectory as much as Hendriks’ had?
Gathering Data
I decided to stick with the website I, and so many other fans, come to as a first stop for statistics, Baseball-Reference. I was specifically interested in gaining information about pitchers who end up in the closer role. For clarity, a closer is a pitcher who teams use in late game high leverage situations to help “close-out” a game. If they enter in such a situation, hold the lead, and their team wins, they are credited with a save. Baseball-Reference has a nice page with the top ten pitchers in saves for each year, going back to 1871, so I used this to decide which pitchers I would put into a dataset. I used BeautifulSoup to scrape career statistics, broken down by season, for each pitcher who appeared on a top ten list for a season between 1990 and the present. The 1990 cutoff keeps us in the modern era for bullpen usage, as the way relievers were deployed began changing drastically in the mid to late 1970s.
Cleaning and Engineering Data
Once I had the HTML scraped into a Jupyter Notebook, I parsed it and loaded it into a Pandas data frame. My target was next year’s saves, so I had to do a little bit of engineering time series data by shifting rows to provide saves for next season. For example, a pitcher’s 2015 data would have an entry added to it for 2016 saves. That way, I could use the data in 2015 with a supervised target to training, validation, and testing. I did the same thing looking backwards as well, to see if statistics for track record might impact future performance. So those 2015 statistics also got categories for 2014 saves, sum of the previous two years, and a running three year total.
This decision also meant a few observations needed to be removed from the training data set. Pitchers in the final year of their career had no meaningful prediction to make. After their final season saves count was used for the previous year’s target, their final season was removed. Similarly, the 2020 season was anomalous due to COVID-19, so it would not serve to use it as a true target for the 2019 seasons, so those were set aside for future consideration.
On top of the saves statistics, a large number of other statistics were available to use as potential features in a model. These included but were not limited to: games played, hits, runs, strikeout, ERA, save opportunities, holds, age, and more complex metrics such as WHIP, FIP, ERA+, SO/BB, and S0/9.
MVP and Baseline
To get an idea of how one might start of predicting saves, I began with a simple regression model as a minimum viable product. The initial model was a simple linear regression that took last year’s saves as the only feature to predict the coming season. This model was only able to account for about 30% of the variability in saves for the coming year. That meant that one year’s save total could account for some of how well a pitcher might do in the following season, it was also unable to account for a large majority of variability in saves. On average, the predictions from this model had an average absolute error of about 14 saves.
In the figure above, accurate predictions fall on the dotted line, while points further from the line represent over or under predictions. This model has a lower bound that gives a minimum value it will predict, which is a good example of how this model is failing to capture some of the variability in saves.
I took this model to be a baseline. Future models that might account for more of the variability would be a success story. Would it be possible to accurately predict saves? What statistics might account for the ability, if any?
Improved Models
For the purpose of this project, I wanted to stick to linear regression to meet the requirements of the assignment for the Metis Data Science Intensive Bootcamp. There were a few directions to consider for improving in a new model. The largest opportunity would come from what has already been discussed, adding more features. By analyzing pair plots to avoid collinearity between features, using domain knowledge to confirm if those features might be related, and considering a heatmap of correlation coefficients of features against the target variable, I created a new model using the remaining features.
This updated regression model was then trained and run through a five fold cross validation, which saw an increase in the percentage of the variability that could explain in target saves, jumping to 35% on average. Moreover, there was a drop in the mean average error of predictions, falling to about 10 saves per prediction.
In the same style visual as before, the new model is creating predictions that fit a little more closely to actual values, as the points move closer to the dotted line of true predictions. Moreover, the minimum value of outputs is no longer occurring, which helps the model account for additional variability in future saves.
Fine Tuning
One final attempt at improvements came from considering feature interactions and identifying which features are contributing the most to the resulting model. In both cases, I relied on using cross validated LASSO models to penalize large coefficients and aid in feature selection. I created all of the interaction terms of degree two and ran those through a LASSO model, but had none were nonzero coefficients afterward, so I did not keep those interaction terms in my model. I did, however, find a couple features from my own engineering that added value, in mistakes per appearance (balks, hit by pitch, and wild pitches summed and divided by appearances) and batters faced per game.
Using this LASSO model with the selected features, the final model was fit for predictions. Testing it against a holdout test data set, a final plot of actual saves against predictions was produced.
While this plot does appear fairly similar to the previous regression model’s predictions, this final model, on test data, was able to account for about 41% of the variability in target saves and had an average error of about 10 saves per prediction.
Sources of Error
In this process, while there has been improvement in a model’s ability to explain the variability in saves, I would not find any of them to be a completely trustworthy method for predicting saves. There were a number of players with a large amount of error and considering them helps to understand why the models struggle.
There are several points that speak to players such as John Smoltz and Duane Ward. Smoltz saw a career trajectory like Hendriks. However, he did not go from starter to reliever to closer, but jumped directly from starter to closer. Pitchers like Smoltz did not have any intermediary seasons that might predict their success in saves and this model is unable to capture this jump. On the flip side, Duane Ward was incredibly successful but got hurt and never really played again. After his last successful season, this model expected more success, but was unable to account for his drop off. The downside of the high volatility of closers is seen in many pitchers that are predicted to earn saves but end up earning literally zero.
Takeaways
Despite these sources of error, the model does have the ability to account for some variability in future saves and more than simply using last year’s saves. To better understand which features are contributing most, let’s consider the coefficients of the features from the LASSO model. Since they statistics have been standardized for regularization, these values are for interpretation of worth and are not representing values of the related features.
The features that contribute most, in my view, fall into two categories: opportunity and dominance. Opportunity comes down to a mixture of luck, managerial choices, performance, and other factors but is seen here in statistics such as number of games finished, holds (another high leverage opportunity but not to finish games), saves, save opportunities, and games. Dominance is a category that speaks to how much a pitcher controls a game, rather than relying on his defense to support him. This is visible in statistics like FIP (which is a metric that describes pitcher success independent of fielders), strikeouts per walk, and the combination of allowing fewer hits and being ok with increased walks.
The features that contribute most, in my view, fall into two categories: opportunity and dominance.
Future Work
I would direct future efforts into uncovering more information regarding these two categories that most impacted the final model. In particular, I would turn to additional advanced metrics to better define a dominant pitcher based on velocity, spin rate, swing and miss percentage, and other new school statistics. I would also think that bringing in more statistics related to a team, such as managerial behaviors and metrics about a team’s balance between offensive and defensive expectations that might project the closeness of games, and therefore number of save opportunities. All in all, it is clear that saves are highly volatile and difficult to project, so relying on these proxy values can help us better understand potential future success barring injury.
For Fun: Predicting 2020
As a final takeaway, here is what the final model predicted for 2020 based on 2019, then scaled down to a 60 game season.
On the left side the pitchers are sorted by actual 2020 saves, while the right side is sorted by project saves in the model. Overall, the model does a decent job of predicting saves, unless if fails greatly. However, these large errors do seem to fall generally into the categories discussed previously of role/team change and injuries. Without accurate 162 game 2020 data or project two years out from 2019 data, there is not a data set that seems pertinent to predict 2021, but I look forward to using 2021 data to see what I might expect for 2022!
For more detail, code, to connect, and other information, please visit my Github repository, website, or LinkedIn.