Untangling the NFL Pt. 4: Wins Above Replacement (QBs)

After a couple weeks off, I’ve returned to my long-term project of quantifying value in the NFL. The reason was discouragement at the end of Part 3. I had built complex metrics (TOM for offense, TDM for defense) that seemed sound on paper, and better yet, applied some basic machine learning techniques, and yet failed to yield meaningful results.

In hindsight, TOM and TDM each had a solid conceptual core. It made sense to break down offense and defense into logical ‘phases,’ and the strong phase-level correlations with DVOA proved to have some type of basis in the data. But I think where I went wrong was architecture. By building from season-level aggregate information instead of play-level data, I had already limited my ability to create anything close to a strong Wins Above Replacement score. But I also believe that anchoring the model to team-level outcomes forced crude post-hoc adjustments that otherwise disproportionately inflated or underplayed contributions at certain positions. Worse yet, I think I lost the thread behind the whole point of this statistic; my scores were largely just flat composites with no relation to wins or what replacement level actually means. 

Sometimes to go forward, you have to embarrass yourself, look back, say “that was cringe,” and then do everything over again. To prepare myself for my next attempt, I looked back at my failed attempts with TDM and TOM, but I also reviewed what I wrote in part 1 and revisited each of the previous existing work in quantifying value in football. In today’s post, I’m taking another swing at quantifying individual value in pro football, but this time, I’m not going to waste any time – this piece is my attempt to follow up on attempts from my nflWAR elders to create an empirically grounded quarterback WAR list. 

My Approach to WAR

In any sport, WAR is intended as a universal measure of how much an individual contributes to team success above a replacement-level player. What’s considered a replacement-level player will obviously differ from person to person, but in baseball, the sport where sabermetrics have the most history, replacement-level means a typical league-minimum free agent. When it comes to football, I initially wasn’t very sure about what was considered replacement-level. Was a ‘premium backup’ like Mac Jones closer to replacement-level, or does replacement-level mean someone off the practice squad? 

One of the big limitations of WAR in any sport is that while it isolates value created, it only does so assuming that all else is held equal. In other words, even if a player’s individual contributions on a roster are properly quantified above the hypothetical contributions of a replacement-level player, that could just as easily be a function of the environment around them being curated in such a way. WAR can still semi-accurately order the best players in the league, but its focus is on distinguishing tiers of contributions from each player to their respective roster more than it is to differentiate players within the same tier. Put simply, the order of Patrick Mahomes and Josh Allen isn’t relevant, as long as their general tier is distinguished from players like Baker Mayfield or Justin Fields. 

On that note, this model is not built to resolve debates between friends on who is better between Justin Herbert or Sam Darnold – or whether 2021 Aaron Rodgers had a better season than 2024 Josh Allen. When building it, I was most interested in its statistical relationship with wins, and how assessing ‘value’ can help shape roster decisions. For example, how many wins is the typical Top 5 first-round draft pick quarterback worth vs. a free agent quarterback in his 30s? Are elite hall-of-famers really that much better than reliable starters? These are the questions that any good WAR model should try to answer. 

Data Loading and Preparation

My dataset spans four full NFL seasons of play-by-play data: 2021, 2022, 2023, and 2024, including postseason plays as well. After removing special teams, spikes, kneeldowns, and penalty-nullified snaps, I was left with 141,856 offensive plays across 1,139 games. However, to build a quarterback-specific WAR model, I needed a baseline translating Expected Points Added (EPA) – one of the most frequently respected advanced analytics for assessing productive efficiency – into wins.

I recomputed this value directly from my own dataset by first calculating the average EPA margin between winning and losing teams on a per-game basis. After that, I multiplied that margin across a full 17-game season. Through this process, I learned that a team needed roughly 218.83 net EPA above replacement over a season to produce one additional win.

The next step was to scale this number to isolate quarterback impact. With the same sample used to calculate the EPA-per-win baseline, I found that quarterbacks accounted for 67.4 percent of all offensive plays and 67.53 percent of all offensive EPA. Multiplying these weights by the team EPA-per-win baseline gave me an estimate of the typical EPA required for one quarterback win above replacement across a season: about 99.5 EPA.

Regression Modeling & Empirical Validation

However, to calculate my own version of WAR for quarterbacks, I didn’t want to strictly rely on raw EPA, as it can be unstable and include plenty of noise.. Using CPOE (Completion Percentage Over Expected) and Success Rate (roughly speaking, the percent of plays that had an EPA above 0) as other variables, I created a statistic called “Modeled EPA,” which predicts EPA based on these two variables.. For each season from 2021 to 2024, I fit a simple linear regression measuring each of them against EPA/play. 

SeasonInterceptCPOE_Z CoefficientSuccess_Z CoefficientN
2021-1.1870.00112.6380.73449
2022-1.0040.00222.2460.78551
2023-1.1110.00302.4570.73752
2024-1.279-0.00372.8560.77750

Across all four seasons, Success Rate is the dominant statistical driver of EPA, showing  that a one-standard-deviation improvement in Success Rate predicts roughly 0.11–0.15 EPA/play, which is the difference between an average starter and a fringe MVP candidate. CPOE, by comparison, contributed minimally, as coefficients hovered near zero across seasons.

With that said though, it was telling – the sub-zero coefficient score in 2024 seems to suggest that accuracy weighted for difficulty of throw may not actually be that important from a pure value basis – while the high coefficient score for Success Rate emphasizes the importance of consistently finding first downs. In either case though, I retained CPOE in the model for completeness and because it contributes to a strong regression fit, even if Success Rate did most of the heavy-lifting. Eventually, I ended up with a final statistic called “Enhanced EPA,” which was the average of raw EPA and modeled EPA at a 50/50 split. I chose this split just as an arbitrary early starting point.

The WAR Formula

I’m about to get into the fun part now: evaluating who the best quarterbacks were. Before that, however, I had to find quantitative terms for assessing replacement-level value for NFL quarterbacks. As a starting heuristic of sorts, I chose to define “replacement-level” as the 25th percentile of players or below, with at least 100 combined rushing and passing plays per season. AlthoughI understand that replacement-level in other sports means something more akin to the minimum-salary available free agent, for now I wanted to take a conservative approach. I picked a number that felt tied to the reality of quarterbacks that get significant playing time in a season.

Once I established the replacement baseline, I computed each quarterback’s EPA Above Replacement. This is simply the difference between their enhanced EPA/play and the replacement EPA/play for that season, scaled by how many total plays they ran. Here is the general formula: 

EPA Above Replacement = [(Enhanced EPA – Replacement EPA/play) / EPA per win] × Total Plays.

This gives us a season-level measure of how much value a quarterback generated beyond what a generic backup would have produced in the same volume. The final step is to convert that EPA surplus into wins using the quarterback-specific EPA-per-win constant in our model.

NOTE:  My more math-inclined readers might notice something strange here: me applying enhanced EPA for players of comparison, but using regular EPA/play and regular EPA per win as my other variables. This is largely a design choice; when it comes to determining replacement level play, I thought it was more important that our numbers reflected the real performances of low-end quarterbacks when they were on the field. The enhanced model stabilizes noisy efficiency for starters and high-volume quarterbacks (the ones we’re most interested in); when applied to backups, it would inflate their actual performance.

Obviously, raw WAR alone doesn’t tell the whole story. A quarterback who plays 1,000 snaps will always accumulate more WAR than someone who plays 400, even if both performed at the same efficiency. To fix this, I also normalized performance to a standard workload using WAR per 700 plays, which was roughly my standardization choice for one full season of quarterback action. This makes efficiency comparisons across players and across years meaningful.

After calculating raw WAR and WAR per 700 for all four seasons, I then introduced a third stabilizing measurement: Weighted WAR, a balanced combination of volume and efficiency expressed in seasonal Z-scores. Weighted WAR captures a QB’s overall impact by blending Total WAR and WAR/700, using the ratio 70 percent volume, 30 percent efficiency. 

Why did I choose this split? Ultimately, I wanted to reward high volume over elite efficiency among passers who might have missed time to injuries or just not have the same snap count. Finally, for the multi-year analysis, each quarterback’s contributions from 2021–2024 were aggregated into cumulative totals:

  • Total Plays– all passing and rushing plays involving the quarterback
  • WAR – the raw number of wins this player delivered above our replacement threshold for that year
  • Weighted WAR – A conservative estimate for how many wins a player added to their team’s results after combining pure WAR with WAR/700 (in a 70/30 split). 

In the next section, I’m going to share some cumulative insights across our entire period of assessment by examining our Top 32 quarterbacks across this whole period.

Initial Insights

Cumulative 2021-2024 Table (Sorted by WWAR)
RankPlayerTotal PlaysTotal WARWeighted WAR
1Patrick Mahomes328213.113.14
2Josh Allen308112.663.06
3Lamar Jackson23738.671.99
4Jalen Hurts27939.001.96
5Joe Burrow27078.841.94
6Jared Goff25298.361.83
7Dak Prescott22827.501.62
8Brock Purdy14066.231.59
9Matthew Stafford23567.421.56
10Justin Herbert27077.681.55

NOTE: Lamar Jackson’s weighted WAR, even after factoring his elite efficiency, places him closer to the Hurts and Burrow group of quarterbacks. This is because of his 11 missed games over this period. By WAR/700, his numbers (2.56) are right beneath Mahomes (2.80) and Josh Allen (2.88).

Without going into the weeds across our entire data set (which you can see the appendix for more information about), here are a few things to remember as general rules of thumb for insights. 

  1. The best quarterbacks in each year are typically worth at least three wins above replacement level and two weighted wins above replacement. 
  2. “Reliable starters” and above average franchise players are typically contributing anywhere from two to three wins above replacement level in typical seasons, and at least half a weighted win above replacement.
  3. Anything below one win above replacement level or 0.5 weighted wins above replacement put a quarterback within the journeyman to late-stage-career to replacement-level territory.

To validate that these WAR values capture real quarterback contributions despite seeming conservative, I tested the metric against two external benchmarks: team wins and Approximate Value.

Validation Test 1: WAR vs Actual Wins

Across all 32 teams in 2024, total team quarterback WAR correlates at r = 0.69 with total wins (regular season + postseason). That corresponds to an R-squared of 0.48, meaning QB WAR alone explains nearly half of all win variance across the league. The fitted slope is also intuitive, as each point of team QB WAR across a season corresponds to roughly +2.64 wins. This implies, for example, that the difference between an average quarterback room and an elite one is often the difference between missing the playoffs and competing for a bye.

Within the 2024 data, five teams of note stood out to me, though they aren’t labeled above. Beginning with the Eagles and Lions, both significantly over-performed relative to expectations, which showcases a strong roster of contributing players (and coaching) beyond quarterback play. Conversely, the Commanders, led by arguably the greatest rookie season ever in Jayden Daniels, largely benefited from his presence. Strangely enough though, the Rams were treated as the biggest over-performers relative to quarterback play, with Stafford having strong and weak years alike during this time period. 

Validation Test 2: WAR vs. Approximate Value 

The next benchmark compares my version of WAR to the legendary Approximate Value: David Drinen’s brainchild and the very reason I’m here right now. Beginning with measuring the correlation between WAR and AV, I found our model captured 94 percent of the variance in AV, meaning that this was roughly as close as it got between matching value systems – except I used play-by-play data to translate value into wins directly. 

With that said though, I couldn’t help myself from wondering which of these two metrics had a great correlation with actual wins. Upon analyzing the data where both AV and WAR shared quarterback data, WAR had a slightly stronger correlation (0.69) than AV (0.60) for quarterback value to actual wins. 

Limitations & Conclusion

Like everything else in football, quarterback value is enormously context-dependent. No single metric can fully capture the interplay between play design, protection, route structure, defensive disguises, and the quarterback’s decisions, let alone how a roster is configured around him. However,  I believe that even for what I’m trying to capture with WAR, there are some reasonable criticisms and limitations to my approach. 

First off, while some of my starting heuristics were data-driven, I took a conservative approach to what I considered replacement level (25th percentile of passers and 100 plays across four seasons). If I were to adhere to a more traditional WAR framework like baseball or basketball, I might instead use a lower threshold or outright rename WAR to be Wins Above Practice Squad or the typical free agent at the position. 

It’s also worth noting that this version of WAR was created in a vacuum. Without measuring the specific impacts of players at other positions, like wide receiver or offensive linemen or even defense, it’s difficult to take a quarterback’s actual isolated contributions for granted outside of the very limited EPA framework. Because I’m not a professional data scientist (and am a grad school student), I’ve used EPA as a proxy for team context and individual production, but it’s not necessarily the most disentangled statistic in the world. 

Lastly, this version of WAR was more focused on measuring value and describing how well a player performed when on the field. To actually forecast WAR for the future, I would need to implement aging curves, take into account injury history, and apply more advanced machine learning methods to predict how someone would perform in the future. 

Still – for a follow-up attempt to my previous failures, I think this was a good start. Over the next month, here’s my planned schedule for my WAR series:

  • November 21: Offensive Skill Position Players (HBs, WRs, TEs)
  • November 28: Offensive Linemen
  • December 5: Defensive Players (EDs, DIs, LBs, DBs)
  • December 12: Can WAR work for special teams?
  • December 19: Using WAR to make free agency and draft decisions

See you next week. 

Project GitHub

Published by EdwinBudding

Anokh Palakurthi is a writer from Boston who is currently pursuing his masters degree in business analytics at Brandeis University. In addition to writing weekly columns about Super Smash Bros. Melee tournaments, he also loves writing about the NFL, NBA, movies, and music.

Leave a Reply

Discover more from bignokh.com

Subscribe now to keep reading and get access to the full archive.

Continue reading