I’ve always been a Green Bay Packers fan, which may be a bit surprising given that I’m currently in Chicago as a part of the fall cohort of the Metis Data Science Bootcamp. But I am a Cheesehead born and raised. My family may not agree on everything, but we can agree when it comes to our NFL fandom.
Why look at contract data? My initial interest in this question comes from looking at the quarterback position. Since the 2010 season, the Super Bowl winning quarterbacks were Aaron Rodgers, Eli Manning, Joe Flacco, Russell Wilson, Tom Brady (three times), Peyton Manning, and Nick Foles. Of these, Joe Flacco and Russell Wilson were on their rookie contracts, while Nick Foles was a backup for a starter on a rookie contract.
I found this situation interesting. Considering also runners up in the Super Bowl, three additional quarterbacks on a rookie contract also appear. Furthermore, the 2018 NFL MVP Patrick Mahomes is still on his rookie contract. These contracts are notably cheaper than contracts given to the same players in future contracts.
This motivates my analysis. Does having additional cap space to spend on other position groups help a team win? More generally, what can linear regression say from spending patterns in relation to regular season wins?
To get started I got really well acquainted with BeautifulSoup, a library for working with HTML/XML files. With this, I scraped relevant contract data off of Sportrac.com. This data is somewhat limited, not by availability but by relevance, given that the collective bargaining agreement between the NFL and the NFL Players Association fundamentally determines the relevance of contract financials.
Once i had contract data, I then went to Wikipedia and acquired the remainder of my data, now came the hard part! This data is not at all suited to the techniques I have at hand, and as an NFL fan, it’s not the most direct measure of a teams success to know how they paid their punter!
Regardless, there is still important modeling considerations to entertain. For this project at Metis I was restricted to forms of linear regression (with regularization). Fighting overfitting became my main difficulty, as the dataset was quite small (~200 rows). I decided to use a LASSO regression (a linear regression with an L1 penalty on coefficients). What does that mean? Not only did it try to match my training data, but furthermore this algorithm punishes itself for
the sum of the absolute values of the coefficients. By adding a contraint, the ability to overfit is reduced.
In the end, this model was not the greatest for betting at Vegas! My model on my testing set had a Root Mean Square Error of ~2.8. That is it’s generally around 3 games off on a 16 game season.
Whew! That was quite the first week in the Metis data science bootcamp, but in a great way. This past week focused on getting us up and running and able to do some exploratory data analysis with a group project.
All in all, we ended up spending Tuesday, Wednesday, and Thursday afternoon working to produce … three or four different charts for a six minute presentation. Let’s get into it!
Our end goal was to understand traffic data from the New York City Metropolitan Transportation Authority (MTA. This publically available data records all turnstile activity in the entire MTA system. Our endgame with this data was to advise our (fictional) clients on where to place outreach staff to advertise an upcoming gala.
The Data
In order to say anything about the data, let’s first examine the format a bit. There are eleven different quantities in each file of MTA data. Let’s take a look at them.
C/A, UNIT, SCP: These serve to uniquely determine which turnstile this data is from. In fact, UNIT and SCP alone give a reliable identifier. So we augmented the MTA data with a new column of data we called ID
STATION: This is what you’d expect, it’s an identifier for the specific station the turnstile is in
LINENAME: The different MTA lines accessible from this turnstile. We did not use this in our analysis
DIVISION: I have no clue what this means. It’s in the documentation, but it seems to be a historical remnant from before multiple systems were integrated in the MTA?
DATE, TIME: The date and time of the turnstile reading. By default these occur every four hours. Most of the time this will be midnight/4AM/8AM/… or 1AM/5AM/9AM/… but really anything can happen here.
DESC: Whether this was a regular reading or one recovering a failed reading, in which case it reads “RECOV AUD”. These occur rarely enough that in our analysis we ignored them.
ENTRIES, EXITS: The cumulative entries or exits for that turnstile. In order to get entries or exits in a time period, the difference in adjacent readings is neceessary
I know that seems a lot, but in the end, the ID (unit + scp), the date, the time, the station, and the entries in that time period were all that was needed for some decent analysis and results.
Data Cleaning Sucks
Over this past summer I worked some data sciencey research with some undergrads at the University of Illinois. This past week made me really glad that in that project I could assign the undergrads to do the data cleaning.
How bad could the MTA data be? The answer is that it was surprisingly not that bad. We found three different sorts of issues when digging into the data.
Negative entries or exits: Sometimes the entries in a time period went negative. There’s clearly something wrong with this. This was a relatively small portion of turnstiles. We ended up just removing these ~90 turnstiles completely as there were over 4900 reliable turnstiles.
Huge numbers of entries: One reading gave over 30 million people going through a single turnstile in 4 hours. This is clearly a very busy turnstile, with over 2083 people moving through this turnstile per second.
RECOVR AUD: Sometimes we would get both a regular reading and an audit for a turnstile. This was even rarer than getting a negative for the entries or exits, so we did not worry about this in our analysis.
How does one measure busyness?
One interesting take away from watching the three other presentations by the other groups in the cohort was that each group had a slightly different proposal. Some groups focused on the stations with the most entries total, others on entries across different timeframes. My group decided that we wanted to find those stations that on average had the busiest peak hours. That is, we first added up each station across different days, while keeping time of day separate. Then we found the busiest time of day for each station, and averaged that across the weekdays. Why the weekdays? It turns out that on average the MTA is around twice as busy on weekdays compared to weekends. Let’s now look at a pretty plot! This one is one that I made to place in our presentation.
Conclusion
This week was busy but fun. In another post I may say more about myself and my background, but suffice it to say that working with my group and being around my cohort is a good bit more fun than being a PhD student suffering through finishing my thesis. Peace.