Travis Nell An Illinois Model Theorist in Data Science's Court:
    About     Archive     Feed     Tutoring

Turnstyles

Intro

Whew! That was quite the first week in the Metis data science bootcamp, but in a great way. This past week focused on getting us up and running and able to do some exploratory data analysis with a group project.

All in all, we ended up spending Tuesday, Wednesday, and Thursday afternoon working to produce … three or four different charts for a six minute presentation. Let’s get into it!

Our end goal was to understand traffic data from the New York City Metropolitan Transportation Authority (MTA. This publically available data records all turnstile activity in the entire MTA system. Our endgame with this data was to advise our (fictional) clients on where to place outreach staff to advertise an upcoming gala.

The Data

In order to say anything about the data, let’s first examine the format a bit. There are eleven different quantities in each file of MTA data. Let’s take a look at them.

  • C/A, UNIT, SCP: These serve to uniquely determine which turnstile this data is from. In fact, UNIT and SCP alone give a reliable identifier. So we augmented the MTA data with a new column of data we called ID
  • STATION: This is what you’d expect, it’s an identifier for the specific station the turnstile is in
  • LINENAME: The different MTA lines accessible from this turnstile. We did not use this in our analysis
  • DIVISION: I have no clue what this means. It’s in the documentation, but it seems to be a historical remnant from before multiple systems were integrated in the MTA?
  • DATE, TIME: The date and time of the turnstile reading. By default these occur every four hours. Most of the time this will be midnight/4AM/8AM/… or 1AM/5AM/9AM/… but really anything can happen here.
  • DESC: Whether this was a regular reading or one recovering a failed reading, in which case it reads “RECOV AUD”. These occur rarely enough that in our analysis we ignored them.
  • ENTRIES, EXITS: The cumulative entries or exits for that turnstile. In order to get entries or exits in a time period, the difference in adjacent readings is neceessary

I know that seems a lot, but in the end, the ID (unit + scp), the date, the time, the station, and the entries in that time period were all that was needed for some decent analysis and results.

Data Cleaning Sucks

Over this past summer I worked some data sciencey research with some undergrads at the University of Illinois. This past week made me really glad that in that project I could assign the undergrads to do the data cleaning.

How bad could the MTA data be? The answer is that it was surprisingly not that bad. We found three different sorts of issues when digging into the data.

  • Negative entries or exits: Sometimes the entries in a time period went negative. There’s clearly something wrong with this. This was a relatively small portion of turnstiles. We ended up just removing these ~90 turnstiles completely as there were over 4900 reliable turnstiles.
  • Huge numbers of entries: One reading gave over 30 million people going through a single turnstile in 4 hours. This is clearly a very busy turnstile, with over 2083 people moving through this turnstile per second.
  • RECOVR AUD: Sometimes we would get both a regular reading and an audit for a turnstile. This was even rarer than getting a negative for the entries or exits, so we did not worry about this in our analysis.

How does one measure busyness?

One interesting take away from watching the three other presentations by the other groups in the cohort was that each group had a slightly different proposal. Some groups focused on the stations with the most entries total, others on entries across different timeframes. My group decided that we wanted to find those stations that on average had the busiest peak hours. That is, we first added up each station across different days, while keeping time of day separate. Then we found the busiest time of day for each station, and averaged that across the weekdays. Why the weekdays? It turns out that on average the MTA is around twice as busy on weekdays compared to weekends. Let’s now look at a pretty plot! This one is one that I made to place in our presentation.

Image

Conclusion

This week was busy but fun. In another post I may say more about myself and my background, but suffice it to say that working with my group and being around my cohort is a good bit more fun than being a PhD student suffering through finishing my thesis. Peace.