TSA Prediction Market: Part 1 - Web Scraping
Overview
Introduction
This is a 5 part series on building a trading bot for next week’s TSA traffic market on Kalshi. We will be covering the following:
-
Part 1: Web scraping to get historical data from the TSA site (Current page)
What is Kalshi?
Kalshi is a prediction market platform where users trade contracts based on the likelihood of real-world events. It offers markets in politics, finance, and current events, allowing users to buy and sell contracts to express their views and potentially profit from accurate predictions. We will talk more about how Kalshi works in a later section.
Part 1: Scraping TSA Data
The data
The first thing we need to build any model is data, so let’s create a process to quickly retrieve the historical TSA volume data. The data is updated daily here. This data will be used as the historical source of “truth” for actual TSA traffic. Our model will also leverage newly released data each day to update its predictions.
As you would expect from a government agency, the data is embedded in the page’s html with no easy option to download. So, we will have to build a web scraper. Luckily, this is easy to do with Python.
The web scraper
We could retrieve the html and parse it for the table and convert this to a dataframe, but pandas has done all this work for us. Using the pandas library, we can get this data into a dataframe in a few lines of code.
Date | 2024 | 2023 | |
---|---|---|---|
0 | 4/8/2024 | 2484246 | 2523060 |
1 | 4/7/2024 | 2663438 | 2387558 |
2 | 4/6/2024 | 2314575 | 2093075 |
3 | 4/5/2024 | 2681880 | 2475896 |
4 | 4/4/2024 | 2619443 | 2532778 |
5 | 4/3/2024 | 2258815 | 2209680 |
6 | 4/2/2024 | 2258219 | 2080546 |
7 | 4/1/2024 | 2648336 | 2404966 |
8 | 3/31/2024 | 2586606 | 2545494 |
9 | 3/30/2024 | 2320492 | 2253892 |
That gets us a pandas dataframe with the current year of data! But, looking again at the TSA page, we can see that they have data going back to 2019 on different pages. Let’s get that data too. Each historical page is structured in a similar format but with the addition of the year appended at the end of the url. Like this…
https://www.tsa.gov/travel/passenger-volumes/2022
Let’s create a function to retrieve any given year of data. The current year of
data has a different url structure which we will need to account for. The below
function create_request_url()
takes a one year at a time
as arguments. If the year to process is the current year, it knows to use
the modified url format. Otherwise, it uses the standard url format with
the year appended to the end. We will iterate through the years we care about
calling this function each time.
Now we have a function that will generate the appropriate request url for any given
year. Next, we will create a function to use this url to actually retrieve the data.
Below we define a function fetch_year_of_tsa_data()
. We will call
this function for each year that we iterate through. It then will:
- Call
create_request_url()
to define that year’s url - Get the data for that year
- If processing current_year, modified the data format to match the other years.
- Returns the resulting dataframe
Finally, we need a wrapper function to iterate through
the years calling fetch_year_of_tsa_data()
and consolidating
the results.
Here we define a function, fetch_all_tsa_data()
. First, we
create an empty list that will hold our dataframe from each
year. Then, we create a for loop to iterate through the years. We know
that TSA data starts at 2019, and we don’t expect that to change. Next,
we pull the data for that year and add that dataframe to our
list of dataframes. Finally, once the final year has been processed,
we union the dataframes together to get the data in a single
dataframe.
From here, we can save the dataframe as a csv to later usage.
Conclusion
In the first part of this series, we discussed the overall goal of creating a TSA traffic prediction model to (hopefully) make money on binary contract markets. We successfully built a webscraper to collect new data from the TSA website every day.
In upcoming posts, we will enhance the model with supplementary data, discuss evaluation methods, and put the model to the test by setting up a live trading bot.
Ways to improve this code
If you want to follow along and add some extra functionality, here are a few areas to expand on:
Additional logging functionality
In any live system, logging can make debugging more efficient. You could expand on the logging in the code examples and save the files for later analysis
Error handling
The above code does not implement any error handling. This leaves it susceptible to breaking down from small changes in inputs or external formatting changes. Error handling would make the webscraper more robust
Data quality checks
We did not implement any data quality checks as part of this process. If the TSA changes the format of their site, we could start reading in incorrect data and we wouldn’t know unless it broke something downstream.