Chapter 2 Proposal

2.1 Research topic

Our research topic is U.S. Population Migration patterns from 1990 until 2020. The goal is to explore the inflow and outflow of people between states and counties within the United States of America. With this data, there is an opportunity to create visualizations to detect patterns not only based on geographic location, but also across different age groups and income levels.

Data exploration and the analysis will be cemented on the Migration Data Users guide published by the Internal Revenue Service (IRS).

2.2 Data availability

The data is made available, collected, and maintained by the Internal Revenue Service (IRS), which is a bureau of the Department of the Treasury and one of the world’s most efficient tax administrators. In the fiscal year 2020, the IRS collected almost $3.5 trillion in revenue and processed more than 240 million tax returns. The IRS maintains a lot of tax statistics data with respect to a variety of fields like business tax, individual tax, charities, etc.

The migration data is derived from Form 1040 which is filed by residents of the United States of America. The tax return data of an individual are matched based on their Tax Identification Number (TIN). After the matching procedure is completed, these are classified into one of 4 categories :

  • Non-migrant returns,
  • Migrant return - different state,
  • Migrant return - same state - different county,
  • Migrant return - foreign

2.2.1 Data Format and Frequency

The data is publicly available as zipped Excel files (.xls) on the IRS website, composing multiple datasets for each fiscal year. To import the data, we need to download these files to our local computers and then upload to RStudio, which is the software we will use to conduct the analysis.

The IRS leverages the change in address on the Tax return from one year to another year to generate the migration data. Despite the fact that the IRS provides help services for individuals and businesses looking at their tax returns, there is no contact information for data-related issues. As we encounter questions about the data, our approach will be to explore the online resources available on their website.

2.2.2 Scope

The primary data that we will be analyzing is the summary of the migration flows within the states for 5 years from 2015 to 2020. This data contains the following fields:

  • The number of returns: which indicates the number of households that migrated.
  • The number of individuals: which indicates the number of individuals that migrated.
  • Adjusted Gross income: which indicates the gross income after deducting certain adjustments like dividends, interests, etc…
  • Age: which indicates the age of individuals filing their return
  • Geographical data: Geographical data like State name, State code are also included.
  • Migration status: Migration statuses like Inflow, Outflow, Same state migration is included in the data.

The data for each state is split into 7 categories based on the adjusted gross income of the household.

In addition to this, in case we discover any particular trend or anomaly for a state, we plan to use more granular data per state. This data contains the following fields:

  • Destination information
  • The number of returns: which indicates the number of households that migrated.
  • The number of individuals: which indicates the number of individuals that migrated.
  • Adjusted Gross income: which indicates the gross income after deducting certain adjustments like dividends, interests, etc..

2.2.3 Data Quality

It is important to note that the data does not represent the entire US population because there can be several individuals who might not be filing tax returns despite having an income.

Tax returns that are filled without a state code or with an incorrect state code are excluded. Similarly, if someone filed their tax return after September of a given year, it will not be included to maintain uniformity. Returns received in 2019 indicate the income earned in 2018, and returns received in 2020 indicate the income that was earned in 2019, and so on.

The migration data cannot be compared to other data that relates to sources of income (SOI) because the migration data only compares the tax returns for 2 consecutive years.