Reading Data#

Before we can start analyzing data, we first need to actually bring in the data into Python so that we can work with it.

Loading Libraries#

We first load a few libraries. These libraries are essentially bundles of useful tools that can help us do specific tasks. In this case, we’re going to be bringing in NumPy and Pandas, which are specifically useful for computing and data analysis. Don’t worry too much about the specifics of libraries for now, just remember that we need to include the first few lines of code below if we want to use many of the tools described in this workbook.

If you’d like to read more about the libraries that we’re loading here, see the following links: for NumPy and for Pandas. Throughout this book, we’ll continuously add libraries (also called packages or modules within Python) that we bring in.

# First, we import packages
# Note that everything after a "#" in Python will be ignored, so you can use it to write comments
import numpy as np # NumPy (Numerical Python) for scientific computing
import pandas as pd # Pandas, for data analysis tools 

Reading in the Data Set#

We’ll start by reading in a data set from a csv, or comma-separated value, file. For our examples, we’ll use the Workplace Area Characteristic (WAC) data from California.

We use the read_csv function from pandas to read in the csv file.

data_file = 'ca_wac_S000_JT00_2015.csv'
df = pd.read_csv(data_file)

# We can also use the following line by itself instead
# df = pd.read_csv('ca_wac_S000_JT00_2015.csv')

Let’s break the code down. In the first line, we assign 'ca_wac_S000_JT00_2015.csv' to the variable data_file. Note that any text inside quotation marks, such as 'ca_wac_S000_JT00_2015.csv', is a string, which makes data_file a string variable. Note that this by itself doesn’t really do anything fancy. We are just setting up a string variable with the text, 'ca_wac_S000_JT00_2015.csv', not telling Python to look for a file with that name or anything like that yet.

To look at what we’ve stored inside data_file, try using the print function with data_file as the argument. What do you think the output will be?

print(data_file)
ca_wac_S000_JT00_2015.csv

In the second line, we’re using the read_csv() function from the pandas package. Notice that we had to include pd. in front of the function. This tells Python to use the read_csv function that is inside the pandas package. The function read_csv outputs a Data Frame, which is then assigned to the variable df. This means that our data is now in a data frame called df.

Note: We only used the file name for data_file. This is because we included the CSV file in the same folder as this notebook. If it were somewhere else, we’d have to include the file path (e.g. "/Documents/Python/ca_wac_S000_JT00_2015.csv"). If you don’t know much about how file paths work, don’t worry: you just need to make sure that the file is in the same folder as the notebook.

Lastly, we’ve included a line of code that can load the csv file into df in one line. This does the exact same thing as the first two lines of code, with the exception of not assigning 'ca_wac_S000_JT00_2015.csv' to data_file. Notice that all we did was replace data_file with the string that we assigned to data_file. It is commented out, but feel free to try running it by itself (commenting out the first two lines) to check that it does the same thing.

Data from the Internet#

We can also bring in data from a URL if the data is made available in this way. The LODES data was downloaded by following links as described in the documentation. Notably, the documentation says:

For users who want to automate the download of LODES data files, the root location of the directory structure is lehd.ces.census.gov/data/lodes/LODES7/. This location can be accessed directly via a web browser.

This means we can construct the URL for where we got the file by following the link and navigating to it. We can also look at the description in the documentation to see how the naming convention goes. An example of bringing in data using the URL is shown below.

data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df_from_url = pd.read_csv(data_url,compression='gzip')

Note that we had to add another argument to the function. We added compression = 'gzip'. This is because the CSV file as it is provided through the URL is compressed. Luckily, the pd.read_csv function has a way of unzipping it while bringing it in.

A Note on Data Types#

We’ve mentioned that data_file is a string variable and that df is a pandas Data Frame. These are different variable types, and it’s important to keep this in mind because the type of variable dictates what you can do with it.

type(data_file)
str

Since data_file is a string, the type function returns str (which stands for string). Let’s look at df. What do you think the output will be?

type(df)
pandas.core.frame.DataFrame

It tells us that df is a pandas Data Frame. As we’ll see later on, a Pandas Data Frame, as with other Python objects, has specific attributes and functions that you can use with it.

Checkpoint: Read in Other Data#

You can access LODES data from other states by using the link below:

LODES Data

and navigating to the state you want. Check out the LODES documentation for more information.

If you download any data, it will go to your local computer. However, since this notebook is running from a server in the cloud, you’ll have to upload it to this server to access it. In JupyterLab, there should be an Upload Files icon on the left side above the file structure to do this. If you are running this in Colab, you will need to make sure the data is in the same place as the Colab notebook. This will likely be easier if you copy the notebook into your own Colab space.

If you have problems with that, don’t worry. We’ve included several other LODES data in this environment too as long as you are using Binder. They are named:

  • Illinois: il_wac_S000_JT00_2015.csv

  • Indiana: in_wac_S000_JT00_2015.csv

  • Maryland: md_wac_S000_JT00_2015.csv

See if you can load one similarly to how you loaded the California data above.

Make sure you assign it to a variable other than df so that you don’t overwrite the data we loaded earlier (for example, if you choose Illinois, you might use df_il).

Try using the URL structure to bring in another data source as well.