import numpy as np 
import pandas as pd 
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')

Creating Subsets of the Data#

Many times, we want to work with just a subset of the data. That is, we might want to only take some of the rows or only take some of the columns.

Let’s look at an example by trying to find how many census blocks had more than 50 jobs. We can do this by including a conditional statement inside brackets.

df_over50 = df[df['C000'] > 50]
df_over50.shape
(47191, 53)

To see how exactly this works, let’s see what happens when we run only what’s inside the brackets.

df['C000'] > 50 # This will be a pandas Series with True and False values
0         False
1         False
2         False
3         False
4         False
          ...  
243457    False
243458    False
243459    False
243460    False
243461    False
Name: C000, Length: 243462, dtype: bool

The above code evaluates whether the value in C000 is above 50 for each row in the data frame, creating a Series with length equal to the number of rows with “True” or “False” depending on whether that row had a value above 50 in C000. By putting this conditional statement into the brackets, we are telling Python which rows we want to include in our subsetted data frame using True and False.

We can also use more complicated subsets. For example, let’s say we wanted to look at the census blocks with between 50 and 100 jobs.

df_50to100 = df[(df['C000'] > 50) & (df['C000'] < 100)]
df_50to100.shape
(17877, 53)

Notice that we replaced our conditional statement from before with (df['C000'] > 50) & (df['C000'] < 100). This says that we want each row that meets both conditions, (df['C000'] > 50) and (df['C000'] < 100).

What about the rest? Well, we can find the rows with less than or equal to 50 jobs OR greater than or equal to 100 jobs.

df_not50to100 = df[(df['C000'] <= 50) | (df['C000'] >= 100)]
df_not50to100.shape
(225585, 53)

Checkpoint: Explore The Data#

Recall that we have a few datasets available in the environment for you to practice with:

  • Illinois: il_wac_S000_JT00_2015.csv

  • Indiana: in_wac_S000_JT00_2015.csv

  • Maryland: md_wac_S000_JT00_2015.csv

Load in another dataset and explore it. Do the number of rows and columns make sense? Try subsetting the data set. How does it compare to the results from California? Does it make sense?