Creating Subsets of the Data
Contents
import numpy as np
import pandas as pd
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')
Creating Subsets of the Data#
Many times, we want to work with just a subset of the data. That is, we might want to only take some of the rows or only take some of the columns.
Let’s look at an example by trying to find how many census blocks had more than 50 jobs. We can do this by including a conditional statement inside brackets.
df_over50 = df[df['C000'] > 50]
df_over50.shape
(47191, 53)
To see how exactly this works, let’s see what happens when we run only what’s inside the brackets.
df['C000'] > 50 # This will be a pandas Series with True and False values
0 False
1 False
2 False
3 False
4 False
...
243457 False
243458 False
243459 False
243460 False
243461 False
Name: C000, Length: 243462, dtype: bool
The above code evaluates whether the value in C000
is above 50 for each row in the data frame, creating a Series with length equal to the number of rows with “True” or “False” depending on whether that row had a value above 50 in C000
. By putting this conditional statement into the brackets, we are telling Python which rows we want to include in our subsetted data frame using True
and False
.
We can also use more complicated subsets. For example, let’s say we wanted to look at the census blocks with between 50 and 100 jobs.
df_50to100 = df[(df['C000'] > 50) & (df['C000'] < 100)]
df_50to100.shape
(17877, 53)
Notice that we replaced our conditional statement from before with (df['C000'] > 50) & (df['C000'] < 100)
. This says that we want each row that meets both conditions, (df['C000'] > 50)
and (df['C000'] < 100)
.
What about the rest? Well, we can find the rows with less than or equal to 50 jobs OR greater than or equal to 100 jobs.
df_not50to100 = df[(df['C000'] <= 50) | (df['C000'] >= 100)]
df_not50to100.shape
(225585, 53)
Checkpoint: Explore The Data#
Recall that we have a few datasets available in the environment for you to practice with:
Illinois:
il_wac_S000_JT00_2015.csv
Indiana:
in_wac_S000_JT00_2015.csv
Maryland:
md_wac_S000_JT00_2015.csv
Load in another dataset and explore it. Do the number of rows and columns make sense? Try subsetting the data set. How does it compare to the results from California? Does it make sense?