Checking for Missing Values
Contents
import numpy as np
import pandas as pd
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')
Checking for Missing Values#
Now that we have our data set, let’s do a quick check for missing values. This is typically one of the first things you’ll want to do when exploring a new data set.
Below, we’ve shown two different ways of writing the same thing. Using isnull()
gives us a data frame of the same size with True
and False
values depending on whether it was a missing value or not. Then, sum()
sums each column. Since Python treats True
as 1
and False
as 0
, the sum of each column gives us the total number of missing values for each variable.
df_null = df.isnull()
df_null.sum()
w_geocode 0
C000 0
CA01 0
CA02 0
CA03 0
CE01 0
CE02 0
CE03 0
CNS01 0
CNS02 0
CNS03 0
CNS04 0
CNS05 0
CNS06 0
CNS07 0
CNS08 0
CNS09 0
CNS10 0
CNS11 0
CNS12 0
CNS13 0
CNS14 0
CNS15 0
CNS16 0
CNS17 0
CNS18 0
CNS19 0
CNS20 0
CR01 0
CR02 0
CR03 0
CR04 0
CR05 0
CR07 0
CT01 0
CT02 0
CD01 0
CD02 0
CD03 0
CD04 0
CS01 0
CS02 0
CFA01 0
CFA02 0
CFA03 0
CFA04 0
CFA05 0
CFS01 0
CFS02 0
CFS03 0
CFS04 0
CFS05 0
createdate 0
dtype: int64
We did this in two separate lines, but that’s actually not necessary. In fact, we can do it all in one go:
df.isnull().sum()
w_geocode 0
C000 0
CA01 0
CA02 0
CA03 0
CE01 0
CE02 0
CE03 0
CNS01 0
CNS02 0
CNS03 0
CNS04 0
CNS05 0
CNS06 0
CNS07 0
CNS08 0
CNS09 0
CNS10 0
CNS11 0
CNS12 0
CNS13 0
CNS14 0
CNS15 0
CNS16 0
CNS17 0
CNS18 0
CNS19 0
CNS20 0
CR01 0
CR02 0
CR03 0
CR04 0
CR05 0
CR07 0
CT01 0
CT02 0
CD01 0
CD02 0
CD03 0
CD04 0
CS01 0
CS02 0
CFA01 0
CFA02 0
CFA03 0
CFA04 0
CFA05 0
CFS01 0
CFS02 0
CFS03 0
CFS04 0
CFS05 0
createdate 0
dtype: int64
We can also drop any duplicated rows.
df_no_dups = df.drop_duplicates()
df_no_dups.shape # Check how many rows there are after dropping duplicates
(243462, 53)
Checking for Inconsistencies#
If you check the data documentation, you’ll see that C000
is the total number of jobs. Therefore, it would make sense for the other groups to columns to add up to the values in C000
. For example, you’d expect CA01
, CA02
, and CA03
to add up to C000
for each row. Let’s check to see if this is true.
We’ll first take the sum of CA01
, CA02
, and CA03
in each row and put that in a new column called CA_sum
. Then, we’ll compare our new CA_sum
column to the existing C000
column to see if they match. We’ll first show all the code, then explain each section.
# Create a list with the columns we want to add up
vars_to_check = ['CA01','CA02','CA03']
# Using apply to sum the columns for each row
df['CA_sum'] = df[vars_to_check].apply(sum,1)
# Check how many rows don't match
sum(df.CA_sum != df.C000)
0
We first created a list called vars_to_check
, which contains the columns that we want to add up. Then, we took those columns from df
and used the apply()
method, which applies the same function to each row (or column, if we used 0
in the second argument instead of 1
). In this case, we want to find the sum of each row, so the first argument is sum
. We want to create a new column that contains this sum, so we assign that to a new column in df
, CA_sum
. Notice that this is the first place we see 'CA_sum'
, because this is where we are creating it.
Lastly, we want to check how many rows in which C000
and CA_sum
differ. We do this by using
df.CA_sum != df.C000
which outputs a Series of True
and False
values: True
if the value in CA_sum
is not equal to the value in C000
for that row, and False
otherwise. In other words, this is a Series of True
s and False
s indicating whether the values for the row didn’t match. We can then use the sum
function from NumPy to add up how many times they didn’t match. If there are no errors, the sum should be 0.
Checking for Outliers#
Suppose we want to check if there are any outliers in total number of jobs by census block. We can sort the values in C000
in order to figure this out. Let’s say we want to find the top ten census blocks by total number of jobs.
df.sort_values("C000",ascending=False).head(10)
w_geocode | C000 | CA01 | CA02 | CA03 | CE01 | CE02 | CE03 | CNS01 | CNS02 | ... | CFA03 | CFA04 | CFA05 | CFS01 | CFS02 | CFS03 | CFS04 | CFS05 | createdate | CA_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
52952 | 60372074001016 | 72275 | 5546 | 47140 | 19589 | 2129 | 6625 | 63521 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 72275 |
64327 | 60373104003007 | 50650 | 6975 | 31905 | 11770 | 33722 | 5798 | 11130 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 50650 |
59358 | 60372655101002 | 24628 | 3643 | 15326 | 5659 | 2238 | 4323 | 18067 | 45 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 24628 |
186347 | 60750177002001 | 23553 | 1858 | 15018 | 6677 | 1237 | 1750 | 20566 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 23553 |
130619 | 60599800001001 | 23071 | 9319 | 9436 | 4316 | 7413 | 10538 | 5120 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 23071 |
48981 | 60371437001000 | 22683 | 4736 | 12397 | 5550 | 14933 | 3456 | 4294 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 22683 |
97829 | 60377030012000 | 21971 | 1963 | 13008 | 7000 | 14741 | 925 | 6305 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 21971 |
64362 | 60373105011027 | 20663 | 3268 | 13510 | 3885 | 7717 | 4444 | 8502 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 20663 |
212152 | 60855081012002 | 20104 | 3603 | 14572 | 1929 | 246 | 1643 | 18215 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 20104 |
52965 | 60372074001037 | 19469 | 1995 | 12795 | 4679 | 1603 | 1839 | 16027 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 19469 |
10 rows × 54 columns
Let’s break this down piece by piece. First, we use the sort_values()
method to sort the Data Frame by C000
. We use ascending=False
so that the highest values are at the top (the default is to sort in ascending order). This would give us
df.sort_values("C000",ascending=False)
However, we don’t want to look at everything. Here, we use head()
to give us only the top ten values after sorting. This gives us the final code, df.sort_values("C000",ascending=False).head(10)
.
Checkpoint: Descriptive Statistics on Your Data#
Using the tools described above, look at the data you loaded in earlier. Make sure you know the answers to each of the following questions:
Are there any missing values?
Are there any inconsistencies in the data?
Are there missing values that may not have been coded as missing?
Are there any interesting outliers?