In [None]:
import numpy as np 
import pandas as pd 
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')

# Checking for Missing Values
Now that we have our data set, let's do a quick check for missing values. This is typically one of the first things you'll want to do when exploring a new data set. 

Below, we've shown two different ways of writing the same thing. Using `isnull()` gives us a data frame of the same size with `True` and `False` values depending on whether it was a missing value or not. Then, `sum()` sums each column. Since Python treats `True` as `1` and `False` as `0`, the sum of each column gives us the total number of missing values for each variable.

In [None]:
df_null = df.isnull()
df_null.sum()

We did this in two separate lines, but that's actually not necessary. In fact, we can do it all in one go:

In [None]:
df.isnull().sum()

We can also drop any duplicated rows.

In [None]:
df_no_dups = df.drop_duplicates()
df_no_dups.shape # Check how many rows there are after dropping duplicates

## Checking for Inconsistencies

If you check the data documentation, you'll see that `C000` is the total number of jobs. Therefore, it would make sense for the other groups to columns to add up to the values in `C000`. For example, you'd expect `CA01`, `CA02`, and `CA03` to add up to `C000` for each row. Let's check to see if this is true.

We'll first take the sum of `CA01`, `CA02`, and `CA03` in each row and put that in a new column called `CA_sum`. Then, we'll compare our new `CA_sum` column to the existing `C000` column to see if they match. We'll first show all the code, then explain each section.

In [None]:
# Create a list with the columns we want to add up
vars_to_check = ['CA01','CA02','CA03']

# Using apply to sum the columns for each row
df['CA_sum'] = df[vars_to_check].apply(sum,1)

# Check how many rows don't match
sum(df.CA_sum != df.C000)

We first created a list called `vars_to_check`, which contains the columns that we want to add up. Then, we took those columns from `df` and used the `apply()` method, which applies the same function to each row (or column, if we used `0` in the second argument instead of `1`). In this case, we want to find the sum of each row, so the first argument is `sum`. We want to create a new column that contains this sum, so we assign that to a new column in `df`, `CA_sum`. Notice that this is the first place we see `'CA_sum'`, because this is where we are creating it. 

Lastly, we want to check how many rows in which `C000` and `CA_sum` differ. We do this by using

    df.CA_sum != df.C000

which outputs a Series of `True` and `False` values: `True` if the value in `CA_sum` is not equal to the value in `C000` for that row, and `False` otherwise. In other words, this is a Series of `True`s and `False`s indicating whether the values for the row didn't match. We can then use the `sum` function from NumPy to add up how many times they didn't match. If there are no errors, the sum should be 0.

## Checking for Outliers

Suppose we want to check if there are any outliers in total number of jobs by census block. We can sort the values in `C000` in order to figure this out. Let's say we want to find the top ten census blocks by total number of jobs.

In [None]:
df.sort_values("C000",ascending=False).head(10)

Let's break this down piece by piece. First, we use the `sort_values()` method to sort the Data Frame by `C000`. We use `ascending=False` so that the highest values are at the top (the default is to sort in ascending order). This would give us 

    df.sort_values("C000",ascending=False)

However, we don't want to look at everything. Here, we use `head()` to give us only the top ten values after sorting. This gives us the final code, `df.sort_values("C000",ascending=False).head(10)`.

## <span style="color:red">Checkpoint: Descriptive Statistics on Your Data</span>

Using the tools described above, look at the data you loaded in earlier. Make sure you know the answers to each of the following questions:
- Are there any missing values?
- Are there any inconsistencies in the data? 
- Are there missing values that may not have been coded as missing?
- Are there any interesting outliers?