In [2]:
import numpy as np 
import pandas as pd 
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')

# Exploring the Data Frame

Now that we've loaded in the data set as a Data Frame, let's check the number of rows and columns. We can do this by looking at the `shape` attribute of a data frame.

In [None]:
df.shape

It looks like there are 243,462 rows and 53 columns. 

Let's also find out the names of all the variables in this data set. 

In [None]:
df.columns

To get more information about the contents of the Data Frame, we can use the `.info()` method. This will give us the number of non-null values and the type of data (these have all been read in as integers) for each column.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243462 entries, 0 to 243461
Data columns (total 53 columns):
 #   Column      Non-Null Count   Dtype
---  ------      --------------   -----
 0   w_geocode   243462 non-null  int64
 1   C000        243462 non-null  int64
 2   CA01        243462 non-null  int64
 3   CA02        243462 non-null  int64
 4   CA03        243462 non-null  int64
 5   CE01        243462 non-null  int64
 6   CE02        243462 non-null  int64
 7   CE03        243462 non-null  int64
 8   CNS01       243462 non-null  int64
 9   CNS02       243462 non-null  int64
 10  CNS03       243462 non-null  int64
 11  CNS04       243462 non-null  int64
 12  CNS05       243462 non-null  int64
 13  CNS06       243462 non-null  int64
 14  CNS07       243462 non-null  int64
 15  CNS08       243462 non-null  int64
 16  CNS09       243462 non-null  int64
 17  CNS10       243462 non-null  int64
 18  CNS11       243462 non-null  int64
 19  CNS12       243462 non-null  int64
 20  CNS1

We can use the `head` and `tail` methods in order to look at the first or last few rows of the data frame. 

In [None]:
df.head() # Default is to show first 5 rows.

In [None]:
df.head(10) # We can specify how many rows we want to see.

In [None]:
df.tail(10) # Same as head, except the last 10 instead of first 10

> **Attributes vs Variables:** Note that we used `head()`, with parentheses, while we used just `shape` or `column`, without parentheses. This is because `shape` and `column` are **attributes** and head is a **method**. To put it another way, `shape` and `columns` are characteristics that each Data Frame object has, and we're just displaying the values that exist already. On the other hand, `head` is a method, or a function that you perform specifically on a certain type of object (in this case, a Data Frame object).

## Accessing the Data Frame
What if we want to only look at certain cells, or certain columns? We can use a variety of commands to do just that.

### Accessing Columns

To access individual columns, we can use square brackets or we can simply use dot notation.

In [None]:
# Look at just total number of jobs (C000)
df["C000"] 

In [None]:
# This does the same thing
df.C000 

Remember, in Python, we are working with objects that have certain types. When we run the above code, we are accessing a specific column of a Data Frame, and that itself is a different type of object called a **Series**. 

In [5]:
type(df.C000)

pandas.core.series.Series

This can be useful for working with individual columns, because we can then use Series methods to do things like find the mean or standard deviation.

In [4]:
# Mean number of jobs in census blocks that had jobs
df.C000.mean()

65.9188990479007

In [6]:
# Standard deviation
df.C000.std()

369.9156753256301

### Accessing Rows 

What if we want to get certain rows? We can also use `loc` with square brackets. We use a colon to indicate that we want a series of indices with a start and end. We can also leave one side of the colon empty to indicate that we want the rest of the values on that end.

In [None]:
# Show rows 10 - 20. Remember, the first row is row 0
df.loc[10:20] 

In [None]:
df.loc[:10]

In [None]:
df.loc[:] # This gives all rows

In addition, we can use `loc` to access certain columns as well as certain indices in the Data Frame.

In [None]:
# Look at rows 10 - 20 for total number of jobs (C000)
df.loc[10:20,"C000"] 

To get a range of columns, we can use the same colon notation. 

In [11]:
# Look at rows 10 - 20 for total number of jobs (C000) and jobs by age group
df.loc[10:20,'C000':'CA03']

Unnamed: 0,C000,CA01,CA02,CA03
10,14,2,9,3
11,3,0,0,3
12,1,0,0,1
13,9,0,9,0
14,3,1,1,1
15,1,0,0,1
16,2,0,2,0
17,1,0,1,0
18,1,0,1,0
19,7,0,4,3


An alternative to `loc` is `iloc`. This takes rows from specific positions in the Data Frame rather than the row labels. Most of the time, row labels are going to be numbered sequentially, so `loc` and `iloc` should act similarly. However, sometimes, especially when creating subsets of the data, you might end up with row labels that aren't ordered sequentially and go up by one. In those cases, `iloc` might be more useful.

Another use case for `iloc` is in using negative numbers.

In [3]:
df.iloc[-5:]['C000']

243457     3
243458    18
243459    11
243460     8
243461    10
Name: C000, dtype: int64

In this case, we were able to use the "`-5:`" to indicate that we want the last 5 rows of the data frame. Note that we can't do the same with `.loc`. This is because `.loc` retrieves the rows from a particular *label* in the Data Frame, while `.iloc` retrieves them from particular *positions*. 

## <span style="color:red">Checkpoint: Explore Other Data</span>

Copy the code you used in the previous section to bring in another dataset. Try to access different values within the dataset. Can you isolate a certain variable? Look at the documentation and identify a variable to pull out of the dataset. 