import numpy as np 
import pandas as pd 
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')

Exploring the Data Frame#

Now that we’ve loaded in the data set as a Data Frame, let’s check the number of rows and columns. We can do this by looking at the shape attribute of a data frame.

df.shape

(243462, 53)

It looks like there are 243,462 rows and 53 columns.

Let’s also find out the names of all the variables in this data set.

df.columns

Index(['w_geocode', 'C000', 'CA01', 'CA02', 'CA03', 'CE01', 'CE02', 'CE03',
       'CNS01', 'CNS02', 'CNS03', 'CNS04', 'CNS05', 'CNS06', 'CNS07', 'CNS08',
       'CNS09', 'CNS10', 'CNS11', 'CNS12', 'CNS13', 'CNS14', 'CNS15', 'CNS16',
       'CNS17', 'CNS18', 'CNS19', 'CNS20', 'CR01', 'CR02', 'CR03', 'CR04',
       'CR05', 'CR07', 'CT01', 'CT02', 'CD01', 'CD02', 'CD03', 'CD04', 'CS01',
       'CS02', 'CFA01', 'CFA02', 'CFA03', 'CFA04', 'CFA05', 'CFS01', 'CFS02',
       'CFS03', 'CFS04', 'CFS05', 'createdate'],
      dtype='object')

To get more information about the contents of the Data Frame, we can use the .info() method. This will give us the number of non-null values and the type of data (these have all been read in as integers) for each column.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243462 entries, 0 to 243461
Data columns (total 53 columns):
 #   Column      Non-Null Count   Dtype
---  ------      --------------   -----
 w_geocode   243462 non-null  int64
 C000        243462 non-null  int64
 CA01        243462 non-null  int64
 CA02        243462 non-null  int64
 CA03        243462 non-null  int64
 CE01        243462 non-null  int64
 CE02        243462 non-null  int64
 CE03        243462 non-null  int64
 CNS01       243462 non-null  int64
 CNS02       243462 non-null  int64
CNS03       243462 non-null  int64
CNS04       243462 non-null  int64
CNS05       243462 non-null  int64
CNS06       243462 non-null  int64
CNS07       243462 non-null  int64
CNS08       243462 non-null  int64
CNS09       243462 non-null  int64
CNS10       243462 non-null  int64
CNS11       243462 non-null  int64
CNS12       243462 non-null  int64
CNS13       243462 non-null  int64
CNS14       243462 non-null  int64
CNS15       243462 non-null  int64
CNS16       243462 non-null  int64
CNS17       243462 non-null  int64
CNS18       243462 non-null  int64
CNS19       243462 non-null  int64
CNS20       243462 non-null  int64
CR01        243462 non-null  int64
CR02        243462 non-null  int64
CR03        243462 non-null  int64
CR04        243462 non-null  int64
CR05        243462 non-null  int64
CR07        243462 non-null  int64
CT01        243462 non-null  int64
CT02        243462 non-null  int64
CD01        243462 non-null  int64
CD02        243462 non-null  int64
CD03        243462 non-null  int64
CD04        243462 non-null  int64
CS01        243462 non-null  int64
CS02        243462 non-null  int64
CFA01       243462 non-null  int64
CFA02       243462 non-null  int64
CFA03       243462 non-null  int64
CFA04       243462 non-null  int64
CFA05       243462 non-null  int64
CFS01       243462 non-null  int64
CFS02       243462 non-null  int64
CFS03       243462 non-null  int64
CFS04       243462 non-null  int64
CFS05       243462 non-null  int64
createdate  243462 non-null  int64
dtypes: int64(53)
memory usage: 98.4 MB

We can use the head and tail methods in order to look at the first or last few rows of the data frame.

df.head() # Default is to show first 5 rows.

	w_geocode	C000	CA01	CA02	CA03	CE01	CE02	CE03	...	createdate
0	60014001001007	30	2	16	12	4	2	24	...	20190826
1	60014001001008	4	0	1	3	0	0	4	...	20190826
2	60014001001011	3	2	1	0	0	3	0	...	20190826
3	60014001001017	11	3	3	5	2	2	7	...	20190826
4	60014001001024	10	3	3	4	7	1	2	...	20190826

5 rows × 53 columns

df.head(10) # We can specify how many rows we want to see.

	w_geocode	C000	CA01	CA02	CA03	CE01	CE02	CE03	...	createdate
0	60014001001007	30	2	16	12	4	2	24	...	20190826
1	60014001001008	4	0	1	3	0	0	4	...	20190826
2	60014001001011	3	2	1	0	0	3	0	...	20190826
3	60014001001017	11	3	3	5	2	2	7	...	20190826
4	60014001001024	10	3	3	4	7	1	2	...	20190826
5	60014001001026	3	0	2	1	0	2	1	...	20190826
6	60014001001027	13	3	3	7	4	5	4	...	20190826
7	60014001001032	13	2	4	7	3	2	8	...	20190826
8	60014001001033	2	0	0	2	2	0	0	...	20190826
9	60014001001034	1	0	0	1	1	0	0	...	20190826

10 rows × 53 columns

df.tail(10) # Same as head, except the last 10 instead of first 10

	w_geocode	C000	CA01	CA02	CA03	CE01	CE02	CE03	...	createdate
243452	61150411004001	1	0	0	1	1	0	0	...	20190826
243453	61150411004011	1	0	1	0	1	0	0	...	20190826
243454	61150411004020	1	1	0	0	1	0	0	...	20190826
243455	61150411004025	3	0	0	3	3	0	0	...	20190826
243456	61150411004027	1	0	1	0	1	0	0	...	20190826
243457	61150411004035	3	0	2	1	1	1	1	...	20190826
243458	61150411004037	18	3	7	8	7	9	2	...	20190826
243459	61150411004047	11	2	5	4	9	0	2	...	20190826
243460	61150411004051	8	6	1	1	0	7	1	...	20190826
243461	61150411004053	10	1	5	4	3	3	4	...	20190826

10 rows × 53 columns

Attributes vs Variables: Note that we used head(), with parentheses, while we used just shape or column, without parentheses. This is because shape and column are attributes and head is a method. To put it another way, shape and columns are characteristics that each Data Frame object has, and we’re just displaying the values that exist already. On the other hand, head is a method, or a function that you perform specifically on a certain type of object (in this case, a Data Frame object).

Accessing the Data Frame#

What if we want to only look at certain cells, or certain columns? We can use a variety of commands to do just that.

Accessing Columns#

To access individual columns, we can use square brackets or we can simply use dot notation.

# Look at just total number of jobs (C000)
df["C000"] 

0         30
1          4
2          3
3         11
4         10
          ..
243457     3
243458    18
243459    11
243460     8
243461    10
Name: C000, Length: 243462, dtype: int64

# This does the same thing
df.C000 

0         30
1          4
2          3
3         11
4         10
          ..
243457     3
243458    18
243459    11
243460     8
243461    10
Name: C000, Length: 243462, dtype: int64

Remember, in Python, we are working with objects that have certain types. When we run the above code, we are accessing a specific column of a Data Frame, and that itself is a different type of object called a Series.

type(df.C000)

pandas.core.series.Series

This can be useful for working with individual columns, because we can then use Series methods to do things like find the mean or standard deviation.

# Mean number of jobs in census blocks that had jobs
df.C000.mean()

65.9188990479007

# Standard deviation
df.C000.std()

369.9156753256301

Accessing Rows#

What if we want to get certain rows? We can also use loc with square brackets. We use a colon to indicate that we want a series of indices with a start and end. We can also leave one side of the colon empty to indicate that we want the rest of the values on that end.

# Show rows 10 - 20. Remember, the first row is row 0
df.loc[10:20] 

	w_geocode	C000	CA01	CA02	CA03	CE01	CE02	CE03	...	createdate
10	60014001001036	14	2	9	3	3	5	6	...	20190826
11	60014001001038	3	0	0	3	0	0	3	...	20190826
12	60014001001041	1	0	0	1	1	0	0	...	20190826
13	60014001001042	9	0	9	0	0	0	9	...	20190826
14	60014001001043	3	1	1	1	0	2	1	...	20190826
15	60014001001044	1	0	0	1	0	0	1	...	20190826
16	60014001001047	2	0	2	0	0	0	2	...	20190826
17	60014001001049	1	0	1	0	1	0	0	...	20190826
18	60014001001053	1	0	1	0	0	0	1	...	20190826
19	60014001001054	7	0	4	3	1	2	4	...	20190826
20	60014001001056	1	0	1	0	0	0	1	...	20190826

11 rows × 53 columns

df.loc[:10]

	w_geocode	C000	CA01	CA02	CA03	CE01	CE02	CE03	...	createdate
0	60014001001007	30	2	16	12	4	2	24	...	20190826
1	60014001001008	4	0	1	3	0	0	4	...	20190826
2	60014001001011	3	2	1	0	0	3	0	...	20190826
3	60014001001017	11	3	3	5	2	2	7	...	20190826
4	60014001001024	10	3	3	4	7	1	2	...	20190826
5	60014001001026	3	0	2	1	0	2	1	...	20190826
6	60014001001027	13	3	3	7	4	5	4	...	20190826
7	60014001001032	13	2	4	7	3	2	8	...	20190826
8	60014001001033	2	0	0	2	2	0	0	...	20190826
9	60014001001034	1	0	0	1	1	0	0	...	20190826
10	60014001001036	14	2	9	3	3	5	6	...	20190826

11 rows × 53 columns

df.loc[:] # This gives all rows

	w_geocode	C000	CA01	CA02	CA03	CE01	CE02	CE03	CNS01	CNS02	...	CFA02	CFA03	CFA04	CFA05	CFS01	CFS02	CFS03	CFS04	CFS05	createdate
0	60014001001007	30	2	16	12	4	2	24	0	0	...	0	0	0	0	0	0	0	0	0	20190826
1	60014001001008	4	0	1	3	0	0	4	0	0	...	0	0	0	0	0	0	0	0	0	20190826
2	60014001001011	3	2	1	0	0	3	0	0	0	...	0	0	0	0	0	0	0	0	0	20190826
3	60014001001017	11	3	3	5	2	2	7	0	0	...	0	0	0	0	0	0	0	0	0	20190826
4	60014001001024	10	3	3	4	7	1	2	0	0	...	0	0	0	0	0	0	0	0	0	20190826
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
243457	61150411004035	3	0	2	1	1	1	1	0	0	...	0	0	0	0	0	0	0	0	0	20190826
243458	61150411004037	18	3	7	8	7	9	2	0	0	...	0	0	0	0	0	0	0	0	0	20190826
243459	61150411004047	11	2	5	4	9	0	2	0	0	...	0	0	0	0	0	0	0	0	0	20190826
243460	61150411004051	8	6	1	1	0	7	1	0	0	...	0	0	0	0	0	0	0	0	0	20190826
243461	61150411004053	10	1	5	4	3	3	4	0	0	...	0	0	0	0	0	0	0	0	0	20190826

243462 rows × 53 columns

In addition, we can use loc to access certain columns as well as certain indices in the Data Frame.

# Look at rows 10 - 20 for total number of jobs (C000)
df.loc[10:20,"C000"] 

  14
   3
   1
   9
   3
   1
   2
   1
   1
   7
   1
Name: C000, dtype: int64

To get a range of columns, we can use the same colon notation.

# Look at rows 10 - 20 for total number of jobs (C000) and jobs by age group
df.loc[10:20,'C000':'CA03']

	C000	CA01	CA02	CA03
10	14	2	9	3
11	3	0	0	3
12	1	0	0	1
13	9	0	9	0
14	3	1	1	1
15	1	0	0	1
16	2	0	2	0
17	1	0	1	0
18	1	0	1	0
19	7	0	4	3
20	1	0	1	0

An alternative to loc is iloc. This takes rows from specific positions in the Data Frame rather than the row labels. Most of the time, row labels are going to be numbered sequentially, so loc and iloc should act similarly. However, sometimes, especially when creating subsets of the data, you might end up with row labels that aren’t ordered sequentially and go up by one. In those cases, iloc might be more useful.

Another use case for iloc is in using negative numbers.

df.iloc[-5:]['C000']

243457     3
243458    18
243459    11
243460     8
243461    10
Name: C000, dtype: int64

In this case, we were able to use the “-5:” to indicate that we want the last 5 rows of the data frame. Note that we can’t do the same with .loc. This is because .loc retrieves the rows from a particular label in the Data Frame, while .iloc retrieves them from particular positions.

Checkpoint: Explore Other Data#

Copy the code you used in the previous section to bring in another dataset. Try to access different values within the dataset. Can you isolate a certain variable? Look at the documentation and identify a variable to pull out of the dataset.

Introduction to Python and SQL for Data Analysis

Exploring the Data Frame

Contents

Exploring the Data Frame#

Accessing the Data Frame#

Accessing Columns#

Accessing Rows#

Checkpoint: Explore Other Data#

	C000	CA01	CA02	CA03
10	14	2	9	3
11	3	0	0	3
12	1	0	0	1
13	9	0	9	0
14	3	1	1	1
15	1	0	0	1
16	2	0	2	0
17	1	0	1	0
18	1	0	1	0
19	7	0	4	3
20	1	0	1	0

	C000	CA01	CA02	CA03
10	14	2	9	3
11	3	0	0	3
12	1	0	0	1
13	9	0	9	0
14	3	1	1	1
15	1	0	0	1
16	2	0	2	0
17	1	0	1	0
18	1	0	1	0
19	7	0	4	3
20	1	0	1	0

	C000	CA01	CA02	CA03
10	14	2	9	3
11	3	0	0	3
12	1	0	0	1
13	9	0	9	0
14	3	1	1	1
15	1	0	0	1
16	2	0	2	0
17	1	0	1	0
18	1	0	1	0
19	7	0	4	3
20	1	0	1	0