Data Types in Python
Contents
import numpy as np
import pandas as pd
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')
Data Types in Python#
We can store data in many different ways within Python. We’ve gone over one way to do this using Data Frames, which have nice rows and columns in a rectangular format like we might be used to seeing data. However, there are other types of objects that might be better for different types of data. In this section, we will look at different types of objects within Python that will help us better manage the data.
Series#
Individual columns of a Data Frame can be accessed as a pandas Series object. We’ve used this already before – this is what we get from getting just one column from a Data Frame.
type(df.C000)
pandas.core.series.Series
Pandas Series have some specific useful properties that make them useful for creating measures and working with the data. For example, you can do arithmetic with Series and have it do the same operation for each value within the Series.
Let’s look at an example by finding the proportion of workers who are in the first age category (29 or younger) within each census block.
prop_young = df.CA01/df.C000
prop_young
0 0.066667
1 0.000000
2 0.666667
3 0.272727
4 0.300000
...
243457 0.000000
243458 0.166667
243459 0.181818
243460 0.750000
243461 0.100000
Length: 243462, dtype: float64
The first value in the prop_young
Series is calculated by dividing the first value of df.CA01
by the first value of df.C000
. The same is done for the second value and for the third, and so on. Note that the two Series that you are dividing do need to be the same length for this to work.
Now that we have calculated the proportion, we can then easily add that back into the original dataset.
df['CA01_proportion'] = prop_young
df.head()
w_geocode | C000 | CA01 | CA02 | CA03 | CE01 | CE02 | CE03 | CNS01 | CNS02 | ... | CFA03 | CFA04 | CFA05 | CFS01 | CFS02 | CFS03 | CFS04 | CFS05 | createdate | CA01_proportion | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60014001001007 | 30 | 2 | 16 | 12 | 4 | 2 | 24 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 0.066667 |
1 | 60014001001008 | 4 | 0 | 1 | 3 | 0 | 0 | 4 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 0.000000 |
2 | 60014001001011 | 3 | 2 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 0.666667 |
3 | 60014001001017 | 11 | 3 | 3 | 5 | 2 | 2 | 7 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 0.272727 |
4 | 60014001001024 | 10 | 3 | 3 | 4 | 7 | 1 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 0.300000 |
5 rows × 54 columns
Lists#
Another useful type of object in Python is the List. Lists are similar to Series in that they contain a set of values, but they have slightly different properties.
Let’s take a look at a simple list first. We can create a list using brackets.
mylist = [1, 2, 3, 4]
mylist
[1, 2, 3, 4]
Lists are useful with DataFrames in particular because they can be used to select multiple columns easily.
# Look at rows 10 - 20 for total number of jobs (C000) and jobs by age group
df.loc[10:20,['C000','CA01','CA02','CA03'] ]
C000 | CA01 | CA02 | CA03 | |
---|---|---|---|---|
10 | 14 | 2 | 9 | 3 |
11 | 3 | 0 | 0 | 3 |
12 | 1 | 0 | 0 | 1 |
13 | 9 | 0 | 9 | 0 |
14 | 3 | 1 | 1 | 1 |
15 | 1 | 0 | 0 | 1 |
16 | 2 | 0 | 2 | 0 |
17 | 1 | 0 | 1 | 0 |
18 | 1 | 0 | 1 | 0 |
19 | 7 | 0 | 4 | 3 |
20 | 1 | 0 | 1 | 0 |
type(['C000','CA01','CA02','CA03'])
list
Here, we wanted to select 4 variables to look at. Notice that we replaced "C000"
with ['C000','CA01','CA02','CA03']
. The square brackets create a list with 4 elements, 'C000'
,'CA01'
,'CA02'
, and 'CA03'
.
We also could have done this using colon notation, but the list is helpful is the columns that we want to explore are not in order. For example, we can include the columns for total jobs as well as jobs by sex using the list notation.
df.loc[10:20,['C000','CS01','CS02'] ]
C000 | CS01 | CS02 | |
---|---|---|---|
10 | 14 | 3 | 11 |
11 | 3 | 2 | 1 |
12 | 1 | 0 | 1 |
13 | 9 | 8 | 1 |
14 | 3 | 3 | 0 |
15 | 1 | 1 | 0 |
16 | 2 | 2 | 0 |
17 | 1 | 0 | 1 |
18 | 1 | 0 | 1 |
19 | 7 | 4 | 3 |
20 | 1 | 1 | 0 |
vars_to_show = ['CA01','CA02','CA03'] # A list of strings containing names of variables (jobs by age group)
df.iloc[-5:][vars_to_show]
CA01 | CA02 | CA03 | |
---|---|---|---|
243457 | 0 | 2 | 1 |
243458 | 3 | 7 | 8 |
243459 | 2 | 5 | 4 |
243460 | 6 | 1 | 1 |
243461 | 1 | 5 | 4 |
Note that Lists are a little bit different in how arithmetic works with them. Take a look at the following code and see if you can see what is happening, and how it is different from a Series object.
[1, 2, 3] + [4, 5, 6]
[1, 2, 3, 4, 5, 6]
This is one of the reasons why it is important to make sure you are using the right type of object. Even simple arithmetic operations will work differently depending on what type of object is being used.
Dictionaries#
A dictionary is like a list, except it doesn’t have an order in which values (which can be any Python object) are stored, and you access the elements of a dictionary using a key. Think of a dictionary like a bag of objects, from which we can find the object we want by using the appropriate label. In our case, we’ll create a dictionary that has the year as the key and the Data Frame for that year as the value. This give us an easy way of both storing and accessing the Data Frames that we want to get.
To create a dictionary, we can use curly braces, with colons separating key-value pairs. For example, we can create a dictionary called example_dict
with three keys (2009
, "2010"
, 2011
) with some values. We can access the values we assigned to the keys using square brackets.
# Creating a dictionary called example_dict
example_dict = {2009:5, "2010":2, 2011:None}
Our example_dict
dictionary is storing three values: 5
, 2
, and None
. The keys associated with these three values are 2009
, "2010"
, and 2011
. Notice that "2010"
is in quotes, indicating that it is a string, as opposed to 2009
or 2011
, which are integers. This is important, because we need to make sure to use the correct type to access the dictionary values.
# What do you think this will output?
example_dict[2009]
5
# Since keys can be any type, we need to make sure to use the appropriate type
example_dict["2010"]
2