{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "7cea6043-e3e1-43d1-b820-918db0b6d2fe", "metadata": { "tags": [ "hide_input" ] }, "outputs": [], "source": [ "import numpy as np \n", "import pandas as pd \n", "data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'\n", "df = pd.read_csv(data_url,compression='gzip')" ] }, { "cell_type": "markdown", "id": "0bd1468e-76a1-4c42-8f18-9a7e93747547", "metadata": { "tags": [] }, "source": [ "# Data Types in Python\n", "\n", "We can store data in many different ways within Python. We've gone over one way to do this using Data Frames, which have nice rows and columns in a rectangular format like we might be used to seeing data. However, there are other types of objects that might be better for different types of data. In this section, we will look at different types of objects within Python that will help us better manage the data." ] }, { "cell_type": "markdown", "id": "0bc0f5e5-6d93-4c11-8ad7-22493fd44405", "metadata": {}, "source": [ "## Series\n", "\n", "Individual columns of a Data Frame can be accessed as a pandas Series object. We've used this already before -- this is what we get from getting just one column from a Data Frame." ] }, { "cell_type": "code", "execution_count": 2, "id": "10207ccb-96c7-43f0-97e6-7c483d23b2c0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df.C000)" ] }, { "cell_type": "markdown", "id": "f2b6678d-4ccb-4286-919e-85d6446b18c9", "metadata": {}, "source": [ "Pandas Series have some specific useful properties that make them useful for creating measures and working with the data. For example, you can do arithmetic with Series and have it do the same operation for each value within the Series.\n", "\n", "Let's look at an example by finding the proportion of workers who are in the first age category (29 or younger) within each census block." ] }, { "cell_type": "code", "execution_count": 2, "id": "f3b89d8b-f9dc-47b0-b5b4-203be434a868", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.066667\n", "1 0.000000\n", "2 0.666667\n", "3 0.272727\n", "4 0.300000\n", " ... \n", "243457 0.000000\n", "243458 0.166667\n", "243459 0.181818\n", "243460 0.750000\n", "243461 0.100000\n", "Length: 243462, dtype: float64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prop_young = df.CA01/df.C000\n", "prop_young" ] }, { "cell_type": "markdown", "id": "21c52445-a65d-41e6-a3c8-95b2e36ffb11", "metadata": {}, "source": [ "The first value in the `prop_young` Series is calculated by dividing the first value of `df.CA01` by the first value of `df.C000`. The same is done for the second value and for the third, and so on. Note that the two Series that you are dividing do need to be the same length for this to work.\n", "\n", "Now that we have calculated the proportion, we can then easily add that back into the original dataset." ] }, { "cell_type": "code", "execution_count": 5, "id": "ee8e3f50-449b-4403-9938-7eecbadd9d97", "metadata": {}, "outputs": [], "source": [ "df['CA01_proportion'] = prop_young" ] }, { "cell_type": "code", "execution_count": 6, "id": "edee8299-c4e0-4fbe-b6d0-3a460b6b1d2e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | w_geocode | \n", "C000 | \n", "CA01 | \n", "CA02 | \n", "CA03 | \n", "CE01 | \n", "CE02 | \n", "CE03 | \n", "CNS01 | \n", "CNS02 | \n", "... | \n", "CFA03 | \n", "CFA04 | \n", "CFA05 | \n", "CFS01 | \n", "CFS02 | \n", "CFS03 | \n", "CFS04 | \n", "CFS05 | \n", "createdate | \n", "CA01_proportion | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "60014001001007 | \n", "30 | \n", "2 | \n", "16 | \n", "12 | \n", "4 | \n", "2 | \n", "24 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "20190826 | \n", "0.066667 | \n", "
1 | \n", "60014001001008 | \n", "4 | \n", "0 | \n", "1 | \n", "3 | \n", "0 | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "20190826 | \n", "0.000000 | \n", "
2 | \n", "60014001001011 | \n", "3 | \n", "2 | \n", "1 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "20190826 | \n", "0.666667 | \n", "
3 | \n", "60014001001017 | \n", "11 | \n", "3 | \n", "3 | \n", "5 | \n", "2 | \n", "2 | \n", "7 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "20190826 | \n", "0.272727 | \n", "
4 | \n", "60014001001024 | \n", "10 | \n", "3 | \n", "3 | \n", "4 | \n", "7 | \n", "1 | \n", "2 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "20190826 | \n", "0.300000 | \n", "
5 rows × 54 columns
\n", "\n", " | C000 | \n", "CS01 | \n", "CS02 | \n", "
---|---|---|---|
10 | \n", "14 | \n", "3 | \n", "11 | \n", "
11 | \n", "3 | \n", "2 | \n", "1 | \n", "
12 | \n", "1 | \n", "0 | \n", "1 | \n", "
13 | \n", "9 | \n", "8 | \n", "1 | \n", "
14 | \n", "3 | \n", "3 | \n", "0 | \n", "
15 | \n", "1 | \n", "1 | \n", "0 | \n", "
16 | \n", "2 | \n", "2 | \n", "0 | \n", "
17 | \n", "1 | \n", "0 | \n", "1 | \n", "
18 | \n", "1 | \n", "0 | \n", "1 | \n", "
19 | \n", "7 | \n", "4 | \n", "3 | \n", "
20 | \n", "1 | \n", "1 | \n", "0 | \n", "