{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7cea6043-e3e1-43d1-b820-918db0b6d2fe",
   "metadata": {
    "tags": [
     "hide_input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np \n",
    "import pandas as pd \n",
    "data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'\n",
    "df = pd.read_csv(data_url,compression='gzip')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0bd1468e-76a1-4c42-8f18-9a7e93747547",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Data Types in Python\n",
    "\n",
    "We can store data in many different ways within Python. We've gone over one way to do this using Data Frames, which have nice rows and columns in a rectangular format like we might be used to seeing data. However, there are other types of objects that might be better for different types of data. In this section, we will look at different types of objects within Python that will help us better manage the data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0bc0f5e5-6d93-4c11-8ad7-22493fd44405",
   "metadata": {},
   "source": [
    "## Series\n",
    "\n",
    "Individual columns of a Data Frame can be accessed as a pandas Series object. We've used this already before -- this is what we get from getting just one column from a Data Frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "10207ccb-96c7-43f0-97e6-7c483d23b2c0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(df.C000)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2b6678d-4ccb-4286-919e-85d6446b18c9",
   "metadata": {},
   "source": [
    "Pandas Series have some specific useful properties that make them useful for creating measures and working with the data. For example, you can do arithmetic with Series and have it do the same operation for each value within the Series.\n",
    "\n",
    "Let's look at an example by finding the proportion of workers who are in the first age category (29 or younger) within each census block."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f3b89d8b-f9dc-47b0-b5b4-203be434a868",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0         0.066667\n",
       "1         0.000000\n",
       "2         0.666667\n",
       "3         0.272727\n",
       "4         0.300000\n",
       "            ...   \n",
       "243457    0.000000\n",
       "243458    0.166667\n",
       "243459    0.181818\n",
       "243460    0.750000\n",
       "243461    0.100000\n",
       "Length: 243462, dtype: float64"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prop_young = df.CA01/df.C000\n",
    "prop_young"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21c52445-a65d-41e6-a3c8-95b2e36ffb11",
   "metadata": {},
   "source": [
    "The first value in the `prop_young` Series is calculated by dividing the first value of `df.CA01` by the first value of `df.C000`. The same is done for the second value and for the third, and so on. Note that the two Series that you are dividing do need to be the same length for this to work.\n",
    "\n",
    "Now that we have calculated the proportion, we can then easily add that back into the original dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ee8e3f50-449b-4403-9938-7eecbadd9d97",
   "metadata": {},
   "outputs": [],
   "source": [
    "df['CA01_proportion'] = prop_young"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "edee8299-c4e0-4fbe-b6d0-3a460b6b1d2e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>w_geocode</th>\n",
       "      <th>C000</th>\n",
       "      <th>CA01</th>\n",
       "      <th>CA02</th>\n",
       "      <th>CA03</th>\n",
       "      <th>CE01</th>\n",
       "      <th>CE02</th>\n",
       "      <th>CE03</th>\n",
       "      <th>CNS01</th>\n",
       "      <th>CNS02</th>\n",
       "      <th>...</th>\n",
       "      <th>CFA03</th>\n",
       "      <th>CFA04</th>\n",
       "      <th>CFA05</th>\n",
       "      <th>CFS01</th>\n",
       "      <th>CFS02</th>\n",
       "      <th>CFS03</th>\n",
       "      <th>CFS04</th>\n",
       "      <th>CFS05</th>\n",
       "      <th>createdate</th>\n",
       "      <th>CA01_proportion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>60014001001007</td>\n",
       "      <td>30</td>\n",
       "      <td>2</td>\n",
       "      <td>16</td>\n",
       "      <td>12</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>24</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>20190826</td>\n",
       "      <td>0.066667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>60014001001008</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>20190826</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>60014001001011</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>20190826</td>\n",
       "      <td>0.666667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>60014001001017</td>\n",
       "      <td>11</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>20190826</td>\n",
       "      <td>0.272727</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>60014001001024</td>\n",
       "      <td>10</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>20190826</td>\n",
       "      <td>0.300000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 54 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        w_geocode  C000  CA01  CA02  CA03  CE01  CE02  CE03  CNS01  CNS02  \\\n",
       "0  60014001001007    30     2    16    12     4     2    24      0      0   \n",
       "1  60014001001008     4     0     1     3     0     0     4      0      0   \n",
       "2  60014001001011     3     2     1     0     0     3     0      0      0   \n",
       "3  60014001001017    11     3     3     5     2     2     7      0      0   \n",
       "4  60014001001024    10     3     3     4     7     1     2      0      0   \n",
       "\n",
       "   ...  CFA03  CFA04  CFA05  CFS01  CFS02  CFS03  CFS04  CFS05  createdate  \\\n",
       "0  ...      0      0      0      0      0      0      0      0    20190826   \n",
       "1  ...      0      0      0      0      0      0      0      0    20190826   \n",
       "2  ...      0      0      0      0      0      0      0      0    20190826   \n",
       "3  ...      0      0      0      0      0      0      0      0    20190826   \n",
       "4  ...      0      0      0      0      0      0      0      0    20190826   \n",
       "\n",
       "   CA01_proportion  \n",
       "0         0.066667  \n",
       "1         0.000000  \n",
       "2         0.666667  \n",
       "3         0.272727  \n",
       "4         0.300000  \n",
       "\n",
       "[5 rows x 54 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9dc66f5d-5e49-422a-bd8a-d9ef802f2593",
   "metadata": {},
   "source": [
    "## Lists\n",
    "\n",
    "Another useful type of object in Python is the **List**. Lists are similar to Series in that they contain a set of values, but they have slightly different properties. \n",
    "\n",
    "Let's take a look at a simple list first. We can create a list using brackets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "e8535b08-c3a0-4efc-8478-eaa6d34ea0b1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 2, 3, 4]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mylist = [1, 2, 3, 4]\n",
    "mylist"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c0afab6-9b58-4b15-b441-ad7cc6a74025",
   "metadata": {},
   "source": [
    "Lists are useful with DataFrames in particular because they can be used to select multiple columns easily."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "682ea7c7-5c46-4ab1-9939-6913a69d915c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Look at rows 10 - 20 for total number of jobs (C000) and jobs by age group\n",
    "df.loc[10:20,['C000','CA01','CA02','CA03'] ] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b541bace-b9bb-499a-ad57-3adb54a22682",
   "metadata": {},
   "outputs": [],
   "source": [
    "type(['C000','CA01','CA02','CA03'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73a80d69-79f2-40af-b386-0c655a9ce61b",
   "metadata": {},
   "source": [
    "Here, we wanted to select 4 variables to look at. Notice that we replaced `\"C000\"` with `['C000','CA01','CA02','CA03']`. The square brackets create a list with 4 elements, `'C000'`,`'CA01'`,`'CA02'`, and `'CA03'`.\n",
    "\n",
    "We also could have done this using colon notation, but the list is helpful is the columns that we want to explore are not in order. For example, we can include the columns for total jobs as well as jobs by sex using the list notation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "e90395aa-58e2-4793-8077-271cc0a0994e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>C000</th>\n",
       "      <th>CS01</th>\n",
       "      <th>CS02</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>14</td>\n",
       "      <td>3</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>9</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>7</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    C000  CS01  CS02\n",
       "10    14     3    11\n",
       "11     3     2     1\n",
       "12     1     0     1\n",
       "13     9     8     1\n",
       "14     3     3     0\n",
       "15     1     1     0\n",
       "16     2     2     0\n",
       "17     1     0     1\n",
       "18     1     0     1\n",
       "19     7     4     3\n",
       "20     1     1     0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[10:20,['C000','CS01','CS02'] ] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ddeb86d-66e0-4c85-8e7e-dd77f08881ab",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "vars_to_show = ['CA01','CA02','CA03'] # A list of strings containing names of variables (jobs by age group)\n",
    "df.iloc[-5:][vars_to_show]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34fdb91e-3ea2-4d36-90c3-352c085e40a6",
   "metadata": {},
   "source": [
    "Note that Lists are a little bit different in how arithmetic works with them. Take a look at the following code and see if you can see what is happening, and how it is different from a Series object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "521dfb46-3aa0-4cf0-88ba-848549506f36",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 2, 3, 4, 5, 6]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[1, 2, 3] + [4, 5, 6]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bf6df51-f95c-487d-8af4-3cb1751110ee",
   "metadata": {},
   "source": [
    "This is one of the reasons why it is important to make sure you are using the right type of object. Even simple arithmetic operations will work differently depending on what type of object is being used."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fdc58dfb-1250-43bf-b9da-151417b8558a",
   "metadata": {},
   "source": [
    "## Dictionaries\n",
    "\n",
    "A **dictionary** is like a list, except it doesn't have an order in which **values** (which can be any Python object) are stored, and you access the elements of a dictionary using a **key**. Think of a dictionary like a bag of objects, from which we can find the object we want by using the appropriate label. In our case, we'll create a dictionary that has the year as the key and the Data Frame for that year as the value. This give us an easy way of both storing and accessing the Data Frames that we want to get. \n",
    "\n",
    "To create a dictionary, we can use curly braces, with colons separating key-value pairs. For example, we can create a dictionary called `example_dict` with three keys (`2009`, `\"2010\"`, `2011`) with some values. We can access the values we assigned to the keys using square brackets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6c9d202-e353-4014-8c5a-b2b7d9e2e979",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creating a dictionary called example_dict\n",
    "example_dict = {2009:5, \"2010\":2, 2011:None}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce4f993d-750d-404f-bb2d-ddeb2d4712ee",
   "metadata": {},
   "source": [
    "Our `example_dict` dictionary is storing three values: `5`, `2`, and `None`. The keys associated with these three values are `2009`, `\"2010\"`, and `2011`. Notice that `\"2010\"` is in quotes, indicating that it is a string, as opposed to `2009` or `2011`, which are integers. This is important, because we need to make sure to use the correct type to access the dictionary values. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "411ca44d-aebc-4f41-98a3-c9a5d4015175",
   "metadata": {},
   "outputs": [],
   "source": [
    "# What do you think this will output?\n",
    "example_dict[2009]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb418eb2-70d0-4080-abb4-48af6033f937",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Since keys can be any type, we need to make sure to use the appropriate type\n",
    "example_dict[\"2010\"]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}