import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import sqlite3

conn = sqlite3.connect("../../lodes.db")

Writing Data Frames to an SQL Database#

We can use our engine that we used to connect to the database to also write tables to a database. We do this using the to_sql Data Frame method in pandas. Let’s try this out by putting a new state’s data into our lodes.db database.

data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/md/wac/md_wac_S000_JT00_2015.csv.gz'
df_md = pd.read_csv(data_url,compression='gzip')
df_md.head()
w_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 ... CFA02 CFA03 CFA04 CFA05 CFS01 CFS02 CFS03 CFS04 CFS05 createdate
0 240010001001023 8 3 4 1 4 4 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
1 240010001001025 1 0 1 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
2 240010001001054 10 2 3 5 7 3 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
3 240010001001113 2 0 2 0 0 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
4 240010001002061 8 4 4 0 7 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826

5 rows × 53 columns

First, we will lowercase all column names to make it easier to work with in SQL. Remember, SQL lowercases everything by default, so if we bring this data into the database with uppercase column names, we’ll need to use double quotes to refer to all column names, which can become a headache to manage down the road.

df_md.columns = df_md.columns.str.lower()
df_md.columns
Index(['w_geocode', 'c000', 'ca01', 'ca02', 'ca03', 'ce01', 'ce02', 'ce03',
       'cns01', 'cns02', 'cns03', 'cns04', 'cns05', 'cns06', 'cns07', 'cns08',
       'cns09', 'cns10', 'cns11', 'cns12', 'cns13', 'cns14', 'cns15', 'cns16',
       'cns17', 'cns18', 'cns19', 'cns20', 'cr01', 'cr02', 'cr03', 'cr04',
       'cr05', 'cr07', 'ct01', 'ct02', 'cd01', 'cd02', 'cd03', 'cd04', 'cs01',
       'cs02', 'cfa01', 'cfa02', 'cfa03', 'cfa04', 'cfa05', 'cfs01', 'cfs02',
       'cfs03', 'cfs04', 'cfs05', 'createdate'],
      dtype='object')
df_md.to_sql('md_wac_2015', conn, if_exists='replace')
33450

Here, to_sql() takes the Data Frame and puts it into the database in the connection conn, naming it as the value given in the first argument (in this case, we named it md_wac_2015. The last argument, if_exists='replace', makes it so that we replace the table if it exists already. This isn’t necessarily always the best to use, because it will overwrite any updates you might have made, but it works for our purposes because if we are bringing in a new table, we haven’t done anything about the old one to warrant keeping it.

Let’s bring this dataset back to make sure we got what we wanted.

df = pd.read_sql("SELECT * FROM md_wac_2015",conn)
df.head()
index w_geocode c000 ca01 ca02 ca03 ce01 ce02 ce03 cns01 ... cfa02 cfa03 cfa04 cfa05 cfs01 cfs02 cfs03 cfs04 cfs05 createdate
0 0 240010001001023 8 3 4 1 4 4 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
1 1 240010001001025 1 0 1 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
2 2 240010001001054 10 2 3 5 7 3 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
3 3 240010001001113 2 0 2 0 0 1 1 0 ... 0 0 0 0 0 0 0 0 0 20190826
4 4 240010001002061 8 4 4 0 7 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826

5 rows × 54 columns

Checkpoint: Put More Datasets into the Database#

Try doing the same process of putting data into a database with a different dataset. Make sure you lowercase all columns, and check to make sure it has been put into the database properly by bringing it back into Python.