Writing a Basic Query#

In order to analyze the data in a database, we need to query it, or request specific information about the data. Before we do that, run the following cell to establish a connection to the database:

%LOAD ../../lodes.db

This is something that is unique to Jupyter Notebooks, and is only there because of how we’ve set up the notebook to interact with the database using SQLite3. There are other ways to interact with the database, such as using the command line or other software such as DBeaver. We won’t go over the various ways to interact with databases here, since the SQL code is generally the same regardless of how you do it.

Now, let’s start with some basics. We’ll start by retrieving all columns from the California Workplace Area Characteristic (ca_wac_2015) table. Try running the following query:

SELECT * FROM ca_wac_2015 LIMIT 10;
w_geocode c000 ca01 ca02 ca03 ce01 ce02 ce03 cns01 cns02 cns03 cns04 cns05 cns06 cns07 cns08 cns09 cns10 cns11 cns12 cns13 cns14 cns15 cns16 cns17 cns18 cns19 cns20 cr01 cr02 cr03 cr04 cr05 cr07 ct01 ct02 cd01 cd02 cd03 cd04 cs01 cs02 cfa01 cfa02 cfa03 cfa04 cfa05 cfs01 cfs02 cfs03 cfs04 cfs05 createdate over100
060014001001007 30 2 16 12 4 2 24 0 0 0 0 0 0 0 0 0 1 0 25 0 0 0 0 0 0 4 0 25 1 0 3 1 0 27 3 3 6 10 9 7 23 0 0 0 0 0 0 0 0 0 0 20170919 0
060014001001008 4 0 1 3 0 0 4 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 2 0 0 2 0 0 4 0 0 2 1 1 0 4 0 0 0 0 0 0 0 0 0 0 20170919 0
060014001001011 3 2 1 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 3 0 3 0 0 0 0 1 0 3 0 0 0 0 0 0 0 0 0 0 20170919 0
060014001001017 11 3 3 5 2 2 7 0 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 10 0 0 1 0 0 11 0 0 0 3 5 10 1 0 0 0 0 0 0 0 0 0 0 20170919 0
060014001001024 10 3 3 4 7 1 2 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 2 0 0 6 0 4 0 0 5 0 1 10 0 0 2 3 2 2 8 0 0 0 0 0 0 0 0 0 0 20170919 0
060014001001026 3 0 2 1 0 2 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 1 0 0 2 0 0 2 1 0 0 2 1 1 2 0 0 0 0 0 0 0 0 0 0 20170919 0
060014001001027 13 3 3 7 4 5 4 0 0 0 0 0 0 0 0 0 0 0 3 0 0 1 1 0 5 3 0 9 0 0 4 0 0 10 3 1 4 2 3 6 7 0 0 0 0 0 0 0 0 0 0 20170919 0
060014001001032 13 2 4 7 3 2 8 0 0 0 0 0 0 6 0 0 0 0 6 0 0 0 0 0 0 1 0 6 0 0 7 0 0 12 1 0 4 3 4 7 6 0 0 0 0 0 0 0 0 0 0 20170919 0
060014001001033 2 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 0 0 0 0 0 2 0 0 0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 20170919 0
060014001001034 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 20170919 0

You should see 10 rows of the ca_wac_2015 dataset. Let’s go over the basics of this SQL command.

  • SELECT: We start out with the SELECT statement. The SELECT statement specifies which variables (columns) you want.

    • Here, we used SELECT *. The “*” just says that we want all the variables.

    • If we wanted a few columns, we would use the column names separated by commas instead of “*” (for example, w_geocode, createdate).

  • FROM: Now, let’s look at the next part of the query, FROM ca_wac_2015. This part of the query specifies the table, ca_wac_2015, from which we want to retrieve the data. Most of your queries will begin in this fashion, describing which columns you want and from which table.

  • LIMIT: We typically include a LIMIT statement at the end of our query so that we don’t get overloaded with rows being output. Here, LIMIT 10 means that we just want the first ten rows. Many times, the LIMIT that you want will be higher than 10 – you might generally prefer to use 1000 or so. Having a LIMIT for all queries is highly recommended even if you know only a few rows will be shown, since it acts as a safety precaution against (for example) displaying millions of rows of data.

In this case, we’ve put everything in one line, but that’s not necessary. We could have split the code up into multiple lines, like so:

SELECT * 
FROM ca_wac_2015
LIMIT 10;

This gives the same output as our original query. Generally, once queries start getting longer, breaking up the code into multiple lines can be very helpful in organizing your code and making it easier to read.

Along those lines, note that we used a semi-colon at the end of the query to mark the end of the query. That isn’t absolutely necessary here, but it does help mark the end of a query and is required in other applications of SQL, so it’s good practice to use it.

Note about capitalization: If you notice, we’ve been using all caps for SQL commands and all lowercase for data table and schema names. This is simply a convention, as SQL is not case sensitive. For example, we could have run select * from ca_wac_2015 limit 10; and it would have given us the exact same output as the first query.

This does mean you need to be careful when using column names. If your column name has capital letters in it, you need use double quotes (e.g. "C000") to preserve the capitalization. For this reason, you might find that using all lowercase letters in column names is preferable, which is what we’ve done here.

Now, consider the following query. What do you think it will do?

SELECT w_geocode, createdate 
FROM ca_wac_2015
LIMIT 100;

We’ve changed the original query by using w_geocode, createdate instead of *, so we’ll only get the values from two columns, w_geocode and createdate. In addition, we’ve changed the value after LIMIT to be 100 instead of 10, so we’ll get the first 100 rows instead of the first 10 rows.

SELECT w_geocode, createdate 
FROM ca_wac_2015
LIMIT 100;
w_geocode createdate
060014001001007 20170919
060014001001008 20170919
060014001001011 20170919
060014001001017 20170919
060014001001024 20170919
060014001001026 20170919
060014001001027 20170919
060014001001032 20170919
060014001001033 20170919
060014001001034 20170919
060014001001036 20170919
060014001001038 20170919
060014001001041 20170919
060014001001042 20170919
060014001001043 20170919
060014001001044 20170919
060014001001047 20170919
060014001001049 20170919
060014001001053 20170919
060014001001054 20170919
060014001001056 20170919
060014001001057 20170919
060014001001058 20170919
060014001001060 20170919
060014001001062 20170919
060014001001065 20170919
060014001001068 20170919
060014001001074 20170919
060014002001001 20170919
060014002001002 20170919
060014002001004 20170919
060014002001005 20170919
060014002001009 20170919
060014002001010 20170919
060014002001011 20170919
060014002001012 20170919
060014002001016 20170919
060014002001018 20170919
060014002001019 20170919
060014002001020 20170919
060014002001021 20170919
060014002001022 20170919
060014002002000 20170919
060014002002002 20170919
060014002002003 20170919
060014002002004 20170919
060014002002005 20170919
060014002002006 20170919
060014002002008 20170919
060014002002013 20170919
060014002002014 20170919
060014002002015 20170919
060014003001006 20170919
060014003001007 20170919
060014003001008 20170919
060014003001009 20170919
060014003001010 20170919
060014003001011 20170919
060014003001012 20170919
060014003001013 20170919
060014003001014 20170919
060014003001015 20170919
060014003001016 20170919
060014003001017 20170919
060014003002002 20170919
060014003002003 20170919
060014003002004 20170919
060014003002005 20170919
060014003002006 20170919
060014003002007 20170919
060014003002008 20170919
060014003002009 20170919
060014003002010 20170919
060014003002013 20170919
060014003002015 20170919
060014003002017 20170919
060014003002018 20170919
060014003002019 20170919
060014003002020 20170919
060014003002022 20170919
060014003002023 20170919
060014003003001 20170919
060014003003002 20170919
060014003003005 20170919
060014003003006 20170919
060014003003017 20170919
060014003003018 20170919
060014003003019 20170919
060014003003021 20170919
060014003003022 20170919
060014003003024 20170919
060014003003026 20170919
060014003003027 20170919
060014003003030 20170919
060014003004002 20170919
060014003004003 20170919
060014003004004 20170919
060014003004005 20170919
060014003004006 20170919
060014003004007 20170919

Checkpoint: Running Basic Queries#

Consider the following queries. What do you think they will do? Try figuring out what the output will look like, then run the code to see if you’re correct.

  • SELECT * FROM ca_wac_2015 LIMIT 25;

  • SELECT c000,ca01,ca02,ca03 FROM ca_wac_2015 LIMIT 1000;

  • SELECT * FROM ca_od_2015 LIMIT 100;

  • SELECT * FROM ca_rac_2015 LIMIT 40;

Think about the following scenarios. What is the query you would use to answer these questions? Try them out.

  • You want to see the first 100 rows of the origin and destination geocodes for each census block in California.

  • You want to see the top 1000 rows of census blocks containing workplaces and the number of jobs for workers of each race.