Writing a Basic Query
Contents
Writing a Basic Query#
In order to analyze the data in a database, we need to query it, or request specific information about the data. Before we do that, run the following cell to establish a connection to the database:
%LOAD ../../lodes.db
This is something that is unique to Jupyter Notebooks, and is only there because of how we’ve set up the notebook to interact with the database using SQLite3. There are other ways to interact with the database, such as using the command line or other software such as DBeaver. We won’t go over the various ways to interact with databases here, since the SQL code is generally the same regardless of how you do it.
Now, let’s start with some basics. We’ll start by retrieving all columns from the California Workplace Area Characteristic (ca_wac_2015
) table. Try running the following query:
SELECT * FROM ca_wac_2015 LIMIT 10;
w_geocode | c000 | ca01 | ca02 | ca03 | ce01 | ce02 | ce03 | cns01 | cns02 | cns03 | cns04 | cns05 | cns06 | cns07 | cns08 | cns09 | cns10 | cns11 | cns12 | cns13 | cns14 | cns15 | cns16 | cns17 | cns18 | cns19 | cns20 | cr01 | cr02 | cr03 | cr04 | cr05 | cr07 | ct01 | ct02 | cd01 | cd02 | cd03 | cd04 | cs01 | cs02 | cfa01 | cfa02 | cfa03 | cfa04 | cfa05 | cfs01 | cfs02 | cfs03 | cfs04 | cfs05 | createdate | over100 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
060014001001007 | 30 | 2 | 16 | 12 | 4 | 2 | 24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 25 | 1 | 0 | 3 | 1 | 0 | 27 | 3 | 3 | 6 | 10 | 9 | 7 | 23 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20170919 | 0 |
060014001001008 | 4 | 0 | 1 | 3 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 4 | 0 | 0 | 2 | 1 | 1 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20170919 | 0 |
060014001001011 | 3 | 2 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20170919 | 0 |
060014001001017 | 11 | 3 | 3 | 5 | 2 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 | 0 | 0 | 1 | 0 | 0 | 11 | 0 | 0 | 0 | 3 | 5 | 10 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20170919 | 0 |
060014001001024 | 10 | 3 | 3 | 4 | 7 | 1 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 6 | 0 | 4 | 0 | 0 | 5 | 0 | 1 | 10 | 0 | 0 | 2 | 3 | 2 | 2 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20170919 | 0 |
060014001001026 | 3 | 0 | 2 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 2 | 1 | 0 | 0 | 2 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20170919 | 0 |
060014001001027 | 13 | 3 | 3 | 7 | 4 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 1 | 0 | 5 | 3 | 0 | 9 | 0 | 0 | 4 | 0 | 0 | 10 | 3 | 1 | 4 | 2 | 3 | 6 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20170919 | 0 |
060014001001032 | 13 | 2 | 4 | 7 | 3 | 2 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 6 | 0 | 0 | 7 | 0 | 0 | 12 | 1 | 0 | 4 | 3 | 4 | 7 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20170919 | 0 |
060014001001033 | 2 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20170919 | 0 |
060014001001034 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20170919 | 0 |
You should see 10 rows of the ca_wac_2015
dataset. Let’s go over the basics of this SQL command.
SELECT: We start out with the
SELECT
statement. TheSELECT
statement specifies which variables (columns) you want.Here, we used
SELECT *
. The “*
” just says that we want all the variables.If we wanted a few columns, we would use the column names separated by commas instead of “
*
” (for example,w_geocode, createdate
).
FROM: Now, let’s look at the next part of the query,
FROM ca_wac_2015
. This part of the query specifies the table,ca_wac_2015
, from which we want to retrieve the data. Most of your queries will begin in this fashion, describing which columns you want and from which table.LIMIT: We typically include a
LIMIT
statement at the end of our query so that we don’t get overloaded with rows being output. Here,LIMIT 10
means that we just want the first ten rows. Many times, theLIMIT
that you want will be higher than 10 – you might generally prefer to use 1000 or so. Having aLIMIT
for all queries is highly recommended even if you know only a few rows will be shown, since it acts as a safety precaution against (for example) displaying millions of rows of data.
In this case, we’ve put everything in one line, but that’s not necessary. We could have split the code up into multiple lines, like so:
SELECT *
FROM ca_wac_2015
LIMIT 10;
This gives the same output as our original query. Generally, once queries start getting longer, breaking up the code into multiple lines can be very helpful in organizing your code and making it easier to read.
Along those lines, note that we used a semi-colon at the end of the query to mark the end of the query. That isn’t absolutely necessary here, but it does help mark the end of a query and is required in other applications of SQL, so it’s good practice to use it.
Note about capitalization: If you notice, we’ve been using all caps for SQL commands and all lowercase for data table and schema names. This is simply a convention, as SQL is not case sensitive. For example, we could have run
select * from ca_wac_2015 limit 10;
and it would have given us the exact same output as the first query.
This does mean you need to be careful when using column names. If your column name has capital letters in it, you need use double quotes (e.g.
"C000"
) to preserve the capitalization. For this reason, you might find that using all lowercase letters in column names is preferable, which is what we’ve done here.
Now, consider the following query. What do you think it will do?
SELECT w_geocode, createdate
FROM ca_wac_2015
LIMIT 100;
We’ve changed the original query by using w_geocode, createdate
instead of *
, so we’ll only get the values from two columns, w_geocode
and createdate
. In addition, we’ve changed the value after LIMIT
to be 100 instead of 10, so we’ll get the first 100 rows instead of the first 10 rows.
SELECT w_geocode, createdate
FROM ca_wac_2015
LIMIT 100;
w_geocode | createdate |
---|---|
060014001001007 | 20170919 |
060014001001008 | 20170919 |
060014001001011 | 20170919 |
060014001001017 | 20170919 |
060014001001024 | 20170919 |
060014001001026 | 20170919 |
060014001001027 | 20170919 |
060014001001032 | 20170919 |
060014001001033 | 20170919 |
060014001001034 | 20170919 |
060014001001036 | 20170919 |
060014001001038 | 20170919 |
060014001001041 | 20170919 |
060014001001042 | 20170919 |
060014001001043 | 20170919 |
060014001001044 | 20170919 |
060014001001047 | 20170919 |
060014001001049 | 20170919 |
060014001001053 | 20170919 |
060014001001054 | 20170919 |
060014001001056 | 20170919 |
060014001001057 | 20170919 |
060014001001058 | 20170919 |
060014001001060 | 20170919 |
060014001001062 | 20170919 |
060014001001065 | 20170919 |
060014001001068 | 20170919 |
060014001001074 | 20170919 |
060014002001001 | 20170919 |
060014002001002 | 20170919 |
060014002001004 | 20170919 |
060014002001005 | 20170919 |
060014002001009 | 20170919 |
060014002001010 | 20170919 |
060014002001011 | 20170919 |
060014002001012 | 20170919 |
060014002001016 | 20170919 |
060014002001018 | 20170919 |
060014002001019 | 20170919 |
060014002001020 | 20170919 |
060014002001021 | 20170919 |
060014002001022 | 20170919 |
060014002002000 | 20170919 |
060014002002002 | 20170919 |
060014002002003 | 20170919 |
060014002002004 | 20170919 |
060014002002005 | 20170919 |
060014002002006 | 20170919 |
060014002002008 | 20170919 |
060014002002013 | 20170919 |
060014002002014 | 20170919 |
060014002002015 | 20170919 |
060014003001006 | 20170919 |
060014003001007 | 20170919 |
060014003001008 | 20170919 |
060014003001009 | 20170919 |
060014003001010 | 20170919 |
060014003001011 | 20170919 |
060014003001012 | 20170919 |
060014003001013 | 20170919 |
060014003001014 | 20170919 |
060014003001015 | 20170919 |
060014003001016 | 20170919 |
060014003001017 | 20170919 |
060014003002002 | 20170919 |
060014003002003 | 20170919 |
060014003002004 | 20170919 |
060014003002005 | 20170919 |
060014003002006 | 20170919 |
060014003002007 | 20170919 |
060014003002008 | 20170919 |
060014003002009 | 20170919 |
060014003002010 | 20170919 |
060014003002013 | 20170919 |
060014003002015 | 20170919 |
060014003002017 | 20170919 |
060014003002018 | 20170919 |
060014003002019 | 20170919 |
060014003002020 | 20170919 |
060014003002022 | 20170919 |
060014003002023 | 20170919 |
060014003003001 | 20170919 |
060014003003002 | 20170919 |
060014003003005 | 20170919 |
060014003003006 | 20170919 |
060014003003017 | 20170919 |
060014003003018 | 20170919 |
060014003003019 | 20170919 |
060014003003021 | 20170919 |
060014003003022 | 20170919 |
060014003003024 | 20170919 |
060014003003026 | 20170919 |
060014003003027 | 20170919 |
060014003003030 | 20170919 |
060014003004002 | 20170919 |
060014003004003 | 20170919 |
060014003004004 | 20170919 |
060014003004005 | 20170919 |
060014003004006 | 20170919 |
060014003004007 | 20170919 |
Checkpoint: Running Basic Queries#
Consider the following queries. What do you think they will do? Try figuring out what the output will look like, then run the code to see if you’re correct.
SELECT * FROM ca_wac_2015 LIMIT 25;
SELECT c000,ca01,ca02,ca03 FROM ca_wac_2015 LIMIT 1000;
SELECT * FROM ca_od_2015 LIMIT 100;
SELECT * FROM ca_rac_2015 LIMIT 40;
Think about the following scenarios. What is the query you would use to answer these questions? Try them out.
You want to see the first 100 rows of the origin and destination geocodes for each census block in California.
You want to see the top 1000 rows of census blocks containing workplaces and the number of jobs for workers of each race.