%LOAD ../../lodes.db

Checking Number of Rows and Duplicates#

Let’s say we want to find out how many rows there are. You can do this by using a COUNT.

SELECT COUNT(*) 
FROM ca_wac_2015;
COUNT(*)
243462

Here, we used COUNT(*), which does a count of all rows, regardless of NULL values. We can instead do a count of all non-NULL values of a certain variable by including that variable instead of *.

SELECT COUNT(w_geocode) 
FROM ca_wac_2015;
COUNT(w_geocode)
243462

We can also look at both of these at the same time. This can be helpful for checking for NULL values of multiple columns at a time, since we’ll be able to see which ones have less than the total number of rows.

SELECT COUNT(*), COUNT(w_geocode), COUNT(C000) 
FROM ca_wac_2015;
COUNT(*) COUNT(w_geocode) COUNT(C000)
243462 243462 243462

But wait; what if there are duplicates in the data? We can check for them by using DISTINCT.

SELECT DISTINCT w_geocode 
FROM ca_wac_2015
LIMIT 100;
w_geocode
060014001001007
060014001001008
060014001001011
060014001001017
060014001001024
060014001001026
060014001001027
060014001001032
060014001001033
060014001001034
060014001001036
060014001001038
060014001001041
060014001001042
060014001001043
060014001001044
060014001001047
060014001001049
060014001001053
060014001001054
060014001001056
060014001001057
060014001001058
060014001001060
060014001001062
060014001001065
060014001001068
060014001001074
060014002001001
060014002001002
060014002001004
060014002001005
060014002001009
060014002001010
060014002001011
060014002001012
060014002001016
060014002001018
060014002001019
060014002001020
060014002001021
060014002001022
060014002002000
060014002002002
060014002002003
060014002002004
060014002002005
060014002002006
060014002002008
060014002002013
060014002002014
060014002002015
060014003001006
060014003001007
060014003001008
060014003001009
060014003001010
060014003001011
060014003001012
060014003001013
060014003001014
060014003001015
060014003001016
060014003001017
060014003002002
060014003002003
060014003002004
060014003002005
060014003002006
060014003002007
060014003002008
060014003002009
060014003002010
060014003002013
060014003002015
060014003002017
060014003002018
060014003002019
060014003002020
060014003002022
060014003002023
060014003003001
060014003003002
060014003003005
060014003003006
060014003003017
060014003003018
060014003003019
060014003003021
060014003003022
060014003003024
060014003003026
060014003003027
060014003003030
060014003004002
060014003004003
060014003004004
060014003004005
060014003004006
060014003004007

This shows us all of the rows with distinct w_geocode values; that is, all of the distinct census block ids. Let’s count how many there are. To count them, all we have to do is put COUNT() around the DISTINCT part.

SELECT COUNT(DISTINCT w_geocode) 
FROM ca_wac_2015;
COUNT(DISTINCT w_geocode)
243462

Building Up a Query Notice that we wanted to count the number of distinct rows, but we first started from querying the rows with distinct w_geocode first before adding in the COUNT. Though this is a simple example, this process of building up a query as we go is important, especially when we get to more complicated tasks. When writing a query, try to think about the basic parts first, and feel free to run intermediate steps (making sure to include LIMIT) as you go.

Checkpoint: Counting Rows, Using Conditional Statements and Creating Variables#

We’ve included the 2015 OD, RAC, WAC, and geography crosswalk data for both California and Illinois for you in tables. The Illinois data follow the same format as the California, except with il replacing ca (e.g. il_wac_2015 for the Illinois 2015 WAC). Try using the methods described in this section to further explore the tables. Answer the questions below, making sure to write out the queries used to answer the questions.

  • How many counties are there?

  • How many total census blocks are there?

  • How many Metropolitan/Micropolitan areas are there?