Introduction to Python and Pandas#

Python is a general-purpose programming language that has seen a rise in use for data analysis. As of 2017, Python is near or at the top in terms of popularity in programming languages for data analysis (see the kdnuggets post or a look at multiple surveys).

In addition, unlike other tools for statistical analysis, like Excel or Stata, Python is designed to be general-purpose. This means that we aren’t limited to only doing certain statistical analyses and gives us much more flexibility in what we can do.

As with the other chapters in this book, we will start with a motivating question, then walk through the process we need to go through in order to answer the motivating question. Along the way, we will walk through various Python commands and develop skills as we work towards answering the question.

NOTE: When you open a notebook, make sure you run each cell containing code from the beginning. Since the code we’re writing builds on everything written before, if you don’t make sure to run everything from the beginning, some things may not work.

Longitudinal Employer-Household Dynamics (LEHD) Data#

Throughout this book, we will be using LEHD data. These are public-use data sets containing information about employers and employees. Information about the LEHD Data can be found at https://lehd.ces.census.gov/ and the data documentation can be found here.

Motivating Question#

The main dataset we’ll be working with in this section is the Workplace Area Characteristics data, which aggregates job totals by workplace census block. We want to explore this dataset and get a better idea of the distribution of jobs. That is, we want to answer the following questions:

How can we characterize the number of jobs in the state? What can we say about the distribution of the number of jobs by census block? What are distributions of jobs by different categories, such as age group or industry?