Brian Karfunkel — SF Data Traction

Analysis of crime data in SF

View the Project on GitHub bkfunk/sf_data_traction

« Brian Karfunkel home

SF Data Traction

SF provides a lot of interesting data online, including data on all incidents reported to SFPD (including incidents reported by citizens and by officers themselves).* The actual data file, which contains incidents since 2003, can be found here. Note that this is a broader universe than, say, the FBI's Uniform Crime Reporting database, since it includes reported incidents that were deemed by officers to not, in fact, be criminal activity (as well as warnings, etc.). In particular, the data includes latitude and longitude fields for all incidents.

I wanted to see what I could learn from this data, in particular how crime reporting relates to a number of other characteristics. Here, I match incidents for 2012 to Census-Tract-level data from the US Census's 2012 American Community Survey. This allows me to turn incidents into incident rates, since I have population by census tract. It also allows me to look at, say, the relationship between poverty levels and crime rates, or the relationship between the percentage of people who don't have a car and the car theft rate. In order to visualize this disparate data, I create a combination scatter plot and map using d3.js. The final visualization can be viewed here: http://bkfunk.github.io/sf_data_traction/crimestats.html (source).

Cleaning the Data

The data cleaning scripts are located in the /scripts/ directory.

Visualizing the Data

I create some plot templates (located in src/js/charts/) to ease plotting. crimestats.html uses these templates to create a scatter plot, a brush-able distribution plot for each axis on the scatter, and a map showing a choropleth of the data.

I got a shapefile for the SF Census Tracts from the US Census site. I then converted the shapefile to TopoJSON using the bin/create_topojson bash script. I load that TopoJSON file, along with the incidents data, into d3, and merge those two in the javascript so that both geo data and the incident data are bound to each tract element.

Using the visualization, you can select variables to be plotted on:

All three variables can be set indipendently, meaning you can visualize three separate variables in three ways, or one variable on the scatter plot and another on both the scatter plot and choropleth. This allows you to describe some complex relationships in a number of ways.

Mousing over a point in the scatter plot will highlight the corresponding Census tract on the map, and vice versa. Using the brushes on the axis density plots, you can select only a portion of each of the scatter's variables' distributions. For example, you could zoom in on the y-values in the lower part of the distribution. On the choropleth at the right, any tracts that are no longer visible on the scatter plot will switch from a green color scale to a gray color scale. This allows you to, say, zoom in on tracts with high rates of assault and view the distribution of percentage of households with young residents within those high-assault areas and outside those areas.

The d3 code is intentionally pretty agnostic about the precise nature of the data. In order to reuse the visualization with different data, you would just need to get other census-tract-level data with a "tract_id" and another TopoJSON file with the tracts for a different city or region. With a little modification, you could use any type of data associated with any type of map boundary.

*NB: Per the dataset owner's post in the discussion on the data site, "The only data not included in this database is the homicide data."