CHANEY LAW FIRM BLOG

Subscribe to our Blog

Visualizing Arkansas traffic fatalities, Part 3

This is the latest post in a series analyzing Arkansas traffic fatalities. Please take a look at part 1 (a map of 2015 traffic deaths)  and part 2 (a heat map of all fatalities, both nationwide and in Arkansas, from 200-2015) if you haven't already.

Visualizations

Today's visualization piggybacks off part 2, in that we further explore the relationship of the day of the week to traffic fatalities, both nationwide and in Arkansas. The first set of visualizations maps the raw number of traffic fatalities in the US by the day of the week. You can click to zoom the image. Each band represents a single year between 2000 and 2015. Each row within the band is a year, and the column represents the band. From left to right (or top to bottom on small devices), you have drunk driving fatalities, non-drunk driving fatalities, and total fatalities. 

There are a couple of things that stand out here. As we saw in the previous post, weekends are far higher for drunk driving fatalities than during the week. A couple of things we couldn't easily see in the previous post is that non-drunk-driving fatalities are pretty evenly spread throughout the week. Finally, moving from top to bottom in the charts, it looks like traffic fatalities may have gone down somewhat over the past 15 years. 

As I will with the remaining posts, I repeated the same analysis on Arkansas-specific wreck information. Again, the same trends appear to hold, although the bands aren't as smoothly colored (that tells us the data is a little noiser due to fewer data points). Note that this scale is different than the nationwide set. 

Code

We'll be using the same FARS data we used in the previous two posts. Let's set up our libraries, import the data into R, and get moving. For a more detailed explanation of what we're doing here, please refer to part 2.

library(foreign)
library(ggplot2) # 2.1.0.9000
library(plyr)
library(zoo)

data.dir <- "/path/to/your/dir/"

# Read select columns from datasets
accidents_2015 <- read.dbf(paste(data.dir, "Data/FARS2015NationalDBF/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2014 <- read.dbf(paste(data.dir, "Data/FARS2014NationalDBF/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2013 <- read.dbf(paste(data.dir, "Data/FARS2013NationalDBF/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2012 <- read.dbf(paste(data.dir, "Data/FARS2012/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2011 <- read.dbf(paste(data.dir, "Data/FARS2011/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2010 <- read.dbf(paste(data.dir, "Data/FARS2010/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2009 <- read.dbf(paste(data.dir, "Data/FARS2009/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2008 <- read.dbf(paste(data.dir, "Data/FARS2008/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2007 <- read.dbf(paste(data.dir, "Data/FARS2007/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2006 <- read.dbf(paste(data.dir, "Data/FARS2006/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2005 <- read.dbf(paste(data.dir, "Data/FARS2005/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2004 <- read.dbf(paste(data.dir, "Data/FARS2004/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2003 <- read.dbf(paste(data.dir, "Data/FARS2003/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2002 <- read.dbf(paste(data.dir, "Data/FARS2002/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2001 <- read.dbf(paste(data.dir, "Data/FARS2001/accident.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2000 <- read.dbf(paste(data.dir, "Data/FARSDBF00/ACCIDENT.dbf", sep=""))[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]

# Merge all data
accidents <- rbind(accidents_2015, accidents_2014, accidents_2013, accidents_2012, accidents_2011, accidents_2010, accidents_2009, accidents_2008, accidents_2007, accidents_2006, accidents_2005, accidents_2004, accidents_2003, accidents_2002, accidents_2001, accidents_2000)

# Subset Arkansas wrecks
# Comment out the following line for nationwide chart
accidents <- subset(accidents, STATE == 5)

# Add date column
accidents$date <- as.Date(paste(accidents$YEAR, accidents$MONTH, accidents$DAY, sep='-'), "%Y-%m-%d")

# Divide and aggregate data into drunk/not drunk
accidents_drunk <- accidents$DRUNK_DR > 0
accidents_not_drunk <- accidents$DRUNK_DR == 0
summary <- aggregate(FATALS ~ date, accidents, sum)
summary_not_drunk <- aggregate(FATALS ~ date, accidents, sum, subset=accidents_not_drunk)
summary_drunk <- aggregate(FATALS ~ date, accidents, sum, subset=accidents_drunk)

Now, since this is a weekly analysis, we'll add in some information about each date itself.

# Date calculations
summary_not_drunk <- transform(summary_not_drunk,
week = as.numeric(format(date, "%U")),
day = as.numeric(format(date, "%d")),
wday = as.numeric(format(date, "%w"))+1,
month = as.POSIXlt(date)$mon + 1,
year = as.POSIXlt(date)$year + 1900)

summary_drunk <- transform(summary_drunk,
week = as.numeric(format(date, "%U")),
day = as.numeric(format(date, "%d")),
wday = as.numeric(format(date, "%w"))+1,
month = as.POSIXlt(date)$mon + 1,
year = as.POSIXlt(date)$year + 1900)

summary <- transform(summary,
week = as.numeric(format(date, "%U")),
day = as.numeric(format(date, "%d")),
wday = as.numeric(format(date, "%w"))+1,
month = as.POSIXlt(date)$mon + 1,
year = as.POSIXlt(date)$year + 1900)

This gives us the weekday and year (along with some other information we're not using here), which will form the x- and y-axis for our visualizations.

Next, we'll aggregate the data by year and day of week we created in the previous step. We'll also take the min and max so that we can use consistent scale colors across the three visualizations.

# Aggregation
data_not_drunk <- ddply(summary_not_drunk, .(wday, year), summarize, sum = sum(FATALS))
data_drunk <- ddply(summary_drunk, .(wday, year), summarize, sum = sum(FATALS))
data <- ddply(summary, .(wday, year), summarize, sum = sum(FATALS))
max <- max(c(max(data$sum), max(data_not_drunk$sum), max(data_drunk$sum)))
min <- min(c(min(data$sum), min(data_not_drunk$sum), min(data_drunk$sum)))

The next step is a user-experience step of changing the day of the week from a number to a more-familiar text abbreviation.

# Apply factors for days of week
data_not_drunk$weekday<-factor(data_not_drunk$wday,levels=1:7,labels=c("S","M","T","W","Th","F","Sa"),ordered=TRUE)
data_drunk$weekday<-factor(data_drunk$wday,levels=1:7,labels=c("S","M","T","W","Th","F","Sa"),ordered=TRUE)
data$weekday<-factor(data$wday,levels=1:7,labels=c("S","M","T","W","Th","F","Sa"),ordered=TRUE)

 

That's it for data wrangling. Now, we just need to plot the data. Again, we'll define a theme so that the charts look pretty and incorporate the same color scale.

# Define theme
heat_map_theme <- theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.spacing.x = unit(0, "points"),
panel.spacing.y = unit(1, "points"),
strip.background = element_rect(fill="gray90", color=NA),
strip.text = element_text(color="gray5"),
axis.ticks = element_blank(),
axis.text.x = element_text(color="gray5", size=9),
axis.text.y = element_text(color="gray5", size=9),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.text = element_text(color="gray5"),
legend.title = element_text(color="gray5"),
plot.title = element_text(color="gray5", hjust=0.5),
plot.subtitle = element_text(color="gray5", hjust=0.5),
plot.caption = element_text(color="gray5", hjust=1, size=6),
panel.background = element_rect(fill="transparent", color=NA),
legend.background = element_rect(fill="transparent", color=NA),
plot.background = element_rect(fill="transparent", color=NA),
legend.key = element_rect(fill=alpha("white", 0.33), color=NA)
)

Next, we'll define a data directory to save the output.

imagedir <- "~/PATH/TO/YOUR/SAVE/DIRECTORY/Images/"

Now, we'll simply plot and save the output.

# Plot drunk and save
ggplot(data_drunk, aes(weekday, year)) +
geom_tile(aes(fill=sum), na.rm = FALSE) +
scale_fill_gradient(name="Fatalities", low="yellow", high="red", na.value = alpha("white", 0.25), limits=c(min,max)) +
scale_y_reverse(expand=(c(0,0))) +
labs(title = "2000-2015 Traffic Fatalities, Nationwide", x="", y="", subtitle="by Day of Week (drunk driving only)", caption = "(based on data from NHTSA FARS: ftp://ftp.nhtsa.dot.gov/fars)") +
heat_map_theme

filename <- paste(c(imagedir, "2000-2015_fatalities_calendar DOW (nationwide, drunk).png"), collapse="")
ggsave(filename, bg = "transparent")

# Plot not drunk and save
ggplot(data_not_drunk, aes(weekday, year)) +
geom_tile(aes(fill=sum), na.rm = FALSE) +
scale_fill_gradient(name="Fatalities", low="yellow", high="red", na.value = alpha("white", 0.25), limits=c(min,max)) +
scale_y_reverse(expand=(c(0,0))) +
labs(title = "2000-2015 Traffic Fatalities, Nationwide", x="", y="", subtitle="by Day of Week (excluding drunk driving)", caption = "(based on data from NHTSA FARS: ftp://ftp.nhtsa.dot.gov/fars)") +
heat_map_theme

filename <- paste(c(imagedir, "2000-2015_fatalities_calendar DOW (nationwide, not drunk).png"), collapse="")
ggsave(filename, bg = "transparent")

# Plot all and save
ggplot(data, aes(weekday, year)) +
geom_tile(aes(fill=sum), na.rm = FALSE) +
scale_fill_gradient(name="Fatalities", low="yellow", high="red", na.value = alpha("white", 0.25), limits=c(min,max)) +
scale_y_reverse(expand=(c(0,0))) +
labs(title = "2000-2015 Traffic Fatalities, Nationwide", x="", y="", subtitle="by Day of Week", caption = "(based on data from NHTSA FARS: ftp://ftp.nhtsa.dot.gov/fars)") +
heat_map_theme

filename <- paste(c(imagedir, "2000-2015_fatalities_calendar DOW (nationwide, all).png"), collapse="")
ggsave(filename, bg = "transparent")

Conclusion

We've seen using a couple of different metrics that drunk driving fatalities seem to occur more often on the weekend. In the next post, we'll look at another set of heat maps that break down when driving fatalities occur during the week even further: by time of day.

Nathan's new blog home

Since starting a new job at UAMS earlier this year, I've been guest-blogging here about software projects related to the firm's work. Some of my projects aren't related, however, and I haven't had a forum for those posts. So, I've started a new blog at nathanchaney.com and seeded it with non-law posts from this site that relate to big data, information visualization, software, technology, and food. I'll continue to post here about things related to the law practice, but the new site will be a broader forum for my big data and information visualization projects. Please visit and have a look around!

Visualizing Arkansas traffic fatalities, Part 2

A couple of weeks ago, I posted a map of the traffic fatalities in Arkansas in 2015. The data came from the NHTSA, and the graphic I posted was just scratching the surface. I've sliced the data a couple of different ways and created three more sets of visualizations about that data. For the next three posts, I'll show the visualizations, my interpretation, and then the code so that non-programmers will get the goods on the front end.

The Visualizations

The first set of visualizations maps the raw number of traffic fatalities in the US. You can click to zoom the image. Each band represents a single year between 2000 and 2015. Each row within the band is a day of the week. From left to right (or top to bottom on small devices), you have drunk driving fatalities, non-drunk driving fatalities, and total fatalities. We'll repeat this comparison a number of times, and the color coding in each set of graphs uses the same scale (so we compare apples to apples across the three visualizations).  

For me, there are two things that stand out in this set of visualizations. First, drunk driving fatalities are heavily weighted towards weekends. Second, New Years' Day (the left- and top-most block) is an especially dangerous time to be in the road.

As I will with the remaining posts, I repeated the same analysis on Arkansas-specific wreck information (this requires a single line of R code given below). The data is a little noisier, but the same results appear to hold. Note that this scale is different than the nationwide set. The scale is a little skewed towards lighter colors by one extremely bad Arkansas traffic Saturday in the fall of 2004.

The Code

Now to the code. First, we'll load in all the data from the NHTSA (available at ftp.nhtsa.dot.gov/fars/). Because the files added data points over time, we'll need to select just what we need for the visualization so we can combine them all.

# Read data elements
accidents_2015 <- read.dbf("Data/FARS2015NationalDBF/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2014 <- read.dbf("Data/FARS2014NationalDBF/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2013 <- read.dbf("Data/FARS2013NationalDBF/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2012 <- read.dbf("Data/FARS2012/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2011 <- read.dbf("Data/FARS2011/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2010 <- read.dbf("Data/FARS2010/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2009 <- read.dbf("Data/FARS2009/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2008 <- read.dbf("Data/FARS2008/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2007 <- read.dbf("Data/FARS2007/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2006 <- read.dbf("Data/FARS2006/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2005 <- read.dbf("Data/FARS2005/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2004 <- read.dbf("Data/FARS2004/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2003 <- read.dbf("Data/FARS2003/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2002 <- read.dbf("Data/FARS2002/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2001 <- read.dbf("Data/FARS2001/accident.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]
accidents_2000 <- read.dbf("Data/FARSDBF00/ACCIDENT.dbf")[,c("STATE", "COUNTY", "DAY", "MONTH", "YEAR", "FATALS", "DRUNK_DR")]

# Combine annual data
accidents <- rbind(accidents_2015, accidents_2014, accidents_2013, accidents_2012, accidents_2011, accidents_2010, accidents_2009, accidents_2008, accidents_2007, accidents_2006, accidents_2005, accidents_2004, accidents_2003, accidents_2002, accidents_2001, accidents_2000)

Next, if we're looking at just Arkansas data, we need this line of code:

# Subset Arkansas wrecks
accidents <- subset(accidents, STATE == 5)

We have to add an R-specific date column to be able to use date functions later on, as follows:

# Add date column
accidents$date <- as.Date(paste(accidents$YEAR, accidents$MONTH, accidents$DAY, sep='-'), "%Y-%m-%d")

Next, we need to write the code that will let us parse the data by whether or not any driver involved in the wreck was drunk. We do this by creating R vectors. The vectors allow us to aggregate the raw wreck data by day and then subset it by whether or not it is alcohol-related, as follows:

# Subset and aggregate wrecks by drunk/not drunk
accidents_drunk <- accidents$DRUNK_DR > 0
accidents_not_drunk <- accidents$DRUNK_DR == 0
summary <- aggregate(FATALS ~ date, accidents, sum)
summary_not_drunk <- aggregate(FATALS ~ date, accidents, sum, subset=accidents_not_drunk)
summary_drunk <- aggregate(FATALS ~ date, accidents, sum, subset=accidents_drunk)

Now, we need to create empty entries for those dates that don't have any fatal wrecks (not necessary for nationwide plots, but necessary for Arkansas-specific ones). We'll add this to the data we created earlier.

# get vector of all days in relevant data range
start_year <- format(min(summary_not_drunk$date), "%Y") # Calculate start_year from data
firstday <- format(Sys.Date(), paste(start_year, "-01-01", sep=""))
lastday <- format(Sys.Date(), "2015-12-31")
alldays <- seq(c(ISOdate(start_year,01,01)), by="day", length.out=as.Date(lastday) - as.Date(firstday) + 1)

alldays <- as.data.frame(alldays)
alldays <- rename(alldays, c("alldays"="date"))
alldays[["date"]] <- as.Date(alldays[["date"]], "%Y-%m-%d")

# Uses 53-week year when week 1 has < 4 days
alldays <- transform(alldays,
week = as.numeric(format(date, "%U")),
day = as.numeric(format(date, "%d")),
wday = as.numeric(format(date, "%w"))+1,
month = as.POSIXlt(date)$mon + 1,
year = as.POSIXlt(date)$year + 1900)

# initialize date vector columns to match data columns
alldays[,"FATALS"] <- NA

data <- rbind(summary, alldays)
data_drunk <- rbind(summary_drunk, alldays)
data_not_drunk <- rbind(summary_not_drunk, alldays)

Next, we'll create some additional columns to give us the week of the year, which is how we plot the bands. This code is adapted from the very helpful r-bloggers post I mentioned in the last visualization post. As an aside, if you're interested in data visualization, you should subscribe to r-bloggers, as those guys are always posting fascinating stuff.

data$yearmonth<-as.yearmon(data$date)
data$yearmonthf<-factor(data$yearmonth)
data <- ddply(data,.(yearmonthf),transform,monthweek=1+week-min(week)) # and now for each monthblock we normalize the week to start at 1
data <- subset(data, select = -c(yearmonth, yearmonthf)) # Then, drop extra columns
data$monthf<-factor(data$month,levels=as.character(1:12),labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"),ordered=TRUE)
data$weekdayf<-factor(data$wday,levels=rev(1:7),labels=rev(c("S","M","T","W","Th","F","Sa")),ordered=TRUE)

data_drunk$yearmonth<-as.yearmon(data_drunk$date)
data_drunk$yearmonthf<-factor(data_drunk$yearmonth)
data_drunk <- ddply(data_drunk,.(yearmonthf),transform,monthweek=1+week-min(week)) # and now for each monthblock we normalize the week to start at 1
data_drunk <- subset(data_drunk, select = -c(yearmonth, yearmonthf)) # Then, drop extra columns
data_drunk$monthf<-factor(data_drunk$month,levels=as.character(1:12),labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"),ordered=TRUE)
data_drunk$weekdayf<-factor(data_drunk$wday,levels=rev(1:7),labels=rev(c("S","M","T","W","Th","F","Sa")),ordered=TRUE)

data_not_drunk$yearmonth<-as.yearmon(data_not_drunk$date)
data_not_drunk$yearmonthf<-factor(data_not_drunk$yearmonth)
data_not_drunk <- ddply(data_not_drunk,.(yearmonthf),transform,monthweek=1+week-min(week)) # and now for each monthblock we normalize the week to start at 1
data_not_drunk <- subset(data_not_drunk, select = -c(yearmonth, yearmonthf)) # Then, drop extra columns
data_not_drunk$monthf<-factor(data_not_drunk$month,levels=as.character(1:12),labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"),ordered=TRUE)
data_not_drunk$weekdayf<-factor(data_not_drunk$wday,levels=rev(1:7),labels=rev(c("S","M","T","W","Th","F","Sa")),ordered=TRUE)

Next, we'll look at the maximum and minimum number of fatalities across the three sets of data we'll be plotting. This will allow us to keep the same scale across all three plots.

max <- max(c(max(data["FATALS"], na.rm=TRUE), max(data_not_drunk["FATALS"], na.rm=TRUE), max(data_drunk["FATALS"], na.rm=TRUE)))
min <- min(c(min(data["FATALS"], na.rm=TRUE), min(data_not_drunk["FATALS"], na.rm=TRUE), min(data_drunk["FATALS"], na.rm=TRUE)))

We're finally done with our data processing. The next step is to set up a theme for the visualization. To do this, I installed the development version of ggplot2 from github to use some of the newer features like subtitles and moving the facets (year labels) around.

# Define theme
heat_map_theme <- theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.spacing.x = unit(0, "points"),
panel.spacing.y = unit(1, "points"),
strip.placement = "outside",
strip.switch.pad.grid = unit(2,"points"),
strip.background = element_rect(fill="gray90", color=NA),
strip.text = element_text(color="gray5"),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_text(color="gray5", size=5),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.text = element_text(color="gray5"),
legend.title = element_text(color="gray5"),
plot.title = element_text(color="gray5", hjust=0.5),
plot.subtitle = element_text(color="gray5", hjust=0.5),
plot.caption = element_text(color="gray5", hjust=1, size=6),
panel.background = element_rect(fill="transparent", color=NA),
legend.background = element_rect(fill="transparent", color=NA),
plot.background = element_rect(fill="transparent", color=NA),
legend.key = element_rect(fill=alpha("white", 0.33), color=NA)
)

Finally, we'll perform the plot on the three datasets and save the output. The important things here are:

  • The switch command in facet_grid function puts the years on the left, not the right
  • the limits in the scale_fill_gradient function keeps the scale the same across all three plots
  • the expand=c(0,0) function in scale_x_continuous() eliminates the padding around the plot area 
# Plot and save all
ggplot(data, aes(week, weekdayf)) +
geom_tile(aes(fill=FATALS), na.rm = FALSE) +
facet_grid(year ~ ., drop = FALSE, switch="y") +
scale_fill_gradient(name="Fatalities", low="yellow", high="red", na.value = alpha("white", 0.25), limits=c(min,max)) +
scale_x_continuous(expand=(c(0,0))) +
labs(title = "2000-2015 Traffic Fatalities, Arkansas", x="", y="", subtitle="by Year", caption = "(based on data from NHTSA FARS: ftp://ftp.nhtsa.dot.gov/fars)") +
heat_map_theme

filename <- paste(c("Images/2000-2015_fatalities_calendar (AR, all).png"), collapse="")
ggsave(filename, bg = "transparent")

# Plot and save drunk
ggplot(data_drunk, aes(week, weekdayf)) +
geom_tile(aes(fill=FATALS), na.rm = FALSE) +
facet_grid(year ~ ., drop = FALSE, switch="y") +
scale_fill_gradient(name="Fatalities", low="yellow", high="red", na.value = alpha("white", 0.25), limits=c(min,max)) +
scale_x_continuous(expand=(c(0,0))) +
labs(title = "2000-2015 Traffic Fatalities, Arkansas", x="", y="", subtitle="by Year (drunk driving only)", caption = "(based on data from NHTSA FARS: ftp://ftp.nhtsa.dot.gov/fars)") +
heat_map_theme

filename <- paste(c("Images/2000-2015_fatalities_calendar (AR, drunk).png"), collapse="")
ggsave(filename, bg = "transparent")

# Plot and save not drunk
ggplot(data_not_drunk, aes(week, weekdayf)) +
geom_tile(aes(fill=FATALS), na.rm = FALSE) +
facet_grid(year ~ ., drop = FALSE, switch="y") +
scale_fill_gradient(name="Fatalities", low="yellow", high="red", na.value = alpha("white", 0.25), limits=c(min,max)) +
scale_x_continuous(expand=(c(0,0))) +
labs(title = "2000-2015 Traffic Fatalities, Arkansas", x="", y="", subtitle="by Year (excludes drunk driving)", caption = "(based on data from NHTSA FARS: ftp://ftp.nhtsa.dot.gov/fars)") +
heat_map_theme

filename <- paste(c("Images/2000-2015_fatalities_calendar (AR, not drunk).png"), collapse="")
ggsave(filename, bg = "transparent")

Conclusion

I hope you found these visualizations useful. I'd love your feedback on what you insight you gathered from the visualizations, or any critiques on how to make them more useful. 

I'm new to R, so if you're seeing these and have suggestions for how to do things more efficiently, I'd love to see you comment.

Which Arkansas counties are the most litigious?

I've posted before about the upcoming launch of Docket Dog, a case watching service for Arkansas state court cases. To me, one of the interesting things coming out of Docket Dog is the ability to look at different metrics for case filings in Arkansas.

I have been learning the Python programming language, which has some good visualization tools available for it. My latest exploration has been using Python to create state maps. I found this awesome tutorial by a guy with an equally-awesome name (Nathan, of course!). I also wanted to view metrics per capita, so I downloaded the latest US Census data.

I wanted to figure out which Arkansas counties were the most sue-happy. So, looked at the total number of cases filed in each county, divided by the population, and plotted the result. Each color band is a multiple of the average for that year.

2015 cases per capita. Click to enlarge.

2014 cases per capita. Click to enlarge.

2013 cases per capita. Click to enlarge.

2012 cases per capita. Click to enlarge.

2011 cases per capita. Click to enlarge.

2010 cases per capita. Click to enlarge.

What do you make of this? Clark County, my old stomping ground, is average from 2010 to 2012, but is well above that the last couple of years.

Why do you think certain counties are more litigious than others?

How long will my case take?

Nathan here. I'm back for a guest post with some new tricks I've learned at my new job from some of the researchers at UAMS. I've having a blast getting an inside look at cutting-edge biomedical research. This post looks at some data visualization about the time it takes to resolve civil tort cases in Arkansas.

Background:

One of the researchers has a master's degree in computer science, and I picked his brain a little bit about what software packages he likes to use. He prefers python to Perl (which I like) because python's research libraries are easier to use.

I took his recommendations to heart, and I've been tinkering around with the Anaconda python distribution with data I've gathered for another project I'm working on releasing very soon: Docket Dog. It's an Arkansas state court notification system. I used the data mining application Orange to perform some data visualization on the types of civil cases my dad and brother handle.

Arkansas Tort Case Length Analysis:

I took a look at over 98000 tort cases available electronically from the Administrative Office of the Courts for which I could calculate an end date. This is what the time frames look like:

Pendency of Arkansas tort cases in years. The scale is 20 years wide. Click to enlarge.

As you can see, civil court cases can take several years to resolve. We'll see what the averages look like here in a few minutes with another chart.

In the meantime, there are several interesting patterns that appear in this chart. For instance, on the first line for product liability cases, there are several vertical bands around 9, 12, and 14–16 years. I haven't looked into this, but I suspect each band probably represents a settlement of a specific type of cases, like Firestone exploding tire cases, Pinto exploding car cases, or something similar.

The declaratory judgment (dec action) line is notably shorter overall than the others. Again, I haven't researched this further, but I would expect this is due to the fact that dec actions don't involve juries and are usually about a specific question of law. For instance, lots of dec actions involve whether there is insurance coverage for a particular event or not (the hilarious Luther Sutter v. Dennis Milligan dec action notwithstanding). 

Now, on to the next chart. This is called a box chart:

Comparison of median Arkansas tort case values over the last 20 years. Click to enlarge.

This chart is broken up into quartiles. The light blue box represents 50% of all cases. So, 50% of motor vehicle collision (MVC) cases are decided within 2 years, with the median value being 1.6 years. (Median means the middle value; if there were 101 cases, for instance, the median value would be the 51st value). The average MVC case length is shorter at just over 1 year.

The dark blue lines represent maximum values, excluding outliers. The dots out to the right of the graph represent those outliers, which extend out to 20 years.

What's the bottom line? For 3/4 of tort cases, you can expect resolution to take at least 6 months to 3 years. Another quarter of cases take up to 4 years or so. And, there are always outliers that can take many, many years to reach ultimate resolution.

What questions do you have about this analysis?