Decoding injustice: Unraveling the complexities of sentencing data
Too often, we overlook basic data exploration
The United States is leading the world in a race no one wants to win-it’s the most prison-happy nation on the planet. Beyond the immediate costs of locking people up, the ripple effects are enormous. The impacts hit incarcerated individuals’ families, their neighborhoods, and even future generations. Wrapping our heads around our criminal sentencing habits is literally a matter of national health. There’s already a huge body of work that uses statistical methods to dig into sentencing inequalities, especially around race. But that work is typically deductive. Inductive methods such as exploratory data analysis could add a valuable dimension to this existing body of knowledge.
Here’s the catch. Sentencing data is chock-full-o’-variables, making traditional approaches like plots or cross-tabulations unworkable because of the massive number of possible combinations.
Consider the United States Sentencing Commission (USSC), which offers public access to federal sentencing data. Each data record represents one defendant in one case, covering variables that pertain to administrative information, demographics, sentencing outcomes, and other aspects. There is one data file for each year. Each file houses tens of thousands of records, approximately 64,000 to 90,000, with each year introducing a slightly altered set of variables, sometimes reaching into the tens of thousands. Across all years, I found roughly 2,400 common variables.
While exploratory data analysis is an important preliminary step before statistical modeling, machine learning, and other quantitative approaches, traditional data exploration methods struggle to handle high-dimensional data. With less complex data sets, a typical strategy is to examine pairwise interactions of variables. With 2,400 variables, there are 2.9 million possible pairings. It’s impossible to discern which pairings are worthwhile to study amidst this massive number. The scenario worsens if we want to consider relationships among three variables, which for our 2,400 variables create an overwhelming 2.3 billion potential combinations.
It’s not just the sheer number of variables that creates complexity, but also the number of possible values each variable can assume. In the USSC data, for example, four important variables are the defendant’s criminal history, the offense type, the defendant’s race, and the judicial district in which the case is being heard. The criminal history variable has 6 possible values, the offense type variable has 30, the defendant race has 8, and there are 94 different judicial districts. Just the cross-tabulation of these four variables alone produces over 135,000 cells for comparison.
Amid such complexity, researchers often resort to well-established tools like principal component analysis (PCA), multiple correspondence analysis (MCA), and factor analysis of mixed data (FAMD). These are so-called dimensionality reduction techniques which aim to simplify the data description by exploiting associations between variables. Despite their wide use in the natural and social sciences, such methods are rarely employed in criminal justice research, and to my knowledge, never in sentencing studies. However, even these methods have their limitations, and can struggle with complex datasets. Dimension reduction is an active research area within statistics, applied mathematics, and data science. In my view, exploratory data analysis remains a vastly underutilized and underappreciated tool in the fields of law, criminology, and criminal justice. Before we leap into causal analyses and predictive models, we should better understand the structure of our complex data.
Your neighbor,
Chad