Data information

Florida Department of Corrections Data (DOC) (2004-2016) Used 90.66%
Circuits Used 15%
Outliers Removed40%
Crime Type Used5%
Race Category Used66.7%
Number of Black Americans in Cleaned Dataset 38.5%

DAATE MVP Data:

For the MVP the DAATE team leveraged data from the Florida Department of Corrections (DOC) from 2004-2016. The data contained 1.3 million rows and 290 columns. Because no data dictionary was available the team relied heavily on our subject matter experts.



DAATE EDA:

Race Categories Used

Our MVP is exploring the disparity of sentencing between Black and White Americans. Because of this, DAATE currently only uses data in rows that indicate white or black for race.


Removing invalid/incorrect data

During the EDA process the team identified some invalid and incorrect data that was removed from the data. This includes:

  • Life Sentences: The life sentences created significant outliers in the sentencing outcome variable and showed signs of incorrect data. This data represented only 0.26% of the raw data and because of the small amount was removed.
  • 0 Total Points: There were 3 cases with 0 total points. We assumed this to be incorrect data and removed the 3 corresponding rows.
  • 44+ Points & No Prison/Jail Time: Cases with 44+ points usually receive a state prison sentence as a baseline. Manual checks indicated incorrect data and the data was removed.

  • Removing outliers

    During the EDA process the team identified some outliers. This includes:

  • Total Points Outliers: Extreme values (thousands of points, max = 19,439) were identified and not feasible based on other data. We used the z-score method to identify and remove outliers (new max = 199.6)
  • Sentence Time Outliers: Handle potentially incorrect data (e.g. very small points but extremely high sentence time). The team decided to bucket the total points and removed sentence time outliers (via z-score method) within each bucket