This repository contains Feathers and Parquet files derived from the most recent versions of the legacy and modern Supreme Court Database datasets. As discussed on the SCDB website, the SCDB is released annually in a variety of formats that differ from one another along several axes (time period, unit of analysis, database record granularity, and file format). This repository contains a minimally-altered version of each of these datasets.
I've made an active effort to ensure that, apart from datasets in the
data/preprocessed directory, the feather and parquet
files in this repository are faithful reproductions of those found in the
official releases.
They should differ from expectations only in that
- Human-readable strings are used instead of numeric codes for variable values. These strings match the ones found in the SPSS release.
- In string-valued and categorical columns,
np.nanvalues are replaced by the description'MISSING_VALUE'. - Variable data types are converted to accurate and more-or-less optimal (in
terms of storage space) data types. This includes using the experimental
pd.StringDtypefrom pandas. As a result of this and, mostly, general advantages of these file formats, the largest feather and parquet files we create here are 6.5 MB and 3.4 MB, respectively, roughly 1.7% and 6.5% the size of the largest.savfile from which we imported.
data/rawcontains the officially-released SPSS files from which I've derived datasets.data/feathercontains all of the generated feathersdata/parquetcontains—yep you guessed it—the parquet filesdata/preprocessedcontains a more refined version of the case-centric, citation-level dataset. This is a combination of the legacy and modern datasets that also includes some mild error correction and imputation work. If you're curious for more details, all changes are documented in the repository'sdvc.yamlfile, thedata_pipelinepackage and, with more prose, on my blog beginning with this post. If you're interested in getting involved, contributions are welcomed as are feature requests and issues!
I'm not affiliated with the Supreme Court Database, and this project is not officially endorsed by members of the Supreme Court Database.