Skip to content
Snippets Groups Projects
Jim Wallace's avatar
Jim Wallace authored
1de4bfd9

PASS: PushShift Archive Scripts, in Swift

A set of command-line tools to extract data from PushShift's Reddit archive files, which are typically large and difficult to work with. These scripts use streaming-decompression and multi-threading to work with the archives in a more reasonable way.

  • filter - Takes a large, compressed pushshift archive and produces an uncompressed file with only subreddits of interest
  • split - Takes a directory of uncompressed comments/submissions and converts them into a set of month-year files

So, to convert PushShift archives into a set of month-year files for the CTA toolkit:

./filter -i <Directory Containing Reddit Archives> -o <Temporary Directory>
./split -i <Temporary Directory> -o <Directory Where I'd Like My Output>

If you are working with one of set of files (_submissions and _comments) from the top 40k torrent, the above commands should produce output that works with the CTA toolkit. If you are working with archives containing more than one subreddit (or all subreddits), you will likely want to select only a subset of subreddits to include:

./filter -i <Directory Containing Reddit Archives> -o <Temporary Directory> -s <A comma-delimited list of subreddits>
./split -i <Temporary Directory> -o <Directory Where I'd Like My Output>

Filter

The expected input is a newline-delimited JSON (NDJSON) file or set of files compressed using ZSTD containing either Reddit Comment or Submission data. The output is a corresponding file or set of files containing only data from a comma-delimted list of subreddits, in uncompressed NDJSON format.

USAGE: single-file --input-directory-path <input-directory-path> --output-directory-path <output-directory-path> [--verbose] [--subreddits <subreddits>]

OPTIONS:
  -i, --input-directory-path <input-directory-path>
                          The directory to read files from.
  -o, --output-directory-path <output-directory-path>
                          The directory to write output files to.
  -v, --verbose           Whether to output extra debug text.
  -s, --subreddits <subreddits>
                          A comma-delimited list of subreddits to search for.
  -h, --help              Show help information.

Split

The expected input is an uncompressed newline-delimited JSON (NDJSON) file or set of files containing either Reddit Comment or Submission data. The output is a corresponding set of files containing submissions or comments for each month-year timeperiod, in uncompressed NDJSON format.

USAGE: single-file --input-directory-path <input-directory-path> --output-directory-path <output-directory-path> [--verbose] [--subreddits <subreddits>]

OPTIONS:
  -i, --input-directory-path <input-directory-path>
                          The directory to read files from.
  -o, --output-directory-path <output-directory-path>
                          The directory to write output files to.
  -v, --verbose           Whether to output extra debug text.
  -h, --help              Show help information.

Resources

These scripts are largely inspired by existing work:

Author

Jim Wallace

License

Released under the MIT license.