PASS: PushShift Archive Scripts, in Swift
A set of command-line tools to extract data from PushShift's Reddit archive files, which are typically large and difficult to work with. These scripts use streaming-decompression and multi-threading to work with the archives in a more reasonable way.
- filter - Takes a large, compressed pushshift archive and produces an uncompressed file with only subreddits of interest
- split - Takes a directory of uncompressed comments/submissions and converts them into a set of month-year files
So, to convert PushShift archives into a set of month-year files for the CTA toolkit:
./filter -i <Directory Containing Reddit Archives> -o <Temporary Directory>
./split -i <Temporary Directory> -o <Directory Where I'd Like My Output>
If you are working with one of set of files (_submissions and _comments) from the top 40k torrent, the above commands should produce output that works with the CTA toolkit. If you are working with archives containing more than one subreddit (or all subreddits), you will likely want to select only a subset of subreddits to include:
./filter -i <Directory Containing Reddit Archives> -o <Temporary Directory> -s <A comma-delimited list of subreddits>
./split -i <Temporary Directory> -o <Directory Where I'd Like My Output>
Filter
The expected input is a newline-delimited JSON (NDJSON) file or set of files compressed using ZSTD containing either Reddit Comment or Submission data. The output is a corresponding file or set of files containing only data from a comma-delimted list of subreddits, in uncompressed NDJSON format.
USAGE: single-file --input-directory-path <input-directory-path> --output-directory-path <output-directory-path> [--verbose] [--subreddits <subreddits>]
OPTIONS:
-i, --input-directory-path <input-directory-path>
The directory to read files from.
-o, --output-directory-path <output-directory-path>
The directory to write output files to.
-v, --verbose Whether to output extra debug text.
-s, --subreddits <subreddits>
A comma-delimited list of subreddits to search for.
-h, --help Show help information.
Split
The expected input is an uncompressed newline-delimited JSON (NDJSON) file or set of files containing either Reddit Comment or Submission data. The output is a corresponding set of files containing submissions or comments for each month-year timeperiod, in uncompressed NDJSON format.
USAGE: single-file --input-directory-path <input-directory-path> --output-directory-path <output-directory-path> [--verbose] [--subreddits <subreddits>]
OPTIONS:
-i, --input-directory-path <input-directory-path>
The directory to read files from.
-o, --output-directory-path <output-directory-path>
The directory to write output files to.
-v, --verbose Whether to output extra debug text.
-h, --help Show help information.
Resources
These scripts are largely inspired by existing work:
- Academic Torrents for Reddit Data
- Academic Torretns for top 40k subreddits
- Watchful1's Pushshift Dump Scripts
- Arctic Shift
Author
Jim Wallace
License
Released under the MIT license.