@@ -5,13 +5,21 @@ A set of command-line tools to extract data from PushShift's Reddit archive file
*[filter](filter) - Takes a large, compressed pushshift archive and produces an uncompressed file with only subreddits of interest
*[split](split) - Takes a directory of uncompressed comments/submissions and converts them into a set of month-year files
So, to convert PushShift archives into a set of month-year files for the CTA toolkit:
So, to convert PushShift archives into a set of month-year files for the [CTA toolkit](https://github.com/rpgauthier/ComputationalThematicAnalysisToolkit):
./split -i <Temporary Directory> -o <Directory Where I'd Like My Output>
```
If you are working with one of set of files (_submissions and _comments) from the top 40k torrent, the above commands should produce output that works with the CTA toolkit. If you are working with archives containing more than one subreddit (or all subreddits), you will likely want to select only a subset of subreddits to include:
```
./filter -i <Directory Containing Reddit Archives> -o <Temporary Directory> -s <A comma-delimited list of subreddits>
./split -i <Temporary Directory> -o <Directory Where I'd Like My Output>
```
## Filter
The expected input is a newline-delimited JSON (NDJSON) file or set of files compressed using ZSTD containing either Reddit Comment or Submission data. The output is a corresponding file or set of files containing only data from a comma-delimted list of subreddits, in uncompressed NDJSON format.