Skip to content
Snippets Groups Projects
Commit 56e1fea9 authored by Jim Wallace's avatar Jim Wallace
Browse files

Update README.md

parent 13568e14
No related branches found
No related tags found
No related merge requests found
......@@ -5,13 +5,21 @@ A set of command-line tools to extract data from PushShift's Reddit archive file
* [filter](filter) - Takes a large, compressed pushshift archive and produces an uncompressed file with only subreddits of interest
* [split](split) - Takes a directory of uncompressed comments/submissions and converts them into a set of month-year files
So, to convert PushShift archives into a set of month-year files for the CTA toolkit:
So, to convert PushShift archives into a set of month-year files for the [CTA toolkit](https://github.com/rpgauthier/ComputationalThematicAnalysisToolkit):
```
./filter -i <Directory Containing Reddit Archives> -o <Temporary Directory>
./split -i <Temporary Directory> -o <Directory Where I'd Like My Output>
```
If you are working with one of set of files (_submissions and _comments) from the top 40k torrent, the above commands should produce output that works with the CTA toolkit. If you are working with archives containing more than one subreddit (or all subreddits), you will likely want to select only a subset of subreddits to include:
```
./filter -i <Directory Containing Reddit Archives> -o <Temporary Directory> -s <A comma-delimited list of subreddits>
./split -i <Temporary Directory> -o <Directory Where I'd Like My Output>
```
## Filter
The expected input is a newline-delimited JSON (NDJSON) file or set of files compressed using ZSTD containing either Reddit Comment or Submission data. The output is a corresponding file or set of files containing only data from a comma-delimted list of subreddits, in uncompressed NDJSON format.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment