Skip to content

First Draft of reddit archive reading

Jason Zhao requested to merge data-pipe-v2 into main

Purpose of this MR:

  • To set up a basic pipeline that reads in test Reddit Archive json files

What is my solution?

  • I added 2 utility functions in the Test folder that reads in the RedditComment or RedditSubmission files in the Test/Resource folder.
  • The main swiftNLP package has 2 new functions that decodes json strings into respective RedditComment or RedditSubmission data classes
  • The goal is to separate the file reading logic from the main logic of our package, i.e. users can pull data using whatever medium they want.

What should the final version look like?

  • It is not the best idea to put a lot of test files in the repo, so my suggestion is that we store the files in a cloud
  • During the initial test stage, I believe that just pulling data from the existing OneDrive resources is fine. However, accessing one-drive using API will require setting an Azure App to get API keys & secret keys.
  • Some alternatives are AWS S3, Google Cloud, Microsoft Azure Blob Storage, all are low cost options to store the files. However, I believe using one-drive should be enough for now.

Merge request reports