Reddit extractor
The Reddit extractor found in File > Extract > Other > Reddit enables extraction of data using the Reddit API. (No API key required) The extractor takes a thing as an input in addition to a set of rules which determine how the extractor crawls results returned by the API. Currently the extractor supports submissions, accounts, subreddits and link URLs as starting points for the extraction. The crawling options are as follows:
- Crawl the subreddit a link was posted at
- Crawl the comments posted as replies to a link
- Crawl the account used to post the link
- Crawl the link the comment was posted to
- Crawl the account used to post the comment
- Crawl the rest of the comment tree not returned at first ('load more comments' on the site)
- Crawl the links posted by the account
- Crawl the comments posted by the account
- Crawl the links posted in the subreddit
Note that certain combinations of crawl options, such as crawling both link accounts and account links, lead to effictively infinite recursion which should be terminated manually.
Contents |
Submission search
The submission search presents a search dialog to find a suitable submission to use as a starting point to the extraction. For any meaningful output the appropriate crawling options should also be selected.
Example
We use the query shown above to extract details on the post 'What's the biggest programming mistake you've ever made?'
In addition to the link itself we want to extract details on the link submitter, the 'r/programming' subreddit, the comments made to the link and the accounts used to post the comments. The submission has over 1000 comments so we leave the comment tree crawling off to shorten the extraction time.
The results are then shown in Wandora
The D3 graph visualization shows the extracted comment tree as the large cluster on the right and the accounts related to the link and comments as the large cluster on the left.
Subreddit search
The subreddit search presents a similar search dialog to find a suitable subreddit for a starting point for the extraction. The crawling options are uniform across all extraction methods.
Example
In the dialog above the 'r/Python' subreddit is found and we've decided to extract links found in the subreddit as well as comments in those submissions in addition to accounts used to post the links and comments.
The extraction yields a set of links in the subreddit as well as the additional data the extractor was told to find.
The graph visualization shows the links extracted on the right with the respective destinations scattered around them. The links are connected to the large comment cluster in the bottom and to the account cluster above it.
Link search
The link search differs from the other methods of extraction as it takes an URL as the input and uses it to search for submissions referencing it across Reddit. The crawling options remain the same with the link search method. If the Topic Map contains a set of topics with relevant subject locators the locators may be used as URLs to search.
Example
We use the destinations extracted above through 'Add Context SLs' to see if they have been submitted to r/python or other subreddits multiple times.
After the extraction we see the rest of the extractions as unselected items in the main panel and the subreddits appear in the topic tree in the side pane.
Account search
The account search takes simply a valid username as the input and processes the extraction with the account as the starting point.
We may use the account search method to find links and comments posted by Bill Gates (thisisbillgates).