Reddit extractor

From WandoraWiki
Jump to: navigation, search

The Reddit extractor found in File > Extract > Other > Reddit enables extraction of data using the Reddit API. (No API key required) The extractor takes a thing as an input in addition to a set of rules which determine how the extractor crawls results returned by the API. Currently the extractor supports submissions, accounts, subreddits and link URLs as starting points for the extraction. The crawling options are as follows:

  • Crawl the subreddit a link was posted at
  • Crawl the comments posted as replies to a link
  • Crawl the account used to post the link
  • Crawl the link the comment was posted to
  • Crawl the account used to post the comment
  • Crawl the rest of the comment tree not returned at first ('load more comments' on the site)
  • Crawl the links posted by the account
  • Crawl the comments posted by the account
  • Crawl the links posted in the subreddit

Note that certain combinations of crawl options, such as crawling both link accounts and account links, lead to effictively infinite recursion which should be terminated manually.


Submission search

The submission search dialog

The submission search presents a search dialog to find a suitable submission to use as a starting point to the extraction. For any meaningful output the appropriate crawling options should also be selected.


We use the query shown above to extract details on the post 'What's the biggest programming mistake you've ever made?'

Reddit 2 2.png

In addition to the link itself we want to extract details on the link submitter, the 'r/programming' subreddit, the comments made to the link and the accounts used to post the comments. The submission has over 1000 comments so we leave the comment tree crawling off to shorten the extraction time.

Reddit 3.png

The results are then shown in Wandora

Reddit 4.png

The D3 graph visualization shows the extracted comment tree as the large cluster on the right and the accounts related to the link and comments as the large cluster on the left.

Subreddit search

The subreddit search dialog

The subreddit search presents a similar search dialog to find a suitable subreddit for a starting point for the extraction. The crawling options are uniform across all extraction methods.


In the dialog above the 'r/Python' subreddit is found and we've decided to extract links found in the subreddit as well as comments in those submissions in addition to accounts used to post the links and comments.

Reddit 6.png

The extraction yields a set of links in the subreddit as well as the additional data the extractor was told to find.

Reddit 7.png

The graph visualization shows the links extracted on the right with the respective destinations scattered around them. The links are connected to the large comment cluster in the bottom and to the account cluster above it.

Link search

The link search differs from the other methods of extraction as it takes an URL as the input and uses it to search for submissions referencing it across Reddit. The crawling options remain the same with the link search method. If the Topic Map contains a set of topics with relevant subject locators the locators may be used as URLs to search.


Reddit 8.png

Reddit 9 2.png

We use the destinations extracted above through 'Add Context SLs' to see if they have been submitted to r/python or other subreddits multiple times.

Reddit 10.png

After the extraction we see the rest of the extractions as unselected items in the main panel and the subreddits appear in the topic tree in the side pane.

Account search

The account search takes simply a valid username as the input and processes the extraction with the account as the starting point.

Reddit 11 2.png

Reddit 12.png

We may use the account search method to find links and comments posted by Bill Gates (thisisbillgates).

Personal tools