Page 1 of 1

HTML Link Extractor

PostPosted: Wed Jan 30, 2013 6:17 am
by ShaunandSteve
Using the HTML Link Extractor with the extract option "exactly given urls" produces a topic-map(s) and instance-of from each link crawled in the given URL.

Using the same extractor with the extract option "given urls and directly linked documents", or any other option, produces nothing. It will crawl all the documents, but nothing is created.

What am I doing wrong?


I would like to choose any of the extract options and have every crawled URL be an topic-map/instance-of object.

Re: HTML Link Extractor

PostPosted: Thu Jan 31, 2013 5:07 pm
by akivela
Hi Steve

You are doing nothing wrong. It seems that Wandora has a bug in this feature. We'll ship a fix for this bug during next week (before 2013-02-08).

Kind Regards,
Aki / Wandora Team

Re: HTML Link Extractor

PostPosted: Mon Feb 04, 2013 9:52 pm
by aki
Hello Steve

I have today updated Wandora's distribution packages at http://www.wandora.org/wiki/Download . This was a silent release and we'll not make any additional announcement about it. Updates should fix the issue on Wandora's File > Extract > HTML structures > HTML link extractor...

The bug you spotted was caused by an inadequate behaviour in registering mime type specific handlers. The WebCrawler was unable to register two handlers for one mime type. HTML link extractor needs two handlers for mime type text/html. First is used to crawl html documents and second is used to extract links out of HTML documents. And the bug was that the second handler was overridden by the first. Now WebCrawler accepts multiple handlers for a mime type and HTML link extractor should work properly with option Extract given urls and directly linked documents. I added the extractor an option Extract given url and urls below. This new option forces the WebCrawler to crawl only URLs that start with an initial URL. For example, if your initial URL is http://www.wandora.org/wiki then WebCrawler proceeds to an URL http://www.wandora.org/wiki/Download but NOT URL http://www.wandora.org/www. (Additional note: The URL http://www.wandora.org/www is actually included in Wandora's topic map after extraction but that is only because wiki contains the URL as a link).

If you face problems with the HTML link extractor, please don't hesitate to drop a line.

Kind Regards
Aki / Wandora Team

Re: HTML Link Extractor

PostPosted: Fri Feb 08, 2013 1:06 am
by ShaunandSteve
Thank you for the quick reply and fix!

I'm trying to learn the mechanics of Wandora and if you don't mind could you confirm that the only code change you made was to the AbstractExtractor.java class; specifically the setupCrawler() method?

Re: HTML Link Extractor

PostPosted: Fri Feb 08, 2013 5:13 pm
by akivela
Hi Steve

Actually I changed many more source files. Most changes are in

org.wandora.piccolo.utils.crawler.AbstractCrawler
org.wandora.piccolo.utils.crawler.WebCrawler
org.wandora.piccolo.utils.crawler.URLMask

but also in AbstractExtractor, as you pointed out.

Kind Regards,
Aki / Grip

Edit: Feel free to ask, if you have questions related to Wandora's architecture.