HTML Link Extractor

Forum is for miscellaneous user help requests.

HTML Link Extractor

Postby ShaunandSteve » Wed Jan 30, 2013 6:17 am

Using the HTML Link Extractor with the extract option "exactly given urls" produces a topic-map(s) and instance-of from each link crawled in the given URL.

Using the same extractor with the extract option "given urls and directly linked documents", or any other option, produces nothing. It will crawl all the documents, but nothing is created.

What am I doing wrong?


I would like to choose any of the extract options and have every crawled URL be an topic-map/instance-of object.
ShaunandSteve
 
Posts: 2
Joined: Wed Jan 30, 2013 6:00 am

Re: HTML Link Extractor

Postby akivela » Thu Jan 31, 2013 5:07 pm

Hi Steve

You are doing nothing wrong. It seems that Wandora has a bug in this feature. We'll ship a fix for this bug during next week (before 2013-02-08).

Kind Regards,
Aki / Wandora Team
akivela
Site Admin
 
Posts: 260
Joined: Tue Sep 18, 2007 10:20 am
Location: Helsinki, Finland

Re: HTML Link Extractor

Postby aki » Mon Feb 04, 2013 9:52 pm

Hello Steve

I have today updated Wandora's distribution packages at http://www.wandora.org/wiki/Download . This was a silent release and we'll not make any additional announcement about it. Updates should fix the issue on Wandora's File > Extract > HTML structures > HTML link extractor...

The bug you spotted was caused by an inadequate behaviour in registering mime type specific handlers. The WebCrawler was unable to register two handlers for one mime type. HTML link extractor needs two handlers for mime type text/html. First is used to crawl html documents and second is used to extract links out of HTML documents. And the bug was that the second handler was overridden by the first. Now WebCrawler accepts multiple handlers for a mime type and HTML link extractor should work properly with option Extract given urls and directly linked documents. I added the extractor an option Extract given url and urls below. This new option forces the WebCrawler to crawl only URLs that start with an initial URL. For example, if your initial URL is http://www.wandora.org/wiki then WebCrawler proceeds to an URL http://www.wandora.org/wiki/Download but NOT URL http://www.wandora.org/www. (Additional note: The URL http://www.wandora.org/www is actually included in Wandora's topic map after extraction but that is only because wiki contains the URL as a link).

If you face problems with the HTML link extractor, please don't hesitate to drop a line.

Kind Regards
Aki / Wandora Team
aki
 
Posts: 12
Joined: Mon Jun 28, 2010 2:40 pm

Re: HTML Link Extractor

Postby ShaunandSteve » Fri Feb 08, 2013 1:06 am

Thank you for the quick reply and fix!

I'm trying to learn the mechanics of Wandora and if you don't mind could you confirm that the only code change you made was to the AbstractExtractor.java class; specifically the setupCrawler() method?
ShaunandSteve
 
Posts: 2
Joined: Wed Jan 30, 2013 6:00 am

Re: HTML Link Extractor

Postby akivela » Fri Feb 08, 2013 5:13 pm

Hi Steve

Actually I changed many more source files. Most changes are in

org.wandora.piccolo.utils.crawler.AbstractCrawler
org.wandora.piccolo.utils.crawler.WebCrawler
org.wandora.piccolo.utils.crawler.URLMask

but also in AbstractExtractor, as you pointed out.

Kind Regards,
Aki / Grip

Edit: Feel free to ask, if you have questions related to Wandora's architecture.
akivela
Site Admin
 
Posts: 260
Joined: Tue Sep 18, 2007 10:20 am
Location: Helsinki, Finland


Return to How to... and problems

Who is online

Users browsing this forum: No registered users and 43 guests

cron