Hello Steve
I have today updated Wandora's distribution packages at
http://www.wandora.org/wiki/Download . This was a silent release and we'll not make any additional announcement about it. Updates should fix the issue on Wandora's
File > Extract > HTML structures > HTML link extractor...The bug you spotted was caused by an inadequate behaviour in registering mime type specific handlers. The WebCrawler was unable to register two handlers for one mime type. HTML link extractor needs two handlers for mime type
text/html. First is used to crawl html documents and second is used to extract links out of HTML documents. And the bug was that the second handler was overridden by the first. Now WebCrawler accepts multiple handlers for a mime type and HTML link extractor should work properly with option
Extract given urls and directly linked documents. I added the extractor an option
Extract given url and urls below. This new option forces the WebCrawler to crawl only URLs that start with an initial URL. For example, if your initial URL is
http://www.wandora.org/wiki then WebCrawler proceeds to an URL
http://www.wandora.org/wiki/Download but NOT URL
http://www.wandora.org/www. (Additional note: The URL
http://www.wandora.org/www is actually included in Wandora's topic map after extraction but that is only because wiki contains the URL as a link).
If you face problems with the HTML link extractor, please don't hesitate to drop a line.
Kind Regards
Aki / Wandora Team