WebCrawler (Wandora

java.lang.Object
- org.wandora.piccolo.utils.crawler.AbstractCrawler
- - org.wandora.piccolo.utils.crawler.WebCrawler

All Implemented Interfaces:

java.lang.Runnable, Crawler, CrawlerAccess
```
public class WebCrawler
extends AbstractCrawler
implements java.lang.Runnable, CrawlerAccess, Crawler
```
A generic class for crawling web pages (or possibly other objects/files too). Must be setup by giving ContentHandler objects and a start page (or several start pages) and an URLMask. Files are retrieved with URLs and the content is then passed to one of the given ContentHandlers according to the content-type of the URL. If no suitable ContentHandler is found, the file is simply discarded. ContentHandlers may add other pages to the crawl queue according to the links they find in the content they are processing. The WebCrawler will take care that the same page is not crawled more than once. WebCrawer implements Runnable and can be used with multiple Threads. Simply initialize the WebCraler object. Then create as many Threads with the single WebCrawler instance as you want and start each with Thread.start(). Use Thread.isAlive() to tell if the thread is still running. Using more than one thread can be more efficient because threads can download and parse pages simultaneously.

Author:

olli, akivela

See Also:

ContentHandler, URLMask, CrawlerAccess

Field Summary

Fields
Modifier and Type	Field and Description
`private boolean`	`checkDonePages`
`private int`	`counter`
`private java.util.HashSet`	`donePages`
`static int`	`HTTP_UNAUTHORIZED_INTERRUPTION`
`protected int`	`numThreads`
`private java.util.HashMap`	`pagesDone`
`private int`	`processing`
`private java.util.LinkedList<Tuples.T2<java.lang.Object,java.lang.Integer>>`	`queue`
`private java.util.HashMap`	`timeTaken`

Fields inherited from class org.wandora.piccolo.utils.crawler.AbstractCrawler
forceExit, handleCount

Constructor Summary

Constructors
Constructor and Description

WebCrawler()
Creates new WebCrawler

Constructors
Constructor and Description
`WebCrawler()` Creates new WebCrawler

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`add(java.lang.Object crawlObject, int depth)` Adds an url to the queue of the crawler.
`void`	`addObject(java.lang.Object data)` Gives any object constructed from the crawled page to the call back object.
`void`	`clearQueue()` Clears the crawl queue.
`void`	`crawl()` Starts crawling the pages added to queue with `addPageToQueue`.
`void`	`crawl(java.lang.Object crawlObject)` Starts crawling by first adding the given page to the queue.
`void`	`crawl(java.lang.Object crawlObject, int depth)` Starts crawling by first adding the given page to the queue.
`java.util.HashSet`	`getDonePages()` Gets the HashMap that contains pages that have already been crawled.
`java.util.HashMap`	`getPagesDone()`
`java.util.HashMap`	`getTimeTaken()`
`void`	`loadSettings(org.w3c.dom.Element rootElement)`
`static void`	`main(java.lang.String[] args)`
`int`	`pagesProcessed()`
`void`	`run()` The Runnable implementation.
`void`	`setDonePages(java.util.HashSet hm)` Sets the HashMap that contains pages that have already been crawled.

Methods inherited from class org.wandora.piccolo.utils.crawler.AbstractCrawler
addHandler, addInterruptHandler, createObject, forceExit, getCallBack, getCrawlCounter, getHandledDocumentCount, getHandler, getInterruptHandler, getMask, getProperty, isVerbose, loadSettings, loadSettings, modifyCrawlCounter, setCallBack, setCrawlCounter, setMask, setProperty, setVerbose

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.wandora.piccolo.utils.crawler.CrawlerAccess
forceExit, setProperty

- Field Detail
  - HTTP_UNAUTHORIZED_INTERRUPTION
```
public static final int HTTP_UNAUTHORIZED_INTERRUPTION
```
    See Also:
    
    Constant Field Values
  - donePages
```
private java.util.HashSet donePages
```
  - queue
```
private java.util.LinkedList<Tuples.T2<java.lang.Object,java.lang.Integer>> queue
```
  - counter
```
private int counter
```
  - processing
```
private int processing
```
  - checkDonePages
```
private boolean checkDonePages
```
  - pagesDone
```
private java.util.HashMap pagesDone
```
  - timeTaken
```
private java.util.HashMap timeTaken
```
  - numThreads
```
protected int numThreads
```
- Constructor Detail
  - WebCrawler
```
public WebCrawler()
```
    Creates new WebCrawler
- Method Detail
  - clearQueue
```
public void clearQueue()
```
    Clears the crawl queue.
  - crawl
```
public void crawl(java.lang.Object crawlObject)
```
    Starts crawling by first adding the given page to the queue. Uses a depth of 5.
    
    Specified by:
    
    crawl in interface Crawler
  - crawl
```
public void crawl(java.lang.Object crawlObject,
                  int depth)
```
    Starts crawling by first adding the given page to the queue. Uses the given depth.
  - crawl
```
public void crawl()
```
    Starts crawling the pages added to queue with addPageToQueue.
  - run
```
public void run()
```
    The Runnable implementation. Starts crawling.
    
    Specified by:
    
    run in interface java.lang.Runnable
  - add
```
public void add(java.lang.Object crawlObject,
                int depth)
```
    Description copied from interface: CrawlerAccess
    
    Adds an url to the queue of the crawler.
    
    Specified by:
    
    add in interface CrawlerAccess
  - addObject
```
public void addObject(java.lang.Object data)
```
    Description copied from interface: CrawlerAccess
    
    Gives any object constructed from the crawled page to the call back object. The type of the given data can be anything and it is up to the CrawlerAccess implementation to decide what to do with it.
    
    Specified by:
    
    addObject in interface CrawlerAccess
  - setDonePages
```
public void setDonePages(java.util.HashSet hm)
```
    Sets the HashMap that contains pages that have already been crawled. If they appear again in the queue, they will simply be skipped. With setDonePages and getDonePages you can setup multiple Crawlers that don't crawl pages that some other Crawler have already processed.
  - getDonePages
```
public java.util.HashSet getDonePages()
```
    Gets the HashMap that contains pages that have already been crawled.
  - pagesProcessed
```
public int pagesProcessed()
```
  - loadSettings
```
public void loadSettings(org.w3c.dom.Element rootElement)
                  throws java.lang.Exception
```
    Overrides:
    
    loadSettings in class AbstractCrawler
    
    Throws:
    
    java.lang.Exception
  - getPagesDone
```
public java.util.HashMap getPagesDone()
```
  - getTimeTaken
```
public java.util.HashMap getTimeTaken()
```
  - main
```
public static void main(java.lang.String[] args)
                 throws java.lang.Exception
```
    Throws:
    
    java.lang.Exception

Class WebCrawler

Field Summary

Fields inherited from class org.wandora.piccolo.utils.crawler.AbstractCrawler

Constructor Summary

Method Summary

Methods inherited from class org.wandora.piccolo.utils.crawler.AbstractCrawler

Methods inherited from class java.lang.Object

Methods inherited from interface org.wandora.piccolo.utils.crawler.CrawlerAccess

Field Detail

HTTP_UNAUTHORIZED_INTERRUPTION

donePages

queue

counter

processing

checkDonePages

pagesDone

timeTaken

numThreads

Constructor Detail

WebCrawler

Method Detail

clearQueue

crawl

crawl

crawl

run

add

addObject

setDonePages

getDonePages

pagesProcessed

loadSettings

getPagesDone

getTimeTaken

main