public class WebCrawler extends AbstractCrawler implements java.lang.Runnable, CrawlerAccess, Crawler
ContentHandler
objects and a start page
(or several start pages) and an URLMask
. Files are retrieved with URLs and the
content is then passed to one of the given ContentHandler
s according to the
content-type of the URL
. If no suitable ContentHandler
is found, the file is
simply discarded.
ContentHandlers may add other pages to the crawl queue according to the links they find in the content they
are processing. The WebCrawler will take care that the same page is not crawled more than once.
WebCrawer implements Runnable and can be used with multiple Threads. Simply initialize the WebCraler object.
Then create as many Threads with the single WebCrawler instance as you want and start each with Thread.start().
Use Thread.isAlive() to tell if the thread is still running.
Using more than one thread can be more efficient because threads can download and parse pages simultaneously.ContentHandler
,
URLMask
,
CrawlerAccess
Modifier and Type | Field and Description |
---|---|
private boolean |
checkDonePages |
private int |
counter |
private java.util.HashSet |
donePages |
static int |
HTTP_UNAUTHORIZED_INTERRUPTION |
protected int |
numThreads |
private java.util.HashMap |
pagesDone |
private int |
processing |
private java.util.LinkedList<Tuples.T2<java.lang.Object,java.lang.Integer>> |
queue |
private java.util.HashMap |
timeTaken |
forceExit, handleCount
Constructor and Description |
---|
WebCrawler()
Creates new WebCrawler
|
Modifier and Type | Method and Description |
---|---|
void |
add(java.lang.Object crawlObject,
int depth)
Adds an url to the queue of the crawler.
|
void |
addObject(java.lang.Object data)
Gives any object constructed from the crawled page to the call back object.
|
void |
clearQueue()
Clears the crawl queue.
|
void |
crawl()
Starts crawling the pages added to queue with
addPageToQueue . |
void |
crawl(java.lang.Object crawlObject)
Starts crawling by first adding the given page to the queue.
|
void |
crawl(java.lang.Object crawlObject,
int depth)
Starts crawling by first adding the given page to the queue.
|
java.util.HashSet |
getDonePages()
Gets the HashMap that contains pages that have already been crawled.
|
java.util.HashMap |
getPagesDone() |
java.util.HashMap |
getTimeTaken() |
void |
loadSettings(org.w3c.dom.Element rootElement) |
static void |
main(java.lang.String[] args) |
int |
pagesProcessed() |
void |
run()
The Runnable implementation.
|
void |
setDonePages(java.util.HashSet hm)
Sets the HashMap that contains pages that have already been crawled.
|
addHandler, addInterruptHandler, createObject, forceExit, getCallBack, getCrawlCounter, getHandledDocumentCount, getHandler, getInterruptHandler, getMask, getProperty, isVerbose, loadSettings, loadSettings, modifyCrawlCounter, setCallBack, setCrawlCounter, setMask, setProperty, setVerbose
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
forceExit, setProperty
public static final int HTTP_UNAUTHORIZED_INTERRUPTION
private java.util.HashSet donePages
private java.util.LinkedList<Tuples.T2<java.lang.Object,java.lang.Integer>> queue
private int counter
private int processing
private boolean checkDonePages
private java.util.HashMap pagesDone
private java.util.HashMap timeTaken
protected int numThreads
public void clearQueue()
public void crawl(java.lang.Object crawlObject)
public void crawl(java.lang.Object crawlObject, int depth)
public void crawl()
addPageToQueue
.public void run()
run
in interface java.lang.Runnable
public void add(java.lang.Object crawlObject, int depth)
CrawlerAccess
add
in interface CrawlerAccess
public void addObject(java.lang.Object data)
CrawlerAccess
CrawlerAccess
implementation to decide what to do with it.addObject
in interface CrawlerAccess
public void setDonePages(java.util.HashSet hm)
setDonePages
and getDonePages
you can setup
multiple Crawler
s that don't crawl pages that some other
Crawler
have already processed.public java.util.HashSet getDonePages()
public int pagesProcessed()
public void loadSettings(org.w3c.dom.Element rootElement) throws java.lang.Exception
loadSettings
in class AbstractCrawler
java.lang.Exception
public java.util.HashMap getPagesDone()
public java.util.HashMap getTimeTaken()
public static void main(java.lang.String[] args) throws java.lang.Exception
java.lang.Exception
Copyright 2004-2015 Wandora Team