public class FindSubjectLocator extends AbstractWandoraTool implements WandoraTool, Handler, InterruptHandler
FindSubjectLocator
crawls URL resources and tries to match each
found URL to the search pattern. If URL matches the search pattern topic is
given URL as the subject locator.
FindSubjectLocator
is used to fix missing subject locators
for example.
This tool has NOT been tested.
Modifier and Type | Field and Description |
---|---|
private Wandora |
admin |
protected int |
browseCounter |
java.lang.String[] |
contentTypes |
private int |
crawlCounter |
private WebCrawler |
crawler |
protected int |
extractionCounter |
protected int |
foundCounter |
private java.lang.String |
startUrl |
private java.lang.String |
subjectLocatorString |
private java.util.HashMap<java.lang.String,java.lang.String> |
topicPatterns |
private java.lang.String |
urlPattern |
Constructor and Description |
---|
FindSubjectLocator()
Creates a new instance of FindSubjectLocator
|
FindSubjectLocator(Context preferredContext) |
Modifier and Type | Method and Description |
---|---|
void |
execute(Wandora admin,
Context context)
Runs the tool.
|
java.lang.String[] |
getContentTypes()
Returns an array of String containing the content-types this
ContentHandler can process. |
java.lang.String |
getDescription()
AdminToolManager views tool descriptions while user browses available
tools and build user customizable GUI elements such as Tools menu.
|
int[] |
getInterruptsHandled() |
java.lang.String |
getName()
Tools name represent the tool in UI unless the tool has been given
explicitly another GUI name.
|
void |
handle(CrawlerAccess crawler,
java.io.InputStream in,
int depth,
java.net.URL url)
Processes the given page.
|
void |
handleInterrupt(CrawlerAccess crawler,
int interrupt,
java.net.URL url) |
Topic |
isMyURL(java.net.URL url) |
void |
setupCrawler(java.lang.String startUrl) |
java.lang.String |
solveStartURL()
solveStartURL returns the URL where crawling is
started. |
java.lang.String |
solveURLPattern(Topic topic)
solveURLPattern returns pattern that is compared to each
URL. |
addUndoMarker, addUndoMarker, allowMultipleInvocations, clearAllThreads, clearThreads, clearThreads, clearToolLock, clearToolLock, clearToolLocks, configure, execute, execute, forceStop, forceStop, getContext, getCurrentLogger, getDefaultLogger, getHistory, getIcon, getLastLogger, getState, getThreads, getThreads, getToolMenuItem, getToolMenuItem, getTopicName, getType, hlog, initialize, interruptAllThreads, interruptThreads, interruptThreads, isConfigurable, isRunning, isRunning, lockLog, log, log, log, log, requiresRefresh, run, runInOwnThread, setContext, setDefaultLogger, setLogTitle, setProgress, setProgressMax, setState, setToolLogger, singleLog, singleLog, singleLog, solveContextTopicMap, solveNameForTopicMap, writeOptions
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
configure, execute, execute, getContext, getIcon, getToolMenuItem, getType, hlog, initialize, isConfigurable, isRunning, log, log, log, log, requiresRefresh, setContext, setToolLogger, writeOptions
forceStop, getHistory, getState, lockLog, setLogTitle, setProgress, setProgressMax, setState
private Wandora admin
private WebCrawler crawler
private int crawlCounter
protected int extractionCounter
protected int foundCounter
protected int browseCounter
private java.lang.String urlPattern
private java.lang.String startUrl
private java.lang.String subjectLocatorString
private java.util.HashMap<java.lang.String,java.lang.String> topicPatterns
public final java.lang.String[] contentTypes
public FindSubjectLocator()
public FindSubjectLocator(Context preferredContext)
public java.lang.String getName()
AbstractWandoraTool
getName
in interface WandoraTool
getName
in class AbstractWandoraTool
public java.lang.String getDescription()
AbstractWandoraTool
getDescription
in interface WandoraTool
getDescription
in class AbstractWandoraTool
public java.lang.String solveStartURL()
solveStartURL
returns the URL where crawling is
started.public java.lang.String solveURLPattern(Topic topic)
solveURLPattern
returns pattern that is compared to each
URL.public void execute(Wandora admin, Context context)
WandoraTool
execute
in interface WandoraTool
public void setupCrawler(java.lang.String startUrl)
public Topic isMyURL(java.net.URL url)
public void handle(CrawlerAccess crawler, java.io.InputStream in, int depth, java.net.URL url)
Handler
InputStream
contains the data of an object that is
of the content-type this content handler accepts. May use the given
CrawlerAccess
object to add further pages to the queue of the
WebCrawler
that asked to process the page.handle
in interface Handler
crawler
- The call back object for the handler. Any objects built from
the content of the page can be sent to this.in
- The InputStream
of the page.depth
- The depth remaining depth. When reporting another page to
the queue, the depth of that page should be set to this depth-1.url
- The URL
of the page.public java.lang.String[] getContentTypes()
Handler
ContentHandler
can process.getContentTypes
in interface Handler
public void handleInterrupt(CrawlerAccess crawler, int interrupt, java.net.URL url)
handleInterrupt
in interface InterruptHandler
public int[] getInterruptsHandled()
getInterruptsHandled
in interface InterruptHandler
Copyright 2004-2015 Wandora Team