|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--zyh.robot.URLManager
The basic service for maintaining a list of visited URLs and tell Robot how to deal with a specific URL. To avoid revisiting the same URL over and over, a list of visited URLs should be maintained in memory or database. The zyh.robot.URLManager uses a hash table to hold all visited URLs and their related information. It decides whether to visit a URL or not according to the specific interested URL identifiers, the maximal recursive depth and the Robots Exclusion Protocol. If you wish to modify it for a large url database, you can not load all url information into memory from the url index datbase. You should seek the specific keyword or url through JDBC drvier one by one.
Field Summary | |
static int |
INTERESTED_URL
interested URL type the URL should be visited and Robot should extract all referenced URL links from this URL |
static int |
REFERENCED_URL
referenced URL type This url is on other friendly sites The URL should be visited but not filtered the content |
static int |
UNKNOWN_LINK
|
static int |
UNTOUCHED_URL
Untouched URL type The depth of this url beyond the maximal recursive depth The URL should be visited but not filtered the content |
Constructor Summary | |
URLManager(int maxDepth,
java.lang.String[] interestedIdentifiers,
java.lang.String mailbox,
java.net.InetAddress proxyInetAddress,
int proxyPort,
java.lang.String jdbcurl,
java.lang.String urlsTableName)
Creates a URLManager object |
Method Summary | |
void |
addURL(java.net.URL baseURL)
Adds a destination URL Checks if a URL is in the list of known URL's, and if it's not, adds it to the list. |
void |
addURL(java.net.URL url,
int linkType,
zyh.robot.URLInfo parentURLInfo)
Adds a destination URL Checks if a URL is in the list of known URL's, and if it's not, adds it to the list. |
java.lang.String |
getCookie(java.lang.String host)
Gets a cookie for the specific host |
java.io.PrintWriter |
getLogWriter()
Gets the log writer. |
protected java.lang.String |
getMailbox()
return the email of robot master |
zyh.robot.URLInfo |
getNextUnprocessedURLInfo()
Gets an unprocessed URLInfo object from the waiting queue. |
void |
println(java.lang.String message)
Prints a message to the current ContentFilter log writer. |
void |
report(java.io.PrintWriter out)
Outputs an html format report which contains the link information. |
void |
setCookie(java.lang.String host,
java.lang.String cookie)
Sets a cookie for the specific host |
void |
setKeywords(int urlID,
int[] wordIDs)
Set the keyword list of a destination URL |
void |
setLogWriter(java.io.PrintWriter out)
|
Methods inherited from class java.lang.Object |
clone,
equals,
finalize,
getClass,
hashCode,
notify,
notifyAll,
toString,
wait,
wait,
wait |
Field Detail |
public static final int INTERESTED_URL
public static final int REFERENCED_URL
public static final int UNTOUCHED_URL
public static final int UNKNOWN_LINK
Constructor Detail |
public URLManager(int maxDepth, java.lang.String[] interestedIdentifiers, java.lang.String mailbox, java.net.InetAddress proxyInetAddress, int proxyPort, java.lang.String jdbcurl, java.lang.String urlsTableName) throws java.sql.SQLException
maxDepth
- the maximal recursive depthinterestedIdentifiers
- all interested indentifiersmailbox
- please provide your mailbox so that server maintainers can contact you in case of problemsproxyInetAddress
- the InetAddress of HTTP proxyproxyPort
- the port of HTTP proxyjdbcurl
- the JDBC url of index databaseurlsTableName
- the table name which contains all urlsMethod Detail |
public void setLogWriter(java.io.PrintWriter out)
public java.io.PrintWriter getLogWriter()
public void println(java.lang.String message)
message
- a log or tracing messageprotected java.lang.String getMailbox()
public void addURL(java.net.URL baseURL) throws java.sql.SQLException
baseURL
- the start urlpublic void addURL(java.net.URL url, int linkType, zyh.robot.URLInfo parentURLInfo) throws java.sql.SQLException
url
- the url which will be processedlinkType
- the url link type in the processed documentparentURLInfo
- the URLInfo object which contains this urlInfopublic void setKeywords(int urlID, int[] wordIDs) throws java.sql.SQLException
urlID
- the urlID which will be processedwordIDs
- the wordIDs array which contains this keywords for the url which alloted thie specific urlID.public java.lang.String getCookie(java.lang.String host)
public void setCookie(java.lang.String host, java.lang.String cookie)
public zyh.robot.URLInfo getNextUnprocessedURLInfo()
public void report(java.io.PrintWriter out) throws java.io.IOException
out
- a PrintWriter output object.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |