zyh.robot
Class URLManager

java.lang.Object
  |
  +--zyh.robot.URLManager

public class URLManager
extends java.lang.Object

The basic service for maintaining a list of visited URLs and tell Robot how to deal with a specific URL. To avoid revisiting the same URL over and over, a list of visited URLs should be maintained in memory or database. The zyh.robot.URLManager uses a hash table to hold all visited URLs and their related information. It decides whether to visit a URL or not according to the specific interested URL identifiers, the maximal recursive depth and the Robots Exclusion Protocol. If you wish to modify it for a large url database, you can not load all url information into memory from the url index datbase. You should seek the specific keyword or url through JDBC drvier one by one.


Field Summary
static int INTERESTED_URL
          interested URL type the URL should be visited and Robot should extract all referenced URL links from this URL
static int REFERENCED_URL
          referenced URL type This url is on other friendly sites The URL should be visited but not filtered the content
static int UNKNOWN_LINK
           
static int UNTOUCHED_URL
          Untouched URL type The depth of this url beyond the maximal recursive depth The URL should be visited but not filtered the content
 
Constructor Summary
URLManager(int maxDepth, java.lang.String[] interestedIdentifiers, java.lang.String mailbox, java.net.InetAddress proxyInetAddress, int proxyPort, java.lang.String jdbcurl, java.lang.String urlsTableName)
          Creates a URLManager object
 
Method Summary
 void addURL(java.net.URL baseURL)
          Adds a destination URL Checks if a URL is in the list of known URL's, and if it's not, adds it to the list.
 void addURL(java.net.URL url, int linkType, zyh.robot.URLInfo parentURLInfo)
          Adds a destination URL Checks if a URL is in the list of known URL's, and if it's not, adds it to the list.
 java.lang.String getCookie(java.lang.String host)
          Gets a cookie for the specific host
 java.io.PrintWriter getLogWriter()
          Gets the log writer.
protected  java.lang.String getMailbox()
          return the email of robot master
 zyh.robot.URLInfo getNextUnprocessedURLInfo()
          Gets an unprocessed URLInfo object from the waiting queue.
 void println(java.lang.String message)
          Prints a message to the current ContentFilter log writer.
 void report(java.io.PrintWriter out)
          Outputs an html format report which contains the link information.
 void setCookie(java.lang.String host, java.lang.String cookie)
          Sets a cookie for the specific host
 void setKeywords(int urlID, int[] wordIDs)
          Set the keyword list of a destination URL
 void setLogWriter(java.io.PrintWriter out)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INTERESTED_URL

public static final int INTERESTED_URL
interested URL type the URL should be visited and Robot should extract all referenced URL links from this URL

REFERENCED_URL

public static final int REFERENCED_URL
referenced URL type This url is on other friendly sites The URL should be visited but not filtered the content

UNTOUCHED_URL

public static final int UNTOUCHED_URL
Untouched URL type The depth of this url beyond the maximal recursive depth The URL should be visited but not filtered the content

UNKNOWN_LINK

public static final int UNKNOWN_LINK
Constructor Detail

URLManager

public URLManager(int maxDepth,
                  java.lang.String[] interestedIdentifiers,
                  java.lang.String mailbox,
                  java.net.InetAddress proxyInetAddress,
                  int proxyPort,
                  java.lang.String jdbcurl,
                  java.lang.String urlsTableName)
           throws java.sql.SQLException
Creates a URLManager object
Parameters:
maxDepth - the maximal recursive depth
interestedIdentifiers - all interested indentifiers
mailbox - please provide your mailbox so that server maintainers can contact you in case of problems
proxyInetAddress - the InetAddress of HTTP proxy
proxyPort - the port of HTTP proxy
jdbcurl - the JDBC url of index database
urlsTableName - the table name which contains all urls
Method Detail

setLogWriter

public void setLogWriter(java.io.PrintWriter out)

getLogWriter

public java.io.PrintWriter getLogWriter()
Gets the log writer.

println

public void println(java.lang.String message)
Prints a message to the current ContentFilter log writer.
Parameters:
message - a log or tracing message

getMailbox

protected java.lang.String getMailbox()
return the email of robot master

addURL

public void addURL(java.net.URL baseURL)
            throws java.sql.SQLException
Adds a destination URL Checks if a URL is in the list of known URL's, and if it's not, adds it to the list.
Parameters:
baseURL - the start url

addURL

public void addURL(java.net.URL url,
                   int linkType,
                   zyh.robot.URLInfo parentURLInfo)
            throws java.sql.SQLException
Adds a destination URL Checks if a URL is in the list of known URL's, and if it's not, adds it to the list.
Parameters:
url - the url which will be processed
linkType - the url link type in the processed document
parentURLInfo - the URLInfo object which contains this urlInfo

setKeywords

public void setKeywords(int urlID,
                        int[] wordIDs)
                 throws java.sql.SQLException
Set the keyword list of a destination URL
Parameters:
urlID - the urlID which will be processed
wordIDs - the wordIDs array which contains this keywords for the url which alloted thie specific urlID.

getCookie

public java.lang.String getCookie(java.lang.String host)
Gets a cookie for the specific host

setCookie

public void setCookie(java.lang.String host,
                      java.lang.String cookie)
Sets a cookie for the specific host

getNextUnprocessedURLInfo

public zyh.robot.URLInfo getNextUnprocessedURLInfo()
Gets an unprocessed URLInfo object from the waiting queue. If you wish to limit that no more than one robot visits the same host at the same time when you're using a lot of robots, you should modify this function to public URLInfo getNextUnprocessedURLInfo(Robot robot,URL latestVisitedURL) and maintain a map of every robot's latest visited host. You can also using rotate queries between different servers in a round-robin fashion.

report

public void report(java.io.PrintWriter out)
            throws java.io.IOException
Outputs an html format report which contains the link information.
Parameters:
out - a PrintWriter output object.