Part I: Write a Polite Smart Robot
The pseudocode for the robot is not very complicated:
Prompt the user to enter the starting URL.
Obtain user's starting URL.
Add the starting URL information to the table of
known URL information.
Add the starting URL information to the list of
unprocessed URL information.
While the list of unprocessed URL information is
not empty,
{
Sleep the currently executing
thread for one minute, so as not to burden a server by a large number of
connection requests in a short space of time.
Get an unprocessed URL information
in the list of unprocessed URL information.
Remove the unprocessed URL
information from the list of unprocessed URL information.
Mark the URL as having been
retrieved.
Create and open a communications
link between the robot and the URL.
If there is not a corresponding
registered content filter for the content type of the referenced resource,
Construct a content object to hold the file name of the URL as title.
Else,
{
Retrieve information through the communications link.
Parse raw data by the corresponding content filter.
Construct a content object to hold the filtered information, such as title,
keywords, summary and text.
Put all extracted URLs into a list.
Scan every URL in the extracted URLs list,
{
If the URL is present in the table of known URL information,
{
Get the URL information from the table of known URL information.
If the URL is a retrieved URL, continue the scan.
}
Else,
{
Construct a URL information for the URL.
Add the URL information to the table of known URL information.
}
If the current recursive depth is beyond the maximal recursive depth, continue
the scan.
If there is a list of interested URL identifiers and the URL doesn't contains
any specific interested identifier, continue the scan.
If the URL's protocol is HTTP and the Robots Exclusion Protocol denies
a robot access to the URL, continue the scan.
Add the URL information to the list of unprocessed URL information.
}
}
Index the filtered information
in the above content object by a content processor.
}
Display an information retrieval report.
Here's the detailed code to open a communications link
between the robot and a URL:
URL url = new URL(spec); //Create a java.net.URL
object according to a specific URL in String form.
URLConnection urlConnection = url.openConnection();
//Create a java.net.URLConnection object that represents a communications
link between the robot and a URL.
urlConnection.setAllowUserInteraction (false); //Set
the allowUserInteraction field of this URLConnection to disallow user interactions.
urlConnection.connect(); //Open a communications
link to the resource referenced by this URL.
String contentType = urlConnection.getContentType();
//Get the content type of the resource that the URL references.
InputStream in = urlConnection.getInputStream();
//Get an input stream that reads from this open connection
.... //Extract URLs and process content from the
input stream according to its content type.
HttpURLConnection is the most prevailing URLConnection
for a WWW robot, although URLConnection's subclasses include also JarURLConnection,
FtpURLConnection, FileURLConnection, and so on. To make your robot polite,
you should tell the web servers your robot name and your mailbox. To make
your robot smart, you should consider implementing four functions:
1. Robot can utilize a proxy to access the remote sites
even if it is behind a firewall.
2. Robot can accept and use Cookie if the web site allots
it a Cookie.
3. Robot can utilize Referer request-header to bypass
any kind of reference checking since many HttpURLConnections use a form
of security, which checks Referer header.
4. Robot can detect and avoid infinite redirection loops
while it follows automatically HTTP redirection without interaction with
the user.
You can program successfully a lightweight customized
HttpURLConnection class or function, if you know four basic stages of the
direct HTTP GET communication between a client and a web server. First,
a stream socket is created and connected to the specified port number on
the server. An HTTP server almost always uses port number 80. Next, an
HTTP GET request message with a collection of additional information, such
as the browser's name and acceptable character set, is sent from the client
to the server. Third, after receiving and interpreting the request message,
the server responds with an HTTP response message containing a status line
indicating the result of the request. The response message may also contain
other information such as the content type and the content body of the
destination URL. Finally, the server closes the connection immediately
after sending the response if the client does not intend to maintain a
persistent connection. Let's see a typical GET request message for the
URL of "http://pii300/index.html":
GET /index.html HTTP/1.0
Referer: http://pii300/java.html
User-Agent: Mozilla/4.71 [en] (Win95; I)
Pragma: no-cache
Host: pii300
Accept: image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, */*
Accept-Encoding: gzip
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8
/* a blank line */
A typical response from the server:
HTTP/1.0 200 OK
Content-Type: text/html
Content-Length: 759
Expires: Fri, 01 Dec 2000 16:00:00 GMT
Server: JRun Web Server/1.0
/* a blank line */
<HTML>/* HTML text of the Web page */</HTML>
To retrieve the URL of "http://pii300/index.html" without
a browser:
URL url = new URL("http://pii300/index.html"); //Create
a java.net.URL object according to a specific URL in String form.
if(url.getPort()==-1) //The port is not set URL
in String form.
url = new URL(url.getProtocol(),
url.getHost(),80, url.getFile()); //Set the default port to 80
Socket socket = new Socket(url.getHost(), url.getPort());
//Create a stream socket and connect it to the specified port number on
the web host.
socket.setSoTimeout(120000); //Throw a java.io.InterruptedIOException
if the timeout expires.
java.io.PrintStream out = new java.io.PrintStream(socket.getOutputStream());
//Get a print stream from the output stream for this socket.
out.print("GET "+url.getFile()+" HTTP/1.0" +"\r\n");
//Construct an HTTP GET request line with the relative URL and HTTP/1.0
version
out.print("Referer: "+ "http://pii300/java.html\r\n");
//Tell the server the referenced URL from which the request URL was obtained
so that the obsolete or mistyped link can be traced for maintenance.
out.print("User-Agent: Java Robot/1.01" +"\r\n");
//Tell the server the robot name so that the server can distinguish it
from browsers.
out.print("From: robotmaster@host.com" +"\r\n");
//Provide your mailbox so that server maintainer can contact you in case
of problems.
out.print("Pragma: no-cache\r\n"); //Ignore the
caches.
out.print("Host: "+url.getHost()+":"+ url.getPort()+"\r\n");
//Provide the Internet host and port number of the relative URL being requested.
out.print("Accept: */*\r\n"); //Accept all media
types and their subtypes.
out.print("Cookie: "+"sessionid=126\r\n"); //Submit
a Cookie if the web site allots it a Cookie.
out.print("\r\n"); //A blank line indicates the
end of the header fields.
out.flush(); //Flush the stream.
InputStream in=socket.getInputStream(); //Get an
input stream for this socket.
...//Process the HTTP response message from the
server, which includes the HTML text of "http://pii300/index.html".
socket.close(); //Closes this socket.
Part II: Construct a Flexible Content Filter
How to parse documents and extract URLs from the URLConnection
input stream? Maybe the idea of using String.indexOf() to hunt down "http://"
flashed into your mind. No, you need a powerful flexible syntax parser
for HTML. It should not only extract URLs, but also title, keywords, summary
and text. It should not only take on the known HTML syntax, but also the
future HTML syntax, even the error HTML syntax. The zyh.robot.TextHTML
package is such an efficient parser for plain text and HTML document with
syntactic error recovery. It is thanks to the programmers of JLex and CUP
that all parsing code are generated no sweat by JLex and CUP according
to the corresponding specification in Yylex and TextHTML.cup. JLex can
take a specification file, then create a Java source file for the corresponding
lexical analyzer. For instance, a simple JLex specification file:
/** A lexical analyzer breaks an input stream into
symbols for URLs extractor */
/* User code section is mainly used for a package
declaration and the importation of some external classes. */
package SimpleExample;
import java_cup.runtime.Symbol;
/* JLex directives section is used for macros definitions.*/
%%
%{ /* Declaration of variables and functions internal
to the generated lexical analyzer class. */
/** Return the next parsed
symbol or an error symbol if there is a lexical error */
public Symbol nextSymbol()
throws java.io.IOException
{
try
{
return parse();
}
catch(java.lang.Error e)
{//Lexical Error
return new Symbol(sym.error);
}
}
%}
%type Symbol /* Change the return type of the tokenizing
function from Yytoken to Symbol. */
%function parse /* Change the name of the tokenizing
function from yylex to parse. */
%public /* Cause the lexical analyzer class
generated by JLex to be a public class. */
%unicode /* Generate a lexical analyzer that supports
an alphabet of character codes between 0 and 2^16-1 inclusive. */
%ignorecase /* Generate a case-insensitive lexical
analyzer. */
%eofval{
return (new Symbol(sym.EOF));//Return
an EOF symbol after the end of input stream is reached.
%eofval}
/* Macro definitions for white space */
WHITE_SPACE_CHAR=[\ \t\b\012\r\n]
/* Macro definitions for phase or string */
PHRASE = [^\"\<\>\=\ \t\b\012\r\n] [^\"\<\>\=\
\t\b\012\r\n]*
QQLINK = \"([^\"\<])*\"
STRING = [^\"\<\>\=\ \t\b\012\r\n] +\"[^\"\<\>\=\
\t\b\012\r\n]*
/* Regular expression rules section contains the
rules of lexical analysis, each of which consists of three parts: an optional
state list, a regular expression, and an action. */
%%
<YYINITIAL> {WHITE_SPACE_CHAR} {}
<YYINITIAL> "=" { return new Symbol(sym.EQUAL);}
<YYINITIAL> ">" { return new Symbol(sym.GREATER_THAN);}
<YYINITIAL> "<" { return new Symbol(sym.LESS_THAN);}
<YYINITIAL> "HREF" { return new Symbol(sym.HREF,yytext());
}
<YYINITIAL> "SRC" { return new Symbol(sym.SRC,yytext());
}
<YYINITIAL> {PHRASE} { return new Symbol(sym.STRING,yytext());
}
<YYINITIAL> {QQLINK} { return new Symbol(sym.STRING,yytext().substring(1,
yytext().length() - 1)); }
<YYINITIAL> STRING { return new Symbol(sym.STRING,yytext());
}
CUP is correspondingly a LALR parser constructor. Let
us see also its specification file for URLs extractor:
/* Package and Import Specifications */
package SimpleExample;
import java_cup.runtime.*;
import java.util.*;
import java.io.*;
action code {: /* Place methods and variable within
the generated action class. */
protected ArrayList URLs=new
ArrayList(); //Define a list to contain the extracted URLs.
:};
parser code {: /* Place methods and variable within
the generated parser class. */
SimpleExample.Yylex htmlScanner;
private parser(InputStream
is)
{
this();
htmlScanner = new SimpleExample.Yylex(is);
}
/** Use this static function
to extract urls from html doucments. */
public static final ArrayList
getURLs (InputStream inputStream) throws Exception
{
SimpleExample.parser parse_obj = new SimpleExample.parser(inputStream);
parse_obj.parse();
return parse_obj.action_obj.URLs;
}
:};
scan with {: /* Return the next symbol from the
JLex generated lexical analyzer. */
return htmlScanner.nextSymbol();
:};
/* Declare terminal symbols returned by the JLex
generated lexical analyzer. */
terminal EQUAL, GREATER_THAN, LESS_THAN;
terminal String HREF, SRC, STRING;
/* Declare non-terminal symbols */
non terminal Integer html;
non terminal String html_tag_clause;
/* The grammar for HTML documents with syntactic
error recovery */
html ::= STRING | html STRING | error
| LESS_THAN html_tag_clause
GREATER_THAN
| html LESS_THAN html_tag_clause
GREATER_THAN
;
html_tag_clause ::= STRING:s1
{: RESULT = s1; :}
| html_tag_clause:s1 STRING
EQUAL STRING
{: RESULT = s1; :}
| html_tag_clause:s1 SRC
EQUAL STRING:s2
{: if( s1.equalsIgnoreCase("IMG")
|| s1.equalsIgnoreCase("FRAME")
|| s1.equalsIgnoreCase("IFRAME")
|| s1.equalsIgnoreCase("SCRIPT")
)
URLs.add(s2);
RESULT = s1;
:}
| html_tag_clause:s1 HREF
EQUAL STRING:s2
{: if(s1.equalsIgnoreCase("BASE")
|| s1.equalsIgnoreCase("LINK")
|| s1.equalsIgnoreCase("A")
)
URLs.add(s2);
RESULT = s1;
:}
;
Now, a small parser for extracting URLs from HTML documents
is at hand. You can get a list of all extracted URLs by calling SimpleExample.parser.getURLs(
InputStream inputStream). Remember to utilize those two free toolkits,
JLex and CUP, if you like to write an extendable parser in a short time.
Part III: Do Content Cognitive Filtering
Content processor determines essentially the documents'
relevance by processing the plain text in content objects. Firstly, it
should regard any character sequence found between space characters or
punctuation symbols as a word. To remove numerical values, all the digit
characters in the plain text can also serve as delimiters. It is simple
to parse plain text into words:
String text = "This is a plain text.";
int last = 0, current = 0;
while( last < text.length() )
{
for(current = last; current
< text.length(); current++)
if( !Character.isLetter(
text.charAt(current)))break;
if( current > last)
{
String word = text.substring( last, current);
...//Process the splitted word.
}
last = current + 1;//Skip
the delimiter.
}
Secondly, it should removing prefixes and suffixes from
words according to a stemming list or some stemming algorithm. Stemming
means the process of grouping words that have the same conceptual meaning,
such as say, says, and said, to a word. The Porter stemming algorithm is
introduced in "IR Vocabulary" (http://www.cs.jhu.edu/~weiss/glossary.html).
Thirdly, it should remove completely all noise words in its stopword list,
such as "the", "a", and "this". Fourthly, it should cognize synonyms through
its synonym list. The lists of stopword and synonym can be maintained in
plain text files or databases, and loaded into memory as two hashtable
objects for quick query purpose. Content processor begins lastly the core
weighting process, which decides whether a word is more important than
other keywords in this URL, and whether a keyword in this URL is more important
than the same keyword in other URLs. The popular weighting scheme is evaluating
quoted frequency and quoted location. Words that appear frequently are
weighted more highly, while words that appear infrequently are given lower
weights. Words in document title are weighted most highly, and document's
keywords are given upper weights than words in headings. A good search
engine should reduce the URL's weight when it abused keyword metatag. URLs
with the highest weight of matching words should appear highest in the
following search.
Part IV: Searching in Index Database
Your searchable index database should consist of a keyword
list, for each keyword, a list of URLs which contain that keyword, and
a URL list, for each URL, a list of keywords which are found in that URL.
If you are ambitious to index millions of URLs, you have to take a good
care of your huge gigabyte-level index database, otherwise the user will
lose patience in waiting the query result out. You should consider a distributed
storage strategy to over this long-playing query bottleneck. For instance,
try to classify the data into thousands of smaller tables according to
the hash codes of URLs or keywords. Then search engine can load in advance
a list of table names of keywords, look through those smaller tables which
contain those requested words, rank the befitting URLs in order of relevance
to those words, and submit the list of search results as HTML.
All database operations in this search engine sample
are through JDBC 2.0 interface so that it is independent of the kind of
index database. The SQL statements for creating its index database are
simple:
create table urls(urlid int, url char(100), urlstatus
int, urlmessage char(20), urltime timestamp, keywords object, urlsummary
object)
create table keywords(wordid int, keyword char(20),
urls object)
The index database will be more efficient in lookup speed
if you create index on its columns of url, urlID, keyword and wordID. The
keyword list of every URL is held as a serializable object in urls.keywords
column, and the URL list of every keyword is held as a serializable object
in keywords.urls column. It utilizes some scrollable updatable result sets
to gain better performance on inserting and updating frequently. Since
waiting for server response and network bandwidth is the main bottleneck
in indexing, the sample utilizes also thread pool techniques to manage
and run some robots in parallel.
There is not setup requirement to the end user, since
search service is provided by a Servlet. The Servlet parses the user input
into a series of words, and supports Boolean queries with operands such
as AND and OR. If you wish to use NOT and the wildcard character "*" to
construct queries, you should consider using JLex and CUP to construct
a query parser. Cascading search can also be implemented if you cache the
latest queries and hide a query session ID in the query form.
References:
1. W3C Recommendation, HTML
4.01 Specification
2. The Internet Society, RFC2616:
Hypertext Transfer Protocol -- HTTP/1.1
3. Scott Weiss, IR
Vocabulary: Glossary for Information Retrieval
4. Martijn Koster, Guidelines
for Robot Writers
5. Elliot Berk, JLex:
A Lexical Analyzer Generator for Java
6. Scott E. Hudson, CUP:
A Parser Generator for Java