Do
you need a link-checker facility to see if all links on your web pages
refer to actual pages? Are you planning to add search functionality to
your CD-ROM publications or web site?
Well
then, let us see the workings of the common search engines.
A
robot is a program that retrieves a particular starting URL, and then automatically
walks across all the referenced URLs from the retrieved document. Some
robots uses recursion to make URL connections, fetch data, extract URLs
and retrieve information from the Web or File Systems.
The
gained information is filtered and added into an index database. When you
submits a query request by an html form, a CGI script (Servlet or server
program) running on the web server is used to process your request. After
the search engine looks through this index database for keywords, you will
receive a list of URLs which is order of relevance to the keywords you
entered. This article includes such an extendable search engine demo program
written in Java 2.
PartI: write a
polite smart robot
The
pseudocode for the robot is not very complicated:
Prompt the user to enter the starting URL.
Obtain user's starting URL.
Add the starting URL information to the table
of known URL information.
Add the starting URL information to the list
of unprocessed URL information.
While the list of unprocessed URL information is not empty{
Sleep the currently executing thread for one minute,
so as not to burden a server by a large number of
connection requests in a short space of time.
Get an unprocessed URL information in the list of
unprocessed URL information.
Remove the unprocessed URL information from the list of
unprocessed URL information.
Mark the URL as having been retrieved.
Create and open a communications link between the
robot and the URL.
If there is not a corresponding registered content
filter for the content type of the referenced resource{
Construct a content object to hold the file name of
the URL as title
}
Else{
Retrieve information through the communications link.
Parse raw data by the corresponding content filter.
Construct a content object to hold the filtered
information, such as title, keywords, summary and text.
Put all extracted URLs into a list.
Scan every URL in the extracted URLs list{
If the URL is present in the table of known URL
information{
Get the URL information from the table of known
URL information.
If the URL is a retrieved URL, continue the scan.
}
Else{
Construct a URL information for the URL.
Add the URL information to the table of known
URL information.
}
If the current recursive depth is beyond the maximal
recursive depth{
continue the scan
}
If there is a list of interested URL identifiers
and the URL doesn't contains any specific interested
identifier{
continue the scan
}
If the URL's protocol is HTTP and the Robots Exclusion
Protocol denies a robot access to the URL}
continue the scan
}
Add the URL information to the list of unprocessed
URL information
}
}
Index the filtered information in the above content
object by a content processor.
}
Display an information retrieval report.
Here's
the detailed code to open a communications link between the robot and a
URL:
//
Create a java.net.URL object according to a
//
specific URL in String form.
URL
url = new URL(spec);
//
Create a java.net.URLConnection
//
object that represents a communications
//
link between the robot and a URL.
URLConnection
urlConnection = url.openConnection();
//Set
the allowUserInteraction field of this
urlConnection.setAllowUserInteraction
(false);
URLConnection
to disallow user interactions.
//
Open a communications link to the resource referenced
//
by this URL
urlConnection.connect();
//
Get the content type of the resource that the
//
URL references
String
contentType = urlConnection.getContentType();
//
Get an input stream that reads from this open
//
connection
InputStream
in = urlConnection.getInputStream();
//
...Extract URLs and process content from the input
//
stream according to its content type.
HttpURLConnection
is the most prevailing URLConnection for a WWW robot, although URLConnection's
subclasses include also JarURLConnection, FtpURLConnection, FileURLConnection,
and so on.
To
make your robot polite, you should tell the web servers your robot name
and your mailbox. To make your robot smart, you should consider implementing
four functions:
-
Robot
can utilize a proxy to access the remote sites even if it is behind a firewall.
-
Robot
can accept and use Cookie if the web site allots it a Cookie.
-
Robot
can utilize Referer request-header to bypass any kind of reference checking
since many HttpURLConnections use a form of security, which checks Referer
header.
-
Robot
can detect and avoid infinite redirection loops while it follows automatically
HTTP redirection without interaction with the user.
You can
program successfully a lightweight customized HttpURLConnection class or
function, if you know four basic stages of the direct HTTP GET communication
between a client and a web server. First, a stream socket is created and
connected to the specified port number on the server. An HTTP server almost
always uses port number 80. Next, an HTTP GET request message with a collection
of additional information, such as the browser's name and acceptable character
set, is sent from the client to the server. Third, after receiving and
interpreting the request message, the server responds with an HTTP response
message containing a status line indicating the result of the request.
The response message may also contain other information such as the content
type and the content body of the destination URL. Finally, the server closes
the connection immediately after sending the response if the client does
not intend to maintain a persistent connection.
Let's
see a typical GET request message for the URL of "http://pii300/index.html":
GET /index.html HTTP/1.0
Referer: http://pii300/java.html
User-Agent: Mozilla/4.71 [en] (Win95; I)
Pragma: no-cache
Host: pii300
Accept: image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, */*
Accept-Encoding: gzip
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8
/* a blank line */
A typical
response from the server:
HTTP/1.0 200 OK
Content-Type: text/html
Content-Length: 759
Expires: Fri, 01 Dec 2000 16:00:00 GMT
Server: JRun Web Server/1.0
/* a blank line */
<HTML>/* HTML text of the Web page */</HTML>
To retrieve
the URL of "http://pii300/index.html" without a browser:
//
Create a java.net.URL object according to a specific URL in String form.
URL
url = new URL("http://pii300/index.html");
//
The port is not set URL in String form.
if(url.getPort()==-1)
//Set the default port to 80
url = new URL(url.getProtocol(), url.getHost(),80, url.getFile());
//
Create a stream socket and connect it to the specified port number on the
web host.
Socket
socket = new Socket(url.getHost(), url.getPort());
//
Throw a java.io.InterruptedIOException if the timeout expires.
socket.setSoTimeout(120000);
//Get
a print stream from the output stream for this socket.
java.io.PrintStream
out = new java.io.PrintStream(socket.getOutputStream());
//
Construct an HTTP GET request line with the relative URL and HTTP/1.0 version
out.print("GET
"+url.getFile()+" HTTP/1.0" +"\r\n");
//
Tell the server the referenced URL from which the request URL
//
was obtained so that the obsolete or mistyped link can be traced
//
for maintenance.
out.print("Referer:
"+ "http://pii300/java.html\r\n");
//
Tell the server the robot name so that the server can
//
distinguish it from browsers.
out.print("User-Agent:
Java Robot/1.01" +"\r\n");
//
Provide your mailbox so that server
//
maintainer can contact you in case of problems.
out.print("From:
robotmaster@host.com" +"\r\n");
out.print("Pragma:
no-cache\r\n"); //Ignore the caches.
//
Provide the Internet host and port number of the relative
//
URL being requested.
out.print("Host:
"+url.getHost()+":"+ url.getPort()+"\r\n");
//
Accept all media types and their subtypes.
out.print("Accept:
*/*\r\n");
//Submit
a Cookie if the web site allots it a Cookie.
out.print("Cookie:
"+"sessionid=126\r\n");
//A
blank line indicates the end of the header fields.
out.print("\r\n");
//Flush
the stream.
out.flush();
//
Get an input stream for this socket.
//
Process the HTTP response message from the server, which
//
includes the HTML text of "http://pii300/index.html".
InputStream
in=socket.getInputStream();
//
Closes this socket
socket.close();
Part II: construct
a flexible content filter
How
to parse documents and extract URLs from the URLConnection input stream?
Maybe the idea of using String.indexOf() to hunt down "http://" flashed
into your mind. No, you need a powerful flexible syntax parser for HTML.
It should not only extract URLs, but also title, keywords, summary and
text. It should not only take on the known HTML syntax, but also the future
HTML syntax, even the error HTML syntax. The zyh.robot.TextHTML package
is such an efficient parser for plain text and HTML document with syntactic
error recovery. It is thanks to the programmers of JLex and CUP that all
parsing code are generated no sweat by JLex and CUP according to the corresponding
specification in Yylex and TextHTML.cup. JLex can take a specification
file, then create a Java source file for the corresponding lexical analyzer.
For instance, a simple JLex specification file:
/** A lexical analyzer breaks an input stream into symbols for URLs extractor
*/
/* User code section is mainly used for a package declaration and the importation
of some external classes. */
package SimpleExample;
import java_cup.runtime.Symbol;
/* JLex directives section is used for macros definitions.*/
%%
%{ /* Declaration of variables and functions internal to the generated
lexical analyzer class. */
/** Return the next parsed symbol or an error symbol if there is a lexical
error */
public Symbol nextSymbol()
throws java.io.IOException
{
try
{
return parse();
}
catch(java.lang.Error e)
{//Lexical Error
return new Symbol(sym.error);
}
}
%}
%type Symbol /* Change the return type of the tokenizing function from
Yytoken to Symbol. */
%function parse /* Change the name of the tokenizing function from yylex
to parse. */
%public /* Cause the lexical analyzer class generated by JLex to
be a public class. */
%unicode /* Generate a lexical analyzer that supports an alphabet of character
codes between 0 and 2^16-1 inclusive. */
%ignorecase /* Generate a case-insensitive lexical analyzer. */
%eofval{
return (new Symbol(sym.EOF));//Return an EOF symbol after the end of input
stream is reached.
%eofval}
/* Macro definitions for white space */
WHITE_SPACE_CHAR=[\ \t\b\012\r\n]
/* Macro definitions for phase or string */
PHRASE = [^\"\<\>\=\ \t\b\012\r\n] [^\"\<\>\=\ \t\b\012\r\n]*
QQLINK = \"([^\"\<])*\"
STRING = [^\"\<\>\=\ \t\b\012\r\n] +\"[^\"\<\>\=\ \t\b\012\r\n]*
/* Regular expression rules section contains the rules of lexical analysis,
each of which consists of three parts: an optional state list, a regular
expression, and an action. */
%%
<YYINITIAL> {WHITE_SPACE_CHAR} {}
<YYINITIAL> "=" { return new Symbol(sym.EQUAL);}
<YYINITIAL> ">" { return new Symbol(sym.GREATER_THAN);}
<YYINITIAL> "<" { return new Symbol(sym.LESS_THAN);}
<YYINITIAL> "HREF" { return new Symbol(sym.HREF,yytext()); }
<YYINITIAL> "SRC" { return new Symbol(sym.SRC,yytext()); }
<YYINITIAL> {PHRASE} { return new Symbol(sym.STRING,yytext()); }
<YYINITIAL> {QQLINK} { return new Symbol(sym.STRING,yytext().substring(1,
yytext().length() - 1)); }
<YYINITIAL> STRING { return new Symbol(sym.STRING,yytext()); }
CUP is
correspondingly a LALR parser constructor. Let us see also its specification
file for URLs extractor:
/* Package and Import Specifications */
package SimpleExample;
import java_cup.runtime.*;
import java.util.*;
import java.io.*;
action code {: /* Place methods and variable within the generated action
class. */
protected ArrayList URLs=new ArrayList(); //Define a list to contain the
extracted URLs.
:};
parser code {: /* Place methods and variable within the generated parser
class. */
SimpleExample.Yylex htmlScanner;
private parser(InputStream is)
{
this();
htmlScanner = new SimpleExample.Yylex(is);
}
/** Use this static function to extract urls from html doucments. */
public static final ArrayList getURLs (InputStream inputStream) throws
Exception
{
SimpleExample.parser parse_obj = new SimpleExample.parser(inputStream);
parse_obj.parse();
return parse_obj.action_obj.URLs;
}
:};
scan with {: /* Return the next symbol from the JLex generated lexical
analyzer. */
return htmlScanner.nextSymbol();
:};
/* Declare terminal symbols returned by the JLex generated lexical analyzer.
*/
terminal EQUAL, GREATER_THAN, LESS_THAN;
terminal String HREF, SRC, STRING;
/* Declare non-terminal symbols */
non terminal Integer html;
non terminal String html_tag_clause;
/* The grammar for HTML documents with syntactic error recovery */
html ::= STRING | html STRING | error
| LESS_THAN html_tag_clause GREATER_THAN
| html LESS_THAN html_tag_clause GREATER_THAN
;
html_tag_clause ::= STRING:s1
{: RESULT = s1; :}
| html_tag_clause:s1 STRING EQUAL STRING
{: RESULT = s1; :}
| html_tag_clause:s1 SRC EQUAL STRING:s2
{: if( s1.equalsIgnoreCase("IMG")
|| s1.equalsIgnoreCase("FRAME")
|| s1.equalsIgnoreCase("IFRAME")
|| s1.equalsIgnoreCase("SCRIPT")
)
URLs.add(s2);
RESULT = s1;
:}
| html_tag_clause:s1 HREF EQUAL STRING:s2
{: if(s1.equalsIgnoreCase("BASE")
|| s1.equalsIgnoreCase("LINK")
|| s1.equalsIgnoreCase("A")
)
URLs.add(s2);
RESULT = s1;
:}
;
Now,
a small parser for extracting URLs from HTML documents is at hand. You
can get a list of all extracted URLs by calling SimpleExample.parser.getURLs(
InputStream inputStream). Remember to utilize those two free toolkits,
JLex and CUP, if you like to write an extendable parser in a short time.
Part III: Do Content
Cognitive Filtering
Content
processor determines essentially the documents' relevance by processing
the plain text in content objects. Firstly, it should regard any character
sequence found between space characters or punctuation symbols as a word.
To remove numerical values, all the digit characters in the plain text
can also serve as delimiters. It is simple to parse plain text into words:
String
text = "This is a plain text.";
int last = 0, current = 0;
while( last < text.length() )
{
for(current = last; current < text.length(); current++)
if( !Character.isLetter( text.charAt(current)))break;
if( current > last)
{
String word = text.substring( last, current);
...//Process the splitted word.
}
last = current + 1;//Skip the delimiter.
}
Secondly,
it should removing prefixes and suffixes from words according to a stemming
list or some stemming algorithm. Stemming means the process of grouping
words that have the same conceptual meaning, such as say, says, and said,
to a word. The Porter stemming algorithm is introduced in "IR Vocabulary"
(http://www.cs.jhu.edu/~weiss/glossary.html). Thirdly, it should remove
completely all noise words in its stopword list, such as "the", "a", and
"this". Fourthly, it should cognize synonyms through its synonym list.
The lists of stopword and synonym can be maintained in plain text files
or databases, and loaded into memory as two hashtable objects for quick
query purpose. Content processor begins lastly the core weighting process,
which decides whether a word is more important than other keywords in this
URL, and whether a keyword in this URL is more important than the same
keyword in other URLs. The popular weighting scheme is evaluating quoted
frequency and quoted location. Words that appear frequently are weighted
more highly, while words that appear infrequently are given lower weights.
Words in document title are weighted most highly, and document's keywords
are given upper weights than words in headings. A good search engine should
reduce the URL's weight when it abused keyword metatag. URLs with the highest
weight of matching words should appear highest in the following search.
Part IV: Searching
in Index Database
Your
searchable index database should consist of a keyword list, for each keyword,
a list of URLs which contain that keyword, and a URL list, for each URL,
a list of keywords which are found in that URL. If you are ambitious to
index millions of URLs, you have to take a good care of your huge gigabyte-level
index database, otherwise the user will lose patience in waiting the query
result out. You should consider a distributed storage strategy to over
this long-playing query bottleneck. For instance, try to classify the data
into thousands of smaller tables according to the hash codes of URLs or
keywords. Then search engine can load in advance a list of table names
of keywords, look through those smaller tables which contain those requested
words, rank the befitting URLs in order of relevance to those words, and
submit the list of search results as HTML.
All
database operations in this search engine sample are through JDBC 2.0 interface
so that it is independent of the kind of index database. The SQL statements
for creating its index database are simple:
-
create
table urls(urlid int, url char(100), urlstatus int, urlmessage char(20),
urltime timestamp, keywords object, urlsummary object)
-
create
table keywords(wordid int, keyword char(20), urls object)
The index
database will be more efficient in lookup speed if you create index on
its columns of url, urlID, keyword and wordID. The keyword list of every
URL is held as a serializable object in urls.keywords column, and the URL
list of every keyword is held as a serializable object in keywords.urls
column. It utilizes some scrollable updatable result sets to gain better
performance on inserting and updating frequently. Since waiting for server
response and network bandwidth is the main bottleneck in indexing, the
sample utilizes also thread pool techniques to manage and run some robots
in parallel.
There
is not setup requirement to the end user, since search service is provided
by a Servlet. The Servlet parses the user input into a series of words,
and supports Boolean queries with operands such as AND and OR. If you wish
to use NOT and the wildcard character "*" to construct queries, you should
consider using JLex and CUP to construct a query parser. Cascading search
can also be implemented if you cache the latest queries and hide a query
session ID in the query form.
References
1.
W3C Recommendation, HTML 4.01 Specification
2.
The Internet Society, RFC2616:
Hypertext Transfer Protocol -- HTTP/1.1
3.
Scott Weiss, IR Vocabulary:
Glossary for Information Retrieval
4.
Martijn Koster, Guidelines
for Robot Writers
5.
Elliot Berk, JLex:
A Lexical Analyzer Generator for Java
6.
Scott E. Hudson, CUP:
A Parser Generator for Java
The
source files can be found here
|