MokaByte 50 - Marzo 2001

MokaByte 50 - Marzo 2001

Write your own full text
search engine in Java

This article is about how to write a modularized flexible full-text search engine, which is independent of protocols, documents and databases.

Do you need a link-checker facility to see if all links on your web pages refer to actual pages? Are you planning to add search functionality to your CD-ROM publications or web site?
Well then, let us see the workings of the common search engines.
A robot is a program that retrieves a particular starting URL, and then automatically walks across all the referenced URLs from the retrieved document. Some robots uses recursion to make URL connections, fetch data, extract URLs and retrieve information from the Web or File Systems.

The gained information is filtered and added into an index database. When you submits a query request by an html form, a CGI script (Servlet or server program) running on the web server is used to process your request. After the search engine looks through this index database for keywords, you will receive a list of URLs which is order of relevance to the keywords you entered. This article includes such an extendable search engine demo program written in Java 2.

PartI: write a polite smart robot
The pseudocode for the robot is not very complicated:
    Prompt the user to enter the starting URL.
    Obtain user's starting URL.
    Add the starting URL information to the table
      of known URL information.
    Add the starting URL information to the list
      of unprocessed URL information.
    While the list of unprocessed URL information is not empty{
      Sleep the currently executing thread for one minute,
        so as not to burden a server by a large number of
        connection requests in a short space of time.
      Get an unprocessed URL information in the list of
        unprocessed URL information.
      Remove the unprocessed URL information from the list of
        unprocessed URL information.
      Mark the URL as having been retrieved.
      Create and open a communications link between the
        robot and the URL.
      If there is not a corresponding registered content
        filter for the content type of the referenced resource{
        Construct a content object to hold the file name of
        the URL as title
      }
      Else{
        Retrieve information through the communications link.
        Parse raw data by the corresponding content filter.
        Construct a content object to hold the filtered
          information, such as title, keywords, summary and text.
        Put all extracted URLs into a list.
        Scan every URL in the extracted URLs list{
          If the URL is present in the table of known URL
            information{
              Get the URL information from the table of known
              URL information.
              If the URL is a retrieved URL, continue the scan.
          }
          Else{
            Construct a URL information for the URL.
            Add the URL information to the table of known
            URL information.
          }
          If the current recursive depth is beyond the maximal
            recursive depth{
            continue the scan
          }

          If there is a list of interested URL identifiers
          and the URL doesn't contains any specific interested
          identifier{
            continue the scan
          }

          If the URL's protocol is HTTP and the Robots Exclusion
          Protocol denies a robot access to the URL}
           continue the scan
          }
          Add the URL information to the list of unprocessed
          URL information
        }
      }
      Index the filtered information in the above content
      object by a content processor.
    }
    Display an information retrieval report.

Here's the detailed code to open a communications link between the robot and a URL:

// Create a java.net.URL object according to a
// specific URL in String form.
URL url = new URL(spec);
// Create a java.net.URLConnection
// object that represents a communications
// link between the robot and a URL.
URLConnection urlConnection = url.openConnection();
//Set the allowUserInteraction field of this
urlConnection.setAllowUserInteraction (false);
URLConnection to disallow user interactions.
// Open a communications link to the resource referenced
// by this URL
urlConnection.connect();
// Get the content type of the resource that the
// URL references
String contentType = urlConnection.getContentType();
// Get an input stream that reads from this open
// connection
InputStream in = urlConnection.getInputStream();
// ...Extract URLs and process content from the input
// stream according to its content type.

HttpURLConnection is the most prevailing URLConnection for a WWW robot, although URLConnection's subclasses include also JarURLConnection, FtpURLConnection, FileURLConnection, and so on.

To make your robot polite, you should tell the web servers your robot name and your mailbox. To make your robot smart, you should consider implementing four functions:

Robot can utilize a proxy to access the remote sites even if it is behind a firewall.
Robot can accept and use Cookie if the web site allots it a Cookie.
Robot can utilize Referer request-header to bypass any kind of reference checking since many HttpURLConnections use a form of security, which checks Referer header.
Robot can detect and avoid infinite redirection loops while it follows automatically HTTP redirection without interaction with the user.

You can program successfully a lightweight customized HttpURLConnection class or function, if you know four basic stages of the direct HTTP GET communication between a client and a web server. First, a stream socket is created and connected to the specified port number on the server. An HTTP server almost always uses port number 80. Next, an HTTP GET request message with a collection of additional information, such as the browser's name and acceptable character set, is sent from the client to the server. Third, after receiving and interpreting the request message, the server responds with an HTTP response message containing a status line indicating the result of the request. The response message may also contain other information such as the content type and the content body of the destination URL. Finally, the server closes the connection immediately after sending the response if the client does not intend to maintain a persistent connection.
Let's see a typical GET request message for the URL of "http://pii300/index.html":

    GET /index.html HTTP/1.0
    Referer: http://pii300/java.html
    User-Agent: Mozilla/4.71 [en] (Win95; I)
    Pragma: no-cache
    Host: pii300
    Accept: image/gif, image/x-xbitmap, image/jpeg,
            image/pjpeg, image/png, */*
    Accept-Encoding: gzip
    Accept-Language: en
    Accept-Charset: iso-8859-1,*,utf-8
    /* a blank line */

A typical response from the server:

    HTTP/1.0 200 OK
    Content-Type: text/html
    Content-Length: 759
    Expires: Fri, 01 Dec 2000 16:00:00 GMT
    Server: JRun Web Server/1.0
    /* a blank line */
    <HTML>/* HTML text of the Web page */</HTML>

To retrieve the URL of "http://pii300/index.html" without a browser:

// Create a java.net.URL object according to a specific URL in String form.
URL url = new URL("http://pii300/index.html");

// The port is not set URL in String form.
if(url.getPort()==-1)
//Set the default port to 80
url = new URL(url.getProtocol(), url.getHost(),80, url.getFile());

// Create a stream socket and connect it to the specified port number on the web host.
Socket socket = new Socket(url.getHost(), url.getPort());

// Throw a java.io.InterruptedIOException if the timeout expires.
socket.setSoTimeout(120000);
//Get a print stream from the output stream for this socket.
java.io.PrintStream out = new java.io.PrintStream(socket.getOutputStream());

// Construct an HTTP GET request line with the relative URL and HTTP/1.0 version
out.print("GET "+url.getFile()+" HTTP/1.0" +"\r\n");

// Tell the server the referenced URL from which the request URL
// was obtained so that the obsolete or mistyped link can be traced
// for maintenance.
out.print("Referer: "+ "http://pii300/java.html\r\n");

// Tell the server the robot name so that the server can
// distinguish it from browsers.
out.print("User-Agent: Java Robot/1.01" +"\r\n");

// Provide your mailbox so that server
// maintainer can contact you in case of problems.
out.print("From: robotmaster@host.com" +"\r\n");
out.print("Pragma: no-cache\r\n"); //Ignore the caches.

// Provide the Internet host and port number of the relative
// URL being requested.
out.print("Host: "+url.getHost()+":"+ url.getPort()+"\r\n");

// Accept all media types and their subtypes.
out.print("Accept: */*\r\n");

//Submit a Cookie if the web site allots it a Cookie.
out.print("Cookie: "+"sessionid=126\r\n");

//A blank line indicates the end of the header fields.
out.print("\r\n");

//Flush the stream.
out.flush();

// Get an input stream for this socket.
// Process the HTTP response message from the server, which
// includes the HTML text of "http://pii300/index.html".
InputStream in=socket.getInputStream();

// Closes this socket
socket.close();

Part II: construct a flexible content filter
How to parse documents and extract URLs from the URLConnection input stream? Maybe the idea of using String.indexOf() to hunt down "http://" flashed into your mind. No, you need a powerful flexible syntax parser for HTML. It should not only extract URLs, but also title, keywords, summary and text. It should not only take on the known HTML syntax, but also the future HTML syntax, even the error HTML syntax. The zyh.robot.TextHTML package is such an efficient parser for plain text and HTML document with syntactic error recovery. It is thanks to the programmers of JLex and CUP that all parsing code are generated no sweat by JLex and CUP according to the corresponding specification in Yylex and TextHTML.cup. JLex can take a specification file, then create a Java source file for the corresponding lexical analyzer. For instance, a simple JLex specification file:

    /** A lexical analyzer breaks an input stream into symbols for URLs extractor */
    /* User code section is mainly used for a package declaration and the importation of some external classes. */
    package SimpleExample;
    import java_cup.runtime.Symbol;
    /* JLex directives section is used for macros definitions.*/
    %%
    %{ /* Declaration of variables and functions internal to the generated lexical analyzer class. */
        /** Return the next parsed symbol or an error symbol if there is a lexical error */
        public Symbol nextSymbol()
            throws java.io.IOException
        {
             try
            {
                   return parse();
            }
            catch(java.lang.Error e)
            {//Lexical Error
                return new Symbol(sym.error);
            }
        }
    %}
    %type Symbol /* Change the return type of the tokenizing function from Yytoken to Symbol. */
    %function parse /* Change the name of the tokenizing function from yylex to parse. */
    %public /* Cause the lexical analyzer class generated by JLex to be a public class. */
    %unicode /* Generate a lexical analyzer that supports an alphabet of character codes between 0 and 2^16-1 inclusive. */
    %ignorecase /* Generate a case-insensitive lexical analyzer. */
    %eofval{
        return (new Symbol(sym.EOF));//Return an EOF symbol after the end of input stream is reached.
    %eofval}
    /* Macro definitions for white space */
    WHITE_SPACE_CHAR=[\ \t\b\012\r\n]
    /* Macro definitions for phase or string */
    PHRASE = [^\"\<\>\=\ \t\b\012\r\n] [^\"\<\>\=\ \t\b\012\r\n]*
    QQLINK = \"([^\"\<])*\"
    STRING = [^\"\<\>\=\ \t\b\012\r\n] +\"[^\"\<\>\=\ \t\b\012\r\n]*
    /* Regular expression rules section contains the rules of lexical analysis, each of which consists of three parts: an optional state list, a regular expression, and an action. */
    %%
    <YYINITIAL> {WHITE_SPACE_CHAR} {}
    <YYINITIAL> "=" { return new Symbol(sym.EQUAL);}
    <YYINITIAL> ">" { return new Symbol(sym.GREATER_THAN);}
    <YYINITIAL> "<" { return new Symbol(sym.LESS_THAN);}
    <YYINITIAL> "HREF" { return new Symbol(sym.HREF,yytext()); }
    <YYINITIAL> "SRC" { return new Symbol(sym.SRC,yytext()); }
    <YYINITIAL> {PHRASE} { return new Symbol(sym.STRING,yytext()); }
    <YYINITIAL> {QQLINK} { return new Symbol(sym.STRING,yytext().substring(1, yytext().length() - 1)); }
    <YYINITIAL> STRING { return new Symbol(sym.STRING,yytext()); }

CUP is correspondingly a LALR parser constructor. Let us see also its specification file for URLs extractor:

    /* Package and Import Specifications */
    package SimpleExample;
    import java_cup.runtime.*;
    import java.util.*;
    import java.io.*;
    action code {: /* Place methods and variable within the generated action class. */
        protected ArrayList URLs=new ArrayList(); //Define a list to contain the extracted URLs.
    :};
    parser code {: /* Place methods and variable within the generated parser class. */
        SimpleExample.Yylex htmlScanner;
        private parser(InputStream is)
        {
            this();
            htmlScanner = new SimpleExample.Yylex(is);
        }
        /** Use this static function to extract urls from html doucments. */
        public static final ArrayList getURLs (InputStream inputStream) throws Exception
        {
            SimpleExample.parser parse_obj = new SimpleExample.parser(inputStream);
            parse_obj.parse();
            return parse_obj.action_obj.URLs;
        }
    :};
    scan with {: /* Return the next symbol from the JLex generated lexical analyzer. */
        return htmlScanner.nextSymbol();
    :};
    /* Declare terminal symbols returned by the JLex generated lexical analyzer. */
    terminal EQUAL, GREATER_THAN, LESS_THAN;
    terminal String HREF, SRC, STRING;
    /* Declare non-terminal symbols */
    non terminal Integer html;
    non terminal String html_tag_clause;
    /* The grammar for HTML documents with syntactic error recovery */
    html ::= STRING | html STRING | error
        | LESS_THAN html_tag_clause GREATER_THAN
        | html LESS_THAN html_tag_clause GREATER_THAN
        ;
    html_tag_clause ::= STRING:s1
            {: RESULT = s1; :}
        | html_tag_clause:s1 STRING EQUAL STRING
            {: RESULT = s1; :}
        | html_tag_clause:s1 SRC EQUAL STRING:s2
            {:   if( s1.equalsIgnoreCase("IMG")
                     || s1.equalsIgnoreCase("FRAME")
                     || s1.equalsIgnoreCase("IFRAME")
                     || s1.equalsIgnoreCase("SCRIPT")
                     )
                    URLs.add(s2);
                    RESULT = s1;
            :}
        | html_tag_clause:s1 HREF EQUAL STRING:s2
            {:   if(s1.equalsIgnoreCase("BASE")
                     || s1.equalsIgnoreCase("LINK")
                     || s1.equalsIgnoreCase("A")
                     )
                    URLs.add(s2);
                    RESULT = s1;
            :}
        ;

Now, a small parser for extracting URLs from HTML documents is at hand. You can get a list of all extracted URLs by calling SimpleExample.parser.getURLs( InputStream inputStream). Remember to utilize those two free toolkits, JLex and CUP, if you like to write an extendable parser in a short time.

Part III: Do Content Cognitive Filtering
Content processor determines essentially the documents' relevance by processing the plain text in content objects. Firstly, it should regard any character sequence found between space characters or punctuation symbols as a word. To remove numerical values, all the digit characters in the plain text can also serve as delimiters. It is simple to parse plain text into words:

String text = "This is a plain text.";
    int last = 0, current = 0;
     while( last < text.length() )
    {
        for(current = last; current < text.length(); current++)
        if( !Character.isLetter( text.charAt(current)))break;
        if( current > last)
        {
            String word = text.substring( last, current);
            ...//Process the splitted word.
        }
        last = current + 1;//Skip the delimiter.
    }

Secondly, it should removing prefixes and suffixes from words according to a stemming list or some stemming algorithm. Stemming means the process of grouping words that have the same conceptual meaning, such as say, says, and said, to a word. The Porter stemming algorithm is introduced in "IR Vocabulary" (http://www.cs.jhu.edu/~weiss/glossary.html). Thirdly, it should remove completely all noise words in its stopword list, such as "the", "a", and "this". Fourthly, it should cognize synonyms through its synonym list. The lists of stopword and synonym can be maintained in plain text files or databases, and loaded into memory as two hashtable objects for quick query purpose. Content processor begins lastly the core weighting process, which decides whether a word is more important than other keywords in this URL, and whether a keyword in this URL is more important than the same keyword in other URLs. The popular weighting scheme is evaluating quoted frequency and quoted location. Words that appear frequently are weighted more highly, while words that appear infrequently are given lower weights. Words in document title are weighted most highly, and document's keywords are given upper weights than words in headings. A good search engine should reduce the URL's weight when it abused keyword metatag. URLs with the highest weight of matching words should appear highest in the following search.

Part IV: Searching in Index Database
Your searchable index database should consist of a keyword list, for each keyword, a list of URLs which contain that keyword, and a URL list, for each URL, a list of keywords which are found in that URL. If you are ambitious to index millions of URLs, you have to take a good care of your huge gigabyte-level index database, otherwise the user will lose patience in waiting the query result out. You should consider a distributed storage strategy to over this long-playing query bottleneck. For instance, try to classify the data into thousands of smaller tables according to the hash codes of URLs or keywords. Then search engine can load in advance a list of table names of keywords, look through those smaller tables which contain those requested words, rank the befitting URLs in order of relevance to those words, and submit the list of search results as HTML.
All database operations in this search engine sample are through JDBC 2.0 interface so that it is independent of the kind of index database. The SQL statements for creating its index database are simple:

create table urls(urlid int, url char(100), urlstatus int, urlmessage char(20), urltime timestamp, keywords object, urlsummary object)
create table keywords(wordid int, keyword char(20), urls object)

The index database will be more efficient in lookup speed if you create index on its columns of url, urlID, keyword and wordID. The keyword list of every URL is held as a serializable object in urls.keywords column, and the URL list of every keyword is held as a serializable object in keywords.urls column. It utilizes some scrollable updatable result sets to gain better performance on inserting and updating frequently. Since waiting for server response and network bandwidth is the main bottleneck in indexing, the sample utilizes also thread pool techniques to manage and run some robots in parallel.
There is not setup requirement to the end user, since search service is provided by a Servlet. The Servlet parses the user input into a series of words, and supports Boolean queries with operands such as AND and OR. If you wish to use NOT and the wildcard character "*" to construct queries, you should consider using JLex and CUP to construct a query parser. Cascading search can also be implemented if you cache the latest queries and hide a query session ID in the query form.

References
1. W3C Recommendation, HTML 4.01 Specification
2. The Internet Society, RFC2616: Hypertext Transfer Protocol -- HTTP/1.1
3. Scott Weiss, IR Vocabulary: Glossary for Information Retrieval
4. Martijn Koster, Guidelines for Robot Writers
5. Elliot Berk, JLex: A Lexical Analyzer Generator for Java
6. Scott E. Hudson, CUP: A Parser Generator for Java

The source files can be found here

MokaByte® è un marchio registrato da MokaByte s.r.l.
Java®, Jini® e tutti i nomi derivati sono marchi registrati da Sun Microsystems; tutti i diritti riservati
E' vietata la riproduzione anche parziale
Per comunicazioni inviare una mail a
mokainfo@mokabyte.it