zyh.robot.TextHTML
Class TextHTMLFilter

java.lang.Object
  |
  +--zyh.robot.TextHTML.TextHTMLFilter

public class TextHTMLFilter
extends java.lang.Object
implements ContentFilter

The TextHTMLFilter clase is used to filter the content of Text or HTML documents. All the Java source code files in zyh.robot.TextHTML, except for TextHTMLFilter.java, are generated by JLex and CUP according to the corresponding specification of Yylex and TextHTML.cup. JLex is a lexical analyzer generator for Java, and CUP is correspondingly a LALR parser constructor. For full-text indexing purpose, your HTML parser need only focus on some important embedded tags, not be fully compliant with the latest HTML4.01 specification. Your HTML filter should recognize seven HTML elements or attributes thereinafter:
1. URL of embedded image, frame or script: For example, <IMG src= "http://www.somecompany.com/People/Ian/vacation/family.png" alt="A photo of my family at the lake.">, <FRAME src="contents_of_frame1.html">, <IFRAME src="foo.html" width="400" height="500" scrolling="auto" frameborder="1">, and <SCRIPT type="text/vbscript" src= "http://someplace.com/progs/vbcalc"> </SCRIPT>.
2. Background image(Deprecated): For example, <BODY background ="bk1.gif" bgcolor="white" text="black" link="red" alink="fuchsia" vlink="maroon">... document body...</BODY>.
3. URL for linked resource or base URL: For example, <A href= "http://www.w3.org/" charset="ISO-8859-1">W3C Web site</A>, <A href= "./one.html#anchor-one"> anchor one</A>, <a href= "mailto:someone@somecompany.com">some comments</A>, <BASE href= "http://www.aviary.com/products/intro.html">, and <LINK media="print" title="The manual in postscript" type="application/postscript" rel="alternate" href= "http://someplace.com/manual/postscript.ps">.
4. Document title: For example, <TITLE>A study of population dynamics</TITLE>.
5. Author, keywords and description in generic meta information: For example, <META name="author" content="John Doe">, <META name="keywords" lang="en-us" content="vacation, Greece, sunshine">, and <META name="description" content="Idyllic European vacations">.'
6. When to fetch a fresh copy of the associated document in generic meta information: For example, <META http-equiv="Expires" content="Tue, 20 Aug 1996 14:25:27 GMT">
7. Headings from H1 to H6: For example, <H1>Forest elephants</H1>.
You can also extract URLs from the HTML elements of APPLET, OBJECT, BLOCKQUOTE, INS, and DEL by appending simply the corresponding symbols and grammar in the definition files of Yylex and TextHTML.cup. Words, including JavaScript code, within the remaining HTML tags, can be plainly transacted into the text part of content objects since content processor will purify all content in content objects at the next phase. Sometimes it's impossible to extract all URLs from web page because some URLs are possibly loaded dynamically by JavaScript.


Constructor Summary
TextHTMLFilter()
           
 
Method Summary
 java.lang.String[] getAcceptedContenttype()
          Return an String array which contains all supported content types
 Content getContent(java.io.InputStream in)
          Attempts to get a content from a given input stream.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TextHTMLFilter

public TextHTMLFilter()
Method Detail

getAcceptedContenttype

public java.lang.String[] getAcceptedContenttype()
Return an String array which contains all supported content types
Specified by:
getAcceptedContenttype in interface ContentFilter

getContent

public Content getContent(java.io.InputStream in)
                   throws java.io.IOException
Attempts to get a content from a given input stream.
Specified by:
getContent in interface ContentFilter
Parameters:
InputStream - an input stream
Returns:
a Content from the input stream
Throws:
IOException - if an input stream access error occurs