zyh.robot.TextHTML
Class TextHTMLFilter
java.lang.Object
|
+--zyh.robot.TextHTML.TextHTMLFilter
- public class TextHTMLFilter
- extends java.lang.Object
- implements ContentFilter
The TextHTMLFilter clase is used to filter the content
of Text or HTML documents.
All the Java source code files in zyh.robot.TextHTML,
except for TextHTMLFilter.java, are generated by JLex and
CUP according to the corresponding specification of Yylex
and TextHTML.cup.
JLex is a lexical analyzer generator for Java,
and CUP is correspondingly a LALR parser constructor.
For full-text indexing purpose, your HTML parser need only focus
on some important embedded tags, not be fully compliant with the
latest HTML4.01 specification. Your HTML filter should recognize
seven HTML elements or attributes thereinafter:
1. URL of embedded image, frame or script: For example, <IMG src= "http://www.somecompany.com/People/Ian/vacation/family.png" alt="A photo of my family at the lake.">, <FRAME src="contents_of_frame1.html">, <IFRAME src="foo.html" width="400" height="500" scrolling="auto" frameborder="1">, and <SCRIPT type="text/vbscript" src= "http://someplace.com/progs/vbcalc"> </SCRIPT>.
2. Background image(Deprecated): For example, <BODY background ="bk1.gif" bgcolor="white" text="black" link="red" alink="fuchsia" vlink="maroon">... document body...</BODY>.
3. URL for linked resource or base URL: For example, <A href= "http://www.w3.org/" charset="ISO-8859-1">W3C Web site</A>, <A href= "./one.html#anchor-one"> anchor one</A>, <a href= "mailto:someone@somecompany.com">some comments</A>, <BASE href= "http://www.aviary.com/products/intro.html">, and <LINK media="print" title="The manual in postscript" type="application/postscript" rel="alternate" href= "http://someplace.com/manual/postscript.ps">.
4. Document title: For example, <TITLE>A study of population dynamics</TITLE>.
5. Author, keywords and description in generic meta information: For example, <META name="author" content="John Doe">, <META name="keywords" lang="en-us" content="vacation, Greece, sunshine">, and <META name="description" content="Idyllic European vacations">.'
6. When to fetch a fresh copy of the associated document in generic meta information: For example, <META http-equiv="Expires" content="Tue, 20 Aug 1996 14:25:27 GMT">
7. Headings from H1 to H6: For example, <H1>Forest elephants</H1>.
You can also extract URLs from the HTML elements of APPLET,
OBJECT, BLOCKQUOTE, INS, and DEL by appending simply the
corresponding symbols and grammar in the definition files
of Yylex and TextHTML.cup. Words, including JavaScript code,
within the remaining HTML tags, can be plainly transacted into
the text part of content objects since content processor will
purify all content in content objects at the next phase.
Sometimes it's impossible to extract all URLs from web page
because some URLs are possibly loaded dynamically by JavaScript.
Method Summary |
java.lang.String[] |
getAcceptedContenttype()
Return an String array which contains all supported content types |
Content |
getContent(java.io.InputStream in)
Attempts to get a content from a given input stream. |
Methods inherited from class java.lang.Object |
clone,
equals,
finalize,
getClass,
hashCode,
notify,
notifyAll,
toString,
wait,
wait,
wait |
TextHTMLFilter
public TextHTMLFilter()
getAcceptedContenttype
public java.lang.String[] getAcceptedContenttype()
- Return an String array which contains all supported content types
- Specified by:
- getAcceptedContenttype in interface ContentFilter
getContent
public Content getContent(java.io.InputStream in)
throws java.io.IOException
- Attempts to get a content from a given input stream.
- Specified by:
- getContent in interface ContentFilter
- Parameters:
InputStream
- an input stream- Returns:
- a Content from the input stream
- Throws:
- IOException - if an input stream access error occurs