Package org.htmlparser.nodes
Class AbstractNode
- java.lang.Object
-
- org.htmlparser.nodes.AbstractNode
-
- All Implemented Interfaces:
java.io.Serializable
,java.lang.Cloneable
,Node
- Direct Known Subclasses:
RemarkNode
,TagNode
,TextNode
public abstract class AbstractNode extends java.lang.Object implements Node, java.io.Serializable
The concrete base class for all types of nodes (tags, text remarks). This class provides basic functionality to hold thePage
, the starting and ending position in the page, the parent and the list ofchildren
.- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected NodeList
children
The children of this node.protected Page
mPage
The page this node came from.protected int
nodeBegin
The beginning position of the tag in the lineprotected int
nodeEnd
The ending position of the tag in the lineprotected Node
parent
The parent of this node.
-
Constructor Summary
Constructors Constructor Description AbstractNode(Page page, int start, int end)
Create an abstract node with the page positions given.
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description abstract void
accept(NodeVisitor visitor)
Visit this node.java.lang.Object
clone()
Clone this object.void
collectInto(NodeList list, NodeFilter filter)
Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node satisfies the filtering criteria.void
doSemanticAction()
Perform the meaning of this tag.NodeList
getChildren()
Get the children of this node.int
getEndPosition()
Gets the ending position of the node.Node
getFirstChild()
Get the first child of this node.Node
getLastChild()
Get the last child of this node.Node
getNextSibling()
Get the next sibling to this node.Page
getPage()
Get the page this node came from.Node
getParent()
Get the parent of this node.Node
getPreviousSibling()
Get the previous sibling to this node.int
getStartPosition()
Gets the starting position of the node.java.lang.String
getText()
Returns the text of the node.void
setChildren(NodeList children)
Set the children of this node.void
setEndPosition(int position)
Sets the ending position of the node.void
setPage(Page page)
Set the page this node came from.void
setParent(Node node)
Sets the parent of this node.void
setStartPosition(int position)
Sets the starting position of the node.void
setText(java.lang.String text)
Sets the string contents of the node.java.lang.String
toHtml()
Return the HTML for this node.abstract java.lang.String
toHtml(boolean verbatim)
Return the HTML for this node.abstract java.lang.String
toPlainTextString()
Returns a string representation of the node.abstract java.lang.String
toString()
Return a string representation of the node.
-
-
-
Field Detail
-
mPage
protected Page mPage
The page this node came from.
-
nodeBegin
protected int nodeBegin
The beginning position of the tag in the line
-
nodeEnd
protected int nodeEnd
The ending position of the tag in the line
-
parent
protected Node parent
The parent of this node.
-
children
protected NodeList children
The children of this node.
-
-
Constructor Detail
-
AbstractNode
public AbstractNode(Page page, int start, int end)
Create an abstract node with the page positions given. Remember the page and start & end cursor positions.- Parameters:
page
- The page this tag was read from.start
- The starting offset of this node within the page.end
- The ending offset of this node within the page.
-
-
Method Detail
-
clone
public java.lang.Object clone() throws java.lang.CloneNotSupportedException
Clone this object. Exposes java.lang.Object clone as a public method.
-
toPlainTextString
public abstract java.lang.String toPlainTextString()
Returns a string representation of the node. It allows a simple string transformation of a web page, regardless of node type.
Typical application code (for extracting only the text from a web page) would then be simplified to:
Node node; for (Enumeration e = parser.elements (); e.hasMoreElements (); ) { node = (Node)e.nextElement(); System.out.println (node.toPlainTextString ()); // or do whatever processing you wish with the plain text string }
- Specified by:
toPlainTextString
in interfaceNode
- Returns:
- The 'browser' content of this node.
-
toHtml
public java.lang.String toHtml()
Return the HTML for this node. This should be the sequence of characters that were encountered by the parser that caused this node to be created. Where this breaks down is where broken nodes (tags and remarks) have been encountered and fixed. Applications reproducing html can use this method on nodes which are to be used or transferred as they were received or created.
-
toHtml
public abstract java.lang.String toHtml(boolean verbatim)
Return the HTML for this node. This should be the exact sequence of characters that were encountered by the parser that caused this node to be created. Where this breaks down is where broken nodes (tags and remarks) have been encountered and fixed. Applications reproducing html can use this method on nodes which are to be used or transferred as they were received or created.
-
toString
public abstract java.lang.String toString()
Return a string representation of the node. Subclasses must define this method, and this is typically to be used in the manner
System.out.println(node)
-
collectInto
public void collectInto(NodeList list, NodeFilter filter)
Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node satisfies the filtering criteria.This mechanism allows powerful filtering code to be written very easily, without bothering about collection of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it at the top-level, as many tags (like form tags), can contain links embedded in them. We could get the links out by checking if the current node is a
CompositeTag
, and going through its children. So this method provides a convenient way to do this.Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look like:
NodeList collectionList = new NodeList(); NodeFilter filter = new TagNameFilter ("A"); for (NodeIterator e = parser.elements(); e.hasMoreNodes();) e.nextNode().collectInto(collectionList, filter);
Thus, collectionList will hold all the link nodes, irrespective of how deep the links are embedded.Another way to accomplish the same objective is:
NodeList collectionList = new NodeList(); NodeFilter filter = new TagClassFilter (LinkTag.class); for (NodeIterator e = parser.elements(); e.hasMoreNodes();) e.nextNode().collectInto(collectionList, filter);
This is slightly less specific because the LinkTag class may be registered for more than one node name, e.g. <LINK> tags too.- Specified by:
collectInto
in interfaceNode
- Parameters:
list
- The node list to collect acceptable nodes into.filter
- The filter to determine which nodes are retained.
-
getPage
public Page getPage()
Get the page this node came from.- Specified by:
getPage
in interfaceNode
- Returns:
- The page that supplied this node.
- See Also:
Node.setPage(org.htmlparser.lexer.Page)
-
setPage
public void setPage(Page page)
Set the page this node came from.- Specified by:
setPage
in interfaceNode
- Parameters:
page
- The page that supplied this node.- See Also:
Node.getPage()
-
getStartPosition
public int getStartPosition()
Gets the starting position of the node.- Specified by:
getStartPosition
in interfaceNode
- Returns:
- The start position.
- See Also:
Node.setStartPosition(int)
-
setStartPosition
public void setStartPosition(int position)
Sets the starting position of the node.- Specified by:
setStartPosition
in interfaceNode
- Parameters:
position
- The new start position.- See Also:
Node.getStartPosition()
-
getEndPosition
public int getEndPosition()
Gets the ending position of the node.- Specified by:
getEndPosition
in interfaceNode
- Returns:
- The end position.
- See Also:
Node.setEndPosition(int)
-
setEndPosition
public void setEndPosition(int position)
Sets the ending position of the node.- Specified by:
setEndPosition
in interfaceNode
- Parameters:
position
- The new end position.- See Also:
Node.getEndPosition()
-
accept
public abstract void accept(NodeVisitor visitor)
Visit this node.
-
getParent
public Node getParent()
Get the parent of this node. This will always return null when parsing without scanners, i.e. if semantic parsing was not performed. The object returned from this method can be safely cast to aCompositeTag
.- Specified by:
getParent
in interfaceNode
- Returns:
- The parent of this node, if it's been set,
null
otherwise. - See Also:
Node.setParent(org.htmlparser.Node)
-
setParent
public void setParent(Node node)
Sets the parent of this node.- Specified by:
setParent
in interfaceNode
- Parameters:
node
- The node that contains this node. Must be aCompositeTag
.- See Also:
Node.getParent()
-
getChildren
public NodeList getChildren()
Get the children of this node.- Specified by:
getChildren
in interfaceNode
- Returns:
- The list of children contained by this node, if it's been set,
null
otherwise. - See Also:
Node.setChildren(org.htmlparser.util.NodeList)
-
setChildren
public void setChildren(NodeList children)
Set the children of this node.- Specified by:
setChildren
in interfaceNode
- Parameters:
children
- The new list of children this node contains.- See Also:
Node.getChildren()
-
getFirstChild
public Node getFirstChild()
Get the first child of this node.- Specified by:
getFirstChild
in interfaceNode
- Returns:
- The first child in the list of children contained by this node,
null
otherwise.
-
getLastChild
public Node getLastChild()
Get the last child of this node.- Specified by:
getLastChild
in interfaceNode
- Returns:
- The last child in the list of children contained by this node,
null
otherwise.
-
getPreviousSibling
public Node getPreviousSibling()
Get the previous sibling to this node.- Specified by:
getPreviousSibling
in interfaceNode
- Returns:
- The previous sibling to this node if one exists,
null
otherwise.
-
getNextSibling
public Node getNextSibling()
Get the next sibling to this node.- Specified by:
getNextSibling
in interfaceNode
- Returns:
- The next sibling to this node if one exists,
null
otherwise.
-
getText
public java.lang.String getText()
Returns the text of the node.- Specified by:
getText
in interfaceNode
- Returns:
- The text of this node. The default is
null
. - See Also:
Node.setText(java.lang.String)
-
setText
public void setText(java.lang.String text)
Sets the string contents of the node.- Specified by:
setText
in interfaceNode
- Parameters:
text
- The new text for the node.- See Also:
Node.getText()
-
doSemanticAction
public void doSemanticAction() throws ParserException
Perform the meaning of this tag. The default action is to do nothing.- Specified by:
doSemanticAction
in interfaceNode
- Throws:
ParserException
- Not used. Provides for subclasses that may want to indicate an exceptional condition.
-
-