public class Parser extends Object implements DTDConstants
Unfortunately there are many badly implemented HTML parsers out there, and as a result there are many badly formatted HTML files. This parser attempts to parse most HTML files. This means that the implementation sometimes deviates from the SGML specification in favor of HTML.
The parser treats \r and \r\n as \n. Newlines after starttags and before end tags are ignored just as specified in the SGML/HTML specification.
The html spec does not specify how spaces are to be coalesced very well. Specifically, the following scenarios are not discussed (note that a space should be used here, but I am using   to force the space to be displayed):
'<b>blah <i> <strike> foo' which can be treated as: '<b>blah <i><strike>foo'
as well as: '<p><a href="xx"> <em>Using</em></a></p>' which appears to be treated as: '<p><a href="xx"><em>Using</em></a></p>'
If strict
is false, when a tag that breaks flow,
(TagElement.breaksFlows
) or trailing whitespace is
encountered, all whitespace will be ignored until a non whitespace
character is encountered. This appears to give behavior closer to
the popular browsers.
DTD
,
TagElement
,
SimpleAttributeSet
Modifier and Type | Field and Description |
---|---|
protected DTD |
dtd
The dtd.
|
protected boolean |
strict
This flag determines whether or not the Parser will be strict
in enforcing SGML compatibility.
|
ANY, CDATA, CONREF, CURRENT, DEFAULT, EMPTY, ENDTAG, ENTITIES, ENTITY, FIXED, GENERAL, ID, IDREF, IDREFS, IMPLIED, MD, MODEL, MS, NAME, NAMES, NMTOKEN, NMTOKENS, NOTATION, NUMBER, NUMBERS, NUTOKEN, NUTOKENS, PARAMETER, PI, PUBLIC, RCDATA, REQUIRED, SDATA, STARTTAG, SYSTEM
Constructor and Description |
---|
Parser(DTD dtd)
Creates parser with the specified
dtd . |
Modifier and Type | Method and Description |
---|---|
protected void |
endTag(boolean omitted)
Handle an end tag.
|
protected void |
error(String err)
Invokes the error handler with the 1st, 2nd and 3rd error message argument "?".
|
protected void |
error(String err,
String arg1)
Invokes the error handler with the 2nd and 3rd error message argument "?".
|
protected void |
error(String err,
String arg1,
String arg2)
Invokes the error handler with the 3rd error message argument "?".
|
protected void |
error(String err,
String arg1,
String arg2,
String arg3)
Invokes the error handler.
|
protected void |
flushAttributes()
Removes the current attributes.
|
protected SimpleAttributeSet |
getAttributes()
Returns attributes for the current tag.
|
protected int |
getCurrentLine() |
protected int |
getCurrentPos()
Returns the current position.
|
protected void |
handleComment(char[] text)
Called when an HTML comment is encountered.
|
protected void |
handleEmptyTag(TagElement tag)
Called when an empty tag is encountered.
|
protected void |
handleEndTag(TagElement tag)
Called when an end tag is encountered.
|
protected void |
handleEOFInComment()
Called when the content terminates without closing the HTML comment.
|
protected void |
handleError(int ln,
String msg)
An error has occurred.
|
protected void |
handleStartTag(TagElement tag)
Called when a start tag is encountered.
|
protected void |
handleText(char[] text)
Called when PCDATA is encountered.
|
protected void |
handleTitle(char[] text)
Called when an HTML title tag is encountered.
|
protected TagElement |
makeTag(Element elem)
Makes a TagElement.
|
protected TagElement |
makeTag(Element elem,
boolean fictional)
Makes a TagElement.
|
protected void |
markFirstTime(Element elem)
Marks the first time a tag has been seen in a document
|
void |
parse(Reader in)
Parse an HTML stream, given a DTD.
|
String |
parseDTDMarkup()
Parses the Document Type Declaration markup declaration.
|
protected boolean |
parseMarkupDeclarations(StringBuffer strBuff)
Parse markup declarations.
|
protected void |
startTag(TagElement tag)
Handle a start tag.
|
protected DTD dtd
protected boolean strict
public Parser(DTD dtd)
dtd
.dtd
- the dtd.protected int getCurrentLine()
protected TagElement makeTag(Element elem, boolean fictional)
elem
- the element storing the tag definitionfictional
- the value of the flag "fictional
" to be set for the tagTagElement
protected TagElement makeTag(Element elem)
elem
- the element storing the tag definitionTagElement
protected SimpleAttributeSet getAttributes()
SimpleAttributeSet
containing the attributesprotected void flushAttributes()
protected void handleText(char[] text)
text
- the section textprotected void handleTitle(char[] text)
text
- the title textprotected void handleComment(char[] text)
text
- the comment being handledprotected void handleEOFInComment()
protected void handleEmptyTag(TagElement tag) throws ChangedCharSetException
tag
- the tag being handledChangedCharSetException
- if the document charset was changedprotected void handleStartTag(TagElement tag)
tag
- the tag being handledprotected void handleEndTag(TagElement tag)
tag
- the tag being handledprotected void handleError(int ln, String msg)
ln
- the number of line containing the errormsg
- the error messageprotected void error(String err, String arg1, String arg2, String arg3)
err
- the error typearg1
- the 1st error message argumentarg2
- the 2nd error message argumentarg3
- the 3rd error message argumentprotected void error(String err, String arg1, String arg2)
err
- the error typearg1
- the 1st error message argumentarg2
- the 2nd error message argumentprotected void error(String err, String arg1)
err
- the error typearg1
- the 1st error message argumentprotected void error(String err)
err
- the error typeprotected void startTag(TagElement tag) throws ChangedCharSetException
tag
- the tagChangedCharSetException
- if the document charset was changedprotected void endTag(boolean omitted)
omitted
- true
if the tag is no actually present in the
document, but is supposed by the parserprotected void markFirstTime(Element elem)
elem
- the element represented by the tagpublic String parseDTDMarkup() throws IOException
IOException
- if an I/O error occursprotected boolean parseMarkupDeclarations(StringBuffer strBuff) throws IOException
strBuff
- the markup declarationtrue
if this is a valid markup declaration;
otherwise false
IOException
- if an I/O error occurspublic void parse(Reader in) throws IOException
in
- the reader to read the source fromIOException
- if an I/O error occursprotected int getCurrentPos()
Submit a bug or feature
For further API reference and developer documentation, see Java SE Documentation. That documentation contains more detailed, developer-targeted descriptions, with conceptual overviews, definitions of terms, workarounds, and working code examples.
Copyright © 1993, 2016, Oracle and/or its affiliates. All rights reserved.
DRAFT 9-internal+0-2016-01-26-133437.ivan.openjdk9onspinwait