谁能推荐一个能够解析 html的解释器? 解析完html静态页面后,能够提供 getTitle,getSummary,等方法? htmlparser2.0 不行,提供的方法太少,lucene2.0的有 bug,部分html中的特殊字符无法处理。
总是报错:
DEBUG org.apache.lucene.demo.html.HTMLParserToorg.apache.lucene.demo.html.ParseException: Encountered ">" at line 80, column 19. Was expecting one of: <Quote2Text> ... <CloseQuote2> ... at org.apache.lucene.demo.html.HTMLParser.generateParseException(HTMLParser.java:691) at org.apache.lucene.demo.html.HTMLParser.jj_consume_token(HTMLParser.java:569) at org.apache.lucene.demo.html.HTMLParser.ArgValue(HTMLParser.java:329) at org.apache.lucene.demo.html.HTMLParser.Tag(HTMLParser.java:261) at org.apache.lucene.demo.html.HTMLParser.HTMLDocument(HTMLParser.java:189) at org.apache.lucene.demo.html.ParserThread.run(ParserThread.java:38)
|
多谢! 一定要有这个 getSummary() 方法!