htmlcleaner 下載地址:htmlcleaner2_1.jar?源碼下載:htmlcleaner2_1-all.zip
寫一個測試用的html文件:html-clean-demo.html
- <!DOCTYPE?html?PUBLIC?"-//W3C//DTD?XHTML?1.0?Transitional"?"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd?"?>???
- <?html???xmlns?=?"http://www.w3.org/1999/xhtml?"???xml:lang?=?"zh-CN"???dir?=?"ltr"?>??
- <?head?>???
- ????<?meta???http-equiv?=?"Content-Type"???content?=?"text/html;?charset=GBK"?/>???
- ????<?meta???http-equiv?=?"Content-Language"???content?=?"zh-CN"?/>???
- ????<?title?>?html?clean?demo?</?title?>???
- </?head?>???
- <?body?>???
- <?div???class?=?"d_1"?>???
- ????<?ul?>???
- ????????<?li?>?bar?</?li?>???
- ????????<?li?>?foo?</?li?>???
- ????????<?li?>?gzz?</?li?>???
- ????</?ul?>???
- </?div?>???
- <?div?>???
- ????<?ul?>???
- ????????<?li?>?<?a???name?=?"my_href"???href?=?"1.html"?>?text-1?</?a?>?</?li?>???
- ????????<?li?>?<?a???name?=?"my_href"???href?=?"2.html"?>?text-2?</?a?>?</?li?>???
- ????????<?li?>?<?a???name?=?"my_href"???href?=?"3.html"?>?text-3?</?a?>?</?li?>???
- ????????<?li?>?<?a???name?=?"my_href"???href?=?"4.html"?>?text-4?</?a?>?</?li?>???
- ????</?ul?>???
- </?div?>???
- </?body?>???
- </?html?>???
模擬需求:取出title,name="my_href" 的鏈接,div的class="d_1"下的所有li內容。下面用htmlcleaner寫代碼,HtmlCleanerDemo.java
- package??com.chenlb;??
- ??
- import??java.io.File;??
- ??
- import??org.htmlcleaner.HtmlCleaner;??
- import??org.htmlcleaner.TagNode;??
- ??
- /**??
- ?*?htmlcleaner?使用示例.??
- ?*??
- ?*?@author?chenlb?2008-11-26?下午02:12:02??
- ?*/???
- public???class??HtmlCleanerDemo?{??
- ??
- ????public???static???void??main(String[]?args)??throws??Exception?{??
- ????????HtmlCleaner?cleaner?=?new??HtmlCleaner();??
- ??
- ????????TagNode?node?=?cleaner.clean(new??File(?"html/html-clean-demo.html"?),??"GBK");??
- ????????//按tag取.???
- ????????Object[]?ns?=?node.getElementsByName("title"?,??true?);?????//標題???
- ??
- ????????if?(ns.length?>??0?)?{??
- ????????????System.out.println("title="?+((TagNode)ns[?0?]).getText());??
- ????????}??
- ????????System.out.println("ul/li:"?);??
- ????????//按xpath取???
- ????????ns?=?node.evaluateXPath("//div[@class='d_1']//li"?);??
- ????????for?(Object?on?:?ns)?{??
- ????????????TagNode?n?=?(TagNode)?on;??
- ????????????System.out.println("\ttext="?+n.getText());??
- ????????}??
- ????????System.out.println("a:"?);??
- ????????//按屬性值取???
- ????????ns?=?node.getElementsByAttValue("name"?,??"my_href"?,??true?,??true?);??
- ????????for?(Object?on?:?ns)?{??
- ????????????TagNode?n?=?(TagNode)?on;??
- ????????????System.out.println("\thref="?+n.getAttributeByName(?"href"?)+?",?text="?+n.getText());??
- ????????}??
- ????}??
- }??
cleaner.clean()中的參數,可以是文件,可以是url,可以是字符串內容。個人認為:比較常用的應該是evaluateXPath、 getElementsByAttValue、getElementsByName方法了。另外說明下,htmlcleaner 對不規范的html兼容性比較好。
?