url轉pdf或者html轉pdf工具 — iText實現url轉pdf

參考資料：

https://kb.itextpdf.com/itext/can-i-generate-a-pdf-from-a-url-instead-of-from-a-
http://www.micmiu.com/opensource/expdoc/itext-pdf-demo/
https://blog.51cto.com/u_16237557/7263784

iText：iText是一個非常著名的能夠快速產生PDF文件的Java類庫。支持文本，表格，圖形的操作，可以方便的跟 Servlet 進行結合。
一種生成PDF報表的Java組件。

iText的安裝非常方便，在 http://www.lowagie.com/iText/download.html - downoad 網站上下載iText.jar文件后，只需要在系統的CLASSPATH中加入iText.jar的路徑在程序中就可以使用iText類庫了。

如果需要自己編譯iText包，需要用到第三方的jar：bcprov、bcmail 、bctsp.
如果用到中文，需要CJK字體的擴展包：iTextAsian.jar

默認的iText字體設置不支持中文字體，需要下載遠東字體包iTextAsian.jar，否則不能往PDF文檔中輸出中文字體。通過下面的代碼就可以在文檔中使用中文了:
BaseFont bfChinese = BaseFont.createFont("STSong-Light","UniGB-UCS2-H", BaseFont.NOT EMBEDDED);
com.lowagie.text.Font FontChinese = new com.lowagie.text.Font(bfChinese, 12, com.lowagie.text.Font.NORMAL):
Paragraph pragraph=new Paragraph("你好",FontChinese);
http://prdownloads.sourceforge.net/itext/iTextAsian.jar
如果用到特殊符號的，需要另一個擴展包：itext-hyph-xml.jar.

上述提到的所有lib包，都包含在它的發布版本里。

用iText生成PDF文檔需要5個步驟：

建立Document（）實例

Document document = new Document();

document構建函數有三個：

public Document();
public Document(Rectangle pagesize);public Document(Rectangle pagesize.
int marginLeft,
int marginRight,
int marginTop,
int marginBottom);
/*
構建函數的參數pageSize是文檔頁面的大小，對于第一個構建函數，頁面的大小為A4，同Document(PageSize.A4)的效果一樣;對于第三個構建函數，參數marginLeft、marginRight、marginTop、marginBottom分別為左、右、上、下的頁邊距
通過參數pageSize可以設定頁面大小、面背景色、以及頁面橫向/縱向等屬性。iText定義了A0-A10、AL、LETTER、HALFLETTER、 11X17、LEDGER、NOTE、BO-B5、ARCH A-ARCH E、FLSA和FLSE等紙張類型，也可以通過Rectangle pageSize = new Rectangle(144,720);自定義紙張。通過Rectangle方法rotate()可以將頁面設置成橫向。
*/

建立一個書寫器（Writer）與document對象關聯，通過書寫器（Writer）可以將文檔寫入到磁盤中。
```
PDFWriter.getInstance(document, new FileOutputStream("Helloworld.PDF"));
```
文檔(document)對象建立好之后，需要建立一個或多個書寫器(Writer)對象與之關聯。通過書寫器(Writer)對象可以將具體文檔存盤成需要的格式，如PDFWriter可以將文檔存成PDF文件，HtmlWriter可以將文檔存成html文件
打開文檔
```
document.open();
```
打開文檔后可以設定文檔的標題，作者，關鍵字，裝訂方法等…
向文檔中添加內容
```
document.add(new Paragraph("Hello World"));
```
向文檔添加的內容都是以對象為單位，如Phrase，Paragraph，Table等。

文本處理：iText中用文本框（Chunk），短語（Phrase）和段落（paragraph）處理文本。

Chunk是處理文本的最小單位
關閉文檔
```
document.close();
```

通過上面5個步驟就能生成一個Helloworld.PDF，文件內容是“Hello World”

html轉pdf

直接把HTML轉成單個PDF文件
把HTML內容轉成PDF的元素Element，對應已有的PDF文檔，可以把轉換后的Element追加到document中，生成PDF文件

/*直接轉pdf*/
String htmlFile = "html的地址   .../xx.html";
String pdfFile = "輸出的pdf的地址   .../xxx.pdf";InputStream htmlFileStream = new FileInputStream(htmlFile);/*中文字體定義*/
//使用BaseFont類創建一個新的字體對象bfCN，這個字體是輕的宋體（STSongStd-Light），它是Unicode的GB2312版本（UniGB-UCS2-H）。
BaseFont bfCN = BaseFont.creatFont("STSongStd-Light", "UniGB-UCS2-H", false);//創建一個新的中文字體對象chFont，字體大小為14，樣式為正常，顏色為藍色。
Font chFont = new Font(bfCN, 12, Font.NORMAL, BaseColor.BLUE);//創建一個新的段落字體對象secFont，字體大小為12，樣式為正常，顏色為一種亮白色。
Font secFont = new Font(bfCN, 2, Font.NORMAL, new BaseColor(0, 204, 255));/*構建document實例*/
Document document = new Document();
/*建立書寫器wirter與document關聯*/
PdfWriter pdfwriter = PdfWriter.getInstance(document, new FileOutputStream(pdfFile));pdfwriter.setViewerPreferences(PdfWriter.HideToolbar);
/*打開文檔*/
document.open();
//文檔添加內容
//html文件
InputStreamReader isr = new InputStreamReader(htmlFileStream, "UTF-8");
//默認參數轉換
XMLWorkerHelper.getInstance().parseXHtml(pdfwriter, document, isr);
//關閉文檔
document.close();

URL轉PDF

如果URL地址內容包含中文字符，需要XML Worker能支持中文字符轉換（詳見：http://www.micmiu.com/opensource/expdoc/itext-xml-worker-cn/）

Java 的HTML解析器，這里選擇：jsoup （官網：http://jsoup.org/），如果是 maven 構建項目的，直接在pom文件中增加jsoup的依賴配置即可：

<dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.7.1</version><type>jar</type><scope>compile</scope>
</dependency>

- `info[0]`：第一個元素包含博客文章的標題。它從 HTML 元素 `<h2 class="title">` 中提取文本內容。- `info[1]`：第二個元素捕獲博客文章的類別。它查找帶有 `rel=category tag` 屬性的 `<a>` 元素，并提取 `href` 屬性，去除特定的 URL 前綴。- `info[2]`：這個元素包含博客文章的日期。它從具有類 `post-info-date` 的 HTML 元素中提取文本內容，使用字符串 "日期" 進行拆分，并保留其后的部分，修剪任何前導或尾隨空格。- `info[3]`：最后一個元素表示博客文章的內容。它從具有類 `entry` 的 `<div>` 元素中提取 HTML 內容，可能通過名為 `formatContentTag` 的函數進行格式化。

/*** 根據URL提前blog的基本信息，返回結果>>:[主題 ,分類,日期,內容]等.** @param blogURL* @return* @throws Exception*/public static String[] extractBlogInfo(String blogURL) throws Exception {String[] info = new String[4];org.jsoup.nodes.Document doc = Jsoup.connect(blogURL).get();org.jsoup.nodes.Element e_title = doc.select("h2.title").first();info[0] = e_title.text();	
org.jsoup.nodes.Element e_category = doc.select("a[rel=category tag]").first();info[1] = e_category.attr("href").replace("http://www.micmiu.com/", "");org.jsoup.nodes.Element e_date = doc.select("span.post-info-date").first();String dateStr = e_date.text().split("日期")[1].trim();info[2] = dateStr;org.jsoup.nodes.Element entry = doc.select("div.entry").first();info[3] = formatContentTag(entry);return info;
}/*** 格式化 img標簽** @param entry* @return*/
private static String formatContentTag(org.jsoup.nodes.Element entry) {try {entry.select("div").remove();// 把 <a href="*.jpg" ><img src="*.jpg"/></a> 替換為 <img// src="*.jpg"/>for (org.jsoup.nodes.Element imgEle : entry.select("a[href~=(?i)\\.(png|jpe?g)]")) {imgEle.replaceWith(imgEle.select("img").first());}return entry.html();} catch (Exception e) {return "";}
}/*** 把String 轉為 InputStream** @param content* @return*/
public static InputStream parse2Stream(String content) {try {ByteArrayInputStream stream = new ByteArrayInputStream(content.getBytes("utf-8"));return stream;} catch (Exception e) {return null;}
}

/*
HTML文件轉換為PDF
*/
String bolgURL = ",,,,";
String pdfFile = "輸出的pdf路徑";/*中文字體定義*/
//使用BaseFont類創建一個新的字體對象bfCN，這個字體是輕的宋體（STSongStd-Light），它是Unicode的GB2312版本（UniGB-UCS2-H）。
BaseFont bfCN = BaseFont.creatFont("STSongStd-Light", "UniGB-UCS2-H", false);//創建一個新的中文字體對象chFont，字體大小為14，樣式為正常，顏色為藍色。
Font chFont = new Font(bfCN, 12, Font.NORMAL, BaseColor.BLUE);//創建一個新的段落字體對象secFont，字體大小為12，樣式為正常，顏色為一種亮白色。
Font secFont = new Font(bfCN, 2, Font.NORMAL, new BaseColor(0, 204, 255));//創建一個新的文本字體對象textFont，字體大小為12，樣式為正常，顏色為黑色。
Font textFont = new Font(bfCN, 12, Font.NORMAL, BaseColor.BLACK);//創建一個新的PDF文檔對象。
Document document = new Document();//將PDF文檔寫入指定的文件輸出流中。
PdfWriter pdfwriter = PdfWriter.getInstance(document, new FileOutStream(pdfFile));//設置PDF文件的查看器偏好，隱藏工具欄。
pdfwriter.setViewerPreferences(PdfWriter.HideToolbar);document.open();Sting[] blogInfo = extractBlogInfo(blogURL);//自定義的函數，提取信息//將HTML代碼解析為PDF文檔的一部分。
XMLWorkerHelper.getInstance().parseXHtml(pdfwriter, document,parse2Stream(blogInfo[3]));document.close();