springboot集成tika解析word,pdf,xls文件文本內容

在這里插入圖片描述

介紹

Apache Tika 是一個開源的內容分析工具包，用于從各種文檔格式中提取文本和元數據。它支持多種文檔類型，包括但不限于文本文件、HTML、PDF、Microsoft Office 文檔、圖像文件等。Tika 的主要功能包括內容檢測、文本提取和元數據提取。

官網

https://tika.apache.org/

Apache Tika 的功能

內容檢測：識別文件的 MIME 類型。
文本提取：從文檔中提取純文本內容。
元數據提取：從文檔中提取元數據（如標題、作者、創建日期等）。

與Springboot集成案例

添加pom依賴

<dependency><groupId>org.apache.tika</groupId><artifactId>tika-core</artifactId><version>2.9.1</version></dependency><dependency><groupId>org.apache.tika</groupId><artifactId>tika-parsers-standard-package</artifactId><version>2.9.1</version></dependency>

創建工具類

public class MyFileUtils {public static String doParse(String filePath) throws TikaException, SAXException, IOException {try(InputStream inputStream = new FileInputStream(filePath)){BodyContentHandler handler = new BodyContentHandler(-1);Metadata metadata = new Metadata();ParseContext parseContext = new ParseContext();AutoDetectParser detectParser = new AutoDetectParser();detectParser.parse(inputStream, handler, metadata, parseContext);return handler.toString();}}}

測試

public class MyFileUtilsTest {public static void main(String[] args) {String filePath = "D:/tmp/測試附件.xls";String content = null;try {content = MyFileUtils.doParse(filePath);} catch (TikaException e) {e.printStackTrace();} catch (SAXException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}System.out.println(content);}
}

輸出

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/40700.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/40700.shtml
英文地址，請注明出處：http://en.pswp.cn/web/40700.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！