本人目前在做魚皮的《智能協同云圖庫》,涉及到了以圖搜圖+圖片爬取,雖然以前有爬過圖片,但是用的都是別人現成的代碼,不怎么去理解為什么要這樣做,這次有在嘗試理解每一個步驟。本人基礎極差,屬于一點基礎也沒學直接上手做項目的那種類型,所以跟課程有點吃力。但好在gpt非常好用,也算是勉強能夠理解了。在這里總結一下思路。
百度的以圖搜圖可以通過上傳url進行,我選擇這個url的圖片。
https://i2.hdslb.com/bfs/archive/ad698e40cc6dd3d03ae5d0ab7bfa50faf368bd9b.jpg
然后就可以出現這個:
然后可以打開Safari網頁檢查器(如果不是Safari,應該是開發者工具)
只看XHR類型就可以,也就是只顯示接口請求。
記得設置保留日志,因為會有一閃而過的upload。別的網站也可能是別的名字,比如pcsearch這種。
把搜索的網址輸進去,再重新搜一遍,會出現:
然后需要關注標頭中的內容。
展開請求數據后,可以得到:
sdkParams?通常是由百度官方 SDK 生成的簽名參數,里面可能是時間戳、簽名、密鑰哈希等。這里不需要管它。
package com.bxt.picturebackend.imageSearch.sub;import cn.hutool.core.util.URLUtil;
import cn.hutool.http.HttpRequest;
import cn.hutool.http.HttpResponse;
import cn.hutool.json.JSONUtil;
import com.bxt.picturebackend.exception.BusinessException;
import com.bxt.picturebackend.exception.ErrorCode;
import lombok.extern.slf4j.Slf4j;import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.HexFormat;
import java.util.Map;@Slf4j
public class GetImagePageUrlApi {public static String getImagePageUrl(String imageUrl) {Map<String, Object> formData = new HashMap<>();formData.put("image", imageUrl);formData.put("tn","pc");formData.put("from", "pc");formData.put("image_source", "PC_UPLOAD_URL");long upTime = System.currentTimeMillis();String postUrl = "https://graph.baidu.com/upload?uptime="+ upTime;String acsToken = "jmM4zyI8OUixvSuWh0sCy4xWbsttVMZb9qcRTmn6SuNWg0vCO7N0s6Lffec+IY5yuqHujHmCctF9BVCGYGH0H5SH/H3VPFUl4O4CP1jp8GoAzuslb8kkQQ4a21Tebge8yhviopaiK66K6hNKGPlWt78xyyJxTteFdXYLvoO6raqhz2yNv50vk4/41peIwba4lc0hzoxdHxo3OBerHP2rfHwLWdpjcI9xeu2nJlGPgKB42rYYVW50+AJ3tQEBEROlg/UNLNxY+6200B/s6Ryz+n7xUptHFHi4d8Vp8q7mJ26yms+44i8tyiFluaZAr66/+wW/KMzOhqhXCNgckoGPX1SSYwueWZtllIchRdsvCZQ8tFJymKDjCf3yI/Lw1oig9OKZCAEtiLTeKE9/CY+Crp8DHa8Tpvlk2/i825E3LuTF8EQfzjcGpVnR00Lb4/8A";try {HttpResponse httpResponse=HttpRequest.post(postUrl).form(formData).timeout(10000).header("Acs-Token", acsToken).execute();if (httpResponse.getStatus() != 200) {log.error("獲取以圖搜圖頁面地址失敗,狀態碼:{}", httpResponse.getStatus());throw new RuntimeException("獲取以圖搜圖頁面地址失敗,請稍后重試");}String body = httpResponse.body();System.out.println("body = " + body);Map<String, Object> responseMap = JSONUtil.toBean(body, Map.class);System.out.println("responseMap = " + responseMap);if (responseMap == null ) {log.error("獲取以圖搜圖頁面地址失敗,響應內容:{}", body);throw new RuntimeException("獲取以圖搜圖頁面地址失敗,請稍后重試");}Map<String, Object> data = (Map<String, Object>) responseMap.get("data");System.out.println("data = " + data);String rawUrl = (String) data.get("url");// 對 URL 進行解碼String searchResultUrl = URLUtil.decode(rawUrl, StandardCharsets.UTF_8);// 如果 URL 為空if (searchResultUrl == null) {throw new BusinessException(ErrorCode.OPERATION_ERROR, "未返回有效結果");}return searchResultUrl;}catch (Exception e) {log.error("獲取以圖搜圖頁面地址失敗,錯誤信息:{}", e.getMessage());throw new RuntimeException("獲取以圖搜圖頁面地址失敗,請稍后重試");}}
}
用單元測試類測試:
package com.bxt.picturebackend.imageSearch.sub;import cn.hutool.http.HttpResponse;
import com.mysql.cj.x.protobuf.MysqlxResultset;
import org.junit.jupiter.api.Test;import static org.junit.jupiter.api.Assertions.*;class GetImagePageUrlApiTest {@Testvoid testGetImagePageUrl() {String testImageUrl = "https://i2.hdslb.com/bfs/archive/ad698e40cc6dd3d03ae5d0ab7bfa50faf368bd9b.jpg";String response = GetImagePageUrlApi.getImagePageUrl(testImageUrl);System.out.println(response);}
}
可以得到:
body = {"status":0,"msg":"Success","data":{"url":"https://graph.baidu.com/s?card_key=\u0026entrance=GENERAL\u0026extUiData%5BisLogoShow%5D=1\u0026f=all\u0026isLogoShow=1\u0026session_id=13377293787626920489\u0026sign=1260533cc766d268eaf8401755063018\u0026tpl_from=pc","sign":"1260533cc766d268eaf8401755063018"}}
responseMap = {status=0, msg=Success, data={"url":"https://graph.baidu.com/s?card_key=&entrance=GENERAL&extUiData%5BisLogoShow%5D=1&f=all&isLogoShow=1&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tpl_from=pc","sign":"1260533cc766d268eaf8401755063018"}}
data = {"url":"https://graph.baidu.com/s?card_key=&entrance=GENERAL&extUiData%5BisLogoShow%5D=1&f=all&isLogoShow=1&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tpl_from=pc","sign":"1260533cc766d268eaf8401755063018"}
https://graph.baidu.com/s?card_key=&entrance=GENERAL&extUiData[isLogoShow]=1&f=all&isLogoShow=1&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tpl_from=pcProcess finished with exit code 0
這里得到的url就是返回的頁面。
然后可以繼續分析這個頁面
只過濾文稿,可以得到這個頁面的html
因為需要的圖片位于“相似圖片”下方,所以可以去“相似圖片”周邊找一下
firsturl看起來是有用的。
把后邊跟著的那一串字符摘過來:
https:\/\/graph.baidu.com\/ajax\/pcsimi?carousel=503&entrance=GENERAL&extUiData%5BisLogoShow%5D=1&inspire=general_pc&limit=30&next=2&render_type=card&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tk=2e59f&tpl_from=pc
需要稍微改一下,因為其中反斜杠?\
?是?JSON 字符串里對斜杠?/
?的轉義,屬于 JSON 格式要求,不是 URL 本身的內容。
把所有的反斜杠“\”都去掉,可以得到下邊的網址:
https://graph.baidu.com/ajax/pcsimi?carousel=503&entrance=GENERAL&extUiData%5BisLogoShow%5D=1&inspire=general_pc&limit=30&next=2&render_type=card&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tk=2e59f&tpl_from=pc
輸入這個網址,可以得到如下頁面:
thumbUrl后邊跟著的字符串,是我們需要的內容
可是,直接把它粘過來進行搜索,是會出錯的
原因主要是轉義字符寫法錯誤,具體問題包括:
URL中出現了錯誤的轉義寫法 /u0026,這是 Unicode 轉義符,表示字符 &。但在 URL 中不能直接寫成 /u0026,正確的是用 & 連接參數。同樣的,末尾的 \u0026h=500 也寫成了 \u0026,這不是有效的 URL 字符。
改成正確的格式,比如這樣:
http://mms1.baidu.com/it/u=771534300,3396233686&fm=253&app=138&f=JPEG?w=800&h=500
就可以正常顯示了
補充之前的代碼,完整版如下,調用getUrlList可以返回相似圖片的url
package com.bxt.picturebackend.imageSearch.sub;import cn.hutool.core.util.URLUtil;
import cn.hutool.http.HttpRequest;
import cn.hutool.http.HttpResponse;
import cn.hutool.json.JSONUtil;
import com.bxt.picturebackend.exception.BusinessException;
import com.bxt.picturebackend.exception.ErrorCode;
import lombok.extern.slf4j.Slf4j;
import org.springframework.security.web.firewall.FirewalledRequest;import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.HexFormat;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import static cn.hutool.poi.excel.sax.AttributeName.r;@Slf4j
public class GetImagePageUrlApi {public static List<String> getUrlList(String imageUrl){String imagePageUrl = getImagePageUrl(imageUrl);if (imagePageUrl == null || imagePageUrl.isEmpty()) {throw new BusinessException(ErrorCode.OPERATION_ERROR, "未返回有效結果");}String acsToken = "jmM4zyI8OUixvSuWh0sCy4xWbsttVMZb9qcRTmn6SuNWg0vCO7N0s6Lffec+IY5yuqHujHmCctF9BVCGYGH0H5SH/H3VPFUl4O4CP1jp8GoAzuslb8kkQQ4a21Tebge8yhviopaiK66K6hNKGPlWt78xyyJxTteFdXYLvoO6raqhz2yNv50vk4/41peIwba4lc0hzoxdHxo3OBerHP2rfHwLWdpjcI9xeu2nJlGPgKB42rYYVW50+AJ3tQEBEROlg/UNLNxY+6200B/s6Ryz+n7xUptHFHi4d8Vp8q7mJ26yms+44i8tyiFluaZAr66/+wW/KMzOhqhXCNgckoGPX1SSYwueWZtllIchRdsvCZQ8tFJymKDjCf3yI/Lw1oig9OKZCAEtiLTeKE9/CY+Crp8DHa8Tpvlk2/i825E3LuTF8EQfzjcGpVnR00Lb4/8A";HttpResponse httpResponse = HttpRequest.get(imagePageUrl).timeout(10000).header("Acs-Token", acsToken).execute();
// System.out.println("httpResponse = " + httpResponse);if (httpResponse.getStatus() != 200) {log.error("獲取以圖搜圖頁面地址失敗,狀態碼:{}", httpResponse.getStatus());throw new RuntimeException("獲取以圖搜圖頁面地址失敗,請稍后重試");}Pattern pattern = Pattern.compile("\"firstUrl\"\\s*:\\s*\"(.*?)\"");Matcher matcher = pattern.matcher(httpResponse.body());String firstUrl;if (matcher.find()) {// 提取并替換 \/ 為 /firstUrl = matcher.group(1).replace("\\/", "/");System.out.println("firstUrl = " + firstUrl);} else {throw new BusinessException(ErrorCode.OPERATION_ERROR, "未返回有效結果");}HttpResponse urlListPage = HttpRequest.get(firstUrl).timeout(10000).header("Acs-Token", acsToken).execute();
// System.out.println(urlListPage);pattern = Pattern.compile("\"thumbUrl\"\\s*:\\s*\"(.*?)\"");matcher = pattern.matcher(urlListPage.body());List<String> urlList = new java.util.ArrayList<>();while (matcher.find()) {String thumbUrl = matcher.group(1);// 轉義 \u0026 -> &thumbUrl = thumbUrl.replaceAll("\\\\u0026", "&");urlList.add(thumbUrl);}
// System.out.println("urlList = " + urlList);return urlList;}public static String getImagePageUrl(String imageUrl) {Map<String, Object> formData = new HashMap<>();formData.put("image", imageUrl);formData.put("tn","pc");formData.put("from", "pc");formData.put("image_source", "PC_UPLOAD_URL");long upTime = System.currentTimeMillis();String postUrl = "https://graph.baidu.com/upload?uptime="+ upTime;String acsToken = "jmM4zyI8OUixvSuWh0sCy4xWbsttVMZb9qcRTmn6SuNWg0vCO7N0s6Lffec+IY5yuqHujHmCctF9BVCGYGH0H5SH/H3VPFUl4O4CP1jp8GoAzuslb8kkQQ4a21Tebge8yhviopaiK66K6hNKGPlWt78xyyJxTteFdXYLvoO6raqhz2yNv50vk4/41peIwba4lc0hzoxdHxo3OBerHP2rfHwLWdpjcI9xeu2nJlGPgKB42rYYVW50+AJ3tQEBEROlg/UNLNxY+6200B/s6Ryz+n7xUptHFHi4d8Vp8q7mJ26yms+44i8tyiFluaZAr66/+wW/KMzOhqhXCNgckoGPX1SSYwueWZtllIchRdsvCZQ8tFJymKDjCf3yI/Lw1oig9OKZCAEtiLTeKE9/CY+Crp8DHa8Tpvlk2/i825E3LuTF8EQfzjcGpVnR00Lb4/8A";try {HttpResponse httpResponse=HttpRequest.post(postUrl).form(formData).timeout(10000).header("Acs-Token", acsToken).execute();if (httpResponse.getStatus() != 200) {log.error("獲取以圖搜圖頁面地址失敗,狀態碼:{}", httpResponse.getStatus());throw new RuntimeException("獲取以圖搜圖頁面地址失敗,請稍后重試");}String body = httpResponse.body();System.out.println("body = " + body);Map<String, Object> responseMap = JSONUtil.toBean(body, Map.class);System.out.println("responseMap = " + responseMap);if (responseMap == null ) {log.error("獲取以圖搜圖頁面地址失敗,響應內容:{}", body);throw new RuntimeException("獲取以圖搜圖頁面地址失敗,請稍后重試");}Map<String, Object> data = (Map<String, Object>) responseMap.get("data");System.out.println("data = " + data);String rawUrl = (String) data.get("url");// 對 URL 進行解碼String searchResultUrl = URLUtil.decode(rawUrl, StandardCharsets.UTF_8);// 如果 URL 為空if (searchResultUrl == null) {throw new BusinessException(ErrorCode.OPERATION_ERROR, "未返回有效結果");}return searchResultUrl;}catch (Exception e) {log.error("獲取以圖搜圖頁面地址失敗,錯誤信息:{}", e.getMessage());throw new RuntimeException("獲取以圖搜圖頁面地址失敗,請稍后重試");}}
}
輸出最后的list,是這樣的:
[http://mms1.baidu.com/it/u=771534300,3396233686&fm=253&app=138&f=JPEG?w=800&h=500, http://mms0.baidu.com/it/u=4161103281,1829674203&fm=253&app=138&f=JPEG?w=749&h=580, http://mms2.baidu.com/it/u=2706284301,789398194&fm=253&app=120&f=JPEG?w=800&h=500, http://mms1.baidu.com/it/u=1667096992,1485299432&fm=253&app=138&f=JPEG?w=800&h=500, http://mms0.baidu.com/it/u=2502213264,439196765&fm=253&app=120&f=JPEG?w=800&h=500, http://mms0.baidu.com/it/u=4000521229,3982402882&fm=253&app=120&f=JPEG?w=655&h=446, http://mms2.baidu.com/it/u=640527677,1986438968&fm=253&app=138&f=JPEG?w=455&h=256, http://mms2.baidu.com/it/u=156995109,2192672339&fm=253&app=120&f=JPEG?w=801&h=500, http://mms0.baidu.com/it/u=48011703,2549638517&fm=253&app=138&f=JPEG?w=800&h=500, http://mms2.baidu.com/it/u=1316957924,1711619045&fm=253&app=120&f=JPEG?w=800&h=500, http://mms0.baidu.com/it/u=2192255561,2552189568&fm=253&app=138&f=JPEG?w=634&h=356, http://mms0.baidu.com/it/u=2868092005,3149855400&fm=253&app=138&f=JPEG?w=500&h=500, http://mms0.baidu.com/it/u=2173262737,1364469520&fm=253&app=138&f=JPEG?w=500&h=500, http://mms0.baidu.com/it/u=896380067,3285805132&fm=253&app=138&f=JPEG?w=1053&h=800, http://mms0.baidu.com/it/u=184083361,1291046512&fm=253&app=138&f=JPEG?w=500&h=500, http://mms0.baidu.com/it/u=2147020713,3191068967&fm=253&app=138&f=JPEG?w=867&h=500, http://mms0.baidu.com/it/u=864737700,3400231159&fm=253&app=120&f=JPEG?w=800&h=500, http://mms1.baidu.com/it/u=153299186,2018689789&fm=253&app=120&f=JPEG?w=480&h=270, http://mms0.baidu.com/it/u=2253215478,3249860676&fm=253&app=120&f=JPEG?w=800&h=500, http://mms2.baidu.com/it/u=3522373714,3342355003&fm=253&app=120&f=JPEG?w=800&h=500]
?
全部都是坤坤