最近有小伙伴問我能不能抓取同花順的數據,最近股票行情還不錯,想把數據抓下來自己分析分析。我大A股,大家都知道的,一個概念火了,相應的股票就都大漲。
如果能及時獲取股票漲跌信息,那就能在剛開始火起來的時候殺進去,小賺一筆。但是股票那么多,小伙伴也盯不過來,于是就微信問我,能不能抓取同花順的板塊下的股票信息存到數據庫里?他就能根據數據庫里的數據,制定一些策略。
俗話說:哪里有痛點,哪里就有編程!不就是個同花順嘛,辦他!
調研背景
于是我點開了同花順的板塊頁面:http://q.10jqka.com.cn/gn/ 發現有好268個概念:
分析概念板塊的網頁HTML發現,268個概念的URL就在HTML中:
打開其中的“阿里巴巴概念”,發現網頁又有分頁:
分頁的數據,是根據接口實時獲取的,接口中注入了一些Cooki信息和其他標識,同花順的反爬蟲策略一直比較強,使用模擬接口的方式可能難度會比較大,所以使用selenium模擬瀏覽器操作這種方式比較完美。
設計方案
技術方向有了,再簡單整理一下思路:根據http://q.10jqka.com.cn/gn/,獲取板塊網頁的源碼HTML,用Jsoup解析HTML獲取每個概念的url信息放到List中
遍歷List,根據概念的url獲取概念網頁源碼HTML,解析股票信息
再遞歸點擊執行“下一頁”操作,獲取每一頁的股票數據,直至尾頁
把股票信息存儲到數據庫
配置環境
先介紹下工程所需要的環境:編碼工具:idea 語言:java 依賴:jdk1.8、maven、chrome、ChromeDriver
我們使用的方案是模擬瀏覽器的操作,所以我們需要在電腦安裝chrome瀏覽器和chromedriver驅動。chrome的安裝這里就不說了,百度下載個瀏覽器就行。
關鍵是安裝 ChromeDriver ,需要安裝和當前chrome版本一致的驅動才寫。
查看chrome版本:chrome瀏覽器輸入:Chrome://version
在根據版本下載對于的驅動,版本最好要一致,比如我的是:79.0.3945.117 (正式版本) (64 位),我下載的就是 79.0.3945.36。
ChromeDriver各版本的下載地址:
下面這一步可做可不做,不做也能啟動工程,只是需要修改代碼中的一個配置即可。配置方式:
將下載好的ChromeDriver文件放到/usr/local/bin/目錄下:
shell cp chromedriver /usr/local/bin/
檢測是否安裝成功
shell chromedriver --version
如果不配置,只需要記得修改ChromeDriver在代碼中配置的路徑,你只需要將路徑改為你自己的ChromeDriver路徑即可,比如我的是:
System.setProperty(
"webdriver.chrome.driver",
"/Users/admin/Documents/selenium/chrome/79.0.3945.36/chromedriver"
);
記得修改代碼里ChromeDriver的路徑。 記得修改代碼里ChromeDriver的路徑。 記得修改代碼里ChromeDriver的路徑。
驗證方案
首先完成設計方案中的三步
package com.ths.controller;
import com.ths.service.ThsGnCrawlService;
import com.ths.service.ThsGnDetailCrawlService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.ResponseBody;
import java.util.HashMap;
import java.util.List;
@Controller
public class CrawlController {
@Autowired
private ThsGnCrawlService thsGnCrawlService;
@Autowired
private ThsGnDetailCrawlService thsGnDetailCrawlService;
@RequestMapping("/test")
@ResponseBody
public void test() {
// 抓取所有概念板塊的url List> list = thsGnCrawlService.ThsGnCrawlListUrl();
// 放入阻塞隊列 thsGnDetailCrawlService.putAllArrayBlockingQueue(list);
// 根據url多線程抓取 thsGnDetailCrawlService.ConsumeCrawlerGnDetailData(1);
}
}
先看看thsGnCrawlService.ThsGnCrawlListUrl();方法,如何抓取所有概念板塊的url?
package com.ths.service.impl;
import com.ths.parse.service.ThsParseHtmlService;
import com.ths.service.ThsGnCrawlService;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import java.util.HashMap;
import java.util.List;
import java.util.concurrent.TimeUnit;
@Service
public class ThsGnCrawlServiceImpl implements ThsGnCrawlService {
private final static Logger LOGGER = LoggerFactory.getLogger(ThsGnCrawlServiceImpl.class);
/*** 同花順全部概念板塊url*/
private final static String GN_URL = "http://q.10jqka.com.cn/gn/";
@Autowired
private ThsParseHtmlService thsParseHtmlService;
@Override
public List> ThsGnCrawlListUrl() {
System.setProperty("webdriver.chrome.driver", "/Users/admin/Documents/selenium/chrome/79.0.3945.36/chromedriver");
ChromeOptions options = new ChromeOptions();
//是否啟用瀏覽器界面的參數 //無界面參數// options.addArguments("headless"); //禁用沙盒 就是被這個參數搞了一天// options.addArguments("no-sandbox"); WebDriver webDriver = new ChromeDriver(options);
try {
// 根據網速設置,網速慢可以調低點 webDriver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS);
webDriver.get(GN_URL);
Thread.sleep(1000L);
String gnWindow = webDriver.getWindowHandle();
// 獲取同花順概念頁面的HTML String thsGnHtml = webDriver.getPageSource();
LOGGER.info("獲取同花順url:[{}]的html為:/n{}", GN_URL, thsGnHtml);
return thsParseHtmlService.parseGnHtmlReturnGnUrlList(thsGnHtml);
} catch (Exception e) {
LOGGER.error("獲取同花順概念頁面的HTML,出現異常:", e);
} finally {
webDriver.close();
webDriver.quit();
}
return null;
}
}
這里使用了上文說的ChromeDriver,我們需要根據自己的配置,修改對應的地址(重復第四遍!)。 根據代碼可以看到String thsGnHtml = webDriver.getPageSource();方法獲取頁面的HTML,再解析HTML就能獲取各大概念板塊的url。
解析HTML我使用的是Jsoup,簡單易上手,api也很簡單,解析HTML獲取各大板塊的url的代碼如下:
package com.ths.parse.service.impl;
import com.ths.parse.service.ThsParseHtmlService;
import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Service;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
@Service
public class ThsParseHtmlServiceImpl implements ThsParseHtmlService {
/*** 解析同花順概念板塊的Html頁面:http://q.10jqka.com.cn/gn/* 返回所有概念板塊的url地址*/
public List> parseGnHtmlReturnGnUrlList(String html) {
if (StringUtil.isBlank(html)) {
return null;
}
List> list = new ArrayList<>();
Document document = Jsoup.parse(html);
Elements cateItemsFromClass = document.getElementsByClass("cate_items");
for (Element element : cateItemsFromClass) {
Elements as = element.getElementsByTag("a");
for (Element a : as) {
String gnUrl = a.attr("href");
String name = a.text();
HashMap map = new HashMap<>();
map.put("url", gnUrl);
map.put("gnName", name);
list.add(map);
}
}
return list;
}
}
可以看到,只要在html中有的數據,定位到標簽就能獲取對應的數據。
然后放到阻塞隊列:
/*** 阻塞隊列*/
private ArrayBlockingQueue> arrayBlockingQueue = new ArrayBlockingQueue<>(1000);
@Override
public void putAllArrayBlockingQueue(List> list) {
if (!CollectionUtils.isEmpty(list)) {
arrayBlockingQueue.addAll(list);
}
}
再開啟多個線程,從阻塞隊列里獲取url,分別抓取概念板塊的股票數據,如果頁面有分頁,就循環點擊下一頁,再獲取數據,直到尾頁,代碼如下:
package com.ths.service.impl;
import com.ths.dao.StockThsGnInfoDao;
import com.ths.domain.StockThsGnInfo;
import com.ths.service.ThsGnDetailCrawlService;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import org.springframework.util.CollectionUtils;
import org.springframework.util.StringUtils;
import javax.annotation.PostConstruct;
import java.math.BigDecimal;
import java.text.SimpleDateFormat;
import java.util.*;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.TimeUnit;
@Service
public class ThsGnDetailCrawlServiceImpl implements ThsGnDetailCrawlService {
private final static Logger LOGGER = LoggerFactory.getLogger(ThsGnDetailCrawlServiceImpl.class);
/*** 阻塞隊列*/
private ArrayBlockingQueue> arrayBlockingQueue = new ArrayBlockingQueue<>(1000);
@Autowired
private StockThsGnInfoDao stockThsGnInfoDao;
@Override
public void putAllArrayBlockingQueue(List> list) {
if (!CollectionUtils.isEmpty(list)) {
arrayBlockingQueue.addAll(list);
}
}
@Override
public void ConsumeCrawlerGnDetailData(int threadNumber) {
for (int i = 0; i < threadNumber; ++i) {
LOGGER.info("開啟線程第[{}]個消費", i);
new Thread(new crawlerGnDataThread()).start();
}
LOGGER.info("一共開啟線程[{}]個消費", threadNumber);
}
class crawlerGnDataThread implements Runnable {
@Override
public void run() {
try {
while (true) {
Map map = arrayBlockingQueue.take();
String url = map.get("url");
String gnName = map.get("gnName");
String crawlerDateStr = new SimpleDateFormat("yyyy-MM-dd HH:00:00").format(new Date());
//chromederiver存放位置 System.setProperty("webdriver.chrome.driver", "/Users/admin/Documents/selenium/chrome/79.0.3945.36/chromedriver");
ChromeOptions options = new ChromeOptions();
//無界面參數 // options.addArguments("headless"); //禁用沙盒 就是被這個參數搞了一天 // options.addArguments("no-sandbox"); WebDriver webDriver = new ChromeDriver(options);
try {
webDriver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS);
webDriver.get(url);
Thread.sleep(1000L);
String oneGnHtml = webDriver.getPageSource();
LOGGER.info("當前概念:[{}],html數據為[{}]", gnName, oneGnHtml);
LOGGER.info(oneGnHtml);
// TODO 解析并存儲數據 parseHtmlAndInsertData(oneGnHtml, gnName, crawlerDateStr);
clicktoOneGnNextPage(webDriver, oneGnHtml, gnName, crawlerDateStr);
} catch (Exception e) {
LOGGER.error("用chromerDriver抓取數據,出現異常,url為[{}],異常為[{}]", url, e);
} finally {
webDriver.close();
webDriver.quit();
}
}
} catch (Exception e) {
LOGGER.error("阻塞隊列出現循環出現異常:", e);
}
}
}
public void parseHtmlAndInsertData(String html, String gnName, String crawlerDateStr) {
Document document = Jsoup.parse(html);
// Element boardElement = document.getElementsByClass("board-hq").get(0);// String gnCode = boardElement.getElementsByTag("h3").get(0).getElementsByTag("span").get(0).text();
Element table = document.getElementsByClass("m-pager-table").get(0);
Element tBody = table.getElementsByTag("tbody").get(0);
Elements trs = tBody.getElementsByTag("tr");
for (Element tr : trs) {
try {
Elements tds = tr.getElementsByTag("td");
String stockCode = tds.get(1).text();
String stockName = tds.get(2).text();
BigDecimal stockPrice = parseValueToBigDecimal(tds.get(3).text());
BigDecimal stockChange = parseValueToBigDecimal(tds.get(4).text());
BigDecimal stockChangePrice = parseValueToBigDecimal(tds.get(5).text());
BigDecimal stockChangeSpeed = parseValueToBigDecimal(tds.get(6).text());
BigDecimal stockHandoverScale = parseValueToBigDecimal(tds.get(7).text());
BigDecimal stockLiangBi = parseValueToBigDecimal(tds.get(8).text());
BigDecimal stockAmplitude = parseValueToBigDecimal(tds.get(9).text());
BigDecimal stockDealAmount = parseValueToBigDecimal(tds.get(10).text());
BigDecimal stockFlowStockNumber = parseValueToBigDecimal(tds.get(11).text());
BigDecimal stockFlowMakertValue = parseValueToBigDecimal(tds.get(12).text());
BigDecimal stockMarketTtm = parseValueToBigDecimal(tds.get(13).text());
// 存儲數據 StockThsGnInfo stockThsGnInfo = new StockThsGnInfo();
stockThsGnInfo.setGnName(gnName);
stockThsGnInfo.setGnCode(null);
stockThsGnInfo.setStockCode(stockCode);
stockThsGnInfo.setStockName(stockName);
stockThsGnInfo.setStockPrice(stockPrice);
stockThsGnInfo.setStockChange(stockChange);
stockThsGnInfo.setStockChangePrice(stockChangePrice);
stockThsGnInfo.setStockChangeSpeed(stockChangeSpeed);
stockThsGnInfo.setStockHandoverScale(stockHandoverScale);
stockThsGnInfo.setStockLiangBi(stockLiangBi);
stockThsGnInfo.setStockAmplitude(stockAmplitude);
stockThsGnInfo.setStockDealAmount(stockDealAmount);
stockThsGnInfo.setStockFlowStockNumber(stockFlowStockNumber);
stockThsGnInfo.setStockFlowMakertValue(stockFlowMakertValue);
stockThsGnInfo.setStockMarketTtm(stockMarketTtm);
stockThsGnInfo.setCrawlerTime(crawlerDateStr);
stockThsGnInfo.setCrawlerVersion("同花順概念板塊#" + crawlerDateStr);
stockThsGnInfo.setCreateTime(new Date());
stockThsGnInfo.setUpdateTime(new Date());
stockThsGnInfoDao.insert(stockThsGnInfo);
} catch (Exception e) {
LOGGER.error("插入同花順概念板塊數據出現異常:", e);
}
}
}
public BigDecimal parseValueToBigDecimal(String value) {
if (StringUtils.isEmpty(value)) {
return BigDecimal.ZERO;
} else if ("--".equals(value)) {
return BigDecimal.ZERO;
} else if (value.endsWith("億")) {
return new BigDecimal(value.substring(0, value.length() - 1)).multiply(BigDecimal.ONE);
}
return new BigDecimal(value);
}
public boolean clicktoOneGnNextPage(WebDriver webDriver, String oneGnHtml, String key, String crawlerDateStr) throws InterruptedException {
// 是否包含下一頁 String pageNumber = includeNextPage(oneGnHtml);
if (!StringUtils.isEmpty(pageNumber)) {
WebElement nextPageElement = webDriver.findElement(By.linkText("下一頁"));
webDriver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS);
nextPageElement.click();
Thread.sleep(700);
String nextPageHtml = webDriver.getPageSource();
LOGGER.info("下一頁:");
LOGGER.info(nextPageHtml);
// TODO 解析并存儲數據 parseHtmlAndInsertData(nextPageHtml, key, crawlerDateStr);
clicktoOneGnNextPage(webDriver, nextPageHtml, key, crawlerDateStr);
}
return true;
}
public String includeNextPage(String html) {
Document document = Jsoup.parse(html);
List list = document.getElementsByTag("a");
for (Element element : list) {
String a = element.text();
if ("下一頁".equals(a)) {
String pageNumber = element.attr("page");
return pageNumber;
}
}
return null;
}
}
最后對,概念板塊的頁面數據進行解析入庫。
數據展示
如果遇到問題,可以關注我的公眾號:java之旅或掃描下方二維碼,回復【加群】,加我個人微信詢問我