目錄
- 1.爬取百家姓
- 1.爬取代碼
- 2.爬取效果
- 2.爬取名字
- 1.篩選男生名字
- 2.篩選女生名字
- 3.數據處理(去除重復)
- 4.拼接數據
- 5.將數據寫出到文件中
1.爬取百家姓
目標網站,僅作為實驗目的。
①爬取
姓氏
網站: https://hanyu.baidu.com/shici/detail?from=aladdin&pid=0b2f26d4c0ddb3ee693fdb1137ee1b0d&showPinyin=1
②爬取男生名字
網站:https://baijiahao.baidu.com/s?id=1744863812577130101&wfr=spider&for=pc
③爬取女生名字
網站:https://baijiahao.baidu.com/s?id=1743833274577209720&wfr=spider&for=pc
1.爬取代碼
1.爬蟲函數(使用轉換流,輸入輸出流)
/***從網絡中爬取數據,將數據拼接成字符串* @param net 網址* @return 爬取的數據*/public static String webCrawler(String net) throws IOException {//拼接爬取到的數據StringBuilder sb = new StringBuilder();//創建一個url對象URL url = new URL(net);//網絡連接URLConnection conn = url.openConnection();//讀取數據InputStreamReader isr = new InputStreamReader(conn.getInputStream());//轉換流int ch;while ((ch = isr.read()) != -1){sb.append((char) ch);}//釋放資源isr.close();//將讀取的數據進行返回return sb.toString();}
}
2.數據篩選函數(
正則表達式
)
/***根據正則表達式獲取數據* @param str 完整的字符串* @param rule 正則表達式* @return 姓氏*/private static ArrayList<String> getData(String str, String rule,int index) {//存放數據ArrayList<String> list = new ArrayList<>();//獲取編譯器Pattern compile = Pattern.compile(rule);//使用編譯器匹配字符串Matcher matcher = compile.matcher(str);while (matcher.find()){String group = matcher.group(index);list.add(group);}return list;}
3.主函數
main
public class Test1 {public static void main(String[] args) throws IOException {//定義變量記錄爬取目標的網址String familyNameNet = "https://hanyu.baidu.com/shici/detail?from=aladdin&pid=0b2f26d4c0ddb3ee693fdb1137ee1b0d&showPinyin=1";String boyName = "https://baijiahao.baidu.com/s?id=1744863812577130101&wfr=spider&for=pc";String girlName = "https://baijiahao.baidu.com/s?id=1743833274577209720&wfr=spider&for=pc";//爬取數據,把網址上所有的數據拼接成一個字符串String family = webCrawler(familyNameNet);String boy = webCrawler(boyName);String girl = webCrawler(girlName);//使用正則表達式,篩選數據ArrayList<String> familyNameTemp = getData(family, "(.{4})(,|。)", 1);System.out.println(familyNameTemp);}
2.爬取效果
使用集合(
ArrayList
)存儲
2.爬取名字
1.篩選男生名字
使用正則表達式匹配漢字
ArrayList<String> boyNameTemp = getData(boy, "([\\u4E00-\\u9FA5]{2})(、|。)", 1);System.out.println(boyNameTemp);
效果:
2.篩選女生名字
ArrayList<String> girlNameTemp = getData(girl, "([\\u4E00-\\u9FA5]{2})(、|。)", 1);System.out.println(girlNameTemp);
效果:
3.數據處理(去除重復)
//處理男生名字//去除重復元素ArrayList<String> boyList = new ArrayList<>();for (String str : boyNameTemp) {if (!boyList.contains(str)){boyList.add(str);}}System.out.println(boyList);//處理男生名字//去除重復元素ArrayList<String> girlList = new ArrayList<>();for (String str : girlNameTemp) {if (!girlList.contains(str)){girlList.add(str);}}System.out.println(girlList);
4.拼接數據
拼接成指定集合元素的格式:“
張三-性別-年齡
”
/*** 作用:* 獲取男生和女生的信息:張三-男-23** @param familyList 參數一:裝著姓氏的集合* @param boyList 參數二:裝著男生名字的集合* @param girlList 參數三:裝著女生名字的集合* @param boyCnt 參數四:男生的個數* @param girlCnt 參數五:女生的個數* @return*/public static ArrayList<String> getInfos(ArrayList<String> familyList, ArrayList<String> boyList, ArrayList<String> girlList, int boyCnt, int girlCnt) {//生成不重復的名字//男生HashSet<String> boyhs = new HashSet<>();while (true) {if (boyhs.size() == boyCnt) {break;}//隨機生成Collections.shuffle(familyList);Collections.shuffle(boyList);boyhs.add(familyList.get(0) + boyList.get(0));}//生成女生HashSet<String> girlhs = new HashSet<>();while (true) {if (girlhs.size() == girlCnt) {break;}//隨機生成Collections.shuffle(familyList);Collections.shuffle(girlList);girlhs.add(familyList.get(0) + girlList.get(0));}//最終格式;張三-男-21ArrayList<String> list = new ArrayList<>();Random random = new Random();//添加男生:年齡要求在18到27歲for (String boyName : boyhs) {int age = random.nextInt(10) + 18;list.add(boyName + "-男-" + age);}//添加女生:年齡要求在18到25歲for (String girlName : girlhs) {int age = random.nextInt(8) + 18;list.add(girlName + "-女-" + age);}return list;}
主函數添加代碼:
ArrayList<String> infos = getInfos(familyList, boyList, girlList, 10, 10);//打亂集合順序Collections.shuffle(infos);System.out.println(infos);
效果:
5.將數據寫出到文件中
//寫出數據BufferedWriter bw = new BufferedWriter(new FileWriter("G:\\JavaReview\\day33\\names.txt"));for (String info : infos) {bw.write(info);bw.newLine();}bw.close();
查看效果: