python爬取elasticsearch內容

我們以上篇的elasticsearch添加的內容為例，對其內容進行爬取，并獲得有用信息個過程。

先來看一下elasticsearch中的內容：

{"took": 88,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 3,"max_score": 1,"hits": [{"_index": "megacorp","_type": "employee","_id": "2","_score": 1,"_source": {"first_name": "Jane","last_name": "Smith","age": 32,"about": "I like to collect rock albums","interests": ["music"]}},{"_index": "megacorp","_type": "employee","_id": "1","_score": 1,"_source": {"first_name": "John","last_name": "Smith","age": 25,"about": "I love to go rock climbing","interests": ["sports","music"]}},{"_index": "megacorp","_type": "employee","_id": "3","_score": 1,"_source": {"first_name": "Douglas","last_name": "Fir","age": 35,"about": "I like to build cabinets","interests": ["forestry"]}}]}
}

1.在python中，首先要用到urllib的包，其次對其進行讀取的格式為json。

import urllib.request as request
import json

2.接下來，我們獲取相應的路徑請求，并用urlopen打開請求的文件：

if __name__ == '__main__':req = request.Request("http://localhost:9200/megacorp/employee/_search")resp = request.urlopen(req)

3.對得到的resp,我們需要用json的格式迭代輸出：（注意是字符串類型）

jsonstr=""for line in resp:jsonstr+=line.decode()data=json.loads(jsonstr)print(data)

4.但是我們得到的信息是包含內容和屬性的，我們只想得到內容，那么久需要對每層的屬性進行分解獲取：

employees = data['hits']['hits']for e in employees:_source=e['_source']full_name=_source['first_name']+"."+_source['last_name']age=_source["age"]about=_source["about"]interests=_source["interests"]print(full_name,'is',age,",")print(full_name,"info is",about)print(full_name,'likes',interests)

得到的內容為：

Jane.Smith is 32 ,
Jane.Smith info is I like to collect rock albums
Jane.Smith likes ['music']John.Smith is 25 ,
John.Smith info is I love to go rock climbing
John.Smith likes ['sports', 'music']Douglas.Fir is 35 ,
Douglas.Fir info is I like to build cabinets
Douglas.Fir likes ['forestry']

對于需要聚合的內容，我們可以通過下面的方法進行獲取：

1：獲取路徑

url="http://localhost:9200/megacorp/employee/_search"

2.獲取聚合的格式查詢

data='''
    {"aggs" : {"all_interests" : {"terms" : { "field" : "interests" },"aggs" : {"avg_age" : {"avg" : { "field" : "age" }}}}}
}'''

3.標明頭部信息

headers={"Content-Type":"application/json"}

4.同樣，以請求和相應的方式獲取信息并迭代為json格式

req=request.Request(url=url,data=data.encode(),headers=headers,method="GET")resp=request.urlopen(req)jsonstr=""for line in resp:jsonstr+=line.decode()rsdata=json.loads(jsonstr)

5.有用聚合信息內部依然是數組形式，所以依然需要迭代輸出：

agg = rsdata['aggregations']
buckets = agg['all_interests']['buckets']for b in buckets:key = b['key']doc_count = b['doc_count']avg_age = b['avg_age']['value']        
        print('aihao',key,'gongyou',doc_count,'ren,tamenpingjuageshi',avg_age)

最終得到信息：

aihao music gongyou 2 ren,tamenpingjuageshi 28.5aihao forestry gongyou 1 ren,tamenpingjuageshi 35.0aihao sports gongyou 1 ren,tamenpingjuageshi 25.0

轉載于:https://www.cnblogs.com/qianshuixianyu/p/9287556.html

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/451262.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/451262.shtml
英文地址，請注明出處：http://en.pswp.cn/news/451262.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！