Doc2Vec實踐

目錄:
- 前言：
- 第一步：首先我們需要拿到對應的數據，相關的代碼如下：
- 第二步：拿到對應的數據后，就開始訓練數據生成對應的model，對應的代碼如下：
- 第三步：得到生成的model后，我們就可以輸入相應的問題得到相似的問題。代碼如下：
- 第四步：將拿到的相似問題表示出來就OK了。
- 總結：

def get_datasest():fin = open("questions.txt",encoding='utf8').read().strip(' ')   #strip()取出首位空格# print(fin)# print(type(fin))# 添加自定義的詞庫用于分割或重組模塊不能處理的詞組。jieba.load_userdict("userdict.txt")# 添加自定義的停用詞庫，去除句子中的停用詞。stopwords = set(open('stopwords.txt',encoding='utf8').read().strip('\n').split('\n'))   #讀入停用詞text = ' '.join([x for x in jieba.lcut(fin) if x not in stopwords])  #去掉停用詞中的詞# print(text)print (type(text),len(text))x_train = []word_list = text.split('\n')print(word_list[0])for i,sub_list in enumerate(word_list):document = TaggededDocument(sub_list, tags=[i])# document是一個Tupple,形式為：TaggedDocument( 楊千嬅 現在 教育 變成 一種 生意 , [42732])# print(document)x_train.append(document)return x_train

第二步：拿到對應的數據后，就開始訓練數據生成對應的model，對應的代碼如下：

def train(x_train, size=200, epoch_num=1):# D2V參數解釋：# min_count：忽略所有單詞中單詞頻率小于這個值的單詞。# window：窗口的尺寸。（句子中當前和預測單詞之間的最大距離）# size:特征向量的維度# sample：高頻詞匯的隨機降采樣的配置閾值，默認為1e-3，范圍是(0,1e-5)。# negative: 如果>0,則會采用negativesampling，用于設置多少個noise words（一般是5-20）。默認值是5。# workers：用于控制訓練的并行數。model_dm = Doc2Vec(x_train,min_count=1, window = 5, size = size, sample=1e-3, negative=5, workers=4,hs=1,iter=6)# total_examples：統計句子數# epochs：在語料庫上的迭代次數(epochs)。model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=70)model_dm.save('model_test')return model_dm

第三步：得到生成的model后，我們就可以輸入相應的問題得到相似的問題。代碼如下：

def test():model_dm = Doc2Vec.load("model_test")test_ = '申請貸款需要什么條件？'#讀入停用詞stopwords = set(open('stopwords.txt',encoding='utf8').read().strip('\n').split('\n'))#去掉停用詞中的詞test_text = ' '.join([x for x in jieba.lcut(test_) if x not in stopwords])print(test_text)#獲得對應的輸入句子的向量inferred_vector_dm = model_dm.infer_vector(doc_words=test_text)# print(inferred_vector_dm)#返回相似的句子sims = model_dm.docvecs.most_similar([inferred_vector_dm], topn=10)return sims

第四步：將拿到的相似問題表示出來就OK了。

if __name__ == '__main__':x_train = get_datasest()# print(x_train)# model_dm = train(x_train)sims = test()# sims:[(89, 0.730167031288147), (6919, 0.6993225812911987), (6856, 0.6860911250114441), (40892, 0.6508388519287109), (40977, 0.6465731859207153), (30707, 0.6388640403747559), (40160, 0.6366203427314758), (11672, 0.6353889107704163), (16752, 0.6346361637115479), (40937, 0.6337493062019348)]# sim是一個Tuple,內部包含兩個元素，一個是對應的句子的索引號（之前自定義的tag）一個是對應的相似度# print(type(sims))# print('sims:'+str(sims))for count, sim in sims:sentence = str(x_train[count])# sentence = x_train[count]# print('sentence:'+sentence)# print('sim:'+str(sim))print(sentence, sim, len(sentence))

當然中間你也可以拿到對應的句子的向量如下：

def getVecs(model, corpus, size):vecs = [np.array(model.docvecs[z.tags[0]].reshape(1, size)) for z in corpus]return np.concatenate(vecs)

總結：

雖然在程序中自定義了停詞庫和詞庫但是整體的效果依舊不盡人意，甚至在剛開始未調參階段碰到對于同一個輸入運行多次得到不同結果的尷尬情況……雖然這個問題在后來通過調參解決了，但是發現發現這里面仍有許多問題：如輸入問題A，model中也包含問題A，但是返回的相似問題中，A的相關度有些卻不是最高的等等。后來查了一些資料發現其他的一些網友做這個實驗的時候也是效果不理想（至于出現這些問題的原因目前不是特別清楚，按照Doc2Vec的理論來說效果應該不會很差的，可實踐后卻啪啪啪打臉。。。）。所以暫時得到的結論就是：doc2vec效果時好時壞，偶然性大，不穩定。目前有找到另一種方法來滿足我的需求，同樣采用的是句子向量，同樣是用余弦定理來求相似句子，理論比doc2vec簡單，效果也比doc2vec好。等整理好了，會在下一篇文章中做介紹。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/456558.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/456558.shtml
英文地址，請注明出處：http://en.pswp.cn/news/456558.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！