python遍歷txt每一行_python – 計算(和寫入)文本文件中每一行的...

第一次在堆棧中發布 – 總是發現以前的問題足以解決我的問題！我遇到的主要問題是邏輯……即使是偽代碼答案也會很棒.

我正在使用python從文本文件的每一行讀取數據,格式如下：

This is a tweet captured from the twitter api #hashtag http://url.com/site

使用nltk,我可以逐行標記,然后可以使用reader.sents()迭代等：

reader = TaggedCorpusReader(filecorpus, r'.*\.txt', sent_tokenizer=Line_Tokenizer())

reader.sents()[:10]

但我想計算每行某些“熱詞”(存儲在數組或類似詞中)的頻率,然后將它們寫回文本文件.如果我使用reader.words(),我可以計算整個文本中“熱詞”的頻率,但我正在尋找每行的數量(或者在這種情況下為“句子”).

理想情況下,例如：

hotwords = (['tweet'], ['twitter'])

for each line

tokenize into words.

for each word in line

if word is equal to hotword[1], hotword1 count ++

if word is equal to hotword[2], hotword2 count ++

at end of line, for each hotword[index]

filewrite count,

另外,不要擔心URL被破壞(使用WordPunctTokenizer會刪除標點符號 – 這不是問題)

任何有用的指針(包括偽或其他類似代碼的鏈接)都會很棒.

—-編輯——————

結束這樣的事情：

import nltk

from nltk.corpus.reader import TaggedCorpusReader

from nltk.tokenize import LineTokenizer

#from nltk.tokenize import WordPunctTokenizer

from collections import defaultdict

# Create reader and generate corpus from all txt files in dir.

filecorpus = 'Twitter/FINAL_RESULTS/tweetcorpus'

filereader = TaggedCorpusReader(filecorpus, r'.*\.csv', sent_tokenizer=LineTokenizer())

print "Reader accessible."

print filereader.fileids()

#define hotwords

hotwords = ('cool','foo','bar')

tweetdict = []

for line in filereader.sents():

wordcounts = defaultdict(int)

for word in line:

if word in hotwords:

wordcounts[word] += 1

tweetdict.append(wordcounts)

輸出是：

print tweetdict

[defaultdict(, {}),

defaultdict(, {'foo': 2, 'bar': 1, 'cool': 2}),

defaultdict(, {'cool': 1})]

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/446230.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/446230.shtml
英文地址，請注明出處：http://en.pswp.cn/news/446230.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！