大文件處理的終極武器：Yield詳解

【大文件處理的終極武器：Yield詳解】🚀

一、大文件處理的痛點

內存限制
數據量巨大
傳統方法效率低

二、Yield解決方案

def read_large_file(file_path):with open(file_path, 'r') as file:# 每次只讀取一行，而不是全文for line in file:yield line.strip()# 使用示例
def process_log_file(file_path):# 內存友好的日志處理for line in read_large_file(file_path):# 實時處理每一行if 'ERROR' in line:print(f"發現錯誤日志：{line}")# 日志分析案例
def analyze_error_logs(file_path):error_count = 0for line in read_large_file(file_path):if 'ERROR' in line:error_count += 1return error_count

三、實戰場景：海量日志分析

超大日志文件處理

def parse_massive_log(file_path):# 內存高效的日志解析with open(file_path, 'r') as file:for line in file:# 實時解析每一行try:# 假設日志格式：時間 | 級別 | 消息timestamp, level, message = line.split('|')# 只處理特定級別的日志if level.strip() == 'ERROR':yield {'time': timestamp.strip(),'message': message.strip()}except ValueError:# 處理格式不正確的行continue# 使用示例
def log_error_summary(file_path):error_summary = {}for error in parse_massive_log(file_path):# 統計每小時錯誤次數hour = error['time'].split()[1]error_summary[hour] = error_summary.get(hour, 0) + 1return error_summary# 調用
errors = log_error_summary('huge_server.log')
print(errors)

四、CSV大文件處理

import csvdef process_large_csv(file_path):with open(file_path, 'r') as file:reader = csv.DictReader(file)for row in reader:# 實時處理每一行yield process_row(row)def process_row(row):# 數據清洗和轉換return {'name': row['name'].upper(),'score': float(row['score']) * 1.1}def analyze_student_data(file_path):total_scores = 0student_count = 0for processed_row in process_large_csv(file_path):total_scores += processed_row['score']student_count += 1return total_scores / student_count if student_count > 0 else 0

五、大文件去重

def deduplicate_file(input_file, output_file):# 內存高效的文件去重seen = set()with open(input_file, 'r') as infile, \open(output_file, 'w') as outfile:for line in infile:# 每次處理一行clean_line = line.strip()if clean_line not in seen:seen.add(clean_line)outfile.write(clean_line + '\n')# 防止去重集合過大if len(seen) > 10000:seen.clear()# 文件指紋去重
def find_duplicate_files(directory):import osimport hashlibdef file_hash(filepath):# 生成文件指紋hasher = hashlib.md5()with open(filepath, 'rb') as f:# 分塊讀取，避免一次性加載整個文件for chunk in iter(lambda: f.read(4096), b''):hasher.update(chunk)return hasher.hexdigest()# 生成器返回重復文件seen_hashes = set()for root, _, files in os.walk(directory):for filename in files:filepath = os.path.join(root, filename)file_fingerprint = file_hash(filepath)if file_fingerprint in seen_hashes:yield filepathelse:seen_hashes.add(file_fingerprint)

六、高級應用：流式數據處理

def process_streaming_data(data_source):# 模擬實時數據流處理for data_point in data_source:# 實時轉換和過濾processed_data = transform(data_point)if is_valid(processed_data):yield processed_datadef transform(data):# 數據清洗轉換return data.lower().strip()def is_valid(data):# 數據有效性驗證return len(data) > 0

七、Yield的優勢

內存效率高
實時處理
惰性計算
可中斷、可恢復
適合大數據場景

八、最佳實踐

分塊處理
及時釋放內存
使用迭代器模式
避免一次性加載全部數據

💡 學習建議：

理解生成器原理
掌握迭代器概念
實踐大文件處理
關注內存優化

溫馨提示：Yield就像數據處理的"省內存神器"！🌟 每一行代碼都在為性能而戰！

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/63061.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/63061.shtml
英文地址，請注明出處：http://en.pswp.cn/web/63061.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！