pandas提供了iterrows()、itertuples()、apply等行遍歷的方式,還是比較方便的。
polars的列操作功能非常強大,這個在其官網上有詳細的介紹。由于polars底層的arrow是列存儲模式,行操作效率低下,官方也不推薦以行方式進行數據操作。但是還是有部分場景可能會用到行遍歷的情況。
polars如何進行行遍歷,今天嘗試一下非apply的方式。
場景:polars讀取相應的關于歷史股價的csv文件,其中有基本的行情信息,那么,如何對讀取到的文件進行快速的行遍歷?這種場景在行情驅動的策略回測中比較常見。
一、初步方案:
1、總體方案
1、csv => dataframe
2、dataframe =>into_struct ,得到structchunked
3、struchchunked =>在bars進行行遍歷。
2、Bar類型
至于Bar類型的設計,存在兩種方案:
(1)值類型的Bar
#[warn(dead_code)]
struct Bar{code:String,date:String,open:f32,high:f32,close:f32,low:f32,volume:f32,amount:f32,is_fq:bool,
}
(2)有引用類型的Bar
#[warn(dead_code)]
struct Bar2<'a>{code:&'a str,date:&'a str,open:f32,high:f32,close:f32,low:f32,volume:f32,amount:f32,is_fq:bool,
}
二、toml
注意,polars對features的設置要求高,有些用到的特性需要準確打開,否則代碼編譯會通不過。這一點在polars文檔中經常沒有寫清楚,也算是一個坑。
[package]
name = "my_duckdb"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
polars = { version = "*", features = ["lazy","dtype-struct"] }
注意,features中,一定要加上"dtype-struct"。
三、main.rs
根據上面的設計,全部代碼如下:
use polars::prelude::*;
use std::time::Instant;#[warn(dead_code)]
struct Bar{code:String,date:String,open:f32,high:f32,close:f32,low:f32,volume:f32,amount:f32,is_fq:bool,
}
#[warn(dead_code)]
struct Bar2<'a>{code:&'a str,date:&'a str,open:f32,high:f32,close:f32,low:f32,volume:f32,amount:f32,is_fq:bool,
}
fn main() {let time0 = Instant::now();// test2.csv:64w行let csv = "test2.csv"; let df = polars_lazy_read_csv(csv);println!("read raw csv cost time : {:?} seconds",time0.elapsed().as_secs_f32());let time1 = Instant::now();let rows = df.into_struct("bars");println!("dataframe => structs cost time : {:?} seconds",time1.elapsed().as_secs_f32());let time2 = Instant::now();let bars = get_vec_bars(&rows);println!("dataframe => bars cost time : {:?} seconds",time2.elapsed().as_secs_f32());let time3 = Instant::now();let bar2s = get_vec_bar2s(&rows);println!("dataframe => bar2s cost time : {:?} seconds",time3.elapsed().as_secs_f32());println!("bars length :{:?}",bars.len());println!("bar2s length:{:?}",bar2s.len());
}fn get_bar(row:&[AnyValue])->Bar{let code = row.get(0).unwrap();let mut new_code = "";if let &AnyValue::Utf8(value) = code{new_code = value;}let mut new_date = ""; let date = row.get(1).unwrap();if let &AnyValue::Utf8(v) = date {new_date = v;}let open =row[2].extract::<f32>().unwrap();let high:f32 = row[3].extract::<f32>().unwrap();let close =row[4].extract::<f32>().unwrap();let low:f32 = row[5].extract::<f32>().unwrap();let volume =row[6].extract::<f32>().unwrap();let amount:f32 = row[7].extract::<f32>().unwrap();let mut is_fq = false;if let &AnyValue::Boolean(b) = row.get(8).unwrap(){is_fq = b;}let bar = Bar{code: String::from(new_code),date: String::from(new_date),open:open,high:high,close:close,low:low,volume:volume,amount,is_fq:is_fq,};bar
}fn get_bar2<'a>(row:&'a [AnyValue])->Bar2<'a>{let code = row.get(0).unwrap();let mut new_code = "";if let &AnyValue::Utf8(value) = code{new_code = value;}let mut new_date = ""; let date = row.get(1).unwrap();if let &AnyValue::Utf8(v) = date {new_date = v;}let open =row[2].extract::<f32>().unwrap();let high:f32 = row[3].extract::<f32>().unwrap();let close =row[4].extract::<f32>().unwrap();let low:f32 = row[5].extract::<f32>().unwrap();let volume =row[6].extract::<f32>().unwrap();let amount:f32 = row[7].extract::<f32>().unwrap();let mut is_fq = false;if let &AnyValue::Boolean(b) = row.get(8).unwrap(){is_fq = b;}let bar = Bar2{code: new_code,date: new_date,open:open,high:high,close:close,low:low,volume:volume,amount,is_fq:is_fq,};bar
}
fn get_vec_bars(data: &StructChunked)-> Vec<Bar>{let mut bars = Vec::new();for row in data{let bar = get_bar(row);bars.push(bar);}bars
}fn get_vec_bar2s(data: &StructChunked)-> Vec<Bar2>{let mut bars = Vec::new();for row in data{let bar = get_bar2(row);bars.push(bar);}bars
}
fn polars_lazy_read_csv(filepath:&str) ->DataFrame{let polars_lazy_csv_time = Instant::now();let p = LazyCsvReader::new(filepath).has_header(true).finish().unwrap();let mut df = p.collect().expect("error to dataframe!");println!("polars lazy 讀出csv的行和列數:{:?}",df.shape());println!("polars lazy 讀csv 花時: {:?} 秒!", polars_lazy_csv_time.elapsed().as_secs_f32());df
}
四、輸出與比較
對于一個64萬行,9列的csv文件,需要遍歷轉換Vec< Bar >類型,
1、輸出如下:
polars lazy 讀出csv的行和列數:(640710, 9)
polars lazy 讀csv 花時: 0.058484446 秒!
read raw csv cost time : 0.058487203 seconds
dataframe => structs cost time : 2.8842e-5 seconds
dataframe => bars cost time : 0.131985 seconds
dataframe => bar2s cost time : 0.10357016 seconds
bars length :640710
bar2s length:640710
總體上看,從dataframe到struct這層,效率比較高,主要的時間花在了structchunked至bars這部分上面。
2、值類型Bar和引用類型Bar
從輸出結果,可以看出,引用類型的Bar的效率要高一些,提效了20%。因為減少了堆分配所需要的時間。
五、其它
polars目前還沒有發現有類似pandas的行遍歷的方式,后面將持續跟蹤。
此外,dataframe轉bars的效率并不高,期待找到更高效的方式替代。