[MDM 2024]Spatial-Temporal Large Language Model for Traffic Prediction

論文網址：[2401.10134] Spatial-Temporal Large Language Model for Traffic Prediction

論文代碼：GitHub - ChenxiLiu-HNU/ST-LLM: Official implementation of the paper "Spatial-Temporal Large Language Model for Traffic Prediction"

英文是純手打的！論文原文的summarizing and paraphrasing。可能會出現難以避免的拼寫錯誤和語法錯誤，若有發現歡迎評論指正！文章偏向于筆記，謹慎食用

1. 心得

2. 論文逐段精讀

2.1. Abstract

2.2. Introduction

2.3. Related Work

2.3.1.?Large Language Models for Time Series Analysis

2.3.2.?Traffic Prediction

2.4. Problem Definition

2.5. Methodology

2.5.1.?Overview

2.5.2.?Spatial-Temporal Embedding and Fusion

2.5.3.?Partially Frozen Attention (PFA) LLM

2.6. Experiments

2.6.1. Datasdets

2.6.2. Baselines

2.6.3. Implementations

2.6.4. Evaluation Metrics

2.6.5. Main Results

2.6.6.?Performance of ST-LLM and Ablation Studies

2.6.7.?Parameter Analysis

2.6.8.?Inference Time Analysis

2.6.9. Few-Shot Prediction

2.6.10.?Zero-Shot Prediction

2.7. Conclusion

3. Reference

1. 心得

（1）盡管幾天后要投的論文還沒開始寫，仍然嚼嚼餅干寫寫閱讀筆記。哎。這年頭大家都跑得太快了

（2）比起數學，LLM適合配一杯奶茶讀，全程輕松愉悅，這一篇就是分開三個卷積→合在一起→LLM（部分解凍一些模塊）→over

2. 論文逐段精讀

2.1. Abstract

? ? ? ? ①They proposed Spatial-Temporal Large Language Model (ST-LLM) to predict traffic（好像沒什么特別的我就不寫了，就是在介紹方法，說以前的精度不高。具體方法看以下圖吧）

2.2. Introduction

? ? ? ? ①Traditional CNN and RNN cannot capture complex/long range spatial and temporal dependencies. GNNs are prone to overfitting, thus reseachers mainly use attention mechanism.

? ? ? ? ②Existing traffic prediction methods mainly focus on temporal feature rather than spatial

? ? ? ? ③For better long term prediction, they proposed?partially frozen attention (PFA)

2.3. Related Work

2.3.1.?Large Language Models for Time Series Analysis

? ? ? ? ①Listing TEMPO-GPT, TIME-LLM, OFA, TEST, and LLM-TIME, which all utilize temporal feature only. However, GATGPT, which introduced spatial feature, ignores temporal dependencies.

imputation??n.歸責；歸罪；歸咎；歸因

2.3.2.?Traffic Prediction

? ? ? ? ①Filter is a common and classic method for processing traffic data

? ? ? ? ②Irrgular city net makes CNN hard to apply or extract spatial feature

2.4. Problem Definition

? ? ? ? ①Input traffic data:? $\mathbf{X}\in\mathbb{R}^{T\times N\times C}$ , where? $T$ ?denotes timesteps,? $N$ ?denotes numberof spatial stations,? $C$ ?denotes feature

? ? ? ? ②Task: given historical traffic data? $\mathbf{X}_{P}=\{\mathbf{X}_{t-P+1},\mathbf{X}_{t-P+2},\ldots,\mathbf{X}_{t}\}\in\mathbb{R}^{P\times N\times C}$ ?of? $P$ ?time steps only, learning a function? $f\left ( \cdot \right )$ ?with parameter? $\theta$ ?to predict future? $S$ ?timesteps:? $\mathbf{Y}_{S}=\{\mathbf{Y}_{t+1},\mathbf{Y}_{t+2},\ldots,\mathbf{Y}_{t+S}\}\in\mathbb{R}^{S\times N\times C}$ :

$[\mathbf{X}_{t-P+1},\mathbf{X}_{t-P+2},\ldots,\mathbf{X}_{t}]\xrightarrow{f(\cdot)}[\mathbf{Y}_{t+1},\mathbf{Y}_{t+2},\ldots,\mathbf{Y}_{t+S}]$

2.5. Methodology

2.5.1.?Overview

? ? ? ? ①Overall framework of ST-LLM:

where Spatial-Temporal Embedding layer extracts?timesteps $\mathbf{E}_{T}\in\mathbb{R}^{N\times D}$ , spatial embedding $\mathbf{E}_{S}\in\mathbb{R}^{N\times D}$ , and temporal embedding? $\mathbf{E}_{P}\in\mathbb{R}^{N\times D}$ ?of historical? $P$ ?timesteps. Then, they three are combined to? $\mathbf{E}_{F}\in\mathbb{R}^{N\times3D}$ . Freeze first? $F$ ?layers and preserve last? $U$ ?layers in PFA LLM and get output? $\mathbf{H}^{L}\in\mathbb{R}^{N\times3D}$ . Lastly, regresion convolution convert it to? $\widehat{\mathbf{Y}}_{S}\in\mathbb{R}^{S\times N\times C}$ .

2.5.2.?Spatial-Temporal Embedding and Fusion

? ? ? ? ①They get?tokens by?pointwise convolution:

$\mathbf{E}_{P}=PConv(\mathbf{X}_{P};\theta_{p})$

? ? ? ? ②Applying linear layer to encode input? $\mathbf{X}_P\in\mathbb{R}^{P\times N\times C}$ ?to day? $\mathbf{X}_{day}\in\mathbb{R}^{N\times T_{d}}$ ?and week? $\mathbf{X}_{week}\in\mathbb{R}^{N\times T_{w}}$ :

$E_T^d = W_{day}(X_{day}), \\ E_T^w = W_{week}(X_{week}), \\ E_T = E_T^d + E_T^w.$

where? $\mathbf{W}_{day}\in\mathbb{R}^{T_{d}\times D}$ ?and? $\mathbf{W}_{week}\in\mathbb{R}^{T_{w}\times D}$ ?are learnable parameter and the output is? $\mathbf{E}_{T}\in\mathbb{R}^{N\times D}$

? ? ? ? ③They extract spatial correlations by:

$\mathbf{E}_S=\sigma(\mathbf{W}_s\cdot\mathbf{X}_\mathbf{P}+\mathbf{b}_s)$

? ? ? ? ④Fusion convolution:

$\mathbf{H}_F=FConv(\mathbf{E}_P||\mathbf{E}_S||\mathbf{E}_T;\theta_f)$

where? $\mathbf{H}_{F}\in\mathbb{R}^{N\times3D}$

2.5.3.?Partially Frozen Attention (PFA) LLM

? ? ? ? ①They freeze the first? $F$ ?layers (including multihead attention and feed-forward layers) which contains important information:

$\mathbf{\bar{H}}^{i}=MHA\left(LN\left(\mathbf{H}^{i}\right)\right)+\mathbf{H}^{i},\\\mathbf{H}^{i+1}=FFN\left(LN\left(\mathbf{\bar{H}}^{i}\right)\right)+\mathbf{\bar{H}}^{i},$

where? $i \in \left \{ 1,F-1 \right \}$ ,? $\mathbf{H}^{1}=[\mathbf{H}_{F}+\mathbf{P}\mathbf{E}]$ ,? $\mathrm{PE}$ ?denotes?learnable positional encoding,? $\mathbf{\bar{H}}^{i}$ ?represents the intermediate representation of the $i$ -th layer after applying the frozen multi-head attention (MHA) and the first unfrozen layer normalization (LN),? $\mathbf{H}^{i}$ ?symbolizes the final representation after applying the unfrozen LN and frozen feed-forward network (FFN), and:

$LN \left( \mathbf { H } ^ { i } \right) = \gamma \odot \frac { \mathbf { H } ^ { i } - \mu } { \sigma } + \beta ,\\ MHA ( \tilde { \mathbf { H } } ^ { i } ) = \mathbf { W } ^ { O } ( \mathrm { h e a d } _ { 1 } ^ { i } \| \cdots \| \mathrm { h e a d } _ { h } ^ { i } ) ,\\ \mathrm { h e a d } _ { k } ^ { i } = A t t e n t i o n ( \mathbf { W } _ { q } ^ { k } \tilde { \mathbf { H } } ^ { i } , \mathbf { W } _ { k } ^ { k } \tilde { \mathbf { H } } ^ { i } , \mathbf { W } _ { v } ^ { k } \tilde { \mathbf { H } } ^ { i } ) ,\\ A t t e n t i o n ( \tilde { \mathbf { H } } ^ { i } ) = \operatorname { s o f t m a x } \left( \frac { \tilde { \mathbf { H } } ^ { i } \tilde { \mathbf { H } } ^ { i T } } { \sqrt { d _ { k } } } \right) \tilde { \mathbf { H } } ^ { i } ,\\ F F N ( \tilde { \mathbf { H } } ^ { i } ) = \max \left( 0 , \mathbf { W } _ { 1 } \tilde { \mathbf { H } } ^ { i + 1 } + \mathbf { b } _ { 1 } \right) \mathbf { W } _ { 2 } + \mathbf { b } _ { 2 } ,\\$

? ? ? ? ②Unfreezing the last? $U$ ?layers:

$\mathbf{\bar{H}^{F+U-1}}=MHA\left(LN\left(\mathbf{H^{F+U-1}}\right)\right)+\mathbf{H^{F+U-1}},\\\mathbf{H^{F+U}}=FFN\left(LN\left(\mathbf{\bar{H}^{F+U-1}}\right)\right)+\mathbf{\bar{H}^{F+U-1}},$

? ? ? ? ③The final regresion convolution (RConv):

$\hat{\mathbf{Y}}_{S}=RCon\nu(\mathbf{H}^{F+U};\theta_{r})$

? ? ? ? ④Loss function:

$\mathcal{L}=\left\|\widehat{\mathbf{Y}}_{S}-\mathbf{Y}_{S}\right\|+\lambda\cdot L\mathrm{reg}$

where? $\mathbf{Y}_{S}$ ?is ground truth

? ? ? ? ⑤Algorithm:

2.6. Experiments

2.6.1. Datasdets

? ? ? ? ①Statistics of datasets:

? ? ? ? ②NYCTaxi: includes 266?virtual stations and?4,368 timesteps (each timestep is half-hour)

? ? ? ? ③CHBike: includes 250 sites and?4,368 timesteps (30 mins as well)

2.6.2. Baselines

? ? ? ? ①GNN based baselines: DCRNN, STGCN, GWN, AGCRN, STGNCDE, DGCRN

? ? ? ? ②Attention based model: ASTGCN, GMAN, ASTGNN

? ? ? ? ③LLMs: OFA, GATGPT, GCNGPT, LLAMA2

2.6.3. Implementations

? ? ? ? ①Data split: 6:2:2

? ? ? ? ②Historical and future timesteps:? $P=12,S=12$

? ? ? ? ③ $T_w=7,T_d=48$

? ? ? ? ④Learning rate: 0.001 and Ranger21 optimizer for LLM and 0.001 and Adam for GCN and attention based

? ? ? ? ⑤LLM:?GPT2 and LLAMA2 7B

? ? ? ? ⑥Layer: 6 for?GPT2 and 8 for LLAMA2

? ? ? ? ⑦Epoch: 100

? ? ? ? ⑧Batch size: 64

2.6.4. Evaluation Metrics

? ? ? ? ①Metrics:?Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Root Mean Squared Error (RMSE), and Weighted Absolute Percentage Error (WAPE)

2.6.5. Main Results

? ? ? ? ①Performance table:

2.6.6.?Performance of ST-LLM and Ablation Studies

? ? ? ? ①Module ablation:

? ? ? ? ②Frozen ablation:

2.6.7.?Parameter Analysis

? ? ? ? ①Hyperparameter? $U$ ?ablation:

2.6.8.?Inference Time Analysis

? ? ? ? ①Inference time table:

2.6.9. Few-Shot Prediction

? ? ? ? ①10% samples few-shot learning:

2.6.10.?Zero-Shot Prediction

? ? ? ? ①Performance:

2.7. Conclusion

? ? ? ? ~

3. Reference

@inproceedings{liu2024spatial,
? title={Spatial-Temporal Large Language Model for Traffic Prediction},
? author={Liu, Chenxi and Yang, Sun and Xu, Qianxiong and Li, Zhishuai and Long, Cheng and Li, Ziyue and Zhao, Rui},
? booktitle={MDM},
? year={2024}
}