01 簡介
NextPolish 是一個用于修正由低準確度長讀段(如 ONT 或 CLR)組裝出來的基因組序列中堿基錯誤(SNV/Indel)的工具。它支持:
-
僅使用短讀段
-
僅使用長讀段
-
同時使用短讀段與長讀段
NextPolish 包含兩個核心模塊,采用逐步(stepwise)策略對參考基因組中的錯誤堿基進行修正。
如果你要對原始第三代測序(TGS)長讀段(錯誤率約為 10–15%)進行糾錯或組裝,請使用 NextDenovo。
02 安裝
1. 下載
使用以下命令:
wget https://github.com/Nextomics/NextPolish/releases/latest/download/NextPolish.tgz
?注意:如果你遇到諸如
GLIBC_2.14 not found
或liblzma.so.0: cannot open shared object file
的錯誤,請嘗試使用該頁面提供的兼容版本。
2. 依賴項
-
Python 2 或 3
-
paralleltask
安裝:
pip install paralleltask
3. 編譯安裝
tar -vxzf NextPolish.tgz cd NextPolish make
4. 卸載
cd NextPolish make clean
5. 測試是否安裝成功
nextPolish test_data/run.cfg
03 使用
1:準備短讀段文件列表(SGS FOFN)
ls reads1_R1.fq reads1_R2.fq reads2_R1.fq reads2_R2.fq > sgs.fofn
2:創建配置文件 run.cfg
genome=input.genome.fa echo -e "task = best\ngenome = $genome\nsgs_fofn = sgs.fofn" > run.cfg
3:運行程序
nextPolish run.cfg
輸出結果
-
打磨后的序列:
/path_to_work_directory/genome.nextpolish.fasta
-
統計信息文件:
/path_to_work_directory/genome.nextpolish.fasta.stat
提示
你也可以用自己的比對流程(如 BWA + samtools),僅調用 NextPolish 的打磨模塊,這在本地系統上通常會更快,但打磨后的基因組質量一致。
重要說明
建議先使用長讀段(如 HiFi 或 ONT)對原始基因組進行打磨,再用短讀段進一步優化,避免短讀段在高錯誤率區域中錯配。尤其是基于無共識算法(如 miniasm)生成的初步組裝,更容易出現錯配。
04 案例
4.1 使用短讀進行基因組精修
準備 sgs_fofn 文件
ls reads1_R1.fq reads1_R2.fq reads2_R1.fq reads2_R2.fq > sgs.fofn
創建 run.cfg
配置文件
[General] job_type = local job_prefix = nextPolish task = best rewrite = yes rerun = 3 parallel_jobs = 6 multithread_jobs = 5 genome = ./raw.genome.fasta # 基因組文件 genome_size = auto workdir = ./01_rundir polish_options = -p {multithread_jobs} [sgs_option] sgs_fofn = ./sgs.fofn sgs_options = -max_depth 100 -bwa
運行
nextPolish run.cfg
精修完成后輸出
-
序列:
/path_to_work_directory/genome.nextpolish.fasta
-
統計信息:
/path_to_work_directory/genome.nextpolish.fasta.stat
提示
用戶自定義的比對流程在本地運行時比默認流程更快,精修后基因組的準確性與默認流程一致。
自定義比對流程示例(短讀)
round=2 threads=20 read1=reads_R1.fastq.gz read2=reads_R2.fastq.gz input=input.genome.fa for ((i=1; i<=${round};i++)); do bwa index ${input}; bwa mem -t ${threads} ${input} ${read1} ${read2} | \ samtools view --threads 3 -F 0x4 -b - | \ samtools fixmate -m --threads 3 - - | \ samtools sort -m 2g --threads 5 - | \ samtools markdup --threads 5 -r - sgs.sort.bam samtools index -@ ${threads} sgs.sort.bam samtools faidx ${input} python NextPolish/lib/nextpolish1.py -g ${input} -t $i -p ${threads} -s sgs.sort.bam > genome.polishtemp.fa input=genome.polishtemp.fa done # 最終精修后的基因組:genome.nextpolish.fa
4.2 使用長讀進行基因組精修
準備 lgs_fofn 文件
ls reads1.fq reads2.fa.gz > lgs.fofn
創建 run.cfg
配置文件
[General] job_type = local job_prefix = nextPolish task = best rewrite = yes rerun = 3 parallel_jobs = 6 multithread_jobs = 5 genome = ./raw.genome.fasta genome_size = auto workdir = ./01_rundir polish_options = -p {multithread_jobs} [lgs_option] lgs_fofn = ./lgs.fofn lgs_options = -min_read_len 1k -max_depth 100 lgs_minimap2_options = -x map-ont
運行
nextPolish run.cfg
精修完成后輸出
-
序列:
/path_to_work_directory/genome.nextpolish.fasta
-
統計信息:
/path_to_work_directory/genome.nextpolish.fasta.stat
自定義比對流程示例(長讀)
round=2 threads=20 read=read.fasta.gz read_type=ont # 可為 clr(PacBio)、hifi(高準確性PacBio)、ont(納米孔) declare -A mapping_option=(["clr"]="map-pb" ["hifi"]="asm20" ["ont"]="map-ont") input=input.genome.fa for ((i=1; i<=${round};i++)); do minimap2 -ax ${mapping_option[$read_type]} -t ${threads} ${input} ${read} | \ samtools sort - -m 2g --threads ${threads} -o lgs.sort.bam samtools index lgs.sort.bam ls `pwd`/lgs.sort.bam > lgs.sort.bam.fofn python NextPolish/lib/nextpolish2.py -g ${input} -l lgs.sort.bam.fofn -r ${read_type} -p ${threads} -sp -o genome.nextpolish.fa if ((i!=${round})); then mv genome.nextpolish.fa genome.nextpolishtmp.fa input=genome.nextpolishtmp.fa fi done # 最終精修后的基因組:genome.nextpolish.fa
4.3 使用短讀和長讀聯合精修
-
準備
sgs.fofn
和lgs.fofn
-
創建
run.cfg
文件如下:
[General] job_type = local job_prefix = nextPolish task = best rewrite = yes rerun = 3 parallel_jobs = 6 multithread_jobs = 5 genome = ./raw.genome.fasta genome_size = auto workdir = ./01_rundir polish_options = -p {multithread_jobs} [sgs_option] sgs_fofn = ./sgs.fofn sgs_options = -max_depth 100 -bwa [lgs_option] lgs_fofn = ./lgs.fofn lgs_options = -min_read_len 1k -max_depth 100 lgs_minimap2_options = -x map-ont
4.4 使用短讀和 hifi 長讀聯合精修
-
準備
sgs.fofn
和hifi.fofn
-
創建
run.cfg
文件如下:
[General] job_type = local job_prefix = nextPolish task = best rewrite = yes rerun = 3 parallel_jobs = 6 multithread_jobs = 5 genome = ./raw.genome.fasta genome_size = auto workdir = ./01_rundir polish_options = -p {multithread_jobs} [sgs_option] sgs_fofn = ./sgs.fofn sgs_options = -max_depth 100 -bwa [hifi_option] hifi_fofn = ./hifi.fofn hifi_options = -min_read_len 1k -max_depth 100 hifi_minimap2_options = -x map-pb