模擬一個簡單計算器
Read simulators are widely being used within the research community to create synthetic and mock datasets for analysis. In this article, I will introduce some recently proposed, commonly used read simulators.
閱讀模擬器在研究社區中被廣泛使用,以創建用于分析的綜合和模擬數據集。 在本文中,我將介紹一些最近提出的,常用的讀取模擬器。
DNA測序和讀取 (DNA Sequencing and Reads)
If you have come across my previous article on DNA Sequence Data Analysis, you may have read about DNA sequencing. Sequencing is the process that determines the precise order of nucleotides of a given DNA molecule. We can determine the order of the four bases adenine, guanine, cytosine and thymine, in a strand of DNA. DNA sequencing is used to determine the sequence of individual genes, full chromosomes or entire genomes of an organism.
如果您看過我以前有關DNA序列數據分析的文章,那么您可能已經閱讀了有關DNA測序的信息。 測序是確定給定DNA分子核苷酸精確順序的過程。 我們可以確定四個堿基的腺嘌呤 , 鳥嘌呤 , 胞嘧啶和胸腺嘧啶的順序, 在DNA鏈中。 DNA測序用于確定生物的單個基因,完整染色體或完整基因組的序列。
Special machines known as sequencing machines are used to extract short random DNA sequences from a particular genome we wish to determine (target genome). Current DNA sequencing technologies cannot read one whole genome at once. It reads small pieces of between 100 and 30,000 bases, depending on the technology used. These short pieces are called reads.
使用稱為測序機的特殊機器從我們希望確定的特定基因組( 目標基因組 )中提取隨機的短DNA序列。 當前的DNA測序技術無法一次讀取一個完整的基因組。 根據所使用的技術,它可以讀取100到30,000個堿基之間的小片段。 這些短片段稱為讀取 。
讀模擬器 (Read Simulators)
Sequencing machines may not be available as we wish and we may not be able to get hold of real-world samples to sequence. This is where read simulators come in handy for research purposes. Read simulators can mimic sequencing machines to simulate reads. They have pre-defined statistical models to mimic the error rates relevant to the particular sequencing machines. Furthermore, we can provide our own error models as well (different rates of insertions, deletions and substitutions).
測序機器可能無法如我們所愿,并且我們可能無法掌握現實世界中的樣品進行測序。 在這里,閱讀模擬器可用于研究目的。 閱讀模擬器可以模仿測序儀來模擬閱讀。 他們具有預定義的統計模型,可以模擬與特定測序儀相關的錯誤率。 此外,我們還可以提供自己的錯誤模型(插入,刪除和替換的比率不同)。
估計測序覆蓋率 (Estimating sequencing coverage)
Sequencing coverage is defined as the average number of reads that covers each base of the reference genome. Estimating the sequencing coverage is very important when you are simulating datasets. The coverage equation is defined as follows.
測序覆蓋率定義為覆蓋參考基因組每個堿基的平均讀取數。 模擬數據集時, 估計測序覆蓋率非常重要。 覆蓋方程定義如下。
C = LN / G
C = LN / G
- C is the sequencing coverage C是測序覆蓋率
- G is the length of the genome G是基因組的長度
- L is the read length L是讀取長度
- N is the number of reads N是讀取次數
For example, if you have a genome of length 5Mbp and you simulate 1,000,000 HiSeq 2000 reads (read length is 100bp), then we will get a sequencing coverage of 20x
as follows.
例如,如果您的基因組長度為5Mbp,并且模擬了1,000,000個HiSeq 2000讀取(讀取長度為100bp),那么我們將獲得如下20x
的測序覆蓋率。
C = LN / G = 100 * 1,000,000 / 5,000,000 = 20x
Here, at least each position of the reference genome is covered by 20 reads.
在這里,參考基因組的至少每個位置被20個讀數覆蓋。
估計豐度 (Estimating Abundance)
The abundance of a species in a dataset is considered as the fraction of reads that belong to that species. For example, if there is a dataset with 10,000,000 reads and 1,000,000 of them belong to E. coli, then the abundance of E. coli will be 0.1.
數據集中物種的豐富度被視為屬于該物種的讀段的分數。 例如,如果存在具有10,000,000的數據集的讀取和它們的百萬屬于大腸桿菌 ,然后大腸桿菌的豐度為0.1。
Note that coverage and abundance are not the same.
請注意,覆蓋范圍和豐度不同。
短讀模擬器 (Short Read Simulators)
With the popularity of next-generation sequencing (NGS) technologies, many NGS read simulators have been developed. Currently, many of the popular short read simulators are designed to simulate reads mimicking many Illumina, 454 and SOLiD platforms. Listed below are some popular short read simulators. Links to their publications are provided as well.
隨著下一代測序(NGS)技術的普及,已經開發了許多NGS讀取模擬器。 當前,許多流行的短讀模擬器被設計為模擬模仿許多Illumina,454和SOLiD平臺的讀。 下面列出了一些流行的簡短閱讀模擬器。 還提供了指向其出版物的鏈接。
MetaSim
MetaSim
wgsim
wgs??im
SimNGS
SimNGS
ArtificialFastqGenerator
人工快速生成器
InSilicoSeq
InSilicoSeq
長讀模擬器 (Long Read Simulators)
With the advancements in sequencing technologies, scientists have shown an increasing interest in using third-generation sequencing (TGS) technologies. Currently, many of the popular long read simulators are designed to simulate reads mimicking the two main TGS technologies; (1) Pacific Biosciences (PacBio) and (2) Oxford Nanopore (ONT). Listed below are some of the popular and recently introduced PacBio and ONT simulators. Links to their publications are provided as well.
隨著測序技術的進步,科學家對使用第三代測序(TGS)技術的興趣日益濃厚。 當前,許多流行的長讀模擬器被設計為模擬模仿兩種主要TGS技術的讀操作。 (1) 太平洋生物科學(PacBio)和(2) 牛津納米Kong(ONT) 。 下面列出的是一些最近流行的PacBio和ONT模擬器。 還提供了指向其出版物的鏈接。
PacBio模擬器 (PacBio Simulators)
PBSIM
PBSIM
LongISLND
長ISLND
SimLoRD
模擬神
NPBSS
全國公共廣播電臺
PaSS
通過
ONT模擬器 (ONT Simulators)
NanoSim
納米模擬
Nanopore SimulatION
納米Kong模擬
DeepSimulator
深度模擬器
DeepSimulator1.5
DeepSimulator1.5
InSilicoSeq (InSilicoSeq)
I have been using InSilicoSeq in my work a lot and I find it very intuitive and easy to use. I will walk you through some sample commands to simulate reads. You can easily install InSilicoSeq using conda
or pip
.
我在工作中經常使用InSilicoSeq ,發現它非常直觀且易于使用。 我將引導您完成一些示例命令以模擬讀取。 您可以使用輕松安裝InSilicoSeq conda
或pip
。
conda install -c bioconda insilicoseq
OR
pip install InSilicoSeq
Simulate reads by providing the number of reads
通過提供讀取次數來模擬讀取
Assume that you have a single reference genome and you want to simulate 1 million Illumina MiSeq reads. Given below is a sample command you can run using InSilicoSeq.
假設您有一個參考基因組,并且您想模擬一百萬個Illumina MiSeq讀數。 下面給出的是可以使用InSilicoSeq運行的示例命令。
iss generate --model miseq --genomes ref.fasta --n_reads 1M --cpus 8 --output reads
Simulate reads by providing the coverage
通過提供覆蓋范圍來模擬閱讀
Assume that you have two reference genome files ref1.fasta
and ref2.fasta
. You want to simulate 30x
coverage from ref1
and 10x
coverage from ref2
. You will need to create a tab-separated file named coverages.tsv
and add the coverage details as follows.
假設您有兩個參考基因組文件ref1.fasta
和ref2.fasta
。 您要模擬ref1
30x
覆蓋率和ref2
10x
覆蓋率。 您將需要創建一個以制表符分隔的文件,名為coverages.tsv
,并按如下所示添加coverage的詳細信息。
red1_id 30
ref2_id 10
ref1_id
and ref2_id
refer to the identifiers of the filesref1.fasta
and ref2.fasta
. If you download the reference genomes from NCBI, the identifies will consist of letters and numbers and for example, may look something like thisNC_007712.1
or CP001844.2
. These identifiers are NCBI accession numbers provided for each reference genome.
ref1_id
和ref2_id
引用文件ref1.fasta
和ref2.fasta
。 如果從NCBI下載參考基因組,則標識將由字母和數字組成,例如,看起來可能類似于NC_007712.1
或CP001844.2
。 這些標識符是為每個參考基因組提供的NCBI登錄號。
Now you can simulate the reads using the following command.
現在,您可以使用以下命令模擬讀取。
iss generate --model miseq --genomes ref1.fasta ref2.fasta --coverage coverages.tsv --cpus 8 --output reads
Simulate reads by providing the abundance
通過提供豐富的內容來模擬閱讀
Assume that you have two reference genome files ref1.fasta
and ref2.fasta
. You want to simulate 0.4
abundance from ref1
and 0.6
abundance from ref2
. Note that the sum of all the abundance values should be 1.0
. Similar to coverage, you will need to create a tab-separated file named abundance.tsv
and add the abundance details as follows.
假設您有兩個參考基因組文件ref1.fasta
和ref2.fasta
。 您要模擬ref1
0.4
豐度和ref2
0.6
豐度。 注意所有豐度值的總和應為1.0
。 與覆蓋范圍類似,您將需要創建一個制表符分隔的文件abundance.tsv
,并按如下所示添加豐度詳細信息。
red1_id 0.4
ref2_id 0.6
Now you can simulate the reads using the following command.
現在,您可以使用以下命令模擬讀取。
iss generate --model miseq --genomes ref1.fasta ref2.fasta --abundance abundance.txt --cpus 8 --output reads
You can read more details from the InSilicoSeq documentation.
您可以從InSilicoSeq文檔中詳細信息。
PBSIM (PBSIM)
PBSIM is a PacBio reads simulator which provides both sampling-based and model-based simulations. I will walk you through some sample commands to simulate reads using PBSIM.
PBSIM是PacBio讀取模擬器,它提供基于采樣和基于模型的模擬。 我將引導您完成一些示例命令,以使用PBSIM模擬讀取。
基于模型的仿真 (Model-based simulation)
For model-based simulation, you can run the following command.
對于基于模型的仿真,可以運行以下命令。
pbsim --data-type CLR --depth 100 --length-min 10000 --length-max 20000 --prefix test --model_qc data/model_qc_clr ref.fasta
The model can be found in the PBSIM folder PBSIM-PacBio-Simulator/data/model_qc_clr
. The data type CLR refers to Continuous Long Read which simulates long and high error rates. The other data type CCS refers to Circular consensus Read which simulates short and low error rates.
該模型可以在PBSIM文件夾PBSIM-PacBio-Simulator/data/model_qc_clr
。 數據類型CLR是指連續長讀取 ,它模擬長錯誤率和高錯誤率。 另一種數據類型CCS指的是“ 循環共識讀取” ,它可以模擬短錯誤率和低錯誤率。
基于采樣的模擬 (Sampling-based simulation)
For sampling-based simulation, you can run the following command.
對于基于采樣的模擬,可以運行以下命令。
pbsim --data-type CLR --depth 100 --sample-fastq sample/sample.fastq sample/sample.fasta
The sample FASTQ file can be found in the PBSIM folder PBSIM-PacBio-Simulator/sample/sample.fastq
. You can use your own FASTQ file as well.
樣本FASTQ文件可在PBSIM文件夾PBSIM-PacBio-Simulator/sample/sample.fastq
。 您也可以使用自己的FASTQ文件。
You can read more details from the PBSIM documentation.
您可以從PBSIM文檔中詳細信息。
模擬神 (SimLoRD)
SimLoRD is a TGS read simulator based on the Pacific Biosciences SMRT error model. I have frequently used SimLoRD to simulate PacBio datasets for my work. I will walk you through some sample commands to simulate reads using SimLoRD.
SimLoRD是基于Pacific Biosciences SMRT錯誤模型的TGS讀取模擬器。 我經常使用SimLoRD為我的工作模擬PacBio數據集。 我將引導您完成一些示例命令,以使用SimLoRD模擬讀取。
通過提供讀取次數來模擬定長讀取 (Simulate fixed-length reads by providing the number of reads)
Assume that you have a reference genome and you want to simulate fixed-length reads with 60x
coverage. Given below is a sample command you can run using SimLoRD.
假設您有一個參考基因組,并且想要模擬覆蓋率是60x
固定長度讀取。 下面給出的是可以使用SimLoRD運行的示例命令。
simlord --read-reference ref.fasta --coverage 60 --fixed-readlength 5000 output_prefix
通過提供覆蓋范圍來模擬定長讀取 (Simulate fixed-length reads by providing the coverage)
Assume that you have a reference genome and you want to simulate 2000 fixed-length reads. Given below is a sample command you can run using SimLoRD.
假設您有一個參考基因組,并且想要模擬2000個固定長度的讀取。 下面給出的是可以使用SimLoRD運行的示例命令。
simlord --read-reference ref.fasta --num-reads 2000 --fixed-readlength 5000 output_prefix
You can also set a minimum length for the reads using the --min-readlength
parameter during the simulation. You can read more from the SimLoRD documentation.
您還可以在仿真過程中使用--min-readlength
參數設置讀取的最小長度。 您可以從SimLoRD文檔中了解更多信息。
最后的想法 (Final Thoughts)
Read simulators have given us the opportunity to simulate reads ranging from zero errors to very high error rates. Also, they have allowed us to create synthetic and mock datasets mimicking different sequencing machines and different species compositions.
讀取模擬器使我們有機會模擬從零錯誤到很高錯誤率的讀取。 此外,它們還使我們能夠創建模仿不同測序儀和不同物種組成的合成和模擬數據集。
Hope you found this article useful and informative as a starting point towards using read simulators. Feel free to use these tools for your projects and research work as they are freely available.
希望您發現本文對使用閱讀模擬器有幫助,并為您提供了有益的信息。 您可以免費使用這些工具來進行項目和研究工作。
Cheers, and stay safe!
干杯,保持安全!
翻譯自: https://medium.com/computational-biology/a-simple-introduction-to-read-simulators-bbeff4f0c0c6
模擬一個簡單計算器
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/388456.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/388456.shtml 英文地址,請注明出處:http://en.pswp.cn/news/388456.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!