biopython中文指南
When you hear the word Biopython what is the first thing that came to your mind? A python library to handle biological data…? You are correct! Biopython provides a set of tools to perform bioinformatics computations on biological data such as DNA data and protein data. I have been using Biopython ever since I started studying bioinformatics and it has never let me down with its functions. It is an amazing library which provides a wide range of functions from reading large files with biological data to aligning sequences. In this article, I will introduce you to some basic functions of Biopython which can make implementations much easier with just a single call.
當您聽到Biopython一詞時,您想到的第一件事是什么? 一個處理生物學數據的python庫...? 你是對的! Biopython提供了一套工具,可對DNA數據和蛋白質數據等生物學數據進行生物信息學計算。 自從我開始研究生物信息學以來,我就一直在使用Biopython,但是它從來沒有讓我失望過它的功能。 它是一個了不起的庫,它提供了廣泛的功能,從讀取帶有生物學數據的大文件到比對序列。 在本文中,我將向您介紹Biopython的一些基本功能,這些功能只需一次調用就可以使實現更加容易。
入門 (Getting started)
The latest version available when I’m writing this article is biopython-1.77 released in May 2020.
在我撰寫本文時,可用的最新版本是2020年5月發布的biopython-1.77 。
You can install Biopython using pip
您可以使用pip安裝Biopython
pip install biopython
or using conda.
或使用conda 。
conda install -c conda-forge biopython
You can test whether Biopython is properly installed by executing the following line in the python interpreter.
您可以通過在python解釋器中執行以下行來測試Biopython是否已正確安裝。
import Bio
If you get an error such as ImportError: No module named Bio
then you haven’t installed Biopython properly in your working environment. If no error messages appear, we are good to go.
如果您收到諸如ImportError: No module named Bio
類的錯誤,則說明您的工作環境中沒有正確安裝Biopython。 如果沒有錯誤消息出現,我們很好。
In this article, I will be walking you through some examples where Seq
, SeqRecord
and SeqIO
come in handy. We will go through the functions that perform the following tasks.
在本文中,我將向您介紹一些示例,其中Seq
, SeqRecord
和SeqIO
會派上用場。 我們將介紹執行以下任務的功能。
- Creating a sequence 創建一個序列
- Get the reverse complement of a sequence 獲取序列的反補
- Count the number of occurrences of a nucleotide 計算核苷酸的出現次數
- Find the starting index of a subsequence 查找子序列的起始索引
- Reading a sequence file 讀取序列文件
- Writing sequences to a file 將序列寫入文件
- Convert a FASTQ file to FASTA file 將FASTQ文件轉換為FASTA文件
- Separate sequences by ids from a list of ids 按ID從ID列表中分離序列
1.創建一個序列 (1. Creating a sequence)
To create your own sequence, you can use the Biopython Seq
object. Here is an example.
要創建自己的序列,可以使用Biopython Seq
對象。 這是一個例子。
>>> from Bio.Seq import Seq
>>> my_sequence = Seq("ATGACGTTGCATG")
>>> print("The sequence is", my_sequence)
The sequence is ATGACGTTGCATG
>>> print("The length of the sequence is", len(my_sequence))
The length of the sequence is 13
2.獲得序列的反補 (2. Get the reverse complement of a sequence)
You can easily get the reverse complement of a sequence using a single function call reverse_complement()
.
您可以使用單個函數reverse_complement()
輕松獲得序列的反向補碼。
>>>
The reverse complement if the sequence is CATGCAACGTCAT
3.計算核苷酸的出現次數 (3. Count the number of occurrences of a nucleotide)
You can get the number of occurrence of a particular nucleotide using the count()
function.
您可以使用count()
函數獲得特定核苷酸的出現count()
。
>>> print("The number of As in the sequence", my_sequence.count("A"))
The number of As in the sequence 3
4.查找子序列的起始索引 (4. Find the starting index of a subsequence)
You can find the starting index of a subsequence using the find()
function.
您可以使用find()
函數find()
序列的起始索引。
>>> print("Found TTG in the sequence at index", my_sequence.find("TTG"))
Found TTG in the sequence at index 6
5.讀取序列文件 (5. Reading a sequence file)
Biopython’s SeqIO
(Sequence Input/Output) interface can be used to read sequence files. The parse()
function takes a file (with a file handle and format) and returns a SeqRecord
iterator. Following is an example of how to read a FASTA file.
Biopython的SeqIO
(序列輸入/輸出)接口可用于讀取序列文件。 parse()
函數獲取一個文件(具有文件句柄和格式),并返回一個SeqRecord
迭代器。 以下是如何讀取FASTA文件的示例。
from Bio import SeqIOfor record in SeqIO.parse("example.fasta", "fasta"):
print(record.id)
record.id
will return the identifier of the sequence. record.seq
will return the sequence itself. record.description
will return the sequence description.
record.id
將返回序列的標識符。 record.seq
將返回序列本身。 record.description
將返回序列描述。
6.將序列寫入文件 (6. Writing sequences to a file)
Biopython’s SeqIO
(Sequence Input/Output) interface can be used to write sequences to files. Following is an example where a list of sequences are written to a FASTA file.
Biopython的SeqIO
(序列輸入/輸出)接口可用于將序列寫入文件。 以下是將序列列表寫入FASTA文件的示例。
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_dnasequences = ["AAACGTGG", "TGAACCG", "GGTGCA", "CCAATGCG"]records = (SeqRecord(Seq(seq, generic_dna), str(index)) for index,seq in enumerate(sequences))with open("example.fasta", "w") as output_handle:
SeqIO.write(
This code will result in a FASTA file with sequence ids starting from 0. If you want to give a custom id and a description you can create the records as follows.
此代碼將生成一個FASTA文件,其序列ID從0開始。如果要提供自定義ID和說明,可以按以下方式創建記錄。
sequences = ["AAACGTGG", "TGAACCG", "GGTGCA", "CCAATGCG"]
new_sequences = []i=1for
record = SeqRecord(
new_sequences.append(record)with open("example.fasta", "w") as output_handle:
SeqIO.write(
The SeqIO.write()
function will return the number of sequences written.
SeqIO.write()
函數將返回寫入的序列數。
7.將FASTQ文件轉換為FASTA文件 (7. Convert a FASTQ file to FASTA file)
We need to convert DNA data file formats in certain applications. For example, we can do file format conversions from FASTQ to FASTA as follows.
我們需要在某些應用程序中轉換DNA數據文件格式。 例如,我們可以按照以下步驟進行從FASTQ到FASTA的文件格式轉換。
from Bio import SeqIOwith open("path/to/fastq/file.fastq", "r") as input_handle, open("path/to/fasta/file.fasta", "w") as output_handle: sequences = SeqIO.parse(input_handle, "fastq")
count = SeqIO.write(sequences, output_handle, "fasta") print("Converted %i records" % count)
If you want to convert a GenBank file to FASTA format,
如果要將GenBank文件轉換為FASTA格式,
from Bio import SeqIO
with open("
sequences = SeqIO.parse(input_handle, "genbank")
count = SeqIO.write(sequences, output_handle, "fasta")
print("Converted %i records" % count)
8.將ID序列與ID列表分開 (8. Separate sequences by ids from a list of ids)
Assume that you have a list of sequence identifiers in a file named list.lst
where you want to separate the corresponding sequences from a FASTA file. You can run the following and write those sequences to a file.
假設您有一個名為list.lst
的文件中的序列標識符列表,您想在其中將相應的序列與FASTA文件分開。 您可以運行以下命令,并將這些序列寫入文件。
from Bio import SeqIOids = set(x[:-1] for x in open(path+"list.lst"))with open(path+'list.fq', mode='a') as my_output:
for seq in SeqIO.parse(path+"list_sequences.fq", "fastq"):
if seq.id in ids:
my_output.write(seq.format("fastq"))
最后的想法 (Final Thoughts)
Hope you got an idea of how to use Seq
, SeqRecord
and SeqIO
Biopython functions and will be useful for your research work.
希望您對如何使用Seq
, SeqRecord
和SeqIO
Biopython函數有所了解,并且對您的研究工作很有用。
Thank you for reading. I would love to hear your thoughts. Stay tuned for the next part of this article with more usages and Biopython functions.
感謝您的閱讀。 我很想聽聽您的想法。 請繼續關注本文的下一部分,了解更多用法和Biopython函數。
Cheers, and stay safe!
干杯,保持安全!
翻譯自: https://medium.com/computational-biology/newbies-guide-to-biopython-part-1-9ec82c3dfe8f
biopython中文指南
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/387964.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/387964.shtml 英文地址,請注明出處:http://en.pswp.cn/news/387964.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!