怎么用Python从灰葡萄孢菌的基因组FASTA和GFF3文件里自动抽取出所有基因序列，并保存成标准FASTA格式？

以下是实现从 `Botrytis_cinerea_genome.fa` 文件提取基因序列（基因位置从 `Botrytis_cinerea.gff3` 查找）并以 fasta 格式输出的代码，同时会逐行解读代码及其原因： ```python # 读取基因组文件 def read_genome(file_path): # 创建一个空字符串，用来存储基因组序列 genome = "" # 以只读模式打开基因组文件 with open(file_path, 'r') as file: # 逐行读取文件内容 for line in file: # 如果这一行不是以 '>' 开头（因为 '>' 开头的行是 fasta 文件的标题行，不是序列内容） if not line.startswith('>'): # 把这一行的内容去掉首尾的空白字符后添加到 genome 字符串里 genome += line.strip() # 最后返回存储好的基因组序列 return genome # 读取 GFF3 文件并提取基因位置 def read_gff3(file_path): # 创建一个空列表，用来存储基因的起始和结束位置 gene_positions = [] # 以只读模式打开 GFF3 文件 with open(file_path, 'r') as file: # 逐行读取文件内容 for line in file: # 如果这一行不是注释行（注释行以 '#' 开头） if not line.startswith('#'): # 把这一行按制表符 '\t' 分割成多个字段 fields = line.strip().split('\t') # 如果分割后的字段数量至少有 9 个（GFF3 文件的有效行一般有 9 个或更多字段） if len(fields) >= 9: # 把字段里的起始位置（第 4 个字段）转换成整数，并且减 1 是因为 Python 列表索引是从 0 开始的 start = int(fields[3]) - 1 # 把字段里的结束位置（第 5 个字段）转换成整数 end = int(fields[4]) # 把起始和结束位置作为一个元组添加到 gene_positions 列表中 gene_positions.append((start, end)) # 最后返回存储好的基因位置列表 return gene_positions # 提取基因序列并以 fasta 格式输出 def extract_genes(genome, gene_positions, output_file): # 以写入模式打开输出文件 with open(output_file, 'w') as out_file: # 遍历基因位置列表，同时获取每个位置的索引和具体的起始、结束位置 for i, (start, end) in enumerate(gene_positions): # 从基因组序列里提取出当前基因的序列 gene_sequence = genome[start:end] # 按照 fasta 格式，先写入基因的标题行，这里简单用 'gene_' 加上索引作为标题 out_file.write(f'>gene_{i + 1}\n') # 再写入提取的基因序列，并且换行 out_file.write(gene_sequence + '\n') # 主程序 # 定义基因组文件的路径 genome_file = r'C:\Users\Administrator\Desktop\Botrytis_cinerea_genome.fa' # 定义 GFF3 文件的路径 gff3_file = r'C:\Users\Administrator\Desktop\Botrytis_cinerea.gff3' # 定义输出文件的路径 output_file = r'C:\Users\Administrator\Desktop\extracted_genes.fasta' # 调用 read_genome 函数读取基因组文件，得到基因组序列 genome = read_genome(genome_file) # 调用 read_gff3 函数读取 GFF3 文件，得到基因位置列表 gene_positions = read_gff3(gff3_file) # 调用 extract_genes 函数，根据基因组序列、基因位置列表，把提取的基因序列写入输出文件 extract_genes(genome, gene_positions, output_file) ``` ### 代码解读总结这段代码主要做了三件事： 1. 读取基因组文件，把里面的序列内容提取出来存成一个字符串。 2. 读取 GFF3 文件，从里面找到基因的起始和结束位置，存成一个列表。 3. 根据基因位置从基因组序列里提取出基因序列，然后按照 fasta 格式写入到输出文件中。 ###

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

下一篇 Python怎么读取和分析CANoe生成的BLF二进制日志文件？