物种进化

基因家族扩张与收缩分析及物种进化树构建（上） 首先，选取不同物种的Protein数据集：Arabidopsis_thaliana.fa；Citrus_grandis.fa；Dimocarpus_longan.fa；Durio_zibethinus.fa；Prunus_persica.fa； Vitis_vinifera.fa；Citrus_clementina.fa；Citrus_sinensis.fa；Diospyros_oleifera.fa；Malus_domestica.fa；Oryza_sativa.fa；Pyrus_communis.fa 然后进行数据处理，去冗余，只保留最长转录本，去除可变剪切： python3 removeRedundantProteins.py -i input.fa -o output.fa removeRedundantProteins.py 将处理好的数据置于一个文件夹中“Dataset” OrthoFinder这个软件，之前有一篇文章已经介绍过了，这里就不在赘述，这个软件安装十分友好，直接conda安装即可； nohup orthofinder -f Dataset -M msa -S diamond -T iqtree -t 24 -a 24 2> orthofinder.log & orthofinder参数详情： -t 并行序列搜索线程数（默认= 16） -a 并行分析线程数（默认值= 1） -M 基因树推断方法。可选：dendroblast和msa（默认= dendroblast） -S 序列搜索程序（默认= blast）选项：blast，mmseqs,，blast_gz，diamond（推荐使用diamond，比对速度很给力） -A 多序列联配方式，需要添加参数-M msa时才有效；（默认= mafft）可选择：muscle，mafft -T 建树方法，需要添加参数-M msa时才有效，（默认 = fasttree）可选：iqtree，raxml-ng，fasttree，raxml -s <文件> 可指定特定的根物种树 -I 设定MCL的通胀参数（默认 = 1.5） -x Info用于以othoXML格式输出结果 -p <dir>将临时pickle文件写入到<dir> -l 只执行单向序列搜索 -n 名称以附加到结果目录 -h 打印帮助文本如果只需要查找直系同源基因，只需接“-f” 参数即可；此步也可建树，采用默认的建树方法fasttree，为无根树。 nohup orthofinder -f Dataset & 如果添加-M msa -T iqtree设定制定参数，可按照设定的参数使用最大似然法构建有根的物种进化树，构建的树为STAG树。 nohup orthofinder -f Dataset -M msa -S diamond -T iqtree -t 24 -a 24 2> orthofinder.log & 关于构建系统进化树，有很多种做法，常见的有利用物种全部的蛋白序列，构建STAG物种树；也有使用单拷贝直系同源基因构建的物种进化树，关于这一点，OrthoFinder查找同源基因，可以输出直系单拷贝同源基因的序列结果，后续也可使用其他构树软件及算法进行进化树构建。关于建树方法，则有距离矩阵法、最大简约法、最大似然法以及贝叶斯；当然目前主流采用的基本为最大似然法和贝叶斯，其中贝叶斯算法计算量巨大，耗时最久，其构建的树也认为最为“逼真”，但文章中使用较多的还是最大似然法，其耗时也需蛮久。 OrthoFinder输出的结果会在OrthoFinder文件夹下面的以日期命名的文件夹中，如：~/OrthoFinder/Results_May08其中，我们可以用Orthogroups.GeneCount.tsv来作为CAFE的输入文件，分析基因家族的扩张与收缩；使用SpeciesTree_rooted.txt作为推断的物种树，并使用r8s，从中提取超度量树（ultrametric tree）即时间树； python cafetutorial_prep_r8s.py -i SpeciesTree_rooted.txt -o r8s_ctl_file.txt -s 6650255 -p "Oryza_sativa,Arabidopsis_thaliana" -c "152" 参数： -i path_tree_file: path to .txt file containing tree in NEWICK format -s n_sites: number of sites in alignment that was used to infer species tree -p list_of_spp_tuples: list of tuples (each tuple being two species IDs whose mrca"s age we are constraining; e.g., [("ENSG00","ENSPTR"),("ENSFCA","ENSECA")] -c list_of_spp_cal_points: list of flats, one for each tuple in list_of_spp_tuples (e.g., [6.4,80]) -s 即用于推断物种树的比对序列碱基数目； -p 已知物种树中的一对物种； -c 已知一对物种的分化年限：可在 timetree 网站查询：为152 myaconda install cafe cafetutorial_clade_and_size_filter.py vim cafetutorial_run.sh tree即为r8s提取的超度量树； python cafetutorial_report_analysis.py -i reports/report_run.cafe -o reports/summary_run summary_run_node.txt：统计每个节点中扩张，收缩的基因家族数目； summary_run_fams.txt：具体发生变化的基因家族 python3 /home/Tools/CAFE_fig/CAFE_fig.py resultfile.cafe -pb 0.05 -pf 0.05 --dump test/ -g svg --count_all_expansions 输出svg格式的文件，可导入AI编辑美化； CAFE_fig运行报错：（module "ete3" has no attribute "TreeStyle"）报错解决： vim /home/Tools/CAFE_fig/CAFE_fig.py 程序还在运行，后续贴出结果图。 OrthoFinder timetree http://www.chenlianfu.com/?tag=genomecomparison https://www.jianshu.com/p/146093c91e2b r8s 【OrthoFinder】 Emms, D.M., Kelly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol 16, 157 (2015) ( https://doi.org/10.1186/s13059-015-0721-2 ) Emms, D.M., Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 20, 238 (2019) https://doi.org/10.1186/s13059-019-1832-y ) 【CAFE v.4.2.1】 Han, M. V., Thomas, G. W. C., Lugo-Martinez, J., and Hahn, M. W. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Molecular Biology and Evolution 30, 8 (2013) 【iqtree v. 1.6.12】 Lam-Tung Nguyen, Heiko A. Schmidt, Arndt von Haeseler, and Bui Quang Minh (2015) IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies. Mol Biol Evol, 32:268-274. https://doi.org/10.1093/molbev/msu300 【modelFinder】 Subha Kalyaanamoorthy, Bui Quang Minh, Thomas KF Wong, Arndt von Haeseler, and Lars S Jermiin (2017) ModelFinder: Fast model selection for accurate phylogenetic estimates. Nature Methods, 14:587–589. https://doi.org/10.1038/nmeth.4285 【R8s v. 1.81】 Sanderson M J. R8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics, 2003, 19(2): 301-302. 【STAG tree】 Emms D.M. & Kelly S. STAG: Species Tree Inference from All Genes (2018), bioRxiv https://doi.org/10.1101/267914 直系同源低拷贝核基因（orthologous low-copy nuclear genes, LCN）：在进化过程中，新基因通常来自事先存在的基因，新基因的功能从先前基因的功能进化而来。新基因的原材料来自基因组区域的重复，这种重复可包括一个或多个基因。作为物种形成的伴随事件而被重复，并继续保持相同功能的基因，称为直系同源基因（orthologous）。新的基因功能可由在单个物种的基因组中发生的重复引起的。在一个基因组内部的重复导致旁系同源基因（paralogous gene）。最大似然法（maximum likelihood method）：使用概率模型，寻找能够以较高概率产生观察数据的系统发生树。外群的选择：大多数的种系发生重建方法会产生无根树，但是观察树的拓扑结构无法识别树根应在哪一分支上。实际上，对于要证实哪一个分类单元的分支先于其他的分类单元，树根必须确定。在无根树中设定一个根，最简单的方法是在数据集中增加一个外群（outgroup）。外群是一种分类操作单元，且有外部信息表明外群在所有分类群之前就已分化。研究演化历史，一般选择比目标序列具有较早进化历史的序列作为外类群。 Bootstrap support: bootstrap是统计学上一种非参数统计方法，通过有放回的随机抽样，构建分类回归树。Jackknife与bootstrap类似，只是每次抽样时会去除几个样本，像小刀一样切去一部分。所谓bootstrap法就是从整个序列的碱基（氨基酸）中任意选取一半，剩下的一半序列随机补齐组成一个新的序列。这样，一个序列就可以变成许多序列，一个序列组也就可以变成许多个序列组。根据某种算法（距离矩阵法、最大简约法、最大似然法），每个多序列组都可以生成一个进化树。将生成的许多进化树进行比较，按照多数规则（majority-rule）就会得到一个最“逼真”的进化树。

猜你想看

森林公园 penicillin 证明信申请书应该怎么写 lowkey 共赢保护环境的标语合作共赢的意思申请签证商业数据商业数据分析人物事迹事迹自然英语宣传标语怎样申请强制执行学习环境生态工程

大家在看

hamada posh koji avcc yammy proposes lingos lingoes mojave vimicro pentile wannacry veggie veggieg serto turnup netants turnto

物种进化

基因家族扩张与收缩分析及物种进化树构建（上）

猜你想看

大家在看