Signal concentration in multiple sequence alignment: empirical and simulation studies

Rusin, L.; Lyubetsky, V.; Zverkov, O.; Gorbunov, K.; V’yugin, V.

Авторы: Rusin L., Lyubetsky V., Zverkov O., Gorbunov K., V’yugin V.
Сборник: Материалы международной конференции Annual Meeting for the Society of Molecular Biology and Evolution, SMBE’2006 «Genomes, Evolution and Bioinformatics»
Тезисы
Год издания: 2006
Место издания: Society of Molecular Biology and Evolution Темпе, США
Аннотация: Materials and methods --- Phylogenetic software used at successive steps of the procedure was described elsewhere (Lyubetsky et al., 2005) and includes originally developed programs that implement algorithms of computing the objective scoring function and constrained generation of random trees. Simulations were conducted with the evolver program from PAML package (Yang, 1997). 1000 datasets of 40 sequences with the length of 300 amino acids were generated using maximum-likelihood model parameters and branch lengths estimated from real COG data that were used in previous studies to empirically test the algorithm’s performance (Lyubetsky et al., 2005). Algorithm of the procedure was described in detail in (Lyubetsky et al., 2005). In essence, it uses a scoring function to rank columns of the alignment according to the consistency of the column’s content with a list of reliable clades and gradually removes the least consistent ones until signal is refined to provide for better resolution of the tree. The list of reliable clades is basically the list of splits occurring in 70% majority-rule consensus topology constructed after bootstrapping the intact alignment (i.e., before column removal). On each step of removing columns the g1 statistic is estimated on current alignment (Hillis and Huelsenbeck, 1992) with the original algorithm of generating random trees strictly compatible with the list of reliable clades (Lyubetsky et al., 2005) and is used to determine the step, at which the procedure halts. The obtained alignment is considered optimal for tree reconstruction (definitive phylogenetic analysis). Simulation results --- Simulation studies were aimed at proving two statements: (1) removal of noisy columns permits to reconstruct a tree closer to the known tree than is the tree reconstructed with data without removal; (2) g1 statistic estimated with the original algorithm of constrained random tree generation can be used to identify the refinement step, at which the procedure should be stopped. The lists of reliable clades were quite different among the generated datasets, probably, due to unequal fraction of hypervariable sites retained in each of 500 replicates after bootstrapping the data. The datasets that produced well resolved consensus trees after bootstrapping were assumed to contain low amount of hypervariable sites and enough informative sites to produce a robust tree. Therefore, we used datasets (512 out of 1000), which produced consensus trees sufficiently unresolved to generate 100,000 constrained random topologies on their basis as test datasets to refine the signal. Batch refinement of in-silico generated datasets was continued for 10 steps. At each step, current alignment was analyzed to produce a phylogenetic tree and a g1 score. Trees from successive steps were computed likelihoods against the intact data and compared using standard tests of phylogenies (approximately unbiased test, Kishino-Hasegawa test, Shimodaira-Hasegawa test). In 100% cases removing noisy columns permitted to reconstruct the tree, which is closer to the known tree used to simulate the data than is the tree obtained without refinement. In 91% cases the tree with highest likelihood (“best” tree) was reconstructed at the step of the procedure, where the alignment produced the minimal (optimal) g1 score, and in 53% cases the difference in likelihood between the found “best” tree and the tree inferred with intact data was statistically significant. In 9% cases the g1 score continued to decrease beyond the step, at which the “best” tree is found, which might suggest that, although the signal related to poorly resolved branches of 70%-consensus can be refined further, the columns needed to correctly reconstruct shallow parts of the whole tree (containing recent evolutionary events and, therefore, described by more variable regions) are already removed. Further studies will be conducted to develop measures of clade-specific noise removal. In the meantime, the described procedure can be used to refine alignments and improve phylogenetic inference with the advice to compare likelihoods of trees before and after column removal.
Добавил в систему: Русин Леонид Юрьевич

	ИСТИНА	Войти в систему Регистрация
	ФНКЦ РР
	Главная Поиск Статистика О проекте Помощь

ИСТИНА

ФНКЦ РР

Signal concentration in multiple sequence alignment: empirical and simulation studiesтезисы доклада