PT - JOURNAL ARTICLE AU - Robert L. Charlebois AU - Siemon H. S. Ng AU - Lucy Gisonni-Lex AU - Laurent Mallet TI - Cataloguing the Taxonomic Origins of Sequences from a Heterogeneous Sample Using Phylogenomics: Applications in Adventitious Agent Detection AID - 10.5731/pdajpst.2014.01023 DP - 2014 Nov 01 TA - PDA Journal of Pharmaceutical Science and Technology PG - 602--618 VI - 68 IP - 6 4099 - http://journal.pda.org/content/68/6/602.short 4100 - http://journal.pda.org/content/68/6/602.full SO - PDA J Pharm Sci Technol2014 Nov 01; 68 AB - We have designed and implemented a software system, named PhyloID™, that can be used to detect putative adventitious agents in biological samples characterized by next-generation sequencing. PhyloID is run in two steps, each being a self-contained automated process amenable to GMP validation. The first module, MiLY, is responsible for assembling individual sequence reads into contigs, and annotating all sequences with a unique sequence identifier, the number of reads in each contig, and the length of the sequence. The trimmed, assembled and annotated data are then processed by PhyloID's second module, NGmapper. NGmapper takes the FASTA-formatted output from MiLY and identifies the taxonomic origins of the contigs and singletons therein. It compares each sequence's BLASTN hit profile against the patterns of evolutionary relationships described within phylogenomic distance matrices for all of the various taxonomic groups, in order to find the best fit. NGmapper then produces lists of taxonomic assignments in both summarized and detailed form, and tree files for viewing results graphically. We optimized PhyloID's parameters and measured its performance using simulated metagenomic data and subsets of the reference phylogenies. PhyloID's precision and recall in identifying simulated sequences were measured by information retrieval analysis, focusing on read length, read number, sequence accuracy, background complexity, taxonomy and reference data coverage. We found PhyloID to be highly accurate and quantitative in its taxonomic mapping of sequences, with excellent precision, sensitivity and robustness. The degree of taxonomic representation available in publicly available databases remains an issue, as expected, for any sequence classifier, but community sequencing efforts are poised to overcome this problem. In order to illustrate real-world usage of the application, we also describe some simple spike-recovery experiments as well as a multi-site comparative characterization of a viral suspension. These data help to illustrate, to corroborate, and to extend results using simulated data. LAY ABSTRACT: In order to address gaps in the detection of contaminating viruses and microorganisms in vaccines and other biologicals, manufacturers are exploring the use of new technologies that promise greater sensitivity and breadth of coverage. One challenge in implementing such new methods is the complexity of analysis of the “big data” generated by these new instruments: hundreds of millions of sequence reads (segments of genetic material from viruses and cells) need to be compared against a vast and growing number of entries in genetic databases, in order to come up with a confident identification. These large-scale analyses must furthermore be carried out within the strict regulatory environment that governs the industry. We have developed an automated software pipeline named PhyloID™ that is capable of identifying viruses and microorganisms from large-scale sequence data. Using simulated data as well as real samples, we show that PhyloID is both sensitive and accurate in identifying any type of potential contaminant. Such a powerful new assay will be an important addition to the adventitious agent testing package, providing further assurance about product safety.