Eight-cluster structure of chloroplast genomes differs from similar one observed for bacteriaFeb 08 2018Previously, a seven-cluster pattern claiming to be a universal one in bacterial genomes has been reported. Keeping in mind the most popular theory of chloroplast origin, we checked whether a similar pattern is observed in chloroplast genomes. Surprisingly, ... More
Molecular Regulation of Histamine SynthesisFeb 07 2018Histamine is a critical mediator of IgE/ cell-mediated anaphylaxis, a neurotransmitter and a regulator of gastric acid secretion. Histamine is a monoamine synthesized from the amino acid histidine through a reaction catalyzed by the enzyme histidine decarboxylase ... More
ASB1 differential methylation in ischaemic cardiomyopathy. Relationship with left ventricular performance in end stage heart failure patientsApr 04 2017Aims: Ischaemic cardiomyopathy (ICM) leads to impaired contraction and ventricular dysfunction causing high rates of morbidity and mortality. Epigenomics allows the identification of epigenetic signatures in human diseases. We analyse the differential ... More
Large scale modeling of antimicrobial resistance with interpretable classifiersDec 03 2016Antimicrobial resistance is an important public health concern that has implications in the practice of medicine worldwide. Accurately predicting resistance phenotypes from genome sequences shows great promise in promoting better use of antimicrobial ... More
Multi-stage Clustering of Breast Cancer for Precision MedicineDec 02 2016Cancer has become one of the most widespread diseases in the world. Specifically, breast cancer is diagnosed more often than any other type of cancer. However, breast cancer patients and their individual tumors are often unique. Identifying the underlying ... More
A Noise-Filtering Approach for Cancer Drug Sensitivity PredictionDec 02 2016Dec 05 2016Accurately predicting drug responses to cancer is an important problem hindering oncologists' efforts to find the most effective drugs to treat cancer, which is a core goal in precision medicine. The scientific community has focused on improving this ... More
Dynamical System Modeling to Simulate Donor T Cell Response to Whole Exome Sequencing-Derived Recipient Peptides: Understanding Randomness in Clinical Outcomes Following Stem Cell TransplantationNov 28 2016Alloreactivity following stem cell transplantation (SCT) is difficult to predict in patients undergoing transplantation from HLA matched donors. In this study we performed whole exome sequencing of SCT donor-recipient pairs (DRP). This allowed determination ... More
Specificity-determining DNA triplet code for positioning of human pre-initiation complexNov 23 2016The notion that transcription factors bind DNA only through specific, consensus binding sites has been recently questioned. In a pioneering study by Pugh and Venters no specific consensus motif for the positioning of the human pre-initiation complex (PIC) ... More
Fast low-level pattern matching algorithmNov 18 2016This paper focuses on pattern matching in the DNA sequence. It was inspired by a previously reported method that proposes encoding both pattern and sequence using prime numbers. Although fast, the method is limited to rather small pattern lengths, due ... More
Duplication Distance to the Root for Binary SequencesNov 17 2016We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq ... More
Genomic Region Detection via Spatial Convex ClusteringNov 15 2016Several modern genomic technologies, such as DNA-Methylation arrays, measure spatially registered probes that number in the hundreds of thousands across multiplechromosomes. The measured probes are by themselves less interesting scientifically; instead ... More
Polygenic score analyses of schizophrenia and bipolar disorder with cardiometabolic traitsNov 10 2016Cardiovascular diseases (CVD) represent a major health issue in patients with schizophrenia and bipolar disorder (BD). While many studies have shown increased CVD risks in schizophrenia and BD, the underlying mechanisms remain unclear. Psychiatric medications ... More
Mouse T cell repertoires as statistical ensembles: overall characterization and age dependenceNov 09 2016The ability of the adaptive immune system to respond to arbitrary pathogens stems from the broad diversity of immune cell surface receptors (TCRs). This diversity originates in a stochastic DNA editing process (VDJ recombination) that acts each time a ... More
Reverse vaccinology in Plasmodium falciparum 3D7Nov 05 2016A timely immunization can be effective against certain diseases and can save thousands of lives. However, for some diseases it has been difficult, so far, to develop an efficient vaccine. Malaria, a tropical disease caused by a parasite of the genus Plasmodium, ... More
Efficient causal inference with hidden confounders from genome-transcriptome variation dataNov 03 2016Nov 06 2016Natural genetic variation between individuals in a population leads to variations in gene expression that are informative for the inference of gene regulatory networks. Particularly, genome-wide genotype and transcriptome data from the same samples allow ... More
Computational genomic algorithms for miRNA-based diagnosis of lung cancer: the potential of machine learningOct 28 2016The advent of large scale, high-throughput genomic screening has introduced a wide range of tests for diagnostic purposes. Prominent among them are tests using miRNA expression levels. Genomics and proteomics now provide expression levels of hundreds ... More
Aligning coding sequences with frameshift extension penaltiesOct 27 2016Frameshift translation is an important phenomenon that contributes to the appearance of novel Coding DNA Sequences (CDS) and functions in gene evolution, by allowing alternative amino acid translations of genes coding regions. Frameshift translations ... More
Stratification of patient trajectories using covariate latent variable modelsOct 27 2016Standard models assign disease progression to discrete categories or stages based on well-characterized clinical markers. However, such a system is potentially at odds with our understanding of the underlying biology, which in highly complex systems may ... More
Functional architecture and global properties of the Corynebacterium glutamicum regulatory network: novel insights from a dataset with a high genomic coverageOct 26 2016Corynebacterium glutamicum is a Gram-positive, anaerobic, rod-shaped soil bacterium able to grow on a diversity of carbon sources like sugars and organic acids. It is a biotechnological relevant organism because of its highly efficient ability to biosynthesize ... More
A single step protein assay that is both detergent and reducer compatible: The cydex blue assayOct 24 2016Determination of protein concentration in often an absolute pre-requisite in preparing samples for biochemical and proteomic analyses. However, current protein assay methods are not compatible with both reducers and detergents, which are however present ... More
Full Reconstruction of Non-Stationary Strand-Symmetric Models on Rooted PhylogeniesOct 17 2016Nov 13 2016Understanding the evolutionary relationship among species is of fundamental importance to the biological sciences. The location of the root in any phylogenetic tree is critical as it gives an order to evolutionary events. None of the popular models of ... More
Pan-genome Analysis of the Genus SerratiaOct 13 2016Pan-genome analysis is a standard procedure to decipher genome heterogeneity and diversification of bacterial species. Specie evolution is traced by defining and comparing the core (conserved), accessory (dispensable) and unique (strain-specific) gene ... More
A Unified Model for Differential Expression Analysis of RNA-seq Data via L1-Penalized Linear RegressionOct 11 2016The RNA-sequencing (RNA-seq) is becoming increasingly popular for quantifying gene expression levels. Since the RNA-seq measurements are relative in nature, between-sample normalization of counts is an essential step in differential expression (DE) analysis. ... More
An Improved Filtering Algorithm for Big Read DatasetsOct 11 2016For single-cell or metagenomic sequencing projects, it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to huge datasets with lots of redundant ... More
Effective Classification of MicroRNA Precursors Using Combinatorial Feature Mining and AdaBoost AlgorithmsOct 06 2016MicroRNAs (miRNAs) are non-coding RNAs with approximately 22 nucleotides (nt) that are derived from precursor molecules. These precursor molecules or pre-miRNAs often fold into stem-loop hairpin structures. However, a large number of sequences with pre-miRNA-like ... More
Prediction of Prokaryotic and Eukaryotic Promoters Using Convolutional Deep Learning Neural NetworksOct 01 2016Accurate computational identification of promoters remains a challenge as these key DNA regulatory regions have variable structures composed of functional motifs that provide gene specific initiation of transcription. In this paper we utilize Convolutional ... More
RNA as a Nanoscale Data Transmission Medium: Error AnalysisSep 23 2016RNA can be used as a high-density medium for data storage and transmission; however, an important RNA process -- replication -- is noisy. This paper presents an error analysis for RNA as a data transmission medium, analyzing how deletion errors increase ... More
A spectral algorithm for fast de novo layout of uncorrected long nanopore readsSep 23 2016Motivation: New long read sequencers promise to transform sequencing and genome assembly by producing reads tens of kilobases long. However their high error rate significantly complicates assembly and requires expensive correction steps to layout the ... More
Network-regularized Sparse Logistic Regression Models for Clinical Risk Prediction and Biomarker DiscoverySep 21 2016Molecular profiling data (e.g., gene expression) has been used for clinical risk prediction and biomarker discovery. However, it is necessary to integrate other prior knowledge like biological pathways or gene interaction networks to improve the predictive ... More
Relation between Gene Content and Taxonomy in ChloroplastsSep 20 2016The aim of this study is to investigate the relation that can be found between the phylogeny of a large set of complete chloroplast genomes, and the evolution of gene content inside these sequences. Core and pan genomes have been computed on \textit{de ... More
Searching for Gene Sets with Mutually Exclusive MutationsSep 18 2016Cancer cells evolve through random somatic mutations. "Beneficial" mutations which disrupt key pathways (e.g. cell cycle regulation) are subject to natural selection. Multiple mutations may lead to the same "beneficial" effect, in which case there is ... More
A Quadratically Regularized Functional Canonical Correlation Analysis for Identifying the Global Structure of Pleiotropy with NGS DataSep 16 2016Investigating the pleiotropic effects of genetic variants can increase statistical power, provide important information to achieve deep understanding of the complex genetic structures of disease, and offer powerful tools for designing effective treatments ... More
Stochastic predator-prey dynamics of transposons in the human genomeSep 14 2016Oct 07 2016Transposable elements, or transposons, are DNA sequences that can jump from site to site in the genome during the life cycle of a cell, usually encoding the very enzymes which perform their excision. However, some transposons are parasitic, relying on ... More
BAUM: A DNA Assembler by Adaptive Unique Mapping and Local Overlap-Layout-ConsensusSep 10 2016Genome assembly from the high-throughput sequencing (HTS) reads is a fundamental yet challenging computational problem. An intrinsic challenge is the uncertainty caused by the widespread repetitive elements. Here we get around the uncertainty using the ... More
Learning Directed-Acyclic-Graphs from Large-Scale Genomics DataSep 09 2016In this paper we consider the problem of learning the genetic-interaction-map, i.e., the topology of a directed acyclic graph (DAG) of genetic interactions from noisy double knockout (DK) data. Based on a set of well established biological interaction ... More
The more you test, the more you find: Smallest P-values become increasingly enriched with real findings as more tests are conductedSep 07 2016Increasing accessibility of data to researchers makes it possible to conduct massive amounts of statistical testing. Rather than follow a carefully crafted set of scientific hypotheses with statistical analysis, researchers can now test many possible ... More
Assessment of P-value variability in the current replicability crisisSep 06 2016Sep 10 2016Increased availability of data and accessibility of computational tools in recent years have created unprecedented opportunities for scientific research driven by statistical analysis. Inherent limitations of statistics impose constrains on reliability ... More
Extracting replicable associations across multiple studies: algorithms for controlling the false discovery rateSep 05 2016Sep 07 2016Extracting associations that recur across multiple studies while controlling the false discovery rate is a fundamental challenge. Here, we consider an extension of Efron's single-study two-groups model to allow joint analysis of multiple studies. We assume ... More
Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptionsSep 04 2016RNA-Seq is a widely-used method for studying the behavior of genes under different biological conditions. An essential step in an RNA-Seq study is normalization, in which raw data are adjusted to account for factors that prevent direct comparison of expression ... More
Binary Particle Swarm Optimization versus Hybrid Genetic Algorithm for Inferring Well Supported Phylogenetic TreesAug 31 2016The amount of completely sequenced chloroplast genomes increases rapidly every day, leading to the possibility to build large-scale phylogenetic trees of plant species. Considering a subset of close plant species defined according to their chloroplasts, ... More
Chaos in DNA EvolutionAug 20 2016In this paper, we explain why the chaotic model (CM) of Bahi and Michel (2008) accurately simulates gene mutations over time. First, we demonstrate that the CM model is a truly chaotic one, as defined by Devaney. Then, we show that mutations occurring ... More
Associative memory by collective regulation of non-coding RNAAug 19 2016The majority of mammalian genomic transcripts do not directly code for proteins and it is currently believed that most of these are not under evolutionary constraint. However given the abundance non-coding RNA (ncRNA) and its strong affinity for inter-RNA ... More
BNP-Seq: Bayesian Nonparametric Differential Expression Analysis of Sequencing Count DataAug 13 2016We perform differential expression analysis of high-throughput sequencing count data under a Bayesian nonparametric framework, removing sophisticated ad-hoc pre-processing steps commonly required in existing algorithms. We propose to use the gamma (beta) ... More
Inferring unknown biological function by integration of GO annotations and gene expression dataAug 12 2016Characterizing genes with semantic information is an important process regarding the description of gene products. In spite that complete genomes of many organisms have been already sequenced, the biological functions of all of their genes are still unknown. ... More
Core-genome scaffold comparison reveals the prevalence that inversion events are associated with pairs of inverted repeatsAug 08 2016Motivation: Genome rearrangement plays an important role in evolutionary biology and has profound impacts on phenotype in organisms ranging from microbes to humans. The mechanisms for genome rearrangement events remain unclear. Lots of comparisons have ... More
Sam2bam: High-Performance Framework for NGS Data Preprocessing ToolsAug 05 2016This paper introduces a high-throughput software tool framework called {\it sam2bam} that enables users to significantly speedup pre-processing for next-generation sequencing data. The sam2bam is especially efficient on single-node multi-core large-memory ... More
Meraculous2: fast accurate short-read assembly of large polymorphic genomesAug 02 2016We present Meraculous2, an update to the Meraculous short-read assembler that includes (1) handling of allelic variation using "bubble" structures within the de Bruijn graph, (2) improved gap closing, and (3) an improved scaffolding algorithm that produces ... More
Identification of repeats in DNA sequences using nucleotide distribution uniformityJul 31 2016Repetitive elements are important in genomic structures, functions and regulations, yet effective methods in precisely identifying repetitive elements in DNA sequences are not fully accessible, and the relationship between repetitive elements and periodicities ... More
Cytomegalovirus Antigenic Mimicry of Human Alloreactive Peptides: A Potential Trigger for Graft versus Host DiseaseJul 28 2016The association between human cytomegalovirus (hCMV) reactivation and the development of graft-versus-host-disease (GVHD) has been observed in stem cell transplantation (SCT). Seventy seven SCT donor-recipient pairs (DRP) (HLA matched unrelated donor ... More
Genomic data analysis in tree spacesJul 25 2016Recently, an elegant approach in phylogenetics was introduced by Billera-Holmes-Vogtmann that allows a systematic comparison of different evolutionary histories using the metric geometry of tree spaces. In many problem settings one encounters heavily ... More
On the Ribosomal Density that Maximizes Protein Translation RateJul 14 2016During mRNA translation, several ribosomes attach to the same mRNA molecule simultaneously translating it into a protein. This pipelining increases the protein production rate. A natural and important question is what ribosomal density maximizes the protein ... More
A draft genome assembly of southern bluefin tuna Thunnus maccoyiiJul 13 2016Tuna are large pelagic fish whose populations are close to panmixia. In addition, they are threatened species, so it is important for the maintenance and monitoring of genetic diversity that genetic information at a genome level be obtained. Here we report ... More
Elastic Model for Dinucleosome Structure and EnergyJul 11 2016The equilibrium structure of a Dinucleosome is studied using an elastic model that takes into account the force and torque balance conditions. Using the proper boundary conditions, it is found that the conformational energy of the problem does not depend ... More
A Weighted Exact Test for Mutually Exclusive Mutations in CancerJul 08 2016The somatic mutations in the pathways that drive cancer development tend to be mutually exclusive across tumors, providing a signal for distinguishing driver mutations from a larger number of random passenger mutations. This mutual exclusivity signal ... More
DeepChrome: Deep-learning for predicting gene expression from histone modificationsJul 07 2016Motivation: Histone modifications are among the most important factors that control gene regulation. Computational methods that predict gene expression from histone modification signals are highly desirable for understanding their combinatorial effects ... More
Analysis of Chromosome 20 - A StudyJun 30 2016Since the arrival of next-generation sequencing technologies the amount of genetic sequencing data has increased dramatically. This has has fueled an increase in human genetics research. At the same time, with the recent advent of technologies in processing ... More
Enhancing power of rare variant association test by Zoom-Focus Algorithm (ZFA) to locate optimal testing regionJun 29 2016Motivation: Exome or targeted sequencing data exerts analytical challenge to test single nucleotide polymorphisms (SNPs) with extremely small minor allele frequency (MAF). Various rare variant tests were proposed to increase power by aggregating SNPs ... More
Using Sequence Ensembles for Seeding Alignments of MinION Sequencing DataJun 28 2016Oxford Nanopore MinION sequencer is currently the smallest sequencing device available. While being able to produce very long reads (reads of up to 100~kbp were reported), it is prone to high sequencing error rates of up to 30%. Since most of these errors ... More
Reanalyzing variable directionality of gene expression in transgenerational epigenetic inheritanceJun 28 2016A previous report claimed no evidence of transgenerational epigenetic inheritance in a mouse model of in utero environmental exposure, based on the observation that gene expression changes observed in the germ cells of G1 and G2 male fetus were not in ... More
Characterization of Methicillin-resistant Staphylococcus aureus Isolates from Fitness Centers in Memphis Metropolitan Area, USAJun 27 2016Indoor skin-contact surfaces of public fitness centers may serve as reservoirs of potential human transmission of methicillin-resistant Staphylococcus aureus (MRSA). We found a high prevalence of multi-drug resistant (MDR)-MRSA of CC59 lineage harboring ... More
Optimal Down Regulation of mRNA TranslationJun 26 2016Jun 28 2016Down regulation of mRNA translation is an important problem in various bio-medical domains ranging from developing effective medicines for tumors and for viral diseases to developing attenuated virus strains that can be used for vaccination. Here, we ... More
H(O)TA: estimation of DNA methylation and hydroxylation levels and efficiencies from time course dataJun 26 2016Methylation and hydroxylation of cytosines to form 5-methylcytosine (5mC) and 5-droxymethylcytosine (5hmC) belong to the most important epigenetic modifications and their vital role in the regulation of gene expression has been widely recognized. Recent ... More
Effects of initial telomere length distribution on senescence onset and heterogeneityJun 22 2016Nov 07 2016Replicative senescence, induced by telomere shortening, exhibits considerable asynchrony and heterogeneity, the origins of which remain unclear. Here, we formally study how telomere shortening mechanisms impact on senescence kinetics and define two regimes ... More
Genomic disintegration in woolly mammoths on Wrangel islandJun 20 2016Woolly mammoths (Mammuthus primigenius) populated Siberia, Beringea, and North America during the pleistocene and early holocene. Recent breakthroughs in ancient DNA sequencing have allowed for complete genome sequencing for two specimens of woolly mammoths ... More
Principle, analysis, application and challenges of next-generation sequencing: a reviewJun 15 2016Next Generation Sequencing (NGS), a recently evolved technology, have served a lot in the research and development sector of our society. This novel approach is a newbie and has critical advantages over the traditional Capillary Electrophoresis (CE) based ... More
An observation of circular RNAs in bacterial RNA-seq dataJun 14 2016Circular RNAs (circRNAs) are a class of RNA with an important role in micro RNA (miRNA) regulation recently discovered in Human and various other eukaryotes as well as in archaea. Here, we have analyzed RNA-seq data obtained from {\it Enterococcus faecalis} ... More
Clustering and Classification of Genetic Data Through U-StatisticsJun 10 2016Genetic data are frequently categorical and have complex dependence structures that are not always well understood. For this reason, clustering and classification based on genetic data, while highly relevant, are challenging statistical problems. Here ... More
Cell lineage tracing using nuclease barcodingJun 02 2016Lineage tracing, the determination and mapping of progeny arising from single cells, is an important approach enabling the elucidation of mechanisms underlying diverse biological processes ranging from development to disease. We developed a dynamic sequence-based ... More
Dynamic read mapping and online consensus calling for better variant detectionMay 29 2016Variant detection from high-throughput sequencing data is an essential step in identification of alleles involved in complex diseases and cancer. To deal with these massive data, elaborated sequence analysis pipelines are employed. A core component of ... More
Controlling the joint local false discovery rate is more powerful than meta-analysis methods in joint analysis of summary statistics from multiple genome-wide association studiesMay 28 2016In genome-wide association studies (GWASs) of common diseases/traits, we often analyze multiple GWASs with the same phenotype together to discover associated genetic variants with higher power. Since it is difficult to access data with detailed individual ... More
Dynamics of transcription-translation networksMay 26 2016A theory for qualitative models of gene regulatory networks has been developed over several decades, generally considering transcription factors to regulate directly the expression of other transcription factors, without any intermediate variables. Here ... More
A resource-frugal probabilistic dictionary and applications in (meta)genomicsMay 26 2016Genomic and metagenomic fields, generating huge sets of short genomic sequences, brought their own share of high performance problems. To extract relevant pieces of information from the huge data sets generated by current sequencing techniques, one must ... More
Transcriptional Similarity in Couples Reveals the Impact of Shared Environment and Lifestyle on Gene Regulation through Modified CytosinesMay 24 2016Gene expression is a complex and quantitative trait that is influenced by both genetic and non-genetic regulators including environmental factors. Evaluating the contribution of environment to gene expression regulation and identifying which genes are ... More
A model for the clustered distribution of SNPs in the human genomeMay 21 2016Motivated by a non-random but clustered distribution of SNPs, we introduce a phenomenological model to account for the clustering properties of SNPs in the human genome. The phenomenological model is based on a preferential mutation to the closer proximity ... More
Abasy Atlas: A comprehensive inventory of systems, global network properties and systems-level elements across bacteriaMay 16 2016The availability of databases electronically encoding curated regulatory networks and of high-throughput technologies and methods to discover regulatory interactions provides an invaluable source of data to understand the principles underpinning the organization ... More
An end-to-end assembly of the Aedes aegypti genomeMay 16 2016Jun 13 2016We present an end-to-end genome assembly of a female Aedes aegypti mosquito, which spreads viral diseases such as yellow fever, dengue, chikungunya, and Zika to humans. The assembly is based on an earlier genome published in 2007 and improved in 2013. ... More
Identifying and removing the cell-cycle effect from single-cell RNA-Sequencing dataMay 15 2016Single-cell RNA-Sequencing (scRNA-Seq) is a revolutionary technique for discovering and describing cell types in heterogeneous tissues, yet its measurement of expression often suffers from large systematic bias. A major source of this bias is the cell ... More
Variance component score test for time-course gene set analysis of longitudinal RNA-seq dataMay 08 2016Jul 01 2016As gene expression measurement technology is shifting from microarrays to sequencing, the statistical tools available for their analysis must be adapted since RNA-seq data are measured as counts. Recently, it has been proposed to tackle the count nature ... More
Partial DNA Assembly: A Rate-Distortion PerspectiveMay 06 2016Earlier formulations of the DNA assembly problem were all in the context of perfect assembly; i.e., given a set of reads from a long genome sequence, is it possible to perfectly reconstruct the original sequence? In practice, however, it is very often ... More
Parallel Pairwise Correlation Computation On Intel Xeon Phi ClustersMay 05 2016Sep 27 2016Co-expression network is a critical technique for the identification of inter-gene interactions, which usually relies on all-pairs correlation (or similar measure) computation between gene expression profiles across multiple samples. Pearson's correlation ... More
Info-Clustering: A Mathematical Theory for Data ClusteringMay 04 2016Oct 05 2016We formulate an info-clustering paradigm based on a multivariate mutual information measure that naturally extends Shannon's mutual information between two random variables to the multivariate case involving more than two random variables. With proper ... More
Evolution of transcription factor families along the human lineageMay 04 2016Transcription factors (TFs) exert their regulatory action by binding to DNA with specific sequence preferences. However, different TFs can partially share their binding sequences. This "redundancy" of binding defines a way of organizing TFs in "motif ... More
Factor Models for Cancer SignaturesApr 29 2016Jun 29 2016We present a novel method for extracting cancer signatures by applying statistical risk models ( from quantitative finance to cancer genome data. Using 1389 whole genome sequenced samples from 14 cancers, we identify an ... More
Bayesian Genome- and Epigenome-wide Association Studies with Gene Level DependenceApr 29 2016High-throughput genetic and epigenetic data are often screened for associations with an observed phenotype. For example, one may wish to test hundreds of thousands of genetic variants, or DNA methylation sites, for an association with disease status. ... More
NRSSPrioritize: Associating Protein Complex and Disease Similarity Information to Prioritize Disease Candidate GenesApr 25 2016The identification of disease-associated genes has recently gathered much attention for uncovering disease complex mechanisms that could lead to new insights into the treatment of diseases. For exploring disease-susceptible genes, not only experimental ... More
HybridRanker: Integrating network structure and disease knowledge to prioritize cancer candidate genesApr 25 2016One of the notable fields in studying the genetics of cancer is disease gene identification which affects disease treatment and drug discovery. Many researches have been done in this field. Genome-wide association studies (GWAS) are one of them that focus ... More
A frame-based representation of genomic sequences for removing errors and rare variant detection in NGS dataApr 16 2016We propose a frame-based representation of k-mers for detecting sequencing errors and rare variants in next generation sequencing data obtained from populations of closely related genomes. Frames are sets of non-orthogonal basis functions, traditionally ... More
Chloroplast Genome Yields Unusual Seven-Cluster Structure CApr 15 2016We studied the structuredness in a chloroplast genome of Siberian larch. The clusters in 63-dimensional space were identified with elastic map technique, where the objects to be clusterized are the different fragments of the genome. A seven-cluster structure ... More
Variational inference for rare variant detection in deep, heterogeneous next-generation sequencing dataApr 14 2016Apr 22 2016The detection of rare variants is important for understanding the genetic heterogeneity in mixed samples. Recently, next-generation sequencing (NGS) technologies have enabled the identification of single nucleotide variants (SNVs) in mixed samples with ... More
FSG: Fast String Graph Construction for De Novo Assembly of Reads DataApr 12 2016The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this paper, we explore a novel approach to compute the string graph, ... More
Accurate, Fast and Lightweight Clustering of de novo Transcriptomes using Fragment Equivalence ClassesApr 12 2016Motivation: De novo transcriptome assembly of non-model organisms is the first major step for many RNA-seq analysis tasks. Current methods for de novo assembly often report a large number of contiguous sequences (contigs), which may be fractured and incomplete ... More
Efficient Index Maintenance Under Dynamic Genome ModificationApr 11 2016Efficient text indexing data structures have enabled large-scale genomic sequence analysis and are used to help solve problems ranging from assembly to read mapping. However, these data structures typically assume that the underlying reference text is ... More
metaSPAdes: a new versatile de novo metagenomics assemblerApr 11 2016Aug 01 2016While metagenomics has emerged as a technology of choice for analyzing bacterial populations, assembly of metagenomic data remains difficult thus stifling biological discoveries. metaSPAdes is a new assembler that addresses the challenge of metagenome ... More
Low-density locality-sensitive hashing boosts metagenomic binningApr 10 2016Metagenomic binning is an essential task in analyzing metagenomic sequence datasets. To analyze structure or function of microbial communities from environmental samples, metagenomic sequence fragments are assigned to their taxonomic origins. Although ... More
Multi-State Perfect Phylogeny Mixture Deconvolution and Applications to Cancer SequencingApr 09 2016The reconstruction of phylogenetic trees from mixed populations has become important in the study of cancer evolution, as sequencing is often performed on bulk tumor tissue containing mixed populations of cells. Recent work has shown how to reconstruct ... More