Disease gene prioritization using network topological analysis from a sequence based human functional linkage networkApr 15 2019Sequencing large number of candidate disease genes which cause diseases in order to identify the relationship between them is an expensive and time-consuming task. To handle these challenges, different computational approaches have been developed. Based ... More
A mean first passage time genome rearrangement distanceApr 12 2019This paper introduces a new way to define a genome rearrangement distance, using the concept of mean first passage time from control theory. Crucially, this distance estimate provides a genuine metric on genome space. We develop the theory and introduce ... More
pdbmine: A Node.js API for the RCSB Protein Data Bank (PDB)Apr 03 2019Summary: The advent of Web-based tools that assist in the analysis and visualization of macromolecules require application programming interfaces (APIs) designed for modern web frameworks. To this end, we have developed a Node.js module pdbmine that allows ... More
Learning Clinical Outcomes from Heterogeneous Genomic Data SourcesApr 02 2019Translating the vast data generated by genomic platforms into reliable predictions of clinical outcomes remains a critical challenge in realizing the promise of genomic medicine largely due to small number of independent samples. In this paper, we show ... More
BPPart and BPMax: RNA-RNA Interaction Partition Function and Structure Prediction for the Base Pair Counting ModelApr 02 2019RNA-RNA interaction (RRI) is ubiquitous and has complex roles in the cellular functions. In human health studies, miRNA-target and lncRNAs are among an elite class of RRIs that have been extensively studied. Bacterial ncRNA-target and RNA interference ... More
Data structures to represent sets of k-long DNA sequencesMar 29 2019The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While ... More
Why understanding multiplex social network structuring processes will help us better understand the evolution of human behaviorMar 26 2019Anthropologists have long appreciated that single-layer networks are insufficient descriptions of human interactions---individuals are embedded in complex networks with dependencies. One debate explicitly about this surrounds food sharing. Some argue ... More
HIV-1 virus cycle replication: a review of RNA polymerase II transcription, alternative splicing and protein synthesisMar 12 2019HIV virus replication is a time-related process that includes several stages. Focusing on the core steps, RNA polymerase II transcripts in an early stage pre-mRNA containing regulator proteins (i.e nef,tat,rev,vif,vpr,vpu), which are completely spliced ... More
conLSH: Context based Locality Sensitive Hashing for Mapping of noisy SMRT ReadsMar 11 2019Single Molecule Real-Time (SMRT) sequencing is a recent advancement of Next Gen technology developed by Pacific Bio (PacBio). It comes with an explosion of long and noisy reads demanding cutting edge research to get most out of it. To deal with the high ... More
A biologically constrained encoding solution for long-term storage of images onto synthetic DNAMar 07 2019Living in the age of the digital media explosion, the amount of data that is being stored increases dramatically. However, even if existing storage systems suggest efficiency in capacity, they are lacking in durability. Hard disks, flash, tape or even ... More
On genetic correlation estimation with summary statistics from genome-wide association studiesMar 04 2019Genome-wide association studies (GWAS) have been widely used to examine the association between single nucleotide polymorphisms (SNPs) and complex traits, where both the sample size n and the number of SNPs p can be very large. Recently, cross-trait polygenic ... More
CAMIRADA: Cancer microRNA association discovery algorithm, a case study on breast cancerFeb 27 2019In recent studies, non-coding protein RNAs have been identified as microRNA that can be used as biomarkers for early diagnosis and treatment of cancer, that decrease mortality in cancer. A microRNA may target hundreds or thousands of genes and a gene ... More
Fast Approximation of Frequent $k$-mers and Applications to MetagenomicsFeb 26 2019Estimating the abundances of all $k$-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. While several methods have been designed for the exact or approximate solution of this problem, ... More
Diversity and its decomposition into variety, balance and disparityFeb 25 2019Feb 26 2019Diversity is a central concept in many fields. Despite its importance, there is no unified methodological framework to measure diversity and its three components of variety, balance and disparity. Current approaches take into account disparity of the ... More
A Nonparametric Multi-view Model for Estimating Cell Type-Specific Gene Regulatory NetworksFeb 21 2019We present a Bayesian hierarchical multi-view mixture model termed Symphony that simultaneously learns clusters of cells representing cell types and their underlying gene regulatory networks by integrating data from two views: single-cell gene expression ... More
Using sequencing coverage statistics to identify sex chromosomes in minke whalesFeb 18 2019The ever-increasing number of genome sequencing and resequencing projects is a central source of insights into the ecology and evolution of non-model organisms. An important aspect of genomics is the elucidation of sex determination systems and identifying ... More
BOAssembler: a Bayesian Optimization Framework to Improve RNA-Seq Assembly PerformanceFeb 14 2019High throughput sequencing of RNA (RNA-Seq) can provide us with millions of short fragments of RNA transcripts from a sample. How to better recover the original RNA transcripts from those fragments (RNA-Seq assembly) is still a difficult task. For example, ... More
OPENMENDEL: A Cooperative Programming Project for Statistical GeneticsFeb 14 2019Statistical methods for genomewide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, ... More
PLIT: An alignment-free computational tool for identification of long non-coding RNAs in plant transcriptomic datasetsFeb 12 2019Long non-coding RNAs (lncRNAs) are a class of non-coding RNAs which play a significant role in several biological processes. RNA-seq based transcriptome sequencing has been extensively used for identification of lncRNAs. However, accurate identification ... More
Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing AlgorithmFeb 12 2019A large proportion of the basepairs in the long reads that third-generation sequencing technologies produce possess sequencing errors. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize ... More
Achieving GWAS with Homomorphic EncryptionFeb 12 2019Mar 11 2019One way of investigating how genes affect human traits would be with a genome-wide association study (GWAS). Genetic markers, known as single-nucleotide polymorphism (SNP), are used in GWAS. This raises privacy and security concerns as these genetic markers ... More
Scalable optimal Bayesian classification of single-cell trajectories under regulatory model uncertaintyFeb 08 2019Single-cell gene expression measurements offer opportunities in deriving mechanistic understanding of complex diseases, including cancer. However, due to the complex regulatory machinery of the cell, gene regulatory network (GRN) model inference based ... More
Some Enumeration Problems in the Duplication-Loss Model of Genome RearrangementFeb 01 2019Tandem-duplication-random-loss (TDRL) is an important genome rearrangement operation studied in evolutionary biology. This paper investigates some of the formal properties of TDRL operations on the symmetric group (the space of permutations over an $ ... More
Adaptive Monte Carlo Multiple Testing via Multi-Armed BanditsFeb 01 2019Monte Carlo (MC) permutation testing is considered the gold standard for statistical hypothesis testing, especially when standard parametric assumptions are not clear or likely to fail. However, in modern data science settings where a large number of ... More
Predicting Toxicity from Gene Expression with Neural NetworksJan 31 2019We train a neural network to predict chemical toxicity based on gene expression data. The input to the network is a full expression profile collected either in vitro from cultured cells or in vivo from live animals. The output is a set of fine grained ... More
GeNet: Deep Representations for MetagenomicsJan 30 2019We introduce GeNet, a method for shotgun metagenomic classification from raw DNA sequences that exploits the known hierarchical structure between labels for training. We provide a comparison with state-of-the-art methods Kraken and Centrifuge on datasets ... More
Causal Mediation Analysis Leveraging Multiple Types of Summary Statistics DataJan 24 2019Summary statistics of genome-wide association studies (GWAS) teach causal relationship between millions of genetic markers and tens and thousands of phenotypes. However, underlying biological mechanisms are yet to be elucidated. We can achieve necessary ... More
Proteomic and metagenomic insights into prehistoric Spanish Levantine Rock ArtJan 24 2019The Iberian Mediterranean Basin is home to one of the largest groups of prehistoric rock art sites in Europe. Despite the cultural relevance of prehistoric Spanish Levantine rock art, pigment composition remains partially unknown, and the nature of the ... More
Identifying centromeric satellites with dna-brnnJan 22 2019Summary: Human alpha satellite and satellite 2/3 contribute to several percent of the human genome. However, identifying these sequences with traditional algorithms is computationally intensive. Here we develop dna-brnn, a recurrent neural network to ... More
Dual Graph-Laplacian PCA: A Closed-Form Solution for Bi-clustering to Find "Checkerboard" Structures on Gene Expression DataJan 21 2019In the context of cancer, internal "checkerboard" structures are normally found in the matrices of gene expression data, which correspond to genes that are significantly up- or down-regulated in patients with specific types of tumors. In this paper, we ... More
Spatial clustering and common regulatory elements correlate with coordinated gene expressionJan 18 2019Many cellular responses to surrounding cues require temporally concerted transcriptional regulation of multiple genes. In prokaryotic cells, a single-input-module motif with one transcription factor regulating multiple target genes can generate coordinated ... More
A Hybrid HMM Approach for the Dynamics of DNA MethylationJan 18 2019The understanding of mechanisms that control epigenetic changes is an important research area in modern functional biology. Epigenetic modifications such as DNA methylation are in general very stable over many cell divisions. DNA methylation can however ... More
The Mahalanobis kernel for heritability estimation in genome-wide association studies: fixed-effects and random-effects methodsJan 09 2019Linear mixed models (LMMs) are widely used for heritability estimation in genome-wide association studies (GWAS). In standard approaches to heritability estimation with LMMs, a genetic relationship matrix (GRM) must be specified. In GWAS, the GRM is frequently ... More
De novo inference of diversity genes and analysis of non-canonical V(DD)J recombination in immunoglobulinsJan 08 2019The V(D)J recombination forms the immunoglobulin genes by joining the variable (V), diversity (D), and joining (J) germline genes. Since variations in germline genes have been linked to various diseases, personalized immunogenomics aims at finding alleles ... More
Figure 1 Theory Meets Figure 2 Experiments in the Study of Gene ExpressionDec 30 2018It is tempting to believe that we now own the genome. The ability to read and re-write it at will has ushered in a stunning period in the history of science. Nonetheless, there is an Achilles heel exposed by all of the genomic data that has accrued: we ... More
ATHENA: Automated Tuning of Genomic Error Correction Algorithms using Language ModelsDec 30 2018The performance of most error-correction algorithms that operate on genomic sequencer reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding ... More
Parallel Clustering of Single Cell Transcriptomic Data with Split-Merge Sampling on Dirichlet Process MixturesDec 25 2018Motivation: With the development of droplet based systems, massive single cell transcriptome data has become available, which enables analysis of cellular and molecular processes at single cell resolution and is instrumental to understanding many biological ... More
Pan-Cancer Epigenetic Biomarker Selection from Blood Samples Using SASDec 21 2018A key focus in current cancer research is the discovery of cancer biomarkers that allow earlier detection with high accuracy and lower costs for both patients and hospitals. Blood samples have long been used as a health status indicator, but DNA methylation ... More
Bayesian Manifold-Constrained-Prior Model for an Experiment to Locate XceDec 20 2018We propose an analysis for a novel experiment intended to locate the genetic locus Xce (X-chromosome controlling element), which biases the stochastic process of X-inactivation in the mouse. X-inactivation bias is a phenomenon where cells in the embryo ... More
GenHap: A Novel Computational Method Based on Genetic Algorithms for Haplotype AssemblyDec 18 2018The computational problem of inferring the full haplotype of a cell starting from read sequencing data is known as haplotype assembly, and consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two chromosomes. ... More
Alpha7 nicotinic acetylcholine receptor signaling modulates ovine fetal brain astrocytes transcriptome in response to endotoxinDec 17 2018Apr 09 2019Neuroinflammation in utero may result in lifelong neurological disabilities. Astrocytes play a pivotal role, but the mechanisms are poorly understood. No early postnatal treatment strategies exist to enhance neuroprotective potential of astrocytes. We ... More
Topological Data Analysis of Single-cell Hi-C Contact MapsDec 04 2018In this article, we show how the recent statistical techniques developed in Topological Data Analysis for the Mapper algorithm can be extended and leveraged to formally define and statistically quantify the presence of topological structures coming from ... More
Integrating omics and MRI data with kernel-based tests and CNNs to identify rare genetic markers for Alzheimer's diseaseDec 02 2018Mar 05 2019For precision medicine and personalized treatment, we need to identify predictive markers of disease. We focus on Alzheimer's disease (AD), where magnetic resonance imaging scans provide information about the disease status. By combining imaging with ... More
Interlacing Personal and Reference Genomes for Machine Learning Disease-Variant DetectionNov 26 2018DNA sequencing to identify genetic variants is becoming increasingly valuable in clinical settings. Assessment of variants in such sequencing data is commonly implemented through Bayesian heuristic algorithms. Machine learning has shown great promise ... More
A Framework for Implementing Machine Learning on Omics DataNov 26 2018The potential benefits of applying machine learning methods to -omics data are becoming increasingly apparent, especially in clinical settings. However, the unique characteristics of these data are not always well suited to machine learning techniques. ... More
Private Shotgun DNA SequencingNov 23 2018Current techniques in sequencing a genome allow a service provider (e.g. a sequencing company) to have full access to the genome information, and thus the privacy of individuals regarding their lifetime secret is violated. In this paper, we introduce ... More
Inference of the three-dimensional chromatin structure and its temporal behaviorNov 22 2018Understanding the three-dimensional (3D) structure of the genome is essential for elucidating vital biological processes and their links to human disease. To determine how the genome folds within the nucleus, chromosome conformation capture methods such ... More
DeepZip: Lossless Data Compression using Recurrent Neural NetworksNov 20 2018Sequential data is being generated at an unprecedented pace in various forms, including text and genomic data. This creates the need for efficient compression mechanisms to enable better storage, transmission and processing of such data. To solve this ... More
A Multi-Trait Approach Identified Genetic Variants Including a Rare Mutation in RGS3 with Impact on Abnormalities of Cardiac Structure/FunctionNov 19 2018Heart failure is a major cause for premature death. Given heterogeneity of the heart failure syndrome, identifying genetic determinants of cardiac function and structure may provide greater insights into heart failure. Despite progress in understanding ... More
Prediction of Signal Sequences in Abiotic Stress Inducible Genes from Main Crops by Association Rule MiningNov 18 2018It is important to study on genes affecting to growing environment of main crops. Especially the recognition problem of promoter region, which is the problem to predict whether DNA sequences contain promoter regions or not, is prior to find abiotic stress-inducible ... More
Linking de novo assembly results with long DNA reads by dnaasm-link applicationNov 13 2018Currently, third-generation sequencing techniques, which allow to obtain much longer DNA reads compared to the next-generation sequencing technologies, are becoming more and more popular. There are many possibilities to combine data from next-generation ... More
Prediction of Alzheimer's disease-associated genes by integration of GWAS summary data and expression dataNov 12 2018Alzheimer's disease is the most common cause of dementia. It is the fifth-leading cause of death among elderly people. With high genetic heritability (79%), finding disease causal genes is a crucial step in find treatment for AD. Following the International ... More
An annotated list of bivalent chromatin regions in human ES cells: a new tool for cancer epigenetic researchNov 09 2018CpG islands (CGI) marked by bivalent chromatin in stem cells are believed to be more prone to aberrant DNA methylation in tumor cells. The robustness and genome-wide extent of this instructive program in different cancer types remain to be determined. ... More
Imprinting control regions (ICRs) are marked by mono-allelic bivalent chromatin when transcriptionally inactiveNov 09 2018Parental allele-specific expression of imprinted genes is mediated by imprinting control regions (ICRs) that are constitutively marked by DNA methylation imprints on the maternal or paternal allele. Mono-allelic DNA methylation is strictly required for ... More
The long non-coding RNA HOTAIR is transcriptionally activated by HOXA9 and is an independent prognostic marker in patients with malignant gliomaNov 09 2018The lncRNA HOTAIR has been implicated in several human cancers. Here, we evaluated the molecular alterations and upstream regulatory mechanisms of HOTAIR in glioma, the most common primary brain tumors, and its clinical relevance. HOTAIR gene expression, ... More
Searching by index for similar sequences: the SEQR algorithmNov 02 2018This paper describes a method to efficiently retrieve protein database sequences similar to a query sequence, while allowing for significant numbers of mutations. We call this method SEQR for SEQuence Retrieval. This approach increases the speed of sequence ... More
TF-MoDISco v0.4.4.2-alpha: Technical NoteOct 31 2018TF-MoDISco (Transcription Factor Motif Discovery from Importance Scores) is an algorithm for identifying motifs from basepair-level importance scores computed on genomic sequence data. This paper describes the methods behind TF-MoDISco version ... More
Whole genome single nucleotide polymorphism genotyping of Staphylococcus aureusOct 30 2018Next-generation sequencing technology enables routine detection of bacterial pathogens for clinical diagnostics and genetic research. Whole genome sequencing has been of importance in the epidemiologic analysis of bacterial pathogens. However, few whole ... More
A Comparison of Microbial Genome Web PortalsOct 30 2018Microbial genome web portals have a broad range of capabilities that address a number of information-finding and analysis needs for scientists. This article compares the capabilities of the major microbial genome web portals to aid researchers in determining ... More
Quantum Structures in Human Decision-making: Towards Quantum Expected UtilityOct 29 2018{\it Ellsberg thought experiments} and empirical confirmation of Ellsberg preferences pose serious challenges to {\it subjective expected utility theory} (SEUT). We have recently elaborated a quantum-theoretic framework for human decisions under uncertainty ... More
Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count dataOct 22 2018Precision medicine aims for personalized prognosis and therapeutics by utilizing recent genome-scale high-throughput profiling techniques, including next-generation sequencing (NGS). However, translating NGS data faces several challenges. First, NGS count ... More
Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomicsOct 12 2018The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang and Dunson, 2011) was proposed ... More
Towards the Latent TranscriptomeOct 08 2018Dec 10 2018In this work we propose a method to compute continuous embeddings for kmers from raw RNA-seq data, without the need for alignment to a reference genome. The approach uses an RNN to transform kmers of the RNA-seq reads into a 2 dimensional representation ... More
CTCF Degradation Causes Increased Usage of Upstream Exons in Mouse Embryonic Stem CellsOct 07 2018Transcriptional repressor CTCF is an important regulator of chromatin 3D structure, facilitating the formation of topologically associating domains (TADs). However, its direct effects on gene regulation is less well understood. Here, we utilize previously ... More
A statistical normalization method and differential expression analysis for RNA-seq data between different speciesOct 04 2018Background: High-throughput techniques bring novel tools but also statistical challenges to genomic research. Identifying genes with differential expression between different species is an effective way to discover evolutionarily conserved transcriptional ... More
PromID: human promoter prediction by deep learningOct 02 2018Computational identification of promoters is notoriously difficult as human genes often have unique promoter sequences that provide regulation of transcription and interaction with transcription initiation complex. While there are many attempts to develop ... More
Mapping the spectrum of 3D communities in human chromosome conformation capture dataOct 02 2018Several experiments show that the three dimensional (3D) organization of chromosomes affects genetic processes such as transcription and gene regulation. To better understand this connection, researchers developed the Hi-C method that is able to detect ... More
Cancer classification and pathway discovery using non-negative matrix factorizationSep 27 2018Oct 08 2018Extracting genetic information from a full range of sequencing data is important for understanding diseases. We propose a novel method to effectively explore the landscape of genetic mutations and aggregate them to predict cancer type. We used multinomial ... More
Extreme Scale De Novo Metagenome AssemblySep 19 2018Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomes's genomes. State-of-the-art tools require ... More
Network analyses of 4D genome datasets automate detection of community-scale gene structure and plasticitySep 18 2018Oct 01 2018Chromosome conformation capture and Hi-C technologies provide gene-gene proximity datasets of stationary cells, revealing chromosome territories, topologically associating domains, and chromosome topology. Imaging of tagged DNA sequences in live cells ... More
Π-cyc: A Reference-free SNP Discovery Application using Parallel Graph SearchSep 18 2018Motivation: Working with a large number of genomes simultaneously is of great interest in genetic population and comparative genomics research. Bubbles discovery in multi-genomes coloured de bruijn graph for de novo genome assembly is a problem that can ... More
Integrated systems approach identifies pathways from the genome to triglycerides through a metabolomic causal networkSep 13 2018Introduction: To leverage functionality and clinical relevance into understanding systems biology, one needs to understand the pathway of the genetic effects on risk factors/disease through intermediate molecular levels, such as metabolomics. Systems ... More
Effect of Blast Exposure on Gene-Gene InteractionsSep 13 2018Nov 09 2018Repeated exposure to low-level blast may initiate a range of adverse health problem such as traumatic brain injury (TBI). Although many studies successfully identified genes associated with TBI, yet the cellular mechanisms underpinning TBI are not fully ... More
Virus genome sequence classification using features based on nucleotides, words and compressionSep 11 2018The ICTV develops, refines and maintains a universal virus taxonomy; Order is the highest taxon in the branching hierarchy of recognised viral taxa. Historically, ICTV (sub)committees have classified viruses on the basis of morphological characteristics ... More
A reproducible effect size is more useful than an irreproducible hypothesis test to analyze high throughput sequencing datasetsSep 07 2018Motivation: P values derived from the null hypothesis significance testing framework are strongly affected by sample size, and are known to be irreproducible in underpowered studies, yet no suitable replacement has been proposed. Results: Here we present ... More
Whole genome resequencing reveals diagnostic markers for investigating global migration and hybridization between minke whale speciesSep 06 2018Background: In the marine environment, where there are few absolute physical barriers, contemporary contact between previously isolated species can occur across great distances, and in some cases, may be inter-oceanic. [..] in the minke whale species ... More
Data Lakes, Clouds and Commons: A Review of Platforms for Analyzing and Sharing Genomic DataSep 05 2018Dec 24 2018Data commons collate data with cloud computing infrastructure and commonly used software services, tools and applications to create biomedical resources for the large-scale management, analysis, harmonization, and sharing of biomedical data. Over the ... More
Gene Shaving using influence function of a kernel methodSep 05 2018Identifying significant subsets of the genes, gene shaving is an essential and challenging issue for biomedical research for a huge number of genes and the complex nature of biological networks,. Since positive definite kernel based methods on genomic ... More
SSCU: an R/Bioconductor package for analyzing selective profile in synonymous codon usageAug 22 2018Background Synonymous codon choice is mainly affected by mutation and selection. For the majority of genes within a genome, mutational pressure is the major driving force, but selective strength can be strong and dominant for specific set of genes or ... More
Quantitative and functional post-translational modification proteomics reveals that TREPH1 plays a role in plant thigmomorphogenesisAug 13 2018Plants can sense both intracellular and extracellular mechanical forces and can respond through morphological changes. The signaling components responsible for mechanotransduction of the touch response are largely unknown. Here, we performed a high-throughput ... More
Genome-Wide Association Studies: Information Theoretic Limits of Reliable LearningAug 10 2018In the problems of Genome-Wide Association Study (GWAS), the objective is to associate subsequences of individuals' genomes to the observable characteristics called phenotypes. The genome containing the biological information of an individual can be represented ... More
Fast computation of the principal components of genotype matrices in JuliaAug 09 2018Finding the largest few principal components of a matrix of genetic data is a common task in genome-wide association studies (GWASs), both for dimensionality reduction and for identifying unwanted factors of variation. We describe a simple random matrix ... More
Deep Neural Network for Analysis of DNA Methylation DataAug 02 2018Many researches demonstrated that the DNA methylation, which occurs in the context of a CpG, has strong correlation with diseases, including cancer. There is a strong interest in analyzing the DNA methylation data to find how to distinguish different ... More
Mass-spectrometry of single mammalian cells quantifies proteome heterogeneity during cell differentiationAug 01 2018Cellular heterogeneity is important to biological processes, including cancer and development. However, proteome heterogeneity is largely unexplored because of the limitations of existing methods for quantifying protein levels in single cells. To alleviate ... More
Insights into Complex Brain Functions Related to Schizophrenia Disorder through Causal Network AnalysisJul 31 2018Gene expression represents a fundamental interface between genes and environment in the development and ongoing plasticity of the human organism. Individual differences in gene expression are likely to underpin much of human diversity, including psychiatric ... More
Explaining Parochialism: A Causal Account for Political Polarization in Changing Economic EnvironmentsJul 28 2018Political and social polarization are a significant cause of conflict and poor governance in many societies, thus understanding their causes is of considerable importance. Here we demonstrate that shifts in socialization strategy similar to political ... More
EBIC: an open source software for high-dimensional and big data biclustering analysesJul 26 2018Motivation: In this paper we present the latest release of EBIC, a next-generation biclustering algorithm for mining genetic data. The major contribution of this paper is adding support for big data, making it possible to efficiently run large genomic ... More
Convolutional Neural Networks In Classifying Cancer Through DNA MethylationJul 24 2018DNA Methylation has been the most extensively studied epigenetic mark. Usually a change in the genotype, DNA sequence, leads to a change in the phenotype, observable characteristics of the individual. But DNA methylation, which happens in the context ... More
Detecting T-cell receptors involved in immune responses from single repertoire snapshotsJul 23 2018Hypervariable T-cell receptors (TCR) play a key role in adaptive immunity, recognising a vast diversity of pathogen-derived antigens. High throughput sequencing of TCR repertoires (RepSeq) produces huge datasets of T-cell receptor sequences from blood ... More
Reconstructing Latent Orderings by Spectral ClusteringJul 18 2018Spectral clustering uses a graph Laplacian spectral embedding to enhance the cluster structure of some data sets. When the embedding is one dimensional, it can be used to sort the items (spectral ordering). A number of empirical results also suggests ... More
OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifsJul 12 2018Nov 13 2018Motivation: High-throughput sequencing of large immune repertoires has enabled the development of methods to predict the probability of generation by V(D)J recombination of T- and B-cell receptors of any specific nucleotide sequence. These generation ... More
The exon junction complex undergoes a compositional switch that alters mRNP structure and nonsense-mediated mRNA decay activityJul 02 2018The exon junction complex (EJC) deposited upstream of mRNA exon junctions shapes structure, composition and fate of spliced mRNA ribonucleoprotein particles (mRNPs). To achieve this, the EJC core nucleates assembly of a dynamic shell of peripheral proteins ... More
Genesis of the alpha beta T-cell receptorJun 28 2018Dec 11 2018The T-cell (TCR) repertoire relies on the diversity of receptors composed of two chains, called $\alpha$ and $\beta$, to recognize pathogens. Using results of high throughput sequencing and computational chain-pairing experiments of human TCR repertoires, ... More
Deep SNP: An End-to-end Deep Neural Network with Attention-based Localization for Break-point Detection in SNP Array Genomic dataJun 22 2018Diagnosis and risk stratification of cancer and many other diseases require the detection of genomic breakpoints as a prerequisite of calling copy number alterations (CNA). This, however, is still challenging and requires time-consuming manual curation. ... More