Distributed Hypothesis Testing over a Noisy Channel: Error-exponents Trade-offAug 21 2019A distributed hypothesis testing problem with two parties, one referred to as the observer and the other as the detector, is considered. The observer observes a discrete memoryless source and communicates its observations to the detector over a discrete

Cumulants of multiinformation density in the case of a multivariate normal distributionAug 19 2019We consider a generalization of information density to a partitioning into $N \geq 2$ subvectors. We calculate its cumulant-generating function and its cumulants, showing that these quantities are only a function of all the regression coefficients associated

Karl Pearson and the Logic of Science: Renouncing Causal Understanding (the Bride) and Inverted SpinozismAug 17 2019Karl Pearson is the leading figure of XX century statistics. He and his co-workers crafted the core of the theory, methods and language of frequentist or classical statistics -- the prevalent inductive logic of contemporary science. However, before working

Maize Yield and Nitrate Loss Prediction with Machine Learning AlgorithmsAug 14 2019Aug 20 2019Pre-season prediction of crop production outcomes such as grain yields and N losses can provide insights to stakeholders when making decisions. Simulation models can assist in scenario planning, but their use is limited because of data requirements and

A Proof of First Digit Law from Laplace TransformAug 13 2019The first digit law, also known as Benford's law or the significant digit law, is an empirical phenomenon that the leading digit of numbers from real world sources favors small ones in a form $\log(1+{1}/{d})$, where $d=1, 2, ..., 9$. Such a law keeps

Stochastic differential theory of cricketAug 12 2019A new formalism for analyzing the progression of cricket game using Stochastic differential equation (SDE) is introduced. This theory enables a quantitative way of representing every team using three key variables which have physical meaning associated

MSNM-S: An Applied Network Monitoring Tool for Anomaly Detection in Complex Networks and SystemsJul 31 2019Technology evolves quickly. Low cost and ready-to-connect devices are designed to provide new services and applications for a better people's daily life. Smart grids or smart healthcare systems are some examples of such applications all of them in the

A brief note on the t-testJul 19 2019In this article we discuss the role that the null hypothesis should play in the construction of a test statistic used to make a decision concerning the truth of that hypothesis. Motivated by the common recommendation that, to construct the test statistic

Continuously Updated Data Analysis SystemsJul 19 2019When doing data science, it's important to know what you're building. This paper describes an idealized final product of a data science project, called a Continuously Updated Data-Analysis System (CUDAS). The CUDAS concept synthesizes ideas from a range

Leveraging Auxiliary Information on Marginal Distributions in Nonignorable Models for Item and Unit NonresponseJul 13 2019When handling nonresponse, government agencies and survey organizations typically are forced to make strong, and potentially unrealistic, assumptions about the reasons why values are missing. We present a framework that enables users to reduce reliance

Model based Level Shift Detection in Autocorrelated Data Streams using a moving windowJul 11 2019Standard Control Chart techniques to detect level shift in data streams assume independence between observations. As data today is collected with high frequency, this assumption is seldom valid. To overcome this, we propose to adapt the off-line test

Truth, Proof, and Reproducibility: There's no counter-attack for the codelessJul 11 2019Current concerns about reproducibility in many research communities can be traced back to a high value placed on empirical reproducibility of the physical details of scientific experiments and observations. For example, the detailed descriptions by 17th

Topological Information Data AnalysisJul 06 2019This paper presents methods that quantify the structure of statistical interactions within a given data set, and was first used in \cite{Tapia2018}. It establishes new results on the k-multivariate mutual-informations (I_k) inspired by the topological

Investigating some attributes of periodicity in DNA sequences via semi-Markov modellingJul 06 2019DNA segments and sequences have been studied thoroughly during the past decades. One of the main problems in computational biology is the identification of exon-intron structures inside genes using mathematical techniques. Previous studies have used different

On Finite Exchangeability and Conditional IndependenceJul 05 2019We study the independence structure of finitely exchangeable distributions over random vectors and random networks. In particular, we provide necessary and sufficient conditions for an exchangeable vector so that its elements are completely independent

On the Convergence Rate of the Quasi- to Stationary Distribution for the Shiryaev-Roberts DiffusionJul 05 2019Jul 15 2019For the classical Shiryaev--Roberts martingale diffusion considered on the interval $[0,A]$, where $A>0$ is a given absorbing boundary, it is shown that the rate of convergence of the diffusion's quasi-stationary cumulative distribution function (cdf),

On the asymptotic behavior of the length of the longest increasing subsequences of random walksJun 30 2019We numerically estimate the leading asymptotic behavior of the length $L_{n}$ of the longest increasing subsequence of random walks with step increments following Student's $t$-distribution with parameter in the range $\frac{1}{2} \leq \nu \leq 5$. We

Multivariate Big Data Analysis for Intrusion Detection: 5 steps from the haystack to the needleJun 27 2019The research literature on cybersecurity incident detection & response is very rich in automatic detection methodologies, in particular those based on the anomaly detection paradigm. However, very little attention has been devoted to the diagnosis ability

Detecting and classifying moments in basketball matches using sensor tracked dataJun 27 2019Data analytics in sports is crucial to evaluate the performance of single players and the whole team. The literature proposes a number of tools for both offence and defence scenarios. Data coming from tracking location of players, in this respect, may

Hybrid Resource Scheduling for Aggregation in Massive Machine-type Communication NetworksJun 25 2019Data aggregation is a promising approach to enable massive machine-type communication (mMTC). Here, we first characterize the aggregation phase where a massive number of machine-type devices transmits to their respective aggregator. By using non-orthogonal

A Role for Symmetry in the Bayesian Solution of Differential EquationsJun 24 2019Jun 26 2019The interpretation of numerical methods, such as finite difference methods for differential equations, as point estimators suggests that formal uncertainty quantification can also be performed in this context. Competing statistical paradigms can be considered

On the probability of a causal inference is robust for internal validityJun 20 2019The internal validity of observational study is often subject to debate. In this study, we define the counterfactuals as the unobserved sample and intend to quantify its relationship with the null hypothesis statistical testing (NHST). We propose the

Frequentist Inference without Repeated SamplingJun 19 2019Frequentist inference typically is described in terms of hypothetical repeated sampling but there are advantages to an interpretation that uses a single random sample. Contemporary examples are given that indicate probabilities for random phenomena are

Incorporating Open Data into Introductory Courses in StatisticsJun 10 2019The 2016 Guidelines for Assessment and Instruction in Statistics Education (GAISE) College Report emphasized six recommendations to teach introductory courses in statistics. Among them: use of real data with context and purpose. Many educators have created

Quantifying impacts of the drought 2018 on European ecosystems in comparison to 2003Jun 07 2019Jul 15 2019In recent decades, an increasing persistence of atmospheric circulation patterns has been observed. In the course of the associated long-lasting anticyclonic summer circulations, heat waves and drought spells often coincide, leading to so-called hotter

The behaviour of information flow near criticalityJun 03 2019Recent experiments have indicated that many biological systems self-organise near their critical point, which hints at a common design principle. While it has been suggested that information transmission is optimized near the critical point, it remains

Semi-Supervised Learning, Causality and the Conditional Cluster AssumptionMay 28 2019While the success of semi-supervised learning (SSL) is still not fully understood, Sch\"olkopf et al. (2012) have established a link to the principle of independent causal mechanisms. They conclude that SSL should be impossible when predicting a target

Estimating Average Treatment Effects Utilizing Fractional Imputation when Confounders are Subject to MissingnessMay 27 2019The problem of missingness in observational data is ubiquitous. When the confounders are missing at random, multiple imputation is commonly used; however, the method requires congeniality conditions for valid inferences, which may not be satisfied when

A score function for Bayesian cluster analysisMay 24 2019We propose a score function for Bayesian clustering. The function is parameter free and captures the interplay between the within cluster variance and the between cluster entropy of a clustering. It can be used to choose the number of clusters in well-established

An illustration of the risk of borrowing information via a shared likelihoodMay 23 2019A concrete, stylized example illustrates that inferences may be degraded, rather than improved, by incorporating supplementary data via a joint likelihood. In the example, the likelihood is assumed to be correctly specified, as is the prior over the parameter

Atlantic Causal Inference Conference (ACIC) Data Analysis Challenge 2017May 23 2019This brief note documents the data generating processes used in the 2017 Data Analysis Challenge associated with the Atlantic Causal Inference Conference (ACIC). The focus of the challenge was estimation and inference for conditional average treatment

Leveraging Uncertainty in Deep Learning for Selective ClassificationMay 23 2019The wide and rapid adoption of deep learning by practitioners brought unintended consequences in many situations such as in the infamous case of Google Photos' racist image recognition algorithm; thus, necessitated the utilization of the quantified uncertainty

Many perspectives on Deborah Mayo's "Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars"May 21 2019May 29 2019The new book by philosopher Deborah Mayo is relevant to data science for topical reasons, as she takes various controversial positions regarding hypothesis testing and statistical practice, and also as an entry point to thinking about the philosophy of

Discussion of "Nonparametric generalized fiducial inference for survival functions under censoring"May 21 2019The following discussion is inspired by the paper Nonparametric generalized fiducial inference for survival functions under censoring by Cui and Hannig. The discussion consists of comments on the results, but also indicates it's importance more generally

Statistical methods research done as science rather than mathematicsMay 20 2019This paper is about how we study statistical methods. As an example, it uses the random regressions model, in which the intercept and slope of cluster-specific regression lines are modeled as a bivariate random effect. Maximizing this model's restricted

A response to O. Arandjelovic's critique of "The reproducibility of research and the misinterpretation of p-values"May 20 2019The main criticism of my piece in ref (2) seems to be that my calculations rely on testing a point null hypothesis, i.e. the hypothesis that the true effect size is zero. He objects to my contention that the true effect size can be zero, "just give the

Non-Parametric Estimation of Spot Covariance Matrix with High-Frequency DataMay 20 2019Estimating spot covariance is an important issue to study, especially with the increasing availability of high-frequency financial data. We study the estimation of spot covariance using a kernel method for high-frequency data. In particular, we consider

Which principal components are most sensitive to distributional changes?May 15 2019PCA is often used in anomaly detection and statistical process control tasks. For bivariate data, we prove that the minor projection (the least varying projection) of the PCA-rotated data is the most sensitive to distributional changes, where sensitivity

A First Course in Data ScienceMay 08 2019Data science is a discipline that provides principles, methodology and guidelines for the analysis of data for tools, values, or insights. Driven by a huge workforce demand, many academic institutions have started to offer degrees in data science, with

Effect of E-cigarette Use and Social Network on Smoking Behavior Change: An agent-based model of E-cigarette and Cigarette InteractionMay 03 2019Despite a general reduction in smoking in many areas of the developed world, it remains one of the biggest public health threats. As an alternative to tobacco, the use of electronic cigarettes (ECig) has been increased dramatically over the last decade.

Fault Diagnosis using Clustering. What Statistical Test to use for Hypothesis Testing?Apr 30 2019Predictive maintenance and condition-based monitoring systems have seen significant prominence in recent years to minimize the impact of machine downtime on production and its costs. Predictive maintenance involves using concepts of data mining, statistics,

First digit law from Laplace transformApr 30 2019The occurrence of digits 1 through 9 as the leftmost nonzero digit of numbers from real-world sources is distributed unevenly according to an empirical law, known as Benford's law or the first digit law. It remains obscure why a variety of data sets generated

Methods of Estimation for the Three-Parameter Reflected Weibull DistributionApr 28 2019In this paper, we propose methods for the estimation of parameters for the three-parameter Reflected Weibull distribution. The Moment estimator , Maximum likelihood estimator and Location

Geological modeling using a recursive convolutional neural networks approachApr 27 2019Resource models are constrained by the extent of geological units that often depend on the lithology, alteration and mineralization. A three dimensional model of these geological units must be built from scarce information coming from drillholes and limited ... More

Evaluating the Success of a Data AnalysisApr 26 2019A fundamental problem in the practice and teaching of data science is how to evaluate the quality of a given data analysis, which is different than the evaluation of the science or question underlying the data analysis. Previously, we defined a set of ... More

Rushing or Dragging? An Analysis of the "Universality" of Correlated Fluctuations in Hi-Hat Timing and DynamicsApr 26 2019A previous analysis of fluctuations in a virtuoso (Jeff Porcaro) drum performance [R\"as\"anen et al., PLoS ONE 10(6): e0127902 (2015)] demonstrated that the rhythmic signal comprised both long range correlations and short range anti-correlations, with ... More

Governance on Social Media Data: Different Focuses between Government and Internet CompanyApr 25 2019How governments and Internet companies regulate user data on social media attracts public attention. This study tried to answer two questions: What kind of countries send more requests for Facebook user data? What kind of countries get more requests replies ... More

Exponential Random Graph models for Little NetworksApr 23 2019Statistical models for social networks have enabled researchers to study complex social phenomena that give rise to observed patterns of relationships among social actors and to gain a rich understanding of the interdependent nature of social ties and ... More

ssMousetrack: Analysing computerized tracking data via Bayesian state-space models in {R}Apr 23 2019Recent technological advances have provided new settings to enhance individual-based data collection and computerized-tracking data have became common in many behavioral and social research. By adopting instantaneous tracking devices such as computer-mouse, ... More

Amazon Forest Fires Between 2001 and 2006 and Birth Weight in Porto VelhoApr 23 2019Apr 25 2019Birth weight data (22,012 live-births) from a public hospital in Porto Velho (Amazon) was used in multiple statistical models to assess the effects of forest-fire smoke on human reproductive outcome. Mean birth weights for girls (3,139 g) and boys (3,393 ... More

Some ordering properties of highest and lowest order statistics with exponentiated Gumble type-II distributed componentsApr 18 2019In this paper, we have studied the stochastic comparisons of the highest and lowest order statistics of exponentiated Gumble type-II distribution with three parameters. We have compared both the statistics by using three different stochastic ordering. ... More

High-dimensional variable selection via low-dimensional adaptive learningApr 17 2019A stochastic search method, the so-called Adaptive Subspace (AdaSub) method, is proposed for variable selection in high-dimensional linear regression models. The method aims at finding the best model with respect to a certain model selection criterion ... More

Introducing Bayesian Analysis with $\text{m&m's}^\circledR$: an active-learning exercise for undergraduatesApr 16 2019We present an active-learning strategy for undergraduates that applies Bayesian analysis to candy-covered chocolate $\text{m&m's}^\circledR$. The exercise is best suited for small class sizes and tutorial settings, after students have been introduced ... More

Statistical witchhunts: Science, justice & the p-value crisisApr 11 2019We provide accessible insight into the current 'replication crisis' in 'statistical science', by revisiting the old metaphor of 'court trial as hypothesis test'. Inter alia, we define and diagnose harmful statistical witch-hunting both in justice and ... More

The Contribution Plot: Decomposition and Graphical Display of the RV Coefficient, with Application to Genetic and Brain Imaging Biomarkers of Alzheimer's DiseaseApr 08 2019Alzheimer's disease (AD) is a chronic neurodegenerative disease that causes memory loss and decline in cognitive abilities. AD is the sixth leading cause of death in the United States, affecting an estimated 5 million Americans. To assess the association ... More

Analytic Evaluation of the Fractional Moments for the Quasi-Stationary Distribution of the Shiryaev Martingale on an IntervalApr 05 2019We consider the quasi-stationary distribution of the classical Shiryaev diffusion restricted to the interval $[0,A]$ with absorption at a fixed $A>0$. We derive analytically a closed-form formula for the distribution's fractional moment of an {\em arbitrary} ... More

Statistical testing in a Linear Probability SpaceApr 02 2019Imagine that you could calculate of posttest probabilities, i.e. Bayes theorem with simple addition. This is possible if we stop thinking of probabilities as ranging from 0 to 1.0. There is a naturally occurring linear probability space when data are ... More

Data-driven discovery of coordinates and governing equationsMar 29 2019The discovery of governing equations from scientific data has the potential to transform data-rich fields that lack well-characterized quantitative descriptions. Advances in sparse regression are currently enabling the tractable identification of both ... More

GraSPy: Graph Statistics in PythonMar 29 2019Jun 18 2019We introduce GraSPy, a Python library devoted to statistical inference, machine learning, and visualization of random graphs and graph populations. This package provides flexible and easy-to-use algorithms for analyzing and understanding graphs with a ... More

An innovating Statistical Learning Tool based on Partial Differential Equations, intending livestock Data AssimilationMar 29 2019The realistic modeling intended to quantify precisely some biological mechanisms is a task requiering a lot of a priori knowledge and generally leading to heavy mathematical models. On the other hand, the structure of the classical Machine Learning algorithms, ... More

Deterministic bootstrapping for a class of bootstrap methodsMar 26 2019Apr 09 2019An algorithm is described that enables efficient deterministic approximate computation of the bootstrap distribution for any linear bootstrap method $T_n^*$, alleviating the need for repeated resampling from observations (resp. input-derived data). In ... More

Revising the Wilks Scoring System for pro RAW PowerliftingMar 22 2019Purpose: In powerlifting the total result is highly dependent on the athletes bodyweight. Powerlifting is divided to equipped and RAW types. Pro RAW powerlifting competitions use the Wilks scoring system to compare and rank powerlifting results across ... More

Three issues impeding communication of statistical methodology for incomplete dataMar 21 2019We identify three issues permeating the literature on statistical methodology for incomplete data written for non-specialist statisticians and other investigators. The first is a mathematical defect in the notation Yobs, Ymis used to partition the data ... More

A response-matrix-centred approach to presenting cross-section measurementsMar 15 2019Mar 21 2019The current canonical approach to publishing cross-section data is to unfold the reconstructed distributions. Detector effects like efficiency and smearing are undone mathematically, yielding distributions in true event properties. This is an ill-posed ... More

Effects of Stochastic Parametrization on Extreme Value StatisticsMar 13 2019Jul 17 2019Extreme geophysical events are of crucial relevance to our daily life: they threaten human lives and cause property damage. To assess the risk and reduce losses, we need to model and probabilistically predict these events. Parametrizations are computational ... More

Tutorial: Deriving The Efficient Influence Curve for Large ModelsMar 05 2019This paper aims to provide a tutorial for upper level undergraduate and graduate students in statistics and biostatistics on deriving influence functions for non-parametric and semi-parametric models. The author will build on previously known efficiency ... More

Comparison of plotting system outputs in beginner analystsMar 03 2019The R programming language is built on an ecosystem of packages, some that allow analysts to accomplish the same tasks. For example, there are at least two clear workflows for creating data visualizations in R: using the base graphics package (referred ... More

Bounds on Bayes Factors for Binomial A/B TestingFeb 28 2019Bayes factors, in many cases, have been proven to bridge the classic -value based significance testing and bayesian analysis of posterior odds. This paper discusses this phenomena within the binomial A/B testing setup (applicable for example to conversion ... More

A note on Fibonacci Sequences of Random VariablesFeb 26 2019The focus of this paper is the random sequences in the form $\{X_{0},X_{1},$ $X_{n}=X_{n-2}+X_{n-1},n=2,3,..\dot{\}},$ referred to as Fibonacci Random Sequence (FRS). The initial random variables $X_{0}$ and $X_{1}$ are assumed to be absolutely continuous ... More

Modeling the Health Expenditure in Japan, 2011. A Healthy Life Years Lost MethodologyFeb 25 2019The Healthy Life Years Lost Methodology (HLYL) is introduced to model and estimate the Health Expenditure in Japan in 2011. The HLYL theory and estimation methods are presented in our books in the Springer Series on Demographic Methods and Population ... More

A Combinatorial Approach to Causal InferenceFeb 18 2019The objective of causal inference is to learn the network of causal relationships holding between a system of variables from the correlations that these variables exhibit; a sub-problem of which is to certify whether or not a given causal hypothesis is ... More

Synthesis of High-Resolution Load Profiles with Minimal DataFeb 15 2019For the estimation of a new energy supply system it is an important to have high-resolution energy load profile. Such a profile is in general either not present or very costly to obtain. We will therefore present a method which synthesizes load profiles ... More

Applications of band-limited extrapolation to forecasting of weather and financial time seriesFeb 15 2019This paper describes the practical application of causal extrapolation of sequences for the purpose of forecasting. The methods and proofs have been applied to simulations to measure the range which data can be accurately extrapolated. Real world data ... More

Optimal BIBD-extended designsFeb 12 2019Balanced incomplete block designs (BIBDs) are a class of designs with v treatments and b blocks of size k that are optimal with regards to a wide range of optimality criteria, but it is not clear which designs to choose for combinations of v, b and k ... More

Learning spatially-correlated temporal dictionaries for calcium imagingFeb 08 2019Calcium imaging has become a fundamental neural imaging technique, aiming to recover the individual activity of hundreds of neurons in a cortical region. Current methods (mostly matrix factorization) are aimed at detecting neurons in the field-of-view ... More

Characterization of Sine- Skewed von Mises DistributionFeb 07 2019The von Mises distribution is one of the most important distribution in statistics to deal with circular data. In this paper we will consider some basic properties and characterizations of the sine skewed von Mises distribution.

CMS Sematrix: A Tool to Aid the Development of Clinical Quality Measures (CQMs)Feb 05 2019As part of the effort to improve quality and to reduce national healthcare costs, the Centers for Medicare and Medicaid Services (CMS) are responsible for creating and maintaining an array of clinical quality measures (CQMs) for assessing healthcare structure, ... More

Uncertainty Quantification in Molecular Signals using Polynomial Chaos ExpansionJan 30 2019Molecular signals are abundant in engineering and biological contexts, and undergo stochastic propagation in fluid dynamic channels. The received signal is sensitive to a variety of input and channel parameter variations. Currently we do not understand ... More

Shannon's entropy and its Generalizations towards Statistics, Reliability and Information Science during 1948-2018Jan 28 2019Starting from the pioneering works of Shannon and Weiner in 1948, a plethora of works have been reported on entropy in different directions. Entropy-related review work in the direction of statistics, reliability and information science, to the best of ... More

Variability in the interpretation of Dutch probability phrases - a risk for miscommunicationJan 28 2019Verbal probability phrases are often used to express estimated risk. In this study, focus was on the numerical interpretation of 29 Dutch probability and frequency phrases, including several complementary phrases to test (a)symmetry in their interpretation. ... More

Organic Fiducial InferenceJan 23 2019A substantial generalization is put forward of the theory of subjective fiducial inference as it was outlined in earlier papers. In particular, this theory is extended to deal with cases where the data are discrete or categorical rather than continuous, ... More

Fitting A Mixture Distribution to Data: TutorialJan 20 2019This paper is a step-by-step tutorial for fitting a mixture distribution to data. It merely assumes the reader has the background of calculus and linear algebra. Other required background is briefly reviewed before explaining the main algorithm. In explaining ... More

Custodes: Auditable Hypothesis TestingJan 19 2019We present Custodes: a new approach to solving the complex issue of preventing "p-hacking" in scientific studies. The novel protocol provides a concrete and publicly auditable method for controlling false-discoveries and eliminates any potential for data ... More

Systemic Risk: Conditional Distortion Risk MeasuresJan 15 2019Jan 28 2019In this paper, we introduce the rich classes of conditional distortion (CoD) risk measures and distortion risk contribution ($\Delta$CoD) measures as measures of systemic risk and analyze their properties and representations. The classes include the well-known ... More