Research

My research sits at the intersection of probability and statistics, mostly focusing on hypothesis testing, statistical learning theory, and online learning. I've recently been working on sequential testing with e-values, the generalisation and expressiveness of over-parameterised models, and mean/parameter estimation. Here are some highlights of my recent work.

Sequential testing and e-values

Classical statistical hypothesis testing is based on p-values, which do not allow for changing plans mid-experiment. For example, you cannot peek at the results and stop early because the evidence looks strong, or decide on the significance level after having evaluated the p-value. Testing via e-values is a modern and more flexible approach, which is attracting growing interest. It is based on the concept of an e-variable: a non-negative random variable whose expectation under the null is at most one. Its realisation is an e-value. For a concise introduction see link, for a deeper (and often updated) treatment see the monograph by Ramdas and Wang (link). My recent work focuses on characterising the set of e-variables for constraint-defined hypotheses (e.g., testing for the mean), with a particular focus on minimal complete classes (what I call optimal e-classes) of e-variables, and the sequential analogue for e-processes.

On the optimality of coin-betting for mean estimation

Abstract

We consider the problem of testing the mean of a bounded real random variable. We introduce a notion of optimal classes for e-variables and e-processes, and establish the optimality of the coin-betting formulation among e-variable-based algorithmic frameworks for testing and estimating the (conditional) mean. As a consequence, we provide a direct and explicit characterisation of all valid e-variables and e-processes for this testing problem. In the language of classical statistical decision theory, we fully describe the set of all admissible e-variables and e-processes, and identify the corresponding minimal complete class.
Optimal e-value testing for properly constrained hypotheses

Abstract

Hypothesis testing via e-variables can be framed as a sequential betting game, where a player each round picks an e-variable. A good player’s strategy results in an effective statistical test that rejects the null hypothesis as soon as sufficient evidence arises. Building on recent advances, we address the question of restricting the pool of e-variables to simplify strategy design without compromising effectiveness. We extend the results of Clerico (2024), by characterising optimal sets of e-variables for a broad class of non-parametric hypothesis tests, defined by finitely many regular constraints. As an application, we discuss this notion of optimality in algorithmic mean estimation, including for heavy-tailed random variables.

Algorithmic statistics

Several recent works have have investigated ways to address classical statistical challenges by leveraging tools and techniques from online learning. A common pattern is to break the analysis into a stochastic and a worst-case part: the former is handled via standard martingale arguments, whilst the latter is controlled by the regret of an online learning game. I usually call this approach algorithmic statistics (this might not be a widely used term, I picked it up from Gergely Neu as I think it's a nice name). Some examples of algorithmic statitistics I am becoming rather familiar with are the online-to-PAC framework for generalisation bounds (link), sequential hypothesis testing via e-values (link), or obtaining confidence sequences via regret analysis (link , link). My work on the subject has mostly focused on introducing a general framework for confidence sets for the unknown parameter of generalised linear models, and on obtaining generalisation and concentration guarantees under data inter-dependencies.

Confidence Sequences for Generalized Linear Models via Regret Analysis

Abstract

We develop a methodology for constructing confidence sets for parameters of statistical models via a reduction to sequential prediction. Our key observation is that for any generalized linear model (GLM), one can construct an associated game of sequential probability assignment such that achieving low regret in the game implies a high-probability upper bound on the excess likelihood of the true parameter of the GLM. This allows us to develop a scheme that we call online-to-confidence-set conversions, which effectively reduces the problem of proving the desired statistical claim to an algorithmic question. We study two varieties of this conversion scheme: 1) analytical conversions that only require proving the existence of algorithms with low regret and provide confidence sets centered at the maximumlikelihood estimator 2) algorithmic conversions that actively leverage the output of the online algorithm to construct confidence sets (and may be centered at other, adaptively constructed point estimators). The resulting methodology recovers all state-of-the-art confidence set constructions within a single framework, and also provides several new types of confidence sets that were previously unknown in the literature.
Online-to-PAC generalization bounds under graph-mixing dependencies

Abstract

Traditional generalization results in statistical learning require a training data set made of independently drawn examples. Most of the recent efforts to relax this independence assumption have considered either purely temporal (mixing) dependencies, or graphdependencies, where non-adjacent vertices correspond to independent random variables. Both approaches have their own limitations, the former requiring a temporal ordered structure, and the latter lacking a way to quantify the strength of inter-dependencies. In this work, we bridge these two lines of work by proposing a framework where dependencies decay with graph distance. We derive generalization bounds leveraging the onlineto-PAC framework, by deriving a concentration result and introducing an online learning framework incorporating the graph structure. The resulting high-probability generalization guarantees depend on both the mixing rate and the graph’s chromatic number.
Uniform mean estimation for monotonic processes

Abstract

We consider the problem of deriving uniform confidence bands for the mean of a monotonic stochastic process, such as the cumulative distribution function (CDF) of a random variable, based on a sequence of i.i.d.~observations. Our approach leverages the coin-betting framework, and inherits several favourable characteristics of coin-betting methods. In particular, for each point in the domain of the mean function, we obtain anytime-valid confidence intervals that are numerically tight and adapt to the variance of the observations. To derive uniform confidence bands, we employ a continuous union bound that crucially leverages monotonicity. In the case of CDF estimation, we also exploit the fact that the empirical CDF is piece-wise constant to obtain simple confidence bands that can be easily computed. In simulations, we find that our confidence bands for the CDF achieve state-of-the-art performance.

Generalisation bounds

In machine learning, generalisation refers to a model's ability to perform well on unseen data, after training on a limited dataset. A neural network’s parameters are tuned by optimising a learning objective that ensures good accuracy on the training data. Upper bounds on the gap between performance on this training dataset and expected performance on new data are known as generalisation bounds. Beyond theoretical interest, these results have practical value, as they help assess the reliability of a model. My research has focused on information-theoretic (link) and PAC-Bayesian (link) approaches for generalisation bounds, both of which build on a notion of stability: a model generalises well if it is not overly sensitive to the particular dataset used for training. These frameworks capture this idea by comparing distributions over model parameters before and after training.

How good is PAC-Bayes at explaining generalisation?

Abstract

We discuss necessary conditions for a PAC-Bayes bound to provide a meaningful generalisation guarantee. Our analysis reveals that the optimal generalisation guarantee depends solely on the distribution of the risk induced by the prior distribution. In particular, achieving a target generalisation level is only achievable if the prior places sufficient mass on high-performing predictors. We relate these requirements to the prevalent practice of using data-dependent priors in deep learning PAC-Bayes applications, and discuss the implications for the claim that PAC-Bayes explains generalisation.
Generalisation under gradient descent via deterministic PAC-Bayes

Abstract

We establish disintegrated PAC-Bayesian generalisation bounds for models trained with gradient descent methods or continuous gradient flows. Contrary to standard practice in the PAC-Bayesian setting, our result applies to optimisation algorithms that are deterministic, without requiring any de-randomisation step. Our bounds are fully computable, depending on the density of the initial distribution and the Hessian of the training objective over the trajectory. We show that our framework can be applied to a variety of iterative optimisation algorithms, including stochastic gradient descent (SGD), momentum-based schemes, and damped Hamiltonian dynamics.
Chained Generalisation Bounds

Abstract

This work discusses how to derive upper bounds for the expected generalisation error of supervised learning algorithms by means of the chaining technique. By developing a general theoretical framework, we establish a duality between generalisation bounds based on the regularity of the loss function, and their chained counterparts, which can be obtained by lifting the regularity assumption from the loss onto its gradient. This allows us to re-derive the chaining mutual information bound from the literature, and to obtain novel chained information-theoretic generalisation bounds, based on the Wasserstein distance and other probability metrics. We show on some toy examples that the chained generalisation bound can be significantly tighter than its standard counterpart, particularly when the distribution of the hypotheses selected by the algorithm is very concentrated.
Conditionally Gaussian PAC-Bayes

Abstract

Recent studies have empirically investigated different methods to train stochastic neural networks on a classification task by optimising a PAC-Bayesian bound via stochastic gradient descent. Most of these procedures need to replace the misclassification error with a surrogate loss, leading to a mismatch between the optimisation objective and the actual generalisation bound. The present paper proposes a novel training algorithm that optimises the PAC-Bayesian bound, without relying on any surrogate loss. Empirical results show that this approach outperforms currently available PAC-Bayesian training methods.