Home

The Workshop on Functional Inference and Machine Intelligence (FIMI) is an international workshop on machine learning and statistics, with a particular focus on theory, methods, and practice. It consists of invited talks, and poster sessions. The topics include (but not limited to):

  • Machine Learning Methods
  • Deep Learning
  • Kernel Methods
  • Probabilistic Methods
Previous Workshop: 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
FIMI2024

Registration

Registration period: January 19 - February 15, 2026
Registration is closed.

Invited Speakers

Program

March 2nd (Mon)

10:30--10:40Opening Remark
10:40--11:30

Makoto Yamada (Okinawa Institute of Science and Technology)

Title: Brain-Inspired Models Meet World Models: Temporal Prediction for Self-Supervised Learning

Predictive coding and the temporal prediction hypothesis suggest that the brain acquires representations by predicting future sensory inputs from past experience. Inspired by this idea, we reinterpret non-contrastive self-supervised learning as a form of temporal prediction and propose PhiNet, a brain-inspired architecture with two predictors corresponding to the hippocampal CA3 and CA1. We show that PhiNet yields more stable representations and faster adaptation in online and continual learning settings. We further introduce PhiNet v2, a Transformer-based extension that learns from image sequences without strong data augmentation and is closely related to the joint embedding predictive architecture (JEPA). In this talk, we will highlight the connection between PhiNet and world models and discuss promising directions for future research.
11:35--12:25

Sho Sonoda (RIKEN AIP)

Title: Why Agentic Theorem Prover Works: A Statistical Provability Theory of Mathematical Reasoning Models

We develop a theoretical analysis of LLM-guided formal theorem proving in interactive proof assistants (e.g., Lean) by modeling tactic proposal as a stochastic policy in a finite-horizon deterministic MDP. To capture modern representation learning, we treat the state and action spaces as general compact metric spaces and assume Lipschitz policies. To explain the gap between worst-case hardness and empirical success, we introduce problem distributions generated by a reference policy \(q\). Under a top-\(k\) search protocol and Tsybakov-type margin conditions, we derive lower bounds on finite-horizon success probability that decompose into search and learning terms, with learning controlled by sequential Rademacher/covering complexity. Our main result shows that when cut elimination expands a DAG of depth \(D\) into a cut-free tree of size \(\Omega(\Lambda^D)\) while the cut-aware hierarchical process has size \(O(\lambda^D)\) with \(\lambda\ll\Lambda\), a flat (cut-free) learner provably requires exponentially more data than a cut-aware hierarchical learner. This provides a principled justification for subgoal decomposition in recent agentic theorem provers.
12:30--14:00Lunch
14:00--14:50

Chulhee Yun (KAIST)

Title: How Does Neural Network Training Work: Edge of Stability, River Valley Landscape, and More

Traditional analyses of gradient descent (GD) state that GD monotonically decreases the loss as long as the "sharpness" of the objective function—defined as the maximum eigenvalue of the objective's Hessian—is below a threshold 2/η, where η is the step size. Recent works have identified a striking discrepancy between traditional GD analyses and modern neural network training, referred to as the "Edge of Stability" phenomenon, in which the sharpness at GD iterates increases over time and hovers around the threshold η, while the loss continues to decrease rather than diverging. This discovery calls for an in-depth investigation into the underlying cause of the phenomenon as well as the actual inner mechanisms of neural network training. In this talk, I will briefly overview the Edge of Stability phenomenon and recent theoretical explanations of its underlying mechanism. We will then explore where learning actually occurs in the parameter space, discussing a recent paper that challenges the idea that neural network training happens in a low-dimensional dominant subspace. Based on these observations, I propose the hypothesis that the training loss landscape resembles a "river valley." I will also present an analysis of the Schedule-Free AdamW optimizer (Defazio et al., 2024) through this river-valley lens, including insights into why schedule-free methods can be advantageous for scalable pretraining of language models.
14:55--15:45

Atsushi Nitanda (A*Star)

Title: TBA

TBA
25 min break
16:10--17:00

Han Bao (The Institute of Statistical Mathematics)

Title: Post-hoc calibration through lens of optimal transport and scoring rules

Probability calibration is a fundamental framework to quantify uncertainty of a well-trained classifier. In particular, calibration for multi-class classifiers remains challenging. In the first half, I introduce an multi-class extension of isotonic regression, a classical and standard algorithm for binary calibration. The idea is to learn a gradient of some convex potential via an optimal transport map. In the second half, scoring rules are revisited for classification to highlight that some scoring rules are improper in spite of the popularity. Nevertheless, a post-hoc transform can be constructed in a closed form and the underlying probability can be correctly recovered.

March 3rd (Tue)

9:40--10:30

Dino Sejdinovic (University of Adelaide)

Title: Causally Aligned Active Learning

We study how to estimate causal effects with fewer outcome measurements, which is essential when each measurement is costly. A central challenge is that the quantities we ultimately care about, such as potential outcomes and treatment effects, are not directly observable. As a result, standard active learning heuristics that target uncertainty in model parameters or factual predictions can be misaligned with the causal estimation objective. We present a Bayesian experimental design perspective on causally aligned data acquisition: at each step, we choose which data item to measure next by estimating its expected value for reducing posterior uncertainty about a target causal quantity. We first apply this principle to Conditional Average Treatment Effect (CATE) estimation, showing how acquisition rules can be built to focus directly on uncertainty in unobserved causal outcomes. We then broaden the scope beyond CATE to a general class of causal quantities, which can be expressed as integrals of regression functions, yielding a unified framework that supports acquisition strategies based on information gain. Together, these results clarify how aligning the acquisition objective with the desired causal estimand leads to more interpretable trade-offs, and how the best strategy depends on both the causal target and the structure of the data. Joint work with Erdun Gao and Jake Fawkes.
10:35--11:25

Taiji Suzuki (The University of Tokyo)

Title: TBA

TBA
11:30--13:00Lunch
13:00--13:50

Tatsuya Harada (The University of Tokyo)

Title: A Unified Framework for Domain Adaptation

Domain adaptation transfers knowledge from a label-rich source domain to a label-scarce target domain, yet prior theories often emphasize marginal alignment and source error while largely ignoring the intractable joint error. This thesis argues that joint error is crucial, especially under large domain gaps. For unsupervised DA, we propose a new objective derived from an upper bound of the joint error, together with a label-induced hypothesis space to tighten the bound, and a cross-margin discrepancy to stabilize adversarial learning. We further extend the joint-error perspective to semi-supervised, open-set, multi-source, and source-free settings by integrating PU learning theory and an attention-based feature generation module, achieving both tighter risk bounds and practical applicability. Extensive experiments across diverse datasets demonstrate consistent improvements over strong baselines, particularly under large shifts and limited labeled target data.
13:55--14:45

Pascal Mettes (University of Amsterdam)

Title: Hyperbolic Deep Learning

From linear layers and convolutions to self-attention, deep learning is implicitly Euclidean. But should it be? In this talk, I will dive into hyperbolic geometry for deep learning, with a focus on computer vision. I will discuss what hyperbolic geometry is and its strong potential as a new foundation for deep learning, from learning hierarchical representations to making neural networks more robust. I will place an emphasis on vision-language models, where hyperbolic embeddings are key for vision-language alignment. Lastly, I will show how to get started in this new and exciting field.
15 min break
15:00--15:50

Qing Qu (University of Michigan)

Title: Harnessing Low-Dimensionality for Generalizable and Scientific Generative AI

The empirical success of modern generative AI, from diffusion models to Large Language Models (LLMs), often outpaces our classical understanding of how machine learning models generalize from finite, out-of-distribution (OOD) data. This talk introduces a unified mathematical framework identifying intrinsic low-dimensional structures as the primary driver of generalization and a critical lever for advancing scientific AI. First, we deconstruct the generalization mechanism of diffusion models, revealing a training transition from memorization to generalization that effectively breaks the curse of dimensionality. Using a mixture of low-rank Gaussian models, we demonstrate that sample complexity scales linearly with the intrinsic dimension rather than exponentially with the ambient dimension, establishing a formal equivalence with the canonical subspace clustering problem. Moreover, by exploring nonlinearity in two-layer denoising autoencoders, we uncover how weight structures differ between memorization and generalization. This distinction provides a unified understanding of how models learn representations and generate new data. Second, we characterize the OOD generalization of in-context learning (ICL) in transformers. For linear regression tasks in which vectors lie in low-dimensional subspaces, we show that OOD capabilities emerge from interpolating across training task subspaces. We derive precise conditions under which linear attention models interpolate across distribution shifts, highlighting task diversity as a prerequisite for ICL efficacy. Finally, we translate these theoretical insights into practical guidelines for controlled generation, ensuring model safety, privacy, and solving high-dimensional inverse problems in science and engineering.
15 min break
16:00--18:00Poster Session

March 4th (Wed)

9:40--10:30

Song Liu (University of Bristol)

Title: Flow Models Beyond Generative Tasks

Flow-based methods have achieved significant success in various generative modeling tasks, capturing nuanced details within complex data distributions. There is an expanding body of literature now exploiting this unique capability to resolve fine-grained structural details beyond generation tasks. In this talk, we explore several non-generative modeling tasks inspired by flow models.
10:35--11:25

Arthur Gretton (University College London)

Title: MMD flows, chi2 flows, and neural net training

We construct a Wasserstein gradient flow on the Maximum Mean Discrepancy (MMD): a family of integral probability metrics (measures of distance between probabilities), indexed by a choice of kernel. The MMD flow is used to construct a particle flow from any source distribution to any target distribution. We relate this flow to the problem of training neural networks for large numbers of neurons, where individual neurons can be considered as particles. Next, we show that convergence of the MMD flow works best when the MMD (or rather, its kernel) is chosen according to the current locations of the source and target particles. We can do this in two ways: first, by interpolating to a chi2 flow; second, by learning a family of linear kernels on neural net features from the forward noising process of a diffusion model.
11:30--13:00Lunch
13:00--13:50

Chieh-Hsin Lai (Sony AI)

Title: The Principles of Diffusion Models: From Origins to Flow-Map Models

This talk presents the core principles behind diffusion models and their modern generalization to flow-map models for fast generation. We view diffusion as learning a time-dependent velocity field that transports a simple noise distribution to the data distribution through a continuum of intermediate states, so sampling amounts to solving an ODE from noise to data. Because ODE-based sampling can be slow, we introduce diffusion-motivated flow-map models that learn direct mappings between arbitrary time points, enabling faster and more flexible sampling. We highlight two early flow-map formulations, Consistency Models (OpenAI) and Consistency Trajectory Models (Sony AI), and then discuss MeanFlow (CMU) as a recent follow-up. Finally, we explain why training these models can be unstable and expensive, and present our Consistency Mid-Training (CMT) as a general recipe to improve stability and efficiency.
13:55--14:45

Takaharu Yaguchi (Kobe University)

Title: Machine Learning Methods for Learning Physical Systems and Related Problems

In recent years, machine learning methods for modelling physical systems have attracted much attention. Examples of such methods include Hamiltonian neural networks and symplectic neural networks. However, there has been little theoretical analysis of these methods. In this talk, I will introduce these methods, existing theoretical analysis results, and also unsolved problems.
15 min break
15:00--17:00Poster Session

March 5th (Thu)

9:40--10:30

Masaaki Imaizumi (The University of Tokyo)

Title: Analysis on dynamics of neural network with high-dimensional structure: Fast-Slow dynamics for unlearning and precise analysis for deep models

This talk introduces several results analyzing the learning dynamics of neural networks by leveraging high-dimensional structures. While neural network learning exhibits complex dynamics, employing the high-dimensional limit to reduce it to the dynamics of parameter element distributions enables effective analysis. The first result analyzes the mechanism of feature unlearning, which means that neural networks forget learned feature, by utilizing multiple timescales within neural networks, specifically the fast-slow dynamics. The second result performs a precise analysis of the learning dynamics of multi-layer finite-width neural networks, attempting to relax the limitations of conventional theories that primarily analyze two-layer neural networks. Both analyses are based on recent dynamical analysis techniques using high-dimensional statistics.
10:35--11:25

Bharath Sriperumbudur (The Pennsylvania State University)

Title: (De)regularized Wasserstein Gradient Flows via Reproducing Kernels

Wasserstein gradient flows have become a popular tool in machine learning with applications in sampling, variational inference, generative modeling, and reinforcement learning, among others. The Wasserstein gradient flow (WGF) involves minimizing a probability functional over the Wasserstein space (by taking into account the intrinsic geometry of the Wasserstein space). In this work, we introduce approximate/regularized Wasserstein gradient flows in two different settings: (a) approximate the probability functional and (b) approximate the Wasserstein geometry. In (a), we consider the probability functional to be chi2-divergence, whose WGF is difficult to implement. To this end, we propose a (de)-regularization of the Maximum Mean Discrepancy (DrMMD) as an approximation of chi2-divergence and develop an approximate WGF, which is easy to implement and has applications in generative modeling. On the other hand, in the setting of (b), we use Kullback-Leibler divergence as the probability functional and develop an approximation to the Wasserstein geometry, which allows for an efficient implementation than that of the exact WGF, with applications in sampling. In both settings, we present a variety of theoretical results that relate the approximate flow to the exact flow and demonstrate the superiority of the approximate flows via numerical simulations.
11:30--13:00Lunch
13:00--13:50

Motonobu Kanagawa (EURECOM)

Title: Gaussian Processes and Reproducing Kernels: Connections and Equivalences

This talk discusses the relations between two approaches using positive definite kernels: probabilistic methods using Gaussian processes, and non-probabilistic methods using reproducing kernel Hilbert spaces (RKHS). They are widely studied and used in machine learning, statistics, and numerical analysis. Connections and equivalences between them are reviewed for fundamental topics such as regression, interpolation, numerical integration, distributional discrepancies, and statistical dependence, as well as for sample path properties of Gaussian processes. A unifying perspective for these equivalences is established, based on the equivalence between the Gaussian Hilbert space and the RKHS. This serves as a foundation for many other methods based on Gaussian processes and reproducing kernels, which are being developed in parallel by the two research communities.
13:55--14:45

Krikamol Muandet (CISPA)

Title: Integral Imprecise Probability Metrics

Quantifying differences between probability distributions is fundamental to statistics and machine learning, primarily for comparing statistical uncertainty. In contrast, epistemic uncertainty (EU), due to incomplete knowledge, requires richer representations than those offered by classical probability. Imprecise probability (IP) theory offers such models, capturing ambiguity and partial belief. This has driven growing interest in imprecise probabilistic machine learning (IPML), where inference and decision-making rely on broader uncertainty models, highlighting the need for metrics beyond classical probability. This work introduces the Integral Imprecise Probability Metric (IIPM) framework, a Choquet integral-based generalisation of classical Integral Probability Metric (IPM) to the setting of capacities, a broad class of IP models encompassing many existing ones, including lower probabilities, probability intervals, belief functions, and more. Theoretically, we establish conditions under which IIPM serves as a valid metric and metrises a form of weak convergence of capacities. Practically, IIPM not only enables comparison across different IP models but also supports the quantification of epistemic uncertainty within a single IP model. In particular, by comparing an IP model with its conjugate, IIPM gives rise to a new class of EU measures called Maximum Mean Imprecision, which satisfy key axiomatic properties proposed in the Uncertainty Quantification literature. We validate MMI through selective classification experiments, demonstrating strong empirical performance against established EU measures, and outperforming them when classical methods struggle to scale to a large number of classes. Our work advances both theory and practice in IPML, offering a principled framework for comparing and quantifying epistemic uncertainty under imprecision.
25 min break
15:10--16:00

Guillaume Braun

Title: How Anisotropy Shapes Learning Dynamics and the Role of Optimization Geometry

Anisotropic data — such as Gaussian inputs with spiked or power-law covariance — induces gradient-based learning dynamics that differ qualitatively from those in the isotropic setting. We illustrate this through two complementary toy-model analyses.

First, in nonlinear phase retrieval with power-law covariance, the population gradient flow is governed by an infinite hierarchy of coupled ODEs rather than a finite-dimensional system. This leads to explicit scaling laws and multi-timescale learning behavior governed by the data eigenstructure.

Second, in a spiked covariance model where the dominant variance direction is uninformative, gradient descent amplifies this direction during early training, delaying alignment with the signal. Spectral gradient descent suppresses this effect through scale-invariant updates, resulting in faster and more stable convergence.

These results show that anisotropy fundamentally alters both learning dynamics and the performance of gradient-based methods.
16:05--16:55

Kenji Fukumizu (The Institute of Statistical Mathematics)

Title: Flow Matching from Viewpoint of Proximal Operators

We reformulate Optimal Transport Conditional Flow Matching (OT-CFM), a class of dynamical generative models, showing that it admits an exact proximal formulation via an extended Brenier potential, without assuming that the target distribution has a density. In particular, the mapping to recover the target point is exactly given by a proximal operator, which yields an explicit proximal expression of the vector field. We also discuss the convergence of minibatch OT-CFM to the population formulation as the batch size increases. Finally, using second epi-derivatives of convex potentials, we prove that, for manifold-supported targets, OT-CFM is terminally normally hyperbolic: after time rescaling, the dynamics contracts exponentially in directions normal to the data manifold while remaining neutral along tangential directions, implying that the data manifold is a terminal normally hyperbolic attractor.
17:00--17:30Discussion and Closing

Poster Schedule

We will prepare A0 portrait/landscape size (1189mm x 841mm) poster boards.

  • 3rd (Tue)

    1. Haruka Eshima (Okinawa Institute of Science and Technology) - Effects of l2-regularization and hyperparameters on loss landscape
    2. Wei Huang (RIKEN AIP) - How Does Label Noise Gradient Descent Improve Generalization in the Low SNR Regime?
    3. Shuhei Kashiwamura (The University of Tokyo) - Transition from positional learning to semantic learning induced by quantization
    4. Akihiro Maeda (Japan Advanced Institute of Science and Technology) - Structural Learning in Probability Tensors via Vanishing 2×2 Minors and Walsh Transform
    5. Ryunosuke Miyazaki (Hitotsubashi University) - Doubly Robust Variable Significance Testing for Conditional Frechet Means
    6. Kaoru Otsuka (Okinawa Institute of Science and Technology) - Delayed Momentum Aggregation: Communication-efficient Byzantine-robust Federated Learning with Partial Participation
    7. Mana Sakai (The University of Tokyo) - Generalization Bounds for Transformers via Effective Rank
    8. Yuki Takezawa (Kyoto University / OIST) - Exploiting Similarity for Computation and Communication-Efficient Decentralized Optimization
    9. Tomoya Wakayma (RIKEN AIP) - Generalization Error of Mean-Field Shallow Neural Networks Trained by Wasserstein Gradient Flow
    10. Yakun Wang (University of Bristol) - Zero-flow encoders
    11. Junichiro Yoshida (Graduate School of Mathematical Sciences, University of Tokyo) - Estimation error and hypothesis testing for non-identifiable models, with applications to machine learning
    12. Donghao Zhu (The Institute of Statistical Mathematics) - A Learning-Based Hybrid Decision Framework for Matching Systems with User Departure Detection
    13. Yushi Hirose (Institute of Science Tokyo) - Mixture Proportion Estimation and Weakly-supervised Kernel Test for Conditional Independence
    14. Yichen Zang (University of Bristol) - Unbiased and Robust Test-time Guidance for Prior Adaptation
  • 4th (Wed)

    1. Dominic Jay Broadbent (University of Bristol) - Bilateral Distribution Compression: Reducing Both Data Size and Dimensionality
    2. Noboru Isobe (RIKEN AIP) - TBA
    3. Yoh-ichi Mototake (Hitotsubashi university) - Uncertainties in Physics-informed Inverse Problems
    4. Atsuyoshi Muta (Tokyo University) - Singularity-Induced Local Complexity of Learning Models via Singular Learning Theory
    5. Hirofumi Shiba (Institute of Statistical Mathematics) - Diffusive Scaling Limits & Early Diagnostics for PDMP Samplers
    6. Kazuma Sawaya (The University of Tokyo) - Provable False Discovery Rate Control for Deep Feature Selection
    7. Eiki Shimizu (ISM) - TBA
    8. Joe Suzuki (Osaka University) - Bayesian ICA for Causal Discovery
    9. Shu Tamano (The University of Tokyo) - A Principled Approach to Generative Adversarial Network-Based Causal Inference
    10. Masaya Taniguchi (RIKEN) - TBA
    11. Shoji Toyota (Kyushu University) - TBA
    12. Yanfeng Yang (The Graduate University for Advanced Studies, The Institute of Statistical Mathematics) - Conditionally Whitened Generative Models for Probabilistic Time Series Forecasting
    13. Shota Yano (The university of Tokyo) - Quasi-likelihood analysis for adaptive estimation of a degenerate diffusion process under relaxed balance conditions
    14. Shunyu Zhao (ISM) - TBA

Organizers

    General Organising Committee
  • Han Bao, The Institute of Statistical Mathematics
  • Masaaki Imaizumi, The University of Tokyo
  • Taiji Suzuki, The University of Tokyo
  • Tatsuya Harada, The University of Tokyo
  • Kenji Fukumizu, The Institute of Statistical Mathematics

Sponsors

Location

Address: TKP Shinjuku West Exit Conference Center
8th floor, 1 Chome-10-1 Nishi-shinjuku, Shinjuku, Tokyo, JAPAN 160-0023.
For detail, see official information.
(Note: There is another TKP *Shinjuku* Conference Center. Please do not get confused)