AutoXiv

Read what ships this week.

⌘K
260421.0038
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
Alshammari · Wen · Zainal +5
MathNet is a large-scale, multilingual dataset of 30,676 Olympiad-level math problems from 47 countries spanning two decades, designed to benchmark both mathematical reasoning in generative models and mathematical retrieval in embedding systems. The benchmark reveals that even state-of-the-art models struggle with these problems, with top models achieving only 78.4% accuracy, and that retrieval quality significantly impacts retrieval-augmented generation performance.
Formal Sciences
260421.0039
Sessa: Selective State Space Attention
Horbatko
Sessa is a new sequence model that places attention inside a recurrent feedback path, enabling power-law memory decay instead of exponential or 1/length dilution. This architecture achieves superior long-context performance while remaining competitive on short sequences.
Formal Sciences
260421.0040
Bounded Ratio Reinforcement Learning
Ao · Chen · Lee +5
This paper introduces Bounded Ratio Reinforcement Learning (BRRL), a theoretical framework that bridges the gap between trust region methods and PPO's clipped objective, leading to a new algorithm called Bounded Policy Optimization (BPO) that provides monotonic improvement guarantees while matching or exceeding PPO's performance. The framework also extends to Group-relative BPO (GBPO) for large language model fine-tuning.
Formal Sciences
260421.0041
When Can LLMs Learn to Reason with Weak Supervision?
Rahman · Shen · Mordvina +3
This paper investigates when reinforcement learning with verifiable rewards (RLVR) enables large language models to generalize under weak supervision (scarce data, noisy rewards, or self-supervised signals). The key finding is that models generalize when they exhibit prolonged pre-saturation training dynamics, which is predicted by reasoning faithfulness—the degree to which intermediate reasoning steps logically support final answers.
Formal Sciences
260421.0042
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
Koepke · Zverev · Ginosar +1
This paper challenges the Platonic Representation Hypothesis by showing that apparent alignment between vision and language models is an artifact of small-scale evaluation. When tested at scale with millions of samples and realistic many-to-many settings, cross-modal alignment degrades substantially, suggesting different modalities learn different representations of reality.
Formal Sciences
260421.0043
Revisiting Active Sequential Prediction-Powered Mean Estimation
Sfyraki · Wang
This paper analyzes active sequential prediction-powered mean estimation, where labels are selectively queried and ML predictions fill in the gaps. The authors find that contrary to intuition, using a nearly constant query probability (ignoring uncertainty) often produces tighter confidence intervals than adaptive uncertainty-based querying.
Formal Sciences
260421.0044
Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering
Gupta · Kumar
LPSR is an inference-time error correction method that monitors internal model activations to detect reasoning mistakes, then rolls back generation and steers the model using cached corrections—no training required. It improves an 8B model's math accuracy from 28.8% to 44.0%, outperforming prompted self-correction and even larger 70B models.
Formal Sciences
260421.0045
Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion
Leitch
This paper benchmarks cloud and local large language models on two System Dynamics tasks: extracting causal loop diagrams and providing interactive coaching. The best local models match mid-tier cloud performance on diagram extraction (77%) but struggle with long-context error-fixing tasks, with backend implementation choices mattering more than quantization levels.
Formal Sciences
260421.0046
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
Dadgarnia · Tabesh · Nikdan +3
GSQ is a new scalar quantization method for large language models that uses Gumbel-Softmax relaxation to jointly optimize grid assignments and scales, achieving accuracy comparable to complex vector quantization methods while remaining compatible with existing inference kernels. It successfully quantizes models to 2-3 bits per parameter and scales to trillion-parameter mixture-of-experts models.
Formal Sciences
260421.0047
A Note on TurboQuant and the Earlier DRIVE/EDEN Line of Work
Ben-Basat · Ben-Itzhak · Mendelson +3
This note demonstrates that TurboQuant, a recent quantization method, is a suboptimal special case of the earlier EDEN/DRIVE quantization schemes. EDEN consistently outperforms TurboQuant across all tested scenarios, often by more than one bit of precision.
Formal Sciences
260421.0048
Physics-Informed Neural Networks for Biological $2\mathrm{D}{+}t$ Reaction-Diffusion Systems
Lavery · Cochrane · Olesen +3
This paper extends biologically-informed neural networks (BINNs) from 1D to 2D spatial domains for learning reaction-diffusion equations from data, combining neural network training with symbolic regression to discover closed-form equations. The method is demonstrated on real lung cancer cell microscopy data, successfully recovering interpretable 2D+time reaction-diffusion models.
Formal Sciences
260421.0049
FUSE: Ensembling Verifiers with Zero Labeled Data
Lee · Ma · Zhao +4
FUSE is a method that combines multiple imperfect AI verifiers to better judge model outputs without needing any labeled training data. It matches or beats semi-supervised methods across diverse benchmarks by controlling how verifiers depend on each other using spectral algorithms.
Formal Sciences
260421.0050
Wasserstein Distributionally Robust Risk-Sensitive Estimation via Conditional Value-at-Risk
Taha · Bitar
This paper develops a distributionally robust estimation method that minimizes worst-case conditional value-at-risk (CVaR) of estimation error when the true distribution is uncertain but lies within a Wasserstein ball. The method can be computed via tractable semidefinite programming and outperforms existing approaches on electricity price forecasting.
Formal Sciences
260421.0051
Duality for the Adversarial Total Variation
Bungert · Schmitt
This paper establishes a mathematical duality framework for adversarial total variation, showing that adversarial training of binary classifiers can be understood through nonlocal calculus of variations. The work provides rigorous characterizations of subdifferentials using dual representations and integration by parts formulas in both metric and Euclidean spaces.
Formal Sciences
260421.0052
IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem
Adiga · Chou · Chiranth +7
IDOBE is a standardized benchmark dataset containing over 10,000 infectious disease outbreak segments from a century of surveillance data across 13 diseases, designed to evaluate and compare epidemic forecasting methods. The authors test 11 baseline models and find MLP-based methods perform most robustly, with statistical methods excelling in pre-peak phases.
Formal Sciences
260421.0053
Learning the Riccati solution operator for time-varying LQR via Deep Operator Networks
Chen · Biccari · Wang
This paper uses Deep Operator Networks (DeepONets) to learn a surrogate for the Riccati differential equation solution operator in finite-horizon LQR problems, enabling fast approximate optimal control without repeated numerical integration. The approach includes theoretical guarantees on stability and performance, and demonstrates significant computational speedups while maintaining high accuracy.
Formal Sciences
260421.0054
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
Liang · Zhou · Lu +3
This paper addresses a critical problem in reinforcement learning for large language models: when base models are already very accurate on training benchmarks, standard RL methods fail because there aren't enough errors to learn from, causing models to collapse into repetitive solutions. The authors propose CUTS, a novel sampling strategy that maintains solution diversity even when models are highly accurate, improving generalization on challenging out-of-domain math problems by up to 15.1%.
Formal Sciences
260421.0055
Barrier-enforced multi-objective optimization for direct point and sharp interval forecasting
Amnuaypongsa · Suparanonrat · Wanitchollakit +1
This paper presents a neural network framework that simultaneously generates point forecasts and prediction intervals for multi-step time series forecasting, using multi-objective optimization to automatically balance forecast accuracy and interval sharpness while guaranteeing non-crossing intervals and target coverage. The method eliminates manual hyperparameter tuning and demonstrates superior performance on solar irradiance forecasting compared to existing approaches.
Formal Sciences
260421.0056
Faster by Design: Interactive Aerodynamics via Neural Surrogates Trained on Expert-Validated CFD
Thumiger · Bartezzaghi · Rigotti +5
This paper introduces GIST, a neural network surrogate that predicts race-car aerodynamics 10,000× faster than traditional CFD simulations while maintaining accuracy suitable for early-stage design. The work includes a new high-fidelity dataset of LMP2 race-car aerodynamics validated by industry experts at Dallara, enabling interactive design exploration in motorsport.
Formal Sciences
260421.0057
Safe Control using Learned Safety Filters and Adaptive Conformal Inference
Huriot · Tabbara · Sibai
This paper introduces Adaptive Conformal Filtering (ACoFi), which combines learned safety filters with adaptive conformal inference to provide soft safety guarantees for control systems. The method dynamically adjusts switching criteria between nominal and safe policies based on prediction uncertainty, achieving better safety performance than fixed-threshold approaches.
Formal Sciences