Iterative generative policies, such as diffusion models and flow matching, offer superior expressivity for continuous control but complicate Maximum Entropy Reinforcement Learning because their action log-densities are not directly accessible. To address this, we propose Field Least-Energy Actor-Critic (FLAC), a likelihood-free framework that regulates policy stochasticity by penalizing the kinetic energy of the velocity field.
Our key insight is to formulate policy optimization as a Generalized Schrödinger Bridge (GSB) problem relative to a high-entropy reference process (e.g., uniform). Under this view, the maximum-entropy principle emerges naturally as staying close to a high-entropy reference while optimizing return, without requiring explicit action densities. In this framework, kinetic energy serves as a physically grounded proxy for divergence from the reference: minimizing path-space energy bounds the deviation of the induced terminal action distribution.
Building on this view, we derive an energy-regularized policy iteration scheme and a practical off-policy algorithm that automatically tunes the kinetic energy via a Lagrangian dual mechanism. Empirically, FLAC achieves superior or comparable performance on high-dimensional benchmarks relative to strong baselines, while avoiding explicit density estimation.
The Schrödinger Bridge Problem (SBP) finds the most likely stochastic process connecting two distributions. Given initial distribution $\mu_0$ and terminal distribution $\mu_1$, it solves:
The Generalized Schrödinger Bridge (GSB) relaxes the hard terminal constraint to a soft potential. This is crucial: we no longer need samples from the target distribution $\mu_1$, only a potential function $\mathcal{G}$ that scores terminal states.
The optimal solution has a closed form—the terminal distribution is an exponential tilting of the reference:
Let's start with what we want. Maximum Entropy RL seeks policies that maximize return while maintaining high entropy:
This is elegant in theory. But here's the problem: for iterative generative policies like diffusion models, we can't compute $\log \pi(a|s)$—actions emerge from a multi-step stochastic process. The entropy term becomes intractable.
Now, recall the GSB optimal solution from Section 1:
Notice something? These have the same form! If we choose:
Then GSB directly gives us the Boltzmann policy—without ever computing $\log \pi(a|s)$.
Here's the key insight: in RL, we don't have samples from the optimal policy $\pi^*$—we only have a scoring function (Q-value). This matches GSB exactly: no target samples needed, just a potential $\mathcal{G}$. The path divergence term implicitly regularizes entropy through the generation process itself.
This gives us three benefits:
So far so good. But the GSB objective involves a path constraint—staying close to the reference process. This is defined on the space of entire trajectories, which sounds abstract and hard to optimize. Can we turn it into something concrete?
Consider a controlled process with drift $u_\theta$ and a reference process with zero drift (pure noise):
The two processes share the same noise but differ in drift. How far apart are they? This depends on whether we work with SDEs or ODEs:
For stochastic processes, Girsanov's theorem gives an exact identity: the KL divergence on path space equals the expected kinetic energy of the drift:
$$\mathcal{D}_{\mathrm{KL}}(\mathbb{P}^\theta \| \mathbb{P}^{\mathrm{ref}}) = \frac{1}{2\sigma^2} \mathbb{E}_{\mathbb{P}^\theta} \left[ \int_0^1 \|u_\theta(s, \tau, X_\tau)\|^2 d\tau \right]$$
This is not an approximation—it is an equality. Minimizing kinetic energy is exactly equivalent to minimizing path-space KL divergence.
For deterministic flows, the KL interpretation no longer applies. Instead, kinetic energy upper-bounds the squared $W_2$ distance between the induced terminal distribution and the reference:
$$W_2^2(p_1^\theta, \mu_1^{\mathrm{ref}}) \leq \mathbb{E}\left[\int_0^1 \|u_\theta\|^2 d\tau\right]$$
This acts as a geometric proximity constraint: penalizing kinetic energy prevents aggressive, large-scale transport that would concentrate probability mass. While this does not directly bound entropy, it empirically discourages rapid mode collapse and promotes broad action coverage.
In both cases, kinetic energy provides a tractable, computable measure of how far the policy strays from the high-entropy reference:
Think of it this way: kinetic energy $\|u_\theta\|^2$ measures the "effort" to transport noise to actions, while the terminal potential $-Q(s,a)$ acts like a potential energy landscape guiding where actions should land. The policy seeks paths that minimize total action—balancing efficient transport (low kinetic energy) against reaching high-reward regions (low potential energy). This echoes the principle of least action in physics: among all possible paths, nature chooses the one that minimizes the integrated difference between kinetic and potential energy.
Now we have all the pieces. Combining GSB formulation + RL potential + kinetic energy regularization, we arrive at the FLAC objective:
That's it. We can sample trajectories, compute kinetic energy along the path, and optimize with standard gradient descent. No likelihood computation required.
We derive energy-regularized Bellman operators with $\gamma$-contraction guarantees, extending classical policy iteration to generative policies.
A Lagrangian dual mechanism learns $\alpha$ to maintain target energy level $E_{\mathrm{tgt}}$, adapting exploration automatically.
Differentiable ODE solvers enable end-to-end gradient flow, compatible with standard off-policy replay buffers.
Currently, FLAC applies a uniform kinetic energy penalty across all action dimensions. However, in practice different dimensions often require different levels of exploration—for instance, a humanoid's hip joint may need precise control while the arm joints benefit from more exploratory behavior. Learning dimension-specific energy budgets, rather than a single scalar $\alpha$, is a promising direction for improving performance.
We are deeply grateful to Xiao Ma, Yunfei Li, and Yu Luo for their continuous support and guidance throughout this project. This work would not have been possible without their invaluable contributions.
If you find this work useful, please consider citing:
@article{lv2026flac,
title={FLAC: Maximum Entropy RL via Kinetic Energy Regularized Bridge Matching},
author={Lv, Lei and Li, Yunfei and Luo, Yu and Sun, Fuchun and Ma, Xiao},
journal={arXiv preprint arXiv:2602.12829},
year={2026}
}