FLAC: Maximum Entropy RL via Kinetic Energy Regularized Bridge Matching

Lei Lv1,2,3*, Yunfei Li2, Yu Luo3, Fuchun Sun3†, Xiao Ma2†
1Shanghai Research Institute for Intelligent Autonomous Systems    2ByteDance Seed    3Tsinghua University
*Work done during internship at ByteDance Seed    Corresponding authors

Abstract

Iterative generative policies, such as diffusion models and flow matching, offer superior expressivity for continuous control but complicate Maximum Entropy Reinforcement Learning because their action log-densities are not directly accessible. To address this, we propose Field Least-Energy Actor-Critic (FLAC), a likelihood-free framework that regulates policy stochasticity by penalizing the kinetic energy of the velocity field.

Our key insight is to formulate policy optimization as a Generalized Schrödinger Bridge (GSB) problem relative to a high-entropy reference process (e.g., uniform). Under this view, the maximum-entropy principle emerges naturally as staying close to a high-entropy reference while optimizing return, without requiring explicit action densities. In this framework, kinetic energy serves as a physically grounded proxy for divergence from the reference: minimizing path-space energy bounds the deviation of the induced terminal action distribution.

Building on this view, we derive an energy-regularized policy iteration scheme and a practical off-policy algorithm that automatically tunes the kinetic energy via a Lagrangian dual mechanism. Empirically, FLAC achieves superior or comparable performance on high-dimensional benchmarks relative to strong baselines, while avoiding explicit density estimation.

Method

1. What is Generalized Schrödinger Bridge?

The Schrödinger Bridge Problem (SBP) finds the most likely stochastic process connecting two distributions. Given initial distribution $\mu_0$ and terminal distribution $\mu_1$, it solves:

$$\min_{\mathbb{P}} \quad \mathcal{D}(\mathbb{P} \| \mathbb{P}^{\mathrm{ref}}) \quad \text{s.t.} \quad \mathbb{P}_{t=0} = \mu_0, \quad \mathbb{P}_{t=1} = \mu_1$$
Find the path measure closest to reference while matching both boundary distributions.

The Generalized Schrödinger Bridge (GSB) relaxes the hard terminal constraint to a soft potential. This is crucial: we no longer need samples from the target distribution $\mu_1$, only a potential function $\mathcal{G}$ that scores terminal states.

$$\min_{\mathbb{P}} \quad \mathcal{J}_{\mathrm{GSB}}(\mathbb{P}) = \underbrace{\mathcal{D}(\mathbb{P} \| \mathbb{P}^{\mathrm{ref}})}_{\text{Stay close to reference}} + \underbrace{\mathbb{E}_{X_1 \sim \mathbb{P}} \left[ \mathcal{G}(X_1) \right]}_{\text{Terminal potential}} \quad \text{s.t.} \quad \mathbb{P}_{t=0} = \mu_0$$
Only the initial distribution is constrained; the terminal distribution is guided by potential $\mathcal{G}$.
Note: The divergence $\mathcal{D}$ depends on the process type. For SDE, it is KL divergence; for ODE, other measures apply. Our framework handles both.

The optimal solution has a closed form—the terminal distribution is an exponential tilting of the reference:

$$p^*(X_1) \propto \mu_1^{\mathrm{ref}}(X_1) \cdot \exp\left(-\mathcal{G}(X_1)\right)$$
where $\mu_1^{\mathrm{ref}}$ is the terminal distribution of the reference process.

2. Why Model Maximum Entropy RL as GSB?

Let's start with what we want. Maximum Entropy RL seeks policies that maximize return while maintaining high entropy:

$$\max_\pi \quad \mathbb{E}_\pi[R] + \alpha \mathcal{H}(\pi(\cdot|s))$$
The optimal solution is the Boltzmann policy: $\pi^*(a|s) \propto \exp(Q(s,a)/\alpha)$

This is elegant in theory. But here's the problem: for iterative generative policies like diffusion models, we can't compute $\log \pi(a|s)$—actions emerge from a multi-step stochastic process. The entropy term becomes intractable.

Now, recall the GSB optimal solution from Section 1:

$$p^*(X_1) \propto \mu_1^{\mathrm{ref}}(X_1) \cdot \exp\left(-\mathcal{G}(X_1)\right)$$

Notice something? These have the same form! If we choose:

Then GSB directly gives us the Boltzmann policy—without ever computing $\log \pi(a|s)$.

Why GSB Fits RL Perfectly

Here's the key insight: in RL, we don't have samples from the optimal policy $\pi^*$—we only have a scoring function (Q-value). This matches GSB exactly: no target samples needed, just a potential $\mathcal{G}$. The path divergence term implicitly regularizes entropy through the generation process itself.

This gives us three benefits:

3. From Path Constraint to Kinetic Energy

So far so good. But the GSB objective involves a path constraint—staying close to the reference process. This is defined on the space of entire trajectories, which sounds abstract and hard to optimize. Can we turn it into something concrete?

Consider a controlled process with drift $u_\theta$ and a reference process with zero drift (pure noise):

$$\text{Policy: } dX_\tau = u_\theta(s, \tau, X_\tau) d\tau + \sigma dW_\tau \qquad \text{Reference: } dX_\tau = \sigma dW_\tau$$

The two processes share the same noise but differ in drift. How far apart are they? This depends on whether we work with SDEs or ODEs:

SDE Case ($\sigma > 0$): Girsanov's Theorem

For stochastic processes, Girsanov's theorem gives an exact identity: the KL divergence on path space equals the expected kinetic energy of the drift:

$$\mathcal{D}_{\mathrm{KL}}(\mathbb{P}^\theta \| \mathbb{P}^{\mathrm{ref}}) = \frac{1}{2\sigma^2} \mathbb{E}_{\mathbb{P}^\theta} \left[ \int_0^1 \|u_\theta(s, \tau, X_\tau)\|^2 d\tau \right]$$

This is not an approximation—it is an equality. Minimizing kinetic energy is exactly equivalent to minimizing path-space KL divergence.

ODE Case ($\sigma \to 0$): Wasserstein Bound

For deterministic flows, the KL interpretation no longer applies. Instead, kinetic energy upper-bounds the squared $W_2$ distance between the induced terminal distribution and the reference:

$$W_2^2(p_1^\theta, \mu_1^{\mathrm{ref}}) \leq \mathbb{E}\left[\int_0^1 \|u_\theta\|^2 d\tau\right]$$

This acts as a geometric proximity constraint: penalizing kinetic energy prevents aggressive, large-scale transport that would concentrate probability mass. While this does not directly bound entropy, it empirically discourages rapid mode collapse and promotes broad action coverage.

In both cases, kinetic energy provides a tractable, computable measure of how far the policy strays from the high-entropy reference:

$$\mathcal{E}_{\mathrm{kinetic}} = \mathbb{E}_{\mathbb{P}^\theta} \left[ \int_0^1 \|u_\theta(s, \tau, X_\tau)\|^2 d\tau \right]$$
SDE: equals path-space KL divergence. ODE: upper-bounds $W_2$ distance to reference.

An Intuitive View

Think of it this way: kinetic energy $\|u_\theta\|^2$ measures the "effort" to transport noise to actions, while the terminal potential $-Q(s,a)$ acts like a potential energy landscape guiding where actions should land. The policy seeks paths that minimize total action—balancing efficient transport (low kinetic energy) against reaching high-reward regions (low potential energy). This echoes the principle of least action in physics: among all possible paths, nature chooses the one that minimizes the integrated difference between kinetic and potential energy.

Kinetic Energy Regularization
Kinetic Energy Regularization Encourages Exploration. (Top) Without regularization: high velocity collapses to a single mode. (Bottom) FLAC: penalizing kinetic energy preserves the multimodal distribution.

4. The FLAC Objective

Now we have all the pieces. Combining GSB formulation + RL potential + kinetic energy regularization, we arrive at the FLAC objective:

$$\min_{\theta} J_{\text{FLAC}}(\theta) = \mathbb{E}_{\mathbb{P}^\theta} \left[ \underbrace{\alpha \int_0^1 \frac{1}{2} \left\| u_\theta(s, \tau, X_\tau) \right\|^2 d\tau}_{\text{Kinetic energy (entropy proxy)}} - \underbrace{Q(s, X_1)}_{\text{Return}} \right]$$
Minimize kinetic energy + Maximize return. Fully tractable—no density evaluation needed.

That's it. We can sample trajectories, compute kinetic energy along the path, and optimize with standard gradient descent. No likelihood computation required.

Energy-Regularized Policy Iteration

We derive energy-regularized Bellman operators with $\gamma$-contraction guarantees, extending classical policy iteration to generative policies.

Automatic Energy Tuning

A Lagrangian dual mechanism learns $\alpha$ to maintain target energy level $E_{\mathrm{tgt}}$, adapting exploration automatically.

Pathwise Gradient Estimation

Differentiable ODE solvers enable end-to-end gradient flow, compatible with standard off-policy replay buffers.

Experimental Results

Task Visualization
Benchmark Tasks. Visualization of the benchmark tasks used in our experiments, including DMControl and HumanoidBench environments.
DMControl Results
DMControl Benchmark Results. Performance comparisons on DMControl tasks. FLAC achieves competitive or superior performance compared to state-of-the-art baselines. All algorithms are evaluated with 5 random seeds.
Humanoid Results
Humanoid Benchmark Results. Performance comparisons on challenging humanoid locomotion tasks. FLAC demonstrates strong performance across different humanoid control scenarios.

Limitations & Future Directions

Currently, FLAC applies a uniform kinetic energy penalty across all action dimensions. However, in practice different dimensions often require different levels of exploration—for instance, a humanoid's hip joint may need precise control while the arm joints benefit from more exploratory behavior. Learning dimension-specific energy budgets, rather than a single scalar $\alpha$, is a promising direction for improving performance.

Acknowledgments

We are deeply grateful to Xiao Ma, Yunfei Li, and Yu Luo for their continuous support and guidance throughout this project. This work would not have been possible without their invaluable contributions.

Citation

If you find this work useful, please consider citing:

@article{lv2026flac,
  title={FLAC: Maximum Entropy RL via Kinetic Energy Regularized Bridge Matching},
  author={Lv, Lei and Li, Yunfei and Luo, Yu and Sun, Fuchun and Ma, Xiao},
  journal={arXiv preprint arXiv:2602.12829},
  year={2026}
}