A New Default Open-Loop Step-Size for Frank-Wolfe?

TL;DR: In our recent paper Adaptive Open-Loop Step-Sizes for Accelerated Convergence Rates of the Frank-Wolfe Algorithm with Elias Wirth and Javier Peña, we explore a new “log-adaptive” open-loop step-size for the Frank-Wolfe algorithm, $\eta_t = \frac{2 + \log(t+1)}{t+2 + \log(t+1)}$, which is adaptable both to favorable function properties and the feasible region in terms of convergence rates. In particular, it matches and often surpasses traditional fixed-parameter open-loop step-sizes across various settings, without needing prior knowledge of problem parameters, both in theory and computations.

Introduction

The Frank-Wolfe (FW) algorithm is an important method for constrained (first-order) convex optimization, especially when projections are costly. A critical and often debated component of FW is the choice of step-size $\eta_t$. Basically, there are two types of step-size strategies: so-called open-loop strategies that do not require any feedback from the function and are fixed ahead of time, not adapting at runtime, and so-called closed-loop strategies that “interact” with the objective function. The former are typically much cheaper, while the latter often possess superior convergence performance.

Since the inception of the Frank-Wolfe algorithm, the “standard” open-loop step-size has been $\eta_t = \frac{2}{t+2}.$ More generally, step-sizes of the form $\eta_t = \frac{\ell}{t+\ell}$ for some integer $\ell \ge 2$ have been analyzed. While these provide an $\mathcal{O}(1/t)$ convergence rate in general, recent work [WKP2023, WPP2024] has shown that under certain “growth conditions,” these fixed-$\ell$ step-sizes can achieve much faster rates—up to $\mathcal{O}(t^{-\ell})$.

The catch is that there is no single “best” $\ell$, and there is—as expected—a tradeoff: While for the asymptotic convergence rate, larger $\ell$ are better, we pay a price for this: namely, an initial burn-in with suboptimal convergence whose length depends on $\ell$. Thus, a larger $\ell$ might be great for one problem structure (like strong $(M,1)$-growth, where up to linear rates are possible), but not for others. This leads to a practical dilemma: which $\ell$ should you pick if you don’t know the specific growth properties of your problem beforehand?

Beyond Fixed-$\ell$: Towards an Adaptive Open-Loop

To address this issue, in [P2024] a new open-loop step-size strategy of the form $\eta_t = \frac{2 + \log(t+1)}{t+2 + \log(t+1)}$ was proposed, albeit without proof of convergence or analysis of expected properties; it was a remark complementing a new adaptive closed-loop strategy. In our recent paper, Adaptive Open-Loop Step-Sizes for Accelerated Convergence Rates of the Frank-Wolfe Algorithm, we remedy this and in fact provide a much broader convergence analysis of a wide range of open-loop step-sizes for the Frank-Wolfe algorithm. Instead of a fixed $\ell$, we propose a more general open-loop step-size scheme:

\[\eta_t = \frac{g(t)}{t+g(t)},\]

where $g(t)$ is a non-decreasing function of the iteration count $t$. The idea is to let $g(t)$ “adapt” (or rather, grow) with $t$, potentially capturing the benefits of larger $\ell$ values as the optimization progresses.

To make this work, we require two natural properties from $g(t)$:

$g(t)$ has to be non-decreasing and $g(t) \ge 2$ (the latter is a technical condition to ensure compatibility with the standard open-loop strategy and in particular $\mathcal O(t^{-1})$ convergence in general.)
The sequence $\eta_t = \frac{g(t)}{t+g(t)}$ itself is non-increasing (or equivalently, $\frac{t}{g(t)} \le \frac{t+1}{g(t+1)}$).

A new open-loop step-size

While the analysis holds more broadly, we are in particular interested in the log-adaptive open-loop step-size, where we set:

\[g(t) = 2 + \log(t+1),\]

leading to the aforementioned step-size strategy:

\[\eta_t = \frac{2 + \log(t+1)}{t+2 + \log(t+1)}\]

This choice satisfies both our requirements from above. Moreover, using this choice, we obtain optimal rates without line search or knowledge of parameters in a wide range of growth settings:

Growth Setting	Fixed $\eta_t = \frac{\ell}{t+\ell}$	Log-Adaptive $\eta_t = \frac{2+\log(t+1)}{t+2+\log(t+1)}$
Strong $(M, 1)$	$\mathcal{O}(t^{-\ell})$	$\tilde{\mathcal{O}}(t^{-k})$ for any $k \in \mathbb{N}$
Strong $(M, r)$	$\mathcal{O}(t^{-\ell+\epsilon}+t^{-\frac{1}{1-r}})$	$\tilde{\mathcal{O}}(t^{-\frac{1}{1-r}})$
Weak $(M, r)$	$\mathcal{O}(t^{-\ell+\epsilon}+t^{-\frac{1}{1-r}} +t^{-2})$	$\tilde{\mathcal{O}}(t^{-\frac{1}{1-r}}+t^{-2})$

(Rates in the table are for the suboptimality gap $f(x) - f(x^\esx)$ and $\tilde{\mathcal{O}}(\cdot)$ hides polylog factors)

Essentially, the log-adaptive step-size performs at least as well (up to polylogarithmic factors) as any fixed-$\ell$ choice, and in the strong growth settings, it can significantly outperform them or match the best possible rate achievable by tuning $\ell$, doing line search, or picking an adaptive closed-loop strategy. For instance, in the Strong $(M,1)$ growth setting, its asymptotic convergence rate approaches a linear rate in the limit.

Note. If the log-adaptive step-size does not meet favorable function properties or properties of the feasible region, its performance simply collapses to that of the standard rule $\eta_t = \frac{2}{t+2}$.

The Key Lemma

The key insight behind these results lies in extending the analytical framework previously used for fixed-$\ell$ step-sizes from [WPP2024] (see also [WKP2023]) and providing a new, stronger cumulative product bound. For those familiar with [WPP2024], we obtain the following strengthened bound:

For $S \in \mathbb{N}_{\ge 1}$, $\epsilon \in ]0, g(S)[$, and $t \in \mathbb{N}_{\ge S}$: \[ \prod_{i=S}^t \left(1-\left(1-\frac{\epsilon}{g(i)}\right) \eta_i\right) \le \left(\frac{\eta_t}{\eta_{S-1}} \right)^{g(S) - \epsilon}.\]

This refinement, when $g(t)$ grows like $\log(t+1)$, allows us to derive the improved and more robust convergence rates.

Numerical Experiments

In the following, we provide an excerpt from our numerical experiments to demonstrate the performance of the log-adaptive strategy. Here we will only depict convergence in the Frank-Wolfe gap

\[\max_{v \in P} \langle \nabla f(x), x - v \rangle\]

which is a (standard) observable dual gap measure that can be used as a stopping criterion; see the paper for other measures such as the primal (suboptimality) gap or the primal-dual gap.

Note. While we primarily considered two types of experiments in the paper, the log-adaptive step-size is now also included in the FrankWolfe.jl Julia package and has been tested on a wide variety of problems.

Constrained Regression

We tested a constrained regression problem on the Boston-housing dataset, confining the regression coefficients to $L_p$-balls with $p=2$ and $p=5$. We considered the cases where the unconstrained optimum is inside (here we do not expect accelerated convergence) and outside the feasible $\ell_p$-ball (here we expect accelerated convergence). The objective is of the form:

\[\min_{x \in\RR^n, \|x\|_p \leq \beta} \frac{1}{2}\|A x - y\|_2^2.\]

The results are depicted below.

Figure 1: Frank-Wolfe gap for constrained regression, unconstrained optimum inside the feasible region. (L: $L_2$-ball, R: $L_5$-ball)

Figure 2: Frank-Wolfe gap for constrained regression, unconstrained optimum outside the feasible region. (L: $L_2$-ball, R: $L_5$-ball)

Collaborative Filtering

Our second test is collaborative filtering, with the objective

\[\min_{X\in\RR^{m\times n}, \|X\|_{\text{nuc}}\leq \beta} \frac{1}{|\mathcal I|} \sum_{(i,j)\in\mathcal I} H(A_{i,j} - X_{i,j}),\]

using a Huber loss $H$ over the MovieLens 100k dataset, with nuclear norm ball radii $\beta=1000$ and $\beta=3000$. We expect higher convergence rates when the radius is reasonably small.

Figure 3: Frank-Wolfe gap for collaborative filtering. (L: radius 1000, R: radius 3000)

As can be seen from the plots, across these tests, the log-adaptive step-size ($g(t)=2+\log(t+1)$) consistently performed on par with or better than the fixed step-sizes ($g(t)=2$ and $g(t)=4$). In scenarios aligning with (strong) growth (like the regression problem where the unconstrained optimum lies outside the feasible region, or collaborative filtering with a smaller radius), it often showed notably faster convergence for the primal-dual gap, the primal (suboptimality) gap, and the Frank-Wolfe gap, in line with the predictions from theory.

Wrap-up

The log-adaptive open-loop step-size $\eta_t = \frac{2 + \log(t+1)}{t+2 + \log(t+1)}$ offers a compelling alternative to traditional fixed-$\ell$ step-sizes for the Frank-Wolfe algorithm. It is simple (in fact, trivial) to implement, requires no prior problem knowledge for parameter tuning, and demonstrates robust, often superior, performance both theoretically and empirically. To foster adaptation and community feedback, we have also incorporated these adaptive open-loop step-sizes into the FrankWolfe.jl package.

References

[WKP2023] Wirth, E., Kerdreux, T., & Pokutta, S. (2023). Acceleration of Frank-Wolfe algorithms with open loop step-sizes. Proceedings of AISTATS. arXiv:2205.12838, Poster

[P2024] Pokutta, S. (2024). The Frank-Wolfe algorithm: a short introduction. Jahresbericht der Deutschen Mathematiker-Vereinigung, 126, 3–35. arXiv:2311.05313, Published version

[WPP2024] Wirth, E., Peña, J., & Pokutta, S. (2024). Accelerated Affine-Invariant Convergence Rates of the Frank-Wolfe Algorithm with Open-Loop Step-Sizes. to appear in Mathematical Programming A. arXiv:2310.04096, Published version

[WPP2025A] Wirth, E., Peña, J., & Pokutta, S. (2025). Adaptive Open-Loop Step-Sizes for Accelerated Convergence Rates of the Frank-Wolfe Algorithm. preprint. arXiv:2505.09886

[WPP2025B] Wirth, E., Peña, J., & Pokutta, S. (2025). Fast Convergence of Frank-Wolfe algorithms on polytopes. to appear in Mathematics of Operations Research. arXiv:2406.18789

Growth Setting	Fixed \(\eta_t = \frac{\ell}{t+\ell}\)	Log-Adaptive \(\eta_t = \frac{2+\log(t+1)}{t+2+\log(t+1)}\)
Strong \((M, 1)\)	\(\mathcal{O}(t^{-\ell})\)	\(\tilde{\mathcal{O}}(t^{-k})\) for any \(k \in \mathbb{N}\)
Strong \((M, r)\)	\(\mathcal{O}(t^{-\ell+\epsilon}+t^{-\frac{1}{1-r}})\)	\(\tilde{\mathcal{O}}(t^{-\frac{1}{1-r}})\)
Weak \((M, r)\)	\(\mathcal{O}(t^{-\ell+\epsilon}+t^{-\frac{1}{1-r}} +t^{-2})\)	\(\tilde{\mathcal{O}}(t^{-\frac{1}{1-r}}+t^{-2})\)