An uncommon approach to lower bounds for Frank-Wolfe on strongly convex sets

TL;DR: This is a short summary of our paper Lower Bounds for Frank-Wolfe on Strongly Convex Sets by Jannis Halbey, Daniel Deza, Max Zimmer, Christophe Roux, Bartolomeo Stellato, and Sebastian Pokutta. We prove a matching $\Omega(1/\sqrt{\varepsilon})$ lower bound for Frank-Wolfe on strongly convex sets, showing that Garber and Hazan’s 2015 upper bound is tight. The construction of the lower bound deviates from the standard route quite a bit: instead of searching for worst-case initializations, we build them backward from the optimum.

A question that has been open for a decade

Frank-Wolfe has been central to our group’s work for over 10 years, from algorithmic variants to large-scale applications to the recent MOS-SIAM monograph. So when a fundamental (and seemingly simple) question about its convergence rate stays open, it is highly unsatisfactory. I remember quite well when I first discussed this question with Gábor Braun in my office at Georgia Tech: Consider Frank-Wolfe with line search or short steps. On general convex sets, Frank-Wolfe converges at a rate of $\mathcal{O}(1/\varepsilon)$, and this is known to be tight. But when the constraint set is strongly convex (think of a ball or an ellipsoid rather than a polytope) Garber and Hazan [GH15] showed that Frank-Wolfe achieves a faster rate of convergence of $\mathcal{O}(1/\sqrt{\varepsilon})$. The obvious question was: is this the best one can do, or is there still room for improvement? In particular, this has to be contrasted with the two cases where the constrained optimum lies in the strict (relative) interior or when the unconstrained optimum lies outside of the feasible set; in both cases Frank-Wolfe achieves linear rates of convergence for strongly convex objectives, when the feasible region is strongly convex.

For ten years, nobody had a matching lower bound. Actually, for quite some time the folklore belief was that the rate might be higher than $\mathcal{O}(1/\sqrt{\varepsilon})$. This was supported by numerics, where from random starting points Frank-Wolfe often seemed to converge (much) faster than $1/\sqrt{\varepsilon}$. At the same time, general-purpose worst-case search methods like performance estimation problems (PEPs) could not reach long enough horizons to see the true rate. At the horizons they could access, the scaling looked often closer to $\mathcal{O}(1/\varepsilon^{1/1.44})$, which is faster than $1/\sqrt{\varepsilon}$.

It turns out however, that the answer is in the negative: the upper bound is tight. Frank-Wolfe with exact line search or short steps can require $\Omega(1/\sqrt{\varepsilon})$ iterations on strongly convex sets, and this holds even from dimension $n = 2$ onwards. While the result itself is interesting the underlying construction might be even more so.

NB. Shortly after our work, there was a second paper, by Grimmer and Liu [GL26] using a technique reminiscent of Jaggi [J13], proving a very nice result complementing ours: They showed that any LMO-based methods requires $\Omega(1/\sqrt{\varepsilon})$ LMO-oracle calls in the worst-case for strongly convex objectives over strongly convex sets. In contrast to our result, their result only applies to the high-dimensional regime (their threshold is essentially $n/2$) and their strongly convex set is non-smooth. So broader methods coverage but reduced coverage of regimes.

The setup: a quadratic over the ball

The model problem is as simple as it gets: minimize a quadratic over the Euclidean unit ball; and not just any quadratic but the most basic one, one can imagine:

\[\min_{x \in B_1(0)} \norm{x - p}^2,\]

where the target point $p$ lies on the boundary of the ball, i.e., $\norm{p} = 1$. Both the objective and the constraint set are strongly convex and smooth. The problem has a closed-form solution $x^* = p$, and yet we will see, that the Frank-Wolfe algorithm struggles on this instance, depending on where you start, i.e., your choice of $x_0 \in B_1(0).$

The key structural observation is that the iterates stay in a two-dimensional invariant subspace spanned by $p$ and the initial point $x_0$. So the dynamics can be fully described by two quantities: the residual $r_t = \norm{x_t - p}$ and an angle parameter $\theta_t$, basically allowing us to switch to polar coordinates. Moreover, this reduction to the $2$-dimensional space makes the problem analytically tractable while still capturing the essential difficulty.

To give you a feel for the setup and the dynamic, check out the widget below. It shows what the Frank-Wolfe iterates actually look like on the ball. Click inside the ball to set a starting point and see the actual trajectory the algorithm would take. Each step, the LMO picks a point on the boundary (the “vertex”), and the iterate moves along the line toward it. You can see, how the iterates alternate toward the optimizer and how the approach slows down dramatically near the boundary; can you also check out the worst-case trajectory although at this point it is not clear yet why it is special.

Figure 1. Frank-Wolfe iterates on the unit ball with $p = (0,1)$ on the boundary. Click to place a starting point and watch the trajectory. The iterates alternate sides of the optimizer and progress slows down dramatically as they approach the boundary.

As mentioned earlier, what makes the problem hard for Frank-Wolfe is the boundary. When the optimizer sits in the interior, the algorithm converges linearly. When the unconstrained(!) optimizer sits outside the feasible set, the gradient is bounded away from zero over the feasible set and we again get linear convergence. But when the optimizer is on the boundary, i.e., when $\norm{p} = 1$, the gradient vanishes at the optimum and the curvature of the constraint set and the objective interact in a non-trivial way. This can be also seen below in the widget. You can see how the position of the optimizer relative to the ball completely determines the convergence behavior.

Figure 2. Frank-Wolfe convergence on the unit ball for different optimizer positions. Interior ($\norm{p} < 1$) and exterior ($\norm{p} > 1$) give linear convergence. On the boundary ($\norm{p} = 1$), convergence slows to $\mathcal{O}(1/\sqrt{\varepsilon})$ and the trajectories develop a characteristic plateau-and-jump pattern.

Plateaus and jumps

Another peculiar aspect of the log-log convergence plots in Figure 2 is that when $\norm{p} = 1$, then the trajectories exhibit a weird plateau and jump dynamic. We have long basically flat plateaus where Frank-Wolfe barely makes progress, interrupted by sudden jumps, where the error decreases dramatically in a single step. To the expert, this might look like restart trajectories but the situation is different here. Even more surprising: how fast Frank-Wolfe converges has almost nothing to do with how close the starting point is to the optimum, but where you start. The heatmap below shows the number of iterations to reach a target accuracy from every point in the unit ball (with $p = (0,1)$ on the top of the boundary). The heatmap exhibits some weird “highways” (the darker areas), and once an iterate hits such a highway, convergence is almost immediate; in particular, starting points further away but on such a highway lead to much faster convergence to the optimum than points close to the optimum but in the lighter areas.

Figure 3. Convergence landscape of Frank-Wolfe on the unit ball with optimizer $p = (0,1)$ (red dot). Colors show the number of iterations to reach the target accuracy. The pattern is non-monotone: starting points close to the optimum do not necessarily converge faster. Note that the “highways” are hard to see at higher accuracies due to the reduced rendering resolution but they are equally visible for smaller $\varepsilon$ once the resolution is bumped up, which is beyond what the widget can handle.

This phenomenon is not noise but highly structured and an artifact of the (deterministic) dynamic that drives the optimization process. During a plateau, the Frank-Wolfe direction is nearly orthogonal to the gradient, so each step makes tiny progress. Then, at some point, the iterate reaches a configuration where the direction aligns favorably and the error drops sharply. Understanding when these jumps happen and whether they can be delayed is the key to the lower bound construction.

To make this precise, we introduce the contraction factor $s_t = r_{t+1}/r_t$. When $s_t$ is close to 1, the algorithm barely moves. When $s_t$ is small, we get a jump. The dynamics of $(r_t, s_t)$ satisfy a closed-form recurrence (whose derivation was somewhat painful):

\[s_{t+1}^2 = \frac{1 - (1+r_t)^2 s_t^2}{2 - 2s_t - (2+r_t) r_t s_t^2}.\]

The phase space

Once equipped with the recurrence, we can analyze it. In fact, what can be seen numerically and analytically, is that the recurrence partitions the $(r, s)$ plane into two regimes and there is a critical curve $s = g(r)$ that separates the two:

The stable regime. (below the curve): the contraction factor increases at each step, i.e., convergence is slowing down. The algorithm eventually plateaus.
The unstable regime. (above the curve): the contraction factor decreases sharply and a jump is about to happen.

To get slow convergence for our lower bound, our goal is to keep the trajectory in the stable regime for as long as possible, basically converging to the critical curve from below without crossing it. This can be also nicely seen in the widget below: click anywhere in the phase space (below the black line) to set an initial $(r, s)$ and see how the trajectory evolves. Points in the stable regime (red colors) lead to long plateaus. Points in the unstable regime (blue colors) jump quickly to smaller errors back into the stable regime.

Figure 4. The phase space $(r, s)$ of the Frank-Wolfe dynamics. The red region is the stable regime where contraction worsens at each step (plateaus). The blue region is the unstable regime where jumps happen. The worst-case trajectory (red) convergence to the threshold $g(r)$ from below but stays in the stable regime without crossing into the blue region, inducing slow convergence.

The key insight: roll the dynamic backwards

So how do you find the initialization that produces the longest possible plateau? The obvious approach, searching forward from many starting points, does not work well. In fact, the dynamics are extremely sensitive to initial conditions. Two starting points that are nearly identical can produce wildly different trajectories; one might plateau for 50 iterations while the other jumps after 10. Moreover, numerical inaccuracies compound very quickly. Just for reference, our longest trajectories that we constructed numerically required several thousand bits of precision; thanks to Julia’s multiprecision capabilities this was easy to do.

So how do you get good initial points $x_0$ that induce slow convergence? The key insight is to reverse the question and dynamics and doing things backwards. Start at a point very close to the optimum compatible with the dynamic from above, do some second-order corrections for errors and then, roll the dynamic backwards. More precisely, our backward construction is as follows: We initialize at a point $(r_T, s_T)$ very close to the optimum with $r_T = \varepsilon \ll 1$ and $s_T \approx 1 - \frac{4}{3}\varepsilon$, which we know lies on the stable trajectory (after some second order corrections). Then we invert the dynamics:

\[\begin{aligned} s_{t-1} &= (1+r_t)s_t^2 - r_t + \sqrt{(1-s_t^2)(1-(1+r_t)^2 s_t^2)}, \\ r_{t-1} &= r_t / s_{t-1}. \end{aligned}\]

Each backward step recovers the previous iterate. There is a subtlety here that the root induces two branches; however, one can show that always taking the negative branch is sufficient. Now we unroll the dynamic, until $r_t$ is large enough, basically until some point $t_0$, where $r_{t_0}$ just left the feasible region and then we do one step forward back into the feasible to obtain $r_{t_0-1}$ and the associated $s_{t_0-1}$, which can then be converted back into a valid starting point $x_0$.

This backwards-forward construction works quite well: with 1000-bit arithmetic precision (small errors destroy the stable phase), we can construct worst-case trajectories over thousands of iterations, far beyond what general-purpose PEP methods can reach. Check out the following widget to explore the backward-forward process. The backward pass starts at $r = 0.002$ (near the optimum) and rolls back until $r > 1$. Then the forward pass replays Frank-Wolfe from that start. Note that the browser uses standard 64-bit floating point, while the paper used 1000-bit arithmetic precision to keep trajectories stable over thousands of steps. In the browser, the forward and backward traces will start to diverge after ~50-100 steps, which emphasizes how sensitive the stable manifold is to numerical precision.

Figure 5. The backward-then-forward construction. The backward pass (red) starts near the optimum and reconstructs the worst-case initialization by inverting the dynamics. The forward pass (blue) verifies the trajectory by running Frank-Wolfe from the reconstructed start forward; the dashed line is the $1/t^2$ target. The red and blue line overlap perfectly, until they break apart due to lost precision.

The theorem

This is all fine but so far everything we did is numerical. We are however interested in an actual analytical proof. To obtain such a proof, we follow exactly the same logic as the numerical backward construction: going backwards. The key technical step is showing that if the contraction factor satisfies

\[s_t = 1 - \frac{4}{3} r_t + c_t r_t^2\]

with $c_t \in [1, \frac{5}{2}]$, then after a backward step $c_{t-1}$ stays in $[1, \frac{5}{2}]$. This means the entire trajectory remains on the stable manifold. By induction from the endpoint, the trajectory satisfies $s_t \geq 1 - \frac{4}{3} r_t$ for all $t$, which gives the bound $r_t \geq r_0 / (1 + \frac{8}{3} r_0 t)$ and therefore:

Theorem. There exists a smooth and strongly convex set $\mathcal{X}$ and a smooth and strongly convex function $f$ such that for all $T \in \mathbb{N}$ there exists a starting point $x_0 \in \mathcal{X}$ with $f(x_t) - f(x^*) \geq \Omega\left(\frac{1}{t^2}\right)$ for all $t = 1, \ldots, T$. Equivalently, $T = \Omega(1/\sqrt{\varepsilon})$.

Two features of this lower bound are worth highlighting:

It is dimension-free. Because the iterates stay in a two-dimensional subspace, the lower bound holds in any dimension $d \geq 2$. This is in contrast to the classical lower bounds by Jaggi (2013) [J13] (see also Lan (2013) [L13]) on the simplex, which require $t < d$ as well as the complementary result by Grimmer and Liu [GL26] for strongly convex sets.

Smoothness of the set does not help. The unit ball is as smooth as a constraint set can be. The lower bound shows that even with this additional regularity, the $\mathcal{O}(1/\sqrt{\varepsilon})$ rate cannot be improved; the strongly convex set in Grimmer and Liu [GL26] is non-smooth.

What this tells us about Frank-Wolfe

This result settles the convergence landscape for Frank-Wolfe on strongly convex sets. Here is where things stand now:

Setting	Upper bound	Lower bound	Gap?
$f$ SC, $\mathcal{X}$ SC, optimizer in interior	$\mathcal{O}(\log \frac{1}{\varepsilon})$	$\Omega(\log \frac{1}{\varepsilon})$	Tight
$f$ SC, $\mathcal{X}$ SC, optimizer on boundary	$\mathcal{O}(\frac{1}{\sqrt{\varepsilon}})$	$\Omega(\frac{1}{\sqrt{\varepsilon}})$	Tight (this paper)
$f$ convex, $\mathcal{X}$ convex	$\mathcal{O}(\frac{1}{\varepsilon})$	$\Omega(\frac{1}{\varepsilon})$	Tight

In a nutshell, the position of the optimizer is a genuine structural barrier, not an artifact of the analysis. When the optimizer lies on the boundary, the rate drops from exponential to polynomial.

Several questions remain open. For example:

Can FW variants break through? Away-step, pairwise, and fully-corrective variants are designed for polytopes. Can analogous variants for strongly convex sets beat $\mathcal{O}(1/\sqrt{\varepsilon})$? We know that at least in the high-dimensional regime this is not possible due to [GL26]. At the same time often acceleration can be obtained beyond the dimension [DCP20], [CDLP21], so it is not ruled out that beyond $n/2$ or $n$, we can find a faster algorithm.
Smoothed complexity. Our numerical experiments show that small perturbations to the initialization kick the trajectory off the stable manifold, leading to much faster convergence. Can we obtain better rates in a smoothed complexity sense or can we establish an $\Omega(1/\sqrt{\varepsilon})$ smoothed-complexity lower bound?
Other algorithms. The backward construction methodology — start from the end, invert the dynamics — seems applicable beyond Frank-Wolfe. It should work whenever the algorithm’s dynamics are locally invertible and exhibit sensitivity to initial conditions.

Sneak Peek: beyond strongly convex sets

Our analysis can also be extended to uniformly convex sets and sharp functions. This result was obtained based on our original argument using our Agentic Researcher (see Section 4.4 in [ZPRP26]), an agentic AI co-creation framework. If you are interested in this you can check the recent blog post on the topic.

References

[BCCHHKMP25] Braun, G., Carderera, A., Combettes, C.W., Hassani, H., Karbasi, A., Mokhtari, A., & Pokutta, S. (2025). Conditional Gradient Methods: From Core Principles to AI Applications. MOS-SIAM Series on Optimization. doi

[CC68] Canon, M.D. & Cullum, C.D. (1968). A Tight Upper Bound on the Rate of Convergence of Frank-Wolfe Algorithm. SIAM Journal on Control, 6(4), 509–516.

[CDLP21] Carderera, A., Diakonikolas, J., Lin, C.Y., & Pokutta, S. (2021). Parameter-Free Locally Accelerated Conditional Gradients. ICML 2021. arxiv

[DCP20] Diakonikolas, J., Carderera, A., & Pokutta, S. (2020). Locally Accelerated Conditional Gradients. AISTATS 2020. arxiv

[GH15] Garber, D. & Hazan, E. (2015). Faster Rates for the Frank-Wolfe Method over Strongly-Convex Sets. ICML 2015. paper

[GL26] Grimmer, B. & Liu, N. (2026). Lower Bounds for Linear Minimization Oracle Methods Optimizing over Strongly Convex Sets. arXiv preprint. arxiv

[HDZRSP26] Halbey, J., Deza, D., Zimmer, M., Roux, C., Stellato, B., & Pokutta, S. (2026). Lower Bounds for Frank-Wolfe on Strongly Convex Sets. arXiv preprint. arxiv

[J13] Jaggi, M. (2013). Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. ICML 2013. paper

[L13] Lan, G. (2013). The Complexity of Large-Scale Convex Programming under a Linear Optimization Oracle. arXiv preprint. arxiv

[LG24] Luner, A. & Grimmer, B. (2024). Performance Estimation for Smooth and Strongly Convex Sets. arXiv preprint. arxiv

[NY83] Nemirovski, A.S. & Yudin, D.B. (1983). Problem Complexity and Method Efficiency in Optimization. Wiley.

[ZPRP26] Zimmer, M., Pelleriti, N., Roux, C., & Pokutta, S. (2026). The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning. arXiv preprint. arxiv