TL;DR: This is an informal summary of our recent paper Restarting Frank-Wolfe with Alexandre D’Aspremont and Thomas Kerdreux, where we show how to achieve improved convergence rates under sharpness through restarting Frank-Wolfe algorithms.

Note: This summary is shorter than usual as I wrote a whole post about sharpness (aka Hölder Error Bounds) and conditional gradient methods that is strongly correlated with this paper some time back; for the sake non-duplication the interested reader is referred to Cheat Sheet: Hölder Error Bounds (HEB) for Conditional Gradients, which also explains the more technical aspects of our work in a significantly broader context.

What is the paper about and why you might care

We often want to solve constrained smooth convex optimization problems of the form

where $P$ is some compact convex set and $f$ is a smooth function. If the considered function $f$ is strongly convex, then we can expect a linear rate of convergence of $O(\log 1/\varepsilon)$, i.e., it takes about $k \sim \log 1/\varepsilon$ iterations until $f(x_k) - f(x^\esx) \leq \varepsilon$ by using Away-Step Frank-Wolfe or Pairwise Conditional Gradients (see Cheat Sheet: Linear convergence for Conditional Gradients for more). However, in absence of strongly convexity we often have to fall back to the smooth and (non-strongly) convex case with a much lower rate of $O(1/\varepsilon)$ (without acceleration). In many cases this rate is considerably worse than what is empirically observed. To remedy this by providing a more fine-grained convergence analysis, in a recent paper [RA] analyzed convergence under sharpness (also known as the Hölder Error Bound (HEB) condition) which characterizes the behavior of $f$ around the optimal solutions:

Definition (Hölder Error Bound (HEB) condition). A convex function $f$ is satisfies the Hölder Error Bound (HEB) condition on $P$ with parameters $0 < c < \infty$ and $\theta \in [0,1]$ if for all $x \in P$ it holds: $c (f(x) - f^\esx)^\theta \geq \min_{y \in \Omega^\esx} \norm{x-y}.$

It was shown that using the notion of sharpness, one can derive much better rates, covering the whole range between the sublinear rate of $O(1/\varepsilon)$ and the linear rate $O(\log 1/\varepsilon)$. Moreover, these rates can be realized with adaptive restart schemes, requiring no knowledge about the sharpness parameters.

To establish the link to strong convexity, note that strong convexity which is a global property implies sharpness (with appropriate parameterization) which has to hold only locally around the optimal solutions; the converse is not true. In fact, using sharpness one can show linear convergence for certain function classes of functions that are not strongly convex.

Our results

An open question was whether adaptive restarts can be also utilized to achieve a similar adaptive behavior for Conditional Gradient type methods that access the feasible region $P$ only through a linear programming oracle and this is precisely what we study in our recent work [KDP]. There we show that one can modify the Away-Step Frank Wolfe algorithm (and similarly Pairwise Conditional Gradients) by endowing them with scheduled restarts to automatically adapt to the function’s sharpness. For functions with optimal solutions contained in the strict relative interior of $P$ it even suffices to modify the (vanilla) Frank-Wolfe algorithm. By doing so we obtain, depending on the function’s sharpness parameters, convergence rates of the form $O(1/\varepsilon^p)$ or $O(\log 1/\varepsilon)$. In particular, similar to [RA], we can achieve linear convergence for functions that are sufficiently sharp but not strongly convex.

For illustration, the next graph shows the behavior of the Frank-Wolfe Algorithm under sharpness on the probability simplex of dimension $30$ and function $\norm{x}_2^{1/\theta}$. For $\theta = 1/2$, we observe linear convergence as expected, while for the other values of $\theta$ we observe various degrees of sublinear convergence of the form $O(1/\varepsilon^p)$ with $p \geq 1$.

References

[RA] Roulet, V., & d’Aspremont, A. (2017). Sharpness, restart and acceleration. In Advances in Neural Information Processing Systems (pp. 1119-1129). pdf

[KDP] Kerdreux, T., d’Aspremont, A., & Pokutta, S. (2018). Restarting Frank-Wolfe. to appear in Proceedings of AISTATS. pdf