Projection-Free Optimization on Uniformly Convex Sets

TL;DR: This is an informal summary of our recent paper Projection-Free Optimization on Uniformly Convex Sets by Thomas Kerdreux, Alexandre d’Aspremont, and Sebastian Pokutta. We present convergence analyses of the Frank-Wolfe algorithm in settings where the constraint sets are uniformly convex. Our results generalize different analyses of [P], [DR], [D], and [GH] when the constraint sets are strongly convex. For instance, the $\ell_p$ balls are uniformly convex for all $p > 1$, but strongly convex for $p\in]1,2]$ only. We show in these settings that uniform convexity of the feasible region systematically induces accelerated convergence rates of the Frank-Wolfe algorithm (with short steps or exact line-search). This shows that the Frank-Wolfe algorithm is not just adaptive to the sharpness of the objective [KDP] but also to the feasible region.

Written by Thomas Kerdreux.

We consider the following constrained optimization problem

\[\underset{x\in\mathcal{C}}{\text{argmin}} f(x),\]

where $f$ is a $L$-smooth convex function and $\mathcal{C}$ is a compact convex set in a Hilbert space. Frank-Wolfe algorithms form a classical family of first-order iterative methods for solving such problems. Each iteration requires in the worst case the solution of a linear minimization problem over the original feasible domain, a subset of the domain, or a reasonable modification of the domain, i.e. a change that does not implicitly amount to a proximal operation.

The understanding of the convergence rates of Frank-Wolfe algorithms in a variety of settings is an active field of research. For smooth constrained problems, the vanilla Frank-Wolfe algorithm (FW) enjoys a tight sublinear convergence rate of $\mathcal{O}(1/T)$ (see e.g., [J] for an in-depth discussion). There are known accelerated convergence regimes, as a function of the feasible region, only when $\mathcal{C}$ is a polytope or a strongly convex set.

Frank-Wolfe Algorithm
Input: Start with $x_0 \in \mathcal{C}$, $L$ upper bound on the Lipschitz constant.
Output: Sequence of iterates $x_t$
For $t=0, 1, \ldots, T $ do:
$\qquad v_t \leftarrow \underset{v\in\mathcal{C}}{\text{argmax }} \langle -\nabla f(x_t); v - x_t\rangle$ $\qquad \triangleright $ FW vertex
$\qquad \gamma_t \leftarrow \underset{\gamma\in[0,1]}{\text{argmin }} \gamma \langle \nabla f(x_t); v_t - x_t \rangle + \frac{\gamma^2}{2} L \norm{v_t - x_t}^2$ $\qquad \triangleright$ Short step
$\qquad x_{t+1} \leftarrow (1 - \gamma_t)x_{t} + \gamma_t v_{t}$ $\qquad \triangleright$ Convex update

When $\mathcal{C}$ is a strongly convex set and $\text{inf}_{x\in\mathcal{C}}\norm{\nabla f(x)}_\esx > 0$, FW enjoys linear convergence rates [P, DR]. Recently, [GH] showed that the Frank-Wolfe algorithm converges in $\mathcal{O}(1/T^2)$ when the objective function is strongly convex without restriction on the position of the optimum $x^\esx$. Importantly, the conditioning of this sub-linear rate does not depend on the distance of the constrained optimum $x^\esx$ either from the boundary or the unconstrained optimum, as it is the case in [P, DR], or [GM] when the optimum $x^\esx$ is in the interior of $\mathcal{C}$. Note that all these analyses require short steps as step-size rule (or even stronger conditions such as line-search).

Finally, when $\mathcal{C}$ is a polytope and $f$ is strongly convex, corrective variants of Frank-Wolfe were recently shown to enjoy linear convergence rates, see [LJ]. No accelerated convergence rates are known for constraint sets that are neither polytopes nor strongly convex sets. We show here that uniformly convex sets, which non-trivially subsume strongly convex sets, systematically enjoy accelerated convergence rate in the respective settings of [P,DR], [GH], and [D].

Uniformly Convex Sets

A closed set $\mathcal{C}\subset\mathbb{R}^d$ is $(\alpha, q)$-uniformly convex with respect to a norm $\norm{\cdot}$, if for any $x,y\in\mathcal{C}$, any $\eta\in[0,1]$, and any $z\in\mathbb{R}^d$ with $\norm{z} = 1$, we have

\[\tag{1} \eta x + (1-\eta)y + \eta (1 - \eta ) \alpha ||x-y||^q z \in \mathcal{C}.\]

At a high-level, this property is a global quantification of the set curvature that subsumes strong convexity. There exist others equivalent definitions, see e.g. [GI] in the strongly convex case. In finite-dimensional spaces, the $\ell_p$ balls form classical and important examples of uniformly convex sets. For $0 < p < 1$, the $\ell_p$ balls are non-convex. For $p\in ]1,2]$, the $\ell_p$ balls are strongly convex (or $(\alpha,2)$-uniformly convex) and $p$-uniformly convex (but not strongly convex) for $p>2$. The $p$-Schatten norms with $p>1$ or various group norms are also typical examples.

Figure 1. Example of $\ell_q$ balls.

Besides the quantification in (1), uniform convexity is a very classical notion in the study of normed spaces. Indeed, it allows to refine the convex characterization of these spaces’ unit balls, leading to a plethora of interesting properties. For instance, the various uniform convexity types of Banach spaces have consequences, notably in learning theory [DDS], online learning [ST], or concentration inequalities [JN]. Here, we show that the Frank-Wolfe algorithm accelerates as a function of the uniform convexity of $\mathcal{C}$.

Convergence analysis for Frank-Wolfe with Uniformly Convex Sets

Proof Sketch

At iteration $t$, we have $x_{t+1} = x_t + \gamma_t (v_t - x_t)$ with $\gamma_t\in[0,1]$ chosen to optimize the quadratic upper-bound on $f$ implied by $L$-smoothness. Classically then, we obtain that for any $\gamma\in[0,1]$

\[\tag{2} f(x_{t+1}) - f(x^\esx) \leq f(x_t) - f(x^\esx) - \gamma \langle - \nabla f(x_t); v_t - x_t\rangle + \frac{\gamma^2}{2} L \norm{x_t - v_t}^2.\]

The Frank-Wolfe gap $g_t = \langle - \nabla f(x_t); v_t - x_t\rangle\geq 0$ contributes in the primal decrease counter-balanced by the right-hand term. Informally then, uniform convexity of the set ensures that the distance of the iterate to the Frank-Wolfe vertex $\norm{v_t - x_t}$ shrinks to zero at a specific rate that depends on $g_t$. This schema illustrates that it is generally not the case when $\mathcal{C}$ is a polytope and how various type of curvature influence the shrinking of $\norm{v_t - x_t}$ to zero.

Figure 2. The uniform convexity assumption.

Scaling Inequalities

The uniform convexity parameters quantify different trade-off between the convergence to zero of $g_t$ and $\norm{v_t - x_t}$ . In particular, $(\alpha, q)$-uniform convexity of $\mathcal{C}$ implies scaling inequalities of the form $\langle -\nabla f(x_t); v_t - x_t\rangle \geq \alpha/2 \norm{\nabla f(x_t)}_\esx \norm{v_t - x_t}^q,$ where $\norm{\cdot}_\esx$ stands for the dual norm to $\norm{\cdot}$. Plugging this scaling inequality into (1) is then the basis for the various convergence results.

Convergence results with global uniform convexity

Assuming that $\mathcal{C}$ is $(\alpha,q)$-uniformly convex and $f$ is a strongly convex and $L$-smooth function, we obtain convergence rates of $\mathcal{O}(1/T^{1/(1-1/q)})$ that interpolate between the general sub-linear rate of $\mathcal{O}(1/T)$ and the $\mathcal{O}(1/T^2)$ of [GH]. Note that we further generalize these results by relaxing the strong convexity of $f$ with $(\mu, \theta)$-Hölderian Error Bounds and the rates become $\mathcal{O}(1/T^{1/(1-2\theta/q)})$. For more details, see Theorem 2.10. of our paper or [KDP] for general error bounds in the context of the Frank-Wolfe algorithm.

Similarly, assuming $\text{inf}_{x\in\mathcal{C}}\norm{\nabla f(x)}_\esx > 0$, when $\mathcal{C}$ is $(\alpha,q)$-uniformly convex with $q>2$, we obtain convergence rates of $\mathcal{O}(1/T^{1/(1-2/q)})$ that interpolate between the general sub-linear rate of $\mathcal{O}(1/T)$ and the linear convergence rates of [P, DR].

These two convergence regimes depend on the global uniform convexity parameters of the set $\mathcal{C}$. However, for example $\ell_3$ balls, seem to exhibit various degree of curvature depending on the position of $x^\esx$ on the boundary $\partial\mathcal{C}$.

A simple numerical experiment

We now numerically observe the convergence of Frank-Wolfe where $f$ is a quadratic and the feasible region are $\ell_p$ balls with varying $p$. We provide two different plots where we vary the position of the optimal solution $x^\esx$. In both cases we use short steps.

In the right figure, $x^\esx$ is chosen near the intersection of the $\ell_p$-balls and the half-line generated by $\sum_{i=1}^{d}e_i$, where the $(e_i)$ is the canonical basis. Informally, this corresponds to the curvy areas of the $\ell_p$ balls. On the left figure, $x^\esx$ is chosen near the intersection of the $\ell_p$ balls and the half-line generated by one of the $(e_i)$, which corresponds to a flat area for large value of $p$.

We observe that when the optimum is near a curvy area, the converge rates are asymptotically linear for $\ell_p$ balls that are not strongly convex.

Optimum $x^\esx$ in flat area	Optimum $x^\esx$ in curvy area

This suggests that the local behavior of $\mathcal{C}$ around $x^\esx$ might even better explain the convergence rates. Providing such an analysis would be in line with [D] which proves linear convergence rates assuming only local strong convexity of $\mathcal{C}$ around $x^\esx$. We extend this result to a localized notion of uniform convexity.

Frank-Wolfe Analysis with Localized Uniform Convexity

When $\mathcal{C}$ is not globally uniformly convex, the scaling inequality does not necessarily hold anymore. We rather assume the following localized version around $x^\esx$:

\[\tag{3} \langle -\nabla f(x^\esx); x^\esx - x\rangle \geq \alpha/2 \norm{\nabla f(x^\esx)}_\esx \norm{x^\esx - x}^q.\]

A localized definition of uniform convexity in the form of (1) applied at $x^\esx \in \partial \mathcal{C}$ indeed implies the local scaling inequality (3). However, the local scaling inequality (3) holds in more general situations, e.g. in the case of the strong convexity analog for the local moduli of rotundity in [GI]. This condition was already identified by Dunn [D], yet without any convergence analysis as soon as $q>2$. In another blog post, we will delve into more details on the generality of (3) and related assumptions in optimization.

Assuming $\text{inf}_{x\in\mathcal{C}} \norm{\nabla f(x)}_\esx > 0$ and a local scaling inequality at $x^\esx$ with parameters $(\alpha, q)$, we obtain convergence rates of $\mathcal{O}(1/T^{1/(1-2/(q(q-1))})$ with $q>2$ that interpolate between the general sub-linear rate of $\mathcal{O}(1/T)$ and the linear convergence rates of [D].

Note that the obtained sublinear rate via the local scaling inequality is strictly worse than the one obtained via the global scaling inequality with same $(\alpha, q)$ parameters. The sublinear rate $\mathcal{O}(1/T^{1/(1-2/(q(q-1))})$ remains however always better than $\mathcal{O}(1/T)$ when $q>2$.

Conclusion

Our results show that in all regimes (for smooth constrained problems) where the strong convexity of the constraint set is known to accelerate Frank-Wolfe algorithms (see [P, DR], [D] or [GH]), the uniform convexity of the set leads to accelerated convergence rates with respect to $\mathcal{O}(1/T)$ as well.

We also show that the local uniform convexity of $\mathcal{C}$ around $x^\esx$ already induces accelerated convergence rates. In particular, this acceleration is achieved with the vanilla Frank-Wolfe algorithm which does not require any knowledge about these underlying structural assumptions and their respective parameters. As such, our results further shed light on the adaptive properties of Frank-Wolfe type of algorithms. For instance, see also [KDP] for adaptivity with respect to error bound conditions on the objective function, or [J, LJ] for affine invariant analyses of Frank-Wolfe algorithms when the constraint set are polytopes.

References

[D] Dunn, Joseph C. Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on Control and Optimization 17.2 (1979): 187-211. pdf

[DR] Demyanov, V. F. ; Rubinov, A. M. Approximate methods in optimization problems. Modern Analytic and Computational Methods in Science and Mathematics, 1970.

[DDS] Donahue, M. J.; Darken, C.; Gurvits, L.; Sontag, E. (1997). Rates of convex approximation in non-Hilbert spaces. Constructive Approximation, 13(2), 187-220.

[GH] Garber, D.; Hazan, E. Faster rates for the frank-wolfe method over strongly-convex sets. In 32nd International Conference on Machine Learning, ICML 2015. pdf

[GI] Goncharov, V. V.; Ivanov, G. E. Strong and weak convexity of closed sets in a hilbert space. In Operations research engineering, and cyber security, pages 259–297. Springer, 2017.

[GM] Guélat, J.; Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’. Mathematical Programming, 35(1), 110-119.

[HL] Huang, R.; Lattimore, T.; György, A.; Szepesvári, C. Following the leader and fast rates in linear prediction: Curved constraint sets and other regularities. In Advances in Neural Information Processing Systems, pages 4970–4978, 2016. pdf

[J] Jaggi, M. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th international conference on machine learning, ICML 2013. p. 427-435. pdf

[JN] Juditsky, A.; Nemirovski, A.S. Large deviations of vector-valued martingales in 2-smooth normed spaces. pdf

[KDP] Kerdreux, T.; d’Aspremont, A.; Pokutta, S. 2019. Restarting Frank-Wolfe. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 1275-1283). pdf

[LJ] Lacoste-Julien, S. ; Jaggi, Martin. On the global linear convergence of Frank-Wolfe optimization variants. In : Advances in neural information processing systems. 2015. p. 496-504. pdf

[P] Polyak, B. T. Existence theorems and convergence of minimizing sequences for extremal problems with constraints. In Doklady Akademii Nauk, volume 166, pages 287–290. Russian Academy of Sciences, 1966.

[ST] Sridharan, K.; Tewari, A. (2010, June). Convex Games in Banach Spaces. In COLT (pp. 1-13). pdf