# Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning

*TL;DR: Cheat Sheet for non-smooth convex optimization: subgradient descent, mirror descent, and online learning. Long and technical.*

*Posts in this series (so far).*

- Cheat Sheet: Smooth Convex Optimization
- Cheat Sheet: Frank-Wolfe and Conditional Gradients
- Cheat Sheet: Linear convergence for Conditional Gradients
- Cheat Sheet: HĆ¶lder Error Bounds (HEB) for Conditional Gradients
- Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning

*My apologies for incomplete referencesāthis should merely serve as an overview.*

This time we will consider non-smooth convex optimization. Our starting point is a very basic argument that is used to prove convergence of *Subgradient Descent (SG)*. From there we will consider the projected variants in the constrained setting and naturally arrive at *Mirror Descent (MD)* of [NY]; we follow the proximal point of view as presented in [BT]. We will also see that online learning algorithms such as *Online Gradient Descent (OGD)* of [Z] or *Online Mirror Descent (OMD)* and the special case of the *Multiplicative Weights Update (MWU)* algorithm arise as natural consequences.

This time we will consider a convex function $f: \RR^n \rightarrow \RR$ and we want to solve

where $K$ is some convex feasible region, e.g., $K = \RR^n$ is the unconstrained case. However compared to previous posts now we will consider the *non-smooth* case. As before we assume that we only have *first-order access* to the function, via a so-called *first-order oracle*, which in the non-smooth case returns subgradients:

**First-Order oracle for $f$**

*Input:* $x \in \mathbb R^n$

*Output:* $\partial f(x)$ and $f(x)$

In the above $\partial f(x)$ denotes a subgradient of the (convex!) function $f$ at point $x$. Recall that a *subgradient at $x \in \operatorname{dom}(f)$* is any vector $\partial f(x)$ such that $f(z) \geq f(x) + \partial \langle f(x), z-x \rangle$ holds for all $z \in \operatorname{dom}(f)$. So basically the same as we obtain from convexity for smooth functions, just that in the general non-smooth case, there might be more than one vector satisfying this condition. In contrast, for convex and smooth (i.e., differentiable) functions there exists only one subgradient at $x$, which is the gradient, i.e., $\partial f(x) = \nabla f(x)$ in this case. In the following we will use the notation $[n] \doteq \setb{1,\dots, n}$.

## A basic argument

We will first consider gradient descent-like algorithms of the form

\[ \tag{dirStep} x_{t+1} \leftarrow x_t - \eta_t d_t, \]

where we choose $d_t \doteq \partial f(x_t)$ and we show how we can establish convergence of the above scheme to an (approximately) optimal solution $x_T$ to $\min_{x \in K} f(x)$ in the case $K = \RR^n$; we will choose the step length $\eta_t$ later. For completeness, the full algorithm looks like this:

**Subgradient Descent Algorithm.**

*Input:* Convex function $f$ with first-order oracle access and some initial point $x_0 \in \RR^n$

*Output:* Sequence of points $x_0, \dots, x_T$

For $t = 1, \dots, T$ do:

$\quad x_{t+1} \leftarrow x_t - \eta_t \partial f(x_t)$

In this section we will assume that $\norm{\cdot}$ is the $\ell_2$-norm, however note that later we will allow for other norms. Let $x^\esx$ be an optimal solution to $\min_{x \in K} f(x)$ and consider the following using (dirStep) and $d_t \doteq \partial f(x_t)$.

This can be rearranged to

as we aim to later estimate $f(x_t) - f(x^\esx) \leq \langle \partial f(x_t), x_t - x^\esx\rangle$ as $\partial f(x)$ was a subgradient. However in view of setting out to provide a unified perspective on various settings, including online learning, we will do this substitution only in the very end. Adding up those equations until iteration $T-1$ we obtain:

Let us further assume that $\norm{\partial f(x_t)} \leq G$ for all $t = 0, \dots, T-1$ for some $G \in \RR$ and to simplify the exposition let us choose $\eta_t \doteq \eta > 0$ for now for all $t$ for some $\eta$ to be chosen later. We obtain:

where the right-hand side is minimized for

leading to

We will later see that (RegretBound) can be used as a starting point to develop online learning algorithms, for now however, we will derive our convergence guarantee from this. To this end we divide both sides by $T$, use convexity, and the subgradient property to conclude:

where $\bar x \doteq \frac{1}{T} \sum_{t=0}^{T-1} x_t$ is the average of all iterates. As such we obtain a $O(1/\sqrt{T})$ convergence rate for our algorithm. It is useful to observe that what the algorithm does is to minimize the average of the dual gaps at points $x_t$ given by $\langle \partial f(x_t), x_t - x^\esx\rangle$ and since the average of the dual gaps upper bounds the primal gap of the average point convergence follows.

This basic analysis is the standard analysis for *subgradient descent* and will serve as a starting point for what follows.

Before we continue the following remarks are in order:

- An important observation is that in the argument above we never used that $x^\esx$ is an optimal solution and in fact the arguments hold for
*any*point $u$; in particular for some choices $u$ the left-hand side of (regretBound)*can be negative*(as in the case of (convergenceSG) which becomes vacuous in this case). We will see the implications of this very soon below in the online learning section. Ultimately, subgradient descent (and also the mirror descent as we will see later) is a*dual method*in the sense, that it directly minimizes the duality gap or equivalently maximizes the dual. That is where the strong guarantees with respect to*all points*$u$ come from. - Another important insight is that the argument from above does not provide a
*descent algorithm*, i.e., it is*not guaranteed*that we make progress in terms of primal function value decrease in each iteration. However, what we show is that picking $\eta$ ensures that the average point $\bar x$ converges to an optimal solution: we make progress on average. - In the current form as stated above the choice of $\eta$ requires prior knowledge of the number of total iterations $T$ and the guarantee
*only*applies to the average point obtained from averaging over all iterations $T$. However, this can be remedied in various ways. The poor manās approach is to simply run the algorithm with a small $T$ and whenever $T$ is reached to double $T$ and restart the algorithm. This is usually referred to as the*doubling-trick*and at most doubles the number of performed iterations but now we do not need prior knowledge of $T$ and we obtain guarantees at iterations of the form $t = 2^\ell$ for $\ell = 1,2, \dots$. The smarter way is to use a variable step size as we will show later. This requires however that $\norm{x_t - x^\esx} \leq D$ holds for all iterates for some constant $D$, which might be hard to ensure in general but which can be safely assumed in the compact constrained case by choosing $D$ to be the diameter; the guarantees will depend on that parameter.

### An āoptimalā update

Similar to the descent approach using smoothness, as done in several previous posts, such as e.g., Cheat Sheet: Smooth Convex Optimization, we might try to pick $\eta_t$ in each step to maximize progress. Our starting point is

from above and we want to choose $\eta_t$ to maximize progress in terms of $\norm{x_{t+1} - x^\esx}$ vs. $\norm{x_t - x^\esx}$, i.e., decrease in distance to the optimal solution. Observe that the right-hand side is convex in $\eta_t$ and optimizing over $\eta_t$ leads to

This choice of $\eta_t^\esx$ looks very similar to the choice that we have seen before for, e.g., gradient descent in the smooth case (see, e.g., Cheat Sheet: Smooth Convex Optimization), with some important differences however: we cannot compute the above step length as we do not know $x^\esx$; we ignore this for now.

Plugging the step length back into (expand), we obtain:

This shows that progress in the distance squared to the optimal solution decreases by $\frac{\langle \partial f(x_t), x_t - x^\esx\rangle^2}{\norm{\partial f(x_t)}^2}$, i.e., the better aligned the gradient is with the idealized direction $x_t - x^\esx$, which points towards an optimal solution $x^\esx$, the faster the progress. In particular, if the alignment is perfect, then *one step* suffices. Note, however that this is *only* hypothetical as the computation of the optimal step length requires knowledge of an optimal solution. This is simply to demonstrate that a ānon-deterministicā version would only require one step. This is in contrast to e.g., gradient step progress in function value exploiting smoothness (see Cheat Sheet: Smooth Convex Optimization). In that case, only using first order information and smoothness we *naturally* obtain, e.g., a $O(1/t)$-rate for the smooth case, even for the non-deterministic idealized algorithm, where we guess as direction $x_t - x^\esx$ pointing towards the optimum. This is a subtle but important difference.

Finally, to add slightly more to the confusion (for now) compare the rearranged (expand) which captures progress *in the distance*

to the smoothness induced progress *in function value* (or primal gap) for the idealized $d \doteq x_t - x^\esx$ (see, e.g., Cheat Sheet: Smooth Convex Optimization):

These two progress-inducing (in-)equalities are very similar. In particular, in the smooth case, for $\eta_t$ tiny, the progress is identical up to the linear factor $2$ and lower order terms; this is for a good reason as I will discuss sometime in the future when we look at the continuous time versions.

### Online Learning

In the following we will discuss the connection of the above to online learning. In *online learning* we typically consider the following setup; I simplified the setup slightly for exposition and the exact requirements will become clear from the actual algorithm that we will use.

We consider two players: the *adversary* and the *player*. We then play a game over $T$ rounds of the following form:

**Game.** For $t = 0, \dots, T-1$ do:

(1) Player chooses an action $x_t$

(2) Adversary picks a (convex) function $f_t$, reveals $\partial f_t(x_t)$ and $f_t(x_t)$

(3) Player updates/learns via $\partial f_t(x_t)$ and incurs cost $f_t(x_t)$

The goal of the game is to minimize the so-called *regret*, which is defined as:

which measures how well our *dynamic strategy* $x_1, \dots, x_t$ compares to the *single best decision in hindsight*, i.e., a *static strategy* given perfect information.

Although surprising at first, it turns out that one can show that there exists an algorithm that generates a strategy $x_1, \dots, x_t$, so that (regret) is growing sublinearly, in fact typically of the order $O(\sqrt{T})$, i.e., something of the following form holds:

What does this mean? If we divide both sides by $T$, we obtain the so-called *average regret* and the bound becomes:

showing that the average mistake that we make per round, in the long run, tends to $0$ at a rate of $O\left(\frac{1}{\sqrt{T}}\right)$.

Now it is time to wonder what this has to do with what we have seen so far. It turns out that already the our basic analysis from above provides a bound for the most basic unconstrained case for a given time horizon $T$. To this end recall the inequality (regretBound) that we established above:

A careful look at the argument that we used to establish the inequality (regretBound) reveals that it actually does not depend on $f$ being the same in each iteration and also that we can replace $x^\esx$ by any other feasible solution $u$ (as discussed before) so that we also proved:

with $G$ now being a bound on the subgradients across the rounds, i.e., $\norm{\partial f_t(x_t)} \leq G$ and now using the fact that $\partial_t(x_t)$ is a subgradient, we obtain:

and in particular this holds for the minimum:

which establishes sublinear regret for the actions played by the player according to:

\[
x_{t+1} \leftarrow x_t - \eta_t \partial f_t(x_t),
\]
with the step length $\eta \doteq \frac{\norm{x_0 - x^\esx}}{G} \sqrt{\frac{1}{T}}$, which in this context is also often referred to as *learning rate*. This setting requires knowledge of $T$ ahead of time. As discussed earlier this can be overcome, either with the doubling-trick or via the variable step length approach that we discuss further below; the cost in terms of regret is a $\sqrt{2}$-factor for the latter.

So what is our algorithm doing when deployed in an online setting? For this it is helpful to consider the update in iteration $t$ through (basic), rearranged for convenience:

so for $x_{t+1}$ to move closer to a given $u$, it is necessary that

or equivalently, that

i.e., the *potential gain*, measured by the dual gap $\langle \partial f_t(x_t), x_t - u\rangle$ must be larger than $\frac{\eta_t}{2} \norm{\partial f_t(x_t)}^2$, where $\norm{\partial f_t(x_t)}^2$ is the maximally possible gain (the scalar product is maximized at $\partial f_t(x_t)$, e.g., by Cauchy-Schwarz). As such we require an $\frac{\eta_t}{2}$ fraction of the total possible gain to move closer to $u$.

We will later see that other online learning variants naturally arise the same way by āshort-cuttingā the convergence proof as we have done here. In particular, we will see that the famous *Multiplicative Weight Update* algorithm is basically obtained from short-cutting the Mirror Descent convergence proof for the probability simplex with the relative entropy as Bregman divergence; more on this later.

## The constrained setting: projected subgradient descent

We will now move to the constrained setting where we want require that the iterates $x_t$ are contained in some convex set $K$, i.e., $x_t \in K$. As in the above, our starting point is the *poor manās identity* arising from expanding the norm. To this end we write:

or in a more convenient form (by rearranging) as:

In the basic analysis of subgradient descent, we then used the specific form of the update $x_{t+1} \leftarrow x_t - \eta_t \partial f(x_t)$ and then summed and telescoped out. Now, things are different. A hypothetical update $x_{t+1} \leftarrow x_t - \eta_t \partial f(x_t)$ might lead outside of $K$, i.e., $x_{t+1} \not\in K$ might happen. Observe though that (normExpand) still telescopes as before by simply adding up over the iterations, however we have no idea, how $\langle x_t - x_{t+1}, x_t - x^\esx \rangle$ relates to our function $f$ of interest and clearly this has to depend on the actual step we take, i.e., on the properties of $x_{t+1}$. A natural, but slightly too optimistic thing to hope for is to find a step $x_{t+1}$, such that

holds as this actually held even with equality in the unconstrained case. However, suppose we can show the following:

Note the subtle difference in the indices in the $x_{t+1} - x^\esx$ term. It is much easier to show (lookAhead), as we will do further below, because the point $x_{t+1}$ that we choose as a function of $\nabla f(x_t)$ and $x_t$ is under our control; in comparison $x_t$ is already chosen at time $t$. However, this not yet good enough to telescope out the sums due to the mismatch in indices. The following observation remedies the situation by undoing the index shift and quantifying the change:

**Observation.** If $\langle \eta_t \partial f(x_{t}), x_{t+1} - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_{t+1} - x^\esx \rangle$, then
\[
\tag{lookAheadIneq}
\begin{align}
\langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle & \leq \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle \newline
\nonumber & - \frac{1}{2}\norm{x_t - x_{t+1}}^2 \newline
\nonumber & +\frac{1}{2}\norm{\eta_t \partial f(x_t)}^2
\end{align}
\]

Before proving the observation, observe that in the unconstrained case, where we choose $x_{t+1} = x_t - \eta \partial f(x_t)$, the inequality in the observation reduces to (optimistic), holding even with equality, and when plugging this back into our poor manās identify this exactly becomes the basic argument from beginning of the post. This is a good news as it indicates that the observation reduces to what we know already in the unconstrained case. As such we might want to think of the observation as relating the step $x_{t+1} - x_t$ that we take with $\partial f(x_t)$, assuming that we can choose $x_{t+1}$ to satisfy (lookAhead).

*Proof (of observation).*
Our starting point is the inequality (lookAhead) whose validity we establish a little later:

We will simply brute-force rewrite the inequality into the desired form and collect the error terms in the process. The above inequality is equivalent to:

Rewriting we obtain:

where the last inequality uses the binomial formula, i.e., $(a+b)^2 = a^2 - 2ab +b^2 \geq 0$ and hence $a^2 \geq -b^2 +2ab$.

$\qed$

With the observation we can immediately conclude our convergence proof and the argument becomes identical to the basic case from above. Recall that our starting point is (normExpand):

Now we can estimate the term on the left-hand side using our observation. This leads to:

and after subtracting $\norm{x_t - x_{t+1}}^2$ and adding $\norm{\eta_t \partial f(x_t)}^2$, we obtain:

which is exactly (basic) as above and we can conclude the argument the same way: summing up and telescoping and then optimizing $\eta_t$. In particular, the convergence rate (convergenceSG) and regret bound (regretBound) stay the same with no deterioration due to constraints or projections:

and

So the key is really establishing (lookAhead) as it immediately implies all we need to establish convergence in the constrained case. This is what we will do now, which will also, finally, specify our choice of $x_{t+1}$.

### Using optimization to prove what you want

Have you ever wondered why people add these weird 2-norms to their optimization problem to āregularizeā the problem, i.e., they solve problems of the form $\min_{x} f(x) + \lambda \norm{x - z}^2$? Then this section might provide some insight into this. We will see that it is actually not about the āproblemā that is solved, but about what an optimal solution might guarantee; bear with me.

So what we want to establish is inequality (lookAhead), i.e.,

or slightly more generally stated as our proof will work for *all* $u \in K$ (and in particular the choice $u = x^\esx$):

Rearranging the above we obtain:

What we will do now is to interpret the above as an *optimality condition* to some smooth convex optimization problem of the form $\max_{x \in K} g(x)$, where $g(x)$ is some smooth and convex function. Recall, from the previous posts, e.g., Cheat Sheet: Frank-Wolfe and Conditional Gradients, that the first order optimality condition states, that for all $u \in K$ it holds:

provided that $x \in K$ is an optimal solution, as otherwise we would be able to make progress via, e.g., a gradient step or a Frank-Wolfe step. By simply reverse engineering (aka remembering how we differentiate), we guess

so that that its optimality condition produces (optCon). We now simply choose

and (just to be sure) we inspect the optimality condition that states:

which is exactly (lookAhead). This step then ensures convergence with (maybe surprisingly) a rate identical to the unconstrained case. The resulting algorithm is often referred to as *projected subgradient descent* and the problem whose optimal solution defines $x_{t+1}$ is the projection problem. We provide the *projected subgradient descent* algorithm below:

**Projected Subgradient Descent Algorithm.**

*Input:* Convex function $f$ with first-order oracle access and some initial point $x_0 \in K$

*Output:* Sequence of points $x_0, \dots, x_T$

For $t = 1, \dots, T$ do:

$\quad x_{t+1} \leftarrow \arg\min_{x \in K} \langle \eta_t \partial f(x_{t}), x \rangle + \frac{1}{2}\norm{x-x_t}^2$

### Variable step length

We will now briefly explain how to replace the constant step length from before that requires a priori knowledge of $T$ by a variable step length, so that the convergence guarantee holds for any iterate $x_t$. To this end let $D \geq 0$ be a constant so that $\max_{x,y \in K} \norm{x-y} \leq D$. We now choose $\eta_t \doteq \tau \sqrt{\frac{1}{t+1}}$, where we will specify the constant $\tau \geq 0$ soon.

**Observation.** For $\eta_t$ as above it holds:
\[\sum_{t = 0}^{T-1} \eta_t \leq \tau\left(2 \sqrt{T} - 1\right).\]

*Proof.* There are various ways of showing the above. We follow the argument in [Z]. We have:

Now we restart from inequality (basic) from earlier:

however before we sum up and telescope we first divide by $2\eta_t$, i.e.,

where we applied the observation from above in the last but one inequality, plugged in the definition of $\eta_t$, and used the choice $\tau \doteq \frac{D}{G\sqrt{2}}$ in the last equation, which minimizes the term in the brackets in the last inequality. In summary, we have shown:

From (regretBoundAnytime) we can now derive convergence rates as usual: summing up, then averaging the iterates, and using convexity.

It is useful to compare (regretBoundAnytime) to the case with fixed step length, which is given in (regretBound): using a variable step length costs us a factor of $\sqrt{2}$, however the above bound in (regretBoundAnytime) now holds for *all* $t$ and a priori knowledge of $T$ is not required. Such regret bounds are sometimes referred to as *anytime regret bounds*.

### Online (Sub-)Gradient Descent

Starting from (regretBoundAnytime), we can also follow the same path as in the online learning section from above. This recovers the Online (Sub-)Gradient Descent algorithm of [Z]: Consider the online learning setting from before and choose

Then, we obtain the regret bound

in the anytime setting and

when $T$ is known ahead of time, where $D$ and $G$ are a bound on the diameter of the feasible domain and the norm of the gradients respectively as before.

## Mirror Descent

We will now derive Nemirovskiās Mirror Descent algorithm (see e.g., [NY]) and we will be following somewhat the proximal perspective as outlined in [BT]. Simplifying and running the risk of attracting the wrath of the optimization titans, *Mirror Descent* arises from subgradient descent by replacing the $\ell_2$-norm with a āgeneralized distanceā that satisfies the inequalities that we needed in the basic argument from above.

Why would you want to do that? Adjusting the distance function will allow us to fine-tune the iterates and the resulting dimension-dependent term for the geometry under consideration.

In the following, as we move away from the $\ell_2$-norm, which is self-dual, we will need the definition of the *dual norm* defined as . Note that for the $\ell_p$-norm the $\ell_q$-norm is dual if $\frac{1}{p} + \frac{1}{q} = 1$. For the $\ell_1$-norm the dual norm is $\ell_\infty$. We will also need the *(generalized) Cauchy-Schwarz inequality* or *HĆ¶lder inequality*: $\langle y , x \rangle \leq \norm{y}_\esx \norm{x}$. A very useful consequence of this inequality is:

which follows from

### āGeneralized Distanceā aka Bregman divergence

We will first introduce the generalization of norms that we will be working with. To this end, let us first collect the desired properties that we needed in the proof of the basic argument; I will already suggestively use the final notation to not create notational overload. Let our desired function be called $V_x(y)$ and let us further assume in a first step the choice $V_x(y) = \frac{1}{2} \norm{x-y}^2$; note the factor $\frac{1}{2}$ is only used to make the proofs cleaner.

In the very first step we used the expansion of the $\ell_2$-norm. As $x^\esx$ plays no special role, we write everything with respect to any feasible $u$:

Rescaling and substituting $V_x(y) = \frac{1}{2} \norm{x-y}^2$, and observing that $\nabla V_x(y) = y - x$, we obtain:

where the last choice $V_{x_{t+1}}(x_{t})$ used a non-deterministic guess as by symmetry of the $\ell_2$-norm also $V_{x_{t}}(x_{t+1})$ would have been feasible.

We will need another inequality if we aim to mimic the same proof in the constrained case. Recall that we needed (lookAheadIneq)

\[\langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle - \frac{1}{2}\norm{x_t - x_{t+1}}^2+\frac{1}{2}\norm{\eta_t \partial f(x_t)}^2,\]

to relate the step $x_{t+1} - x_t$ that we take with $\partial f(x_t)$, assuming that $x_{t+1}$ was chosen appropriately. In the proof the term $\frac{1}{2}\norm{x_t - x_{t+1}}^2$ simply arose from the mechanics of the (standard) scalar product, which is inherently linked to the $\ell_2$-norm. Slightly jumping ahead (the later proof will make this requirement natural), we additionally require

Moreover, the term $\frac{1}{2}\norm{\eta_t \partial f(x_t)}^2$ in (lookAheadIneq) is actually using the *dual norm*, which we did not have to pay attention to as the $\ell_2$-norm is self-dual. We will redo the full argument in the next sections with the correct distinctions for completeness. First, however we will complete the definition of the $V_x(y)$.

There is a natural class of functions that satisfy (req-1) and (req-2), so called *Bregman divergences*, which are defined through *Distance Generating Functions (DGFs)*. Let us choose some norm $\norm{\cdot}$, which is not necessarily the $\ell_2$-norm.

**Definition. (DGF and Bregman Divergence)** Let $K \subseteq \RR^n$ be a closed convex set. Then $\phi: K \rightarrow \RR$ is called a *Distance Generating Function (DGF)* if $\phi$ is $1$-strongly convex with respect to $\norm{\cdot}$, i.e., for all $x \in K \setminus \partial K, y \in K$ we have $\phi(y) \geq \phi(x) + \langle \nabla \phi, y-x \rangle + \frac{1}{2}\norm{x-y}^2$. The *Bregman divergence (induced by $\phi$)* is defined as
\[V_x(y) \doteq \phi(y) - \langle \nabla \phi(x), y - x \rangle - \phi(x),\]
$x \in K \setminus \partial K, y \in K$.

Observe that the strong convexity requirement of the DGF is with respect to the chosen norm. This is important as it allows us to āfine-tuneā our geometry. Before we establish some basic properties of Bregman divergences, here are two common examples:

**Examples. (Bregman Divergences)**

(a) Let $\norm{x} \doteq \norm{x}_2$ be the $\ell_2$-norm and $\phi(x) \doteq \frac{1}{2} \norm{x}^2$. Clearly, $\phi(x)$ is $1$-strongly convex with respect to $\norm{\cdot}$ (for any $K$). The resulting Bregman divergence is $V_x(y) = \frac{1}{2}\norm{x-y}^2$, which is the choice used for (projected) subgradient descent above.

(b) Let be the $\ell_1$-norm and be the (negative) entropy. Then $\phi(x)$ is $1$-strongly convex for all , which is the *probability simplex* with respect to . The resulting Bregman divergence is , which is the *Kullback-Leibler divergence* or *relative entropy*.

We will now establish some basic properties for $V_x(y)$ and show that $V_x(y)$ satisfies the required properties:

**Lemma. (Properties of the Bregman Divergence)** Let $V_x(y)$ be a Bregman divergence defined via some DGF $\phi$. Then the following holds:

(a) Point-separating: $V_x(x) = 0$ (and also $V_x(y) = 0 \Leftrightarrow x = y$ via (b))

(b) Compatible with norm: $V_x(y) \geq \frac{1}{2} \norm{x-y}^2 \geq 0$

(c) $\Delta$-Inequality: $\langle - \nabla V_x(y), y - u \rangle = V_x(u) - V_y(u) - V_x(y)$

*Proof.* Property (a) follows directly from the definition and (b) follows from $\phi$ in the definition of $V_x(y)$ being $1$-strongly convex. Property (c) follows from straightforward expansion and computation:

### Back to basics

In a first step we will redo our basic argument from the beginning of the post with a Bregman divergence instead of the expansion of the $\ell_2$-norm. To this end let $K \subseteq \RR^n$ (possibly $K = \RR^n$) be a closed convex set. We consider a generic algorithm that produces iterates $x_1, \dots, x_t, \dots$. We will define the choice of the iterates later. Our starting point is the $\Delta$-inequality of the Bregman divergence with the choices $y \leftarrow x_{t+1}$, $x \leftarrow x_t$, and $u \in K$ arbitrary:

We could now try the same strategy, summing up and telescoping out:

But how to continue? First observe that in contrast to the telescoping of the $\ell_2$-norm expansion we have a *negative* term on the right hand-side (this is technical and could have been done the same way for the $\ell_2$-norm) and the left-hand side, as of now, has no relation to the function $f$; clearly, we have not even defined our step yet. So let us try the obvious first-order guess, i.e., replacing the $\ell_2$-norm with the Bregman divergence.

To this end, we define

Mimicking the approach we took for projected gradient descent, let us inspect the optimality condition of the system. For all $u \in K$ it holds:

or equivalently we obtain:

as before, we now have to fix the index mismatch on the left:

This we can then plug back into (basicBreg) to obtain the key inequality for Mirror Descent:

where the last inequality follows via (genBinomial). We can now simply sum up and telescope out to obtain the generic regret bound for Mirror Descent:

and further we can again use convexity, averaging of the iterates, and picking $\eta_t = \eta \doteq \sqrt{\frac{2M}{G^2T}}$ (by optimizing out) to arrive at the convergence rate of Mirror Descent:

where $\norm{\partial f(x_t)}_\esx \leq G$ and for all $u \in K$..

For completeness, the Mirror Descent algorithm is specified below:

**Mirror Descent Algorithm.**

*Input:* Convex function $f$ with first-order oracle access and some initial point $x_0 \in K$

*Output:* Sequence of points $x_0, \dots, x_T$

For $t = 1, \dots, T$ do:

$\quad x_{t+1} \leftarrow \arg\min_{x \in K} \langle \eta_t \partial f(x_t), x \rangle + V_{x_t}(x)$

### Online Mirror Descent and Multiplicative Weights

Alternatively, starting from (regretBoundMD) we can yet again observe that one could use a different function $f_t$ in each iteration $t$, which leads us to *Online Mirror Descent* as we will briefly discuss in this section. From (regretBoundMD) we have with $\eta_t = \eta$ chosen below:

Rearranging with $\norm{\partial f_t(x_t)}_\esx \leq G$ and for all $u \in K$, gives

With the (optimal) choice $\eta = \sqrt{\frac{2M}{G^2T}}$ and using the subgradient property, we obtain the online learning regret bound for Mirror Descent:

and, paying another factor $\sqrt{2}$, we can make this bound *anytime*.

We will now consider the important special case of $K = \Delta_n$ being the probability simplex and being the *relative entropy*, which will lead to (an alternative proof of) the *Multiplicative Weight Update (MWU)* algorithm; this argument is folklore and has been widely known by experts, see e.g., [BT, AO]. In particular, it generalizes immediately to the matrix case in contrast to other proofs of the MWU algorithm. We refer the interested reader to [AHK] for an overview of the many applications of the MWU algorithm or equivalently Mirror Descent over the probability simplex with relative entropy as Bregman divergence.

Via information-theoretic inequalities or just by-hand calculations (see [BT]), it can be easily seen that is $1$-stronly convex with respect to , whose dual norm is . Moreover, for all $x \in \Delta_n$, with $x_0 = (1/n, \dots, 1/n)$ being the uniform distribution.

Recall that the iterates are defined via (IteratesMD), which in our case becomes

and making this explicit amounts to updates of the form (to be read coordinate-wise):

where $K_t$ is chosen such that $\norm{x_{t+1}}_1 = 1$, i.e., $K_t = \norm{x_t \cdot e^{-\eta_t \partial f_t(x_t)}}_1$, which is precisely the Multiplicative Weight Update algorithm.

With the bounds $M = \log n$ and $\norm{\partial f_t(x_t)}_\infty \leq G$ for all $t = 0, \dots T-1$, the regret bound in this case becomes:

for the the variant with known $T$ and we can pay another factor $\sqrt{2}$ to make this bound an anytime guarantee.

### Mirror Descent vs. Gradient Descent

One of the key questions of course if whether the improvement in convergence rate through fine-tuning against the geometry materializes in actual computations or whether it is just an improvement on paper. Following [AO], some comments are helpful: if $V_x(y) \doteq \frac{1}{2}\norm{x-y^2}$, then Mirror Descent and (Sub-)Gradient Descent produce identical iterates. If on the other hand we, e.g., pick $K = \Delta_n$ the probability simplex in $\RR^n$, then we can pick to be the relative entropy, and the iterates from Gradient Descent with the $\ell_2$-norm and Mirror Descent with relative entropy will be very different; and so will be the convergence behavior. Mirror Descent provides a guarantee of

where $\bar x = \frac{1}{T} \sum_0^{T-1} x_t$ and $\norm{\partial f(x_t)}_\infty \leq G$ for all $t = 0, \dots T-1$ in this case.

Below we compare Mirror Descent and Gradient Descent over $K = \Delta_n$ with $n = 10000$ (left) and across different values of $n$ on the (right) for some randomly generated functions; both plots are log-log plots. As can be seen Mirror Descent can scale much better than Gradient Descent by choosing a Bregman divergence that is optimized for the geometry. For $K = \Delta_n$, the dependence on $n$ for Mirror Descent (with relative entropy and $\ell_1$-norm) is only logarithmic vs. for Gradient Descent it is linear in the dimension $n$. This logarithmic dependency makes Mirror Descent well suited for large-scale applications in this case.

## Extensions

Finally I will talk about some natural extensions. While the full arguments will be beyond the scope, the interested reader might consult [Z2] for proofs. Also, there are various natural extensions in the online learning case, e.g., where we compare to slowly changing strategies; see [Z] for details.

### Stochastic versions

It is relatively easy to see that the above bounds can be transferred to the stochastic setting, where we have an unbiased gradient estimator only. We then obtain basically the same convergence rates and regret bounds *in expectation*. With the usual Markov trick etc, we can also obtain high probability bounds, say with probability $1-\delta$, paying a $\log \frac{1}{\delta}$ factor in convergence bound and regret.

### Smooth case

When $f$ is smooth we can modify (basicMD) to obtain the improved $O(1/t)$ rate. Recall that $f$ is $L$-smooth with respect to $\norm{\cdot}$ if:

for all $x,y \in \mathbb R^n$. Choosing $x \leftarrow x_t$ and $y \leftarrow x_{t+1}$ we obtain:

We now modify (basicMD) with $u = x^\esx$ as follows:

Chaining in the inequality we obtained from smoothness:

where the last inequality used the compatibility of the Bregman divergence with the norm. Picking $\eta_t = \frac{1}{L}$ results in

which we telescope out to

and by convexity the average $\bar x = \frac{1}{T} \sum_{t = 0}^{T-1} x_t$ satisfies:

which is the expected rate for the smooth case. Note that this improvement does not translate to the online case.

### Strongly convex case

Finally we will show that if $f$ is $\mu$-strongly convex *with respect to $V_x(y)$* (not necessarily smooth though), then we can also obtain improved rates. This improvement translates also to the online learning case, i.e., we get the corresponding improvement in regret. Recall that a function is $\mu$-strongly convex with respect to $V_x(y)$ if:

holds for all $x,y \in \mathbb R^n$. Choosing $x \leftarrow x_t$ and $y \leftarrow x^\esx$, we obtain:

Again we start with (basicMD), which we will modify:

We now plug in the bound from strong convexity to obtain:

which can be simplified to

Choosing $\eta_t = \frac{1}{\mu t}$ now we obtain:

which we can finally sum up (starting at $t=1$), multiply by $\mu$, and telescope out to arrive at:

and with the usual averaging and using convexity we obtain:

for the convergence rate and

for the regret; note that this bound is already anytime. In order to obtain the regret bound, simply replace $x^\esx$ by an arbitrary $u$ and $f(x_t)$ by $f_t(x_t)$. Note however, that this time the argument is directly on the primal difference $f_t(x_t) - f_t(u)$, rather than the dual gaps, i.e., after plugging-in the strong convexity inequality we start from:

and continue the same way.

### References

[NY] Nemirovsky, A. S., & Yudin, D. B. (1983). Problem complexity and method efficiency in optimization.

[BT] Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3), 167-175. pdf

[Z] Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 928-936). pdf

[AO] Allen-Zhu, Z., & Orecchia, L. (2014). Linear coupling: An ultimate unification of gradient and mirror descent. arXiv preprint arXiv:1407.1537. pdf

[AHK] Arora, S., Hazan, E., & Kale, S. (2012). The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1), 121-164. pdf

[Z2] Zhang, X. Bregman Divergence and Mirror Descent. pdf

#### Changelog

03/02/2019: Fixed several typos and added clarifications as pointed out by Matthieu Bloch.

03/04/2019: Fixed several typos and a norm/divergence mismatch in the strongly convex case as pointed out by Cyrille Combettes.