This post is about a particular argument involving the euclidean norm. The basic idea is always the same: we expand the euclidean norm akin to the binomial formula and then do some form of averaging. We really focus on this specific argument alone here: It is not guaranteed that the estimations are optimal (although they often are) and also sometimes the argument can be generalized, replacing the euclidean norm, e.g., with the respective Bregmann divergences.

Our first example is (sub-)gradient descent without any fluff; see Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning for a more detailed discussion and its pecularities. Consider the basic iterative scheme:

\[ \tag{subGD} x_{t+1} \leftarrow x_t - \eta \partial f(x_t), \]

where $f$ is a not necessarily smooth but convex function and $\partial f$ are its (sub-)gradients. We show how we can establish convergence of the above scheme to an (approximately) optimal solution $x_T$ to $\min_{x \in \RR^n} f(x)$ with $x^\esx$ being an optimal solution to $\min_{x \in \RR^n} f(x)$. To this end, we will first expand the euclidean norm as follows; basically the binomial formula:

\[\begin{align*} \norm{x_{t+1} - x^\esx}^2 & = \norm{x_t - \eta \partial f(x_t) - x^\esx}^2 \\ & = \norm{x_t - x^\esx}^2 - 2 \eta \langle \partial f(x_t), x_t - x^\esx\rangle + \eta^2 \norm{\partial f(x_t)}^2. \end{align*}\]This can be rearranged to

\[\begin{align*} \tag{subGD-iteration} 2 \eta \langle \partial f(x_t), x_t - x^\esx\rangle & = \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 + \eta^2 \norm{\partial f(x_t)}^2. \end{align*}\]Whenever we have an expression of the form (subGD-iteration), we can typically complete the convergence argument in three steps. We first 1) add up those expressions for $t = 0, \dots, T-1$ and 2) telescope to obtain:

\[\begin{align*} \sum_{t = 0}^{T-1} 2\eta \langle \partial f(x_t), x_t - x^\esx\rangle & = \norm{x_0 - x^\esx}^2 - \norm{x_{T} - x^\esx}^2 + \sum_{t = 0}^{T-1} \eta^2 \norm{\partial f(x_t)}^2 \\ & \leq \norm{x_0 - x^\esx}^2 + \sum_{t = 0}^{T-1} \eta^2 \norm{\partial f(x_t)}^2. \end{align*}\]For simplicity, let us further assume that $\norm{\partial f(x_t)} \leq G$ for all $t = 0, \dots, T-1$ for some $G \in \RR$. Then the above simplifies to:

\[\begin{align*} 2\eta \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq \norm{x_0 - x^\esx}^2 + \eta^2 T G^2 \\ \Leftrightarrow \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq \frac{\norm{x_0 - x^\esx}^2}{2\eta} + \frac{\eta}{2} T G^2. \end{align*}\]At this point we *could* minimize the right-hand side by setting

leading to

\[\begin{align*} \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq G \norm{x_0 - x^\esx} \sqrt{T}, \end{align*}\]however for this we would need to know $G$ and $\norm{x_0 - x^\esx}$, which is often not practical but we can simply set $\eta \doteq \sqrt{\frac{1}{T}}$, which is good enough, and obtain:

\[\begin{align*} \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq \frac{G^2 + \norm{x_0 - x^\esx}^2}{2} \sqrt{T}, \end{align*}\]i.e., not knowing the parameters leads to the arithmetic mean rather than the geomtric mean of the coefficients.

Then finally, 3) we average by dividing both sides by $T$. Together with convexity and the subgradient property it holds that $f(x_t) - f(x^\esx) \leq \langle \partial f(x_t), x_t - x^\esx\rangle$ and we can conclude:

\[\begin{align*} \tag{convergenceSG} f(\bar x) - f(x^\esx) & \leq \frac{1}{T} \sum_{t = 0}^{T-1} f(x_t) - f(x^\esx) \\ & \leq \frac{1}{T} \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle \\ & \leq \frac{G^2 + \norm{x_0 - x^\esx}^2}{2} \frac{1}{\sqrt{T}}, \end{align*}\]where $\bar x \doteq \frac{1}{T} \sum_{t=0}^{T-1} x_t$ is the average of all iterates and the first inequality directly follows from convexity. As such we have effectively shown a $O(1/\sqrt{T})$ convergence rate for our subgradient descent algorithm (subGD).

It is useful to observe that the algorithm actually minimizes the average of the dual gaps at points $x_t$ given by $\langle \partial f(x_t), x_t - x^\esx\rangle$ and since the average of the dual gaps upper bounds the primal gap of the average point (via convexity) primal convergence follows. Moreover, this type of argument also allows us to obtain guarantees for online learning algorithms by simply observing that we could have used a different function $f$ for each $t$ in (subGD-iteration); for details see Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning.

Next we come to von Neumann’s Alternating Projections algorithm that he formulated first in his lecture notes in 1933 and which was later (re-)printed in [vN] in 1949. Given two compact convex sets $P$ and $Q$ with associated projection operators $\Pi_P$ and $\Pi_Q$, our goal is to find a point $x \in P \cap Q$; originally von Neumann formulated the argument for linear spaces but it is pretty much immediate that his argument holds much more generally. His algorithm is quite straightforward basically, alternatingly projecting onto the respective sets:

**von Neumann’s Alternating Projections [vN]**

*Input:* Point $y_{0} \in \RR^n$, $\Pi_P$ projector onto $P \subseteq \RR^n$ and $\Pi_Q$ projector onto $Q \subseteq \RR^n$.

*Output:* Iterates $x_1,y_1 \dotsc \in \RR^n$

For $t = 1, \dots$ do:

$\quad x_{t+1} \leftarrow \Pi_P(y_{t})$

$\quad y_{t+1} \leftarrow \Pi_Q(x_{t+1})$

Now suppose $P \cap Q \neq \emptyset$ and let $u \in P \cap Q$ be arbitrary. We will show that the algorithm converges to a point in the intersection. The argument is quite similar to the above. We consider a given iterate, add $0$, and then use the binomial formula (and repeat):

\[\begin{align*} \norm{y_t - u}^2 & = \norm{y_t - x_{t+1} + x_{t+1} - u}^2 = \norm{y_t - x_{t+1}}^2 + \norm{x_{t+1} - u}^2 - 2 \underbrace{\langle x_{t+1} - y_t, x_{t+1} - u \rangle}_{\leq 0} \\ & \geq \norm{y_t - x_{t+1}}^2 + \norm{x_{t+1} - u}^2 = \norm{y_t - x_{t+1}}^2 + \norm{x_{t+1} - y_{t+1} + y_{t+1} - u}^2 \\ & = \norm{y_t - x_{t+1}}^2 + \norm{x_{t+1} - y_{t+1}}^2 + \norm{y_{t+1} - u}^2 - 2 \underbrace{\langle y_{t+1} - x_{t+1} , y_{t+1} - u\rangle}_{\leq 0} \\ & \geq \norm{y_t - x_{t+1}}^2 + \norm{x_{t+1} - y_{t+1}}^2 + \norm{y_{t+1} - u}^2, \end{align*}\]where $\langle x_{t+1} - y_t, x_{t+1} - u \rangle \leq 0$ and $\langle y_{t+1} - x_{t+1} , y_{t+1} - u\rangle \leq 0$ are simply the first-order optimality conditions of the respective projection operator, i.e., if $x_{t+1} = \Pi_P(y_t)$ then $x_{t+1} \in \arg\min_{x \in P} \norm{x - y_t}^2$ and hence $\langle x_{t+1} - y_t, x_{t+1} - u \rangle \leq 0$ for all $u \in P$. Also observe that after a single step we obtain the inequality $\norm{y_t - u}^2 \geq \norm{y_t - x_{t+1}}^2 + \norm{x_{t+1} - u}^2$, i.e., as long as $y_t \neq x_{t+1}$, which we can safely assume as we would be done otherwise, $\norm{y_t - u}^2 > \norm{x_{t+1} - u}^2$ so that we obtain that $x_{t+1}$ moved closer to $u$ than $y_t$; similar argument is implicit for the other set.

The derivation above can be rearranged to

\[\tag{vN-iteration} \norm{y_t - u}^2 - \norm{y_{t+1} - u}^2 \geq \norm{y_t - x_{t+1}}^2 + \norm{x_{t+1} - y_{t+1}}^2,\]which is similar to the iteration (subGD-iteration). As before, we can checkmate this in $3$ moves:

1) Sum up:

\[\sum_{t = 0, \dots, T-1} \left(\norm{y_t - u}^2 - \norm{y_{t+1} - u}^2\right) \geq \sum_{t = 0, \dots, T-1} \left( \norm{y_t - x_{t+1}}^2 + \norm{x_{t+1} - y_{t+1}}^2 \right).\]2) Telescope:

\[\norm{y_0 - u}^2 \geq \sum_{t = 0, \dots, T-1} \left( \norm{y_t - x_{t+1}}^2 + \norm{x_{t+1} - y_{t+1}}^2\right).\]3) Divide by $T$:

\[\frac{\norm{y_0 - u}^2}{T} \geq \frac{1}{T} \sum_{t = 0, \dots, T-1} \left( \norm{y_t - x_{t+1}}^2 + \norm{x_{t+1} - y_{t+1}}^2 \right) \geq \norm{x_{T} - y_{T}}^2,\]where the last inequality is because the distances are non-increasing, so that that we can reply the average by the minimum, which is the last iterate. This shows that $\norm{x_{T} - y_{T}}^2$ goes to $0$ at a rate of $O(1/T)$, and with some minor extra reasoning we can even show that $x_T \rightarrow z$ and $y_T \rightarrow z$ with $z \in P \cap Q$.

The Frank-Wolfe algorithm is a first-order method that allows to optimize an $L$-smooth and convex function $f$ over a compact convex feasible region $P$, for which we have a linear minimization oracle (LMO) available, i.e., we assume that we can optimize a linear function over $P$; for more details see the two cheat sheets Cheat Sheet: Smooth Convex Optimization and Cheat Sheet: Frank-Wolfe and Conditional Gradients. The standard Frank-Wolfe algorithm is presented below:

**Frank-Wolfe Algorithm [FW] (see also [CG])**

*Input:* Smooth convex function $f$ with first-order oracle access, feasible region $P$ with linear optimization oracle access, initial point (usually a vertex) $x_0 \in P$.

*Output:* Sequence of points $x_0, \dots, x_T$

For $t = 1, \dots, T$ do:

$\quad v_t \leftarrow \arg\min_{x \in P} \langle \nabla f(x_{t-1}), x \rangle$

$\quad x_{t+1} \leftarrow (1-\gamma_t) x_t + \gamma_t v_t$

A crucial characteristic of (the family of) Frank-Wolfe algorithms is that they admit a natural dual gap, the so-called *Frank-Wolfe gap* at any point $x \in P$, defined as $\max_{v \in P} \langle \nabla f(x), x - v \rangle$, which most algorithms also naturally compute as part of their iterations. It is straightforward to see that the Frank-Wolfe gap upper bounds the primal gap by convexity:

and the Frank-Wolfe gap can be “observed” compared to the primal gap, i.e., we can use it as a stopping criterion. In the case where $f$ is non-convex but smooth—our object of interest here—the Frank-Wolfe gap is still a criterion for first-order criticality, albeit not bounding the primal gap anymore and it is only necessary for global optimality but not sufficient. We want to show that the Frank-Wolfe gap converges to $0$ in the non-convex but smooth case. For this we start from the smoothness inequality—the only thing that we have in this case—and write:

\[f(x_t) - f(x_{t+1}) \geq \gamma \langle \nabla f(x_t), x_t - v_t \rangle - \gamma^2 \frac{L}{2} \norm{x_t - v_t}^2,\]where $v_t = \arg \max_{v \in P} \langle \nabla f(x_t), x_t - v \rangle$ is the *Frank-Wolfe vertex* at $x_t$. Let $D$ be the diameter of $P$, so that can bound $\norm{x_t - v_t} \leq D$, for simplicity, and obtain after rearranging:

and we can continue as in (subGD-iteration): adding up, telescoping, rearranging, and simplifying leads to:

\[\sum_{t = 0, \dots, T-1} \langle \nabla f(x_t), x_t - v_t \rangle \leq \frac{f(x_0) - f(x_{T})}{\gamma} + \gamma^2 T \frac{L}{2} D^2 \leq \frac{f(x_0) - f(x^\esx)}{\gamma} + \gamma T \frac{L}{2} D^2,\]Dividing by $T$ and setting $\gamma \doteq \frac{1}{\sqrt{T}}$, we obtain:

\[\min_{t = 0, \dots, T-1} \langle \nabla f(x_t), x_t - v_t \rangle \leq \frac{1}{T} \sum_{t = 0, \dots, T-1} \langle \nabla f(x_t), x_t - v_t \rangle \leq \frac{f(x_0) - f(x^\esx) + \frac{LD^2}{2}}{\sqrt{T}},\]i.e., the average of the Frank-Wolfe gaps converges to $0$ at a rate of $O(1/\sqrt{T})$ and so does the minimum.

Finally, we consider the problem of certifying non-membership of a point $x_0 \not \in P$, where $P$ is a polytope that we have a linear minimization oracle (LMO) for; the same argument also works for any compact convex $P$ with LMO, however for simplicity we consider the polytopal case here.

The certificate will be a separating hyperplane, that separates $x_0$ from $P$. We apply the Frank-Wolfe algorithm to minimize the function \(f(x) = \frac{1}{2} \norm{x - x_0}^2\) over $P$, which is essentially the projection of $x_0$ onto $P$; we rescale by $1/2$ only for convenience.

Our starting point is the following expansion of the norm similar to what we have done before for von Neumann’s alternating projections. Let $v \in P$ be arbitrary and let the $x_t$ be the iterates of the Frank-Wolfe algorithm:

\[\begin{align*} & \|x_0 -v \|^2 = \|x_0 - x_t \|^2 + \|x_t -v \|^2 - 2 \langle x_t - x_0, x_t -v \rangle \\ \Leftrightarrow\ & 2 \langle x_t - x_0, x_t -v \rangle = \|x_0 - x_t \|^2 + \|x_t -v \|^2 - \|x_0 -v \|^2 \\ \Leftrightarrow\ & \langle x_t - x_0, x_t -v \rangle = \frac{1}{2} \|x_0 - x_t \|^2 + \frac{1}{2} \|x_t -v \|^2 - \frac{1}{2} \|x_0 -v \|^2, \end{align*} \tag{altDualGap}\]and observe that the left hand-side is the Frank-Wolfe gap expression at iterate $x_t$ (except for the maximization over $v \in P$) as $\nabla f(x_t) = x_t - x_0$. Let $x^* = \arg\min_{x \in P} f(x)$, i.e., the projection of $x_0$ onto $P$ under the euclidean norm.

We will now derive a characterization for $x_0 \not \in P$, which also provides the certifying hyperplane.

**Necessary Condition.** Suppose \(\|x_t - v \| < \|x_0 - v \|\) for all vertices $v \in P$ in some iteration $t$. With this (altDualGap) reduces to

for all $v \in P$ vertices. Now if we maximize over $v$ on the left-hand side in (altTest) to compute the Frank-Wolfe gap we obtain \(\max_{v \in P} \langle x_t - x_0, x_t -v \rangle < \frac{1}{2} \|x_0 - x_t \|^2\), i.e., for all $v \in P$ (not just the vertices) it holds \(\langle x_t - x_0, x_t -v \rangle < \frac{1}{2} \|x_0 - x_t \|^2\). Plugging this back into (altDualGap), we obtain \(\|x_t - v \| < \|x_0 - v \|\) for all $v \in P$ (not just the vertices): this will be important for the equivalence in our characterization below.

Let $v_t$ be the Frank-Wolfe vertex in iteration $t$. We then obtain:

\[\begin{align*} \frac{1}{2} \|x_t - x_0 \|^2 - \frac{1}{2} \|x^* - x_0 \|^2 & = f(x_t) - f(x^*) \\ & \leq \max_{v \in P} \langle \nabla f(x_t), x_t - v \rangle \\ & = \langle \nabla f(x_t), x_t - v_t \rangle \\ & = \langle x_t - x_0, x_t -v_t \rangle < \frac{1}{2} \|x_0 - x_t \|^2. \end{align*}\]Subtracting \(\frac{1}{2} \|x_0 - x_t \|^2\) on both sides and re-arranging yields:

\[0 < \frac{1}{2} \|x^* - x_0 \|^2,\]which proves that $x_0 \not \in P$. Moreover, (altTest) also immediately provides a separating hyperplane: Observe that the inequality

\[\langle x_t - x_0, x_t -v \rangle < \frac{1}{2} \|x_0 - x_t \|^2,\]is actually a linear inequality in $v$ and it holds for all $v \in P$ at stated above. However, at the same time, for the choice $v \leftarrow x_0$ the inequality is violated.

**Sufficient Condition.** Now suppose that in each iteration $t$ there exists a vertex $\bar v_t \in P$ (not to be confused with the Frank-Wolfe vertex), so that
\(\|x_t - \bar v_t \| \geq \|x_0 - \bar v_t \|\). In this case (altDualGap) ensures:

Thus, in particular the Frank-Wolfe gap satisfies in each iteration $t$ that

\[\max_{v \in P} \langle \nabla f(x_t), x_t - v \rangle \geq \langle x_t - x_0, x_t - \bar v_t \rangle \geq \frac{1}{2} \|x_0 - x_t \|^2,\]i.e., the Frank-Wolfe gap upper bounds the distance between the current iterate $x_t$ and point $x_0$ in each iteration. Now the Frank-Wolfe gap converges to $0$ as the algorithm progresses, with iterates $x_t \in P$, so that with the usual arguments (compactness and limits etc) it follows that $x_0 \in P$. We are basically done here, but for the sake of argument, observe that by convexity we also have

\[\max_{v \in P} \langle \nabla f(x_t), x_t - v \rangle \geq f(x_t) - f(x^\esx) = \frac{1}{2} \norm{x_t - x_0}^2 - \frac{1}{2} \norm{x^\esx - x_0}^2 \geq 0,\]and hence $\norm{x^\esx - x_0}$ has to be $0$ also, so that $x_0 = x^\esx$.

**Characterization.** The following are equivalent:

- (Non-Membership) $x_0 \not \in P$.
- (Distance) there exists an iteration $t$, so that \(\|x_t - v \| < \|x_0 - v \|\) for all vertices $v \in P$.
- (FW Gap) there exists an iteration $t$, so that \(\max_{v \in P} \langle x_t - x_0, x_t -v \rangle < \frac{1}{2} \|x_0 - x_t \|^2\).

[vN] Von Neumann, J. (1949). On rings of operators. Reduction theory. Annals of Mathematics, 401-485. pdf

[CG] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. pdf

[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. pdf

]]>Finally(!!), after being several years in the making, we finished our monograph on Conditional Gradients and Frank-Wolfe methods, together with Gábor Braun, Alejandro Carderera, Cyrille Combettes, Hamed Hassani, Amin Karbasi, and Aryan Mokhtari.

The story of this monograph is a quite winding one. The work on the monograph started out when I was still at Georgia Tech. Together with Alejandro, Cyrille, and Gábor we thought of providing a more formal treatment of a blog post series that I had written on CG methods (see, e.g., here). A little later, right when I returned to Germany, Amin contacted me because he, Hamed, and Aryan were also working on a survey on Frank-Wolfe methods and Conditional Gradients. So we joined forces and this monograph came to be.

The target audience is quite broad, basically from everyone that wants to get into conditional gradients to seasoned optimizers that need a quick overview of key results. We had a couple of tough decisions to make regarding what to include and what not and to which extent we covered certain topics. No easy decisions because they did not only had to be reasonable by themselves but integrate well with other content decisions. As a consequence some of my favorite results and proofs did not make it into the final version and I am sure it is the same for my co-authors; I intend to include some of my personal favorites in upcoming blog posts though.

If you have comments, suggestions, or feedback please let us know!

]]>`Boscia.jl`

and associated preprint Convex integer optimization with Frank-Wolfe methods by Deborah Hendrych, Hannah Troppens, Mathieu Besançon, and Sebastian Pokutta.Combining conditional gradient approaches (aka Frank-Wolfe methods) with branch-and-bound is not completely new and has been already explored in [BDRT]. However, due to the overhead of having to solve a (relatively complex) linear minimization problem (the LMO call) in *each iteration* of the Frank-Wolfe subproblem solver which often means several thousand LMO calls *per node* processed in the branch-and-bound tree, this approach might not scale as well as one would like it to as an excessive number of linear minimization problems has to be solved.

In our work, we considered a similar approach in the sense that we also use a Frank-Wolfe (FW) variant as solver for the nodes. A crucial difference however is that we do not relax the integrality requirements in the LMO and directly solve the linear optimization subproblem arising in the Frank-Wolfe algorithm over the *mixed-integer hull of the feasible region* (together with bounds arising in the tree). So we seemingly make the iterations of the node solver even more expensive. Ignoring the cost of the LMO for a second, this approach can have significant advantages as the underlying non-linear node relaxation (in fact solving a convex problem over the *mixed-integer hull*) in the branch-and-bound tree can be much tighter, leading to fewer fractional variables, and significantly reduced branching. Put differently, the fractionality is now *only* arising from the non-linearity of the objective function. Figure 1 below might provide some intuition why this might be beneficial.

**Figure 1.** Solving stronger subproblems, directly optimizing over the mixed-integer hull can be very powerful. (left) baseline and fractional optimal solution, (middle-left) branching on fractional variable leads only to minor improvement, (middle-right) direct optimization over the mixed-integer hull and fractional solution over the mixed-integer hull, (right) branching once results in optimal solution.

As such, our approach is a *Mixed-Integer Conditional Gradient* Algorithm that combines a specific Frank-Wolfe algorithm, the Blended Pairwise Conditional Gradient (BPCG) approach of [TTP] (see also the earlier BPCG post), with a branch-and-bound scheme, and solves *MIPs as LMOs*. Our approach combines several very powerful improvements from conditional gradients with state-of-the-art MIP techniques to obtain a fast algorithm.

*Leveraging MIP improvements.*As our LMOs in the FW subsolver are standards MIPs we can exploit the whole toolbox of MIP solver improvements. In particular, we can use solution pools to collect and reuse previously identified feasible solution. Note, that as our feasible region is never modified (in contrast to, e.g., outer approximation relying on epigraph formulations), all discovered primal solution are globally feasible. In future releases we will also support more advanced reoptimization features of modern MIP solvers, such as carrying over certain presolve information, propagation, and cutting planes.*Incomplete resolution of nodes.*We do not have to solve nodes to near-optimality but rather we can use an adaptive gap strategy to only partially resolve subproblems, significantly reducing the number of iterations in the FW subsolver.*Warmstarting.*We can warmstart the FW subsolver from the run before branching, further cutting down required iterations. For this we can efficiently (as in: for free) write the parent solution as a convex combination of two distinct solutions valid for the left and the right branch, respectively.*Lazification and blending.*We generalize both the lazification [BPZ] and blending approach [BPTW] to apply to the whole tree allowing to aggressively reuse previously found primal feasible solutions. In particular, no primal feasible solution has to be computed or discovered twice.*Hybrid branching strategy.*Finally, we developed a hybrid branching strategy that can further reduce the number of required nodes; however this is highly dependend on the instance.

Combining and exploiting all these things, this results in having only to solve something like 5-8 LMOs calls per node on average and the longer the run and the deeper the tree, the smaller this number is. Asymptotically the average number of LMO calls is approaching something close to 1 as no MIP-feasible solution is computed twice as mentioned above. For a more complex instance, the overall impact of these tricks is shown in Figure 2 below in terms of the average number of LMO calls per node.

**Figure 2.** Minimizing a quadratic over a cardinality constrained variant of the Birkhoff polytope. Key statistics as a function of the node depth. Both the size of the active (vertex) set and the discarded (vertex) set remain small throughout the run of the algorithm and the average number of LMO calls required to solve a subproblem drops significantly as a function of depth as previously discovered solutions cut out unnecessary LMO calls.

For more details see the preprint [HTBP].

- Released under MIT license. Do whatever you want with it and consider contributing to the code base.
- Uses our earlier
`FrankWolfe.jl`

Julia package as well the`Bonobo.jl`

branch-and-bound Julia package. - Uses the Blended Pairwise Conditional Gradient (BPCG) algorithm [TTP] together with (a modified variant of) the adaptive step-size strategy of [PNAJ] from the
`FrankWolfe.jl`

package. - Supports a wide variety of MIP solvers through the MathOptInterface (MOI). Currently, we use
`SCIP.jl`

with SCIP 8 in our examples. - via MOI reads .mps and .lp files out of the box allowing to easily replace linear objectives by convex objectives.
- Interface is identical to that of
`FrankWolfe.jl`

: specify the objective and its gradients and provide an LMO for the feasible region.

Most certainly there will be still bugs, calibration issues, and numerical issues in the solver. Any feedback, bug reports, issues, PRs are highly welcome on the package’s github repository.

```
using Boscia
using FrankWolfe
using Random
using SCIP
using LinearAlgebra
import MathOptInterface
const MOI = MathOptInterface
n = 6
const diffw = 0.5 * ones(n)
##############################
# defining the LMO and using
# SCIP as solver for the LMO
##############################
o = SCIP.Optimizer()
MOI.set(o, MOI.Silent(), true)
x = MOI.add_variables(o, n)
for xi in x
MOI.add_constraint(o, xi, MOI.GreaterThan(0.0))
MOI.add_constraint(o, xi, MOI.LessThan(1.0))
MOI.add_constraint(o, xi, MOI.ZeroOne())
end
lmo = FrankWolfe.MathOptLMO(o) # MOI-based LMO
##############################
# defining objective and
# gradient
##############################
function f(x)
return sum(0.5*(x.-diffw).^2)
end
function grad!(storage, x)
@. storage = x-diffw
end
##############################
# calling the solver
##############################
x, _, result = Boscia.solve(f, grad!, lmo, verbose = true)
```

The output - which is quite wide - then roughly looks like this:

```
Boscia Algorithm.
Parameter settings.
Tree traversal strategy: Move best bound
Branching strategy: Most infeasible
Absolute dual gap tolerance: 1.000000e-06
Relative dual gap tolerance: 1.000000e-02
Frank-Wolfe subproblem tolerance: 1.000000e-05
Total number of varibales: 6
Number of integer variables: 0
Number of binary variables: 6
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Iteration Open Bound Incumbent Gap (abs) Gap (rel) Time (s) Nodes/sec FW (ms) LMO (ms) LMO (calls c) FW (Its) #ActiveSet Discarded
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
* 1 2 -1.202020e-06 7.500000e-01 7.500012e-01 Inf 3.870000e-01 7.751938e+00 237 2 9 13 1 0
100 27 6.249998e-01 7.500000e-01 1.250002e-01 2.000004e-01 5.590000e-01 2.271914e+02 0 0 641 0 1 0
127 0 7.500000e-01 7.500000e-01 0.000000e+00 0.000000e+00 5.770000e-01 2.201040e+02 0 0 695 0 1 0
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Postprocessing
Blended Pairwise Conditional Gradient Algorithm.
MEMORY_MODE: FrankWolfe.InplaceEmphasis() STEPSIZE: Adaptive EPSILON: 1.0e-7 MAXITERATION: 10000 TYPE: Float64
GRADIENTTYPE: Nothing LAZY: true lazy_tolerance: 2.0
[ Info: In memory_mode memory iterates are written back into x0!
----------------------------------------------------------------------------------------------------------------
Type Iteration Primal Dual Dual Gap Time It/sec #ActiveSet
----------------------------------------------------------------------------------------------------------------
Last 0 7.500000e-01 7.500000e-01 0.000000e+00 1.086583e-03 0.000000e+00 1
----------------------------------------------------------------------------------------------------------------
PP 0 7.500000e-01 7.500000e-01 0.000000e+00 1.927792e-03 0.000000e+00 1
----------------------------------------------------------------------------------------------------------------
Solution Statistics.
Solution Status: Optimal (tree empty)
Primal Objective: 0.75
Dual Bound: 0.75
Dual Gap (relative): 0.0
Search Statistics.
Total number of nodes processed: 127
Total number of lmo calls: 699
Total time (s): 0.58
LMO calls / sec: 1205.1724137931035
Nodes / sec: 218.96551724137933
LMO calls / node: 5.503937007874016
```

[TTP] Tsuji, K., Tanaka, K. I., & Pokutta, S. (2021). Sparser kernel herding with pairwise conditional gradients without swap steps. to appear in Proceedings of ICML. arXiv preprint arXiv:2110.12650. pdf

[PNAJ] Pedregosa, F., Negiar, G., Askari, A., & Jaggi, M. (2020, June). Linearly convergent Frank-Wolfe with backtracking line-search. In International Conference on Artificial Intelligence and Statistics (pp. 1-10). PMLR. pdf

[BDRT] Buchheim, C., De Santis, M., Rinaldi, F., & Trieu, L. (2018). A Frank–Wolfe based branch-and-bound algorithm for mean-risk optimization. Journal of Global Optimization, 70(3), 625-644. pdf

[HTBP] Hendrych, D., Troppens, H., Besançon, M., & Pokutta, S. (2022). Convex integer optimization with Frank-Wolfe methods. arXiv preprint arXiv:2208.11010. pdf

[BPTW] Braun, G., Pokutta, S., Tu, D., & Wright, S. (2019, May). Blended conditonal gradients. In International Conference on Machine Learning (pp. 735-743). PMLR. pdf

[BPZ] Braun, G., Pokutta, S., & Zink, D. (2017, July). Lazifying conditional gradient algorithms. In International conference on machine learning (pp. 566-575). PMLR. pdf

]]>*Written by Elias Wirth.*

Frank-Wolfe algorithms (FW) [F], see Algorithm 1, are popular first-order methods to solve convex constrained optimization problems of the form

\[\min_{x\in \mathcal{C}} f(x),\]where $\mathcal{C}\subseteq\mathbb{R}^d$ is a compact convex set and $f\colon \mathcal{C} \to \mathbb{R}$ is a convex and smooth function. FW and its variants rely on a linear minimization oracle instead of potentially expensive projection-like oracles. Many works have identified accelerated convergence rates under various structural assumptions on the optimization problem and for specific FW variants when using line search or short-step, requiring feedback from the objective function, see, e.g., [GH13, GH15, GM, J, LJ] for an incomplete list of references.

**Algorithm 1.** Frank-Wolfe algorith (FW) [F]

*Input:* Starting point $x_0\in\mathcal{C}$, step-size rule $\eta_t \in [0, 1]$.

$\text{ }$ 1: $\text{ }$ **for** $t=0$ **to** $T$ **do**

$\text{ }$ 2: $\quad$ $p_t\in \mathrm{argmin}_{p\in \mathcal{C}} \langle \nabla f(x_t), p - x_t\rangle$

$\text{ }$ 3: $\quad$ $x_{t+1} \gets (1-\eta_t) x_t + \eta_t p_t$

$\text{ }$ 4: $\text{ }$ **end for**

However, little is known about accelerated convergence regimes utilizing open loop step-size rules, a.k.a. FW with pre-determined step-sizes, which are algorithmically extremely simple and stable. In our paper, most of the open loop step-size rules are of the form

\[\eta_t = \frac{4}{t+4}.\]One of the main motivations for studying FW with open loop step-size rules is an unexplained phenomenon in kernel herding: In the right plot of Figure 3 of [BLO], the authors observe that FW with open loop step-size rules converges at the optimal rate [BLO] of $\mathcal{O}(1/t^2)$, whereas FW with line search or short-step converges at a rate of $\Omega(1/t)$. For this setting, FW with open loop step-size rules converges at the optimal rate, whereas FW with line search or short-step converge at a suboptimal rate. Despite substantial research interest in the connection between FW and kernel herding, so far, this behaviour remained unexplained.

However, kernel herding is not the only problem setting for which FW with open loop step-size rules can converge faster than FW with line search or short-step. In [B], the author proves that when the feasible region is a polytope, the objective function is strongly convex, the optimum lies in the interior of an at least one-dimensional face of the feasible region $\mathcal{C}$, and some other mild assumptions are satisfied, FW with open loop step-size rules asymptotically converges at a rate of $\mathcal{O}(1/t^2)$. Combined with the convergence rate lower bound of $\Omega(1/t^{1+\epsilon})$ for any $\epsilon > 0$ for FW with line search or short-step [W], this characterizes a setting for which FW with open loop step-size rules converges asymptotically faster than FW with line search or short-step.

The main goal of the paper is to address the current gaps in our understanding of FW with open loop step-size rules.

**1. Accelerated rates depending on the location of the unconstrained optimum.**
For FW with open loop step-size rules, the primal gap does not decay monotonously, unlike for FW with line search or short-step.
We thus derive a different proof template that captures several of our
acceleration results: For FW with open loop step-size rules, we derive convergence rates of
up to $\mathcal{O}(1/t^2)$

- when the feasible region is uniformly convex and the optimum lies in the interior of the feasible region,
- when the objective satisfies a Hölderian error bound (think, relaxation of strong convexity) and the optimum lies in the exterior of the feasible region,
- and when the feasible region is uniformly convex and the objective satisfies a Hölderian error bound.

**2. FW with open loop step-size rules can be faster than with line search or short-step.**
We derive a non-asymptotic version of the accelerated convergence result in [B]. More
specifically, we show that when the feasible region is a polytope, the optimum lies in the interior of an at least one-dimensional
face of $\mathcal{C}$, the objective is stronly convex, and some additional mild assumptions are satisfied,
FW with open loop step-size rules converges at a non-asymptotic rate of $O(1/t^2)$. Combined with the convergence rate
lower bound for FW with line search or short-step [W], we thus characterize problem instances
for which FW with open loop step-size rules converges non-asymptotically faster than FW with line search or short-step.

**Figure 1.** Convergence over probability simplex. Depending on the position of the contrained optimum line search (left) can be slower than open loop or (right) faster than open loop.

**3. Algorithmic variants.**
When the feasible region is a polytope, we also study FW variants that were traditionally used to overcome the
convergence rate lower bound [W] for FW with line search or short-step. Specifically, we present open loop versions of
the Away-Step Frank-Wolfe algorithm (AFW) [LJ] and the Decomposition-Invariant Frank-Wolfe algorithm (DIFW) [GH13]. For
both algorithms, we derive convergence rates of order $O(1/t^2)$.

**4. Addressing an unexplained phenomenon in kernel herding.**
We answer the open problem from [BLO], that is, we explain why FW with open loop step-size rules converges at a
rate of $\mathcal{O}(1/t^2)$ in the infinite-dimensional kernel herding setting of the right plot of Figure 3 in [BLO].

**Figure 2.** Kernel Herding. Open loop step sizes can outperform line search (and short step) (left) uniform (right) non-uniform case.

**5. Improved convergence rate after finite burn-in.**
For many of our results, so as to not contradict the convergence rate lower bound of [J], the derived accelerated
convergence rates only hold after an initial number of iterations, that is, the accelerated rates require a
burn-in phase. This phenomenon is also referred to as accelerated local convergence [CDLP, DCP]. We study this behvaviour
both in theory and with numerical experiments for FW with open loop step-size rules.

[B] Bach, F. (2021). On the effectiveness of richardson extrapolation in data science. SIAM Journal on Mathematics of Data Science, 3(4):1251–1277.

[BLO] Bach, F., Lacoste-Julien, S., and Obozinski, G. (2012). On the equivalence between herding and conditional gradient algorithms. In ICML 2012 International Conference on Machine Learning.

[CDLP] Carderera, A., Diakonikolas, J., Lin, C. Y., and Pokutta, S. (2021a). Parameter-free locally accelerated conditional gradients. arXiv preprint arXiv:2102.06806.

[DCP] Diakonikolas, J., Carderera, A., and Pokutta, S. (2020). Locally accelerated conditional gradients. In International Conference on Artificial Intelligence and Statistics, pages 1737–1747. PMLR.

[F] Frank, M., Wolfe, P., et al. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110

[GH13] Garber, D. and Hazan, E. (2013). A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv preprint arXiv:1301.4666.

[GH15] Garber, D. and Hazan, E. (2015). Faster rates for the frank-wolfe method over strongly-convex sets. In International Conference on Machine Learning, pages 541–549. PMLR.

[GM] Guélat, J. and Marcotte, P. (1986). Some comments on wolfe’s ‘away step’. Mathematical Programming, 35(1):110–119.

[J] Jaggi, M. (2013). Revisiting frank-wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning, pages 427–435. PMLR.

[LJ] Lacoste-Julien, S. and Jaggi, M. (2015). On the global linear convergence of frank-wolfe optimization variants. Advances in Neural Information Processing Systems, 28:496–504.

[W] Wolfe, P. (1970). Convergence theory in nonlinear programming. Integer and nonlinear programming, pages 1–36.

]]>*Written by Sebastian Pokutta.*

One of the most appealing conditional gradient algorithm is the *Pairwise Conditional Gradient (PCG)* algorithm introduced in [LJ] as a modification of the Away-step Frank-Wolfe (AFW) algorithm. The Pairwise Conditional Gradient algorithm essentially performs a normal Frank-Wolfe step and an away-step simultaneously. This gives much higher convergence steps in practice, however in the analysis now, so-called *swap steps* appear, where weight is shifted from an away-vertex to a new Frank-Wolfe vertex. For these steps, we cannot bound the primal progress and additionally, there are potentially lots of these steps, so that the theoretical convegence bounds are much worse than what we observe in practice. In fact its guarantees are worse than the guarantees for the Away-step Frank-Wolfe algorithm, that is almost always outperformed by the Pairwise Conditional Gradient algorithm in practice. Various modifications of PCG had been suggested, e.g., in [RZ] and [MGP] to deal with this issue, however often requiring subprocedures whose costs cannot be easily bounded.

It should also be mentioned that the PCG algorithm is particularly nice in the case of polytopes and almost becomes a combinatorial algorithm as the resulting directions are always formed by direction arising from the line-segment between two vertices and thus there are only finitely many such directions.

By borrowing machinery from [BPTW], we show that with a minor modification of the PCG algorithm we can avoid drop steps altogether. In a nutshell, the idea is to limit the pairwise directions to those formed by FW vertices and away-vertices from the current active set and only if those steps are not good enough we perform a normal FW step. This way swap steps cannot appear anymore however we require a key technical lemma that shows that the reduced pairwise steps are still good enough. We call this algorithm the *Blended Pairwise Conditional Gradient (BPCG)* algorithm; see below.

The resulting algorithm has a theoretical convergence rate that is basically the same as the one for the AFW algorithm (up to small constant factors). In fact, it inherits all convergence proofs that hold for the AFW algorithm. Moreover, it exhibits the same convergence speed as (even often faster than) the original PCG algorithm in practice. The algorithm also works in the infinite dimensional setting, which is not true for the original PCG algorithm due to the dimension dependence arising from the number of swap steps in the convergence rate. The iterates produced by BPCG are very sparse, where sparsity is measured in the number of elements in the convex combinations of the iterates making it very suitable, e.g., for kernel herding.

**Blended Pairwise Conditional Gradient Algorithm (BPCG); slightly simplified**

*Input:* Smooth convex function $f$ with first-order oracle access, feasible region (polytope) $P$ with linear optimization oracle access, initial vertex $x_0 \in P$.

*Output:* Sequence of points $x_0, \dots, x_{T}$

\(S_0 = \{x_0\}\)

**For** $t = 0, \dots, T-1$ **do**:

$\quad a_t \leftarrow \arg\max_{v \in S_t} \langle \nabla f(x_{t}), v \rangle$ $\qquad$ {Away-vertex over $S_t$}

$\quad s_t \leftarrow \arg\min_{x \in S_t} \langle \nabla f(x_{t}), v \rangle$ $\qquad$ {Local FW-vertex over $S_t$}

$\quad w_t \leftarrow \arg\min_{x \in P} \langle \nabla f(x_{t}), v \rangle$ $\qquad$ {(Global) FW-vertex over $P$}

$\quad$ **If** \(\langle \nabla f(x_{t}), a_t - s_t \rangle \geq \langle \nabla f(x_{t}), x_t - w_t \rangle\): $\qquad$ {Local gap as large as global gap}

$\quad\quad$ $d_t = a_t - s_t$ $\qquad$ {Pick (local) pairwise direction}

$\quad\quad$ $x_{t+1} \leftarrow x_t - \gamma_t d_t$ via line search s.t. residual weight of $a_t$ nonnegative

$\quad\quad$ **If** $a_t$ removed **then** {Drop step} \(S_{t+1} \leftarrow S_t \setminus \{a_t\}\) **else** {Descent step} \(S_{t+1} \leftarrow S_t\)

$\quad$ **Else** $\qquad$ {Normal FW Step}

$\quad\quad$ $d_t = x_t - w_t$ $\qquad$ {Pick FW direction}

$\quad\quad$ $x_{t+1} \leftarrow x_t - \gamma_t d_t$ via line search with $\gamma_t \in [0,1]$

$\quad\quad$ Update \(S_{t+1} \leftarrow S_t \cup \{w_t\}\)

The key to the convergence proofs is the following lemma that show that these local pairwise steps combined with global FW steps are good enough to ensure sufficient progress per iteration

**Key Lemma.** In each iteration $t$ it holds:
\[
2 \langle \nabla f(x_{t}), d_t \rangle \geq \langle \nabla f(x_{t}), a_t - w_t \rangle.
\]

For those in the know this is all you need to prove convergence: the term on the right-hand side is the strong Wolfe gap and as such progress from smoothness with $d_t$ can be lower bounded by the progress from smoothness with the strong Wolfe gap. See e.g., Cheat Sheet: Smooth Convex Optimization and Cheat Sheet: Linear convergence for Conditional Gradients (towards the end when analyzing AFW) to understand how one continues from here. Also similarly, with this inequality we can apply e.g., the reasoning in Cheat Sheet: Hölder Error Bounds (HEB) for Conditional Gradients to extend the results, to e.g., sharp functions etc.

We also performed several computational experiments to evaluate the performance of BPCG and in fact the algorithm has been also implemented in the *FrankWolfe.jl* Julia Package (See this post or [BCP]) and is now the recommended default active-set based conditional gradient algorithm. With a simple trick by adding a factor on the left-hand side in front of the gap test in the algorithm one can further improve sparsity; see the original paper for more details.

Below in Figure 1, we provide a simple convergence test for the approximate Carathéodory problem. We can see that in iterations BPCG is basically identical to PCG (as expected) however in time it is faster as the local updates are often cheaper.

**Figure 1.** Convergence on approximate Carathéodory instance over polytope of dimension $n=200$.

While the convergence plot above is quite typical and in terms of speed per iteration or wallclock time BPCG is usually at least as good as PCG (and sometimes faster), the real advantage is often in terms of sparsity as the preference for local steps promotes sparsity. This can be seen in the two plots below.

**Figure 2.** Sparse regression problem over $l_5$-norm ball. Here we plot primal value and dual gap vs. size of the active set. BPCG consistently delivers smaller primal and dual values for the same number of atoms in the active set.

**Figure 3.** Movielens matrix completion problem. Same logic as above with similar results.

Finally we also considered various kernel herding problems. Here are two examples; the graphs are a little packed.

**Figure 4.** Kernel herding for Matérn kernel (left) and Gaussian kernel (right). In both cases BPCG delivers results on par with the Sequential Bayesian Quadrature method (SBQ) however at the fraction of the cost.

[LJ] Lacoste-Julien, S., & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems (pp. 496-504). pdf

[RZ] Rinaldi, F., & Zeffiro, D. (2020). A unifying framework for the analysis of projection-free first-order methods under a sufficient slope condition. arXiv preprint arXiv:2008.09781. pdf

[MGP] Mortagy, H., Gupta, S., & Pokutta, S. (2020). Walking in the shadow: A new perspective on descent directions for constrained minimization. Advances in Neural Information Processing Systems, 33, 12873-12883. pdf

[BPTW] Braun, G., Pokutta, S., Tu, D., & Wright, S. (2019, May). Blended conditonal gradients. In International Conference on Machine Learning (pp. 735-743). PMLR. pdf

[BCP] Besançon, M., Carderera, A., & Pokutta, S. (2022). FrankWolfe. jl: A High-Performance and Flexible Toolbox for Frank–Wolfe Algorithms and Conditional Gradients. INFORMS Journal on Computing. pdf

]]>*Posts in this series (so far).*

*My apologies for incomplete references—this should merely serve as an overview.*

This will be a series on quantum computing. Our perspective here will be a more mathematical or computer science one. I am not a physicist so I will not be able to provide sophisticated physical interpretations. Nonetheless, I will try to provide physics context here and there to highlight the difficulties when going from the rather abstract mathematical formalism of quantum mechanics and quantum computing to the real (physical) world, which leads to many challenging—sometimes philosophical—problems. Feel free to comment if you have suggestions for improvements.

In this first installment we will really just look at the basics of quantum computing and I end this post with a famous motivating example showing the power of quantum mechanics. Most of what we are going to see today is linear algebra with Dirac notation; consider this a warm-up to get used to the notation as well as a refresh on linear algebra basics in the context of quantum mechanics, which is the basis for quantum computing. For a more extensive introduction, check out [dW19], [M07], and [P21] which I heavily relied upon and from where some of the examples are taken. I also extensively used wikipedia, which has quite accessible articles on most of the basic stuff that we will see today.

We will be working in Hilbert spaces over complex numbers. A very useful notation in quantum mechanics is the *Dirac notation* (also called: *bra-ket notation*), which is used to write quantum states, which in turn are nothing else but special vectors in that Hilbert space. Slightly abusing notions, following Dirac’s original intent, basically an element $\phi \in \mathcal H$ on the primal side is a *ket*, written as $\ket{\phi}$ and corresponds to a column vector and an element $\psi \in \mathcal H$ on the dual side is a *bra*, written as $\bra{\psi}$, corresponding to a row vector. This notation has many advantages as it ensures that we automatically distinguish between primal and dual and the inner product follows naturally; we will see all this in a second.

Usually we will have an orthonormal basis (say, $\ket{0}, \dots, \ket{N-1}$ of $N$ vectors) that generates our Hilbert space $\mathcal H = \langle \ket{0}, \dots, \ket{N-1} \rangle$ and each element $\ket{\phi} \in \mathcal H$ (abusing notation here), is given by:

\[\ket{\phi} = \sum_{i = 0}^{N-1} \alpha_i \ket{i} \qquad \text{ with } \qquad \alpha_i \in \CC,\]equivalently, due to the standard isomorphism in the finite dimensional case, we can write

\[\newcommand\vec[1]{\begin{pmatrix}#1\end{pmatrix}} \ket{\phi} = \vec{\alpha_0 \\ \vdots \\ \alpha_{N-1}},\]and naturally associated with each ket $\ket{\phi}$ is a bra $\bra{\phi}$, which is defined as the conjugate transpose of $\ket{\phi}$:

\[\newcommand\vec[1]{\begin{pmatrix}#1\end{pmatrix}} \bra{\phi} = \vec{\alpha_0^\esx, \dots, \alpha_{N-1}^\esx},\]where the $\esx$ denotes the conjugate operation here, mapping a complex number $\alpha = x + iy$ to $\alpha^\esx = x + i (-y) = x - i y$. Note, that the bra-ket notation is simply a different notation for vectors and in particular, it holds:

\[\ket{a \phi + b \gamma} = a \ket{\phi} + b \ket{\gamma} \qquad \text{ and } \qquad \bra{a \phi + b \gamma} = a^\esx \bra{\phi} + b^\esx \bra{\gamma}.\]However, the bra notation has the built-in conjugate for its coefficients, which ensures that basically all properties, e.g., of the inner product simply follow from applying the “Euclidean”-style inner product.

With the above we naturally obtain our scalar product as our basis is orthonormal, i.e., \(\braket{i \mid j} = \delta_{ij}\). To this end, let

\[\newcommand\vec[1]{\begin{pmatrix}#1\end{pmatrix}} \ket{\phi} = \vec{\alpha_0 \\ \vdots \\ \alpha_{N-1}} \qquad \text{ and } \qquad \bra{\psi} = \vec{\beta_0^\esx, \dots, \beta_{N-1}^\esx},\]then we have that

\[\newcommand\vec[1]{\begin{pmatrix}#1\end{pmatrix}} \braket{\psi \mid \phi} = \vec{\beta_0^\esx, \dots, \beta_{N-1}^\esx} \cdot \vec{\alpha_0 \\ \vdots \\ \alpha_{N-1}} = \sum_{i = 0}^{N-1} \beta_i^\esx \alpha_i,\]where we have exploited the built-in conjugate in the bra.

**Properties of $\braket{\psi \mid \phi}$.**

a. $\braket{\psi \mid \phi}$ is a Hermitian form, i.e., $\braket{\psi \mid \phi} = \braket{\phi \mid \psi}^\esx$

b. linear in right-hand side: $\braket{\psi \mid a \phi + b \gamma} = a \braket{\psi \mid \phi} + b \braket{\psi \mid \gamma}$

c. anti-linear in left-hand side: $\braket{a \psi + b \delta \mid \phi} = a^\esx \braket{\psi \mid \phi} + b^\esx \braket{\delta \mid \phi}$

d. $\braket{\psi \mid \phi} \in \CC$

e. $\braket{\phi \mid \phi} \in \RR$ and $\braket{\phi \mid \phi} > 0$ iff $\ket{\phi} \neq 0$

We also obtain the *squared norm* $\norm{\ket{\phi}}^2$

In the following let $A^\dagger$ denote the *adjoint* of the matrix $A$, which is nothing else but the conjugate transpose of $A$, i.e., $A^\dagger = (A^T)^\esx$. In particular, if $A$ corresponds to multiplication with $z \in \CC$, then $A^\dagger$ corresponds to the multiplication with $z^\esx$. It is useful here to extend the dagger notion also the vectors to render bras and kets dual to each other, i.e., $(\ket{\phi})^\dagger = \bra{\phi}$, which is in line with our definition of the bra as conjugate transpose of the ket. With the general rule that the adjoint of the product is equal to the reverse-order product of the adjoints, most of the below follow naturally; see also [M07] for a broader exposition. For some of those rules, we assume that we will be working with finite dimensional vector spaces.

**Useful rules.**

a. $\ket{A \phi} = A \ket{\phi}$ and $\bra{A\phi} = \bra{\phi} A^\dagger$.

b. $(A \ket{\phi})^\dagger = (\ket{A \phi})^\dagger = \bra{\phi} A^\dagger$.

c. $A (\alpha \ket{\phi} + \beta \ket{\psi}) = \alpha A \ket{\phi} + \beta A \ket{\psi}$ and $(\alpha \bra{\phi} + \beta \bra{\psi}) A = \alpha \bra{\phi} A + \beta \bra{\psi} A = \alpha \bra{A^\dagger \phi} + \beta \bra{A^\dagger \psi}$.

d. If $U$ is a unitary matrix, i.e., $U^\dagger U = U U^\dagger = I$ then $\braket{\psi \mid \phi} = \braket{U \psi \mid U\phi}$.

With this we can now define what *pure (quantum) states* are. These are nothing else but linear combinations of elements from our orthonormal basis ${\ket{i}}_{i = 0, \dots, N-1}$.

and additionally we require that

\[\norm{\ket{\phi}}^2 = \braket{\phi \mid \phi} =\sum_{i = 0}^{N-1} \alpha_i^* \alpha_i = \sum_{i = 0}^{N-1} \abs{\alpha_i}^2 = 1,\]and hence also $\norm{\ket{\phi}} = 1$.

An important operation that we can apply to a state is a *measurement* with the aim to extract information from the state.

We first consider so-called projective measurements. To this end, let us briefly recall the definition and properties of an orthogonal projection matrix:

**Definition and Properties: Orthogonal projection matrices.** A square matrix $P : \mathcal H \rightarrow \mathcal H$ is an *orthogonal projection matrix* if:
\[P^2 = P = P^\dagger\]
*Properties.*

a. $\braket{\psi \mid P \phi} = \braket{\psi P \mid \phi}$

b. Eigenvalues of $P$ are $0$ and $1$ only

c. $\norm{\ket{P \phi}}^2 = \braket{P \phi \mid P \phi} = \braket{\phi \mid P^\dagger P \mid \phi} = \braket{\phi \mid P \mid \phi} = \tr(P \ketbra{\phi}{\phi})$. The matrix $\rho = \ketbra{\phi}{\phi}$ here is called *density matrix* and we will revisit it later.

See also wikipedia for more useful properties. With this we can define the measurement operation:

**Definition: Measurement.** A *measurement with $m$ outcomes* is a set of orthogonal projection matrices $P_1, \dots, P_m$ that decompose the identity matrix $I = \sum_{i = 1}^m P_i$.

Note, that the above definition implies that $P_i P_j = 0$ for $i \neq j$: Simply multiply $I = \sum_{i = 1}^m P_i$ with some $P_j$ from the right, then reorder to $0 = P_1P_j + \dots + P_j(P_j - I) + \dots + P_nP_j$. Since the images of two distinct $P_i$ only intersect in $0$ it follows that $P_1P_j = 0$ for all $i \neq j$; full proof left to the interested reader or see e.g., Theorem 2.13 here.

We can now write $\ket{\phi} = I \ket{\phi} = \sum_{i = 1}^m P_i \ket{\phi}$. As $\norm{\ket{\phi}}^2 = 1$ and since the projections are orthogonal, we have that $1 = \sum_{i = 1}^m \norm{P_i \ket{\phi}}^2$, as $P_iP_j = 0$ for $i \neq j$ and $P_i^2 = P_i$, i.e., we obtain a probability distribution. The process of *measuring* now samples an $i$ according to this probability distribution, i.e., with probability $\norm{\ket{P_i\phi}}^2$ and maps $\ket{\phi} \mapsto \ket{P_i \phi} / \norm{\ket{P_i\phi}}$, which is again a (valid) state. After measuring, the state $\ket{\phi}$ ends up in an eigenstate of the measurement and thus the state changes, except for when $\ket{\phi}$ is already in an eigenstate of the measurement in which case it does not change.

Note that measurements are invariant w.r.t. the global phase, i.e., $\ket{\phi}$ and $e^{ir} \ket{\phi}$ produce the same measurement outcomes and statistics and the obtained states after measurement are also identical up to $e^{ir}$-rotation. In fact the global rotation $e^{ir}$ only affects the phase of the complex coefficients but not their absolute value. This is not to be confused with the relative phase differences in superpositions which are important.

Later, we will be mostly concerned with the case where the $P_i$ are given as rank-1 projectors into the actual (computational) basis $\ket{0}, \dots, \ket{N-1}$, i.e., $P_i = \ketbra{i}{i}$. Let $\ket{\phi} = \sum_{i = 0}^{N-1} \alpha_i \ket{i}$. We then have:

\[P_j \ket{\phi} = \ketbra{j}{j} \ket{\phi} = \sum_{i = 0}^{N-1} \alpha_i \ketbra{j}{j} \ket{i} = \alpha_j \ket{j},\]we purposefully (only this time) did not clean up bra and ket double separators for the sake of exposition. Thus we obtain that we measure $P_j = \ketbra{j}{j}$ with probability $\norm{\ket{P_j\phi}}^2 = \norm{\alpha_j \ket{j}}^2 = \Abs{\alpha_j}^2$. Alternatively, just for the sake of getting used to the bra-ket notation:

\[\begin{align*} \norm{\ket{P_j\phi}}^2 & = \norm{\ket{j}\bra{j} \ket{\phi}}^2 = \braket{\phi \mid \ketbra{j}{j} \mid \phi} = \braket{\phi \mid j} \braket{j \mid \phi} \\ & = \braket{j \mid \phi}^\esx \braket{j \mid \phi} = \Abs{\braket{j \mid \phi}}^2 \\ & = \Abs{\sum_{i = 0}^{N-1} \braket{j \mid \alpha_i i}}^2 = \Abs{\sum_{i = 0}^{N-1} \alpha_i \braket{j \mid i}}^2 = \Abs{\alpha_j}^2. \end{align*}\]The resulting state after measuring $j$ via $P_j$ is

\[\ket{P_j\phi} / \norm{\ket{P_j\phi}} = \frac{\alpha_j}{\Abs{\alpha_j}} \ket{j},\]i.e., when measuring in the computational basis our superposition collapses to a classical state.

**The Physics spin: Measurements, collapse of superpositions, and Schrödinger’s cat.** While we quite non-chalantly applied our measurements, e.g., by simply multiplying with the projection matrix and renormalization, the physical reality seems to be much more complicated. In fact, up to today it is unclear when *exactly* the measurement happens that forces the quantum superposition to collapse to a classical state. The famous thought experiment of Schrödinger made this problem very apparent. Simplifying, the box with the cat is built so that the life of a cat in a box is linked one-to-one to a quantum superposition, i.e., it is a mechanism to upscale the effect from the atomic domain to the macroscopic one. Now when does the measurement take place that decides the fate of the cat? When you open the box? What if you can hear the cat being alive in the box? I.e., when exactly does the superposition cease to be a superposition and collapses to a classic state? There are tons of interpretation of quantum mechanics that give different answers to the questions posed by Schrödinger’s cat. The most prevalent one, which also seems to be the most unsatisfying one as it is basically stating the obvious, is the so-called *Copenhagen interpretation*: “A system stops being a superposition of states and becomes either one or the other when an observation takes place.” Now, what is an “observation”? For further reading check out wikipedia, but beware this easily becomes a rabbit hole.

Note that while we have seen only rank-1 projectors in this section, it is very well possible to also have higher rank projectors. For example consider the state:

\[\ket{\phi} = \frac{1}{\sqrt{3}} \ket{1} + \frac{2}{\sqrt{3}} \ket{N},\]and the projectors (assuming $N$ is even)

\[P_1 = \sum_{i = 1}^{N/2} \ketbra{i}{i} \qquad \text{and} \qquad P_2 = \sum_{i = N/2 + 1}^{N} \ketbra{i}{i}.\]Clearly, $I = P_1 + P_2$. We measure with the first projector $P_1$ with probability

\[\norm{P_1 \ket{\phi}}^2 = \tr(P_1 \ketbra{\phi}{\phi}) = 1/3\]and we end up in state $P_1 \ket{\phi} / \norm{P_1 \ket{\phi}} = \ket{1}$. Similarly, we measure the second projector $P_2$ with probability $\norm{P_2 \ket{\phi}}^2 = \tr(P_2 \ketbra{\phi}{\phi}) = 2/3$ ending up in state $\ket{N}$.

**Remark (Probability of state transition).** Finally, we consider a curiosity that we are going to revisit later. Let $\ket{\phi} = \sum_{i = 0}^{N-1} \alpha_i \ket{i}$ and $\ket{\psi} = \sum_{i = 0}^{N-1} \beta_i \ket{i}$ be two states expressed in our computational base and let us define the rank-1 projector $P = \ketbra{\psi}{\psi}$ and let $Q = I - P$ be its complementary projector. Now let us consider the probability of measuring $\phi$ with $P$. By the above this is:
\[
\norm{P\ket{\phi}}^2 = \braket{\phi \mid P \mid \phi} = \braket{\phi \ketbra{\psi}{\psi} \phi} = \Abs{\braket{\psi \mid \phi}}^2,
\]
and the last statement can be expressed via the computational base by linearity
\[
\Abs{\braket{\psi \mid \phi}}^2 = \Abs{\sum_{i = 0}^{N-1}\sum_{j = 0}^{N-1} \beta_i^\esx \alpha_j \braket{i \mid j}}^2 = \Abs{\sum_{i = 0}^{N-1} \beta_i^\esx \alpha_i}^2,
\]
using that $\braket{i \mid j} = \delta_{ij}$. Moreover, if we end up measuring with $P$ we obtain the post-measurement state:
\[
\ket{P \phi} / \norm{\ket{P \phi}} = \ket{\psi}\braket{\psi \mid \phi} / \norm{\ket{P \phi}} = \ket{\psi}.
\]
So what did this exercise show us? In some meanigful way, the probability of $\phi$ transitioning to $\psi$ is equal to $\Abs{\braket{\psi \mid \phi}}^2 = \Abs{\sum_{i = 0}^{N-1} \beta_i^\esx \alpha_i}^2$. I am simplifying a little here because there is some arbitrariness why applying the measure $P$, $Q$ and not any another. We are going to discuss this a little later but keep this formula in mind. It will prove quite helpful.

Closely connected to projective measurements are *observables*.

**Definition: Observable.** A projective measurement with $m$ distinct outcomes $\lambda_1, \dots, \lambda_m \in \RR$ given by a set of orthogonal projection matrices $P_1, \dots, P_m$ that decompose the identity matrix $I = \sum_{i = 1}^m P_i$ form the *observable* $M = \sum_{i = 1}^m \lambda_i P_i$.

Observe that $M$ is Hermitian, i.e., $M = M^\dagger$ as $\lambda_i \in \RR$ and $P_i = P_i^\dagger$ are Hermitian themselves for $i = 1, \dots, m$ (recall: if $M$ is Hermitian, all its eigenvalues are real and eigenvectors of distinct eigenvalues are orthogonal). Moreover, any Hermitian matrix $M$ corresponds to an observable, simply by taking its spectral decomposition $M = \sum_{i = 1}^m \lambda_i P_i$ with $\lambda_i \in \RR$ as $M$ is Hermitian. Thus there is a correspondence between observables and Hermitian matrices.

Observables allow us to very easily compute the expected value of a measurement. As before we have that the probability of measuring outcome $j$ is simply $\norm{P_i \ket{\phi}}^2$, thus we obtain the expected value of the measurement as:

\[\tag{EObservable} \sum_{i = 1}^m \lambda_i \norm{P_i \ket{\phi}}^2 = \sum_{i = 1}^m \lambda_i \tr(P_i \ketbra{\phi}{\phi}) = \tr(M \ketbra{\phi}{\phi}).\]The measurements above are so-called projective measurements as they use projection matrices. However if we are not interested in the resulting state after measuring there is another form of measurement, so-called *Positive-Operator-Valued Measure (POVM) measurements*. I will keep it brief for now until we need POVMs; for more details see wikipedia. Here we are given $m$ positive semidefinite matrices $E_1, \dots, E_m$ (effectively relaxing the 0/1 eigenvalue requirement of the projection matrices), so that $I = \sum_{i = 0}^{m-1} E_i$. Similar to what we have done before, given a state $\ket{\phi}$ the probability of measuring outcome $j$ is $\tr(E_j \ketbra{\phi}{\phi})$ however, and this is important, it might not hold that the probability is given by $\norm{E_j \ket{\phi}}^2$. In the derivation from earlier we basically used

in particular the first equality can easily fail if $E_j$ is not a projector, i.e., $E_j^2 = E_j$ might not hold.

There are a couple of things compared to projective measurement (also sometimes abbreviated PVM for *projection-valued measure*) that are different and we will look them in more detail below. Most importantly, the elements $E_1, \dots, E_m$ of the POVM do not have to be orthogonal anymore and as such in particular, we can have $m \geq N$ elements where $N$ is the dimension of the Hilbert space under consideration. This can be helpful in some application and was not possible for PVMs due to the orthogonality condition. In fact projective
measurements are a special case of the POVMs together with the additional condition $E_i^2 = E_i$ and $E_i E_j = 0$ for $i \neq j$. On the other hand it is not obvious to characterize the post-measurement state. We might think of POVMs as being to PVMs what mixed states are to pure states.

Why do we care? The reason is that when two states we want to distinguish are orthogonal, we can simply use a PVM, however if they are not orthogonal then there is neither a PVM nor POVM that can separate these two with certainty; it it simply impossible. In fact this impossibility is used in several quantum applications. However there are POVMs that never make a mistake but sometimes return that they cannot distinguish the state, i.e., return “I don’t know”. As an example consider the two states:

\[\ket{0} \qquad \text{and} \qquad \ket{+} \doteq \frac{1}{\sqrt{2}}(\ket{0} + \ket{1})\]and we consider the three psd matrices (with $\ket{-} \doteq \frac{1}{\sqrt{2}}(\ket{0} - \ket{1})$):

\[E_0 \doteq \frac{1}{2}\ketbra{-}{-} \qquad \text{and} \qquad E_1 \doteq \frac{1}{2} \ketbra{1}{1} \qquad \text{and} \qquad E_3 \doteq I - E_0 - E_1,\]which are psd with eigenvalues \(\{0, 1/2\}\) for $E_0$ and $E_1$ and \(\{\approx 0.146, \approx 0.854\}\) for $E_2$ and by definition sum up to $1$. We obtain the following measurement outcomes. If the state is $\ket{0}$ and we measure with the POVM we have the outcomes

\[0 \text{ w.p. } \tr(E_0 \ketbra{0}{0}) = 1/4 \qquad 1 \text{ w.p. } \tr(E_1 \ketbra{0}{0}) = 0 \qquad 2 \text{ w.p. } \tr(E_2 \ketbra{0}{0}) = 3/4.\]If on the other hand the state is $\ket{+}$ and we measure with the POVM we have the outcomes

\[0 \text{ w.p. } \tr(E_0 \ketbra{+}{+}) = 0 \qquad 1 \text{ w.p. } \tr(E_1 \ketbra{+}{+}) = 1/4 \qquad 2 \text{ w.p. } \tr(E_2 \ketbra{+}{+}) = 3/4.\]While there is no PVM in original space that can achieve the same thing, by slightly extending the dimension of the space we can find a PVM that generates the same outcome distribution. This is known as Naimark’s dilation theorem (also Neumark’s Theorem; see also here for a formulation directly applicable to POVMs). This theorem is crucial as it allows to physically realize POVMs by means of PVMs. Moreover, there is also an interested twist in terms of the post-measurement state that we brushed aside so far: when measuring with a POVM the post-measurement state is actually not defined by the POVM but rather by the PVM that physically realizes it. There is an infinite number of such realizations of the POVM by means of PVMs simply via applying unitaries. Thus if we need the post-measurement state we need to realize the POVM by means of a PVM and compute its post-measurement state. Moreover, note that due to non-orthogonality when applying a POVM, the measurement is not repeatable in the sense that measuring twice can change the result the second time.

We will now discuss pure states and mixed states. You might want to read this twice as there is something non-trivial going on here. We will later revisit pure vs. mixed states also for more complex setups but it is instructional to start with the simple case first.

Let us first consider the pure state

\[\ket{\phi} = \frac{1}{\sqrt{2}} (\ket{0} + \ket{1}).\]As stated above this is a pure state as it is a vector of norm $1$ in the Hilbert space generated by $\ket{0}$ and $\ket{1}$. Now let us further define the observable

\[M = \ketbra{0}{0} - \ketbra{1}{1}.\]If we now measure with $M$, we obtain that the expected value of the measurement is

\[\tr(M \ketbra{\phi}{\phi})\]and after measuring via M, we find the system in state $\ket{0}$ with probability $1/2$ and in state $\ket{1}$ with probability $1/2$.

We can also define a so-called *ensemble* which is a statistical mixture of states via a so-called *density matrix* $\rho$

It is easy to see that the density matrix is positive semidefinite, Hermitian, and has trace $1$ and density matrices are a generalization of the usual (pure) state description and can also capture mixed states and ensembles (as we do here); see wikipedia for more. In a nutshell, mathematically a mixed state is a convex combination of pure states. This *ensemble* describes our degree of knowledge stating that with probability $1/2$ we have that $\rho$ is the state $\ket{0}$ and with probability $1/2$ we have that $\rho$ is the state $\ket{1}$.

It is very important not to confuse a super position, which captures fundamental quantum uncertainty with ensembles which capture *our* degree of knowledge about the system. So in some sense we have two types of uncertanties: fundamental quantum uncertainty and statistical uncertainty. I found the following two statements helpful to differentiate the two:

Statistical mixtures represent the degree of knowledge whilst the uncertainty within quantum mechanics is fundamental. [wikipedia]

and

A mixed state is a mixture of probabilities of physical states, not a coherent superposition of physical states.

Note we can also measure an ensemble w.r.t. an observable $M$ via its density matrix $\rho$:

\[\tag{EEnsemble} \tr(M\rho),\]which is nothing else but the probability weighted average of the outcomes for the individual states comprising the ensemble.

**The Physics spin: Ensemble interpretation.** A way to think about ensembles is that if we have infinite copies of system then the ensemble captures the distribution of states. Closely related to this is the *Ensemble Interpretation (EI)* that considers a quantum state not being an exhaustive representation of an individual physical system but only a description for an ensemble of similarly prepared systems. This is in contrast to the *Copenhagen Interpretation (CI)*. From wikipedia; see [B14] for more background:

*CI:* A pure state \(\ket{y}\) provides a “complete” description of an individual system, in the sense that a dynamical variable represented by the operator \(Q\) has a definite value (\(q\), say) if and only if \(Q \ket{y} = q \ket{y}\).

*EI:* A pure state describes the statistical properties of an ensemble of identically prepared systems, of which the statistical operator is idempotent.

Now you might be tempted to think that this is a more metaphysical problem than a mathematical one. Let me convince you with the next example that this is not the case and, in fact, quantum uncertainty behaves very differently than normal statistical uncertainty and probability theory.

**Example: Superposition vs. mixture of states.** Consider the following two states:
\[\phi_1 = \frac{1}{\sqrt{2}} (\ket{0} + \ket{1}) \qquad \text{and} \qquad \phi_2 = \frac{1}{\sqrt{2}} (\ket{0} - \ket{1}),\]
and let us define the observable
\[M = \ketbra{0}{0} - \ketbra{1}{1}.\]
With what we have seen so far, when measuring with $M$, for state $\ket{\phi_1}$ we end up in state:
\[
\ket{0} \text{ w.p. } \norm{\ketbra{0}{0} \phi_1}^2 = 1/2 \qquad\qquad \ket{1} \text{ w.p. } \norm{\ketbra{1}{1} \phi_1}^2 = 1/2,
\]
and for state $\ket{\phi_2}$ we end up in state:
\[
\ket{0} \text{ w.p. } \norm{\ketbra{0}{0} \phi_2}^2 = 1/2 \qquad\qquad \ket{1} \text{ w.p. } \norm{\ketbra{1}{1} \phi_2}^2 = 1/2,
\]
where we used “w.p.” as a short-hand for “with probability”. Although $\phi_1 \neq \phi_2$ under the observable $M$ we end up in states $\ket{0}$ and $\ket{1}$ uniformly and with the same distribution for $\ket{\phi_1}$ and $\ket{\phi_2}$.

Now let us first consider a uniform mixture of these two states via the density matrix:
\[\rho = \frac{1}{2} \ketbra{\phi_1}{\phi_1} + \frac{1}{2} \ketbra{\phi_2}{\phi_2}.\]
So if we measure with $M$ with what probability do we obtain state $\ket{0}$? With probability $1/2$, the system is in state $\ket{\phi_0}$ and we have just computed that in this case we measure $\ket{0}$ with probability $1/2$, i.e., by the product rule that is a probability of $1/4$. Moreover, with probability $1/2$ the system is in state $\ket{\phi_1}$ and we have just computed that in this case we measure $\ket{0}$ with probability $1/2$ as well. Thus again $1/4$ probability, so that we obtain a total probability of measuring $\ket{0}$ being $1/4 + 1/4 = 1/2$; basic probability calculation. Moreover, we can also compute the expected value of the observable via the rules from above. Via (EEnsemble) we have
\[
\tr(M\rho) = \frac{1}{2} \tr(M \ketbra{\phi_1}{\phi_1}) + \frac{1}{2} \tr(M \ketbra{\phi_2}{\phi_2}),
\]
and via (EObservable) we obtain
\[
\tr(M\rho) = \frac{1}{2} (\norm{\ketbra{0}{0} \ket{\phi_1}}^2 - \norm{\ketbra{1}{1} \ket{\phi_1}}^2) + \frac{1}{2} (\norm{\ketbra{0}{0} \ket{\phi_2}}^2 - \norm{\ketbra{1}{1} \ket{\phi_2}}^2) = 0.
\]

Now let us consider the “uniform” superposition of $\phi_1$ and $\phi_2$. Recall that both $\phi_1$ and $\phi_2$ are in state $\ket{0}$ and $\ket{1}$ with probability $1/2$ after measurement with $M$. We consider the superposition $\phi$ defined as:
\[\phi = \frac{1}{\sqrt{2}} (\phi_1 + \phi_2) = \ket{0}.\]
Now we have
\[\ket{0} \text{ w.p. } \norm{\ketbra{0}{0} \phi}^2 = 1,\]
and the expected value under $M$ is:
\[\tr(M\ketbra{\phi}{\phi}) = 1.\]

So what happened here and how is this possible? The key is that in a superposition the amplitudes can interact as is the case here. Slightly metaphysical: this interaction allows for something like “negative probabilities”, so that both $\phi_1$ and $\phi_2$ are maximally random but their superposition is not.

For those of you that like to implement things a quick computation with `qutip`

in `python`

of the above roughly looks as follows; see also this colab notebook:

```
from qutip import *
import math
N = 2
b0 = basis(N, 0) # |0>
b1 = basis(N, 1) # |1>
phi1 = 1/math.sqrt(2) * (b0 + b1)
phi2 = 1/math.sqrt(2) * (b0 - b1)
M = b0.proj() - b1.proj() # the observable
print("Probability: ", (b0.proj() * phi1).norm()**2) # prob of |0> when measuring |\phi_1> via M: 1/2
rho = 1/2 * phi1.proj() + 1/2 * phi2.proj() # density matrix
print("Expected Value Mixture: ", (M * rho).tr()) # expected value of M for mixed state: 0.0
phi = phi1 + phi2
phi = phi / phi.norm()
(b0.proj() * phi).norm()**2 # prob |0> when measuring |\phi> via M: 1.0
print("Expected Value State: ", (M * phi.proj()).tr()) # expected value of M for |\phi>: 1.0
```

So how do we know whether a state is a pure state or a mixed state? One of the easiest ways is looking at its density matrix $\rho$. The state given by the density matrix $\rho$ is pure if and only if $\tr(\rho^2) = 1$. This gives also rise to the notion of *linear entropy* of a state given by its density matrix $\rho$ defined as:

so that $\rho$ is pure if and only if $S_L(\rho) = 0$. Similarly we can define the *von Neumann entropy* of a state given by its density matrix $\rho$ as:

where $\ln$ is the natural matrix logarithm (see wikipedia for more information). In case $\rho$ is expressed in terms of its eigenvectors, i.e., $\rho = \sum_{i = 0}^{N-1} \eta_i \ketbra{i}{i}$, the von Neumann entropy simply becomes the Shannon entropy of the eigenvalues, i.e.,

\[S(\rho) = - \sum_{i = 0}^{N-1} \eta_i \ln \eta_i.\]Similarly, we have $S(\rho) = 0$ if and only if $\rho$ is a pure state. In fact we can think of both the linear entropy as well as the von Neumann entropy as a measure of mixedness of the state. The latter notion we will also revisit in the context of the entropy of entanglements. For a *maximally mixed state* the linear entropy is $1 - 1/N$ and the von Neumann entropy is $\ln N$. The linear entropy is usually much easier to compute as it does not require a spectral decomposition and for measuring purity of a state it is often sufficient.

A note for those that have guessed already, the linear entropy is to the von Neumann entropy what the total variational distance is to Kullback-Leibler divergence or the mean-variance approximation to the entropy function; simply a Taylor/Mercator series approximation.

Finally, we close this section with a question: Why is the outcome of the measurement of $\ket{\phi_1}$ under $M$ *not* itself a mixed state of the form
\[\tag{measureMixed}\tilde \rho = \frac{1}{2} \ketbra{0}{0} + \frac{1}{2} \ketbra{1}{1}?\]

The Bloch sphere is mostly a reparametrization of a $2$-level quantum system, e.g., generated by the base $\ket{0}$ and $\ket{1}$ that allows for easy visualization. Note that every state in that system corresponds to two complex numbers defined by their respective real and imaginary part, hence $4$ reals. Now what we can do, is to reparametrize by fixing the global phase of the state (as the global phase is meaningless with regards to the measurement distribution) effectively eliminating one dimension and allowing for representation on a three-dimensional sphere: the Bloch sphere. I will keep things super compact here; the interested reader is referred to the wikipedia article for further reading.

The easiest way to convert the coordinates is by starting from the density matrix $\rho$. Then we obtain the Bloch sphere coordinates as follows:

\[\rho = \begin{pmatrix} \rho_{11} & \rho_{12} \\ \rho_{21} & \rho_{22} \end{pmatrix} \mapsto 2 \begin{pmatrix} \re(\rho_{21}) \\ \im(\rho_{21}) \\ \rho_{11} - \frac{1}{2} \end{pmatrix}.\]Note that since $\rho$ is Hemitian, we have that $\rho_{11} \in \RR$.

**Figure 1.** Bloch sphere. (left) layout of Bloch sphere (middle) orthogonal vectors are antiparallel on the Bloch sphere (right) pure states (example in green) have length $1$ and are on the surface, mixed states (example in orange) have length strictly less than $1$ and are interior points.

So far we have only considered a single particle or unipartite system. As the saying goes: “You need two points of reference to measure distance or speed” and by the same token, once we go from unipartite systems to bipartite (or more generally multipartite) systems, things get significantly more interesting, by e.g., allowing for entanglement, which is key to quantum’s expressive power. Multipartite systems are simply obtained by taking the tensor product of multiple unipartite systems. More specifically, suppose we have multiple unipartite systems and their associated Hilbert spaces $\mathcal H_1, \dots, \mathcal H_\ell$, then the space of the composite system $\mathcal H$ is given by their tensor product:

\[\mathcal H \doteq \bigotimes_{i = 1}^{\ell} \mathcal H_i,\]and an element in $\mathcal H$ can be written as $\ket{q_1} \otimes \dots \otimes \ket{q_\ell}$; similarly we can consider the tensor of density matrices $\rho_1 \otimes \dots \otimes \rho_\ell$ to capture mixed states in compososite systems. For a quick refresher, the tensor product is basically like the outer product (i.e., we form tuples), however with the additional structural properties of ensuring homogeneity w.r.t. to addition and scalar multiplication; see wikipedia for a recap. This homogeneity basically determines also how linear maps act on the space. We recall the most important rules below; for simplicity we formulate them for the tensor product of two spaces $\mathcal H_1 \otimes \mathcal H_2$ but they hold more generally with the obvious generalizations:

**Useful rules for tensor products.**

a. (Linearity w.r.t. “+”): $(\ket{\phi} + \ket{\psi}) \otimes \ket{\kappa} = \ket{\phi} \otimes \ket{\kappa} + \ket{\psi} \otimes \ket{\kappa}$.

b. (Linearity w.r.t. “·”): for $s \in \CC$, we have $\ket{s \phi} \otimes \ket{\kappa} = s (\ket{\phi} \otimes \ket{\kappa}) = \ket{\phi} \otimes \ket{s \kappa}$.

c. (Tensor of linear maps): $(A \otimes B) (\ket{\phi} \otimes \ket{\kappa}) = A\ket{\phi} \otimes B \ket{\kappa}$.

d. (Linear maps as concatenation:) $A \otimes B = (A \otimes I) \circ (I \otimes B) = (I \otimes B) \circ (A \otimes I)$.

An important operator will be the *partial trace*, which basically applies the trace operator to only some subset of tensor components. Skipping the formalism (see wikipedia for details), the *partial trace w.r.t to $\mathcal H_1$* (in short: \(\ptr{\mathcal H_1}\)) is the unique linear operator such that for any two matrices $A: \mathcal H_1 \rightarrow \mathcal H_1$ and $B: \mathcal H_2 \rightarrow \mathcal H_2$ it holds

This gives rise to the partial trace on any element $M \in \mathcal H_1 \otimes \mathcal H_2$. Computationally, the partial trace can be implemented by taking partial sums of coefficients along diagonals and it does not require an explicit (potentially non-existent) decomposition $M = A \otimes B$; see wikipedia for an explanation.

Now consider a density matrix $\rho$ on $\mathcal H_1 \otimes \mathcal H_2$. The *partial trace of $\rho$ w.r.t. $\mathcal H_2$* denoted by $\rho_1$ is given by $\rho_1 \doteq \ptr{\mathcal H_2} (\rho)$ and $\rho_1$ is called the *reduced density matrix* of $\rho$ on system $\mathcal H_1$. This process is also referred as “tracing out” (or averaging out) $\mathcal H_2$. The tracing out basically captures the situation, where we have a composite system however we are unaware of it, e.g., we only know about $\mathcal H_1$ but not $\mathcal H_2$. If now $M$ is a measurement on $\mathcal H_1$, then we essentially measure on composite system with $M \otimes I$ and it holds with the above that

In this sense $\rho_1$ is the “right state” as it generates the same measurement statistics on $\mathcal H_1$ as $\rho$ does on $\mathcal H_2$ provided we measure only the $\mathcal H_1$ part, i.e., we measure with matrices of the form $M \otimes I$.

For the sake of brevity, in the following we will often write $\ket{0}\ket{0}$ as a shorthand for \(\ket{0}_1 \otimes \ket{0}_2\), when the spaces etc are clear from the context; the same applies to multipartite systems.

Finally, we come to entanglement, this obscure term that makes quantum mechanics and quantum computing so special. In the following we will (mostly) consider bipartite systems $\mathcal H_1 \otimes \mathcal H_2$, each generated by the basis $\ket{0}$ and $\ket{1}$, to simplify the exposition but everything holds also for arbitrary multipartite systems. Let us consider the following state (which is also referred to as a *Bell state*)

Let us start with a few simple observations: the density matrix of $\ket{\phi}$ is given by:

\[\rho = \ketbra{\phi}{\phi} = \begin{pmatrix} 1/2 & 0 & 0 & 1/2 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 1/2 & 0 & 0 & 1/2 \end{pmatrix},\]and moreover, we have

\[S_L(\rho) = 1 - \tr(\rho^2) = 0,\]i.e., $\rho$ is a pure state in the bipartite system. Now consider the measurement consisting of the two projective matrices $\ketbra{0}{0} \otimes I$ and $\ketbra{1}{1} \otimes I$. In the first case we end up with the post-measurement state

\[(\ketbra{0}{0} \otimes I) \ket{\phi} / \norm{(\ketbra{0}{0} \otimes I) \ket{\phi}} = \ket{0} \otimes \ket{0},\]and in the second case we end up with

\[(\ketbra{1}{1} \otimes I) \ket{\phi} / \norm{(\ketbra{1}{1} \otimes I) \ket{\phi}} = \ket{1} \otimes \ket{1},\]i.e., when measuring the first component of the bipartite system this might also collapse the second component and here via the entanglement the two components are forced to be the same. On the other hand if we would consider an alternative state (which is not entangled as we will see soon)

\[\ket{\mu} = \left(\frac{1}{\sqrt{2}} (\ket{0} + \ket{1})\right) \otimes \left(\frac{1}{\sqrt{2}} (\ket{0} + \ket{1})\right),\]and apply the same measurement we would obtain the post-measurement states

\[(\ketbra{0}{0} \otimes I) \ket{\mu} / \norm{(\ketbra{0}{0} \otimes I) \ket{\mu}} = \ket{0} \otimes 1/\sqrt{2} (\ket{0} + \ket{1}),\]and

\[(\ketbra{1}{1} \otimes I) \ket{\mu} / \norm{(\ketbra{1}{1} \otimes I) \ket{\mu}} = \ket{1} \otimes 1/\sqrt{2} (\ket{0} + \ket{1}),\]i.e., in this case the second component is “undisturbed” by the measurement on the first component.

Now let us ask a seemingly innocent question: Can we write $\ket{\phi} = \ket{\psi} \otimes \ket{\kappa}$ with $\ket{\psi} \in \mathcal H_1$ and $\ket{\kappa} \in \mathcal H_2$? To this end, let us express

\[\ket{\psi} = \alpha_0 \ket{0} + \alpha_1 \ket{1} \qquad \text{and} \qquad \ket{\kappa} = \beta_0 \ket{0} + \beta_1 \ket{1}\]and do some basic linear algebra transformations

\[\begin{align*} \ket{\psi} \otimes \ket{\kappa} & = (\alpha_0 \ket{0} + \alpha_1 \ket{1}) \otimes (\beta_0 \ket{0} + \beta_1 \ket{1}) \\ & = \alpha_0 \beta_0 \ket{0} \otimes \ket{0} + \alpha_1 \beta_0 \ket{1} \otimes \ket{0} + \alpha_0 \beta_1 \ket{0} \otimes \ket{1} + \alpha_1 \beta_1 \ket{1} \otimes \ket{1}. \end{align*}\]Thus the coefficients have to be in product form, in order to express $\ket{\phi} = \ket{\psi} \otimes \ket{\kappa}$. This however is not the case for $\ket{\phi}$.

**Definition: Separable and entangled state.** A state $\ket{\phi}$ is called *separable* if it can be written as $\ket{\phi} = \ket{\psi} \otimes \ket{\kappa}$ with $\ket{\psi} \in \mathcal H_1$ and $\ket{\kappa} \in \mathcal H_2$. A state that is not separable is called *entangled*. The same definition extends to density matrices covering the mixed state case.

Note that *a priori* this has nothing to do with pure vs. mixed states and in fact all four combinations are possible: entangled-pure, unentangled-pure, entangled-mixed, and unentangled-mixed.

In particular, the above suggests that while $\ket{\phi}$ is a pure state in the composite system there are no pure states in $\mathcal H_1$ and $\mathcal H_2$ that capture the individual components. This becomes evident when we trace out $\mathcal H_2$ and obtain $\rho_1$ with the reduced density matrix

\[\rho_1 = \begin{pmatrix} 1/2 & 0 \\ 0 & 1/2 \end{pmatrix} = \frac{1}{2} \ketbra{0}{0} + \frac{1}{2} \ketbra{1}{1},\]i.e., a mixed state. Note that $\rho_1$ has maximum linear and von Neumann entropy.

This is a good time to revisit our question (measureMixed) from above. We asked: Why is the outcome of the measurement […] *not* itself a mixed state of the form
\(\frac{1}{2} \ketbra{0}{0} + \frac{1}{2} \ketbra{1}{1}?\)

The reason for this is a little subtle: In the case of (measureMixed), *after* the measured it is decided in which state we are in and hence it is not a probability distribution but a state (which arises from some probability distribution). On the other hand, when tracing out above $\mathcal H_2$ we are left with (statistical) uncertainty about the part of the state in $\mathcal H_1$ and we *must* explicitly account for this uncertainty, which is precisely what the reduced density matrix after tracing out does. This is closely related to the totalitarian principle in quantum mechanics which states “Everything not forbidden is compulsory.” Wikipedia explains this quite aptly:

The statement is in reference to a surprising feature of particle interactions: that any interaction that is not forbidden by a small number of simple conservation laws is not only allowed, but must be included in the sum over all “paths” that contribute to the outcome of the interaction. Hence if it is not forbidden, there is some probability amplitude for it to happen.

In some sense the totalitarian principle is the analog of the maximum entropy principle. In general, tracing out and/or measuring turns quantum mechanical uncertainty and quantum correlations (e.g., arising via entanglement) into statistical uncertainty.

In fact the above is no coincidence. A pure state $\ket{\phi}$ in the bipartite system $\mathcal H_1 \otimes \mathcal H_2$ is entangled if and only if the reduced density matrix $\rho_1$ is a mixed state if and only if the von Neumann entropy $S(\rho_1)$ of the reduced density matrix $\rho_1$ is non-zero. In fact $S(\rho_1) = S(\rho_2)$, so that it does not matter which one of the two reduced density matrices we are using. This entropy is also referred to as the *entropy of the entanglement* and if the entropy of the entanglement is maximal, we say the states are *maximally entangled*. In this case the reduced density matrix is also a diagonal matrix and by the fact that we compute the entropy of the reduced density matrices, it also implies that $\rho_1$ and $\rho_2$ are maximally mixed in this case.

It is tempting to generalize this to the mixed state case, however this is not easily possible. In fact, already deciding whether a mixed state in a bipartite system is entangled or not is NP-hard by a reduction from KNAPSACK as shown in a relatively recent result [G03]. In fact, for a mixed state in a bipartite system the entanglement entropy is no longer a measure of entanglement. As always check out wikipedia for some background reading.

We will finish this first post with a first fascinating result that demonstrates that there *is* something special happening when using entanglement: Bell’s theorem. Being an umbrella for several different related insights and results and subject to various interpretations I will completely skip the physical side of things; see [P21] for a more in-depth treatment or as usual wikipedia is a great starting point. In a nutshell Bell’s theorem demonstrates that quantum mechanics/computing, can violate classical probability theory. The argument from below is a later example from [NC02], which is more accessible than Bell’s original argument [B64].

Our setup is as follows. We have three parties: Alice, Bob, and Cliff. Alice and Bob are spatially very far away from each other. Both Alice and Bob each have two binary measurements. Alice has $A_0$ that measures some property $a_0$ and $A_1$ that measures some property $a_1$. Similarly for Bob $B_0$ measures $b_0$ and $B_1$ measures $b_1$. The measurements output $\pm 1$ with $1$ if that particle that is measured carried the property and $-1$ if the property is absent; slightly abusing notation, let $a_0, a_1, b_0, b_1$ denote also the outcome of the measurement with the respective measure, which is ok as they are in one-to-one correspondence with the actual property.

Now Cliff prepares a pair of particles and sends particle $1$ to Alice and particle $2$ to Bob. Upon receiving their particles, each Alice and Bob pick one of their two measurements at random, e.g., by flipping a coin and measure their particle. By doing so we obtain $4$ measurement combinations and we consider the following linear combination (note the minus sign for the last summand):

\[a_0 b_0 + a_1 b_0 + a_0 b_1 - a_1 b_1 = a_0 (b_0 + b_1) + a_1 (b_0 - b_1).\]Now since the outcomes of the measurements are $\pm 1$ either $b_0 = b_1$ and then the second term on the right-hand side vanishes or $b_0 = -b_1$ and then the first term on the right-hand-side vanishes; in either case the remaining term in brackets is then equal to $2$, so that the right-hand side becomes $\pm 2$ and we obtain the valid inequality:

\[a_0 b_0 + a_1 b_0 + a_0 b_1 - a_1 b_1 \leq 2.\]Observe that the left-hand side cannot be measured with a *single measurement* as Alice and Bob have to pick one measurement each in a given trial. However, if we perform a large number of experiments (each time Cliff preparing a new state) then we also have

where $\mathbb E$ denotes the expectation and by linearity of expectation it follows:

\[\tag{CHSH} \mathbb E [a_0 b_0 ] + \mathbb E[a_1 b_0] + \mathbb E[a_0 b_1] - \mathbb E[a_1 b_1] \leq 2.\]This inequality is a so-called Bell inequality (one of many) and specifically the CHSH inequality; we will discuss these and the geometric properties etc in the next post.

Note that the argument above relies on two key assumptions (a) *Realism*: the properties of the particles exist irrespectively of whether they are observed/measured or not, this is referred to as realism, and (b) *Locality*: Alice’s choice of a measurement cannot influence Bob’s result and vice versa, which is often referred to as locality, i.e., if far enough away they do not interact/interfere with each other.

And now we will show that quantum mechanics can break this. We let Cliff prepare a bipartite quantum state of the form:

\[\ket{\phi} \doteq \frac{1}{\sqrt{2}} (\ket{0}\ket{1} - \ket{1}\ket{0})\]and then send one of the qubits to Alice and the other to Bob. Note that this is a pure state. Next we define Alice’s observables:

\[A_0 \doteq \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix} \qquad \text{and} \qquad A_1 \doteq \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}\]and Bob’s observables:

\[B_0 \doteq \frac{1}{\sqrt{2}} (-A_1 - A_0) \qquad \text{and} \qquad B_1 \doteq \frac{1}{\sqrt{2}} (A_1 - A_0).\]It is easy to see that $A_0, A_1, B_0$, and $B_1$ have eigenvalues $\pm 1$ and as such are the measurement outcomes. Let Alice and Bob pick their measurements uniformly at random. We then obtain the measurement outcomes

\[\begin{align*} \tr(A_0 \otimes B_0 \ketbra{\phi}{\phi}) = \frac{1}{\sqrt{2}} \qquad & \tr(A_0 \otimes B_1 \ketbra{\phi}{\phi}) = \frac{1}{\sqrt{2}} \\ \tr(A_1 \otimes B_0 \ketbra{\phi}{\phi}) = \frac{1}{\sqrt{2}} \qquad & \tr(A_1 \otimes B_1 \ketbra{\phi}{\phi}) = - \frac{1}{\sqrt{2}}, \end{align*}\]and in particular:

\[\tr(A_0 \otimes B_0 \ketbra{\phi}{\phi}) + \tr(A_0 \otimes B_1 \ketbra{\phi}{\phi}) + \tr(A_1 \otimes B_0 \ketbra{\phi}{\phi}) - \tr(A_1 \otimes B_1 \ketbra{\phi}{\phi}) = 2 \sqrt{2},\]which violates (CHSH). One might wonder, where the specific observables come from and this will also be subject to the next post. For now however, observe that as the trace is linear we can combine the above observables into one:

\[\begin{align*} A_0 \otimes B_0 + A_0 \otimes B_1 + A_1 \otimes B_0 - A_1 \otimes B_1 & = A_0 \otimes (B_0 + B_1) + A_1 \otimes (B_0 - B_1) \\ & = \sqrt{2} \begin{pmatrix} -1 & 0 & 0 & -1 \\ 0 & 1 & -1 & 0 \\ 0 & -1 & 1 & 0 \\ -1 & 0 & 0 & -1 \end{pmatrix}. \end{align*}\]Note that so far we have not talk yet about any operations that we can perform on a state in order to perform computations. This will also be subject to another post soon.

I would like to thank Omid Nohadani for the helpful discussions and clarifications of the physics perspective of things.

[M07] Mermin, N. D. (2007). Quantum computer science: an introduction. Cambridge University Press.

[dW19] De Wolf, R. (2019). Quantum computing: Lecture notes. arXiv preprint arXiv:1907.09415. pdf

[B14] Ballentine, L. E. (2014). Quantum mechanics: a modern development. World Scientific Publishing Company.

[P21] Preskill, J. (2021). Physics 219/Computer Science 219: Quantum Computation. web

[G03] Gurvits, L. (2003, June). Classical deterministic complexity of Edmonds’ problem and quantum entanglement. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing (pp. 10-19). pdf

[B64] Bell, J. S. (1964). On the einstein podolsky rosen paradox. Physics Physique Fizika, 1(3), 195. pdf

[NC02] Nielsen, M. A., & Chuang, I. (2002). Quantum computation and quantum information. pdf

05/09/2022: Fixed several typos as pointed out by Zev Woodstock and Berkant Turan.

06/13/2022: Fixed several typos as pointed out by Felipe Serrano.

]]>*Written by Elias Wirth.*

The accuracy of classification algorithms relies on the quality of the available features. Here we focus on feature
transformations for a linear kernel *Support Vector Machine* (SVM)
[SV], an algorithm that is reliant on the linear separability of the different classes to achieve high classificaiton accuracy.
Our approach is based on the
idea that a given set of data points
$X = \lbrace x_1, \ldots, x_m\rbrace\subseteq \mathbb{R}^n$ can be succinctly described by the *vanishing ideal* over
$X$, i.e., the set of polynomials vanishing over $X$:

where $\mathcal{P}$ denotes the polynomial ring in $n$-variables.

The set $\mathcal{I}_X$ contains infinitely many polynomials, but, by Hilbert’s basis theorem [CLO], there exists a
finite number of polynomials $g_1, \ldots, g_k \in \mathcal{I}_X$, $k\in \mathbb{N}$,
referred to as *generators*,
such that for any $f\in \mathcal{I}_X$, there exist polynomials $h_1, \ldots, h_k \in \mathcal{P}$ such that

Thus, the set of generators is a finite representation of the ideal $\mathcal{I}_X$, and, as we explain below, can be used to create a linearly separable representation of the data set.

We now explain how sets of generators can be employed to create a linearly separable representation of the data: Consider a set of data points $X = \lbrace x_1, \ldots, x_m\rbrace \subseteq \mathbb{R}^n$ with associated label vector $Y \in \lbrace -1, 1 \rbrace ^m$. The goal is to train a linear classifier that assigns the correct label to each data point. Let $X^{-1}\subseteq X$ and $X^{1}\subseteq X$ denote the subsets of feature vectors corresponding to data points with labels $-1$ and $1$, respectively. With access to an algorithm that can construct a set of generators for a data set $X\subseteq \mathbb{R}^n$, we construct a set of generators $\mathcal{G}^{-1} = \lbrace g_1, \ldots, g_k \rbrace$ of the vanishing ideal corresponding to $X^{-1}$, such that for all $g\in \mathcal{G}^{-1}$ it holds that

\[g(x) = \begin{cases} = 0, & x \in X^{-1}\\ \neq 0, & x \in X^{1}. \end{cases}\]Similarly, we construct a set of generators $\mathcal{G}^{1} = \lbrace h_1, \ldots h_l \rbrace $ of the vanishing ideal corresponding to $X^{1}$, such that for all $h\in \mathcal{G}^{1}$ it holds that

\[h(x) = \begin{cases} \neq 0, & x \in X^{-1}\\ = 0, & x \in X^{1}. \end{cases}\]Let $\mathcal{G}: = \mathcal{G}^{-1} \cup \mathcal{G}^{1} = \lbrace g_1, \ldots, g_k, h_1, \ldots, h_l\rbrace$ and consider the associated feature transformation:

\[x \mapsto \tilde{x} = \left(|g_1(x)|, \ldots, |g_k(x)|, |h_1(x)|, \ldots, |h_l(x)|\right)^\intercal\in \mathbb{R}^{k+l}.\]Under mild assumptions [L], it then holds that for $x\in X^{-1}$,

\[\tilde{x}_i = \begin{cases} = 0, & i \in \lbrace 1, \ldots, k\rbrace\\ > 0, & i \in \lbrace k + 1, \ldots, k + l\rbrace, \end{cases}\]and for $x\in X^{1}$,

\[\tilde{x}_i = \begin{cases} >0, & i \in \lbrace 1, \ldots, k\rbrace\\ =0, & i \in \lbrace k + 1, \ldots, k + l\rbrace. \end{cases}\]The transformed data is now linearly separable. Indeed, let

\[w : = (-1, \ldots, -1, 1, \ldots, 1)^\intercal \in \mathbb{R}^{k + l},\]where the first $k$ entries are $-1$ and the last $l$ entries are $1$. Then,

\[w^\intercal \tilde{x} = \begin{cases} < 0, & x\in X^{-1}\\ > 0, & x\in X^{1}, \end{cases}\]and we can perfectly classify all $x \in X$. In practice, we instead use a linear kernel Support Vector Machine (SVM) [SV] as the classifier.

**Noisy data:** The *vanishing ideal* is highly susceptible to noise in the data. Thus, in practice, instead of constructing
generators of the vanishing ideal, we construct generators of the *approximately vanishing ideal*, that is, the set of
polynomials $g\in \mathcal{P}$ such that $g(x)\approx 0$ for all $x\in X$. For details on the switch to the
approximately vanishing ideal, we refer the interested reader to the full paper.

Our main contribution is the introduction of a new algorithm for the construction of a finite set of generators
corresponding to the approximately vanishing ideal
of a data set $X\subseteq\mathbb{R}^n$, the *Conditional Gradients Approximately Vanishing Ideal algorithm* (CGAVI).
The novelty of our approach lies in the way CGAVI constructs generators of the approximately vanishing ideal.
The algorithm constructs generators by solving (constrained) convex optimization problems (CCOPs). In CGAVI, these CCOPs
are solved using the
*Pairwise Frank-Wolfe algorithm* (PFW) [LJ], whereas related methods
such as the *Approximate Vanishing Ideal algorithm* (AVI) [H] and *Vanishing Component Analysis* (VCA) [L]
employ *Singular Value Decompositions* (SVDs) to construct generators.
As we demonstrate in our paper, our approach admits the following attractive properties when the CCOP is the LASSO
and solved with PFW:

**Generalization bounds:**Under mild assumptions, the generators constructed with CGAVI provably vanish on out-sample data and the combined approach of constructing generators with CGAVI to transform features for a linear kernel SVM inherits the margin bound of the SVM. To the best of our knowledge, these results cannot be extended to AVI or VCA.**Sparse generators:**PFW is known to construct sparse iterates [LJ], which then leads to the construction of sparse generators with CGAVI.**Blueprint:**Even though we propose to solve the CCOP with PFW, it is possible to replace PFW with any solver of (constrained) convex optimization problems. Thus, our approach gives rise to a family of procedures for the construction of generators of the approximately vanishing ideal.**Empirical results:**In practical experiments, we observe that CGAVI tends to construct fewer and sparser generators than AVI or VCA. For the combined approach of constructing generators to transform features for a linear kernel SVM, generators constructed with CGAVI lead to test set classification errors and evaluation times comparable to or better than generators constructed with related methods such as AVI or VCA.

From a high-level perspective, we reformulate the construction of generators as a (constrained) convex optimization problem, thus motivating the replacement of the SVD-based approach prevalent in most generator construction algorithms. Our approach enjoys theoretically appealing properties, e.g., we derive two generalization bounds that do not hold for SVD-based approaches and since the solver of CCOP can be chosen freely, CGAVI is highly modular. Practically, CGAVI can compete with and sometimes outperform SVD-based approaches and produces sparser and fewer generators than AVI or VCA.

[CLO] Cox, D., Little, J., and O’Shea, D. (2013). Ideals, varieties, and algorithms: an introduction to computational algebraic geometry and commutative algebra. Springer Science & Business Media.

[F] Frank, M. and Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110.

[H] Heldt, D., Kreuzer, M., Pokutta, S., and Poulisse, H. (2009). Approximate computation of zero-dimensional polynomial ideals. Journal of Symbolic Computation, 44(11):1566–1591.

[LJ] Lacoste-Julien, S. and Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in neural information processing systems, pages 496–504.

[L] Livni, R., Lehavi, D., Schein, S., Nachliely, H., Shalev-Shwartz, S., and Globerson, A. (2013). Vanishing component analysis. In International Conference on Machine Learning, pages 597–605.

[SV] Suykens, J. A. and Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural processing letters, 9(3):293–300.

]]>*Written by Francisco Criado and David Martínez-Rubio.*

Consider a linear resource allocation problem with $n$ users:

\[\tag{packing} \begin{align} \max &\ \ u(x_1, \dots, x_n) \\ \nonumber s.t. \ \ & Ax\leq b \\ \nonumber &\ \ x\geq 0 &\ \ A \in \mathcal{M}_{m\times n}(\mathbb{R}_{\geq 0}) \end{align}\]Here the constraints are linear and non-negative (that is, $A, b \geq 0$), which can naturally happen for example if the resource has to be delivered via a network. We could set the utility $u(x_1,\dots, x_n)= \sum_{i\in [n]} x_i$ to maximize the total amount of the resource delivered. However, for some applications, this could be *unfair*. One of the users could get a proportionally small increase in their allocated resources at the cost of some other user getting completely ignored. The question is now what is a *fairness measure* we could use to quantify the fairness of the allocation?

Under some natural axiomatic assumptions (see [BF11] [LKCS10]), the most fair such allocation is the one maximizing the product of the allocations, or in other terms, $u(x_1,\dots, x_n)= \sum_{i\in[n]} \log x_i$. The solution maximizing this utility attains *proportional fairness*. This fairness criterion was first introduced by Nash [N50] and is consistent with the logarithmic utility commonly found in portfolio optimization problems as well.

This motivates our study of the problem

\[\tag{primal} \begin{align} \label{eq:primal_problem} \max &\ \ f(x)=\sum_{i\in [n]} \log x_i \\ \nonumber s.t. &\ \ Ax\leq \textbf{1} \\ \nonumber & \ \ x\geq \textbf{0} \end{align}\]This problem is equivalent to maximizing the product of coordinates over the feasible region \(\mathcal{P} = \{x \in \mathbb{R}: Ax\leq \textbf{1}, x\geq \textbf{0}\}\). Note we assume without loss of generality $b = \textbf{1}$ since we can divide each row of $A$ by $b_j$ to obtain such a formulation. We also assume wlog that the maximum entry of each column is $1$, i.e., $\max_{j\in[m]} A_{ij} =1$ for all $i \in [n]$, which can be obtained by rescaling the variables $x_j$ and corresponding columns of $A$. The latter only adds a constant to the objective.

A relevant quantity is the *width* of the matrix, which is the ratio between the maximum element of $A$ and the minimum nonzero element of $A$ and can be exponential in the input. [BNOT14] studied linearly constrained fairness problems with strongly concave utilities by applying an accelerated method to the dual problem and recovering a primal solution. Smoothness and Lipschitz constants of the former objectives do not scale poly-logarithmically with the width and thus, direct application of classical first-order methods lead to non-polynomial algorithms.

There is a more general fairness objective called $\alpha$-fair packing, for which packing proportional fairness corresponds to $\alpha=1$, packing linear programming results when $\alpha=0$ and the min-max fair allocation arises when $\alpha\to\infty$. [MSZ16] and [DFO20] studied this general setting and obtained algorithms with polylog dependence on the width (so the algorithms are polynomial). [MSZ16] focused on a stateless algorithm and [DFO20] obtained better convergence rates by forgoing the stateless property. For packing proportional fairness, the latter work obtained rates of $\widetilde{O}(n^2/\varepsilon^2)$. In contrast, our solution does not depend on the width and we obtain rates of $\widetilde{O}(n/\varepsilon)$ with an algorithm that is not stateless either. All these algorithms can work under a distributed model of computation that is natural in some applications [AK08]. In this model, there are $n$ agents and agent $j\in[n]$ has access to the $j$-th column of $A$ and to the slack $(Ax)_i -1$ of the constraints $i$ in which $j$ participates. The total work of an iteration is the number of non-zero entries of $A$, and is distributed across agents.

Our solution for the dual problem does not depend on the width either and it converges with rates $\widetilde{O}(n^2/\varepsilon)$. We interpret the dual objective as the log-volume of a simplex covering the feasible region, and we use this interpretation to present an application to the approximation of the simplex $\Delta^{(k+1)}$ of minimum volume that covers a polytope $\mathcal{P}$, where $\Delta^{(k+1)}$ is given by a previous bounding simplex $\Delta^{(k)}$ containing $\mathcal{P}$, and where exactly one facet is allowed to move. This results in some improvements to the old method of simplices algorithm by [YL82] for linear programming.

We designed a distributed accelerated algorithm for $1$-fair packing by using an accelerated technique that uses truncated gradients of a regularized objective, similarly to [AO19] for packing LP. However, in contrast, our algorithm and its guarantees are deterministic. Also, our algorithm makes use of a different regularization and an analysis that yields accelerated additive error guarantees as opposed to multiplicative ones. The regularization already appeared in [DFO20] with a different algorithm and analysis.

We reparametrize our objective function $f$, with optimum $f^\ast$ in Problem (primal), so that it becomes linear at the expense of making the constraints more complex. The optimization problem becomes

\[\max_{x\in \mathbb{R}^{n}}\left\{\hat{f} = \langle \mathbb{1}_{n}, x\rangle: A\exp(x) \leq \mathbb{1}_{m}\right\}.\]Then, we regularize the negative of the reparametrized objective by adding a fast-growing barrier, that we minimize in a box $\mathcal{B} = [-\omega, 0]^n$, for some value of $\omega$ chosen so that the optimizer must lie in the box. This redundant constraint is introduced to later guarantee a bound on the regret of the mirror descent method that runs within the algorithm. The final problem is:

\[\min_{x\in\mathcal{B}}\{f_r(x)= -\langle \mathbb{1}_{n}, x \rangle + \frac{\beta}{1+\beta}\sum_{i=1}^{m} (A\exp(x))_i^{\frac{1+\beta}{\beta}} \},\]for a parameter $\beta$ that is roughly $\varepsilon/(n\log(n/\varepsilon))$. This choice of $\beta$ makes the regularizer add a penalty of roughly $\varepsilon$ if $(A\exp(x))_i > 1+\varepsilon/n$, for some $i\in[n]$ and at the same time points satisfying the constraints and not too close to the border will have negligible penalty. This fact allows to show that it is enough to minimize $f_r$ defined as above in order to solve the original problem. The regularized function also satisfies that its gradient $\nabla f_r(x) \in [-1, \infty]^{n}$, so whenever a gradient coordinate is large, it is positive and this allows for taking a gradient step that decreases the function significantly. In particular, for a point $y^{(k)}$ obtained by a gradient step from $x^{(k)}$ with the right learning rate we can show

\[f_r(x^{(k)}) -f_r(y^{(k)}) \geq \frac{1}{2}\langle \nabla f_r(x^{(k)}), x^{(k)}-y^{(k)}\rangle \geq 0.\]This is a smoothness-like property that we exploit in combination with the mirror descent that runs with truncated losses, in order to use a linear coupling [AO17] argument to obtain an accelerated deterministic algorithm. In short, the local smoothness of the function $f_r$ is large, so instead of feeding the gradient to mirror descent and obtain a regret of the order of $|\nabla f(x)|^2$ , we run a mirror descent algorithm with losses $\ell_k$ equal to $\nabla f(x^{(k)})$ but clipped so each coordinate is in $[-1, 1]$. Then, we couple this mirror descent with the gradient descent above and show the progress of the latter compensates for the regret of the mirror descent step and for the part of the regret we ignored when truncating the losses, i.e. for $\langle \nabla f_r(x^{(k)})-\ell_k, z^{(k)}-x_r^\ast \rangle$ where $z^{(k)}$ is the mirror point and $x_r^\ast$ is the minimizer of $f_r$

After a careful choice of learning rates $\eta_k$, coupling parameter $\tau$, box width $\omega$, parameter $L$ and number of iterations $T$ (that are computed given the known quantities $\varepsilon$ and $A \in \mathbb{R}^{m\times n}_{\geq 0}$), the final algorithm has a simple form as a linear coupling algorithm that runs in $T = \widetilde{O}(n/\varepsilon)$ iterations.

**Accelerated descent method for 1-fair packing**

**Input:** Normalized matrix $A \in \mathbb{R}^{m\times n}_{\geq 0}$ and accuracy $\varepsilon$.

- $x^{(0)} \gets y^{(0)} \gets z^{(0)} \gets -\omega \textbf{1}_n$
**for**$k = 1$**to**$T$- $x^{(k)} \gets \tau z^{(k-1)} + (1-\tau) y^{(k-1)}$
- $z^{(k)} \gets \operatorname{argmin}_{z\in \mathcal{B}}\left( \frac{1}{2\omega}\mid\mid z-z^{(k-1)}\mid\mid_2^2 + \langle \eta_k\ell_k(x), z\rangle \right)$ (
**Mirror descent step**) - $y^{(k)} \gets x^{(k)} + \frac{1}{\eta_k L}(z^{(k)}-z^{(k-1)})$ (
**Gradient descent step**)

**end for****return**$\widehat{x} \stackrel{\mathrm{\scriptscriptstyle def}}{=} \exp(y^{(T)})/(1+\varepsilon/n)$

Now let us look at the Lagrangian dual of (primal):

\[\tag{dual} \begin{align} \label{eq:dual_problem} \max &\ \ g(\lambda)=-\sum_{i\in [n]} \log (A^T\lambda)_i \\ \nonumber s.t. &\ \ \lambda\in \Delta^m \end{align}\]Here $\Delta^m$ is the standard probability simplex, that is, $\sum_{i\in [m]} \lambda_i=1$, $\lambda\geq \textbf{0}$. Recall that the feasible region of (primal) was the positive polyhedron $\mathcal{P}$. In the dual, we study the dual feasible region \(\mathcal{D}^+ = \{A^T \lambda + \mu : \lambda \in \Delta^m, \mu\in \mathbb{R}_{\geq 0}^n \}\). It turns out that $\mathcal{D}^+$ is exactly the set of vectors $h \in \mathbb{R}_{\geq 0}$ such that $\langle h, x\rangle \leq 1$ for all $x\in \mathcal{P}$.

In other words, $\mathcal{D}^+$ is the set of positive constraints covering $\mathcal{P}$, if we represent the halfspace \(\{x\in\mathbb{R}_{\geq 0} : \langle h,x \rangle \leq 1 \}\) by the vector $h$. $\mathcal{D}^+$ contains also the related polytope \(\mathcal{D}=\{ A^T \lambda : \lambda \in \Delta^m \}\). Problem (dual) actually optimizes over $\mathcal{D}$ but it can be shown that by expanding the feasible region to $\mathcal{D}^+$ the optimum does not change. A crucial observation for later is that if $\lambda^{opt}$ is the optimum of (dual), then $A^T \lambda^{opt}$ is the half space covering $\mathcal{P}$ minimizing the volume of the simplex it encloses with the first orthant.

Now, consider the following map from $\mathcal{D}^+ \rightarrow \mathbb{R}_{\geq 0}$:

\[c(h) = \left( \frac{1}{nh_1}, \dots, \frac{1}{nh_n}\right).\]We call this map the *centroid map*, as it maps the hyperplane $H={x\in\mathbb{R}_{\geq 0} : \langle h, x \rangle = 1 }$ to the centroid (barycenter) of the simplex formed by its intersection with the positive orthant.

The primal and dual problems are related by this centroid map: if $x^{opt}$ is the optimum of (primal) and $\lambda^{opt}$ is the optimum of (dual) (both problems have an unique solution because of strong convexity), then $x^{opt} = c(A^T \lambda^{opt})$.

**Figure 1.** The centroid map.

This means the primal optimum is the unique point in the intersection $\mathcal{P} \cap \mathcal{D}^+$. $c(\mathcal{D}^+)$ is convex, so this is a linear feasibility problem over a convex set. We use Plotkin-Smoys-Tardos (PST) for this problem, in a version inspired by [AHK12] which is better suited for this purpose.

The PST algorithm requires an *oracle* which for a given “query” halfspace $h$ returns a point $x\in c(\mathcal{D}^+)$ such that $\langle h, x\rangle \leq 1$. The closer the oracle returns points to the optimum $x^{opt}$, the faster our algorithm will run. The oracle we suggest depends on a feasible solution to (dual), and its performance improves as the solution is closer to optimum in (dual).

In particular, our oracle depends on a solution $s$ and the points it returns are in a region we call the *lens* of $s$, $\mathcal{L}_{\delta}$:

**Figure 2.** The primal polytope $P$, and the lens of a feasible solution.

As the figure illustrates, the lens of $s$ becomes smaller as $s$ improves as a solution. For this reason, we use a restart scheme: First we compute some approximate solution, then we use that approximate solution as the input for the oracle in the next restart. With this approach, we attain the following result:

**Theorem 9.**
Let $\varepsilon \in (0,n(n-1)]$ be an accuracy parameter. There is an algorithm that finds a linear combination of the rows of $A$, $\lambda\in\Delta^m$ such that $g(\lambda)$ is an $(\varepsilon/n)$-approximate solution of (dual) after $\widetilde{O}( n^2/\varepsilon)$ iterations.

The Yamnitsky-Levin algorithm [YL82] is an algorithm for linear feasibility problem. It is very similar to the ellipsoid method:

**Input:** A matrix $A\in\mathbb{R}^{m\times n}_{\geq 0}$,and a vector $b\in\mathbb{R}$

**Output:** Either a point $x\in\mathcal{P}$ where $P={x\in \mathbb{R}_{\geq 0} : Ax\leq b }$ or the guarantee that $\mathcal{P}$ has volume $\leq \varepsilon$.

- start with a simplex $\Delta$ covering $\mathcal{P}$.
**while**the centroid of $\Delta$, $c$ is not in $\mathcal{P}$- find a hyperplane separating $c$ from $\mathcal{P}$ (a row of $Ax\leq b$).
- Combine the separating hyperplane with $\Delta$ to find a new simplex $\Delta$ with smaller volume.

**return**c

The interested reader can see the details in [YL82]. Observe that if we choose a suitable change of basis, we can map any $(d-1)$ facets of $\Delta$ to the first orthant. The Yamnitsky-Levin algorithm tries to find some hyperplane for the last facet minimizing the simplex volume while covering $\mathcal{P}$.

Recall that this is exactly what (dual) is, except we are only considering the positive constraints. In a way, (dual) is solving the Yamnitsky-Levin problem but with two changes: it considers more than one constraint at the same time, but it can only change one facet of the simplex at a time.

It is possible to replace the Yamnitsky-Levin simplex pivoting step with the algorithm in Theorem 9. However, it is not clear yet how this affects its performance.

*We would like to thank Prof. Elias Koutsoupias for starting our motivation in this problem.*

[AK08] Baruch Awerbuch and Rohit Khandekar. Stateless distributed gradient descent for positive linear programs. Proceedings of the fourtieth annual ACM symposium on Theory of computing - STOC 08, page 691, 2008

[AO17] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. In Christos H. Papadimitriou, editor, 8th Innovations in Theoretical Computer Science Conference, ITCS 2017, January 9-11, 2017, Berkeley, CA, USA, volume 67 of LIPIcs, pages 3:1–3:22. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.

[AO19] Zeyuan Allen-Zhu and Lorenzo Orecchia. Nearly linear-time packing and covering LP solvers achieving width-independence and 1/epsilon-convergence. Math. Program., 175(1-2):307–353, 2019. doi: 10.1007/s10107-018-1244-x. URL https://doi.org/10.1007/s10107-018-1244-x.

[AHK12] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory Comput., 8(1):121–164, 2012. doi: 10.4086/toc.2012. v008a006. URL https:// doi.org/ 10.4086/ toc.2012.v008a006.

[BF11] Dimitris Bertsimas, Vivek F. Farias, and Nikolaos Trichakis. The price of fairness. Oper. Res., 59 (1):17–31, 2011. doi: 10.1287/opre.1100.0865. URL https://doi.org/10.1287/opre.1100.0865.

[LKCS10] Tian Lan, David T. H. Kao, Mung Chiang, and Ashutosh Sabharwal. An axiomatic theory of
fairness in network resource allocation. In *INFOCOM 2010. 29th IEEE International Conference
on Computer Communications, Joint Conference of the IEEE Computer and Communications
Societies, 15-19 March 2010, San Diego, CA, USA, pages 1343–1351*. IEEE, 2010. doi: 10.1109/
INFCOM.2010.5461911. URL https://doi.org/10.1109/INFCOM.2010.5461911.

[BNOT14] Amir Beck, Angelia Nedic, Asuman E. Ozdaglar, and Marc Teboulle. An $O(1/k)$ gradient method for network resource allocation problems. *IEEE Trans. Control. Netw. Syst., 1(1):64–73, 2014*. doi: 10.1109/TCNS.2014.2309751. URL https://doi.org/10.1109/TCNS.2014.2309751.

[MSZ16] Jelena Marašević, Clifford Stein, and Gil Zussman. A fast distributed stateless algorithm for alpha-fair packing problems. In *Ioannis Chatzigiannakis, Michael Mitzenmacher, Yuval Rabani, and Davide Sangiorgi, editors, 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, volume 55 of LIPIcs, pages 54:1–54:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016*. doi: 10.4230/LIPIcs.ICALP.2016.54. URL https://doi.org/10.4230/LIPIcs.ICALP.2016.54.

[DFO20] Jelena Diakonikolas, Maryam Fazel, and Lorenzo Orecchia. Fair packing and covering on a relative scale. *SIAM J. Optim., 30(4):3284–3314, 2020*. doi: 10.1137/19M1288516. URL https://doi.org/10.1137/19M1288516.

[N50] John F. Nash. The bargaining problem. *Econometrica, 18(2):155–162, 1950. ISSN 00129682, 14680262*. URL http://www.jstor.org/stable/1907266.

[YL82] Boris Yamnitsky and Leonid A. Levin. An old linear programming algorithm runs in polynomial time. In 23rd Annual Symposium on Foundations of Computer Science, Chicago, Illinois, USA, 3-5 November 1982, pages 327–328. IEEE Computer Society, 1982.

]]>Written by Alejandro Carderera.

Consider a problem of the sort:
\(\tag{minProblem}
\begin{align}
\label{eq:minimizationProblem}
\min\limits_{x \in \mathcal{X}} f(x),
\end{align}\)
where $\mathcal{X}$ is a compact convex set and $f(x)$ is a *generalized self-concordant* (GSC) function. This class of functions, which can informally be defined as those whose third derivative is bounded by their second derivative, have played an important role in the development of polynomial time algorithms for optimization, and also happen to appear in many machine learning problems. For example, the objective function encountered in logistic regression, or in marginal inference with concave maximization [KLS], belong to this family of functions.

As in previous posts, our focus is on Frank-Wolfe or Conditional Gradient algorithms, and we assume that solving an LP over $\mathcal{X}$ is easy, but projecting onto $\mathcal{X}$ is hard, additionally, we assume access to first-order and zeroth-order information about the function. Existing algorithms for this class of functions require access to second-order information, or local smoothness estimates, to achieve a $\mathcal{O}\left( 1/t \right)$ rate of convergence in primal gap [DSSS]. With the *Monotonous Frank-Wolfe* (M-FW) algorithm we require neither of these, achieving a $\mathcal{O}\left( 1/t \right)$ rate both in primal gap *and* Frank-Wolfe gap with a simple six-line algorithm that only requires access to a domain oracle, which is simply an oracle that checks if $x \in \mathrm{dom} f$. This extra oracle has to be used if one is to avoid assuming access to second-order information, and is also implicitly used by the existing algorithms that compute local estimates of the smoothness. The proof of convergence for both primal gap and Frank-Wolfe gap are simple and easy to follow, which add to the appeal of the algorithm.

Additionally, we also show improved rates of convergence with a backtracking line search [PNAM] (that locally estimates the smoothness) when the optimum is contained in the interior of $\mathcal{X} \cap \mathrm{dom} f$, when $\mathcal{X}$ is uniformly convex or when $\mathcal{X}$ is a polytope. The contributions are summarized in the table below.

Many convergence proofs in optimization make use of the smoothness inequality to bound the progress that an algorithm makes per iteration when moving from $x_{t}$ to $x_{t} + \gamma_t (v_{t} - x_{t})$. For smooth functions this inequality holds globally for all $x_{t}$ and $x_{t} + \gamma_t (v_{t} - x_{t})$. For GSC functions we also have a *smoothness-like* inequality with which we can bound progress. The problem is that this inequality only holds locally around $x_t$, and if one wants to test if the *smoothness-like* inequality is valid between $x_t$ and $x_{t} + \gamma_t (v_{t} - x_{t})$ we need to have knowledge of $\nabla^2 f(x_t)$, and know several parameters of the function. Several of the algorithms presented in [DSSS] utilize this approach, in order to compute a step size $\gamma_t$ such that the *smoothness-like* inequality holds between $x_t$ and $x_{t} + \gamma_t (v_{t} - x_{t})$. Alternatively, one can use the backtracking line search of [PNAM] to find a $\gamma_t$ and a smoothness estimate such that a local smoothness inequality holds between $x_t$ and $x_{t} + \gamma_t (v_{t} - x_{t})$.

We take a different approach to prove a convergence bound, which we review after describing our algorithm. The Monotonous Frank-Wolfe (M-FW) algorithm below is a rather simple, but powerful modification of the standard Frank-Wolfe algorithm, with the only difference that before taking a step, we verify if $x_t +\gamma_t \left( v_t - x_t\right) \in \mathrm{dom} f$, and if so, we check whether moving to the next iterate provides primal progress. Note, that the open-loop step size rule $2/(2+t)$ does not guarantee monotonous primal progress for the vanilla Frank-Wolfe algorithm in general. If either of these two checks fails, we simply do not move: the algorithm sets $x_t = x_{t+1}$. Note that if this is the case we do not need to compute a gradient or an LP call at iteration $t+1$, as we can simply reuse $v_t$.

**Monotonous Frank-Wolfe (M-FW) algorithm**

*Input:* Initial point $x_0 \in \mathcal{X}$

*Output:* Point $x_{T+1} \in \mathcal{X}$

For $t = 1, \dots, T$ do:

\(\quad v_t \leftarrow \mathrm{argmin}_{v \in \mathcal{X}}\left\langle \nabla f(x_t),v \right\rangle\)

\(\quad \gamma_t \leftarrow 2/(2+t)\)

\(\quad x_{t+1} \leftarrow x_t +\gamma_t \left( v_t - x_t\right)\)

\(\quad \text{if } x_{t+1} \notin \mathrm{dom} f \text{ or } f(x_{t+1}) > f(x_t) \text{ then}\)

\(\quad \quad x_{t + 1} \leftarrow x_t\)

End For

The simple structure of the algorithm above allows us to prove a $\mathcal{O}(1/t)$ convergence bound in primal gap and Frank-Wolfe gap. To do this we use an inequality that holds if $d(x_t +\gamma_t \left( v_t - x_t\right), x_t) \leq 1/2$, where $d(x,y)$ is a distance function that depends on the structure of the GSC function. Namely the inequality that we use is: \(\begin{align} f(x_t +\gamma_t \left( v_t - x_t\right)) - f(x^*) \leq (f(x_t)-f(x^*))(1-\gamma_t) + \gamma_t L_{f,x_0}D^2 \omega(1/2), \end{align}\) where $D$ is the diameter of $\mathcal{X}$, and $L_{f,x_0}$ and $\omega(1/2)$ are constants that depend on the function, and the starting point (which we do not need to know). If we were to have knowledge of $\nabla^2 f(x_t)$ we could compute the value of $\gamma_t$ that allows us to ensure that $d(x_t +\gamma_t \left( v_t - x_t\right), x_t) \leq 1/2$, however we purposefully do not want to use second order information!

We briefly (and informally) describe how we prove this convergence bound for the primal gap: As the iterates make monotonous progress, and the step size $\gamma_t = 2/(2+t)$ in our scheme decreases continously, there is an iteration $T$, which depends on the function, after which the *smoothness-like* inequality holds for all $t \geq T$ between $x_t$ and $x_t +\gamma_t \left( v_t - x_t\right)\in \mathrm{dom} f$, i.e. we guarantee that $d(x_t +\gamma_t \left( v_t - x_t\right), x_t) \leq 1/2$ for $t \geq T$ (without the need to know any parameters). However, note that in order to take a non-zero step size we also need to ensure that $f(x_{t+1}) < f(x_t)$. We complete the convergence proof using induction, that is, the assumption that $f(x_t) - f(x^\esx) \leq C(T+1)/(t+1)$ where $C$ is a constant, and the following subtlety – the *smoothness-like* inequality will only guarantee progress (i.e. $f(x_{t+1}) < f(x_t)$) at iteration $t$ if $\gamma_t$ is smaller than the primal gap at iteration $t$ multiplied by a factor. We can see this by going to the inequality above, and seeing that we will be able to guarantee that $f(x_{t+1}) < f(x_t)$ if:
\(\begin{align}
\gamma_t(f(x_t) - f(x^*)) - \gamma_t^2L_{f,x_0}D^2 \omega(1/2) <0,
\end{align}\)
If this is true, we can guarantee that we set $x_{t+1} = x_t +\gamma_t \left( v_t - x_t\right)$ and we can bound the progress using the *smoothness-like* inequality. Using the aforementioned fact and our induction hypothesis that \(f(x_t) - f(x^\esx) \leq C(T+1)/(t+1)\), we prove that claim $f(x_{t+1}) - f(x^\esx) \leq C(T+1)/(t+2)$. Assume however that this is not the case and that the following inequality is true:
\(\begin{align}
\gamma_t(f(x_t) - f(x^*)) - \gamma_t^2L_{f,x_0}D^2 \omega(1/2) \geq 0.
\end{align}\)
Reordering the previous expression, we have that $f(x_t) - f(x^*) \leq \gamma_t L_{f,x_0}D^2 \omega(1/2)$, with $\gamma_t = 2/(2+t)$. It turns out that $\gamma_t L_{f,x_0}D^2 \omega(1/2) \leq C(T+1)/(t+2)$, so there is nothing left to prove, and we do not even need the induction hypothesis, as for this case the claim is automatically true for $t+1$. The proof of convergence in Frank-Wolfe gap proceeds similarly. See Theorem 2.5 and Theorem A.2 in the paper for the full details.

As each iteration of the algorithm computes at most one first-order oracle call, one zeroth-order oracle call, one LP call, and one domain oracle call, we can bound the number of oracle calls needed to achieve an $\epsilon$ tolerance in primal gap (or Frank-Wolfe gap) directly using the iteration-complexity bound of $\mathcal{O}(1/\epsilon)$.

Note also that we can also implement the M-FW algorithm using a halving strategy for the step size, instead of the $2/(2+t)$ step size. This strategy helps deal with the case in which a large number of consecutive step sizes $\gamma_t$ are rejected either because $x_t + \gamma_t(v_t - x_t) \notin \mathrm{dom} f$ or $f(x_t) < f(x_t + \gamma_t(v_t - x_t))$. The strategy consists of halving the step size if we encounter any of these two cases. This results in a step size that is at most a factor of 2 smaller than the one that would have been accepted with the original strategy. However, the number of zeroth-order or domain oracles that would be needed to find this step size that satisfies the desired properties is logarithmic when compared to the number needed for the $2/(2+t)$ variant. The convergence properties established throughout the paper for the M-FW also hold for the variant with the halving strategy; with the only difference being that we lose a small constant factor in the convergence rate.

We compare the performance of the M-FW algorithm with that of other projection free algorithms which apply to the GSC setting. That is, we compare to the B-FW and the GSC-FW algorithms of [DSSS], the non-monotonous standard FW algorithm, for which there are no formal convergence guarantees for this class of problems, and the B-AFW algorithm. Note that the B-AFW is simply the AFW algorithm with the backtracking strategy of [PNAM], for which we also provide convergence guarantees in some special cases in the paper for GSC functions.

**Figure 1.** Portfolio Optimization.

**Figure 2.** Signal recovery with KL divergence.

**Figure 3.** Logistic regression over $\ell_1$ unit ball.

**Figure 4.** Logistic regression over the Birkhoff polytope.

[KLS] Krishnan, R. G., Lacoste-Julien, S., and Sontag, D. Barrier Frank-Wolfe for Marginal Inference. In *Proceedings of the 28th Conference in Neural Information Processing Systems*. PMLR, 2015. pdf

[DSSS] Dvurechensky, P., Safin, K., Shtern, S., and Staudigl, M. Generalized self-concordant analysis of
Frank-Wolfe algorithms. *arXiv preprint arXiv:2010.01009*, 2020b. pdf

[PNAM] Pedregosa, F., & Negiar, G. & Askari, A. & Jaggi, M. (2020). Linearly Convergent Frank-Wolfe with Backtracking Line-Search. In *Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics*. pdf

So you are tired of classifying cats and dogs? You are done with Kaggle competitions? How about trying something new? This year at NeurIPS there will be a new competition „Machine Learning for Discrete Optimization (ML4CO)“ which is about improving Integer Programming Solvers by means of Machine Learning.

In contrast to many other learning tasks the associated learning problems have a couple of characteristics that make them especially hard:

- Sampling and data acquisition is usually quite expensive and noisy
- There are strong interactions between decisions
- There are lots of long-range dependencies

So if you want to try something different, this might be a good chance. The (semi-)official announcement roughly reads as follows:

The Machine Learning for Combinatorial Optimization (ML4CO) NeurIPS 2021 competition aims at improving state-of-the-art combinatorial optimization solvers by replacing / integrating key heuristic components with machine learning models. The competition’s main scientific question is the following: is machine learning a viable option for improving traditional combinatorial optimization solvers on specific problem distributions, when historical data is available?

The webpage of the competition with all necessary information is

https://www.ecole.ai/2021/ml4co-competition

and the preregistration form is

https://forms.gle/pv6aaXxZ9iGYVCtj9

The ML4CO organizers

There will be three tasks that one can compete in this year:

The first one is really about *primal solutions*: producing new feasible solutions with good objective function values fast in order to minimize the so-called *primal integral*.

The second task is about closing the *dual gap*: learning to select branching variables. This can have many positive effects in the solution process and the aggregate measure that is considered here is the *dual integral*.

Finally, the third task is more of a traditional configuration learning task. Integer Programming solvers’ performance heavily depends on the chosen parameters. The task here is to learn a good set of parameters for a given problem instance and the considered metric is the *primal-dual integral*.

This competition is gonna be lit 🔥 and the whole team is super excited to see what y’all come up with!

]]>