Jekyll2020-06-21T09:52:50+02:00http://www.pokutta.com/blog/feed.xmlOne trivial observation at a timeEverything Mathematics, Optimization, Machine Learning, and Artificial IntelligenceSecond-order Conditional Gradient Sliding2020-06-20T07:00:00+02:002020-06-20T07:00:00+02:00http://www.pokutta.com/blog/research/2020/06/20/socgs<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2002.08907">Second-order Conditional Gradient Sliding</a> by <a href="https://alejandro-carderera.github.io/">Alejandro Carderera</a> and <a href="http://www.pokutta.com/">Sebastian Pokutta</a>, where we present a second-order analog of the Conditional Gradient Sliding algorithm [LZ] for smooth and strongly-convex minimization problems over polytopes. The algorithm combines Inexact Projected Variable-Metric (PVM) steps with independent Away-step Conditional Gradient (ACG) steps to achieve global linear convergence and local quadratic convergence in primal gap. The resulting algorithm outperforms other projection-free algorithms in applications where first-order information is costly to compute.</em>
<!--more--></p>
<p><em>Written by Alejandro Carderera.</em></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>Consider a problem of the sort:
\(\tag{minProblem}
\begin{align}
\label{eq:minimizationProblem}
\min\limits_{x \in \mathcal{X}} f(x),
\end{align}\)
where $\mathcal{X}$ is a polytope and $f(x)$ is a twice differentiable function that is strongly convex and smooth. We assume that solving an LP over $\mathcal{X}$ is easy, but projecting using the Euclidean norm (or any other norm) onto $\mathcal{X}$ is expensive. Moreover, we also assume that evaluating $f(x)$ is expensive, and so is computing the gradient and the Hessian of $f(x)$. An example of such an objective function can be found when solving an MLE problem to estimate the parameters of a Gaussian distribution modeled as a sparse undirected graph [BEA] (also known as the Graphical Lasso problem). Another example is the objective function used in logistic regression problems when the number of samples is high.</p>
<h2 id="projected-variable-metric-algorithms">Projected Variable-Metric algorithms</h2>
<p>Working with such unwieldy functions is often too expensive, and so a popular approach to tackling (minProblem) is to construct an approximation to the original function whose gradients are easier to compute. A linear approximation of $f(x)$ at $x_k$ using only first-order information will not contain any curvature information, giving us little to work with. Consider, on the other hand a quadratic approximation of $f(x)$ at $x_k$, denoted by $\hat{f_k}(x)$, that is:</p>
\[\tag{quadApprox}
\begin{align}
\label{eq:quadApprox}
\hat{f_k}(x) = f(x_k) + \left\langle \nabla f(x_k), x - x_k \right\rangle + \frac{1}{2} \norm{x - x_k}_{H_k}^2,
\end{align}\]
<p>where $H_k$ is a positive definite matrix that approximates the Hessian $\nabla^2 f(x_k)$. Algorithms that minimize the quadratic approximation $\hat{f}_k(x)$ over $\mathcal{X}$ at each iteration and set</p>
\[x_{k+1} = x_k + \gamma_k (\operatorname{argmin}_{x\in \mathcal{X}} \hat{f_k}(x) - x_k)\]
<p>for some \(\gamma_k \in [0,1]\) are dubbed <em>Projected Variable-Metric</em> (PVM) algorithms. These algorithms are useful when the progress per unit time obtained by moving towards the minimizer of $\hat{f}_k(x)$ over $\mathcal{X}$ at each time step is greater than the progress per unit time obtained by taking a step of any other first-order algorithm that makes use of the original function (whose gradients are very expensive to compute). We define the scaled projection of $x$ onto $\mathcal{X}$ when we measure the distance in the $H$-norm as \(\Pi_{\mathcal{X}}^{H} (y) \stackrel{\mathrm{\scriptscriptstyle def}}{=} \text{argmin}_{x\in\mathcal{X}} \norm{x - y}_{H}\). This allows us to interpret the steps taken by PVM algorithms as:</p>
\[\tag{stepPVM}
\begin{align}
\label{eq:stepPVM}
\operatorname{argmin}_{x\in \mathcal{X}} \hat{f_k}(x) = \Pi_{\mathcal{X}}^{H_k} \left( x_k - H_k^{-1} \nabla f(x_k) \right).
\end{align}\]
<p>These algorithms owe their name to this interpretation: at each iteration, as $H_k$ varies, we change the metric (the norm) with which we perform the scaled-projections, and we deform the negative of the gradient using this metric. The next image gives a schematic overview of a step of the PVM algorithm. The polytope $\mathcal{X}$ is depicted with solid black lines, the contour lines of the original objective function $f(x)$ are depicted with solid blue lines, and the contour lines of the quadratic approximation \(\hat{f}_k(x)\) are depicted with dashed red lines. Note that \(x_k - H_k^{-1}\nabla f(x_k)\) is the unconstrained minimizer of the quadratic approximation \(\hat{f}_k(x)\). The iterate used in the PVM algorithm to define the directions along which we move, i.e., \(\text{argmin}_{x\in \mathcal{X}} \hat{f}_k(x)\), is simply the scaled projection of that unconstrained minimizer onto \(\mathcal{X}\) using the norm \(\norm{\cdot}_{H_k}\) defined by $H_k$.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/SchematicAlgorithm.png" alt="Minimization of $\hat{f_k}(x)$ over $\mathcal{X}$." style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Minimization of $\hat{f_k}(x)$ over $\mathcal{X}$.</p>
<p>Note that if we set $H_k = \nabla^2 f(x_k)$ the PVM algorithm is equivalent to the <em>Projected Newton</em> algorithm, and if we set $H_k = I^n$, where $I^n$ is the identity matrix, the algorithm is equal to the <em>Projected Gradient Descent</em> algorithm. Intuitively, when $H_k$ is a good approximation to the Hessian $\nabla^2 f(x_k)$ we can expect to make good progress when moving along these directions. In terms of convergence, the PVM algorithm has a <em>global</em> linear convergence rate in primal gap when using an exact line search [KSJ], although with a dependence on the condition number that is worse than that of Projected Gradient Descent or the <em>Away-step Conditional Gradient</em> (ACG) algorithms. Moreover, the algorithm has a <em>local</em> quadratic convergence rate with a unit step size when close to the optimum $x^\esx$ if the matrix $H_k$ becomes a better and better approximation to $\nabla^2 f(x_k)$ as we approach $x^\esx$ (which we also assume in our theoretical results).</p>
<h2 id="second-order-conditional-gradient-sliding-algorithm">Second-order Conditional Gradient Sliding algorithm</h2>
<p>Two questions arise:</p>
<ol>
<li>Can we achieve a global linear convergence rate on par with that of the Away-step Conditional Gradient algorithm?</li>
<li>Solving the problem shown in (stepPVM) to optimality is often too expensive. Can we solve the problem to some $\varepsilon_k$-optimality and keep the local quadratic convergence?</li>
</ol>
<p>The <em>Second-order Conditional Gradient Sliding</em> (SOCGS) algorithm is designed with these considerations in mind, providing global linear convergence in primal gap and local quadratic convergence in primal gap and distance to $x^\esx$. The algorithm couples an independent ACG step with line search with an Inexact PVM step with a unit step size. At the end of each iteration, we choose the step that provides the greatest primal progress. The independent ACG steps will ensure global linear convergence in primal gap, and the Inexact PVM steps will provide quadratic convergence. Moreover, the line search in the ACG step can be substituted with a step size strategy that requires knowledge of the $L$-smoothness parameter of $f(x)$ [PNAM].</p>
<p>We compute the PVM step inexactly using the (same) ACG algorithm with an exact line search, thereby making the SOCGS algorithm <em>projection-free</em>. As the function being minimized in the Inexact PVM steps is quadratic there is a closed-form expression for the optimal step size. The scaled projection problem is solved to $\varepsilon_k$-optimality, using the Frank-Wolfe gap as a stopping criterion, as in the Conditional Gradient Sliding (CGS) algorithm [LZ]. The CGS algorithm uses the vanilla Conditional Gradient algorithm to find an approximate solution to the Euclidean projection problems that arise in <em>Nesterov’s Accelerated Gradient Descent</em> steps. In the SOCGS algorithm, we use the ACG algorithm to find an approximate solution to the scaled-projection problems that arise in PVM steps.</p>
<h3 id="accuracy-parameter-varepsilon_k">Accuracy Parameter $\varepsilon_k$.</h3>
<p>The accuracy parameter $\varepsilon_k$ in the SOCGS algorithm depends on a lower bound on the primal gap of (minProblem) which we denote by $lb\left( x_k \right)$ that satisfies $lb\left( x_k \right) \leq f\left(x_k \right) - f\left(x^\esx \right)$.</p>
<p>In several machine learning applications, the value of $f(x^\esx)$ is known a priori, such is the case of the approximate Carathéodory problem (see post <a href="/blog/research/2019/11/30/approxCara-abstract.html">Approximate Carathéodory via Frank-Wolfe</a> where $f(x^\esx) = 0$). In other applications, estimating $f(x^\esx)$ is easier than estimating the strong convexity parameter (see [BTA] for an in-depth discussion). This allows for tight lower bounds on the primal gap in these cases.</p>
<p>If there is no easy way to estimate the value of $f(x^\esx)$, we can compute a lower bound on the primal gap at $x_k$ (bounded away from zero) using any CG variant that monotonically decreases the primal gap. It suffices to run an arbitrary number of steps $n \geq 1$ of the aforementioned variant to minimize $f(x)$ starting from $x_k$, resulting in $x_k^n$. Simply noting that $f(x_k^n) \geq f(x^\esx)$ allows us to conclude that $f(x_k) - f(x^\esx) \geq f(x_k) - f(x_k^n)$, and therefore a valid lower bound is $lb\left( x_k \right) = f(x_k) - f(x^n_k)$. The higher the number of CG steps performed from $x_k$, the tighter the resulting lower bound will be.</p>
<h3 id="complexity-analysis">Complexity Analysis</h3>
<p>For the complexity analysis, we assume that we have at our disposal the tightest possible bound on the primal gap, which is $lb\left( x_k \right) = f(x_k) - f(x^\esx)$. A looser lower bound increases the number of linear minimization calls but does not increase the number of first-order, or approximate Hessian oracle calls. As in the classical analysis of Projected Newton algorithms, after a finite number of iterations independent of the target accuracy $\varepsilon$ (which in our case are linearly convergent in primal gap) the algorithm enters a regime of quadratic convergence in primal gap. Once in this phase the algorithm requires $\mathcal{O}\left( \log(1/\varepsilon) \log(\log 1/\varepsilon)\right)$ calls to a linear minimization oracle, and $\mathcal{O}\left( \log(\log 1/\varepsilon)\right)$ calls to a first-order and approximate Hessian oracle to reach an $\varepsilon$-optimal solution.</p>
<p>If we were to solve problem (minProblem) using the Away-step Conditional Gradient algorithm we would need $\mathcal{O}\left( \log(1/\varepsilon)\right)$ calls to a linear minimization and first-order oracle. Using the SOCGS algorithm makes sense if the linear minimization calls are not the computational bottleneck of the algorithm and the approximate Hessian oracle is about as expensive as the first-order oracle.</p>
<h3 id="computational-experiments">Computational Experiments</h3>
<p>We compare the performance of the SOCGS algorithm with that of other first-order projection free algorithms in settings where computing first-order information is expensive (and computing Hessian information is just as expensive). We also compare the performance of our algorithm with the recent <em>Newton Conditional Gradient</em> algorithm [LCT] which minimizes a self-concordant function over a convex set by performing Inexact Newton steps (thereby requiring an exact Hessian oracle) using a Conditional Gradient algorithm to compute the scaled projections. After a finite number of iterations (independent of the target accuracy $\varepsilon$), the convergence rate of the NCG algorithm is linear in primal gap. Once inside this phase an $\varepsilon$-optimal solution is reached after $\mathcal{O}\left(\log 1/\varepsilon\right)$ exact Hessian and first-order oracle calls and $\mathcal{O}( 1/\varepsilon^{\nu})$ linear minimization oracle calls, where $\nu$ is a constant greater than one.</p>
<p>In the first experiment the Hessian information will be inexact (but subject to an asymptotic accuracy assumption), and so we will only compare to other first-order projection-free algorithms. In the second and third experiments, the Hessian oracle will be exact. For reference, the algorithms in the legend correspond to the vanilla Conditional Gradient (CG), the Away-step Conditional Gradient (ACG) [GM], the Lazy Away-step Conditional Gradient (ACG (L)) [BPZ], the Pairwise-step Conditional Gradient (PCG) [LJ], the Conditional Gradient Sliding (CGS) [LZ], the Stochastic Variance-Reduced Conditional Gradient (SVRCG) [HL], the Decomposition Invariant Conditional Gradient (DICG) [GM2] and the Newton Conditional Gradient (NCG) [LCT] algorithm. We also present an LBFGS version of SOCGS (SOCGS LBFGS). However note that this algorithm, while performing well, does not formally satisfy our assumptions.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/Birkhoff_Experiments.png" alt="fig2" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> Sparse coding over the Birkhoff polytope.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/GLassoPSD.png" alt="fig3" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 3.</strong> Inverse covariance estimation over the spectrahedron.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/LogReg.png" alt="fig4" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 4.</strong> Structured logistic regression over $\ell_1$ unit ball.</p>
<h3 id="references">References</h3>
<p>[LZ] Lan, G., & Zhou, Y. (2016). Conditional gradient sliding for convex optimization. In <em>SIAM Journal on Optimization</em> 26(2) (pp. 1379–1409). SIAM. <a href="http://www.optimization-online.org/DB_FILE/2014/10/4605.pdf">pdf</a></p>
<p>[BEA] Banerjee, O., & El Ghaoui, L. & d’Aspremont, A. (2008). Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. In <em>Journal of Machine Learning Research</em> 9 (2008) (pp. 485–516). JMLR. <a href="http://www.jmlr.org/papers/volume9/banerjee08a/banerjee08a.pdf">pdf</a></p>
<p>[KSJ] Karimireddy, S.P., & Stich, S.U. & Jaggi, M. (2018). Global linear convergence of Newton’s method without strong-convexity or Lipschitz gradients. <em>arXiv preprint:1806.00413</em>. <a href="https://arxiv.org/pdf/1806.00413.pdf">pdf</a></p>
<p>[PNAM] Pedregosa, F., & Negiar, G. & Askari, A. & Jaggi, M. (2020). Linearly Convergent Frank-Wolfe with Backtracking Line-Search. In <em>Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics</em>. <a href="http://proceedings.mlr.press/v108/pedregosa20a/pedregosa20a-supp.pdf">pdf</a></p>
<p>[BTA] Barré, M., & Taylor, A. & d’Aspremont, A. (2020). Complexity Guarantees for Polyak Steps with Momentum. <em>arXiv preprint:2002.00915</em>. <a href="https://arxiv.org/pdf/2002.00915.pdf">pdf</a></p>
<p>[LCT] Liu, D., & Cevher, V. & Tran-Dinh, Q. (2020). A Newton Frank-Wolfe Method for Constrained Self-Concordant Minimization. <em>arXiv preprint:2002.07003</em>. <a href="https://arxiv.org/pdf/2002.07003.pdf">pdf</a></p>
<p>[GM] Guélat, J., & Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’. In <em>Mathematical Programming</em> 35(1) (pp. 110–119). Springer. <a href="http://www.iro.umontreal.ca/~marcotte/ARTIPS/1986_MP.pdf">pdf</a></p>
<p>[BPZ] Braun, G., & Pokutta, S. & Zink, D. (2017). Lazifying Conditional Gradient Algorithms. In <em>Proceedings of the 34th International Conference on Machine Learning</em>. <a href="http://proceedings.mlr.press/v70/braun17a/braun17a.pdf">pdf</a></p>
<p>[LJ] Lacoste-Julien, S. & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In <em>Advances in Neural Information Processing Systems</em> 2015 (pp. 496-504). <a href="https://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>
<p>[HL] Hazan, E. & Luo, H. (2016). Variance-reduced and projection-free stochastic optimization. In <em>Proceedings of the 33rd International Conference on Machine Learning</em>. <a href="https://arxiv.org/pdf/1602.02101.pdf">pdf</a></p>
<p>[GM2] Garber, D. & Meshi, O.(2016). Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes. In <em>Advances in Neural Information Processing Systems</em> 2016 (pp. 1001-1009). <a href="https://arxiv.org/pdf/1605.06492.pdf">pdf</a></p>Alejandro CardereraTL;DR: This is an informal summary of our recent paper Second-order Conditional Gradient Sliding by Alejandro Carderera and Sebastian Pokutta, where we present a second-order analog of the Conditional Gradient Sliding algorithm [LZ] for smooth and strongly-convex minimization problems over polytopes. The algorithm combines Inexact Projected Variable-Metric (PVM) steps with independent Away-step Conditional Gradient (ACG) steps to achieve global linear convergence and local quadratic convergence in primal gap. The resulting algorithm outperforms other projection-free algorithms in applications where first-order information is costly to compute.On the unreasonable effectiveness of the greedy algorithm2020-06-03T07:00:00+02:002020-06-03T07:00:00+02:00http://www.pokutta.com/blog/research/2020/06/03/unreasonable-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2002.04063">On the Unreasonable Effectiveness of the Greedy Algorithm: Greedy Adapts to Sharpness</a> with <a href="https://www2.isye.gatech.edu/~msingh94/">Mohit Singh</a>, and <a href="https://sites.google.com/view/atorrico">Alfredo Torrico</a>, where we adapt the sharpness concept from convex optimization to explain the effectiveness of the greedy algorithm for submodular function maximization.</em>
<!--more--></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>An important problem is the maximization of a non-negative monotone submodular set function $f: 2^V \rightarrow \RR_+$ subject to a cardinality constraint, i.e.,</p>
\[\tag{maxSub}
\max_{S \subseteq V, |S| \leq k} f(S).\]
<p>This problem naturally occurs in many contexts, such as, e.g., feature selection, sensor placement, and non-parametric learning. It is well known that in submodular function maximization with a single cardinality constraint we can compute a $(1-1/\mathrm{e})$-approximate solution by means of the greedy algorithm [NW], [NWF], while computing an exact solution is NP-hard. The greedy algorithm is extremely simple, selecting in each of its $k$ iterations the element with the largest <em>marginal gain</em> $\Delta_{S}(e) \doteq f(S \cup \setb{e}) - f(S)$:</p>
<p class="mathcol"><strong>Greedy Algorithm</strong> <br />
<em>Input:</em> Non-negative, monotone, submodular function $f$ and budget $k$ <br />
<em>Output:</em> Set $S_g \subseteq V$ <br />
$S_g \leftarrow \emptyset$ <br />
For $i = 1, \dots, k$ do: <br />
$\quad S_g \leftarrow S_g \cup \setb{\arg\max_{e \in V \setminus S_g} \Delta_{S_g}(e)}$<br /></p>
<p>Due to its simplicity and good real-world performance the greedy algorithm is often the method of choice in large-scale tasks where more involved methods, such as, e.g., integer programming are computationally prohibitive. As mentioned before, the returned solution $S_g \subseteq V$ of the greedy algorithm satisfies [NW]</p>
\[f(S_g) \geq (1-1/\mathrm{e}) \ f(S^\esx),\]
<p>where $S^\esx \subseteq V$ is the optimal solution to problem (maxSub).</p>
<p>In practice however, we often observe that the greedy algorithm performs much better than this conservative approximation guarantee suggests and several concepts such as, e.g., <em>curvature</em> [CC] or <em>stability</em> [CRV] have been proposed as a means to explain the excess performance of the greedy algorithm beyond this worst-case bound. The reason one might be interested in this, beyond understanding greedy’s performance as a function of additional properties of $f$ (which is interesting in its own right), is that the problem instance of interest might be amenable to pre-processing in order to improve conditioning with respect to these additional structural properties and hence performance.</p>
<h2 id="our-results">Our results</h2>
<p>We focus on giving an alternative explanation for those instances in which the optimal solution clearly stands out over the rest of feasible solutions. For this, we consider the concept of sharpness initially introduced in continuous optimization (see [BDL] and references contained therein) and we adapt it to submodular optimization. In convex optimization, roughly speaking, sharpness measures the behavior of the objective function around the set of optimal solutions and it translates to faster convergence rates. The way one should think about sharpness and similar parameters is <em>data-dependent</em> quantities that are usually either hard to compute or inaccessible. As these quantities are also (usually) unobservable and estimation is non-trivial yet impact the convergence rate, we would like our algorithms to be <em>adaptive</em> to these parameters without requiring them as <em>input</em>, i.e., the algorithm behaves automatically better when the data is better conditioned.</p>
<p>We show that the greedy algorithm for submodular maximization also provides better approximation guarantees as (our submodular analog to) sharpness of the objective function increases. While surprising at first, this is actually quite natural, once we understand the greedy algorithm as a discrete analog of ascent algorithms in continuous optimization, that is allowed to perform a fixed number of steps only: if the algorithm converges faster, than after a fixed number of steps ($k$ in the discrete case to be precise) its achieved approximation guarantee will be better. Then the key challenge is to identify a notion of sharpness that is meaningful in the context of submodular function maximization. We also show that the greedy algorithm automatically adapts to the submodular function’s sharpness.</p>
<p>The most basic notation of <em>sharpness for submodular functions</em> that we define is the following notion of <em>monontone sharpness</em>: There exists an optimal solution $S^\esx \subseteq V$ such that for all $S \subseteq V$ it holds:</p>
\[\tag{monSharp}
\sum_{e \in S^\esx \setminus S} \Delta_S(e) \geq \left( \frac{|S^\esx \setminus S|}{kc} \right)^{1/\theta} f(S^\esx),\]
<p>which then leads to a guarantee of the form:</p>
\[f(S_g) \geq \left(1- \left(1-\frac{\theta}{c}\right)^{1/c}\right) f(S^\esx),\]
<p>which interpolates between the worst-case approximation factor $(1-1/\mathrm{e})$ and the best-case approximation factor $1$.</p>
<p>We also define tighter notions of sharpness that explain more of greedy’s performance, however their definitions are slightly more involved and beyond the scope of this summary. In the following figure we depict the performance of the greedy algorithm on three different tasks as well as how much of its performance is explained by various data-dependent measures - it can be seen that our most advanced notion of sharpness, called <em>dynamic submodular sharpness</em>, explains a significant portion of greedy’s performance.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/sharpSubmodular/1_image_clustering.png" alt="img1" style="float:center; margin-right: 1%; width:27%" />
<img src="http://www.pokutta.com/blog/assets/sharpSubmodular/2_fac_loc.png" alt="img2" style="float:center; margin-right: 1%; width:27%" />
<img src="http://www.pokutta.com/blog/assets/sharpSubmodular/3_parkison_tele.png" alt="img3" style="float:center; margin-right: 1%; width:39%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Image Clustering (left), Facility Location (middle), Parkison Telemonitoring (right). For each example we computed both the sharpness parameters and optimal solutions to compare predicted vs actual performance. In all three examples sharpness explains a significant portion of the greedy algorithm’s excess performance.</p>
<p>One final but important question that one might ask is: how many functions actually do satisfy sharpness? In convex optimization, by the <em>Łojasiewicz Factorization Lemma</em> (see [BDL] and references contained therein), basically almost all functions exhibit non-trivial sharpness; the same is true for the submodular case here, albeit in somewhat weaker form.</p>
<h3 id="references">References</h3>
<p>[NWF] Nemhauser, G. L., Wolsey, L. A., & Fisher, M. L. (1978). An analysis of approximations for maximizing submodular set functions—I. Mathematical programming, 14(1), 265-294. <a href="https://link.springer.com/content/pdf/10.1007/BF01588971.pdf">pdf</a></p>
<p>[NW] Nemhauser, G. L., & Wolsey, L. A. (1978). Best algorithms for approximating the maximum of a submodular set function. Mathematics of operations research, 3(3), 177-188. <a href="https://www.jstor.org/stable/pdf/3689488.pdf">pdf</a></p>
<p>[CC] Conforti, M., & Cornuéjols, G. (1984). Submodular set functions, matroids and the greedy algorithm: tight worst-case bounds and some generalizations of the Rado-Edmonds theorem. Discrete applied mathematics, 7(3), 251-274. <a href="https://www.sciencedirect.com/science/article/pii/0166218X84900039">pdf</a></p>
<p>[CRV] Chatziafratis, V., Roughgarden, T., & Vondrák, J. (2017). Stability and recovery for independence systems. arXiv preprint arXiv:1705.00127. <a href="https://arxiv.org/abs/1705.00127">pdf</a></p>
<p>[BDL] Bolte, J., Daniilidis, A., & Lewis, A. (2007). The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4), 1205-1223. <a href="https://epubs.siam.org/doi/pdf/10.1137/050644641">pdf</a></p>Sebastian PokuttaTL;DR: This is an informal summary of our recent paper On the Unreasonable Effectiveness of the Greedy Algorithm: Greedy Adapts to Sharpness with Mohit Singh, and Alfredo Torrico, where we adapt the sharpness concept from convex optimization to explain the effectiveness of the greedy algorithm for submodular function maximization.An update on SCIP2020-05-15T07:00:00+02:002020-05-15T07:00:00+02:00http://www.pokutta.com/blog/news/2020/05/15/scip-update<p><em>TL;DR: A quick update on what is on the horizon for SCIP.</em>
<!--more--></p>
<p>SCIP has been a cornerstone of ZIB’s mathematical optimization department for many years. It is probably (one of) the fastest and most comprehensive academic solvers for MIPs and several related optimization paradigms. Certainly it is the fastest MIP and MINLP solver that is fully transparent and accessible in source code. This impressive effort is due to a great team of researchers and developers, both at ZIB and throughout the world, that has been pushing SCIP to the cutting-edge.</p>
<p>Over the last 5 years, two people have strongly shaped the progress of SCIP at ZIB: Thorsten Koch on the organizational side and Ambros Gleixner as head of technical research & development. In Fall of 2019 I moved to ZIB. With this move I also took the lead of the overall SCIP project among several other new responsibilities and I would like to take the opportunity to thank Thorsten Koch for his great leadership of the SCIP project over the last years. I am quite excited to have the opportunity to shape the future of SCIP together with the rest of the SCIP team and in view of this I would like to share some updates. In a nutshell these changes can be summarized as follows:</p>
<ol>
<li>Making SCIP more open</li>
<li>Making SCIP more accessible</li>
<li>Making SCIP more inclusive</li>
</ol>
<p>While not everything can be achieved in a one step, this overview might give you an idea of what is on the horizon.</p>
<p>We also have some very exciting new research directions and results, however I am going to talk about some of that work elsewhere in a more research-focused post.</p>
<h2 id="scip-7-release">SCIP 7 release</h2>
<p>Before I am going to talk about some upcoming things, I wanted to briefly mention the recent release of <a href="http://www.optimization-online.org/DB_HTML/2020/03/7705.html">SCIP 7</a>, with many new features. Just to name two, there is a new parallel preprocessing library <em>PaPILO</em> and we now have <a href="http://www.optimization-online.org/DB_HTML/2020/04/7722.html">tree-size prediction</a> built-in:</p>
<blockquote>
<p>On average, the best method estimates B&B tree sizes within a factor of 3 on the set of unseen test instances even during the early stage of the search, and improves in accuracy as the search progresses. It also achieves a factor 2 over the entire search on each out of six additional sets of homogeneous instances we have tested.</p>
</blockquote>
<p><em>#firstSeenInSCIP</em></p>
<p>Both for MIP and MINLP, SCIP 7 is on average 1.36x faster than SCIP 6 on hard instances, i.e., on instances that take at least 100 seconds to solve. You can check out the latest release on the <a href="http://scip.zib.de">SCIP homepage</a>.</p>
<h2 id="interfaces">Interfaces</h2>
<p>SCIP already supports a wide variety of interfaces. In the future we will further integrate SCIP with those interfaces and in particular we will improve integration with Python through <a href="https://github.com/SCIP-Interfaces/PySCIPOpt">PySCIPOpt</a> and Julia through <a href="https://github.com/SCIP-Interfaces/SCIP.jl">SCIP.jl</a>. These two will become true first-class interfaces. Moreover, we will maintain several <a href="https://github.com/SCIP-Interfaces">other interfaces</a> depending on demand etc.</p>
<h2 id="distribution">Distribution</h2>
<p>We intend to extend the distribution mechanisms for SCIP. One very high priority is distribution through the conda package manager, so that the SCIP optimization suite and PySCIPOpt can be basically installed with a simple <code class="highlighter-rouge">conda install pyscipopt</code>. We are also exploring to make SCIP available in <a href="https://colab.research.google.com/">Google Colab</a>; the conda integration might make this a trivial exercise.</p>
<h2 id="tutorials">Tutorials</h2>
<p>Many of you have experienced that SCIP is a very complex software and getting started can be a nontrivial endeavor, just because of its high flexibility as a framework, which is fully exposed through its API. At the same time, SCIP can be used out-of-the-box as a powerful black-box solver. However, many of you have suffered from the current lack of good entry level documentation. To alleviate this in the short term, we are in the process of writing a tutorial specifically targeting the “black box user + SCIP” via PySCIPOpt. In the mid term we will try to offer more resources to people that use SCIP mainly as a black box solver; see the Website section below.</p>
<h2 id="new-platforms">New platforms</h2>
<p>We intend to support several new platforms for the SCIP Optimization Suite. As you might have already seen from <a href="/blog/random/2019/09/29/scipberry.html">a post sometime back</a> one such platform is ARM. This includes the RaspberryPi but also many cell phone and mobile architectures that then can potentially run SCIP. Moreover, we also plan a dockerized version of SCIP for deployment in cloud computing environments. In fact if you want to give a preliminary build a spin: <code class="highlighter-rouge">docker pull scipoptsuite/scipoptsuite:7.0.0</code>; SCIP Optimization Suite 7.0.0 and PySCIPOpt 3.0.0 on slim buster with Python 3.7—feedback appreciated.</p>
<p>A little further down the road, we will be likely also supporting RISC-V once stable development systems are available and we are currently evaluating Microsoft’s <a href="https://docs.microsoft.com/en-us/windows/wsl/wsl2-install">WSL</a> in particular together with <a href="https://ubuntu.com/wsl">Ubuntu on WSL</a> as an alternative deployment mode for Windows.</p>
<h2 id="decentralized-development">Decentralized Development</h2>
<p>SCIP has had a strong decentralized development component and this trend is likely to increase further in the future with many more non-ZIB developers contributing to SCIP. In Germany alone, we have 4 development centers with FAU Erlangen-Nürnberg, TU Darmstadt, RWTH Aachen, and the Zuse Institute Berlin. On top of that we have a large number of international contributors.</p>
<p>This decentralized development setup with many stakeholders and core developers outside ZIB will be also more strongly reflected in SCIP’s governance; more on this soon.</p>
<h2 id="website">Website</h2>
<p>SCIP will move to <a href="http://www.scipopt.org">http://www.scipopt.org</a> as a new home and one-stop-shop; should be online in a few days. Moreover, we will also separate the web site into two parts in the next few months: one for SCIP users and one for SCIP developers.</p>
<h2 id="licensing">Licensing</h2>
<p>There are some license changes on the horizon as well. Short version is that we intend that SCIP will be free for non-commercial use in general and we are currently discussing how to deal with commercial use. One model might be to have a community edition under some permissible open source license and a professional edition for commercial use. Obviously this is quite a complicated matter and it will take some time to iron out all the details and settle on a final setup.</p>
<p>In the meantime, if you want to use SCIP, send an email to <a href="mailto:licenses@zib.de">licenses@zib.de</a> and we will work something out in the spirit of the above.</p>
<h2 id="hiring">Hiring</h2>
<p>We are looking to grow the SCIP developer team. If you want to contribute to the future development of SCIP and want to get involved please get in touch.</p>Sebastian PokuttaTL;DR: A quick update on what is on the horizon for SCIP.Psychedelic Style Transfer2020-04-09T01:00:00+02:002020-04-09T01:00:00+02:00http://www.pokutta.com/blog/research/2020/04/09/ai-art<p><em>TL;DR: We point out how to make psychedelic animations from discarded instabilities in neural style transfer. This post builds upon a remark we made in our recent paper <a href="https://arxiv.org/abs/2003.06659">Interactive Neural Style Transfer with Artists</a>. In this paper, we questioned several simple evaluation aspects of neural style transfer methods. Also, it is our second series of interactive painting experiments where style transfer outputs constantly influence a painter, see the other series <a href="https://arxiv.org/abs/1910.04386">here</a>. See also our medium <a href="https://medium.com/@human.aimachine.art/psychedelic-style-transfer-5744b700fc3e">post</a>.</em>
<!--more--></p>
<p><em>Written by Thomas Kerdreux and Louis Thiry.</em> <br /></p>
<div class="paddingContainer">
<div class="iframe-container center">
<iframe width="100%" height="100%" src="https://www.youtube-nocookie.com/embed/1jg6CqMEbcQ" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
</div>
<p>The first frame of the video, a watercolor from my grand-father, is progressively stirred into a plethora of curvy and colorful patches. It then metamorphoses into a purplish phantasmal coral reef that is itself slowly submerged by an angry puce ocean. The water then calms down as the coral reef disappears and ends up perfectly still. How is this psychedelic animation related to style transfer methods?</p>
<p>Neural style transfers are rendering techniques – for images mostly – that seek to stylize a content image with the style of another, see figure below. More precisely, the algorithms are designed to extract a style representation of an image and a representation of the semantic content of another and then cleverly construct a new picture from these.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Heard Island in Antarctica</th>
<th style="text-align: center">Maxime Maufra’s painting</th>
<th style="text-align: center">Style Transfer Output using STROTSS [KS]</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/blog/assets/style_transfer/example_ST_content.jpg" style="zoom:415%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/example_ST_style.jpg" style="zoom:415%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/example_ST_output.jpg" style="zoom:485%;" /></td>
</tr>
</tbody>
</table>
<p>While designing new evaluation techniques for style transfer methods in [KT], we made an uncomplicated but crucial observation. <strong>Style transfer applied to the same image as style and content should reasonably output the image itself</strong>. However, we observed that many style transfer algorithms do not satisfy this property. No one ever cared for hard-coding this fundamental property. Here, we show how, leveraging on that instability, we produce animations like the one above.</p>
<table>
<thead>
<tr>
<th style="text-align: center">MST first iteration</th>
<th style="text-align: center">MST second iteration</th>
<th style="text-align: center">MST third iteration</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/blog/assets/style_transfer/MST_0.jpg" style="zoom:400%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/MST_1.jpg" style="zoom:400%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/MST_2.jpg" style="zoom:400%;" /></td>
</tr>
</tbody>
</table>
<p>Formally a style transfer method is simply a function \(f\) that takes a style image \(s\) and a content image \(c\) and outputs a new image \(f(s,c)\). Our observation is that for some style transfer methods \(f\) and an initial images \(x_0\), the equality $f(x_0, x_0) = x_0$ is not satisfied. The output images are adding a slightly perceptible flicker, blur or blunder to the initial image \(x_0\). These instability patterns differ from one method to another but are experimentally the same when starting from different images \(x_0\).</p>
<p>Yet these effects are hardly perceptible. Hence to better understand the phenomenon, we need to amplify them. We simply repeat the process: start from an initial image \(x_0\) and reiterate the style transfer operation</p>
\[\begin{align*}
x_{t+1} = f(x_t, x_t)
\end{align*},\]
<p>In Figure above, after a few iterations, the effects become perceptible and particularly stylish. For instance, when taking the MST style transfer method [MST] (with this <a href="https://github.com/irasin/Pytorch_MST">code</a>) the iterates become tessellated versions of the initial image. The instabilities amplify all the lines of the pictures. On portraits, they reveal all wrinkles. When taking another algorithm like WCT [WCT] (with <a href="https://github.com/irasin/Pytorch_WCT">this code</a>), the effects are different. The goblin is slowly dematerialized by the devilish style transfer instabilities, see figure below.</p>
<table>
<thead>
<tr>
<th style="text-align: center">MST first iteration</th>
<th style="text-align: center">MST second iteration</th>
<th style="text-align: center">MST fourth iteration</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/blog/assets/style_transfer/WCT_diablotin_0.jpg" style="zoom:200%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/WCT_diablotin_1.jpg" style="zoom:200%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/WCT_diablotin_4.jpg" style="zoom:200%;" /></td>
</tr>
</tbody>
</table>
<p>So far, we simply showed the outputs of the first iterations of the repeat process. Actually, the animation above basically collects all the images of the sequence \((x_t)\). For many different pictures and methods, we observe this asymptotic type of divergence that we name <em>psychedelic regime</em>. Indeed, once the algorithm loses track of the initial image, it starts raving. And then feeds itself with its own slowly delusional outputs, without ever going back to our reality! The raving differs from one method to another but experimentally seems not to depend on the initial image.</p>
<p>This playfully shoots what a machine can do when forgetting about the human inputs or the non-numerical reality. In fact, metaphorically, this is also happening in many practical uses of algorithms. For instance, collaborative based filtering recommender systems use new data that come from humans interacting with the algorithm. We no longer assess the choices human would have done without ever been influenced by algorithms. We have lost this initial input!</p>
<p>[R] and [G] studied instabilities of style transfer method in the case of real-time style transfer for videos. The style transfer output may differ significantly from one frame to another while the initial consecutive frames are perceptibly the same. This results in unpleasing flickering effect in style transferred videos. Similarly to the adversarial examples literature, the main focus is to study the instabilities to detect, correct and remove them. Here we outlined instabilities stemming from another type of inconsistency and took advantage of them.</p>
<p>Also, note that MST and WCT are feed-forward approaches to style transfer, i.e. the function $f$ is a neural network [JA,GL,LW]. In reality, the first approach to neural style transfer was optimization-based [G]. In particular, when considering the same image as style and content, the image is the global optimum of the loss. Hence the method satisfies $f(x,x)=x$ if properly initialized. Actually, even when choosing a random initial image, we observed that the image iterate converges to the initial image, <em>i.e.</em> the global minimum of a non-convex loss. Note although that some optimization-based methods like STROTSS may still not satisfy this stability property because of some randomization and re-parametrization of the image with its Laplacian pyramid.</p>
<p>Finally, if you are interested in making your psychedelic videos, the take-home message is that most certainly any feed-forward neural style-transfer approach will give a different <em>psychedelic regime</em>. Below we show one using the WCT method and the first iteration when using STROTSS optimization-based style method (our <a href="https://github.com/human-aimachine-art/pytorch-STROTSS-improved">code</a>).</p>
<div class="paddingContainer">
<div class="iframe-container center">
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/lyyAFlmNjIg" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
</div>
<table>
<thead>
<tr>
<th style="text-align: center">STROTSS first iteration</th>
<th style="text-align: center">STROTSS several iteration later…</th>
<th style="text-align: center">STROTSS several iteration later…</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/blog/assets/style_transfer/STROTSS_0.jpg" style="zoom:300%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/STROTSS_5.jpg" style="zoom:300%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/STROTSS_15.jpg" style="zoom:300%;" /></td>
</tr>
</tbody>
</table>
<h3 id="references">References</h3>
<p>[CKT] Cabannes, V., Kerdreux, T., Thiry, L., Campana, T., & Ferrandes, C. (2019). Dialog on a Canvas with a Machine. Third Workshop of Creativity and Design at NeurIPS 2019. <a href="https://arxiv.org/abs/1910.04386">pdf</a></p>
<p>[JA] Johnson, J.; Alahi, A.; and Fei-Fei, L. 2016. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, 694–711. Springer. <a href="https://arxiv.org/abs/1603.08155">pdf</a></p>
<p>[G] Gatys, L. A., Ecker, Alexander S., et Bethge, M.. A neural algorithm of artistic style. (2015). <a href="https://arxiv.org/abs/1508.06576">pdf</a></p>
<p>[GL] Ghiasi, G.; Lee, H.; Kudlur, M.; Dumoulin, V.; and Shlens, J. (2017). Exploring the structure of a real-time, arbitrary neural artistic stylization network. <a href="https://arxiv.org/abs/1705.06830">pdf</a></p>
<p>[G] Gupta, A.; Johnson, J.; Alahi, A.; and Fei-Fei, L. 2017. Characterizing and improving stability in neural style transfer. In Proceedings of the IEEE International Conference on Computer Vision, 4067–4076. <a href="https://arxiv.org/abs/1705.02092">pdf</a></p>
<p>[KT] Kerdreux, T., Thiry, L., Kerdreux, E. (2020). Interactive Neural Style Transfer with Artists. <a href="https://arxiv.org/abs/2003.06659">pdf</a></p>
<p>[KS] Kolkin, N., Salavon, J., Shakhnarovich G. (2019). Style Transfer by Relaxed Optimal Transport and Self-Similarity. <a href="https://arxiv.org/abs/1904.12785">pdf</a></p>
<p>[LW] Li, C., and Wand, M. (2016). Precomputed real-time texture synthesis with markovian generative adversarial networks. In European conference on computer vision, 702–716 Springer. <a href="https://arxiv.org/abs/1604.04382">pdf</a></p>
<p>[MST] Zhang, Y., Fang, C., Wang, Y., Wang, Z., Lin, Z., Fu, Y., Yang, J. (2017). Multimodal Style Transfer via Graph Cuts. In Proceedings of the IEEE International Conference on Computer Vision. 2019. p. 5943–5951. <a href="https://arxiv.org/abs/1904.04443">pdf</a></p>
<p>[WCT] Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. H. (2017). Universal style transfer via feature transforms. In Advances in neural information processing systems (pp. 386–396). <a href="https://arxiv.org/abs/1705.08086">pdf</a></p>
<p>[R] Risser, E.; Wilmot, P.; and Barnes, C. 2017. Stable and controllable neural texture synthesis and style transfer using histogram losses. <a href="https://arxiv.org/abs/1701.08893">pdf</a></p>Thomas Kerdreux, Louis ThiryTL;DR: We point out how to make psychedelic animations from discarded instabilities in neural style transfer. This post builds upon a remark we made in our recent paper Interactive Neural Style Transfer with Artists. In this paper, we questioned several simple evaluation aspects of neural style transfer methods. Also, it is our second series of interactive painting experiments where style transfer outputs constantly influence a painter, see the other series here. See also our medium post.Boosting Frank-Wolfe by Chasing Gradients2020-03-16T00:00:00+01:002020-03-16T00:00:00+01:00http://www.pokutta.com/blog/research/2020/03/16/boostFW<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/pdf/2003.06369.pdf">Boosting Frank-Wolfe by Chasing Gradients</a> by <a href="https://cyrillewcombettes.github.io">Cyrille Combettes</a> and <a href="http://www.pokutta.com">Sebastian Pokutta</a>, where we propose to speed-up the Frank-Wolfe algorithm by better aligning the descent direction with that of the negative gradient. This is achieved by chasing the negative gradient direction in a matching pursuit-style, while still remaining projection-free. Although the idea is reasonably natural, it produces very significant results.</em></p>
<!--more-->
<p><em>Written by Cyrille Combettes.</em></p>
<h2 id="motivation">Motivation</h2>
<p>The Frank-Wolfe algorithm (FW) [FW, CG] is a simple projection-free algorithm addressing problems of the form</p>
\[\begin{align*}
\min_{x\in\mathcal{C}}f(x)
\end{align*}\]
<p>where $f$ is a smooth convex function and $\mathcal{C}$ is a compact convex set. At each iteration, FW performs a linear minimization $v_t\leftarrow\arg\min_{v\in\mathcal{C}}\langle\nabla f(x_t),v\rangle$ and updates $x_{t+1}\leftarrow x_t+\gamma_t(v_t-x_t)$. That is, it searches for a vertex $v_t$ minimizing the linear approximation of $f$ at $x_t$ over $\mathcal{C}$, i.e., $\arg\min_{v\in\mathcal{C}}f(x_t)+\langle\nabla f(x_t),v-x_t\rangle$, and moves in that direction. Thus, by imposing a step-size $\gamma_t\in\left[0,1\right]$, it ensures that $x_{t+1}=(1-\gamma_t)x_t+\gamma_tv_t\in\mathcal{C}$ is feasible by convex combination, and hence there is no need to use projections back onto $\mathcal{C}$. This property is very useful when projections onto $\mathcal{C}$ are much more expensive than linear minimizations over $\mathcal{C}$.</p>
<p>The main drawback of FW however lies in its convergence rate, which can be excessively slow when the descent directions $v_t-x_t$ are inadequate. This motivated the Away-Step Frank-Wolfe algorithm (AFW) [W, LJ]. Figure 1 illustrates the zig-zagging phenomenon that can arise in FW and how it is solved by AFW.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/fig1.png" alt="img1" style="float:center; margin-right: 1%; width:45%" />
<img src="http://www.pokutta.com/blog/assets/boostfw/fig2.png" alt="img2" style="float:center; margin-right: 1%; width:45%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Trajectory of the iterates of FW and AFW to minimize $f(x)=\norm{x}_2^2/2$ over the convex hull of $\setb{(-1,0),(0,1),(1,0)}$, starting from $x_0=(0,1)$. The solution is $x^\esx=(0,0)$. FW tries to reach $x^\esx$ by moving <em>towards</em> vertices, which is not efficient here as the directions $v_t-x_t$ become more and more orthogonal to $x^\esx-x_t$. AFW solves this issue by adding the option to move <em>away</em> from vertices; here, $x_4$ is obtained by moving away from $x_0$ and enables $x_5=x^\esx$.</p>
<p>However, the descent directions of AFW might still not be as favorable as those of gradient descent. Furthermore, in order to decide whether to move towards vertices or away from vertices and, when appropriate, to perform an away step, AFW needs to maintain the decomposition of the iterates onto $\mathcal{C}$. This can become very costly in both memory usage and computation time [GM]. Thus, we propose to directly estimate the gradient descent direction $-\nabla f(x_t)$ by sequentially picking up vertices in a matching pursuit-style [MZ]. By doing so, we can descend in directions better aligned with those of the negative gradients while still remaining projection-free.</p>
<h2 id="boosting-via-gradient-pursuit">Boosting via gradient pursuit</h2>
<p>At each iteration, we perform a sequence of rounds <em>chasing</em> the direction $-\nabla f(x_t)$. We initialize our direction estimate as $d_0\leftarrow0$. At round $k$, the residual is $r_k\leftarrow-\nabla f(x_t)-d_k$ and we aim at maximally reducing it by subtracting its maximum component among the vertex directions. We update $d_{k+1}\leftarrow d_k+\lambda_ku_k$ where $u_k\leftarrow v_k-x_t$, $v_k\leftarrow\arg\max_{v\in\mathcal{C}}\langle r_k,v\rangle$, and $\lambda_k\leftarrow\frac{\langle r_k,u_k\rangle}{\norm{u_k}^2}$: $\lambda_ku_k$ is the projection of $r_k$ onto its maximum component $u_k$. Note that this “projection” is actually closed-form and very cheap. The new residual is $r_{k+1}\leftarrow-\nabla f(x_t)-d_{k+1}=r_k-\lambda_ku_k$. We stop the procedure whenever the improvement in alignment between rounds $k$ and $k+1$ is not <em>sufficient</em>, i.e., whenever</p>
\[\begin{align*}
\frac{\langle-\nabla f(x_t),d_{k+1}\rangle}{\norm{\nabla f(x_t)}\norm{d_{k+1}}}-\frac{\langle-\nabla f(x_t),d_k\rangle}{\norm{\nabla f(x_t)}\norm{d_k}}<\delta
\end{align*}\]
<p>for some $\delta\in\left]0,1\right[$. In our experiments, we typically set $\delta=10^{-3}$.</p>
<p>We stress that $d_k$ can be well aligned with $-\nabla f(x_t)$ even when $\norm{-\nabla f(x_t)-d_k}$ is arbitrarily large: we aim at estimating <em>the direction of</em> $-\nabla f(x_t)$ and not the vector $-\nabla f(x_t)$ itself (which would require many more rounds). Once the procedure is completed, we use $g_t\leftarrow d_k/\sum_{\ell=0}^{k-1}\lambda_\ell$ as descent direction. This normalization ensures that the entire segment $[x_t,x_t+g_t]$ is in the feasible region $\mathcal{C}$. Hence, by updating $x_{t+1}\leftarrow x_t+\gamma_tg_t$ with a step-size $\gamma_t\in\left[0,1\right]$, we remain projection-free while descending in the direction $g_t$ better aligned with $-\nabla f(x_t)$. This is the design of our <em>Boosted Frank-Wolfe</em> algorithm (BoostFW). Figure 2 illustrates the procedure.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/fig4.png" alt="img4" style="float:center; margin-right: 1%; width:22%" />
<img src="http://www.pokutta.com/blog/assets/boostfw/fig5.png" alt="img5" style="float:center; margin-right: 1%; width:22%" />
<img src="http://www.pokutta.com/blog/assets/boostfw/fig6.png" alt="img6" style="float:center; margin-right: 1%; width:22%" />
<img src="http://www.pokutta.com/blog/assets/boostfw/fig7.png" alt="img7" style="float:center; margin-right: 1%; width:22%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> Illustration of the gradient pursuit procedure. It builds a descent direction $g_t$ better aligned with the gradient descent direction $-\nabla f(x_t)$. We have $g_t=d_2/(\lambda_0+\lambda_1)$ where $d_2=\lambda_0u_0+\lambda_1u_1$, $u_0=v_0-x_t$, and $u_1=v_1-x_t$. Furthermore, note that $[x_t,x_t+d_2]\not\subset\mathcal{C}$ but $[x_t,x_t+g_t]\subset\mathcal{C}$. Hence, moving along the segment $[x_t,x_t+g_t]$ ensures feasibility of the new iterate $x_{t+1}$.</p>
<h2 id="computational-results">Computational results</h2>
<p>Observe that BoostFW likely performs multiple linear minimizations per iteration, while FW only performs $1$ and AFW performs $\sim2$ ($1$ for the FW vertex and $\sim1$ for the away vertex). Thus, one might wonder if the progress obtained by the gradient pursuit procedure is washed away by the higher number of linear minimizations. We conducted a series of computational experiments demonstrating that this is not the case and that the advantage is quite substantial. We compared BoostFW to AFW, DICG [GM]‚ and BCG [BPTW] on various tasks: sparse signal recovery, sparsity-constrained logistic regression, traffic assignment, collaborative filtering, and video-colocalization. Figures 3-6 show that BoostFW outperforms the other algorithms both per iteration and CPU time although it calls the linear minimization oracle more often: BoostFW makes better use of its oracle calls.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/lasso-xkcd.png" alt="lasso" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 3.</strong> Sparse signal recovery.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/gisette-all-ls-noafwl-xkcd.png" alt="gisette" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 4.</strong> Sparse logistic regression on the Gisette dataset.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/traffic-xkcd.png" alt="traffic" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 5.</strong> Traffic assignment. DICG is not applicable here.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/collabo-xkcd.png" alt="collabo" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 6.</strong> Collaborative filtering on the MovieLens 100k dataset. DICG is not applicable here.</p>
<p>Lastly, we present a preliminary extension of our boosting procedure to DICG. DICG is known to perform particularly well on the video-colocalization experiment of [JTF]. The comparison is made in duality gap, in line with [GM]. Figure 7 shows promising results for BoostDICG.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/video-gaps-xkcd.png" alt="video" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 7.</strong> Video-colocalization on the YouTube-Objects dataset.</p>
<h3 id="references">References</h3>
<p>[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p>
<p>[CG] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=zvmmf&paperid=7415&option_lang=eng">pdf</a></p>
<p>[W] Wolfe, P. (1970). Convergence theory in nonlinear programming. Integer and nonlinear programming, 1-36.</p>
<p>[LJ] Lacoste-Julien, S., & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems (pp. 496-504). <a href="http://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>
<p>[GM] Garber, D., & Meshi, O. (2016). Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes. In Advances in Neural Information Processing Systems (pp. 1001-1009). <a href="http://papers.nips.cc/paper/6115-linear-memory-and-decomposition-invariant-linearly-convergent-conditional-gradient-algorithm-for-structured-polytopes">pdf</a></p>
<p>[MZ] Mallat, S. G., & Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing, 41(12), 3397-3415. <a href="https://pdfs.semanticscholar.org/0b6e/98a6a8cf8283fd76fe1100b23f11f4cfa711.pdf">pdf</a></p>
<p>[BPTW] Braun, G., Pokutta, S., Tu, D., & Wright, S. (2019). Blended conditional gradients: the unconditioning of conditional gradients. Proceedings of ICML. <a href="https://arxiv.org/abs/1805.07311">pdf</a></p>
<p>[JTF] Joulin, A., Tang, K., & Fei-Fei, L. (2014, September). Efficient image and video co-localization with frank-wolfe algorithm. In European Conference on Computer Vision (pp. 253-268). Springer, Cham. <a href="https://link.springer.com/chapter/10.1007/978-3-319-10599-4_17">pdf</a></p>Cyrille CombettesTL;DR: This is an informal summary of our recent paper Boosting Frank-Wolfe by Chasing Gradients by Cyrille Combettes and Sebastian Pokutta, where we propose to speed-up the Frank-Wolfe algorithm by better aligning the descent direction with that of the negative gradient. This is achieved by chasing the negative gradient direction in a matching pursuit-style, while still remaining projection-free. Although the idea is reasonably natural, it produces very significant results.Non-Convex Boosting via Integer Programming2020-02-13T06:00:00+01:002020-02-13T06:00:00+01:00http://www.pokutta.com/blog/research/2020/02/13/ipboost-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2002.04679">IPBoost – Non-Convex Boosting via Integer Programming</a> with <a href="https://www2.mathematik.tu-darmstadt.de/~pfetsch/">Marc Pfetsch</a>, where we present a non-convex boosting procedure that relies on integer programing. Rather than solving a convex proxy problem, we solve the actual classification problem with discrete decisions. The resulting procedure achieves performance at par or better than Adaboost however it is robust to label noise that can defeat convex potential boosting procedures.</em>
<!--more--></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>Boosting (see <a href="https://en.wikipedia.org/wiki/Boosting_(machine_learning)">Wikipedia</a>) is an important (and by now standard) technique in classification to combine several “low accuracy” learners, so-called <em>base learners</em>, into a “high accuracy” learner, a so-called
<em>boosted learner</em>. Pioneered by the AdaBoost approach of [FS], in recent decades there has been extensive
work on boosting procedures and analyses of their limitations. In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows: for $t = 1, \dots, T$ do the following:</p>
<ol>
<li>Train a learner $\mu_t$ from a given class of base learners on
the data distribution $\mathcal D_t$.</li>
<li>Evaluate performance of $\mu_t$ by computing its loss.</li>
<li>Push weight of the data distribution $\mathcal D_t$ towards misclassified examples leading to $\mathcal D_{t+1}$.</li>
</ol>
<p>Finally, the learners are combined by some form of voting (e.g., soft or hard voting, averaging, thresholding). A close inspection of most (but not all) boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function by means of coordinate gradient descent. Boosting schemes of this type are often referred to as <em>convex potential boosters</em>. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. In fact, in theory, provided the class of base learners is rich enough, a perfect strong learner can be constructed that has accuracy $1$ (see e.g., [AHK]), however clearly such a learner might not necessarily generalize well. Boosted learners can generate quite some complicated decision boundaries, much more complicated than that of the base learners. Here is an example from <a href="https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/">Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook</a>. Here data is generated online according to some process with optimal decision boundary represented by the dotted line and <a href="https://xgboost.ai/">XGBoost</a> was used learn a classifier:</p>
<p class="minimg"><img src="https://paulvanderlaken.files.wordpress.com/2020/01/xgboost.gif?w=518&zoom=2" alt="XGBoost example" /></p>
<p><br /></p>
<h3 id="label-noise">Label noise</h3>
<p>In reality we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. However if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case these examples cannot be correctly classified and the procedure shifts more and more weight towards these bad examples. This eventually leads to a strong learner, that perfectly predicts the (flawed) training data, however that does not generalize well anymore. This intuition has been formalized by [LS] who construct a “hard” training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in performance of these boosted learners; see tables below. The more technical reason for this problem is actually the convexity of the loss function that is minimized by the boosting procedure. Clearly, one can use all types of “tricks” such as <em>early stopping</em> but at the end of the day this is not solving the fundamental problem.</p>
<h2 id="our-results">Our results</h2>
<p>To combat the problem of convex potential boosters being susceptible to label noise, rather than optimizing some convex loss proxy, why not considering the classification problem with the actual <em>misclassification loss function</em>:</p>
\[\tag{classify}
\begin{align}
\label{eq:trueLoss}
\ell(\theta,D) \doteq \sum_{i \in I} \mathbb I[h_\theta(x_i) \neq
y_i],
\end{align}\]
<p>where $h_\theta$ is a learner parameterized by $\theta$ and $D = \setb{(x_i,y_i) \mid i \in I}$ is the training data? This loss function counts the number of misclassifications, which somehow seems to be a natural quantity to minimize. Unfortunately, this loss function is non-convex, however, the resulting optimization problem can be rather naturally phrased as an <em>Integer Program (IP)</em>. In fact, our basic boosting model is captured by the following integer programming problem:</p>
\[\tag{basicBoost}
\begin{align*}
\min\; & \sum_{i=1}^N z_i \\
& \sum_{j=1}^L \eta_{ij}\, \lambda_j + (1 + \rho) z_i \geq
\rho\quad\forall\, i \in [N],\\
& \sum_{j=1}^L \lambda_j = 1,\; \lambda \geq 0,\\
& z \in \{0,1\}^N,
\end{align*}\]
<p>where the matrix $\eta_{ij}$ encodes the predictions of learner $j$ on example $i$. The boosting part comes naturally into play here as the number of base learners is potentially huge (sometimes even infinite) and we have to generate these learners with some procedure. We do this by means of <em>column generation</em>, where we add base learners via a pricing problem, that essentially generates an acceptable base learner for the modified data distribution that is encoded in the dual variables of the relaxed problem (basicBoost). This is somewhat similar to the (LP-based) LPBoost approach of [DKS] however we consider an integer program here where column generation is significantly more involved. The dual problem is of the following form:</p>
\[\begin{align*}
\tag{dualProblem}
\max\; & \rho \sum_{i=1}^N w_i + v - \sum_{i=1}^N u_i\\
& \sum_{i=1}^N \eta_{ij}\, w_i + v \leq 0 \quad\forall\, j \in \mathcal{L},\\
& \;(1 + \rho) w_i - u_i \leq 1\quad\forall\, i \in [N],\\
& w \geq 0,\; u \geq 0,\; v \text{ free},
\end{align*}\]
<p>and after some clean up etc, the pricing constraints that need to be satisfied are:</p>
\[\tag{pricing}
\begin{equation}\label{eq:PricingProb}
\sum_{i=1}^N \eta_{ij}\, w_i^\esx + v^\esx > 0
\end{equation},\]
<p>and we ask whether there exist a base learner $h_j \in \Omega$ such
that (pricing) holds? For this, the $w_i^\esx$ can be seen
as weights over the points $x_i$ with $i \in [N]$, and we have to
classify the points according to these weights. This pricing problem is solved within a branch-and-cut-and-price framework, complicating things significantly compared to column generation in the LP case.</p>
<h3 id="computations">Computations</h3>
<p>Solving an IP is much more computationally expensive than traditional boosting approaches or LPBoost. However, what we gain is robustness and stability. For the hard distribution of [LS] we significantly outperform Adaboost and gain moderately compared to LPBoost, which is already more robust towards label noise as it re-solves the optimization in each round, albeit for a proxy loss function. The reported accuracy is test accuracy over multiple runs for various parameters and $L$ denotes the (average) number of learners generated in order to construct the strong learner.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/ipboost/hardDistPerf.png" alt="Results on hard distribution" /></p>
<p>Note that the hard distribution is for a binary classification problem, so that $50\%$ accuracy is random guessing. Therefore the improvement from $53.27\%$ for Adaboost, which is basically random guessing to $69.03\%$ is quite significant.</p>
<h3 id="references">References</h3>
<p>[FS] Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In <em>European conference on computational learning theory</em> (pp. 23-37). Springer, Berlin, Heidelberg. <a href="https://pdfs.semanticscholar.org/5fb5/f7b545a5320f2a50b30af599a9d9a92a8216.pdf">pdf</a></p>
<p>[AHK] Arora, S., Hazan, E., & Kale, S. (2012). The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1), 121-164. <a href="http://www.theoryofcomputing.org/articles/v008a006/v008a006.pdf">pdf</a></p>
<p>[LS] Long, P. M., & Servedio, R. A. (2010). Random classification noise defeats all convex potential boosters. Machine learning, 78(3), 287-304. <a href="http://www.machinelearning.org/archive/icml2008/papers/258.pdf">pdf</a></p>
<p>[DKS] Demiriz, A., Bennett, K. P., & Shawe-Taylor, J. (2002). Linear programming boosting via column generation. Machine Learning, 46(1-3), 225-254. <a href="https://link.springer.com/content/pdf/10.1023/A:1012470815092.pdf">pdf</a></p>Sebastian PokuttaTL;DR: This is an informal summary of our recent paper IPBoost – Non-Convex Boosting via Integer Programming with Marc Pfetsch, where we present a non-convex boosting procedure that relies on integer programing. Rather than solving a convex proxy problem, we solve the actual classification problem with discrete decisions. The resulting procedure achieves performance at par or better than Adaboost however it is robust to label noise that can defeat convex potential boosting procedures.Approximate Carathéodory via Frank-Wolfe2019-11-30T00:00:00+01:002019-11-30T00:00:00+01:00http://www.pokutta.com/blog/research/2019/11/30/approxCara-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/pdf/1911.04415.pdf">Revisiting the Approximate Carathéodory Problem via the Frank-Wolfe Algorithm</a> with <a href="https://www.linkedin.com/in/cyrille-combettes/">Cyrille W Combettes</a>. We show that the Frank-Wolfe algorithm constitutes an intuitive and efficient method to obtain a solution to the approximate Carathéodory problem and that it also provides improved cardinality bounds in particular scenarios.</em>
<!--more--></p>
<p><em>Written by Cyrille W Combettes.</em></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>Let $\mathcal{V}\subset\mathbb{R}^n$ be a compact set and denote by $\mathcal{C} \doteq \operatorname{conv}(\mathcal{V})$ its convex hull. Slightly abusing notation, we will refer to any point in $\mathcal{V}$ as a <em>vertex</em>. Let the <em>cardinality</em> of a point $x\in\operatorname{conv}(\mathcal{V})$ be the minimum number of vertices necessary to form $x$ as a convex combination. Then Carathéodory’s theorem [C] states that every point $x\in\mathcal{C}$ has cardinality at most $n+1$, and this bound is tight. However, if we can afford an $\epsilon$-approximation with respect to some norm, can we improve this bound?</p>
<p>The approximate Carathéodory theorem states that if $p \geq 2$, then for every $x^\esx \in \mathcal{C}$ there exists $x \in \mathcal{C}$ of cardinality $\mathcal{O}(pD_p^2/\epsilon^2)$ satisfying $\norm{x-x^\esx}_p \leq \epsilon$, where $D_p$ is the diameter of $\mathcal{V}$ in $\ell_p$-norm. This result is independent of the dimension $n$ and is therefore particularly significant in high dimensional spaces. Furthermore, [MLVW] showed that this bound is tight.</p>
<p>Let $p\geq2$. A natural way to think about the approximate Carathéodory problem is to minimize $f(x)=\norm{x-x^\esx}_p$ by sequentially picking up vertices, starting from an arbitrary vertex. By doing so, we hope to converge fast enough to $x^\esx$ so as to keep the number of iterations low, hence, to pick up as few vertices as possible. This is precisely the Frank-Wolfe algorithm [FW], a.k.a. conditional gradient algorithm [CG]. At each iteration, it selects a vertex via the following linear minimization problem:</p>
\[\begin{align*}
v_t\leftarrow\arg\min_{v\in\mathcal{V}}\langle\nabla f(x_t),v\rangle
\end{align*}\]
<p>and then moves towards that vertex, i.e., in the direction $v_t-x_t$:</p>
\[\begin{align*}
x_{t+1}\leftarrow x_t+\gamma_t(v_t-x_t)
\end{align*},\]
<p>where $\gamma_t \in [0,1]$. Note that this amounts to selecting the direction formed from the current iterate $x_t$ to a vertex $v_t$ that is most aligned with the gradient descent direction $-\nabla f(x_t)$, up to a normalization factor as measured by the inner product. Thus, FW “approximates” gradient descent with sparse directions ensuring that at most $1$ new vertex is added to the convex decomposition of the iterate $x_t$. Therefore, if $T$ is the number of iterations necessary to achieve $\norm{x_T-x^\esx}_p \leq \epsilon$, then $x_T$ is an $\epsilon$-approximate solution with cardinality $T+1$.</p>
<p>We can estimate $T$ using convergence results for FW. These often require convexity and smoothness of the objective function, and sometimes also strong convexity. In the case of $f(x)=\norm{x-x^\esx}_p^2$ (note that we squared the norm to obtain the following properties), we can verify that $f$ is convex and smooth, but it is strongly convex only for $p \in \left]1,2\right]$ (with respect to the $\ell_p$-norm). However, we can replace the strong convexity requirement with a weaker one satisfied by $f$, namely the Polyak-Łojasiewicz (PL) inequality [P], [L]:</p>
\[\begin{align*}
f(x)-\min_{\mathbb{R}^n}f
\leq\frac{1}{2\mu}\|\nabla f(x)\|_*^2.
\end{align*}\]
<p>Now, by using some existing convergence results using the PL condition for FW available in [LP], [GM], [J], [GH], we can directly deduce cardinality bounds in different scenarios. In particular, the approximate Carathéodory bound $\mathcal{O}(pD_p^2/\epsilon^2)$ is achieved and FW constitutes a very intuitive method to obtain a solution to the approximate Carathéodory problem.</p>
<table>
<thead>
<tr>
<th style="text-align: left">Assumptions</th>
<th style="text-align: right">FW Rate</th>
<th style="text-align: right">Cardinality Bound</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">—</td>
<td style="text-align: right">$\frac{4(p-1)D_p^2}{t+2}$</td>
<td style="text-align: right">$\frac{4(p-1)D_p^2}{\epsilon^2}=\mathcal{O}\left(\frac{pD_p^2}{\epsilon^2}\right)$</td>
</tr>
<tr>
<td style="text-align: left">$\mathcal{C}$ is $S_p$-strongly convex</td>
<td style="text-align: right">$\frac{\max\{9(p-1)D_p^2,1152(p-1)^2/S_p^2\}}{(t+2)^2}$</td>
<td style="text-align: right">$\mathcal{O}\left(\frac{\sqrt{p}D_p+p/S_p}{\epsilon}\right)$</td>
</tr>
<tr>
<td style="text-align: left">$x^\esx \in \operatorname{relint}_p(\mathcal{C})$ with radius $r_p$</td>
<td style="text-align: right">$\left(1-\frac{1}{p-1}\frac{r_p^2}{D_p^2}\right)^t\epsilon_0$</td>
<td style="text-align: right">$\mathcal{O}\left(\frac{pD_p^2}{r_p^2}\ln\left(\frac{1}{\epsilon}\right)\right)$</td>
</tr>
</tbody>
</table>
<p>Let $H_n$ be a Hadamard matrix of dimension $n$ and $\mathcal{C} \doteq \operatorname{conv}(H_n/n^{1/p})$ be the convex hull of its normalized columns with respect to the $\ell_p$-norm. Suppose we want to approximate the convex decomposition of $x^\esx \doteq (H_n/n^{1/p})\mathbf{1}/n=e_1/n^{1/p}$; this is the lower bound instance from [MLVW]. Below we plot the performance of FW and two variants, Away-Step Frank-Wolfe (AFW) and Fully-Corrective Frank-Wolfe (FCFW), for the approximate Carathéodory problem here with $p=7$, as well as (a minor correction to) the corresponding lower bound stated by [MLVW].</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/approxCara/lower7-card.png" alt="p7 norm" /></p>
<p>We see that AFW performs better than FW and that FCFW almost matches the lower bound. It remains an open question however to derive a precise convergence rate for FCFW in this setting rather than simply inheriting the rate of AFW via [LJ], which seems to be too loose here.</p>
<h3 id="references">References</h3>
<p>[C] Carathéodory, C. (1907). Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen. Mathematische Annalen, 64(1), 95-115. <a href="https://link.springer.com/content/pdf/10.1007/BF01449883.pdf">pdf</a></p>
<p>[MLVW] Mirrokni, V., Leme, R. P., Vladu, A., & Wong, S. C. W. (2017, August). Tight bounds for approximate Carathéodory and beyond. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 2440-2448). <a href="https://arxiv.org/pdf/1512.08602.pdf">pdf</a></p>
<p>[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p>
<p>[CG] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=zvmmf&paperid=7415&option_lang=eng">pdf</a></p>
<p>[P] Polyak, B. T. (1963). Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4), 864-878. <a href="https://www.researchgate.net/profile/Boris_Polyak2/publication/243648552_Gradient_methods_for_the_minimisation_of_functionals/links/5a608e09aca272328103d55e/Gradient-methods-for-the-minimisation-of-functionals.pdf">pdf</a></p>
<p>[L] Lojasiewicz, S. (1963). Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117, 87-89.</p>
<p>[LP] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6(5), 1-50.</p>
<p>[GM] Guélat, J., & Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’. Mathematical Programming, 35(1), 110-119. <a href="https://link.springer.com/content/pdf/10.1007/BF01589445.pdf">pdf</a></p>
<p>[J] Jaggi, M. (2013, June). Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML (1) (pp. 427-435). <a href="http://proceedings.mlr.press/v28/jaggi13-supp.pdf">pdf</a></p>
<p>[GH] Garber, D., & Hazan, E. (2014). Faster rates for the Frank-Wolfe method over strongly-convex sets. arXiv preprint arXiv:1406.1305. <a href="http://proceedings.mlr.press/v37/garbera15-supp.pdf">pdf</a></p>
<p>[LJ] Lacoste-Julien, S., & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems (pp. 496-504). <a href="http://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>Cyrille W CombettesTL;DR: This is an informal summary of our recent paper Revisiting the Approximate Carathéodory Problem via the Frank-Wolfe Algorithm with Cyrille W Combettes. We show that the Frank-Wolfe algorithm constitutes an intuitive and efficient method to obtain a solution to the approximate Carathéodory problem and that it also provides improved cardinality bounds in particular scenarios.SCIP x Raspberry Pi: SCIP on Edge2019-09-29T08:00:00+02:002019-09-29T08:00:00+02:00http://www.pokutta.com/blog/random/2019/09/29/scipberry<p><em>TL;DR: Running SCIP on a Raspberry Pi 4 with relatively moderate performance losses (compared to a standard machine) of a factor of 3-5 brings Integer Programming into the realm of Edge Computing.</em>
<!--more--></p>
<p>Edge Computing is concerned with the deployment of compute and algorithms close to their actual location. Thinking about it for a few minutes one easily comes up with a lot of good reasons why one might want to do this. From <a href="https://en.wikipedia.org/wiki/Edge_computing">Wikipedia</a>:</p>
<blockquote>
<p><strong>Edge computing</strong> is a <a href="https://en.wikipedia.org/wiki/Distributed_computing">distributed computing</a> paradigm which brings <a href="https://en.wikipedia.org/wiki/Computation">computation</a> and <a href="https://en.wikipedia.org/wiki/Data_storage">data storage</a> closer to the location where it is needed, to improve response times and save bandwidth.</p>
</blockquote>
<p>For example, in the context of deep learning applications, there is great hardware out there to bring, e.g., deep learning applications to the edge and one example in this category are the <a href="https://developer.nvidia.com/embedded/jetson-tx2-developer-kit">NVidia Jetson TX kits</a>.</p>
<p>For completely different but not unrelated reasons I have recently been thinking very much about the interplay of hardware and software and in particular, the potential of, e.g., FPGAs to realize customized functions in hardware to better support algorithms (and their implementations) that we care for: both for realizing a better energy footprint and closer to the edge deployment on the one end of the spectrum and for highest performance operations on the other end of the spectrum. To make things more tangible, e.g., a specialized FPGA for Integer Programming. Why? While we have great solutions to deploy, e.g., deep learning applications on the edge there is <em>nothing</em> there for deploying integer programming codes, i.e., discrete decision making on the edge. As such I was curious to get <a href="https://scip.zib.de/">SCIP</a> up and running on a <a href="https://www.raspberrypi.org/products/raspberry-pi-4-model-b/">Raspberry Pi 4 B (4GB RAM)</a> board (RPi 4) which can be bought for $55, e.g., on Amazon. Tentative working title: <em>SCIPberry</em>—every Raspberry Pi project needs to have “berry” in its name.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/scipberry/backgroundSCIP.png" alt="scipberry" /></p>
<p>(Polyhedral raspberry logo by <a href="https://www.reddit.com/user/SiRo126/">SiRo126</a>)</p>
<h2 id="scip-on-edge">SCIP on Edge</h2>
<p>So first let us have a look at all the required pieces.</p>
<h3 id="the-hardware">The hardware</h3>
<p>A <a href="https://www.raspberrypi.org/products/raspberry-pi-4-model-b/">Raspberry Pi 4 with 4GB of RAM</a>. As Integer Programs can get large we need some memory and hence the 4GB version. The RPi 4 is really a tiny device that can easily fit into the palm of your hand. See this image from <a href="https://www.raspberrypi.org/products/raspberry-pi-4-model-b/">raspberrypi.org</a>:</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/scipberry/raspberrypi4.png" alt="raspberrypi4" /></p>
<p>At a price point of about $55 you might think that this is not much more than a toy but the RPi 4 is suprisingly powerful; checkout the <a href="https://www.raspberrypi.org/magpi/raspberry-pi-4-specs-benchmarks/">benchmarks</a>. Moreover, you can actually run <a href="https://github.com/JuliaBerry">Julia</a>, <a href="https://www.wolfram.com/raspberry-pi/">Mathematica</a>, and <a href="https://www.raspberrypi.org/documentation/usage/python/">python</a> on it. Not bad for $55!</p>
<p><strong>Interlude:</strong> As a feasibility study I have been running my day-to-day work on a RPi 4 for a few weeks now and it is refreshingly sufficient. Effectively almost all services that I use in my day-to-day operations are now cloud based, so that you can run them in a web browser and with Raspbian’s Chromium (the open source base of Chrome) you get quite far. Moreover, there is also <a href="https://linuxhint.com/install_firefox_raspberry_pi/">Firefox available</a> for Raspbian. Interestingly the user agent detection of Google Calendar forces Chromium to display Google Calendar in some weird mobile version, which made me look for Firefox in the first place.</p>
<h4 id="purely-optional-and-your-own-risk-spiked-berrys">Purely optional and your own risk: Spiked berrys</h4>
<p>The Raspberry Pi is an extremely versatile and flexible device. In fact you can even overclock it easily by simply changing its startup configuration. How far you can go with this depends on how lucky you have been in the silicon lottery and I strongly recommend to first read <a href="https://www.tomshardware.com/reviews/raspberry-pi-4-b-overclocking,6188.html">this article about overclocking an RPi 4</a> and <a href="https://www.tomshardware.com/reviews/raspberry-pi-4-overclock-2-ghz,6254.html">this one for how to push it to 2 GHz</a>; your mileage may vary. Looks like I have been lucky in the silicon lottery as I pushed my RPi 4 to 2GHz without any issues and perfectly stable behavior, over multiple days and under full load of all four cores with varying workloads. Very important: you <em>will</em> need an active cooling case as otherwise you will run into thermal throttling, negating the effect of the overclocking. For example the Miuzei case with active cooling (<a href="https://www.amazon.de/gp/product/B07TYW63M8/ref=ppx_yo_dt_b_asin_title_o00_s00?ie=UTF8&psc=1&pldnSite=1">link to Amazon Germany</a>; same can be found on, e.g., Amazon US) for about 18 Euro (or around $20) works great. You can barely hear the fan, while keeping the RPi4 at a 69C under full load of all four cores, so rather far away from the thermal throttling point of 80C. Moreover, the power adapter provides enough juice to support the over voltage we have to provide. To see what overclocking buys you in terms of performance, you can check out <a href="https://www.tomshardware.com/reviews/raspberry-pi-4-overclock-2-ghz,6254.html">the benchmarks at the end of the overclocking article</a>. Short version is: it buys you somewhere between 3% - 33% in their tests and in our tests later we pretty much get the full 33%.</p>
<p>In <code class="highlighter-rouge">/boot/config.txt</code> (you need to edit e.g., with <code class="highlighter-rouge">sudo nano /boot/config.txt</code>), I use the following for overclocking in the <code class="highlighter-rouge">[pi4]</code> section.</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">over_voltage</span><span class="o">=</span>6
<span class="nv">arm_freq</span><span class="o">=</span>2000
<span class="nv">gpu_freq</span><span class="o">=</span>600
</code></pre></div></div>
<p>The <code class="highlighter-rouge">over_voltage=6</code> setting for extra voltage is comprised of 3 times <code class="highlighter-rouge">2</code> each with <code class="highlighter-rouge">2</code> to go up to 1.750 GHz for the CPU, another <code class="highlighter-rouge">2</code> to go up to 2Ghz, and the final <code class="highlighter-rouge">2</code> for the GPU overclocking.</p>
<p><strong>Note:</strong> This is at your <strong>own risk</strong> and I strongly suggest to read <a href="https://www.tomshardware.com/reviews/raspberry-pi-4-b-overclocking,6188.html">this article about overclocking RPi4</a> and <a href="https://www.tomshardware.com/reviews/raspberry-pi-4-overclock-2-ghz,6254.html">this one for how to push it to 2 GHz</a> to understand how to trouble shoot and fix if something goes wrong.</p>
<h3 id="the-software">The software</h3>
<p>On the software side, I went with the <a href="https://www.raspberrypi.org/downloads/raspbian/">Raspbian Buster</a> image from <a href="https://www.raspberrypi.org/">Raspberrypi.org</a>. Then I did the usual package updates, installed <code class="highlighter-rouge">cmake</code> with <code class="highlighter-rouge">sudo apt-get install cmake</code>, and compiled the <a href="https://scip.zib.de">scip optimization suite</a>. Compilation worked right out of the box thanks to <code class="highlighter-rouge">cmake</code> and appropriate build files. Compile time on a stock RPi 4 is about 40 mins for the whole optimization suite:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>real 41m38.569s
user 38m39.830s
sys 2m41.798s
</code></pre></div></div>
<h4 id="a-few-notes-on-compilation-etc">A few notes on compilation etc</h4>
<p>There seems to be a minor issue with the <a href="https://www.raspberrypi.org/forums/viewtopic.php?t=245846">linux kernel reporting the wrong arm architecture</a> which might lead to suboptimal compiler flags as determined by <code class="highlighter-rouge">cmake</code> based on the reported architecture; this is due to the upstream linux kernel implementation. At the same time I did some preliminary tests and the compiler flags can have some substantial performance impact, in particular the configuration of the floating point unit (easily a factor of 2). We intend to make Raspberry Pi binaries of SCIP available relatively soon once we have tuned the compilation for the arm architecture. Meanwhile, you can simply compile SCIP from the sources with its out-of-the-box configuration which is stable but not yet arm architecture optimized. <strong>If some arm compilation expert, in particular w.r.t. floating point arithmetics, reads this please drop me a line!</strong></p>
<h2 id="performance-and-benchmarks">Performance and benchmarks</h2>
<p>I did two comparisons in term of performance. The first one is between a MacBook Pro and the RPi 4 on some standard MIPLIB instances and the second one is a full MIPLIB 2017 benchmark run.</p>
<h3 id="unscientific-comparison">Unscientific comparison</h3>
<p>The comparison below is between a MacBook Pro (Core i7 3.5 GHz with 16GB RAM) vs. stock Raspberry Pi 4 vs. spiked Raspberry Pi 4 (running at 2 GHz for the CPU and 600 MHz for the GPU) for a few select instances from the <a href="https://miplib.zib.de/">MIPLIB 2017</a> running SCIP. As discussed above the Raspberry Pi 4 version of SCIP is not yet optimized for the ARM architecture (the stock and spiked version use <code class="highlighter-rouge">-mcpu</code> and <code class="highlighter-rouge">-mtune</code> flags for the Cortex A72 architecture though), so that there are likely more speed improvements to be gained. The table reports time in seconds as well as how many times the RPi 4 is slower than the MBP. While this is not a scientific or complete benchmark we get a pretty good idea. Effectively, we are talking about a 3-5 times multiple, which basically means that we are not changing categories: seconds remain seconds and minutes remain minutes, so that in actual applications there is not <em>that much</em> of a difference.</p>
<p>Also note that the numbers below are <em>single core</em> performance, however the RPi 4 is a quad core design, so that there might be additional speedups to be gained.</p>
<table>
<thead>
<tr>
<th style="text-align: right">Instance</th>
<th style="text-align: right">sec (MBP)</th>
<th style="text-align: right">sec (stock)</th>
<th style="text-align: right">x (stock)</th>
<th style="text-align: right">sec (spiked)</th>
<th style="text-align: right">x (spiked)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">air05</td>
<td style="text-align: right">34.64</td>
<td style="text-align: right">184.32</td>
<td style="text-align: right">5.32</td>
<td style="text-align: right">143.21</td>
<td style="text-align: right">4.13</td>
</tr>
<tr>
<td style="text-align: right">beasleyC3</td>
<td style="text-align: right">20.68</td>
<td style="text-align: right">80.49</td>
<td style="text-align: right">3.89</td>
<td style="text-align: right">62.99</td>
<td style="text-align: right">3.05</td>
</tr>
<tr>
<td style="text-align: right">cbs-cta</td>
<td style="text-align: right">8.05</td>
<td style="text-align: right">40.55</td>
<td style="text-align: right">5.04</td>
<td style="text-align: right">34.27</td>
<td style="text-align: right">4.26</td>
</tr>
<tr>
<td style="text-align: right">pk1</td>
<td style="text-align: right">178.80</td>
<td style="text-align: right">645.73</td>
<td style="text-align: right">3.61</td>
<td style="text-align: right">500.54</td>
<td style="text-align: right">2.80</td>
</tr>
<tr>
<td style="text-align: right">pg</td>
<td style="text-align: right">17.47</td>
<td style="text-align: right">72.31</td>
<td style="text-align: right">4.14</td>
<td style="text-align: right">58.06</td>
<td style="text-align: right">3.32</td>
</tr>
<tr>
<td style="text-align: right">neos-1122047</td>
<td style="text-align: right">3.81</td>
<td style="text-align: right">12.78</td>
<td style="text-align: right">3.35</td>
<td style="text-align: right">10.83</td>
<td style="text-align: right">2.84</td>
</tr>
<tr>
<td style="text-align: right">timtab1</td>
<td style="text-align: right">59.83</td>
<td style="text-align: right">223.47</td>
<td style="text-align: right">3.74</td>
<td style="text-align: right">179.33</td>
<td style="text-align: right">2.99</td>
</tr>
<tr>
<td style="text-align: right">dano3_5</td>
<td style="text-align: right">206.12</td>
<td style="text-align: right">1275.86</td>
<td style="text-align: right">6.19</td>
<td style="text-align: right">1036.74</td>
<td style="text-align: right">5.03</td>
</tr>
<tr>
<td style="text-align: right">hypothyroid-k1</td>
<td style="text-align: right">19.80</td>
<td style="text-align: right">104.96</td>
<td style="text-align: right">5.30</td>
<td style="text-align: right">93.15</td>
<td style="text-align: right">4.70</td>
</tr>
<tr>
<td style="text-align: right">swath3</td>
<td style="text-align: right">279.40</td>
<td style="text-align: right">1185.95</td>
<td style="text-align: right">4.24</td>
<td style="text-align: right">1000.58</td>
<td style="text-align: right">3.58</td>
</tr>
<tr>
<td style="text-align: right">unitcal_7</td>
<td style="text-align: right">381.70</td>
<td style="text-align: right">1853.92</td>
<td style="text-align: right">4.86</td>
<td style="text-align: right">1489.52</td>
<td style="text-align: right">3.90</td>
</tr>
<tr>
<td style="text-align: right">CMS750_4</td>
<td style="text-align: right">922.35</td>
<td style="text-align: right">4346.98</td>
<td style="text-align: right">4.71</td>
<td style="text-align: right">3379.95</td>
<td style="text-align: right">3.66</td>
</tr>
<tr>
<td style="text-align: right">istanbul-no-cutoff</td>
<td style="text-align: right">98.07</td>
<td style="text-align: right">501.13</td>
<td style="text-align: right">5.11</td>
<td style="text-align: right">408.17</td>
<td style="text-align: right">4.16</td>
</tr>
</tbody>
</table>
<h3 id="miplib-2017-benchmark-run">MIPLIB 2017 benchmark run</h3>
<p>The second test is a <a href="https://miplib.zib.de">MIPLIB 2017</a> benchmark run - singe core run. Given that it took a couple of days to complete I only did the run for the spiked version.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>----------------------------+----------------+----------------+------+---------+-------+--------+
Name | Dual Bound | Primal Bound | Gap% | Nodes | Time | Status |
----------------------------+----------------+----------------+------+---------+-------+--------+
30n20b8 302 302 0.0 122 1225 ok
50v-10 3246.37511 3313.17999 2.1 79593 3601 stopped
academictimetablesmall 7.10542736e-15 1e+20 -- 250 3601 stopped
air05 26374 26374 0.0 418 144 ok
app1-1 -3 -3 0.0 3 27 ok
app1-2 -41 -41 0.0 24 2473 ok
assign1-5-8 199.527686 212 6.3 1096572 3601 stopped
atlanta-ip 83.0314859 91.0099752 9.6 533 3601 stopped
b1c1s1 21152.5997 25486.73 20.5 2475 3600 stopped
bab2 -358777.078 1e+20 -- 1 3607 stopped
bab6 -288604.802 1e+20 -- 1 3605 stopped
beasleyC3 754 754 0.0 6 64 ok
binkar10_1 6742.20002 6742.20002 0.0 3092 192 ok
blp-ar98 6168.05653 6457.70502 4.7 7103 3601 stopped
blp-ic98 4440.63685 4720.2617 6.3 15310 3602 stopped
bnatt400 1 1 0.0 5709 1218 ok
bnatt500 1e+20 1e+20 -- 20612 2854 ok
bppc4-08 52 54 3.8 114635 3602 stopped
brazil3 24 1e+20 -- 152 3600 stopped
buildingenergy 33246.2151 42652.3398 28.3 1 3606 stopped
cbs-cta 0 0 0.0 1 35 ok
chromaticindex1024-7 3 4 33.3 1 3730 stopped
chromaticindex512-7 3 4 33.3 44 3602 stopped
cmflsp50-24-8-8 54860290.3 1e+20 -- 4184 3601 stopped
CMS750_4 252 252 0.0 14526 3290 ok
co-100 1936126.27 10724685.1 453.9 1 3605 stopped
cod105 -18.2857143 -12 52.4 65 3601 stopped
comp07-2idx 6 823 Large 5 3601 stopped
comp21-2idx 41.0023016 295 619.5 129 3600 stopped
cost266-UUE 23987103.2 25148940.6 4.8 67004 3601 stopped
cryptanalysiskb128n5obj14 0 1e+20 -- 1 3602 stopped
cryptanalysiskb128n5obj16 0 1e+20 -- 1 3602 stopped
csched007 326.692847 353 8.1 89932 3601 stopped
csched008 171.342524 173 1.0 91573 3600 stopped
cvs16r128-89 -120.632745 -93 29.7 727 3601 stopped
dano3_3 576.344633 576.344633 0.0 19 587 ok
dano3_5 576.924916 576.924916 0.0 60 1011 ok
decomp2 -160 -160 0.0 1 13 ok
drayage-100-23 103333.874 103333.874 0.0 39 61 ok
drayage-25-23 101208.076 101282.647 0.1 93492 3600 stopped
dws008-01 20712.5472 59272.7628 186.2 7329 3600 stopped
eil33-2 934.007916 934.007916 0.0 717 337 ok
eilA101-2 805.920228 1240.02356 53.9 303 3604 stopped
enlight_hard 37 37 0.0 1 1 ok
ex10 100 100 0.0 1 2338 ok
ex9 81 81 0.0 1 158 ok
exp-1-500-5-5 65887 65887 0.0 1 9 ok
fast0507 174 174 0.0 528 571 ok
fastxgemm-n2r6s0t2 29 230 693.1 144518 3601 stopped
fhnw-binpack4-4 0 1e+20 -- 2772713 3600 stopped
fhnw-binpack4-48 0 1e+20 -- 461981 3600 stopped
fiball 138 140 1.4 10323 3601 stopped
gen-ip002 -4796.88522 -4783.73339 0.3 4003765 3603 stopped
gen-ip054 6803.0169 6840.96564 0.6 4873762 3606 stopped
germanrr 46795848.8 48328053.1 3.3 167 3601 stopped
gfd-schedulen180f7d50m30k18 1 1e+20 -- 1 3607 stopped
glass-sc 19.2345968 23 19.6 40518 3601 stopped
glass4 900005372 1.600014e+09 77.8 595600 3601 stopped
gmu-35-40 -2406903.88 -2406207.39 0.0 476087 3604 stopped
gmu-35-50 -2608070.29 -2607011 0.0 264255 3606 stopped
graph20-20-1rand -24.9879502 -9 177.6 605 3600 stopped
graphdraw-domain 17998.5455 19686 9.4 1549875 3601 stopped
h80x6320d 6382.09905 6382.09905 0.0 4 596 ok
highschool1-aigio 0 1e+20 -- 1 3607 stopped
hypothyroid-k1 -2851 -2851 0.0 1 94 ok
ic97_potential 3887 3948 1.6 1053664 3601 stopped
icir97_tension 6362 6386 0.4 370916 3601 stopped
irish-electricity 2934051.39 1e+20 -- 1 3602 stopped
irp 12159.4928 12159.4928 0.0 7 74 ok
istanbul-no-cutoff 204.081749 204.081749 0.0 229 396 ok
k1mushroom -1e+20 1e+20 -- 0 3618 stopped
lectsched-5-obj 15 47 213.3 4287 3601 stopped
leo1 400019518 410793643 2.7 17966 3601 stopped
leo2 393957837 418226155 6.2 6883 3601 stopped
lotsize 1466743.39 1514999 3.3 623 3601 stopped
mad 4.23272528e-15 0.028 -- 1758648 3602 stopped
map10 -522.229315 -495 5.5 1221 3604 stopped
map16715-04 -202.53321 -83 144.0 265 3605 stopped
markshare_4_0 1 1 0.0 2552263 864 ok
markshare2 0 17 -- 1942425 3604 stopped
mas74 11405.2315 11801.1857 3.5 2857955 3602 stopped
mas76 40005.0541 40005.0541 0.0 269818 444 ok
mc11 11689 11689 0.0 2247 372 ok
mcsched 211913 211913 0.0 11189 755 ok
mik-250-20-75-4 -52301 -52301 0.0 28546 189 ok
milo-v12-6-r2-40-1 277158.347 326481.143 17.8 11372 3601 stopped
momentum1 96365.4136 128477.452 33.3 4759 3600 stopped
mushroom-best 0.0174384818 0.0553337612 217.3 6318 3600 stopped
mzzv11 -21718 -21718 0.0 2233 1477 ok
mzzv42z -20540 -20540 0.0 261 621 ok
n2seq36q 52200 52200 0.0 2740 3431 ok
n3div36 125388.755 131000 4.5 12248 3603 stopped
n5-3 8105 8105 0.0 607 98 ok
neos-1122047 161 161 0.0 1 23 ok
neos-1171448 -309 -308 0.3 61 3601 stopped
neos-1171737 -195 -191 2.1 234 3600 stopped
neos-1354092 36 1e+20 -- 3 3601 stopped
neos-1445765 -17783 -17783 0.0 75 214 ok
neos-1456979 157.283966 204 29.7 3148 3601 stopped
neos-1582420 91 91 0.0 252 94 ok
neos-2075418-temuka 0 1e+20 -- 1 3614 stopped
neos-2657525-crna 0 1.810748 -- 715693 3602 stopped
neos-2746589-doon 1993.43007 1e+20 -- 881 3602 stopped
neos-2978193-inde -2.4017989 -2.38806169 0.6 38669 3600 stopped
neos-2987310-joes -607702988 -607702988 0.0 1 72 ok
neos-3004026-krka 0 0 0.0 2020 245 ok
neos-3024952-loue 26756 26756 0.0 56581 3459 ok
neos-3046615-murg 538.135067 1610 199.2 1551986 3606 stopped
neos-3083819-nubu 6307996 6307996 0.0 1138 47 ok
neos-3216931-puriri 59191.1268 1e+20 -- 202 3601 stopped
neos-3381206-awhea 453 453 0.0 1 4 ok
neos-3402294-bobin 1.11022302e-16 0.06725 -- 2765 3606 stopped
neos-3402454-bohle -1e+20 1e+20 -- 0 43 stopped
neos-3555904-turama -40.95 -33.2 23.3 3 3604 stopped
neos-3627168-kasai 988203.134 988585.62 0.0 462893 3600 stopped
neos-3656078-kumeu -18413.2 1e+20 -- 1 3601 stopped
neos-3754480-nidda -352051.784 13747.5367 -- 1952395 3602 stopped
neos-3988577-wolgan 119 1e+20 -- 13 3601 stopped
neos-4300652-rahue 0.128756061 5.2121 3948.0 103 3606 stopped
neos-4338804-snowy 1447 1477 2.1 554130 3601 stopped
neos-4387871-tavua 28.8473171 34.79894 20.6 1310 3600 stopped
neos-4413714-turia 45.370167 45.370167 0.0 2 2082 ok
neos-4532248-waihi 0.370420217 1e+20 -- 1 3616 stopped
neos-4647030-tutaki 27265.1927 27271.257 0.0 231 3609 stopped
neos-4722843-widden 25009.6634 25009.6634 0.0 2623 3368 ok
neos-4738912-atrato 283627957 283627957 0.0 46900 2055 ok
neos-4763324-toguru 1142.84659 2240.0651 96.0 2 3604 stopped
neos-4954672-berkel 2308705.23 2633312 14.1 94247 3601 stopped
neos-5049753-cuanza 550.216667 1e+20 -- 1 3607 stopped
neos-5052403-cygnet 179.500371 290 61.6 1 3609 stopped
neos-5093327-huahum 5192.21511 6506 25.3 2403 3602 stopped
neos-5104907-jarama 642.256923 1e+20 -- 1 3609 stopped
neos-5107597-kakapo 1864.20375 3690 97.9 171822 3600 stopped
neos-5114902-kasavu -1e+20 1e+20 -- 0 26 stopped
neos-5188808-nattai 0 0.110287132 -- 6443 3601 stopped
neos-5195221-niemur 0.000977767 0.003863653 295.2 14256 3602 stopped
neos-631710 0 215 -- 1 3606 stopped
neos-662469 184368.162 184544.5 0.1 4717 3600 stopped
neos-787933 30 30 0.0 1 7 ok
neos-827175 112.00152 112.00152 0.0 1 107 ok
neos-848589 2302.61937 2528.6184 9.8 3 3612 stopped
neos-860300 3201 3201 0.0 2 67 ok
neos-873061 105.645552 121.460195 15.0 1 3604 stopped
neos-911970 54.76 54.76 0.0 455585 2952 ok
neos-933966 318 4398 1283.0 8 3601 stopped
neos-950242 1.22222222 4 227.3 194 3600 stopped
neos-957323 -237.756681 -237.756681 0.0 1 295 ok
neos-960392 -238 0 -- 13 3601 stopped
neos17 0.150002577 0.150002577 0.0 19658 115 ok
neos5 15 15 0.0 1812079 1944 ok
neos8 -3719 -3719 0.0 1 9 ok
neos859080 1e+20 1e+20 -- 661 2 ok
net12 214 214 0.0 1334 3292 ok
netdiversion 237.111111 242 2.1 12 3604 stopped
nexp-150-20-8-5 230.447378 235 2.0 16 3601 stopped
ns1116954 0 1e+20 -- 1 3602 stopped
ns1208400 2 2 0.0 990 723 ok
ns1644855 -1524.33333 -1419.66667 7.4 1 3605 stopped
ns1760995 -1e+20 1e+20 -- 0 3662 stopped
ns1830653 20622 20622 0.0 6176 434 ok
ns1952667 0 0 0.0 1382 1164 ok
nu25-pr12 53905 53905 0.0 83 22 ok
nursesched-medium-hint03 75.8661003 8081 Large 1 3602 stopped
nursesched-sprint02 58 58 0.0 6 155 ok
nw04 16862 16862 0.0 7 105 ok
opm2-z10-s4 -45884.456 -29300 56.6 3 3602 stopped
p200x1188c 15078 15078 0.0 2 12 ok
peg-solitaire-a3 1 1e+20 -- 184 3600 stopped
pg -8674.34261 -8674.34261 0.0 559 59 ok
pg5_34 -14350.2009 -14338.0615 0.1 171265 3600 stopped
physiciansched3-3 2609077.71 1e+20 -- 6 3604 stopped
physiciansched6-2 49324 49324 0.0 142 727 ok
piperout-08 125055 125055 0.0 110 2050 ok
piperout-27 8124 8124 0.0 2 612 ok
pk1 11 11 0.0 406826 535 ok
proteindesign121hz512p9 0 1e+20 -- 0 920 stopped
proteindesign122trx11p8 -1e+20 1e+20 -- 0 1076 abort
qap10 340 340 0.0 2 361 ok
radiationm18-12-05 17565 17576 0.1 58171 3601 stopped
radiationm40-10-02 155321.712 256218 65.0 1127 3604 stopped
rail01 -92.0873 1e+20 -- 1 3603 stopped
rail02 -6350.94197 1e+20 -- 1 3605 stopped
rail507 174 174 0.0 682 796 ok
ran14x18-disj-8 3650.00897 3734.99999 2.3 264526 3601 stopped
rd-rplusc-21 100 171887.288 Large 24296 3603 stopped
reblock115 -36934261.8 -36800603.2 0.4 104186 3601 stopped
rmatr100-p10 423 423 0.0 777 689 ok
rmatr200-p5 3291.08079 4706 43.0 1 3602 stopped
rocI-4-11 -6020203 -6020203 0.0 11774 387 ok
rocII-5-11 -11.811922 -5.65497492 108.9 4983 3601 stopped
rococoB10-011000 16178.1679 20170 24.7 10307 3600 stopped
rococoC10-001000 11460 11460 0.0 49162 2307 ok
roi2alpha3n4 -69.497701 -63.2084921 9.9 5167 3603 stopped
roi5alpha10n8 -72.6816859 -42.3653204 71.6 314 3609 stopped
roll3000 12890 12890 0.0 3147 181 ok
s100 -1e+20 1e+20 -- 0 101 stopped
s250r10 -0.172620061 -0.1717256 0.5 2 3609 stopped
satellites2-40 -29 49 -- 1 3653 stopped
satellites2-60-fs -29 28 -- 1 3628 stopped
savsched1 -801160.3 31846.3 -- 1 3614 stopped
sct2 -231.063567 -230.989162 0.0 90160 3600 stopped
seymour 417.932149 423 1.2 25264 3601 stopped
seymour1 410.763701 410.763701 0.0 1203 309 ok
sing326 7740242.84 7815051.11 1.0 106 3602 stopped
sing44 8110365.4 8336047.42 2.8 36 3602 stopped
snp-02-004-104 586784451 586903510 0.0 104 3607 stopped
sorrell3 -20.6893407 -12 72.4 1 3603 stopped
sp150x300d 69 69 0.0 121 2 ok
sp97ar 657199509 691953855 5.3 3591 3601 stopped
sp98ar 528030041 530117051 0.4 6957 3601 stopped
splice1k1 -1645.76473 -121 1260.1 1 3618 stopped
square41 8.87035973 51 474.9 3 3627 stopped
square47 -1e+20 1e+20 -- 0 81 stopped
supportcase10 0 18 -- 1 3603 stopped
supportcase12 -82461.0638 0 -- 1 34 stopped
supportcase18 47.1866667 50 6.0 9632 3600 stopped
supportcase19 -1e+20 1e+20 -- 0 29 stopped
supportcase22 0 1e+20 -- 2 3606 stopped
supportcase26 1521.82348 1755.84518 15.4 843082 3602 stopped
supportcase33 -359.701149 -340 5.8 6947 3602 stopped
supportcase40 23849.6271 24422.9041 2.4 7365 3600 stopped
supportcase42 7.75103344 8.02972774 3.6 35452 3602 stopped
supportcase6 45997.8089 51937.682 12.9 127 3605 stopped
supportcase7 -1132.22317 -1132.22317 0.0 159 739 ok
swath1 379.071296 379.071296 0.0 362 65 ok
swath3 397.761344 397.761344 0.0 39302 966 ok
tbfp-network 23.3340112 28.1166667 20.5 33 3646 stopped
thor50dday 32001.5993 58369 82.4 1 3604 stopped
timtab1 764772 764772 0.0 43393 182 ok
tr12-30 130529.549 130596 0.1 568241 3601 stopped
traininstance2 0 79160 -- 2970 3601 stopped
traininstance6 2072 29130 1305.9 17661 3600 stopped
trento1 5183779.47 5534271 6.8 2305 3601 stopped
triptim1 22.8680875 22.8681 0.0 2 3601 stopped
uccase12 11507.3721 11507.4051 0.0 1866 3603 stopped
uccase9 10881.1167 11691.6074 7.4 65 3602 stopped
uct-subprob 300.483452 314 4.5 30374 3600 stopped
unitcal_7 19635558.2 19635558.2 0.0 374 1448 ok
var-smallemery-m6j6 -152.559657 -149.375 2.1 45835 3602 stopped
wachplan -9 -8 12.5 67432 3600 stopped
----------------------------+----------------+----------------+------+---------+-------+--------+
solved/stopped/failed: 79/161/0
@03 MIPLIB script version
@02 timelimit: 3600
@01 SCIP(6.0.2)spx(4.0.2)
</code></pre></div></div>
<h2 id="nerd-corner-additional-rpi-4-benchmarks">Nerd corner: additional RPi 4 benchmarks</h2>
<p>For the curious and for comparison to the stock model, I also ran the <a href="https://github.com/aikoncwd/rpi-benchmark">RPi Benchmark script</a> and a <a href="https://people.sc.fsu.edu/~jburkardt/c_src/linpack_bench/linpack_bench.html">Linpack benchmark</a> on the spiked RPi4.</p>
<h3 id="rpi-benchmark">RPi Benchmark</h3>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Raspberry Pi Benchmark Test
Author: AikonCWD
Version: 3.0
temp=46.0'C
arm_freq=2000
gpu_freq=600
gpu_freq_min=500
sd_clock=50.000 MHz
Running InternetSpeed test...
Ping: 12.172 ms
Download: 31.11 Mbit/s
Upload: 3.76 Mbit/s
Running CPU test...
total time: 6.5163s
min: 2.53ms
avg: 2.61ms
max: 12.92ms
temp=58.0'C
Running THREADS test...
total time: 11.2985s
min: 4.02ms
avg: 4.52ms
max: 42.92ms
temp=62.0'C
Running MEMORY test...
Operations performed: 3145728 (1930915.11 ops/sec)
3072.00 MB transferred (1885.66 MB/sec)
total time: 1.6291s
min: 0.00ms
avg: 0.00ms
max: 11.62ms
temp=63.0'C
Running HDPARM test...
Timing buffered disk reads: 130 MB in 3.01 seconds = 43.19 MB/sec
temp=51.0'C
Running DD WRITE test...
536870912 bytes (537 MB, 512 MiB) copied, 13.9169 s, 38.6 MB/s
temp=48.0'C
Running DD READ test...
536870912 bytes (537 MB, 512 MiB) copied, 11.9338 s, 45.0 MB/s
temp=48.0'C
AikonCWD's rpi-benchmark completed!
</code></pre></div></div>
<h3 id="linpack-benchmark">Linpack benchmark</h3>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>28 September 2019 01:57:50 AM
LINPACK_BENCH
C version
The LINPACK benchmark.
Language: C
Datatype: Double precision real
Matrix order N = 1000
Leading matrix dimension LDA = 1001
Norm. Resid Resid MACHEP X[1] X[N]
6.491510 0.000000 2.220446e-16 1.000000 1.000000
Factor Solve Total MFLOPS Unit Cray-Ratio
2.530031 0.002854 2.532885 263.994088 0.007576 45.230089
LINPACK_BENCH
Normal end of execution.
28 September 2019 01:57:53 AM
</code></pre></div></div>Sebastian PokuttaTL;DR: Running SCIP on a Raspberry Pi 4 with relatively moderate performance losses (compared to a standard machine) of a factor of 3-5 brings Integer Programming into the realm of Edge Computing.Universal Portfolios: how to (not) get rich2019-08-29T01:00:00+02:002019-08-29T01:00:00+02:00http://www.pokutta.com/blog/research/2019/08/29/universalPortfolios<p><em>TL;DR: How to (not) get rich? Running Universal Portfolios online with Online Convex Optimization techniques.</em>
<!--more--></p>
<p><strong>The following does not constitute any investment advice.</strong></p>
<p>In 1956 Kelly Jr., at Bell Labs then, wrote a very groundbreaking paper [K] that provided a strong link between Shannon’s (then) newly proposed information theory [S] and gambling (Latane [L] came to similar conclusions around the same time, however published it later). Without going too much into detail here—the casual reader is referred to Poundstone’s popular science account [P] and the more technically inclided reader to [K] and [L]; also make sure to check out [C]—what Kelly showed in the context of sequential betting (his example was horse races) the growth of the bettor’s bankroll is upper bounded by $2^r$, where $r$ is the information rate of the <em>private wire</em> of the bettor, i.e., information that only the bettor is privy to, the <em>information advantage</em>. Moreover, Kelly also proposed an optimal strategy, nowadays called <em>Kelly Strategy</em> or <em>Kelly Betting</em>, that achieves this growth rate and, moreover, for any substantially different strategy, the limit of that strategy’s return divided by the return of the Kelly strategy goes to $0$.</p>
<h3 id="the-kelly-criterion">The Kelly Criterion</h3>
<p>Given sequential betting opportunities, e.g., in a casino, the <em>key question</em> is how much money of the current bankroll $V$ to invest. What Kelly showed is that the <em>optimal fraction</em> $f^\esx$ to wager is given by:</p>
\[\tag{KellyFraction}
f^\esx \doteq \frac{bp - q}{b},\]
<p>where $b$ is the net odds (you get $b$ dollars back for each invested dollar on top of the dollar returned to you), $p$ is the probability of winning and $q=1-p$ is the probability of losing. From this formula we can immediately see that we wager a fraction of $0$ if and only if $bp = 1 - p$ or equivalently $p = \frac{1}{b+1}$, i.e., our probability of winning is net flat against the odds, i.e., we have no informational advantage.</p>
<p>However, rather than dwelling on the interpretation as well as the optimality of the above, it is instructive to understand where this formula originates from. In fact, the Kelly fraction naturally arises as the fraction that maximizes the geometric growth of our bankroll under sequential betting or equivalently <em>maximizes the expectation of the logarithm of our (terminal) wealth</em>. The rationelle behind this objective is as follows. Suppose we have an initial backroll of $X_0$ and we sequentially bet at each time step $t$ a given fraction $f$ of our bankroll $X_t$. Let us make the simplifying assumption that the we have even wager bets, i.e., if we win we get our money $f X_t$ back and another $f X_t$ as win and if we lose we lose $f X_t$ as such our bankroll $X_t$ evolves as</p>
\[X_n = X_0 (1+f)^S (1-f)^F,\]
<p>where $S$ is the number of successes and $F$ the number of failures, i.e., in particular $S + F = n$. From this we obtain a <em>average growth (per bet)</em> of</p>
\[\left( \frac{X_n}{X_0} \right)^{1/n} = (1+f)^{S/n} (1-f)^{F/n},\]
<p>where $S/n \rightarrow p$ the <em>probability of success</em> and $F/n \rightarrow 1-p$ the <em>probability of failure</em>. The term on the left is the <em>growth rate per bet on average</em> and we want to find an $f$ that maximizes this quantity via the right-hand side. In log-world this is equivalent to</p>
\[\frac{1}{n} (\log X_n - \log X_0) = \frac{S}{n} \log (1 + f) + \frac{F}{n} \log (1-f),\]
<p>or in the limit for $n$ large we obtain:</p>
\[\tag{maxExpLog}
\mathbb E[g(f)] = p \log (1 + f) + (1-p) \log (1-f),\]
<p>where $g$ is the <em>expected growth rate per bet when betting fraction $f$</em>, which is independent of $n$ now. Note, that it can be also easily seen that betting a fixed fraction $f$ (i.e., independent of time) here is sufficient provided that the individual bets are independent and we care for the (maximization of the) expected growth rate.</p>
<p>Now we can simply maximize the right-hand side by computing a critical point:</p>
\[0 = p \frac{1}{1+f} - (1-p) \frac{1}{1-f} \Rightarrow f^\esx = 2p -1,\]
<p>which is the Kelly fraction for this simple case. Again, if the probability of winning is exactly $p = 0.5$, i.e., we do not have any information advantage, then the optimal fraction $f^\esx = 0$. Observe, that as soon as $p \neq 0.5$ we have some form of informational advantage and the Kelly fraction $f^\esx \neq 0$: if it is positive it is beneficial to bet on the outcome (being the bet long) and if it is negative it is beneficial to bet on the inverse of the outcome (being the bet short). The latter is not always possible in traditional betting, but it is in investing, where we can short.</p>
<p>Following the same logic as above, the general formula can be derived by maximizing the <em>expected logarithmic terminal wealth</em> $\mathbb E \log W_T(f)$, where $W_T(f)$ is the <em>(terminal) wealth</em> at time $T$ provided we bet a fixed fraction $f$. We can also see from the derivation that the limit outcome is very sensitive to overestimations of $f^\esx$, in particular if our estimation of $p$ has some error (which it usually has) leading to an estimation $\hat f > f^\esx$, then ruin in the limit is guaranteed. Therefore in practice people devised various strategies to combat overbetting, such as, e.g., <em>half Kelly</em>, where only $\frac{1}{2}\hat f$ is invested. This significantly cuts down the risk of overbetting while still providing $2/3$ of the expected growth rate; see <a href="https://blogs.cfainstitute.org/investor/2018/06/14/the-kelly-criterion-you-dont-know-the-half-of-it/">here</a>, <a href="https://www.pinnacle.com/en/betting-articles/Betting-Strategy/fractional-kelly-criterion/GBD27Z9NLJVGFLGG">here</a>, or <a href="https://www.bettingexpert.com/en-au/learn/successful-betting/the-kelly-criterion#gref">here</a> for some of many online discussions. Also, it seems that in practical betting the Kelly criterion tends to work quite well (see e.g., [T, T3] or [P]) however this is beyond the scope of this post.</p>
<p><strong>Example: biased even wager bet.</strong> To understand a bit better what the expected growth rates etc are, consider the even wager setup from above. However, this time let us suppose that our <em>informational advantage</em> is $\varepsilon$, i.e., $p = \frac{1}{2}+\varepsilon$. Then the optimal growth rate $r^\esx$ obtained via $f^\esx$ is given by</p>
\[r^\esx \approx 2 \varepsilon^2\]
<p>and the number of bets required to double our bankroll is roughly $0.35 \varepsilon^{-2}$. For actual biases this roughly looks like this, where we list number of bets to 2x, 10x, and $5\%$:</p>
<table>
<thead>
<tr>
<th style="text-align: right">Advantage $\varepsilon$</th>
<th style="text-align: right">$r^\esx$</th>
<th style="text-align: right">2x ($0.35 \varepsilon^{-2}$)</th>
<th style="text-align: right">10x ($1.15\varepsilon^{-2}$)</th>
<th style="text-align: right">$5\%$</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">$40\%$</td>
<td style="text-align: right">$36.80642\%$</td>
<td style="text-align: right">$1.88$</td>
<td style="text-align: right">$6.26$</td>
<td style="text-align: right">$0.14$</td>
</tr>
<tr>
<td style="text-align: right">$10\%$</td>
<td style="text-align: right">$2.01355\%$</td>
<td style="text-align: right">$34.42$</td>
<td style="text-align: right">$114.35$</td>
<td style="text-align: right">$2.48$</td>
</tr>
<tr>
<td style="text-align: right">$2.5\%$</td>
<td style="text-align: right">$0.12505\%$</td>
<td style="text-align: right">$554.29$</td>
<td style="text-align: right">$1,841.30$</td>
<td style="text-align: right">$39.98$</td>
</tr>
<tr>
<td style="text-align: right">$0.63\%$</td>
<td style="text-align: right">$0.00781\%$</td>
<td style="text-align: right">$8,872.05$</td>
<td style="text-align: right">$29,472.32$</td>
<td style="text-align: right">$639.98$</td>
</tr>
<tr>
<td style="text-align: right">$0.08\%$</td>
<td style="text-align: right">$0.00012\%$</td>
<td style="text-align: right">$567,825.94$</td>
<td style="text-align: right">$1,886,276.94$</td>
<td style="text-align: right">$40,959.98$</td>
</tr>
</tbody>
</table>
<p>Put differently, we need to have a significant informational advantage to really turn this into any reasonable return in <em>reasonable time</em> or we need to be in an environment where we can make a large number of bets. Keep this in mind, we will be revisiting this later. To put things into perspective, the edges in “professional gambling” settings typically hover around $1\%$, so you need to put in some <em>real work</em> but it is not unrealistic to achieve a decent growth rate.</p>
<h2 id="log-optimal-portfolios-and-constant-rebalancing">Log-optimal Portfolios and Constant Rebalancing</h2>
<p>Going from betting to investing is quite natural actually. After all, we can think of <em>investing</em> also as a form of sequential betting, however now:</p>
<ol>
<li>the bets do not necessarily have a fixed time of outcome (we can decide when we exit a position etc);</li>
<li>the odds / probability of success is not easily available (not saying it is easy in betting…);</li>
<li>the payoff of each “bet” can differ significantly.</li>
</ol>
<p>Let us ignore those (significant) issues for the moment and focus on the transfer of the methodology first. In fact it turns out that the setup that Kelly considered naturally generalizes to, e.g., investing in equities etc.</p>
<p>Suppose we have $n$ assets, with random return vector $x \in \RR^n$, where the $x_i$ are of the form $x_i = \frac{p_i(\text{new})}{p_i(\text{old})}$, i.e., <em>relative price changes</em>. In the spirit of Kelly’s approach, we allocate fractions</p>
\[f \in \Delta(n) \doteq \setb{f \in \RR^n \mid \sum_i f_i = 1, f \geq 0},\]
<p>across these assets, so that the <em>logarithmic growth rate</em> (similar to above) is given as $\log f^\intercal x$. Now expressed in a sequential investing/betting fashion, we would have the relative price change realizations $x_t$ in time $t$ and allocations $f_t$, so that the <em>logarithmic portfolio growth</em> over time is given by</p>
\[\sum_{t = 1}^T \log f_t^T x_t,\]
<p>which is identical to the <em>logarithmic terminal wealth</em> $\log W_T(f_1,\dots, f_T)$ as the return relatives are additive in log-space. Equivalently the <em>average logarithmic growth rate</em> is given by:</p>
\[\tag{avgLogGrowth}
\frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t,\]
<p>which is the (somewhat natural) generalization of (maxExpLog), however now we spread captial along multiple assets.</p>
<p>A <em>Constant Rebalancing Portfolio (CRP)</em> is one where the $f_t = f$ are constant over time with the implicit assumption(!!) that the return distribution is somewhat stationary and hence from a sequential decision perspective we maximize the <em>expected logarithmic growth rate</em> by picking the expectation maximizer in each step assuming(!!) i.i.d. returns. In short, the CRP is in some sense the natural generalization of the Kelly bet from above. The term <em>constant rebalancing</em> arises from the fact that in each time step, we <em>rebalance</em> the portfolio to represent the <em>constant</em> allocation $f$ across assets; note that this rebalancing usually incurs transaction costs.</p>
<p>A natural and important question to ask if <em>why</em> one would want to consider CRPs at all and why one would prefer CRPs over other strategies. One advantage is that apart from being the “Kelly bet for investing” they turn volatility into return. Moreover, Cover [C2] showed that the best CRP provides returns at least as good as (1) buying and holding any particular stock, (2) the average of the returns all stocks, and (3) the geometric mean of all stocks. We will briefly discuss below how CRPs can turn naturally volatility into return. We describe the argument here for the Kelly betting case (which can be considered a two-asset CRP after appropriate rewriting) for the sake of exposition, but the general case follows similarly.</p>
<p><strong>Observation: turning volatility into return.</strong> One of the arguments for CRPs is that they can even generate a return when the log geometric mean of the random variable we are investing in is $0$. This observation is due to Shannon and is simply the <a href="https://en.wikipedia.org/wiki/Inequality_of_arithmetic_and_geometric_means">AM/GM inequality</a> (or concavity of the logarithm) in action. First let us understand what the above means: If a random variable $X$ has geometric mean $1$ (or log geometric mean $0$), e.g., a fair coin whose payout is so that we have a new bankroll of amount $1+r$ when $X = 1$ and $1/(1+r)$ if $X = 0$. This means that a buy-and-hold strategy in expectation would not yield any return. Note that the <em>arithmetic mean</em> is roughly $r^2/2$ in this case (via Taylor approximation). The Kelly strategy allows to profit from this positive arithmetic mean, in fact setting $f=1/2$ (which is equal to $f^\esx$ if the geometric mean is exactly $1$) guarantees a positive rate. Using $\log (1+x) \approx x - \frac{x^2}{2}+\frac{x^3}{3}$ we obtain for small $r$:</p>
\[\begin{align*}
\frac{1}{2}\cdot\log(1+f\cdot r)+\frac{1}{2}\cdot\log(1-f+\frac{f}{1+r}) & \approx\frac{r^{2}-r^{3}}{8}.
\end{align*}\]
<p>More generally, suppose that we have a discrete one dimensional random variable $1+X$ (this is our bet) with outcomes $1+x_i \in \RR_+$ with probability $p_i$, i.e., the $x_i$ corresponding to the returns, so that the log geometric mean is at least $0$:</p>
\[\sum_{i}p_{i}\log\left(1+x_{i}\right)\geq 0\]
<p>Betting a fraction $f$ leads to the expected growth function:</p>
\[r(f) \doteq \sum_{i}p_{i}\log\left(1+ f x_{i}\right)\]
<p>Observe that $r(f)$ is strictly concave in $f$ in the interval $f \in [0,1]$ with $r(0)=0$ and $r(1) \geq 0$ (here we use that the log geometric mean is at least $0$). Thus by concavity (or <a href="[https://en.wikipedia.org/wiki/Jensen%27s_inequality](https://en.wikipedia.org/wiki/Jensen's_inequality)">Jensen’s inequality</a>; all the same) it follows:</p>
\[r(f) = r(1 \cdot f + 0 \cdot (1-f)) > f \cdot r(1) + (1-f) \cdot r(0) \geq 0.\]
<p>As such betting a fraction of $f=1/2$ is always safe as long as the geometric mean is at least $1$. However, for completeness, note that betting a fixed fraction of $1/2$ can be quite suboptimal: say the random variable actually had positive geometric mean of $r > 1$ (in particular no ruin events) then investing a $1/2$-fraction will lead to suboptimal growth. In fact, in some cases the optimal Kelly fraction $f^\esx$ can be actually larger than $1$. In this case we would leverage the bets.</p>
<p>In the case of $n$ assets, the generalization of the above simple strategy is to allocate a $1/n$-fraction of capital to each of the $n$ assets. For further comparisons of this simple $1/n$-CRP to other strategies and further properties, see [DGU]. Finally, before continuing with universal portfolios, let me add that <em>dutch booking</em> (i.e., locking in a guaranteed profit through arbitrage and crossing bets) is not necessarily growth optimal as can be seen with a similar argument: we can gain extra return from allowing volatility in the portfolio returns. Put differently, if the payout function is convex, we benefit from randomness.</p>
<h3 id="universal-portfolios">Universal Portfolios</h3>
<p>Once the notion of Constant Rebalancing Portfolios is defined, given a set of assets, a natural question is whether one can estimate or compute the optimal allocation vector $f$ given either distributional assumptions about the returns or actual data. The natural second order question is then, if so, whether we can do it in an online style fashion, <em>while</em> we are investing. In its most basic form: Given relative price change vectors $x_1, \dots, x_T$, we would like to solve:</p>
\[\tag{staticLogOpt}
\max_{f \in \RR^n} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t.\]
<p>While this optimization problem can easily be solved with convex optimization methods, provided the price change vectors $x_1, \dots, x_T$, this is not very helpful as the past is usually not a great predictor for the future.</p>
<p>Motivated by the strong connection to information theory, Cover [C2] realized that one can define <em>Universal Portfolios</em>, that are growth optimal for unknown returns following an analogous line of reasoning as Kolmogorov, Lempel, and Ziv did for <em>universal coding</em> where an (asymptotically) optimal (source) code can be constructed without knowing the source’s statistics, i.e., the algorithm constructs a portfolio over time, so that when $T \rightarrow \infty$, for any sequence of relative price changes $x_1,\dots, x_T$, then</p>
\[\tag{universalPortfolio}
\frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t \rightarrow \max_{f \in \RR^n} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t,\]
<p>where the $f_t$ are dynamic allocations. However, grossly simplifying, the proposed algorithm spreads the capital across an exponential number of sequences arising from binary strings of length $T$ (assuming the most basic case based on a binomial model), making it impractical for computational implementation as well as from an actual real-world perspective as one would probably be bankrupted by transaction costs. Nonetheless, Cover’s work [C1], [C2] inspired a lot of related work addressing various aspects (see, e.g., [CO], [BK], [T], [TK], [MTZZ]) and in particular [KV] provided a theoretically polynomial time implementable variant of Cover’s universal portfolios.</p>
<h3 id="an-application-of-online-mirror-descent-et-al">An application of Online Mirror Descent et al</h3>
<p>It did not take long, given the suggestive form of (universalPortfolio) until the link to online convex optimization and regret minimization was observed (see e.g., [HSSW] for one of the earlier references). In fact, the online search for a universal portfolio can be easily cast as a regret minimization problem: Find a strategy of <em>dynamic allocations</em> $f_t$, so that</p>
\[\tag{univPortRegret}
\max_{f \in \RR^n} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t - \frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t\leq R(T)/T,\]
<p>where $R(T)$ is the <em>regret</em> achieved after $T$ time steps (or bets depending on the point of view). Given that our function $\log f_t^T x_t$ is concave in $f_t$ and we want to maximize (or equivalently minimize $-\log f_t^T x_t$), we can apply the <em>online convex optimization framework</em> to obtain no-regret algorithms (see <a href="/blog/research/2019/02/27/cheatsheet-nonsmooth.html">Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning</a> for some background information on the actual algorithms); for the expert in online convex optimization, we require some assumptions on the feasible region of $f$ later, so avoid complications that arise in the unbounded case.</p>
<h4 id="online-gradient-descent">Online Gradient Descent</h4>
<p>In its most basic form we can simply run Zinkevich’s [Z] <em>Online (sub-)Gradient Descent (OGD)</em> for a given time horizon $T$ and feasible region $P$. Here, we update according to</p>
\[f_{t+1} \leftarrow \arg\min_{f \in P} \eta_t \nabla_f(\log f_t^T x_t)^Tf + \frac{1}{2}\norm{f-f_t}^2\]
<p>with the step-size choice $\eta_t = \eta = \sqrt{\frac{2M}{G^2T}}$, where $\norm{\nabla_f(\log f_t^T x_t)}_2 \leq G_2$ is an upper bound on the $2$-norm of the gradients and $M$ is an upper bound on the $2$-norm diameter of the feasible region $P$. This provides a guarantee of the form:</p>
\[\tag{univPortOGDGen}
\max_{f \in P} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t - \frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t\leq \frac{G_2M}{\sqrt{T}},\]
<p>Note, that $P$ can be used to model or capture various allocation constraints, e.g., if $P = \Delta(n)$ the probability simplex in dimension $n$, then we essentially assume that cannot short and we cannot take leverage; this would be the traditional unleveraged long-only investor setting. However, many other choices are possible and we can also easily include limits on assets and segment exposures into $P$; the projection operation might get slightly more involved but that is about it. It is important to observe though that while the bound in (univPortOGDGen) looks like being independent of the number of assets $n$, in fact $G_2$ will typically scale as $\sqrt{n}$. For $P = \Delta(n)$, which we will assume here for simplicity and also to be in line with the original univeral portfolio model, the bound (univPortOGDGen) becomes:</p>
\[\tag{univPortOGD}
\max_{f \in P} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t - \frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t\leq \sqrt{\frac{2G_2^2}{T}}.\]
<h4 id="online-mirror-descent">Online Mirror Descent</h4>
<p>A more general approach with more freedom to customize the regret bound is <em>Online Mirror Descent</em> (OMD), which arises as a natural generalization of <em>Mirror Descent</em> [NY] (by simply cutting short the original MD proof; see <a href="/blog/research/2019/02/27/cheatsheet-nonsmooth.html">Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning</a>). Rather than stating the most general version, for the case of $P = \Delta(n)$ where OMD with respective Bregman divergence and distance generating function is the <em>Multiplicative Weight Update</em> method (see [AHK] for a survey on MWU), we obtain the regret bound:</p>
\[\tag{univPortOMD}
\max_{f \in P} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t - \frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t \leq \sqrt{\frac{2G_\infty^2\log n}{T}},\]
<p>where \(\norm{\nabla_f(\log f_t^T x_t)}_\infty \leq G_\infty\); note that this time around we have a bound on the max norm of the gradients. The update looks very similar to OGD:</p>
\[f_{t+1} \leftarrow \arg\min_{f \in P} \eta_t \nabla_f(\log f_t^T x_t)^Tf + V_{f_t}(f),\]
<p>where $V_x(y)$ is the corresponding Bregman divergence. This update can be implemented via the multiplicative weight update method; see <a href="/blog/research/2019/02/27/cheatsheet-nonsmooth.html">Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning</a> for details.</p>
<p><strong>Remark: (Feasible region $\Delta(n)$).</strong> If we want to stick to $P = \Delta(n)$ as feasible region, which is not as restrictive as it seems as one can simply add the negative of an asset to allow for shorting and duplicate assets to allow for leverage. This “modification” has only minor impact on the regret bound (logarithmic dependence on $n$ for OMD). Alternatively one can simply define a customized Bregman divergence incorporating those constraints with basically the same result.</p>
<h4 id="online-gradient-descent-for-strongly-convex-functions">Online Gradient Descent for strongly convex functions</h4>
<p>It turns out that given that $\log f_t^T x_t$ is strongly concave in $f$ with respect to the $2$-norm one can significantly improve the OGD regret bound as shown by [HAK]. This type of improved regret bound in the strongly convex case is pretty standard by now (again see <a href="/blog/research/2019/02/27/cheatsheet-nonsmooth.html">Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning</a> for details), so that we just state the bound:</p>
\[\tag{univPortOGDSC}
\max_{f \in P} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t - \frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t \leq \frac{G_2^2}{2\mu} \frac{(1 + \log T)}{T},\]
<p>where $\mu >0$, so that $\mu I \preceq \nabla_f^2 \log f_t^T x_t$, i.e., it is a lower bound on the strong concavity constant. Note that we have, up to log factors, a quadratic improvement in the convergence rate.</p>
<h2 id="performance-on-real-world-data">Performance on real-world data</h2>
<p>So from a theoretical perspective the performance guarantee of Online Mirror Descent and even more so the performance guarantee of the variant for strongly convex losses via Online Gradient Descent sound very appealing in terms of running Universal Portfolios online. Given the regret bounds one would expect that all information theory, computer science, and machine learning people “in the know” would be rich. However, as so often:</p>
<blockquote>
<p>“In academia there is no difference between academia and the real world - in the real world there is”
— <a href="https://en.wikipedia.org/wiki/Nassim_Nicholas_Taleb">Nassim Nicholas Taleb</a></p>
</blockquote>
<p>As we will see in the following, for actual parameterizations that are compatible with market conditions that we encounter in today’s market regimes, the provided asymptotic guarantees are often too weak; but not always. Note, that we are ignoring transaction costs here, which impact performance negatively (see [BK] for a discussion of Universal Portfolios with transaction costs), as it would complicate the exposition significantly while adding little extra value in terms of understanding.</p>
<p>I would like to stress that I am not saying that OMD, Universal Portfolios, etc. cannot be successfully used in investing. A few disclaimers:</p>
<ol>
<li>There are <em>some</em> researchers, in particular information theorists that do use related methodologies for investing quite successfully. But in a more elaborate way than just vanilla portfolio optimization.</li>
<li>The aforementioned algorithms do often work much better in practice than the regret bounds suggest, however we run Universal Portfolios precisely because of the worst-case guarantee. Otherwise, we get into the metaphysical discussions of pro and cons of investments strategies without any guarantees.</li>
</ol>
<p><strong>Understanding the Regret Bounds.</strong> It is important to understand exactly, what the regret bounds (univPortOGDSC) and (univPortOMD) really mean. Namely, after a certain number of iterations or steps the achieved error is the <em>average additive return error</em>, i.e., if the value in graph for (univPortOMD) indicates an error of $\varepsilon$, this means that <em>on average in each step</em> we roughly make an additive error in the return of $\varepsilon$. This can be quite problematic if the actual returns are smaller than $\varepsilon$. As such we will also report the <em>average relative return error</em> as the ratio of the additive error and the return in that time step, as this is really what matters from a practical perspective: you want to be within a (hopefully small) <em>multiplicative</em> factor of the actual return.</p>
<p>Before looking at actual data I will first make a few remarks where I used simulated market data (calibrated against “typical” market conditions) to demonstrate the underlying mechanics. Then we will look at actual data in two cases: a more traditional US equities scenario and a more speculative, high(er)-frequency cryptocurrency example.</p>
<h3 id="constants-and-norms-matter">Constants and norms matter</h3>
<p>At first sight it seems that the regret bound (univPortOGDSC) for the strongly convex case via OGD should be vastly superior to the regret bound (univPortOMD) via OMD. So let us start with a simple example of $n = 2$ assets for $T = 100000$ time steps; this is quite long in the real-world, more on this later. Here and in the following we often depict on the left average additive (or relative) return errors and on the right the respective norms of the gradients over time as well as the lower bound estimation of the strong concavity constant.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/twoAssetcompOver.png" alt="Regret Bound 2 assets" /></p>
<p>This confirms our initial suspicion that (univPortOGDSC) is superior. Or is it? Let us consider another example but this time we take a more realistic number of $n = 50$ assets:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/50AssetcompOver.png" alt="Regret Bound 50 assets" /></p>
<p>This might be slightly surprising at first but what is happening is that (univPortOMD) scales <em>much</em> better due to the logarithmic dependence on the number of assets and the max norm of the gradients, whereas (univPortOGDSC) depends on the $2$-norm of the gradients and this scales roughly with $\sqrt{n}$ when the assets are reasonably independent. So basically OMD is slower in principle but starts from an exponentially lower initial error bound, so you need really long time horizons for (univPortOGDSC) to dominate (univPortOMD) for reasonably sized portfolios. In actual setups it really depends on the data and parameters in terms of which algorithm and bound performs better.</p>
<p>Moreover, as mentioned above what we really care for are the relative errors in the returns. For the same examples from above on the left we have the $2$-asset case and on the right the $50$-asset case.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/2-50-relComp.png" alt="Relative Regret Bounds 2 and 50 assets" /></p>
<p>Key takeaway is here that the relative error can easily be order an of magnitude worse than the additive error. In fact in both cases here, where we simulated really moderate markets, even after a large number of time steps the relative error is around $10\%$, which is still quite considerable.</p>
<h3 id="actual-time-matters-more-than-anything">Actual time matters (more than anything)</h3>
<p>The other important thing is that actual <em>time</em> not <em>time steps</em> matters a lot. This might be counterintuitive as well but bear with me. As can be seen from the two regret formulas of interest, (univPortOGDSC) and (univPortOMD), it is the number of time steps that is key to bringing down the average additive return error. As such, while we cannot accelerate time etc in the real-world, we might be tempted to trade more often. Let us see what happens. We consider a market that has $n = 50$ assets with a volatility of roughly that of the US equities market; the graph shows log (portfolio) value of the asset (as a single-asset portfolio).</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/univPort/50Assets10years52weeksmarket.png" alt="Market 50 assets 10 years 52 weeks" /></p>
<p>Now in terms of trading and achievable average additive return errors in the next graphic, on the left we assume that we trade once per week, i.e., we have $T = 520$ time steps, whereas on the right we assume that we trade $100$ times per week, i.e., $T = 52000$.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/50Assets10years52weekscompRescaleRegret.png" alt="Rescaling additive regret" /></p>
<p>First of all, we see that additive error for the once-per week setup is actually quite high, whereas on the right, trading $100$ times per week really helps to bring down the average additive error and we also see that (univPortOGDSC) starts to provide better guarantess than (univPortOMD). This is also consistent with many papers reporting tests over long(er) time horizons in order to ensure that the regret bounds are tight enough.</p>
<p>Now however, the story changes quite a bit when we consider the <em>relative</em> errors as shown in the next graphic.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/50Assets10years52weekscompRescaleRegretrel.png" alt="Rescaling relative regret" /></p>
<p>So for (univPortOMD) basically nothing has changed. The reason is the following: what the increase in number of trades does is to basically turn $\sqrt{T}$ in the regret bound into $\sqrt{100 T} = \sqrt{100} \sqrt{T}$. However, the volatility of the underlying stock process scales <em>the same way</em>, i.e., the returns get roughly rescaled by $1/\sqrt{100}$ as well, so that (apart from a better tracking and removing discretization errors) we obtain basically the <em>same relative errors</em>. This is not the case for (univPortOGDSC) as it contracts quadratically faster (up to log factors), however the strong dependency on the number of assets might still outweigh this benefit.</p>
<p>In short: rescaling time by increasing the number of trades per time step is of <em>limited efficiency</em>. It is actual <em>time</em> that matters when we consider relative return errors, which is the quantity that actually matters.</p>
<p><strong>Interlude: Understanding compounding or how to create a trust fund baby.</strong> Just to understand how important time is when it comes to investing and compounding, at retail investor return rates, say somewhere less than $8\%$ per year (which is already quite optimistic), within $20$ years of investing you can merely $4.7$-fold your initial investment; even $40$ years only give you a factor of about $22$, which is also not game changing: neither is going to make you rich when you started out poor and “getting rich” is closer to a factor $100$ increase in wealth). However, if you are willing not to invest for yourself but, say, for your grandchild, you have a good $60$ years of compounding and the situation is very different: you roughly $101.26$-folded your initial investment, say turning USD $10,000$ into roughly USD $1.00$ million. Go one generation further down the familty tree and it becomes USD $4.72$ million. So while it is unlikely that you can catapult yourself into the realm of riches (provided you did not start out rich), you do have the chance to do something for your family down the line. One might actually argue from an economic perspective the reason retail returns are not higher than they are is to maintain a certain equilibrium (while beyond the scope of this post: otherwise everybody would be rich which means nobody is rich). The lack of category-changing retail investment returns might also provide an explanation why people have been crazy about cryptocurrencies recently: they provide(d) a return that is potentially significant enough to gain $2$ generations of investing within a couple of years (or even months); at the correspondingly high risk levels though.</p>
<h3 id="benchmarks-on-actual-data">Benchmarks on actual data</h3>
<p>With the above in mind, let us benchmark (univPortOGDSC) and (univPortOMD), and hence the algorithms, in terms of achievable guarantees on actual market data.</p>
<h4 id="sp500-stocks">SP500 stocks</h4>
<p>In the first benchmark we consider the SP500 stocks from 01/01/2008 to 01/01/2018 in daily trading. After correcting for dropouts etc over the horizon we are left with $n = 426$ stocks and $T = 2519$ time steps (which corresponds to $10$ years of approx $252$ trading days/year). Apart from very basic tests, no further cleanup was done; that is left as an exercise to the reader. In the graphic below, on the left we have the actual market prices in USD of the assets and on the right the log (portfolio) value of the assets.</p>
<p><img src="http://www.pokutta.com/blog/assets/univPort/sp500marketStockOver.png" alt="sp500 market" /></p>
<p>In terms of achievable average additive return errors, they are huge for either of the bounds:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/sp500compOver.png" alt="sp500 additive regret" /></p>
<p>We have roughly a “guarantee” of $100\%$ <em>additive</em> error in the daily return for the (better) OMD bound: to put this in perspective, say the best stock had a <em>daily return</em> of $3\%$ on average (that is <em>huge</em>), then we can “guarantee” an error band of $[-97\%,103\%]$ for the average daily performance of the universal portfolio <em>at the end of the time horizon</em>. Not quite helpful I guess. The situation gets much more pronounced in terms of average relative error:</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/univPort/sp500comprel.png" alt="sp500 relative regret" /></p>
<h4 id="bitcoin">Bitcoin</h4>
<p>Maybe we were on the wrong end of the regime. Now let us consider fewer assets and a much higher trading frequency. We consider $n = 2$ assets: Bitcoin and a risk-free asset (e.g., cash or government bonds; unsure about the latter recently) that make up our portfolio. The risk-free asset returns a (moderate) $2\%$ over the full horizon. Those days you could consider yourself lucky to get any interest at all: Germany <a href="https://www.cnbc.com/2019/08/20/what-is-a-zero-coupon-bond-germany-set-to-auction-a-zero-percent-30-year-bond.html">just tried to sell a zero-coupon bond</a> at a crazy price (i.e., negative yield), which basically means parking your hard earned cash for $30$ years in a “safe-haven” and paying for it—demand at the auction was <a href="https://www.bloomberg.com/news/articles/2019-08-21/germany-sees-anemic-demand-for-30-year-bond-sale-at-zero-coupon">anemic</a>. The original bitcoin market data was on a sub-second granularity level and for illustration here has been downsampled to $T = 566159$ time steps. As before, in the graphics below, on the left we have the actual market price in USD and on the right the log (portfolio) value of the two assets.</p>
<p><img src="http://www.pokutta.com/blog/assets/univPort/bitcoinMarketOver.png" alt="bitcoin market" /></p>
<p>Now, for our two regret bounds, we see that the variant utilizing the strong concavity performs much better (as we have only two assets, so that constants are small). In fact, $10^{-4}$ as average additive return error seems to be quite good.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/bitcoincompOver.png" alt="bitcoin additive regret" /></p>
<p>However, unfortunately the reality here is a different one: average price changes between time steps are also quite small, so that the relative errors are <em>huge</em>:</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/univPort/bitcoincomprel.png" alt="bitcoin relative regret" /></p>
<p>Even towards the end of the time horizon our (guaranteed) average relative error bounds can be as large as $10^3$ making them impractical. Sure, here we have the <em>special situation</em> of only two assets, one of which is a risk-free asset so that the allocation cannot be <em>that bad</em> but that is beyond the point.</p>
<h2 id="a-simple-asymptotically-optimal-strategy">A simple asymptotically optimal strategy</h2>
<p>Although CRPs ensure competitiveness with respect to a wide variety of benchmarks (see above), in fact in most papers, the reference is often the <em>single best asset in hindsight</em>. This intuitively makes sense: you want to perform essentially (at least) as good as the best possible asset. Note that in principle CRPs can be even profitable when none of the assets is (as discussed before) but you need to basically have a situation where all assets have log geometric mean return of at most $0$. Once you are beyond this, i.e., there is an asset with strictly positive log geometric mean returns, the log portfolio value is optimized usually at a single asset.</p>
<h3 id="split-and-forget">Split-and-Forget</h3>
<p>So let us change our reference measure to not compete with the best CRP but with the <em>best single asset in hindsight</em> in view of this discussion. In this case there is a simple regret optimal strategy with <em>constant</em> regret in terms of terminal logarithmic wealth (as used in (univPortOGDSC) and (univPortOMD)). The strategy, let us call it <em>Split-and-Forget</em>, works as follows: Say we have $n$ assets, then simply allocate an $1/n$-fraction of capital to each of the $n$ assets and then forget about them (i.e., no rebalancing etc and hence also no additional transaction costs). Now let $V_t^i$ denote the value of asset $i$ at time $t$ and let $M_t$ denote the value of our portfolio at time $t$. Then the regret with respect to the best asset in hindsight is:</p>
\[\max_{i \in [n]} \log V^i_T - \log M_T \leq \max_{i \in [n]} \log V^i_T - \log \frac{1}{n} V^i_T = \log n,\]
<p>and as such, in the language from before, our <em>average regret</em> (i.e., <em>average additive return error</em>) satisifies</p>
\[\tag{SF}
\max_{i \in [n]} \frac{1}{T }\log V^i_T - \frac{1}{T} \log M_T \leq \max_{i \in [n]}\frac{1}{T } \log V^i_T - \frac{1}{T} \log \frac{1}{n} V^i_T = \frac{\log n}{T},\]
<p>i.e., the bound has the logarithmic dependency on $n$ as OMD but an even higher (no log factor) convergence rate than OGD-SC. On top of that you have no transaction costs (except for the initial allocation). In fact, if the optimal CRP is attained by a single asset portfolio this basic strategy performs much better in terms of regret and achieved average returns. Note that [DGU]’s $1/n$-strategy rebalances to a uniform $1/n$ allocation at each rebalancing step and as such is different from Split-and-Forget, which does not alter the allocation after the split up in the initial step.</p>
<p><strong>Interlude: Understanding diversification or why old families tend to be rich.</strong> In order to understand a bit better why the above is really working and actually not that bad of a strategy, observe that when doing the initial split up you pay for this <em>once</em>: In the worst-case the $\log n$ additive regret, which is the case where all but one asset is wiped out and you have only a $1/n$-fraction in the one remaining asset left. However, after the split up you are guaranteed that your portfolio is growing at a rate of the best single asset. More mathematically speaking, the split up leads to a shift by $\log n$ in the log portfolio value but after that its growth approaches optimal growth (compared to the single best asset). Sounds weird, but becomes more clear when you think of the asset $i$ compounding at some rate $r_i$. Now the terminal log portfolio wealth would be</p>
\[\log \sum_{i \in [n]} \frac{1}{n} e^{r_i T} = \log \sum_{i \in [n]} e^{r_i T - \log n},\]
<p>which is the <a href="https://en.wikipedia.org/wiki/LogSumExp">LogSumExp (LSE) function</a> (also sometimes called <em>softmax</em>), which for larger $T$ converges to the maximum: Let $j$ be so that $r_j$ is (uniquely) maximal, then</p>
\[\log \sum_{i \in [n]} e^{r_i T - \log n} = \log e^{r_j T - \log n} + \sum_{j \neq i \in [n]} \frac{e^{r_i T - \log n}}{e^{r_j T - \log n}} \rightarrow r_j T - \log n,\]
<p>so that the average return of the portfolio approaches $r_j - \frac{\log n}{T}$ as the fractions tend to $0$, which in turn approaches $r_j$ for larger $T$.</p>
<p>So what does this have to do with old families tending to be rich? Suppose at some point your family decided to split their wealth and invest equally into various business ventures, say $10$. You only need to get one of these ventures right and then, over time, you wash out your additive offset of $\log 10$; say the best one generates $4\%$ return per year, then it roughly takes $60$ years to “pay” for the cost of the split up. In other words, the cost of the split up translates into a shift in time and if you have enough time then you just wash out the cost. You might be well aware of the supercharged version of this: investing into startups. There you spread the capital even wider and try to harvest your black swan with explosive growth that pays for the losers in the portfolio and returns a hefty profit. This is simply convexity: the average of the individual payoffs can be much higher than the payoff of the average of the individuals.</p>
<h3 id="split-and-forget-compared-to-omd-and-ogd-sc">Split-and-Forget compared to OMD and OGD-SC</h3>
<p>In the following we compare the realized average return errors (both additive and relative) of (SF) on the same examples that we have seen above. In each figure on the left we have the average additive error and on the right the average relative error.</p>
<p>The first one is the $50$ assets $10$ years example:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/50Assets10years52weekscompSOver.png" alt="SF 50 Assets 10 years" /></p>
<p>And here the longer time horizon version used above to demonstrate the dependency on the constants:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/50AssetscompSOver.png" alt="SF 50 Assets long term" /></p>
<p>The next one is the SP500 example from above:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/sp500compSOver.png" alt="SF SP 500" /></p>
<p>And the bitcoin example:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/bitcoincompSOver.png" alt="SF bitcoin" /></p>
<p>As can be seen the regret of (SF) is significantly smaller than the regret of (OGD-SC) and (OMD) albeit with respect to a slightly different benchmark: the best single asset vs. the best CRP. We can have now a very long discussion about whether (SF) is a good strategy or not. What the (SF) strategy however does is, it sheds some light on what one can really expect from universal portfolios.</p>
<p>In actual backtesting performance the algorithms (as usual) perform much better, although I want to stress this is without guarantee and hence puts us back into the realms of justification-by-backtesting; you should always <em>validate</em> by backtesting though. Thus, just for completeness a quick backtesting experiment. Here we ran the actual algorithms (as compared to just computing the additive and relative bounds) and benchmark the (SF) strategy against the, for this case better suited, (OGD-SC) strategy. The green curve “Opt” is the in-hindsight-optimal CRP <em>allowing for leverage of a factor of up to $2$</em>, i.e., it is outside of the benchmarking class of CRPs and I added it just for comparison. You can clearly see the offset of (SF) at the start and see that this offset translates into a time shift from where onwards it dominates (OGD-SC) significantly in growth rate, essentially following the market: negative offset but higher rate.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/backtestingBitcoingOGDSC.png" alt="SF bitcoin backtesting" /></p>
<h3 id="references">References</h3>
<p>[K] Kelly Jr, J. L. (2011). A new interpretation of information rate. In <em>The Kelly Capital Growth Investment Criterion: Theory and Practice</em> (pp. 25-34). (original article from 1956) <a href="http://www.herrold.com/brokerage/kelly.pdf">pdf</a></p>
<p>[L] Latane, H. A. (1959). Criteria for choice among risky ventures. <em>Journal of Political Economy</em>, <em>67</em>(2), 144-155. <a href="http://finance.martinsewell.com/money-management/Latane1959.pdf">pdf</a></p>
<p>[S] Shannon, C. E. (1948). A mathematical theory of communication. <em>Bell system technical journal</em>, <em>27</em>(3), 379-423. <a href="https://pure.mpg.de/rest/items/item_2383162/component/file_2456978/content">pdf</a></p>
<p>[C] Cover, T. M. “Shannon and investment.” <em>IEEE Information Theory Society Newsletter, Summer</em> (1998). <a href="xxx">pdf</a></p>
<p>[T] Thorp, E. O. (1966). <em>Beat the Dealer: a winning strategy for the game of twenty one</em> (Vol. 310). Vintage.</p>
<p>[TK] Thorp, E. O., & Kassouf, S. T. (1967). <em>Beat the market: a scientific stock market system</em>. Random House.</p>
<p>[B] Breiman, L. (1961). Optimal gambling systems for favorable games. <a href="https://apps.dtic.mil/dtic/tr/fulltext/u2/402290.pdf">pdf</a></p>
<p>[P] Poundstone, W. (2010). <em>Fortune’s formula: The untold story of the scientific betting system that beat the casinos and Wall Street</em>. Hill and Wang.</p>
<p>[MTZZ] MacLean, L. C., Thorp, E. O., Zhao, Y., & Ziemba, W. T. (2011). How Does the Fortune’s Formula Kelly CapitalGrowth Model Perform?. <em>The Journal of Portfolio Management</em>, <em>37</em>(4), 96-111. <a href="http://hari.seshadri.com/docs/kelly-betting/kelly1.pdf">pdf</a></p>
<p>[T2] Thorp, E. O. (2011). Understanding the Kelly criterion. In <em>The Kelly Capital Growth Investment Criterion: Theory and Practice</em> (pp. 509-523).</p>
<p>[T3] Thorp, E. O. (2011). The Kelly criterion in blackjack sports betting, and the stock market. In <em>The Kelly Capital Growth Investment Criterion: Theory and Practice</em> (pp. 789-832).<a href="https://www.bjrnet.com/thorp/Thorp_KellyCriterion.pdf">pdf</a></p>
<p>[C1] Cover, T. (1984). An algorithm for maximizing expected log investment return. <em>IEEE Transactions on Information Theory</em>, <em>30</em>(2), 369-373. <a href="https://ieeexplore.ieee.org/abstract/document/1056869">pdf</a></p>
<p>[C2] Cover, T. M. (2011). Universal portfolios. In <em>The Kelly Capital Growth Investment Criterion: Theory and Practice</em> (pp. 181-209). (original reference: Cover, T. M. (1991). Universal Portfolios. <em>Mathematical Finance</em>, <em>1</em>(1), 1-29.) <a href="https://stuff.mit.edu/afs/athena.mit.edu/course/6/6.962/www/www_fall_2001/shaas/universal_portfolios.pdf">pdf</a></p>
<p>[CO] Cover, T. M., & Ordentlich, E. (1996). Universal portfolios with side information. <em>IEEE Transactions on Information Theory</em>, <em>42</em>(2), 348-363. <a href="https://pdfs.semanticscholar.org/e3f5/037e8ad65deead506235d0c25d07f7d3f0d6.pdf">pdf</a></p>
<p>[BK] Blum, A., & Kalai, A. (1999). Universal portfolios with and without transaction costs. <em>Machine Learning</em>, <em>35</em>(3), 193-205. <a href="https://link.springer.com/content/pdf/10.1023/A:1007530728748.pdf">pdf</a></p>
<p>[KV] Kalai, A., & Vempala, S. (2002). Efficient algorithms for universal portfolios. <em>Journal of Machine Learning Research</em>, <em>3</em>(Nov), 423-440. <a href="http://www.jmlr.org/papers/volume3/kalai02a/kalai02a.pdf">pdf</a></p>
<p>[HSSW] Helmbold, D. P., Schapire, R. E., Singer, Y., & Warmuth, M. K. (1998). On‐Line Portfolio Selection Using Multiplicative Updates. <em>Mathematical Finance</em>, <em>8</em>(4), 325-347. <a href="http://web.cs.iastate.edu/~honavar/portfolio-selection.pdf">pdf</a></p>
<p>[Z] Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 928-936). <a href="http://www.aaai.org/Papers/ICML/2003/ICML03-120.pdf">pdf</a></p>
<p>[NY] Nemirovsky, A. S., & Yudin, D. B. (1983). Problem complexity and method efficiency in optimization.</p>
<p>[AHK] Arora, S., Hazan, E., & Kale, S. (2012). The multiplicative weights update method: a meta-algorithm and applications. <em>Theory of Computing</em>, <em>8</em>(1), 121-164. <a href="http://www.theoryofcomputing.org/articles/v008a006/v008a006.pdf">pdf</a></p>
<p>[HAK] Hazan, E., Agarwal, A., & Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. <em>Machine Learning</em>, <em>69</em>(2-3), 169-192. <a href="https://link.springer.com/content/pdf/10.1007/s10994-007-5016-8.pdf">pdf</a></p>
<p>[DGU] DeMiguel, V., Garlappi, L., & Uppal, R. (2007). Optimal versus naive diversification: How inefficient is the 1/N portfolio strategy?. The review of Financial studies, 22(5), 1915-1953. <a href="https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1031.3574&rep=rep1&type=pdf">pdf</a></p>
<p><br /></p>
<h4 id="changelog">Changelog</h4>
<p>09/07/2019: Fixed several typos and added reference [DGU] as pointed out by Steve Wright.</p>Sebastian PokuttaTL;DR: How to (not) get rich? Running Universal Portfolios online with Online Convex Optimization techniques.Toolchain Tuesday No. 62019-08-19T02:00:00+02:002019-08-19T02:00:00+02:00http://www.pokutta.com/blog/random/2019/08/19/toolchain-6<p><em>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. This time around will be about privacy tools. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see <a href="/blog/pages/toolchain.html">here</a>.</em>
<!--more--></p>
<p><em>Disclaimer: I am not a security or privacy expert. Do your own due diligence and consider the following as pointers only.</em></p>
<p>With a lot of high profile data leaks (see e.g., <a href="https://www.nytimes.com/2019/07/30/business/capital-one-breach.html">here</a>, <a href="https://about.flipboard.com/support-information-incident-May-2019/">here</a>, <a href="https://www.upguard.com/breaches/rsync-oklahoma-securities-commission">here</a>, <a href="https://www.upguard.com/breaches/facebook-user-data-leak">here</a>, <a href="https://www.forbes.com/sites/zakdoffman/2019/07/16/whatsapptelegram-issue-has-put-a-billion-users-at-risk-check-your-settings-now/#5866d865ab88">here</a>, and <a href="https://research.checkpoint.com/hacking-fortnite/">here</a>), considerations to systematically weaken encryption via backdoors (see e.g., <a href="https://www.engadget.com/2019/07/31/how-ag-barr-is-going-to-get-encryption-backdoors/">here</a>, <a href="https://www.financemagnates.com/cryptocurrency/news/is-facebook-building-the-tools-to-end-encryption-forever/">here</a>, <a href="https://www.forbes.com/sites/kalevleetaru/2019/05/28/facebook-is-already-working-towards-germanys-end-to-end-encryption-backdoor-vision/#37d78a154e4a">here</a>, and <a href="https://www.forbes.com/sites/kalevleetaru/2019/07/26/the-encryption-debate-is-over-dead-at-the-hands-of-facebook/#b8654cd53626">here</a>) to e.g., spy on your WhatsApp messages while still giving the impression of strong end-to-end encryption, <a href="https://eugdpr.org/">GDPR</a> coming into (full) effect (see e.g., <a href="https://www.darkreading.com/endpoint/privacy/companies-anonymized-data-may-violate-gdpr-privacy-regs/d/d-id/1335361">here</a>, <a href="https://www.computerweekly.com/news/252467726/GDPR-taken-more-seriously-after-first-fines">here</a>), and in general all types of issues with user tracking, selling data, etc., I thought it might be a good time to talk about privacy tools. There are tons of great tools out there, however I will only be able to touch upon a few, notably those that I have first-hand experience with. If there is a tool that you think should be here, drop me a line.</p>
<p>I will not dive into the question <em>why privacy tools are useful/necessary</em>; this has been done in many places elsewhere. However, I believe it is fair to say that it is quite hard to get a true grasp of what data is really collected how, where, when, and for what purpose (even ignoring higher-order cross-referencing etc). GDPR is not going to change that, with or without you clicking hundreds of “I accept/consent” buttons a day. In fact GDPR will likely amplify the current imbalance in favor of large tech companies as the invidual or smaller companies might outsource their data storage etc to big tech, rather than trying to navigate the complexities of GDPR themselves. All the more a reason to think about your data. Just to get a glimpse, if you have not done so yet: download your facebook or google data (how to <a href="https://www.dataislife.net/download-entire-facebook-account-data-2019/">download your facebook data</a> or <a href="https://www.cnbc.com/2018/03/29/how-to-download-a-copy-of-everything-google-knows-about-you.html">download your google data</a>). Then take out half an afternoon browsing through the data trove; the experience might be quite sobering.</p>
<p>I would like to stress that the tools below <em>do not</em> guarantee privacy or safety etc. In particular, even with those tools in place your working assumption should be that <em>there is always a risk of exploits</em> and side channels etc. For example, suppose that through a backdoor, e.g., a keylogger is running on your phone or computer, then even the best tools cannot protect you; in the words of <a href="https://www.forbes.com/sites/kalevleetaru/2019/07/26/the-encryption-debate-is-over-dead-at-the-hands-of-facebook/#75e449ce5362">one of the articles</a> from above:</p>
<blockquote>
<p>The ability of encryption to shield a user’s communications rests upon the assumption that the sender and recipient’s devices are themselves secure, with the encrypted channel the only weak point.</p>
</blockquote>
<h2 id="software">Software</h2>
<h3 id="signal">Signal</h3>
<p>Secure open-source messenger.</p>
<p><em>Learning curve: ⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://signal.org/">https://signal.org/</a></em> <br /></p>
<p>The messaging app <code class="highlighter-rouge">Signal</code>, which is available for iOS and Android (as well as on Mac/PC but requiring a phone with <code class="highlighter-rouge">Signal</code>), is considered one of the most secure messaging apps (see, e.g., <a href="https://www.boxcryptor.com/en/blog/post/encryption-comparison-secure-messaging-apps/">here</a>). There are several other messaging apps out there that also use the <code class="highlighter-rouge">signal protocol</code>, however <code class="highlighter-rouge">Signal</code> has the important advantage that it is <a href="https://github.com/signalapp">open source</a>. Moreover it has been scrutinized and reviewed by security experts and while some messengers like <code class="highlighter-rouge">WhatsApp</code> basically use the same protocol, you have no idea whether the app is trustworthy or whether it contains backdoors.</p>
<p>In terms of learning curve, it basically works like your favorite messaging app however lacking some of the bells and whistles. <code class="highlighter-rouge">Signal</code> also supports voice and video calls.</p>
<h3 id="threema">Threema</h3>
<p>Secure messenger.</p>
<p><em>Learning curve: ⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://threema.ch/en">https://threema.ch/en</a></em> <br /></p>
<p>Another choice is <code class="highlighter-rouge">Threema</code>. It is also considered a very good messenger with strong encryption, however the code is not open source. This is precludes public code review by security experts, which traditionally has led to harder code with fewer exploits. Apparently there has been some closed-door code review though.</p>
<p><code class="highlighter-rouge">Threema</code> looks like your favorite messenger with a set of features comparable to <code class="highlighter-rouge">Signal</code>.</p>
<p><em>For more on secure messengers etc, the website <a href="https://www.securemessagingapps.com">securemessagingapps</a> provides a general overview of the different messengers and their security/privacy features.</em></p>
<h3 id="brave">Brave</h3>
<p>Privacy-aware web browser based on chromium.</p>
<p><em>Learning curve: ⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://www.brave.com/">https://www.brave.com/</a></em> <br /></p>
<p>Based on the same <code class="highlighter-rouge">Chromium</code> backend as <code class="highlighter-rouge">Google Chrome</code>, <code class="highlighter-rouge">Brave</code> is a very fast webbrowser with extensive privacy tools, blocking various trackers, cookies, and finger printing. Moreover, it supports most of the <code class="highlighter-rouge">Chrome</code> extensions which is quite useful and basically you can change from <code class="highlighter-rouge">Chrome</code> to <code class="highlighter-rouge">Brave</code> with little to no work. <code class="highlighter-rouge">Brave</code> also supports an experimental model called <code class="highlighter-rouge">Brave Rewards</code> to support content creators not through ads but through a micro-payment like system:</p>
<blockquote>
<p>Activate Brave Rewards (available on desktop only) and give a little back to the sites you frequent most. Help fund the content you love – even when you block ads.</p>
<p>Browsing the web with Brave is free: with Brave Rewards activated, you can support the content creators you love at the amount that works for you.</p>
</blockquote>
<p>Finally <code class="highlighter-rouge">Brave</code> is very fast, in fact much faster than <code class="highlighter-rouge">Chrome</code>, probably due to blocking tons of scripts, trackers etc. <code class="highlighter-rouge">Brave</code> is also available on mobile (iOS and Android). Also, for the geeks, <code class="highlighter-rouge">Brave</code> supports <a href="[https://ipfs.io](https://ipfs.io/)">IPFS</a> and <a href="[https://www.torproject.org](https://www.torproject.org/)">Tor</a> directly out of the box.</p>
<h3 id="firefox">Firefox</h3>
<p>Browser with strong privacy features.</p>
<p><em>Learning curve: ⭐️⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://www.mozilla.org/en-US/firefox/">https://www.mozilla.org/en-US/firefox/</a></em> <br /></p>
<p>Another great browser is <code class="highlighter-rouge">Firefox</code>, which also comes with extensive privacy tools and arguably one of the first browsers taking privacy seriously. <code class="highlighter-rouge">Firefox</code> is a great browser, however I opted for using <code class="highlighter-rouge">Brave</code> for compatibility reasons (see, e.g., <a href="https://www.theverge.com/2019/3/4/18249623/brave-browser-choice-chrome-vivaldi-replacement-chromium">this article on Verge</a> for some discussion).</p>
<h3 id="gnupg">GnuPG</h3>
<p>State-of-the-art open-source encryption suite.</p>
<p><em>Learning curve: ⭐️⭐️⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://gnupg.org/">https://gnupg.org/</a></em> <br /></p>
<p>Last, but not least: encryption. This does not directly relate to privacy in the sense from above but is of no lesser importance. If you need state-of-the-art encryption both for emails and also files, then <code class="highlighter-rouge">GnuPG</code> is the answer. It takes some time to getting used to but both the tools and underlying protocols are rock solid and it is open source.</p>
<h2 id="services">Services</h2>
<h3 id="duckduckgo">DuckDuckGo</h3>
<p>Search engine that respects your privacy.</p>
<p><em>Learning curve: ⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://duckduckgo.com/">https://duckduckgo.com/</a></em> <br /></p>
<p>Complementing a privacy-enhanced webbrowser, it makes sense to consider a search engine that respects your privacy. A great choice is <code class="highlighter-rouge">DuckDuckGo</code>. While not perfect, for your normal day to day use it gets the job more than done and if you feel that you are missing out on something, you can always head over to <code class="highlighter-rouge">google</code>.</p>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. This time around will be about privacy tools. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see here.