Jekyll2019-09-30T10:57:14+02:00http://www.pokutta.com/blog/One trivial observation at a timeEverything Mathematics, Optimization, Machine Learning, and Artificial IntelligenceSCIP x Raspberry Pi: SCIP on Edge2019-09-29T08:00:00+02:002019-09-29T08:00:00+02:00http://www.pokutta.com/blog/random/2019/09/29/scipberry<p><em>TL;DR: Running SCIP on a Raspberry Pi 4 with relatively moderate performance losses (compared to a standard machine) of a factor of 3-5 brings Integer Programming into the realm of Edge Computing.</em>
<!--more--></p>
<p>Edge Computing is concerned with the deployment of compute and algorithms close to their actual location. Thinking about it for a few minutes one easily comes up with a lot of good reasons why one might want to do this. From <a href="https://en.wikipedia.org/wiki/Edge_computing">Wikipedia</a>:</p>
<blockquote>
<p><strong>Edge computing</strong> is a <a href="https://en.wikipedia.org/wiki/Distributed_computing">distributed computing</a> paradigm which brings <a href="https://en.wikipedia.org/wiki/Computation">computation</a> and <a href="https://en.wikipedia.org/wiki/Data_storage">data storage</a> closer to the location where it is needed, to improve response times and save bandwidth.</p>
</blockquote>
<p>For example, in the context of deep learning applications, there is great hardware out there to bring, e.g., deep learning applications to the edge and one example in this category are the <a href="https://developer.nvidia.com/embedded/jetson-tx2-developer-kit">NVidia Jetson TX kits</a>.</p>
<p>For completely different but not unrelated reasons I have recently been thinking very much about the interplay of hardware and software and in particular, the potential of, e.g., FPGAs to realize customized functions in hardware to better support algorithms (and their implementations) that we care for: both for realizing a better energy footprint and closer to the edge deployment on the one end of the spectrum and for highest performance operations on the other end of the spectrum. To make things more tangible, e.g., a specialized FPGA for Integer Programming. Why? While we have great solutions to deploy, e.g., deep learning applications on the edge there is <em>nothing</em> there for deploying integer programming codes, i.e., discrete decision making on the edge. As such I was curious to get <a href="https://scip.zib.de/">SCIP</a> up and running on a <a href="https://www.raspberrypi.org/products/raspberry-pi-4-model-b/">Raspberry Pi 4 B (4GB RAM)</a> board (RPi 4) which can be bought for $55, e.g., on Amazon. Tentative working title: <em>SCIPberry</em>—every Raspberry Pi project needs to have “berry” in its name.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/scipberry/backgroundSCIP.png" alt="scipberry" /></p>
<p>(Polyhedral raspberry logo by <a href="https://www.reddit.com/user/SiRo126/">SiRo126</a>)</p>
<h2 id="scip-on-edge">SCIP on Edge</h2>
<p>So first let us have a look at all the required pieces.</p>
<h3 id="the-hardware">The hardware</h3>
<p>A <a href="https://www.raspberrypi.org/products/raspberry-pi-4-model-b/">Raspberry Pi 4 with 4GB of RAM</a>. As Integer Programs can get large we need some memory and hence the 4GB version. The RPi 4 is really a tiny device that can easily fit into the palm of your hand. See this image from <a href="https://www.raspberrypi.org/products/raspberry-pi-4-model-b/">raspberrypi.org</a>:</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/scipberry/raspberrypi4.png" alt="raspberrypi4" /></p>
<p>At a price point of about $55 you might think that this is not much more than a toy but the RPi 4 is suprisingly powerful; checkout the <a href="https://www.raspberrypi.org/magpi/raspberry-pi-4-specs-benchmarks/">benchmarks</a>. Moreover, you can actually run <a href="https://github.com/JuliaBerry">Julia</a>, <a href="https://www.wolfram.com/raspberry-pi/">Mathematica</a>, and <a href="https://www.raspberrypi.org/documentation/usage/python/">python</a> on it. Not bad for $55!</p>
<p><strong>Interlude:</strong> As a feasibility study I have been running my day-to-day work on a RPi 4 for a few weeks now and it is refreshingly sufficient. Effectively almost all services that I use in my day-to-day operations are now cloud based, so that you can run them in a web browser and with Raspbian’s Chromium (the open source base of Chrome) you get quite far. Moreover, there is also <a href="https://linuxhint.com/install_firefox_raspberry_pi/">Firefox available</a> for Raspbian. Interestingly the user agent detection of Google Calendar forces Chromium to display Google Calendar in some weird mobile version, which made me look for Firefox in the first place.</p>
<h4 id="purely-optional-and-your-own-risk-spiked-berrys">Purely optional and your own risk: Spiked berrys</h4>
<p>The Raspberry Pi is an extremely versatile and flexible device. In fact you can even overclock it easily by simply changing its startup configuration. How far you can go with this depends on how lucky you have been in the silicon lottery and I strongly recommend to first read <a href="https://www.tomshardware.com/reviews/raspberry-pi-4-b-overclocking,6188.html">this article about overclocking an RPi 4</a> and <a href="https://www.tomshardware.com/reviews/raspberry-pi-4-overclock-2-ghz,6254.html">this one for how to push it to 2 GHz</a>; your mileage may vary. Looks like I have been lucky in the silicon lottery as I pushed my RPi 4 to 2GHz without any issues and perfectly stable behavior, over multiple days and under full load of all four cores with varying workloads. Very important: you <em>will</em> need an active cooling case as otherwise you will run into thermal throttling, negating the effect of the overclocking. For example the Miuzei case with active cooling (<a href="https://www.amazon.de/gp/product/B07TYW63M8/ref=ppx_yo_dt_b_asin_title_o00_s00?ie=UTF8&psc=1&pldnSite=1">link to Amazon Germany</a>; same can be found on, e.g., Amazon US) for about 18 Euro (or around $20) works great. You can barely hear the fan, while keeping the RPi4 at a 69C under full load of all four cores, so rather far away from the thermal throttling point of 80C. Moreover, the power adapter provides enough juice to support the over voltage we have to provide. To see what overclocking buys you in terms of performance, you can check out <a href="https://www.tomshardware.com/reviews/raspberry-pi-4-overclock-2-ghz,6254.html">the benchmarks at the end of the overclocking article</a>. Short version is: it buys you somewhere between 3% - 33% in their tests and in our tests later we pretty much get the full 33%.</p>
<p>In <code class="highlighter-rouge">/boot/config.txt</code> (you need to edit e.g., with <code class="highlighter-rouge">sudo nano /boot/config.txt</code>), I use the following for overclocking in the <code class="highlighter-rouge">[pi4]</code> section.</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">over_voltage</span><span class="o">=</span>6
<span class="nv">arm_freq</span><span class="o">=</span>2000
<span class="nv">gpu_freq</span><span class="o">=</span>600
</code></pre></div></div>
<p>The <code class="highlighter-rouge">over_voltage=6</code> setting for extra voltage is comprised of 3 times <code class="highlighter-rouge">2</code> each with <code class="highlighter-rouge">2</code> to go up to 1.750 GHz for the CPU, another <code class="highlighter-rouge">2</code> to go up to 2Ghz, and the final <code class="highlighter-rouge">2</code> for the GPU overclocking.</p>
<p><strong>Note:</strong> This is at your <strong>own risk</strong> and I strongly suggest to read <a href="https://www.tomshardware.com/reviews/raspberry-pi-4-b-overclocking,6188.html">this article about overclocking RPi4</a> and <a href="https://www.tomshardware.com/reviews/raspberry-pi-4-overclock-2-ghz,6254.html">this one for how to push it to 2 GHz</a> to understand how to trouble shoot and fix if something goes wrong.</p>
<h3 id="the-software">The software</h3>
<p>On the software side, I went with the <a href="https://www.raspberrypi.org/downloads/raspbian/">Raspbian Buster</a> image from <a href="https://www.raspberrypi.org/">Raspberrypi.org</a>. Then I did the usual package updates, installed <code class="highlighter-rouge">cmake</code> with <code class="highlighter-rouge">sudo apt-get install cmake</code>, and compiled the <a href="https://scip.zib.de">scip optimization suite</a>. Compilation worked right out of the box thanks to <code class="highlighter-rouge">cmake</code> and appropriate build files. Compile time on a stock RPi 4 is about 40 mins for the whole optimization suite:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>real 41m38.569s
user 38m39.830s
sys 2m41.798s
</code></pre></div></div>
<h4 id="a-few-notes-on-compilation-etc">A few notes on compilation etc</h4>
<p>There seems to be a minor issue with the <a href="https://www.raspberrypi.org/forums/viewtopic.php?t=245846">linux kernel reporting the wrong arm architecture</a> which might lead to suboptimal compiler flags as determined by <code class="highlighter-rouge">cmake</code> based on the reported architecture; this is due to the upstream linux kernel implementation. At the same time I did some preliminary tests and the compiler flags can have some substantial performance impact, in particular the configuration of the floating point unit (easily a factor of 2). We intend to make Raspberry Pi binaries of SCIP available relatively soon once we have tuned the compilation for the arm architecture. Meanwhile, you can simply compile SCIP from the sources with its out-of-the-box configuration which is stable but not yet arm architecture optimized. <strong>If some arm compilation expert, in particular w.r.t. floating point arithmetics, reads this please drop me a line!</strong></p>
<h2 id="performance-and-benchmarks">Performance and benchmarks</h2>
<p>I did two comparisons in term of performance. The first one is between a MacBook Pro and the RPi 4 on some standard MIPLIB instances and the second one is a full MIPLIB 2017 benchmark run.</p>
<h3 id="unscientific-comparison">Unscientific comparison</h3>
<p>The comparison below is between a MacBook Pro (Core i7 3.5 GHz with 16GB RAM) vs. stock Raspberry Pi 4 vs. spiked Raspberry Pi 4 (running at 2 GHz for the CPU and 600 MHz for the GPU) for a few select instances from the <a href="https://miplib.zib.de/">MIPLIB 2017</a> running SCIP. As discussed above the Raspberry Pi 4 version of SCIP is not yet optimized for the ARM architecture (the stock and spiked version use <code class="highlighter-rouge">-mcpu</code> and <code class="highlighter-rouge">-mtune</code> flags for the Cortex A72 architecture though), so that there are likely more speed improvements to be gained. The table reports time in seconds as well as how many times the RPi 4 is slower than the MBP. While this is not a scientific or complete benchmark we get a pretty good idea. Effectively, we are talking about a 3-5 times multiple, which basically means that we are not changing categories: seconds remain seconds and minutes remain minutes, so that in actual applications there is not <em>that much</em> of a difference.</p>
<p>Also note that the numbers below are <em>single core</em> performance, however the RPi 4 is a quad core design, so that there might be additional speedups to be gained.</p>
<table>
<thead>
<tr>
<th style="text-align: right">Instance</th>
<th style="text-align: right">sec (MBP)</th>
<th style="text-align: right">sec (stock)</th>
<th style="text-align: right">x (stock)</th>
<th style="text-align: right">sec (spiked)</th>
<th style="text-align: right">x (spiked)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">air05</td>
<td style="text-align: right">34.64</td>
<td style="text-align: right">184.32</td>
<td style="text-align: right">5.32</td>
<td style="text-align: right">143.21</td>
<td style="text-align: right">4.13</td>
</tr>
<tr>
<td style="text-align: right">beasleyC3</td>
<td style="text-align: right">20.68</td>
<td style="text-align: right">80.49</td>
<td style="text-align: right">3.89</td>
<td style="text-align: right">62.99</td>
<td style="text-align: right">3.05</td>
</tr>
<tr>
<td style="text-align: right">cbs-cta</td>
<td style="text-align: right">8.05</td>
<td style="text-align: right">40.55</td>
<td style="text-align: right">5.04</td>
<td style="text-align: right">34.27</td>
<td style="text-align: right">4.26</td>
</tr>
<tr>
<td style="text-align: right">pk1</td>
<td style="text-align: right">178.80</td>
<td style="text-align: right">645.73</td>
<td style="text-align: right">3.61</td>
<td style="text-align: right">500.54</td>
<td style="text-align: right">2.80</td>
</tr>
<tr>
<td style="text-align: right">pg</td>
<td style="text-align: right">17.47</td>
<td style="text-align: right">72.31</td>
<td style="text-align: right">4.14</td>
<td style="text-align: right">58.06</td>
<td style="text-align: right">3.32</td>
</tr>
<tr>
<td style="text-align: right">neos-1122047</td>
<td style="text-align: right">3.81</td>
<td style="text-align: right">12.78</td>
<td style="text-align: right">3.35</td>
<td style="text-align: right">10.83</td>
<td style="text-align: right">2.84</td>
</tr>
<tr>
<td style="text-align: right">timtab1</td>
<td style="text-align: right">59.83</td>
<td style="text-align: right">223.47</td>
<td style="text-align: right">3.74</td>
<td style="text-align: right">179.33</td>
<td style="text-align: right">2.99</td>
</tr>
<tr>
<td style="text-align: right">dano3_5</td>
<td style="text-align: right">206.12</td>
<td style="text-align: right">1275.86</td>
<td style="text-align: right">6.19</td>
<td style="text-align: right">1036.74</td>
<td style="text-align: right">5.03</td>
</tr>
<tr>
<td style="text-align: right">hypothyroid-k1</td>
<td style="text-align: right">19.80</td>
<td style="text-align: right">104.96</td>
<td style="text-align: right">5.30</td>
<td style="text-align: right">93.15</td>
<td style="text-align: right">4.70</td>
</tr>
<tr>
<td style="text-align: right">swath3</td>
<td style="text-align: right">279.40</td>
<td style="text-align: right">1185.95</td>
<td style="text-align: right">4.24</td>
<td style="text-align: right">1000.58</td>
<td style="text-align: right">3.58</td>
</tr>
<tr>
<td style="text-align: right">unitcal_7</td>
<td style="text-align: right">381.70</td>
<td style="text-align: right">1853.92</td>
<td style="text-align: right">4.86</td>
<td style="text-align: right">1489.52</td>
<td style="text-align: right">3.90</td>
</tr>
<tr>
<td style="text-align: right">CMS750_4</td>
<td style="text-align: right">922.35</td>
<td style="text-align: right">4346.98</td>
<td style="text-align: right">4.71</td>
<td style="text-align: right">3379.95</td>
<td style="text-align: right">3.66</td>
</tr>
<tr>
<td style="text-align: right">istanbul-no-cutoff</td>
<td style="text-align: right">98.07</td>
<td style="text-align: right">501.13</td>
<td style="text-align: right">5.11</td>
<td style="text-align: right">408.17</td>
<td style="text-align: right">4.16</td>
</tr>
</tbody>
</table>
<h3 id="miplib-2017-benchmark-run">MIPLIB 2017 benchmark run</h3>
<p>The second test is a <a href="https://miplib.zib.de">MIPLIB 2017</a> benchmark run - singe core run. Given that it took a couple of days to complete I only did the run for the spiked version.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>----------------------------+----------------+----------------+------+---------+-------+--------+
Name | Dual Bound | Primal Bound | Gap% | Nodes | Time | Status |
----------------------------+----------------+----------------+------+---------+-------+--------+
30n20b8 302 302 0.0 122 1225 ok
50v-10 3246.37511 3313.17999 2.1 79593 3601 stopped
academictimetablesmall 7.10542736e-15 1e+20 -- 250 3601 stopped
air05 26374 26374 0.0 418 144 ok
app1-1 -3 -3 0.0 3 27 ok
app1-2 -41 -41 0.0 24 2473 ok
assign1-5-8 199.527686 212 6.3 1096572 3601 stopped
atlanta-ip 83.0314859 91.0099752 9.6 533 3601 stopped
b1c1s1 21152.5997 25486.73 20.5 2475 3600 stopped
bab2 -358777.078 1e+20 -- 1 3607 stopped
bab6 -288604.802 1e+20 -- 1 3605 stopped
beasleyC3 754 754 0.0 6 64 ok
binkar10_1 6742.20002 6742.20002 0.0 3092 192 ok
blp-ar98 6168.05653 6457.70502 4.7 7103 3601 stopped
blp-ic98 4440.63685 4720.2617 6.3 15310 3602 stopped
bnatt400 1 1 0.0 5709 1218 ok
bnatt500 1e+20 1e+20 -- 20612 2854 ok
bppc4-08 52 54 3.8 114635 3602 stopped
brazil3 24 1e+20 -- 152 3600 stopped
buildingenergy 33246.2151 42652.3398 28.3 1 3606 stopped
cbs-cta 0 0 0.0 1 35 ok
chromaticindex1024-7 3 4 33.3 1 3730 stopped
chromaticindex512-7 3 4 33.3 44 3602 stopped
cmflsp50-24-8-8 54860290.3 1e+20 -- 4184 3601 stopped
CMS750_4 252 252 0.0 14526 3290 ok
co-100 1936126.27 10724685.1 453.9 1 3605 stopped
cod105 -18.2857143 -12 52.4 65 3601 stopped
comp07-2idx 6 823 Large 5 3601 stopped
comp21-2idx 41.0023016 295 619.5 129 3600 stopped
cost266-UUE 23987103.2 25148940.6 4.8 67004 3601 stopped
cryptanalysiskb128n5obj14 0 1e+20 -- 1 3602 stopped
cryptanalysiskb128n5obj16 0 1e+20 -- 1 3602 stopped
csched007 326.692847 353 8.1 89932 3601 stopped
csched008 171.342524 173 1.0 91573 3600 stopped
cvs16r128-89 -120.632745 -93 29.7 727 3601 stopped
dano3_3 576.344633 576.344633 0.0 19 587 ok
dano3_5 576.924916 576.924916 0.0 60 1011 ok
decomp2 -160 -160 0.0 1 13 ok
drayage-100-23 103333.874 103333.874 0.0 39 61 ok
drayage-25-23 101208.076 101282.647 0.1 93492 3600 stopped
dws008-01 20712.5472 59272.7628 186.2 7329 3600 stopped
eil33-2 934.007916 934.007916 0.0 717 337 ok
eilA101-2 805.920228 1240.02356 53.9 303 3604 stopped
enlight_hard 37 37 0.0 1 1 ok
ex10 100 100 0.0 1 2338 ok
ex9 81 81 0.0 1 158 ok
exp-1-500-5-5 65887 65887 0.0 1 9 ok
fast0507 174 174 0.0 528 571 ok
fastxgemm-n2r6s0t2 29 230 693.1 144518 3601 stopped
fhnw-binpack4-4 0 1e+20 -- 2772713 3600 stopped
fhnw-binpack4-48 0 1e+20 -- 461981 3600 stopped
fiball 138 140 1.4 10323 3601 stopped
gen-ip002 -4796.88522 -4783.73339 0.3 4003765 3603 stopped
gen-ip054 6803.0169 6840.96564 0.6 4873762 3606 stopped
germanrr 46795848.8 48328053.1 3.3 167 3601 stopped
gfd-schedulen180f7d50m30k18 1 1e+20 -- 1 3607 stopped
glass-sc 19.2345968 23 19.6 40518 3601 stopped
glass4 900005372 1.600014e+09 77.8 595600 3601 stopped
gmu-35-40 -2406903.88 -2406207.39 0.0 476087 3604 stopped
gmu-35-50 -2608070.29 -2607011 0.0 264255 3606 stopped
graph20-20-1rand -24.9879502 -9 177.6 605 3600 stopped
graphdraw-domain 17998.5455 19686 9.4 1549875 3601 stopped
h80x6320d 6382.09905 6382.09905 0.0 4 596 ok
highschool1-aigio 0 1e+20 -- 1 3607 stopped
hypothyroid-k1 -2851 -2851 0.0 1 94 ok
ic97_potential 3887 3948 1.6 1053664 3601 stopped
icir97_tension 6362 6386 0.4 370916 3601 stopped
irish-electricity 2934051.39 1e+20 -- 1 3602 stopped
irp 12159.4928 12159.4928 0.0 7 74 ok
istanbul-no-cutoff 204.081749 204.081749 0.0 229 396 ok
k1mushroom -1e+20 1e+20 -- 0 3618 stopped
lectsched-5-obj 15 47 213.3 4287 3601 stopped
leo1 400019518 410793643 2.7 17966 3601 stopped
leo2 393957837 418226155 6.2 6883 3601 stopped
lotsize 1466743.39 1514999 3.3 623 3601 stopped
mad 4.23272528e-15 0.028 -- 1758648 3602 stopped
map10 -522.229315 -495 5.5 1221 3604 stopped
map16715-04 -202.53321 -83 144.0 265 3605 stopped
markshare_4_0 1 1 0.0 2552263 864 ok
markshare2 0 17 -- 1942425 3604 stopped
mas74 11405.2315 11801.1857 3.5 2857955 3602 stopped
mas76 40005.0541 40005.0541 0.0 269818 444 ok
mc11 11689 11689 0.0 2247 372 ok
mcsched 211913 211913 0.0 11189 755 ok
mik-250-20-75-4 -52301 -52301 0.0 28546 189 ok
milo-v12-6-r2-40-1 277158.347 326481.143 17.8 11372 3601 stopped
momentum1 96365.4136 128477.452 33.3 4759 3600 stopped
mushroom-best 0.0174384818 0.0553337612 217.3 6318 3600 stopped
mzzv11 -21718 -21718 0.0 2233 1477 ok
mzzv42z -20540 -20540 0.0 261 621 ok
n2seq36q 52200 52200 0.0 2740 3431 ok
n3div36 125388.755 131000 4.5 12248 3603 stopped
n5-3 8105 8105 0.0 607 98 ok
neos-1122047 161 161 0.0 1 23 ok
neos-1171448 -309 -308 0.3 61 3601 stopped
neos-1171737 -195 -191 2.1 234 3600 stopped
neos-1354092 36 1e+20 -- 3 3601 stopped
neos-1445765 -17783 -17783 0.0 75 214 ok
neos-1456979 157.283966 204 29.7 3148 3601 stopped
neos-1582420 91 91 0.0 252 94 ok
neos-2075418-temuka 0 1e+20 -- 1 3614 stopped
neos-2657525-crna 0 1.810748 -- 715693 3602 stopped
neos-2746589-doon 1993.43007 1e+20 -- 881 3602 stopped
neos-2978193-inde -2.4017989 -2.38806169 0.6 38669 3600 stopped
neos-2987310-joes -607702988 -607702988 0.0 1 72 ok
neos-3004026-krka 0 0 0.0 2020 245 ok
neos-3024952-loue 26756 26756 0.0 56581 3459 ok
neos-3046615-murg 538.135067 1610 199.2 1551986 3606 stopped
neos-3083819-nubu 6307996 6307996 0.0 1138 47 ok
neos-3216931-puriri 59191.1268 1e+20 -- 202 3601 stopped
neos-3381206-awhea 453 453 0.0 1 4 ok
neos-3402294-bobin 1.11022302e-16 0.06725 -- 2765 3606 stopped
neos-3402454-bohle -1e+20 1e+20 -- 0 43 stopped
neos-3555904-turama -40.95 -33.2 23.3 3 3604 stopped
neos-3627168-kasai 988203.134 988585.62 0.0 462893 3600 stopped
neos-3656078-kumeu -18413.2 1e+20 -- 1 3601 stopped
neos-3754480-nidda -352051.784 13747.5367 -- 1952395 3602 stopped
neos-3988577-wolgan 119 1e+20 -- 13 3601 stopped
neos-4300652-rahue 0.128756061 5.2121 3948.0 103 3606 stopped
neos-4338804-snowy 1447 1477 2.1 554130 3601 stopped
neos-4387871-tavua 28.8473171 34.79894 20.6 1310 3600 stopped
neos-4413714-turia 45.370167 45.370167 0.0 2 2082 ok
neos-4532248-waihi 0.370420217 1e+20 -- 1 3616 stopped
neos-4647030-tutaki 27265.1927 27271.257 0.0 231 3609 stopped
neos-4722843-widden 25009.6634 25009.6634 0.0 2623 3368 ok
neos-4738912-atrato 283627957 283627957 0.0 46900 2055 ok
neos-4763324-toguru 1142.84659 2240.0651 96.0 2 3604 stopped
neos-4954672-berkel 2308705.23 2633312 14.1 94247 3601 stopped
neos-5049753-cuanza 550.216667 1e+20 -- 1 3607 stopped
neos-5052403-cygnet 179.500371 290 61.6 1 3609 stopped
neos-5093327-huahum 5192.21511 6506 25.3 2403 3602 stopped
neos-5104907-jarama 642.256923 1e+20 -- 1 3609 stopped
neos-5107597-kakapo 1864.20375 3690 97.9 171822 3600 stopped
neos-5114902-kasavu -1e+20 1e+20 -- 0 26 stopped
neos-5188808-nattai 0 0.110287132 -- 6443 3601 stopped
neos-5195221-niemur 0.000977767 0.003863653 295.2 14256 3602 stopped
neos-631710 0 215 -- 1 3606 stopped
neos-662469 184368.162 184544.5 0.1 4717 3600 stopped
neos-787933 30 30 0.0 1 7 ok
neos-827175 112.00152 112.00152 0.0 1 107 ok
neos-848589 2302.61937 2528.6184 9.8 3 3612 stopped
neos-860300 3201 3201 0.0 2 67 ok
neos-873061 105.645552 121.460195 15.0 1 3604 stopped
neos-911970 54.76 54.76 0.0 455585 2952 ok
neos-933966 318 4398 1283.0 8 3601 stopped
neos-950242 1.22222222 4 227.3 194 3600 stopped
neos-957323 -237.756681 -237.756681 0.0 1 295 ok
neos-960392 -238 0 -- 13 3601 stopped
neos17 0.150002577 0.150002577 0.0 19658 115 ok
neos5 15 15 0.0 1812079 1944 ok
neos8 -3719 -3719 0.0 1 9 ok
neos859080 1e+20 1e+20 -- 661 2 ok
net12 214 214 0.0 1334 3292 ok
netdiversion 237.111111 242 2.1 12 3604 stopped
nexp-150-20-8-5 230.447378 235 2.0 16 3601 stopped
ns1116954 0 1e+20 -- 1 3602 stopped
ns1208400 2 2 0.0 990 723 ok
ns1644855 -1524.33333 -1419.66667 7.4 1 3605 stopped
ns1760995 -1e+20 1e+20 -- 0 3662 stopped
ns1830653 20622 20622 0.0 6176 434 ok
ns1952667 0 0 0.0 1382 1164 ok
nu25-pr12 53905 53905 0.0 83 22 ok
nursesched-medium-hint03 75.8661003 8081 Large 1 3602 stopped
nursesched-sprint02 58 58 0.0 6 155 ok
nw04 16862 16862 0.0 7 105 ok
opm2-z10-s4 -45884.456 -29300 56.6 3 3602 stopped
p200x1188c 15078 15078 0.0 2 12 ok
peg-solitaire-a3 1 1e+20 -- 184 3600 stopped
pg -8674.34261 -8674.34261 0.0 559 59 ok
pg5_34 -14350.2009 -14338.0615 0.1 171265 3600 stopped
physiciansched3-3 2609077.71 1e+20 -- 6 3604 stopped
physiciansched6-2 49324 49324 0.0 142 727 ok
piperout-08 125055 125055 0.0 110 2050 ok
piperout-27 8124 8124 0.0 2 612 ok
pk1 11 11 0.0 406826 535 ok
proteindesign121hz512p9 0 1e+20 -- 0 920 stopped
proteindesign122trx11p8 -1e+20 1e+20 -- 0 1076 abort
qap10 340 340 0.0 2 361 ok
radiationm18-12-05 17565 17576 0.1 58171 3601 stopped
radiationm40-10-02 155321.712 256218 65.0 1127 3604 stopped
rail01 -92.0873 1e+20 -- 1 3603 stopped
rail02 -6350.94197 1e+20 -- 1 3605 stopped
rail507 174 174 0.0 682 796 ok
ran14x18-disj-8 3650.00897 3734.99999 2.3 264526 3601 stopped
rd-rplusc-21 100 171887.288 Large 24296 3603 stopped
reblock115 -36934261.8 -36800603.2 0.4 104186 3601 stopped
rmatr100-p10 423 423 0.0 777 689 ok
rmatr200-p5 3291.08079 4706 43.0 1 3602 stopped
rocI-4-11 -6020203 -6020203 0.0 11774 387 ok
rocII-5-11 -11.811922 -5.65497492 108.9 4983 3601 stopped
rococoB10-011000 16178.1679 20170 24.7 10307 3600 stopped
rococoC10-001000 11460 11460 0.0 49162 2307 ok
roi2alpha3n4 -69.497701 -63.2084921 9.9 5167 3603 stopped
roi5alpha10n8 -72.6816859 -42.3653204 71.6 314 3609 stopped
roll3000 12890 12890 0.0 3147 181 ok
s100 -1e+20 1e+20 -- 0 101 stopped
s250r10 -0.172620061 -0.1717256 0.5 2 3609 stopped
satellites2-40 -29 49 -- 1 3653 stopped
satellites2-60-fs -29 28 -- 1 3628 stopped
savsched1 -801160.3 31846.3 -- 1 3614 stopped
sct2 -231.063567 -230.989162 0.0 90160 3600 stopped
seymour 417.932149 423 1.2 25264 3601 stopped
seymour1 410.763701 410.763701 0.0 1203 309 ok
sing326 7740242.84 7815051.11 1.0 106 3602 stopped
sing44 8110365.4 8336047.42 2.8 36 3602 stopped
snp-02-004-104 586784451 586903510 0.0 104 3607 stopped
sorrell3 -20.6893407 -12 72.4 1 3603 stopped
sp150x300d 69 69 0.0 121 2 ok
sp97ar 657199509 691953855 5.3 3591 3601 stopped
sp98ar 528030041 530117051 0.4 6957 3601 stopped
splice1k1 -1645.76473 -121 1260.1 1 3618 stopped
square41 8.87035973 51 474.9 3 3627 stopped
square47 -1e+20 1e+20 -- 0 81 stopped
supportcase10 0 18 -- 1 3603 stopped
supportcase12 -82461.0638 0 -- 1 34 stopped
supportcase18 47.1866667 50 6.0 9632 3600 stopped
supportcase19 -1e+20 1e+20 -- 0 29 stopped
supportcase22 0 1e+20 -- 2 3606 stopped
supportcase26 1521.82348 1755.84518 15.4 843082 3602 stopped
supportcase33 -359.701149 -340 5.8 6947 3602 stopped
supportcase40 23849.6271 24422.9041 2.4 7365 3600 stopped
supportcase42 7.75103344 8.02972774 3.6 35452 3602 stopped
supportcase6 45997.8089 51937.682 12.9 127 3605 stopped
supportcase7 -1132.22317 -1132.22317 0.0 159 739 ok
swath1 379.071296 379.071296 0.0 362 65 ok
swath3 397.761344 397.761344 0.0 39302 966 ok
tbfp-network 23.3340112 28.1166667 20.5 33 3646 stopped
thor50dday 32001.5993 58369 82.4 1 3604 stopped
timtab1 764772 764772 0.0 43393 182 ok
tr12-30 130529.549 130596 0.1 568241 3601 stopped
traininstance2 0 79160 -- 2970 3601 stopped
traininstance6 2072 29130 1305.9 17661 3600 stopped
trento1 5183779.47 5534271 6.8 2305 3601 stopped
triptim1 22.8680875 22.8681 0.0 2 3601 stopped
uccase12 11507.3721 11507.4051 0.0 1866 3603 stopped
uccase9 10881.1167 11691.6074 7.4 65 3602 stopped
uct-subprob 300.483452 314 4.5 30374 3600 stopped
unitcal_7 19635558.2 19635558.2 0.0 374 1448 ok
var-smallemery-m6j6 -152.559657 -149.375 2.1 45835 3602 stopped
wachplan -9 -8 12.5 67432 3600 stopped
----------------------------+----------------+----------------+------+---------+-------+--------+
solved/stopped/failed: 79/161/0
@03 MIPLIB script version
@02 timelimit: 3600
@01 SCIP(6.0.2)spx(4.0.2)
</code></pre></div></div>
<h2 id="nerd-corner-additional-rpi-4-benchmarks">Nerd corner: additional RPi 4 benchmarks</h2>
<p>For the curious and for comparison to the stock model, I also ran the <a href="https://github.com/aikoncwd/rpi-benchmark">RPi Benchmark script</a> and a <a href="https://people.sc.fsu.edu/~jburkardt/c_src/linpack_bench/linpack_bench.html">Linpack benchmark</a> on the spiked RPi4.</p>
<h3 id="rpi-benchmark">RPi Benchmark</h3>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Raspberry Pi Benchmark Test
Author: AikonCWD
Version: 3.0
temp=46.0'C
arm_freq=2000
gpu_freq=600
gpu_freq_min=500
sd_clock=50.000 MHz
Running InternetSpeed test...
Ping: 12.172 ms
Download: 31.11 Mbit/s
Upload: 3.76 Mbit/s
Running CPU test...
total time: 6.5163s
min: 2.53ms
avg: 2.61ms
max: 12.92ms
temp=58.0'C
Running THREADS test...
total time: 11.2985s
min: 4.02ms
avg: 4.52ms
max: 42.92ms
temp=62.0'C
Running MEMORY test...
Operations performed: 3145728 (1930915.11 ops/sec)
3072.00 MB transferred (1885.66 MB/sec)
total time: 1.6291s
min: 0.00ms
avg: 0.00ms
max: 11.62ms
temp=63.0'C
Running HDPARM test...
Timing buffered disk reads: 130 MB in 3.01 seconds = 43.19 MB/sec
temp=51.0'C
Running DD WRITE test...
536870912 bytes (537 MB, 512 MiB) copied, 13.9169 s, 38.6 MB/s
temp=48.0'C
Running DD READ test...
536870912 bytes (537 MB, 512 MiB) copied, 11.9338 s, 45.0 MB/s
temp=48.0'C
AikonCWD's rpi-benchmark completed!
</code></pre></div></div>
<h3 id="linpack-benchmark">Linpack benchmark</h3>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>28 September 2019 01:57:50 AM
LINPACK_BENCH
C version
The LINPACK benchmark.
Language: C
Datatype: Double precision real
Matrix order N = 1000
Leading matrix dimension LDA = 1001
Norm. Resid Resid MACHEP X[1] X[N]
6.491510 0.000000 2.220446e-16 1.000000 1.000000
Factor Solve Total MFLOPS Unit Cray-Ratio
2.530031 0.002854 2.532885 263.994088 0.007576 45.230089
LINPACK_BENCH
Normal end of execution.
28 September 2019 01:57:53 AM
</code></pre></div></div>Sebastian PokuttaTL;DR: Running SCIP on a Raspberry Pi 4 with relatively moderate performance losses (compared to a standard machine) of a factor of 3-5 brings Integer Programming into the realm of Edge Computing.Universal Portfolios: how to (not) get rich2019-08-29T01:00:00+02:002019-08-29T01:00:00+02:00http://www.pokutta.com/blog/research/2019/08/29/universalPortfolios<p><em>TL;DR: How to (not) get rich? Running Universal Portfolios online with Online Convex Optimization techniques.</em>
<!--more--></p>
<p><strong>The following does not constitute any investment advice.</strong></p>
<p>In 1956 Kelly Jr., at Bell Labs then, wrote a very groundbreaking paper [K] that provided a strong link between Shannon’s (then) newly proposed information theory [S] and gambling (Latane [L] came to similar conclusions around the same time, however published it later). Without going too much into detail here—the casual reader is referred to Poundstone’s popular science account [P] and the more technically inclided reader to [K] and [L]; also make sure to check out [C]—what Kelly showed in the context of sequential betting (his example was horse races) the growth of the bettor’s bankroll is upper bounded by $2^r$, where $r$ is the information rate of the <em>private wire</em> of the bettor, i.e., information that only the bettor is privy to, the <em>information advantage</em>. Moreover, Kelly also proposed an optimal strategy, nowadays called <em>Kelly Strategy</em> or <em>Kelly Betting</em>, that achieves this growth rate and, moreover, for any substantially different strategy, the limit of that strategy’s return divided by the return of the Kelly strategy goes to $0$.</p>
<h3 id="the-kelly-criterion">The Kelly Criterion</h3>
<p>Given sequential betting opportunities, e.g., in a casino, the <em>key question</em> is how much money of the current bankroll $V$ to invest. What Kelly showed is that the <em>optimal fraction</em> $f^\esx$ to wager is given by:</p>
<script type="math/tex; mode=display">\tag{KellyFraction}
f^\esx \doteq \frac{bp - q}{b},</script>
<p>where $b$ is the net odds (you get $b$ dollars back for each invested dollar on top of the dollar returned to you), $p$ is the probability of winning and $q=1-p$ is the probability of losing. From this formula we can immediately see that we wager a fraction of $0$ if and only if $bp = 1 - p$ or equivalently $p = \frac{1}{b+1}$, i.e., our probability of winning is net flat against the odds, i.e., we have no informational advantage.</p>
<p>However, rather than dwelling on the interpretation as well as the optimality of the above, it is instructive to understand where this formula originates from. In fact, the Kelly fraction naturally arises as the fraction that maximizes the geometric growth of our bankroll under sequential betting or equivalently <em>maximizes the expectation of the logarithm of our (terminal) wealth</em>. The rationelle behind this objective is as follows. Suppose we have an initial backroll of $X_0$ and we sequentially bet at each time step $t$ a given fraction $f$ of our bankroll $X_t$. Let us make the simplifying assumption that the we have even wager bets, i.e., if we win we get our money $f X_t$ back and another $f X_t$ as win and if we lose we lose $f X_t$ as such our bankroll $X_t$ evolves as</p>
<script type="math/tex; mode=display">X_n = X_0 (1+f)^S (1-f)^F,</script>
<p>where $S$ is the number of successes and $F$ the number of failures, i.e., in particular $S + F = n$. From this we obtain a <em>average growth (per bet)</em> of</p>
<script type="math/tex; mode=display">\left( \frac{X_n}{X_0} \right)^{1/n} = (1+f)^{S/n} (1-f)^{F/n},</script>
<p>where $S/n \rightarrow p$ the <em>probability of success</em> and $F/n \rightarrow 1-p$ the <em>probability of failure</em>. The term on the left is the <em>growth rate per bet on average</em> and we want to find an $f$ that maximizes this quantity via the right-hand side. In log-world this is equivalent to</p>
<script type="math/tex; mode=display">\frac{1}{n} (\log X_n - \log X_0) = \frac{S}{n} \log (1 + f) + \frac{F}{n} \log (1-f),</script>
<p>or in the limit for $n$ large we obtain:</p>
<script type="math/tex; mode=display">\tag{maxExpLog}
\mathbb E[g(f)] = p \log (1 + f) + (1-p) \log (1-f),</script>
<p>where $g$ is the <em>expected growth rate per bet when betting fraction $f$</em>, which is independent of $n$ now. Note, that it can be also easily seen that betting a fixed fraction $f$ (i.e., independent of time) here is sufficient provided that the individual bets are independent and we care for the (maximization of the) expected growth rate.</p>
<p>Now we can simply maximize the right-hand side by computing a critical point:</p>
<script type="math/tex; mode=display">0 = p \frac{1}{1+f} - (1-p) \frac{1}{1-f} \Rightarrow f^\esx = 2p -1,</script>
<p>which is the Kelly fraction for this simple case. Again, if the probability of winning is exactly $p = 0.5$, i.e., we do not have any information advantage, then the optimal fraction $f^\esx = 0$. Observe, that as soon as $p \neq 0.5$ we have some form of informational advantage and the Kelly fraction $f^\esx \neq 0$: if it is positive it is beneficial to bet on the outcome (being the bet long) and if it is negative it is beneficial to bet on the inverse of the outcome (being the bet short). The latter is not always possible in traditional betting, but it is in investing, where we can short.</p>
<p>Following the same logic as above, the general formula can be derived by maximizing the <em>expected logarithmic terminal wealth</em> $\mathbb E \log W_T(f)$, where $W_T(f)$ is the <em>(terminal) wealth</em> at time $T$ provided we bet a fixed fraction $f$. We can also see from the derivation that the limit outcome is very sensitive to overestimations of $f^\esx$, in particular if our estimation of $p$ has some error (which it usually has) leading to an estimation $\hat f > f^\esx$, then ruin in the limit is guaranteed. Therefore in practice people devised various strategies to combat overbetting, such as, e.g., <em>half Kelly</em>, where only $\frac{1}{2}\hat f$ is invested. This significantly cuts down the risk of overbetting while still providing $2/3$ of the expected growth rate; see <a href="https://blogs.cfainstitute.org/investor/2018/06/14/the-kelly-criterion-you-dont-know-the-half-of-it/">here</a>, <a href="https://www.pinnacle.com/en/betting-articles/Betting-Strategy/fractional-kelly-criterion/GBD27Z9NLJVGFLGG">here</a>, or <a href="https://www.bettingexpert.com/en-au/learn/successful-betting/the-kelly-criterion#gref">here</a> for some of many online discussions. Also, it seems that in practical betting the Kelly criterion tends to work quite well (see e.g., [T, T3] or [P]) however this is beyond the scope of this post.</p>
<p><strong>Example: biased even wager bet.</strong> To understand a bit better what the expected growth rates etc are, consider the even wager setup from above. However, this time let us suppose that our <em>informational advantage</em> is $\varepsilon$, i.e., $p = \frac{1}{2}+\varepsilon$. Then the optimal growth rate $r^\esx$ obtained via $f^\esx$ is given by</p>
<script type="math/tex; mode=display">r^\esx \approx 2 \varepsilon^2</script>
<p>and the number of bets required to double our bankroll is roughly $0.35 \varepsilon^{-2}$. For actual biases this roughly looks like this, where we list number of bets to 2x, 10x, and $5\%$:</p>
<table>
<thead>
<tr>
<th style="text-align: right">Advantage $\varepsilon$</th>
<th style="text-align: right">$r^\esx$</th>
<th style="text-align: right">2x ($0.35 \varepsilon^{-2}$)</th>
<th style="text-align: right">10x ($1.15\varepsilon^{-2}$)</th>
<th style="text-align: right">$5\%$</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">$40\%$</td>
<td style="text-align: right">$36.80642\%$</td>
<td style="text-align: right">$1.88$</td>
<td style="text-align: right">$6.26$</td>
<td style="text-align: right">$0.14$</td>
</tr>
<tr>
<td style="text-align: right">$10\%$</td>
<td style="text-align: right">$2.01355\%$</td>
<td style="text-align: right">$34.42$</td>
<td style="text-align: right">$114.35$</td>
<td style="text-align: right">$2.48$</td>
</tr>
<tr>
<td style="text-align: right">$2.5\%$</td>
<td style="text-align: right">$0.12505\%$</td>
<td style="text-align: right">$554.29$</td>
<td style="text-align: right">$1,841.30$</td>
<td style="text-align: right">$39.98$</td>
</tr>
<tr>
<td style="text-align: right">$0.63\%$</td>
<td style="text-align: right">$0.00781\%$</td>
<td style="text-align: right">$8,872.05$</td>
<td style="text-align: right">$29,472.32$</td>
<td style="text-align: right">$639.98$</td>
</tr>
<tr>
<td style="text-align: right">$0.08\%$</td>
<td style="text-align: right">$0.00012\%$</td>
<td style="text-align: right">$567,825.94$</td>
<td style="text-align: right">$1,886,276.94$</td>
<td style="text-align: right">$40,959.98$</td>
</tr>
</tbody>
</table>
<p>Put differently, we need to have a significant informational advantage to really turn this into any reasonable return in <em>reasonable time</em> or we need to be in an environment where we can make a large number of bets. Keep this in mind, we will be revisiting this later. To put things into perspective, the edges in “professional gambling” settings typically hover around $1\%$, so you need to put in some <em>real work</em> but it is not unrealistic to achieve a decent growth rate.</p>
<h2 id="log-optimal-portfolios-and-constant-rebalancing">Log-optimal Portfolios and Constant Rebalancing</h2>
<p>Going from betting to investing is quite natural actually. After all, we can think of <em>investing</em> also as a form of sequential betting, however now:</p>
<ol>
<li>the bets do not necessarily have a fixed time of outcome (we can decide when we exit a position etc);</li>
<li>the odds / probability of success is not easily available (not saying it is easy in betting…);</li>
<li>the payoff of each “bet” can differ significantly.</li>
</ol>
<p>Let us ignore those (significant) issues for the moment and focus on the transfer of the methodology first. In fact it turns out that the setup that Kelly considered naturally generalizes to, e.g., investing in equities etc.</p>
<p>Suppose we have $n$ assets, with random return vector $x \in \RR^n$, where the $x_i$ are of the form $x_i = \frac{p_i(\text{new})}{p_i(\text{old})}$, i.e., <em>relative price changes</em>. In the spirit of Kelly’s approach, we allocate fractions</p>
<script type="math/tex; mode=display">f \in \Delta(n) \doteq \setb{f \in \RR^n \mid \sum_i f_i = 1, f \geq 0},</script>
<p>across these assets, so that the <em>logarithmic growth rate</em> (similar to above) is given as $\log f^\intercal x$. Now expressed in a sequential investing/betting fashion, we would have the relative price change realizations $x_t$ in time $t$ and allocations $f_t$, so that the <em>logarithmic portfolio growth</em> over time is given by</p>
<script type="math/tex; mode=display">\sum_{t = 1}^T \log f_t^T x_t,</script>
<p>which is identical to the <em>logarithmic terminal wealth</em> $\log W_T(f_1,\dots, f_T)$ as the return relatives are additive in log-space. Equivalently the <em>average logarithmic growth rate</em> is given by:</p>
<script type="math/tex; mode=display">\tag{avgLogGrowth}
\frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t,</script>
<p>which is the (somewhat natural) generalization of (maxExpLog), however now we spread captial along multiple assets.</p>
<p>A <em>Constant Rebalancing Portfolio (CRP)</em> is one where the $f_t = f$ are constant over time with the implicit assumption(!!) that the return distribution is somewhat stationary and hence from a sequential decision perspective we maximize the <em>expected logarithmic growth rate</em> by picking the expectation maximizer in each step assuming(!!) i.i.d. returns. In short, the CRP is in some sense the natural generalization of the Kelly bet from above. The term <em>constant rebalancing</em> arises from the fact that in each time step, we <em>rebalance</em> the portfolio to represent the <em>constant</em> allocation $f$ across assets; note that this rebalancing usually incurs transaction costs.</p>
<p>A natural and important question to ask if <em>why</em> one would want to consider CRPs at all and why one would prefer CRPs over other strategies. One advantage is that apart from being the “Kelly bet for investing” they turn volatility into return. Moreover, Cover [C2] showed that the best CRP provides returns at least as good as (1) buying and holding any particular stock, (2) the average of the returns all stocks, and (3) the geometric mean of all stocks. We will briefly discuss below how CRPs can turn naturally volatility into return. We describe the argument here for the Kelly betting case (which can be considered a two-asset CRP after appropriate rewriting) for the sake of exposition, but the general case follows similarly.</p>
<p><strong>Observation: turning volatility into return.</strong> One of the arguments for CRPs is that they can even generate a return when the log geometric mean of the random variable we are investing in is $0$. This observation is due to Shannon and is simply the <a href="https://en.wikipedia.org/wiki/Inequality_of_arithmetic_and_geometric_means">AM/GM inequality</a> (or concavity of the logarithm) in action. First let us understand what the above means: If a random variable $X$ has geometric mean $1$ (or log geometric mean $0$), e.g., a fair coin whose payout is so that we have a new bankroll of amount $1+r$ when $X = 1$ and $1/(1+r)$ if $X = 0$. This means that a buy-and-hold strategy in expectation would not yield any return. Note that the <em>arithmetic mean</em> is roughly $r^2/2$ in this case (via Taylor approximation). The Kelly strategy allows to profit from this positive arithmetic mean, in fact setting $f=1/2$ (which is equal to $f^\esx$ if the geometric mean is exactly $1$) guarantees a positive rate. Using $\log (1+x) \approx x - \frac{x^2}{2}+\frac{x^3}{3}$ we obtain for small $r$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\frac{1}{2}\cdot\log(1+f\cdot r)+\frac{1}{2}\cdot\log(1-f+\frac{f}{1+r}) & \approx\frac{r^{2}-r^{3}}{8}.
\end{align*} %]]></script>
<p>More generally, suppose that we have a discrete one dimensional random variable $1+X$ (this is our bet) with outcomes $1+x_i \in \RR_+$ with probability $p_i$, i.e., the $x_i$ corresponding to the returns, so that the log geometric mean is at least $0$:</p>
<script type="math/tex; mode=display">\sum_{i}p_{i}\log\left(1+x_{i}\right)\geq 0</script>
<p>Betting a fraction $f$ leads to the expected growth function:</p>
<script type="math/tex; mode=display">r(f) \doteq \sum_{i}p_{i}\log\left(1+ f x_{i}\right)</script>
<p>Observe that $r(f)$ is strictly concave in $f$ in the interval $f \in [0,1]$ with $r(0)=0$ and $r(1) \geq 0$ (here we use that the log geometric mean is at least $0$). Thus by concavity (or <a href="[https://en.wikipedia.org/wiki/Jensen%27s_inequality](https://en.wikipedia.org/wiki/Jensen's_inequality)">Jensen’s inequality</a>; all the same) it follows:</p>
<script type="math/tex; mode=display">r(f) = r(1 \cdot f + 0 \cdot (1-f)) > f \cdot r(1) + (1-f) \cdot r(0) \geq 0.</script>
<p>As such betting a fraction of $f=1/2$ is always safe as long as the geometric mean is at least $1$. However, for completeness, note that betting a fixed fraction of $1/2$ can be quite suboptimal: say the random variable actually had positive geometric mean of $r > 1$ (in particular no ruin events) then investing a $1/2$-fraction will lead to suboptimal growth. In fact, in some cases the optimal Kelly fraction $f^\esx$ can be actually larger than $1$. In this case we would leverage the bets.</p>
<p>In the case of $n$ assets, the generalization of the above simple strategy is to allocate a $1/n$-fraction of capital to each of the $n$ assets. For further comparisons of this simple $1/n$-CRP to other strategies and further properties, see [DGU]. Finally, before continuing with universal portfolios, let me add that <em>dutch booking</em> (i.e., locking in a guaranteed profit through arbitrage and crossing bets) is not necessarily growth optimal as can be seen with a similar argument: we can gain extra return from allowing volatility in the portfolio returns. Put differently, if the payout function is convex, we benefit from randomness.</p>
<h3 id="universal-portfolios">Universal Portfolios</h3>
<p>Once the notion of Constant Rebalancing Portfolios is defined, given a set of assets, a natural question is whether one can estimate or compute the optimal allocation vector $f$ given either distributional assumptions about the returns or actual data. The natural second order question is then, if so, whether we can do it in an online style fashion, <em>while</em> we are investing. In its most basic form: Given relative price change vectors $x_1, \dots, x_T$, we would like to solve:</p>
<script type="math/tex; mode=display">\tag{staticLogOpt}
\max_{f \in \RR^n} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t.</script>
<p>While this optimization problem can easily be solved with convex optimization methods, provided the price change vectors $x_1, \dots, x_T$, this is not very helpful as the past is usually not a great predictor for the future.</p>
<p>Motivated by the strong connection to information theory, Cover [C2] realized that one can define <em>Universal Portfolios</em>, that are growth optimal for unknown returns following an analogous line of reasoning as Kolmogorov, Lempel, and Ziv did for <em>universal coding</em> where an (asymptotically) optimal (source) code can be constructed without knowing the source’s statistics, i.e., the algorithm constructs a portfolio over time, so that when $T \rightarrow \infty$, for any sequence of relative price changes $x_1,\dots, x_T$, then</p>
<script type="math/tex; mode=display">\tag{universalPortfolio}
\frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t \rightarrow \max_{f \in \RR^n} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t,</script>
<p>where the $f_t$ are dynamic allocations. However, grossly simplifying, the proposed algorithm spreads the capital across an exponential number of sequences arising from binary strings of length $T$ (assuming the most basic case based on a binomial model), making it impractical for computational implementation as well as from an actual real-world perspective as one would probably be bankrupted by transaction costs. Nonetheless, Cover’s work [C1], [C2] inspired a lot of related work addressing various aspects (see, e.g., [CO], [BK], [T], [TK], [MTZZ]) and in particular [KV] provided a theoretically polynomial time implementable variant of Cover’s universal portfolios.</p>
<h3 id="an-application-of-online-mirror-descent-et-al">An application of Online Mirror Descent et al</h3>
<p>It did not take long, given the suggestive form of (universalPortfolio) until the link to online convex optimization and regret minimization was observed (see e.g., [HSSW] for one of the earlier references). In fact, the online search for a universal portfolio can be easily cast as a regret minimization problem: Find a strategy of <em>dynamic allocations</em> $f_t$, so that</p>
<script type="math/tex; mode=display">\tag{univPortRegret}
\max_{f \in \RR^n} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t - \frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t\leq R(T)/T,</script>
<p>where $R(T)$ is the <em>regret</em> achieved after $T$ time steps (or bets depending on the point of view). Given that our function $\log f_t^T x_t$ is concave in $f_t$ and we want to maximize (or equivalently minimize $-\log f_t^T x_t$), we can apply the <em>online convex optimization framework</em> to obtain no-regret algorithms (see <a href="/blog/research/2019/02/27/cheatsheet-nonsmooth.html">Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning</a> for some background information on the actual algorithms); for the expert in online convex optimization, we require some assumptions on the feasible region of $f$ later, so avoid complications that arise in the unbounded case.</p>
<h4 id="online-gradient-descent">Online Gradient Descent</h4>
<p>In its most basic form we can simply run Zinkevich’s [Z] <em>Online (sub-)Gradient Descent (OGD)</em> for a given time horizon $T$ and feasible region $P$. Here, we update according to</p>
<script type="math/tex; mode=display">f_{t+1} \leftarrow \arg\min_{f \in P} \eta_t \nabla_f(\log f_t^T x_t)^Tf + \frac{1}{2}\norm{f-f_t}^2</script>
<p>with the step-size choice $\eta_t = \eta = \sqrt{\frac{2M}{G^2T}}$, where $\norm{\nabla_f(\log f_t^T x_t)}_2 \leq G_2$ is an upper bound on the $2$-norm of the gradients and $M$ is an upper bound on the $2$-norm diameter of the feasible region $P$. This provides a guarantee of the form:</p>
<script type="math/tex; mode=display">\tag{univPortOGDGen}
\max_{f \in P} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t - \frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t\leq \frac{G_2M}{\sqrt{T}},</script>
<p>Note, that $P$ can be used to model or capture various allocation constraints, e.g., if $P = \Delta(n)$ the probability simplex in dimension $n$, then we essentially assume that cannot short and we cannot take leverage; this would be the traditional unleveraged long-only investor setting. However, many other choices are possible and we can also easily include limits on assets and segment exposures into $P$; the projection operation might get slightly more involved but that is about it. It is important to observe though that while the bound in (univPortOGDGen) looks like being independent of the number of assets $n$, in fact $G_2$ will typically scale as $\sqrt{n}$. For $P = \Delta(n)$, which we will assume here for simplicity and also to be in line with the original univeral portfolio model, the bound (univPortOGDGen) becomes:</p>
<script type="math/tex; mode=display">\tag{univPortOGD}
\max_{f \in P} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t - \frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t\leq \sqrt{\frac{2G_2^2}{T}}.</script>
<h4 id="online-mirror-descent">Online Mirror Descent</h4>
<p>A more general approach with more freedom to customize the regret bound is <em>Online Mirror Descent</em> (OMD), which arises as a natural generalization of <em>Mirror Descent</em> [NY] (by simply cutting short the original MD proof; see <a href="/blog/research/2019/02/27/cheatsheet-nonsmooth.html">Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning</a>). Rather than stating the most general version, for the case of $P = \Delta(n)$ where OMD with respective Bregman divergence and distance generating function is the <em>Multiplicative Weight Update</em> method (see [AHK] for a survey on MWU), we obtain the regret bound:</p>
<script type="math/tex; mode=display">\tag{univPortOMD}
\max_{f \in P} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t - \frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t \leq \sqrt{\frac{2G_\infty^2\log n}{T}},</script>
<p>where <script type="math/tex">\norm{\nabla_f(\log f_t^T x_t)}_\infty \leq G_\infty</script>; note that this time around we have a bound on the max norm of the gradients. The update looks very similar to OGD:</p>
<script type="math/tex; mode=display">f_{t+1} \leftarrow \arg\min_{f \in P} \eta_t \nabla_f(\log f_t^T x_t)^Tf + V_{f_t}(f),</script>
<p>where $V_x(y)$ is the corresponding Bregman divergence. This update can be implemented via the multiplicative weight update method; see <a href="/blog/research/2019/02/27/cheatsheet-nonsmooth.html">Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning</a> for details.</p>
<p><strong>Remark: (Feasible region $\Delta(n)$).</strong> If we want to stick to $P = \Delta(n)$ as feasible region, which is not as restrictive as it seems as one can simply add the negative of an asset to allow for shorting and duplicate assets to allow for leverage. This “modification” has only minor impact on the regret bound (logarithmic dependence on $n$ for OMD). Alternatively one can simply define a customized Bregman divergence incorporating those constraints with basically the same result.</p>
<h4 id="online-gradient-descent-for-strongly-convex-functions">Online Gradient Descent for strongly convex functions</h4>
<p>It turns out that given that $\log f_t^T x_t$ is strongly concave in $f$ with respect to the $2$-norm one can significantly improve the OGD regret bound as shown by [HAK]. This type of improved regret bound in the strongly convex case is pretty standard by now (again see <a href="/blog/research/2019/02/27/cheatsheet-nonsmooth.html">Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning</a> for details), so that we just state the bound:</p>
<script type="math/tex; mode=display">\tag{univPortOGDSC}
\max_{f \in P} \frac{1}{T} \sum_{t = 1}^T \log f^T x_t - \frac{1}{T} \sum_{t = 1}^T \log f_t^T x_t \leq \frac{G_2^2}{2\mu} \frac{(1 + \log T)}{T},</script>
<p>where $\mu >0$, so that $\mu I \preceq \nabla_f^2 \log f_t^T x_t$, i.e., it is a lower bound on the strong concavity constant. Note that we have, up to log factors, a quadratic improvement in the convergence rate.</p>
<h2 id="performance-on-real-world-data">Performance on real-world data</h2>
<p>So from a theoretical perspective the performance guarantee of Online Mirror Descent and even more so the performance guarantee of the variant for strongly convex losses via Online Gradient Descent sound very appealing in terms of running Universal Portfolios online. Given the regret bounds one would expect that all information theory, computer science, and machine learning people “in the know” would be rich. However, as so often:</p>
<blockquote>
<p>“In academia there is no difference between academia and the real world - in the real world there is”
— <a href="https://en.wikipedia.org/wiki/Nassim_Nicholas_Taleb">Nassim Nicholas Taleb</a></p>
</blockquote>
<p>As we will see in the following, for actual parameterizations that are compatible with market conditions that we encounter in today’s market regimes, the provided asymptotic guarantees are often too weak; but not always. Note, that we are ignoring transaction costs here, which impact performance negatively (see [BK] for a discussion of Universal Portfolios with transaction costs), as it would complicate the exposition significantly while adding little extra value in terms of understanding.</p>
<p>I would like to stress that I am not saying that OMD, Universal Portfolios, etc. cannot be successfully used in investing. A few disclaimers:</p>
<ol>
<li>There are <em>some</em> researchers, in particular information theorists that do use related methodologies for investing quite successfully. But in a more elaborate way than just vanilla portfolio optimization.</li>
<li>The aforementioned algorithms do often work much better in practice than the regret bounds suggest, however we run Universal Portfolios precisely because of the worst-case guarantee. Otherwise, we get into the metaphysical discussions of pro and cons of investments strategies without any guarantees.</li>
</ol>
<p><strong>Understanding the Regret Bounds.</strong> It is important to understand exactly, what the regret bounds (univPortOGDSC) and (univPortOMD) really mean. Namely, after a certain number of iterations or steps the achieved error is the <em>average additive return error</em>, i.e., if the value in graph for (univPortOMD) indicates an error of $\varepsilon$, this means that <em>on average in each step</em> we roughly make an additive error in the return of $\varepsilon$. This can be quite problematic if the actual returns are smaller than $\varepsilon$. As such we will also report the <em>average relative return error</em> as the ratio of the additive error and the return in that time step, as this is really what matters from a practical perspective: you want to be within a (hopefully small) <em>multiplicative</em> factor of the actual return.</p>
<p>Before looking at actual data I will first make a few remarks where I used simulated market data (calibrated against “typical” market conditions) to demonstrate the underlying mechanics. Then we will look at actual data in two cases: a more traditional US equities scenario and a more speculative, high(er)-frequency cryptocurrency example.</p>
<h3 id="constants-and-norms-matter">Constants and norms matter</h3>
<p>At first sight it seems that the regret bound (univPortOGDSC) for the strongly convex case via OGD should be vastly superior to the regret bound (univPortOMD) via OMD. So let us start with a simple example of $n = 2$ assets for $T = 100000$ time steps; this is quite long in the real-world, more on this later. Here and in the following we often depict on the left average additive (or relative) return errors and on the right the respective norms of the gradients over time as well as the lower bound estimation of the strong concavity constant.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/twoAssetcompOver.png" alt="Regret Bound 2 assets" /></p>
<p>This confirms our initial suspicion that (univPortOGDSC) is superior. Or is it? Let us consider another example but this time we take a more realistic number of $n = 50$ assets:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/50AssetcompOver.png" alt="Regret Bound 50 assets" /></p>
<p>This might be slightly surprising at first but what is happening is that (univPortOMD) scales <em>much</em> better due to the logarithmic dependence on the number of assets and the max norm of the gradients, whereas (univPortOGDSC) depends on the $2$-norm of the gradients and this scales roughly with $\sqrt{n}$ when the assets are reasonably independent. So basically OMD is slower in principle but starts from an exponentially lower initial error bound, so you need really long time horizons for (univPortOGDSC) to dominate (univPortOMD) for reasonably sized portfolios. In actual setups it really depends on the data and parameters in terms of which algorithm and bound performs better.</p>
<p>Moreover, as mentioned above what we really care for are the relative errors in the returns. For the same examples from above on the left we have the $2$-asset case and on the right the $50$-asset case.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/2-50-relComp.png" alt="Relative Regret Bounds 2 and 50 assets" /></p>
<p>Key takeaway is here that the relative error can easily be order an of magnitude worse than the additive error. In fact in both cases here, where we simulated really moderate markets, even after a large number of time steps the relative error is around $10\%$, which is still quite considerable.</p>
<h3 id="actual-time-matters-more-than-anything">Actual time matters (more than anything)</h3>
<p>The other important thing is that actual <em>time</em> not <em>time steps</em> matters a lot. This might be counterintuitive as well but bear with me. As can be seen from the two regret formulas of interest, (univPortOGDSC) and (univPortOMD), it is the number of time steps that is key to bringing down the average additive return error. As such, while we cannot accelerate time etc in the real-world, we might be tempted to trade more often. Let us see what happens. We consider a market that has $n = 50$ assets with a volatility of roughly that of the US equities market; the graph shows log (portfolio) value of the asset (as a single-asset portfolio).</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/univPort/50Assets10years52weeksmarket.png" alt="Market 50 assets 10 years 52 weeks" /></p>
<p>Now in terms of trading and achievable average additive return errors in the next graphic, on the left we assume that we trade once per week, i.e., we have $T = 520$ time steps, whereas on the right we assume that we trade $100$ times per week, i.e., $T = 52000$.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/50Assets10years52weekscompRescaleRegret.png" alt="Rescaling additive regret" /></p>
<p>First of all, we see that additive error for the once-per week setup is actually quite high, whereas on the right, trading $100$ times per week really helps to bring down the average additive error and we also see that (univPortOGDSC) starts to provide better guarantess than (univPortOMD). This is also consistent with many papers reporting tests over long(er) time horizons in order to ensure that the regret bounds are tight enough.</p>
<p>Now however, the story changes quite a bit when we consider the <em>relative</em> errors as shown in the next graphic.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/50Assets10years52weekscompRescaleRegretrel.png" alt="Rescaling relative regret" /></p>
<p>So for (univPortOMD) basically nothing has changed. The reason is the following: what the increase in number of trades does is to basically turn $\sqrt{T}$ in the regret bound into $\sqrt{100 T} = \sqrt{100} \sqrt{T}$. However, the volatility of the underlying stock process scales <em>the same way</em>, i.e., the returns get roughly rescaled by $1/\sqrt{100}$ as well, so that (apart from a better tracking and removing discretization errors) we obtain basically the <em>same relative errors</em>. This is not the case for (univPortOGDSC) as it contracts quadratically faster (up to log factors), however the strong dependency on the number of assets might still outweigh this benefit.</p>
<p>In short: rescaling time by increasing the number of trades per time step is of <em>limited efficiency</em>. It is actual <em>time</em> that matters when we consider relative return errors, which is the quantity that actually matters.</p>
<p><strong>Interlude: Understanding compounding or how to create a trust fund baby.</strong> Just to understand how important time is when it comes to investing and compounding, at retail investor return rates, say somewhere less than $8\%$ per year (which is already quite optimistic), within $20$ years of investing you can merely $4.7$-fold your initial investment; even $40$ years only give you a factor of about $22$, which is also not game changing: neither is going to make you rich when you started out poor and “getting rich” is closer to a factor $100$ increase in wealth). However, if you are willing not to invest for yourself but, say, for your grandchild, you have a good $60$ years of compounding and the situation is very different: you roughly $101.26$-folded your initial investment, say turning USD $10,000$ into roughly USD $1.00$ million. Go one generation further down the familty tree and it becomes USD $4.72$ million. So while it is unlikely that you can catapult yourself into the realm of riches (provided you did not start out rich), you do have the chance to do something for your family down the line. One might actually argue from an economic perspective the reason retail returns are not higher than they are is to maintain a certain equilibrium (while beyond the scope of this post: otherwise everybody would be rich which means nobody is rich). The lack of category-changing retail investment returns might also provide an explanation why people have been crazy about cryptocurrencies recently: they provide(d) a return that is potentially significant enough to gain $2$ generations of investing within a couple of years (or even months); at the correspondingly high risk levels though.</p>
<h3 id="benchmarks-on-actual-data">Benchmarks on actual data</h3>
<p>With the above in mind, let us benchmark (univPortOGDSC) and (univPortOMD), and hence the algorithms, in terms of achievable guarantees on actual market data.</p>
<h4 id="sp500-stocks">SP500 stocks</h4>
<p>In the first benchmark we consider the SP500 stocks from 01/01/2008 to 01/01/2018 in daily trading. After correcting for dropouts etc over the horizon we are left with $n = 426$ stocks and $T = 2519$ time steps (which corresponds to $10$ years of approx $252$ trading days/year). Apart from very basic tests, no further cleanup was done; that is left as an exercise to the reader. In the graphic below, on the left we have the actual market prices in USD of the assets and on the right the log (portfolio) value of the assets.</p>
<p><img src="http://www.pokutta.com/blog/assets/univPort/sp500marketStockOver.png" alt="sp500 market" /></p>
<p>In terms of achievable average additive return errors, they are huge for either of the bounds:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/sp500compOver.png" alt="sp500 additive regret" /></p>
<p>We have roughly a “guarantee” of $100\%$ <em>additive</em> error in the daily return for the (better) OMD bound: to put this in perspective, say the best stock had a <em>daily return</em> of $3\%$ on average (that is <em>huge</em>), then we can “guarantee” an error band of $[-97\%,103\%]$ for the average daily performance of the universal portfolio <em>at the end of the time horizon</em>. Not quite helpful I guess. The situation gets much more pronounced in terms of average relative error:</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/univPort/sp500comprel.png" alt="sp500 relative regret" /></p>
<h4 id="bitcoin">Bitcoin</h4>
<p>Maybe we were on the wrong end of the regime. Now let us consider fewer assets and a much higher trading frequency. We consider $n = 2$ assets: Bitcoin and a risk-free asset (e.g., cash or government bonds; unsure about the latter recently) that make up our portfolio. The risk-free asset returns a (moderate) $2\%$ over the full horizon. Those days you could consider yourself lucky to get any interest at all: Germany <a href="https://www.cnbc.com/2019/08/20/what-is-a-zero-coupon-bond-germany-set-to-auction-a-zero-percent-30-year-bond.html">just tried to sell a zero-coupon bond</a> at a crazy price (i.e., negative yield), which basically means parking your hard earned cash for $30$ years in a “safe-haven” and paying for it—demand at the auction was <a href="https://www.bloomberg.com/news/articles/2019-08-21/germany-sees-anemic-demand-for-30-year-bond-sale-at-zero-coupon">anemic</a>. The original bitcoin market data was on a sub-second granularity level and for illustration here has been downsampled to $T = 566159$ time steps. As before, in the graphics below, on the left we have the actual market price in USD and on the right the log (portfolio) value of the two assets.</p>
<p><img src="http://www.pokutta.com/blog/assets/univPort/bitcoinMarketOver.png" alt="bitcoin market" /></p>
<p>Now, for our two regret bounds, we see that the variant utilizing the strong concavity performs much better (as we have only two assets, so that constants are small). In fact, $10^{-4}$ as average additive return error seems to be quite good.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/bitcoincompOver.png" alt="bitcoin additive regret" /></p>
<p>However, unfortunately the reality here is a different one: average price changes between time steps are also quite small, so that the relative errors are <em>huge</em>:</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/univPort/bitcoincomprel.png" alt="bitcoin relative regret" /></p>
<p>Even towards the end of the time horizon our (guaranteed) average relative error bounds can be as large as $10^3$ making them impractical. Sure, here we have the <em>special situation</em> of only two assets, one of which is a risk-free asset so that the allocation cannot be <em>that bad</em> but that is beyond the point.</p>
<h2 id="a-simple-asymptotically-optimal-strategy">A simple asymptotically optimal strategy</h2>
<p>Although CRPs ensure competitiveness with respect to a wide variety of benchmarks (see above), in fact in most papers, the reference is often the <em>single best asset in hindsight</em>. This intuitively makes sense: you want to perform essentially (at least) as good as the best possible asset. Note that in principle CRPs can be even profitable when none of the assets is (as discussed before) but you need to basically have a situation where all assets have log geometric mean return of at most $0$. Once you are beyond this, i.e., there is an asset with strictly positive log geometric mean returns, the log portfolio value is optimized usually at a single asset.</p>
<h3 id="split-and-forget">Split-and-Forget</h3>
<p>So let us change our reference measure to not compete with the best CRP but with the <em>best single asset in hindsight</em> in view of this discussion. In this case there is a simple regret optimal strategy with <em>constant</em> regret in terms of terminal logarithmic wealth (as used in (univPortOGDSC) and (univPortOMD)). The strategy, let us call it <em>Split-and-Forget</em>, works as follows: Say we have $n$ assets, then simply allocate an $1/n$-fraction of capital to each of the $n$ assets and then forget about them (i.e., no rebalancing etc and hence also no additional transaction costs). Now let $V_t^i$ denote the value of asset $i$ at time $t$ and let $M_t$ denote the value of our portfolio at time $t$. Then the regret with respect to the best asset in hindsight is:</p>
<script type="math/tex; mode=display">\max_{i \in [n]} \log V^i_T - \log M_T \leq \max_{i \in [n]} \log V^i_T - \log \frac{1}{n} V^i_T = \log n,</script>
<p>and as such, in the language from before, our <em>average regret</em> (i.e., <em>average additive return error</em>) satisifies</p>
<script type="math/tex; mode=display">\tag{SF}
\max_{i \in [n]} \frac{1}{T }\log V^i_T - \frac{1}{T} \log M_T \leq \max_{i \in [n]}\frac{1}{T } \log V^i_T - \frac{1}{T} \log \frac{1}{n} V^i_T = \frac{\log n}{T},</script>
<p>i.e., the bound has the logarithmic dependency on $n$ as OMD but an even higher (no log factor) convergence rate than OGD-SC. On top of that you have no transaction costs (except for the initial allocation). In fact, if the optimal CRP is attained by a single asset portfolio this basic strategy performs much better in terms of regret and achieved average returns. Note that [DGU]’s $1/n$-strategy rebalances to a uniform $1/n$ allocation at each rebalancing step and as such is different from Split-and-Forget, which does not alter the allocation after the split up in the initial step.</p>
<p><strong>Interlude: Understanding diversification or why old families tend to be rich.</strong> In order to understand a bit better why the above is really working and actually not that bad of a strategy, observe that when doing the initial split up you pay for this <em>once</em>: In the worst-case the $\log n$ additive regret, which is the case where all but one asset is wiped out and you have only a $1/n$-fraction in the one remaining asset left. However, after the split up you are guaranteed that your portfolio is growing at a rate of the best single asset. More mathematically speaking, the split up leads to a shift by $\log n$ in the log portfolio value but after that its growth approaches optimal growth (compared to the single best asset). Sounds weird, but becomes more clear when you think of the asset $i$ compounding at some rate $r_i$. Now the terminal log portfolio wealth would be</p>
<script type="math/tex; mode=display">\log \sum_{i \in [n]} \frac{1}{n} e^{r_i T} = \log \sum_{i \in [n]} e^{r_i T - \log n},</script>
<p>which is the <a href="https://en.wikipedia.org/wiki/LogSumExp">LogSumExp (LSE) function</a> (also sometimes called <em>softmax</em>), which for larger $T$ converges to the maximum: Let $j$ be so that $r_j$ is (uniquely) maximal, then</p>
<script type="math/tex; mode=display">\log \sum_{i \in [n]} e^{r_i T - \log n} = \log e^{r_j T - \log n} + \sum_{j \neq i \in [n]} \frac{e^{r_i T - \log n}}{e^{r_j T - \log n}} \rightarrow r_j T - \log n,</script>
<p>so that the average return of the portfolio approaches $r_j - \frac{\log n}{T}$ as the fractions tend to $0$, which in turn approaches $r_j$ for larger $T$.</p>
<p>So what does this have to do with old families tending to be rich? Suppose at some point your family decided to split their wealth and invest equally into various business ventures, say $10$. You only need to get one of these ventures right and then, over time, you wash out your additive offset of $\log 10$; say the best one generates $4\%$ return per year, then it roughly takes $60$ years to “pay” for the cost of the split up. In other words, the cost of the split up translates into a shift in time and if you have enough time then you just wash out the cost. You might be well aware of the supercharged version of this: investing into startups. There you spread the capital even wider and try to harvest your black swan with explosive growth that pays for the losers in the portfolio and returns a hefty profit. This is simply convexity: the average of the individual payoffs can be much higher than the payoff of the average of the individuals.</p>
<h3 id="split-and-forget-compared-to-omd-and-ogd-sc">Split-and-Forget compared to OMD and OGD-SC</h3>
<p>In the following we compare the realized average return errors (both additive and relative) of (SF) on the same examples that we have seen above. In each figure on the left we have the average additive error and on the right the average relative error.</p>
<p>The first one is the $50$ assets $10$ years example:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/50Assets10years52weekscompSOver.png" alt="SF 50 Assets 10 years" /></p>
<p>And here the longer time horizon version used above to demonstrate the dependency on the constants:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/50AssetscompSOver.png" alt="SF 50 Assets long term" /></p>
<p>The next one is the SP500 example from above:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/sp500compSOver.png" alt="SF SP 500" /></p>
<p>And the bitcoin example:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/bitcoincompSOver.png" alt="SF bitcoin" /></p>
<p>As can be seen the regret of (SF) is significantly smaller than the regret of (OGD-SC) and (OMD) albeit with respect to a slightly different benchmark: the best single asset vs. the best CRP. We can have now a very long discussion about whether (SF) is a good strategy or not. What the (SF) strategy however does is, it sheds some light on what one can really expect from universal portfolios.</p>
<p>In actual backtesting performance the algorithms (as usual) perform much better, although I want to stress this is without guarantee and hence puts us back into the realms of justification-by-backtesting; you should always <em>validate</em> by backtesting though. Thus, just for completeness a quick backtesting experiment. Here we ran the actual algorithms (as compared to just computing the additive and relative bounds) and benchmark the (SF) strategy against the, for this case better suited, (OGD-SC) strategy. The green curve “Opt” is the in-hindsight-optimal CRP <em>allowing for leverage of a factor of up to $2$</em>, i.e., it is outside of the benchmarking class of CRPs and I added it just for comparison. You can clearly see the offset of (SF) at the start and see that this offset translates into a time shift from where onwards it dominates (OGD-SC) significantly in growth rate, essentially following the market: negative offset but higher rate.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/univPort/backtestingBitcoingOGDSC.png" alt="SF bitcoin backtesting" /></p>
<h3 id="references">References</h3>
<p>[K] Kelly Jr, J. L. (2011). A new interpretation of information rate. In <em>The Kelly Capital Growth Investment Criterion: Theory and Practice</em> (pp. 25-34). (original article from 1956) <a href="http://www.herrold.com/brokerage/kelly.pdf">pdf</a></p>
<p>[L] Latane, H. A. (1959). Criteria for choice among risky ventures. <em>Journal of Political Economy</em>, <em>67</em>(2), 144-155. <a href="http://finance.martinsewell.com/money-management/Latane1959.pdf">pdf</a></p>
<p>[S] Shannon, C. E. (1948). A mathematical theory of communication. <em>Bell system technical journal</em>, <em>27</em>(3), 379-423. <a href="https://pure.mpg.de/rest/items/item_2383162/component/file_2456978/content">pdf</a></p>
<p>[C] Cover, T. M. “Shannon and investment.” <em>IEEE Information Theory Society Newsletter, Summer</em> (1998). <a href="xxx">pdf</a></p>
<p>[T] Thorp, E. O. (1966). <em>Beat the Dealer: a winning strategy for the game of twenty one</em> (Vol. 310). Vintage.</p>
<p>[TK] Thorp, E. O., & Kassouf, S. T. (1967). <em>Beat the market: a scientific stock market system</em>. Random House.</p>
<p>[B] Breiman, L. (1961). Optimal gambling systems for favorable games. <a href="https://apps.dtic.mil/dtic/tr/fulltext/u2/402290.pdf">pdf</a></p>
<p>[P] Poundstone, W. (2010). <em>Fortune’s formula: The untold story of the scientific betting system that beat the casinos and Wall Street</em>. Hill and Wang.</p>
<p>[MTZZ] MacLean, L. C., Thorp, E. O., Zhao, Y., & Ziemba, W. T. (2011). How Does the Fortune’s Formula Kelly CapitalGrowth Model Perform?. <em>The Journal of Portfolio Management</em>, <em>37</em>(4), 96-111. <a href="http://hari.seshadri.com/docs/kelly-betting/kelly1.pdf">pdf</a></p>
<p>[T2] Thorp, E. O. (2011). Understanding the Kelly criterion. In <em>The Kelly Capital Growth Investment Criterion: Theory and Practice</em> (pp. 509-523).</p>
<p>[T3] Thorp, E. O. (2011). The Kelly criterion in blackjack sports betting, and the stock market. In <em>The Kelly Capital Growth Investment Criterion: Theory and Practice</em> (pp. 789-832).<a href="https://www.bjrnet.com/thorp/Thorp_KellyCriterion.pdf">pdf</a></p>
<p>[C1] Cover, T. (1984). An algorithm for maximizing expected log investment return. <em>IEEE Transactions on Information Theory</em>, <em>30</em>(2), 369-373. <a href="https://ieeexplore.ieee.org/abstract/document/1056869">pdf</a></p>
<p>[C2] Cover, T. M. (2011). Universal portfolios. In <em>The Kelly Capital Growth Investment Criterion: Theory and Practice</em> (pp. 181-209). (original reference: Cover, T. M. (1991). Universal Portfolios. <em>Mathematical Finance</em>, <em>1</em>(1), 1-29.) <a href="https://stuff.mit.edu/afs/athena.mit.edu/course/6/6.962/www/www_fall_2001/shaas/universal_portfolios.pdf">pdf</a></p>
<p>[CO] Cover, T. M., & Ordentlich, E. (1996). Universal portfolios with side information. <em>IEEE Transactions on Information Theory</em>, <em>42</em>(2), 348-363. <a href="https://pdfs.semanticscholar.org/e3f5/037e8ad65deead506235d0c25d07f7d3f0d6.pdf">pdf</a></p>
<p>[BK] Blum, A., & Kalai, A. (1999). Universal portfolios with and without transaction costs. <em>Machine Learning</em>, <em>35</em>(3), 193-205. <a href="https://link.springer.com/content/pdf/10.1023/A:1007530728748.pdf">pdf</a></p>
<p>[KV] Kalai, A., & Vempala, S. (2002). Efficient algorithms for universal portfolios. <em>Journal of Machine Learning Research</em>, <em>3</em>(Nov), 423-440. <a href="http://www.jmlr.org/papers/volume3/kalai02a/kalai02a.pdf">pdf</a></p>
<p>[HSSW] Helmbold, D. P., Schapire, R. E., Singer, Y., & Warmuth, M. K. (1998). On‐Line Portfolio Selection Using Multiplicative Updates. <em>Mathematical Finance</em>, <em>8</em>(4), 325-347. <a href="http://web.cs.iastate.edu/~honavar/portfolio-selection.pdf">pdf</a></p>
<p>[Z] Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 928-936). <a href="http://www.aaai.org/Papers/ICML/2003/ICML03-120.pdf">pdf</a></p>
<p>[NY] Nemirovsky, A. S., & Yudin, D. B. (1983). Problem complexity and method efficiency in optimization.</p>
<p>[AHK] Arora, S., Hazan, E., & Kale, S. (2012). The multiplicative weights update method: a meta-algorithm and applications. <em>Theory of Computing</em>, <em>8</em>(1), 121-164. <a href="http://www.theoryofcomputing.org/articles/v008a006/v008a006.pdf">pdf</a></p>
<p>[HAK] Hazan, E., Agarwal, A., & Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. <em>Machine Learning</em>, <em>69</em>(2-3), 169-192. <a href="https://link.springer.com/content/pdf/10.1007/s10994-007-5016-8.pdf">pdf</a></p>
<p>[DGU] DeMiguel, V., Garlappi, L., & Uppal, R. (2007). Optimal versus naive diversification: How inefficient is the 1/N portfolio strategy?. The review of Financial studies, 22(5), 1915-1953. <a href="https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1031.3574&rep=rep1&type=pdf">pdf</a></p>
<p><br /></p>
<h4 id="changelog">Changelog</h4>
<p>09/07/2019: Fixed several typos and added reference [DGU] as pointed out by Steve Wright.</p>Sebastian PokuttaTL;DR: How to (not) get rich? Running Universal Portfolios online with Online Convex Optimization techniques.Toolchain Tuesday No. 62019-08-19T02:00:00+02:002019-08-19T02:00:00+02:00http://www.pokutta.com/blog/random/2019/08/19/toolchain-6<p><em>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. This time around will be about privacy tools. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see <a href="/blog/pages/toolchain.html">here</a>.</em>
<!--more--></p>
<p><em>Disclaimer: I am not a security or privacy expert. Do your own due diligence and consider the following as pointers only.</em></p>
<p>With a lot of high profile data leaks (see e.g., <a href="https://www.nytimes.com/2019/07/30/business/capital-one-breach.html">here</a>, <a href="https://about.flipboard.com/support-information-incident-May-2019/">here</a>, <a href="https://www.upguard.com/breaches/rsync-oklahoma-securities-commission">here</a>, <a href="https://www.upguard.com/breaches/facebook-user-data-leak">here</a>, <a href="https://www.forbes.com/sites/zakdoffman/2019/07/16/whatsapptelegram-issue-has-put-a-billion-users-at-risk-check-your-settings-now/#5866d865ab88">here</a>, and <a href="https://research.checkpoint.com/hacking-fortnite/">here</a>), considerations to systematically weaken encryption via backdoors (see e.g., <a href="https://www.engadget.com/2019/07/31/how-ag-barr-is-going-to-get-encryption-backdoors/">here</a>, <a href="https://www.financemagnates.com/cryptocurrency/news/is-facebook-building-the-tools-to-end-encryption-forever/">here</a>, <a href="https://www.forbes.com/sites/kalevleetaru/2019/05/28/facebook-is-already-working-towards-germanys-end-to-end-encryption-backdoor-vision/#37d78a154e4a">here</a>, and <a href="https://www.forbes.com/sites/kalevleetaru/2019/07/26/the-encryption-debate-is-over-dead-at-the-hands-of-facebook/#b8654cd53626">here</a>) to e.g., spy on your WhatsApp messages while still giving the impression of strong end-to-end encryption, <a href="https://eugdpr.org/">GDPR</a> coming into (full) effect (see e.g., <a href="https://www.darkreading.com/endpoint/privacy/companies-anonymized-data-may-violate-gdpr-privacy-regs/d/d-id/1335361">here</a>, <a href="https://www.computerweekly.com/news/252467726/GDPR-taken-more-seriously-after-first-fines">here</a>), and in general all types of issues with user tracking, selling data, etc., I thought it might be a good time to talk about privacy tools. There are tons of great tools out there, however I will only be able to touch upon a few, notably those that I have first-hand experience with. If there is a tool that you think should be here, drop me a line.</p>
<p>I will not dive into the question <em>why privacy tools are useful/necessary</em>; this has been done in many places elsewhere. However, I believe it is fair to say that it is quite hard to get a true grasp of what data is really collected how, where, when, and for what purpose (even ignoring higher-order cross-referencing etc). GDPR is not going to change that, with or without you clicking hundreds of “I accept/consent” buttons a day. In fact GDPR will likely amplify the current imbalance in favor of large tech companies as the invidual or smaller companies might outsource their data storage etc to big tech, rather than trying to navigate the complexities of GDPR themselves. All the more a reason to think about your data. Just to get a glimpse, if you have not done so yet: download your facebook or google data (how to <a href="https://www.dataislife.net/download-entire-facebook-account-data-2019/">download your facebook data</a> or <a href="https://www.cnbc.com/2018/03/29/how-to-download-a-copy-of-everything-google-knows-about-you.html">download your google data</a>). Then take out half an afternoon browsing through the data trove; the experience might be quite sobering.</p>
<p>I would like to stress that the tools below <em>do not</em> guarantee privacy or safety etc. In particular, even with those tools in place your working assumption should be that <em>there is always a risk of exploits</em> and side channels etc. For example, suppose that through a backdoor, e.g., a keylogger is running on your phone or computer, then even the best tools cannot protect you; in the words of <a href="https://www.forbes.com/sites/kalevleetaru/2019/07/26/the-encryption-debate-is-over-dead-at-the-hands-of-facebook/#75e449ce5362">one of the articles</a> from above:</p>
<blockquote>
<p>The ability of encryption to shield a user’s communications rests upon the assumption that the sender and recipient’s devices are themselves secure, with the encrypted channel the only weak point.</p>
</blockquote>
<h2 id="messaging">Messaging:</h2>
<h3 id="signal">Signal</h3>
<p>The messaging app <code class="highlighter-rouge">Signal</code>, which is available for iOS and Android (as well as on Mac/PC but requiring a phone with <code class="highlighter-rouge">Signal</code>), is considered one of the most secure messaging apps (see, e.g., <a href="https://www.boxcryptor.com/en/blog/post/encryption-comparison-secure-messaging-apps/">here</a>). There are several other messaging apps out there that also use the <code class="highlighter-rouge">signal protocol</code>, however <code class="highlighter-rouge">Signal</code> has the important advantage that it is <a href="https://github.com/signalapp">open source</a>. Moreover it has been scrutinized and reviewed by security experts and while some messengers like <code class="highlighter-rouge">WhatsApp</code> basically use the same protocol, you have no idea whether the app is trustworthy or whether it contains backdoors.</p>
<p>In terms of learning curve, it basically works like your favorite messaging app however lacking some of the bells and whistles. <code class="highlighter-rouge">Signal</code> also supports voice and video calls.</p>
<p><em>Learning curve: ⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://signal.org/">https://signal.org/</a></em></p>
<h3 id="threema">Threema</h3>
<p>Another choice is <code class="highlighter-rouge">Threema</code>. It is also considered a very good messenger with strong encryption, however the code is not open source. This is precludes public code review by security experts, which traditionally has led to harder code with fewer exploits. Apparently there has been some closed-door code review though.</p>
<p><code class="highlighter-rouge">Threema</code> looks like your favorite messenger with a set of features comparable to <code class="highlighter-rouge">Signal</code>.</p>
<p><em>Learning curve: ⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://threema.ch/en">https://threema.ch/en</a></em></p>
<p><em>For more on secure messengers etc, the website <a href="https://www.securemessagingapps.com">securemessagingapps</a> provides a general overview of the different messengers and their security/privacy features.</em></p>
<h2 id="webbrowser">Webbrowser:</h2>
<h3 id="brave">Brave</h3>
<p>Based on the same <code class="highlighter-rouge">Chromium</code> backend as <code class="highlighter-rouge">Google Chrome</code>, <code class="highlighter-rouge">Brave</code> is a very fast webbrowser with extensive privacy tools, blocking various trackers, cookies, and finger printing. Moreover, it supports most of the <code class="highlighter-rouge">Chrome</code> extensions which is quite useful and basically you can change from <code class="highlighter-rouge">Chrome</code> to <code class="highlighter-rouge">Brave</code> with little to no work. <code class="highlighter-rouge">Brave</code> also supports an experimental model called <code class="highlighter-rouge">Brave Rewards</code> to support content creators not through ads but through a micro-payment like system:</p>
<blockquote>
<p>Activate Brave Rewards (available on desktop only) and give a little back to the sites you frequent most. Help fund the content you love – even when you block ads.</p>
<p>Browsing the web with Brave is free: with Brave Rewards activated, you can support the content creators you love at the amount that works for you.</p>
</blockquote>
<p>Finally <code class="highlighter-rouge">Brave</code> is very fast, in fact much faster than <code class="highlighter-rouge">Chrome</code>, probably due to blocking tons of scripts, trackers etc. <code class="highlighter-rouge">Brave</code> is also available on mobile (iOS and Android). Also, for the geeks, <code class="highlighter-rouge">Brave</code> supports <a href="[https://ipfs.io](https://ipfs.io/)">IPFS</a> and <a href="[https://www.torproject.org](https://www.torproject.org/)">Tor</a> directly out of the box.</p>
<p><em>Learning curve: ⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://www.brave.com/">https://www.brave.com/</a></em></p>
<h3 id="firefox">Firefox</h3>
<p>Another great browser is <code class="highlighter-rouge">Firefox</code>, which also comes with extensive privacy tools and arguably one of the first browsers taking privacy seriously. <code class="highlighter-rouge">Firefox</code> is a great browser, however I opted for using <code class="highlighter-rouge">Brave</code> for compatibility reasons (see, e.g., <a href="https://www.theverge.com/2019/3/4/18249623/brave-browser-choice-chrome-vivaldi-replacement-chromium">this article on Verge</a> for some discussion).</p>
<p><em>Learning curve: ⭐️⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://www.mozilla.org/en-US/firefox/">https://www.mozilla.org/en-US/firefox/</a></em></p>
<h2 id="search-engine">Search Engine:</h2>
<h3 id="duckduckgo">DuckDuckGo</h3>
<p>Complementing a privacy-enhanced webbrowser, it makes sense to consider a search engine that respects your privacy. A great choice is <code class="highlighter-rouge">DuckDuckGo</code>. While not perfect, for your normal day to day use it gets the job more than done and if you feel that you are missing out on something, you can always head over to <code class="highlighter-rouge">google</code>.</p>
<p><em>Learning curve: ⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://duckduckgo.com/">https://duckduckgo.com/</a></em></p>
<h2 id="encryption">Encryption:</h2>
<h3 id="gnupg">GnuPG</h3>
<p>Last, but not least: encryption. This does not directly relate to privacy in the sense from above but is of no lesser importance. If you need state-of-the-art encryption both for emails and also files, then <code class="highlighter-rouge">GnuPG</code> is the answer. It takes some time to getting used to but both the tools and underlying protocols are rock solid and it is open source.</p>
<p><em>Learning curve: ⭐️⭐️⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://gnupg.org/">https://gnupg.org/</a></em></p>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. This time around will be about privacy tools. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see here.Conditional Gradients and Acceleration2019-07-04T01:00:00+02:002019-07-04T01:00:00+02:00http://www.pokutta.com/blog/research/2019/07/04/LaCG-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/1906.07867">Locally Accelerated Conditional Gradients</a> [CDP] with Alejandro Carderera and Jelena Diakonikolas, showing that although optimal global convergence rates cannot be achieved for Conditional Gradients [CG], acceleration can be achieved after a burn-in phase independent of the accuracy $\epsilon$, giving rise to asymptotically optimal rates, accessing the feasible region only through a linear minimization oracle.</em></p>
<!--more-->
<p><em>Post written by Alejandro Carderera.</em></p>
<p>In this post we will consider optimization problems of the form</p>
<script type="math/tex; mode=display">\min_{x \in P} f(x),</script>
<p>where $f$ is an $L$-smooth convex or $\mu$-strongly convex function and $P$ is a polytope (in particular compact and convex). We will denote a solution to this problem as $x^\esx = \arg\min_{x \in P} f(x)$; note that the solution is unique in the $\mu$-strongly convex case.</p>
<p>In the previous blog post <a href="/blog/research/2019/06/10/cheatsheet-acceleration-first-principles.html">Cheat Sheet: Acceleration from First Principles</a> we saw that we can achieve acceleration (going from $\mathcal{O} ( \frac{L}{\mu} \log \frac{1}{\epsilon} )$ to $\mathcal{O} ( \sqrt{\frac{L}{\mu}} \log \frac{1}{\epsilon} )$ iterations for $L$-smooth $\mu$-strongly convex functions and from $\mathcal{O} (1/\epsilon )$ to $\mathcal{O} ( 1/\sqrt{\epsilon} )$ iterations for (general) $L$-smooth convex functions to achieve an accuracy of $\epsilon$) by simultaneously optimizing the primal and dual update, as opposed to <em>only</em> optimizing the primal or dual update, as vanilla gradient descent and mirror descent do, respectively. A natural question to ask is, can <em>Conditional Gradients</em> (CG) [CG, FW] be accelerated in that same spirit? This would effectively mean that we have a projection-free first-order constrained optimization algorithm with optimal convergence rates for smooth convex and strongly convex functions. Let’s explore the matter in more depth; we only consider the strongly convex case in the following.</p>
<h2 id="global-acceleration-a-fundamental-barrier">Global Acceleration: A fundamental barrier</h2>
<p>Can we achieve accelerated convergence rates globally for CG, i.e., can they hold for all problem instances, throughout the whole algorithm? Let us start with the example already considered in previous posts (<a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a> and <a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a>) which was studied in depth in terms of complexity in [L].</p>
<p class="mathcol"><strong>Example:</strong> Consider the function $f(x) \doteq \norm{x}^2$, which is strongly convex and the polytope $P = \mathrm{conv} \setb{ e_1,\dots, e_n} \subseteq \mathbb{R}^n$, the probability simplex in dimension $n$. We want to solve $\min_{x \in P} f(x)$. Clearly, the optimal solution is $x^\esx = (\frac{1}{n}, \dots, \frac{1}{n})$. Recall that at each iteration we call the <em>first-order oracle</em> to obtain a gradient, and the <em>linear optimization oracle</em> to solve a linear program. Whenever we call the linear programming oracle, we will obtain one of the $e_i$ vectors and in lieu of any other information but that the feasible region is convex, we can only form convex combinations of those. Thus after $t$ iterations, the best we can produce as a convex combination is a vector with support $t$, where the minimizer of such vectors for $f(x)$ is, e.g., $x_t = (\frac{1}{t}, \dots,\frac{1}{t},0,\dots,0)$ with $t$ times $1/t$ entries, so that we obtain a gap
<script type="math/tex">h(x_t) \doteq f(x_t) - f(x^\esx) \geq \frac{1}{t}-\frac{1}{n},</script>
which after requiring $\frac{1}{t}-\frac{1}{n} < \varepsilon$ implies $t > \frac{1}{\varepsilon - 1/n} \approx \frac{1}{\varepsilon}$ for $n$ large. In particular for $t = n/2$, assuming even $n$ it holds that:
\[h(x_{t}) \geq \frac{1}{t}-\frac{1}{n} = \frac{1}{2t}.\]</p>
<p>As is clear from this lower bound on the primal gap, a convergence rate of $O(1/\sqrt{t})$ cannot be attained in general. Moreover, note that the objective function is strongly convex. Recall from post <a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a> that an algorithm has a linear convergence rate if it contracts the primal gap as $f(x_t) - f^\esx \leq e^{-rt} (f(x_0) - f^\esx)$ with $r>0$. If we apply the previous lower bound to a convergence rate of this form for $t = n/2$, i.e., $1/2t \leq f(x_t) - f^\esx \leq e^{-rt} (f(x_0) - f^\esx)$, and we note that $f(x_0) - f^\esx \leq 1$, we arrive at $r \leq 2 \frac{\log 2t}{2t}$. Consider the linearly convergent primal gap evolution of the Fully-Corrective Frank-Wolfe variant [LJ], which has the form $f(x_t) - f^\esx \leq \left(1 - \frac{\mu}{4L} \left( \frac{w(P)}{D} \right)^2 \right)^t (f(x_0) - f^\esx)$, where $w(P)$ is the pyramidal width of $P$ and $D$ is the diameter of $P$. For the problem instance we are considering with $D = \sqrt{2}$, $\mu = L = 1$, and $w(P) = 2/\sqrt{n}$ we arrive at the fact that $r = \frac{1}{2n} = \frac{1}{4t}$. Therefore we see that at most the convergence rate of this Frank-Wolfe variant can only be improved up to a logarithmic factor, and a convergence rate of $\mathcal{O} ( \sqrt{\frac{L}{\mu}} \log \frac{1}{\epsilon} )$ is not globally attainable in terms of calls to the <em>linear optimization oracle</em>. Similar conclusions can be drawn for the Away-Step and Pairwise-Step Frank Wolfe algorithms.</p>
<p>Despite the previous global conclusion in terms of linear optimization oracle calls, the Conditional Gradient Sliding algorithm [LZ] achieves an optimal convergence rate in terms of global <em>first-order oracle calls</em>, requiring $\mathcal{O} ( \sqrt{\frac{L}{\mu}} \log \frac{1}{\epsilon} )$ first-order oracle calls, i.e., gradient evaluations, and $\mathcal{O} ( \frac{L}{\mu} \log \frac{1}{\epsilon} )$ linear optimization oracle calls to achieve an accuracy of $\epsilon$ for $L$-smooth and $\mu$-strongly convex functions. In light of the barrier to global acceleration, can we achieve acceleration in terms of calls to the linear optimization oracle locally with CG? Bear in mind that the lower bound holds only up to the dimension $n$. As such, this bound does not answer the question of whether true acceleration is possible beyond the dimension with respect to both first-order oracle calls <em>and</em> linear optimization oracle.</p>
<h2 id="local-acceleration-beyond-the-dimension">Local Acceleration beyond the Dimension</h2>
<p>As the title of this post already suggests, an optimal convergence rate can be achieved with CG, but only <em>locally</em>. We will call this algorithm the <em>Locally Accelerated Conditional Gradient</em> algorithm, obtained as the combination of a linearly convergent CG method and a novel accelerated algorithm. The key ingredient is a <em>Generalized Accelerated Method</em>, an extension of the $\mu AGD+$ algorithm from [CDO], which we call the <em>Modified $\mu AGD+$</em> algorithm, that can be coupled with an alternative algorithm (in this case CG, which is run independently), in such a way that at each iteration the point with lowest objective function value is chosen. The Modified $\mu AGD+$ is run on the convex hull of a set $\mathcal{C}$ (and therefore the point from the alternative algorithm also has to be in $\mathcal{C}$), and if $x^\esx \in \operatorname{conv} \mathcal{C}$ it will converge to $x^\esx$ with an optimal convergence rate in terms of first-order oracle and linear optimization oracle calls. However we cannot include all the vertices of $P$ in $\mathcal{C}$ as then each step of the Modified $\mu AGD+$ algorithm will be prohibitively expensive; note that a polytope can have a number of vertices exponential in the dimension. Ideally what we want is for $\mathcal{C}$ to be formed by the minimum number of vertices of $P$, such that $x^\esx \in \operatorname{conv} \mathcal{C}$ and we know the cardinality of this set should be at most $n+1$ by Carathéodory’s theorem. Note that the use of the Modified $\mu AGD+$ algorithm relaxes the projection-free requirement of the algorithm, as at each iteration it has to solve a convex subproblem over the convex hull of the elements in $\mathcal{C}$ (which will be equivalent to solving a projection problem over a simplex of low dimension).</p>
<p>Let us denote by $\mathcal{C}_t$ the set $\mathcal{C}$ at iteration $t$. The high-level description from above brings to mind the following question: How do we select $\mathcal{C}_t$ at each iteration $t$? This is where the CG algorithm comes in. Consider choosing a CG variant such as the <em>Away-Step Frank-Wolfe</em> (AFW) algorithm with the short-step rule (as detailed in <a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a>). The AFW algorithm maintains its iterates $x_t^{\text{AFW}}$ as a convex combination of the extreme points of $P$. Let us denote by $\mathcal{S}_t^{\text{AFW}}$ the set of these vertices,</p>
<script type="math/tex; mode=display">\mathcal{S}_t^{\text{AFW}} = \setb{ \setb{v_1, \cdots, v_m} \;\vert\; v_i \in P, \lambda_{i} > 0 \; \forall i \in [1,\cdots ,m]; x_t^{\text{AFW}} = \sum_{j = 1}^{m} \lambda_{j}v_j, \sum_{j = 1}^{m} \lambda_{j} =1}.</script>
<p>We also know that by the finiteness of the number of active sets and the closedness of P there exists $r>0$ and $T \in \mathbb{N}_+$ such that $\norm{x_t^{\text{AFW}} - x^\esx} < r$ and $x^\esx \in \mathcal{S}_t^{\text{AFW}}$ for all $t \geq T$. The values of $r$ and $T$ are independent of $\epsilon$ and depend only on $f$ and $P$. It therefore seems natural to consider the $\mathcal{S}_t^{\text{AFW}}$ as potential candidates for $\mathcal{C}_t$, i.e., the CG algorithm is used to <em>identify</em> a small active set such that $x^\esx \in \operatorname{conv} \mathcal{C}$, in what we will call a burn-in phase, after which the Modified $\mu AGD+$ algorithm will drive the iterates with optimal convergence rate.</p>
<p>However, we do not know a priori what the value of $T$ is, and we have no way of certifying that for a given iteration $t$ we have $x^\esx \in \operatorname{conv}\mathcal{S}_t^{\text{AFW}}$. Otherwise we could just simply run AFW for $T$ iterations, set <script type="math/tex">\mathcal{C}_t = \mathcal{S}_T^{\text{AFW}},</script> and then run the Modified $\mu AGD+$ algorithm in the convex hull of <script type="math/tex">\mathcal{C}_t</script> from then onwards. This is why LaCG runs the AFW algorithm and the Modified $\mu AGD+$ algorithm simultaneously, updating the set <script type="math/tex">\mathcal{C}_{t}</script> relative to <script type="math/tex">\mathcal{S}_t^{\text{AFW}}</script> or restarting the Modified $\mu AGD+$ algorithm when certain conditions are met. As long as $t \leq T$ the convergence is driven by AFW, and once $t >T$ the convergence is driven by the Modified $\mu AGD+$ algorithm. Note that we still restart the accelerated sequence even when $t > T$ (simply because we do not know when we have reached T), however the restarts are <em>delayed</em> to ensure sufficient progress is made (losing at most a factor of 2 in the convergence rate); in practice this regime shift might actually happen already before reaching $T$.</p>
<p>As the careful reader will have noticed, at each iteration we’ll be performing an AFW step, and a Modified $\mu AGD+$ step. And when $x^\esx \in \operatorname{conv} \mathcal{C}_{t}$, LaCG will have a convergence rate of $\mathcal{O} ( \sqrt{\frac{L}{\mu}} \log \frac{1}{\epsilon} )$ for L-smooth and $\mu$-strongly convex functions, as opposed to $\mathcal{O} ( \frac{L}{\mu} \log \frac{1}{\epsilon} )$ for AFW, so we’ll have greater progress-per-iteration compared to AFW. Note also that the accelerated convergence rate we obtain is independent of the dimension $n$. This is not the case for the Catalyst-augmented variant we benchmark against as we will see below. But how will we perform in terms of wall-clock time?</p>
<h2 id="computational-experiments">Computational Experiments</h2>
<p>We consider three different feasible regions $P$:</p>
<ol>
<li>Probability Simplex $\Delta$</li>
<li>$\ell_1$-unit ball.</li>
<li>Birkhoff polytope.</li>
</ol>
<p>Note that for all instances shown considered here the global minimum of the objective function will not be contained in the feasible region $P$, and therefore the constrained optimum will lie on the boundary of $P$.</p>
<h4 id="optimization-over-the-probability-simplex">Optimization over the probability simplex</h4>
<p>Let’s see how our algorigthm performs in practice. Consider an $L$-smooth and $\mu$-strongly quadratic of the form $f(x) = x^T Mx$, with $L / \mu = 1000$ and $x \in \mathbb{R}^n$ for $n=2000$. Our feasible region $P$ will be the probability simplex (note that this case could also be easily handled with projected AGD). The following image shows on top the normalized primal gap $h(x_t)/h(x_0)$ in terms of the number of iterations for three different methods: the LaCG algorithm, the AFW algorithm, and a Catalyst-augmented AFW algorithm [LMH], which has a <em>global</em> convergence rate in terms of linear oracle calls of $\mathcal{O} ( \sqrt{\frac{L}{\mu}} \frac{D^2}{\delta^2} \log \frac{1}{\epsilon} )$. The image on the bottom shows the percentage of optimal vertices (those that are <em>required</em> to write the optimal solution as a convex combination) that have been picked up by the AFW algorithm, when the black line reaches $100\%$ on the y-axis, it means that <script type="math/tex">x^\esx \in \operatorname{conv} S_{t}^{\text{AFW}},</script> and therefore if we would pick <script type="math/tex">\mathcal{C}_{t} = S_{t}^{\text{AFW}}</script> we’ll observe our sought-after acceleration, i.e., a sharp speed-up in convergence. Observe that the LaCG algorithm does not require knowledge of the optimal active set, and there could be multiple optimal and minimal active sets. The black dots indicate the iterations at which the LaCG restarts the Modified $\mu AGD+$ algorithm, and we set <script type="math/tex">\mathcal{C}_{t} = S_{t}^{\text{AFW}}</script>.</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/lacg/Simplex_It_2019-06-27-16-51-15_Size2000.png" alt="Convergence in terms of iterations" /></p>
<p>As mentioned previously, the LaCG algorithm makes at least as much progress per iteration as AFW. However, we can see in the graph that the algorithm makes substantially more progress per iteration than AFW, even before it has had time to pick up all the optimal vertices; we have partial acceleration already on the sub-optimal active sets. We can also observe that the Catalyst-augmented AFW has a pretty slow convergence rate, looking carefully at the fact that $D^2 = 2$ and $\delta^2 = 4/n$ for the simplex we arrive at $\frac{D^2}{\delta^2} = 1000$, leading to the observation that $\sqrt{\frac{L}{\mu}} \frac{D^2}{\delta^2} \geq \frac{L}{\mu}$ in the convergence rate for this problem instance. This inevitable dimension dependence is one of the shortcomings of the Catalyst method, and it is also the reason why this method does not violate the lower bound we mentioned previously and can obtain global convergence rates that depend on $\sqrt{\frac{L}{\mu}}$. The following image shows the behavior of the algorithms in terms of wall-clock time.</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/lacg/Simplex_t_2019-06-27-16-51-15_Size2000.png" alt="Convergence in terms of time" /></p>
<p>The beauty of LaCG is that due to the small dimensionality of the sets <script type="math/tex">\mathcal{C}_{t}</script>, we can perform the Modified $\mu AGD+$ steps efficiently, and even when we have not picked up all the optimal vertices, we are still competitive with AFW in terms of wall-clock time. Note the speed-up seen in the previous plot once <script type="math/tex">x^\esx \in \mathcal{C}_{t}</script>.</p>
<h4 id="optimization-over-the-ell_1-unit-ball">Optimization over the $\ell_1$-unit ball</h4>
<p>The $\ell_1$ ball is often encountered in Lasso regression problems, where the aim is to find an accurate yet interpretable solution to a least-squares problem. In this example the feasible region will be the $\ell_1$-unit ball, and the objective function will again be a quadratic of the same form with a global minimum outside of the $\ell_1$-unit ball, with $L / \mu = 100$ and $x \in \mathbb{R}^n$ for $n=2000$. The following experiments were run for $240$ seconds of wall-clock time.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/lacg/L1_2019-06-28-16-37-21_Size2000.png" alt="Convergence (l1 ball)" /></p>
<p>As was the case for the probability simplex, LaCG outperforms both AFW and the Catalyst-augmented AFW algorithm in terms of progress per iteration and wall-clock time, outputting a solution that is more than an order of magnitude higher in accuracy in the same wall-clock time interval.</p>
<h4 id="optimization-over-the-birkhoff-polytope">Optimization over the Birkhoff polytope</h4>
<p>The <em>Birkhoff polytope</em>, also called the <em>polytope of doubly stochastic matrices</em>, or <em>assignment polytope</em>, is formed by the square matrices $Q \in \mathbb{R}^{n \times n}$ such that each row/column in the matrix sums up to $1$. Its vertices are the $n!$ permutation matrices, which are matrices in $\mathbb{R}^{n \times n}$ that have exactly one entry equal to $1$ in each row and in each column, being the entries elsewhere equal to $0$. Projection in the Euclidean sense onto this polytope is not as easy as projecting onto the probability simplex, which makes projection-free methods like CG and LaCG of considerable interest. Consider again an $L$-smooth and $\mu$-strongly quadratic of the form $f(x) = x^T Mx$, with $L / \mu = 100$ and $x \in \mathbb{R}^{n \times n}$ for $n \times n = 1600$. Now our feasible region $P$ will be the Bikhoff polytope. The following images show the evolution of the normalized primal gap $h(x_t)/h(x_0)$ in terms of iterations and in terms of wall-clock time.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/lacg/Birkhoff_2019-05-30-14-47-33_Size40.png" alt="Convergence (Birkhoff)" /></p>
<p>As we saw in the example with the simplex, LaCG outperforms both AFW and the Catalyst-augmented AFW algorithm in terms of progress per iteration, and in terms of wall-clock time, even before the optimal face has been identified (which happens at approximately $t = 1000$). After this has happened, we see a phenomenal performance boost in both metrics. As before, note the effect of the dimension dependence of the convergence rate of the Catalyst-augmented AFW algorithm. This can be further seen in the following image, which shows a quadratic with the same $L/\mu$ over the Birkhoff polytope in $\mathbb{R}^{n \times n}$ for a smaller dimension $n \times n = 400$.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/lacg/Birkhoff_2019-05-23-04-28-01_Size20.png" alt="Convergence (Birkhoff small)" /></p>
<p>For lower dimensions, the Catalyst-augmented AFW becomes competitive with LaCG in terms of progress per iteration, however LaCG performs much better in terms of wall-clock time. Another thing to note is the stair-like behaviour of the LaCG algorithm. This is due to the fact that between the restarts (the black circles in the image), we do not change the elements in <script type="math/tex">\mathcal{C}_{t}</script>, and LaCG converges to a solution of <script type="math/tex">\min_{x \in \mathcal{C}_{t}} f(x)</script>, and we cannot guarantee that <script type="math/tex">x^\esx \in \operatorname{conv} \mathcal{C}_{t}</script>, however due to the delayed restarts LaCG resets $\mathcal{C}_{t}$ often enough and converges to $x^\esx$ faster than AFW and the Catalyst-augmented AFW algorithm.</p>
<p>Some other important features of LaCG are:</p>
<ol>
<li>The Modified $\mu AGD+$ subproblems at each time step do not have to be solved to perfect optimality, instead they can be solved approximately to a tolerance $\epsilon_t$ (more details can be found in [CDP]). <br /></li>
<li>If we use a projection-free algorithm to solve the subproblems in the Modified $\mu AGD+$ step, we arrive at a fully projection-free algorithm. We recover a variant of the Conditional Gradient Sliding [LZ] algorithm: ignore the AFW steps in the LaCG algorithm, only perform Modified $\mu AGD+$ steps, and solve the subproblems with a CG algorithm.</li>
</ol>
<h3 id="references">References</h3>
<p>[CDP] Carderera, A., Diakonikolas, J., Pokutta, S., (2019). Locally Accelerated Conditional Gradients. arXiv preprint arXiv:1906.07867. <a href="https://arxiv.org/pdf/1906.07867.pdf">pdf</a></p>
<p>[CG] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=zvmmf&paperid=7415&option_lang=eng">pdf</a></p>
<p>[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p>
<p>[L] Lan, G., (2013). The complexity of large-scale convex programming under a linear optimization oracle. arXiv preprint arXiv:1309.5550. <a href="https://arxiv.org/pdf/1309.5550.pdf">pdf</a></p>
<p>[LJ] Lacoste-Julien, S., & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems (pp. 496-504). <a href="http://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>
<p>[LZ] Lan, G. and Zhou, Y., (2016). Conditional gradient sliding for convex optimization. SIAM Journal on Optimization, 26(2), pp.1379-1409. <a href="http://www.optimization-online.org/DB_FILE/2014/10/4605.pdf">pdf</a></p>
<p>[CDO] M. B. Cohen, J. Diakonikolas, and L. Orecchia, (2018). On acceleration with noise-corrupted gradients. In Proc. ICML’18. <a href="https://arxiv.org/pdf/1805.12591.pdf">pdf</a></p>
<p>[LMH] Lin, H., Mairal, J. and Harchaoui, Z., (2015). A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems (pp. 3384-3392). <a href="http://papers.nips.cc/paper/5928-a-universal-catalyst-for-first-order-optimization.pdf">pdf</a></p>Alejandro CardereraTL;DR: This is an informal summary of our recent paper Locally Accelerated Conditional Gradients [CDP] with Alejandro Carderera and Jelena Diakonikolas, showing that although optimal global convergence rates cannot be achieved for Conditional Gradients [CG], acceleration can be achieved after a burn-in phase independent of the accuracy $\epsilon$, giving rise to asymptotically optimal rates, accessing the feasible region only through a linear minimization oracle.Cheat Sheet: Acceleration from First Principles2019-06-10T01:00:00+02:002019-06-10T01:00:00+02:00http://www.pokutta.com/blog/research/2019/06/10/cheatsheet-acceleration-first-principles<p><em>TL;DR: Cheat Sheet for a derivation of acceleration from optimization first principles.</em>
<!--more--></p>
<p><em>Posts in this series (so far).</em></p>
<ol>
<li><a href="/blog/research/2018/12/07/cheatsheet-smooth-idealized.html">Cheat Sheet: Smooth Convex Optimization</a></li>
<li><a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a></li>
<li><a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a></li>
<li><a href="/blog/research/2018/11/12/heb-conv.html">Cheat Sheet: Hölder Error Bounds (HEB) for Conditional Gradients</a></li>
<li><a href="/blog/research/2019/02/27/cheatsheet-nonsmooth.html">Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning</a></li>
<li><a href="/blog/research/2019/06/10/cheatsheet-acceleration-first-principles.html">Cheat Sheet: Acceleration from First Principles</a></li>
</ol>
<p><em>My apologies for incomplete references—this should merely serve as an overview.</em></p>
<p>Acceleration in smooth convex optimization has been met with awe and has been subject to extensive research over the last years. In a nutshell, what acceleration does is that it provides an “unexpected” speedup in smooth convex optimization; we will be concerned with acceleration in the Nesterov sense [N1], [N2]. We consider the problem</p>
<script type="math/tex; mode=display">\tag{prob}
\min_{x \in \RR^n} f(x),</script>
<p>where $f$ is an $L$-smooth and $\mu$-strongly convex function. Then with standard arguments that we review below we can show that we need roughly $t(\varepsilon) = \Theta(\frac{\mu}{L} \log \frac{1}{
\varepsilon})$ iterations (of e.g., gradient descent) to achieve a primal gap</p>
<script type="math/tex; mode=display">f(x_{t(\varepsilon)}) - f(x^\esx) \leq \varepsilon,</script>
<p>where $x^\esx$ is the (unique) optimal solution to (prob). Accelerated methods achieve the same accuracy in $\Theta(\sqrt{\frac{\mu}{L}} \log \frac{1}{
\varepsilon})$ iterations, which can be a huge improvement in running time.</p>
<p>By now we have various proofs, explanations, and analyses for the phenomenon. Just to name a few, for example, if you look for a very concise analysis of acceleration for the smooth and (non-strongly) convex then there is a very nice proof on <a href="https://blogs.princeton.edu/imabandit/2018/11/21/a-short-proof-for-nesterovs-momentum/">Sébastien Bubeck’s blog</a>, which also provides a nice overview and link to other methods such as Polyak’s method [P] (Sébastien also has a very nice post about the <a href="https://blogs.princeton.edu/imabandit/2019/01/09/nemirovskis-acceleration/">Nemirovski’s acceleration with line search</a>) and on <a href="https://distill.pub/2017/momentum/">distill there is a very nice post</a> that explains momentum and acceleration for quadratics. There has been also recent work that understands acceleration as a linear coupling of mirror descent and gradient descent [AO] and other work explains acceleration as arising from a “better” discretization of the continuous time dynamics (see [SBC] and follow-up work). An ellipsoid method-like accelerated algorithm was derived in [BLS] providing some nice geometric intuition. Another very interesting perspective on acceleration by means of polynomial approximation and Chebyshev polynomials is given on <a href="http://blog.mrtz.org/2013/09/07/the-zen-of-gradient-descent.html">Moritz Hardt’s blog</a>; in fact I like this quite a bit as a possible explanation of the origin of acceleration. Very recently, in [DO] a unifying framework for the analysis of first-order methods has been presented that significantly streamlines the analysis of more complex first-order methods; we will present the later derivation in that framework and will provide a brief introduction further below.</p>
<p>What this post is about is not providing yet another <em>analysis of acceleration</em> and proving that a <em>given algorithm</em> indeed achieves an improved rate—there are already many excellent resources out there. Rather, what I will try to do is to provide a (relatively) natural <em>derivation of acceleration</em> (and associated algorithm) from optimization first principles, such as smoothness, (strong) convexity, first-order optimality, and Taylor expansions. In particular: no estimated point sequences, no lookahead or extrapolation, no momentum, no quadratic equation magic, no Chebyshev polynomials (although they are awesome!), and no guessing of secret constants: everything will follow (arguably) naturally although I was told that all of the aforementioned can be easily recovered.</p>
<p><em>Disclaimer: Just to be clear, fundamentally nothing new is going to happen here but rather I will provide a somewhat natural derivation of acceleration, which is sliced and diced together from [DO] and some recent work with Alejandro Carderera and Jelena Diakonikolas. Also, note that the derivation below can be significantly compressed but I opted for a more verbose exposition to emphasize that there is no hidden magic.</em></p>
<h2 id="how-we-got-here-the-basic-argument">How we got here: the basic argument</h2>
<p>We will first recall the standard proof of linear convergence of (vanilla) gradient descent for problems of the form (prob) and use this as an opportunity to introduce and recall definitions; see warmup section in <a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a> for an in-depth discussion of these concepts. In the following, let $x^\esx$ denote the (unique) optimal solution to (prob).</p>
<p>We will use the following (standard) definitions:</p>
<p class="mathcol"><strong>Definition (convexity).</strong> A differentiable function $f$ is said to be <em>convex</em> if for all $x,y \in \mathbb R^n$ it holds: <script type="math/tex">f(y) - f(x) \geq \langle \nabla f(x), y-x\rangle</script>.</p>
<p class="mathcol"><strong>Definition (smoothness).</strong> A convex function $f$ is said to be <em>$L$-smooth</em> if for all $x,y \in \mathbb R^n$ it holds: <script type="math/tex">f(y) - f(x) \leq \langle \nabla f(x), y-x\rangle + \frac{L}{2} \norm{x-y}^2</script>.</p>
<p class="mathcol"><strong>Definition (strong convexity).</strong> A convex function $f$ is said to be <em>$\mu$-strongly convex</em> if for all $x,y \in \mathbb R^n$ it holds: <script type="math/tex">f(y) - f(x) \geq \langle \nabla f(x), y-x\rangle + \frac{\mu}{2} \norm{x-y}^2</script>.</p>
<p>(Strong) convexity provides an underestimator of the function whereas smoothness provides an overestimator:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/convexity.png" alt="Convexity and smoothness" /></p>
<p>Now suppose that we consider (vanilla) gradient descent with updates of the form</p>
<script type="math/tex; mode=display">\tag{GD}
x_{t+1} \leftarrow x_t - \frac{1}{L} \nabla f(x_t).</script>
<p>Plugging-in this update into the definition of smoothness, we immediately obtain:</p>
<script type="math/tex; mode=display">\tag{progress}
\underbrace{f(x_{t}) - f(x_{t+1})}_{\text{primal progress}} \geq \frac{\norm{\nabla f(x_t)}^2}{2L}.</script>
<p>Similarly, with standard arguments we obtain from strong convexity the upper bound on the primal gap:</p>
<script type="math/tex; mode=display">f(x_t) - f(x^\esx) \leq \frac{\norm{\nabla f(x_t)}^2}{2 \mu}.</script>
<p>We can now simply plug-in the upper bound into the progress inequality (progress) to obtain:</p>
<script type="math/tex; mode=display">f(x_{t}) - f(x_{t+1}) \geq \frac{\norm{\nabla f(x_t)}^2}{2L} \geq \frac{\mu}{L} (f(x_t) - f(x^\esx)),</script>
<p>or, via rewriting,</p>
<script type="math/tex; mode=display">f(x_{t+1}) - f(x^\esx) \leq \left(1- \frac{\mu}{L}\right) (f(x_t) - f(x^\esx)),</script>
<p>so that we obtain the coveted linear rate:</p>
<script type="math/tex; mode=display">f(x_t) - f(x^\esx) \leq \left(1- \frac{\mu}{L}\right)^t (f(x_0) - f(x^\esx)).</script>
<p>While this is great, the best-known lower bound only rules out rates faster than $\Theta(\sqrt{\frac{\mu}{L}} \log \frac{1}{\varepsilon})$, so that we are potentially quadratically slower than the best possible. Acceleration closes this gap, achieving a convergence rate of $\Theta(\sqrt{\frac{\mu}{L}} \log \frac{1}{\varepsilon})$, which is optimal.</p>
<h3 id="information-left-on-the-table">Information left on the table</h3>
<p>A natural question to ask now is of course why gradient descent cannot achieve the optimal rate and whether it is a problem with the algorithm or the analysis (which is an important question one should ask routinely). For example, going from sublinear convergence in the case smooth and (non-strongly) convex function to linear convergence in the case of smooth and strongly convex function <em>does not require any change in the algorithm</em>, e.g., in (vanilla) gradient descent but rather it is a <em>better analysis</em> that establishes the better rate. A close examination of the argument from above shows that in each iteration $t+1$ we basically rely on two inequalities:</p>
<p>Smoothness at $x_t$:</p>
<script type="math/tex; mode=display">f(y) - f(x_t) \leq \langle \nabla f(x_t), y-x_t\rangle + \frac{L}{2} \norm{x_t-y}^2.</script>
<p>Strong convexity at $x_t$:</p>
<script type="math/tex; mode=display">f(y) - f(x_t) \geq \langle \nabla f(x_t), y-x_t\rangle + \frac{\mu}{2} \norm{x_t-y}^2.</script>
<p>The key point is that we use <em>significantly less</em> information than we actually have available: In fact, in iteration $t+1$, we have iterates $x_0, \dots, x_t$ and for <em>each</em> of them we have these two inequalities. In particular, we have the strong convexity lower bound for each $x_0, \dots, x_t$ potentially providing a much better lower approximation of $f$ than just the bound from the last iterate $x_t$. Roughly in picture world it looks like this, where the left is only using last-iterate information for the lower bound and the right is using all previous iterates:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/acc/MSCSM-comb.png" alt="MSC and SM inequalities" /></p>
<p>Can this additional information be used to improve convergence?</p>
<h2 id="acceleration">Acceleration</h2>
<p>We will now try to derive our accelerated method. To this end assume that we already have a <em>hypothetical</em> algorithm that has generated iterates $y_0, \dots, y_t$ by some (as of now unknown) rule.</p>
<h3 id="a-better-lower-bound">A better lower bound</h3>
<p>Let us see whether we can use the additional information from previous iterates to obtain a better lower bound or approximation for our function. Given the sequence of iterates $y_0, \dots, y_t$ we have the following family of inequalities from strong convexity:</p>
<script type="math/tex; mode=display">f(y_i) + \langle \nabla f(y_i), z - y_i \rangle + \frac{\mu}{2} \norm{y_i-z}^2 \leq f(z),</script>
<p>for $i \in \setb{0, \dots, t}$. Moreover we can take any positive combination of these inequalities with weights $a_0, \dots, a_t \geq 0$ and obtain:</p>
<script type="math/tex; mode=display">\sum_{i = 0}^t a_i [f(y_i) + \langle \nabla f(y_i), z - y_i \rangle + \frac{\mu}{2} \norm{y_i-z}^2] \leq \underbrace{\left(\sum_{i = 0}^t a_i\right)}_{\doteq A_t} f(z),</script>
<p>or equivalently:</p>
<script type="math/tex; mode=display">\frac{1}{A_t}\sum_{i = 0}^t a_i [f(y_i) + \langle \nabla f(y_i), z - y_i \rangle + \frac{\mu}{2} \norm{y_i-z}^2] \leq f(z).</script>
<p>This inequality holds for all $z \in \RR^n$ and in particular the optimal solution $x^\esx \in \RR^n$. We do not know $x^\esx$ however, but what we do know, taking the minimum over $z \in \RR^n$ on both sides, is the bound:</p>
<script type="math/tex; mode=display">L_t \doteq \min_{z \in \RR^n} \frac{1}{A_t}\sum_{i = 0}^t a_i [f(y_i) + \langle \nabla f(y_i), z - y_i \rangle + \frac{\mu}{2} \norm{y_i-z}^2] \leq \min_{z \in \RR^n} f(z) = f(x^\esx).</script>
<p>Note that $L_t$ is a function of $a_0, \dots, a_t$ and $y_0, \dots, y_t$ and clearly the strength of the bound depends heavily on both; we will get back to both later. Next let us compute the minimizer of $L_k$. Clearly, $L_k$ is smooth and (strongly) convex in $z$ and the first-order optimality condition leads to the equation:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
0 & = \frac{1}{A_t}\sum_{i = 0}^t a_i [\nabla f(y_i) - \mu (y_i - z) ] \\
& = \mu z + \frac{1}{A_t}\sum_{i = 0}^t a_i [\nabla f(y_i) - \mu y_i ] \\
\Leftrightarrow z &= \frac{1}{A_t}\sum_{i = 0}^t a_i [y_i - \frac{1}{\mu}\nabla f(y_i)],
\end{align*} %]]></script>
<p>so that the (dual) lower bound is optimized by</p>
<script type="math/tex; mode=display">\tag{dualSeq}
w_t \leftarrow \frac{1}{A_t}\sum_{i = 0}^t a_i [y_i - \frac{1}{\mu}\nabla f(y_i)],</script>
<p>which can also be written recursively using $A_t \doteq \sum_{i = 0}^t a_i$ as:</p>
<script type="math/tex; mode=display">\tag{dualSeqRec}
w_t \leftarrow \frac{A_{t-1}}{A_t} w_{t-1} + \frac{a_t}{A_t} [y_t - \frac{1}{\mu}\nabla f(y_t)].</script>
<h3 id="primal-steps">Primal steps</h3>
<p>For the primal steps we do exactly the same thing as before in gradient descent. After all, let us first see how far we can get with only adjusting the lower bound. Therefore, given the sequence $y_0, \dots, y_t$ in iteration $t$, we define the primal steps as</p>
<script type="math/tex; mode=display">\tag{primalSeq}
x_t \leftarrow y_t - \frac{1}{L} \nabla f(y_t),</script>
<p>which is the same update as in (GD) but we do not use the update to (directly) define the next iterates $y_{t+1}$ but rather let us give them a different name until we decide what to do with them.</p>
<h3 id="interlude-approximate-dual-gap-technique-adgt">Interlude: Approximate Dual Gap Technique (ADGT)</h3>
<p>Having now a (hopefully) better lower bound we need to see what we can do with it. For this we will use the <em>Approximate Dual Gap Technique (ADGT)</em> of [DO], a conceptually simple yet powerful technique to analyze first-order methods. In a nutshell, ADGT works as follows: Our ultimate aim is to prove that for some first-order algorithm that generates iterates $x_0, \dots, x_t, \dots$ the <em>optimality gap</em> $f(x_t) - f(x^\esx) \rightarrow 0$ with a certain convergence rate. Usually, it is very hard to say something about $f(x_t) - f(x^\esx)$ directly and so typically analyses use bounds on the optimality gap.</p>
<p>ADGT makes this explicit in a first step by working with a lower bound $L_t \leq f(x^\esx)$ in iteration $t$ and upper bound $f(x_t) \leq U_t$ and then defining a <em>gap estimate</em> in iteration $t$ as $G_t \doteq U_t - L_t$, so that $f(x_t) - f(x^\esx) \leq G_t$ in each iteration $t$. Then further, if there exists a sequence of suitably chosen, fast growing numbers $0 \leq A_0, \dots, A_t, \dots$, so that</p>
<script type="math/tex; mode=display">A_t G_t \leq A_{t-1} G_{t-1}.</script>
<p>Then in particular the gap estimate in iteration $t$ drops as $G_t \leq \frac{A_{t-1}}{A_t} G_{t-1}$ and after chaining these bounds together we obtain $f(x_t) - f(x^*) \leq \frac{A_0}{A_t} G_0$, i.e., the convergence rate is given basically by $\frac{1}{A_t}$.</p>
<p>Before going back to our attempt at acceleration, let us familiarize ourselves with ADGT by analyzing vanilla gradient descent with update (GD). To this end let us consider a <em>simplified and stronger</em> lower bound, given by</p>
<script type="math/tex; mode=display">\hat L_t \doteq \frac{1}{A_t} \sum_{i = 0}^t a_i [f(x_i) + \langle \nabla f(x_i), x^\esx - x_i \rangle + \frac{\mu}{2} \norm{x_i-x^\esx}^2] \leq f(x^\esx),</script>
<p>where we chose $z = x^\esx$ with $A_t \doteq \sum_{i = 0}^t a_i$, so that we only have to pick the $a_i$ at some point. For the upper bound we simple choose $U_t \doteq f(x_{t+1})$; mind the index shift as it will be important.</p>
<p>In order to show that $A_t G_t \leq A_{t-1} G_{t-1}$, we will analyze the upper bound change and the lower bound change separately as our goal is to show:</p>
<script type="math/tex; mode=display">0 \geq A_t G_t - A_{t-1} G_{t-1} = A_t U_t - A_{t-1} U_{t-1} - (A_t L_t - A_{t-1} L_{t-1}).</script>
<p><em>Change in upper bound.</em> <br />
The change in the upper bound can be bounded using smoothness with basically the same argument as for the vanilla GD warmup from above:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
A_t U_t - A_{t-1} U_{t-1} & = A_t f(x_{t+1}) - A_{t-1} f(x_{t}) = A_t (f(x_{t+1}) - f(x_{t})) + a_t f(x_{t}) \\
& \leq - A_t \left(\frac{\norm{\nabla f(x_{t})}^2}{2L}\right) + a_t f(x_{t}).
\end{align*} %]]></script>
<p><em>Change in lower bound.</em> <br />
The change in the lower bound follows from evaluating $\hat L_t$, which used the strong convexity of $f$ in its definition:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
A_t \hat L_t - A_{t-1} \hat L_{t-1} & = a_t [f(x_t) + \langle \nabla f(x_t), x^\esx - x_t \rangle + \frac{\mu}{2} \norm{x_t-x^\esx}^2]
\end{align*} %]]></script>
<p><em>Change in the gap estimate.</em> <br />
With this we immediately obtain that the change in the gap estimate is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
A_t G_t - A_{t-1} G_{t-1} & \leq a_t (\underbrace{f(x_{t}) - f(x_t)}_{= 0}) - A_t \left(\frac{\norm{\nabla f(x_{t})}^2}{2L}\right) - a_t [\langle \nabla f(x_t), x^\esx - x_t \rangle + \frac{\mu}{2} \norm{x_t-x^\esx}^2] \\
& \leq - A_t \left(\frac{\norm{\nabla f(x_{t})}^2}{2L}\right) - \frac{a_t}{2\mu} [\underbrace{2\mu \langle \nabla f(x_t), x^\esx - x_t \rangle + \mu^2 \norm{x_t-x^\esx}^2}_{\geq - \norm{ \nabla f(x_t)}^2 \text{ via } a^2 + 2ab \geq - b^2}] \\
& \leq \norm{\nabla f(x_{t})}^2 \left( - \frac{A_t}{2L} + \frac{a_t}{2\mu} \right),
\end{align*} %]]></script>
<p>using the standard trick $a^2 + 2ab \geq - b^2$ that virtually every proof utilizing strong convexity uses (see <a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a> for a derivation of that estimation). Thus for</p>
<script type="math/tex; mode=display">A_t G_t - A_{t-1} G_{t-1} \leq 0,</script>
<p>it suffices to choose $a_t$, so that $- \frac{A_t}{2L} + \frac{a_t}{2\mu} \leq 0$ and the choice $\frac{a_t}{A_t} \doteq \frac{\mu}{L}$ suffices, leading to a contraction with rate:</p>
<script type="math/tex; mode=display">\frac{A_{t-1}}{A_t} = 1 - \frac{a_t}{A_t} = 1 - \frac{\mu}{L},</script>
<p>which is the standard rate and matches what we have derived above in the warmup. In a last step one would now relate $G_0$ to the initial gap $f(x_0) - f(x^\esx)$ to obtain a bound on the constant $A_0$. We skip this step to keep the exposition clean; it is immediate here and as it is not that crucial for our discussion.</p>
<p>Now is a good time to pause for a second. Initially we speculated that maybe not using all available information might be the reason for not obtaining a better rate. Yet, in this argument now, we <em>have used</em> more information, in fact in iteration $t$ we have used all iterates $x_0, \dots, x_{t-1}$; see definition of $\hat L_t$. Maybe this is because the iterates $x_t$ are obtained without <em>any regard</em> for the lower bound and while we use all iterates now, maybe the bound is not much stronger than the bound arising from the last iterate and maybe we could strengthen it by a better choice of the $a_i$ and $y_i$ in the general definition of $L_t$?</p>
<h3 id="adgt-on-the-hypothetical-sequence">ADGT on the hypothetical sequence</h3>
<p>While ADGT seems to be an overkill for the standard linear convergence proof compared to the argument from the warmup, we will see soon that ADGT buys us considerable extra freedom. We will now try to apply the same analysis as above once more however, we start out with our hypothetical sequence $y_0, \dots, y_t, \dots$ and see, whether maybe naturally a way arises to choose the $y_t$ not only to produce primal progress as the rule (GD) does but also dual progress, <em>improving</em> our lower bound estimate by providing better “attachment points” $y_0, \dots, y_t$ from which we obtain a stronger lower bound $L_t$ via strong convexity.</p>
<p>Observe that, given the sequence of iterates $y_0, \dots, y_t$, our primal iterates $x_t$ and dual iterates $w_t$ are the <em>optimal updates</em> in iteration $t$ for primal and dual progress respectively. Note, that in general we do not know whether $w_t = x_t$ and usually they are <em>not</em> equal.</p>
<p>We follow the same strategy as above with the aim of analyzing the change in the gap estimate for our partially specified algorithm:</p>
<p><em>Change in upper bound.</em> <br />
The change in the upper bound can be bounded using smoothness but applied to the update $x_t \leftarrow y_t - \frac{1}{L} \nabla f(y_t)$ that we used to define our primal sequence $x_t$, i.e., $f(x_t) - f(y_t) \leq - \frac{\norm{\nabla f(y_t)}^2}{2L}$. Otherwise, except for rearranging, using the upper bound $U_t \doteq f(x_{t})$ and adding zero it is the same; note the index shift from $t+1$ to $t$ in the definition of $U_t$ as we have the intermediate point $y_t$ now as will become clear:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
A_t U_t - A_{t-1} U_{t-1} & = A_t f(x_t) - A_{t-1} f(x_{t-1}) \\
& = A_t f(y_t) - A_t f(y_t) + A_t f(x_t) - A_{t-1} f(x_{t-1}) \\
& = a_t f(y_t) + A_t (f(x_t) - f(y_t)) + A_{t-1} (f(y_{t}) - f(x_{t-1})) \\
& \leq a_t f(y_t) - A_t \left(\frac{\norm{\nabla f(y_t)}^2}{2L}\right) + A_{t-1} (f(y_{t}) - f(x_{t-1})).
\end{align*} %]]></script>
<p><em>Change in lower bound.</em> <br />
The change in the lower bound this time around is more intricate however as we now have an optimization problem in the definition of $L_k$ that defines the dual iterates $w_t$. Recall that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
L_t & = \min_{z \in \RR^n} \frac{1}{A_t}\sum_{i = 0}^t a_i [f(y_i) + \langle \nabla f(y_i), z - y_i \rangle + \frac{\mu}{2} \norm{y_i-z}^2] \\
& = \frac{1}{A_t}\sum_{i = 0}^t a_i f(y_i) + \frac{1}{A_t}\min_{z \in \RR^n} \underbrace{\sum_{i = 0}^t a_i [\langle \nabla f(y_i), z - y_i \rangle + \frac{\mu}{2} \norm{y_i-z}^2]}_{\doteq \gamma_t(z)}.
\end{align*} %]]></script>
<p>With this we can conveniently express the change in the lower bound as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
A_t L_t - A_{t-1} L_{t-1} & = a_t f(y_t) + \gamma_t(w_t) - \gamma_{t-1}(w_{t-1}),
\end{align*} %]]></script>
<p>so that it suffices to bound the change $\gamma_t(w_t) - \gamma_{t-1}(w_{t-1})$. We have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\gamma_t(w_t) - \gamma_{t-1}(w_{t-1}) & = \gamma_{t-1}(w_{t}) + a_t \langle \nabla f(y_t), w_{t} - y_t \rangle + a_t\frac{\mu}{2} \norm{y_t-w_{t}}^2 - \gamma_{t-1}(w_{t-1}) \\
& = \gamma_{t-1}(w_{t-1}) + \underbrace{\langle \nabla \gamma_{t-1}(w_{t-1}), w_{t} - w_{t-1}\rangle}_{= 0\text{, as $w_{t-1}$ is a minimizer of $\gamma_{t-1}$}} + \frac{\mu A_{t-1}}{2} \norm{w_t - w_{t-1}}^2 \\ & \qquad \qquad + a_t \langle \nabla f(y_t), w_{t} - y_t \rangle + a_t \frac{\mu}{2} \norm{y_t-w_{t}}^2 - \gamma_{t-1}(w_{t-1}) \\
& = a_t \langle \nabla f(y_t), w_{t} - y_t \rangle + \frac{A_t\mu}{2}\left(\frac{A_{t-1}}{A_t} \norm{w_t - w_{t-1}}^2 + \frac{a_t}{A_t} \norm{y_t-w_{t}}^2 \right) \\
& \geq a_t \langle \nabla f(y_t), w_{t} - y_t \rangle + \frac{A_t \mu}{2} \norm{w_t - \frac{A_{t-1}}{A_t} w_{t-1} - \frac{a_t}{A_t} y_t}^2 \\
& = a_t \langle \nabla f(y_t), w_{t} - y_t \rangle + \frac{A_t \mu}{2} \norm{\frac{a_t}{A_t} \frac{1}{\mu}\nabla f(y_t)}^2,
\end{align*} %]]></script>
<p>where the second equation is by the Taylor expansion of $\gamma_{t-1}$ around $w_{t-1}$ evaluated at $w_t$, the first inequality is by Jensen’s inequality, and the fourth equation is by the recursive definition (dualSeqRec) of $w_t$. With this we obtain that the change is in the lower bound can be bounded as:</p>
<script type="math/tex; mode=display">A_t L_t - A_{t-1} L_{t-1} \geq a_t f(y_t) + a_t \langle \nabla f(y_t), w_{t} - y_t \rangle + \frac{A_t \mu}{2} \norm{\frac{a_t}{A_t} \frac{1}{\mu}\nabla f(y_t)}^2.</script>
<p><em>Change in the gap estimate.</em> <br />
With the above, we can now bound the change in the gap estimate, where the inequality is by convexity, via:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
A_t G_t - A_{t-1} G_{t-1} & \leq a_t f(y_t) - A_t \left(\frac{\norm{\nabla f(y_t)}^2}{2L}\right) + A_{t-1} (f(y_{t}) - f(x_{t-1})) - a_t f(y_t) \\ & \qquad \qquad - a_t \langle \nabla f(y_t), w_{t} - y_t \rangle - \frac{A_t \mu}{2} \norm{\frac{a_t}{A_t} \frac{1}{\mu}\nabla f(y_t)}^2 \\
& = - A_t \left(\frac{\norm{\nabla f(y_t)}^2}{2L} + \frac{a_t^2}{A_t^2} \frac{\norm{\nabla f(y_t)}^2}{2\mu} \right) + A_{t-1} (f(y_{t}) - f(x_{t-1})) - a_t \langle \nabla f(y_t), w_{t} - y_t \rangle \\
& \leq - A_t \left(\frac{\norm{\nabla f(y_t)}^2}{2L} + \frac{a_t^2}{A_t^2} \frac{\norm{\nabla f(y_t)}^2}{2\mu} \right) + A_{t-1} \langle \nabla f(y_t), y_{t} - x_{t-1} \rangle - a_t \langle \nabla f(y_t), w_{t} - y_t \rangle \\
& = - A_t \left(\frac{\norm{\nabla f(y_t)}^2}{2L} + \frac{a_t^2}{A_t^2} \frac{\norm{\nabla f(y_t)}^2}{2\mu} \right) + \langle \nabla f(y_t), A_t y_{t} - A_{t-1} x_{t-1} - a_t w_t \rangle.
\end{align*} %]]></script>
<p>While maybe a little tedious, so far nothing <em>special</em> has happened. We simply computed the change in the gap estimate via the change in the upper bound and the change in the lower bound. Now we need the right-hand side to be non-positive to complete the proof and derive the rate. Our goal is to show:</p>
<script type="math/tex; mode=display">\tag{gapCondition}
- A_t \left(\frac{\norm{\nabla f(y_t)}^2}{2L} + \frac{a_t^2}{A_t^2} \frac{\norm{\nabla f(y_t)}^2}{2\mu} \right) + \langle \nabla f(y_t), A_t y_{t} - A_{t-1} x_{t-1} - a_t w_t \rangle \leq 0,</script>
<p>which we would obtain, dividing by $A_t$ and defining $\tau \doteq \frac{a_t}{A_t}$, if we can ensure:</p>
<script type="math/tex; mode=display">\tag{impY}
y_{t} - (1-\tau) x_{t-1} - \tau w_t = \nabla f(y_t) \left(\frac{1}{2L} + \tau^2 \frac{1}{2\mu} \right),</script>
<p>as then the left hand side in (gapCondition) evaluates to $0$. Note that (impY) <em>almost</em> provides a definition of the $y_t$, which is our last missing piece but not quite: the $w_t$ depends itself on $y_t$ and we would rather have it an explicit function of $w_{t-1}$. Luckily we have the recursive definition of the $w_t$ from (dualSeqRec) that will allow us to unroll one step:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\nabla f(y_t) \left(\frac{1}{2L} + \tau^2 \frac{1}{2\mu} \right) & = y_{t} - (1-\tau) x_{t-1} - \tau w_t \\
& = y_t - (1-\tau) x_{t-1} - \tau (1-\tau) w_{t-1} - \tau^2 (y_t - \frac{1}{\mu} \nabla f(y_t)).
\end{align*} %]]></script>
<p>After rearranging, the above becomes:</p>
<script type="math/tex; mode=display">\tag{expY}
\nabla f(y_t) \left(\frac{1}{2L} - \tau^2 \frac{1}{2\mu} \right) = (1-\tau^2) y_t - (1-\tau) x_{t-1} - \tau (1-\tau) w_{t-1},</script>
<p>and we are free to make some choices. For $\tau = \frac{a_t}{A_t} \doteq \sqrt{\frac{\mu}{L}}$, the left-hand side becomes $0$ and after dividing by $(1-\tau)$, we obtain:</p>
<script type="math/tex; mode=display">(1+\tau) y_t = x_{t-1} + \tau w_{t-1},</script>
<p>which finally provides the desired definition of the $y_t$ and, recalling that we contract at a rate of $1-\frac{a_t}{A_t}$, we achieve a contraction of the gap at a rate of $1-\frac{a_t}{A_t} = 1-\tau = 1 - \sqrt{\frac{\mu}{L}}$ as required:</p>
<script type="math/tex; mode=display">\tag{accRate}
f(x_t) - f(x^\esx) \leq \left(1 - \sqrt{\frac{\mu}{L}}\right)^t (f(x_0) - f(x^\esx)).</script>
<script type="math/tex; mode=display">\qed</script>
<p>This completes our argument. Now that we are done with this exercise it is time to pause and recap. First, let us state the full algorithm; we only output the primal sequence $x_1, \dots, x_t, \dots$ here but the other sequences are useful as well:</p>
<p class="mathcol"><strong>Algorithm.</strong> (Accelerated Gradient Descent)<br />
<em>Input:</em> $L$-smooth and $\mu$-strongly convex function $f$. Initial point $x_0$. <br />
<em>Output:</em> Sequence of iterates $x_0, \dots, x_t$ <br />
$w_0 \leftarrow x_0$ <br />
$\tau \leftarrow \sqrt{\frac{\mu}{L}}$ <br />
For $t = 1, \dots, t$ do <br />
$\qquad$ $y_t \leftarrow \frac{1}{1 + \tau} x_{t-1} + \frac{\tau}{1 + \tau} w_{t-1} \qquad \text{{update mixing sequence $y_t$}}$ <br />
$\qquad$ $w_t \leftarrow (1-\tau) w_{t-1} + \tau (y_t - \frac{1}{\mu} \nabla f(y_t)) \qquad \text{{update dual sequence $w_t$}}$ <br />
$\qquad$ $x_t \leftarrow y_t - \frac{1}{L} \nabla f(y_t)\qquad \text{{update primal sequence $x_t$}}$ <br /></p>
<p>Next, a few remarks are in order:</p>
<p><strong>Remarks.</strong> <br /></p>
<ol>
<li>What the analysis shows is that acceleration is achieved by <em>simultaneously</em> optimizing the primal and dual, i.e., upper and lower bound on the gap. This is in contrast to vanilla gradient descent that first and foremost maximizes primal progress per iteration and <em>not</em> gap closed per iteration. The key here is the definition of the sequence $y_t$ that balances primal and dual progress and ensures optimal progress in <em>gap closed</em> per iteration. Moreover, it is important to note that this is <em>not</em> simply a better analysis of the same algorithm but rather the iterates in the algorithm do really differ from gradient descent and acceleration really materializes in faster convergence rates; see computations below.</li>
<li>The $y_t$ are chosen to be a convex combination of the primal and the dual step. This combination is formed with fixed weights that do not change across the algorithm’s progression.</li>
<li>The proof establishes that in each iteration we contract the gap by a multiplicative factor $(1-\tau)$. It is neither guaranteed that we make primal progress per iteration nor that we make dual progress in each iteration. What is guaranteed is that the <em>in sum</em> we make enough progress; we will get back to this further below.</li>
<li>The primal and dual iterates in iteration $k$ are independent of each other conditioned on the $y_k, \dots, y_0$; see the graphics below. This is quite helpful for modifications as we will see in the next section.</li>
</ol>
<p class="center"><img src="http://www.pokutta.com/blog/assets/acc/inf.png" alt="Information Flow Acceleration" /></p>
<p>We will now compare the method from above to vanilla gradient descent. The instance that we consider is a quadratic with condition number $\theta = \frac{\mu}{L} = 10000$ in $\RR^n$, where $n = 100$. We run the algorithms for $2000$ iterations.</p>
<p>The first figure shows the gap evolution of $U_t$ and $L_t$ across iterations for the accelerated method. It can be seen that indeed the upper bound $U_t$, which is given by the primal function value, is not necessarily monotonic as compared to e.g., gradient descent. The plot is in log-log scale to better visualize behavior.</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/acc/gapEvolution.png" alt="Gap evolution" /></p>
<p>The next figure compares vanilla gradient descent (GD) vs. our accelerated method (AGD) with respect to the (true) primal gap as well as the gap estimate $G_t$. It can be seen that the convergence rate of AGD is much higher than the rate of GD. Note that GD is not converging prematurely but the rate is significantly lower. While the AGD gap estimate has a much higher offset, it contracts basically at the same rate as the (true) primal gap.</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/acc/GD-AGD.png" alt="GD vs. AGD" /></p>
<p>Finally, the last figure depicts the evolution of the distance to the optimal solution as well as the norm of the gradient (both optimality measures) for GD and AGD. Note that while these measures are not monotonic for AGD, they converge much faster.</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/acc/auxil.png" alt="Auxiliary Measures" /></p>
<h2 id="extensions">Extensions</h2>
<p>We will now briefly discuss extensions of the argumentation from above.</p>
<h3 id="monotonic-variant">Monotonic variant</h3>
<p>The presented argument above is for the most basic case and as mentioned in the remarks usually the primal gap is not monotonously decreasing. This can be an issue in some cases. In the following we discuss a modification of the above to ensure monotonous primal progress, which, by strong convexity, also ensures monotonous decrease in distance to the optimal solution. Such modifications are reasonably well-known, e.g., <a href="http://www.mathopt.org/Optima-Issues/optima88.pdf">Nesterov</a> used it to define an accelerated method that ensures monotonous progress in the distance to the optimal solution. Here we will see that such modifications are easily handled in the ADGT framework.</p>
<p>In order to achieve the above we will actually prove something stronger: we show that we can mix-in an auxiliary sequence of points $\tilde x_1, \dots, \tilde x_t, \dots$ and the algorithm will choose the better of the two (in terms of function value) between the provided point and the accelerated step. Then it suffices to, e.g., choose the sequence $\tilde x_1, \dots, \tilde x_t, \dots$ to be standard gradient steps $\tilde x_t \leftarrow x_{t-1} - \frac{1}{L} \nabla f(x_{t-1})$ to obtain a monotonic variant.</p>
<p class="mathcol"><strong>Algorithm.</strong> (Accelerated Gradient Descent with mixed-in sequence) <br />
<em>Input:</em> $L$-smooth and $\mu$-strongly convex function $f$. Initial point $x_0$. Sequence $\tilde x_1, \dots, \tilde x_t, \dots$. <br />
<em>Output:</em> Sequence of iterates $x_0, \dots, x_t$ <br />
$w_0 \leftarrow x_0$ <br />
$\tau \leftarrow \sqrt{\frac{\mu}{L}}$ <br />
For $t = 1, \dots, t$ do <br />
$\qquad$ $y_t \leftarrow \frac{1}{1 + \tau} x_{t-1} + \frac{\tau}{1 + \tau} w_{t-1} \qquad \text{{update mixing sequence $y_t$}}$ <br />
$\qquad$ $w_t \leftarrow (1-\tau) w_{t-1} + \tau (y_t - \frac{1}{\mu} \nabla f(y_t)) \qquad \text{{update dual sequence $w_t$}}$ <br />
$\qquad$ $\bar x_t \leftarrow y_t - \frac{1}{L} \nabla f(y_t)\qquad \text{{update primal sequence $x_t$}}$ <br />
$\qquad$ $x_t \leftarrow \arg\min \setb{f(\bar x_t), f(\tilde x_t)}\qquad \text{{take better point}}$ <br /></p>
<p>At first this might seem problematic as we argued before, that it is the intricate construction of the $y_t$ that simultaneously optimize the primal upper and lower bound. In fact, we potentially sacrificed primal progress (the method is not monotonic anymore after all) to close the gap faster. Now that we play around with the definition of the $x_t$ (and in turn with the $y_t$) in such a heavy-handed way we might <em>break</em> acceleration. It turns out however that the above works just fine. To see this let us redo the analysis. First of all, note that for a given $y_t$ the analysis of the lower bound improvement remains the same as there is no dependence on $x_{t-1}$ given $y_t$. So let us re-examine the upper bound:</p>
<p><em>Change in upper bound.</em> <br />
It suffices to observe that $f(x_t) \leq f(\bar x_t)$ and hence the change in the upper bound can be bounded as before using $\bar x_t \leftarrow y_t - \nabla f(y_t)$ which implies $f(\bar x_t) - f(y_t) \leq - \frac{\norm{\nabla f(y_t)}^2}{2L}$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
A_t U_t - A_{t-1} U_{t-1} & = A_t f(x_t) - A_{t-1} f(x_{t-1}) \leq A_t f(\bar x_t) - A_{t-1} f(x_{t-1}) \\
& = A_t f(y_t) - A_t f(y_t) + A_t f(\bar x_t) - A_{t-1} f(x_{t-1}) \\
& = a_t f(y_t) + A_t (f(\bar x_t) - f(y_t)) + A_{t-1} (f(y_{t}) - f(x_{t-1})) \\
& \leq a_t f(y_t) - A_t \left(\frac{\norm{\nabla f(y_t)}^2}{2L}\right) + A_{t-1} (f(y_{t}) - f(x_{t-1})).
\end{align*} %]]></script>
<p>Thus we obtain an identical bound for the change in the upper bound. Not really surprising as we potentially only do better due to the auxiliary sequence. Now it remains to derive the definition of the $y_t$ from $x_{t-1}$ and $w_{t-1}$.</p>
<p><em>Change in the gap estimate.</em> <br />
The first part of the estimation is a direct combination of the upper bound estimate and the lower bound estimate. Note that the definition of the $y_t$ depended on $x_{t-1}$ and $w_{t-1}$ before and we have to carefully check the impact of the changed definition of $x_{t-1}$. The manipulations combining the upper and the lower bound estimate however do not make special use of how $x_{t-1}$ is defined and we similarly end up with:</p>
<script type="math/tex; mode=display">\tag{gapCondition}
- A_t \left(\frac{\norm{\nabla f(y_t)}^2}{2L} + \frac{a_t^2}{A_t^2} \frac{\norm{\nabla f(y_t)}^2}{2\mu} \right) + \langle \nabla f(y_t), A_t y_{t} - A_{t-1} x_{t-1} - a_t w_t \rangle \leq 0,</script>
<p>as before and the goal is again to choose $y_t$ to ensure the above is satisfied. With the same computations as before we obtain with $\tau \doteq \frac{a_t}{A_t} \doteq \sqrt{\frac{\mu}{L}}$:</p>
<script type="math/tex; mode=display">\tag{expY}
\nabla f(y_t) \left(\frac{1}{2L} - \tau^2 \frac{1}{2\mu} \right) = (1-\tau^2) y_t - (1-\tau) x_{t-1} - \tau (1-\tau) w_{t-1}.</script>
<p>and plugging in the value of $\tau$ and rearranging leads to:</p>
<script type="math/tex; mode=display">(1+\tau) y_t = x_{t-1} + \tau w_{t-1},</script>
<p>and the conclusion follows as before. The key point here is that also these estimations do not rely on the specific form of $x_{t-1}$. The only thing that we really needed and where the definition of the $x_{t-1}$ played a role is that we make enough progress in terms of the upper bound estimate. Basically, we can define $y_t$ from any $x_{t-1}$ and $w_{t-1}$ as long as they satisfy the upper bound and lower bound estimates.</p>
<p>The following figure compares the monotonic variant of the accelerated method (AGDM) to the non-monotonic accelerated method (AGD) and vanilla gradient descent (GD) in terms of primal gap evolution. Observe that AGDM is not just monotonic but also has an (empirically) higher convergence rate. The instance is the same as above:</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/acc/monotonic.png" alt="Monotonic" /></p>
<h3 id="the-smooth-and-non-strongly-convex-case">The smooth and (non-strongly) convex case</h3>
<p>It is known that in the smooth and (non-strongly) convex case acceleration is also possible, improving from $O(1/t)$ (or equivalently $O(1/\varepsilon)$) convergence to $O(1/t^2)$ (or equivalently $O(1/\sqrt{\varepsilon})$) convergence. We could establish this result with the analysis from above, by adjusting the lower bound $L_t$ to not rely on strong convexity but convexity only. However, the resulting lower bound is not going to be smooth anymore, so that the simple trick of optimizing out the bound we used above is not going to work anymore. The answer to this is a more complicated (but natural) smoothening of the lower bound function and the interested reader is referred to [DO]. There is another way however, that essentially achieves the same result, up to log-factors, and leverages what we have proven already. We replicate the argument from [SAB] here.</p>
<p>The basic idea is that we take our smooth function $f$ and given an accuracy $\varepsilon$, we mix in a weak quadratic to make the function strongly convex and then we run the algorithm from above.</p>
<p>Let $f$ be $L$-smooth and assume that $x_0$, our initial iterate, is close enough to the optimal solution so that $D \doteq \norm{x_0 - x^\esx} \geq \norm{x_t - x^\esx}$ for all $t$; the “burn-in” until we reach such a point happens after at most a finite number of iterations, independent of $\varepsilon$. Given a target accuracy $\varepsilon > 0$, we simply define:</p>
<script type="math/tex; mode=display">f_\varepsilon(x) \doteq f(x) + \frac{\varepsilon}{2D^2} \norm{x_0 - x}^2.</script>
<p>Observe that $f_\varepsilon$ is now $(L + \frac{\varepsilon}{2D^2})$-smooth and $\frac{\varepsilon}{2D^2}$-strongly convex. Moreover, essentially minimizing $f_\varepsilon$ is the same a minimizing $f$ up to small error:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
f(x_t) - f(x^\esx) & = f_\varepsilon(x_t) - \frac{\varepsilon}{2D^2} \norm{x_0 - x_t}^2 - f_\varepsilon(x^\esx) + \frac{\varepsilon}{2D^2} \norm{x_0 - x^\esx}^2 \\
& \leq f_\varepsilon(x_t) - f_\varepsilon(x^\esx) + \frac{\varepsilon}{2} \leq f_\varepsilon(x_t) - f_\varepsilon(x_\varepsilon^\esx) + \frac{\varepsilon}{2},
\end{align*} %]]></script>
<p>where $x_\varepsilon^\esx$ is the optimal solution to $\min_{x} f_\varepsilon(x)$. This shows finding an $\varepsilon/2$-optimal solution $x_t$ to $\min_{x} f_\varepsilon(x)$ provides an $\varepsilon$-optimal solution to $\min_x f(x)$.</p>
<p>Now we run the accelerated method from above on $f_\varepsilon$ with accuracy $\varepsilon/2$. We had an accelerated rate (accRate) of</p>
<script type="math/tex; mode=display">f(x_t) - f(x^\esx) \leq \left(1 - \sqrt{\frac{\mu}{L}}\right)^t (f(x_0) - f(x^\esx)),</script>
<p>for a generic $L$-smooth and $\mu$-strongly convex function $f$. Moreover, $f_\varepsilon(x_0) - f_\varepsilon(x^\esx) \leq \frac{(L+\varepsilon)D^2}{2}$ by smoothness. We now simply plug-in parameters and obtain:</p>
<script type="math/tex; mode=display">f_\varepsilon(x_t) - f_\varepsilon(x^\esx) \leq \left(1 - \sqrt{\frac{\frac{\varepsilon}{2D^2}}{L + \frac{\varepsilon}{2D^2}}}\right)^t \left(\frac{(L+\varepsilon)D^2}{2}\right),</script>
<p>so that in order to achieve $f_\varepsilon(x_t) - f_\varepsilon(x^\esx) \leq \varepsilon/2$ it suffices to satisfy:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
t \log \left(1 - \sqrt{\frac{\frac{\varepsilon}{2D^2}}{L + \frac{\varepsilon}{2D^2}}}\right) & \leq - \log \frac{(L+\varepsilon)D^2}{2} + \log \frac{\varepsilon}{2} \\
\Leftrightarrow t & \geq - \frac{\log \frac{(L+\varepsilon)D^2}{\varepsilon} }{\log \left(1 - \sqrt{\frac{\frac{\varepsilon}{2D^2}}{L + \frac{\varepsilon}{2D^2}}}\right)},
\end{align*} %]]></script>
<p>and using $\log 1-r \approx - r$ for $r$ small and $L + \frac{\varepsilon}{2D^2} \approx L$ we obtain that in order to ensure $f(x_t) - f(x^\esx) \leq \varepsilon$, we need to run the accelerated method on the smooth function $f_\varepsilon$ for roughly no more than:</p>
<script type="math/tex; mode=display">t \approx \log \left(\frac{(L+\varepsilon)D^2}{\varepsilon} \right) \sqrt{\frac{2LD^2}{\varepsilon}},</script>
<p>iterations. This matches, up to a logarithmic term, the complexity that we would expect from an accelerated method in the smooth (non-strongly) convex case.</p>
<h3 id="acceleration-and-noise">Acceleration and noise</h3>
<p>One of the often-cited major drawbacks of accelerated methods is that they do not deal well with noisy or inexact gradients, i.e, they are not robust. To make matters worse, in [DGN] it was shown that basically any method that is faster than vanilla <em>Gradient Descent</em> necessarily needs to accumulate errors linearly in the number of iterations. This poses significant challenges depending on the magnitude of the noise. Slightly cheating here and considering the smooth and (non-stronlgy) convex case (check out [DGN] and Moritz’s post on <a href="http://blog.mrtz.org/2014/08/18/robustness-versus-acceleration.html">Robustness vs. Acceleration</a> for precise definitions and some nice computations), suppose that the magnitude of error in the gradients is $\delta$, then vanilla <em>Gradient Descent</em> after $t$ iterations provides a solution with guarantee:</p>
<script type="math/tex; mode=display">f(x_t) - f(x^\esx) \leq O(1/t) + \delta,</script>
<p>whereas <em>Accelerated Gradient Descent</em> (the standard one, see [DGN]) provides a solution that satisfies:</p>
<script type="math/tex; mode=display">f(x_t) - f(x^\esx) \leq O(1/t^2) + t\delta,</script>
<p>so that there is a tradeoff between accuracy, iterations, and magnitude of error. A detailed analysis of the effects of noise, various restart strategies to combat noise accumulation, as well as the (substantial) differences between noise accumulation in the constrained and unconstrained setting are discussed in [CDO]; check it out for details, here is a quick teaser:</p>
<blockquote>
<p>Our results reveal an interesting discrepancy between noise tolerance in the settings of constrained and unconstrained smooth minimization. Namely, in the setting of constrained optimization, the error due to noise does not accumulate and is proportional to the diameter of the feasible region and the expected norm of the noise. In the setting of unconstrained optimization, the bound on the error incurred due to the noise accumulates, as observed empirically by (Hardt, 2014).</p>
</blockquote>
<h3 id="references">References</h3>
<p>[N1] Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate $O (1/k^ 2)$. In Sov. Math. Dokl (Vol. 27, No. 2).</p>
<p>[N2] Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course (Vol. 87). Springer Science & Business Media. <a href="https://books.google.com/books?hl=en&lr=&id=2-ElBQAAQBAJ&oi=fnd&pg=PA1&dq=Introductory+lectures+on+convex+optimization:+A+basic+course&ots=wltS9osfmv&sig=2kEC_XSXH-OZVyY1ZmK43khv3eQ#v=onepage&q=Introductory%20lectures%20on%20convex%20optimization%3A%20A%20basic%20course&f=false">google books</a></p>
<p>[P] Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1-17. <a href="https://www.sciencedirect.com/science/article/abs/pii/0041555364901375">pdf</a></p>
<p>[AO] Allen-Zhu, Z., & Orecchia, L. (2014). Linear coupling: An ultimate unification of gradient and mirror descent. arXiv preprint arXiv:1407.1537. <a href="https://arxiv.org/abs/1407.1537">pdf</a></p>
<p>[SBC] Su, W., Boyd, S., & Candes, E. (2014). A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems (pp. 2510-2518). <a href="https://arxiv.org/abs/1503.01243">pdf</a></p>
<p>[BLS] Bubeck, S., Lee, Y. T., & Singh, M. (2015). A geometric alternative to Nesterov’s accelerated gradient descent. arXiv preprint arXiv:1506.08187. <a href="https://arxiv.org/abs/1506.08187">pdf</a></p>
<p>[DO] Diakonikolas, J., & Orecchia, L. (2019). The approximate duality gap technique: A unified theory of first-order methods. SIAM Journal on Optimization, 29(1), 660-689. <a href="https://arxiv.org/pdf/1712.02485.pdf">pdf</a></p>
<p>[SAB] Scieur, D., d’Aspremont, A., & Bach, F. (2016). Regularized nonlinear acceleration. In Advances In Neural Information Processing Systems (pp. 712-720). <a href="https://arxiv.org/abs/1606.04133">pdf</a></p>
<p>[DGN] Devolder, O., Glineur, F., & Nesterov, Y. (2014). First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(1-2), 37-75. <a href="http://www.optimization-online.org/DB_FILE/2010/12/2865.pdf">pdf</a></p>
<p>[CDO] Cohen, M. B., Diakonikolas, J., & Orecchia, L. (2018). On acceleration with noise-corrupted gradients. arXiv preprint arXiv:1805.12591. <a href="https://arxiv.org/abs/1805.12591">pdf</a></p>
<p><br /></p>
<h4 id="acknowledgements-and-changelog">Acknowledgements and Changelog</h4>
<p>I would like to thank Alejandro Carderera and Cyrille Combettes for pointing out several typos in an early version of this post. Computations and plots provided by Alejandro Carderera.</p>Sebastian PokuttaTL;DR: Cheat Sheet for a derivation of acceleration from optimization first principles.Blended Matching Pursuit2019-05-27T01:00:00+02:002019-05-27T01:00:00+02:00http://www.pokutta.com/blog/research/2019/05/27/bmp-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/1904.12335">Blended Matching Pursuit</a> with <a href="https://www.linkedin.com/in/cyrille-combettes/">Cyrille W. Combettes</a>, showing that the blending approach that we used earlier for conditional gradients can be carried over also to the Matching Pursuit setting, resulting in a new and very fast algorithm for minimizing convex functions over linear spaces while maintaining sparsity close to full orthogonal projection approaches such as Orthogonal Matching Pursuit.</em>
<!--more--></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>We are interested in solving the following convex optimization problem. Let $f$ be a smooth convex function with potentially additional properties and $D \subseteq \RR^n$ a finite set of vectors. We want to solve:</p>
<script type="math/tex; mode=display">\tag{opt}
\min_{x \in \operatorname{lin}D} f(x)</script>
<p>The set $D$ in the context considered here is often referred to as <em>dictionary</em> and its elements are called <em>atoms</em>. Note that this problem does also make sense for infinite dictionaries and more general Hilbert spaces but for the sake of exposition we confine ourselves here to the finite case; see the paper for more details.</p>
<h3 id="sparse-signal-recovery">Sparse Signal Recovery</h3>
<p>The problem (opt) with, e.g., $f(x) \doteq \norm{x - y}_2^2$ for a given vector $y \in \RR^n$ is of particular interest in <em>Signal Processing</em> in the context of <em>Sparse Signal Recovery</em>, where a signal $y \in \RR^n$ is measured that is known to be the sum of a <em>sparse</em> linear combination of elements in $D$ and a <em>noise term</em> $\epsilon$, e.g., $y = x + \epsilon$ for some $x \in \operatorname{lin} D$ and $\epsilon \sim N(0,\Sigma)$; see <a href="https://en.wikipedia.org/wiki/Matching_pursuit">Wikipedia</a> for more details. Here <em>sparsity</em> refers to $x$ being a linear combination of <em>few</em> elements from $D$ and the task is to reconstruct $x$ from $y$. If the signal’s sparsity is known ahead of time, say $m$, then the optimization problem of interest is:</p>
<script type="math/tex; mode=display">\tag{sparseRecovery}
\min_{x \in \RR^k} \setb{\norm{y - Dx}_2^2 \ \mid\ \norm{x}_0 \leq m},</script>
<p>where $|D| = k$ and $m \ll k$ typically. As the above problem is non-convex (and in fact NP-hard to solve), various relaxations have been used and a common one is to solve (opt) instead with an algorithm that promotes sparsity due to its algorithmic design. Other variants include relaxing the $\ell_0$-norm constraint via an $\ell_1$-norm constraint and then solving the arising constrained convex optimization problem over an appropriately scaled $\ell_1$-ball with an optimization methods that is relatively sparse, such as e.g., conditional gradients and related methods.</p>
<p>The following graphics is taken from <a href="https://en.wikipedia.org/wiki/Matching_pursuit">Wikipedia’s Matching Pursuit entry</a>. On the bottom the actual signal is depicted in the time domain and on top the inner product of the wavelet atom with the signal is shown as a heat map, where each pixel corresponds to a time-frequency wavelet atom (this would be our dictionary). In this example, we would seek a reconstruction with $3$ elements given by the centers of the ellipsoids.</p>
<p class="center"><img src="https://upload.wikimedia.org/wikipedia/commons/2/21/Matching_pursuit.png" alt="Sparse Signal" /></p>
<p>Without going into detail here, (sparseRecovery) also naturally relates to compressed sensing and our algorithm also applies to this context, as do all other algorithms that solve (sparseRecovery).</p>
<h3 id="the-general-setup">The general setup</h3>
<p>Here we actually consider the more general problem of minimizing an arbitrary smooth convex function $f$ over the linear span of the dictionary $D$ in (opt). This more general setup has many applications including the one from above. Basically, whenever we seek to project a vector into a linear space, writing it as linear combination of basis elements we are in the setup of (opt). Moreover, sparsity is often a natural requirement as it helps explainability and interpretation etc. in many cases.</p>
<h3 id="solving-the-optimization-problem">Solving the optimization problem</h3>
<p>Apart from the broad applicability, (opt) is also algorithmically interesting. It is a constrained problem as we optimize subject to $x \in \operatorname{lin} D$, yet at the same time the feasible region is unbounded. Surely one could project into the linear space etc but this is quite costly if $D$ is large and potentially very challenging if $D$ is countably infinite; in fact it is (opt) that solves exactly this problem for a <em>specific</em> vector $y$ subject to additional constraints such as, e.g., sparsity and good <em>Normalized Mean Squared Error (NMSE)</em>. When solving (opt) we thus face some interesting challenges, such as not being able to bound the diameter of the feasible region (an often used quantity in constrained convex minimization).</p>
<p>There are various algorithms to solve (opt) while maintaining sparsity. One such class are Coordinate Descent, Matching Pursuit [MZ], Orthogonal Matching Pursuit [AKGT] and similar algorithms that try to achieve sparsity due to their design. Another class solves a constraint version by introducing an $\ell_1$-constraint as discussed above to induce sparsity. This includes (vanilla) Gradient Descent (not really sparse), Conditional Gradient descent [CG] (aka the Frank-Wolfe algorithm [FW]) and its variants (see e.g., [LJ]) as well as specialized algorithms such as Compressive Sampling Matching Pursuit (CoSaMP) [NT] or Conditional Gradient with Enhancement and Truncation (CoGEnT) [RSW]. Also our recent Blended Conditional Gradients (BCG) algorithm [BPTW] applies to the formulation with $\ell_1$-ball relaxation; see also the <a href="/blog/research/2019/02/18/bcg-abstract.html">summary of the paper</a> for more details.</p>
<p>For an overview of the computational as well as reconstruction advantages and disadvantages of some of those algorithms, see [AKGT].</p>
<h2 id="our-results">Our results</h2>
<p>More recently, in [LKTJ] a unifying view of Conditional Gradients and Matching Pursuit has been established. Apart from presenting new algorithms, the authors also show that basically the Frank-Wolfe algorithm corresponds to Matching Pursuit and the Fully-Corrective Frank-Wolfe algorithm corresponds to Orthogonal Matching Pursuit; shortly after in [LRKRSSJ] an accelerated variant of Matching Pursuit has been provided. The unified view of [LKTJ] motivated us to carry over the blending idea from [BPTW] to the Matching Pursuit context, as the BCG algorithm provided very good sparsity in the constraint case in our tests. Moreover, we wanted to extend the convergence analysis to not just smooth and (strongly) convex functions but more generally smooth and sharp functions, which nicely interpolates between the convex and the strongly convex regime (see <a href="/blog/research/2018/11/12/heb-conv.html">Cheat Sheet: Hölder Error Bounds (HEB) for Conditional Gradients</a> for details on sharpness); the same can be also done for Conditional Gradients (see our recent work [KDP] or the <a href="/blog/research/2019/05/02/restartfw-abstract.html">summary</a>).</p>
<p>The basic idea behind behind <em>blending</em> is to mix together various types of steps. Here the mixing is between Matching Pursuit style steps and low-complexity Gradient Steps over the currently selected atoms. The former steps make sure that we discover new dictionary elements that we need to make progress, whereas the latter ones usually give more per-iteration progress, are cheaper in wall-clock time, and promote sparsity. Unfortunately, as straightforward as it sounds to carry over the blending to Matching Pursuit, it is not that simple. The blending that we did before in [BPTW] heavily relied on dual gap estimates (in fact variants of the Wolfe gap) to switch between the various steps, however these gaps are not available here due to the unboundedness of $\operatorname{lin} D$.</p>
<p>After navigating these technical challenges, what we ended up with is a <em>Blended Matching Pursuit (BMP)</em> algorithm, that is basically as fast (or faster) than the standard Matching Pursuit (or its generalized variant, <em>Generalized Matching Pursuit (MP)</em>, for arbitrary smooth convex functions), while maintaining a sparsity close to that of the much slower Orthogonal Matching Pursuit (OMP); the former only performs line search across the newly added atom, while the latter re-optimizes over the <em>full</em> set of selected elements in each iteration, hence offering much better sparsity at the price of much higher running times.</p>
<h3 id="example-computation-1">Example Computation 1:</h3>
<p>The following figure shows a sample computation for a sparse signal recovery instance from [RSW], which we scaled down by a factor of $10$. The actual signal has a sparsity of $s = 100$, we have $m = 500$ measurements, and the measurement happens in $n = 2000$-dimensional space. We choose $A\in\mathbb{R}^{m\times n}$ and $x^\esx \in \mathbb{R}^n$ with $\norm{x^\esx}_0=s$. The measurement is generated as $y=Ax^\esx + \mathcal{N}(0,\sigma^2I_m)$ with $\sigma = 0.05$.</p>
<p>We benchmarked BMP against MP and OMP (see [LKTJ] for pseudo-code). We also benchmarked against BCG (see [BPTW] for pseudo-code) and CoGEnT (see [RSW] for pseudo-code); for these algorithms we optimize subject to a scaled $\ell_1$-ball, where the radius has been empirically chosen so the signal is contained in the ball; otherwise we could not compare primal gap progress. Note that by scaling up the $\ell_1$-ball we might produce less sparse solutions; see [LKTJ] and the contained discussion for relating conditional gradient methods to matching pursuit methods. Each algorithm is run for either $300$ secs or until there is no (substantial) primal improvement anymore; whichever comes first.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/bmp/convergence.png" alt="Comparison BMP vs. others" /></p>
<p>In the aforementioned <em>Sparse Signal Recovery</em> problem, another way to compare the quality of the actual reconstructions is via the <em>Normalized Mean Square Error (NMSE)</em>. The next figure shows the evolution of NMSE across the optimization:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/bmp/nmse.png" alt="NMSE small" /></p>
<p>The rebound likely happens because once the actual signal is reconstructed, overfitting of the noise term starts, deteriorating NMSE. One could clean up the reconstruction by removing all atoms in the support with small coefficients; this is beyond the scope however. Here the same NMSE plot truncated after the first $30$ secs for better visibility of the initial phase:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/bmp/nmse-30s.png" alt="NMSE small" /></p>
<h3 id="example-computation-2">Example Computation 2:</h3>
<p>Same setup as above, however this time actual signal has a sparsity of $s = 100$, we have $m = 1500$ measurements, and the measurement happens in $n = 6000$-dimensional space. This time we run for $1200$ secs or until no (substantial) primal progress. Here the performance of BMP is very obvious.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/bmp/convergence2.png" alt="Comparison BMP vs. others" /></p>
<p>The next figure shows the evolution of NMSE across the optimization:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/bmp/nmse2.png" alt="NMSE small" /></p>
<p>And truncated again, here after roughly the first $300$ secs for better visibility. We can see that BMP reaches its NMSE minimum right around $100$ atoms and it is much faster than any of the other algorithms.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/bmp/nmse2-300s.png" alt="NMSE small" /></p>
<h3 id="references">References</h3>
<p>[MZ] Mallat, S. G., & Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing, 41(12), 3397-3415. <a href="https://pdfs.semanticscholar.org/0b6e/98a6a8cf8283fd76fe1100b23f11f4cfa711.pdf">pdf</a></p>
<p>[TG] Tropp, J. A., & Gilbert, A. C. (2007). Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on information theory, 53(12), 4655-4666. <a href="https://authors.library.caltech.edu/9490/1/TROieeetit07.pdf">pdf</a></p>
<p>[CG] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=zvmmf&paperid=7415&option_lang=eng">pdf</a></p>
<p>[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p>
<p>[LJ] Lacoste-Julien, S., & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems (pp. 496-504). <a href="http://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>
<p>[NT] Needell, D., & Tropp, J. A. (2009). CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and computational harmonic analysis, 26(3), 301-321. <a href="https://core.ac.uk/download/pdf/22761532.pdf">pdf</a></p>
<p>[RSW] Rao, N., Shah, P., & Wright, S. (2015). Forward–backward greedy algorithms for atomic norm regularization. IEEE Transactions on Signal Processing, 63(21), 5798-5811. <a href="https://arxiv.org/pdf/1404.5692.pdf">pdf</a></p>
<p>[BPTW] Braun, G., Pokutta, S., Tu, D., & Wright, S. (2018). Blended Conditional Gradients: the unconditioning of conditional gradients. arXiv preprint arXiv:1805.07311. <a href="https://arxiv.org/pdf/1805.07311.pdf">pdf</a></p>
<p>[AKGT] Arjoune, Y., Kaabouch, N., El Ghazi, H., & Tamtaoui, A. (2017, January). Compressive sensing: Performance comparison of sparse recovery algorithms. In 2017 IEEE 7th annual computing and communication workshop and conference (CCWC) (pp. 1-7). IEEE. <a href="https://arxiv.org/pdf/1801.09744.pdf">pdf</a></p>
<p>[LKTJ] Locatello, F., Khanna, R., Tschannen, M., & Jaggi, M. (2017). A unified optimization view on generalized matching pursuit and frank-wolfe. arXiv preprint arXiv:1702.06457. <a href="https://arxiv.org/pdf/1702.06457.pdf">pdf</a></p>
<p>[LRKRSSJ] Locatello, F., Raj, A., Karimireddy, S. P., Rätsch, G., Schölkopf, B., Stich, S. U., & Jaggi, M. (2018). On matching pursuit and coordinate descent. arXiv preprint arXiv:1803.09539. <a href="https://arxiv.org/pdf/1803.09539.pdf">pdf</a></p>
<p>[KDP] Kerdreux, T., d’Aspremont, A., & Pokutta, S. (2018). Restarting Frank-Wolfe. to appear in Proceedings of AISTATS. <a href="https://arxiv.org/abs/1810.02429">pdf</a></p>Sebastian PokuttaTL;DR: This is an informal summary of our recent paper Blended Matching Pursuit with Cyrille W. Combettes, showing that the blending approach that we used earlier for conditional gradients can be carried over also to the Matching Pursuit setting, resulting in a new and very fast algorithm for minimizing convex functions over linear spaces while maintaining sparsity close to full orthogonal projection approaches such as Orthogonal Matching Pursuit.Sharpness and Restarting Frank-Wolfe2019-05-02T01:00:00+02:002019-05-02T01:00:00+02:00http://www.pokutta.com/blog/research/2019/05/02/restartfw-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="http://arxiv.org/abs/1810.02429">Restarting Frank-Wolfe</a> with <a href="https://www.di.ens.fr/~aspremon/">Alexandre D’Aspremont</a> and <a href="https://www.researchgate.net/profile/Thomas_Kerdreux">Thomas Kerdreux</a>, where we show how to achieve improved convergence rates under sharpness through restarting Frank-Wolfe algorithms.</em>
<!--more--></p>
<p>Note: This summary is shorter than usual as I wrote a whole post about sharpness (aka Hölder Error Bounds) and conditional gradient methods that is strongly correlated with this paper some time back; for the sake non-duplication the interested reader is referred to <a href="/blog/research/2018/11/12/heb-conv.html">Cheat Sheet: Hölder Error Bounds (HEB) for Conditional Gradients</a>, which also explains the more technical aspects of our work in a significantly broader context.</p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>We often want to solve <em>constrained smooth convex optimization</em> problems of the form</p>
<script type="math/tex; mode=display">\min_{x \in P} f(x),</script>
<p>where $P$ is some compact convex set and $f$ is a smooth function. If the considered function $f$ is strongly convex, then we can expect a linear rate of convergence of $O(\log 1/\varepsilon)$, i.e., it takes about $k \sim \log 1/\varepsilon$ iterations until $f(x_k) - f(x^\esx) \leq \varepsilon$ by using <em>Away-Step Frank-Wolfe</em> or <em>Pairwise Conditional Gradients</em> (see <a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a> for more). However, in absence of strongly convexity we often have to fall back to the smooth and (non-strongly) convex case with a much lower rate of $O(1/\varepsilon)$ (without acceleration). In many cases this rate is considerably worse than what is empirically observed. To remedy this by providing a more fine-grained convergence analysis, in a recent paper [RA] analyzed convergence under <em>sharpness</em> (also known as the <em>Hölder Error Bound (HEB) condition</em>) which characterizes the behavior of $f$ around the optimal solutions:</p>
<p class="mathcol"><strong>Definition (Hölder Error Bound (HEB) condition).</strong> A convex function $f$ is satisfies the <em>Hölder Error Bound (HEB) condition on $P$</em> with parameters $0 < c < \infty$ and $\theta \in [0,1]$ if for all $x \in P$ it holds:
\[
c (f(x) - f^\esx)^\theta \geq \min_{y \in \Omega^\esx} \norm{x-y}.
\]</p>
<p>It was shown that using the notion of sharpness, one can derive much better rates, covering the whole range between the sublinear rate of $O(1/\varepsilon)$ and the linear rate $O(\log 1/\varepsilon)$. Moreover, these rates can be realized with adaptive restart schemes, requiring no knowledge about the sharpness parameters.</p>
<p>To establish the link to strong convexity, note that strong convexity which is a global property implies sharpness (with appropriate parameterization) which has to hold only locally around the optimal solutions; the converse is not true. In fact, using sharpness one can show linear convergence for certain function classes of functions that are not strongly convex.</p>
<h2 id="our-results">Our results</h2>
<p>An open question was whether adaptive restarts can be also utilized to achieve a similar adaptive behavior for Conditional Gradient type methods that access the feasible region $P$ only through a linear programming oracle and this is precisely what we study in our recent work [KDP]. There we show that one can modify the Away-Step Frank Wolfe algorithm (and similarly Pairwise Conditional Gradients) by endowing them with <em>scheduled restarts</em> to automatically adapt to the function’s sharpness. For functions with optimal solutions contained in the strict relative interior of $P$ it even suffices to modify the (vanilla) Frank-Wolfe algorithm. By doing so we obtain, depending on the function’s sharpness parameters, convergence rates of the form $O(1/\varepsilon^p)$ or $O(\log 1/\varepsilon)$. In particular, similar to [RA], we can achieve linear convergence for functions that are sufficiently sharp but not strongly convex.</p>
<p>For illustration, the next graph shows the behavior of the Frank-Wolfe Algorithm under sharpness on the probability simplex of dimension $30$ and function $\norm{x}_2^{1/\theta}$. For $\theta = 1/2$, we observe linear convergence as expected, while for the other values of $\theta$ we observe various degrees of sublinear convergence of the form $O(1/\varepsilon^p)$ with $p \geq 1$.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/heb-simplex-30-noLine.png" alt="HEB with approx minimizer" /></p>
<h3 id="references">References</h3>
<p>[RA] Roulet, V., & d’Aspremont, A. (2017). Sharpness, restart and acceleration. In Advances in Neural Information Processing Systems (pp. 1119-1129). <a href="http://papers.nips.cc/paper/6712-sharpness-restart-and-acceleration">pdf</a></p>
<p>[KDP] Kerdreux, T., d’Aspremont, A., & Pokutta, S. (2018). Restarting Frank-Wolfe. to appear in Proceedings of AISTATS. <a href="https://arxiv.org/abs/1810.02429">pdf</a></p>Sebastian PokuttaTL;DR: This is an informal summary of our recent paper Restarting Frank-Wolfe with Alexandre D’Aspremont and Thomas Kerdreux, where we show how to achieve improved convergence rates under sharpness through restarting Frank-Wolfe algorithms.Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning2019-02-27T13:00:00+01:002019-02-27T13:00:00+01:00http://www.pokutta.com/blog/research/2019/02/27/cheatsheet-nonsmooth<p><em>TL;DR: Cheat Sheet for non-smooth convex optimization: subgradient descent, mirror descent, and online learning. Long and technical.</em>
<!--more--></p>
<p><em>Posts in this series (so far).</em></p>
<ol>
<li><a href="/blog/research/2018/12/07/cheatsheet-smooth-idealized.html">Cheat Sheet: Smooth Convex Optimization</a></li>
<li><a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a></li>
<li><a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a></li>
<li><a href="/blog/research/2018/11/12/heb-conv.html">Cheat Sheet: Hölder Error Bounds (HEB) for Conditional Gradients</a></li>
<li><a href="/blog/research/2019/02/27/cheatsheet-nonsmooth.html">Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning</a></li>
<li><a href="/blog/research/2019/06/10/cheatsheet-acceleration-first-principles.html">Cheat Sheet: Acceleration from First Principles</a></li>
</ol>
<p><em>My apologies for incomplete references—this should merely serve as an overview.</em></p>
<p>This time we will consider non-smooth convex optimization. Our starting point is a very basic argument that is used to prove convergence of <em>Subgradient Descent (SG)</em>. From there we will consider the projected variants in the constrained setting and naturally arrive at <em>Mirror Descent (MD)</em> of [NY]; we follow the proximal point of view as presented in [BT]. We will also see that online learning algorithms such as <em>Online Gradient Descent (OGD)</em> of [Z] or <em>Online Mirror Descent (OMD)</em> and the special case of the <em>Multiplicative Weights Update (MWU)</em> algorithm arise as natural consequences.</p>
<p>This time we will consider a convex function $f: \RR^n \rightarrow \RR$ and we want to solve</p>
<script type="math/tex; mode=display">\min_{x \in K} f(x),</script>
<p>where $K$ is some convex feasible region, e.g., $K = \RR^n$ is the unconstrained case. However compared to previous posts now we will consider the <em>non-smooth</em> case. As before we assume that we only have <em>first-order access</em> to the function, via a so-called <em>first-order oracle</em>, which in the non-smooth case returns subgradients:</p>
<p class="mathcol"><strong>First-Order oracle for $f$</strong> <br />
<em>Input:</em> $x \in \mathbb R^n$ <br />
<em>Output:</em> $\partial f(x)$ and $f(x)$</p>
<p>In the above $\partial f(x)$ denotes a subgradient of the (convex!) function $f$ at point $x$. Recall that a <em>subgradient at $x \in \operatorname{dom}(f)$</em> is any vector $\partial f(x)$ such that $f(z) \geq f(x) + \partial \langle f(x), z-x \rangle$ holds for all $z \in \operatorname{dom}(f)$. So basically the same as we obtain from convexity for smooth functions, just that in the general non-smooth case, there might be more than one vector satisfying this condition. In contrast, for convex and smooth (i.e., differentiable) functions there exists only one subgradient at $x$, which is the gradient, i.e., $\partial f(x) = \nabla f(x)$ in this case. In the following we will use the notation $[n] \doteq \setb{1,\dots, n}$.</p>
<h2 id="a-basic-argument">A basic argument</h2>
<p>We will first consider gradient descent-like algorithms of the form</p>
<p>\[
\tag{dirStep}
x_{t+1} \leftarrow x_t - \eta_t d_t,
\]</p>
<p>where we choose $d_t \doteq \partial f(x_t)$ and we show how we can establish convergence of the above scheme to an (approximately) optimal solution $x_T$ to $\min_{x \in K} f(x)$ in the case $K = \RR^n$; we will choose the step length $\eta_t$ later. For completeness, the full algorithm looks like this:</p>
<p class="mathcol"><strong>Subgradient Descent Algorithm.</strong> <br />
<em>Input:</em> Convex function $f$ with first-order oracle access and some initial point $x_0 \in \RR^n$ <br />
<em>Output:</em> Sequence of points $x_0, \dots, x_T$ <br />
For $t = 1, \dots, T$ do: <br />
$\quad x_{t+1} \leftarrow x_t - \eta_t \partial f(x_t)$<br /></p>
<p>In this section we will assume that $\norm{\cdot}$ is the $\ell_2$-norm, however note that later we will allow for other norms. Let $x^\esx$ be an optimal solution to $\min_{x \in K} f(x)$ and consider the following using (dirStep) and $d_t \doteq \partial f(x_t)$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\norm{x_{t+1} - x^\esx}^2 & = \norm{x_t - x^\esx}^2 - 2 \eta_t \langle \partial f(x_t), x_t - x^\esx\rangle + \eta_t^2 \norm{\partial f(x_t)}^2.
\end{align*} %]]></script>
<p>This can be rearranged to</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\tag{basic}
2 \eta_t \langle \partial f(x_t), x_t - x^\esx\rangle & = \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 + \eta_t^2 \norm{\partial f(x_t)}^2,
\end{align*} %]]></script>
<p>as we aim to later estimate $f(x_t) - f(x^\esx) \leq \langle \partial f(x_t), x_t - x^\esx\rangle$ as $\partial f(x)$ was a subgradient. However in view of setting out to provide a unified perspective on various settings, including online learning, we will do this substitution only in the very end. Adding up those equations until iteration $T-1$ we obtain:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sum_{t = 0}^{T-1} 2\eta_t \langle \partial f(x_t), x_t - x^\esx\rangle & = \norm{x_0 - x^\esx}^2 - \norm{x_{T} - x^\esx}^2 + \sum_{t = 0}^{T-1} \eta_t^2 \norm{\partial f(x_t)}^2 \\
& \leq \norm{x_0 - x^\esx}^2 + \sum_{t = 0}^{T-1} \eta_t^2 \norm{\partial f(x_t)}^2.
\end{align*} %]]></script>
<p>Let us further assume that $\norm{\partial f(x_t)} \leq G$ for all $t = 0, \dots, T-1$ for some $G \in \RR$ and to simplify the exposition let us choose $\eta_t \doteq \eta > 0$ for now for all $t$ for some $\eta$ to be chosen later. We obtain:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
2\eta \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq \norm{x_0 - x^\esx}^2 + \eta^2 T G^2 \\
\Leftrightarrow \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq \frac{\norm{x_0 - x^\esx}^2}{2\eta} + \frac{\eta}{2} T G^2,
\end{align*} %]]></script>
<p>where the right-hand side is minimized for</p>
<script type="math/tex; mode=display">\eta \doteq \frac{\norm{x_0 - x^\esx}}{G} \sqrt{\frac{1}{T}},</script>
<p>leading to</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\tag{regretBound}
\sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq G \norm{x_0 - x^\esx} \sqrt{T}.
\end{align*} %]]></script>
<p>Note that we usually do not know $\norm{x_0 - x^\esx}$ in advance, however here and later it suffices to upper bound $\norm{x_0 - x^\esx}$ and compute the step length with respect to the upper bound. We will later see that (RegretBound) can be used as a starting point to develop online learning algorithms, for now however, we will derive our convergence guarantee from this. To this end we divide both sides by $T$, use convexity, and the subgradient property to conclude:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\tag{convergenceSG}
f(\bar x) - f(x^\esx) & \leq \frac{1}{T} \sum_{t = 0}^{T-1} f(x_t) - f(x^\esx) \\
& \leq \frac{1}{T} \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle \\
& \leq G \norm{x_0 - x^\esx} \frac{1}{\sqrt{T}},
\end{align*} %]]></script>
<p>where $\bar x \doteq \frac{1}{T} \sum_{t=0}^{T-1} x_t$ is the average of all iterates. As such we obtain a $O(1/\sqrt{T})$ convergence rate for our algorithm. It is useful to observe that what the algorithm does is to minimize the average of the dual gaps at points $x_t$ given by $\langle \partial f(x_t), x_t - x^\esx\rangle$ and since the average of the dual gaps upper bounds the primal gap of the average point convergence follows.</p>
<p>This basic analysis is the standard analysis for <em>subgradient descent</em> and will serve as a starting point for what follows.</p>
<p>Before we continue the following remarks are in order:</p>
<ol>
<li>An important observation is that in the argument above we never used that $x^\esx$ is an optimal solution and in fact the arguments hold for <em>any</em> point $u$; in particular for some choices $u$ the left-hand side of (regretBound) <em>can be negative</em> (as in the case of (convergenceSG) which becomes vacuous in this case). We will see the implications of this very soon below in the online learning section. Ultimately, subgradient descent (and also the mirror descent as we will see later) is a <em>dual method</em> in the sense, that it directly minimizes the duality gap or equivalently maximizes the dual. That is where the strong guarantees with respect to <em>all points</em> $u$ come from.</li>
<li>Another important insight is that the argument from above does not provide a <em>descent algorithm</em>, i.e., it is <em>not guaranteed</em> that we make progress in terms of primal function value decrease in each iteration. However, what we show is that picking $\eta$ ensures that the average point $\bar x$ converges to an optimal solution: we make progress on average.</li>
<li>In the current form as stated above the choice of $\eta$ requires prior knowledge of the number of total iterations $T$ and the guarantee <em>only</em> applies to the average point obtained from averaging over all iterations $T$. However, this can be remedied in various ways. The poor man’s approach is to simply run the algorithm with a small $T$ and whenever $T$ is reached to double $T$ and restart the algorithm. This is usually referred to as the <em>doubling-trick</em> and at most doubles the number of performed iterations but now we do not need prior knowledge of $T$ and we obtain guarantees at iterations of the form $t = 2^\ell$ for $\ell = 1,2, \dots$. The smarter way is to use a variable step size as we will show later. This requires however that $\norm{x_t - x^\esx} \leq D$ holds for all iterates for some constant $D$, which might be hard to ensure in general but which can be safely assumed in the compact constrained case by choosing $D$ to be the diameter; the guarantees will depend on that parameter.</li>
</ol>
<h3 id="an-optimal-update">An “optimal” update</h3>
<p>Similar to the descent approach using smoothness, as done in several previous posts, such as e.g., <a href="/blog/research/2018/12/07/cheatsheet-smooth-idealized.html">Cheat Sheet: Smooth Convex Optimization</a>, we might try to pick $\eta_t$ in each step to maximize progress. Our starting point is</p>
<script type="math/tex; mode=display">% <![CDATA[
\tag{expand}
\begin{align*}
\norm{x_{t+1} - x^\esx}^2 & = \norm{x_t - x^\esx}^2 - 2 \eta_t \langle \partial f(x_t), x_t - x^\esx\rangle + \eta_t^2 \norm{\partial f(x_t)}^2,
\end{align*} %]]></script>
<p>from above and we want to choose $\eta_t$ to maximize progress in terms of $\norm{x_{t+1} - x^\esx}$ vs. $\norm{x_t - x^\esx}$, i.e., decrease in distance to the optimal solution. Observe that the right-hand side is convex in $\eta_t$ and optimizing over $\eta_t$ leads to</p>
<script type="math/tex; mode=display">\eta_t^\esx \doteq \frac{\langle \partial f(x_t), x_t - x^\esx\rangle}{\norm{\partial f(x_t)}^2}.</script>
<p>This choice of $\eta_t^\esx$ looks very similar to the choice that we have seen before for, e.g., gradient descent in the smooth case (see, e.g., <a href="/blog/research/2018/12/07/cheatsheet-smooth-idealized.html">Cheat Sheet: Smooth Convex Optimization</a>), with some important differences however: we cannot compute the above step length as we do not know $x^\esx$; we ignore this for now.</p>
<p>Plugging the step length back into (expand), we obtain:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\norm{x_{t+1} - x^\esx}^2 & = \norm{x_t - x^\esx}^2 - \frac{\langle \partial f(x_t), x_t - x^\esx\rangle^2}{\norm{\partial f(x_t)}^2}.
\end{align*} %]]></script>
<p>This shows that progress in the distance squared to the optimal solution decreases by $\frac{\langle \partial f(x_t), x_t - x^\esx\rangle^2}{\norm{\partial f(x_t)}^2}$, i.e., the better aligned the gradient is with the idealized direction $x_t - x^\esx$, which points towards an optimal solution $x^\esx$, the faster the progress. In particular, if the alignment is perfect, then <em>one step</em> suffices. Note, however that this is <em>only</em> hypothetical as the computation of the optimal step length requires knowledge of an optimal solution. This is simply to demonstrate that a “non-deterministic” version would only require one step. This is in contrast to e.g., gradient step progress in function value exploiting smoothness (see <a href="/blog/research/2018/12/07/cheatsheet-smooth-idealized.html">Cheat Sheet: Smooth Convex Optimization</a>). In that case, only using first order information and smoothness we <em>naturally</em> obtain, e.g., a $O(1/t)$-rate for the smooth case, even for the non-deterministic idealized algorithm, where we guess as direction $x_t - x^\esx$ pointing towards the optimum. This is a subtle but important difference.</p>
<p>Finally, to add slightly more to the confusion (for now) compare the rearranged (expand) which captures progress <em>in the distance</em></p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 & = 2 \eta_t \langle \partial f(x_t), x_t - x^\esx\rangle - \eta_t^2 \norm{\partial f(x_t)}^2,
\end{align*} %]]></script>
<p>to the smoothness induced progress <em>in function value</em> (or primal gap) for the idealized $d \doteq x_t - x^\esx$ (see, e.g., <a href="/blog/research/2018/12/07/cheatsheet-smooth-idealized.html">Cheat Sheet: Smooth Convex Optimization</a>):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
f(x_{t}) - f(x_{t+1}) & \geq \eta_t \langle\nabla f(x_t), x_t - x^\esx \rangle - \eta_t^2 \frac{L}{2}\norm{x_t - x^\esx}^2.
\end{align*} %]]></script>
<p>These two progress-inducing (in-)equalities are very similar. In particular, in the smooth case, for $\eta_t$ tiny, the progress is identical up to the linear factor $2$ and lower order terms; this is for a good reason as I will discuss sometime in the future when we look at the continuous time versions.</p>
<h3 id="online-learning">Online Learning</h3>
<p>In the following we will discuss the connection of the above to online learning. In <em>online learning</em> we typically consider the following setup; I simplified the setup slightly for exposition and the exact requirements will become clear from the actual algorithm that we will use.</p>
<p>We consider two players: the <em>adversary</em> and the <em>player</em>. We then play a game over $T$ rounds of the following form:</p>
<p class="mathcol"><strong>Game.</strong> For $t = 0, \dots, T-1$ do: <br />
(1) Player chooses an action $x_t$ <br />
(2) Adversary picks a (convex) function $f_t$, reveals $\partial f_t(x_t)$ and $f_t(x_t)$ <br />
(3) Player updates/learns via $\partial f_t(x_t)$ and incurs cost $f_t(x_t)$ <br /></p>
<p>The goal of the game is to minimize the so-called <em>regret</em>, which is defined as:</p>
<script type="math/tex; mode=display">\tag{regret}
\sum_{t = 0}^{T-1} f_t(x_t) - \min_{x} \sum_{t = 0}^{T-1} f_t(x),</script>
<p>which measures how well our <em>dynamic strategy</em> $x_1, \dots, x_t$ compares to the <em>single best decision in hindsight</em>, i.e., a <em>static strategy</em> given perfect information.</p>
<p>Although surprising at first, it turns out that one can show that there exists an algorithm that generates a strategy $x_1, \dots, x_t$, so that (regret) is growing sublinearly, in fact typically of the order $O(\sqrt{T})$, i.e., something of the following form holds:</p>
<script type="math/tex; mode=display">\sum_{t = 0}^{T-1} f_t(x_t) - \min_{x} \sum_{t = 0}^{T-1} f_t(x) \leq O(\sqrt{T}),</script>
<p>What does this mean? If we divide both sides by $T$, we obtain the so-called <em>average regret</em> and the bound becomes:</p>
<script type="math/tex; mode=display">\frac{1}{T} \sum_{t = 0}^{T-1} f_t(x_t) - \min_{x} \sum_{t = 0}^{T-1} f_t(x) \leq O\left(\frac{1}{\sqrt{T}}\right),</script>
<p>showing that the average mistake that we make per round, in the long run, tends to $0$ at a rate of $O\left(\frac{1}{\sqrt{T}}\right)$.</p>
<p>Now it is time to wonder what this has to do with what we have seen so far. It turns out that already the our basic analysis from above provides a bound for the most basic unconstrained case for a given time horizon $T$. To this end recall the inequality (regretBound) that we established above:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq G \norm{x_0 - x^\esx} \sqrt{T}.
\end{align*} %]]></script>
<p>A careful look at the argument that we used to establish the inequality (regretBound) reveals that it actually does not depend on $f$ being the same in each iteration and also that we can replace $x^\esx$ by any other feasible solution $u$ (as discussed before) so that we also proved:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sum_{t = 0}^{T-1} \langle \partial f_t(x_t), x_t - u\rangle & \leq G \norm{x_0 - u} \sqrt{T},
\end{align*} %]]></script>
<p>with $G$ now being a bound on the subgradients across the rounds, i.e., $\norm{\partial f_t(x_t)} \leq G$ and now using the fact that $\partial_t(x_t)$ is a subgradient, we obtain:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sum_{t = 0}^{T-1} \left (f_t(x_t) - f_t(u) \right) \leq \sum_{t = 0}^{T-1} \langle \partial f_t(x_t), x_t - u \rangle & \leq G \norm{x_0 - u} \sqrt{T},
\end{align*} %]]></script>
<p>which holds for any $u$, in particular for the minimum $x^\esx \doteq \arg\min_{x} \sum_{t = 0}^{T-1} f_t(x)$ (assuming the minimum exists):</p>
<script type="math/tex; mode=display">% <![CDATA[
\tag{regretSG}
\begin{align*}
\sum_{t = 0}^{T-1} f_t(x_t) - \sum_{t = 0}^{T-1} f_t(x^\esx) \leq \sum_{t = 0}^{T-1} \langle \partial f_t(x_t), x_t - x^\esx \rangle & \leq G \norm{x_0 - x^\esx} \sqrt{T}.
\end{align*} %]]></script>
<p>This establishes sublinear regret for the actions played by the player according to:</p>
<p>\[
x_{t+1} \leftarrow x_t - \eta_t \partial f_t(x_t),
\]
with the step length $\eta \doteq \frac{\norm{x_0 - x^\esx}}{G} \sqrt{\frac{1}{T}}$, which in this context is also often referred to as <em>learning rate</em>. This setting requires knowledge of $T$ <em>and</em> knowledge of $\norm{x_0 - x^\esx}$ ahead of time. As discussed earlier the lack of knowledge of $T$ can be overcome, either with the doubling-trick or via the variable step length approach that we discuss further below; the cost in terms of regret is a $\sqrt{2}$-factor for the latter. However, the lack of knowledge of $\norm{x_0 - x^\esx}$, while a non-issue later in the constrained case as we can simply overestimate by the diameter of the feasible region, does pose a <em>significant issue</em> in the unconstrained case and it is unclear how to turn the above algorithmic scheme into an actual algorithm with sublinear regret in the unconstrained case; note that there are other online learning algorithms that <em>can</em> achieve sublinear regret in this setting, however they are (much) more involved and quite different in spirit than the approach from online convex optimization [MS].</p>
<p>So what is our algorithm doing when deployed in an online setting? For this it is helpful to consider the update in iteration $t$ through (basic), rearranged for convenience:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\norm{x_{t+1} - u}^2 & = \norm{x_t - u}^2 + \eta_t^2 \norm{\partial f_t(x_t)}^2 - 2 \eta_t \langle \partial f_t(x_t), x_t - u\rangle,
\end{align*} %]]></script>
<p>so for $x_{t+1}$ to move closer to a given $u$, it is necessary that</p>
<script type="math/tex; mode=display">% <![CDATA[
\eta_t^2 \norm{\partial f_t(x_t)}^2 - 2 \eta_t \langle \partial f_t(x_t), x_t - u\rangle < 0, %]]></script>
<p>or equivalently, that</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{\eta_t}{2} \norm{\partial f_t(x_t)}^2 < \langle \partial f_t(x_t), x_t - u\rangle, %]]></script>
<p>i.e., the <em>potential gain</em>, measured by the dual gap $\langle \partial f_t(x_t), x_t - u\rangle$ must be larger than $\frac{\eta_t}{2} \norm{\partial f_t(x_t)}^2$, where $\norm{\partial f_t(x_t)}^2$ is the maximally possible gain (the scalar product is maximized at $\partial f_t(x_t)$, e.g., by Cauchy-Schwarz). As such we require an $\frac{\eta_t}{2}$ fraction of the total possible gain to move closer to $u$.</p>
<p>We will later see that other online learning variants naturally arise the same way by ‘short-cutting’ the convergence proof as we have done here. In particular, we will see that the famous <em>Multiplicative Weight Update</em> algorithm is basically obtained from short-cutting the Mirror Descent convergence proof for the probability simplex with the relative entropy as Bregman divergence; more on this later.</p>
<h2 id="the-constrained-setting-projected-subgradient-descent">The constrained setting: projected subgradient descent</h2>
<p>We will now move to the constrained setting where we want require that the iterates $x_t$ are contained in some convex set $K$, i.e., $x_t \in K$. As in the above, our starting point is the <em>poor man’s identity</em> arising from expanding the norm. To this end we write:</p>
<script type="math/tex; mode=display">\norm{x_{t+1} - x^\esx}^2 = \norm{x_t - x^\esx}^2 - 2 \langle x_t - x_{t+1}, x_t - x^\esx \rangle + \norm{x_t - x_{t+1}}^2,</script>
<p>or in a more convenient form (by rearranging) as:</p>
<script type="math/tex; mode=display">\tag{normExpand}
2 \langle x_t - x_{t+1}, x_t - x^\esx \rangle = \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 + \norm{x_t - x_{t+1}}^2.</script>
<p>In the basic analysis of subgradient descent, we then used the specific form of the update $x_{t+1} \leftarrow x_t - \eta_t \partial f(x_t)$ and then summed and telescoped out. Now, things are different. A hypothetical update $x_{t+1} \leftarrow x_t - \eta_t \partial f(x_t)$ might lead outside of $K$, i.e., $x_{t+1} \not\in K$ might happen. Observe though that (normExpand) still telescopes as before by simply adding up over the iterations, however we have no idea, how $\langle x_t - x_{t+1}, x_t - x^\esx \rangle$ relates to our function $f$ of interest and clearly this has to depend on the actual step we take, i.e., on the properties of $x_{t+1}$. A natural, but slightly too optimistic thing to hope for is to find a step $x_{t+1}$, such that</p>
<script type="math/tex; mode=display">\tag{optimistic} \langle \eta_t \partial f(x_{t}), x_t - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_t - x^\esx \rangle,</script>
<p>holds as this actually held even with equality in the unconstrained case. However, suppose we can show the following:</p>
<script type="math/tex; mode=display">\tag{lookAhead} \langle \eta_t \partial f(x_{t}), x_{t+1} - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_{t+1} - x^\esx \rangle.</script>
<p>Note the subtle difference in the indices in the $x_{t+1} - x^\esx$ term. It is much easier to show (lookAhead), as we will do further below, because the point $x_{t+1}$ that we choose as a function of $\nabla f(x_t)$ and $x_t$ is under our control; in comparison $x_t$ is already chosen at time $t$. However, this not yet good enough to telescope out the sums due to the mismatch in indices. The following observation remedies the situation by undoing the index shift and quantifying the change:</p>
<p class="mathcol"><strong>Observation.</strong> If $\langle \eta_t \partial f(x_{t}), x_{t+1} - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_{t+1} - x^\esx \rangle$, then
\[
\tag{lookAheadIneq}
\begin{align}
\langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle & \leq \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle \newline
\nonumber & - \frac{1}{2}\norm{x_t - x_{t+1}}^2 \newline
\nonumber & +\frac{1}{2}\norm{\eta_t \partial f(x_t)}^2
\end{align}
\]</p>
<p>Before proving the observation, observe that in the unconstrained case, where we choose $x_{t+1} = x_t - \eta \partial f(x_t)$, the inequality in the observation reduces to (optimistic), holding even with equality, and when plugging this back into our poor man’s identify this exactly becomes the basic argument from beginning of the post. This is a good news as it indicates that the observation reduces to what we know already in the unconstrained case. As such we might want to think of the observation as relating the step $x_{t+1} - x_t$ that we take with $\partial f(x_t)$, assuming that we can choose $x_{t+1}$ to satisfy (lookAhead).</p>
<p><em>Proof (of observation).</em>
Our starting point is the inequality (lookAhead) whose validity we establish a little later:</p>
<script type="math/tex; mode=display">\langle \eta_t \partial f(x_{t}), x_{t+1} - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_{t+1} - x^\esx \rangle.</script>
<p>We will simply brute-force rewrite the inequality into the desired form and collect the error terms in the process. The above inequality is equivalent to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
& \langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle + \langle \eta_t \partial f(x_{t}), x_{t+1} - x_t \rangle \\
\leq\ & \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle + \langle x_t - x_{t+1}, x_{t+1} - x_t \rangle.
\end{align*} %]]></script>
<p>Rewriting we obtain:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
& \langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle \\
\leq\ & \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle -\norm{x_{t+1} - x_t}^2 - \langle \eta_t \partial f(x_{t}), x_{t+1} - x_t \rangle \\
=\ & \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle - \frac{1}{2}\norm{x_{t+1} - x_t}^2 - \frac{1}{2}\left(\norm{x_{t+1} - x_t}^2 - 2 \langle \eta_t \partial f(x_{t}), x_{t+1} - x_t \rangle\right) \\
\leq\ & \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle - \frac{1}{2}\norm{x_{t+1} - x_t}^2 + \frac{1}{2} \norm{ \eta_t \partial f(x_{t})}^2,
\end{align*} %]]></script>
<p>where the last inequality uses the binomial formula, i.e., $(a+b)^2 = a^2 - 2ab +b^2 \geq 0$ and hence $a^2 \geq -b^2 +2ab$.</p>
<p>$\qed$</p>
<p>With the observation we can immediately conclude our convergence proof and the argument becomes identical to the basic case from above. Recall that our starting point is (normExpand):</p>
<script type="math/tex; mode=display">2 \langle x_t - x_{t+1}, x_t - x^\esx \rangle = \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 + \norm{x_t - x_{t+1}}^2.</script>
<p>Now we can estimate the term on the left-hand side using our observation. This leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
& 2 \langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle + \norm{x_t - x_{t+1}}^2 - \norm{\eta_t \partial f(x_t)}^2\\
& \leq 2 \langle x_t - x_{t+1}, x_t - x^\esx \rangle \\
& = \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 + \norm{x_t -x_{t+1}}^2,
\end{align*} %]]></script>
<p>and after subtracting $\norm{x_t - x_{t+1}}^2$ and adding $\norm{\eta_t \partial f(x_t)}^2$, we obtain:</p>
<script type="math/tex; mode=display">2 \langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle
\leq \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 + \norm{\eta_t \partial f(x_t)}^2,</script>
<p>which is exactly (basic) as above and we can conclude the argument the same way: summing up and telescoping and then optimizing $\eta_t$. In particular, the convergence rate (convergenceSG) and regret bound (regretBound) stay the same with no deterioration due to constraints or projections:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq G \norm{x_0 - x^\esx} \sqrt{T},
\end{align*} %]]></script>
<p>and</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
f(\bar x) - f(x^\esx) & \leq G \norm{x_0 - x^\esx} \frac{1}{\sqrt{T}}
\end{align*} %]]></script>
<p>So the key is really establishing (lookAhead) as it immediately implies all we need to establish convergence in the constrained case. This is what we will do now, which will also, finally, specify our choice of $x_{t+1}$.</p>
<h3 id="using-optimization-to-prove-what-you-want">Using optimization to prove what you want</h3>
<p>Have you ever wondered why people add these weird 2-norms to their optimization problem to “regularize” the problem, i.e., they solve problems of the form $\min_{x} f(x) + \lambda \norm{x - z}^2$? Then this section might provide some insight into this. We will see that it is actually not about the “problem” that is solved, but about what an optimal solution might guarantee; bear with me.</p>
<p>So what we want to establish is inequality (lookAhead), i.e.,</p>
<script type="math/tex; mode=display">\langle \eta_t \partial f(x_{t}), x_{t+1} - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_{t+1} - x^\esx \rangle,</script>
<p>or slightly more generally stated as our proof will work for <em>all</em> $u \in K$ (and in particular the choice $u = x^\esx$):</p>
<script type="math/tex; mode=display">\langle \eta_t \partial f(x_{t}), x_{t+1} - u \rangle \leq \langle x_t - x_{t+1}, x_{t+1} - u \rangle.</script>
<p>Rearranging the above we obtain:</p>
<script type="math/tex; mode=display">\tag{optCon} \langle \eta_t \partial f(x_{t}), x_{t+1} - u \rangle - \langle x_t - x_{t+1}, x_{t+1} - u \rangle \leq 0.</script>
<p>What we will do now is to interpret the above as an <em>optimality condition</em> to some smooth convex optimization problem of the form $\max_{x \in K} g(x)$, where $g(x)$ is some smooth and convex function. Recall, from the previous posts, e.g., <a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a>, that the first order optimality condition states, that for all $u \in K$ it holds:</p>
<script type="math/tex; mode=display">\langle \nabla g(x), x - u \rangle \leq 0,</script>
<p>provided that $x \in K$ is an optimal solution, as otherwise we would be able to make progress via, e.g., a gradient step or a Frank-Wolfe step. By simply reverse engineering (aka remembering how we differentiate), we guess</p>
<script type="math/tex; mode=display">\tag{proj} g(x) \doteq \langle \eta_t \partial f(x_{t}), x \rangle + \frac{1}{2}\norm{x-x_t}^2,</script>
<p>so that that its optimality condition produces (optCon). We now simply choose</p>
<script type="math/tex; mode=display">\tag{constrainedStep}x_{t+1} \doteq \arg\min_{x \in K} \langle \eta_t \partial f(x_{t}), x \rangle + \frac{1}{2}\norm{x-x_t}^2,</script>
<p>and (just to be sure) we inspect the optimality condition that states:</p>
<script type="math/tex; mode=display">\begin{align*}
\langle \eta_t \partial f(x_{t}), x_{t+1} - u \rangle - \langle x_t - x_{t+1}, x_{t+1} - u \rangle = \langle \nabla g(x_{t+1}), x_{t+1} - u \rangle \leq 0,
\end{align*}</script>
<p>which is exactly (lookAhead). This step then ensures convergence with (maybe surprisingly) a rate identical to the unconstrained case. The resulting algorithm is often referred to as <em>projected subgradient descent</em> and the problem whose optimal solution defines $x_{t+1}$ is the projection problem. We provide the <em>projected subgradient descent</em> algorithm below:</p>
<p class="mathcol"><strong>Projected Subgradient Descent Algorithm.</strong> <br />
<em>Input:</em> Convex function $f$ with first-order oracle access and some initial point $x_0 \in K$<br />
<em>Output:</em> Sequence of points $x_0, \dots, x_T$ <br />
For $t = 1, \dots, T$ do: <br />
$\quad x_{t+1} \leftarrow \arg\min_{x \in K} \langle \eta_t \partial f(x_{t}), x \rangle + \frac{1}{2}\norm{x-x_t}^2$<br /></p>
<h3 id="variable-step-length">Variable step length</h3>
<p>We will now briefly explain how to replace the constant step length from before that requires a priori knowledge of $T$ by a variable step length, so that the convergence guarantee holds for any iterate $x_t$. To this end let $D \geq 0$ be a constant so that $\max_{x,y \in K} \norm{x-y} \leq D$. We now choose $\eta_t \doteq \tau \sqrt{\frac{1}{t+1}}$, where we will specify the constant $\tau \geq 0$ soon.</p>
<p class="mathcol"><strong>Observation.</strong> For $\eta_t$ as above it holds:
\[\sum_{t = 0}^{T-1} \eta_t \leq \tau\left(2 \sqrt{T} - 1\right).\]</p>
<p><em>Proof.</em> There are various ways of showing the above. We follow the argument in [Z]. We have:
<script type="math/tex">% <![CDATA[
\begin{align*}
\sum_{t = 0}^{T-1} \eta_t & = \tau \sum_{t = 0}^{T-1} \frac{1}{\sqrt{t+1}} \\
& \leq \tau \left(1 + \int_{0}^{T-1}\frac{dt}{\sqrt{t+1}}\right) \\
& \leq \tau \left(1 + \left[2 \sqrt{t+1}\right]_0^{T-1} \right) = \tau (2\sqrt{T}-1) \qed
\end{align*} %]]></script></p>
<p>Now we restart from inequality (basic) from earlier:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
2\eta_t \langle \partial f(x_t), x_t - x^\esx\rangle & = \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 + \eta_t^2 \norm{\partial f(x_t)}^2,
\end{align*} %]]></script>
<p>however before we sum up and telescope we first divide by $2\eta_t$, i.e.,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sum_{t=0}^{T-1}\langle \partial f(x_t), x_t - x^\esx\rangle & = \sum_{t=0}^{T-1} \left(\frac{\norm{x_t - x^\esx}^2}{2\eta_t} - \frac{\norm{x_{t+1} - x^\esx}^2}{2\eta_t} + \frac{\eta_t}{2} \norm{\partial f(x_t)}^2\right) \\
& \leq \frac{\norm{x_0 - x^\esx}^2}{2\eta_{0}} - \frac{\norm{x_{T} - x^\esx}^2}{2\eta_{T-1}} \\ & \qquad + \frac{1}{2} \sum_{t=1}^{T-1} \left(\frac{1}{\eta_{t}} - \frac{1}{\eta_{t-1}} \right) \norm{x_t - x^\esx}^2 + \sum_{t=0}^{T-1}\frac{\eta_t}{2} \norm{\partial f(x_t)}^2 \\
& \leq D^2 \left(\frac{1}{2\eta_0} + \frac{1}{2} \sum_{t=1}^{T-1} \left(\frac{1}{\eta_{t}} - \frac{1}{\eta_{t-1}} \right) \right) + \sum_{t=0}^{T-1}\frac{\eta_t}{2} G^2 \\
& \leq \frac{D^2 }{2\eta_{T-1}} + \sum_{t=0}^{T-1}\frac{\eta_t}{2} G^2 \\
& \leq \frac{1}{2}\left(\frac{D^2 }{\tau} \sqrt{T} + 2 G^2 \tau \sqrt{T} \right) = DG\sqrt{2T},
\end{align*} %]]></script>
<p>where we applied the observation from above in the last but one inequality, plugged in the definition of $\eta_t$, and used the choice $\tau \doteq \frac{D}{G\sqrt{2}}$ in the last equation, which minimizes the term in the brackets in the last inequality. In summary, we have shown:</p>
<script type="math/tex; mode=display">\tag{regretBoundAnytime}
\sum_{t=0}^{T-1}\langle \partial f(x_t), x_t - x^\esx\rangle \leq DG\sqrt{2T}.</script>
<p>From (regretBoundAnytime) we can now derive convergence rates as usual: summing up, then averaging the iterates, and using convexity.</p>
<p>It is useful to compare (regretBoundAnytime) to the case with fixed step length, which is given in (regretBound): using a variable step length costs us a factor of $\sqrt{2}$, however the above bound in (regretBoundAnytime) now holds for <em>all</em> $t$ and a priori knowledge of $T$ is not required. Such regret bounds are sometimes referred to as <em>anytime regret bounds</em>.</p>
<h3 id="online-sub-gradient-descent">Online (Sub-)Gradient Descent</h3>
<p>Starting from (regretBoundAnytime), we can also follow the same path as in the online learning section from above. This recovers the Online (Sub-)Gradient Descent algorithm of [Z]: Consider the online learning setting from before and choose</p>
<script type="math/tex; mode=display">x_{t+1} \leftarrow \arg\min_{x \in K} \langle \eta_t \partial f_t(x_{t}), x \rangle + \frac{1}{2}\norm{x-x_t}^2.</script>
<p>Then, we obtain the regret bound</p>
<script type="math/tex; mode=display">\tag{regretOGDanytime}
\sum_{t=0}^{T-1} f_t(x_t) - \min_{x \in K} \sum_{t=0}^{T-1} f_t(x) \leq
\sum_{t=0}^{T-1}\langle \partial f(x_t), x_t - x^\esx\rangle \leq DG\sqrt{2T},</script>
<p>in the anytime setting and</p>
<script type="math/tex; mode=display">\tag{regretOGD}
\sum_{t=0}^{T-1} f_t(x_t) - \min_{x \in K} \sum_{t=0}^{T-1} f_t(x) \leq
\sum_{t=0}^{T-1}\langle \partial f(x_t), x_t - x^\esx\rangle \leq DG\sqrt{T},</script>
<p>when $T$ is known ahead of time, where $D$ and $G$ are a bound on the diameter of the feasible domain and the norm of the gradients respectively as before.</p>
<h2 id="mirror-descent">Mirror Descent</h2>
<p>We will now derive Nemirovski’s Mirror Descent algorithm (see e.g., [NY]) and we will be following somewhat the proximal perspective as outlined in [BT]. Simplifying and running the risk of attracting the wrath of the optimization titans, <em>Mirror Descent</em> arises from subgradient descent by replacing the $\ell_2$-norm with a “generalized distance” that satisfies the inequalities that we needed in the basic argument from above.</p>
<p>Why would you want to do that? Adjusting the distance function will allow us to fine-tune the iterates and the resulting dimension-dependent term for the geometry under consideration.</p>
<p>In the following, as we move away from the $\ell_2$-norm, which is self-dual, we will need the definition of the <em>dual norm</em> defined as <script type="math/tex">\norm{w}_\esx \doteq \max\setb{\langle w , x \rangle : \norm{x} = 1}</script>. Note that for the $\ell_p$-norm the $\ell_q$-norm is dual if $\frac{1}{p} + \frac{1}{q} = 1$. For the $\ell_1$-norm the dual norm is $\ell_\infty$. We will also need the <em>(generalized) Cauchy-Schwarz inequality</em> or <em>Hölder inequality</em>: $\langle y , x \rangle \leq \norm{y}_\esx \norm{x}$. A very useful consequence of this inequality is:</p>
<script type="math/tex; mode=display">\tag{genBinomial}
\norm{a}^2 - 2 \langle a , b \rangle + \norm{b}^2_\esx \geq 0,</script>
<p>which follows from</p>
<script type="math/tex; mode=display">\begin{align*}
\norm{a}^2 - 2 \langle a , b \rangle + \norm{b}^2_\esx \geq \norm{a}^2 - 2 \norm{a} \norm{b}_\esx + \norm{b}^2_\esx = (\norm{a} - \norm{b}_\esx)^2 \geq 0.
\end{align*}</script>
<h3 id="generalized-distance-aka-bregman-divergence">“Generalized Distance” aka Bregman divergence</h3>
<p>We will first introduce the generalization of norms that we will be working with. To this end, let us first collect the desired properties that we needed in the proof of the basic argument; I will already suggestively use the final notation to not create notational overload. Let our desired function be called $V_x(y)$ and let us further assume in a first step the choice $V_x(y) = \frac{1}{2} \norm{x-y}^2$; note the factor $\frac{1}{2}$ is only used to make the proofs cleaner.</p>
<p>In the very first step we used the expansion of the $\ell_2$-norm. As $x^\esx$ plays no special role, we write everything with respect to any feasible $u$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\norm{x_{t+1} - u}^2 & = \norm{x_t - u}^2 - 2 \langle x_{t} - x_{t+1}, x_t - u\rangle + \norm{x_{t+1} - x_t}^2.
\end{align*} %]]></script>
<p>Rescaling and substituting $V_x(y) = \frac{1}{2} \norm{x-y}^2$, and observing that $\nabla V_x(y) = y - x$, we obtain:</p>
<script type="math/tex; mode=display">% <![CDATA[
\tag{req-1}
\begin{align*}
V_{x_{t+1}}(u) & = V_{x_{t}}(u) - \langle \nabla V_{x_{t+1}}(x_{t}), x_t - u\rangle + V_{x_{t+1}}(x_{t}),
\end{align*} %]]></script>
<p>where the last choice $V_{x_{t+1}}(x_{t})$ used a non-deterministic guess as by symmetry of the $\ell_2$-norm also $V_{x_{t}}(x_{t+1})$ would have been feasible.</p>
<p>We will need another inequality if we aim to mimic the same proof in the constrained case. Recall that we needed (lookAheadIneq)</p>
<p>\[\langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle - \frac{1}{2}\norm{x_t - x_{t+1}}^2+\frac{1}{2}\norm{\eta_t \partial f(x_t)}^2,\]</p>
<p>to relate the step $x_{t+1} - x_t$ that we take with $\partial f(x_t)$, assuming that $x_{t+1}$ was chosen appropriately. In the proof the term $\frac{1}{2}\norm{x_t - x_{t+1}}^2$ simply arose from the mechanics of the (standard) scalar product, which is inherently linked to the $\ell_2$-norm. Slightly jumping ahead (the later proof will make this requirement natural), we additionally require</p>
<script type="math/tex; mode=display">\tag{req-2}
\begin{align*}
V_x(y) \geq \frac{1}{2}\norm{x-y}^2,
\end{align*}</script>
<p>Moreover, the term $\frac{1}{2}\norm{\eta_t \partial f(x_t)}^2$ in (lookAheadIneq) is actually using the <em>dual norm</em>, which we did not have to pay attention to as the $\ell_2$-norm is self-dual. We will redo the full argument in the next sections with the correct distinctions for completeness. First, however we will complete the definition of the $V_x(y).$</p>
<p>There is a natural class of functions that satisfy (req-1) and (req-2), so called <em>Bregman divergences</em>, which are defined through <em>Distance Generating Functions (DGFs)</em>. Let us choose some norm $\norm{\cdot}$, which is not necessarily the $\ell_2$-norm.</p>
<p class="mathcol"><strong>Definition. (DGF and Bregman Divergence)</strong> Let $K \subseteq \RR^n$ be a closed convex set. Then $\phi: K \rightarrow \RR$ is called a <em>Distance Generating Function (DGF)</em> if $\phi$ is $1$-strongly convex with respect to $\norm{\cdot}$, i.e., for all $x \in K \setminus \partial K, y \in K$ we have $\phi(y) \geq \phi(x) + \langle \nabla \phi, y-x \rangle + \frac{1}{2}\norm{x-y}^2$. The <em>Bregman divergence (induced by $\phi$)</em> is defined as
\[V_x(y) \doteq \phi(y) - \langle \nabla \phi(x), y - x \rangle - \phi(x),\]
$x \in K \setminus \partial K, y \in K$.</p>
<p>Observe that the strong convexity requirement of the DGF is with respect to the chosen norm. This is important as it allows us to “fine-tune” our geometry. Before we establish some basic properties of Bregman divergences, here are two common examples:</p>
<p class="mathcol"><strong>Examples. (Bregman Divergences)</strong> <br />
(a) Let $\norm{x} \doteq \norm{x}_2$ be the $\ell_2$-norm and $\phi(x) \doteq \frac{1}{2} \norm{x}^2$. Clearly, $\phi(x)$ is $1$-strongly convex with respect to $\norm{\cdot}$ (for any $K$). The resulting Bregman divergence is $V_x(y) = \frac{1}{2}\norm{x-y}^2$, which is the choice used for (projected) subgradient descent above. <br />
(b) Let <script type="math/tex">\norm{x} \doteq \norm{x}_1</script> be the $\ell_1$-norm and <script type="math/tex">\phi(x) \doteq \sum_{i \in [n]} x_i \log x_i</script> be the (negative) entropy. Then $\phi(x)$ is $1$-strongly convex for all <script type="math/tex">K \subseteq \Delta_n \doteq \setb{x \geq 0 \mid \sum_{i \in [n]}x_i = 1}</script>, which is the <em>probability simplex</em> with respect to <script type="math/tex">\norm{\cdot}_1</script>. The resulting Bregman divergence is <script type="math/tex">V_x(y) = \sum_{i \in [n]} y_i \log \frac{y_i}{x_i} = D(y \| x)</script>, which is the <em>Kullback-Leibler divergence</em> or <em>relative entropy</em>.</p>
<p>We will now establish some basic properties for $V_x(y)$ and show that $V_x(y)$ satisfies the required properties:</p>
<p class="mathcol"><strong>Lemma. (Properties of the Bregman Divergence)</strong> Let $V_x(y)$ be a Bregman divergence defined via some DGF $\phi$. Then the following holds: <br />
(a) Point-separating: $V_x(x) = 0$ (and also $V_x(y) = 0 \Leftrightarrow x = y$ via (b))<br />
(b) Compatible with norm: $V_x(y) \geq \frac{1}{2} \norm{x-y}^2 \geq 0$<br />
(c) $\Delta$-Inequality: $\langle - \nabla V_x(y), y - u \rangle = V_x(u) - V_y(u) - V_x(y)$</p>
<p><em>Proof.</em> Property (a) follows directly from the definition and (b) follows from $\phi$ in the definition of $V_x(y)$ being $1$-strongly convex. Property (c) follows from straightforward expansion and computation:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\langle - \nabla V_x(y), y - u \rangle & = \langle\nabla\phi(x) -\nabla \phi(y) , y-u \rangle \\
& = (\phi(u) - \phi(x) - \langle \nabla \phi(x), u -x \rangle) \\
& - (\phi(u) - \phi(y) - \langle \nabla \phi(y), u - y \rangle) \\
& - (\phi(y) - \phi(x) - \langle \nabla \phi(x), y - x \rangle) \\
& = V_x(u) - V_y(u) - V_x(y).
\end{align*} %]]></script>
<script type="math/tex; mode=display">\qed</script>
<h3 id="back-to-basics">Back to basics</h3>
<p>In a first step we will redo our basic argument from the beginning of the post with a Bregman divergence instead of the expansion of the $\ell_2$-norm. To this end let $K \subseteq \RR^n$ (possibly $K = \RR^n$) be a closed convex set. We consider a generic algorithm that produces iterates $x_1, \dots, x_t, \dots$. We will define the choice of the iterates later. Our starting point is the $\Delta$-inequality of the Bregman divergence with the choices $y \leftarrow x_{t+1}$, $x \leftarrow x_t$, and $u \in K$ arbitrary:</p>
<script type="math/tex; mode=display">\tag{basicBreg}
\langle - \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle = V_{x_t}(u) - V_{x_{t+1}}(u) - V_{x_t}(x_{t+1}).</script>
<p>We could now try the same strategy, summing up and telescoping out:</p>
<script type="math/tex; mode=display">\sum_{t = 0}^{T-1} \langle - \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle = V_{x_0}(u) - V_{x_{T}}(u) - \sum_{t = 0}^{T-1} V_{x_t}(x_{t+1}).</script>
<p>But how to continue? First observe that in contrast to the telescoping of the $\ell_2$-norm expansion we have a <em>negative</em> term on the right hand-side (this is technical and could have been done the same way for the $\ell_2$-norm) and the left-hand side, as of now, has no relation to the function $f$; clearly, we have not even defined our step yet. So let us try the obvious first-order guess, i.e., replacing the $\ell_2$-norm with the Bregman divergence.</p>
<p>To this end, we define
<script type="math/tex">\tag{IteratesMD} x_{t+1} \doteq \arg\min_{x \in K} \langle \eta_t \partial f(x_t), x \rangle + V_{x_t}(x).</script></p>
<p>Mimicking the approach we took for projected gradient descent, let us inspect the optimality condition of the system. For all $u \in K$ it holds:</p>
<script type="math/tex; mode=display">\tag{optConBreg}
\langle \eta_t \partial f(x_t),x_{t+1} - u \rangle + \langle \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle \leq 0</script>
<p>or equivalently we obtain:</p>
<script type="math/tex; mode=display">\tag{lookaheadMD}
\langle \eta_t \partial f(x_t),x_{t+1} - u \rangle \leq - \langle \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle</script>
<p>as before, we now have to fix the index mismatch on the left:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\langle \eta_t \partial f(x_t),x_{t+1} - u \rangle & \leq - \langle \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle \\
\Leftrightarrow \langle \eta_t \partial f(x_t),x_{t} - u \rangle + \langle \eta_t \partial f(x_t),x_{t+1} - x_t \rangle & \leq - \langle \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle
\end{align*} %]]></script>
<p>This we can then plug back into (basicBreg) to obtain the key inequality for Mirror Descent:</p>
<script type="math/tex; mode=display">% <![CDATA[
\tag{basicMD}
\begin{align*}
\langle \eta_t \partial f(x_t),x_{t} - u \rangle & \leq - \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle - \langle \eta_t \partial f(x_t),x_{t+1} - x_t \rangle \\
& = V_{x_t}(u) - V_{x_{t+1}}(u) - V_{x_t}(x_{t+1}) - \langle \eta_t \partial f(x_t),x_{t+1} - x_t \rangle \\
& \leq V_{x_t}(u) - V_{x_{t+1}}(u) - \frac{1}{2} \norm{x_t - x_{t+1}}^2 - \langle \eta_t \partial f(x_t),x_{t+1} - x_t \rangle \\
& \leq V_{x_t}(u) - V_{x_{t+1}}(u) + \left (\langle \eta_t \partial f(x_t),x_ t - x_{t+1} \rangle- \frac{1}{2} \norm{x_t - x_{t+1}}^2 \right) \\
& \leq V_{x_t}(u) - V_{x_{t+1}}(u) + \frac{\eta_t^2}{2}\norm{\partial f(x_t)}_\esx^2,
\end{align*} %]]></script>
<p>where the last inequality follows via (genBinomial). We can now simply sum up and telescope out to obtain the generic regret bound for Mirror Descent:</p>
<script type="math/tex; mode=display">\tag{regretBoundMD}
\begin{align*}
\sum_{t=0}^{T-1} \langle \eta_t \partial f(x_t),x_{t} - u \rangle \leq V_{x_0}(u) + \sum_{t=0}^{T-1} \frac{\eta_t^2}{2} \norm{\partial f(x_t)}_\esx^2,
\end{align*}</script>
<p>and further we can again use convexity, averaging of the iterates, and picking $\eta_t = \eta \doteq \sqrt{\frac{2M}{G^2T}}$ (by optimizing out) to arrive at the convergence rate of Mirror Descent:</p>
<script type="math/tex; mode=display">% <![CDATA[
\tag{convergenceMD}
\begin{align*}
f(\bar x) - f(x^\esx) & \leq \frac{1}{T} \sum_{t=0}^{T-1} \left (f(x_t) - f(x^\esx) \right) \\
& \leq \sum_{t=0}^{T-1} \langle \partial f(x_t),x_{t} - x^\esx \rangle \leq \frac{M}{\eta T} + \frac{\eta G^2}{2} \\
& \leq \sqrt{\frac{2M G^2}{T}},
\end{align*} %]]></script>
<p>where $\norm{\partial f(x_t)}_\esx \leq G$ and <script type="math/tex">V_{x_0}(u) \leq M</script> for all $u \in K$.</p>
<p>For completeness, the Mirror Descent algorithm is specified below:</p>
<p class="mathcol"><strong>Mirror Descent Algorithm.</strong> <br />
<em>Input:</em> Convex function $f$ with first-order oracle access and some initial point $x_0 \in K$<br />
<em>Output:</em> Sequence of points $x_0, \dots, x_T$ <br />
For $t = 1, \dots, T$ do: <br />
$\quad x_{t+1} \leftarrow \arg\min_{x \in K} \langle \eta_t \partial f(x_t), x \rangle + V_{x_t}(x)$</p>
<h3 id="online-mirror-descent-and-multiplicative-weights">Online Mirror Descent and Multiplicative Weights</h3>
<p>Alternatively, starting from (regretBoundMD) we can yet again observe that one could use a different function $f_t$ in each iteration $t$, which leads us to <em>Online Mirror Descent</em> as we will briefly discuss in this section. From (regretBoundMD) we have with $\eta_t = \eta$ chosen below:</p>
<script type="math/tex; mode=display">\begin{align*}
\sum_{t=0}^{T-1} \langle \eta \partial f_t(x_t),x_{t} - u \rangle \leq V_{x_0}(u) + \frac{\eta^2}{2} \sum_{t=0}^{T-1} \norm{\partial f_t(x_t)}_\esx^2.
\end{align*}</script>
<p>Rearranging with $\norm{\partial f_t(x_t)}_\esx \leq G$ and <script type="math/tex">V_{x_0}(u) \leq M</script> for all $u \in K$, gives</p>
<script type="math/tex; mode=display">\begin{align*}
\sum_{t=0}^{T-1} \langle \partial f_t(x_t),x_{t} - u \rangle \leq \frac{M}{\eta} + \frac{\eta T}{2} G^2.
\end{align*}</script>
<p>With the (optimal) choice $\eta = \sqrt{\frac{2M}{G^2T}}$ and using the subgradient property, we obtain the online learning regret bound for Mirror Descent:</p>
<script type="math/tex; mode=display">\tag{regretMD}
\begin{align*}
\sum_{t = 0}^{T-1} f_t(x_t) - \min_{x \in K} \sum_{t = 0}^{T-1} f_t(x) \leq \min_{x \in K} \sum_{t=0}^{T-1} \langle \partial f_t(x_t),x_{t} - x \rangle \leq \sqrt{2M G^2T},
\end{align*}</script>
<p>and, paying another factor $\sqrt{2}$, we can make this bound <em>anytime</em>.</p>
<p>We will now consider the important special case of $K = \Delta_n$ being the probability simplex and <script type="math/tex">V_x(y) = \sum_{i \in [n]} y_i \log \frac{y_i}{x_i} = D(y \| x)</script> being the <em>relative entropy</em>, which will lead to (an alternative proof of) the <em>Multiplicative Weight Update (MWU)</em> algorithm; this argument is folklore and has been widely known by experts, see e.g., [BT, AO]. In particular, it generalizes immediately to the matrix case in contrast to other proofs of the MWU algorithm. We refer the interested reader to [AHK] for an overview of the many applications of the MWU algorithm or equivalently Mirror Descent over the probability simplex with relative entropy as Bregman divergence.</p>
<p>Via information-theoretic inequalities or just by-hand calculations (see [BT]), it can be easily seen that <script type="math/tex">D(x\| y)</script> is $1$-stronly convex with respect to <script type="math/tex">\norm{.}_1</script>, whose dual norm is <script type="math/tex">\norm{.}_\infty</script>. Moreover, <script type="math/tex">D(x \| x_0) \leq \log n</script> for all $x \in \Delta_n$, with $x_0 = (1/n, \dots, 1/n)$ being the uniform distribution.</p>
<p>Recall that the iterates are defined via (IteratesMD), which in our case becomes</p>
<script type="math/tex; mode=display">\tag{IteratesMWU} x_{t+1} \doteq \arg\min_{x \in K} \langle \eta_t \partial f_t(x_t), x \rangle + D(x \| x_t),</script>
<p>and making this explicit amounts to updates of the form (to be read coordinate-wise):</p>
<script type="math/tex; mode=display">\tag{IteratesMWUExp}
x_{t+1} \leftarrow x_t \cdot \frac{e^{-\eta_t \partial f_t(x_t)}}{K_t},</script>
<p>where $K_t$ is chosen such that $\norm{x_{t+1}}_1 = 1$, i.e., $K_t = \norm{x_t \cdot e^{-\eta_t \partial f_t(x_t)}}_1$, which is precisely the Multiplicative Weight Update algorithm.</p>
<p>With the bounds $M = \log n$ and $\norm{\partial f_t(x_t)}_\infty \leq G$ for all $t = 0, \dots T-1$, the regret bound in this case becomes:</p>
<script type="math/tex; mode=display">\tag{regretMWU}
\begin{align*}
\sum_{t = 0}^{T-1} f_t(x_t) - \min_{x \in K} \sum_{t = 0}^{T-1} f_t(x) \leq \min_{x \in K} \sum_{t=0}^{T-1} \langle \partial f_t(x_t),x_{t} - x \rangle \leq \sqrt{2 \log(n) G^2 T},
\end{align*}</script>
<p>for the the variant with known $T$ and we can pay another factor $\sqrt{2}$ to make this bound an anytime guarantee.</p>
<h3 id="mirror-descent-vs-gradient-descent">Mirror Descent vs. Gradient Descent</h3>
<p>One of the key questions of course if whether the improvement in convergence rate through fine-tuning against the geometry materializes in actual computations or whether it is just an improvement on paper. Following [AO], some comments are helpful: if $V_x(y) \doteq \frac{1}{2}\norm{x-y^2}$, then Mirror Descent and (Sub-)Gradient Descent produce identical iterates. If on the other hand we, e.g., pick $K = \Delta_n$ the probability simplex in $\RR^n$, then we can pick <script type="math/tex">V_x(y) \doteq D(y\|x)</script> to be the relative entropy, and the iterates from Gradient Descent with the $\ell_2$-norm and Mirror Descent with relative entropy will be very different; and so will be the convergence behavior. Mirror Descent provides a guarantee of</p>
<script type="math/tex; mode=display">\begin{align*}
f(\bar x) - f(x^\esx) \leq \frac{\sqrt{2 \log(n) G^2}}{\sqrt{T}},
\end{align*}</script>
<p>where $\bar x = \frac{1}{T} \sum_0^{T-1} x_t$ and $\norm{\partial f(x_t)}_\infty \leq G$ for all $t = 0, \dots T-1$ in this case.</p>
<p>Below we compare Mirror Descent and Gradient Descent over $K = \Delta_n$ with $n = 10000$ (left) and across different values of $n$ on the (right) for some randomly generated functions; both plots are log-log plots. As can be seen Mirror Descent can scale much better than Gradient Descent by choosing a Bregman divergence that is optimized for the geometry. For $K = \Delta_n$, the dependence on $n$ for Mirror Descent (with relative entropy and $\ell_1$-norm) is only logarithmic vs. for Gradient Descent it is linear in the dimension $n$. This logarithmic dependency makes Mirror Descent well suited for large-scale applications in this case.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/md/MD-arranged.png" alt="MD vs. GD" /></p>
<h2 id="extensions">Extensions</h2>
<p>Finally I will talk about some natural extensions. While the full arguments will be beyond the scope, the interested reader might consult [Z2] for proofs. Also, there are various natural extensions in the online learning case, e.g., where we compare to slowly changing strategies; see [Z] for details.</p>
<h3 id="stochastic-versions">Stochastic versions</h3>
<p>It is relatively easy to see that the above bounds can be transferred to the stochastic setting, where we have an unbiased (sub-)gradient estimator only. We then obtain basically the same convergence rates and regret bounds <em>in expectation</em>.</p>
<p>More specifically, see [LNS] for details, for unbiased stochastic subgradients $\partial f(x,\xi)$ ($\xi$ is a random variable), so that their variance is bounded, i.e.,</p>
<script type="math/tex; mode=display">\mathbb E[\norm{\partial f(x,\xi)}_\esx^2] \leq M_\esx^2,</script>
<p>one can achieve (using Mirror Descent) a guarantee of:</p>
<script type="math/tex; mode=display">\mathbb E \left[f(x_T) - f(x^\esx)\right] \leq O\left(\frac{\sqrt{2} D M_\esx}{\sqrt{\alpha}\sqrt{T}}\right),</script>
<p>where $D$ is the “diameter” of the feasible region w.r.t. to the distance generating function used in the Bregman divergence. Note that the iterates $x_t$ are random now depending on the realizations of the $\xi$ when sampling (sub-)gradients. Moreover, via Markov’s inequality one can prove probabilistic statements of the form:</p>
<script type="math/tex; mode=display">\mathbb P\left[f(x_T) - f(x^\esx) > \varepsilon \right]\leq O(1) \frac{\sqrt{2} D M_\esx}{\varepsilon \sqrt{\alpha}\sqrt{T}}.</script>
<h3 id="smooth-case">Smooth case</h3>
<p>When $f$ is smooth we can modify (basicMD) to obtain the improved $O(1/t)$ rate. Recall that $f$ is $L$-smooth with respect to $\norm{\cdot}$ if:</p>
<script type="math/tex; mode=display">\tag{smooth}
f(y) - f(x) \leq \langle \nabla f(x), y-x \rangle + \frac{L}{2} \norm{x-y}^2.</script>
<p>for all $x,y \in \mathbb R^n$. Choosing $x \leftarrow x_t$ and $y \leftarrow x_{t+1}$ we obtain:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\langle \nabla f(x_t), x_{t+1} - x^\esx \rangle & = \langle \nabla f(x_t), x_{t} - x^\esx \rangle + \langle \nabla f(x_t), x_{t+1} - x_t \rangle \\
& \geq f(x_t) - f(x^\esx) + f(x_{t+1}) - f(x_t) - \frac{L}{2} \norm{x_{t+1}-x_t}^2 \\
& = f(x_{t+1}) - f(x^\esx) - \frac{L}{2} \norm{x_{t+1}-x_t}^2
\end{align*} %]]></script>
<p>We now modify (basicMD) with $u = x^\esx$ as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\tag{basicMDSmooth}
\begin{align*}
\langle \eta_t \nabla f(x_t),x_{t+1} - x^\esx \rangle & \leq - \nabla V_{x_t}(x_{t+1}), x_{t+1} - x^\esx \rangle \\
& = V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) - V_{x_t}(x_{t+1}).
\end{align*} %]]></script>
<p>Chaining in the inequality we obtained from smoothness:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\eta_t (f(x_{t+1}) - f(x^\esx)) & \leq \langle \eta_t \nabla f(x_t),x_{t+1} - x^\esx \rangle + \eta_t \frac{L}{2} \norm{x_{t+1}-x_t}^2 \\
& \leq - \nabla V_{x_t}(x_{t+1}), x_{t+1} - x^\esx \rangle + \eta_t \frac{L}{2} \norm{x_{t+1}-x_t}^2 \\
& = V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) - V_{x_t}(x_{t+1}) + \eta_t \frac{L}{2} \norm{x_{t+1}-x_t}^2 \\
& \leq V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) - \frac{1}{2} \norm{x_{t+1}-x_t}^2 + \eta_t \frac{L}{2} \norm{x_{t+1}-x_t}^2,
\end{align*} %]]></script>
<p>where the last inequality used the compatibility of the Bregman divergence with the norm. Picking $\eta_t = \frac{1}{L}$ results in</p>
<script type="math/tex; mode=display">\frac{1}{L} (f(x_{t+1}) - f(x^\esx)) \leq V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx),</script>
<p>which we telescope out to</p>
<script type="math/tex; mode=display">\sum_{t = 0}^{T-1} (f(x_{t+1}) - f(x^\esx)) \leq L V_{x_0}(x^\esx),</script>
<p>and by convexity the average $\bar x = \frac{1}{T} \sum_{t = 0}^{T-1} x_t$ satisfies:</p>
<script type="math/tex; mode=display">f(\bar x) - f(x^\esx) \leq \frac{L V_{x_0}(x^\esx)}{T},</script>
<p>which is the expected rate for the smooth case. Note that this improvement does not translate to the online case.</p>
<h3 id="strongly-convex-case">Strongly convex case</h3>
<p>Finally we will show that if $f$ is $\mu$-strongly convex <em>with respect to $V_x(y)$</em> (not necessarily smooth though), then we can also obtain improved rates. This improvement translates also to the online learning case, i.e., we get the corresponding improvement in regret. Recall that a function is $\mu$-strongly convex with respect to $V_x(y)$ if:</p>
<script type="math/tex; mode=display">f(y) - f(x) \geq \langle \nabla f(x),y-x \rangle + \mu V_x(y),</script>
<p>holds for all $x,y \in \mathbb R^n$. Choosing $x \leftarrow x_t$ and $y \leftarrow x^\esx$, we obtain:</p>
<script type="math/tex; mode=display">\langle \nabla f(x_t), x_t - x^\esx \rangle \geq f(x_t) - f(x^\esx) + \mu V_{x_t}(x^\esx).</script>
<p>Again we start with (basicMD), which we will modify:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\langle \eta_t \nabla f(x_t),x_{t} - x^\esx \rangle & \leq V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) + \frac{\eta_t^2}{2}\norm{\partial f(x_t)}_\esx^2.
\end{align*} %]]></script>
<p>We now plug in the bound from strong convexity to obtain:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\eta_t (f(x_t) - f(x^\esx) + \mu V_{x_t}(x^\esx)) & \leq
\langle \eta_t \nabla f(x_t),x_{t} - x^\esx \rangle \\
& \leq V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) + \frac{\eta_t^2}{2}\norm{\partial f(x_t)}_\esx^2,
\end{align*} %]]></script>
<p>which can be simplified to</p>
<script type="math/tex; mode=display">% <![CDATA[
\tag{basicMDSC}
\begin{align*}
\eta_t (f(x_t) - f(x^\esx))
& \leq V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) + \frac{\eta_t^2}{2}\norm{\partial f(x_t)}_\esx^2 - \eta_t \mu V_{x_t}(x^\esx) \\
& \leq \left(1- \eta_t \mu\right) V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) + \frac{\eta_t^2}{2}\norm{\partial f(x_t)}_\esx^2.
\end{align*} %]]></script>
<p>Choosing $\eta_t = \frac{1}{\mu t}$ now we obtain:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\frac{1}{\mu t} (f(x_t) - f(x^\esx))
& \leq \left(1- \frac{1}{t}\right) V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) + \frac{1}{2\mu^2t^2}\norm{\partial f(x_t)}_\esx^2 \\
\Leftrightarrow \frac{1}{\mu} (f(x_t) - f(x^\esx))
& \leq \left(t- 1\right) V_{x_t}(x^\esx) - t V_{x_{t+1}}(x^\esx) + \frac{1}{2\mu^2t}\norm{\partial f(x_t)}_\esx^2,
\end{align*} %]]></script>
<p>which we can finally sum up (starting at $t=1$), multiply by $\mu$, and telescope out to arrive at:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sum_{t = 1}^{T} (f(x_t) - f(x^\esx))
& \leq - T \mu V_{x_{T+1}}(x^\esx) + \frac{G^2}{2\mu} \sum_{t = 1}^{T-1} \frac{1}{t} \leq \frac{G^2 \log T}{2\mu},
\end{align*} %]]></script>
<p>and with the usual averaging and using convexity we obtain:</p>
<script type="math/tex; mode=display">f(\bar x) - f(x^\esx) \leq \frac{G^2 \log T}{2\mu T}</script>
<p>for the convergence rate and</p>
<script type="math/tex; mode=display">\sum_{t = 0}^{T-1} f_t(x_t) - \min_x \sum_{t = 0}^{T-1} f_t(x) \leq \frac{G^2 }{2\mu} \log T,</script>
<p>for the regret; note that this bound is already anytime. In order to obtain the regret bound, simply replace $x^\esx$ by an arbitrary $u$ and $f(x_t)$ by $f_t(x_t)$. Note however, that this time the argument is directly on the primal difference $f_t(x_t) - f_t(u)$, rather than the dual gaps, i.e., after plugging-in the strong convexity inequality we start from:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\eta_t (f_t(x_t) - f_t(u) + \mu V_{x_t}(u)) & \leq
\langle \eta_t \nabla f_t(x_t), x_{t} - u \rangle \\
& \leq V_{x_t}(u) - V_{x_{t+1}}(u) + \frac{\eta_t^2}{2}\norm{\partial f_t(x_t)}_\esx^2,
\end{align*} %]]></script>
<p>and continue the same way.</p>
<h3 id="references">References</h3>
<p>[NY] Nemirovsky, A. S., & Yudin, D. B. (1983). Problem complexity and method efficiency in optimization.</p>
<p>[BT] Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3), 167-175. <a href="https://web.iem.technion.ac.il/images/user-files/becka/papers/3.pdf">pdf</a></p>
<p>[Z] Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 928-936). <a href="http://www.aaai.org/Papers/ICML/2003/ICML03-120.pdf">pdf</a></p>
<p>[AO] Allen-Zhu, Z., & Orecchia, L. (2014). Linear coupling: An ultimate unification of gradient and mirror descent. arXiv preprint arXiv:1407.1537. <a href="https://arxiv.org/abs/1407.1537">pdf</a></p>
<p>[AHK] Arora, S., Hazan, E., & Kale, S. (2012). The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1), 121-164. <a href="http://www.theoryofcomputing.org/articles/v008a006/v008a006.pdf">pdf</a></p>
<p>[Z2] Zhang, X. Bregman Divergence and Mirror Descent. <a href="http://users.cecs.anu.edu.au/~xzhang/teaching/bregman.pdf">pdf</a></p>
<p>[LNS] Lan, G., Nemirovski, A., & Shapiro, A. (2012). Validation analysis of mirror descent stochastic approximation method. Mathematical programming, 134(2), 425-458. <a href="https://www2.isye.gatech.edu/~nemirovs/MP_Valid_2011.pdf">pdf</a></p>
<p>[MS] Mcmahan, B., & Streeter, M. (2012). No-regret algorithms for unconstrained online convex optimization. In Advances in neural information processing systems (pp. 2402-2410). <a href="https://papers.nips.cc/paper/4709-no-regret-algorithms-for-unconstrained-online-convex-optimization.pdf">pdf</a></p>
<p><br /></p>
<h4 id="changelog">Changelog</h4>
<p>03/02/2019: Fixed several typos and added clarifications as pointed out by Matthieu Bloch.</p>
<p>03/04/2019: Fixed several typos and a norm/divergence mismatch in the strongly convex case as pointed out by Cyrille Combettes.</p>
<p>05/15/2019: Added pointer to stochastic case and summary of the results in this case.</p>
<p>08/25/2019: Added brief discussion about no-regret algorithms in the unconstrained online learning setting and a pointer to [MS] as pointed out by Francesco Orabona.</p>Sebastian PokuttaTL;DR: Cheat Sheet for non-smooth convex optimization: subgradient descent, mirror descent, and online learning. Long and technical.Mixing Frank-Wolfe and Gradient Descent2019-02-18T06:00:00+01:002019-02-18T06:00:00+01:00http://www.pokutta.com/blog/research/2019/02/18/bcg-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/1805.07311">Blended Conditional Gradients</a> with <a href="https://users.renyi.hu/~braung/">Gábor Braun</a>, <a href="https://www.linkedin.com/in/dan-tu/">Dan Tu</a>, and <a href="http://pages.cs.wisc.edu/~swright/">Stephen Wright</a>, showing how mixing Frank-Wolfe and Gradient Descent gives a new, very fast, projection-free algorithm for constrained smooth convex minimization.</em>
<!--more--></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>Frank-Wolfe methods [FW] (also called conditional gradient methods [CG]) have been very successful in solving <em>constrained smooth convex minimization</em> problems of the form:</p>
<script type="math/tex; mode=display">\min_{x \in P} f(x),</script>
<p>where $P$ is some compact and convex feasible region; you might want to think of, e.g., $P$ being a polytope, which is one of the most common cases. We assume so-called <em>first-order access</em> to the objective function $f$, i.e., we have an oracle that returns function evaluation $f(x)$ and gradient information $\nabla f(x)$ for a provided point $x \in P$. Moreover, we assume that we have access to the feasible region $P$ by means of a so-called <em>linear optimization oracle</em>, which upon being presented with a linear objective $c \in \RR^n$ returns $\arg\min_{x \in P} \langle c, x \rangle$. The basic Frank-Wolfe algorithm looks like this:</p>
<p class="mathcol"><strong>Frank-Wolfe Algorithm [FW]</strong> <br />
<em>Input:</em> Smooth convex function $f$ with first-order oracle access, feasible region $P$ with linear optimization oracle access, initial point (usually a vertex) $x_0 \in P$. <br />
<em>Output:</em> Sequence of points $x_0, \dots, x_T$ <br />
For $t = 1, \dots, T$ do: <br />
$\quad v_t \leftarrow \arg\min_{x \in P} \langle \nabla f(x_{t-1}), x \rangle$ <br />
$\quad x_{t+1} \leftarrow (1-\gamma_t) x_t + \gamma_t v_t$</p>
<p>The Frank-Wolfe algorithm has a couple of important advantages:</p>
<ol>
<li>It is very easy to implement</li>
<li>It does not require projections (as projected gradient descent does)</li>
<li>It maintains iterates as reasonably sparse convex combination of vertices.</li>
</ol>
<p>Generally, one can expect an $O(1/t)$ convergence for the general convex smooth case and linear convergence for strongly convex functions with appropriate modifications of the Frank-Wolfe algorithm. The interested reader might check out <a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a> and <a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a> for an extensive overview.</p>
<p>In the context of Frank-Wolfe methods a key assumption is that linear optimization is <em>cheap</em>. Compared to projections one would have to perform, say for projected gradient descent this is almost always true (except for very simple feasible regions where projection is trivial). As such traditionally one would account for the linear optimization oracle call with an $O(1)$ cost and disregard it in the analysis. However, if the feasible region is complex (e.g., arising from an integer program or just being a really large linear program), this assumption is not warranted anymore and one might ask a few natural questions:</p>
<ol>
<li>Do we really have to call the (expensive) linear programming oracle in each iteration?</li>
<li>Do we really need to compute (approximately) optimal solutions to the LP or does something completely different suffice?</li>
<li>More generally, can we reuse information?</li>
</ol>
<p>It turns out that one can replace the linear programming oracle by what we call a <em>weak separation oracle</em>; see [BPZ] for more details. Without going into full detail, as I will have a dedicated post about <em>lazification</em> (what we dubbed this technique), what the oracle does, it basically wraps around the actual linear programming oracle. Before calling the linear programming oracle, the weak separation oracle can answer by checking previous answers to oracle calls (caching). Moreover, one gains that one does not have to solve the LPs to (approximate) optimality but rather it suffices to check for a certain minimal improvement, which is compatible with the to-be-achieved convergence rate. In particular, we do not need any optimality proofs. One can then show that one can <em>maintain</em> the same convergence rates using the weak separation oracle as for the respective Frank-Wolfe variant utilizing the linear programming oracle, while drastically reducing the number of LP oracle calls.</p>
<h2 id="our-results">Our results</h2>
<p>In practice, while lazification can provide huge speedups for Frank-Wolfe type methods when the LPs are hard to solve, this technique loses its advantage when the LPs are simple. The reason for this is that at the end of the day, there is a trade-off between the quality of the computed directions in terms of providing progress vs. how hard they are to compute: the weak-separation oracle computes potentially worse approximations but does so very fast.</p>
<p>However, what we show in our <em>Blended Conditional Gradients</em> paper is that one can:</p>
<ol>
<li>Cut-out a <em>huge fraction of LP oracle calls</em> (sometimes less than 1% of the iterations require an actual LP oracle call)</li>
<li>While working with <em>actual gradients</em> as descent direction providing much better progress than traditional Frank-Wolfe directions and</li>
<li>Staying fully <em>projection-free</em>.</li>
</ol>
<p>This is achieved by <em>blending together</em> conditional gradient descent steps and gradient steps in a special way. The resulting algorithm has a per-iteration cost that is very comparable to gradient descent in most of the steps and when the LP oracle is called the per-iteration cost is comparable to the standard Frank-Wolfe Algorithm. In progress per iteration though our algorithm, which we call <em>Blended Conditional Gradients (BCG)</em> (see [BPTW] for details), typically outperforms Away-Step and Pairwise Conditional Gradients (the current state-of-the-art methods). We are often even faster in wall-clock performance as we eschew most LP oracle calls. Naturally, we maintain worst-case convergence rates that match those of Away-Step Frank-Wolfe and Pairwise Conditional Gradients; the known lower bounds only assume first-order oracle access and LP oracle access and are unconditional.</p>
<p>Rather than stating the algorithm’s (worst-case) convergence rates for the various cases which are identical to the ones for Away-Step Frank-Wolfe and Pairwise Conditional Gradients achieving $O(1/\varepsilon)$-convergence for general smooth and convex functions and $O(\log 1/\varepsilon)$-convergence for smooth and strongly convex functions (see [LJ] and [BPTW] for details), I rather present some computational results as they highlight the typical behavior. The following graphics provide a pretty representative overview of the computational performance. Everything is in log-scale and we ran each algorithm with a fixed time limit; we refer the reader to [BPTW] for more details.</p>
<p>The first example is a benchmark of BCG vs. Away-Step Frank-Wolfe (AFW), Pairwise Conditional Gradients (PCG), and Vanilla Frank-Wolfe (FW) on a LASSO instance. BCG significantly outperforms the other variants and in fact the empirical convergence rate of BCG is much higher than the rates of the other algorithms; recall we are in log-scale and we expect linear convergence for all variants due to the characteristics of the instance (optimal solution in strict interior).</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/bcg/bcg4.png" alt="BCG vs. normal" /></p>
<p>One might say that the above is not completely unexpected in particular because BCG uses also the lazification technique from our previous work in [BPZ]. So let us see how we compare to lazified variants of Frank-Wolfe. In the next graph we compare BCG vs. LPCG vs. PCG. The problem we solve here is a structured regression problem over a spanning tree polytope. Clearly, while LPCG is faster than PCG, BCG is significantly faster than either of those, both in iterations and wall-clock time.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/bcg/bcg3.png" alt="BCG vs. lazy" /></p>
<p>To better understand what is going on let us see how often the LP oracle is actually called throughout the iterations. In the next graph we plot iterations vs. cumulative number of calls to the (true) LP oracle. Here we added also Fully-Corrective Frank-Wolfe (FCFW) variants that fully optimize out over the active set and hence should have the lowest number of required LP calls; we implemented two variants: one that optimizes over the active set for a fixed number of iterations (the faster one in grey) and one that optimizes to a specific accuracy (the slower one in orange). The next plot shows two instances: LASSO (left) and structured regression over a <em>netgen</em> instance (right); for the former lazification is not helpful as the LP oracle is too simple for the latter it is. As expected for the non-lazy variants such as FW, AFW, and PCG we have a straight line as we perform one LP call per iteration. For LCG, BCG, and the two FCFW variants we obtain a significant reduction in actual calls to the LP oracle with BCG sitting right between the (non-)lazy variants and the FCFW variants. BCG attains a large fraction of the reduction in calls of the much slower FCFW variants while being extremely fast compared to FCFW and all other variants.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/bcg/bcg1.png" alt="Cache rates" /></p>
<p>On a fundamental level one might argue that what it really comes down to is how well the algorithm uses the information obtained from an LP call. Clearly, there is a trade-off: on the one hand, better utilization of that information will be increasingly more expensive making the algorithm slower, so that it might be advantageous to rather do another LP call, on the other hand, calling the LP oracle too often results in suboptimal use of the LP call’s information and these calls can be expensive. Managing this tradeoff is critical to achieve high performance and BCG uses a convergence criterion to maintain a very favorable balance. To get an idea how well the various algorithms are using the LP call information, consider the next graphic, where we run various algorithms on a LASSO instance. As can be seen in primal and dual progress, BCG is using LP information in a much more aggressive way, while maintaining a very high speed—the two FCFW variants that would use the information from the LP calls even more aggressively only performed a handful of iterations though as they are extremely slow (see the grey and orange line right at the beginning of the red line). As a measure we depict primal and dual progress vs (true) LP oracle call.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/bcg/bcg2.png" alt="Progress per LP call" /></p>
<h3 id="bcg-code">BCG Code</h3>
<p>If you are interested in using BCG, we made a preliminary version of our code available on <a href="https://github.com/pokutta/bcg">github</a>; a significant update with more options and additional algorithms is coming soon.</p>
<h3 id="references">References</h3>
<p>[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p>
<p>[CG] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=zvmmf&paperid=7415&option_lang=eng">pdf</a></p>
<p>[BPZ] Braun, G., Pokutta, S., & Zink, D. (2017, August). Lazifying conditional gradient algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 566-575). JMLR. org. <a href="https://arxiv.org/abs/1610.05120">pdf</a></p>
<p>[BPTW] Braun, G., Pokutta, S., Tu, D., & Wright, S. (2018). Blended Conditional Gradients: the unconditioning of conditional gradients. arXiv preprint arXiv:1805.07311. <a href="https://arxiv.org/abs/1805.07311">pdf</a></p>
<p>[LJ] Lacoste-Julien, S., & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems (pp. 496-504). <a href="http://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>Sebastian PokuttaTL;DR: This is an informal summary of our recent paper Blended Conditional Gradients with Gábor Braun, Dan Tu, and Stephen Wright, showing how mixing Frank-Wolfe and Gradient Descent gives a new, very fast, projection-free algorithm for constrained smooth convex minimization.The Zeroth World2019-02-06T05:06:41+01:002019-02-06T05:06:41+01:00http://www.pokutta.com/blog/random/2019/02/06/zeroth-world<p><em>TL;DR: On the impact of AI on society and economy and its potential to enable a zeroth world with unprecedented economic output.</em>
<!--more--></p>
<p>In this post I want to talk about the impact that artificial intelligence might have on society and economy; not because of the “terminator scenario” but because of what it already can achieve <em>right now</em>. Over the last few months I have had many such discussions within industry, academia, and government and this is a summary of what I think; as always <em>biased and incomplete</em>.</p>
<p>Before delving into the actual discussion, I would like to clarify what I consider artificial intelligence (AI) as this is a very elusive term that has been overloaded several times to suit various narratives. When I talk about <em>artificial intelligence (AI)</em>, what I am talking about is <em>any technology, technology complex, or system</em>, that:
1) (Sensing) gathers information through direct input, sensors, etc.
2) (Learning) processes information with the explicit or implicit aim of forming an evaluation of its environment.
3) (Deciding) Decides on a course of action.
4) (Acting) Informs or implements that course of action.</p>
<p>For those familiar, this is quite similar to the <a href="https://en.wikipedia.org/wiki/OODA_loop">OODA loop</a>, an abstraction that captures dynamic decision-making with feedback. The (minor) difference here, is that we (a) consider broader systems and we (b) do not require necessarily a feedback. In terms of (Acting) we also assume some form of autonomy, however the action might be either only suggested by the system or directly executed. The purpose of this “definition” is not to add yet another definition to the mix but to make precise, <em>for the purpose of this post</em>, what we will be talking about. For simplicity from now on we will refer to such systems as AI or AI systems. We will also refer to larger systems as AI system if they contain such technology at their core.</p>
<p>Examples of where such AI systems are used or appear are:</p>
<ul>
<li>Credit ratings</li>
<li>Amazon’s “people also bought”</li>
<li>Autonomous vehicles</li>
<li>Medical decision-support systems</li>
<li>Facial recognition</li>
<li>…</li>
</ul>
<p>Also, note that I chose the term “AI systems” vs many other equally fitting terms as it seems to be more “accessible” than some of the more technical ones, such as <em>Machine Learning</em> or <em>Decision-Support Systems</em>. Otherwise this choice is really arbitrary; let’s not make it about the choice of words.</p>
<h2 id="impact-through-hybridization">Impact through Hybridization</h2>
<p>A lot of the current discussion has been centered around the direct substitution of technology, workers, etc. by AI systems, as in <em>robot-in-human-out</em>. In believe however that this is not the likely scenario in the short to mid term as it would require a very high maturity level of current AI and machine learning technology that seems far away. Those wary of AI would argue that the <em>singularity</em>, where basically AI systems improve themselves, will drive maturity exponentially fast. Whether this is likely to happen I do not know as predictions of such type are tough. Most of those voices wary of AI, seem to argue from a utilitarian perspective a la Bernoulli and rather want to err on the safe side; from a risk management perspective not necessarily a bad approach. Most of those unconcerned argue that the we have not figured out some very basic challenges and as such there is no real risk.</p>
<p class="center"><img src="https://imgs.xkcd.com/comics/skynet.png" alt="Comparison different step size rules" />
<a href="https://xkcd.com/1046/">[Source: XKCD]</a></p>
<p>While this discourse might be important in its own right, I want to focus more on the <em>(relatively) immediate, short-term</em> impact: timelines of the order of 10 - 20 years, which is really short compared to the speed with which societies and economic systems adapt.</p>
<h3 id="scaling-and-enabling-through-ai">Scaling and Enabling through AI</h3>
<p>In order for AI to have a disruptive impact on society full maturity is not required; neither is <em>explainability</em> although this might be desirable. The reason for this is that we can simply “pair up a human with an AI”, which I refer to as <em>Hybridization</em>, forming a symbiotic system in a more Xenoblade-esque fashion. The basic principle is that 90% of the basics can be performed efficiently and faster by an AI and for the remaining 10% we have human override. This will (1) enable an individual to perform tasks that were out of reach at unprecedented speed and (2) allows an individual to aggressively scale up her/his operations by operating on a higher level, letting the AI take care of the basics.</p>
<p>While this sounds Sci-Fi at first, a closer look reveals that we have been operating like this for many decades, we build tools to automate basic tasks (where basic is relative to the current level). This leads to an <em>automate-and-elevate</em> paradigm or cycle: automate the basics (e.g., via a machine or computer) and then go to the next level. A couple of examples:</p>
<ul>
<li>Driver + Google Maps</li>
<li>Engineer + finite elements software</li>
<li>Vlogger + Camera + Final Cut Pro</li>
<li>MD + X-Ray</li>
</ul>
<p>I am sure you can come up with hundreds of other examples. What all these examples have in common is (1) an enabling factor and (2) a scale-up factor. Take the “Engineer + finite elements software” example: The engineer can suddenly compute and test designs that were impossible to verify by himself beforehand and required a larger number of other people to be involved. However with this tool, the number of involved people can be significantly reduced (the individual’s productivity skyrockets) and completely new unthinkable things can be suddenly done.</p>
<p>What AI systems bring to the mix is that they suddenly allow us to (at least partially) tool and automate tasks that were out of reach so far because of “messy inputs”, i.e., these AI systems allow us to redefine what we consider “basic”.</p>
<h3 id="an-example">An example</h3>
<p>Let us consider the example of autonomous driving. Not because I like it particularly but because most of us have a pretty good idea about driving. Also today’s cars already have very basic automation, such as “cruise control” and “lane assist” systems, so that the idea is not that foreign. Traditionally, a car has one driver. While AI for autonomous driving seems far from being completely there yet, we <em>do not need this</em> to achieve disruptive improvements. Here are two use cases:</p>
<p>Use case 1: Let the AI take care of the basic driving tasks. Whenever a situation is unclear the controls are transferred to a centralized control center, where professional drivers take over for the duration of the “complex task” and then the controls are passed back to the car. This might allow a single driver, together with AI subsystems to operate 4-10 cars at a time; the range is arbitrary but seems reasonable: not correcting for correlation and tail risks, a 4x factor would require the AI to tackle 75% of the driven miles autonomously and a factor of 10x would require 90% of the driven miles being handled autonomously. Current disengagement rates of Waymo seem to be far better than that.</p>
<p>Use case 2: Long-haul trucking. Highway autonomy is much easier than intracity operations. Have truck drivers drive the truck to a “handover point” on a highway. Truck driver gets off the truck, the truck drives autonomously via the highway network to the handover point close to its destination. Human truck driver “picks up” the truck for last-mile intracity driving. If you now consider the ratio between the intracity portions and the highway portions of the trip, the number of required drivers can be reduced significantly; a 10x factor seems conservative. Moreover, rest times etc can be cut out as well.</p>
<p>Clearly, we can also combine use case 1 and 2 for extra safety with minimal extra cost. What we see however from this basic example is that AI systems can scale-up what a single human can do by significant multiples. Also in the long-haul example from above, the quality of life of the drivers goes up, e.g., less time spent away from family (that is for those that keep their job). However, the <em>very important</em> flip-side of this hybridization is that it threatens to displace a huge fraction of jobs: at a scaling of 10x about 90% of the jobs might be at risk; this is of course a naive estimate.</p>
<p>Other tasks, which might become “basic” are:</p>
<ul>
<li><em>Call center operations:</em> We already have call systems handling large portions of the call until being passed to an operator. AI-based systems bring this to another level. Think: <a href="https://www.theverge.com/2018/12/5/18123785/google-duplex-how-to-use-reservations">Google Duplex</a></li>
<li><em>Checking NDAs and contracts:</em> Time consuming and not value add. There are several systems (have not verified their accuracy) that offer automatic review, e.g., <a href="https://www.ndalynn.com/">NDALynn</a>, <a href="https://www.lawgeex.com/">LawGeex</a> (see also <a href="https://www.techspot.com/news/77189-machine-learning-algorithm-beats-20-lawyers-nda-legal.html">TechSpot</a>).</li>
<li><em>Managing investment portfolios:</em> Robo-Advisors in the retail space deliver similar or better performance than traditional and costly (and often subpar) investment advisors; after all, the hot shots are mostly working for funds or UHNWIs. (see <a href="http://money.com/money/5330932/best-robo-advisors-beginner-advanced-2018/">here</a> and <a href="https://www.barrons.com/articles/the-top-robo-advisors-an-exclusive-ranking-1532740937">here</a>)</li>
<li><em>Design of (simple) machine learning solutions:</em> Google’s <a href="https://cloud.google.com/automl/">AutoML</a> automates the creation of high-performance machine learning models. Upload your data and get a deployment ready model with REST API etc. No data scientist required.</li>
<li>I know of other large companies using AI systems to automate the RFP process by sifting through thousands of pages of specifications to determine a product offering.</li>
</ul>
<p>Of course, just to be clear, all of the above come also with certain usage risks if not used properly or without the necessary expertise.</p>
<h2 id="the-bigger-picture-learning-rate-and-discovery-rate">The bigger picture: learning rate and discovery rate</h2>
<p>What this all might lead to is a <em>Zeroth World</em> whose advantage (broadly speaking in terms of development: economic, educational, societal, etc) over the First World might be as large as the advantage of the First World over the Third World.</p>
<h3 id="gdp-per-employed-person">GDP per employed person</h3>
<p>A very skewed but still informative metric is GDP per person employed. It gives generally a good idea of the productivity levels achieved <em>on average</em>. There are a couple of special cases, for example China with an extremely high variance. Nonetheless, in the graphics below generated from <a href="https://www.google.com/publicdata/explore?ds=d5bncppjof8f9_&ctype=l&met_y=sl_gdp_pcap_em_kd#!ctype=l&strail=false&bcs=d&nselm=h&met_y=sl_gdp_pcap_em_kd&scale_y=log&ind_y=false&rdim=region&idim=region:NAC&idim=country:SGP:JPN:DEU:CHN:FRA:CMR:COG:ETH:GHA:KEN:NGA:SDN:ECU&ifdim=region&tdim=true&hl=en_US&dl=en_US&ind=false">Google’s dataset</a> you can see a strict separation between (some) First World countries and (some) Third World countries; note that the scale is logarithmic:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/gdp-employed-person-comp.png" alt="GDP per employed person" />
<a href="https://www.google.com/publicdata/explore?ds=d5bncppjof8f9_&ctype=l&met_y=sl_gdp_pcap_em_kd#!ctype=l&strail=false&bcs=d&nselm=h&met_y=sl_gdp_pcap_em_kd&scale_y=log&ind_y=false&rdim=region&idim=region:NAC&idim=country:SGP:JPN:DEU:CHN:FRA:CMR:COG:ETH:GHA:KEN:NGA:SDN:ECU&ifdim=region&tdim=true&hl=en_US&dl=en_US&ind=false">[Source: Google’s dataset]</a></p>
<p>Now imagine that some countries, upon leveraging AI systems achieve a 10x gain in output per employed person. That will be the <em>Zeroth World</em>: people operating at 10x of their First World productivity levels. Hard to imagine, but that is roughly the separation between the US and Ghana for example.</p>
<p>The graph above is very compatible with well-known trends, e.g., <a href="https://www.reuters.com/article/us-singapore-semiconductors-analysis/singapores-automation-incentives-draw-tech-firms-boost-economy-idUSKBN17T3DX">Singapore strongly investing in automation</a> or China being the country with <a href="https://www.dbs.com/aics/templatedata/article/generic/data/en/GR/042018/180409_insights_understanding_china_automation_drive_is_essential_and_welcome.xml">largest number of industrial robots going online</a>. JP Morgan estimates that automation could add <a href="https://www.businessinsider.com/automation-one-trillion-dollars-global-economy-jpmam-report-2017-11">up to $1.1 trilion</a> to the global economy over the next 10-15 years. While this is only 1-1.5% of an overall boost in global GDP, in actuality the effect might be much more pronounced as it will be concentrated in few countries leading to a much stronger separation; still even if the whole boost would be accounted to the US it would still be just about 5%. But AI systems go beyond mere manufacturing automation and it is hard to estimate the cumulative effect. To put things into context, in manufacturing an extreme shift happened around the 2000’s when the first wave of strong automation kicked in. Over the last 30 or so years we roughly doubled manufacturing output and close to halved the number of people; see the graphics from <a href="https://www.businessinsider.com/manufacturing-output-versus-employment-chart-2016-12">Business Insider</a>:</p>
<p class="center"><img src="https://amp.businessinsider.com/images/584b0056ca7f0c5c008b4a92-960-720.png" alt="Manufacturing output vs. automation" />
<a href="https://www.businessinsider.com/manufacturing-output-versus-employment-chart-2016-12">[Source: Business Insider]</a></p>
<p>That is 4x in about 30 years in a physical space, with large, tangible assets and more generally with lots of overall inertia in the system. It is quite likely that AI systems will have an even more pronounced effect because they are more widely deployable, so that the 10x scenario is <em>not that</em> ambitious.</p>
<h3 id="learning-rate-vs-discovery-rate">Learning rate vs discovery rate</h3>
<p>To better understand what AI systems reasonably can and cannot do, without making strong predictions about the future we need to differentiate between the <em>learning rate</em> and the <em>discovery rate</em> of a technology. In a nutshell, the learning rate captures how fast, e.g., prices, resources required etc fell over time for an <em>existing</em> solution or product, i.e., by how much flying got cheaper over time. This captures various improvements over time in deploying a given technology. The learning rate makes no statement about new discoveries or overcoming fundamental roadblocks. That is exactly what the <em>discovery rate</em> captures. While the learning rate tends to be quite observable and often follows a relatively stable trend over time, the discovery rate is much more unpredictable (due to its nature) and that is where often speculation about the future and its various scenarios comes into play. I will not go there: the learning rate alone can provide us with some insights. Note that we refer to those two as “rates” as it is very insightful to consider the world in logarithmic scale, e.g., measuring time to double or halve. Let us consider the examples of <a href="https://aiimpacts.org/wikipedia-history-of-gflops-costs/">historical prices for GFlops</a>:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/History-of-GFLOPS-prices.png" alt="Learning rate GFlops" /></p>
<p class="center"><a href="https://aiimpacts.org/wikipedia-history-of-gflops-costs/">[Source: AIImpacts.org]</a></p>
<p>We can find a very similar trend in <a href="https://jcmit.net/memoryprice.htm">historical prices for storage</a>:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/MemoryDiskPriceGraph-2018Dec.jpg" alt="Learning rate storage" />
<a href="https://jcmit.net/memoryprice.htm">[Source: jcmit.net]</a></p>
<p>These two are probably pretty much expected as they roughly follow <a href="https://en.wikipedia.org/wiki/Moore%27s_law">Moore’s law</a>, however there are many similar examples in other industries with different rates. For examples <a href="https://www.vox.com/2016/8/24/12620920/us-solar-power-costs-falling">historical prices for solar panels</a> or <a href="https://www.theatlantic.com/business/archive/2013/02/how-airline-ticket-prices-fell-50-in-30-years-and-why-nobody-noticed/273506/">flights</a>. Now let us compare this to the recent increase in <a href="https://blog.openai.com/ai-and-compute/">compute deployed for training ai systems</a>:</p>
<p class="center"><img src="https://openai.com/content/images/2018/05/compute_diagram-log@2x-3.png" alt="Learning rate storage" />
<a href="https://blog.openai.com/ai-and-compute/">[Source: OpenAI Blog]</a></p>
<p>Compared to Moore’s law with a doubling rate of roughly every 18 months (so far) for the compute deployed here the doubling rate much higher at about only 3.5 months (so far). Clearly, neither can continue forever at such aggressive rates, however this example points at two things: (a) we are moving <em>much faster</em> than anything that we have seen so far and (b) with the deployment of more compute usually a roughly similar increase in required data comes along (the reason being, that training algorithms, usually based on variants of stochastic gradient descent, can only make so many passes over the data before overfitting). Notably those applications in the graph with the highest compute are not relying on labeled data (except for maybe Neural Machine Translation to some extent; not sure) but are reinforcement learning systems, where training data is generated through simulation and (self-)play. For more details see the <a href="ttps://blog.openai.com/ai-and-compute/">AI and Compute</a> post on OpenAI’s blog. The graph above is not exactly the learning rate as it lacks the relation to e.g., price, however it clearly shows how fast we are progressing. It is not hard to imagine that with new hardware architectures, in a not too distant future that type of power will be available on your cell phone. So even <em>without new discoveries</em>, just following the natural learning rate of the industry and making the current state-of-the-art cheaper, will have profound impact. For example, just a few days ago Google’s Deepmind <a href="https://blog.usejournal.com/an-analysis-on-how-deepminds-starcraft-2-ai-s-superhuman-speed-could-be-a-band-aid-fix-for-the-1702fb8344d6">(not completely uncontroversially)</a> won against pro players at playing <a href="https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/">StarCraft 2</a> (see also <a href="https://www.theverge.com/2019/1/24/18196135/google-deepmind-ai-starcraft-2-victory">here</a>). The training of this system required an enormous amount of computational resources. Even in light of the controversy, this is still an important achievement in terms of scaling technology, large-scale training with multiple agents, demonstrating that well designed reinforcement learning systems <em>can</em> learn very complex tasks, and more generally to “make it work”; whether reinforcement learning in general is the right approach to such problems is left for another discussion. In a few years we will teach building such integrated large-scale systems at universities end-to-end as a senior design type of project and then a few years later you will be able to download such a bot in the <em>App Store</em>. Crazy? Think of <em>neural style transfer</em> a few years back. You can now get <a href="https://prisma-ai.com/">Prisma</a> on your cell phone. Sure it might offload the computation to the cloud—at least previous versions did so—but that is not the point. The point is that complex AI system designs at the cutting edge are made available to the broader public only <em>a few years</em> after their inceptions. <a href="https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html">Google Duplex</a> is another such example making restaurant etc. reservations for you. To be clear, I am also very well aware of the limitations etc., but at the same time I fail to see a <em>fundamental roadblock</em> and existing limitations might be removed quickly with good engineering and research.</p>
<h3 id="impact-on-society-and-economy">Impact on society and economy</h3>
<p>In a nutshell: we are moving very fast. In fact, so fast that the consequences are unclear. Forget about the “terminator scenario” as a threat to society. Not because it might or it might not happen but rather because already the <em>current technology</em> just following its natural learning rate cycle poses a much more immediate challenge with the potential to lead to <em>huge</em> disruptions, both positive and negative.</p>
<p>One very critical impact to think about is workforce. If AI enables people to be more productive then either the economic output increases or the number of people required to achieve a given output level will decrease; these are the two sides of the same coin. The reality is that while there (likely) will be significant improvements in terms of economic output, there is only so much increase the “world” can absorb in a short period of time: at an economic world growth of about 2-3% per year the time it takes to 10x the output is roughly 80-100 years; even with significantly improved efficiency due to AI systems you can only push the output so far. What this means is that we might be facing a transitory period where efficiency improvements will drastically impact employment levels and it will take a considerable time for the workforce to adjust to these changes. In light of this one might actually contemplate whether populations in several developed countries are shrinking in early anticipation of the times ahead.</p>
<p>The other critical thing to think about is the concentration of power and wealth that might be accompanied by these shifts. Already today, we see that tech companies accumulate wealth and capital at unprecedented rates, leveraging the network effects of the internet. Yet, still somewhat tight to the physical world, e.g., due to users, there is still <em>some limit</em> to their growth. It is easily imaginable however, that the next “category of scale” will be defined by AI companies, with an insane concentration of resources, wealth, and power that pales current concentration levels in the valley.</p>
<p>We will likely also see the empowering of individuals beyond what we could imagine just a few years back by (a) multiplying the sheer output of an individuals due to scaling but also (b) by enabling the individual to do new things leveraging AI support systems. Then the “best” will dominate and technology will enable that individual to act globally, removing the last of the geographic entry barriers. As a simple example, take the recent “vlog” phenomenon, where one-person video productions can achieve a level of professionalism that rivals that of large scale productions. Executed from any place in the world and distributed world-wide through youtube. Moreover, the individual can directly “sell” to her/his target audience cutting out the middle man. This might provide a larger diversity and also a democratization of such disciplines but at the same time might also remove a useful filter in some cases.</p>
<p>These shifts, brought about by AI systems and resulting technology, come with a lot of (potential) positives and negatives and the promises of AI systems are great. Being high on possibilities of this new paradigm, it is easy to forget though that there might be severe unintended consequences with potentially critical impact on our societies and economies. In order to enable sustainable progress we need to not just be aware but prepare and actively shape the use of these new technologies.</p>Sebastian PokuttaTL;DR: On the impact of AI on society and economy and its potential to enable a zeroth world with unprecedented economic output.