Jekyll2019-04-25T06:03:50+00:00/feed.xmlJeffrey LingJeffrey's personal websiteJeffrey LingVAEs are Bayesian2018-01-09T00:00:00+00:002018-01-09T00:00:00+00:00/2018/01/09/vaes-are-bayesian<div hidden="">
$$
\newcommand{\boldA}{\mathbf{A}}
\newcommand{\boldB}{\mathbf{B}}
\newcommand{\boldC}{\mathbf{C}}
\newcommand{\boldD}{\mathbf{D}}
\newcommand{\boldE}{\mathbf{E}}
\newcommand{\boldF}{\mathbf{F}}
\newcommand{\boldG}{\mathbf{G}}
\newcommand{\boldH}{\mathbf{H}}
\newcommand{\boldI}{\mathbf{I}}
\newcommand{\boldJ}{\mathbf{J}}
\newcommand{\boldK}{\mathbf{K}}
\newcommand{\boldL}{\mathbf{L}}
\newcommand{\boldM}{\mathbf{M}}
\newcommand{\boldN}{\mathbf{N}}
\newcommand{\boldO}{\mathbf{O}}
\newcommand{\boldP}{\mathbf{P}}
\newcommand{\boldQ}{\mathbf{Q}}
\newcommand{\boldR}{\mathbf{R}}
\newcommand{\boldS}{\mathbf{S}}
\newcommand{\boldT}{\mathbf{T}}
\newcommand{\boldU}{\mathbf{U}}
\newcommand{\boldV}{\mathbf{V}}
\newcommand{\boldW}{\mathbf{W}}
\newcommand{\boldX}{\mathbf{X}}
\newcommand{\boldY}{\mathbf{Y}}
\newcommand{\boldZ}{\mathbf{Z}}
\newcommand{\bolda}{\mathbf{a}}
\newcommand{\boldb}{\mathbf{b}}
\newcommand{\boldc}{\mathbf{c}}
\newcommand{\boldd}{\mathbf{d}}
\newcommand{\bolde}{\mathbf{e}}
\newcommand{\boldf}{\mathbf{f}}
\newcommand{\boldg}{\mathbf{g}}
\newcommand{\boldh}{\mathbf{h}}
\newcommand{\boldi}{\mathbf{i}}
\newcommand{\boldj}{\mathbf{j}}
\newcommand{\boldk}{\mathbf{k}}
\newcommand{\boldl}{\mathbf{l}}
\newcommand{\boldm}{\mathbf{m}}
\newcommand{\boldn}{\mathbf{n}}
\newcommand{\boldo}{\mathbf{o}}
\newcommand{\boldp}{\mathbf{p}}
\newcommand{\boldq}{\mathbf{q}}
\newcommand{\boldr}{\mathbf{r}}
\newcommand{\bolds}{\mathbf{s}}
\newcommand{\boldt}{\mathbf{t}}
\newcommand{\boldu}{\mathbf{u}}
\newcommand{\boldv}{\mathbf{v}}
\newcommand{\boldw}{\mathbf{w}}
\newcommand{\boldx}{\mathbf{x}}
\newcommand{\boldy}{\mathbf{y}}
\newcommand{\boldz}{\mathbf{z}}
\newcommand{\mcA}{\mathcal{A}}
\newcommand{\mcB}{\mathcal{B}}
\newcommand{\mcC}{\mathcal{C}}
\newcommand{\mcD}{\mathcal{D}}
\newcommand{\mcE}{\mathcal{E}}
\newcommand{\mcF}{\mathcal{F}}
\newcommand{\mcG}{\mathcal{G}}
\newcommand{\mcH}{\mathcal{H}}
\newcommand{\mcI}{\mathcal{I}}
\newcommand{\mcJ}{\mathcal{J}}
\newcommand{\mcK}{\mathcal{K}}
\newcommand{\mcL}{\mathcal{L}}
\newcommand{\mcM}{\mathcal{M}}
\newcommand{\mcN}{\mathcal{N}}
\newcommand{\mcO}{\mathcal{O}}
\newcommand{\mcP}{\mathcal{P}}
\newcommand{\mcQ}{\mathcal{Q}}
\newcommand{\mcR}{\mathcal{R}}
\newcommand{\mcS}{\mathcal{S}}
\newcommand{\mcT}{\mathcal{T}}
\newcommand{\mcU}{\mathcal{U}}
\newcommand{\mcV}{\mathcal{V}}
\newcommand{\mcW}{\mathcal{W}}
\newcommand{\mcX}{\mathcal{X}}
\newcommand{\mcY}{\mathcal{Y}}
\newcommand{\mcZ}{\mathcal{Z}}
\newcommand{\reals}{\mathbb{R}}
\newcommand{\integers}{\mathbb{Z}}
\newcommand{\rationals}{\mathbb{Q}}
\newcommand{\naturals}{\mathbb{N}}
\newcommand{\ident}{\boldsymbol{I}}
\newcommand{\bzero}{\boldsymbol{0}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\bdelta}{\boldsymbol{\delta}}
\newcommand{\boldeta}{\boldsymbol{\eta}}
\newcommand{\bkappa}{\boldsymbol{\kappa}}
\newcommand{\bgamma}{\boldsymbol{\gamma}}
\newcommand{\bmu}{\boldsymbol{\mu}}
\newcommand{\bphi}{\boldsymbol{\phi}}
\newcommand{\bpi}{\boldsymbol{\pi}}
\newcommand{\bpsi}{\boldsymbol{\psi}}
\newcommand{\bsigma}{\boldsymbol{\sigma}}
\newcommand{\btheta}{\boldsymbol{\theta}}
\newcommand{\bxi}{\boldsymbol{\xi}}
\newcommand{\bGamma}{\boldsymbol{\Gamma}}
\newcommand{\bLambda}{\boldsymbol{\Lambda}}
\newcommand{\bOmega}{\boldsymbol{\Omega}}
\newcommand{\bPhi}{\boldsymbol{\Phi}}
\newcommand{\bPi}{\boldsymbol{\Pi}}
\newcommand{\bPsi}{\boldsymbol{\Psi}}
\newcommand{\bSigma}{\boldsymbol{\Sigma}}
\newcommand{\bTheta}{\boldsymbol{\Theta}}
\newcommand{\bUpsilon}{\boldsymbol{\Upsilon}}
\newcommand{\bXi}{\boldsymbol{\Xi}}
\newcommand{\bepsilon}{\boldsymbol{\epsilon}}
\newcommand{\on}{\operatorname}
\newcommand{\E}{\mathbb{E}}
\newcommand{\Var}{\on{Var}}
$$
</div>
<p>Variational autoencoders (VAEs) have become an extremely popular generative model
in deep learning. While VAE outputs don’t achieve the same level of prettiness that
GANs do, they are theoretically well-motivated by probability theory and Bayes’
rule.</p>
<p>However, when deep learning papers discuss VAEs, they totally ignore the Bayesian
framework and emphasize the <strong>encoder-decoder</strong> architecture, despite the fact that
the original paper (Kingma 2013) was literally called “Stochastic Gradient
Variational Bayes”.</p>
<p>In this post, we’ll highlight the differences between the deep learning and
Bayesian interpretations.</p>
<h1 id="background">Background</h1>
<p>First, a review of VAEs. We have some data $x_1, \ldots, x_n$ that we want to
model with a generative process. We assume a latent variable $z_i$ for each $x_i$,
and a joint probability distribution $p(x, z) = p(x|z)p(z)$ that the data come from.</p>
<p>In deep learning, we usually assume that $p(z)$ is some simple distribution, e.g.
a standard multivariate Gaussian $\mcN(0,1)$, and that $p(x|z)$ comes from a
complicated neural network, $x = f_\theta(z)$ (this will be the <strong>decoder</strong>). To
generate a data sample, all we need to do is get a Gaussian sample $z$ (easy) and
apply $f_\theta(z)$ to get $x$.</p>
<p>In some sense, $z$ is a <strong>code</strong> for $x$, so motivated by autoencoders, we want a
function $g_\phi$ (the <strong>encoder</strong>, also a neural network with parameters $\phi$)
that encodes $x$. Since we’re working with probabilities here, instead of mapping
to a single $z$, we map to a distribution - in this case, we’ll assume another
Gaussian $\mcN(\mu, \sigma^2)$ is the “posterior” distribution on $z$. Now, if we
ever want to know the latent code for a data point $x$, we get $\mu, \sigma =
g_\phi(x)$, and take samples $z \sim \mcN(\mu, \sigma^2)$. We can think of
$g_\phi$ as representing the approximate posterior distribution $q(z|x)$.</p>
<p>To train the generative model (i.e. get $\theta$), we maximize the data log-likelihood $\log p(x)$. It turns out that doing this directly is hard, but by Jensen’s inequality, we can get a useful lower bound:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\log p(x) &\geq \E_{z \sim q_\phi(z|x)} [ \log p(x, z) - \log q_\phi(z|x) ] \\
&= \E_{z \sim q}[\log p_\theta(x|z)] - KL(q_\phi(z|x) || p(z))
\end{align*} %]]></script>
<p>This expression is known as the evidence lower bound (ELBO), and will be our loss function $\mcL_{ELBO}$ to optimize. Specifically, we will be doing gradient descent on both $\phi$ and $\theta$, the encoder and decoder respectively, to obtain our generative model. (Expectations are generally estimated via one Monte Carlo sample, or closed form if available.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>)</p>
<p><img src="/assets/vae_graph.png" alt="VAE computation graph" /></p>
<p>As in the above picture (from <a href="https://arxiv.org/abs/1606.05908">this great tutorial</a>),
here is how the deep learning computation graph unfolds: given $x$, we encode
$\mu, \sigma = g_\phi(x)$, then take a sample $z \sim q(z|x) = \mcN(\mu, \sigma^2)$
(the sample handles the $\E_q$ in the first term of the ELBO). Then, we decode
$x = f_\theta(z)$ so that we can compute $\log p_\theta(x|z)$ for our loss.</p>
<h1 id="the-deep-learning-perspective">The deep learning perspective</h1>
<p>From a deep learning point of view, we have set up an <strong>autoencoder</strong>. We map our $x$ to a latent space of $z$ (encoding), then map it back out to the data space (decoding).</p>
<p>Let’s break down the ELBO objective. We wish to maximize a sum of two terms:</p>
<ol>
<li>First is $\E_{z \sim q}[\log p_\theta(x|z)]$, commonly known as the
reconstruction error. By encoding $x$ to the latent space, sampling a $z$, then
decoding back to $x’$ in the expectation, we can see how good our reconstruction is.
Of course, we want our reconstructed $x’$ to be as close to our input $x$ as possible.</li>
<li>The second term is the negative KL divergence, which computes a sense of distance
from our encoding distribution $q_\phi(z|x)$ to $p(z)$. If we only had the first
term, the encoding process could be pretty much arbitrary, but this term helps us
“regularize” towards some simple distribution $p(z)$ by minimizing the KL. Of
course, we don’t want it to collapse to 0, otherwise our encoding $q(z|x)$ is
meaningless and fails to capture anything informative about our data $x$.</li>
</ol>
<p>From this perspective, we see that variational autoencoders are like an upgrade of standard autoencoders, which seek to minimize reconstruction loss upon encoding and decoding.</p>
<h1 id="the-bayesian-perspective">The Bayesian perspective</h1>
<p>We saw where the “autoencoder” part of VAE comes from. But what about the “variational” part? Let’s rephrase everything we’ve covered so far in the language of Bayes.</p>
<p>In Bayesian learning, if we have observed variables $x$ and latent $z$, obtaining
the posterior $p(z|x)$ is often hard. One method is <strong>variational inference</strong>, where
we approximate the posterior $p(z|x)$ from a family of well-behaved distributions
$q(z|x)$. Variational inference seeks to minimize the KL divergence
$KL(q(z|x) || p(z|x))$ of our guess $q$ from the true posterior.</p>
<p>VAEs are really doing variational inference. Let’s see how:</p>
<ul>
<li>$p(z)$ is the prior. In our case, this is a simple Gaussian.</li>
<li>$p(x|z)$ is the likelihood. In our case, this is implicitly defined by a
complicated neural network.</li>
<li>$p(z|x)$ is the true posterior that we don’t know.</li>
<li>$q(z|x)$ is the variational posterior, also defined a complicated neural network. We usually call this the <strong>inference network</strong>.</li>
</ul>
<p>In typical variational inference, both the likelihood $p(x|z)$ and variational
posterior $q$ come from simple exponential families, so that we can derive closed
form expressions for the parameters. In VAEs, however, both the likelihood and
inference net are intractable (defined by neural networks). Fortunately, we can
get around this issue with stochastic gradient descent on the ELBO.</p>
<p>Therefore, in a Bayesian sense, VAEs are using variational inference to handle our
complicated generative model. The way we do inference on $z$ <em>happens</em> to be by
using a neural network, since inference with a neural likelihood is intractable.
Note that this interpretation gives us some flexibility: we can change up our prior
$p(z)$, likelihood $p(x|z)$, or even variational posterior $q(z|x)$ and still be
doing variational inference in the correct sense.</p>
<h3 id="amortized-inference">Amortized Inference</h3>
<p>There is an interesting point to make about our inference net $q(z|x)$. It performs
what is known as <strong>amortized inference</strong>; amortized in the sense that for different
$x_n$, the same parameters $\phi$ are used in inference for $z_n$. In traditional
VI, we usually only compute $z_n$ based on $x_n$ and the original model parameters.
However, in amortized inference we can generalize to unseen $x$ and $z$!</p>
<h1 id="why-do-we-care">Why do we care?</h1>
<p>So we saw that there’s two ways to look at VAEs. Are there advantages to one over the other? It seems that both have value.</p>
<p>For deep learning practitioners, the fact that VAEs are motivated by variational inference is not really relevant. As long as the generative model can produce good results (i.e. pretty pictures) it’s good enough. Therefore, papers that use this interpretation think about how to have better encoders and decoders, and when it makes sense to regularize latent codes with the KL term.</p>
<ul>
<li>One example from NLP is the sentence VAE (Bowman et al, 2015). Here, $x$ is a sentence, $z$ is a Gaussian vector, and the encoder and decoders are LSTMs.</li>
<li>A recent example is the VQ-VAE (van den Oord et al, 2017), which makes $z$ a discrete table of latent vectors. This doesn’t correspond to any standard probability distributions, yet they achieve remarkable results nonetheless.</li>
</ul>
<p>For Bayesian practicitioners, the VAE setup is not necessarily for generative modeling, but can do inference on actually useful models (i.e. interesting $z$).</p>
<ul>
<li>In <a href="http://approximateinference.org/2017/accepted/SinghEtAl2017.pdf">my last project</a>, we assumed that $z$ was distributed according to an Indian buffet process, a nonparametric feature allocation distribution. We showed that the VAE setup allows us to learn latent features with frequencies that resembles the IBP distribution.</li>
<li>This interpretation can also shed light on principled ways to improve VAEs. One
example is ELBO surgery (Hoffman & Johnson, 2016): they rewrite the ELBO in a
certain way and show that the KL between the prior $p(z)$ and the <em>average marginal</em>
$q(z) = \sum \frac 1N q(z_n | x_n)$ can be small, but the <em>average KL</em>
$\sum \frac 1N KL(q(z_n|x_n)||p(z))$ can still be large. This suggests that
traditional priors may be harder to learn for an individual sample than previously
thought. My friend Rachit has a <a href="https://rachitsingh.com/elbo_surgery/">great post</a>
explaining this in detail.</li>
</ul>
<p>To conclude, we see that VAEs can be interesting and useful from more than one point of view. This is what makes them such an exciting area of research!</p>
<h1 id="references">References</h1>
<ul>
<li><a id="ref-Kingma2013"></a>
Kingma, Diederik P, and Max Welling. 2013. “Auto-Encoding Variational
Bayes.” In <em>ICLR</em>.</li>
<li><a id="ref-Oord2017"></a>
Oord, Aaron van den, Oriol Vinyals, and Koray Kavukcuoglu. 2017. “Neural
Discrete Representation Learning.” <em>NIPS</em>.</li>
<li><a id="ref-Bowman2016"></a>
Bowman, Samuel R., Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal
Jozefowicz, and Samy Bengio. 2016. “Generating Sentences from a
Continuous Space.” <em>CONLL</em>.</li>
<li><a id="ref-Hoffman2016"></a>
Hoffman, Matthew D, and Matthew J Johnson. 2016. “ELBO surgery: yet
another way to carve up the variational evidence lower bound.” <em>Advances
in Approximate Bayesian Inference, NIPS Workshop</em>.</li>
<li><a id="ref-Roeder2017"></a>
Roeder, Geoffrey, Yuhuai Wu, and David Duvenaud. 2017. “Sticking the
Landing : Simple , Lower-Variance Gradient Estimators for Variational
Inference.” <em>NIPS</em>.</li>
</ul>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>It turns out that it’s counterintuitively <em>not</em> always better to use closed form expressions for the KL, even if we can compute it. (Roeder 2017) explains why: when we’re close to the optimum, sampling instead of using the closed form can actually help us reduce variance of our Monte Carlo gradient estimators. This will help us converge, or in their words, “stick the landing”. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jeffrey Ling$$ \newcommand{\boldA}{\mathbf{A}} \newcommand{\boldB}{\mathbf{B}} \newcommand{\boldC}{\mathbf{C}} \newcommand{\boldD}{\mathbf{D}} \newcommand{\boldE}{\mathbf{E}} \newcommand{\boldF}{\mathbf{F}} \newcommand{\boldG}{\mathbf{G}} \newcommand{\boldH}{\mathbf{H}} \newcommand{\boldI}{\mathbf{I}} \newcommand{\boldJ}{\mathbf{J}} \newcommand{\boldK}{\mathbf{K}} \newcommand{\boldL}{\mathbf{L}} \newcommand{\boldM}{\mathbf{M}} \newcommand{\boldN}{\mathbf{N}} \newcommand{\boldO}{\mathbf{O}} \newcommand{\boldP}{\mathbf{P}} \newcommand{\boldQ}{\mathbf{Q}} \newcommand{\boldR}{\mathbf{R}} \newcommand{\boldS}{\mathbf{S}} \newcommand{\boldT}{\mathbf{T}} \newcommand{\boldU}{\mathbf{U}} \newcommand{\boldV}{\mathbf{V}} \newcommand{\boldW}{\mathbf{W}} \newcommand{\boldX}{\mathbf{X}} \newcommand{\boldY}{\mathbf{Y}} \newcommand{\boldZ}{\mathbf{Z}} \newcommand{\bolda}{\mathbf{a}} \newcommand{\boldb}{\mathbf{b}} \newcommand{\boldc}{\mathbf{c}} \newcommand{\boldd}{\mathbf{d}} \newcommand{\bolde}{\mathbf{e}} \newcommand{\boldf}{\mathbf{f}} \newcommand{\boldg}{\mathbf{g}} \newcommand{\boldh}{\mathbf{h}} \newcommand{\boldi}{\mathbf{i}} \newcommand{\boldj}{\mathbf{j}} \newcommand{\boldk}{\mathbf{k}} \newcommand{\boldl}{\mathbf{l}} \newcommand{\boldm}{\mathbf{m}} \newcommand{\boldn}{\mathbf{n}} \newcommand{\boldo}{\mathbf{o}} \newcommand{\boldp}{\mathbf{p}} \newcommand{\boldq}{\mathbf{q}} \newcommand{\boldr}{\mathbf{r}} \newcommand{\bolds}{\mathbf{s}} \newcommand{\boldt}{\mathbf{t}} \newcommand{\boldu}{\mathbf{u}} \newcommand{\boldv}{\mathbf{v}} \newcommand{\boldw}{\mathbf{w}} \newcommand{\boldx}{\mathbf{x}} \newcommand{\boldy}{\mathbf{y}} \newcommand{\boldz}{\mathbf{z}} \newcommand{\mcA}{\mathcal{A}} \newcommand{\mcB}{\mathcal{B}} \newcommand{\mcC}{\mathcal{C}} \newcommand{\mcD}{\mathcal{D}} \newcommand{\mcE}{\mathcal{E}} \newcommand{\mcF}{\mathcal{F}} \newcommand{\mcG}{\mathcal{G}} \newcommand{\mcH}{\mathcal{H}} \newcommand{\mcI}{\mathcal{I}} \newcommand{\mcJ}{\mathcal{J}} \newcommand{\mcK}{\mathcal{K}} \newcommand{\mcL}{\mathcal{L}} \newcommand{\mcM}{\mathcal{M}} \newcommand{\mcN}{\mathcal{N}} \newcommand{\mcO}{\mathcal{O}} \newcommand{\mcP}{\mathcal{P}} \newcommand{\mcQ}{\mathcal{Q}} \newcommand{\mcR}{\mathcal{R}} \newcommand{\mcS}{\mathcal{S}} \newcommand{\mcT}{\mathcal{T}} \newcommand{\mcU}{\mathcal{U}} \newcommand{\mcV}{\mathcal{V}} \newcommand{\mcW}{\mathcal{W}} \newcommand{\mcX}{\mathcal{X}} \newcommand{\mcY}{\mathcal{Y}} \newcommand{\mcZ}{\mathcal{Z}} \newcommand{\reals}{\mathbb{R}} \newcommand{\integers}{\mathbb{Z}} \newcommand{\rationals}{\mathbb{Q}} \newcommand{\naturals}{\mathbb{N}} \newcommand{\ident}{\boldsymbol{I}} \newcommand{\bzero}{\boldsymbol{0}} \newcommand{\balpha}{\boldsymbol{\alpha}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bdelta}{\boldsymbol{\delta}} \newcommand{\boldeta}{\boldsymbol{\eta}} \newcommand{\bkappa}{\boldsymbol{\kappa}} \newcommand{\bgamma}{\boldsymbol{\gamma}} \newcommand{\bmu}{\boldsymbol{\mu}} \newcommand{\bphi}{\boldsymbol{\phi}} \newcommand{\bpi}{\boldsymbol{\pi}} \newcommand{\bpsi}{\boldsymbol{\psi}} \newcommand{\bsigma}{\boldsymbol{\sigma}} \newcommand{\btheta}{\boldsymbol{\theta}} \newcommand{\bxi}{\boldsymbol{\xi}} \newcommand{\bGamma}{\boldsymbol{\Gamma}} \newcommand{\bLambda}{\boldsymbol{\Lambda}} \newcommand{\bOmega}{\boldsymbol{\Omega}} \newcommand{\bPhi}{\boldsymbol{\Phi}} \newcommand{\bPi}{\boldsymbol{\Pi}} \newcommand{\bPsi}{\boldsymbol{\Psi}} \newcommand{\bSigma}{\boldsymbol{\Sigma}} \newcommand{\bTheta}{\boldsymbol{\Theta}} \newcommand{\bUpsilon}{\boldsymbol{\Upsilon}} \newcommand{\bXi}{\boldsymbol{\Xi}} \newcommand{\bepsilon}{\boldsymbol{\epsilon}} \newcommand{\on}{\operatorname} \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\on{Var}} $$NIPS 20172017-12-11T00:00:00+00:002017-12-11T00:00:00+00:00/2017/12/11/nips-2017<p>I attended NIPS 2017 in Long Beach, CA – my first conference! Rachit and I presented our short paper at the <a href="http://approximateinference.org/">approximate Bayesian inference workshop</a>.</p>
<p>Overall, it was an incredible experience. Six days of little sleep, skipped meals, and a tremendous amount of talks, posters, and papers to process.
I learned a lot, met cool people, and gained exposure to a ton of interesting ideas.
Additionally, I unexpectedly ran into several people who I know from past internships, etc. Turns out everyone is doing machine learning these days!</p>
<p>The poster sessions were undoubtedly the highlight of the conference. While the talks are good, they are often hard to understand, not to mention the fact that several talks happen at the same time. (I think a lot of people end up skipping talks to spend time meeting old friends / properly feeding themselves.)
On the other hand, everyone goes to the poster sessions. Some high profile professors were even presenting their posters! The discussions at the posters are what research is really about - sharing awesome work with people who are interested in the fine-grained details.</p>
<p>While I’ve read so many papers from research labs all over the world, I actually have no idea what most of the authors look like or sound like in real life. At NIPS, I was finally able to put a face and voice to many of these people :)</p>
<p>Some highlights:</p>
<ul>
<li>Flo Rida was at an Intel party, straight out of a Silicon Valley scene</li>
<li>The Imposteriors (stat professors, including Michael Jordan and David Blei), performed in a live band for the final day reception</li>
<li>Ali Rahimi’s test of time award talk, equating machine learning to alchemy - spurred a ton of debate (including with Yann LeCun) on the importance of engineering vs. theory</li>
<li>One of the best papers has solved poker (no limit Hold’em)!</li>
</ul>
<p>Some research observations:</p>
<ul>
<li>Favorite talks:
<ul>
<li>John Platt, who showed a cool application of variational inference for nuclear fusion engineering</li>
<li>Emma Brunskill, who discussed a ton of interesting challenges and applications for reinforcement learning</li>
</ul>
</li>
<li>Favorite posters / papers:
<ul>
<li>Inverse reward design (Hadfield-Menell et al). An RL agent is Bayesian about a reward function.</li>
<li>Sticking the landing (Roeder et al). Shows that closed form KLs for the gradient estimator may actually have higher variance, and proposes a neat and quick solution.</li>
</ul>
</li>
<li>Favorite panel: Tim Salimans, Max Welling, Zoubin Ghahramani, David Blei, Katherine Heller, Matt Hoffman (moderator) at approximate Bayesian inference workshop.</li>
<li>Favorite quote (by Zoubin): If you put a chicken in a MNIST classifier, you already know it’s not going to be a 1 or 7!</li>
<li>Not as much NLP as I expected.</li>
<li>Deep learning is a big deal. All the deep learning talks were easily the most well-attended.</li>
<li>People still love GANs. Gaussian processes are becoming popular on the Bayesian front.</li>
<li>An increasing interest in how ML will work in society, particularly with issues of bias and fairness.</li>
</ul>
<p>If you’re interested in a blow-by-blow recap of my NIPS experience, read below (warning: lots of details).</p>
<h2 id="day-1">Day 1</h2>
<p>Crazy long line out the door! According to the organizers, about 8000 people signed up for NIPS this year. I arrived Sunday and was able to get my registration that night, which proved to be a good decision. ;)</p>
<p>Also, the career fair booths were pretty next level (compared to a college career fair).</p>
<h1 id="tutorials-1">Tutorials 1</h1>
<p>I attended the first 8am tutorial on “Reinforcement learning for the people, by the people” presented by Emma Brunskill. At a high level, the talk covered two ideas: first, how can we do RL in settings where humans are involved, and second, how can we include people as part of the learning process.</p>
<p>The talk brought up several research problems about RL in these settings. Here’s a bit of a brain dump of what I saw.</p>
<ul>
<li>Sample efficiency. Unlike games / robotics, we can’t endlessly simulate humans.</li>
<li>Multi-task learning for education. Assume students are assigned to latent groups, and do inference. Referenced Finale’s work on better models (e.g. Bayesian neural net).</li>
<li>Policy search. Use Gaussian process to limit search. Shared basis for multi task, representation learning to generalize.</li>
<li>Different metrics. Beyond expectation, need to consider risk as humans will only see one trajectory. Safe exploration.</li>
<li>Batch RL, learning from prior data. Uses counterfactual reasoning, e.g. for classrooms assigned different education methods, or patient treatment. Key difficulty here is policy evaluation, hard because off policy (old data)</li>
<li>Better models can lead to worse policies. Models have high log-likelihood but get bad rewards.</li>
<li>Importance sampling for policy evaluation. Unbiased, but high variance for long horizon.</li>
<li>You can replace education “experts” with an RL policy evaluator :)</li>
</ul>
<p>By the people:</p>
<ul>
<li>Reward specification, imitation learning, supervised learning for trajectories is not i.i.d.</li>
<li>How to get access to experts? Which features matter? Difference in showing vs. doing (e.g. teaching surgery).</li>
</ul>
<p>I found this talk really well done! There was a ton of super exciting and awesome work cited. Definitely a flood of information that will take some time to parse.</p>
<h1 id="tutorials-2">Tutorials 2</h1>
<p>I jumped around a bit for the 10:45am tutorial. First, I went to “Fairness in Machine Learning”. The talk went on for a while about discrimination law (NOT the GAN discrimination!!), which is interesting but not what I was looking for.</p>
<p>Went to deep Gaussian process tutorial by Neil Lawrence. Some cool intuition on GP kernels, also claimed that the universe is a Gaussian process (???)</p>
<p>Briefly visited StarAI tutorial but the middle was too technical for my background.</p>
<h1 id="tutorials-3">Tutorials 3</h1>
<p>Went to the probabilistic programming tutorial. Josh Tenenbaum first talked about learning intuitive physics and showed some cat and baby videos.</p>
<p>Vikash Mansinghka talked more about details. Slides on automatic statistician (David Duvenaud’s work), where priors on hyperparameters helped a lot. Their language is Venture. Some cool work on probabilistic graphics program, doing inference on the renderer.</p>
<h1 id="john-platt">John Platt</h1>
<p>John Platt from Google gave an amazing talk on using machine learning to help solve the energy problem. Great speaker - conveyed clearly the science that was needed behind energy economics, current failings, and potential solutions.</p>
<p>He then went on to talk about using Bayesian inference to aid in nuclear fusion engineering, which was awesome.</p>
<h1 id="posters">Posters</h1>
<p>The poster session was absolutely nuts. For the first hour or so, you couldn’t get near any poster due to crowding. Later it was much more reasonable.</p>
<p>Some posters I saw:</p>
<ul>
<li>Neural Expectation Maximization. They changed the sequential updates of EM into an RNN with the M-step gradient as input, lol.</li>
<li>Unsupervised Transformation Learning via Convex Relaxations. Cool direction in manifold learning where transformations between nearest neighbors are learned, instead of latent code approach.</li>
<li>Context Selection for Embedding Models. Learn embeddings for shopping items, modeled as a binary mask with count data. Inference on latents using the VAE setup.</li>
<li>Toward Multimodal Image-to-Image Translation. Bicycle GAN has been added to the cycle GAN zoo. It does matched style transfer but has many modes.</li>
<li>One-Shot Imitation Learning. Do supervised learning on expert actions, using attention over time steps to decide how to imitate at test time.</li>
<li>Variational Inference via Chi Upper Bound Minimization. Uses a COBO (?) upper bound with ELBO to get sandwich estimator for log likelihood. Trains better due to better tail behavior capturing.</li>
<li>Q-LDA: Uncovering Latent Patterns in Text-based Sequential Decision Processes. Model of text game with latent Dirichlet process, and a learned “reward” function. Does mirror descent and MAP inference.</li>
<li>InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations. GANs can now be used for imitation learning, and this adds a latent code that disentangles expert policy behaviors.</li>
</ul>
<p>Definitely missed a lot of posters because of the crowding issues, and I couldn’t get close enough to ask questions. Also, because there are just too many (>200).</p>
<p>On the other hand, the poster session is clearly the highlight of the conference. Lots of great work and good conversations all around.</p>
<h2 id="day-2">Day 2</h2>
<h1 id="brendan-frey">Brendan Frey</h1>
<p>Brendan Frey gave a talk on using machine learning in gene therapy, and how to speed up clinical trials. He observed that problems in biology are unlike game playing, since unlike in games, even humans are bad at biology.</p>
<h1 id="test-of-time">Test of time</h1>
<p>The author of the test-of-time award paper, Ali Rahimi, talked about bringing back scientific rigor in ML. He brought up a quote by Andrew Ng, “machine learning is like the new electricity”, and suggested that “electricity” should instead be “alchemy”. ML is alchemy in that deep learning models have produced useful innovations, like alchemy, but also was a fundamentally flawed way of thinking, just like alchemy.</p>
<p>Ali also asked the audience if anyone had ever built a neural net, tried to train it and failed, and felt bad for themselves (lol). He says “it’s not your fault, it’s SGD’s fault!”, meaning that there’s basically no theory why anything works or doesn’t work.</p>
<p>Overall, great talk on a trend in machine learning I have definitely been concerned about.</p>
<h1 id="morning-talks">Morning talks</h1>
<p>Two tracks: optimization and algorithms. I was mostly at the optimization talks.</p>
<ul>
<li>Tensor decomposition - understanding the landscape</li>
<li>Robust optimization, treat corrupted datasets with known noise types as optimization over choice of loss functions</li>
<li>Bayesian gradient optimization - use gradient information in GPs for acquisition function</li>
</ul>
<p>Second track on theory, and afternoon talks</p>
<ul>
<li>Best paper on subgame solving, they used it to beat best poker players (best paper)</li>
<li>Stein estimator for model criticism (best paper)</li>
</ul>
<p>It’s pretty hard to keep track of everything since there’s two tracks and so many talks.</p>
<h1 id="kate-crawford">Kate Crawford</h1>
<p>Kate Crawford spoke in the afternoon about bias in machine learning from a societal perspective. She talked about how biased ML systems cause bad effects in two ways: on the individual level (e.g. people get denied mortgage because of demographic), and on a societal level (misrepresentation in culture, e.g. stereotypes).</p>
<p>Overall was a solid talk, really highlighted a lot of things ML researchers need to keep in mind as ML becomes more prevalent.</p>
<h1 id="posters-1">Posters</h1>
<p>Some posters that caught my eye:</p>
<ul>
<li>Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples. Update your learning based on training examples with bad prediction error. Andrew McCallum was at the poster</li>
<li>A-NICE-MCMC: Adversarial Training for MCMC. Treat Metropolis Hastings as learned process, use a neural net to learn transition function (?). Check paper details.</li>
<li>Identification of Gaussian Process State Space Models. Use inference net to train a GP.</li>
<li>Filtering Variational Objectives. By Chris Maddison, Rachit read this paper. I have yet to read</li>
<li>Poincare Embeddings for Learning Hierarchical Representations. Some crazy visuals on word embeddings on a manifold, didn’t get to look at the poster.</li>
<li>Learning Populations of Parameters. Kevin Tian’s poster! Example: independent coins each flipped twice, should not use maximum likelihood for all of them separately.</li>
</ul>
<p>Overall: lots of approximate inference posters. Tons of posters on GPs, really a hot topic these days. Posters on Stein method, something I need to learn about.</p>
<p>End of day 2: becoming really exhausted, so many posters and so little time! Also, the Flo Rida party was full which was super disappointing. I was looking forward to hearing My House live.</p>
<h2 id="day-3">Day 3</h2>
<h1 id="morning-session">Morning session</h1>
<p>Missed a lot of the morning due to sleeping in and talking to people. Some talk highlights:</p>
<ul>
<li>Generalization gap in large batch deep learning. Apparently small batches lead to large weight distances from initialization, but large batches don’t. This is an optimization issue. Need to read paper for details.</li>
<li>End to end differentiable proving. Using Prolog backward chaining. Try to learn proof success.</li>
</ul>
<h1 id="lunch-session">Lunch session</h1>
<p>I went with Rachit to a lunch session on Pyro, Pytorch distributions, and probtorch, which was led by the leads of all these frameworks (Noah Goodman, Soumith Chintala, Adam Paszke, Dustin Tran guest appearance). Some interesting discussion on the challenges and best way to integrate the distributions library. Looking forward to how this works out, I really love Pytorch :)</p>
<h1 id="pieter-abbeel">Pieter Abbeel</h1>
<p>Pieter Abbeel gave a talk on his RL work (of which there’s a ton!). I actually found the talk a bit hard to follow, even though I enjoy reading his work. Covered topics including MAML, meta learning, hierarchical policies, simulated data, hindsight experience replay. He really believes that meta learning is the answer, since big data beats human ingenuity with enough compute in the long run.</p>
<p>One highlight was when he showed the picture of the chocolate cake with a cherry on top (from last year?) and replaced it with a cake covered with cherries. The cake is supervised learning, the icing unsupervised, and the cherry reinforcement learning – he wants more cherries!!</p>
<h1 id="afternoon-talks">Afternoon talks</h1>
<p>Lots of great talks this afternoon. Sadly can only attend one of two sessions.</p>
<ul>
<li>ELF game simulator. Apparently works faster than everything else out there.</li>
<li>Imagination augmented deep RL. Basically model based RL with simulation policy for rollouts. There is a rollout encoder RNN that produces an “imagination augmented code”, which is combined with a model free code to form a loss. Policy gradient training. At the end, there was an audience question representing the “hype police” to retract the word “imagination” - I honestly felt like this was really rude. If you want to challenge someone’s work, you can do it either respectfully or offline.</li>
<li>Off policy evaluation for the following problem: user interacts with website, website can choose combinatorial list of actions (slates?), get reward. This is a bandit; they come up with a new method to do off-policy estimation. Well done talk!</li>
<li>Hidden Parameter MDPs (Finale’s work). Pacing of talk was a bit weird. Model based RL. Used a Bayesian neural net instead of linear combination of GPs for the model, trained policy with Double DQN.</li>
<li>Inverse reward design. One of the coolest ideas I think, though even after reading the paper I have trouble understanding what exactly their technical methods are. Essentially, they set a Bayesian prior on reward functions, so that even if it’s badly specified, the agent can learn something from the posterior.</li>
<li>Interruptible RL. Puts human in the loop. Defines two concepts: safe interruptibility (policy should be optimal in non-interruptible environment), dynamic safe interruptibility (should not prevent exploration, policy update should not depend on interruption)</li>
<li>Uniform PAC guarantees for RL. Combine regret bounds + PAC framework into uniform PAC. For all epsilon (uniformly), bound number of episodes not within epsilon in probability.</li>
<li>Repeated Inverse RL. Something about multiple tasks: learner chooses a task, human reveals optimal policy for given reward function. Adversary chooses task, learner proposes policy and minimizes mistakes.</li>
<li>RL exploration in high dim state space. Classify novelty of states from past experience, adding a reward bonus for novelty. Use exemplar one vs all models, to do density estimation. (Seems dangerous to do this reward thing?)</li>
<li>Regret minimization in MDPs with options. Options are temporally extended actions, can be harmful. They come up with a way to choose options to minimize regret.</li>
<li>Transfer in RL with multiple tasks. Use generalized policy improvement (GPI) as uniform argmax over set of MDP policies. Successor features (weighted discounted reward) to evaluate policy across all tasks.</li>
</ul>
<h1 id="posters-2">Posters</h1>
<p>Was really tired, didn’t process very many of these. High level: saw a whole section on online learning.</p>
<ul>
<li>Sticking the landing (from David Duvenaud’s group). Apparently adding one stop gradient can lower the REINFORCE gradient variance once the ELBO is near minimum. This is counterintuitive, since most people use the KL closed form instead of the whole Monte Carlo expression for ELBO.</li>
<li>Self supervised learning of motion capture. Some neural net to learn parameters of a human modeling mesh based on keypoint and image input.</li>
<li>Counterfactual fairness. This was a talk I missed. They use counterfactual reasoning to see what would happen to certain datapoints of people if their race etc. were switched. Super cool!</li>
</ul>
<h2 id="day-4">Day 4</h2>
<h1 id="yee-whye-teh">Yee Whye Teh</h1>
<p>Yee Whye Teh gave a nice talk on Bayesian methods and deep learning. He highlighted three projects: first, a Bayesian neural net where the posterior was estimated with a distributed system of workers. The workers do MCMC updates on parameters and send messages to server, who does EP. He gave an interesting point where since parameters are symmetric/non-identifiable, we should look for priors over functions (GPs?), not over parameters as we do now.</p>
<p>The other two projects were Concrete (Maddison) and Filtering Variational Objectives. Even more reason to read about FVO.</p>
<h1 id="morning-talks-1">Morning talks</h1>
<p>Some highlights:</p>
<ul>
<li>Masked autoregressive flow for density estimation. In standard inverse autoregressive flow, try to make invertible neural net. Usually use Gaussians, which aren’t flexible, can fix this with stacking (masked). Real NVP faster (parallel) but less flexible than masked.</li>
<li>Deep sets. Learn equivariant and permutation invariant functions on sets. Surprised they didn’t even mention PointNet, a paper with similar motivations in CVPR (even if they are concurrent work). One cool application with margin training on words from LDA topics, which apparently does better than word2vec.</li>
</ul>
<h1 id="symposia">Symposia</h1>
<p>The afternoon was a bunch of symposia. There were four tracks, all of which had interesting topics and speakers, so I had to jump around a bunch. The four were: (1) Deep RL, (2) Interpretable ML, (3) Kinds of intelligence, (4) Meta-learning.</p>
<ul>
<li>Deep RL: learning with latent language representations (Jacob Andreas). Can do few shot policy learning, by learning concepts with strings. Didn’t give many details, so I have to read paper.</li>
<li>Intelligence: Demis Hassabis talked about AlphaZero etc. He is a pretty good speaker.</li>
<li>Interpretable ML: Explainable ML challenge (Been Kim). Apparently FICO released a dataset, and they want to be able to build a model that’s not only predictive but interpretable, so that if someone has bad credit we can understand why. This has implications if ML becomes regulated in the future, since uninterpretable models won’t be acceptable.</li>
<li>Meta-learning: Max Jaderberg on hyperparameter search methods.</li>
<li>Meta-learning: Pieter Abbeel gave the same talk as yesterday…</li>
<li>Meta-learning: Schmidhuber gave an incomprehensible talk on what is meta learning</li>
<li>Meta-learning: Satinder Singh talked about the reward design problem. Agent has an internal representation of reward, and this can be learned through policy gradient somehow.</li>
<li>Meta-learning: Ilya Sutskever gave a hype talk on self play and how it will lead to agents with better intelligence fast (surpassing humans). I don’t really buy that it will work so easily.</li>
</ul>
<p>Food in between was much better, they also gave us a coupon for food trucks.</p>
<p>Evening:</p>
<ul>
<li>Cynthia Dwork came to talk about fairness in ML. David Runciman (philosopher) talked about how AI is like artificial agents (e.g. corporations). Panel with Zoubin Ghahramani, talked a bit about what AI should be… not a clear consensus</li>
</ul>
<p>Our workshop talk is tomorrow so we need to prepare for that! Time to get grilled by the experts in the field.</p>
<h1 id="day-5">Day 5</h1>
<p>Workshop day! Rachit and I gave our talk! We practiced with Finale briefly before.
The talk itself was short and quite uneventful. Zoubin Ghahramani sat in the front row which was pretty awesome… I’m sure other high profile researchers were in the audience.</p>
<p>Some of the other talk highlights:</p>
<ul>
<li>Netflix uses VAEs for recommendation. Use KL annealing hacks, I can’t really see this working in practice</li>
<li>Andy Miller: Taylor residuals for low variance gradients</li>
<li>Yixin Wang: consistency of variational Bayes</li>
</ul>
<p>A couple other talks I went to at other workshops:</p>
<ul>
<li>Percy Liang: talked about adversaries and using natural language as a collaboration tool. Work on SHRDLURN</li>
<li>QA: computer system beat Quizbowl humans</li>
</ul>
<p>Approximate inference panel: a fun discussion with Matt Hoffman (moderator), Tim Salimans, Dave Blei, Zoubin, Katherine Heller, and Max Welling. Turns out all of these people are incredibly funny.</p>
<ul>
<li>Zoubin introduced himself as a dinosaur of machine learning, Max Welling a half-dinosaur, Blei a woolly mammoth.</li>
<li>Dave Blei randomly talked about Alan Turing and codebreaking with Bayesian statistics, this book: https://www.amazon.com/Between-Silk-Cyanide-Codemakers-1941-1945/dp/068486780X</li>
<li>Welling: We should not miss exponential families, deep learning is good. Dave Blei says exponential families can be useful with some discrete data.</li>
<li>Zoubin: we should do transfer testing. We can’t assume test distribution will be same as train in the real world since it’s so much bigger</li>
<li>Zoubin: deep learning is basically nonparametric since almost infinite parameters</li>
<li>Zoubin: estimate errors for discriminative models? Some notion of p(x) for these? If you stick a chicken in MNIST classifier it’s neither 1 nor 7 (lol)</li>
</ul>
<p>It turns out that all the workshops have amazing programs and I missed a lot of cool invited talks (a lot of the time we had to be at our poster). For example, other workshops had great speakers like Jeff Dean, Fei-Fei, and Ian Goodfellow, just to name a few. It’s too bad they’re all concurrent… the same thing will probably happen tomorrow.</p>
<h1 id="day-6">Day 6</h1>
<p>Last day of NIPS! Started the day off early with Bayesian deep learning workshop.</p>
<ul>
<li>Dustin Tran talked about Edward and probabilistic programming. Seems like the main difficulty is that there’s still no generalized inference procedure.</li>
<li>Really interesting contributed talk: tighter lower bounds are not necessarily better. For the IWAE, higher K can actually lower the signal-to-noise ratio of the gradient, forcing it to 0.</li>
<li>Finale: discussed horseshoe priors for Bayesian neural nets. Better tail / sparsity than Gaussian priors.</li>
<li>Welling: highlighted 3 directions: deep GPs, information bottleneck, variational dropout.</li>
</ul>
<p>Also checked out the theory of deep learning workshop.</p>
<ul>
<li>Peter Bartlett discussed the generalization paper from ICLR. Examine classification margins, scale by network operator norm in some sense.</li>
</ul>
<p>Theory Posters: very little successful theory, focuses on small neural networks. A lot of empirical work to test hypotheses.</p>
<ul>
<li>Homological properties of neural nets, some truly crazy stuff</li>
<li>Neural nets are robust to some random label noise in data, as in it gracefully degrades performance</li>
<li>Madry tried to get metrics for GANs (besides Inception score, which I actually don’t know). Simple binary classification problems on features of data, show that many common GANs can’t properly match the true distribution classification.</li>
</ul>
<p>Bayesian posters:</p>
<ul>
<li>Matt Hoffman on beta-VAEs. Showed that the beta-VAE loss corresponds to some other prior. Standard beta loss then corresponds to trying to get the posterior without including an entropy promoting term. Should read more about this (ELBO surgery).</li>
</ul>
<p>Later, jumped back and forth between workshops.</p>
<ul>
<li>Matt Hoffman: SGD as approximate Bayesian inference. By making some Gaussian assumptions on the gradients, get an OU process, like MCMC. Iterate averaging is the optimal, so we can’t beat linear time.</li>
<li>Russ Salakhutdinov: deep kernel learning. Basically in GPs, also learn the hyperparameters of the kernel.</li>
<li>Sham Kakade: policy gradient guarantees for LQR model in RL control. No analogues of robustness in MDPs?</li>
</ul>
<p>Theory Panel: Sanjeev Arora (moderator), Sham Kakade, Percy Liang, Salakhutdinov, Yoshua Bengio (20 min), Peter Bartlett</p>
<ul>
<li>On adversarial examples: Bengio: we should use p(x) so our discriminative model doesn’t fail as much.</li>
<li>Russ: pixels suck, so people used HOG / SIFT features. Now we use CNNs which are pixel based…</li>
<li>Discussion on combining deep learning with rule based methods (60s-70s).</li>
</ul>
<p>Finally, to wrap up, there was an awesome reception. The Imposteriors, a band made of statistics professors (including Michael Jordan!), performed live music! Also, David Blei made a guest appearance on accordion! Truly a highlight of the conference.</p>
<p>Will write up a summary / highlights soon, this is very long. The main takeaway for me was to associate a face to all of the famous researchers.</p>Jeffrey LingI attended NIPS 2017 in Long Beach, CA – my first conference! Rachit and I presented our short paper at the approximate Bayesian inference workshop.Welcome2017-11-18T00:00:00+00:002017-11-18T00:00:00+00:00/2017/11/18/first-post<p>Welcome to my blog! I’m starting this space so that I can:</p>
<ul>
<li>organize my ideas</li>
<li>provide interesting and educational content to the public</li>
<li>practice writing</li>
</ul>
<p>Some of my interests include machine learning, computer science, and math, so I will blog about these in addition to other topics.</p>
<p>Hope you enjoy! Please email me with any feedback :)</p>Jeffrey LingWelcome to my blog! I’m starting this space so that I can: organize my ideas provide interesting and educational content to the public practice writingReinforcement Learning as Optimization2017-11-18T00:00:00+00:002017-11-18T00:00:00+00:00/2017/11/18/rl-as-optimization<div hidden="">
$$
\newcommand{\boldA}{\mathbf{A}}
\newcommand{\boldB}{\mathbf{B}}
\newcommand{\boldC}{\mathbf{C}}
\newcommand{\boldD}{\mathbf{D}}
\newcommand{\boldE}{\mathbf{E}}
\newcommand{\boldF}{\mathbf{F}}
\newcommand{\boldG}{\mathbf{G}}
\newcommand{\boldH}{\mathbf{H}}
\newcommand{\boldI}{\mathbf{I}}
\newcommand{\boldJ}{\mathbf{J}}
\newcommand{\boldK}{\mathbf{K}}
\newcommand{\boldL}{\mathbf{L}}
\newcommand{\boldM}{\mathbf{M}}
\newcommand{\boldN}{\mathbf{N}}
\newcommand{\boldO}{\mathbf{O}}
\newcommand{\boldP}{\mathbf{P}}
\newcommand{\boldQ}{\mathbf{Q}}
\newcommand{\boldR}{\mathbf{R}}
\newcommand{\boldS}{\mathbf{S}}
\newcommand{\boldT}{\mathbf{T}}
\newcommand{\boldU}{\mathbf{U}}
\newcommand{\boldV}{\mathbf{V}}
\newcommand{\boldW}{\mathbf{W}}
\newcommand{\boldX}{\mathbf{X}}
\newcommand{\boldY}{\mathbf{Y}}
\newcommand{\boldZ}{\mathbf{Z}}
\newcommand{\bolda}{\mathbf{a}}
\newcommand{\boldb}{\mathbf{b}}
\newcommand{\boldc}{\mathbf{c}}
\newcommand{\boldd}{\mathbf{d}}
\newcommand{\bolde}{\mathbf{e}}
\newcommand{\boldf}{\mathbf{f}}
\newcommand{\boldg}{\mathbf{g}}
\newcommand{\boldh}{\mathbf{h}}
\newcommand{\boldi}{\mathbf{i}}
\newcommand{\boldj}{\mathbf{j}}
\newcommand{\boldk}{\mathbf{k}}
\newcommand{\boldl}{\mathbf{l}}
\newcommand{\boldm}{\mathbf{m}}
\newcommand{\boldn}{\mathbf{n}}
\newcommand{\boldo}{\mathbf{o}}
\newcommand{\boldp}{\mathbf{p}}
\newcommand{\boldq}{\mathbf{q}}
\newcommand{\boldr}{\mathbf{r}}
\newcommand{\bolds}{\mathbf{s}}
\newcommand{\boldt}{\mathbf{t}}
\newcommand{\boldu}{\mathbf{u}}
\newcommand{\boldv}{\mathbf{v}}
\newcommand{\boldw}{\mathbf{w}}
\newcommand{\boldx}{\mathbf{x}}
\newcommand{\boldy}{\mathbf{y}}
\newcommand{\boldz}{\mathbf{z}}
\newcommand{\mcA}{\mathcal{A}}
\newcommand{\mcB}{\mathcal{B}}
\newcommand{\mcC}{\mathcal{C}}
\newcommand{\mcD}{\mathcal{D}}
\newcommand{\mcE}{\mathcal{E}}
\newcommand{\mcF}{\mathcal{F}}
\newcommand{\mcG}{\mathcal{G}}
\newcommand{\mcH}{\mathcal{H}}
\newcommand{\mcI}{\mathcal{I}}
\newcommand{\mcJ}{\mathcal{J}}
\newcommand{\mcK}{\mathcal{K}}
\newcommand{\mcL}{\mathcal{L}}
\newcommand{\mcM}{\mathcal{M}}
\newcommand{\mcN}{\mathcal{N}}
\newcommand{\mcO}{\mathcal{O}}
\newcommand{\mcP}{\mathcal{P}}
\newcommand{\mcQ}{\mathcal{Q}}
\newcommand{\mcR}{\mathcal{R}}
\newcommand{\mcS}{\mathcal{S}}
\newcommand{\mcT}{\mathcal{T}}
\newcommand{\mcU}{\mathcal{U}}
\newcommand{\mcV}{\mathcal{V}}
\newcommand{\mcW}{\mathcal{W}}
\newcommand{\mcX}{\mathcal{X}}
\newcommand{\mcY}{\mathcal{Y}}
\newcommand{\mcZ}{\mathcal{Z}}
\newcommand{\reals}{\mathbb{R}}
\newcommand{\integers}{\mathbb{Z}}
\newcommand{\rationals}{\mathbb{Q}}
\newcommand{\naturals}{\mathbb{N}}
\newcommand{\ident}{\boldsymbol{I}}
\newcommand{\bzero}{\boldsymbol{0}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\bdelta}{\boldsymbol{\delta}}
\newcommand{\boldeta}{\boldsymbol{\eta}}
\newcommand{\bkappa}{\boldsymbol{\kappa}}
\newcommand{\bgamma}{\boldsymbol{\gamma}}
\newcommand{\bmu}{\boldsymbol{\mu}}
\newcommand{\bphi}{\boldsymbol{\phi}}
\newcommand{\bpi}{\boldsymbol{\pi}}
\newcommand{\bpsi}{\boldsymbol{\psi}}
\newcommand{\bsigma}{\boldsymbol{\sigma}}
\newcommand{\btheta}{\boldsymbol{\theta}}
\newcommand{\bxi}{\boldsymbol{\xi}}
\newcommand{\bGamma}{\boldsymbol{\Gamma}}
\newcommand{\bLambda}{\boldsymbol{\Lambda}}
\newcommand{\bOmega}{\boldsymbol{\Omega}}
\newcommand{\bPhi}{\boldsymbol{\Phi}}
\newcommand{\bPi}{\boldsymbol{\Pi}}
\newcommand{\bPsi}{\boldsymbol{\Psi}}
\newcommand{\bSigma}{\boldsymbol{\Sigma}}
\newcommand{\bTheta}{\boldsymbol{\Theta}}
\newcommand{\bUpsilon}{\boldsymbol{\Upsilon}}
\newcommand{\bXi}{\boldsymbol{\Xi}}
\newcommand{\bepsilon}{\boldsymbol{\epsilon}}
\newcommand{\on}{\operatorname}
\newcommand{\E}{\mathbb{E}}
\newcommand{\Var}{\on{Var}}
$$
</div>
<p>When I first learned about deep reinforcement learning, I was extremely
confused how RL could fit in with deep learning – the two topics seem
totally different, at least in the ways they’re presented! The trick is
to understand that RL is simply optimizing a particular loss function,
and we can thus view it in the context of optimization and make a direct
comparison to supervised learning. In this post, we will show how the
REINFORCE algorithm, used commonly in reinforcement learning, is
actually a very general idea.</p>
<h1 id="classical-reinforcement-learning">Classical Reinforcement Learning</h1>
<p>First, a quick primer on reinforcement learning.</p>
<p>The classical setup of reinforcement learning is that we have an agent
who lives in an environment with states $\mcS$ and can make actions
$\mcA$ which causes them to transition to a new state. Depending on the
state and action pair, they then receive some reward $r \in \reals$.</p>
<p>We can formalize these ideas with Markov decision processes (MDPs). An
MDP has the following parameters:</p>
<ul>
<li>
<p>time horizon $T$, which could be infinite</p>
</li>
<li>
<p>action space $\mcA$</p>
</li>
<li>
<p>state space $\mcS$</p>
</li>
<li>
<p>transition distribution $p(s’ | s, a)$ for all
$s,s’ \in \mcS, a \in \mcA$ (this is the Markov ingredient)</p>
</li>
<li>
<p>reward function $r : \mcS \times \mcA \to \reals$ for a given state
and action pair (possibly stochastic)</p>
</li>
<li>
<p>time discount factor $\gamma \in (0,1)$</p>
</li>
</ul>
<p>Suppose we index time with $t = 0, 1, 2, \ldots, T$. The agent starts in
state $s_0$. At time $t$, the agent is in state $s_t$ and decides to
make action $a_t$, receiving reward $r_t = r(s_t, a_t)$. The agent wants
to maximize their total expected discounted reward</p>
<script type="math/tex; mode=display">\mathbb{E}_{s_t, a_t, r_t} \left[ \sum_{t=0}^T \gamma^t r_t \right]</script>
<p>where the expectation is over all sampled <em>trajectories</em>
$s_0, a_0, s_1, a_1, \ldots, s_T, a_T$. To be clear
$s_{t+1} \sim p(s_{t+1}|s_t, a_t)$ throughout the trajectory, and
$r_t \sim r(s_t, a_t)$.</p>
<p>We have here an objective we want to optimize! We’ll revisit this in the
deep learning context, but first we take a look at how we solve it
traditionally.</p>
<h1 id="solving-the-mdp">Solving the MDP</h1>
<p>To solve the MDP, the agent has a policy function $\pi : \mcS \to \mcA$
that selects the action to make in a given state (possibly stochastic).</p>
<p>In general, we assume that the agent doesn’t know the reward function
$r(s,a)$ or the transition distribution $p(s’ | s,a)$ – if they did,
solving the MDP would be possible through dynamic programming methods.
The agent can try to estimate the reward function and transition
distribution – this is known as <em>model-based</em> reinforcement learning,
aptly named since we try to guess the MDP model of the environment.</p>
<p>We instead cover <em>model-free</em> RL where we estimate $\pi$ directly.
Therefore, the <strong>overall strategy</strong> to learn the function
$\pi : \mcS \to \mcA$ will be: by testing out values of $a$ to see what
rewards we get (exploration step), and once we’ve collected enough
evidence, pick the best action $a$ for each step (exploitation).</p>
<h1 id="estimating-the-policy-and-q-learning">Estimating the Policy and Q-learning</h1>
<p>Note that we have a lot of freedom in our learning strategy.</p>
<ol>
<li>
<p>How do we keep track of our exploration (do we need to save all of
our sample trajectories)?</p>
</li>
<li>
<p>How are we going to do exploration (do we make actions completely
randomly)?</p>
</li>
</ol>
<p>Some preliminary answers:</p>
<ol>
<li>
<p>Short answer: no. Instead of saving our entire sample trajectory, it
suffices to have training data in the form $(s, a, r, s’)$ – this is
because our transition distribution is Markovian, and the reward $r$
and next state $s’$ only depends on $s, a$ (not whatever
came before).</p>
</li>
<li>
<p>There are several exploration strategies, the most common of which
is $\epsilon$-greedy, where we pick an action uniformly at random
with probability $\epsilon$ and pick our current best estimate
action with probability $1-\epsilon$.</p>
<p>Another is to make actions in proportion to states we already think
are “good” - i.e. if we got a lot of reward from a certain action
$a_0$, we are more likely to do $a_0$ (but still do other actions
with some probability!).</p>
</li>
</ol>
<p>To effectively learn a policy, we can use the trajectory data to
estimate either a value function $V(s)$ or a Q-function $Q(s,a)$ to
summarize our data. $V(s)$ estimates the best total reward we can get if
we start from state $s$ and act optimally; $Q(s,a)$ tracks the same if
we start with action $a$ in state $s$.</p>
<p>Note the direct relation $V(s) = \arg\max_a Q(s,a)$. Also, in optimality
we have</p>
<script type="math/tex; mode=display">V^*(s) = \max_{a \in \mcA} r(s,a) + \gamma \mathbb{E}_{s' \sim p(s'|s,a)} [V^*(s')]</script>
<script type="math/tex; mode=display">Q^*(s,a) = r(s,a) + \gamma \mathbb{E}_{s' \sim p(s'|s,a)} \left[ \max_{a' \in \mcA} Q(s', a') \right]</script>
<p>by recursively defining $V^*$ and $Q^*$ as optimal. These relations are
known as the <strong>Bellman equations</strong>, and they are necessary and
sufficient for optimality.</p>
<p>If we can learn either of these, note the optimal policy will be
$\pi(s) = \arg\max_{a \in\mcA} Q(s,a)$ or also
$\pi(s) = \arg\max_{a \in \mcA} r + \gamma \mathbb{E}_{s’ \sim p(s’ | s, a)}[ V(s’) ]$.</p>
<h4 id="q-learning">Q-learning</h4>
<p>One of the standard techniques for reinforcement learning is Q-learning.
We initialize our table of $Q(s,a)$ values to 0, so that our parameters
are $\theta_{s,a} = Q(s,a)$. If we’re in state $s$, make action $a$, and
get reward $r$, we make the update</p>
<script type="math/tex; mode=display">Q(s,a) \gets Q(s,a) + \alpha \left(r + \gamma \max_{a' \in \mcA} Q(s', a') - Q(s,a) \right)</script>
<p>where $\alpha$ is a learning rate.</p>
<p>The astute reader will realize that this update rule is actually
gradient descent on some kind of least squares objective! Specifically,
the loss is</p>
<script type="math/tex; mode=display">\mcL = \mathbb{E}_{s,a}\left[ \left( Q(s,a) - (r(s,a) + \gamma \mathbb{E}_{s' \sim p(s'|s,a)}[ \max_{a'} Q(s', a') ]) \right)^2 \right]</script>
<p>At optimality, we have that</p>
<script type="math/tex; mode=display">Q(s,a) \equiv r + \gamma \mathbb{E}_{s' \sim p(s'|s,a)} \max_{a'} Q(s', a')</script>
<p>which is exactly the Bellman condition! In Q-learning, we try to satisfy
the Bellman condition through optimizing $\mcL$ by sampling
$s’ \sim p(s’ | s,a)$ and performing gradient ascent on the parameters
$\theta_{s,a} = Q(s,a)$. Our gradients are, to be clear,</p>
<script type="math/tex; mode=display">\frac{\partial \mcL}{\partial \theta_{s,a}} := r + \gamma \max_{a' \in \mcA} Q(s', a') - Q(s,a)</script>
<p>which come from differentiating $\mcL$ and sampling. Thus our Q-learning
update is very well motivated.</p>
<h1 id="tie-in-to-deep-learning">Tie-in to Deep Learning</h1>
<p>In our Q-learning loss, there is nothing that says we have to
parametrize it as a table of values $\theta_{s,a} = Q(s,a)$. We can
instead use a neural network $Q(s,a; \theta)$ where $\theta$ are
parameters of some arbitrary net (could have convolutions, MLPs, etc.).
Then, using training tuples $(s, a, r, s’)$, we can do backpropagation
in the usual way to get gradients $\partial \mcL / \partial \theta$.
Specifically, the gradients come out to be</p>
<script type="math/tex; mode=display">\frac{\partial \mcL}{\partial \theta} := (r + \gamma \max_{a' \in \mcA} Q(s', a') - Q(s,a)) \frac{\partial Q(s, a; \theta)}{\partial \theta}</script>
<p>That’s all there is to deep Q-learning! You have flexibility to choose
how to encode your $s,a$ pair, as well as come up with all kinds of
learning schedules, but the core idea is there. You can read about how
(Mnih et al. <a href="#ref-deepatari2015">2015</a>) use this technique to play
Atari games from pixels.</p>
<h1 id="policy-gradient">Policy Gradient</h1>
<p>Instead of learning a proxy Q-function, we can try to solve the MDP
directly. Let’s revisit the objective we want to optimize:</p>
<script type="math/tex; mode=display">\mcL = \mathbb{E}_{s_t, a_t} \left[ \sum_{t=0}^T \gamma^t r_t(s_t, a_t) \right]</script>
<p>Now, we assume our policy is a parameterized stochastic map
$\pi(\cdot; \theta) : \mcS \to \mcA$, so that we draw actions according
to $a_t \sim \pi(a_t | s_t; \theta)$. Immediately, we run into a problem
with backpropagation – how do we backpropagate through the expectation?</p>
<p>The answer is in the classic paper Williams
(<a href="#ref-williams1992reinforce">1992</a>), also known as REINFORCE, the
score function estimator, or the likelihood ratio estimator.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> Let
the quantity in the expectation be $f(s_t, a_t)$, so that if we write
out the expectation over the $a_t$ in full:</p>
<script type="math/tex; mode=display">\mcL = \sum_t \gamma^t \mathbb{E}_{s_{t+1} \sim p(s_{t+1} | s_t, a_t)} \sum_{a_t} \pi(a_t | s_t; \theta) r(s_t, a_t)</script>
<p>Now we take the gradient with respect to $\theta$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\partial\mcL}{\partial \theta} &= \sum_t \gamma^t \mathbb{E}_{s_{t+1} \sim p(s_{t+1} | s_t, a_t)} \sum_{a_t} \frac{\partial}{\partial \theta} ( \pi(a_t | s_t; \theta) r(s_t, a_t)) \\
&= \sum_t \gamma^t \mathbb{E}_{s_{t+1} \sim p(s_{t+1} | s_t, a_t)} \sum_{a_t} \frac{\partial \pi(a_t | s_t; \theta)}{\partial \theta} r(s_t, a_t) \\
&= \sum_t \gamma^t \mathbb{E}_{s_{t+1} \sim p(s_{t+1} | s_t, a_t)} \sum_{a_t} \frac{\partial \pi(a_t | s_t; \theta)}{\partial \theta} \frac{\pi(a_t | s_t; \theta)}{\pi(a_t | s_t; \theta)} r(s_t, a_t) \\
&= \sum_t \gamma^t \mathbb{E}_{s_{t+1} \sim p(s_{t+1} | s_t, a_t)} \sum_{a_t} \pi(a_t | s_t; \theta) \frac{\partial \log \pi(a_t | s_t; \theta)}{\partial \theta} r(s_t, a_t) \\
&= \sum_t \gamma^t \mathbb{E}_{s_{t+1} \sim p(s_{t+1} | s_t, a_t),\, a_t \sim \pi(a_t | s_t; \theta)} \frac{\partial \log \pi(a_t | s_t; \theta)}{\partial \theta} r(s_t, a_t) \\
&=\mathbb{E}_{s_{t+1} \sim p(s_{t+1} | s_t, a_t),\, a_t \sim \pi(a_t | s_t; \theta)}\left[ \sum_t \gamma^t \frac{\partial \log \pi(a_t | s_t; \theta)}{\partial \theta} r(s_t, a_t) \right]\end{aligned} %]]></script>
<p>Notice our trick: we used the fact that
$\frac{\partial \log y}{ \partial x } = \frac{\partial y}{\partial x} \frac{1}{y}$.
This allows us to efficiently sample our actions $a_t$ to calculate the
expectation instead of needing to compute a (potentially big) sum over
$a_t$.</p>
<p>Some intution: there is no longer any <em>direct</em> gradient flow backwards
to our parameters. Instead, we need to obtain samples and compute
$\partial \log \pi(a_t | s_t; \theta) / \partial \theta$ and
$r(s_t, a_t)$ – here the signal will come from the reward in
$r(s_t, a_t)$. Good actions $a$ will lead to high reward, leading to a
bigger gradient and increasing the probability $\pi(a)$. Conversely, bad
actions with negative reward will lead to a negative gradient,
decreasing the probability of $\pi(a)$.</p>
<p>The gradients we compute here are known as <em>policy gradients</em> – a
fitting name since we differentiate the log probability of the policy.</p>
<h4 id="aside-on-control-variates">Aside on control variates</h4>
<p>The gradient we just computed is unbiased – sampling has the right
expectation. In practice, people will sample one trajectory to get a
gradient, but this will cause the variance to be very high. There is a
trick to get around this, known as the <em>likelihood ratio trick</em>.</p>
<p>Simply, if we want to get samples for
$\mathbb{E}_{a_t \sim \pi(a_t | s_t; \theta)} \left[ \frac{\partial \log \pi(a_t | s_t; \theta)}{\partial \theta} r(s_t, a_t) \right]$,
this turns out to be equal to</p>
<script type="math/tex; mode=display">\mathbb{E}_{a_t \sim \pi(a_t | s_t; \theta)} \left[ \frac{\partial \log \pi(a_t | s_t; \theta)}{\partial \theta} (r(s_t, a_t) - b) \right]</script>
<p>for any $b$ constant with respect to $a_t$. This is because</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}_{a_t} \left[ \frac{\partial\log \pi(a_t | s_t; \theta)}{\partial \theta} b \right] &= b \sum_{a_t} \frac{\partial\log \pi(a_t | s_t; \theta)}{\partial \theta} \pi(a_t | s_t; \theta) \\
&= b\sum_{a_t} \frac{\partial \pi(a_t | s_t; \theta)}{\partial \theta} \frac{\pi(a_t | s_t; \theta)}{\pi(a_t | s_t; \theta)} \\
&= b\sum_{a_t} \frac{\partial \pi(a_t | s_t; \theta)}{\partial \theta} \\
&= b\frac{\partial}{\partial\theta} \sum_{a_t} \pi(a_t | s_t; \theta) = b\frac{\partial}{\partial\theta} 1 = 0\\\end{aligned} %]]></script>
<p>So we can freely pick $b$ such that the estimate has the lowest variance - finding the optimal $b$ is outside the scope of this post. However, we
can interpret what $b$ is: instead of our reward $r(s_t, a_t)$ we have
$r(s_t, a_t) - b$, so we can think of $b$ as a baseline reward. In fact,
if we write $b := V(s_t)$ then we can interpret $r(s_t, a_t) - V(s_t)$
as an “advantage function”, i.e. the advantage of choosing $a_t$ in
state $s_t$.</p>
<h1 id="stochastic-computation-graphs">Stochastic Computation Graphs</h1>
<p>In the RL setup we made our action stochastic and using REINFORCE were
able to backpropagate gradients to the action distribution
$\pi(a_t | s_t; \theta)$. In fact, this technique is very general: in a
neural network setting, it means that if we have stochastic nodes that
are sampled from parameterized distributions, we can compute gradients
without the differentiability condition!</p>
<p>The paper, (Schulman et al. <a href="#ref-schulman2015backprop">2015</a>), neatly
explains how all of this fits together using stochastic computation
graphs.</p>
<p>Basically, a stochastic computation graph is any graph that has nodes
with random variables in addition to deterministic nodes, and the loss
function is defined as the sum of some endpoint nodes. There will then
be three kinds of parameters:</p>
<ol>
<li>
<p>parameters that come after stochastic nodes, with direct
deterministic paths to the loss</p>
</li>
<li>
<p>parameters that precede stochastic nodes, which can’t be directly
differentiated for the loss,</p>
</li>
<li>
<p>parameters that have both direct paths to the loss, as well as paths
that go through stochastic nodes</p>
</li>
</ol>
<p>For the first kind, parameters that have direct paths to the loss, we
can compute the gradients with backpropagation as usual.</p>
<p>We turn to REINFORCE for the second kind to get gradients that precede
the stochastic nodes. Recall that in reinforcement learning, our policy
gradient had two main components, the reward $r(s_t, a_t)$ and the
log-probability gradient
$\frac{\partial \log \pi(a_t | s_t; \theta) }{ \partial \theta}$. In a
general stochastic computation graph, the reward is replaced by the sum
of all losses influenced downstream of our stochastic sample – this
makes sense since only those should provide training signal for our
probability distribution.</p>
<p>For the third kind, we can sum up the gradients from each path due to
linearity of gradients.</p>
<p>To make this concrete, here are some examples (from the original paper):</p>
<p><img src="/images/scgs.png" alt="image" /></p>
<p>where circles are stochastic nodes, squares are deterministic nodes.
Notice how the stochastic nodes all result in gradients of the form
$\frac{\partial}{\partial \theta} \log p( - | \theta)$ times the sum of
downstream losses.</p>
<p>It turns out that stochastic computation graphs are everywhere! This
idea has a lot of instantiations – hard attention models (Mnih et al.
<a href="#ref-mnih2014visualattention">2014</a>; Xu et al.
<a href="#ref-xu2015captioning">2015</a>), black box variational inference (BBVI)
(Ranganath, Gerrish, and Blei <a href="#ref-ranganath2014">2014</a>), and of course
policy gradients all use REINFORCE to backpropagate through stochastic
components of their model. We can therefore differentiate stochastic
functions as complicated as we like, as long as we have a loss function
to optimize, and REINFORCE + backpropagation takes care of the rest.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Armed with our new understanding of RL, we can ask the following
question: why is it harder than supervised learning? The key difference
is in the stochasticity: because our signal comes from a sampled scalar
reward, it’s hard to tell which action led directly to a better outcome.
In networks where there is further computation after the stochastic
decision (such as hard attention models), this problem is exacerbated.
Contrast this with supervised learning, where the signal is
backpropagated directly through the entire differentiable network.
Therefore, although the REINFORCE method is very general, in certain
cases it may not be suitable for the problem.</p>
<p>The reason I wrote this blog post is that the ideas of REINFORCE and
stochastic computation graphs are very general, yet it’s hard to see the
connections between the instantiations if you don’t already know about
it. While there are a ton of other topics outside the scope of this post,
I hope this blog post helps you start making the key connections!</p>
<h1 id="references">References</h1>
<ul>
<li><a id="ref-mnih2014visualattention"></a> Mnih, Volodymyr, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. “Recurrent models of visual attention.” <em>Advances in Neural Information Processing Systems</em>, 2204—2212.</li>
<li><a id="ref-deepatari2015"></a> Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei a Rusu, Joel Veness, Marc G Bellemare, Alex Graves, et al. 2015. “Human-level control through deep reinforcement learning.” <em>Nature</em> 518 (7540): 529–33.</li>
<li><a id="ref-ranganath2014"></a> Ranganath, Rajesh, Sean Gerrish, and David M Blei. 2014. “Black Box Variational Inference.” <em>AISTATS</em> 33.</li>
<li><a id="ref-schulman2015backprop"></a> Schulman, John, Nicolas Heess, Theophane Weber, and Pieter Abbeel. 2015. “Gradient Estimation Using Stochastic Computation Graphs.” <em>NIPS</em>, 1–13.</li>
<li><a id="ref-williams1992reinforce"></a> Williams, Ronald J. 1992. “Simple statistical gradient-following algorithms for connectionist reinforcement learning.” <em>Machine Learning</em> 8 (3-4). Springer: 229–56.</li>
<li><a id="ref-xu2015captioning"></a> Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. 2015. “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” <em>ICML</em> 14: 77—81.</li>
</ul>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>If you try to read the original paper, it probably won’t make a
lot of sense to you, since the language it uses is outdated and
radically different from the current way we talk about things. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Jeffrey Ling$$ \newcommand{\boldA}{\mathbf{A}} \newcommand{\boldB}{\mathbf{B}} \newcommand{\boldC}{\mathbf{C}} \newcommand{\boldD}{\mathbf{D}} \newcommand{\boldE}{\mathbf{E}} \newcommand{\boldF}{\mathbf{F}} \newcommand{\boldG}{\mathbf{G}} \newcommand{\boldH}{\mathbf{H}} \newcommand{\boldI}{\mathbf{I}} \newcommand{\boldJ}{\mathbf{J}} \newcommand{\boldK}{\mathbf{K}} \newcommand{\boldL}{\mathbf{L}} \newcommand{\boldM}{\mathbf{M}} \newcommand{\boldN}{\mathbf{N}} \newcommand{\boldO}{\mathbf{O}} \newcommand{\boldP}{\mathbf{P}} \newcommand{\boldQ}{\mathbf{Q}} \newcommand{\boldR}{\mathbf{R}} \newcommand{\boldS}{\mathbf{S}} \newcommand{\boldT}{\mathbf{T}} \newcommand{\boldU}{\mathbf{U}} \newcommand{\boldV}{\mathbf{V}} \newcommand{\boldW}{\mathbf{W}} \newcommand{\boldX}{\mathbf{X}} \newcommand{\boldY}{\mathbf{Y}} \newcommand{\boldZ}{\mathbf{Z}} \newcommand{\bolda}{\mathbf{a}} \newcommand{\boldb}{\mathbf{b}} \newcommand{\boldc}{\mathbf{c}} \newcommand{\boldd}{\mathbf{d}} \newcommand{\bolde}{\mathbf{e}} \newcommand{\boldf}{\mathbf{f}} \newcommand{\boldg}{\mathbf{g}} \newcommand{\boldh}{\mathbf{h}} \newcommand{\boldi}{\mathbf{i}} \newcommand{\boldj}{\mathbf{j}} \newcommand{\boldk}{\mathbf{k}} \newcommand{\boldl}{\mathbf{l}} \newcommand{\boldm}{\mathbf{m}} \newcommand{\boldn}{\mathbf{n}} \newcommand{\boldo}{\mathbf{o}} \newcommand{\boldp}{\mathbf{p}} \newcommand{\boldq}{\mathbf{q}} \newcommand{\boldr}{\mathbf{r}} \newcommand{\bolds}{\mathbf{s}} \newcommand{\boldt}{\mathbf{t}} \newcommand{\boldu}{\mathbf{u}} \newcommand{\boldv}{\mathbf{v}} \newcommand{\boldw}{\mathbf{w}} \newcommand{\boldx}{\mathbf{x}} \newcommand{\boldy}{\mathbf{y}} \newcommand{\boldz}{\mathbf{z}} \newcommand{\mcA}{\mathcal{A}} \newcommand{\mcB}{\mathcal{B}} \newcommand{\mcC}{\mathcal{C}} \newcommand{\mcD}{\mathcal{D}} \newcommand{\mcE}{\mathcal{E}} \newcommand{\mcF}{\mathcal{F}} \newcommand{\mcG}{\mathcal{G}} \newcommand{\mcH}{\mathcal{H}} \newcommand{\mcI}{\mathcal{I}} \newcommand{\mcJ}{\mathcal{J}} \newcommand{\mcK}{\mathcal{K}} \newcommand{\mcL}{\mathcal{L}} \newcommand{\mcM}{\mathcal{M}} \newcommand{\mcN}{\mathcal{N}} \newcommand{\mcO}{\mathcal{O}} \newcommand{\mcP}{\mathcal{P}} \newcommand{\mcQ}{\mathcal{Q}} \newcommand{\mcR}{\mathcal{R}} \newcommand{\mcS}{\mathcal{S}} \newcommand{\mcT}{\mathcal{T}} \newcommand{\mcU}{\mathcal{U}} \newcommand{\mcV}{\mathcal{V}} \newcommand{\mcW}{\mathcal{W}} \newcommand{\mcX}{\mathcal{X}} \newcommand{\mcY}{\mathcal{Y}} \newcommand{\mcZ}{\mathcal{Z}} \newcommand{\reals}{\mathbb{R}} \newcommand{\integers}{\mathbb{Z}} \newcommand{\rationals}{\mathbb{Q}} \newcommand{\naturals}{\mathbb{N}} \newcommand{\ident}{\boldsymbol{I}} \newcommand{\bzero}{\boldsymbol{0}} \newcommand{\balpha}{\boldsymbol{\alpha}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bdelta}{\boldsymbol{\delta}} \newcommand{\boldeta}{\boldsymbol{\eta}} \newcommand{\bkappa}{\boldsymbol{\kappa}} \newcommand{\bgamma}{\boldsymbol{\gamma}} \newcommand{\bmu}{\boldsymbol{\mu}} \newcommand{\bphi}{\boldsymbol{\phi}} \newcommand{\bpi}{\boldsymbol{\pi}} \newcommand{\bpsi}{\boldsymbol{\psi}} \newcommand{\bsigma}{\boldsymbol{\sigma}} \newcommand{\btheta}{\boldsymbol{\theta}} \newcommand{\bxi}{\boldsymbol{\xi}} \newcommand{\bGamma}{\boldsymbol{\Gamma}} \newcommand{\bLambda}{\boldsymbol{\Lambda}} \newcommand{\bOmega}{\boldsymbol{\Omega}} \newcommand{\bPhi}{\boldsymbol{\Phi}} \newcommand{\bPi}{\boldsymbol{\Pi}} \newcommand{\bPsi}{\boldsymbol{\Psi}} \newcommand{\bSigma}{\boldsymbol{\Sigma}} \newcommand{\bTheta}{\boldsymbol{\Theta}} \newcommand{\bUpsilon}{\boldsymbol{\Upsilon}} \newcommand{\bXi}{\boldsymbol{\Xi}} \newcommand{\bepsilon}{\boldsymbol{\epsilon}} \newcommand{\on}{\operatorname} \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\on{Var}} $$