From cf6258501489d1fd28c48609b876aa141addbb06 Mon Sep 17 00:00:00 2001
From: Adithya Ganesh Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few set of lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data. Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data. The generator and discriminator both play a two player minimax game, where the generator minimizes a two-sample test objective () and the discriminator maximizes the objective (). Intuitively, the generator tries to fool the discriminator to the best of its ability by generating samples that look indisginguishable from . The generator and discriminator both play a two player minimax game, where the generator minimizes a two-sample test objective () and the discriminator maximizes the objective (). Intuitively, the generator tries to fool the discriminator to the best of its ability by generating samples that look indistinguishable from . Formally, the GAN objective can be written as: The JSD satisfies all properties of the KL, and has the additional perk that . With this distance metric, the optimal generator for the GAN objective becomces , and the optimal objective value that we can achieve with optimal generators and discriminators and is . The JSD satisfies all properties of the KL, and has the additional perk that . With this distance metric, the optimal generator for the GAN objective becomes , and the optimal objective value that we can achieve with optimal generators and discriminators and is . Although GANs have been successfully applied to several domains and tasks, working with them in practice is challenging because of their: (1) unstable optimization procedure, (2) potential for mode collapse, (3) difficulty in evaluation. Although GANs have been successfully applied to several domains and tasks, working with them in practice is challenging because of their: (1) unstable optimization procedure, (2) potential for mode collapse, (3) difficulty in performance evaluation. During optimization, the generator and discriminator loss often continue to oscillate without converging to a clear stopping point. Due to the lack of a robust stopping criteria, it is difficult to know when exactly the GAN has finished training. Additionally, the generator of a GAN can often get stuck producing one of a few types of samples over and over again (mode collapse). Most fixes to these challenges are empirically driven, and there has been a significant amount of work put into developing new architectures, regularization schemes, and noise perturbations in an attempt to circumvent these issues. Soumith Chintala has a nice link outlining various tricks of the trade to stabilize GAN training. During optimization, the generator and discriminator loss often continue to oscillate without converging to a definite stopping point. Due to the lack of robust stopping criteria, it is difficult to know when exactly the GAN has finished training. Additionally, the generator of a GAN can often get stuck producing one of a few types of samples over and over again (mode collapse). Most fixes to these challenges are empirically driven, and there has been a significant amount of work put into developing new architectures, regularization schemes, and noise perturbations in an attempt to circumvent these issues. Soumith Chintala has a nice link outlining various tricks of the trade to stabilize GAN training. Next, we focus our attention to a few select types of GAN architectures and explore them in more detail. Next, we focus our attention on a few select types of GAN architectures and explore them in more detail. The f-GAN optimizes the variant of the two-sample test objective that we have discussed so far, but using a very general notion of distance: the . Given two densities and , the -divergence can be written as: Intuitively, we can think about this objective as the generator trying to minimize the divergence estimate, while the discriminator tries to tighten the lower bound. Intuitively, we can think about this objective as the generator trying to minimize the divergence estimate, while the discriminator attempts to tighten the lower bound. We won’t worry too much about the BiGAN in these notes. However, we can think about this model as one that allows us to infer latent representations even within a GAN framework. Intelligent agents are constantly generating, acquiring, and processing
data. This data could be in the form of images that we capture on our
phones, text messages we share with our friends, graphs that model
-interactions on social media, videos that record important events,
-etc. Natural agents excel at discovering patterns, extracting
+interactions on social media, or videos that record important events. Natural agents excel at discovering patterns, extracting
knowledge, and performing complex reasoning based on the data they observe. How
can we build artificial learning systems to do the same?
@@ -66,10 +65,10 @@ model with such a limited dataset is a highly underdetermined problem.
Fortunately, the real world is highly structured and automatically
discovering the underlying structure is key to learning generative
models. For example, we can hope to learn some basic artifacts about
-dogs even with just a few images: two eyes, two ears, fur etc. Instead
+dogs even with just a few images: two eyes, two ears, fur, etc. Instead
of incorporating this prior knowledge explicitly, we will hope the model
learns the underlying structure directly from data. There is no free
-lunch however, and indeed successful learning of generative models will
+lunch, however, and indeed successful learning of generative models will
involve instantiating the optimization problem in
$$(\ref{eq:learning_gm})$$ in a suitable way. In this course, we will be
primarily interested in the following questions:
@@ -78,9 +77,9 @@ primarily interested in the following questions:
* What is the objective function $$d(\cdot)$$?
* What is the optimization procedure for minimizing $$d(\cdot)$$?
-In the next few set of lectures, we will take a deeper dive into certain
+In the next few lectures, we will take a deeper dive into certain
families of generative models. For each model family, we will note how
-the representation is closely tied with the choice of learning objective
+the representation relates to the choice of learning objective
and the optimization procedure.
Inference
@@ -93,7 +92,7 @@ data.[^2]
While the range of applications to which generative models have been
used continue to grow, we can identify three fundamental inference
-queries for evaluating a generative model.:
+queries for evaluating a generative model:
1. *Density estimation:* Given a datapoint $$\mathbf{x}$$, what is the
probability assigned by the model, i.e., $$p_\theta(\mathbf{x})$$?
From e6a01428262088f8229d2561d69b97026f56b3a1 Mon Sep 17 00:00:00 2001
From: Adithya Ganesh
@@ -59,7 +59,7 @@ f_i(x_1, x_2, \ldots, x_{i-1}) =\sigma(\alpha^{(i)}_0 + \alpha^{(i)}_1 x_1 + \ld
where $$\sigma$$ denotes the sigmoid function and $$\theta_i=\{\alpha^{(i)}_0,\alpha^{(i)}_1, \ldots, \alpha^{(i)}_{i-1}\}$$ denote the parameters of the mean function. The conditional for variable $$i$$ requires $$i$$ parameters, and hence the total number of parameters in the model is given by $$\sum_{i=1}^ni= O(n^2)$$. Note that the number of parameters are much fewer than the exponential complexity of the tabular case.
-A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function e.g., multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable $$i$$ can be expressed as
+A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function, e.g. multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable $$i$$ can be expressed as
{% math %}
\mathbf{h}_i = \sigma(A_i \mathbf{x_{< i}} + \mathbf{c}_i)\\
@@ -105,14 +105,14 @@ Notice that NADE requires specifying a single, fixed ordering of the variables.
Learning and inference
======================
-Recall that learning a generative model involves optimizing the closeness between the data and model distributions. One commonly used notion of closeness in the KL divergence between the data and the model distributions.
+Recall that learning a generative model involves optimizing the closeness between the data and model distributions. One commonly used notion of closeness is the KL divergence between the data and the model distributions.
{% math %}
\min_{\theta\in \mathcal{M}}d_{KL}
(p_{\mathrm{data}}, p_{\theta}) = \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}} }\left[\log p_{\mathrm{data}}(\mathbf{x}) - \log p_{\theta}(\mathbf{x})\right]
{% endmath %}
-Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution $$p_\theta$$ which assigns low probability to a datapoint that is likely to be sampled under $$p_{\mathrm{data}}$$. In the extreme case, if the density $$p_\theta(\mathbf{x})$$ evaluates to zero for a datapoint sampled from $$p_{\mathrm{data}}$$, the objective evaluates to $$+\infty$$.
+Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution $$p_\theta$$ which assigns a low probability to a datapoint that is likely to be sampled under $$p_{\mathrm{data}}$$. In the extreme case, if the density $$p_\theta(\mathbf{x})$$ evaluates to zero for a datapoint sampled from $$p_{\mathrm{data}}$$, the objective evaluates to $$+\infty$$.
Since $$p_{\mathrm{data}}$$ does not depend on $$\theta$$, we can equivalently recover the optimal parameters via maximizing likelihood estimation.
@@ -138,7 +138,7 @@ In practice, we optimize the MLE objective using mini-batch gradient ascent. The
where $$\theta^{(t+1)}$$ and $$\theta^{(t)}$$ are the parameters at iterations $$t+1$$ and $$t$$ respectively, and $$r_t$$ is the learning rate at iteration $$t$$. Typically, we only specify the initial learning rate $$r_1$$ and update the rate based on a schedule. [Variants](http://cs231n.github.io/optimization-1/) of stochastic gradient ascent, such as RMS prop and Adam, employ modified update rules that work slightly better in practice.
-From a practical standpoint, we must think about how to choose hyperparameters (such as the initial learning rate) and a stopping criteria for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve[^1].
+From a practical standpoint, we must think about how to choose hyperparameters (such as the initial learning rate) and a stopping criterion for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve[^1].
Now that we have a well-defined objective and optimization procedure, the only remaining task is to evaluate the objective in the context of an autoregressive generative model. To this end, we substitute the factorized joint distribution of an autoregressive model in the MLE objective to get
@@ -149,13 +149,13 @@ Now that we have a well-defined objective and optimization procedure, the only r
where $$\theta = \{\theta_1, \theta_2, \ldots, \theta_n\}$$ now denotes the
collective set of parameters for the conditionals.
-Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point $$\mathbf{x}$$, we simply evaluate the log-conditionals $$\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i})$$ for each $$i$$ and add these up to obtain the log-likelihood assigned by the model to $$\mathbf{x}$$. Since we know conditioning vector $$\mathbf{x}$$, each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware.
+Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point $$\mathbf{x}$$, we simply evaluate the log-conditionals $$\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i})$$ for each $$i$$ and add these up to obtain the log-likelihood assigned by the model to $$\mathbf{x}$$. Since we know the conditioning vector $$\mathbf{x}$$, each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware.
Sampling from an autoregressive model is a sequential procedure. Here, we first sample $$x_1$$, then we sample $$x_2$$ conditioned on the sampled $$x_1$$, followed by $$x_3$$ conditioned on both $$x_1$$ and $$x_2$$ and so on until we sample $$x_n$$ conditioned on the previously sampled $$\mathbf{x}_{< n}$$. For applications requiring real-time generation of high-dimensional data such as audio synthesis, the sequential sampling can be an expensive process. Later in this course, we will discuss how parallel WaveNet, an autoregressive model sidesteps this expensive sampling process.
-Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few set of lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data.
+Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data.
-GAN Objective
D_{\textrm{JSD}}[p, q] = \frac{1}{2} \left( D_{\textrm{KL}}\left[p, \frac{p+q}{2} \right] + D_{\textrm{KL}}\left[q, \frac{p+q}{2} \right] \right)
-GAN training algorithm
@@ -149,13 +149,13 @@ GAN training algorithm
Challenges
-Selected GANs
-f-GAN
f-GAN
\min_\theta \max_\phi F(\theta,\phi) = \mathbb{E}_{x \sim p_{\textrm{data}}}[T_\phi(\mathbf{x})] - \mathbb{E}_{x \sim p_{G_\theta}}[f^*(T_\phi(\mathbf{x}))]
-BiGAN
Introduction
Introduction
underlying distribution, say . At its very core, the
goal of any generative model is then to approximate this data
distribution given access to the dataset . The hope is that
-if we are able to learn a good generative model, we can use the
+if we can learn a good generative model, we can use the
learned model for downstream inference.
In the parametric setting, we can think of the task of learning a generative model as picking the parameters within a family of model distributions that minimizes some notion of distance1 between the -model distribution and the data distribution.
+model distribution and data distribution.
For instance, we might be given access to a dataset of dog images and -our goal is to learn the paraemeters of a generative model within a model family such that +our goal is to learn the parameters of a generative model within a model family such that the model distribution is close to the data distribution over dogs . Mathematically, we can specify our goal as the following optimization problem: \begin{equation} @@ -130,17 +129,17 @@
Fortunately, the real world is highly structured and automatically discovering the underlying structure is key to learning generative models. For example, we can hope to learn some basic artifacts about -dogs even with just a few images: two eyes, two ears, fur etc. Instead +dogs even with just a few images: two eyes, two ears, fur, etc. Instead of incorporating this prior knowledge explicitly, we will hope the model learns the underlying structure directly from data. There is no free -lunch however, and indeed successful learning of generative models will +lunch, however, and indeed successful learning of generative models will involve instantiating the optimization problem in in a suitable way. In this course, we will be primarily interested in the following questions:
@@ -151,9 +150,9 @@In the next few set of lectures, we will take a deeper dive into certain +
In the next few lectures, we will take a deeper dive into certain families of generative models. For each model family, we will note how -the representation is closely tied with the choice of learning objective +the representation relates to the choice of learning objective and the optimization procedure.
While the range of applications to which generative models have been used continue to grow, we can identify three fundamental inference -queries for evaluating a generative model.:
+queries for evaluating a generative model:In practice however, optimizing the above estimate suffers from high variance in gradient estimates.
+In practice, however, optimizing the above estimate suffers from high variance in gradient estimates.
Rather than maximizing the log-likelihood directly, an alternate is to instead construct a lower bound that is more amenable to optimization. To do so, we note that evaluating the marginal likelihood is at least as difficult as as evaluating the posterior for any latent vector since by definition .
-Next, we introduce a variational family of distributions that approximate the true, but intractable posterior . Further henceforth, we will assume a parameteric setting where any distribution in the model family is specified via a set of parameters and distributions in the variational family are specified via a set of parameters .
+Next, we introduce a variational family of distributions that approximate the true, but intractable posterior . Henceforth, we will assume a parameteric setting where any distribution in the model family is specified via a set of parameters and distributions in the variational family are specified via a set of parameters .
Given and , we note that the following relationships hold true1 for any and all variational distributions
@@ -310,7 +310,7 @@A noticable limitation of black-box variational inference is that Step 1 executes an optimization subroutine that is computationally expensive. Recall that the goal of the Step 1 is to find
+A noticeable limitation of black-box variational inference is that Step 1 executes an optimization subroutine that is computationally expensive. Recall that the goal of the Step 1 is to find
-It is also worth noting that optimizing over the entire dataset as a subroutine everytime we sample a new mini-batch is clearly not reasonable. However, if we believe that is capable of quickly adapting to a close-enough approximation of given the current choice of , then we can interleave the optimization and . The yields the following procedure, where for each mini-batch , we perform the following two updates jointly
+It is also worth noting that optimizing over the entire dataset as a subroutine every time we sample a new mini-batch is clearly not reasonable. However, if we believe that is capable of quickly adapting to a close-enough approximation of given the current choice of , then we can interleave the optimization and . This yields the following procedure, where for each mini-batch , we perform the following two updates jointly