From cf6258501489d1fd28c48609b876aa141addbb06 Mon Sep 17 00:00:00 2001 From: Adithya Ganesh Date: Sat, 22 Dec 2018 14:37:47 -0800 Subject: [PATCH 1/7] Fix style issues / typos in intro post. --- introduction/index.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/introduction/index.md b/introduction/index.md index 6b7a969..045aefd 100644 --- a/introduction/index.md +++ b/introduction/index.md @@ -6,8 +6,7 @@ title: Introduction Intelligent agents are constantly generating, acquiring, and processing data. This data could be in the form of *images* that we capture on our phones, *text* messages we share with our friends, *graphs* that model -interactions on social media, *videos* that record important events, -etc. Natural agents excel at discovering patterns, extracting +interactions on social media, or *videos* that record important events. Natural agents excel at discovering patterns, extracting knowledge, and performing complex reasoning based on the data they observe. How can we build artificial learning systems to do the same? @@ -17,7 +16,7 @@ observed data, say $$\mathcal{D}$$, as a finite set of samples from an underlying distribution, say $$p_{\mathrm{data}}$$. At its very core, the goal of any generative model is then to approximate this data distribution given access to the dataset $$\mathcal{D}$$. The hope is that -if we are able to *learn* a good generative model, we can use the +if we can *learn* a good generative model, we can use the learned model for downstream *inference*. Learning @@ -32,7 +31,7 @@ limited in the family of distributions they can represent. In the parametric setting, we can think of the task of learning a generative model as picking the parameters within a family of model distributions that minimizes some notion of distance[^1] between the -model distribution and the data distribution. +model distribution and data distribution. drawing @@ -66,10 +65,10 @@ model with such a limited dataset is a highly underdetermined problem. Fortunately, the real world is highly structured and automatically discovering the underlying structure is key to learning generative models. For example, we can hope to learn some basic artifacts about -dogs even with just a few images: two eyes, two ears, fur etc. Instead +dogs even with just a few images: two eyes, two ears, fur, etc. Instead of incorporating this prior knowledge explicitly, we will hope the model learns the underlying structure directly from data. There is no free -lunch however, and indeed successful learning of generative models will +lunch, however, and indeed successful learning of generative models will involve instantiating the optimization problem in $$(\ref{eq:learning_gm})$$ in a suitable way. In this course, we will be primarily interested in the following questions: @@ -78,9 +77,9 @@ primarily interested in the following questions: * What is the objective function $$d(\cdot)$$? * What is the optimization procedure for minimizing $$d(\cdot)$$? -In the next few set of lectures, we will take a deeper dive into certain +In the next few lectures, we will take a deeper dive into certain families of generative models. For each model family, we will note how -the representation is closely tied with the choice of learning objective +the representation relates to the choice of learning objective and the optimization procedure. Inference @@ -93,7 +92,7 @@ data.[^2] While the range of applications to which generative models have been used continue to grow, we can identify three fundamental inference -queries for evaluating a generative model.: +queries for evaluating a generative model: 1. *Density estimation:* Given a datapoint $$\mathbf{x}$$, what is the probability assigned by the model, i.e., $$p_\theta(\mathbf{x})$$? From e6a01428262088f8229d2561d69b97026f56b3a1 Mon Sep 17 00:00:00 2001 From: Adithya Ganesh Date: Sat, 22 Dec 2018 14:41:52 -0800 Subject: [PATCH 2/7] Style / minor typos in autoregressive post. --- autoregressive/index.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/autoregressive/index.md b/autoregressive/index.md index 9e01c9f..94be915 100644 --- a/autoregressive/index.md +++ b/autoregressive/index.md @@ -43,7 +43,7 @@ where $$\theta_i$$ denotes the set of parameters used to specify the mean function $$f_i: \{0,1\}^{i-1}\rightarrow [0,1]$$. -The number of parameters of an autoregressive generative model are given by $$\sum_{i=1}^n \vert \theta_i \vert$$. As we shall see in the examples below, the number of parameters are much fewer than the tabular setting considered previously. Unlike the tabular setting however, an autoregressive generative model cannot represent all possible distributions. Its expressiveness is limited by the fact that we are limiting the conditional distributions to correspond to a Bernoulli random variable with the mean specified via a restricted class of parameterized functions. +The number of parameters of an autoregressive generative model are given by $$\sum_{i=1}^n \vert \theta_i \vert$$. As we shall see in the examples below, the number of parameters are much fewer than the tabular setting considered previously. Unlike the tabular setting, however, an autoregressive generative model cannot represent all possible distributions. Its expressiveness is limited by the fact that we are limiting the conditional distributions to correspond to a Bernoulli random variable with the mean specified via a restricted class of parameterized functions.
drawing @@ -59,7 +59,7 @@ f_i(x_1, x_2, \ldots, x_{i-1}) =\sigma(\alpha^{(i)}_0 + \alpha^{(i)}_1 x_1 + \ld where $$\sigma$$ denotes the sigmoid function and $$\theta_i=\{\alpha^{(i)}_0,\alpha^{(i)}_1, \ldots, \alpha^{(i)}_{i-1}\}$$ denote the parameters of the mean function. The conditional for variable $$i$$ requires $$i$$ parameters, and hence the total number of parameters in the model is given by $$\sum_{i=1}^ni= O(n^2)$$. Note that the number of parameters are much fewer than the exponential complexity of the tabular case. -A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function e.g., multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable $$i$$ can be expressed as +A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function, e.g. multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable $$i$$ can be expressed as {% math %} \mathbf{h}_i = \sigma(A_i \mathbf{x_{< i}} + \mathbf{c}_i)\\ @@ -105,14 +105,14 @@ Notice that NADE requires specifying a single, fixed ordering of the variables. Learning and inference ====================== -Recall that learning a generative model involves optimizing the closeness between the data and model distributions. One commonly used notion of closeness in the KL divergence between the data and the model distributions. +Recall that learning a generative model involves optimizing the closeness between the data and model distributions. One commonly used notion of closeness is the KL divergence between the data and the model distributions. {% math %} \min_{\theta\in \mathcal{M}}d_{KL} (p_{\mathrm{data}}, p_{\theta}) = \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}} }\left[\log p_{\mathrm{data}}(\mathbf{x}) - \log p_{\theta}(\mathbf{x})\right] {% endmath %} -Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution $$p_\theta$$ which assigns low probability to a datapoint that is likely to be sampled under $$p_{\mathrm{data}}$$. In the extreme case, if the density $$p_\theta(\mathbf{x})$$ evaluates to zero for a datapoint sampled from $$p_{\mathrm{data}}$$, the objective evaluates to $$+\infty$$. +Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution $$p_\theta$$ which assigns a low probability to a datapoint that is likely to be sampled under $$p_{\mathrm{data}}$$. In the extreme case, if the density $$p_\theta(\mathbf{x})$$ evaluates to zero for a datapoint sampled from $$p_{\mathrm{data}}$$, the objective evaluates to $$+\infty$$. Since $$p_{\mathrm{data}}$$ does not depend on $$\theta$$, we can equivalently recover the optimal parameters via maximizing likelihood estimation. @@ -138,7 +138,7 @@ In practice, we optimize the MLE objective using mini-batch gradient ascent. The where $$\theta^{(t+1)}$$ and $$\theta^{(t)}$$ are the parameters at iterations $$t+1$$ and $$t$$ respectively, and $$r_t$$ is the learning rate at iteration $$t$$. Typically, we only specify the initial learning rate $$r_1$$ and update the rate based on a schedule. [Variants](http://cs231n.github.io/optimization-1/) of stochastic gradient ascent, such as RMS prop and Adam, employ modified update rules that work slightly better in practice. -From a practical standpoint, we must think about how to choose hyperparameters (such as the initial learning rate) and a stopping criteria for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve[^1]. +From a practical standpoint, we must think about how to choose hyperparameters (such as the initial learning rate) and a stopping criterion for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve[^1]. Now that we have a well-defined objective and optimization procedure, the only remaining task is to evaluate the objective in the context of an autoregressive generative model. To this end, we substitute the factorized joint distribution of an autoregressive model in the MLE objective to get @@ -149,13 +149,13 @@ Now that we have a well-defined objective and optimization procedure, the only r where $$\theta = \{\theta_1, \theta_2, \ldots, \theta_n\}$$ now denotes the collective set of parameters for the conditionals. -Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point $$\mathbf{x}$$, we simply evaluate the log-conditionals $$\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i})$$ for each $$i$$ and add these up to obtain the log-likelihood assigned by the model to $$\mathbf{x}$$. Since we know conditioning vector $$\mathbf{x}$$, each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware. +Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point $$\mathbf{x}$$, we simply evaluate the log-conditionals $$\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i})$$ for each $$i$$ and add these up to obtain the log-likelihood assigned by the model to $$\mathbf{x}$$. Since we know the conditioning vector $$\mathbf{x}$$, each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware. Sampling from an autoregressive model is a sequential procedure. Here, we first sample $$x_1$$, then we sample $$x_2$$ conditioned on the sampled $$x_1$$, followed by $$x_3$$ conditioned on both $$x_1$$ and $$x_2$$ and so on until we sample $$x_n$$ conditioned on the previously sampled $$\mathbf{x}_{< n}$$. For applications requiring real-time generation of high-dimensional data such as audio synthesis, the sequential sampling can be an expensive process. Later in this course, we will discuss how parallel WaveNet, an autoregressive model sidesteps this expensive sampling process. -Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few set of lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data. +Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data. -

Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few set of lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data.

+

Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data.

-

The generator and discriminator both play a two player minimax game, where the generator minimizes a two-sample test objective () and the discriminator maximizes the objective (). Intuitively, the generator tries to fool the discriminator to the best of its ability by generating samples that look indisginguishable from .

+

The generator and discriminator both play a two player minimax game, where the generator minimizes a two-sample test objective () and the discriminator maximizes the objective (). Intuitively, the generator tries to fool the discriminator to the best of its ability by generating samples that look indistinguishable from .

Formally, the GAN objective can be written as:

@@ -125,7 +125,7 @@

GAN Objective

D_{\textrm{JSD}}[p, q] = \frac{1}{2} \left( D_{\textrm{KL}}\left[p, \frac{p+q}{2} \right] + D_{\textrm{KL}}\left[q, \frac{p+q}{2} \right] \right) -

The JSD satisfies all properties of the KL, and has the additional perk that . With this distance metric, the optimal generator for the GAN objective becomces , and the optimal objective value that we can achieve with optimal generators and discriminators and is .

+

The JSD satisfies all properties of the KL, and has the additional perk that . With this distance metric, the optimal generator for the GAN objective becomes , and the optimal objective value that we can achieve with optimal generators and discriminators and is .

GAN training algorithm

@@ -149,13 +149,13 @@

GAN training algorithm

Challenges

-

Although GANs have been successfully applied to several domains and tasks, working with them in practice is challenging because of their: (1) unstable optimization procedure, (2) potential for mode collapse, (3) difficulty in evaluation.

+

Although GANs have been successfully applied to several domains and tasks, working with them in practice is challenging because of their: (1) unstable optimization procedure, (2) potential for mode collapse, (3) difficulty in performance evaluation.

-

During optimization, the generator and discriminator loss often continue to oscillate without converging to a clear stopping point. Due to the lack of a robust stopping criteria, it is difficult to know when exactly the GAN has finished training. Additionally, the generator of a GAN can often get stuck producing one of a few types of samples over and over again (mode collapse). Most fixes to these challenges are empirically driven, and there has been a significant amount of work put into developing new architectures, regularization schemes, and noise perturbations in an attempt to circumvent these issues. Soumith Chintala has a nice link outlining various tricks of the trade to stabilize GAN training.

+

During optimization, the generator and discriminator loss often continue to oscillate without converging to a definite stopping point. Due to the lack of robust stopping criteria, it is difficult to know when exactly the GAN has finished training. Additionally, the generator of a GAN can often get stuck producing one of a few types of samples over and over again (mode collapse). Most fixes to these challenges are empirically driven, and there has been a significant amount of work put into developing new architectures, regularization schemes, and noise perturbations in an attempt to circumvent these issues. Soumith Chintala has a nice link outlining various tricks of the trade to stabilize GAN training.

Selected GANs

-

Next, we focus our attention to a few select types of GAN architectures and explore them in more detail.

+

Next, we focus our attention on a few select types of GAN architectures and explore them in more detail.

f-GAN

The f-GAN optimizes the variant of the two-sample test objective that we have discussed so far, but using a very general notion of distance: the . Given two densities and , the -divergence can be written as:

@@ -177,7 +177,7 @@

f-GAN

\min_\theta \max_\phi F(\theta,\phi) = \mathbb{E}_{x \sim p_{\textrm{data}}}[T_\phi(\mathbf{x})] - \mathbb{E}_{x \sim p_{G_\theta}}[f^*(T_\phi(\mathbf{x}))] -

Intuitively, we can think about this objective as the generator trying to minimize the divergence estimate, while the discriminator tries to tighten the lower bound.

+

Intuitively, we can think about this objective as the generator trying to minimize the divergence estimate, while the discriminator attempts to tighten the lower bound.

BiGAN

We won’t worry too much about the BiGAN in these notes. However, we can think about this model as one that allows us to infer latent representations even within a GAN framework.

diff --git a/docs/introduction/index.html b/docs/introduction/index.html index f0f5990..53241c5 100644 --- a/docs/introduction/index.html +++ b/docs/introduction/index.html @@ -80,8 +80,7 @@

Introduction

Intelligent agents are constantly generating, acquiring, and processing data. This data could be in the form of images that we capture on our phones, text messages we share with our friends, graphs that model -interactions on social media, videos that record important events, -etc. Natural agents excel at discovering patterns, extracting +interactions on social media, or videos that record important events. Natural agents excel at discovering patterns, extracting knowledge, and performing complex reasoning based on the data they observe. How can we build artificial learning systems to do the same?

@@ -91,7 +90,7 @@

Introduction

underlying distribution, say . At its very core, the goal of any generative model is then to approximate this data distribution given access to the dataset . The hope is that -if we are able to learn a good generative model, we can use the +if we can learn a good generative model, we can use the learned model for downstream inference.

Learning

@@ -105,7 +104,7 @@

Learning

In the parametric setting, we can think of the task of learning a generative model as picking the parameters within a family of model distributions that minimizes some notion of distance1 between the -model distribution and the data distribution.

+model distribution and data distribution.

drawing

@@ -114,7 +113,7 @@

Learning

![goal](learning_2.png =100x20) -->

For instance, we might be given access to a dataset of dog images and -our goal is to learn the paraemeters of a generative model within a model family such that +our goal is to learn the parameters of a generative model within a model family such that the model distribution is close to the data distribution over dogs . Mathematically, we can specify our goal as the following optimization problem: \begin{equation} @@ -130,17 +129,17 @@

Learning

Each pixel has three channels: R(ed), G(reen) and B(lue) and each channel can take a value between 0 to 255. Hence, the number of possible images is given by . -In contrast, Imagenet, one of the largest publicly available datasets, +In contrast, ImageNet, one of the largest publicly available datasets, consists of only about 15 million images. Hence, learning a generative model with such a limited dataset is a highly underdetermined problem.

Fortunately, the real world is highly structured and automatically discovering the underlying structure is key to learning generative models. For example, we can hope to learn some basic artifacts about -dogs even with just a few images: two eyes, two ears, fur etc. Instead +dogs even with just a few images: two eyes, two ears, fur, etc. Instead of incorporating this prior knowledge explicitly, we will hope the model learns the underlying structure directly from data. There is no free -lunch however, and indeed successful learning of generative models will +lunch, however, and indeed successful learning of generative models will involve instantiating the optimization problem in in a suitable way. In this course, we will be primarily interested in the following questions:

@@ -151,9 +150,9 @@

Learning

  • What is the optimization procedure for minimizing ?
  • -

    In the next few set of lectures, we will take a deeper dive into certain +

    In the next few lectures, we will take a deeper dive into certain families of generative models. For each model family, we will note how -the representation is closely tied with the choice of learning objective +the representation relates to the choice of learning objective and the optimization procedure.

    Inference

    @@ -165,7 +164,7 @@

    Inference

    While the range of applications to which generative models have been used continue to grow, we can identify three fundamental inference -queries for evaluating a generative model.:

    +queries for evaluating a generative model:

    1. diff --git a/docs/vae/index.html b/docs/vae/index.html index c9d2f14..e5544e3 100644 --- a/docs/vae/index.html +++ b/docs/vae/index.html @@ -170,11 +170,11 @@

      Learning Directed Latent Varia \log p(\bx) \approx \log \frac{1}{k} \sum_{i=1}^k p(\bx \vert \bz^{(i)}) \text{, where } \bz^{(i)} \sim p(\bz) -

      In practice however, optimizing the above estimate suffers from high variance in gradient estimates.

      +

      In practice, however, optimizing the above estimate suffers from high variance in gradient estimates.

      Rather than maximizing the log-likelihood directly, an alternate is to instead construct a lower bound that is more amenable to optimization. To do so, we note that evaluating the marginal likelihood is at least as difficult as as evaluating the posterior for any latent vector since by definition .

      -

      Next, we introduce a variational family of distributions that approximate the true, but intractable posterior . Further henceforth, we will assume a parameteric setting where any distribution in the model family is specified via a set of parameters and distributions in the variational family are specified via a set of parameters .

      +

      Next, we introduce a variational family of distributions that approximate the true, but intractable posterior . Henceforth, we will assume a parameteric setting where any distribution in the model family is specified via a set of parameters and distributions in the variational family are specified via a set of parameters .

      Given and , we note that the following relationships hold true1 for any and all variational distributions

      @@ -310,7 +310,7 @@

      Parameterizing Di

      Amortized Variational Inference

      -

      A noticable limitation of black-box variational inference is that Step 1 executes an optimization subroutine that is computationally expensive. Recall that the goal of the Step 1 is to find

      +

      A noticeable limitation of black-box variational inference is that Step 1 executes an optimization subroutine that is computationally expensive. Recall that the goal of the Step 1 is to find

      -

      It is also worth noting that optimizing over the entire dataset as a subroutine everytime we sample a new mini-batch is clearly not reasonable. However, if we believe that is capable of quickly adapting to a close-enough approximation of given the current choice of , then we can interleave the optimization and . The yields the following procedure, where for each mini-batch , we perform the following two updates jointly

      +

      It is also worth noting that optimizing over the entire dataset as a subroutine every time we sample a new mini-batch is clearly not reasonable. However, if we believe that is capable of quickly adapting to a close-enough approximation of given the current choice of , then we can interleave the optimization and . This yields the following procedure, where for each mini-batch , we perform the following two updates jointly