Double-backward

Wednesday, November 06, 2019

API , Tutorial

Posted by Kazuki Yoshiyama

Double Backwards

NNabla has supported double backwards features from version 1.1.0.

In this blog post, we briefly describe the double backwards in the neural network libraries and their typical use cases.

Feature of Double Backwards

There are two main features of double backwards:

  • Forward Graph Expansion
  • Backward on both a backward and forward graph

Let’s have a look at each.

Forward Graph Expansion

In the original design of a NNabla’s graph engine, Backpropagation is executed by back-traversing a forward graph without explicitly creating a backward graph. This is still an option to implement backpropagation. Another way to implement backpropagation is to expand the forward graph, or flip the forward graph to create a backward graph.

The following code snippets perform the same computation but in different ways.

1. Backpropagation on forward graph
import nnabla as nn
import nnabla.functions as F
import nnabla.parametric_functions as PF

import numpy as np

rng = np.random.RandomState(313)
x = nn.Variable.from_numpy_array(rng.randn(2, 3)).apply(need_grad=True)
x.grad.zero()
y = F.sigmoid(x)
y.forward()
y.backward(clear_buffer=True)
print(x.g)
2. Backpropagation by forward graph expansion
import nnabla as nn
import nnabla.functions as F
import nnabla.parametric_functions as PF

import numpy as np

rng = np.random.RandomState(313)
x = nn.Variable.from_numpy_array(rng.randn(2, 3)).apply(need_grad=True)
x.grad.zero()
y = F.sigmoid(x)
grads = nn.grad([y], [x], bind_grad_output=True)
nn.forward_all(grads)
print(x.g)

In the first example, the backpropagation on a forward graph is performed. This is the usual way in NNabla. On the other hand, in the second example, a forward graph is expanded by nn.grad at y with respect to x. Then, the forward computation on both forward and backward graphs is performed by calling nn.forward_all.

Note that the second example is intended to use only the first-order gradients like the first example, thus bind_grad_output=True is set in nn.grad API, which forces to bind a memory region of the returned grads and x.grad. For more details about API, please visit Grad API.

Backward on both backward forward graphs

In the second example, the returned value of nn.grad is a list of nn.Variables, so we can perform any computation as in the first case, even calling a backward method! (if the corresponding function supports double backwards)

Following example shows the successive computation with the grads returned by the nn.grad API.

3. Backpropagation on both backward and forward graphs
import nnabla as nn
import nnabla.functions as F
import nnabla.parametric_functions as PF

import numpy as np

rng = np.random.RandomState(313)
b, c, h, w = 16, 3, 32, 32
x = nn.Variable.from_numpy_array(rng.randn(b, c, h, w)).apply(need_grad=True)
x.grad.zero()
y = F.sigmoid(x)
grads = nn.grad([y], [x])
norms = [F.sum(g ** 2.0, [1, 2, 3]) ** 0.5 for g in grads]
gp = sum([F.mean((norm - 1.0) ** 2.0) for norm in norms])
gp.forward()
gp.backward()
print(x.g)

Note that we called x.grad.zero() to print the gradients for illustration purpose only, and it is usually not necessary.

Usecase

Double backwards features are normally used when one wants to constrain an optimization problem with first-order gradients. Typical example is Gradient penalty for stabilizing a training of GANs (Generative Adversarial Networks). Common examples of gradient penalty are

  1. One-centered gradient penalty with randomly linear-interpolated samples used in WGAN-GP
  2. Zero-centered gradient penalties like R1 and R2

One-centered gradient penalty can be found here in NNabla example, and we show the second one for R1 zero-centered gradient penalty in this article. The following is the snippet of R1 zero-centered gradient penalty.

4. R1 Zero-centered Gradient Penalty
import nnabla as nn
import nnabla.functions as F
import nnabla.parametric_functions as PF

import numpy as np

# Generator
x_fake = <generator(...)>
p_fake = <discriminator(x_fake)>

# Discriminator Loss
x_real = <nn.Variable(...)>.apply(need_grad=True)
p_real = <discriminator(x_real)>
loss_dis = <gan_loss>(p_real, p_fake)

# R1 Zero-centerd Gradient Penalty
grads = nn.grad([p_real], [x_real])
norms = [F.sum(g ** 2.0, [1, 2, 3]) ** 0.5 for g in grads]
r1_zc_gp = sum([F.mean(norm ** 2.0) for norm in norms])

# Total loss for the discriminator
loss_dis += 0.5 * <gamma> * r1_zc_gp

As in the code snippet, R1 zero-centered gradient penalty uses the discriminator output and real input only to compute the gradient penalty, and L2-norm of the gradients are not optimized with one-value as a target, both of which are different from WGAN-GP.