# Double Backwards

NNabla has supported double backwards features from version 1.1.0.

In this blog post, we briefly describe the double backwards in the neural network libraries and their typical use cases.

## Feature of Double Backwards

There are two main features of double backwards:

- Forward Graph Expansion
- Backward on both a backward and forward graph

Let’s have a look at each.

### Forward Graph Expansion

In the original design of a NNabla’s graph engine, *Backpropagation* is executed by back-traversing a forward graph without explicitly creating a backward graph. This is still an option to implement backpropagation. Another way to implement backpropagation is to expand the forward graph, or flip the forward graph to create a backward graph.

The following code snippets perform the same computation but in different ways.

##### 1. Backpropagation on forward graph

```
import nnabla as nn
import nnabla.functions as F
import nnabla.parametric_functions as PF
import numpy as np
rng = np.random.RandomState(313)
x = nn.Variable.from_numpy_array(rng.randn(2, 3)).apply(need_grad=True)
x.grad.zero()
y = F.sigmoid(x)
y.forward()
y.backward(clear_buffer=True)
print(x.g)
```

##### 2. Backpropagation by forward graph expansion

```
import nnabla as nn
import nnabla.functions as F
import nnabla.parametric_functions as PF
import numpy as np
rng = np.random.RandomState(313)
x = nn.Variable.from_numpy_array(rng.randn(2, 3)).apply(need_grad=True)
x.grad.zero()
y = F.sigmoid(x)
grads = nn.grad([y], [x], bind_grad_output=True)
nn.forward_all(grads)
print(x.g)
```

In the first example, the backpropagation on a forward graph is performed. This is the usual way in NNabla. On the other hand, in the second example, a forward graph is expanded by *nn.grad* at *y* with respect to *x*. Then, the forward computation on both forward and backward graphs is performed by calling *nn.forward_all*.

Note that the second example is intended to use only the first-order gradients like the first example, thus *bind_grad_output=True* is set in *nn.grad* API, which forces to bind a memory region of the returned *grads* and *x.grad*. For more details about API, please visit Grad API.

### Backward on both backward forward graphs

In the second example, the returned value of *nn.grad* is a list of *nn.Variable*s, so we can perform any computation as in the first case, even calling a backward method! (if the corresponding function supports double backwards)

Following example shows the successive computation with the grads returned by the *nn.grad* API.

##### 3. Backpropagation on both backward and forward graphs

```
import nnabla as nn
import nnabla.functions as F
import nnabla.parametric_functions as PF
import numpy as np
rng = np.random.RandomState(313)
b, c, h, w = 16, 3, 32, 32
x = nn.Variable.from_numpy_array(rng.randn(b, c, h, w)).apply(need_grad=True)
x.grad.zero()
y = F.sigmoid(x)
grads = nn.grad([y], [x])
norms = [F.sum(g ** 2.0, [1, 2, 3]) ** 0.5 for g in grads]
gp = sum([F.mean((norm - 1.0) ** 2.0) for norm in norms])
gp.forward()
gp.backward()
print(x.g)
```

Note that we called *x.grad.zero()* to print the gradients for illustration purpose only, and it is usually not necessary.

## Usecase

Double backwards features are normally used when one wants to constrain an optimization problem with first-order gradients. Typical example is *Gradient penalty* for stabilizing a training of GANs (Generative Adversarial Networks). Common examples of gradient penalty are

- One-centered gradient penalty with randomly linear-interpolated samples used in WGAN-GP
- Zero-centered gradient penalties like R1 and R2

One-centered gradient penalty can be found here in NNabla example, and we show the second one for R1 zero-centered gradient penalty in this article. The following is the snippet of R1 zero-centered gradient penalty.

##### 4. R1 Zero-centered Gradient Penalty

```
import nnabla as nn
import nnabla.functions as F
import nnabla.parametric_functions as PF
import numpy as np
# Generator
x_fake = <generator(...)>
p_fake = <discriminator(x_fake)>
# Discriminator Loss
x_real = <nn.Variable(...)>.apply(need_grad=True)
p_real = <discriminator(x_real)>
loss_dis = <gan_loss>(p_real, p_fake)
# R1 Zero-centerd Gradient Penalty
grads = nn.grad([p_real], [x_real])
norms = [F.sum(g ** 2.0, [1, 2, 3]) ** 0.5 for g in grads]
r1_zc_gp = sum([F.mean(norm ** 2.0) for norm in norms])
# Total loss for the discriminator
loss_dis += 0.5 * <gamma> * r1_zc_gp
```

As in the code snippet, R1 zero-centered gradient penalty uses the discriminator output and real input only to compute the gradient penalty, and L2-norm of the gradients are not optimized with one-value as a target, both of which are different from WGAN-GP.