Released Neural Network Libraries v1.32.0! Performance improvement, Python 3.10 support, etc.

Monday, December 12, 2022


Posted by Takuya Narihira

We have released Neural Network Libraries v1.32.0!
Please see “Spotlight” section for important changes.


Performance improvement in auto-forward mode

Memory efficiency

In previous versions, auto-forward mode was much less memory-efficient than static graph mode, making users hesitant to use it even though it was more intuitive and flexible!
Now you can use the auto-forward mode without sacrificing memory usage. It automatically releases the memory during forward propagation as much as possible according to gradient computation dependency of variables in backpropagation. The following table shows more than 2x memory efficiency in auto-forward (dynamic-graph) execution of StyleGAN2 discriminator.

Static Dynamic (this version) Dynamic (previous)
3168 MB 3187 MB 8143 MB

The following examples demonstrate how the memory-release in auto-forward mode happens.

Usage examples 1

x = nn.Variable()
with auto_forward():
    y = F.identity(x) # F.identity has no grad dependencies.
    # The rebinding of the python variable y
    # releases the previous y and its memory.
    y = F.identity(y)

Usage examples 2

def local_scope(x):
    h = F.identity(x) # F.identity has no grad dependencies.
    y = F.identity(h)
    return y

x = nn.Variable()
with auto_forward():
    # After exiting local_scope,
    # the local Variable "h" and its memory are released.
    y = local_scope(x)

Note that because of the timing to release nnabla.Variable depends on the Python GC, the timing of memory release is not determined but is highly expected to be immediate.

Removed overhead in tranpose operation

The transpose operation in CUDA backend has relied on NVIDIA’s cuTensor since nnabla v1.30.0. In previous versions, we created a cuTensor handle every time we created a transpose operation, which caused significant overhead, especially in auto-forward mode.
Now the created cuTensor handle is cached and reused in different transpose calls, greatly reducing the overhead (like 100x faster). Because the transpose is actually used internally by several other functions, including LayerNormalization, it has widely influenced various network architectures including Transformers.

Fixed SyncBN backward (CPU / GPU)

The previous SyncBN definition and implementation in nnabla had a potential issue where all-reduce of gradients for beta and gamma are performed twice under a standard data-parallel training pipeline, resulting in unexpectedly larger magnitudes of gradients (multiplied by a factor of the number of GPU workers) used for parameter updates.

# Pseudo code
with auto_forward():
  h = F.sync_batch_normalization(x, comm=comm)
# Before this change, all-reduce for the gradients of beta & gamma
# was performed at the backward function call.
grads = get_grad_arrays()
# All-reduce for the gradients of beta & gamma in sync BN was performed here again

Now, we changed the definition of the backward of sync BN such that it doesn’t perform all-reduce operations for the gradients of beta and gamma to avoid duplicate reduce sum operations over the devices.

Support python3.10 (CPU / GPU)

We have added support for python3.10. Along with this update, onnx and tensorflow used by file format converter has also been updated to v1.12.0 and v2.8.x respectively.


OP Layer



Format Converter