We have released Neural Network Libraries v1.32.0!
Please see “Spotlight” section for important changes.
Performance improvement in auto-forward mode
In previous versions, auto-forward mode was much less memory-efficient than static graph mode, making users hesitant to use it even though it was more intuitive and flexible!
Now you can use the auto-forward mode without sacrificing memory usage. It automatically releases the memory during forward propagation as much as possible according to gradient computation dependency of variables in backpropagation. The following table shows more than 2x memory efficiency in auto-forward (dynamic-graph) execution of StyleGAN2 discriminator.
|Static||Dynamic (this version)||Dynamic (previous)|
|3168 MB||3187 MB||8143 MB|
The following examples demonstrate how the memory-release in auto-forward mode happens.
Usage examples 1
x = nn.Variable() with auto_forward(): y = F.identity(x) # F.identity has no grad dependencies. # The rebinding of the python variable y # releases the previous y and its memory. y = F.identity(y)
Usage examples 2
def local_scope(x): h = F.identity(x) # F.identity has no grad dependencies. y = F.identity(h) return y x = nn.Variable() with auto_forward(): # After exiting local_scope, # the local Variable "h" and its memory are released. y = local_scope(x)
Note that because of the timing to release
nnabla.Variable depends on the Python GC, the timing of memory release is not determined but is highly expected to be immediate.
The transpose operation in CUDA backend has relied on NVIDIA’s cuTensor since nnabla v1.30.0. In previous versions, we created a cuTensor handle every time we created a transpose operation, which caused significant overhead, especially in auto-forward mode.
Now the created cuTensor handle is cached and reused in different transpose calls, greatly reducing the overhead (like 100x faster). Because the transpose is actually used internally by several other functions, including LayerNormalization, it has widely influenced various network architectures including Transformers.
The previous SyncBN definition and implementation in nnabla had a potential issue where all-reduce of gradients for beta and gamma are performed twice under a standard data-parallel training pipeline, resulting in unexpectedly larger magnitudes of gradients (multiplied by a factor of the number of GPU workers) used for parameter updates.
# Pseudo code with auto_forward(): h = F.sync_batch_normalization(x, comm=comm) # Before this change, all-reduce for the gradients of beta & gamma # was performed at the backward function call. h.backward() grads = get_grad_arrays() # All-reduce for the gradients of beta & gamma in sync BN was performed here again comm.all_reduce(grads) solver.update()
Now, we changed the definition of the backward of sync BN such that it doesn’t perform all-reduce operations for the gradients of beta and gamma to avoid duplicate reduce sum operations over the devices.
We have added support for python3.10. Along with this update, onnx and tensorflow used by file format converter has also been updated to v1.12.0 and v2.8.x respectively.
- Faster LayerNormCuda::setup_impl (GN and IN as well) by removing unnecessary function creation (CPU / GPU)
- Avoid net[‘names’] dict is overwrite
- Fix save function for the parameters created by narrow
- Fix the timing of fill and zero evaluations in narrowed array
- fix crash problem of pipeline
- fix:skip unnecessary download
- correct index error in generate_cache_dir().
- complete setup cfg in all condition
- Use not cast but get in nan/inf check
- Fix cython build failure by missing cutensor.h in non-docker env
- Use a singleton function for cudaGetDeviceProperties to avoid slowdown
- Change python version to 3.9.14 in aarch64 environment
- change pip installation method (CPU / GPU)
- Added ca-certificates to Dockerfiles (CPU / GPU)
- fix version mismatch for libarchive.so
- add python310 into lib_in_wheel
- python3.10 converter dependency for onnx and tensorflow
- Loose converter python packages requirements
- upgrade onnx to 1.10.0
- update document for watch dog timeout setting
- Update nnabla converter document
- Add callback APIs to doc