Matrix operations in Deeplearning4j are powered by ND4J, a linear algebra library for n-dimensional arrays. ND4J can be described as “Numpy for the JVM” with swappable backends supporting CPUs and GPUs. ND4J is available on the most common operating systems, including Linux, Mac OSX, Windows on x86_64 and Linux on ppc8. Libnd4j, the native engine that powers ND4J, is written in C++. The CPU backend is implemented with OpenMP vectorizable loops with SIMD support while the GPU backend is implemented with CUDA.
The key to massive, fast linear algebra is exploiting the inherent element-wise nature of matrix operations through parallelization. Java, however, does not offer efficient parallelism below the thread level. Libraries that can efficiently leverage the hardware in this manner are usually written in C/C++ or Fortran. To make the benefits of hardware acceleration available to the JVM we have to deal with the complexity and challenges of bridging Java and native C++.
To this end, we use JavaCPP (created and maintained by Skymind engineer Samuel Audet). JavaCPP auto-generates JNI bindings by parsing corresponding the C/C++ header files, thus exploiting the syntactic and semantic similarities between the two languages. The auto-generated JNI code from JavaCPP has zero overhead compared to manually-coded JNI functions. All matrices are stored off-heap and memory is managed through JavaCPP. In Java, these matrices are accessed by simply passing around pointers. We are therefore not restricted by addressing arrays with “ints” which are always 32 bits. In this manner libnd4j can offer the JVM access to arrays indexed all the way up to 64 bits.
The mechanics of implementing linear algebra operations in parallel boils down to two cases. All operations can be broken down into element-wise or dimensional-wise parallel operations. Here are two simple cases that illustrate these concepts.
Linear element-wise parallelism:
Assume the following code:
INDArray array = Nd4j.create(2, 3); array.addi(2.5f);
We have an INDArray instance, “array” which holds a 2x3 two-dimensional tensor, and we want to add a static value of 2.5f to each element of this array. In this particular case, our 6-element array will be processed with element-wise parallelism, with one thread (for GPUs) and/or one SIMD lane (for CPUs) processing each element.
Assume the same array, but now with a dimensional operation:
INDArray array = Nd4j.create(2, 3); INDArray sums = array.sum(0);
In this case, we want to create a one-dimensional INDArray “sums”. “sums” will be a row vector with three elements where each element is the sum of the two elements along the column. This operation will be processed with respect to dimensions: three separate threads will process so-called «subarrays» along the specified dimension, to provide a reduction value.
While these simple examples leave out the nuts and bolts of what happens under the hood, they capture the principles in question. It is these same principles that are at play behind more complex cases such as SoftMax or im2col/col2im. Although they involve custom loops, the general idea remains the same, regardless of backends and operations. One master thread controls execution from the JVM side, hooking into the C++ backend to launch a parallel linear algebra operation.
Moreover, to properly utilize the backend, say the GPU, several launch parameters need to be calculated at runtime. For instance, the optimal number of threads is dependent on the size of the matrix in question as well as its size along its different dimensions.
Using the approach described here guarantees data consistency across multiple JVM threads, and provides targeted peak performance on all platforms. To boost performance even more and further mitigate other constraints imposed by Java programming practices and the JVM, we have several tricks up our sleeve, some of which we will explore in the following posts.