DEEP LEARNING FOR ENTERPRISE

High Performance Compute for the JVM - A Prerequisite for DL4J (Part 2)

[fa icon="calendar"] Sep 20, 2016 10:00:00 PM / by Susan Eraly and Vyacheslav Kokorin

Susan Eraly and Vyacheslav Kokorin


hero-image.jpg

Unlike most popular Deep Learning frameworks, DL4J was envisioned with Java principles and the JVM in mind. Its backends were once, well, all Java. But those days are long gone, and Nd4J now uses native backends for both CPU and CUDA.

CUDA is of special interest to our users, since it dramatically boosts performance in parallel computations. This in turn significantly lowers the time required for tuning and training models. While we seamlessly support NVIDIA’s cuDNN library of deep learning primitives, we also make the power and performance of the GPUs accessible to end users without cuDNN installed.

This kind of stack - all in Java with a “native” backend, however, comes with it’s own set of unique challenges. These native operations are essentially a set of independent individual operations applied to the same data. This does not play well with Java as it is a managed memory model resulting in challenges, some of which are explored in the next section. To this end we’re constantly looking for ways that will allow us to improve performance, without breaking code universality for end-user.

Under the hood

Overview

Deep learning operations are mostly linear algebra at it's core. All of this magic can be represented as a sequence of algebraic transformations applied to the data in some specific order.
flow.png

Memory tricks & cheats

Memory Reuse

One of the significant hurdles to address here is allocation cost. Ideally in a C/CUDA application, the device/host memory is allocated once, and then reused as long as possible. Java, however, makes this an entirely different story. In Java, local scope variables are heavily used and in fact might even be considered «good practice».

To deal with this challenge, we’ve added a special caching layer that guarantees memory reuse over time. The idea is simple: when the JVM releases an INDArray object, the native device/host memory used for it isn’t really freed, but stored for reuse.  When allocation of a similar memory chunk is requested some time later, that request is served directly from the cache, thus lowering the allocation time down to flat constant time in the order of tens of nanoseconds. According to our tests, the average cache hit rate for a typical Deep Learning workload is somewhere around 95-100%, depending on the size of the cache.

Immutable buffers

Besides what is described above, there are a few cases when the typical workflow involves the creation of arrays with the same contents over and over again. To optimize for this, we have defined a special case with immutable buffers. This mostly applies to Op parameters and INDArray shapeInformation buffers (the contents of the shapeInformation hold rank, dimension sizes, ordering etc). So, during the first call of such an immutable buffer, it will be initialized, and then moved to device constant memory space, for faster access from kernels during runtime. The cache hit rate for this layer is 100% after the first training iteration.

Dimensional information

The last (certainly not the least) caching layer is the TAD (Tensor Along Dimension) information cache. As mentioned earlier, Deep Learning involves sequence of transformations on data, often in specific dimensions. This caching layer targets dimensional information, to speed up navigation within an INDArray along a given dimension(s). This cache partially stores information in device constant memory space, and partially in device global memory. Cache hit rate for this layer is also 100% after first training iteration.

Operations combination

As discussed in an earlier blog post, all op executions involve internal parallelism, and here we address the most important of all CUDA mechanics: memory bandwidth.

Let us consider how an op works from the perspective of a single CUDA kernel thread:

  • A thread gets an array element from global memory
  • An operation is applied to this element
  • A result is being written back to global memory

Thousands of such threads do the above at the same time symmetrically. This is what is called the  SIMD/SIMT execution model: Single Instruction Multiple Data/Single Instruction Multiple Threads.

However once can easily imagine a situation, when two consequent operations are, in fact, being applied to the same array. This will result in double kernel calls, double global memory reads, double op application and double global memory writes. Certainly not ideal. Our solution here is “automatic op combination”, which results in the following scenario:

  • A thread gets an array element from global memory
  • An operation A is applied to this element
  • An operation B is applied to this element
  • A result is being written back to global memory

In this manner, we save a whole second global memory read/write cycle, effectively doubling our memory bandwidth.

In the latest release (0.6.0, that is) we’ve introduced an initial version of a sophisticated execution pipeline that allowed us to combine certain multiple operations into solid blocks. This feature boosts performance by reducing the memory bandwidth required for individual operations and reducing the total number of calls going to the GPU.

In subsequent releases the support for such recombinations will be extended, based on the evolving demands of use cases we see from our customer base.  

Topics: Engineering

Susan Eraly and Vyacheslav Kokorin

Written by Susan Eraly and Vyacheslav Kokorin

Susan and Vyacheslav are Deep Learning Engineers at Skymind. They work on all things native - OpenMP + CUDA - in the DL4J stack.

Subscribe to Email Updates