In this post, our founder and CTO, Adam Gibson is being interviewed by one of our open-source contributors, Francois Garillot, on recent changes to DeepLearning4j's parameter server.
Hi Adam, thanks for agreeing to chat with me. I heard some parts of Deeplearning4j are now using Aeron, as of release 0.7.0. and I wanted to ask you a few questions about it. But first, remind me and our audience, what is Deeplearning4j?
In this context, it is actually a whole ecosystem of libraries, one that has everything for deep learning: you have data transforms, you have a UI, you have distributed systems, you have reinforcement learning, you have various kinds of streaming integrations. It's literally a whole ecosystem of libraries for building deep learning applications. The main emphasis is not necessarily on research but actually connecting a deep learning model to production systems, connecting it to a local database, running on your Hadoop cluster — in the same library! We're able to build comprehensive pipelines connecting production systems to new systems, deploying models as microservices, among other things. Deeplearning4j is actually a sub-library now: it is just the name of the library that started all this. It now contains mostly a deep learning DSL.
So that's equivalent to Theano, Tensorflow and all these other libraries. In fact, if you were to map this to the Python ecosystem, think of Tensorflow, plus Keras, plus Pandas, plus various signal-processing-libraries, plus Flask, plus...
That's very comprehensive. So, how does Aeron fit into this?
Well, Deeplearning4j is a whole connected ecosystem of libraries focused on deep learning applications. And Aeron is literally a raw UDP library. What it does is facilitate peer-to-peer transactions via something called a Media Driver. A Media Driver, is something you can think of as the coordination system for allowing you to do communications. It's mainly used in high-frequency trading systems. In essence, it does network communications and it's vastly faster than say, Google RPC.
Okay. So, if I understand correctly, the ecosystem beyond the Deeplearning4j library consists mainly of an interconnection between many vastly different modules that also extends to the distributed world. And in that context, I can see how having a really good message passing library helps. So, how did you come by Aeron, how did you learn of it?
It's funny. I was just Google searching around. I actually saw I some of the acronyms floating about. I think I saw some Google RPC benchmarks. I'm always interested in distributed systems, so I read up on it, and was kind of surprised at what I saw. Then, I Googled around a bit and saw also the Akka guys were starting to use it for a new message passing component in their actor system. I looked at it a bit and I figured it might be a good way to implement a parameter server.
So, to drill down a bit, what are the components that specifically use Aeron in the Deeplearning4j ecosystem? Is that the parameter server, mostly?
Right, because it's mostly message passing. The idea here — and in this case why we did our own — was that among many reasons, we basically wanted something that wasn't just a blob of C code that you'd deploy. There's a big emphasis in the Python world on kind of opaque code that you download and somehow runs, but usually, very inconsistently.
What we wanted was something familiar in Java that was fast.
Something fast in Java? Isn't Java a performance bottleneck?
I know Java has a reputation for being slow for deep learning applications and native code. In reality, it's not true. It's actually a better interface than Python. There's a serious amount of code that's written for Java — that's focused on low-latency trading systems that are actually faster than what the "Google and Facebook" of the world need.
So what we wanted to do was leverage that extra piece, but in this case for machine learning. People in the machine-learning world usually are on Python and they don't see much of the high-frequency trading world. Trading infrastructure is a little outside the expertise of most of the people in this space: they just don't know about it. And they often don't have an inherent interest in using it because it doesn't have a Python interface.
This is mainly meant for production systems. The neat thing about the Java world is it has this intersection of programmers who built games and wrote databases. Unlike Python, you can actually write a database in Java and actually have it do moderately well in speed. There's also these distributed systems people who know how to bend the Java virtual machine to do things it is normally not meant to do. This whole ecosystem of people exists in various big companies, but you don't typically hear about them. This is where a lot of our ecosystem originates.
It's fascinating that there is so much usable from those communities. So, just to circle back to the subject of the parameter server: what does our parameter server do? Why is it useful in the training phase of deep learning and where does Aeron come in?
Right. So the parameter server is something you can just think of it as a way of sending neural nets weights around. A neural net is essentially a set of weighted connections that represent a learnt state. At the end of the day this is just all one big vector. A vector, to avoid any fancy terms, is just a list of floating-point numbers. We use this to coordinate the learning of weights over a distributed network, or over a distributed cluster. Ways of parallelizing our neural network model training is via a model or data-parallelism, but in both contexts, the parameter server is just a way of communicating progress in model training over the cluster.
To boil it down, you have several parts of a dataset that you're training on and you're acting on all of this in some parallel fashion. You need to do some sort of coordination and this is what a parameter server is for.
So the parameter server is basically at the core of all this vector passing? Is there a specific challenge in the pattern of the communication about why it needs so much performance?
The major thing here is, you're moving gigabytes of data around like it's nothing. You can imagine a vector of a million floats, a couple of megabytes, but here's the thing though, when you're doing, say, asynchronous training and a bunch of other things, you're going to end up with a lot of network latency. The other things here is, if you're going to do distributed deep learning, which is already highly computationally intensive, you don't want the communication overhead over the network to be a bottleneck. The faster that network communication is, the faster your neural net training is. And there's a couple of trick in our particular parameter server implementation, but the speed brought by Aeron is a huge help.
You also mentioned high-frequency training, and I find that interesting, because I heard that this domain uses network transports that are slightly different from what we regularly use. Now I'm used to my regular gigabit ethernet, but trading systems always seem to be using the hardware of tomorrow. Is that relevant with respect to deploying deep learning models? How does that play in our area?
The major thing with different hardware is, basically, in a high-frequency training world, there's a big emphasis on what we'll call messaging. Trades happen in microseconds and nanoseconds. You need everything at that scale, because otherwise you basically lose the race on trading. A lot of trading is basically a race to capture the right price.
But is something like RDMA relevant for deep learning? How does our world change with new hardware?
That's what mainly Aeron is for: you can also use it for RDMA. In fact, RDMA stands for Remote Direct Memory Access. It's a special interconnect between two machines usually over some sort of PCI Express bus. Basically, connecting two CPUs to GPUs. With Aeron, you have that low-level control and you can basically start network communications over either a PCI-E bus, or just over ethernet. In the case of RDMA you're usually connecting GPUs, and they are the hardware of choice in a lot of deep learning applications.
That sounds fantastic. We could have at some point a GPU to GPU communication over the network?
It wouldn't be a network in the normal sense. It would usually be on a shared rack with a special interconnect. You'd have to have special wiring, but it would virtually be a "network".
Fantastic. There's potential in the future for a massive boom in performance by using that hardware that Aeron is already ready for at the message-writer level. How long was the rewrite using Aeron? How much effort did it take from how many people?
I did it by myself and it took about...the first prototype only took about a couple of weeks.
Wow. Do you have any ideas of the numbers? The order of magnitude of this speed-up that we've seen?
What we're seeing now, just from our communications and compression, is a ten-times speed-up so far. Just from some first numbers. It's pivotal to a lot of our communication now: we couldn't even do a synchronized training or anything like that before. Now we're opening up the floodgates to different ways we can communicate. Right now, it's just over Apache Spark, which has a lot of overhead in communicating. The fact that we moved all of our communications over to our Aeron-based system actually allowed us to implement many more training modes. It also allowed us to optimize network communications.
Impressive. And if what I've read is correct, just like Netty, Aeron doesn't use the JVM's memory management. So I assume that it's also a good match for what we are transmitting over the network, which are mostly off-heap ND4J arrays, right?
Great. As a last part, what were the difficult spots, or at least the not-so-easy parts that you would give as advice to somebody who would like to try to Aeron for their particular networking needs?
The major thing that I would recommend if you're going to start with Aeron, is that for support, they have, much like we do, a Gitter channel. Start with their Wiki and make sure that you read through the different protocols, the different libraries available. And also before you do anything else, I would have an understanding of how general-bit buffers work. I would start with NIO before starting with Aeron, because NIO is not only easier, it's been around for a long time, and it has some of the same ideas built into it. Aeron is basically a modern re-thinking of some of that, but also just enables different network communications on top.
Sounds useful. Did you encounter the concept of back pressure with Aeron?
Actually Aeron already has built-in back pressure. It will give you an error code saying your message was back pressured and you just need to handle it. Now you meet that concept in pairs of producers-subscribers, just like in any modern distributed system. What I would recommend there is just having a bust code for reacting to this and retrying your communication. That's a lot of what you'll spend your time handling.
Great. One thing that is very specific about Aeron is that it depends on Java. Where do you see this going in terms of advantages and disadvantages? How do you think this dependency will go in the future? What is the uptake and adoption of Java 8 you've seen in these large enterprise companies?
Java 8 is going to be a problem. For our own managed deployments we're fine, but for enterprise customers, we might have issues with them adopting it. We're working on some special things there, since there will be certain places where we just can't use it. Any customer who wants speed will be willing to work with us on upgrading their programs to Java 8. Overall, Java 8 has seen quite a bit of adoption, because it's like Java 5: it has compelling features that you just need when you're coding, like lambdas. I think in general a lot of new code will be Java 8 and people will gradually migrate from Java 7, because Java 7 is at the end-of-life. Another consideration that we had to think about was Android. A lot of our code runs on Android as well. And that's why while we have parts of our code in Java 8, most of our code is going to stay in Java 7.
Okay. Another thing that I wanted to ask with respect to Java 8. Doesn't Skymind use containers to bypass this issue in terms of deployment?
Yes, but despite what the hype would have you believe, containers are not in most places in the world yet. Most people don't even know what containers are. We may sometimes be the first usage of containers that a customer may see. Even then we try to bundle it so they don't see containers. A lot of enterprise customers don't collocate Docker with Hadoop. And many Hadoop administrators may be wary of Docker.
Right. So there's a long road ahead. Thank you so much for talking to me, that was positively enlightening!
Where to learn more:
To learn more about Aeron, please visit http://highscalability.com/
The Akka team re-wrote the remoting module in Akka, now named Artery, and as mentioned above this used Aeron as one of the many optimizations. They reported on this re-write in two fantastic blog posts, a must-read for those interested in refactoring networking for performance. The second post touches specifically on Aeron adoption:
Streams in Artery - http://blog.akka.io/artery/
Aeron in Artery - http://blog.akka.io/artery/