Excerpted from ArXIV by Keno Fischer and Elliot Saba, Hearts of Machines Editorial Board.

The Julia language is evolving rapidly and can be considered to have both the flexibility of Python and the speed of C. However, currently TensorFlow and PyTorch do not officially support Julia. Therefore, some researchers recently built TPU support for Julia with the help of XLA underlying compiler. They said that this method could integrate VGG19 model written by Julia program into TPU executable file and invoke TPU to achieve efficient calculation. “Julia + TPUs = fast and easily expressible ML computations!” tweeted Jeff Dean, head of Google.ai.

1. The introduction

One of the fundamental changes that has driven the steady advance of machine learning technology over the past few years is the enormous computational power of training and optimizing machine learning models. Many of these technologies have been around for years, and only recent advances in computing power can provide good enough solutions to real-world problems. Much of this computing power is captured by gpus, whose vector-specific computing power was originally designed for graphics, but machine learning models often require the execution of complex matrix operations, so gpus also show very good performance.

These approaches and the success of gpus in the real world, especially in machine learning, have sparked a wave of innovation among hardware designers who are working on new accelerators for machine learning workloads. However, while Gpus have long been the power of software systems such as CUDA, these libraries generally do not extend to new non-GPU accelerators, and developing software for them remains a challenge.

In 2017, Google announced that they would make their proprietary TPU machine learning accelerator available to the public via cloud services. Initially, the use of TPU was limited to applications written under Google’s TensorFlow machine learning framework. Fortunately, in September 2018, Google opened access to TPU through IR from the underlying XLA (Accelerated Linear Algebra) compiler. This IR is a general-purpose optimized compiler for expressing arbitrary calculations of linear algebra primitives, thus providing a good foundation for non-TensorFlow users using TPU and non-machine learning workloads.

In this article, we describe the initial work of compiling generic Julia code using this interface, which can further access the TPU of Google Cloud. This approach is in contrast to the approach taken by TensorFlow (Abadi et al., 2016), which does not compile Python code but builds a graph in Python and then compiles the graph. It is aesthetically similar to JAX (Frostig et al., 2018), whose goal is to Offload calculations written in Python itself by tracing and Offload advanced array operations. Importantly, however, we do not rely on tracing, but instead use Julia’s static analysis and compilation capabilities to compile the entire program, including all the control flows passed to the device.

It is worth mentioning that our approach allows users to take full advantage of the expressive power of the Julia language when writing models. These expressiveness is mainly reflected in advanced features such as multiple distributions, higher-order functions and existing libraries such as Differential equation solvers (Rackauckas & Nie, 2017) and general linear algebra routines. Since it only runs on pure Julia code, it is also compatible with zygote.jL (Innes, 2018) automatic differentiation tool, which performs automatic differentiation as an advanced compilation process. Overall, we were able to compile a complete machine learning model written using the Flux machine learning framework, fuse the model’s forward, back propagation, and training loops into an executable and Offload them into the TPU.

Automatic Programmer operation of Julia Programs and ML Models to Cloud TPUs

Links to papers: arxiv.org/abs/1810.09…

Abstract: Google cloud TPU is a promising new hardware architecture for machine learning workloads, which has achieved many milestone machine learning breakthroughs at Google in recent years. Today, Google has made TPU available to the masses on its cloud platform, and recently opened it up further to allow non-TensorFlow frontend use. We describe a method and implementation to Offload the appropriate portion of the Julia program to TPU through this new API and the Google XLA compiler. Our approach is able to fully integrate the VGG19 model written by Julia program and its forward propagation into a single TPU executable for Offload to the device. Our approach works well with existing compiler-based automatic differentiation techniques on Julia code, so it is also possible to automatically take VGG19 back propagation and Offload it to TPU in a similar way. Using our compiler to access the TPU, we were able to do VGG19 forward propagation for batches of 100 images in 0.23 seconds, compared to 52.4 seconds for the original model on the CPU. Our implementation requires less than 1000 lines of Julia code, without making specific changes to the core Julia compiler or any other Julia package based on THE TPU.

5. Map Julia semantics to XLA

As long as the Julia program is written according to XLA primitives, we can compile it into XLA. However, the Julia program is written not based on arcane HLO operations, but on functions and abstractions provided by the Julia base library. Fortunately, Julia uses multiple distributions that make it easy to express the abstraction of the standard library in terms of HLO operations. Here are a few simple examples:

In addition to these simple operations, we also provide implementations of advanced array abstraction, especially MapReduce and Broadcast. The broadcast code based on the HLO operation contains about 20 lines of code. To save space, it is not expanded here. However, the implementation of MapReduce is very simple:

You can see the effect of using any Julia function as a static calculation. Because of Julia’s reliance on generic abstractions, it can cover a large number of apis by specifying very few definitions. Specifically, from the definition of MapReduce, we can automatically obtain the dimensionality reduction of the operations defined in base, such as sum and PROd. In fact, getting enough API coverage to compile forward and back propagation of the VGG19 model requires less than 200 lines of definition.

5.1 Structure Mapping

We did an additional identification. Any tuple or IMmutable structure in embedded IR is mapped to an XLA tuple, that is, the Julia value 1 + 2im (a complex number composed of two integer structures) will be mapped to an XLA tuple (s64[], s64[]). We save this structure type in the Julia embed in XLA IR, but it is clear that XLA does not know the Julia type, so these types are converted to the appropriate tuples in the final transformation step. Similarly, the (Julia) tuple constructor (and the immutable constructor) becomes a tuple component of XLA. Tuple references, which are immutable field references, become XLA tuple references.

5.2 Handling control flow

There is an additional complication that we haven’t discussed yet: the semantic mismatch between the imperative control flow provided by Julia and the functional control flow provided by XLA. To solve the if/else control flow module, we look at φ nodes in the Julia compiler’s SSA IR and then treat these nodes as a result of XLA functional control flow (if there are multiple φ nodes at the same merge point, we construct tuples of these nodes). The condition that causes the computation flow to diverge becomes the condition of the functional control flow, and any calculation between the two can be called as a function. Cyclic control flow is similar to conditional control flow construction. We identify the strongly connected regions of the control flow diagram and take them as the main body of the cycle.

7 the results

7.2 VGG19 forward propagation

Our first complex example is full VGG19 forward propagation. We use VGG19 implementation in Metalhead packs (Mike Innes & Polymorphism, 2018), It utilizes the Flux (Innes & Polymorphism, 2017) framework to transform familiar machine learning layers (convolution layers, full connectivity layers) into linear algebra. But importantly, each layer in the Flux framework is just a generic function that can in turn call generic linear algebra operations. Therefore, the machine learning models expressed in Flux (including VGG19) are just general Julia functions and are therefore able to use the methods introduced in this paper.

Our compiler is able to fully infer, offload, and fuse all forward propagation of VGG19. After Julia level optimization, the final IR of the top-level function consists of 181 instructions (each HloOp is a constant static parameter with appropriate inference and a dynamic parameter with appropriate morphological inference). The total number of HLO Operands calculated at each level is 183 (the extra two are used for parameter instructions hidden in embedding), and a total of 361 HLO Operands are calculated from 29 calculations. See Figure 3 for details of the number of instructions. Since we are able to offload all forward propagation calculations, Julia does not participate in any evaluation steps, allowing other tasks (such as preparing data for the next batch) to be performed synchronously. In addition, the performance of the resulting code is limited only by the quality of XLA generated code, not by the front-end (see 7.4 for performance evaluation). We evaluated the VGG19 model on the ImageNet validation set and verified that the results matched the original Metalhead results, thus verifying the accuracy of the generated XLA code.

7.3 VGG19 Reverse Propagation

To obtain back propagation, we use AD framework based on zygote.jL compiler (Innes, 2018). Zygote runs on Julia code, whose output is also a Julia function (suitable for re-importing Zygote for higher derivatives, and for compiling into a model for TPU). Here is a concrete example:

Namely, the derivative corresponding to the current value of the model and a specific training sample (or batch of training samples). We use sum as a simple substitute for the loss function. Surprisingly, the type inference modification described in Chapter 6 can also improve the type inference accuracy of all VGG19 backpropagation. As for forward propagation, the total number of optimized and unoptimized instructions is shown in Figure 1. Back propagation generates significantly more XLA instructions than forward propagation, and one of the biggest contributors is Zygote’s Mixed mode broadcast fusion, which computes both forward propagation and back propagation in a map kernel. Since XLA does not currently support multiple outputs from a single mapping instruction, this function is run repeatedly on multiple mapping instructions, so the DCE of XLA needs to be cleaned later. In general, our compilation process addresses XLA’s handling of mapping instructions because it is common to call the Julia map and broadcast functions in generic code.

7.4 Perform evaluation on TPU



Figure 2: VGG19 forward propagation time corresponding to different batch sizes. The Flux CPU is the Flux Master /Julia Master, but does not use the XLA compiler. The PyTorch CPU is the same PyTorch model on the same CPU. FluXLA CPU is the XRT implementation of our research on CPU; FluXLA TPU (Total) is the end-to-end time, which is consistent with the time reported by the client (including kernel launch overhead and data migration back from Google Cloud, note that this measurement result may vary greatly due to additional network migration). FluXLA TPU (Compute) is the total computed time on the TPU, which is consistent with the time reported by the cloud analyzer (unlike FluXLA TPU (Total), this measurement is stable). All CPU measurements are based on an Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz CPU that supports AVX512. Up to 20 cores can be obtained, and CPU references are not limited to a single kernel (even in practice, parallelization is not used for all CPU references). The TPU benchmark applies to only one TPU kernel. All times go through at least 4 runs (except FluXLA CPU for N=100, because it can’t complete a run in 10 minutes).

Figure 3: The instruction count breakdown for forward and back propagation of Metalhead. Jl VGG19 after compilation to XLA. The figure above shows the unoptimized (after Julia front end) and optimized instruction count (after XLA optimization process, similar to the process used by CPU back end, but without HLO Fusion). Each instruction number is further split into instructions (E) in entity calculations and instructions (T) in all calculations.