Wednesday, January 12, 2022

State of Machine Learning in Julia

Question 1: Where does Julia Shine

For scientific machine learning (also known as physics-informed learning, or science-guided AI, or expert-guided AI, etc. see the note at the bottom), Julia is really good. If you don’t know what that is, check out this recent seminar talk which walks through SciML for model discovery in epidemics, climate modeling, and more.

The SciML Benchmarks in Neural ODEs and other such dynamical models are pretty damn good. We’re talking 100x, 1000x, etc. across tons of different examples. Here are some. Note that most examples don’t even run in torchdiffeq since it uses the non-robust adjoints and no real stiff ODE solvers, so the ones that are benchmarked are only the easiest cases so that torchdiffeq isn’t just exiting early and outputting Inf for the gradients (I guess that’s one way to be fast :sweat_smile:)

Similarly we see torchsde performance on standard SDEs to be not great:

And those even have input from the package’s devs on how to optimize the benchmarks… so… :sweat_smile:. Though note that all of these benchmarks are on examples which are more realistic to scientific machine learning, so smaller kernels, more nonlinearity, at least mild stiffness, etc. I’ll address “big kernel computing” in the second part.

Another thing that people might not know about is that dynamical models also fall into this space. Simply using a good ODE solver is much better than using a “physics engine”. There’s a paper which should DiffEqFlux outperforming Mujuco and DiffTaichi by a good order of magnitude.

There’s a lot more to say too. Managing linear solvers and preconditioners is essential to this kind of stuff, and these other libraries don’t even have them. So really, direct usage of Sundials from C or Fortran is our only real competitor in this space, but even then that won’t mix with vjps of ML libraries automatically, which we know will decrease performance by two orders of magnitude, and we do beat it in most of the SciMLBenchmarks (most but not all). So even if you use C/Fortran Sundials you have to define every Jacobian, vjp, sparsity, etc. function to even get close to what DifferentialEquations.jl does by default. So I would assume that for the vast majority of the population they wouldn’t hit DiffEq speeds even in C/Fortran anymore in this domain :wink:.

The other thing is differentiable programming on non quasi-static programs, and I’ll let this blog speak for itself.

Another way to define how far ahead we are is to note that one of the NeurIPS spotlight papers is a method that we wrote a tutorial on in 2019 (this kind of thing isn’t too uncommon, it must’ve happened like 10 times by now). Oh wait, they only did the Newton part, not Newton-Krylov, and they didn’t do forward-over-adjoint which is known to be more efficient (as mentioned in Griewank’s AD book, though it’s easy to see with applications), so that paper doesn’t even match our tutorial :sweat_smile:. Feel free to make figures and publish our tutorials if you want.

Question 2: where is the Julia ML ecosystem currently inferior?

This blog touches on this topic in some detail:

(and the Hacker News discussion is quite good for this one). The Julia ML/AD tools are not inferior by design, but in some cases by focus. If you’re only dealing with big kernels, i.e. lots of big matrix multiplications, then PyTorch does not have measurable overhead. If you’re only dealing with such kernels, then XLA will perform fusion operations like A*v1 + A*v2 => A*[v1;v2], changing BLAS2 to BLAS3 and exploiting more parallelism. Another case is fusions in conv kernels. cudnn has a billion kernels, PyTorch listed them out. Here’s a snippet:

cudnn_convolution
cudnn_convolution_backward_input
cudnn_convolution_backward
cudnn_convolution_backward_weight
cudnn_convolution_transpose
cudnn_convolution_transpose_backward
cudnn_convolution_transpose_backward_input
cudnn_convolution_transpose_backward_weight
cudnn_convolution_relu
cudnn_convolution_add_relu

So instead of relu(conv(x)) .+ y, on GPUs you’d want to do cudnn_convolution_add_relu. Similarly some of these extra kernels exist for RNNs. XLA, and thus both TensorFlow and Jax, will do these fusions automatically.

Julia does not perform those kinds of optimizations in its “ML compiler” because it’s just using the Julia compiler, and you wouldn’t necessary want to do that on all Julia codes because it’s kind of like @fastmath in that it changes the floating point results. So somehow we need to add such compiler optimizations to the language which only apply in specific user contexts. Even better, it would be nice if mathematical/ML users could help build and maintain these compiler optimizations like they do with automatic differentiation rules for ChainRules.jl. Not everyone is a compiler engineer, but everyone can help with linear algebra equalities. That’s the purpose of the E-graph project:

However, that is still in its elementary stages and cannot even apply rules to arrays (though Shashi has started that work).

So in theory getting ML performance is simple: you just use fast kernels. In practice, the hard part is allowing users to write simple Julia code but then to allow for, in limited contexts, the compiler to change the high level calls of their code to more efficient kernels. The AbstractInterpreter will give us the tools to do this, but it’s still just the near future. The Julia ML tools picked a much larger scope and that’s good for some things but the downside is that these optimizations are much harder to write.

This does not effect my SciML use cases but is probably the biggest performance issue with standard ML in Julia. But while finishing these optimizations would make Julia really ergonomic for both building and using ML packages, I’m not convinced it will “win over” the standard ML community because you still would only expect to match performance in that domain, so I’m not convinced we have a silver bullet there.

Question 3: How well does Julia perform in “standard ML”?

I don’t think each individual benchmark is interesting. They all say pretty much the same thing as my response to 2:

  1. Julia’s kernel speeds are fine. On CPUs we’re doing really well, beating most other things with GitHub - JuliaLinearAlgebra/Octavian.jl: Multi-threaded BLAS-like library that provides pure Julia matrix multiplication and such. On GPUs everyone is just calling the same cudnn etc. so it’s a battle of who calls the better kernels and with the right arguments.
  2. Julia’s AD speeds are fine. Zygote can have some overhead, but it’s actually rather fast in most contexts compared to Jax/PyTorch/TensorFlow. Specifically, PyTorch overhead is much higher but it’s not even measurable in standard ML workflows anyways. One matrix multiplication of a large enough matrix eats up allocation issues or other O(n) stuff.
  3. Julia does not fuse kernels, so in most benchmarks if you look at it you just go “oh, it’s not fusing this conv” or “this RNN cudnn call”.

So a lot of the issues which mention PyTorch, like RNN design for efficient CUDNN usage · Issue #1365 · FluxML/Flux.jl · GitHub, are really just re-design issues to try and make Flux more readily call better kernels by manually fusing some things. So that’s really the main standard ML issue at the moment.

Question 4: What important experiments and benchmarks should we be tracking?

XLA’s distributed scheduler is pretty good. As we are thinking about scaling, we should probably ignore PyTorch and look at DaggerFlux vs TensorFlow/Jax. XLA has more freedom to change operations around so I think it should be the winner here, and we will need to use e-graphs tricks to match it.

One other thing to note though is that there is a “missing middle in automatic differentiation” where one has to choose between loopy mutating code (with Enzyme) vs kernel linear algebra code (with Zygote/Diffractor), and mixing the two code styles does not work right now. For a discussion on that, see:

That said, PyTorch, Jax, and TensorFlow also have this issue so it’s not really inferior, and Julia is closer to solving it than the others. But it’s a big PITA and something we really need to address to make Julia’s differentiable programming feel truly better than the alternatives.

Question 5: Which companies and institutions are looking to give multi-year positions?

I’m not sure, but I see it around. Both Julia Computing and Pumas-AI have many folks doing SciML though, and if you’re interested in SciML stuff please get in touch.

This topic also explains the parts of the MIT Julia Lab, where there’s a SciML crew focusing on SciML models and applications, and a compiler/symbolics crew working on these kinds of “customizable compiler for science” issues.

Question 6: Why should independent developers consider contributing?

It’s easy to specialize it towards weird research. In fact, there’s an entire research domain in non-quasi-static ML algorithms that’s just full of potential gems. Writing your own kernels gets you new algorithms that aren’t in the rut. Our implicit layer tooling all composes, and we’re far enough ahead that there are entire papers that are just what we’ve been doing for years.

And of course, all of the differential equation stuff mentioned at the beginning.

Question 7: What packages do you use and which packages do you wish existed?

I tend to reach to Flux when I need to, but try to stick to DiffEqFlux. Flux is just the most complete in terms of kernels that exist, but its style irks me. I wish there was a Flux which did not use implicit parameters and instead used explicit parameters. I would like those parameters to be represented by ComponentArrays

If you haven’t seen the ComponentArrays DiffEqFlux.jl example it’s really nice:

https://jonniedie.github.io/ComponentArrays.jl/dev/examples/DiffEqFlux/

function dudt(u, p, t)
    @unpack L1, L2 = p
    return L2.W * tanh.(L1.W * u.^3 .+ L1.b) .+ L2.b
end

That would make there be no implicit global state and everything would be explicit but with nice syntax. Since ComponentArrays is a flat contiguous vector with helper indexers (that return views), it works nicely with linear algebra for putting into BFGS. Mixing that with a universal optimizer interface GalacticOptim.jl

https://galacticoptim.sciml.ai/dev/

Would give me everything I need. It would be a whole lot easier to debug and optimize too. So I would really consider either making that kind of ML package or changing Flux to explicit parameters (which is somewhat underway with Optimisers.jl).

[Note on SciML terminology. It’s pretty funny that these days people attribute SciML to the SciML Scientific Machine Learning Open-Source Software Organization. The original attribution was a major workshop by the US DoE which established the term and essentially proclaimed it to be a field. We started working on this and were JuliaDiffEq before, with half of the repos not being differential equation solvers anymore, so it made sense to change to being “The SciML Scientific Machine Learning Open-Source Software Organization”, which is always just abbreviated to SciML. Soon, the SciML org became synonymous with the term, and so now people are less inclined to use the term as it refers more to us than the field, and so now people ask why we didn’t adopt “standard” terminology like “science-guided AI” which was first developed to avoid referring to us :laughing:. Fun anecdote to show where our community is in this space.]



from Hacker News https://ift.tt/3K3qGv3

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.