Our Vision for Serverless ML and Everinfer Internals

Hey there! We're Andrew, Igor, and Dimitriy, the founders of Everinfer.

Our message: Stop torturing yourself with deploying ML models as containerized services.

Here's what we think: The way we deploy ML is a mess. We're stuck in a pattern influenced by traditional web apps – packing the model with its Python code into a Docker container. Sure, it makes scaling and redeployment a breeze from a technical standpoint. But is this the best for ML workloads? We don't think so.

The problem: Containerised apps don’t effectively utilize your GPU machines.

You need to download and build tools like nvidia-docker, torch, transformers, and loads of other stuff to run your code. Not only does it take a significant amount of time to download and construct this environment on a machine, but the Python tooling also steals precious milliseconds from each model call. You pay for GPU time while all this wrapper software runs on CPU.

An analogy: it makes little sense to reinstall your IDE and the whole fancy toolchain to compile a single file of C code on a remote server. Why should you do that when it comes to ML models?

Another issue: With arbitrary Python+Torch/TF code models, optimizing for low latency or placing multiple models on one GPU becomes challenging.

Our solution: We believe in scaling GPU compute separately from the CPU-bound code. As we see it, the future of ML isn't about reinventing the microservices wheel. It's about saying no to unnecessary weight and yes to efficiency.

That's why we're starting a company focused on building ML DevTools around this “decoupling” idea.

Here's the game plan if you're developing AI code with this approach:

  1. Develop your model deployment code in Python.

  2. Pinpoint the most intense computations (usually model.forward calls).

  3. Save these models to a static weights file for a fast runtime software (like ONNX, TensorRT, ggml, torchscript, tflite, and so on).

  4. Set up a fast inference server that can snatch up these files and serve them (use Triton or build something similar yourself).

  5. Tweak your deployment code to run all the wrapper code (pre/post processing of text inputs, logic, heuristics, model control flow, beamsearch, etc.) on a lean CPU client. This part isn't usually the bottleneck anyway.

  6. Finally, make this code call your swift inference servers hosting static model graphs instead of local models.

So, what's in it for you?

Firstly, scaling just got easier. You just need a few instances of your model "front-end" code and scale AI runtimes only. These runtimes are light and deploy quickly, which means no more long cold start times.

Next, you're getting stellar GPU utilization. Most models with static weight files require a set amount of VRAM – this means you can fit more models on a single GPU and save some cash.

Also, it simplifies development. Say goodbye to the headaches of configuring nvidia docker or making sure your research-grade libraries are playing nice with each other. The folks behind the runtime software have sorted that out.

Lastly, storing models is now straightforward – your models are simply binaries.

Big tech has done this for ages. So why don’t you? That approach helps even at scales of 2-3GPUs – Andrew did that at his ML consulting agency on top of servers in a garage – and it just gets better with scale.

But here's the thing: while this concept may make sense, you might not want to take on the headache of developing and hosting such a system yourself. That's precisely where we step in with Everinfer.

We've constructed a serverless model deployment system built around this notion of decoupling. Leveraging our remote hardware is as straightforward as:

from everinfer import Client 

client = Client("my_api_key") 
engine = client.create_engine("model_id") # our servers start downloading your model 
tasks = [{"input_name": input_array_or_img_path}] 
preds = engine.predict(tasks) # runs on remote GPUs

This Python snippet can be seamlessly integrated into your existing codebase. No need to tinker with Docker, mess around with HTTP/GRPC APIs, or navigate any GUIs. The experience is akin to running models locally.

Explore other pages in our docs complete with examples and benchmarks!

Take a special look at our Stable Diffusion example that demonstrates the decoupling philosophy perfectly. We swap out a 2GB UNet model with a call to a remote GPU in just five lines of code, without significantly altering the huggingface diffusers pipeline and running everything else on a CPU.

Let's talk tech: how Everinfer works under the hood?

On the server side, we host a fleet of machines running a highly-optimized inference server. This server, essentially a zero-copy wrapper over ONNX runtime, pulls models on-demand from S3, leveraging all available bandwidth and deploys your model into the latest ORT Runtime. It also handles inputs from clients in a streaming fashion, pre-loading available inputs and running them through the model using ORT IObinding for maximum efficiency.

On the client side, we provide a Python package that handles authentication, manages uploads of your models to our S3, declares the intent to deploy a specific model, and exposes a pub-sub queue for worker servers to pull inputs from. This package can be conveniently installed with pip.

Our control plane identifies clients and workers, connecting them via p2p. Basically, clients act as shards of a distributed queue and push serialized inputs into these queues. Workers, in turn, run a carefully optimized data path that atomically passes inputs from network card to GPU with no unnecessary memory copies.

No load balancers in the middle – no issues with scaling and no bottlenecks.

To quote Greg Brockman, "Much of modern ML engineering is making Python not be your bottleneck."

We do exactly that – both our client-side and server-side software is written in Rust and employs a host of low-level optimizations. In fact, even our Python package is Rust under the hood.

So, what kind of performance can you expect? Tested on a single T4 GPU as a worker and a t2.micro as a client:

  • Dummy model that just takes and returns a single integer: 65ms cold start, 2.6ms latency

  • BERT: 1s cold start, 8ms latency

  • FasterRCNN: 2s cold start, 100ms latency

  • Stable Diffusion: 9s cold start, 30sec to generate an image

A short demo: https://www.loom.com/share/50891d4548bc4658b21d6cb8bdadfbfc

Our first customer achieved 146 RPS throughput for a computer vision task with a custom model using just ten lines of Python+everinfer to deploy the model to multiple GPUs.

Kudos to our tech stack built for speed: Rust, Zenoh, ScyllaDB, FlatBuffers, ONNX.

While we're in the early stages and don't currently have enough GPUs for a public demo, we're more than happy to provide API keys and GPUs to all interested hackers.

Drop us a line at team@everinfer.ai – we respond fast.

Looking ahead, our plan is to provide:

  • Managed cloud version with proper controllable scaling and per-second billing.

  • Support for more runtimes (TensorRT, torchscript, tflite) and large models.

  • Open-source version for on-premise deployment that will enable you to turn any set of heterogeneous in-house machines into a serverless cloud.

If you think that's cool, join our waitlist at everinfer.ai.

Last updated