Everinfer
Contact us
  • Introduction
  • Getting started
    • Basics
    • Model management
    • Limitations
    • Faster-RCNN example
  • Examples
    • GPT2: 900+RPS
    • BERT With Zero Overhead
    • Segformer from HuggingFace
    • Stable Diffusion: Decouple GPU Ops from Code
  • Essays
    • Our Vision for Serverless ML and Everinfer Internals
Powered by GitBook
On this page
  • Core Features
  • Superior tech
  • Quick links

Introduction

NextBasics

Last updated 1 year ago

Everinfer is a system that offloads inference of ONNX graphs to remote GPUs.

Core Features

  • Remote GPU resources: Run your ONNX-compatible models on remote GPUs that are managed by Everinfer.

  • Peer-to-peer communication: Inference client connects directly to highly optimized C++ ONNX runtimes, running on remote GPUs. Overhead as low as 2ms is possible and enables near-real-time applications.

  • Instant cold starts: No Docker or Git is involved, cold start time is limited by model download time only and is close to theoretical limit. (e.g. BERT cold start <= 1s).

  • Scalability: Our architecture allows linear horizontal scaling — scale to 1000s of RPS with no extra effort on your side.

  • Model storage: Upload your models once and reuse them infinitely.

  • Minimalistic SDK: Client-side SDK is open-source, provides simple Pythonic primitives, and is extremely easy to use — you are free to add complexity as needed.

  • On-premise deployment: Want to use your own hardware for added security, or mix and match your hardware with external computing power? On-premise deployment is possible. Contact us!

  • [COMING SOON]: Multiple runtimes supported: TensorRT, torchscript and tflite support through unified interface.

Feel free to contact us even if you are a sole developer. We are quick to respond and ready to give out API keys and provide demos —

Superior tech

We use blazing fast buzzword-worthy stack to ensure Everinfer technical superiority. Honourable mentions: , , , , .

Quick links

See the of Everinfer in action.

Want to skip the boring parts and dive straight in? Take a look at how you could while fusing pre- and post-processing in a single graph with the model.

Doubt latency and scalability claims? Take a look at , still with four lines of code.

- offload U-net to remote GPUs, while running lightweight models locally.

☁️
⚡
⚡
🔥
📈
📥
👏
🏦
⏳
⚡
⚡
hello@everinfer.ai
ONNX
Rust
Zenoh
FlatBuffers
ScyllaDB
simplest example
deploy Faster-RCNN
GPT-2 running at 900 RPS
Stable Diffusion demo