Introduction
Everinfer is a system that offloads inference of ONNX graphs to remote GPUs.
- Remote GPU resources: Run your ONNX-compatible models on remote GPUs that are managed by Everinfer.☁
- Peer-to-peer communication: Inference client connects directly to highly optimized C++ ONNX runtimes, running on remote GPUs. Overhead as low as 2ms⚡is possible and enables near-real-time applications.⚡
- Instant cold starts: No Docker or Git is involved, cold start time is limited by model download time only and is close to theoretical limit. (e.g. BERT cold start <= 1s).🔥
- Scalability: Our architecture allows linear horizontal scaling — scale to 1000s of RPS with no extra effort on your side.📈
- Model storage: Upload your models once and reuse them infinitely.📥
- Minimalistic SDK: Client-side SDK is open-source, provides simple Pythonic primitives, and is extremely easy to use — you are free to add complexity as needed.👏
- On-premise deployment: Want to use your own hardware for added security, or mix and match your hardware with external computing power? On-premise deployment is possible. Contact us!🏦
- [COMING SOON]: Multiple runtimes supported: TensorRT, torchscript and tflite support through unified interface.⏳
Feel free to contact us even if you are a sole developer. We are quick to respond and ready to give out API keys and provide demos — [email protected]
We use blazing fast buzzword-worthy stack to ensure Everinfer technical superiority. Honourable mentions: ONNX, Rust, Zenoh, FlatBuffers, ScyllaDB.
⚡
⚡
- Want to skip the boring parts and dive straight in? Take a look at how you could deploy Faster-RCNN while fusing pre- and post-processing in a single graph with the model.
- Doubt latency and scalability claims? Take a look at GPT-2 running at 900 RPS, still with four lines of code.
Last modified 2mo ago