Latency at Scale
Run GPT-2 at 900+RPS and 6ms latency!
In this example, we run the GPT-2 from Huggingface
Transformers library on basic Everinfer setup and evaluate possible RPS and latency.
All tests are conducted on AWS g4dn.xlarge as a client machine, located in N. Virginia. The example below demonstrates easily achievable RPS and latency. Performance can be scaled up or down to fit your requirements. Please feel free to contact us [email protected] to discuss your perfect setup :)
Huggingface Transformers library provides a trivial way to convert supported models to ONNX.
Run the following command to convert GPT-2 to ONNX:
!python3 -m transformers.onnx --model=gpt2 onnx/
The default Transformers tokenizer output fits perfectly for the creation of Everinfer tasks. Let's create a test input for GPT-2:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Everinfer is blazing fast! Please contact us at [email protected] to try it out!",
Now we are ready to register the model on Everinfer and call it.
Authenticate with your API key and upload the converted model to Everinfer:
from everinfer import Client
client = Client('my_key')
pipeline = client.register_pipeline('gpt2', ['onnx/model.onnx'])
Create an inference engine and run on inputs, while timing it to measure cold start time:
runner = client.create_engine(pipeline['uuid'])
preds = runner.predict([inputs])
Wall time: 10.7 s
Yup, cold start times are far from perfect for now, working on it!
Stay tuned for optimized cold starts as short as 5s for huge models.
Now our model is deployed to multiple machines and is warmed up!
Let's measure single-call latency:
preds = runner.predict([inputs]) # measure single-call latency
5.88 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And scaling capabilities on batch jobs:
preds = runner.predict([inputs]*1000) # horizontal scaling of batch processing
1.1 s ± 4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Congrats! You have achieved 900+ RPS and 6ms latency deployment of GPT-2!
Can be easily scaled up btw, hit us up with your requirements
Check out the fully reproducible Google Colab notebook, and message [email protected] to reproduce that benchmark on any hardware or cloud provider of your choice.