GPT2: 900+RPS
Run GPT-2 at 900+RPS and 6ms latency!
Last updated
Run GPT-2 at 900+RPS and 6ms latency!
Last updated
In this example, we run the GPT-2 from Huggingface Transformers library on basic Everinfer setup and evaluate possible RPS and latency.
Huggingface Transformers library provides a trivial way to convert supported models to ONNX.
Install Transformers ONNX as described in the official documentation.
Run the following command to convert GPT-2 to ONNX:
The default Transformers tokenizer output fits perfectly for the creation of Everinfer tasks. Let's create a test input for GPT-2:
Now we are ready to register the model on Everinfer and call it.
Authenticate with your API key and upload the converted model to Everinfer:
Create an inference engine and run on inputs, while timing it to measure cold start time:
Wall time: 2.68 s
Yup, cold start times are fast! GPT-2 is ~0.5GB and we start it up in less that 3 seconds!
Now our model is deployed to multiple machines and is warmed up!
Let's measure single-call latency:
5.88 ms ยฑ 252 ยตs per loop (mean ยฑ std. dev. of 7 runs, 100 loops each)
And scaling capabilities on batch jobs:
1.1 s ยฑ 4 ms per loop (mean ยฑ std. dev. of 7 runs, 1 loop each)
Message hello@everinfer.ai to reproduce that benchmark on any hardware or cloud provider of your choice.
Congrats! You have achieved 900+ RPS and 6ms latency deployment of GPT-2! Can be easily scaled up btw, hit us up with your requirements