GPT2: 900+RPS
Run GPT-2 at 900+RPS and 6ms latency!
In this example, we run the GPT-2 from Huggingface 🤗 Transformers library on basic Everinfer setup and evaluate possible RPS and latency.
Preparation
Huggingface Transformers library provides a trivial way to convert supported models to ONNX.
Install Transformers ONNX as described in the official documentation.
Run the following command to convert GPT-2 to ONNX:
!python3 -m transformers.onnx --model=gpt2 onnx/
The default Transformers tokenizer output fits perfectly for the creation of Everinfer tasks. Let's create a test input for GPT-2:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Everinfer is blazing fast! Please contact us at [email protected] to try it out!",
return_tensors="np")
Now we are ready to register the model on Everinfer and call it.
Using Everinfer
Authenticate with your API key and upload the converted model to Everinfer:
from everinfer import Client
client = Client('my_key')
pipeline = client.register_pipeline('gpt2', ['onnx/model.onnx'])
Create an inference engine and run on inputs, while timing it to measure cold start time:
%%time
runner = client.create_engine(pipeline['uuid'])
preds = runner.predict([inputs])
Wall time: 2.68 s
Yup, cold start times are fast! GPT-2 is ~0.5GB and we start it up in less that 3 seconds!
Now our model is deployed to multiple machines and is warmed up!
Let's measure single-call latency:
%%time
preds = runner.predict([inputs]) # measure single-call latency
5.88 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And scaling capabilities on batch jobs:
%%timeit
preds = runner.predict([inputs]*1000) # horizontal scaling of batch processing
1.1 s ± 4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Congrats! You have achieved 900+ RPS and 6ms latency deployment of GPT-2! 🎉 Can be easily scaled up btw, hit us up with your requirements 😏
Message [email protected] to reproduce that benchmark on any hardware or cloud provider of your choice.
Last updated