# GPT2: 900+RPS

In this example, we run the GPT-2 from Huggingface :hugging: Transformers library on basic Everinfer setup and evaluate possible RPS and latency.&#x20;

{% hint style="info" %}
All tests are conducted on AWS g4dn.xlarge as a client machine, located in N. Virginia. \
\
The example below demonstrates easily achievable RPS and latency. Performance can be scaled up or down to fit your requirements.\
\
Please feel free to contact us <hello@everinfer.ai> to discuss your perfect setup :)
{% endhint %}

## Preparation

Huggingface Transformers library provides a trivial way to convert supported models to ONNX.&#x20;

Install Transformers ONNX as described in the [official documentation](https://huggingface.co/docs/transformers/serialization).

Run the following command to convert GPT-2 to ONNX:

```
!python3 -m transformers.onnx --model=gpt2 onnx/
```

The default Transformers tokenizer output fits perfectly for the creation of Everinfer tasks. \
Let's create a test input for GPT-2:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Everinfer is blazing fast! Please contact us at hello@everinfer.ai to try it out!",
                   return_tensors="np")
```

Now we are ready to register the model on Everinfer and call it.&#x20;

## Using Everinfer

Authenticate with your API key and upload the converted model to Everinfer:

```python
from everinfer import Client
client = Client('my_key')
pipeline = client.register_pipeline('gpt2', ['onnx/model.onnx'])
```

Create an inference engine and run on inputs, while timing it to measure cold start time:

```python
%%time
runner = client.create_engine(pipeline['uuid'])
preds = runner.predict([inputs]) 
```

> Wall time: 2.68 s

{% hint style="success" %}
Yup, cold start times are fast! GPT-2 is \~0.5GB and we start it up in less that 3 seconds!
{% endhint %}

Now our model is deployed to multiple machines and is warmed up!

Let's measure single-call latency:&#x20;

```python
%%time
preds = runner.predict([inputs]) # measure single-call latency
```

> 5.88 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And scaling capabilities on batch jobs:

```python
%%timeit
preds = runner.predict([inputs]*1000) # horizontal scaling of batch processing
```

> 1.1 s ± 4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Congrats! You have achieved **900+ RPS** and **6ms** latency deployment of GPT-2! :tada:\
Can be easily scaled up btw, hit us up with your requirements :smirk:

{% hint style="success" %}
Message <hello@everinfer.ai> to reproduce that benchmark on any hardware or cloud provider of your choice.&#x20;
{% endhint %}
