# GPT2: 900+RPS

In this example, we run the GPT-2 from Huggingface :hugging: Transformers library on basic Everinfer setup and evaluate possible RPS and latency.&#x20;

{% hint style="info" %}
All tests are conducted on AWS g4dn.xlarge as a client machine, located in N. Virginia. \
\
The example below demonstrates easily achievable RPS and latency. Performance can be scaled up or down to fit your requirements.\
\
Please feel free to contact us <hello@everinfer.ai> to discuss your perfect setup :)
{% endhint %}

## Preparation

Huggingface Transformers library provides a trivial way to convert supported models to ONNX.&#x20;

Install Transformers ONNX as described in the [official documentation](https://huggingface.co/docs/transformers/serialization).

Run the following command to convert GPT-2 to ONNX:

```
!python3 -m transformers.onnx --model=gpt2 onnx/
```

The default Transformers tokenizer output fits perfectly for the creation of Everinfer tasks. \
Let's create a test input for GPT-2:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Everinfer is blazing fast! Please contact us at hello@everinfer.ai to try it out!",
                   return_tensors="np")
```

Now we are ready to register the model on Everinfer and call it.&#x20;

## Using Everinfer

Authenticate with your API key and upload the converted model to Everinfer:

```python
from everinfer import Client
client = Client('my_key')
pipeline = client.register_pipeline('gpt2', ['onnx/model.onnx'])
```

Create an inference engine and run on inputs, while timing it to measure cold start time:

```python
%%time
runner = client.create_engine(pipeline['uuid'])
preds = runner.predict([inputs]) 
```

> Wall time: 2.68 s

{% hint style="success" %}
Yup, cold start times are fast! GPT-2 is \~0.5GB and we start it up in less that 3 seconds!
{% endhint %}

Now our model is deployed to multiple machines and is warmed up!

Let's measure single-call latency:&#x20;

```python
%%time
preds = runner.predict([inputs]) # measure single-call latency
```

> 5.88 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And scaling capabilities on batch jobs:

```python
%%timeit
preds = runner.predict([inputs]*1000) # horizontal scaling of batch processing
```

> 1.1 s ± 4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Congrats! You have achieved **900+ RPS** and **6ms** latency deployment of GPT-2! :tada:\
Can be easily scaled up btw, hit us up with your requirements :smirk:

{% hint style="success" %}
Message <hello@everinfer.ai> to reproduce that benchmark on any hardware or cloud provider of your choice.&#x20;
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.everinfer.ai/examples/gpt2-900+rps.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
