# BERT With Zero Overhead

In this tutorial we're going to demonstrate how you can deploy one of the fastest NLP models, BERT, to remote GPUs and maintain decent latency.

## Why BERT?

BERT encoder is extremely fast, running at 1.5ms on local GPU (tested on Nvidia T4). Deploying that model to remote machines and maintaining low latency is hard.&#x20;

Everinfer is highly optimized and will allow you to run that model on remote machines while keeping up with its speed.&#x20;

## How to deploy BERT on Everinfer

Install Everinfer and HuggingFace transformers library.

Convert the model to ONNX format:

```
!python3 -m transformers.onnx --model=distilbert-base-uncased onnx/
```

Authenticate on Everinfer using your API key, upload the model, and create inference engine:

```python
from everinfer import Client
client = Client('my_api_key')
pipeline = client.register_pipeline('bert', ['onnx/model.onnx'])
runner = client.create_engine(pipeline['uuid'])
```

You are ready to go!

Since HuggingFace tokenizers are fully compatible with Everinfer expected input format, you can feed tokenizer outputs directly to the deployed model:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
inputs = tokenizer("Everinfer is fast af", return_tensors="np")
```

After applying tokenizer to input text, running the model is as simple as:

```python
preds = runner.predict([inputs]) 
```

## Performance&#x20;

Remote GPU access overhead is virtually zero!&#x20;

<figure><img src="https://157857996-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEc85HZIPxdQNjizdmV41%2Fuploads%2FjEVKkC8Irb1qGArHWyMB%2Fimage.png?alt=media&#x26;token=50dfb511-3c38-43af-bff2-eb09d4411981" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
You could deploy that code to AWS Lambda to go fully serverless, or use it as a part of your self-hosted web app.&#x20;
{% endhint %}
