Website | Source

What is BentoML?

BentoML is a Python library for building online serving systems optimized for AI apps and model inference.

Getting started

Install BentoML:

# Requires Python≥3.9
pip install -U bentoml

Define APIs in a service.py file.

import bentoml

@bentoml.service(
image=bentoml.images.Image(python_version="3.11").python_packages("torch", "transformers"),
)
class Summarization:
def __init__(self) -> None:
import torch
from transformers import pipeline

    device \= "cuda" if torch.cuda.is\_available() else "cpu"
    self.pipeline \= pipeline('summarization', device\=device)

@bentoml.api(batchable\=True)
def summarize(self, texts: list\[str\]) \-> list\[str\]:
    results \= self.pipeline(texts)
    return \[item\['summary\_text'\] for item in results\]

💻 Run locally

Install PyTorch and Transformers packages to your Python virtual environment.

pip install torch transformers # additional dependencies for local run

Run the service code locally (serving at http://localhost:3000 by default):

bentoml serve

You should expect to see the following output.

[INFO] [cli] Starting production HTTP BentoServer from "service:Summarization" listening on http://localhost:3000 (Press CTRL+C to quit)
[INFO] [entry_service:Summarization:1] Service Summarization initialized

Now you can run inference from your browser at http://localhost:3000 or with a Python script:

import bentoml

with bentoml.SyncHTTPClient('http://localhost:3000') as client:
summarized_text: str = client.summarize([bentoml.__doc__])[0]
print(f"Result: {summarized_text}")

🐳 Deploy using Docker

Run bentoml build to package necessary code, models, dependency configs into a Bento - the standardized deployable artifact in BentoML:

bentoml build

Ensure Docker is running. Generate a Docker container image for deployment:

bentoml containerize summarization:latest

Run the generated image:

docker run --rm -p 3000:3000 summarization:latest


Tags: ai   runtime   platform  

Last modified 22 March 2026