BentoML is a Python library for building online serving systems optimized for AI apps and model inference.
Install BentoML:
# Requires Python≥3.9
pip install -U bentoml
Define APIs in a service.py file.
import bentoml
@bentoml.service(
image=bentoml.images.Image(python_version="3.11").python_packages("torch", "transformers"),
)
class Summarization:
def __init__(self) -> None:
import torch
from transformers import pipeline
device \= "cuda" if torch.cuda.is\_available() else "cpu"
self.pipeline \= pipeline('summarization', device\=device)
@bentoml.api(batchable\=True)
def summarize(self, texts: list\[str\]) \-> list\[str\]:
results \= self.pipeline(texts)
return \[item\['summary\_text'\] for item in results\]
Install PyTorch and Transformers packages to your Python virtual environment.
pip install torch transformers # additional dependencies for local run
Run the service code locally (serving at http://localhost:3000 by default):
bentoml serve
You should expect to see the following output.
[INFO] [cli] Starting production HTTP BentoServer from "service:Summarization" listening on http://localhost:3000 (Press CTRL+C to quit)
[INFO] [entry_service:Summarization:1] Service Summarization initialized
Now you can run inference from your browser at http://localhost:3000 or with a Python script:
import bentoml
with bentoml.SyncHTTPClient('http://localhost:3000') as client:
summarized_text: str = client.summarize([bentoml.__doc__])[0]
print(f"Result: {summarized_text}")
Run bentoml build to package necessary code, models, dependency configs into a Bento - the standardized deployable artifact in BentoML:
bentoml build
Ensure Docker is running. Generate a Docker container image for deployment:
bentoml containerize summarization:latest
Run the generated image:
docker run --rm -p 3000:3000 summarization:latest
Last modified 22 March 2026