Scaling ML Deployment with Ray Serve on Kubernetes: A Practical Guide for DevOps Teams

Why Ray Serve? And Why Now?

Machine learning (ML) workloads are maturing fast. What used to be experimental notebooks are now powering real-time user experiences, from recommendations to fraud detection. And with that shift comes pressure — pressure to deploy faster, scale reliably, and recover smoothly.

That’s where Ray Serve steps in. Built on top of the Ray distributed computing framework, Ray Serve gives you an elegant, scalable, and Python-native way to deploy ML models as APIs.

But like any powerful tool, it needs a strong foundation. And for that, we turn to Kubernetes.

In this blog, I’ll walk through how I’ve deployed production-grade ML services using Ray Serve on Kubernetes — and what DevOps teams need to know to make it work smoothly. No hype, no jargon — just practical architecture, lessons learned, and a few bruises I earned along the way.

The Architecture at a Glance

Let’s start with a high-level overview of how it fits together:

Kubernetes (EKS/GKE): The orchestrator, managing nodes, scaling, and pod lifecycles.
Ray Cluster: A set of Ray pods managed as a RayHead and RayWorker setup.
Ray Serve: Deployed inside the RayHead pod — this is where your model-serving logic lives.
Model API: Your Python model, wrapped in a FastAPI-compatible Ray Serve deployment.
Ingress (like NGINX or ALB): Routes traffic to the Ray Serve HTTP proxy.
Monitoring (Prometheus + Grafana): Metrics tracking CPU, memory, and inference latency.

This setup allows you to deploy any Python-based model — from scikit-learn to PyTorch — as a scalable, auto-replicated service with built-in request batching, versioning, and A/B testing.

Step 1: Setting Up Ray on Kubernetes

Ray has native support for Kubernetes via the Ray Operator. Here’s what you need:

Install the Ray Operator using Helm:

helm repo add ray https://ray-project.github.io/ray-helm/

helm install ray-operator ray/ray-operator

Define a RayCluster YAML, which sets up:
- A head pod to run the Ray dashboard and controller.
- One or more worker pods that execute tasks and handle model replicas.
Apply the cluster manifest:

kubectl apply -f ray-cluster.yaml

Once the cluster is up, you’ll use the head pod as your deployment target for Ray Serve applications.

Step 2: Deploying Your Model with Ray Serve

Ray Serve uses decorators to wrap your model function or class and expose it via HTTP.

Here’s a basic example:

from ray import serve

import joblib

@serve.deployment

class MyModel:

def __init__(self):

self.model = joblib.load(“/models/my_model.pkl”)

async def __call__(self, request):

input_data = await request.json()

prediction = self.model.predict([input_data[“features”]])

return {“prediction”: prediction.tolist()}

To deploy this to your running Ray cluster:

serve.run(MyModel.bind())

From here, you get a REST endpoint for inference. Want auto-scaling? Just add .options(num_replicas=3) to your deployment.

Step 3: Exposing Ray Serve to the Outside World

Ray Serve includes an HTTP proxy inside the head pod, but it’s internal by default. To expose it:

Create a Kubernetes Service for the Ray Serve HTTP port (default: 8000).
Use an Ingress controller (like NGINX or ALB) to route external traffic.
(Optional) Add an API Gateway for authentication or rate limiting.

For production, make sure to:

Terminate TLS at the ingress layer.
Use readiness probes for Ray pods.
Set autoscaling limits on the RayCluster to avoid overconsumption.

Observability and Resilience

No deployment is complete without monitoring.

Here’s what I monitor in every Ray Serve deployment:

Inference latency (per endpoint)
Queue depth and backlog (Ray metrics)
CPU and memory usage per pod
Model load failures
Error rates (HTTP 5xx)

Prometheus can scrape Ray’s built-in metrics, and Grafana makes it easy to visualize trends. Alerting on model errors or traffic spikes can save you a ton of fire drills.

Also, remember: Ray Serve supports rolling updates, so you can push new models without downtime. That’s a DevOps dream.

Scaling Tips from Real-World Use

Over time, I’ve learned a few things that might help your team:

Use node affinity to co-locate Ray worker pods for better performance.
Package models separately from code — use object stores or persistent volumes.
Set num_cpus per deployment to avoid oversubscribing pods.
Batch requests if you expect high-throughput inference (Ray makes this easy).
Avoid stateful logic inside Ray deployments unless you really know what you’re doing.

Common Pitfalls to Avoid

Let me save you some pain:

Don’t use default resource limits. Ray needs memory and CPU to scale properly.
Don’t run your Ray cluster in the same namespace as unrelated workloads.
Avoid using large models without lazy loading — startup times will kill your rollout performance.
Remember: Ray clusters don’t persist state between pod terminations unless you configure volume mounts.

Empowering DevOps to Own ML Deployment

If you’re a DevOps engineer, ML deployment might seem like “someone else’s problem.” But that’s changing. As ML becomes a core part of business logic, it belongs in your pipeline — monitored, versioned, and deployed like any other service.

Ray Serve on Kubernetes is the toolset that brings ML to our world — the world of reproducibility, automation, and scale.

The next time your data science team hands you a “model to productionize,” don’t just duct tape Flask onto it. Give it a home in your infrastructure. Give it observability. Give it failover.

In short: treat it like you would any critical microservice.

Because now, it is.

Share the Post:

Leo Snetsinger