Leo Snetsinger

Scaling ML Deployment with Ray Serve on Kubernetes: A Practical Guide for DevOps Teams

Leo Snetsinger — Thu, 04 Sep 2025 18:11:46 +0000

Why Ray Serve? And Why Now?

Machine learning (ML) workloads are maturing fast. What used to be experimental notebooks are now powering real-time user experiences, from recommendations to fraud detection. And with that shift comes pressure — pressure to deploy faster, scale reliably, and recover smoothly.

That’s where Ray Serve steps in. Built on top of the Ray distributed computing framework, Ray Serve gives you an elegant, scalable, and Python-native way to deploy ML models as APIs.

But like any powerful tool, it needs a strong foundation. And for that, we turn to Kubernetes.

In this blog, I’ll walk through how I’ve deployed production-grade ML services using Ray Serve on Kubernetes — and what DevOps teams need to know to make it work smoothly. No hype, no jargon — just practical architecture, lessons learned, and a few bruises I earned along the way.

The Architecture at a Glance

Let’s start with a high-level overview of how it fits together:

Kubernetes (EKS/GKE): The orchestrator, managing nodes, scaling, and pod lifecycles.
Ray Cluster: A set of Ray pods managed as a RayHead and RayWorker setup.
Ray Serve: Deployed inside the RayHead pod — this is where your model-serving logic lives.
Model API: Your Python model, wrapped in a FastAPI-compatible Ray Serve deployment.
Ingress (like NGINX or ALB): Routes traffic to the Ray Serve HTTP proxy.
Monitoring (Prometheus + Grafana): Metrics tracking CPU, memory, and inference latency.

This setup allows you to deploy any Python-based model — from scikit-learn to PyTorch — as a scalable, auto-replicated service with built-in request batching, versioning, and A/B testing.

Step 1: Setting Up Ray on Kubernetes

Ray has native support for Kubernetes via the Ray Operator. Here’s what you need:

Install the Ray Operator using Helm:

helm repo add ray https://ray-project.github.io/ray-helm/

helm install ray-operator ray/ray-operator

Define a RayCluster YAML, which sets up:
- A head pod to run the Ray dashboard and controller.
- One or more worker pods that execute tasks and handle model replicas.
Apply the cluster manifest:

kubectl apply -f ray-cluster.yaml

Once the cluster is up, you’ll use the head pod as your deployment target for Ray Serve applications.

Step 2: Deploying Your Model with Ray Serve

Ray Serve uses decorators to wrap your model function or class and expose it via HTTP.

Here’s a basic example:

from ray import serve

import joblib

@serve.deployment

class MyModel:

def __init__(self):

self.model = joblib.load(“/models/my_model.pkl”)

async def __call__(self, request):

input_data = await request.json()

prediction = self.model.predict([input_data[“features”]])

return {“prediction”: prediction.tolist()}

To deploy this to your running Ray cluster:

serve.run(MyModel.bind())

From here, you get a REST endpoint for inference. Want auto-scaling? Just add .options(num_replicas=3) to your deployment.

Step 3: Exposing Ray Serve to the Outside World

Ray Serve includes an HTTP proxy inside the head pod, but it’s internal by default. To expose it:

Create a Kubernetes Service for the Ray Serve HTTP port (default: 8000).
Use an Ingress controller (like NGINX or ALB) to route external traffic.
(Optional) Add an API Gateway for authentication or rate limiting.

For production, make sure to:

Terminate TLS at the ingress layer.
Use readiness probes for Ray pods.
Set autoscaling limits on the RayCluster to avoid overconsumption.

Observability and Resilience

No deployment is complete without monitoring.

Here’s what I monitor in every Ray Serve deployment:

Inference latency (per endpoint)
Queue depth and backlog (Ray metrics)
CPU and memory usage per pod
Model load failures
Error rates (HTTP 5xx)

Prometheus can scrape Ray’s built-in metrics, and Grafana makes it easy to visualize trends. Alerting on model errors or traffic spikes can save you a ton of fire drills.

Also, remember: Ray Serve supports rolling updates, so you can push new models without downtime. That’s a DevOps dream.

Scaling Tips from Real-World Use

Over time, I’ve learned a few things that might help your team:

Use node affinity to co-locate Ray worker pods for better performance.
Package models separately from code — use object stores or persistent volumes.
Set num_cpus per deployment to avoid oversubscribing pods.
Batch requests if you expect high-throughput inference (Ray makes this easy).
Avoid stateful logic inside Ray deployments unless you really know what you’re doing.

Common Pitfalls to Avoid

Let me save you some pain:

Don’t use default resource limits. Ray needs memory and CPU to scale properly.
Don’t run your Ray cluster in the same namespace as unrelated workloads.
Avoid using large models without lazy loading — startup times will kill your rollout performance.
Remember: Ray clusters don’t persist state between pod terminations unless you configure volume mounts.

Empowering DevOps to Own ML Deployment

If you’re a DevOps engineer, ML deployment might seem like “someone else’s problem.” But that’s changing. As ML becomes a core part of business logic, it belongs in your pipeline — monitored, versioned, and deployed like any other service.

Ray Serve on Kubernetes is the toolset that brings ML to our world — the world of reproducibility, automation, and scale.

The next time your data science team hands you a “model to productionize,” don’t just duct tape Flask onto it. Give it a home in your infrastructure. Give it observability. Give it failover.

In short: treat it like you would any critical microservice.

Because now, it is.

The post Scaling ML Deployment with Ray Serve on Kubernetes: A Practical Guide for DevOps Teams appeared first on Leo Snetsinger.

From Kubernetes to Crustaceans: Applying DevOps Principles to Sustainable Aquaculture

Leo Snetsinger — Thu, 04 Sep 2025 18:09:11 +0000

Bridging the Worlds of Tech and Aquaculture

As a senior platform engineer, I’ve spent over a decade working with cloud-native infrastructure. My daily tools included Kubernetes, GitOps, CI/CD pipelines, and enough YAML to fill a novel. But lately, I’ve found myself applying those same tools — or at least the thinking behind them — to a completely different domain: indoor shrimp farming.

Running my aquaculture startup, Homeland Shrimp, I’ve realized something simple but powerful: the mindset and principles we use in DevOps don’t stop at the data center. In fact, they translate beautifully to building and maintaining environmentally controlled systems in the physical world — where reliability, observability, and automation are just as critical.

This post is a deep dive into how lessons from infrastructure engineering are helping me sustainably raise Pacific white shrimp — and how you can use DevOps thinking beyond the screen.

Infrastructure as Code → Infrastructure as Ecosystem

In cloud infrastructure, we treat our architecture as code. Everything is declarative, repeatable, and version-controlled. You want your system to spin up the same way every time, regardless of who’s deploying it or where it’s running.

In aquaculture, I’ve come to treat the tank environment the same way.

Each tank is a system. It has inputs (feed, oxygen, temperature control), outputs (ammonia, waste, growth), and defined state goals: stable pH, optimal temperature, dissolved oxygen levels. Just like cloud infrastructure, it needs to be configured, monitored, and tuned for uptime and performance.

My background in defining predictable environments with Kubernetes gave me the discipline to build precision control into the physical world — and it’s working. I know what “healthy” looks like in the same way a production cluster knows what “running” looks like.

Observability: You Can’t Fix What You Can’t See

In platform engineering, we build dashboards with metrics like CPU usage, error rates, and deployment status. These let us make fast, informed decisions — and alert us before users even notice something is wrong.

Shrimp don’t file support tickets. But they do give off signals — and if you’re paying attention, they’re just as valuable.

I monitor real-time data from sensors tracking temperature, oxygen saturation, ammonia, nitrates, and water flow. These sensors are my Prometheus. My dashboard isn’t Grafana, but it performs the same function: visualizing system health.

When oxygen levels start to drop, I don’t wait for shrimp to react. I get alerted and let automation kick in. That’s the same feedback loop I’ve relied on in GitOps-based environments — only this one keeps creatures alive.

Kubernetes Taught Me to Design for Failure

Kubernetes assumes that things will break. Pods die. Nodes flake. Containers crash. And that’s OK — because the system recovers automatically.

That principle — “design for failure” — has had a huge influence on how I built Homeland Shrimp.

For example, my heat exchange system doesn’t just work when the power’s on. It holds temperature using passive design principles and thermodynamic insulation. My oxygenation system doesn’t rely on one pump. There are redundant systems that keep water flowing even if something breaks.

This mindset — learned from years of dealing with flaky clusters — is helping me build a shrimp farm that can survive a Minnesota winter power outage or a failed valve without total catastrophe. Resilience isn’t optional when your uptime includes living creatures.

GitOps and Aquaculture: It’s All About the Source of Truth

One of my favorite DevOps concepts is GitOps — managing your infrastructure declaratively, with Git as the source of truth. If your live system drifts from that declared state, the controller steps in and brings it back into sync.

I’ve applied this thinking to how I document, track, and manage the operation of my tanks.

Every feed schedule, filter clean, water change, and sensor calibration gets logged. I track what works, what doesn’t, and version-control key environmental parameters. When something goes wrong, I can roll back to a known-good configuration — just like a Kubernetes deployment.

It’s not flashy, but it gives me confidence. When you’re dealing with biological systems, knowing what changed — and when — is the difference between a hiccup and a disaster.

Automating the Mundane, Prioritizing the Critical

In DevOps, we automate the repetitive so we can focus on the strategic. Jenkins, ArgoCD, and Terraform run our routines so we can think about system design and scale.

In my shrimp farm, I’ve automated lighting, oxygen cycles, feeding routines, and temperature adjustments. Not because I’m lazy — but because it frees me up to focus on long-term system health, growth rates, and infrastructure improvements.

Every minute I save not having to manually flip a switch is a minute I can use to optimize a filter design or improve shrimp welfare. Automation, whether digital or mechanical, gives me leverage.

Why This Matters: Sustainability and Scalability

At its core, this fusion of DevOps and aquaculture isn’t just a fun side project — it’s a model for sustainable, scalable food production.

Raising shrimp indoors with closed-loop systems and precision control allows us to:

Reduce water waste
Avoid harmful chemicals
Lower energy usage
Produce clean, local protein with low environmental impact

And by borrowing from DevOps — monitoring, automation, resilience — we can scale this model without compromising on quality or ethics.

Build Systems That Take Care of Themselves

The biggest lesson I’ve learned from both Kubernetes and crustaceans is this:

If you build it right, it takes care of itself.

Whether it’s a cluster or a tank, a system should be observable, resilient, and recoverable. The more you listen to its feedback and respect its complexity, the better your outcomes will be.

Engineering isn’t just about servers or code. It’s a mindset — and when you bring that mindset to the physical world, incredible things can happen.

From YAML to oxygenation loops, from rollbacks to water changes — it’s all just infrastructure. And I’ve never been more excited to build it.

The post From Kubernetes to Crustaceans: Applying DevOps Principles to Sustainable Aquaculture appeared first on Leo Snetsinger.