I stared at our February 2026 AWS bill and realised we were paying a massive premium just to move our own data. The tipping point for our infrastructure team was the realisation that executing a dedicated Enterprise Cloud Repatriation Strategy was no longer optional.
When you evaluate colocation vs public cloud strictly on bandwidth, the public cloud model completely collapses under high-throughput workloads.
We initiated a hard bare-metal migration, ripped our clusters out of EKS, and built Kubernetes on-premise. We cut our monthly infrastructure spend by 60%, but the transition process broke almost every deployment pipeline we had.
The Breaking Point: AWS Egress Fees
Industry consensus holds that the cloud is cheaper due to elastic auto-scaling. That is a lie for steady-state, predictable enterprise workloads.
AWS egress fees cost $0.09 per GB for the first 10TB outbound to the internet, totaling $90 per TB. If you push petabytes of video or machine learning inference data, those AWS egress fees will consume your profit margins entirely.
Worse, you are billed for internal routing. Traffic crossing Availability Zones (AZs) incurs Data Transfer AZ charges. Pushing traffic through a NAT Gateway incurs hourly charges plus per-GB processing fees, meaning you pay just to route internal traffic to the outside world.
The Hidden Costs of Cloud Networking
As verified by infrastructure teams executing high-traffic cloud exits—such as the famous 37signals repatriation that saved millions annually—moving to alternative networks or bare-metal racks can dramatically reduce bandwidth costs. However, the only way to eliminate metered billing completely is to lease unmetered 10G drops in a private rack.
You must run a strict network traffic audit before pulling the plug on your current environment. Export your VPC Flow Logs to a local analytics tool and calculate your exact outbound bandwidth.
If data transfer makes up more than 30% of your total monthly infrastructure bill, repatriation is financially mandatory.
Colocation vs Public Cloud: The 2026 Financial Math
When building the financial model for colocation vs public cloud, the calculation is CapEx hardware depreciation versus OpEx billing surprises.
In a colocation facility, your costs are entirely static. You pay for power circuits, physical rack space, and a flat-rate cross-connect to a transit provider. Your servers run at 10% CPU or 100% CPU for the exact same monthly price.
The public cloud monetises every single API call, disk IOP, and packet routed. While you can attempt trimming Kubernetes cost optimization within AWS by right-sizing nodes, the larger your infrastructure grows, the more you inevitably pay for the hyperscaler’s profit margin.
CapEx Depreciation and Hardware Leasing
To win the colocation vs public cloud debate internally, model your current AWS bill against a 36-month hardware lease.
You must include the cost of two senior network engineers in the colocation column to handle the physical routing layer. You also need to budget for remote hands services at the data centre for physical drive swaps.
Even with the added payroll and hardware leasing costs, you will achieve a 60% net savings over a three-year depreciation cycle if your workload is compute and network-heavy.
Executing the Bare-Metal Migration Architecture
You cannot execute a successful bare-metal migration by treating physical servers like EC2 instances.
In the cloud, you have an API to instantly replace a degraded motherboard. On-premise, a failed RAM stick requires a physical intervention, a node drain, and a coordinated hardware swap.
A production-grade bare-metal migration requires extreme hardware standardization. Buy identical pizza-box servers from a single vendor, equipped with identical NICs and NVMe drives.
Standardising the Provisioning Layer
If you mix and match hardware generations, your deployment automation will break continuously due to driver and kernel module mismatches.
Standardise your provisioning layer using bare-metal lifecycle tools like Canonical MAAS or Matchbox.
Configure your Top of Rack (ToR) switches to PXE boot your servers and automate the base OS installation. If an engineer has to plug a physical crash cart into a server to install Linux, your Enterprise Cloud Repatriation Strategy has already failed.
The Reality of Building Kubernetes On-Premise
Managed cloud services like EKS hide the operational nightmare of distributed systems control planes.
When you build Kubernetes on-premise, you are entirely responsible for etcd quorum health, API server certificate rotation, and the Container Network Interface (CNI).
If your etcd cluster loses quorum due to high disk latency, your entire orchestration layer goes read-only. We lost a cluster in staging because we put etcd on standard SSDs instead of dedicated NVMe drives.
Replacing Managed CNI with Cilium and BGP
Do not attempt to use Flannel or standard Layer 2 networking for a production Kubernetes on-premise cluster. Layer 2 broadcast domains are strictly for lab environments and create massive spanning tree protocol (STP) issues at scale.
Deploy Cilium as your CDeploy Cilium as your CNI to leverage eBPF for high-performance packet routing directly in the Linux kernel.NI to leverage eBPF for high-performance packet routing directly in the Linux kernel.
Configure Border Gateway Protocol (BGP) within Cilium to peer directly with your Arista or Mellanox ToR switches. This guarantees the fault tolerance and Equal-Cost Multi-Path (ECMP) load balancing required to match public cloud reliability.
Replacing S3: The NVMe Object Storage Dilemma
You cannot run a modern stateless application without highly available object storage. Moving away from S3 means you must build your own resilient storage cluster.
This was the most painful phase of our migration. We initially attempted to deploy a dedicated Ceph cluster using BlueStore backed entirely by NVMe drives to replicate S3 speeds.
Ceph is notoriously hostile to configure. Calculating Placement Groups (PGs) incorrectly during cluster initialization permanently bottlenecked our storage throughput, forcing us to rebuild the cluster from scratch.
The MinIO Alternative for Object Storage
If you only need S3 API compatibility without the massive operational overhead of Ceph block storage, deploy MinIO instead.
MinIO runs perfectly in a clustered deployment but strictly requires fast NVMe storage and 100GbE networking to prevent CPU starvation on erasure coding.
It provides strict S3 API compatibility, meaning you retain critical enterprise features, such as immutable storage for ransomware protection, without rewriting your application code. You simply change the endpoint URL and the access keys. This approach bypasses massive AWS egress fees during the final data sync because you can pull directly into the MinIO buckets over a direct connect.
Database Evacuation: Leaving Amazon RDS
Leaving Amazon RDS means you no longer have automated point-in-time recovery and push-button multi-AZ failovers.
For our database bare-metal migration, we abandoned managed Postgres and built our own highly available clusters using Patroni and HAProxy. However, HAProxy TCP routing is only half the battle; if your backend application relies on fragile connection scripts instead of a rock-solid PHP database class with proper retry and exception logic, your frontend will still crash during a Patroni leader election.
Preventing Postgres Split-Brain
Do not run the Patroni consensus store on the same physical disks as your database engine.
We deployed a dedicated, three-node etcd cluster solely for Patroni leader elections and state management.
We automated our backups using pgBackRest, streaming continuous WAL archives directly to our internal MinIO cluster. You must test your restore procedures manually; a backup script is worthless if you have never proven you can restore the WAL files under pressure.
Escaping AWS IAM: The Identity Problem
AWS IAM is deeply integrated into every cloud-native application. The irony of repatriation is that safely decoupling these proprietary hooks requires the exact same architectural knowledge your engineers use to prepare for an AWS certification exam—if they don’t understand the cloud, they will break the application trying to leave it.
Your services likely assume IAM roles via the AWS metadata service to access buckets, queues, and databases.
We stripped out all AWS IAM SDKs from our application code and moved our authentication layer entirely to standard OpenID Connect (OIDC).
Deploying Keycloak for Centralised Auth
We deployed Keycloak on-premises, backed by an internal LDAP directory, to securely handle service-to-service authentication.
We issued short-lived JSON Web Tokens (JWTs) for service accounts and forced all internal microservices to validate tokens against the Keycloak JWKS endpoint.
This completely removed vendor lock-in. Our applications now verify identity cryptographically rather than relying on an Amazon-controlled API endpoint that we no longer have access to.
Adapting the CI/CD Pipeline for Physical Hardware
Your deployment pipelines will fail the moment you cut the cord to AWS CodePipeline or ECR. To survive the transition, you must completely rebuild how your engineering team integrates DevOps services into software development, and stand up your own GitLab runners directly on the new hardware.
We had to set up our own GitLab runners directly on the new hardware to ensure build jobs had fast access to the internal network.
Pushing container images across the public internet to Docker Hub during a deployment will quickly throttle your network and trigger Docker’s strict rate limits.
The Local Container Registry Bottleneck
To solve the rate-limit issue, we deployed Harbor as our internal container registry.
Harbor acts as a pull-through cache for external Docker images and stores our proprietary build artifacts locally.
When a Kubernetes node pulls a new image during a deployment, it pulls it over the internal 100GbE network rather than traversing the internet, reducing pod startup times from 45 seconds to under 3 seconds.
Essential Tooling Stack for Your Migration
A successful Enterprise Cloud Repatriation Strategy requires entirely replacing the managed services you relied on. We finalised our architecture using these specific open-source tools:
- Bare-Metal Provisioning: Canonical MAAS and Matchbox for PXE booting and automated OS installation.
- Orchestration: Kubernetes on-premise, installed via Kubespray or Kubeadm.
- Networking: Cilium for eBPF routing and MetalLB for BGP IP broadcasting.
- Storage: MinIO for S3-compatible object storage and Longhorn for block volumes.
- Database HA: Patroni paired with HAProxy and etcd for Postgres leader election.
- Identity: Keycloak for OIDC and JWT-based service account authentication.
Our Biggest Bottleneck: The Ingress Routing Nightmare
The hardest part of building Kubernetes on-premises is losing the AWS Application Load Balancer.
In a public cloud, deploying a LoadBalancer service automatically provisions external routing and attaches a public IP.
During our migration, that exact same YAML manifest sat in a “Pending” state forever because there was no cloud provider controller to fulfil the request. I wasted four days trying to hack together NodePort services and NGINX host ports before accepting that we needed a dedicated bare-metal load balancer.
I had to deploy MetalLB and configure it to broadcast our public IP space via BGP. If you do not configure your ToR switches to accept BGP advertisements from your worker nodes, your cluster will remain isolated from the internet.
Below is the exact MetalLB configuration we used to fix the routing black hole and force the switches to route inbound traffic to our ingress controllers:
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: tor-switch-peer
namespace: metallb-system
spec:
myASN: 64512
peerASN: 64512
peerAddress: 10.0.0.1
---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: public-ip-pool
namespace: metallb-system
spec:
addresses:
- 198.51.100.10-198.51.100.50
---
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
name: public-bgp-adv
namespace: metallb-system
spec:
ipAddressPools:
- public-ip-pool
Once applied, the ToR switches registered the BGP routes, ECMP load balancing engaged across the physical ports, and our ingress controllers successfully picked up the traffic. This was the final step that made our Enterprise Cloud Repatriation Strategy a reality in production, ending our reliance on AWS egress fees once and for all.
Before You Pull the Plug on AWS
A successful bare-metal migration requires accepting that you no longer have a cloud provider to fix your infrastructure. If you cannot check these five boxes, do not terminate your AWS workloads:
- Audit Your Bandwidth: Export VPC Flow Logs and confirm that data transfer (egress and cross-AZ) accounts for at least 30% of your bill.
- Standardise Hardware: Buy identical pizza-box servers and NVMe drives. Mixed hardware generations will break your bare-metal provisioning automation.
- Solve the Storage Dilemma: Deploy MinIO for S3-compatible object storage if your team lacks the deep operational expertise required to maintain Ceph.
- Automate Database Failovers: Run your Postgres consensus store (etcd) on dedicated, isolated disks to prevent split-brain scenarios during network partitions.
- Control Your Ingress: Configure BGP peering between your CNI (Cilium) and your Top-of-Rack switches to ensure ECMP load balancing before routing public traffic.