Interesting how they have kept their ops team the same but now run an entire datacentre.
Overworked teams? I just can’t see how this is possible.
Not defending cloud hosting/costs etc. You generally pay more for cloud to then not have to deal with hardware maintenance, datacentre management. I didn’t see this directly in their post. Other than keeping the same size Ops team
I’m running both physical hardware and cloud stuff for different customers. The problem with maintaining physical hardware is getting a team of people with relevant skills together, not the actual work - the effort is small enough that you can’t justify hiring a dedicated network guy, for example, and same applies for other specialities, so you need people capable of debugging and maintaining a wide variety of things.
Getting those always was difficult - and (partially thanks to the cloud stuff) it has become even more difficult by now.
The actual overhead - even when you’re racking the stuff yourself - is minimal. “Put the server in the rack and cable it up” is not hard - my last rack was filled by a high school student in a part of an afternoon, after explaining once how to cable and label everything. I didn’t need to correct anything - which is a better result than many highly paid people I’ve worked with…
So paying for remote hands in the DC, or - if you’re big enough - just order complete racks with racked and pre-cabled servers gets rid of the “put the hardware in”.
Next step is firmware patching and bootstrapping - that happens automatically via network boot. After that it’s provisioning the containers/VMs to run on there - which at this stage isn’t different from how you’d provision it in the cloud.
You do have some minor overhead for hardware monitoring - but you hopefully have some monitoring solution anyway, so adding hardware, and maybe have the DC guys walk past and inform you of any red LEDs isn’t much of an overhead. If hardware fails you can just fail over to a different system - the cost difference to cloud is so big that just having those spare systems is worth it.
I’m not at all surprised by those numbers - about two years ago somebody was considering moving our stuff into the cloud, and asked us to do some math. We’d have ended up paying roughly our yearly hardware budget (including the hours spent on working with hardware we wouldn’t have with a cloud) to host a single of one of our largest servers in the cloud - and we’d have to pay that every year again, while with our own hardware and proper maintenance planned we can let old servers we paid for years ago slowly age out naturally.
Thank you for the very detailed response!
They’re using a third party called deft to manage the hardware. Which is a reasonable middleground between cloud and self-operated, the more I think about it.
I haven’t seen a lot of info on what the cost of that management is though but it’s likely to be leagues less than AWS/GCP
It’s not just the hardware. “The cloud is expensive” is usually touted by people not understanding why managed services (like Aurora RDS and OpenSearch as suggested in the article) ‘cost more than running it themselves’ by not accounting the management costs.
A database service needs management not only in hardware (I.e. replace dead drives) but also in software (I.e. monitor cluster performance, tweak system settings to fit usage pattern, manage cluster health, etc etc). These management requires time from the ops team, often in multiple roles like SysAdmin, DBA, and Ops engineers. Fact that they claim to have moved to their own hardware without being on new talents to their ops team makes it questionable as to whether or not they actually understand the cost and If they’re overworking their existing ops team.
Or it could be that they haven’t run into problems yet. If you overbuild your hardware or your software is efficient enough, you don’t need as much tweaking.
It’s questionable, but I don’t think implausible.
“yet” is the keyword there for sure. It’s not a matter of if, but a matter of when. Even if they’re flushed with cash and grossly over provision their systems, sooner or later, a huge vulnerability will roll around and someone will need to setup / update the OS, ensuring quorum is available for their cluster, fail over traffic during update windows, etc etc etc.
The stacks are getting so insurmountably huge, it’s not possible to just drop a new cluster at their described scale without significantly increasing the workload for an existing team.
Yup. By moving out, they already let go of a lot of security services that came with their cloud subscription like CASB, automated patching, DB maintenance, security/network monitoring, etc. You have to replace all of that with people and on-prem tools/systems.
Warning. This site claims you’ve been blocked and asks for your email to verify you. Do not provide it. Reloaded and it worked. Just be safe out there
This didn’t happen for me
Nor i
That isn’t happening for me, nor has it ever when I’ve visited DHH’s blog. It’s possible your browser is compromised.
I have strong privacy settings enabled. I believe it might be because they can’t fingerprint me or similar, so are checking for bot activity
That seems extremely unlikely, and almost unheard of. If I wget the page I’m a container, I get the same as in browser, so that would suggest this isn’t the case.
“An entire data center” is 8 rented racks in two enterprise data centers (4 racks in each). They’re paying $60K/month for racks, cooling, and location.
That’s the thing, ‘cloud’ is just another tool in your toolbox. It’s the right tool for some workloads and the wrong one for others. The fact they’ve shifted the work to their own servers and kept the ops team suggests it was the wrong sort of workload to be in the cloud in the first place.
For a while there was an obsession with moving everything to the cloud, and that was always going to be an expensive mistake in a number of different ways. Hopefully, as the hype dies down more nuanced decisions will be made. There’s a whole gamut of options between all in the cloud and all in the data centre, and when people jump straight from one end to the other I’m put in mind of Hamlet’s quote “There are more things in heaven and earth, Horatio, / Than are dreamt of in your philosophy.” Understand your workload, understand your business’ future plans and their needs, and then make a plan, considering all the tools at your disposal.
If there’s anything that 3 decades in Tech have taught me is that fad-following commonly rules it, even with the supposedly logical (but not really) techies.
Cloud storage and cloud computing became a fad about a decade ago (I still remember the hype repeated by people who had never actually designed distruted systems) so there were tons of people jumping headfirst without a plan into it for the hype and the seemingly cheaper price (if you didn’t think your needs and future evolution through) even though it wasn’t the best choice for them.
No doubt well see the same kind of fad-following over making-sense-for-us thing with the latest hype-train: AI.
I hate the obsession to move to the cloud and the obsession towards serverless or functions.
Functions are stupid and crazy for anything that is actually used often.
For small utilities, they make a ton of sense, but next time I see an app with millions of requests per day using functions, I’m going to lose my mind.
[This comment has been deleted by an automated system]
Years ago I was the senior techie in designing and implementing distributed high performance server systems and what you reminded me of just made my blood start to boil… :/
What always kept me off the “cloud” (other people’s computers) is not only giving up my data but giving up control on what I spend. Corporations lure you in with flashy promises and low prices, then usually over time the service gets worse the prices go higher and higher. I’m sure the cloud hosting corporations are good at pricing their services very high but not quite high enough to make most customers cancel.
Lock-in is quite an old strategy in Tech (back in the day Microsoft’s dominance was built on it) and apparently every new generation needs to learn their lesson…
That’s true, back in the 1970s and 1980s IBM locked companies in with mainframes and PCs were their way out.
deleted by creator
Exiting cloud being useful seems to be a very narrow use case.
For one, you have to be at a large enough scale where buying and hosting your own infra is feasible and cheaper.
Second, you have to give up the ability to almost instantly scale up or provision hardware in response to traffic or other events. (which is very common at scale)
Maybe his use case happens to be that very narrow case, but this isn’t something I would take as general advice.
[This comment has been deleted by an automated system]
Your last paragraph is why we’ve heavily used the cloud here in rural Canada for years.
Monitoring data is much easier to push into the cloud and read from there than it is hope for a reliable connection to a farm or rural plant.
Self-hosted services need to be cloud hosted for uptime and because it was getting ever harder to get a routed IPv4 address from any provider. IPv6 is nice to finally have, but Starlink is the only provider at all supporting it and it’s only been a few months at that. Their prefixes change constantly too, come on guys get your shit together.
Even basic remote access systems require a VPS or VPN cloud service as you always need both ends to punch out through layers of CGNAT. Now we can finally have one end available through IPv6 but the remote user is often trying to use a IPv4 CGNAT network to connect… So you still need something in the cloud to punch holes.
Can’t believe it’s been over 20 years for the IPv6 rollout
[This comment has been deleted by an automated system]
I’m still trying to figure out how to use Docker with an unstable prefix (hey Docker, this is as much your problem as the ISPs, honestly) as any of the v6NAT solutions I’ve found that enable the same full containerization available on IPv4 all require you feed the Docker daemon a fixed prefix on startup. Frustrating.
I’m also tired of reading posts about v6NAT being irrelevant because half of the point of containers is the interchangeability, Docker containers aren’t supposed to be routable unless you intentionally put them on the host network! Docker just needs to work the same on v4 and v6!
Tor as a hole puncher is an intriguing idea but I don’t think I would use it for something customer facing… Too many moving parts. We like to use Wireguard and a tiny cloud VPS instance when someone needs to punch into an unreliable network around here.
[This comment has been deleted by an automated system]
DHH is a contrarian. Any benefits of the cloud he might get are overridden by the fact that he needs to be different (and blog about it).
See his stances on Typescript, workplace inclusion, TDD, etc.
This is quite intriguing. But DHH has left so many details out (at least in that post) as pointed out by @[email protected] - it makes it difficult to relate to.
On the other hand, like DHH said, one’s mileage may vary: it’s, in many ways, a case-by-case analysis that companies should do.
I know many businesses shrink the OPs team and hire less experienced OPs people to save $$$. But just to forward those saved $$$ to cloud providers. I can only assume DDH’s team is comprised of a bunch of experienced well-payed OPs people who can pull such feats off.
Nonetheless, looking forward to, hopefully, a follow up post that lays out some more details. Pray share if you come across it 🙏
This is part of a series of posts he has done about find out his cloud bill was stupid high because they do computationally heavy software and switching over to collocation. But the whole going from 100% cloud to colo and saving that much money is not to be scoffed at.
He does say this is an outlier and others won’t get as much roi as they have.
there are a number of blog posts that have different details about the how/why, etc. i just followed the links in the article to other parts of the series.
I expect that the use case is more prevalent than you think, where you are spending a decent chunk on cloud infra. I have been convinced for some time now that the costs are high compared to our on-prem. I really like the idea of a the “deft” type hardware management service, so that look after the DCs, hardware and connectivity, and we look after the software.
Hopefully, they place their servers at 2x the historical peak floodpoint. Or set up standby zones in different geographies in case there’s a power or network outage.
Came upon several projects where folks hadn’t…
Having your compute in “the cloud” doesn’t remove the need for a good backup strategy, it just changes how it works. Yes, disaster recover for natural disasters should be easier (OHV’s fire showed that this may not always be true). But, that doesn’t cover cases like ransomware, insider threats, data mistakes or any other case where data is corrupted/modified by mistake. You still need a plan for these cases. And cloud based backups actually make a lot of sense.
But, just because you put your backups in the cloud, doesn’t mean that your compute should be there as well. There is an advantage that your Time to Recovery is likely lower with both backups and compute in the same cloud. But, is that worth the ongoing cost of running your compute in the cloud? That needs to be considered separately. You also need to consider the cost of running on-prem versus in the cloud. If you have fairly predictable, static loads, it may be cheaper to buy and run servers yourself. For hard to predict, elastic loads, cloud may make more financial sense.
As others have said before, there was a period where companies were just going to the cloud for the sole reason that it was the popular thing to do. For some it actually made financial sense. For some, it didn’t. The OP’s article seems to be the latter.
Exactly. Use cloud for off-site backup and things that need flexibility.
You don’t need any of that to run a basic website. You can almost use an old laptop or PC for most static applications.
The cloud isn’t just for storage or compute. There are a number of managed services that let you build a full application by snapping together lego building blocks.
For example, pop together a REST API handler, an auth service, a few functions-as-a-service, a database, and a storage service. Then add a static website server. Throw a CDN in front. You got yourself a dynamic application service that can be accessed globally for a few pennies and can scale up and down without you doing anything. Add multi-zone support and auto-DNS failover and you’ve got a production quality scalable, resilient back-end, for both web and mobile. When it’s not being used, it costs very little and when it goes big, hopefully it means you’re doing well. Wrap it all in an infrastructures-as-code script and you can bring all this up in 30m.
To host all that in-house, you would have to buy a lot of equipment, stage it, manage it, add cooling, electricity, security patches, upgrades, security, etc. Now you have part of your business just doing all this instead of focusing on what you do best. I won’t bother going into the tax implications of capex vs opex.
This, is what the cloud sales people call ‘undifferentiated heavy lifting.’ There are reasons to have on-prem hardware. For a lot of applications though, it makes more sense to let someone else take care of all that infrastructure cruft.
So how then people using this *miraculous and incredibly safe * (/s) cloud lost their data in OVH datacenter fire?
They used the cheap option without geographic mirrors.
So you say that if you don’t make an additional investment in backup infrastructure your data is at risk… Sounds pretty similar to self-hosting, doesn’t it?
More like the “cloud” provider should have multiple locations and redundancy in place.
That will also depend on if you include that in your subscription and pay for it. Some plans exclude that in the cheaper tiers if I remember correctly
Shocking, right? (/s) You don’t get what you do not pay for. OVH also offers private cloud hosting, basically managed servers in a cloud setup and normal hosting options. I have no idea, what the datacenter was primarily used for.
That was a data center, not a cloud. The sort of place they are moving to from the cloud.
With a cloud solution, you make sure to use services that are redundant. AWS and Azure build each region (geographical location) with **multiple **interconnected independent data centers (availability zones). High durability is one of the strong use cases for public clouds.
But it’s the cloud! It can’t ever go down!
oh god, its dhh
never heard of him before. what about him?
I’ve been saying this for a long time.
There are use cases for the cloud. I put e-mail in the cloud- ain’t nobody got time to deal with providing reliable SMTP or Exchange while keeping spam out. If you have a web app that needs to scale quickly, cloud’s the way. If you’re a startup with limited capital and you don’t want to blow it on a bunch of servers when you’re not sure if you’ll survive more than a year or so, cloud’s the way.
But Cloud ISN’T the end-all answer for everything.
If you have a predictable workload, especially one that relies on more expensive cloud services, de-clouding can save you a bundle. Buying hardware can be cheaper than renting it, if only because (think about it) the cloud provider has to buy the same hardware and rent it to you AND make a profit. If you’re going to be around a while, and you expect to use a piece of hardware for its full service life, that makes a lot of sense.
Yeah ok well when you get ransomware’d you’re going to wish you had Cloud backups.
Ask me how I know
There are also many organizations that wish they has some local backups after their cloud service providers lost all their data. Lesson to learn: Backup properly with offline storage. Tape in a safe, maybe even off-site, etc.
So, what you’re saying is that, regardless of where you run your workloads, you should still follow the 3-2-1 rule?
3 - copies of the data. 2 - different media. 1 - offsite.
It’s funny how cloud doesn’t really change the basics of good systems administration.
Almost like a responsible modern day approach is multifaceted
How do you know?
An earwig told me
As long as you realize that the “cloud” is someone else’s computer, it is a very viable way of hosting your service. However as your service grows all those micro services that your cloud provider charges you for will grow as well. Eventually you’ll get to the point where “data transfer” costs begins to make up >50% of your total cloud spend. At that point (or ideally before) you should have a plan to stop expanding your cloud footprint, because that cost grows geometrically with the size of your cloud data and the number of cloud functions you are using on your data.
Remember Data has Weight. If you don’t understand what that means, you aren’t ready to make a cost comparison between cloud-hosting and data center hosting.
That’s the thing, ‘cloud’ is just another tool in your toolbox. It’s the right tool for some workloads and the wrong one for others. The fact they’ve shifted the work to their own servers and kept the ops team suggests it was the wrong sort of workload to be in the cloud in the first place.
For a while there was an obsession with moving everything to the cloud, and that was always going to be an expensive mistake in a number of different ways. Hopefully, as the hype dies down more nuanced decisions will be made. There’s a whole gamut of options between all in the cloud and all in the data centre, and when people jump straight from one end to the other I’m put in mind of Hamlet’s quote “There are more things in heaven and earth, Horatio, / Than are dreamt of in your philosophy.” Understand your workload, understand your business’ future plans and their needs, and then make a plan, considering all the tools at your disposal.
Lmaoooooo anyone doing price comparison without labor should be immediately ignored as ignorant and with an agenda.
Read the article.
Relevant passage: “While there are some additional other costs associated with the extra servers, it’s relative peanuts in the grand scheme (our ops team stayed the same, for example)” (emphasys mine)
Hand waving. The work is somewhere.
Either the team had free time and weren’t being used to capacity, employees aren’t doing continuing education during work anymore, or they are being overworked. You don’t just magically take on scope and not have labor hours shift at a bare minimum.
I refuse to take anyone at face value when they say they are using physical hardware, setting up automation, and running support on all of that and the labor hours are described as peanuts.
I also wonder what they are doing for security and data privacy.