Cutting costs. These two words often mean reducing performance or sacrificing features – but it doesn’t have to be that way! If managed correctly, a crisis can be an opportunity to drive change , and the results can be startling – in a good way!
This is the story of how VanHack’s tech team managed to reduce our cloud computing costs by close to 50% – while increasing performance by 25%!
As you read this, you’re probably wondering what we aren’t telling you – what’s the downside? Yes, of course there’s something we had to give up to get these numbers.
Our Recovery Time Objective has more than doubled. That sounds bad, but in absolute terms, RTO has gone up from ~8 min to ~20 min.
Given the nature of our business (we don’t do thousands of transactions a second), and the remote possibility of true infra downtime (in 2+ years, we’ve had one infrastructure outage – when Azure’s South Central Region was struck by lightning, and we’ve implemented multi-region failover after that), we considered this an acceptable trade-off.
Our Recovery Point Objective remains exactly the same. We have not compromised on data recovery in any way since our data is business-critical.
Read on to find out how we did it, and what we learned along the way.
Turning on a Dime
The month of March started out like any other for VanHack’s tech team.
Deploying code, working through the backlog of feature requests and bugfixes, hunting down edge cases and odd bugs, optimizing and refactoring code… the day-to-day tasks of running a SaaS/cloud app based business.
You know what happened next.
On Monday, March 14th, our CTO assigned the job of optimizing the infrastructure to two members of our tech team – Robson Paproski and Anybal da Silva. Robson and Anybal are both Senior Full-Stack Developers with overlapping and complementary skill-sets – Robson is our cloud infrastructure guru, and Anybal is our performance testing maven.
TIP: Decisions of this nature need to be made quickly, and not in committee. Assign the task to trusted team members, make sure the correct people are on the team, and let them get on with it. Only the most essential people should be part of the team. We didn’t assign a front-end dev, for example.
VanHack’s Infrastructure – Before
VanHack runs on Microsoft Azure.
We use a React frontend that talks to a dotNet Core API backend. Application Data is stored on Azure SQL, and we use Redis to cache various things. Azure Blob Storage stores objects.
Our front-end infrastructure is based around Azure Kubernetes Services, with Rancher managing the orchestration. Rancher runs on a 3-node cluster, and deploys to a 3-node AKS cluster. We also run a single-node AKS cluster for development/staging.
The dotNet Core application runs on a Azure managed Windows App Service. We also have another App Service running some legacy code written in .NET 4.5.
Besides this, we have a couple of VMs running our blog, dev/staging miscellany, load balancers, and geo-replication to another Azure region.
Overall costs were in the region of $2000-2500 a month. We have 120,000+ users, with approximately 40,000 active users in any given month. These are application users, not website hits – these people log into the platform and do stuff on the application.
Document, document, document
Robson and Anybal started by examining the infrastructure maps and diagrams. VanHack is quite disciplined about documenting every change to our infrastructure, and our application process flows are all maintained in Confluence.
Since everything was up to date, they could jump right into discussing what was critical and what changes could potentially be made without impacting production services. Within two days, they put together a document that detailed three scenarios along with the projected cost savings.
The next step was to test the scenarios to determine which one was the best.
TIP: Make sure you are documenting everything about your infrastructure. This is less of an issue with larger companies, but startups often make major changes on the fly – and if you don’t know exactly what you have running and how it all comes together, you’ll be wasting time doing network and application audits before you can make any changes.
Performance is Key
Testing the three scenarios took about 10 days. This sounds like a long time, but consider that tests had to be done without impacting production in any way.
First, the team ran synthetic tests on the dev environment first to make sure the application wasn’t adversely affected. Things had to be modified based on these tests, and then the tests had to be re-run.
Once they were confident that they weren’t going to break production, they started modifying production and measuring actual user statistics. This required the ability to modify production on the fly, while making sure we could actually roll-back. As you can imagine, this was done very carefully and in different time-zones.
We set up alternative versions of the infrastructure and sent traffic to it and measured the performance impacts. And then measured it again. And once more for good measure!
TIP: Make sure you can roll-back any and all changes. And if you cannot roll back because you’re going to make a destructive change (this is very unusual and should be avoided), the old adage “measure twice and cut once” is extremely applicable – except read “twice” as “dozens of times”.
Serendipity is real
At the end of the performance testing, Robson and Anybal were confident that we could reduce costs by ~30% without impacting our customer experience. Performance would remain the same, though the RTO would rise.
At this point, they were close to calling it done – but looking at the infrastructure end-to-end with fresh eyes had thrown up one more idea that needed to be explored.
The initial versions of VanHack were built on .Net 4.5. When dotNet Core released version 2.0 in 2018, our team decided to move to Core for a variety of good reasons. This was achieved over the course of a year, and now we only have some very legacy standalone services running on .Net 4.5.
However – we were running our dotNet Core application on a Windows App Service in Azure. This was done partly out of familiarity and partly because deploying was easier on Windows initially (from our CI/CD pipeline). We’d changed that since but stayed on Windows.
And under normal circumstances, things were humming along smoothly and we didn’t really have the necessary motivation to move the service to Linux – but now we really, really did!
So, late one night, Anybal and Robson moved the application to a Linux App Service. After quashing a few bugs (don’t hard code IP addresses, people), they ran the performance tests .. and sat back amazed!
The move from Windows to Linux increased performance by nearly 30%! And given the cost differential, the RoI was even greater.
When our CTO learnt of this the next morning, his response was “Implement this NOW”!
TIP: Don’t be afraid to try something new. A crisis is the best time to try out ideas that would normally be shelved out of an abundance of caution
Expect the unexpected – and don’t forget other SPoFs
Anyway who has done a migration in production knows all about this. No matter how meticulously you plan, there’s always something that goes wrong (and it’s usually DNS). It was no different for us.
Robson was tasked with doing the final migration late one Saturday night. He’d planned for it to take no longer than an hour. Naturally, it took more than 4 hours because our AKS cluster didn’t take too kindly to being resized and fell over. It took some manual wrangling to get stuff back up, but that’s almost par for the course.
You also need to plan for unusual single points of failure. In our case, Robson’s access to the internet is considered business critical. We have other people within the company who can take over if necessary, but that will increase the RTO even more (we’re still just a small startup!).
Robson has two independent internet connections from two different service providers, as well as battery backup that will keep him online for 180 minutes in case of a power outage.
TIP: The cloud is wonderful – but don’t forget that your team needs to be able to access the cloud if things go pear-shaped. Your plans need to include the possibility of power and internet outages.
We reduced our cluster nodes, VM sizes, killed a couple of redundant VMs, and moved our apps from Windows to Linux
This crisis has led to our infrastructure getting more performant. It’s slightly less resilient, but it would take only a few clicks to get back to the same levels. We’ve also removed unnecessary services and optimized our app even more.
We hope that this deep dive helps you as you think about ways to reduce your infrastructure costs.