GDS Podcast #35: How our Site Reliability Engineers migrated GOV.UK Pay
Description
Wondered how to migrate a 24/7 product to a serverless platform? We chat about initial user research, developing DevOps skills and the benefits of GDS's approach to this type of tech project.
---------
The transcript of the episode follows:
Vanessa Schneider:
Hello and welcome to the Government Digital Service podcast. My name is Vanessa Schneider and I am Senior Channels and Community Manager at GDS. Today, I am joined by Jonathan Harden, Senior Site Reliability Engineer, and Kat Stevens, Senior Developer and co-Tech Lead on GOV.UK Pay.
GDS has many products that rely on our expert site reliability engineers and their colleagues to maintain and improve their functionality. Such as GOV.UK Pay - one of GDS’s common platforms that is used by more than 200 organisations across the UK public sector to take and process online payments from service users. Jonathan and Kat recently completed a crucial reliability engineering project to ensure that GOV.UK Pay continues to operate at the highest standards and provide a reliable service for public sector users and their service users.
We'll hear more about that in a moment, but to start off, can you please introduce yourself to our listeners? Kat, would you mind starting?
Kat Stevens:
Hi I'm Kat Stevens, I’m a Senior Developer on GOV.UK Pay. I've been working at GDS since 2017. And before that, I was a developer at start-ups and small companies.
As a co-Tech Lead on the migration team then, I'm kind of jointly responsible for making sure that our platform is running as it should be. That our team is working well together, that we're working on the right things and that we're, what we're working on is of a high quality, and is delivering value for our users. So it's like balancing that up with software engineering, making sure that you know, that we're being compliant. It's very important for Pay. Software [laughs] engineering is so broad: there's like security, reliability, performance, all of those things. So yeah, it's kind of thinking about everything and---at a high level.
Vanessa Schneider:
I'm glad somebody's got a high level overview. Thanks, Kat. Jonathan, would you mind introducing yourself too?
Jonathan Harden:
Hi, I'm Jonathan Harden, and I am Senior Site Reliability Engineer on GOV.UK Pay. I've previously worked for a major UK mobile network operator, in the movie industry and for one of the UK's highest rated ISPs.
So all of GOV.UK Pay's services run, have to run somewhere. Being a Site Reliability Engineer means that I'm helping to build the infrastructure on which it runs, ensure that it is operating correctly and that we keep users’ cardholder data safe and help the developers ease their development lifecycle into getting updates and changes out into the world.
Vanessa Schneider:
Hmm..exciting work. So you both worked on a site reliability project for GOV.UK Pay. Can you please, for the uninitiated, introduce our listeners to the project that you carried out?
Kat Stevens:
Yeah so recently, we finished migrating GOV.UK Pay to run on AWS Fargate. So previously Pay was running its applications on ECS EC2 instances on AWS. That's a lot of acronyms. But it basically means we were maintaining long-lived EC2 instances that were running our applications. And that incurred quite a high maintenance burden for the developers on our team. And we decided that we wanted to move to a serverless platform to kind of reduce that maintenance burden. And after researching a few options, we decided that Fargate was a good fit for Pay, and we spent a few months carefully moving our apps across to the Fargate platform whilst not having any downtime for our users, which is obviously quite important. Like Pay is a 24/7 service, so we wanted to make sure that our end users had no idea that this was happening.
Vanessa Schneider:
Jonathan, how did you contribute to this migration?
Jonathan Harden:
So obviously, I've only been here for three months, so and the project has been going on quite a lot longer than that. But this is the kind of task I've been involved with, uh, several times now in the last few years at different companies. And so when I joined GDS, it was suggested that I join this project on Pay because I'd be able to contribute really quickly and, and help with the kind of the, the long tail of this migration.
So a-anybody else that's been in an SR- that works in SRE capacity will know that when you do these kind of projects, you have like the bulk of the migration where you move your applications, like your frontend services that users actually see when they go to the website and the backend services that processes transactions. But then you also have a lot of supporting services around that. So you have services like: things that provide monitoring and alerting, infrastructure that provides where, where do these applications get stored when they're not in use and like where do you launch them from. And there was, there was still quite a bit of that to tie up at the end. And the team, it's quite a small team. As a lot of SRE and infrastructure teams do tend to be. And so when I started, I joined that team and I've been helping with the, the, these long tail parts of the migration. Like in a lot of software engineering, the bulk of the work is done very quickly and the long tail takes quite a bit of time. So, so that's the kind of work that I've been helping with in the last few months.
Vanessa Schneider:
Great. Kat, as co-Tech Lead, what was your involvement in the migration?
Kat Stevens:
Let’s see where to start. So when I joined the Pay Team, which was around October
2020, we were in the early stages of the, of the project, so we'd made the decision that we needed to migrate and that involved things like analysing, like co-cost benefit things. I-It doesn't sound that exciting, but it was actually quite cool looking at all the different options. So, for example, it meant that we could keep some of our existing infrastructure. We wouldn't have to move our RDS instances for, for example. We could keep our existing security group, subnets - all that kind of glue that holds all the application, like infrastructure together.
Then there was quite a lot of planning of how we would actually do this, how we would roll out the migration application by application. We've got around a dozen microservices that we were going to move one by one. And figuring out what good looked like. How would we know that the migration is successful. How do we know whether to roll back a particular app.
So for the actual rollout of migrating sort of one application from EC2 to Fargate: we basically did DNS weighting. So we could have both run--versions of the app running alongside each other, and then you can have 5% of the traffic going to new apps, 95% to the old app. And you can gradually switch over that weighting and monitor whether there are any errors, whether like the traffic suddenly dips and things aren't getting through. So that was all part of the plannings. Like what, what stages would we reach to say like, that yes, we're confident that this change has been positive. And like having a whole, like overview view of what's happening when. Estimating things as well - that's alway, always pretty, [laughs] pretty difficult. But we, as the more apps we did, the quicker we went and we sped up on that. So that was good.
And yeah, there's a whole bunch of other things we, we had to get involved with over the last few months as well. So that's things like performance-testing the whole environment to, you know, we wanted to have confidence that the new platform would be able to handle like the high levels of traffic that we see on GOV.UK Pay. Also we wanted to look at how we would actually deploy these apps. Having more confidence in our deployments, moving to continuous deployment where possible. So while those things weren't like directly impacted by Fargate, doing this migration like gave us the opportunity to explore some of those other improvements that we could make. And yeah, I think we've really benefited.
Vanessa Schneider:
That makes sense, it's always nice to not just keep things ticking over, but making big improvements, that feels really rewarding, I think. Can you give us an impression of what the situation was before the migration maybe?
Kat Stevens:
On our previous infrastructure, we were running ECS tasks on EC2 launch types - so those are sort of, relatively long-lived instances that we had to provision, patch, maintain. And the developers on the, on the rest of the team, and I--we're not necessarily infrastructure specialists, but when developers on our support rota would end up spending sort of like maybe 5, 6, 7 hours a week just maintaining our EC2 instances, we kind of realised that something had to change [laughs]. And use it, moving to a serverless infrastructure, it's just completely removes that burden of having to provision and make, roll our AMIs, our machine images. We, that just doesn't happen anymore. And we've freed up our developers to work on features. And yeah, the, the infrastructure burden on Pay is just so much less.
Vanessa Schneider:
Oh, that sounds really helpful. I’m not sure if migrations are an every-day kind of job for site reliability engineers or software developers, so I was wondering if there’s anything that stood out about this process, like an opportunity to use new tools, or a different way of working?
Jonath