GCP - Planning for the Worst

Last month, Google Cloud published Planning for the Worst: Reliability, Resilience, Exit and Stressed Exit in Financial Services. This happens to be a topic I have previously worked on, so I was very interested to hear the perspective that GCP would bring.

The wider industry context here is that regulators are very interested in potential risks to the financial system arising from the wholesale migration to cloud computing; in March 2021 the Prudential Regulation Authority in the UK published two supervisory statements closely related to the topic, including Outsourcing and third party risk management, which introduces the concept of a “stressed exit”. That is, if a Cloud Service Provider were to become insolvent, suffer a catastrophic technical failure, or (perhaps more likely) get banned from doing business in a particular geographical region… as a bank, what would you do if you have outsourced all your computing services to that provider?

I would rate the Google Cloud paper as “average” - it treads familiar ground and provides very little insight that could not be gathered from an afternoon of reading the regulations (which, however, are helpfully listed in the introduction). The equivalent Azure paper contains much more detailed advice on how to develop an exit plan and was published more than a year earlier. Still, there are a few interesting nuggets here:

Firstly, Google lists its commitment to Open Source as an advantage for exit planning; many of its products and services are available in open source versions. Two examples which come to mind are Kubernetes and Tensorflow; here GCP has adopted a strategy of creating new software categories that the other CSPs have embraced, which does make it easier to avoid vendor lock-in. Dataflow is available as Apache Beam. DataProc is really a managed Hadoop.

However, an obvious counterexample is listed slightly further down: BigQuery’s capabilities are substantially unique (I predict migrating large workloads to AWS Athena may be challenging, for example). Is there a globally-consistent competitor to Cloud Spanner yet?

The exit planning benefit of GCP’s open source commitments therefore largely depends on whether the workloads have been designed with exit requirements in mind.

Secondly, common standards for hosting applications in virtual machines or containers are also listed as an advantage; the benefit of these similarly depends on whether the workload is built against these interfaces.

Finally, Anthos (GCP’s multi-cloud management tool) is mentioned under both exit planning and stressed exit planning. It is true that this facilitates management of Kubernetes clusters in other clouds or on-premise; but if you are unexpectedly ending your commercial relationship with Google, how easy is it going to be to migrate those clusters to an alternative “single pane of glass”? If you have configured workload identity for your GKE-on-AWS clusters, and GCP disappears, how much trouble are you in? I have had a brief look, but have not yet found a white paper on this!

When discussing critically-important financial workloads which may well be essential to the stability of the entire financial system, I think the question asked by the regulators is legitimate: what new risks are you introducing when you add a technology outsourcing relationship to the mix? If this relationship suddenly breaks down, what happens?

But this train of thought needs to be taken to the logical conclusion: if one of the hyperscale cloud providers genuinely did collapse, or even if a bank just decided to end a commercial relationship, most of the exit plans in existence today would fail. When developing the plans, each workload is considered individually, and assumes the luxury of several months to execute rapid custom development of infrastructure on a new cloud. In reality, all workloads in the bank would be affected at once - there would be a massive shortage of cloud engineers within the organisation (and possibly across the industry depending on the scenario), and these would be some of the most rushed and risky projects of all time. This is concentration risk that is touched on only briefly at the end of the paper.

And in this regard the paper does hint at the right answers - the most critical types of workload need to be built against industry standard interfaces (e.g. containers are essentially an extension of the stable Linux kernel-userspace API boundary); avoid over-reliance on CSP-specific services such as BigQuery; use open source that you can run elsewhere (e.g. Kubernetes, complex as it is, is at least provided by multiple vendors) and replicate data across multiple suppliers to enable sufficiently swift recovery from disasters.

These engineering constraints are necessary because society cannot afford for these services to be locked-in to a single vendor; not for mere anti-competition reasons, but to ensure continuity of service.