Mark Meyer: Cloud & Platform Architecture for Digital Products.

Infra Crib Sheet

This crib sheet is directed at people who manage some kind of infra footprint. The idea originated from a conversation about how someone would expect a “senior engineer” to map out some kind of forward-pointing plan for the aforementioned infra footprint.

If one of these points is below average for your purpose, you need to change it. If it’s fit for purpose, drill down into the more technical topics. These are what I consider “strategic heights” that let you map out where the action will be happening and where you stand. Changing these factors takes more and longer-term effort across the organization than just accelerating a short step. And the change close to these central issues is mostly planned and effected by senior personnel.

Your crib sheet can be on a piece of paper, a wiki, or a chalkboard. Since all members of IT infra teams perform in the capacity of an engineer, I see a strong value add in creating visibility into these topics for the entire team. And I see special value in making these factors explicit and part of the onboarding of new team members.

I spent 15 years in service companies and have seen plenty of hair-on-fire teams, lots of bad management, and subpar engineering. None of these things in themselves prevents you from trying to move a platform forward in a systematic and meaningful way.

Even if you’re in a project that devalues your skill and rejects proper planning and technical discussions (this happens constantly), having a page (or spread) in a personal notebook that lists your priorities won’t hurt.

This crib sheet has, in typical meyer style, an informal form (“haemdsaermelig”), because I trust that you’re a competent professional and a first-class engineer. This is only a collection of suggestions, less formal than a checklist. The way you move it forward, while maintaining fit-for-purpose, is the mark of a good tech/infra/platform strategy and defines the style of a good engineer. And there are always different styles and strategies possible.

Workload

A workload is something external stakeholders (“the business”, “our customers”) take an interest in. Etcd is not a workload unless you vend etcd, like in etcd-as-a-service, to an external party.

To describe the workload, use the following scheme: Name of the service (“sandra”), tech behind it (“juice vending machine”), external stakeholders (“employees who need a refreshment”).

It’s not necessary to fully note down the service level expectation for this workload, but it’s important that people understand whether they’re working on the safety mechanism of a nuclear power plant or a juice vending machine. Set the stage or give context, as they say.

Examples:

We have two runner setups that process jobs from GitLab. One of these runs on the K8s cluster and builds containers for the developer, for both release and stage. The other one runs on EC2 (compute service) to build a raw Debian disk image for our integration test.
We run a Grafana service, used by our team. This service is made available to other people inside the company, but we made it clear that this is a courtesy only.
We run an IoT platform for electricity grid monitoring. Availability and correctness of the data are governed by our commitment to our regulator.

You might want to note down: “announce in ask-systems-engineering if we do maintenance” or “call the customer’s it director if we’re facing problems”. The latter one, many times, sits in a dedicated runbook in the o11y tooling.

External surface

The term “attack surface” is a staple in IT infra. This section plays on that idea, with a bit of a twist.

Note down what parts of your infrastructure are exposed to the public or other external stakeholders. Examples:

everything has a public IP, but we only allow port 22 for instance group xy and 443 for all app servers, as per the console here
We have load balancers in a public subnet, these have an open port 80 and 443 , all other traffic has to move across the AWS control plane
In addition, we have an open Kubernetes control plane , which is secured with standard SigV4 access via the AWS control plane. People have to assume the following role
We only serve and allow no egress in the prod application farm (“air-gapped”), and in stage/dev, we only allow egress via NAT gateway.

Note that I care about infrastructure issues here. I don’t usually have to account for two services sitting on the same piece of infrastructure, but with different failure modes, but this has happened. Think about the failure mode (“http server on server xy crashes”). If the failure modes for two services are identical, they should be listed only once. This follows from the idea of (common) infrastructure.

External surface defines security, team/organizational interactions, and performance/cost parameters for the platform. Think of a NAT gateway case for AWS; this is frequently a performance bottleneck and cost driver.

Security & safety-wise, this is the worst place for maverick decisions. Make sure that the external surface (e.g., open ports) is known or brought up with security and product. Some organizations maintain a separate risk management/tracking function; they need to be informed if your setup does not match good/best practice.

Platform access

This is about all access to the platform that is not customer access. It encompasses how people prove their identity (e.g., two factor, IDP), what kind of credentials they’re granted (e.g., long or short-lived, tied to an IP), and by what means they access the platform (e.g., in-band or out-of-band/control plane).

If you get this wrong, everybody in the team and beyond will hate you. This requires a measured approach, but I have locked out people from doing weird shit before. And I don’t regret it.

Everything runs across SSH. If you fuck up the firewall rules, you’ll lock yourself and everybody else out of remote access
We manage the tech for a privilege escalation process to give an external vendor additional permissions. The process can be triggered by product/support/NOC/service desk. Access is limited to account 123, and a full log trail will be generated. Access reverts after four hours.
Every developer has to go through the AWS API/control plane to access the cluster. In order for that to happen, they need to authenticate with the company IDP. The service desk can assign groups from this set , and this will be mapped with prefix PLAT_ into an AWS role in the management account 456, having an almost zero (see ). From here, they assume elevated production roles. SSH access is blocked with a VPC-wide ACL ().

If your platform access encompasses pushing large blobs around, you’re doing it wrong. This is by definition a low traffic, high reliability/security control path, and a lot of good things have come from segregating this control path into a control plane. Nothing stops you from moving more in this direction, not even SSH.

Business Decisions

This is a specific point in business cadence. Every business has its own cadence, much like nature has seasons. If you want to see spring flowers, you need to pack a jacket to get out in the spring.

Examples:

We have a honeycomb.io subscription for app farm X. It’s used by three feature teams, working on apps a, b, and c.
Project timeline and result expectations: we’re operating on a “it’s done, when it’s done” perspective, but want to start onboarding friendly testers in three months
Yearly company returns will be mostly fixed in February, and we need to spec three new Dell servers by then, to make a budget request.
We’re about to hire three new dev teams, and we need to make sure that our infra onboarding process is in top shape when that happens

Frequently, these things are not shared with engineering by management. This inadvertently leads to management frequently discovering that engineering is not using the resources they’ve ordered.

That said, nothing prevents you from underhandedly noting down that you use, e.g., honeycomb.io and later starting a dialogue with management if the three teams aren’t using it anymore. Of course, honeycomb.io is great, and they should be using it.

There’s one important thing about business decisions: If you ignore these, you might get stuck at the junior engineer role for a decade or more (I’ve seen this happen). F.L. Bauer, a German computer pioneer, suggested the following as the definition of software engineering:

“Establishment and use of sound engineering principles to economically obtain software that is reliable and works on real machines efficiently.”

If you don’t have a relationship to the “economic” and “efficiency” part of your work, you’re either not an engineer or a very junior engineer.

Operations (Scaling, Failure, Running Cost)

This is a wide and deep topic in a very short heading. For the matter at hand, just make sure you (and everybody else) know that there are requirements that deviate from the run-of-the-mill setup. These special cases fall broadly into two categories: a) ugly hacks for no good reason, and b) ugly hacks driven by the workload requirements.

“Ugly hacks driven by workload requirements” is a bit of hyperbole. Let me give you a longer, fictional example. We’ve a very important data set on our blob store. In order to make sure we can access this data in the foreseeable future, we use technical means (replicate it to a different region) and organizational means (segregation of duties; we’re only allowed to append, the security team is able to modify it). Of course, this requires special attention and a certain amount of overhead if you set up or change this mechanism. Therefore, you must be careful and measured when you establish these kinds of mechanisms. But no level of care on your side will suffice if nobody knows that this kind of mechanism exists. Take special care to note things that are non-obvious.

Examples:

Everything is set up to autoscale our three K8s clusters, but if we receive a huge traffic spike, since all load balancer backends and all K8s nodes are colocated in the same small subnet, everything will just fall over
Broken service X can lead to infra leaks (e.g., instances not being terminated), which in turn leads to very high infra bills
In our strongly coupled microservice architecture, a failure in the database can lead to a cascading failure in all applications, which in turn triggers the cluster to continuously scale until the company runs out of money
Technically we run service x on a persistent database, but it only stores 5 seconds’ worth of data, and it’s just a support system, so if a disk breaks, just re-initialize the entire thing, no recovery needed

The three keywords each warrant having designated specialists on staff, and you can delve into ridiculously deep depths for them. Say, having a couple of mathematicians in a dedicated research arm that deals with advancing queuing theory, to make your systems more efficient. In other cases, scaling will be handled by the function “f(x) = 1”, which still affords you a good dose of resiliency. This is equivalent to having a system service autostart. From experience, I have built services that scale to zero, based on workload. I’d always tell the customer/stakeholder that this is in place and they/we might want to remove it if it’s too annoying for them. A lot of solutions are possible, and it all depends on the wider context.

Final Note

Note down everything in the infra you don’t understand or that doesn’t make sense, and continue working on these issues.

This sounds a bit like a general janitorial task list, and in a way it is. This is why I say being a “senior engineer” is more like being Mary Poppins than Tony Stark.

Usually, I make sure that I do a “walkthrough inspection” of the piece of infrastructure I work with. And I usually do this within the first two weeks of a posting. This has led to turning down a posting multiple times, because the mandate didn’t cover fixing egregious problems the platform had. The only thing I regret is that I haven’t turned down more postings and not having been more explicit and straightforward about my quality standards.

In any case, it’s time to roll up your sleeves.