Infra Crib Sheet

This crib sheet is directed at people who manage some kind of infra footprint. The idea originated from a conversation about how someone would expect a “senior engineer” to map out some kind of forward-pointing plan for the aforementioned infra footprint.

If one of these points is below average for your purpose, you need to change it. If it’s fit for purpose, drill down into the more technical topics. These are what I consider “strategic heights” that let you map out where the action will be happening and where you stand. Changing these factors takes more and longer-term effort across the organization than just accelerating a short step. And the change close to these central issues is mostly planned and effected by senior personnel.

Your crib sheet can be on a piece of paper, a wiki, or a chalkboard. Since all members of IT infra teams perform in the capacity of an engineer, I see a strong value add in creating visibility into these topics for the entire team. And I see special value in making these factors explicit and part of the onboarding of new team members.

I spent 15 years in service companies and have seen plenty of hair-on-fire teams, lots of bad management, and subpar engineering. None of these things in themselves prevents you from trying to move a platform forward in a systematic and meaningful way.

Even if you’re in a project that devalues your skill and rejects proper planning and technical discussions (this happens constantly), having a page (or spread) in a personal notebook that lists your priorities won’t hurt.

This crib sheet has, in typical meyer style, an informal form (“haemdsaermelig”), because I trust that you’re a competent professional and a first-class engineer. This is only a collection of suggestions, less formal than a checklist. The way you move it forward, while maintaining fit-for-purpose, is the mark of a good tech/infra/platform strategy and defines the style of a good engineer. And there are always different styles and strategies possible.

Workload

Workload

A workload is something external stakeholders (“the business”, “our customers”) take an interest in. Etcd is not a workload unless you vend etcd, like in etcd-as-a-service, to an external party.

To describe the workload, use the following scheme: Name of the service (“sandra”), tech behind it (“juice vending machine”), external stakeholders (“employees who need a refreshment”).

It’s not necessary to fully note down the service level expectation for this workload, but it’s important that people understand whether they’re working on the safety mechanism of a nuclear power plant or a juice vending machine. Set the stage or give context, as they say.

Examples:

You might want to note down: “announce in ask-systems-engineering if we do maintenance” or “call the customer’s it director if we’re facing problems”. The latter one, many times, sits in a dedicated runbook in the o11y tooling.

External surface

External surface

The term “attack surface” is a staple in IT infra. This section plays on that idea, with a bit of a twist.

Note down what parts of your infrastructure are exposed to the public or other external stakeholders. Examples:

Note that I care about infrastructure issues here. I don’t usually have to account for two services sitting on the same piece of infrastructure, but with different failure modes, but this has happened. Think about the failure mode (“http server on server xy crashes”). If the failure modes for two services are identical, they should be listed only once. This follows from the idea of (common) infrastructure.

External surface defines security, team/organizational interactions, and performance/cost parameters for the platform. Think of a NAT gateway case for AWS; this is frequently a performance bottleneck and cost driver.

Security & safety-wise, this is the worst place for maverick decisions. Make sure that the external surface (e.g., open ports) is known or brought up with security and product. Some organizations maintain a separate risk management/tracking function; they need to be informed if your setup does not match good/best practice.

Platform access

This is about all access to the platform that is not customer access. It encompasses how people prove their identity (e.g., two factor, IDP), what kind of credentials they’re granted (e.g., long or short-lived, tied to an IP), and by what means they access the platform (e.g., in-band or out-of-band/control plane).

If you get this wrong, everybody in the team and beyond will hate you. This requires a measured approach, but I have locked out people from doing weird shit before. And I don’t regret it.

If your platform access encompasses pushing large blobs around, you’re doing it wrong. This is by definition a low traffic, high reliability/security control path, and a lot of good things have come from segregating this control path into a control plane. Nothing stops you from moving more in this direction, not even SSH.

Business Decisions

This is a specific point in business cadence. Every business has its own cadence, much like nature has seasons. If you want to see spring flowers, you need to pack a jacket to get out in the spring.

Examples:

Frequently, these things are not shared with engineering by management. This inadvertently leads to management frequently discovering that engineering is not using the resources they’ve ordered.

That said, nothing prevents you from underhandedly noting down that you use, e.g., honeycomb.io and later starting a dialogue with management if the three teams aren’t using it anymore. Of course, honeycomb.io is great, and they should be using it.

There’s one important thing about business decisions: If you ignore these, you might get stuck at the junior engineer role for a decade or more (I’ve seen this happen). F.L. Bauer, a German computer pioneer, suggested the following as the definition of software engineering:

“Establishment and use of sound engineering principles to economically obtain software that is reliable and works on real machines efficiently.”

If you don’t have a relationship to the “economic” and “efficiency” part of your work, you’re either not an engineer or a very junior engineer.

Operations (Scaling, Failure, Running Cost)

This is a wide and deep topic in a very short heading. For the matter at hand, just make sure you (and everybody else) know that there are requirements that deviate from the run-of-the-mill setup. These special cases fall broadly into two categories: a) ugly hacks for no good reason, and b) ugly hacks driven by the workload requirements.

“Ugly hacks driven by workload requirements” is a bit of hyperbole. Let me give you a longer, fictional example. We’ve a very important data set on our blob store. In order to make sure we can access this data in the foreseeable future, we use technical means (replicate it to a different region) and organizational means (segregation of duties; we’re only allowed to append, the security team is able to modify it). Of course, this requires special attention and a certain amount of overhead if you set up or change this mechanism. Therefore, you must be careful and measured when you establish these kinds of mechanisms. But no level of care on your side will suffice if nobody knows that this kind of mechanism exists. Take special care to note things that are non-obvious.

Examples:

The three keywords each warrant having designated specialists on staff, and you can delve into ridiculously deep depths for them. Say, having a couple of mathematicians in a dedicated research arm that deals with advancing queuing theory, to make your systems more efficient. In other cases, scaling will be handled by the function “f(x) = 1”, which still affords you a good dose of resiliency. This is equivalent to having a system service autostart. From experience, I have built services that scale to zero, based on workload. I’d always tell the customer/stakeholder that this is in place and they/we might want to remove it if it’s too annoying for them. A lot of solutions are possible, and it all depends on the wider context.

Final Note

Note down everything in the infra you don’t understand or that doesn’t make sense, and continue working on these issues.

This sounds a bit like a general janitorial task list, and in a way it is. This is why I say being a “senior engineer” is more like being Mary Poppins than Tony Stark.

Usually, I make sure that I do a “walkthrough inspection” of the piece of infrastructure I work with. And I usually do this within the first two weeks of a posting. This has led to turning down a posting multiple times, because the mandate didn’t cover fixing egregious problems the platform had. The only thing I regret is that I haven’t turned down more postings and not having been more explicit and straightforward about my quality standards.

In any case, it’s time to roll up your sleeves.

Bio Facts
generalist/architectsystems designIaCcloud & dcproduct first18 years experienceconversational archedu: informal/self-taught
Check in below for an informal but personal conversation. I've special rates for purpose-driven and charity work. The informal session is free of charge, of course.

more meyer, less fire (claim)

mail: m@meyer.engineering

Mark Meyer
IT Beratung und Umsetzung

Heitmannstr. 73
22083 Hamburg
Germany

VAT ID: DE 45 603 8776

Image