<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Mark Meyer: Cloud &amp;amp; Platform Architecture for Digital Products</title><link href="https://meyer.engineering/" rel="alternate"/><link href="https://meyer.engineering/atom.xml" rel="self"/><id>https://meyer.engineering/</id><updated>2025-12-31T00:00:00+01:00</updated><subtitle>Mark Meyer: Cloud &amp;amp; Platform Architecture for Digital Products</subtitle><entry><title>Engineers, Know Your Customer!</title><link href="https://meyer.engineering/posts/2025-12-31-know-your-customer.html" rel="alternate"/><published>2025-12-31T00:00:00+01:00</published><updated>2025-12-31T00:00:00+01:00</updated><author><name>Mark Meyer</name></author><id>tag:meyer.engineering,2025-12-31:/posts/2025-12-31-know-your-customer.html</id><summary type="html">There's quite a bit of value to knowing who your customer is. For one, you'll better understand what's important for the business, and you'll be able to avoid deep pits that could get you (or your customer) into trouble. In my opinion, this kind of research makes sense, even if you're a software engineer. We're not trying to match some regulatory requirement, which means we can keep this light.</summary><content type="html">&lt;p&gt;There&amp;rsquo;s quite a bit of value to knowing who your customer is. For one, you&amp;rsquo;ll better
understand what&amp;rsquo;s important for the business, and you&amp;rsquo;ll be able to avoid deep
pits that could get you (or your customer) into trouble. In my opinion, this kind
of research makes sense, even if you&amp;rsquo;re a software engineer. We&amp;rsquo;re not trying to
match some regulatory requirement, which means we can keep this light.&lt;/p&gt;
&lt;p&gt;For many large-scale
enterprises, the following approach won&amp;rsquo;t work, but for many small and medium-scale
For companies with limited business dealings, it works quite well. In particular, the same
methodology also applies to suppliers.&lt;/p&gt;
&lt;p&gt;Some of these things are centered around Germany, some cover international postings.
You might need to adapt these to your circumstances.&lt;/p&gt;
&lt;h2&gt;Research&lt;/h2&gt;
&lt;p&gt;For starters, we&amp;rsquo;ll try to map out what kind of business we&amp;rsquo;re actually dealing with.
Somebody asks you: &amp;ldquo;Would you mind doing some Kubernetes revamp at ACME Inc. in
six weeks?&amp;rdquo; That means, I&amp;rsquo;ll look into what ACME is actually trying to achieve,
product-wise.&lt;/p&gt;
&lt;p&gt;If you feel you&amp;rsquo;re getting too much information into your system, I can assure you
that many people have this feeling. My solution is to quit/block janky social media
shit (like LinkedIn) and stick to the evening news. It&amp;rsquo;s worked for a couple
of generations, and it might work for you, too.&lt;/p&gt;
&lt;h3&gt;Wikipedia&lt;/h3&gt;
&lt;p&gt;Check the Wikipedia page &lt;em&gt;and&lt;/em&gt; the discussion page. The Wikipedia page should list
a couple of things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;legal entity type (e.g., GmbH, or such)&lt;/li&gt;
&lt;li&gt;Where are the headquarters, from where is the company controlled&lt;/li&gt;
&lt;li&gt;company history: renaming and past scandals (predecessor companies)&lt;/li&gt;
&lt;li&gt;What do these people actually sell (e.g., paint and solvents) and roughly
how their position in the market is&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You would be surprised how many people never look up the Wikipedia page of their
customer/posting. I really enjoy this because it gives me some context about
the companies I work with.&lt;/p&gt;
&lt;h3&gt;Public Register&lt;/h3&gt;
&lt;p&gt;The public register is important to assert company ownership (see above) and
the current state. If a court has opened insolvency proceedings, the company should
be stricken from the register, i.e., the entry will read that the company ceases
operations.&lt;/p&gt;
&lt;p&gt;You can find the register for Germany here:
https://www.handelsregister.de&lt;/p&gt;
&lt;p&gt;The register is fairly reliable, but there are quite a few technical hiccups.
It&amp;rsquo;s possible to lock entries, but only temporarily. The register actually has
unnamed and unexplained numeric error codes, and these can be explained by
referencing this document (obtained via IFSG, German FOIA):&lt;/p&gt;
&lt;p&gt;https://fragdenstaat.de/anfrage/errorcodes-des-gemeinsamen-registerportals-der-laender&lt;/p&gt;
&lt;p&gt;This register includes commercial and not-for-profit companies and associations.
It does not contain foundations, which should be listed in the &amp;ldquo;transparency register&amp;rdquo;.
This register is not public and last I checked it was incomplete and out of date.
It&amp;rsquo;s probably time to rename it to &amp;ldquo;instransparency register&amp;rdquo;. A foundation is the standard
way to cloak ownership, i.e., for wealthy families.&lt;/p&gt;
&lt;p&gt;The other standard way is to register a company in, e.g., Luxembourg.&lt;/p&gt;
&lt;p&gt;In general, the register tells you who is responsible, but the information
might be less useful to see the true owners, i.e., if the ownership information is
a dead end (like a foundation or a hard-to-find post box in Luxembourg).&lt;/p&gt;
&lt;h3&gt;Regulator&lt;/h3&gt;
&lt;p&gt;Most of the time, regulatory requirements are molded into specific company
regulations. These company regulations usually follow some ISO scheme, and
there are quite a few to choose from.&lt;/p&gt;
&lt;p&gt;If you work in IT, you&amp;rsquo;ll have to deal with a couple of general regulations and laws.
Here&amp;rsquo;s the most general stuff. You should know that these exist and where to find
them. Knowing that these exist gives you some context on why you need to
jump through some extra hoops.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Confidentiality of Mail and Telecommunication, which in German is the Fernmelde-
und Postgeheimnis. This privileged status of communications is internationally
accepted. There&amp;rsquo;s a page in the German Wikipedia (https://de.wikipedia.org/wiki/Fernmeldegeheimnis)
and the ITU has a page here (https://www.itu.int/en/wcit-12/pages/itrs.aspx).&lt;/li&gt;
&lt;li&gt;General Data Protection Regulation, which led to the DS-GVO, which in turn is a
more precise version of the Bundesdatenschutzgesetz (BSDG). Again, Wikipedia has
you covered: https://de.wikipedia.org/wiki/Datenschutz-Grundverordnung&lt;/li&gt;
&lt;li&gt;Critical Infrastructure Regulation, in Germany, this is covered by the Kritis-VO
(https://www.gesetze-im-internet.de/bsi-kritisv/BJNR095800016.html).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Check who the regulator is for your company. Many industries have a dedicated
regulator (as part of the public administration). Companies can and often are regulated
by multiple regulators. All of the following offices should be neutral politically and
base their recommendations on the best current practice.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;data/it regulators, in Germany this role is taken by the Bundesamt fuer Sicherheit
in der Informationstechnologie - BSI. They do limited operations (like ordering sinkholing)
and provide standards and guidance for government IT operations and the wider
public. https://www.bsi.bund.de&lt;/li&gt;
&lt;li&gt;financial services authority, like the German BaFin (https://www.bafin.de);
These people directly set IT standards/requirements if you handle financial transactions,&lt;/li&gt;
&lt;li&gt;environmental protection agency, like the Umweltbundesamt (https://www.umweltbundesamt.de/);
tracks chemical accidents, release of dangerous substances; they also engage
in refining regulations like REACH and ROHS, and tracking population levels of
lead, among many other things; their task is to provide consulting to politicians
and the wider public&lt;/li&gt;
&lt;li&gt;occupational health and safety, this is mostly tracked by the German Deutsche Gesetzliche
Unfallversicherung (https://www.dguv.de), these people set standards regarding
exposure to dangerous substances; they moderate the process
that results in a recognized occupational health risk.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You&amp;rsquo;re less likely to come into conflict with the latter two agencies, but that can
happen, e.g., if your postings focus around industrial settings.&lt;/p&gt;
&lt;p&gt;Whistleblower laws are common around Europe. In Germany, the law in question is the
&amp;ldquo;Hinweisgeberschutzgesetz&amp;rdquo;. Many companies run their own whistleblower portal that
allows you to report violations directly to the management. You can choose whether
to use the company&amp;rsquo;s portal or to send it to the corresponding regulator.
(See: https://de.wikipedia.org/wiki/Hinweisgeberschutzgesetz)&lt;/p&gt;
&lt;h3&gt;General Sources&lt;/h3&gt;
&lt;p&gt;There are a couple of organizations that provide useful general guidance around
assignments with touchpoints in foreign countries. This particular selection I checked,
because these organizations are &amp;ldquo;established&amp;rdquo; and offer a global footprint.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Human Rights Watch https://www.hrw.org&lt;/li&gt;
&lt;li&gt;Amnesty International https://amnesty.org&lt;/li&gt;
&lt;li&gt;Transparency International https://transparancy.org&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The foreign office has an official list of &amp;ldquo;travel warnings&amp;rdquo; and other information
around security in different countries, this includes security issues like
civil war, civil unrest, and armed conflict.&lt;/p&gt;
&lt;p&gt;https://www.auswaertiges-amt.de/de/reiseundsicherheit/10-2-8reisewarnungen&lt;/p&gt;
&lt;p&gt;Quite a few small and medium-scale companies have extensive overseas engagements,
especially if we go into high-value products. Companies like Cargill, Nestlé, Siemens,
or BMW basically cover the globe. Problems with shady regions or engagement
in shady activity might show up on Wikipedia; the info here can help to contextualize.&lt;/p&gt;
&lt;p&gt;Finally, you can query a newspaper archive. My recommendation is to look into
the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Financial Times, https://ft.com&lt;/li&gt;
&lt;li&gt;NY Times, https://nyt.com&lt;/li&gt;
&lt;li&gt;South China Morning Post, https://scmp.com&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All of these should be obtainable in a good library.&lt;/p&gt;
&lt;h2&gt;Backgrounder: How Kickbacks Work (PEP)&lt;/h2&gt;
&lt;p&gt;You might have had to fill out some forms about your relationship to politically exposed
persons in recent years. The regulations in question try to reduce corruption.
They specifically target some scenario, like the following fictional case of bribery.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Your wife is a high-ranking member of the government with executive powers.
Let&amp;rsquo;s say, she&amp;rsquo;s the mayor of a city with 5 million people.&lt;/p&gt;
&lt;p&gt;You&amp;rsquo;re a member of the &amp;ldquo;Club of the Nouveau Riche,&amp;rdquo; which is a registered charity.&lt;/p&gt;
&lt;p&gt;Now ACME Inc pays 5 million EUR to the Club, which gets transferred to you to pay
for your consulting services, or let&amp;rsquo;s say the Club funnels this money to you
in the form of a very nice car.&lt;/p&gt;
&lt;p&gt;In exchange for this very nice gift, your wife buys the latest ACME Inc. bubble
cannons for the city cops.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The bribery here is clear: ACME funnels the money via a registered charity, and this
way they bribe &amp;ldquo;your wife&amp;rdquo; to buy their stuff with taxpayer money.&lt;/p&gt;
&lt;p&gt;The regulation in question does not mandate that you can&amp;rsquo;t transfer money to PEPs.
It mandates that ACME Inc. and its compliance department make a strong
effort to prevent this from happening.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s the reason you had to fill out this form. These requirements flow from the FATF.&lt;/p&gt;
&lt;h2&gt;The Final Countdown&lt;/h2&gt;
&lt;p&gt;I don&amp;rsquo;t think this is super involved. You check Wikipedia and poke the register.
Then make sure that you know which regulator is applicable, and you&amp;rsquo;re mostly
good, IMO.&lt;/p&gt;
&lt;p&gt;I cover some basics in this article. There are quite a few unwritten chapters:
Proliferation risks and financial crime, among others. And a lot of ways to dig
deeper.&lt;/p&gt;
&lt;p&gt;I think this has helped me in the past, even if I worked in an FTE capacity.&lt;/p&gt;
&lt;p&gt;Happy hacking.&lt;/p&gt;</content></entry><entry><title>Infra Crib Sheet</title><link href="https://meyer.engineering/posts/2025-12-28-infra-crib.html" rel="alternate"/><published>2025-12-28T00:00:00+01:00</published><updated>2025-12-28T00:00:00+01:00</updated><author><name>Mark Meyer</name></author><id>tag:meyer.engineering,2025-12-28:/posts/2025-12-28-infra-crib.html</id><summary type="html">This crib sheet is directed at people who manage some kind of infra footprint. The idea originated from a conversation about how someone would expect a "senior engineer" to map out some kind of forward-pointing plan for the aforementioned infra footprint.</summary><content type="html">&lt;p&gt;This crib sheet is directed at people who manage some kind of infra footprint. The idea originated from a conversation about how someone would expect a &amp;ldquo;senior engineer&amp;rdquo; to map out some kind of forward-pointing plan for the aforementioned infra footprint. &lt;/p&gt;
&lt;p&gt;If one of these points is below average for your purpose, you need to change it. If it&amp;rsquo;s fit for purpose, drill down into the more technical topics. These are what I consider &amp;ldquo;strategic heights&amp;rdquo; that let you map out where the action will be happening and where you stand. Changing these factors takes more and longer-term effort across the organization than just accelerating a short step. And the change close to these central issues is mostly planned and effected by senior personnel.&lt;/p&gt;
&lt;p&gt;Your crib sheet can be on a piece of paper, a wiki, or a chalkboard. Since all members of IT infra teams perform in the capacity of an engineer, I see a strong value add in creating visibility into these topics for the entire team. And I see special value in making these factors explicit and part of the onboarding of new team members.&lt;/p&gt;
&lt;p&gt;I spent 15 years in service companies and have seen plenty of hair-on-fire teams, lots of bad management, and subpar engineering. None of these things in themselves prevents you from trying to move a platform forward in a systematic and meaningful way.&lt;/p&gt;
&lt;p&gt;Even if you&amp;rsquo;re in a project that devalues your skill and rejects proper planning and technical discussions (this happens constantly), having a page (or spread) in a personal notebook that lists your priorities won&amp;rsquo;t hurt.&lt;/p&gt;
&lt;p&gt;This crib sheet has, in typical meyer style, an informal form (&amp;ldquo;haemdsaermelig&amp;rdquo;), because I trust that you&amp;rsquo;re a competent professional and a first-class engineer. This is only a collection of suggestions, less formal than a checklist. The way you move it forward, while maintaining fit-for-purpose, is the mark of a good tech/infra/platform strategy and defines the style of a good engineer. And there are always different styles and strategies possible.&lt;/p&gt;
&lt;p&gt;Workload&lt;/p&gt;
&lt;h2&gt;Workload&lt;/h2&gt;
&lt;p&gt;A workload is something external stakeholders (&amp;ldquo;the business&amp;rdquo;, &amp;ldquo;our customers&amp;rdquo;) take an interest in. Etcd is not a workload unless you vend etcd, like in etcd-as-a-service, to an external party.&lt;/p&gt;
&lt;p&gt;To describe the workload, use the following scheme: Name of the service (&amp;ldquo;sandra&amp;rdquo;), tech behind it (&amp;ldquo;juice vending machine&amp;rdquo;), external stakeholders (&amp;ldquo;employees who need a refreshment&amp;rdquo;).&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s not necessary to fully note down the service level expectation for this workload, but it&amp;rsquo;s important that people understand whether they&amp;rsquo;re working on the safety mechanism of a nuclear power plant or a juice vending machine. Set the stage or give context, as they say.&lt;/p&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We have two runner setups that process jobs from GitLab. One of these runs on the K8s cluster and builds containers for the developer, for both release and stage. The other one runs on EC2 (compute service) to build a raw Debian disk image for our integration test.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We run a Grafana service, used by our team. This service is made available to other people inside the company, but we made it clear that this is a courtesy only.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We run an IoT platform for electricity grid monitoring. Availability and correctness of the data are governed by our commitment to our regulator.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You might want to note down: &amp;ldquo;announce in ask-systems-engineering if we do maintenance&amp;rdquo; or &amp;ldquo;call the customer&amp;rsquo;s it director if we&amp;rsquo;re facing problems&amp;rdquo;. The latter one, many times, sits in a dedicated runbook in the o11y tooling.&lt;/p&gt;
&lt;p&gt;External surface&lt;/p&gt;
&lt;h2&gt;External surface&lt;/h2&gt;
&lt;p&gt;The term &amp;ldquo;attack surface&amp;rdquo; is a staple in IT infra. This section plays on that idea, with a bit of a twist.&lt;/p&gt;
&lt;p&gt;Note down what parts of your infrastructure are exposed to the public or other external stakeholders. Examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;everything has a public IP, but we only allow port 22 for instance group xy and 443 for all app servers, as per the console here &lt;insert link&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We have load balancers in a public subnet, these have an open port 80 and 443 &lt;insert console link&gt;, all other traffic has to move across the AWS control plane&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In addition, we have an open Kubernetes control plane &lt;insert link&gt;, which is secured with standard SigV4 access via the AWS control plane. People have to assume the following role &lt;insert link&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We only serve and allow no egress in the prod application farm (&amp;ldquo;air-gapped&amp;rdquo;), and in stage/dev, we only allow egress via NAT gateway.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note that I care about infrastructure issues here. I don&amp;rsquo;t usually have to account for two services sitting on the same piece of infrastructure, but with different failure modes, but this has happened. Think about the failure mode (&amp;ldquo;http server on server xy crashes&amp;rdquo;). If the failure modes for two services are identical, they should be listed only once. This follows from the idea of (common) infrastructure.&lt;/p&gt;
&lt;p&gt;External surface defines security, team/organizational interactions, and performance/cost parameters for the platform. Think of a NAT gateway case for AWS; this is frequently a performance bottleneck and cost driver.&lt;/p&gt;
&lt;p&gt;Security &amp;amp; safety-wise, this is the worst place for maverick decisions. Make sure that the external surface (e.g., open ports) is known or brought up with security and product. Some organizations maintain a separate risk management/tracking function; they need to be informed if your setup does not match good/best practice.&lt;/p&gt;
&lt;h2&gt;Platform access&lt;/h2&gt;
&lt;p&gt;This is about all access to the platform that is not customer access. It encompasses how people prove their identity (e.g., two factor, IDP), what kind of credentials they&amp;rsquo;re granted (e.g., long or short-lived, tied to an IP), and by what means they access the platform (e.g., in-band or out-of-band/control plane).&lt;/p&gt;
&lt;p&gt;If you get this wrong, everybody in the team and beyond will hate you. This requires a measured approach, but I have locked out people from doing weird shit before. And I don&amp;rsquo;t regret it.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Everything runs across SSH. If you fuck up the firewall rules, you&amp;rsquo;ll lock yourself and everybody else out of remote access&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We manage the tech for a privilege escalation process to give an external vendor additional permissions. The process can be triggered by product/support/NOC/service desk. Access is limited to account 123, and a full log trail will be generated. Access reverts after four hours.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Every developer has to go through the AWS API/control plane to access the cluster. In order for that to happen, they need to authenticate with the company IDP. The service desk can assign groups from this set &lt;insert link&gt;, and this will be mapped with prefix PLAT_ into an AWS role in the management account 456, having an almost zero (see &lt;insert link&gt;). From here, they assume elevated production roles. SSH access is blocked with a VPC-wide ACL (&lt;insert link&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your platform access encompasses pushing large blobs around, you&amp;rsquo;re doing it wrong. This is by definition a low traffic, high reliability/security control path, and a lot of good things have come from segregating this control path into a control plane. Nothing stops you from moving more in this direction, not even SSH.&lt;/p&gt;
&lt;h2&gt;Business Decisions&lt;/h2&gt;
&lt;p&gt;This is a specific point in business cadence. Every business has its own cadence, much like nature has seasons. If you want to see spring flowers, you need to pack a jacket to get out in the spring.&lt;/p&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We have a honeycomb.io subscription for app farm X. It&amp;rsquo;s used by three feature teams, working on apps a, b, and c.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Project timeline and result expectations: we&amp;rsquo;re operating on a &amp;ldquo;it&amp;rsquo;s done, when it&amp;rsquo;s done&amp;rdquo; perspective, but want to start onboarding friendly testers in three months&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Yearly company returns will be mostly fixed in February, and we need to spec three new Dell servers by then, to make a budget request.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We&amp;rsquo;re about to hire three new dev teams, and we need to make sure that our infra onboarding process is in top shape when that happens&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Frequently, these things are not shared with engineering by management. This inadvertently leads to management frequently discovering that engineering is not using the resources they&amp;rsquo;ve ordered.&lt;/p&gt;
&lt;p&gt;That said, nothing prevents you from underhandedly noting down that you use, e.g., honeycomb.io and later starting a dialogue with management if the three teams aren&amp;rsquo;t using it anymore. Of course, honeycomb.io is great, and they should be using it.&lt;/p&gt;
&lt;p&gt;There&amp;rsquo;s one important thing about business decisions: If you ignore these, you might get stuck at the junior engineer role for a decade or more (I&amp;rsquo;ve seen this happen). F.L. Bauer, a German computer pioneer, suggested the following as the definition of software engineering:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;ldquo;Establishment and use of sound engineering principles to economically obtain software that is reliable and works on real machines efficiently.&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If you don&amp;rsquo;t have a relationship to the &amp;ldquo;economic&amp;rdquo; and &amp;ldquo;efficiency&amp;rdquo; part of your work, you&amp;rsquo;re either not an engineer or a very junior engineer.&lt;/p&gt;
&lt;h2&gt;Operations (Scaling, Failure, Running Cost)&lt;/h2&gt;
&lt;p&gt;This is a wide and deep topic in a very short heading. For the matter at hand, just make sure you (and everybody else) know that there are requirements that deviate from the run-of-the-mill setup. These special cases fall broadly into two categories: a) ugly hacks for no good reason, and b) ugly hacks driven by the workload requirements.&lt;/p&gt;
&lt;p&gt;&amp;ldquo;Ugly hacks driven by workload requirements&amp;rdquo; is a bit of hyperbole. Let me give you a longer, fictional example. We&amp;rsquo;ve a very important data set on our blob store. In order to make sure we can access this data in the foreseeable future, we use technical means (replicate it to a different region) and organizational means (segregation of duties; we&amp;rsquo;re only allowed to append, the security team is able to modify it). Of course, this requires special attention and a certain amount of overhead if you set up or change this mechanism. Therefore, you must be careful and measured when you establish these kinds of mechanisms. But no level of care on your side will suffice if nobody knows that this kind of mechanism exists. Take special care to note things that are non-obvious.&lt;/p&gt;
&lt;p&gt;Examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Everything is set up to autoscale our three K8s clusters, but if we receive a huge traffic spike, since all load balancer backends and all K8s nodes are colocated in the same small subnet, everything will just fall over&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Broken service X can lead to infra leaks (e.g., instances not being terminated), which in turn leads to very high infra bills&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In our strongly coupled microservice architecture, a failure in the database can lead to a cascading failure in all applications, which in turn triggers the cluster to continuously scale until the company runs out of money&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Technically we run service x on a persistent database, but it only stores 5 seconds&amp;rsquo; worth of data, and it&amp;rsquo;s just a support system, so if a disk breaks, just re-initialize the entire thing, no recovery needed&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The three keywords each warrant having designated specialists on staff, and you can delve into ridiculously deep depths for them. Say, having a couple of mathematicians in a dedicated research arm that deals with advancing queuing theory, to make your systems more efficient. In other cases, scaling will be handled by the function &amp;ldquo;f(x) = 1&amp;rdquo;, which still affords you a good dose of resiliency. This is equivalent to having a system service autostart. From experience, I have built services that scale to zero, based on workload. I&amp;rsquo;d always tell the customer/stakeholder that this is in place and they/we might want to remove it if it&amp;rsquo;s too annoying for them. A lot of solutions are possible, and it all depends on the wider context.&lt;/p&gt;
&lt;h2&gt;Final Note&lt;/h2&gt;
&lt;p&gt;Note down everything in the infra you don&amp;rsquo;t understand or that doesn&amp;rsquo;t make sense, and continue working on these issues.&lt;/p&gt;
&lt;p&gt;This sounds a bit like a general janitorial task list, and in a way it is. This is why I say being a &amp;ldquo;senior engineer&amp;rdquo; is more like being Mary Poppins than Tony Stark.&lt;/p&gt;
&lt;p&gt;Usually, I make sure that I do a &amp;ldquo;walkthrough inspection&amp;rdquo; of the piece of infrastructure I work with. And I usually do this within the first two weeks of a posting. This has led to turning down a posting multiple times, because the mandate didn&amp;rsquo;t cover fixing egregious problems the platform had. The only thing I regret is that I haven&amp;rsquo;t turned down more postings and not having been more explicit and straightforward about my quality standards.&lt;/p&gt;
&lt;p&gt;In any case, it&amp;rsquo;s time to roll up your sleeves.&lt;/p&gt;</content></entry></feed>