This is not a post about eventual consistency.
Well, it is. But not just about that.
It’s more about the delicate waltz between software complexity and user experience - which often lapses into a violent tango, or whatever Patrick Swayze was doing in Dirty Dancing. And it is about something I wish I had done better in my many refactor initiatives over the years.
A few years ago, a pitch deck landed in my inbox. It was for a fixed-income investment platform: very specific, very niche, very confusing. Nothing was built yet; it was an aspirational deck (though perspirational is probably a better word for it). One of the slides in it had a proposed architecture diagram for the platform that looked roughly like this:
Ok fine, I coaxed that one out of Dall-E (hence the typos). But it is directionally aligned with what I saw. There were about 7 sub-systems, each containing north of 10 microservices, I think I counted about 19 data-stores, and a dizzying array of arrows that presumably represented data flows or message exchanges. And I’m sitting there wondering:
Where do I begin?
What if there is an error? Where in the name of Donald Knuth would you even start looking?
What would happen if that went down? “That"“: any random arrow up there.
Is there a wall large enough to contain the SPOG that tells you wtf is going on at any given time?
Most importantly, how do you ensure a smooth user experience with a back-end architecture that would make Rube Goldberg blush?
I am not getting into a monolith vs microservices polemic here. For one, enough has been written on this topic. More to the point, I actually don’t have strong convictions either way, for the simple reason that I have seen both approaches work beautifully in some situations, and bomb with a theatrical flourish in others. The best I can say on which architecture I prefer is It depends. Or more succinctly: ¯\_(ツ)_/¯
But what I do have adamant views on, is user experience - and how that somehow starts taking a backseat as platforms get retooled for scale, a sacrificial lamb at the Altar of the Refactor Gods. And monolith → microservice refactors are most commonly deployed in service of scale.
SiaS
That’s not SaaS with a typo. That’s an actual acronym. It is now, anyway:
Software is a Service.
Software is a tool, a means to provide a service to some user or stakeholder. It is not the end. End users don’t see the wizardry of your architecture or the elegance of your code, and even if they could, they don’t care. If their experience sucks, your work sucks. And their experience is all that matters. Period.
Somewhere along the way of growth, we often tend to lose sight of that. This is particularly endemic in monolith refactors, inherently a very back-end heavy endeavor. An all-too-common narrative: startups grow, sometimes exponentially. Audiences grow, teams grow. To support all this hockey-stickage, the platform tries to scale, often awkwardly. Cracks start showing up in key flows. Inevitably, tech debt bat signals start flashing, fingers start pointing to “that monolith”, and a blueprint emerges to retool the platform for horizontal scalability by breaking it down.
This is all very good. I have been in these discussions many many times, often led them, and I have been consistently humbled by the engineering magic that happens in the room.
My main concern is: who else is in the room where it happens?
The room where it happens
Here’s a scenario.
The Buy Foo flow in your foo.com (or foo.ai probably, these days) has been keeling over a tad too often. That blocking call on the pricing API is bricking performance, the request backlog is piling up your Postgres connections, capping you at 500 concurrent foo purchases. Hardly web-scale.
So, a RefactorMoot is planned to make Buy Foo horizontally scalable. A legion of bright minds gathers in a room. A glance at the whiteboard in the room shows a dense graffiti of arrowheads and boxen and zigs representing module boundaries, and zags representing network boundaries, a bunch of cylinders for data stores. Something like this (or more):
You look around the room and see this collective, armed with coffee and Red Bull and sometimes stronger stuff, engaged in passionate debate, geekspeak flying around like verbal butterflies: “The foo purchase is massively I/O bound”, “Does it need to be strongly consistent?”, “Postgres is the bottleneck” (yes, it usually is), “Let’s split this out into a foo-payment
service with a local projection”, “at least once delivery should be good” (shameless plug for this publication’s name).
Heady stuff. But then you look around the room again, and wonder: Where’s the front-end engineer who built the Buy Foo flow? The product designer who Figma-ed it? What about the product owner for the Storefront, was she looped in?
Most of all, is anyone here thinking of Janet over there, who’s waiting to buy some foo?
Almost all refactors for scale, microservices or not, involve the introduction of things like statelessness and async processing. Thread and network boundaries are introduced. Real-time is relaxed to “near-real-time”, a nice non-committal promise. Interactions are morphed from API calls to pub-sub event emissions. Eventual consistency is an almost-imperative for handling scale.
It is also, in my experience, one of the most difficult constructs to design good user experiences around. The back-end toolchain has gotten so rich and sophisticated, with concepts and frameworks like the actor systems, Kafka, auto-scaling, serverless, distributed tracing, and so on, that the realization of this refactor is not conceptually hard in that layer of the stack (execution is another thing). But retooling the UX in the wake of a scale-driven refactor without violently violating POLA remains stubbornly, infuriatingly challenging. (Aside: I love the acronyms in our industry. SPOG, SPOF, POLA… someone recently introduced me to git lola.)
See, when Janet clicks that Complete foo purchase button, she expects to see a shiny foo in her pending orders on the next screen. Strong consistency promises that. Eventual consistency promises not that: it literally says it will eventually show up. I have done multiple Google searches for things like “designing for eventual consistency” or “UX for eventually consistent flows” and have been, er, consistently disappointed with the results.
The point of all this is not to knock eventual consistency. I have nothing against it, it’s one of the many tools in the toolbox, a great one at that. My lament is the fact that Janet above is often an afterthought in this refactoring charrette. And by the time the ripple reaches her, it’s often too late. While the back-end hums along in its modularized emancipation, and Grafana hums along in the comforting greenspace of passing health checks, Janet proceeds to get a mild coronary on clicking that Complete Foo Purchase button, because she’s just been deposited on her My Orders page and where the hell is the foo she just bought? Or she is rightfully peeved with a confirmation page that displays something like Order ID 9a3dfcf9-1d1b-423c-8f43-96e25fee73a7 submitted
,
above the fold, and a Check status
button she has to keep clicking. All because that’s all you had time to shoehorn in after the scale refactor made that flow eventually consistent, and you never did loop in the front-end engineer did you!
In other words, few people seem to be giving an indented hoot about changes to Janet’s new post-refactor foo-buying experience.
I really need to stop saying foo soon, but bear with me a bit.
Concurrent engineering
So, what do we do?
I was planning to write about concurrent engineering in a later post, perhaps the only meaningful crossover from my grad school research into my professional career. But worth introducing here. It’s an engineering practice that emphasizes concurrent consideration of multiple aspects in the design of a product by giving voice to multiple perspectives at once.
Put simply: Get all the right people in the room!
Get the front-end devs, get the designers, get product, get QA. Treat the refactor like the new product build that it is, up and down the stack. Most importantly, start the discussion with Janet. How does this affect Janet’s foo-buying experience? What will change? What can we do to make it not suck for Janet?
Because remember, you’re doing this refactor for Janet. Not for anyone else.
The scene unfolding in the room looks different now. Channeling my movie geekdom, let’s try this as a 2-minute short. Any WGA format non-compliance is unintentional, and casting is left as a thought exercise for the reader.
There’s a movie I’d like to see. Wish I had done more of that.
Mercifully, this is the point where I stop saying foo.
Elegance, not perfection
At Yieldstreet, we had our own set of core values that I drafted for the Engineering org, additive to the company-level ones. One of those was this:
This summarizes my hard-earned lessons on too many refactor efforts over my time, especially that last bullet. There was a lot of heartburn over the years, across multiple startups. It had to be codified.
Aside: our values all had this “X, not Y” syntax, because why not. For those curious, this was the full set:
Post scriptum
My search for elegant UX paradigms for eventually consistent systems goes on. Would love to know what you have done, or come across, that addresses this gracefully.
It is also quite true that not all refactors, for scale or otherwise, have a direct impact on end human experience. Not all software even has a user interface - API plays for one (though I would argue that the idea holds true in some form there as well). Conversely, there may other refactors, not involving eventual consistency, not related to scale, that may still have a UX impact.
The point is to start refactoring discussions with the consideration of end user experience impacts, even if to establish right up front that there aren’t any. And that requires getting all the right people in the room where it happens.
My proudest moments at Zapier came watching my engineers solve problems while holding the customer close. We talked about what we wanted to happen to our credit card, to our bank account. We shared frustration experience with other vendors as example.
After building this team of detailed-oriented empathic folks, it was me and my Product & Design partners who got to play bad cop and make the case for the less than perfect experience. There are downsides to this mindset, but it's exactly who I'd want on a team building accounting, billing and fulfillment systems.
I'm thinking through a couple different examples of eventually consistent systems we implemented... we had the most success when we exposed the state machine in clear user-facing language. Failures might have looked, in increasing severity; confusion, failure to deliver purchases, and incorrect charges. These all led to increased support load and decreased customer trust... and trust was our most important contribution to the company. When designing a CX, only lie if you're 100% sure you can get away with it!