16 Dec 2016

Dawn of the Microlith - Monoservices with Elixir

Introduction

Microservices are the newest technology trend and should be approached with deliberation. There is a ludicrous amount of subject matter on it, which I’m not going to cover. The important part to understand is if they are a fit for your organization.

To measure if the architecture adds business value, it really comes down to how quickly can features be released to production and the stability / maintainability of the codebase. Whether microservices work or not also depend on the organizational structure and maturity. Building different components with independent deployment and loose coupling is definitely one way to do that and gives a sense of ownership to teams that can help kickstart innovation and creativity. The tradeoff of independence and potential innovation is that it is pushing the complexity of building and quickly iterating on a large code base into the QA and operations.

However, there are alternatives which give the advantages of both monoliths and microservices.

Business value - getting to and operating in production

Classifying the costs in an engineering organization into two major categories:

Build. Cost of iteration/creation of a new feature/product. Over long periods of time, it may become difficult to know if one change will affect something else in the system causing development to slow down.
Operate. The maintenance cost of these feature/product sets in production. Debugging large systems is difficult as there are many interlocking calls through classes or services. Reliability of a service in production is paramount.

Complexity of large projects are here to stay and microservices add additional devops, qa, and operational cost. If an organization has built a unmaintainable monolith, what would change when moving to microservices? That being said - they can certainly help in decomposition given the right discipline.

Microservices are like a gym membership. Just because someone has a gym membership, doesn’t mean that suddenly they will have the discipline to get in shape. Similarly, just because there’s an expensive network boundary to enforce modularization and a clean interface, doesn’t mean the code will magically be written. Essentially, classes in a language are very cheap to create and require no discipline, but microservices may make someone think twice because of the network boundary. This may help, but, then again, this strategy might be the same success rate of a New Year’s resolution.

In smaller projects, I’m not sure why anyone would use with microservices other than perhaps conversion of performance hotspots.

Costs with a monolith

Now there are some natural problems in a large monolithic codebases:

too many developers trying to make changes causes long merge windows
build and test may take too long to run
more costly to scale as hotspots may be small compared to full application size
longer startup and initialization times

There’s a clear roadmap of engineering practices to get to continuous deployment. It may not be easy to get to the end of the road, but it is well-defined. Note that the QA and operations of the system has moved from a model of mean time between failure to mean-time to recovery. The engineering changes could involve:

Automated testing
Metrics and fault detection
Blue/green and canary deployments (potentially)
Trunk-based development (potentially)
Feature toggles (potentially)

However, even if continuous deployment is achieved, it could still be difficult to have ownership and innovation. There could also remain problems with scaling, size, and organizational structure.

Costs with microservices

Interestingly enough, the smaller codebases are easier to maintain, but the complexity has now moved to the interactions between all the different components. Nothing goes away from our monolithic requirements though as we still want continuous delivery on all our services.

But now there’s definitely some extra cost. It’s a lot of work. This list is from both experience and endless conversations. There are many pre-requisites which we’ll go over below.

Microservice anti-patterns

There are thousand ways to die in the wild west. For instance, without well-defined service boundaries in moving from monolith to microservice, you could end up building a distributed monolith. There are the additional network costs and flakiness to account for. Because of this, I’m limiting anti-patterns to that which will either cause difficulty for services in deploying independently or large extra cost in maintaining in production.

Tight coupling - Services need to be able to deploy independently. Tight coupling could be caused through shared libraries forcing an upgrade throughout the system. Or they could be coupled through a database schema where many services need to upgrade based on a schema change.
Too fine grained - this will just drive up latency. These aren’t classes and could cause production problems.
JSON over REST - if it leads to cascading synchronous chained blocking calls. There are libraries such as Hystrix and GRPC to help with this. Also, JSON requires serialization/deserialization which could cause problems for chatty services.
Distributed objects - if you’re constructing objects over the network, you’re going to have a bad time.
Canonical data model - good luck with that.

Operations Costs

Before we can even start with microservices, we need to have a platform to run them. It’s a good chance that the platform will be Kubernetes - an open-source container management system from Google. They’ve made some excellent trade-offs in moving the complexity to the routing and discovery and keeping the deployment and namespacing clean. There is Helm which provides a great way to deploy systems easily on a cluster. Of course, there also needs to be a container runtime which will probably end up being Docker. I’m not a big fan, but it does the job.

There is a material cost in operating a cluster. Without going through a definitive list, the most recent bugs that I’ve witnessed are:

Kubernetes is limited to 50 or 250 services per cluster on AWS depending on the setup.
Docker doesn’t currently delete containers and requires a reboot to be able to create again. Has to do with migration to systemd and not checking available ip addresses.

What can I say, there’s work, but it’s still easier to maintain than Openstack. Of course, Google Cloud provides Kubernetes clusters, which could certainly simplify matters for your organization. There are a plethora of other choices: vamp, docker, gilliam, rancher, deis, pivotal, etc, etc.

However, there is still a matter of debugging in production and for that we would need to be able to trace requests. The easy way would be to add a correlation id to a header on every request and use centralized logging to trace. But if you want more details, then zipkin or the newer opentracing come to the fore.

Communication fabric

Services need to communicate with each other. Networks fail often, and there needs to be a strategy for fault-tolerance. How will communication happen? Is it going to be JSON over REST? Hystrix? GRPC? Linkerd? I haven’t kept up with the PAAS layers for the past months, but if you’re not using Kubernetes, you may also need to consider service discovery and routing. The communication layer could also account for stopping cascading failure via circuit breakers, latency, etc. Also, there could be a message bus involved, and if so, it should probably remain as dumb as possible, lest it turn into an enterprise service bus (ESB).

Eventual consistency patterns

So let’s say we drew up all the service boundaries, and service A has data service B needs. What then? If service B has direct access to service A’s database, then those are tightly coupled and should probably be one service. If service B keeps a copy of service A’s data, then we have an eventual consistency problem. If service B queries service A every request, then this could be a hard dependency with cascading failure.

Assuming in a microservices architecture that we want service B to keep a copy of the data, this becomes an eventual consistency problem. I know of three major ways to deal with who owns resolving the consistency:

First party responsibility - distributed transactions. Service A writes the data to its datastore, followed by ensuring that the writes occur on all other systems that desire the data.
Third party responsibility - event sourcing. Service A sends the event across a bus, all services that require the data subscribe to the event. For reconciliation, this may also require a 3rd party listener that records all of the events. If someone misses an event, this 3rd party owns the data and can replay them. This also could lead us to CQRS and materialized views. Note that sending data over a bus could also be problematic due to delivery and ordering issues. A message could be delivered multiple times or arrive out of order. Would need to implement CRDT’s or some other resolution mechanism on all the consumers.
Second party responsibility - the owner of the data keeps a transaction log. The owner of the data could send an event across a bus if necessary, but the responsibilty belongs to the consumer. Let’s say that the consumer decides to implement polling of the source system. If there are 100 events, but only 5 id’s changed during the time (each having 20 changes), on poll the source would list those 5 id’s. Then the consumer could fetch the id’s at their convenience. This avoids any race conditions or out of order message problems. Events can also be sent along a bus if the SLA for consistency is important.

Versioning

Eventually, we need to deploy. How do we ensure that service A doesn’t break anything else on the deploy? We must have strict contracts for any service going out or we must maintain full backwards compatibility. Whatever is easier. Note that depending on the communication fabric, how the versioning is dealt with is different. For REST, there are a few ways to version. For an event based system, they also need to be versioned, however, now the publisher may never know if anyone is consuming the events or not, so may not be able to deprecate. Fun times.

QA Costs

QA becomes a complex problem as there could be no particular set of services that can be “validated”. How do multiple teams develop against each other? They could deploy their branches to a namespace or perhaps use feature flags which are enabled at runtime for certain tests, etc. Mocking with wiremock is also a very good practice here. At a minimum, we need some extra diligence over a monolith:

Functional tests with mocks of dependent services - it may not be possible to setup the full environment with all the data required. It is nice to have an authorative mock of a service, but wiremock does make the bar for setting up a mock quite low.
Contract tests - if service A calls service B, A could write contract tests (partially the functional tests) into B’s repository.

It is very difficult to have effective end to end tests on long paths. Let’s say a request goes from A -> B ->… F-> G. End-to-end tests can test an input at A, and look for an output at G. This detects an error, but doesn’t help us with finding out where the error could be.

At least from what I’ve seen, due to the complexity, we don’t have much choice and have to go with MTTR over MTBF and count on our contracts and versioning if we want to deploy at a reasonable pace.

Deployment

Finally, the feature makes it out to production. But, wait, it’s not quite ready yet. As with a complex monolith, to detect any failures, we still could use:

Canary tests - the ability to put a deploy into production and run it against real data, perhaps for only a set of users/customers.
Journey tests - tracing major workflows through the system. Ensures major functionality doesn’t break. This goes step by step and isolates where a problem may have occurred.
Monitoring for exceptions and other faults

Now, it’s certainly not necessary to have this type of coverage, but it certainly helps.

Security

There are also a variety of security issues that are novel. If we’re using containers, then we must also have a way to update containers enmasse. For instance, during the heartbleed vulnerability, the base images would have needed to be updated with the patched libraries and pushed out to production. As compared to virtual machines, where the OS could have been automatically updated.

Another issue could be authorization between service endpoints. If all services can call any endpoint on any other service, what is to prevent any server-side request forgery? If a user can cause a fake request to be sent out, it could cause havoc. So we’ll need to implement some sort of authorization mechanism, whether it be JWT tokens with service users or Oauth2 tokens on the client credentials flow. This type of problem doesn’t occur with a monolith as any user is already authorized and private interfaces may not be exposed as easily as http endpoints.

And, of course, we need transport encryption on all the services. This could mean we have to maintain a private certificate authority to issue new certificates as novel services get added to the cluster.

Costs with a microlith/monoservices - Elixir - the middle ground

In breaking out a monolith to microservices, microlith was a new term that came up from a co-worker, which I promptly decided to steal. Quite frankly, I’m not sure if this architecture would be called monoservices or microlith. A microlith sounds more like a distributed monolith, but monoservices don’t sound quite as cool.

Anyway, before we took a deep-dive into microservices, we had two goals to drive business value:

Build. Ability to iterate features quickly and deploy to production over long periods of time.
Operate. Run in production reliably.

So what is Elixir?

Firstly, Elixir is a magical potion that will solve all of the problems you may or may not have.

It is also a language built on the Erlang VM, which has been known to build low-latency, highly-concurrent, distributed, and fault-tolerant systems. It is compatible with existing erlang libraries, and its ecosystem has grown a great deal in the past year. The language was designed to build reliable systems, and many consider it a functional language.

And although it would be nice if there was a silver bullet, there isn’t one. Elixir gives tooling and a programming paradigm that makes it easier to solve difficult problems about reliability. It also includes an entire framework called the OTP Platform which includes basic boilerplate for reliable systems. The problems, however, still need to be solved.

Technical aside - robustness principles - Microservices+PAAS vs Erlang/Elixir

Jose Valim has written an excellent article outlining some of the free items that come with Erlang/Elixir. To summarize, an application in Elixir is the equivalent of a fully independent microservice. I’m going to go in a slightly different direction with the comparison.

The primitives that are provided by default in the erlang ecosystem for building reliable systems¹:

Shared-nothing architecture with processes that have their own state and mailbox and are strongly isolated
Message passing that is designed to be:
1. atomic
2. arrive ordered between two processes that are communicating ²
3. only contain constants.
Supervision trees that handle restarts and monitoring
Fault detection and reporting
Local vs remote clustering transparency and message routing
Hot code upgrades
Reliable storage³

For instance, the most basic tenet of scaling is shared nothing architecture. The same principle helps scale from multi-core cpus, since flops per core have been relatively stable, to web-services with massive concurrency. This philosophy is baked into the language itself as the fundamental unit is the extremely lightweight process. As a consequence of shared-nothing, all communication must be via message passing.

Table - comparison of basic principles PAAS vs Erlang

What	Microservices+PAAS	Erlang VM
Shared-nothing	Possible	Default
Communication fabric	Build it	Default
Supervision	Yes	Default
Fault detection	No	Default
Location transparency/Routing	Usually	Default
Hot code upgrades	Probably no	Default
Reliable storage	Sort of	Sort of

Note that though the communication fabric exists, the engineering for reliability still must be done. Even with Elixir, there are plenty of ways to shoot yourself in the foot. Circuit breakers, etc. are still needed. Fault detection is an interesting one. With Elixir, if your database goes down, you can inform all the processes that something happenned.

Technical aside - Let it crash - an example of these principles

There has been a ton written on let it crash, but I’m going to try to summarize.

By having supervision trees and fault detection, we can concentrate on writing happy path code and not worry about transient failures.

I’m going to go with a contrived example of a chat service. Let’s say the user communicates to a service over a websocket. In a lightweight process, we will spawn a webserver just for them. They will also have their own private state. Now if we need to load some data for the user from a cache, we can insert it directly into the state of their webserver to be available for all future interactions. Quite frankly, this entire paragraph may seem absurd for the uninitiated⁴.

Now imagine that the service needs to talk to a message broker to pass on messages to other users.

So in bad ascii art form, we have:

Web browser -> Erlang VM -> User’s Webserver -> User’s message broker process -> External message broker

Let’s say that the external message broker crashed for some reason. We can handle the error in a few ways, for instance, we can have the message broker process do retries. But we can also crash the user’s webserver and propagate the error back to the web browser and have the user retry and reconnect the web socket. Since the user is the owner of the data, we don’t need to build in error correction on the entire path.

I can’t leave this section without mentioning hot code upgrades. This means that we can upgrade any running code without taking the system down. There’s no need for connection draining or blue green deployments.

Comparison with microservices

Back to the main point, microservices may reduce the cost of building a new service, but may also increase the cost of QA and operations. We are concerned with trying to optimize both of them.

Now one of the reasons to move from monolithic architectures to microservices is to have clear boundaries and avoid a spaghetti mess over time.⁵ In defining these boundaries, we must establish a contract between different parts of the system and pay an additional network cost. And although each service should be able to deploy and scale independently, to get to a point where we can safely deploy any microservice into production, we made a giant list that needed to be executed from debugging to security.

With Elixir, the closest equivalent of a microservice would be an umbrella application running inside a Erlang VM cluster with an interface of a GenServer. A great example of this would be Acme Bank Application. This greatly simplifies our testing as we have direct access to all the services at a single commit! We can test the entire system, or we can test an individual application with its dependent services. Yes, that was worth an exclamation mark. We can also deploy the entire system as a package or each application independently, but we have some certainty to the internal interfaces without having to maintain versions and contracts. The key to this is that we can update the system, a collection of services, to a single state at one point in time without having to worry about deployment independence.

Ironically, we don’t really need to talk about microservices or applications at all. What we wanted was a clean interface with supervision and restarts. We could build these “services” inside a single application with multiple GenServers as interfaces. Later we can move the supervision tree to its own application/service. At the end of the day, it comes down to the supervision tree and all communication could occur on a single node without the network costs and scale out later. The point is that we have an interface that separates a unit of development from the unit of deployment.

Table - comparison of costs for a monolith for microservices and elixir

To summarize, we get benefits of microservices, without many of the additional costs. Naturally, for very large projects, the overhead increases, but for small to medium projects, there are quite a few wins. You’ll see a lot of “it depends” in the last column because the lines are very fuzzy based on org size, team size, time zones, etc, etc. The interesting part is that for a elixir umbrella apps, you can go either way as the unit of development is different than the unit of deployment.

Extra work?	Monolith	Microservices + PAAS	Elixir umbrella apps
Automated testing	Yes	Yes	Yes
Metrics and fault detection	Yes	Yes	Yes
Canary and Journey tests	Yes	Yes	Yes
Monolith problems (shared library, scaling)	Yes	No	Maybe
Anti-patterns for distributed systems	No	Yes	Maybe
Eventual consistency	No	Maybe	Maybe
Blue/green deployments	SLA	SLA	No, hot-code updates
Operations - Hotspots	Yes	No	No
Operations - PAAS	No	Yes	No
Operations - Distributed tracing	No	Yes	No, use observer
Operations - Private CA	No	Maybe	No, certificate pin on cluster
Communication fabric	No	Yes	No
Internal API Versioning	No	Yes	No
QA Functional Tests with Service Mocks	No	Yes	Mostly no, use real app
QA Contract Tests	No	Yes	Mostly no, entire system tested at once
Security - containers	No	Yes	No
Security - oauth/application	No	Yes	No

Alternatives

The advantages of using Erlang/Elixir in a microlith / monoservices architecture are numerous. However, there are alternatives such as squbs built by Paypal which appears to be something like the OTP framework. There is a great blog post explaining the organizational reasons of Akka/JVM over Erlang. I’m not sure if it has the same advantages of QA as an Elixir mono-repo with umbrella applications.

GRPC is under strong development and could perhaps help a great deal in the communication fabric for arbitrary languages, however, there are still many costs in operating the rest of the system (PAAS, QA, etc.)

Conclusion

There is a middle ground between monoliths and microservices. Whether starting out or trying to decompose a monolith, a microlith might be a good choice. The Elixir ecosystem provides a great start for providing robust applications in a “microlith” or “monoservice” type architecture by allowing clean client/server interfaces with supervision. Large microservices could end up as applications, while smaller ones simple GenServer. However, if your problem domain involves massive computations or some other domain specific problems, perhaps the Squbs ecosystem based on Akka may be a better fit. Regardless, this isn’t a silver bullet, without good development practices, anything can become unmaintainable.

Other than being a single language, this table shows that Elixir with umbrella apps (monorepo) has numerous complexity and cost advantages in operations, qa, and security over monoliths and containerized microservices in a PAAS. See this table to understand the differences between a PAAS and the Erlang VM. In fact, given the cost and complexity benefits, if it is possible to pick Erlang/Elixir, I’m not sure why anyone would bother with microservices.

A few years ago, even though I would use erlang for my own projects, unless it fit the problem domain exactly, I could never recommend it to anyone because of the weak ecosystem. With the development and community of Elixir, if you’re thinking of microservices, maybe you should give Elixir a shot first.⁶

Joe Armstrong, one of the creators of Erlang, lists some points on how to create a reliable distributed system in his thesis. ^[return]
I’ve never used this feature myself and I’m not sure if would hold true over multiple nodes, but there is a selective receive that is extremely handy. ^[return]
Doesn’t quite compare to the AP systems that we have today. ^[return]
Comparison with other solutions or languages that support high concurrency. For instance, nodejs uses epoll in a single event loop to process connections, however, any cpu code running may block the scheduler. A better example may be nginx, a webserver that uses also uses epoll, and sets a number of workers/thread pool to handle blocking work. Now with the cowboy webserver running on the Erlang VM, every connection that comes in has its own webserver code running. That is not a typo. This has huge implications on how web applications can be built from the ground up. ^[return]
There’s really no reason why boundaries can’t be well-defined in a monolith. I’m rolling with it. ^[return]
For the longest time, Erlang has been a sanctuary for me, so for personal reasons I have trouble sharing it. It’s like that a special spot on an island, just cresting above the rocks of high tide, shade of a willow tree sheltering against the southern sun, while the waves crash on the shore on the bluest of days. ^[return]

Nothing interesting...

About