As a fully distributed team, we cannot organize ad-hoc meetings any time we want, therefore we have adopted an RFC (Request For Comments/Change) process in which the author writes a document which states the problem, possible solutions, evaluation of attempts, pitfalls/cons and the conclusion. The RFC process has been quite popular - and many companies and open source projects have used it to a great success.
For the scheduler problem it took 3 RFC documents and testing the following approaches:
-
using Redis as a lock service
-
using Zookeeper, Consul and jGroups as the basis of the leader election mechanism
-
introducing a new service, which would be a centralized, distributed scheduler orchestrating jobs over RabbitMQ, similar to Google Cloud’s Scheduler service
The last idea was rejected as introducing too much complexity and dependency between a single point of failure and the rest of the system.
Other approaches were evaluated in the form of an internal Clojure library, which we called Leader - it provides a uniform interface for picking a cluster leader, and is backed by Consul, Zookeeper or jGroups (not all at once of course, backends are pluggable).
In the end, all approaches were documented, built, tested, and… we have rejected all of them.
Run less stuff
Anything that introduces a new piece of infrastructure should be evaluated in the most strict way possible. It’s not only about solving the problem at hand, but also answering the following questions (and possibly more):
-
who’s going to manage the new piece?
-
what are the failure modes?
-
how are we going to manage upgrades?
-
is this suitable for our scale?
-
what is the associated infrastructure cost?
-
what are the security implications?
-
is this a single purpose tool or can we use it for other purposes
Once all of these questions where taken into account, it basically ruled out introducing solutions based on Consul or Zookeeper (and similar systems). Our deployment setup is incredibly simple - we deploy containerized applications into dedicated VMs and these are backed by internal load balancers (all managed with Terraform). Adding another layer of state to our deployment introduced yet another layer of complexity.
The last choice was the Redis based solution - but that also didn’t feel right. Redis usage is very light - we could potentially add one more workload to it without affecting the rest of the system. However, the most popular approach, Redlock seems to have a lot of issues and we weren’t comfortable with having to deal with potential problems caused by clock skew etc.
Back to the drawing board.
Lockjaw
This is where Lockjaw comes in - it’s a small library which exposes Postgres’ advisory locks as a pluggable Component.
from Hacker News https://ift.tt/39XqQ4l
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.