Axion5
The official example template of creating a blog with Bootstrap.
- 0.1Introduction
- 0.2Limitations
- 0.3Goal
- 0.4Requirements:
- 0.4.1Apps talk directly to a database at all times
- 0.4.2A solid read and write permission and role system needs to implemented
- 0.4.3All data is contained in one database (aggregate) which is replicated between pods.
- 0.4.4Every pod instance copies user docs to the _user database when required
- 0.4.5Users can never access the main aggregate.
- 0.4.6New users can only register at the write master
- 0.4.7Aggregate database is partitioned by documents' access signatures
- 0.4.8Access signature can also be function of doc's properties
- 0.4.9Rean is the name of the set of workers behind couchdb
- 0.4.10All docs written to the aggregate get validated by a validate doc update function (vdu)
- 0.4.11All docs written may need to include access signature of the old doc as a property
- 0.4.12Rean sets up a database and one way replication for every access signature
- 0.4.13Documents with changed access signature do not get replicated anywhere.
- 0.4.14Rean cleans up docs from partial databases when access signature changes
- 0.4.15Every document in the aggregate might be duplicated in one of the partial databases
- 0.4.16Rean writes special docs with delete instructions
- 0.4.17A node can fail and this does not affect the pod ring
- 0.4.18A proxy may be necessary to enforce read restrictions on databases
- 0.4.19A pod's public port is the proxy.
- 0.4.20The proxy is either something like bounce.js or haproxy
- 0.4.21Every pod decides independently which pod is the pod master
- 0.4.22Pods are responsible themselves for deleting docs from their partial databases
- 0.4.23In a pod ring every pod maintains its own pod status document
- 0.4.24No dns, or at least no hard dependency.
- 0.4.25No load balancing, instead clients find their own most efficient server
- 0.4.26Every machine or vps and its pod is completely autonomous
- 0.4.27Client is totally independent from back end database and vice versa
- 0.4.28A cluster, or pod ring is self adjusting, depending on load.
- 0.4.29A pod ring should be robust and be able to cope with all but one pod failing
- 0.4.30A user can start a new pod ring with just the data accessible in another pod ring
- 0.4.31Apps should leverage couchdb's replication and changes feed features
- 0.4.32Every pod has minimal workers behind its database
- 0.4.33The whole system is message based
- 0.4.34No possibility of creating document conflicts
- 0.4.35A proxy is used for basic read access control.
- 0.4.36No sessions or cookies
- 0.4.37Every pod consists of adea and docker containers
- 0.4.38A pod ring shares a pod ring configuration document
- 0.4.39A pod ring can add, activate, deactivate and remove its members
- 0.5Compromises/trade-offs
- 0.6Work to be done!
- 0.7Questions and research
- 0.7.1How would this system go about implementing the following?
- 0.7.2User names or emails as ids?
- 0.7.3Can workers run in parallel?
- 0.7.4Cache requests in memcache perhaps?
- 0.7.5Multiple apps per pod ring?
- 0.7.6Multiple groups of users for one app in a pod ring
- 0.7.7Splitting of pod rings into two rings?
- 0.7.8Where is the client app?
- 0.7.9Deploying/updating software?
- 0.7.10A/B testing?
- 0.7.11What's the api for docker? From node? From clojure?
- 0.7.12How to bootstrap a pod or pod ring?
- 0.7.13Big files (video, images, sound) should not be replicated around
comments: true published: true title: Adea, an experiment in application back end infrastructure tags: web internet rethink publishedat: 27/12/2014
I believe firmly in decoupling front and back end, using messages for communication, self reliance for individual machines in a network, auto configuration for the network as a whole, 'backendless' infrastructure and depositing as much as functionality as possible in the front end. This can be implemented lots of different ways, this blog post discusses a sketch of just one way, dictated by limitations chosen.
--------------------------------
0.1Introduction
As per the last blog post I want to try to sketch a system and infrastructure to serve as the back end for a (web) app. I want to challenge some commonly held ideas of how to organize such a thing. Since the legacy of a document based web architecture has left us with a server-client paradigm and therefore a functional application (the 'server') on the server side (as it has left us with endless dom meddling on the client side) we're going reevaluate the use and function of this server application. To do this properly we're going to try to make do without it all together. Removing the server from the server-client schism is one sure fire way to remove the schism and therefore break the old paradigm. Some interesting things might follow from this.
The system is being described by listing a fair amount of ideas that taken together describe a possible working setup. Once the ideas crystallize I might (be able to) describe it all a bit more formally.
0.2Limitations
Since this is an experiment I am setting some limitations, in particular technologies used. They are relatively novel, and also few in number:
- docker: For ready made instances of the following three.
- serf: To coordinate servers in a multi-server setup
- A proxy server: To control access to a particular vps/machine and/or databases
- couchdb: Data has to be stored somewhere..
0.3Goal
A working setup, that needs minimal maintenance, auto scales, can be setup with minimal effort, and can be used for any app. Once this thing works writing any web app should involve minimal change to the back end and should only need to concentrate wholly on the front end. With some libraries a lot of the boiler plate dealing with the back end should again be possible to abstract away. By using reactjs it is possible to abstract away the dom. Combining the two allows the app developer to concentrate on the actual functionality as much as possible.
0.4Requirements:
The following is a list of requirements and ideas for the setup. Some are essential, some can be compromised on. But all try to make use of the technologies as best as possible.
0.4.1Apps talk directly to a database at all times
This leverages the http end point feature of couchdb. Couchdb has also a solid authentication and authorization system built-in and has no problem with dealing with a lot of connections at the same time (Erlang based). Further more it is good at replicating databases between instances and is able to provide a changes feed, thus enabling apps to directly respond to changes in a database.
0.4.2A solid read and write permission and role system needs to implemented
This is a problem for couchdb since it is not very good at read validation, and only checks access at a database level. The usual solution is to separate the data into multiple databases based on read permissions. But we still want the data to replicate between instances.
0.4.3All data is contained in one database (aggregate) which is replicated between pods.
This is pull and push replication. However data from the aggregate gets replicated only one way only into partial databases. The aggregate also contains a copy of every user doc.
0.4.4Every pod instance copies user docs to the _user database when required
But this only needs to be done for active users. This way duplication can be minimized. A client app can send a message requesting access for a particular email address or user name and rean can then copy the proper user doc to the _users database, and set up the required partial databases as well. Something might be possible to set up so that updated user docs can be replicated to the _users database, while leaving not active users for some pod can be left out of the _users database.
0.4.5Users can never access the main aggregate.
Instead they access duplicates of this data from partitioned by access databases. This can be enforced when couchdb implements the validate doc read function, or when using rcouch. Otherwise this will have to be enforced by the proxy by disallowing any get on the aggregate database.
0.4.6New users can only register at the write master
This also goes for the forget password procedure. For new users to register some hoops have to jumped through since we still don't want a back end server and the new user, or the client app can only talk to to the database directly by writing docs to a public database. The procedure is described in my github repos cape and a demo is at authpages
0.4.7Aggregate database is partitioned by documents' access signatures
All docs are accessible by a unique set of users. This can be one user, all users, or a particular subset. Every doc can limit its accessibility by defining it in a property. This is a list of roles and individual users that is allowed to read the doc. Other properties decide on write and modify rights, enforced by couchdb's validate doc update. Since every doc has a unique 'access signature' the whole set of docs can be partitioned by access signature. For every partition a database can be set up and using filtered replication all docs will be replicated to one and only one database.
0.4.8Access signature can also be function of doc's properties
In this case not the parameters to a replication's filter are changed (as read from a doc's access object) but the filter itself. In this case all partial databases might have to be torn down, recreated and replications set up again. Or the change of access signature recalculated for every doc so they get removed from databases where they not longer belong.
0.4.9Rean is the name of the set of workers behind couchdb
These workers are either put into action by a particular change in the aggregate database or periodically inspect the state of couchdb themselves. They can for instance respond to messages sent to by client apps. Rean will also initialize its couchdb instance. Rean runs in a docker container with a serf instance hooked up to the pod ring if the pod is part of one. Rean will look for its configuration in the couchdb instance if it's not the first pod in the ring to be started up. Otherwise it will create the aggregate and save its config in it to be replicated to any other pods. Needs some pondering still about how to start this up properly with minimal effort...
0.4.10All docs written to the aggregate get validated by a validate doc update function (vdu)
This function has access to the doc written, the doc to be updated, the database's security object and the user's roles and id. Whether a doc gets written is thus a function of all of these parameters. This function cannot rewrite the documents, however it can make sure that the document adhere to some properties. For instance the user that updates it can be enforced to be included as a property.
0.4.11All docs written may need to include access signature of the old doc as a property
This gets enforced by the aggregate's vdu. The vdu has access to the doc that's being updated and calculates its access signature, as well as the access signature of the updated doc. If the access signature of the updated doc is included this can get validated by the vdu as well. The access signature of the old doc would be the name of the database where the doc was read from. When a client app reads the doc directly from a partial database it can include it in the updated doc as a property, if it doesn't know it can include the old doc as a separate property, or it can use the aggregate's update doc function. This update doc function can rewrite the doc and add the old access signature property, as well as the new. If a doc does not change it signature on update no signatures need to added as properties to the updated doc.
Client apps might also be able to calculate a doc's access signature themselves using the access object in the doc (but won't work if ps function applies to doc), or ask a pod to calculate the signature and add it itself.
0.4.12Rean sets up a database and one way replication for every access signature
The aggregate database has a design doc with a filter that takes a parameter. Every replication to every partial database uses the same filter but passes a different parameter to make sure that every partial database receives only the documents that match the access signature of the partial database. If a doc has a property defining its access signature it is used (has to be correct because of the vdu), otherwise the filter function will calculate it itself.
0.4.13Documents with changed access signature do not get replicated anywhere.
They await intervention from rean. All replications (to partial databases and other pods) use a filter function that prevents documents such as this from being replicated. This mechanism is to prevent any docs from not being removed from a partial database. A change feed is not guaranteed to provide all changes, only the latest, so intermediate docs may be overlooked by rean, which means docs may not be removed from partial databases while they should. A document cannot be updated till the old version is removed from a partial database it no longer belongs in.
0.4.14Rean cleans up docs from partial databases when access signature changes
Every doc with a changed access signature will get processed by rean. It will delete the doc in the partial database with the old access signature. After this it will rewrite the doc but it will set the old access signature property to the updated version. This will again pass the vdu, but now it will get replicated to the proper partial database. It will also get replicated to other pods if there are any replications setup.
0.4.15Every document in the aggregate might be duplicated in one of the partial databases
But only once. No two partial databases may and can contain the same document (by id). When there are no users logged in that have access to a particular partial database, it can be torn down by rean, and rebuilt when a user requires it again. But this is optional.
0.4.16Rean writes special docs with delete instructions
For every doc that moves to another partial database (because its access signature changed) rean writes a doc with instructions to delete a particular doc with a particular revision from a particular partial database. These docs get replicated to other pods where the local reans can then carry out the instructions and remove these docs in its partial databases. These delete instruction docs have a sequential number. Every rean in every pod updates their status doc with the sequence number of the latest delete instruction doc carried out. The write master can periodically check these status docs and remove these delete instruction docs up to the lowest sequential number reported. These deletes will replicated to to other pods. Rean can choose to add delete instructions in bulk to one of these delete instruction docs.
0.4.17A node can fail and this does not affect the pod ring
This will get reported by serf to all other pods. Every pod writes a info doc that does not get replicated to other pods, but does get replicated to a generally accessible partial database. Each pod will at all times report in this info doc the databases it knows are accessible (from serf). It's the client app's responsibility to fetch this document and listen for changes to it. A client that was connected to the failing pod can now try to access a working pod. New clients will not try to connect to the failing one. When the pod comes back online the same process gets repeated to let clients know its back online. If all fails a client can appeal to dns server again or any central app registry if it exists.
0.4.18A proxy may be necessary to enforce read restrictions on databases
Couchdb is good at validating writes, but not at reads. In particular it cannot prevent reading from a database and only allow writing. This is necessary however for the aggregate. So until couchdb implements 'validate doc read' a proxy is necessary.
0.4.19A pod's public port is the proxy.
Furthermore serf and couchdb need to be able to connect to other pods/machines. But couchdb itself should not be publicly accessible.So maybe it should go through proxy as well when replicating with other couchdb instances. So maybe the proxy can check for the proper password when it gets a direct connection request for couchdb. This proxy is also useful for when rean wants to shut down or do maintenance on the pod.
0.4.20The proxy is either something like bounce.js or haproxy
Both are controllable and programmable, and will sit in a docker container with a serf instance. It can then respond to messages from the local rean for instance.
0.4.21Every pod decides independently which pod is the pod master
A write master is chosen based on its id. The highest id wins, or whatever algorithm or preferences are set in the configuration of rean. Once a write master fails all other pods will learn of this via serf. Each one will independently know which one should become the new write master. The one with the highest id will then open its aggregate for writing. If there are no docs in the failing pod that still haven't replicated this process is safe. All clients will be notified or can query which pod is the new write master. When a failed write master comes back online any docs that replicated yet before it failed might now get in a conflicted state. So in general any write master that finds itself newly elected may have to check for conflicts and resolve them.
0.4.22Pods are responsible themselves for deleting docs from their partial databases
When a pod goes offline and then after some time online again, the write master might have given up on the pod and removed the delete instruction docs as part of maintenance. The local rean can deduce this from the fact that there is a gap between the earliest delete instruction doc in its aggregate and the latest one it reported as deleted in its own status doc. In this case the local rean can either remove all partial databases and start all replication again to new empty partial databases or laboriously compare all docs in the partial databases with the ones in the aggregate and clean up appropriately.
0.4.23In a pod ring every pod maintains its own pod status document
Which gets replicated to all other pods. In this document is data such as pod id, couchdb stats, vps stats and index of latest delete instruction doc carried out. Nobody else writes to this document but the pod itself. They get replicated to all other pods.
0.4.24No dns, or at least no hard dependency.
A user needs to find the app, so an url for that will be needed, and a dns look up. One could use a central place where pods or pod rings register ip addresses. Once an app is loaded (from a pod's database for instance) one could imagine that the app questions the pod on other ip addresses of other pods in the ring. Or again looks in a central registry where pods register their ip addresses. Pods might also be able to configure their own dns settings using a the api of some dns server if possible. An app only needs access to one pod, somehow, to be able to access the others.
0.4.25No load balancing, instead clients find their own most efficient server
Once an app knows the ip addresses of all the pods in a ring, it can be made the app's responsibility to choose the pod with the most capacity in terms of connections or latency or other parameters it either can measure itself, or that are being reported by the separate pods. Remember, all pods know all about all other pods in a pod ring.
0.4.26Every machine or vps and its pod is completely autonomous
Meaning it can take action without being told so by a master vps, and no vps is more important than another. It can manage its own affairs and no decision or action it takes should endanger the cluster. No vps is dependent on another vps. All knowledge of the cluster is shared. Etc, you get the idea.
0.4.27Client is totally independent from back end database and vice versa
Neither should expect or demand anything from the other. Client should politely request for data and if not granted or received solve its own problems. Back end workers though should do their best to accommodate and anticipate clients' needs, and organize things as best as they can. This means keeping public, reception, post office and mailboxes in order, and any replications that are needed between them etc, and respond to client messages as well as possible.
0.4.28A cluster, or pod ring is self adjusting, depending on load.
A master pod should get elected whose responsibility it is to add and remove pod, that is, vps instances. Whether this should happen depends on the pod ring's configuration. Every pod reports at all times its status and load via a shared status doc. Based on that decisions can be made about scaling by the master pod.
0.4.29A pod ring should be robust and be able to cope with all but one pod failing
Using a watts-newman small world network between pods all pods should keep replicating to each other and stay in sync at all times. The watts-newman network model can be implemented by every single pod independently without consulting or taking into account what other pods decide to connect to. It also predicts relatively low average hop count even when there are dozens and dozens or possibly hundreds of pods in a pod ring. When a pod disappears from a pod ring it will self adjust, as it will when a pod is added (again). For this to work every pod needs a working serf instance that has been hooked up to the pods serf network, as the pod knows about the network through serf. A visual demo of the watts-newman network is at deploy-demo.
Clients will also always notice a pod failing and should redirect requests to another pod in the ring if the app is designed properly.
0.4.30A user can start a new pod ring with just the data accessible in another pod ring
So users own their data. They can replicate their own data to a pod they control and then delete the data in the old pod (ring). When the data is shared with other users, they will also not be able to use the old pod (ring) to access the data.
0.4.31Apps should leverage couchdb's replication and changes feed features
With a bit effort separate users of the same app and connecting to the same pod (ring) should be easy to sync up with each other using these couchdb features. The ultimate goal is to achieve parity with most features in meteorjs.It should be for instance pretty easy to send messages between users.
0.4.32Every pod has minimal workers behind its database
These workers are doing registration, send emails, do maintenance on the databases, monitor and report the pods state etc, but should contain minimal app or business logic. This should reside in the apps/client themselves. It is the question in how far you can go with this.
0.4.33The whole system is message based
From pod to pod and from pod to app. This maximizes decoupling. No app or pod should be reliant on a specific response or for that matter any response to a message sent. If a message is confirmed, or other wise is returning data the app may use this, but it can not expect or wait for this. It should be able to make do and not fail or otherwise 'crash', but should always present a reasonable ui to the user and do its best to resolve the situation. Data should always be retrieved by direct database access.
0.4.34No possibility of creating document conflicts
A logical consequence of having only one writable pod, and creating only one way replications. From aggregate to other aggregates and from all aggregates to the partial databases on their respective pods. However a bulk save of docs with the all-or-nothing set might create conflicts. Maybe the vdu can check for conflicts by comparing revision numbers? Or the proxy might disallow this particular request.
0.4.35A proxy is used for basic read access control.
For maintenance it might be useful to cut of access to a couchdb instance, or certain databases can be made write-only by disallowing get requests on the database. This proxy can also do basic reporting and logging of connections and requests.
0.4.36No sessions or cookies
Instead users get a temporary ogin and pwd to access a database. This user name/pwd can expire. This bypasses and ignore session cookies and therefore csrf. An web app still needs to store this pwd in local storage. Meteorjs talks about storing session tokens in local storage in this blog post and expands on the rationale. This does require an https connection and a web app will have to send a user name/pwd with every request. Meteorjs only sends this session token once when setting up the ddp connection. Since every request is actually a login and the crypto algorithm to hash a password used by couchdb is pbkdf2 login attempts can be throttled to any arbitrary rate. So timing attacks become difficult. Web apps would be mostly listening to a change feed, which are long lasting connections. The issue still remains of a secret token (login pwd) sent with every request.
To create a quasi session create a temp user with the same roles as actual user, with the added role of the user id (user email). This user can be replicated to other couch instances and it will still work. The session expires when the user account gets deleted, which rean can do after a certain amount of no pings from client, or after a message from client. When the proper user account doesn't have the email as role it can't be used to access any of the dbs, except to ask for a session.
0.4.37Every pod consists of adea and docker containers
A vps or machine needs to have docker installed, a run time (node, erlang, clojure) and serf and a port open to the proxy container. Adea is the software that is responsible for adding and removing docker containers. Adea is responsible for starting and stopping docker containers. It communicates with the pod and the pod ring through serf. It can do some basic host status reporting. This agent should be as simple and robust as possible.
0.4.38A pod ring shares a pod ring configuration document
In this document are the settings and parameters to properly configure rean. This doc gets replicated to all other pods and their aggregate databases. But are not replicated to any partial database.
0.4.39A pod ring can add, activate, deactivate and remove its members
If in the rean config doc there are credentials for any virtual server providers such as aws, linode, digital ocean or any other vps provider and rean is configured to auto scale one could devise an algorithm where a pod gets elected to start up a new vps using a vagrant script for instance. Rean could also for instance monitor money spent and shut down or start up pods depending on a budget. This is possible because a pod ring self adjusts to pods being added or removed from a pod ring. A new vps only needs adea, serf and docker installed. Adea needs to be started up and join the right serf cluster (with the proper secret) and the password to the couchdb instances of the pod ring. Somehow this secret and password needs to be given or passed in when the vps is started up. After this the pod instance should be able to look after itself without any further control or meddling from other pods.
0.5Compromises/trade-offs
Some compromises have already been discussed in the requirements but here are some more I could think of:
0.5.1Easily scalable for read operations, but not for write operations
An app can use any pod to read from, but only one to write to, and that pod is the same for all clients. This is not only to prevent document conflicts, but also to enable proper implementation of a access system based on permissions and roles.
0.5.2No sharding of data
But one can use bigcouch, cloudant or couchdb 2.0 for this if needed. Every pod has a complete copy of all the data. So this system favors connection heavy, cpu heavy applications, since we can keep firing up new pods to deal with additional load. But a vps has a limit on how data it can store (hard disk limit), plus all this will have to be replicated to every new pod on creation. This becomes troublesome once we're talking about gigabytes of data. One solution would be to store big files such as images and video and sound files outside the pod ring and in a key value store somewhere. But that would need a server in front of it to enforce permissions.
0.5.3Every node uses (multiple) duplication of data to control read access.
Even when couchdb implements "validate doc read" this will be necessary because views are recalculated into indexed and will not use the "vdr" function to validate read access. Every pod therefore will have to duplicate its data to some degree. If all data is accessed in a pod by users it will have a duplication ratio of at least two. If the pod is nice enough to then aggregate this again to some degree or other for individual users the duplication ratio might be much higher than two. On the other hand every pod only needs to leave in one piece the main aggregate that's replicated between pods. All other databases it can destroy and create at its own discretion, taking matters such as load and space into consideration. This will mess with users who are reading from these databases, or have change feeds set up of course.
0.5.4Client apps will have their accessible data spread amongst several partial databases
However a pod can decide, based on hard disk space available to set up a database for every user where all the data that's accessible for that user gets replicated into. This will eat space if there is a lot of data that is shared. If all data is shared all data will get duplicated as many times as there are users. You could work out an algorithm where rean calculates how much data each user has access to (the sum of the size partial databases he has access to) and then divide the free space by the number of (active) users and then replicate as much as possible to a user specific database. A client app can then query rean which databases it should read from. The other option is to write messages to a generally accessible database about all changes to the aggregate. Clients can listen for (filtered) changes in this general database and then fetch the docs based on reported changes in the aggregate.
0.6Work to be done!
The above sketches a system that has not been fully fleshed out yet but seems like it could work as a whole. The system tries to exploit to the fullest the strongest points of each technology used such as docker, couchdb, serf, proxies and vps providers. It also tries to approach the problem of the client-server schism in a I hope somewhat original way, discarding some other truisms about setting up a back end infrastructure along the way about load balancing, sessions, dns and scaling. Whether it works I don't know, I have to build most of it still. What does work is the authentication and sign up so far.
0.7Questions and research
0.7.1How would this system go about implementing the following?
- trello
- shop
- wiki
- social network
- inventory
0.7.2User names or emails as ids?
Lots of authentication systems use both, but this creates confusion for users, because they don't know which one to use for which site or app. User names seem to make more sense because email addresses change. However email addresses are always unique.
0.7.3Can workers run in parallel?
Can you set multiple reans to work on one couchdb?
0.7.4Cache requests in memcache perhaps?
Rean is already listening to changes so it can invalidate queries and views.
0.7.5Multiple apps per pod ring?
Yes, that's possible, just prefix all doc types with the app name. So a client app can just pull the app specific data from the partial databases.
0.7.6Multiple groups of users for one app in a pod ring
Like a saas?
0.7.7Splitting of pod rings into two rings?
Don't know why you'd want to do this.
0.7.8Where is the client app?
In the aggregate? Pulled down from some repository somewhere?
0.7.9Deploying/updating software?
How to update the app within the pod ring? How to update rean? How to update technologies used, or for that matter the os of a vps?
0.7.10A/B testing?
Should be easy to do because it's so useful. But how?
0.7.11What's the api for docker? From node? From clojure?
Because we need to do this from adea. Apparently docker's api is a bit iffy.
0.7.12How to bootstrap a pod or pod ring?
In principle a pod only needs an open port on a vps (with docker installed) with an instance of adea and possibly serf running. But it needs be fed a configuration. This can maybe written straight to the couchdb instance. Rean can react to this and bootstrap itself and configure couchdb properly. But really all you is start rean with a particular config and it will provision one or more vps's and build the pod ring.
0.7.13Big files (video, images, sound) should not be replicated around
But should be in a key value store somewhere (DynamoDB perhaps). Or in a separate database with a validate read access function on it. Or use a proxy to control access. But this complicates access to these resources. Needs more thought.