Software Stuff: 2011

Saturday, September 3, 2011

Data that travels, again.

It's really quite surprising how many kinds of system you can build with traveling data and what range of tasks can be accomplished there.To me it is particularly surprising because the paradigm rarely comes to mind in a natural way. After all the most straightforward way of system building is to store the data somewhere in your system and then run multiple activities upon it. Unfortunately this approach usually creates close coupling between data and activities and even between activities or at least their execution sequence.
When a data moves activities can be totally decoupled from each other and even from the original piece of data that started the whole processing sequence. Imagine an event processing where event1 triggers activity1 and activity2. Then activities generate events of next generation with data enriched or transformed and these events trigger their own activities or just get ignored. Those system are very flexible - it's easy to add a new branch or a processor. If you use locality information to deploy event processor involved in the same chain close to each other the processing can be really fast. Obviously events should be relatively small, but if you combine event processing with a data storage solution you can have access to a much larger data object.
Events usually include several key pieces of meta- data to facilitate routing and monitoring: event type, creation time, TTL, creator etc.
On the other side of data traveling is map-reduce, which requires copying of rather large volumes of data. I find that one of most actively "growing" branches on map-reduce tree is HBASE highly suggestive of that paradigm viability. Basically it seems that moving large volumes of data is fine if there is no way you can avoid it. Otherwise pretty much anything else might be a better solution.

Thursday, September 1, 2011

Dead? Very Dead?

This is one of the most complex areas in distributed software. How do you define "dead" and what do you do when a node begins to behave strangely? How do you detect brown-outs and network partitioning? What do you do when these thing hit? The list of questions can go on and on.
Traditionally the simplest solutions work the best. Set a timeout on communications between nodes writing the same key segment, when a timeout expires the node that has access to a quorum device lives on while the other one commits suicide (or goes into read-only mode). Of course, you need a good quorum device - usually a lock service based on Paxos is a good provider of that sort of things. Moving to read-only mode (and coming back from the dead remembering previous state of data on the node) sounds like a better option. Unfortunately people often forget that the next step is the update from an active node. Update puts extra load on the active node and can (and in many implementations does) push the active node over if it was loaded well enough.
A node that never comes back from the dead (e.g. gets a drive formatted while returning to the service) is simpler, but you might lose data that way AND you still need to solve the problem of adding a node to the affected key segment.I personally favor a resurrection attempt because network partitioning happens much more often than a true death.

Saturday, August 27, 2011

Routing and some more consistency

The easiest way to achieve simultaneous writing to X nodes is to copy the input stream as it comes, not waiting for it to complete. So a writer connects to one of the write-enabled nodes for a given key, starts writing and he node copies the stream to other write-enabled nodes. Follows that each write-enabled node should have a way to pick up X-1 other nodes for writing. The component picking up nodes must have at least minimal intelligence not to choose a node in the same rack or, better yet, to follow user-defined availability zone rules.
A storage node also needs to have persistent connections already established to the rest of the nodes used for writing of the key space on this node. That is, if write latency is of any importance. And of course the protocol must provide for a way to establish consistency in case of any errors. For instance, what happens if the original write node dies after sending the data object to some but not all of other data nodes?
There are also questions about the structure of data nodes key space and key space distribution: do nodes come in sets of X, where each node from the set is identical (key-space-wise) to the rest of the sets nodes or is a key space on a node made of smaller units and a copy of that smaller unit can be distributed independently of the rest of units on the node? DHT (distributed hast table) and DHT key-ring literature has examples of both approaches. I personally prefer a full back-up (identical copies) because they are easy to create and manage.
Another question is the state of node coming back from the dead and the definition of "dead". See that in the next post.
On a side: it is difficult to find time to write while vacationing on Hawaii!

Wednesday, August 24, 2011

Consistency and user experience

Consistency comes into the picture with reliability and write speed requirements. What's the probability of eventually loosing the data you can live with? How fast should a write request return? If the answers are "0" and 10 ms you got into eventual consistency world already. Basic idea of eventual consistency is that when you write a data item, you write synchronously on X (minimum 1) machine only. The write returns with success if that operation totally succeeds and a failure otherwise. An asynchronous process copies the data to other Y storage machines (preferably located in another place). X/Y ratio's dependent on the degree of paranoia while the absolute values of X and Y (especially Y) depends upon you read requirements (e.g. such use-cases as edge caching; write in one region, read anywhere etc.).
Adding more X nodes slows writes down and makes them less reliable, which is why X is usually set at 2. Turns out people really do not understand why the data written by their own application does not show up on the next read request. Even worse problem is writing the second write on top of an obsolete version of data. There are various ways of dealing with this problem. One of them is to expose data version information to the application and let it make some sort of intelligent decision but that does not help the application developers all that much and the paradigm confuses them.
The only solution that seems to work is to provide consistent reads for the process that wrote the data and let everyone else read whatever version is available (they do not know what the current version is one way or another). This is known as read after write consistency. Most modern services achieve this goal by having some session stickiness in request routing level. Subsequent requests from a writer hit one of the X nodes used in the write request.

Monday, August 22, 2011

Data that travels

A separate kind, or for the lack of a better word, family of distributed systems includes data that travels from a node to node instead of being stored in one place. The range here covers anything from event processing systems (including workflow systems) to map-reduce clusters. Current state of data travels from step to step, being enriched or otherwise transformed by every processing step it goes through. It's really difficult to run queries or even know precisely how much data you have in your system at any given time. It's also difficult to predict how much time a given piece of data spends inside. Usually you collect statistics on the system behavior and that allows you to make fairly good guesses. There also some special tricks that help you run queries and get better estimates. People use these systems when there is more interest in processing results than data storage.
Consistency
Any kind of data storage and processing system can be eventually or immediately consistent. I use "Immediately consistent" to describe read-after-write consistency by any thread (the thread that wrote the data and any other). "Eventually consistent" means that the thread, which wrote the data can get a previous version of the data on the subsequent read. There is a really useful behavior where only the thread (client etc.) which wrote the data is guaranteed to read a new version immediately afterward, while any other thread (client) can read old or new version - however its luck works out.

Wednesday, August 17, 2011

Partitoned Data, Importance of Numbers

With this partitioning schema you must avoid cross-machine writes and reads. Under any sort of load cross-machine writes will fails in a variety of really exciting and difficult to untangle ways. Multi-phase commit is not going to help you there: it doesn't work on any real world load. The transactions just take too long, which gets reflected by humongous resource consumption, wildly different latencies and frequent failures. Network partitioning decreases your availability and adds failures, too. Multi-machine reads are also rather flaky for the same reasons. You also need to think about the combining the results from several nodes, pagination and more. In short - just do not allow cross-machine access if you are doing horizontal data partitioning.
If you do need to have cross-machine updates or queries, you need to implement a workflow. Any financial system is a shining example of such interaction pattern. Clearly, a response in such system is asynchronous.
The importance of Numbers
I had meetings with my mentees and the common issue with them was the lack of any numbers they could put on their requirements. Working on any system, distributed or not, you need to start by having a grasp on some cornerstone numbers:

Expected Latency. Clearly, system with synchronous 15 ms response time must have very different architecture from a system with 2 second response time. You need to know both tp50 and tp99.9.
Expected availability. If it's different for reads and writes or per resource, you need to know all the numbers. Also what are the criteria for availability (e.g. if ping works, but nothing else does - is the service available?)
Expected redundancy.
Total amount of money your business owners are ready to spend on the hardware. This is one of the things that greatly inspires your creativity, especially when you realize that the number is several times less than what you think it ought to be.

Tuesday, August 16, 2011

Distributed Systems Architecture, part 1.

I got a couple of new mentees interested in distributed systems so I decided to externalize my thinking in hopes it'll help me make the coaching more productive.
Before you even begin to think about technical part of your system, start by analyzing business requirements. Your goal is to figure out your technical requirements AND verify that the business requirements are correct. It does not hart to start with the initial business idea for the new service (e.g. let people to keep their music on the Internet) and write down your own business requirements. The next step, of course, is to compare your set of requirements with the one provided by the business and learn about the differences. Pay special attention to their growth predictions. I saw systems made very difficult to build by extremely optimistic growth projections (e.g. We will have 3,000,000,000,000 requests per day in the first 6 months) . I also saw how underestimating growth made systems victims of their own success: quick popularity and crash under the demand. Lets define our first tenet here:You should know how you are going to scale your system up and down, if needed. You should also know how much work that is going to take.
Obviously, we always build infinitely horizontally scalable systems, so your projections will be simplified by their linearity.
The first thing you usually can analyze is your data. The business provides most of data definitions in form of the requirements, you just need to define you data in more formal way. If you are building a distributed system you can't have one centralized database - it won't scale with the rest of your system and your system won't be infinitely horizontally scalable. You need to decide if you want your data to stay put or move, if you want what basically amounts to horizontally partitioned database or something else.
Horizontally partitioned database is the most popular use case. You have the identical schema entities (atomic data units. For example, Customer or Account) present on all of your nodes. Then you use some key to build a map: Entity -> Storage Node. For example, hash of CustomerID sliced into ranges, where range 1-100 belongs to Storage Node 1, 101-200 to Storage Node 2 and so on. One of the problems with this schema is the possibility of hot spots. If one of your customers is Sears chances are you will retrieve that entry more often than an entry for some local shop.

Software Stuff