It's really quite surprising how many kinds of system you can build with traveling data and what range of tasks can be accomplished there.To me it is particularly surprising because the paradigm rarely comes to mind in a natural way. After all the most straightforward way of system building is to store the data somewhere in your system and then run multiple activities upon it. Unfortunately this approach usually creates close coupling between data and activities and even between activities or at least their execution sequence.
When a data moves activities can be totally decoupled from each other and even from the original piece of data that started the whole processing sequence. Imagine an event processing where event1 triggers activity1 and activity2. Then activities generate events of next generation with data enriched or transformed and these events trigger their own activities or just get ignored. Those system are very flexible - it's easy to add a new branch or a processor. If you use locality information to deploy event processor involved in the same chain close to each other the processing can be really fast. Obviously events should be relatively small, but if you combine event processing with a data storage solution you can have access to a much larger data object.
Events usually include several key pieces of meta- data to facilitate routing and monitoring: event type, creation time, TTL, creator etc.
On the other side of data traveling is map-reduce, which requires copying of rather large volumes of data. I find that one of most actively "growing" branches on map-reduce tree is HBASE highly suggestive of that paradigm viability. Basically it seems that moving large volumes of data is fine if there is no way you can avoid it. Otherwise pretty much anything else might be a better solution.
When a data moves activities can be totally decoupled from each other and even from the original piece of data that started the whole processing sequence. Imagine an event processing where event1 triggers activity1 and activity2. Then activities generate events of next generation with data enriched or transformed and these events trigger their own activities or just get ignored. Those system are very flexible - it's easy to add a new branch or a processor. If you use locality information to deploy event processor involved in the same chain close to each other the processing can be really fast. Obviously events should be relatively small, but if you combine event processing with a data storage solution you can have access to a much larger data object.
Events usually include several key pieces of meta- data to facilitate routing and monitoring: event type, creation time, TTL, creator etc.
On the other side of data traveling is map-reduce, which requires copying of rather large volumes of data. I find that one of most actively "growing" branches on map-reduce tree is HBASE highly suggestive of that paradigm viability. Basically it seems that moving large volumes of data is fine if there is no way you can avoid it. Otherwise pretty much anything else might be a better solution.
