25. October, 2013
Guido Schmutz introduces two frameworks in his talk “Kafka and Storm – event processing in realtime” (slides on slideshare)
Apache Kafka is a publish-subscribe messaging system like JMS but more simple. Messages are kept in files and never deleted. Together with the fact that subscribers have to tell the system which message they want next, this means you can recover from bugs that corrupted your data even if you notice them only after some time: Just process all the corrupted messages again.
Storm is a “distributed realtime computation system. ” It makes it easy to define topologies (= graphs) of bolts (= places where a computation takes place) flowing your real-time data through a complex network to process, filter and aggregate it. Just like Akka, it defines all kinds of operations (filters, switches, routers, …) so you can easily and quickly build the topology you need. Trident makes this set-up step even more simple.
Compared to Hadoop, Storm is meant for real-time processing. Some projects combine the two.
If you need a good framework for serializing data in Java, have a look at Apache Avro.
24. October, 2013
Big Data is being used everywhere. Kai Wähner mentioned a couple of examples in his talk “Big Data beyond Hadoop – How to integrate ALL your Data” (slides on slideshare):
Anyone else getting worried by these “success stories”? How do you feel as a mobile customer that your mobile company tries to prevent you from leaving? How about using Big Data to notice bad customer service and prevent making customers unhappy? How do Macy’s competitors feel about this “monitoring”?
One great point was “Silence the HiPPOs” (highest-paid person’s opinion). With the ability “to interpret unimaginable large data stream, the gut feeling is no longer justified!”
Why Big Data? 3 V’s: Volume, Velocity, Variety. But don’t forget the fourth: Value (slide 8)
Before you can start analysis, you need to get the data from somewhere. That usually means integration of a foreign system (reading the data), manipulation of the data (like string to int or date conversion, etc.) and filtering (duplicates, importance, …). See slide 9.
Beware that Big Data is no silver bullet. If you have a gigantic amount of data with poor quality, that will just give you huge problems.
When planning for a Big Data project, begin with a business opportunity (slide 22). Chose the right data (don’t just import everything because you might need it), combine different sources and use easy tooling (slide 26).
Be wary of ETL tools. The network will quickly become your bottleneck.
For the actual implementation, he suggested to look at Apache Camel (slide 34) as a pure integration framework and the talend Open Studio (slide 56) as an example of an integration suite.