Big Data | Dark Views

TNBT: Proactive IDEs

13. February, 2015

Imagine this situation: You’re working on some code and you get an exception when you run the unit tests. Next to the output is a link with the text: “User Joe had the same exception two months ago and fixed it with the commit b8cfda02.”

How would that work? We’re using big data for all kinds of things, tracking customer happiness, searching the Internet and discovering terrorist threats (or not).

Standard development teams have about 10 people. That means you have a super computer with 40-80 cores, 160 GB of RAM and 20 TB of disk space connected with a fast LAN in your office already. That beast is usually idling while it waits for the developers to press keys. It would be pretty simple to install a clustered log analyzer on this hardware which simply reads all the log files and reports which Maven and running JUnit test creates. It would be as simple to connect the same database to your version control. That means this system could track all the errors and exceptions that you get when you run unit tests or the whole application.

This information could then be used to detect when someone in the team gets a new exception plus the change sets which fixes them. If the system detects an exception which it has seen before, it can tell you which developer has fixed it or who is currently working on it – instead of wasting your time, you could see the code which contains the solution or ask someone who has already solved the problem.

With proper filtering, the data could be split into internal and framework code. That way, the system could report to library projects where consumers struggle most.

On the large scale of things, this system can tell you which parts of the system are most brittle.

As usual with big data, there are some downsides. The same system would tell you which developer breaks the code most often. Who writes the worst code. If your manager isn’t able to see the human value in his charges, this might not be your best bet.

The Next Best Thing – Series in my blog where I dream about the future of software development

Leave a Comment » | Software | Tagged: Big Data, Software Development, TNBT | Permalink
Posted by digulla

Jazoon 2013 – Big Data beyond Hadoop – How to integrate ALL your Data

24. October, 2013

Big Data is being used everywhere. Kai Wähner mentioned a couple of examples in his talk “Big Data beyond Hadoop – How to integrate ALL your Data” (slides on slideshare):

Anyone else getting worried by these “success stories”? How do you feel as a mobile customer that your mobile company tries to prevent you from leaving? How about using Big Data to notice bad customer service and prevent making customers unhappy? How do Macy’s competitors feel about this “monitoring”?

Anyway …

One great point was “Silence the HiPPOs” (highest-paid person’s opinion). With the ability “to interpret unimaginable large data stream, the gut feeling is no longer justified!”

Why Big Data? 3 V’s: Volume, Velocity, Variety. But don’t forget the fourth: Value (slide 8)

Before you can start analysis, you need to get the data from somewhere. That usually means integration of a foreign system (reading the data), manipulation of the data (like string to int or date conversion, etc.) and filtering (duplicates, importance, …). See slide 9.

Beware that Big Data is no silver bullet. If you have a gigantic amount of data with poor quality, that will just give you huge problems.

When planning for a Big Data project, begin with a business opportunity (slide 22). Chose the right data (don’t just import everything because you might need it), combine different sources and use easy tooling (slide 26).

Be wary of ETL tools. The network will quickly become your bottleneck.

For the actual implementation, he suggested to look at Apache Camel (slide 34) as a pure integration framework and the talend Open Studio (slide 56) as an example of an integration suite.

1 Comment | Conference, Software | Tagged: 2013, Big Data, Hadoop, Jazoon | Permalink
Posted by digulla

Jazoon 2012: Building Scalable, Highly Concurrent and Fault-Tolerant Systems: Lessons Learned

29. June, 2012

What do Cloud Computing, multi-core processors and Big Data have in common?

Parallelism.

In his presentation, Jonas Bonér showed what you should care about:

Always prefer immutable
Separate concerns in different layers with the minimum amount of dependencies
Separate error handling from the business logic
There is no free lunch: For every feature, you will have to pay a price
Avoid using RPC/RMI. Try lure you into “convenience over correctness”
Make sure you handle timeouts correctly
Use CALM if you can
Not all your data needs ACID.
Know about CAP and BASE: Drop ACID And Think About Data
Get rid of dependencies by using event sourcing/CQS/CQRS
Frameworks like Hibernate always leak in places where you can’t have it. KISS.

Longer explanation:

Immutables can always be shared between threads. Usually, they are also simple to share between processes, even when they run on different computers. Trying locks and clever concurrency will only get you more bugs, unmaintainable code and a heart attack.

Dependencies kill a project faster and more efficiently than almost any other technique. Avoid them. Split your projects into Maven modules. You can’t import what you don’t have on the classpath.

Error handling in your business logic (BL) will bloat the code and make it harder to maintain. Business logic can’t handle database failures. Parameters should have been validated before they were passed to business logic. Business logic should produce a result and the caller should then decide what to do with it (instead of mixing persistence code into your business layer). The BL shouldn’t be aware that the data comes from a database or that the result goes back into a database. What would your unit tests say? See also Akka 2.0 and “parental supervision.”

Obvious programming has a value: You can see what happens. It has a price: Boiler plate code. You can try to hide this but it will still leak. Hibernate is a prefect example for this. Yes, it hides the fact that getChildren() needs to run a query against the database – unless the entity leaks outside of your transaction. It does generate proxies to save you from seeing the query but that can break equals().

Same applies to RMI. When RMI decides that you can’t handle the message, then you won’t even see it. In many cases, a slightly “unusual” message (like one with additional fields) wouldn’t hurt.

As soon as you add RMI or clustering, you add an invisible network in your method calls. Make sure you have the correct timeouts (so your callers don’t block forever) and that you handle them correctly. New error sources that are caused adding the network:

Failure to serialize the message
Host unreachable
Packet drops
Network lag
Destination doesn’t accept message because of configuration error
Message is sent to the wrong destination
Destination can’t read message

Claim checks allow to resend a message again after a timeout without having it processed twice by the consumer.

CALM and BASE refer to the fact that you can only have two of the tree CAP characteristics: Consistency, Availability and Partition Tolerance. Since Partition Tolerance (necessary for scaling) and Availability (what’s the point of having a consistent but dead database?) are most important, you have to sacrifice consistency. CALM and BASE show ways to eventually reach consistency, even without manual intervention. For all data related to money, you will want consistency as well but think about it: How many accounts are there in your database? And how many comments? Is ACID really necessary for each comment?

Solution: Put your important data (when money is involved) into an old school relational database. Single instance. Feed that database with queues, so it doesn’t hurt (much) when it goes down once in a while. Put comments, recommendations, shopping carts into a NoSQL database. So what if a shopping cart isn’t synchronized over all your partitions? Just make sure that users stay on one shard and they will only notice when the shard dies and you can’t restore the shopping cart quickly enough from the event stream.

Which event stream? The one which your CQRS design created. More on that in another post. You might also want to look at Akka 2.0 which comes with a new EventBus.

Leave a Comment » | Conference, Software | Tagged: 2012, ACID, BASE, Big Data, CAP theorem, Cloud computing, CQRS, CQS, Database, Hibernate, Jazoon, Multi-core processor, nosql, Programming | Permalink
Posted by digulla

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Dark Views

TNBT: Proactive IDEs

Jazoon 2013 – Big Data beyond Hadoop – How to integrate ALL your Data

Email Subscription

Archives

Pages

Meta

Dark Views

TNBT: Proactive IDEs

Share this:

Jazoon 2013 – Big Data beyond Hadoop – How to integrate ALL your Data

Share this:

Jazoon 2012: Building Scalable, Highly Concurrent and Fault-Tolerant Systems: Lessons Learned

Share this:

Email Subscription

Archives

Pages

Meta