Jazoon 2012: Building Scalable, Highly Concurrent and Fault-Tolerant Systems: Lessons Learned

29. June, 2012

What do Cloud Computing, multi-core processors and Big Data have in common?

Parallelism.

In his presentation, Jonas Bonér showed what you should care about:

  • Always prefer immutable
  • Separate concerns in different layers with the minimum amount of dependencies
  • Separate error handling from the business logic
  • There is no free lunch: For every feature, you will have to pay a price
  • Avoid using RPC/RMI. Try lure you into “convenience over correctness”
  • Make sure you handle timeouts correctly
  • Use CALM if you can
  • Not all your data needs ACID.
  • Know about CAP and BASEDrop ACID And Think About Data
  • Get rid of dependencies by using event sourcing/CQS/CQRS
  • Frameworks like Hibernate always leak in places where you can’t have it. KISS.

Longer explanation:

Immutables can always be shared between threads. Usually, they are also simple to share between processes, even when they run on different computers. Trying locks and clever concurrency will only get you more bugs, unmaintainable code and a heart attack.

Dependencies kill a project faster and more efficiently than almost any other technique. Avoid them. Split your projects into Maven modules. You can’t import what you don’t have on the classpath.

Error handling in your business logic (BL) will bloat the code and make it harder to maintain. Business logic can’t handle database failures. Parameters should have been validated before they were passed to business logic. Business logic should produce a result and the caller should then decide what to do with it (instead of mixing persistence code into your business layer). The BL shouldn’t be aware that the data comes from a database or that the result goes back into a database. What would your unit tests say? See also Akka 2.0 and “parental supervision.”

Obvious programming has a value: You can see what happens. It has a price: Boiler plate code. You can try to hide this but it will still leak. Hibernate is a prefect example for this. Yes, it hides the fact that getChildren() needs to run a query against the database – unless the entity leaks outside of your transaction. It does generate proxies to save you from seeing the query but that can break equals().

Same applies to RMI. When RMI decides that you can’t handle the message, then you won’t even see it. In many cases, a slightly “unusual” message (like one with additional fields) wouldn’t hurt.

As soon as you add RMI or clustering, you add an invisible network in your method calls. Make sure you have the correct timeouts (so your callers don’t block forever) and that you handle them correctly. New error sources that are caused adding the network:

  1. Failure to serialize the message
  2. Host unreachable
  3. Packet drops
  4. Network lag
  5. Destination doesn’t accept message because of configuration error
  6. Message is sent to the wrong destination
  7. Destination can’t read message
Claim checks allow to resend a message again after a timeout without having it processed twice by the consumer.

CALM and BASE refer to the fact that you can only have two of the tree CAP characteristics: Consistency, Availability and Partition Tolerance. Since Partition Tolerance (necessary for scaling) and Availability (what’s the point of having a consistent but dead database?) are most important, you have to sacrifice consistency. CALM and BASE show ways to eventually reach consistency, even without manual intervention. For all data related to money, you will want consistency as well but think about it: How many accounts are there in your database? And how many comments? Is ACID really necessary for each comment?

Solution: Put your important data (when money is involved) into an old school relational database. Single instance. Feed that database with queues, so it doesn’t hurt (much) when it goes down once in a while. Put comments, recommendations, shopping carts into a NoSQL database. So what if a shopping cart isn’t synchronized over all your partitions? Just make sure that users stay on one shard and they will only notice when the shard dies and you can’t restore the shopping cart quickly enough from the event stream.

Which event stream? The one which your CQRS design created. More on that in another post. You might also want to look at Akka 2.0 which comes with a new EventBus.


Follow

Get every new post delivered to your Inbox.

Join 340 other followers