Jazoon 2012: CQRS – Trauma treatment for architects

4. July, 2012

A few years ago, concurrency and scalability were a hype. Today, it’s a must. But how do you write applications that scale painlessly?

Command and Query Responsibility Segregation (CQRS) is an architectural pattern to address these problems. In his talk, Allard Buijze gave a good introduction. First, some of the problems of the standard approach. Your database, everyone says, must be normalized.

That can lead to a couple of problems:

  • Historic data changes
  • The data model is neither optimized for writes nor for queries

The first problem can result in a scenario like this. Imagine you have a report that tells you the annual turnover. You run the report for 2009 in January, 2010. You run the same report again in 2011 and 2012 and each time, the annual turnover of 2009 gets bigger. What is going on?

The data model is in third normal form. This is great, no data duplication. It’s not so great when data can change over time. So if your invoices point to the products and the products point to the prices, any change of a price will also change all the existing invoices. Or when customers move, all the addresses on the invoices change. There is no way to tell where you sent something.

The solution is to add “valid time range” to each price, address, …, which makes your SQL hideous and helps to keep your bug tracker filled.

It will also make your queries slow since you will need lots and lots of joins. These joins will eventually get in conflict with your updates. Deadlocks occur.

On the architectural side, some problems will be much easier to solve if you ignore the layer boundaries. You will end up business logic in the persistence layer.

Don’t get me wrong. All these problems can be solved but the question here is: Is this amount of pain really necessary?

CQRS to the rescue. The basic idea is to use two domain models instead of one. Sounds like more work? That depends.

With CQRS, you will have more code to maintain but the code will be much more simple. There will be more tables and data will be duplicated in the database but there will never be deadlocks, queries won’t need joins in the usual case (you could get rid of all joins if you wanted). So you trade bugs for code.

How does it work? Split your application into two main parts. One part takes user input and turns that into events which are published. Listeners will then process the events.

Some listeners will write the events into the database. If you need to, you will be able to replay these later. Imagine your customer calls you because of some bug. Instead of asking your customer to explain what happened, you go to the database, copy the events into a test system and replay them. It might take a few minutes but eventually, you will have a system which is in the exact same state as when the bug happened.

Some other listeners will process the events and generate more events (which will also be written to the database). Imagine the event “checkout”. It will contain the current content of the shopping cart. You write that into the database. You need to know what was in the shopping basket? Look for this event.

The trick here is that the event is “independent”. It doesn’t contain foreign keys but immutables or value objects. The value objects are written into a new table. That makes sure that when you come back 10 years later, you will see the exact same shopping cart as the customer saw when she ordered.

When you need to display the shopping cart, you won’t need to join 8 tables. Instead, you’ll need to query 1-2 tables for the ID of the shopping cart. One table will have the header with the customer address, the order number, the date, the total and the second table will contain the items. If you wanted, you could add the foreign keys to the product definition tables but you don’t have to. If that’s enough for you, those two tables could be completely independent of any other table in your database.

The code to fill the database gets the event as input (no database access to read anything from anywhere) and it will only write to those two tables. Minimum amount of dependencies.

The code to display the cart will only need to read those two tables. No deadlocks possible.

The code will be incredibly simple.

If you make a mistake somewhere, you can always replay all the events with the fixed code.

For tests, you can replay the events. No need to a human to click buttons in a web browser (not more than once, anyway).

Since you don’t need foreign keys unless you want to, you can spread the data model over different databases, computers, data centers. Some data would be better in a NoSQL repository? No problem.

Something crashes? Fix the problem, replay the events which got lost.

Instead of developing one huge monster model where each change possibly dirties some existing feature, you can imagine CQRS as developing thousands of mini-applications that work together.

And the best feature: It allows you to retroactively add features. Imagine you want to give users credits for some action. The idea is born one year after the action was added. In a traditional application, it will be hard to assign credit to the existing users. With CQRS, you simply implement the feature, set up the listeners, disable the listeners which already ran (so the action isn’t executed again) and replay the events. Presto, all the existing users will have their credit.

Related:


Jazoon 2012: Building Scalable, Highly Concurrent and Fault-Tolerant Systems: Lessons Learned

29. June, 2012

What do Cloud Computing, multi-core processors and Big Data have in common?

Parallelism.

In his presentation, Jonas Bonér showed what you should care about:

  • Always prefer immutable
  • Separate concerns in different layers with the minimum amount of dependencies
  • Separate error handling from the business logic
  • There is no free lunch: For every feature, you will have to pay a price
  • Avoid using RPC/RMI. Try lure you into “convenience over correctness”
  • Make sure you handle timeouts correctly
  • Use CALM if you can
  • Not all your data needs ACID.
  • Know about CAP and BASEDrop ACID And Think About Data
  • Get rid of dependencies by using event sourcing/CQS/CQRS
  • Frameworks like Hibernate always leak in places where you can’t have it. KISS.

Longer explanation:

Immutables can always be shared between threads. Usually, they are also simple to share between processes, even when they run on different computers. Trying locks and clever concurrency will only get you more bugs, unmaintainable code and a heart attack.

Dependencies kill a project faster and more efficiently than almost any other technique. Avoid them. Split your projects into Maven modules. You can’t import what you don’t have on the classpath.

Error handling in your business logic (BL) will bloat the code and make it harder to maintain. Business logic can’t handle database failures. Parameters should have been validated before they were passed to business logic. Business logic should produce a result and the caller should then decide what to do with it (instead of mixing persistence code into your business layer). The BL shouldn’t be aware that the data comes from a database or that the result goes back into a database. What would your unit tests say? See also Akka 2.0 and “parental supervision.”

Obvious programming has a value: You can see what happens. It has a price: Boiler plate code. You can try to hide this but it will still leak. Hibernate is a prefect example for this. Yes, it hides the fact that getChildren() needs to run a query against the database – unless the entity leaks outside of your transaction. It does generate proxies to save you from seeing the query but that can break equals().

Same applies to RMI. When RMI decides that you can’t handle the message, then you won’t even see it. In many cases, a slightly “unusual” message (like one with additional fields) wouldn’t hurt.

As soon as you add RMI or clustering, you add an invisible network in your method calls. Make sure you have the correct timeouts (so your callers don’t block forever) and that you handle them correctly. New error sources that are caused adding the network:

  1. Failure to serialize the message
  2. Host unreachable
  3. Packet drops
  4. Network lag
  5. Destination doesn’t accept message because of configuration error
  6. Message is sent to the wrong destination
  7. Destination can’t read message
Claim checks allow to resend a message again after a timeout without having it processed twice by the consumer.

CALM and BASE refer to the fact that you can only have two of the tree CAP characteristics: Consistency, Availability and Partition Tolerance. Since Partition Tolerance (necessary for scaling) and Availability (what’s the point of having a consistent but dead database?) are most important, you have to sacrifice consistency. CALM and BASE show ways to eventually reach consistency, even without manual intervention. For all data related to money, you will want consistency as well but think about it: How many accounts are there in your database? And how many comments? Is ACID really necessary for each comment?

Solution: Put your important data (when money is involved) into an old school relational database. Single instance. Feed that database with queues, so it doesn’t hurt (much) when it goes down once in a while. Put comments, recommendations, shopping carts into a NoSQL database. So what if a shopping cart isn’t synchronized over all your partitions? Just make sure that users stay on one shard and they will only notice when the shard dies and you can’t restore the shopping cart quickly enough from the event stream.

Which event stream? The one which your CQRS design created. More on that in another post. You might also want to look at Akka 2.0 which comes with a new EventBus.


What’s Wrong With XA/Two Phase Commit

27. June, 2011

To recap, two phase commit (XA) means that you have two or more systems which take part in a single transaction. The first phase ask all system “are you 100% sure that you can commit?” (prepare) and the second phase is the actual commit.

Of course, the answer to the first question can be a lie. You can never be 100% sure that a commit will go through. The system might crash in the middle of the actual commit or the network my break between prepare and commit. Doom.

So XA doesn’t when it should. How can you solve this?

By using a less brittle protocol. Imagine you want to copy data from database A to B. You could use XA and clean up the mess every once in a while.

Or you could add a field to A which says “this has been copied to B” and when inserting data into B, you must ignore data that is already there. Here is the pseudo code:

  1. Create the connections
  2. Find all rows in A that don’t have the flag set
  3. Read all the rows and insert them into B. If a row already exists, skip it.
  4. Set the flags for all copied rows in A.

Note: You can’t use MAX(ID) or MAX(TIMESTAMP) here. Why not? Imagine:

  1. TX1 is created in A and a row is inserted. MAX(ID) == 2
  2. TX2 is created in A and a row is inserted. MAX(ID) == 3! At this time, you can’t read WHERE ID == 2 but you can conclude from the gap in the ID values that it will soon exist.
  3. TX2 is committed.
  4. TX3 is created to copy the data to B. TX1 is still running! MAX(ID) == 3 but row 2 will not be copied!
  5. TX3 ends without row 2. Since MAX(ID) is now 3, it won’t ever be copied.

If you’d rather avoid a “to copy” or “has been copied” flag in each row, create a “to transfer” table which contains IDs of the rows to copy. In step #4, delete the rows that have to be copied.

Advantages:

  1. Resilient. If the transfer fails in the middle for any reason, it can pick up where it left of. In the worst case, a lot of data will be copied again but data will never be lost. In the usual case, no data will be copied twice.
  2. If you make a mistake, chances are that it won’t have matter much. Say you forget to set the “has been copied” flag. Well, for every transfer, too much data will be copied but it will still work. Slower than expected but you won’t lose data. The database will always be consistent!
  3. Say something goes wrong in B and you need to transfer the data again. With my approach, you just reset the flags. It doesn’t matter if you can’t say for sure which flags to reset. If in doubt, reset all of them. The algorithm will heal itself.
  4. You can copy the data in chunk sizes of your choice. It doesn’t matter if you copy everything in one go or in blocks of 1’000 rows or row by row.
  5. It’s a simple algorithm, so it will be quick to implement and there is only a small chance for bugs. Even if there is a bug, in most cases, you will be able to resolve the situation with one or two simple SQL queries.

Disadvantages:  The XA sales guy won’t make a buck from it. Expect some resistance.

 


Jazoon 2011, Day 2 – NoSQL – Schemaless Data-stores Not Only for the Cloud – Thomas Schank

26. June, 2011

NoSQL – Schemaless Data-stores Not Only for the Cloud - Thomas Schank

Thomas gave an overview of some NoSQL databases and the theoretical background of it.

The main points are: SQL databases get inefficient as the data grows and if you need to split the data between instances (how do you join tables between two DB servers? Even if you can, performance is going to suffer).

But there are new problems: Data can be inconsistent for a while (keyword: MVCC).

OTOH, these databases don’t need locks and, as Amazon demonstrated, any kind of lock will eventually become a bottleneck:

Each node in a system should be able to make decisions purely based on local state. If you need to do something under high load with failures occurring and you need to reach agreement, you’re lost. If you’re concerned about scalability, any algorithm that forces you to run agreement will eventually become your bottleneck. Take that as a given.

Werner Vogels, Amazon CTO and Vice President

And since the servers can heal inconsistencies, broken connections don’t mean the end of the world. The acronym of the hour is BASE: Basically Available, Soft-state, Eventually consistent.

Interesting stuff, especially since every company owns a super-computer today. My own team has one with 64 cores, 64GB of RAM and 16TB of disk space sitting in 8 boxes spread under the desks of the developers. Not much but it only costs $8 000. And if we need more, we can simply buy another node. Unfortunately, it’s hard to leverage this power. I’ll come back to that in a later post.

One thing to take with you: If you can’t stand JavaScript but you need to write it, have a look at CoffeeScript.

Links:


Declaring Relations Between Model Instances

14. March, 2011

I’ve been playing with Xtext 1.0 the last three weeks. The editor support is most impressive. It’s one of those well balanced technologies that work “right” out of the box, offer lots of places where to hook in to tweak things.

The last few days, I’ve been chewing on a hard problem: Relations. I could to this:

class A {
}

class B {
    A parent 1:*;
}

Read: Instances of B have a field “parent” that references instances of A. There is a list in A with all instances of B where the field “parent” is that exact instance. In simple words, it’s a parent-child relation between A and B. For each B, there is always one A. Each A can have several B’s attached.

Or like this:

class A {
}

class B {
}

relation A children 1:* B parent;

The first solution avoids repetition but when looking at A, you will miss the fact that it has fields. It looks empty but the declaration in B adds a field. I don’t like it. While it might work in this case, how do you map N:M? Where do you declare a many-to-many relation? In A or in B? My guts say: Don’t do it.

The second solution is also flawed if less so. Here, both classes are empty. I have to look up the relations to see what else is going on. And I have to repeat information. I don’t like it.

Time to take a step back and sleep over it.

Fast forward over the weekend.

How about this:

class A {
}
parent of
class B {
}

One word: Intent. A construct in a programming language should clearly communicate intent. Which is one of the weak points in Java: It’s hard to express what you intend with Java. It’s not impossible but it takes years of experience to avoid the luring shortcuts.

So this new syntax clearly states what I want: A’s are parents of B’s. No repetition. No odd “1:*” which every reader has to decode. Which you can get wrong.

But what if A is parent of more classes? Simple:

class A { }
parent of {
    class B { } parent of class X { }
    class C { }
    ...
}

See? It nests. Let’s add tree-like structures to the mix:

tree class A { }
parent of {
    class B { } parent of class X { }
    class C { }
    ...
}

So A gets a parent-child relation with itself. One word says everything: Ownership. Cardinality. Field names.

My gut likes it. Listen to your guts. Especially when a simple solution tries to lure you into a shortcut.


TNBT: Persistence

19. February, 2011

In this issue of “The Next Big Thing”, I’ll talk about something that every software uses and which is always developed again from scratch for every application: Persistence.

Every application needs to load data from “somewhere” (user preferences, config settings, data to process) and after processing the data, it needs to save the results. Persistence is the most important feature of any software. Without it, the code would be useless.

Oddly, the most important area of the software isn’t a shiny skyscraper but a swamp: Muddy, boggy, suffocating.

Therefore, the next big thing in software development must make loading and saving data a bliss. Some features it needs to have:

  • Transaction contexts to define which data needs to be rolled back in case of an error. Changes to the data model must be atomic by default. Even if I add 5,000 elements at once, either all or none of them must be added when an error happens.
  • Persistence must be transparent. The language should support rules how to transform data for a specific storage (file, database) but these should be generic. I don’t want to poison my data model with thousands of annotations.
  • All types must support persistence by default; not being able to be persisted must be the exception.
  • Creating a binary file format must be as simple as defining the XML format.
  • It must have optimizers (which run in the background like garbage collection runs today) that determine how much of the model graph needs to be loaded from a storage.

Related Articles:

  • The Next Best Thing - Series in my blog where I dream about the future of software development

ORA-01461: can bind a LONG value only for insert into a LONG column tips

23. August, 2010

Should you ever encounter this message:

ORA-01461: can bind a LONG value only for insert into a LONG column tips

then chances are you tried to insert more than 4096 characters at once.

Again the price for useless error messages goes to … Oracle! ;-)

To add insult to injury, I couldn’t find anything useful related to this error with Oracle’s own site search.


Testing the Impossible: Shifting IDs

7. October, 2009

So you couldn’t resist and wrote tests against the database. With Hibernate, no less. And maybe some Spring. And to make things more simple, toString() contains the IDs of the objects in the database. Now, you want to check the results of queries and other operations using my advice how to verify several values at once.

During your first run, your DB returns some values, you copy them into a String. When you run the tests again, the tests fail because all the IDs have changed. Bummer. Now what? Just verify that the correct number of results was returned?

Don’t. assertEquals (3, list.size()) doesn’t give you a clue where to search for the error when the assert fails.

The solution is to give each ID a name. The name always stays the same and when you look at the test results, you’ll immediately know that the ID has been “fixated” (unlike when you’d replace the number with another, for example). Here is the code:

import java.util.*;

/** Helper class to fixate changing object IDs */
public class IDFixer
{
    private Map<String, String> replacements = new HashMap<String, String> ();
    private Map<String, Exception> usedNames = new HashMap<String, Exception> ();
    
    /** Assign a name to the number which follows the string <code>id=</code> in the object's
     * <code>toString()</code> method. */
    public void register (Object o, String name)
    {
        if (o == null)
            return;
        
        String s = o.toString ();
        int pos = s.indexOf ("id=");
        if (pos != -1)
            pos = s.indexOf (',', pos);
        if (pos == -1)
            throw new RuntimeException ("Can't find 'id=' in " + s);
        String search = s.substring (0, pos + 1);
        
        pos = search.lastIndexOf ('=');
        String replace = search.substring (0, pos + 1) + name + ",";
        
        add (name, search, replace);
        
        registerSpecial (o, name);
    }

    /** Add a search&replace pattern. */
    protected void add (String name, String search, String replace)
    {
        if (usedNames.containsKey (replace))
            throw new RuntimeException ("Name "+name+" is already in use", usedNames.get (replace));
        
        //System.out.println ("+++ ["+search+"] -> ["+replace+"]");
        usedNames.put (replace, new Exception ("Name was registered here"));
        replacements.put (search, replace);
    }
    
    /** Allow for special mappings */
    protected void registerSpecial (Object o, String name)
    {
        // NOP
    }

    /** Turn a <code>Collection</code> into a <code>String</code>, replacing all IDs with names. */
    public String toString (Collection c)
    {
        StringBuilder buffer = new StringBuilder (10240);
        String delim = "";
        for (Object o: c)
        {
            buffer.append (delim);
            delim = "\n";
            buffer.append (o);
        }
        return toString (buffer);
    }
    
    /** Turn a <code>Map</code> into a <code>String</code>, replacing all IDs with names. */
    public String toString (Map m)
    {
        StringBuilder buffer = new StringBuilder (10240);
        String delim = "";
        for (Iterator iter=m.entrySet ().iterator (); iter.hasNext (); )
        {
            Map.Entry e = (Map.Entry)iter.next ();
            
            buffer.append (delim);
            delim = "\n";
            buffer.append (e.getKey ());
            buffer.append ('=');
            buffer.append (e.getValue ());
        }
        return toString (buffer);
    }
    
    /** Turn an <code>Object</code> to a <code>String</code>, replacing all IDs with names.
     * 
     *  <p>If the object is a <code>Collection</code> or a <code>Map</code>, the special collection handling methods will be called. */
    public String toString (Object o)
    {
        if (o instanceof Collection)
        {
            return toString ((Collection)o);
        }
        else if (o instanceof Map)
        {
            return toString ((Map)o);
        }
        
        if (o == null)
            return "null";

        String s = o.toString ();
        for (Map.Entry<String, String> entry: replacements.entrySet ())
        {
            s = s.replace (entry.getKey (), entry.getValue ());
        }
        
        return s;
    }

    public boolean knowsAbout (String key)
    {
        return replacements.containsKey (key);
    }
}

To use the IDFixer, call register(object, name) to assign a name to whatever follows the number after id=. The method will search for this substring in the result of calling object.toString().

If you call IDFixer.toString(object) for a single object or a collection, it will replace id=number with id=name everywhere. If you have references between objects, you can register additional pattern to be replaced by overriding registerSpecial().

To make it easier to compare lists and maps in assertEquals(), each item gets its own line.


Traits for Groovy/Java

25. June, 2009

I’m again toying with the idea of traits for Java (or rather Groovy). Just to give you a rough idea if you haven’t heard about this before, think of my Sensei application template:

class Knowledge {
    Set tags;
    Knowledge parent;
    List children;
    String name;
    String content;
}
class Tag { String name; }
class Relation { String name; Knowledge from, to;

A most simple model but it contains everything you can encounter in an application: Parent-child/tree structure, 1:N and N:M mappings. Now the idea is to have a way to build a UI and a DB mapping from this code. The idea of traits is to implement real properties in Java.

So instead of fields with primitive types, you have real objects to work with:

    assert "name" == Knowledge.name.getName()

These objects exist partially at the class and at the instance level. There is static information at the class level (the name of the property) and there is instance information (the current value). But it should be possible to add more information at both levels. So a DB mapper can add necessary translation information to the class level and a Hibernate mapper can build on top of that.

Oh, I hear you cry “annotations!” But annotations can suck, too. You can’t have smart defaults with annotations. For example, you can’t say “I want all fields called ‘timestamp’ to be mapped with an java.sql.Timestamp“. You have to add the annotation to each timestamp field. That violates DRY. It quickly gets really bad when you have to do this for several mappers: Database, Hibernate, JPA, the UI, Swing, SWT, GWT. Suddenly, each property would need 10+ annotations!

I think I’ve found a solution which should need relatively few lines of code with Groovy. I’ll let that stew for a couple of days in by subconscious and post another article when it’s well done :)


Jazoon, Day 2: XWiki

24. June, 2009

I’m a huge fan of wikis. Not necessarily MediaWiki (holy ugly, Batman, the syntax!). I dig MoinMoin. I just heard the talk by Vincent Massol about next generation wikis. Okay, I can hear you moan under the load of buzzwords but give me a moment. XWiki looks really promising.

Wikis basically allow to publish mostly unstructured data. I say “mostly” because wikis give them some structure but not (too) much: You can organize it in pages and put it into context (by linking between the pages). Often, this is enough. But recently, MediaWiki has started to add support for structured data. See that infobox in the top left corner? But that’s just half-hearted.

XWiki takes this one step further. XWiki, as I understand it, is a framework of loosely coupled components which allow you to create a wiki. The default one is pretty good, too, so most of the time, you won’t even get into this. The cool part about XWiki is that you can define a class inside of it. Let me repeat: You can create a page (like a normal text page) that XWiki will treat as a class definition. So this class gets versioned, etc. You can then add attributes as you like.

After that, you can create instances of this class. The instances are again wiki pages. You can even use more than a single instance on a page, for example, you can have several tag instances and a single person instance. Instances are versioned, too. Of course they are, this is a wiki!

Now you need to display that data. You can use Velocity or Groovy for that. And guess what, the view is … a wiki page. So your designers can create a beautiful look for your the boring raw data. With versions and comments and everything. While some other guys are adding data to the system.

In “normal” wiki pages, you can reference these instances and render them using such a template. The same is true for editors. With a few lines of code, you can create overview pages: All instances of a class or all instances with or without a certain property or you can use Groovy to do whatever you can think of.

Now imagine this: You have an existing database where your marketing guys can, say, plan the next campaign. They can use all the wiki features to collect ideas, filter and verify them, to come up with a really good plan. Some of that data needs to go into a corporate database. In former wikis, you’d have to use an external application, switch back and forth, curse a lot when they get out of sync.

With XWiki, you can finally annotate data in your corporate database with a wiki page, with all the power of a wiki and you can even display the data set in the wiki and edit it there. Granted, the data set won’t be versioned unless your corporate database allows that but it’s simple to do the versioning in the data access layer (for example, you can save all modifications in a log database).

Suddenly, possibilities open up.


Follow

Get every new post delivered to your Inbox.

Join 327 other followers