Update to DecentXML

30. July, 2008

I’ve updated my XML parser. The tests now cover 97.7% of the code (well, actually 100% of the code which can be executed; there are a couple of exceptions which will never be thrown but I still have to handle them) and there are classes to read XML from InputStream and Reader sources (including encoding detection).

The XMLInputStreamReader class can be used standalone, if you ever want to read an XML file with the correct encoding.

You can download the sources and report issues in the new Google Code project I’ve created.


Information Management With Zotero

30. July, 2008

I’ve been long looking for a nice tool to manage my vast extra-brain information collection, i.e. the stuff that I don’t want to save in my long term memory. Web snippets, notes, that kind of stuff. All the usual solutions didn’t appeal to me. Either I was locked to Windows or to a single computer or the UI was bad or the feature list lacked some important points.

Zotero to the rescue. This beast is advertised as “Zotero [zoh-TAIR-oh] is a free, easy-to-use Firefox extension to help you collect, manage, and cite your research sources. It lives right where you do your work — in the web browser itself.”

Which makes sense. I watch most of my information in my web browser, so why no collect it in there, too? The UI is nice, I’m just missing a few features. Also being able to sync with my own server would be nice. But I’m sure that will be fixed, soon. In the mean time, I can at least tag and order my snippets.


A Decent XML Parser

29. July, 2008

Since there isn’t one, I’ve started writing one myself. Main features:

  • Allows 100% round-tripping, even for weird whitespace between attributes in elements
  • Suitable for building editors and filters which want to preserve the original file layout
  • Error messages have line and column information
  • Easy to reuse
  • XML 1.0 compatible

You can download the latest sources here as a Maven 2 project.


DSLs: Introducing Slang

28. July, 2008

Did you ever ask for a more compact way to express something in your favorite programming language? Say hello to DSL (Domain Specific Language). A DSL is a slang, a compact way to say what you want. When two astronauts talk, they use slang. They need to get information across and presto. “Over” instead of “I’ll now clear the frequency so you can start talking.” And when these guys do it, there’s no reason for us not to.

Here is an artical on Java World which gives some nice examples how to create a slang in Java and in Groovy. Pizza-lovers of the world, eat your heart out.


FREE! Really.

28. July, 2008

I just found a nice comment under my blog. It offered a free service. One sentence was: “REGISTRATION IS ABSOLUTELY FREE!” When you see that, you know you’re being ripped off. I’m not mentioning the name of the guys who tried that stunt in order to give them no additional advertisement. ‘Nuff said.

Tip: If you want me to join your planet or RSS mega feed or whatever, it’s not smart to post a comment in my blog. This is my blog, my reputation, my honor. I decide who gets free advertisement here.


Testing With Databases

26. July, 2008

When it comes to testing, databases are kind of a sore spot. People like to think that “you can’t test when you need a database” or “it’s too complicated” or “it’s not worth it.” I’d like to give you some ideas what you can do when you need to test code that depends on a database. This list is sorted in the order in which I try to tackle the problem:

  1. Use POJOs to store the data from the database in the real code and for the tests, create some dummy objects with test data and use them.
  2. Make the database layer a plug-in of your application and replace it with a mockup for testing that doesn’t need the database and which returns test objects instead.
  3. Instead of connecting to the real database, get H2 or Derby and use an embedded or at least local database. I prefer H2 because it’s smaller, starts faster (and tests should always be fast) and it can emulate a ton of SQL dialects. But Derby has some features which H2 is missing.
  4. Create a second instance of the production database system on a different machine, preferably your own computer.
  5. Create another instance of the real database with test data on the same machine as the real database.
  6. Use database schemas to create a logical database in the real database, for example if all tables are in the schema APP, create APP_TEST and in your code, add a way to replace the schema name in the SQL statements. If you wrote the DB layer yourself, use a system property which isn’t set in production. If you’re using Hibernate, walk the mapping objects which are created and replace the table names after loading the production configuration. Field.setAccessible(true) is your friend.

If you can’t decide, here are a few hints:

Creating two databases using schemas in the same instance can get you into serious trouble without you noticing. For example, the tests should be able to rebuild the test database from scratch at the press of a button so you can be sure in which state the database really is. If you make a mistake with the schema name during that setup, you’ll destroy the real database. You might not notice you did, because the flawed statement is usually hidden under a few hundred others.

Installing a second instance on a different machine might trigger license fees or your DB admin might not like it for some reason. Also, a test database should be very flexible because you’ll need to be able to drop and recreate it a dozen times per hour if you need to. Your DB admins might not like to give you the necessary rights to do that. Lastly, this means only one developer can run all the tests at any given point in time because you’re all going against the same database. This is bad, really bad. More often than not, you’ll have spurious errors because of that.

If you can legally get a copy of the real database on your own machine, that’s cool … until you see the memory, CPU and hard disk requirements plus a DB admin will probably hog your machine for a day or two to install it. Having to run two applications which need 1GB of RAM (your IDE and the DB) with a machine that has only 1GB of RAM isn’t going to fun.

For many cases, using HSQLDB or Derby is a good compromise between all forces that pull at you. While that will make your tests slow, they will often run much faster than against the real DB. You can install these as many times you like without any license issues, fees or DB admins bothering you. They don’t take much memory or hard disk space and they are under your total control.

Only, they are not the real DB. There might be slight differences, hidden performance issues and other stuff that you won’t notice. Don’t bother about that, though. If you can test your application, you’ll find that you’ll be able to fix any problems that come up when you run against the real database in little time. If you can’t test your application, thought, well, you’re doomed.

I strongly recommend to be able to setup the database from scratch automatically. With Derby, you can create a template database and clone that on the first connection. With HSQLDB, loading data is so fast that you can afford to rebuild it with INSERT statements every time you run the tests.

Still, test as much code as possible without a database connection. For one, any test without a DB will run 100-1000 times faster. Secondly, you’re adding a couple more points of failure to your test which are really outside the scope of your test. Remember, every test in your suite should test exactly one thing. But if you can’t get rid of the connection, you’re testing the feature you want plus the DB layer plus the DB connection plus the DB itself. Plus you’ll have the overhead of setup, etc. It will be hard to run a single test from your suite.

At the end of the day, testing needs to be fun. If you feel that the tests are the biggest obstacle in being productive, you wouldn’t be the good developer you are if you didn’t get rid of them.

One last thing: Do not load as much data as possible! It is a common mistake to think that your tests will be “better” if you have “as much data as possible”. Instead load as little data as possible to make the tests work. When you find a bug, add as little data as possible to create a test for this bug. Otherwise, you’ll hog your database with useless junk that a) costs time, b) no one can tell apart from the useful stuff and c) it will give you a false feeling of safety that isn’t there.

If you don’t know which data is useful and which isn’t, then you don’t know. Loading of huge amounts of junk into your database won’t change that. In order to learn, you must start with what you know and work from there. Simply copying the whole production system will only slow you down and it will overwrite the carefully designed test cases you inserted yesterday.


Nexus, a Maven Repository Manager

26. July, 2008

If you’re using Maven in a corporate environment, then you’re struggling with the same problems all over again: How to make sure that the build builds?

While a simple task at first glance, there are a few hidden obstacles which boil down to two things: Downloads via the Internet and plugin or dependency version stability. Both can be solved by a using a proxy or a in-house repository.

The guys from Sonatype have been busy in the last months and have released Nexus 1.0.0-beta-4.2 which gives you another option to chose from besides Archiva or DSMP (my own Maven 2 proxy). I’ve tried Nexus yesterday and I have to say that I’m very pleased with the result. As usual for Open Source Software, the beta is more stable than some post-beta commercial products and it delivers with very little setup (follow the link to see the documentation).

Now, we have a second issue: version stability. Here is my recipe to achieve that. First of all, version anything in your POM. All dependencies, all plugins, everything. I’m using properties for that which I define in a common parent POM plus I’m using the dependency management. Maven 2.0.9 helps a lot here because it forces you to add version elements everywhere.

The next step is to make sure the maven builds can find their stuff. To do that, I suggest to set up two Nexus repositories. The first one is the “build” repository, the second one is the “cache” repository. While all developers should use the “build” repository, the “cache” repository can actually download dependencies from the Internet.

The “build” repository, on the other hand, is just a local repository with no Internet connection. To avoid mistakes, I suggest to install the build repo with the default settings but with all remote repositories deleted or turned into local ones. The “cache” repository should run on an unusual port and with the remote repositories enabled as described in the installation documentation.

Next, you need to create a profile in your settings.xml which switches mirrors between the two. When you want to check out a new version of some plugin, switch to the cache repository and have it download all the new stuff. This will pollute your local copy of the maven repository but only yours. After you have verified that the build completes (or fixed all the problems you’ve got), check the RSS feeds of Nexus for stuff it downloaded. Then, all you have to do, is to copy those to the “build” repository. After a refresh, all the other developers in your company can use the new, verified downloads.

Clean your local repository and build again to make sure that your colleagues won’t have any problems after the change and you’re set.


Management Is The Art of Choosing What Not To Do

23. July, 2008

From Rands in Repose: “… management is the art of choosing what not to do …”

If you want to know more about management told in a way an engineer can understand, consider Rands’ book “Managing Humans“.


The Code Reuse Myth

22. July, 2008

The post “The Code Reuse Myth” by James Sugrue got me thinking.

The main problem with code reuse is that our programming languages don’t support it. We sacrificed this to the gods of efficiency four decades ago and, while a few people dared to question the practice, all of them were struck down by unexpected lightning out of the blue, so far.

As James said, “context” is the keyword. What we’d need is a programming language where you can adapt concepts to a context that goes beyond “object instance” or “class” or “application”. What we need is an efficient way to say “collect all objects and sort them by ID” where the context defines what “ID” is. What we need is an efficient way to describe a model (relations of basic data types) and then have some tool map that efficiently to reality so we can reuse parts or fragments of the model in different contexts.

OO can’t do that because it’s limited to factories and inheritance. Traditional preprocessors can’t do it, because they can’t see the AST. Having closures in a programming language is the first, tiny step in the direction to be able to push external context into existing code. This allows us to put the code to access the database into a library and influence what it does per row of the result with a closure.

But to be a real new paradigm, we need “closures” in data types as well. This means being able to reuse fragments of code and data structure definitions in a new context. These fragments need to pull along all the algorithms and structures they need without the developer having to pay close attention what is going on until the point where the result needs peep hole optimization because of performance issues.

Example:

fragment Named {
    String name

    String toString () {
        return "name=${name}"
    }
}

class File : Named {...}

class Directory : Named, File {...}

Looks simple but with OO, this will get you in a lot of trouble: Directory gets a name from the class File and from the fragment Named (this is an artificial example, bear with me). Which one should the compiler chose? In OO, I can’t say “I don’t care” or “Merge them”.

With real fragments, you could say “Directory is a File in the sense that it supports a lot of the operations of a file (like rename, delete, get last modification time) but not all (you can’t execute a directory or open it for reading).” So the example would look like this:

class File : Named { 
    void delete () {...}
    Reader openForReading () {...}
}

class Directory : Named, File.delete {...}

Now, a Directory has a name and it “copies” the method “delete” from File along with anything this method would need. Internally, the compiler could create an invisible File delegate or it could clone the source or byte code of the File class, or whatever. Or, even better, we could say “give me a copy of File.delete() but replace all references to the field File.path with Directory.path.”

The main goal would be to allow to use the compiler as a cut’n’paste tool which checks the syntax and allows me to say “copy that method over there and replace xxx with yyy”. Because that’s why we think that code reuse could work: We see the same code over and over and over again and each time, the difference is just a tiny little bit of code which we can’t “patch” because the language just doesn’t allow it.


Lovin’ Linux? Dig This!

22. July, 2008

Want to make linux better? Ask the Linux Hater. If in doubt: He wouldn’t write 15 articles per month telling where Linux sucks if he didn’t care.