Another Lesson on Performance

23. January, 2009

Just another story you can tell someone who fears that “XYZ might be too slow”:

I’m toying with the idea to write a new text editor. I mean, I’ve written my own OS, my own XML parser and I once maintained XDME, an editor written originally by Matthew Dillon. XDME has a couple of bugs and major design flaws that I always wanted to fix but never really got to it. Anyway.

So what is the best data structure for a text editor in 2008? List of lines? Gap-Buffer? Multi-Gap-Buffer?

XDME would keep the text in a list of lines and each line would point to a character array with the actual data. When editing, the characters would be copied into an edit buffer, the changes made and after the edit, the changed characters would be copied back, allocation a new character array if necessary.

This worked, it was a simple design but it had a flaw: it didn’t scale. The length of a line was limited to the max size of the edit buffer and loading a huge file took ages because each line was read, analyzed, memory was allocated … you get the idea.

So I wanted to make it better. Much better. I’d start with reading the file into a big buffer, chopped into evenly sized chunks to make reading both fast and memory efficient (imagine loading a 46MB file into a single memory chunk – after a couple of changes, I’d need to allocate a second 46MB chunk, copy the whole stuff over, etc, needing twice the amount of RAM for a short time).

During the weekend, I mulled the various ideas over, planned, started with a complex red-black tree structure for markers (positions in the text that move when you insert before them). It’s huge, complex. It screams “wrong way!”

So today, I sat back and did what I should have done first: Get some figures. How much does it really cost to copy 4MB of RAM? Make a guess. Here is the code to check:

    public static void main (String[] args)
    {
        long start = System.currentTimeMillis ();
        
        int N = 10000;
        for (int i=0; i<N; i++)
        {
            int[] buffer = new int[1024*1024];
            System.arraycopy (buffer, 0, buffer, 1, buffer.length-1);
        }
        
        long duration = System.currentTimeMillis () - start;
        System.out.println (duration);
        System.out.println (duration / N);
    }

On my machine, this prints “135223” and “13”. That’s thirteen milliseconds to copy 4MB of RAM. Okay. It’s obviously not worth to spend a second to think about the cost of moving data around in a big block of bytes.

That leaves the memory issue. I would really like to be able to load and edit a 40MB file in a VM which has 64MB heap. Also, I would like to be effective loading a file with 40MB worth of line-feeds as well as a file which contains just a single line with 40MB data in it.

But this simple test has solved one problem for me: I can keep the lines in an ArrayList for fast access and need not worry too much about performance. The actual character data needs to go into a chunked memory structure, though.

Morale: There is no way to tell the performance of a piece of code by looking at it.


25 Most Dangerous Programming Errors

23. January, 2009

If you want to improve your l33t [0d|ng skillz, especially keeping script kiddies off your back, here is a list of the 25 most common coding errors: http://www.sans.org/top25errors/


Holding a Program in One’s Head

18. December, 2008

In my perpetual search for brain food for programmers, I’ve found this article: Holding a Program in One’s Head


Stuck? Ask Stack Overflow

19. November, 2008

Stuck with a hard programming problem? Just solved an impossible problem and want to show the world your genius? Don’t know how to solve a problem with your favorite OS or programming language? Check out stackoverflow.com.


How To Launch Software

5. September, 2008

If you’re still wondering if “big bang” software releases are a good thing, read this.


The Space Between Two Characters

31. August, 2008

If you’re claustrophobic, you’re afraid of confined spaces. If you’re a software developer, you can be afraid of non-existing space.

When it comes to editing text, we usually don’t think about the space between the characters. There simply isn’t any. When you write a text editor, things start to look different. Suddenly, you have a caret or cursor which goes between the characters and that space between two characters can suddenly become uncomfortably tight.

Fire up your favorite text editor, Word, Writer, whatever. It has to support character formatting, though. Now enter this:

Hello, world!

If in doubt, the bold text ends before the comma and the italic part starts with the “w” and ends with “!” including both.

Now move to the “e” and type “x”. What do you get?

Hxello, world!

Piece of cake.

Now move to the “H” and type “x”. What do you get? Is the new x bold or not? Do you get “xHxello” or “xHxello“? How about typing “x” after the “o” of our abused “Hello”? Is that new x bold or not? If it is, what is the most simple way to make it non-bold? Do you have to delete the comma, do you have to go through menus or toolbars or is there a simple, consistent way to add a character inside and outside of a formatted range of text?

Let’s go one step further. Add a character after the “!”. Is it italic? If not, you’re lucky. If it is … what’s the most simple way to you get rid of the italic? If you press Return now, will the italic leak to the next line? If not, how can you make it leak? If that italic is the last thing in your text, can you add non-italic text beyond without fumbling with the formatting options?

There is no space between two characters and when you write a text editor, that non-existing space is biting you. Which is actually the problem: There is no consistent way to move in and out of a formatted range of characters.

The naive attempt would be to say “depending on the side you came from, you’re inside or outside.” So, if we have this (| is the cursor or caret): “Hello |world” and you type something, the question is: How did the caret end there? Did it come from “w|o” and moved one to the left? Or from “o| ” and move one position right?

That works somewhat but it fails at the beginning and the end of the text plus you’re in trouble during deleting text. What should happen after the last character of “Hello” has been deleted? Should that also delete the character range or should there be an empty, invisible bold range left and when you type something now, it should appear again? If you keep the empty invisible range, when do you drop it? Do you keep it as long as the user stays “in” it? Or until the document is saved? Loaded again from disk?

It’s a mess and there is a reason why neither Word nor OpenOffice get it right: You can’t. There is information in the head of the user (what she wants) but no way for her to tell the computer. Duh.

That is, unless you start to give the user a visual cue what is going on. The problems we have is that there is no simple, obvious way for the user to say “I want …” because there is no space on the screen reserved for this. We barely manage to squeeze a caret between the characters. There is just not enough room.

Well, there could be. A simple solution might be to add a little hint to the cursor to show which way it is leaning right now. Right. How about “A|B“? Here, you have three options. Add bold, italic and normal.

In HTML, this is simple. I’m editing this text in Firefox using the standard text area. What looks fancy to you looks like this to me: “<b>A</b>|<i>B</i>”

And this is the solution: I need to add a visual cue for the start and end of the format ranges. Maybe a simple U-shape which underlines the text for which the character format applies. Or an image (> and < in this example): “>A<|>B<“. And suddenly, it’s completely obvious on which side of the range start and end you are and what you want. You can delete the text in the range without losing it or you can delete both and you can move in and out of the range at will.

The drawback is that you need to keep that information somewhere. It adds a pretty huge cost to the limits of a format range. I’ll have to try and see how much that is and if I can get away with less by cleverly using the information I already have.

Also, it clearly violates WHYSIWYG. On the other hand, we get WYSIWYW which is probably better for the user.


Text Editor Component and JADS

27. August, 2008

While working on DecentXML (1.2 due this weekend), I’ve had those other two things that were bugging me. One is that there is no high-quality, open-source framework with algorithms and data structures. I’m not talking about java.lang.Collections, I’m talking about red-black trees, interval trees, gap buffers, things like that. Powerful data structures you need to build complex software.

Welcome the “Java Algorithm and Data Structure” project – jads. I haven’t started opened a project page on SourceForge or Google Code, yet, but I’ll probably do that this weekend.

Based on that, I’m working on a versatile text editor component for Java software. The final editor will work with user interfaces implemented in Swing, SWT and Qt. It’s an extensible framework where you can easily replace parts with your own code to get the special features you need. I currently have a demo running which can display text, which allows scrolling and where you can do some basic editing. Nothing fancy but it’s coming alone nicely.

If you want to hear more about these projects, post a comment or drop me a mail.


Another Lesson on Performance

19. August, 2008

Just another story you can tell someone who fears that “XYZ might be too slow”:

I’m toying with the idea to write a new text editor. I mean, I’ve written my own OS, my own XML parser and I once maintained XDME, an editor written originally by Matthew Dillon. XDME has a couple of bugs and major design flaws that I always wanted to fix but never really got to it. Anyway.

There are various data structures which are suitable for a text editor and some of those depend on copying data around (see gap buffers). The question is: How effective is that? The first instinct of a developer is to avoid copying large amounts data and to optimize the data structure instead.

After years of training, I’ve yet to overcome this instinct and start to measure:

    public static void main (String[] args)
    {
        long start = System.currentTimeMillis ();
        
        int N = 10000;
        for (int i=0; i<N; i++)
        {
            int[] buffer = new int[1024*1024];
            System.arraycopy (buffer, 0, buffer, 1,
                buffer.length-1);
        }
        
        long duration = System.currentTimeMillis () - start;
        System.out.println (duration);
        System.out.println (duration / N);
    }

On my computer at work (which is pretty fast but not cutting edge), prints: “135223” and “13”. That’s thirteen milliseconds to copy 4MB of RAM. Okay. It’s obviously not worth to spend a second to think about the cost of moving data around in a big block of bytes.

Lesson: If you’re talking about performance and you didn’t measure, you have no idea what you’re talking about.

Still not convinced? Read this.


Quantity Always Trumps Quality

4. August, 2008

While I wouldn’t completely subscribe to that without a grain of salt, the story is nice.


A Decent XML Parser

29. July, 2008

Since there isn’t one, I’ve started writing one myself. Main features:

  • Allows 100% round-tripping, even for weird whitespace between attributes in elements
  • Suitable for building editors and filters which want to preserve the original file layout
  • Error messages have line and column information
  • Easy to reuse
  • XML 1.0 compatible

You can download the latest sources here as a Maven 2 project.