Why Java's new String is bad for us

November 20, 2013

With the arrival of Java 7 update 6 a while ago, there was one minor change that wanted to be the salvation of a few memory leaks. This change has to do with Strings

The String class has suffered from an undocumented bug-feature that has given headaches to a lot of people al lot of time. Turns out that this class can be the cause of a kind of memory leak that is specially transparent to the programmer.

In their attempt to fix it, however, they have forced different types of equally severe bug-features to the programmer, equally transparent or maybe more, as this new behaviour really makes less sense.

Let me explain myself…

The old String

The String class in Java is implemented by wrapping a char[] object. For performance reasons, it includes two fields, offset and length, that identify the first character of the array that is part of the string and its size.

Those performace reasons have to do with the String.substring method. This method allows you to take an already existing String object and create a new one using a smaller character segment of the original. Because strings in Java are immutable, this can be done in constant time just by adjusting those offset and length fields but sharing the same character buffer.

But speed is not the only thing getting better by this. By sharing the same buffer we are also reducing the potential memory usage of the system. That’s ok, except this introduces a new problem, a silent memory leak that is quite difficult to spot if one does not pay attention to the whole process.

Suppose you have a big string. A 2 MiB string. You perform a regular expression search over it, find your piece, get it with a substring call, store it, and then let the original 2 MiB string be GC’d. You would expect your final memory usage to be the few bytes your final string uses (plus object overhead), while reality kicks and you still have 2 MiB of used heap space.

The reason is simple. Remember when I told you that strings can share the character array? That’s what’s happening. As long as the original string is still used, you’re saving space. If not, you’re wasting space. You’re keeping a reference to a 2 MiB string just to represent a string of 10-100 bytes. And this is one of the most overlooked memory leaks in Java.

It is, however, very easy to avoid. The problem is that this is not well documented in the JavaDoc, so not everyone knows about it, but it can be fixed just like this:

    String str = twoMegsString.substring(someIndex, 100);
    str = new String(str);

That’s it. Really. Creating a new string like this in Java will allocate a new character array of the exact size you need and expect.

This is something that has been known for a while. The memory leak has been explained a thousand times and people who really know to code in Java know about it. Most people actually use it to their advantage. We all know we can perform trim on everything we get from the user and only pay an object allocation and maybe a few lost bytes. If we know we have a large string, we can use the above trick to trim down the internal array, while still profiting of the extra performance of not re-building a large string just to remove 10 bytes of trailing spaces.

But then, reality kicks again…

The new String

The String class after Java 7 update 6 has been rewritten. Two fields have been lost in the process: offset and length. This means that now a string always has its first caracter at position 0 and the same size as its char[] buffer.

This means no more memory leaks. Nobody can ever again substring a 2 MiB string down to a 10 byte micro-string and still get a 2 MiB of unused memory filling up the heap. We are effectively protecting people from ever needing to learn how things work, because things work exactly as expected.

Except this introduces new problems. New problems that, sadly, and unlike the previous one, have no solution.

The first of our problems is memory. Every time we use substring (or trim) we copy the buffer. For small strings, this really doesn’t matter. For the average case, even for long strings, this also doesn’t matter, as the original string is rarely kept. It is in a few key circumstances where a lot of strings are created from a large one where this may matter. An application that yesterday used a decent amount of memory now uses a lot more with no direct solution.

The second of our problems is computing time. Every time we use substring or trim, we copy the buffer. Copying a buffer means an O(n) run on the target size. Again, for short strings, this is not a problem. For a long one, it is. We can no longer rely on trim to be fast. Every time we call it, we risk copying a large string just to save a few bytes we didn’t even want to save. Applications that called substring in loops may now behave closer to O(n²) than to their previous linear times. Existing applications may slow down just by running them with the new updates. And again, this doesn’t have a direct solution.

These are not trivial problems. This change is introducing a lot of accidental bad behaviour to well-behaved applications only to eliminate an avoidable memory leak. Again: to remove an error that was avoidable.

Conclusions

I can’t stress this hard enough. The bad behaviour of the old String was avoidable, everything bad introduced with the new String is forced to the programmer. We have introduced bad behaviour just so bad programmers didn’t have to learn to code.

Instead of telling people that they should care about memory in Java (even though they sell it as the language in which you forget about memory management), they break their old users that were using their platform the right way.

This is specially odd in the very people that introduced that thing they call type erasure just so new code could be run in old platforms, when it should be the other way around. And this time, is neither way.


Edit (Nov 23): Looks like the author of this changes has made a few comments on Reddit.

To summarize, he says the change does more good than harm because of how strings are usedin the average application. Although I mostly agree, I think that it’s a change that potentially breaks a lot of things in those applications that are not the usual one.

There is one thing I can say about all this: Is not documented anywhere. Neither the old or the new behaviour. So, in the end, anyone relying on it was relying on undocumented features, something that should not be done.