Today we’ll have a little Java quiz. Most readers probably know the answer.
The question
Let’s create a hash map in Java and fill it with some big number of strings:
We’ll search various strings in this table and report time:
Let’s run it for some arbitrary strings:
Here is the output:
mouse: time 17; sum 0
String#532: time 93; sum 1025032704
a quick brown fox jumps over the lazy dog: time 23; sum 0
aardvark polycyclic bitmap: time 196; sum 0
public static void main (String [] args): time 24; sum 0
Most of the times are very short. One (for "String#532"
) is slightly longer than the others due
to the fact that this string is in fact present in the map. One, however (for the "aardvark polycyclic bitmap"
)
is exceptionally long - eight times longer that the times for similar-sized strings.
What makes this aardvark string so slow, and are there other such strings?
The answer
First of all, other such strings definitely do exists, for instance, "aaron bends nonconformity"
or "maximal java reconstructs"
.
When searching objects in a hash map, the first thing that happens is that their hash code is calculated:
The String
’s hash calculation looks like this:
The hash is calculated by adding all the characters multiplied by various powers of 31. To avoid
performing this calculation every time the hash is needed, and also to avoid performing it
when it is not needed at all, the value is calculated in a lazy way and cached. It is calculated the
first time hashCode()
is called and stored in the hash
field.
This scheme, however, fails when the hash value is 0. This value will always present itself as uncalculated, causing the hash code to be re-calculated each time. And this is exactly what our strings are: the strings with the hash code of zero.
The background
Java’s designers could have provided an extra flag to indicate the hash value being present, or used the highest bit of the hash value for that, or, perhaps, made the hash one if it was calculated as zero. They, probably, considered all of this unnecessary, as the probability is very low: one out of four billion strings.
Most of these strings look very artificial and very few make grammatical sense.
Still, meaningful ones occur, too. I started this study after reading a Russian article on the issue, where they
came up with a very beautiful example: "лжеотождествление электровиолончели"
(false identification of an electric cello). So I took a list of 58K English words and tried combining them.
No single English word produced zero hash code (perhaps, some super-long hyphenated ones could, but I didn’t try them), and no two words with a space in between could do it either.
Three words, however, happened to be very productive, producing 46K strings. The shortest ones
are 13 characters long ("civic sear ha"
, "pleas so fold"
, "venal us burs"
); the longest one
is 48 ("superconducting instrumentals intercommunication"
).
Here are some that make some sense:
covenant selfdestructs nondrinkers
cannabis auctioneers uncorrupted
america envisages inflate
colourful Obama aftereffect
reduce whores sinfulness
shivery rhinoceroses chaser
ostensible communists speech
flowers vendors conversation
And here is the best one:
antipathies included computing
The full list of found phrases, together with the program that calculated it, is in the repository. The program ran for 13 minutes – perhaps, someone can speed it up enough to search four and more words?
Conclusion
Hash map is a very powerful tool to achieve good performance. Still, abnormal cases of unusually bad performance do occur, so be careful.
Comments are welcome below or at reddit.