Michael Truog (okeuday) wrote,
Michael Truog

Using Integer Keys and String Keys in Erlang

My benchmarks for key/value data stored with either integer or string keys lacked explanation of the results. I will explain my conclusions here based on the benchmark results. When looking at the benchmark results, the normalized time spent during the test is within the parentheses, next to the time it relates to. The normalized values shows how many times an operation for a data structure is slower than the fastest alternative.

The orddict data structure is the fastest data structure for storing data with integer keys. However, all other operations on an orddict are very slow compared to any of the other data structures. Storing data with an integer key is slowest in a tuple, but the integer key lookup of data within a tuple is the fastest among all the data structures. The process dictionary is the fastest for storing data if you ignore an orddict with an integer key (since a lookup in an orddict is painfully slow). Typically, usage of the process dictionary is discouraged, since it eliminates the source code's referential transparency making the code harder to test, maintain, and understand.

ETS results in the benchmark showed that concurrent lookups are faster with integer keys but slower with string keys, which shows that more complex comparison operations limit scalability quicker with ETS' global data storage (the assumption being that concurrent integer key lookups on ETS would also become slower, but it would take more concurrent lookups to see the performance degrade due to the global data locking that is used for consistency). A persistent puzzle has been the aadict and gb_trees which should implement the same algorithm, but differ for storing integer key data (the aadict is quicker). The aadict and gb_trees data structures should have faster lookup times than the rbdict data structure, but no significant difference can be seen (gb_trees actually has slower lookup times for string keys). The dict data structure is faster for lookups but slower for storing when compared with the red-black tree data structures (rbdict, aatree and gb_trees).

The array data structure is much quicker than a tuple for storing integer key data, if you can tolerate the slower lookup time (lookups become 7.5 times slower than a tuple). The trie data structure is a new data structure I created for efficient string lookups. The trie is able to be roughly equivalent to the process dictionary for lookups (1.1 times the speed for the largest test), but should always be slightly slower as an externally implemented data structure (i.e., not a core language feature like the process dictionary). The store of string key data in a trie is roughly equivalent to storing data in a red-black tree data structure (a trie is slower than rbdict, equivalent to aatree, and faster than gb_trees for the largest test).

The speed of the trie comes from exploiting the quick integer key lookups on a tuple. The speed of storing string key data in the red-black tree data structures is similar to storing data in a trie, because all of these tree structures use tuples for nodes. One interesting fact is that the trie can not store the atom value 'error' because of the ambiguity within the dict interface (the dict:fetch/2 function) and the fact that the atom 'error' is used for empty values within the trie to make the implementation simpler.
  • Post a new comment


    default userpic

    Your reply will be screened

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.