Valerii: 32-base string encoder and decoder

13 October 2009

You have probably seen URL shorteners that use short, seemingly random strings to identify sites. These strings are not random, they are encoded integer values. My valerii gem allows you to easily and quickly encode and decode integer values. Let me show you. To understand the shortening of integer values to strings you first need to know how integers work.

We use a base-10 system. This means that for any number, every digit can hold 10 different values: 0 through 9. If you want to express more than 10 values, you need to add a digit: 26 means (6 * 1) + (2 * 10).

You may be familiar with hexadecimal numbers. Hexadecimal has a base of 16, meaning every digit can hold 16 distinct values. 0 through 9, a through f.

Converting an integer to a 32-base string

The key to shortening a number to a string is that you need to store as many different values in a single digit. Because we want the digits to be part of an URL, we can only use valid URL characters. Characters like ‘/’ and ‘?’ are a no-go in this situation.

Another reason for sticking with 32 (and not 52 (a-z, A-Z, 0-9)) is that I’m going to use bit encoding. But, first we need to define an alphabet to use to represent the 32 different values:

  %w( B C D F G H J K
      M N P Q R S T V
      W Z b c d f h j
      k m n p r x t v )

For a value of 3 we’d use ‘F’. Easy, right?

Let’s say we want to encode the number 123. What valerii does first is convert 123 to a binary string.

123.to_s(2) # => "1111011"

Cool, now split up this string in blocks of 5 bits. 5 bits can contain 32 different values. We need to start ‘chopping’ from the right to the left, so first we reverse the binary string and split it up.

"1111011".reverse.scan(/.{1,5}/) # => ["11011", "11"]

Now we convert the binary strings to 10-base numbers and map those to the characters defined in ENCODE_CHARS. The resulting characters are reversed again and joined to one string:

123.to_s(2).reverse.scan(/.{1,5}/).map do |bits|
end.reverse.join # => "Fp"

Converting a 32-base string to an integer

Converting a 32-base string back to its original integer value is quite easy now. The only trick is to create another hash that maps each character to its integer value:

DECODE_MAP = ENCODE_CHARS.to_enum(:each_with_index).inject({}) do |h,(c,i)|
  h[c] = i; h

This looks scary, but it’s actually a common way to reverse key/values in a hash.

Next is taking each character from the string and pushing the 5 bits we read in a variable. This variable is the original integer value.

"Fp".split(//).map { |char|
  DECODE_MAP[char] or return nil
}.inject(0) { |result,val| (result << 5) + val } # => 123


Once you have established your way of encoding/decoding you should not change the alphabet you’re using, since it redefines the meaning and value of the encoded strings.

There are a lot of ways to store an integer value into a string. Another common methods is to use a 52 characters alphabet (a-z, A-Z, 0-9). You can’t use the bit encoding here, but there are quite a few algorithms out there that let you do that.

More info on how to get the valerii gem can be found at Gemcutter.

Also checkout my other gems