User:Narc/Serverless URL Shortening
From NarcWiki
See also http://identi.ca/conversation/7861666, in particular http://identi.ca/notice/7868514, which started this adventure.
Upon investigation of RFC 3986, I counted 58 allowable characters in a URL -- which did not include %, bringing the total to 59. This fits in the 63 positions allowed by 6 bits and gives me the "clever" idea of shortening a URL by encoding it in a way similar to reverse base64.
The idea, basically, is to map the allowable characters in a URL into 59 of the 63 (NOT counting 0, as a string terminator) positions in a 6-bit number, and then "smooshing" together the resulting 6-bit numbers into 8-bit bytes. Let me explain:
Assume you somehow got the following numbers:
1, 17, 53, 24
These translate to:
000001 010001 110101 011000
Now we can take these 4 six-bit numbers and turn them into 3 eight-bit (i.e. "normal") bytes:
00000101 00011101 01011000
Which, in hexadecimal, translate to 05 1D 58, which illustrates the first problem with trying to do this in real life -- 0x05 and 0x1D are non-printing "characters" in US-ASCII, and they are very likely to be filtered out.
On the other hand, given our focus is *characters* rather than bytes, we can convert our original four six-bit characters into a UTF-8 characters. As a reminder, UTF-8 characters look like this:
1 byte: 0bbbbbbb (7 bits usable) 2 bytes: 110bbbbb 10bbbbbb (11 bits usable) 3 bytes: 1110bbbb 10bbbbbb 10bbbbbb (16 bits usable) 4 bytes: 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb (21 bits usable)
Since we're encoding a total of 24 bits, that comes to one 4-byte char and one 1-byte char:
To encode: 000001010001110101011 000 Unicode: [11110000 10001010 10001110 10101011] 00000000 Hexadecimal: F08A8EAB 00
The second character in our case is a null character and can be safely omitted, leaving just the one single four-byte unicode character, and the three extra 0 bits can and should be padded back in when decoding.
Note that we went from four characters to one (could've been two) Unicode characters, but the actual length in bytes is still 4, and could have gone up to 5. This underscores that we're NOT looking at compression -- or, rather, we have compression with subsequent expansion from Unicode encoding -- but this is acceptable in our use case, which are character-limited services like identi.ca or Twitter.
| Six-bit encodings for URL characters | |||||
|---|---|---|---|---|---|
| decimal | binary | value | decimal | binary | value |
| 0 | 000000 | NULL char | 30 | 011110 | 3 |
| 1 | 000001 | a | 31 | 011111 | 4 |
| 2 | 000010 | b | 32 | 100000 | 5 |
| 3 | 000011 | c | 33 | 100001 | 6 |
| 4 | 000100 | d | 34 | 100010 | 7 |
| 5 | 000101 | e | 35 | 100011 | 8 |
| 6 | 000110 | f | 36 | 100100 | 9 |
| 7 | 000111 | g | 37 | 100101 | : |
| 8 | 001000 | h | 38 | 100110 | / |
| 9 | 001001 | i | 39 | 100111 | ? |
| 10 | 001010 | j | 40 | 101000 | # |
| 11 | 001011 | k | 41 | 101001 | [ |
| 12 | 001100 | l | 42 | 101010 | ] |
| 13 | 001101 | m | 43 | 101011 | @ |
| 14 | 001110 | n | 44 | 101100 | ! |
| 15 | 001111 | o | 45 | 101101 | $ |
| 16 | 010000 | p | 46 | 101110 | & |
| 17 | 010001 | q | 47 | 101111 | ' |
| 18 | 010010 | r | 48 | 110000 | ( |
| 19 | 010011 | s | 49 | 110001 | ) |
| 20 | 010100 | t | 50 | 110010 | * |
| 21 | 010101 | u | 51 | 110011 | + |
| 22 | 010110 | v | 52 | 110100 | , |
| 23 | 010111 | w | 53 | 110101 | ; |
| 24 | 011000 | x | 54 | 110110 | = |
| 25 | 011001 | y | 55 | 110111 | % |
| 26 | 011010 | z | 56 | 111000 | - |
| 27 | 011011 | 0 | 57 | 111001 | . |
| 28 | 011100 | 1 | 58 | 111010 | _ |
| 29 | 011101 | 2 | 59 | 111011 | ~ |
So, now, a real-life example:
http://narc.ro/
Becomes:
8 20 20 16 37 38 38 14 1 18 3 57 18 15 38 (note that we have to include the protocol) 001000 010100 010100 010000 100101 100110 100110 001110 000001 010010 000011 111001 010010 010010 001111
which are 90 bits, or 4 4-byte UTF chars (4 * 21 == 84) + 6 bits:
001000010100010100010 000100101100110100110 001110000001010010000 011111001010010010010 001111
Then, in UTF:
11110001 10000010 10100010 10100010 11110000 10100101 10100110 10100110 11110001 10110000 10001010 10010000 11110011 10111001 10010010 10010010 00011110 (note: always right-pad any necessary zeroes to make a complete final byte)
Or, in base 16, the same UTF:
F182A2A2 F0A5A6A6 F1B08A90 F3B99292 1E
However, the last character is a non-printable less than 0x20, so we turn into a 2-byte character:
11000111 10100000 C7A0
To summarize, from 15 characters of US-ASCII, we went to 5 UTF characters (admittedly very heavily using 4-byte characters), namely 𥦦Ǡ. To turn this into a ssURL, we will use the su: pseudo-protocol (though remember, this is not really a URL anymore, since it does not use any characters from the correct character set), resulting in:
su:𥦦Ǡ (equivalent to http://narc.ro/)