Codecs

mom.codec

synopsis:Many different types of common encode/decode function.
module:mom.codec

This module contains codecs for converting between hex, base64, base85, base58, base62, base36, decimal, and binary representations of bytes.

Understand that bytes are simply base-256 representation. A PNG file:

\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00
\x05\x00\x00\x00\x05\x08\x06\x00\x00\x00\x8do&
\xe5\x00\x00\x00\x1cIDAT\x08\xd7c\xf8\xff\xff?
\xc3\x7f\x06 \x05\xc3 \x12\x84\xd01\xf1\x82X\xcd
\x04\x00\x0e\xf55\xcb\xd1\x8e\x0e\x1f\x00\x00\x00
\x00IEND\xaeB`\x82

That is what an example PNG file looks like as a stream of bytes (base-256) in Python (with line-breaks added for visual-clarity).

If we wanted to send this PNG within an email message, which is restricted to ASCII characters, we cannot simply add these bytes in and hope they go through unchanged. The receiver at the other end expects to get a copy of exactly the same bytes that you send. Because we are limited to using ASCII characters, we need to “encode” this binary data into a subset of ASCII characters before transmitting, and the receiver needs to “decode” those ASCII characters back into binary data before attempting to display it.

Base-encoding raw bytes into ASCII characters is used to safely transmit binary data through any medium that does not inherently support non-ASCII data.

Therefore, we need to convert the above PNG binary data into something that looks like (again, line-breaks have been added for visual clarity only):

iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI
12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJg
gg==

The base-encoding method that we can use is limited by these criteria:

  1. The number of ASCII characters, a subset of ASCII, that we can use to represent binary data (case-sensitivity, ambiguity, base, deviation from standard characters, etc.)
  2. Whether human beings are involved in the transmission of data. Ergo, visual clarity, legibility, readability, human-inputability, and even double-click-to-select-ability! (Hint: try double-clicking the encoded data above to see whether it selects all of it–it won’t). This is a corollary for point 1.
  3. Whether we want the process to be more time-efficient or space-efficient. That is, whether we can process binary data in chunks or whether we need to convert it into an arbitrarily large integer before encoding, respectively.

Terminology

Answer this question:

How many times should I multiply 2 by itself to obtain 8?

You say:

That’s a dumb question. 3 times!

Well, congratulations! You have just re-discovered logarithms. In a system of equations, we may have unknowns. Given an equation with 3 parts, 2 of which are known, we often need to find the 3rd. Logarithms are used when you know the base (radix) and the number, but not the exponent.

Logarithms help you find exponents.

Take for example:

2**0 = 1
2**1 = 2
2**2 = 4
2**3 = 8
2**4 = 16
2**5 = 32
2**6 = 64

Alternatively, logarithms can be thought of as answering the question:

Raising 2 to which exponent gets me 64?

This is the same as doing:

import math
math.log(64, 2)    # 6.0; number 64, base 2.
6.0

read as “logarithm to the base 2 of 64” which gives 6. That is, if we raise 2 to the power 6, we get 64.

The concept of roots or radicals is also related. Roots help you find the base (radix) given the exponent and the number. So:

root(8, 3)   # 2.0; cube root. exponent 3, number 8.
Roots help you find the base.

Hopefully, that brings you up to speed and enables you to clearly see the relationship between powers, logarithms, and roots.

We will often refer to the term byte and mean it to be an octet (8) of bits. The number of bits in a byte is dependent on the processor architecture. Therefore, we can have a 9-bit byte or even a 36-bit byte.

For our purposes, however, a byte means a chunk of 8 bits–that is, an octet.

By the term “encoding,” throughout this discussion, we mean a way of representing a sequence of bytes in terms of a subset of US-ASCII characters, each of which uses 7 bits. This ensures that in communication and messaging that involves the transmission of binary data, at a small increase in encoded size, we can safely transmit this data encoded as ASCII text. We could be pedantic and use the phrase “ASCII-subset-based encoding” everywhere, but we’ll simply refer to it as “encoding” instead.

How it applies to encodings

Byte, or base-256, representation allows each byte to be represented using one of 256 values (0-255 inclusive). Modern processors can process data in chunks of 32 bits (4 bytes), 64 bits (8 bytes), and so on. Notice that these are powers of 2 given that our processors are binary machines.

We could feed a 64-bit processor with 8 bits of data at a time, but that would guarantee that the codec will be only 1/8th as time-efficient as it can be. That is, if you feed the same 64-bit processor with 64 bits of data at a time instead, the encoding process will be 8 times as fast. Whoa!

Therefore, in order to ensure that our codecs are fast, we need to feed our processors data in chunks to be more time-efficient. The two types of encoding we discuss here are:

  1. big-integer-based polynomial-time base-conversions
  2. chunked linear-time base-conversions.
These two types of encoding are not always compatible with each other.

Big-integer based encoding

This method of encoding is generally costlier because the raw bytes (base-256 representation) are first converted into a big integer, which is then subsequently repeatedly divided to obtain an encoded sequence of bytes. Bases 58, 60, and 62 are not powers of 2, and therefore cannot be reliably or efficiently encoded in chunks of powers of 2 (used by microprocessors) so as to produce the same encoded representations as their big integer encoded representations. Therefore, using these encodings for a large amount of binary data is not advised. The base-58 and base-62 modules in this library are meant to be used with small amounts of binary data.

Chunked encoding

Base encoding a chunk of 4 bytes at a time (32 bits at a time) means we would need a way to represent each of the 256**4 (4294967296) values with our encoding:

256**4 # 4294967296
2**32  # 4294967296

Given an encoding alphabet of 85 ASCII characters, for example, we need to find an exponent (logarithm) that allows us to represent each one of these 4294967296 values:

85**4 # 52200625
85**5 # 4437053125

>>> 85**5 >= 2**32
True

Done using logarithms:

import math
math.log(2**32, 85)   # 4.9926740807111996

Therefore, we would need 5 characters from this encoding alphabet to represent 4 bytes. Since 85 is not a power of 2, there is going to be a little wastage of space and the codec will need to deal with padding and de-padding bytes to ensure the resulting size to be a multiple of the chunk size, but the byte sequence will be more compact than its base-16 (hexadecimal) representation, for example:

import math
math.log(2**32, 16)   # 8.0

As you can see, if we used hexadecimal representation instead, each 4-byte chunk would be represented using 8 characters from the encoding alphabet. This is clearly less space-efficient than using 5 characters per 4 bytes of binary data.

Base-64 as another example

Base-64 allows us to represent 256**4 (4294967296) values using 64 ASCII characters.

Bytes base-encoding

These codecs preserve bytes “as is” when decoding back to bytes. In a more mathematical sense,

g(f(x)) is an identity function

where g is the decoder and f is the encoder.

Why have we reproduced base64 encoding/decoding functions here when the standard library has them? Well, those functions behave differently in Python 2.x and Python 3.x. The Python 3.x equivalents do not accept Unicode strings as their arguments, whereas the Python 2.x versions would happily encode your Unicode strings without warning you-you know that you are supposed to encode them to UTF-8 or another byte encoding before you base64-encode them right? These wrappers are re-implemented so that you do not make these mistakes. Use them. They will help prevent unexpected bugs.

mom.codec.base85_encode(raw_bytes, charset='ASCII85')

Encodes raw bytes into ASCII85 representation.

Encode your Unicode strings to a byte encoding before base85-encoding them.

Parameters:
  • raw_bytes – Bytes to encode.
  • charset – “ASCII85” (default) or “RFC1924”.
Returns:

ASCII85 encoded string.

mom.codec.base85_decode(encoded, charset='ASCII85')

Decodes ASCII85-encoded bytes into raw bytes.

Parameters:
  • encoded – ASCII85 encoded representation.
  • charset – “ASCII85” (default) or “RFC1924”.
Returns:

Raw bytes.

mom.codec.base64_encode(raw_bytes)

Encodes raw bytes into base64 representation without appending a trailing newline character. Not URL-safe.

Encode your Unicode strings to a byte encoding before base64-encoding them.

Parameters:raw_bytes – Bytes to encode.
Returns:Base64 encoded bytes without newline characters.
mom.codec.base64_decode(encoded)

Decodes base64-encoded bytes into raw bytes. Not URL-safe.

Parameters:encoded – Base-64 encoded representation.
Returns:Raw bytes.
mom.codec.base64_urlsafe_encode(raw_bytes)

Encodes raw bytes into URL-safe base64 bytes.

Encode your Unicode strings to a byte encoding before base64-encoding them.

Parameters:raw_bytes – Bytes to encode.
Returns:Base64 encoded string without newline characters.
mom.codec.base64_urlsafe_decode(encoded)

Decodes URL-safe base64-encoded bytes into raw bytes.

Parameters:encoded – Base-64 encoded representation.
Returns:Raw bytes.
mom.codec.base62_encode(raw_bytes)

Encodes raw bytes into base-62 representation. URL-safe and human safe.

Encode your Unicode strings to a byte encoding before base-62-encoding them.

Convenience wrapper for consistency.

Parameters:raw_bytes – Bytes to encode.
Returns:Base-62 encoded bytes.
mom.codec.base62_decode(encoded)

Decodes base-62-encoded bytes into raw bytes.

Convenience wrapper for consistency.

Parameters:encoded – Base-62 encoded bytes.
Returns:Raw bytes.
mom.codec.base58_encode(raw_bytes)

Encodes raw bytes into base-58 representation. URL-safe and human safe.

Encode your Unicode strings to a byte encoding before base-58-encoding them.

Convenience wrapper for consistency.

Parameters:raw_bytes – Bytes to encode.
Returns:Base-58 encoded bytes.
mom.codec.base58_decode(encoded)

Decodes base-58-encoded bytes into raw bytes.

Convenience wrapper for consistency.

Parameters:encoded – Base-58 encoded bytes.
Returns:Raw bytes.
mom.codec.base36_encode(raw_bytes)

Encodes raw bytes into base-36 representation.

Encode your Unicode strings to a byte encoding before base-58-encoding them.

Convenience wrapper for consistency.

Parameters:raw_bytes – Bytes to encode.
Returns:Base-36 encoded bytes.
mom.codec.base36_decode(encoded)

Decodes base-36-encoded bytes into raw bytes.

Convenience wrapper for consistency.

Parameters:encoded – Base-36 encoded bytes.
Returns:Raw bytes.
mom.codec.hex_encode(raw_bytes)

Encodes raw bytes into hexadecimal representation.

Encode your Unicode strings to a byte encoding before hex-encoding them.

Parameters:raw_bytes – Bytes.
Returns:Hex-encoded representation.
mom.codec.hex_decode(encoded)

Decodes hexadecimal-encoded bytes into raw bytes.

Parameters:encoded – Hex representation.
Returns:Raw bytes.
mom.codec.decimal_encode(raw_bytes)

Encodes raw bytes into decimal representation. Leading zero bytes are preserved.

Encode your Unicode strings to a byte encoding before decimal-encoding them.

Parameters:raw_bytes – Bytes.
Returns:Decimal-encoded representation.
mom.codec.decimal_decode(encoded)

Decodes decimal-encoded bytes to raw bytes. Leading zeros are converted to leading zero bytes.

Parameters:encoded – Decimal-encoded representation.
Returns:Raw bytes.
mom.codec.bin_encode(raw_bytes)

Encodes raw bytes into binary representation.

Encode your Unicode strings to a byte encoding before binary-encoding them.

Parameters:raw_bytes – Raw bytes.
Returns:Binary representation.
mom.codec.bin_decode(encoded)

Decodes binary-encoded bytes into raw bytes.

Parameters:encoded – Binary representation.
Returns:Raw bytes.
synopsis:ASCII-85 and RFC1924 Base85 encoding and decoding functions.
module:mom.codec.base85
see:http://en.wikipedia.org/wiki/Ascii85
see:http://tools.ietf.org/html/rfc1924
see:http://www.piclist.com/techref/method/encode.htm

Where should you use base85?

Base85-encoding is used to compactly represent binary data in 7-bit ASCII. It is, therefore, 7-bit MIME-safe but not safe to use in URLs, SGML, HTTP cookies, and other similar places. Example scenarios where Base85 encoding can be put to use are Adobe PDF documents, Adobe PostScript format, binary diffs (patches), efficiently storing RSA keys, etc.

The ASCII85 character set-based encoding is mostly used by Adobe PDF and PostScript formats. It may also be used to store RSA keys or binary data with a lot of zero byte sequences. The RFC1924 character set-based encoding, however, may be used to compactly represent 128-bit unsigned integers (like IPv6 addresses) or binary diffs. Encoding based on RFC1924 does not compact zero byte sequences, so this form of encoding is less space-efficient than the ASCII85 version which compacts redundant zero byte sequences.

About base85 and this implementation

Base-85 represents 4 bytes as 5 ASCII characters. This is a 7% improvement over base-64, which translates to a size increase of ~25% over plain binary data for base-85 versus that of ~37% for base-64.

However, because the base64 encoding routines in Python are implemented in C, base-64 may be less expensive to compute. This implementation of base-85 uses a lot of tricks to reduce computation time and is hence generally faster than many other implementations. If computation speed is a concern for you, please contribute a C implementation or wait for one.

Functions

mom.codec.base85.b85encode(raw_bytes, prefix=None, suffix=None, _base85_bytes=array('B', [33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117]), _padding=False, _compact_zero=True, _compact_char='z')

ASCII-85 encodes a sequence of raw bytes.

The character set in use is:

ASCII 33 ("!") to ASCII 117 ("u")

If the number of raw bytes is not divisible by 4, the byte sequence is padded with up to 3 null bytes before encoding. After encoding, as many bytes as were added as padding are removed from the end of the encoded sequence if padding is False (default).

Encodes a zero-group () as “z” instead of ”!!!!!”.

The resulting encoded ASCII string is not URL-safe nor is it safe to include within SGML/XML/HTML documents. You will need to escape special characters if you decide to include such an encoded string within these documents.

Parameters:
  • raw_bytes – Raw bytes.
  • prefix – The prefix used by the encoded text. None by default.
  • suffix – The suffix used by the encoded text. None by default.
  • _base85_bytes – (Internal) Character set to use.
  • _compact_zero – (Internal) Encodes a zero-group () as “z” instead of ”!!!!!” if this is True (default).
  • _compact_char – (Internal) Character used to represent compact groups (“z” default)
Returns:

ASCII-85 encoded bytes.

mom.codec.base85.b85decode(encoded, prefix=None, suffix=None, _base85_bytes=array('B', [33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117]), _base85_ords={'!': 0, '#': 2, '"': 1, '%': 4, '$': 3, "'": 6, '&': 5, ')': 8, '(': 7, '+': 10, '*': 9, '-': 12, ',': 11, '/': 14, '.': 13, '1': 16, '0': 15, '3': 18, '2': 17, '5': 20, '4': 19, '7': 22, '6': 21, '9': 24, '8': 23, ';': 26, ':': 25, '=': 28, '<': 27, '?': 30, '>': 29, 'A': 32, '@': 31, 'C': 34, 'B': 33, 'E': 36, 'D': 35, 'G': 38, 'F': 37, 'I': 40, 'H': 39, 'K': 42, 'J': 41, 'M': 44, 'L': 43, 'O': 46, 'N': 45, 'Q': 48, 'P': 47, 'S': 50, 'R': 49, 'U': 52, 'T': 51, 'W': 54, 'V': 53, 'Y': 56, 'X': 55, '[': 58, 'Z': 57, ']': 60, '\\': 59, '_': 62, '^': 61, 'a': 64, '`': 63, 'c': 66, 'b': 65, 'e': 68, 'd': 67, 'g': 70, 'f': 69, 'i': 72, 'h': 71, 'k': 74, 'j': 73, 'm': 76, 'l': 75, 'o': 78, 'n': 77, 'q': 80, 'p': 79, 's': 82, 'r': 81, 'u': 84, 't': 83}, _uncompact_zero=True, _compact_char='z')

Decodes an ASCII85-encoded string into raw bytes.

Parameters:
  • encoded – Encoded ASCII string.
  • prefix – The prefix used by the encoded text. None by default.
  • suffix – The suffix used by the encoded text. None by default.
  • _base85_bytes – (Internal) Character set to use.
  • _base85_ords – (Internal) A function to convert a base85 character to its ordinal value. You should not need to use this.
  • _uncompact_zero – (Internal) Treats “z” (a zero-group ()) as a ”!!!!!” if True (default).
  • _compact_char – (Internal) Character used to represent compact groups (“z” default)
Returns:

ASCII85-decoded raw bytes.

mom.codec.base85.rfc1924_b85encode(raw_bytes, _padding=False)

Base85 encodes using the RFC1924 character set.

The character set is:

0–9, A–Z, a–z, and then !#$%&()*+-;<=>?@^_`{|}~

These characters are specifically not included:

"',./:[]\

This is the encoding method used by Mercurial (and git?) to generate binary diffs, for example. They chose the IPv6 character set and encode using the ASCII85 encoding method while not compacting zero-byte sequences.

See:

http://tools.ietf.org/html/rfc1924

Parameters:
  • raw_bytes – Raw bytes.
  • _padding – (Internal) Whether padding should be included in the encoded output. (Default False, which is usually what you want.)
Returns:

RFC1924 base85 encoded string.

mom.codec.base85.rfc1924_b85decode(encoded)

Base85 decodes using the RFC1924 character set.

This is the encoding method used by Mercurial (and git) to generate binary diffs, for example. They chose the IPv6 character set and encode using the ASCII85 encoding method while not compacting zero-byte sequences.

See:http://tools.ietf.org/html/rfc1924
Parameters:encoded – RFC1924 Base85 encoded string.
Returns:Decoded bytes.
mom.codec.base85.ipv6_b85encode(uint128, _base85_bytes=array('B', [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 33, 35, 36, 37, 38, 40, 41, 42, 43, 45, 59, 60, 61, 62, 63, 64, 94, 95, 96, 123, 124, 125, 126]))

Encodes a 128-bit unsigned integer using the RFC 1924 base-85 encoding. Used to encode IPv6 addresses or 128-bit chunks.

Parameters:
  • uint128 – A 128-bit unsigned integer to be encoded.
  • _base85_bytes – (Internal) Base85 encoding charset lookup table.
Returns:

RFC1924 Base85-encoded string.

mom.codec.base85.ipv6_b85decode(encoded, _base85_ords={'!': 62, '#': 63, '%': 65, '$': 64, '&': 66, ')': 68, '(': 67, '+': 70, '*': 69, '-': 71, '1': 1, '0': 0, '3': 3, '2': 2, '5': 5, '4': 4, '7': 7, '6': 6, '9': 9, '8': 8, ';': 72, '=': 74, '<': 73, '?': 76, '>': 75, 'A': 10, '@': 77, 'C': 12, 'B': 11, 'E': 14, 'D': 13, 'G': 16, 'F': 15, 'I': 18, 'H': 17, 'K': 20, 'J': 19, 'M': 22, 'L': 21, 'O': 24, 'N': 23, 'Q': 26, 'P': 25, 'S': 28, 'R': 27, 'U': 30, 'T': 29, 'W': 32, 'V': 31, 'Y': 34, 'X': 33, 'Z': 35, '_': 79, '^': 78, 'a': 36, '`': 80, 'c': 38, 'b': 37, 'e': 40, 'd': 39, 'g': 42, 'f': 41, 'i': 44, 'h': 43, 'k': 46, 'j': 45, 'm': 48, 'l': 47, 'o': 50, 'n': 49, 'q': 52, 'p': 51, 's': 54, 'r': 53, 'u': 56, 't': 55, 'w': 58, 'v': 57, 'y': 60, 'x': 59, '{': 81, 'z': 61, '}': 83, '|': 82, '~': 84})

Decodes an RFC1924 Base-85 encoded string to its 128-bit unsigned integral representation. Used to base85-decode IPv6 addresses or 128-bit chunks.

Whitespace is ignored. Raises an OverflowError if stray characters are found.

Parameters:
  • encoded – RFC1924 Base85-encoded string.
  • _base85_ords – (Internal) Look up table.
Returns:

A 128-bit unsigned integer.

synopsis:Base-62 7-bit ASCII-safe representation for compact human-input.
module:mom.codec.base62

Where should you use base-62?

Base-62 representation is 7 bit-ASCII safe, MIME-safe, URL-safe, HTTP cookie-safe, and almost human being-safe. Base-62 representation can:

  • be readable and editable by a human being;
  • safely and compactly represent numbers;
  • contain only alphanumeric characters;
  • not contain punctuation characters.

For examples of places where you can use base-62, see the documentation for mom.codec.base58.

In general, use base-62 in any 7-bit ASCII-safe compact communication where human beings and communication devices may be significantly involved.

When to prefer base-62 over base-58?

When you don’t care about the visual ambiguity between these characters:

  • 0 (ASCII NUMERAL ZERO)
  • O (ASCII UPPERCASE ALPHABET O)
  • I (ASCII UPPERCASE ALPHABET I)
  • l (ASCII LOWERCASE ALPHABET L)

A practical example (versioned static asset URLs):

In order to reduce the number of HTTP requests for static assets sent to a Web server, developers often include a hash of the asset being served into the URL and set the expiration time of the asset to a very long period (say, 365 days).

This enables an almost perfect form of client-side asset caching while still serving fresh content when it changes. To minimize the size overhead introduced into the URL by such hashed-identifiers, the identifiers themselves can be shortened using base-58 or base-62 encoding. For example:

$ sha1sum file.js
a497f210fc9c5d02fc7dc7bd211cb0c74da0ae16

The asset URL for this file can be:

http://s.example.com/js/a497f210fc9c5d02fc7dc7bd211cb0c74da0ae16/file.js

where example.com is a canonical domain used only for informational purposes. However, the hashed-identifier in the URL is long but can be reduced using base-62 to:

# Base-58
http://s.example.com/js/3HzsRcRETLZ3qFgDzG1QE7CJJNeh/file.js

# Base-62
http://s.example.com/js/NU3qW1G4teZJynubDFZnbzeOUFS/file.js

The first 12 characters of a SHA-1 hash are sufficiently strong for serving static assets while minimizing collision overhead in the context of a small-to-medium-sized Website and considering these are URLs for static served assets that can change over periods of time. You may want to consider using the full hash for large-scale Websites. Therefore, we can shorten the original asset URL to:

http://s.example.com/js/a497f210fc9c/file.js

which can then be reduced utilizing base-58 or base-62 encoding to:

# Base-58
http://s.example.com/js/2QxqmqiFm/file.js

# Base-62
http://s.example.com/js/pO7arZWO/file.js

These are a much shorter URLs than the original. Notice that we have not renamed the file file.js to 2QxqmqiFm.js or pO7arZWO.js because that would cause an unnecessary explosion of files on the server as new files would be generated every time the source files change. Instead, we have chosen to make use of Web server URL-rewriting rules to strip the hashed identifier and serve the file fresh as it is on the server file system. These are therefore non-versioned assets–only the URLs that point at them are versioned. That is if you took a diff between the files that these URLs point at:

http://s.example.com/js/pO7arZWO/file.js
http://s.example.com/js/2qiFqxEm/file.js

you would not see a difference. Only the URLs differ to trick the browser into caching as well as it can.

The hashed-identifier is not part of the query string for this asset URL because certain proxies do not cache files served from URLs that include query strings. That is, we are not doing this:

# Base-58 -- Don't do this. Not all proxies will cache it.
http://s.example.com/js/file.js?v=2QxqmqiFm

# Base-62 -- Don't do this. Not all proxies will cache it.
http://s.example.com/js/file.js?v=pO7arZWO

If you wish to support versioned assets, however, you may need to rename files to include their hashed identifiers and avoid URL-rewriting instead. For example:

# Base-58
http://s.example.com/js/file-2QxqmqiFm.js

# Base-62
http://s.example.com/js/file-pO7arZWO.js

Note

Do note that the base-58 encoded version of the SHA-1 hash (40 characters in hexadecimal representation) may have a length of either 27 or 28. Similarly, for the SHA-1 hash (40 characters in hex), the base62-encoded version may have a length of either 26 or 27.

Therefore, please ensure that your rewriting rules take variable length into account.

The following benefits are therefore achieved:

  • Client-side caching is fully utilized
  • The number of HTTP requests sent to Web servers by clients is reduced.
  • When assets change, so do their SHA-1 hashed identifiers, and hence their asset URLs.
  • Shorter URLs also implies that fewer bytes are transferred in HTTP responses.
  • Bandwidth consumption is reduced by a noticeably large factor.
  • Multiple versions of assets (if required).

Essentially, URLs shortened using base-58 or base-62 encoding can result in a faster Web-browsing experience for end-users.

Functions

mom.codec.base62.b62encode(raw_bytes, base_bytes='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', _padding=True)

Base62 encodes a sequence of raw bytes. Zero-byte sequences are preserved by default.

Parameters:
  • raw_bytes – Raw bytes to encode.
  • base_bytes – (Internal) The character set to use. Defaults to ASCII62_BYTES that uses natural ASCII order.
  • _padding – (Internal) True (default) to include prefixed zero-byte sequence padding converted to appropriate representation.
Returns:

Base-62 encoded bytes.

mom.codec.base62.b62decode(encoded, base_bytes='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', base_ords={'1': 1, '0': 0, '3': 3, '2': 2, '5': 5, '4': 4, '7': 7, '6': 6, '9': 9, '8': 8, 'A': 10, 'C': 12, 'B': 11, 'E': 14, 'D': 13, 'G': 16, 'F': 15, 'I': 18, 'H': 17, 'K': 20, 'J': 19, 'M': 22, 'L': 21, 'O': 24, 'N': 23, 'Q': 26, 'P': 25, 'S': 28, 'R': 27, 'U': 30, 'T': 29, 'W': 32, 'V': 31, 'Y': 34, 'X': 33, 'Z': 35, 'a': 36, 'c': 38, 'b': 37, 'e': 40, 'd': 39, 'g': 42, 'f': 41, 'i': 44, 'h': 43, 'k': 46, 'j': 45, 'm': 48, 'l': 47, 'o': 50, 'n': 49, 'q': 52, 'p': 51, 's': 54, 'r': 53, 'u': 56, 't': 55, 'w': 58, 'v': 57, 'y': 60, 'x': 59, 'z': 61})

Base-62 decodes a sequence of bytes into raw bytes. Whitespace is ignored.

Parameters:
  • encoded – Base-62 encoded bytes.
  • base_bytes – (Internal) The character set to use. Defaults to ASCII62_BYTES that uses natural ASCII order.
  • base_ords – (Internal) Ordinal-to-character lookup table for the specified character set.
Returns:

Raw bytes.

synopsis:Base-58 repr for unambiguous display & compact human-input.
module:mom.codec.base58

Where should you use base-58?

Base-58 representation is 7 bit-ASCII safe, MIME-safe, URL-safe, HTTP cookie-safe, and human being-safe. Base-58 representation can:

  • be readable and editable by a human being;
  • safely and compactly represent numbers;
  • contain only alphanumeric characters (omitting a few with visually- ambiguously glyphs–namely, “0OIl”);
  • not contain punctuation characters.

Example scenarios where base-58 encoding may be used:

  • Visually-legible account numbers
  • Shortened URL paths
  • OAuth verification codes
  • Unambiguously printable and displayable key codes (for example, net-banking PINs, verification codes sent via SMS, etc.)
  • Bitcoin decentralized crypto-currency addresses
  • CAPTCHAs
  • Revision control changeset identifiers
  • Encoding email addresses compactly into JavaScript that decodes by itself to display on Web pages in order to reduce spam by stopping email harvesters from scraping email addresses from Web pages.

In general, use base-58 in any 7-bit ASCII-safe compact communication where human beings, paper, and communication devices may be significantly involved.

The default base-58 character set is [0-9A-Za-z] (base-62) with some characters omitted to make them visually-legible and unambiguously printable. The characters omitted are:

  • 0 (ASCII NUMERAL ZERO)
  • O (ASCII UPPERCASE ALPHABET O)
  • I (ASCII UPPERCASE ALPHABET I)
  • l (ASCII LOWERCASE ALPHABET L)

For a practical example, see the documentation for mom.codec.base62.

Functions

mom.codec.base58.b58encode(raw_bytes, base_bytes='123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz', _padding=True)

Base58 encodes a sequence of raw bytes. Zero-byte sequences are preserved by default.

Parameters:
  • raw_bytes – Raw bytes to encode.
  • base_bytes – The character set to use. Defaults to ASCII58_BYTES that uses natural ASCII order.
  • _padding – (Internal) True (default) to include prefixed zero-byte sequence padding converted to appropriate representation.
Returns:

Base-58 encoded bytes.

mom.codec.base58.b58decode(encoded, base_bytes='123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz', base_ords={'1': 0, '3': 2, '2': 1, '5': 4, '4': 3, '7': 6, '6': 5, '9': 8, '8': 7, 'A': 9, 'C': 11, 'B': 10, 'E': 13, 'D': 12, 'G': 15, 'F': 14, 'H': 16, 'K': 18, 'J': 17, 'M': 20, 'L': 19, 'N': 21, 'Q': 23, 'P': 22, 'S': 25, 'R': 24, 'U': 27, 'T': 26, 'W': 29, 'V': 28, 'Y': 31, 'X': 30, 'Z': 32, 'a': 33, 'c': 35, 'b': 34, 'e': 37, 'd': 36, 'g': 39, 'f': 38, 'i': 41, 'h': 40, 'k': 43, 'j': 42, 'm': 44, 'o': 46, 'n': 45, 'q': 48, 'p': 47, 's': 50, 'r': 49, 'u': 52, 't': 51, 'w': 54, 'v': 53, 'y': 56, 'x': 55, 'z': 57})

Base-58 decodes a sequence of bytes into raw bytes. Whitespace is ignored.

Parameters:
  • encoded – Base-58 encoded bytes.
  • base_bytes – (Internal) The character set to use. Defaults to ASCII58_BYTES that uses natural ASCII order.
  • base_ords – (Internal) Ordinal-to-character lookup table for the specified character set.
Returns:

Raw bytes.

synopsis:Routines for converting between integers and bytes.
module:mom.codec.integer

Number-bytes conversion

These codecs are “lossy” as they don’t preserve prefixed padding zero bytes. In a more mathematical sense,

g(f(x)) is almost an identity function, but not exactly.

where g is the decoder and f is a encoder.

mom.codec.integer.bytes_to_uint(raw_bytes)

Converts a series of bytes into an unsigned integer.

Parameters:raw_bytes – Raw bytes (base-256 representation).
Returns:Unsigned integer.
mom.codec.integer.uint_to_bytes(number, fill_size=0, chunk_size=0, overflow=False)

Convert an unsigned integer to bytes (base-256 representation).

Leading zeros are not preserved for positive integers unless a chunk size or a fill size is specified. A single zero byte is returned if the number is 0 and no padding is specified.

When a chunk size or a fill size is specified, the resulting bytes are prefix-padded with zero bytes to satisfy the size. The total size of the number in bytes is either the fill size or an integral multiple of the chunk size.

Parameters:
  • number – Integer value
  • fill_size – The maxmimum number of bytes with which to represent the integer. Prefix zero padding is added as necessary to satisfy the size. If the number of bytes needed to represent the integer is greater than the fill size, an OverflowError is raised. To suppress this error and allow overflow, you may set the overfloww argument to this function to True.
  • chunk_size – If optional chunk size is given and greater than zero, the resulting sequence of bytes is prefix-padded with zero bytes so that the total number of bytes is a multiple of chunk_size.
  • overflowFalse (default). If this is True, no OverflowError will be raised when the fill_size is shorter than the length of the generated byte sequence. Instead the byte sequence will be returned as is.
Returns:

Raw bytes (base-256 representation).

Raises:

OverflowError when a fill size is given and the number takes up more bytes than fit into the block. This requires the overflow argument to this function to be set to False otherwise, no error will be raised.

synopsis:More portable JSON encoding and decoding routines.
module:mom.codec.json
mom.codec.json.json_encode(obj)

Encodes a Python value into its equivalent JSON string.

JSON permits but does not require forward slashes to be escaped. This is useful when json data is emitted in a <script> tag in HTML, as it prevents </script> tags from prematurely terminating the javscript. Some json libraries do this escaping by default, although python’s standard library does not, so we do it here.

See:http://stackoverflow.com/questions/1580647/json-why-are-forward-slashes-escaped
Parameters:obj – Python value.
Returns:JSON string.
mom.codec.json.json_decode(encoded)

Decodes a JSON string into its equivalent Python value.

Parameters:encoded – JSON string.
Returns:Decoded Python value.
synopsis:Common functions for text encodings.
module:mom.codec.text

Text encoding

"There is no such thing as plain text."
                          - Plain Text.

UTF-8 is one of the many ways in which Unicode strings can be represented as a sequence of bytes, and because UTF-8 is more portable between diverse systems, you must ensure to convert your Unicode strings to UTF-8 encoded bytes before they leave your system and ensure to decode UTF-8 encoded bytes back into Unicode strings before you start working with them in your code–that is, if you know those bytes are UTF-8 encoded.

Terminology
  • The process of encoding is that of converting a Unicode string into a sequence of bytes. The method using which this conversion is done is also called an encoding:

    Unicode string   -> Encoded bytes
    ---------------------------------
    "深入 Python"     -> b"\xe6\xb7\xb1\xe5\x85\xa5 Python"
    

    The encoding (method) used to encode in this example is UTF-8.

  • The process of decoding is that of converting a sequence of bytes into a Unicode string:

    Encoded bytes                      -> Unicode string
    ----------------------------------------------------
    b"\xe6\xb7\xb1\xe5\x85\xa5 Python" -> "深入 Python"
    

    The encoding (method) used to decode in this example is UTF-8.

A very crude explanation of when to use what

Essentially, inside your own system, work with:

"深入 Python"

and not:

b"\xe6\xb7\xb1\xe5\x85\xa5 Python"

but when sending things out to other systems that may not see “深入 Python” the way Python does, you encode it into UTF-8 bytes:

b"\xe6\xb7\xb1\xe5\x85\xa5 Python"

and tell those systems that you’re using UTF-8 to encode your Unicode strings so that those systems can decode the bytes you sent appropriately.

When receiving text from other systems, ask for their encodings. Decode the text using the appropriate encoding method as soon as you receive it and then operate on the resulting Unicode text.

Read these before you begin to use these functions
  1. http://www.joelonsoftware.com/articles/Unicode.html
  2. http://diveintopython3.org/strings.html
  3. http://docs.python.org/howto/unicode.html
  4. http://docs.python.org/library/codecs.html
mom.codec.text.utf8_encode(unicode_text)

UTF-8 encodes a Unicode string into bytes; bytes and None are left alone.

Work with Unicode strings in your code and encode your Unicode strings into UTF-8 before they leave your system.

Parameters:unicode_text – If already a byte string or None, it is returned unchanged. Otherwise it must be a Unicode string and is encoded as UTF-8 bytes.
Returns:UTF-8 encoded bytes.
mom.codec.text.utf8_decode(utf8_encoded_bytes)

Decodes bytes into a Unicode string using the UTF-8 encoding.

Decode your UTF-8 encoded bytes into Unicode strings as soon as they arrive into your system. Work with Unicode strings in your code.

Parameters:utf8_encoded_bytes – UTF-8 encoded bytes.
Returns:Unicode string.
mom.codec.text.utf8_encode_if_unicode(obj)

UTF-8 encodes the object only if it is a Unicode string.

Parameters:obj – The value that will be UTF-8 encoded if it is a Unicode string.
Returns:UTF-8 encoded bytes if the argument is a Unicode string; otherwise the value is returned unchanged.
mom.codec.text.utf8_decode_if_bytes(obj)

Decodes UTF-8 encoded bytes into a Unicode string.

Parameters:obj – Python object. If this is a bytes instance, it will be decoded into a Unicode string; otherwise, it will be left alone.
Returns:Unicode string if the argument is a bytes instance; the unchanged object otherwise.
mom.codec.text.utf8_encode_recursive(obj)

Walks a simple data structure, converting Unicode strings to UTF-8 encoded byte strings.

Supports lists, tuples, and dictionaries.

Parameters:obj – The Python data structure to walk recursively looking for Unicode strings.
Returns:obj with all the Unicode strings converted to byte strings.
mom.codec.text.utf8_decode_recursive(obj)

Walks a simple data structure, converting bytes to Unicode strings.

Supports lists, tuples, and dictionaries.

Parameters:obj – The Python data structure to walk recursively looking for byte strings.
Returns:obj with all the byte strings converted to Unicode strings.
mom.codec.text.bytes_to_unicode(raw_bytes, encoding='utf-8')

Converts bytes to a Unicode string decoding it according to the encoding specified.

Parameters:
  • raw_bytes – If already a Unicode string or None, it is returned unchanged. Otherwise it must be a byte string.
  • encoding – The encoding used to decode bytes. Defaults to UTF-8
mom.codec.text.bytes_to_unicode_recursive(obj, encoding='utf-8')

Walks a simple data structure, converting byte strings to unicode.

Supports lists, tuples, and dictionaries.

Parameters:
  • obj – The Python data structure to walk recursively looking for byte strings.
  • encoding – The encoding to use when decoding the byte string into Unicode. Default UTF-8.
Returns:

obj with all the byte strings converted to Unicode strings.

mom.codec.text.to_unicode_if_bytes(obj, encoding='utf-8')

Decodes encoded bytes into a Unicode string.

Parameters:
  • obj – The value that will be converted to a Unicode string.
  • encoding – The encoding used to decode bytes. Defaults to UTF-8.
Returns:

Unicode string if the argument is a byte string. Otherwise the value is returned unchanged.

mom.codec.text.ascii_encode(obj)

Encodes a string using ASCII encoding.

Parameters:obj – String to encode.
Returns:ASCII-encoded bytes.
mom.codec.text.latin1_encode(obj)

Encodes a string using LATIN-1 encoding.

Parameters:obj – String to encode.
Returns:LATIN-1 encoded bytes.
synopsis:Routines used by ASCII-based base converters.
module:mom.codec._base
mom.codec._base.base_encode(raw_bytes, base, base_bytes, base_zero, padding=True)

Encodes raw bytes given a base.

Parameters:
  • raw_bytes – Raw bytes to encode.
  • base – Unsigned integer base.
  • base_bytes – The ASCII bytes used in the encoded string. “Character set” or “alphabet”.
  • base_zero
mom.codec._base.base_decode(encoded, base, base_ords, base_zero, powers)

Decode from base to base 256.

mom.codec._base.base_to_uint(encoded, base, ord_lookup_table, powers)

Decodes bytes from the given base into a big integer.

Parameters:
  • encoded – Encoded bytes.
  • base – The base to use.
  • ord_lookup_table – The ordinal lookup table to use.
  • powers – Pre-computed tuple of powers of length powers_length.
mom.codec._base.uint_to_base256(number, encoded, base_zero)

Convert uint to base 256.