tree: cacc94914a7375fa84977fc8d23f5d9853f94897 [path history] [tgz]

- README.md
- common_crc32.wuffs
- common_up_arm_crc32.wuffs
- common_up_x86_avx2.wuffs
- common_up_x86_sse42.wuffs

std/crc32/README.md

CRC-32 (Cyclic Redundancy Check 32) is a checksum algorithm that hashes byte sequences to 32 bit values.

The algorithm is based on polynomial division. In theory, a variety of polynomials can be used. In practice, only two are widely used. The IEEE polynomial is used by Bzip2, Ethernet (IEEE 802.3), Gzip, MPEG-2, PNG, SATA, Zip and other formats. The Castagnoli polynomial is used by Btrfs, Ext4, iSCSI, SCTP and other formats.

In arithmetic, two numbers `A`

and `B`

can be divided to yield a quotient `Q`

and a remainder `R`

, so that `A = (B × Q) + R`

. Two polynomials can similarly be divided, yielding quotient and remainder polynomials. This remains true when restricting polynomial coefficients to the Galois Field GF(2), equivalent to the integers modulo 2, i.e. the integers 0 and 1. Addition and multiplication are equivalent to XOR and AND, so that `foo + foo = 0`

and `foo + 0 = foo`

. For example,

((x⁴ + x¹ + x⁰) × (x⁴ + x⁰)) + (x¹ + x⁰) = (x⁸ + x⁵ + x⁴) + (x⁴ + x¹ + x⁰) + (x¹ + x⁰) = x⁸ + x⁵ + (x⁴ + x⁴) + (x¹ + x¹) + (x⁰ + x⁰) = x⁸ + x⁵ + 0 + 0 + 0 = x⁸ + x⁵

Therefore, dividing `A(x) = x⁸ + x⁵`

by `B(x) = x⁴ + x¹ + x⁰`

yields a remainder of `R(x) = x¹ + x⁰`

. Such polynomials can be represented as binary numbers: `A`

is `0b100100000`

, `B`

is `0b10011`

and `R`

is `0b11`

.

Polynomial division over GF(2) can be efficiently implemented as a sequence of XORs and shifts. Extending the example above (with a divisor polynomial `B`

of degree `N`

equal to 4), an input `0b10010`

is first padded with `N`

trailing zeroes to give the initial state: the `A`

value `0b100100000`

.

The algorithm loops, alternating between shifting a copy of the `B`

polynomial so that its left-most 1 bit aligns with the left-most 1 bit of the state, and XOR-ing those two values, producing a new state whose left-most 1 bit is further to right. The loop continues until no such shifted `B`

can be found, or in other words, all of the state's bits to the left of the right-most `N`

bits are all zero. What remains is the remainder polynomial, `R`

.

input 00010010 pad 00010010 0000 divisor 10011 xor 00000001 0000 divisor 1 0011 xor 00000000 0011 remain 0011

This `B`

divisor polynomial, `0b10011`

, is also known as the CRC-4-ITU polynomial. The `4`

in “CRC-4-ITU” is the `N`

in “an N bit CRC”. The polynomial has `N+1`

bits, and the hash value or remainder `0b0011`

(what remains after dividing a longer padded-input polynomial by the shorter hash polynomial) has `N`

bits.

Consider that CRC-4-ITU polynomial again: (x⁴ + x¹ + x⁰) or `0b10011`

in binary. The high bit of an `N`

-degree polynomial is always 1, so an equivalent representation of that polynomial is `N=4, MSB, BITS=0b0011`

. This is the “normal” or Most Significant Bit first order. Another equivalent representation is to visit the bits right-to-left: `N=4, LSB, BITS=0b1100`

. This is the “reversed” or Least Significant Bit first order. The binary number `0b1100`

in hexadecimal is `0xC`

, so yet another equivalent representation of that polynomial is `N=4, LSB, BITS=0xC`

. If `N`

and `LSB`

-ness is agreed beforehand, `0xC`

is all you need to specify the polynomial.

For 32 bit CRCs, the two popular polynomials (presented in `LSB`

order) are `0xEDB8_8320`

and `0x82F6_3B78`

, also called the IEEE and Castagnoli polynomials, also called CRC-32 and CRC-32C. For example, the bit string representation of the IEEE polynomial `0xEDB8_8320`

, un-reversed and with the implicit high bit, is `0b1_00000100_11000001_00011101_10110111 = 0x1_04C1_1DB7`

.

A worked example for calculating the CRC-32 hash of the three byte input “Hi\n” (equivalently, “\x48\x69\x0A” but note in the work below's input that the bits within each byte are reversed to be LSB-first) proceeds similarly to the simpler worked example for the CRC-4-ITU hash, above, with two additional inversion steps, described below.

Both the CRC-4-ITU and the CRC-32 worked examples are output by the `script/print-crc32-example.go`

program.

input 00010010 10010110 01010000 pad 00010010 10010110 01010000 00000000 00000000 00000000 00000000 invert 11101101 01101001 10101111 11111111 00000000 00000000 00000000 divisor 10000010 01100000 10001110 11011011 1 xor 01101111 00001001 00100001 00100100 10000000 00000000 00000000 divisor 1000001 00110000 01000111 01101101 11 xor 00101110 00111001 01100110 01001001 01000000 00000000 00000000 divisor 100000 10011000 00100011 10110110 111 xor 00001110 10100001 01000101 11111111 10100000 00000000 00000000 divisor 1000 00100110 00001000 11101101 10111 xor 00000110 10000111 01001101 00010010 00011000 00000000 00000000 divisor 100 00010011 00000100 01110110 110111 xor 00000010 10010100 01001001 01100100 11000100 00000000 00000000 divisor 10 00001001 10000010 00111011 0110111 xor 00000000 10011101 11001011 01011111 10101010 00000000 00000000 divisor 10000010 01100000 10001110 11011011 1 xor 00000000 00011111 10101011 11010001 01110001 10000000 00000000 divisor 10000 01001100 00010001 11011011 0111 xor 00000000 00001111 11100111 11000000 10101010 11110000 00000000 divisor 1000 00100110 00001000 11101101 10111 xor 00000000 00000111 11000001 11001000 01000111 01001000 00000000 divisor 100 00010011 00000100 01110110 110111 xor 00000000 00000011 11010010 11001100 00110001 10010100 00000000 divisor 10 00001001 10000010 00111011 0110111 xor 00000000 00000001 11011011 01001110 00001010 11111010 00000000 divisor 1 00000100 11000001 00011101 10110111 xor 00000000 00000000 11011111 10001111 00010111 01001101 00000000 divisor 10000010 01100000 10001110 11011011 1 xor 00000000 00000000 01011101 11101111 10011001 10010110 10000000 divisor 1000001 00110000 01000111 01101101 11 xor 00000000 00000000 00011100 11011111 11011110 11111011 01000000 divisor 10000 01001100 00010001 11011011 0111 xor 00000000 00000000 00001100 10010011 11001111 00100000 00110000 divisor 1000 00100110 00001000 11101101 10111 xor 00000000 00000000 00000100 10110101 11000111 11001101 10001000 divisor 100 00010011 00000100 01110110 110111 xor 00000000 00000000 00000000 10100110 11000011 10111011 01010100 remain 10100110 11000011 10111011 01010100 invert 01011001 00111100 01000100 10101011 hex A 9 C 3 2 2 5 D

The final line says that the CRC-32 checksum of “Hi\n” is `0xD522_3C9A`

. This can be verified by running the `/usr/bin/crc32`

program:

$ echo Hi | xxd /dev/stdin 00000000: 4869 0a Hi. $ echo Hi | crc32 /dev/stdin d5223c9a

An `A`

value of `0b100100000`

as a mathematical concept (whether as a binary number of as a polynomial) is unchanged by adding leading 0 bits. Thus, the CRC algorithm in its basic form will compute the same hash value for both some string `s`

and another string that is multiple “\x00” bytes followed by `s`

.

It's easy for network errors or other corruption to introduce multiple “\x00” bytes, and a good checksum should be able detect that. To do so, CRC as used in practice inverts (applies a bitwise NOT to) the first `N`

bits of the padded input, just before the divisor-xor loop of the algorithm.

Similarly but independently of that, the same issue (in a more limited way) can occur with trailing “\x00” bytes. The same trick addresses that, this time inverting the last `N`

bits, just after the divisor-xor loop.

While the two inversions are notionally independent, and it would be possible to implement a CRC flavor that inverted only one side, inverting on both sides results in a nice decomposability property. Calculating the CRC hash of the concatenation of two strings, `s+t`

, can be computed by first hashing `s`

, then hashing `t`

starting with an initial state equal to that hash instead of equal to zero. The second inversion of the `s`

computation cancels out the first inversion of the `t`

computation.

In theory, the mathematics of the CRC algorithm works in terms of bit streams. In practice, the computation of the CRC hash value works with byte streams: 8 (or more) bits at a time. It is faster to do so because CPU and RAM fundamentally work with bytes (or words) instead of bits, and because the bit-by-bit loop (over 8 successive 0 or 1 input bits) can be implemented as a byte-by-byte loop involving a single lookup into a 256-entry table, yielding a 32-bit value to XOR with the cumulative state. For well known polynomials, such as the IEEE and Castagnoli polynomials, these lookup tables can be calculated beforehand, and hard-coded at compile time. A uint32 `state`

variable can be updated by the simple algorithm (with `invert`

meaning `bitwise_not`

):

state = invert(state) for each input byte x { state = table[x ^ (state & 0xFF)] ^ (state >> 8) } state = invert(state)

Better performance can be obtained by processing M bytes at a time, e.g. for an M of 4, 8 or 16. Even at only 4 bytes, a naive implementation would require a more-than-4-billion (256 × 256 × 256 × 256) entry lookup table, which is impractical. A cleverer algorithm can process 4 bytes at a time using (256 + 256 + 256 + 256) entries. See A Systematic Approach to Building High Performance, Software-based, CRC Generators by Kounavis and Berry of Intel Corporation. Stephan Brumme's CRC-32 page also has some more discussion and code examples.

This slicing-by-M algorithm is still applicable even when the input isn't an exact multiple of M bytes. The bulk of the input is processed M bytes at a time, and the remainder is then processed 1 byte at a time.

For M greater than 4, the first 4 bytes of each slice are treated in one way, since the state (a uint32) is 4 bytes long. The remaining M-4 bytes are treated a second way. See the actual code for details.

Even better performance can be obtained through CPU-specific SIMD instructions. See Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction by Gopal, Ozturk, Guilford, Wolrich, Feghali and Dixon of Intel Corporation and Karakoyunlu of the Worcester Polytechnic Institute.

See a couple of Wikipedia articles: