Examining the fixed point in base64

A recent set of internet postings noticed that base64 has a "fixed point." That is, there's a string of text that, when you base64 encode it, gives you itself back out.

http://en.wikipedia.org/wiki/Fixed_point_(mathematics) http://fmota.eu/blog/base64-fixed-point.html http://www.reddit.com/r/compsci/comments/18234a/the_base64_encoder_has_a_fixed_point/c8b1axm

This is kind of fascinating, and leads to all sorts of next questions. I'm going to look at a few of them here.

Is the fixed point an attractor?

One natural question is "Is the fixed point an attractor?" That is, if you start with any arbitrary input and base64 encode it over and over, will it converge to this fixed point? This has been addressed by some people, but there's still room for a more in-depth explanation of why this is the case.

The answer is yes, given the base64 alphabets commonly used. We can demonstrate this by showing that any string which begins with an ASCII-encoded base64 symbol will result in an output beginning with 'V' within a few applications of base64 to the string. Once we demonstrate that, we know that any arbitrary data will be in the base64 alphabet after the first encoding, so any arbitrary input will fit into this argument as well.

The diagram below shows the sequence of symbols the string will start with, after repeated rounds of base64 encoding. We see that , after only 4 iterations, the symbols have converged on 'V'.

The path to 'V' as first character of fixpoint

This shows that the encoder will reach 'V' as the first character, no matter what the input. Once the first byte is fixed, the question is then: will the second byte be similarly fixed? At that point, the fourth byte is deterministic. From there, we need to show that the fifth byte is deterministic no matter what the fourth byte is, and so on, until we've hit the full period.

The path to 'm' as second character of fixpoint

Is the base64 fixed point periodic?

One question that comes up pretty quickly is "Is the base64 fixed point periodic, or does it go on forever?" After a short amount of time, it's clear that it must be periodic, because there are only 64*64*4 states available to the encoder, and every one of those has a deterministic next state, so it has a maximum period of 26+6+2 = 16,384 symbols.

We find this by breaking down the encoding operation into a sequence of operations:
EMIT(symbol, carry, offset) → symbol
CARRY(symbol, offset) → carry
OFFSET(offset) → offset
or
STEP(input, (cursor, carry, offset)) → (output, (cursor, carry, offset))

We can think of the base64 encoding operation as consuming 24 bit blocks of input and yielding 32 bit blocks of output. We'll work through an example; refer to the table below as you read through this.

Let's start with an input of "ABC" and encode it. We begin by lining all 24 of the bits up, end to end. We then work our way down this set of bits in chunks of six, encoding those to a symbol. The first symbol comes from the highest six bits of 'A' and doing a lookup in the base64 table. Thus, we emit a 'Q' but are left with the two lower bits of 'A' to "carry" into the next step (0b01). In this step, we take the two low bits of 'A' as the two high bits of our output, and then take the four high bits of B as the four low bits of our output. This results in a U, with four bits of carry into the next step (0b0010). We then take the two high bits of 'C' as the low two bits of our output, which results in the symbol 'J' as output. Then, we have simply the low six bits of 'C' as our final symbol, which results in 'D'.

Input A B C
Bits 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1
Value 16 20 9 3
Output Q U J D
The process of base64 encoding, laid out

There's some formalization to complete here, but if you imagine it as a state machine that's taking ASCII-encoded base64 symbols as input and emitting symbols as output and moving a cursor, it's pretty straightforward to get it down to a state of < symbol under cursor, last symbol, current encoding state > Once you have this, it's easy to see that there are only 64×64×4 states.