One RGBA pixel is 32 bits, so a 2x2 of pixels is 128 bits
Each max pool index stores 2 bits of data. The first convolutional block has 64 channels.
So there are 2*64 = 128 bits of data in the max pooling indices for that block. Those get passed straight to the end of the network.