Sowhat as a tribute to this rather foolish letter.
---
This will be an article about how I optimized the md5 algorithm in Mono and the Microsoft .NET Framework. I haven't completed it yet, but there is quite a bit about optimization in my blog. Check out all the mentions of md5 in my blog for some pointers.
If you need my work on this right now, raw and unpolished - please feel free to email me at tobias at hertkorn dot com. Or to get some hand-on example, go to my post about the optimized Md5CryptoServiceProvider in mono.
Optimizing the md5 algo on little endian machines
This decoding step is a major part within the md5 algo. It translates a 64 byte array into a 16 uint array. The general endian-safe step looks like this:
- for (i = 0; i <16; i++)
- {
- buff[i] = (uint)(inputBuffer[inputOffset + 4 * i])
- | (((uint)(inputBuffer[inputOffset + 4 * i + 1])) <<8 )
- | (((uint)(inputBuffer[inputOffset + 4 * i + 2])) <<16)
- | (((uint)(inputBuffer[inputOffset + 4 * i + 3])) <<24);
- }
If this step is performed on a little endian machine, the bytes are already safed in memory in the right order and there is no need for all the bitlogic. Therefore optimizing it using unsafe code and direct copying 4 consecutive bytes as one uint gives a huge speed increase:
- unsafe
- {
- fixed (byte* bFixed = inputBuffer)
- {
- byte* inputPointer = bFixed + inputOffset;
- for (i = 0; i <16; i++)
- {
- buff[i] = *(uint*)(inputPointer);
- inputPointer += 4;
- }
- }
- }
As we are already in unsafe context using an additional pointer for buff gives an additional speed increase:
- unsafe
- {
- fixed (byte* bFixed = inputBuffer)
- {
- fixed (uint* aFixed = buff)
- {
- byte* inputPointer = bFixed + inputOffset;
- uint* bufferPointer = aFixed;
- for (i = 0; i <16; i++)
- {
- *bufferPointer = *(uint*)(inputPointer);
- inputPointer += 4;
- bufferPointer++;
- }
- }
- }
- }
Be sure to still make this Md5 implementation endian safe by introducing a if:
- if (BitConverter.IsLittleEndian)
- {
- // unsafe decode
- }
- else
- {
- // bitshifting decode
- }
Avoiding unnecessary stloc and ldloc ops
If possible do as much arithmetic as possible in one go. For example:
- b += (((d ^ a) & c) ^ a) + (uint)K[7] + buff[7];
- b = (b <<22) | (b>> 10);
- b += c;
This b += c is not smart to use in this situation and actually produces a performance hit. Better would be:
- b += (((d ^ a) & c) ^ a) + (uint)K[7] + buff[7];
- b = ((b <<22) | (b>> 10)) + c;
Altering the whole segment like this in the md5 algorithm adds up to a 10-15% performance increase!
Let's look at the IL:
- ldloc.1
- ldloc.3
- ldloc.0
- xor
- ldloc.2
- and
- ldloc.0
- xor
- ldsfld unsigned int32[] MD5CryptoServiceProviderMonoOrig::K
- ldc.i4.7
- ldelem.u4
- add
- ldarg.0
- ldfld unsigned int32[] MD5CryptoServiceProviderMonoOrig::buff
- ldc.i4.7
- ldelem.u4
- add
- add
- stloc.1
- ldloc.1
- ldc.i4.s 22
- shl
- ldloc.1
- ldc.i4.s 10
- shr.un
- or
- stloc.1 // stores b
- ldloc.1 // loads b
- ldloc.2
- add
- stloc.1
the second version's:
- ldloc.1
- ldloc.3
- ldloc.0
- xor
- ldloc.2
- and
- ldloc.0
- xor
- ldsfld unsigned int32[] MD5AddOptimization3::K
- ldc.i4.7
- ldelem.u4
- add
- ldarg.0
- ldfld unsigned int32[] MD5AddOptimization3::buff
- ldc.i4.7
- ldelem.u4
- add
- add
- stloc.1
- ldloc.1
- ldc.i4.s 22
- shl
- ldloc.1
- ldc.i4.s 10
- shr.un
- or
- // store, load eliminated
- ldloc.2
- add
- stloc.1
