C# is not (yet) a high perf language

by Tobias Hertkorn on August 2nd, 2009

Disclamer: The lcg used in this article has drawbacks but it was sufficiently random for my purposes. Please be sure to check if that assumption holds for you as well before using!

Well, this won't come to you as a shocker - but C# is not a high performance language... No, wrong, the C# compiler is not primarily optimized for performance. I noticed it while playing around with a linear congruential generator, a fast and lightweight pseudo random generator.

The lcg is in essence this code fragment:

C#:
  1. PRN = (A * PRN + C) % M; // [0;M[

which generates the next pseudo random number (PRN) which is a number between 0 and M (0 inclusive, M exclusive), using carefully chosen constants A, C and M.

In order to generate a double between 0 and max the naive way to do so is like this:

C#:
  1. private static double RandomLcg(double max)
  2. {
  3.    PRN = (A * PRN + C) % M; // [0;M[
  4.    return max * PRN / M;
  5. }

Now what would you say, if I told you that this optimization, no not optimization - reordering,

C#:
  1. private static double RandomLcg(double max)
  2. {
  3.    PRN = (PRN * A + C) % M; // [0;M[
  4.    return max * (PRN * (1.0 / M));
  5. }

is almost TWICE at fast as the original? Curious, isn't it?!

The fun part is - manually inlining the PRN write and access gives another 10% improvement. So the final version

C#:
  1. private static double RandomLcg(double max)
  2. {
  3.    return max * ((PRN = (PRN * A + C) % M) * (1.0 / M));
  4. }

generates 1,000,000,000 pseudo numbers in 11,4 seconds on my machine - the original version needs 20,7 seconds.

This kind of performance improvement is not uncommon in any language, but especially the manual inlining surprised me. I guess it is a sign that the compiler team is - understandibly - focusing on other things than extracting the last bit of performance from C#. After all, who needs to regularly generate a billion PRNs using C#?

Download source for crude test program.

Post to Twitter Tweet this

August 2nd, 2009 6:57 pm | Comments (3)

Optimizing heavy arithmetics in Mono

by Tobias Hertkorn on June 29th, 2009

Sowhat as a tribute to this rather foolish letter.

---

This will be an article about how I optimized the md5 algorithm in Mono and the Microsoft .NET Framework. I haven't completed it yet, but there is quite a bit about optimization in my blog. Check out all the mentions of md5 in my blog for some pointers.

If you need my work on this right now, raw and unpolished - please feel free to email me at tobias at hertkorn dot com. Or to get some hand-on example, go to my post about the optimized Md5CryptoServiceProvider in mono.

Optimizing the md5 algo on little endian machines

This decoding step is a major part within the md5 algo. It translates a 64 byte array into a 16 uint array. The general endian-safe step looks like this:

C#:
  1. for (i = 0; i <16; i++)
  2. {
  3.   buff[i] = (uint)(inputBuffer[inputOffset + 4 * i])
  4.     | (((uint)(inputBuffer[inputOffset + 4 * i + 1])) <<8 )
  5.     | (((uint)(inputBuffer[inputOffset + 4 * i + 2])) <<16)
  6.     | (((uint)(inputBuffer[inputOffset + 4 * i + 3])) <<24);
  7. }

If this step is performed on a little endian machine, the bytes are already safed in memory in the right order and there is no need for all the bitlogic. Therefore optimizing it using unsafe code and direct copying 4 consecutive bytes as one uint gives a huge speed increase:

C#:
  1. unsafe
  2. {
  3.   fixed (byte* bFixed = inputBuffer)
  4.   {
  5.     byte* inputPointer = bFixed + inputOffset;
  6.  
  7.     for (i = 0; i <16; i++)
  8.     {
  9.       buff[i] = *(uint*)(inputPointer);
  10.       inputPointer += 4;
  11.     }
  12.   }
  13. }

As we are already in unsafe context using an additional pointer for buff gives an additional speed increase:

C#:
  1. unsafe
  2. {
  3.   fixed (byte* bFixed = inputBuffer)
  4.   {
  5.     fixed (uint* aFixed = buff)
  6.     {
  7.       byte* inputPointer = bFixed + inputOffset;
  8.       uint* bufferPointer = aFixed;
  9.  
  10.       for (i = 0; i <16; i++)
  11.       {
  12.         *bufferPointer = *(uint*)(inputPointer);
  13.         inputPointer += 4;
  14.         bufferPointer++;
  15.       }
  16.     }
  17.   }
  18. }

Be sure to still make this Md5 implementation endian safe by introducing a if:

C#:
  1. if (BitConverter.IsLittleEndian)
  2. {
  3.   // unsafe decode
  4. }
  5. else
  6. {
  7.   // bitshifting decode
  8. }

Avoiding unnecessary stloc and ldloc ops

If possible do as much arithmetic as possible in one go. For example:

C#:
  1. b += (((d ^ a) & c) ^ a) + (uint)K[7] + buff[7];
  2. b = (b <<22) | (b>> 10);
  3. b += c;

This b += c is not smart to use in this situation and actually produces a performance hit. Better would be:

C#:
  1. b += (((d ^ a) & c) ^ a) + (uint)K[7] + buff[7];
  2. b = ((b <<22) | (b>> 10)) + c;

Altering the whole segment like this in the md5 algorithm adds up to a 10-15% performance increase!

Let's look at the IL:

IL:
  1. ldloc.1
  2. ldloc.3
  3. ldloc.0
  4. xor
  5. ldloc.2
  6. and
  7. ldloc.0
  8. xor
  9. ldsfld unsigned int32[] MD5CryptoServiceProviderMonoOrig::K
  10. ldc.i4.7
  11. ldelem.u4
  12. add
  13. ldarg.0
  14. ldfld unsigned int32[] MD5CryptoServiceProviderMonoOrig::buff
  15. ldc.i4.7
  16. ldelem.u4
  17. add
  18. add
  19. stloc.1
  20. ldloc.1
  21. ldc.i4.s 22
  22. shl
  23. ldloc.1
  24. ldc.i4.s 10
  25. shr.un
  26. or
  27. stloc.1  // stores b
  28. ldloc.1  // loads b
  29. ldloc.2
  30. add
  31. stloc.1

the second version's:

IL:
  1. ldloc.1
  2. ldloc.3
  3. ldloc.0
  4. xor
  5. ldloc.2
  6. and
  7. ldloc.0
  8. xor
  9. ldsfld unsigned int32[] MD5AddOptimization3::K
  10. ldc.i4.7
  11. ldelem.u4
  12. add
  13. ldarg.0
  14. ldfld unsigned int32[] MD5AddOptimization3::buff
  15. ldc.i4.7
  16. ldelem.u4
  17. add
  18. add
  19. stloc.1
  20. ldloc.1
  21. ldc.i4.s 22
  22. shl
  23. ldloc.1
  24. ldc.i4.s 10
  25. shr.un
  26. or
  27.                // store, load eliminated
  28. ldloc.2
  29. add
  30. stloc.1

Post to Twitter Tweet this

June 29th, 2009 8:38 pm | Comments (0)

Stylishly improving Silverlight Toolkit Chart’s Performance

by Tobias Hertkorn on June 27th, 2009

TL;TR: Skip all the way down and download the complete style without "Show" storyboard for the Silverlight Toolkit March 2009 release.

I am currently looking at a couple of Silverlight features and extensions. And what's about the first thing one wants to see - GRAPHS. Well, that's me at least - maybe my background in Physics does push me in weird ways most of the times sometimes...

What I found was that the performance of the charts is not that great - when going to more than a couple datapoints ... to say like 1000?! :-D Oh, and of course those datapoints should change all the time, right? Turns out that does kinda have some impact on performance - since every single datapoint has an animation attached to it when it gets shown or hidden.

I am a total noob when it comes to XAML in general and Silverlight in particular - so it was a huge break-through for me to find out that one can change the style of existing controls. Yeah I know. Knocked me out though. So I got the trial of Expressoin Blend 2, did all the steps that were necessary to get to the templates. And found out that now all my datapoints where bright orange. Umm, that's not fun.

So after a bit more of fumbling around I found out that it's all very easy - when one knows how to do it... I know this is one of the "aha, actually he is not that bright" kinda revelation blog posts. ;) Aaaanyway.

The new template for a datapoint (in this particular case a LineDataPoint for LineSeries) that has no fade in translation looks like this:

XML:
  1. <ControlTemplate TargetType='charting:LineDataPoint' x:Key='LineNoTransition'>
  2.   <Grid Opacity='1' x:Name='Root'>
  3.     <vsm:VisualStateManager.VisualStateGroups>
  4.       <vsm:VisualStateGroup x:Name='CommonStates'>
  5.         <vsm:VisualStateGroup.Transitions>
  6.           <vsm:VisualTransition GeneratedDuration='0:0:0.1' />
  7.         </vsm:VisualStateGroup.Transitions>
  8.         <vsm:VisualState x:Name='Normal' />
  9.         <vsm:VisualState x:Name='MouseOver'>
  10.           <Storyboard>
  11.             <ColorAnimationUsingKeyFrames BeginTime='00:00:00' Duration='00:00:00.0010000' Storyboard.TargetName='MouseOverHighlight' Storyboard.TargetProperty='(Shape.Fill).(SolidColorBrush.Color)'>
  12.               <SplineColorKeyFrame KeyTime='00:00:00' Value='#FFFFDF00' />
  13.             </ColorAnimationUsingKeyFrames>
  14.             <DoubleAnimationUsingKeyFrames BeginTime='00:00:00' Duration='00:00:00.0010000' Storyboard.TargetName='MouseOverHighlight' Storyboard.TargetProperty='(UIElement.Opacity)'>
  15.               <SplineDoubleKeyFrame KeyTime='00:00:00' Value='0.24' />
  16.             </DoubleAnimationUsingKeyFrames>
  17.           </Storyboard>
  18.         </vsm:VisualState>
  19.       </vsm:VisualStateGroup>
  20.       <vsm:VisualStateGroup x:Name='SelectionStates'>
  21.         <vsm:VisualStateGroup.Transitions>
  22.           <vsm:VisualTransition GeneratedDuration='0:0:0.1' />
  23.         </vsm:VisualStateGroup.Transitions>
  24.         <vsm:VisualState x:Name='Unselected' />
  25.         <vsm:VisualState x:Name='Selected'>
  26.           <Storyboard>
  27.             <DoubleAnimationUsingKeyFrames BeginTime='00:00:00' Duration='00:00:00.0010000' Storyboard.TargetName='SelectionHighlight' Storyboard.TargetProperty='(UIElement.Opacity)'>
  28.               <SplineDoubleKeyFrame KeyTime='00:00:00' Value='0.18' />
  29.             </DoubleAnimationUsingKeyFrames>
  30.           </Storyboard>
  31.         </vsm:VisualState>
  32.       </vsm:VisualStateGroup>
  33.       <vsm:VisualStateGroup x:Name='RevealStates'>
  34.         <vsm:VisualState x:Name='Shown' />
  35.         <vsm:VisualState x:Name='Hidden' />
  36.       </vsm:VisualStateGroup>
  37.     </vsm:VisualStateManager.VisualStateGroups>
  38.     <ToolTipService.ToolTip>
  39.       <ContentControl Content='{TemplateBinding FormattedDependentValue}' />
  40.     </ToolTipService.ToolTip>
  41.     <Ellipse Fill='{TemplateBinding Background}' Stroke='{TemplateBinding BorderBrush}' />
  42.     <Ellipse RenderTransformOrigin='0.661,0.321'>
  43.       <Ellipse.Fill>
  44.         <RadialGradientBrush GradientOrigin='0.681,0.308'>
  45.           <GradientStop Color='#00FFFFFF' />
  46.           <GradientStop Color='#FF3D3A3A' Offset='1' />
  47.         </RadialGradientBrush>
  48.       </Ellipse.Fill>
  49.     </Ellipse>
  50.     <Ellipse Fill='Red' Opacity='0' x:Name='SelectionHighlight' />
  51.     <Ellipse Fill='White' Opacity='0' x:Name='MouseOverHighlight' />
  52.   </Grid>
  53. </ControlTemplate>

Notice that I set the opacity of the "Root" grid to 1 (it is 0 in the original style) and removed all storyboards from the "Shown" and "Hidden" VisualStates. So now we have a ControlTemplate that shows the datapoint right away (opaycity to 1) and does not need any additional CPU cycles for animations (no storyboards).

In order to be able to attach these animation-less templates to the respective data series we need an additional template:

XML:
  1. <Style TargetType='charting:LineDataPoint' x:Key='LineDataPointStyleBlue'>
  2.   <Setter Property='Background'>
  3.     <Setter.Value>
  4.       <RadialGradientBrush>
  5.         <RadialGradientBrush.RelativeTransform>
  6.           <TransformGroup>
  7.             <ScaleTransform CenterX='0.5' CenterY='0.5' ScaleX='2.09' ScaleY='1.819' />
  8.             <TranslateTransform X='-0.425' Y='-0.486' />
  9.           </TransformGroup>
  10.         </RadialGradientBrush.RelativeTransform>
  11.         <GradientStop Color='#FFB9D6F7' />
  12.         <GradientStop Color='#FF284B70' Offset='1' />
  13.       </RadialGradientBrush>
  14.     </Setter.Value>
  15.   </Setter>
  16.   <Setter Property='BorderBrush' Value='Gray' />
  17.   <Setter Property='BorderThickness' Value='1' />
  18.   <Setter Property='IsTabStop' Value='False' />
  19.   <Setter Property='Width' Value='8' />
  20.   <Setter Property='Height' Value='8' />
  21.   <Setter Property='Template' Value='{StaticResource LineNoTransition}' />
  22. </Style>

Notice that I went to the source of the chart library and copied the original style for "Background" getting rid of the hideous "orange". Plus notice the Value for the Template that points to our new animation-less template.

Last but not least attach the new dataline template to the chart:

XML:
  1. <charting:Chart x:Name="Performance">
  2.    <charting:LineSeries
  3.       x:Name="SleepTimez"
  4.       ItemsSource="{Binding Path=Performance.SleepTimez}"
  5.       Title="T"
  6.       DependentValuePath="Count"
  7.       IndependentValuePath="Time"
  8.       DataPointStyle="{StaticResource LineDataPointStyleBlue}" />
  9.    <charting:LineSeries
  10.       x:Name="ListSleepTimez"
  11.       ItemsSource="{Binding Path=Performance.ListSleepTimez}"
  12.       Title="LT"
  13.       DependentValuePath="Count"
  14.       IndependentValuePath="Time"
  15.       DataPointStyle="{StaticResource LineDataPointStyleRed}" />
  16.    <charting:LineSeries
  17.       x:Name="ChartSleepTimez"
  18.       ItemsSource="{Binding Path=Performance.ChartSleepTimez}"
  19.       Title="CT"
  20.       DependentValuePath="Count"
  21.       IndependentValuePath="Time"
  22.       DataPointStyle="{StaticResource LineDataPointStyleLightGreen}" />
  23. </charting:Chart>

And that's pretty much it. Unfortunatelly I was not able to make the style palette work again - so right now I have to manually choose and attach colors to the various data series. But that's good enough for me right now.

I have not figured out yet how to remove the "hide" transition. It seems as if the removal of datapoints from a graph still trigger an animation, but I don't know why. Any hints?

Download: Complete LineDataPoint style

Post to Twitter Tweet this

June 27th, 2009 12:13 am | Comments (0)

Measure memory consumption of creating object or executing functions in C#

by Tobias Hertkorn on August 25th, 2007

Yesterday Anton Venter from New Zealand contacted me, because he read about my posts about optimization and stuff like that. And he was looking at different sorting and searching strategies he asked me how to measure execution time and memory comsumption of functions in a black box kind of way.

Timing

Well, the execution time part was answered in my previous post on HighPerformanceCounter and a more general article about measuring performance in C#.

To quickly sum those up those two posts, to time a function use something like:

C#:
  1. // run a sort first to get on the fly code gen out of the way
  2. sort.run(myarray);
  3. //now time it
  4. timer.start();
  5. sort.run(mysecondarray);
  6. timer.stop();
  7. console.writeline(timer.elapsedmilliseconds);

Memory Consumption

After browsing the documentation I found out that it is not too hard to measure the memory usage caused by functions and/or instantiation of and object using System.GC.GetTotalMemory. BUT that's of course "just" the amount allocated in managed memory. (So it does NOT give you an idea about how much JITing those methods would cause. Just a side-comment.)

C#:
  1. long m_memoryStart = 0;
  2. long m_memoryMakeList = 0;
  3.  
  4. Thread.MemoryBarrier();
  5. m_memoryStart = System.GC.GetTotalMemory(true);
  6.  
  7. List<int> testList1 = new List<int>();
  8. Thread.MemoryBarrier();
  9. m_memoryMakeList = System.GC.GetTotalMemory(true);
  10.  
  11. int count = testList1.Count;

For a more complete example on how to use that see my sample solution.

I inserted the Thread.MemoryBarrier() in order to absolutely prevent the compiler from reordering statements for optimizations. And the int count = testList1.Count; is there to keep testList1 alive until AFTER the second GetTotalMemory, which is abolutely necessary! Otherwise the difference between the two memory sizes will indeed be zero. :-)

At this point in encourage you to look at my sample. Especially notice how the first execution of Array.Sort does increase memory usage, the second does not.

Should I do timing or memory tests first?

There is no right order in which you should do these. It absolutely depends on what exactly you want to time or measure. But remember. It is best to separate the testing for memory consumption and the timing. Don't try to do both with one execution of a specific function.

Important: To get accurate readings especially for timing always compile your solution to "Release" and run the .exe outside of Visual Studio!

Thanks Anton for contacting me. It was fun thinking about your question. :-)

Memory Consumption Sample Solution

kick it on DotNetKicks.com

Post to Twitter Tweet this

August 25th, 2007 7:33 pm | Comments (8)

How to measure performance of algorithms in .NET

by Tobias Hertkorn on September 6th, 2006

I have come across some poorly written "performance" tests in the last couple of month, persuing my little hobby improving md5 algos for .NET. Here are some basic things to always have in the back of your head while doing performance optimizations and tests in .NET (and other languages):

NEVER assume you know in advance what will be faster or slower

This is the first and probably only RULE in optimization. Don't even assume that unsafe code must be faster. Actually that is quite wrong since the built-in optimizer in Microsoft's .NET is very good at optimizing regular stuff. That means: ALWAYS test your assumptions and "optimizations".

Always have a spinup cycle that does exactly the same

If you are testing an algorithm or a specific method in your class, always call it at least once BEFORE actually starting your test. This will get rid of the compilation that is done while executing a codepath for the first time. .NET is an interpreted language. Don't underestimate the time it take even today's cpus to do that initial step. I am talking here about hundrets of seconds. Be aware that SQL 2005 does some first time compilation of T-SQL statements as well that will not be visible during productive usage.

Be aware of caching

While the "spinup cycle" is very important one has to be absolutely sure that no caching will interfere the process of the "second call". Especially be careful of the earlier stated caching and compiling of T-SQL statements! This might result in way faster test times than your actual production code will ever run.

Utilize statistics

Never assume you know everything about the test by just performing it once. There are a lot of factors that play into the "actual performance right now" on a multithreaded GUI system. Try creating the environment that will exist during every day usage of the production code. Sum over multiple runs to get more accurate readings of the actual performance during normal operation.

Be aware of Debugging

Don't run your performance test by hitting F5 or CTRL+F5 in Visual Studio. This will get you readings that are very different from you actual production performance. First of all you probably used "Debug" as compiler settings. Plus Visual Studio will watch variables and breakpoints for you by attaching to the process of you test. This will slow down execution time considerably.

Be aware of Console output

Writing to console or updating textfields is very slow and therefore no diagnostic output should be produced during the actual testruns. Always save that for time after/in between testruns. If you want to write to the console between testruns concider doing a Thread.Sleep(50); before starting the next run.

Frameworks and compilers differ

Never assume that the optimization you made will be the best on all platforms and/or compiler. In the .NET world there is the Microsoft .NET Framework and the Mono Framework. Their compilers are very different when it comes to optimization philosophy, focus and maturity. Plus especially the mono interpreter allows for a huge range of additional runtime optimizations. Different code optimizations will gain or loose, depending on compiler and interpreter!

That's the basics. I could get into "testing the empty test-loop", etc. but that's for the hardcore testers. These points above though are very basic, but must be obeyed, if the test timing should be anything close to the actual production timing.

Happy testing. :)

Read on:

Post to Twitter Tweet this

September 6th, 2006 1:11 am | Comments (0)

Still optimizing IL

by Tobias Hertkorn on June 28th, 2006

Here are some pointers to great documents that I was looking at, while still trying to find a faster way to make arithmetic operations in safe C#:

Writing High-Performance Managed Applications : A Primer

Effective C#: 50 Specific Ways to Improve Your C#

Writing Faster Managed Code: Know What Things Cost

A Fast Serialization Technique

Post to Twitter Tweet this

June 28th, 2006 3:32 pm | Comments (0)

On the hunt for safe optimizations for md5

by Tobias Hertkorn on June 10th, 2006

Well, right now I am testing some other optimizations that use strictly safe code. One of the things I came across:

On my fast machine it helps to do as much arithmetic as possible in one go. For example:

C#:
  1. b += (((d ^ a) & c) ^ a) + (uint)K[7] + buff[7];
  2. b = (b <<22) | (b>> 10);
  3. b += c;

This b += c is not smart to use in this situation and actually produces a performance hit. Better would be:

C#:
  1. b += (((d ^ a) & c) ^ a) + (uint)K[7] + buff[7];
  2. b = ((b <<22) | (b>> 10)) + c;

Altering the whole segment like this in the md5 algorithm adds up to a 10-15% performance increase!

Let's look at the IL:

IL:
  1. ldloc.1
  2. ldloc.3
  3. ldloc.0
  4. xor
  5. ldloc.2
  6. and
  7. ldloc.0
  8. xor
  9. ldsfld unsigned int32[] MD5CryptoServiceProviderMonoOrig::K
  10. ldc.i4.7
  11. ldelem.u4
  12. add
  13. ldarg.0
  14. ldfld unsigned int32[] MD5CryptoServiceProviderMonoOrig::buff
  15. ldc.i4.7
  16. ldelem.u4
  17. add
  18. add
  19. stloc.1
  20. ldloc.1
  21. ldc.i4.s 22
  22. shl
  23. ldloc.1
  24. ldc.i4.s 10
  25. shr.un
  26. or
  27. stloc.1  // stores b
  28. ldloc.1  // loads b
  29. ldloc.2
  30. add
  31. stloc.1

the second version's:

IL:
  1. ldloc.1
  2. ldloc.3
  3. ldloc.0
  4. xor
  5. ldloc.2
  6. and
  7. ldloc.0
  8. xor
  9. ldsfld unsigned int32[] MD5AddOptimization3::K
  10. ldc.i4.7
  11. ldelem.u4
  12. add
  13. ldarg.0
  14. ldfld unsigned int32[] MD5AddOptimization3::buff
  15. ldc.i4.7
  16. ldelem.u4
  17. add
  18. add
  19. stloc.1
  20. ldloc.1
  21. ldc.i4.s 22
  22. shl
  23. ldloc.1
  24. ldc.i4.s 10
  25. shr.un
  26. or
  27.                // store, load eliminated
  28. ldloc.2
  29. add
  30. stloc.1

I thought this was weird, because I assumed that the compiler might be able to find these kinds of unnecessary operations, but I guess it is harder to locate them than it seems. Especially since a wrong optimization might end up with broken code.

A weird thing is: I just did some testing on my old, old i586 - and this optimization I described here actually decreases performances under some cirumstances! Obviously the lesson learned is: Never trust "optimizations" without testing them!

Well, I'll soon write about more safe optimizations I found. But this is for sure again: Optimizing is all about trial and error. There is almost no way to tell if anything might perform better or worse.

Post to Twitter Tweet this

June 10th, 2006 8:07 pm | Comments (2)

Md5 – no unsafe in crypto

by Tobias Hertkorn on June 9th, 2006

I got in contact with Sebastien Pouliot in order to contribute back to mono. And I got a really nice email back. He suggested a couple of other optimizations, explained some of the performance gains I was experiencing and was looking forward to seeing more of my optimizations. But unfortunatelly he also wrote this:

For many reasons I don't want to add unsafe code into crypto. Beside being "unsafe" many projects use copies of Mono crypto source code directly into their apps and I don't want to require them to use unsafe code (and limit the condition were the code can be executed, e.g. FullTrust)

That makes it impossible to use the one optimization that gives the biggest performance boost on little endian machines. Since that optimization will not make it into mono, I just wanted to blog about it here - just in case somebody needs the extra edge.

Optimizing the md5 algo on little endian machines

This decoding step is a major part within the md5 algo. It translates a 64 byte array into a 16 uint array. The general endian-safe step looks like this:

C#:
  1. for (i = 0; i <16; i++)
  2. {
  3.   buff[i] = (uint)(inputBuffer[inputOffset + 4 * i])
  4.     | (((uint)(inputBuffer[inputOffset + 4 * i + 1])) <<8 )
  5.     | (((uint)(inputBuffer[inputOffset + 4 * i + 2])) <<16)
  6.     | (((uint)(inputBuffer[inputOffset + 4 * i + 3])) <<24);
  7. }

If this step is performed on a little endian machine, the bytes are already safed in memory in the right order and there is no need for all the bitlogic. Therefore optimizing it using unsafe code and direct copying 4 consecutive bytes as one uint gives a huge speed increase:

C#:
  1. unsafe
  2. {
  3.   fixed (byte* bFixed = inputBuffer)
  4.   {
  5.     byte* inputPointer = bFixed + inputOffset;
  6.  
  7.     for (i = 0; i <16; i++)
  8.     {
  9.       buff[i] = *(uint*)(inputPointer);
  10.       inputPointer += 4;
  11.     }
  12.   }
  13. }

As we are already in unsafe context using an additional pointer for buff gives an additional speed increase:

C#:
  1. unsafe
  2. {
  3.   fixed (byte* bFixed = inputBuffer)
  4.   {
  5.     fixed (uint* aFixed = buff)
  6.     {
  7.       byte* inputPointer = bFixed + inputOffset;
  8.       uint* bufferPointer = aFixed;
  9.  
  10.       for (i = 0; i <16; i++)
  11.       {
  12.         *bufferPointer = *(uint*)(inputPointer);
  13.         inputPointer += 4;
  14.         bufferPointer++;
  15.       }
  16.     }
  17.   }
  18. }

Be sure to still make this Md5 implementation endian safe by introducing a if:

C#:
  1. if (BitConverter.IsLittleEndian)
  2. {
  3.   // unsafe decode
  4. }
  5. else
  6. {
  7.   // bitshifting decode
  8. }

I really can understand why this is important. There should not be any unsafe code in a crypto environment, and I agree on that. Unfortunatelly I have not come up with an alternative solution for that particular decode step. Using

C#:
  1. BitConverter.ToUInt32(inputBuffer, inputOffset + 4 * i);

really is no good. Unfortunatelly this method does some extensive testing before converting the bytes. That slows it down so much, that it performs worse than the bitops.

Post to Twitter Tweet this

June 9th, 2006 2:14 pm | Comments (0)

Optimized MD5CryptoServiceProvider for Mono

by Tobias Hertkorn on June 8th, 2006

Inspired by a comment here, I implemented a couple of my optimization strategies as MD5CryptoServiceProvider. Tests are running as I speak and should be done tomorrow morning, but these figures sure do give some hope:

MD5CopyOptimize (5): 0,446876320604479
MD5CopyOptimize2 (8): 0,447343112157013
MD5Bitlogic (4): 0,448018751258771
MD5Bitlogic2 (7): 0,448770640646336
MD5Inline (3): 0,457765987475873
MD5Inline2 (6): 0,458054073151199
MD5LittleEndian (2): 0,477792175150227
MD5NoArray (1): 0,526980798829037
MD5CryptoServiceProviderMonoOrig (0): 1,45861854618167

That's on my WinXP machine, running Microsoft's .NET 2.0. I do not have any data for Mono yet, but it seems to be about the same - that means, about 3 (!) times faster than the original mono Md5.

To understand which optimizations are done, please download the optimized MD5CryptoServiceProvider: Md5MonoOptimized
UPDATE: 4 and 5 (MD5Bitlogic and MD5CopyOptimize) did have a typo in them and therefore produced wrong results. This problem is fixed as of 15:00 on the 9th.

I'll post the mono test results tomorrow morning.

Post to Twitter Tweet this

June 8th, 2006 11:14 pm | Comments (1)

Fast Dynamic Property Access in C# using Reflection.Emit

by Tobias Hertkorn on May 31st, 2006

I just read an extremely interesting article on CodeProject called: Fast Dynamic Property Access with C# by James Nies. Wow, I am really impressed both by the idea to use Emit in this genius way and by the coding style. Very nice work.

So, I had to sit down and port it to .NET 2.0 and use Generics. Actually, what I found out - it is way simpler, than what he did. ;) How fortunate for me, since this was the very first time I even looked at Emit.

Please remember: This is a quick port. It won't work, if you don't know exactly the return type, especially if it is a ValueType.

Here are links to download the stuff:

GenericPropertyAccessor as Zip

Or as txts:
IGenericPropertyAccessor.cs
GenericPropertyAccessor.cs
And finally the updated test app:
SampleApp.cs

This is not the complete solution you need in order to build the SampleApp. Please go to Codeproject and download the source code, then add my files.

Post to Twitter Tweet this

May 31st, 2006 2:51 am | Comments (2)
Tobi + C# = T# - Blogged blogoscoop