Friday, December 23, 2011

A simple approach that paid off

I am currently migrating a relatively large application from VS2008/NET3.5 to VS2010/NET4. During this migration we ran into a problem. After we had migrated the code we wanted to test the application. For this we have a number of automated tests. We use Ranorex to drive part of our application.
While testing we found that the application reacted in a sluggish way and the automated tests failed due to time-outs.  We also observed that when doing the same tests ,manually the problem did occur. So what was happening and why this strange behavior.

I followed a simple approach  which paid of and which gave us enough information pin point the problem.
I fired up WinDbg attached to the process and loaded the sos extension (.loadby sos clr).
I listed all the thread (!threads) and looked up the thread ID of the UI thread.
I then resumed the program. 

I prepared the following command (17 being the ID of the UI thread):  ~17s;!clrstack;g

I then reproduced the problem. At that point I quickly hit break and executed the above command. 
I repeated this a number of times. After doing so I started looking at the different calls stacks.
What I noticed that all the stacks looked similar.

OS Thread Id: 0x1f08 (17)
Child SP IP       Call Site
0a54be30 5c075e00 Ranorex.Libs.WinForms.AssemblyResolver.AssemblyResolveHandler(System.Object, System.ResolveEventArgs)
0a54be34 68db0de1 System.AppDomain.OnAssemblyResolveEvent(System.Reflection.RuntimeAssembly, System.String)
0a54c074 6a2121db [GCFrame: 0a54c074] 
0a54d0ac 6a2121db [HelperMethodFrame_PROTECTOBJ: 0a54d0ac] System.Reflection.RuntimeAssembly._nLoad(System.Reflection.AssemblyName, System.String, System.Security.Policy.Evidence, System.Reflection.RuntimeAssembly, System.Threading.StackCrawlMark ByRef, Boolean, Boolean, Boolean)
0a54d154 68dce713 System.Reflection.RuntimeAssembly.InternalGetSatelliteAssembly(System.String, System.Globalization.CultureInfo, System.Version, Boolean, System.Threading.StackCrawlMark ByRef)
0a54d17c 68dce4ea System.Resources.ManifestBasedResourceGroveler.GetSatelliteAssembly(System.Globalization.CultureInfo, System.Threading.StackCrawlMark ByRef)
0a54d1c0 68dbd314 System.Resources.ManifestBasedResourceGroveler.GrovelForResourceSet(System.Globalization.CultureInfo, System.Collections.Generic.Dictionary`2<System.String,System.Resources.ResourceSet>, Boolean, Boolean, System.Threading.StackCrawlMark ByRef)
0a54d230 68dbcf95 System.Resources.ResourceManager.InternalGetResourceSet(System.Globalization.CultureInfo, Boolean, Boolean, System.Threading.StackCrawlMark ByRef)
0a54d298 68dbcd51 System.Resources.ResourceManager.InternalGetResourceSet(System.Globalization.CultureInfo, Boolean, Boolean)
0a54d2b4 68dd324a System.Resources.ResourceManager.GetResourceSet(System.Globalization.CultureInfo, Boolean, Boolean)
.....

To further investigate the problem I put a breakpoint on the AssemblyResolveHandler. To get the address of the JIT'ed code I used the following command:

0:083> !name2ee Ranorex.Libs.WinForms.dll Ranorex.Libs.WinForms.AssemblyResolver.AssemblyResolveHandler
Module:      5c021000
Assembly:    Ranorex.Libs.WinForms.dll
Token:       060001cd
MethodDesc:  5c0354a4
Name:        Ranorex.Libs.WinForms.AssemblyResolver.AssemblyResolveHandler(System.Object, System.ResolveEventArgs)
JITTED Code Address: 5c075e00

Next I set a breakpoint using : bu 5c075e00 "!clrstack;g"
and opened a logfile (.logopen <filename>) and reran the test.

This way I found that there were indeed a large number of calls to this method. I repeated the experiment on our VS2008 build and here I did not see a large amount of calls to this method. 

The next step in the investigation was to examine the call-stack in more detail. We found that the "System.AppDomain.OnAssemblyResolveEvent" behave differently in framework version 4. MSDN states: 

Beginning with the .NET Framework 4, the ResolveEventHandler event is raised for all assemblies, including resource assemblies. In earlier versions, the event was not raised for resource assemblies. If the operating system is localized, the handler might be called multiple times: once for each culture in the fallback chain.

Debugging along we found that the library that our application is using could not find a resource (tooltip) which resulted in lots of callback to the Ranorex code. 

So using this simple method and the help of google (leading us to MSDN) we were able to pin point the problem in a relatively short time. 

Happy bug hunting

No comments:

Post a Comment