Jump to navigation

Poupou's Corner of the Web

Looking for perfect security? Try a wireless brick.
Otherwise you may find some unperfect stuff here...

Weblog

Easy to (mis)use API

Here's another take at reducing string allocations inside Gendarme using the new Log Profiler. This time I focused on a very helpful, but easy to abuse, API: StreamReader.ReadLine. Similar methods suffers from similar fates.

The .NET framework has quite a few helpers like this one. They work great when quickly hacking a solution but they also have serious limitations in the real world. E.g. how long is a line ? from a Stream it could be infinite, eventually leading to a OutOfMemoryException. Same goes for ReadToEnd wrt file size, ReadAllLines... (that sounds like a rule in itself ;-)

Even if you control the line/file size there's still a price to pay: each line becomes a new string. Now that's not a big deal if you actually need, as is, each line. However if you (pretty common pattern) read lines, then parse each/most of them then you get a lot of extra allocations.

make self-test

When doing a make self-test Gendarme read two text files to find which known defects should ignored (i.e. not reported). E.g.

-rw-r--r-- 1 poupou users  3169 2011-01-05 15:32 mono-options.ignore
-rw-r--r-- 1 poupou users 55154 2011-02-28 18:54 self-test.ignore

So, that's 58323 bytes for less than 700 lines (including blanks and comments). However the (very simple) file format requires to split each, non-comment, line in two parts:

  1. an indicator (is this a Rule, Assembly, Type, Method or a # comment); and
  2. a (rule / assembly / type / method) full name
This means that the original string, returned from ReadLine is often a short lived variable.

So what if we were reading this into a, re-usable, char[] buffer ? Could we drop the allocations by half ? It was worth a try and StreamLineReader was born. Here's the total allocations before and after IgnoreFileList was updated.

before	Total memory allocated: 71512640 bytes in 823879 objects
after	Total memory allocated: 71322520 bytes in 823084 objects
        	                  190120 bytes in    795 objects

Ok, 190,120 bytes may not a huge gain (that's 0.25% of the allocations required for a self-test). Still it represent 3.25 bytes saved for each byte being read from the files (a good ratio) because other, string and non-strings, allocations are now avoided as well.

Why bother?

IgnoreFileList was not very high in the profiler logs. However MonoCompatibilityReviewRule is at the top, for the same reason, since it download (from MoMA web service), uncompress then read three text files. Here's an extract of the logs:

Allocation summary
     Bytes      Count  Average Type name
  25515184     153796      165 System.String
	11693296 bytes from:
		Gendarme.Framework.Runner:Initialize ()
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:Initialize (Gendarme.Framework.IRunner)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:LoadDefinitions (string)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:Read (System.IO.TextReader)
		System.IO.StreamReader:ReadLine ()
		(wrapper managed-to-managed) string:.ctor (char[],int,int)
		string:CreateString (char[],int,int)
		(wrapper managed-to-native) string:InternalAllocateStr (int)
	1365136 bytes from:
		Gendarme.Framework.Runner:Initialize ()
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:Initialize (Gendarme.Framework.IRunner)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:LoadDefinitions (string)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:Read (System.IO.TextReader)
		System.IO.StreamReader:ReadLine ()
		System.Text.StringBuilder:set_Length (int)
		System.Text.StringBuilder:InternalEnsureCapacity (int)
		(wrapper managed-to-native) string:InternalAllocateStr (int)
	1164952 bytes from:
		Gendarme.Framework.Runner:Initialize ()
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:Initialize (Gendarme.Framework.IRunner)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:LoadDefinitions (string)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:Read (System.IO.TextReader)
		System.IO.StreamReader:ReadLine ()
		System.Text.StringBuilder:Append (char[],int,int)
		System.Text.StringBuilder:InternalEnsureCapacity (int)
		(wrapper managed-to-native) string:InternalAllocateStr (int)
	1030624 bytes from:
		Gendarme.ConsoleRunner:Initialize ()
		Gendarme.Framework.Runner:Initialize ()
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:Initialize (Gendarme.Framework.IRunner)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:LoadDefinitions (string)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:ReadWithComments (System.IO.TextReader)
		string:Substring (int,int)
		string:SubstringUnchecked (int,int)
		(wrapper managed-to-native) string:InternalAllocateStr (int)
	966576 bytes from:
		Gendarme.Framework.Runner:Initialize ()
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:Initialize (Gendarme.Framework.IRunner)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:LoadDefinitions (string)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:ReadWithComments (System.IO.TextReader)
		System.IO.StreamReader:ReadLine ()
		(wrapper managed-to-managed) string:.ctor (char[],int,int)
		string:CreateString (char[],int,int)
		(wrapper managed-to-native) string:InternalAllocateStr (int)

We see the System.IO.StreamReader:ReadLine and also the string:Substring - a clear hint that (some) lines are being parsed. Changing the rule to use the StreamLineReader shows how much memory can be saved.

Total memory allocated: 71322520 bytes in 823084 objects
Total memory allocated: 68936880 bytes in 816067 objects
                         2385640 bytes in   7017 objects
                             3.3 %           0.8 %

That's much better percentage wise. However the ratio (wrt file size) is much lower because two of the three files using by the rule do not require parsing the lines, i.e. what ReadLine returned was usable "as-is" and kept in a HashSet. Only monotodo.txt, which has an optional text message, needs some extra parsing - even if only to remove the '-' at the end of the line.

Newer logs show, more clearly, that most allocations are done on the unparsed files - i.e. the optimization did not reach them:

  22888072     145733      157 System.String
	13264152 bytes from:
		Gendarme.ConsoleRunner:Initialize ()
		Gendarme.Framework.Runner:Initialize ()
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:Initialize (Gendarme.Framework.IRunner)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:LoadDefinitions (string)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:Read (Gendarme.Framework.Helpers.StreamLineReader)
		(wrapper managed-to-managed) string:.ctor (char[],int,int)
		string:CreateString (char[],int,int)
		(wrapper managed-to-native) string:InternalAllocateStr (int)
	1131424 bytes from:
		Gendarme.ConsoleRunner:Initialize ()
		Gendarme.Framework.Runner:Initialize ()
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:Initialize (Gendarme.Framework.IRunner)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:LoadDefinitions (string)
		Gendarme.Rules.Portability.MonoCompatibilityReviewRule:ReadWithComments (Gendarme.Framework.Helpers.StreamLineReader)
		(wrapper managed-to-managed) string:.ctor (char[],int,int)
		string:CreateString (char[],int,int)
		(wrapper managed-to-native) string:InternalAllocateStr (int)

Right now my options, to further reduce string usages, are a bit limited - at least without changing the file format, which we inherit from MoMA. E.g. the file missing.txt has more than 55000 lines because it covers every assemblies shipped by MS.NET 4.0. Gendarme could easily read (and allocate) entries that are only needed by the assemblies being referenced by the code analyzed - if that data was available.

This will become important because I expect (or at least wishes) for similar rules (e.g. something similar to CA1903:UseOnlyApiFromTargetedFramework) to be added to Gendarme in the next releases. Yet there's more planning needed (other rules requirements) before changing the format.

Still it's nice to know the tooling needed to guide such work is available and simply waiting for time / hackers :-)


3/4/2011 15:15:05 | Comments

The views expressed on this website/weblog are mine alone and do not necessarily reflect the views of my employer.