Latch waits on 2:1:103? You are probably creating too many temp tables in Sql Server by Matt Wrock

Last week my team pushed our monthly release and all looked well and good after the launch. Then at about 10PM our site monitoring alerts started to fire as our page load times soared and this continued until early morning. There were no unusual errors in our logs and our traffic load was normal. What was clear was that our database had been running quite hot on CPU.

As we continued to investigate the following day on a Friday, all systems looked quite healthy and we were very perplexed. That Friday and Saturday night there was no reoccurrence of this behavior but on Sunday it happened again. A very abrupt and fast rise in page load times, followed by a just as sudden drop in load times at around 9AM. All of this happening during what we thought was our lull in traffic.

Interestingly to us but not so topical to this post, after studying our logs back several weeks, we learned that 10PM to 9AM actually is our traffic peak. So it began to make sense why this was occurring during this time. Knowing that this was manifesting itself as a database CPU issue, we ran the following query:

select * from master..sysprocesses sp cross apply fn_get_sql(sql_handle) order by cpu desc

This query will return a row for every process in descending order of CPU load along with the exact SQL statement they are executing. The will also tell you if the process is blocking on another process and if so, which process and what resource it is waiting on. Having run this query many times under this load we were seeing a pattern with results like this:

spid kpid blocked waittype waittime lastwaittype waitresource dbid
254 9680 0 0x0034 2 PAGELATCH_EX                    2:1:103 9
128 10288 163 0x0032 11 PAGELATCH_SH                    2:1:103 9
129 0 0 0x0000 0 MISCELLANEOUS                     9
116 10152 0 0x0000 0 SOS_SCHEDULER_YIELD               9
169 0 0 0x0000 0 MISCELLANEOUS                     1
52 0 0 0x0000 0 MISCELLANEOUS                     4
148 9468 163 0x0034 9 PAGELATCH_EX                    2:1:103 11
90 0 0 0x0000 0 MISCELLANEOUS                     9
274 6612 0 0x0034 15 PAGELATCH_EX                    2:1:103 9
131 8568 163 0x0032 4 PAGELATCH_SH                    2:1:103 9
120 0 0 0x0000 0 MISCELLANEOUS                     11
105 8728 163 0x0034 11 PAGELATCH_EX                    2:1:103 9
247 5660 0 0x0000 0 PAGELATCH_SH                    2:1:103 9
216 0 0 0x0000 0 MISCELLANEOUS                     1
163 8108 0 0x0000 0 PAGELATCH_SH                    2:1:103 9
201 7004 0 0x0034 1 PAGELATCH_EX                    2:1:103 9
267 0 0 0x0000 0 MISCELLANEOUS                     6
117 11124 163 0x0034 10 PAGELATCH_EX                    2:1:103 11
205 10016 0 0x0034 1 PAGELATCH_EX                    2:1:103 9
210 10108 0 0x0034 14 PAGELATCH_EX                    2:1:103 9
226 0 0 0x0000 0 MISCELLANEOUS                     10
164 0 0 0x0000 0 MISCELLANEOUS                     1
175 0 0 0x0000 0 MISCELLANEOUS                     11
98 0 0 0x0000 0 MISCELLANEOUS                     7

What this tells me is that there are several queries lined up behind the query executing under spid 163 and they are waiting for 163 to release its shared lock on a latch so that they can aquire an exclusive lock.  To quote the SQLCAT team, “Latches are lightweight synchronization primitives that are used by the SQL Server engine to guarantee consistency of in-memory structures including; index, data pages and internal structures such as non-leaf pages in a B-Tree.” See this white paper for an in depth discussion on latch contention. So the next question is what is 2:1:103 – the resource everyone is fighting for.

The first number will be the database id, the second is the database file and the third and last number is the page.  2 is temp db. If you don’t believe me, just execute:

select db_name(2)

It says “tempdb” doesn’t it? I just knew that it would.

To track down the actual object that owns that page, execute:

DBCC Traceon(3604)DBCC PAGE (2, 1, 103)

The key item of interest in the results is the Metadata: ObjectId with a value of 75. Then execute:

select object_name(75)

and the object is sysmultiobjrefs.

So what the heck am I supposed to do with this? If this was blocking on a user table that was part of my application, it seems I would have something to work with. But tempdb? Really? Ugh.

After recovering from a spiral of primal self-loathing and despair, I turn to google. I’m sure lots of people have seen contention on 2:1:103 and it will be no time at all before a clear explanation and remedy are revealed. Well, yes, there are several people struggling with contention on this object, but alas, not there is no clear explanations and answers. I find several posts talking about 2:1:1 and 2:2:3 (see this KB article for those). But those struggling with 2:1:103 seem to be being told to turn on trace flag –T1118 but then later complaining that this flag is not doing anything. Or they were told to add more files but they replied that they had plenty of files. And by the way, –T1118 was turned on already in our DB and we have plenty of files too.

Well finally I come across this post from someone with this exact issue and the poster states that they contacted Microsoft and then reports the explanation MS gave and this explanation actually makes sense. The issue is from contention of DDL operations in tempdb. In other words, the Creating and dropping of temp tables. Well we just so happen to do a lot of that and added more in this last release.

I then came upon this white paper and while it did not specifically refer to 2:1:103, it really provided an in depth description of the inner workings of tempdb and how to identify and troubleshoot different forms of contention there. Of particular interest was its explanation of contention in DML and DDL operations.

Contention in DML (select, update, insert) operations is identified by contention on 2:1:1 or 2:1:3. This can be caused by either “internal” objects such as the objects that the sql query engine creates to when executing certain hash joins, sorts and groupings. Or it can be caused by user objects like temp tables and temp table variables. At the root of the issue here is SQL Servers struggle to allocate and deallocate pages and updating metadata tables that track those pages. In addition to suggesting that those who suffer from this issue analyze query plans to make more efficient use of tempdb, it also suggests turning on –T1118 and adding data files to tempdb.

My team’s plight seemed to fall under DDL Contention. This can be determined by checking if the object under contention is a system catalog table in sysobjects. Well according to:

select * from sys.objects where object_id=75

sysmultiobjrefs is a system table. DDL contention is not impacted by internal objects but deals strictly with the creation and destruction of user objects. Namely temp tables and temp table variables.

In addition to checking on the wait resource, there are two performance counters that you can turn on to track the real time creation and destruction of temp tables:

  • Temp Tables Creation Rate
  • Temp Tables For Destruction

You can turn these counters on and run your queries to determine how many temporary resources they are creating. We were able to identify that 70% of our traffic was issuing a query that now created 4 temp tables. The advice given here is to:

  1. Use fewer temp tables and temp table variables.
  2. Check to see if your use of temp tables takes advantage of temp table caching which reduces contention. Sql Server will automatically cache created temp tables as long as:
    • Named constraints are not created.

    • Data Definition Language (DDL) statements that affect the table are not run after the temp table has been created, such as the CREATE INDEX or CREATE STATISTICS statements.

    • Temp object is not created by using dynamic SQL, such as: sp_executesql N'create table #t(a int)'.

    • Temp object is created inside another object, such as a stored procedure, trigger, and user-defined function; or is the return table of a user-defined, table-valued function.

Well we use dynamic SQL so none of our temp tables get cached which probably means 2:1:103 is being updated on every creation.

Our “quick fix” was to take the query that 70% of our traffic hits and remove the use of temp tables as well as trim it down to something highly optimized to exactly what that traffic was doing. Fortunately something very simple. Moving forward we will need to seriously curb the use of temp tables or wrap them in sql functions or stored procs.

I am intentionally putting 2:1:103 in the title of this post with the hope that someone will find this more quickly that it took me to track all of this down.

UPDATE: Michael J. Swart (@MJSwart) wrote a great post on this exact same problem here. I strongly encourage you to read about his run in with this awful awful problem.

Released: nQuant .net 8 bit PNG Quantizer by Matt Wrock

Last weekend I blogged about the quantization algorithm I had been working on in order optimize images in my Auto CSS minification and spriting framework RequestReduce. Yesterday, I released this code as its own project on nquant.codeplex.com. While this quantizer provides great value for users of RequestReduce, I wanted to provide it as a separate component for anyone who would like to take advantage of its ability to dramatically reduce the size of their images. As I mentioned before, I have been seeing 3x reductions in image sizes of the 32bit PNG I produce in RequestReduce.

You can incorporate the quantizing DLL into your own code to transform any bitmap and get a 8 bit PNG bitmap back. I also provide a command line wrapper you can use to quantize individual image files from a script.

You download nQuant at nquant.codeplex.com or if you have Nuget, simply enter:

Install-Package nQuant

from the Package Manager Console.

I would appreciate any feedback, bug reports or suggestions for added features at nQuant.codeplex.com. You are also welcome to contribute to the code.

Convert 32 bit PNGs to high quality 8 bit PNGs with C# by Matt Wrock

I’ve been working on a C# HTTPModule called RequestReduce that sprites CSS background images on the fly as well as merges and Minifes all CSS resources in the document head. For those who may not know, image spriting is a technique that takes individual images in separate image files and merges them into a single file. By tracking the offsets of where each image begins, you can use CSS syntax to limit a css background image to just one of the images in the file like so:

background: url(/RequestReduceContent/94b703a23cf3a2644577ab68fa465844-e2d1ced7efd8cd883516e2da4e51a9d5.png) -241 0 no-repeat

 

By using this technique effectively, page load times will be faster since there will be fewer HTTP connections opened.

One of the items on my backlog has been image optimization. I had envisioned a feature that would either shell out to one of the popular PNG compressors or allow users a means of plugging in a compressor of their choice. About a week or two before releasing this project at work where I work on the Microsoft EPX Galleries team where we produce among many other galleries, the Visual Studio Gallery and the MSDN Code Samples Gallery, I was in a meeting discussing image optimization and someone was demonstrating the benefits of optimizing some of the images that get published in the MSDN dev centers. The impact on size was enormous and on the level of producing images several times smaller than their unoptimized originals.

So I figured how difficult could it be to add this feature to RequestReduce? My plan was to shell out to optipng.exe - a very popular and effective C Library that losslessly compresses PNG images. The answer was not difficult at all and the code looked something like this:

private void InvokeExecutable(string arguments, string executable){
  using(var process = new Process())    {
      process.StartInfo = new ProcessStartInfo()        {
      UseShellExecute = false,
      RedirectStandardOutput = true,
      CreateNoWindow = true,
      FileName = executable,
      Arguments = arguments
    };
    process.Start();
    process.StandardOutput.ReadToEnd();
    process.WaitForExit(10000);
    if(!process.HasExited)
    {
      process.Kill();
      throw new OptimizationException(string.Format(
        "Unable to optimize image using executable {0} with arguments {1}",         executable, 
        arguments
      )
     );
    }    
  }
}

 

However, I was seeing image size reductions that were much less dramatic than the ones demoed in the meeting I attended. The reason was that I work with Web Devs who are typically good about optimizing images and the presenter in the meeting was referring to “site managers” who might know little to nothing about image optimization and sometimes upload huge images unaware of the consequences.

So I was a bit disappointed but decided to move forward with the feature. I was after all seeing some improvement but more on the order of 5%.

A couple days later I plugged RequestReduce into this blog. While it was very effective in reducing the number of HTTP requests, it appeared as though the total count of downloaded bytes was greater that before implementing RequestReduce. How could this be? With the minification of the CSS, it would certainly be less right? Apparently not. The main culprit was a JPG image that was originally abut  5k and was consuming well over this amount in my sprite.I figured this was a flaw I had to fix and I knew very little about the details of image optimization.

My first inclination was that I was creating the image incorrectly. I assumed there was some encoder setting I was failing to set correctly that was forcing the generation of 32 bit PNGs. I spent hours googling only to find there were countless others with the same inclination and no solution. Previously I had reviewed the code written for the Asp.Net Sprite Framework. This framework does something similar to RequestReduce but takes a very different approach that requires some intentional organization of the local image files before it can work. It had code, and eventually I had code that looked like this when finally generating the final PNG bitmap:

public byte[] GetBytes(string mimeType){
using (var spriteEncoderParameters = new EncoderParameters(1))    {
  spriteEncoderParameters.Param[0] = new EncoderParameter(Encoder.Quality, 90);
  using (var stream = new MemoryStream()) {
    SpriteImage.Save(stream, ImageCodecInfo.GetImageEncoders()
      .First(x => x.MimeType == mimeType), spriteEncoderParameters);
    return stream.GetBuffer();
  }
 }
}

 

Let me just set the record straight here because there are oodles of forum posts and stackoverflow answers concerning this and many give wrong or misleading answers. The Encoder.Quality parameter does absolutely nothing in the default PNG encoder as well as most of the other encoder parameters. I had thought that this was where the bulk of my problem lied. It wasn’t.

 

Eventually it became clear that this was not going to be solved by some hidden property in the GDI+ api which is the API exposed in .net BCL for image manipulation. I stumbled upon a concept called image quantization. The process of reducing the number of colors in an image. This is a key concept in getting a 32 bit PNG to convert to a 8 bit PNG because an 8 bit PNG can have no more than 256 colors. It has what is called an Indexed Color palette. A part of the PNG file structure holds pointers to 256 colors and then each pixel in the image gets its color from one of those pointers. Thus each pixel only consumes one bytem its 0-255 value pointing to its color on the palette. On the other hand, a 32 bit PNG is 4 bytes per pixel and each pixel can represent a different ARGB color value. A 32 bit PNG also has a lot of wasted space since in may have multiple pixels that use the same color but they each hold their own copy.

So with this information it makes sense why my sprites had to be 32 bit PNGs. As I added images to the sprites, even if the individual images being added were 8 bit PNGs, I would inevitably accrue more than 256 colors, once that happens, an 8 bit PNG can no longer be used without some data loss. That said, data loss does not necessarily need to translate to “perceptible” quality loss. This is where color quantizers come into play.

Typically image optimization for the web takes one to two paths of optimization: lossless and lossy. Lossless guaeantees no quality loss and implements various compression strategies to shrink the size of a PNG. Popular tools like Optipng or PngCrush utilize this method, The savings can sometimes be significant and other times not so much. However, the savings is always “free” of quality loss and is therefore highly advisable.

Lossy compression does certainly run the risk of unacceptable quality loss but can often render images that are dramatically smaller than their 32 bit cousins and their quality loss is practically imperceptible. There is inevitably loss, but if the loss can either not be perceived at all or is so slight that the savings far outweigh the color loss, taking on the loss may be advisable. Often the deciding factor is the number of colors in the original image, The closer that number is to 256, the less quality loss there will be and therefore the less likely that any of this loss will be perceptible. RequestReduce’s default color limit to quantize is 5000. However, this rule is not concrete. There are images where noticable color loss may occur beyond 3000 colors and other times (especially dealing with photos, where 10000 images is fine to quantize.

So I only managed to find one open source .net based library that provided quantization and I was not comfortable with its license. There are a couple of C based EXEs that perform fast and effective quantization. The ones I used were PNGQuant and Pngnq. The difference is the algorithm they employ for the quantization. I found that these produced decent quality images but one of my images (a rating star) was always off color. and a couple sprites had images with gradients that did look rather awful after the quantization. I eventially tried tweaking my code to limit the number of colors stored in a single sprite. This meant producing up to 3 sprite images for a page which seemed to me to be compromising the benefit of the sprite.

So I looked further into other quantization algorithms. I found an old sample on MSDN that I also found in an older version of Paint.net source code. My images looked terrible using this sample. Then I stumbled upon this article on Code Project. It was a C# library that contained several different quantization algorithms. They all looked quite subpar except for one: an algorithm developed by Xiaolin Wu. My images looked perfect. I could not notice any loss. Furthermore, their sizes were even smaller than the images generated by the C exe libraries. There was just one catch: the algorithm assumed RGB values pixels with no alpha opacity value. Now this Code Project sample included an “alpha blending” technique that made the transparency “appear” to be visible but you had to know what the background color is to blend the transparency with. RequestReduce works with sites that it has no idea what the background color is.

Thus began a three week journey to tweak Xiaolin Wu’s algorithm to include the fourth alpha layer. Let me just preface this with the fact that I have no math or statistics background to speak of. Sadly, someone with such a background probably would have spent a lot less time on this.But it was literally almost all I could think of for three weeks. It was the type of problem that I constantly felt I was about to solve within an hour but each breakthrough simply exposed a new problem that became more complex than I anticipated as I got closer.Before really jumping in, I had thought that adding the fourth Alpha dimension to RGB would be rather trivial. Just look for lines like:

halfDistance = halfRed * halfRed + halfGreen * halfGreen + halfBlue * halfBlue;

 

and change it to:

halfDistance = halfRed * halfRed + halfGreen * halfGreen + halfBlue * halfBlue + halfAlpha * halfAlpha;

 

I can do that and a lot of it was just that. Except for a few key parts. I’m not going to agonize the details of this discovery process. While perhaps theraputic to myself, it would have negative value for any reader. Suffice it to say that in the end after lots of scratch pads and notpad.exe windows full of random numbers and sequences, I eventually had a quantized image with 256 colors that looked really good, better and smaller than the ones produced by pngquant or pngnq. However they were not perfect.

As one can assume, adding an extra byte for alpha will translate to a source image .with potentially much more colors than a fully opaque image. This is because, technically, the same color with a different alpha value, is a different value and competes for one of the 256 slots in the 8 bit PNG indexed palette. In fact, there is the potential for 256x more colors. So I was not surprised to find that these quantized images were of slightly lower quality that the ones produced by the RGB quantizer.

Fortunately, in reality, the number of values used in the alpha channel are usually far less than those used in the RGB channels.That being the case, I took some liberties with the alpha values that increased the image quality while sacrificing the number of unique alpha values used. I used two strategies here:

1. I created an AlphaThreshold value. This is a constant that I set but should be configurable if this were production code. Any alpha value equal to or below the threshold gets translated to 0. I set this threshold to 30, finding that values with an alpha of 30 or less are for the most part invisible.

2. I normalized the source alpha values to their closest multiple of 70. Why 70? Because that is the number transmitted to me from the great master Astra who is the giver of justice in the curious singing forest. Gotta love Astra, A heart of gold that one..This essentially limits the number of values that the alpha channel can occupy but continues to allow for decent gradients. I’m sure there are images where this would not work well but it looks good on all the images I have processed so far.

So lets dive into the code, You can find a complete working VS Solution on the MSDN Samples Gallery. The process of quantization can be split into the following parts:

  1. Read the unoptimized source image and create a histogram of its colors. Which colors are used and how often.
  2. Break down this data into several cumulative arrays that will make it much more efficient to slice and dice.
  3. Create 256 clusters of colors from which we will pick the resulting palette. This step is what sets Wu’s algorithm apart from others. It is a process of perpetually splitting the color space into pairs of boxes to find clusters of colors with minimal variance.
  4. Iterate over the source image again to find find the actual color closest to the mean of each cluster (these will be our palette colors) and map each pixel to its appropriate cluster.
  5. Generate the new quantized image from the data in step 4.

Again, what lends to the accuracy of this algorithm is the clustering of colors of minimal variance. It is looking for the core colors in the image and not simply the colors most often used. With lower quality algorithms, it is far more likely for parts of an image to appear discolored as a more predominant color in the image. The color may be somewhat similar but quite noticeably different. From my tests, the Wu algorithm seems to eliminate these inaccuracies. For a deeper analysis from the author himself see  Graphics Gems vol. II, pp. 126-133.

Building the Histogram

private static ColorData BuildHistogram(Bitmap sourceImage){
var data = sourceImage.LockBits(Rectangle.FromLTRB(0, 0, sourceImage.Width, sourceImage.Height),
ImageLockMode.ReadOnly, sourceImage.PixelFormat);
var colorData = new ColorData(MaxSideIndex);


try{
var byteLength = data.Stride < 0 ? -data.Stride : data.Stride;
var byteCount = Math.Max(1, BitDepth >> 3);
var offset = 0;
var buffer = new Byte[byteLength * sourceImage.Height];
var value = new Byte[byteCount];

Marshal.Copy(data.Scan0, buffer, 0, buffer.Length);
for (var y = 0; y < sourceImage.Height; y++){
var index = 0;
for (var x = 0; x < sourceImage.Width; x++){
var indexOffset = index >> 3;

for (var valueIndex = 0; valueIndex < byteCount; valueIndex++)value[valueIndex] = buffer[offset + valueIndex + indexOffset];

var indexAlpha = (byte)((value[Alpha] >> 3) + 1); var indexRed = (byte)((value[Red] >> 3) + 1);
var indexGreen = (byte)((value[Green] >> 3) + 1);
var indexBlue = (byte)((value[Blue] >> 3) + 1);

if (value[Alpha] > AlphaThreshold){
if (value[Alpha] < 255){
var alpha = value[Alpha] + (value[Alpha] % 70); value[Alpha] = (byte)(alpha > 255 ? 255 : alpha); indexAlpha = (byte)((value[Alpha] >> 3) + 1); }

colorData.Weights[indexAlpha,indexRed,indexGreen,indexBlue]++;
colorData.MomentsRed[indexAlpha,indexRed,indexGreen,indexBlue]
+= value[Red];
colorData.MomentsGreen[indexAlpha,indexRed,indexGreen,indexBlue]+= value[Green];
colorData.MomentsBlue[indexAlpha,indexRed,indexGreen,indexBlue] 
+= value[Blue];
colorData.MomentsAlpha[indexAlpha,indexRed,indexGreen,indexBlue]+= value[Alpha];
colorData.Moments[indexAlpha, indexRed, indexGreen, indexBlue] 
+= (value[Alpha]*value[Alpha]) + (value[Red]*value[Red]) +
 (value[Green]*value[Green]) +
 (value[Blue]*value[Blue]);
}

colorData.QuantizedPixels.Add(BitConverter.ToInt32(new[] { 
 indexAlpha, indexRed, indexGreen, indexBlue }, 0));
colorData.Pixels.Add(
  new Pixel(value[Alpha], value[Red], value[Green], value[Blue]));
index += BitDepth;
}
offset += byteLength;
}
}
finally{
sourceImage.UnlockBits(data);
}
return colorData;
}

This is rather straight forward with just two things to note. First, note that there are a few ways to iterate over the pixels of a bit map. Three that I can think of:

  1. The simplest but by far the most inefficient is to simply create a for loop for x and y and call GetPixel(x,y) which returns the color of that pixel.
  2. Use lockbits and unlock bits (as seen above) but with the difference of using pointers to iterate over the locked memory space of the image. This is much faster than technique number one but requires you to compile the code as unsafe.
  3. Use lockbits and unlock but use Marshal.Copy to feed the memory into a buffer.

Second, note that we drop the three rightmost bit of each color dimension.This reduces the granularity of the color maps produced in the next step, making it much faster, This allows us to create our clusters using color values from 1 to 32 instead of 256.

Creating the Color Arrays

These are the data structures that all of the upcoming analysis is performed upon. If this step is off, everything else will be off. This is one of the steps that really led me down a rat hole. For the longest time I thought this was correct and was looking for bugs elsewhere

private static ColorData CalculateMoments(ColorData data)
{
  for (var alphaIndex = 1; alphaIndex <= MaxSideIndex; ++alphaIndex)
  {
    var xarea = new long[SideSize, SideSize, SideSize];
    var xareaAlpha = new long[SideSize, SideSize, SideSize];
    var xareaRed = new long[SideSize, SideSize, SideSize];
    var xareaGreen = new long[SideSize, SideSize, SideSize];
    var xareaBlue = new long[SideSize, SideSize, SideSize];
    var xarea2 = new float[SideSize, SideSize, SideSize];
    for (var redIndex = 1; redIndex <= MaxSideIndex; ++redIndex)
    {
      var area = new long[SideSize];
      var areaAlpha = new long[SideSize];
      var areaRed = new long[SideSize];
      var areaGreen = new long[SideSize];
      var areaBlue = new long[SideSize];
      var area2 = new float[SideSize];
      for (var greenIndex = 1; greenIndex <= MaxSideIndex; ++greenIndex)
      {
        long line = 0;
        long lineAlpha = 0;
        long lineRed = 0;
        long lineGreen = 0;
        long lineBlue = 0;
        var line2 = 0.0f;
        for (var blueIndex = 1; blueIndex <= MaxSideIndex; ++blueIndex)
        {
          line += data.Weights[alphaIndex, redIndex, greenIndex, blueIndex];
          lineAlpha += data.MomentsAlpha[alphaIndex, redIndex, greenIndex, blueIndex];
          lineRed += data.MomentsRed[alphaIndex, redIndex, greenIndex, blueIndex];
          lineGreen += data.MomentsGreen[alphaIndex, redIndex, greenIndex, blueIndex];
          lineBlue += data.MomentsBlue[alphaIndex, redIndex, greenIndex, blueIndex];
          line2 += data.Moments[alphaIndex, redIndex, greenIndex, blueIndex];

          area[blueIndex] += line;
          areaAlpha[blueIndex] += lineAlpha;
          areaRed[blueIndex] += lineRed;
          areaGreen[blueIndex] += lineGreen;
          areaBlue[blueIndex] += lineBlue;
          area2[blueIndex] += line2;

          xarea[redIndex, greenIndex, blueIndex] = xarea[redIndex - 1, greenIndex, blueIndex] + area[blueIndex];
          xareaAlpha[redIndex, greenIndex, blueIndex] = xareaAlpha[redIndex - 1, greenIndex, blueIndex] + areaAlpha[blueIndex];
          xareaRed[redIndex, greenIndex, blueIndex] = xareaRed[redIndex - 1, greenIndex, blueIndex] + areaRed[blueIndex];
          xareaGreen[redIndex, greenIndex, blueIndex] = xareaGreen[redIndex - 1, greenIndex, blueIndex] + areaGreen[blueIndex];
          xareaBlue[redIndex, greenIndex, blueIndex] = xareaBlue[redIndex - 1, greenIndex, blueIndex] + areaBlue[blueIndex];
          xarea2[redIndex, greenIndex, blueIndex] = xarea2[redIndex - 1, greenIndex, blueIndex] + area2[blueIndex];

          data.Weights[alphaIndex, redIndex, greenIndex, blueIndex] = data.Weights[alphaIndex - 1, redIndex, greenIndex, blueIndex] + xarea[redIndex, greenIndex, blueIndex];
          data.MomentsAlpha[alphaIndex, redIndex, greenIndex, blueIndex] = data.MomentsAlpha[alphaIndex - 1, redIndex, greenIndex, blueIndex] + xareaAlpha[redIndex, greenIndex, blueIndex];
          data.MomentsRed[alphaIndex, redIndex, greenIndex, blueIndex] = data.MomentsRed[alphaIndex - 1, redIndex, greenIndex, blueIndex] + xareaRed[redIndex, greenIndex, blueIndex];
          data.MomentsGreen[alphaIndex, redIndex, greenIndex, blueIndex] = data.MomentsGreen[alphaIndex - 1, redIndex, greenIndex, blueIndex] + xareaGreen[redIndex, greenIndex, blueIndex];
          data.MomentsBlue[alphaIndex, redIndex, greenIndex, blueIndex] = data.MomentsBlue[alphaIndex - 1, redIndex, greenIndex, blueIndex] + xareaBlue[redIndex, greenIndex, blueIndex];
          data.Moments[alphaIndex, redIndex, greenIndex, blueIndex] = data.Moments[alphaIndex - 1, redIndex, greenIndex, blueIndex] + xarea2[redIndex, greenIndex, blueIndex];
        }
      }
    }
  }
  return data;
}

In the end you should have a set of multi dimensional arrays that cumulatively increase both vertically down each “line” (or most finely grained dimension – blue here) and across to the right from A to R to G to B. The very last value in the array should be the total sum of the original array.

Arranging the data n this way allows you to make very efficient measurements on regions of the color space from just a few calculations instead of having to iterate over every point.

Clustering the Data

There is a fair amount of code involved here so I will not include all of it. Here is the entry point into this step.

private IEnumerable<Box> SplitData(ref int colorCount, ColorData data)
{
  --colorCount;
  var next = 0;
  var volumeVariance = new float[MaxColor];
  var cubes = new Box[MaxColor];
  cubes[0].AlphaMaximum = MaxSideIndex;
  cubes[0].RedMaximum = MaxSideIndex;
  cubes[0].GreenMaximum = MaxSideIndex;
  cubes[0].BlueMaximum = MaxSideIndex;
  for (var cubeIndex = 1; cubeIndex < colorCount; ++cubeIndex)
  {
    if (Cut(data, ref cubes[next], ref cubes[cubeIndex]))
    {
      volumeVariance[next] = cubes[next].Size > 1 ? CalculateVariance(data, cubes[next]) : 0.0f;
      volumeVariance[cubeIndex] = cubes[cubeIndex].Size > 1 ? CalculateVariance(data, cubes[cubeIndex]) : 0.0f;
    }
    else
    {
      volumeVariance[next] = 0.0f;
      cubeIndex--;
    }

    next = 0;
    var temp = volumeVariance[0];

    for (var index = 1; index <= cubeIndex; ++index)
    {
      if (volumeVariance[index] <= temp) continue;
      temp = volumeVariance[index];
      next = index;
    }

    if (temp > 0.0) continue;
    colorCount = cubeIndex + 1;
    break;
  }
  return cubes.Take(colorCount).ToList();
}

To dive deeper into what is going on, follow the Cut method. Note that this is looking to build 256 clusters of the smallest possible variance. One interesting piece of code a bit deeper down that I would like to point out is the calculation to obtain the volume of each cube. This demonstrates how the cumulative array allows you to make this calculation without iterating each element in the array:

private static long Volume(Box cube, long[,,,] moment)
{
  return (moment[cube.AlphaMaximum, cube.RedMaximum, cube.GreenMaximum, cube.BlueMaximum] -
    moment[cube.AlphaMaximum, cube.RedMaximum, cube.GreenMinimum, cube.BlueMaximum] -
    moment[cube.AlphaMaximum, cube.RedMinimum, cube.GreenMaximum, cube.BlueMaximum] +
    moment[cube.AlphaMaximum, cube.RedMinimum, cube.GreenMinimum, cube.BlueMaximum] -
    moment[cube.AlphaMinimum, cube.RedMaximum, cube.GreenMaximum, cube.BlueMaximum] +
    moment[cube.AlphaMinimum, cube.RedMaximum, cube.GreenMinimum, cube.BlueMaximum] +
    moment[cube.AlphaMinimum, cube.RedMinimum, cube.GreenMaximum, cube.BlueMaximum] -
    moment[cube.AlphaMinimum, cube.RedMinimum, cube.GreenMinimum, cube.BlueMaximum]) -

   (moment[cube.AlphaMaximum, cube.RedMaximum, cube.GreenMaximum, cube.BlueMinimum] -
    moment[cube.AlphaMinimum, cube.RedMaximum, cube.GreenMaximum, cube.BlueMinimum] -
    moment[cube.AlphaMaximum, cube.RedMaximum, cube.GreenMinimum, cube.BlueMinimum] +
    moment[cube.AlphaMinimum, cube.RedMaximum, cube.GreenMinimum, cube.BlueMinimum] -
    moment[cube.AlphaMaximum, cube.RedMinimum, cube.GreenMaximum, cube.BlueMinimum] +
    moment[cube.AlphaMinimum, cube.RedMinimum, cube.GreenMaximum, cube.BlueMinimum] +
    moment[cube.AlphaMaximum, cube.RedMinimum, cube.GreenMinimum, cube.BlueMinimum] -
    moment[cube.AlphaMinimum, cube.RedMinimum, cube.GreenMinimum, cube.BlueMinimum]);
}

Just 15 simple arithmetic operations instead of 32768. This is another formula that I agonized over for days. The trick is figuring out where to put the pluses and minuses. The original RGB method is:

private static Int64 Volume(WuColorCube cube, Int64[, ,] moment)
{
  return moment[cube.RedMaximum, cube.GreenMaximum, cube.BlueMaximum] -
    moment[cube.RedMaximum, cube.GreenMaximum, cube.BlueMinimum] -
    moment[cube.RedMaximum, cube.GreenMinimum, cube.BlueMaximum] +
    moment[cube.RedMaximum, cube.GreenMinimum, cube.BlueMinimum] -
    moment[cube.RedMinimum, cube.GreenMaximum, cube.BlueMaximum] +
    moment[cube.RedMinimum, cube.GreenMaximum, cube.BlueMinimum] +
    moment[cube.RedMinimum, cube.GreenMinimum, cube.BlueMaximum] -
    moment[cube.RedMinimum, cube.GreenMinimum, cube.BlueMinimum];
}

The similarity is obvious. The key is building the arrays correctly. With incorrect data, this formula will never come out right. I also did not expect the addition of alpha to require parentheses.

Generating the Final Palette

This builds the final data structure we need to create out quantized image:

protected override QuantizedPalette GetQuantizedPalette(int colorCount, ColorData data, IEnumerable<Box> cubes, int alphaThreshold)
{
  int imageSize = data.PixelsCount;
  LookupData lookups = BuildLookups(cubes, data);

  IList<int> quantizedPixels = data.QuantizedPixels;
  for (var index = 0; index < imageSize; ++index)
  {
    var indexParts = BitConverter.GetBytes(quantizedPixels[index]);
    quantizedPixels[index] = lookups.Tags[indexParts[Alpha], indexParts[Red], indexParts[Green], indexParts[Blue]];
  }

  var alphas = new int[colorCount + 1];
  var reds = new int[colorCount + 1];
  var greens = new int[colorCount + 1];
  var blues = new int[colorCount + 1];
  var sums = new int[colorCount + 1];
  var palette = new QuantizedPalette(imageSize);

  IList<Pixel> pixels = data.Pixels;
  int pixelsCount = data.PixelsCount;
  IList<Lookup> lookupsList = lookups.Lookups;
  int lookupsCount = lookupsList.Count;

  Dictionary<int, int> cachedMaches = new Dictionary<int, int>();

  for (int pixelIndex = 0; pixelIndex < pixelsCount; pixelIndex++)
  {
    Pixel pixel = pixels[pixelIndex];
    palette.PixelIndex[pixelIndex] = -1;
    if (pixel.Alpha <= alphaThreshold)
        continue;

    int bestMatch;
    int argb = pixel.Argb;

    if (!cachedMaches.TryGetValue(argb, out bestMatch))
    {
      int match = quantizedPixels[pixelIndex];
      bestMatch = match;
      int bestDistance = int.MaxValue;

      for (int lookupIndex = 0; lookupIndex < lookupsCount; lookupIndex++)
      {
        Lookup lookup = lookupsList[lookupIndex];
        var deltaAlpha = pixel.Alpha - lookup.Alpha;
        var deltaRed = pixel.Red - lookup.Red;
        var deltaGreen = pixel.Green - lookup.Green;
        var deltaBlue = pixel.Blue - lookup.Blue;

        int distance = deltaAlpha*deltaAlpha + deltaRed*deltaRed + deltaGreen*deltaGreen + deltaBlue*deltaBlue;

        if (distance >= bestDistance)
            continue;

        bestDistance = distance;
        bestMatch = lookupIndex;
      }

      cachedMaches[argb] = bestMatch;
    }

    alphas[bestMatch] += pixel.Alpha;
    reds[bestMatch] += pixel.Red;
    greens[bestMatch] += pixel.Green;
    blues[bestMatch] += pixel.Blue;
    sums[bestMatch]++;

    palette.PixelIndex[pixelIndex] = bestMatch;
  }

  for (var paletteIndex = 0; paletteIndex < colorCount; paletteIndex++)
  {
    if (sums[paletteIndex] > 0)
    {
      alphas[paletteIndex] /= sums[paletteIndex];
      reds[paletteIndex] /= sums[paletteIndex];
      greens[paletteIndex] /= sums[paletteIndex];
      blues[paletteIndex] /= sums[paletteIndex];
    }

    var color = Color.FromArgb(alphas[paletteIndex], reds[paletteIndex], greens[paletteIndex], blues[paletteIndex]);
    palette.Colors.Add(color);
  }

  palette.Colors.Add(Color.FromArgb(0, 0, 0, 0));

  return palette;
}

 

Building the Image

private static Bitmap ProcessImagePixels(Image sourceImage, QuantizedPalette palette)
{
  var result = new Bitmap(sourceImage.Width, sourceImage.Height, PixelFormat.Format8bppIndexed);
  var newPalette = result.Palette;
  for (var index = 0; index < palette.Colors.Count; index++)
    newPalette.Entries[index] = palette.Colors[index];
  result.Palette = newPalette;

  BitmapData targetData = null;
  try
  {
    targetData = result.LockBits(Rectangle.FromLTRB(0, 0, result.Width, result.Height), ImageLockMode.WriteOnly, result.PixelFormat);
    const byte targetBitDepth = 8;
    var targetByteLength = targetData.Stride < 0 ? -targetData.Stride : targetData.Stride;
    var targetByteCount = Math.Max(1, targetBitDepth >> 3);
    var targetSize = targetByteLength * result.Height;
    var targetOffset = 0;
    var targetBuffer = new byte[targetSize];
    var targetValue = new byte[targetByteCount];
    var pixelIndex = 0;

    for (var y = 0; y < result.Height; y++)
    {
      var targetIndex = 0;
      for (var x = 0; x < result.Width; x++)
      {
        var targetIndexOffset = targetIndex >> 3;
        targetValue[0] = (byte)(palette.PixelIndex[pixelIndex] == -1 ? palette.Colors.Count - 1 : palette.PixelIndex[pixelIndex]);
        pixelIndex++;

        for (var valueIndex = 0; valueIndex < targetByteCount; valueIndex++)
          targetBuffer[targetOffset + valueIndex + targetIndexOffset] = targetValue[valueIndex];

        targetIndex += targetBitDepth;
      }

      targetOffset += targetByteLength;
    }

    Marshal.Copy(targetBuffer, 0, targetData.Scan0, targetSize);
  }
  finally
  {
    if(targetData != null)
      result.UnlockBits(targetData);
  }

  return result;
}


Here we simply use our generated palette and fill in the pixels.

If you would like to download a full copy of this code along with a .EXE wrapper to test the quality of the quantizations, you can download it from the MSDN Code Samples Gallery.