Computing the Mandelbrot Set

The Mandelbrot set is a fractal that most people have seen an image of at some point or another. For my first post on this site, I thought generating images of the Mandelbrot would be a good place to start, plus the nature of how the pixel values are computed makes the computation a good candidate for parallelism. In this post, however, we will just talk about the single threaded version and save a discussion of profiling for another day. So, we should first describe what the Mandelbrot set is, and how we go about calculating images of it.

To define the Mandelbrot set, we need to think about sequences of complex number defined by \(z_{n+1}=z_n^2+c\) with \(z_0=c\) for some \(c\in\mathbb{C}\). This type of sequence has a rather important property: if for any \(n\in\mathbb{N}\) it happens to be the case that \(|z_n|^2>4\), the sequence necessarily diverges to infinity. Unfortunately, this is a bit of a pain to prove. So we are going to take this fact for granted, and maybe in some future post I’ll write up the proof.

So, what this bound tells us is that all initial values with \(|c|>2\) are guaranteed to diverge. Any points that don’t diverge to infinity will be within the circle of radius 2 in the complex plane. However, if we do start inside this circle, after enough iterations we could still end up with a \(z_n\) outside the circle, so we know the rest of the sequence will diverge to infinity.

For every point \(c\) in the circle, we can iterate until until we hit a value \(z_n\) which is outside the circle, then associate this integer with that point. We will call this integer the exit time for the point \(c\). The points for which this integer is unbounded will be the points in the Mandelbrot set. Though, in practice, we will have to establish a maximum number of iterations, say \(N\), and if we hit the maximum number of iterations, we call that “good enough” to say the point is in the Mandelbrot set. Once we have this integer computed at all the points of interest, we color the image by treating the integers as intensity values.

Dealing with Memory

I wrote the program to compute the exit times in C++ in Visual Studio. The C++ utility will be useful for the rendering portion of the program, but the computation of the exit times is essentially written using C functionality.

To start with, we need to allocate the array which will hold the exit times. If we are looking to produce an image \(X\) pixels wide and \(Y\) pixels high, we will need to hold \(XY\) exit times. This immediately means the array of exit times will be rather large. Even for a fairly standard full-screen resolution of \(1920\times1080\), we are already looking at an array of over 2 million elements. If we wanted to produce a very large 8K resolution image which has dimensions \(15360\times8640\), there will be over 132 million elements in the array.

The size of this array is not something we can just sweep under the rug and will change how we have to do things. For starters, operating systems typically will only allocate maximum of about a megabyte to a program’s stack memory. Since a megabyte is about a million bytes, even if we only store the exit times as bytes, limiting \(N\) to be less than 256, and want to generate a standard resolution image, we would find ourselves with a stack overflow error as we attempt to allocate more memory on the stack than is actually made available by the operating system.

On some Linux distributions, I believe it is possible to manually remove the limiters on the stack size, but I don’t think this is universal, and certainly isn’t true on Windows, which is what I’m developing on, so we instead need to allocate the array to heap memory, which is only limited by available system resources.

Now, the commonly accepted knowledge is that stack memory is much faster than heap memory. But when we say this, we have to be careful about why and how this statement is true. As it turns out, in our case, the situation is not nearly as dire as it might appear. The largest difference in speed between stack and heap memory is the allocation time, though there are some small differences in reference time we will get to in a moment. Essentially, the idea is that, since the stack is preallocated explicitly for the program’s use and the execution order of the program determines the next free byte implicitly, the process of allocation really only involves writing the byte with indicates the size of the array and updating the address of the next free byte.

If we look at the same process for heap memory, the program doesn’t know where the next free byte is. Instead, the program needs to query the operating system for an address in memory of the correct size. The operating system then needs to kick in an find a section of memory of the correct size which won’t interfere with any other running programs. If the operating system is any good, it will also be concerned with optimizing the location of the memory to minimize the amount of memory space between allocated blocks which is too small to likely be requested by an application. Since the operating system has a lot more to worry about, we can expect it to take a fair while to get back to our program with an address for the requested memory.

If we are allocating and freeing many blocks of memory quite frequently, the overhead of using heap memory adds up with every call for more memory, particularly when compared to stack memory which takes comparatively zero time to allocate.

However, we only need to allocate this array once and never again. So the overhead of a allocation call is fairly low compared with the remainder of the program and hence using the heap to store our exit time array won’t affect the performance of the program in a significant way. We could still worry that the heap memory will have a slower reference speed when compared to stack memory though, since the address of the array may not be near the addresses in the stack.

Whether in stack or heap memory, the data needs to make its way into a CPU register in order for any computations to be done. This means the address of the memory in question will be read, and then the data at that address gets moved into a register for whatever computation is to follow. If there were only registers and RAM, there would be little to no difference between the reference times of stack and heap memory. However, between the RAM and the CPU sits several cache memories. The three of them are used for slightly difference purposes, but the important thing for us is that references to memory stored in a cache is far faster than a reference to RAM by several orders of magnitude.

Since it is so much faster to reference cache memory, it would behoove us to read and write to the cache as much as possible. When it is not possible, we have to copy the memory from RAM to the cache, known as a cache miss. If our program causes a large number of cache misses, our program is liable to run significantly slower than it otherwise might be able to run.

Unfortunately, we cannot control directly what memory is loaded to a cache and when, but we can arrange our program in such a way that we make the best possible use of the cache. See, when there is a cache miss and some data is requested to be copied from RAM to the cache, that data alone is not copied. Instead, what is known as a cache line is copied from RAM which will include the requested data but will also include bytes following the requested data. Typically a cache line is 64 bytes long.

So, if we are able to structure our program in such that memory access is as linear as possible in memory, we won’t need to frequently backtrack in memory and reload a cache line we have already loaded once before.

Since our exit time array is a contiguous block of memory in RAM, as long as we make sure to only access these points sequentially, we will avoid needing to reload cache lines we’ve already loaded once before. This will make the heap memory references about as fast as stack memory references. In fact, the only way a heap reference will still be slower is if loading a cache line from the stack segment of memory overwrites a cache line from the heap memory which we haven’t finished using yet. But, since we cannot directly control the cache memory, there’s little we can do about this situation.

Now that we know we need to be careful about making sure the array of exit times are a contiguous block of memory and that we access this block in a sequential manner, it makes sense to allocate the array by writing

//The C++ way
unsigned short* arr = new unsigned short[X*Y];
//The C way
unsigned short* arr = (unsigned short*)malloc(X*Y*sizeof(unsigned short));

Either of these will allocate a block of \(XY\) unsigned shorts contiguously in memory. Note that we do not allocate an array of \(X\) pointers, which would then each point to an array of \(Y\) unsigned shorts. If we were to allocate memory in this way, the memory we get is unlikely to be contiguous in memory, which might then increase the number of cache misses we have to deal with.

Since we want to be careful about the access order, it makes sense for us to handle the two dimensional indexing manually and be sure we are accessing elements in the right order rather that use a simpler notation and possibly access the array out of order. This means accessing the arr[x][y] element by writing arr[y*X+x]. Even though there is more explicit computation here than something like arr[x][y], they both compile to the same thing since this is the computation that the computer must do to walk through the array at the end of the day anyway.

Computing the Exit Times

To compute the exit time for each pixel, we first need a way of mapping the pixel indices to points in the complex plane and vica versa. We have already specified the dimensions of the pixel array by the constants \(X\) and \(Y\), so we now need to define a rectangle in the complex plane (hopefully, we choose this rectangle to have the same aspect ratio as our image, or there will be distortion in the final image). There are several different ways we could go about defining the rectangle, but I decided to define it by the location of the bottom left corner, \(z_c=R+iI\), along with the width and height of the box, \(W\) and \(H\), respectively..

This means that the width and height of an single pixel in the complex plane must be $$dw=W/X,\ \ \ \ dh=H/Y$$. Then we can approximate the location of the pixel \((x,y)\) as \((x*dw,y*dh)\). The computation may then be written

double re = R;
double im = I;
unsigned int x;
unsigned int y;
for(y = 0; y < Y; y++){
    //Reset the horizontal to the left.
    re = R;
    for(x = 0; x < X; x++){
        //The function iterate computes the exit time for the given point.
        arr[y * X + x]=iterate(re, im);
        //Increment the horizontal position by dw to the right.
        re += dw;
    }
    //Increment the vertical position up by dh.
    im += dh;
}

This algorithm computes the pixel for the bottom left corner first, then works its way through the rest of the image row by row upwards. Notice that we were careful to make the iteration over \(y\) the outer loop so as we make our way through the array, we never skip an element and all our access is sequential.

The function iterate which computes the exit time for the given point is implemented in terms of all real valued computations so we avoid needing to rely on the implementation of complex valued arithmetic.

unsigned short iterate2(double wre, double wim) {
    double re = 0.0;
    double im = 0.0;
    double tmpre = 0.0;
    double tmpim = 0.0;

    unsigned short i = 0;
    while (i < N) {
        //Compute z^2+(wre,wim).
        re = (tmpre * tmpre - tmpim * tmpim) + wre;
        im = (2.0 * tmpre * tmpim) + wim;

        //Check exit condition.
        if (4.0 < re * re + im * im) {
            return i;
        }

        //Otherwise, swap and increment.
        tmpre = re;
        tmpim = im;
        i++;
    }

    //Never exited.
    return 0;
}

Note that if the maximum number of iterations is exceeded, the algorithm returns zero instead of \(N\). This is an aesthetic choice that has to do with the coloring of the final image. Essentially, setting these points to zero will color them black, making the final image just the outline of the Mandelbrot. I personally think this looks better and it being filled in.

Rendering the Image

Once we have the array of exit times computed, we need to map these somehow into an image format. This includes mapping the integers to colors.

For the map from integers to colors, I decided to keep it simple and go for a linear mapping from the exit time to the (R,G,B) color space. In this way, the exit time will act like an intensity.

If we suppose that \(max\) is the maximum value in the exit time array, and \((R,G,B)\) is the vector in RBG space that we want as our maximum intensity color, then we can compute the color associated with the exit time \(t\) by

unsigned short ratio = (255 * t) / max;
unsigned short norm = R * R + B * B + G * G;
norm = (unsigned short)sqrtf((float)norm);
R_out = (ratio * R) / norm;
G_out = (ratio * G) / norm;
B_out = (ratio * B) / norm;

With this, we can compute the color of every pixel in the final image, but we still need some way to encode these color values into an image file and save it to disk — since we are dealing with high resolution images, it makes sense to save them as PNG files so the compression is lossless.

I decided to use the GDI+ library to do this because it was available in Visual Studio, free, doesn’t come with nicky licensing issues like GNU software, and seemed like it would have a smaller learning curve than OpenGL.

Since this post has already gone on for quite a while though, I think I will save my discussion of how to use GDI+ here for a future post. For now though, here’s the finished product in 4K resolution with \(N=300\), \((X,Y)=(-1.6,-0.3)\), and \((W,H)=(1.0667,0.6)\).

The Mandelbrot Set
admin Avatar