poita.org

Simplest Image Format

Posted: 29 Jan 2013 - Link

Just a quick post to share something I discovered a few days ago: the simplest image format ever. I’m actually quite surprised it took me so long to find it, which is why I’m sharing now.

The format I’m talking about is the Portable Pixmap Format (.ppm). It’s so simple to write that it’s easier to describe using code:

// --------------------------------------------------------------------
//     file - the file to write to
//    width - width of the image in pixels
//   height - height of the image in pixels
// rgb_data - the image data, 3 bytes per pixel (R, G, B)
// --------------------------------------------------------------------
void write_ppm(FILE* file, int width, int height, void* rgb_data)
{
    fprintf(file, "P6 %d %d 255\n", width, height);
    fwrite(rgb_data, 1, 3 * width * height, file);
}

That’s it: a small header followed by the raw image data.

It’s incredibly useful to have a simple file format like this when you want to write out some images at runtime and don’t have any other image writing libraries available (or can’t be bothered hooking them up).


D Tip - Beware Narrow Strings

Posted: 22 Dec 2012 - Link

In The D Programming Language, strings are arrays. They are literally aliases to arrays of immutable characters of various width, defined in druntime.

alias immutable(char)[]  string;
alias immutable(wchar)[] wstring;
alias immutable(dchar)[] dstring;

All string types in D express Unicode strings using different encodings. string uses UTF-8, wstring uses UTF-16, and dstring uses UTF-32. With the exception of UTF-32, these encodings are variable length encodings. A single “character” may be represented by a variable number of array elements. These string types are called “narrow strings”.

Variable length encoding is very space efficient, but at the cost of indexing. With string and wstring there is no way to retrieve the n’th character in O(1) time. When you use array indexing on narrow strings, you are actually indexing into the code units. For example, the string “こんにちは世界” has 7 code points (characters), but 21 code units. .length will report 21, and the element at index 0 is the code unit 227, which is not こ!

string s = "こんにちは世界";
writeln(s[0] == 'こ'); // false
writeln(s.length); // 21

As you can see, this behaviour isn’t much use when you want to work with the actual characters. The Phobos developers are aware of this, which is why, when you treat strings as ranges, they do what you would expect.

string s = "こんにちは世界";
writeln(s.front == 'こ'); // true
writeln(s.walkLength); // 7

This puts D programmers in a slightly unusual position. Narrow strings are both arrays of code units, and ranges of code points, depending on how you use them. When writing generic code, you need to be aware of this because it has some quite unintuitive consequences:

T[] is not always a range of T

For example, string (immutable(char)[]) is a range of dchar, not char. typeof("abc".front) is dchar. If you want to store the result of .front then you can use ElementType!R (or just use auto).

hasSlicing!T[] is not always true

The hasSlicing!R trait from std.range is true when R is sliceable. For strings, it returns false because they are only sliceable as arrays of code units.

hasLength!T[] is not always true

Similarly to above, hasLength!R is true when you can get the length of R in O(1) time. For strings, it is false because you can only the get the number of code units in O(1) time, not code points. walkLength on narrow strings runs in O(n) time.

With a T[], you can’t call .popFront() .length times

.length returns the number of code units, but .popFront() pops off a code point, which may be more than one code unit.

In short, try to avoid using arrays directly in generic code. Write your code to use arbitrary ranges, and add the necessary template constraints when you want to use array features:

If you follow those rules, your generic code should handle narrow strings just fine.


Using D Without The GC

Posted: 15 Dec 2012 - Link

One of the major selling points of the D Programming Language is its “native efficiency”. I use the scare quotes because while D compilers do compile down to native machine code, your program will only run efficiently if you code with performance in mind!

In D, the easiest way to degrade performance (besides poor algorithms) is overuse of automatic memory managment. No matter what language you use, memory allocation is non-trivial, and has a non-trivial cost to go with it. With D’s automatic memory management, you not only incur the cost of allocation, but also the cost of automatic deallocation, i.e. garbage collection. If you want high throughput and low latency then you really need to avoid memory allocation as much as possible.

The most important thing to understand about the GC is that it can only run when you try to allocate. Unfortunately D doesn’t make it obvious when you are allocating memory, so I’ve listed the most common ways here.

Things That Allocate in D

Before I start, I should make it clear that these are things that could allocate in D. A good optimiser may be able to remove some of the allocations. When in doubt, test.

new

Any time you use the new keyword, you are allocating memory. Avoid creating class instances frequently, and use structs where possible.

Array literals

In most cases, array literals will allocate.

immutable int[3] a = [1, 2, 3];  // Won't allocate
immutable int[] b = [1, 2, 3];   // Will allocate once
int[] c = [1, 2, 3];             // Will allocate
int[3] d = [1, 2, 3];            // Will allocate
foo([1, 2, 3]);                  // Will allocate

The fact that the initialisation of d allocates may be surprising. It doesn’t need to, but DMD currently doesn’t do the optimisation to remove the allocation. In the meantime, you can do this:

immutable int[3] a = [1, 2, 3];  // Won't allocate
int[3] b = a;                    // Won't allocate

Calling dup or idup on an array will also lead to an allocation.

Array concatenation

Array concatenation won’t always result in an allocation, as arrays are sometimes over-allocated, but if you want to avoid allocations then it is best to avoid array concatenation. Modifying the .length property of an array could also cause allocation.

Associative arrays

Creating, adding, or removing elements from an associative array could cause an allocation. Lookups are safe.

Closures

A closure is a function reference that needs extra context beyond the stack. An example should illustrate this:

int delegate() foo(int x) { return () => x; }
int delegate() bar(int x) { return () => 1; }

Here, foo returns a closure because the returned function relies on the value of x, which will no longer be on the stack after foo returns. bar, on the other hand, does not return a closure, because the function () => 1 has no dependency on x.

In D, creating a closure allocates memory, i.e. every time you call foo you will allocate memory.

Detecting Allocations

The easiest way to detect unexpected allocations is to add some logging to the D runtime. Clone that project, and open up src/gc/gc.d. In there, there’s some functions with names like gc_malloc, gc_calloc, etc. Just add some calls to printf there, and build.

Style

When you try to avoid using the GC, you might find that your usual style of programming doesn’t work very well. If you are used to just building up arrays using concatenation, joining strings together, and creating lots of temporary objects then you’re going to have a hard time.

In general, to avoid these dynamic allocations, you need to make your program more static. Static arrays in particular are your friend. You’ll have to build hard limits into your code, and just “allocate” within those limits.

Using strings can be quite tricky when you want to avoid allocations. string in D is an alias for immutable(char)[], which is fine when you know what your string needs to contain beforehand, but if you want to create a string based on runtime data then you’ll need to build it up inside a char[N], which has no legal conversion to immutable(char)[].

To get around the string problem, you can use assumeUnique. This is a simple function that does nothing more than cast a mutable array to an immutable array. The caveat is that the cast is only safe if the reference is truly unique. Once you have built up your string in the mutable array, and casted using assumeUnique, you can’t safely use the mutable array again. This shouldn’t be a problem, but it’s something to be aware of.

char[16] buf;
snprintf(buf.ptr, buf.length, "1 + 1 = %d", 1 + 1);
string str = assumeUnique(buf[]);

Conclusion

While D was designed with GC use in mind, the language also provides all the necessary abstraction facilities to build usable “GC free” libraries (templates, aliases, delegates, static arrays). It can take a while to get used to not using the GC, but I certainly don’t feel hindered by its absence.


D Tip - Testing with TypeTuple

Posted: 18 Nov 2012 - Link

When testing a function, it’s usually a good idea to test it with a range of different inputs. To do this, you can easily use a for loop over an array of input values, but what if your input is a type, as it often is with template code?

The D Programming Language allows you to iterate over a TypeTuple, so all you need to do is declare a tuple of all the types you want to test, and iterate over them in the normal way:

import std.typetuple;
alias TypeTuple!(int, long, double) Types;
foreach (T; Types)
    test!T();

You might wonder what this compiles to. After all, the body of the loop varies with T, so the generated code must also vary on each iteration. How does the compiler handle this?

The answer is that the loop is completely unrolled. The code above is literally the same as:

test!int();
test!long();
test!double();

For this reason, you might want to keep an eye on the size of your TypeTuples, to avoid code bloat.


Pythagorean Polygons - Part 2

Posted: 14 Sep 2012 - Link

In my last post I showed how to solve the Project Euler problem, Pythagorean Polygons for N = 60 in under a minute. In this post I’ll show how to solve it for the target N = 120, and in under a second.

Our current algorithm uses a brute force depth first search, memoized using the generic Phobos memoize template. One common way to speed up searches is to prune branches that have no possibility of reaching the goal. In this problem, we can prune a search when the distance to the goal is greater than the remaining perimeter:

long dfs(int x, int y, int d, int i)
{
    import std.functional;
    alias memoize!dfs mdfs;

    if (x == 0 && y == 0 && d > 0) return 1;

    // Recurse
    long result = 0;
    foreach (int k; i..cast(int)vs.length)
        if (!(d == vs[k].d && x == -vs[k].x && y == -vs[k].y))
        {
            int nx = x + vs[k].x;  // new X
            int ny = y + vs[k].y;  // new Y
            int nd = d + vs[k].d;  // new used perimeter
            int rd = N - nd;       // remaining perimeter
            if (nx * nx + ny * ny <= rd * rd && rd >= 0)
                result += mdfs(nx, ny, nd, next(k));
        }
    return result;
}

This alone speed up the N = 60 solution to under 2 seconds, and gives us the solution to N = 120 in just over 2 minutes.

We can do better.

In the N = 120 case, the memoization cache uses up roughly 200MB of memory. This is much larger than my L2 cache, and since hash maps distribute elements randomly, we’ll be spending a lot of time going out to memory checking the memoization cache. Reducing memory usage and improving lookup patterns should help performance significantly.

One observation we can make is that dfs(..., i) always calls dfs(..., next(i)). If we change the algorithm from a top-down memoization algorithm to a bottom-up dynamic programming algorithm, then we could do the depth-first search backwards, in layers of different values of i. We start by computing dfs(..., vs.length-1) for all x, y, and d. That gives us everything we need to compute all of dfs(..., prev(vs.length-1)).

prev is just the opposite function to next – it returns the index j, previous to i such that vs[i] is not colinear with vs[j].

int prev(int i)
{
    // Find next, non-colinear edge
    int j = i - 1;
    while (j >= 0 && colinear(vs[i], vs[j]))
        --j;
    return j;
}

Since we aren’t going to be caching all of dfs anymore, we’ll get rid of the memoization and instead just use a big array as our cache. We’ll also need two of them: one for the previously computed layer of i, and one for the next layer that is being computed.

alias long[N+1][N+1][N+1] DP; // cache
DP[] dp = new DP[2];
dp[0][M][M][0] = 1; // start point
int r = 0, w = 1; // read index, write index

Here, dp[r] stores the previously computed layer. dp[r][M+x][M+y][d] represents the number of paths of length d from (0, 0) to (x, y) using the only the triples from the previous layers. Using dp[r] we compute dp[w] by using the next layer down from what dp[r] used. To do this, we start by copying all of dp[r] into dp[w] and then simply iterate over all x, y, and d, as well as all colinear edges in the current layer.

The algorithm looks something like this:

for (int i = cast(int)vs.length - 1; i >= 0; )
{
    dp[w] = dp[r];    // copy previous buffer
    auto j = prev(i); // colinear edges from j..i

    foreach (x; -M..M+1)
    foreach (y; -M..M+1)
    foreach (d; 0..N+1)
        if (dp[r][M+x][M+y][d] > 0)
            for (int k = i; k > j; --k)
                if (!(d == vs[k].d && x == -vs[k].x && y == -vs[k].y))
                {
                    int nx = x + vs[k].x;
                    int ny = y + vs[k].y;
                    int nd = d + vs[k].d;
                    int rd = N - nd;
                    if (nx * nx + ny * ny <= rd * rd && rd >= 0)
                        dp[w][M+nx][M+ny][nd] += dp[r][M+x][M+y][d];
                }

    i = j;       // move to next layer
    swap(r, w);  // swap read/write buffers
}

Once all this has run for each layer, dp[r] will contain the final path counts for each (x, y, d) tuple. All that’s left to do is count up the number of paths that reach back to (0, 0) (dp[r][M][M][...]).

alias reduce!("a + b") sum;
long total = sum(dp[r][M][M][1..N+1]);
writeln(total);

With this improvement, the N = 120 case now runs in just over 1 second, which is easily fast enough for us to be satisfied, but there’s much more that can be done.

Like many geometrical problems, Pythagorean Polygons has a symmetry to it that can be exploited for speed. Notice that we generate the Pythagorean triplets redundantly from all quadrants. If (3, 4) is a triplet, then so is (-3, 4), (3, -4), (-3, 4), (4, 3), (-4, 3), (4, -3), and (-4, -3). What this also means is that if there is a path from (0, 0) to (x, y, d) then there are also paths to (-x, y, d), (x, -y, d) etc. but we compute all of these redundantly!

Taking full advantage of the symmetry is tricky, but there’s one simple way that can speed up our algorithm up significantly. Notice that the first half of triplets travel upwards, and the second half travel downwards (due to the sort). This means our algorithm calculates all paths down to some point (x, y, d) using the second half of the triplets, then calculates all paths from there back to (0, 0) using the second half of the triplets. But we know from the symmetry that if there is a path to (x, y, d) using the first half then there is a path back to (0, 0) using the second half. In fact, there are the same number of paths, so the total unique paths is equal to the product of all combinations where the total distance is less than N.

To implement this, we only need to make one simple change, and add a combining step. The change is to modify the main loop to only use the first half of the triplets, i.e.

for (int i = cast(int)vs.length - 1; i >= 0; )

becomes

for (int i = cast(int)vs.length / 2 - 1; i >= 0; )

Next, instead of just summing up the dp[r][M][M][1..N] entries, we need to sum up the path products. To do this, we iterate over all (x, y, d) tuples and then multiply the cache value with those for (x, y, d2) where d2 <= N - d.

long total = 0;
foreach (x; -M..M+1)
foreach (y; -M..M+1)
foreach (d; 1..N+1)
    if (dp[r][M+x][M+y][d] > 0)
        foreach (d2; 0..N+1-d)
            total += dp[r][M+x][M+y][d] * dp[r][M+x][M+y][d2];
total -= vs.length / 2;

That last line needs some explanation. Previously our algorithm accounted for paths with only two edges, but we don’t do that anymore, so they will be included in the totals. What that line of code does is remove those paths from the total. That value is vs.length / 2 because every edge (dx, dy) has a corresponding edge (-dx, -dy) that will be included, so the total of these two-edge paths is equal to half the triplets.

With this improvement, the N = 120 case gives the solution in 0.36 seconds, which is more than fast enough for my liking. The other axes of symmetry can be exploited further, but I’ll leave that as an exercise for the reader :-)