poita.org

# Range-Based Graph Search in D

Posted: 09 Jan 2014 - Link

I’ve just made my first commit of a range-based graph library for D. At the moment it only contains a few basic search algorithms (DFS, BFS, Dijsktra, and A*), but I wanted to write a bit about the design of the library, and how you can use it.

As a bit of a teaser, the snippet below shows how you can use the library to solve those word change puzzles (you know, the ones when you have to get from one word to another by only changing one letter at a time?)

string[] words = File("/usr/share/dict/words")
.byLine
.filter!(w => w.length == 5)
.map!"a.idup"
.array();
enum string start = "hello";
enum string end = "world";
static d(string a, string b) { return zip(a, b).count!"a[0] != a[1]"; }
auto graph = implicitGraph!(string, u => words.filter!(v => d(u, v) == 1));
graph.aStarSearch!(v => d(v, end))(start).find(end).path.writeln();


This promptly produces the desired result:

["hello", "hollo", "holly", "molly", "mouly", "mould", "would", "world"]


The last two lines are the interesting part. The first of those lines defines our graph as an implicit graph of words connected by one-letter changes. The second line performs the A* search and writes the resultant path to stdout.

On the second-last line, graph is defined as an implicitGraph. An implicit graph is a graph that is represented by functions rather than in-memory data structures. Instead of having an array of adjacency lists, we just define a function that returns a range of adjacent words (that’s the second parameter to implicitGraph). This representation saves us the tedium of generating the graph beforehand, and saves your RAM from having to store it unnecessarily. It also makes for succinct graph definitions!

The last line is where the algorithm is called. Unlike most other graph libraries, stdex.graph implements graph searches as ranges – iterations of the graph vertices. aStarSearch returns a range of vertices in the order they would be visited by the A* search algorithm (parameterized by the A* heuristic). By implementing the search as a range, the user can take advantage of Phobos’ multitude of range algorithms.

For instance, suppose you want to count the number of nodes searched by the algorithm until it reaches the target node. For this task, you can just use countUntil from std.algorithm.

auto search = graph.aStarSearch!(v => d(v, end))(start);
writefln("%d nodes visited", search.countUntil(end));


For tuning and debugging, it might be useful to print out the nodes visited by the algorithm as it runs.

foreach (node; search.until(end))
writeln(node);


When no path is available, graph search algorithms typically end up searching the entire vertex space. It is common to cut-off the search after a certain threshold of nodes have been searched. This can be accomplished by takeing only so many nodes from the search.

auto result = search.take(50).find(end);
if (result.empty)


Similarly, you could have the search continue until 10 seconds have elapsed.

alias now = TickDuration.currSystemTick;
auto t = now;
auto result = search.until!(_ => (now - t).seconds > 10).find(end);
if (result.empty)


The beauty of all this is that none of these features are part of the graph library – they come for free as a side-effect of being a range.

One unsolved part of the design for the graph searches is how to handle visitation. The Boost Graph Library solves this by having users implement a visitor object, which has to define a suite of callback methods, one for each event (vertex discovered, examine edge, etc.) This is certainly mimicable in D, but it may be more effective to model visitation with output ranges, which would again allow composition with existing Phobos constructs. This is what I will be investigating next.

# Ranges Part 1 - Basics

Posted: 16 Jun 2013 - Link

Iteration is one of the most common activities in programming. Few programming tasks can be accomplished without looping or recursing over some set of values, whether it be a stream of bytes from a file, elements of an array, rows in a database, or nodes in an implicitly generated graph. Programs need to iterate, and programming languages need to provide idioms for iteration.

In D, iteration is achieved through the use of ranges. Broadly speaking, a range defines a sequence of values. There are many styles of ranges. For example, some ranges lazily generate their values instead of iterating over data; some ranges are infinite; some ranges can be iterated from both ends; and some ranges allow you to jump around to any index (like arrays). For the most part, the details of how a particular range operates is up to the range implementor.

So how do you implement a range? At the most basic level, a range is just a type with three member functions: front, empty, and popFront.

• front returns the value from the front of the range.
• empty returns true if the range has been exhausted.
• popFront removes the first value from the range.

As a simple example, here is a range that counts from 0 to ‘n’.

struct Iota
{
private int m_current = 0;
private int m_target;

this(int n) { m_target = n; }
@property int front() { return m_current; }
@property bool empty() { return m_current == m_target; }
void popFront() { m_current++; }
}


(The name Iota is a Greek letter. In a tradition started by the programming language APL, it is used to represent consecutive integers in several languages)

Notice that there is no need to inherit from any sort of IRange interface. There’s nothing magic about ranges, they are just simple types with those functions defined.

To iterate over a range, you can manually call those functions using a for loop, or use the built-in foreach loop in D, which knows about ranges.

void main()
{
import std.stdio;
foreach (x; Iota(10))
writeln(x);
}


As expected, this will print out the numbers 0 through 9, one per line. Of course, writeln knows about ranges too, so writeln(Iota(10)) works as well, and will print out [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], formatted like an array.

Writing functions that work with ranges is not difficult. Suppose we want to write a function to count the number of elements in a range. This can be achieved by accepting the range as a template parameter and iterating the range using popFront() explicitly, like so:

size_t count(Range)(Range r)
{
size_t n = 0;
while (!r.empty)
{
++n;
r.popFront();
}
return n;
}

unittest
{
assert(count(Iota(10)) == 10);
}


By using templates, we can create range types that are templated on other range types. For example, consider a skip range that skips every second element in another range. You could define it like this:

struct Skip(Range)
{
private Range m_range;

this(Range r) { m_range = r; }

// front and empty just forward to the sub-range.
@property auto ref front() { return m_range.front; }
@property bool empty() { return m_range.empty; }

// popFront also forwards to the sub-range, but pops off two
// elements at a time, instead of one.
void popFront()
{
m_range.popFront();
if (!m_range.empty)
m_range.popFront();
}
}


We can then use Skip in conjunction with Iota to skip through consecutive integers.

void main()
{
import std.stdio;
writeln(Skip!Iota(Iota(10))); // [0, 2, 4, 6, 8]
}


Notice that we have to redundantly specify Iota as a template parameter to Skip. This is because D doesn’t currently support type inference for template type constructors. A common workaround is to create a module-level function that wraps the constructor:

auto skip(Range)(Range r) { return Skip!Range(r); }
auto iota(int n) { return Iota(n); }


We can then use these like so:

writeln(iota(10).skip());


The above code also makes use of another D feature: uniform function call syntax, which basically means f(x) can be written as x.f() as if f were a member function of x. If you come from the C# world, you can kind of think of it as every free function being an Extension Method. We could have also written 10.iota().skip() or just plain skip(iota(10)). There’s no difference, so choose whatever you think is most readable.

The ability to combine arbitrary ranges is perhaps their most powerful feature. There’s no reason we need to stop at combining just two ranges:

writeln(iota(10)); // [0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
writeln(iota(10).skip()); // [0, 2, 4, 6, 8];
writeln(iota(10).skip().skip()); // [0, 4, 8];
writeln(iota(10).skip().skip().skip()); // [0, 8];


It shouldn’t be hard to imagine the kind of flexibility and expressiveness that can be achieved once you have a large vocabulary of ranges at your disposal.

In Part 2 we’ll look at some of the different categories of ranges.

# DustMite - Code Minimizer

Posted: 13 Jun 2013 - Link

Yesterday I discovered DustMite, a source code minimizer tool for D.

What’s a source code minimizer? It’s a program that takes some source code, and automatically removes lines, or even whole files from the source while maintaining some invariant (e.g. it still builds, or still gives a particular output). The usual use case is to produce a minimal reproduction test case for a bug, which is exactly what I used it for.

I had a problem in a medium-sized D project where anytime I had some sort of compile-time error, such as a typo in a variable name, DMD (the D compiler), would spew out pages and pages of false errors, triggered by some other module in my codebase. The errors were completely irrelevant. This was quite annoying because it made it difficult to see the actual error. So, like a good programmer, I decided to file a bug report.

Step one in filing a good bug report is to create a minimal test case. I tried to do this manually by pulling out what seemed to be the relevant parts of the code that would reproduce the error, but I couldn’t get it to happen. I could easily and consistently reproduce the bug in the full codebase, but the project is about 10,000 lines of code, so minimizing it manually from there would be very time consuming.

I’d heard about DustMite on the forums, so I thought I’d give it a go.

Installing was easy enough.

% git clone https://github.com/CyberShadow/DustMite.git
% cd DustMite
% dmd dustmite.d dsplit.d
% ln -s ~/DustMite/dustmite ~/bin


Next step was to prepare the project for minification. All you need to do is make a full copy of the codebase, and remove any unnecessary files (e.g. object files, data files – anything not needed for reproduction).

Next you need to devise a test command. This is a command that should return 0 if the bug is still present, and non-zero otherwise. For example, suppose we had this (trivial) program:

import std.stdio;
import std.string;

string world() { return "world!"; }

void main()
{
write(hello());
write(", ");
write(world());
write("\n");
}


I’ve got an obvious bug here in that hello() is undefined. Trying to compile this gives Error: undefined identifier hello. If we wanted to minimize this then we need a command, which, when run, will return 0.

Compiling and greping for the error would be perfect for this.

% dmd test.d 2>&1 | grep -q "undefined identifier hello"
% echo \$?
0


This runs the compiler (dmd test.d), redirects the error messages to stdout (2>&1) then pipes the result (|) to grep, which searches for that string without producing output (-q = quiet). grep returns 0 if it finds anything, and 1 otherwise.

The DustMite wiki provides a list of useful test scripts.

Finally, we just run DustMite with that test command.

% dustmite testdir 'dmd test.d 2>&1 | grep -q "undefined identifier
hello"'
None => Yes
############### ITERATION 0 ################
[  0.0%] Remove [] => No
[  1.6%] Remove [0] => No (cached)
[  3.3%] Remove [01] => No
...
[ 82.4%] Remove [000001] => No (cached)
[ 88.2%] Remove [000000] => No (cached)
[ 94.1%] Remove [0000000] => No (cached)
Done in 36 tests and 10 secs and 165 ms; reduced version is in
testdir.reduced


And checking testdir.reduced/test.d confirms that the source has been reduced.

% cd testdir.reduced
% cat test.d
void main()
{
hello;
}


For my project, DustMite took 8 minutes, and reduced roughly 10,000 lines of code down to about 10, and perfectly reproduced the issue. The problem I was having trying to reproduce manually was that the repro required that you pass the files in a specific order to the compiler, and it also required an extra file that seemingly has nothing to do with the issue. I don’t think I would have ever minimized it without DustMite.

# D Tip - Skipping main() for Unit Tests

Posted: 11 Jun 2013 - Link

When you compile a D program with unit tests enabled (using the -unittest flag), the generated executable will first run the unit tests then continue with executing main().

Sometimes, you just want the unit tests to run in isolation.

A simple way to skip execution of main() for unit test builds is to use the unittest version identifier:

unittest
{
assert(1 + 1 == 2, "Something horrible has gone wrong.");
}

void main()
{
version(unittest)
return;

// This won't be run
}


Code inside the version(unittest) block will only be run when the code is compiled with -unittest, so the code above will exit main() immediately in unit test builds.

There are numerous pre-defined version identifiers, which you can find at http://dlang.org/version.html. Another handy one is version(assert), which is only compiled in when asserts are enabled.

# Associative Array Indexing

Posted: 27 May 2013 - Link

I dislike how associative array indexing is handled in every language I know. In all cases it handled inconsistently with other parts of the language, and in all cases the inconsistency is unnecessary.

There’s two aspects of handling associative arrays that vary among languages:

1. Does array[key] automatically insert a value if the key isn’t present?
2. Do operations on array[key] have “magic” compiler support?

To illustrate these differences, here’s how things are handled in C++ and D.

C++

std::map<std::string, int> days;
days["Mon"] = 1;
std::cout << days["Tue"] << std::endl;


D

int[string] days;
days["Mon"] = 1;
writeln(days["Tue"]);


In C++, array[key] always returns a valid lvalue, whether the key was in the array or not. If the key was not in the array then a default-constructed value is added, so days["Tue"] defaults to 0.

In D, array[key] will return an lvalue only if key was in the array. However, the syntax array[key] = value is given magic compiler treatment, and is converted into a special opIndexAssign operator, which is a combination of indexing and assignment. The attempt to access days["Tue"] results in a range violation.

C++ answers ‘yes’ and ‘no’ to questions 1 and 2, and D answers ‘no’ and ‘yes’. As far as I’m aware, all modern languages follow either the C++ or D approach. For example, Ruby follows C++ and Python follows D.

Now I’ll explain why both of these approaches are bad.

Automatically inserting a default value when the key is present is bad because it is inconsistent with how normal, non-associative arrays work.

std::vector<std::string> days;
days[1] = "Mon"; // ERROR


Arrays don’t automatically expand their domain on indexing, so why should associative arrays? Aren’t arrays just a optimisation of an associative array where the key is a bounded integer? They should have the same semantics.

And the reason the D approach is bad is because it messes with your intuition about how expressions work. If I see an indexing operation and an assignment operation then I expect the AST to have two operation nodes, not one. This has some interesting consequences:

void set(ref int x, int y) { x = y; }

int[string] days;
days["Mon"] = 1;    // Okay
days["Tue"].set(2); // Error?


Intuitively, this should work. set is no different from the int assignment operator, so I would expect that I can replace all int assignments with calls to set. I can’t, only because of the compiler magic.

In my opinion, the correct approach is to disallow implicit insertions, and also avoid any compiler magic. Indexing should be an error when the key isn’t present (just like it is in an array), and if you want to expand the domain of an associative array then you should need to use a specific operation to do that (array.insert would be the obvious choice).

A likely complaint about this approach would be that you can’t write the three-line word count program, i.e.:

int[string] wc;
foreach (word; words)
wc[word]++;


Under my suggested approach, this program would become:

int[string] wc;
foreach (word; words)
if (word in wc)
wc[word]++;
else
wc.insert(word, 1);


Yes, it’s longer, and no, I don’t think it’s an issue, nor do I think it is an argument for implicit insertions or magic. If shorter programs are a sound argument for inconsistent semantics then we’re in a lot of trouble. Besides, it wouldn’t be hard to introduce a lookup function where you can provide a default value if the key is missing.

int[string] wc;
foreach (word; words)
wc.lookup(word, 0)++;


That’s it. Language designers, don’t throw away consistency for the sake of terseness – it’s not worth it!