This is the homepage of Peter Alexander. I am currently at Facebook working on Core Data. Previously a game developer at Codemasters. Any opinions are my own.
- unique_ptr Type Erasure
- The Condenser Part 1
- Cube Vertex Numbering
- Range-Based Graph Search in D
- Ranges Part 1 - Basics
The Condenser Part 1
Last night I rewatched Joe Armstrong’s excellent keynote from this year’s Strange Loop conference. He begins the talk with a fairly standard, but humorous lament about the sorry state of software, then goes on to relate this to some physical quantities and limits. He finishes by suggesting some possible directions we can look to improve things.
One of his suggestions is to build something called “The Condenser”, which is a program that takes all programs in the world, and condenses them down in such a way that all redundancy is removed. He breaks this task down into two parts:
He also explains the obvious, and easy way to do Part 1.
At this point I remembered that I hadn’t written a lot of D recently, and although Joe is talking about condensing all files in the world, I was kind of curious how much duplication there was just on my laptop. How many precious bytes am I wasting?
It turns out that The Condenser Part 1 is very easy to write in D, and even quite elegant.
I’ll explain the code a little, starting from
filter to produce a list of files (the
dirEntries also returns directories).
dirEntries is a
lazy range, so the directories are only iterated as necessary. Similarly,
filter creates another lazy range, which removes non-file entries.
foreach loop corresponds to Joe’s
do loop. While I could have just
files directly, I instead iterate over a
semi-lazy map from David Simcha’s
std.parallelism that does the
processing in parallel on multiple threads. It is semi-lazy because it
eagerly takes a chunk of elements at a time from the front of the range,
processes them in parallel, then returns them back to the caller to consume.
The task pool size defaults to the number of CPU cores on your machine, but
this can be configured.
process function is where the hashing is done. It takes a
allocates an uninitialized (
= void) 4kb buffer on the stack and uses that
buffer to read the file 4kb at a time and build up the SHA1. Again,
returns a lazy range, and
digest!SHA1 consumes it. Lazy ranges are very
common in idiomatic D, especially in these kinds of stream processing type
applications. It’s worth getting familiar with the common ranges in
I ran the program over my laptop’s home directory. Since the duplicate
file sizes are all in the first column, I can just use
awk to sum them
up and find the total waste:
That’s 2.18Gb of duplicate files, which is around 10% of my home dir.
In some cases the duplicates are just files that I have absentmindedly downloaded twice. In other cases it’s duplicate music that I have both in iTunes and Google Play. For some reason, the game Left 4 Dead 2 seems to have a lot of duplicate audio files.
So, in theory, I could save 10% by deduplicating all these files, but it’s not a massive amount of waste. It doesn’t seem worth the effort trying to fix this. Perhaps it would be nice for the OS or file system to solve this automatically, but the implementation would be tricky, and still maybe not worth it.
Part 2 of the Condenser is to merge all similar files using least compression difference. This is decidedly more difficult, since it relies on near-perfect compression, which is apparently an AI-hard problem. Unfortunately, solving AI is out of scope for this post, so I’ll leave that for another time :-)