This is the homepage of Peter Alexander. I am currently at Facebook working on AR/VR. Any opinions are my own.
Recent Posts
- Single-Threaded Parallelism
- unique_ptr Type Erasure
- The Condenser Part 1
- Cube Vertex Numbering
- Range-Based Graph Search in D
All Posts
Links
github.com/Poita@Poita_
Atom Feed
The Condenser Part 1
Last night I rewatched Joe Armstrong’s excellent keynote from this year’s Strange Loop conference. He begins the talk with a fairly standard, but humorous lament about the sorry state of software, then goes on to relate this to some physical quantities and limits. He finishes by suggesting some possible directions we can look to improve things.
One of his suggestions is to build something called “The Condenser”, which is a program that takes all programs in the world, and condenses them down in such a way that all redundancy is removed. He breaks this task down into two parts:
He also explains the obvious, and easy way to do Part 1.
At this point I remembered that I hadn’t written a lot of D recently, and although Joe is talking about condensing all files in the world, I was kind of curious how much duplication there was just on my laptop. How many precious bytes am I wasting?
It turns out that The Condenser Part 1 is very easy to write in D, and even quite elegant.
I’ll explain the code a little, starting from main
. The files
variable
uses dirEntries
and filter
to produce a list of files (the filter
is
necessary since dirEntries
also returns directories). dirEntries
is a
lazy range, so the directories are only iterated as necessary. Similarly,
filter
creates another lazy range, which removes non-file entries.
The foreach
loop corresponds to Joe’s do
loop. While I could have just
iterated over files
directly, I instead iterate over a taskPool.map
, a
semi-lazy map from David Simcha’s std.parallelism
that does the
processing in parallel on multiple threads. It is semi-lazy because it
eagerly takes a chunk of elements at a time from the front of the range,
processes them in parallel, then returns them back to the caller to consume.
The task pool size defaults to the number of CPU cores on your machine, but
this can be configured.
The process
function is where the hashing is done. It takes a DirEntry
,
allocates an uninitialized (= void
) 4kb buffer on the stack and uses that
buffer to read the file 4kb at a time and build up the SHA1. Again, byChunk
returns a lazy range, and digest!SHA1
consumes it. Lazy ranges are very
common in idiomatic D, especially in these kinds of stream processing type
applications. It’s worth getting familiar with the common ranges in
std.algorithm
and std.range
.
I ran the program over my laptop’s home directory. Since the duplicate
file sizes are all in the first column, I can just use awk
to sum them
up and find the total waste:
That’s 2.18Gb of duplicate files, which is around 10% of my home dir.
In some cases the duplicates are just files that I have absentmindedly downloaded twice. In other cases it’s duplicate music that I have both in iTunes and Google Play. For some reason, the game Left 4 Dead 2 seems to have a lot of duplicate audio files.
So, in theory, I could save 10% by deduplicating all these files, but it’s not a massive amount of waste. It doesn’t seem worth the effort trying to fix this. Perhaps it would be nice for the OS or file system to solve this automatically, but the implementation would be tricky, and still maybe not worth it.
Part 2 of the Condenser is to merge all similar files using least compression difference. This is decidedly more difficult, since it relies on near-perfect compression, which is apparently an AI-hard problem. Unfortunately, solving AI is out of scope for this post, so I’ll leave that for another time :-)