Crypto Island

image/svg+xml

Introducing BigTrees!

Posted 2025-07-16
Tags: , , , , , , ,
Contents

I’ve been working on a Haskell program to dedup large collections of files efficiently. It’s still under heavy development, but I think it’s at the point now where it could be useful for a few intrepid testers/power users.

Most people should default to using an established tool like fdupes for now. Then again, if you’re reading this you may be an exception… consider trying bigtrees if you like/need some of the features I’m working on, or have an idea about how it could be made to fit your workflow better than the established tools!

I’ll go into more detail on all the cool things you can do with each of these commands and others “soon”; for now here are 3 short examples.

Test data

You can already try bigtrees on your own data, of course! But maybe you’re a little more careful than that? If so, use this Python script. It’ll download the Linux kernel source code and a nice git repo with some example pictures, music etc. Then it’ll duplicate them a few times to get ~1 million test files taking up 18G total.

$ export TMPDIR=/tmp
$ time ./fetch-test-files.py
downloading '/tmp/test-files/josef-friedrichj.zip'... ok
unzipping '/tmp/test-files/josef-friedrichj.zip'... ok
copying '/tmp/test-files/josef-friedrichj' -> '/tmp/test-files/josef-friedrichj-dupe1'... ok
copying '/tmp/test-files/josef-friedrichj' -> '/tmp/test-files/josef-friedrichj-dupe2'... ok
copying '/tmp/test-files/josef-friedrichj' -> '/tmp/test-files/josef-friedrichj-dupe3'... ok
copying '/tmp/test-files/josef-friedrichj' -> '/tmp/test-files/josef-friedrichj-dupe4'... ok
copying '/tmp/test-files/josef-friedrichj' -> '/tmp/test-files/josef-friedrichj-dupe5'... ok
copying '/tmp/test-files/josef-friedrichj' -> '/tmp/test-files/josef-friedrichj-dupe6'... ok
copying '/tmp/test-files/josef-friedrichj' -> '/tmp/test-files/josef-friedrichj-dupe7'... ok
copying '/tmp/test-files/josef-friedrichj' -> '/tmp/test-files/josef-friedrichj-dupe8'... ok
copying '/tmp/test-files/josef-friedrichj' -> '/tmp/test-files/josef-friedrichj-dupe9'... ok
downloading '/tmp/test-files/linux-source-code.zip'... ok
unzipping '/tmp/test-files/linux-source-code.zip'... ok
copying '/tmp/test-files/linux-source-code' -> '/tmp/test-files/linux-source-code-dupe1'... ok
copying '/tmp/test-files/linux-source-code' -> '/tmp/test-files/linux-source-code-dupe2'... ok
copying '/tmp/test-files/linux-source-code' -> '/tmp/test-files/linux-source-code-dupe3'... ok
copying '/tmp/test-files/linux-source-code' -> '/tmp/test-files/linux-source-code-dupe4'... ok
copying '/tmp/test-files/linux-source-code' -> '/tmp/test-files/linux-source-code-dupe5'... ok
copying '/tmp/test-files/linux-source-code' -> '/tmp/test-files/linux-source-code-dupe6'... ok
copying '/tmp/test-files/linux-source-code' -> '/tmp/test-files/linux-source-code-dupe7'... ok
copying '/tmp/test-files/linux-source-code' -> '/tmp/test-files/linux-source-code-dupe8'... ok
copying '/tmp/test-files/linux-source-code' -> '/tmp/test-files/linux-source-code-dupe9'... ok

real    3m11.745s
user    0m23.659s
sys     0m22.355s
$ find test-files | wc -l
957363

$ du -h test-files | tail -n1
18G     test-files

$ tree -L 1 test-files
test-files
├── josef-friedrichj
├── josef-friedrichj-dupe1
├── josef-friedrichj-dupe2
├── josef-friedrichj-dupe3
├── josef-friedrichj-dupe4
├── josef-friedrichj-dupe5
├── josef-friedrichj-dupe6
├── josef-friedrichj-dupe7
├── josef-friedrichj-dupe8
├── josef-friedrichj-dupe9
├── josef-friedrichj.zip
├── linux-source-code
├── linux-source-code-dupe1
├── linux-source-code-dupe2
├── linux-source-code-dupe3
├── linux-source-code-dupe4
├── linux-source-code-dupe5
├── linux-source-code-dupe6
├── linux-source-code-dupe7
├── linux-source-code-dupe8
├── linux-source-code-dupe9
└── linux-source-code.zip

21 directories, 2 files

Minimal dedup command

I’m quite pleased with how simple this looks! It took a lot of work to get it that way.

$ time bigtrees dupes test-files

# This is the default 'suggestions' output format.
# It just suggests what you might delete manually yourself.

# You could save 861228 inodes by deleting all but one of these 10 duplicate directories
test-files/linux-source-code
test-files/linux-source-code-dupe1
test-files/linux-source-code-dupe2
test-files/linux-source-code-dupe3
test-files/linux-source-code-dupe4
test-files/linux-source-code-dupe5
test-files/linux-source-code-dupe6
test-files/linux-source-code-dupe7
test-files/linux-source-code-dupe8
test-files/linux-source-code-dupe9

# You could save 414 inodes by deleting all but one of these 10 duplicate directories
test-files/josef-friedrichj
test-files/josef-friedrichj-dupe1
test-files/josef-friedrichj-dupe2
test-files/josef-friedrichj-dupe3
test-files/josef-friedrichj-dupe4
test-files/josef-friedrichj-dupe5
test-files/josef-friedrichj-dupe6
test-files/josef-friedrichj-dupe7
test-files/josef-friedrichj-dupe8
test-files/josef-friedrichj-dupe9

real    2m42.149s
user    3m29.705s
sys     0m53.310s

(I’ll go into a few cases where it’s not so perfect in future posts)

Using a .bigtree file

These save the directory structure as well as the hash of each file and folder. They’re used when you want to hash files once once and use the results multiple times. The command above could equivalently be written like so:

$ bigtrees hash test-files --output test-files.bigtree
$ bigtrees dupes test-files.bigtree

Minimal diff command

This is meant to take an old and a new collection of files. You might use it to compare a backup to your current documents, or an older backup to a newer one. You can also mix and match actual files/dirs with saved .bigtree files. Let’s edit test-files a little, and see if it can detect the changes.

$ rm -r test-files/linux-source-code-dupe7/linux-master/
$ rm -r test-files/linux-source-code-dupe8/linux-master/drivers/pinctrl/realtek/
$ echo "a new file!" > test-files/linux-source-code-dupe8/linux-master/extra.txt

$ time bigtrees diff test-files.bigtree test-files

removed 'linux-source-code-dupe7/linux-master'
added 'linux-source-code-dupe8/linux-master/extra.txt'
removed 'linux-source-code-dupe8/linux-master/drivers/pinctrl/realtek'

real    2m25.713s
user    2m55.189s
sys     0m52.527s

You can also compare things that aren’t time ordered. Just keep in mind the changes will flip depending which you put first.

Minimal find command

Once I started hashing my drives + tarballs and keeping .bigtree files indexing their contents, I realised I could also use them to find particular files without having the drives on hand. It may not sound like a big upgrade vs find, locate, or similar commands, but it’s come to be an essential part of my workflow.

Interactive use is a bit like find or tar --list.

$ bigtrees find test-files.bigtree --search-regex '/old.*\.jpg$'

test-files/josef-friedrichj/test-files-master/jpg/old-house.jpg
test-files/josef-friedrichj-dupe1/test-files-master/jpg/old-house.jpg
test-files/josef-friedrichj-dupe2/test-files-master/jpg/old-house.jpg
test-files/josef-friedrichj-dupe3/test-files-master/jpg/old-house.jpg
test-files/josef-friedrichj-dupe4/test-files-master/jpg/old-house.jpg
test-files/josef-friedrichj-dupe5/test-files-master/jpg/old-house.jpg
test-files/josef-friedrichj-dupe6/test-files-master/jpg/old-house.jpg
test-files/josef-friedrichj-dupe7/test-files-master/jpg/old-house.jpg
test-files/josef-friedrichj-dupe8/test-files-master/jpg/old-house.jpg
test-files/josef-friedrichj-dupe9/test-files-master/jpg/old-house.jpg