in2csv_splitYears
csv2journal_mergeYears
Disclaimer: nothing on this blog is advice about the substance of your taxes! I have no background in accounting and no idea whether this code will produce valid results. You need to verify everything yourself and then own your own mistakes or hire a tech-savvy CPA (or equivalent in your country) to go over it and fix any problems.
When using the “full-fledged hledger” system (see my intro) it’s helpful to separate all the imported data by tax year. You can do it manually in a spreadsheet editor, setting up your files like this…
…but copying/pasting rows is tedious and error-prone. Today I’ll go over a couple functions I added to export/export.hs
to automate it instead. Sorry in advance for the messy Haskell! I promise it enables a cleaner workflow.
Parsing in the full-fledged system can have one or two steps depending on how weird the input data is. You always do the second step, csv2journal
, which typically uses Hledger’s CSV parsing DSL. But you can also add an in2csv
script to do any extra cleanup and pre-processing that doesn’t fit cleanly into that paradigm first.
I’ve found it useful to overload these two steps with splitting and merging respectively. There’s no deep reason split/merge need to be coupled to in2csv
/csv2journal
; it just minimized the code changes and seems to work well in most cases.
in2csv_splitYears
Add this to your export/export.hs
, changing the input file patterns to match your importers.
The new code does the same thing as in2csv
(the Haskell function), but also passes the year as a string when it calls scripts.
import Data.List.Extra (splitOn)
-- ...
-- This goes at the end of the export_all function,
-- below the equivalent list for regular in2csv:
"//import/coinpaprika/csv/*.csv",
[ "//import/etherscan/csv/*.csv" ] |%> in2csv_splitYears
-- ...
-- Based on in2csv, but passes a multi-year file + a year arg to the in2csv script.
-- Requires the out file to be named with "-<year>" at the end, for example "mywallet-2023.csv"
= do
in2csv_splitYears out let (cdir, file) = splitFileName out
= splitExtension file
(base, ext) = last $ splitOn "-" base
year = (concat $ intersperse "-" $ init $ splitOn "-" base) ++ ext
file_noyear = parentOf "csv" cdir
pdir = replaceDir "csv" "in" cdir
idir = "in2csv_splitYears"
script <- getDirectoryFiles idir [file_noyear -<.> "*"]
possibleInputs let inputs =
case possibleInputs of
-> error $ "no inputs for " ++ show file_noyear
[] -> map (idir</>) $ possibleInputs ++ (extraInputs file_noyear)
_ let deps = map (pdir </>) $ extraDeps out
$ (pdir </> script):(inputs ++ deps)
need Stdout output) <- cmd (Cwd pdir) Shell ("./" ++ script) (year: map (makeRelative pdir) inputs)
( writeFileChanged out output
Now in each import folder you have the option of keeping your current in2csv
script,
or renaming it in2csv_splitYears
and handling an extra argument for the year.
All you have to do is filter out lines not matching that year.
I do it with variations on this template Python script.
#!/usr/bin/env python3
from sys import argv, stdout
from csv import DictReader, DictWriter
def match_year(row):
'''Returns whether to edit + print this row
'''
# TODO adjust this to your input format
= 'Date'
date_field return row[date_field].startswith(year):
def edit_row(row):
'''Make any edits you want to the row dict here
For example:
- add an account field based on the filename
- remove commas from numbers
- combine send amount + fee into a total
'''
return row
def main(year, in_file):
'''Read in_file, keep rows matching year, edit them, print csv
'''
with open(in_file, 'r', encoding='utf-8-sig') as f:
= DictReader(f)
reader = None # wait to make it below
writer for row in reader:
if not match_year(row):
continue
= edit_row(row)
row if writer is None:
# get header from first row
= row.keys()
header = DictWriter(stdout, fieldnames=header)
writer
writer.writeheader()
writer.writerow(row)
if __name__ == '__main__':
= argv[1]
year = argv[2]
in_file main(year, in_file)
Most of that is a good template for an in2csv
script too. Just remove the year
and match_year
parts.
csv2journal_mergeYears
Once you have a CSV per input per year, you can use this to merge journal files by year.
Note that the code assumes you’re also using in2csv_splitYears
from above. It will get confused and fail if you manually split up the CSV files and skip the in2csv
step, even though that sounds equivalent.
-- ...
-- This goes at the end of the export_all function,
-- below the equivalent list for regular csv2journal:
"//import/coinpaprika/journal/*.journal",
[ "//import/etherscan/journal/*.journal" ] |%> csv2journal_mergeYears
-- ...
-- Based on csv2journal, but merges all the input files for a given year.
-- Name your script csv2journal_mergeYears to use this version.
= do
csv2journal_mergeYears out let (jdir, file) = splitFileName out
= splitExtension file
(base, ext) = last $ splitOn "-" base
year = parentOf "journal" jdir
pdir = replaceDir "journal" "csv" jdir
cdir = replaceDir "journal" "in" jdir
idir = "csv2journal_mergeYears"
script <- (fmap . map)
csvs -> cdir </> (fst $ splitExtension f) ++ "-" ++ year ++ ".csv")
(\f "*"])
(getDirectoryFiles idir [let deps = map (pdir </>) $ extraDeps out
$ (pdir </> script):(csvs++deps)
need Stdout output) <- cmd (Cwd pdir) Shell ("./" ++ script) $ map (makeRelative pdir) csvs
( writeFileChanged out output
I normally don’t bother with Python for this part because it’s short.
When you name the script csv2journal_mergeYears
it will get a list of CSV files instead of a single file. Use hledger
to parse them sequentially.
#!/usr/bin/env bash
for csv in $@; do
hledger print --rules-file myimporter.rules -f $csv
done
The transactions will be out of order, but the full-fledged system automatically fixes that later.
The last step is to consolidate your include
statements.
Here’s an example of how your CoinPaprika price feeds might change. The deduplication is nice, but more importantly now you can’t forget one of them!
-include ./import/coinpaprika/journal/BTC-2023.journal
-include ./import/coinpaprika/journal/ETH-2023.journal
-include ./import/coinpaprika/journal/BNB-2023.journal
-include ./import/coinpaprika/journal/XRP-2023.journal
-include ./import/coinpaprika/journal/ADA-2023.journal
-include ./import/coinpaprika/journal/OKB-2023.journal
-include ./import/coinpaprika/journal/MATIC-2023.journal
-include ./import/coinpaprika/journal/DOGE-2023.journal
+include ./import/coinpaprika/journal/2023.journal
The same trick also works with other data sources like bank accounts and wallets. For example if you have many different ETH wallets each carrying a different ERC20 token, you could include all of them in one line per year too. (I’ll do a post soon on importing EtherScan data)
-include ./import/etherscan/journal/ethwallet-2023.journal
-include ./import/etherscan/journal/linkwallet-2023.journal
-include ./import/etherscan/journal/maticwallet-2023.journal
-include ./import/etherscan/journal/shibwallet-2023.journal
+include ./import/etherscan/journal/2023.journal
As always, here is a tarball of today’s code.
split-and-merge-by-year
├── export
│ └── export.hs
└── import
└── myimporter
├── csv2journal_mergeYears └── in2csv_splitYears
This time it’s not a complete runnable example though. Just an outline. I assume if you’re interested in this it means you’re starting your own full-fledged repo. So go ahead and check out a temporary branch to try the split/merge thing on your own data!
]]>Disclaimer: nothing on this blog is advice about the substance of your taxes! I have no background in accounting and no idea whether this code will produce valid results. You need to verify everything yourself and then own your own mistakes or hire a tech-savvy CPA (or equivalent in your country) to go over it and fix any problems.
Today we’ll be adding historical crypto price data to hledger and using it to track portfolio value. It’s also important for calculating taxes on staking income. I’ll do a separate post on that.
Here is a tarball of the code, or you can read it on GitHub.
The top-level files are one-off demos, but import/coinpaprika
will slot into the rest of the “full-fledged” system if you want it to.
I tried CoinGecko and a Kaggle dataset first, but settled on CoinPaprika because they let you download the complete price history for many different coins/tokens conveniently. No API account signup, no wasted time pasting paginated data into a spreadsheet!
The only downside is that it’s low resolution weekly data rather than daily or hourly. I think that should be OK for most purposes because whenever you buy or sell something that already has a real price; the historical data is only for estimating what you could have sold things for on different dates. And that’s inherently a little vague because it would have depended on how and when you did it. So as long as you’re not cherry picking the data in your favor I think any reasonable authority would be OK with it. (Again: not tax advice. I don’t know whether your authorities see it that way.)
We’ll use Ethereum as an example.
Go to the coin page, set the chart range to Max
, and Export → CSV
.
The link worked for me in Chromium but not Firefox.
The data should look like this:
"DateTime","Price","Price in BTC","Volume (24h)"
"2015-08-03 00:00:00",2.9379853153153,0.010536816026859,145781
"2015-08-10 00:00:00",1.4312609722222,0.0054110598511272,2518067
"2015-08-17 00:00:00",1.3534692708333,0.0057977101345057,1796680
...
"2023-01-30 00:00:00",1672.2780124628,0.07017126906523,8394707643
"2023-02-06 00:00:00",1620.9206133282,0.071827640577938,6702235129 "2023-02-13 00:00:00",1688.5364890397,0.068847315571221,7708162827
I got coinpaprika.rules
to generate almost-valid market price directives using empty transactions with just a date
and description
, then cleaned up the output in csv2journal
.
# import/coinpaprika/coinpaprika.rules
skip 1
fields date,price,price_btc,volume_24h
date-format %Y-%m-%d %H:%M:%S
description ETH %price USD
# import/coinpaprika/csv2journal
hledger print --rules-file coinpaprika.rules -f "$1" | while read line; do
[[ -z "$line" ]] || echo "P $line"
done
I’ll do a post later that includes how to infer ETH
from the filename rather than hard-coding it in the description.
Another minor improvement would be to do everything in csv2journal
and skip the rules file, but I’ll leave that as an exercise. The general philosophy is that hacks are fine as long as you version control them!
Let’s try it.
This is roughly what export/export.hs
in the “full-fledged” system will do if you include ./import/coinpaprika/journal/coinpaprika-ETH.journal
from one of the top-level journal files:
$ cd historical-prices
$ nix-shell
[nix-shell]$ cd import/coinpaprika
[nix-shell]$ chmod +x csv2journal
[nix-shell]$ mkdir journal
[nix-shell]$ ./csv2journal csv/coinpaprika-ETH.csv > journal/coinpaprika-ETH.journal
P 2015-08-03 2.9379853153153 USD
P 2015-08-10 1.4312609722222 USD
P 2015-08-17 1.3534692708333 USD
...
P 2023-01-30 1672.2780124628 USD
P 2023-02-06 1620.9206133282 USD P 2023-02-13 1688.5364890397 USD
We’ll start a minimal portfolio.journal
here for clarity.
You can do the same thing with all.journal
later.
Here are the files we’ve used so far along with the new journal:
historical-prices
├── import
│ └── coinpaprika
│ ├── coinpaprika.rules
│ ├── csv2journal
│ ├── csv
│ │ └── coinpaprika-ETH.csv
│ └── journal
│ └── coinpaprika-ETH.journal
├── portfolio.journal └── shell.nix
;; portfolio.journal
commodity 1000.00 USD
commodity 1000.00 ETH
2015-08-04 Buy 1 ETH back in the day
assets:wallets:ancient 1 ETH
equity:opening balances
include ./import/coinpaprika/journal/coinpaprika-ETH.journal
OK, so Hledger commands can be a little cryptic. When trying something new I tend to look in the manual, this cheetsheet, and then forum posts if needed. One nice feature is that order usually doesn’t matter, so you can tack more flags on the end to incrementally improve it.
This command says “Using the file portfolio.journal
, show the historical balances at the end of each year until today, converted to USD value, only for accounts with ‘assets’ in their names, and transpose the table”. Whew!
[nix-shell]$ hledger -f portfolio.journal bal --historical -Y -e today -X USD assets --transpose
Ending balances (historical) in 2015-01-01..2023-12-31, valued at period ends:
|| assets:wallets:ancient |
============++========================+=============
2015-12-31 || 0.95 USD | 0.95 USD
2016-12-31 || 8.14 USD | 8.14 USD
2017-12-31 || 728.19 USD | 728.19 USD
2018-12-31 || 151.80 USD | 151.80 USD
2019-12-31 || 129.39 USD | 129.39 USD
2020-12-31 || 741.34 USD | 741.34 USD
2021-12-31 || 3709.23 USD | 3709.23 USD
2022-12-31 || 1197.17 USD | 1197.17 USD 2023-12-31 || 1688.54 USD | 1688.54 USD
Did you notice that today’s shell.nix
includes tidyverse?
That’s so that as a final sanity check we can increase the reporting frequency to weekly (-W
) and plot USD value over time. Our 1 ETH “portfolio” should come out looking like the CoinPaprika chart at the top…
# save-portfolio-table.sh
echo 'date usd_value' > portfolio.tsv
hledger -f portfolio.journal bal --historical \
-X USD -W -e today -X USD --transpose |
assets grep '^\s*20' | awk '{print $1, $3}' \
>> portfolio.tsv
# plot-portfolio-table.R
require(tidyverse)
read_delim('portfolio.tsv') %>%
ggplot(aes(x=date, y=usd_value)) +
geom_line() +
ggtitle('Historical value of ETH in USD') +
xlab('') +
ylab('') +
theme_classic() +
theme(aspect.ratio=1/2)
This might seem trivial because we got the same chart back at the end, but now we’re close to a general solution! With a few more tweaks this can keep track of a real portfolio as we buy/sell/transfer things over time. In future posts I’ll explain how to make portfolio value a report in the “full-fledged” system and how to add more detailed charts by currency, location (bank/exchange/wallet), or accounting category (assets/liabilities/income/expenses).
]]>hledger
Disclaimer 1: nothing on this blog is advice about the substance of your taxes! I have no background in accounting and no idea whether this code will produce valid results. You need to verify everything yourself and then own your own mistakes or hire a tech-savvy CPA (or equivalent in your country) to go over it and fix any problems. That’s what I’ll be doing.
Disclaimer 2: this really is the hard way. If your taxes are relatively simple, consider trying one of the standard crypto tax subscription services or hiring a human instead. I would guess the break even point is probably around 5-10 different arcane data formats (including banks, exchanges, and wallets). If you’re dealing with more than that, it might be worth setting this up.
With that out of the way, I’ve recently started treating my taxes as a software/data science pipeline! It hasn’t been easy, but it has worked better so far than all the easier-sounding ways I’ve attempted to organize them before. Read on if you think you might be in the same boat…
hledger
Plain text accounting is an obvious win in general because you can version control it. But there are several good tools to choose from and many of their features overlap. I’ve gone with hledger (so far) mainly so I could follow this excellent “full-fledged hledger” tutorial. I like the principles in the README and the wiki. For me, this is the most important part:
It should be easy to work towards eventual consistency. Large and daunting tasks (like “I will process 10 years of paper mortgage statements” or “I want to import 5 years of paypal payments”) should not require one big push and perfect planning to finish them. Instead I should be able to do them bit by little bit, leaving things half-done, and picking them up later with little (mental) effort. Eventually my records would be perfect and consistent.
I might adopt the format advocated by the related hledger-flow project at some point too.
Today is a “relatively quick start” guide based on 02-getting-data-in in the full-fledged tutorial with mock exchange data rather than a bank account. I suggest starting your own repo right now, working through this first post, working through the rest of that tutorial, and finally coming back here later for crypto-specific addons like:
You’ll probably invent some addons of your own too, and I’d love to hear about them!
Here is a tarball of today’s code to use as a template. I’ll assume you use git for simplicity, but nothing important relies on that. You can also read through it on GitHub.
Installing dependencies will probably take at least a couple minutes.
I use Nix whenever I expect a project to involve more than one language. To try it that way run the Nix install script, then open a new terminal and start nix-shell
inside the repo. Alternatively the full-fledged tutorial includes a Dockerfile with a pre-built image you can pull.
Your new repo should look roughly like this (omitting generated files):
crypto-taxes-the-hard-way/
├── shell.nix (or Dockerfile)
├── 2023.journal
├── all.journal
├── config.journal
├── import
│ └── mockex
│ ├── mockex.rules
│ ├── csv
│ │ └── trades-2023.csv
│ └── csv2journal
├── export.sh
└── export └── export.hs
Manually written hledger journal files go at the top level, input data + code to parse it in import/
, and the export script + everything it generates in export/
. The main journal for each tax year depends on the previous year’s main + generated journal files. That makes it possible to automatically open and close the books.
Most of the interesting logic is in export/export.hs
. It’s written with Shake, which is like Make for Haskell. It parses the journal files for include
statements and uses those + some hard-coded rules to build a dependency DAG and update the relevant files in order when one of the inputs changes. export.sh
is just a more obvious entry point.
Here is an ASCII diagram of data flow through the “full-fledged” system. It’s confusing at first but worth learning because it was refined over ~10 years of successful journal maintenance. For today we’ll focus on only the parts needed to answer the question, “What happens when you add or edit a CSV file?”
import/mockex/csv/* (1)
+
| csv2journal (2)
v
import/mockex/journal/* (3)
+
|
v
+------------+
|2023.journal| (4)
+------------+
| |
| v
| export/2023-* (5)
|
v
+-----------+
|all.journal| (6) +-----------+
import/mockex/csv/trades-2023.csv
.
"Transaction ID","Time","Type","Asset","Amount","Fee","Price"
"078werfgsdaf","1/5/2023","buy","BTC",0.01,0.0001,45 "078blk23598s","1/2/2023","buy","BTC",0.01,0.0001,35
csv2journal
parses the csv according to import/mockex/mockex.rules
,
which is written in the hledger CSV rules format.
You’ll need that because everyone seems to make up their own CSV conventions as they go.
The generated import/mockex/journal/trades-2023.journal
in hledger format goes here.
You’ll mainly look at these per-import journal files to debug your CSV rules.
2023-02-01 (078blk23598s) MockEx buy
assets:exchanges:mockex BTC0.0100
assets:exchanges:mockex USD-35
expenses:fees BTC0.0001
2023-05-01 (078werfgsdaf) MockEx buy
assets:exchanges:mockex BTC0.0100
assets:exchanges:mockex USD-45 expenses:fees BTC0.0001
include
the per-import journal file in 2023.journal
by hand.
That tells export/export.hs
to generate it from the CSV and hledger
to read it.
You might also add some transactions here by hand. In this case there’s an opening balance.
;; Settings you want in all your journals
include ./config.journal
;; Opening balances
;; This only needs to be done once for the first year you track
;; After that there are auto-generated opening + closing transactions
2023/01/01 opening balances
assets:exchanges:mockex = 100.00 USD
equity:opening balances
;; Add not-yet-generated files here to tell export.hs to generate them
;; from the corresponding CSV inputs include ./import/mockex/journal/trades-2023.journal
export/export.hs
generates financial reports here along with 2023-all.journal
. It looks trivial now (see next section) but once you have dozens of data sources it’s is very helpful to see them merged into one linear history.
You include
the journal for each year into all.journal
by hand.
It’s what you load to look at your finances interactively.
First, generate all the files:
[nix-shell]$ ./export.sh
# csv2journal (for ../import/mockex/journal/trades-2023.journal)
# hledger (for 2023-balance-sheet.txt)
# hledger (for 2023-all.journal)
# hledger (for 2023-cash-flow.txt)
# hledger (for 2023-income-expenses.txt)
Build completed in 0.09s
This will fail if you have any unbalanced transactions according to hledger’s version of standard double-entry accounting rules.
If you’ve ever worked with Haskell this will be a similar experience: the compiler complains over and over until every little thing is fixed in your CSV rules, then suddenly it all works and magically writes a consistent history to 2023-all.journal
. Cool, right?
2023-01-01 opening balances
assets:exchanges:mockex = 100.00 USD
equity:opening balances
2023-02-01 (078blk23598s) MockEx buy
assets:exchanges:mockex 0.0100 BTC
assets:exchanges:mockex -35.0 USD
expenses:fees 0.0001 BTC
2023-05-01 (078werfgsdaf) MockEx buy
assets:exchanges:mockex 0.0100 BTC
assets:exchanges:mockex -45.0 USD expenses:fees 0.0001 BTC
The rest of the exported files are standard financial reports. For example a balance sheet:
Balance Sheet 2023-05-01
|| 2023-05-01
=========================++=======================
Assets ||
-------------------------++-----------------------
assets:exchanges:mockex || 0.0200 BTC, 20.00 USD
-------------------------++-----------------------
|| 0.0200 BTC, 20.00 USD
=========================++=======================
Liabilities ||
-------------------------++-----------------------
-------------------------++-----------------------
||
=========================++======================= Net: || 0.0200 BTC, 20.00 USD
You should version control all of them so you can diff
them later!
One of the main benefits of this system is being able to refactor aggressively and see what changed.
Try messing up the sign of a number or the name of a field in mockex.rules
and re-running it.
You should either get an Hledger error about improper transactions or it will succeed and you can git diff
your final reports.
One way to get a good diff would be to change the name of the assets:exchanges:mockex
account everywhere.
Finally, play around with some interactive hledger
commands:
[nix-shell]$ hledger -f all.journal reg cur:USD assets
2023-01-01 opening balances assets:exchanges:mockex 100.00 USD 100.00 USD
2023-02-01 MockEx buy assets:exchanges:mockex -35.0 USD 65.00 USD
2023-05-01 MockEx buy assets:exchanges:mockex -45.0 USD 20.00 USD
[nix-shell]$ hledger -f all.journal reg cur:BTC
2023-02-01 MockEx buy assets:exchanges:mockex 0.0100 BTC 0.0100 BTC
expenses:fees 0.0001 BTC 0.0101 BTC
2023-05-01 MockEx buy assets:exchanges:mockex 0.0100 BTC 0.0201 BTC
expenses:fees 0.0001 BTC 0.0202 BTC
[nix-shell]$ hledger -f all.journal bal assets
0.0200 BTC
20.00 USD assets:exchanges:mockex
--------------------
0.0200 BTC
20.00 USD
Your finances will end up being too complicated for any one command to give a good overview, but you can do lots of small checks to build confidence that particular things are going well, then codify them as new report files to check for regressions. Towards eventual consistency!
If you’re still interested at this point, take a break to let things sink in (really!), then work through at least part of the “full-fledged hledger” series + do whatever other research seems important. Decide whether you would feel comfortable committing a bunch of time and energy to this stuff.
Speaking of which, the one major disadvantage to hledger (vs ledger or beancount) as I see it is the lack of built-in capital gains handling. There are hacky ways to work around that—I’ll do a post about my solution soon—but it’s something to be aware of from the beginning. I decided the tutorial + CSV parsing infrastructure makes it worth using anyway.
The general idea of plain text accounting is sound, so I think it makes sense to commit to learning it but only provisionally use a specific tool. The journal formats are mostly compatible, minus a few edge cases like capital gains lot handling. So just get started! Once you have your data collected + formatted and a good pipeline structure, changing a few of the commands it invokes isn’t as big a deal as it probably sounds now.
]]>Everyone agrees that GnuPG has a difficult interface, and therefore you need to follow various guides to get stuff done. But they’re often really detailed! So here’s a somewhat-less-detailed guide, intended to get you set up as quickly as possible without missing anything important.
But first, one request: I’m breaking with “best practices” a bit in that most people would advise you not to show the commands you used in case it helps an attacker. That smells of security through obscurity to me, and I think the value of sharing outweighs the risk. But if you do spot a mistake, please contact me so I can update this post rather than hacking me!
With that out of the way, here are the steps I used to set up and share my own
keys. I found it easier to run through everything multiple times with
new temporary --homedir
s than to understand the options before trying them.
If you decide you want more background reading though, start with these:
The key generation command is interactive. Everything after #
is an answer to
a prompt which should be clear in context.
$ mkdir -p offline-gpghome
$ alias GPG='gpg --homedir=offline-gpghome --keyid-format=long'
$ GPG --expert --full-generate-key # 8, s, e, q, 0, y, no password, name, email, no comment
$ GPG --expert --edit-key jefdaj
> addkey # 4, 4096, 1y, y, y, no password
> addkey # 6, 4096, 1y, y, y, no password
> addkey # 8, a, s, e, q, 4096, 1y, y, y, no password
> save
That created one master [C]
ertification key and separate subkeys for
[S]
igning, [E]
ncryption, and [A]
uthentication:
$ GPG --list-keys
/tmp/generate-gpg-key/offline-gpghome/pubring.kbx
-------------------------------------------------
pub rsa4096/E604517174B3D49E 2021-09-28 [C]
54C195A345205DCABC2010EEE604517174B3D49E
uid [ultimate] EMAIL REDACTED TO REDUCE SPAM
sub rsa4096/EF8F01E8B5D49300 2021-09-28 [S] [expires: 2022-09-28]
sub rsa4096/73B356CD3CA9E12B 2021-09-28 [E] [expires: 2022-09-28]
sub rsa4096/6E123E190F5FB8BD 2021-09-28 [A] [expires: 2022-09-28]
The certification key never expires, and will be treated like a cold
wallet: I’ll have to dig it out of my offline backups once per year to
extend or replace the other 3, or to certify anyone else’s public key
(confusingly referred to as “key signing” even though you use the [C]
key).
Next I export backups of all the keys.
$ mkdir offline-backup
$ GPG --armor --export-secret-keys > offline-backup/secret-keys.asc
$ GPG --armor --export-secret-subkeys > offline-backup/secret-subkeys.asc
$ GPG --armor --export > offline-backup/public-keys.asc
These will be stored offline and encrypted.
I also generate revocation certificates. They will be backed up too, but more importantly I’ll keep copies on hand to import and upload to keyservers in case I get hacked.
$ for keyid in EF8F01E8B5D49300 73B356CD3CA9E12B 6E123E190F5FB8BD; do
> GPG --output offline-backup/revoke-${keyid}.asc --gen-revoke ${keyid}
> done # answer for each: y, 1 (compromised), y
The master certification key probably won’t be hacked, but it should be revocable in case I lose my online backups. So I generate a certificate for that too. I’ll store it separately from the main backups.
$ GPG --output offline-backup/revoke-E604517174B3D49E.asc --gen-revoke E604517174B3D49E # y, 0 (no reason), y
First the public keys. We’re checking for pub
in front of the master key,
sub
in front of each subkey, and that --list-secret-keys
doesn’t list
anything.
$ mkdir verify-public
$ gpg --homedir=verify-public --import offline-backup/public-keys.asc
gpg: keybox '/tmp/generate-gpg-key/verify-public/pubring.kbx' created
gpg: /tmp/generate-gpg-key/verify-public/trustdb.gpg: trustdb created
gpg: key E604517174B3D49E: public key "EMAIL REDACTED TO REDUCE SPAM" imported
gpg: Total number processed: 1
gpg: imported: 1
$ gpg --homedir=verify-public --list-keys --keyid-format=long
/tmp/generate-gpg-key/verify-public/pubring.kbx
---------------------------------------------
pub rsa4096/E604517174B3D49E 2021-09-28 [C]
54C195A345205DCABC2010EEE604517174B3D49E
uid [ unknown] EMAIL REDACTED TO REDUCE SPAM
sub rsa4096/EF8F01E8B5D49300 2021-09-28 [S] [expires: 2022-09-28]
sub rsa4096/73B356CD3CA9E12B 2021-09-28 [E] [expires: 2022-09-28]
sub rsa4096/6E123E190F5FB8BD 2021-09-28 [A] [expires: 2022-09-28]
$ gpg --homedir=verify-public --list-secret-keys --keyid-format=long
Now the secret subkeys. Look for sec#
in front of the master key, meaning
that you only have the public half available. You should have the private
parts of the subkeys though, which is indicated with ssb
.
$ mkdir verify-subkeys
$ gpg --homedir=verify-subkeys --import offline-backup/secret-subkeys.asc
gpg: keybox '/tmp/generate-gpg-key/verify-subkeys/pubring.kbx' created
gpg: /tmp/generate-gpg-key/verify-subkeys/trustdb.gpg: trustdb created
gpg: key E604517174B3D49E: public key "EMAIL REDACTED TO REDUCE SPAM" imported
gpg: To migrate 'secring.gpg', with each smartcard, run: gpg --card-status
gpg: key E604517174B3D49E: secret key imported
gpg: Total number processed: 1
gpg: imported: 1
gpg: secret keys read: 1
gpg: secret keys imported: 1
$ gpg --homedir=verify-subkeys --list-secret-keys --keyid-format=long
/tmp/generate-gpg-key/verify-subkeys/pubring.kbx
----------------------------------------------
sec# rsa4096/E604517174B3D49E 2021-09-28 [C]
54C195A345205DCABC2010EEE604517174B3D49E
uid [ unknown] EMAIL REDACTED TO REDUCE SPAM
ssb rsa4096/EF8F01E8B5D49300 2021-09-28 [S] [expires: 2022-09-28]
ssb rsa4096/73B356CD3CA9E12B 2021-09-28 [E] [expires: 2022-09-28]
ssb rsa4096/6E123E190F5FB8BD 2021-09-28 [A] [expires: 2022-09-28]
Finally the secret keys. Everything should look the same except #
should
be gone from sec
on the master key:
$ mkdir verify-secret
$ gpg --homedir=verify-secret --import offline-backup/secret-keys.asc
gpg: keybox '/tmp/generate-gpg-key/verify-secret/pubring.kbx' created
gpg: /tmp/generate-gpg-key/verify-secret/trustdb.gpg: trustdb created
gpg: key E604517174B3D49E: public key "EMAIL REDACTED TO REDUCE SPAM" imported
gpg: key E604517174B3D49E: secret key imported
gpg: Total number processed: 1
gpg: imported: 1
gpg: secret keys read: 1
$ gpg --homedir=verify-secret --list-secret-keys --keyid-format=long
/tmp/generate-gpg-key/verify-secret/pubring.kbx
---------------------------------------------
sec rsa4096/E604517174B3D49E 2021-09-28 [C]
54C195A345205DCABC2010EEE604517174B3D49E
uid [ unknown] EMAIL REDACTED TO REDUCE SPAM
ssb rsa4096/EF8F01E8B5D49300 2021-09-28 [S] [expires: 2022-09-28]
ssb rsa4096/73B356CD3CA9E12B 2021-09-28 [E] [expires: 2022-09-28]
ssb rsa4096/6E123E190F5FB8BD 2021-09-28 [A] [expires: 2022-09-28]
Next I set passwords to protect the subkeys in case they’re stolen from my online computer…
$ for keyid in EF8F01E8B5D49300 73B356CD3CA9E12B 6E123E190F5FB8BD; do
> GPG --pinentry-mode loopback --passwd $keyid
> done # ignore error message, enter new passphrase twice
… and re-export the password-shielded versions. I also copy over the revocation certificates.
$ mkdir online-pc
$ GPG --armor --export-secret-subkeys > online-pc/secret-subkeys.asc
$ cp offline-backup/revoke-*.asc online-pc/
$ rm online-pc/revoke-E604517174B3D49E.asc
Note: I didn’t use passwords on the backups above because they will already be
encrypted. Trying to set them here revealed a bug in pinentry
’s handling of
empty passwords! I worked around it using --pinentry-mode loopback
as
suggested here.
After a bit of cleanup, these are the files I’ll be keeping:
$ rm -r offline-gpghome verify-*
$ tree
.
├── offline-backup
│ ├── public-keys.asc
│ ├── revoke-6E123E190F5FB8BD.asc
│ ├── revoke-73B356CD3CA9E12B.asc
│ ├── revoke-E604517174B3D49E.asc
│ ├── revoke-EF8F01E8B5D49300.asc
│ ├── secret-keys.asc
│ └── secret-subkeys.asc
└── online-pc
├── revoke-6E123E190F5FB8BD.asc
├── revoke-73B356CD3CA9E12B.asc
├── revoke-EF8F01E8B5D49300.asc
└── secret-subkeys.asc
2 directories, 12 files
file
reports that the pubkeys are public keys, the revocation certificates
are signatures, and the secret keys are ASCII.
$ file */*
offline-backup/public-keys.asc: PGP public key block Public-Key (old)
offline-backup/revoke-6E123E190F5FB8BD.asc: PGP public key block Signature (old)
offline-backup/revoke-73B356CD3CA9E12B.asc: PGP public key block Signature (old)
offline-backup/revoke-E604517174B3D49E.asc: PGP public key block Signature (old)
offline-backup/revoke-EF8F01E8B5D49300.asc: PGP public key block Signature (old)
offline-backup/secret-keys.asc: ASCII text
offline-backup/secret-subkeys.asc: ASCII text
online-pc/revoke-6E123E190F5FB8BD.asc: PGP public key block Signature (old)
online-pc/revoke-73B356CD3CA9E12B.asc: PGP public key block Signature (old)
online-pc/revoke-EF8F01E8B5D49300.asc: PGP public key block Signature (old)
online-pc/secret-subkeys.asc: ASCII text
Seems reasonable. I’m ready to back up the raw keys and move the shielded subkeys to an online computer!
I import the secret subkeys, which also include the corresponding public ones.
$ gpg --import online-pc/secret-subkeys.asc
$ gpg --list-secret-keys
/home/jefdaj/.gnupg/pubring.kbx
-------------------------------
sec# rsa4096 2021-09-28 [C]
54C195A345205DCABC2010EEE604517174B3D49E
uid [ unknown] EMAIL REDACTED TO REDUCE SPAM
ssb rsa4096 2021-09-28 [S] [expires: 2022-09-28]
ssb rsa4096 2021-09-28 [E] [expires: 2022-09-28]
ssb rsa4096 2021-09-28 [A] [expires: 2022-09-28]
Just to be sure, I confirm the sec#
and 3 ssb
s again. I save the revocation
certificates somewhere too.
There are lots of confusing options for where to publish a key these days. After reading this I decide to export to a file and upload it to keys.openpgp.org manually.
gpg --armor --export 73B356CD3CA9E12B > 73B356CD3CA9E12B.asc
I expected that would only export the [E]
ncryption subkey, but turns out
everything is bundled together. I’m OK with that. If you aren’t you
could --edit-key
to delete the ones you don’t want to publish first,
then re-import them.
Once they’re uploaded (and my email is verified) I can search for them by key fingerprint or email, or fetch by fingerprint using gpg only:
gpg --keyserver keys.openpgp.org --recv-key 54C195A345205DCABC2010EEE604517174B3D49E
Suppose you have some secret code—a master password or cryptocurrency seed phrase are common examples—and you want to back it up. Is there a clever way to arrange things so that it’s simultaneously secure against the risk of you losing the backup and against the risk of someone else finding it?
We’ll try the naive way, build up to proper secret sharing, and finish with recommendations for how you could actually use it to back up your passwords. Some of this is based on an excellent YouTube video, some on the original 1979 paper by Adi Shamir, and the rest I came up with myself.
Most people would probably try breaking the secret code in half and storing the
two halves in separate locations. For example if you have the 8-character
password !f8sy=06
, you could write !f8s
on one piece of paper and y=06
on
the other. There are two problems with that though:
The first problem is obvious. You can see the second by using Python to estimate the entropy of each piece:
>>> import string
>>> chars = string.ascii_lowercase + string.ascii_uppercase + string.punctuation
>>> chars
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> len(chars)
84
>>> len(chars) ** 8 # possible 8-char passwords
2406758911082496
>>> len(chars) ** 4 # possible 4-char password halves
49067136
There are only around 50 million combinations per half. Which sounds like a lot, but a modern computer could probably try all of them in a couple seconds! Another way to look at it is that guessing the missing half is 50 million times easier than guessing the entire password from scratch.
Of course, you could compensate for that by making the password longer. But the general problem remains, and it gets exponentially worse if you also try to protect yourself against losing some of the pieces (problem #1 above).
Why the trade-off? Because to protect against losing any given part of the password you have to make more than one copy of it. For example, you could break the password into 3 partially redundant pieces so that any 2 of them can be used to reconstruct it. But then each piece would have to contain more of the password:
███sy=06
!f8███06
!f8sy=██
Now you’re protected from losing any one share, but you’ve sacrificed almost all of the entropy! Someone who finds even one piece can easily guess the missing characters:
>>> len(chars) ** 3 # possible missing 3-char chunks
592704
>>> len(chars) ** 2 # possible missing 2-char chunks
7056
If you needed to, you could pick a very long password and split it up like that. But it turns out there’s a more elegant way to solve both problems at once…
The trick is to encode your secret as the Y intercept of a curve, then write down coordinates to points on that curve. Here’s how to do the simplest “2 of N” case, where the curve is a straight line and you can reconstruct it from any 2 points:
For step 1, there are many ways to encode text as numbers. We’ll use ASCII character codes:
>>> def encode(chars):
return [ord(c) for c in chars]
...
...>>> def decode(nums):
return ''.join(chr(n) for n in nums)
...
...>>> encode('!f8sy=06')
33, 102, 56, 115, 121, 61, 48, 54]
[>>>
>>> decode(encode('!f8sy=06'))
'!f8sy=06'
To keep it simple, let’s focus on secret sharing just the exclamation point:
Now anyone who has the (x,y)
coordinates of at least two of the blue + brown
points can draw a line to find the secret number, 33. This is much better than
the naive solution above because finding one pair of coordinates leaks no
information at all. You could draw an infinite number of lines through the
point you found, and 128 of them would lead to Y intercepts representing ASCII
characters. Which is the same as the total number of characters you would have
to guess from anyway. Cool, right?
To split up the whole password we could just repeat that process 8 times.
Two points are enough to define a straight line. So if we want to require more
than 2 shares to reconstruct the secret, we need a more complicated curve. The
number of shares needed is called the “threshold”. It’s the same as the number
of variables (besides x
and y
) in the equation for the curve.
Threshold | Equation | Curve |
---|---|---|
2 | y = ax + b | linear |
3 | y = ax2 + bx + c | quadratic |
4 | y = ax3 + bx2 + cx + d | cubic |
… | … | … |
For example, the cubic version (threshold = 4) could be set up like this:
It’s probably counter-intuitive, but even knowing 3 of the blue + brown points on that curve doesn’t get you any closer to finding the intercept. Without 4 you might as well have none at all.
Note: Shamir’s actual scheme uses number fields defined in terms of large prime numbers; the easier-to-visualize real number curves and equations shown here are just meant to give some basic intuition for how the math works
The final step is to hide the math from regular users who just want to back up
their cryptocurrency, passwords, or other data. The ssss
(“Shamir’s Secret
Sharing Scheme”) package is easy to use. Pick the threshold with -t
and
the total number of shares with -n
:
$ ssss-split -t 3 -n 5
Generating shares using a (3,5) scheme with dynamic security level.
Enter the secret, at most 128 ASCII characters: !f8sy=06
Using a 64 bit security level.
1-d8a1c623c3a614a5
2-e38ae6524ad6f239
3-40676cf2da3882d3
4-36b7e761817c962b 5-955a6dc11192e6d3
$ ssss-combine -t 3 -n 5
Enter 3 shares separated by newlines:
Share [1/3]: 1-d8a1c623c3a614a5
Share [2/3]: 3-40676cf2da3882d3
Share [3/3]: 4-36b7e761817c962b Resulting secret: !f8sy=06
Of course there’s some other fanciness going on too. But I hope that it doesn’t look quite like magic anymore, and that you would consider using something similar to back up your actual master password or seed phrase.
One warning though: don’t use the online demo on the ssss
site. You would be
sending your secrets to the author + anyone snooping on the unsecure HTTP
connection. Ideally you should boot into a Linux LiveCD (I like
Lubuntu), install it there with sudo apt install ssss
, and only save
the secret shares on paper.
If you do decide to use ssss
, consider using it via a program I wrote called
Horcrux. It’s only a little more complicated. The main advantage is that
it lets you create your secret shares once, hide them, and then encrypt new
backups later without gathering enough shares to reconstruct the master password each
time. I think that makes it much more likely you’ll keep regular backups, and
less likely you’ll have the master password written on a Post-It note or in
some similarly insecure location.
Everyone seems so interested in Cardano’s concurrency problem that I decided it’s worth posting my own half-baked solution. This is mainly based on (and almost identical to?) Chris from Mirqur’s idea. It sounds like others are thinking along similar lines as well.
Update: MELD has a simpler solution that removes the need for an auction by requiring a predefined set of UTXOs to be used as user inputs. That looks better for most use cases. I could see it having an issue with auction sniping though.
Update: only one bid can reference each original trade UTXO; subsequent bids just reference the previous state
The basic idea is to have a three-phase protocol for updating the DEX state:
First, everyone who wants to interact with the DEX posts a transaction that announces the trade they intend to make. It doesn’t actually consume the current DEX state though. They would just send their tokens to a DEX-controlled script address that acts like a queue with an attached datum specifying the trade parameters they agree to. A large number of them could be submitted at once during one or several slots.
Next, an “auction” where anyone can run a bot that gathers up all the posted trades and executes them at once according to the DEX logic. They would be paid a small fee for doing it, and have to post collateral to be slashed if they leave any valid trades out. Each “bid” has to include all previous valid transactions plus at least one new one.
After waiting a few slots for any more state transition bids to appear, the final step is to apply the highest-bidding transition to the main DEX UTXO, reward the winning bidder, and slash any other bidders’ collateral. Anyone could do this step as well but the winning bot is the one with the most incentive.
Here’s an example workflow. Rectangles are transactions and ovals are UTXOs. Color indicates who controls each thing: blue for a user, green for a bot, orange for the DEX contract.
Something like this should work for not only a DEX, but for any contract that
needs someone to post a state transition and be assured that they include all
the valid inputs. For example a concurrent auction contract could tip a
bot for calling its close
endpoint and gathering all valid bids at the end of
the auction. (I imagine this would come up a lot in real auctions, since
everyone would wait to snipe at the end)
The simplest way I can think of bootstrapping a generalized bot economy would
be to have a cardano-batch-bot-contrib
or similar repo where everyone posts
code for operating a bot on their protocol, and bot operators choose which ones
to include. Any stake pool operator (SPO) who wants some extra revenue could
also operate a bot and customize it to run only the protocols they’re
comfortable with.
Q: Is this too complicated or too slow to operate at scale?
A: I don’t think so. This protocol could probably be run on the main chain at least as fast as an oracle could come to consensus on price changes. Trades would be ordered by the slot they were submitted in, which is the same level of resolution a trivial “single-threaded” DEX would be capable of. And inside a Hydra head it could go orders of magnitude faster. Probably fast enough that users wouldn’t notice any delay at all! Finally, because cheating bots will always be caught, there should normally only be one bid.
Q: Would this introduce nondeterminism and miner-extractable value (MEV)?
A: It depends whether the DEX protocol is deterministic. Batching bots would probably end up being run by SPOs and the slot leader would be able to include their own batch transaction, but they would be unable to censor or re-order transactions unless the DEX contract + batching protocol allows it.
Q: Would batching bot operators be legally responsible for operating an order book, like what happened to EtherDelta?
A: I have no legal training whatsoever, but it seems to me that that should also depend on whether the protocol is deterministic. If so, bot operators don’t have any decision-making power to abuse and therefore don’t need to be regulated. The only choice they would have is whether to participate in the protocol at all or not. One could imagine situations where everyone might refuse to process money from a hack for example, but I think that would be better regulated at the DEX protocol level so everyone is clear on what counts as a legal trade beforehand. The founders, devs, or token holders might be responsible in that case.
Lots of things would still need to be worked out. For example:
This site is built with Hakyll. I’ve had a great experience with that so far! Here I’ll do a quick overview of how I manage it in case you want to try something similar. Most of it is based on this tutorial, but I switched to self-hosting on a VPS rather than via Github Pages.
The master
branch holds the production source code.
I make a new branch like master-cssfixes
or master-greatidea
when
starting any task that has a chance of failing, then merge back into master
once I know it works. All my draft posts live on one drafts
branch. When one
is done I check it out onto master
, then rebase drafts
from there.
Each post is a folder with an index.md
like this and possibly some
other files too: drawings, standalone scripts, etc. The post should contain
links and instructions whenever you can do something non-obvious with the other
files. I mainly write in Pandoc markdown, but you can use anything
supported by Pandoc. Posting dates are based on the folder structure, and the
rest is pulled from the markdown header. I date draft posts 2099/something,
which pushes them to the top of the recent posts list and reminds me to fill in
the actual posting date later.
To write I checkout the drafts
branch, rebase from master
if needed, and run
build.sh. It builds a local copy of the site, serves it at
http://localhost:8000, and auto-updates it as I change things. The tag cloud,
RSS feed, CSS, and recent posts list auto-update along with the post contents.
The only thing that doesn’t auto-update is the Haskell code; if I
edit that I have to kill and re-run the script. One other gotcha is that I
have to disable caching in Chrome and Firefox to make sure I’m not looking at old
versions of the CSS.
I commit the drafts
often. Then when a post is done I:
master
master
, leaving a clean git reporsync
the .site
folder to the serverdrafts
from master
To ensure that I don’t accidentally publish draft posts I have a pre-push hook as suggested here:
# .git/hooks/pre-push
if [[ `grep 'draft'` ]]; then
echo "You really don't want to push the drafts branch. Aborting."
exit 1
fi
I also remove them in publish.sh
and .gitignore
:
# publish.sh
# Just in case, remove accidentally-added draft posts
rm -rf posts/2099
# .gitignore
src/posts/2099