case study · personal data pipeline · running since 2026

Anki vocab pipeline

The mess

Learning six languages produces data faster than any one tool can hold it. Vocabulary arrived from everywhere: words typed into a phone note on the train, immersion sentences captured mid episode, entries queued in a spreadsheet at a desk. Each capture path had its own format, the exporters disagreed on encodings, and the Anki review logs that tracked all of it needed normalizing before they agreed on anything.

The result was the classic personal data failure: the words existed, the reviews existed, and none of it lined up into cards without an evening of hand formatting.

Constraints

Capture has to work offline on a phone and reconcile later, because vocabulary shows up on trains and in episodes, not at desks. The output format is nonstandard: Yomitan handles dictionary lookups during card creation, so the pipeline has to emit HTML that Yomitan can read. And the whole thing must run as one command, because a study habit dies the day it needs maintenance.

Architecture

capturephone note offline, or straight into the queue sheet
vocab queue sheetsingle source of truth, language tagged per row
Python pipelineparses, validates, marks rows done
sentence generationClaude writes examples for new vocab, immersion passes through
Yomitan ready HTMLdated session file per run
Anki cardscreated with dictionary lookups intact

one queue, two ways in, one command out. Review logs are normalized on the side into a single dataset.

What was built

Two Python entry points share one engine. The main script reads unprocessed rows from the queue sheet, generates example sentences for new vocabulary, passes captured immersion sentences through untouched, writes a dated session file of Yomitan ready HTML, and marks each row done. The offline script takes the other road in: it parses pipe delimited entries straight off the clipboard from a phone note, syncs them into the same sheet, then continues down the same pipeline.

Every row carries a language tag, with character detection covering legacy rows that never had one. Japanese, Korean and Mandarin run through the queue today, and the same normalized format carries the rest of the six.

Outcome

Capture is now a ten second act on whatever device is nearest, and card creation is one command at a desk. The queue sheet stays the single source of truth, nothing gets lost between devices, and the dated session files keep a record of every word that ever entered the system.