{"channel":"barsukas","content":"[# The Problem #]\r\n\r\nBuilding a language-learning app is a data problem disguised as a software problem.\r\n\r\nTrakaido is a language-learning application. To teach someone Lithuanian, or Swahili, or Japanese, the app needs more than a translation dictionary. It needs to know that Lithuanian nouns decline into seven grammatical cases, and which declension class each noun belongs to. It needs to know that Chinese has no verb conjugation but every noun requires a specific measure word \u2014 you count flat things differently from round things differently from long things \u2014 and that the measure word depends on the individual noun, not just its shape. It needs to know that German nouns have grammatical gender that doesn't track biological sex or English intuition: *die Sonne* (the sun) is feminine, *der Mond* (the moon) is masculine, *das Kind* (the child) is neuter. It needs to know that Swahili verb infinitives take a *ku-* prefix. These aren't edge cases \u2014 they are the grammar, and the app needs them for every word.\r\n\r\nIt also needs example sentences, IPA pronunciations, synonyms, difficulty ratings, and translations \u2014 not into one other language, but into fourteen simultaneously, with more handled on the margins. And it needs all of this organized: not just a flat list of words, but words grouped by semantic category (food nouns, motion verbs, perception adjectives) and difficulty level, so the app can present vocabulary in a sensible pedagogical order.\r\n\r\nThat data does not exist in one place, in one consistent format, at the required level of detail, for all of these languages. So we built Greenland to create it.\r\n\r\n[# What Greenland Is #]\r\n\r\nGreenland is the backend data and tooling repository for Trakaido. It builds, maintains, and exports a multilingual linguistic database covering fourteen primary languages \u2014 English, Lithuanian, Chinese (Simplified), French, German, Spanish, Portuguese, Korean, Swahili, Vietnamese, Japanese, Italian, Dutch, and Swedish \u2014 plus dozens of additional languages handled for source data and supplementary content.\r\n\r\nThe database currently holds around 2,500 words across 90 semantic categories. For each word, it stores translations into all primary languages, inflected forms (conjugations, declensions, plurals), IPA pronunciations, example sentences with translations, grammar facts, synonyms, and learner difficulty levels.\r\n\r\nWords are organized by part of speech and semantic category: food nouns, motion verbs, manner adverbs, and so on. This organization matters both pedagogically \u2014 it shapes how the app sequences material \u2014 and operationally, since the tooling processes batches of related words together.\r\n\r\n[# How It Works #]\r\n\r\n*The Data Layer*\r\n\r\nThe canonical source of truth is a set of JSONL files in the repository \u2014 one file per semantic category, one JSON record per word. Because this lives in version control, every word added, changed, or removed is visible as a code change and can be reviewed like any other commit.\r\n\r\nThe live working database is SQLite. It extends the JSONL baseline with everything generated by the automation pipeline: full morphological forms, sentences, pronunciations, grammar facts, operation logs. It can be rebuilt from the JSONL files at any time, and changes made through the tools can be exported back to JSONL to keep the repository current.\r\n\r\n*The Automation Agents*\r\n\r\nA fleet of autonomous scripts performs the bulk of data generation and quality assurance. Each agent is named after a Lithuanian animal and performs a specific task.\r\n\r\nVoras (the Spider) generates and validates translations. Vilkas (the Wolf) generates word forms \u2014 conjugations, declensions, plurals \u2014 for six languages. \u017dvirblis (the Sparrow) generates example sentences. Lape (the Fox) generates language-specific grammar facts: which measure word a Chinese noun takes, the grammatical gender of French and German nouns, the declension class of Lithuanian nouns. Lokys (the Bear) validates English lemma forms and definitions. Papuga (the Parrot) generates IPA pronunciations. \u0160ernas (the Boar) generates synonyms and alternative forms. Sarka (the Magpie) generates conversational example dialogues.\r\n\r\nOn the quality and export side: Bebras (the Beaver) checks database integrity. Povas (the Peacock) generates HTML reports. Ungurys (the Eel) and Elnias (the Deer) export the database in the formats the app consumes. Strazdas (the Thrush) and Vieversys (the Lark) generate audio files via eSpeak-NG and OpenAI TTS respectively.\r\n\r\nThe agents run in a pipeline: initialization, then translations, then word forms and grammar facts, then sentences, then audio, then export. Each supports a read-only check mode, a dry-run preview, and a fix mode that writes changes. They are idempotent and safe to run repeatedly as the database grows.\r\n\r\nGeneration is done by LLMs \u2014 OpenAI, Anthropic, and Google models, as well as local models via Ollama and LM Studio. Results are stored so each word is only processed once; humans review and correct output through the web editor.\r\n\r\n*The Benchmark Suite*\r\n\r\nGreenland includes a framework for evaluating the LLMs it relies on. Over 40 benchmarks test different skills:\r\n\r\n- **Token and word processing**: syllable counting, spell checking, IPA transcription, plural generation\r\n- **Lexical semantics**: antonyms, synonyms, definitions, part-of-speech tagging, lemma identification\r\n- **Morphology**: verb form generation, lemma validation\r\n- **Translation**: nine language pairs spanning European and Asian languages\r\n- **Mathematics**: arithmetic, algebra, geometry, unit conversion, word problems, time arithmetic\r\n- **General knowledge**: geography, historical dates, book-author matching, food classification, syllogism validity\r\n\r\nSome benchmarks directly test agent decisions \u2014 for example, whether Lokys's lemma validation judgments match known-correct answers. This closes a quality loop: the models used to build the database are evaluated against ground truth, so systematic errors can be detected before they propagate through thousands of words.\r\n\r\n*Barsukas: The Web Editor*\r\n\r\nBarsukas is a local Flask application for human curators to interact with the database. It provides a full editing interface: browsing and searching the lemma inventory, editing translations and definitions, reviewing and correcting example sentences, managing pronunciations.\r\n\r\nA sync interface compares the live database against the JSONL files \u2014 showing additions, removals, difficulty changes, and translation changes \u2014 and lets curators apply or reject each category. Selected agent workflows can be triggered from within Barsukas, so a curator can generate a translation or sentence for a specific word without going to the command line. Operation logs record what changed, when, and by which agent or user, making it possible to trace the provenance of any piece of data.\r\n\r\n*The Export Layer*\r\n\r\nUngurys exports the database into WireWord format \u2014 the JSON structure the app consumes \u2014 organized by difficulty level and part of speech. Elnias produces a minimal bootstrap format. Audio files, generated by Strazdas and Vieversys, are uploaded to S3.\r\n\r\n[# Why This Approach #]\r\n\r\nThe core design choice is to use LLMs for generation and humans for judgment. LLMs handle the volume: translating 2,500 words into 14 languages, generating every conjugation of every verb in six languages, identifying the measure word for every Chinese noun. Humans handle the exceptions, corrections, and quality bar \u2014 through the web editor, through reviewing commits, through the benchmark suite catching systematic failures.\r\n\r\nThe JSONL files in the repository make this sustainable. Every word has a traceable record. The database can be audited, reconstructed, and extended without losing history. New languages, new word categories, or new types of linguistic data can be added incrementally without disrupting what's already there.","created_at":"2026-03-09T17:02:50.752863","id":772,"llm_annotations":{},"parent_id":null,"processed_content":"<p><span class=\"inline-title\"> The Problem </span>\r</p>\n<p>Building a language-learning app is a data problem disguised as a software problem.\r</p>\n<p>Trakaido is a language-learning application. To teach someone Lithuanian, or Swahili, or Japanese, the app needs more than a translation dictionary. It needs to know that Lithuanian nouns decline into seven grammatical cases, and which declension class each noun belongs to. It needs to know that Chinese has no verb conjugation but every noun requires a specific measure word \u2014 you count flat things differently from round things differently from long things \u2014 and that the measure word depends on the individual noun, not just its shape. It needs to know that German nouns have grammatical gender that doesn't track biological sex or English intuition: <em>die Sonne</em> (the sun) is feminine, <em>der Mond</em> (the moon) is masculine, <em>das Kind</em> (the child) is neuter. It needs to know that Swahili verb infinitives take a <em>ku-</em> prefix. These aren't edge cases \u2014 they are the grammar, and the app needs them for every word.\r</p>\n<p>It also needs example sentences, IPA pronunciations, synonyms, difficulty ratings, and translations \u2014 not into one other language, but into fourteen simultaneously, with more handled on the margins. And it needs all of this organized: not just a flat list of words, but words grouped by semantic category (food nouns, motion verbs, perception adjectives) and difficulty level, so the app can present vocabulary in a sensible pedagogical order.\r</p>\n<p>That data does not exist in one place, in one consistent format, at the required level of detail, for all of these languages. So we built Greenland to create it.\r</p>\n<p><span class=\"inline-title\"> What Greenland Is </span>\r</p>\n<p>Greenland is the backend data and tooling repository for Trakaido. It builds, maintains, and exports a multilingual linguistic database covering fourteen primary languages \u2014 English, Lithuanian, Chinese (Simplified), French, German, Spanish, Portuguese, Korean, Swahili, Vietnamese, Japanese, Italian, Dutch, and Swedish \u2014 plus dozens of additional languages handled for source data and supplementary content.\r</p>\n<p>The database currently holds around 2,500 words across 90 semantic categories. For each word, it stores translations into all primary languages, inflected forms (conjugations, declensions, plurals), IPA pronunciations, example sentences with translations, grammar facts, synonyms, and learner difficulty levels.\r</p>\n<p>Words are organized by part of speech and semantic category: food nouns, motion verbs, manner adverbs, and so on. This organization matters both pedagogically \u2014 it shapes how the app sequences material \u2014 and operationally, since the tooling processes batches of related words together.\r</p>\n<p><span class=\"inline-title\"> How It Works </span>\r</p>\n<p><em>The Data Layer</em>\r</p>\n<p>The canonical source of truth is a set of JSONL files in the repository \u2014 one file per semantic category, one JSON record per word. Because this lives in version control, every word added, changed, or removed is visible as a code change and can be reviewed like any other commit.\r</p>\n<p>The live working database is SQLite. It extends the JSONL baseline with everything generated by the automation pipeline: full morphological forms, sentences, pronunciations, grammar facts, operation logs. It can be rebuilt from the JSONL files at any time, and changes made through the tools can be exported back to JSONL to keep the repository current.\r</p>\n<p><em>The Automation Agents</em>\r</p>\n<p>A fleet of autonomous scripts performs the bulk of data generation and quality assurance. Each agent is named after a Lithuanian animal and performs a specific task.\r</p>\n<p>Voras (the Spider) generates and validates translations. Vilkas (the Wolf) generates word forms \u2014 conjugations, declensions, plurals \u2014 for six languages. \u017dvirblis (the Sparrow) generates example sentences. Lape (the Fox) generates language-specific grammar facts: which measure word a Chinese noun takes, the grammatical gender of French and German nouns, the declension class of Lithuanian nouns. Lokys (the Bear) validates English lemma forms and definitions. Papuga (the Parrot) generates IPA pronunciations. \u0160ernas (the Boar) generates synonyms and alternative forms. Sarka (the Magpie) generates conversational example dialogues.\r</p>\n<p>On the quality and export side: Bebras (the Beaver) checks database integrity. Povas (the Peacock) generates HTML reports. Ungurys (the Eel) and Elnias (the Deer) export the database in the formats the app consumes. Strazdas (the Thrush) and Vieversys (the Lark) generate audio files via eSpeak-NG and OpenAI TTS respectively.\r</p>\n<p>The agents run in a pipeline: initialization, then translations, then word forms and grammar facts, then sentences, then audio, then export. Each supports a read-only check mode, a dry-run preview, and a fix mode that writes changes. They are idempotent and safe to run repeatedly as the database grows.\r</p>\n<p>Generation is done by LLMs \u2014 OpenAI, Anthropic, and Google models, as well as local models via Ollama and LM Studio. Results are stored so each word is only processed once; humans review and correct output through the web editor.\r</p>\n<p><em>The Benchmark Suite</em>\r</p>\n<p>Greenland includes a framework for evaluating the LLMs it relies on. Over 40 benchmarks test different skills:\r</p>\n<p>- *<em>Token and word processing</em>*: syllable counting, spell checking, IPA transcription, plural generation\r</p>\n<p>- *<em>Lexical semantics</em>*: antonyms, synonyms, definitions, part-of-speech tagging, lemma identification\r</p>\n<p>- *<em>Morphology</em>*: verb form generation, lemma validation\r</p>\n<p>- *<em>Translation</em>*: nine language pairs spanning European and Asian languages\r</p>\n<p>- *<em>Mathematics</em>*: arithmetic, algebra, geometry, unit conversion, word problems, time arithmetic\r</p>\n<p>- *<em>General knowledge</em>*: geography, historical dates, book-author matching, food classification, syllogism validity\r</p>\n<p>Some benchmarks directly test agent decisions \u2014 for example, whether Lokys's lemma validation judgments match known-correct answers. This closes a quality loop: the models used to build the database are evaluated against ground truth, so systematic errors can be detected before they propagate through thousands of words.\r</p>\n<p><em>Barsukas: The Web Editor</em>\r</p>\n<p>Barsukas is a local Flask application for human curators to interact with the database. It provides a full editing interface: browsing and searching the lemma inventory, editing translations and definitions, reviewing and correcting example sentences, managing pronunciations.\r</p>\n<p>A sync interface compares the live database against the JSONL files \u2014 showing additions, removals, difficulty changes, and translation changes \u2014 and lets curators apply or reject each category. Selected agent workflows can be triggered from within Barsukas, so a curator can generate a translation or sentence for a specific word without going to the command line. Operation logs record what changed, when, and by which agent or user, making it possible to trace the provenance of any piece of data.\r</p>\n<p><em>The Export Layer</em>\r</p>\n<p>Ungurys exports the database into WireWord format \u2014 the JSON structure the app consumes \u2014 organized by difficulty level and part of speech. Elnias produces a minimal bootstrap format. Audio files, generated by Strazdas and Vieversys, are uploaded to S3.\r</p>\n<p><span class=\"inline-title\"> Why This Approach </span>\r</p>\n<p>The core design choice is to use LLMs for generation and humans for judgment. LLMs handle the volume: translating 2,500 words into 14 languages, generating every conjugation of every verb in six languages, identifying the measure word for every Chinese noun. Humans handle the exceptions, corrections, and quality bar \u2014 through the web editor, through reviewing commits, through the benchmark suite catching systematic failures.\r</p>\n<p>The JSONL files in the repository make this sustainable. Every word has a traceable record. The database can be audited, reconstructed, and extended without losing history. New languages, new word categories, or new types of linguistic data can be added incrementally without disrupting what's already there.</p>","quotes":[],"subject":"Greenland: A Multilingual Linguistic Database and Toolchain"}
