This is a genuinely interesting linguistic modeling problem. After reviewing the current data model, here are my thoughts.
The Core Problem
Each Lemma currently represents one concept anchored to an English label. "hat" and "cap" are two separate lemmas, each with one translation per language. But languages carve up semantic space differently:
English: "hat" (broad brim) vs "cap" (fitted, no brim) — distinct
Lithuanian: "skrybelė" (wide brim) vs "kepurė" (general headwear/cap) — different boundary
Chinese: "帽子" covers nearly all headwear generically
French: "chapeau" (hat, broad) vs "casquette" (baseball cap) — yet another split
The current model has no way to say "these two lemmas occupy overlapping semantic territory, and in language X they might collapse into one word."
Design Ideas
1. Synonym Clusters via Existing Relation Groups (Minimal Change)
Add a new relation_type = "synonym_cluster" to the existing LemmaRelationGroup / LemmaRelationMember system. Group "hat" and "cap" together under concept_label "headwear".
Pros: Almost no schema change. Reuses existing infrastructure and release format (lemma_relations/synonym_cluster/headwear.jsonl).
Cons: Too blunt. It says "these are related" but doesn't capture how — it doesn't tell the app what to do in Lithuanian vs Chinese. A cluster of 5 near-synonyms doesn't tell you which ones merge in which languages.
2. Per-Language Acceptable Alternatives on Translations
Add an alternatives field (JSON list) to LemmaTranslation. For "hat" in Chinese, the primary translation is "帽子", and for "cap" in Chinese, the primary translation is also "帽子" (or maybe "鸭舌帽"). The alternatives field would note that "帽子" is acceptable for both.
More concretely, a new table like:
TranslationAlternative:
lemma_id → FK to Lemma
language_code → str
alternative_text → str (the acceptable alternative translation)
source_lemma_id → FK to Lemma (which lemma "owns" this word natively)
notes → str (why it's acceptable)
So for "hat" (N09_X) in Chinese, you'd have: primary = "帽子", no alternatives needed (it's the generic). For "cap" (N09_Y) in Chinese, primary = "鸭舌帽", alternative = "帽子" (source: hat's lemma).
Pros: Directly solves the Trakaido use case — the app knows which answers to accept. Per-language granularity. Works well for quiz/testing scenarios.
Cons: Doesn't capture the why — it's a flat list of "also accept this." Maintenance burden grows combinatorially. Doesn't help with teaching (explaining the distinction to learners).
3. Semantic Overlap Groups with Language-Specific Merge Flags
A hybrid: create synonym clusters (idea 1), but annotate per language whether the cluster's members are distinct or merged in that language.
SynonymCluster:
id, concept_label ("headwear")
SynonymClusterMember:
cluster_id, lemma_id
SynonymClusterLanguageInfo:
cluster_id
language_code
merge_type → "distinct" | "merged" | "partial"
dominant_lemma_id → if merged, which lemma's translation covers the group
notes → free text ("kepurė covers both unless brimmed")
For the headwear cluster in Chinese: merge_type = "merged", dominant_lemma_id = hat (because 帽子 covers everything). In English: merge_type = "distinct". In Lithuanian: merge_type = "partial" with a note.
Pros: Captures the actual linguistic reality. The app can use merge_type to decide behavior (skip testing distinctions in languages where they're merged, accept either answer for "partial"). Useful for pedagogy — you can teach learners "in Chinese, there's one word for all of these."
Cons: More complex schema. "partial" is vague and needs the notes field to be useful. Requires human curation per language per cluster.
4. Concept Hierarchy with Language-Specific Lexicalization Levels
Model a hypernym/hyponym tree:
headwear
├── hat (brimmed headwear)
│ ├── sun hat
│ └── top hat
├── cap (fitted, no brim)
│ ├── baseball cap
│ └── beanie
└── helmet
Then annotate per language which level has a common word. Chinese lexicalizes at "headwear" level (帽子). English lexicalizes at "hat"/"cap" level. Lithuanian has its own split point.
Pros: Linguistically the most accurate model. Powerful for curriculum design — teach the generic word first, then the specific ones at higher difficulty. Naturally handles the "how many words does this language need for this concept space" question.
Cons: Most complex to build and maintain. Over-engineered for many cases (do you really need a tree for "big" vs "large"?). The hierarchy itself may be English-biased — who says "headwear" is the natural top node?
5. Pragmatic Recommendation: Cluster + Alternatives (Ideas 1+2 Combined)
My actual recommendation is to combine the simplest version of two ideas:
A) Synonym clusters using the existing relation group system (relation_type = "synonym_cluster"). This groups related lemmas and is cheap to implement. It lives in lemma_relations/synonym_cluster/ in the release format.
B) Acceptable alternatives as a lightweight annotation on LemmaTranslation (a JSON field or a small sibling table). This gives the app the per-language "accept this answer too" data it needs.
The cluster provides the human/editorial grouping for curation. The alternatives provide the app-facing behavior. They're populated semi-independently — you might have a cluster of 4 words but only 2 of them share a translation in Lithuanian.