{"channel":"barsukas","content":"<teal> <<< This is a genuinely interesting linguistic modeling problem. After reviewing the current data model, here are my thoughts.\r\nThe Core Problem\r\nEach Lemma currently represents one concept anchored to an English label. \"hat\" and \"cap\" are two separate lemmas, each with one translation per language. But languages carve up semantic space differently:\r\nEnglish: \"hat\" (broad brim) vs \"cap\" (fitted, no brim) \u2014 distinct\r\nLithuanian: \"skrybel\u0117\" (wide brim) vs \"kepur\u0117\" (general headwear/cap) \u2014 different boundary\r\nChinese: \"\u5e3d\u5b50\" covers nearly all headwear generically\r\nFrench: \"chapeau\" (hat, broad) vs \"casquette\" (baseball cap) \u2014 yet another split\r\nThe current model has no way to say \"these two lemmas occupy overlapping semantic territory, and in language X they might collapse into one word.\" >>>\r\n\r\n--MORE--\r\n\r\n[# Design Ideas #]\r\n<teal> <<< 1. Synonym Clusters via Existing Relation Groups (Minimal Change)\r\nAdd a new relation_type = \"synonym_cluster\" to the existing LemmaRelationGroup / LemmaRelationMember system. Group \"hat\" and \"cap\" together under concept_label \"headwear\".\r\nPros: Almost no schema change. Reuses existing infrastructure and release format (lemma_relations/synonym_cluster/headwear.jsonl).\r\nCons: Too blunt. It says \"these are related\" but doesn't capture how \u2014 it doesn't tell the app what to do in Lithuanian vs Chinese. A cluster of 5 near-synonyms doesn't tell you which ones merge in which languages.\r\n\r\n2. Per-Language Acceptable Alternatives on Translations\r\nAdd an alternatives field (JSON list) to LemmaTranslation. For \"hat\" in Chinese, the primary translation is \"\u5e3d\u5b50\", and for \"cap\" in Chinese, the primary translation is also \"\u5e3d\u5b50\" (or maybe \"\u9e2d\u820c\u5e3d\"). The alternatives field would note that \"\u5e3d\u5b50\" is acceptable for both.\r\nMore concretely, a new table like:\r\nTranslationAlternative:\r\n    lemma_id          \u2192 FK to Lemma\r\n    language_code     \u2192 str\r\n    alternative_text  \u2192 str  (the acceptable alternative translation)\r\n    source_lemma_id   \u2192 FK to Lemma (which lemma \"owns\" this word natively)\r\n    notes             \u2192 str (why it's acceptable)\r\n\r\nSo for \"hat\" (N09_X) in Chinese, you'd have: primary = \"\u5e3d\u5b50\", no alternatives needed (it's the generic). For \"cap\" (N09_Y) in Chinese, primary = \"\u9e2d\u820c\u5e3d\", alternative = \"\u5e3d\u5b50\" (source: hat's lemma).\r\nPros: Directly solves the Trakaido use case \u2014 the app knows which answers to accept. Per-language granularity. Works well for quiz/testing scenarios.\r\nCons: Doesn't capture the why \u2014 it's a flat list of \"also accept this.\" Maintenance burden grows combinatorially. Doesn't help with teaching (explaining the distinction to learners).\r\n3. Semantic Overlap Groups with Language-Specific Merge Flags\r\nA hybrid: create synonym clusters (idea 1), but annotate per language whether the cluster's members are distinct or merged in that language.\r\nSynonymCluster:\r\n    id, concept_label (\"headwear\")\r\n\r\nSynonymClusterMember:\r\n    cluster_id, lemma_id\r\n\r\nSynonymClusterLanguageInfo:\r\n    cluster_id\r\n    language_code\r\n    merge_type        \u2192 \"distinct\" | \"merged\" | \"partial\"\r\n    dominant_lemma_id \u2192 if merged, which lemma's translation covers the group\r\n    notes             \u2192 free text (\"kepur\u0117 covers both unless brimmed\")\r\n\r\nFor the headwear cluster in Chinese: merge_type = \"merged\", dominant_lemma_id = hat (because \u5e3d\u5b50 covers everything). In English: merge_type = \"distinct\". In Lithuanian: merge_type = \"partial\" with a note.\r\nPros: Captures the actual linguistic reality. The app can use merge_type to decide behavior (skip testing distinctions in languages where they're merged, accept either answer for \"partial\"). Useful for pedagogy \u2014 you can teach learners \"in Chinese, there's one word for all of these.\"\r\nCons: More complex schema. \"partial\" is vague and needs the notes field to be useful. Requires human curation per language per cluster.\r\n\r\n4. Concept Hierarchy with Language-Specific Lexicalization Levels\r\nModel a hypernym/hyponym tree:\r\nheadwear\r\n\u251c\u2500\u2500 hat (brimmed headwear)\r\n\u2502   \u251c\u2500\u2500 sun hat\r\n\u2502   \u2514\u2500\u2500 top hat\r\n\u251c\u2500\u2500 cap (fitted, no brim)\r\n\u2502   \u251c\u2500\u2500 baseball cap\r\n\u2502   \u2514\u2500\u2500 beanie\r\n\u2514\u2500\u2500 helmet\r\n\r\nThen annotate per language which level has a common word. Chinese lexicalizes at \"headwear\" level (\u5e3d\u5b50). English lexicalizes at \"hat\"/\"cap\" level. Lithuanian has its own split point.\r\nPros: Linguistically the most accurate model. Powerful for curriculum design \u2014 teach the generic word first, then the specific ones at higher difficulty. Naturally handles the \"how many words does this language need for this concept space\" question.\r\nCons: Most complex to build and maintain. Over-engineered for many cases (do you really need a tree for \"big\" vs \"large\"?). The hierarchy itself may be English-biased \u2014 who says \"headwear\" is the natural top node?\r\n\r\n5. Pragmatic Recommendation: Cluster + Alternatives (Ideas 1+2 Combined)\r\nMy actual recommendation is to combine the simplest version of two ideas:\r\nA) Synonym clusters using the existing relation group system (relation_type = \"synonym_cluster\"). This groups related lemmas and is cheap to implement. It lives in lemma_relations/synonym_cluster/ in the release format.\r\nB) Acceptable alternatives as a lightweight annotation on LemmaTranslation (a JSON field or a small sibling table). This gives the app the per-language \"accept this answer too\" data it needs.\r\nThe cluster provides the human/editorial grouping for curation. The alternatives provide the app-facing behavior. They're populated semi-independently \u2014 you might have a cluster of 4 words but only 2 of them share a translation in Lithuanian. >>>","created_at":"2026-02-09T15:14:21.411980","id":745,"llm_annotations":{},"parent_id":null,"processed_content":"<div class=\"mlq color-teal\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">\ud83e\udd16</span></button><div class=\"mlq-content\"><p> This is a genuinely interesting linguistic modeling problem. After reviewing the current data model, here are my thoughts.\r</p>\n<p>The Core Problem\r</p>\n<p>Each Lemma currently represents one concept anchored to an English label. \"hat\" and \"cap\" are two separate lemmas, each with one translation per language. But languages carve up semantic space differently:\r</p>\n<p>English: \"hat\" (broad brim) vs \"cap\" (fitted, no brim) \u2014 distinct\r</p>\n<p>Lithuanian: \"skrybel\u0117\" (wide brim) vs \"kepur\u0117\" (general headwear/cap) \u2014 different boundary\r</p>\n<p>Chinese: \"<span class=\"annotated-chinese\" data-pinyin=\"M\u00c0O ZI\" data-definition=\"hat\">\u5e3d\u5b50</span>\" covers nearly all headwear generically\r</p>\n<p>French: \"chapeau\" (hat, broad) vs \"casquette\" (baseball cap) \u2014 yet another split\r</p>\n<p>The current model has no way to say \"these two lemmas occupy overlapping semantic territory, and in language X they might collapse into one word.\" </p></div></div>\n<div class=\"content-sigil\" aria-label=\"Extended content begins here\">&#9135;&#9135;&#9135;&#9135;&#9135;</div>\n<p><span class=\"inline-title\"> Design Ideas </span>\r</p>\n<div class=\"mlq color-teal\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">\ud83e\udd16</span></button><div class=\"mlq-content\"><p> 1. Synonym Clusters via Existing Relation Groups (Minimal Change)\r</p>\n<p>Add a new relation_type = \"synonym_cluster\" to the existing LemmaRelationGroup / LemmaRelationMember system. Group \"hat\" and \"cap\" together under concept_label \"headwear\".\r</p>\n<p>Pros: Almost no schema change. Reuses existing infrastructure and release format (lemma_relations/synonym_cluster/headwear.jsonl).\r</p>\n<p>Cons: Too blunt. It says \"these are related\" but doesn't capture how \u2014 it doesn't tell the app what to do in Lithuanian vs Chinese. A cluster of 5 near-synonyms doesn't tell you which ones merge in which languages.\r</p>\n<p>2. Per-Language Acceptable Alternatives on Translations\r</p>\n<p>Add an alternatives field (JSON list) to LemmaTranslation. For \"hat\" in Chinese, the primary translation is \"<span class=\"annotated-chinese\" data-pinyin=\"M\u00c0O ZI\" data-definition=\"hat\">\u5e3d\u5b50</span>\", and for \"cap\" in Chinese, the primary translation is also \"<span class=\"annotated-chinese\" data-pinyin=\"M\u00c0O ZI\" data-definition=\"hat\">\u5e3d\u5b50</span>\" (or maybe \"<span class=\"annotated-chinese\" data-pinyin=\"Y\u0100 SH\u00c9 M\u00c0O\" data-definition=\"peaked cap\">\u9e2d\u820c\u5e3d</span>\"). The alternatives field would note that \"<span class=\"annotated-chinese\" data-pinyin=\"M\u00c0O ZI\" data-definition=\"hat\">\u5e3d\u5b50</span>\" is acceptable for both.\r</p>\n<p>More concretely, a new table like:\r</p>\n<p>TranslationAlternative:\r</p>\n<p>    lemma_id          \u2192 FK to Lemma\r</p>\n<p>    language_code     \u2192 str\r</p>\n<p>    alternative_text  \u2192 str  (the acceptable alternative translation)\r</p>\n<p>    source_lemma_id   \u2192 FK to Lemma (which lemma \"owns\" this word natively)\r</p>\n<p>    notes             \u2192 str (why it's acceptable)\r</p>\n<p>So for \"hat\" (N09_X) in Chinese, you'd have: primary = \"<span class=\"annotated-chinese\" data-pinyin=\"M\u00c0O ZI\" data-definition=\"hat\">\u5e3d\u5b50</span>\", no alternatives needed (it's the generic). For \"cap\" (N09_Y) in Chinese, primary = \"<span class=\"annotated-chinese\" data-pinyin=\"Y\u0100 SH\u00c9 M\u00c0O\" data-definition=\"peaked cap\">\u9e2d\u820c\u5e3d</span>\", alternative = \"<span class=\"annotated-chinese\" data-pinyin=\"M\u00c0O ZI\" data-definition=\"hat\">\u5e3d\u5b50</span>\" (source: hat's lemma).\r</p>\n<p>Pros: Directly solves the Trakaido use case \u2014 the app knows which answers to accept. Per-language granularity. Works well for quiz/testing scenarios.\r</p>\n<p>Cons: Doesn't capture the why \u2014 it's a flat list of \"also accept this.\" Maintenance burden grows combinatorially. Doesn't help with teaching (explaining the distinction to learners).\r</p>\n<p>3. Semantic Overlap Groups with Language-Specific Merge Flags\r</p>\n<p>A hybrid: create synonym clusters (idea 1), but annotate per language whether the cluster's members are distinct or merged in that language.\r</p>\n<p>SynonymCluster:\r</p>\n<p>    id, concept_label (\"headwear\")\r</p>\n<p>SynonymClusterMember:\r</p>\n<p>    cluster_id, lemma_id\r</p>\n<p>SynonymClusterLanguageInfo:\r</p>\n<p>    cluster_id\r</p>\n<p>    language_code\r</p>\n<p>    merge_type        \u2192 \"distinct\" | \"merged\" | \"partial\"\r</p>\n<p>    dominant_lemma_id \u2192 if merged, which lemma's translation covers the group\r</p>\n<p>    notes             \u2192 free text (\"kepur\u0117 covers both unless brimmed\")\r</p>\n<p>For the headwear cluster in Chinese: merge_type = \"merged\", dominant_lemma_id = hat (because <span class=\"annotated-chinese\" data-pinyin=\"M\u00c0O ZI\" data-definition=\"hat\">\u5e3d\u5b50</span> covers everything). In English: merge_type = \"distinct\". In Lithuanian: merge_type = \"partial\" with a note.\r</p>\n<p>Pros: Captures the actual linguistic reality. The app can use merge_type to decide behavior (skip testing distinctions in languages where they're merged, accept either answer for \"partial\"). Useful for pedagogy \u2014 you can teach learners \"in Chinese, there's one word for all of these.\"\r</p>\n<p>Cons: More complex schema. \"partial\" is vague and needs the notes field to be useful. Requires human curation per language per cluster.\r</p>\n<p>4. Concept Hierarchy with Language-Specific Lexicalization Levels\r</p>\n<p>Model a hypernym/hyponym tree:\r</p>\n<p>headwear\r</p>\n<p>\u251c\u2500\u2500 hat (brimmed headwear)\r</p>\n<p>\u2502   \u251c\u2500\u2500 sun hat\r</p>\n<p>\u2502   \u2514\u2500\u2500 top hat\r</p>\n<p>\u251c\u2500\u2500 cap (fitted, no brim)\r</p>\n<p>\u2502   \u251c\u2500\u2500 baseball cap\r</p>\n<p>\u2502   \u2514\u2500\u2500 beanie\r</p>\n<p>\u2514\u2500\u2500 helmet\r</p>\n<p>Then annotate per language which level has a common word. Chinese lexicalizes at \"headwear\" level (<span class=\"annotated-chinese\" data-pinyin=\"M\u00c0O ZI\" data-definition=\"hat\">\u5e3d\u5b50</span>). English lexicalizes at \"hat\"/\"cap\" level. Lithuanian has its own split point.\r</p>\n<p>Pros: Linguistically the most accurate model. Powerful for curriculum design \u2014 teach the generic word first, then the specific ones at higher difficulty. Naturally handles the \"how many words does this language need for this concept space\" question.\r</p>\n<p>Cons: Most complex to build and maintain. Over-engineered for many cases (do you really need a tree for \"big\" vs \"large\"?). The hierarchy itself may be English-biased \u2014 who says \"headwear\" is the natural top node?\r</p>\n<p>5. Pragmatic Recommendation: Cluster + Alternatives (Ideas 1+2 Combined)\r</p>\n<p>My actual recommendation is to combine the simplest version of two ideas:\r</p>\n<p>A) Synonym clusters using the existing relation group system (relation_type = \"synonym_cluster\"). This groups related lemmas and is cheap to implement. It lives in lemma_relations/synonym_cluster/ in the release format.\r</p>\n<p>B) Acceptable alternatives as a lightweight annotation on LemmaTranslation (a JSON field or a small sibling table). This gives the app the per-language \"accept this answer too\" data it needs.\r</p>\n<p>The cluster provides the human/editorial grouping for curation. The alternatives provide the app-facing behavior. They're populated semi-independently \u2014 you might have a cluster of 4 words but only 2 of them share a translation in Lithuanian. </p></div></div>","quotes":[],"subject":"in which the machine tries to design a solution for synonyms"}
