{"channel":"llm","content":"Other than the << fit and finish >>, the *Greenland* project is done.\r\n\r\nI have all the metrics, timing data, and performance data I need. (<orange> well, actually, there will probably be a half-dozen more \"exemplars\", and a half-dozen more \"models\".  but adding each is one line of code, plus 5-30 minutes of \"wait for it to update\") (<orange> well, actually, adding an exemplar starts with << tell Claude one line of directions, and wait for it to write 75 lines of code >> )\r\n\r\n----\r\n\r\nThe zeroth takeaway is that \"token introspection\" is hard.  LLMs aren't designed with the tools to do this correctly.\r\n\r\nA native tool of \"convert this word to letter-tokens\" would make \"count the Rs\" easier.  But not solve everything; small LLMs struggle with even \"count to three\".\r\n\r\n----\r\n\r\nMy estimate is that the task of << write Python code to generate and score a random poker hand >> is halfway to << can code anything >>.\r\n\r\nThe 4B-8B models struggle with the simpler task of \"write code to determine if a poker hand is a straight / flush\".\r\n\r\nThe smallest/older API models (claude-haiku 3, gemini 1.5) mostly get it right, but don't handle un-mentioned edge cases (<green> the \"wheel\" straight, Ace Two Three Four Five, does not always get included).\r\n\r\nFor the largest models tested (which are still \"mid-sized\", like gpt-4.1-mini), there are only style issues.  And style is something which can specified in the context text. (<red> it is also, somewhat, a matter of personal preference.  the fact that the *machine* did not do my preferred style (without me telling it to do so) is not a point against it)\r\n\r\n----\r\n\r\nThe problems of *commission* are sometimes worse than the problems of *omission*.\r\n\r\nFor various reasons, the models want a second definition of the word \"granite\" (beyond the type of rock).  This was most commonly << granite as a metaphor >>, but sometimes << granite as a type of countertop >>.  Other definitions were more of a stretch. \r\n\r\nThe example sentences demonstrate the contrived nature. The sentence << The team's resolve granited in the face of adversity. >> is not proper English. << The team\u2019s granite defense kept the opponents from scoring. >> is worse. (<red> and those are from the *larger* models.  The small models have some pure hallucinations.  << \u201cgranite\u201d referred to a unit of weight equal to 40 pounds >>?  Nope.)\r\n\r\n----\r\n\r\nAlmost all the models stated that granite is composed of << quartz, feldspar, and mica >>.  All the models tested knew which battle happened in 1485 during the Wars of the Roses, and who won it.\r\n\r\nIn one sense, this is not surprising.  If you imagine the LLM as << a dictionary that talks >>, it would certainly have this information. (<orange> well, actually, the Wars of the Roses wouldn't be in most dictionaries; that would be an *encyclopedia*.) (<red> I expect that, going forward, this will be a distinction without meaning.)","created_at":"2025-05-02T15:24:32.950481","id":461,"llm_annotations":{},"parent_id":null,"processed_content":"<p>Other than the <span class=\"literal-text\">fit and finish</span>, the <em>Greenland</em> project is done.\r</p>\n<p>I have all the metrics, timing data, and performance data I need. <span class=\"colorblock color-orange\">\n    <span class=\"sigil\">\u2694\ufe0f</span>\n    <span class=\"colortext-content\">( well, actually, there will probably be a half-dozen more \"exemplars\", and a half-dozen more \"models\".  but adding each is one line of code, plus 5-30 minutes of \"wait for it to update\")</span>\n  </span> <span class=\"colorblock color-orange\">\n    <span class=\"sigil\">\u2694\ufe0f</span>\n    <span class=\"colortext-content\">( well, actually, adding an exemplar starts with <span class=\"literal-text\">tell Claude one line of directions, and wait for it to write 75 lines of code</span> )</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>The zeroth takeaway is that \"token introspection\" is hard.  LLMs aren't designed with the tools to do this correctly.\r</p>\n<p>A native tool of \"convert this word to letter-tokens\" would make \"count the Rs\" easier.  But not solve everything; small LLMs struggle with even \"count to three\".\r</p> <hr class=\"section-break\" /> <p>My estimate is that the task of <span class=\"literal-text\">write Python code to generate and score a random poker hand</span> is halfway to <span class=\"literal-text\">can code anything</span>.\r</p>\n<p>The 4B-8B models struggle with the simpler task of \"write code to determine if a poker hand is a straight / flush\".\r</p>\n<p>The smallest/older API models (claude-haiku 3, gemini 1.5) mostly get it right, but don't handle un-mentioned edge cases <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( the \"wheel\" straight, Ace Two Three Four Five, does not always get included)</span>\n  </span>.\r</p>\n<p>For the largest models tested (which are still \"mid-sized\", like gpt-4.1-mini), there are only style issues.  And style is something which can specified in the context text. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( it is also, somewhat, a matter of personal preference.  the fact that the <em>machine</em> did not do my preferred style (without me telling it to do so) is not a point against it)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>The problems of <em>commission</em> are sometimes worse than the problems of <em>omission</em>.\r</p>\n<p>For various reasons, the models want a second definition of the word \"granite\" (beyond the type of rock).  This was most commonly <span class=\"literal-text\">granite as a metaphor</span>, but sometimes <span class=\"literal-text\">granite as a type of countertop</span>.  Other definitions were more of a stretch. \r</p>\n<p>The example sentences demonstrate the contrived nature. The sentence <span class=\"literal-text\">The team's resolve granited in the face of adversity.</span> is not proper English. <span class=\"literal-text\">The team\u2019s granite defense kept the opponents from scoring.</span> is worse. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( and those are from the <em>larger</em> models.  The small models have some pure hallucinations.  <span class=\"literal-text\">\u201cgranite\u201d referred to a unit of weight equal to 40 pounds</span>?  Nope.)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>Almost all the models stated that granite is composed of <span class=\"literal-text\">quartz, feldspar, and mica</span>.  All the models tested knew which battle happened in 1485 during the Wars of the Roses, and who won it.\r</p>\n<p>In one sense, this is not surprising.  If you imagine the LLM as <span class=\"literal-text\">a dictionary that talks</span>, it would certainly have this information. <span class=\"colorblock color-orange\">\n    <span class=\"sigil\">\u2694\ufe0f</span>\n    <span class=\"colortext-content\">( well, actually, the Wars of the Roses wouldn't be in most dictionaries; that would be an <em>encyclopedia</em>.)</span>\n  </span> <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( I expect that, going forward, this will be a distinction without meaning.)</span>\n  </span></p>","quotes":[],"subject":"greenland, a post-mortem, part 1"}
