{"chain":[{"channel":"llm","content":"Other than the << fit and finish >>, the *Greenland* project is done.\r\n\r\nI have all the metrics, timing data, and performance data I need. (<orange> well, actually, there will probably be a half-dozen more \"exemplars\", and a half-dozen more \"models\".  but adding each is one line of code, plus 5-30 minutes of \"wait for it to update\") (<orange> well, actually, adding an exemplar starts with << tell Claude one line of directions, and wait for it to write 75 lines of code >> )\r\n\r\n----\r\n\r\nThe zeroth takeaway is that \"token introspection\" is hard.  LLMs aren't designed with the tools to do this correctly.\r\n\r\nA native tool of \"convert this word to letter-tokens\" would make \"count the Rs\" easier.  But not solve everything; small LLMs struggle with even \"count to three\".\r\n\r\n----\r\n\r\nMy estimate is that the task of << write Python code to generate and score a random poker hand >> is halfway to << can code anything >>.\r\n\r\nThe 4B-8B models struggle with the simpler task of \"write code to determine if a poker hand is a straight / flush\".\r\n\r\nThe smallest/older API models (claude-haiku 3, gemini 1.5) mostly get it right, but don't handle un-mentioned edge cases (<green> the \"wheel\" straight, Ace Two Three Four Five, does not always get included).\r\n\r\nFor the largest models tested (which are still \"mid-sized\", like gpt-4.1-mini), there are only style issues.  And style is something which can specified in the context text. (<red> it is also, somewhat, a matter of personal preference.  the fact that the *machine* did not do my preferred style (without me telling it to do so) is not a point against it)\r\n\r\n----\r\n\r\nThe problems of *commission* are sometimes worse than the problems of *omission*.\r\n\r\nFor various reasons, the models want a second definition of the word \"granite\" (beyond the type of rock).  This was most commonly << granite as a metaphor >>, but sometimes << granite as a type of countertop >>.  Other definitions were more of a stretch. \r\n\r\nThe example sentences demonstrate the contrived nature. The sentence << The team's resolve granited in the face of adversity. >> is not proper English. << The team\u2019s granite defense kept the opponents from scoring. >> is worse. (<red> and those are from the *larger* models.  The small models have some pure hallucinations.  << \u201cgranite\u201d referred to a unit of weight equal to 40 pounds >>?  Nope.)\r\n\r\n----\r\n\r\nAlmost all the models stated that granite is composed of << quartz, feldspar, and mica >>.  All the models tested knew which battle happened in 1485 during the Wars of the Roses, and who won it.\r\n\r\nIn one sense, this is not surprising.  If you imagine the LLM as << a dictionary that talks >>, it would certainly have this information. (<orange> well, actually, the Wars of the Roses wouldn't be in most dictionaries; that would be an *encyclopedia*.) (<red> I expect that, going forward, this will be a distinction without meaning.)","created_at":"2025-05-02T15:24:32.950481","id":461,"is_target":false,"parent_id":null,"processed_content":"<p>Other than the <span class=\"literal-text\">fit and finish</span>, the <em>Greenland</em> project is done.\r</p>\n<p>I have all the metrics, timing data, and performance data I need. <span class=\"colorblock color-orange\">\n    <span class=\"sigil\">\u2694\ufe0f</span>\n    <span class=\"colortext-content\">( well, actually, there will probably be a half-dozen more \"exemplars\", and a half-dozen more \"models\".  but adding each is one line of code, plus 5-30 minutes of \"wait for it to update\")</span>\n  </span> <span class=\"colorblock color-orange\">\n    <span class=\"sigil\">\u2694\ufe0f</span>\n    <span class=\"colortext-content\">( well, actually, adding an exemplar starts with <span class=\"literal-text\">tell Claude one line of directions, and wait for it to write 75 lines of code</span> )</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>The zeroth takeaway is that \"token introspection\" is hard.  LLMs aren't designed with the tools to do this correctly.\r</p>\n<p>A native tool of \"convert this word to letter-tokens\" would make \"count the Rs\" easier.  But not solve everything; small LLMs struggle with even \"count to three\".\r</p> <hr class=\"section-break\" /> <p>My estimate is that the task of <span class=\"literal-text\">write Python code to generate and score a random poker hand</span> is halfway to <span class=\"literal-text\">can code anything</span>.\r</p>\n<p>The 4B-8B models struggle with the simpler task of \"write code to determine if a poker hand is a straight / flush\".\r</p>\n<p>The smallest/older API models (claude-haiku 3, gemini 1.5) mostly get it right, but don't handle un-mentioned edge cases <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( the \"wheel\" straight, Ace Two Three Four Five, does not always get included)</span>\n  </span>.\r</p>\n<p>For the largest models tested (which are still \"mid-sized\", like gpt-4.1-mini), there are only style issues.  And style is something which can specified in the context text. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( it is also, somewhat, a matter of personal preference.  the fact that the <em>machine</em> did not do my preferred style (without me telling it to do so) is not a point against it)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>The problems of <em>commission</em> are sometimes worse than the problems of <em>omission</em>.\r</p>\n<p>For various reasons, the models want a second definition of the word \"granite\" (beyond the type of rock).  This was most commonly <span class=\"literal-text\">granite as a metaphor</span>, but sometimes <span class=\"literal-text\">granite as a type of countertop</span>.  Other definitions were more of a stretch. \r</p>\n<p>The example sentences demonstrate the contrived nature. The sentence <span class=\"literal-text\">The team's resolve granited in the face of adversity.</span> is not proper English. <span class=\"literal-text\">The team\u2019s granite defense kept the opponents from scoring.</span> is worse. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( and those are from the <em>larger</em> models.  The small models have some pure hallucinations.  <span class=\"literal-text\">\u201cgranite\u201d referred to a unit of weight equal to 40 pounds</span>?  Nope.)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>Almost all the models stated that granite is composed of <span class=\"literal-text\">quartz, feldspar, and mica</span>.  All the models tested knew which battle happened in 1485 during the Wars of the Roses, and who won it.\r</p>\n<p>In one sense, this is not surprising.  If you imagine the LLM as <span class=\"literal-text\">a dictionary that talks</span>, it would certainly have this information. <span class=\"colorblock color-orange\">\n    <span class=\"sigil\">\u2694\ufe0f</span>\n    <span class=\"colortext-content\">( well, actually, the Wars of the Roses wouldn't be in most dictionaries; that would be an <em>encyclopedia</em>.)</span>\n  </span> <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( I expect that, going forward, this will be a distinction without meaning.)</span>\n  </span></p>","subject":"greenland, a post-mortem, part 1"},{"channel":"llm","content":"That link is https://spaceship.computer/greenland/ .\r\n\r\n----\r\n\r\nNobody particularly cares about the \"space-time tradeoff\" with these models. (<red> which is a shame, because it is *very* relevant to both << industrial >> uses and << AI safety >> concerns)\r\n\r\nIf an 8B model does 5% better because of \"chain-of-thought\" but takes 15 times longer, it's generally not actually better than a 14B model would have been.\r\n\r\nAnd, a lot of the \"thought\" should be tools, rather than the illusion-of-thought that (at least the small LLMs) love. (<red> the prime example is << what's the capital of Spain?  Oh, I think I heard once that it is Madrid! >> style bullshit.)\r\n\r\n----\r\n\r\nWe don't need a mythical << super-human AI >> to generate mass unemployment in << knowledge-workers >>.\r\n\r\nWe don't need models that have a desire to \"escape\" or \"replicate\".  We don't need to worry about \"alignment\".  We certainly don't need << By 2035, trillions of tons of planetary material have been launched into space and turned into rings of satellites orbiting the sun. >>\r\n\r\nThe ordinary-intelligence AI, that I can already run on my computer, is already enough to trigger mass-unemployment. (<orange> well, actually, the 8B models aren't quite good enough or fast enough.  but the GPT-4.1-nano size models are cheap enough and good enough to be sufficient.  once the tools and the workflows are improved.)\r\n\r\nBut, this social change is not something that an << *AI Safety Team* >> can address.  The myth-making of the all-powerful AI is, for lack of a better word, dumb.  If you really want there to be meaning to it, you can use enough << it's a metaphor >> to make their arguments somewhat match the future.  But you can't kill a metaphor with a shotgun.\r\n\r\n----\r\n\r\nThere is an insidious meme in the LLM community, that a benchmark where models can get 100% is a bad benchmark.\r\n\r\nThis could not be farther from the truth.\r\n\r\nIf your only concern is \"how advanced is the state-of-the-art model\", there is a slight amount of sense to this.  But, the new benchmarks are often mind-bogglingly stupid.\r\n\r\nWhen the questions are << obscure trivia that shouldn't even be in the training set >>, << deliberately-obfuscated mathematical puzzles >>, or << evaluate this complicated Python function without using Python >>, it is arguable that getting the question right (from memory, in a short response) is the wrong response.  The *machine* shouldn't know, or should have to spend more time/effort than is allowed. (<red> the *machine* isn't magic.  if you ask it to solve a computational task that takes O(n^3) time in O(n) time, it won't do it.  at best, it will make guesses that evade your spot-checking.)\r\n\r\nI affirmatively *want* benchmarks that GPT-4.1-mini gets a perfect score on.  I want to know what the tasks which the *machine* can do perfectly are; and at what point it starts being able to do so.\r\n\r\n----\r\n\r\nOne approach I have considered but not found any good outcomes from is the << consensus of mediocre models >> approach.\r\n\r\nIf you take 7 8B models, and ask them all the same question, and then \"merge\" the outputs, will you get a better result?\r\n\r\n<red> <<< This is not exactly the same as the \"mixture of experts\" approach for various models.  But, there are similarities.  ... Perhaps the difference is that << Mixture of Experts >> is beneficial, and mixing general-purpose models is not. >>>","created_at":"2025-05-02T17:59:42.923475","id":462,"is_target":true,"parent_id":461,"processed_content":"<p>That link is <a href=\"https://spaceship.computer/greenland/\" target=\"_blank\" rel=\"noopener noreferrer\">https://spaceship.computer/greenland/</a> .\r</p> <hr class=\"section-break\" /> <p>Nobody particularly cares about the \"space-time tradeoff\" with these models. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( which is a shame, because it is <em>very</em> relevant to both <span class=\"literal-text\">industrial</span> uses and <span class=\"literal-text\">AI safety</span> concerns)</span>\n  </span>\r</p>\n<p>If an 8B model does 5% better because of \"chain-of-thought\" but takes 15 times longer, it's generally not actually better than a 14B model would have been.\r</p>\n<p>And, a lot of the \"thought\" should be tools, rather than the illusion-of-thought that (at least the small LLMs) love. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( the prime example is <span class=\"literal-text\">what's the capital of Spain?  Oh, I think I heard once that it is Madrid!</span> style bullshit.)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>We don't need a mythical <span class=\"literal-text\">super-human AI</span> to generate mass unemployment in <span class=\"literal-text\">knowledge-workers</span>.\r</p>\n<p>We don't need models that have a desire to \"escape\" or \"replicate\".  We don't need to worry about \"alignment\".  We certainly don't need <span class=\"literal-text\">By 2035, trillions of tons of planetary material have been launched into space and turned into rings of satellites orbiting the sun.</span>\r</p>\n<p>The ordinary-intelligence AI, that I can already run on my computer, is already enough to trigger mass-unemployment. <span class=\"colorblock color-orange\">\n    <span class=\"sigil\">\u2694\ufe0f</span>\n    <span class=\"colortext-content\">( well, actually, the 8B models aren't quite good enough or fast enough.  but the GPT-4.1-nano size models are cheap enough and good enough to be sufficient.  once the tools and the workflows are improved.)</span>\n  </span>\r</p>\n<p>But, this social change is not something that an <span class=\"literal-text\"><em>AI Safety Team</em></span> can address.  The myth-making of the all-powerful AI is, for lack of a better word, dumb.  If you really want there to be meaning to it, you can use enough <span class=\"literal-text\">it's a metaphor</span> to make their arguments somewhat match the future.  But you can't kill a metaphor with a shotgun.\r</p> <hr class=\"section-break\" /> <p>There is an insidious meme in the LLM community, that a benchmark where models can get 100% is a bad benchmark.\r</p>\n<p>This could not be farther from the truth.\r</p>\n<p>If your only concern is \"how advanced is the state-of-the-art model\", there is a slight amount of sense to this.  But, the new benchmarks are often mind-bogglingly stupid.\r</p>\n<p>When the questions are <span class=\"literal-text\">obscure trivia that shouldn't even be in the training set</span>, <span class=\"literal-text\">deliberately-obfuscated mathematical puzzles</span>, or <span class=\"literal-text\">evaluate this complicated Python function without using Python</span>, it is arguable that getting the question right (from memory, in a short response) is the wrong response.  The <em>machine</em> shouldn't know, or should have to spend more time/effort than is allowed. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( the <em>machine</em> isn't magic.  if you ask it to solve a computational task that takes O(n^3) time in O(n) time, it won't do it.  at best, it will make guesses that evade your spot-checking.)</span>\n  </span>\r</p>\n<p>I affirmatively <em>want</em> benchmarks that GPT-4.1-mini gets a perfect score on.  I want to know what the tasks which the <em>machine</em> can do perfectly are; and at what point it starts being able to do so.\r</p> <hr class=\"section-break\" /> <p>One approach I have considered but not found any good outcomes from is the <span class=\"literal-text\">consensus of mediocre models</span> approach.\r</p>\n<p>If you take 7 8B models, and ask them all the same question, and then \"merge\" the outputs, will you get a better result?\r</p>\n<p><div class=\"mlq color-red\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">\ud83d\udca1</span></button><div class=\"mlq-content\"><p> This is not exactly the same as the \"mixture of experts\" approach for various models.  But, there are similarities.  ... Perhaps the difference is that <span class=\"literal-text\">Mixture of Experts</span> is beneficial, and mixing general-purpose models is not. </p></div></div></p>","subject":"greenland, a post-mortem, part 2"},{"channel":"llm","content":"JSON output is almost a necessity for an LLM to be usable today.  All of the major LLM platforms have it in some form.  But, if you are using a model from 2023, it might not support it, or it might not work very well.\r\n\r\nWhile many of the improvements from 2 years ago are in the tools running the LLM (<red> such as the token-selection algorithm), there is some amount of understanding of the output-format that needs to be trained into the model.\r\n\r\n----\r\n\r\nTrying to test Phi-2 (December 2023, 2.7B params) or Mistral-0.3 (September 2023, 7B params) seems unlikely to be worth any time/effort.  I know there are newer models that are better; and I'm not sure there will be usable results at all.\r\n\r\nDoes that mean the models we have today will be useless in 18 months?  *Probably not*.  Maybe there will be a GPT-4.1-nano quality model that is 2c IN / 5c OUT per million tokens (<red> currently GPT-4.1-nano is 10c IN / 40c OUT per million tokens).  For almost all personal uses, this is not a substantial improvement.\r\n\r\n----\r\n\r\nWhether << Falcon 3 >> (<resource> https://huggingface.co/blog/falcon3 ) is worth considering is a different question.\r\n\r\nTheir press-release has benchmarks showing them as slightly better than earlier systems of similar size.  But nothing ground-breaking; and in fact we know there can't be anything too unique.  If there were, it would have already been copied.\r\n\r\nIt is \"just another model\". (<yellow> if you want to build a forest, it helps to have many different trees)\r\n\r\n----\r\n\r\nWhat about Granite (the IBM offering)? (<resource> https://www.ibm.com/granite/docs/ )\r\n\r\nThis one I happened to already test.  The results were very unremarkable.  Like most 8B models, this 8B model gave acceptable results for tasks that did not require deep insight or precision.\r\n\r\n----\r\n\r\nThe highest-profile \"local models\" are Gemma (<context> Google's latest model), Llama (<context> Facebook's latest model), QWEN (<context> Alibaba's latest model), and Phi (<context> Microsoft's latest model). (<xantham> Amazon and Apple do not seem to be releasing their own models.  Netflix is not, either.) (<red> there are others; Mistral is probably the leading European provider.) (<xantham> I still don't care about Deepseek; the \"thought\" is largely a party-trick that people will see through soon enough ... also most other models also do that in some way now.)\r\n\r\nAnd, all of these seem to be hitting limits at the 8B param size.  The latest releases are more interesting at the 24-40B param size.  Which *can* be run on a local machine ... just not the ones I own.\r\n\r\n----\r\n\r\nThe 1.5B parameter models are useful for << speculative decoding >> (<context> https://research.google/blog/looking-back-at-speculative-decoding/ ), which is where you use one model to make a cheap \"guess\" for the larger model, allowing more tokens to be calculated at once.\r\n\r\nBeyond that, they are largely toys.  With fine-tuning and testing, you can probably use a model for a single useful task.  But the 1.5B models are not << general-purpose >> AI, and they probably never will be.\r\n\r\n----\r\n\r\nFor \"cloud\" models, there is Gemini (<context> Google), GPT (<context> OpenAI), and Claude (<context> Anthropic).  And, several others that I haven't bothered with. (<red> Perplexity has an API called << Sonar >>.  Amazon has something called << Nova >>.  And there is still TSFKAT's offering.) (<acronym> TSFKAT = \"the site formerly known as Twitter\")\r\n\r\nAnd ... without a specific work-task, it is unlikely that benchmarking / testing these models will come up with any useful data.","created_at":"2025-05-02T19:23:20.769452","id":463,"is_target":false,"parent_id":462,"processed_content":"<p>JSON output is almost a necessity for an LLM to be usable today.  All of the major LLM platforms have it in some form.  But, if you are using a model from 2023, it might not support it, or it might not work very well.\r</p>\n<p>While many of the improvements from 2 years ago are in the tools running the LLM <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( such as the token-selection algorithm)</span>\n  </span>, there is some amount of understanding of the output-format that needs to be trained into the model.\r</p> <hr class=\"section-break\" /> <p>Trying to test Phi-2 (December 2023, 2.7B params) or Mistral-0.3 (September 2023, 7B params) seems unlikely to be worth any time/effort.  I know there are newer models that are better; and I'm not sure there will be usable results at all.\r</p>\n<p>Does that mean the models we have today will be useless in 18 months?  <em>Probably not</em>.  Maybe there will be a GPT-4.1-nano quality model that is 2c IN / 5c OUT per million tokens <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( currently GPT-4.1-nano is 10c IN / 40c OUT per million tokens)</span>\n  </span>.  For almost all personal uses, this is not a substantial improvement.\r</p> <hr class=\"section-break\" /> <p>Whether <span class=\"literal-text\">Falcon 3</span> <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( <a href=\"https://huggingface.co/blog/falcon3\" target=\"_blank\" rel=\"noopener noreferrer\">https://huggingface.co/blog/falcon3</a> )</span>\n  </span> is worth considering is a different question.\r</p>\n<p>Their press-release has benchmarks showing them as slightly better than earlier systems of similar size.  But nothing ground-breaking; and in fact we know there can't be anything too unique.  If there were, it would have already been copied.\r</p>\n<p>It is \"just another model\". <span class=\"colorblock color-yellow\">\n    <span class=\"sigil\">\ud83d\udcac</span>\n    <span class=\"colortext-content\">( if you want to build a forest, it helps to have many different trees)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>What about Granite (the IBM offering)? <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( <a href=\"https://www.ibm.com/granite/docs/\" target=\"_blank\" rel=\"noopener noreferrer\">https://www.ibm.com/granite/docs/</a> )</span>\n  </span>\r</p>\n<p>This one I happened to already test.  The results were very unremarkable.  Like most 8B models, this 8B model gave acceptable results for tasks that did not require deep insight or precision.\r</p> <hr class=\"section-break\" /> <p>The highest-profile \"local models\" are Gemma <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Google's latest model)</span>\n  </span>, Llama <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Facebook's latest model)</span>\n  </span>, QWEN <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Alibaba's latest model)</span>\n  </span>, and Phi <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Microsoft's latest model)</span>\n  </span>. <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( Amazon and Apple do not seem to be releasing their own models.  Netflix is not, either.)</span>\n  </span> <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( there are others; Mistral is probably the leading European provider.)</span>\n  </span> <span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\">( I still don't care about Deepseek; the \"thought\" is largely a party-trick that people will see through soon enough ... also most other models also do that in some way now.)</span>\n  </span>\r</p>\n<p>And, all of these seem to be hitting limits at the 8B param size.  The latest releases are more interesting at the 24-40B param size.  Which <em>can</em> be run on a local machine ... just not the ones I own.\r</p> <hr class=\"section-break\" /> <p>The 1.5B parameter models are useful for <span class=\"literal-text\">speculative decoding</span> <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( <a href=\"https://research.google/blog/looking-back-at-speculative-decoding/\" target=\"_blank\" rel=\"noopener noreferrer\">https://research.google/blog/looking-back-at-speculative-decoding/</a> )</span>\n  </span>, which is where you use one model to make a cheap \"guess\" for the larger model, allowing more tokens to be calculated at once.\r</p>\n<p>Beyond that, they are largely toys.  With fine-tuning and testing, you can probably use a model for a single useful task.  But the 1.5B models are not <span class=\"literal-text\">general-purpose</span> AI, and they probably never will be.\r</p> <hr class=\"section-break\" /> <p>For \"cloud\" models, there is Gemini <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Google)</span>\n  </span>, GPT <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( OpenAI)</span>\n  </span>, and Claude <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( Anthropic)</span>\n  </span>.  And, several others that I haven't bothered with. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( Perplexity has an API called <span class=\"literal-text\">Sonar</span>.  Amazon has something called <span class=\"literal-text\">Nova</span>.  And there is still TSFKAT's offering.)</span>\n  </span> <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( TSFKAT = \"the site formerly known as Twitter\")</span>\n  </span>\r</p>\n<p>And ... without a specific work-task, it is unlikely that benchmarking / testing these models will come up with any useful data.</p>","subject":"greenland, a post-mortem, part 3"}]}
