{"channel":"llm","content":"That link is https://spaceship.computer/greenland/ .\r\n\r\n----\r\n\r\nNobody particularly cares about the \"space-time tradeoff\" with these models. (<red> which is a shame, because it is *very* relevant to both << industrial >> uses and << AI safety >> concerns)\r\n\r\nIf an 8B model does 5% better because of \"chain-of-thought\" but takes 15 times longer, it's generally not actually better than a 14B model would have been.\r\n\r\nAnd, a lot of the \"thought\" should be tools, rather than the illusion-of-thought that (at least the small LLMs) love. (<red> the prime example is << what's the capital of Spain?  Oh, I think I heard once that it is Madrid! >> style bullshit.)\r\n\r\n----\r\n\r\nWe don't need a mythical << super-human AI >> to generate mass unemployment in << knowledge-workers >>.\r\n\r\nWe don't need models that have a desire to \"escape\" or \"replicate\".  We don't need to worry about \"alignment\".  We certainly don't need << By 2035, trillions of tons of planetary material have been launched into space and turned into rings of satellites orbiting the sun. >>\r\n\r\nThe ordinary-intelligence AI, that I can already run on my computer, is already enough to trigger mass-unemployment. (<orange> well, actually, the 8B models aren't quite good enough or fast enough.  but the GPT-4.1-nano size models are cheap enough and good enough to be sufficient.  once the tools and the workflows are improved.)\r\n\r\nBut, this social change is not something that an << *AI Safety Team* >> can address.  The myth-making of the all-powerful AI is, for lack of a better word, dumb.  If you really want there to be meaning to it, you can use enough << it's a metaphor >> to make their arguments somewhat match the future.  But you can't kill a metaphor with a shotgun.\r\n\r\n----\r\n\r\nThere is an insidious meme in the LLM community, that a benchmark where models can get 100% is a bad benchmark.\r\n\r\nThis could not be farther from the truth.\r\n\r\nIf your only concern is \"how advanced is the state-of-the-art model\", there is a slight amount of sense to this.  But, the new benchmarks are often mind-bogglingly stupid.\r\n\r\nWhen the questions are << obscure trivia that shouldn't even be in the training set >>, << deliberately-obfuscated mathematical puzzles >>, or << evaluate this complicated Python function without using Python >>, it is arguable that getting the question right (from memory, in a short response) is the wrong response.  The *machine* shouldn't know, or should have to spend more time/effort than is allowed. (<red> the *machine* isn't magic.  if you ask it to solve a computational task that takes O(n^3) time in O(n) time, it won't do it.  at best, it will make guesses that evade your spot-checking.)\r\n\r\nI affirmatively *want* benchmarks that GPT-4.1-mini gets a perfect score on.  I want to know what the tasks which the *machine* can do perfectly are; and at what point it starts being able to do so.\r\n\r\n----\r\n\r\nOne approach I have considered but not found any good outcomes from is the << consensus of mediocre models >> approach.\r\n\r\nIf you take 7 8B models, and ask them all the same question, and then \"merge\" the outputs, will you get a better result?\r\n\r\n<red> <<< This is not exactly the same as the \"mixture of experts\" approach for various models.  But, there are similarities.  ... Perhaps the difference is that << Mixture of Experts >> is beneficial, and mixing general-purpose models is not. >>>","created_at":"2025-05-02T17:59:42.923475","id":462,"llm_annotations":{},"parent_id":461,"processed_content":"<p>That link is <a href=\"https://spaceship.computer/greenland/\" target=\"_blank\" rel=\"noopener noreferrer\">https://spaceship.computer/greenland/</a> .\r</p> <hr class=\"section-break\" /> <p>Nobody particularly cares about the \"space-time tradeoff\" with these models. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( which is a shame, because it is <em>very</em> relevant to both <span class=\"literal-text\">industrial</span> uses and <span class=\"literal-text\">AI safety</span> concerns)</span>\n  </span>\r</p>\n<p>If an 8B model does 5% better because of \"chain-of-thought\" but takes 15 times longer, it's generally not actually better than a 14B model would have been.\r</p>\n<p>And, a lot of the \"thought\" should be tools, rather than the illusion-of-thought that (at least the small LLMs) love. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( the prime example is <span class=\"literal-text\">what's the capital of Spain?  Oh, I think I heard once that it is Madrid!</span> style bullshit.)</span>\n  </span>\r</p> <hr class=\"section-break\" /> <p>We don't need a mythical <span class=\"literal-text\">super-human AI</span> to generate mass unemployment in <span class=\"literal-text\">knowledge-workers</span>.\r</p>\n<p>We don't need models that have a desire to \"escape\" or \"replicate\".  We don't need to worry about \"alignment\".  We certainly don't need <span class=\"literal-text\">By 2035, trillions of tons of planetary material have been launched into space and turned into rings of satellites orbiting the sun.</span>\r</p>\n<p>The ordinary-intelligence AI, that I can already run on my computer, is already enough to trigger mass-unemployment. <span class=\"colorblock color-orange\">\n    <span class=\"sigil\">\u2694\ufe0f</span>\n    <span class=\"colortext-content\">( well, actually, the 8B models aren't quite good enough or fast enough.  but the GPT-4.1-nano size models are cheap enough and good enough to be sufficient.  once the tools and the workflows are improved.)</span>\n  </span>\r</p>\n<p>But, this social change is not something that an <span class=\"literal-text\"><em>AI Safety Team</em></span> can address.  The myth-making of the all-powerful AI is, for lack of a better word, dumb.  If you really want there to be meaning to it, you can use enough <span class=\"literal-text\">it's a metaphor</span> to make their arguments somewhat match the future.  But you can't kill a metaphor with a shotgun.\r</p> <hr class=\"section-break\" /> <p>There is an insidious meme in the LLM community, that a benchmark where models can get 100% is a bad benchmark.\r</p>\n<p>This could not be farther from the truth.\r</p>\n<p>If your only concern is \"how advanced is the state-of-the-art model\", there is a slight amount of sense to this.  But, the new benchmarks are often mind-bogglingly stupid.\r</p>\n<p>When the questions are <span class=\"literal-text\">obscure trivia that shouldn't even be in the training set</span>, <span class=\"literal-text\">deliberately-obfuscated mathematical puzzles</span>, or <span class=\"literal-text\">evaluate this complicated Python function without using Python</span>, it is arguable that getting the question right (from memory, in a short response) is the wrong response.  The <em>machine</em> shouldn't know, or should have to spend more time/effort than is allowed. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( the <em>machine</em> isn't magic.  if you ask it to solve a computational task that takes O(n^3) time in O(n) time, it won't do it.  at best, it will make guesses that evade your spot-checking.)</span>\n  </span>\r</p>\n<p>I affirmatively <em>want</em> benchmarks that GPT-4.1-mini gets a perfect score on.  I want to know what the tasks which the <em>machine</em> can do perfectly are; and at what point it starts being able to do so.\r</p> <hr class=\"section-break\" /> <p>One approach I have considered but not found any good outcomes from is the <span class=\"literal-text\">consensus of mediocre models</span> approach.\r</p>\n<p>If you take 7 8B models, and ask them all the same question, and then \"merge\" the outputs, will you get a better result?\r</p>\n<p><div class=\"mlq color-red\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">\ud83d\udca1</span></button><div class=\"mlq-content\"><p> This is not exactly the same as the \"mixture of experts\" approach for various models.  But, there are similarities.  ... Perhaps the difference is that <span class=\"literal-text\">Mixture of Experts</span> is beneficial, and mixing general-purpose models is not. </p></div></div></p>","quotes":[],"subject":"greenland, a post-mortem, part 2"}
