{"channel":"llm","content":"In my testing, I am starting to make a distinction between two types of \"tests\" for LLMs.\r\n\r\n----\r\n\r\nA *proficiency* test covers simple tasks.  Some examples:\r\n* << Repeat the misspelled word in this sentence >>.\r\n* << Translate this word from English to French >>. (<red> the linguistic knowledge of models is an unresolved question.  Should it know 5 languages, or 40, or 400?  In the specific case of English/French: it is plausible to claim that one cannot truly know the English language without knowing French.  The LLM should also know French.)\r\n* << Choose the definition of this word >>.\r\n\r\nThe accuracy in performing these tasks is, often, surprisingly bad, compared to performance on other tasks.  This may be due to a lack of training for these tasks.\r\n\r\n----\r\n\r\nOn the other hand, a *qual* test (<green> possibly for \"qualification\") are more difficult.  \r\n* << Write two paragraphs about the city of Toulouse. >>\r\n* << Explain the theory of relativity to a nine-year-old. >>\r\n* << Answer these questions from the GRE Verbal Reasoning section. >>\r\n\r\nFrom a technical perspective: many of these are free-form responses that are scored by a larger LLM.\r\n\r\nThe interesting question is not whether *any* LLM can answer these, but whether an LLM under 16GB in size can do so.\r\n\r\n----\r\n\r\nthe << Frontier >> tests are not particularly interesting.","created_at":"2025-03-24T20:07:33.043164","id":326,"llm_annotations":{},"parent_id":null,"processed_content":"<p>In my testing, I am starting to make a distinction between two types of \"tests\" for LLMs.\r</p> <hr class=\"section-break\" /> <p>A <em>proficiency</em> test covers simple tasks.  Some examples:\r</p>\n<ul>\n<li class=\"bullet-list\"> <span class=\"literal-text\">Repeat the misspelled word in this sentence</span>.\r</li>\n<li class=\"bullet-list\"> <span class=\"literal-text\">Translate this word from English to French</span>. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( the linguistic knowledge of models is an unresolved question.  Should it know 5 languages, or 40, or 400?  In the specific case of English/French: it is plausible to claim that one cannot truly know the English language without knowing French.  The LLM should also know French.)</span>\n  </span>\r</li>\n<li class=\"bullet-list\"> <span class=\"literal-text\">Choose the definition of this word</span>.\r</li>\n</ul>\n<p>The accuracy in performing these tasks is, often, surprisingly bad, compared to performance on other tasks.  This may be due to a lack of training for these tasks.\r</p> <hr class=\"section-break\" /> <p>On the other hand, a <em>qual</em> test <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( possibly for \"qualification\")</span>\n  </span> are more difficult.  \r</p>\n<ul>\n<li class=\"bullet-list\"> <span class=\"literal-text\">Write two paragraphs about the city of Toulouse.</span>\r</li>\n<li class=\"bullet-list\"> <span class=\"literal-text\">Explain the theory of relativity to a nine-year-old.</span>\r</li>\n<li class=\"bullet-list\"> <span class=\"literal-text\">Answer these questions from the GRE Verbal Reasoning section.</span>\r</li>\n</ul>\n<p>From a technical perspective: many of these are free-form responses that are scored by a larger LLM.\r</p>\n<p>The interesting question is not whether <em>any</em> LLM can answer these, but whether an LLM under 16GB in size can do so.\r</p> <hr class=\"section-break\" /> <p>the <span class=\"literal-text\">Frontier</span> tests are not particularly interesting.</p>","quotes":[],"subject":"proficiencies and quals"}
