{"channel":"tech","content":"The Harvard Sentences (<green> https://www.cs.columbia.edu/~hgs/audio/harvard.html ) contains about 1880 distinct words, across the 720 sentences. (<red> My script split the word \"don't\" into << don >> and << t >>.)\r\n\r\nIt is statistically implausible just how often the word \"the\" is used; 746 times.  The next most common words are less common (<green> << a >> at 212, << of >> at 132, << to >> at 123).\r\n\r\n----\r\n\r\nWhile these are useful sample sentences for certain purposes (<red> such as the \"what is the verb in this sentence\" test for 1B size LLMs), it is too small and idiosyncratic to stand up a useful word-bank.","created_at":"2025-02-20T20:01:34.992393","id":248,"llm_annotations":{},"parent_id":null,"processed_content":"<p>The Harvard Sentences <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( <a href=\"https://www.cs.columbia.edu/~hgs/audio/harvard.html\" target=\"_blank\" rel=\"noopener noreferrer\">https://www.cs.columbia.edu/~hgs/audio/harvard.html</a> )</span>\n  </span> contains about 1880 distinct words, across the 720 sentences. <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( My script split the word \"don't\" into <span class=\"literal-text\">don</span> and <span class=\"literal-text\">t</span>.)</span>\n  </span>\r</p>\n<p>It is statistically implausible just how often the word \"the\" is used; 746 times.  The next most common words are less common <span class=\"colorblock color-green\">\n    <span class=\"sigil\">\u2699\ufe0f</span>\n    <span class=\"colortext-content\">( <span class=\"literal-text\">a</span> at 212, <span class=\"literal-text\">of</span> at 132, <span class=\"literal-text\">to</span> at 123)</span>\n  </span>.\r</p> <hr class=\"section-break\" /> <p>While these are useful sample sentences for certain purposes <span class=\"colorblock color-red\">\n    <span class=\"sigil\">\ud83d\udca1</span>\n    <span class=\"colortext-content\">( such as the \"what is the verb in this sentence\" test for 1B size LLMs)</span>\n  </span>, it is too small and idiosyncratic to stand up a useful word-bank.</p>","quotes":[],"subject":"wordlists"}
