{"id":5090,"date":"2021-12-11T00:01:41","date_gmt":"2021-12-11T08:01:41","guid":{"rendered":"https:\/\/c-for-dummies.com\/blog\/?p=5090"},"modified":"2021-12-18T08:36:21","modified_gmt":"2021-12-18T16:36:21","slug":"a-tally-of-unique-words-part-i","status":"publish","type":"post","link":"https:\/\/c-for-dummies.com\/blog\/?p=5090","title":{"rendered":"A Tally of Unique Words, Part I"},"content":{"rendered":"<p>It&#8217;s easy for a good C programmer to code a program to tally the number of unique words in a chunk of text. Further, the computer could track repeating words. This task would drive a human nuts, but a computer? No problem.<br \/>\n<!--more--><br \/>\nOf course, anything that&#8217;s easy for a computer to do isn&#8217;t always easy for a programmer to code. In this instance, what I desired was to find the number of unique words in string. This task branched off into an obvious secondary task of finding those words that repeat. Regardless, the job starts with processing a chunk of text.<\/p>\n<p>For a target string, I&#8217;m using <a href=\"https:\/\/c-for-dummies.com\/blog\/wp-content\/uploads\/2021\/12\/sonnet18.txt\">Shakespeare&#8217;s 18th Sonnet<\/a>. The first draft of my code merely contained and output the sonnet stored in a buffer. <a href=\"https:\/\/github.com\/dangookin\/C-For-Dummies-Blog\/blob\/master\/2021_12_11-Lesson-a.c\" rel=\"noopener\" target=\"_blank\">Click here<\/a> to view the code on Github.<\/p>\n<p>My first draft (link above) is limited in flexibility as the string is kept in an array local to the function. So for my first update to the code, I created a separate text file, <code>sonnet18.txt<\/code> (link above), opened the file, stored it in a dynamic buffer, then output the results. This approach keeps the code flexible, able to open any size text file to process, parse, and eventually sort out the unique and duplicate words:<\/p>\n<h3><a href=\"https:\/\/github.com\/dangookin\/C-For-Dummies-Blog\/blob\/master\/2021_12_11-Lesson-b.c\" rel=\"noopener\" target=\"_blank\">2021_12_11-Lesson-b.c<\/a><\/h3>\n<pre class=\"screen\">\r\n#include &lt;stdio.h&gt;\r\n#include &lt;stdlib.h&gt;\r\n#include &lt;string.h&gt;\r\n\r\nint main()\r\n{\r\n    const char filename[] = \"sonnet18.txt\";\r\n    char *buffer;\r\n    FILE *fp;\r\n    int offset,ch;\r\n\r\n\r\n    <span class=\"comments\">\/* open the file *\/<\/span>\r\n    fp = fopen(filename,\"r\");\r\n    if( fp==NULL )\r\n    {\r\n        fprintf(stderr,\"Unable to open %s\\n\",filename);\r\n        exit(1);\r\n    }\r\n\r\n    <span class=\"comments\">\/* allocate the buffer *\/<\/span>\r\n    buffer = malloc( sizeof(char) * BUFSIZ );\r\n    if( buffer==NULL )\r\n    {\r\n        fprintf(stderr,\"Unable to allocate memory\\n\");\r\n        exit(1);\r\n    }\r\n\r\n    <span class=\"comments\">\/* read from the file to fill the buffer *\/<\/span>\r\n    offset = 0;\r\n    while( !feof(fp) )\r\n    {\r\n        ch = fgetc(fp);           <span class=\"comments\">\/* read a character *\/<\/span>\r\n        if( ch==EOF )             <span class=\"comments\">\/* bail on EOF *\/<\/span>\r\n            break;\r\n        *(buffer+offset) = ch;    <span class=\"comments\">\/* store character *\/<\/span>\r\n        offset++;\r\n        if( offset%BUFSIZ==0 )    <span class=\"comments\">\/* check for full buffer *\/<\/span>\r\n        {\r\n            <span class=\"comments\">\/* enlarge the buffer by another BUFSIZ bytes *\/<\/span>\r\n            buffer = realloc(buffer,offset+BUFSIZ);\r\n            if( buffer==NULL )\r\n            {\r\n                fprintf(stderr,\"Unable to allocate memory\\n\");\r\n                exit(1);\r\n            }\r\n        }\r\n    }\r\n    *(buffer+offset) = '\\0';      <span class=\"comments\">\/* cap the string *\/<\/span>\r\n\r\n    fclose(fp);\r\n\r\n    printf(\"%s\",buffer);\r\n\r\n    return(0);\r\n}<\/pre>\n<p>Lines 13 through 19 attempt to open the file <code>filename<\/code>, which is set at Line 7 to <code>sonnet18.txt<\/code>.<\/p>\n<p>At Line 22, storage is allocated for <em>char<\/em> pointer <code>buffer<\/code>. The size is set to <a href=\"https:\/\/c-for-dummies.com\/blog\/?p=4711\">the <code>BUFSIZ<\/code> defined constant<\/a>, which may not be enough storage, a problem the code deals with later.<\/p>\n<p>The <em>while<\/em> loop at Line 31 reads characters from the file, storing them at an offset in <code>buffer<\/code>. At Line 38, the value of variable <code>offset<\/code> is compared with multiples of the <code>BUFSIZ<\/code> defined constant:<\/p>\n<p><code>if( offset%BUFSIZ==0 )<\/code><\/p>\n<p>If the condition is true, meaning the buffer size is a multiple of <code>BUFSIZ<\/code>, pointer <code>buffer<\/code> is reallocated at Line 41 to make room for <code>BUFSIZE<\/code> more characters.<\/p>\n<p>Line 49 caps the input buffer with a null character, then the file is closed at Line 51.<\/p>\n<p>All this machination replaces the earlier code example, which just stuffed the text into a <em>char<\/em> array. Still, Line 53 outputs the buffer &#8211; the same output as before, but coming from a file.<\/p>\n<p>An obvious improvement to the code would be to provide the filename as a command line argument. Still, for this series of programs, I&#8217;m working with the one text file for testing consistency.<\/p>\n<p>With the text stored in a buffer, I can now work on the code to parse out the words. <a href=\"https:\/\/c-for-dummies.com\/blog\/?p=5099\">Next week&#8217;s Lesson<\/a> continues modifying the program to meet this next step on the way toward the project&#8217;s ultimate goal.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The process starts with opening a text file and storing in a buffer. Seems simple, but it requires many lines of code. <a href=\"https:\/\/c-for-dummies.com\/blog\/?p=5090\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[],"class_list":["post-5090","post","type-post","status-publish","format-standard","hentry","category-main"],"_links":{"self":[{"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5090","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5090"}],"version-history":[{"count":7,"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5090\/revisions"}],"predecessor-version":[{"id":5111,"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/5090\/revisions\/5111"}],"wp:attachment":[{"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5090"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5090"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/c-for-dummies.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5090"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}