A Tally of Unique Words, Part I

It’s easy for a good C programmer to code a program to tally the number of unique words in a chunk of text. Further, the computer could track repeating words. This task would drive a human nuts, but a computer? No problem.

Of course, anything that’s easy for a computer to do isn’t always easy for a programmer to code. In this instance, what I desired was to find the number of unique words in string. This task branched off into an obvious secondary task of finding those words that repeat. Regardless, the job starts with processing a chunk of text.

For a target string, I’m using Shakespeare’s 18th Sonnet. The first draft of my code merely contained and output the sonnet stored in a buffer. Click here to view the code on Github.

My first draft (link above) is limited in flexibility as the string is kept in an array local to the function. So for my first update to the code, I created a separate text file, sonnet18.txt (link above), opened the file, stored it in a dynamic buffer, then output the results. This approach keeps the code flexible, able to open any size text file to process, parse, and eventually sort out the unique and duplicate words:

2021_12_11-Lesson-b.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main()
{
    const char filename[] = "sonnet18.txt";
    char *buffer;
    FILE *fp;
    int offset,ch;


    /* open the file */
    fp = fopen(filename,"r");
    if( fp==NULL )
    {
        fprintf(stderr,"Unable to open %s\n",filename);
        exit(1);
    }

    /* allocate the buffer */
    buffer = malloc( sizeof(char) * BUFSIZ );
    if( buffer==NULL )
    {
        fprintf(stderr,"Unable to allocate memory\n");
        exit(1);
    }

    /* read from the file to fill the buffer */
    offset = 0;
    while( !feof(fp) )
    {
        ch = fgetc(fp);           /* read a character */
        if( ch==EOF )             /* bail on EOF */
            break;
        *(buffer+offset) = ch;    /* store character */
        offset++;
        if( offset%BUFSIZ==0 )    /* check for full buffer */
        {
            /* enlarge the buffer by another BUFSIZ bytes */
            buffer = realloc(buffer,offset+BUFSIZ);
            if( buffer==NULL )
            {
                fprintf(stderr,"Unable to allocate memory\n");
                exit(1);
            }
        }
    }
    *(buffer+offset) = '\0';      /* cap the string */

    fclose(fp);

    printf("%s",buffer);

    return(0);
}

Lines 13 through 19 attempt to open the file filename, which is set at Line 7 to sonnet18.txt.

At Line 22, storage is allocated for char pointer buffer. The size is set to the BUFSIZ defined constant, which may not be enough storage, a problem the code deals with later.

The while loop at Line 31 reads characters from the file, storing them at an offset in buffer. At Line 38, the value of variable offset is compared with multiples of the BUFSIZ defined constant:

if( offset%BUFSIZ==0 )

If the condition is true, meaning the buffer size is a multiple of BUFSIZ, pointer buffer is reallocated at Line 41 to make room for BUFSIZE more characters.

Line 49 caps the input buffer with a null character, then the file is closed at Line 51.

All this machination replaces the earlier code example, which just stuffed the text into a char array. Still, Line 53 outputs the buffer – the same output as before, but coming from a file.

An obvious improvement to the code would be to provide the filename as a command line argument. Still, for this series of programs, I’m working with the one text file for testing consistency.

With the text stored in a buffer, I can now work on the code to parse out the words. Next week’s Lesson continues modifying the program to meet this next step on the way toward the project’s ultimate goal.

Leave a Reply