A Tally of Unique Words, Part II

Continuing with my Unique Words project from last week’s Lesson: Once the buffer contains text, the next step is to parse the words: to split the long string of text stored in memory into separate word chunks. For this task, I turn to my old pall, the strtok() function.

The process starts with the text file sonnet18.txt opened and stored in a buffer. No matter how many lines of text are in the file, it’s stored as one long string, which is perfect for the strtok() function to slice through. But first some additions must be made to the code.

Three new variables are declared:

const char separators lists those characters used by the strtok() function to parse words from the buffer.

char *word references words found in the buffer, retaining the address/offset.

int count counts the words found.

Here is how the updated variable declaration statements for the main() function appear when added to the source code file presented last week:

const char filename[] = "sonnet18.txt";
char *buffer;
const char separators[] = ",.:;!?\n ";
FILE *fp;
char *word;
int offset,ch,count;

The printf() statement at the end of the main() function is removed. A while loop is added in its place to parse the buffer, count and output the words. These are the new statements added to the code:

count = 0;
word = strtok(buffer,separators);
while( word )
{
    printf("%3d:%s\n",count+1,word);
    word = strtok(NULL,separators);
    count++;
}

The strtok() function must be called twice. The initial call at Line 54 (that’s Line 54 from the full source code file) identifies the buffer and the characters stored in the separators string. The value returned is a pointer to the first word in the string, saved in char pointer word. When the value returned is NULL, strtok() has exhausted the search string.

The while loop spins as long as new words are parsed from the buffer. The strtok() function’s first argument replaced with NULL at Line 58 to keep scanning the same string. The output generated consists of a long list of words in the buffer:

  1:Shall
  2:I
  3:compare
  4:thee
  5:to
  6:a
  7:summer’s
  8:day
  9:Thou
 10:art
...
109:and
110:this
111:gives
112:life
113:to
114:thee

Click here to view the full source code in my GitHub repository.

At this point the memory addresses saved in the word pointer are lost, continuously overwritten. But this problem is okay! The code confirms that words in the buffer can be counted and parsed, which is another step toward finding unique words and those words that repeat.

Oh, and I don’t free pointers buffer or word because all allocated memory is released when the program quits. If it makes you feel good, you can add these statements to the end of the main() function, before the return statement:

free(buffer);
free(word);

Freeing buffers is always necessary when they’re allocated for temporary storage in a function.

In next week’s Lesson, I continue the improvement process by retaining the word pointers allocated.

Leave a Reply