A Tally of Unique Words, Part VI

Any mortal programmer would have stopped with last week’s Lesson, where a tally of unique and duplicate words is output. This is the desired result, right? Yes, but it’s an un-orderly list.

To separate the unique and different words, each must be stored in a new list. A new list is something I wanted to avoid early on in the program’s development. Yet, creating a new storage structure is necessary to associates words with their repeat count.

My solution involves creating a structure:

struct u {
char *w;
int d;
} *unique;

The u structure contains two members: w, a pointer to the word inside the **list thing; and d, the duplicate count. The structure is declared as pointer variable unique.

After the word list is sorted, I added the following statements to the code. Storage is allocated for the unique structures, with the number of structures allocated matching the word count as stored in variable count. This value is the maximum size required, assuming that each word is unique:

unique = malloc( sizeof(struct u) * count );
if( unique==NULL )
{
    fprintf(stderr,"Unable to allocate structures\n");
    exit(1);
}

Immediately following this allocation, the for loop that scans the list for duplicates is updated. Two statements are added, shown at the end of the loop block:

dup = 1;
for( x=0,index=0; x<count-1; x+=dup,index++ )
{
    dup = 1;
    while( strcasecmp(*(list+x),*(list+x+dup))==0 )
    {
        dup++;
    }
    (unique+index)->w = *(list+x);
    (unique+index)->d = dup;
}
index--;

Variables index and x are initialized together in the for loop. (See this Lesson for more details on setting multiple expressions in a for loop.)

The index variable is used in the final two statements in the block. It references the offset within the list of structures where the word and its count are stored. Each structure is updated with the word’s address from the **list buffer, and the dup repeat count.

After the for loop builds the unique structure list, variable index is decremented, index--. Its size indicates the number of items — starting with zero — in the unique list. This code is followed by two new loops that output each group, unique words and duplicates:

/* unique words... */
printf("Unique words:\n");
for( x=0; x<index; x++ )
{
    if( (unique+x)->d == 1 )
        printf("%s ",
                (unique+x)->w
              );
}
printf("\n\n");

/* duplicates */
printf("Words appearing more than once:\n");
for( x=0; x<index; x++ )
{
    if( (unique+x)->d > 1 )
        printf("%s (%d) ",
                (unique+x)->w,
                (unique+x)->d
              );
}
printf("\n\n");

Here’s the final program’s output:

Unique words:
all art as brag breathe buds But By chance changing compare complexion course darling date day death declines dimmed do every eye eyes fade from gives gold grow'st hath heaven hot I is lease life lines lives lose lovely May men nature’s not often ow’st possession Rough see shade shake shines short summer temperate that thy Time untrimmed wand'rest

Words appearing more than once:
a (2) And (5) can (2) eternal (2) fair (3) his (2) in (2) long (2) more (2) Nor (2) of (3) or (2) Shall (3) So (2) sometime (2) summer’s (2) the (2) thee (2) this (2) thou (4) to (3) too (2)

Alas, some words are split on the screen in the output. I thought about tracking the horizontal character output to avoid splitting words. But, naaa; I didn’t want to write a Part VII for this series.

You can obtain the full code from my Github page. I ran the program on another, larger text file and it worked. The Macintosh had some issues reading longer text files, but this problem didn’t occur in other platforms.

This code demonstrates that computers don’t mind performing a task that would drive a human nuts. As a programmer, your job is to write the code.

Leave a Reply