A Tally of Unique Words, Part IV

In our last episode, the unique words code is able to parse and list individual words in the buffer. To find unique and duplicate words, the next step is to sort the list.

The word list is held in a dynamically allocated array of pointers, **list. To sort this list, I use my old pal the qsort() function. The function has this horrible man page format:

void qsort(void *base, size_t nel, size_t width, int (*compar)(const void *, const void *));

The first argument is the base of the list to sort. For my word list, it’s the name of the **list pointer: list. That’s it, no asteriskses.

The second argument is the number of items to sort. In the code, this quantity is stored in the count variable. So far so good.

The third argument is the comparison function, which is pretty much boilerplate except that the items to be sorted are in a double-pointer list. Fret not! I’ve already covered this topic in an older blog post.

To update the code, the compare() function is added before the main() function:

int compare(const void *a, const void *b)
{
    return( strcmp( *(const char **)a, *(const char **)b ));
}

Again, read my previous blog post if you want to assure yourself that I didn’t just randomly type asterisks in that return statement.

The other change to the code is to add the qsort() function between the final while and for loops:

qsort(list,count,sizeof(char *),compare);

That’s it. The list is sorted. Here is some of the program’s output:

  1:And
  2:And
  3:And
  4:But
  5:By
  6:I
...
107:to
108:to
109:to
110:too
111:too
112:untrimmed
113:wand'rest
114:winds

The list is sorted and the duplicates are easy to spy. But one problem is apparent right away if you peruse the entire list: The sorting method doesn’t catch any case differences. The list shows three uppercase words And and two lowercase words and separately.

This fix for this problem is rather sneaky: Replace the strcmp() function in the compare() function’s return statement with strcasecamp(), which compares strings without regards to letter case.

The strcasecmp() function isn’t part of the standard C library. To use it, you must include the strings.h header file. You could use my own version of the function, but it has a flaw: It’s unable to distinguish between matching words of different lengths, such as “to” and “too.” I have an update to my own strcasecmp() function coming in a future Lesson.

Click here to view the all the modifications to the code on GitHub. In this update, the strcmp() is replaced with strcasecmp() in the compare() function. Oh, and the strings.h header file is included.

Here is the updated output:

  1:a
  2:a
  3:all
  4:And
  5:And
  6:and
...
107:to
108:to
109:too
110:too
111:untrimmed
112:wand'rest
113:When
114:winds

You can see the all duplicates right away, regardless of case. The next step is to add code to find the unique words and the duplicates. This update is covered in next week’s Lesson.

Leave a Reply