Finding Four-Letter Words

Not all the nasty words are four letters long, but a good chunk of them are. If you ran the program from last week’s Lesson, you can quickly check the computer’s dictionary for the words you once couldn’t say on TV, gleefully typing them in and confirming that they exist in the dictionary. But how many four letter words are there?

I was going to make this Lesson’s code a monthly challenge because it took me some time to work out a solution. I assumed it would be easy, but a few things tripped me up.

Rather than use the strlen() function, I decide to check the input buffer, array word[], for a newline at element 4: if( word[4]=='\n' ) Such a test yields a four-letter word: four characters plus the newline.

Upon testing the program, however, I found a bunch of possessives included, such as it's. To remove them, I added another test: if( word[2]=='\'' ) If true, the possessive is skipped over.

But the biggest problem I had was re-using the word[] buffer when scanning the dictionary. This buffer is filled with each word in the dictionary, but it’s not erased or re-initialized. The effect is that characters linger in the buffer, which can lead to false positives.

For example, a four letter word followed by a two-letter word means the newline at offset word[4] still is present. The program would spit out the two-letter word as a match when it isn’t. To remedy this situation, I use the memset() function to clear the word[] buffer for each iteration of the while loop.

Here is the full code:

2023_11_11-Lesson.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* this code assumes the following path is valid */
#define DICTIONARY "/usr/share/dict/words"
#define SIZE 32

int main()
{
    FILE *dict;
    int wc;
    char word[SIZE],*r;

    /* open the dictionary */
    dict = fopen(DICTIONARY,"r");
    if( dict==NULL )
    {
        fprintf(stderr,"Unable to open %s\n",DICTIONARY);
        exit(1);
    }

    /* scan for four-letter words */
    wc = 0;
    while( !feof(dict) )
    {
        memset(word,'\0',SIZE);        /* clear buffer */
        r = fgets(word,SIZE,dict);    /* read a word */
        if( r==NULL )
            break;
        if( word[2]=='\'' )            /* skip possessives */
            continue;
        if( word[4]=='\n' )            /* four-letter word */
        {
            printf("%s",word);
            wc++;
        }
    }
    printf("I found %d four-letter words!\n",wc);

    /* close */
    fclose(dict);

    return(0);
}

The code borrows from examples shown in previous Lessons. The updated while loop includes calling memset() to clear the input buffer, fetching a new word, then culling out four-letter words that lack an apostrophe at the third character.

The output shows all of the four-letter words in the digital dictionary. This list includes abbreviations and plurals — and, yes, dirty words. Here’s a snippet of the output on my computer:

ABCs
ABMs
ACLU
ACTH
...
zone
zoom
zoos
I found 3376 four-letter words!

The memset() function was the key to making this program run properly. I’ve written about this function before, and how it will be deprecated in the C23 update. The problem is optimization, where this function’s behavior can be skipped if the compiler believes its clean-up loop doesn’t do anything. Alternative functions are available, though for this example memset()‘s flaws shouldn’t affect the output.

Upon reflection, I did make one improvement to the code: The word length value can be set as a constant. For exmample:

#define LENGTH 4

The code can be updated to reflect a flexible word size:

if( word[LENGTH-2]=='\'' )

and:

if( word[LENGTH]=='\n' )

These changes come in handy for next week’s Lesson, which continues my exploration of the digital dictionary file.

2 thoughts on “Finding Four-Letter Words

  1. Does the dictionary include words like “I’ve”? If so your code would include them because the apostrophe isn’t at [2]. The problem would be worse for finding longer words. (Does it include fo’c’s’le?)

    I would probably approach this problem by writing a separate function that identified words including non-letter characters, returning a bool, so you could ignore “it’s” for example as not having four actual letters.

    Where’s this going? Hangman? Crossword solver?

  2. I fixed the issue with the next update to the code, pretty close to what you suggest.
    My obsession is with the online game Spelling Bee. It’s mentioned in a future post.

Leave a Reply