Finding the Long Words

Beyond knowing how many words are in the computer’s dictionary, another good measure to know is how many characters are in the longest word. Together, these two values give you a profile for the complete word matrix.

Continuing the exploration of the Linux/Unix/macOS dictionary from last week’s Lesson, the task at hand is to find the longest word stored in the dictionary file, /usr/share/dictionary/words. In last week’s code, defined constant SIZE is set to 32 and used in the fgets() function to scoop out lines of text from the file and output the words.

The value 32 is a guess. When a word is longer, the fgets() function truncates it, which is okay, but the total word count would be off. That’s because the file position indicator remains on the same line as the truncated word, which alters the total word count output. I know this situation didn’t occur because the second program (from last week’s Lesson) just counted newlines and the results are same — at least for the dictionary I installed.

Knowing the maximum size of a word in the dictionary is important if you want to manipulate the stored words. The value 32 is just a guess. For this week’s dictionary-reading program, I set a larger SIZE value to read all the words in the dictionary file. The word count is monitored and successive larger words are output. The result shows the maximum word size.

2023_10_21-Lesson.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* this code assumes the following path is valid */
#define DICTIONARY "/usr/share/dict/words"
#define SIZE 1024

int main()
{
    FILE *dict;
    int maxlen;
    char word[SIZE],*r;

    /* open the dictionary */
    dict = fopen(DICTIONARY,"r");
    if( dict==NULL )
    {
        fprintf(stderr,"Unable to open %s\n",DICTIONARY);
        exit(1);
    }

    /* find the longest word */
    maxlen = 0;
    while( !feof(dict) )
    {
        r = fgets(word,SIZE,dict);    /* read a word */
        if( r==NULL )
            break;
        if( strlen(word) > maxlen )
        {
            printf("%s",word);
            maxlen = strlen(word);
        }
    }

    /* results */
    printf("The longest word is %d characters long\n",maxlen);

    /* close */
    fclose(dict);

    return(0);
}

Defined constant SIZE is set to 1024 (1K), which should be adequate for any word in my mother tongue.

After the dictionary file is opened, variable maxlen is initialized to zero: maxlen = 0;

A while loop scans the dictionary file reading words just like last week’s Lesson The strlen() function returns the word’s length and compares it with the value stored in maxlen: if( strlen(word) > maxlen ) When the value is greater, the word is output and a new value for maxlen is set.

The program ends with a printf() statement that outputs the character count for the longest word.

Here’s a sample run:

A
AA
AAA
AA's
ABC's
ACLU's
ANZUS's
Aachen's
Aaliyah's
Aberdeen's
Abernathy's
Abyssinian's
Adirondacks's
Afrocentrism's
Americanization
Americanization's
Andrianampoinimerina
Andrianampoinimerina's
electroencephalograph's
The longest word is 24 characters long

Pretty!

My initial guess value from last week’s Lesson was close, seeing how the longest word in my system’s dictionary is 24 characters and the original SIZE value was set to 32. Different dictionaries yield different results, with the obnoxiously huge dictionary files containing scientific and technical words that may greatly exceed 24 characters.

The point of this exercise isn’t just to know the type of matrix in which the words are stored (total word count and word size), but to avoid potential overflow. It’s too easy to guess at a shorter buffer, which not only can crop the output but can lead to misreading the words and messing up the results.

I have more fun with the dictionary in next week’s Lesson.

Leave a Reply