Counting the Non-ASCII Bytes – Solution

Several items are noteworthy for this month’s Exercise , with the biggie being how to detect a non-ASCII character. The secret involves a wee bit of type conversion.

First, you must know how to detect whether a command line option is present.

Second, you must open the argument as a file and properly report when this operation fails.

After the file is open, the third step is to read each character in the file, from first to EOF. Along the way, these characters are examined to determine which are outside of the ASCII range (from zero through 127) and tally that count.

Here is my solution:

2023_04-Exercise.c

#include <stdio.h>
#include <stdlib.h>

int main(int argc,char *argv[])
{
    FILE *fp;
    int count,c;
    char *file;

    /* check for filename argument */
    if( argc<2 )
    {
        fprintf(stderr,"Specify a filename\n");
        exit(1);
    }
    /* open the file */
    file = argv[1];        /* shortcut */
    fp = fopen(file,"r");
    if( fp==NULL )
    {
        fprintf(stderr,"Can't open '%s'\n",file);
        exit(1);
    }
    printf("Examining '%s'...\n",file);

    /* scan for and count non-ASCII characters */
    count = 0;
    while(!feof(fp))
    {
        c = fgetc(fp);
        if( c==EOF )
            break;
        if((unsigned char)c > 127 )
            count++;
    }

    /* close file and output results */
    fclose(fp);
    printf("%d non-ASCII characters found\n",
            count,
          );

    return(0);
}

If present, a filename argument brings the argc value to two. This condition is tested for at Line 11.

Next, the argument stored in argv[1] is copied to char pointer file for convenience. The file is opened. If this operation fails, an error message is output. Otherwise text is output to alert the user that the file is being examined.

The while loop reads characters from the file. Is uses the feof() function to halt once the end-of-file is encountered for FILE pointer fp. Within the loop, the fgets() function reads characters into int variable c. A test is made immediately for the EOF, which breaks the loop once encountered.

The magic happens at Line 33, where the value of character c is tested. First, the integer value is typecast to an unsigned char data type. It must be unsigned or any values above 127 are interpreted as negative and the test never passes. The value must also be a char as c is declared as an int. If the character value is greater than 127, variable count is incremented.

After the loop exits, meaning every byte of the file is read, the file is closed and a printf() statement outputs the results.

The tricky part for this solution is the unsigned char typecast, which allows the ASCII test to pass or fail. I hope your solution met with success.

2 thoughts on “Counting the Non-ASCII Bytes – Solution

  1. Somehow I missed this exercise – too bad! Thinking about the above, another challenge comes to mind however:

    Instead of »non-ASCII bytes« try to count how many »UTF-8 encoded Uɴɪᴄᴏᴅᴇ codepoints« a given text file contains.

    Just a small suggestion, if you ever wanted to go for a follow-up to the above exercise 🙂

Leave a Reply