Counting the Non-ASCII Bytes

Difficulty: ★ ★ ☆ ☆

The file utility in Linux quickly identifies known file types. It identifies text files, graphics images, or other common file types by using some sort of magic beyond checking the filename extension (which isn’t a guarantee). You can employ similar sorcery by writing your own Is It ASCII program.

I originally coded this utility about 15 years ago. My comments in the code read:

/* hasc - Dan Gookin, January 3, 2008
   I just needed this today, but will probably
   [mess] with it some more to add some options
Purpose: To check for and remove non-ascii characters from a text file
*/

I used the hasc program to spy for wide characters in plain text files. Specifically, curly quotes that didn’t translate well. The program worked to locate these characters, but I never wrote the second half to remove or replace these characters. Anyway, my hasc program exists and it’s the basis your challenge for this month’s C programming exercise.

Right away, know that the ASCII character codes range from zero through 127. You can read more about ASCII here. Though codes zero through 31 don’t display as characters, they’re still considered ASCII text. It’s only byte values above code 127 that aren’t ASCII. These codes are sometimes referred to as “Extended ASCII,” though that’s just an IBM term.

For this challenge, write code that accepts a filename as an argument. Open the file and scan its character values. Tally each character code larger in value than 127. Output the filename and the number of non-ASCII character codes encountered.

Here’s a sample run of my solution, which scans a plain text file:

Examining 'sonnet18.txt'...
0 non-ASCII characters found in 'sonnet18.txt'

And here is a sample run of my solution scanning a wide text file:

Examining 'widetext.txt'...
8 non-ASCII characters found in 'widetext.txt'

Remember that wide text uses multiple bytes to represent a single character. In the above example, two emojis appear in the file widetext.txt. These account for the eight bytes detected by the solution. Pure binary files show even higher numbers.

Try coding your solution to the problem. You can view my solution here.

Leave a Reply