Wide Characters and Unicode, Part I

At the dawn of the microcomputer era, lowercase text was considered a big deal. That’s because many home computers at the time displayed only uppercase letters. Memory was tight. Full-ASCII character generators added to the system’s cost. Yes, a microcomputer that displayed lowercase text was a big deal.

Eventually microcomputers hosted the entire gamut of ASCII text. Beyond the 32 control codes, ASCII provides symbols for all the keyboard characters, which is what most early computer users wanted — and lowercase, naturally.

A microcomputer byte held 256 values, but ASCII defines only 128 characters. So many microcomputer manufacturers offered extra characters and symbols in the “upper” or “extended” 128 codes. (These codes weren’t defined by the ASCII standard.) IBM offered its Extended ASCII character set, as well as code pages you could swap in or out. These features provided users with foreign characters, line-drawing characters, and other symbols that are today called wingdings.

A C programmer back in the day could use Extended ASCII characters to create interesting output, but the characters were inconsistent across computer platforms. To provide consistency, as well as to create a code system for all the world’s languages and symbols, people wearing white lab coats and safety goggles developed the Unicode standard.

Unlike ASCII, Unicode defines thousands of character codes. The Unicode Character Table web page presents them all in a useful format.

As a programmer, your question might be, “How can I display Unicode text in my C program’s text output?”

A better question might be, “How can I use C’s char output functions with a string of what are obviously int character values?”

The actual question I’ve received over the years is, “How do I output the Yen character?” That’s because one of my early programming books featured a program that output the Yen sign, ¥. That character was available on the IBM “extended ASCII” code page, but that system is no longer used so the character no longer appears as coded.

Today, if you want to print the Yen sign in a terminal window program’s output, you might think to send its Unicode value, 0x00A5, to standard output. What you see displayed, however, is not the Yen sign unless the program is configured to output wide characters.

On the PC, I see the character Ñ displayed for code 0x00A5, which is probably the active code page character. On my Mac terminal, the ? is displayed. On my Ubuntu Linux system, the symbol appears.

A wide character occupies more than a single byte of storage; it’s (probably) an unsigned int, not a char. The standard C library features wide-character functions and has a wide character header file, wchar.h. But using those functions isn’t enough: You must set the proper locale.

Locale settings include region and language specific details for a computer program. Items such as the language, time and date format, and character set are included in the locale details. Unless you set the locale to support wide characters, your C program outputs only standard, boring, ASCII text.

In C, you use the setlocale() function to check or set locale information. The function is defined in the locale.h header file and it requires two arguments: a locale constant and its string value. The function also returns the current local settings.

In the following code, the current locale setting is returned and displayed:

#include <stdio.h>
#include <locale.h>

int main()
{
    char *locale;

    locale = setlocale(LC_ALL,"");
    printf("The current locale is %s\n",locale);

    return(0);
}

The LC_ALL constant checks all locale details. The program’s output might look like this:

The current locale is en_US.UTF-8

You might see something else output. Whatever the case, to output wide characters, you must set a specific environment. I address that topic in next week’s Lesson.

Leave a Reply