
Any C programmer eager to mess with characters or strings knows about the handy ctype functions. I use this name because these functions, which include a few macros, are defined in the ctype.h header file. Their job is to manipulate and examine characters.
I divide the ctype functions into the “to” and “is” categories.
The “to” functions manipulate characters. Only two of them are available: toupper() and tolower(), which convert a character from lowercase to uppercase and vice-versa, respectively. Both functions start with “to.”
The “is” functions return TRUE or FALSE based on the character’s attributes. For example, isalpha() returns TRUE when the character examined is alphabetic, upper- or lowercase. The function starts with “is,” which is how I define this category. Lotsa “is” ctype functions are available:
isalnum()
isalpha()
isascii()
isblank()
iscntrl()
isdigit()
isgraph()
islower()
isprint()
ispunct()
isspace()
isupper()
isxdigit())
All of these functions are defined in the ctype.h header file. They all have a similar man page format. For example:
int isspace(int c)
The argument c is specified as an integer, though it must have the value of an unsigned char or EOF. (The EOF is why the prototype is an integer, which allows this function to work with standard I/O.)
The return value is non-zero for a TRUE or positive result, zero otherwise. For example, isspace(' ') returns TRUE when character c is a whitespace character.
These functions work reliably on standard ASCII characters. Supposedly, they can also function in other languages when the locale is set, and variations on the functions are available to handle different alphabets. I’ve been unable to verify whether this feature works. So, my exploration of these functions is limited to standard ASCII, the Latin alphabet.
It’s easy to guess what each of the function does based on the name, though some are kinda weird. Here are brief descriptions:
isalnum() |
returns TRUE for letters of the alphabet (both upper- and lowercase) as well as digits 0 through 9. |
isalpha() |
returns TRUE for an alphabetic character, both upper- and lowercase. |
isascii() |
returns TRUE if character c is an ASCII character, codes 0 through 127. |
isblank() |
returns TRUE for a space (' ') or tab ('\t') character. |
iscntrl() |
returns TRUE for a control character, ASCII codes 0 through 31 (0x00 through 0x1F). |
isdigit() |
returns TRUE when the character is a digit, zero through 9. |
isgraph() |
returns TRUE for all printable characters except for a space. |
islower() |
returns TRUE for a lowercase character. |
isprint() |
returns TRUE for all printable characters, including the space. |
ispunct() |
returns TRUE for a character that is not a space or alphanumeric. |
isspace() |
returns TRUE for any whitespace character, including space, tab, form feed, newline, carriage return, and vertical tab. |
isupper() |
returns TRUE for an uppercase letter. |
isxdigit() |
returns TRUE for characters used in hexadecimal values, zero through 9 and A through F both upper- and lowercase. |
Over the next few weeks, I’ll cover these functions and how they work. I’ll also present code that emulates the functions just because it’s a fun thing to do!
These are “ctype” functions.
I will admit to using the above functions myself from time to time. They can be useful, so I donʼt want to judge them too harshly.
Even with setlocale() (and as noted in the man pages) they only work with extended ASCII, however. These functions donʼt offer any support for Uɴɪᴄᴏᴅᴇ… at all.
The simplest solution to obtain Uɴɪᴄᴏᴅᴇ-compatible character classification would be the FSFʼs unistring library:
/* sudo apt install -y libunistring-dev; Linker: -lunistring */
#include <unictype.h>
if (uc_is_alpha (U'ま')) /* Hiragana Letter 'ma' */
fprintf (stdout, "Hiragana 'ma' is classified as a letter.\n");
Regrettably, this library comes with its own weaknesses. For example, it doesnʼt provide any functions to recognize numerical values that arenʼt positional (like the Chinese logographic number system):
if (uc_is_digit (U'五')) /* Chinese Hanzi Numeral '5' */
fprintf (stdout, "Wonʼt recognize 五 as a numeral (with value 5).\n");
In practice, the Unicode Consortiumʼs International Components for Unicode is probably the best option:
/* sudo apt install libicu-dev; pkg-config –cflags –libs icu-uc */
#include <unicode/uchar.h>
double value = u_getNumericValue (U'五');
if (value != U_NO_NUMERIC_VALUE)
fprintf (stdout, "Represents the numeric value %d\n", (int)value);
Hereʼs a list of ICU character classification functions comparable to those in <ctype.h>
ctype.h ICU equivalent
isalpha() u_isalpha() All Unicode letters
islower() u_islower() Unicode lowercase letters
isupper() u_isupper() Unicode uppercase letters
isdigit() u_isdigit() Unicode (positional) digits
isxdigit() u_isxdigit() Hex digits [0-9a-fA-F]
isalnum() u_isalnum() Letters + digits
isspace() u_isspace() All unicode whitespace characters
isblank() u_isblank() Horizontal whitespace only
isgraph() u_isgraph() Visible (non-space) characters
isprint() u_isprint() Printable characters (including space)
ispunct() u_ispunct() Punctuation characters
iscntrl() u_iscntrl() Control characters
To illustrate the differences, I wrote a small example application that checks for all Uɴɪᴄᴏᴅᴇ characters that should be interpreted as whitespace. Here is its output:
Code Name isspace uc_is_space u_isspace
————————————————————
U+0009 HT (TAB) 1 1 1
U+000A LF 1 1 1
U+000B VT 1 1 1
U+000C FF 1 1 1
U+000D CR 1 1 1
U+001C FS 0 0 1
U+001D GS 0 0 1
U+001E RS 0 0 1
U+001F US 0 0 1
U+0020 SPACE 1 1 1
U+0085 NEL 0 0 1
U+00A0 NO-BREAK SPACE 0 0 1
U+1680 OGHAM SPACE MARK 0 1 1
U+2000 EN QUAD 0 1 1
U+2001 EM QUAD 0 1 1
U+2002 EN SPACE 0 1 1
U+2003 EM SPACE 0 1 1
U+2004 THREE-PER-EM SPACE 0 1 1
U+2005 FOUR-PER-EM SPACE 0 1 1
U+2006 SIX-PER-EM SPACE 0 1 1
U+2007 FIGURE SPACE 0 0 1
U+2008 PUNCTUATION SPACE 0 1 1
U+2009 THIN SPACE 0 1 1
U+200A HAIR SPACE 0 1 1
U+2028 LINE SEPARATOR 0 1 1
U+2029 PARAGRAPH SEPARATOR 0 1 1
U+202F NARROW NO-BREAK SPACE 0 0 1
U+205F MEDIUM MATHEMATICAL SPACE 0 1 1
U+3000 IDEOGRAPHIC SPACE 0 1 1
In my view <ctype.h> simply doesnʼt cut it anymore, libunistring / ICU4C for the win!
Outstanding info.
I’m too chicken to reconfigure a computer for another language. I did that once; Korean still shows up on one of my old Windows 10 boxes. But I agree that many of these functions were written in an era when locality was assumed to always be local.
Thank you for your kind words! I was hoping that my comments would be useful ☺
My main motivation for the above is that I believe the community needs to clarify that modern C is—of course—a fully Uɴɪᴄᴏᴅᴇ-capable language, a language that provides suitable functions for working with text in various languages and is therefore suitable for developing international applications.
If we donʼt do this, C will inevitably be seen as inferior to modern languages like Rust or Swift… thus wonʼt be able to survive beyond this half-century outside of legacy applications.
I don’t think C23 improves the standard, though there may be hope for the next standard – or even a subset of C that creates this level of compatibility. It would be nice to see an implementation as an update and not a reintroduction of C as yet another OOP language or some “improved” version of Java.
“It would be nice to see an implementation as an update and not a reintroduction of C as yet another OOP language or some “improved” version of Java.”
My sentiment exactly, I fully agree. Looking through WG14’s document log, the standards committee unfortunately seems to be bent on adding features like defer and Closures in C as well as quite a few other language extensions to bring C closer to C++ once again (also quite a few additions to the preprocessor). In the coming years, C will probably no longer remain the small language with the simple syntax that we have all come to love…
BASIC was extremely popular in the 1970s and early 80s, specifically on microcomputers of the day. Each one came with its own BASIC.
Kemeny and Kurtz, who developed BASIC at Dartmouth, wanted in on the action. So they produced TRUE BASIC, which to me smelled a lot like Pascal and wasn’t very BASIC-y at all. My guess is that the same thing may happen to C – not that it hasn’t already happened a dozen times already.
Before I was introduced to Turbo Pascal in secondary school, I experimented with GW-BASIC under MS-DOS 3.x. Having never heard of True BASIC before I just took a look at it. Youʼre right, the syntax does suspiciously look like Pascal!
Anyway, with the renewed momentum in the development of C itʼs unfortunately all too likely that C2y and its successors will feel markedly different from C89…