Dumping the Screen in Color

The hexdump utility is a marvelous tool for grabbing a sneak peek at a file’s innards, especially when debugging code that performs file access. As a text mode tool, however, it could stand to use some colorful character improvement.

The hexdump utility is a filter. Run at the command prompt, it dumps a file by name or by using redirected input. Here’s a quickie test run:

~$ hexdump
Hello!
0000000 6548 6c6c 216f 000a
0000007
~$

My favorite view is “canonical” mode, activated by using the -C switch:

~$ hexdump -C
Hello!
00000000  48 65 6c 6c 6f 21 0a                              |Hello!.|
00000007
~$

Seeing both the hex dump and ASCII equivalents can truly help disclose a file’s contents. In fact, I used hextdump a few days ago to dump some word processing files saved years ago by antique software. This tool helped me view the text portion of correspondence, which saves time tussling with Microsoft Word to extract text.

The Exercise for March 2020 was to code a hexdump-like utility. My solution involved reading a file for input, though it can easily be converted into a filter for more flexible input. I’ll use this program as a base for my updated hexdump/color utility, though it requires lotsa modification.

My first goal is to decide how to use colors to help present more than just a period in the ASCII column for non-printable characters. For example, control codes can be output color-coded red. Each code has a corresponding character: ^@ for the null character (\0), ^A for control code 1, ^B for 2, on up to ^_ for code 31. No standard equivalent exists for code 127, the “delete” character: Unicode U-2421 (␡) is often used on the web as well as U-247F ( ⑿ for some reason).

The following code churns through ASCII values zero through 127. For the control codes (values zero through 31), the corresponding character is output in red:

2025_12_06-Lesson.c

#include <stdio.h>

#define RED "\e[31m"
#define NORMAL "\e[m"

int main()
{
    unsigned char ch;

    for( ch=0; ch<=127; ch++ )
    {
        printf("%02X %03d ",ch,ch);
        if( ch<32 )
            printf("%s%c%s",RED,ch+'@',NORMAL);
        else
            putchar(ch);
        putchar('\n');
    }
    return 0;
}

Variable ch is declared as an unsigned char, which prevents the for loop from repeating endlessly. (A signed value resets ch negative after 127 and the loop continues endlessly.)

Within the for loop, the hexadecimal and decimal values of variable ch are output. When ch is less than 32, its ASCII equivalent character is output color-coded red. Otherwise, the character is output directly. Refer to this blog post for information on ANSI color text output.

Here’s a truncated sample run:

00 000 @
01 001 A
02 002 B
03 003 C
04 004 D
05 005 E
06 006 F
07 007 G
...
78 120 x
79 121 y
7A 122 z
7B 123 {
7C 124 |
7D 125 }
7E 126 ~
7F 127

For value 127 (del), I’d like to output the Unicode character ␡ as the equivalent. Additionally, for character code values 128 through 255, special characters can also be generated, perhaps in color. In fact, equivalents for the control code characters exist as well: code 0, ^@ is ␀. This update to the code requires that I retrofit it with wide character output. I begin this task in next week’s Lesson.

8 thoughts on “Dumping the Screen in Color

  1. I did something similar once and used an array indexed with ASCII codes and containing the printable characters and text descriptions of non-printable ones. This is the code to create the array. The argument is an empty char pointer array.

    void populate_mappings(char** mappings)
    {
        // initialize to default values
        for(int i = 0; i <= 127; i++)
        {
            mappings[i] = malloc(2);

            sprintf(mappings[i], “%c”, i);
        }

        // replace non-printable characters with descriptions
        set_value(mappings, 0, “[null]”);
        set_value(mappings, 1, “[start of heading]”);
        set_value(mappings, 2, “[start of text]”);
        set_value(mappings, 3, “[end of text]”);
        set_value(mappings, 4, “[end of transmission]”);
        set_value(mappings, 5, “[enquiry]”);
        set_value(mappings, 6, “[acknowledge]”);
        set_value(mappings, 7, “[bell]”);
        set_value(mappings, 8, “[backspace]”);
        set_value(mappings, 9, “[tab]”);
        set_value(mappings, 10, “[line feed]”);
        set_value(mappings, 11, “[vertical tab]”);
        set_value(mappings, 12, “[form feed]”);
        set_value(mappings, 13, “[carriage return]”);
        set_value(mappings, 14, “[shift out]”);
        set_value(mappings, 15, “[shift in]”);
        set_value(mappings, 16, “[data link escape]”);
        set_value(mappings, 17, “[device control 1]”);
        set_value(mappings, 18, “[device control 2]”);
        set_value(mappings, 19, “[device control 3]”);
        set_value(mappings, 20, “[device control 4]”);
        set_value(mappings, 21, “[negative acknowledge]”);
        set_value(mappings, 22, “[synchronous idle]”);
        set_value(mappings, 23, “[end of trans. block]”);
        set_value(mappings, 24, “[cancel]”);
        set_value(mappings, 25, “[end of medium]”);
        set_value(mappings, 26, “[substitute]”);
        set_value(mappings, 27, “[escape]”);
        set_value(mappings, 28, “[file separator]”);
        set_value(mappings, 29, “[group separator]”);
        set_value(mappings, 30, “[record separator]”);
        set_value(mappings, 31, “[unit separator]”);
        set_value(mappings, 32, “[space]”);
        set_value(mappings, 127, “[delete]”);
    }

    Extended ASCII is standard with the proviso of the old joke that the great thing about standards is there are so many to choose from.

    I’ve just read Code by Charles Petzold which includes an interesting chapter on the teleprinter origins of ASCII as well as a good description of the inner workings of Unicode.

  2. I missed a bit, the set_value function.

    bool set_value(char** array, int index, char* value)
    {
        array[index] = realloc(array[index], strlen(value) + 1);
        strcpy(array[index], value);

  3. “In fact, equivalents for the control code characters exist as well: code 0, ^@ is ␀. This update to the code requires that I retrofit it with wide character output.”

    If youʼre thinking about using wchar_t as well as wprintf (or printf with %ls or %S), then I would advise against it. I think nowadays itʼs generally recognized that the idea of “wide characters” (UTF-16 or UTF-32) was a bad one.

    The reason being that many languages come with a variety of diacritical marks that make it necessary to combine more than one wchar_t—i.e. a “base character” as well as one (or even several) “combining characters” to get what most people think of as a (single) character.

    The name of the computer scientist responsible for pdf(La)ΤεΧ, Hàn Thế Thành, is a good example for this—to get the ‘ế’ of his middle name, a total of 3 ‘characters’ are required in NFD form:
    e (base character), U+0302 (combining character), U+0301 (combining character)

    The only sane way to go about this, it seems, is to treat all Uɴɪᴄᴏᴅᴇ characters as (UTF-8) string arrays:

    char e_circumflex_acute [] = "e\xCC\x82\xCC\x81";  // → ế

    … or, better yet, since C23 comes with a char8_t as well as a u8 string literal prefix:

    char8_t e_circumflex_acute [] = u8"e\xCC\x82\xCC\x81";  // → ế

    Using this char8_t data type, Iʼve prepared some sample code that illustrates how ASCII control characters can be displayed to the user using characters from the Uɴɪᴄᴏᴅᴇ Control Pictures block (see "unicode.c" for the char8_t Unicode_ControlPictures [33][4]; array.)

    The program works on Windows (Visual Studio) as well as Linux (Codelite/GCC), and its output can be seen in the following file: utf8-hexdump_output.txt

  4. As usual, thank you for the thorough explanation and examination.

    Your sample code is great, but requires c23. It makes me continue to yearn for compatibility that is slow in coming.

    I’m aware of the “stacking” (my term) issue for overlays in Unicode. I use them frequently in Word, which handles them well . . . depending on the typeface.

  5. Thank you for looking at my code!

    While it is true that char8_t has officially been added with C23, it’s possible to provide a compatibility typedef (as is done in anyway):

        /* Define the 8-bit character type. */
        typedef unsigned char char8_t;

    (Or just use char I guess.)

    Just my personal opinion, but as sizeof(wchar_t) is not consistent over platforms (UCS-4 on some platforms, UCS-2 on others), and even Windows nowadays comes with support for UTF-8 (since Windows 10, 1803), I think itʼs best to regard wchar_t as a mistake of the past.

  6. “many languages come with a variety of diacritical marks that make it necessary to combine more than one” Maybe that could be considered a shortcoming of Unicode rather than the encoding methods? With over 4 billion values and something over 100,000 currently used there’s plenty of room for Hàn Thế Thành.

    If you enter “pdflatex –v” you get:


    .
    .
    .
    Copyright 2015 Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX).
    .
    .
    .
    Primary author of pdfTeX: Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX).
    .
    .
    .

    Note the depressing lack of diacritics. I feel cheated!

  7. Just tried it under Debian 13: seems to have been a conscious choice; even under Linux, TeXliveʼs pdflatex only prints “Han The Thanh”. Maybe still for the best… as Dan pointed out, all this magic only works if the typeface in question supports it.

    Anyway, I thought about this idea—why not just give each character in every human language its own unique Uɴɪᴄᴏᴅᴇ codepoint—quite a bit myself over the years, but the truth likely is that this “base + combining characters” idea probably is the only sane way to go about it.

    Even OpenType fonts—as you’re surely aware—still exhibit a limit of 64k glyphs per font (a far cry from the 2³² ones for UCS-32). Also, who would design such a font? (Even if characters with diacritics were generated automatically at first, with a human just overviewing the process.)

    As it is, itʼs the rendering engines job to get everything drawn nicely onto the screen… and if some glyphs are not displayed correctly, all thatʼs needed is to fix the rendering engine.

    In short, I think NFD is fine; rather, from my point of view, the mistake was to also allow for NFC for compatibility reasons… so most of the time there are actually two different encodings that will result in the same character.

Leave a Reply