The C language woefully lacks a trim() or similar string manipulation function. Rather than let it flail in absent envy, your task for this month’s Exercise is to code such a function. The goal is to remove whitespace characters from either end of a string.
I can think of a number of ways to craft a trim() function. My approach is to pass the string to the function, then return a modified string, leaving the original string untouched. This modified string is allocated in the function, so the new string’s address is returned. Here’s the prototype:
char *trim(const char *s)
This function calls two other functions that do the trimming: rtrim() and ltrim(), which remove whitespace from the right and left sides of a string, respectively. I favor this approach as it not only allows me to piece out the solution, but the two functions are available separately as left-right string trimming functions. Again, such functions are readily found in other programming languages.
Aside from calling rtrim() and ltrim(), the trim() function confirms that the string passed isn’t NULL. It compares the results from rtrim() and ltrim() to ensure that an empty string doesn’t result, which must be specially handled. Otherwise, storage is allocated for the new string and characters copied into it.
I also made a few modifications to the main() function: A test is made for the NULL return from the trim() function. Upon success, the pointer returned is freed. (If not, the allocated memory goes untracked.)
Here is my solution:
2026_04-Exercise.c
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <ctype.h> /* find the first non-space character on the right */ char *rtrim(char *s) { char *right = NULL; right = s+strlen(s)-1; while( isspace(*right) ) { right--; } return(right); } /* find the first non-space character on the left */ char *ltrim(char *s) { char *left = NULL; left = s; while( isspace(*left) ) { if( *left=='\0' ) break; left++; } return(left); } /* trim spaces from the left and right of a string; return the new string */ char *trim(char *s) { char *trimmed,*front,*back; int offset; /* test for NULL pointer */ if( s==NULL ) return(s); /* return NULL pointer */ /* find the ends */ back = rtrim(s); front = ltrim(s); /* allocate storage */ if( back>front ) /* string contains some text */ trimmed = malloc( sizeof(char) * (back-front) + 1 ); else /* string is empty or all spaces */ trimmed = malloc( sizeof(char) * 1 ); /* create the trimmed string */ if( trimmed!=NULL ) { /* empty string test */ if( sizeof(trimmed)==1 ) { *trimmed = '\0'; /* empty string */ } else { /* copy the center */ offset = 0; while( front<=back ) { *(trimmed+offset) = *front; front++; offset++; } *(trimmed+offset)='\0'; } } return(trimmed); } int main() { char *sample[] = { " one ", "\ttwo\n", "", " ", "a", " x ", NULL, " \t three \n", "four", " five", "six ", " seven eight " }; int size,x; char *t; /* obtain array size */ size = sizeof(sample)/sizeof(sample[0]); /* output trimmed strings */ for( x=0; x<size; x++ ) { printf("'%s' => ",sample[x]); t = trim(sample[x]); if( t==NULL ) { printf("Bad string\n"); } else { printf("'%s'\n",t); free(t); } } return 0; }
Both the rtrim() and ltrim() functions use the isspace() function to check for spaces, prototyped in the chtype.h header file. This function is used in a while loop to locate spaces at the start and end of the string. Both functions return a pointer to the first non-space character, right and left, in the string.
In the trim() function, an initial test is made to determine whether a NULL pointer was passed. If so, the same pointer is returned. For all other strings, the rtrim() and ltrim() functions are called. Each function returns a pointer holding the address of the first non-space characters in the string, back and front.
A test is made to check whether the back pointer is greater than the front pointer. If so, the string contains some text. Storage is allocated for the text, plus one for the null character.
When the back pointer is less than or equal to the front pointer, the string is either empty or contains all whitespace characters. If so, storage is allocated for only a single character, which creates an empty string.
The next part of the function fills the trimmed string with characters. If the size of trimmed is one byte, the null character is set. Otherwise, characters are copied from the passed string (s) to the allocated string (trimmed). The string is capped with a null character and returned.
Here’s a sample run of my solution:
' one ' => 'one' ' two ' => 'two' '' => '' ' ' => '' 'a' => 'a' ' x ' => 'x' '(null)' => Bad string ' three ' => 'three' 'four' => 'four' ' five' => 'five' 'six ' => 'six'
This exercise proved more difficult than I originally thought. In fact, my first solution didn’t consider the final sample string where spaces are found in the middle. My second solution didn’t account for empty strings or strings composed entirely of spaces. The key for me was first coding the rtrim() and ltrim() functions, which made the rest of the process easier.
I hope that your solution met with success.
Very nice exercise, lots of things to think about in this one! I also had separate functions for trimming left and right, useful in their own right, but failed miserably because I didn’t check for null pointers and empty strings.
That happened to me – and it was frustrating to figure out how to detect for the null pointers and empty strings. I did a lot of debugging. 🙂
Hope I donʼt sound too negative saying this, but in the spirit of Edgar Dijkstraʼs “Go To Statement Considered Harmful” I would argue that using
isspace()is harmuful in modern code. Thatʼs because it usually classifies only about six ASCII characters – ‘ ‘, ‘\t’, ‘\n’, ‘\r’, ‘\v’, ‘\f’ – as whitespace, without taking into account all those defined by the Uɴɪᴄᴏᴅᴇ standard.With that, the first question that arises is which characters the Uɴɪᴄᴏᴅᴇ database actually does classify as whitespace. This StackOverflow posting provides an interesting idea: during the compilation of the Python script interpreter, a script called “makeunicodedata.py” is executed that creates a function
_PyUnicode_IsWhitespace()in a file named “unicodetype_db.h”. As it turns out, Python 3.15a8 classifies 29 of the 159,801 characters of the Uɴɪᴄᴏᴅᴇ 17.0 standard as whitespace:extern bool utf32_isspace (char32_t const c)
{
return ((c >= 0x09 && c <= 0x0D) || (c >= 0x1C && c <= 0x20))
|| c == 0x0085 || c == 0x00A0 || c == 0x1680
|| (c >= 0x2000 && c <= 0x200A)
|| c == 0x2028 || c == 0x2029 || c == 0x202F
|| c == 0x205F || c == 0x3000;
}
This leaves one problem: in order to support Uɴɪᴄᴏᴅᴇ at all, the program would need to be able to work with UTF-8 encoded text. To support this, the attached solution includes two functions (in a file named “UTF.c”) that allow one to iterate over UTF-8 encoded text:
•
utf8_unchecked_codepoint(), to read the next UTF-32 codepoint without error handling, and•
utf8_next_codepoint(), defined in “utf8_next_codepoint_bytewise.c”, which includes error handling.With that, a fully Uɴɪᴄᴏᴅᴇ aware
utf8_trim()function can finally be coded up (its implementation can be found in a file named “utf8_trim.c”).To check its correctness, I added all of the aforementioned Uɴɪᴄᴏᴅᴇ characters to the test_data array:
char8_t const *test_data [13] = {
u8″\xE2\x80\x83 one \xE2\x80\x82″,
u8″\t\xE2\x80\x89two\n\xE2\x81\x9F”,
u8″”,
u8″\xC2\xA0\xE2\x80\x87\xE2\x80\xAF”,
u8″a”,
u8″\xE2\x80\x8A x \xE2\x80\x84″,
NULL,
u8″ \t three \n\xE3\x80\x80″,
u8″\v\f\r\xC2\x85″,
u8″\xE1\x9A\x80 four \xE1\x9A\x80″,
u8″\xE2\x80\x80″ “five” “\xE2\x80\x81”,
u8″six\xE2\x80\x85\xE2\x80\x86 “,
u8″\xE2\x80\xA8seven eight\xE2\x80\xA9”
};
Alternatively, if a (UTF-8) text file is passed as the first argument, the program loads it into memory perform a benchmark… showing that
utf8_next_codepoint()can decode approximately 290 million UTF-8 sequences per second on my Core i7-11850H @ 2.50GHz. (All the gory details can be found in a file named “random_utf8_text.log”.)The program works on Windows (Visual Studio) as well as Linux (Codelite/GCC) and can be downloaded from here.
Bravo.
I noticed that the chtype functions have (supposedly) the capability to detect foreign language characters as well. For example, toupper() should translate ç to Ç. I’m unsure how to get this to work, however, without writing a version of the function on my own similar to what you’ve done above.
Regarding your code. Wow, as usual. 🙂
Thank you! To be honest, I probably should have just used the utf8proc library for all UTF-8 processing. In this case I didnʼt because I actually wanted to implement it myself.
Now, Iʼm not as familiar with ncurses as you are. That said, I am not convinced that ncurses is worth the effort in such cases. For example, when working with complex Uɴɪᴄᴏᴅᴇ text—letʼs take the string "Thế", for example—it is necessary to
(1) use utf8proc_iterate() and utf8proc_grapheme_break_stateful() to break a given string up into grapheme clusters… for 'T' and 'h' this is easy, but the third glyph is actually the following UTF-8 character sequence: uint8_t e_circumflex_and_acute[] = "e\xcc\x82\xcc\x81";
(2) convert such an uint8_t[] array back to a wide character wchar_t[] array:
wchar_t e_circumflex_and_acute_wide[] = L"The\u0302\u0301";
(3) pass this wide character array to setcchar() to get a single cchar_t which can then be rendered onto the terminal by calling add_wch()
If itʼs a TUI application and all of this needs to work, this level of complexity is probably unavoidable… but for pure console applications it is, I think, better to just stay with UTF-8 text (operated on by utf8proc) and to simply output resulting text using printf(), i.e. let the terminal worry about combining characters and all that nonsense.
In any case, sorry to say so, even a locale-aware
toupper()is sometimes not enough. The problem again being, that this function only works with single codepoints. For example, if the German locale is active, toupper('ß') should normaly result in "SS". (Admittedly, only the Unicode Consortiumʼs ICU library/ICU4C (“ICU for C”) does this correctly, utf8procʼs utf8proc_toupper() function has the same problem. At least as long as utf8proc_map() is not used, which allows for proper case-folding, as defined in Unicode Standard Annex #21.)