C Programming with Unicode

I'm writing and maintaining a program that needs to make use of Unicode characters. Here are some findings on developing such a program effectively under GNU/Linux. Some extra notes (typeset in green color) are also given for supporting CJK characters (especially the Chinese characters) properly.

1. Programming Environment

Current Programming Environment:

  • Ubuntu. The concepts explained here should nonetheless be applicable to other GNU/Linux distributions.
  • Emacs 21.4.x.

Best Practices:

  • Use Unicode. It's the trend. Sticking to other codings such as Big5, GB, CNS, or Shift-JIS doesn't seem practical as it suffers from a lot of compatibility problems with existing applications.
  • Use UTF-8 encoding for single-byte character set (char). Many current GNU/Linux applications (including the GNU C library since 2.2) have excellent supports for UTF-8 due to its compatibility with ASCII.
  • Use UTF-16 or UTF-32 encoding for wide character set (wchar_t). Obvious choices for representing unicode when more bits are available.

If you are not familiar about Unicode, please be familiar with it first before continuing. Some useful references are listed at the bottom. In particular, it is often essential to understand the UTF-8 encoding scheme and its properties to ensure correct programming.

2. How To Do It

2.1. Text Editing

2.1.1. Emacs

For proper text editing with Emacs in Ubuntu or some Debian-based distributions, you need the followings:

  • The emacs package.
  • The mule-ucs package. This package greatly improves the UTF-8 support of Emacs 21.4.x. Remember to enable the Unicode support (which is disabled by default) by including the line (require 'un-define) in ~/.emacs. Refer to /usr/share/doc/mule-ucs/README.debian for more details.
  • Extra X fonts. Besides the fonts installed on a typical GNU/Linux system, you may need extra fonts for displaying characters of other languages. To use these extra fonts, install the related xfonts-intl-* packages. You may also install the emacs-intl-fonts package to allow multi-lingual postscript printing from Emacs.
(Note: Without mule-ucs, Emacs 21.4.x will not display some UTF-8 characters, and will not support most input methods as they use non-Unicode encodings. (For example, none of the Chinese input methods packaged with emacs support Unicode.) The mule-ucs solves many of such problems transparently. Note that the upcoming emacs 22.x should support UTF-8 better, probably without relying on an external package.)

After preparing the items above,

  • Emacs assumes that a normal text file is coded in UTF-8 by default.
  • More Unicode characters are displayed properly (using the X11 fonts you have installed).
  • Many input methods that use non-Unicode encodings are supported. Translation from the supported codings to UTF-8 is handled internally and automatically.

Popular Related Emacs Commands:

  • Select a language environment: M-x set-language-environment, or C-x <RET> l. Use English or UTF-8.
  • Select an input method: M-x set-input-method, or C-x <RET> C-\.
  • Toggle input method: M-x toggle-input-method, or C-\.
  • Set the preferred coding system: M-x prefered-coding-system coding <RET>.
  • Show the coding systems currently in use: M-x describe-coding-system <RET>, or C-h C <RET>.
  • Execute the next I/O command to execute with a specified coding system: M-x universal-coding-system-argument, or C-x <RET> c. For example, to force emacs to open and recognize a file with UTF-8 coding, do C-x <RET> c utf-8 <RET> C-x C-f filename.
  • Display the information about the character at the current cursor position: M-x what-cursor-position, or C-x =. To show more details, use M-x describe-char-after.
2.1.2. Vim

Vim supports UTF-8 well since version 6.0. Vim will start in UTF-8 mode automatically if your current locale is in an UTF-8 encoding. If you are using another locale, you can force Vim to use UTF-8 by executing ":set encoding=utf-8".

To input a Unicode character U+wxyz in the insert mode, type Ctrl-v u wxyz. For more information, type ":help i_CTRL-V-digit" in Vim. Some characters can be entered by using digraphs. Type ":digraphs" for the mapping.

2.2. Programming Issues

Support of wide characters in C language:

  • Header Files
    • wchar.h, for functions manipulating wide streams and several kinds of strings (includes some wide-character versions of those in stdio.h, string.h, stdlib.h, and time.h.).
    • wctype.h, for functions classifying and mapping wide characters (includes the wide-character versions of those in ctype.h).
    • stdlib.h, for some conversion functions that involve wide characters.
  • Types
    • wchar_t, able to store distinct wide-character codes.
    • wint_t, able to store any valid value of wchar_t, or WEOF.
  • Constants
    • WEOF, wide character equivalent of EOF.
    • WCHAR_MIN, WCHAR_MAX. The minimum and maximum values representable by wchar_t.
    • L'x', a wide character constant.
    • L"...", a wide-character string constant.
  • Library Functions
    • mbtowc(), wctomb(). Convert a multibyte sequence to a wide character, and vice versa.
    • mbstowcs(), wcstombs(). Convert a multibyte string to a wide character string, and vice versa.
    • Many others. Defined in wchar.h, stdlib.h.

Generally, we can manipulate unicode characters with char (using UTF-8) or wchar_t (using UTF-16/UTF-32). But which should we stick to? The decision can vary for different parts of the source code, and depends on many factors. Some of them are listed below.

  • The interface of external libraries. Obviously, if a library routine expects UTF-8 strings, you need to pass UTF-8 strings to it. Many existing GUI libraries (such as GTK, FLTK-UTF8) and X terminal emulators (such as xfterm4, gnome-terminal, konsole) expect and handle UTF-8 strings correctly.
  • The complexity of string manipulation. If your program needs to perform a lot of complex string manipulations that can't be done easily with UTF-8 strings, then it's better to use wide strings at least at that portion of the code. Many common string operations (such as string comparison, string concatenation, character counting) can be performed easily with UTF-8 strings. It's easier to use wide strings for some operations such as character substitution, counting of display width.

You can use multibyte UTF-8 characters safely in the following places in the source code (tested with GCC):

  • Character strings surrounded by a pair of double quotes. ("...")
  • Comments.

Stick to only ASCII characters (Unicode range U+0000 to U+007F) elsewhere.

2.3. Program Compilation

No special extra option is required for GCC as it normally assumes UTF-8 for single-byte character set for compilation and execution. (The behavior is controlled by -finput-charset=charset, -fexec-charset=charset, and -fwide-exec-charset=charset.) However, make sure that you compile with the locale (controlled by the environment variables LANG, LC_CTYPE, LC_MESSAGES, and LC_ALL) set to use a character set encoded in UTF-8. For example, LANG with a value of en_US.UTF-8 for American English encoded in UTF-8.

2.4. Program Execution

To execute the program properly, you basically only need to ensure that (a) the execution environment (in particular the locale settings) supports UTF-8 correctly, and (b) the required fonts are installed. Luckily, many existing GNU/Linux distributions provide these as default.

2.5. Useful Utilities

iconv converts a given file from one encoding to another. A lot of popular encodings (such as Latin character sets, GB, Big5, Shitft-JIS, ...) are supported by default.

3. References / Further Readings