I'm writing and maintaining a program that needs to make use of Unicode characters. Here are some findings on developing such a program effectively under GNU/Linux. Some extra notes (typeset in green color) are also given for supporting CJK characters (especially the Chinese characters) properly.
Current Programming Environment:
Best Practices:
char). Many current GNU/Linux applications (including the GNU C library since 2.2) have excellent supports for UTF-8 due to its compatibility with ASCII.wchar_t). Obvious choices for representing unicode when more bits are available.If you are not familiar about Unicode, please be familiar with it first before continuing. Some useful references are listed at the bottom. In particular, it is often essential to understand the UTF-8 encoding scheme and its properties to ensure correct programming.
For proper text editing with Emacs in Ubuntu or some Debian-based distributions, you need the followings:
emacs package.mule-ucs package. This package greatly improves the UTF-8 support of Emacs 21.4.x. Remember to enable the Unicode support (which is disabled by default) by including the line (require 'un-define) in ~/.emacs. Refer to /usr/share/doc/mule-ucs/README.debian for more details.xfonts-intl-* packages. You may also install the emacs-intl-fonts package to allow multi-lingual postscript printing from Emacs.mule-ucs, Emacs 21.4.x will not display some UTF-8 characters, and will not support most input methods as they use non-Unicode encodings. (For example, none of the Chinese input methods packaged with emacs support Unicode.) The mule-ucs solves many of such problems transparently. Note that the upcoming emacs 22.x should support UTF-8 better, probably without relying on an external package.)
After preparing the items above,
Popular Related Emacs Commands:
M-x set-language-environment, or C-x <RET> l. Use English or UTF-8.M-x set-input-method, or C-x <RET> C-\.M-x toggle-input-method, or C-\.M-x prefered-coding-system coding <RET>.M-x describe-coding-system <RET>, or C-h C <RET>.M-x universal-coding-system-argument, or C-x <RET> c. For example, to force emacs to open and recognize a file with UTF-8 coding, do C-x <RET> c utf-8 <RET> C-x C-f filename.M-x what-cursor-position, or C-x =. To show more details, use M-x describe-char-after.Vim supports UTF-8 well since version 6.0. Vim will start in UTF-8 mode automatically if your current locale is in an UTF-8 encoding. If you are using another locale, you can force Vim to use UTF-8 by executing ":set encoding=utf-8".
To input a Unicode character U+wxyz in the insert mode, type Ctrl-v u wxyz. For more information, type ":help i_CTRL-V-digit" in Vim. Some characters can be entered by using digraphs. Type ":digraphs" for the mapping.
Support of wide characters in C language:
wchar.h, for functions manipulating wide streams and several kinds of strings (includes some wide-character versions of those in stdio.h, string.h, stdlib.h, and time.h.).wctype.h, for functions classifying and mapping wide characters (includes the wide-character versions of those in ctype.h).stdlib.h, for some conversion functions that involve wide characters.wchar_t, able to store distinct wide-character codes.wint_t, able to store any valid value of wchar_t, or WEOF.WEOF, wide character equivalent of EOF.WCHAR_MIN, WCHAR_MAX. The minimum and maximum values representable by wchar_t.L'x', a wide character constant.L"...", a wide-character string constant.mbtowc(), wctomb(). Convert a multibyte sequence to a wide character, and vice versa.mbstowcs(), wcstombs(). Convert a multibyte string to a wide character string, and vice versa.wchar.h, stdlib.h.Generally, we can manipulate unicode characters with char (using UTF-8) or wchar_t (using UTF-16/UTF-32). But which should we stick to? The decision can vary for different parts of the source code, and depends on many factors. Some of them are listed below.
You can use multibyte UTF-8 characters safely in the following places in the source code (tested with GCC):
"...")Stick to only ASCII characters (Unicode range U+0000 to U+007F) elsewhere.
No special extra option is required for GCC as it normally assumes UTF-8 for single-byte character set for compilation and execution. (The behavior is controlled by -finput-charset=charset, -fexec-charset=charset, and -fwide-exec-charset=charset.) However, make sure that you compile with the locale (controlled by the environment variables LANG, LC_CTYPE, LC_MESSAGES, and LC_ALL) set to use a character set encoded in UTF-8. For example, LANG with a value of en_US.UTF-8 for American English encoded in UTF-8.
To execute the program properly, you basically only need to ensure that (a) the execution environment (in particular the locale settings) supports UTF-8 correctly, and (b) the required fonts are installed. Luckily, many existing GNU/Linux distributions provide these as default.
iconv converts a given file from one encoding to another. A lot of popular encodings (such as Latin character sets, GB, Big5, Shitft-JIS, ...) are supported by default.