@node Iconv @chapter Character-set conversions (@file{iconv.h}) This chapter describes the Newlib iconv library. The iconv functions declarations are in @file{iconv.h}. @menu * iconv:: Character set conversion routines * iconv architecture:: Architecture of Newlib iconv library * iconv configuration:: Newlib iconv-specific configure options * Generating CCS tables:: How to generate CCS tables * Adding new converter:: Steps on adding a new converter @end menu @page @include iconv/iconv.def @page @node iconv architecture @section iconv architecture @findex iconv architecture @findex encoding @findex CCS @findex CES @findex iconv converter @* @itemize @bullet @item Encoding - a rule to represent computer text by means of bits and bytes. @item CCS (Coded Character Set) - a mapping from an abstract character set to a set of non-negative integers (character codes). @item CES (Character Encoding Scheme) - a mapping from a set of character codes units to a sequence of bytes. @end itemize @* Examples of CCS: ASCII, ISO-8859-x, KOI8-R, KSX-1001, GB-2312.@* Examples of CES: UTF-8, UTF-16, EUC-JP, ISO-2022-JP. @* The iconv library is used to convert an array of characters in one encoding to array in another encoding. @* From a user's point of view, the iconv library is a set of converters. Each converter corresponds to one encoding (e.g., KOI8-R converter, UTF-8 converter). Internally the meaning of converter is different. @* The iconv library always performs conversions through UCS-32: i.e., to convert from A to B, iconv library first converts A to UCS-32, and then USC-32 to B. @* Each encoding consists of CES and CCS. CCS may be represented as data tables but CES always implies some code (algorithm). Iconv uses CCS tables to map from some encoding to UCS-32. CCS tables are placed into the iconv/ccs subdirectory of newlib. The iconv code also uses CES modules which can convert some CCS to and from UCS-32. CES modules are placed in the iconv/ces subdirectory. @* Some encodings have CES = CCS (e.g., KOI8-R). For such encodings iconv uses special subroutines which perform simple table conversions (ccs_table.c). @* Among specialized CES modules, the iconv library has generic support for EUC and ISO-2022-family encodings (ces_euc.c and ces_iso2022.c). @* To enable iconv to work with CCS or CES-based encodings, the correspondent CES table or CCS module should be linked with Newlib. The iconv support can also load CCS tables dynamically from external files (.cct files from iconv/ccs/binary subdirectory). CES modules, on the other-hand, can't be dynamically loaded. @* Each iconv converter has one name and a set of aliases. The list of aliases for each converter's name is in the iconv/charset.aliases file. Note: iconv always normalizes converter names and aliases before using. @page @node iconv configuration @section iconv configuration @findex iconv configuration @findex iconv converter @* To enable iconv, the --enable-newlib-iconv configuration option should be used when configuring newlib. @* To link a specific converter (CCS table or CES module) into Newlib, the ---enable-newlib-builtin-converters option should be used. A comma-separated list of converters can be passed with this option (e.g., ---enable-newlib-builtin-converters=koi8-r,euc-jp to link KOI8-R and EUC-JP converters). Either converter names or aliases may be used. @* If the target system has a file system accessible by Newlib, table-based converters may be loaded dynamically from external files. The iconv code tries to load files from the iconv_data subdirectory of the directory specified by the NLSPATH environment variable. @* Since Newlib has no generic dynamic module load support, CES-based converters can't be dynamically loaded and should be linked-in. @page @node Generating CCS tables @section Generating CCS tables @* CCS tables are placed in the ccs subdirectory of the iconv directory. This subdirectory contains .cct and .c files. The .cct files are for dynamic loading whereas the .c files are for static linking with Newlib. Both .c and .cct files are generated by the 'iconv_mktbl' perl script from special source files (call them .txt files). The 'iconv_mktbl' script can be found in the iconv/ccs subdirectory. Input .txt files can be found at the Unicode.org site or other locations found on the web. @* The .c files are linked with Newlib if the correspondent 'configure' script option was given. This is needed to use iconv on targets without file system support. If a CCS table isn't configured to be linked, the iconv library tries to load it dynamically from a corresponding .cct file. @* The following are commands to build .c and .cct CCS table files from .txt files for several supported encodings. @* @itemize @item cp775:@* iconv_mktbl -Co cp775.c cp775.txt@* iconv_mktbl -o cp775.cct cp775.txt @end itemize @itemize @item cp850:@* iconv_mktbl -Co cp850.c cp850.txt@* iconv_mktbl -o cp850.cct cp850.txt @end itemize @itemize @item cp852:@* iconv_mktbl -Co cp852.c cp852.txt@* iconv_mktbl -o cp852.cct cp852.txt @end itemize @itemize @item cp855:@* iconv_mktbl -Co cp855.c cp855.txt@* iconv_mktbl -o cp855.cct cp855.txt @end itemize @itemize @item cp866@* iconv_mktbl -Co cp866.c cp866.txt@* iconv_mktbl -o cp866.cct cp866.txt @end itemize @itemize @item iso-8859-1@* iconv_mktbl -Co iso-8859-1.c iso-8859-1.txt@* iconv_mktbl -o iso-8859-1.cct iso-8859-1.txt @end itemize @itemize @item iso-8859-4@* iconv_mktbl -Co iso-8859-4.c iso-8859-4.txt@* iconv_mktbl -o iso-8859-4.cct iso-8859-4.txt @end itemize @itemize @item iso-8859-5@* iconv_mktbl -Co iso-8859-5.c iso-8859-5.txt@* iconv_mktbl -o iso-8859-5.cct iso-8859-5.txt @end itemize @itemize @item iso-8859-2@* iconv_mktbl -Co iso-8859-2.c iso-8859-2.txt@* iconv_mktbl -o iso-8859-2.cct iso-8859-2.txt @end itemize @itemize @item iso-8859-15@* iconv_mktbl -Co iso-8859-15.c iso-8859-15.txt@* iconv_mktbl -o iso-8859-15.cct iso-8859-15.txt @end itemize @itemize @item big5@* iconv_mktbl -Co big5.c big5.txt@* iconv_mktbl -o big5.cct big5.txt @end itemize @itemize @item ksx1001@* iconv_mktbl -Co ksx1001.c ksx1001.txt@* iconv_mktbl -o ksx1001.cct ksx1001.txt @end itemize @itemize @item gb_2312@* iconv_mktbl -Co gb_2312-80.c gb_2312-80.txt@* iconv_mktbl -o gb_2312-80.cct gb_2312-80.txt @end itemize @itemize @item jis_x0201@* iconv_mktbl -Co jis_x0201.c jis_x0201.txt@* iconv_mktbl -o jis_x0201.cct jis_x0201.txt @end itemize @itemize @item iconv_mktbl -Co shift_jis.c shift_jis.txt@* iconv_mktbl -o shift_jis.cct shift_jis.txt @end itemize @itemize @item jis_x0208@* iconv_mktbl -C -c 1 -u 2 -o jis_x0208-1983.c jis_x0208-1983.txt@* iconv_mktbl -c 1 -u 2 -o jis_x0208-1983.cct jis_x0208-1983.txt @end itemize @itemize @item jis_x0212@* iconv_mktbl -Co jis_x0212-1990.c jis_x0212-1990.txt@* iconv_mktbl -o jis_x0212-1990.cct jis_x0212-1990.txt @end itemize @itemize @item cns11643-plane1@* iconv_mktbl -C -p 0x1 -o cns11643-plane1.c cns11643.txt@* iconv_mktbl -p 0x1 -o cns11643-plane1.cct cns11643.txt @end itemize @itemize @item cns11643-plane2@* iconv_mktbl -C -p 0x2 -o cns11643-plane2.c cns11643.txt@* iconv_mktbl -p 0x2 -o cns11643-plane2.cct cns11643.txt @end itemize @itemize @item cns11643-plane14@* iconv_mktbl -C -p 0xE -o cns11643-plane14.c cns11643.txt@* iconv_mktbl -p 0xE -o cns11643-plane14.cct cns11643.txt @end itemize @itemize @item koi8-r@* iconv_mktbl -Co koi8-r.c koi8-r.txt@* iconv_mktbl -o koi8-r.cct koi8-r.txt @end itemize @itemize @item koi8-u@* iconv_mktbl -Co koi8-u.c koi8-u.txt@* iconv_mktbl -o koi8-u.cct koi8-u.txt @end itemize @itemize @item us-ascii@* iconv_mktbl -Cao us-ascii.c iso-8859-1.txt@* iconv_mktbl -ao us-ascii.cct iso-8859-1.txt @end itemize @* Source files for CCS tables can be taken from at least two places: @* @enumerate @item http://www.unicode.org/Public/MAPPINGS/ contains a lot of encoding map files. @item http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains original iconv sources and encoding map files. @end enumerate @* The following are URLs where source files for some of the CCS tables are found: @itemize @item big5:@* http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT @end itemize @itemize @item cns11643_plane14, cns11643_plane1 and cns11643_plane2:@* http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT @end itemize @itemize @item cp775, cp850, cp852, cp855, cp866:@* http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/ @end itemize @itemize @item gb_2312_80:@* http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT @end itemize @itemize @item iso_8859_15, iso_8859_1, iso_8859_2, iso_8859_4, iso_8859_5:@* http://www.unicode.org/Public/MAPPINGS/ISO8859/ @end itemize @itemize @item jis_x0201, jis_x0208_1983, jis_x0212_1990, shift_jis@* http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT @end itemize @itemize @item koi8_r@* http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT @end itemize @itemize @item ksx1001@* http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT @end itemize @itemize @item koi8-u can be given from original FreeBSD iconv library distribution http://www.dante.net/staff/konstantin/FreeBSD/iconv/ @end itemize @* Moreover, http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains a lot of additional CCS tables that you can use with Newlib (iso-2022 and RFC1345 encodings). @page @node Adding new converter @section Adding a new iconv converter @* The following steps should be taken to add a new iconv converter: @* @enumerate @item Converter's name and aliases list should be added to the iconv/charset.aliases file @item All iconv converters are protected by a _ICONV_CONVERTER_XXX macro, where XXX is converter name. This protection macro should be added to newlib/newlib.hin file. @item Converter's name and aliases should be also registered in _iconv_builtin_aliases table in iconv/lib/bialiasesi.c. The list should be protected by the corresponding macro mentioned above. @item If a new converter is just a CCS table, the corresponding .cct and .c files should be added to the iconv/ccs/ subdirectory. The name of the files should be equivalent to the normalized encoding name. The 'iconv_mktbl' Perl script (found in iconv/ccs) may be used to generate such files. The file's name should be added to iconv/ccs/Makefile.am and iconv/ccs/binary/Makefile.am files and then automake should be used to regenerate the Makefile.in files. @item If a new converter has a CES algorithm, the appropriate file should be added to the iconv/ces/ subdirectory. The name of the file again should be equivalent to the normalized encoding name. @item If a converter is EUC or ISO-2022-family CES, then the converter is just an array with a list of used CCS (See ccs/euc-jp.c for example). This is because iconv already has EUC and ISO-2022 support. Used CCS tables should be provided in iconv/ccs/. @item If a converter isn't EUC or ISO-2022-based CCS, the following two functions should be provided (see utf-8.c for example): @enumerate - @item A function to convert from new CES to UCS-32; @item A function to convert from UCS-32 to new CES; @item An 'init' function; @item A 'close' function; @item A 'reset' function to reset shift state for stateful CES. @end enumerate @* All these functions are registered into a 'struct iconv_ces_desc' object. The name of the object should be _iconv_ces_module_XXX, where XXX is the name of the converter. @item For CES converters the correspondent 'struct iconv_ces_desc' reference should be added into iconv/lib/bices.c file. @* For CCS converters, the corresponding table reference should be added into the iconv/lib/biccs.c file. @end enumerate