@node Iconv @chapter Encoding conversions (@file{iconv.h}) This chapter describes the Newlib iconv library. The iconv functions declarations are in @file{iconv.h}. @menu * iconv:: Encoding conversion routines * Introduction:: Introduction to iconv and encodings * Supported encodings:: The list of currently supported encodings * iconv design decisions:: General iconv library design issues and decisions * iconv configuration:: iconv-related configure script options @end menu @page @include iconv/iconv.def @page @node Introduction @section Introduction @findex encoding @findex character set @findex charset @findex CES @findex CCS @* The iconv library is intended to convert characters from one encoding to another. It implements iconv(), iconv_open() and iconv_close() calls defined by the Single Unix Specification. @* In addition to these user-level interfaces, the iconv library also has several useful internal interfaces which are needed to support coding capabilities of the Locale infrastructure. Since Locale also needs to convert various character sets to and from Wide characters set, iconv library shares it's capabilities with Locale subsystem. Moreover, iconv supports several features which are only needed for Locale infrastructure (for example, the MB_CUR_MAX value). @* The Newlib iconv library was created using ideas of another iconv library implemented by Konstantin Chuguev (ver 2.0). Thus, the Newlib iconv library has double Copyright. The Newlib iconv library was rewritten from scratch by Artem B. Bityuckiy and contains a lot of improvements with respect to original iconv library. @* Terms like @dfn{encoding} or @dfn{character set} aren't well defined and are used with various meanings. The following is definitions of terms used in this documentation as well as in iconv library implementation: @itemize @bullet @item @dfn{encoding} - a machine representation of characters by means of bits; @item @dfn{Character Set} or @dfn{Charset} - just a collection of characters, i.e. encoding is a machine representation of character set; @item @dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a set of integers @dfn{character codes}; @item @dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character codes to a sequence of bytes; @end itemize @* Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8, ASCII, etc. Encodings are formed by the following chain: @enumerate @item User has a set of characters specific to his language (character set). @item Each character from this set uniquely numbered, resulting in CCS. @item Each number from CCS is converted to a sequence of bits or bytes by means of CES resulting in some encoding. Thus, CES may be considered as a function of CCS which produces some encoding. Note, that CES may be applied to more than one CCS. @end enumerate @* Thus, an encoding may be considered as one or more CCS + CES. @* Sometimes, there is no CES and in such cases Encoding is equivalent to CCS, e.g. KOI8-R or ASCII. @* The example of more complicated encoding is UTF-8 which is the UCS (or Unicode) CCS plus UTF-8 CES. @* The following is a brief list of iconv library features: @itemize @item Generic architecture @item Locale infrastructure support @item Automatic generation of code which handles various CES/CCS/Encoding/Names/Aliases dependencies. @item The possibility to choose size- or speed-optimazed configuration @item The possibility to exclude almost all unneeded code from linking. @end itemize @page @node Supported encodings @section Supported encodings @findex big5 @findex cp775 @findex cp850 @findex cp852 @findex cp855 @findex cp866 @findex euc_jp @findex euc_kr @findex euc_tw @findex iso_8859_1 @findex iso_8859_10 @findex iso_8859_11 @findex iso_8859_13 @findex iso_8859_14 @findex iso_8859_15 @findex iso_8859_2 @findex iso_8859_3 @findex iso_8859_4 @findex iso_8859_5 @findex iso_8859_6 @findex iso_8859_7 @findex iso_8859_8 @findex iso_8859_9 @findex iso_ir_111 @findex koi8_r @findex koi8_ru @findex koi8_u @findex koi8_uni @findex ucs_2 @findex ucs_2_internal @findex ucs_2be @findex ucs_2le @findex ucs_4 @findex ucs_4_internal @findex ucs_4be @findex ucs_4le @findex us_ascii @findex utf_16 @findex utf_16be @findex utf_16le @findex utf_8 @findex win_1250 @findex win_1251 @findex win_1252 @findex win_1253 @findex win_1254 @findex win_1255 @findex win_1256 @findex win_1257 @findex win_1258 @* The following is a list of currently supported encodings. The first column corresponds to encoding name, the second to the list of its aliases, third - to its CES and CCS components names, fourth - to its short description. @multitable @columnfractions .20 .26 .24 .30 @item Name @tab Aliases @tab CES/CCS @tab Short description @item @tab @tab @tab @item big5 @tab csbig5, big_five, bigfive, cn_big5, cp950 @tab table_pcs / big5, us_ascii @tab An encoding for Traditional Chinese. @item cp775 @tab ibm775, cspc775baltic @tab table / cp775 @tab An updated version of CP 437 that supports balitic languages. @item cp850 @tab ibm850, 850, cspc850multilingual @tab table / cp850 @tab IBM 850 - an updated version of CP 437 where several Latin 1 characters have been added instead of some less-often used characters like line-drawing and greek ones. @item cp852 @tab ibm852, 852, cspcp852 @tab @tab IBM 852 - an updated version of CP 437 where several Latin 2 characters have been added instead of some less-often used characters like line-drawing and greek ones. @item cp855 @tab ibm855, 855, csibm855 @tab table / cp855 @tab IBM 855 - an updated version of CP 437 that supports Cyrillic. @item cp866 @tab 866, IBM866, CSIBM866 @tab table / cp866 @tab IBM 866 - an updated version of CP 855 which followes the more logical Russian alphabet ordering of the alternativny variant that is preferred by many Russian users. @item euc_jp @tab eucjp @tab euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990 @tab EUC-JP - The EUC for Japanese. @item euc_kr @tab euckr @tab euc / ksx1001 @tab EUC-KR - The EUC for Korean. @item euc_tw @tab euctw @tab euc / cns11643_plane1, cns11643_plane2, cns11643_plane14 @tab EUC-TW - The EUC for Traditional Chinese. @item iso_8859_1 @tab iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1 @tab table / iso_8859_1 @tab ISO 8859-1:1987 - Latin 1, West European. @item iso_8859_10 @tab iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10 @tab table / iso_8859_10 @tab ISO 8859-10:1992 - Latin 6, Nordic. @item iso_8859_11 @tab iso8859_11, iso885911 @tab table / iso_8859_11 @tab ISO 8859-11 - Thai. @item iso_8859_13 @tab iso_8859_13:1998, iso8859_13, iso885913 @tab table / iso_8859_13 @tab ISO 8859-13:1998 - Latin 7, Baltic Rim. @item iso_8859_14 @tab iso_8859_14:1998, iso885914, iso8859_14 @tab table / iso_8859_14 @tab ISO 8859-14:1998 - Latin 8, Celtic. @item iso_8859_15 @tab iso885915, iso_8859_15:1998, iso8859_15, @tab table / iso_8859_15 @tab ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1. @item iso_8859_2 @tab iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2 @tab table / iso_8859_2 @tab ISO 8859-2:1987 - Latin 2, East European. @item iso_8859_3 @tab iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593 @tab table / iso_8859_3 @tab ISO 8859-3:1988 - Latin 3, South European. @item iso_8859_4 @tab iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4 @tab table / iso_8859_4 @tab ISO 8859-4:1988 - Latin 4, North European. @item iso_8859_5 @tab iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic @tab table / iso_8859_5 @tab ISO 8859-5:1988 - Cyrillic. @item iso_8859_6 @tab iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596 @tab table / iso_8859_6 @tab ISO i8859-6:1987 - Arabic. @item iso_8859_7 @tab iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597 @tab table / iso_8859_7 @tab ISO 8859-7:1987 - Greek. @item iso_8859_8 @tab iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598 @tab table / iso_8859_8 @tab ISO 8859-8:1988 - Hebrew. @item iso_8859_9 @tab iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599 @tab table / iso_8859_9 @tab ISO 8859-9:1989 - Latin 5, Turkish. @item iso_ir_111 @tab ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic @tab table / iso_ir_111 @tab ISO IR 111/ECMA Cyrillic. @item koi8_r @tab cskoi8r, koi8r, koi8 @tab table / koi8_r @tab RFC 1489 Cyrillic. @item koi8_ru @tab koi8ru @tab table / koi8_ru @tab Obsoleted Ukrainian. @item koi8_u @tab koi8u @tab table / koi8_u @tab RFC 2319 Ukrainian. @item koi8_uni @tab koi8uni @tab table / koi8_uni @tab KOI8 Unified. @item ucs_2 @tab ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode @tab ucs_2 / (UCS) @tab ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). @item ucs_2_internal @tab ucs2_internal, ucs_2internal, ucs2internal @tab ucs_2_internal / (UCS) @tab ISO-10646-UCS-2 in system byte order. NBSP is always interpreted as NBSP (BOM isn't supported). @item ucs_2be @tab ucs2be @tab ucs_2 / (UCS) @tab Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2). Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). @item ucs_2le @tab ucs2le @tab ucs_2 / (UCS) @tab Little Endian version of ISO-10646-UCS-2. Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported). @item ucs_4 @tab ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4 @tab ucs_4 / (UCS) @tab ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). @item ucs_4_internal @tab ucs4_internal, ucs_4internal, ucs4internal @tab ucs_4_internal / (UCS) @tab ISO-10646-UCS-4 in system byte order. NBSP is always interpreted as NBSP (BOM isn't supported). @item ucs_4be @tab ucs4be @tab ucs_4 / (UCS) @tab Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4). Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). @item ucs_4le @tab ucs4le @tab ucs_4 / (UCS) @tab Little Endian version of ISO-10646-UCS-4. Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported). @item us_ascii @tab ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii @tab us_ascii / (ASCII) @tab 7-bit ASCII. @item utf_16 @tab utf16 @tab utf_16 / (UCS) @tab RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM. @item utf_16be @tab utf16be @tab utf_16 / (UCS) @tab Big Endian version of RFC 2781 UTF-16. NBSP is always interpreted as NBSP (BOM isn't supported). @item utf_16le @tab utf16le @tab utf_16 / (UCS) @tab Little Endian version of RFC 2781 UTF-16. NBSP is always interpreted as NBSP (BOM isn't supported). @item utf_8 @tab utf8 @tab utf_8 / (UCS) @tab RFC 3629 UTF-8. @item win_1250 @tab cp1250 @tab @tab Win-1250 Croatian. @item win_1251 @tab cp1251 @tab table / win_1251 @tab Win-1251 - Cyrillic. @item win_1252 @tab cp1252 @tab table / win_1252 @tab Win-1252 - Latin 1. @item win_1253 @tab cp1253 @tab table / win_1253 @tab Win-1253 - Greek. @item win_1254 @tab cp1254 @tab table / win_1254 @tab Win-1254 - Turkish. @item win_1255 @tab cp1255 @tab table / win_1255 @tab Win-1255 - Hebrew. @item win_1256 @tab cp1256 @tab table / win_1256 @tab Win-1256 - Arabic. @item win_1257 @tab cp1257 @tab table / win_1257 @tab Win-1257 - Baltic. @item win_1258 @tab cp1258 @tab table / win_1258 @tab Win-1258 - Vietnamese7 that supports Cyrillic. @end multitable @page @node iconv design decisions @section iconv design decisions @findex CCS table @findex CES converter @* The first iconv library design issue arises when considering the following two design approaches: @enumerate @item Have modules which implement conversion from encoding A to encoding B and vice versa, i.e., one conversion module relates to any two encodings. @item Have modules which implement conversion from encoding A to fixed encoding C and vice versa, i.e., on conversion module relates to any one encoding A and one fixed encoding C. In this case, to convert from encoding A to encoding B, two modules are needed in order to convert from A to C and then from C to B. @end enumerate @* It's obvious, that we have a tradeoff between commonness/flexibility and efficiency: the first method is more efficient since it converts directly. But from other hand, it isn't so flexible since for each encoding pair distinct module is needed. @* The Newlib iconv uses the second method and always converts through 32 bit UCS. But its design also allows to write specialized conversion modules if the conversion speed is critical. @* The second design issue is how to decompose encodings. The Newlib iconv library uses the fact that any encoding may be considered as one or more CCS plus CES. It also decomposes its conversion modules on @dfn{CES converter} plus one or more @dfn{CCS tables}. CCS tables maps CCS to UCS and vice versa, CES converters map CCS to encoding and vice versa. @* As an example, consider conversion from big5 encoding to EUC-TW encoding. big5 encoding may be decomposed on ASCII and BIG5 CCSes plus BIG5 CES. EUC-TW may be decomposed on CNS11643_PLANE1, CNS11643_PLANE2, and CNS11643_PLANE14 CCSes plus EUC CES. @* The euc_jp -> big5 conversion happens as follows: @enumerate @item EUC converter performs EUC-TW encoding to correspondent CCSes transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14 CCSes); @item Obtained CCS codes are transformed to UCS codes using CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables; @item Resulting UCS codes are transformed to ASCII and BIG5 codes using correspondent CCS tables; @item Obtained CCS codes are transformed to big5 encoding using correspondent CES converter. @end enumerate @* Analogously, the backward conversion is performed as follows: @enumerate @item BIG converter performs big5 encoding -> correspondent CCSes transformation (ASCII and BIG5 CCSes); @item Obtained CCS codes are transformed to UCS codes using ASCII and BIG5 CCS tables; @item Resulting UCS codes are transformed to ASCII and BIG5 codes using correspondent CCS tables; @item Obtained CCS codes are transformed to EUC-TW encoding using correspondent CES converter. @end enumerate @* Note, the above is just an example and real names (implemented in Newlib iconv) of CES converters and CCS tables are slightly different. @* The third design issue also relates to flexibility. Obviously, it isn't wanted to always link all CES converters and CCS tables to the library but instead, it is wanted to be able to load needed converters and tables dynamically on demand. This isn't a problem on "big" machines like PC but may be very problematical within "small" embedded systems. @* Since the CCS tables are just data, it is possible to load them dynamically from external files. Instead, CES converters are algorithms and contain some code and the dynamic library loading capability is needed. @* Apart from possible restrictions applied by embedded systems (too few RAM for example), the Newlib itself has no dynamic libraries support and, therefore, all CES converters which will ever be uses must be linked into the library. But the dynamic CCS tables loading is possible and is implemented in the Newlib iconv library and may be enabled via Newlib configure script options. @* The next design decision is the possibility to of fine iconv library configuring. This means, that iconv isn't always link all it's converters and tables (if no dynamical loading enabled) but instead, it gives the possibility to enable only those encodings which are planned to be used (see section about configure script options). @* Moreover, the Newlib iconv library configure options distinguish between coding directions. This means, that not only supported encodings are selectable, but the coding direction too. For example, if user wants configuration which allows conversions from UTF-8 to UTF-16 and he doesn't plan to use UTF-16 to UTF-8 conversions, he can enable exactly that conversion direction (i.e., no UTF-16 -> UTF-8 -related code will be included) thus saving some memory (note, that such technique allows to exclude one half of CCS table from linking which may be big enough). @* One more design decision is speed- and size- optimized tables. Used can select between them using s configure script option. Speed CCS tables are the same as Size ones in case of 8 bit CCS (e.g.m KOI8-R), but for 16 bit CCS Size-optimized table may be in 1.5-2 time less then Speed-optimized ones. From the other hand, the conversion with speed tables is in several times faster. @* Its worth to stress, that new encodings support can't be dynamically added into already compiled Newlib library. Even if this needs only additional CCS table and iconv is configured to use external files with CCS tables (this isn't a fundamental restriction and the possibility to add new Table-based encodings support dynamically, by copying new .cct file, may be easily added). @* Theoretically, the compiled-in CCS tables may be more appropriate foe embedded solutions since they are read-only and are placed to ROM, whereas the dynamic loading needs more RAM. Moreover, in current implementation, distinct copy of CCS file is loaded for each fore each opened iconv descriptor even in case of the same encoding. This means, for example, that if two iconv descriptors for KOI8-R -> UCS-4BE and KOI8-R -> UTF-16BE are opened, two copies of koi8-r .cct file will be loaded (actually, iconv loads only needed part of these files). @page @node iconv configuration @section iconv configuration @findex iconv configuration @* To enable encoding support --enable-newlib-iconv-encodings configure script option should be used. This option accepts a comma-separated list of encodings that should be enabled. Option enables each encoding in both ("to" and "from") directions. @* --enable-newlib-iconv-from-encodings configure script option enables "from" support for each encoding that was passed to it. @* --enable-newlib-iconv-to-encodings configure script option enables "to" support for each encoding that was passed to it. @* Example: if user plans only KOI8-R -> UTF-8, UTF-8 -> ISO-8859-5 and KOI8-R -> UCS-2 conversions, the most optimal way (minimal iconv's code and data will be linked) is to configure Newlib with --enable-newlib-iconv-encodings=UTF-8 --enable-newlib-iconv-from-encodings=KOI8-R --enable-newlib-iconv-to-encodings=KOI8-R,ISO-8859-5 @* --enable-newlib-iconv-external-ccs option enables iconv's capabilities to work with external CCS files. @* Note: CCS files are searched by iconv_open in $NLSPATH/iconv_data/ directory.