886 lines
19 KiB
TeX
886 lines
19 KiB
TeX
@node Iconv
|
|
@chapter Encoding conversions (@file{iconv.h})
|
|
|
|
This chapter describes the Newlib iconv library.
|
|
The iconv functions declarations are in
|
|
@file{iconv.h}.
|
|
|
|
@menu
|
|
* iconv:: Encoding conversion routines
|
|
* Introduction:: Introduction to iconv and encodings
|
|
* Supported encodings:: The list of currently supported encodings
|
|
* iconv design decisions:: General iconv library design issues and decisions
|
|
* iconv configuration:: iconv-related configure script options
|
|
@end menu
|
|
|
|
@page
|
|
@include iconv/iconv.def
|
|
|
|
@page
|
|
@node Introduction
|
|
@section Introduction
|
|
@findex encoding
|
|
@findex character set
|
|
@findex charset
|
|
@findex CES
|
|
@findex CCS
|
|
@*
|
|
The iconv library is intended to convert characters from one encoding to
|
|
another. It implements iconv(), iconv_open() and iconv_close() calls
|
|
defined by the Single Unix Specification.
|
|
|
|
@*
|
|
In addition to these user-level interfaces, the iconv library also has
|
|
several useful internal interfaces which are needed to support coding
|
|
capabilities of the Locale infrastructure. Since Locale also needs to
|
|
convert various character sets to and from Wide characters set, iconv
|
|
library shares it's capabilities with Locale subsystem. Moreover, iconv
|
|
supports several features which are only needed for Locale infrastructure
|
|
(for example, the MB_CUR_MAX value).
|
|
|
|
@*
|
|
The Newlib iconv library was created using ideas of another iconv
|
|
library implemented by Konstantin Chuguev (ver 2.0). Thus, the Newlib iconv
|
|
library has double Copyright. The Newlib iconv library was rewritten from
|
|
scratch by Artem B. Bityuckiy and contains a lot of improvements with respect
|
|
to original iconv library.
|
|
|
|
@*
|
|
Terms like @dfn{encoding} or @dfn{character set} aren't well defined and
|
|
are used with various meanings. The following is definitions of terms
|
|
used in this documentation as well as in iconv library implementation:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@dfn{encoding} - a machine representation of characters by means of bits;
|
|
|
|
@item
|
|
@dfn{Character Set} or @dfn{Charset} - just a collection of
|
|
characters, i.e. encoding is a machine representation of character set;
|
|
|
|
@item
|
|
@dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a
|
|
set of integers @dfn{character codes};
|
|
|
|
@item
|
|
@dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character
|
|
codes to a sequence of bytes;
|
|
@end itemize
|
|
|
|
@*
|
|
Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8,
|
|
ASCII, etc. Encodings are formed by the following chain:
|
|
|
|
@enumerate
|
|
@item
|
|
User has a set of characters specific to his language (character set).
|
|
|
|
@item
|
|
Each character from this set uniquely numbered, resulting in CCS.
|
|
|
|
@item
|
|
Each number from CCS is converted to a sequence of bits or bytes by means
|
|
of CES resulting in some encoding. Thus, CES may be considered as a
|
|
function of CCS which produces some encoding. Note, that CES may be
|
|
applied to more than one CCS.
|
|
@end enumerate
|
|
|
|
@*
|
|
Thus, an encoding may be considered as one or more CCS + CES.
|
|
|
|
@*
|
|
Sometimes, there is no CES and in such cases Encoding is equivalent to CCS,
|
|
e.g. KOI8-R or ASCII.
|
|
|
|
@*
|
|
The example of more complicated encoding is UTF-8 which is the UCS
|
|
(or Unicode) CCS plus UTF-8 CES.
|
|
|
|
@*
|
|
The following is a brief list of iconv library features:
|
|
@itemize
|
|
@item
|
|
Generic architecture
|
|
@item
|
|
Locale infrastructure support
|
|
@item
|
|
Automatic generation of code which handles various CES/CCS/Encoding/Names/Aliases
|
|
dependencies.
|
|
@item
|
|
The possibility to choose size- or speed-optimazed configuration
|
|
@item
|
|
The possibility to exclude almost all unneeded code from linking.
|
|
@end itemize
|
|
|
|
|
|
|
|
|
|
@page
|
|
@node Supported encodings
|
|
@section Supported encodings
|
|
@findex big5
|
|
@findex cp775
|
|
@findex cp850
|
|
@findex cp852
|
|
@findex cp855
|
|
@findex cp866
|
|
@findex euc_jp
|
|
@findex euc_kr
|
|
@findex euc_tw
|
|
@findex iso_8859_1
|
|
@findex iso_8859_10
|
|
@findex iso_8859_11
|
|
@findex iso_8859_13
|
|
@findex iso_8859_14
|
|
@findex iso_8859_15
|
|
@findex iso_8859_2
|
|
@findex iso_8859_3
|
|
@findex iso_8859_4
|
|
@findex iso_8859_5
|
|
@findex iso_8859_6
|
|
@findex iso_8859_7
|
|
@findex iso_8859_8
|
|
@findex iso_8859_9
|
|
@findex iso_ir_111
|
|
@findex koi8_r
|
|
@findex koi8_ru
|
|
@findex koi8_u
|
|
@findex koi8_uni
|
|
@findex ucs_2
|
|
@findex ucs_2_internal
|
|
@findex ucs_2be
|
|
@findex ucs_2le
|
|
@findex ucs_4
|
|
@findex ucs_4_internal
|
|
@findex ucs_4be
|
|
@findex ucs_4le
|
|
@findex us_ascii
|
|
@findex utf_16
|
|
@findex utf_16be
|
|
@findex utf_16le
|
|
@findex utf_8
|
|
@findex win_1250
|
|
@findex win_1251
|
|
@findex win_1252
|
|
@findex win_1253
|
|
@findex win_1254
|
|
@findex win_1255
|
|
@findex win_1256
|
|
@findex win_1257
|
|
@findex win_1258
|
|
@*
|
|
The following is a list of currently supported encodings. The first column
|
|
corresponds to encoding name, the second to the list of its aliases, third
|
|
- to its CES and CCS components names, fourth - to its short description.
|
|
|
|
@multitable @columnfractions .20 .26 .24 .30
|
|
@item
|
|
Name
|
|
@tab
|
|
Aliases
|
|
@tab
|
|
CES/CCS
|
|
@tab
|
|
Short description
|
|
@item
|
|
@tab
|
|
@tab
|
|
@tab
|
|
|
|
|
|
@item
|
|
big5
|
|
@tab
|
|
csbig5, big_five, bigfive, cn_big5, cp950
|
|
@tab
|
|
table_pcs / big5, us_ascii
|
|
@tab
|
|
An encoding for Traditional Chinese.
|
|
|
|
|
|
@item
|
|
cp775
|
|
@tab
|
|
ibm775, cspc775baltic
|
|
@tab
|
|
table / cp775
|
|
@tab
|
|
An updated version of CP 437 that supports balitic languages.
|
|
|
|
|
|
@item
|
|
cp850
|
|
@tab
|
|
ibm850, 850, cspc850multilingual
|
|
@tab
|
|
table / cp850
|
|
@tab
|
|
IBM 850 - an updated version of CP 437 where several Latin 1 characters have been
|
|
added instead of some less-often used characters like line-drawing and greek ones.
|
|
|
|
|
|
@item
|
|
cp852
|
|
@tab
|
|
ibm852, 852, cspcp852
|
|
@tab
|
|
@tab
|
|
IBM 852 - an updated version of CP 437 where several Latin 2 characters have been added
|
|
instead of some less-often used characters like line-drawing and greek ones.
|
|
|
|
|
|
@item
|
|
cp855
|
|
@tab
|
|
ibm855, 855, csibm855
|
|
@tab
|
|
table / cp855
|
|
@tab
|
|
IBM 855 - an updated version of CP 437 that supports Cyrillic.
|
|
|
|
|
|
@item
|
|
cp866
|
|
@tab
|
|
866, IBM866, CSIBM866
|
|
@tab
|
|
table / cp866
|
|
@tab
|
|
IBM 866 - an updated version of CP 855 which followes the more logical Russian alphabet
|
|
ordering of the alternativny variant that is preferred by many Russian users.
|
|
|
|
|
|
@item
|
|
euc_jp
|
|
@tab
|
|
eucjp
|
|
@tab
|
|
euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990
|
|
@tab
|
|
EUC-JP - The EUC for Japanese.
|
|
|
|
|
|
@item
|
|
euc_kr
|
|
@tab
|
|
euckr
|
|
@tab
|
|
euc / ksx1001
|
|
@tab
|
|
EUC-KR - The EUC for Korean.
|
|
|
|
|
|
@item
|
|
euc_tw
|
|
@tab
|
|
euctw
|
|
@tab
|
|
euc / cns11643_plane1, cns11643_plane2, cns11643_plane14
|
|
@tab
|
|
EUC-TW - The EUC for Traditional Chinese.
|
|
|
|
|
|
@item
|
|
iso_8859_1
|
|
@tab
|
|
iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1
|
|
@tab
|
|
table / iso_8859_1
|
|
@tab
|
|
ISO 8859-1:1987 - Latin 1, West European.
|
|
|
|
|
|
@item
|
|
iso_8859_10
|
|
@tab
|
|
iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10
|
|
@tab
|
|
table / iso_8859_10
|
|
@tab
|
|
ISO 8859-10:1992 - Latin 6, Nordic.
|
|
|
|
|
|
@item
|
|
iso_8859_11
|
|
@tab
|
|
iso8859_11, iso885911
|
|
@tab
|
|
table / iso_8859_11
|
|
@tab
|
|
ISO 8859-11 - Thai.
|
|
|
|
|
|
@item
|
|
iso_8859_13
|
|
@tab
|
|
iso_8859_13:1998, iso8859_13, iso885913
|
|
@tab
|
|
table / iso_8859_13
|
|
@tab
|
|
ISO 8859-13:1998 - Latin 7, Baltic Rim.
|
|
|
|
|
|
@item
|
|
iso_8859_14
|
|
@tab
|
|
iso_8859_14:1998, iso885914, iso8859_14
|
|
@tab
|
|
table / iso_8859_14
|
|
@tab
|
|
ISO 8859-14:1998 - Latin 8, Celtic.
|
|
|
|
|
|
@item
|
|
iso_8859_15
|
|
@tab
|
|
iso885915, iso_8859_15:1998, iso8859_15,
|
|
@tab
|
|
table / iso_8859_15
|
|
@tab
|
|
ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1.
|
|
|
|
|
|
@item
|
|
iso_8859_2
|
|
@tab
|
|
iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2
|
|
@tab
|
|
table / iso_8859_2
|
|
@tab
|
|
ISO 8859-2:1987 - Latin 2, East European.
|
|
|
|
|
|
@item
|
|
iso_8859_3
|
|
@tab
|
|
iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593
|
|
@tab
|
|
table / iso_8859_3
|
|
@tab
|
|
ISO 8859-3:1988 - Latin 3, South European.
|
|
|
|
|
|
@item
|
|
iso_8859_4
|
|
@tab
|
|
iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4
|
|
@tab
|
|
table / iso_8859_4
|
|
@tab
|
|
ISO 8859-4:1988 - Latin 4, North European.
|
|
|
|
|
|
@item
|
|
iso_8859_5
|
|
@tab
|
|
iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic
|
|
@tab
|
|
table / iso_8859_5
|
|
@tab
|
|
ISO 8859-5:1988 - Cyrillic.
|
|
|
|
|
|
@item
|
|
iso_8859_6
|
|
@tab
|
|
iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596
|
|
@tab
|
|
table / iso_8859_6
|
|
@tab
|
|
ISO i8859-6:1987 - Arabic.
|
|
|
|
|
|
@item
|
|
iso_8859_7
|
|
@tab
|
|
iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597
|
|
@tab
|
|
table / iso_8859_7
|
|
@tab
|
|
ISO 8859-7:1987 - Greek.
|
|
|
|
|
|
@item
|
|
iso_8859_8
|
|
@tab
|
|
iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598
|
|
@tab
|
|
table / iso_8859_8
|
|
@tab
|
|
ISO 8859-8:1988 - Hebrew.
|
|
|
|
|
|
@item
|
|
iso_8859_9
|
|
@tab
|
|
iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599
|
|
@tab
|
|
table / iso_8859_9
|
|
@tab
|
|
ISO 8859-9:1989 - Latin 5, Turkish.
|
|
|
|
|
|
@item
|
|
iso_ir_111
|
|
@tab
|
|
ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic
|
|
@tab
|
|
table / iso_ir_111
|
|
@tab
|
|
ISO IR 111/ECMA Cyrillic.
|
|
|
|
|
|
@item
|
|
koi8_r
|
|
@tab
|
|
cskoi8r, koi8r, koi8
|
|
@tab
|
|
table / koi8_r
|
|
@tab
|
|
RFC 1489 Cyrillic.
|
|
|
|
|
|
@item
|
|
koi8_ru
|
|
@tab
|
|
koi8ru
|
|
@tab
|
|
table / koi8_ru
|
|
@tab
|
|
Obsoleted Ukrainian.
|
|
|
|
|
|
@item
|
|
koi8_u
|
|
@tab
|
|
koi8u
|
|
@tab
|
|
table / koi8_u
|
|
@tab
|
|
RFC 2319 Ukrainian.
|
|
|
|
|
|
@item
|
|
koi8_uni
|
|
@tab
|
|
koi8uni
|
|
@tab
|
|
table / koi8_uni
|
|
@tab
|
|
KOI8 Unified.
|
|
|
|
|
|
@item
|
|
ucs_2
|
|
@tab
|
|
ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode
|
|
@tab
|
|
ucs_2 / (UCS)
|
|
@tab
|
|
ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_2_internal
|
|
@tab
|
|
ucs2_internal, ucs_2internal, ucs2internal
|
|
@tab
|
|
ucs_2_internal / (UCS)
|
|
@tab
|
|
ISO-10646-UCS-2 in system byte order.
|
|
NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_2be
|
|
@tab
|
|
ucs2be
|
|
@tab
|
|
ucs_2 / (UCS)
|
|
@tab
|
|
Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2).
|
|
Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_2le
|
|
@tab
|
|
ucs2le
|
|
@tab
|
|
ucs_2 / (UCS)
|
|
@tab
|
|
Little Endian version of ISO-10646-UCS-2.
|
|
Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_4
|
|
@tab
|
|
ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4
|
|
@tab
|
|
ucs_4 / (UCS)
|
|
@tab
|
|
ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_4_internal
|
|
@tab
|
|
ucs4_internal, ucs_4internal, ucs4internal
|
|
@tab
|
|
ucs_4_internal / (UCS)
|
|
@tab
|
|
ISO-10646-UCS-4 in system byte order.
|
|
NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_4be
|
|
@tab
|
|
ucs4be
|
|
@tab
|
|
ucs_4 / (UCS)
|
|
@tab
|
|
Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4).
|
|
Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_4le
|
|
@tab
|
|
ucs4le
|
|
@tab
|
|
ucs_4 / (UCS)
|
|
@tab
|
|
Little Endian version of ISO-10646-UCS-4.
|
|
Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
us_ascii
|
|
@tab
|
|
ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii
|
|
@tab
|
|
us_ascii / (ASCII)
|
|
@tab
|
|
7-bit ASCII.
|
|
|
|
|
|
@item
|
|
utf_16
|
|
@tab
|
|
utf16
|
|
@tab
|
|
utf_16 / (UCS)
|
|
@tab
|
|
RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM.
|
|
|
|
|
|
@item
|
|
utf_16be
|
|
@tab
|
|
utf16be
|
|
@tab
|
|
utf_16 / (UCS)
|
|
@tab
|
|
Big Endian version of RFC 2781 UTF-16.
|
|
NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
utf_16le
|
|
@tab
|
|
utf16le
|
|
@tab
|
|
utf_16 / (UCS)
|
|
@tab
|
|
Little Endian version of RFC 2781 UTF-16.
|
|
NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
utf_8
|
|
@tab
|
|
utf8
|
|
@tab
|
|
utf_8 / (UCS)
|
|
@tab
|
|
RFC 3629 UTF-8.
|
|
|
|
|
|
@item
|
|
win_1250
|
|
@tab
|
|
cp1250
|
|
@tab
|
|
@tab
|
|
Win-1250 Croatian.
|
|
|
|
|
|
@item
|
|
win_1251
|
|
@tab
|
|
cp1251
|
|
@tab
|
|
table / win_1251
|
|
@tab
|
|
Win-1251 - Cyrillic.
|
|
|
|
|
|
@item
|
|
win_1252
|
|
@tab
|
|
cp1252
|
|
@tab
|
|
table / win_1252
|
|
@tab
|
|
Win-1252 - Latin 1.
|
|
|
|
|
|
@item
|
|
win_1253
|
|
@tab
|
|
cp1253
|
|
@tab
|
|
table / win_1253
|
|
@tab
|
|
Win-1253 - Greek.
|
|
|
|
|
|
@item
|
|
win_1254
|
|
@tab
|
|
cp1254
|
|
@tab
|
|
table / win_1254
|
|
@tab
|
|
Win-1254 - Turkish.
|
|
|
|
|
|
@item
|
|
win_1255
|
|
@tab
|
|
cp1255
|
|
@tab
|
|
table / win_1255
|
|
@tab
|
|
Win-1255 - Hebrew.
|
|
|
|
|
|
@item
|
|
win_1256
|
|
@tab
|
|
cp1256
|
|
@tab
|
|
table / win_1256
|
|
@tab
|
|
Win-1256 - Arabic.
|
|
|
|
|
|
@item
|
|
win_1257
|
|
@tab
|
|
cp1257
|
|
@tab
|
|
table / win_1257
|
|
@tab
|
|
Win-1257 - Baltic.
|
|
|
|
|
|
@item
|
|
win_1258
|
|
@tab
|
|
cp1258
|
|
@tab
|
|
table / win_1258
|
|
@tab
|
|
Win-1258 - Vietnamese7 that supports Cyrillic.
|
|
@end multitable
|
|
|
|
|
|
|
|
|
|
@page
|
|
@node iconv design decisions
|
|
@section iconv design decisions
|
|
@findex CCS table
|
|
@findex CES converter
|
|
@*
|
|
The first iconv library design issue arises when considering the
|
|
following two design approaches:
|
|
|
|
@enumerate
|
|
@item
|
|
Have modules which implement conversion from encoding A to encoding B
|
|
and vice versa, i.e., one conversion module relates to any two
|
|
encodings.
|
|
@item
|
|
Have modules which implement conversion from encoding A to fixed
|
|
encoding C and vice versa, i.e., on conversion module relates to any
|
|
one encoding A and one fixed encoding C. In this case, to convert from
|
|
encoding A to encoding B, two modules are needed in order to convert
|
|
from A to C and then from C to B.
|
|
@end enumerate
|
|
|
|
@*
|
|
It's obvious, that we have a tradeoff between commonness/flexibility and
|
|
efficiency: the first method is more efficient since it converts
|
|
directly. But from other hand, it isn't so flexible since for each
|
|
encoding pair distinct module is needed.
|
|
|
|
@*
|
|
The Newlib iconv uses the second method and always converts through 32
|
|
bit UCS. But its design also allows to write specialized conversion
|
|
modules if the conversion speed is critical.
|
|
|
|
@*
|
|
The second design issue is how to decompose encodings.
|
|
The Newlib iconv library uses the fact that any encoding may be
|
|
considered as one or more CCS plus CES. It also decomposes its
|
|
conversion modules on @dfn{CES converter} plus one or more @dfn{CCS
|
|
tables}. CCS tables maps CCS to UCS and vice versa, CES converters
|
|
map CCS to encoding and vice versa.
|
|
|
|
@*
|
|
As an example, consider conversion from big5 encoding to EUC-TW
|
|
encoding. big5 encoding may be decomposed on ASCII and BIG5 CCSes plus
|
|
BIG5 CES. EUC-TW may be decomposed on CNS11643_PLANE1, CNS11643_PLANE2,
|
|
and CNS11643_PLANE14 CCSes plus EUC CES.
|
|
|
|
@*
|
|
The euc_jp -> big5 conversion happens as follows:
|
|
|
|
@enumerate
|
|
@item
|
|
EUC converter performs EUC-TW encoding to correspondent CCSes transformation
|
|
(CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14 CCSes);
|
|
@item
|
|
Obtained CCS codes are transformed to UCS codes using CNS11643_PLANE1,
|
|
CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables;
|
|
@item
|
|
Resulting UCS codes are transformed to ASCII and BIG5 codes using
|
|
correspondent CCS tables;
|
|
@item
|
|
Obtained CCS codes are transformed to big5 encoding using correspondent
|
|
CES converter.
|
|
@end enumerate
|
|
|
|
@*
|
|
Analogously, the backward conversion is performed as follows:
|
|
|
|
@enumerate
|
|
@item
|
|
BIG converter performs big5 encoding -> correspondent CCSes transformation
|
|
(ASCII and BIG5 CCSes);
|
|
@item
|
|
Obtained CCS codes are transformed to UCS codes using ASCII and BIG5 CCS tables;
|
|
@item
|
|
Resulting UCS codes are transformed to ASCII and BIG5 codes using
|
|
correspondent CCS tables;
|
|
@item
|
|
Obtained CCS codes are transformed to EUC-TW encoding using correspondent
|
|
CES converter.
|
|
@end enumerate
|
|
|
|
@*
|
|
Note, the above is just an example and real names (implemented in Newlib
|
|
iconv) of CES converters and CCS tables are slightly different.
|
|
|
|
@*
|
|
The third design issue also relates to flexibility. Obviously, it isn't
|
|
wanted to always link all CES converters and CCS tables to the library
|
|
but instead, it is wanted to be able to load needed converters and tables
|
|
dynamically on demand. This isn't a problem on "big" machines like PC
|
|
but may be very problematical within "small" embedded systems.
|
|
|
|
@*
|
|
Since the CCS tables are just data, it is possible to load them
|
|
dynamically from external files. Instead, CES converters are algorithms
|
|
and contain some code and the dynamic library loading capability is needed.
|
|
|
|
@*
|
|
Apart from possible restrictions applied by embedded systems (too few
|
|
RAM for example), the Newlib itself has no dynamic libraries support and,
|
|
therefore, all CES converters which will ever be uses must be linked into
|
|
the library. But the dynamic CCS tables loading is possible and is
|
|
implemented in the Newlib iconv library and may be enabled via Newlib
|
|
configure script options.
|
|
|
|
@*
|
|
The next design decision is the possibility to of fine iconv library
|
|
configuring. This means, that iconv isn't always link all it's
|
|
converters and tables (if no dynamical loading enabled) but instead, it
|
|
gives the possibility to enable only those encodings which are planned
|
|
to be used (see section about configure script options).
|
|
|
|
@*
|
|
Moreover, the Newlib iconv library configure options distinguish between
|
|
coding directions. This means, that not only supported encodings are
|
|
selectable, but the coding direction too. For example, if user wants
|
|
configuration which allows conversions from UTF-8 to UTF-16 and he
|
|
doesn't plan to use UTF-16 to UTF-8 conversions, he can enable exactly
|
|
that conversion direction (i.e., no UTF-16 -> UTF-8 -related code will
|
|
be included) thus saving some memory (note, that such technique allows to
|
|
exclude one half of CCS table from linking which may be big enough).
|
|
|
|
@*
|
|
One more design decision is speed- and size- optimized tables. Used can
|
|
select between them using s configure script option. Speed CCS tables
|
|
are the same as Size ones in case of 8 bit CCS (e.g.m KOI8-R), but for 16
|
|
bit CCS Size-optimized table may be in 1.5-2 time less then
|
|
Speed-optimized ones. From the other hand, the conversion with speed
|
|
tables is in several times faster.
|
|
|
|
@*
|
|
Its worth to stress, that new encodings support can't be
|
|
dynamically added into already compiled Newlib library. Even if this
|
|
needs only additional CCS table and iconv is configured to use external
|
|
files with CCS tables (this isn't a fundamental restriction and the
|
|
possibility to add new Table-based encodings support dynamically, by
|
|
copying new .cct file, may be easily added).
|
|
|
|
@*
|
|
Theoretically, the compiled-in CCS tables may be more appropriate foe
|
|
embedded solutions since they are read-only and are placed to ROM,
|
|
whereas the dynamic loading needs more RAM. Moreover, in current
|
|
implementation, distinct copy of CCS file is loaded for each fore each
|
|
opened iconv descriptor even in case of the same encoding.
|
|
This means, for example, that if two iconv descriptors for
|
|
KOI8-R -> UCS-4BE and KOI8-R -> UTF-16BE are opened, two copies of
|
|
koi8-r .cct file will be loaded (actually, iconv loads only needed part
|
|
of these files).
|
|
|
|
|
|
@page
|
|
@node iconv configuration
|
|
@section iconv configuration
|
|
@findex iconv configuration
|
|
@*
|
|
To enable encoding support --enable-newlib-iconv-encodings configure
|
|
script option should be used. This option accepts a comma-separated list
|
|
of encodings that should be enabled. Option enables each encoding in both
|
|
("to" and "from") directions.
|
|
|
|
@*
|
|
--enable-newlib-iconv-from-encodings configure script option enables
|
|
"from" support for each encoding that was passed to it.
|
|
|
|
@*
|
|
--enable-newlib-iconv-to-encodings configure script option enables
|
|
"to" support for each encoding that was passed to it.
|
|
|
|
@*
|
|
Example: if user plans only KOI8-R -> UTF-8, UTF-8 -> ISO-8859-5 and
|
|
KOI8-R -> UCS-2 conversions, the most optimal way (minimal iconv's
|
|
code and data will be linked) is to configure Newlib with
|
|
--enable-newlib-iconv-encodings=UTF-8
|
|
--enable-newlib-iconv-from-encodings=KOI8-R
|
|
--enable-newlib-iconv-to-encodings=KOI8-R,ISO-8859-5
|
|
|
|
@*
|
|
--enable-newlib-iconv-external-ccs option enables iconv's
|
|
capabilities to work with external CCS files.
|
|
|
|
@*
|
|
Note: CCS files are searched by iconv_open in $NLSPATH/iconv_data/ directory.
|