1710 lines
45 KiB
TeX
1710 lines
45 KiB
TeX
@node Encoding conversions
|
|
@chapter Encoding conversions (@file{iconv.h})
|
|
|
|
This chapter describes the Newlib iconv library.
|
|
The iconv functions declarations are in
|
|
@file{iconv.h}.
|
|
|
|
@menu
|
|
* Function iconv:: Encoding conversion routines
|
|
* Introduction to iconv:: Introduction to iconv and encodings
|
|
* Supported encodings:: The list of currently supported encodings
|
|
* iconv design decisions:: General iconv library design issues
|
|
* iconv configuration:: iconv-related configure script options
|
|
* Encoding names:: How encodings are named.
|
|
* CCS tables:: CCS tables format and 'mktbl.pl' Perl script
|
|
* CES converters:: CES converters description
|
|
* The encodings description file:: The 'encoding.deps' file and 'mkdeps.pl'
|
|
* How to add new encoding:: The steps to add new encoding support
|
|
* The locale support interfaces:: Locale-related iconv interfaces
|
|
* Contact:: The author contact
|
|
@end menu
|
|
|
|
@page
|
|
@include iconv/lib/iconv.def
|
|
|
|
@page
|
|
@node Introduction to iconv
|
|
@section Introduction to iconv
|
|
@findex encoding
|
|
@findex character set
|
|
@findex charset
|
|
@findex CES
|
|
@findex CCS
|
|
@*
|
|
The iconv library is intended to convert characters from one encoding to
|
|
another. It implements iconv(), iconv_open() and iconv_close()
|
|
calls, which are defined by the Single Unix Specification.
|
|
|
|
@*
|
|
In addition to these user-level interfaces, the iconv library also has
|
|
several useful interfaces which are needed to support coding
|
|
capabilities of the Newlib Locale infrastructure. Since Locale
|
|
support also needs to
|
|
convert various character sets to and from the @emph{wide characters
|
|
set}, the iconv library shares it's capabilities with the Newlib Locale
|
|
subsystem. Moreover, the iconv library supports several features which are
|
|
only needed for the Locale infrastructure (for example, the MB_CUR_MAX value).
|
|
|
|
@*
|
|
The Newlib iconv library was created using concepts from another iconv
|
|
library implemented by Konstantin Chuguev (ver 2.0). The Newlib iconv library
|
|
was rewritten from scratch and contains a lot of improvements with respect to
|
|
the original iconv library.
|
|
|
|
@*
|
|
Terms like @dfn{encoding} or @dfn{character set} aren't well defined and
|
|
are often used with various meanings. The following are the definitions of terms
|
|
which are used in this documentation as well as in the iconv library
|
|
implementation:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@dfn{encoding} - a machine representation of characters by means of bits;
|
|
|
|
@item
|
|
@dfn{Character Set} or @dfn{Charset} - just a collection of
|
|
characters, i.e. the encoding is the machine representation of the character set;
|
|
|
|
@item
|
|
@dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a
|
|
set of integers @dfn{character codes};
|
|
|
|
@item
|
|
@dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character
|
|
codes to a sequence of bytes;
|
|
@end itemize
|
|
|
|
@*
|
|
Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8,
|
|
ASCII, etc. Encodings are formed by the following chain of steps:
|
|
|
|
@enumerate
|
|
@item
|
|
User has a set of characters which are specific to his or her language (character set).
|
|
|
|
@item
|
|
Each character from this set is uniquely numbered, resulting in an CCS.
|
|
|
|
@item
|
|
Each number from the CCS is converted to a sequence of bits or bytes by means
|
|
of a CES and form some encoding. Thus, CES may be considered as a
|
|
function of CCS which produces some encoding. Note, that CES may be
|
|
applied to more than one CCS.
|
|
@end enumerate
|
|
|
|
@*
|
|
Thus, an encoding may be considered as one or more CCS + CES.
|
|
|
|
@*
|
|
Sometimes, there is no CES and in such cases encoding is equivalent
|
|
to CCS, e.g. KOI8-R or ASCII.
|
|
|
|
@*
|
|
An example of a more complicated encoding is UTF-8 which is the UCS
|
|
(or Unicode) CCS plus the UTF-8 CES.
|
|
|
|
@*
|
|
The following is a brief list of iconv library features:
|
|
@itemize
|
|
@item
|
|
Generic architecture;
|
|
@item
|
|
Locale infrastructure support;
|
|
@item
|
|
Automatic generation of the program code which handles
|
|
CES/CCS/Encoding/Names/Aliases dependencies;
|
|
@item
|
|
The ability to choose size- or speed-optimazed
|
|
configuration;
|
|
@item
|
|
The ability to exclude a lot of unneeded code and data from the linking step.
|
|
@end itemize
|
|
|
|
|
|
|
|
|
|
@page
|
|
@node Supported encodings
|
|
@section Supported encodings
|
|
@findex big5
|
|
@findex cp775
|
|
@findex cp850
|
|
@findex cp852
|
|
@findex cp855
|
|
@findex cp866
|
|
@findex euc_jp
|
|
@findex euc_kr
|
|
@findex euc_tw
|
|
@findex iso_8859_1
|
|
@findex iso_8859_10
|
|
@findex iso_8859_11
|
|
@findex iso_8859_13
|
|
@findex iso_8859_14
|
|
@findex iso_8859_15
|
|
@findex iso_8859_2
|
|
@findex iso_8859_3
|
|
@findex iso_8859_4
|
|
@findex iso_8859_5
|
|
@findex iso_8859_6
|
|
@findex iso_8859_7
|
|
@findex iso_8859_8
|
|
@findex iso_8859_9
|
|
@findex iso_ir_111
|
|
@findex koi8_r
|
|
@findex koi8_ru
|
|
@findex koi8_u
|
|
@findex koi8_uni
|
|
@findex ucs_2
|
|
@findex ucs_2_internal
|
|
@findex ucs_2be
|
|
@findex ucs_2le
|
|
@findex ucs_4
|
|
@findex ucs_4_internal
|
|
@findex ucs_4be
|
|
@findex ucs_4le
|
|
@findex us_ascii
|
|
@findex utf_16
|
|
@findex utf_16be
|
|
@findex utf_16le
|
|
@findex utf_8
|
|
@findex win_1250
|
|
@findex win_1251
|
|
@findex win_1252
|
|
@findex win_1253
|
|
@findex win_1254
|
|
@findex win_1255
|
|
@findex win_1256
|
|
@findex win_1257
|
|
@findex win_1258
|
|
@*
|
|
The following is the list of currently supported encodings. The first column
|
|
corresponds to the encoding name, the second column is the list of aliases,
|
|
the third column is its CES and CCS components names, and the fourth column
|
|
is a short description.
|
|
|
|
@multitable @columnfractions .20 .26 .24 .30
|
|
@item
|
|
Name
|
|
@tab
|
|
Aliases
|
|
@tab
|
|
CES/CCS
|
|
@tab
|
|
Short description
|
|
@item
|
|
@tab
|
|
@tab
|
|
@tab
|
|
|
|
|
|
@item
|
|
big5
|
|
@tab
|
|
csbig5, big_five, bigfive, cn_big5, cp950
|
|
@tab
|
|
table_pcs / big5, us_ascii
|
|
@tab
|
|
The encoding for the Traditional Chinese.
|
|
|
|
|
|
@item
|
|
cp775
|
|
@tab
|
|
ibm775, cspc775baltic
|
|
@tab
|
|
table / cp775
|
|
@tab
|
|
The updated version of CP 437 that supports the balitic languages.
|
|
|
|
|
|
@item
|
|
cp850
|
|
@tab
|
|
ibm850, 850, cspc850multilingual
|
|
@tab
|
|
table / cp850
|
|
@tab
|
|
IBM 850 - the updated version of CP 437 where several Latin 1 characters have been
|
|
added instead of some less-often used characters like the line-drawing
|
|
and the greek ones.
|
|
|
|
|
|
@item
|
|
cp852
|
|
@tab
|
|
ibm852, 852, cspcp852
|
|
@tab
|
|
@tab
|
|
IBM 852 - the updated version of CP 437 where several Latin 2 characters have been added
|
|
instead of some less-often used characters like the line-drawing and the greek ones.
|
|
|
|
|
|
@item
|
|
cp855
|
|
@tab
|
|
ibm855, 855, csibm855
|
|
@tab
|
|
table / cp855
|
|
@tab
|
|
IBM 855 - the updated version of CP 437 that supports Cyrillic.
|
|
|
|
|
|
@item
|
|
cp866
|
|
@tab
|
|
866, IBM866, CSIBM866
|
|
@tab
|
|
table / cp866
|
|
@tab
|
|
IBM 866 - the updated version of CP 855 which follows more the logical Russian alphabet
|
|
ordering of the alternative variant that is preferred by many Russian users.
|
|
|
|
|
|
@item
|
|
euc_jp
|
|
@tab
|
|
eucjp
|
|
@tab
|
|
euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990
|
|
@tab
|
|
EUC-JP - The EUC for Japanese.
|
|
|
|
|
|
@item
|
|
euc_kr
|
|
@tab
|
|
euckr
|
|
@tab
|
|
euc / ksx1001
|
|
@tab
|
|
EUC-KR - The EUC for Korean.
|
|
|
|
|
|
@item
|
|
euc_tw
|
|
@tab
|
|
euctw
|
|
@tab
|
|
euc / cns11643_plane1, cns11643_plane2, cns11643_plane14
|
|
@tab
|
|
EUC-TW - The EUC for Traditional Chinese.
|
|
|
|
|
|
@item
|
|
iso_8859_1
|
|
@tab
|
|
iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1
|
|
@tab
|
|
table / iso_8859_1
|
|
@tab
|
|
ISO 8859-1:1987 - Latin 1, West European.
|
|
|
|
|
|
@item
|
|
iso_8859_10
|
|
@tab
|
|
iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10
|
|
@tab
|
|
table / iso_8859_10
|
|
@tab
|
|
ISO 8859-10:1992 - Latin 6, Nordic.
|
|
|
|
|
|
@item
|
|
iso_8859_11
|
|
@tab
|
|
iso8859_11, iso885911
|
|
@tab
|
|
table / iso_8859_11
|
|
@tab
|
|
ISO 8859-11 - Thai.
|
|
|
|
|
|
@item
|
|
iso_8859_13
|
|
@tab
|
|
iso_8859_13:1998, iso8859_13, iso885913
|
|
@tab
|
|
table / iso_8859_13
|
|
@tab
|
|
ISO 8859-13:1998 - Latin 7, Baltic Rim.
|
|
|
|
|
|
@item
|
|
iso_8859_14
|
|
@tab
|
|
iso_8859_14:1998, iso885914, iso8859_14
|
|
@tab
|
|
table / iso_8859_14
|
|
@tab
|
|
ISO 8859-14:1998 - Latin 8, Celtic.
|
|
|
|
|
|
@item
|
|
iso_8859_15
|
|
@tab
|
|
iso885915, iso_8859_15:1998, iso8859_15,
|
|
@tab
|
|
table / iso_8859_15
|
|
@tab
|
|
ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1.
|
|
|
|
|
|
@item
|
|
iso_8859_2
|
|
@tab
|
|
iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2
|
|
@tab
|
|
table / iso_8859_2
|
|
@tab
|
|
ISO 8859-2:1987 - Latin 2, East European.
|
|
|
|
|
|
@item
|
|
iso_8859_3
|
|
@tab
|
|
iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593
|
|
@tab
|
|
table / iso_8859_3
|
|
@tab
|
|
ISO 8859-3:1988 - Latin 3, South European.
|
|
|
|
|
|
@item
|
|
iso_8859_4
|
|
@tab
|
|
iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4
|
|
@tab
|
|
table / iso_8859_4
|
|
@tab
|
|
ISO 8859-4:1988 - Latin 4, North European.
|
|
|
|
|
|
@item
|
|
iso_8859_5
|
|
@tab
|
|
iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic
|
|
@tab
|
|
table / iso_8859_5
|
|
@tab
|
|
ISO 8859-5:1988 - Cyrillic.
|
|
|
|
|
|
@item
|
|
iso_8859_6
|
|
@tab
|
|
iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596
|
|
@tab
|
|
table / iso_8859_6
|
|
@tab
|
|
ISO i8859-6:1987 - Arabic.
|
|
|
|
|
|
@item
|
|
iso_8859_7
|
|
@tab
|
|
iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597
|
|
@tab
|
|
table / iso_8859_7
|
|
@tab
|
|
ISO 8859-7:1987 - Greek.
|
|
|
|
|
|
@item
|
|
iso_8859_8
|
|
@tab
|
|
iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598
|
|
@tab
|
|
table / iso_8859_8
|
|
@tab
|
|
ISO 8859-8:1988 - Hebrew.
|
|
|
|
|
|
@item
|
|
iso_8859_9
|
|
@tab
|
|
iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599
|
|
@tab
|
|
table / iso_8859_9
|
|
@tab
|
|
ISO 8859-9:1989 - Latin 5, Turkish.
|
|
|
|
|
|
@item
|
|
iso_ir_111
|
|
@tab
|
|
ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic
|
|
@tab
|
|
table / iso_ir_111
|
|
@tab
|
|
ISO IR 111/ECMA Cyrillic.
|
|
|
|
|
|
@item
|
|
koi8_r
|
|
@tab
|
|
cskoi8r, koi8r, koi8
|
|
@tab
|
|
table / koi8_r
|
|
@tab
|
|
RFC 1489 Cyrillic.
|
|
|
|
|
|
@item
|
|
koi8_ru
|
|
@tab
|
|
koi8ru
|
|
@tab
|
|
table / koi8_ru
|
|
@tab
|
|
The obsolete Ukrainian.
|
|
|
|
|
|
@item
|
|
koi8_u
|
|
@tab
|
|
koi8u
|
|
@tab
|
|
table / koi8_u
|
|
@tab
|
|
RFC 2319 Ukrainian.
|
|
|
|
|
|
@item
|
|
koi8_uni
|
|
@tab
|
|
koi8uni
|
|
@tab
|
|
table / koi8_uni
|
|
@tab
|
|
KOI8 Unified.
|
|
|
|
|
|
@item
|
|
ucs_2
|
|
@tab
|
|
ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode
|
|
@tab
|
|
ucs_2 / (UCS)
|
|
@tab
|
|
ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_2_internal
|
|
@tab
|
|
ucs2_internal, ucs_2internal, ucs2internal
|
|
@tab
|
|
ucs_2_internal / (UCS)
|
|
@tab
|
|
ISO-10646-UCS-2 in system byte order.
|
|
NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_2be
|
|
@tab
|
|
ucs2be
|
|
@tab
|
|
ucs_2 / (UCS)
|
|
@tab
|
|
Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2).
|
|
Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_2le
|
|
@tab
|
|
ucs2le
|
|
@tab
|
|
ucs_2 / (UCS)
|
|
@tab
|
|
Little Endian version of ISO-10646-UCS-2.
|
|
Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_4
|
|
@tab
|
|
ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4
|
|
@tab
|
|
ucs_4 / (UCS)
|
|
@tab
|
|
ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_4_internal
|
|
@tab
|
|
ucs4_internal, ucs_4internal, ucs4internal
|
|
@tab
|
|
ucs_4_internal / (UCS)
|
|
@tab
|
|
ISO-10646-UCS-4 in system byte order.
|
|
NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_4be
|
|
@tab
|
|
ucs4be
|
|
@tab
|
|
ucs_4 / (UCS)
|
|
@tab
|
|
Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4).
|
|
Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
ucs_4le
|
|
@tab
|
|
ucs4le
|
|
@tab
|
|
ucs_4 / (UCS)
|
|
@tab
|
|
Little Endian version of ISO-10646-UCS-4.
|
|
Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
us_ascii
|
|
@tab
|
|
ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii
|
|
@tab
|
|
us_ascii / (ASCII)
|
|
@tab
|
|
7-bit ASCII.
|
|
|
|
|
|
@item
|
|
utf_16
|
|
@tab
|
|
utf16
|
|
@tab
|
|
utf_16 / (UCS)
|
|
@tab
|
|
RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM.
|
|
|
|
|
|
@item
|
|
utf_16be
|
|
@tab
|
|
utf16be
|
|
@tab
|
|
utf_16 / (UCS)
|
|
@tab
|
|
Big Endian version of RFC 2781 UTF-16.
|
|
NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
utf_16le
|
|
@tab
|
|
utf16le
|
|
@tab
|
|
utf_16 / (UCS)
|
|
@tab
|
|
Little Endian version of RFC 2781 UTF-16.
|
|
NBSP is always interpreted as NBSP (BOM isn't supported).
|
|
|
|
|
|
@item
|
|
utf_8
|
|
@tab
|
|
utf8
|
|
@tab
|
|
utf_8 / (UCS)
|
|
@tab
|
|
RFC 3629 UTF-8.
|
|
|
|
|
|
@item
|
|
win_1250
|
|
@tab
|
|
cp1250
|
|
@tab
|
|
@tab
|
|
Win-1250 Croatian.
|
|
|
|
|
|
@item
|
|
win_1251
|
|
@tab
|
|
cp1251
|
|
@tab
|
|
table / win_1251
|
|
@tab
|
|
Win-1251 - Cyrillic.
|
|
|
|
|
|
@item
|
|
win_1252
|
|
@tab
|
|
cp1252
|
|
@tab
|
|
table / win_1252
|
|
@tab
|
|
Win-1252 - Latin 1.
|
|
|
|
|
|
@item
|
|
win_1253
|
|
@tab
|
|
cp1253
|
|
@tab
|
|
table / win_1253
|
|
@tab
|
|
Win-1253 - Greek.
|
|
|
|
|
|
@item
|
|
win_1254
|
|
@tab
|
|
cp1254
|
|
@tab
|
|
table / win_1254
|
|
@tab
|
|
Win-1254 - Turkish.
|
|
|
|
|
|
@item
|
|
win_1255
|
|
@tab
|
|
cp1255
|
|
@tab
|
|
table / win_1255
|
|
@tab
|
|
Win-1255 - Hebrew.
|
|
|
|
|
|
@item
|
|
win_1256
|
|
@tab
|
|
cp1256
|
|
@tab
|
|
table / win_1256
|
|
@tab
|
|
Win-1256 - Arabic.
|
|
|
|
|
|
@item
|
|
win_1257
|
|
@tab
|
|
cp1257
|
|
@tab
|
|
table / win_1257
|
|
@tab
|
|
Win-1257 - Baltic.
|
|
|
|
|
|
@item
|
|
win_1258
|
|
@tab
|
|
cp1258
|
|
@tab
|
|
table / win_1258
|
|
@tab
|
|
Win-1258 - Vietnamese7 that supports Cyrillic.
|
|
@end multitable
|
|
|
|
|
|
|
|
|
|
|
|
@page
|
|
@node iconv design decisions
|
|
@section iconv design decisions
|
|
@findex CCS table
|
|
@findex CES converter
|
|
@findex Speed-optimized tables
|
|
@findex Size-optimized tables
|
|
@*
|
|
The first iconv library design issue arises when considering the
|
|
following two design approaches:
|
|
|
|
@enumerate
|
|
@item
|
|
Have modules which implement conversion from the encoding A to the encoding B
|
|
and vice versa i.e., one conversion module relates to any two encodings.
|
|
@item
|
|
Have modules which implement conversion from the encoding A to the fixed
|
|
encoding C and vice versa i.e., one conversion module relates to any
|
|
one encoding A and one fixed encoding C. In this case, to convert from
|
|
the encoding A to the encoding B, two modules are needed (in order to convert
|
|
from A to C and then from C to B).
|
|
@end enumerate
|
|
|
|
@*
|
|
It's obvious, that we have tradeoff between commonality/flexibility and
|
|
efficiency: the first method is more efficient since it converts
|
|
directly; however, it isn't so flexible since for each
|
|
encoding pair a distinct module is needed.
|
|
|
|
@*
|
|
The Newlib iconv model uses the second method and always converts through the 32-bit
|
|
UCS but its design also allows one to write specialized conversion
|
|
modules if the conversion speed is critical.
|
|
|
|
@*
|
|
The second design issue is how to break down (decompose) encodings.
|
|
The Newlib iconv library uses the fact that any encoding may be
|
|
considered as one or more CCS plus a CES. It also decomposes its
|
|
conversion modules on @dfn{CES converter} plus one or more @dfn{CCS
|
|
tables}. CCS tables map CCS to UCS and vice versa; the CES converters
|
|
map CCS to the encoding and vice versa.
|
|
|
|
@*
|
|
As the example, let's consider the conversion from the big5 encoding to
|
|
the EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5
|
|
CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2,
|
|
and CNS11643_PLANE14 CCS-es plus the EUC CES.
|
|
|
|
@*
|
|
The euc_jp -> big5 conversion is performed as follows:
|
|
|
|
@enumerate
|
|
@item
|
|
The EUC converter performs the EUC-TW encoding to the corresponding CCS-es
|
|
transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14
|
|
CCS-es);
|
|
@item
|
|
The obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1,
|
|
CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables;
|
|
@item
|
|
The resulting UCS codes are transformed to the ASCII and BIG5 codes using
|
|
the corresponding CCS tables;
|
|
@item
|
|
The obtained CCS codes are transformed to the big5 encoding using the corresponding
|
|
CES converter.
|
|
@end enumerate
|
|
|
|
@*
|
|
Analogously, the backward conversion is performed as follows:
|
|
|
|
@enumerate
|
|
@item
|
|
The BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation
|
|
(the ASCII and BIG5 CCS-es);
|
|
@item
|
|
The obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables;
|
|
@item
|
|
The resulting UCS codes are transformed to the ASCII and BIG5 codes using
|
|
the corresponding CCS tables;
|
|
@item
|
|
The obtained CCS codes are transformed to the EUC-TW encoding using the corresponding
|
|
CES converter.
|
|
@end enumerate
|
|
|
|
@*
|
|
Note, the above is just an example and real names (which are implemented
|
|
in the Newlib iconv) of the CES converters and the CCS tables are slightly different.
|
|
|
|
@*
|
|
The third design issue also relates to flexibility. Obviously, it isn't
|
|
desirable to always link all the CES converters and the CCS tables to the library
|
|
but instead, we want to be able to load the needed converters and tables
|
|
dynamically on demand. This isn't a problem on "big" machines such as
|
|
a PC, but it may be very problematical within "small" embedded systems.
|
|
|
|
@*
|
|
Since the CCS tables are just data, it is possible to load them
|
|
dynamically from external files. The CES converters, on the other hand
|
|
are algorithms with some code so a dynamic library loading
|
|
capability is required.
|
|
|
|
@*
|
|
Apart from possible restrictions applied by embedded systems (small
|
|
RAM for example), Newlib itself has no dynamic library support and
|
|
therefore, all the CES converters which will ever be used must be linked into
|
|
the library. However, loading of the dynamic CCS tables is possible and is
|
|
implemented in the Newlib iconv library. It may be enabled via the Newlib
|
|
configure script options.
|
|
|
|
@*
|
|
The next design issue is fine-tuning the iconv library
|
|
configuration. One important ability is for iconv to not link all it's
|
|
converters and tables (if dynamic loading is not enabled) but instead,
|
|
enable only those encodings which are specified at configuration
|
|
time (see the section about the configure script options).
|
|
|
|
@*
|
|
In addition, the Newlib iconv library configure options distinguish between
|
|
conversion directions. This means that not only are supported encodings
|
|
selectable, the conversion direction is as well. For example, if user wants
|
|
the configuration which allows conversions from UTF-8 to UTF-16 and
|
|
doesn't plan using the "UTF-16 to UTF-8" conversions, he or she can
|
|
enable only
|
|
this conversion direction (i.e., no "UTF-16 -> UTF-8"-related code will
|
|
be included) thus, saving some memory (note, that such technique allows to
|
|
exclude one half of a CCS table from linking which may be big enough).
|
|
|
|
@*
|
|
One more design aspect are the speed- and size- optimized tables. Users can
|
|
select between them using configure script options. The
|
|
speed-optimized CCS tables are the same as the size-optimized ones in
|
|
case of 8-bit CCS (e.g.m KOI8-R), but for 16-bit CCS-es the size-optimized
|
|
CCS tables may be 1.5 to 2 times less then the speed-optimized ones. On the
|
|
other hand, conversion with speed tables is several times faster.
|
|
|
|
@*
|
|
Its worth to stress that the new encoding support can't be
|
|
dynamically added into an already compiled Newlib library, even if it
|
|
needs only an additional CCS table and iconv is configured to use
|
|
the external files with CCS tables (this isn't the fundamental restriction
|
|
and the possibility to add new Table-based encoding support dynamically, by
|
|
means of just adding new .cct file, may be easily added).
|
|
|
|
@*
|
|
Theoretically, the compiled-in CCS tables should be more appropriate for
|
|
embedded systems than dynamically loaded CCS tables. This is because the compiled-in tables are read-only and can be placed in ROM
|
|
whereas dynamic loading requires RAM. Moreover, in the current iconv
|
|
implementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding.
|
|
This means, for example, that if two iconv descriptors for
|
|
"KOI8-R -> UCS-4BE" and "KOI8-R -> UTF-16BE" are opened, two copies of
|
|
koi8-r .cct file will be loaded (actually, iconv loads only the needed part
|
|
of these files). On the other hand, in the case of compiled-in CCS tables, there will always be only one copy.
|
|
|
|
@page
|
|
@node iconv configuration
|
|
@section iconv configuration
|
|
@findex iconv configuration
|
|
@findex --enable-newlib-iconv-encodings
|
|
@findex --enable-newlib-iconv-from-encodings
|
|
@findex --enable-newlib-iconv-to-encodings
|
|
@findex --enable-newlib-iconv-external-ccs
|
|
@findex NLSPATH
|
|
@*
|
|
To enable an encoding, the @emph{--enable-newlib-iconv-encodings} configure
|
|
script option should be used. This option accepts a comma-separated list
|
|
of @emph{encodings} that should be enabled. The option enables each encoding in both
|
|
("to" and "from") directions.
|
|
|
|
@*
|
|
The @option{--enable-newlib-iconv-from-encodings} configure script option enables
|
|
"from" support for each encoding that was passed to it.
|
|
|
|
@*
|
|
The @option{--enable-newlib-iconv-to-encodings} configure script option enables
|
|
"to" support for each encoding that was passed to it.
|
|
|
|
@*
|
|
Example: if user plans only the "KOI8-R -> UTF-8", "UTF-8 -> ISO-8859-5" and
|
|
"KOI8-R -> UCS-2" conversions, the most optimal way (minimal iconv
|
|
code and data will be linked) is to configure Newlib with the following
|
|
options:
|
|
@*
|
|
@code{--enable-newlib-iconv-encodings=UTF-8
|
|
--enable-newlib-iconv-from-encodings=KOI8-R
|
|
--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5}
|
|
@*
|
|
which is the same as
|
|
@*
|
|
@code{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-8
|
|
--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5,UTF-8}
|
|
@*
|
|
User may also just use the
|
|
@*
|
|
@code{--enable-newlib-iconv-encodings=KOI8-R,ISO-8859-5,UTF-8,UCS-2}
|
|
@*
|
|
configure script option, but it isn't so optimal since there will be
|
|
some unneeded data and code.
|
|
|
|
@*
|
|
The @option{--enable-newlib-iconv-external-ccs} option enables iconv's
|
|
capabilities to work with the external CCS files.
|
|
|
|
@*
|
|
The @option{--enable-target-optspace} Newlib configure script option also affects
|
|
the iconv library. If this option is present, the library uses the size
|
|
optimized CCS tables. This means, that only the size-optimized CCS
|
|
tables will be linked or, if the
|
|
@option{--enable-newlib-iconv-external-ccs} configure script option was used,
|
|
the iconv library will load the size-optimized tables. If the
|
|
@option{--enable-target-optspace}configure script option is disabled,
|
|
the speed-optimized CCS tables are used.
|
|
|
|
@*
|
|
Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory.
|
|
Thus, the NLSPATH environment variable should be set.
|
|
|
|
|
|
|
|
|
|
|
|
@page
|
|
@node Encoding names
|
|
@section Encoding names
|
|
@findex encoding name
|
|
@findex encoding alias
|
|
@findex normalized name
|
|
@*
|
|
Each encoding has one @dfn{name} and a number of @dfn{aliases}. When
|
|
user works with the iconv library (i.e., when the @code{iconv_open} call
|
|
is used) both name or aliases may be used. The same is when encoding
|
|
names are used in configure script options.
|
|
|
|
@*
|
|
Names and aliases may be specified in any case (small or capital
|
|
letters) and the @kbd{-} symbol is equivalent to the @kbd{_} symbol.
|
|
|
|
@*
|
|
Internally the Newlib iconv library always converts aliases to names. It
|
|
also converts names and aliases in the @dfn{normalized} form which means
|
|
that all capital letters are converted to small letters and the @kbd{-}
|
|
symbols are converted to @kbd{_} symbols.
|
|
|
|
|
|
|
|
|
|
@page
|
|
@node CCS tables
|
|
@section CCS tables
|
|
@findex Size-optimized CCS table
|
|
@findex Speed-optimized CCS table
|
|
@findex mktbl.pl Perl script
|
|
@findex .cct files
|
|
@findex The CCT tables source files
|
|
@findex CCS source files
|
|
@*
|
|
The iconv library stores files with CCS tables in the the @emph{ccs/}
|
|
subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form
|
|
(@dfn{.cct files}, see the @emph{ccs/binary/} subdirectory) and in form
|
|
of compilable .c source files. The .cct files are only used when the
|
|
@option{--enable-newlib-iconv-external-ccs} configure script option is enabled.
|
|
The .c files are linked to the Newlib library if the corresponding
|
|
encoding is enabled.
|
|
|
|
@*
|
|
As stated earlier, the Newlib iconv library performs all
|
|
conversions through the 32-bit UCS, but the codes which are used
|
|
in most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set.
|
|
Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 is
|
|
used instead of the 32-bit UCS-4.
|
|
|
|
@*
|
|
CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to
|
|
16-bit UCS-2 and vice versa while 16-bit CCS tables map
|
|
16-bit CCS to 16-bit UCS-2 and vice versa.
|
|
8-bit tables are small (in size) while 16-bit tables may be big enough.
|
|
Because of this, 16-bit CCS tables may be
|
|
either speed- or size-optimized. Size-optimized CCS tables are
|
|
smaller then speed-optimized ones, but the conversion process is
|
|
slower if the size-optimized CCS tables are used. 8-bit CCS tables have only
|
|
size-optimized variant.
|
|
|
|
Each CCS table (both speed- and size-optimized) consists of
|
|
@dfn{from_ucs} and @dfn{to_ucs} subtables. "from_ucs" subtable maps
|
|
UCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to
|
|
UCS-2 codes.
|
|
|
|
@*
|
|
Almost all 16-bit CCS tables contain less then 0xFFFF codes and
|
|
a lot of gaps exist.
|
|
|
|
@subsection Speed-optimized tables format
|
|
@*
|
|
In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format is
|
|
trivial - it is just the array of 256 16-bit UCS codes. Therefore, an
|
|
UCS-2 code @emph{Y} corresponding to a @emph{X} CCS code is calculates
|
|
as @emph{Y = to_ucs[X]}.
|
|
|
|
@*
|
|
Obviously, the simplest way to create the "from_ucs" table or the
|
|
16-bit "to_ucs" table is to use the huge 16-bit array like in case
|
|
of the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables contain
|
|
less then 0xFFFF code maps and this fact may be exploited to reduce
|
|
the size of the CCS tables.
|
|
|
|
@*
|
|
In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The
|
|
16-bit "CCS -> UCS-2" CCS table format is the same, except the mapping
|
|
direction and the CCS bits number.
|
|
|
|
@*
|
|
In case of the 8-bit speed-optimized table the "from_ucs" subtable
|
|
corresponds the "from_ucs" array and has the following layout:
|
|
|
|
@*
|
|
from_ucs array:
|
|
@*
|
|
-------------------------------------
|
|
@*
|
|
0xFF mapping (2 bytes) (only for
|
|
8-bit table).
|
|
@*
|
|
-------------------------------------
|
|
@*
|
|
Heading block
|
|
@*
|
|
-------------------------------------
|
|
@*
|
|
Block 1
|
|
@*
|
|
-------------------------------------
|
|
@*
|
|
Block 2
|
|
@*
|
|
-------------------------------------
|
|
@*
|
|
...
|
|
@*
|
|
-------------------------------------
|
|
@*
|
|
Block N
|
|
@*
|
|
-------------------------------------
|
|
|
|
@*
|
|
The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Each
|
|
subrange is represented by an 256-element @dfn{block} (256 1-byte
|
|
elements or 256 2-byte element in case of 16-bit CCS table) with
|
|
elements which are equivalent to the CCS codes of this subrange.
|
|
If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will be
|
|
absent and there will be less then 256 blocks.
|
|
|
|
@*
|
|
Any element number @emph{m} of @dfn{the heading block} (which contains
|
|
256 2-byte elements) corresponds to the @emph{m}-th 256-element subrange.
|
|
If the subrange contains some codes, the value of the @emph{m}-th element of
|
|
the heading block contains the offset of the corresponding block in the
|
|
"from_ucs" array. If there is no codes in the subrange, the heading
|
|
block element contains 0xFFFF.
|
|
|
|
@*
|
|
If there are some gaps in a block, the corresponding block elements have
|
|
the 0xFF value. If there is an 0xFF code present in the CCS, it's mapping
|
|
is defined in the first 2-byte element of the "from_ucs" array.
|
|
|
|
@*
|
|
Having such a table format, the algorithm of searching the CCS code
|
|
@emph{X} which corresponds to the UCS-2 code @emph{Y} is as follows.
|
|
|
|
@*
|
|
@enumerate
|
|
@item If @emph{Y} is equivalent to the value of the first 2-byte element
|
|
of the "from_ucs" array, @emph{X} is 0xFF. Else, continue to search.
|
|
|
|
@item Calculate the block number: @emph{BlkN = (Y & 0xFF00) >> 8}.
|
|
|
|
@item If the heading block element with number @emph{BlkN} is 0xFFFF, there
|
|
is no corresponding CCS code (error, wrong input data). Else, fetch the
|
|
"flom_ucs" array index of the @emph{BlkN}-th block.
|
|
|
|
@item Calculate the offset of the @emph{X} code in its block:
|
|
@emph{Xindex = Y & 0xFF}
|
|
|
|
@item If the @emph{Xindex}-th element of the block (which is equivalent to
|
|
@emph{from_ucs[BlkN+Xindex]}) value is 0xFF, there is no corresponding
|
|
CCS code (error, wrong input data). Else, @emph{X = from_ucs[BlkN+Xindex]}.
|
|
@end enumerate
|
|
|
|
@subsection Size-optimized tables format
|
|
@*
|
|
As it is stated above, size-optimized tables exist only for 16-bit CCS-es.
|
|
This is because there is too small difference between the speed-optimized
|
|
and the size-optimized table sizes in case of 8-bit CCS-es.
|
|
|
|
@*
|
|
Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of
|
|
size-optimized tables.
|
|
|
|
This sections describes the format of the "UCS-2 -> CCS" size-optimized
|
|
CCS table. The format of "CCS -> UCS-2" table is the same.
|
|
|
|
The idea of the size-optimized tables is to split the UCS-2 codes
|
|
("from" codes) on @dfn{ranges} (@dfn{range} is a number of consecutive UCS-2 codes).
|
|
Then CCS codes ("to" codes) are stored only for the codes from these
|
|
ranges. Distinct "from" codes, which have no range (@dfn{unranged codes}, are stored
|
|
together with the corresponding "to" codes.
|
|
|
|
@*
|
|
The following is the layout of the size-optimized table array:
|
|
|
|
@*
|
|
size_arr array:
|
|
@*
|
|
-------------------------------------
|
|
@*
|
|
Ranges number (2 bytes)
|
|
@*
|
|
-------------------------------------
|
|
@*
|
|
Unranged codes number (2 bytes)
|
|
@*
|
|
-------------------------------------
|
|
@*
|
|
Unranged codes array index (2 bytes)
|
|
@*
|
|
-------------------------------------
|
|
@*
|
|
Ranges indexes (triads)
|
|
@*
|
|
-------------------------------------
|
|
@*
|
|
Ranges
|
|
@*
|
|
-------------------------------------
|
|
@*
|
|
Unranged codes array
|
|
@*
|
|
-------------------------------------
|
|
|
|
@*
|
|
The @dfn{Unranged codes array index} @emph{size_arr} section helps to find
|
|
the offset of the needed range in the @emph{size_arr} and has
|
|
the following format (triads):
|
|
@*
|
|
the first code in range, the last code in range, range offset.
|
|
|
|
@*
|
|
The array of these triads is sorted by the firs element, therefore it is
|
|
possible to quickly find the needed range index.
|
|
|
|
@*
|
|
Each range has the corresponding sub-array containing the "to" codes. These
|
|
sub-arrays are stored in the place marked as "Ranges" in the layout
|
|
diagram.
|
|
|
|
@*
|
|
The "Unranged codes array" contains pairs ("from" code, "to" code") for
|
|
each unranged code. The array of these pairs is sorted by "from" code
|
|
values, therefore it is possible to find the needed pair quickly.
|
|
|
|
@*
|
|
Note, that each range requires 6 bytes to form its index. If, for
|
|
example, there are two ranges (1 - 5 and 9 - 10), and one unranged code
|
|
(7), 12 bytes are needed for two range indexes and 4 bytes for the unranged
|
|
code (total 16). But it is better to join both ranges as 1 - 10 and
|
|
mark codes 6 and 8 as absent. In this case, only 6 additional bytes for the
|
|
range index and 4 bytes to mark codes 6 and 8 as absent are needed
|
|
(total 10 bytes). This optimization is done in the size-optimized tables.
|
|
Thus, ranges may contain small gaps. The absent codes in ranges are marked
|
|
as 0xFFFF.
|
|
|
|
@*
|
|
Note, a pair of "from" codes is stored by means of unranged codes since
|
|
the number of bytes which are needed to form the range is greater than
|
|
the number of bytes to store two unranged codes (5 against 4).
|
|
|
|
@*
|
|
The algorithm of searching of the CCS code
|
|
@emph{X} which corresponds to the UCS-2 code @emph{Y} (input) in the "UCS-2 ->
|
|
CCS" size-optimized table is as follows.
|
|
|
|
@*
|
|
@enumerate
|
|
@item Try to find the corresponding triad in the "Unranged codes array
|
|
index". Since we are searching in the sorted array, we can do it quickly
|
|
(divide by 2, compare, etc).
|
|
|
|
@item If the triad is found, fetch the @emph{X} code from the corresponding
|
|
range array. If it is 0xFFFF, return an error.
|
|
|
|
@item If there is no corresponding triad, search the @emph{X} code among the
|
|
sorted unranged codes. Return error, if noting was found.
|
|
@end enumerate
|
|
|
|
@subsection .cct ant .c CCS Table files
|
|
@*
|
|
The .c source files for 8-bit CCS tables have "to_ucs" and "from_ucs"
|
|
speed-optimized tables. The .c source files for 16-bit CCS tables have
|
|
"to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size"
|
|
tables.
|
|
|
|
@*
|
|
When .c files are compiled and used, all the 16-bit and 32-bit values
|
|
have the native endian format (Big Endian for the BE systems and Little
|
|
Endian for the LE systems) since they are compile for the system before
|
|
they are used.
|
|
|
|
@*
|
|
In case of .cct files, which are intended for dynamic CCS tables
|
|
loading, the CCS tables are stored either in LE or BE format. Since the
|
|
.cct files are generated by the 'mktbl.pl' Perl script, it is possible
|
|
to choose the endianess of the tables. It is also possible to store two
|
|
copies (both LE and BE) of the CCS tables in one .cct file. The default
|
|
.cct files (which come with the Newlib sources) have both LE and BE CCS
|
|
tables. The Newlib iconv library automatically chooses the needed CCS tables
|
|
(with appropriate endianess).
|
|
|
|
@*
|
|
Note, the .cct files are only used when the
|
|
@option{--enable-newlib-iconv-external-ccs} is used.
|
|
|
|
@subsection The 'mktbl.pl' Perl script
|
|
@*
|
|
The 'mktbl.pl' script is intended to generate .cct and .c CCS table
|
|
files from the @dfn{CCS source files}.
|
|
|
|
@*
|
|
The CCS source files are just text files which has one or more colons
|
|
with CCS <-> UCS-2 codes mapping. To see an example of the CCS table
|
|
source files see one of them using URL-s which will be given bellow.
|
|
|
|
@*
|
|
The following table describes where the source files for CCS table files
|
|
provided by the Newlib distribution are located.
|
|
|
|
@multitable @columnfractions .25 .75
|
|
@item
|
|
Name
|
|
@tab
|
|
URL
|
|
|
|
@item
|
|
@tab
|
|
|
|
@item
|
|
big5
|
|
@tab
|
|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
|
|
|
|
@item
|
|
cns11643_plane1
|
|
cns11643_plane14
|
|
cns11643_plane2
|
|
@tab
|
|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
|
|
|
|
@item
|
|
cp775
|
|
cp850
|
|
cp852
|
|
cp855
|
|
cp866
|
|
@tab
|
|
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
|
|
|
|
@item
|
|
iso_8859_1
|
|
iso_8859_2
|
|
iso_8859_3
|
|
iso_8859_4
|
|
iso_8859_5
|
|
iso_8859_6
|
|
iso_8859_7
|
|
iso_8859_8
|
|
iso_8859_9
|
|
iso_8859_10
|
|
iso_8859_11
|
|
iso_8859_13
|
|
iso_8859_14
|
|
iso_8859_15
|
|
@tab
|
|
http://www.unicode.org/Public/MAPPINGS/ISO8859/
|
|
|
|
@item
|
|
iso_ir_111
|
|
@tab
|
|
http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT
|
|
|
|
@item
|
|
jis_x0201_1976
|
|
jis_x0208_1990
|
|
jis_x0212_1990
|
|
@tab
|
|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
|
|
|
|
@item
|
|
koi8_r
|
|
@tab
|
|
http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
|
|
|
|
@item
|
|
koi8_ru
|
|
@tab
|
|
http://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT
|
|
|
|
@item
|
|
koi8_u
|
|
@tab
|
|
http://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT
|
|
|
|
@item
|
|
koi8_uni
|
|
@tab
|
|
http://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT
|
|
|
|
@item
|
|
ksx1001
|
|
@tab
|
|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
|
|
|
|
@item
|
|
win_1250
|
|
win_1251
|
|
win_1252
|
|
win_1253
|
|
win_1254
|
|
win_1255
|
|
win_1256
|
|
win_1257
|
|
win_1258
|
|
@tab
|
|
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
|
|
@end multitable
|
|
|
|
The CCS source files aren't distributed with Newlib because of License
|
|
restrictions in most Unicode.org's files.
|
|
|
|
The following are 'mktbl.pl' options which were used to generate .cct
|
|
files. Note, to generate CCS tables source files @option{-s} option
|
|
should be added.
|
|
|
|
@enumerate
|
|
@item For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct,
|
|
iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct,
|
|
iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct,
|
|
iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct
|
|
win_1256.cct, win_1258.cct, win_1251.cct,
|
|
win_1253.cct, win_1255.cct, win_1257.cct,
|
|
koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct,
|
|
big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct
|
|
files, only the @option{-i <SRC_FILE_NAME>} option were used.
|
|
|
|
@item To generate the jis_x0208_1990.cct file, the
|
|
@option{-i jis_x0208_1990.txt -x 2 -y 3} options were used.
|
|
|
|
@item To generate the cns11643_plane1.cct file, the
|
|
@option{-i cns11643.txt -p1 -N cns11643_plane1 -o cns11643_plane1.cct}
|
|
options were used.
|
|
|
|
@item To generate the cns11643_plane2.cct file, the
|
|
@option{-i cns11643.txt -p2 -N cns11643_plane2 -o cns11643_plane2.cct}
|
|
options were used.
|
|
|
|
@item To generate the cns11643_plane14.cct file, the
|
|
@option{-i cns11643.txt -p0xE -N cns11643_plane14 -o cns11643_plane14.cct}
|
|
options were used.
|
|
@end enumerate
|
|
|
|
@*
|
|
For more info about the 'mktbl.pl' options, see the 'mktbl.pl -h' output.
|
|
|
|
@*
|
|
It is assumed that CCS codes are 16 or less bits wide. If there are wider CCS codes
|
|
in the CCS source file, the bits which are higher then 16 defines plane (see the
|
|
cns11643.txt CCS source file).
|
|
|
|
@*
|
|
Sometimes, it is impossible to map some CCS codes to the 16-bit UCS if, for example,
|
|
several different CCS codes are mapped to one UCS-2 code or one CCS code is mapped to
|
|
the pair of UCS-2 codes. In these cases, such CCS codes (@dfn{lost
|
|
codes}) aren't just rejected but instead, they are mapped to the default
|
|
UCS-2 code (which is currently the @kbd{?} character's code).
|
|
|
|
|
|
|
|
|
|
|
|
@page
|
|
@node CES converters
|
|
@section CES converters
|
|
@findex PCS
|
|
@*
|
|
Similar to the CCS tables, CES converters are also split into "from UCS"
|
|
and "to UCS" parts. Depending on the iconv library configuration, these
|
|
parts are enabled or disabled.
|
|
|
|
@*
|
|
The following it the list of CES converters which are currently present
|
|
in the Newlib iconv library.
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@emph{euc} - supports the @emph{euc_jp}, @emph{euc_kr} and @emph{euc_tw}
|
|
encodings. The @emph{euc} CES converter uses the @emph{table} and the
|
|
@emph{us_ascii} CES converters.
|
|
|
|
@item
|
|
@emph{table} - this CES converter corresponds to "null" and just performs
|
|
tables-based conversion using 8- and 16-bit CCS tables. This converter
|
|
is also used by any other CES converter which needs the CCS table-based
|
|
conversions. The @emph{table} converter is also responsible for .cct files
|
|
loading.
|
|
|
|
@item
|
|
@emph{table_pcs} - this is the wrapper over the @emph{table} converter
|
|
which is intended for 16-bit encodings which also use the @dfn{Portable
|
|
Character Set} (@dfn{PCS}) which is the same as the @emph{US-ASCII}.
|
|
This means, that if the first byte the CCS code is in range of [0x00-0x7f],
|
|
this is the 7-bit PCS code. Else, this is the 16-bit CCS code. Of course,
|
|
the 16-bit codes must not contain bytes in the range of [0x00-0x7f].
|
|
The @emph{big5} encoding uses the @emph{table_pcs} CES converter and the
|
|
@emph{table_pcs} CES converter depends on the @emph{table} CES converter.
|
|
|
|
@item
|
|
@emph{ucs_2} - intended for the @emph{ucs_2}, @emph{ucs_2be} and
|
|
@emph{ucs_2le} encodings support.
|
|
|
|
@item
|
|
@emph{ucs_4} - intended for the @emph{ucs_4}, @emph{ucs_4be} and
|
|
@emph{ucs_4le} encodings support.
|
|
|
|
@item
|
|
@emph{ucs_2_internal} - intended for the @emph{ucs_2_internal} encoding support.
|
|
|
|
@item
|
|
@emph{ucs_4_internal} - intended for the @emph{ucs_4_internal} encoding support.
|
|
|
|
@item
|
|
@emph{us_ascii} - intended for the @emph{us_ascii} encoding support. In
|
|
principle, the most natural way to support the @emph{us_ascii} encoding
|
|
is to define the @emph{us_ascii} CCS and use the @emph{table} CES
|
|
converter. But for the optimization purposes, the specialized
|
|
@emph{us_ascii} CES converter was created.
|
|
|
|
@item
|
|
@emph{utf_16} - intended for the @emph{utf_16}, @emph{utf_16be} and
|
|
@emph{utf_16le} encodings support.
|
|
|
|
@item
|
|
@emph{utf_8} - intended for the @emph{utf_8} encoding support.
|
|
@end itemize
|
|
|
|
|
|
|
|
|
|
|
|
@page
|
|
@node The encodings description file
|
|
@section The encodings description file
|
|
@findex encoding.deps description file
|
|
@findex mkdeps.pl Perl script
|
|
@*
|
|
To simplify the process of adding new encodings support allowing to
|
|
automatically generate a lot of "glue" files.
|
|
|
|
@*
|
|
There is the 'encoding.deps' file in the @emph{lib/} subdirectory which
|
|
is used to describe encoding's properties. The 'mkdeps.pl' Perl script
|
|
uses 'encoding.deps' to generates the "glue" files.
|
|
|
|
@*
|
|
The 'encoding.deps' file is composed of sections, each section consists
|
|
of entries, each entry contains some encoding/CES/CCS description.
|
|
|
|
@*
|
|
The 'encoding.deps' file's syntax is very simple. Currently only two
|
|
sections are defined: @emph{ENCODINGS} and @emph{CES_DEPENDENCIES}.
|
|
|
|
@*
|
|
Each @emph{ENCODINGS} section's entry describes one encoding and
|
|
contains the following information.
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Encoding name (the @emph{ENCODING} field). The name should
|
|
be unique and only one name is possible.
|
|
|
|
@item
|
|
The encoding's CES converter name (the @emph{CES} field). Only one CES
|
|
converter is allowed.
|
|
|
|
@item
|
|
The whitespace-separated list of CCS table names which are used by the
|
|
encoding (the @emph{CCS} field).
|
|
|
|
@item
|
|
The whitespace-separated list of aliases names (the @emph{ENCODING}
|
|
field).
|
|
@end itemize
|
|
|
|
@*
|
|
Note all names in the 'encoding.deps' file have to have the normalized
|
|
form.
|
|
|
|
@*
|
|
Each @emph{CES_DEPENDENCIES} section's entry describes dependencies of
|
|
one CES converted. For example, the @emph{euc} CES converter depends on
|
|
the @emph{table} and the @emph{us_ascii} CES converter since the
|
|
@emph{euc} CES converter uses them. This means, that both @emph{table}
|
|
and @emph{us_ascii} CES converters should be linked if the @emph{euc}
|
|
CES converter is enabled.
|
|
|
|
@*
|
|
The @emph{CES_DEPENDENCIES} section defines the following:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
the CES converter name for which the dependencies are defined in this
|
|
entry (the @emph{CES} field);
|
|
|
|
@item
|
|
the whitespace-separated list of CES converters which are needed for
|
|
this CES converter (the @emph{USED_CES} field).
|
|
@end itemize
|
|
|
|
@*
|
|
The 'mktbl.pl' Perl script automatically solves the following tasks.
|
|
|
|
@itemize @bullet
|
|
@item
|
|
User works with the iconv library in terms of encodings and doesn't know
|
|
anything about CES converters and CCS tables. The script automatically
|
|
generates code which enables all needed CES converters and CCS tables
|
|
for all encodings, which were enabled by the user.
|
|
|
|
@item
|
|
The CES converters may have dependencies and the script automatically
|
|
generates the code which handles these dependencies.
|
|
|
|
@item
|
|
The list of encoding's aliases is also automatically generated.
|
|
|
|
@item
|
|
The script uses a lot of macros in order to enable only the minimum set
|
|
of code/data which is needed to support the requested encodings in the
|
|
requested directions.
|
|
@end itemize
|
|
|
|
@*
|
|
The 'mktbl.pl' Perl script is intended to interpret the 'encoding.deps'
|
|
file and generates the following files.
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@emph{lib/encnames.h} - this header files contains macro definitions for all
|
|
encoding names
|
|
|
|
@item
|
|
@emph{lib/aliasesbi.c} - the array of encoding names and aliases. The array
|
|
is used to find the name of requested encoding by it's alias.
|
|
|
|
@item
|
|
@emph{ces/cesbi.c} - this file defines two arrays
|
|
(@code{_iconv_from_ucs_ces} and @code{_iconv_to_ucs_ces}) which contain
|
|
description of enabled "to UCS" and "from UCS" CES converters and the
|
|
names of encodings which are supported by these CES converters.
|
|
|
|
@item
|
|
@emph{ces/cesbi.h} - this file contains the set of macros which defines
|
|
the set of CES converters which should be enabled if only the set of
|
|
enabled encodings is given (through macros defined in the
|
|
@emph{newlib.h} file). Note, that one CES converter may handle several
|
|
encodings.
|
|
|
|
@item
|
|
@emph{ces/cesdeps.h} - the CES converters dependencies are handled in
|
|
this file.
|
|
|
|
@item
|
|
@emph{ccs/ccsdeps.h} - the array of linked-in CCS tables is defined
|
|
here.
|
|
|
|
@item
|
|
@emph{ccs/ccsnames.h} - this header files contains macro definitions for all
|
|
CCS names.
|
|
|
|
@item
|
|
@emph{encoding.aliases} - the list of supported encodings and their
|
|
aliases which is intended for the Newlib configure scripts in order to
|
|
handle the iconv-related configure script options.
|
|
@end itemize
|
|
|
|
|
|
|
|
|
|
|
|
@page
|
|
@node How to add new encoding
|
|
@section How to add new encoding
|
|
@*
|
|
At first, the new encoding should be broken down to CCS and CES. Then,
|
|
the process of adding new encoding is split to the following activities.
|
|
|
|
@enumerate
|
|
@item Generate the .cct CCS file and the .c source file for the new
|
|
encoding's CCS (if it isn't already present). To do this, the CCS source
|
|
file should be had and the 'mktbl.pl' script should be used.
|
|
|
|
@item Write the corresponding CES converter (if it isn't already
|
|
present). Use the existing CES converters as an example.
|
|
|
|
@item
|
|
Add the corresponding entries to the 'encoding.deps' file and regenerate
|
|
the autogenerated "glue" files using the 'mkdeps.pl' script.
|
|
|
|
@item
|
|
Don't forget to add entries to the newlib/newlib.hin file.
|
|
|
|
@item
|
|
Of course, the 'Makefile.am'-s should also be updated (if new files were
|
|
added) and the 'Makefile.in'-s should be regenerated using the correct
|
|
version of 'automake'.
|
|
|
|
@item
|
|
Don't forget to update the documentation (the list of
|
|
supported encodings and CES converters).
|
|
@end enumerate
|
|
|
|
In case a new encoding doesn't fit to the CES/CCS decomposition model or
|
|
it is desired to add the specialized (non UCS-based) conversion support,
|
|
the Newlib iconv library code should be upgraded.
|
|
|
|
|
|
|
|
|
|
|
|
@page
|
|
@node The locale support interfaces
|
|
@section The locale support interfaces
|
|
@*
|
|
The newlib iconv library also has some interface functions (besides the
|
|
@code{iconv}, @code{iconv_open} and @code{iconv_close} interfaces) which
|
|
are intended for the Locale subsystem. All the locale-related code is
|
|
placed in the @emph{lib/iconvnls.c} file.
|
|
|
|
@*
|
|
The following is the description of the locale-related interfaces:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@code{_iconv_nls_open} - opens two iconv descriptors for "CCS ->
|
|
wchar_t" and "wchar_t -> CCS" conversions. The normalized CCS name is
|
|
passed in the function parameters. The @emph{wchar_t} characters encoding is
|
|
either ucs_2_internal or ucs_4_internal depending on size of
|
|
@emph{wchar_t}.
|
|
|
|
@item
|
|
@code{_iconv_nls_conv} - the function is similar to the @code{iconv}
|
|
functions, but if there is no character in the output encoding which
|
|
corresponds to the character in the input encoding, the default
|
|
conversion isn't performed (the @code{iconv} function sets such output
|
|
characters to the @kbd{?} symbol and this is the behavior, which is
|
|
specified in SUSv3).
|
|
|
|
@item
|
|
@code{_iconv_nls_get_state} - returns the current encoding's shift state
|
|
(the @code{mbstate_t} object).
|
|
|
|
@item
|
|
@code{_iconv_nls_set_state} sets the current encoding's shift state (the
|
|
@code{mbstate_t} object).
|
|
|
|
@item
|
|
@code{_iconv_nls_is_stateful} - checks whether the encoding is stateful
|
|
or stateless.
|
|
|
|
@item
|
|
@code{_iconv_nls_get_mb_cur_max} - returns the maximum length (the
|
|
maximum bytes number) of the encoding's characters.
|
|
@end itemize
|
|
|
|
|
|
|
|
|
|
@page
|
|
@node Contact
|
|
@section Contact
|
|
@*
|
|
The author of the original BSD iconv library (Alexander Chuguev) no longer
|
|
supports that code.
|
|
|
|
@*
|
|
Any questions regarding the iconv library may be forwarded to
|
|
Artem B. Bityuckiy (dedekind@@oktetlabs.ru or dedekind@@mail.ru) as
|
|
well as to the public Newlib mailing list.
|
|
|