433 lines
12 KiB
TeX
433 lines
12 KiB
TeX
@node Iconv
|
|
@chapter Character-set conversions (@file{iconv.h})
|
|
|
|
This chapter describes the Newlib iconv library.
|
|
The iconv functions declarations are in
|
|
@file{iconv.h}.
|
|
|
|
@menu
|
|
* iconv:: Character set conversion routines
|
|
* iconv architecture:: Architecture of Newlib iconv library
|
|
* iconv configuration:: Newlib iconv-specific configure options
|
|
* Generating CCS tables:: How to generate CCS tables
|
|
* Adding new converter:: Steps on adding a new converter
|
|
@end menu
|
|
|
|
@page
|
|
@include iconv/iconv.def
|
|
|
|
@page
|
|
@node iconv architecture
|
|
@section iconv architecture
|
|
@findex iconv architecture
|
|
@findex encoding
|
|
@findex CCS
|
|
@findex CES
|
|
@findex iconv converter
|
|
@*
|
|
@itemize @bullet
|
|
@item
|
|
Encoding - a rule to represent computer text by means of bits and bytes.
|
|
@item
|
|
CCS (Coded Character Set) - a mapping from an abstract character set
|
|
to a set of non-negative integers (character codes).
|
|
@item
|
|
CES (Character Encoding Scheme) - a mapping from a set of character codes
|
|
units to a sequence of bytes.
|
|
@end itemize
|
|
|
|
@*
|
|
Examples of CCS: ASCII, ISO-8859-x, KOI8-R, KSX-1001, GB-2312.@*
|
|
Examples of CES: UTF-8, UTF-16, EUC-JP, ISO-2022-JP.
|
|
|
|
@*
|
|
The iconv library is used to convert an array of characters in one encoding
|
|
to array in another encoding.
|
|
|
|
@*
|
|
From a user's point of view, the iconv library is a set of converters. Each converter
|
|
corresponds to one encoding (e.g., KOI8-R converter, UTF-8 converter).
|
|
Internally the meaning of converter is different.
|
|
|
|
@*
|
|
The iconv library always performs conversions through UCS-32: i.e., to convert
|
|
from A to B, iconv library first converts A to UCS-32, and then USC-32 to B.
|
|
|
|
@*
|
|
Each encoding consists of CES and CCS. CCS may be represented as data tables
|
|
but CES always implies some code (algorithm). Iconv uses CCS tables
|
|
to map from some encoding to UCS-32. CCS tables are placed into
|
|
the iconv/ccs subdirectory of newlib. The iconv code also uses CES
|
|
modules which can convert some CCS to and from UCS-32. CES modules are placed
|
|
in the iconv/ces subdirectory.
|
|
|
|
@*
|
|
Some encodings have CES = CCS (e.g., KOI8-R). For such encodings iconv uses
|
|
special subroutines which perform simple table conversions (ccs_table.c).
|
|
|
|
@*
|
|
Among specialized CES modules, the iconv library has
|
|
generic support for EUC and ISO-2022-family encodings (ces_euc.c and
|
|
ces_iso2022.c).
|
|
|
|
@*
|
|
To enable iconv to work with CCS or CES-based encodings, the correspondent
|
|
CES table or CCS module should be linked with Newlib. The iconv support
|
|
can also load CCS tables dynamically from external files (.cct files from
|
|
iconv/ccs/binary subdirectory). CES modules, on the other-hand, can't
|
|
be dynamically loaded.
|
|
|
|
@*
|
|
Each iconv converter has one name and a set of aliases. The list of
|
|
aliases for each converter's name is in the iconv/charset.aliases file.
|
|
Note: iconv always normalizes converter names and aliases before using.
|
|
|
|
@page
|
|
@node iconv configuration
|
|
@section iconv configuration
|
|
@findex iconv configuration
|
|
@findex iconv converter
|
|
@*
|
|
To enable iconv, the --enable-newlib-iconv configuration option should be
|
|
used when configuring newlib.
|
|
|
|
@*
|
|
To link a specific converter (CCS table or CES module) into Newlib, the
|
|
---enable-newlib-builtin-converters option should be used. A
|
|
comma-separated list of converters can be passed with this option
|
|
(e.g., ---enable-newlib-builtin-converters=koi8-r,euc-jp to link KOI8-R
|
|
and EUC-JP converters). Either converter names or aliases may be used.
|
|
|
|
@*
|
|
If the target system has a file system accessible by Newlib, table-based
|
|
converters may be loaded dynamically from external files. The iconv
|
|
code tries to load files from the iconv_data subdirectory of the directory
|
|
specified by the NLSPATH environment variable.
|
|
|
|
@*
|
|
Since Newlib has no generic dynamic module load support, CES-based converters
|
|
can't be dynamically loaded and should be linked-in.
|
|
|
|
@page
|
|
@node Generating CCS tables
|
|
@section Generating CCS tables
|
|
@*
|
|
CCS tables are placed in the ccs subdirectory of the iconv directory.
|
|
This subdirectory contains .cct and .c files. The .cct files are for
|
|
dynamic loading whereas the .c files are for static linking with Newlib.
|
|
Both .c and .cct files are generated by the 'iconv_mktbl' perl script
|
|
from special source files (call them
|
|
.txt files). The 'iconv_mktbl' script can be found in the iconv/ccs
|
|
subdirectory. Input .txt files can be found at the Unicode.org site or
|
|
other locations found on the web.
|
|
|
|
@*
|
|
The .c files are linked with Newlib if the correspondent 'configure' script
|
|
option was given. This is needed to use iconv on targets without file system
|
|
support. If a CCS table isn't configured to be linked, the iconv library
|
|
tries to load it dynamically from a corresponding .cct file.
|
|
|
|
@*
|
|
The following are commands to build .c and .cct CCS table files from .txt
|
|
files for several supported encodings.
|
|
|
|
@*
|
|
@itemize
|
|
@item
|
|
cp775:@*
|
|
iconv_mktbl -Co cp775.c cp775.txt@*
|
|
iconv_mktbl -o cp775.cct cp775.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
cp850:@*
|
|
iconv_mktbl -Co cp850.c cp850.txt@*
|
|
iconv_mktbl -o cp850.cct cp850.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
cp852:@*
|
|
iconv_mktbl -Co cp852.c cp852.txt@*
|
|
iconv_mktbl -o cp852.cct cp852.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
cp855:@*
|
|
iconv_mktbl -Co cp855.c cp855.txt@*
|
|
iconv_mktbl -o cp855.cct cp855.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
cp866@*
|
|
iconv_mktbl -Co cp866.c cp866.txt@*
|
|
iconv_mktbl -o cp866.cct cp866.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
iso-8859-1@*
|
|
iconv_mktbl -Co iso-8859-1.c iso-8859-1.txt@*
|
|
iconv_mktbl -o iso-8859-1.cct iso-8859-1.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
iso-8859-4@*
|
|
iconv_mktbl -Co iso-8859-4.c iso-8859-4.txt@*
|
|
iconv_mktbl -o iso-8859-4.cct iso-8859-4.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
iso-8859-5@*
|
|
iconv_mktbl -Co iso-8859-5.c iso-8859-5.txt@*
|
|
iconv_mktbl -o iso-8859-5.cct iso-8859-5.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
iso-8859-2@*
|
|
iconv_mktbl -Co iso-8859-2.c iso-8859-2.txt@*
|
|
iconv_mktbl -o iso-8859-2.cct iso-8859-2.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
iso-8859-15@*
|
|
iconv_mktbl -Co iso-8859-15.c iso-8859-15.txt@*
|
|
iconv_mktbl -o iso-8859-15.cct iso-8859-15.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
big5@*
|
|
iconv_mktbl -Co big5.c big5.txt@*
|
|
iconv_mktbl -o big5.cct big5.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
ksx1001@*
|
|
iconv_mktbl -Co ksx1001.c ksx1001.txt@*
|
|
iconv_mktbl -o ksx1001.cct ksx1001.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
gb_2312@*
|
|
iconv_mktbl -Co gb_2312-80.c gb_2312-80.txt@*
|
|
iconv_mktbl -o gb_2312-80.cct gb_2312-80.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
jis_x0201@*
|
|
iconv_mktbl -Co jis_x0201.c jis_x0201.txt@*
|
|
iconv_mktbl -o jis_x0201.cct jis_x0201.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
iconv_mktbl -Co shift_jis.c shift_jis.txt@*
|
|
iconv_mktbl -o shift_jis.cct shift_jis.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
jis_x0208@*
|
|
iconv_mktbl -C -c 1 -u 2 -o jis_x0208-1983.c jis_x0208-1983.txt@*
|
|
iconv_mktbl -c 1 -u 2 -o jis_x0208-1983.cct jis_x0208-1983.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
jis_x0212@*
|
|
iconv_mktbl -Co jis_x0212-1990.c jis_x0212-1990.txt@*
|
|
iconv_mktbl -o jis_x0212-1990.cct jis_x0212-1990.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
cns11643-plane1@*
|
|
iconv_mktbl -C -p 0x1 -o cns11643-plane1.c cns11643.txt@*
|
|
iconv_mktbl -p 0x1 -o cns11643-plane1.cct cns11643.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
cns11643-plane2@*
|
|
iconv_mktbl -C -p 0x2 -o cns11643-plane2.c cns11643.txt@*
|
|
iconv_mktbl -p 0x2 -o cns11643-plane2.cct cns11643.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
cns11643-plane14@*
|
|
iconv_mktbl -C -p 0xE -o cns11643-plane14.c cns11643.txt@*
|
|
iconv_mktbl -p 0xE -o cns11643-plane14.cct cns11643.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
koi8-r@*
|
|
iconv_mktbl -Co koi8-r.c koi8-r.txt@*
|
|
iconv_mktbl -o koi8-r.cct koi8-r.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
koi8-u@*
|
|
iconv_mktbl -Co koi8-u.c koi8-u.txt@*
|
|
iconv_mktbl -o koi8-u.cct koi8-u.txt
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
us-ascii@*
|
|
iconv_mktbl -Cao us-ascii.c iso-8859-1.txt@*
|
|
iconv_mktbl -ao us-ascii.cct iso-8859-1.txt
|
|
@end itemize
|
|
|
|
@*
|
|
Source files for CCS tables can be taken from at least two places:
|
|
|
|
@*
|
|
@enumerate
|
|
@item
|
|
http://www.unicode.org/Public/MAPPINGS/ contains a lot of encoding
|
|
map files.
|
|
@item
|
|
http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains original
|
|
iconv sources and encoding map files.
|
|
@end enumerate
|
|
|
|
@*
|
|
The following are URLs where source files for some of the CCS tables
|
|
are found:
|
|
|
|
@itemize
|
|
@item
|
|
big5:@*
|
|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
cns11643_plane14, cns11643_plane1 and cns11643_plane2:@*
|
|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
cp775, cp850, cp852, cp855, cp866:@*
|
|
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
gb_2312_80:@*
|
|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
iso_8859_15, iso_8859_1, iso_8859_2, iso_8859_4, iso_8859_5:@*
|
|
http://www.unicode.org/Public/MAPPINGS/ISO8859/
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
jis_x0201, jis_x0208_1983, jis_x0212_1990, shift_jis@*
|
|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
koi8_r@*
|
|
http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
ksx1001@*
|
|
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
|
|
@end itemize
|
|
|
|
@itemize
|
|
@item
|
|
koi8-u can be given from original FreeBSD iconv library distribution
|
|
http://www.dante.net/staff/konstantin/FreeBSD/iconv/
|
|
@end itemize
|
|
|
|
@*
|
|
Moreover, http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains a
|
|
lot of additional CCS tables that you can use with Newlib (iso-2022 and
|
|
RFC1345 encodings).
|
|
|
|
@page
|
|
@node Adding new converter
|
|
@section Adding a new iconv converter
|
|
@*
|
|
The following steps should be taken to add a new iconv converter:
|
|
|
|
@*
|
|
@enumerate
|
|
@item
|
|
Converter's name and aliases list should be added to
|
|
the iconv/charset.aliases file
|
|
@item
|
|
All iconv converters are protected by a _ICONV_CONVERTER_XXX
|
|
macro, where XXX is converter name. This protection macro should be added to
|
|
newlib/newlib.hin file.
|
|
@item
|
|
Converter's name and aliases should be also registered in _iconv_builtin_aliases
|
|
table in iconv/lib/bialiasesi.c. The list should be protected by
|
|
the corresponding macro mentioned above.
|
|
@item
|
|
If a new converter is just a CCS table, the corresponding .cct and .c files
|
|
should be added to the iconv/ccs/ subdirectory. The name of the files
|
|
should be equivalent to the normalized encoding name. The 'iconv_mktbl'
|
|
Perl script (found in iconv/ccs) may
|
|
be used to generate such files. The file's name should be added to
|
|
iconv/ccs/Makefile.am and iconv/ccs/binary/Makefile.am files and then
|
|
automake should be used to regenerate the Makefile.in files.
|
|
@item
|
|
If a new converter has a CES algorithm, the appropriate file should be
|
|
added to the
|
|
iconv/ces/ subdirectory. The name of the file again should be equivalent
|
|
to the normalized
|
|
encoding name.
|
|
@item
|
|
If a converter is EUC or ISO-2022-family CES, then the converter
|
|
is just an array with a list of used CCS (See ccs/euc-jp.c for example). This
|
|
is because iconv already has EUC and ISO-2022 support. Used CCS tables should
|
|
be provided in iconv/ccs/.
|
|
@item
|
|
If a converter isn't EUC or ISO-2022-based CCS, the following two functions
|
|
should be provided (see utf-8.c for example):
|
|
@enumerate -
|
|
@item A function to convert from new CES to UCS-32;
|
|
@item A function to convert from UCS-32 to new CES;
|
|
@item An 'init' function;
|
|
@item A 'close' function;
|
|
@item A 'reset' function to reset shift state for stateful CES.
|
|
@end enumerate
|
|
|
|
@*
|
|
All these functions are registered into a 'struct iconv_ces_desc' object.
|
|
The name of the object should be _iconv_ces_module_XXX, where XXX is the
|
|
name of the converter.
|
|
@item
|
|
For CES converters the correspondent 'struct iconv_ces_desc' reference should
|
|
be added into iconv/lib/bices.c file.
|
|
|
|
@*
|
|
For CCS converters, the corresponding table reference should be added into
|
|
the iconv/lib/biccs.c file.
|
|
@end enumerate
|
|
|