mirror of
git://sourceware.org/git/newlib-cygwin.git
synced 2025-03-01 12:35:44 +08:00
* new-features.sgml: Add missing GB2312 and eucKR character sets.
* pathnames.sgml: Change "DOS devices" title to "Invalid filenames" and rephrase that section. Add section "Filenames with unusual (foreign) characters". Fix an emphasis. * setup-net.sgml: Integrate setup-locale section. * setup2.sgml: Add locale variables to section "Environment Variables". Add section "Internationalization".
This commit is contained in:
parent
4747078502
commit
f276aab75a
@ -1,3 +1,14 @@
|
||||
2009-03-25 Corinna Vinschen <corinna@vinschen.de>
|
||||
|
||||
* new-features.sgml: Add missing GB2312 and eucKR character sets.
|
||||
* pathnames.sgml: Change "DOS devices" title to "Invalid filenames"
|
||||
and rephrase that section.
|
||||
Add section "Filenames with unusual (foreign) characters".
|
||||
Fix an emphasis.
|
||||
* setup-net.sgml: Integrate setup-locale section.
|
||||
* setup2.sgml: Add locale variables to section "Environment Variables".
|
||||
Add section "Internationalization".
|
||||
|
||||
2009-03-24 Corinna Vinschen <corinna@vinschen.de>
|
||||
|
||||
* new-features.sgml: Add section about chaged (no)winsymlink default.
|
||||
|
@ -195,8 +195,9 @@
|
||||
in 1-16, except 12, "UTF-8", Windows codepages "CPxxx", with xxx in
|
||||
(437, 720, 737, 775, 850, 852, 855, 857, 858, 862, 866, 874, 1125,
|
||||
1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258), "JIS", "SJIS",
|
||||
"eucJP", "Big5". The leading language and territory part (en_US) is not
|
||||
used by Cygwin yet, but is required for POSIX compatibility.
|
||||
"GB2312", "eucJP", "eucKR", and "Big5". The leading language and territory
|
||||
part (en_US, for instance) is not used by Cygwin yet, but is required
|
||||
for POSIX compatibility.
|
||||
|
||||
- Allow multiple concurrent read locks per thread for pthread_rwlock_t.
|
||||
|
||||
|
@ -311,21 +311,25 @@ to be readable by the $USER user account itself.</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2 id="pathnames-dosdevices"><title>DOS devices</title>
|
||||
<sect2 id="pathnames-dosdevices"><title>Invalid filenames</title>
|
||||
|
||||
<para>Filenames invalid under Win32 are not necessarily invalid
|
||||
under Cygwin since release 1.7.0. There are a couple of rules which
|
||||
apply to Windows filenames. First of all, DOS device names like
|
||||
under Cygwin since release 1.7.0. There are a few rules which
|
||||
apply to Windows filenames. Most notably, DOS device names like
|
||||
<filename>AUX</filename>, <filename>COM1</filename>,
|
||||
<filename>LPT1</filename> or <filename>PRN</filename> (to name a few)
|
||||
cannot be used in a native Win32 application, even with an
|
||||
extension (<filename>prn.txt</filename>). Cygwin can handle files with
|
||||
these names just fine.</para>
|
||||
cannot be used as filename or extension in a native Win32 application.
|
||||
So filenames like <filename>prn.txt</filename> or <filename>foo.aux</filename>
|
||||
are invalid filenames for native Win32 applications.</para>
|
||||
|
||||
<para>This restriction doesn't apply to Cygwin applications. Cygwin
|
||||
can create and access files with such names just fine. Just don't try
|
||||
to use these files with native Win32 aqpplications...</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2 id="pathnames-specialchars">
|
||||
<title>Special characters in filenames</title>
|
||||
<title>Forbidden characters in filenames</title>
|
||||
|
||||
<para>Win32 filenames can't contain trailing dots and spaces for backward
|
||||
compatibility. When trying to create files with trailing dots or spaces,
|
||||
@ -346,6 +350,48 @@ are converted to special UNICODE characters in the range 0xf000 to 0xf0ff
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2 id="pathnames-unusual">
|
||||
<title>Filenames with unusual (foreign) characters</title>
|
||||
|
||||
<para> Windows filesystems use the Unicode character set in the UTF-16
|
||||
encoding to store filename information. If you don't use the UTF-8
|
||||
character set (see <xref linkend="setup-locale"></xref>) then there's a
|
||||
chance that a filename is using one or more characters which have no
|
||||
representation in the character set you're using.</para>
|
||||
|
||||
<para>For instance, there are no chinese characters in the ISO-8859-1
|
||||
character set. So, converting a filename containing a chinese character
|
||||
to ISO-8859-1 leaves you with a wrongly converted filename, for instance
|
||||
containing a question mark '?' as replacement for the unconvertable
|
||||
character. When trying to access the file, Cygwin has to convert the
|
||||
filename back to UTF-16. However, this doesn't result in the original
|
||||
filename because the question mark will not translate back to the original
|
||||
chinese character, but to a simple question mark instead. This in turn
|
||||
results in strange "File not found" messages.</para>
|
||||
|
||||
<note><para>To avoid this scenario altogether, just use always UTF-8 as
|
||||
character set.</para></note>
|
||||
|
||||
<para>If you don't want or can't use UTF-8 as character set for whatever
|
||||
reason, you will nevertheless be able to access the file. How does that
|
||||
work? When Cygwin converts the filename from UTF-16 to your character
|
||||
set, it recognizes characters which can't be converted. If that occurs,
|
||||
Cygwin replaces the non-convertible character with a special character
|
||||
sequence. The sequence starts with an ASCII SO character (hex code
|
||||
0x0e, equivalent Control-N), followed by the UTF-8 representation of the
|
||||
character. The result is a filename containing some ugly looking
|
||||
characters. While it doesn't <emphasis>look</emphasis> nice, it
|
||||
<emphasis>is</emphasis> nice, because Cygwin knows how to convert this
|
||||
filename back to UTF-16. The filename will be converted using your
|
||||
usual character set. However, when Cygwin recognizes an ASCII SO
|
||||
character, it skips over the ASCII SO and handles the following bytes as
|
||||
a UTF-8 character. Thus, the filename is symmetrically converted back to
|
||||
UTF-16 and you can access the file.</para>
|
||||
|
||||
<para>Again, by using UTF-8 you can avoid this problem entirely.</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2 id="pathnames-casesensitive">
|
||||
<title>Case sensitive filenames</title>
|
||||
|
||||
@ -369,7 +415,7 @@ HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\kernel\obcaseinsensitive
|
||||
this registry value also on Windows NT4 and Windows 2000, which usually
|
||||
both don't know this registry key. If you want case-sensitivity on these
|
||||
systems, create that registry value and set it to 0. On these systems
|
||||
(and *only* on these systems) you don't have to reboot to bring it
|
||||
(and <emphasis role='bold'>only</emphasis> on these systems) you don't have to reboot to bring it
|
||||
into effect, rather stopping all Cygwin processes and then restarting them
|
||||
is sufficient.</para>
|
||||
|
||||
|
@ -254,6 +254,7 @@ Problems with Cygwin</ulink>.
|
||||
|
||||
DOCTOOL-INSERT-setup-env
|
||||
DOCTOOL-INSERT-setup-maxmem
|
||||
DOCTOOL-INSERT-setup-locale
|
||||
DOCTOOL-INSERT-ntsec
|
||||
DOCTOOL-INSERT-setup-files
|
||||
</chapter>
|
||||
|
@ -13,12 +13,21 @@ The <envar>CYGWIN</envar> variable is used to configure many global
|
||||
settings for the Cygwin runtime system. Initially you can leave
|
||||
<envar>CYGWIN</envar> unset or set it to <literal>tty</literal> (e.g.
|
||||
to support job control with ^Z etc...) using a syntax like this in the
|
||||
DOS shell, before launching bash. </para>
|
||||
DOS shell, before launching bash.</para>
|
||||
|
||||
<screen>
|
||||
<prompt>C:\></prompt> <userinput>set CYGWIN=tty notitle glob</userinput>
|
||||
</screen>
|
||||
|
||||
<para>
|
||||
Locale support is controlled by the <envar>LANG</envar> and
|
||||
<envar>LC_xxx</envar> environment variables. You can set all of them
|
||||
but Cygwin itself only honors the variables <envar>LC_ALL</envar>,
|
||||
<envar>LC_CTYPE</envar>, and <envar>LANG</envar>, in this order, according
|
||||
to the POSIX standard. The first one found rules. For a more detailed
|
||||
description see <xref linkend="setup-locale"></xref>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The <envar>PATH</envar> environment variable is used by Cygwin
|
||||
applications as a list of directories to search for executable files
|
||||
@ -124,6 +133,279 @@ Run the program and it will output the maximum amount of allocatable memory.
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="setup-locale"><title>Internationalization</title>
|
||||
|
||||
<sect2 id="setup-locale-ov"><title>Overview</title>
|
||||
|
||||
<para>
|
||||
Internationalization support is controlled by the <envar>LANG</envar> and
|
||||
<envar>LC_xxx</envar> environment variables. You can set all of them
|
||||
but Cygwin itself only honors the variables <envar>LC_ALL</envar>,
|
||||
<envar>LC_CTYPE</envar>, and <envar>LANG</envar>, in this order, according
|
||||
to the POSIX standard. The content of these variables should follow the
|
||||
POSIX standard for a locale specifier. The correct form of a locale
|
||||
specifier is</para>
|
||||
|
||||
<screen>
|
||||
language[[_TERRITORY][.charset][@modifier]]
|
||||
</screen>
|
||||
|
||||
<para>"language" is a lowercase two character string per ISO 639-1,
|
||||
"TERRITORY" is an uppercase two character string per ISO 3166, charset is
|
||||
one of a list of supported character sets, and the modifier doesn't matter
|
||||
here (though it might for some applications). If you're interested in the
|
||||
exact description, you can find it in the online publication of the POSIX
|
||||
manual pages on the homepage of the
|
||||
<ulink url="http://www.opengroup.org/">Open Group</ulink>.</para>
|
||||
|
||||
<para>Typical locale specifiers are</para>
|
||||
|
||||
<screen>
|
||||
"de_CH" language = German, territory = Switzerland, default charset
|
||||
"fr_FR.UTF-8" language = french, territory = France, charset = UTF-8
|
||||
"ko_KR.eucKR" language = korean, territory = South Korea, charset = eucKR
|
||||
</screen>
|
||||
|
||||
<para>
|
||||
And let's not forget the default locale called "C" or "POSIX"
|
||||
which basically only supports plain ASCII code. If the aforementioned
|
||||
environment variables are not set, or set to "C" or "POSIX", you get the
|
||||
default ASCII-only behaviour.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Right now the language and territory content is not evaluated by Cygwin any
|
||||
further. The only important part so far is the character set. How does that
|
||||
work?
|
||||
</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2 id="setup-locale-how"><title>How to set the locale</title>
|
||||
|
||||
<itemizedlist mark="bullet">
|
||||
|
||||
<listitem><para>
|
||||
The default locale is the "C" or "POSIX" locale. In this locale, basically
|
||||
only ASCII characters are supported. Even if one of the aforementioned
|
||||
environment variables are set to something else, it's the application's
|
||||
responsibility to call the function <function>setlocale</function>,
|
||||
typically like this</para>
|
||||
|
||||
<screen>
|
||||
setlocale (LC_ALL, "");
|
||||
</screen>
|
||||
|
||||
<para>to switch to another locale according to the settings of the
|
||||
internationalization environment variables.
|
||||
</para></listitem>
|
||||
|
||||
<listitem><para>
|
||||
Assuming you set one of the aforementioned environment variables to some
|
||||
valid POSIX locale value, other than "C" and "POSIX", and assuming you
|
||||
call an application which calls <function>setlocale</function> as above.</para>
|
||||
|
||||
<para>Assuming further you're living in Japan. So you might want to use
|
||||
the language code "ja" and the territory "JP", thus setting, say,
|
||||
<envar>LANG</envar> to "ja_JP". You didn't set a character set, so
|
||||
what will Cygwin use now? Easy! It will use the default Windows ANSI
|
||||
codepage of your system, if it's supported by Cygwin. Hopefully Cygwin
|
||||
supports all relevant default ANSI codepages...</para>
|
||||
|
||||
<note><para>For a list of supported character sets, see
|
||||
<xref linkend="setup-locale-charsetlist"></xref>
|
||||
</para></note>
|
||||
</listitem>
|
||||
|
||||
<listitem><para>
|
||||
You don't want to use the default Windows codepage as character set?
|
||||
In that case you have to specify the charset explicitely. For instance,
|
||||
assume you're from Italy and don't want to use the default Windows codepage
|
||||
1252, but the more portable ISO-8859-15 character set. What you can do is
|
||||
to set the <envar>LANG</envar> variable in the
|
||||
<filename>C:\cygwin\Cygwin.bat</filename> file which is the batch file
|
||||
to start a Cygwin session from the "Cygwin" desktop shortcut.</para>
|
||||
|
||||
<screen>
|
||||
@echo off
|
||||
|
||||
C:
|
||||
chdir C:\cygwin\bin
|
||||
set LANG=it_IT.ISO-8859-15
|
||||
bash --login -i
|
||||
</screen>
|
||||
</listitem>
|
||||
|
||||
<listitem><para>
|
||||
Most singlebyte or doublebyte charsets have a disadvantage. Windows
|
||||
filesystems use the Unicode character set in the UTF-16 encoding to store filename information. Not all characters
|
||||
from the Unicode character set are available in a singlebyte or doublebyte
|
||||
charset. While Cygwin has a workaround to access files with unusual
|
||||
characters (see <xref linkend="pathnames-unusual"></xref>), a better
|
||||
workaround is to use always the UTF-8 character set. UTF-8 is the only
|
||||
multibyte character set which can represent <emphasis>every</emphasis>
|
||||
Unicode character.</para>
|
||||
|
||||
<screen>
|
||||
set LANG=es_MX.UTF-8
|
||||
</screen>
|
||||
|
||||
<para>For a description of the Unicode standard, see the homepage of the
|
||||
<ulink url="http://www.unicode.org/">Unicode Consortium</ulink>.
|
||||
</para></listitem>
|
||||
|
||||
</itemizedlist>
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2 id="setup-locale-problems"><title>Potential Problems</title>
|
||||
|
||||
<para>
|
||||
You can set the above internationalization variables not only in
|
||||
<filename>Cygwin.bat</filename> or in the Windows environment, but also
|
||||
in your Cygwin shell on the fly, even switch to yet another character
|
||||
set, and yet another. In bash for instance:</para>
|
||||
|
||||
<screen>
|
||||
<prompt>bash$</prompt> export LC_CTYPE="nl_BE.UTF-8"
|
||||
</screen>
|
||||
|
||||
<para>However, here's a problem. At the start of the first Cygwin process
|
||||
in a session, the Windows environment has to be converted from UTF-16 to
|
||||
some singlebyte or multibyte charset. If the internationalization environment
|
||||
variable hasn't been set <emphasis>before</emphasis> starting this process,
|
||||
Cygwin has to make an educated guess which charset to use to convert
|
||||
the environment itself. The only reproducible way to do that in the absence
|
||||
of <envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>, or <envar>LANG</envar>,
|
||||
is to use the current Windows ANSI codepage.</para>
|
||||
|
||||
<para>As long as the environment only contains ASCII characters, this is
|
||||
no problem. But if it does, and you're planning to use, say, UTF-8,
|
||||
the environment will result in invalid characters in the UTF-8 charset.
|
||||
This would be especially a problem in variables like <envar>PATH</envar>.</para>
|
||||
|
||||
<note><para>Per POSIX, the name of an environment variable should only
|
||||
consist of valid ASCII characters, and only of uppercase letters, digits, and
|
||||
the underscore for maximum portablilty.</para></note>
|
||||
|
||||
<para>And here's another problem when switching charsets on the fly.
|
||||
Symbolic links. A symbolic link contains the filename of the target
|
||||
file the symlink points to. When a symlink is created, the current
|
||||
character set is used to store the target filename. If the target
|
||||
filename contains non-ASCII characters and you switch to another
|
||||
character set, the target filename of the symlink is now potentially
|
||||
an invalid character sequence in the new character set. This behaviour
|
||||
is not different from the behaviour in other Operating Systems. So,
|
||||
if you suddenly can't access a symlink anymore, maybe it's because you
|
||||
switched to another character set?
|
||||
</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2 id="setup-locale-missing"><title>What does not work?</title>
|
||||
|
||||
<para>
|
||||
Except for <envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>,
|
||||
and <envar>LANG</envar>, all other LC_xxx environment variables,
|
||||
<envar>LC_COLLATE</envar>, <envar>LC_MESSAGES</envar>,
|
||||
<envar>LC_MONETARY</envar>, <envar>LC_NUMERIC</envar>,
|
||||
and <envar>LC_TIME</envar>, are ignored right now. This means, while Cygwin
|
||||
supports different character sets, it does <emphasis>not</emphasis> support
|
||||
real localization so far. There's no support for locale-specific monetary
|
||||
symbols, for a decimalpoint other than '.', no support for native time
|
||||
formats, and no support for native language sorting orders.
|
||||
</para>
|
||||
|
||||
<para>However, internationalization is work in progress and we would be glad
|
||||
for coding help in this area.</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2 id="setup-locale-charsetlist"><title>List of supported character sets</title>
|
||||
|
||||
<para>Last but not least, here's the list of currently supported character
|
||||
sets. The left-hand expression is the name of the charset, as you would use
|
||||
it in the internationalization environment variables as outlined above.
|
||||
</para>
|
||||
|
||||
<para>The right-hand side is the number of the equivalent Windows
|
||||
codepage as well as the Windows name of the codepage. They are only
|
||||
noted here for reference. Don't try to use the bare codepage number or
|
||||
the Windows name of the codepage as charset in locale specifiers, unless
|
||||
they happen to be identical with the left-hand side. Especially in case
|
||||
oif the "CPxxx" style charsets, always use them with the trailing "CP".</para>
|
||||
|
||||
<para>This works:</para>
|
||||
|
||||
<screen>
|
||||
set LC_ALL=en_US.CP437
|
||||
</screen>
|
||||
|
||||
<para>This does <emphasis>not</emphasis> work:</para>
|
||||
|
||||
<screen>
|
||||
set LC_ALL=en_US.437
|
||||
</screen>
|
||||
|
||||
<para>You can find a full list of Windows codepages on the Microsoft MSDN page
|
||||
<ulink url="http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx">Code Page Identifiers</ulink>.</para>
|
||||
|
||||
<screen>
|
||||
Charset Codepage
|
||||
|
||||
CP437 437 (OEM United States)
|
||||
CP720 720 (DOS Arabic)
|
||||
CP737 737 (OEM Greek)
|
||||
CP775 775 (OEM Baltic)
|
||||
CP850 850 (OEM Latin 1, Western European)
|
||||
CP852 852 (OEM Latin 2, Central European)
|
||||
CP855 855 (OEM Cyrillic)
|
||||
CP857 857 (OEM Turkish)
|
||||
CP858 858 (OEM Latin 1 + Euro Symbol)
|
||||
CP862 862 (OEM Hebrew)
|
||||
CP866 866 (OEM Russian)
|
||||
CP874 874 (ANSI/OEM Thai)
|
||||
CP1125 1125 (OEM Ukraine)
|
||||
CP1250 1250 (ANSI Central European)
|
||||
CP1251 1251 (ANSI Cyrillic)
|
||||
CP1252 1252 (ANSI Latin 1, Western European)
|
||||
CP1253 1253 (ANSI Greek)
|
||||
CP1254 1254 (ANSI Turkish)
|
||||
CP1255 1255 (ANSI Hebrew)
|
||||
CP1256 1256 (ANSI Arabic)
|
||||
CP1257 1257 (ANSI Baltic)
|
||||
CP1258 1258 (ANSI/OEM Vietnamese)
|
||||
|
||||
ISO-8859-1 28591 (ISO-8859-1)
|
||||
ISO-8859-2 28592 (ISO-8859-2)
|
||||
ISO-8859-3 28593 (ISO-8859-3)
|
||||
ISO-8859-4 28594 (ISO-8859-4)
|
||||
ISO-8859-5 28595 (ISO-8859-5)
|
||||
ISO-8859-6 28596 (ISO-8859-6)
|
||||
ISO-8859-7 28597 (ISO-8859-7)
|
||||
ISO-8859-8 28598 (ISO-8859-8)
|
||||
ISO-8859-9 28599 (ISO-8859-9)
|
||||
ISO-8859-10 - (not available)
|
||||
ISO-8859-11 - (not available)
|
||||
ISO-8859-13 28563 (ISO-8859-13)
|
||||
ISO-8859-14 - (not available)
|
||||
ISO-8859-15 28565 (ISO-8859-15)
|
||||
ISO-8859-16 - (not available)
|
||||
|
||||
SJIS 932 (ANSI/OEM Japanese)
|
||||
GB2312 936 (ANSI/OEM Simplified Chinese, GBK)
|
||||
Big5 950 (ANSI/OEM Traditional Chinese)
|
||||
JIS 50220 (ISO2022 Japanese w/o halfwidth Katakana)
|
||||
eucJP 51932 (EUC Japanese)
|
||||
eucKR 51949 (EUC Korean)
|
||||
|
||||
UTF-8 65001 (UTF-8)
|
||||
</screen>
|
||||
|
||||
</sect2>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="setup-files"><title>Customizing bash</title>
|
||||
|
||||
<para>
|
||||
|
Loading…
x
Reference in New Issue
Block a user