Generate lc_def_codeset.h header containing the default mapping from
locale to codeset on Linux. Use this mapping in __set_charset_from_locale
in the first place.
For every locale not covered by this table, just map Windows codepages
to equivalent codesets used on Linux/Unix, getting rid of LCIDs entirely.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
Since Windows Vista, locale handling is converted from using numeric
locale identifiers (LCID) to using ISO5646 locale strings. In the
meantime Windows introduced new locales which don't even have a LCID
attached. Those were unusable in Cygwin because locale information
for these locales required to call the new locale functions taking
a locale string.
Convert Cygwin to drop LCIDs and use Windows ISO5646 locales instead.
The last place using LCIDs is the __set_charset_from_locale function.
Checking numerically is easier and uslay faster than checking strings.
However, this function is clearly a TODO
Used on Linux as default codeset for Tajik. There's no matching
Windows codepage, so fake it as CP103.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
Most locales using latin characters ignore case while sorting.
This is what wcscoll does (correctly so). However, there's an
internal order of collating sequences compared to the base
character, which is case-sensitive, at least in GLibc.
There's no way to express this in Windows, because CompareString
and LCMapString *always* use case-insensitivity in those locales,
even if none of the *IGNORECASE sorting flags are used.
We want to follow glibc's behaviour more closely, so we add an
extra check for the case and make sure upper and lower cased
letters don't comapre as identical.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
Allow the [.<sym>.] expression
This requires a string comparision rather than a character
comparison. Introduce and use __wscollate_range_cmp.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
lc_collelem.h: autogenerated table of collating element, taken
from glibc
is_unicode_coll_elem: Check if a UTF-32 string is a collating element
next_unicode_char: return length of prefix from a string constituting
a complete character in the current locale, taking
collating elements into acocunt.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=179721
After FreeBSD eventually picked up the bugreport from within
only 5 years, rename __collate_range_cmp to __wcollate_range_cmp
as suggested all along, and make it type safe (wint_t instead of
wchar_t for hopefully obvious reasons...)
While at it, drop __collate_load_error and fix the checks for
it in glob and fnmatch.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
is_unicode_equiv compares two UTF-32 values and returns 1 if
both are member of the same Unicode equivalence class, 0 otherwise.
Note that this function only works with precomposed characters
per Unicode normalization form C. It doesn't handle decomposed
characters, just like its counterpart in glibc. I.e., equivalence
class comparison using decomposed chars won't work. Example:
fnmatch("[=n=]", "ñ") == 0
fnmatch("[=ñ=]", "n") == 0
but
fnmatch("[=n=]", "n\x0303") == 1
fnmatch("[=n\x0303=]", "n") == 1
fnmatch("[=n\x0303=]", "n\x0303") == 1
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
So far the input to __collate_range_cmp was handled as a wchar_t.
Change that to handle it as wint_t holding a UTF-32 value and
add creating surrogate pairs for the call to wcscoll.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
i. e. Vista/2008. This drops support for the sr_CS locale.
Regenerate LC_MESSAGES and LC_TIME ERA data from more recent Linux
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
The new function __eval_codepage_from_internal_charset
is a simplified version of the former code in
fhandler_tty.cc. It probably needs some extension,
but the gist is to use knowledge of internals to
be as quick as possible.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
This should slightly speed up especially path conversions,
given there's one less function call rearranging all function
arguments in registers/stack (and less stack pressure).
For clarity, rename overloaded sys_wcstombs to _sys_wcstombs
and sys_cp_mbstowcs to _sys_mbstowcs.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
commit c0d7d3e1a2 removed the usage of the
LCMAP_BYTEREV flag in the call to LCMapStringW to workaround a strange
bug in LCMapStringW. This patch didn't take a userspace call of
wcsxfrm{_l} with NULL buffer and 0 size to evaluate the required buffer
size into account. This introduced a crash trying to byte swap the
NULL buffer. This patch fixes that problem.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
Workaround a bug (or undocumented behaviour) in LCMapStringW:
It's documented(*) that the cchDest parameter is a byte count with
LCMAP_SORTKEY, but a character count otherwise. But the docs don't
state what happens if you combine LCMAP_SORTKEY with LCMAP_BYTEREV.
Tests indicate that LCMAP_SORTKEY treats cchDest as byte count, but
then LCMAP_BYTEREV treats it as char count in the same call. So the
latter swaps twice as much bytes in the destination buffer than the
byte count it returns, which potentially results in writing past the
end of the given output buffer.
Solution: Don't specify LCMAP_BYTEREV in the LCMapStringW(LCMAP_SORTKEY)
call, rather byte swap afterwards.
(*) https://msdn.microsoft.com/en-us/library/windows/desktop/dd318702(v=vs.85).aspx
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
The former __locale_charset always fetched the current locale's charset.
We need the per-locale charset, too, in future. Rename __locale_charset
to __current_locale_charset and change __locale_charset to take a
locale_t as parameter. Accommodate througout.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
This allows looping through the structs and buffers. Also
rearrange definitions to follow order of LC_xxx values.
Signed-off by: Corinna Vinschen <corinna@vinschen.de>
Don't use global variables. This allows to call loadlocale from
the yet to be created newlocale().
Rename _thr_locale_t to __locale_t (these locales are not restricted
to threads so the name is misleading).
Along these lines, fix _set_ctype to take a __locale_t as parameter.
Signed-off by: Corinna Vinschen <corinna@vinschen.de>
- Remove charset parameter from low level __foo_wctomb/__foo_mbtowc calls.
- Instead, create array of function for ISO and Windows codepages to point
to function which does not require to evaluate the charset string on
each call. Create matching helper functions. I.e., __iso_wctomb,
__iso_mbtowc, __cp_wctomb and __cp_mbtowc are functions returning the
right function pointer now.
- Create __WCTOMB/__MBTOWC macros utilizing per-reent locale and replace
calls to __wctomb/__mbtowc with calls to __WCTOMB/__MBTOWC.
- Drop global __wctomb/__mbtowc vars.
- Utilize aforementioned changes in Cygwin to get rid of charset in other,
calling functions and simplify the code.
- In Cygwin restrict global cygheap locale info to the job performed
by internal_setlocale. Use UTF-8 instead of ASCII on the fly in
internal conversion functions.
- In Cygwin dll_entry, make sure to initialize a TLS area with a NULL
_REENT->_locale pointer. Add comment to explain why.
Signed-off by: Corinna Vinschen <corinna@vinschen.de>
Move all locale category structure definitions into setlocale.h and remove
other headers in locale subdir. Create inline accessor functions for
current category struct pointers and use throughout. Use pointers to
"C" locale category structs by default in __global_locale.
Signed-off by: Corinna Vinschen <corinna@vinschen.de>
Introduce first cut of struct _thr_locale_t used for the locale_t definition.
Introduce global instance called __global_locale used by default.
Introduce internal inline functions __get_global_locale, __get_locale_r,
__get_current_locale.
Remove usage of global variables in favor of accessor functions pointing to
__global_locale for now. Include all local headers in locale subdir from
setlocale.h to get single include for internal locale access.
Introduce __CTYPE_PTR macro to replace direct access to __ctype_ptr__
and use throughout in isxxx functions.
Signed-off by: Corinna Vinschen <corinna@vinschen.de>
Bump GPLv2+ to GPLv3+ for some files, clarify BSD 2-clause.
Everything else stays under GPLv3+.
New Linking Exception exempts resulting executables from LGPLv3 section 4.
Add CONTRIBUTORS file to keep track of licensing.
Remove 'Copyright Red Hat Inc' comments.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
Cygwin's strxfrm/wcsfrm treated a too short output buffer as an error
condition and always returned the size value provided as third parameter.
This is not as it's documented in POSIX.1-2008. Rather, the only error
condition is an invalid input string(*).
Other than that, the functions are supposed to return the length of the
resulting sort key, even if the output buffer is too small. In the latter
case the content of the output array is unspecified, but it's the job
of the application to check that the return value is greater or equal to
the provided buffer size.
(*) We have to make an exception in Cygwin: strxfrm has to call the
UNICODE function LCMapStringW for reasons outlined in a source comment.
If the incoming multibyte string is so large that we fail to malloc
the space required to convert it to a wchar_t string, we have to
ser errno as well since we have nothing to call LCMapStringW with.
* nlsfuncs.cc (wcsxfrm): Fix expression computing offset of
trailing wchar_t NUL. Compute correct return value even if
output buffer is too small.
(strxfrm): Handle failing malloc. Compute correct return value
even if output buffer is too small.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
* nlsfuncs.cc (setlocaleinfo): New macro calling __setlocaleinfo.
(__setlocaleinfo): New function to set a locale-specific character
to an explicit wchar_t value.
(__set_lc_numeric_from_win): Handle fa_IR and ps_AF locales to return
same decimal point and thousands separator characters as on Linux.
(__set_lc_monetary_from_win): Ditto for monetary characters.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
* nlsfuncs.cc (__get_lcid_from_locale): Handle LocaleNameToLCID
returning LOCALE_CUSTOM_UNSPECIFIED instead of failing in case of
an unsupported locale on Windows 10.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
included by default.
* winlean.h: Add long comment to explain why we have to define certain
symbols.
(_NORMALIZE_): Define.
(_WINNLS_): Drop definition and subsequent undef.
(_WINNETWK_): Ditto.
(_WINSVC_): Ditto.
2013-11-23 Eric Blake <eblake@redhat.com>
* nlsfuncs.cc (__get_lcid_from_locale): Update list of Script-only
locales to Windows 8.
(__set_charset_from_locale): Take locales added with Windows 8 and 8.1
into account.
* collate.h: New header.
(__collate_range_cmp): Declare.
(__collate_load_error): Define.
* glob.cc: Pull in latest version from FreeBSD. Simplify and reduce
Cygwin-specific changes.
* regex/regcomp.c: Include collate.h on Cygwin as well.
(__collate_range_cmp): Move from here...
* nlsfuncs.cc (__collate_range_cmp): ...to here.
* miscfuncs.cc (thread_wrapper): Fix typo in comment.
(CygwinCreateThread): Take dead zone of Windows stack into account.
Change the way how the stack is commited and how to handle guardpages.
Explain how and why.
* thread.h (PTHREAD_DEFAULT_STACKSIZE): Change definition. Explain why.
suffix.
* nlsfuncs.cc (rebase_locale_buf): Reorder arguments. Accommodate
throughout. Add pointer to end of buffer and avoid changing pointers
not pointing into the buffer.
* nlsfuncs.cc (__getlocaleinfo): Drop conversion to multibyte.
(__charfromwchar): New function to convert to multibyte.
(__eval_datetimefmt): Convert to return wchar_t pointer. Work on
wide char string.
(__set_lc_time_from_win): Take additional pointer to "C" category info
to accommodate C.foo locales. Rework to fill wide char members in
category info.
(__set_lc_ctype_from_win): New function.
(__set_lc_numeric_from_win): Take additional pointer to "C" category
info to accommodate C.foo locales. Rework to fill wide char members
in category info.
(__set_lc_monetary_from_win): Ditto.
(__set_lc_messages_from_win): Ditto.
(__get_current_collate_codeset): New function, called from nl_langinfo.
* include/cygwin/config.h (__HAVE_LOCALE_INFO_EXTENDED__): Define.