We restricted LC_CTYPE to ASCII recently but not messages from, say,
gcc. So instead of nice warnings I get '???? ??????? ???????'
(ru_RU.UTF-8 locale) as a gcc warning, which is not nice. So, set
LC_MESSAGES=C too.
Signed-off-by: Sergei Trofimovich <[email protected]>
---
Makefile | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/Makefile b/Makefile
index c628a5c..67bc799 100644
--- a/Makefile
+++ b/Makefile
@@ -21,7 +21,8 @@ unexport LC_ALL
LC_CTYPE=C
LC_COLLATE=C
LC_NUMERIC=C
-export LC_CTYPE LC_COLLATE LC_NUMERIC
+LC_MESSAGES=C
+export LC_CTYPE LC_COLLATE LC_NUMERIC LC_MESSAGES
# We are using a recursive build, so we need to do a little thinking
# to get the ordering right.
--
1.6.4.4
On 12/25/2009 09:13 AM, Sergei Trofimovich wrote:
> We restricted LC_CTYPE to ASCII recently but not messages from, say,
> gcc. So instead of nice warnings I get '???? ??????? ???????'
> (ru_RU.UTF-8 locale) as a gcc warning, which is not nice. So, set
> LC_MESSAGES=C too.
The whole reason with only setting some LC_* to C was to be able to
leave LC_MESSAGES intact, but it seems it breaks on too many real-life
systems.
As such, I suggest we should set LC_ALL=C and get rid of the rest of it:
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
> The whole reason with only setting some LC_* to C was to be able to
> leave LC_MESSAGES intact, but it seems it breaks on too many real-life
> systems.
> As such, I suggest we should set LC_ALL=C and get rid of the rest of it:
Seems unfortunate to lose localized error messages. (Although in my
en_US.UTF-8 case, all I get is non-ASCII quote characters)
This all started because of the awk invocation in arch/x86/lib. Maybe
the best idea would be to confine the locale monkeying to that one
place?
- R.
On 12/25/2009 05:17 PM, Roland Dreier wrote:
>
> > The whole reason with only setting some LC_* to C was to be able to
> > leave LC_MESSAGES intact, but it seems it breaks on too many real-life
> > systems.
>
> > As such, I suggest we should set LC_ALL=C and get rid of the rest of it:
>
> Seems unfortunate to lose localized error messages. (Although in my
> en_US.UTF-8 case, all I get is non-ASCII quote characters)
>
The whole problem is that for some people we lose *all* messages. This
seems all very strange to me at all, but I guess it tweaks some internal
detail inside the glibc message library, sigh.
> This all started because of the awk invocation in arch/x86/lib. Maybe
> the best idea would be to confine the locale monkeying to that one
> place?
Except that sed, etc. and even the shell itself have the same class of
problems. Perl doesn't, since it has saner rules for how regular
expressions handle ranges.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
> > Seems unfortunate to lose localized error messages. (Although in my
> > en_US.UTF-8 case, all I get is non-ASCII quote characters)
> The whole problem is that for some people we lose *all* messages. This
> seems all very strange to me at all, but I guess it tweaks some internal
> detail inside the glibc message library, sigh.
I just meant that people used to be able to get localized error messages
by setting LANG or whatever. And now they're stuck with ASCII english.
> > This all started because of the awk invocation in arch/x86/lib. Maybe
> > the best idea would be to confine the locale monkeying to that one
> > place?
> Except that sed, etc. and even the shell itself have the same class of
> problems. Perl doesn't, since it has saner rules for how regular
> expressions handle ranges.
But pretty much everyone on a modern distro has had a UTF8 locale for
quite a while. And as far as I know there have been no problems caused
by collation order or anything else. So this change to always build in
the C locale is just worrying about theoretical problems.
Anyway, not a big deal I guess.
- R.
On 12/25/2009 05:17 PM, Roland Dreier wrote:
>
> > The whole reason with only setting some LC_* to C was to be able to
> > leave LC_MESSAGES intact, but it seems it breaks on too many real-life
> > systems.
>
> > As such, I suggest we should set LC_ALL=C and get rid of the rest of it:
>
> Seems unfortunate to lose localized error messages. (Although in my
> en_US.UTF-8 case, all I get is non-ASCII quote characters)
>
> This all started because of the awk invocation in arch/x86/lib. Maybe
> the best idea would be to confine the locale monkeying to that one
> place?
>
It is also possible that setting only LC_COLLATE will solve the most
fundamental problem, which is the one of character ranges. LC_COLLATE
probably will interfere less with LC_MESSAGES than the setting of LC_CTYPE.
It's still bloody broken that glibc malfunctions like that for an
LC_MESSAGES/LC_CTYPE intentional mismatch, but, sigh, that's glibc for you.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On 26.12.2009 21:04, H. Peter Anvin wrote:
> On 12/25/2009 05:17 PM, Roland Dreier wrote:
>>
>> > The whole reason with only setting some LC_* to C was to be able to
>> > leave LC_MESSAGES intact, but it seems it breaks on too many real-life
>> > systems.
>>
>> > As such, I suggest we should set LC_ALL=C and get rid of the rest of it:
>>
>> Seems unfortunate to lose localized error messages. (Although in my
>> en_US.UTF-8 case, all I get is non-ASCII quote characters)
>>
>> This all started because of the awk invocation in arch/x86/lib. Maybe
>> the best idea would be to confine the locale monkeying to that one
>> place?
>>
>
> It is also possible that setting only LC_COLLATE will solve the most
> fundamental problem, which is the one of character ranges. LC_COLLATE
> probably will interfere less with LC_MESSAGES than the setting of LC_CTYPE.
We need LC_COLLATE=C so that [a-z] really means lowercase ASCII letters
and nothing else (most importantly not uppercase letters) in awk, sed
and the shell. If we stay with LC_CTYPE=$userdefined, the meaning of
[[:classes:]] becomes indeterministic and so does the mapping of
lowercase and uppercase characters:
$ echo iI | LC_CTYPE=tr_TR.UTF-8 awk '{ print $0 " " toupper($0) " "
tolower($0) }'
iI İI iı
Character classes are probably not a big issue (modulo the fact that
mawk doesn't seem to support them), because the input is ascii text
anyway. Regarding the tolower()/toupper() functions, I found one
potential troublemaker:
$ git grep -E 'to(lower|upper)' | grep -v '\.[ch]:'
arch/sh/tools/gen-mach-types: tolower(mach[i]), mach[i]);
Maybe this awk script should be run with LC_ALL=C, people mostly care
about (localized) messages from gcc, not from awk.
Michal