2009-12-25 17:13:15

by Sergei Trofimovich

[permalink] [raw]
Subject: [PATCH] Kbuild: set LC_MESSAGES=C (as LC_CTYPE=C is)

We restricted LC_CTYPE to ASCII recently but not messages from, say,
gcc. So instead of nice warnings I get '???? ??????? ???????'
(ru_RU.UTF-8 locale) as a gcc warning, which is not nice. So, set
LC_MESSAGES=C too.

Signed-off-by: Sergei Trofimovich <[email protected]>
---
Makefile | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/Makefile b/Makefile
index c628a5c..67bc799 100644
--- a/Makefile
+++ b/Makefile
@@ -21,7 +21,8 @@ unexport LC_ALL
LC_CTYPE=C
LC_COLLATE=C
LC_NUMERIC=C
-export LC_CTYPE LC_COLLATE LC_NUMERIC
+LC_MESSAGES=C
+export LC_CTYPE LC_COLLATE LC_NUMERIC LC_MESSAGES

# We are using a recursive build, so we need to do a little thinking
# to get the ordering right.
--
1.6.4.4


2009-12-25 23:37:07

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Kbuild: set LC_MESSAGES=C (as LC_CTYPE=C is)

On 12/25/2009 09:13 AM, Sergei Trofimovich wrote:
> We restricted LC_CTYPE to ASCII recently but not messages from, say,
> gcc. So instead of nice warnings I get '???? ??????? ???????'
> (ru_RU.UTF-8 locale) as a gcc warning, which is not nice. So, set
> LC_MESSAGES=C too.

The whole reason with only setting some LC_* to C was to be able to
leave LC_MESSAGES intact, but it seems it breaks on too many real-life
systems.

As such, I suggest we should set LC_ALL=C and get rid of the rest of it:

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.


Attachments:
0001-Makefile-Set-LC_ALL-C.patch (0.99 kB)

2009-12-26 01:18:06

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH] Kbuild: set LC_MESSAGES=C (as LC_CTYPE=C is)


> The whole reason with only setting some LC_* to C was to be able to
> leave LC_MESSAGES intact, but it seems it breaks on too many real-life
> systems.

> As such, I suggest we should set LC_ALL=C and get rid of the rest of it:

Seems unfortunate to lose localized error messages. (Although in my
en_US.UTF-8 case, all I get is non-ASCII quote characters)

This all started because of the awk invocation in arch/x86/lib. Maybe
the best idea would be to confine the locale monkeying to that one
place?

- R.

2009-12-26 01:31:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Kbuild: set LC_MESSAGES=C (as LC_CTYPE=C is)

On 12/25/2009 05:17 PM, Roland Dreier wrote:
>
> > The whole reason with only setting some LC_* to C was to be able to
> > leave LC_MESSAGES intact, but it seems it breaks on too many real-life
> > systems.
>
> > As such, I suggest we should set LC_ALL=C and get rid of the rest of it:
>
> Seems unfortunate to lose localized error messages. (Although in my
> en_US.UTF-8 case, all I get is non-ASCII quote characters)
>

The whole problem is that for some people we lose *all* messages. This
seems all very strange to me at all, but I guess it tweaks some internal
detail inside the glibc message library, sigh.

> This all started because of the awk invocation in arch/x86/lib. Maybe
> the best idea would be to confine the locale monkeying to that one
> place?

Except that sed, etc. and even the shell itself have the same class of
problems. Perl doesn't, since it has saner rules for how regular
expressions handle ranges.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-12-26 06:58:31

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH] Kbuild: set LC_MESSAGES=C (as LC_CTYPE=C is)


> > Seems unfortunate to lose localized error messages. (Although in my
> > en_US.UTF-8 case, all I get is non-ASCII quote characters)

> The whole problem is that for some people we lose *all* messages. This
> seems all very strange to me at all, but I guess it tweaks some internal
> detail inside the glibc message library, sigh.

I just meant that people used to be able to get localized error messages
by setting LANG or whatever. And now they're stuck with ASCII english.

> > This all started because of the awk invocation in arch/x86/lib. Maybe
> > the best idea would be to confine the locale monkeying to that one
> > place?

> Except that sed, etc. and even the shell itself have the same class of
> problems. Perl doesn't, since it has saner rules for how regular
> expressions handle ranges.

But pretty much everyone on a modern distro has had a UTF8 locale for
quite a while. And as far as I know there have been no problems caused
by collation order or anything else. So this change to always build in
the C locale is just worrying about theoretical problems.

Anyway, not a big deal I guess.

- R.

2009-12-26 20:05:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Kbuild: set LC_MESSAGES=C (as LC_CTYPE=C is)

On 12/25/2009 05:17 PM, Roland Dreier wrote:
>
> > The whole reason with only setting some LC_* to C was to be able to
> > leave LC_MESSAGES intact, but it seems it breaks on too many real-life
> > systems.
>
> > As such, I suggest we should set LC_ALL=C and get rid of the rest of it:
>
> Seems unfortunate to lose localized error messages. (Although in my
> en_US.UTF-8 case, all I get is non-ASCII quote characters)
>
> This all started because of the awk invocation in arch/x86/lib. Maybe
> the best idea would be to confine the locale monkeying to that one
> place?
>

It is also possible that setting only LC_COLLATE will solve the most
fundamental problem, which is the one of character ranges. LC_COLLATE
probably will interfere less with LC_MESSAGES than the setting of LC_CTYPE.

It's still bloody broken that glibc malfunctions like that for an
LC_MESSAGES/LC_CTYPE intentional mismatch, but, sigh, that's glibc for you.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-01-04 14:44:46

by Michal Marek

[permalink] [raw]
Subject: Re: [PATCH] Kbuild: set LC_MESSAGES=C (as LC_CTYPE=C is)

On 26.12.2009 21:04, H. Peter Anvin wrote:
> On 12/25/2009 05:17 PM, Roland Dreier wrote:
>>
>> > The whole reason with only setting some LC_* to C was to be able to
>> > leave LC_MESSAGES intact, but it seems it breaks on too many real-life
>> > systems.
>>
>> > As such, I suggest we should set LC_ALL=C and get rid of the rest of it:
>>
>> Seems unfortunate to lose localized error messages. (Although in my
>> en_US.UTF-8 case, all I get is non-ASCII quote characters)
>>
>> This all started because of the awk invocation in arch/x86/lib. Maybe
>> the best idea would be to confine the locale monkeying to that one
>> place?
>>
>
> It is also possible that setting only LC_COLLATE will solve the most
> fundamental problem, which is the one of character ranges. LC_COLLATE
> probably will interfere less with LC_MESSAGES than the setting of LC_CTYPE.

We need LC_COLLATE=C so that [a-z] really means lowercase ASCII letters
and nothing else (most importantly not uppercase letters) in awk, sed
and the shell. If we stay with LC_CTYPE=$userdefined, the meaning of
[[:classes:]] becomes indeterministic and so does the mapping of
lowercase and uppercase characters:

$ echo iI | LC_CTYPE=tr_TR.UTF-8 awk '{ print $0 " " toupper($0) " "
tolower($0) }'
iI İI iı

Character classes are probably not a big issue (modulo the fact that
mawk doesn't seem to support them), because the input is ascii text
anyway. Regarding the tolower()/toupper() functions, I found one
potential troublemaker:

$ git grep -E 'to(lower|upper)' | grep -v '\.[ch]:'
arch/sh/tools/gen-mach-types: tolower(mach[i]), mach[i]);

Maybe this awk script should be run with LC_ALL=C, people mostly care
about (localized) messages from gcc, not from awk.

Michal