2004-10-19 15:15:21

by Richard B. Johnson

[permalink] [raw]
Subject: Register corruption --patch

Hello,

This 'C' compiler destroys parameters passed to functions
even though the code does not alter that parameter.

gcc (GCC) 3.3.3 20040412 (Red Hat Linux 3.3.3-7)
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

For instance:

asmlinkage void __up(struct semaphore *sem)
{
wake_up(&sem->wait);
}

This was from /usr/src/linux-2.6.9/arch/i386/kernel/semaphore.c
It this case, the value of 'sem' is destroyed which means that
certain assembly-language helper functions no longer work.

This was discovered by Aleksey Gorelov <[email protected]>

I have been having trouble with mysterious things like:

(1) Code that sleeps while holding a semaphore sometimes never
releases that semaphore.
(2) SCSI disk files disappearing after boot.
(3) SCSI disk corruption preventing mounting after a boot.
(4) Data errors in email.
(5) Network connections failing to go away `netstat -c` shows
hundreds of lines of very old history.
... etc.

The 'C' compiler is provided in a recent Fedora distribution.

The following patch seems to fix it all!


--- linux-2.6.9/arch/i386/kernel/semaphore.c.orig 2004-08-14 01:36:56.000000000 -0400
+++ linux-2.6.9/arch/i386/kernel/semaphore.c 2004-10-19 08:06:15.000000000 -0400
@@ -198,9 +198,11 @@
#endif
"pushl %eax\n\t"
"pushl %edx\n\t"
- "pushl %ecx\n\t"
+ "pushl %ecx\n\t" // Register to save
+ "pushl %ecx\n\t" // Passed parameter
"call __down\n\t"
- "popl %ecx\n\t"
+ "addl $0x04, %esp\t\n" // Bypass corrupted parameter
+ "popl %ecx\n\t" // Restore original
"popl %edx\n\t"
"popl %eax\n\t"
#if defined(CONFIG_FRAME_POINTER)
@@ -220,9 +222,11 @@
"movl %esp,%ebp\n\t"
#endif
"pushl %edx\n\t"
- "pushl %ecx\n\t"
+ "pushl %ecx\n\t" // Save register
+ "pushl %ecx\n\t" // Passed parameter
"call __down_interruptible\n\t"
- "popl %ecx\n\t"
+ "addl $0x04, %esp\n\t" // Bypass corrupted parameter
+ "popl %ecx\n\t" // Restore register
"popl %edx\n\t"
#if defined(CONFIG_FRAME_POINTER)
"movl %ebp,%esp\n\t"
@@ -241,9 +245,11 @@
"movl %esp,%ebp\n\t"
#endif
"pushl %edx\n\t"
- "pushl %ecx\n\t"
+ "pushl %ecx\n\t" // Save register
+ "pushl %ecx\n\t" // Passed parameter
"call __down_trylock\n\t"
- "popl %ecx\n\t"
+ "addl $0x04, %esp\n\t" // Bypass corrupted parameter
+ "popl %ecx\n\t" // Restore register
"popl %edx\n\t"
#if defined(CONFIG_FRAME_POINTER)
"movl %ebp,%esp\n\t"
@@ -259,9 +265,11 @@
"__up_wakeup:\n\t"
"pushl %eax\n\t"
"pushl %edx\n\t"
- "pushl %ecx\n\t"
+ "pushl %ecx\n\t" // Save register
+ "pushl %ecx\n\t" // Passed parameter
"call __up\n\t"
- "popl %ecx\n\t"
+ "addl $0x04, %esp\n\t" // Bypass corrupted parameter
+ "popl %ecx\n\t" // Restore register
"popl %edx\n\t"
"popl %eax\n\t"
"ret"


I think these 'helper' functions are no longer useful because
they counted on a certain behavior of a 'C' compiler. This
behavior may not continue to exist. This patch is a temporary
solution to the observed problem. The correct solution is
probably to get rid of these 'helper' functions altogether.

Linus, please check this out.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Why is the government concerned about the lunitic fringe? Think about it!


2004-10-20 01:03:29

by Jacek Kawa

[permalink] [raw]
Subject: Re: Register corruption --patch

Richard B. Johnson wrote:

> This 'C' compiler destroys parameters passed to functions
> even though the code does not alter that parameter.
[example]
> This was from /usr/src/linux-2.6.9/arch/i386/kernel/semaphore.c
> It this case, the value of 'sem' is destroyed which means that
> certain assembly-language helper functions no longer work.
>
> This was discovered by Aleksey Gorelov <[email protected]>
>
> I have been having trouble with mysterious things like:
[...]
> (4) Data errors in email.
> (5) Network connections failing to go away `netstat -c` shows
> hundreds of lines of very old history.
> ... etc.
>

Having troubles with some strange (and -as it seems- temporary)
data corruptions here[*], I was wondering, whether would it be
posiible to easily diagnose this somehow?

[*] like diff running serval times over same two files can
only once in a while show one character altered

bye

--
Jacek Kawa **Define the universe. Give three examples.** [r.h.f.r]

2004-10-20 12:56:40

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Register corruption --patch

On Wed, 20 Oct 2004, Jacek Kawa wrote:

> Richard B. Johnson wrote:
>
>> This 'C' compiler destroys parameters passed to functions
>> even though the code does not alter that parameter.
> [example]
>> This was from /usr/src/linux-2.6.9/arch/i386/kernel/semaphore.c
>> It this case, the value of 'sem' is destroyed which means that
>> certain assembly-language helper functions no longer work.
>>
>> This was discovered by Aleksey Gorelov <[email protected]>
>>
>> I have been having trouble with mysterious things like:
> [...]
>> (4) Data errors in email.
>> (5) Network connections failing to go away `netstat -c` shows
>> hundreds of lines of very old history.
>> ... etc.
>>
>
> Having troubles with some strange (and -as it seems- temporary)
> data corruptions here[*], I was wondering, whether would it be
> posiible to easily diagnose this somehow?
>
> [*] like diff running serval times over same two files can
> only once in a while show one character altered
>
> bye

Register corruption makes strange unrelated errors. The problem
shown is related to semaphores. Most every time any I/O is
performed to a shared device, a semaphore is used to obtain
temporary exclusive use of the device. If the semaphore code
is spilling registers, which it can with recent 'C' compilers,
the result can be random corruption.

You should just patch the kernel with the patch I provided.
It will even patch 2.4.x code because semaphore.c hasn't been
changed very often.

If the corruption goes away, you've either fixed the problem
or have changed the size of something so that something that
was getting trashed before by some completely-unrelated code,
is now able to survive.

Without some specific OOPS, some code to trace, it's just
a crap game. But, the semaphore patch can't hurt anything.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 GrumpyMips).
98.36% of all statistics are fiction.

2004-10-20 20:43:13

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Register corruption --patch

On Wed, 20 Oct 2004, Jacek Kawa wrote:

> Richard B. Johnson wrote
>
>> On Wed, 20 Oct 2004, Jacek Kawa wrote:
>>> Richard B. Johnson wrote:
>>>
>>>> This 'C' compiler destroys parameters passed to functions
>>>> even though the code does not alter that parameter.
> [...]
>>>> I have been having trouble with mysterious things like:
>>> [...]
>>>> (4) Data errors in email.
>>>> (5) Network connections failing to go away `netstat -c` shows
>>>> hundreds of lines of very old history.
>>>> ... etc.
>>>
>>> Having troubles with some strange (and -as it seems- temporary)
>>> data corruptions here[*], I was wondering, whether would it be
>>> posiible to easily diagnose this somehow?
>>>
>>> [*] like diff running serval times over same two files can
>>> only once in a while show one character altered
> [...]
>> If the corruption goes away, you've either fixed the problem
>> or have changed the size of something so that something that
>> was getting trashed before by some completely-unrelated code,
>> is now able to survive.
>
> In a way patch helped to track down the error: while compiling
> new kernel[*] I was hit by SEGFAULT, so I ran memtest....
> Well, it's not new RAM, so it goes away now, and I will give
> a plain 2.6.9 next try.
>
> [*] I compiled -rc4 and -final (well, even twice) not so long
> ago and everythig was fine those days. :/
>
>> Without some specific OOPS, some code to trace, it's just
>> a crap game. But, the semaphore patch can't hurt anything.
>
> Thanks for explanation. I will apply workaround in case
> of 'mysterious' corruption reappear.
>
> BTW, could it be, that CONFIG_REGPARM makes problem visible with
> your compiler (somehow)?
>

That's disabled in my .config.

> --
> Jacek Kawa **So, logically... If she weighs the same as a duck,
> she's made of wood. And therefore-? A witch! A witch!**
>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 GrumpyMips).
98.36% of all statistics are fiction.

2004-10-20 20:59:41

by Jacek Kawa

[permalink] [raw]
Subject: Re: Register corruption --patch

Richard B. Johnson wrote

> On Wed, 20 Oct 2004, Jacek Kawa wrote:
> >Richard B. Johnson wrote:
> >
> >>This 'C' compiler destroys parameters passed to functions
> >>even though the code does not alter that parameter.
[...]
> >>I have been having trouble with mysterious things like:
> >[...]
> >>(4) Data errors in email.
> >>(5) Network connections failing to go away `netstat -c` shows
> >>hundreds of lines of very old history.
> >>... etc.
> >
> >Having troubles with some strange (and -as it seems- temporary)
> >data corruptions here[*], I was wondering, whether would it be
> >posiible to easily diagnose this somehow?
> >
> >[*] like diff running serval times over same two files can
> > only once in a while show one character altered
[...]
> If the corruption goes away, you've either fixed the problem
> or have changed the size of something so that something that
> was getting trashed before by some completely-unrelated code,
> is now able to survive.

In a way patch helped to track down the error: while compiling
new kernel[*] I was hit by SEGFAULT, so I ran memtest....
Well, it's not new RAM, so it goes away now, and I will give
a plain 2.6.9 next try.

[*] I compiled -rc4 and -final (well, even twice) not so long
ago and everythig was fine those days. :/

> Without some specific OOPS, some code to trace, it's just
> a crap game. But, the semaphore patch can't hurt anything.

Thanks for explanation. I will apply workaround in case
of 'mysterious' corruption reappear.

BTW, could it be, that CONFIG_REGPARM makes problem visible with
your compiler (somehow)?

--
Jacek Kawa **So, logically... If she weighs the same as a duck,
she's made of wood. And therefore-? A witch! A witch!**