2012-11-08 00:05:58

by Joseph Parmelee

[permalink] [raw]
Subject: Binutils test suite freezes kernel

Greetings:

The gas test suite in recent binutils snapshots from
ftp://sourceware.org/pub/binutils/snapshots/ consistently freezes my i386
custom-built kernels. This may be a kernel configuration problem but if so
it has manifested only recently. I have been building kernels since 1995
and this is the first instance I have seen where the kernel is brought down
by a non-privileged user space process. AIUI this should be impossible
regardless of what that process is doing. The problem affects all kernels
between 3.6.2 and 3.6.6. These are merely the kernels were I have seen the
problem; it may well affect other kernels.

My system uses a raid1 array of two SATA disks, each having a root partition
and a much smaller swap partition. Because the raid arrays have been in use
since 2001 on various disks over the years they use the older kernel
automatic raid detection metadata.

When the freeze occurs not all system processes always stop but most do such
that I can change virtual terminals but cannot enter characters into any of
them except sysreq magic keys. Often this also affects telnet from other
hosts, but not always. If a can kill the test process, either through
telnet or sysreq magic keys, the system returns, though it appears that the
system clock has also been stopped during the freeze.

If however I press the reset button during the freeze, this results in a
reconstruction of the raided swap partition on system restart. What is most
striking is that this reconstruction is not always successful because of
hard disk errors in one of the swap partitions. They are unrecoverable CRC
read errors which cause the affected partition to be kicked out of the raid
array. However, they disappear when the badblock program is run with the -w
(write then read) option on the affected partition. The partition can then
be added back into the array without further incident. This suggests to me
that sometimes the system freeze occurs in the middle of swap sector writes
such that they are actually bad on the disk. Just how that is happening is
a mystery to me.

I do not pretend to understand what is happening here but I will do what I
can to provide whatever additional information may be necessary.

Please CC me directly as I am no longer subscribed to the list.

Yours,

Joseph
jparmele at wildbear dot com


2012-11-29 20:11:11

by Andi Kleen

[permalink] [raw]
Subject: Re: Binutils test suite freezes kernel

Joseph Parmelee <[email protected]> writes:

> Greetings:
>
> The gas test suite in recent binutils snapshots from
> ftp://sourceware.org/pub/binutils/snapshots/ consistently freezes my i386
> custom-built kernels. This may be a kernel configuration problem but if so
> it has manifested only recently. I have been building kernels since 1995
> and this is the first instance I have seen where the kernel is brought down
> by a non-privileged user space process. AIUI this should be impossible
> regardless of what that process is doing. The problem affects all kernels
> between 3.6.2 and 3.6.6. These are merely the kernels were I have seen the
> problem; it may well affect other kernels.

A common cause of this would be running out of memory.
While this should eventually resolve itself it may take a long time
and the system may appear frozen.

I would rerun with an appropiate ulimit setting.

-Andi

--
[email protected] -- Speaking for myself only

2012-11-30 02:02:33

by Joseph Parmelee

[permalink] [raw]
Subject: Re: Binutils test suite freezes kernel




On Thu, 29 Nov 2012, Andi Kleen wrote:

> Joseph Parmelee <[email protected]> writes:
>
>> Greetings:
>>
>> The gas test suite in recent binutils snapshots from
>> ftp://sourceware.org/pub/binutils/snapshots/ consistently freezes my i386
>> custom-built kernels. This may be a kernel configuration problem but if so
>> it has manifested only recently. I have been building kernels since 1995
>> and this is the first instance I have seen where the kernel is brought down
>> by a non-privileged user space process. AIUI this should be impossible
>> regardless of what that process is doing. The problem affects all kernels
>> between 3.6.2 and 3.6.6. These are merely the kernels were I have seen the
>> problem; it may well affect other kernels.
>
> A common cause of this would be running out of memory.
> While this should eventually resolve itself it may take a long time
> and the system may appear frozen.
>
> I would rerun with an appropiate ulimit setting.
>
> -Andi
>
> --
> [email protected] -- Speaking for myself only
>

I appreciate your response but there is no ulimit on memory usage defined
and this system has 1 gb of RAM and 10 gb of swap.

The offending test is assembly of rept.s in the gas testsuite which
generates a labeled data segment of about 55 MB containing all zeroes. This
is far less than the memory available. Running just this assembly by itself
with the gas in the snapshot will freeze the system. This is probably the
kswapd problem being discussed now on the mailing list.

At the time I posted I was unaware that I also had a hardware problem with
one of the SATA controllers probably induced by lightning which did other
damage in October. I have since fixed that issue which gets rid of the bad
blocks and the reconstruction failures. But the reconstruction is still
occurring because the system is still freezing under the md layer with the
swap array often in an inconsistent state.

With the latest changes in 3.6.8 I can sometimes run the test once without
failures. But repeating the test has so far always resulted in a system
freeze with swap reconstruction usually (but not always) occurring on
reboot. At one point I left it for 30 minutes before using sysreq to sync,
unmount, and reboot to be sure that it is really frozen. This certainly
sounds like an infinite loop in kswapd as described by others.

Because this machine is in use for other purposes I am unable to run very
many such tests, so I will wait to see what patches the developers produce
and test some more after upgrading to the next "stable" kernel. I will
squawk to the list again if the fix doesn't work for me.

Thanks for your interest.

Joseph