2006-02-27 12:28:12

by Max Kellermann

[permalink] [raw]
Subject: 2.6.16-rc[1-5]: soft lockups on Athlon64 X2

Hi,

the kernel 2.6.16-rc5 is receiving soft lockups when being run in SMP
mode. When booting with "nosmp" (same kernel), everything is fine.
Hardware: Asus A8N-SLI (nForce4), Athlon64 X2. I also tested rc1,
rc3, rc4, same error - 2.6.15.4 is ok.

The soft lockups always occur when I try to mount a dm-crypted XFS
partition (another dm-crypted reiserfs partition on the same disk has
been mounted previously - both AES).

I have attached my kernel configuration and the kernel messages. The
compressed System.map is 280kB, for practical reasons I have not
appended it to this email:

http://www.duempel.org/~max/linux/System.map-2.6.16-rc5-woodpecker.bz2

Max


Attachments:
(No filename) (666.00 B)
config-2.6.16-rc5-woodpecker (35.00 kB)
minicom.cap (32.46 kB)
Download all attachments

2006-03-01 06:21:00

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.16-rc[1-5]: soft lockups on Athlon64 X2

Max Kellermann <[email protected]> wrote:
>
> the kernel 2.6.16-rc5 is receiving soft lockups when being run in SMP
> mode. When booting with "nosmp" (same kernel), everything is fine.
> Hardware: Asus A8N-SLI (nForce4), Athlon64 X2. I also tested rc1,
> rc3, rc4, same error - 2.6.15.4 is ok.
>
> The soft lockups always occur when I try to mount a dm-crypted XFS
> partition (another dm-crypted reiserfs partition on the same disk has
> been mounted previously - both AES).

Strange. I did remove a cond_resched() from invalidate_mapping_pages() so
that it could be run under spinlock but I cannot believe that you had so
many pages cached that the invalidate took more than ten seconds.

Does the machine recover and otherwise work OK?

Does it appear to you that the mount _really_ took over ten seconds system
CPU time??

2006-03-01 10:08:51

by Max Kellermann

[permalink] [raw]
Subject: Re: 2.6.16-rc[1-5]: soft lockups on Athlon64 X2

On 2006/03/01 07:19, Andrew Morton <[email protected]> wrote:
> Strange. I did remove a cond_resched() from
> invalidate_mapping_pages() so that it could be run under spinlock
> but I cannot believe that you had so many pages cached that the
> invalidate took more than ten seconds.
>
> Does the machine recover and otherwise work OK?

The mount indeed took more than ten (wallclock) seconds during which
the lockups occured, and after that it seemed usable (I rebooted
shortly after because I feared fs corruption). Normally, this mount
takes 2 seconds or so - it's a crypted 200GB XFS partition. The same
goes for the "bugged" kernel with "nosmp": mount is finished quickly.

Max

2006-03-01 10:23:49

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.16-rc[1-5]: soft lockups on Athlon64 X2

Max Kellermann <[email protected]> wrote:
>
> On 2006/03/01 07:19, Andrew Morton <[email protected]> wrote:
> > Strange. I did remove a cond_resched() from
> > invalidate_mapping_pages() so that it could be run under spinlock
> > but I cannot believe that you had so many pages cached that the
> > invalidate took more than ten seconds.
> >
> > Does the machine recover and otherwise work OK?
>
> The mount indeed took more than ten (wallclock) seconds during which
> the lockups occured, and after that it seemed usable (I rebooted
> shortly after because I feared fs corruption). Normally, this mount
> takes 2 seconds or so - it's a crypted 200GB XFS partition. The same
> goes for the "bugged" kernel with "nosmp": mount is finished quickly.
>

How is it encrypted? (With which kernel encryption stuff?)

I guess it'd be useful to see where all that time is spent, if you have
time. Enable CONFIG_PROFILING, boot with `profile=1', do:

readprofile -r
mount ...
readprofile -n -v -m /boot/System.map | sort -n -k 3 | tail -40

(Make sure it's the correct System.map).

2006-03-01 22:24:12

by Max Kellermann

[permalink] [raw]
Subject: Re: 2.6.16-rc[1-5]: soft lockups on Athlon64 X2

On 2006/03/01 11:22, Andrew Morton <[email protected]> wrote:
> I guess it'd be useful to see where all that time is spent, if you have
> time. Enable CONFIG_PROFILING, boot with `profile=1', do:
>
> readprofile -r
> mount ...
> readprofile -n -v -m /boot/System.map | sort -n -k 3 | tail -40

Here it is. As an explanation of the profile's scope, I have also
sent the shell script which reproduces the problem on my machine.

The script however is not 100% reliable; sometimes, the lockup just
won't occur. Sounds like a timing problem to me.

btw. my other partitions are also reiserfs (but not encrypted).

Max


Attachments:
(No filename) (616.00 B)
profile (1.68 kB)
trigger_soft_lockup.sh (500.00 B)
Download all attachments