2013-03-27 01:55:59

by Robert Norris

[permalink] [raw]
Subject: PROBLEM: All CPUs in soft lockup

In the last two weeks we've had three servers (identical hardware,
software and load) hang. The details in this report are from one that
hung last night.

They're all IMAP servers servicing many hundreds of users, so several
thousand processes and active connections. There's been two major
application level changes in the last couple of weeks, corresponding to
the time where these hangs started. One is that we now do mail event
notifications directly to user clients, so more TCP connections. The
other is that we're now maintaining live search indexes, so a lot more
disk and tmpfs IO.

All that said, we're not under what we'd consider to be heavy load. When
they're running, the servers are fast and responsive.

During the hang itself, the machine responds to pings, and TCP
connections can be established, but the servicing processes never
respond. The console shows a new "BUG: soft lockup" line every few
seconds, and will not respond to keyboard input. It is a virtual console
though, which may or may not make a difference, I'm not sure.

The kernel is 3.4.33 with AUFS patches applied. However there are no
AUFS mounts on this machine; we use this elsewhere. If you think that's
a problem I can rebuild for this machine without it.

Attached are various bits of information requested in REPORTING-BUGS.
I'm not entirely sure what else is relevant. I'm happy to supply any
other information and test things, just let me know.

Thanks,
Rob.


Attachments:
(No filename) (1.42 kB)
4.1.version (116.00 B)
4.2.config (116.80 kB)
6.messages.gz (181.34 kB)
8.1.software (1.41 kB)
8.2.cpuinfo (13.78 kB)
8.3.modules (4.00 kB)
8.4.ioports-iomem (3.12 kB)
8.5.pci (64.00 kB)
8.6.scsi (1.10 kB)
Download all attachments

2013-03-27 04:00:19

by Li Guang

[permalink] [raw]
Subject: Re: PROBLEM: All CPUs in soft lockup

seems tasks are hogging your cpu/memory resource,
did you check status your servicing processes?

在 2013-03-27三的 12:55 +1100,Robert Norris写道:
> In the last two weeks we've had three servers (identical hardware,
> software and load) hang. The details in this report are from one that
> hung last night.
>
> They're all IMAP servers servicing many hundreds of users, so several
> thousand processes and active connections. There's been two major
> application level changes in the last couple of weeks, corresponding to
> the time where these hangs started. One is that we now do mail event
> notifications directly to user clients, so more TCP connections. The
> other is that we're now maintaining live search indexes, so a lot more
> disk and tmpfs IO.
>
> All that said, we're not under what we'd consider to be heavy load. When
> they're running, the servers are fast and responsive.
>
> During the hang itself, the machine responds to pings, and TCP
> connections can be established, but the servicing processes never
> respond. The console shows a new "BUG: soft lockup" line every few
> seconds, and will not respond to keyboard input. It is a virtual console
> though, which may or may not make a difference, I'm not sure.
>
> The kernel is 3.4.33 with AUFS patches applied. However there are no
> AUFS mounts on this machine; we use this elsewhere. If you think that's
> a problem I can rebuild for this machine without it.
>
> Attached are various bits of information requested in REPORTING-BUGS.
> I'm not entirely sure what else is relevant. I'm happy to supply any
> other information and test things, just let me know.
>
> Thanks,
> Rob.

2013-03-27 10:40:51

by Robert Norris

[permalink] [raw]
Subject: Re: PROBLEM: All CPUs in soft lockup

On Wed, Mar 27, 2013, at 02:42 PM, li guang wrote:
> seems tasks are hogging your cpu/memory resource, did you check status
> your servicing processes?

According to my monitoring I have plenty of CPU and memory free at the
time the problem occurs. What specifically are you looking at the data I
provided that makes you think that?

2013-03-28 00:22:33

by Robert Norris

[permalink] [raw]
Subject: Re: PROBLEM: All CPUs in soft lockup

On Wed, Mar 27, 2013 at 12:55:41PM +1100, Robert Norris wrote:
> The console shows a new "BUG: soft lockup" line every few seconds

Looking closer, the whole thing starts with a _hard_ lockup.

2013-03-26T08:33:39.921834-04:00 imap30 kernel: [185090.090328] Watchdog detected hard LOCKUP on cpu 3

(also in the logs of the other two servers I mentioned).

Looking down to where the watchdog interrupt comes in:

2013-03-26T08:33:39.921870-04:00 imap30 kernel: [185090.090426] <<EOE>> <IRQ> [<ffffffff8112a5b1>] ? end_buffer_async_read+0x79/0xff

Disassembling:

0xffffffff8112a57a <+66>: mov %rbx,%rdi
0xffffffff8112a57d <+69>: callq 0xffffffff81129265 <buffer_io_error>
0xffffffff8112a582 <+74>: lock orb $0x2,0x0(%rbp)
0xffffffff8112a587 <+79>: mov 0x0(%rbp),%rax
0xffffffff8112a58b <+83>: test $0x8,%ah
0xffffffff8112a58e <+86>: jne 0xffffffff8112a594 <end_buffer_async_read+92>
0xffffffff8112a590 <+88>: ud2
0xffffffff8112a592 <+90>: jmp 0xffffffff8112a592 <end_buffer_async_read+90>

That lock at +74 is presumably the offender here. Which is line 275 of fs/buffer.c:

275 SetPageError(page);

So another CPU has these page flags locked right now, and isn't keen to
release that lock?

I don't know how to debug this further. What's the next step?

Thanks,
Rob.