LinuxLists.cc - vma corruption in today's -git

2011-03-29 04:10:18

Subject: vma corruption in today's -git

Running a kernel that I built from a git pull from this afternoon,
I ran a barrage of stress tests, including some of the new syscall fuzzing
work that I've been working on. Everything seemed just fine, except that
after I had stopped running everything, loadavg was still at 6, with
top showing kworker processes running at high % of cpu.

I ran perf top, and got this ..

BUG: unable to handle kernel paging request at fffffffffffffff3
IP: [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33
PGD 1a05067 PUD 1a06067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0C0A:00/power_supply/BAT0/energy_full
CPU 0
Modules linked in: ipx p8022 psnap llc p8023 sctp libcrc32c tun vfat fat usb_storage uas nfs fscache fuse nfsd lockd nfs_acl auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 uinput arc4 iwlagn btusb bluetooth snd_hda_codec_hdmi snd_hda_codec_idt mac80211 snd_hda_intel snd_usb_audio snd_hda_codec dell_wmi snd_seq sparse_keymap uvcvideo snd_pcm dell_laptop snd_hwdep dcdbas snd_usbmidi_lib microcode videodev zaurus cfg80211 snd_rawmidi snd_seq_device cdc_ether v4l2_compat_ioctl32 usbnet snd_timer iTCO_wdt pcspkr joydev cdc_acm i2c_i801 cdc_wdm tg3 snd mii iTCO_vendor_support rfkill soundcore snd_page_alloc wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]

Pid: 324, comm: perf Tainted: G W 2.6.38+ #5 Dell Inc. Adamo 13 /0N70T0
RIP: 0010:[<ffffffff8118d3e9>] [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33
RSP: 0018:ffff880114de1e48 EFLAGS: 00010286
RAX: 00000000fffffff3 RBX: ffff880007f44bc8 RCX: ffff8800b7216806
RDX: ffffffff81618f50 RSI: fffffffffffffff3 RDI: ffff880007f44bd8
RBP: ffff880114de1e58 R08: 0000000000000001 R09: 0000000000000000
R10: ffffffff81a26870 R11: 0000000000000246 R12: ffff88000fe3d880
R13: ffff8800b72168f8 R14: fffffffffffffff3 R15: 0000000000000000
FS: 00007f59e738c720(0000) GS:ffff88013fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffffffffff3 CR3: 0000000136566000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process perf (pid: 324, threadinfo ffff880114de0000, task ffff8800b43c23b0)
Stack:
0000000000000000 ffff880007f44bc8 ffff880114de1e78 ffffffff8118d46a
0000000000000000 0000000000000400 ffff880114de1ef8 ffffffff81159a99
0000000000000004 ffff8801fffffff3 0000000000000004 ffff8800b7216930
Call Trace:
[<ffffffff8118d46a>] m_stop+0x1e/0x30
[<ffffffff81159a99>] seq_read+0x240/0x36e
[<ffffffff8113dc28>] vfs_read+0xac/0xff
[<ffffffff8113f0f8>] ? fget_light+0x3a/0xa1
[<ffffffff8113dcc8>] sys_read+0x4d/0x74
[<ffffffff814d6b42>] system_call_fastpath+0x16/0x1b
Code: 8b 54 24 08 48 89 53 30 41 5d 5b 41 5c 31 c0 41 5d c9 c3 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 85 f6 74 1c 48 3b 37 74 17
8b 1e 48 8d bb a8 00 00 00 e8 d9 e6 ee ff 48 89 df e8 9a ab
RIP [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33
RSP <ffff880114de1e48>
CR2: fffffffffffffff3
---[ end trace ea1a722d5e92836a ]---

it smells like something got corrupted, but I'm not sure where
to start. I've got a bunch of debug options turned on, but nothing
else triggered, so I've got nothing other than this oops to go on.

Any thoughts ?

Dave

2011-03-29 04:19:04

by Cong Wang

[permalink] [raw]

Subject: Re: vma corruption in today's -git

On Tue, Mar 29, 2011 at 12:09 PM, Dave Jones <[email protected]> wrote:
> Running a kernel that I built from a git pull from this afternoon,
> I ran a barrage of stress tests, including some of the new syscall fuzzing
> work that I've been working on. Everything seemed just fine, except that
> after I had stopped running everything, loadavg was still at 6, with
> top showing kworker processes running at high % of cpu.
>
> I ran perf top, and got this ..
>
> BUG: unable to handle kernel paging request at fffffffffffffff3
> IP: [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33

Should be fixed by:

commit 76597cd31470fa130784c78fadb4dab2e624a723
Author: Linus Torvalds <[email protected]>
Date: Sun Mar 27 19:09:29 2011 -0700

proc: fix oops on invalid /proc/<pid>/maps access

When m_start returns an error, the seq_file logic will still call m_stop
with that error entry, so we'd better make sure that we check it before
using it as a vma.

Introduced by commit ec6fd8a4355c ("report errors in /proc/*/*map*
sanely"), which replaced NULL with various ERR_PTR() cases.

(On ia64, you happen to get a unaligned fault instead of a page fault,
since the address used is generally some random error code like -EPERM)

Reported-by: Anca Emanuel <[email protected]>
Reported-by: Tony Luck <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Américo Wang <[email protected]>
Cc: Stephen Wilson <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

2011-03-29 04:22:47

by Linus Torvalds

[permalink] [raw]

Subject: Re: vma corruption in today's -git

On Mon, Mar 28, 2011 at 9:09 PM, Dave Jones <[email protected]> wrote:
>
> I ran perf top, and got this ..
>
> ?BUG: unable to handle kernel paging request at fffffffffffffff3

Should be fixed by commit 76597cd31470.

Except that was already committed yesterday. When did you pull?

Linus

2011-03-29 04:27:30

by Dave Jones

[permalink] [raw]

Subject: Re: vma corruption in today's -git

On Tue, Mar 29, 2011 at 12:19:01PM +0800, Am?rico Wang wrote:
> On Tue, Mar 29, 2011 at 12:09 PM, Dave Jones <[email protected]> wrote:
> > Running a kernel that I built from a git pull from this afternoon,
> > I ran a barrage of stress tests, including some of the new syscall fuzzing
> > work that I've been working on. ?Everything seemed just fine, except that
> > after I had stopped running everything, loadavg was still at 6, with
> > top showing kworker processes running at high % of cpu.
> >
> > I ran perf top, and got this ..
> >
> > ?BUG: unable to handle kernel paging request at fffffffffffffff3
> > ?IP: [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33
>
> Should be fixed by:
>
> commit 76597cd31470fa130784c78fadb4dab2e624a723
> Author: Linus Torvalds <[email protected]>
> Date: Sun Mar 27 19:09:29 2011 -0700
>
> proc: fix oops on invalid /proc/<pid>/maps access

excellent, thanks. I'll repull and try again.

Dave

2011-03-31 03:09:27

by Dave Jones

[permalink] [raw]

Subject: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Mon, Mar 28, 2011 at 09:22:25PM -0700, Linus Torvalds wrote:
> On Mon, Mar 28, 2011 at 9:09 PM, Dave Jones <[email protected]> wrote:
> >
> > I ran perf top, and got this ..
> >
> > ?BUG: unable to handle kernel paging request at fffffffffffffff3
>
> Should be fixed by commit 76597cd31470.
>
> Except that was already committed yesterday. When did you pull?

It did fix it, I must have missed that commit.

But rerunning the same tests on current head (6aba74f2791287ec407e0f92487a725a25908067)
I can still reproduce the problem where kworker threads go nutso
when the machine should be completely idle.

top shows two kworker threads constantly at >80% cpu.

I can now profile it with perf though, so I'll dig deeper.
I need to try a build with slab debug etc turned off, just to be sure that
isn't throwing things into a spin.

Dave

2011-03-31 03:34:41

by Dave Jones

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Wed, Mar 30, 2011 at 11:09:18PM -0400, Dave Jones wrote:

> I can still reproduce the problem where kworker threads go nutso
> when the machine should be completely idle.
>
> top shows two kworker threads constantly at >80% cpu.
>
> I can now profile it with perf though, so I'll dig deeper.
> I need to try a build with slab debug etc turned off, just to be sure that
> isn't throwing things into a spin.

so 'perf kmem record sleep 5' shows hundreds of calls kmem_cache_free from
the kworker processes. Called from shmem_i_callback, __d_free and file_free_rcu.
My guess is that my fuzzing caused so many allocations that the rcu freeing is
still ongoing an hour or so after the process has quit. does that make any sense?

Although if that hypothesis is true, shouldn't I be able to observe this
by watching slabtop ? because I don't see anything decreasing...

Dave

2011-03-31 03:38:14

by Linus Torvalds

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Wed, Mar 30, 2011 at 8:09 PM, Dave Jones <[email protected]> wrote:
>
> But rerunning the same tests on current head (6aba74f2791287ec407e0f92487a725a25908067)
> I can still reproduce the problem where kworker threads go nutso
> when the machine should be completely idle.
>
> top shows two kworker threads constantly at >80% cpu.

Ok, I've seen that "tons of cpu by kworker" triggered by a few
different issues. One was the intel graphics driver doing monitor
detection constantly, and spending all its time in one of the worker
threads doing __udelay() for the stupid i2c driver.

The other case I've seen is a wireless thing that falls back to GPIO,
and spends a lot of CPU time on that.

I'm not saying yours is either of those cases, but one of the problems
with that behavior is that it's actually fairly hard to figure out
what the hell is happening. You don't see some nice thread description
in 'top' any more (like you used to when everybody created their own
threads and didn't do the common worker thread thing), and the best
approach literally seems to be something like

perf record -ag sleep 10
perf report

which does tend to show what's going on, but it's still a ridiculous
way to this.

(Powertop can also do it, and is probably a better thing to use, I'm
just used to "perf record" for other reasons, so..)

Tejun, would there be some saner way to get the information about
_where_ all the kworker time is going?

Linus

2011-03-31 03:44:56

by Linus Torvalds

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Wed, Mar 30, 2011 at 8:34 PM, Dave Jones <[email protected]> wrote:
>
> so 'perf kmem record sleep 5' shows hundreds of calls kmem_cache_free from
> the kworker processes. Called from shmem_i_callback, __d_free and file_free_rcu.
> My guess is that my fuzzing caused so many allocations that the rcu freeing is
> still ongoing an hour or so after the process has quit. ?does that make any sense?

No, that shouldn't be the case. RCU freeing should go on for just a
few RCU periods, and be done. I think there is some "limit the work we
do for rcu each time" in order to not have bad latencies, but even so
that shouldn't take _that_ long. And as you say, you should see the
freeing in the slab stats.

So clearly there are shmem inodes being destroyed, but it shouldn't be
from an hour ago. I wonder if your system isn't as idle as you think
it is.

But I'm cc'ing Paul, maybe he'll disagree and say it's expected and
that the RCU batch size is really small. Or at least give some hint
about how to check the pending rcu state.

Linus

2011-03-31 03:56:16

by Dave Jones

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Wed, Mar 30, 2011 at 08:37:21PM -0700, Linus Torvalds wrote:
> You don't see some nice thread description
> in 'top' any more (like you used to when everybody created their own
> threads and didn't do the common worker thread thing), and the best
> approach literally seems to be something like
>
> perf record -ag sleep 10
> perf report
>
> which does tend to show what's going on, but it's still a ridiculous
> way to this.

I tried that, and wasn't particularly enlightened.

+ 6.53% kworker/1:2 [kernel.kallsyms] [k] read_hpet
+ 4.83% kworker/0:0 [kernel.kallsyms] [k] read_hpet
+ 4.28% kworker/0:0 [kernel.kallsyms] [k] arch_local_irq_restore
+ 4.03% kworker/1:2 [kernel.kallsyms] [k] arch_local_irq_restore
+ 3.10% kworker/0:0 [kernel.kallsyms] [k] do_raw_spin_trylock
+ 2.88% kworker/1:2 [kernel.kallsyms] [k] do_raw_spin_trylock
+ 2.85% kworker/1:2 [kernel.kallsyms] [k] debug_locks_off
+ 2.69% kworker/0:0 [kernel.kallsyms] [k] debug_locks_off
+ 2.48% kworker/0:0 [kernel.kallsyms] [k] lock_release
+ 2.26% kworker/1:2 [kernel.kallsyms] [k] lock_release
+ 2.03% kworker/0:0 [kernel.kallsyms] [k] lock_acquire
+ 1.88% kworker/0:0 [kernel.kallsyms] [k] arch_local_save_flags
+ 1.87% kworker/1:2 [kernel.kallsyms] [k] lock_acquire
+ 1.82% kworker/1:2 [kernel.kallsyms] [k] arch_local_save_flags
+ 1.81% kworker/1:2 [kernel.kallsyms] [k] arch_local_irq_save
+ 1.78% kworker/0:0 [kernel.kallsyms] [k] arch_local_irq_save
+ 1.56% kworker/0:0 [kernel.kallsyms] [k] lock_acquired
+ 1.53% kworker/1:2 [kernel.kallsyms] [k] __lock_acquire
+ 1.51% kworker/0:0 [kernel.kallsyms] [k] __lock_acquire
+ 1.29% kworker/0:0 [kernel.kallsyms] [k] native_write_msr_safe
+ 1.23% kworker/1:2 [kernel.kallsyms] [k] cpu_relax
+ 1.17% kworker/1:2 [kernel.kallsyms] [k] lock_acquired
+ 1.17% kworker/0:0 [kernel.kallsyms] [k] trace_hardirqs_off_caller
+ 1.11% kworker/1:2 [kernel.kallsyms] [k] trace_hardirqs_off_caller
+ 1.08% kworker/1:2 [kernel.kallsyms] [k] native_write_msr_safe
+ 1.02% kworker/0:0 [kernel.kallsyms] [k] _raw_spin_lock_irqsave
+ 0.92% kworker/0:0 [kernel.kallsyms] [k] process_one_work
+ 0.87% kworker/1:2 [kernel.kallsyms] [k] _raw_spin_lock_irqsave
+ 0.80% kworker/0:0 [kernel.kallsyms] [k] flush_to_ldisc
+ 0.76% kworker/1:2 [kernel.kallsyms] [k] process_one_work
+ 0.76% kworker/1:2 [kernel.kallsyms] [k] flush_to_ldisc
+ 0.72% kworker/0:0 [kernel.kallsyms] [k] arch_local_irq_restore
+ 0.71% kworker/1:2 [kernel.kallsyms] [k] arch_local_irq_restore
+ 0.64% kworker/1:2 [kernel.kallsyms] [k] do_raw_spin_unlock
+ 0.63% kworker/0:0 [kernel.kallsyms] [k] perf_event_task_tick
+ 0.61% kworker/1:2 [kernel.kallsyms] [k] ktime_get
+ 0.59% kworker/0:0 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore

this is what led me to try the other perf methods. the kmem traces were
the only things that really jumped out.

> (Powertop can also do it, and is probably a better thing to use, I'm
> just used to "perf record" for other reasons, so..)

Tried that too, here's what it said..

Summary: 0.0 wakeups/second, 0.0 GPU ops/second and 0.0 VFS ops/sec

Usage Events/s Category Description
-2147483648 ms/s 0.0 Timer
-2147483648 ms/s 0.0 kWork
35151589 ms/s 0.0 Timer
35151588 ms/s 0.0 Timer
35151587 ms/s 0.0 Timer
35151586 ms/s 0.0 Timer
35151585 ms/s 0.0 Timer
35151585 ms/s 0.0 Timer
35151584 ms/s 0.0 Timer
35151583 ms/s 0.0 Timer
35151582 ms/s 0.0 Timer
35151581 ms/s 0.0 Timer
35151581 ms/s 0.0 Timer
35151580 ms/s 0.0 Timer

Not exactly helpful.

Dave

2011-03-31 04:09:15

by Dave Jones

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Wed, Mar 30, 2011 at 08:44:06PM -0700, Linus Torvalds wrote:

> So clearly there are shmem inodes being destroyed, but it shouldn't be
> from an hour ago. I wonder if your system isn't as idle as you think
> it is.

hmm. when I say idle, I mean I'm staring at a gnome-terminal with
and ssh client, and nothing else. Certainly nothing that should be
causing the cpu meter in the taskbar to almost max out, and definitely
nothing that should cause the fans in the laptop to be roaring.

If you want to try and repeat this experiment, this should do it..

git clone git://git.codemonkey.org.uk/trinity
cd trinity ; make
./trinity --mode=random -i

let it run for a while, then kill it. kworker should then go nuts.

Dave

2011-03-31 05:33:03

by Linus Torvalds

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Wed, Mar 30, 2011 at 8:55 PM, Dave Jones <[email protected]> wrote:
>
> I tried that, and wasn't particularly enlightened.
>
> + ? ? ?6.53% ?kworker/1:2 ?[kernel.kallsyms] ?[k] read_hpet
> + ? ? ?4.83% ?kworker/0:0 ?[kernel.kallsyms] ?[k] read_hpet

Go into those, deeper into the callchain. That's how to see what the
upper-level problem is - by seeing who actually calls it. The leafs
themselves (where the time is obviously spent) don't tend to be very
informative, no.

> ?> (Powertop can also do it, and is probably a better thing to use, I'm
> ?> just used to "perf record" for other reasons, so..)
>
> Tried that too, here's what it said..

Hmm. Powertop is _supposed_ to be able to look into the kwork things
(it has special trace points for it), but as mentioned, the times I've
seen it I've used 'perf', so I haven't actually tried it myself.

Linus

2011-03-31 06:56:49

by Tejun Heo

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

Hello,

On Wed, Mar 30, 2011 at 08:37:21PM -0700, Linus Torvalds wrote:
> On Wed, Mar 30, 2011 at 8:09 PM, Dave Jones <[email protected]> wrote:
> >
> > But rerunning the same tests on current head (6aba74f2791287ec407e0f92487a725a25908067)
> > I can still reproduce the problem where kworker threads go nutso
> > when the machine should be completely idle.
> >
> > top shows two kworker threads constantly at >80% cpu.
>
> Ok, I've seen that "tons of cpu by kworker" triggered by a few
> different issues. One was the intel graphics driver doing monitor
> detection constantly, and spending all its time in one of the worker
> threads doing __udelay() for the stupid i2c driver.
>
> The other case I've seen is a wireless thing that falls back to GPIO,
> and spends a lot of CPU time on that.
>
> I'm not saying yours is either of those cases, but one of the problems
> with that behavior is that it's actually fairly hard to figure out
> what the hell is happening. You don't see some nice thread description
> in 'top' any more (like you used to when everybody created their own
> threads and didn't do the common worker thread thing), and the best
> approach literally seems to be something like

Yes, not having dedicated workers decrease immediate visibility via
ps/top but I don't think the actual vsibility is worse. There are two
types for kworker looping - something being scheduled in rapid
succession or a single work consuming lots of CPU cycles.

The first one can be tracked using tracing.

$ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace_pipe > out.txt
(wait a few secs)
^C

If something is busy looping on work queueing, it would be dominating
the output and the offender can be determined with the work item
function.

For the second type, "cat /proc/THE_OFFENDING_KWORKER/stack" is the
easiest. The work item function will be trivially visible in the
stack trace.

Thanks.

--
tejun

2011-03-31 10:37:48

by Florian Mickler

[permalink] [raw]

Subject: [PATCH] workqueue: document debugging tricks

It is not obvious how to debug run-away workers.

These are some tips given by Tejun on lkml.

Signed-off-by: Florian Mickler <[email protected]>
CC: Tejun Heo <[email protected]>
---
Documentation/workqueue.txt | 38 ++++++++++++++++++++++++++++++++++++++
1 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/Documentation/workqueue.txt b/Documentation/workqueue.txt
index 01c513f..cdbc3c6 100644
--- a/Documentation/workqueue.txt
+++ b/Documentation/workqueue.txt
@@ -12,6 +12,7 @@ CONTENTS
4. Application Programming Interface (API)
5. Example Execution Scenarios
6. Guidelines
+7. Debugging

1. Introduction
@@ -379,3 +380,40 @@ If q1 has WQ_CPU_INTENSIVE set,
* Unless work items are expected to consume a huge amount of CPU
cycles, using a bound wq is usually beneficial due to the increased
level of locality in wq operations and work item execution.
+
+
+7. Debugging
+
+Because the work functions are executed by generic worker threads there are a
+few tricks needed to shed some light on misbehaving workqueue users.
+
+Worker threads show up in the process list as:
+
+root 5671 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/0:1]
+root 5672 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/1:2]
+root 5673 0.0 0.0 0 0 ? S 12:12 0:00 [kworker/0:0]
+root 5674 0.0 0.0 0 0 ? S 12:13 0:00 [kworker/1:0]
+
+If kworkers are going crazy (using too much cpu), there are two types of
+possible problems:
+
+ 1. Something beeing scheduled in rapid succession
+ 2. a single work item that consumes lots of cpu cycles
+
+The first one can be tracked using tracing:
+
+ $ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace_pipe > out.txt
+ (wait a few secs)
+ ^C
+
+If something is busy looping on work queueing, it would be dominating
+the output and the offender can be determined with the work item
+function.
+
+For the second type of problem it should be possible to just check the stack
+trace of the offending worker thread.
+
+ $ cat /proc/THE_OFFENDING_KWORKER/stack
+
+The work item's function should be trivially visible in the stack trace.
--
1.7.4.1

2011-03-31 11:41:51

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH] workqueue: document debugging tricks

On Thu, Mar 31, 2011 at 12:37:06PM +0200, Florian Mickler wrote:
> It is not obvious how to debug run-away workers.
>
> These are some tips given by Tejun on lkml.
>
> Signed-off-by: Florian Mickler <[email protected]>
> CC: Tejun Heo <[email protected]>

Right, good idea. Applied to fixes-2.6.39 with minor formatting
changes.

Thanks.

--
tejun

2011-03-31 14:21:45

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

Em Wed, Mar 30, 2011 at 10:32:11PM -0700, Linus Torvalds escreveu:
> On Wed, Mar 30, 2011 at 8:55 PM, Dave Jones <[email protected]> wrote:
> > I tried that, and wasn't particularly enlightened.
> >
> > + ? ? ?6.53% ?kworker/1:2 ?[kernel.kallsyms] ?[k] read_hpet
> > + ? ? ?4.83% ?kworker/0:0 ?[kernel.kallsyms] ?[k] read_hpet
>
> Go into those, deeper into the callchain. That's how to see what the
> upper-level problem is - by seeing who actually calls it. The leafs
> themselves (where the time is obviously spent) don't tend to be very
> informative, no.

Yeah, please press 'E' since you're using the TUI, where you can expand
collapse callchains in all its levels.

Pressing 'E' will expand all of them, 'C' will collapse it all. Enter on
lines with a '+' will expand just that one, etc.

Press 'h' to get more help.

Or use 'perf report --stdio' and you'll get it all expanded and in plain
text.

- Arnaldo

2011-03-31 14:59:32

by Paul E. McKenney

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Wed, Mar 30, 2011 at 08:44:06PM -0700, Linus Torvalds wrote:
> On Wed, Mar 30, 2011 at 8:34 PM, Dave Jones <[email protected]> wrote:
> >
> > so 'perf kmem record sleep 5' shows hundreds of calls kmem_cache_free from
> > the kworker processes. Called from shmem_i_callback, __d_free and file_free_rcu.
> > My guess is that my fuzzing caused so many allocations that the rcu freeing is
> > still ongoing an hour or so after the process has quit. ?does that make any sense?
>
> No, that shouldn't be the case. RCU freeing should go on for just a
> few RCU periods, and be done. I think there is some "limit the work we
> do for rcu each time" in order to not have bad latencies, but even so
> that shouldn't take _that_ long. And as you say, you should see the
> freeing in the slab stats.
>
> So clearly there are shmem inodes being destroyed, but it shouldn't be
> from an hour ago. I wonder if your system isn't as idle as you think
> it is.
>
> But I'm cc'ing Paul, maybe he'll disagree and say it's expected and
> that the RCU batch size is really small. Or at least give some hint
> about how to check the pending rcu state.

There is by default a ten-callback-per-softirq per-CPU limit (set by
module parameter "blimit"), but if there are more than 10,000 callbacks
queued (set by module parameter "qhimark"), RCU stops being Mr. Nice Guy
and just plows through the callbacks.

I am having a really hard time believing that you could queue a batch of
callbacks and still have RCU processing them an hour later. But one way
this can happen is if the callbacks requeue themselves, for example, if
the RCU callback function contains a call to call_rcu(). I am not seeing
that in the __d_free and file_free_rcu callbacks that Dave mentioned,
but this does happen in some places in the Linux kernel. And maybe such
a place is being invoked in a non-obvious way.

There is a debugfs file "rcu/rcudata" that reports the current RCU queue
length that is enabled with CONFIG_RCU_TRACE. Here is sample output:

rcu_sched:
0 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=10951/1 dn=0 df=1101 of=0 ri=36 ql=0 b=10
1 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=16117/1 dn=0 df=1015 of=0 ri=0 ql=0 b=10
2 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1445/1 dn=0 df=1839 of=0 ri=0 ql=0 b=10
3 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=6681/1 dn=0 df=1545 of=0 ri=0 ql=0 b=10
4 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1003/1 dn=0 df=1992 of=0 ri=0 ql=0 b=10
5 c=17829 g=17830 pq=1 pqc=17829 qp=1 dt=3887/1 dn=0 df=3331 of=0 ri=4 ql=2 b=10
6 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=859/1 dn=0 df=3224 of=0 ri=0 ql=0 b=10
7 c=17829 g=17830 pq=0 pqc=17829 qp=1 dt=3761/1 dn=0 df=1818 of=0 ri=0 ql=2 b=10
rcu_bh:
0 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=10951/1 dn=0 df=0 of=0 ri=0 ql=0 b=10
1 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=16117/1 dn=0 df=13 of=0 ri=0 ql=0 b=10
2 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1445/1 dn=0 df=15 of=0 ri=0 ql=0 b=10
3 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=6681/1 dn=0 df=9 of=0 ri=0 ql=0 b=10
4 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1003/1 dn=0 df=15 of=0 ri=0 ql=0 b=10
5 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3887/1 dn=0 df=15 of=0 ri=0 ql=0 b=10
6 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=859/1 dn=0 df=15 of=0 ri=0 ql=0 b=10
7 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3761/1 dn=0 df=15 of=0 ri=0 ql=0 b=10

The "ql=" gives the per-CPU callback queue size. For example, CPU 7 has
two rcu_sched callbacks queued. In CONFIG_TREE_PREEMPT_RCU builds, there
will be an additional "rcu_preempt" section. If you see some huge number
after a "ql=", that would indicate a problem. ;-)

Does this help?

Thanx, Paul

2011-03-31 14:59:53

by Dave Jones

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Wed, Mar 30, 2011 at 10:32:11PM -0700, Linus Torvalds wrote:
> On Wed, Mar 30, 2011 at 8:55 PM, Dave Jones <[email protected]> wrote:
> >
> > I tried that, and wasn't particularly enlightened.
> >
> > + ? ? ?6.53% ?kworker/1:2 ?[kernel.kallsyms] ?[k] read_hpet
> > + ? ? ?4.83% ?kworker/0:0 ?[kernel.kallsyms] ?[k] read_hpet
>
> Go into those, deeper into the callchain. That's how to see what the
> upper-level problem is - by seeing who actually calls it. The leafs
> themselves (where the time is obviously spent) don't tend to be very
> informative, no.

ok, a bit more interesting ?

- 17.67% kworker/0:1 [kernel.kallsyms] [k] read_hpet
- read_hpet
- 76.07% timekeeping_get_ns
- ktime_get
- 64.02% tick_dev_program_event
tick_program_event
hrtimer_interrupt
smp_apic_timer_interrupt
+ apic_timer_interrupt
- 35.30% sched_clock_tick
scheduler_tick
update_process_times
tick_sched_timer
__run_hrtimer
hrtimer_interrupt
smp_apic_timer_interrupt
- apic_timer_interrupt
+ 65.70% _raw_spin_unlock_irqrestore
+ 27.33% _raw_spin_unlock_irq
+ 1.16% lock_acquire
+ 1.16% put_ldisc
+ 1.16% process_one_work
+ 1.16% lock_release
+ 0.78% worker_thread
+ 0.58% _raw_spin_lock_irqsave
+ 0.55% tick_sched_timer
- 23.82% do_timer
tick_do_update_jiffies64
tick_sched_timer
__run_hrtimer
hrtimer_interrupt
smp_apic_timer_interrupt
- apic_timer_interrupt
+ 63.46% _raw_spin_unlock_irqrestore
+ 22.98% _raw_spin_unlock_irq
+ 3.50% put_ldisc
+ 3.28% lock_acquire

btw, I left this on overnight capturing the output of /proc/loadavg
it didn't drop off at all, and this morning the laptop is almost too
hot to hold.

Dave

2011-03-31 15:04:23

by Dave Jones

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Thu, Mar 31, 2011 at 10:58:50AM -0400, Dave Jones wrote:

bah I missed a level. which seems to be the smoking gun.
something in the tty layer seems to have gone nuts.

> - 17.67% kworker/0:1 [kernel.kallsyms] [k] read_hpet
> - read_hpet
> - 76.07% timekeeping_get_ns
> - ktime_get
> - 64.02% tick_dev_program_event
> tick_program_event
> hrtimer_interrupt
> smp_apic_timer_interrupt
> + apic_timer_interrupt

- 52.24% _raw_spin_unlock_irqrestore
- 39.26% flush_to_ldisc
process_one_work
worker_thread
kthread
kernel_thread_helper
- 32.52% __wake_up
put_ldisc
tty_ldisc_deref
flush_to_ldisc
process_one_work
worker_thread
kthread
kernel_thread_helper
- 28.22% tty_ldisc_try
tty_ldisc_ref
flush_to_ldisc
process_one_work
worker_thread
kthread
kernel_thread_helper
- 18.80% _raw_spin_unlock_irq
process_one_work
worker_thread
kthread
kernel_thread_helper
- 18.80% _raw_spin_unlock_irq
process_one_work
worker_thread
kthread
kernel_thread_helper
- 5.98% process_one_work
worker_thread
kthread
kernel_thread_helper
- 4.81% lock_acquire
process_one_work
worker_thread
kthread
kernel_thread_helper
- 4.70% put_ldisc
tty_ldisc_deref
flush_to_ldisc
process_one_work
worker_thread
kthread
kernel_thread_helper
- 4.17% worker_thread
kthread
kernel_thread_helper
- 4.17% lock_release
process_one_work
worker_thread
kthread
kernel_thread_helper
- 1.28% flush_to_ldisc
process_one_work
worker_thread
kthread
kernel_thread_helper
- 1.28% arch_local_irq_save
- 58.33% lock_release
process_one_work
worker_thread
kthread
kernel_thread_helper
- 41.67% lock_acquire
process_one_work
worker_thread
kthread
kernel_thread_helper
- 0.85% tty_ldisc_deref
flush_to_ldisc
process_one_work
worker_thread
kthread
kernel_thread_helper
- 0.85% _raw_spin_lock_irqsave
- 62.51% flush_to_ldisc
process_one_work
worker_thread
kthread
kernel_thread_helper
- 37.49% tty_ldisc_try
tty_ldisc_ref
flush_to_ldisc
process_one_work
worker_thread
kthread
kernel_thread_helper
- 0.53% tty_ldisc_ref
flush_to_ldisc
process_one_work
worker_thread
kthread

> - 35.30% sched_clock_tick
> scheduler_tick
> update_process_times
> tick_sched_timer
> __run_hrtimer
> hrtimer_interrupt
> smp_apic_timer_interrupt
> - apic_timer_interrupt
> + 65.70% _raw_spin_unlock_irqrestore
> + 27.33% _raw_spin_unlock_irq
> + 1.16% lock_acquire
> + 1.16% put_ldisc
> + 1.16% process_one_work
> + 1.16% lock_release
> + 0.78% worker_thread
> + 0.58% _raw_spin_lock_irqsave
> + 0.55% tick_sched_timer
> - 23.82% do_timer
> tick_do_update_jiffies64
> tick_sched_timer
> __run_hrtimer
> hrtimer_interrupt
> smp_apic_timer_interrupt
> - apic_timer_interrupt
> + 63.46% _raw_spin_unlock_irqrestore
> + 22.98% _raw_spin_unlock_irq
> + 3.50% put_ldisc
> + 3.28% lock_acquire

2011-03-31 15:09:55

by Dave Jones

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Thu, Mar 31, 2011 at 11:03:44AM -0400, Dave Jones wrote:
> On Thu, Mar 31, 2011 at 10:58:50AM -0400, Dave Jones wrote:
>
> bah I missed a level. which seems to be the smoking gun.
> something in the tty layer seems to have gone nuts.

I thought that trace looked familiar.

http://lkml.org/lkml/2010/11/30/592

It's the same thing.

Dave

2011-03-31 15:26:35

by Linus Torvalds

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Thu, Mar 31, 2011 at 8:03 AM, Dave Jones <[email protected]> wrote:
> On Thu, Mar 31, 2011 at 10:58:50AM -0400, Dave Jones wrote:
>
> bah I missed a level. which seems to be the smoking gun.
> something in the tty layer seems to have gone nuts.

Ahhah. That is indeed the smoking gun.

I bet it's an old problem where something causes some stupid tty
ping-pong behaviour, and now it has become a high-CPU thing triggered
by commit f23eb2b2b285 ("tty: stop using "delayed_work" in the tty
layer").

Before that commit, if the tty layer got somehow confused, it would
always delay for a timer tick in between doing nothing. Now it will go
pretty much full blast. Which explains your CPU being hot.

Linus

2011-03-31 15:46:43

by Linus Torvalds

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Thu, Mar 31, 2011 at 8:09 AM, Dave Jones <[email protected]> wrote:
>
> I thought that trace looked familiar.
>
> http://lkml.org/lkml/2010/11/30/592
>
> It's the same thing.

Ok, that's before the "tty: stop using "delayed_work" in the tty
layer" commit I just pointed to.

So apparently you've been able to trigger this even with the old code
too - although maybe the lack of delays anywhere has made it easier,
and has made it use more cpu.

I'll have to think about it, but I wonder if it's the crazy "reflush"
case in flush_to_ldisc. We do

if (!tty->receive_room || seen_tail) {
schedule_work(&tty->buf.work);
break;
}

inside the routine that is the work itself - basically we're saying
that "if there's no more room to flip, of we've seen a new buffer,
give up now and reschedule outselves".

Which doesn't really make much sense to me, I have to admit. The code
that actually empties the buffer, or the code that adds one, should
already have scheduled us for a flip _anyway_. So the only thing that
"schedule_work()" is doing is causing infinite work if nothing empties
the buffer, of more likely if we have a flushing bug elsewhere.

So I'm not sure, but my gut feel is that removing that
"schedule_work()" line there is the right thing to do.

At a guess, it was hiding some locking problem - and it's been carried
around even though hopefully we've fixed all the crazy races we used
to have (and it was a mindless "hey, we can retry in one jiffy - it
doesn't really cost us anything")

NOTE! Even if I'm right, and that line is just buggy, the bug may well
have been hiding some other issue - ie just some using not flushing
the tty when it made more room available. So I think the "make tty
flush cause a re-flush when it cannot make progress" is wrong, but
removing the line may well expose some other problem.

Linus

2011-03-31 15:50:24

by Dave Jones

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Thu, Mar 31, 2011 at 08:25:36AM -0700, Linus Torvalds wrote:
> On Thu, Mar 31, 2011 at 8:03 AM, Dave Jones <[email protected]> wrote:
> > On Thu, Mar 31, 2011 at 10:58:50AM -0400, Dave Jones wrote:
> >
> > bah I missed a level. which seems to be the smoking gun.
> > something in the tty layer seems to have gone nuts.
>
> Ahhah. That is indeed the smoking gun.
>
> I bet it's an old problem where something causes some stupid tty
> ping-pong behaviour, and now it has become a high-CPU thing triggered
> by commit f23eb2b2b285 ("tty: stop using "delayed_work" in the tty
> layer").
>
> Before that commit, if the tty layer got somehow confused, it would
> always delay for a timer tick in between doing nothing. Now it will go
> pretty much full blast. Which explains your CPU being hot.

That's a recent change though, and I first saw this back in November.

Dave

2011-03-31 15:54:23

by Linus Torvalds

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

And with your fixed trinity thing, I can now do this:

On Wed, Mar 30, 2011 at 9:08 PM, Dave Jones <[email protected]> wrote:
>
> git clone git://git.codemonkey.org.uk/trinity
> cd trinity ; make
> ./trinity --mode=random -i
>
> let it run for a while, then kill it. ? kworker should then go nuts.

I really thought "hey, that reproduced it", because "top" did indeed
show kworker going nuts after a while. And perf did show it doing all
that

| queue_work
| schedule_work
| flush_to_ldisc

that looks like the re-scheduling.

But when I killed trinity, kworker immediately calmed down too.

Which I'd actually expect, since any tty it is using (and making cause
problem) should calm down when it's closed - even if it was stuck
before.

But maybe I haven't triggered just the right condition to actually get
it into some really infinite case.

Regardless, I'll put my money where my mouth is, and try to remove the
crazy re-flush thing.

Linus

2011-03-31 15:59:07

by Linus Torvalds

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Thu, Mar 31, 2011 at 8:49 AM, Dave Jones <[email protected]> wrote:
>
> That's a recent change though, and I first saw this back in November.

So your November report said that you could see "thousands" of these a
second. But maybe it didn't use up all CPU until recently?

Especially if you have a high CONFIG_HZ value, you'd still see a
thousand flushes a second even with the old "delay a bit". So it would
use a fair amount of CPU, and certainly waste a lot of power. But it
wouldn't pin the CPU entirely.

With that commit f23eb2b2b285, the buggy case would become basically
totally CPU-bound.

I dunno. Right now 'trinity' just ends up printing out a lot of system
call errors for me. I assume that's its normal behavior?

Linus

2011-03-31 16:13:45

by Dave Jones

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Thu, Mar 31, 2011 at 08:58:14AM -0700, Linus Torvalds wrote:
> On Thu, Mar 31, 2011 at 8:49 AM, Dave Jones <[email protected]> wrote:
> >
> > That's a recent change though, and I first saw this back in November.
>
> So your November report said that you could see "thousands" of these a
> second. But maybe it didn't use up all CPU until recently?

>From memory, I noticed it back then in the same way "hey, why is the laptop getting hot".

> Especially if you have a high CONFIG_HZ value, you'd still see a
> thousand flushes a second even with the old "delay a bit". So it would
> use a fair amount of CPU, and certainly waste a lot of power. But it
> wouldn't pin the CPU entirely.

HZ=1000 here, so yeah that makes sense.

> With that commit f23eb2b2b285, the buggy case would become basically
> totally CPU-bound.
>
> I dunno. Right now 'trinity' just ends up printing out a lot of system
> call errors for me. I assume that's its normal behavior?

yep. it's throwing semi-random junk and seeing what sticks.
99% of the time, it ends up with an -EINVAL or something providing
the syscalls have sufficient checks in place. We're pretty solid there
these days. (Though I still need to finish adding more annotations of some
of the syscall arguments just to be sure we're passing semi-sane things
to get deeper into the syscalls)

What still seems to throw a curve-ball though is the case where calling
a syscall generates some state. Due to the random nature of the program,
we never have a balanced alloc/free, so lists can grow quite large etc.
I'm wondering if something has created a livelock situation, where some queue
has grown to the point that we're generating new work before we can
process the backlog.

The downside of using randomness of course is with bugs like this, there's
no easy way to figure out wtf happened to get it into this state
other than poring over huge logfiles of all the syscalls that were made.

I'm happily ignorant of most of the inner workings of the tty layer.
I don't see anything useful in procfs/sysfs/debugfs. Is there anything
useful I can do with the trace tools to try and find out things like
the length of queues ?

Dave

2011-03-31 16:22:30

by Linus Torvalds

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Thu, Mar 31, 2011 at 8:53 AM, Linus Torvalds
<[email protected]> wrote:
>
> Regardless, I'll put my money where my mouth is, and try to remove the
> crazy re-flush thing.

Yeah, that doesn't work. The tty does actually lock up when it fills
up. So clearly we actually depended on that reflushing happening.

That said, I do still think it's the right thing to do to remove that
line, it just means that I need to figure out where the missing flush
is.

Linus

2011-03-31 21:39:10

by Linus Torvalds

[permalink] [raw]

Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)

On Thu, Mar 31, 2011 at 9:21 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Mar 31, 2011 at 8:53 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Regardless, I'll put my money where my mouth is, and try to remove the
>> crazy re-flush thing.
>
> Yeah, that doesn't work. The tty does actually lock up when it fills
> up. So clearly we actually depended on that reflushing happening.
>
> That said, I do still think it's the right thing to do to remove that
> line, it just means that I need to figure out where the missing flush
> is.

Ok, that was unexpected.

So the reason for the need to do that crazy "try to flush from the
flush routine" is that in the case of "receive_room" going down to
zero (which only happens for n_tty and for the case of switching
ldisc's around), if we hit that during flushing, nothing will
apparently ever re-start the flushing when receive_room then opens up
again.

So instead of having that case re-start the flush, we end up saying
"ok, we'll just retry the flush over and over again", and essentially
poll for receive_room opening up. No wonder you've seen high CPU use
and thousands of calls a second.

The "seen_tail" case doesn't have that issue, because anything that
adds a new buffer to the tty list should always be flipping anyway. So
this attached patch would seem to work.

Not heavily tested, but the case that I could trivially trigger before
doesn't trigger for me any more. And I can't seem to get kworker to
waste lots of CPU time any more, but it was kind of hit-and-miss
before too, so I don't know how much that's worth..

The locking here is kind of iffy, but otherwise? Comments?

Linus

Attachments:

patch.diff (1.48 kB)