2010-06-14 22:21:35

by John Kacur

[permalink] [raw]
Subject: [PATCH 0/6] Potential rt patches for tip/rt/2.6.33

Thomas:

This is a small collection of patches that I have in my -rt tree on-top
of 2.6.33.5-rt23

Please consider them for the next version of tip/rt/2.6.33

They can be pulled from
git://git.kernel.org/pub/scm/linux/kernel/git/jkacur/jk-2.6.git tip-rt-2.6.33-jk

Amit Shah (1):
hvc_console: Fix race between hvc_close and hvc_remove

Anton Blanchard (1):
hvc_console: Fix race between hvc_close and hvc_remove

Arnaldo Carvalho de Melo (1):
tracing: Update the comm field in the right variable in update_max_tr

John Kacur (1):
lockdep: Make MAX_STACK_TRACE_ENTRIES configurable.

Stephane Eranian (1):
perf: Fix errors path in perf_output_begin()

Toshiyuki Okajima (1):
KEYS: find_keyring_by_name() can gain access to a freed keyring

drivers/char/hvc_console.c | 27 +++++++++++++++++----------
kernel/lockdep_internals.h | 2 +-
kernel/perf_event.c | 2 +-
kernel/trace/trace.c | 4 ++--
lib/Kconfig.debug | 10 ++++++++++
security/keys/keyring.c | 18 +++++++++---------
6 files changed, 40 insertions(+), 23 deletions(-)


2010-06-14 22:21:38

by John Kacur

[permalink] [raw]
Subject: [PATCH 3/6] perf: Fix errors path in perf_output_begin()

From: Stephane Eranian <[email protected]>

In case the sampling buffer has no "payload" pages,
nr_pages is 0. The problem is that the error path in
perf_output_begin() skips to a label which assumes
perf_output_lock() has been issued which is not the
case. That triggers a WARN_ON() in
perf_output_unlock().

This patch fixes the problem by skipping
perf_output_unlock() in case data->nr_pages is 0.

Signed-off-by: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Upstream-commit: 00d1d0b095ba4e5c0958cb228b2a9c445d4a339d
Signed-off-by: John Kacur <[email protected]>
---
kernel/perf_event.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 43c1dfb..e353be2 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2933,7 +2933,7 @@ int perf_output_begin(struct perf_output_handle *handle,
handle->sample = sample;

if (!data->nr_pages)
- goto fail;
+ goto out;

have_lost = atomic_read(&data->lost);
if (have_lost)
--
1.6.6.1

2010-06-14 22:21:43

by John Kacur

[permalink] [raw]
Subject: [PATCH 5/6] hvc_console: Fix race between hvc_close and hvc_remove

From: Amit Shah <[email protected]>

Alan pointed out a race in the code where hvc_remove is invoked. The
recent virtio_console work is the first user of hvc_remove().

Alan describes it thus:

The hvc_console assumes that a close and remove call can't occur at the
same time.

In addition tty_hangup(tty) is problematic as tty_hangup is asynchronous
itself....

So this can happen

hvc_close hvc_remove
hung up ? - no
lock
tty = hp->tty
unlock
lock
hp->tty = NULL
unlock
notify del
kref_put the hvc struct
close completes
tty is destroyed
tty_hangup dead tty
tty->ops will be NULL
NULL->...

This patch adds some tty krefs and also converts to using tty_vhangup().

Reported-by: Alan Cox <[email protected]>
Signed-off-by: Amit Shah <[email protected]>
CC: Alan Cox <[email protected]>
CC: [email protected]
CC: Rusty Russell <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Upstream-commit: e74d098c66543d0731de62eb747ccd5b636a6f4c
Signed-off-by: John Kacur <[email protected]>
---
drivers/char/hvc_console.c | 31 +++++++++++++++++++++----------
1 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/drivers/char/hvc_console.c b/drivers/char/hvc_console.c
index 416d342..da70f68 100644
--- a/drivers/char/hvc_console.c
+++ b/drivers/char/hvc_console.c
@@ -312,6 +312,7 @@ static int hvc_open(struct tty_struct *tty, struct file * filp)
spin_lock_irqsave(&hp->lock, flags);
/* Check and then increment for fast path open. */
if (hp->count++ > 0) {
+ tty_kref_get(tty);
spin_unlock_irqrestore(&hp->lock, flags);
hvc_kick();
return 0;
@@ -319,7 +320,7 @@ static int hvc_open(struct tty_struct *tty, struct file * filp)

tty->driver_data = hp;

- hp->tty = tty;
+ hp->tty = tty_kref_get(tty);

spin_unlock_irqrestore(&hp->lock, flags);

@@ -336,6 +337,7 @@ static int hvc_open(struct tty_struct *tty, struct file * filp)
spin_lock_irqsave(&hp->lock, flags);
hp->tty = NULL;
spin_unlock_irqrestore(&hp->lock, flags);
+ tty_kref_put(tty);
tty->driver_data = NULL;
kref_put(&hp->kref, destroy_hvc_struct);
printk(KERN_ERR "hvc_open: request_irq failed with rc %d.\n", rc);
@@ -363,13 +365,18 @@ static void hvc_close(struct tty_struct *tty, struct file * filp)
return;

hp = tty->driver_data;
+
spin_lock_irqsave(&hp->lock, flags);
+ tty_kref_get(tty);

if (--hp->count == 0) {
/* We are done with the tty pointer now. */
hp->tty = NULL;
spin_unlock_irqrestore(&hp->lock, flags);

+ /* Put the ref obtained in hvc_open() */
+ tty_kref_put(tty);
+
if (hp->ops->notifier_del)
hp->ops->notifier_del(hp, hp->data);

@@ -389,6 +396,7 @@ static void hvc_close(struct tty_struct *tty, struct file * filp)
spin_unlock_irqrestore(&hp->lock, flags);
}

+ tty_kref_put(tty);
kref_put(&hp->kref, destroy_hvc_struct);
}

@@ -424,10 +432,11 @@ static void hvc_hangup(struct tty_struct *tty)
spin_unlock_irqrestore(&hp->lock, flags);

if (hp->ops->notifier_hangup)
- hp->ops->notifier_hangup(hp, hp->data);
+ hp->ops->notifier_hangup(hp, hp->data);

while(temp_open_count) {
--temp_open_count;
+ tty_kref_put(tty);
kref_put(&hp->kref, destroy_hvc_struct);
}
}
@@ -592,7 +601,7 @@ int hvc_poll(struct hvc_struct *hp)
}

/* No tty attached, just skip */
- tty = hp->tty;
+ tty = tty_kref_get(hp->tty);
if (tty == NULL)
goto bail;

@@ -672,6 +681,8 @@ int hvc_poll(struct hvc_struct *hp)

tty_flip_buffer_push(tty);
}
+ if (tty)
+ tty_kref_put(tty);

return poll_mask;
}
@@ -806,7 +817,7 @@ int hvc_remove(struct hvc_struct *hp)
struct tty_struct *tty;

spin_lock_irqsave(&hp->lock, flags);
- tty = hp->tty;
+ tty = tty_kref_get(hp->tty);

if (hp->index < MAX_NR_HVC_CONSOLES)
vtermnos[hp->index] = -1;
@@ -818,18 +829,18 @@ int hvc_remove(struct hvc_struct *hp)
/*
* We 'put' the instance that was grabbed when the kref instance
* was initialized using kref_init(). Let the last holder of this
- * kref cause it to be removed, which will probably be the tty_hangup
+ * kref cause it to be removed, which will probably be the tty_vhangup
* below.
*/
kref_put(&hp->kref, destroy_hvc_struct);

/*
- * This function call will auto chain call hvc_hangup. The tty should
- * always be valid at this time unless a simultaneous tty close already
- * cleaned up the hvc_struct.
+ * This function call will auto chain call hvc_hangup.
*/
- if (tty)
- tty_hangup(tty);
+ if (tty) {
+ tty_vhangup(tty);
+ tty_kref_put(tty);
+ }
return 0;
}
EXPORT_SYMBOL_GPL(hvc_remove);
--
1.6.6.1

2010-06-14 22:21:49

by John Kacur

[permalink] [raw]
Subject: [PATCH 6/6] hvc_console: Fix race between hvc_close and hvc_remove

From: Anton Blanchard <[email protected]>

I don't claim to understand the tty layer, but it seems like hvc_open and
hvc_close should be balanced in their kref reference counting.

Right now we get a kref every call to hvc_open:

if (hp->count++ > 0) {
tty_kref_get(tty); <----- here
spin_unlock_irqrestore(&hp->lock, flags);
hvc_kick();
return 0;
} /* else count == 0 */

tty->driver_data = hp;

hp->tty = tty_kref_get(tty); <------ or here if hp->count was 0

But hvc_close has:

tty_kref_get(tty);

if (--hp->count == 0) {
...
/* Put the ref obtained in hvc_open() */
tty_kref_put(tty);
...
}

tty_kref_put(tty);

Since the outside kref get/put balance we only do a single kref_put when
count reaches 0.

The patch below changes things to call tty_kref_put once for every
hvc_close call, and with that my machine boots fine.

Signed-off-by: Anton Blanchard <[email protected]>
Acked-by: Amit Shah <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
Upstream-commit: 320718ee074acce5ffced6506cb51af1388942aa
Signed-off-by: John Kacur <[email protected]>
---
drivers/char/hvc_console.c | 4 ----
1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/drivers/char/hvc_console.c b/drivers/char/hvc_console.c
index da70f68..2898084 100644
--- a/drivers/char/hvc_console.c
+++ b/drivers/char/hvc_console.c
@@ -367,16 +367,12 @@ static void hvc_close(struct tty_struct *tty, struct file * filp)
hp = tty->driver_data;

spin_lock_irqsave(&hp->lock, flags);
- tty_kref_get(tty);

if (--hp->count == 0) {
/* We are done with the tty pointer now. */
hp->tty = NULL;
spin_unlock_irqrestore(&hp->lock, flags);

- /* Put the ref obtained in hvc_open() */
- tty_kref_put(tty);
-
if (hp->ops->notifier_del)
hp->ops->notifier_del(hp, hp->data);

--
1.6.6.1

2010-06-14 22:22:12

by John Kacur

[permalink] [raw]
Subject: [PATCH 4/6] KEYS: find_keyring_by_name() can gain access to a freed keyring

From: Toshiyuki Okajima <[email protected]>

find_keyring_by_name() can gain access to a keyring that has had its reference
count reduced to zero, and is thus ready to be freed. This then allows the
dead keyring to be brought back into use whilst it is being destroyed.

The following timeline illustrates the process:

|(cleaner) (user)
|
| free_user(user) sys_keyctl()
| | |
| key_put(user->session_keyring) keyctl_get_keyring_ID()
| || //=> keyring->usage = 0 |
| |schedule_work(&key_cleanup_task) lookup_user_key()
| || |
| kmem_cache_free(,user) |
| . |[KEY_SPEC_USER_KEYRING]
| . install_user_keyrings()
| . ||
| key_cleanup() [<= worker_thread()] ||
| | ||
| [spin_lock(&key_serial_lock)] |[mutex_lock(&key_user_keyr..mutex)]
| | ||
| atomic_read() == 0 ||
| |{ rb_ease(&key->serial_node,) } ||
| | ||
| [spin_unlock(&key_serial_lock)] |find_keyring_by_name()
| | |||
| keyring_destroy(keyring) ||[read_lock(&keyring_name_lock)]
| || |||
| |[write_lock(&keyring_name_lock)] ||atomic_inc(&keyring->usage)
| |. ||| *** GET freeing keyring ***
| |. ||[read_unlock(&keyring_name_lock)]
| || ||
| |list_del() |[mutex_unlock(&key_user_k..mutex)]
| || |
| |[write_unlock(&keyring_name_lock)] ** INVALID keyring is returned **
| | .
| kmem_cache_free(,keyring) .
| .
| atomic_dec(&keyring->usage)
v *** DESTROYED ***
TIME

If CONFIG_SLUB_DEBUG=y then we may see the following message generated:

=============================================================================
BUG key_jar: Poison overwritten
-----------------------------------------------------------------------------

INFO: 0xffff880197a7e200-0xffff880197a7e200. First byte 0x6a instead of 0x6b
INFO: Allocated in key_alloc+0x10b/0x35f age=25 cpu=1 pid=5086
INFO: Freed in key_cleanup+0xd0/0xd5 age=12 cpu=1 pid=10
INFO: Slab 0xffffea000592cb90 objects=16 used=2 fp=0xffff880197a7e200 flags=0x200000000000c3
INFO: Object 0xffff880197a7e200 @offset=512 fp=0xffff880197a7e300

Bytes b4 0xffff880197a7e1f0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
Object 0xffff880197a7e200: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b jkkkkkkkkkkkkkkk

Alternatively, we may see a system panic happen, such as:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
IP: [<ffffffff810e61a3>] kmem_cache_alloc+0x5b/0xe9
PGD 6b2b4067 PUD 6a80d067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/kernel/kexec_crash_loaded
CPU 1
...
Pid: 31245, comm: su Not tainted 2.6.34-rc5-nofixed-nodebug #2 D2089/PRIMERGY
RIP: 0010:[<ffffffff810e61a3>] [<ffffffff810e61a3>] kmem_cache_alloc+0x5b/0xe9
RSP: 0018:ffff88006af3bd98 EFLAGS: 00010002
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88007d19900b
RDX: 0000000100000000 RSI: 00000000000080d0 RDI: ffffffff81828430
RBP: ffffffff81828430 R08: ffff88000a293750 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000100000 R12: 00000000000080d0
R13: 00000000000080d0 R14: 0000000000000296 R15: ffffffff810f20ce
FS: 00007f97116bc700(0000) GS:ffff88000a280000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000001 CR3: 000000006a91c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process su (pid: 31245, threadinfo ffff88006af3a000, task ffff8800374414c0)
Stack:
0000000512e0958e 0000000000008000 ffff880037f8d180 0000000000000001
0000000000000000 0000000000008001 ffff88007d199000 ffffffff810f20ce
0000000000008000 ffff88006af3be48 0000000000000024 ffffffff810face3
Call Trace:
[<ffffffff810f20ce>] ? get_empty_filp+0x70/0x12f
[<ffffffff810face3>] ? do_filp_open+0x145/0x590
[<ffffffff810ce208>] ? tlb_finish_mmu+0x2a/0x33
[<ffffffff810ce43c>] ? unmap_region+0xd3/0xe2
[<ffffffff810e4393>] ? virt_to_head_page+0x9/0x2d
[<ffffffff81103916>] ? alloc_fd+0x69/0x10e
[<ffffffff810ef4ed>] ? do_sys_open+0x56/0xfc
[<ffffffff81008a02>] ? system_call_fastpath+0x16/0x1b
Code: 0f 1f 44 00 00 49 89 c6 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 60 e8 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 14 4c 89 f9 83 ca ff 44 89 e6 48 89 ef
RIP [<ffffffff810e61a3>] kmem_cache_alloc+0x5b/0xe9

This problem is that find_keyring_by_name does not confirm that the keyring is
valid before accepting it.

Skipping keyrings that have been reduced to a zero count seems the way to go.
To this end, use atomic_inc_not_zero() to increment the usage count and skip
the candidate keyring if that returns false.

The following script _may_ cause the bug to happen, but there's no guarantee
as the window of opportunity is small:

#!/bin/sh
LOOP=100000
USER=dummy_user
/bin/su -c "exit;" $USER || { /usr/sbin/adduser -m $USER; add=1; }
for ((i=0; i<LOOP; i++))
do
/bin/su -c "echo '$i' > /dev/null" $USER
done
(( add == 1 )) && /usr/sbin/userdel -r $USER
exit

Note that the nominated user must not be in use.

An alternative way of testing this may be:

for ((i=0; i<100000; i++))
do
keyctl session foo /bin/true || break
done >&/dev/null

as that uses a keyring named "foo" rather than relying on the user and
user-session named keyrings.

Reported-by: Toshiyuki Okajima <[email protected]>
Signed-off-by: David Howells <[email protected]>
Tested-by: Toshiyuki Okajima <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: James Morris <[email protected]>
Upstream-commit: cea7daa3589d6b550546a8c8963599f7c1a3ae5c
Signed-off-by: John Kacur <[email protected]>
---
security/keys/keyring.c | 18 +++++++++---------
1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index 8ec0274..e031952 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -524,9 +524,8 @@ struct key *find_keyring_by_name(const char *name, bool skip_perm_check)
struct key *keyring;
int bucket;

- keyring = ERR_PTR(-EINVAL);
if (!name)
- goto error;
+ return ERR_PTR(-EINVAL);

bucket = keyring_hash(name);

@@ -553,17 +552,18 @@ struct key *find_keyring_by_name(const char *name, bool skip_perm_check)
KEY_SEARCH) < 0)
continue;

- /* we've got a match */
- atomic_inc(&keyring->usage);
- read_unlock(&keyring_name_lock);
- goto error;
+ /* we've got a match but we might end up racing with
+ * key_cleanup() if the keyring is currently 'dead'
+ * (ie. it has a zero usage count) */
+ if (!atomic_inc_not_zero(&keyring->usage))
+ continue;
+ goto out;
}
}

- read_unlock(&keyring_name_lock);
keyring = ERR_PTR(-ENOKEY);
-
- error:
+out:
+ read_unlock(&keyring_name_lock);
return keyring;

} /* end find_keyring_by_name() */
--
1.6.6.1

2010-06-14 22:22:32

by John Kacur

[permalink] [raw]
Subject: [PATCH 1/6] tracing: Update the comm field in the right variable in update_max_tr

From: Arnaldo Carvalho de Melo <[email protected]>

The latency output showed:

# | task: -3 (uid:0 nice:0 policy:1 rt_prio:99)

The comm is missing in the "task:" and it looks like a minus 3 is
the output. The correct display should be:

# | task: migration/0-3 (uid:0 nice:0 policy:1 rt_prio:99)

The problem is that the comm is being stored in the wrong data
structure. The max_tr.data[cpu] is what stores the comm, not the
tr->data[cpu].

Before this patch the max_tr.data[cpu]->comm was zeroed and the /debug/trace
ended up showing just the '-' sign followed by the pid.

Also remove a needless initialization of max_data.

Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Upstream commit: 1acaa1b2d9b5904c9cce06122990a2d71046ce16
Signed-off-by: John Kacur <[email protected]>
---
kernel/trace/trace.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 9b66ee1..a1cf0ad 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -520,7 +520,7 @@ static void
__update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu)
{
struct trace_array_cpu *data = tr->data[cpu];
- struct trace_array_cpu *max_data = tr->data[cpu];
+ struct trace_array_cpu *max_data;

max_tr.cpu = cpu;
max_tr.time_start = data->preempt_timestamp;
@@ -530,7 +530,7 @@ __update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu)
max_data->critical_start = data->critical_start;
max_data->critical_end = data->critical_end;

- memcpy(data->comm, tsk->comm, TASK_COMM_LEN);
+ memcpy(max_data->comm, tsk->comm, TASK_COMM_LEN);
max_data->pid = tsk->pid;
max_data->uid = task_uid(tsk);
max_data->nice = tsk->static_prio - 20 - MAX_RT_PRIO;
--
1.6.6.1

2010-06-14 22:22:45

by John Kacur

[permalink] [raw]
Subject: [PATCH 2/6: v4] lockdep: Make MAX_STACK_TRACE_ENTRIES configurable.

Certain debug configurations that have LOCKDEP turned on, run into the limit
where the MAX_STACK_TRACE_ENTRIES are too small. Rather than simply turning
off the locking correctness validator, let the user configure this value
to something reasonable for their system.

This patch was generated against 2.6.33.5-rt23 but is also intended to be
picked-up for mainline.

Message-ID: <[email protected]>
Signed-off-by: John Kacur <[email protected]>
---
kernel/lockdep_internals.h | 2 +-
lib/Kconfig.debug | 10 ++++++++++
2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/kernel/lockdep_internals.h b/kernel/lockdep_internals.h
index a2ee95a..d0797bc 100644
--- a/kernel/lockdep_internals.h
+++ b/kernel/lockdep_internals.h
@@ -65,7 +65,7 @@ enum {
* Stack-trace: tightly packed array of stack backtrace
* addresses. Protected by the hash_lock.
*/
-#define MAX_STACK_TRACE_ENTRIES 262144UL
+#define MAX_STACK_TRACE_ENTRIES (CONFIG_MAX_STACK_ENTRIES_KIB*1024UL)

extern struct list_head all_lock_classes;
extern struct lock_chain lock_chains[];
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index cbf6e02..6087fb0 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -509,6 +509,16 @@ config LOCKDEP
select KALLSYMS
select KALLSYMS_ALL

+config MAX_STACK_ENTRIES_KIB
+ int "MAX_STACK_ENTRIES_KIB for LOCKDEP"
+ depends on LOCKDEP
+ default 256
+ help
+ This option allows you to change the default MAX_STACK_TRACE_ENTRIES
+ used by LOCKDEP. The default is 256*1024 = 262144.
+ Warning: increasing this will increase the size of stack_trace_array
+ and therefore the kernel size too.
+
config LOCK_STAT
bool "Lock usage statistics"
depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT
--
1.6.6.1

2010-06-16 18:49:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/6: v4] lockdep: Make MAX_STACK_TRACE_ENTRIES configurable.

On Tue, 2010-06-15 at 00:21 +0200, John Kacur wrote:
> Certain debug configurations that have LOCKDEP turned on, run into the limit
> where the MAX_STACK_TRACE_ENTRIES are too small. Rather than simply turning
> off the locking correctness validator, let the user configure this value
> to something reasonable for their system.
>
> This patch was generated against 2.6.33.5-rt23 but is also intended to be
> picked-up for mainline.

NACK

patches like 4726f2a617ebd868a4fdeb5679613b897e5f1676 are the way to go.

2010-06-16 20:38:16

by John Kacur

[permalink] [raw]
Subject: Re: [PATCH 2/6: v4] lockdep: Make MAX_STACK_TRACE_ENTRIES configurable.



On Wed, 16 Jun 2010, Peter Zijlstra wrote:

> On Tue, 2010-06-15 at 00:21 +0200, John Kacur wrote:
> > Certain debug configurations that have LOCKDEP turned on, run into the limit
> > where the MAX_STACK_TRACE_ENTRIES are too small. Rather than simply turning
> > off the locking correctness validator, let the user configure this value
> > to something reasonable for their system.
> >
> > This patch was generated against 2.6.33.5-rt23 but is also intended to be
> > picked-up for mainline.
>
> NACK
>
> patches like 4726f2a617ebd868a4fdeb5679613b897e5f1676 are the way to go.

I've been testing 4726f2a617ebd868a4fdeb5679613b897e5f1676 in rt
(Thomas has it in tip/rt/2.6.33 now) and so far it is doing the trick for
me, at least on my laptop. I still need to test it on larger machines.
However, this problem seems to continuably come up, and I'm not the only
one who has expessed the wish / need to have this tunable.
See Message-Id <[email protected]>

and here in the same thread is even an argument that some configs
might tune it down. (I needed to increase it for rt-debug configs)
See Message-Id <[email protected]>

fwiw regarding a nacked patch, v4 is the least intrusive version
(smallest number of changes), based on a bit of trout wacking induced
hints.

Thanks

2010-06-16 21:17:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/6: v4] lockdep: Make MAX_STACK_TRACE_ENTRIES configurable.

On Wed, 2010-06-16 at 22:37 +0200, John Kacur wrote:
>
> On Wed, 16 Jun 2010, Peter Zijlstra wrote:
>
> > On Tue, 2010-06-15 at 00:21 +0200, John Kacur wrote:
> > > Certain debug configurations that have LOCKDEP turned on, run into the limit
> > > where the MAX_STACK_TRACE_ENTRIES are too small. Rather than simply turning
> > > off the locking correctness validator, let the user configure this value
> > > to something reasonable for their system.
> > >
> > > This patch was generated against 2.6.33.5-rt23 but is also intended to be
> > > picked-up for mainline.
> >
> > NACK
> >
> > patches like 4726f2a617ebd868a4fdeb5679613b897e5f1676 are the way to go.
>
> I've been testing 4726f2a617ebd868a4fdeb5679613b897e5f1676 in rt
> (Thomas has it in tip/rt/2.6.33 now) and so far it is doing the trick for
> me, at least on my laptop. I still need to test it on larger machines.
> However, this problem seems to continuably come up, and I'm not the only
> one who has expessed the wish / need to have this tunable.

And simply increasing the number without thought is the worst approach
ever and I'm simply not going to merge it.

Also, google doesn't seem to index msg-ids, so I've no idea what you're
referring to.

2010-06-16 21:29:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/6: v4] lockdep: Make MAX_STACK_TRACE_ENTRIES configurable.

On Wed, 2010-06-16 at 23:16 +0200, Peter Zijlstra wrote:

> Also, google doesn't seem to index msg-ids, so I've no idea what you're
> referring to.

But marc.info does:

http://marc.info/[email protected]

;-)

-- Steve

2010-06-16 21:34:08

by John Kacur

[permalink] [raw]
Subject: Re: [PATCH 2/6: v4] lockdep: Make MAX_STACK_TRACE_ENTRIES configurable.



On Wed, 16 Jun 2010, Peter Zijlstra wrote:

> On Wed, 2010-06-16 at 22:37 +0200, John Kacur wrote:
> >
> > On Wed, 16 Jun 2010, Peter Zijlstra wrote:
> >
> > > On Tue, 2010-06-15 at 00:21 +0200, John Kacur wrote:
> > > > Certain debug configurations that have LOCKDEP turned on, run into the limit
> > > > where the MAX_STACK_TRACE_ENTRIES are too small. Rather than simply turning
> > > > off the locking correctness validator, let the user configure this value
> > > > to something reasonable for their system.
> > > >
> > > > This patch was generated against 2.6.33.5-rt23 but is also intended to be
> > > > picked-up for mainline.
> > >
> > > NACK
> > >
> > > patches like 4726f2a617ebd868a4fdeb5679613b897e5f1676 are the way to go.
> >
> > I've been testing 4726f2a617ebd868a4fdeb5679613b897e5f1676 in rt
> > (Thomas has it in tip/rt/2.6.33 now) and so far it is doing the trick for
> > me, at least on my laptop. I still need to test it on larger machines.
> > However, this problem seems to continuably come up, and I'm not the only
> > one who has expessed the wish / need to have this tunable.
>
> And simply increasing the number without thought is the worst approach
> ever and I'm simply not going to merge it.
>
> Also, google doesn't seem to index msg-ids, so I've no idea what you're
> referring to.
> --

The first mail in which Gregory also expressed desire to make this tunable
is here
http://lkml.org/lkml/2010/4/21/123

The second in which Sven said in certain configurations they might even
tune it down (decrease) is here.
http://lkml.org/lkml/2010/4/21/183

2010-06-17 08:36:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/6: v4] lockdep: Make MAX_STACK_TRACE_ENTRIES configurable.

On Wed, 2010-06-16 at 17:29 -0400, Steven Rostedt wrote:
> On Wed, 2010-06-16 at 23:16 +0200, Peter Zijlstra wrote:
>
> > Also, google doesn't seem to index msg-ids, so I've no idea what you're
> > referring to.
>
> But marc.info does:
>
> http://marc.info/[email protected]

Right, so that was before Yong Zhang's patch, if it still happens we
need to again look at what is causing this. Again, blindly increasing
the limit is not a good option.

2010-06-17 09:35:32

by John Kacur

[permalink] [raw]
Subject: Re: [PATCH 2/6: v4] lockdep: Make MAX_STACK_TRACE_ENTRIES configurable.



On Thu, 17 Jun 2010, Peter Zijlstra wrote:

> On Wed, 2010-06-16 at 17:29 -0400, Steven Rostedt wrote:
> > On Wed, 2010-06-16 at 23:16 +0200, Peter Zijlstra wrote:
> >
> > > Also, google doesn't seem to index msg-ids, so I've no idea what you're
> > > referring to.
> >
> > But marc.info does:
> >
> > http://marc.info/[email protected]
>
> Right, so that was before Yong Zhang's patch, if it still happens we
> need to again look at what is causing this. Again, blindly increasing
> the limit is not a good option.

Well, as I said, I'm testing with Yong Zhang's patch, and it seems to be
doing the trick, so I am actually not pushing for my patch right now.

But please stop characterizing this as "blindly increasing the limit",
because that is not at all what I or others do. We have a debug build with
tons of options turned on in which case we increased it to the minimum
that worked for us, and we have a tracing build in which case we left it
at the default. Also as I pointed out, in Sven's case it sounds like they may
have had a build where they even wanted to decrease it.

Your objection in the past was that it was another tunable that nobody understands, and I
have more sympathy for that argument. My counterargument is that if we're
all putting a version of this patch in our private builds, then it's a tad
counterproductive. Let's leave things the way they are for now, unless
this becomes a problem again.

Thanks.

2010-06-18 03:56:25

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: Re: [PATCH 2/6: v4] lockdep: Make MAX_STACK_TRACE_ENTRIES configurable.

On Thu, 2010-06-17 at 10:46 +0200, John Kacur wrote:
> Also as I pointed out, in Sven's case it sounds like they may
> have had a build where they even wanted to decrease it.
>

I hacked the patch below for SLERT 10 a few years ago, and that's
shipping still.

As you can see from the comment in the patch header, I was aware that
this wasn't a closed case, but rather a stop-gap.

I think if I had spent any more time on it, or run into the issue
repeatedly, my first instinct would have been to do what John did, i.e.
make it configurable.

However, I think that fundamentally, if nesting goes that deep, simply
increasing this define would lead to masking real problems.

So while I hacked this up for a shipping product, with intention to
debug this later, I'd probably favor a proper fix of the offending call
chain upstream.

I do not recall ever trying to decrease it, other than to reproduce the
issue a few times. I never did have time to dig into it further, and we
closed the bug as won't fix for this particular product.

Regards,

Sven


Subject: Increase lockdep's MAX_STACK_TRACE_ENTRIES
From: Sven-Thorsten Dietrich <[email protected]>

For large SMP systems, MAX_STACK_TRACE_ENTRIES was too low, and
lockdep would complain. This change addresses the issue on systems
we have tested.

It remains to be determined whether other bugs (e.g. scheduler
balancing issues led to unreasonably deep call chains)

Signed-off-by: Sven-Thorsten Dietrich <[email protected]>

Index: linux-2.6.22/kernel/lockdep_internals.h
===================================================================
--- linux-2.6.22.orig/kernel/lockdep_internals.h
+++ linux-2.6.22/kernel/lockdep_internals.h
@@ -27,7 +27,7 @@
* Stack-trace: tightly packed array of stack backtrace
* addresses. Protected by the hash_lock.
*/
-#define MAX_STACK_TRACE_ENTRIES 262144UL
+#define MAX_STACK_TRACE_ENTRIES 524288UL

extern struct list_head all_lock_classes;