I noticed I had a ton of core dumps (like 70G worth) in a directory
I hadn't cleaned up in a while, and set about deleting them.
After a while I noticed the rm wasn't making any progress.
Even more strange, the rm process doesn't show up in the process list.
The shell that spawned it is still there, with no child processes,
but it hasn't returned to accept new input. (no message of oom kills or
anything, just totally missing pids).
I did sysrq-t to see if it showed up there. It didn't, but.. I noticed
a ton of processes from my syscall fuzzer were still around, and all
of them were stuck in this trace..
trinity-child2 D 0000000000000000 5528 13066 1 0x00000004
ffff880100a37ce8 0000000000000046 0000000000000006 ffff880129070000
ffff880129070000 ffff880100a37fd8 ffff880100a37fd8 ffff880100a37fd8
ffff880145ec4d60 ffff880129070000 ffff880100a37cd8 ffff88014784e2a0
Call Trace:
[<ffffffff8164b919>] schedule+0x29/0x70
[<ffffffff8164bca8>] schedule_preempt_disabled+0x18/0x30
[<ffffffff8164a186>] mutex_lock_nested+0x196/0x3b0
[<ffffffff811b6d6e>] ? lock_rename+0x3e/0xf0
[<ffffffff811b6d6e>] ? lock_rename+0x3e/0xf0
[<ffffffff811b6d6e>] lock_rename+0x3e/0xf0
[<ffffffff811bcaca>] sys_renameat+0x11a/0x230
[<ffffffff8164d738>] ? _raw_spin_unlock_irqrestore+0x38/0x80
[<ffffffff81050e1c>] ? do_setitimer+0x1cc/0x310
[<ffffffff810b1d7e>] ? put_lock_stats.isra.23+0xe/0x40
[<ffffffff8164d6d0>] ? _raw_spin_unlock_irq+0x30/0x60
[<ffffffff81086f81>] ? get_parent_ip+0x11/0x50
[<ffffffff81655177>] ? sysret_check+0x1b/0x56
[<ffffffff810b7cd5>] ? trace_hardirqs_on_caller+0x115/0x1a0
[<ffffffff813264be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff811bcbfb>] sys_rename+0x1b/0x20
[<ffffffff81655152>] system_call_fastpath+0x16/0x1b
The whole sysrq-t is attached.
I ran mc to try and kill off all those core files, as I was running low on disk space,
and it deleted them without problem.
The two bash processes are chewing up 100% CPU, though strace shows no output.
It's still up and in this state if you want me to gather any further info
before I reboot it.
Dave
On Sun, Jun 03, 2012 at 06:36:17PM -0400, Dave Jones wrote:
> I noticed I had a ton of core dumps (like 70G worth) in a directory
> I hadn't cleaned up in a while, and set about deleting them.
> After a while I noticed the rm wasn't making any progress.
> Even more strange, the rm process doesn't show up in the process list.
> The shell that spawned it is still there, with no child processes,
> but it hasn't returned to accept new input. (no message of oom kills or
> anything, just totally missing pids).
>
>
> I did sysrq-t to see if it showed up there. It didn't, but.. I noticed
> a ton of processes from my syscall fuzzer were still around, and all
> of them were stuck in this trace..
>
>
> trinity-child2 D 0000000000000000 5528 13066 1 0x00000004
> ffff880100a37ce8 0000000000000046 0000000000000006 ffff880129070000
> ffff880129070000 ffff880100a37fd8 ffff880100a37fd8 ffff880100a37fd8
> ffff880145ec4d60 ffff880129070000 ffff880100a37cd8 ffff88014784e2a0
> Call Trace:
> [<ffffffff8164b919>] schedule+0x29/0x70
> [<ffffffff8164bca8>] schedule_preempt_disabled+0x18/0x30
> [<ffffffff8164a186>] mutex_lock_nested+0x196/0x3b0
> [<ffffffff811b6d6e>] ? lock_rename+0x3e/0xf0
> [<ffffffff811b6d6e>] ? lock_rename+0x3e/0xf0
> [<ffffffff811b6d6e>] lock_rename+0x3e/0xf0
> [<ffffffff811bcaca>] sys_renameat+0x11a/0x230
> [<ffffffff8164d738>] ? _raw_spin_unlock_irqrestore+0x38/0x80
> [<ffffffff81050e1c>] ? do_setitimer+0x1cc/0x310
> [<ffffffff810b1d7e>] ? put_lock_stats.isra.23+0xe/0x40
> [<ffffffff8164d6d0>] ? _raw_spin_unlock_irq+0x30/0x60
> [<ffffffff81086f81>] ? get_parent_ip+0x11/0x50
> [<ffffffff81655177>] ? sysret_check+0x1b/0x56
> [<ffffffff810b7cd5>] ? trace_hardirqs_on_caller+0x115/0x1a0
> [<ffffffff813264be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff811bcbfb>] sys_rename+0x1b/0x20
> [<ffffffff81655152>] system_call_fastpath+0x16/0x1b
>
> The whole sysrq-t is attached.
>
> I ran mc to try and kill off all those core files, as I was running low on disk space,
> and it deleted them without problem.
>
> The two bash processes are chewing up 100% CPU, though strace shows no output.
trying to run perf causes hung perf processes too. hrmph, messed up.
Dave
perf D 0000000000000000 3944 1525 1613 0x00000004
ffff880103e49d58 0000000000000046 0000000000000006 ffff88012c0d4d60
ffff88012c0d4d60 ffff880103e49fd8 ffff880103e49fd8 ffff880103e49fd8
ffff880145edcd60 ffff88012c0d4d60 ffff880103e49d48 ffff88013ea31310
Call Trace:
[<ffffffff8164b919>] schedule+0x29/0x70
[<ffffffff8164bca8>] schedule_preempt_disabled+0x18/0x30
[<ffffffff816498a6>] mutex_lock_killable_nested+0x1a6/0x470
[<ffffffff81046134>] ? mm_access+0x34/0xc0
[<ffffffff81046134>] ? mm_access+0x34/0xc0
[<ffffffff81046134>] mm_access+0x34/0xc0
[<ffffffff8106eef0>] ? pid_task+0xd0/0xd0
[<ffffffff81213b2c>] m_start+0x7c/0x190
[<ffffffff811cfa50>] seq_read+0xa0/0x3e0
[<ffffffff811aae1c>] vfs_read+0xac/0x180
[<ffffffff811aaf3d>] sys_read+0x4d/0x90
[<ffffffff81655152>] system_call_fastpath+0x16/0x1b
perf x 0000000000000000 5496 1526 1525 0x00000004
ffff88013fbf7cb8 0000000000000046 ffff88013fbf7c68 ffffffff810b248c
ffff8801423f8000 ffff88013fbf7fd8 ffff88013fbf7fd8 ffff88013fbf7fd8
ffff880145ee8000 ffff8801423f8000 ffff88013fbf7ca8 ffff8801423f87e0
Call Trace:
[<ffffffff810b248c>] ? lock_release_holdtime.part.24+0xcc/0x140
[<ffffffff8164b919>] schedule+0x29/0x70
[<ffffffff8104f960>] do_exit+0x670/0xb90
[<ffffffff810627e1>] ? get_signal_to_deliver+0x291/0x930
[<ffffffff810501cc>] do_group_exit+0x4c/0xc0
[<ffffffff8106281e>] get_signal_to_deliver+0x2ce/0x930
[<ffffffff8100225f>] do_signal+0x3f/0x610
[<ffffffff812b5cc5>] ? security_file_permission+0x95/0xb0
[<ffffffff811aa961>] ? rw_verify_area+0x61/0xf0
[<ffffffff816551cc>] ? sysret_signal+0x5/0x47
[<ffffffff810028d8>] do_notify_resume+0x88/0xc0
[<ffffffff8165545a>] int_signal+0x12/0x17
On Sun, Jun 3, 2012 at 3:36 PM, Dave Jones <[email protected]> wrote:
>
> It's still up and in this state if you want me to gather any further info
> before I reboot it.
You seem to have lockdep enabled, since I see the lock information,
but presumably you didn't get a lockdep splat before this?
Also, sysrq-w is usually way more interesting than 't' when there are
processes stuck on a mutex.
Because yes, it looks like you have a boattload of trinity processes
stuck on an inode mutex. Looks like every single one of them is in
'lock_rename()'. It *shouldn't* be an ABBA deadlock, since lockdep
should have noticed that, but who knows.
It looks like the lock information is somewhat truncated. I suspect
the dmesg buffer had filled up with all the task data.
Can you do just "sysrq-d" for the lock information (and maybe
separately "sysrq-w" for the blocked tasks)? The non-truncated lock
data might tell us more.
Linus
> Also, sysrq-w is usually way more interesting than 't' when there are
> processes stuck on a mutex.
>
> Because yes, it looks like you have a boattload of trinity processes
> stuck on an inode mutex. Looks like every single one of them is in
> 'lock_rename()'. It *shouldn't* be an ABBA deadlock, since lockdep
> should have noticed that, but who knows.
lock_rename() is a bit of a red herring here - they appear to be all
within-directory renames, so it's just a "trying to rename something
in a directory that has ->i_mutex held by something else".
IOW, something else in there is holding ->i_mutex - something that
either hadn't been through lock_rename() at all or has already
passed through it and still hadn't got around to unlock_rename().
In either case, suspects won't have lock_rename() in the trace...
On Mon, Jun 04, 2012 at 12:17:09AM +0100, Al Viro wrote:
>
> > Also, sysrq-w is usually way more interesting than 't' when there are
> > processes stuck on a mutex.
> >
> > Because yes, it looks like you have a boattload of trinity processes
> > stuck on an inode mutex. Looks like every single one of them is in
> > 'lock_rename()'. It *shouldn't* be an ABBA deadlock, since lockdep
> > should have noticed that, but who knows.
>
> lock_rename() is a bit of a red herring here - they appear to be all
> within-directory renames, so it's just a "trying to rename something
> in a directory that has ->i_mutex held by something else".
>
> IOW, something else in there is holding ->i_mutex - something that
> either hadn't been through lock_rename() at all or has already
> passed through it and still hadn't got around to unlock_rename().
> In either case, suspects won't have lock_rename() in the trace...
Everything in lock_rename() appears to be at lock_rename+0x3e. Unless
there's a really huge amount of filesystems on that box, this has to
be
mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
and everything on that sucker is not holding any locks yet. IOW, that's
the tail hanging off whatever deadlock is there.
One possibility is that something has left the kernel without releasing
i_mutex on some directory, which would make atomic_open patches the most
obvious suspects.
Which kernel it is and what filesystems are there? Is there nfsd anywhere
in the mix?
On Mon, Jun 04, 2012 at 12:28:20AM +0100, Al Viro wrote:
> Everything in lock_rename() appears to be at lock_rename+0x3e. Unless
> there's a really huge amount of filesystems on that box, this has to
> be
> mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
> and everything on that sucker is not holding any locks yet. IOW, that's
> the tail hanging off whatever deadlock is there.
Er... After another look, probably not - it's ->s_vfs_rename_mutex,
so we are seeing one cross-directory rename stuck on something with
all subsequent ones blocked on attempt to grab said mutex.
The interesting one is the guy stuck at lock_rename+0xc9/0xf0, everything
else in lock_rename() is the consequence.
On Mon, Jun 04, 2012 at 12:40:42AM +0100, Al Viro wrote:
> On Mon, Jun 04, 2012 at 12:28:20AM +0100, Al Viro wrote:
>
> > Everything in lock_rename() appears to be at lock_rename+0x3e. Unless
> > there's a really huge amount of filesystems on that box, this has to
> > be
> > mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
> > and everything on that sucker is not holding any locks yet. IOW, that's
> > the tail hanging off whatever deadlock is there.
>
> Er... After another look, probably not - it's ->s_vfs_rename_mutex,
> so we are seeing one cross-directory rename stuck on something with
> all subsequent ones blocked on attempt to grab said mutex.
>
> The interesting one is the guy stuck at lock_rename+0xc9/0xf0, everything
> else in lock_rename() is the consequence.
BTW, another suspicious patch is d_splice_alias() one; note that if we
_ever_ pick a dentry that isn't disconnected, we are deeply fucked.
d_move() without the old parent locked is a Bad Thing(tm). I don't see
how that could've triggered without another bug somewhere, but what's
happening in d_splice_alias() right now is wrong.
On Sun, Jun 03, 2012 at 04:07:35PM -0700, Linus Torvalds wrote:
> On Sun, Jun 3, 2012 at 3:36 PM, Dave Jones <[email protected]> wrote:
> >
> > It's still up and in this state if you want me to gather any further info
> > before I reboot it.
>
> You seem to have lockdep enabled, since I see the lock information,
> but presumably you didn't get a lockdep splat before this?
Nothing.
> Because yes, it looks like you have a boattload of trinity processes
> stuck on an inode mutex. Looks like every single one of them is in
> 'lock_rename()'. It *shouldn't* be an ABBA deadlock, since lockdep
> should have noticed that, but who knows.
another data point: the core dumps that were being deleted were in the
directory that trinity runs from. It's feasible that some of the processes
were using those dumps as input data for random syscalls, before they
got deleted.
> It looks like the lock information is somewhat truncated. I suspect
> the dmesg buffer had filled up with all the task data.
>
> Can you do just "sysrq-d" for the lock information (and maybe
> separately "sysrq-w" for the blocked tasks)? The non-truncated lock
> data might tell us more.
sysrq-w: http://fpaste.org/WUd9/
sysrq-d: http://fpaste.org/ow9O/
Dave
On Mon, Jun 04, 2012 at 12:28:20AM +0100, Al Viro wrote:
> On Mon, Jun 04, 2012 at 12:17:09AM +0100, Al Viro wrote:
> >
> > > Also, sysrq-w is usually way more interesting than 't' when there are
> > > processes stuck on a mutex.
> > >
> > > Because yes, it looks like you have a boattload of trinity processes
> > > stuck on an inode mutex. Looks like every single one of them is in
> > > 'lock_rename()'. It *shouldn't* be an ABBA deadlock, since lockdep
> > > should have noticed that, but who knows.
> >
> > lock_rename() is a bit of a red herring here - they appear to be all
> > within-directory renames, so it's just a "trying to rename something
> > in a directory that has ->i_mutex held by something else".
> >
> > IOW, something else in there is holding ->i_mutex - something that
> > either hadn't been through lock_rename() at all or has already
> > passed through it and still hadn't got around to unlock_rename().
> > In either case, suspects won't have lock_rename() in the trace...
>
> Everything in lock_rename() appears to be at lock_rename+0x3e. Unless
> there's a really huge amount of filesystems on that box, this has to
> be
> mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
> and everything on that sucker is not holding any locks yet. IOW, that's
> the tail hanging off whatever deadlock is there.
>
> One possibility is that something has left the kernel without releasing
> i_mutex on some directory, which would make atomic_open patches the most
> obvious suspects.
>
> Which kernel it is and what filesystems are there? Is there nfsd anywhere
> in the mix?
Linus tree as of rc1, with 5ceb9ce6fe94 reverted, and a bunch of patches to
shut up noisy printk's that get spewed during fuzz testing.
No active nfs exports/mounts, though something caused nfsd.ko to get loaded
at some point (module use count of 13, weird).
Dave
On Sun, Jun 3, 2012 at 5:00 PM, Dave Jones <[email protected]> wrote:
>
> sysrq-d: http://fpaste.org/ow9O/
Ugh. I'm adding PeterZ to the cc, just to see if he can make more sense of it.
Peter, is there no way to make the lock thing print not just the lock
class name, but also the pointer to the actual *instance* of the lock
held?
Also, it's a bit unclear to me, but I *think* that most of those users
don't actually "hold" the lock - they are waiting for it. Yes/no? Does
the lockdep information have the capability to distinguish between
"waiting for" vs "actually successfully owns the lock"?
Linus
On Sun, Jun 03, 2012 at 05:16:30PM -0700, Linus Torvalds wrote:
> On Sun, Jun 3, 2012 at 5:00 PM, Dave Jones <[email protected]> wrote:
> >
> > sysrq-d: http://fpaste.org/ow9O/
>
> Ugh. I'm adding PeterZ to the cc, just to see if he can make more sense of it.
>
> Peter, is there no way to make the lock thing print not just the lock
> class name, but also the pointer to the actual *instance* of the lock
> held?
>
> Also, it's a bit unclear to me, but I *think* that most of those users
> don't actually "hold" the lock - they are waiting for it. Yes/no?
Has to be, unless something _very_ odd is going on... Just how many
->s_vfs_rename_mutex are there on that box?
> Does
> the lockdep information have the capability to distinguish between
> "waiting for" vs "actually successfully owns the lock"?
Another question re lockdep - does it scream loudly if process returns
to userland without having released some lock? I hope so, but I've never
checked that... ;-/
On Sun, 2012-06-03 at 17:16 -0700, Linus Torvalds wrote:
> On Sun, Jun 3, 2012 at 5:00 PM, Dave Jones <[email protected]> wrote:
> >
> > sysrq-d: http://fpaste.org/ow9O/
>
> Ugh. I'm adding PeterZ to the cc, just to see if he can make more sense of it.
>
> Peter, is there no way to make the lock thing print not just the lock
> class name, but also the pointer to the actual *instance* of the lock
> held?
Sorta, we have a pointer to the struct lockdep_map inside whatever lock
type. But we don't have the lock type so we cannot actually provide the
pointer to the spinlock_t struct mutex etc..
> Also, it's a bit unclear to me, but I *think* that most of those users
> don't actually "hold" the lock - they are waiting for it. Yes/no?
Yes, this is waiting to acquire, lockdep started out with 2 hooks, one
before the actual acquire and one on release. Its done before the actual
acquire so we can warn before we lock up in case of an actual deadlock.
> Does
> the lockdep information have the capability to distinguish between
> "waiting for" vs "actually successfully owns the lock"?
Sometimes.. when build with lockstat we have enough hooks to do this. I
suppose I could make all those hooks available for all of lockdep and
track the per lock state more accurate in order to improve this
printout.
On Mon, 2012-06-04 at 01:20 +0100, Al Viro wrote:
>
> Another question re lockdep - does it scream loudly if process returns
> to userland without having released some lock? I hope so, but I've never
> checked that... ;-/
It does, on some archs. git grep lockdep_sys_exit, suggests x86 and
s390.
http://www.google.com/search?q=lock+held+when+returning+to+user+space
will find you people who tripped it ;-) There's a few filesystems that
did various vile things on freeze or so.
On Mon, 2012-06-04 at 11:29 +0200, Peter Zijlstra wrote:
> On Sun, 2012-06-03 at 17:16 -0700, Linus Torvalds wrote:
> > On Sun, Jun 3, 2012 at 5:00 PM, Dave Jones <[email protected]> wrote:
> > >
> > > sysrq-d: http://fpaste.org/ow9O/
> >
> > Ugh. I'm adding PeterZ to the cc, just to see if he can make more sense of it.
> >
> > Peter, is there no way to make the lock thing print not just the lock
> > class name, but also the pointer to the actual *instance* of the lock
> > held?
>
> Sorta, we have a pointer to the struct lockdep_map inside whatever lock
> type. But we don't have the lock type so we cannot actually provide the
> pointer to the spinlock_t struct mutex etc..
>
> > Also, it's a bit unclear to me, but I *think* that most of those users
> > don't actually "hold" the lock - they are waiting for it. Yes/no?
>
> Yes, this is waiting to acquire, lockdep started out with 2 hooks, one
> before the actual acquire and one on release. Its done before the actual
> acquire so we can warn before we lock up in case of an actual deadlock.
>
> > Does
> > the lockdep information have the capability to distinguish between
> > "waiting for" vs "actually successfully owns the lock"?
>
> Sometimes.. when build with lockstat we have enough hooks to do this. I
> suppose I could make all those hooks available for all of lockdep and
> track the per lock state more accurate in order to improve this
> printout.
Something like the compile tested only patch below should do both
things. It will however make that already long output line longer still.
[19247.688328] 1 lock held by agetty/509:
[19247.689732] #0: (&tty->atomic_read_lock){+.+.+.}, at: [<ffffffff813c12f0>] n_tty_read+0x5d0/0x8a0
Will become something like:
19247.688328] 1 lock on stack by agetty/509:
[19247.689732] #0: blocked: (&tty->atomic_read_lock){+.+.+.}, instance: ffffffffffffffff, at: [<ffffffff813c12f0>] n_tty_read+0x5d0/0x8a0
where instance is clearly a saner number :-)
---
include/linux/lockdep.h | 44 ++++++++-------
include/trace/events/lock.h | 3 -
kernel/lockdep.c | 127 ++++++++++++++++++++++++++++---------------
3 files changed, 106 insertions(+), 68 deletions(-)
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 00e4637..7a989a8 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -211,6 +211,13 @@ struct lock_chain {
*/
#define MAX_LOCKDEP_KEYS ((1UL << MAX_LOCKDEP_KEYS_BITS) - 1)
+enum held_lock_state {
+ hls_unheld = 0,
+ hls_acquiring,
+ hls_contended,
+ hls_acquired,
+};
+
struct held_lock {
/*
* One-way hash of the dependency chain up to this point. We
@@ -254,7 +261,10 @@ struct held_lock {
unsigned int read:2; /* see lock_acquire() comment */
unsigned int check:2; /* see lock_acquire() comment */
unsigned int hardirqs_off:1;
- unsigned int references:11; /* 32 bits */
+ unsigned int state:2; /* see held_lock_state */
+ /* 9 bit hole */
+ unsigned int references:16;
+ /* 16 bit hole */
};
/*
@@ -363,6 +373,18 @@ extern void lockdep_trace_alloc(gfp_t mask);
#define lockdep_recursing(tsk) ((tsk)->lockdep_recursion)
+extern void lock_contended(struct lockdep_map *lock, unsigned long ip);
+extern void lock_acquired(struct lockdep_map *lock, unsigned long ip);
+
+#define LOCK_CONTENDED(_lock, try, lock) \
+do { \
+ if (!try(_lock)) { \
+ lock_contended(&(_lock)->dep_map, _RET_IP_); \
+ lock(_lock); \
+ } \
+ lock_acquired(&(_lock)->dep_map, _RET_IP_); \
+} while (0)
+
#else /* !LOCKDEP */
static inline void lockdep_off(void)
@@ -414,31 +436,13 @@ struct lock_class_key { };
#define lockdep_recursing(tsk) (0)
-#endif /* !LOCKDEP */
-
-#ifdef CONFIG_LOCK_STAT
-
-extern void lock_contended(struct lockdep_map *lock, unsigned long ip);
-extern void lock_acquired(struct lockdep_map *lock, unsigned long ip);
-
-#define LOCK_CONTENDED(_lock, try, lock) \
-do { \
- if (!try(_lock)) { \
- lock_contended(&(_lock)->dep_map, _RET_IP_); \
- lock(_lock); \
- } \
- lock_acquired(&(_lock)->dep_map, _RET_IP_); \
-} while (0)
-
-#else /* CONFIG_LOCK_STAT */
-
#define lock_contended(lockdep_map, ip) do {} while (0)
#define lock_acquired(lockdep_map, ip) do {} while (0)
#define LOCK_CONTENDED(_lock, try, lock) \
lock(_lock)
-#endif /* CONFIG_LOCK_STAT */
+#endif /* !LOCKDEP */
#ifdef CONFIG_LOCKDEP
diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index 2821b86..dbcb2f5 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -61,8 +61,6 @@ DEFINE_EVENT(lock, lock_release,
TP_ARGS(lock, ip)
);
-#ifdef CONFIG_LOCK_STAT
-
DEFINE_EVENT(lock, lock_contended,
TP_PROTO(struct lockdep_map *lock, unsigned long ip),
@@ -78,7 +76,6 @@ DEFINE_EVENT(lock, lock_acquired,
);
#endif
-#endif
#endif /* _TRACE_LOCK_H */
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index ea9ee45..81ae879 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -541,10 +541,23 @@ static void print_lockdep_cache(struct lockdep_map *lock)
printk("%s", name);
}
+static void print_lock_state(struct held_lock *hlock)
+{
+ switch (hlock->state) {
+ /* holding an unheld lock is fail! */
+ case hls_unheld: printk("FAIL: "); break;
+
+ case hls_acquiring: /* fall-through */
+ case hls_contended: printk("blocked: "); break;
+ case hls_acquired: printk("held: "); break;
+ };
+}
+
static void print_lock(struct held_lock *hlock)
{
+ print_lock_state(hlock);
print_lock_name(hlock_class(hlock));
- printk(", at: ");
+ printk(", instance: %p, at: ", hlock->instance);
print_ip_sym(hlock->acquire_ip);
}
@@ -556,7 +569,7 @@ static void lockdep_print_held_locks(struct task_struct *curr)
printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr));
return;
}
- printk("%d lock%s held by %s/%d:\n",
+ printk("%d lock%s on stack by %s/%d:\n",
depth, depth > 1 ? "s" : "", curr->comm, task_pid_nr(curr));
for (i = 0; i < depth; i++) {
@@ -3093,6 +3106,7 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
hlock->check = check;
hlock->hardirqs_off = !!hardirqs_off;
hlock->references = references;
+ hlock->state = hls_acquiring;
#ifdef CONFIG_LOCK_STAT
hlock->waittime_stamp = 0;
hlock->holdtime_stamp = lockstat_clock();
@@ -3607,7 +3621,6 @@ void lockdep_clear_current_reclaim_state(void)
current->lockdep_reclaim_gfp = 0;
}
-#ifdef CONFIG_LOCK_STAT
static int
print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
unsigned long ip)
@@ -3637,14 +3650,72 @@ print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
return 0;
}
+#ifdef CONFIG_LOCK_STAT
+
+static void lockstat_contended(struct held_lock *hlock)
+{
+ int contention_point, contending_point;
+ struct lock_class_stats *stats;
+
+ hlock->waittime_stamp = lockstat_clock();
+
+ contention_point = lock_point(hlock_class(hlock)->contention_point, ip);
+ contending_point = lock_point(hlock_class(hlock)->contending_point,
+ lock->ip);
+
+ stats = get_lock_stats(hlock_class(hlock));
+ if (contention_point < LOCKSTAT_POINTS)
+ stats->contention_point[contention_point]++;
+ if (contending_point < LOCKSTAT_POINTS)
+ stats->contending_point[contending_point]++;
+ if (lock->cpu != smp_processor_id())
+ stats->bounces[bounce_contended + !!hlock->read]++;
+ put_lock_stats(stats);
+}
+
+static void lockstat_acquired(struct held_lock *hlock)
+{
+ struct lockdep_map *lock = hlock->instance;
+ struct lock_class_stats *stats;
+ u64 now, waittime = 0;
+ int cpu;
+
+ cpu = smp_processor_id();
+ if (hlock->waittime_stamp) {
+ now = lockstat_clock();
+ waittime = now - hlock->waittime_stamp;
+ hlock->holdtime_stamp = now;
+ }
+
+ stats = get_lock_stats(hlock_class(hlock));
+ if (waittime) {
+ if (hlock->read)
+ lock_time_inc(&stats->read_waittime, waittime);
+ else
+ lock_time_inc(&stats->write_waittime, waittime);
+ }
+ if (lock->cpu != cpu)
+ stats->bounces[bounce_acquired + !!hlock->read]++;
+ put_lock_stats(stats);
+
+ lock->cpu = cpu;
+ lock->ip = ip;
+}
+
+#else /* CONFIG_LOCK_STAT */
+
+static inline void lockstat_contended(struct held_lock *hlock) { }
+static inline void lockstat_acquired(struct held_lock *hlock) { }
+
+#endif /* CONFIG_LOCK_STAT */
+
static void
__lock_contended(struct lockdep_map *lock, unsigned long ip)
{
struct task_struct *curr = current;
struct held_lock *hlock, *prev_hlock;
- struct lock_class_stats *stats;
unsigned int depth;
- int i, contention_point, contending_point;
+ int i;
depth = curr->lockdep_depth;
/*
@@ -3673,20 +3744,8 @@ __lock_contended(struct lockdep_map *lock, unsigned long ip)
if (hlock->instance != lock)
return;
- hlock->waittime_stamp = lockstat_clock();
-
- contention_point = lock_point(hlock_class(hlock)->contention_point, ip);
- contending_point = lock_point(hlock_class(hlock)->contending_point,
- lock->ip);
-
- stats = get_lock_stats(hlock_class(hlock));
- if (contention_point < LOCKSTAT_POINTS)
- stats->contention_point[contention_point]++;
- if (contending_point < LOCKSTAT_POINTS)
- stats->contending_point[contending_point]++;
- if (lock->cpu != smp_processor_id())
- stats->bounces[bounce_contended + !!hlock->read]++;
- put_lock_stats(stats);
+ hlock->state = hls_contended;
+ lockstat_contended(hlock);
}
static void
@@ -3694,10 +3753,8 @@ __lock_acquired(struct lockdep_map *lock, unsigned long ip)
{
struct task_struct *curr = current;
struct held_lock *hlock, *prev_hlock;
- struct lock_class_stats *stats;
unsigned int depth;
- u64 now, waittime = 0;
- int i, cpu;
+ int i;
depth = curr->lockdep_depth;
/*
@@ -3726,28 +3783,8 @@ __lock_acquired(struct lockdep_map *lock, unsigned long ip)
if (hlock->instance != lock)
return;
- cpu = smp_processor_id();
- if (hlock->waittime_stamp) {
- now = lockstat_clock();
- waittime = now - hlock->waittime_stamp;
- hlock->holdtime_stamp = now;
- }
-
- trace_lock_acquired(lock, ip);
-
- stats = get_lock_stats(hlock_class(hlock));
- if (waittime) {
- if (hlock->read)
- lock_time_inc(&stats->read_waittime, waittime);
- else
- lock_time_inc(&stats->write_waittime, waittime);
- }
- if (lock->cpu != cpu)
- stats->bounces[bounce_acquired + !!hlock->read]++;
- put_lock_stats(stats);
-
- lock->cpu = cpu;
- lock->ip = ip;
+ hlock->state = hls_acquired;
+ lockstat_acquired(hlock);
}
void lock_contended(struct lockdep_map *lock, unsigned long ip)
@@ -3783,12 +3820,12 @@ void lock_acquired(struct lockdep_map *lock, unsigned long ip)
raw_local_irq_save(flags);
check_flags(flags);
current->lockdep_recursion = 1;
+ trace_lock_acquired(lock, ip);
__lock_acquired(lock, ip);
current->lockdep_recursion = 0;
raw_local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(lock_acquired);
-#endif
/*
* Used by the testsuite, sanitize the validator state
On Mon, Jun 04, 2012 at 12:28:20AM +0100, Al Viro wrote:
> On Mon, Jun 04, 2012 at 12:17:09AM +0100, Al Viro wrote:
> >
> > > Also, sysrq-w is usually way more interesting than 't' when there are
> > > processes stuck on a mutex.
> > >
> > > Because yes, it looks like you have a boattload of trinity processes
> > > stuck on an inode mutex. Looks like every single one of them is in
> > > 'lock_rename()'. It *shouldn't* be an ABBA deadlock, since lockdep
> > > should have noticed that, but who knows.
> >
> > lock_rename() is a bit of a red herring here - they appear to be all
> > within-directory renames, so it's just a "trying to rename something
> > in a directory that has ->i_mutex held by something else".
> >
> > IOW, something else in there is holding ->i_mutex - something that
> > either hadn't been through lock_rename() at all or has already
> > passed through it and still hadn't got around to unlock_rename().
> > In either case, suspects won't have lock_rename() in the trace...
>
> Everything in lock_rename() appears to be at lock_rename+0x3e. Unless
> there's a really huge amount of filesystems on that box, this has to
> be
> mutex_lock_nested(&p1->d_inode->i_mutex, I_MUTEX_PARENT);
> and everything on that sucker is not holding any locks yet. IOW, that's
> the tail hanging off whatever deadlock is there.
>
> One possibility is that something has left the kernel without releasing
> i_mutex on some directory, which would make atomic_open patches the most
> obvious suspects.
Just hit this again on a different box, though this time the stack traces
of the stuck processes seems to vary between fchmod/fchown/getdents calls.
partial dmesg at http://fpaste.org/jBVM/
sysrq-w: http://fpaste.org/uYtj/
sysrq-d: http://fpaste.org/Xxur/
does this give any new clues that the previous traces didn't ?
Dave
So what filesystem is this?
It really looks like something has left i_mutex locked on a directory,
but I'm for the life of me not seeing it. There are lookup changes
mainly by Miklos, but they don't seem to change the i_mutex locking.
They do change some other things, though. In particular, I mislike the
last patch in that series ("vfs: retry last component if opening stale
dentry"), which does the whole "save_parent" thing. There's a few
things that look odd there, and I don't like this code, for example:
+ save_parent.dentry = nd->path.dentry;
+ save_parent.mnt = mntget(path->mnt);
+ nd->path.dentry = path->dentry;
...
+ path_put(&save_parent);
whete there isn't a dget() on the dentry (but path_put() will do a
dput() on it). I'm guessing it's because we lose a refcount to it when
we overwrite nd->path.dentry, but why isn't there a dget() on *that*
one? The patch just makes me nervous. Miklos, can you explain more?
The interactions with "path_put_conditional()" makes me extra nervous.
I'm also adding Jan, since he changed the i_mutex rules for the quota
files. That should be totally immaterial, but just the fact that it
touches i_mutex makes me want to hear more. Maybe there's some path
that had a lock, the unlock got deleted, and inode information ended
up leaking through the slab caches or something insane like that.
The lock output doesn't tell me anything new, except that yes, once
more people are waiting for a directory mutex, or waiting for the
rename mutex that is held by another process waiting for the directory
mutex.
Anybody see any i_mutex changes I missed?
Linus
On Wed, Jun 6, 2012 at 12:42 PM, Dave Jones <[email protected]> wrote:
>
> Just hit this again on a different box, though this time the stack traces
> of the stuck processes seems to vary between fchmod/fchown/getdents calls.
>
> partial dmesg at http://fpaste.org/jBVM/
> sysrq-w: http://fpaste.org/uYtj/
> sysrq-d: http://fpaste.org/Xxur/
>
> does this give any new clues that the previous traces didn't ?
>
> ? ? ? ?Dave
>
On Wed, Jun 06, 2012 at 03:38:46PM -0700, Linus Torvalds wrote:
> So what filesystem is this?
The box I saw it on a few days ago: ext4. Today's box: btrfs
> It really looks like something has left i_mutex locked on a directory,
> but I'm for the life of me not seeing it. There are lookup changes
> mainly by Miklos, but they don't seem to change the i_mutex locking.
It is possible that this is a bug stretching back further than 3.4.
I'm continually adding new ways to do terrible things to the fuzzer,
and this last week or so has seen quite a few changes. So maybe I've only
just found a way to tickle this particular bug. (Just like the situation
we had with the mbind corruption a few weeks back).
Dave
On Wed, Jun 6, 2012 at 4:00 PM, Dave Jones <[email protected]> wrote:
>
> It is possible that this is a bug stretching back further than 3.4.
> I'm continually adding new ways to do terrible things to the fuzzer,
> and this last week or so has seen quite a few changes. So maybe I've only
> just found a way to tickle this particular bug. (Just like the situation
> we had with the mbind corruption a few weeks back).
Possible. The directory i_mutex locking rules are subtle, and we have
that whole complex d_ancestor() logic for ordering them against the
child ordering locks (*plus* the s_vfs_rename_mutex to then make us
able to not have to worry about the ordering of non-ancestry-related
dentries).
Sure, the dentry->d_lock has some of that too, but not quite as bad.
It's definitely one of our most subtle orderings - most of the other
ones we can just compare the address of the lock itself or something
like that to avoid deadlock, there's no more complex topology among
the entries. Add to that the whole "i_mutex" *also* has the mmap sem
issue where the dependency is reversed for directories (readdir ->
copy_to_user mmap sem) from normal inodes (copy_to/from_user -> IO
->i_mutex), and it's perhaps no wonder i_mutex is not one of my
all-time favorite locks.
At the same time, the fact that you're running with lockdep makes me
think that *if* we got the ordering wrong, lockdep should still show
the deadlock. PeterZ - the fact that we use mutex_lock_nested with
I_MUTEX_PARENT/I_MUTEX_CHILD/nothing hopefully doesn't screw us up in
case we get it wrong, right? IOW, lockdep would still find an *actual*
circular dependency if we hit one, no?
Dave - could you try running with the "instance" patch from PeterZ
earlier in this thread? That should at least help improve the sysrq-d
output a bit, and might give us some clue.
Al, looking at i_mutex use and rename, the only odd thing I see is how
vfs_rename_dir() does the "d_move()" *after* it has dropped the target
i_mutex. That looks odd. But I guess it shouldn't matter, because if
we're doing cross-directory renames we will always serialize everybody
with that rename mutex anyway. Yes/no? But wouldn't it make more sense
to do it inside the i_mutex? And before we do the dput() on the
new_dentry?
Why *does* that vfs_rename_dir() thing do that d_move so late?
vfs_rename_other() looks saner. I'm assuming it wants to do it after
the dput() for some reason (wanting to prune any stale children of the
target?) but I can't for the life of me remember why and I'm too lazy
to look up the git history. Maybe a comment would be in order?
Linus
On Wed, Jun 06, 2012 at 04:31:51PM -0700, Linus Torvalds wrote:
> Al, looking at i_mutex use and rename, the only odd thing I see is how
> vfs_rename_dir() does the "d_move()" *after* it has dropped the target
> i_mutex. That looks odd. But I guess it shouldn't matter, because if
> we're doing cross-directory renames we will always serialize everybody
> with that rename mutex anyway. Yes/no? But wouldn't it make more sense
> to do it inside the i_mutex? And before we do the dput() on the
> new_dentry?
What we need is ->i_mutex on parents. And I'm much more concerned about
this: 7732a557b1342c6e6966efb5f07effcf99f56167 and
3f50fff4dace23d3cfeb195d5cd4ee813cee68b7.
Dave, you seem to be able to reproduce it; could you try with those two
commits reverted? This stuff is *definitely* wrong with the way it
treats d_move(); there we might get it with parents not locked at all.
FWIW, I'd suggest adding a check into d_move(); new parent must be
locked in all cases and old one whenever dentry has one (i.e. isn't
disconnected). If you can find a violation of that, you very likely
have found the cause of that bug.
Al, in the middle of really messy bisect right now ;-/ It started with
mips panicing (under qemu-system-mips -M malta) in -rc1; bisect has lead
to merge of akpm's patchbomb - as in "both parents work, merge doesn't,
recreating the merge give the identical tree and no textual conflicts".
I've located the (half of the) problem in akpm branch - that's commit
d6629859b36d953a4b1369b749f178736911bf10 (ipc/mqueue: improve performance
of send/recv). Merge with it => unhandled unaligned access in the kernel,
merge with parent => no problems. The other half of the logical conflict
is harder to find ;-/ On the "akpm patchbomb" side it was just a linear
sequence, so doing cherry-pick of all of that stuff to the other side of
merge has yielded a tree identical to the merge one and that allowed normal
git bisect, which has located the point where it breaks. Can't do that
trick on the other side - there we have shitloads of merges (including the
one from tip, and I *really* hope it doesn't end up being the source of
trouble - topology in that one is horrible). So I'm doing a kinda-sorta
manual bisect - pick a point with gitk, reset the test branch to it,
merge the ipc/mqueue commit into it, test, pick the next point, etc.
Any suggestions re improving that process? Short of setting a clone
and doing git bisect _there_, while the original tree is used for
merge/build stuff, hopefully... Is there any way to ask where would the
next bisection point be with given set of goods and bads?
On Mon, Jun 04, 2012 at 12:49:48PM +0200, Peter Zijlstra wrote:
> Something like the compile tested only patch below should do both
> things.
>
..
> +#ifdef CONFIG_LOCK_STAT
> +
> +static void lockstat_contended(struct held_lock *hlock)
> +{
> + int contention_point, contending_point;
> + struct lock_class_stats *stats;
> +
> + hlock->waittime_stamp = lockstat_clock();
> +
> + contention_point = lock_point(hlock_class(hlock)->contention_point, ip);
> + contending_point = lock_point(hlock_class(hlock)->contending_point,
> + lock->ip);
> +
> + stats = get_lock_stats(hlock_class(hlock));
> + if (contention_point < LOCKSTAT_POINTS)
> + stats->contention_point[contention_point]++;
> + if (contending_point < LOCKSTAT_POINTS)
> + stats->contending_point[contending_point]++;
> + if (lock->cpu != smp_processor_id())
> + stats->bounces[bounce_contended + !!hlock->read]++;
> + put_lock_stats(stats);
> +}
kernel/lockdep.c: In function ‘lockstat_contended’:
kernel/lockdep.c:3662:70: error: ‘ip’ undeclared (first use in this function)
kernel/lockdep.c:3662:70: note: each undeclared identifier is reported only once for each function it appears in
kernel/lockdep.c:3664:11: error: ‘lock’ undeclared (first use in this function)
kernel/lockdep.c: In function ‘lockstat_acquired’:
kernel/lockdep.c:3702:13: error: ‘ip’ undeclared (first use in this function)
Dave
On Thu, Jun 07, 2012 at 12:54:04AM +0100, Al Viro wrote:
> What we need is ->i_mutex on parents. And I'm much more concerned about
> this: 7732a557b1342c6e6966efb5f07effcf99f56167 and
> 3f50fff4dace23d3cfeb195d5cd4ee813cee68b7.
>
> Dave, you seem to be able to reproduce it; could you try with those two
> commits reverted? This stuff is *definitely* wrong with the way it
> treats d_move(); there we might get it with parents not locked at all.
>
> FWIW, I'd suggest adding a check into d_move(); new parent must be
> locked in all cases and old one whenever dentry has one (i.e. isn't
> disconnected). If you can find a violation of that, you very likely
> have found the cause of that bug.
Like this ?
void d_move(struct dentry *dentry, struct dentry *target)
{
write_seqlock(&rename_lock);
+
+ BUG_ON(!spin_is_locked(&target->d_parent->d_lock));
+
+ if (dentry->d_parent != NULL)
+ BUG_ON(!spin_is_locked(&dentry->d_parent->d_lock));
+
__d_move(dentry, target);
write_sequnlock(&rename_lock);
}
To be clear, do you want me to try that with or without the reverts ?
Dave
On Wed, Jun 6, 2012 at 4:54 PM, Al Viro <[email protected]> wrote:
> On Wed, Jun 06, 2012 at 04:31:51PM -0700, Linus Torvalds wrote:
>
>> Al, looking at i_mutex use and rename, the only odd thing I see is how
>> vfs_rename_dir() does the "d_move()" *after* it has dropped the target
>> i_mutex. That looks odd. But I guess it shouldn't matter, because if
>> we're doing cross-directory renames we will always serialize everybody
>> with that rename mutex anyway. Yes/no? But wouldn't it make more sense
>> to do it inside the i_mutex? And before we do the dput() on the
>> new_dentry?
>
> What we need is ->i_mutex on parents.
Yes. but the placement is odd as-is, wouldn't you say? *Why* is it
that way? Especially considering that it isn't that way in the other
non-directory case.
>?And I'm much more concerned about
> this: 7732a557b1342c6e6966efb5f07effcf99f56167 and
> ?3f50fff4dace23d3cfeb195d5cd4ee813cee68b7.
Hmm. If two directory dentries point to the same inode, we're f*cked
for other reasons: we'd consider them separate entries, and then try
to mutex_lock() them both. Causing the obvious deadlock. But I would
have assumed those two commits would make us *less* likely to have
that case, rather than more.
That said, you're right, that d_move() is scary as hell. No parent
semaphores there.. So we're screwed whether we try to alias them or
not.
So yeah, I agree with the suggestion of trying to revert those two and
seeing if that changes anything.
> Al, in the middle of really messy bisect right now ;-/
> [...]?On the "akpm patchbomb" side it was just a linear
> sequence, so doing cherry-pick of all of that stuff to the other side of
> merge has yielded a tree identical to the merge one and that allowed normal
> git bisect, which has located the point where it breaks.
Yeah, we've done that before.
> ?Can't do that
> trick on the other side - there we have shitloads of merges (including the
> one from tip, and I *really* hope it doesn't end up being the source of
> trouble - topology in that one is horrible). ?So I'm doing a kinda-sorta
> manual bisect - pick a point with gitk, reset the test branch to it,
> merge the ipc/mqueue commit into it, test, pick the next point, etc.
> Any suggestions re improving that process?
Just do a *real* bisect - not a manual one - but every time you test a
kernel you test it with the merge (or rebase) on top. And then you
just mark the *base* of that merge good/bad, and let bisect sort it
out.
That's effectively how people bisect bugs that are hidden by other
bugs: you have to apply the (known) bugfix on top of the tree you are
bisecting in order to find the unknown one.
Linus
On Wed, Jun 06, 2012 at 08:29:14PM -0400, Dave Jones wrote:
> void d_move(struct dentry *dentry, struct dentry *target)
> {
> write_seqlock(&rename_lock);
> +
> + BUG_ON(!spin_is_locked(&target->d_parent->d_lock));
> +
> + if (dentry->d_parent != NULL)
> + BUG_ON(!spin_is_locked(&dentry->d_parent->d_lock));
> +
> __d_move(dentry, target);
> write_sequnlock(&rename_lock);
> }
>
>
> To be clear, do you want me to try that with or without the reverts ?
No. Not ->d_lock; ->d_inode->i_mutex. _That_ needs to be held for
d_move() to be safe (and on cross-directory move you need ->s_vfs_rename_mutex
as well). I'd do it with WARN_ON, BTW, and without the reverts - that
way we'll have a clear indication if that code is stepped on.
On Wed, Jun 6, 2012 at 5:29 PM, Dave Jones <[email protected]> wrote:
>
> Like this ?
No.
> ?void d_move(struct dentry *dentry, struct dentry *target)
> ?{
> ? ? ? ?write_seqlock(&rename_lock);
> +
> + ? ? ? BUG_ON(!spin_is_locked(&target->d_parent->d_lock));
> +
> + ? ? ? if (dentry->d_parent != NULL)
> + ? ? ? ? ? ? ? BUG_ON(!spin_is_locked(&dentry->d_parent->d_lock));
It's true that the d_lock needs to be held too, but we actually do
that in __d_move() (which calls dentry_lock_for_move), so the callers
don't have to worry about that part.
So Al meant you to test mutex_is_locked(dentry->d_inode->i_mutex) of
the parents.
And I'd suggest making it just a WARN_ON_ONCE(), because if you make
it a BUG_ON() and it triggers, your system will likely be dead. And
ONCE is all that matters - it should never happen.
> To be clear, do you want me to try that with or without the reverts ?
I think either would be interesting. *If* that d_move() from
d_splice_alias ever triggers, that would be an interesting backtrace
to see too.
Of course, if that d_splice_alias change is the cause of this, then
with those two commits reverted you'd not be able to reproduce the
problem, which would also be interesting to know, so either way it
would be good information.
Linus
On Wed, Jun 06, 2012 at 05:42:35PM -0700, Linus Torvalds wrote:
> So Al meant you to test mutex_is_locked(dentry->d_inode->i_mutex) of
> the parents.
ok, I ended up with..
WARN_ON_ONCE(!mutex_is_locked(&target->d_parent->d_inode->i_mutex));
if (dentry->d_parent != NULL)
WARN_ON_ONCE(!mutex_is_locked(&dentry->d_inode->i_mutex));
but..
> And I'd suggest making it just a WARN_ON_ONCE(), because if you make
> it a BUG_ON() and it triggers, your system will likely be dead. And
> ONCE is all that matters - it should never happen.
I hit the second WARN very early on, before I even get out of the initramfs.
[ 6.365556] WARNING: at fs/dcache.c:2350 d_move+0xaf/0xc0()
[ 6.365793] Modules linked in:
[ 6.365909] Pid: 134, comm: mount Not tainted 3.5.0-rc1+ #69
[ 6.366030] Call Trace:
[ 6.366147] [<ffffffff8104910f>] warn_slowpath_common+0x7f/0xc0
[ 6.366271] [<ffffffff8104916a>] warn_slowpath_null+0x1a/0x20
[ 6.366396] [<ffffffff811c5c4f>] d_move+0xaf/0xc0
[ 6.366521] [<ffffffff811b9654>] vfs_rename+0x3c4/0x4e0
[ 6.366647] [<ffffffff811bd741>] sys_renameat+0x201/0x230
[ 6.366773] [<ffffffff8132e59c>] ? debug_check_no_obj_freed+0x16c/0x210
[ 6.366902] [<ffffffff81171cfc>] ? vm_munmap+0x5c/0x80
[ 6.367026] [<ffffffff81086f91>] ? get_parent_ip+0x11/0x50
[ 6.367148] [<ffffffff81086f91>] ? get_parent_ip+0x11/0x50
[ 6.367273] [<ffffffff81651919>] ? sub_preempt_count+0x79/0xd0
[ 6.367397] [<ffffffff81655937>] ? sysret_check+0x1b/0x56
[ 6.367521] [<ffffffff810b7dd5>] ? trace_hardirqs_on_caller+0x115/0x1a0
[ 6.367647] [<ffffffff81326f4e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 6.367772] [<ffffffff811bd78b>] sys_rename+0x1b/0x20
[ 6.367891] [<ffffffff81655912>] system_call_fastpath+0x16/0x1b
Did I screw up the test again ? I'm feeling a bit hard-of-thinking tonight.
> > To be clear, do you want me to try that with or without the reverts ?
>
> I think either would be interesting.
For now, I still have those two commits applied.
Dave
On Wed, Jun 06, 2012 at 09:19:15PM -0400, Dave Jones wrote:
> On Wed, Jun 06, 2012 at 05:42:35PM -0700, Linus Torvalds wrote:
>
> > So Al meant you to test mutex_is_locked(dentry->d_inode->i_mutex) of
> > the parents.
>
> ok, I ended up with..
>
> WARN_ON_ONCE(!mutex_is_locked(&target->d_parent->d_inode->i_mutex));
>
> if (dentry->d_parent != NULL)
!= dentry)
->d_parent *never* should be NULL; when dentry is disconnected from the
tree (or hadn't been connected to it yet), it points to that dentry itself.
On Thu, Jun 07, 2012 at 02:29:00AM +0100, Al Viro wrote:
> On Wed, Jun 06, 2012 at 09:19:15PM -0400, Dave Jones wrote:
> > On Wed, Jun 06, 2012 at 05:42:35PM -0700, Linus Torvalds wrote:
> >
> > > So Al meant you to test mutex_is_locked(dentry->d_inode->i_mutex) of
> > > the parents.
> >
> > ok, I ended up with..
> >
> > WARN_ON_ONCE(!mutex_is_locked(&target->d_parent->d_inode->i_mutex));
> >
> > if (dentry->d_parent != NULL)
> != dentry)
>
> ->d_parent *never* should be NULL; when dentry is disconnected from the
> tree (or hadn't been connected to it yet), it points to that dentry itself.
ah, thanks. My VFS ignorance showing there.
Dave
On Thu, Jun 07, 2012 at 02:29:00AM +0100, Al Viro wrote:
> On Wed, Jun 06, 2012 at 09:19:15PM -0400, Dave Jones wrote:
> > On Wed, Jun 06, 2012 at 05:42:35PM -0700, Linus Torvalds wrote:
> >
> > > So Al meant you to test mutex_is_locked(dentry->d_inode->i_mutex) of
> > > the parents.
> >
> > ok, I ended up with..
> >
> > WARN_ON_ONCE(!mutex_is_locked(&target->d_parent->d_inode->i_mutex));
> >
> > if (dentry->d_parent != NULL)
> != dentry)
>
> ->d_parent *never* should be NULL; when dentry is disconnected from the
> tree (or hadn't been connected to it yet), it points to that dentry itself.
And you want to check i_mutex on old parent, not the file being moved
itself. IOW, the second one should be
if (dentry->d_parent != dentry)
WARN_ON_ONCE(!mutex_is_locked(&dentry->d_parent->d_inode->i_mutex));
On Wed, Jun 6, 2012 at 6:19 PM, Dave Jones <[email protected]> wrote:
>
> ok, I ended up with..
>
> ? ? ? ?WARN_ON_ONCE(!mutex_is_locked(&target->d_parent->d_inode->i_mutex));
>
> ? ? ? ?if (dentry->d_parent != NULL)
> ? ? ? ? ? ? ? ?WARN_ON_ONCE(!mutex_is_locked(&dentry->d_inode->i_mutex));
It should just be something like
WARN_ON_ONCE(!mutex_is_locked(&dentry->d_parent->d_inode->i_mutex));
WARN_ON_ONCE(!mutex_is_locked(&target->d_parent->d_inode->i_mutex));
(And doing it as two separate ones rather than one that tests both is
probably a good idea, so that you see *which* one it is that triggers
if they do)
Linus
On Thu, Jun 07, 2012 at 02:31:49AM +0100, Al Viro wrote:
> On Thu, Jun 07, 2012 at 02:29:00AM +0100, Al Viro wrote:
> > On Wed, Jun 06, 2012 at 09:19:15PM -0400, Dave Jones wrote:
> > > On Wed, Jun 06, 2012 at 05:42:35PM -0700, Linus Torvalds wrote:
> > >
> > > > So Al meant you to test mutex_is_locked(dentry->d_inode->i_mutex) of
> > > > the parents.
> > >
> > > ok, I ended up with..
> > >
> > > WARN_ON_ONCE(!mutex_is_locked(&target->d_parent->d_inode->i_mutex));
> > >
> > > if (dentry->d_parent != NULL)
> > != dentry)
> >
> > ->d_parent *never* should be NULL; when dentry is disconnected from the
> > tree (or hadn't been connected to it yet), it points to that dentry itself.
>
> And you want to check i_mutex on old parent, not the file being moved
> itself. IOW, the second one should be
>
> if (dentry->d_parent != dentry)
> WARN_ON_ONCE(!mutex_is_locked(&dentry->d_parent->d_inode->i_mutex));
ok, now that you've both found one per in every line I've written tonight,
things seem to have booted without incident. I'll rerun the tests, and get
back to you if/when they start spewing anything interesting.
Dave
On Wed, Jun 6, 2012 at 6:31 PM, Al Viro <[email protected]> wrote:
>
> And you want to check i_mutex on old parent, not the file being moved
> itself. ?IOW, the second one should be
>
> if (dentry->d_parent != dentry)
> ? ? ? ?WARN_ON_ONCE(!mutex_is_locked(&dentry->d_parent->d_inode->i_mutex));
Do you even need the "dentry->d_parent != dentry" test? When do we
even rename root dentries?
I do notice that we have that IS_ROOT() test in __d_move(), but now
that I look at it I wonder *why* we have it..
Anyway, Dave, you should probably trust Al over me here - that test
obviously won't hurt, even if I can't see immediately when it would
trigger either.
Linus
On Wed, Jun 06, 2012 at 06:45:35PM -0700, Linus Torvalds wrote:
> On Wed, Jun 6, 2012 at 6:31 PM, Al Viro <[email protected]> wrote:
> >
> > And you want to check i_mutex on old parent, not the file being moved
> > itself. ?IOW, the second one should be
> >
> > if (dentry->d_parent != dentry)
> > ? ? ? ?WARN_ON_ONCE(!mutex_is_locked(&dentry->d_parent->d_inode->i_mutex));
>
> Do you even need the "dentry->d_parent != dentry" test? When do we
> even rename root dentries?
Root of the filesystem - never; root of disconnected subtree - sure,
that's how they become connected to the tree. See d_materialize_unique(),
for example...
BTW, I really need more coffee - those checks belong in __d_move(),
not d_move(); aforementioned d_materialize_unique() doesn't use d_move(),
it calls __d_move() directly. Sorry.
On Thu, Jun 07, 2012 at 02:54:10AM +0100, Al Viro wrote:
> > Do you even need the "dentry->d_parent != dentry" test? When do we
> > even rename root dentries?
>
> Root of the filesystem - never; root of disconnected subtree - sure,
> that's how they become connected to the tree. See d_materialize_unique(),
> for example...
>
> BTW, I really need more coffee - those checks belong in __d_move(),
> not d_move(); aforementioned d_materialize_unique() doesn't use d_move(),
> it calls __d_move() directly. Sorry.
np, I'll restart the build/test.
Dave
> They do change some other things, though. In particular, I mislike the
> last patch in that series ("vfs: retry last component if opening stale
> dentry"), which does the whole "save_parent" thing. There's a few
> things that look odd there, and I don't like this code, for example:
>
> + save_parent.dentry = nd->path.dentry;
> + save_parent.mnt = mntget(path->mnt);
> + nd->path.dentry = path->dentry;
> ...
> + path_put(&save_parent);
>
> whete there isn't a dget() on the dentry (but path_put() will do a
> dput() on it). I'm guessing it's because we lose a refcount to it when
> we overwrite nd->path.dentry, but why isn't there a dget() on *that*
> one? The patch just makes me nervous. Miklos, can you explain more?
> The interactions with "path_put_conditional()" makes me extra nervous.
Yeah, see the comment in below patch for how it's supposed to work. I
*think* it's correct.
Thanks,
Miklos
Subject: vfs: add comment to save_parent
From: Miklos Szeredi <[email protected]>
Reported-by: Linus Torvalds <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/namei.c | 10 ++++++++++
1 file changed, 10 insertions(+)
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c 2012-06-05 19:29:39.000000000 +0200
+++ linux-2.6/fs/namei.c 2012-06-07 08:41:55.000000000 +0200
@@ -2355,6 +2355,16 @@ static struct file *do_last(struct namei
if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path->mnt) {
path_to_nameidata(path, nd);
} else {
+ /*
+ * We are not in RCU mode and nd->path.mnt == path->mnt.
+ * In that case path_to_nameidata() would just do
+ *
+ * dput(nd->path.dentry)
+ * nd->path.dentry = path->dentry;
+ *
+ * So here we can save a dget/dput pair by just transferring the
+ * ref to save_parent.dentry.
+ */
save_parent.dentry = nd->path.dentry;
save_parent.mnt = mntget(path->mnt);
nd->path.dentry = path->dentry;
On Wed, 2012-06-06 at 16:31 -0700, Linus Torvalds wrote:
> PeterZ - the fact that we use mutex_lock_nested with
> I_MUTEX_PARENT/I_MUTEX_CHILD/nothing hopefully doesn't screw us up in
> case we get it wrong, right?
Sadly, if you get that annotation wrong you can annotate an actual
deadlock away. This the reason you have to be very careful when
annotating stuff.
On Thu, Jun 7, 2012 at 3:26 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-06-06 at 16:31 -0700, Linus Torvalds wrote:
>> PeterZ - the fact that we use mutex_lock_nested with
>> I_MUTEX_PARENT/I_MUTEX_CHILD/nothing hopefully doesn't screw us up in
>> case we get it wrong, right?
>
> Sadly, if you get that annotation wrong you can annotate an actual
> deadlock away. This the reason you have to be very careful when
> annotating stuff.
Is that true even if you can see an actual deadlock on the lock *instances*?
The point of the locking classes (to me) was always that you could see
the deadlock even if it didn't *actually* happen, just *potentially*
happened. If the lock classes then mean that that hides an actual
deadlock information, that's a big downside.
And yeah, that really means that we absolutely *have* to get the
instance data in the lock debug printout, since we can't see deadlocks
any other way. Can you check the compile error that Dave reported for
your patch, because right now Dave is running without it, afaik, due
to your lock-instance patch not really working in practice.
Linus
On Thu, Jun 7, 2012 at 12:07 AM, Miklos Szeredi <[email protected]> wrote:
>
> Yeah, see the comment in below patch for how it's supposed to work. ?I
> *think* it's correct.
Ok, yes, that makes me happier.
What would make me happier still is to get rid of the "save_parent"
thing entirely, though.
And I think you should be able to.
Why?
You already have the rule that:
- save_parent.mnt is always same as "path->mnt" (or NULL, if not saved)
- save_parent.dentry is the same as "dir" when you use it (you have a
BUG_ON() for it not being the same)
- you currently use "save_parent.dentry != NULL" as a flag to say "do
we have the valid state"
So as far as I can tell, you should get rid of all the refcount games
and the "struct path save_parent", and just replace the
"save_parent.dentry != NULL" thing with a flag of whether you have a
valid state.
That would get rid of the whole
if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path->mnt) {
path_to_nameidata(path, nd);
} else {
save_parent.dentry = nd->path.dentry;
save_parent.mnt = mntget(path->mnt);
nd->path.dentry = path->dentry;
}
thing, and we could just have the old simple unconditional
path_to_nameidata(path, nd);
back.
And then you'd have
if (filp == ERR_PTR(-EOPENSTALE) && save_parent_flag && !retried) {
dput(nd->path.dentry);
nd->path.dentry = dget(dir);
nd->inode = dir->d_inode;
save_parent_flag = false;
if (want_write) {
mnt_drop_write(nd->path.mnt);
want_write = 0;
}
retried = true;
goto retry_lookup;
or something for the EOPENSTALE thing. That would get all the
save_parent refcount games out of the normal path.
Am I missing some reason why that wouldn't work?
Linus
[maintainers of assorted code involved Cc'd]
On Thu, Jun 07, 2012 at 02:29:00AM +0100, Al Viro wrote:
> On Wed, Jun 06, 2012 at 09:19:15PM -0400, Dave Jones wrote:
> > On Wed, Jun 06, 2012 at 05:42:35PM -0700, Linus Torvalds wrote:
> >
> > > So Al meant you to test mutex_is_locked(dentry->d_inode->i_mutex) of
> > > the parents.
> >
> > ok, I ended up with..
> >
> > WARN_ON_ONCE(!mutex_is_locked(&target->d_parent->d_inode->i_mutex));
> >
> > if (dentry->d_parent != NULL)
> != dentry)
>
> ->d_parent *never* should be NULL; when dentry is disconnected from the
> tree (or hadn't been connected to it yet), it points to that dentry itself.
BTW, I've started to put together some documentation on ->d_parent use.
And already have found idiocies.
0) *all* instances of struct dentry should come from __d_alloc(); never
embed it into another object or declare variables of type struct dentry.
AFAICS, that one is satisfied now; any out-of-tree code violating that
deserves to be broken and *will* be broken.
1) For all code outside of fs/dcache.c it's read-only. Never change it
directly, period. Again, satisfied in mainline, out-of-tree code can
go play in the traffic, for all I care.
2) Places that set ->d_parent: __d_alloc(), d_alloc(), __d_move(),
__d_materialize_dentry(). The first two are doing that to new instance
of struct dentry with no references to it anywhere in shared data
structures. The other two have had dentry passed to dentry_lock_for_move().
3) d_alloc(NULL, name) should never happen (will instantly oops in that case).
4) ->d_parent is *never* NULL; the only way to stumble across a dentry with
that not being true is to crawl through the slab page directly and to run
into an object that has been allocated by not initialized yet. Don't do
that (nothing in mainline does, thanks $DEITY). "No parent" == "dentry
has ->d_parent pointing to dentry itself". There's a helper checking that,
albeit with slightly misleading name - IS_ROOT(dentry). Doesn't have to
be fs root - anything still not attached to the tree or already detached
from it will be that way. Any code checking that ->d_parent is NULL is
almost certainly a bug. And we have a bunch of such places in fs/ceph:
ceph_init_dentry()
ceph_d_prune()
ceph_mdsc_build_path()
a few in fs/cifs:
build_path_from_dentry()
complete idiocy in fs/debugfs (not only check for NULL ->d_parent, but
one for ->d_parent being a negative dentry; also can't happen).
debugfs_remove()
debugfs_remove_recursive()
one in fs/ocfs2:
ocfs2_match_dentry()
and one in security/inode.c:
securityfs_remove() (nicked from debugfs, at a guess).
Some of those guys can be simply dropped, but I really wonder what
ceph_init_dentry/ceph_d_prune intended there.
Incidentall, build_path_from_dentry() ought to be lifted into fs/dcache.c;
we have at least three copies and one of them is not an instant roothole
only because hppfs doesn't follow underlying procfs changes. And I'm going
to export is_subdir() - at least gfs2 has grown itself a b0rken copy...
5) all assignments to dentry->d_parent after initialization are under
dentry->d_lock; dentry_lock_for_move() takes care of that.
6) all reassignments to dentry->d_parent have ->d_lock held on old and new
parent. Again, dentry_lock_for_move() takes care of that and so does
d_alloc() (that one - for new parent only; old parent is dentry itself
here and nothing could have seen it yet).
7) attached dentry moving away from old parent must have ->i_mutex on
that old parent held.
8) dentry getting attached to a new parent must have ->i_mutex held on
that new parent.
9) attached dentry getting moved to a new parent must have ->s_vfs_rename_mutex
held.
Now, (7--9) are interesting; the call graph looks so:
__d_move()
<- d_move()
<- __d_unalias()
<- d_materialise_unique()
__d_materialise_dentry()
<- d_materialise_unique()
rename-related callers of d_move() (vfs_rename_dir/vfs_rename_other/
nfs_rename/ocfs2_dentry_move from ocfs2_rename/ceph_rename/v9fs_vfs_rename)
share the same locking environment; we have both parents + ->s_vfs_rename_mutex
in case of cross-directory move held. Pretty much by an accident we have
one difference between vfs_rename_dir and vfs_rename_other - holding ->i_mutex
on *victim* of overwriting rename() is not something d_move() cares about.
We can make that more uniform, but that's really unrelated to that.
Note that in all these cases both dentries are guaranteed to be attached
all along.
Other callers of d_move():
vfat_lookup() does d_move() of an alias; the validity of that
strongly relies on the lack of anything resembling hardlinks on VFAT;
the parents will be the same (and parent of target is locked).
sysfs_lookup() plays insane games with d_move(); I _really_
don't like the look of that code. It tries to play catch-up after
sysfs_rename() done its thing. AFAICS, the parent *can* be changed
by sysfs_rename(). No lock is held on said parent here; no
->s_vfs_rename_mutex either. Moreover, none of the sysfs_rename() callers
(kobject_move(), etc.) appear to give a damn about checking if it would
create a loop.
* debugfs_rename() - imitates what vfs_rename() is doing. Same
locking environment. BTW,
trap = lock_rename(new_dir, old_dir);
/* Source or destination directories don't exist? */
if (!old_dir->d_inode || !new_dir->d_inode)
goto exit;
is bogus - lock_rename() is taking ->i_mutex on these inodes, for fsck sake!
If this can be called with old_dir or new_dir negative, it's buggered.
* ceph_fill_trace() in CEPH_MDS_OP_RENAME case. Seems to be
done while ceph_rename() is being executed; i.e. the usual locking
environment holds. No idea why it feels the need to grab references
to old_dentry and new_dentry in ceph_rename(), though - if that thing
can outlive ceph_rename(), we have a bad trouble. And if it can't, there's
no reasons to pin them down that way.
* d_splice_alias() - broken. Called without any locking on the
old directory, nevermind ->s_vfs_rename_mutex. I really believe that
this pair of commits needs to be reverted. The earlier code used to
guarantee that alias would be detached.
d_materialise_unique(dentry, inode) must be called with dentry->d_parent
having its ->d_inode->i_mutex held. AFAICS, that's satisfied for all
callers - they come either from ->lookup() or from ->readdir(), both
having the parent inode locked. ceph is a possible exceptions; it's
damn hard to trace back to the places triggering ceph_fill_trace() in
general ;-/
__d_unalias(), in case we are asked for cross-directory move, grabs
->s_vfs_rename_mutex first (with trylock). That makes ->d_parent of
alias stable; then it grabs ->i_mutex on that parent (again with trylock)
and we are all set.
__d_materialise_dentry(dentry, alias) is called only when alias is
unattached. That looks like potential race, BTW - what's to prevent it
from becoming attached right under us? We do have __d_materialise_dentry()
and __d_ualias serialized by alias->d_inode->i_lock, but... there's another
way for that unattached alias to get attached - d_splice_alias(). And
it's done without ->i_lock there. OTOH, it's not a problem to extend
the protection of ->i_lock to that spot as well.
I'll continue this series in a couple of hours - there's more. In the
meanwhile, I'd like to hear from ceph, sysfs and nfsd folks - see above
for the reasons.
Hi Al,
On Thu, 7 Jun 2012, Al Viro wrote:
> [maintainers of assorted code involved Cc'd]
>
> On Thu, Jun 07, 2012 at 02:29:00AM +0100, Al Viro wrote:
> > On Wed, Jun 06, 2012 at 09:19:15PM -0400, Dave Jones wrote:
> > > On Wed, Jun 06, 2012 at 05:42:35PM -0700, Linus Torvalds wrote:
> > >
> > > > So Al meant you to test mutex_is_locked(dentry->d_inode->i_mutex) of
> > > > the parents.
> > >
> > > ok, I ended up with..
> > >
> > > WARN_ON_ONCE(!mutex_is_locked(&target->d_parent->d_inode->i_mutex));
> > >
> > > if (dentry->d_parent != NULL)
> > != dentry)
> >
> > ->d_parent *never* should be NULL; when dentry is disconnected from the
> > tree (or hadn't been connected to it yet), it points to that dentry itself.
>
> BTW, I've started to put together some documentation on ->d_parent use.
> And already have found idiocies.
>
> 0) *all* instances of struct dentry should come from __d_alloc(); never
> embed it into another object or declare variables of type struct dentry.
> AFAICS, that one is satisfied now; any out-of-tree code violating that
> deserves to be broken and *will* be broken.
>
> 1) For all code outside of fs/dcache.c it's read-only. Never change it
> directly, period. Again, satisfied in mainline, out-of-tree code can
> go play in the traffic, for all I care.
>
> 2) Places that set ->d_parent: __d_alloc(), d_alloc(), __d_move(),
> __d_materialize_dentry(). The first two are doing that to new instance
> of struct dentry with no references to it anywhere in shared data
> structures. The other two have had dentry passed to dentry_lock_for_move().
>
> 3) d_alloc(NULL, name) should never happen (will instantly oops in that case).
>
> 4) ->d_parent is *never* NULL; the only way to stumble across a dentry with
> that not being true is to crawl through the slab page directly and to run
> into an object that has been allocated by not initialized yet. Don't do
> that (nothing in mainline does, thanks $DEITY). "No parent" == "dentry
> has ->d_parent pointing to dentry itself". There's a helper checking that,
> albeit with slightly misleading name - IS_ROOT(dentry). Doesn't have to
> be fs root - anything still not attached to the tree or already detached
> from it will be that way. Any code checking that ->d_parent is NULL is
> almost certainly a bug. And we have a bunch of such places in fs/ceph:
> ceph_init_dentry()
> ceph_d_prune()
> ceph_mdsc_build_path()
The ceph_init_dentry check was added in 92cf7652 in response to getting a
dentry with null d_parent out of d_obtain_alias(). I suspect it was
papering over an old bug in that case. Pushed a patch to the ceph tree to
clean these checks out, see f3722944a8565edf434ee44bdfa37715ae0d91cd in
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git. Let
me know if you want to carry it, otherwise I'll push it in the next
window.
> a few in fs/cifs:
> build_path_from_dentry()
> complete idiocy in fs/debugfs (not only check for NULL ->d_parent, but
> one for ->d_parent being a negative dentry; also can't happen).
> debugfs_remove()
> debugfs_remove_recursive()
> one in fs/ocfs2:
> ocfs2_match_dentry()
> and one in security/inode.c:
> securityfs_remove() (nicked from debugfs, at a guess).
> Some of those guys can be simply dropped, but I really wonder what
> ceph_init_dentry/ceph_d_prune intended there.
>
> Incidentall, build_path_from_dentry() ought to be lifted into fs/dcache.c;
> we have at least three copies and one of them is not an instant roothole
> only because hppfs doesn't follow underlying procfs changes. And I'm going
> to export is_subdir() - at least gfs2 has grown itself a b0rken copy...
>
> 5) all assignments to dentry->d_parent after initialization are under
> dentry->d_lock; dentry_lock_for_move() takes care of that.
>
> 6) all reassignments to dentry->d_parent have ->d_lock held on old and new
> parent. Again, dentry_lock_for_move() takes care of that and so does
> d_alloc() (that one - for new parent only; old parent is dentry itself
> here and nothing could have seen it yet).
>
> 7) attached dentry moving away from old parent must have ->i_mutex on
> that old parent held.
>
> 8) dentry getting attached to a new parent must have ->i_mutex held on
> that new parent.
>
> 9) attached dentry getting moved to a new parent must have ->s_vfs_rename_mutex
> held.
>
> Now, (7--9) are interesting; the call graph looks so:
> __d_move()
> <- d_move()
> <- __d_unalias()
> <- d_materialise_unique()
> __d_materialise_dentry()
> <- d_materialise_unique()
>
> rename-related callers of d_move() (vfs_rename_dir/vfs_rename_other/
> nfs_rename/ocfs2_dentry_move from ocfs2_rename/ceph_rename/v9fs_vfs_rename)
> share the same locking environment; we have both parents + ->s_vfs_rename_mutex
> in case of cross-directory move held. Pretty much by an accident we have
> one difference between vfs_rename_dir and vfs_rename_other - holding ->i_mutex
> on *victim* of overwriting rename() is not something d_move() cares about.
> We can make that more uniform, but that's really unrelated to that.
> Note that in all these cases both dentries are guaranteed to be attached
> all along.
>
> Other callers of d_move():
> vfat_lookup() does d_move() of an alias; the validity of that
> strongly relies on the lack of anything resembling hardlinks on VFAT;
> the parents will be the same (and parent of target is locked).
> sysfs_lookup() plays insane games with d_move(); I _really_
> don't like the look of that code. It tries to play catch-up after
> sysfs_rename() done its thing. AFAICS, the parent *can* be changed
> by sysfs_rename(). No lock is held on said parent here; no
> ->s_vfs_rename_mutex either. Moreover, none of the sysfs_rename() callers
> (kobject_move(), etc.) appear to give a damn about checking if it would
> create a loop.
> * debugfs_rename() - imitates what vfs_rename() is doing. Same
> locking environment. BTW,
> trap = lock_rename(new_dir, old_dir);
> /* Source or destination directories don't exist? */
> if (!old_dir->d_inode || !new_dir->d_inode)
> goto exit;
> is bogus - lock_rename() is taking ->i_mutex on these inodes, for fsck sake!
> If this can be called with old_dir or new_dir negative, it's buggered.
> * ceph_fill_trace() in CEPH_MDS_OP_RENAME case. Seems to be
> done while ceph_rename() is being executed; i.e. the usual locking
> environment holds. No idea why it feels the need to grab references
> to old_dentry and new_dentry in ceph_rename(), though - if that thing
> can outlive ceph_rename(), we have a bad trouble. And if it can't, there's
> no reasons to pin them down that way.
The references are owned by the request struct, which can outlive
ceph_rename() in the case of ^C or whatever. When that happens, the
request is marked aborted and the ceph_fill_trace() code doesn't run
(since the locks are no longer held). It still needs the refs to maintain
its own sanity, however.
> * d_splice_alias() - broken. Called without any locking on the
> old directory, nevermind ->s_vfs_rename_mutex. I really believe that
> this pair of commits needs to be reverted. The earlier code used to
> guarantee that alias would be detached.
>
> d_materialise_unique(dentry, inode) must be called with dentry->d_parent
> having its ->d_inode->i_mutex held. AFAICS, that's satisfied for all
> callers - they come either from ->lookup() or from ->readdir(), both
> having the parent inode locked. ceph is a possible exceptions; it's
> damn hard to trace back to the places triggering ceph_fill_trace() in
> general ;-/
I know, sorry about that. :)
ceph_fill_trace() runs on the struct ceph_mds_request, which as an
r_locked_dir member that is verified before calling splice_dentry() ->
d_materialise_unique(). I just double-checked and it looks like all call
sites are safe.
If ceph_rename() (or a similar method who is holding the dir i_mutex)
aborts, ceph_mdsc_do_request() marks the aborted flag under r_fill_mutex
to avoid races with ceph_fill_trace(), so we can rely on those locks being
held for the duration.
Does that clear things up?
sage
>
> __d_unalias(), in case we are asked for cross-directory move, grabs
> ->s_vfs_rename_mutex first (with trylock). That makes ->d_parent of
> alias stable; then it grabs ->i_mutex on that parent (again with trylock)
> and we are all set.
>
> __d_materialise_dentry(dentry, alias) is called only when alias is
> unattached. That looks like potential race, BTW - what's to prevent it
> from becoming attached right under us? We do have __d_materialise_dentry()
> and __d_ualias serialized by alias->d_inode->i_lock, but... there's another
> way for that unattached alias to get attached - d_splice_alias(). And
> it's done without ->i_lock there. OTOH, it's not a problem to extend
> the protection of ->i_lock to that spot as well.
>
> I'll continue this series in a couple of hours - there's more. In the
> meanwhile, I'd like to hear from ceph, sysfs and nfsd folks - see above
> for the reasons.
>
>
Al Viro <[email protected]> writes:
> [maintainers of assorted code involved Cc'd]
> 7) attached dentry moving away from old parent must have ->i_mutex on
> that old parent held.
>
> 8) dentry getting attached to a new parent must have ->i_mutex held on
> that new parent.
>
> 9) attached dentry getting moved to a new parent must have ->s_vfs_rename_mutex
> held.
>
> Now, (7--9) are interesting; the call graph looks so:
> __d_move()
> <- d_move()
> <- __d_unalias()
> <- d_materialise_unique()
> __d_materialise_dentry()
> <- d_materialise_unique()
So at least for filesystems that don't implement
inode->i_op->rename like sysfs I don't see where you get your
rules 7-9 for d_move.
We take the approprate dentry locks in the approparite order so d_move
and the dcache should not care in the slightest about the inode
mutecies.
If we need the inode mutecies to make the dcache bits safe then
something really is insane. There may be subtle insanities in the
vfs that require the inode muticies of the parents in d_move but I am
certainly not seeing them. At least as I read it the code in __d_move
only touches and modifies dentry fields.
For any distributed filesystem and for sysfs in particular all of the
vfs inode mutex locking in the case of rename is absolutely useless as
the renames don't go through vfs, and the vfs simply does not have
enough information to use generic locks.
> sysfs_lookup() plays insane games with d_move(); I _really_
> don't like the look of that code. It tries to play catch-up after
> sysfs_rename() done its thing. AFAICS, the parent *can* be changed
> by sysfs_rename(). No lock is held on said parent here; no
> ->s_vfs_rename_mutex either. Moreover, none of the sysfs_rename() callers
> (kobject_move(), etc.) appear to give a damn about checking if it would
> create a loop.
Can sysfs_lookup cause the weird renameat problems we are seeing? No.
We don't leave any locks hanging and locks are all take in the order
the vfs expects them to be taken and in the proper nesting orders.
Your complaints about the inode muticies and of the parent directories
and s_vfs_rename_mutex seem to be completely irrelevant as those
locks mean nothing to sysfs.
All sysfs_lookup is doing is opportunistically updating the dcache
to reflect reality if we happen to come accross a sysfs_dirent that
has been renamed, and since it is just the dcache that is being
updated the locks that d_move takes appear completely sufficient
for that to be safe.
Can sysfs_rename change the sysfs_dirent parent? yes
Do we have sufficient locking in sysfs_rename? yes we
take the sysfs_mutex, that it overkill but it serializes all
sysfs_dirent changes.
Is there loop prevention for sysfs_rename? No. However there
are only 5 callers of device_move and all of them are known
not to create loops.
It is probably worth it someday to get around to adding a test in
sysfs_rename to be double check and verify that loops are not added.
For purposes of analyzing sysfs_lookup we can assume that there are no
in renames, as that would imply a bug in sysfs_rename.
The interactions between the vfs and sysfs are indeed insane. Because
of the way the vfs is designed the vfs must tell lies about deleted
files and directories when there are submounts involved.
Additionally lies also get told to the vfs about renames because the vfs
implements an on-demand consistency model with respect to remote
filesystems. With the result that frequently we don't have full
knowledge about renames when we are revalidating directories so make a
rename look like unlink+link pair instead.
So I believe you are asking is sysfs_lookup safe? yes
What syfs_lookup is doing is if it happens to have sufficient knowledge
to detect a rename happened in the past is:
- Each sysfs_dirent has at most one struct dentry per super block.
- If d_find_alias finds a dirent then I know that the dirent for my
inode is in the wrong place in the dcache.
- d_move performs all of the necessary dcache locking, to ensure moving
a dentry is safe even if the parent is renamed, at least with respect
to the dcache.
- I hold sysfs_mutex over the whole thing which guarantees that the
sysfs directory structure does not change during this time.
> I'll continue this series in a couple of hours - there's more. In the
> meanwhile, I'd like to hear from ceph, sysfs and nfsd folks - see above
> for the reasons.
I hope my explanations help.
I'd like to hear why you happen to think s_vfs_rename_mutex and the
inode muticies of the parent directories are at all interesting in the
case of d_move and in the case of sysfs.
Eric
On Thu, Jun 07, 2012 at 04:12:45PM -0700, Eric W. Biederman wrote:
> We take the approprate dentry locks in the approparite order so d_move
> and the dcache should not care in the slightest about the inode
> mutecies.
>
> If we need the inode mutecies to make the dcache bits safe then
> something really is insane. There may be subtle insanities in the
> vfs that require the inode muticies of the parents in d_move but I am
> certainly not seeing them. At least as I read it the code in __d_move
> only touches and modifies dentry fields.
Yes. Now, go take a look at e.g. the locking order on ->d_lock. No,
I'm not saying that I like it. Not at all. But we do rely on the
non-local protections for tree topology, just to make sure that the
damn thing has the locking order consistent - not changing between
the moments you take locks you've ordered, for starters.
I realize that "serialize all operations on a single per-machine mutex" is a
solution. It's just not something feasible when we are talking about all
directory tree modifications on a general-purpose filesystem. So no,
sysfs approach to that kind of problems is not usable here.
I doubt that we have something sysfs-related in the deadlocks davej is seeing,
but I seriously suspect that I can cook one based on sysfs_rename() setting
the things up for silent topology changes on ->lookup(). I would suggest
using d_materialise_unique() there - that one *does* take care to take
locks needed.
BTW, looking at the code in sysfs_lookup()... why bother with d_set_d_op()
instead of just sb->s_d_op = &sysfs_dentry_ops; once during sysfs_fill_super()?
In the worst case you need to do that after you've allocated the root
dentry, depending on whether you are willing or not to make ->d_revalidate()
return 1 whenever it's called on the root dentry...
On Thu, Jun 7, 2012 at 4:12 PM, Eric W. Biederman <[email protected]> wrote:
>
> We take the approprate dentry locks in the approparite order so d_move
> and the dcache should not care in the slightest about the inode
> mutecies.
Part of the problem is that you can't even *determine* the appropriate
order without holding the rename mutex.
Now, it may turn out to be a non-issue for sysfs just because there
are no unconstrained directory renames there, but seriously: even the
d_ancestor() check itself (which is how we determine the dentry lock
order) needs that filesystem to be quiescent wrt directory renames in
order to work.
So it may not actually depend on the inode->i_mutex, but it does need
some serialization outside the dcache subsystem.
Any per-filesystem mutex should do, so if sysfs always holds the
sysfs_mutex - and never allows user-initiated renames - it should be
safe.
Linus
On Thu, Jun 07, 2012 at 04:57:13PM -0700, Linus Torvalds wrote:
> Any per-filesystem mutex should do, so if sysfs always holds the
> sysfs_mutex - and never allows user-initiated renames - it should be
> safe.
Frankly, I would very much prefer to have the same locking rules wherever
possible. The locking system is already overcomplicated and making its
analysis fs-dependent as well... <shudder> Sure, we can do that, and that
might even work, until we find out that some piece of code that started
as a helper to some function never called on sysfs dentries had been
reused on the path that *is* reachable on sysfs. At which point we are
suddenly in trouble.
I wouldn't be bothered so much if the overall picture had been simpler;
unfortunately, it isn't.
Eric, how about this - if nothing else, that makes code in there simpler
and less dependent on details of VFS guts:
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index e6bb9b2..5579826 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -363,7 +363,7 @@ static void sysfs_dentry_iput(struct dentry *dentry, struct inode *inode)
iput(inode);
}
-static const struct dentry_operations sysfs_dentry_ops = {
+const struct dentry_operations sysfs_dentry_ops = {
.d_revalidate = sysfs_dentry_revalidate,
.d_delete = sysfs_dentry_delete,
.d_iput = sysfs_dentry_iput,
@@ -795,16 +795,8 @@ static struct dentry * sysfs_lookup(struct inode *dir, struct dentry *dentry,
}
/* instantiate and hash dentry */
- ret = d_find_alias(inode);
- if (!ret) {
- d_set_d_op(dentry, &sysfs_dentry_ops);
- dentry->d_fsdata = sysfs_get(sd);
- d_add(dentry, inode);
- } else {
- d_move(ret, dentry);
- iput(inode);
- }
-
+ dentry->d_fsdata = sysfs_get(sd);
+ ret = d_materialise_unique(dentry, inode);
out_unlock:
mutex_unlock(&sysfs_mutex);
return ret;
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 52c3bdb..c15a7a3 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -68,6 +68,7 @@ static int sysfs_fill_super(struct super_block *sb, void *data, int silent)
}
root->d_fsdata = &sysfs_root;
sb->s_root = root;
+ sb->s_d_op = &sysfs_dentry_ops;
return 0;
}
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 661a963..d73c093 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -157,6 +157,7 @@ extern struct kmem_cache *sysfs_dir_cachep;
*/
extern struct mutex sysfs_mutex;
extern spinlock_t sysfs_assoc_lock;
+extern const struct dentry_operations sysfs_dentry_ops;
extern const struct file_operations sysfs_dir_operations;
extern const struct inode_operations sysfs_dir_inode_operations;
On Thu, Jun 7, 2012 at 5:36 PM, Al Viro <[email protected]> wrote:
>
> Frankly, I would very much prefer to have the same locking rules wherever
> possible. ?The locking system is already overcomplicated and making its
> analysis fs-dependent as well... <shudder>
I do agree that it would be better if we avoid it. I was just trying
to explain that the dentry locking is *not* enough, for the simple
reason that it relies on upper-level non-dentry locking just to work.
Your patch looks good, but whether it works I have no idea ;)
Linus
On Fri, Jun 08, 2012 at 01:36:04AM +0100, Al Viro wrote:
> Eric, how about this - if nothing else, that makes code in there simpler
> and less dependent on details of VFS guts:
Argh. No, it's not enough. Why are you using ->d_iput()? You are not
doing anything unusual with inode; the natural place for that is in
->d_release() and then it will get simpler rules wrt setting ->d_fsdata.
As it is, you need to do that exactly after the point where you know
that it dentry won't be dropped without going through d_add().
OK, I've split that in two commits and put into vfs.git#sysfs; take a look
and comment, please. Should get to git.kernel.in a few...
Al Viro <[email protected]> writes:
> On Thu, Jun 07, 2012 at 04:57:13PM -0700, Linus Torvalds wrote:
>
>> Any per-filesystem mutex should do, so if sysfs always holds the
>> sysfs_mutex - and never allows user-initiated renames - it should be
>> safe.
>
> Frankly, I would very much prefer to have the same locking rules wherever
> possible. The locking system is already overcomplicated and making its
> analysis fs-dependent as well... <shudder> Sure, we can do that, and that
> might even work, until we find out that some piece of code that started
> as a helper to some function never called on sysfs dentries had been
> reused on the path that *is* reachable on sysfs. At which point we are
> suddenly in trouble.
Staring at it I see what I was missing. The practical issue is
lock_rename(), and any parts of the vfs that depend on lock_rename().
d_move and the dcache are made safe just by rename_lock. However other
parts of the vfs that care about using d_ancestor are not. I can't
immediately see a case that really cares but I can't rule such a case
out easily either.
> I wouldn't be bothered so much if the overall picture had been simpler;
> unfortunately, it isn't.
>
> Eric, how about this - if nothing else, that makes code in there simpler
> and less dependent on details of VFS guts:
>
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index e6bb9b2..5579826 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -363,7 +363,7 @@ static void sysfs_dentry_iput(struct dentry *dentry, struct inode *inode)
> iput(inode);
> }
>
> -static const struct dentry_operations sysfs_dentry_ops = {
> +const struct dentry_operations sysfs_dentry_ops = {
> .d_revalidate = sysfs_dentry_revalidate,
> .d_delete = sysfs_dentry_delete,
> .d_iput = sysfs_dentry_iput,
> @@ -795,16 +795,8 @@ static struct dentry * sysfs_lookup(struct inode *dir, struct dentry *dentry,
> }
>
> /* instantiate and hash dentry */
> - ret = d_find_alias(inode);
> - if (!ret) {
> - d_set_d_op(dentry, &sysfs_dentry_ops);
> - dentry->d_fsdata = sysfs_get(sd);
> - d_add(dentry, inode);
> - } else {
> - d_move(ret, dentry);
> - iput(inode);
> - }
> -
> + dentry->d_fsdata = sysfs_get(sd);
> + ret = d_materialise_unique(dentry, inode);
I have a small problem with d_materialise_unique. For renames of files
d_materialise_unique calls __d_instantiate_unique. __d_instantiate_unique
does not detect renames of files. Which at least misses the rename
of sysfs symlinks.
Could we put together a d_materialise_unalias for inodes that we know
they always only have one dentry? That I would be happy to use.
I think the reason I would up with my own version was that the dcache
did no provide what I needed and it was just a few lines to code my own.
> diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
> index 52c3bdb..c15a7a3 100644
> --- a/fs/sysfs/mount.c
> +++ b/fs/sysfs/mount.c
> @@ -68,6 +68,7 @@ static int sysfs_fill_super(struct super_block *sb, void *data, int silent)
> }
> root->d_fsdata = &sysfs_root;
> sb->s_root = root;
> + sb->s_d_op = &sysfs_dentry_ops;
I have no problem with this bit. To answer your earlier question s_d_op
predates this code which is why sysfs was not using it.
> return 0;
> }
>
> diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
> index 661a963..d73c093 100644
> --- a/fs/sysfs/sysfs.h
> +++ b/fs/sysfs/sysfs.h
> @@ -157,6 +157,7 @@ extern struct kmem_cache *sysfs_dir_cachep;
> */
> extern struct mutex sysfs_mutex;
> extern spinlock_t sysfs_assoc_lock;
> +extern const struct dentry_operations sysfs_dentry_ops;
>
> extern const struct file_operations sysfs_dir_operations;
> extern const struct inode_operations sysfs_dir_inode_operations;
On Thu, Jun 07, 2012 at 08:36:07PM +0100, Al Viro wrote:
> Other callers of d_move():
> * debugfs_rename() - imitates what vfs_rename() is doing. Same
> locking environment. BTW,
> trap = lock_rename(new_dir, old_dir);
> /* Source or destination directories don't exist? */
> if (!old_dir->d_inode || !new_dir->d_inode)
> goto exit;
> is bogus - lock_rename() is taking ->i_mutex on these inodes, for fsck sake!
> If this can be called with old_dir or new_dir negative, it's buggered.
It's worse, actually. If we _ever_ do cross-directory debugfs_rename()
without external serialization, we are in trouble. It does imitate
vfs_rename() (actually - its callers), but there's an unpleasant difference:
instead of "lock parents with lock_rename(), then do lookups and we are
guaranteed nobody will change ->d_parent of children we are working with"
it's "lock the new parent and whatever happens to be the current parent
of the object given to us; do lookup for target, pray that the old parent
still was the parent of our object by the time we got the locks".
AFAICS, there's only one caller doing cross-directory moves (__clk_reparent())
and currently all callers are serialized by a mutex in there, but that's
not documented anywhere - not for __clk_reparent(), not for debugfs_rename().
On Thu, Jun 07, 2012 at 07:08:04PM -0700, Eric W. Biederman wrote:
> > + dentry->d_fsdata = sysfs_get(sd);
> > + ret = d_materialise_unique(dentry, inode);
>
> I have a small problem with d_materialise_unique. For renames of files
> d_materialise_unique calls __d_instantiate_unique. __d_instantiate_unique
> does not detect renames of files. Which at least misses the rename
> of sysfs symlinks.
Er... yes, but why do we care? It's not as if you had a hardwired
reference to dentry from your objects, after all (can't, with multiple
superblocks). So you get old stale dentry at the old location and
a new one where we'd moved that sucker. They have the same inode
and each holds a reference to the same sd; ->d_revalidate() at the
old location must invalidate the old instance anyway, since you are
not guaranteed that lookup at the new one will happen before repeated
lookup at the old one.
Directories *are* special in that respect, but symlinks are trivial...
VFS doesn't care if you have extra dentries for those and neither does
sysfs, AFAICS.
It's not that we couldn't teach d_materialise_unique() about those (e.g.
introduce a new dentry flag and treat dentries with it as directories
for d_materialise_unique() purposes); I would like to understand the
reasons for doing that, though.
Al Viro <[email protected]> writes:
> On Fri, Jun 08, 2012 at 01:36:04AM +0100, Al Viro wrote:
>> Eric, how about this - if nothing else, that makes code in there simpler
>> and less dependent on details of VFS guts:
>
> Argh. No, it's not enough. Why are you using ->d_iput()? You are not
> doing anything unusual with inode; the natural place for that is in
> ->d_release() and then it will get simpler rules wrt setting ->d_fsdata.
No good reason. We do tie inode numbers to the syfs_dirent but the
inode was changed quite a while ago to hold it's own reference
sysfs_dirent. So using d_iput looks like sysfs historical baggage.
> As it is, you need to do that exactly after the point where you know
> that it dentry won't be dropped without going through d_add().
>
> OK, I've split that in two commits and put into vfs.git#sysfs; take a look
> and comment, please. Should get to git.kernel.in a few...
The patches on your sysfs branch look reasonable.
I am still learly of d_materialise_unique as it allows to create alias's
on non-directories. It isn't a functional problem as d_revalidate will
catch the issue and make it look like we have a unlink/link pair instead
of a proper rename. However since it is possible I would like to aim
for the higher quality of implemntation and use show renames as renames.
What would be ideal for sysfs is the d_add_singleton function below. It
does what is needed without the weird d_materialise strangeness that is
in d_materialise_unique. But if a all singing all dancing all
confusing function is preferable I would not mind.
What I would really like is an interface so that a distrubuted/remote
filesystem can provide an inotify like stream of and we can really
implement inotify in a distributed filesystem. But since I am too lazy
to do that I am reluctant to give up what progress I have actually made
in that direction.
Eric
diff --git a/fs/dcache.c b/fs/dcache.c
index 85c9e2b..2aab524 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2537,6 +2537,74 @@ out_nolock:
}
EXPORT_SYMBOL_GPL(d_materialise_unique);
+/**
+ * d_add_singleton - add an inode with only a single dentry
+ * @entry: dentry to instantiate
+ * @inode: inode to attach to this dentry
+ *
+ * Fill in inode information in the entry. On success, it returns NULL.
+ * If an alias of "entry" already exists, then we assume that a rename
+ * has occurred and not been reported so the alias is renamed and we
+ * return the aliased dentry and drop one reference to the inode.
+ *
+ * Note that in order to avoid conflicts with rename() etc, the caller
+ * had better be holding the parent directory semaphore.
+ *
+ * This also assumes that the inode count has been incremented
+ * (or otherwise set) by the caller to indicate that it is now
+ * in use by the dcache.
+ */
+struct dentry *d_add_singleton(struct dentry *entry, struct inode *inode)
+{
+ struct dentry *alias, *actual = entry;
+
+ if (!inode) {
+ __d_instantiate(entry, NULL);
+ d_rehash(entry);
+ goto out_nolock;
+ }
+
+ spin_lock(&inode->i_lock);
+
+ /* Does an aliased dentry already exist? */
+ alias = __d_find_alias(inode);
+ if (alias) {
+ write_seqlock(&rename_lock);
+
+ if (d_ancestor(alias, entry)) {
+ /* Check for loops */
+ actual = ERR_PTR(-ELOOP);
+ spin_unlock(&inode->i_lock);
+ } else {
+ /* Avoid aliases. This drops inode->i_lock */
+ actual = __d_unalias(inode, entry, alias);
+ }
+ write_sequnlock(&rename_lock);
+ if (IS_ERR(actual)) {
+ if (PTR_ERR(actual) == -ELOOP)
+ pr_warn_ratelimited(
+ "VFS: Lookup of '%s' in %s %s"
+ " would have caused loop\n",
+ entry->d_name.name,
+ inode->i_sb->s_type->name,
+ inode->i_sb->s_id);
+ dput(alias);
+ }
+ goto out_nolock;
+ }
+ __d_instantiate(entry, inode);
+ spin_unlock(&inode->i_lock);
+ d_rehash(entry);
+out_nolock:
+ if (actual == entry ) {
+ security_d_instantiate(entry, inode);
+ return NULL;
+ }
+ iput(inode);
+ return actual;
+}
+
+
static int prepend(char **buffer, int *buflen, const char *str, int namelen)
{
*buflen -= namelen;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 094789f..9613d4c 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -219,6 +219,8 @@ static inline int dname_external(struct dentry *dentry)
extern void d_instantiate(struct dentry *, struct inode *);
extern struct dentry * d_instantiate_unique(struct dentry *, struct inode *);
extern struct dentry * d_materialise_unique(struct dentry *, struct inode *);
+extern struct dentry * d_materialise_unalias(struct dentry *, struct inode *);
+extern struct dentry *d_add_singleton(struct dentry *, struct inode *);
extern void __d_drop(struct dentry *dentry);
extern void d_drop(struct dentry *dentry);
extern void d_delete(struct dentry *);
On Thu, Jun 07, 2012 at 10:25:46PM -0700, Eric W. Biederman wrote:
> I am still learly of d_materialise_unique as it allows to create alias's
> on non-directories. It isn't a functional problem as d_revalidate will
> catch the issue and make it look like we have a unlink/link pair instead
> of a proper rename. However since it is possible I would like to aim
> for the higher quality of implemntation and use show renames as renames.
???
Please, explain. link/unlink pair in which sense? If you are thinking
of inotify and its ilk and tie some of that to dentry eviction, you
have far worse problem. What stream of events do you expect if you
do
some action that triggers sysfs_rename()
stat(2) on old pathname
stat(2) on new pathname
No matter what we do in lookup at the new pathname, the second step will
be "failed revalidate". If you tie something (presumably unlink side of
your link/unlink pair) to that event, you'll get a very lousy series of
events, no matter what you do on the third step.
I don't see what kind of *notify hookup do you have in mind. Anything that
treats "dentry failed revalidation or got evicted by memory pressure" as
"unlink" is completely nuts, IMO.
On Thu, 2012-06-07 at 08:30 -0700, Linus Torvalds wrote:
> On Thu, Jun 7, 2012 at 3:26 AM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, 2012-06-06 at 16:31 -0700, Linus Torvalds wrote:
> >> PeterZ - the fact that we use mutex_lock_nested with
> >> I_MUTEX_PARENT/I_MUTEX_CHILD/nothing hopefully doesn't screw us up in
> >> case we get it wrong, right?
> >
> > Sadly, if you get that annotation wrong you can annotate an actual
> > deadlock away. This the reason you have to be very careful when
> > annotating stuff.
>
> Is that true even if you can see an actual deadlock on the lock *instances*?
Yep.. :/
> The point of the locking classes (to me) was always that you could see
> the deadlock even if it didn't *actually* happen, just *potentially*
> happened. If the lock classes then mean that that hides an actual
> deadlock information, that's a big downside.
Right, so in order to report a deadlock on instances we'd have to do a
full lock graph walk on every contended lock acquisition. This is quite
expensive.
Although I can look into doing that from say the NMI watchdog and/or
some sysrq key.
> And yeah, that really means that we absolutely *have* to get the
> instance data in the lock debug printout, since we can't see deadlocks
> any other way.
Well, typically deadlocks aren't that hard to find once you trigger
them. The current case of course being the exception to that rule..
> Can you check the compile error that Dave reported for
> your patch, because right now Dave is running without it, afaik, due
> to your lock-instance patch not really working in practice.
Oh, missed that in the thread.. yeah here goes.
Now also compile tested with CONFIG_LOCK_STAT...
---
include/linux/lockdep.h | 44 ++++++++-------
include/trace/events/lock.h | 3 -
kernel/lockdep.c | 128 ++++++++++++++++++++++++++++----------------
3 files changed, 107 insertions(+), 68 deletions(-)
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -211,6 +211,13 @@ struct lock_chain {
*/
#define MAX_LOCKDEP_KEYS ((1UL << MAX_LOCKDEP_KEYS_BITS) - 1)
+enum held_lock_state {
+ hls_unheld = 0,
+ hls_acquiring,
+ hls_contended,
+ hls_acquired,
+};
+
struct held_lock {
/*
* One-way hash of the dependency chain up to this point. We
@@ -254,7 +261,10 @@ struct held_lock {
unsigned int read:2; /* see lock_acquire() comment */
unsigned int check:2; /* see lock_acquire() comment */
unsigned int hardirqs_off:1;
- unsigned int references:11; /* 32 bits */
+ unsigned int state:2; /* see held_lock_state */
+ /* 9 bit hole */
+ unsigned int references:16;
+ /* 16 bit hole */
};
/*
@@ -363,6 +373,18 @@ extern void lockdep_trace_alloc(gfp_t ma
#define lockdep_recursing(tsk) ((tsk)->lockdep_recursion)
+extern void lock_contended(struct lockdep_map *lock, unsigned long ip);
+extern void lock_acquired(struct lockdep_map *lock, unsigned long ip);
+
+#define LOCK_CONTENDED(_lock, try, lock) \
+do { \
+ if (!try(_lock)) { \
+ lock_contended(&(_lock)->dep_map, _RET_IP_); \
+ lock(_lock); \
+ } \
+ lock_acquired(&(_lock)->dep_map, _RET_IP_); \
+} while (0)
+
#else /* !LOCKDEP */
static inline void lockdep_off(void)
@@ -414,31 +436,13 @@ struct lock_class_key { };
#define lockdep_recursing(tsk) (0)
-#endif /* !LOCKDEP */
-
-#ifdef CONFIG_LOCK_STAT
-
-extern void lock_contended(struct lockdep_map *lock, unsigned long ip);
-extern void lock_acquired(struct lockdep_map *lock, unsigned long ip);
-
-#define LOCK_CONTENDED(_lock, try, lock) \
-do { \
- if (!try(_lock)) { \
- lock_contended(&(_lock)->dep_map, _RET_IP_); \
- lock(_lock); \
- } \
- lock_acquired(&(_lock)->dep_map, _RET_IP_); \
-} while (0)
-
-#else /* CONFIG_LOCK_STAT */
-
#define lock_contended(lockdep_map, ip) do {} while (0)
#define lock_acquired(lockdep_map, ip) do {} while (0)
#define LOCK_CONTENDED(_lock, try, lock) \
lock(_lock)
-#endif /* CONFIG_LOCK_STAT */
+#endif /* !LOCKDEP */
#ifdef CONFIG_LOCKDEP
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -61,8 +61,6 @@ DEFINE_EVENT(lock, lock_release,
TP_ARGS(lock, ip)
);
-#ifdef CONFIG_LOCK_STAT
-
DEFINE_EVENT(lock, lock_contended,
TP_PROTO(struct lockdep_map *lock, unsigned long ip),
@@ -78,7 +76,6 @@ DEFINE_EVENT(lock, lock_acquired,
);
#endif
-#endif
#endif /* _TRACE_LOCK_H */
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -541,10 +541,23 @@ static void print_lockdep_cache(struct l
printk("%s", name);
}
+static void print_lock_state(struct held_lock *hlock)
+{
+ switch (hlock->state) {
+ /* holding an unheld lock is fail! */
+ case hls_unheld: printk("FAIL: "); break;
+
+ case hls_acquiring: /* fall-through */
+ case hls_contended: printk("blocked: "); break;
+ case hls_acquired: printk("held: "); break;
+ };
+}
+
static void print_lock(struct held_lock *hlock)
{
+ print_lock_state(hlock);
print_lock_name(hlock_class(hlock));
- printk(", at: ");
+ printk(", instance: %p, at: ", hlock->instance);
print_ip_sym(hlock->acquire_ip);
}
@@ -556,7 +569,7 @@ static void lockdep_print_held_locks(str
printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr));
return;
}
- printk("%d lock%s held by %s/%d:\n",
+ printk("%d lock%s on stack by %s/%d:\n",
depth, depth > 1 ? "s" : "", curr->comm, task_pid_nr(curr));
for (i = 0; i < depth; i++) {
@@ -3093,6 +3106,7 @@ static int __lock_acquire(struct lockdep
hlock->check = check;
hlock->hardirqs_off = !!hardirqs_off;
hlock->references = references;
+ hlock->state = hls_acquiring;
#ifdef CONFIG_LOCK_STAT
hlock->waittime_stamp = 0;
hlock->holdtime_stamp = lockstat_clock();
@@ -3607,7 +3621,6 @@ void lockdep_clear_current_reclaim_state
current->lockdep_reclaim_gfp = 0;
}
-#ifdef CONFIG_LOCK_STAT
static int
print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
unsigned long ip)
@@ -3637,14 +3650,73 @@ print_lock_contention_bug(struct task_st
return 0;
}
+#ifdef CONFIG_LOCK_STAT
+
+static void lockstat_contended(struct held_lock *hlock, unsigned long ip)
+{
+ struct lockdep_map *lock = hlock->instance;
+ int contention_point, contending_point;
+ struct lock_class_stats *stats;
+
+ hlock->waittime_stamp = lockstat_clock();
+
+ contention_point = lock_point(hlock_class(hlock)->contention_point, ip);
+ contending_point = lock_point(hlock_class(hlock)->contending_point,
+ lock->ip);
+
+ stats = get_lock_stats(hlock_class(hlock));
+ if (contention_point < LOCKSTAT_POINTS)
+ stats->contention_point[contention_point]++;
+ if (contending_point < LOCKSTAT_POINTS)
+ stats->contending_point[contending_point]++;
+ if (lock->cpu != smp_processor_id())
+ stats->bounces[bounce_contended + !!hlock->read]++;
+ put_lock_stats(stats);
+}
+
+static void lockstat_acquired(struct held_lock *hlock, unsigned long ip)
+{
+ struct lockdep_map *lock = hlock->instance;
+ struct lock_class_stats *stats;
+ u64 now, waittime = 0;
+ int cpu;
+
+ cpu = smp_processor_id();
+ if (hlock->waittime_stamp) {
+ now = lockstat_clock();
+ waittime = now - hlock->waittime_stamp;
+ hlock->holdtime_stamp = now;
+ }
+
+ stats = get_lock_stats(hlock_class(hlock));
+ if (waittime) {
+ if (hlock->read)
+ lock_time_inc(&stats->read_waittime, waittime);
+ else
+ lock_time_inc(&stats->write_waittime, waittime);
+ }
+ if (lock->cpu != cpu)
+ stats->bounces[bounce_acquired + !!hlock->read]++;
+ put_lock_stats(stats);
+
+ lock->cpu = cpu;
+ lock->ip = ip;
+}
+
+#else /* CONFIG_LOCK_STAT */
+
+static inline void lockstat_contended(struct held_lock *hlock, unsigned long ip) { }
+static inline void lockstat_acquired(struct held_lock *hlock, unsigned long ip) { }
+
+#endif /* CONFIG_LOCK_STAT */
+
static void
__lock_contended(struct lockdep_map *lock, unsigned long ip)
{
struct task_struct *curr = current;
struct held_lock *hlock, *prev_hlock;
- struct lock_class_stats *stats;
unsigned int depth;
- int i, contention_point, contending_point;
+ int i;
depth = curr->lockdep_depth;
/*
@@ -3673,20 +3745,8 @@ __lock_contended(struct lockdep_map *loc
if (hlock->instance != lock)
return;
- hlock->waittime_stamp = lockstat_clock();
-
- contention_point = lock_point(hlock_class(hlock)->contention_point, ip);
- contending_point = lock_point(hlock_class(hlock)->contending_point,
- lock->ip);
-
- stats = get_lock_stats(hlock_class(hlock));
- if (contention_point < LOCKSTAT_POINTS)
- stats->contention_point[contention_point]++;
- if (contending_point < LOCKSTAT_POINTS)
- stats->contending_point[contending_point]++;
- if (lock->cpu != smp_processor_id())
- stats->bounces[bounce_contended + !!hlock->read]++;
- put_lock_stats(stats);
+ hlock->state = hls_contended;
+ lockstat_contended(hlock, ip);
}
static void
@@ -3694,10 +3754,8 @@ __lock_acquired(struct lockdep_map *lock
{
struct task_struct *curr = current;
struct held_lock *hlock, *prev_hlock;
- struct lock_class_stats *stats;
unsigned int depth;
- u64 now, waittime = 0;
- int i, cpu;
+ int i;
depth = curr->lockdep_depth;
/*
@@ -3726,28 +3784,8 @@ __lock_acquired(struct lockdep_map *lock
if (hlock->instance != lock)
return;
- cpu = smp_processor_id();
- if (hlock->waittime_stamp) {
- now = lockstat_clock();
- waittime = now - hlock->waittime_stamp;
- hlock->holdtime_stamp = now;
- }
-
- trace_lock_acquired(lock, ip);
-
- stats = get_lock_stats(hlock_class(hlock));
- if (waittime) {
- if (hlock->read)
- lock_time_inc(&stats->read_waittime, waittime);
- else
- lock_time_inc(&stats->write_waittime, waittime);
- }
- if (lock->cpu != cpu)
- stats->bounces[bounce_acquired + !!hlock->read]++;
- put_lock_stats(stats);
-
- lock->cpu = cpu;
- lock->ip = ip;
+ hlock->state = hls_acquired;
+ lockstat_acquired(hlock, ip);
}
void lock_contended(struct lockdep_map *lock, unsigned long ip)
@@ -3783,12 +3821,12 @@ void lock_acquired(struct lockdep_map *l
raw_local_irq_save(flags);
check_flags(flags);
current->lockdep_recursion = 1;
+ trace_lock_acquired(lock, ip);
__lock_acquired(lock, ip);
current->lockdep_recursion = 0;
raw_local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(lock_acquired);
-#endif
/*
* Used by the testsuite, sanitize the validator state
Al Viro <[email protected]> writes:
> On Thu, Jun 07, 2012 at 10:25:46PM -0700, Eric W. Biederman wrote:
>
>> I am still learly of d_materialise_unique as it allows to create alias's
>> on non-directories. It isn't a functional problem as d_revalidate will
>> catch the issue and make it look like we have a unlink/link pair instead
>> of a proper rename. However since it is possible I would like to aim
>> for the higher quality of implemntation and use show renames as renames.
>
> ???
>
> Please, explain. link/unlink pair in which sense?
In the sense that if we don't use d_move. A rename will look to
userspace like a pair of sys_link and sys_unlink operations.
If I happen to have a file open with the old name and the dentry passes
through d_drop. The /proc/self/fd/N will show the filename as
"...(deleted)".
And in every other way I can think of that is userspace visible this
will look like a pair of link and unlink operations.
> I don't see what kind of *notify hookup do you have in mind. Anything that
> treats "dentry failed revalidation or got evicted by memory pressure" as
> "unlink" is completely nuts, IMO.
In this case much as it might be convinient to have a *notify report,
what I was thinking of were the much simpler userspace visible aspects,
like what /proc/self/fd/N symlinks report.
In the little corner case user visible details the current state of vfs
support for distributed filesystems looks nuts to me, especially where
we can't apply an appropriate d_move onto a renamed dentry.
The fact that open files, open directories and mount points pin dentries
in memory cause interesting challenges for keeping the local vfs state
in sync with the state of a remote filesystem.
What I would love to be able to do is to replay some kind of journal
that reports what happened to the filesystem outside of the linux vfs
onto the linux vfs so that we can get a more accurate picture of what
really happened to the filesystem. Which should allow *notify and the
like to actually work. Would make the /proc/self/fd/* symlinks more
useful, and make allow files that are mount points to be renamed.
But ultimately the change in semantics bugs me. Using d_move less
often feels user visible and because d_materialise_unique because it
does not handle renames of files feels like a lurking maintenance bomb
for sysfs.
Especially since renames on files with mount points on them should be
treated differently from normal files.
Speaking of I just found a small unhandled case in __d_unalias. We need
to deny renaming of mount points.
Eric
From: "Eric W. Biederman" <[email protected]>
Subject: dcache: Deny renaming via __d_unalias dentries of mountpoints
Make __d_unalias match vfs_rename_dir and vfs_rename_other and don't
allow renaming mount points.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
diff --git a/fs/dcache.c b/fs/dcache.c
index 85c9e2b..d236722 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2380,14 +2380,17 @@ static struct dentry *__d_unalias(struct inode *inode,
struct dentry *dentry, struct dentry *alias)
{
struct mutex *m1 = NULL, *m2 = NULL;
- struct dentry *ret;
+ struct dentry *ret = ERR_PTR(-EBUSY);
+
+ /* Linux does not rename mount points */
+ if (d_mountpoint(alias))
+ goto out_err;
/* If alias and dentry share a parent, then no extra locks required */
if (alias->d_parent == dentry->d_parent)
goto out_unalias;
/* See lock_rename() */
- ret = ERR_PTR(-EBUSY);
if (!mutex_trylock(&dentry->d_sb->s_vfs_rename_mutex))
goto out_err;
m1 = &dentry->d_sb->s_vfs_rename_mutex;
On Fri, Jun 08, 2012 at 09:31:38AM +0200, Peter Zijlstra wrote:
> > Can you check the compile error that Dave reported for
> > your patch, because right now Dave is running without it, afaik, due
> > to your lock-instance patch not really working in practice.
>
> Oh, missed that in the thread.. yeah here goes.
>
> Now also compile tested with CONFIG_LOCK_STAT...
Doesn't apply against Linus' current tree.
$ cat ../lock| patch -p1
patching file include/linux/lockdep.h
Hunk #1 FAILED at 211.
Hunk #2 FAILED at 254.
Hunk #3 FAILED at 363.
Hunk #4 FAILED at 414.
4 out of 4 hunks FAILED -- saving rejects to file include/linux/lockdep.h.rej
patching file include/trace/events/lock.h
Hunk #1 FAILED at 61.
Hunk #2 FAILED at 78.
2 out of 2 hunks FAILED -- saving rejects to file include/trace/events/lock.h.rej
patching file kernel/lockdep.c
Hunk #1 FAILED at 541.
Hunk #2 succeeded at 556 with fuzz 2.
Hunk #3 FAILED at 3093.
patch: **** malformed patch at line 268: ck,
Dave
On Fri, Jun 08, 2012 at 09:31:38AM +0200, Peter Zijlstra wrote:
> On Thu, 2012-06-07 at 08:30 -0700, Linus Torvalds wrote:
> > On Thu, Jun 7, 2012 at 3:26 AM, Peter Zijlstra <[email protected]> wrote:
> > > On Wed, 2012-06-06 at 16:31 -0700, Linus Torvalds wrote:
> > >> PeterZ - the fact that we use mutex_lock_nested with
> > >> I_MUTEX_PARENT/I_MUTEX_CHILD/nothing hopefully doesn't screw us up in
> > >> case we get it wrong, right?
> > >
> > > Sadly, if you get that annotation wrong you can annotate an actual
> > > deadlock away.
What's a (contrived as you want) example where that happens?
> > > This the reason you have to be very careful when
> > > annotating stuff.
Or alternatively--what do I need to check before I call
mutex_lock_nested?
--b.
> >
> > Is that true even if you can see an actual deadlock on the lock *instances*?
>
> Yep.. :/
>
> > The point of the locking classes (to me) was always that you could see
> > the deadlock even if it didn't *actually* happen, just *potentially*
> > happened. If the lock classes then mean that that hides an actual
> > deadlock information, that's a big downside.
>
> Right, so in order to report a deadlock on instances we'd have to do a
> full lock graph walk on every contended lock acquisition. This is quite
> expensive.
>
> Although I can look into doing that from say the NMI watchdog and/or
> some sysrq key.
>
> > And yeah, that really means that we absolutely *have* to get the
> > instance data in the lock debug printout, since we can't see deadlocks
> > any other way.
>
> Well, typically deadlocks aren't that hard to find once you trigger
> them. The current case of course being the exception to that rule..
>
> > Can you check the compile error that Dave reported for
> > your patch, because right now Dave is running without it, afaik, due
> > to your lock-instance patch not really working in practice.
>
> Oh, missed that in the thread.. yeah here goes.
>
>
> Now also compile tested with CONFIG_LOCK_STAT...
>
> ---
> include/linux/lockdep.h | 44 ++++++++-------
> include/trace/events/lock.h | 3 -
> kernel/lockdep.c | 128 ++++++++++++++++++++++++++++----------------
> 3 files changed, 107 insertions(+), 68 deletions(-)
>
> --- a/include/linux/lockdep.h
> +++ b/include/linux/lockdep.h
> @@ -211,6 +211,13 @@ struct lock_chain {
> */
> #define MAX_LOCKDEP_KEYS ((1UL << MAX_LOCKDEP_KEYS_BITS) - 1)
>
> +enum held_lock_state {
> + hls_unheld = 0,
> + hls_acquiring,
> + hls_contended,
> + hls_acquired,
> +};
> +
> struct held_lock {
> /*
> * One-way hash of the dependency chain up to this point. We
> @@ -254,7 +261,10 @@ struct held_lock {
> unsigned int read:2; /* see lock_acquire() comment */
> unsigned int check:2; /* see lock_acquire() comment */
> unsigned int hardirqs_off:1;
> - unsigned int references:11; /* 32 bits */
> + unsigned int state:2; /* see held_lock_state */
> + /* 9 bit hole */
> + unsigned int references:16;
> + /* 16 bit hole */
> };
>
> /*
> @@ -363,6 +373,18 @@ extern void lockdep_trace_alloc(gfp_t ma
>
> #define lockdep_recursing(tsk) ((tsk)->lockdep_recursion)
>
> +extern void lock_contended(struct lockdep_map *lock, unsigned long ip);
> +extern void lock_acquired(struct lockdep_map *lock, unsigned long ip);
> +
> +#define LOCK_CONTENDED(_lock, try, lock) \
> +do { \
> + if (!try(_lock)) { \
> + lock_contended(&(_lock)->dep_map, _RET_IP_); \
> + lock(_lock); \
> + } \
> + lock_acquired(&(_lock)->dep_map, _RET_IP_); \
> +} while (0)
> +
> #else /* !LOCKDEP */
>
> static inline void lockdep_off(void)
> @@ -414,31 +436,13 @@ struct lock_class_key { };
>
> #define lockdep_recursing(tsk) (0)
>
> -#endif /* !LOCKDEP */
> -
> -#ifdef CONFIG_LOCK_STAT
> -
> -extern void lock_contended(struct lockdep_map *lock, unsigned long ip);
> -extern void lock_acquired(struct lockdep_map *lock, unsigned long ip);
> -
> -#define LOCK_CONTENDED(_lock, try, lock) \
> -do { \
> - if (!try(_lock)) { \
> - lock_contended(&(_lock)->dep_map, _RET_IP_); \
> - lock(_lock); \
> - } \
> - lock_acquired(&(_lock)->dep_map, _RET_IP_); \
> -} while (0)
> -
> -#else /* CONFIG_LOCK_STAT */
> -
> #define lock_contended(lockdep_map, ip) do {} while (0)
> #define lock_acquired(lockdep_map, ip) do {} while (0)
>
> #define LOCK_CONTENDED(_lock, try, lock) \
> lock(_lock)
>
> -#endif /* CONFIG_LOCK_STAT */
> +#endif /* !LOCKDEP */
>
> #ifdef CONFIG_LOCKDEP
>
> --- a/include/trace/events/lock.h
> +++ b/include/trace/events/lock.h
> @@ -61,8 +61,6 @@ DEFINE_EVENT(lock, lock_release,
> TP_ARGS(lock, ip)
> );
>
> -#ifdef CONFIG_LOCK_STAT
> -
> DEFINE_EVENT(lock, lock_contended,
>
> TP_PROTO(struct lockdep_map *lock, unsigned long ip),
> @@ -78,7 +76,6 @@ DEFINE_EVENT(lock, lock_acquired,
> );
>
> #endif
> -#endif
>
> #endif /* _TRACE_LOCK_H */
>
> --- a/kernel/lockdep.c
> +++ b/kernel/lockdep.c
> @@ -541,10 +541,23 @@ static void print_lockdep_cache(struct l
> printk("%s", name);
> }
>
> +static void print_lock_state(struct held_lock *hlock)
> +{
> + switch (hlock->state) {
> + /* holding an unheld lock is fail! */
> + case hls_unheld: printk("FAIL: "); break;
> +
> + case hls_acquiring: /* fall-through */
> + case hls_contended: printk("blocked: "); break;
> + case hls_acquired: printk("held: "); break;
> + };
> +}
> +
> static void print_lock(struct held_lock *hlock)
> {
> + print_lock_state(hlock);
> print_lock_name(hlock_class(hlock));
> - printk(", at: ");
> + printk(", instance: %p, at: ", hlock->instance);
> print_ip_sym(hlock->acquire_ip);
> }
>
> @@ -556,7 +569,7 @@ static void lockdep_print_held_locks(str
> printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr));
> return;
> }
> - printk("%d lock%s held by %s/%d:\n",
> + printk("%d lock%s on stack by %s/%d:\n",
> depth, depth > 1 ? "s" : "", curr->comm, task_pid_nr(curr));
>
> for (i = 0; i < depth; i++) {
> @@ -3093,6 +3106,7 @@ static int __lock_acquire(struct lockdep
> hlock->check = check;
> hlock->hardirqs_off = !!hardirqs_off;
> hlock->references = references;
> + hlock->state = hls_acquiring;
> #ifdef CONFIG_LOCK_STAT
> hlock->waittime_stamp = 0;
> hlock->holdtime_stamp = lockstat_clock();
> @@ -3607,7 +3621,6 @@ void lockdep_clear_current_reclaim_state
> current->lockdep_reclaim_gfp = 0;
> }
>
> -#ifdef CONFIG_LOCK_STAT
> static int
> print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
> unsigned long ip)
> @@ -3637,14 +3650,73 @@ print_lock_contention_bug(struct task_st
> return 0;
> }
>
> +#ifdef CONFIG_LOCK_STAT
> +
> +static void lockstat_contended(struct held_lock *hlock, unsigned long ip)
> +{
> + struct lockdep_map *lock = hlock->instance;
> + int contention_point, contending_point;
> + struct lock_class_stats *stats;
> +
> + hlock->waittime_stamp = lockstat_clock();
> +
> + contention_point = lock_point(hlock_class(hlock)->contention_point, ip);
> + contending_point = lock_point(hlock_class(hlock)->contending_point,
> + lock->ip);
> +
> + stats = get_lock_stats(hlock_class(hlock));
> + if (contention_point < LOCKSTAT_POINTS)
> + stats->contention_point[contention_point]++;
> + if (contending_point < LOCKSTAT_POINTS)
> + stats->contending_point[contending_point]++;
> + if (lock->cpu != smp_processor_id())
> + stats->bounces[bounce_contended + !!hlock->read]++;
> + put_lock_stats(stats);
> +}
> +
> +static void lockstat_acquired(struct held_lock *hlock, unsigned long ip)
> +{
> + struct lockdep_map *lock = hlock->instance;
> + struct lock_class_stats *stats;
> + u64 now, waittime = 0;
> + int cpu;
> +
> + cpu = smp_processor_id();
> + if (hlock->waittime_stamp) {
> + now = lockstat_clock();
> + waittime = now - hlock->waittime_stamp;
> + hlock->holdtime_stamp = now;
> + }
> +
> + stats = get_lock_stats(hlock_class(hlock));
> + if (waittime) {
> + if (hlock->read)
> + lock_time_inc(&stats->read_waittime, waittime);
> + else
> + lock_time_inc(&stats->write_waittime, waittime);
> + }
> + if (lock->cpu != cpu)
> + stats->bounces[bounce_acquired + !!hlock->read]++;
> + put_lock_stats(stats);
> +
> + lock->cpu = cpu;
> + lock->ip = ip;
> +}
> +
> +#else /* CONFIG_LOCK_STAT */
> +
> +static inline void lockstat_contended(struct held_lock *hlock, unsigned long ip) { }
> +static inline void lockstat_acquired(struct held_lock *hlock, unsigned long ip) { }
> +
> +#endif /* CONFIG_LOCK_STAT */
> +
> static void
> __lock_contended(struct lockdep_map *lock, unsigned long ip)
> {
> struct task_struct *curr = current;
> struct held_lock *hlock, *prev_hlock;
> - struct lock_class_stats *stats;
> unsigned int depth;
> - int i, contention_point, contending_point;
> + int i;
>
> depth = curr->lockdep_depth;
> /*
> @@ -3673,20 +3745,8 @@ __lock_contended(struct lockdep_map *loc
> if (hlock->instance != lock)
> return;
>
> - hlock->waittime_stamp = lockstat_clock();
> -
> - contention_point = lock_point(hlock_class(hlock)->contention_point, ip);
> - contending_point = lock_point(hlock_class(hlock)->contending_point,
> - lock->ip);
> -
> - stats = get_lock_stats(hlock_class(hlock));
> - if (contention_point < LOCKSTAT_POINTS)
> - stats->contention_point[contention_point]++;
> - if (contending_point < LOCKSTAT_POINTS)
> - stats->contending_point[contending_point]++;
> - if (lock->cpu != smp_processor_id())
> - stats->bounces[bounce_contended + !!hlock->read]++;
> - put_lock_stats(stats);
> + hlock->state = hls_contended;
> + lockstat_contended(hlock, ip);
> }
>
> static void
> @@ -3694,10 +3754,8 @@ __lock_acquired(struct lockdep_map *lock
> {
> struct task_struct *curr = current;
> struct held_lock *hlock, *prev_hlock;
> - struct lock_class_stats *stats;
> unsigned int depth;
> - u64 now, waittime = 0;
> - int i, cpu;
> + int i;
>
> depth = curr->lockdep_depth;
> /*
> @@ -3726,28 +3784,8 @@ __lock_acquired(struct lockdep_map *lock
> if (hlock->instance != lock)
> return;
>
> - cpu = smp_processor_id();
> - if (hlock->waittime_stamp) {
> - now = lockstat_clock();
> - waittime = now - hlock->waittime_stamp;
> - hlock->holdtime_stamp = now;
> - }
> -
> - trace_lock_acquired(lock, ip);
> -
> - stats = get_lock_stats(hlock_class(hlock));
> - if (waittime) {
> - if (hlock->read)
> - lock_time_inc(&stats->read_waittime, waittime);
> - else
> - lock_time_inc(&stats->write_waittime, waittime);
> - }
> - if (lock->cpu != cpu)
> - stats->bounces[bounce_acquired + !!hlock->read]++;
> - put_lock_stats(stats);
> -
> - lock->cpu = cpu;
> - lock->ip = ip;
> + hlock->state = hls_acquired;
> + lockstat_acquired(hlock, ip);
> }
>
> void lock_contended(struct lockdep_map *lock, unsigned long ip)
> @@ -3783,12 +3821,12 @@ void lock_acquired(struct lockdep_map *l
> raw_local_irq_save(flags);
> check_flags(flags);
> current->lockdep_recursion = 1;
> + trace_lock_acquired(lock, ip);
> __lock_acquired(lock, ip);
> current->lockdep_recursion = 0;
> raw_local_irq_restore(flags);
> }
> EXPORT_SYMBOL_GPL(lock_acquired);
> -#endif
>
> /*
> * Used by the testsuite, sanitize the validator state
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Fri, 2012-06-08 at 10:38 -0400, Dave Jones wrote:
> On Fri, Jun 08, 2012 at 09:31:38AM +0200, Peter Zijlstra wrote:
>
> > > Can you check the compile error that Dave reported for
> > > your patch, because right now Dave is running without it, afaik, due
> > > to your lock-instance patch not really working in practice.
> >
> > Oh, missed that in the thread.. yeah here goes.
> >
> > Now also compile tested with CONFIG_LOCK_STAT...
>
> Doesn't apply against Linus' current tree.
>
> $ cat ../lock| patch -p1
> patching file include/linux/lockdep.h
> Hunk #1 FAILED at 211.
> Hunk #2 FAILED at 254.
> Hunk #3 FAILED at 363.
> Hunk #4 FAILED at 414.
> 4 out of 4 hunks FAILED -- saving rejects to file include/linux/lockdep.h.rej
> patching file include/trace/events/lock.h
> Hunk #1 FAILED at 61.
> Hunk #2 FAILED at 78.
> 2 out of 2 hunks FAILED -- saving rejects to file include/trace/events/lock.h.rej
> patching file kernel/lockdep.c
> Hunk #1 FAILED at 541.
> Hunk #2 succeeded at 556 with fuzz 2.
> Hunk #3 FAILED at 3093.
> patch: **** malformed patch at line 268: ck,
weird:
root@laptop:/usr/src/linux-2.6# git checkout -f origin/master
HEAD is now at 48d212a... Revert "mm: correctly synchronize rss-counters at exit/exec"
root@laptop:/usr/src/linux-2.6# quilt push
Applying patch patches/peter_zijlstra-re-processes_hung_after_sys_renameat_and__missing__processes.patch
patching file include/linux/lockdep.h
patching file include/trace/events/lock.h
patching file kernel/lockdep.c
Now at patch patches/peter_zijlstra-re-processes_hung_after_sys_renameat_and__missing__processes.patch
root@laptop:/usr/src/linux-2.6# git describe
v3.5-rc1-130-g48d212a
On Fri, Jun 08, 2012 at 04:51:05PM +0200, Peter Zijlstra wrote:
> > > Now also compile tested with CONFIG_LOCK_STAT...
> >
> > Doesn't apply against Linus' current tree.
> >
> > $ cat ../lock| patch -p1
> > patching file include/linux/lockdep.h
> > Hunk #1 FAILED at 211.
> > Hunk #2 FAILED at 254.
> > Hunk #3 FAILED at 363.
> > Hunk #4 FAILED at 414.
> > 4 out of 4 hunks FAILED -- saving rejects to file include/linux/lockdep.h.rej
> > patching file include/trace/events/lock.h
> > Hunk #1 FAILED at 61.
> > Hunk #2 FAILED at 78.
> > 2 out of 2 hunks FAILED -- saving rejects to file include/trace/events/lock.h.rej
> > patching file kernel/lockdep.c
> > Hunk #1 FAILED at 541.
> > Hunk #2 succeeded at 556 with fuzz 2.
> > Hunk #3 FAILED at 3093.
> > patch: **** malformed patch at line 268: ck,
>
> weird:
gaah, it's quoted-printable damaged. When I save that in mutt, I get a file that contains crap like
- lock->cpu =3D cpu;
- lock->ip =3D ip;
Dave
On Fri, 2012-06-08 at 10:46 -0400, J. Bruce Fields wrote:
> > > > Sadly, if you get that annotation wrong you can annotate an actual
> > > > deadlock away.
>
> What's a (contrived as you want) example where that happens?
spinlock_t lock_array[10];
void init_array(void)
{
int i;
for (i = 0; i < ARRAY_SIZE(lock_array); i++)
spin_lock_init(array + i);
}
void double_lock(int a, int b)
{
spin_lock(lock_array + a);
spin_lock_nested(lock_array + b, SINGLE_DEPTH_NESTING);
}
The above places all locks in the array in the same class, it then does
a double lock without order, but tells lockdep the nesting is ok.
A correct version of the double_lock() function would look like:
void double_lock(int a, int b)
{
if (b < a)
swap(a, b);
spin_lock(lock_array + a);
spin_lock_nested(lock_array + b, SINGLE_DEPTH_NESTING);
}
This orders the locks in array order.
> > > > This the reason you have to be very careful when
> > > > annotating stuff.
>
> Or alternatively--what do I need to check before I call
> mutex_lock_nested?
That the lock order you tell lockdep is ok, is indeed correct.
On Fri, 2012-06-08 at 11:01 -0400, Dave Jones wrote:
> gaah, it's quoted-printable damaged. When I save that in mutt, I get a
> file that contains crap like
ok, evolution is no longer able to send plain text.. git can deal with
quoted printable, and I have an awk script to decipher it. Still bloody
annoying though.
Attached now :/
On Fri, Jun 08, 2012 at 05:11:23PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-06-08 at 11:01 -0400, Dave Jones wrote:
> > gaah, it's quoted-printable damaged. When I save that in mutt, I get a
> > file that contains crap like
>
> ok, evolution is no longer able to send plain text..
evolution is a parody of a mail client. Not even a funny one at that.
> git can deal with
> quoted printable, and I have an awk script to decipher it. Still bloody
> annoying though.
>
> Attached now :/
that works, thanks.
Dave
On Thu, Jun 07, 2012 at 08:36:07PM +0100, Al Viro wrote:
> * d_splice_alias() - broken. Called without any locking on the
> old directory, nevermind ->s_vfs_rename_mutex.
I was assuming that the callers were in lookup, held i_mutex on a
parent, and that in the case of a directory, existence of an alias with
a different parent could only result from a filesystem bug.
> I really believe that this pair of commits needs to be reverted. The
> earlier code used to guarantee that alias would be detached.
In the case that prompted that first commit, the directory in question
had an alias that was detached (which I'm taking to mean IS_ROOT(dentry)
was true?), but not flagged DISCONNECTED. The particular case was only
reproduceable on an older kernel, and I couldn't find a similar
reproducer on recent upstream, but I also couldn't convince myself it
was impossible.
So, maybe the correct thing is to revert that change. Or maybe it
should be picking an IS_ROOT dentry instead of a DISCONNECTED one?
There's some previous discussion at
http://marc.info/?i=<[email protected]>
(in particular a long post from Neil:
http://marc.info/?i=<[email protected]>
that I should review.)
--b.
On Fri, Jun 8, 2012 at 9:22 AM, J. Bruce Fields <[email protected]> wrote:
> On Thu, Jun 07, 2012 at 08:36:07PM +0100, Al Viro wrote:
>
>> I really believe that this pair of commits needs to be reverted. ?The
>> earlier code used to guarantee that alias would be detached.
>
> In the case that prompted that first commit, the directory in question
> had an alias that was detached (which I'm taking to mean IS_ROOT(dentry)
> was true?), but not flagged DISCONNECTED. ?The particular case was only
> reproduceable on an older kernel, and I couldn't find a similar
> reproducer on recent upstream, but I also couldn't convince myself it
> was impossible.
>
> So, maybe the correct thing is to revert that change. ?Or maybe it
> should be picking an IS_ROOT dentry instead of a DISCONNECTED one?
I've reverted the changes for now, it looks like the discussion about
them is still on-going, and I think I'll feel happier if we just go
back to the old status quo for the nonce.
Linus
On Fri, Jun 08, 2012 at 12:54:00AM -0700, Eric W. Biederman wrote:
> > Please, explain. link/unlink pair in which sense?
>
> In the sense that if we don't use d_move. A rename will look to
> userspace like a pair of sys_link and sys_unlink operations.
>
> If I happen to have a file open with the old name and the dentry passes
> through d_drop. The /proc/self/fd/N will show the filename as
> "...(deleted)".
Um... reality check, please - you were talking about symlinks until now. Sure,
these days we *can* open them (with O_PATH), but... And unless I'm misreading
something, none of those sysfs_rename() callers are acting on regular files.
In any case, symlinks or no symlinks, what do you expect to happen if
something in userland does lookup at the old location before anyone gets
around to lookup at the new one? It's not as if that had been under
kernel control...
> In the little corner case user visible details the current state of vfs
> support for distributed filesystems looks nuts to me, especially where
> we can't apply an appropriate d_move onto a renamed dentry.
Reread your own code, please. You are *not* guaranteed that lookup triggering
that d_move() of yours will come before ->d_revalidate(). In other words,
the current scheme is not any different in that respect. And nothing in
userland really cares - scenario is too convoluted.
> Speaking of I just found a small unhandled case in __d_unalias. We need
> to deny renaming of mount points.
Umm... Agreed, fix pushed into the same branch.
On Fri, Jun 08, 2012 at 10:44:09AM -0700, Linus Torvalds wrote:
> On Fri, Jun 8, 2012 at 9:22 AM, J. Bruce Fields <[email protected]> wrote:
> > On Thu, Jun 07, 2012 at 08:36:07PM +0100, Al Viro wrote:
> >
> >> I really believe that this pair of commits needs to be reverted. The
> >> earlier code used to guarantee that alias would be detached.
> >
> > In the case that prompted that first commit, the directory in question
> > had an alias that was detached (which I'm taking to mean IS_ROOT(dentry)
> > was true?), but not flagged DISCONNECTED. The particular case was only
> > reproduceable on an older kernel, and I couldn't find a similar
> > reproducer on recent upstream, but I also couldn't convince myself it
> > was impossible.
> >
> > So, maybe the correct thing is to revert that change. Or maybe it
> > should be picking an IS_ROOT dentry instead of a DISCONNECTED one?
>
> I've reverted the changes for now, it looks like the discussion about
> them is still on-going, and I think I'll feel happier if we just go
> back to the old status quo for the nonce.
Fair enough; I'll keep investigating.
--b.
On Fri, Jun 08, 2012 at 05:08:34PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-06-08 at 10:46 -0400, J. Bruce Fields wrote:
> > > > > Sadly, if you get that annotation wrong you can annotate an actual
> > > > > deadlock away.
> >
> > What's a (contrived as you want) example where that happens?
>
> spinlock_t lock_array[10];
>
> void init_array(void)
> {
> int i;
>
> for (i = 0; i < ARRAY_SIZE(lock_array); i++)
> spin_lock_init(array + i);
> }
>
> void double_lock(int a, int b)
> {
> spin_lock(lock_array + a);
> spin_lock_nested(lock_array + b, SINGLE_DEPTH_NESTING);
> }
>
> The above places all locks in the array in the same class, it then does
> a double lock without order, but tells lockdep the nesting is ok.
>
> A correct version of the double_lock() function would look like:
>
> void double_lock(int a, int b)
> {
> if (b < a)
> swap(a, b);
>
> spin_lock(lock_array + a);
> spin_lock_nested(lock_array + b, SINGLE_DEPTH_NESTING);
> }
>
> This orders the locks in array order.
Got it, thanks!
--b.
>
> > > > > This the reason you have to be very careful when
> > > > > annotating stuff.
> >
> > Or alternatively--what do I need to check before I call
> > mutex_lock_nested?
>
> That the lock order you tell lockdep is ok, is indeed correct.
>
>
Linus Torvalds <[email protected]> writes:
> On Thu, Jun 7, 2012 at 12:07 AM, Miklos Szeredi <[email protected]> wrote:
>>
>> Yeah, see the comment in below patch for how it's supposed to work. I
>> *think* it's correct.
>
> Ok, yes, that makes me happier.
>
> What would make me happier still is to get rid of the "save_parent"
> thing entirely, though.
>
> And I think you should be able to.
>
> Why?
>
> You already have the rule that:
> - save_parent.mnt is always same as "path->mnt" (or NULL, if not saved)
> - save_parent.dentry is the same as "dir" when you use it (you have a
> BUG_ON() for it not being the same)
> - you currently use "save_parent.dentry != NULL" as a flag to say "do
> we have the valid state"
>
> So as far as I can tell, you should get rid of all the refcount games
> and the "struct path save_parent", and just replace the
> "save_parent.dentry != NULL" thing with a flag of whether you have a
> valid state.
>
> That would get rid of the whole
>
> if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path->mnt) {
> path_to_nameidata(path, nd);
> } else {
> save_parent.dentry = nd->path.dentry;
> save_parent.mnt = mntget(path->mnt);
> nd->path.dentry = path->dentry;
>
> }
>
> thing, and we could just have the old simple unconditional
>
> path_to_nameidata(path, nd);
>
> back.
>
> And then you'd have
>
> if (filp == ERR_PTR(-EOPENSTALE) && save_parent_flag && !retried) {
> dput(nd->path.dentry);
> nd->path.dentry = dget(dir);
But 'dir' may no longer be valid here since we dput it in
path_to_nameidata() earlier.
So, unfortunately, we do need to play those refcounting games.
The mntget() could be optimized away in theory, but it's tricky because
complete_walk() might do a path_put(nd->path) on failure and then we'd
be left with a dentry ref without a matching vfsmount ref.
Thanks,
Miklos