2005-01-14 01:04:35

by John Hawkes

[permalink] [raw]
Subject: 2.6.10-mm3 scaling problem with inotify

I believe I've encountered a scaling problem with the inotify code in
2.6.10-mm3.

vfs_read() and vfs_write() (for example) do:
dnotify_parent(dentry, DN_ACCESS);
inotify_dentry_parent_queue_event(dentry,
IN_ACCESS, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, IN_ACCESS, 0, NULL);
for the optional "notify" functionality.

dnotify_parent() knows how to exit quickly:
if (!dir_notify_enable)
return;
looking at a global flag, and inotify_inode_queue_event() also knows a quick
exit:
if (!inode->inotify_data)
return;
However, inotify_dentry_parent_queue_event() first has to get the parent
dentry:
parent = dget_parent(dentry);
inotify_inode_queue_event(parent->d_inode, mask, cookie, filename);
dput(parent);
and, sadly, dget_parent(dentry) grabs a spinlock (dentry->d_lock) and dirties a cacheline (dentry->dcount). I have an AIM7-like workload that does numerous
concurrent reads and suffers a 40% drop in performance because of this.

Is it possible for a parent's inode->inotify_data to be enabled when none of
its children's inotify_data are enabled? That would make it easy - just look
at the current inode's inotify_data before walking back to the parent.

John Hawkes


2005-01-14 02:32:53

by Zou, Nanhai

[permalink] [raw]
Subject: RE: 2.6.10-mm3 scaling problem with inotify

> -----Original Message-----
> From: Zou, Nanhai
> Sent: Friday, January 14, 2005 10:28 AM
> To: 'John McCutchan'
> Subject: RE: 2.6.10-mm3 scaling problem with inotify
>
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of John
McCutchan
> > Sent: Friday, January 14, 2005 9:31 AM
> > To: Robert Love
> > Cc: John Hawkes; [email protected]
> > Subject: Re: 2.6.10-mm3 scaling problem with inotify
> >
> > On Thu, 2005-01-13 at 19:49 -0500, Robert Love wrote:
> > > On Thu, 2005-01-13 at 15:56 -0800, John Hawkes wrote:
> > > [snip]
> > >
> > > I am open to other ideas, too, but I don't see any nice shortcuts
like
> > > what we can do in inotify_inode_queue_event().
> > >
> > > (Other) John? Any ideas?
> >
> > No, you covered things well. This code was really just a straight
copy
> > of the dnotify code. Rob cleaned it up at some point, giving us what
we
> > have today. The only fix I can think of is the one suggested by Rob
--
> > copying the dnotify code again.


There is still a little difference between your implement in
inotify_dentry_parent_queue_event from dnotify_parent

In dnotify_parent, if parent is not watching the event, the code will
not fall
through dget and dput path.

While in inotify_dentry_parent_queue_event kernel will go dget and dput
even
if (inode->inotify_data == NULL).

While dget and dput will introduce a lot of atomic operations..
And the most important, dput will grab global dcache_lock...,
I think that is the reason why John Hawkes saw great performance drop.

Simply follow dnotify_parent, only call dget and dput when
inode->inotify_data != NULL will solve this problem I think.

2005-01-14 02:35:55

by John McCutchan

[permalink] [raw]
Subject: RE: 2.6.10-mm3 scaling problem with inotify

On Fri, 2005-01-14 at 10:31 +0800, Zou, Nanhai wrote:
>
> There is still a little difference between your implement in
> inotify_dentry_parent_queue_event from dnotify_parent
>
> In dnotify_parent, if parent is not watching the event, the code will
> not fall
> through dget and dput path.
>
> While in inotify_dentry_parent_queue_event kernel will go dget and dput
> even
> if (inode->inotify_data == NULL).
>
> While dget and dput will introduce a lot of atomic operations..
> And the most important, dput will grab global dcache_lock...,
> I think that is the reason why John Hawkes saw great performance drop.
>
> Simply follow dnotify_parent, only call dget and dput when
> inode->inotify_data != NULL will solve this problem I think.
>

Yeah, the old code was written before inode->inotify_data existed.
Robert caught this before he sent his patch. His patch should fix this
regression.

--
John McCutchan <[email protected]>

2005-01-14 02:40:18

by Robert Love

[permalink] [raw]
Subject: RE: 2.6.10-mm3 scaling problem with inotify

On Fri, 2005-01-14 at 10:31 +0800, Zou, Nanhai wrote:

(your email wraps badly)

> There is still a little difference between your implement in
> inotify_dentry_parent_queue_event from dnotify_parent
>
> In dnotify_parent, if parent is not watching the event, the code will
> not fall
> through dget and dput path.
>
> While in inotify_dentry_parent_queue_event kernel will go dget and dput
> even
> if (inode->inotify_data == NULL).
>
> While dget and dput will introduce a lot of atomic operations..
> And the most important, dput will grab global dcache_lock...,
> I think that is the reason why John Hawkes saw great performance drop.
>
> Simply follow dnotify_parent, only call dget and dput when
> inode->inotify_data != NULL will solve this problem I think.

This is what I meant in the original email. This is the dnotify
difference I was talking about.

The patch I submitted should put the two in parity.

Robert Love


2005-01-14 18:06:04

by John Hawkes

[permalink] [raw]
Subject: Re: 2.6.10-mm3 scaling problem with inotify

From: "Robert Love" <[email protected]>
...
> John Hawkes: Attached patch implements the dget() optimization in the
> same manner as dnotify. Compile-tested but not booted.
>
> Let me know!
...
> inotify: don't do dget() unless we have to
>
> drivers/char/inotify.c | 14 +++++++++++---
> 1 files changed, 11 insertions(+), 3 deletions(-)

The patch shows a substantial 4x improvement for my specific workload (@64p),
although I can still envision workloads that will stimulate high spinlock
contention from spin_lock(&dentry->d_lock). First things first -- let's get
this fix into the -mm tree. Thanks!

John Hawkes

2005-01-14 18:13:56

by Robert Love

[permalink] [raw]
Subject: Re: 2.6.10-mm3 scaling problem with inotify

On Fri, 2005-01-14 at 10:05 -0800, John Hawkes wrote:

> The patch shows a substantial 4x improvement for my specific workload (@64p),
> although I can still envision workloads that will stimulate high spinlock
> contention from spin_lock(&dentry->d_lock). First things first -- let's get
> this fix into the -mm tree. Thanks!

Glad it worked!

Does the spin_lock() show up in dnotify's code path? It is getting
called, unconditionally, there, too.

Robert Love


2005-01-14 18:40:42

by John Hawkes

[permalink] [raw]
Subject: Re: 2.6.10-mm3 scaling problem with inotify

From: "Robert Love" <[email protected]>
> On Fri, 2005-01-14 at 10:05 -0800, John Hawkes wrote:
>
> > The patch shows a substantial 4x improvement for my specific workload
(@64p),
> > although I can still envision workloads that will stimulate high spinlock
> > contention from spin_lock(&dentry->d_lock). First things first -- let's
get
> > this fix into the -mm tree. Thanks!
>
> Glad it worked!
>
> Does the spin_lock() show up in dnotify's code path? It is getting
> called, unconditionally, there, too.

No, it doesn't show up (with this workload).
Lock contention and cacheline contention can be complex. Contention in one
area can alter the timing of other areas and reduce contention. Anyway, with
your inotify_dentry_parent_queue_event() patch, the PC sampling for my
workload is now (with the lock waits getting rolled up to the callers) for
everything above 0.1% is:

12:38pm Samples: 1887402, Mode: pc, load average: 1832.89, 1722.88, 1264.14

COUNT ( % ) ADDR NAME
843466 (44.7) a000000100147ba0 vma_adjust
521940 (27.7) a0000001001131a0 USER
278285 (14.7) a000000100150d20 anon_vma_unlink
26952 ( 1.4) a000000100150ba0 anon_vma_link
26099 ( 1.4) a00000010013c1e0 clear_page_range
16638 ( 0.9) a0000001007473e0 established_get_next
12378 ( 0.7) a00000010044b200 _mcount
12150 ( 0.6) a000000100039d20 ia64_brk
10237 ( 0.5) a00000010013dba0 zap_pte_range
9886 ( 0.5) a00000010044aba0 copy_page
9311 ( 0.5) a00000010013cfe0 copy_pte_range
8239 ( 0.4) a00000010014bcc0 do_munmap
5014 ( 0.3) a0000001001c1340 update_atime
4743 ( 0.3) a00000010044b300 __copy_user
4462 ( 0.2) a0000001004dc360 inotify_dentry_parent_queue_event
4439 ( 0.2) a000000100115020 find_get_page
3602 ( 0.2) a0000001000b0e40 copy_mm
3581 ( 0.2) a0000001000d3d80 del_timer_sync
2951 ( 0.2) a0000001001720e0 vfs_read

John Hawkes