LinuxLists.cc - [PATCH 0/7] breaking the global file_list

2007-01-28 12:10:44

Subject: [PATCH 0/7] breaking the global file_list_lock

This patch-set breaks up the global file_list_lock which was found to be a
severe contention point under basically any filesystem intensive workload.

It has been part of the -rt kernel for some time now and is deemed solid and
useful enough to post in its own right. This contention should also occur on
the large SMP machines.

Feedback would be appreciated, especially with regard to the extra workqueue
that was added to flush per cpu lists. Is there an alternative aproach?

2007-01-28 12:17:49

by Peter Zijlstra

[permalink] [raw]

Subject: [PATCH 8/7] fs: free_write_pipe() fix

From: Ingo Molnar <[email protected]>

file_kill() has to look at the file's inode (for the barrier logic),
hence make sure we free the inode before the file.

Signed-off-by: Ingo Molnar <[email protected]>
---
fs/pipe.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

Index: linux/fs/pipe.c
===================================================================
--- linux.orig/fs/pipe.c
+++ linux/fs/pipe.c
@@ -945,12 +945,17 @@ struct file *create_write_pipe(void)
return ERR_PTR(err);
}

-void free_write_pipe(struct file *f)
+void free_write_pipe(struct file *file)
{
- free_pipe_info(f->f_dentry->d_inode);
- dput(f->f_path.dentry);
- mntput(f->f_path.mnt);
- put_filp(f);
+ struct dentry *dentry = file->f_path.dentry;
+ struct vfsmount *mnt = file->f_path.mnt;
+
+ free_pipe_info(file->f_dentry->d_inode);
+ file->f_path.dentry = NULL;
+ file->f_path.mnt = NULL;
+ put_filp(file);
+ dput(dentry);
+ mntput(mnt);
}

struct file *create_read_pipe(struct file *wrf)

2007-01-28 14:43:28

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 0/7] breaking the global file_list_lock

On Sun, Jan 28, 2007 at 12:51:18PM +0100, Peter Zijlstra wrote:
> This patch-set breaks up the global file_list_lock which was found to be a
> severe contention point under basically any filesystem intensive workload.

Benchmarks, please. Where exactly do you see contention for this?

filesystem intensive workload apparently means namespace operation heavy
workload, right? The biggest bottleneck I've seen with those is dcache lock.

Even if this is becoming a real problem there must be simpler ways to fix
this than introducing various new locking primitives and all kinds of
complexity.

2007-01-28 15:11:29

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 0/7] breaking the global file_list_lock

On Sun, Jan 28, 2007 at 02:43:25PM +0000, Christoph Hellwig wrote:
> On Sun, Jan 28, 2007 at 12:51:18PM +0100, Peter Zijlstra wrote:
> > This patch-set breaks up the global file_list_lock which was found to be a
> > severe contention point under basically any filesystem intensive workload.
>
> Benchmarks, please. Where exactly do you see contention for this?
>
>
> filesystem intensive workload apparently means namespace operation heavy
> workload, right? The biggest bottleneck I've seen with those is dcache lock.
>
> Even if this is becoming a real problem there must be simpler ways to fix
> this than introducing various new locking primitives and all kinds of
> complexity.

One good way to fix scalability without all this braindamage is
to get rid of sb->s_files. Current uses are:

- fs/dquot.c:add_dquot_ref()

This performs it's actual operation on inodes. We should
be able to check inode->i_writecount to see which inodes
need quota initialization.

- fs/file_table.c:fs_may_remount_ro()

This one is gone in Dave Hansens per-mountpoint r/o patchkit

- fs/proc/generic.c:proc_kill_inodes()

This can be done with a list inside procfs.

- fs/super.c:mark_files_ro()

This one is only used for do_emergency_remount(), which is
and utter hack. It might be much better to just deny any
kind of write access through a superblock flag here.

- fs/selinuxfs.c:sel_remove_bools()

Utter madness. I have no idea how this ever got merged.
Maybe the selinux folks can explain what crack they were
on when writing this. The problem would go away with
a generic rewoke infrastructure.

Once sb->s_files is gone we can also kill of fu_list entirely and
replace it by a list head entirely in the tty code and make the lock
for it per-tty.

2007-01-28 15:25:38

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/7] breaking the global file_list_lock

2007-01-28 15:29:23

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 0/7] breaking the global file_list_lock

On Sun, 2007-01-28 at 15:11 +0000, Christoph Hellwig wrote:

> > Even if this is becoming a real problem there must be simpler ways to fix
> > this than introducing various new locking primitives and all kinds of
> > complexity.
>
> One good way to fix scalability without all this braindamage is
> to get rid of sb->s_files. Current uses are:
>
> - fs/dquot.c:add_dquot_ref()
>
> This performs it's actual operation on inodes. We should
> be able to check inode->i_writecount to see which inodes
> need quota initialization.
>
> - fs/file_table.c:fs_may_remount_ro()
>
> This one is gone in Dave Hansens per-mountpoint r/o patchkit
>
> - fs/proc/generic.c:proc_kill_inodes()
>
> This can be done with a list inside procfs.
>
> - fs/super.c:mark_files_ro()
>
> This one is only used for do_emergency_remount(), which is
> and utter hack. It might be much better to just deny any
> kind of write access through a superblock flag here.
>
> - fs/selinuxfs.c:sel_remove_bools()
>
> Utter madness. I have no idea how this ever got merged.
> Maybe the selinux folks can explain what crack they were
> on when writing this. The problem would go away with
> a generic rewoke infrastructure.
>
> Once sb->s_files is gone we can also kill of fu_list entirely and
> replace it by a list head entirely in the tty code and make the lock
> for it per-tty.

I shall pursue this direction. Thanks for the information.

2007-01-28 15:33:29

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 0/7] breaking the global file_list_lock

On Sun, Jan 28, 2007 at 04:29:04PM +0100, Peter Zijlstra wrote:
> I shall pursue this direction. Thanks for the information.

Btw, if you need this short-term there might be an even simpler
solution:

- all files go on sb->s_list and stay there until the file is freed
- ttys grow their private list instead of beeing mixed up with the
per-sb list

This already allows nice splitting of the locks into per-sb and per-tty
(although all tty_files accesses might be protected by tty_mutex anyway).
Especially for the superblock a rcu conversion might be quite easy
if nessecary, but all traversals are in what I'd consider more or less
slowpathes anyway.

>
---end quoted text---

2007-01-28 16:52:49

by Martin Bligh

[permalink] [raw]

Subject: Re: [PATCH 0/7] breaking the global file_list_lock

Ingo Molnar wrote:
> * Christoph Hellwig <[email protected]> wrote:
>
>> On Sun, Jan 28, 2007 at 12:51:18PM +0100, Peter Zijlstra wrote:
>>> This patch-set breaks up the global file_list_lock which was found
>>> to be a severe contention point under basically any filesystem
>>> intensive workload.
>> Benchmarks, please. Where exactly do you see contention for this?
>
> it's the most contended spinlock we have during a parallel kernel
> compile on an 8-way system. But it's pretty common-sense as well,
> without doing any measurements, it's basically the only global lock left
> in just about every VFS workload that doesnt involve massive amount of
> dentries created/removed (which is still dominated by the dcache_lock).
>
>> filesystem intensive workload apparently means namespace operation
>> heavy workload, right? The biggest bottleneck I've seen with those is
>> dcache lock.
>
> the dcache lock is not a problem during kernel compiles. (its
> rcu-ification works nicely in that workload)

Mmm. not wholly convinced that's true. Whilst i don't have lockmeter
stats to hand, the heavy time in __d_lookup seems to indicate we may
still have a problem to me. I guess we could move the spinlocks out
of line again to test this fairly easily (or get lockmeter upstream).

114076 total 0.0545
57766 default_idle 916.9206
11869 prep_new_page 49.4542
3830 find_trylock_page 67.1930
2637 zap_pte_range 3.9125
2486 strnlen_user 54.0435
2018 do_page_fault 1.1941
1940 do_wp_page 1.6973
1869 __d_lookup 7.7231
1331 page_remove_rmap 5.2196
1287 do_no_page 1.6108
1272 buffered_rmqueue 4.6423
1160 __copy_to_user_ll 14.6835
1027 _atomic_dec_and_lock 11.1630
655 release_pages 1.9670
644 do_path_lookup 1.6304
630 schedule 0.4046
617 kunmap_atomic 7.7125
573 __handle_mm_fault 0.7365
548 free_hot_page 78.2857
500 __copy_user_intel 3.3784
483 copy_pte_range 0.5941
482 page_address 2.9571
478 file_move 9.1923
441 do_anonymous_page 0.7424
429 filemap_nopage 0.4450
401 anon_vma_unlink 4.8902

2007-01-28 17:04:19

by Christoph Hellwig

[permalink] [raw]

Subject: lockmeter

On Sun, Jan 28, 2007 at 08:52:25AM -0800, Martin J. Bligh wrote:
> Mmm. not wholly convinced that's true. Whilst i don't have lockmeter
> stats to hand, the heavy time in __d_lookup seems to indicate we may
> still have a problem to me. I guess we could move the spinlocks out
> of line again to test this fairly easily (or get lockmeter upstream).

We definitly should get lockmeter in. Does anyone volunteer for doing
the cleanup and merged?

2007-01-28 17:38:36

by Martin Bligh

[permalink] [raw]

Subject: Re: lockmeter

Christoph Hellwig wrote:
> On Sun, Jan 28, 2007 at 08:52:25AM -0800, Martin J. Bligh wrote:
>> Mmm. not wholly convinced that's true. Whilst i don't have lockmeter
>> stats to hand, the heavy time in __d_lookup seems to indicate we may
>> still have a problem to me. I guess we could move the spinlocks out
>> of line again to test this fairly easily (or get lockmeter upstream).
>
> We definitly should get lockmeter in. Does anyone volunteer for doing
> the cleanup and merged?

On second thoughts .. I don't think it'd actually work for this since
the locks aren't global. Not that it shouldn't be done anyway, but ...

ISTR we still thought dcache scalability was a significant problem last
time anyone looked at it seriously - just never got fixed. Dipankar?

2007-01-28 18:03:06

by Bill Huey

[permalink] [raw]

Subject: Re: lockmeter

On Sun, Jan 28, 2007 at 09:38:16AM -0800, Martin J. Bligh wrote:
> Christoph Hellwig wrote:
> >On Sun, Jan 28, 2007 at 08:52:25AM -0800, Martin J. Bligh wrote:
> >>Mmm. not wholly convinced that's true. Whilst i don't have lockmeter
> >>stats to hand, the heavy time in __d_lookup seems to indicate we may
> >>still have a problem to me. I guess we could move the spinlocks out
> >>of line again to test this fairly easily (or get lockmeter upstream).
> >
> >We definitly should get lockmeter in. Does anyone volunteer for doing
> >the cleanup and merged?
>
> On second thoughts .. I don't think it'd actually work for this since
> the locks aren't global. Not that it shouldn't be done anyway, but ...
>
> ISTR we still thought dcache scalability was a significant problem last
> time anyone looked at it seriously - just never got fixed. Dipankar?

My lock stat stuff shows dcache to a be a problem under -rt as well. It
is keyed off the same mechanism as lockdep. It's pretty heavily hit
under even normal loads relative to other kinds of lock overhead even
for casual file operations on a 2x system. I can't imagine how lousy
it's going to be under real load on a 8x or higher machine.

However, this pathc is -rt only and spinlock times are meaningless under
it because of preemptiblity.

bill

2007-01-28 18:43:01

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/7] breaking the global file_list_lock

* Christoph Hellwig <[email protected]> wrote:

> On Sun, Jan 28, 2007 at 12:51:18PM +0100, Peter Zijlstra wrote:
> > This patch-set breaks up the global file_list_lock which was found to be a
> > severe contention point under basically any filesystem intensive workload.
>
> Benchmarks, please. Where exactly do you see contention for this?

the following very simple workload:

http://redhat.com/~mingo/file-ops-test/file-ops-test.c

starts one process per CPU and open()s/close()s a file all over again,
simulating an open/close-intense workload. This pattern is quite typical
of several key Linux applications.

Using Peter's s_files patchset the following scalability improvement can
be measured (lower numbers are better):

----------------------------------------------------------------------
v2.6.20-rc6 | v2.6.20-rc6+Peter's s_files queue
----------------------------------------------------------------------
dual-core: 2.11 usecs/op | 1.51 usecs/op ( +39.7% win )
8-socket: 6.30 usecs/op | 2.70 usecs/op ( +233.3% win )

[ i'd expect something a 64-way box to improve its open/close
scalability dramatically, factor 10x or 20x better. On a 1024-CPU (or
4096-CPU) system the effects are very likely even more dramatic. ]

Why does this patch make such a difference? Not because the critical
section is in any way contended - it isnt, we only do a simple list
operation there. But this lock is touched in every sys_open() and
sys_close() system-call, so it is a high-frequency accessed cacheline.
The cost is there due to the global cacheline ping-pong of files_lock.
Furthermore, in this very important VFS codepath this is the 'last'
global cacheline that got eliminated, hence all the scalability benefits
(of previous improvements) get reaped all at once.

Now could you please tell me why i had to waste 3.5 hours on measuring
and profiling this /again/, while a tiny little bit of goodwill from
your side could have avoided this? I told you that we lock-profiled this
under -rt, and that it's an accurate measurement of such things - as the
numbers above prove it too. Would it have been so hard to say something
like: "Cool Peter! That lock had been in our way of good open()/close()
scalability for such a long time and it's an obviously good idea to
eliminate it. Now here's a couple of suggestions of how to do it even
simpler: [...]." Why did you have to in essence piss on his patchset?
Any rational explanation?

Ingo

2007-01-28 19:29:05

by Ingo Molnar

[permalink] [raw]

Subject: Re: lockmeter

* Bill Huey <[email protected]> wrote:

> > ISTR we still thought dcache scalability was a significant problem
> > last time anyone looked at it seriously - just never got fixed.
> > Dipankar?
>
> My lock stat stuff shows dcache to a be a problem under -rt as well.
> [...]

yeah, it shows up frequently, but it's hard to fix.

for a well-cached kernel rebuild on a system with lots of RAM and lots
of CPUs, files_lock is the leading lock. (because the dcache is mostly
populated already) files_lock is even more of a problem for workloads
that do lots of open()/close() [Apache comes to mind].

Ingo

2007-01-28 20:38:59

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 0/7] breaking the global file_list_lock

On Sun, Jan 28, 2007 at 07:41:16PM +0100, Ingo Molnar wrote:
> starts one process per CPU and open()s/close()s a file all over again,
> simulating an open/close-intense workload. This pattern is quite typical
> of several key Linux applications.
>
> Using Peter's s_files patchset the following scalability improvement can
> be measured (lower numbers are better):
>
> ----------------------------------------------------------------------
> v2.6.20-rc6 | v2.6.20-rc6+Peter's s_files queue
> ----------------------------------------------------------------------
> dual-core: 2.11 usecs/op | 1.51 usecs/op ( +39.7% win )
> 8-socket: 6.30 usecs/op | 2.70 usecs/op ( +233.3% win )

Thanks for having some numbers to start with.

> Now could you please tell me why i had to waste 3.5 hours on measuring
> and profiling this /again/, while a tiny little bit of goodwill from
> your side could have avoided this? I told you that we lock-profiled this
> under -rt, and that it's an accurate measurement of such things - as the
> numbers above prove it too. Would it have been so hard to say something
> like: "Cool Peter! That lock had been in our way of good open()/close()
> scalability for such a long time and it's an obviously good idea to
> eliminate it. Now here's a couple of suggestions of how to do it even
> simpler: [...]." Why did you have to in essence piss on his patchset?
> Any rational explanation?

Can we please stop this stupid pissing contest. I'm totally fine to admit
yours is bigger than mine in public, so let's get back to the facts.

The patchkit we're discussing here introduces a lot of complexity:

- a new type of implicitly locked linked lists
- a new synchronization primitive
- a new locking scheme that utilizes the previous two items, aswell
as rcu.

I think we definitly we want some numbers (which you finally provided)
to justify this.

Then going on to the implementation I don't like trying to "fix" a problem
with this big hammer approach. I've outlined some alternate ways that
actually simplify both the underlying data structures and locking that
should help towards this problem instead of making the code more complex
and really hard to understand.

2007-01-28 21:07:24

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/7] breaking the global file_list_lock

* Christoph Hellwig <[email protected]> wrote:

> Thanks for having some numbers to start with.

you are welcome.

> > Any rational explanation?
>
> Can we please stop this stupid pissing contest. I'm totally fine to
> admit yours is bigger than mine in public, so let's get back to the
> facts.

sure. Could we begin with you not doing derogative, belittling comments
like:

<hch> so -rt folks also restart the lock brigade crap :P

and:

* hch hands Peter Zijlstra the mister braindead locking madness award

but to show a little bit of .. respect and goodwill towards the work of
others? Those comments might sound funny to you, but they are offensive
to me.

I'll readily admit that the patches are complex and even ugly, but the
problem is complex and ugly to begin with. /Finally/ someone comes along
and tries to sort out the mess that YOU (amonst others, like me) have
not fixed for years, and you welcome his (hard!) efforts with ridicule?

You also are NAK-ing the same patches on lkml for clearly bogus reasons:
for example your 'this is lockdep internals' comment was off the mark by
far. So this isnt just colorful comments, it is also actual prejudice
that is irrational and harmful.

Linux will take 100 years to break into the mainstream if this attitude
prevails. And i think this is a far bigger problem than pretty much
/any/ technological problem we have at the moment so i will continue to
be upset about it, no matter how uncomfortable it is to finally face
this and no matter how ugly such flames might look externally.

Or if you really cannot talk about hard problems without insulting
others then /just dont do it/. Nobody is forcing you to do this if you
cannot find any fun in it, really.

Ingo

2007-01-28 21:18:50

by Ingo Molnar

[permalink] [raw]

Subject: Re: lockmeter

* Bill Huey <[email protected]> wrote:

> My lock stat stuff shows dcache to a be a problem under -rt as well.
> It is keyed off the same mechanism as lockdep. [...]

btw., while my plan is to prototype your lock-stat patch in -rt
initially, it should be doable to extend it to be usable with the
upstream kernel as well.

We can gather lock contention events when there is spinlock debugging
enabled, from lib/spinlock_debug.c. For example __spin_lock_debug() does
this:

static void __spin_lock_debug(spinlock_t *lock)
{
...
for (i = 0; i < loops; i++) {
if (__raw_spin_trylock(&lock->raw_lock))
return;
__delay(1);
}

where the __delay(1) call is done do we notice contention - and there
you could drive the lock-stat code. Similarly, rwlocks have such natural
points too where you could insert a lock-stat callback without impacting
performance (or the code) all that much. mutexes and rwsems have natural
contention points too (kernel/mutex.c:__mutex_lock_slowpath() and
kernel/rwsem.c:rwsem_down_failed_common()), even with mutex debugging is
off.

for -rt the natural point to gather contention events is in
kernel/rtmutex.c, as you are doing it currently.

finally, you can enable lockdep's initialization/etc. wrappers so that
infrastructure between lockdep and lock-stat is shared, but you dont
have to call into the lock-tracking code of lockdep.c if LOCK_STAT is
enabled and PROVE_LOCKING is disabled. That should give you the lockdep
infrastructure for LOCK_STAT, without the lockdep overhead.

all in one, one motivation behind my interest in your patch for -rt is
that i think it's useful for upstream too, and that it can merge with
lockdep to a fair degree.

Ingo

2007-01-29 01:09:27

by Arjan van de Ven

[permalink] [raw]

Subject: Re: lockmeter

On Sun, 2007-01-28 at 17:04 +0000, Christoph Hellwig wrote:
> On Sun, Jan 28, 2007 at 08:52:25AM -0800, Martin J. Bligh wrote:
> > Mmm. not wholly convinced that's true. Whilst i don't have lockmeter
> > stats to hand, the heavy time in __d_lookup seems to indicate we may
> > still have a problem to me. I guess we could move the spinlocks out
> > of line again to test this fairly easily (or get lockmeter upstream).
>
> We definitly should get lockmeter in. Does anyone volunteer for doing
> the cleanup and merged?

specifically; implementing it on top of lockdep should be very lean and
simple...

2007-01-29 01:13:19

by Martin Bligh

[permalink] [raw]

Subject: Re: lockmeter

Arjan van de Ven wrote:
> On Sun, 2007-01-28 at 17:04 +0000, Christoph Hellwig wrote:
>> On Sun, Jan 28, 2007 at 08:52:25AM -0800, Martin J. Bligh wrote:
>>> Mmm. not wholly convinced that's true. Whilst i don't have lockmeter
>>> stats to hand, the heavy time in __d_lookup seems to indicate we may
>>> still have a problem to me. I guess we could move the spinlocks out
>>> of line again to test this fairly easily (or get lockmeter upstream).
>> We definitly should get lockmeter in. Does anyone volunteer for doing
>> the cleanup and merged?
>
> specifically; implementing it on top of lockdep should be very lean and
> simple...

cc: John Hawkes, if we're going to discuss this ;-)

M.

2007-01-29 05:28:14

by Bill Huey

[permalink] [raw]

Subject: Re: lockmeter

On Sun, Jan 28, 2007 at 10:17:05PM +0100, Ingo Molnar wrote:
> btw., while my plan is to prototype your lock-stat patch in -rt
> initially, it should be doable to extend it to be usable with the
> upstream kernel as well.
>
> We can gather lock contention events when there is spinlock debugging
> enabled, from lib/spinlock_debug.c. For example __spin_lock_debug() does
> this:
>
> static void __spin_lock_debug(spinlock_t *lock)
> {
> ...
> for (i = 0; i < loops; i++) {
> if (__raw_spin_trylock(&lock->raw_lock))
> return;
> __delay(1);
> }
>
> where the __delay(1) call is done do we notice contention - and there
> you could drive the lock-stat code. Similarly, rwlocks have such natural
> points too where you could insert a lock-stat callback without impacting
> performance (or the code) all that much. mutexes and rwsems have natural
> contention points too (kernel/mutex.c:__mutex_lock_slowpath() and
> kernel/rwsem.c:rwsem_down_failed_common()), even with mutex debugging is
> off.
>
> for -rt the natural point to gather contention events is in
> kernel/rtmutex.c, as you are doing it currently.
>
> finally, you can enable lockdep's initialization/etc. wrappers so that
> infrastructure between lockdep and lock-stat is shared, but you dont
> have to call into the lock-tracking code of lockdep.c if LOCK_STAT is
> enabled and PROVE_LOCKING is disabled. That should give you the lockdep
> infrastructure for LOCK_STAT, without the lockdep overhead.
>
> all in one, one motivation behind my interest in your patch for -rt is
> that i think it's useful for upstream too, and that it can merge with
> lockdep to a fair degree.

Fantastic. I'm going to try and finish up your suggested changes tonight
and get it to work with CONFIG_LOCK_STAT off. It's been challenging to
find time to do Linux these days, so I don't mind handing it off to you
after this point so that you and tweek it to your heart's content.

Yeah, one of the major motivations behind it was to see if Solaris style
locks were useful and to either validate or invalidate their usefulness.
Because of this patch, we have an idea of what's going on with regard to
adaptive spinning and such. The sensible conclusion is that it's not
sophistication of the lock, but parallelizing the code in the first
place to prevent the contention in the first place that's the key
philosophical drive.

I forward merged it into rc6-rt2 and you can expect a drop tonight of
what I have regardless whether it's perfect or not.

bill

2007-01-29 10:27:14

by Bill Huey

[permalink] [raw]

Subject: Re: lockmeter

On Sun, Jan 28, 2007 at 09:27:45PM -0800, Bill Huey wrote:
> On Sun, Jan 28, 2007 at 10:17:05PM +0100, Ingo Molnar wrote:
> > btw., while my plan is to prototype your lock-stat patch in -rt
> > initially, it should be doable to extend it to be usable with the
> > upstream kernel as well.
...
> Fantastic. I'm going to try and finish up your suggested changes tonight
> and get it to work with CONFIG_LOCK_STAT off. It's been challenging to
> find time to do Linux these days, so I don't mind handing it off to you
> after this point so that you and tweek it to your heart's content.

Ingo,

Got it.
http://mmlinux.sourceforge.net/public/patch-2.6.20-rc6-rt2.1.lock_stat.patch

This compiles and runs with the CONFIG_LOCK_STAT option turned off now
and I believe addresses all of your forementioned concern that you
mentioned. I could have missed a detail here and there, but I think
it's in pretty good shape now.

bill

2007-01-29 13:36:41

by Stephen Smalley

[permalink] [raw]

Subject: Re: [PATCH 0/7] breaking the global file_list_lock

On Sun, 2007-01-28 at 15:11 +0000, Christoph Hellwig wrote:
> On Sun, Jan 28, 2007 at 02:43:25PM +0000, Christoph Hellwig wrote:
> > On Sun, Jan 28, 2007 at 12:51:18PM +0100, Peter Zijlstra wrote:
> > > This patch-set breaks up the global file_list_lock which was found to be a
> > > severe contention point under basically any filesystem intensive workload.
> >
> > Benchmarks, please. Where exactly do you see contention for this?
> >
> >
> > filesystem intensive workload apparently means namespace operation heavy
> > workload, right? The biggest bottleneck I've seen with those is dcache lock.
> >
> > Even if this is becoming a real problem there must be simpler ways to fix
> > this than introducing various new locking primitives and all kinds of
> > complexity.
>
> One good way to fix scalability without all this braindamage is
> to get rid of sb->s_files. Current uses are:
>
> - fs/dquot.c:add_dquot_ref()
>
> This performs it's actual operation on inodes. We should
> be able to check inode->i_writecount to see which inodes
> need quota initialization.
>
> - fs/file_table.c:fs_may_remount_ro()
>
> This one is gone in Dave Hansens per-mountpoint r/o patchkit
>
> - fs/proc/generic.c:proc_kill_inodes()
>
> This can be done with a list inside procfs.
>
> - fs/super.c:mark_files_ro()
>
> This one is only used for do_emergency_remount(), which is
> and utter hack. It might be much better to just deny any
> kind of write access through a superblock flag here.
>
> - fs/selinuxfs.c:sel_remove_bools()
>
> Utter madness. I have no idea how this ever got merged.
> Maybe the selinux folks can explain what crack they were
> on when writing this. The problem would go away with
> a generic rewoke infrastructure.

It was modeled after proc_kill_inodes(), as noted in the comments. So
if you have a suitable replacement for proc_kill_inodes(), you can apply
the same approach to sel_remove_bools().

>
> Once sb->s_files is gone we can also kill of fu_list entirely and
> replace it by a list head entirely in the tty code and make the lock
> for it per-tty.

--
Stephen Smalley
National Security Agency

2007-01-29 18:02:49

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 0/7] breaking the global file_list_lock

On Mon, Jan 29, 2007 at 08:32:53AM -0500, Stephen Smalley wrote:
> > - fs/selinuxfs.c:sel_remove_bools()
> >
> > Utter madness. I have no idea how this ever got merged.
> > Maybe the selinux folks can explain what crack they were
> > on when writing this. The problem would go away with
> > a generic rewoke infrastructure.
>
> It was modeled after proc_kill_inodes(), as noted in the comments. So
> if you have a suitable replacement for proc_kill_inodes(), you can apply
> the same approach to sel_remove_bools().

Looking at it again sel_remove_bools actually only operates on files
backed by selinuxfs, so yes we could use the same approach.