Virtually all sleep_on / interruptible_sleep_on users are racy. Actually
there is only one safe case: if both wakeup and sleep happen under
lock_kernel.
Any objections against killing it entirely? Or what about marking it as
deprecated, as the first step towards killing it?
I'll follow with two patches that remove it from shaper and sunrpc/sched
- shaper is racy, rpciod_down is only safe if called with lock_kernel.
--
Manfred
On Sat, 2004-01-17 at 19:43 +0100, Manfred Spraul wrote:
> Virtually all sleep_on / interruptible_sleep_on users are racy. Actually
> there is only one safe case: if both wakeup and sleep happen under
> lock_kernel.
> Any objections against killing it entirely? Or what about marking it as
> deprecated, as the first step towards killing it?
>
> I'll follow with two patches that remove it from shaper and sunrpc/sched
> - shaper is racy, rpciod_down is only safe if called with lock_kernel.
Deprecate it in 2.7.0 and add BUG_ON(BKL not held) to it. Not before
time.
--
dwmw2
> Any objections against killing it entirely? Or what about marking it as
> deprecated, as the first step towards killing it?
I have it marked like that in my tree. Lots of warnings... and
unfortionately about half are in "new" code so imo we can't mark it
deprecated early enough to prevent more of these bugs to enter the
kernel.
On Sat, Jan 17, 2004 at 07:28:21PM +0000, David Woodhouse wrote:
> On Sat, 2004-01-17 at 19:43 +0100, Manfred Spraul wrote:
> > Virtually all sleep_on / interruptible_sleep_on users are racy. Actually
> > there is only one safe case: if both wakeup and sleep happen under
> > lock_kernel.
> > Any objections against killing it entirely? Or what about marking it as
> > deprecated, as the first step towards killing it?
> >
> > I'll follow with two patches that remove it from shaper and sunrpc/sched
> > - shaper is racy, rpciod_down is only safe if called with lock_kernel.
>
> Deprecate it in 2.7.0 and add BUG_ON(BKL not held) to it. Not before
> time.
We need to remove racy uses anyway - that can't wait for 2.7. And I really
wonder if there will be anything left after that - right now only reiserfs
uses look like something that might be not immediately broken.
AFAICS, _all_ uses of sleep_on() in drivers are broken in one way or another
and BKL won't help them.
P? lau , 17/01/2004 klokka 15:10, skreiv
[email protected]:
> We need to remove racy uses anyway - that can't wait for 2.7. And I really
> wonder if there will be anything left after that - right now only reiserfs
> uses look like something that might be not immediately broken.
rpciod is quite safe as it is. I'm not against changing it, but it's not
a high priority patch as far as I'm concerned.
Cheers,
Trond
On Sat, 2004-01-17 at 20:10 +0000,
[email protected] wrote:
> AFAICS, _all_ uses of sleep_on() in drivers are broken in one way or another
> and BKL won't help them.
I think ext3 and nfs get away with it at the moment by using the BKL. It
does want fixing though.
--
dwmw2
David Woodhouse <[email protected]> wrote:
>
> On Sat, 2004-01-17 at 20:10 +0000,
> [email protected] wrote:
> > AFAICS, _all_ uses of sleep_on() in drivers are broken in one way or another
> > and BKL won't help them.
>
> I think ext3 and nfs get away with it at the moment by using the BKL. It
> does want fixing though.
ext3 had all its sleep_on()s and lock_kernel()s eradicated ages ago. In
2.4 it uses sleep_on()s and lock_kernel() makes that safe.
It would be nice to fix up the lock_kernel()s in the NFS client: SMP lock
contention is quite high in there.
P? su , 18/01/2004 klokka 01:41, skreiv Andrew Morton:
> It would be nice to fix up the lock_kernel()s in the NFS client: SMP lock
> contention is quite high in there.
Chuck did a study of that particular issue some 2-3 years ago, but we've
addressed all the top problems on his list since then. Do you have any
new numbers to show us? I ask because I'm not at all convinced that the
BKL adds significantly to our latencies for the moment (I've been
looking at those number in the last few days due to the readahead
problems that have already been reported).
I am, however, quite convinced that we need new statistics on this sort
of issue. Chuck is therefore working on a set of patches to add an
"iostat"-like tool to the NFS client. Hopefully that will help settle
these questions.
Cheers,
Trond
P? su , 18/01/2004 klokka 01:57, skreiv Trond Myklebust:
> Chuck is therefore working on a set of patches to add an
> "iostat"-like tool to the NFS client.
Sorry - that was poorly formulated:
Chuck has been working on this issue on his _own_ initiative. I'm now
convinced that he is right to be doing so, and will be pushing to get
his patches into the kernel.
Cheers,
Trond
Trond Myklebust <[email protected]> wrote:
>
> P? su , 18/01/2004 klokka 01:41, skreiv Andrew Morton:
> > It would be nice to fix up the lock_kernel()s in the NFS client: SMP lock
> > contention is quite high in there.
>
> Chuck did a study of that particular issue some 2-3 years ago, but we've
> addressed all the top problems on his list since then. Do you have any
> new numbers to show us? I ask because I'm not at all convinced that the
> BKL adds significantly to our latencies for the moment (I've been
> looking at those number in the last few days due to the readahead
> problems that have already been reported).
>
> I am, however, quite convinced that we need new statistics on this sort
> of issue. Chuck is therefore working on a set of patches to add an
> "iostat"-like tool to the NFS client. Hopefully that will help settle
> these questions.
>
Well a quick profile of `dbench 4' on a localhost mount on a P4-HT:
c01445fc kmem_cache_alloc 495 6.1875
c0140d28 buffered_rmqueue 514 1.7133
c012a8b4 del_timer_sync 1043 3.7250
c013ea58 generic_file_aio_write_nolock 1316 0.4398
c011fa60 __wake_up 1472 33.4545
c0158a35 .text.lock.read_write 1541 26.1186
c012113f .text.lock.sched 2443 4.1197
c011f3f4 schedule 2782 1.8065
c0109034 default_idle 65504 1259.6923
00000000 total 91786 0.2234
Turning on spinlock inlining:
c0143ec0 kmem_cache_alloc 464 6.1053
c01405c8 buffered_rmqueue 510 1.6346
c012a260 __mod_timer 540 1.7308
c012a4d4 del_timer_sync 949 3.3893
c013e378 generic_file_aio_write_nolock 1340 0.4479
c011fa1c __wake_up 1556 29.9231
c0157054 remote_llseek 1886 6.6408
c011f3a4 schedule 5158 3.3407
c0109034 default_idle 33731 648.6731
00000000 total 59918 0.1457
That's quite a lot of contention on the lock_kernel() in remote_llseek().
The context switch rate is 45000/sec, so schedule() gets a workout.
I do recall seeing quite a lot of BKL contention within NFS itself running
across 100bT on 4-way. That was several months ago - maybe things
improved?
Andrew Morton wrote:
>That's quite a lot of contention on the lock_kernel() in remote_llseek().
>
>
What about switching to generic_file_llseek, at least for files? The
only references to f_pos are in filldir/readdir.
--
Manfred
P? su , 18/01/2004 klokka 02:36, skreiv Andrew Morton:
> That's quite a lot of contention on the lock_kernel() in remote_llseek().
The NFS use of the BKL may indeed end up leading to latencies within
llseek(), but what you just presented is more of an argument for
eliminating the use of the BKL within llseek()...
In general, though, the latencies involved with the actual RPC call are
so large that anything involving local locks will not tend to register
on the radar screen at all (your numbers for "default_idle" are an order
of magnitude larger than anything else).
This is certainly true of 100Mbit nets. However those latencies might
perhaps start to be measurable on GigE nets with fast servers...
Cheers,
Trond
P? su , 18/01/2004 klokka 02:44, skreiv Manfred Spraul:
> Andrew Morton wrote:
>
> >That's quite a lot of contention on the lock_kernel() in remote_llseek().
> >
> >
> What about switching to generic_file_llseek, at least for files? The
> only references to f_pos are in filldir/readdir.
I'm not sure that taking inode->i_sem would be much of an improvement.
Both th BKL and the inode semaphore seem superfluous to me in this
situation.
After all, the file size is now protected by neither of the above, but
rather by its own seqlock...
Cheers,
Trond
Trond Myklebust wrote:
>I'm not sure that taking inode->i_sem would be much of an improvement.
>Both th BKL and the inode semaphore seem superfluous to me in this
>situation.
>
I think the purpose of i_sem or lock_kernel is to protect the file
pointer. Most local filesystems use i_sem, it's noticably faster -
global vs. per-object locking.
Btw, generic_mapping_read should also lock it's accesses to f_pos: right
now it reads and writes f_pos without any locking...
--
Manfred
P? su , 18/01/2004 klokka 03:18, skreiv Manfred Spraul:
> I think the purpose of i_sem or lock_kernel is to protect the file
> pointer. Most local filesystems use i_sem, it's noticably faster -
> global vs. per-object locking.
Err... Not in the case where someone else has the file open for writing.
In that case, the i_sem can be held for quite long periods of time 'cos
NFS has to flush out conflicting writes, and it has to serialize w/r
reads on pages.
> Btw, generic_mapping_read should also lock it's accesses to f_pos: right
> now it reads and writes f_pos without any locking...
Is there really any sane application that relies on 2 threads sharing
the same file descriptor, and doing llseek()/read()/write() without some
form of userland serialization mechanism?
It sounds to me as if that is going to be pretty much broken whether or
not we protect the integrity of f_pos in the kernel.
Cheers,
Trond