2005-03-29 23:07:32

by Lee Revell

[permalink] [raw]
Subject: NFS client latencies

I am seeing long latencies in the NFS client code. Attached is a ~1.9
ms latency trace.

Lee


Attachments:
nfs-1919us-latency_trace.bz2 (30.01 kB)

2005-03-29 23:18:26

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> I am seeing long latencies in the NFS client code. Attached is a ~1.9
> ms latency trace.

What kind of workload are you using to produce these numbers?

Cheers,
Trond
--
Trond Myklebust <[email protected]>

2005-03-29 23:35:08

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

ty den 29.03.2005 Klokka 18:32 (-0500) skreiv Lee Revell:
> On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > > I am seeing long latencies in the NFS client code. Attached is a ~1.9
> > > ms latency trace.
> >
> > What kind of workload are you using to produce these numbers?
> >
>
> Just a kernel compile over NFS.

In other words a workload consisting mainly of mmap()ed writes?

Cheers,
Trond

--
Trond Myklebust <[email protected]>

2005-03-29 23:32:37

by Lee Revell

[permalink] [raw]
Subject: Re: NFS client latencies

On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > I am seeing long latencies in the NFS client code. Attached is a ~1.9
> > ms latency trace.
>
> What kind of workload are you using to produce these numbers?
>

Just a kernel compile over NFS.

Lee

2005-03-29 23:37:49

by Lee Revell

[permalink] [raw]
Subject: Re: NFS client latencies

On Tue, 2005-03-29 at 18:34 -0500, Trond Myklebust wrote:
> ty den 29.03.2005 Klokka 18:32 (-0500) skreiv Lee Revell:
> > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > > > I am seeing long latencies in the NFS client code. Attached is a ~1.9
> > > > ms latency trace.
> > >
> > > What kind of workload are you using to produce these numbers?
> > >
> >
> > Just a kernel compile over NFS.
>
> In other words a workload consisting mainly of mmap()ed writes?
>

If you say so... I don't know the NFS internals very well.

I did use "make -j16", on a UP.

Lee

2005-03-30 08:02:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Trond Myklebust <[email protected]> wrote:

> ty den 29.03.2005 Klokka 18:32 (-0500) skreiv Lee Revell:
> > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > > > I am seeing long latencies in the NFS client code. Attached is a ~1.9
> > > > ms latency trace.
> > >
> > > What kind of workload are you using to produce these numbers?
> > >
> >
> > Just a kernel compile over NFS.
>
> In other words a workload consisting mainly of mmap()ed writes?

new files created during a kernel compile are done via open()/write().

looking at the trace it seems that the NFS client code is doing list
walks over ~7000 entries (!), in nfs_list_add_request(). Whatever
protocol/server-side gain there might be due to the sorting and
coalescing, this CPU overhead seems extremely high - more than 1 msec
for this single insertion!

the comment suggests that this is optimized for append writes (which is
quite common, but by far not the only write workload) - but the
worst-case behavior of this code is very bad. How about disabling this
sorting altogether and benchmarking the result? Maybe it would get
comparable coalescing (higher levels do coalesce after all), but wastly
improved CPU utilization on the client side. (Note that the server
itself will do sorting of any write IO anyway, if this is to hit any
persistent storage - and if not then sorting so agressively on the
client side makes little sense.)

i think normal NFS benchmarks would not show this effect, as writes are
typically streamed in benchmarks. But once you have lots of outstanding
requests and a write comes out of order, CPU utilization (and latency)
skyrockets.

Ingo

2005-03-30 14:11:29

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

on den 30.03.2005 Klokka 10:02 (+0200) skreiv Ingo Molnar:

> the comment suggests that this is optimized for append writes (which is
> quite common, but by far not the only write workload) - but the
> worst-case behavior of this code is very bad. How about disabling this
> sorting altogether and benchmarking the result? Maybe it would get
> comparable coalescing (higher levels do coalesce after all), but wastly
> improved CPU utilization on the client side. (Note that the server
> itself will do sorting of any write IO anyway, if this is to hit any
> persistent storage - and if not then sorting so agressively on the
> client side makes little sense.)

No. Coalescing on the client makes tons of sense. The overhead of
sending 8 RPC requests for 4k writes instead of sending 1 RPC request
for a single 32k write is huge: among other things, you end up tying up
8 RPC slots on the client + 8 nfsd threads on the server instead of just
one of each.
You also end up allocating 8 times a much memory for supporting
structures such as rpc_tasks. That sucks when you're in a low memory
situation and are trying to push dirty pages to the server as fast as
possible...

What we can do instead is to do the same thing that the VM does: use a
radix tree with tags to do the sorting of the nfs_pages when we're
actually building up the list of dirty pages to send.
We already have the radix tree to do that, all we need to do is add the
tags and modify the scanning function to use them.

Cheers,
Trond

--
Trond Myklebust <[email protected]>

2005-03-30 14:21:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Trond Myklebust <[email protected]> wrote:

> > the comment suggests that this is optimized for append writes (which is
> > quite common, but by far not the only write workload) - but the
> > worst-case behavior of this code is very bad. How about disabling this
> > sorting altogether and benchmarking the result? Maybe it would get
> > comparable coalescing (higher levels do coalesce after all), but wastly
> > improved CPU utilization on the client side. (Note that the server
> > itself will do sorting of any write IO anyway, if this is to hit any
> > persistent storage - and if not then sorting so agressively on the
> > client side makes little sense.)
>
> No. Coalescing on the client makes tons of sense. The overhead of
> sending 8 RPC requests for 4k writes instead of sending 1 RPC request
> for a single 32k write is huge: among other things, you end up tying up
> 8 RPC slots on the client + 8 nfsd threads on the server instead of just
> one of each.

yes - coalescing a few pages makes sense, but linearly scanning
thousands of entries is entirely pointless.

Ingo

2005-03-30 14:26:24

by Lee Revell

[permalink] [raw]
Subject: Re: NFS client latencies

On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > I am seeing long latencies in the NFS client code. Attached is a ~1.9
> > ms latency trace.
>
> What kind of workload are you using to produce these numbers?
>

Here is the other long latency I am seeing in the NFS client. I posted
this before, but did not cc: the correct people.

It looks like nfs_wait_on_requests is doing thousands of
radix_tree_gang_lookups while holding some lock.

Lee

preemption latency trace v1.1.4 on 2.6.12-rc1-RT-V0.7.41-00
--------------------------------------------------------------------
latency: 3178 �s, #4095/14224, CPU#0 | (M:preempt VP:0, KP:1, SP:1 HP:1
#P:1)
-----------------
| task: ksoftirqd/0-2 (uid:0 nice:-10 policy:0 rt_prio:0)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
(T1/#0) <...> 32105 0 3 00000004 00000000 [0011939614227867]
0.000ms (+4137027.445ms): <6500646c> (<61000000>)
(T1/#2) <...> 32105 0 3 00000004 00000002 [0011939614228097]
0.000ms (+0.000ms): __trace_start_sched_wakeup+0x9a/0xd0 <c013150a>
(try_to_wake_up+0x94/0x140 <c0110474>)
(T1/#3) <...> 32105 0 3 00000003 00000003 [0011939614228436]
0.000ms (+0.000ms): preempt_schedule+0x11/0x80 <c02b57c1>
(try_to_wake_up+0x94/0x140 <c0110474>)
(T3/#4) <...>-32105 0dn.3 0�s : try_to_wake_up+0x11e/0x140
<c01104fe> <<...>-2> (69 76):
(T1/#5) <...> 32105 0 3 00000002 00000005 [0011939614228942]
0.000ms (+0.000ms): preempt_schedule+0x11/0x80 <c02b57c1>
(try_to_wake_up+0xf8/0x140 <c01104d8>)
(T1/#6) <...> 32105 0 3 00000002 00000006 [0011939614229130]
0.000ms (+0.000ms): wake_up_process+0x35/0x40 <c0110555> (do_softirq
+0x3f/0x50 <c011b05f>)
(T6/#7) <...>-32105 0dn.1 1�s < (1)
(T1/#8) <...> 32105 0 2 00000001 00000008 [0011939614229782]
0.001ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee>
(nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
(T1/#9) <...> 32105 0 2 00000001 00000009 [0011939614229985]
0.001ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup
+0x52/0x70 <c01e0632>)
(T1/#10) <...> 32105 0 2 00000001 0000000a [0011939614230480]
0.001ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee>
(nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
(T1/#11) <...> 32105 0 2 00000001 0000000b [0011939614230634]
0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup
+0x52/0x70 <c01e0632>)
(T1/#12) <...> 32105 0 2 00000001 0000000c [0011939614230889]
0.002ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee>
(nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
(T1/#13) <...> 32105 0 2 00000001 0000000d [0011939614231034]
0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup
+0x52/0x70 <c01e0632>)
(T1/#14) <...> 32105 0 2 00000001 0000000e [0011939614231302]
0.002ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee>
(nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
(T1/#15) <...> 32105 0 2 00000001 0000000f [0011939614231419]
0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup
+0x52/0x70 <c01e0632>)

(last two lines just repeat)


> Cheers,
> Trond

2005-03-30 14:51:12

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

on den 30.03.2005 Klokka 09:26 (-0500) skreiv Lee Revell:
> On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > > I am seeing long latencies in the NFS client code. Attached is a ~1.9
> > > ms latency trace.
> >
> > What kind of workload are you using to produce these numbers?
> >
>
> Here is the other long latency I am seeing in the NFS client. I posted
> this before, but did not cc: the correct people.
>
> It looks like nfs_wait_on_requests is doing thousands of
> radix_tree_gang_lookups while holding some lock.

That's normal and cannot be avoided: when writing, we have to look for
the existence of old nfs_page requests. The reason is that if one does
exist, we must either coalesce our new dirty area into it or if we
can't, we must flush the old request out to the server.

Cheers,
Trond

--
Trond Myklebust <[email protected]>

2005-03-30 19:53:41

by Lee Revell

[permalink] [raw]
Subject: Re: NFS client latencies

On Wed, 2005-03-30 at 09:50 -0500, Trond Myklebust wrote:
> on den 30.03.2005 Klokka 09:26 (-0500) skreiv Lee Revell:
> > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > > > I am seeing long latencies in the NFS client code. Attached is a ~1.9
> > > > ms latency trace.
> > >
> > > What kind of workload are you using to produce these numbers?
> > >
> >
> > Here is the other long latency I am seeing in the NFS client. I posted
> > this before, but did not cc: the correct people.
> >
> > It looks like nfs_wait_on_requests is doing thousands of
> > radix_tree_gang_lookups while holding some lock.
>
> That's normal and cannot be avoided: when writing, we have to look for
> the existence of old nfs_page requests. The reason is that if one does
> exist, we must either coalesce our new dirty area into it or if we
> can't, we must flush the old request out to the server.

But holding a spinlock for 3ms is not acceptable. _Something_ has to be
done. Can't the lock be dropped and reacquired after processing N
requests where N is some reasonable number?

Lee

2005-03-30 19:57:33

by Andrew Morton

[permalink] [raw]
Subject: Re: NFS client latencies

Ingo Molnar <[email protected]> wrote:
>
> > No. Coalescing on the client makes tons of sense. The overhead of
> > sending 8 RPC requests for 4k writes instead of sending 1 RPC request
> > for a single 32k write is huge: among other things, you end up tying up
> > 8 RPC slots on the client + 8 nfsd threads on the server instead of just
> > one of each.
>
> yes - coalescing a few pages makes sense, but linearly scanning
> thousands of entries is entirely pointless.

If each object has a unique ->wb_index then they could all be placed into a
radix tree rather than a list. Use the gang-lookup functions to perform
the sorting.

2005-03-30 20:01:04

by Andrew Morton

[permalink] [raw]
Subject: Re: NFS client latencies

Trond Myklebust <[email protected]> wrote:
>
> on den 30.03.2005 Klokka 09:26 (-0500) skreiv Lee Revell:
> > On Tue, 2005-03-29 at 18:18 -0500, Trond Myklebust wrote:
> > > ty den 29.03.2005 Klokka 18:04 (-0500) skreiv Lee Revell:
> > > > I am seeing long latencies in the NFS client code. Attached is a ~1.9
> > > > ms latency trace.
> > >
> > > What kind of workload are you using to produce these numbers?
> > >
> >
> > Here is the other long latency I am seeing in the NFS client. I posted
> > this before, but did not cc: the correct people.
> >
> > It looks like nfs_wait_on_requests is doing thousands of
> > radix_tree_gang_lookups while holding some lock.
>
> That's normal and cannot be avoided: when writing, we have to look for
> the existence of old nfs_page requests. The reason is that if one does
> exist, we must either coalesce our new dirty area into it or if we
> can't, we must flush the old request out to the server.

One could use the radix-tree tagging stuff so that the gang lookup only
looks up pages which are !NFS_WBACK_BUSY.

2005-03-30 21:19:46

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

on den 30.03.2005 Klokka 11:56 (-0800) skreiv Andrew Morton:
> > That's normal and cannot be avoided: when writing, we have to look for
> > the existence of old nfs_page requests. The reason is that if one does
> > exist, we must either coalesce our new dirty area into it or if we
> > can't, we must flush the old request out to the server.
>
> One could use the radix-tree tagging stuff so that the gang lookup only
> looks up pages which are !NFS_WBACK_BUSY.

Yes. Together with the radix tree-based sorting of dirty requests,
that's pretty much what I've spent most of today doing. Lee, could you
see how the attached combined patch changes your latency numbers?

Cheers,
Trond
--
Trond Myklebust <[email protected]>


Attachments:
linux-2.6.12-nfspage_use_tags.dif (13.72 kB)

2005-03-31 02:27:51

by Lee Revell

[permalink] [raw]
Subject: Re: NFS client latencies

On Wed, 2005-03-30 at 16:14 -0500, Trond Myklebust wrote:
> on den 30.03.2005 Klokka 11:56 (-0800) skreiv Andrew Morton:
> > > That's normal and cannot be avoided: when writing, we have to look for
> > > the existence of old nfs_page requests. The reason is that if one does
> > > exist, we must either coalesce our new dirty area into it or if we
> > > can't, we must flush the old request out to the server.
> >
> > One could use the radix-tree tagging stuff so that the gang lookup only
> > looks up pages which are !NFS_WBACK_BUSY.
>
> Yes. Together with the radix tree-based sorting of dirty requests,
> that's pretty much what I've spent most of today doing. Lee, could you
> see how the attached combined patch changes your latency numbers?
>

Different code path, and the latency is worse. See the attached ~7ms
trace.

Lee


Attachments:
nfs-6776usec-trace.bz2 (35.93 kB)

2005-03-31 02:40:21

by Andrew Morton

[permalink] [raw]
Subject: Re: NFS client latencies

Lee Revell <[email protected]> wrote:
>
> > Yes. Together with the radix tree-based sorting of dirty requests,
> > that's pretty much what I've spent most of today doing. Lee, could you
> > see how the attached combined patch changes your latency numbers?
> >
>
> Different code path, and the latency is worse. See the attached ~7ms
> trace.

Is a bunch of gobbledygook. Hows about you interpret it for us?

2005-03-31 02:47:26

by Lee Revell

[permalink] [raw]
Subject: Re: NFS client latencies

On Wed, 2005-03-30 at 18:39 -0800, Andrew Morton wrote:
> Lee Revell <[email protected]> wrote:
> >
> > > Yes. Together with the radix tree-based sorting of dirty requests,
> > > that's pretty much what I've spent most of today doing. Lee, could you
> > > see how the attached combined patch changes your latency numbers?
> > >
> >
> > Different code path, and the latency is worse. See the attached ~7ms
> > trace.
>
> Is a bunch of gobbledygook. Hows about you interpret it for us?
>

Sorry. When I summarized them before, Ingo just asked for the full
verbose trace.

The 7 ms are spent in this loop:

radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>)
radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>)
radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>)
radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>)
radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>)
radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>)
radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>)
radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>)
radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>)
radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>)
radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
radix_tree_gang_lookup_tag+0xe/0x80 <c01e074e> (nfs_scan_lock_dirty+0x69/0xf0 <c01c39d9>)
__lookup_tag+0xe/0x130 <c01e061e> (radix_tree_gang_lookup_tag+0x59/0x80 <c01e0799>)

Lee

2005-03-31 03:49:16

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

on den 30.03.2005 Klokka 21:47 (-0500) skreiv Lee Revell:
> On Wed, 2005-03-30 at 18:39 -0800, Andrew Morton wrote:
> > Lee Revell <[email protected]> wrote:
> > >
> > > > Yes. Together with the radix tree-based sorting of dirty requests,
> > > > that's pretty much what I've spent most of today doing. Lee, could you
> > > > see how the attached combined patch changes your latency numbers?
> > > >
> > >
> > > Different code path, and the latency is worse. See the attached ~7ms
> > > trace.
> >
> > Is a bunch of gobbledygook. Hows about you interpret it for us?
> >
>
> Sorry. When I summarized them before, Ingo just asked for the full
> verbose trace.
>
> The 7 ms are spent in this loop:
>
> radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
> nfs_set_page_writeback_locked+0xe/0x60 <c01c357e> (nfs_scan_lock_dirty+0x8d/0xf0 <c01c39fd>)
> radix_tree_tag_set+0xe/0xa0 <c01e036e> (nfs_set_page_writeback_locked+0x4b/0x60 <c01c35bb>)
> radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)


Which is basically confirming what the guys from Bull already told me,
namely that the radix tree tag stuff is much less efficient that what
we've got now. I never saw their patches, though, so I was curious to
try it for myself.

Cheers,
Trond

--
Trond Myklebust <[email protected]>

2005-03-31 07:00:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Trond Myklebust <[email protected]> wrote:

> > The 7 ms are spent in this loop:
>
> Which is basically confirming what the guys from Bull already told me,
> namely that the radix tree tag stuff is much less efficient that what
> we've got now. I never saw their patches, though, so I was curious to
> try it for myself.

i think the numbers are being misinterpreted. I believe this patch is a
big step forward. The big thing is that nfs_list_add_request() is O(1)
now - while _a single request addition to the sorted list_ triggered the
1+ msec latency in Lee's previous trace. I.e. the previous trace showed
a 1+ msec latency for a single page IO submission, while his new trace
shows _thousands_ of pages submitted for NFS IO - which is a very
different beast!

i think all it needs now is a lock-breaker in the main radix-lookup loop
in nfs_scan_lock_dirty(), or a latency-oriented reduction in the npages
argument, to make the loop bounded. The nfs_list_add_request() code
unbreakable from a latency POV. The patch looks really great to me,
kudos for pulling it off that fast.

(Also, i'd not declare this variant _slower_ based on latencies, unless
proven in real benchmarks. A faster implementation can easily have
higher latencies, if some detail changed - and lots of details changed
due to your patch. And i'd not even declare latency that much worse,
unless it's been measured with the tracer turned off. (Lee, could you
try such a comparison, worst-case latency with and without the patch,
with LATENCY_TRACE turned off in the .config but latency timing still
enabled?) The latency tracer itself can baloon the latency.)

Ingo

2005-03-31 07:04:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Lee Revell <[email protected]> wrote:

> > Is a bunch of gobbledygook. Hows about you interpret it for us?
>
> Sorry. When I summarized them before, Ingo just asked for the full
> verbose trace.

please send non-verbose traces if possible. The verbose traces are
useful when it's not clear which portion of a really large function is
the call site - but they are also alot harder to read. Verbose traces
are basically just a debugging mechanism for me, not meant for public
consumption.

i can add back the instruction-'offset' to the non-verbose trace, that
will make even the ext3 traces easily interpretable in the non-verbose
format.

> The 7 ms are spent in this loop:
>
> radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
> radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
> radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
> radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)
> radix_tree_tag_clear+0xe/0xd0 <c01e040e> (nfs_scan_lock_dirty+0xb2/0xf0 <c01c3a22>)

the trace shows thousands of pages getting submitted - each of the line
above is at least one new page. The loop is not preemptible right now
but that should be easy to bound. Note that your earlier traces showed
the list sorting overhead for a _single page_. So it's a huge difference
and a huge step forward.

Ingo

2005-03-31 07:15:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Ingo Molnar <[email protected]> wrote:

> i think all it needs now is a lock-breaker in the main radix-lookup
> loop in nfs_scan_lock_dirty(), or a latency-oriented reduction in the
> npages argument, to make the loop bounded. [...]

can nfsi->req_lock be dropped within nfs_scan_dirty()? Or does the
scanning have to restart in that case? My guess would be the scanning
does not have to be restarted, since we drop the lock after scanning
anyway, so all it has to take care of is the consistency of the list
itself, and the fact that the whole index range got scanned in a certain
point in time.

Such a lock-breaker was hard before because we had a list 'cursor' which
could move away while we dropped the lock. But with the radix tree it's
an 'index position' now, which is much more invariant. The patch below
attempts this, ontop of your patch - but i'm not sure whether ->req_lock
is the only lock we hold at that point.

Ingo

--- linux/fs/nfs/pagelist.c.orig
+++ linux/fs/nfs/pagelist.c
@@ -291,6 +291,7 @@ nfs_scan_lock_dirty(struct nfs_inode *nf
res++;
}
}
+ cond_resched_lock(&nfsi->req_lock);
}
out:
return res;

2005-03-31 07:18:25

by Andrew Morton

[permalink] [raw]
Subject: Re: NFS client latencies

Ingo Molnar <[email protected]> wrote:
>
>
> * Trond Myklebust <[email protected]> wrote:
>
> > > The 7 ms are spent in this loop:
> >
> > Which is basically confirming what the guys from Bull already told me,
> > namely that the radix tree tag stuff is much less efficient that what
> > we've got now. I never saw their patches, though, so I was curious to
> > try it for myself.
>
> i think the numbers are being misinterpreted. I believe this patch is a
> big step forward. The big thing is that nfs_list_add_request() is O(1)
> now - while _a single request addition to the sorted list_ triggered the
> 1+ msec latency in Lee's previous trace.

Well. The radix-tree approach's best-case is probably quite a lot worse
than the list-based approach's best-case. It hits a lot more cachelines
and involves a lot more code.

But of course the radix-tree's worst-case will be far better than list's.

And presumably that list-based code rarely hits the worst-case, else it
would have been changed by now.

2005-03-31 07:35:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Andrew Morton <[email protected]> wrote:

> Well. The radix-tree approach's best-case is probably quite a lot
> worse than the list-based approach's best-case. It hits a lot more
> cachelines and involves a lot more code.

The list-based approach's best-case are large continuous append writes.
No sorting overhead, and light data structures.

i'd say this workload should be not that bad under the radix tree either
- the gang lookup stuffs a nice vector of 16 pages into an array.

we definitely can say nothing based on the observation that a _single_
page took 1.9 msecs in Lee's previous measurement, while 7700 pages now
take 6 msecs to process.

> But of course the radix-tree's worst-case will be far better than
> list's.

the generic VM/pagecache has proven that the radix tree wins hands down
for alot more workloads than the worst-case.

> And presumably that list-based code rarely hits the worst-case, else
> it would have been changed by now.

that was my other point in a previous mail: most write benchmarks do
continuous append writes, and CPU overhead easily gets lost in network
latency.

Also, considering that a good portion of the NFS client's code is still
running under the BKL one would assume if the BKL hurts performance it
would have been changed by now? ;-)

Ingo

2005-03-31 07:43:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Lee Revell <[email protected]> wrote:

> > Yes. Together with the radix tree-based sorting of dirty requests,
> > that's pretty much what I've spent most of today doing. Lee, could you
> > see how the attached combined patch changes your latency numbers?
>
> Different code path, and the latency is worse. See the attached ~7ms
> trace.

could you check the -41-23 -RT kernel at the usual place:

http://redhat.com/~mingo/realtime-preempt/

i've added Trond's radix lookup code, plus the lockbreaker patch.

(one thing that is not covered yet is nfs_scan_list() - that still scans
a list. Trond, is that fixable too?)

Ingo

2005-03-31 07:58:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Ingo Molnar <[email protected]> wrote:

> could you check the -41-23 -RT kernel at the usual place:
>
> http://redhat.com/~mingo/realtime-preempt/
>
> i've added Trond's radix lookup code, plus the lockbreaker patch.
>
> (one thing that is not covered yet is nfs_scan_list() - that still scans
> a list. Trond, is that fixable too?)

ok, find a new latency trace attached (1.6 msecs). I generated write
loads, and the nfs_scan_list_dirty() latency is gone and indeed
nfs_scan_list() generates the worst latency - while processing 8535
pages in one critical section.

Ingo


Attachments:
(No filename) (582.00 B)
nfs-1682us-latency.bz2 (12.27 kB)
Download all attachments

2005-03-31 08:02:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Ingo Molnar <[email protected]> wrote:

> ok, find a new latency trace attached (1.6 msecs). I generated write
> loads, and the nfs_scan_list_dirty() latency is gone and indeed
> nfs_scan_list() generates the worst latency - while processing 8535
> pages in one critical section.

here's a more complete one of 4.4 msecs. (there was a bug in the latency
tracer that trimmed latency traces - this one is complete and shows an
nfs_scan_list processing of 22908 pages)

Ingo


Attachments:
(No filename) (477.00 B)
nfs-4481us-latency.bz2 (33.03 kB)
Download all attachments

2005-03-31 11:58:46

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

fs/nfs/write.c | 28 +++++++++++++++++-----------
1 files changed, 17 insertions(+), 11 deletions(-)

Index: linux-2.6.12-rc1/fs/nfs/write.c
===================================================================
--- linux-2.6.12-rc1.orig/fs/nfs/write.c
+++ linux-2.6.12-rc1/fs/nfs/write.c
@@ -539,12 +539,15 @@ static int
nfs_scan_dirty(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages)
{
struct nfs_inode *nfsi = NFS_I(inode);
- int res;
- res = nfs_scan_lock_dirty(nfsi, dst, idx_start, npages);
- nfsi->ndirty -= res;
- sub_page_state(nr_dirty,res);
- if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty))
- printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n");
+ int res = 0;
+
+ if (nfsi->ndirty != 0) {
+ res = nfs_scan_lock_dirty(nfsi, dst, idx_start, npages);
+ nfsi->ndirty -= res;
+ sub_page_state(nr_dirty,res);
+ if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty))
+ printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n");
+ }
return res;
}

@@ -563,11 +566,14 @@ static int
nfs_scan_commit(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages)
{
struct nfs_inode *nfsi = NFS_I(inode);
- int res;
- res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
- nfsi->ncommit -= res;
- if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
- printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
+ int res;
+
+ if (nfsi->ncommit != 0) {
+ res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
+ nfsi->ncommit -= res;
+ if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
+ printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
+ }
return res;
}
#endif


Attachments:
linux-2.6.12-38.5-nfspage_tag_dirty.dif (1.68 kB)

2005-03-31 12:34:59

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

to den 31.03.2005 Klokka 06:58 (-0500) skreiv Trond Myklebust:
>
> @@ -563,11 +566,14 @@ static int
> nfs_scan_commit(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages)
> {
> struct nfs_inode *nfsi = NFS_I(inode);
> - int res;
> - res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
> - nfsi->ncommit -= res;
> - if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
> - printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
> + int res;
^^^^^^^ Should be "int res = 0;"
> +
> + if (nfsi->ncommit != 0) {
> + res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
> + nfsi->ncommit -= res;
> + if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
> + printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
> + }
> return res;
> }
> #endif

Cheers,
Trond
--
Trond Myklebust <[email protected]>

2005-03-31 14:05:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Trond Myklebust <[email protected]> wrote:

> > + int res;
> ^^^^^^^ Should be "int res = 0;"

your patch works fine here - but there are still latencies in
nfs_scan_commit()/nfs_scan_list(): see the attached 3.7 msec latency
trace. It happened during a simple big-file writeout and is easily
reproducible. Could the nfsi->commit list searching be replaced with a
radix based approach too?

Ingo


Attachments:
(No filename) (416.00 B)
nfs-3769us-latency.bz2 (26.43 kB)
Download all attachments

2005-03-31 14:32:32

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

to den 31.03.2005 Klokka 15:58 (+0200) skreiv Ingo Molnar:
> * Trond Myklebust <[email protected]> wrote:
>
> > > + int res;
> > ^^^^^^^ Should be "int res = 0;"
>
> your patch works fine here - but there are still latencies in
> nfs_scan_commit()/nfs_scan_list(): see the attached 3.7 msec latency
> trace. It happened during a simple big-file writeout and is easily
> reproducible. Could the nfsi->commit list searching be replaced with a
> radix based approach too?

That would be 100% pure overhead. The nfsi->commit list does not need to
be sorted and with these patches applied, it no longer is. In fact one
of the cleanups I still need to do is to get rid of those redundant
checks on wb->index (start is now always set to 0, and end is always
~0UL).

So the overhead you are currently seeing should just be that of
iterating through the list, locking said requests and adding them to our
private list.

Cheers,
Trond

--
Trond Myklebust <[email protected]>

2005-03-31 14:39:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Trond Myklebust <[email protected]> wrote:

> > your patch works fine here - but there are still latencies in
> > nfs_scan_commit()/nfs_scan_list(): see the attached 3.7 msec latency
> > trace. It happened during a simple big-file writeout and is easily
> > reproducible. Could the nfsi->commit list searching be replaced with a
> > radix based approach too?
>
> That would be 100% pure overhead. The nfsi->commit list does not need
> to be sorted and with these patches applied, it no longer is. In fact
> one of the cleanups I still need to do is to get rid of those
> redundant checks on wb->index (start is now always set to 0, and end
> is always ~0UL).
>
> So the overhead you are currently seeing should just be that of
> iterating through the list, locking said requests and adding them to
> our private list.

ah - cool! This was a 100 MB writeout so having 3.7 msecs to process
20K+ pages is not unreasonable. To break the latency, can i just do a
simple lock-break, via the patch below?

Ingo

--- linux/fs/nfs/pagelist.c.orig
+++ linux/fs/nfs/pagelist.c
@@ -311,8 +311,9 @@ out:
* You must be holding the inode's req_lock when calling this function
*/
int
-nfs_scan_list(struct list_head *head, struct list_head *dst,
- unsigned long idx_start, unsigned int npages)
+nfs_scan_list(struct nfs_inode *nfsi, struct list_head *head,
+ struct list_head *dst, unsigned long idx_start,
+ unsigned int npages)
{
struct list_head *pos, *tmp;
struct nfs_page *req;
@@ -327,6 +328,8 @@ nfs_scan_list(struct list_head *head, st

list_for_each_safe(pos, tmp, head) {

+ cond_resched_lock(&nfsi->req_lock);
+
req = nfs_list_entry(pos);

if (req->wb_index < idx_start)
--- linux/fs/nfs/write.c.orig
+++ linux/fs/nfs/write.c
@@ -569,7 +569,7 @@ nfs_scan_commit(struct inode *inode, str
int res = 0;

if (nfsi->ncommit != 0) {
- res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
+ res = nfs_scan_list(nfsi, &nfsi->commit, dst, idx_start, npages);
nfsi->ncommit -= res;
if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
--- linux/include/linux/nfs_page.h.orig
+++ linux/include/linux/nfs_page.h
@@ -63,8 +63,8 @@ extern void nfs_release_request(struct n

extern int nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst,
unsigned long idx_start, unsigned int npages);
-extern int nfs_scan_list(struct list_head *, struct list_head *,
- unsigned long, unsigned int);
+extern int nfs_scan_list(struct nfs_inode *nfsi, struct list_head *,
+ struct list_head *, unsigned long, unsigned int);
extern int nfs_coalesce_requests(struct list_head *, struct list_head *,
unsigned int);
extern int nfs_wait_on_request(struct nfs_page *);

2005-03-31 14:51:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Ingo Molnar <[email protected]> wrote:

> > So the overhead you are currently seeing should just be that of
> > iterating through the list, locking said requests and adding them to
> > our private list.
>
> ah - cool! This was a 100 MB writeout so having 3.7 msecs to process
> 20K+ pages is not unreasonable. To break the latency, can i just do a
> simple lock-break, via the patch below?

with this patch the worst-case latency during NFS writeout is down to 40
usecs (!).

Lee: i've uploaded the -42-05 release with this patch included - could
you test it on your (no doubt more complex than mine) NFS setup?

Ingo

2005-03-31 14:55:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Ingo Molnar <[email protected]> wrote:

> [...] To break the latency, can i just do a simple lock-break, via the
> patch below?

i think it's incorrect in its current form, because 'pos' is not
necessarily safe and might be removed from the commit_list?

the whole loop could be a "while (!list_empty(head))" thing, if it wasnt
for the possibility for an nfs_set_page_writeback_locked() to skip an
entry. Maybe a local list of 'already processed' requests should be
built, and once the main list is emptied, moved back into the main list?
Then the lock-break could be moved to the end of the loop.

Ingo

2005-03-31 14:55:28

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

to den 31.03.2005 Klokka 16:39 (+0200) skreiv Ingo Molnar:
> * Trond Myklebust <[email protected]> wrote:
>
> > > your patch works fine here - but there are still latencies in
> > > nfs_scan_commit()/nfs_scan_list(): see the attached 3.7 msec latency
> > > trace. It happened during a simple big-file writeout and is easily
> > > reproducible. Could the nfsi->commit list searching be replaced with a
> > > radix based approach too?
> >
> > That would be 100% pure overhead. The nfsi->commit list does not need
> > to be sorted and with these patches applied, it no longer is. In fact
> > one of the cleanups I still need to do is to get rid of those
> > redundant checks on wb->index (start is now always set to 0, and end
> > is always ~0UL).
> >
> > So the overhead you are currently seeing should just be that of
> > iterating through the list, locking said requests and adding them to
> > our private list.
>
> ah - cool! This was a 100 MB writeout so having 3.7 msecs to process
> 20K+ pages is not unreasonable. To break the latency, can i just do a
> simple lock-break, via the patch below?

Errm... That looks a bit unsafe. What is there then to stop another
process from removing "pos" while you are scheduling?
It seems to me that you should really start afresh in that case.

The good news, though, is that because requests on the "commit" list do
not remain locked for long without being removed from the list, you
should rarely have to skip them. IOW restarting from the head of the
list is pretty much the same as starting from where you left off.

Cheers,
Trond

> Ingo
>
> --- linux/fs/nfs/pagelist.c.orig
> +++ linux/fs/nfs/pagelist.c
> @@ -311,8 +311,9 @@ out:
> * You must be holding the inode's req_lock when calling this function
> */
> int
> -nfs_scan_list(struct list_head *head, struct list_head *dst,
> - unsigned long idx_start, unsigned int npages)
> +nfs_scan_list(struct nfs_inode *nfsi, struct list_head *head,
> + struct list_head *dst, unsigned long idx_start,
> + unsigned int npages)
> {
> struct list_head *pos, *tmp;
> struct nfs_page *req;
> @@ -327,6 +328,8 @@ nfs_scan_list(struct list_head *head, st
>
> list_for_each_safe(pos, tmp, head) {
>
> + cond_resched_lock(&nfsi->req_lock);
> +
> req = nfs_list_entry(pos);
>
> if (req->wb_index < idx_start)
> --- linux/fs/nfs/write.c.orig
> +++ linux/fs/nfs/write.c
> @@ -569,7 +569,7 @@ nfs_scan_commit(struct inode *inode, str
> int res = 0;
>
> if (nfsi->ncommit != 0) {
> - res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
> + res = nfs_scan_list(nfsi, &nfsi->commit, dst, idx_start, npages);
> nfsi->ncommit -= res;
> if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
> printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
> --- linux/include/linux/nfs_page.h.orig
> +++ linux/include/linux/nfs_page.h
> @@ -63,8 +63,8 @@ extern void nfs_release_request(struct n
>
> extern int nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst,
> unsigned long idx_start, unsigned int npages);
> -extern int nfs_scan_list(struct list_head *, struct list_head *,
> - unsigned long, unsigned int);
> +extern int nfs_scan_list(struct nfs_inode *nfsi, struct list_head *,
> + struct list_head *, unsigned long, unsigned int);
> extern int nfs_coalesce_requests(struct list_head *, struct list_head *,
> unsigned int);
> extern int nfs_wait_on_request(struct nfs_page *);
--
Trond Myklebust <[email protected]>

2005-03-31 14:59:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Trond Myklebust <[email protected]> wrote:

> Errm... That looks a bit unsafe. What is there then to stop another
> process from removing "pos" while you are scheduling? It seems to me
> that you should really start afresh in that case.

yeah.

> The good news, though, is that because requests on the "commit" list
> do not remain locked for long without being removed from the list, you
> should rarely have to skip them. IOW restarting from the head of the
> list is pretty much the same as starting from where you left off.

as we've learned it through painful experience on the ext3 side,
restarting scans where 'skipping' has to be done is unhealthy.

would it be safe to collect locked entries into a separate, local list,
so that the restart would only see newly added entries? Then once the
moving of all entries has been done, all the locked entries could be
added back to the commit_list via one list_add. (can anything happen to
those locked entries that would break this method?)

Ingo

2005-03-31 15:00:57

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

to den 31.03.2005 Klokka 16:54 (+0200) skreiv Ingo Molnar:

> i think it's incorrect in its current form, because 'pos' is not
> necessarily safe and might be removed from the commit_list?

Right.

> the whole loop could be a "while (!list_empty(head))" thing, if it wasnt
> for the possibility for an nfs_set_page_writeback_locked() to skip an
> entry. Maybe a local list of 'already processed' requests should be
> built, and once the main list is emptied, moved back into the main list?
> Then the lock-break could be moved to the end of the loop.

Should be pretty much unnecessary. See my previous email.

Cheers,
Trond
--
Trond Myklebust <[email protected]>

2005-03-31 15:11:27

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

to den 31.03.2005 Klokka 16:58 (+0200) skreiv Ingo Molnar:


> would it be safe to collect locked entries into a separate, local list,
> so that the restart would only see newly added entries? Then once the
> moving of all entries has been done, all the locked entries could be
> added back to the commit_list via one list_add. (can anything happen to
> those locked entries that would break this method?)

You are not allowed to remove an entry from the list if it is locked by
someone else. Locking grants temporary ownership of the entry.

Cheers,
Trond
--
Trond Myklebust <[email protected]>

2005-03-31 15:13:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Trond Myklebust <[email protected]> wrote:

> > would it be safe to collect locked entries into a separate, local list,
> > so that the restart would only see newly added entries? Then once the
> > moving of all entries has been done, all the locked entries could be
> > added back to the commit_list via one list_add. (can anything happen to
> > those locked entries that would break this method?)
>
> You are not allowed to remove an entry from the list if it is locked
> by someone else. Locking grants temporary ownership of the entry.

well, removing a neighboring entry will change the locked entry's link
fields (e.g. req->wb_list.prev), so this ownership cannot involve this
particular list, can it?

Ingo

2005-03-31 15:13:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Ingo Molnar <[email protected]> wrote:

> > The good news, though, is that because requests on the "commit" list
> > do not remain locked for long without being removed from the list, you
> > should rarely have to skip them. IOW restarting from the head of the
> > list is pretty much the same as starting from where you left off.
>
> as we've learned it through painful experience on the ext3 side,
> restarting scans where 'skipping' has to be done is unhealthy.
>
> would it be safe to collect locked entries into a separate, local
> list, so that the restart would only see newly added entries? Then
> once the moving of all entries has been done, all the locked entries
> could be added back to the commit_list via one list_add. (can anything
> happen to those locked entries that would break this method?)

i.e. like the patch below (compiles and is lightly tested). It's a
pretty straightforward technique when dealing with latencies, and lets
us sleep for sure. We dont have to know whether massive amounts of
locked entries ever queue up or not.

(thanks go to Arjan for noticing that it has to be list_splice() not
list_add() :-| )

Ingo

--- linux/fs/nfs/pagelist.c.orig3
+++ linux/fs/nfs/pagelist.c
@@ -311,10 +311,12 @@ out:
* You must be holding the inode's req_lock when calling this function
*/
int
-nfs_scan_list(struct list_head *head, struct list_head *dst,
- unsigned long idx_start, unsigned int npages)
+nfs_scan_list(struct nfs_inode *nfsi, struct list_head *head,
+ struct list_head *dst, unsigned long idx_start,
+ unsigned int npages)
{
- struct list_head *pos, *tmp;
+ LIST_HEAD(locked);
+ struct list_head *pos;
struct nfs_page *req;
unsigned long idx_end;
int res;
@@ -325,21 +327,22 @@ nfs_scan_list(struct list_head *head, st
else
idx_end = idx_start + npages - 1;

- list_for_each_safe(pos, tmp, head) {
+ while (!list_empty(head)) {

+ pos = head->next;
req = nfs_list_entry(pos);

- if (req->wb_index < idx_start)
- continue;
- if (req->wb_index > idx_end)
- break;
-
- if (!nfs_set_page_writeback_locked(req))
- continue;
- nfs_list_remove_request(req);
- nfs_list_add_request(req, dst);
- res++;
+ if (!nfs_set_page_writeback_locked(req)) {
+ list_del(pos);
+ list_add(&req->wb_list, &locked);
+ } else {
+ nfs_list_remove_request(req);
+ nfs_list_add_request(req, dst);
+ res++;
+ }
+ cond_resched_lock(&nfsi->req_lock);
}
+ list_splice(&locked, head);
return res;
}

--- linux/fs/nfs/write.c.orig3
+++ linux/fs/nfs/write.c
@@ -569,7 +569,7 @@ nfs_scan_commit(struct inode *inode, str
int res = 0;

if (nfsi->ncommit != 0) {
- res = nfs_scan_list(&nfsi->commit, dst, idx_start, npages);
+ res = nfs_scan_list(nfsi, &nfsi->commit, dst, idx_start, npages);
nfsi->ncommit -= res;
if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
--- linux/include/linux/nfs_page.h.orig3
+++ linux/include/linux/nfs_page.h
@@ -63,8 +63,8 @@ extern void nfs_release_request(struct n

extern int nfs_scan_lock_dirty(struct nfs_inode *nfsi, struct list_head *dst,
unsigned long idx_start, unsigned int npages);
-extern int nfs_scan_list(struct list_head *, struct list_head *,
- unsigned long, unsigned int);
+extern int nfs_scan_list(struct nfs_inode *nfsi, struct list_head *,
+ struct list_head *, unsigned long, unsigned int);
extern int nfs_coalesce_requests(struct list_head *, struct list_head *,
unsigned int);
extern int nfs_wait_on_request(struct nfs_page *);

2005-03-31 16:01:25

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

to den 31.03.2005 Klokka 17:10 (+0200) skreiv Ingo Molnar:
> * Trond Myklebust <[email protected]> wrote:
>
> > > would it be safe to collect locked entries into a separate, local list,
> > > so that the restart would only see newly added entries? Then once the
> > > moving of all entries has been done, all the locked entries could be
> > > added back to the commit_list via one list_add. (can anything happen to
> > > those locked entries that would break this method?)
> >
> > You are not allowed to remove an entry from the list if it is locked
> > by someone else. Locking grants temporary ownership of the entry.
>
> well, removing a neighboring entry will change the locked entry's link
> fields (e.g. req->wb_list.prev), so this ownership cannot involve this
> particular list, can it?

The point is that you are taking something that was previously globally
visible and owned by somebody else and making it globally invisible by
placing it on a private list.
I'm not sure whether or not that is going to cause bugs in the current
code (it may actually be safe with the current code), but as far as
clean ownership semantics go, it sucks.

Cheers,
Trond
--
Trond Myklebust <[email protected]>

2005-04-01 02:28:39

by Lee Revell

[permalink] [raw]
Subject: Re: NFS client latencies

On Thu, 2005-03-31 at 16:50 +0200, Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
> > > So the overhead you are currently seeing should just be that of
> > > iterating through the list, locking said requests and adding them to
> > > our private list.
> >
> > ah - cool! This was a 100 MB writeout so having 3.7 msecs to process
> > 20K+ pages is not unreasonable. To break the latency, can i just do a
> > simple lock-break, via the patch below?
>
> with this patch the worst-case latency during NFS writeout is down to 40
> usecs (!).
>
> Lee: i've uploaded the -42-05 release with this patch included - could
> you test it on your (no doubt more complex than mine) NFS setup?

This fixes all the NFS related latency problems I was seeing. Now the
longest latency from an NFS kernel compile with "make -j64" is 391 usecs
in get_swap_page.

Lee

2005-04-01 04:31:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: NFS client latencies


* Lee Revell <[email protected]> wrote:

> > > ah - cool! This was a 100 MB writeout so having 3.7 msecs to process
> > > 20K+ pages is not unreasonable. To break the latency, can i just do a
> > > simple lock-break, via the patch below?
> >
> > with this patch the worst-case latency during NFS writeout is down to 40
> > usecs (!).
> >
> > Lee: i've uploaded the -42-05 release with this patch included - could
> > you test it on your (no doubt more complex than mine) NFS setup?
>
> This fixes all the NFS related latency problems I was seeing. Now the
> longest latency from an NFS kernel compile with "make -j64" is 391
> usecs in get_swap_page.

great! The latest patches (-42-08 and later) have the reworked
nfs_scan_list() lock-breaker, which should perform similarly.

i bet these NFS patches also improve generic NFS performance on fast
networks. I've attached the full patchset with all fixes and
improvements included - might be worth a try in -mm?

Ingo


Attachments:
(No filename) (985.00 B)
nfs-latency.patch (15.18 kB)
Download all attachments

2005-04-01 16:19:06

by Orion Poplawski

[permalink] [raw]
Subject: Re: NFS client latencies

Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>>This fixes all the NFS related latency problems I was seeing. Now the
>>longest latency from an NFS kernel compile with "make -j64" is 391
>>usecs in get_swap_page.
>
>
> great! The latest patches (-42-08 and later) have the reworked
> nfs_scan_list() lock-breaker, which should perform similarly.
>
> i bet these NFS patches also improve generic NFS performance on fast
> networks. I've attached the full patchset with all fixes and
> improvements included - might be worth a try in -mm?
>

Just a question - would these changes be expected to improve NFS client
*read* access at all, or just write?

Thanks!

2005-04-01 16:34:11

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client latencies

fr den 01.04.2005 Klokka 09:16 (-0700) skreiv Orion Poplawski:
> Just a question - would these changes be expected to improve NFS client
> *read* access at all, or just write?

Just write.

Cheers,
Trond

--
Trond Myklebust <[email protected]>

2005-04-01 21:21:14

by Lee Revell

[permalink] [raw]
Subject: Re: NFS client latencies

On Fri, 2005-04-01 at 06:30 +0200, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > > > ah - cool! This was a 100 MB writeout so having 3.7 msecs to process
> > > > 20K+ pages is not unreasonable. To break the latency, can i just do a
> > > > simple lock-break, via the patch below?
> > >
> > > with this patch the worst-case latency during NFS writeout is down to 40
> > > usecs (!).
> > >
> > > Lee: i've uploaded the -42-05 release with this patch included - could
> > > you test it on your (no doubt more complex than mine) NFS setup?
> >
> > This fixes all the NFS related latency problems I was seeing. Now the
> > longest latency from an NFS kernel compile with "make -j64" is 391
> > usecs in get_swap_page.
>
> great! The latest patches (-42-08 and later) have the reworked
> nfs_scan_list() lock-breaker, which should perform similarly.
>
> i bet these NFS patches also improve generic NFS performance on fast
> networks. I've attached the full patchset with all fixes and
> improvements included - might be worth a try in -mm?

With tracing disabled on the C3 (which is the NFS server, don't ask),
the maximum latency during the same kernel compile is about 2ms. So
tracing overhead probably doubled or tripled the latencies.

I'll try again with tracing enabled to determine whether these code
paths are related to the NFS server or not. It's either going to be
that or the get_swap_page stuff. But the client side is OK now.

Lee