NFSv4.1+ differs from earlier versions in that it always performs
trunking discovery that results in mounts to the same server sharing a
TCP connection.
It turns out this results in performance regressions for some users;
apparently the workload on one mount interferes with performance of
another mount, and they were previously able to work around the problem
by using different server IP addresses for the different mounts.
Am I overlooking some hack that would reenable the previous behavior?
Or would people be averse to an "-o noshareconn" option?
--b.
> On Oct 6, 2020, at 11:13 AM, [email protected] wrote:
>
> NFSv4.1+ differs from earlier versions in that it always performs
> trunking discovery that results in mounts to the same server sharing a
> TCP connection.
>
> It turns out this results in performance regressions for some users;
> apparently the workload on one mount interferes with performance of
> another mount, and they were previously able to work around the problem
> by using different server IP addresses for the different mounts.
>
> Am I overlooking some hack that would reenable the previous behavior?
> Or would people be averse to an "-o noshareconn" option?
I thought this was what the nconnect mount option was for.
--
Chuck Lever
On Tue, Oct 06, 2020 at 11:20:41AM -0400, Chuck Lever wrote:
>
>
> > On Oct 6, 2020, at 11:13 AM, [email protected] wrote:
> >
> > NFSv4.1+ differs from earlier versions in that it always performs
> > trunking discovery that results in mounts to the same server sharing a
> > TCP connection.
> >
> > It turns out this results in performance regressions for some users;
> > apparently the workload on one mount interferes with performance of
> > another mount, and they were previously able to work around the problem
> > by using different server IP addresses for the different mounts.
> >
> > Am I overlooking some hack that would reenable the previous behavior?
> > Or would people be averse to an "-o noshareconn" option?
>
> I thought this was what the nconnect mount option was for.
I've suggested that. It doesn't isolate the two mounts from each other
in the same way, but I can imagine it might make it less likely that a
user on one mount will block a user on another? I don't know, it might
depend on the details of their workload and a certain amount of luck.
--b.
On 10/6/2020 11:22 AM, Bruce Fields wrote:
> On Tue, Oct 06, 2020 at 11:20:41AM -0400, Chuck Lever wrote:
>>
>>
>>> On Oct 6, 2020, at 11:13 AM, [email protected] wrote:
>>>
>>> NFSv4.1+ differs from earlier versions in that it always performs
>>> trunking discovery that results in mounts to the same server sharing a
>>> TCP connection.
>>>
>>> It turns out this results in performance regressions for some users;
>>> apparently the workload on one mount interferes with performance of
>>> another mount, and they were previously able to work around the problem
>>> by using different server IP addresses for the different mounts.
>>>
>>> Am I overlooking some hack that would reenable the previous behavior?
>>> Or would people be averse to an "-o noshareconn" option?
>>
>> I thought this was what the nconnect mount option was for.
>
> I've suggested that. It doesn't isolate the two mounts from each other
> in the same way, but I can imagine it might make it less likely that a
> user on one mount will block a user on another? I don't know, it might
> depend on the details of their workload and a certain amount of luck.
Wouldn't it be better to fully understand the reason for the
performance difference, before changing the mount API? If it's
a guess, it'll come back to haunt the code for years.
For example, maybe it's lock contention in the xprt transport code,
or in the socket stack.
Just askin'.
Tom.
On Tue, Oct 06, 2020 at 01:07:11PM -0400, Tom Talpey wrote:
> On 10/6/2020 11:22 AM, Bruce Fields wrote:
> >On Tue, Oct 06, 2020 at 11:20:41AM -0400, Chuck Lever wrote:
> >>
> >>
> >>>On Oct 6, 2020, at 11:13 AM, [email protected] wrote:
> >>>
> >>>NFSv4.1+ differs from earlier versions in that it always performs
> >>>trunking discovery that results in mounts to the same server sharing a
> >>>TCP connection.
> >>>
> >>>It turns out this results in performance regressions for some users;
> >>>apparently the workload on one mount interferes with performance of
> >>>another mount, and they were previously able to work around the problem
> >>>by using different server IP addresses for the different mounts.
> >>>
> >>>Am I overlooking some hack that would reenable the previous behavior?
> >>>Or would people be averse to an "-o noshareconn" option?
> >>
> >>I thought this was what the nconnect mount option was for.
> >
> >I've suggested that. It doesn't isolate the two mounts from each other
> >in the same way, but I can imagine it might make it less likely that a
> >user on one mount will block a user on another? I don't know, it might
> >depend on the details of their workload and a certain amount of luck.
>
> Wouldn't it be better to fully understand the reason for the
> performance difference, before changing the mount API? If it's
> a guess, it'll come back to haunt the code for years.
>
> For example, maybe it's lock contention in the xprt transport code,
> or in the socket stack.
Yeah, I wonder too, and I don't have the details.
--b.
On 6 Oct 2020, at 11:13, J. Bruce Fields wrote:
> NFSv4.1+ differs from earlier versions in that it always performs
> trunking discovery that results in mounts to the same server sharing a
> TCP connection.
>
> It turns out this results in performance regressions for some users;
> apparently the workload on one mount interferes with performance of
> another mount, and they were previously able to work around the
> problem
> by using different server IP addresses for the different mounts.
>
> Am I overlooking some hack that would reenable the previous behavior?
> Or would people be averse to an "-o noshareconn" option?
I suppose you could just toggle the nfs4_unique_id parameter. This
seems to
work:
flock /sys/module/nfs/parameters/nfs4_unique_id bash -c "OLD_ID=\$(cat
/sys/module/nfs/parameters/nfs4_unique_id); echo imalittleteapot >
/sys/module/nfs/parameters/nfs4_unique_id; mount -ov4,sec=sys
10.0.1.200:/exports /mnt/fedora2; echo \$OLD_ID >
/sys/module/nfs/parameters/nfs4_unique_id"
I'm trying to think of a reason why this is a bad idea, and not coming
up
with any. Can we support users that have already found this solution?
Ben
On Tue, Oct 6, 2020 at 3:38 PM Benjamin Coddington <[email protected]> wrote:
>
> On 6 Oct 2020, at 11:13, J. Bruce Fields wrote:
>
> > NFSv4.1+ differs from earlier versions in that it always performs
> > trunking discovery that results in mounts to the same server sharing a
> > TCP connection.
> >
> > It turns out this results in performance regressions for some users;
> > apparently the workload on one mount interferes with performance of
> > another mount, and they were previously able to work around the
> > problem
> > by using different server IP addresses for the different mounts.
> >
> > Am I overlooking some hack that would reenable the previous behavior?
> > Or would people be averse to an "-o noshareconn" option?
>
> I suppose you could just toggle the nfs4_unique_id parameter. This
> seems to
> work:
>
> flock /sys/module/nfs/parameters/nfs4_unique_id bash -c "OLD_ID=\$(cat
> /sys/module/nfs/parameters/nfs4_unique_id); echo imalittleteapot >
> /sys/module/nfs/parameters/nfs4_unique_id; mount -ov4,sec=sys
> 10.0.1.200:/exports /mnt/fedora2; echo \$OLD_ID >
> /sys/module/nfs/parameters/nfs4_unique_id"
>
> I'm trying to think of a reason why this is a bad idea, and not coming
> up
> with any. Can we support users that have already found this solution?
>
What about reboot recovery? How will each mount recover its own state
(and present the same identifier it used before). Client only keeps
track of one?
> Ben
>
On Tue, Oct 06, 2020 at 05:46:11PM -0400, Olga Kornievskaia wrote:
> On Tue, Oct 6, 2020 at 3:38 PM Benjamin Coddington <[email protected]> wrote:
> >
> > On 6 Oct 2020, at 11:13, J. Bruce Fields wrote:
> >
> > > NFSv4.1+ differs from earlier versions in that it always performs
> > > trunking discovery that results in mounts to the same server sharing a
> > > TCP connection.
> > >
> > > It turns out this results in performance regressions for some users;
> > > apparently the workload on one mount interferes with performance of
> > > another mount, and they were previously able to work around the
> > > problem
> > > by using different server IP addresses for the different mounts.
> > >
> > > Am I overlooking some hack that would reenable the previous behavior?
> > > Or would people be averse to an "-o noshareconn" option?
> >
> > I suppose you could just toggle the nfs4_unique_id parameter. This
> > seems to
> > work:
> >
> > flock /sys/module/nfs/parameters/nfs4_unique_id bash -c "OLD_ID=\$(cat
> > /sys/module/nfs/parameters/nfs4_unique_id); echo imalittleteapot >
> > /sys/module/nfs/parameters/nfs4_unique_id; mount -ov4,sec=sys
> > 10.0.1.200:/exports /mnt/fedora2; echo \$OLD_ID >
> > /sys/module/nfs/parameters/nfs4_unique_id"
> >
> > I'm trying to think of a reason why this is a bad idea, and not coming
> > up
> > with any. Can we support users that have already found this solution?
> >
>
> What about reboot recovery? How will each mount recover its own state
> (and present the same identifier it used before). Client only keeps
> track of one?
Looks like nfs4_init_{non}uniform_client_string() stores it in
cl_owner_id, and I was thinking that meant cl_owner_id would be used
from then on....
But actually, I think it may run that again on recovery, yes, so I bet
changing the nfs4_unique_id parameter midway like this could cause bugs
on recovery.
--b.
On 6 Oct 2020, at 20:18, J. Bruce Fields wrote:
> On Tue, Oct 06, 2020 at 05:46:11PM -0400, Olga Kornievskaia wrote:
>> On Tue, Oct 6, 2020 at 3:38 PM Benjamin Coddington
>> <[email protected]> wrote:
>>>
>>> On 6 Oct 2020, at 11:13, J. Bruce Fields wrote:
>>>
>>>> NFSv4.1+ differs from earlier versions in that it always performs
>>>> trunking discovery that results in mounts to the same server
>>>> sharing a
>>>> TCP connection.
>>>>
>>>> It turns out this results in performance regressions for some
>>>> users;
>>>> apparently the workload on one mount interferes with performance of
>>>> another mount, and they were previously able to work around the
>>>> problem
>>>> by using different server IP addresses for the different mounts.
>>>>
>>>> Am I overlooking some hack that would reenable the previous
>>>> behavior?
>>>> Or would people be averse to an "-o noshareconn" option?
>>>
>>> I suppose you could just toggle the nfs4_unique_id parameter. This
>>> seems to
>>> work:
>>>
>>> flock /sys/module/nfs/parameters/nfs4_unique_id bash -c
>>> "OLD_ID=\$(cat
>>> /sys/module/nfs/parameters/nfs4_unique_id); echo imalittleteapot >
>>> /sys/module/nfs/parameters/nfs4_unique_id; mount -ov4,sec=sys
>>> 10.0.1.200:/exports /mnt/fedora2; echo \$OLD_ID >
>>> /sys/module/nfs/parameters/nfs4_unique_id"
>>>
>>> I'm trying to think of a reason why this is a bad idea, and not
>>> coming
>>> up
>>> with any. Can we support users that have already found this
>>> solution?
>>>
>>
>> What about reboot recovery? How will each mount recover its own state
>> (and present the same identifier it used before). Client only keeps
>> track of one?
>
> Looks like nfs4_init_{non}uniform_client_string() stores it in
> cl_owner_id, and I was thinking that meant cl_owner_id would be used
> from then on....
>
> But actually, I think it may run that again on recovery, yes, so I bet
> changing the nfs4_unique_id parameter midway like this could cause
> bugs
> on recovery.
Ah, that's what I thought as well. Thanks for looking closer Olga!
I don't see why we couldn't store it for the duration of the mount, and
doing so would fix reboot recovery when the uniquifier is changed after
a
mount.
Ben
On 7 Oct 2020, at 7:27, Benjamin Coddington wrote:
> On 6 Oct 2020, at 20:18, J. Bruce Fields wrote:
>
>> On Tue, Oct 06, 2020 at 05:46:11PM -0400, Olga Kornievskaia wrote:
>>> On Tue, Oct 6, 2020 at 3:38 PM Benjamin Coddington
>>> <[email protected]> wrote:
>>>>
>>>> On 6 Oct 2020, at 11:13, J. Bruce Fields wrote:
>> Looks like nfs4_init_{non}uniform_client_string() stores it in
>> cl_owner_id, and I was thinking that meant cl_owner_id would be used
>> from then on....
>>
>> But actually, I think it may run that again on recovery, yes, so I
>> bet
>> changing the nfs4_unique_id parameter midway like this could cause
>> bugs
>> on recovery.
>
> Ah, that's what I thought as well. Thanks for looking closer Olga!
Well, no -- it does indeed continue to use the original cl_owner_id. We
only jump through nfs4_init_uniquifier_client_string() if cl_owner_id is
NULL:
6087 static int
6088 nfs4_init_uniform_client_string(struct nfs_client *clp)
6089 {
6090 size_t len;
6091 char *str;
6092
6093 if (clp->cl_owner_id != NULL)
6094 return 0;
6095
6096 if (nfs4_client_id_uniquifier[0] != '\0')
6097 return nfs4_init_uniquifier_client_string(clp);
6098
Testing proves this out as well for both EXCHANGE_ID and SETCLIENTID.
Is there any precedent for stabilizing module parameters as part of a
supported interface? Maybe this ought to be a mount option, so client
can
set a uniquifier per-mount.
Ben
> On Oct 7, 2020, at 8:55 AM, Benjamin Coddington <[email protected]> wrote:
>
> On 7 Oct 2020, at 7:27, Benjamin Coddington wrote:
>
>> On 6 Oct 2020, at 20:18, J. Bruce Fields wrote:
>>
>>> On Tue, Oct 06, 2020 at 05:46:11PM -0400, Olga Kornievskaia wrote:
>>>> On Tue, Oct 6, 2020 at 3:38 PM Benjamin Coddington <[email protected]> wrote:
>>>>>
>>>>> On 6 Oct 2020, at 11:13, J. Bruce Fields wrote:
>
>>> Looks like nfs4_init_{non}uniform_client_string() stores it in
>>> cl_owner_id, and I was thinking that meant cl_owner_id would be used
>>> from then on....
>>>
>>> But actually, I think it may run that again on recovery, yes, so I bet
>>> changing the nfs4_unique_id parameter midway like this could cause bugs
>>> on recovery.
>>
>> Ah, that's what I thought as well. Thanks for looking closer Olga!
>
> Well, no -- it does indeed continue to use the original cl_owner_id. We
> only jump through nfs4_init_uniquifier_client_string() if cl_owner_id is
> NULL:
>
> 6087 static int
> 6088 nfs4_init_uniform_client_string(struct nfs_client *clp)
> 6089 {
> 6090 size_t len;
> 6091 char *str;
> 6092
> 6093 if (clp->cl_owner_id != NULL)
> 6094 return 0;
> 6095
> 6096 if (nfs4_client_id_uniquifier[0] != '\0')
> 6097 return nfs4_init_uniquifier_client_string(clp);
> 6098
>
>
> Testing proves this out as well for both EXCHANGE_ID and SETCLIENTID.
>
> Is there any precedent for stabilizing module parameters as part of a
> supported interface? Maybe this ought to be a mount option, so client can
> set a uniquifier per-mount.
The protocol is designed as one client-ID per client. FreeBSD is
the only client I know of that uses one client-ID per mount, fwiw.
You are suggesting each mount point would have its own lease. There
would likely be deeper implementation changes needed than just
specifying a unique client-ID for each mount point.
--
Chuck Lever
On 10/6/20 10:13 AM, J. Bruce Fields wrote:
> NFSv4.1+ differs from earlier versions in that it always performs
> trunking discovery that results in mounts to the same server sharing a
> TCP connection.
>
> It turns out this results in performance regressions for some users;
> apparently the workload on one mount interferes with performance of
> another mount, and they were previously able to work around the problem
> by using different server IP addresses for the different mounts.
>
> Am I overlooking some hack that would reenable the previous behavior?
> Or would people be averse to an "-o noshareconn" option?
>
> --b.
>
I don't see how sharing a TCP connection can result in a performance
regression (the performance degradation of *not* sharing a TCP
connection is why HTTP 1.x is being replaced), or how using different IP
addresses on the same interface resolves anything. Does anyone have an
explanation?
On Wed, Oct 07, 2020 at 09:45:50AM -0400, Chuck Lever wrote:
>
>
> > On Oct 7, 2020, at 8:55 AM, Benjamin Coddington <[email protected]> wrote:
> >
> > On 7 Oct 2020, at 7:27, Benjamin Coddington wrote:
> >
> >> On 6 Oct 2020, at 20:18, J. Bruce Fields wrote:
> >>
> >>> On Tue, Oct 06, 2020 at 05:46:11PM -0400, Olga Kornievskaia wrote:
> >>>> On Tue, Oct 6, 2020 at 3:38 PM Benjamin Coddington <[email protected]> wrote:
> >>>>>
> >>>>> On 6 Oct 2020, at 11:13, J. Bruce Fields wrote:
> >
> >>> Looks like nfs4_init_{non}uniform_client_string() stores it in
> >>> cl_owner_id, and I was thinking that meant cl_owner_id would be used
> >>> from then on....
> >>>
> >>> But actually, I think it may run that again on recovery, yes, so I bet
> >>> changing the nfs4_unique_id parameter midway like this could cause bugs
> >>> on recovery.
> >>
> >> Ah, that's what I thought as well. Thanks for looking closer Olga!
> >
> > Well, no -- it does indeed continue to use the original cl_owner_id. We
> > only jump through nfs4_init_uniquifier_client_string() if cl_owner_id is
> > NULL:
> >
> > 6087 static int
> > 6088 nfs4_init_uniform_client_string(struct nfs_client *clp)
> > 6089 {
> > 6090 size_t len;
> > 6091 char *str;
> > 6092
> > 6093 if (clp->cl_owner_id != NULL)
> > 6094 return 0;
> > 6095
> > 6096 if (nfs4_client_id_uniquifier[0] != '\0')
> > 6097 return nfs4_init_uniquifier_client_string(clp);
> > 6098
> >
> >
> > Testing proves this out as well for both EXCHANGE_ID and SETCLIENTID.
> >
> > Is there any precedent for stabilizing module parameters as part of a
> > supported interface? Maybe this ought to be a mount option, so client can
> > set a uniquifier per-mount.
>
> The protocol is designed as one client-ID per client. FreeBSD is
> the only client I know of that uses one client-ID per mount, fwiw.
>
> You are suggesting each mount point would have its own lease. There
> would likely be deeper implementation changes needed than just
> specifying a unique client-ID for each mount point.
Huh, I thought that should do it.
Do you have something specific in mind?
--b.
On 10/6/2020 5:26 PM, Igor Ostrovsky wrote:
>
>
> On Tue, Oct 6, 2020 at 12:30 PM Bruce Fields <[email protected]
> <mailto:[email protected]>> wrote:
>
> On Tue, Oct 06, 2020 at 01:07:11PM -0400, Tom Talpey wrote:
> > On 10/6/2020 11:22 AM, Bruce Fields wrote:
> > >On Tue, Oct 06, 2020 at 11:20:41AM -0400, Chuck Lever wrote:
> > >>
> > >>
> > >>>On Oct 6, 2020, at 11:13 AM, [email protected]
> <mailto:[email protected]> wrote:
> > >>>
> > >>>NFSv4.1+ differs from earlier versions in that it always performs
> > >>>trunking discovery that results in mounts to the same server
> sharing a
> > >>>TCP connection.
> > >>>
> > >>>It turns out this results in performance regressions for some
> users;
> > >>>apparently the workload on one mount interferes with
> performance of
> > >>>another mount, and they were previously able to work around
> the problem
> > >>>by using different server IP addresses for the different mounts.
> > >>>
> > >>>Am I overlooking some hack that would reenable the previous
> behavior?
> > >>>Or would people be averse to an "-o noshareconn" option?
> > >>
> > >>I thought this was what the nconnect mount option was for.
> > >
> > >I've suggested that. It doesn't isolate the two mounts from
> each other
> > >in the same way, but I can imagine it might make it less likely
> that a
> > >user on one mount will block a user on another? I don't know,
> it might
> > >depend on the details of their workload and a certain amount of
> luck.
> >
> > Wouldn't it be better to fully understand the reason for the
> > performance difference, before changing the mount API? If it's
> > a guess, it'll come back to haunt the code for years.
> >
> > For example, maybe it's lock contention in the xprt transport code,
> > or in the socket stack.
>
> Yeah, I wonder too, and I don't have the details.
>
>
> I've seen cases like this:
>
> dd if=/dev/zero of=/mnt/mount1/zeros &
> ls /mnt/mount2/
>
> If /mnt/mount1 and /mnt/mount2 are NFS v3 mounts to the same server IP,
> the access to /mnt/mount2 can take a long time because the RPCs from "ls
> /mnt/mount2/" get stuck behind a bunch of the writes to /mnt/mount1. If
> /mnt/mount1 and /mnt/mount2 are different IPs to the same server, the
> accesses to /mnt/mount2 aren't impacted by the write workload on
> /mnt/mount1 (unless there is a saturation on the server side, obviously).
This is plausible, and if so, I believe it indicates a credit/slot
shortage.
Does the client request more slots when it begins to share another
mount point on the connection? Does the server grant them, if so?
Tom.
>
> It sounds like with NFS v4.1 trunking discovery, using separate IPs for
> the two mounts is no longer a sufficient workaround.
> Igor
> On Oct 7, 2020, at 10:05 AM, Bruce Fields <[email protected]> wrote:
>
> On Wed, Oct 07, 2020 at 09:45:50AM -0400, Chuck Lever wrote:
>>
>>
>>> On Oct 7, 2020, at 8:55 AM, Benjamin Coddington <[email protected]> wrote:
>>>
>>> On 7 Oct 2020, at 7:27, Benjamin Coddington wrote:
>>>
>>>> On 6 Oct 2020, at 20:18, J. Bruce Fields wrote:
>>>>
>>>>> On Tue, Oct 06, 2020 at 05:46:11PM -0400, Olga Kornievskaia wrote:
>>>>>> On Tue, Oct 6, 2020 at 3:38 PM Benjamin Coddington <[email protected]> wrote:
>>>>>>>
>>>>>>> On 6 Oct 2020, at 11:13, J. Bruce Fields wrote:
>>>
>>>>> Looks like nfs4_init_{non}uniform_client_string() stores it in
>>>>> cl_owner_id, and I was thinking that meant cl_owner_id would be used
>>>>> from then on....
>>>>>
>>>>> But actually, I think it may run that again on recovery, yes, so I bet
>>>>> changing the nfs4_unique_id parameter midway like this could cause bugs
>>>>> on recovery.
>>>>
>>>> Ah, that's what I thought as well. Thanks for looking closer Olga!
>>>
>>> Well, no -- it does indeed continue to use the original cl_owner_id. We
>>> only jump through nfs4_init_uniquifier_client_string() if cl_owner_id is
>>> NULL:
>>>
>>> 6087 static int
>>> 6088 nfs4_init_uniform_client_string(struct nfs_client *clp)
>>> 6089 {
>>> 6090 size_t len;
>>> 6091 char *str;
>>> 6092
>>> 6093 if (clp->cl_owner_id != NULL)
>>> 6094 return 0;
>>> 6095
>>> 6096 if (nfs4_client_id_uniquifier[0] != '\0')
>>> 6097 return nfs4_init_uniquifier_client_string(clp);
>>> 6098
>>>
>>>
>>> Testing proves this out as well for both EXCHANGE_ID and SETCLIENTID.
>>>
>>> Is there any precedent for stabilizing module parameters as part of a
>>> supported interface? Maybe this ought to be a mount option, so client can
>>> set a uniquifier per-mount.
>>
>> The protocol is designed as one client-ID per client. FreeBSD is
>> the only client I know of that uses one client-ID per mount, fwiw.
>>
>> You are suggesting each mount point would have its own lease. There
>> would likely be deeper implementation changes needed than just
>> specifying a unique client-ID for each mount point.
>
> Huh, I thought that should do it.
>
> Do you have something specific in mind?
The relationship between nfs_client and nfs_server structs comes to
mind.
Trunking discovery has been around for several years. This is the
first report I've heard of a performance regression.
We do know that nconnect helps relieve head-of-line blocking on TCP.
I think adding a second socket would be a very easy thing to try and
wouldn't have any NFSv4 state recovery ramifications.
--
Chuck Lever
On Wed, Oct 07, 2020 at 10:15:39AM -0400, Chuck Lever wrote:
>
>
> > On Oct 7, 2020, at 10:05 AM, Bruce Fields <[email protected]> wrote:
> >
> > On Wed, Oct 07, 2020 at 09:45:50AM -0400, Chuck Lever wrote:
> >>
> >>
> >>> On Oct 7, 2020, at 8:55 AM, Benjamin Coddington <[email protected]> wrote:
> >>>
> >>> On 7 Oct 2020, at 7:27, Benjamin Coddington wrote:
> >>>
> >>>> On 6 Oct 2020, at 20:18, J. Bruce Fields wrote:
> >>>>
> >>>>> On Tue, Oct 06, 2020 at 05:46:11PM -0400, Olga Kornievskaia wrote:
> >>>>>> On Tue, Oct 6, 2020 at 3:38 PM Benjamin Coddington <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 6 Oct 2020, at 11:13, J. Bruce Fields wrote:
> >>>
> >>>>> Looks like nfs4_init_{non}uniform_client_string() stores it in
> >>>>> cl_owner_id, and I was thinking that meant cl_owner_id would be used
> >>>>> from then on....
> >>>>>
> >>>>> But actually, I think it may run that again on recovery, yes, so I bet
> >>>>> changing the nfs4_unique_id parameter midway like this could cause bugs
> >>>>> on recovery.
> >>>>
> >>>> Ah, that's what I thought as well. Thanks for looking closer Olga!
> >>>
> >>> Well, no -- it does indeed continue to use the original cl_owner_id. We
> >>> only jump through nfs4_init_uniquifier_client_string() if cl_owner_id is
> >>> NULL:
> >>>
> >>> 6087 static int
> >>> 6088 nfs4_init_uniform_client_string(struct nfs_client *clp)
> >>> 6089 {
> >>> 6090 size_t len;
> >>> 6091 char *str;
> >>> 6092
> >>> 6093 if (clp->cl_owner_id != NULL)
> >>> 6094 return 0;
> >>> 6095
> >>> 6096 if (nfs4_client_id_uniquifier[0] != '\0')
> >>> 6097 return nfs4_init_uniquifier_client_string(clp);
> >>> 6098
> >>>
> >>>
> >>> Testing proves this out as well for both EXCHANGE_ID and SETCLIENTID.
> >>>
> >>> Is there any precedent for stabilizing module parameters as part of a
> >>> supported interface? Maybe this ought to be a mount option, so client can
> >>> set a uniquifier per-mount.
> >>
> >> The protocol is designed as one client-ID per client. FreeBSD is
> >> the only client I know of that uses one client-ID per mount, fwiw.
> >>
> >> You are suggesting each mount point would have its own lease. There
> >> would likely be deeper implementation changes needed than just
> >> specifying a unique client-ID for each mount point.
> >
> > Huh, I thought that should do it.
> >
> > Do you have something specific in mind?
>
> The relationship between nfs_client and nfs_server structs comes to
> mind.
I'm not following. Do you have a specific problem in mind?
--b.
>
> Trunking discovery has been around for several years. This is the
> first report I've heard of a performance regression.
>
> We do know that nconnect helps relieve head-of-line blocking on TCP.
> I think adding a second socket would be a very easy thing to try and
> wouldn't have any NFSv4 state recovery ramifications.
>
>
> --
> Chuck Lever
>
>
On Wed, Oct 7, 2020 at 6:57 AM Patrick Goetz <[email protected]> wrote:
> I don't see how sharing a TCP connection can result in a performance
> regression (the performance degradation of *not* sharing a TCP
> connection is why HTTP 1.x is being replaced), or how using different IP
> addresses on the same interface resolves anything. Does anyone have an
> explanation?
>
The two IPs give you a form of QoS. So, it's about performance isolation
across the mounts, not about improving the aggregate performance.
The example I mentioned was this one:
dd if=/dev/zero of=/mnt/mount1/zeros &
ls /mnt/mount2/
The writes to /mnt/mount1 keep the transport busy transmitting data. As
a result, the "ls" GETATTR (or whatever RPC) needs to wait on the
single transport, potentially for seconds. Putting the two mounts on
different IPs solves the problem, at least prior to trunking discovery.
On 7 Oct 2020, at 9:56, Patrick Goetz wrote:
> On 10/6/20 10:13 AM, J. Bruce Fields wrote:
>> NFSv4.1+ differs from earlier versions in that it always performs
>> trunking discovery that results in mounts to the same server sharing a
>> TCP connection.
>>
>> It turns out this results in performance regressions for some users;
>> apparently the workload on one mount interferes with performance of
>> another mount, and they were previously able to work around the problem
>> by using different server IP addresses for the different mounts.
>>
>> Am I overlooking some hack that would reenable the previous behavior?
>> Or would people be averse to an "-o noshareconn" option?
>>
>> --b.
>>
>
>
> I don't see how sharing a TCP connection can result in a performance
> regression (the performance degradation of *not* sharing a TCP connection
> is why HTTP 1.x is being replaced), or how using different IP addresses on
> the same interface resolves anything. Does anyone have an explanation?
Well, I think the report we're getting may be using two different network
interfaces on the server-side. The user was previously doing one mount each
to each ip address on each interface.
Even if you don't have this arrangement, it may still be possible/desirable
to have separate TCP connections if you want to prioritizes some NFS
traffic. Multi-CPU systems with modern NICs have a number of different ways
to "steer" the traffic they receive to certain CPUs which may have a benefit
or detrimental effect on performance. You can prioritize wake-ups from the
NIC based on throughput or latency, for example.
I don't know for sure which of these specific details are coming into play,
if any, though.
Ben
On Wed, 2020-10-07 at 12:05 -0400, Bruce Fields wrote:
> On Wed, Oct 07, 2020 at 10:15:39AM -0400, Chuck Lever wrote:
> >
> > > On Oct 7, 2020, at 10:05 AM, Bruce Fields <[email protected]>
> > > wrote:
> > >
> > > On Wed, Oct 07, 2020 at 09:45:50AM -0400, Chuck Lever wrote:
> > > >
> > > > > On Oct 7, 2020, at 8:55 AM, Benjamin Coddington <
> > > > > [email protected]> wrote:
> > > > >
> > > > > On 7 Oct 2020, at 7:27, Benjamin Coddington wrote:
> > > > >
> > > > > > On 6 Oct 2020, at 20:18, J. Bruce Fields wrote:
> > > > > >
> > > > > > > On Tue, Oct 06, 2020 at 05:46:11PM -0400, Olga
> > > > > > > Kornievskaia wrote:
> > > > > > > > On Tue, Oct 6, 2020 at 3:38 PM Benjamin Coddington <
> > > > > > > > [email protected]> wrote:
> > > > > > > > > On 6 Oct 2020, at 11:13, J. Bruce Fields wrote:
> > > > > > > Looks like nfs4_init_{non}uniform_client_string() stores
> > > > > > > it in
> > > > > > > cl_owner_id, and I was thinking that meant cl_owner_id
> > > > > > > would be used
> > > > > > > from then on....
> > > > > > >
> > > > > > > But actually, I think it may run that again on recovery,
> > > > > > > yes, so I bet
> > > > > > > changing the nfs4_unique_id parameter midway like this
> > > > > > > could cause bugs
> > > > > > > on recovery.
> > > > > >
> > > > > > Ah, that's what I thought as well. Thanks for looking
> > > > > > closer Olga!
> > > > >
> > > > > Well, no -- it does indeed continue to use the original
> > > > > cl_owner_id. We
> > > > > only jump through nfs4_init_uniquifier_client_string() if
> > > > > cl_owner_id is
> > > > > NULL:
> > > > >
> > > > > 6087 static int
> > > > > 6088 nfs4_init_uniform_client_string(struct nfs_client *clp)
> > > > > 6089 {
> > > > > 6090 size_t len;
> > > > > 6091 char *str;
> > > > > 6092
> > > > > 6093 if (clp->cl_owner_id != NULL)
> > > > > 6094 return 0;
> > > > > 6095
> > > > > 6096 if (nfs4_client_id_uniquifier[0] != '\0')
> > > > > 6097 return nfs4_init_uniquifier_client_string(clp);
> > > > > 6098
> > > > >
> > > > >
> > > > > Testing proves this out as well for both EXCHANGE_ID and
> > > > > SETCLIENTID.
> > > > >
> > > > > Is there any precedent for stabilizing module parameters as
> > > > > part of a
> > > > > supported interface? Maybe this ought to be a mount option,
> > > > > so client can
> > > > > set a uniquifier per-mount.
> > > >
> > > > The protocol is designed as one client-ID per client. FreeBSD
> > > > is
> > > > the only client I know of that uses one client-ID per mount,
> > > > fwiw.
> > > >
> > > > You are suggesting each mount point would have its own lease.
> > > > There
> > > > would likely be deeper implementation changes needed than just
> > > > specifying a unique client-ID for each mount point.
> > >
> > > Huh, I thought that should do it.
> > >
> > > Do you have something specific in mind?
> >
> > The relationship between nfs_client and nfs_server structs comes to
> > mind.
>
> I'm not following. Do you have a specific problem in mind?
>
The problem that all locks etc are tied to the lease, so if you change
the clientid (and hence change the lease) then you need to ensure that
the client knows to which lease the locks belong, that it is able to
respond appropriately to all delegation recalls, layout recalls, ...
etc.
This need to track things on a per-lease basis is why we have the
struct nfs_client. Things that are tracked on a per-superblock basis
are tracked by the struct nfs_server.
However all this is moot as long as nobody can explain why we'd want to
do all this.
As far as I can tell, this thread started with a complaint that
performance suffers when we don't allow setups that hack the client by
pretending that a multi-homed server is actually multiple different
servers.
AFAICS Tom Talpey's question is the relevant one. Why is there a
performance regression being seen by these setups when they share the
same connection? Is it really the connection, or is it the fact that
they all share the same fixed-slot session?
I did see Igor's claim that there is a QoS issue (which afaics would
also affect NFSv3), but why do I care about QoS as a per-mountpoint
feature?
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Wed, 2020-10-07 at 12:05 -0400, Bruce Fields wrote:
> On Wed, Oct 07, 2020 at 10:15:39AM -0400, Chuck Lever wrote:
> >
> > > On Oct 7, 2020, at 10:05 AM, Bruce Fields <[email protected]>
> > > wrote:
> > >
> > > On Wed, Oct 07, 2020 at 09:45:50AM -0400, Chuck Lever wrote:
> > > >
> > > > > On Oct 7, 2020, at 8:55 AM, Benjamin Coddington <
> > > > > [email protected]> wrote:
> > > > >
> > > > > On 7 Oct 2020, at 7:27, Benjamin Coddington wrote:
> > > > >
> > > > > > On 6 Oct 2020, at 20:18, J. Bruce Fields wrote:
> > > > > >
> > > > > > > On Tue, Oct 06, 2020 at 05:46:11PM -0400, Olga
> > > > > > > Kornievskaia wrote:
> > > > > > > > On Tue, Oct 6, 2020 at 3:38 PM Benjamin Coddington <
> > > > > > > > [email protected]> wrote:
> > > > > > > > > On 6 Oct 2020, at 11:13, J. Bruce Fields wrote:
> > > > > > > Looks like nfs4_init_{non}uniform_client_string() stores
> > > > > > > it in
> > > > > > > cl_owner_id, and I was thinking that meant cl_owner_id
> > > > > > > would be used
> > > > > > > from then on....
> > > > > > >
> > > > > > > But actually, I think it may run that again on recovery,
> > > > > > > yes, so I bet
> > > > > > > changing the nfs4_unique_id parameter midway like this
> > > > > > > could cause bugs
> > > > > > > on recovery.
> > > > > >
> > > > > > Ah, that's what I thought as well. Thanks for looking
> > > > > > closer Olga!
> > > > >
> > > > > Well, no -- it does indeed continue to use the original
> > > > > cl_owner_id. We
> > > > > only jump through nfs4_init_uniquifier_client_string() if
> > > > > cl_owner_id is
> > > > > NULL:
> > > > >
> > > > > 6087 static int
> > > > > 6088 nfs4_init_uniform_client_string(struct nfs_client *clp)
> > > > > 6089 {
> > > > > 6090 size_t len;
> > > > > 6091 char *str;
> > > > > 6092
> > > > > 6093 if (clp->cl_owner_id != NULL)
> > > > > 6094 return 0;
> > > > > 6095
> > > > > 6096 if (nfs4_client_id_uniquifier[0] != '\0')
> > > > > 6097 return nfs4_init_uniquifier_client_string(clp);
> > > > > 6098
> > > > >
> > > > >
> > > > > Testing proves this out as well for both EXCHANGE_ID and
> > > > > SETCLIENTID.
> > > > >
> > > > > Is there any precedent for stabilizing module parameters as
> > > > > part of a
> > > > > supported interface? Maybe this ought to be a mount option,
> > > > > so client can
> > > > > set a uniquifier per-mount.
> > > >
> > > > The protocol is designed as one client-ID per client. FreeBSD
> > > > is
> > > > the only client I know of that uses one client-ID per mount,
> > > > fwiw.
> > > >
> > > > You are suggesting each mount point would have its own lease.
> > > > There
> > > > would likely be deeper implementation changes needed than just
> > > > specifying a unique client-ID for each mount point.
> > >
> > > Huh, I thought that should do it.
> > >
> > > Do you have something specific in mind?
> >
> > The relationship between nfs_client and nfs_server structs comes to
> > mind.
>
> I'm not following. Do you have a specific problem in mind?
>
The problem that all locks etc are tied to the lease, so if you change
the clientid (and hence change the lease) then you need to ensure that
the client knows to which lease the locks belong, that it is able to
respond appropriately to all delegation recalls, layout recalls, ...
etc.
This need to track things on a per-lease basis is why we have the
struct nfs_client. Things that are tracked on a per-superblock basis
are tracked by the struct nfs_server.
However all this is moot as long as nobody can explain why we'd want to
do all this.
As far as I can tell, this thread started with a complaint that
performance suffers when we don't allow setups that hack the client by
pretending that a multi-homed server is actually multiple different
servers.
AFAICS Tom Talpey's question is the relevant one. Why is there a
performance regression being seen by these setups when they share the
same connection? Is it really the connection, or is it the fact that
they all share the same fixed-slot session?
I did see Igor's claim that there is a QoS issue (which afaics would
also affect NFSv3), but why do I care about QoS as a per-mountpoint
feature?
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Wed, Oct 07, 2020 at 12:44:42PM -0400, Trond Myklebust wrote:
> The problem that all locks etc are tied to the lease, so if you change
> the clientid (and hence change the lease) then you need to ensure that
> the client knows to which lease the locks belong, that it is able to
> respond appropriately to all delegation recalls, layout recalls, ...
> etc.
Looks to me like cl_owner_id never actually changes over the lifetime of
a mount even if you change nfs4_unique_id.
> This need to track things on a per-lease basis is why we have the
> struct nfs_client. Things that are tracked on a per-superblock basis
> are tracked by the struct nfs_server.
>
> However all this is moot as long as nobody can explain why we'd want to
> do all this.
>
> As far as I can tell, this thread started with a complaint that
> performance suffers when we don't allow setups that hack the client by
> pretending that a multi-homed server is actually multiple different
> servers.
Yeah, honestly I don't understand the details of that case either.
(There is one related thing I'm curious about, which is how close we are
to keeping clients in different containers completely separate (which
we'd need, for example, if we were to ever permit unprivileged nfs
mounts). It looks to me like as long as two network namespaces use
different client identifiers, the client should keep different state for
them already? Or is there more to do there?)
--b.
On Wed, 2020-10-07 at 13:15 -0400, Bruce Fields wrote:
> On Wed, Oct 07, 2020 at 12:44:42PM -0400, Trond Myklebust wrote:
> > The problem that all locks etc are tied to the lease, so if you
> > change
> > the clientid (and hence change the lease) then you need to ensure
> > that
> > the client knows to which lease the locks belong, that it is able
> > to
> > respond appropriately to all delegation recalls, layout recalls,
> > ...
> > etc.
>
> Looks to me like cl_owner_id never actually changes over the lifetime
> of
> a mount even if you change nfs4_unique_id.
It never changes over the lifetime of the nfs_client. If it did, we'd
be inviting fun scenarios in which we end up conflicting with ourself
over locks etc.
>
> > This need to track things on a per-lease basis is why we have the
> > struct nfs_client. Things that are tracked on a per-superblock
> > basis
> > are tracked by the struct nfs_server.
> >
> > However all this is moot as long as nobody can explain why we'd
> > want to
> > do all this.
> >
> > As far as I can tell, this thread started with a complaint that
> > performance suffers when we don't allow setups that hack the client
> > by
> > pretending that a multi-homed server is actually multiple different
> > servers.
>
> Yeah, honestly I don't understand the details of that case either.
>
> (There is one related thing I'm curious about, which is how close we
> are
> to keeping clients in different containers completely separate (which
> we'd need, for example, if we were to ever permit unprivileged nfs
> mounts). It looks to me like as long as two network namespaces use
> different client identifiers, the client should keep different state
> for
> them already? Or is there more to do there?)
The containerised use case should already work. The containers have
their own private uniquifiers, which can be changed
via /sys/fs/nfs/net/nfs_client/identifier.
In fact, there is also a udev trigger for that pseudofile, so my plan
is (in my copious spare time) to write a /usr/lib/udev/nfs-set-
identifier helper in order to manage the container uniquifier, to allow
generation on the fly and persistence.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Wed, Oct 07, 2020 at 05:29:26PM +0000, Trond Myklebust wrote:
> On Wed, 2020-10-07 at 13:15 -0400, Bruce Fields wrote:
> > Yeah, honestly I don't understand the details of that case either.
> >
> > (There is one related thing I'm curious about, which is how close we
> > are
> > to keeping clients in different containers completely separate (which
> > we'd need, for example, if we were to ever permit unprivileged nfs
> > mounts). It looks to me like as long as two network namespaces use
> > different client identifiers, the client should keep different state
> > for
> > them already? Or is there more to do there?)
>
> The containerised use case should already work. The containers have
> their own private uniquifiers, which can be changed
> via /sys/fs/nfs/net/nfs_client/identifier.
I was just looking at that commit (bf11fbd20b3 "NFS: Add sysfs support
for per-container identifier"), and I'm confused by it: it adds code to
nfs/sysfs to get and set "identifier", but nothing anywhere that
actually uses the value. I can't figure out what I'm missing.
--b.
> In fact, there is also a udev trigger for that pseudofile, so my plan
> is (in my copious spare time) to write a /usr/lib/udev/nfs-set-
> identifier helper in order to manage the container uniquifier, to allow
> generation on the fly and persistence.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>
On 7 Oct 2020, at 12:44, Trond Myklebust wrote:
> I did see Igor's claim that there is a QoS issue (which afaics would
> also affect NFSv3), but why do I care about QoS as a per-mountpoint
> feature?
Because it's hard to do QoS without being able to classify the traffic on
the network somehow. The separate connection makes it a lot easier. I see
how that's - not our problem -, though.
The regular admin might find it surprising to tell their system to
connect to a specific IP address at mount time, and it instead sends the
mount's traffic elsewhere.
Are you happy with the state of nconnect, or is there room for something
more dynamic?
Ben
On Wed, 2020-10-07 at 14:04 -0400, Benjamin Coddington wrote:
> On 7 Oct 2020, at 12:44, Trond Myklebust wrote:
> > I did see Igor's claim that there is a QoS issue (which afaics
> > would
> > also affect NFSv3), but why do I care about QoS as a per-mountpoint
> > feature?
>
> Because it's hard to do QoS without being able to classify the
> traffic on
> the network somehow. The separate connection makes it a lot
> easier. I see
> how that's - not our problem -, though.
>
> The regular admin might find it surprising to tell their system to
> connect to a specific IP address at mount time, and it instead sends
> the
> mount's traffic elsewhere.
>
> Are you happy with the state of nconnect, or is there room for
> something
> more dynamic?
>
I think there is room for improvement. We did say that we wanted to
eventually hand control over to a userspace policy daemon which should
be able to manage the number of connections based on demand and
networking conditions.
However as I already pointed out, NFSv4.1 also has congestion control
at the session level which may be playing a role here.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Wed, 2020-10-07 at 14:05 -0400, [email protected] wrote:
> On Wed, Oct 07, 2020 at 05:29:26PM +0000, Trond Myklebust wrote:
> > On Wed, 2020-10-07 at 13:15 -0400, Bruce Fields wrote:
> > > Yeah, honestly I don't understand the details of that case
> > > either.
> > >
> > > (There is one related thing I'm curious about, which is how close
> > > we
> > > are
> > > to keeping clients in different containers completely separate
> > > (which
> > > we'd need, for example, if we were to ever permit unprivileged
> > > nfs
> > > mounts). It looks to me like as long as two network namespaces
> > > use
> > > different client identifiers, the client should keep different
> > > state
> > > for
> > > them already? Or is there more to do there?)
> >
> > The containerised use case should already work. The containers have
> > their own private uniquifiers, which can be changed
> > via /sys/fs/nfs/net/nfs_client/identifier.
>
> I was just looking at that commit (bf11fbd20b3 "NFS: Add sysfs
> support
> for per-container identifier"), and I'm confused by it: it adds code
> to
> nfs/sysfs to get and set "identifier", but nothing anywhere that
> actually uses the value. I can't figure out what I'm missing.
>
No, you're right. Something slipped under the radar there. The
intention was that when it is set, the container-specific 'identifier'
should replace the regular system-wide uniquifier. I thought I had
merged patches for that, but apparently something got screwed up. Let
me fix that up for 5.10...
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Wed, Oct 07, 2020 at 07:11:49PM +0000, Trond Myklebust wrote:
> On Wed, 2020-10-07 at 14:05 -0400, [email protected] wrote:
> > On Wed, Oct 07, 2020 at 05:29:26PM +0000, Trond Myklebust wrote:
> > > On Wed, 2020-10-07 at 13:15 -0400, Bruce Fields wrote:
> > > > Yeah, honestly I don't understand the details of that case
> > > > either.
> > > >
> > > > (There is one related thing I'm curious about, which is how close
> > > > we
> > > > are
> > > > to keeping clients in different containers completely separate
> > > > (which
> > > > we'd need, for example, if we were to ever permit unprivileged
> > > > nfs
> > > > mounts). It looks to me like as long as two network namespaces
> > > > use
> > > > different client identifiers, the client should keep different
> > > > state
> > > > for
> > > > them already? Or is there more to do there?)
> > >
> > > The containerised use case should already work. The containers have
> > > their own private uniquifiers, which can be changed
> > > via /sys/fs/nfs/net/nfs_client/identifier.
> >
> > I was just looking at that commit (bf11fbd20b3 "NFS: Add sysfs
> > support
> > for per-container identifier"), and I'm confused by it: it adds code
> > to
> > nfs/sysfs to get and set "identifier", but nothing anywhere that
> > actually uses the value. I can't figure out what I'm missing.
> >
>
> No, you're right. Something slipped under the radar there. The
> intention was that when it is set, the container-specific 'identifier'
> should replace the regular system-wide uniquifier. I thought I had
> merged patches for that, but apparently something got screwed up. Let
> me fix that up for 5.10...
Great, thanks.
--b.
On Wed, Oct 07, 2020 at 04:50:26PM +0000, Trond Myklebust wrote:
> As far as I can tell, this thread started with a complaint that
> performance suffers when we don't allow setups that hack the client by
> pretending that a multi-homed server is actually multiple different
> servers.
>
> AFAICS Tom Talpey's question is the relevant one. Why is there a
> performance regression being seen by these setups when they share the
> same connection? Is it really the connection, or is it the fact that
> they all share the same fixed-slot session?
>
> I did see Igor's claim that there is a QoS issue (which afaics would
> also affect NFSv3), but why do I care about QoS as a per-mountpoint
> feature?
Sorry for being slow to get back to this.
Some more details:
Say an NFS server exports /data1 and /data2.
A client mounts both. Process 'large' starts creating 10G+ files in
/data1, queuing up a lot of nfs WRITE rpc_tasks.
Process 'small' creates a lot of small files in /data2, which requires a
lot of synchronous rpc_tasks, each of which wait in line with the large
WRITE tasks.
The 'small' process makes painfully slow progress.
The customer previously made things work for them by mounting two
different server IP addresses, so the "small" and "large" processes
effectively end up with their own queues.
Frank Sorenson has a test showing the difference; see
https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c42
https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c43
In that test, the "small" process creates files at a rate thousands of
times slower when the "large" process is also running.
Any suggestions?
--b.
On Tue, 2021-01-19 at 17:22 -0500, [email protected] wrote:
> On Wed, Oct 07, 2020 at 04:50:26PM +0000, Trond Myklebust wrote:
> > As far as I can tell, this thread started with a complaint that
> > performance suffers when we don't allow setups that hack the client
> > by
> > pretending that a multi-homed server is actually multiple different
> > servers.
> >
> > AFAICS Tom Talpey's question is the relevant one. Why is there a
> > performance regression being seen by these setups when they share
> > the
> > same connection? Is it really the connection, or is it the fact
> > that
> > they all share the same fixed-slot session?
> >
> > I did see Igor's claim that there is a QoS issue (which afaics
> > would
> > also affect NFSv3), but why do I care about QoS as a per-mountpoint
> > feature?
>
> Sorry for being slow to get back to this.
>
> Some more details:
>
> Say an NFS server exports /data1 and /data2.
>
> A client mounts both. Process 'large' starts creating 10G+ files in
> /data1, queuing up a lot of nfs WRITE rpc_tasks.
>
> Process 'small' creates a lot of small files in /data2, which
> requires a
> lot of synchronous rpc_tasks, each of which wait in line with the
> large
> WRITE tasks.
>
> The 'small' process makes painfully slow progress.
>
> The customer previously made things work for them by mounting two
> different server IP addresses, so the "small" and "large" processes
> effectively end up with their own queues.
>
> Frank Sorenson has a test showing the difference; see
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c42
> https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c43
>
> In that test, the "small" process creates files at a rate thousands
> of
> times slower when the "large" process is also running.
>
> Any suggestions?
>
I don't see how this answers my questions above?
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Tue, Jan 19, 2021 at 11:09:55PM +0000, Trond Myklebust wrote:
> On Tue, 2021-01-19 at 17:22 -0500, [email protected] wrote:
> > On Wed, Oct 07, 2020 at 04:50:26PM +0000, Trond Myklebust wrote:
> > > As far as I can tell, this thread started with a complaint that
> > > performance suffers when we don't allow setups that hack the client
> > > by
> > > pretending that a multi-homed server is actually multiple different
> > > servers.
> > >
> > > AFAICS Tom Talpey's question is the relevant one. Why is there a
> > > performance regression being seen by these setups when they share
> > > the
> > > same connection? Is it really the connection, or is it the fact
> > > that
> > > they all share the same fixed-slot session?
> > >
> > > I did see Igor's claim that there is a QoS issue (which afaics
> > > would
> > > also affect NFSv3), but why do I care about QoS as a per-mountpoint
> > > feature?
> >
> > Sorry for being slow to get back to this.
> >
> > Some more details:
> >
> > Say an NFS server exports /data1 and /data2.
> >
> > A client mounts both. Process 'large' starts creating 10G+ files in
> > /data1, queuing up a lot of nfs WRITE rpc_tasks.
> >
> > Process 'small' creates a lot of small files in /data2, which
> > requires a
> > lot of synchronous rpc_tasks, each of which wait in line with the
> > large
> > WRITE tasks.
> >
> > The 'small' process makes painfully slow progress.
> >
> > The customer previously made things work for them by mounting two
> > different server IP addresses, so the "small" and "large" processes
> > effectively end up with their own queues.
> >
> > Frank Sorenson has a test showing the difference; see
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c42
> > https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c43
> >
> > In that test, the "small" process creates files at a rate thousands
> > of
> > times slower when the "large" process is also running.
> >
> > Any suggestions?
> >
>
> I don't see how this answers my questions above?
So mainly:
> > > Why is there a performance regression being seen by these setups
> > > when they share the same connection? Is it really the connection,
> > > or is it the fact that they all share the same fixed-slot session?
I don't know. Any pointers how we might go about finding the answer?
It's easy to test the case of entirely seperate state & tcp connections.
If we want to test with a shared connection but separate slots I guess
we'd need to create a separate session for each nfs4_server, and a lot
of functions that currently take an nfs4_client would need to take an
nfs4_server?
--b.
> On Jan 19, 2021, at 5:22 PM, [email protected] wrote:
>
> On Wed, Oct 07, 2020 at 04:50:26PM +0000, Trond Myklebust wrote:
>> As far as I can tell, this thread started with a complaint that
>> performance suffers when we don't allow setups that hack the client by
>> pretending that a multi-homed server is actually multiple different
>> servers.
>>
>> AFAICS Tom Talpey's question is the relevant one. Why is there a
>> performance regression being seen by these setups when they share the
>> same connection? Is it really the connection, or is it the fact that
>> they all share the same fixed-slot session?
>>
>> I did see Igor's claim that there is a QoS issue (which afaics would
>> also affect NFSv3), but why do I care about QoS as a per-mountpoint
>> feature?
>
> Sorry for being slow to get back to this.
>
> Some more details:
>
> Say an NFS server exports /data1 and /data2.
>
> A client mounts both. Process 'large' starts creating 10G+ files in
> /data1, queuing up a lot of nfs WRITE rpc_tasks.
>
> Process 'small' creates a lot of small files in /data2, which requires a
> lot of synchronous rpc_tasks, each of which wait in line with the large
> WRITE tasks.
>
> The 'small' process makes painfully slow progress.
>
> The customer previously made things work for them by mounting two
> different server IP addresses, so the "small" and "large" processes
> effectively end up with their own queues.
>
> Frank Sorenson has a test showing the difference; see
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c42
> https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c43
>
> In that test, the "small" process creates files at a rate thousands of
> times slower when the "large" process is also running.
>
> Any suggestions?
Based on observation, there is a bottleneck in svc_recv which fully
serializes the receipt of RPC messages on a TCP socket. Large NFS
WRITE requests take longer to remove from the socket, and only one
nfsd can access that socket at a time.
Directing the large operations to a different socket means one nfsd
at a time can service those operations while other nfsd threads can
deal with the burst of small operations.
I don't know of any way to fully address this issue with a socket
transport other than by creating more transport sockets.
For RPC/RDMA I have some patches which enable svc_rdma_recvfrom()
to clear XPT_BUSY as soon as the ingress Receive buffer is dequeued.
--
Chuck Lever
On Wed, Jan 20, 2021 at 10:07:37AM -0500, [email protected] wrote:
> On Tue, Jan 19, 2021 at 11:09:55PM +0000, Trond Myklebust wrote:
> > On Tue, 2021-01-19 at 17:22 -0500, [email protected] wrote:
> > > On Wed, Oct 07, 2020 at 04:50:26PM +0000, Trond Myklebust wrote:
> > > > As far as I can tell, this thread started with a complaint that
> > > > performance suffers when we don't allow setups that hack the client
> > > > by
> > > > pretending that a multi-homed server is actually multiple different
> > > > servers.
> > > >
> > > > AFAICS Tom Talpey's question is the relevant one. Why is there a
> > > > performance regression being seen by these setups when they share
> > > > the
> > > > same connection? Is it really the connection, or is it the fact
> > > > that
> > > > they all share the same fixed-slot session?
> > > >
> > > > I did see Igor's claim that there is a QoS issue (which afaics
> > > > would
> > > > also affect NFSv3), but why do I care about QoS as a per-mountpoint
> > > > feature?
> > >
> > > Sorry for being slow to get back to this.
> > >
> > > Some more details:
> > >
> > > Say an NFS server exports /data1 and /data2.
> > >
> > > A client mounts both. Process 'large' starts creating 10G+ files in
> > > /data1, queuing up a lot of nfs WRITE rpc_tasks.
> > >
> > > Process 'small' creates a lot of small files in /data2, which
> > > requires a
> > > lot of synchronous rpc_tasks, each of which wait in line with the
> > > large
> > > WRITE tasks.
> > >
> > > The 'small' process makes painfully slow progress.
> > >
> > > The customer previously made things work for them by mounting two
> > > different server IP addresses, so the "small" and "large" processes
> > > effectively end up with their own queues.
> > >
> > > Frank Sorenson has a test showing the difference; see
> > >
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c42
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c43
> > >
> > > In that test, the "small" process creates files at a rate thousands
> > > of
> > > times slower when the "large" process is also running.
> > >
> > > Any suggestions?
> > >
> >
> > I don't see how this answers my questions above?
>
> So mainly:
>
> > > > Why is there a performance regression being seen by these setups
> > > > when they share the same connection? Is it really the connection,
> > > > or is it the fact that they all share the same fixed-slot session?
>
> I don't know. Any pointers how we might go about finding the answer?
I set this aside and then get bugged about it again.
I apologize, I don't understand what you're asking for here, but it
seemed obvious to you and Tom, so I'm sure the problem is me. Are you
free for a call sometime maybe? Or do you have any suggestions for how
you'd go about investigating this?
Would it be worth experimenting with giving some sort of advantage to
readers? (E.g., reserving a few slots for reads and getattrs and such?)
--b.
> It's easy to test the case of entirely seperate state & tcp connections.
>
> If we want to test with a shared connection but separate slots I guess
> we'd need to create a separate session for each nfs4_server, and a lot
> of functions that currently take an nfs4_client would need to take an
> nfs4_server?
>
> --b.
On Tue, 04 May 2021, [email protected] wrote:
> On Wed, Jan 20, 2021 at 10:07:37AM -0500, [email protected] wrote:
> >
> > So mainly:
> >
> > > > > Why is there a performance regression being seen by these setups
> > > > > when they share the same connection? Is it really the connection,
> > > > > or is it the fact that they all share the same fixed-slot session?
> >
> > I don't know. Any pointers how we might go about finding the answer?
>
> I set this aside and then get bugged about it again.
>
> I apologize, I don't understand what you're asking for here, but it
> seemed obvious to you and Tom, so I'm sure the problem is me. Are you
> free for a call sometime maybe? Or do you have any suggestions for how
> you'd go about investigating this?
I think a useful first step would be to understand what is getting in
the way of the small requests.
- are they in the client waiting for slots which are all consumed by
large writes?
- are they in TCP stream behind megabytes of writes that need to be
consumed before they can even be seen by the server?
- are they in a socket buffer on the server waiting to be served
while all the nfsd thread are busy handling writes?
I cannot see an easy way to measure which it is.
I guess monitoring how much of the time that the client has no free
slots might give hints about the first. If there are always free slots,
the first case cannot be the problem.
With NFSv3, the slot management happened at the RPC layer and there were
several queues (RPC_PRIORITY_LOW/NORMAL/HIGH/PRIVILEGED) where requests
could wait for a free slot. Since we gained dynamic slot allocation -
up to 65536 by default - I wonder if that has much effect any more.
For NFSv4.1+ the slot management is at the NFS level. The server sets a
maximum which defaults to (maybe is limited to) 1024 by the Linux server.
So there are always free rpc slots.
The Linux client only has a single queue for each slot table, and I
think there is one slot table for the forward channel of a session.
So it seems we no longer get any priority management (sync writes used
to get priority over async writes).
Increasing the number of slots advertised by the server might be
interesting. It is unlikely to fix anything, but it might move the
bottle-neck.
Decreasing the maximum of number of tcp slots might also be interesting
(below the number of NFS slots at least).
That would allow the RPC priority infrastructure to work, and if the
large-file writes are async, they might gets slowed down.
If the problem is in the TCP stream (which is possible if the relevant
network buffers are bloated), then you'd really need multiple TCP streams
(which can certainly improve throughput in some cases). That is what
nconnect give you. nconnect does minimal balancing. It general it will
round-robin, but if the number of requests (not bytes) queued on one
socket is below average, that socket is likely to get the next request.
So just adding more connections with nconnect is unlikely to help. You
would need to add a policy engine (struct rpc_xpr_iter_ops) which
reserves some connections for small requests. That should be fairly
easy to write a proof-of-concept for.
NeilBrown
>
> Would it be worth experimenting with giving some sort of advantage to
> readers? (E.g., reserving a few slots for reads and getattrs and such?)
>
> --b.
>
> > It's easy to test the case of entirely seperate state & tcp connections.
> >
> > If we want to test with a shared connection but separate slots I guess
> > we'd need to create a separate session for each nfs4_server, and a lot
> > of functions that currently take an nfs4_client would need to take an
> > nfs4_server?
> >
> > --b.
>
>
On 5/3/2021 10:08 PM, NeilBrown wrote:
> On Tue, 04 May 2021, [email protected] wrote:
>> On Wed, Jan 20, 2021 at 10:07:37AM -0500, [email protected] wrote:
>>>
>>> So mainly:
>>>
>>>>>> Why is there a performance regression being seen by these setups
>>>>>> when they share the same connection? Is it really the connection,
>>>>>> or is it the fact that they all share the same fixed-slot session?
>>>
>>> I don't know. Any pointers how we might go about finding the answer?
>>
>> I set this aside and then get bugged about it again.
>>
>> I apologize, I don't understand what you're asking for here, but it
>> seemed obvious to you and Tom, so I'm sure the problem is me. Are you
>> free for a call sometime maybe? Or do you have any suggestions for how
>> you'd go about investigating this?
>
> I think a useful first step would be to understand what is getting in
> the way of the small requests.
> - are they in the client waiting for slots which are all consumed by
> large writes?
> - are they in TCP stream behind megabytes of writes that need to be
> consumed before they can even be seen by the server?
> - are they in a socket buffer on the server waiting to be served
> while all the nfsd thread are busy handling writes?
>
> I cannot see an easy way to measure which it is.
I completely agree. The most likely scenario is a slot shortage which
might be preventing the client sending new RPCs. And with a round-robin
policy, the first connection with such a shortage will stall them all.
How can we observe whether this is the case?
Tom.
> I guess monitoring how much of the time that the client has no free
> slots might give hints about the first. If there are always free slots,
> the first case cannot be the problem.
>
> With NFSv3, the slot management happened at the RPC layer and there were
> several queues (RPC_PRIORITY_LOW/NORMAL/HIGH/PRIVILEGED) where requests
> could wait for a free slot. Since we gained dynamic slot allocation -
> up to 65536 by default - I wonder if that has much effect any more.
>
> For NFSv4.1+ the slot management is at the NFS level. The server sets a
> maximum which defaults to (maybe is limited to) 1024 by the Linux server.
> So there are always free rpc slots.
> The Linux client only has a single queue for each slot table, and I
> think there is one slot table for the forward channel of a session.
> So it seems we no longer get any priority management (sync writes used
> to get priority over async writes).
>
> Increasing the number of slots advertised by the server might be
> interesting. It is unlikely to fix anything, but it might move the
> bottle-neck.
>
> Decreasing the maximum of number of tcp slots might also be interesting
> (below the number of NFS slots at least).
> That would allow the RPC priority infrastructure to work, and if the
> large-file writes are async, they might gets slowed down.
>
> If the problem is in the TCP stream (which is possible if the relevant
> network buffers are bloated), then you'd really need multiple TCP streams
> (which can certainly improve throughput in some cases). That is what
> nconnect give you. nconnect does minimal balancing. It general it will
> round-robin, but if the number of requests (not bytes) queued on one
> socket is below average, that socket is likely to get the next request.
> So just adding more connections with nconnect is unlikely to help. You
> would need to add a policy engine (struct rpc_xpr_iter_ops) which
> reserves some connections for small requests. That should be fairly
> easy to write a proof-of-concept for.
>
> NeilBrown
>
>
>>
>> Would it be worth experimenting with giving some sort of advantage to
>> readers? (E.g., reserving a few slots for reads and getattrs and such?)
>>
>> --b.
>>
>>> It's easy to test the case of entirely seperate state & tcp connections.
>>>
>>> If we want to test with a shared connection but separate slots I guess
>>> we'd need to create a separate session for each nfs4_server, and a lot
>>> of functions that currently take an nfs4_client would need to take an
>>> nfs4_server?
>>>
>>> --b.
>>
>>
>
On Tue, 2021-05-04 at 12:08 +1000, NeilBrown wrote:
> On Tue, 04 May 2021, [email protected] wrote:
> > On Wed, Jan 20, 2021 at 10:07:37AM -0500, [email protected] wrote:
> > >
> > > So mainly:
> > >
> > > > > > Why is there a performance regression being seen by these
> > > > > > setups
> > > > > > when they share the same connection? Is it really the
> > > > > > connection,
> > > > > > or is it the fact that they all share the same fixed-slot
> > > > > > session?
> > >
> > > I don't know. Any pointers how we might go about finding the
> > > answer?
> >
> > I set this aside and then get bugged about it again.
> >
> > I apologize, I don't understand what you're asking for here, but it
> > seemed obvious to you and Tom, so I'm sure the problem is me. Are
> > you
> > free for a call sometime maybe? Or do you have any suggestions for
> > how
> > you'd go about investigating this?
>
> I think a useful first step would be to understand what is getting in
> the way of the small requests.
> - are they in the client waiting for slots which are all consumed by
> large writes?
> - are they in TCP stream behind megabytes of writes that need to be
> consumed before they can even be seen by the server?
> - are they in a socket buffer on the server waiting to be served
> while all the nfsd thread are busy handling writes?
>
> I cannot see an easy way to measure which it is.
The nfs4_sequence_done tracepoint will give you a running count of the
highest slot id in use.
The mountstats 'execute time' will give you the time between the
request being created and the time a reply was received. That time
includes the time spent waiting for a NFSv4 session slot.
The mountstats 'backlog wait' will tell you the time spent waiting for
an RPC slot after obtaining the NFSv4 session slot.
The mountstats 'RTT' will give you the time spend waiting for the RPC
request to be received, processed and replied to by the server.
Finally, the mountstats also tell you average per-op bytes sent/bytes
received.
IOW: The mountstats really gives you almost all the information you
need here, particularly if you use it in the 'interval reporting' mode.
The only thing it does not tell you is whether or not the NFSv4 session
slot table is full (which is why you want the tracepoint).
> I guess monitoring how much of the time that the client has no free
> slots might give hints about the first. If there are always free
> slots,
> the first case cannot be the problem.
>
> With NFSv3, the slot management happened at the RPC layer and there
> were
> several queues (RPC_PRIORITY_LOW/NORMAL/HIGH/PRIVILEGED) where requests
> could wait for a free slot. Since we gained dynamic slot allocation -
> up to 65536 by default - I wonder if that has much effect any more.
>
> For NFSv4.1+ the slot management is at the NFS level. The server sets
> a
> maximum which defaults to (maybe is limited to) 1024 by the Linux
> server.
> So there are always free rpc slots.
> The Linux client only has a single queue for each slot table, and I
> think there is one slot table for the forward channel of a session.
> So it seems we no longer get any priority management (sync writes used
> to get priority over async writes).
>
> Increasing the number of slots advertised by the server might be
> interesting. It is unlikely to fix anything, but it might move the
> bottle-neck.
>
> Decreasing the maximum of number of tcp slots might also be interesting
> (below the number of NFS slots at least).
> That would allow the RPC priority infrastructure to work, and if the
> large-file writes are async, they might gets slowed down.
>
> If the problem is in the TCP stream (which is possible if the relevant
> network buffers are bloated), then you'd really need multiple TCP
> streams
> (which can certainly improve throughput in some cases). That is what
> nconnect give you. nconnect does minimal balancing. It general it
> will
> round-robin, but if the number of requests (not bytes) queued on one
> socket is below average, that socket is likely to get the next request.
It's not round-robin. Transports are allocated to a new RPC request
based on a measure of their queue length in order to skip over those
that show signs of above average congestion.
> So just adding more connections with nconnect is unlikely to help.
> You
> would need to add a policy engine (struct rpc_xpr_iter_ops) which
> reserves some connections for small requests. That should be fairly
> easy to write a proof-of-concept for.
Ideally we would want to tie into cgroups as the control mechanism so
that NFS can be treated like any other I/O resource.
>
> NeilBrown
>
>
> >
> > Would it be worth experimenting with giving some sort of advantage
> > to
> > readers? (E.g., reserving a few slots for reads and getattrs and
> > such?)
> >
> > --b.
> >
> > > It's easy to test the case of entirely seperate state & tcp
> > > connections.
> > >
> > > If we want to test with a shared connection but separate slots I
> > > guess
> > > we'd need to create a separate session for each nfs4_server, and
> > > a lot
> > > of functions that currently take an nfs4_client would need to
> > > take an
> > > nfs4_server?
> > >
> > > --b.
> >
> >
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
Thanks very much to all of you for the explanations and concrete
suggestions for things to look at, I feel much less stuck!
--b.
On Tue, May 04, 2021 at 02:27:04PM +0000, Trond Myklebust wrote:
> On Tue, 2021-05-04 at 12:08 +1000, NeilBrown wrote:
> > On Tue, 04 May 2021, [email protected] wrote:
> > > On Wed, Jan 20, 2021 at 10:07:37AM -0500, [email protected] wrote:
> > > >
> > > > So mainly:
> > > >
> > > > > > > Why is there a performance regression being seen by these
> > > > > > > setups
> > > > > > > when they share the same connection? Is it really the
> > > > > > > connection,
> > > > > > > or is it the fact that they all share the same fixed-slot
> > > > > > > session?
> > > >
> > > > I don't know. Any pointers how we might go about finding the
> > > > answer?
> > >
> > > I set this aside and then get bugged about it again.
> > >
> > > I apologize, I don't understand what you're asking for here, but it
> > > seemed obvious to you and Tom, so I'm sure the problem is me. Are
> > > you
> > > free for a call sometime maybe? Or do you have any suggestions for
> > > how
> > > you'd go about investigating this?
> >
> > I think a useful first step would be to understand what is getting in
> > the way of the small requests.
> > - are they in the client waiting for slots which are all consumed by
> > large writes?
> > - are they in TCP stream behind megabytes of writes that need to be
> > consumed before they can even be seen by the server?
> > - are they in a socket buffer on the server waiting to be served
> > while all the nfsd thread are busy handling writes?
> >
> > I cannot see an easy way to measure which it is.
>
> The nfs4_sequence_done tracepoint will give you a running count of the
> highest slot id in use.
>
> The mountstats 'execute time' will give you the time between the
> request being created and the time a reply was received. That time
> includes the time spent waiting for a NFSv4 session slot.
>
> The mountstats 'backlog wait' will tell you the time spent waiting for
> an RPC slot after obtaining the NFSv4 session slot.
>
> The mountstats 'RTT' will give you the time spend waiting for the RPC
> request to be received, processed and replied to by the server.
>
> Finally, the mountstats also tell you average per-op bytes sent/bytes
> received.
>
> IOW: The mountstats really gives you almost all the information you
> need here, particularly if you use it in the 'interval reporting' mode.
> The only thing it does not tell you is whether or not the NFSv4 session
> slot table is full (which is why you want the tracepoint).
>
> > I guess monitoring how much of the time that the client has no free
> > slots might give hints about the first. If there are always free
> > slots,
> > the first case cannot be the problem.
> >
> > With NFSv3, the slot management happened at the RPC layer and there
> > were
> > several queues (RPC_PRIORITY_LOW/NORMAL/HIGH/PRIVILEGED) where requests
> > could wait for a free slot. Since we gained dynamic slot allocation -
> > up to 65536 by default - I wonder if that has much effect any more.
> >
> > For NFSv4.1+ the slot management is at the NFS level. The server sets
> > a
> > maximum which defaults to (maybe is limited to) 1024 by the Linux
> > server.
> > So there are always free rpc slots.
> > The Linux client only has a single queue for each slot table, and I
> > think there is one slot table for the forward channel of a session.
> > So it seems we no longer get any priority management (sync writes used
> > to get priority over async writes).
> >
> > Increasing the number of slots advertised by the server might be
> > interesting. It is unlikely to fix anything, but it might move the
> > bottle-neck.
> >
> > Decreasing the maximum of number of tcp slots might also be interesting
> > (below the number of NFS slots at least).
> > That would allow the RPC priority infrastructure to work, and if the
> > large-file writes are async, they might gets slowed down.
> >
> > If the problem is in the TCP stream (which is possible if the relevant
> > network buffers are bloated), then you'd really need multiple TCP
> > streams
> > (which can certainly improve throughput in some cases). That is what
> > nconnect give you. nconnect does minimal balancing. It general it
> > will
> > round-robin, but if the number of requests (not bytes) queued on one
> > socket is below average, that socket is likely to get the next request.
>
> It's not round-robin. Transports are allocated to a new RPC request
> based on a measure of their queue length in order to skip over those
> that show signs of above average congestion.
>
> > So just adding more connections with nconnect is unlikely to help.
> > You
> > would need to add a policy engine (struct rpc_xpr_iter_ops) which
> > reserves some connections for small requests. That should be fairly
> > easy to write a proof-of-concept for.
>
> Ideally we would want to tie into cgroups as the control mechanism so
> that NFS can be treated like any other I/O resource.
>
> >
> > NeilBrown
> >
> >
> > >
> > > Would it be worth experimenting with giving some sort of advantage
> > > to
> > > readers? (E.g., reserving a few slots for reads and getattrs and
> > > such?)
> > >
> > > --b.
> > >
> > > > It's easy to test the case of entirely seperate state & tcp
> > > > connections.
> > > >
> > > > If we want to test with a shared connection but separate slots I
> > > > guess
> > > > we'd need to create a separate session for each nfs4_server, and
> > > > a lot
> > > > of functions that currently take an nfs4_client would need to
> > > > take an
> > > > nfs4_server?
> > > >
> > > > --b.
> > >
> > >
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>
Hi,
For what it's worth, I mentioned this on the associated redhat bugzilla but I'll replicate it here - I *think* this issue (bulk reads/writes starving getattrs etc) is one of the issues I was trying to describe in my re-export thread:
https://marc.info/?l=linux-nfs&m=160077787901987&w=4
Long story short, when we have already read lots of data into a client's pagecache (or fscache/cachefiles), you can't reuse it again later until you do some metadata lookups to re-validate. But if we are also continually filling the client cache with new data (lots of reads) as fast as possible, we starve the other processes (in my case - knfsd re-export threads) from processing the re-validate lookups/getattrs in a timely manner.
We have lots of previously cached data but we can't use it for a long time because we can't get the getattrs out and replied to quickly.
When I was testing the client behaviour, it didn't seem like nconnect or NFSv3/NFSv4.2 made much difference to the behaviour - metadata lookups from another client process to the same mountpoint slowed to a crawl when a process had reads dominating the network pipe.
I also found that maxing out the client's network bandwidth really showed this effect best. Either by saturating a client's physical network link or, in the case of reads, using an ingress qdisc + htb on the client to simulate a saturated low speed network.
In all cases where the client's network is (read) saturated (physically or using a qdisc), the metadata performance from another process becomes really poor. If I mount a completely different server on the same client, the metadata performance to that new second server is much better despite the ongoing network saturation caused by the continuing reads from the first server.
I don't know if that helps much, but it was my observation when I last looked at this.
I'd really love to see any kind of improvement to this behaviour as it's a real shame we can't serve cached data quickly when all the cache re-validations (getattrs) are stuck behind bulk IO that just seems to plow through everything else.
Daire
----- On 4 May, 2021, at 17:51, bfields [email protected] wrote:
> Thanks very much to all of you for the explanations and concrete
> suggestions for things to look at, I feel much less stuck!
>
> --b.
>
> On Tue, May 04, 2021 at 02:27:04PM +0000, Trond Myklebust wrote:
>> On Tue, 2021-05-04 at 12:08 +1000, NeilBrown wrote:
>> > On Tue, 04 May 2021, [email protected] wrote:
>> > > On Wed, Jan 20, 2021 at 10:07:37AM -0500, [email protected] wrote:
>> > > >
>> > > > So mainly:
>> > > >
>> > > > > > > Why is there a performance regression being seen by these
>> > > > > > > setups
>> > > > > > > when they share the same connection? Is it really the
>> > > > > > > connection,
>> > > > > > > or is it the fact that they all share the same fixed-slot
>> > > > > > > session?
>> > > >
>> > > > I don't know. Any pointers how we might go about finding the
>> > > > answer?
>> > >
>> > > I set this aside and then get bugged about it again.
>> > >
>> > > I apologize, I don't understand what you're asking for here, but it
>> > > seemed obvious to you and Tom, so I'm sure the problem is me. Are
>> > > you
>> > > free for a call sometime maybe? Or do you have any suggestions for
>> > > how
>> > > you'd go about investigating this?
>> >
>> > I think a useful first step would be to understand what is getting in
>> > the way of the small requests.
>> > - are they in the client waiting for slots which are all consumed by
>> > large writes?
>> > - are they in TCP stream behind megabytes of writes that need to be
>> > consumed before they can even be seen by the server?
>> > - are they in a socket buffer on the server waiting to be served
>> > while all the nfsd thread are busy handling writes?
>> >
>> > I cannot see an easy way to measure which it is.
>>
>> The nfs4_sequence_done tracepoint will give you a running count of the
>> highest slot id in use.
>>
>> The mountstats 'execute time' will give you the time between the
>> request being created and the time a reply was received. That time
>> includes the time spent waiting for a NFSv4 session slot.
>>
>> The mountstats 'backlog wait' will tell you the time spent waiting for
>> an RPC slot after obtaining the NFSv4 session slot.
>>
>> The mountstats 'RTT' will give you the time spend waiting for the RPC
>> request to be received, processed and replied to by the server.
>>
>> Finally, the mountstats also tell you average per-op bytes sent/bytes
>> received.
>>
>> IOW: The mountstats really gives you almost all the information you
>> need here, particularly if you use it in the 'interval reporting' mode.
>> The only thing it does not tell you is whether or not the NFSv4 session
>> slot table is full (which is why you want the tracepoint).
>>
>> > I guess monitoring how much of the time that the client has no free
>> > slots might give hints about the first. If there are always free
>> > slots,
>> > the first case cannot be the problem.
>> >
>> > With NFSv3, the slot management happened at the RPC layer and there
>> > were
>> > several queues (RPC_PRIORITY_LOW/NORMAL/HIGH/PRIVILEGED) where requests
>> > could wait for a free slot. Since we gained dynamic slot allocation -
>> > up to 65536 by default - I wonder if that has much effect any more.
>> >
>> > For NFSv4.1+ the slot management is at the NFS level. The server sets
>> > a
>> > maximum which defaults to (maybe is limited to) 1024 by the Linux
>> > server.
>> > So there are always free rpc slots.
>> > The Linux client only has a single queue for each slot table, and I
>> > think there is one slot table for the forward channel of a session.
>> > So it seems we no longer get any priority management (sync writes used
>> > to get priority over async writes).
>> >
>> > Increasing the number of slots advertised by the server might be
>> > interesting. It is unlikely to fix anything, but it might move the
>> > bottle-neck.
>> >
>> > Decreasing the maximum of number of tcp slots might also be interesting
>> > (below the number of NFS slots at least).
>> > That would allow the RPC priority infrastructure to work, and if the
>> > large-file writes are async, they might gets slowed down.
>> >
>> > If the problem is in the TCP stream (which is possible if the relevant
>> > network buffers are bloated), then you'd really need multiple TCP
>> > streams
>> > (which can certainly improve throughput in some cases). That is what
>> > nconnect give you. nconnect does minimal balancing. It general it
>> > will
>> > round-robin, but if the number of requests (not bytes) queued on one
>> > socket is below average, that socket is likely to get the next request.
>>
>> It's not round-robin. Transports are allocated to a new RPC request
>> based on a measure of their queue length in order to skip over those
>> that show signs of above average congestion.
>>
>> > So just adding more connections with nconnect is unlikely to help.
>> > You
>> > would need to add a policy engine (struct rpc_xpr_iter_ops) which
>> > reserves some connections for small requests. That should be fairly
>> > easy to write a proof-of-concept for.
>>
>> Ideally we would want to tie into cgroups as the control mechanism so
>> that NFS can be treated like any other I/O resource.
>>
>> >
>> > NeilBrown
>> >
>> >
>> > >
>> > > Would it be worth experimenting with giving some sort of advantage
>> > > to
>> > > readers? (E.g., reserving a few slots for reads and getattrs and
>> > > such?)
>> > >
>> > > --b.
>> > >
>> > > > It's easy to test the case of entirely seperate state & tcp
>> > > > connections.
>> > > >
>> > > > If we want to test with a shared connection but separate slots I
>> > > > guess
>> > > > we'd need to create a separate session for each nfs4_server, and
>> > > > a lot
>> > > > of functions that currently take an nfs4_client would need to
>> > > > take an
>> > > > nfs4_server?
>> > > >
>> > > > --b.
>> > >
>> > >
>>
>> --
>> Trond Myklebust
>> Linux NFS client maintainer, Hammerspace
>> [email protected]
>>
On Tue, 2021-05-04 at 22:32 +0100, Daire Byrne wrote:
> Hi,
>
> For what it's worth, I mentioned this on the associated redhat
> bugzilla but I'll replicate it here - I *think* this issue (bulk
> reads/writes starving getattrs etc) is one of the issues I was trying
> to describe in my re-export thread:
>
> https://marc.info/?l=linux-nfs&m=160077787901987&w=4
>
> Long story short, when we have already read lots of data into a
> client's pagecache (or fscache/cachefiles), you can't reuse it again
> later until you do some metadata lookups to re-validate. But if we
> are also continually filling the client cache with new data (lots of
> reads) as fast as possible, we starve the other processes (in my case
> - knfsd re-export threads) from processing the re-validate
> lookups/getattrs in a timely manner.
>
> We have lots of previously cached data but we can't use it for a long
> time because we can't get the getattrs out and replied to quickly.
>
> When I was testing the client behaviour, it didn't seem like nconnect
> or NFSv3/NFSv4.2 made much difference to the behaviour - metadata
> lookups from another client process to the same mountpoint slowed to
> a crawl when a process had reads dominating the network pipe.
>
> I also found that maxing out the client's network bandwidth really
> showed this effect best. Either by saturating a client's physical
> network link or, in the case of reads, using an ingress qdisc + htb
> on the client to simulate a saturated low speed network.
>
> In all cases where the client's network is (read) saturated
> (physically or using a qdisc), the metadata performance from another
> process becomes really poor. If I mount a completely different server
> on the same client, the metadata performance to that new second
> server is much better despite the ongoing network saturation caused
> by the continuing reads from the first server.
>
> I don't know if that helps much, but it was my observation when I
> last looked at this.
>
> I'd really love to see any kind of improvement to this behaviour as
> it's a real shame we can't serve cached data quickly when all the
> cache re-validations (getattrs) are stuck behind bulk IO that just
> seems to plow through everything else.
If you use statx() instead of the regular stat call, and you
specifically don't request the ctime and mtime, then the current kernel
should skip the writeback.
Otherwise, you're going to have to wait for the NFSv4.2 protocol
changes that we're trying to push through the IETF to allow the client
to be authoritative for the ctime/mtime when it holds a write
delegation.
> >
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
Trond,
----- On 4 May, 2021, at 22:48, Trond Myklebust [email protected] wrote:
>> I'd really love to see any kind of improvement to this behaviour as
>> it's a real shame we can't serve cached data quickly when all the
>> cache re-validations (getattrs) are stuck behind bulk IO that just
>> seems to plow through everything else.
>
> If you use statx() instead of the regular stat call, and you
> specifically don't request the ctime and mtime, then the current kernel
> should skip the writeback.
>
> Otherwise, you're going to have to wait for the NFSv4.2 protocol
> changes that we're trying to push through the IETF to allow the client
> to be authoritative for the ctime/mtime when it holds a write
> delegation.
In my case, it's less about skipping avoidable getattrs if we have the files open and delegated for read/write or are still within the attribute cache timeout, and it has nothing to do with the re-export specific cache optimisations that went into v5.11 (which really helped us out!).
It's more the fact that we can read a terabyte of data (say) into the client's pagecache or (more likely) fscache/cachefiles, but obviously can't use it again days later (say) until some validation getattrs are sent and replied to. If that mountpoint also happens to be very busy with reads or writes at the time, then all that locally cached data sits idle until we can squeeze through the necessary lookups. This is especially painful if you are also using NFS over the WAN.
When I did some basic benchmarking, metadata ops from one process could be x100 slower when the pipe is full of reads or writes from other processes on the same client. Actually, another detail I just read in my previous notes - the more parallel client processes you have reading data, the slower your metadata ops will get replied to.
So if you have 1 process filling the client's network pipe with reads and another walking the filesystem, the walk will be ~x5 slower than if the pipe wasn't full of reads. If you have 20 processes simultaneously reading and again are filling the client's network pipe with reads, then the filesystem walking process is x100 slower. In both cases, the physical network is being maxed out, but the metadata intensive filesystem walking process is getting even less and less opportunity to have it's requests answered.
And this is exactly the scenario we see with our NFS re-export case, where lots of knfsd threads are doing reads from a mountpoint while others are just trying to have lookup requests answered so they can then serve the locally cached data (it helps that our remote files never get overwritten or updated).
So, similar to the original behaviour described in this thread, we also find that even when one client's NFSv4.2 mount is eating up all the network bandwidth and metadata ops are slowed to a crawl, another independent server (or multi-homed with same filesystem) mounted on the same client still shows very good (barely degraded) metadata performance. Presumably due to the independent slot table (which is good news if you are using a single server to re-export multiple servers).
I think for us, some kind of priority for these small metadata ops would be ideal (assuming you can get enough of them into the slot queue in the first place). I'm not sure a slot limit per client process would help that much? I also wonder if readahead (or async writes) could be gobbling up too many available slots leaving little for the sequential metadata intensive processes?
Daire