2019-08-13 16:11:30

by Jason L Tibbitts III

[permalink] [raw]
Subject: Regression in 5.1.20: Reading long directory fails

A user reported to me that they couldn't see the entirety of their home
directory. And indeed:

[root@ld00 ~]# ls -l ~dblecher|wc -l
ls: reading directory '/home/dblecher': Input/output error
1844
[root@ld00 ~]# cat /proc/version Linux version 5.1.20-300.fc30.x86_64 ([email protected]) (gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC)) #1 SMP Fri Jul 26 15:03:11 UTC 2019

Mount options are: nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5i
The server is running CentOS 7 (kernel 3.10.0-957.12.2.el7.x86_64).

The problem does not appear in 5.1.19 and all 7657 entries in that
directory are returned.

Looking at the 5.1.20 changelog I see a few NFS-related changes but
commit 3536b79ba75ba44b9ac1a9f1634f2e833bbb735c:
Revert "NFS: readdirplus optimization by cache mechanism" (memleak)
stands out; I'm working on building a kernel with the revert reverted.

Note that this doesn't happen on any directory with lots of files; I've
only managed to see it on this particular user's overly large home
directory. So I can trivially reproduce it but I don't know how anyone
else could. I'm happy to collect any debugging data that might be
needed.

- J<


2019-08-13 17:01:34

by Jason L Tibbitts III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

To follow up, I built 5.1.20 with just commit
3536b79ba75ba44b9ac1a9f1634f2e833bbb735c reverted and I can now get a
listing of that directory without error. Since it's a revert of
something else, and this is a new problem, I wonder if the revert went
awry or if something else came to depend on the behavior which was
reverted. Again, I'm happy to provide any debugging information you
might request.

Also note that the problem persists in 5.2.8. I see Fedora has a 5.3.0
rc4 build going, so I'll test that one as soon as it finishes.

- J<

2019-08-23 07:50:26

by Jason L Tibbitts III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

I now have another user reporting the same failure of readdir on a long
directory which showed up in 5.1.20 and was traced to
3536b79ba75ba44b9ac1a9f1634f2e833bbb735c. I'm not sure what to do to
get more traction besides reposting and adding some addresses to the CC
list. If there is any information I can provide which might help to get
to the bottom of this, please let me know.

To recap:

5.1.20 introduced a regression reading some large directories. In this
case, the directory should have 7800 files or so in it:

[root@ld00 ~]# ls -l ~dblecher|wc -l
ls: reading directory '/home/dblecher': Input/output error
1844
[root@ld00 ~]# cat /proc/version Linux version 5.1.20-300.fc30.x86_64 ([email protected]) (gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC)) #1 SMP Fri Jul 26 15:03:11 UTC 2019

(The server is a Centos 7 machine running kernel 3.10.0-957.12.2.el7.x86_64.)

Building a kernel which reverts commit 3536b79ba75ba44b9ac1a9f1634f2e833bbb735c:
Revert "NFS: readdirplus optimization by cache mechanism" (memleak)
fixes the issue, but of course that revert was fixing a real issue so
I'm not sure what to do.

I can trivially reproduce this by simply trying to list the problematic
directories but I'm not sure how to construct such a directory; simply
creating 10000 files doesn't cause the problem for me. I am willing to
test patches and can build my own kernels, and I'm happy to provide any
debugging information you might require. Unfortunately I don't know
enough to dig in and figure out for myself what's going wrong.

I did file https://bugzilla.redhat.com/show_bug.cgi?id=1740954 just to
have this in a bug tracker somewhere. I'm happy to file one somewhere
else if that would help.

- J<

2019-08-28 17:46:32

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On Thu, Aug 22, 2019 at 02:39:26PM -0500, Jason L Tibbitts III wrote:
> I now have another user reporting the same failure of readdir on a long
> directory which showed up in 5.1.20 and was traced to
> 3536b79ba75ba44b9ac1a9f1634f2e833bbb735c. I'm not sure what to do to
> get more traction besides reposting and adding some addresses to the CC
> list. If there is any information I can provide which might help to get
> to the bottom of this, please let me know.
>
> To recap:
>
> 5.1.20 introduced a regression reading some large directories. In this
> case, the directory should have 7800 files or so in it:
>
> [root@ld00 ~]# ls -l ~dblecher|wc -l
> ls: reading directory '/home/dblecher': Input/output error
> 1844
> [root@ld00 ~]# cat /proc/version Linux version 5.1.20-300.fc30.x86_64 ([email protected]) (gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC)) #1 SMP Fri Jul 26 15:03:11 UTC 2019
>
> (The server is a Centos 7 machine running kernel 3.10.0-957.12.2.el7.x86_64.)
>
> Building a kernel which reverts commit 3536b79ba75ba44b9ac1a9f1634f2e833bbb735c:
> Revert "NFS: readdirplus optimization by cache mechanism" (memleak)

Looks like that's db531db951f950b8 upstream. (Do you know if it's
reproduceable upstream as well?)

> fixes the issue, but of course that revert was fixing a real issue so
> I'm not sure what to do.
>
> I can trivially reproduce this by simply trying to list the problematic
> directories but I'm not sure how to construct such a directory; simply
> creating 10000 files doesn't cause the problem for me.

Maybe it depends on having names of the right length to place some bit
of xdr on a boundary. I wonder if it'd be possible to reproduce just by
varying the name lengths randomly till you hit it.

The fact that the problematic patch fixed a memory leak also makes me
wonder if it might have gone to far and freed something out from under
the readdir code.

> I am willing to
> test patches and can build my own kernels, and I'm happy to provide any
> debugging information you might require. Unfortunately I don't know
> enough to dig in and figure out for myself what's going wrong.
>
> I did file https://bugzilla.redhat.com/show_bug.cgi?id=1740954 just to
> have this in a bug tracker somewhere. I'm happy to file one somewhere
> else if that would help.

No clever debugging ideas off the top of my head, I'm afraid. I might
start by patching the kernel or doing some tracing to figure out exactly
where that EIO is being generated?

--b.

2019-08-28 18:29:33

by Jason L Tibbitts III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

>>>>> "BF" == J Bruce Fields <[email protected]> writes:

BF> Looks like that's db531db951f950b8 upstream. (Do you know if it's
BF> reproduceable upstream as well?)

Yes, it's reproducible up in the 5.3.0 RCs as well.

However, while trying to do some further bisecting I ran into an odd
problem. Now kernels which were previously working (i.e. 5.1.19 and
older) are returning errors, but at a different file count. This only
gives me more questions. And so, just to be absolutely sure that there
isn't some weird server issue involved, I'm going to try to schedule a
reboot of the relevant server.

BF> Maybe it depends on having names of the right length to place some
BF> bit of xdr on a boundary. I wonder if it'd be possible to reproduce
BF> just by varying the name lengths randomly till you hit it.

I know I can't reproduce with loads of short names, and with relatively
long names as well (using sha256sum as filename generator).

BF> No clever debugging ideas off the top of my head, I'm afraid. I
BF> might start by patching the kernel or doing some tracing to figure
BF> out exactly where that EIO is being generated?

If I had any idea how to do that, I happily would. I'm certainly
willing to learn. At least I can run strace to see where ls bombs:

getdents64(5, 0x7fc13afaf040, 262144) = -1 EIO (Input/output error)

bcodding on IRC mentioned that is a rather large count. Does make me
wonder if the server is weirding out and sending the client bogus data.
Certainly a server reboot, or maybe even just unmounting and remounting
the filesystem or copying the data to another filesystem would tell me
that. In any case, as soon as I am able to mess with that server, I'll
know more.

_ J<

2019-08-28 18:34:17

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On Wed, Aug 28, 2019 at 01:29:00PM -0500, Jason L Tibbitts III wrote:
> If I had any idea how to do that, I happily would. I'm certainly
> willing to learn. At least I can run strace to see where ls bombs:

Somewhat of a caveman, I might start at the code for getdents, sprinkle
printk's around until I get down to the relevant code in the NFS
layer.... There's probably a better way to do it with tracing.

> getdents64(5, 0x7fc13afaf040, 262144) = -1 EIO (Input/output error)
>
> bcodding on IRC mentioned that is a rather large count. Does make me
> wonder if the server is weirding out and sending the client bogus data.
> Certainly a server reboot, or maybe even just unmounting and remounting
> the filesystem or copying the data to another filesystem would tell me
> that. In any case, as soon as I am able to mess with that server, I'll
> know more.

Might also be worth capturing the network traffic and checking whether
wireshark thinks the server replies are valid xdr.

--b.

2019-09-03 15:50:36

by Jason L Tibbitts III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

>>>>> "JLT" == Jason L Tibbitts <[email protected]> writes:

JLT> Certainly a server reboot, or maybe even just
JLT> unmounting and remounting the filesystem or copying the data to
JLT> another filesystem would tell me that. In any case, as soon as I
JLT> am able to mess with that server, I'll know more.

Rebooting the server did not make any difference, and now more users are
seeing the problem. At this point I'm in a state where NFS simply isn't
reliable at all, and I'm not sure what to do. If Centos 8 were out,
I'd work on moving to that just so that the server was a little more
modern. (Currently the server is Centos 7.) I guess I could try using
Fedora, or installing one of the upstream kernels, just in case this has
to do with some interaction between the client and the old RHEL7 kernel.

I do have a packet capture of a directory listing that fails with EIO,
but I'm not sure if it's safe to simply post it, and I'm not sure what
tshark options would be useful in decoding it.

I do know that I can rsync one of the problematic directories to a
different server (running the same kernel) and it doesn't have the same
problem. What I'll try next is rsyncing to a different filesystem on
the same server, but again I'll have to wait until people log off to do
proper testing.

- J<

2019-09-03 18:10:14

by Wolfgang Walter

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

Am Dienstag, 3. September 2019, 10:49:48 schrieb Jason L Tibbitts III:
> >>>>> "JLT" == Jason L Tibbitts <[email protected]> writes:
> JLT> Certainly a server reboot, or maybe even just
> JLT> unmounting and remounting the filesystem or copying the data to
> JLT> another filesystem would tell me that. In any case, as soon as I
> JLT> am able to mess with that server, I'll know more.
>
> Rebooting the server did not make any difference, and now more users are
> seeing the problem. At this point I'm in a state where NFS simply isn't
> reliable at all, and I'm not sure what to do. If Centos 8 were out,
> I'd work on moving to that just so that the server was a little more
> modern. (Currently the server is Centos 7.) I guess I could try using
> Fedora, or installing one of the upstream kernels, just in case this has
> to do with some interaction between the client and the old RHEL7 kernel.
>
> I do have a packet capture of a directory listing that fails with EIO,
> but I'm not sure if it's safe to simply post it, and I'm not sure what
> tshark options would be useful in decoding it.
>
> I do know that I can rsync one of the problematic directories to a
> different server (running the same kernel) and it doesn't have the same
> problem. What I'll try next is rsyncing to a different filesystem on
> the same server, but again I'll have to wait until people log off to do
> proper testing.
>
> - J<

What filesystem do you use on the server? xfs? If yes, does it use 64bit
inodes (or started to use them)? Do you set a fsid when you export the
filesystem?

Regards,
--
Wolfgang Walter
Studentenwerk M?nchen
Anstalt des ?ffentlichen Rechts

2019-09-03 19:07:19

by Jason L Tibbitts III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

>>>>> "WW" == Wolfgang Walter <[email protected]> writes:

WW> What filesystem do you use on the server? xfs?

Yeah, it's XFS.

WW> If yes, does it use 64bit inodes (or started to use them)?

These filesystems aren't super old, and were all created with the
default RHEL7 options. I'm not sure how to check that 64 bit inodes are
being used, though. xfs_info says:

meta-data=/dev/mapper/nas-faculty--08 isize=256 agcount=4, agsize=3276800 blks
= sectsz=512 attr=2, projid32bit=1
= crc=0 finobt=0 spinodes=0
data = bsize=4096 blocks=13107200, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal bsize=4096 blocks=6400, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

WW> Do you set a fsid when you export the filesystem?

I have never done so on any server.

And note that the servers are basically unchanged for quite some time,
while the problem I'm having is new. I want to find some server-related
cause for this but so far I haven't been able to do so. It seems my
best option now seems to be to migrate all data off of this server and
then wipe, reinstall and see if the problem reoccurs.

- J<

2019-09-03 19:12:31

by Chuck Lever III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails


On Sep 3, 2019, at 3:06 PM, Jason L Tibbitts III <[email protected]> wrote:

>>>>>> "WW" == Wolfgang Walter <[email protected]> writes:
>
> WW> What filesystem do you use on the server? xfs?
>
> Yeah, it's XFS.
>
> WW> If yes, does it use 64bit inodes (or started to use them)?
>
> These filesystems aren't super old, and were all created with the
> default RHEL7 options.

I think that means no 64-bit inodes.


> I'm not sure how to check that 64 bit inodes are
> being used, though. xfs_info says:
>
> meta-data=/dev/mapper/nas-faculty--08 isize=256 agcount=4, agsize=3276800 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=0 finobt=0 spinodes=0
> data = bsize=4096 blocks=13107200, imaxpct=25
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0 ftype=0
> log =internal bsize=4096 blocks=6400, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> WW> Do you set a fsid when you export the filesystem?
>
> I have never done so on any server.
>
> And note that the servers are basically unchanged for quite some time,
> while the problem I'm having is new. I want to find some server-related
> cause for this but so far I haven't been able to do so. It seems my
> best option now seems to be to migrate all data off of this server and
> then wipe, reinstall and see if the problem reoccurs.
>
> - J<

2019-09-03 21:38:24

by Wolfgang Walter

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

Am Dienstag, 3. September 2019, 14:06:33 schrieb Jason L Tibbitts III:
> >>>>> "WW" == Wolfgang Walter <[email protected]> writes:
> WW> What filesystem do you use on the server? xfs?
>
> Yeah, it's XFS.
>
> WW> If yes, does it use 64bit inodes (or started to use them)?
>
> These filesystems aren't super old, and were all created with the
> default RHEL7 options. I'm not sure how to check that 64 bit inodes are
> being used, though. xfs_info says:
>
> meta-data=/dev/mapper/nas-faculty--08 isize=256 agcount=4, agsize=3276800
> blks = sectsz=512 attr=2, projid32bit=1 =
> crc=0 finobt=0 spinodes=0
> data = bsize=4096 blocks=13107200, imaxpct=25
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0 ftype=0
> log =internal bsize=4096 blocks=6400, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> WW> Do you set a fsid when you export the filesystem?
>
> I have never done so on any server.
>
> And note that the servers are basically unchanged for quite some time,
> while the problem I'm having is new. I want to find some server-related
> cause for this but so far I haven't been able to do so. It seems my
> best option now seems to be to migrate all data off of this server and
> then wipe, reinstall and see if the problem reoccurs.
>
> - J<

I'm not familiar with RHEL7. But kernel 5.1.20 uses the inode64 mount option
by default, as far as I know (see Documentation/filesystems/xfs.txt). So if
you do not use the mount option inode32 your xfs may now uses inode64 for
newly created files?

We had similar problems some time ago. Then, inode64 was indeed the cause of
the problem. With inode64 it seems that only little room is left in the nfs4
handle for the fsid. When nfs mangles fsid and the xfs inode number to form a
nfs4 handle it seems that in large directories different files may end having
the same handle if there inodes do not fit in 32bit.

You may try setting a rather small fsid (say 500) and reexport the fs and then
see, if the problem disappears. I think our problems dissappered then, but I
do not remember exactly. We now use inode32 to avoid the problem.


Regards,
--
Wolfgang Walter
Studentenwerk M?nchen
Anstalt des ?ffentlichen Rechts

2019-09-04 01:51:26

by Jason L Tibbitts III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

I asked the XFS folks who mentioned that the issues with 64 bit inodes
are old, constrained to larger filesystems than what I'm using, not an
issue with nfsv4, and not present on anything but 32bit clients with old
userspace.

In any case, I have been experimenting a bit and somehow the issue seems
to be related to exporting with sec=krb5i:krb5p or sec=krb5i. If I
export with just sec=krb5p, things magically begin to work.

So basically:

[root@ld00 ~]# ls -l ~tester|wc -l; grep tester /proc/mounts
7685
nas00:/export/misc-00/tester /home/tester nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5p,clientaddr=172.21.84.191,local_lock=none,addr=172.21.86.77 0 0

(unmount, then re-export with krb5i on the server)

[root@ld00 ~]# ls -l ~tester|wc -l; grep tester /proc/mounts
ls: reading directory '/home/tester': Input/output error
5623
nas00:/export/misc-00/tester /home/tester nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5i,clientaddr=172.21.84.191,local_lock=none,addr=172.21.86.77 0 0

(umount, then re-export with krb5i:krb5p on the server)

[root@ld00 ~]# ls -l ~tester|wc -l; grep tester /proc/mounts
ls: reading directory '/home/tester': Input/output error
5623
nas00:/export/misc-00/tester /home/tester nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5i,clientaddr=172.21.84.191,local_lock=none,addr=172.21.86.77 0 0

(umount, switch back to plain krb5p)

[root@ld00 ~]# ls -l ~tester|wc -l; grep tester /proc/mounts
7685
nas00:/export/misc-00/tester /home/tester nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5p,clientaddr=172.21.84.191,local_lock=none,addr=172.21.86.77 0 0

Sometimes the number of files it lists before it fails changes (and in
this case has been as small as a few hundred) but I don't know what
causes it to change.

Anyway, I hope this helps to pinpoint the problem. I now have a really
easy way to reproduce this without having to kick people off of the
server, and if the successes aren't just some kind of false positives
then I guess I also have a workaround. I'm still at a loss as to why a
revert of the readdir changes makes any difference at all here.

- J<

2019-09-06 19:25:36

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On Tue, Sep 03, 2019 at 08:50:39PM -0500, Jason L Tibbitts III wrote:
> I asked the XFS folks who mentioned that the issues with 64 bit inodes
> are old, constrained to larger filesystems than what I'm using, not an
> issue with nfsv4, and not present on anything but 32bit clients with old
> userspace.
>
> In any case, I have been experimenting a bit and somehow the issue seems
> to be related to exporting with sec=krb5i:krb5p or sec=krb5i. If I
> export with just sec=krb5p, things magically begin to work.

That's interesting!

We've occasionally had bugs that are rare corner cases in the xdr
code--e.g. if the encoded directory data hits some limit at the same
time that we reach the end of a page, and the end of the page falls at
some offset with respect to the entry we're encoding.

Something like switching between krb5i and krb5p could affect the
offsets in a way that affected the likelihood of hitting such a case.
That's one guess, anyway.

> Anyway, I hope this helps to pinpoint the problem. I now have a really
> easy way to reproduce this without having to kick people off of the
> server, and if the successes aren't just some kind of false positives
> then I guess I also have a workaround. I'm still at a loss as to why a
> revert of the readdir changes makes any difference at all here.

Those readdir changes were client-side, right? Based on that I'd been
assuming a client bug, but maybe it'd be worth getting a full packet
capture of the readdir reply to make sure it's legit. Looking at it in
wireshark should tell us quickly whether it's corrupted somehow.

--b.

2019-09-07 17:55:56

by Jason L Tibbitts III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

>>>>> "JBF" == J Bruce Fields <[email protected]> writes:

JBF> Those readdir changes were client-side, right? Based on that I'd
JBF> been assuming a client bug, but maybe it'd be worth getting a full
JBF> packet capture of the readdir reply to make sure it's legit.

I have been working with bcodding on IRC for the past couple of days on
this. Fortunately I was able to come up with way to fill up a directory
in such a way that it will fail with certainty and as a bonus doesn't
include any user data so I can feel OK about sharing packet captures. I
have a capture alongside a kernel trace of the problematic operation in
https://www.math.uh.edu/~tibbs/nfs/. Not that I can particularly tell
anything useful from that, but bcodding says that it seems to point to
some issue in sunrpc.

And because I can easily reproduce this and I was able to do a bisect:

2c94b8eca1a26cd46010d6e73a23da5f2e93a19d is the first bad commit
commit 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d
Author: Chuck Lever <[email protected]>
Date: Mon Feb 11 11:25:41 2019 -0500

SUNRPC: Use au_rslack when computing reply buffer size

au_rslack is significantly smaller than (au_cslack << 2). Using
that value results in smaller receive buffers. In some cases this
eliminates an extra segment in Reply chunks (RPC/RDMA).

Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>

:040000 040000 d4d1ce2fbe0035c5bd9df976b8c448df85dcb505 7011a792dfe72ff9cd70d66e45d353f3d7817e3e M net

But of course, I can't say whether this is the actual bad commit or
whether it just introduced a behavior change which alters the conditions
under which the problem appears.

And just to make sure that the blame doesn't lie with the old RHEL7
kernel, I rsynced over the problematic directory to a machine running
something slightly more modern (5.1.11, which I know I need to update,
but it's already set up to do kerberised NFS) and the same problem
exists, though the directory listing does fail at a different place.

- J<

2019-09-08 13:01:23

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On 6 Sep 2019, at 16:50, Chuck Lever wrote:

>> On Sep 6, 2019, at 4:47 PM, Jason L Tibbitts III <[email protected]>
>> wrote:
>>
>>>>>>> "JBF" == J Bruce Fields <[email protected]> writes:
>>
>> JBF> Those readdir changes were client-side, right? Based on that
>> I'd
>> JBF> been assuming a client bug, but maybe it'd be worth getting a
>> full
>> JBF> packet capture of the readdir reply to make sure it's legit.
>>
>> I have been working with bcodding on IRC for the past couple of days
>> on
>> this. Fortunately I was able to come up with way to fill up a
>> directory
>> in such a way that it will fail with certainty and as a bonus doesn't
>> include any user data so I can feel OK about sharing packet captures.
>> I
>> have a capture alongside a kernel trace of the problematic operation
>> in
>> https://www.math.uh.edu/~tibbs/nfs/. Not that I can particularly
>> tell
>> anything useful from that, but bcodding says that it seems to point
>> to
>> some issue in sunrpc.
>>
>> And because I can easily reproduce this and I was able to do a
>> bisect:
>>
>> 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d is the first bad commit
>> commit 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d
>> Author: Chuck Lever <[email protected]>
>> Date: Mon Feb 11 11:25:41 2019 -0500
>>
>> SUNRPC: Use au_rslack when computing reply buffer size
>>
>> au_rslack is significantly smaller than (au_cslack << 2). Using
>> that value results in smaller receive buffers. In some cases this
>> eliminates an extra segment in Reply chunks (RPC/RDMA).
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> Signed-off-by: Anna Schumaker <[email protected]>
>>
>> :040000 040000 d4d1ce2fbe0035c5bd9df976b8c448df85dcb505
>> 7011a792dfe72ff9cd70d66e45d353f3d7817e3e M net
>>
>> But of course, I can't say whether this is the actual bad commit or
>> whether it just introduced a behavior change which alters the
>> conditions
>> under which the problem appears.
>
> The first place I'd start looking is the XDR constants at the head of
> fs/nfs/nfs4xdr.c
> having to do with READDIR.
>
> The report of behavior changes with the use of krb5p also makes this
> commit plausible.

After sprinkling the printk's, we're coming up one word short in the
receive
buffer. I think we're not accounting for the xdr pad of buf->pages for
NFS4
readdir -- but I need to check the RFCs. Anyone know if v4 READDIR
results
have to be aligned?

Also need to check just why krb5i is the only auth that cares..

Ben

2019-09-09 12:56:00

by Trond Myklebust

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On Sun, 2019-09-08 at 07:39 -0400, Benjamin Coddington wrote:
> On 6 Sep 2019, at 16:50, Chuck Lever wrote:
>
> > > On Sep 6, 2019, at 4:47 PM, Jason L Tibbitts III <
> > > [email protected]>
> > > wrote:
> > >
> > > > > > > > "JBF" == J Bruce Fields <[email protected]> writes:
> > >
> > > JBF> Those readdir changes were client-side, right? Based on
> > > that
> > > I'd
> > > JBF> been assuming a client bug, but maybe it'd be worth getting
> > > a
> > > full
> > > JBF> packet capture of the readdir reply to make sure it's legit.
> > >
> > > I have been working with bcodding on IRC for the past couple of
> > > days
> > > on
> > > this. Fortunately I was able to come up with way to fill up a
> > > directory
> > > in such a way that it will fail with certainty and as a bonus
> > > doesn't
> > > include any user data so I can feel OK about sharing packet
> > > captures.
> > > I
> > > have a capture alongside a kernel trace of the problematic
> > > operation
> > > in
> > > https://www.math.uh.edu/~tibbs/nfs/. Not that I can
> > > particularly
> > > tell
> > > anything useful from that, but bcodding says that it seems to
> > > point
> > > to
> > > some issue in sunrpc.
> > >
> > > And because I can easily reproduce this and I was able to do a
> > > bisect:
> > >
> > > 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d is the first bad commit
> > > commit 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d
> > > Author: Chuck Lever <[email protected]>
> > > Date: Mon Feb 11 11:25:41 2019 -0500
> > >
> > > SUNRPC: Use au_rslack when computing reply buffer size
> > >
> > > au_rslack is significantly smaller than (au_cslack << 2).
> > > Using
> > > that value results in smaller receive buffers. In some cases
> > > this
> > > eliminates an extra segment in Reply chunks (RPC/RDMA).
> > >
> > > Signed-off-by: Chuck Lever <[email protected]>
> > > Signed-off-by: Anna Schumaker <[email protected]>
> > >
> > > :040000 040000 d4d1ce2fbe0035c5bd9df976b8c448df85dcb505
> > > 7011a792dfe72ff9cd70d66e45d353f3d7817e3e M net
> > >
> > > But of course, I can't say whether this is the actual bad commit
> > > or
> > > whether it just introduced a behavior change which alters the
> > > conditions
> > > under which the problem appears.
> >
> > The first place I'd start looking is the XDR constants at the head
> > of
> > fs/nfs/nfs4xdr.c
> > having to do with READDIR.
> >
> > The report of behavior changes with the use of krb5p also makes
> > this
> > commit plausible.
>
> After sprinkling the printk's, we're coming up one word short in the
> receive
> buffer. I think we're not accounting for the xdr pad of buf->pages
> for
> NFS4
> readdir -- but I need to check the RFCs. Anyone know if v4 READDIR
> results
> have to be aligned?
>
> Also need to check just why krb5i is the only auth that cares..
>

I'm not seeing that. If you look at commit 02ef04e432ba, you'll see
that Chuck did add a 'padding term' to decode_readdir_maxsz in the
NFSv4 case.
The other thing to remember is that a readdir 'dirlist4' entry is
always word aligned (irrespective of the length of the filename), so
there is no padding that needs to be taken into account.

I think we probably rather want to look at how auth->au_ralign is being
calculated for the case of krb5i. I'm really not understanding why
auth->au_ralign should not take into account the presence of the mic.
Chuck?


--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2019-09-09 13:01:23

by Chuck Lever III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails



> On Sep 8, 2019, at 11:19 AM, Trond Myklebust <[email protected]> wrote:
>
> On Sun, 2019-09-08 at 07:39 -0400, Benjamin Coddington wrote:
>> On 6 Sep 2019, at 16:50, Chuck Lever wrote:
>>
>>>> On Sep 6, 2019, at 4:47 PM, Jason L Tibbitts III <
>>>> [email protected]>
>>>> wrote:
>>>>
>>>>>>>>> "JBF" == J Bruce Fields <[email protected]> writes:
>>>>
>>>> JBF> Those readdir changes were client-side, right? Based on
>>>> that
>>>> I'd
>>>> JBF> been assuming a client bug, but maybe it'd be worth getting
>>>> a
>>>> full
>>>> JBF> packet capture of the readdir reply to make sure it's legit.
>>>>
>>>> I have been working with bcodding on IRC for the past couple of
>>>> days
>>>> on
>>>> this. Fortunately I was able to come up with way to fill up a
>>>> directory
>>>> in such a way that it will fail with certainty and as a bonus
>>>> doesn't
>>>> include any user data so I can feel OK about sharing packet
>>>> captures.
>>>> I
>>>> have a capture alongside a kernel trace of the problematic
>>>> operation
>>>> in
>>>> https://www.math.uh.edu/~tibbs/nfs/. Not that I can
>>>> particularly
>>>> tell
>>>> anything useful from that, but bcodding says that it seems to
>>>> point
>>>> to
>>>> some issue in sunrpc.
>>>>
>>>> And because I can easily reproduce this and I was able to do a
>>>> bisect:
>>>>
>>>> 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d is the first bad commit
>>>> commit 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d
>>>> Author: Chuck Lever <[email protected]>
>>>> Date: Mon Feb 11 11:25:41 2019 -0500
>>>>
>>>> SUNRPC: Use au_rslack when computing reply buffer size
>>>>
>>>> au_rslack is significantly smaller than (au_cslack << 2).
>>>> Using
>>>> that value results in smaller receive buffers. In some cases
>>>> this
>>>> eliminates an extra segment in Reply chunks (RPC/RDMA).
>>>>
>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>> Signed-off-by: Anna Schumaker <[email protected]>
>>>>
>>>> :040000 040000 d4d1ce2fbe0035c5bd9df976b8c448df85dcb505
>>>> 7011a792dfe72ff9cd70d66e45d353f3d7817e3e M net
>>>>
>>>> But of course, I can't say whether this is the actual bad commit
>>>> or
>>>> whether it just introduced a behavior change which alters the
>>>> conditions
>>>> under which the problem appears.
>>>
>>> The first place I'd start looking is the XDR constants at the head
>>> of
>>> fs/nfs/nfs4xdr.c
>>> having to do with READDIR.
>>>
>>> The report of behavior changes with the use of krb5p also makes
>>> this
>>> commit plausible.
>>
>> After sprinkling the printk's, we're coming up one word short in the
>> receive
>> buffer. I think we're not accounting for the xdr pad of buf->pages
>> for
>> NFS4
>> readdir -- but I need to check the RFCs. Anyone know if v4 READDIR
>> results
>> have to be aligned?
>>
>> Also need to check just why krb5i is the only auth that cares..
>>
>
> I'm not seeing that. If you look at commit 02ef04e432ba, you'll see
> that Chuck did add a 'padding term' to decode_readdir_maxsz in the
> NFSv4 case.
> The other thing to remember is that a readdir 'dirlist4' entry is
> always word aligned (irrespective of the length of the filename), so
> there is no padding that needs to be taken into account.
>
> I think we probably rather want to look at how auth->au_ralign is being
> calculated for the case of krb5i. I'm really not understanding why
> auth->au_ralign should not take into account the presence of the mic.
> Chuck?

I'm looking at gss_unwrap_resp_integ():

1971 auth->au_rslack = auth->au_verfsize + 2 + 1 + XDR_QUADLEN(mic.len);
1972 auth->au_ralign = auth->au_verfsize + 2;

au_ralign now sets the alignment of the _start_ of the RPC message body.
The MIC comes _after_ the RPC message body for krb5i.

If Ben is off by one quad, that's not the MIC, which is typically 32 octets,
isn't it?

Maybe some variable-length data item in the returned file attributes is missing
an XDR pad.

--
Chuck Lever



2019-09-09 13:10:22

by Trond Myklebust

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On Sun, 2019-09-08 at 11:48 -0400, Chuck Lever wrote:
> > On Sep 8, 2019, at 11:19 AM, Trond Myklebust <
> > [email protected]> wrote:
> >
> > On Sun, 2019-09-08 at 07:39 -0400, Benjamin Coddington wrote:
> > > On 6 Sep 2019, at 16:50, Chuck Lever wrote:
> > >
> > > > > On Sep 6, 2019, at 4:47 PM, Jason L Tibbitts III <
> > > > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > > > > > "JBF" == J Bruce Fields <[email protected]>
> > > > > > > > > > writes:
> > > > >
> > > > > JBF> Those readdir changes were client-side, right? Based on
> > > > > that
> > > > > I'd
> > > > > JBF> been assuming a client bug, but maybe it'd be worth
> > > > > getting
> > > > > a
> > > > > full
> > > > > JBF> packet capture of the readdir reply to make sure it's
> > > > > legit.
> > > > >
> > > > > I have been working with bcodding on IRC for the past couple
> > > > > of
> > > > > days
> > > > > on
> > > > > this. Fortunately I was able to come up with way to fill up
> > > > > a
> > > > > directory
> > > > > in such a way that it will fail with certainty and as a bonus
> > > > > doesn't
> > > > > include any user data so I can feel OK about sharing packet
> > > > > captures.
> > > > > I
> > > > > have a capture alongside a kernel trace of the problematic
> > > > > operation
> > > > > in
> > > > > https://www.math.uh.edu/~tibbs/nfs/. Not that I can
> > > > > particularly
> > > > > tell
> > > > > anything useful from that, but bcodding says that it seems to
> > > > > point
> > > > > to
> > > > > some issue in sunrpc.
> > > > >
> > > > > And because I can easily reproduce this and I was able to do
> > > > > a
> > > > > bisect:
> > > > >
> > > > > 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d is the first bad
> > > > > commit
> > > > > commit 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d
> > > > > Author: Chuck Lever <[email protected]>
> > > > > Date: Mon Feb 11 11:25:41 2019 -0500
> > > > >
> > > > > SUNRPC: Use au_rslack when computing reply buffer size
> > > > >
> > > > > au_rslack is significantly smaller than (au_cslack << 2).
> > > > > Using
> > > > > that value results in smaller receive buffers. In some
> > > > > cases
> > > > > this
> > > > > eliminates an extra segment in Reply chunks (RPC/RDMA).
> > > > >
> > > > > Signed-off-by: Chuck Lever <[email protected]>
> > > > > Signed-off-by: Anna Schumaker <[email protected]>
> > > > >
> > > > > :040000 040000 d4d1ce2fbe0035c5bd9df976b8c448df85dcb505
> > > > > 7011a792dfe72ff9cd70d66e45d353f3d7817e3e M net
> > > > >
> > > > > But of course, I can't say whether this is the actual bad
> > > > > commit
> > > > > or
> > > > > whether it just introduced a behavior change which alters
> > > > > the
> > > > > conditions
> > > > > under which the problem appears.
> > > >
> > > > The first place I'd start looking is the XDR constants at the
> > > > head
> > > > of
> > > > fs/nfs/nfs4xdr.c
> > > > having to do with READDIR.
> > > >
> > > > The report of behavior changes with the use of krb5p also makes
> > > > this
> > > > commit plausible.
> > >
> > > After sprinkling the printk's, we're coming up one word short in
> > > the
> > > receive
> > > buffer. I think we're not accounting for the xdr pad of buf-
> > > >pages
> > > for
> > > NFS4
> > > readdir -- but I need to check the RFCs. Anyone know if v4
> > > READDIR
> > > results
> > > have to be aligned?
> > >
> > > Also need to check just why krb5i is the only auth that cares..
> > >
> >
> > I'm not seeing that. If you look at commit 02ef04e432ba, you'll see
> > that Chuck did add a 'padding term' to decode_readdir_maxsz in the
> > NFSv4 case.
> > The other thing to remember is that a readdir 'dirlist4' entry is
> > always word aligned (irrespective of the length of the filename),
> > so
> > there is no padding that needs to be taken into account.
> >
> > I think we probably rather want to look at how auth->au_ralign is
> > being
> > calculated for the case of krb5i. I'm really not understanding why
> > auth->au_ralign should not take into account the presence of the
> > mic.
> > Chuck?
>
> I'm looking at gss_unwrap_resp_integ():
>
> 1971 auth->au_rslack = auth->au_verfsize + 2 + 1 +
> XDR_QUADLEN(mic.len);
> 1972 auth->au_ralign = auth->au_verfsize + 2;
>
> au_ralign now sets the alignment of the _start_ of the RPC message
> body.
> The MIC comes _after_ the RPC message body for krb5i.
>
> If Ben is off by one quad, that's not the MIC, which is typically 32
> octets,
> isn't it?
>
> Maybe some variable-length data item in the returned file attributes
> is missing
> an XDR pad.

The only two pieces of variable length data in the readdir payload are
the file name and the filehandle data. Those might present a problem
when encoding on the server side, but not when decoding on the client
side, since they are embedded in the dirlist4 (which, as I said, is
automatically aligned).

Hmm... One thing that does bother me in both gss_unwrap_resp_integ()
and gss_unwrap_resp_priv() is that if the seqno does not match, then we
return EIO. What if we had to retransmit a request, but the server
managed to squeeze off a reply to the first transmission?
Note: it should be pretty easy to catch issues such as this, since we
do have tracepoints for them. That said, it is pretty hard to imagine
this being the problem here if the bug is always reproducible (since
retransmissions typically are not).

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2019-09-09 13:13:13

by Chuck Lever III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails



> On Sep 8, 2019, at 12:47 PM, Trond Myklebust <[email protected]> wrote:
>
> On Sun, 2019-09-08 at 11:48 -0400, Chuck Lever wrote:
>>> On Sep 8, 2019, at 11:19 AM, Trond Myklebust <
>>> [email protected]> wrote:
>>>
>>> On Sun, 2019-09-08 at 07:39 -0400, Benjamin Coddington wrote:
>>>> On 6 Sep 2019, at 16:50, Chuck Lever wrote:
>>>>
>>>>>> On Sep 6, 2019, at 4:47 PM, Jason L Tibbitts III <
>>>>>> [email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>>>>>> "JBF" == J Bruce Fields <[email protected]>
>>>>>>>>>>> writes:
>>>>>>
>>>>>> JBF> Those readdir changes were client-side, right? Based on
>>>>>> that
>>>>>> I'd
>>>>>> JBF> been assuming a client bug, but maybe it'd be worth
>>>>>> getting
>>>>>> a
>>>>>> full
>>>>>> JBF> packet capture of the readdir reply to make sure it's
>>>>>> legit.
>>>>>>
>>>>>> I have been working with bcodding on IRC for the past couple
>>>>>> of
>>>>>> days
>>>>>> on
>>>>>> this. Fortunately I was able to come up with way to fill up
>>>>>> a
>>>>>> directory
>>>>>> in such a way that it will fail with certainty and as a bonus
>>>>>> doesn't
>>>>>> include any user data so I can feel OK about sharing packet
>>>>>> captures.
>>>>>> I
>>>>>> have a capture alongside a kernel trace of the problematic
>>>>>> operation
>>>>>> in
>>>>>> https://www.math.uh.edu/~tibbs/nfs/. Not that I can
>>>>>> particularly
>>>>>> tell
>>>>>> anything useful from that, but bcodding says that it seems to
>>>>>> point
>>>>>> to
>>>>>> some issue in sunrpc.
>>>>>>
>>>>>> And because I can easily reproduce this and I was able to do
>>>>>> a
>>>>>> bisect:
>>>>>>
>>>>>> 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d is the first bad
>>>>>> commit
>>>>>> commit 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d
>>>>>> Author: Chuck Lever <[email protected]>
>>>>>> Date: Mon Feb 11 11:25:41 2019 -0500
>>>>>>
>>>>>> SUNRPC: Use au_rslack when computing reply buffer size
>>>>>>
>>>>>> au_rslack is significantly smaller than (au_cslack << 2).
>>>>>> Using
>>>>>> that value results in smaller receive buffers. In some
>>>>>> cases
>>>>>> this
>>>>>> eliminates an extra segment in Reply chunks (RPC/RDMA).
>>>>>>
>>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>>> Signed-off-by: Anna Schumaker <[email protected]>
>>>>>>
>>>>>> :040000 040000 d4d1ce2fbe0035c5bd9df976b8c448df85dcb505
>>>>>> 7011a792dfe72ff9cd70d66e45d353f3d7817e3e M net
>>>>>>
>>>>>> But of course, I can't say whether this is the actual bad
>>>>>> commit
>>>>>> or
>>>>>> whether it just introduced a behavior change which alters
>>>>>> the
>>>>>> conditions
>>>>>> under which the problem appears.
>>>>>
>>>>> The first place I'd start looking is the XDR constants at the
>>>>> head
>>>>> of
>>>>> fs/nfs/nfs4xdr.c
>>>>> having to do with READDIR.
>>>>>
>>>>> The report of behavior changes with the use of krb5p also makes
>>>>> this
>>>>> commit plausible.
>>>>
>>>> After sprinkling the printk's, we're coming up one word short in
>>>> the
>>>> receive
>>>> buffer. I think we're not accounting for the xdr pad of buf-
>>>>> pages
>>>> for
>>>> NFS4
>>>> readdir -- but I need to check the RFCs. Anyone know if v4
>>>> READDIR
>>>> results
>>>> have to be aligned?
>>>>
>>>> Also need to check just why krb5i is the only auth that cares..
>>>>
>>>
>>> I'm not seeing that. If you look at commit 02ef04e432ba, you'll see
>>> that Chuck did add a 'padding term' to decode_readdir_maxsz in the
>>> NFSv4 case.
>>> The other thing to remember is that a readdir 'dirlist4' entry is
>>> always word aligned (irrespective of the length of the filename),
>>> so
>>> there is no padding that needs to be taken into account.
>>>
>>> I think we probably rather want to look at how auth->au_ralign is
>>> being
>>> calculated for the case of krb5i. I'm really not understanding why
>>> auth->au_ralign should not take into account the presence of the
>>> mic.
>>> Chuck?
>>
>> I'm looking at gss_unwrap_resp_integ():
>>
>> 1971 auth->au_rslack = auth->au_verfsize + 2 + 1 +
>> XDR_QUADLEN(mic.len);
>> 1972 auth->au_ralign = auth->au_verfsize + 2;
>>
>> au_ralign now sets the alignment of the _start_ of the RPC message
>> body.
>> The MIC comes _after_ the RPC message body for krb5i.
>>
>> If Ben is off by one quad, that's not the MIC, which is typically 32
>> octets,
>> isn't it?
>>
>> Maybe some variable-length data item in the returned file attributes
>> is missing
>> an XDR pad.
>
> The only two pieces of variable length data in the readdir payload are
> the file name and the filehandle data. Those might present a problem
> when encoding on the server side, but not when decoding on the client
> side, since they are embedded in the dirlist4 (which, as I said, is
> automatically aligned).

The next thing I'd try, then, is to match the Wireshark-dissected
READDIR4 reply that fails with the macros at the top of fs/nfs/nfs4xdr.c
and look for anything that is missing.


> Hmm... One thing that does bother me in both gss_unwrap_resp_integ()
> and gss_unwrap_resp_priv() is that if the seqno does not match, then we
> return EIO. What if we had to retransmit a request, but the server
> managed to squeeze off a reply to the first transmission?
> Note: it should be pretty easy to catch issues such as this, since we
> do have tracepoints for them. That said, it is pretty hard to imagine
> this being the problem here if the bug is always reproducible (since
> retransmissions typically are not).
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]

--
Chuck Lever



2019-09-11 16:26:31

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On 6 Sep 2019, at 16:50, Chuck Lever wrote:

>> On Sep 6, 2019, at 4:47 PM, Jason L Tibbitts III <[email protected]>
>> wrote:
>>
>>>>>>> "JBF" == J Bruce Fields <[email protected]> writes:
>>
>> JBF> Those readdir changes were client-side, right? Based on that
>> I'd
>> JBF> been assuming a client bug, but maybe it'd be worth getting a
>> full
>> JBF> packet capture of the readdir reply to make sure it's legit.
>>
>> I have been working with bcodding on IRC for the past couple of days
>> on
>> this. Fortunately I was able to come up with way to fill up a
>> directory
>> in such a way that it will fail with certainty and as a bonus doesn't
>> include any user data so I can feel OK about sharing packet captures.
>> I
>> have a capture alongside a kernel trace of the problematic operation
>> in
>> https://www.math.uh.edu/~tibbs/nfs/. Not that I can particularly
>> tell
>> anything useful from that, but bcodding says that it seems to point
>> to
>> some issue in sunrpc.
>>
>> And because I can easily reproduce this and I was able to do a
>> bisect:
>>
>> 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d is the first bad commit
>> commit 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d
>> Author: Chuck Lever <[email protected]>
>> Date: Mon Feb 11 11:25:41 2019 -0500
>>
>> SUNRPC: Use au_rslack when computing reply buffer size
>>
>> au_rslack is significantly smaller than (au_cslack << 2). Using
>> that value results in smaller receive buffers. In some cases this
>> eliminates an extra segment in Reply chunks (RPC/RDMA).
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> Signed-off-by: Anna Schumaker <[email protected]>
>>
>> :040000 040000 d4d1ce2fbe0035c5bd9df976b8c448df85dcb505
>> 7011a792dfe72ff9cd70d66e45d353f3d7817e3e M net
>>
>> But of course, I can't say whether this is the actual bad commit or
>> whether it just introduced a behavior change which alters the
>> conditions
>> under which the problem appears.
>
> The first place I'd start looking is the XDR constants at the head of
> fs/nfs/nfs4xdr.c
> having to do with READDIR.

Well, one thing is that I wonder about decode_readdir_maxsz including
decode_verifier_maxsz since it is part of READDIR4resok, and so should
be
included in the page data.. but that really doesn't matter here, I think
we
just end up with a larger head. Moving the start of the tail two words
forward doesn't seem to fix things..

I think the mic's xdr_buf_subsegment is getting partially split between
leftover space in the pages and the tail, so the checks for the length
of
the subbuf do not match the length of mic. In that case, we usually
just
have extra room in the tail to copy it over, but now that rslack is not
so
slack, the only extra room (as determined in xdr_buf_read_netobj())
would be
extra space in the page data, which we can't use.

The problem is that xdr_buf_read_netobj() is using xdr_buf_subsegment()
with
an offset in the response data to the mic, but xdr_buf_subsegement()
decides
what to do based on offsets in the already set-up xdr_buf. If the
server's
response leaves a hole in the page_len less than the mic, then the mic
subsegment straddles the page and the tail. Then we try to copy it to
the
end of the tail, but now there's not enough room.

Instead, I think we want to make sure the mic falls squarely into the
tail
every time.

xdr_buf_read_netobj() is only used for the mic, so it seems like we can
refactor it to just use the tail, or get rid of it altogether. I'll see
if
I can do that.

Read on, if you want some numbers to check my work..

Here's how to reproduce:
Init an xdr_buf of 176, so rq_rcvsize is:
4 RPC_REPHDRSIZE
19 auth->au_rslack
21 .p_replen
-----
44<<2 = 176

Then xdr_inline_pages() with:
- offset 140:
21 resp.hdr
4 RPC_REPHDRSIZE
11 auth->au_ralign
-1 xdr pad fixup
-----
35<<2 = 140

- base of 88 (that's static entries for . and ..)
- len of 262056 (64 pages - base of 88)

This gives us the following xdr_buf:

struct xdr_buf {
head->iov_len = 140,
tail->iov_len = 32,
page_len = 262056,
buflen = 262232
}

Then, get a READDIR response of total length 232208:

56 bytes to GSS data
76 bytes to READDIR4resok
262044 bytes of READDIR4resok
32 bytes of mic

That puts the mic at offset 262176.

That all looks right, and the response is well-formed.

Things go badly in xdr_read_netobj() with offset 262176:

We return -ENOMEM when
obj->len (28) is > than buf->buflen (262232) - buf->len (262208)

Ben

2019-09-11 16:44:14

by Chuck Lever III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails



> On Sep 11, 2019, at 12:25 PM, Benjamin Coddington <[email protected]> wrote:
>
> On 6 Sep 2019, at 16:50, Chuck Lever wrote:
>
>>> On Sep 6, 2019, at 4:47 PM, Jason L Tibbitts III <[email protected]> wrote:
>>>
>>>>>>>> "JBF" == J Bruce Fields <[email protected]> writes:
>>>
>>> JBF> Those readdir changes were client-side, right? Based on that I'd
>>> JBF> been assuming a client bug, but maybe it'd be worth getting a full
>>> JBF> packet capture of the readdir reply to make sure it's legit.
>>>
>>> I have been working with bcodding on IRC for the past couple of days on
>>> this. Fortunately I was able to come up with way to fill up a directory
>>> in such a way that it will fail with certainty and as a bonus doesn't
>>> include any user data so I can feel OK about sharing packet captures. I
>>> have a capture alongside a kernel trace of the problematic operation in
>>> https://www.math.uh.edu/~tibbs/nfs/. Not that I can particularly tell
>>> anything useful from that, but bcodding says that it seems to point to
>>> some issue in sunrpc.
>>>
>>> And because I can easily reproduce this and I was able to do a bisect:
>>>
>>> 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d is the first bad commit
>>> commit 2c94b8eca1a26cd46010d6e73a23da5f2e93a19d
>>> Author: Chuck Lever <[email protected]>
>>> Date: Mon Feb 11 11:25:41 2019 -0500
>>>
>>> SUNRPC: Use au_rslack when computing reply buffer size
>>>
>>> au_rslack is significantly smaller than (au_cslack << 2). Using
>>> that value results in smaller receive buffers. In some cases this
>>> eliminates an extra segment in Reply chunks (RPC/RDMA).
>>>
>>> Signed-off-by: Chuck Lever <[email protected]>
>>> Signed-off-by: Anna Schumaker <[email protected]>
>>>
>>> :040000 040000 d4d1ce2fbe0035c5bd9df976b8c448df85dcb505 7011a792dfe72ff9cd70d66e45d353f3d7817e3e M net
>>>
>>> But of course, I can't say whether this is the actual bad commit or
>>> whether it just introduced a behavior change which alters the conditions
>>> under which the problem appears.
>>
>> The first place I'd start looking is the XDR constants at the head of fs/nfs/nfs4xdr.c
>> having to do with READDIR.
>
> Well, one thing is that I wonder about decode_readdir_maxsz including
> decode_verifier_maxsz since it is part of READDIR4resok, and so should be
> included in the page data.. but that really doesn't matter here, I think we
> just end up with a larger head. Moving the start of the tail two words
> forward doesn't seem to fix things..
>
> I think the mic's xdr_buf_subsegment is getting partially split between
> leftover space in the pages and the tail, so the checks for the length of
> the subbuf do not match the length of mic. In that case, we usually just
> have extra room in the tail to copy it over, but now that rslack is not so
> slack, the only extra room (as determined in xdr_buf_read_netobj()) would be
> extra space in the page data, which we can't use.
>
> The problem is that xdr_buf_read_netobj() is using xdr_buf_subsegment() with
> an offset in the response data to the mic, but xdr_buf_subsegement() decides
> what to do based on offsets in the already set-up xdr_buf. If the server's
> response leaves a hole in the page_len less than the mic, then the mic
> subsegment straddles the page and the tail. Then we try to copy it to the
> end of the tail, but now there's not enough room.
>
> Instead, I think we want to make sure the mic falls squarely into the tail
> every time.

I'm not clear how you could do that. The length of the page data is not
known to the client before it parses the reply. Are you suggesting that
gss_unwrap should do it somehow?


> xdr_buf_read_netobj() is only used for the mic, so it seems like we can
> refactor it to just use the tail, or get rid of it altogether. I'll see if
> I can do that.
>
> Read on, if you want some numbers to check my work..
>
> Here's how to reproduce:
> Init an xdr_buf of 176, so rq_rcvsize is:
> 4 RPC_REPHDRSIZE
> 19 auth->au_rslack
> 21 .p_replen
> -----
> 44<<2 = 176
>
> Then xdr_inline_pages() with:
> - offset 140:
> 21 resp.hdr
> 4 RPC_REPHDRSIZE
> 11 auth->au_ralign
> -1 xdr pad fixup
> -----
> 35<<2 = 140
>
> - base of 88 (that's static entries for . and ..)
> - len of 262056 (64 pages - base of 88)
>
> This gives us the following xdr_buf:
>
> struct xdr_buf {
> head->iov_len = 140,
> tail->iov_len = 32,
> page_len = 262056,
> buflen = 262232
> }
>
> Then, get a READDIR response of total length 232208:
>
> 56 bytes to GSS data
> 76 bytes to READDIR4resok
> 262044 bytes of READDIR4resok
> 32 bytes of mic
>
> That puts the mic at offset 262176.
>
> That all looks right, and the response is well-formed.
>
> Things go badly in xdr_read_netobj() with offset 262176:
>
> We return -ENOMEM when
> obj->len (28) is > than buf->buflen (262232) - buf->len (262208)
>
> Ben

--
Chuck Lever



2019-09-11 17:27:10

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails


On 11 Sep 2019, at 12:39, Chuck Lever wrote:

>> On Sep 11, 2019, at 12:25 PM, Benjamin Coddington
>> <[email protected]> wrote:
>>

>> Instead, I think we want to make sure the mic falls squarely into the
>> tail
>> every time.
>
> I'm not clear how you could do that. The length of the page data is
> not
> known to the client before it parses the reply. Are you suggesting
> that
> gss_unwrap should do it somehow?

Is it too niave to always put the mic at the end of the tail?

Ben

2019-09-11 17:30:16

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails



On 11 Sep 2019, at 13:26, Benjamin Coddington wrote:

> On 11 Sep 2019, at 12:39, Chuck Lever wrote:
>
>>> On Sep 11, 2019, at 12:25 PM, Benjamin Coddington
>>> <[email protected]> wrote:
>>>
>
>>> Instead, I think we want to make sure the mic falls squarely into
>>> the tail
>>> every time.
>>
>> I'm not clear how you could do that. The length of the page data is
>> not
>> known to the client before it parses the reply. Are you suggesting
>> that
>> gss_unwrap should do it somehow?
>
> Is it too niave to always put the mic at the end of the tail?

Naive, even..?

Ben

2019-09-11 17:35:13

by Chuck Lever III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails



> On Sep 11, 2019, at 1:26 PM, Benjamin Coddington <[email protected]> wrote:
>
>
> On 11 Sep 2019, at 12:39, Chuck Lever wrote:
>
>>> On Sep 11, 2019, at 12:25 PM, Benjamin Coddington <[email protected]> wrote:
>>>
>
>>> Instead, I think we want to make sure the mic falls squarely into the tail
>>> every time.
>>
>> I'm not clear how you could do that. The length of the page data is not
>> known to the client before it parses the reply. Are you suggesting that
>> gss_unwrap should do it somehow?
>
> Is it too niave to always put the mic at the end of the tail?

The size of the page content is variable.

The only way the MIC will fall into the tail is if the page content is
exactly the largest expected size. When the page content is smaller than
that, the receive logic will place part or all of the MIC in ->pages.


--
Chuck Lever



2019-09-11 17:42:00

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On 11 Sep 2019, at 13:29, Chuck Lever wrote:

>> On Sep 11, 2019, at 1:26 PM, Benjamin Coddington
>> <[email protected]> wrote:
>>
>>
>> On 11 Sep 2019, at 12:39, Chuck Lever wrote:
>>
>>>> On Sep 11, 2019, at 12:25 PM, Benjamin Coddington
>>>> <[email protected]> wrote:
>>>>
>>
>>>> Instead, I think we want to make sure the mic falls squarely into
>>>> the tail
>>>> every time.
>>>
>>> I'm not clear how you could do that. The length of the page data is
>>> not
>>> known to the client before it parses the reply. Are you suggesting
>>> that
>>> gss_unwrap should do it somehow?
>>
>> Is it too niave to always put the mic at the end of the tail?
>
> The size of the page content is variable.
>
> The only way the MIC will fall into the tail is if the page content is
> exactly the largest expected size. When the page content is smaller
> than
> that, the receive logic will place part or all of the MIC in ->pages.

Ok, right. But what I meant is that xdr_buf_read_netobj() should be
renamed
and repurposed to be "move the mic from wherever it is to the end of
xdr_buf's tail".

But now I see what you mean, and I also see that it is already trying to
do
that.. and we don't want to overlap the copy..

So, really, we need the tail to be larger than twice the mic.. less 1.
That
means the fix is probably just increasing rslack for krb5i.

Ben

2019-09-11 17:48:03

by Chuck Lever III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails



> On Sep 11, 2019, at 1:40 PM, Benjamin Coddington <[email protected]> wrote:
>
> On 11 Sep 2019, at 13:29, Chuck Lever wrote:
>
>>> On Sep 11, 2019, at 1:26 PM, Benjamin Coddington <[email protected]> wrote:
>>>
>>>
>>> On 11 Sep 2019, at 12:39, Chuck Lever wrote:
>>>
>>>>> On Sep 11, 2019, at 12:25 PM, Benjamin Coddington <[email protected]> wrote:
>>>>>
>>>
>>>>> Instead, I think we want to make sure the mic falls squarely into the tail
>>>>> every time.
>>>>
>>>> I'm not clear how you could do that. The length of the page data is not
>>>> known to the client before it parses the reply. Are you suggesting that
>>>> gss_unwrap should do it somehow?
>>>
>>> Is it too niave to always put the mic at the end of the tail?
>>
>> The size of the page content is variable.
>>
>> The only way the MIC will fall into the tail is if the page content is
>> exactly the largest expected size. When the page content is smaller than
>> that, the receive logic will place part or all of the MIC in ->pages.
>
> Ok, right. But what I meant is that xdr_buf_read_netobj() should be renamed
> and repurposed to be "move the mic from wherever it is to the end of
> xdr_buf's tail".
>
> But now I see what you mean, and I also see that it is already trying to do
> that.. and we don't want to overlap the copy..
>
> So, really, we need the tail to be larger than twice the mic.. less 1. That
> means the fix is probably just increasing rslack for krb5i.

What's the justification for that particular maximum size? Are you sure the
page contents are not spilling into the tail?

--
Chuck Lever



2019-09-11 17:52:41

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On 11 Sep 2019, at 13:40, Benjamin Coddington wrote:

> On 11 Sep 2019, at 13:29, Chuck Lever wrote:
>
>>> On Sep 11, 2019, at 1:26 PM, Benjamin Coddington
>>> <[email protected]> wrote:
>>>
>>>
>>> On 11 Sep 2019, at 12:39, Chuck Lever wrote:
>>>
>>>>> On Sep 11, 2019, at 12:25 PM, Benjamin Coddington
>>>>> <[email protected]> wrote:
>>>>>
>>>
>>>>> Instead, I think we want to make sure the mic falls squarely into
>>>>> the tail
>>>>> every time.
>>>>
>>>> I'm not clear how you could do that. The length of the page data is
>>>> not
>>>> known to the client before it parses the reply. Are you suggesting
>>>> that
>>>> gss_unwrap should do it somehow?
>>>
>>> Is it too niave to always put the mic at the end of the tail?
>>
>> The size of the page content is variable.
>>
>> The only way the MIC will fall into the tail is if the page content
>> is
>> exactly the largest expected size. When the page content is smaller
>> than
>> that, the receive logic will place part or all of the MIC in ->pages.
>
> Ok, right. But what I meant is that xdr_buf_read_netobj() should be
> renamed
> and repurposed to be "move the mic from wherever it is to the end of
> xdr_buf's tail".
>
> But now I see what you mean, and I also see that it is already trying
> to do
> that.. and we don't want to overlap the copy..
>
> So, really, we need the tail to be larger than twice the mic.. less 1.
> That
> means the fix is probably just increasing rslack for krb5i.

.. or we can keep the tighter tail space, and if we detect the mic
straddles
the page and tail, we can move the mic into the tail with 2 copies,
first
move the bit in the tail back, then move the bit in the pages.

Which is preferred, less allocation, or in the rare case this occurs,
doing
copy twice?

Ben

2019-09-11 17:57:03

by Chuck Lever III

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails



> On Sep 11, 2019, at 1:50 PM, Benjamin Coddington <[email protected]> wrote:
>
> On 11 Sep 2019, at 13:40, Benjamin Coddington wrote:
>
>> On 11 Sep 2019, at 13:29, Chuck Lever wrote:
>>
>>>> On Sep 11, 2019, at 1:26 PM, Benjamin Coddington <[email protected]> wrote:
>>>>
>>>>
>>>> On 11 Sep 2019, at 12:39, Chuck Lever wrote:
>>>>
>>>>>> On Sep 11, 2019, at 12:25 PM, Benjamin Coddington <[email protected]> wrote:
>>>>>>
>>>>
>>>>>> Instead, I think we want to make sure the mic falls squarely into the tail
>>>>>> every time.
>>>>>
>>>>> I'm not clear how you could do that. The length of the page data is not
>>>>> known to the client before it parses the reply. Are you suggesting that
>>>>> gss_unwrap should do it somehow?
>>>>
>>>> Is it too niave to always put the mic at the end of the tail?
>>>
>>> The size of the page content is variable.
>>>
>>> The only way the MIC will fall into the tail is if the page content is
>>> exactly the largest expected size. When the page content is smaller than
>>> that, the receive logic will place part or all of the MIC in ->pages.
>>
>> Ok, right. But what I meant is that xdr_buf_read_netobj() should be renamed
>> and repurposed to be "move the mic from wherever it is to the end of
>> xdr_buf's tail".
>>
>> But now I see what you mean, and I also see that it is already trying to do
>> that.. and we don't want to overlap the copy..
>>
>> So, really, we need the tail to be larger than twice the mic.. less 1. That
>> means the fix is probably just increasing rslack for krb5i.
>
> .. or we can keep the tighter tail space, and if we detect the mic straddles
> the page and tail, we can move the mic into the tail with 2 copies, first
> move the bit in the tail back, then move the bit in the pages.
>
> Which is preferred, less allocation, or in the rare case this occurs, doing
> copy twice?

It sounds like the bug is that the current code does not deal correctly
when the MIC crosses the boundary between ->pages and ->tail? I'd like
to see that addressed rather than changing rslack.


--
Chuck Lever



2019-09-11 17:59:49

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On 11 Sep 2019, at 13:43, Chuck Lever wrote:

>> On Sep 11, 2019, at 1:40 PM, Benjamin Coddington
>> <[email protected]> wrote:
>>
>> On 11 Sep 2019, at 13:29, Chuck Lever wrote:
>>
>>>> On Sep 11, 2019, at 1:26 PM, Benjamin Coddington
>>>> <[email protected]> wrote:
>>>>
>>>>
>>>> On 11 Sep 2019, at 12:39, Chuck Lever wrote:
>>>>
>>>>>> On Sep 11, 2019, at 12:25 PM, Benjamin Coddington
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>
>>>>>> Instead, I think we want to make sure the mic falls squarely into
>>>>>> the tail
>>>>>> every time.
>>>>>
>>>>> I'm not clear how you could do that. The length of the page data
>>>>> is not
>>>>> known to the client before it parses the reply. Are you suggesting
>>>>> that
>>>>> gss_unwrap should do it somehow?
>>>>
>>>> Is it too niave to always put the mic at the end of the tail?
>>>
>>> The size of the page content is variable.
>>>
>>> The only way the MIC will fall into the tail is if the page content
>>> is
>>> exactly the largest expected size. When the page content is smaller
>>> than
>>> that, the receive logic will place part or all of the MIC in
>>> ->pages.
>>
>> Ok, right. But what I meant is that xdr_buf_read_netobj() should be
>> renamed
>> and repurposed to be "move the mic from wherever it is to the end of
>> xdr_buf's tail".
>>
>> But now I see what you mean, and I also see that it is already trying
>> to do
>> that.. and we don't want to overlap the copy..
>>
>> So, really, we need the tail to be larger than twice the mic.. less
>> 1. That
>> means the fix is probably just increasing rslack for krb5i.
>
> What's the justification for that particular maximum size? Are you
> sure the
> page contents are not spilling into the tail?

In the problem case, I am sure they are not.

The justification is that if the mic straddles pages and tail, today we
try
to copy it to the end of the tail. The room we'd need for that is the
size
of the mic less any of it that is up in the pages.

2019-09-12 12:31:01

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On 11 Sep 2019, at 13:54, Chuck Lever wrote:

>> On Sep 11, 2019, at 1:50 PM, Benjamin Coddington
>> <[email protected]> wrote:
>>
>> On 11 Sep 2019, at 13:40, Benjamin Coddington wrote:
>>
>>> On 11 Sep 2019, at 13:29, Chuck Lever wrote:
>>>
>>>>> On Sep 11, 2019, at 1:26 PM, Benjamin Coddington
>>>>> <[email protected]> wrote:
>>>>>
>>>>>
>>>>> On 11 Sep 2019, at 12:39, Chuck Lever wrote:
>>>>>
>>>>>>> On Sep 11, 2019, at 12:25 PM, Benjamin Coddington
>>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>
>>>>>>> Instead, I think we want to make sure the mic falls squarely
>>>>>>> into the tail
>>>>>>> every time.
>>>>>>
>>>>>> I'm not clear how you could do that. The length of the page data
>>>>>> is not
>>>>>> known to the client before it parses the reply. Are you
>>>>>> suggesting that
>>>>>> gss_unwrap should do it somehow?
>>>>>
>>>>> Is it too niave to always put the mic at the end of the tail?
>>>>
>>>> The size of the page content is variable.
>>>>
>>>> The only way the MIC will fall into the tail is if the page content
>>>> is
>>>> exactly the largest expected size. When the page content is smaller
>>>> than
>>>> that, the receive logic will place part or all of the MIC in
>>>> ->pages.
>>>
>>> Ok, right. But what I meant is that xdr_buf_read_netobj() should be
>>> renamed
>>> and repurposed to be "move the mic from wherever it is to the end of
>>> xdr_buf's tail".
>>>
>>> But now I see what you mean, and I also see that it is already
>>> trying to do
>>> that.. and we don't want to overlap the copy..
>>>
>>> So, really, we need the tail to be larger than twice the mic.. less
>>> 1. That
>>> means the fix is probably just increasing rslack for krb5i.
>>
>> .. or we can keep the tighter tail space, and if we detect the mic
>> straddles
>> the page and tail, we can move the mic into the tail with 2 copies,
>> first
>> move the bit in the tail back, then move the bit in the pages.
>>
>> Which is preferred, less allocation, or in the rare case this occurs,
>> doing
>> copy twice?
>
> It sounds like the bug is that the current code does not deal
> correctly
> when the MIC crosses the boundary between ->pages and ->tail? I'd like
> to see that addressed rather than changing rslack.

Here's what I'm about to run through my testing:

diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index 48c93b9e525e..d6ffc9011269 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -1238,14 +1238,21 @@ EXPORT_SYMBOL_GPL(xdr_encode_word);

/* If the netobj starting offset bytes from the start of xdr_buf is
contained
* entirely in the head or the tail, set object to point to it;
otherwise
- * try to find space for it at the end of the tail, copy it there, and
- * set obj to point to it. */
+ * try to find space for it at the end of the tail, and copy it there.
If
+ * the netobj is partly within the page data and tail, shrink the pages
to
+ * move the object into the tail */
int xdr_buf_read_netobj(struct xdr_buf *buf, struct xdr_netobj *obj,
unsigned int offset)
{
struct xdr_buf subbuf;
+ unsigned int page_range;

if (xdr_decode_word(buf, offset, &obj->len))
return -EFAULT;
+
+ page_range = buf->head->iov_len + buf->page_len - offset + 4;
+ if (page_range > 0 && page_range < obj->len)
+ xdr_shrink_pagelen(buf, page_range);
+
if (xdr_buf_subsegment(buf, &subbuf, offset + 4, obj->len))
return -EFAULT;


Is the use of xdr_shrink_pagelen() at this point in the decoding a
problem for RDMA?

Ben

2019-09-12 12:54:50

by Trond Myklebust

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On Thu, 2019-09-12 at 08:29 -0400, Benjamin Coddington wrote:
> On 11 Sep 2019, at 13:54, Chuck Lever wrote:
>
> > > On Sep 11, 2019, at 1:50 PM, Benjamin Coddington
> > > <[email protected]> wrote:
> > >
> > > On 11 Sep 2019, at 13:40, Benjamin Coddington wrote:
> > >
> > > > On 11 Sep 2019, at 13:29, Chuck Lever wrote:
> > > >
> > > > > > On Sep 11, 2019, at 1:26 PM, Benjamin Coddington
> > > > > > <[email protected]> wrote:
> > > > > >
> > > > > >
> > > > > > On 11 Sep 2019, at 12:39, Chuck Lever wrote:
> > > > > >
> > > > > > > > On Sep 11, 2019, at 12:25 PM, Benjamin Coddington
> > > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Instead, I think we want to make sure the mic falls
> > > > > > > > squarely
> > > > > > > > into the tail
> > > > > > > > every time.
> > > > > > >
> > > > > > > I'm not clear how you could do that. The length of the
> > > > > > > page data
> > > > > > > is not
> > > > > > > known to the client before it parses the reply. Are you
> > > > > > > suggesting that
> > > > > > > gss_unwrap should do it somehow?
> > > > > >
> > > > > > Is it too niave to always put the mic at the end of the
> > > > > > tail?
> > > > >
> > > > > The size of the page content is variable.
> > > > >
> > > > > The only way the MIC will fall into the tail is if the page
> > > > > content
> > > > > is
> > > > > exactly the largest expected size. When the page content is
> > > > > smaller
> > > > > than
> > > > > that, the receive logic will place part or all of the MIC in
> > > > > ->pages.
> > > >
> > > > Ok, right. But what I meant is that xdr_buf_read_netobj()
> > > > should be
> > > > renamed
> > > > and repurposed to be "move the mic from wherever it is to the
> > > > end of
> > > > xdr_buf's tail".
> > > >
> > > > But now I see what you mean, and I also see that it is already
> > > > trying to do
> > > > that.. and we don't want to overlap the copy..
> > > >
> > > > So, really, we need the tail to be larger than twice the mic..
> > > > less
> > > > 1. That
> > > > means the fix is probably just increasing rslack for krb5i.
> > >
> > > .. or we can keep the tighter tail space, and if we detect the
> > > mic
> > > straddles
> > > the page and tail, we can move the mic into the tail with 2
> > > copies,
> > > first
> > > move the bit in the tail back, then move the bit in the pages.
> > >
> > > Which is preferred, less allocation, or in the rare case this
> > > occurs,
> > > doing
> > > copy twice?
> >
> > It sounds like the bug is that the current code does not deal
> > correctly
> > when the MIC crosses the boundary between ->pages and ->tail? I'd
> > like
> > to see that addressed rather than changing rslack.
>
> Here's what I'm about to run through my testing:
>
> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> index 48c93b9e525e..d6ffc9011269 100644
> --- a/net/sunrpc/xdr.c
> +++ b/net/sunrpc/xdr.c
> @@ -1238,14 +1238,21 @@ EXPORT_SYMBOL_GPL(xdr_encode_word);
>
> /* If the netobj starting offset bytes from the start of xdr_buf
> is
> contained
> * entirely in the head or the tail, set object to point to it;
> otherwise
> - * try to find space for it at the end of the tail, copy it there,
> and
> - * set obj to point to it. */
> + * try to find space for it at the end of the tail, and copy it
> there.
> If
> + * the netobj is partly within the page data and tail, shrink the
> pages
> to
> + * move the object into the tail */
> int xdr_buf_read_netobj(struct xdr_buf *buf, struct xdr_netobj
> *obj,
> unsigned int offset)
> {
> struct xdr_buf subbuf;
> + unsigned int page_range;
>
> if (xdr_decode_word(buf, offset, &obj->len))
> return -EFAULT;
> +
> + page_range = buf->head->iov_len + buf->page_len - offset + 4;
> + if (page_range > 0 && page_range < obj->len)
> + xdr_shrink_pagelen(buf, page_range);
> +
> if (xdr_buf_subsegment(buf, &subbuf, offset + 4, obj->len))
> return -EFAULT;
>
>

Let's please just scrap this function and rewrite it as a generic
function for reading the MIC. It clearly is not a generic function for
reading arbitrary netobjs, and modifications like the above just make
the misnomer painfully obvious.

Let's rewrite it as xdr_buf_read_mic() so that we can simplify it where
possible.


Thanks
Trond

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2019-09-12 13:13:15

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails


On 12 Sep 2019, at 8:53, Trond Myklebust wrote:
> Let's please just scrap this function and rewrite it as a generic
> function for reading the MIC. It clearly is not a generic function for
> reading arbitrary netobjs, and modifications like the above just make
> the misnomer painfully obvious.
>
> Let's rewrite it as xdr_buf_read_mic() so that we can simplify it where
> possible.

Ok. I want to assume the mic will not land in the head, but I am not sure..
Is there a scenario where the mic might land in the head, or is that bit of
the current function left over from other uses?

Ben

2019-09-12 13:15:31

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On Thu, Sep 12, 2019 at 09:08:51AM -0400, Benjamin Coddington wrote:
>
> On 12 Sep 2019, at 8:53, Trond Myklebust wrote:
> > Let's please just scrap this function and rewrite it as a generic
> > function for reading the MIC. It clearly is not a generic function for
> > reading arbitrary netobjs, and modifications like the above just make
> > the misnomer painfully obvious.
> >
> > Let's rewrite it as xdr_buf_read_mic() so that we can simplify it where
> > possible.
>
> Ok. I want to assume the mic will not land in the head, but I am not sure..
> Is there a scenario where the mic might land in the head, or is that bit of
> the current function left over from other uses?

Any reply that doesn't have page data?

A reply that ends up shorter than expected due to failure of an early op
in the compound?

(Unless I'm missing something. I haven't looked at this code in a
while. Though it was problem me that wrote it originally--apologies for
that....)

--b.

2019-09-12 13:17:52

by Trond Myklebust

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On Thu, 2019-09-12 at 09:08 -0400, Benjamin Coddington wrote:
> On 12 Sep 2019, at 8:53, Trond Myklebust wrote:
> > Let's please just scrap this function and rewrite it as a generic
> > function for reading the MIC. It clearly is not a generic function
> > for
> > reading arbitrary netobjs, and modifications like the above just
> > make
> > the misnomer painfully obvious.
> >
> > Let's rewrite it as xdr_buf_read_mic() so that we can simplify it
> > where
> > possible.
>
> Ok. I want to assume the mic will not land in the head, but I am not
> sure..
> Is there a scenario where the mic might land in the head, or is that
> bit of
> the current function left over from other uses?

It is likely to land in the head for most RPC calls that have short
replies. Particularly in error cases.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2019-09-12 13:26:59

by Trond Myklebust

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On Thu, 2019-09-12 at 09:13 -0400, J. Bruce Fields wrote:
> (Unless I'm missing something. I haven't looked at this code in a
> while. Though it was problem me that wrote it originally--apologies
> for
> that....)
>

The function itself is fine. It was just the name I'm objecting to,
since we're actually moving the last 'n' bytes in the message in order
to be able to read them.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2019-09-12 13:38:16

by Benjamin Coddington

[permalink] [raw]
Subject: Re: Regression in 5.1.20: Reading long directory fails

On 12 Sep 2019, at 9:25, Trond Myklebust wrote:

> On Thu, 2019-09-12 at 09:13 -0400, J. Bruce Fields wrote:
>> (Unless I'm missing something. I haven't looked at this code in a
>> while. Though it was problem me that wrote it originally--apologies
>> for
>> that....)
>>
>
> The function itself is fine. It was just the name I'm objecting to,
> since we're actually moving the last 'n' bytes in the message in order
> to be able to read them.

Ok, that's helpful guidance since it saves me from doing a stable fix and
then an attempt to rename/optimize/breakitagain.

I'll just rename it at the same time as the fix.. but now I wonder if that
can potentially mess up other fixes that might retroactively get sent to
stable. Maybe I'm over thinking it. I guess I'll send the fix and then the
rename separately, and maintainers can squash at will.

Ben