2023-01-31 21:15:48

by Chuck Lever

[permalink] [raw]
Subject: git regression failures with v6.2-rc NFS client

Hi-

I upgraded my test client's kernel to v6.2-rc5 and now I get
failures during the git regression suite on all NFS versions.
I bisected to:

85aa8ddc3818 ("NFS: Trigger the "ls -l" readdir heuristic sooner")

The failure looks like:

not ok 6 - git am --skip succeeds despite D/F conflict
#
# test_when_finished "git -C df_plus_edit_edit clean -f" &&
# test_when_finished "git -C df_plus_edit_edit reset --hard" &&
# (
# cd df_plus_edit_edit &&
#
# git checkout f-edit^0 &&
# git format-patch -1 d-edit &&
# test_must_fail git am -3 0001*.patch &&
#
# git am --skip &&
# test_path_is_missing .git/rebase-apply &&
# git ls-files -u >conflicts &&
# test_must_be_empty conflicts
# )
#
# failed 1 among 6 test(s)
1..6
make[2]: *** [Makefile:60: t1015-read-index-unmerged.sh] Error 1
make[2]: *** Waiting for unfinished jobs....

The regression suite is run like this:

RESULTS= some random directory under /tmp
RELEASE="git-2.37.1"

rm -f ${RELEASE}.tar.gz
curl --no-progress-meter -O https://mirrors.edge.kernel.org/pub/software/scm/git/${RELEASE}.tar.gz
/usr/bin/time tar zxf ${RELEASE}.tar.gz >> ${RESULTS}/git 2>&1

cd ${RELEASE}
make clean >> ${RESULTS}/git 2>&1
/usr/bin/time make -j${THREADS} all doc >> ${RESULTS}/git 2>&1

/usr/bin/time make -j${THREADS} test >> ${RESULTS}/git 2>&1

On this client, THREADS=12. A single-thread run doesn't seem to
trigger a problem. So unfortunately the specific data I have is
going to be noisy.

--
Chuck Lever





2023-01-31 22:03:13

by Benjamin Coddington

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client

On 31 Jan 2023, at 16:15, Chuck Lever III wrote:

> Hi-
>
> I upgraded my test client's kernel to v6.2-rc5 and now I get
> failures during the git regression suite on all NFS versions.
> I bisected to:
>
> 85aa8ddc3818 ("NFS: Trigger the "ls -l" readdir heuristic sooner")
>
> The failure looks like:
>
> not ok 6 - git am --skip succeeds despite D/F conflict
> #
> # test_when_finished "git -C df_plus_edit_edit clean -f" &&
> # test_when_finished "git -C df_plus_edit_edit reset --hard" &&
> # (
> # cd df_plus_edit_edit &&
> #
> # git checkout f-edit^0 &&
> # git format-patch -1 d-edit &&
> # test_must_fail git am -3 0001*.patch &&
> #
> # git am --skip &&
> # test_path_is_missing .git/rebase-apply &&
> # git ls-files -u >conflicts &&
> # test_must_be_empty conflicts
> # )
> #
> # failed 1 among 6 test(s)
> 1..6
> make[2]: *** [Makefile:60: t1015-read-index-unmerged.sh] Error 1
> make[2]: *** Waiting for unfinished jobs....
>
> The regression suite is run like this:
>
> RESULTS= some random directory under /tmp
> RELEASE="git-2.37.1"
>
> rm -f ${RELEASE}.tar.gz
> curl --no-progress-meter -O https://mirrors.edge.kernel.org/pub/software/scm/git/${RELEASE}.tar.gz
> /usr/bin/time tar zxf ${RELEASE}.tar.gz >> ${RESULTS}/git 2>&1
>
> cd ${RELEASE}
> make clean >> ${RESULTS}/git 2>&1
> /usr/bin/time make -j${THREADS} all doc >> ${RESULTS}/git 2>&1
>
> /usr/bin/time make -j${THREADS} test >> ${RESULTS}/git 2>&1
>
> On this client, THREADS=12. A single-thread run doesn't seem to
> trigger a problem. So unfortunately the specific data I have is
> going to be noisy.

I'll attempt to reproduce this and see what's up. This is an export of
tmpfs? If so, I suspect you might be running into tmpfs' unstable cookie
problem when two processes race through nfs_do_filldir().. and if so, the
cached listing of the directory on the client won't match a listing on the
server.

Ben


2023-02-01 14:11:24

by Benjamin Coddington

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client

On 31 Jan 2023, at 17:02, Benjamin Coddington wrote:

> On 31 Jan 2023, at 16:15, Chuck Lever III wrote:
>
>> Hi-
>>
>> I upgraded my test client's kernel to v6.2-rc5 and now I get
>> failures during the git regression suite on all NFS versions.
>> I bisected to:
>>
>> 85aa8ddc3818 ("NFS: Trigger the "ls -l" readdir heuristic sooner")
>>
>> The failure looks like:
>>
>> not ok 6 - git am --skip succeeds despite D/F conflict
>> #
>> # test_when_finished "git -C df_plus_edit_edit clean -f" &&
>> # test_when_finished "git -C df_plus_edit_edit reset --hard" &&
>> # (
>> # cd df_plus_edit_edit &&
>> #
>> # git checkout f-edit^0 &&
>> # git format-patch -1 d-edit &&
>> # test_must_fail git am -3 0001*.patch &&
>> #
>> # git am --skip &&
>> # test_path_is_missing .git/rebase-apply &&
>> # git ls-files -u >conflicts &&
>> # test_must_be_empty conflicts
>> # )
>> #
>> # failed 1 among 6 test(s)
>> 1..6
>> make[2]: *** [Makefile:60: t1015-read-index-unmerged.sh] Error 1
>> make[2]: *** Waiting for unfinished jobs....
>>
>> The regression suite is run like this:
>>
>> RESULTS= some random directory under /tmp
>> RELEASE="git-2.37.1"
>>
>> rm -f ${RELEASE}.tar.gz
>> curl --no-progress-meter -O https://mirrors.edge.kernel.org/pub/software/scm/git/${RELEASE}.tar.gz
>> /usr/bin/time tar zxf ${RELEASE}.tar.gz >> ${RESULTS}/git 2>&1
>>
>> cd ${RELEASE}
>> make clean >> ${RESULTS}/git 2>&1
>> /usr/bin/time make -j${THREADS} all doc >> ${RESULTS}/git 2>&1
>>
>> /usr/bin/time make -j${THREADS} test >> ${RESULTS}/git 2>&1
>>
>> On this client, THREADS=12. A single-thread run doesn't seem to
>> trigger a problem. So unfortunately the specific data I have is
>> going to be noisy.
>
> I'll attempt to reproduce this and see what's up. This is an export of
> tmpfs? If so, I suspect you might be running into tmpfs' unstable cookie
> problem when two processes race through nfs_do_filldir().. and if so, the
> cached listing of the directory on the client won't match a listing on the
> server.

It doesn't reproduce on ext4, but I can see it on an export of tmpfs.

Unsurprisingly the pattern is getdents() returning 19 entries (17 for the
first emit and "." and ".."), then unlinking those and the next getdents()
returning 0.

Here's a reproducer which fails on tmpfs but works properly on exports of
ext4 and xfs:

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <string.h>

#define NFSDIR "/mnt/tmpfs/dirtest"
#define BUF_SIZE 4096
#define COUNT 18

int main(int argc, char **argv)
{
int i, dir_fd, bpos, total = 0;
size_t nread;
struct linux_dirent {
long d_ino;
off_t d_off;
unsigned short d_reclen;
char d_name[];
};
struct linux_dirent *d;
char buf[BUF_SIZE];

/* create files */
for (i = 0; i < COUNT; i++) {
sprintf(buf, NFSDIR "/file_%03d", i);
close(open(buf, O_CREAT, 666));
total++;
}
printf("created %d total dirents\n", total);

dir_fd = open(NFSDIR, O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC);
if (dir_fd < 0) {
perror("cannot open dir");
return 1;
}

/* drop the first page */
posix_fadvise(dir_fd, 0, 4096, POSIX_FADV_DONTNEED);
total = 0;

while (1) {
nread = syscall(SYS_getdents, dir_fd, buf, BUF_SIZE);
if (nread == 0 || nread == -1)
break;
for (bpos = 0; bpos < nread;) {
d = (struct linux_dirent *) (buf + bpos);

if (d->d_name[0] != '.') {
printf("%s\n", d->d_name);
unlinkat(dir_fd, d->d_name, 0);
total++;
}
bpos += d->d_reclen;
}
}
printf("found and deleted %d dirents\n", total);
close(dir_fd);

printf("rmdir returns %d\n", rmdir(NFSDIR));
return 0;
}

The client is doing uncached_readdir looking for cookie 19, but tmpfs has
re-ordered the last file into cookie 3 on the second READDIR.

I think this is a different case of the problems discussed about unstable
readdir cookies on the last round of directory cache improvements, but since
we're now returning after 17 entries the problem is exposed on a directory
containing 18 files, rather than 128.

Working on a fix..

Ben


2023-02-01 15:13:57

by Chuck Lever

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client



> On Jan 31, 2023, at 5:02 PM, Benjamin Coddington <[email protected]> wrote:
>
> On 31 Jan 2023, at 16:15, Chuck Lever III wrote:
>
>> Hi-
>>
>> I upgraded my test client's kernel to v6.2-rc5 and now I get
>> failures during the git regression suite on all NFS versions.
>> I bisected to:
>>
>> 85aa8ddc3818 ("NFS: Trigger the "ls -l" readdir heuristic sooner")
>>
>> The failure looks like:
>>
>> not ok 6 - git am --skip succeeds despite D/F conflict
>> #
>> # test_when_finished "git -C df_plus_edit_edit clean -f" &&
>> # test_when_finished "git -C df_plus_edit_edit reset --hard" &&
>> # (
>> # cd df_plus_edit_edit &&
>> #
>> # git checkout f-edit^0 &&
>> # git format-patch -1 d-edit &&
>> # test_must_fail git am -3 0001*.patch &&
>> #
>> # git am --skip &&
>> # test_path_is_missing .git/rebase-apply &&
>> # git ls-files -u >conflicts &&
>> # test_must_be_empty conflicts
>> # )
>> #
>> # failed 1 among 6 test(s)
>> 1..6
>> make[2]: *** [Makefile:60: t1015-read-index-unmerged.sh] Error 1
>> make[2]: *** Waiting for unfinished jobs....
>>
>> The regression suite is run like this:
>>
>> RESULTS= some random directory under /tmp
>> RELEASE="git-2.37.1"
>>
>> rm -f ${RELEASE}.tar.gz
>> curl --no-progress-meter -O https://mirrors.edge.kernel.org/pub/software/scm/git/${RELEASE}.tar.gz
>> /usr/bin/time tar zxf ${RELEASE}.tar.gz >> ${RESULTS}/git 2>&1
>>
>> cd ${RELEASE}
>> make clean >> ${RESULTS}/git 2>&1
>> /usr/bin/time make -j${THREADS} all doc >> ${RESULTS}/git 2>&1
>>
>> /usr/bin/time make -j${THREADS} test >> ${RESULTS}/git 2>&1
>>
>> On this client, THREADS=12. A single-thread run doesn't seem to
>> trigger a problem. So unfortunately the specific data I have is
>> going to be noisy.
>
> I'll attempt to reproduce this and see what's up. This is an export of
> tmpfs?

Sorry for the delayed response... yes, it is a tmpfs export. I was
trying to recall whether this reproduced on other filesystem types,
and it was just far enough in the past that I couldn't be sure I
remembered correctly.


> If so, I suspect you might be running into tmpfs' unstable cookie
> problem when two processes race through nfs_do_filldir().. and if so, the
> cached listing of the directory on the client won't match a listing on the
> server.

Thanks !

--
Chuck Lever




2023-02-01 15:55:40

by Benjamin Coddington

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client

On 1 Feb 2023, at 9:10, Benjamin Coddington wrote:
>
> Working on a fix..

.. actually, I have no idea how to fix this - if tmpfs is going to modify
the position of its dentries, I can't think of a way for the client to loop
through getdents() and remove every file reliably.

The patch you bisected into just makes this happen on directories with 18
entries instead of 127 which can be verified by changing COUNT in the
reproducer.

As Trond pointed out in:
https://lore.kernel.org/all/[email protected]/

POSIX states very explicitly that if you're making changes to the
directory after the call to opendir() or rewinddir(), then the
behaviour w.r.t. whether that file appears in the readdir() call is
unspecified. See
https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html

The issue here is not quite the same though, we unlink the first batch of
entries, then do a second getdents(), which returns zero entries even though
some still exist. I don't think POSIX talks about this case directly.

I guess the question now is if we need to drop the "ls -l" improvement
because after it we are going to see this behavior on directories with >17
entiries instead of >127 entries.

It should be possible to make tmpfs (and friends) generate reliable cookies
by doing something like hashing out the cursor->d_child into the cookie
space.. (waving hands)

Ben


Subject: Re: git regression failures with v6.2-rc NFS client

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 31.01.23 22:15, Chuck Lever III wrote:
>
> I upgraded my test client's kernel to v6.2-rc5 and now I get
> failures during the git regression suite on all NFS versions.
> I bisected to:
>
> 85aa8ddc3818 ("NFS: Trigger the "ls -l" readdir heuristic sooner")
>
> The failure looks like:
> [...]

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 85aa8ddc3818
#regzbot title nfs: failures during the git regression suite on all NFS
versions
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-02-03 14:38:39

by Chuck Lever

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client



> On Feb 1, 2023, at 10:53 AM, Benjamin Coddington <[email protected]> wrote:
>
> On 1 Feb 2023, at 9:10, Benjamin Coddington wrote:
>>
>> Working on a fix..
>
> .. actually, I have no idea how to fix this - if tmpfs is going to modify
> the position of its dentries, I can't think of a way for the client to loop
> through getdents() and remove every file reliably.
>
> The patch you bisected into just makes this happen on directories with 18
> entries instead of 127 which can be verified by changing COUNT in the
> reproducer.
>
> As Trond pointed out in:
> https://lore.kernel.org/all/[email protected]/
>
> POSIX states very explicitly that if you're making changes to the
> directory after the call to opendir() or rewinddir(), then the
> behaviour w.r.t. whether that file appears in the readdir() call is
> unspecified. See
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html
>
> The issue here is not quite the same though, we unlink the first batch of
> entries, then do a second getdents(), which returns zero entries even though
> some still exist. I don't think POSIX talks about this case directly.
>
> I guess the question now is if we need to drop the "ls -l" improvement
> because after it we are going to see this behavior on directories with >17
> entiries instead of >127 entries.

I don't have any suggestions about how to fix your optimization.

Technically I think this counts as a regression; Thorsten seems
to agree with that opinion. It's late in the cycle, so it is
appropriate to consider reverting 85aa8ddc3818 and trying again
in v6.3 or v6.4.


> It should be possible to make tmpfs (and friends) generate reliable cookies
> by doing something like hashing out the cursor->d_child into the cookie
> space.. (waving hands)

Sure, but what if there are non-Linux NFS-exported filesystems
that behave this way?


--
Chuck Lever




2023-02-03 15:18:08

by Benjamin Coddington

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client

On 3 Feb 2023, at 9:38, Chuck Lever III wrote:

>> On Feb 1, 2023, at 10:53 AM, Benjamin Coddington <[email protected]> wrote:
>>
>> On 1 Feb 2023, at 9:10, Benjamin Coddington wrote:
>>>
>>> Working on a fix..
>>
>> .. actually, I have no idea how to fix this - if tmpfs is going to modify
>> the position of its dentries, I can't think of a way for the client to loop
>> through getdents() and remove every file reliably.
>>
>> The patch you bisected into just makes this happen on directories with 18
>> entries instead of 127 which can be verified by changing COUNT in the
>> reproducer.
>>
>> As Trond pointed out in:
>> https://lore.kernel.org/all/[email protected]/
>>
>> POSIX states very explicitly that if you're making changes to the
>> directory after the call to opendir() or rewinddir(), then the
>> behaviour w.r.t. whether that file appears in the readdir() call is
>> unspecified. See
>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html
>>
>> The issue here is not quite the same though, we unlink the first batch of
>> entries, then do a second getdents(), which returns zero entries even though
>> some still exist. I don't think POSIX talks about this case directly.
>>
>> I guess the question now is if we need to drop the "ls -l" improvement
>> because after it we are going to see this behavior on directories with >17
>> entiries instead of >127 entries.
>
> I don't have any suggestions about how to fix your optimization.

I wasn't trying to fix it. I was trying to fix your testing setup.

> Technically I think this counts as a regression; Thorsten seems
> to agree with that opinion. It's late in the cycle, so it is
> appropriate to consider reverting 85aa8ddc3818 and trying again
> in v6.3 or v6.4.

Thorsten's bot is just scraping your regression report email, I doubt
they've carefully read this thread.

>> It should be possible to make tmpfs (and friends) generate reliable cookies
>> by doing something like hashing out the cursor->d_child into the cookie
>> space.. (waving hands)
>
> Sure, but what if there are non-Linux NFS-exported filesystems
> that behave this way?

Then they would display this same behavior, and claiming it is a server bug
might be defensible position.

The reality as I understand it is that if the server is going to change the
cookie or offset of the dentries in between calls to READDIR, you're never
going to reliably be able to list the directory completely. Or maybe we
can, but at least I can't think of any way it can be done.

You can ask Trond/Anna to revert this, but that's only going to fix your
test setup. The behavior you're claiming is a regression is still there -
just at a later offset.

Or we can modify the server to make tmpfs and friends generate stable
cookies/offsets.

Or we can patch git so that it doesn't assume it can walk a directory
completely while simultaneously modifying it.

What do you think?

Ben


2023-02-03 15:35:44

by Chuck Lever

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client



> On Feb 3, 2023, at 10:13 AM, Benjamin Coddington <[email protected]> wrote:
>
> On 3 Feb 2023, at 9:38, Chuck Lever III wrote:
>
>>> On Feb 1, 2023, at 10:53 AM, Benjamin Coddington <[email protected]> wrote:
>>>
>>> On 1 Feb 2023, at 9:10, Benjamin Coddington wrote:
>>>>
>>>> Working on a fix..
>>>
>>> .. actually, I have no idea how to fix this - if tmpfs is going to modify
>>> the position of its dentries, I can't think of a way for the client to loop
>>> through getdents() and remove every file reliably.
>>>
>>> The patch you bisected into just makes this happen on directories with 18
>>> entries instead of 127 which can be verified by changing COUNT in the
>>> reproducer.
>>>
>>> As Trond pointed out in:
>>> https://lore.kernel.org/all/[email protected]/
>>>
>>> POSIX states very explicitly that if you're making changes to the
>>> directory after the call to opendir() or rewinddir(), then the
>>> behaviour w.r.t. whether that file appears in the readdir() call is
>>> unspecified. See
>>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html
>>>
>>> The issue here is not quite the same though, we unlink the first batch of
>>> entries, then do a second getdents(), which returns zero entries even though
>>> some still exist. I don't think POSIX talks about this case directly.
>>>
>>> I guess the question now is if we need to drop the "ls -l" improvement
>>> because after it we are going to see this behavior on directories with >17
>>> entiries instead of >127 entries.
>>
>> I don't have any suggestions about how to fix your optimization.
>
> I wasn't trying to fix it. I was trying to fix your testing setup.
>
>> Technically I think this counts as a regression; Thorsten seems
>> to agree with that opinion. It's late in the cycle, so it is
>> appropriate to consider reverting 85aa8ddc3818 and trying again
>> in v6.3 or v6.4.
>
> Thorsten's bot is just scraping your regression report email, I doubt
> they've carefully read this thread.
>
>>> It should be possible to make tmpfs (and friends) generate reliable cookies
>>> by doing something like hashing out the cursor->d_child into the cookie
>>> space.. (waving hands)
>>
>> Sure, but what if there are non-Linux NFS-exported filesystems
>> that behave this way?
>
> Then they would display this same behavior, and claiming it is a server bug
> might be defensible position.

It's a server bug if we can cite something (perhaps less confusing
and more on-point than the POSIX specification) that says READDIR
cookies aren't supposed to behave this way. I bet the tmpfs folks
are going to want to see that kind of mandate before allowing a
code change.

I'm wondering if you can trigger the same behavior when running
directly on tmpfs?


> The reality as I understand it is that if the server is going to change the
> cookie or offset of the dentries in between calls to READDIR, you're never
> going to reliably be able to list the directory completely. Or maybe we
> can, but at least I can't think of any way it can be done.
>
> You can ask Trond/Anna to revert this, but that's only going to fix your
> test setup. The behavior you're claiming is a regression is still there -
> just at a later offset.

No-one is complaining about the existing situation, which
suggests this is currently only a latent bug, and harmless in
practice. This is a regression because your optimization exposes
the misbehavior to more common workloads.

Even if this is a server bug, the guidelines about not
introducing behavior regressions mean we have to stick with
the current client side behavior until the server side part
of the issue has been corrected.


> Or we can modify the server to make tmpfs and friends generate stable
> cookies/offsets.
>
> Or we can patch git so that it doesn't assume it can walk a directory
> completely while simultaneously modifying it.

I'm guessing that is something that other workloads might
do, so fixing git is not going to solve the issue. And also,
the test works fine on other filesystem types, so it's not
git that is the problem.


> What do you think?

IMO, since the situation is not easy or not possible to fix,
you should revert 85aa8ddc3818 for v6.2 and work on fixing
tmpfs first.

It's going to have to be a backportable fix because your
optimization will break with any Linux server exporting an
unfixed tmpfs.


--
Chuck Lever




2023-02-03 17:15:14

by Benjamin Coddington

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client

On 3 Feb 2023, at 10:35, Chuck Lever III wrote:

>> On Feb 3, 2023, at 10:13 AM, Benjamin Coddington <[email protected]> wrote:
>>
>> On 3 Feb 2023, at 9:38, Chuck Lever III wrote:
>>
>>>> On Feb 1, 2023, at 10:53 AM, Benjamin Coddington <[email protected]> wrote:
>>>>
>>>> On 1 Feb 2023, at 9:10, Benjamin Coddington wrote:
>>>>>
>>>>> Working on a fix..
>>>>
>>>> .. actually, I have no idea how to fix this - if tmpfs is going to modify
>>>> the position of its dentries, I can't think of a way for the client to loop
>>>> through getdents() and remove every file reliably.
>>>>
>>>> The patch you bisected into just makes this happen on directories with 18
>>>> entries instead of 127 which can be verified by changing COUNT in the
>>>> reproducer.
>>>>
>>>> As Trond pointed out in:
>>>> https://lore.kernel.org/all/[email protected]/
>>>>
>>>> POSIX states very explicitly that if you're making changes to the
>>>> directory after the call to opendir() or rewinddir(), then the
>>>> behaviour w.r.t. whether that file appears in the readdir() call is
>>>> unspecified. See
>>>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html
>>>>
>>>> The issue here is not quite the same though, we unlink the first batch of
>>>> entries, then do a second getdents(), which returns zero entries even though
>>>> some still exist. I don't think POSIX talks about this case directly.
>>>>
>>>> I guess the question now is if we need to drop the "ls -l" improvement
>>>> because after it we are going to see this behavior on directories with >17
>>>> entiries instead of >127 entries.
>>>
>>> I don't have any suggestions about how to fix your optimization.
>>
>> I wasn't trying to fix it. I was trying to fix your testing setup.
>>
>>> Technically I think this counts as a regression; Thorsten seems
>>> to agree with that opinion. It's late in the cycle, so it is
>>> appropriate to consider reverting 85aa8ddc3818 and trying again
>>> in v6.3 or v6.4.
>>
>> Thorsten's bot is just scraping your regression report email, I doubt
>> they've carefully read this thread.
>>
>>>> It should be possible to make tmpfs (and friends) generate reliable cookies
>>>> by doing something like hashing out the cursor->d_child into the cookie
>>>> space.. (waving hands)
>>>
>>> Sure, but what if there are non-Linux NFS-exported filesystems
>>> that behave this way?
>>
>> Then they would display this same behavior, and claiming it is a server bug
>> might be defensible position.
>
> It's a server bug if we can cite something (perhaps less confusing
> and more on-point than the POSIX specification) that says READDIR
> cookies aren't supposed to behave this way. I bet the tmpfs folks
> are going to want to see that kind of mandate before allowing a
> code change.

I don't have other stuff to cite for you. All I have is POSIX, and the many
discussions we've had on the linux-nfs list in the past.

> I'm wondering if you can trigger the same behavior when running
> directly on tmpfs?

Not in the same way, because libfs keeps a cursor in the open directory
context, but knfsd has to do a seekdir between replies to READDIR. So, if
you do what the git test is doing locally, it works locally.

The git test is doing:

opendir
while (getdents)
unlink(dentries)
close dir
rmdir <- ENOTEMPTY on NFS

If you have NFS in the middle of that, knfsd tries to do a seekdir to a
position (the cookie/offset sent in the /last/ READDIR results). In that
case, yes - tmpfs would display the same behavior, but locally to tmpfs that
looks like:

opendir
getdents
closedir
unlink,unlink,unlink
opendir
seekdir to next batch
getdents <- no dentries (they all shifted positions)
rmdir <- ENOTEMPTY

The other way to think about this is that on a local system, there's state
saved in the open dir file structure between calls to getdents(). With
knfsd in the middle, you lose that state when you close the directory and
have to do seekdir instead, which means you're not guaranteed to be in the
same place in the directory stream.

>> The reality as I understand it is that if the server is going to change the
>> cookie or offset of the dentries in between calls to READDIR, you're never
>> going to reliably be able to list the directory completely. Or maybe we
>> can, but at least I can't think of any way it can be done.
>>
>> You can ask Trond/Anna to revert this, but that's only going to fix your
>> test setup. The behavior you're claiming is a regression is still there -
>> just at a later offset.
>
> No-one is complaining about the existing situation, which
> suggests this is currently only a latent bug, and harmless in
> practice. This is a regression because your optimization exposes
> the misbehavior to more common workloads.

Its not a latent bug - the incorrect readdir behavior of tmpfs has been
analyzed and discussed on this list, have a look.

> Even if this is a server bug, the guidelines about not
> introducing behavior regressions mean we have to stick with
> the current client side behavior until the server side part
> of the issue has been corrected.
>
>
>> Or we can modify the server to make tmpfs and friends generate stable
>> cookies/offsets.
>>
>> Or we can patch git so that it doesn't assume it can walk a directory
>> completely while simultaneously modifying it.
>
> I'm guessing that is something that other workloads might
> do, so fixing git is not going to solve the issue. And also,
> the test works fine on other filesystem types, so it's not
> git that is the problem.
>
>
>> What do you think?
>
> IMO, since the situation is not easy or not possible to fix,
> you should revert 85aa8ddc3818 for v6.2 and work on fixing
> tmpfs first.
>
> It's going to have to be a backportable fix because your
> optimization will break with any Linux server exporting an
> unfixed tmpfs.

Exports of tmpfs on linux are already broken and seem to always have been.
I spent a great deal of time making points about it and arguing that the
client should try to work around them, and ultimately was told exactly this:
https://lore.kernel.org/linux-nfs/[email protected]/

The optimization you'd like to revert fixes a performance regression on
workloads across exports of all filesystems. This is a fix we've had many
folks asking for repeatedly. I hope the maintainers decide not to revert
it, and that we can also find a way to make readdir work reliably on tmpfs.

Ben


2023-02-03 18:03:33

by Chuck Lever

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client



> On Feb 3, 2023, at 12:14 PM, Benjamin Coddington <[email protected]> wrote:
>
> On 3 Feb 2023, at 10:35, Chuck Lever III wrote:
>
>>> On Feb 3, 2023, at 10:13 AM, Benjamin Coddington <[email protected]> wrote:
>>>
>>> On 3 Feb 2023, at 9:38, Chuck Lever III wrote:
>>>
>>>>> On Feb 1, 2023, at 10:53 AM, Benjamin Coddington <[email protected]> wrote:
>>>>>
>>>>> On 1 Feb 2023, at 9:10, Benjamin Coddington wrote:
>>>>>>
>>>>>> Working on a fix..
>>>>>
>>>>> .. actually, I have no idea how to fix this - if tmpfs is going to modify
>>>>> the position of its dentries, I can't think of a way for the client to loop
>>>>> through getdents() and remove every file reliably.
>>>>>
>>>>> The patch you bisected into just makes this happen on directories with 18
>>>>> entries instead of 127 which can be verified by changing COUNT in the
>>>>> reproducer.
>>>>>
>>>>> As Trond pointed out in:
>>>>> https://lore.kernel.org/all/[email protected]/
>>>>>
>>>>> POSIX states very explicitly that if you're making changes to the
>>>>> directory after the call to opendir() or rewinddir(), then the
>>>>> behaviour w.r.t. whether that file appears in the readdir() call is
>>>>> unspecified. See
>>>>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html
>>>>>
>>>>> The issue here is not quite the same though, we unlink the first batch of
>>>>> entries, then do a second getdents(), which returns zero entries even though
>>>>> some still exist. I don't think POSIX talks about this case directly.
>>>>>
>>>>> I guess the question now is if we need to drop the "ls -l" improvement
>>>>> because after it we are going to see this behavior on directories with >17
>>>>> entiries instead of >127 entries.
>>>>
>>>> I don't have any suggestions about how to fix your optimization.
>>>
>>> I wasn't trying to fix it. I was trying to fix your testing setup.
>>>
>>>> Technically I think this counts as a regression; Thorsten seems
>>>> to agree with that opinion. It's late in the cycle, so it is
>>>> appropriate to consider reverting 85aa8ddc3818 and trying again
>>>> in v6.3 or v6.4.
>>>
>>> Thorsten's bot is just scraping your regression report email, I doubt
>>> they've carefully read this thread.
>>>
>>>>> It should be possible to make tmpfs (and friends) generate reliable cookies
>>>>> by doing something like hashing out the cursor->d_child into the cookie
>>>>> space.. (waving hands)
>>>>
>>>> Sure, but what if there are non-Linux NFS-exported filesystems
>>>> that behave this way?
>>>
>>> Then they would display this same behavior, and claiming it is a server bug
>>> might be defensible position.
>>
>> It's a server bug if we can cite something (perhaps less confusing
>> and more on-point than the POSIX specification) that says READDIR
>> cookies aren't supposed to behave this way. I bet the tmpfs folks
>> are going to want to see that kind of mandate before allowing a
>> code change.
>
> I don't have other stuff to cite for you.

It's not for me. tmpfs folks might need to understand what
a proposed fix will need to do.


> All I have is POSIX, and the many
> discussions we've had on the linux-nfs list in the past.

I'm simply encouraging you to assemble a few of the most
salient such discussions as you approach the tmpfs folks
to address this issue.


>> I'm wondering if you can trigger the same behavior when running
>> directly on tmpfs?
>
> Not in the same way, because libfs keeps a cursor in the open directory
> context, but knfsd has to do a seekdir between replies to READDIR. So, if
> you do what the git test is doing locally, it works locally.
>
> The git test is doing:
>
> opendir
> while (getdents)
> unlink(dentries)
> close dir
> rmdir <- ENOTEMPTY on NFS
>
> If you have NFS in the middle of that, knfsd tries to do a seekdir to a
> position (the cookie/offset sent in the /last/ READDIR results). In that
> case, yes - tmpfs would display the same behavior, but locally to tmpfs that
> looks like:
>
> opendir
> getdents
> closedir
> unlink,unlink,unlink
> opendir
> seekdir to next batch
> getdents <- no dentries (they all shifted positions)
> rmdir <- ENOTEMPTY
>
> The other way to think about this is that on a local system, there's state
> saved in the open dir file structure between calls to getdents(). With
> knfsd in the middle, you lose that state when you close the directory and
> have to do seekdir instead, which means you're not guaranteed to be in the
> same place in the directory stream.

Naive suggestion: Would another option be to add server
support for the directory verifier?


>>> The reality as I understand it is that if the server is going to change the
>>> cookie or offset of the dentries in between calls to READDIR, you're never
>>> going to reliably be able to list the directory completely. Or maybe we
>>> can, but at least I can't think of any way it can be done.
>>>
>>> You can ask Trond/Anna to revert this, but that's only going to fix your
>>> test setup. The behavior you're claiming is a regression is still there -
>>> just at a later offset.
>>
>> No-one is complaining about the existing situation, which
>> suggests this is currently only a latent bug, and harmless in
>> practice. This is a regression because your optimization exposes
>> the misbehavior to more common workloads.
>
> Its not a latent bug - the incorrect readdir behavior of tmpfs has been
> analyzed and discussed on this list, have a look.

Well, let's not argue semantics. The optimization exposes
this (previously known) bug to a wider set of workloads.


>> Even if this is a server bug, the guidelines about not
>> introducing behavior regressions mean we have to stick with
>> the current client side behavior until the server side part
>> of the issue has been corrected.
>>
>>
>>> Or we can modify the server to make tmpfs and friends generate stable
>>> cookies/offsets.
>>>
>>> Or we can patch git so that it doesn't assume it can walk a directory
>>> completely while simultaneously modifying it.
>>
>> I'm guessing that is something that other workloads might
>> do, so fixing git is not going to solve the issue. And also,
>> the test works fine on other filesystem types, so it's not
>> git that is the problem.
>>
>>
>>> What do you think?
>>
>> IMO, since the situation is not easy or not possible to fix,
>> you should revert 85aa8ddc3818 for v6.2 and work on fixing
>> tmpfs first.
>>
>> It's going to have to be a backportable fix because your
>> optimization will break with any Linux server exporting an
>> unfixed tmpfs.
>
> Exports of tmpfs on linux are already broken and seem to always have been.

Please, open some bugs that document it. Otherwise this stuff
will never get addressed. Can't fix what we don't know about.

I'm not claiming tmpfs is perfect. But the optimization seems
to make things worse for more workloads. That's always been a
textbook definition of regression, and a non-starter for
upstream.


> I spent a great deal of time making points about it and arguing that the
> client should try to work around them, and ultimately was told exactly this:
> https://lore.kernel.org/linux-nfs/[email protected]/

Ah, well "client should work around them" is going to be a
losing argument every time.


> The optimization you'd like to revert fixes a performance regression on
> workloads across exports of all filesystems. This is a fix we've had many
> folks asking for repeatedly.

Does the community agree that tmpfs has never been a first-tier
filesystem for exporting? That's news to me. I don't recall us
deprecating support for tmpfs.

If an optimization broke ext4, xfs, or btrfs, it would be
reverted until the situation was properly addressed. I don't
see why the situation is different for tmpfs just because it
has more bugs.


> I hope the maintainers decide not to revert
> it, and that we can also find a way to make readdir work reliably on tmpfs.

IMO the guidelines go the other way: readdir on tmpfs needs to
be addressed first. Reverting is a last resort, but I don't see
a fix coming before v6.2. Am I wrong?

What I fear will happen is that the optimization will go in, and
then nobody will address (or even document) the tmpfs problem,
making it unusable. That would be unfortunate and wrong.

Please look at tmpfs and see how difficult it will be to address
the cookie problem properly.


--
Chuck Lever




2023-02-03 20:02:08

by Benjamin Coddington

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client

On 3 Feb 2023, at 13:03, Chuck Lever III wrote:
> Naive suggestion: Would another option be to add server
> support for the directory verifier?

I don't think it would resolve this bug, because what can the client do to
find its place in the directory stream after getting NFS4ERR_NOT_SAME?

Directory verifiers might help another class of bugs, where a linear walk
through the directory produces duplicate entries.. but I think it only helps
some of those cases.

Knfsd /could/ use the directory verifier to keep a reference to an opened
directory. Perhaps knfsd's open file cache can be used. Then, as long as
we're doing a linear walk through the directory, the local directory's
file->private cursor would continue to be valid to produce consistent linear
results even if the dentries are removed in between calls to READDIR.

.. but that's not how the verifier is supposed to be used - rather it's
supposed to signal when the client's cookies are invalid, which, for tmpfs,
would be anytime any changes are made to the directory.

> Well, let's not argue semantics. The optimization exposes
> this (previously known) bug to a wider set of workloads.

Ok, yes.

> Please, open some bugs that document it. Otherwise this stuff
> will never get addressed. Can't fix what we don't know about.
>
> I'm not claiming tmpfs is perfect. But the optimization seems
> to make things worse for more workloads. That's always been a
> textbook definition of regression, and a non-starter for
> upstream.

I guess I can - my impression has been there's no interest in fixing tmpfs
for use over NFS.. after my last round of work on it I resolved to tell any
customers that asked for it to do:

[root@fedora ~]# modprobe brd rd_size=1048576 rd_nr=1
[root@fedora ~]# mkfs.xfs /dev/ram0

.. instead. Though, I think that tmpfs is going to have better performance.

>> I spent a great deal of time making points about it and arguing that the
>> client should try to work around them, and ultimately was told exactly this:
>> https://lore.kernel.org/linux-nfs/[email protected]/
>
> Ah, well "client should work around them" is going to be a
> losing argument every time.

Yeah, I agree with that, though at the time I naively thought the issues
could be solved by better validation of changing directories.

So.. I guess I lost? I wasn't aware of the stable cookie issues fully
until Trond pointed them out and I compared tmpfs and xfs. At that point, I
saw there's not really much of a path forward for tmpfs, unless we want to
work pretty hard at it. But why? Any production server wanting performance
and stability on an NFS export isn't going to use tmpfs on knfsd. There are
good memory-speed alternatives.

>> The optimization you'd like to revert fixes a performance regression on
>> workloads across exports of all filesystems. This is a fix we've had many
>> folks asking for repeatedly.
>
> Does the community agree that tmpfs has never been a first-tier
> filesystem for exporting? That's news to me. I don't recall us
> deprecating support for tmpfs.

I'm definitely not telling folks to use it as exported from knfsd. I'm
instead telling them, "Directory listings are going to be unstable, you'll
see problems." That includes any filesystems with file_operations of
simple_dir_operations.

That said, the whole opendir, getdents, unlink, getdents pattern is maybe
fine for a test that can assume it has exclusive rights (writes?) to a
directory, but also probably insane for anything else that wants to reliably
remove the thing, and we'll find that's why `rm -rf` still works. It does
opendir, getdents, getdents, getdents, unlink, unlink, unlink, .. rmdir.
This more closely corresponds to POSIX stating:

If a file is removed from or added to the directory after the most recent
call to opendir() or rewinddir(), whether a subsequent call to readdir()
returns an entry for that file is unspecified.


> If an optimization broke ext4, xfs, or btrfs, it would be
> reverted until the situation was properly addressed. I don't
> see why the situation is different for tmpfs just because it
> has more bugs.

Maybe it isn't? We've yet to hear from any authoritative sources on this.

>> I hope the maintainers decide not to revert
>> it, and that we can also find a way to make readdir work reliably on tmpfs.
>
> IMO the guidelines go the other way: readdir on tmpfs needs to
> be addressed first. Reverting is a last resort, but I don't see
> a fix coming before v6.2. Am I wrong?

Is your opinion wrong? :) IMO, yes, because this is just another round of
"we don't fix the client for broken server behaviors". If we're going to
disagree, know that its entirely amicable from my side. :)

> What I fear will happen is that the optimization will go in, and
> then nobody will address (or even document) the tmpfs problem,
> making it unusable. That would be unfortunate and wrong.
>
> Please look at tmpfs and see how difficult it will be to address
> the cookie problem properly.

I think tmpfs doesn't have any requirement to report consistent offsets in
between calls to close(dir) and open(dir) - so if you re-open the directory
a second time and seek to some position, you're not going to be able to
connect some earlier part of the stream to what you see now.

So, I don't think its a tmpfs problem - its a bit of a combination of issues
that contrive to make READDIR results inconsistent on NFS. The best place
to improve things might be to have knfsd try to keep the directory open in
between calls to READDIR for filesystems of simple_dir_operations.

Really, the root of the problem is that there's no stateful way to walk a
directory that's changing, especially if you're expecting things to be the
same in between close() and open(). XFS and ext4 do a pretty good job of
making that problem disappear by hashing out their dentries across a very
large range, but without some previous state or versioning, it seems a hard
problem to solve.

Ben


2023-02-03 20:26:07

by Chuck Lever

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client



> On Feb 3, 2023, at 3:01 PM, Benjamin Coddington <[email protected]> wrote:
>
> On 3 Feb 2023, at 13:03, Chuck Lever III wrote:
>> Naive suggestion: Would another option be to add server
>> support for the directory verifier?
>
> I don't think it would resolve this bug, because what can the client do to
> find its place in the directory stream after getting NFS4ERR_NOT_SAME?
>
> Directory verifiers might help another class of bugs, where a linear walk
> through the directory produces duplicate entries.. but I think it only helps
> some of those cases.
>
> Knfsd /could/ use the directory verifier to keep a reference to an opened
> directory. Perhaps knfsd's open file cache can be used. Then, as long as
> we're doing a linear walk through the directory, the local directory's
> file->private cursor would continue to be valid to produce consistent linear
> results even if the dentries are removed in between calls to READDIR.
>
> .. but that's not how the verifier is supposed to be used - rather it's
> supposed to signal when the client's cookies are invalid, which, for tmpfs,
> would be anytime any changes are made to the directory.

Right. Change the verifier whenever a directory is modified.

(Make it an export -> callback per filesystem type).

>> Well, let's not argue semantics. The optimization exposes
>> this (previously known) bug to a wider set of workloads.
>
> Ok, yes.
>
>> Please, open some bugs that document it. Otherwise this stuff
>> will never get addressed. Can't fix what we don't know about.
>>
>> I'm not claiming tmpfs is perfect. But the optimization seems
>> to make things worse for more workloads. That's always been a
>> textbook definition of regression, and a non-starter for
>> upstream.
>
> I guess I can - my impression has been there's no interest in fixing tmpfs
> for use over NFS.. after my last round of work on it I resolved to tell any
> customers that asked for it to do:
>
> [root@fedora ~]# modprobe brd rd_size=1048576 rd_nr=1
> [root@fedora ~]# mkfs.xfs /dev/ram0
>
> .. instead. Though, I think that tmpfs is going to have better performance.
>
>>> I spent a great deal of time making points about it and arguing that the
>>> client should try to work around them, and ultimately was told exactly this:
>>> https://lore.kernel.org/linux-nfs/[email protected]/
>>
>> Ah, well "client should work around them" is going to be a
>> losing argument every time.
>
> Yeah, I agree with that, though at the time I naively thought the issues
> could be solved by better validation of changing directories.
>
> So.. I guess I lost? I wasn't aware of the stable cookie issues fully
> until Trond pointed them out and I compared tmpfs and xfs. At that point, I
> saw there's not really much of a path forward for tmpfs, unless we want to
> work pretty hard at it. But why? Any production server wanting performance
> and stability on an NFS export isn't going to use tmpfs on knfsd. There are
> good memory-speed alternatives.

Just curious, what are they? I have xfs on a pair of NVMe
add-in cards, and it's quite fast. But that's an expensive
replacement for tmpfs.

My point is, until now, I had thought that tmpfs was a fully
supported filesystem type for NFS export. What I'm hearing
is that some people don't agree, and worse, it's not been
documented anywhere.

If we're not going to support tmpfs enough to fix these
significant problems, then it should be made unexportable,
and that deprecation needs to be properly documented.


>>> The optimization you'd like to revert fixes a performance regression on
>>> workloads across exports of all filesystems. This is a fix we've had many
>>> folks asking for repeatedly.
>>
>> Does the community agree that tmpfs has never been a first-tier
>> filesystem for exporting? That's news to me. I don't recall us
>> deprecating support for tmpfs.
>
> I'm definitely not telling folks to use it as exported from knfsd. I'm
> instead telling them, "Directory listings are going to be unstable, you'll
> see problems." That includes any filesystems with file_operations of
> simple_dir_operations.

> That said, the whole opendir, getdents, unlink, getdents pattern is maybe
> fine for a test that can assume it has exclusive rights (writes?) to a
> directory, but also probably insane for anything else that wants to reliably
> remove the thing, and we'll find that's why `rm -rf` still works. It does
> opendir, getdents, getdents, getdents, unlink, unlink, unlink, .. rmdir.
> This more closely corresponds to POSIX stating:
>
> If a file is removed from or added to the directory after the most recent
> call to opendir() or rewinddir(), whether a subsequent call to readdir()
> returns an entry for that file is unspecified.
>
>
>> If an optimization broke ext4, xfs, or btrfs, it would be
>> reverted until the situation was properly addressed. I don't
>> see why the situation is different for tmpfs just because it
>> has more bugs.
>
> Maybe it isn't? We've yet to hear from any authoritative sources on this.

>>> I hope the maintainers decide not to revert
>>> it, and that we can also find a way to make readdir work reliably on tmpfs.
>>
>> IMO the guidelines go the other way: readdir on tmpfs needs to
>> be addressed first. Reverting is a last resort, but I don't see
>> a fix coming before v6.2. Am I wrong?
>
> Is your opinion wrong? :) IMO, yes, because this is just another round of
> "we don't fix the client for broken server behaviors".

In general, we don't fix broken servers on the client, true.

In this case, though, this is a regression. A client change
broke workloads that were working in v6.1.

I feel that makes this different than "we're not changing
the client to work around a server bug."


> If we're going to
> disagree, know that its entirely amicable from my side. :)
>
>> What I fear will happen is that the optimization will go in, and
>> then nobody will address (or even document) the tmpfs problem,
>> making it unusable. That would be unfortunate and wrong.
>>
>> Please look at tmpfs and see how difficult it will be to address
>> the cookie problem properly.
>
> I think tmpfs doesn't have any requirement to report consistent offsets in
> between calls to close(dir) and open(dir)

It has that requirement if we want to continue to make it
exportable by NFS.


> - so if you re-open the directory
> a second time and seek to some position, you're not going to be able to
> connect some earlier part of the stream to what you see now.
>
> So, I don't think its a tmpfs problem - its a bit of a combination of issues
> that contrive to make READDIR results inconsistent on NFS. The best place
> to improve things might be to have knfsd try to keep the directory open in
> between calls to READDIR for filesystems of simple_dir_operations.
>
> Really, the root of the problem is that there's no stateful way to walk a
> directory that's changing, especially if you're expecting things to be the
> same in between close() and open(). XFS and ext4 do a pretty good job of
> making that problem disappear by hashing out their dentries across a very
> large range, but without some previous state or versioning, it seems a hard
> problem to solve.

Adding the tmpfs folks for an opinion. The issue is that
removing directory entries from a tmpfs directory while
directory scanning is in process results in undefined
behavior because tmpfs permits directory offsets to
change between closedir and opendir.

NFS does not have an explicit opendir or closedir operation.

--
Chuck Lever




2023-02-03 22:27:01

by Trond Myklebust

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client



> On Feb 3, 2023, at 15:25, Chuck Lever III <[email protected]> wrote:
>
>
>
>> On Feb 3, 2023, at 3:01 PM, Benjamin Coddington <[email protected]> wrote:
>>
>> On 3 Feb 2023, at 13:03, Chuck Lever III wrote:
>>> Naive suggestion: Would another option be to add server
>>> support for the directory verifier?
>>
>> I don't think it would resolve this bug, because what can the client do to
>> find its place in the directory stream after getting NFS4ERR_NOT_SAME?
>>
>> Directory verifiers might help another class of bugs, where a linear walk
>> through the directory produces duplicate entries.. but I think it only helps
>> some of those cases.
>>
>> Knfsd /could/ use the directory verifier to keep a reference to an opened
>> directory. Perhaps knfsd's open file cache can be used. Then, as long as
>> we're doing a linear walk through the directory, the local directory's
>> file->private cursor would continue to be valid to produce consistent linear
>> results even if the dentries are removed in between calls to READDIR.
>>
>> .. but that's not how the verifier is supposed to be used - rather it's
>> supposed to signal when the client's cookies are invalid, which, for tmpfs,
>> would be anytime any changes are made to the directory.
>
> Right. Change the verifier whenever a directory is modified.
>
> (Make it an export -> callback per filesystem type).
>
>>> Well, let's not argue semantics. The optimization exposes
>>> this (previously known) bug to a wider set of workloads.
>>
>> Ok, yes.
>>
>>> Please, open some bugs that document it. Otherwise this stuff
>>> will never get addressed. Can't fix what we don't know about.
>>>
>>> I'm not claiming tmpfs is perfect. But the optimization seems
>>> to make things worse for more workloads. That's always been a
>>> textbook definition of regression, and a non-starter for
>>> upstream.
>>
>> I guess I can - my impression has been there's no interest in fixing tmpfs
>> for use over NFS.. after my last round of work on it I resolved to tell any
>> customers that asked for it to do:
>>
>> [root@fedora ~]# modprobe brd rd_size=1048576 rd_nr=1
>> [root@fedora ~]# mkfs.xfs /dev/ram0
>>
>> .. instead. Though, I think that tmpfs is going to have better performance.
>>
>>>> I spent a great deal of time making points about it and arguing that the
>>>> client should try to work around them, and ultimately was told exactly this:
>>>> https://lore.kernel.org/linux-nfs/[email protected]/
>>>
>>> Ah, well "client should work around them" is going to be a
>>> losing argument every time.
>>
>> Yeah, I agree with that, though at the time I naively thought the issues
>> could be solved by better validation of changing directories.
>>
>> So.. I guess I lost? I wasn't aware of the stable cookie issues fully
>> until Trond pointed them out and I compared tmpfs and xfs. At that point, I
>> saw there's not really much of a path forward for tmpfs, unless we want to
>> work pretty hard at it. But why? Any production server wanting performance
>> and stability on an NFS export isn't going to use tmpfs on knfsd. There are
>> good memory-speed alternatives.
>
> Just curious, what are they? I have xfs on a pair of NVMe
> add-in cards, and it's quite fast. But that's an expensive
> replacement for tmpfs.
>
> My point is, until now, I had thought that tmpfs was a fully
> supported filesystem type for NFS export. What I'm hearing
> is that some people don't agree, and worse, it's not been
> documented anywhere.
>
> If we're not going to support tmpfs enough to fix these
> significant problems, then it should be made unexportable,
> and that deprecation needs to be properly documented.
>
>
>>>> The optimization you'd like to revert fixes a performance regression on
>>>> workloads across exports of all filesystems. This is a fix we've had many
>>>> folks asking for repeatedly.
>>>
>>> Does the community agree that tmpfs has never been a first-tier
>>> filesystem for exporting? That's news to me. I don't recall us
>>> deprecating support for tmpfs.
>>
>> I'm definitely not telling folks to use it as exported from knfsd. I'm
>> instead telling them, "Directory listings are going to be unstable, you'll
>> see problems." That includes any filesystems with file_operations of
>> simple_dir_operations.
>
>> That said, the whole opendir, getdents, unlink, getdents pattern is maybe
>> fine for a test that can assume it has exclusive rights (writes?) to a
>> directory, but also probably insane for anything else that wants to reliably
>> remove the thing, and we'll find that's why `rm -rf` still works. It does
>> opendir, getdents, getdents, getdents, unlink, unlink, unlink, .. rmdir.
>> This more closely corresponds to POSIX stating:
>>
>> If a file is removed from or added to the directory after the most recent
>> call to opendir() or rewinddir(), whether a subsequent call to readdir()
>> returns an entry for that file is unspecified.
>>
>>
>>> If an optimization broke ext4, xfs, or btrfs, it would be
>>> reverted until the situation was properly addressed. I don't
>>> see why the situation is different for tmpfs just because it
>>> has more bugs.
>>
>> Maybe it isn't? We've yet to hear from any authoritative sources on this.
>
>>>> I hope the maintainers decide not to revert
>>>> it, and that we can also find a way to make readdir work reliably on tmpfs.
>>>
>>> IMO the guidelines go the other way: readdir on tmpfs needs to
>>> be addressed first. Reverting is a last resort, but I don't see
>>> a fix coming before v6.2. Am I wrong?
>>
>> Is your opinion wrong? :) IMO, yes, because this is just another round of
>> "we don't fix the client for broken server behaviors".
>
> In general, we don't fix broken servers on the client, true.
>
> In this case, though, this is a regression. A client change
> broke workloads that were working in v6.1.

No. We’ve had this discussion on the NFS mailing list every time someone invents a new filesystem that they want to export and then claims that they don’t need stable cookies because the NFS protocol doesn’t bother to spell that part out. The NFS client cannot and will not claim bug-free support for a filesystem that assigns cookie values in any way that is not 100% repeatable.

Concerning the claim that it was unknown this is a problem with tmpfs, then see https://marc.info/?l=linux-kernel&m=100369543808129&w=2
In the years since 2001, we’ve “fixed” the behaviour of tmpfs by circumventing the reliance on cookies for the case of a linear read of a tmpfs directory, but the underlying cookie behaviour is still the same, and so the NFS behaviour ends up being the same.

The bottom line is that you’ve always been playing the lottery when mounting tmpfs over NFS.

_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2023-02-03 23:11:42

by Chuck Lever

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client



> On Feb 3, 2023, at 5:26 PM, Trond Myklebust <[email protected]> wrote:
>
>
>
>> On Feb 3, 2023, at 15:25, Chuck Lever III <[email protected]> wrote:
>>
>>
>>
>>> On Feb 3, 2023, at 3:01 PM, Benjamin Coddington <[email protected]> wrote:
>>>
>>> On 3 Feb 2023, at 13:03, Chuck Lever III wrote:
>>>> Naive suggestion: Would another option be to add server
>>>> support for the directory verifier?
>>>
>>> I don't think it would resolve this bug, because what can the client do to
>>> find its place in the directory stream after getting NFS4ERR_NOT_SAME?
>>>
>>> Directory verifiers might help another class of bugs, where a linear walk
>>> through the directory produces duplicate entries.. but I think it only helps
>>> some of those cases.
>>>
>>> Knfsd /could/ use the directory verifier to keep a reference to an opened
>>> directory. Perhaps knfsd's open file cache can be used. Then, as long as
>>> we're doing a linear walk through the directory, the local directory's
>>> file->private cursor would continue to be valid to produce consistent linear
>>> results even if the dentries are removed in between calls to READDIR.
>>>
>>> .. but that's not how the verifier is supposed to be used - rather it's
>>> supposed to signal when the client's cookies are invalid, which, for tmpfs,
>>> would be anytime any changes are made to the directory.
>>
>> Right. Change the verifier whenever a directory is modified.
>>
>> (Make it an export -> callback per filesystem type).
>>
>>>> Well, let's not argue semantics. The optimization exposes
>>>> this (previously known) bug to a wider set of workloads.
>>>
>>> Ok, yes.
>>>
>>>> Please, open some bugs that document it. Otherwise this stuff
>>>> will never get addressed. Can't fix what we don't know about.
>>>>
>>>> I'm not claiming tmpfs is perfect. But the optimization seems
>>>> to make things worse for more workloads. That's always been a
>>>> textbook definition of regression, and a non-starter for
>>>> upstream.
>>>
>>> I guess I can - my impression has been there's no interest in fixing tmpfs
>>> for use over NFS.. after my last round of work on it I resolved to tell any
>>> customers that asked for it to do:
>>>
>>> [root@fedora ~]# modprobe brd rd_size=1048576 rd_nr=1
>>> [root@fedora ~]# mkfs.xfs /dev/ram0
>>>
>>> .. instead. Though, I think that tmpfs is going to have better performance.
>>>
>>>>> I spent a great deal of time making points about it and arguing that the
>>>>> client should try to work around them, and ultimately was told exactly this:
>>>>> https://lore.kernel.org/linux-nfs/[email protected]/
>>>>
>>>> Ah, well "client should work around them" is going to be a
>>>> losing argument every time.
>>>
>>> Yeah, I agree with that, though at the time I naively thought the issues
>>> could be solved by better validation of changing directories.
>>>
>>> So.. I guess I lost? I wasn't aware of the stable cookie issues fully
>>> until Trond pointed them out and I compared tmpfs and xfs. At that point, I
>>> saw there's not really much of a path forward for tmpfs, unless we want to
>>> work pretty hard at it. But why? Any production server wanting performance
>>> and stability on an NFS export isn't going to use tmpfs on knfsd. There are
>>> good memory-speed alternatives.
>>
>> Just curious, what are they? I have xfs on a pair of NVMe
>> add-in cards, and it's quite fast. But that's an expensive
>> replacement for tmpfs.
>>
>> My point is, until now, I had thought that tmpfs was a fully
>> supported filesystem type for NFS export. What I'm hearing
>> is that some people don't agree, and worse, it's not been
>> documented anywhere.
>>
>> If we're not going to support tmpfs enough to fix these
>> significant problems, then it should be made unexportable,
>> and that deprecation needs to be properly documented.
>>
>>
>>>>> The optimization you'd like to revert fixes a performance regression on
>>>>> workloads across exports of all filesystems. This is a fix we've had many
>>>>> folks asking for repeatedly.
>>>>
>>>> Does the community agree that tmpfs has never been a first-tier
>>>> filesystem for exporting? That's news to me. I don't recall us
>>>> deprecating support for tmpfs.
>>>
>>> I'm definitely not telling folks to use it as exported from knfsd. I'm
>>> instead telling them, "Directory listings are going to be unstable, you'll
>>> see problems." That includes any filesystems with file_operations of
>>> simple_dir_operations.
>>
>>> That said, the whole opendir, getdents, unlink, getdents pattern is maybe
>>> fine for a test that can assume it has exclusive rights (writes?) to a
>>> directory, but also probably insane for anything else that wants to reliably
>>> remove the thing, and we'll find that's why `rm -rf` still works. It does
>>> opendir, getdents, getdents, getdents, unlink, unlink, unlink, .. rmdir.
>>> This more closely corresponds to POSIX stating:
>>>
>>> If a file is removed from or added to the directory after the most recent
>>> call to opendir() or rewinddir(), whether a subsequent call to readdir()
>>> returns an entry for that file is unspecified.
>>>
>>>
>>>> If an optimization broke ext4, xfs, or btrfs, it would be
>>>> reverted until the situation was properly addressed. I don't
>>>> see why the situation is different for tmpfs just because it
>>>> has more bugs.
>>>
>>> Maybe it isn't? We've yet to hear from any authoritative sources on this.
>>
>>>>> I hope the maintainers decide not to revert
>>>>> it, and that we can also find a way to make readdir work reliably on tmpfs.
>>>>
>>>> IMO the guidelines go the other way: readdir on tmpfs needs to
>>>> be addressed first. Reverting is a last resort, but I don't see
>>>> a fix coming before v6.2. Am I wrong?
>>>
>>> Is your opinion wrong? :) IMO, yes, because this is just another round of
>>> "we don't fix the client for broken server behaviors".
>>
>> In general, we don't fix broken servers on the client, true.
>>
>> In this case, though, this is a regression. A client change
>> broke workloads that were working in v6.1.
>
> No. We’ve had this discussion on the NFS mailing list every time someone invents a new filesystem that they want to export and then claims that they don’t need stable cookies because the NFS protocol doesn’t bother to spell that part out. The NFS client cannot and will not claim bug-free support for a filesystem that assigns cookie values in any way that is not 100% repeatable.

Nor should it. However:

A. tmpfs is not a new filesystem.

B. Ben says this is more or less an issue on _all_ filesystems,
but others manage to hide it effectively, likely as a side-effect
of having to deal with slow durable storage. Before the optimization,
tmpfs worked "well enough" as well.

C. If we don't want to fully support tmpfs, that's fine. But let's
document that properly, please.

D. If you guys knew beforehand that this change would break tmpfs
exports, there should have been a warning in the patch description.


The guidelines about regressions are clear. I don't agree with
leaving the optimization in place while tmpfs is not working well
enough. And actually, these issues in tmpfs should have been
addressed first. There's loads of precedent for that requirement.

But it appears that as usual I don't have much choice. What I can
do is file some bugs and see if we can address the server side
pieces.

So far no one has said that the tmpfs cookie problem is irreparable.
I'd much prefer to keep it in NFSD's stable of support.

https://bugzilla.linux-nfs.org/show_bug.cgi?id=405

And, if it helps, our server should support directory verifiers.

https://bugzilla.linux-nfs.org/show_bug.cgi?id=404


> Concerning the claim that it was unknown this is a problem with tmpfs, then see https://marc.info/?l=linux-kernel&m=100369543808129&w=2
> In the years since 2001, we’ve “fixed” the behaviour of tmpfs by circumventing the reliance on cookies for the case of a linear read of a tmpfs directory, but the underlying cookie behaviour is still the same, and so the NFS behaviour ends up being the same.
>
> The bottom line is that you’ve always been playing the lottery when mounting tmpfs over NFS.

I'm not debating the truth of that. I just don't think we should
be making that situation needlessly worse.

And I would be much more comfortable with this if it appeared in
a man page or on our wiki, or ... I'm sorry, but "some email in
2001" is not documentation a user should be expected to find.


--
Chuck Lever



2023-02-03 23:53:34

by Hugh Dickins

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client

On Fri, 3 Feb 2023, Chuck Lever III wrote:
> > On Feb 3, 2023, at 5:26 PM, Trond Myklebust <[email protected]> wrote:
> >> On Feb 3, 2023, at 15:25, Chuck Lever III <[email protected]> wrote:
> >>> On Feb 3, 2023, at 3:01 PM, Benjamin Coddington <[email protected]> wrote:
> >>> On 3 Feb 2023, at 13:03, Chuck Lever III wrote:
> >>>> Naive suggestion: Would another option be to add server
> >>>> support for the directory verifier?
> >>>
> >>> I don't think it would resolve this bug, because what can the client do to
> >>> find its place in the directory stream after getting NFS4ERR_NOT_SAME?
> >>>
> >>> Directory verifiers might help another class of bugs, where a linear walk
> >>> through the directory produces duplicate entries.. but I think it only helps
> >>> some of those cases.
> >>>
> >>> Knfsd /could/ use the directory verifier to keep a reference to an opened
> >>> directory. Perhaps knfsd's open file cache can be used. Then, as long as
> >>> we're doing a linear walk through the directory, the local directory's
> >>> file->private cursor would continue to be valid to produce consistent linear
> >>> results even if the dentries are removed in between calls to READDIR.
> >>>
> >>> .. but that's not how the verifier is supposed to be used - rather it's
> >>> supposed to signal when the client's cookies are invalid, which, for tmpfs,
> >>> would be anytime any changes are made to the directory.
> >>
> >> Right. Change the verifier whenever a directory is modified.
> >>
> >> (Make it an export -> callback per filesystem type).
> >>
> >>>> Well, let's not argue semantics. The optimization exposes
> >>>> this (previously known) bug to a wider set of workloads.
> >>>
> >>> Ok, yes.
> >>>
> >>>> Please, open some bugs that document it. Otherwise this stuff
> >>>> will never get addressed. Can't fix what we don't know about.
> >>>>
> >>>> I'm not claiming tmpfs is perfect. But the optimization seems
> >>>> to make things worse for more workloads. That's always been a
> >>>> textbook definition of regression, and a non-starter for
> >>>> upstream.
> >>>
> >>> I guess I can - my impression has been there's no interest in fixing tmpfs
> >>> for use over NFS.. after my last round of work on it I resolved to tell any
> >>> customers that asked for it to do:
> >>>
> >>> [root@fedora ~]# modprobe brd rd_size=1048576 rd_nr=1
> >>> [root@fedora ~]# mkfs.xfs /dev/ram0
> >>>
> >>> .. instead. Though, I think that tmpfs is going to have better performance.
> >>>
> >>>>> I spent a great deal of time making points about it and arguing that the
> >>>>> client should try to work around them, and ultimately was told exactly this:
> >>>>> https://lore.kernel.org/linux-nfs/[email protected]/
> >>>>
> >>>> Ah, well "client should work around them" is going to be a
> >>>> losing argument every time.
> >>>
> >>> Yeah, I agree with that, though at the time I naively thought the issues
> >>> could be solved by better validation of changing directories.
> >>>
> >>> So.. I guess I lost? I wasn't aware of the stable cookie issues fully
> >>> until Trond pointed them out and I compared tmpfs and xfs. At that point, I
> >>> saw there's not really much of a path forward for tmpfs, unless we want to
> >>> work pretty hard at it. But why? Any production server wanting performance
> >>> and stability on an NFS export isn't going to use tmpfs on knfsd. There are
> >>> good memory-speed alternatives.
> >>
> >> Just curious, what are they? I have xfs on a pair of NVMe
> >> add-in cards, and it's quite fast. But that's an expensive
> >> replacement for tmpfs.
> >>
> >> My point is, until now, I had thought that tmpfs was a fully
> >> supported filesystem type for NFS export. What I'm hearing
> >> is that some people don't agree, and worse, it's not been
> >> documented anywhere.
> >>
> >> If we're not going to support tmpfs enough to fix these
> >> significant problems, then it should be made unexportable,
> >> and that deprecation needs to be properly documented.
> >>
> >>
> >>>>> The optimization you'd like to revert fixes a performance regression on
> >>>>> workloads across exports of all filesystems. This is a fix we've had many
> >>>>> folks asking for repeatedly.
> >>>>
> >>>> Does the community agree that tmpfs has never been a first-tier
> >>>> filesystem for exporting? That's news to me. I don't recall us
> >>>> deprecating support for tmpfs.
> >>>
> >>> I'm definitely not telling folks to use it as exported from knfsd. I'm
> >>> instead telling them, "Directory listings are going to be unstable, you'll
> >>> see problems." That includes any filesystems with file_operations of
> >>> simple_dir_operations.
> >>
> >>> That said, the whole opendir, getdents, unlink, getdents pattern is maybe
> >>> fine for a test that can assume it has exclusive rights (writes?) to a
> >>> directory, but also probably insane for anything else that wants to reliably
> >>> remove the thing, and we'll find that's why `rm -rf` still works. It does
> >>> opendir, getdents, getdents, getdents, unlink, unlink, unlink, .. rmdir.
> >>> This more closely corresponds to POSIX stating:
> >>>
> >>> If a file is removed from or added to the directory after the most recent
> >>> call to opendir() or rewinddir(), whether a subsequent call to readdir()
> >>> returns an entry for that file is unspecified.
> >>>
> >>>
> >>>> If an optimization broke ext4, xfs, or btrfs, it would be
> >>>> reverted until the situation was properly addressed. I don't
> >>>> see why the situation is different for tmpfs just because it
> >>>> has more bugs.
> >>>
> >>> Maybe it isn't? We've yet to hear from any authoritative sources on this.
> >>
> >>>>> I hope the maintainers decide not to revert
> >>>>> it, and that we can also find a way to make readdir work reliably on tmpfs.
> >>>>
> >>>> IMO the guidelines go the other way: readdir on tmpfs needs to
> >>>> be addressed first. Reverting is a last resort, but I don't see
> >>>> a fix coming before v6.2. Am I wrong?
> >>>
> >>> Is your opinion wrong? :) IMO, yes, because this is just another round of
> >>> "we don't fix the client for broken server behaviors".
> >>
> >> In general, we don't fix broken servers on the client, true.
> >>
> >> In this case, though, this is a regression. A client change
> >> broke workloads that were working in v6.1.
> >
> > No. We’ve had this discussion on the NFS mailing list every time someone invents a new filesystem that they want to export and then claims that they don’t need stable cookies because the NFS protocol doesn’t bother to spell that part out. The NFS client cannot and will not claim bug-free support for a filesystem that assigns cookie values in any way that is not 100% repeatable.
>
> Nor should it. However:
>
> A. tmpfs is not a new filesystem.
>
> B. Ben says this is more or less an issue on _all_ filesystems,
> but others manage to hide it effectively, likely as a side-effect
> of having to deal with slow durable storage. Before the optimization,
> tmpfs worked "well enough" as well.
>
> C. If we don't want to fully support tmpfs, that's fine. But let's
> document that properly, please.
>
> D. If you guys knew beforehand that this change would break tmpfs
> exports, there should have been a warning in the patch description.
>
>
> The guidelines about regressions are clear. I don't agree with
> leaving the optimization in place while tmpfs is not working well
> enough. And actually, these issues in tmpfs should have been
> addressed first. There's loads of precedent for that requirement.
>
> But it appears that as usual I don't have much choice. What I can
> do is file some bugs and see if we can address the server side
> pieces.
>
> So far no one has said that the tmpfs cookie problem is irreparable.
> I'd much prefer to keep it in NFSD's stable of support.
>
> https://bugzilla.linux-nfs.org/show_bug.cgi?id=405
>
> And, if it helps, our server should support directory verifiers.
>
> https://bugzilla.linux-nfs.org/show_bug.cgi?id=404
>
>
> > Concerning the claim that it was unknown this is a problem with tmpfs, then see https://marc.info/?l=linux-kernel&m=100369543808129&w=2
> > In the years since 2001, we’ve “fixed” the behaviour of tmpfs by circumventing the reliance on cookies for the case of a linear read of a tmpfs directory, but the underlying cookie behaviour is still the same, and so the NFS behaviour ends up being the same.
> >
> > The bottom line is that you’ve always been playing the lottery when mounting tmpfs over NFS.
>
> I'm not debating the truth of that. I just don't think we should
> be making that situation needlessly worse.
>
> And I would be much more comfortable with this if it appeared in
> a man page or on our wiki, or ... I'm sorry, but "some email in
> 2001" is not documentation a user should be expected to find.

I very much agree with you, Chuck. Making something imperfect
significantly worse is called "a regression".

And I would expect the (laudable) optimization which introduced
that regression to be reverted from 6.2 for now, unless (I imagine
not, but have no clue) it can be easily conditionalized somehow on
not-tmpfs or not-simple_dir_operations. But that's not my call.

What is the likelihood that simple_dir_operations will be enhanced,
or a satisfactory complicated_dir_operations added? I can assure
you, never by me! If Al or Amir or some dcache-savvy FS folk have
time on their hands and an urge to add what's wanted, great: but
that surely will not come in 6.2, if ever.

More likely that effort would have to come from the NFS(D) end,
who will see the benefit. And if there's some little tweak to be
made to simple_dir_operations, which will give you the hint you need
to handle it better, I expect fsdevel would welcome a patch or two.

Hugh

2023-02-04 00:07:43

by Trond Myklebust

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client



> On Feb 3, 2023, at 18:53, Hugh Dickins <[email protected]> wrote:
>
> On Fri, 3 Feb 2023, Chuck Lever III wrote:
>>> On Feb 3, 2023, at 5:26 PM, Trond Myklebust <[email protected]> wrote:
>>>> On Feb 3, 2023, at 15:25, Chuck Lever III <[email protected]> wrote:
>>>>> On Feb 3, 2023, at 3:01 PM, Benjamin Coddington <[email protected]> wrote:
>>>>> On 3 Feb 2023, at 13:03, Chuck Lever III wrote:
>>>>>> Naive suggestion: Would another option be to add server
>>>>>> support for the directory verifier?
>>>>>
>>>>> I don't think it would resolve this bug, because what can the client do to
>>>>> find its place in the directory stream after getting NFS4ERR_NOT_SAME?
>>>>>
>>>>> Directory verifiers might help another class of bugs, where a linear walk
>>>>> through the directory produces duplicate entries.. but I think it only helps
>>>>> some of those cases.
>>>>>
>>>>> Knfsd /could/ use the directory verifier to keep a reference to an opened
>>>>> directory. Perhaps knfsd's open file cache can be used. Then, as long as
>>>>> we're doing a linear walk through the directory, the local directory's
>>>>> file->private cursor would continue to be valid to produce consistent linear
>>>>> results even if the dentries are removed in between calls to READDIR.
>>>>>
>>>>> .. but that's not how the verifier is supposed to be used - rather it's
>>>>> supposed to signal when the client's cookies are invalid, which, for tmpfs,
>>>>> would be anytime any changes are made to the directory.
>>>>
>>>> Right. Change the verifier whenever a directory is modified.
>>>>
>>>> (Make it an export -> callback per filesystem type).
>>>>
>>>>>> Well, let's not argue semantics. The optimization exposes
>>>>>> this (previously known) bug to a wider set of workloads.
>>>>>
>>>>> Ok, yes.
>>>>>
>>>>>> Please, open some bugs that document it. Otherwise this stuff
>>>>>> will never get addressed. Can't fix what we don't know about.
>>>>>>
>>>>>> I'm not claiming tmpfs is perfect. But the optimization seems
>>>>>> to make things worse for more workloads. That's always been a
>>>>>> textbook definition of regression, and a non-starter for
>>>>>> upstream.
>>>>>
>>>>> I guess I can - my impression has been there's no interest in fixing tmpfs
>>>>> for use over NFS.. after my last round of work on it I resolved to tell any
>>>>> customers that asked for it to do:
>>>>>
>>>>> [root@fedora ~]# modprobe brd rd_size=1048576 rd_nr=1
>>>>> [root@fedora ~]# mkfs.xfs /dev/ram0
>>>>>
>>>>> .. instead. Though, I think that tmpfs is going to have better performance.
>>>>>
>>>>>>> I spent a great deal of time making points about it and arguing that the
>>>>>>> client should try to work around them, and ultimately was told exactly this:
>>>>>>> https://lore.kernel.org/linux-nfs/[email protected]/
>>>>>>
>>>>>> Ah, well "client should work around them" is going to be a
>>>>>> losing argument every time.
>>>>>
>>>>> Yeah, I agree with that, though at the time I naively thought the issues
>>>>> could be solved by better validation of changing directories.
>>>>>
>>>>> So.. I guess I lost? I wasn't aware of the stable cookie issues fully
>>>>> until Trond pointed them out and I compared tmpfs and xfs. At that point, I
>>>>> saw there's not really much of a path forward for tmpfs, unless we want to
>>>>> work pretty hard at it. But why? Any production server wanting performance
>>>>> and stability on an NFS export isn't going to use tmpfs on knfsd. There are
>>>>> good memory-speed alternatives.
>>>>
>>>> Just curious, what are they? I have xfs on a pair of NVMe
>>>> add-in cards, and it's quite fast. But that's an expensive
>>>> replacement for tmpfs.
>>>>
>>>> My point is, until now, I had thought that tmpfs was a fully
>>>> supported filesystem type for NFS export. What I'm hearing
>>>> is that some people don't agree, and worse, it's not been
>>>> documented anywhere.
>>>>
>>>> If we're not going to support tmpfs enough to fix these
>>>> significant problems, then it should be made unexportable,
>>>> and that deprecation needs to be properly documented.
>>>>
>>>>
>>>>>>> The optimization you'd like to revert fixes a performance regression on
>>>>>>> workloads across exports of all filesystems. This is a fix we've had many
>>>>>>> folks asking for repeatedly.
>>>>>>
>>>>>> Does the community agree that tmpfs has never been a first-tier
>>>>>> filesystem for exporting? That's news to me. I don't recall us
>>>>>> deprecating support for tmpfs.
>>>>>
>>>>> I'm definitely not telling folks to use it as exported from knfsd. I'm
>>>>> instead telling them, "Directory listings are going to be unstable, you'll
>>>>> see problems." That includes any filesystems with file_operations of
>>>>> simple_dir_operations.
>>>>
>>>>> That said, the whole opendir, getdents, unlink, getdents pattern is maybe
>>>>> fine for a test that can assume it has exclusive rights (writes?) to a
>>>>> directory, but also probably insane for anything else that wants to reliably
>>>>> remove the thing, and we'll find that's why `rm -rf` still works. It does
>>>>> opendir, getdents, getdents, getdents, unlink, unlink, unlink, .. rmdir.
>>>>> This more closely corresponds to POSIX stating:
>>>>>
>>>>> If a file is removed from or added to the directory after the most recent
>>>>> call to opendir() or rewinddir(), whether a subsequent call to readdir()
>>>>> returns an entry for that file is unspecified.
>>>>>
>>>>>
>>>>>> If an optimization broke ext4, xfs, or btrfs, it would be
>>>>>> reverted until the situation was properly addressed. I don't
>>>>>> see why the situation is different for tmpfs just because it
>>>>>> has more bugs.
>>>>>
>>>>> Maybe it isn't? We've yet to hear from any authoritative sources on this.
>>>>
>>>>>>> I hope the maintainers decide not to revert
>>>>>>> it, and that we can also find a way to make readdir work reliably on tmpfs.
>>>>>>
>>>>>> IMO the guidelines go the other way: readdir on tmpfs needs to
>>>>>> be addressed first. Reverting is a last resort, but I don't see
>>>>>> a fix coming before v6.2. Am I wrong?
>>>>>
>>>>> Is your opinion wrong? :) IMO, yes, because this is just another round of
>>>>> "we don't fix the client for broken server behaviors".
>>>>
>>>> In general, we don't fix broken servers on the client, true.
>>>>
>>>> In this case, though, this is a regression. A client change
>>>> broke workloads that were working in v6.1.
>>>
>>> No. We’ve had this discussion on the NFS mailing list every time someone invents a new filesystem that they want to export and then claims that they don’t need stable cookies because the NFS protocol doesn’t bother to spell that part out. The NFS client cannot and will not claim bug-free support for a filesystem that assigns cookie values in any way that is not 100% repeatable.
>>
>> Nor should it. However:
>>
>> A. tmpfs is not a new filesystem.
>>
>> B. Ben says this is more or less an issue on _all_ filesystems,
>> but others manage to hide it effectively, likely as a side-effect
>> of having to deal with slow durable storage. Before the optimization,
>> tmpfs worked "well enough" as well.
>>
>> C. If we don't want to fully support tmpfs, that's fine. But let's
>> document that properly, please.
>>
>> D. If you guys knew beforehand that this change would break tmpfs
>> exports, there should have been a warning in the patch description.
>>
>>
>> The guidelines about regressions are clear. I don't agree with
>> leaving the optimization in place while tmpfs is not working well
>> enough. And actually, these issues in tmpfs should have been
>> addressed first. There's loads of precedent for that requirement.
>>
>> But it appears that as usual I don't have much choice. What I can
>> do is file some bugs and see if we can address the server side
>> pieces.
>>
>> So far no one has said that the tmpfs cookie problem is irreparable.
>> I'd much prefer to keep it in NFSD's stable of support.
>>
>> https://bugzilla.linux-nfs.org/show_bug.cgi?id=405
>>
>> And, if it helps, our server should support directory verifiers.
>>
>> https://bugzilla.linux-nfs.org/show_bug.cgi?id=404
>>
>>
>>> Concerning the claim that it was unknown this is a problem with tmpfs, then see https://marc.info/?l=linux-kernel&m=100369543808129&w=2
>>> In the years since 2001, we’ve “fixed” the behaviour of tmpfs by circumventing the reliance on cookies for the case of a linear read of a tmpfs directory, but the underlying cookie behaviour is still the same, and so the NFS behaviour ends up being the same.
>>>
>>> The bottom line is that you’ve always been playing the lottery when mounting tmpfs over NFS.
>>
>> I'm not debating the truth of that. I just don't think we should
>> be making that situation needlessly worse.
>>
>> And I would be much more comfortable with this if it appeared in
>> a man page or on our wiki, or ... I'm sorry, but "some email in
>> 2001" is not documentation a user should be expected to find.
>
> I very much agree with you, Chuck. Making something imperfect
> significantly worse is called "a regression".
>
> And I would expect the (laudable) optimization which introduced
> that regression to be reverted from 6.2 for now, unless (I imagine
> not, but have no clue) it can be easily conditionalized somehow on
> not-tmpfs or not-simple_dir_operations. But that's not my call.
>
> What is the likelihood that simple_dir_operations will be enhanced,
> or a satisfactory complicated_dir_operations added? I can assure
> you, never by me! If Al or Amir or some dcache-savvy FS folk have
> time on their hands and an urge to add what's wanted, great: but
> that surely will not come in 6.2, if ever.
>
> More likely that effort would have to come from the NFS(D) end,
> who will see the benefit. And if there's some little tweak to be
> made to simple_dir_operations, which will give you the hint you need
> to handle it better, I expect fsdevel would welcome a patch or two.
>
> Hugh


No! If it was impossible to hit this problem before the patch, then I might agree with you. However what it does is exposes a problem that has always existed, but was a lot less likely to happen timing wise when we were allowing glibc to suck in all 50000 or so directory entries in one gulp.

IOW: this patch doesn’t cause the problem, it just makes it easier to hit when you are using a high performance setup like Chuck's. It was always easy to hit when you were using slower networking and/or smaller rsize values against a remote server with multiple clients creating + deleting files in the same NFS exported tmpfs directory.


_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2023-02-04 00:15:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client

On Sat, 4 Feb 2023, Trond Myklebust wrote:
> > On Feb 3, 2023, at 18:53, Hugh Dickins <[email protected]> wrote:
> > On Fri, 3 Feb 2023, Chuck Lever III wrote:
> >>> On Feb 3, 2023, at 5:26 PM, Trond Myklebust <[email protected]> wrote:
> >>> The bottom line is that you’ve always been playing the lottery when mounting tmpfs over NFS.
> >>
> >> I'm not debating the truth of that. I just don't think we should
> >> be making that situation needlessly worse.
> >>
> >> And I would be much more comfortable with this if it appeared in
> >> a man page or on our wiki, or ... I'm sorry, but "some email in
> >> 2001" is not documentation a user should be expected to find.
> >
> > I very much agree with you, Chuck. Making something imperfect
> > significantly worse is called "a regression".
> >
> > And I would expect the (laudable) optimization which introduced
> > that regression to be reverted from 6.2 for now, unless (I imagine
> > not, but have no clue) it can be easily conditionalized somehow on
> > not-tmpfs or not-simple_dir_operations. But that's not my call.
> >
> > What is the likelihood that simple_dir_operations will be enhanced,
> > or a satisfactory complicated_dir_operations added? I can assure
> > you, never by me! If Al or Amir or some dcache-savvy FS folk have
> > time on their hands and an urge to add what's wanted, great: but
> > that surely will not come in 6.2, if ever.
> >
> > More likely that effort would have to come from the NFS(D) end,
> > who will see the benefit. And if there's some little tweak to be
> > made to simple_dir_operations, which will give you the hint you need
> > to handle it better, I expect fsdevel would welcome a patch or two.
> >
> > Hugh
>
>
> No! If it was impossible to hit this problem before the patch, then I might agree with you. However what it does is exposes a problem that has always existed, but was a lot less likely to happen timing wise when we were allowing glibc to suck in all 50000 or so directory entries in one gulp.
>
> IOW: this patch doesn’t cause the problem, it just makes it easier to hit when you are using a high performance setup like Chuck's. It was always easy to hit when you were using slower networking and/or smaller rsize values against a remote server with multiple clients creating + deleting files in the same NFS exported tmpfs directory.
> _________________________________
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]

I can only repeat,
making something imperfect significantly worse is called "a regression".

Hugh

2023-02-04 00:59:15

by Trond Myklebust

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client



> On Feb 3, 2023, at 19:15, Hugh Dickins <[email protected]> wrote:
>
> On Sat, 4 Feb 2023, Trond Myklebust wrote:
>>> On Feb 3, 2023, at 18:53, Hugh Dickins <[email protected]> wrote:
>>> On Fri, 3 Feb 2023, Chuck Lever III wrote:
>>>>> On Feb 3, 2023, at 5:26 PM, Trond Myklebust <[email protected]> wrote:
>>>>> The bottom line is that you’ve always been playing the lottery when mounting tmpfs over NFS.
>>>>
>>>> I'm not debating the truth of that. I just don't think we should
>>>> be making that situation needlessly worse.
>>>>
>>>> And I would be much more comfortable with this if it appeared in
>>>> a man page or on our wiki, or ... I'm sorry, but "some email in
>>>> 2001" is not documentation a user should be expected to find.
>>>
>>> I very much agree with you, Chuck. Making something imperfect
>>> significantly worse is called "a regression".
>>>
>>> And I would expect the (laudable) optimization which introduced
>>> that regression to be reverted from 6.2 for now, unless (I imagine
>>> not, but have no clue) it can be easily conditionalized somehow on
>>> not-tmpfs or not-simple_dir_operations. But that's not my call.
>>>
>>> What is the likelihood that simple_dir_operations will be enhanced,
>>> or a satisfactory complicated_dir_operations added? I can assure
>>> you, never by me! If Al or Amir or some dcache-savvy FS folk have
>>> time on their hands and an urge to add what's wanted, great: but
>>> that surely will not come in 6.2, if ever.
>>>
>>> More likely that effort would have to come from the NFS(D) end,
>>> who will see the benefit. And if there's some little tweak to be
>>> made to simple_dir_operations, which will give you the hint you need
>>> to handle it better, I expect fsdevel would welcome a patch or two.
>>>
>>> Hugh
>>
>>
>> No! If it was impossible to hit this problem before the patch, then I might agree with you. However what it does is exposes a problem that has always existed, but was a lot less likely to happen timing wise when we were allowing glibc to suck in all 50000 or so directory entries in one gulp.
>>
>> IOW: this patch doesn’t cause the problem, it just makes it easier to hit when you are using a high performance setup like Chuck's. It was always easy to hit when you were using slower networking and/or smaller rsize values against a remote server with multiple clients creating + deleting files in the same NFS exported tmpfs directory.
>> _________________________________
>> Trond Myklebust
>> Linux NFS client maintainer, Hammerspace
>> [email protected]
>
> I can only repeat,
> making something imperfect significantly worse is called "a regression".
>
> Hugh

<resending due to mailing list rejection>

It is exposing a problem which was always there. You can’t just pick and choose your tests and claim that one result is more significant than the other: that’s called bias.

The functional change here is simply to cut the very first readdir short: i.e. it returns fewer entries than previously. Reducing the readdir buffer size supplied by the userspace application would have the exact same effect. The fact that Chuck’s test happened to work before the patch went in, is due to a systematic bias in that test. Change the systematic bias (by changing the buffer size supplied by glibc in the sys_getdents call) and you should see the same problem that this patch exposed.


_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

Subject: Re: git regression failures with v6.2-rc NFS client

On 04.02.23 01:59, Trond Myklebust wrote:
>> On Feb 3, 2023, at 19:15, Hugh Dickins <[email protected]> wrote:
>> On Sat, 4 Feb 2023, Trond Myklebust wrote:
>>>> On Feb 3, 2023, at 18:53, Hugh Dickins <[email protected]> wrote:
>>>> On Fri, 3 Feb 2023, Chuck Lever III wrote:
>>>>>> On Feb 3, 2023, at 5:26 PM, Trond Myklebust <[email protected]> wrote:
>>>>>> The bottom line is that you’ve always been playing the lottery when mounting tmpfs over NFS.
>>>>>
>>>>> I'm not debating the truth of that. I just don't think we should
>>>>> be making that situation needlessly worse.
>>>>>
>>>>> And I would be much more comfortable with this if it appeared in
>>>>> a man page or on our wiki, or ... I'm sorry, but "some email in
>>>>> 2001" is not documentation a user should be expected to find.
>>>>
>>>> I very much agree with you, Chuck. Making something imperfect
>>>> significantly worse is called "a regression".
>>>>
>>>> And I would expect the (laudable) optimization which introduced
>>>> that regression to be reverted from 6.2 for now, unless (I imagine
>>>> not, but have no clue) it can be easily conditionalized somehow on
>>>> not-tmpfs or not-simple_dir_operations. But that's not my call.
>>>>
>>>> What is the likelihood that simple_dir_operations will be enhanced,
>>>> or a satisfactory complicated_dir_operations added? I can assure
>>>> you, never by me! If Al or Amir or some dcache-savvy FS folk have
>>>> time on their hands and an urge to add what's wanted, great: but
>>>> that surely will not come in 6.2, if ever.
>>>>
>>>> More likely that effort would have to come from the NFS(D) end,
>>>> who will see the benefit. And if there's some little tweak to be
>>>> made to simple_dir_operations, which will give you the hint you need
>>>> to handle it better, I expect fsdevel would welcome a patch or two.
>>>
>>> No! If it was impossible to hit this problem before the patch, then I might agree with you. However what it does is exposes a problem that has always existed, but was a lot less likely to happen timing wise when we were allowing glibc to suck in all 50000 or so directory entries in one gulp.
>>>
>>> IOW: this patch doesn’t cause the problem, it just makes it easier to hit when you are using a high performance setup like Chuck's. It was always easy to hit when you were using slower networking and/or smaller rsize values against a remote server with multiple clients creating + deleting files in the same NFS exported tmpfs directory.
>>
>> I can only repeat,
>> making something imperfect significantly worse is called "a regression".
>
> It is exposing a problem which was always there. [...]

But as you said: people are more likely to run into this problem now.
This in the end makes the kernel worse and thus afaics is a regression,
as Hugh mentioned.

There sadly is no quote from Linus in
https://docs.kernel.org/process/handling-regressions.html
that exactly matches and helps in this scenario, but a few that come
close; one of them:

```
Because the only thing that matters IS THE USER.

How hard is that to understand?

Anybody who uses "but it was buggy" as an argument is entirely missing
the point. As far as the USER was concerned, it wasn't buggy - it
worked for him/her.
```

Anyway, I guess we get close to the point where I simply explicitly
mention the issue in my weekly regression report, then Linus can speak
up himself if he wants. No hard feeling here, I think that's just my duty.

BTW, I CCed the regression list, as it should be in the loop for
regressions per
https://docs.kernel.org/admin-guide/reporting-regressions.html]

BTW, Benjamin, you earlier in this thread mentioned:

```
Thorsten's bot is just scraping your regression report email, I doubt
they've carefully read this thread.
```

Well, kinda. It's just not the bot that adds the regression to the
tracking, that's me doing it. But yes, I only skim threads and sometimes
simply when adding lack knowledge or details to decide if something
really is a regression or not. But often that sooner or later becomes
clear -- and then I'll remove an issue from the tracking, if it turns
out it isn't a regression.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

2023-02-04 13:16:56

by Benjamin Coddington

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client

On 4 Feb 2023, at 6:07, Thorsten Leemhuis wrote:

> But as you said: people are more likely to run into this problem now.
> This in the end makes the kernel worse and thus afaics is a regression,
> as Hugh mentioned.
>
> There sadly is no quote from Linus in
> https://docs.kernel.org/process/handling-regressions.html
> that exactly matches and helps in this scenario, but a few that come
> close; one of them:
>
> ```
> Because the only thing that matters IS THE USER.
>
> How hard is that to understand?
>
> Anybody who uses "but it was buggy" as an argument is entirely missing
> the point. As far as the USER was concerned, it wasn't buggy - it
> worked for him/her.
> ```
>
> Anyway, I guess we get close to the point where I simply explicitly
> mention the issue in my weekly regression report, then Linus can speak
> up himself if he wants. No hard feeling here, I think that's just my duty.
>
> BTW, I CCed the regression list, as it should be in the loop for
> regressions per
> https://docs.kernel.org/admin-guide/reporting-regressions.html]
>
> BTW, Benjamin, you earlier in this thread mentioned:
>
> ```
> Thorsten's bot is just scraping your regression report email, I doubt
> they've carefully read this thread.
> ```
>
> Well, kinda. It's just not the bot that adds the regression to the
> tracking, that's me doing it. But yes, I only skim threads and sometimes
> simply when adding lack knowledge or details to decide if something
> really is a regression or not. But often that sooner or later becomes
> clear -- and then I'll remove an issue from the tracking, if it turns
> out it isn't a regression.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

Ah, thanks for explaining that.

I'd like to summarize and quantify this problem one last time for folks that
don't want to read everything. If an application wants to remove all files
and the parent directory, and uses this pattern to do it:

opendir
while (getdents)
unlink dents
closedir
rmdir

Before this commit, that would work with up to 126 dentries on NFS from
tmpfs export. If the directory had 127 or more, the rmdir would fail with
ENOTEMPTY.

After this commit, it only works with up to 17 dentries.

The argument that this is making things worse takes the position that there
are more directories in the universe with >17 dentries that want to be
cleaned up by this "saw off the branch you're sitting on" pattern than
directories with >127. And I guess that's true if Chuck runs that testing
setup enough. :)

We can change the optimization in the commit from
NFS_READDIR_CACHE_MISS_THRESHOLD + 1
to
nfs_readdir_array_maxentries + 1

This would make the regression disappear, and would also keep most of the
optimization.

Ben


2023-02-04 16:52:37

by Trond Myklebust

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client



> On Feb 4, 2023, at 08:15, Benjamin Coddington <[email protected]> wrote:
>
> On 4 Feb 2023, at 6:07, Thorsten Leemhuis wrote:
>
>> But as you said: people are more likely to run into this problem now.
>> This in the end makes the kernel worse and thus afaics is a regression,
>> as Hugh mentioned.
>>
>> There sadly is no quote from Linus in
>> https://docs.kernel.org/process/handling-regressions.html
>> that exactly matches and helps in this scenario, but a few that come
>> close; one of them:
>>
>> ```
>> Because the only thing that matters IS THE USER.
>>
>> How hard is that to understand?
>>
>> Anybody who uses "but it was buggy" as an argument is entirely missing
>> the point. As far as the USER was concerned, it wasn't buggy - it
>> worked for him/her.
>> ```
>>
>> Anyway, I guess we get close to the point where I simply explicitly
>> mention the issue in my weekly regression report, then Linus can speak
>> up himself if he wants. No hard feeling here, I think that's just my duty.
>>
>> BTW, I CCed the regression list, as it should be in the loop for
>> regressions per
>> https://docs.kernel.org/admin-guide/reporting-regressions.html]
>>
>> BTW, Benjamin, you earlier in this thread mentioned:
>>
>> ```
>> Thorsten's bot is just scraping your regression report email, I doubt
>> they've carefully read this thread.
>> ```
>>
>> Well, kinda. It's just not the bot that adds the regression to the
>> tracking, that's me doing it. But yes, I only skim threads and sometimes
>> simply when adding lack knowledge or details to decide if something
>> really is a regression or not. But often that sooner or later becomes
>> clear -- and then I'll remove an issue from the tracking, if it turns
>> out it isn't a regression.
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>
> Ah, thanks for explaining that.
>
> I'd like to summarize and quantify this problem one last time for folks that
> don't want to read everything. If an application wants to remove all files
> and the parent directory, and uses this pattern to do it:
>
> opendir
> while (getdents)
> unlink dents
> closedir
> rmdir
>
> Before this commit, that would work with up to 126 dentries on NFS from
> tmpfs export. If the directory had 127 or more, the rmdir would fail with
> ENOTEMPTY.

For all sizes of filenames, or just the particular set that was chosen here? What about the choice of rsize? Both these values affect how many entries glibc can cache before it has to issue another getdents() call into the kernel. For the record, this is what glibc does in the opendir() code in order to choose a buffer size for the getdents syscalls:

/* The st_blksize value of the directory is used as a hint for the
size of the buffer which receives struct dirent values from the
kernel. st_blksize is limited to max_buffer_size, in case the
file system provides a bogus value. */
enum { max_buffer_size = 1048576 };

enum { allocation_size = 32768 };
_Static_assert (allocation_size >= sizeof (struct dirent64),
"allocation_size < sizeof (struct dirent64)");

/* Increase allocation if requested, but not if the value appears to
be bogus. It will be between 32Kb and 1Mb. */
size_t allocation = MIN (MAX ((size_t) statp->st_blksize, (size_t)
allocation_size), (size_t) max_buffer_size);

DIR *dirp = (DIR *) malloc (sizeof (DIR) + allocation);

>
> After this commit, it only works with up to 17 dentries.
>
> The argument that this is making things worse takes the position that there
> are more directories in the universe with >17 dentries that want to be
> cleaned up by this "saw off the branch you're sitting on" pattern than
> directories with >127. And I guess that's true if Chuck runs that testing
> setup enough. :)
>
> We can change the optimization in the commit from
> NFS_READDIR_CACHE_MISS_THRESHOLD + 1
> to
> nfs_readdir_array_maxentries + 1
>
> This would make the regression disappear, and would also keep most of the
> optimization.
>
> Ben
>

So in other words the suggestion is to optimise the number of readdir records that we return from NFS to whatever value that papers over the known telldir()/seekdir() tmpfs bug that is re-revealed by this particular test when run under these particular conditions? Anyone who tries to use tmpfs with a different number of files, different file name lengths, or different mount options is still SOL because that’s not a “regression"?

_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2023-02-04 20:45:21

by Benjamin Coddington

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client

On 4 Feb 2023, at 11:52, Trond Myklebust wrote:
> On Feb 4, 2023, at 08:15, Benjamin Coddington <[email protected]> wrote:
>> Ah, thanks for explaining that.
>>
>> I'd like to summarize and quantify this problem one last time for folks that
>> don't want to read everything. If an application wants to remove all files
>> and the parent directory, and uses this pattern to do it:
>>
>> opendir
>> while (getdents)
>> unlink dents
>> closedir
>> rmdir
>>
>> Before this commit, that would work with up to 126 dentries on NFS from
>> tmpfs export. If the directory had 127 or more, the rmdir would fail with
>> ENOTEMPTY.
>
> For all sizes of filenames, or just the particular set that was chosen
> here? What about the choice of rsize? Both these values affect how many
> entries glibc can cache before it has to issue another getdents() call
> into the kernel. For the record, this is what glibc does in the opendir()
> code in order to choose a buffer size for the getdents syscalls:
>
> /* The st_blksize value of the directory is used as a hint for the
> size of the buffer which receives struct dirent values from the
> kernel. st_blksize is limited to max_buffer_size, in case the
> file system provides a bogus value. */
> enum { max_buffer_size = 1048576 };
>
> enum { allocation_size = 32768 };
> _Static_assert (allocation_size >= sizeof (struct dirent64),
> "allocation_size < sizeof (struct dirent64)");
>
> /* Increase allocation if requested, but not if the value appears to
> be bogus. It will be between 32Kb and 1Mb. */
> size_t allocation = MIN (MAX ((size_t) statp->st_blksize, (size_t)
> allocation_size), (size_t) max_buffer_size);
>
> DIR *dirp = (DIR *) malloc (sizeof (DIR) + allocation);

The behavioral complexity is even higher with glibc in the mix, but both the
test that Chuck's using and the reproducer I've been making claims about
use SYS_getdents directly. I'm using a static 4k buffer size which is big
enough to fit enough entries to prime the heuristic for a single call to
getdents() whether or not we return early at 17 or 126.

>> After this commit, it only works with up to 17 dentries.
>>
>> The argument that this is making things worse takes the position that there
>> are more directories in the universe with >17 dentries that want to be
>> cleaned up by this "saw off the branch you're sitting on" pattern than
>> directories with >127. And I guess that's true if Chuck runs that testing
>> setup enough. :)
>>
>> We can change the optimization in the commit from
>> NFS_READDIR_CACHE_MISS_THRESHOLD + 1
>> to
>> nfs_readdir_array_maxentries + 1
>>
>> This would make the regression disappear, and would also keep most of the
>> optimization.
>>
>> Ben
>>
>
> So in other words the suggestion is to optimise the number of readdir
> records that we return from NFS to whatever value that papers over the
> known telldir()/seekdir() tmpfs bug that is re-revealed by this particular
> test when run under these particular conditions?

Yes. It's a terrible suggestion. Its only merit may be that it meets the
letter of the no regressions law. I hate it, and I after I started popping
out patches that do it I've found they've all made the behavior far more
complex due to the way we dynamically optimize dtsize.

> Anyone who tries to use tmpfs with a different number of files, different
> file name lengths, or different mount options is still SOL because that’s
> not a “regression"?

Right. :P

Ben


2023-02-05 11:24:34

by Jeffrey Layton

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client

On Sat, 2023-02-04 at 15:44 -0500, Benjamin Coddington wrote:
> On 4 Feb 2023, at 11:52, Trond Myklebust wrote:
> > On Feb 4, 2023, at 08:15, Benjamin Coddington <[email protected]> wrote:
> > > Ah, thanks for explaining that.
> > >
> > > I'd like to summarize and quantify this problem one last time for folks that
> > > don't want to read everything. If an application wants to remove all files
> > > and the parent directory, and uses this pattern to do it:
> > >
> > > opendir
> > > while (getdents)
> > > unlink dents
> > > closedir
> > > rmdir
> > >
> > > Before this commit, that would work with up to 126 dentries on NFS from
> > > tmpfs export. If the directory had 127 or more, the rmdir would fail with
> > > ENOTEMPTY.
> >
> > For all sizes of filenames, or just the particular set that was chosen
> > here? What about the choice of rsize? Both these values affect how many
> > entries glibc can cache before it has to issue another getdents() call
> > into the kernel. For the record, this is what glibc does in the opendir()
> > code in order to choose a buffer size for the getdents syscalls:
> >
> > /* The st_blksize value of the directory is used as a hint for the
> > size of the buffer which receives struct dirent values from the
> > kernel. st_blksize is limited to max_buffer_size, in case the
> > file system provides a bogus value. */
> > enum { max_buffer_size = 1048576 };
> >
> > enum { allocation_size = 32768 };
> > _Static_assert (allocation_size >= sizeof (struct dirent64),
> > "allocation_size < sizeof (struct dirent64)");
> >
> > /* Increase allocation if requested, but not if the value appears to
> > be bogus. It will be between 32Kb and 1Mb. */
> > size_t allocation = MIN (MAX ((size_t) statp->st_blksize, (size_t)
> > allocation_size), (size_t) max_buffer_size);
> >
> > DIR *dirp = (DIR *) malloc (sizeof (DIR) + allocation);
>
> The behavioral complexity is even higher with glibc in the mix, but both the
> test that Chuck's using and the reproducer I've been making claims about
> use SYS_getdents directly. I'm using a static 4k buffer size which is big
> enough to fit enough entries to prime the heuristic for a single call to
> getdents() whether or not we return early at 17 or 126.
>
> > > After this commit, it only works with up to 17 dentries.
> > >
> > > The argument that this is making things worse takes the position that there
> > > are more directories in the universe with >17 dentries that want to be
> > > cleaned up by this "saw off the branch you're sitting on" pattern than
> > > directories with >127. And I guess that's true if Chuck runs that testing
> > > setup enough. :)
> > >
> > > We can change the optimization in the commit from
> > > NFS_READDIR_CACHE_MISS_THRESHOLD + 1
> > > to
> > > nfs_readdir_array_maxentries + 1
> > >
> > > This would make the regression disappear, and would also keep most of the
> > > optimization.
> > >
> > > Ben
> > >
> >
> > So in other words the suggestion is to optimise the number of readdir
> > records that we return from NFS to whatever value that papers over the
> > known telldir()/seekdir() tmpfs bug that is re-revealed by this particular
> > test when run under these particular conditions?
>
> Yes. It's a terrible suggestion. Its only merit may be that it meets the
> letter of the no regressions law. I hate it, and I after I started popping
> out patches that do it I've found they've all made the behavior far more
> complex due to the way we dynamically optimize dtsize.
>
> > Anyone who tries to use tmpfs with a different number of files, different
> > file name lengths, or different mount options is still SOL because that’s
> > not a “regression"?
>
> Right. :P
>
> Ben
>

I may be missing something, but would it be possible to move to a more
stable scheme for readdir cookies for tmpfs?

It is tmpfs, so we don't need to worry about persisting these values
across reboots. Could we (e.g.) hash dentry pointers to generate
cookies?
--
Jeff Layton <[email protected]>

2023-02-05 16:12:01

by Benjamin Coddington

[permalink] [raw]
Subject: Re: git regression failures with v6.2-rc NFS client

On 5 Feb 2023, at 6:24, Jeff Layton wrote:
> I may be missing something, but would it be possible to move to a more
> stable scheme for readdir cookies for tmpfs?

Yes, totally possible but it was pointed out earlier that's not going to
land in 6.2, so this thread's been focused on trying to reason if
85aa8ddc3818 counts as a regression.

Ben


Subject: Re: git regression failures with v6.2-rc NFS client

[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]

On 03.02.23 13:39, Linux kernel regression tracking (#adding) wrote:
> On 31.01.23 22:15, Chuck Lever III wrote:
>>
>> I upgraded my test client's kernel to v6.2-rc5 and now I get
>> failures during the git regression suite on all NFS versions.
>> I bisected to:
>>
>> 85aa8ddc3818 ("NFS: Trigger the "ls -l" readdir heuristic sooner")
>>
>> The failure looks like:
>> [...]
>
> #regzbot ^introduced 85aa8ddc3818
> #regzbot title nfs: failures during the git regression suite on all NFS
> versions
> #regzbot ignore-activity

#regzbot resolved: corner case Linus was made aware of and apparently
didn't consider something that needs to be fixed
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.