LinuxLists.cc - reservation errors during fstests on pNFS block

2024-06-14 14:47:59

Subject: reservation errors during fstests on pNFS block

I've finally gotten kdevops and pNFS block to the point where
it can run fstests smoothly with an iSCSI target. I'm seeing
error messages on occasion in the system journal. This set is
from generic/069:

unknown: run fstests generic/069 at 2024-06-14 09:21:08
systemd[1]: Started fstests-generic-069.scope - /usr/bin/bash -c test -w /proc/self/oom_score_adj && echo 250 > /proc/self/oom_score_adj; exec ./tests/generic/069.
nfsrahead[34274]: setting /media/scratch readahead to 128
systemd[1]: media-scratch.mount: Deactivated successfully.
nfsrahead[34284]: setting /media/scratch readahead to 128
kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2)
kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3)
kernel: sd 6:0:0:1: reservation conflict
kernel: sd 6:0:0:1: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
kernel: sd 6:0:0:1: [sdb] tag#16 CDB: Write(10) 2a 00 00 00 00 50 00 00 08 00
kernel: reservation conflict error, dev sdb, sector 80 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
kernel: sd 6:0:0:1: reservation conflict
kernel: sd 6:0:0:1: reservation conflict
kernel: sd 6:0:0:1: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
kernel: sd 6:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
kernel: sd 6:0:0:1: [sdb] tag#18 CDB: Write(10) 2a 00 00 00 00 60 00 00 08 00
kernel: sd 6:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 00 00 00 58 00 00 08 00
kernel: reservation conflict error, dev sdb, sector 96 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
kernel: reservation conflict error, dev sdb, sector 88 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
systemd[1]: fstests-generic-069.scope: Deactivated successfully.
systemd[1]: fstests-generic-069.scope: Consumed 5.092s CPU time.
systemd[1]: media-test.mount: Deactivated successfully.
systemd[1]: media-scratch.mount: Deactivated successfully.
kernel: sd 6:0:0:1: reservation conflict
kernel: failed to unregister PR key.

But note that generic/069 is recorded as passing without error.

I'm about to start digging into this. Any advice or guidance
is welcome.

--
Chuck Lever

2024-06-14 17:08:56

by Christoph Hellwig

[permalink] [raw]

Subject: Re: reservation errors during fstests on pNFS block

On Fri, Jun 14, 2024 at 02:46:49PM +0000, Chuck Lever III wrote:
> I've finally gotten kdevops and pNFS block to the point where
> it can run fstests smoothly with an iSCSI target. I'm seeing
> error messages on occasion in the system journal. This set is
> from generic/069:

Reservation means another node has an active reservation on that LU.
Either you did another previous attempt that fail and let the
reservation linger, or something else in the system claimed it.

> But note that generic/069 is recorded as passing without error.

When pNFS layout access fails we fall back to normal access through the
MDS, so this is expected.

Is generic/069 that first test that failed when doing a full xfstests
run? Do you see LAYOUT* ops in /proc/self/mountstats for the previous
tests?

2024-06-14 18:26:24

by Christoph Hellwig

[permalink] [raw]

Subject: Re: reservation errors during fstests on pNFS block

On Fri, Jun 14, 2024 at 05:46:21PM +0000, Chuck Lever III wrote:
> > Reservation means another node has an active reservation on that LU.
>
> There are only two accessors of the LUN: the NFS server and
> the NFS client running the test. That's why these errors are
> a little surprising to me.

You can create registrations from userspace, and some cluster managers
do that. But none of that should happen for a default setup.

> > When pNFS layout access fails we fall back to normal access through the
> > MDS, so this is expected.
>
> Expected, OK. From a usability standpoint, error messages like
> this would probably be alarming to administrators. I plan to
> convert the printk's and dprintk's in the NFSD layout code into
> trace points, but that doesn't help the messages emitted by the
> block and SCSI drivers. Ideally this should be less noisy.

Well, they really should be alarming because the admin configured
a block layout setup and it did not work as expected. So it should
ring alarm bells.

> > Is generic/069 that first test that failed when doing a full xfstests
> > run?
>
> Yes, it's a full run. generic/069 is the first test where there
> are remarkable system journal messages (ie, PR errors), though
> there are a few subsequent tests that are also whinging.

Interesting. Normally only the server actually reserves the LU,
the clients just register. And something went wrong here and only
for these tests.

> > Do you see LAYOUT* ops in /proc/self/mountstats for the previous
> > tests?
>
> generic/013 is known to generate layout recalls, for example,
> so there is layout activity during the test run.

Ok. The other thing would be to run blktrace on the client and
see that it shows I/O. But all this sounds like the tests in
general work, but something is up with generic/069.

generic/069 just does O_APPEND writes, so I can't see what
would be so special about it.

>
> I can go back and try reproducing with just generic/069 and
> tcpdump as a first step. Is there a way I can tell that the
> PR errors are not reporting a possible data corruption?

xfstests in general does data verifycation to check for data integrity,
so we should not rely on kernel messages.

I'm a bit busy right now, but I'll try to reproduce this locally next
week.

2024-06-14 18:30:24

by Chuck Lever III

[permalink] [raw]

Subject: Re: reservation errors during fstests on pNFS block

> On Jun 14, 2024, at 12:38 PM, Christoph Hellwig <[email protected]> wrote:
>
> On Fri, Jun 14, 2024 at 02:46:49PM +0000, Chuck Lever III wrote:
>> I've finally gotten kdevops and pNFS block to the point where
>> it can run fstests smoothly with an iSCSI target. I'm seeing
>> error messages on occasion in the system journal. This set is
>> from generic/069:
>
> Reservation means another node has an active reservation on that LU.

There are only two accessors of the LUN: the NFS server and
the NFS client running the test. That's why these errors are
a little surprising to me.

> Either you did another previous attempt that fail and let the
> reservation linger, or something else in the system claimed it.

This is the first fstests run after the systems were provisioned.
kdevops lets me provision from scratch before every run [1].

>> But note that generic/069 is recorded as passing without error.
>
> When pNFS layout access fails we fall back to normal access through the
> MDS, so this is expected.

Expected, OK. From a usability standpoint, error messages like
this would probably be alarming to administrators. I plan to
convert the printk's and dprintk's in the NFSD layout code into
trace points, but that doesn't help the messages emitted by the
block and SCSI drivers. Ideally this should be less noisy.

> Is generic/069 that first test that failed when doing a full xfstests
> run?

Yes, it's a full run. generic/069 is the first test where there
are remarkable system journal messages (ie, PR errors), though
there are a few subsequent tests that are also whinging.

> Do you see LAYOUT* ops in /proc/self/mountstats for the previous
> tests?

generic/013 is known to generate layout recalls, for example,
so there is layout activity during the test run.

I can go back and try reproducing with just generic/069 and
tcpdump as a first step. Is there a way I can tell that the
PR errors are not reporting a possible data corruption? I
guess the PASS report from generic/069 is one way. The pass/fail
log from xfstests for pNFS block looks just the non-pNFS runs,
so maybe this is must ado about nothing.

--
Chuck Lever

[1] - https://github.com/chucklever/kdevops/tree/pnfs-block-testing

2024-06-14 18:33:46

by Chuck Lever III

[permalink] [raw]

Subject: Re: reservation errors during fstests on pNFS block

> On Jun 14, 2024, at 2:26 PM, Christoph Hellwig <[email protected]> wrote:
>
> On Fri, Jun 14, 2024 at 05:46:21PM +0000, Chuck Lever III wrote:
>>> Reservation means another node has an active reservation on that LU.
>>
>> There are only two accessors of the LUN: the NFS server and
>> the NFS client running the test. That's why these errors are
>> a little surprising to me.
>
> You can create registrations from userspace, and some cluster managers
> do that. But none of that should happen for a default setup.
>
>>> When pNFS layout access fails we fall back to normal access through the
>>> MDS, so this is expected.
>>
>> Expected, OK. From a usability standpoint, error messages like
>> this would probably be alarming to administrators. I plan to
>> convert the printk's and dprintk's in the NFSD layout code into
>> trace points, but that doesn't help the messages emitted by the
>> block and SCSI drivers. Ideally this should be less noisy.
>
> Well, they really should be alarming because the admin configured
> a block layout setup and it did not work as expected. So it should
> ring alarm bells.

Yes, I expect that "pNFS: failed to open device
/dev/disk/by-id/dm-uuid-mpath-0x6001 ..." is very likely
operator error.

>>> Is generic/069 that first test that failed when doing a full xfstests
>>> run?
>>
>> Yes, it's a full run. generic/069 is the first test where there
>> are remarkable system journal messages (ie, PR errors), though
>> there are a few subsequent tests that are also whinging.
>
> Interesting. Normally only the server actually reserves the LU,
> the clients just register. And something went wrong here and only
> for these tests.

I just checked the NFS server's system journal, and there's
nothing interesting there.

FWIW, the other two tests that emit unexpected journal
messages that I noted down are generic/108 and generic/616.

>>> Do you see LAYOUT* ops in /proc/self/mountstats for the previous
>>> tests?
>>
>> generic/013 is known to generate layout recalls, for example,
>> so there is layout activity during the test run.
>
> Ok. The other thing would be to run blktrace on the client and
> see that it shows I/O. But all this sounds like the tests in
> general work, but something is up with generic/069.
>
> generic/069 just does O_APPEND writes, so I can't see what
> would be so special about it.
>
>>
>> I can go back and try reproducing with just generic/069 and
>> tcpdump as a first step. Is there a way I can tell that the
>> PR errors are not reporting a possible data corruption?
>
> xfstests in general does data verifycation to check for data integrity,
> so we should not rely on kernel messages.
>
> I'm a bit busy right now, but I'll try to reproduce this locally next
> week.

Thanks, I'll also try to investigate further.

--
Chuck Lever

2024-06-14 18:34:51

by Christoph Hellwig

[permalink] [raw]

Subject: Re: reservation errors during fstests on pNFS block

On Fri, Jun 14, 2024 at 06:33:02PM +0000, Chuck Lever III wrote:
> > Well, they really should be alarming because the admin configured
> > a block layout setup and it did not work as expected. So it should
> > ring alarm bells.
>
> Yes, I expect that "pNFS: failed to open device
> /dev/disk/by-id/dm-uuid-mpath-0x6001 ..." is very likely
> operator error.

Actually that one is purely a debug message if at all - we don't know
if it is a mpath or non-mpath device, so we just try that first.
So that is totally fine to downgrade or turn into a trace point. OTOH
a warning if you could open neither option would still be useful.

2024-06-15 18:09:41

by Chuck Lever III

[permalink] [raw]

Subject: Re: reservation errors during fstests on pNFS block

> On Jun 14, 2024, at 2:33 PM, Chuck Lever III <[email protected]> wrote:
>
>> On Jun 14, 2024, at 2:26 PM, Christoph Hellwig <[email protected]> wrote:
>>
>> On Fri, Jun 14, 2024 at 05:46:21PM +0000, Chuck Lever III wrote:
>>>
>>> I can go back and try reproducing with just generic/069 and
>>> tcpdump as a first step. Is there a way I can tell that the
>>> PR errors are not reporting a possible data corruption?
>>
>> xfstests in general does data verifycation to check for data integrity,
>> so we should not rely on kernel messages.
>>
>> I'm a bit busy right now, but I'll try to reproduce this locally next
>> week.
>
> Thanks, I'll also try to investigate further.

This is 100% reproducible in my set up.

bl_alloc_lseg() calls this:

561 static struct nfs4_deviceid_node *
562 bl_find_get_deviceid(struct nfs_server *server,
563 const struct nfs4_deviceid *id, const struct cred *cred,
564 gfp_t gfp_mask)
565 {
566 struct nfs4_deviceid_node *node;
567 unsigned long start, end;
568
569 retry:
570 node = nfs4_find_get_deviceid(server, id, cred, gfp_mask);
571 if (!node)
572 return ERR_PTR(-ENODEV);

nfs4_find_get_deviceid() tries to be clever and do a lookup without
the spin lock first.

If it can't find a matching deviceid, it creates a new device_info
(which calls bl_alloc_deviceid_node, and that registers the device's
PR key).

Then it takes the nfs4_deviceid_lock and looks up the deviceid again.
If it finds it this time, bl_find_get_deviceid() frees the spare
(new) device_info, which unregisters the PR key for the same device.

Any subsequent I/O from this client on that device gets EBADE.

The umount later unregisters the device's PR key again.

Seems like PR key registration should be done from a more
idempotent context...?

--
Chuck Lever