LinuxLists.cc - Re: [REGRESSION] LVM-on-LVM: error while submitting device barriers

2024-02-28 17:26:44

Subject: Re: [REGRESSION] LVM-on-LVM: error while submitting device barriers

I'm unsure if this is just an LVM bug, or a BTRFS+LVM interaction bug,
but LVM is definitely involved somehow.
Upgrading from 5.10 to 6.1, I noticed one of my filesystems was
read-only. In dmesg, I found:

BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
0, rd 0, flush 1, corrupt 0, gen 0
BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
tolerance is 0 for writable mount
BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
failure (errors while submitting device barriers.)
BTRFS info (device dm-75: state E): forced readonly
BTRFS warning (device dm-75: state E): Skipping commit of aborted transaction.
BTRFS: error (device dm-75: state EA) in cleanup_transaction:1992:
errno=-5 IO failure

At first I suspected a btrfs error, but a scrub found no errors, and
it continued to be read-write on 5.10 kernels.

Here is my setup:

/dev/lvm/brokenDisk is a lvm-on-lvm volume. I have /dev/sd{a,b,c,d}
(of varying sizes) in a lower VG, which has three LVs, all raid1
volumes. Two of the volumes are further used as PV's for an upper VGs.
One of the upper VGs has no issues. The non-PV LV has no issue. The
remaining one, /dev/lowerVG/lvmPool, hosting nested LVM, is used as a
PV for VG "lvm", and has 3 volumes inside. Two of those volumes have
no issues (and are btrfs), but the last one is /dev/lvm/brokenDisk.
This volume is the only one that exhibits this behavior, so something
is special.

Or described as layers:
/dev/sd{a,b,c,d} => PV => VG "lowerVG"
/dev/lowerVG/single (RAID1 LV) => BTRFS, works fine
/dev/lowerVG/works (RAID1 LV) => PV => VG "workingUpper"
/dev/workingUpper/{a,b,c} => BTRFS, works fine
/dev/lowerVG/lvmPool (RAID1 LV) => PV => VG "lvm"
/dev/lvm/{a,b} => BTRFS, works fine
/dev/lvm/brokenDisk => BTRFS, Exhibits errors

After some investigation, here is what I've found:

1. This regression was introduced in 5.19. 5.18 and earlier kernels I
can keep this filesystem rw and everything works as expected, while
5.19.0 and later the filesystem is immediately ro on any write
attempt. I couldn't build rc1, but I did confirm rc2 already has this
regression.
2. Passing /dev/lvm/brokenDisk to a KVM VM as /dev/vdb with an
unaffected kernel inside the vm exhibits the ro barrier problem on
unaffected kernels.
3. Passing /dev/lowerVG/lvmPool to a KVM VM as /dev/vdb with an
affected kernel inside the VM and using LVM inside the VM exhibits
correct behavior (I can keep the filesystem rw, no barrier errors on
host or guest)
4. A discussion in IRC with BTRFS folks, and they think the BTRFS
filesystem is fine (btrfs check and btrfs scrub also agree)
5. The dmesg error can be delayed indefinitely by not writing to the
disk, or reading with noatime
6. This affects Debian, Ubuntu, NixOS, and Solus, so I'm fairly
certain it's distro-agnostic, and purely a kernel issue.
7. I can't reproduce this with other LVM-on-LVM setups, so I think the
asymmetric nature of the raid1 volume is potentially contributing
8. There are no new smart errors/failures on any of the disks, disks are healthy
9. I previously had raidintegrity=y and caching enabled. They didn't
affect the issue

#regzbot introduced v5.18..v5.19-rc2

Patrick

2024-02-28 19:27:59

by Goffredo Baroncelli

[permalink] [raw]

Subject: Re: [REGRESSION] LVM-on-LVM: error while submitting device barriers

On 28/02/2024 18.25, Patrick Plenefisch wrote:
> I'm unsure if this is just an LVM bug, or a BTRFS+LVM interaction bug,
> but LVM is definitely involved somehow.
> Upgrading from 5.10 to 6.1, I noticed one of my filesystems was
> read-only. In dmesg, I found:
>
> BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
> 0, rd 0, flush 1, corrupt 0, gen 0
> BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
> tolerance is 0 for writable mount
> BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
> failure (errors while submitting device barriers.)
> BTRFS info (device dm-75: state E): forced readonly
> BTRFS warning (device dm-75: state E): Skipping commit of aborted transaction.
> BTRFS: error (device dm-75: state EA) in cleanup_transaction:1992:
> errno=-5 IO failure
>
> At first I suspected a btrfs error, but a scrub found no errors, and
> it continued to be read-write on 5.10 kernels.
>
> Here is my setup:
>
> /dev/lvm/brokenDisk is a lvm-on-lvm volume. I have /dev/sd{a,b,c,d}
> (of varying sizes) in a lower VG, which has three LVs, all raid1
> volumes. Two of the volumes are further used as PV's for an upper VGs.
> One of the upper VGs has no issues. The non-PV LV has no issue. The
> remaining one, /dev/lowerVG/lvmPool, hosting nested LVM, is used as a
> PV for VG "lvm", and has 3 volumes inside. Two of those volumes have
> no issues (and are btrfs), but the last one is /dev/lvm/brokenDisk.
> This volume is the only one that exhibits this behavior, so something
> is special.
>
> Or described as layers:
> /dev/sd{a,b,c,d} => PV => VG "lowerVG"
> /dev/lowerVG/single (RAID1 LV) => BTRFS, works fine
> /dev/lowerVG/works (RAID1 LV) => PV => VG "workingUpper"
> /dev/workingUpper/{a,b,c} => BTRFS, works fine
> /dev/lowerVG/lvmPool (RAID1 LV) => PV => VG "lvm"
> /dev/lvm/{a,b} => BTRFS, works fine
> /dev/lvm/brokenDisk => BTRFS, Exhibits errors

I am a bit curious about the reasons of this setup. However I understood that:

/dev/sda -+ +-- single (RAID1) -> ok +-> a ok
/dev/sdb | | |-> b ok
/dev/sdc +--> [lowerVG]>--+-- works (RAID1) -> [workingUpper] -+-> c ok
/dev/sdd -+ |
| +-> a -> ok
+-- lvmPool -> [lvm] ->-|
+-> b -> ok
|
+->brokenDisk -> fail

[xxx] means VG, the others are LVs that may act also as PV in
an upper VG

So, it seems that

1) lowerVG/lvmPool/lvm/a
2) lowerVG/lvmPool/lvm/a
3) lowerVG/lvmPool/lvm/brokenDisk

are equivalent ... so I don't understand how 1) and 2) are fine but 3) is
problematic.

Is my understanding of the LVM layouts correct ?

>
> After some investigation, here is what I've found:
>
> 1. This regression was introduced in 5.19. 5.18 and earlier kernels I
> can keep this filesystem rw and everything works as expected, while
> 5.19.0 and later the filesystem is immediately ro on any write
> attempt. I couldn't build rc1, but I did confirm rc2 already has this
> regression.
> 2. Passing /dev/lvm/brokenDisk to a KVM VM as /dev/vdb with an
> unaffected kernel inside the vm exhibits the ro barrier problem on
> unaffected kernels.

Is /dev/lvm/brokenDisk *always* problematic with affected ( >= 5.19 ) and
UNaffected ( < 5.19 ) kernel ?

> 3. Passing /dev/lowerVG/lvmPool to a KVM VM as /dev/vdb with an
> affected kernel inside the VM and using LVM inside the VM exhibits
> correct behavior (I can keep the filesystem rw, no barrier errors on
> host or guest)

Is /dev/lowerVG/lvmPool problematic with only "affected" kernel ?

[...]

--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5

2024-02-28 19:40:41

by Patrick Plenefisch

[permalink] [raw]

Subject: Re: [REGRESSION] LVM-on-LVM: error while submitting device barriers

On Wed, Feb 28, 2024 at 2:19 PM Goffredo Baroncelli <[email protected]> wrote:
>
> On 28/02/2024 18.25, Patrick Plenefisch wrote:
> > I'm unsure if this is just an LVM bug, or a BTRFS+LVM interaction bug,
> > but LVM is definitely involved somehow.
> > Upgrading from 5.10 to 6.1, I noticed one of my filesystems was
> > read-only. In dmesg, I found:
> >
> > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
> > 0, rd 0, flush 1, corrupt 0, gen 0
> > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
> > tolerance is 0 for writable mount
> > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
> > failure (errors while submitting device barriers.)
> > BTRFS info (device dm-75: state E): forced readonly
> > BTRFS warning (device dm-75: state E): Skipping commit of aborted transaction.
> > BTRFS: error (device dm-75: state EA) in cleanup_transaction:1992:
> > errno=-5 IO failure
> >
> > At first I suspected a btrfs error, but a scrub found no errors, and
> > it continued to be read-write on 5.10 kernels.
> >
> > Here is my setup:
> >
> > /dev/lvm/brokenDisk is a lvm-on-lvm volume. I have /dev/sd{a,b,c,d}
> > (of varying sizes) in a lower VG, which has three LVs, all raid1
> > volumes. Two of the volumes are further used as PV's for an upper VGs.
> > One of the upper VGs has no issues. The non-PV LV has no issue. The
> > remaining one, /dev/lowerVG/lvmPool, hosting nested LVM, is used as a
> > PV for VG "lvm", and has 3 volumes inside. Two of those volumes have
> > no issues (and are btrfs), but the last one is /dev/lvm/brokenDisk.
> > This volume is the only one that exhibits this behavior, so something
> > is special.
> >
> > Or described as layers:
> > /dev/sd{a,b,c,d} => PV => VG "lowerVG"
> > /dev/lowerVG/single (RAID1 LV) => BTRFS, works fine
> > /dev/lowerVG/works (RAID1 LV) => PV => VG "workingUpper"
> > /dev/workingUpper/{a,b,c} => BTRFS, works fine
> > /dev/lowerVG/lvmPool (RAID1 LV) => PV => VG "lvm"
> > /dev/lvm/{a,b} => BTRFS, works fine
> > /dev/lvm/brokenDisk => BTRFS, Exhibits errors
>
> I am a bit curious about the reasons of this setup.

The lowerVG is supposed to be a pool of storage for several VM's &
containers. [workingUpper] is for one VM, and [lvm] is for another VM.
However right now I'm still trying to organize the files directly
because I don't have all the VM's fully setup yet

> However I understood that:
>
> /dev/sda -+ +-- single (RAID1) -> ok +-> a ok
> /dev/sdb | | |-> b ok
> /dev/sdc +--> [lowerVG]>--+-- works (RAID1) -> [workingUpper] -+-> c ok
> /dev/sdd -+ |
> | +-> a -> ok
> +-- lvmPool -> [lvm] ->-|
> +-> b -> ok
> |
> +->brokenDisk -> fail
>
> [xxx] means VG, the others are LVs that may act also as PV in
> an upper VG

Note that lvmPool is also RAID1, but yes

>
> So, it seems that
>
> 1) lowerVG/lvmPool/lvm/a
> 2) lowerVG/lvmPool/lvm/a
> 3) lowerVG/lvmPool/lvm/brokenDisk
>
> are equivalent ... so I don't understand how 1) and 2) are fine but 3) is
> problematic.

I assume you meant lvm/b for 2?

>
> Is my understanding of the LVM layouts correct ?

Your understanding is correct. The only thing that comes to my mind to
cause the problem is asymmetry of the SATA devices. I have one 8TB
device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual
extents, lowerVG/single spans (3TB+3TB), and
lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have
the other leg of raid1 on the 8TB drive, but my thought was that the
jump across the 1.5+3TB drive gap was at least "interesting"

>
>
> >
> > After some investigation, here is what I've found:
> >
> > 1. This regression was introduced in 5.19. 5.18 and earlier kernels I
> > can keep this filesystem rw and everything works as expected, while
> > 5.19.0 and later the filesystem is immediately ro on any write
> > attempt. I couldn't build rc1, but I did confirm rc2 already has this
> > regression.
> > 2. Passing /dev/lvm/brokenDisk to a KVM VM as /dev/vdb with an
> > unaffected kernel inside the vm exhibits the ro barrier problem on
> > unaffected kernels.
>
> Is /dev/lvm/brokenDisk *always* problematic with affected ( >= 5.19 ) and
> UNaffected ( < 5.19 ) kernel ?

Yes, I didn't test it in as much depth, but 5.15 and 6.1 in the VM
(and 6.1 on the host) are identically problematic

>
> > 3. Passing /dev/lowerVG/lvmPool to a KVM VM as /dev/vdb with an
> > affected kernel inside the VM and using LVM inside the VM exhibits
> > correct behavior (I can keep the filesystem rw, no barrier errors on
> > host or guest)
>
> Is /dev/lowerVG/lvmPool problematic with only "affected" kernel ?

Uh, passing lvmPool directly to the VM is never problematic. I tested
5.10 and 6.1 in the VM (and 6.1 on the host), and neither setup throws
barrier errors.

> [...]
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
>

2024-02-29 19:59:19

by Goffredo Baroncelli

[permalink] [raw]

Subject: Re: [REGRESSION] LVM-on-LVM: error while submitting device barriers

On 28/02/2024 20.37, Patrick Plenefisch wrote:
> On Wed, Feb 28, 2024 at 2:19 PM Goffredo Baroncelli <[email protected]> wrote:
>>
>> On 28/02/2024 18.25, Patrick Plenefisch wrote:
>>> I'm unsure if this is just an LVM bug, or a BTRFS+LVM interaction bug,
>>> but LVM is definitely involved somehow.
>>> Upgrading from 5.10 to 6.1, I noticed one of my filesystems was
>>> read-only. In dmesg, I found:
>>>
>>> BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
>>> 0, rd 0, flush 1, corrupt 0, gen 0
>>> BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
>>> tolerance is 0 for writable mount
>>> BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
>>> failure (errors while submitting device barriers.)
>>> BTRFS info (device dm-75: state E): forced readonly
>>> BTRFS warning (device dm-75: state E): Skipping commit of aborted transaction.
>>> BTRFS: error (device dm-75: state EA) in cleanup_transaction:1992:
>>> errno=-5 IO failure
>>>
>>> At first I suspected a btrfs error, but a scrub found no errors, and
>>> it continued to be read-write on 5.10 kernels.
>>>
>>> Here is my setup:
>>>
>>> /dev/lvm/brokenDisk is a lvm-on-lvm volume. I have /dev/sd{a,b,c,d}
>>> (of varying sizes) in a lower VG, which has three LVs, all raid1
>>> volumes. Two of the volumes are further used as PV's for an upper VGs.
>>> One of the upper VGs has no issues. The non-PV LV has no issue. The
>>> remaining one, /dev/lowerVG/lvmPool, hosting nested LVM, is used as a
>>> PV for VG "lvm", and has 3 volumes inside. Two of those volumes have
>>> no issues (and are btrfs), but the last one is /dev/lvm/brokenDisk.
>>> This volume is the only one that exhibits this behavior, so something
>>> is special.
>>>
>>> Or described as layers:
>>> /dev/sd{a,b,c,d} => PV => VG "lowerVG"
>>> /dev/lowerVG/single (RAID1 LV) => BTRFS, works fine
>>> /dev/lowerVG/works (RAID1 LV) => PV => VG "workingUpper"
>>> /dev/workingUpper/{a,b,c} => BTRFS, works fine
>>> /dev/lowerVG/lvmPool (RAID1 LV) => PV => VG "lvm"
>>> /dev/lvm/{a,b} => BTRFS, works fine
>>> /dev/lvm/brokenDisk => BTRFS, Exhibits errors
>>
>> I am a bit curious about the reasons of this setup.
>
> The lowerVG is supposed to be a pool of storage for several VM's &
> containers. [workingUpper] is for one VM, and [lvm] is for another VM.
> However right now I'm still trying to organize the files directly
> because I don't have all the VM's fully setup yet
>
>> However I understood that:
>>
>> /dev/sda -+ +-- single (RAID1) -> ok +-> a ok
>> /dev/sdb | | |-> b ok
>> /dev/sdc +--> [lowerVG]>--+-- works (RAID1) -> [workingUpper] -+-> c ok
>> /dev/sdd -+ |
>> | +-> a -> ok
>> +-- lvmPool (raid1)-> [lvm] ->-|
>> +-> b -> ok
>> |
>> +->brokenDisk -> fail
>>
>> [xxx] means VG, the others are LVs that may act also as PV in
>> an upper VG
>
> Note that lvmPool is also RAID1, but yes
>
>>
>> So, it seems that
>>
>> 1) lowerVG/lvmPool/lvm/a
>> 2) lowerVG/lvmPool/lvm/a
>> 3) lowerVG/lvmPool/lvm/brokenDisk
>>
>> are equivalent ... so I don't understand how 1) and 2) are fine but 3) is
>> problematic.
>
> I assume you meant lvm/b for 2?

Yes

>>
>> Is my understanding of the LVM layouts correct ?
>
> Your understanding is correct. The only thing that comes to my mind to
> cause the problem is asymmetry of the SATA devices. I have one 8TB
> device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual
> extents, lowerVG/single spans (3TB+3TB), and
> lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have
> the other leg of raid1 on the 8TB drive, but my thought was that the
> jump across the 1.5+3TB drive gap was at least "interesting"

what about lowerVG/works ?

However yes, I agree that the pair of disks involved may be the answer
of the problem.

Could you show us the output of

$ sudo pvdisplay -m

>
>>
>>
>>>
>>> After some investigation, here is what I've found:
>>>
>>> 1. This regression was introduced in 5.19. 5.18 and earlier kernels I
>>> can keep this filesystem rw and everything works as expected, while
>>> 5.19.0 and later the filesystem is immediately ro on any write
>>> attempt. I couldn't build rc1, but I did confirm rc2 already has this
>>> regression.
>>> 2. Passing /dev/lvm/brokenDisk to a KVM VM as /dev/vdb with an
>>> unaffected kernel inside the vm exhibits the ro barrier problem on
>>> unaffected kernels.
>>
>> Is /dev/lvm/brokenDisk *always* problematic with affected ( >= 5.19 ) and
>> UNaffected ( < 5.19 ) kernel ?
>
> Yes, I didn't test it in as much depth, but 5.15 and 6.1 in the VM
> (and 6.1 on the host) are identically problematic
>
>>
>>> 3. Passing /dev/lowerVG/lvmPool to a KVM VM as /dev/vdb with an
>>> affected kernel inside the VM and using LVM inside the VM exhibits
>>> correct behavior (I can keep the filesystem rw, no barrier errors on
>>> host or guest)
>>
>> Is /dev/lowerVG/lvmPool problematic with only "affected" kernel ?
>
> Uh, passing lvmPool directly to the VM is never problematic. I tested
> 5.10 and 6.1 in the VM (and 6.1 on the host), and neither setup throws
> barrier errors.
>
>> [...]
>>
>> --
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
>>

--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5

2024-02-29 20:23:29

by Patrick Plenefisch

[permalink] [raw]

Subject: Re: [REGRESSION] LVM-on-LVM: error while submitting device barriers

On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <[email protected]> wrote:
>
> > Your understanding is correct. The only thing that comes to my mind to
> > cause the problem is asymmetry of the SATA devices. I have one 8TB
> > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual
> > extents, lowerVG/single spans (3TB+3TB), and
> > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have
> > the other leg of raid1 on the 8TB drive, but my thought was that the
> > jump across the 1.5+3TB drive gap was at least "interesting"
>
>
> what about lowerVG/works ?
>

That one is only on two disks, it doesn't span any gaps

> However yes, I agree that the pair of disks involved may be the answer
> of the problem.
>
> Could you show us the output of
>
> $ sudo pvdisplay -m
>
>

I trimmed it, but kept the relevant bits (Free PE is thus not correct):

--- Physical volume ---
PV Name /dev/lowerVG/lvmPool
VG Name lvm
PV Size <3.00 TiB / not usable 3.00 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 786431
Free PE 82943
Allocated PE 703488
PV UUID 7p3LSU-EAHd-xUg0-r9vT-Gzkf-tYFV-mvlU1M

--- Physical Segments ---
Physical extent 0 to 159999:
Logical volume /dev/lvm/brokenDisk
Logical extents 0 to 159999
Physical extent 160000 to 339199:
Logical volume /dev/lvm/a
Logical extents 0 to 179199
Physical extent 339200 to 349439:
Logical volume /dev/lvm/brokenDisk
Logical extents 160000 to 170239
Physical extent 349440 to 351999:
FREE
Physical extent 352000 to 460026:
Logical volume /dev/lvm/brokenDisk
Logical extents 416261 to 524287
Physical extent 460027 to 540409:
FREE
Physical extent 540410 to 786430:
Logical volume /dev/lvm/brokenDisk
Logical extents 170240 to 416260

--- Physical volume ---
PV Name /dev/sda3
VG Name lowerVG
PV Size <2.70 TiB / not usable 3.00 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 707154
Free PE 909
Allocated PE 706245
PV UUID W8gJ0P-JuMs-1y3g-b5cO-4RuA-MoFs-3zgKBn

--- Physical Segments ---
Physical extent 0 to 52223:
Logical volume /dev/lowerVG/single_corig_rimage_0_iorig
Logical extents 629330 to 681553
Physical extent 52224 to 628940:
Logical volume /dev/lowerVG/single_corig_rimage_0_iorig
Logical extents 0 to 576716
Physical extent 628941 to 628941:
Logical volume /dev/lowerVG/single_corig_rmeta_0
Logical extents 0 to 0
Physical extent 628942 to 628962:
Logical volume /dev/lowerVG/single_corig_rimage_0_iorig
Logical extents 681554 to 681574
Physical extent 628963 to 634431:
Logical volume /dev/lowerVG/single_corig_rimage_0_imeta
Logical extents 0 to 5468
Physical extent 634432 to 654540:
FREE
Physical extent 654541 to 707153:
Logical volume /dev/lowerVG/single_corig_rimage_0_iorig
Logical extents 576717 to 629329

--- Physical volume ---
PV Name /dev/sdf2
VG Name lowerVG
PV Size <7.28 TiB / not usable 4.00 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 1907645
Free PE 414967
Allocated PE 1492678
PV UUID my0zQM-832Z-HYPD-sNfW-68ms-nddg-lMyWJM

--- Physical Segments ---
Physical extent 0 to 0:
Logical volume /dev/lowerVG/single_corig_rmeta_1
Logical extents 0 to 0
Physical extent 1 to 681575:
Logical volume /dev/lowerVG/single_corig_rimage_1_iorig
Logical extents 0 to 681574
Physical extent 681576 to 687044:
Logical volume /dev/lowerVG/single_corig_rimage_1_imeta
Logical extents 0 to 5468
Physical extent 687045 to 687045:
Logical volume /dev/lowerVG/lvmPool_rmeta_0
Logical extents 0 to 0
Physical extent 687046 to 1049242:
Logical volume /dev/lowerVG/lvmPool_rimage_0
Logical extents 0 to 362196
Physical extent 1049243 to 1056551:
FREE
Physical extent 1056552 to 1473477:
Logical volume /dev/lowerVG/lvmPool_rimage_0
Logical extents 369506 to 786431
Physical extent 1473478 to 1480786:
Logical volume /dev/lowerVG/lvmPool_rimage_0
Logical extents 362197 to 369505
Physical extent 1480787 to 1907644:
FREE

--- Physical volume ---
PV Name /dev/sdb3
VG Name lowerVG
PV Size 1.33 TiB / not usable 3.00 MiB
Allocatable yes (but full)
PE Size 4.00 MiB
Total PE 349398
Free PE 0
Allocated PE 349398
PV UUID Ncmgdw-ZOXS-qTYL-1jAz-w7zt-38V2-f53EpI

--- Physical Segments ---
Physical extent 0 to 0:
Logical volume /dev/lowerVG/lvmPool_rmeta_1
Logical extents 0 to 0
Physical extent 1 to 349397:
Logical volume /dev/lowerVG/lvmPool_rimage_1
Logical extents 0 to 349396

--- Physical volume ---
PV Name /dev/sde2
VG Name lowerVG
PV Size 2.71 TiB / not usable 3.00 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 711346
Free PE 255111
Allocated PE 456235
PV UUID xUG8TG-wvp0-roBo-GPo7-sbvn-aE7I-NAHU07

--- Physical Segments ---
Physical extent 0 to 416925:
Logical volume /dev/lowerVG/lvmPool_rimage_1
Logical extents 369506 to 786431
Physical extent 416926 to 437034:
Logical volume /dev/lowerVG/lvmPool_rimage_1
Logical extents 349397 to 369505
Physical extent 437035 to 711345:
FREE

Finally, I am not sure if it's relevant, but I did struggle to expand
the raid1 volumes across gaps when creating this setup. I did file a
bug about that, though I am not sure if it's relevant, as I removed
integrity and cache for brokenDisk & lvmPool:
https://gitlab.com/lvmteam/lvm2/-/issues/6

Patrick

2024-02-29 22:05:37

by Goffredo Baroncelli

[permalink] [raw]

Subject: Re: [REGRESSION] LVM-on-LVM: error while submitting device barriers

On 29/02/2024 21.22, Patrick Plenefisch wrote:
> On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <[email protected]> wrote:
>>
>>> Your understanding is correct. The only thing that comes to my mind to
>>> cause the problem is asymmetry of the SATA devices. I have one 8TB
>>> device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual
>>> extents, lowerVG/single spans (3TB+3TB), and
>>> lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have
>>> the other leg of raid1 on the 8TB drive, but my thought was that the
>>> jump across the 1.5+3TB drive gap was at least "interesting"
>>
>>
>> what about lowerVG/works ?
>>
>
> That one is only on two disks, it doesn't span any gaps

Sorry, but re-reading the original email I found something that I missed before:

> BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
> 0, rd 0, flush 1, corrupt 0, gen 0
> BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> tolerance is 0 for writable mount
> BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
> failure (errors while submitting device barriers.)

Looking at the code, it seems that if a FLUSH commands fails, btrfs
considers that the disk is missing. The it cannot mount RW the device.

I would investigate with the LVM developers, if it properly passes
the flush/barrier command through all the layers, when we have an
lvm over lvm (raid1). The fact that the lvm is a raid1, is important because
a flush command to be honored has to be honored by all the
devices involved.

>
>> However yes, I agree that the pair of disks involved may be the answer
>> of the problem.
>>
>> Could you show us the output of
>>
>> $ sudo pvdisplay -m
>>
>>
>
> I trimmed it, but kept the relevant bits (Free PE is thus not correct):
>
>
> --- Physical volume ---
> PV Name /dev/lowerVG/lvmPool
> VG Name lvm
> PV Size <3.00 TiB / not usable 3.00 MiB
> Allocatable yes
> PE Size 4.00 MiB
> Total PE 786431
> Free PE 82943
> Allocated PE 703488
> PV UUID 7p3LSU-EAHd-xUg0-r9vT-Gzkf-tYFV-mvlU1M
>
> --- Physical Segments ---
> Physical extent 0 to 159999:
> Logical volume /dev/lvm/brokenDisk
> Logical extents 0 to 159999
> Physical extent 160000 to 339199:
> Logical volume /dev/lvm/a
> Logical extents 0 to 179199
> Physical extent 339200 to 349439:
> Logical volume /dev/lvm/brokenDisk
> Logical extents 160000 to 170239
> Physical extent 349440 to 351999:
> FREE
> Physical extent 352000 to 460026:
> Logical volume /dev/lvm/brokenDisk
> Logical extents 416261 to 524287
> Physical extent 460027 to 540409:
> FREE
> Physical extent 540410 to 786430:
> Logical volume /dev/lvm/brokenDisk
> Logical extents 170240 to 416260
>
>
> --- Physical volume ---
> PV Name /dev/sda3
> VG Name lowerVG
> PV Size <2.70 TiB / not usable 3.00 MiB
> Allocatable yes
> PE Size 4.00 MiB
> Total PE 707154
> Free PE 909
> Allocated PE 706245
> PV UUID W8gJ0P-JuMs-1y3g-b5cO-4RuA-MoFs-3zgKBn
>
> --- Physical Segments ---
> Physical extent 0 to 52223:
> Logical volume /dev/lowerVG/single_corig_rimage_0_iorig
> Logical extents 629330 to 681553
> Physical extent 52224 to 628940:
> Logical volume /dev/lowerVG/single_corig_rimage_0_iorig
> Logical extents 0 to 576716
> Physical extent 628941 to 628941:
> Logical volume /dev/lowerVG/single_corig_rmeta_0
> Logical extents 0 to 0
> Physical extent 628942 to 628962:
> Logical volume /dev/lowerVG/single_corig_rimage_0_iorig
> Logical extents 681554 to 681574
> Physical extent 628963 to 634431:
> Logical volume /dev/lowerVG/single_corig_rimage_0_imeta
> Logical extents 0 to 5468
> Physical extent 634432 to 654540:
> FREE
> Physical extent 654541 to 707153:
> Logical volume /dev/lowerVG/single_corig_rimage_0_iorig
> Logical extents 576717 to 629329
>
> --- Physical volume ---
> PV Name /dev/sdf2
> VG Name lowerVG
> PV Size <7.28 TiB / not usable 4.00 MiB
> Allocatable yes
> PE Size 4.00 MiB
> Total PE 1907645
> Free PE 414967
> Allocated PE 1492678
> PV UUID my0zQM-832Z-HYPD-sNfW-68ms-nddg-lMyWJM
>
> --- Physical Segments ---
> Physical extent 0 to 0:
> Logical volume /dev/lowerVG/single_corig_rmeta_1
> Logical extents 0 to 0
> Physical extent 1 to 681575:
> Logical volume /dev/lowerVG/single_corig_rimage_1_iorig
> Logical extents 0 to 681574
> Physical extent 681576 to 687044:
> Logical volume /dev/lowerVG/single_corig_rimage_1_imeta
> Logical extents 0 to 5468
> Physical extent 687045 to 687045:
> Logical volume /dev/lowerVG/lvmPool_rmeta_0
> Logical extents 0 to 0
> Physical extent 687046 to 1049242:
> Logical volume /dev/lowerVG/lvmPool_rimage_0
> Logical extents 0 to 362196
> Physical extent 1049243 to 1056551:
> FREE
> Physical extent 1056552 to 1473477:
> Logical volume /dev/lowerVG/lvmPool_rimage_0
> Logical extents 369506 to 786431
> Physical extent 1473478 to 1480786:
> Logical volume /dev/lowerVG/lvmPool_rimage_0
> Logical extents 362197 to 369505
> Physical extent 1480787 to 1907644:
> FREE
>
> --- Physical volume ---
> PV Name /dev/sdb3
> VG Name lowerVG
> PV Size 1.33 TiB / not usable 3.00 MiB
> Allocatable yes (but full)
> PE Size 4.00 MiB
> Total PE 349398
> Free PE 0
> Allocated PE 349398
> PV UUID Ncmgdw-ZOXS-qTYL-1jAz-w7zt-38V2-f53EpI
>
> --- Physical Segments ---
> Physical extent 0 to 0:
> Logical volume /dev/lowerVG/lvmPool_rmeta_1
> Logical extents 0 to 0
> Physical extent 1 to 349397:
> Logical volume /dev/lowerVG/lvmPool_rimage_1
> Logical extents 0 to 349396
>
>
> --- Physical volume ---
> PV Name /dev/sde2
> VG Name lowerVG
> PV Size 2.71 TiB / not usable 3.00 MiB
> Allocatable yes
> PE Size 4.00 MiB
> Total PE 711346
> Free PE 255111
> Allocated PE 456235
> PV UUID xUG8TG-wvp0-roBo-GPo7-sbvn-aE7I-NAHU07
>
> --- Physical Segments ---
> Physical extent 0 to 416925:
> Logical volume /dev/lowerVG/lvmPool_rimage_1
> Logical extents 369506 to 786431
> Physical extent 416926 to 437034:
> Logical volume /dev/lowerVG/lvmPool_rimage_1
> Logical extents 349397 to 369505
> Physical extent 437035 to 711345:
> FREE
>
>
> Finally, I am not sure if it's relevant, but I did struggle to expand
> the raid1 volumes across gaps when creating this setup. I did file a
> bug about that, though I am not sure if it's relevant, as I removed
> integrity and cache for brokenDisk & lvmPool:
> https://gitlab.com/lvmteam/lvm2/-/issues/6
>
> Patrick
>

--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5

2024-03-05 17:45:32

by Mike Snitzer

[permalink] [raw]

Subject: Re: LVM-on-LVM: error while submitting device barriers

On Thu, Feb 29 2024 at 5:05P -0500,
Goffredo Baroncelli <[email protected]> wrote:

> On 29/02/2024 21.22, Patrick Plenefisch wrote:
> > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <[email protected]> wrote:
> > >
> > > > Your understanding is correct. The only thing that comes to my mind to
> > > > cause the problem is asymmetry of the SATA devices. I have one 8TB
> > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual
> > > > extents, lowerVG/single spans (3TB+3TB), and
> > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have
> > > > the other leg of raid1 on the 8TB drive, but my thought was that the
> > > > jump across the 1.5+3TB drive gap was at least "interesting"
> > >
> > >
> > > what about lowerVG/works ?
> > >
> >
> > That one is only on two disks, it doesn't span any gaps
>
> Sorry, but re-reading the original email I found something that I missed before:
>
> > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
> > 0, rd 0, flush 1, corrupt 0, gen 0
> > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > tolerance is 0 for writable mount
> > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
> > failure (errors while submitting device barriers.)
>
> Looking at the code, it seems that if a FLUSH commands fails, btrfs
> considers that the disk is missing. The it cannot mount RW the device.
>
> I would investigate with the LVM developers, if it properly passes
> the flush/barrier command through all the layers, when we have an
> lvm over lvm (raid1). The fact that the lvm is a raid1, is important because
> a flush command to be honored has to be honored by all the
> devices involved.

Hi Patrick,

Your initial report (start of this thread) mentioned that the
regression occured with 5.19. The DM changes that landed during the
5.19 merge window refactored quite a bit of DM core's handling for bio
splitting (to simplify DM's newfound support for bio polling) -- Ming
Lei (now cc'd) and I wrote these changes:

e86f2b005a51 dm: simplify basic targets
bdb34759a0db dm: use bio_sectors in dm_aceept_partial_bio
b992b40dfcc1 dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct
e6926ad0c988 dm: pass dm_io instance to dm_io_acct directly
d3de6d12694d dm: switch to bdev based IO accounting interfaces
7dd76d1feec7 dm: improve bio splitting and associated IO accounting
2e803cd99ba8 dm: don't grab target io reference in dm_zone_map_bio
0f14d60a023c dm: improve dm_io reference counting
ec211631ae24 dm: put all polled dm_io instances into a single list
9d20653fe84e dm: simplify bio-based IO accounting further
4edadf6dcb54 dm: improve abnormal bio processing

I'll have a closer look at these DM commits (especially relative to
flush bios and your stacked device usage).

The last commit (4edadf6dcb54) is marginally relevant (but likely most
easily reverted from v5.19-rc2, as a simple test to see if it somehow
a problem... doubtful to be cause but worth a try).

(FYI, not relevant because it is specific to REQ_NOWAIT but figured I'd
mention it, this commit earlier in the 5.19 DM changes was bogus:
563a225c9fd2 dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio
Jens fixed it with this stable@ commit:
a9ce385344f9 dm: don't attempt to queue IO under RCU protection)

> > > However yes, I agree that the pair of disks involved may be the answer
> > > of the problem.
> > >
> > > Could you show us the output of
> > >
> > > $ sudo pvdisplay -m
> > >
> > >
> >
> > I trimmed it, but kept the relevant bits (Free PE is thus not correct):
> >
> >
> > --- Physical volume ---
> > PV Name /dev/lowerVG/lvmPool
> > VG Name lvm
> > PV Size <3.00 TiB / not usable 3.00 MiB
> > Allocatable yes
> > PE Size 4.00 MiB
> > Total PE 786431
> > Free PE 82943
> > Allocated PE 703488
> > PV UUID 7p3LSU-EAHd-xUg0-r9vT-Gzkf-tYFV-mvlU1M
> >
> > --- Physical Segments ---
> > Physical extent 0 to 159999:
> > Logical volume /dev/lvm/brokenDisk
> > Logical extents 0 to 159999
> > Physical extent 160000 to 339199:
> > Logical volume /dev/lvm/a
> > Logical extents 0 to 179199
> > Physical extent 339200 to 349439:
> > Logical volume /dev/lvm/brokenDisk
> > Logical extents 160000 to 170239
> > Physical extent 349440 to 351999:
> > FREE
> > Physical extent 352000 to 460026:
> > Logical volume /dev/lvm/brokenDisk
> > Logical extents 416261 to 524287
> > Physical extent 460027 to 540409:
> > FREE
> > Physical extent 540410 to 786430:
> > Logical volume /dev/lvm/brokenDisk
> > Logical extents 170240 to 416260

Please provide the following from guest that activates /dev/lvm/brokenDisk:

lsblk
dmsetup table

Please also provide the same from the host (just for completeness).

Also, I didn't see any kernel logs that show DM-specific errors. I
doubt you'd have left any DM-specific errors out in your report. So
is btrfs the canary here? To be clear: You're only seeing btrfs
errors in the kernel log?

Mike

2024-03-06 16:00:25

by Ming Lei

[permalink] [raw]

Subject: Re: LVM-on-LVM: error while submitting device barriers

On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote:
> On Thu, Feb 29 2024 at 5:05P -0500,
> Goffredo Baroncelli <[email protected]> wrote:
>
> > On 29/02/2024 21.22, Patrick Plenefisch wrote:
> > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <[email protected]> wrote:
> > > >
> > > > > Your understanding is correct. The only thing that comes to my mind to
> > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB
> > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual
> > > > > extents, lowerVG/single spans (3TB+3TB), and
> > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have
> > > > > the other leg of raid1 on the 8TB drive, but my thought was that the
> > > > > jump across the 1.5+3TB drive gap was at least "interesting"
> > > >
> > > >
> > > > what about lowerVG/works ?
> > > >
> > >
> > > That one is only on two disks, it doesn't span any gaps
> >
> > Sorry, but re-reading the original email I found something that I missed before:
> >
> > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
> > > 0, rd 0, flush 1, corrupt 0, gen 0
> > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > tolerance is 0 for writable mount
> > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
> > > failure (errors while submitting device barriers.)
> >
> > Looking at the code, it seems that if a FLUSH commands fails, btrfs
> > considers that the disk is missing. The it cannot mount RW the device.
> >
> > I would investigate with the LVM developers, if it properly passes
> > the flush/barrier command through all the layers, when we have an
> > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because
> > a flush command to be honored has to be honored by all the
> > devices involved.

Hello Patrick & Goffredo,

I can trigger this kind of btrfs complaint by simulating one FLUSH failure.

If you can reproduce this issue easily, please collect log by the
following bpftrace script, which may show where the flush failure is,
and maybe it can help to narrow down the issue in the whole stack.

#!/usr/bin/bpftrace

#ifndef BPFTRACE_HAVE_BTF
#include <linux/blkdev.h>
#endif

kprobe:submit_bio_noacct,
kprobe:submit_bio
/ (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 /
{
$bio = (struct bio *)arg0;
@submit_stack[arg0] = kstack;
@tracked[arg0] = 1;
}

kprobe:bio_endio
/@tracked[arg0] != 0/
{
$bio = (struct bio *)arg0;

if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) {
return;
}

if ($bio->bi_status != 0) {
printf("dev %s bio failed %d, submitter %s completion %s\n",
$bio->bi_bdev->bd_disk->disk_name,
$bio->bi_status, @submit_stack[arg0], kstack);
}
delete(@submit_stack[arg0]);
delete(@tracked[arg0]);
}

END {
clear(@submit_stack);
clear(@tracked);
}

Thanks,
Ming

2024-03-09 20:39:19

by Patrick Plenefisch

[permalink] [raw]

Subject: Re: LVM-on-LVM: error while submitting device barriers

On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <[email protected]> wrote:
>
> On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote:
> > On Thu, Feb 29 2024 at 5:05P -0500,
> > Goffredo Baroncelli <[email protected]> wrote:
> >
> > > On 29/02/2024 21.22, Patrick Plenefisch wrote:
> > > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <[email protected]> wrote:
> > > > >
> > > > > > Your understanding is correct. The only thing that comes to my mind to
> > > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB
> > > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual
> > > > > > extents, lowerVG/single spans (3TB+3TB), and
> > > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have
> > > > > > the other leg of raid1 on the 8TB drive, but my thought was that the
> > > > > > jump across the 1.5+3TB drive gap was at least "interesting"
> > > > >
> > > > >
> > > > > what about lowerVG/works ?
> > > > >
> > > >
> > > > That one is only on two disks, it doesn't span any gaps
> > >
> > > Sorry, but re-reading the original email I found something that I missed before:
> > >
> > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
> > > > 0, rd 0, flush 1, corrupt 0, gen 0
> > > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
> > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > tolerance is 0 for writable mount
> > > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
> > > > failure (errors while submitting device barriers.)
> > >
> > > Looking at the code, it seems that if a FLUSH commands fails, btrfs
> > > considers that the disk is missing. The it cannot mount RW the device.
> > >
> > > I would investigate with the LVM developers, if it properly passes
> > > the flush/barrier command through all the layers, when we have an
> > > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because
> > > a flush command to be honored has to be honored by all the
> > > devices involved.
>
> Hello Patrick & Goffredo,
>
> I can trigger this kind of btrfs complaint by simulating one FLUSH failure.
>
> If you can reproduce this issue easily, please collect log by the
> following bpftrace script, which may show where the flush failure is,
> and maybe it can help to narrow down the issue in the whole stack.
>
>
> #!/usr/bin/bpftrace
>
> #ifndef BPFTRACE_HAVE_BTF
> #include <linux/blkdev.h>
> #endif
>
> kprobe:submit_bio_noacct,
> kprobe:submit_bio
> / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 /
> {
> $bio = (struct bio *)arg0;
> @submit_stack[arg0] = kstack;
> @tracked[arg0] = 1;
> }
>
> kprobe:bio_endio
> /@tracked[arg0] != 0/
> {
> $bio = (struct bio *)arg0;
>
> if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) {
> return;
> }
>
> if ($bio->bi_status != 0) {
> printf("dev %s bio failed %d, submitter %s completion %s\n",
> $bio->bi_bdev->bd_disk->disk_name,
> $bio->bi_status, @submit_stack[arg0], kstack);
> }
> delete(@submit_stack[arg0]);
> delete(@tracked[arg0]);
> }
>
> END {
> clear(@submit_stack);
> clear(@tracked);
> }
>

Attaching 4 probes...
dev dm-77 bio failed 10, submitter
submit_bio_noacct+5
__send_duplicate_bios+358
__send_empty_flush+179
dm_submit_bio+857
__submit_bio+132
submit_bio_noacct_nocheck+345
write_all_supers+1718
btrfs_commit_transaction+2342
transaction_kthread+345
kthread+229
ret_from_fork+49
ret_from_fork_asm+27
completion
bio_endio+5
dm_submit_bio+955
__submit_bio+132
submit_bio_noacct_nocheck+345
write_all_supers+1718
btrfs_commit_transaction+2342
transaction_kthread+345
kthread+229
ret_from_fork+49
ret_from_fork_asm+27

dev dm-86 bio failed 10, submitter
submit_bio_noacct+5
write_all_supers+1718
btrfs_commit_transaction+2342
transaction_kthread+345
kthread+229
ret_from_fork+49
ret_from_fork_asm+27
completion
bio_endio+5
clone_endio+295
clone_endio+295
process_one_work+369
worker_thread+635
kthread+229
ret_from_fork+49
ret_from_fork_asm+27

For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool

>
>
> Thanks,
> Ming
>

And to answer Mike's question:
>
> Also, I didn't see any kernel logs that show DM-specific errors. I
> doubt you'd have left any DM-specific errors out in your report. So
> is btrfs the canary here? To be clear: You're only seeing btrfs
> errors in the kernel log?

Correct, that's why I initially thought it was a btrfs issue. No DM
errors in dmesg, btrfs is just the canary

2024-03-10 11:35:06

by Ming Lei

[permalink] [raw]

Subject: Re: LVM-on-LVM: error while submitting device barriers

On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote:
> On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <[email protected]> wrote:
> >
> > On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote:
> > > On Thu, Feb 29 2024 at 5:05P -0500,
> > > Goffredo Baroncelli <[email protected]> wrote:
> > >
> > > > On 29/02/2024 21.22, Patrick Plenefisch wrote:
> > > > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <[email protected]> wrote:
> > > > > >
> > > > > > > Your understanding is correct. The only thing that comes to my mind to
> > > > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB
> > > > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual
> > > > > > > extents, lowerVG/single spans (3TB+3TB), and
> > > > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have
> > > > > > > the other leg of raid1 on the 8TB drive, but my thought was that the
> > > > > > > jump across the 1.5+3TB drive gap was at least "interesting"
> > > > > >
> > > > > >
> > > > > > what about lowerVG/works ?
> > > > > >
> > > > >
> > > > > That one is only on two disks, it doesn't span any gaps
> > > >
> > > > Sorry, but re-reading the original email I found something that I missed before:
> > > >
> > > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
> > > > > 0, rd 0, flush 1, corrupt 0, gen 0
> > > > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
> > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > tolerance is 0 for writable mount
> > > > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
> > > > > failure (errors while submitting device barriers.)
> > > >
> > > > Looking at the code, it seems that if a FLUSH commands fails, btrfs
> > > > considers that the disk is missing. The it cannot mount RW the device.
> > > >
> > > > I would investigate with the LVM developers, if it properly passes
> > > > the flush/barrier command through all the layers, when we have an
> > > > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because
> > > > a flush command to be honored has to be honored by all the
> > > > devices involved.
> >
> > Hello Patrick & Goffredo,
> >
> > I can trigger this kind of btrfs complaint by simulating one FLUSH failure.
> >
> > If you can reproduce this issue easily, please collect log by the
> > following bpftrace script, which may show where the flush failure is,
> > and maybe it can help to narrow down the issue in the whole stack.
> >
> >
> > #!/usr/bin/bpftrace
> >
> > #ifndef BPFTRACE_HAVE_BTF
> > #include <linux/blkdev.h>
> > #endif
> >
> > kprobe:submit_bio_noacct,
> > kprobe:submit_bio
> > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 /
> > {
> > $bio = (struct bio *)arg0;
> > @submit_stack[arg0] = kstack;
> > @tracked[arg0] = 1;
> > }
> >
> > kprobe:bio_endio
> > /@tracked[arg0] != 0/
> > {
> > $bio = (struct bio *)arg0;
> >
> > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) {
> > return;
> > }
> >
> > if ($bio->bi_status != 0) {
> > printf("dev %s bio failed %d, submitter %s completion %s\n",
> > $bio->bi_bdev->bd_disk->disk_name,
> > $bio->bi_status, @submit_stack[arg0], kstack);
> > }
> > delete(@submit_stack[arg0]);
> > delete(@tracked[arg0]);
> > }
> >
> > END {
> > clear(@submit_stack);
> > clear(@tracked);
> > }
> >
>
> Attaching 4 probes...
> dev dm-77 bio failed 10, submitter
> submit_bio_noacct+5
> __send_duplicate_bios+358
> __send_empty_flush+179
> dm_submit_bio+857
> __submit_bio+132
> submit_bio_noacct_nocheck+345
> write_all_supers+1718
> btrfs_commit_transaction+2342
> transaction_kthread+345
> kthread+229
> ret_from_fork+49
> ret_from_fork_asm+27
> completion
> bio_endio+5
> dm_submit_bio+955
> __submit_bio+132
> submit_bio_noacct_nocheck+345
> write_all_supers+1718
> btrfs_commit_transaction+2342
> transaction_kthread+345
> kthread+229
> ret_from_fork+49
> ret_from_fork_asm+27
>
> dev dm-86 bio failed 10, submitter
> submit_bio_noacct+5
> write_all_supers+1718
> btrfs_commit_transaction+2342
> transaction_kthread+345
> kthread+229
> ret_from_fork+49
> ret_from_fork_asm+27
> completion
> bio_endio+5
> clone_endio+295
> clone_endio+295
> process_one_work+369
> worker_thread+635
> kthread+229
> ret_from_fork+49
> ret_from_fork_asm+27
>
>
> For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool

io_status is 10(BLK_STS_IOERR), which is produced in submission code path on
/dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue.

The error should be from the following code only:

static void __map_bio(struct bio *clone)

...
if (r == DM_MAPIO_KILL)
dm_io_dec_pending(io, BLK_STS_IOERR);
else
dm_io_dec_pending(io, BLK_STS_DM_REQUEUE);
break;

Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is
built? It is dm-raid1 target or over plain raid1 device which is
build over /dev/lowerVG?

Mike, the logic in the following code doesn't change from v5.18-rc2 to
v5.19, but I still can't understand why STS_IOERR is set in
dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(),
since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which
is supposed to not happen in Patrick's case.

dm_io_complete()
...
if (io->status == BLK_STS_DM_REQUEUE) {
unsigned long flags;
/*
* Target requested pushing back the I/O.
*/
spin_lock_irqsave(&md->deferred_lock, flags);
if (__noflush_suspending(md) &&
!WARN_ON_ONCE(dm_is_zone_write(md, bio))) {
/* NOTE early return due to BLK_STS_DM_REQUEUE below */
bio_list_add_head(&md->deferred, bio);
} else {
/*
* noflush suspend was interrupted or this is
* a write to a zoned target.
*/
io->status = BLK_STS_IOERR;
}
spin_unlock_irqrestore(&md->deferred_lock, flags);
}

thanks,
Ming

2024-03-10 15:27:42

by Mike Snitzer

[permalink] [raw]

Subject: Re: LVM-on-LVM: error while submitting device barriers

On Sun, Mar 10 2024 at 7:34P -0400,
Ming Lei <[email protected]> wrote:

> On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote:
> > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <[email protected]> wrote:
> > >
> > > On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote:
> > > > On Thu, Feb 29 2024 at 5:05P -0500,
> > > > Goffredo Baroncelli <[email protected]> wrote:
> > > >
> > > > > On 29/02/2024 21.22, Patrick Plenefisch wrote:
> > > > > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <[email protected]> wrote:
> > > > > > >
> > > > > > > > Your understanding is correct. The only thing that comes to my mind to
> > > > > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB
> > > > > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual
> > > > > > > > extents, lowerVG/single spans (3TB+3TB), and
> > > > > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have
> > > > > > > > the other leg of raid1 on the 8TB drive, but my thought was that the
> > > > > > > > jump across the 1.5+3TB drive gap was at least "interesting"
> > > > > > >
> > > > > > >
> > > > > > > what about lowerVG/works ?
> > > > > > >
> > > > > >
> > > > > > That one is only on two disks, it doesn't span any gaps
> > > > >
> > > > > Sorry, but re-reading the original email I found something that I missed before:
> > > > >
> > > > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
> > > > > > 0, rd 0, flush 1, corrupt 0, gen 0
> > > > > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
> > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > > tolerance is 0 for writable mount
> > > > > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
> > > > > > failure (errors while submitting device barriers.)
> > > > >
> > > > > Looking at the code, it seems that if a FLUSH commands fails, btrfs
> > > > > considers that the disk is missing. The it cannot mount RW the device.
> > > > >
> > > > > I would investigate with the LVM developers, if it properly passes
> > > > > the flush/barrier command through all the layers, when we have an
> > > > > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because
> > > > > a flush command to be honored has to be honored by all the
> > > > > devices involved.
> > >
> > > Hello Patrick & Goffredo,
> > >
> > > I can trigger this kind of btrfs complaint by simulating one FLUSH failure.
> > >
> > > If you can reproduce this issue easily, please collect log by the
> > > following bpftrace script, which may show where the flush failure is,
> > > and maybe it can help to narrow down the issue in the whole stack.
> > >
> > >
> > > #!/usr/bin/bpftrace
> > >
> > > #ifndef BPFTRACE_HAVE_BTF
> > > #include <linux/blkdev.h>
> > > #endif
> > >
> > > kprobe:submit_bio_noacct,
> > > kprobe:submit_bio
> > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 /
> > > {
> > > $bio = (struct bio *)arg0;
> > > @submit_stack[arg0] = kstack;
> > > @tracked[arg0] = 1;
> > > }
> > >
> > > kprobe:bio_endio
> > > /@tracked[arg0] != 0/
> > > {
> > > $bio = (struct bio *)arg0;
> > >
> > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) {
> > > return;
> > > }
> > >
> > > if ($bio->bi_status != 0) {
> > > printf("dev %s bio failed %d, submitter %s completion %s\n",
> > > $bio->bi_bdev->bd_disk->disk_name,
> > > $bio->bi_status, @submit_stack[arg0], kstack);
> > > }
> > > delete(@submit_stack[arg0]);
> > > delete(@tracked[arg0]);
> > > }
> > >
> > > END {
> > > clear(@submit_stack);
> > > clear(@tracked);
> > > }
> > >
> >
> > Attaching 4 probes...
> > dev dm-77 bio failed 10, submitter
> > submit_bio_noacct+5
> > __send_duplicate_bios+358
> > __send_empty_flush+179
> > dm_submit_bio+857
> > __submit_bio+132
> > submit_bio_noacct_nocheck+345
> > write_all_supers+1718
> > btrfs_commit_transaction+2342
> > transaction_kthread+345
> > kthread+229
> > ret_from_fork+49
> > ret_from_fork_asm+27
> > completion
> > bio_endio+5
> > dm_submit_bio+955
> > __submit_bio+132
> > submit_bio_noacct_nocheck+345
> > write_all_supers+1718
> > btrfs_commit_transaction+2342
> > transaction_kthread+345
> > kthread+229
> > ret_from_fork+49
> > ret_from_fork_asm+27
> >
> > dev dm-86 bio failed 10, submitter
> > submit_bio_noacct+5
> > write_all_supers+1718
> > btrfs_commit_transaction+2342
> > transaction_kthread+345
> > kthread+229
> > ret_from_fork+49
> > ret_from_fork_asm+27
> > completion
> > bio_endio+5
> > clone_endio+295
> > clone_endio+295
> > process_one_work+369
> > worker_thread+635
> > kthread+229
> > ret_from_fork+49
> > ret_from_fork_asm+27
> >
> >
> > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool
>
> io_status is 10(BLK_STS_IOERR), which is produced in submission code path on
> /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue.
>
> The error should be from the following code only:
>
> static void __map_bio(struct bio *clone)
>
> ...
> if (r == DM_MAPIO_KILL)
> dm_io_dec_pending(io, BLK_STS_IOERR);
> else
> dm_io_dec_pending(io, BLK_STS_DM_REQUEUE);
> break;

I agree that the above bpf stack traces for dm-77 indicate that
dm_submit_bio failed, which would end up in the above branch if the
target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE.

But such an early failure speaks to the flush bio never being
submitted to the underlying storage. No?

dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with:

/*
* If we're reshaping to add disk(s)), ti->len and
* mddev->array_sectors will differ during the process
* (ti->len > mddev->array_sectors), so we have to requeue
* bios with addresses > mddev->array_sectors here or
* there will occur accesses past EOD of the component
* data images thus erroring the raid set.
*/
if (unlikely(bio_end_sector(bio) > mddev->array_sectors))
return DM_MAPIO_REQUEUE;

But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems
weird relative to a flush.

> Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is
> built? It is dm-raid1 target or over plain raid1 device which is
> build over /dev/lowerVG?

In my earlier reply I asked Patrick for both:
lsblk
dmsetup table

Picking over the described IO stacks provided earlier (or Goffredo's
interpretation of it, via ascii art) isn't really a great way to see
the IO stacks that are in use/question.

> Mike, the logic in the following code doesn't change from v5.18-rc2 to
> v5.19, but I still can't understand why STS_IOERR is set in
> dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(),
> since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which
> is supposed to not happen in Patrick's case.
>
> dm_io_complete()
> ...
> if (io->status == BLK_STS_DM_REQUEUE) {
> unsigned long flags;
> /*
> * Target requested pushing back the I/O.
> */
> spin_lock_irqsave(&md->deferred_lock, flags);
> if (__noflush_suspending(md) &&
> !WARN_ON_ONCE(dm_is_zone_write(md, bio))) {
> /* NOTE early return due to BLK_STS_DM_REQUEUE below */
> bio_list_add_head(&md->deferred, bio);
> } else {
> /*
> * noflush suspend was interrupted or this is
> * a write to a zoned target.
> */
> io->status = BLK_STS_IOERR;
> }
> spin_unlock_irqrestore(&md->deferred_lock, flags);
> }

Given the reason from dm-raid.c:raid_map returning DM_MAPIO_REQUEUE
I think the DM device could be suspending without flush.

But regardless, given you logged BLK_STS_IOERR lets assume it isn't,
the assumption that "noflush suspend was interrupted" seems like a
stale comment -- especially given that target's like dm-raid are now
using DM_MAPIO_REQUEUE without concern for the historic tight-coupling
of noflush suspend (which was always the case for the biggest historic
reason for this code: dm-multipath, see commit 2e93ccc1933d0 from
2006 -- predates my time with developing DM).

So all said, this code seems flawed for dm-raid (and possibly other
targets that return DM_MAPIO_REQUEUE). I'll look closer this week.

Mike

2024-03-10 15:48:08

by Ming Lei

[permalink] [raw]

Subject: Re: LVM-on-LVM: error while submitting device barriers

On Sun, Mar 10, 2024 at 11:27:22AM -0400, Mike Snitzer wrote:
> On Sun, Mar 10 2024 at 7:34P -0400,
> Ming Lei <[email protected]> wrote:
>
> > On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote:
> > > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <[email protected]> wrote:
> > > >
> > > > On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote:
> > > > > On Thu, Feb 29 2024 at 5:05P -0500,
> > > > > Goffredo Baroncelli <[email protected]> wrote:
> > > > >
> > > > > > On 29/02/2024 21.22, Patrick Plenefisch wrote:
> > > > > > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > Your understanding is correct. The only thing that comes to my mind to
> > > > > > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB
> > > > > > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual
> > > > > > > > > extents, lowerVG/single spans (3TB+3TB), and
> > > > > > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have
> > > > > > > > > the other leg of raid1 on the 8TB drive, but my thought was that the
> > > > > > > > > jump across the 1.5+3TB drive gap was at least "interesting"
> > > > > > > >
> > > > > > > >
> > > > > > > > what about lowerVG/works ?
> > > > > > > >
> > > > > > >
> > > > > > > That one is only on two disks, it doesn't span any gaps
> > > > > >
> > > > > > Sorry, but re-reading the original email I found something that I missed before:
> > > > > >
> > > > > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
> > > > > > > 0, rd 0, flush 1, corrupt 0, gen 0
> > > > > > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
> > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > > > tolerance is 0 for writable mount
> > > > > > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
> > > > > > > failure (errors while submitting device barriers.)
> > > > > >
> > > > > > Looking at the code, it seems that if a FLUSH commands fails, btrfs
> > > > > > considers that the disk is missing. The it cannot mount RW the device.
> > > > > >
> > > > > > I would investigate with the LVM developers, if it properly passes
> > > > > > the flush/barrier command through all the layers, when we have an
> > > > > > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because
> > > > > > a flush command to be honored has to be honored by all the
> > > > > > devices involved.
> > > >
> > > > Hello Patrick & Goffredo,
> > > >
> > > > I can trigger this kind of btrfs complaint by simulating one FLUSH failure.
> > > >
> > > > If you can reproduce this issue easily, please collect log by the
> > > > following bpftrace script, which may show where the flush failure is,
> > > > and maybe it can help to narrow down the issue in the whole stack.
> > > >
> > > >
> > > > #!/usr/bin/bpftrace
> > > >
> > > > #ifndef BPFTRACE_HAVE_BTF
> > > > #include <linux/blkdev.h>
> > > > #endif
> > > >
> > > > kprobe:submit_bio_noacct,
> > > > kprobe:submit_bio
> > > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 /
> > > > {
> > > > $bio = (struct bio *)arg0;
> > > > @submit_stack[arg0] = kstack;
> > > > @tracked[arg0] = 1;
> > > > }
> > > >
> > > > kprobe:bio_endio
> > > > /@tracked[arg0] != 0/
> > > > {
> > > > $bio = (struct bio *)arg0;
> > > >
> > > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) {
> > > > return;
> > > > }
> > > >
> > > > if ($bio->bi_status != 0) {
> > > > printf("dev %s bio failed %d, submitter %s completion %s\n",
> > > > $bio->bi_bdev->bd_disk->disk_name,
> > > > $bio->bi_status, @submit_stack[arg0], kstack);
> > > > }
> > > > delete(@submit_stack[arg0]);
> > > > delete(@tracked[arg0]);
> > > > }
> > > >
> > > > END {
> > > > clear(@submit_stack);
> > > > clear(@tracked);
> > > > }
> > > >
> > >
> > > Attaching 4 probes...
> > > dev dm-77 bio failed 10, submitter
> > > submit_bio_noacct+5
> > > __send_duplicate_bios+358
> > > __send_empty_flush+179
> > > dm_submit_bio+857
> > > __submit_bio+132
> > > submit_bio_noacct_nocheck+345
> > > write_all_supers+1718
> > > btrfs_commit_transaction+2342
> > > transaction_kthread+345
> > > kthread+229
> > > ret_from_fork+49
> > > ret_from_fork_asm+27
> > > completion
> > > bio_endio+5
> > > dm_submit_bio+955
> > > __submit_bio+132
> > > submit_bio_noacct_nocheck+345
> > > write_all_supers+1718
> > > btrfs_commit_transaction+2342
> > > transaction_kthread+345
> > > kthread+229
> > > ret_from_fork+49
> > > ret_from_fork_asm+27
> > >
> > > dev dm-86 bio failed 10, submitter
> > > submit_bio_noacct+5
> > > write_all_supers+1718
> > > btrfs_commit_transaction+2342
> > > transaction_kthread+345
> > > kthread+229
> > > ret_from_fork+49
> > > ret_from_fork_asm+27
> > > completion
> > > bio_endio+5
> > > clone_endio+295
> > > clone_endio+295
> > > process_one_work+369
> > > worker_thread+635
> > > kthread+229
> > > ret_from_fork+49
> > > ret_from_fork_asm+27
> > >
> > >
> > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool
> >
> > io_status is 10(BLK_STS_IOERR), which is produced in submission code path on
> > /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue.
> >
> > The error should be from the following code only:
> >
> > static void __map_bio(struct bio *clone)
> >
> > ...
> > if (r == DM_MAPIO_KILL)
> > dm_io_dec_pending(io, BLK_STS_IOERR);
> > else
> > dm_io_dec_pending(io, BLK_STS_DM_REQUEUE);
> > break;
>
> I agree that the above bpf stack traces for dm-77 indicate that
> dm_submit_bio failed, which would end up in the above branch if the
> target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE.
>
> But such an early failure speaks to the flush bio never being
> submitted to the underlying storage. No?
>
> dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with:
>
> /*
> * If we're reshaping to add disk(s)), ti->len and
> * mddev->array_sectors will differ during the process
> * (ti->len > mddev->array_sectors), so we have to requeue
> * bios with addresses > mddev->array_sectors here or
> * there will occur accesses past EOD of the component
> * data images thus erroring the raid set.
> */
> if (unlikely(bio_end_sector(bio) > mddev->array_sectors))
> return DM_MAPIO_REQUEUE;
>
> But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems
> weird relative to a flush.

Yeah, I also found the above is weird, since DM_MAPIO_REQUEUE is
supposed to work together only with noflush_suspend, see
2e93ccc1933d ("[PATCH] dm: suspend: add noflush pushback"), looks
you already mentioned.

If that is the reason, maybe the following change can make a
difference:

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 5e41fbae3f6b..07af18baa8dd 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3331,7 +3331,7 @@ static int raid_map(struct dm_target *ti, struct bio *bio)
* there will occur accesses past EOD of the component
* data images thus erroring the raid set.
*/
- if (unlikely(bio_end_sector(bio) > mddev->array_sectors))
+ if (unlikely(bio_has_data(bio) && bio_end_sector(bio) > mddev->array_sectors))
return DM_MAPIO_REQUEUE;

md_handle_request(mddev, bio);

>
> > Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is
> > built? It is dm-raid1 target or over plain raid1 device which is
> > build over /dev/lowerVG?
>
> In my earlier reply I asked Patrick for both:
> lsblk
> dmsetup table
>
> Picking over the described IO stacks provided earlier (or Goffredo's
> interpretation of it, via ascii art) isn't really a great way to see
> the IO stacks that are in use/question.
>
> > Mike, the logic in the following code doesn't change from v5.18-rc2 to
> > v5.19, but I still can't understand why STS_IOERR is set in
> > dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(),
> > since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which
> > is supposed to not happen in Patrick's case.
> >
> > dm_io_complete()
> > ...
> > if (io->status == BLK_STS_DM_REQUEUE) {
> > unsigned long flags;
> > /*
> > * Target requested pushing back the I/O.
> > */
> > spin_lock_irqsave(&md->deferred_lock, flags);
> > if (__noflush_suspending(md) &&
> > !WARN_ON_ONCE(dm_is_zone_write(md, bio))) {
> > /* NOTE early return due to BLK_STS_DM_REQUEUE below */
> > bio_list_add_head(&md->deferred, bio);
> > } else {
> > /*
> > * noflush suspend was interrupted or this is
> > * a write to a zoned target.
> > */
> > io->status = BLK_STS_IOERR;
> > }
> > spin_unlock_irqrestore(&md->deferred_lock, flags);
> > }
>
> Given the reason from dm-raid.c:raid_map returning DM_MAPIO_REQUEUE
> I think the DM device could be suspending without flush.
>
> But regardless, given you logged BLK_STS_IOERR lets assume it isn't,
> the assumption that "noflush suspend was interrupted" seems like a
> stale comment -- especially given that target's like dm-raid are now
> using DM_MAPIO_REQUEUE without concern for the historic tight-coupling
> of noflush suspend (which was always the case for the biggest historic
> reason for this code: dm-multipath, see commit 2e93ccc1933d0 from
> 2006 -- predates my time with developing DM).
>
> So all said, this code seems flawed for dm-raid (and possibly other
> targets that return DM_MAPIO_REQUEUE). I'll look closer this week.

Agree, the change is added since 9dbd1aa3a81c ("dm raid: add reshaping
support to the target"), so loop Heinz in.

Thanks,
Ming

2024-03-10 18:11:29

by Patrick Plenefisch

[permalink] [raw]

Subject: Re: LVM-on-LVM: error while submitting device barriers

On Sun, Mar 10, 2024 at 11:27 AM Mike Snitzer <[email protected]> wrote:
>
> On Sun, Mar 10 2024 at 7:34P -0400,
> Ming Lei <[email protected]> wrote:
>
> > On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote:
> > > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <[email protected]> wrote:
> > > >
> > > > #!/usr/bin/bpftrace
> > > >
> > > > #ifndef BPFTRACE_HAVE_BTF
> > > > #include <linux/blkdev.h>
> > > > #endif
> > > >
> > > > kprobe:submit_bio_noacct,
> > > > kprobe:submit_bio
> > > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 /
> > > > {
> > > > $bio = (struct bio *)arg0;
> > > > @submit_stack[arg0] = kstack;
> > > > @tracked[arg0] = 1;
> > > > }
> > > >
> > > > kprobe:bio_endio
> > > > /@tracked[arg0] != 0/
> > > > {
> > > > $bio = (struct bio *)arg0;
> > > >
> > > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) {
> > > > return;
> > > > }
> > > >
> > > > if ($bio->bi_status != 0) {
> > > > printf("dev %s bio failed %d, submitter %s completion %s\n",
> > > > $bio->bi_bdev->bd_disk->disk_name,
> > > > $bio->bi_status, @submit_stack[arg0], kstack);
> > > > }
> > > > delete(@submit_stack[arg0]);
> > > > delete(@tracked[arg0]);
> > > > }
> > > >
> > > > END {
> > > > clear(@submit_stack);
> > > > clear(@tracked);
> > > > }
> > > >
> > >
> > > Attaching 4 probes...
> > > dev dm-77 bio failed 10, submitter
> > > submit_bio_noacct+5
> > > __send_duplicate_bios+358
> > > __send_empty_flush+179
> > > dm_submit_bio+857
> > > __submit_bio+132
> > > submit_bio_noacct_nocheck+345
> > > write_all_supers+1718
> > > btrfs_commit_transaction+2342
> > > transaction_kthread+345
> > > kthread+229
> > > ret_from_fork+49
> > > ret_from_fork_asm+27
> > > completion
> > > bio_endio+5
> > > dm_submit_bio+955
> > > __submit_bio+132
> > > submit_bio_noacct_nocheck+345
> > > write_all_supers+1718
> > > btrfs_commit_transaction+2342
> > > transaction_kthread+345
> > > kthread+229
> > > ret_from_fork+49
> > > ret_from_fork_asm+27
> > >
> > > dev dm-86 bio failed 10, submitter
> > > submit_bio_noacct+5
> > > write_all_supers+1718
> > > btrfs_commit_transaction+2342
> > > transaction_kthread+345
> > > kthread+229
> > > ret_from_fork+49
> > > ret_from_fork_asm+27
> > > completion
> > > bio_endio+5
> > > clone_endio+295
> > > clone_endio+295
> > > process_one_work+369
> > > worker_thread+635
> > > kthread+229
> > > ret_from_fork+49
> > > ret_from_fork_asm+27
> > >
> > >
> > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool
> >
> > io_status is 10(BLK_STS_IOERR), which is produced in submission code path on
> > /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue.
> >
> > The error should be from the following code only:
> >
> > static void __map_bio(struct bio *clone)
> >
> > ...
> > if (r == DM_MAPIO_KILL)
> > dm_io_dec_pending(io, BLK_STS_IOERR);
> > else
> > dm_io_dec_pending(io, BLK_STS_DM_REQUEUE);
> > break;
>
> I agree that the above bpf stack traces for dm-77 indicate that
> dm_submit_bio failed, which would end up in the above branch if the
> target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE.
>
> But such an early failure speaks to the flush bio never being
> submitted to the underlying storage. No?
>
> dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with:
>
> /*
> * If we're reshaping to add disk(s)), ti->len and
> * mddev->array_sectors will differ during the process
> * (ti->len > mddev->array_sectors), so we have to requeue
> * bios with addresses > mddev->array_sectors here or
> * there will occur accesses past EOD of the component
> * data images thus erroring the raid set.
> */
> if (unlikely(bio_end_sector(bio) > mddev->array_sectors))
> return DM_MAPIO_REQUEUE;
>
> But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems
> weird relative to a flush.
>
> > Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is
> > built? It is dm-raid1 target or over plain raid1 device which is
> > build over /dev/lowerVG?

LVM raid1:
lvcreate --type raid1 -m 1 ...

I had previously added raidintegrity and caching like
"lowerVG/single", but I had removed them in trying to root cause this
bug

>
> In my earlier reply I asked Patrick for both:
> lsblk
> dmsetup table

Oops, here they are, trimmed for relevance:

NAME
sdb
└─sdb2
├─lowerVG-single_corig_rmeta_1
│ └─lowerVG-single_corig
│ └─lowerVG-single
├─lowerVG-single_corig_rimage_1_imeta
│ └─lowerVG-single_corig_rimage_1
│ └─lowerVG-single_corig
│ └─lowerVG-single
├─lowerVG-single_corig_rimage_1_iorig
│ └─lowerVG-single_corig_rimage_1
│ └─lowerVG-single_corig
│ └─lowerVG-single
├─lowerVG-lvmPool_rmeta_0
│ └─lowerVG-lvmPool
│ ├─lvm-a
│ └─lvm-brokenDisk
├─lowerVG-lvmPool_rimage_0
│ └─lowerVG-lvmPool
│ ├─lvm-a
│ └─lvm-brokenDisk
sdc
└─sdc3
├─lowerVG-single_corig_rmeta_0
│ └─lowerVG-single_corig
│ └─lowerVG-single
├─lowerVG-single_corig_rimage_0_imeta
│ └─lowerVG-single_corig_rimage_0
│ └─lowerVG-single_corig
│ └─lowerVG-single
├─lowerVG-single_corig_rimage_0_iorig
│ └─lowerVG-single_corig_rimage_0
│ └─lowerVG-single_corig
│ └─lowerVG-single
sdd
└─sdd3
├─lowerVG-lvmPool_rmeta_1
│ └─lowerVG-lvmPool
│ ├─lvm-a
│ └─lvm-brokenDisk
└─lowerVG-lvmPool_rimage_1
└─lowerVG-lvmPool
├─lvm-a
└─lvm-brokenDisk
sdf
├─sdf2
│ ├─lowerVG-lvmPool_rimage_1
│ │ └─lowerVG-lvmPool
│ │ ├─lvm-a
│ │ └─lvm-brokenDisk

lowerVG-single: 0 5583462400 cache 254:32 254:31 254:71 128 2
metadata2 writethrough mq 0
lowerVG-singleCache_cvol: 0 104857600 linear 259:13 104859648
lowerVG-singleCache_cvol-cdata: 0 104775680 linear 254:30 81920
lowerVG-singleCache_cvol-cmeta: 0 81920 linear 254:30 0
lowerVG-single_corig: 0 5583462400 raid raid1 3 0 region_size 4096 2
254:33 254:36 254:67 254:70
lowerVG-single_corig_rimage_0: 0 5583462400 integrity 254:35 0 4 J 8
meta_device:254:34 recalculate journal_sectors:130944
interleave_sectors:1 buffer_sectors:128 journal_watermark:50
commit_time:10000 internal_hash:crc32c
lowerVG-single_corig_rimage_0_imeta: 0 44802048 linear 8:35 5152466944
lowerVG-single_corig_rimage_0_iorig: 0 4724465664 linear 8:35 427821056
lowerVG-single_corig_rimage_0_iorig: 4724465664 431005696 linear 8:35 5362001920
lowerVG-single_corig_rimage_0_iorig: 5155471360 427819008 linear 8:35 2048
lowerVG-single_corig_rimage_0_iorig: 5583290368 172032 linear 8:35 5152294912
lowerVG-single_corig_rimage_1: 0 5583462400 integrity 254:69 0 4 J 8
meta_device:254:68 recalculate journal_sectors:130944
interleave_sectors:1 buffer_sectors:128 journal_watermark:50
commit_time:10000 internal_hash:crc32c
lowerVG-single_corig_rimage_1_imeta: 0 44802048 linear 8:18 5583472640
lowerVG-single_corig_rimage_1_iorig: 0 5583462400 linear 8:18 10240
lowerVG-single_corig_rmeta_0: 0 8192 linear 8:35 5152286720
lowerVG-single_corig_rmeta_1: 0 8192 linear 8:18 2048
lowerVG-lvmPool: 0 6442450944 raid raid1 3 0 region_size 4096 2 254:73
254:74 254:75 254:76
lowerVG-lvmPool_rimage_0: 0 2967117824 linear 8:18 5628282880
lowerVG-lvmPool_rimage_0: 2967117824 59875328 linear 8:18 12070733824
lowerVG-lvmPool_rimage_0: 3026993152 3415457792 linear 8:18 8655276032
lowerVG-lvmPool_rimage_1: 0 2862260224 linear 8:51 10240
lowerVG-lvmPool_rimage_1: 2862260224 164732928 linear 8:82 3415459840
lowerVG-lvmPool_rimage_1: 3026993152 3415457792 linear 8:82 2048
lowerVG-lvmPool_rmeta_0: 0 8192 linear 8:18 5628274688
lowerVG-lvmPool_rmeta_1: 0 8192 linear 8:51 2048
lvm-a: 0 1468006400 linear 254:77 1310722048
lvm-brokenDisk: 0 1310720000 linear 254:77 2048
lvm-brokenDisk: 1310720000 83886080 linear 254:77 2778728448
lvm-brokenDisk: 1394606080 2015404032 linear 254:77 4427040768
lvm-brokenDisk: 3410010112 884957184 linear 254:77 2883586048

As a side note, is there a way to make lsblk only show things the
first time they come up?

>
> Picking over the described IO stacks provided earlier (or Goffredo's
> interpretation of it, via ascii art) isn't really a great way to see
> the IO stacks that are in use/question.
>
> > Mike, the logic in the following code doesn't change from v5.18-rc2 to
> > v5.19, but I still can't understand why STS_IOERR is set in
> > dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(),
> > since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which
> > is supposed to not happen in Patrick's case.
> >
> > dm_io_complete()
> > ...
> > if (io->status == BLK_STS_DM_REQUEUE) {
> > unsigned long flags;
> > /*
> > * Target requested pushing back the I/O.
> > */
> > spin_lock_irqsave(&md->deferred_lock, flags);
> > if (__noflush_suspending(md) &&
> > !WARN_ON_ONCE(dm_is_zone_write(md, bio))) {
> > /* NOTE early return due to BLK_STS_DM_REQUEUE below */
> > bio_list_add_head(&md->deferred, bio);
> > } else {
> > /*
> > * noflush suspend was interrupted or this is
> > * a write to a zoned target.
> > */
> > io->status = BLK_STS_IOERR;
> > }
> > spin_unlock_irqrestore(&md->deferred_lock, flags);
> > }
>
> Given the reason from dm-raid.c:raid_map returning DM_MAPIO_REQUEUE
> I think the DM device could be suspending without flush.
>
> But regardless, given you logged BLK_STS_IOERR lets assume it isn't,
> the assumption that "noflush suspend was interrupted" seems like a
> stale comment -- especially given that target's like dm-raid are now
> using DM_MAPIO_REQUEUE without concern for the historic tight-coupling
> of noflush suspend (which was always the case for the biggest historic
> reason for this code: dm-multipath, see commit 2e93ccc1933d0 from
> 2006 -- predates my time with developing DM).
>
> So all said, this code seems flawed for dm-raid (and possibly other
> targets that return DM_MAPIO_REQUEUE). I'll look closer this week.
>
> Mike

2024-03-11 13:13:59

by Ming Lei

[permalink] [raw]

Subject: Re: LVM-on-LVM: error while submitting device barriers

On Sun, Mar 10, 2024 at 02:11:11PM -0400, Patrick Plenefisch wrote:
> On Sun, Mar 10, 2024 at 11:27 AM Mike Snitzer <[email protected]> wrote:
> >
> > On Sun, Mar 10 2024 at 7:34P -0400,
> > Ming Lei <[email protected]> wrote:
> >
> > > On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote:
> > > > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <[email protected]> wrote:
> > > > >
> > > > > #!/usr/bin/bpftrace
> > > > >
> > > > > #ifndef BPFTRACE_HAVE_BTF
> > > > > #include <linux/blkdev.h>
> > > > > #endif
> > > > >
> > > > > kprobe:submit_bio_noacct,
> > > > > kprobe:submit_bio
> > > > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 /
> > > > > {
> > > > > $bio = (struct bio *)arg0;
> > > > > @submit_stack[arg0] = kstack;
> > > > > @tracked[arg0] = 1;
> > > > > }
> > > > >
> > > > > kprobe:bio_endio
> > > > > /@tracked[arg0] != 0/
> > > > > {
> > > > > $bio = (struct bio *)arg0;
> > > > >
> > > > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) {
> > > > > return;
> > > > > }
> > > > >
> > > > > if ($bio->bi_status != 0) {
> > > > > printf("dev %s bio failed %d, submitter %s completion %s\n",
> > > > > $bio->bi_bdev->bd_disk->disk_name,
> > > > > $bio->bi_status, @submit_stack[arg0], kstack);
> > > > > }
> > > > > delete(@submit_stack[arg0]);
> > > > > delete(@tracked[arg0]);
> > > > > }
> > > > >
> > > > > END {
> > > > > clear(@submit_stack);
> > > > > clear(@tracked);
> > > > > }
> > > > >
> > > >
> > > > Attaching 4 probes...
> > > > dev dm-77 bio failed 10, submitter
> > > > submit_bio_noacct+5
> > > > __send_duplicate_bios+358
> > > > __send_empty_flush+179
> > > > dm_submit_bio+857
> > > > __submit_bio+132
> > > > submit_bio_noacct_nocheck+345
> > > > write_all_supers+1718
> > > > btrfs_commit_transaction+2342
> > > > transaction_kthread+345
> > > > kthread+229
> > > > ret_from_fork+49
> > > > ret_from_fork_asm+27
> > > > completion
> > > > bio_endio+5
> > > > dm_submit_bio+955
> > > > __submit_bio+132
> > > > submit_bio_noacct_nocheck+345
> > > > write_all_supers+1718
> > > > btrfs_commit_transaction+2342
> > > > transaction_kthread+345
> > > > kthread+229
> > > > ret_from_fork+49
> > > > ret_from_fork_asm+27
> > > >
> > > > dev dm-86 bio failed 10, submitter
> > > > submit_bio_noacct+5
> > > > write_all_supers+1718
> > > > btrfs_commit_transaction+2342
> > > > transaction_kthread+345
> > > > kthread+229
> > > > ret_from_fork+49
> > > > ret_from_fork_asm+27
> > > > completion
> > > > bio_endio+5
> > > > clone_endio+295
> > > > clone_endio+295
> > > > process_one_work+369
> > > > worker_thread+635
> > > > kthread+229
> > > > ret_from_fork+49
> > > > ret_from_fork_asm+27
> > > >
> > > >
> > > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool
> > >
> > > io_status is 10(BLK_STS_IOERR), which is produced in submission code path on
> > > /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue.
> > >
> > > The error should be from the following code only:
> > >
> > > static void __map_bio(struct bio *clone)
> > >
> > > ...
> > > if (r == DM_MAPIO_KILL)
> > > dm_io_dec_pending(io, BLK_STS_IOERR);
> > > else
> > > dm_io_dec_pending(io, BLK_STS_DM_REQUEUE);
> > > break;
> >
> > I agree that the above bpf stack traces for dm-77 indicate that
> > dm_submit_bio failed, which would end up in the above branch if the
> > target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE.
> >
> > But such an early failure speaks to the flush bio never being
> > submitted to the underlying storage. No?
> >
> > dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with:
> >
> > /*
> > * If we're reshaping to add disk(s)), ti->len and
> > * mddev->array_sectors will differ during the process
> > * (ti->len > mddev->array_sectors), so we have to requeue
> > * bios with addresses > mddev->array_sectors here or
> > * there will occur accesses past EOD of the component
> > * data images thus erroring the raid set.
> > */
> > if (unlikely(bio_end_sector(bio) > mddev->array_sectors))
> > return DM_MAPIO_REQUEUE;
> >
> > But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems
> > weird relative to a flush.
> >
> > > Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is
> > > built? It is dm-raid1 target or over plain raid1 device which is
> > > build over /dev/lowerVG?
>
> LVM raid1:
> lvcreate --type raid1 -m 1 ...

OK, that is the reason, as Mike mentioned.

dm-raid.c:raid_map returns DM_MAPIO_REQUEUE, which is translated into
BLK_STS_IOERR in dm_io_complete().

Empty flush bio is sent from btrfs, both .bi_size and .bi_sector are set
as zero, but the top dm is linear, which(linear_map()) maps new
bio->bi_iter.bi_sector, and the mapped bio is sent to dm-raid(raid_map()),
then DM_MAPIO_REQUEUE is returned.

The one-line patch I sent in last email should solve this issue.

https://lore.kernel.org/dm-devel/[email protected]/T/#m8fce3ecb2f98370b7d7ce8db6714bbf644af5459

But DM_MAPIO_REQUEUE misuse needs close look, and I believe Mike is working
on that bigger problem.

I guess most of dm targets don't deal with empty bio well, at least
linear & dm-raid, not look into others yet, :-(

Thanks,
Ming

2024-03-12 22:56:17

by Patrick Plenefisch

[permalink] [raw]

Subject: Re: LVM-on-LVM: error while submitting device barriers

On Mon, Mar 11, 2024 at 9:13 AM Ming Lei <[email protected]> wrote:
>
> On Sun, Mar 10, 2024 at 02:11:11PM -0400, Patrick Plenefisch wrote:
> > On Sun, Mar 10, 2024 at 11:27 AM Mike Snitzer <[email protected]> wrote:
> > >
> > > On Sun, Mar 10 2024 at 7:34P -0400,
> > > Ming Lei <[email protected]> wrote:
> > >
> > > > On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote:
> > > > > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <ming.lei@redhatcom> wrote:
> > > > > >
> > > > > > #!/usr/bin/bpftrace
> > > > > >
> > > > > > #ifndef BPFTRACE_HAVE_BTF
> > > > > > #include <linux/blkdev.h>
> > > > > > #endif
> > > > > >
> > > > > > kprobe:submit_bio_noacct,
> > > > > > kprobe:submit_bio
> > > > > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 /
> > > > > > {
> > > > > > $bio = (struct bio *)arg0;
> > > > > > @submit_stack[arg0] = kstack;
> > > > > > @tracked[arg0] = 1;
> > > > > > }
> > > > > >
> > > > > > kprobe:bio_endio
> > > > > > /@tracked[arg0] != 0/
> > > > > > {
> > > > > > $bio = (struct bio *)arg0;
> > > > > >
> > > > > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) {
> > > > > > return;
> > > > > > }
> > > > > >
> > > > > > if ($bio->bi_status != 0) {
> > > > > > printf("dev %s bio failed %d, submitter %s completion %s\n",
> > > > > > $bio->bi_bdev->bd_disk->disk_name,
> > > > > > $bio->bi_status, @submit_stack[arg0], kstack);
> > > > > > }
> > > > > > delete(@submit_stack[arg0]);
> > > > > > delete(@tracked[arg0]);
> > > > > > }
> > > > > >
> > > > > > END {
> > > > > > clear(@submit_stack);
> > > > > > clear(@tracked);
> > > > > > }
> > > > > >
> > > > >
> > > > > Attaching 4 probes...
> > > > > dev dm-77 bio failed 10, submitter
> > > > > submit_bio_noacct+5
> > > > > __send_duplicate_bios+358
> > > > > __send_empty_flush+179
> > > > > dm_submit_bio+857
> > > > > __submit_bio+132
> > > > > submit_bio_noacct_nocheck+345
> > > > > write_all_supers+1718
> > > > > btrfs_commit_transaction+2342
> > > > > transaction_kthread+345
> > > > > kthread+229
> > > > > ret_from_fork+49
> > > > > ret_from_fork_asm+27
> > > > > completion
> > > > > bio_endio+5
> > > > > dm_submit_bio+955
> > > > > __submit_bio+132
> > > > > submit_bio_noacct_nocheck+345
> > > > > write_all_supers+1718
> > > > > btrfs_commit_transaction+2342
> > > > > transaction_kthread+345
> > > > > kthread+229
> > > > > ret_from_fork+49
> > > > > ret_from_fork_asm+27
> > > > >
> > > > > dev dm-86 bio failed 10, submitter
> > > > > submit_bio_noacct+5
> > > > > write_all_supers+1718
> > > > > btrfs_commit_transaction+2342
> > > > > transaction_kthread+345
> > > > > kthread+229
> > > > > ret_from_fork+49
> > > > > ret_from_fork_asm+27
> > > > > completion
> > > > > bio_endio+5
> > > > > clone_endio+295
> > > > > clone_endio+295
> > > > > process_one_work+369
> > > > > worker_thread+635
> > > > > kthread+229
> > > > > ret_from_fork+49
> > > > > ret_from_fork_asm+27
> > > > >
> > > > >
> > > > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool
> > > >
> > > > io_status is 10(BLK_STS_IOERR), which is produced in submission code path on
> > > > /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue.
> > > >
> > > > The error should be from the following code only:
> > > >
> > > > static void __map_bio(struct bio *clone)
> > > >
> > > > ...
> > > > if (r == DM_MAPIO_KILL)
> > > > dm_io_dec_pending(io, BLK_STS_IOERR);
> > > > else
> > > > dm_io_dec_pending(io, BLK_STS_DM_REQUEUE);
> > > > break;
> > >
> > > I agree that the above bpf stack traces for dm-77 indicate that
> > > dm_submit_bio failed, which would end up in the above branch if the
> > > target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE.
> > >
> > > But such an early failure speaks to the flush bio never being
> > > submitted to the underlying storage. No?
> > >
> > > dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with:
> > >
> > > /*
> > > * If we're reshaping to add disk(s)), ti->len and
> > > * mddev->array_sectors will differ during the process
> > > * (ti->len > mddev->array_sectors), so we have to requeue
> > > * bios with addresses > mddev->array_sectors here or
> > > * there will occur accesses past EOD of the component
> > > * data images thus erroring the raid set.
> > > */
> > > if (unlikely(bio_end_sector(bio) > mddev->array_sectors))
> > > return DM_MAPIO_REQUEUE;
> > >
> > > But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems
> > > weird relative to a flush.
> > >
> > > > Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is
> > > > built? It is dm-raid1 target or over plain raid1 device which is
> > > > build over /dev/lowerVG?
> >
> > LVM raid1:
> > lvcreate --type raid1 -m 1 ...
>
> OK, that is the reason, as Mike mentioned.
>
> dm-raid.c:raid_map returns DM_MAPIO_REQUEUE, which is translated into
> BLK_STS_IOERR in dm_io_complete().
>
> Empty flush bio is sent from btrfs, both .bi_size and .bi_sector are set
> as zero, but the top dm is linear, which(linear_map()) maps new
> bio->bi_iter.bi_sector, and the mapped bio is sent to dm-raid(raid_map()),
> then DM_MAPIO_REQUEUE is returned.
>
> The one-line patch I sent in last email should solve this issue.
>
> https://lore.kernel.org/dm-devel/[email protected]/T/#m8fce3ecb2f98370b7d7ce8db6714bbf644af5459

With this patch on a 6.6.13 base, I can modify files and the BTRFS
volume stays RW, while no errors are logged in dmesg!

>
> But DM_MAPIO_REQUEUE misuse needs close look, and I believe Mike is working
> on that bigger problem.
>
> I guess most of dm targets don't deal with empty bio well, at least
> linear & dm-raid, not look into others yet, :-(
>
>
> Thanks,
> Ming
>