2024-05-18 02:27:22

by Theodore Ts'o

[permalink] [raw]
Subject: [REGRESSION] dm: use queue_limits_set

#regzbot introduced: 1c0e720228ad

While doing final regression testing before sending a pull request for
the ext4 tree, I found a regression which was triggered by generic/347
and generic/405 on on multiple fstests configurations, including
both ext4/4k and xfs/4k.

It bisects cleanly to commit 1c0e720228ad ("dm: use
queue_limits_set"), and the resulting WARNING is attached below. This
stack trace can be seen for both generic/347 and generic/405. And if
I revert this commit on top of linux-next, the failure goes away, so
it pretty clearly root causes to 1c0e720228ad.

For now, I'll add generic/347 and generic/405 to my global exclude
file, but maybe we should consider reverting the commit if it can't be
fixed quickly?

Thanks,

- Ted

% git show 1c0e720228ad
commit 1c0e720228ad1c63bb487cdcead2558353b5a067
Author: Christoph Hellwig <[email protected]>
Date: Wed Feb 28 14:56:42 2024 -0800

dm: use queue_limits_set

Use queue_limits_set which validates the limits and takes care of
updating the readahead settings instead of directly assigning them to
the queue. For that make sure all limits are actually updated before
the assignment.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
...

% git checkout 1c0e720228ad
% install-kconfig ; kbuild
% kvm-xfstests -c xfs/4k generic/347

BEGIN TEST 4k (1 test): XFS 4k block Fri May 17 22:03:06 EDT 2024
DEVICE: /dev/vdd
XFS_MKFS_OPTIONS: -bsize=4096
XFS_MOUNT_OPTIONS:
FSTYP -- xfs (debug)
PLATFORM -- Linux/x86_64 kvm-xfstests 6.9.0-rc2-xfstests-00006-g1c0e720228ad #335 SMP PREEMPT_DYNAMIC Fri May 17 22:02:37 EDT 2024
MKFS_OPTIONS -- -f -bsize=4096 /dev/vdc
MOUNT_OPTIONS -- /dev/vdc /vdc

[ 3.179146] XFS (vdd): EXPERIMENTAL online scrub feature in use. Use at your own risk!
generic/347 65s ... [22:03:06][ 3.389241] run fstests generic/347 at 2024-05-17 22:03:06
[ 4.032221] ------------[ cut here ]------------
[ 4.033087] WARNING: CPU: 1 PID: 30 at drivers/md/dm-bio-prison-v1.c:128 dm_cell_key_has_valid_range+0x3d/0x50
[ 4.034871] CPU: 1 PID: 30 Comm: kworker/u8:1 Not tainted 6.9.0-rc2-xfstests-00006-g1c0e720228ad #335
[ 4.035440] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 4.035829] Workqueue: dm-thin do_worker
[ 4.035998] RIP: 0010:dm_cell_key_has_valid_range+0x3d/0x50
[ 4.036236] Code: c1 48 29 d1 48 81 f9 00 04 00 00 77 1d 48 83 e8 01 48 c1 ea 0a b9 01 00 00 00 48 c1 e8 0a 48 39 c2 75 12 89 c8 c3 cc cc cc cc <0f> 0b 31 c9 89 c8 c3 cc cc cc cc 0f 0b 31 c9 eb e8 66 90 90 90 90
[ 4.037024] RSP: 0018:ffffc90000107d18 EFLAGS: 00010206
[ 4.037249] RAX: 0000000000000fff RBX: 0000000000000000 RCX: 0000000000000fff
[ 4.037556] RDX: 0000000000000000 RSI: 0000000000000fff RDI: ffffc90000107d28
[ 4.037923] RBP: ffff888009aa7b40 R08: ffff888009aa7bc0 R09: ffff888003e7b980
[ 4.038247] R10: 0000000000000008 R11: fefefefefefefeff R12: ffff8880097ca5b8
[ 4.038545] R13: 0000000000000fff R14: ffff8880097ca5b8 R15: ffff88800f4fc400
[ 4.038843] FS: 0000000000000000(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
[ 4.039228] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4.039472] CR2: 0000564ad92d7c00 CR3: 000000000f73e006 CR4: 0000000000770ef0
[ 4.039774] PKRU: 55555554
[ 4.039892] Call Trace:
[ 4.040000] <TASK>
`[ 4.040094] ? __warn+0x7b/0x120
[ 4.040235] ? dm_cell_key_has_valid_range+0x3d/0x50
[ 4.040473] ? report_bug+0x174/0x1c0
[ 4.040635] ? handle_bug+0x3a/0x70
[ 4.040789] ? exc_invalid_op+0x17/0x70
[ 4.040958] ? asm_exc_invalid_op+0x1a/0x20
[ 4.041163] ? dm_cell_key_has_valid_range+0x3d/0x50
[ 4.041397] process_discard_bio+0xba/0x190
[ 4.041597] process_thin_deferred_bios+0x290/0x3e0
[ 4.041806] process_deferred_bios+0xba/0x2e0
[ 4.041994] do_worker+0xf5/0x160
[ 4.042139] process_one_work+0x18a/0x3d0
[ 4.042311] worker_thread+0x285/0x3a0
[ 4.042470] ? __pfx_worker_thread+0x10/0x10
[ 4.042655] kthread+0xdd/0x110
[ 4.042795] ? __pfx_kthread+0x10/0x10
[ 4.042979] ret_from_fork+0x31/0x50
[ 4.043153] ? __pfx_kthread+0x10/0x10
[ 4.043317] ret_from_fork_asm+0x1a/0x30
[ 4.043490] </TASK>
[ 4.043586] ---[ end trace 0000000000000000 ]---
[ 4.043784] device-mapper: thin: Discard doesn't respect bio prison limits
[ 4.044142] device-mapper: thin: Discard doesn't respect bio prison limits
[ 4.044585] device-mapper: thin: Discard doesn't respect bio prison limits
[ 6.410678] device-mapper: thin: 253:2: reached low water mark for data device: sending event.


2024-05-19 05:05:54

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm: use queue_limits_set

Hi Ted,

On Fri, May 17, 2024 at 10:26:46PM -0400, Theodore Ts'o wrote:
> #regzbot introduced: 1c0e720228ad
>
> While doing final regression testing before sending a pull request for
> the ext4 tree, I found a regression which was triggered by generic/347
> and generic/405 on on multiple fstests configurations, including
> both ext4/4k and xfs/4k.
>
> It bisects cleanly to commit 1c0e720228ad ("dm: use
> queue_limits_set"), and the resulting WARNING is attached below. This
> stack trace can be seen for both generic/347 and generic/405. And if
> I revert this commit on top of linux-next, the failure goes away, so
> it pretty clearly root causes to 1c0e720228ad.
>
> For now, I'll add generic/347 and generic/405 to my global exclude
> file, but maybe we should consider reverting the commit if it can't be
> fixed quickly?

Commit 1c0e720228ad is a red herring, it switches DM over to using
queue_limits_set() which I now see is clearly disregarding DM's desire
to disable discards (in blk_validate_limits).

It looks like the combo of commit d690cb8ae14bd ("block: add an API to
atomically update queue limits") and 4f563a64732da ("block: add a
max_user_discard_sectors queue limit") needs fixing.

This being one potential fix from code inspection I've done to this
point, please see if it resolves your fstests failures (but I haven't
actually looked at those fstests yet _and_ I still need to review
commits d690cb8ae14bd and 4f563a64732da further -- will do on Monday,
sorry for the trouble):

diff --git a/block/blk-settings.c b/block/blk-settings.c
index cdbaef159c4b..c442f7ec3a6b 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -165,11 +165,13 @@ static int blk_validate_limits(struct queue_limits *lim)
lim->max_discard_sectors =
min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors);

- if (!lim->max_discard_segments)
- lim->max_discard_segments = 1;
+ if (lim->max_discard_sectors) {
+ if (!lim->max_discard_segments)
+ lim->max_discard_segments = 1;

- if (lim->discard_granularity < lim->physical_block_size)
- lim->discard_granularity = lim->physical_block_size;
+ if (lim->discard_granularity < lim->physical_block_size)
+ lim->discard_granularity = lim->physical_block_size;
+ }

/*
* By default there is no limit on the segment boundary alignment,
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 88114719fe18..e647e1bcd50c 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1969,6 +1969,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
blk_queue_flag_clear(QUEUE_FLAG_NOWAIT, q);

if (!dm_table_supports_discards(t)) {
+ limits->max_user_discard_sectors = 0;
limits->max_hw_discard_sectors = 0;
limits->discard_granularity = 0;
limits->discard_alignment = 0;

2024-05-19 05:42:12

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm: use queue_limits_set

On Sun, May 19, 2024 at 01:05:40AM -0400, Mike Snitzer wrote:
> Hi Ted,
>
> On Fri, May 17, 2024 at 10:26:46PM -0400, Theodore Ts'o wrote:
> > #regzbot introduced: 1c0e720228ad
> >
> > While doing final regression testing before sending a pull request for
> > the ext4 tree, I found a regression which was triggered by generic/347
> > and generic/405 on on multiple fstests configurations, including
> > both ext4/4k and xfs/4k.
> >
> > It bisects cleanly to commit 1c0e720228ad ("dm: use
> > queue_limits_set"), and the resulting WARNING is attached below. This
> > stack trace can be seen for both generic/347 and generic/405. And if
> > I revert this commit on top of linux-next, the failure goes away, so
> > it pretty clearly root causes to 1c0e720228ad.
> >
> > For now, I'll add generic/347 and generic/405 to my global exclude
> > file, but maybe we should consider reverting the commit if it can't be
> > fixed quickly?
>
> Commit 1c0e720228ad is a red herring, it switches DM over to using
> queue_limits_set() which I now see is clearly disregarding DM's desire
> to disable discards (in blk_validate_limits).
>
> It looks like the combo of commit d690cb8ae14bd ("block: add an API to
> atomically update queue limits") and 4f563a64732da ("block: add a
> max_user_discard_sectors queue limit") needs fixing.
>
> This being one potential fix from code inspection I've done to this
> point, please see if it resolves your fstests failures (but I haven't
> actually looked at those fstests yet _and_ I still need to review
> commits d690cb8ae14bd and 4f563a64732da further -- will do on Monday,
> sorry for the trouble):

I looked early, this is needed (max_user_discard_sectors makes discard
limits stacking suck more than it already did -- imho 4f563a64732da is
worthy of revert. Short of that, dm-cache-target.c and possibly other
DM targets will need fixes too -- I'll go over it all Monday):

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 4793ad2aa1f7..c196f39579af 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -4099,8 +4099,10 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)

if (pt->adjusted_pf.discard_enabled) {
disable_discard_passdown_if_not_supported(pt);
- if (!pt->adjusted_pf.discard_passdown)
- limits->max_discard_sectors = 0;
+ if (!pt->adjusted_pf.discard_passdown) {
+ limits->max_hw_discard_sectors = 0;
+ limits->max_user_discard_sectors = 0;
+ }
/*
* The pool uses the same discard limits as the underlying data
* device. DM core has already set this up.
@@ -4497,7 +4499,8 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)

if (pool->pf.discard_enabled) {
limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
- limits->max_discard_sectors = pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
+ limits->max_hw_discard_sectors = limits->max_user_discard_sectors =
+ pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
}
}




2024-05-20 15:12:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: dm: use queue_limits_set

On Sun, May 19, 2024 at 01:42:00AM -0400, Mike Snitzer wrote:
> > This being one potential fix from code inspection I've done to this
> > point, please see if it resolves your fstests failures (but I haven't
> > actually looked at those fstests yet _and_ I still need to review
> > commits d690cb8ae14bd and 4f563a64732da further -- will do on Monday,
> > sorry for the trouble):
>
> I looked early, this is needed (max_user_discard_sectors makes discard
> limits stacking suck more than it already did -- imho 4f563a64732da is
> worthy of revert.

Can you explain why? This actually makes the original addition of the
user-space controlled max discard limit work. No I'm a bit doubful
that allowing this control was a good idea, but that ship unfortunately
has sailed.

Short of that, dm-cache-target.c and possibly other
> DM targets will need fixes too -- I'll go over it all Monday):
>
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 4793ad2aa1f7..c196f39579af 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -4099,8 +4099,10 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
>
> if (pt->adjusted_pf.discard_enabled) {
> disable_discard_passdown_if_not_supported(pt);
> - if (!pt->adjusted_pf.discard_passdown)
> - limits->max_discard_sectors = 0;
> + if (!pt->adjusted_pf.discard_passdown) {
> + limits->max_hw_discard_sectors = 0;
> + limits->max_user_discard_sectors = 0;
> + }

I think the main problem here is that dm targets adjust
max_discard_sectors diretly instead of adjusting max_hw_discard_sectors.
Im other words we need to switch all places dm targets set
max_discard_sectors to use max_hw_discard_sectors instead. They should
not touch max_user_discard_sectors ever.

This is probably my fault, I actually found this right at the time
of the original revert of switching dm to the limits API, and then
let it slip as the patch was reverted. That fact that you readded
the commit somehow went past my attention window.


2024-05-20 15:39:32

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm: use queue_limits_set

On Mon, May 20, 2024 at 05:06:53PM +0200, Christoph Hellwig wrote:
> On Sun, May 19, 2024 at 01:42:00AM -0400, Mike Snitzer wrote:
> > > This being one potential fix from code inspection I've done to this
> > > point, please see if it resolves your fstests failures (but I haven't
> > > actually looked at those fstests yet _and_ I still need to review
> > > commits d690cb8ae14bd and 4f563a64732da further -- will do on Monday,
> > > sorry for the trouble):
> >
> > I looked early, this is needed (max_user_discard_sectors makes discard
> > limits stacking suck more than it already did -- imho 4f563a64732da is
> > worthy of revert.
>
> Can you explain why? This actually makes the original addition of the
> user-space controlled max discard limit work. No I'm a bit doubful
> that allowing this control was a good idea, but that ship unfortunately
> has sailed.

That's fair. My criticism was more about having to fix up DM targets
to cope with the new normal of max_discard_sectors being set as a
function of max_hw_discard_sectors and max_user_discard_sectors.

With stacked devices in particular it is _very_ hard for the user to
know their exerting control over a max discard limit is correct.

> Short of that, dm-cache-target.c and possibly other
> > DM targets will need fixes too -- I'll go over it all Monday):
> >
> > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > index 4793ad2aa1f7..c196f39579af 100644
> > --- a/drivers/md/dm-thin.c
> > +++ b/drivers/md/dm-thin.c
> > @@ -4099,8 +4099,10 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
> >
> > if (pt->adjusted_pf.discard_enabled) {
> > disable_discard_passdown_if_not_supported(pt);
> > - if (!pt->adjusted_pf.discard_passdown)
> > - limits->max_discard_sectors = 0;
> > + if (!pt->adjusted_pf.discard_passdown) {
> > + limits->max_hw_discard_sectors = 0;
> > + limits->max_user_discard_sectors = 0;
> > + }
>
> I think the main problem here is that dm targets adjust
> max_discard_sectors diretly instead of adjusting max_hw_discard_sectors.
> Im other words we need to switch all places dm targets set
> max_discard_sectors to use max_hw_discard_sectors instead. They should
> not touch max_user_discard_sectors ever.

Yeah, but my concern is that if a user sets a value that is too low
it'll break targets like DM thinp (which Ted reported). So forcibly
setting both to indirectly set the required max_discard_sectors seems
necessary.

> This is probably my fault, I actually found this right at the time
> of the original revert of switching dm to the limits API, and then
> let it slip as the patch was reverted. That fact that you readded
> the commit somehow went past my attention window.

It's fine, all we can do now is work through how best to fix it. Open
to suggestions. But this next hunk, which you trimmed in your reply,
_seems_ needed to actually fix the issue Ted reported -- given the
current validate method in blk-settings.c (resharing here to just
continue this thread in a natural way):

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 4793ad2aa1f7..c196f39579af 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -4497,7 +4499,8 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)

if (pool->pf.discard_enabled) {
limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
- limits->max_discard_sectors = pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
+ limits->max_hw_discard_sectors = limits->max_user_discard_sectors =
+ pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
}
}



2024-05-20 15:45:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: dm: use queue_limits_set

On Mon, May 20, 2024 at 11:39:14AM -0400, Mike Snitzer wrote:
> That's fair. My criticism was more about having to fix up DM targets
> to cope with the new normal of max_discard_sectors being set as a
> function of max_hw_discard_sectors and max_user_discard_sectors.
>
> With stacked devices in particular it is _very_ hard for the user to
> know their exerting control over a max discard limit is correct.

The user forcing a limit is always very sketchy, which is why I'm
not a fan of it.

> Yeah, but my concern is that if a user sets a value that is too low
> it'll break targets like DM thinp (which Ted reported). So forcibly
> setting both to indirectly set the required max_discard_sectors seems
> necessary.

Dm-think requiring a minimum discard size is a rather odd requirement.
Is this just a debug asswert, or is there a real technical reason
for it? If so we can introduce a now to force a minimum size or
disable user setting the value entirely.

> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 4793ad2aa1f7..c196f39579af 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -4497,7 +4499,8 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
>
> if (pool->pf.discard_enabled) {
> limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> - limits->max_discard_sectors = pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
> + limits->max_hw_discard_sectors = limits->max_user_discard_sectors =
> + pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
> }

Drivers really have no business setting max_user_discard_sector,
the whole point of the field is to separate device/driver capabilities
from user policy. So if dm-think really has no way of handling
smaller discards, we need to ensure they can't be set.

I'm also kinda curious what actually sets a user limit in Ted's case
as that feels weird.

2024-05-20 15:48:38

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm: use queue_limits_set

On Mon, May 20, 2024 at 11:39:14AM -0400, Mike Snitzer wrote:
> On Mon, May 20, 2024 at 05:06:53PM +0200, Christoph Hellwig wrote:
>
> > This is probably my fault, I actually found this right at the time
> > of the original revert of switching dm to the limits API, and then
> > let it slip as the patch was reverted. That fact that you readded
> > the commit somehow went past my attention window.
>
> It's fine, all we can do now is work through how best to fix it. Open
> to suggestions. But this next hunk, which you trimmed in your reply,
> _seems_ needed to actually fix the issue Ted reported -- given the
> current validate method in blk-settings.c (resharing here to just
> continue this thread in a natural way):
>
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 4793ad2aa1f7..c196f39579af 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -4497,7 +4499,8 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
>
> if (pool->pf.discard_enabled) {
> limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> - limits->max_discard_sectors = pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
> + limits->max_hw_discard_sectors = limits->max_user_discard_sectors =
> + pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
> }
> }
>
>

Maybe update blk_validate_limits() to ensure max_discard_sectors is a
factor of discard_granularity?

That way thin_io_hints() (and equivalent functions in other DM
targets) just need to be audited/updated to ensure they are setting
both discard_granularity and max_hw_discard_sectors?

Mike

2024-05-20 15:51:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: dm: use queue_limits_set

On Mon, May 20, 2024 at 11:47:54AM -0400, Mike Snitzer wrote:
> Maybe update blk_validate_limits() to ensure max_discard_sectors is a
> factor of discard_granularity?

That's probably a good idea. I'm travelling currently so I'll probably
need a day to two to get to this, feel free to do it yourself if you
are faster.


2024-05-20 15:55:05

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm: use queue_limits_set

On Mon, May 20, 2024 at 05:44:25PM +0200, Christoph Hellwig wrote:
> On Mon, May 20, 2024 at 11:39:14AM -0400, Mike Snitzer wrote:
> > That's fair. My criticism was more about having to fix up DM targets
> > to cope with the new normal of max_discard_sectors being set as a
> > function of max_hw_discard_sectors and max_user_discard_sectors.
> >
> > With stacked devices in particular it is _very_ hard for the user to
> > know their exerting control over a max discard limit is correct.
>
> The user forcing a limit is always very sketchy, which is why I'm
> not a fan of it.
>
> > Yeah, but my concern is that if a user sets a value that is too low
> > it'll break targets like DM thinp (which Ted reported). So forcibly
> > setting both to indirectly set the required max_discard_sectors seems
> > necessary.
>
> Dm-think requiring a minimum discard size is a rather odd requirement.
> Is this just a debug asswert, or is there a real technical reason
> for it? If so we can introduce a now to force a minimum size or
> disable user setting the value entirely.

thinp's discard implementation is constrained by the dm-bio-prison's
constraints. One of the requirements of dm-bio-prison is that a
discard not exceed BIO_PRISON_MAX_RANGE.

My previous reply is a reasonible way to ensure best effort to respect
a users request but that takes into account the driver provided
discard_granularity. It'll force suboptimal (too small) discards be
issued but at least they'll cover a full thinp block.

> > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > index 4793ad2aa1f7..c196f39579af 100644
> > --- a/drivers/md/dm-thin.c
> > +++ b/drivers/md/dm-thin.c
> > @@ -4497,7 +4499,8 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> >
> > if (pool->pf.discard_enabled) {
> > limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> > - limits->max_discard_sectors = pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
> > + limits->max_hw_discard_sectors = limits->max_user_discard_sectors =
> > + pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
> > }
>
> Drivers really have no business setting max_user_discard_sector,
> the whole point of the field is to separate device/driver capabilities
> from user policy. So if dm-think really has no way of handling
> smaller discards, we need to ensure they can't be set.

It can handle smaller so long as they respect discard_granularity.

> I'm also kinda curious what actually sets a user limit in Ted's case
> as that feels weird.

I agree, not sure... maybe the fstests is using the knob?

2024-05-20 17:18:02

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm: use queue_limits_set

[replying for completeness to explain what I think is happening for
the issue Ted reported]

On Mon, May 20, 2024 at 11:54:53AM -0400, Mike Snitzer wrote:
> On Mon, May 20, 2024 at 05:44:25PM +0200, Christoph Hellwig wrote:
> > On Mon, May 20, 2024 at 11:39:14AM -0400, Mike Snitzer wrote:
> > > That's fair. My criticism was more about having to fix up DM targets
> > > to cope with the new normal of max_discard_sectors being set as a
> > > function of max_hw_discard_sectors and max_user_discard_sectors.
> > >
> > > With stacked devices in particular it is _very_ hard for the user to
> > > know their exerting control over a max discard limit is correct.
> >
> > The user forcing a limit is always very sketchy, which is why I'm
> > not a fan of it.
> >
> > > Yeah, but my concern is that if a user sets a value that is too low
> > > it'll break targets like DM thinp (which Ted reported). So forcibly
> > > setting both to indirectly set the required max_discard_sectors seems
> > > necessary.

Could also be that a user sets the max discard too large (e.g. larger
than thinp's BIO_PRISON_MAX_RANGE).

> > Dm-think requiring a minimum discard size is a rather odd requirement.
> > Is this just a debug asswert, or is there a real technical reason
> > for it? If so we can introduce a now to force a minimum size or
> > disable user setting the value entirely.
>
> thinp's discard implementation is constrained by the dm-bio-prison's
> constraints. One of the requirements of dm-bio-prison is that a
> discard not exceed BIO_PRISON_MAX_RANGE.
>
> My previous reply is a reasonible way to ensure best effort to respect
> a users request but that takes into account the driver provided
> discard_granularity. It'll force suboptimal (too small) discards be
> issued but at least they'll cover a full thinp block.

Given below, this isn't at the heart of the issue Ted reported. So
the change to ensure max_discard_sectors is a factor of
discard_granularity, while worthwhile, isn't critical to fixing the
reported issue.

> > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > > index 4793ad2aa1f7..c196f39579af 100644
> > > --- a/drivers/md/dm-thin.c
> > > +++ b/drivers/md/dm-thin.c
> > > @@ -4497,7 +4499,8 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > >
> > > if (pool->pf.discard_enabled) {
> > > limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> > > - limits->max_discard_sectors = pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
> > > + limits->max_hw_discard_sectors = limits->max_user_discard_sectors =
> > > + pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
> > > }
> >
> > Drivers really have no business setting max_user_discard_sector,
> > the whole point of the field is to separate device/driver capabilities
> > from user policy. So if dm-think really has no way of handling
> > smaller discards, we need to ensure they can't be set.
>
> It can handle smaller so long as they respect discard_granularity.
>
> > I'm also kinda curious what actually sets a user limit in Ted's case
> > as that feels weird.
>
> I agree, not sure... maybe the fstests is using the knob?

Doubt there was anything in fstests setting max discard user limit
(max_user_discard_sectors) in Ted's case. blk_set_stacking_limits()
sets max_user_discard_sectors to UINT_MAX, so given the use of
min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors) I
suspect blk_stack_limits() stacks up max_discard_sectors to match the
underlying storage's max_hw_discard_sectors.

And max_hw_discard_sectors exceeds BIO_PRISON_MAX_RANGE, resulting in
dm_cell_key_has_valid_range() triggering on:
WARN_ON_ONCE(key->block_end - key->block_begin > BIO_PRISON_MAX_RANGE)

Mike

2024-05-20 20:12:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: dm: use queue_limits_set

On Mon, May 20, 2024 at 01:17:46PM -0400, Mike Snitzer wrote:
> Doubt there was anything in fstests setting max discard user limit
> (max_user_discard_sectors) in Ted's case. blk_set_stacking_limits()
> sets max_user_discard_sectors to UINT_MAX, so given the use of
> min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors) I
> suspect blk_stack_limits() stacks up max_discard_sectors to match the
> underlying storage's max_hw_discard_sectors.
>
> And max_hw_discard_sectors exceeds BIO_PRISON_MAX_RANGE, resulting in
> dm_cell_key_has_valid_range() triggering on:
> WARN_ON_ONCE(key->block_end - key->block_begin > BIO_PRISON_MAX_RANGE)

Oh, that makes more sense.

I think you just want to set the max_hw_discard_sectors limit before
stacking in the lower device limits so that they can only lower it.

(and in the long run we should just stop stacking the limits except
for request based dm which really needs it)


2024-05-20 22:03:26

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm: use queue_limits_set

On Mon, May 20, 2024 at 10:12:37PM +0200, Christoph Hellwig wrote:
> On Mon, May 20, 2024 at 01:17:46PM -0400, Mike Snitzer wrote:
> > Doubt there was anything in fstests setting max discard user limit
> > (max_user_discard_sectors) in Ted's case. blk_set_stacking_limits()
> > sets max_user_discard_sectors to UINT_MAX, so given the use of
> > min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors) I
> > suspect blk_stack_limits() stacks up max_discard_sectors to match the
> > underlying storage's max_hw_discard_sectors.
> >
> > And max_hw_discard_sectors exceeds BIO_PRISON_MAX_RANGE, resulting in
> > dm_cell_key_has_valid_range() triggering on:
> > WARN_ON_ONCE(key->block_end - key->block_begin > BIO_PRISON_MAX_RANGE)
>
> Oh, that makes more sense.
>
> I think you just want to set the max_hw_discard_sectors limit before
> stacking in the lower device limits so that they can only lower it.
>
> (and in the long run we should just stop stacking the limits except
> for request based dm which really needs it)

This is what I staged, I cannot send a patch out right now..

Ted if you need the patch in email (rather than from linux-dm.git) I
can send it later tonight. Please see:

https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-6.10&id=825d8bbd2f32cb229c3b6653bd454832c3c20acb

2024-05-21 00:45:39

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm: use queue_limits_set

On Mon, May 20, 2024 at 06:03:11PM -0400, Mike Snitzer wrote:
> On Mon, May 20, 2024 at 10:12:37PM +0200, Christoph Hellwig wrote:
> > On Mon, May 20, 2024 at 01:17:46PM -0400, Mike Snitzer wrote:
> > > Doubt there was anything in fstests setting max discard user limit
> > > (max_user_discard_sectors) in Ted's case. blk_set_stacking_limits()
> > > sets max_user_discard_sectors to UINT_MAX, so given the use of
> > > min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors) I
> > > suspect blk_stack_limits() stacks up max_discard_sectors to match the
> > > underlying storage's max_hw_discard_sectors.
> > >
> > > And max_hw_discard_sectors exceeds BIO_PRISON_MAX_RANGE, resulting in
> > > dm_cell_key_has_valid_range() triggering on:
> > > WARN_ON_ONCE(key->block_end - key->block_begin > BIO_PRISON_MAX_RANGE)
> >
> > Oh, that makes more sense.
> >
> > I think you just want to set the max_hw_discard_sectors limit before
> > stacking in the lower device limits so that they can only lower it.
> >
> > (and in the long run we should just stop stacking the limits except
> > for request based dm which really needs it)
>
> This is what I staged, I cannot send a patch out right now..
>
> Ted if you need the patch in email (rather than from linux-dm.git) I
> can send it later tonight. Please see:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-6.10&id=825d8bbd2f32cb229c3b6653bd454832c3c20acb

From: Mike Snitzer <[email protected]>
Date: Mon, 20 May 2024 13:34:06 -0400
Subject: [PATCH] dm: always manage discard support in terms of max_hw_discard_sectors

Commit 4f563a64732d ("block: add a max_user_discard_sectors queue
limit") changed block core to set max_discard_sectors to:
min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors)

Since commit 1c0e720228ad ("dm: use queue_limits_set") it was reported
dm-thinp was failing in a few fstests (generic/347 and generic/405)
with the first WARN_ON_ONCE in dm_cell_key_has_valid_range() being
reported, e.g.:
WARNING: CPU: 1 PID: 30 at drivers/md/dm-bio-prison-v1.c:128 dm_cell_key_has_valid_range+0x3d/0x50

blk_set_stacking_limits() sets max_user_discard_sectors to UINT_MAX,
so given how block core now sets max_discard_sectors (detailed above)
it follows that blk_stack_limits() stacks up the underlying device's
max_hw_discard_sectors and max_discard_sectors is set to match it. If
max_hw_discard_sectors exceeds dm's BIO_PRISON_MAX_RANGE, then
dm_cell_key_has_valid_range() will trigger the warning with:
WARN_ON_ONCE(key->block_end - key->block_begin > BIO_PRISON_MAX_RANGE)

Aside from this warning, the discard will fail. Fix this and other DM
issues by governing discard support in terms of max_hw_discard_sectors
instead of max_discard_sectors.

Reported-by: Theodore Ts'o <[email protected]>
Fixes: 1c0e720228ad ("dm: use queue_limits_set")
Signed-off-by: Mike Snitzer <[email protected]>
---
drivers/md/dm-cache-target.c | 5 ++---
drivers/md/dm-clone-target.c | 4 ++--
drivers/md/dm-log-writes.c | 2 +-
drivers/md/dm-snap.c | 2 +-
drivers/md/dm-target.c | 1 -
drivers/md/dm-thin.c | 4 ++--
drivers/md/dm-zero.c | 1 -
drivers/md/dm-zoned-target.c | 1 -
drivers/md/dm.c | 2 +-
9 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 911f73f7ebba..1f0bc1173230 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -3394,8 +3394,8 @@ static void set_discard_limits(struct cache *cache, struct queue_limits *limits)

if (!cache->features.discard_passdown) {
/* No passdown is done so setting own virtual limits */
- limits->max_discard_sectors = min_t(sector_t, cache->discard_block_size * 1024,
- cache->origin_sectors);
+ limits->max_hw_discard_sectors = min_t(sector_t, cache->discard_block_size * 1024,
+ cache->origin_sectors);
limits->discard_granularity = cache->discard_block_size << SECTOR_SHIFT;
return;
}
@@ -3404,7 +3404,6 @@ static void set_discard_limits(struct cache *cache, struct queue_limits *limits)
* cache_iterate_devices() is stacking both origin and fast device limits
* but discards aren't passed to fast device, so inherit origin's limits.
*/
- limits->max_discard_sectors = origin_limits->max_discard_sectors;
limits->max_hw_discard_sectors = origin_limits->max_hw_discard_sectors;
limits->discard_granularity = origin_limits->discard_granularity;
limits->discard_alignment = origin_limits->discard_alignment;
diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
index 94b2fc33f64b..2332d9798141 100644
--- a/drivers/md/dm-clone-target.c
+++ b/drivers/md/dm-clone-target.c
@@ -2050,7 +2050,8 @@ static void set_discard_limits(struct clone *clone, struct queue_limits *limits)
if (!test_bit(DM_CLONE_DISCARD_PASSDOWN, &clone->flags)) {
/* No passdown is done so we set our own virtual limits */
limits->discard_granularity = clone->region_size << SECTOR_SHIFT;
- limits->max_discard_sectors = round_down(UINT_MAX >> SECTOR_SHIFT, clone->region_size);
+ limits->max_hw_discard_sectors = round_down(UINT_MAX >> SECTOR_SHIFT,
+ clone->region_size);
return;
}

@@ -2059,7 +2060,6 @@ static void set_discard_limits(struct clone *clone, struct queue_limits *limits)
* device limits but discards aren't passed to the source device, so
* inherit destination's limits.
*/
- limits->max_discard_sectors = dest_limits->max_discard_sectors;
limits->max_hw_discard_sectors = dest_limits->max_hw_discard_sectors;
limits->discard_granularity = dest_limits->discard_granularity;
limits->discard_alignment = dest_limits->discard_alignment;
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index f17a6cf2284e..8d7df8303d0a 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -871,7 +871,7 @@ static void log_writes_io_hints(struct dm_target *ti, struct queue_limits *limit
if (!bdev_max_discard_sectors(lc->dev->bdev)) {
lc->device_supports_discard = false;
limits->discard_granularity = lc->sectorsize;
- limits->max_discard_sectors = (UINT_MAX >> SECTOR_SHIFT);
+ limits->max_hw_discard_sectors = (UINT_MAX >> SECTOR_SHIFT);
}
limits->logical_block_size = bdev_logical_block_size(lc->dev->bdev);
limits->physical_block_size = bdev_physical_block_size(lc->dev->bdev);
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 0ace06d1bee3..f40c18da4000 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2410,7 +2410,7 @@ static void snapshot_io_hints(struct dm_target *ti, struct queue_limits *limits)

/* All discards are split on chunk_size boundary */
limits->discard_granularity = snap->store->chunk_size;
- limits->max_discard_sectors = snap->store->chunk_size;
+ limits->max_hw_discard_sectors = snap->store->chunk_size;

up_read(&_origins_lock);
}
diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
index 0c4efb0bef8a..652627aea11b 100644
--- a/drivers/md/dm-target.c
+++ b/drivers/md/dm-target.c
@@ -249,7 +249,6 @@ static int io_err_iterate_devices(struct dm_target *ti,

static void io_err_io_hints(struct dm_target *ti, struct queue_limits *limits)
{
- limits->max_discard_sectors = UINT_MAX;
limits->max_hw_discard_sectors = UINT_MAX;
limits->discard_granularity = 512;
}
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 4793ad2aa1f7..e0528a4f809c 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -4100,7 +4100,7 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
if (pt->adjusted_pf.discard_enabled) {
disable_discard_passdown_if_not_supported(pt);
if (!pt->adjusted_pf.discard_passdown)
- limits->max_discard_sectors = 0;
+ limits->max_hw_discard_sectors = 0;
/*
* The pool uses the same discard limits as the underlying data
* device. DM core has already set this up.
@@ -4497,7 +4497,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)

if (pool->pf.discard_enabled) {
limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
- limits->max_discard_sectors = pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
+ limits->max_hw_discard_sectors = pool->sectors_per_block * BIO_PRISON_MAX_RANGE;
}
}

diff --git a/drivers/md/dm-zero.c b/drivers/md/dm-zero.c
index 3b13e6eb1aa4..9a0bb623e823 100644
--- a/drivers/md/dm-zero.c
+++ b/drivers/md/dm-zero.c
@@ -61,7 +61,6 @@ static int zero_map(struct dm_target *ti, struct bio *bio)

static void zero_io_hints(struct dm_target *ti, struct queue_limits *limits)
{
- limits->max_discard_sectors = UINT_MAX;
limits->max_hw_discard_sectors = UINT_MAX;
limits->discard_granularity = 512;
}
diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
index 621794a9edd6..12236e6f46f3 100644
--- a/drivers/md/dm-zoned-target.c
+++ b/drivers/md/dm-zoned-target.c
@@ -1001,7 +1001,6 @@ static void dmz_io_hints(struct dm_target *ti, struct queue_limits *limits)

limits->discard_alignment = 0;
limits->discard_granularity = DMZ_BLOCK_SIZE;
- limits->max_discard_sectors = chunk_sectors;
limits->max_hw_discard_sectors = chunk_sectors;
limits->max_write_zeroes_sectors = chunk_sectors;

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7d0746b37c8e..3adfc6b83c01 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1086,7 +1086,7 @@ void disable_discard(struct mapped_device *md)
struct queue_limits *limits = dm_get_queue_limits(md);

/* device doesn't really support DISCARD, disable it */
- limits->max_discard_sectors = 0;
+ limits->max_hw_discard_sectors = 0;
}

void disable_write_zeroes(struct mapped_device *md)
--
2.44.0


2024-05-21 15:32:00

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm: use queue_limits_set

On Mon, May 20, 2024 at 08:45:28PM -0400, Mike Snitzer wrote:
> On Mon, May 20, 2024 at 06:03:11PM -0400, Mike Snitzer wrote:
> > On Mon, May 20, 2024 at 10:12:37PM +0200, Christoph Hellwig wrote:
> > > On Mon, May 20, 2024 at 01:17:46PM -0400, Mike Snitzer wrote:
> > > > Doubt there was anything in fstests setting max discard user limit
> > > > (max_user_discard_sectors) in Ted's case. blk_set_stacking_limits()
> > > > sets max_user_discard_sectors to UINT_MAX, so given the use of
> > > > min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors) I
> > > > suspect blk_stack_limits() stacks up max_discard_sectors to match the
> > > > underlying storage's max_hw_discard_sectors.
> > > >
> > > > And max_hw_discard_sectors exceeds BIO_PRISON_MAX_RANGE, resulting in
> > > > dm_cell_key_has_valid_range() triggering on:
> > > > WARN_ON_ONCE(key->block_end - key->block_begin > BIO_PRISON_MAX_RANGE)
> > >
> > > Oh, that makes more sense.
> > >
> > > I think you just want to set the max_hw_discard_sectors limit before
> > > stacking in the lower device limits so that they can only lower it.
> > >
> > > (and in the long run we should just stop stacking the limits except
> > > for request based dm which really needs it)
> >
> > This is what I staged, I cannot send a patch out right now..
> >
> > Ted if you need the patch in email (rather than from linux-dm.git) I
> > can send it later tonight. Please see:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-6.10&id=825d8bbd2f32cb229c3b6653bd454832c3c20acb
>
> From: Mike Snitzer <[email protected]>
> Date: Mon, 20 May 2024 13:34:06 -0400
> Subject: [PATCH] dm: always manage discard support in terms of max_hw_discard_sectors
>
> Commit 4f563a64732d ("block: add a max_user_discard_sectors queue
> limit") changed block core to set max_discard_sectors to:
> min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors)
>
> Since commit 1c0e720228ad ("dm: use queue_limits_set") it was reported
> dm-thinp was failing in a few fstests (generic/347 and generic/405)
> with the first WARN_ON_ONCE in dm_cell_key_has_valid_range() being
> reported, e.g.:
> WARNING: CPU: 1 PID: 30 at drivers/md/dm-bio-prison-v1.c:128 dm_cell_key_has_valid_range+0x3d/0x50
>
> blk_set_stacking_limits() sets max_user_discard_sectors to UINT_MAX,
> so given how block core now sets max_discard_sectors (detailed above)
> it follows that blk_stack_limits() stacks up the underlying device's
> max_hw_discard_sectors and max_discard_sectors is set to match it. If
> max_hw_discard_sectors exceeds dm's BIO_PRISON_MAX_RANGE, then
> dm_cell_key_has_valid_range() will trigger the warning with:
> WARN_ON_ONCE(key->block_end - key->block_begin > BIO_PRISON_MAX_RANGE)
>
> Aside from this warning, the discard will fail. Fix this and other DM
> issues by governing discard support in terms of max_hw_discard_sectors
> instead of max_discard_sectors.
>
> Reported-by: Theodore Ts'o <[email protected]>
> Fixes: 1c0e720228ad ("dm: use queue_limits_set")
> Signed-off-by: Mike Snitzer <[email protected]>

With this patch applied, I verified xfstests generic/347 and
generic/405 no longer trigger the dm_cell_key_has_valid_range
WARN_ON_ONCE. I'm sending the fix to Linus now.