2021-05-20 07:45:48

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH] btrfs: scrub: per-device bandwidth control

Hi David,

On Tue, 18 May 2021, David Sterba wrote:
> Add sysfs interface to limit io during scrub. We relied on the ionice
> interface to do that, eg. the idle class let the system usable while
> scrub was running. This has changed when mq-deadline got widespread and
> did not implement the scheduling classes. That was a CFQ thing that got
> deleted. We've got numerous complaints from users about degraded
> performance.
>
> Currently only BFQ supports that but it's not a common scheduler and we
> can't ask everybody to switch to it.
>
> Alternatively the cgroup io limiting can be used but that also a
> non-trivial setup (v2 required, the controller must be enabled on the
> system). This can still be used if desired.
>
> Other ideas that have been explored: piggy-back on ionice (that is set
> per-process and is accessible) and interpret the class and classdata as
> bandwidth limits, but this does not have enough flexibility as there are
> only 8 allowed and we'd have to map fixed limits to each value. Also
> adjusting the value would need to lookup the process that currently runs
> scrub on the given device, and the value is not sticky so would have to
> be adjusted each time scrub runs.
>
> Running out of options, sysfs does not look that bad:
>
> - it's accessible from scripts, or udev rules
> - the name is similar to what MD-RAID has
> (/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max)
> - the value is sticky at least for filesystem mount time
> - adjusting the value has immediate effect
> - sysfs is available in constrained environments (eg. system rescue)
> - the limit also applies to device replace
>
> Sysfs:
>
> - raw value is in bytes
> - values written to the file accept suffixes like K, M
> - file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max
> - 0 means use default priority of IO
>
> The scheduler is a simple deadline one and the accuracy is up to nearest
> 128K.
>
> Signed-off-by: David Sterba <[email protected]>

Thanks for your patch, which is now commit b4a9f4bee31449bc ("btrfs:
scrub: per-device bandwidth control") in linux-next.

[email protected] reported the following failures for e.g.
m68k/defconfig:

ERROR: modpost: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!
ERROR: modpost: "__divdi3" [fs/btrfs/btrfs.ko] undefined!

> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1988,6 +1993,60 @@ static void scrub_page_put(struct scrub_page *spage)
> }
> }
>
> +/*
> + * Throttling of IO submission, bandwidth-limit based, the timeslice is 1
> + * second. Limit can be set via /sys/fs/UUID/devinfo/devid/scrub_speed_max.
> + */
> +static void scrub_throttle(struct scrub_ctx *sctx)
> +{
> + const int time_slice = 1000;
> + struct scrub_bio *sbio;
> + struct btrfs_device *device;
> + s64 delta;
> + ktime_t now;
> + u32 div;
> + u64 bwlimit;
> +
> + sbio = sctx->bios[sctx->curr];
> + device = sbio->dev;
> + bwlimit = READ_ONCE(device->scrub_speed_max);
> + if (bwlimit == 0)
> + return;
> +
> + /*
> + * Slice is divided into intervals when the IO is submitted, adjust by
> + * bwlimit and maximum of 64 intervals.
> + */
> + div = max_t(u32, 1, (u32)(bwlimit / (16 * 1024 * 1024)));
> + div = min_t(u32, 64, div);
> +
> + /* Start new epoch, set deadline */
> + now = ktime_get();
> + if (sctx->throttle_deadline == 0) {
> + sctx->throttle_deadline = ktime_add_ms(now, time_slice / div);

ERROR: modpost: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!

div_u64(bwlimit, div)

> + sctx->throttle_sent = 0;
> + }
> +
> + /* Still in the time to send? */
> + if (ktime_before(now, sctx->throttle_deadline)) {
> + /* If current bio is within the limit, send it */
> + sctx->throttle_sent += sbio->bio->bi_iter.bi_size;
> + if (sctx->throttle_sent <= bwlimit / div)
> + return;
> +
> + /* We're over the limit, sleep until the rest of the slice */
> + delta = ktime_ms_delta(sctx->throttle_deadline, now);
> + } else {
> + /* New request after deadline, start new epoch */
> + delta = 0;
> + }
> +
> + if (delta)
> + schedule_timeout_interruptible(delta * HZ / 1000);

ERROR: modpost: "__divdi3" [fs/btrfs/btrfs.ko] undefined!

I'm a bit surprised gcc doesn't emit code for the division by the
constant 1000, but emits a call to __divdi3(). So this has to become
div_u64(), too.

> + /* Next call will start the deadline period */
> + sctx->throttle_deadline = 0;
> +}

BTW, any chance you can start adding lore Link: tags to your commits, to
make it easier to find the email thread to reply to when reporting a
regression?

Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


2021-05-20 13:01:02

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH] btrfs: scrub: per-device bandwidth control

On Thu, May 20, 2021 at 09:43:10AM +0200, Geert Uytterhoeven wrote:
> > - values written to the file accept suffixes like K, M
> > - file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max
> > - 0 means use default priority of IO
> >
> > The scheduler is a simple deadline one and the accuracy is up to nearest
> > 128K.
> >
> > Signed-off-by: David Sterba <[email protected]>
>
> Thanks for your patch, which is now commit b4a9f4bee31449bc ("btrfs:
> scrub: per-device bandwidth control") in linux-next.
>
> [email protected] reported the following failures for e.g.
> m68k/defconfig:
>
> ERROR: modpost: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!
> ERROR: modpost: "__divdi3" [fs/btrfs/btrfs.ko] undefined!

I'll fix it, thanks for the report.

> > +static void scrub_throttle(struct scrub_ctx *sctx)
> > +{
> > + const int time_slice = 1000;
> > + struct scrub_bio *sbio;
> > + struct btrfs_device *device;
> > + s64 delta;
> > + ktime_t now;
> > + u32 div;
> > + u64 bwlimit;
> > +
> > + sbio = sctx->bios[sctx->curr];
> > + device = sbio->dev;
> > + bwlimit = READ_ONCE(device->scrub_speed_max);
> > + if (bwlimit == 0)
> > + return;
> > +
> > + /*
> > + * Slice is divided into intervals when the IO is submitted, adjust by
> > + * bwlimit and maximum of 64 intervals.
> > + */
> > + div = max_t(u32, 1, (u32)(bwlimit / (16 * 1024 * 1024)));
> > + div = min_t(u32, 64, div);
> > +
> > + /* Start new epoch, set deadline */
> > + now = ktime_get();
> > + if (sctx->throttle_deadline == 0) {
> > + sctx->throttle_deadline = ktime_add_ms(now, time_slice / div);
>
> ERROR: modpost: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!
>
> div_u64(bwlimit, div)
>
> > + sctx->throttle_sent = 0;
> > + }
> > +
> > + /* Still in the time to send? */
> > + if (ktime_before(now, sctx->throttle_deadline)) {
> > + /* If current bio is within the limit, send it */
> > + sctx->throttle_sent += sbio->bio->bi_iter.bi_size;
> > + if (sctx->throttle_sent <= bwlimit / div)
> > + return;
> > +
> > + /* We're over the limit, sleep until the rest of the slice */
> > + delta = ktime_ms_delta(sctx->throttle_deadline, now);
> > + } else {
> > + /* New request after deadline, start new epoch */
> > + delta = 0;
> > + }
> > +
> > + if (delta)
> > + schedule_timeout_interruptible(delta * HZ / 1000);
>
> ERROR: modpost: "__divdi3" [fs/btrfs/btrfs.ko] undefined!
>
> I'm a bit surprised gcc doesn't emit code for the division by the
> constant 1000, but emits a call to __divdi3(). So this has to become
> div_u64(), too.
>
> > + /* Next call will start the deadline period */
> > + sctx->throttle_deadline = 0;
> > +}
>
> BTW, any chance you can start adding lore Link: tags to your commits, to
> make it easier to find the email thread to reply to when reporting a
> regression?

Well, no I'm not going to do that, sorry. It should be easy enough to
paste the patch subject to the search field on lore.k.org and click the
link leading to the mail, I do that all the time. Making sure that
patches have all the tags and information takes time already so I'm not
too keen to spend time on adding links.

2021-05-21 04:49:00

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH] btrfs: scrub: per-device bandwidth control

Hi David,

On Thu, May 20, 2021 at 2:57 PM David Sterba <[email protected]> wrote:
> On Thu, May 20, 2021 at 09:43:10AM +0200, Geert Uytterhoeven wrote:
> > > - values written to the file accept suffixes like K, M
> > > - file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max
> > > - 0 means use default priority of IO
> > >
> > > The scheduler is a simple deadline one and the accuracy is up to nearest
> > > 128K.
> > >
> > > Signed-off-by: David Sterba <[email protected]>
> >
> > Thanks for your patch, which is now commit b4a9f4bee31449bc ("btrfs:
> > scrub: per-device bandwidth control") in linux-next.
> >
> > [email protected] reported the following failures for e.g.
> > m68k/defconfig:
> >
> > ERROR: modpost: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!
> > ERROR: modpost: "__divdi3" [fs/btrfs/btrfs.ko] undefined!
>
> I'll fix it, thanks for the report.

Thanks!

> > BTW, any chance you can start adding lore Link: tags to your commits, to
> > make it easier to find the email thread to reply to when reporting a
> > regression?
>
> Well, no I'm not going to do that, sorry. It should be easy enough to
> paste the patch subject to the search field on lore.k.org and click the
> link leading to the mail, I do that all the time. Making sure that

There's no global search field on lore.kernel.org (yet), so you still have
to guess the mailing list. In this case that was obvious, but that's not always
the case.

> patches have all the tags and information takes time already so I'm not
> too keen to spend time on adding links.

The link can be added automatically by a git hook. Hence if you use b4
to get the series with all tags, you'll get the Link: for free!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2021-05-21 04:53:22

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] btrfs: scrub: per-device bandwidth control

On Thu, May 20, 2021 at 9:43 AM Geert Uytterhoeven <[email protected]> wrote:
> On Tue, 18 May 2021, David Sterba wrote:

> > --- a/fs/btrfs/scrub.c
> > +++ b/fs/btrfs/scrub.c
> > @@ -1988,6 +1993,60 @@ static void scrub_page_put(struct scrub_page *spage)
> > }
> > }
> >
> > +/*
> > + * Throttling of IO submission, bandwidth-limit based, the timeslice is 1
> > + * second. Limit can be set via /sys/fs/UUID/devinfo/devid/scrub_speed_max.
> > + */
> > +static void scrub_throttle(struct scrub_ctx *sctx)
> > +{
> > + const int time_slice = 1000;
> > + struct scrub_bio *sbio;
> > + struct btrfs_device *device;
> > + s64 delta;
> > + ktime_t now;
> > + u32 div;
> > + u64 bwlimit;
> > +
> > + sbio = sctx->bios[sctx->curr];
> > + device = sbio->dev;
> > + bwlimit = READ_ONCE(device->scrub_speed_max);
> > + if (bwlimit == 0)
> > + return;
> > +
> > + /*
> > + * Slice is divided into intervals when the IO is submitted, adjust by
> > + * bwlimit and maximum of 64 intervals.
> > + */
> > + div = max_t(u32, 1, (u32)(bwlimit / (16 * 1024 * 1024)));
> > + div = min_t(u32, 64, div);
> > +
> > + /* Start new epoch, set deadline */
> > + now = ktime_get();
> > + if (sctx->throttle_deadline == 0) {
> > + sctx->throttle_deadline = ktime_add_ms(now, time_slice / div);
>
> ERROR: modpost: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!
>
> div_u64(bwlimit, div)

If 'time_slice' is in nanoseconds, the best interface to use
is ktime_divns().

> > + sctx->throttle_sent = 0;
> > + }
> > +
> > + /* Still in the time to send? */
> > + if (ktime_before(now, sctx->throttle_deadline)) {
> > + /* If current bio is within the limit, send it */
> > + sctx->throttle_sent += sbio->bio->bi_iter.bi_size;
> > + if (sctx->throttle_sent <= bwlimit / div)
> > + return;

Doesn't this also need to be changed?

> > + /* We're over the limit, sleep until the rest of the slice */
> > + delta = ktime_ms_delta(sctx->throttle_deadline, now);
> > + } else {
> > + /* New request after deadline, start new epoch */
> > + delta = 0;
> > + }
> > +
> > + if (delta)
> > + schedule_timeout_interruptible(delta * HZ / 1000);
>
> ERROR: modpost: "__divdi3" [fs/btrfs/btrfs.ko] undefined!
>
> I'm a bit surprised gcc doesn't emit code for the division by the
> constant 1000, but emits a call to __divdi3(). So this has to become
> div_u64(), too.

There is schedule_hrtimeout(), which takes a ktime_t directly
but has slightly different behavior. There is also an msecs_to_jiffies
helper that should produce a fast division.

Arnd

2021-05-21 20:20:28

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH] btrfs: scrub: per-device bandwidth control

On Thu, May 20, 2021 at 03:14:03PM +0200, Arnd Bergmann wrote:
> On Thu, May 20, 2021 at 9:43 AM Geert Uytterhoeven <[email protected]> wrote:
> > On Tue, 18 May 2021, David Sterba wrote:
> > > + /* Start new epoch, set deadline */
> > > + now = ktime_get();
> > > + if (sctx->throttle_deadline == 0) {
> > > + sctx->throttle_deadline = ktime_add_ms(now, time_slice / div);
> >
> > ERROR: modpost: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!
> >
> > div_u64(bwlimit, div)
>
> If 'time_slice' is in nanoseconds, the best interface to use
> is ktime_divns().

It's in miliseconds and the division above is int/int, the problematic
one is below.
>
> > > + sctx->throttle_sent = 0;
> > > + }
> > > +
> > > + /* Still in the time to send? */
> > > + if (ktime_before(now, sctx->throttle_deadline)) {
> > > + /* If current bio is within the limit, send it */
> > > + sctx->throttle_sent += sbio->bio->bi_iter.bi_size;
> > > + if (sctx->throttle_sent <= bwlimit / div)
> > > + return;
>
> Doesn't this also need to be changed?
>
> > > + /* We're over the limit, sleep until the rest of the slice */
> > > + delta = ktime_ms_delta(sctx->throttle_deadline, now);
> > > + } else {
> > > + /* New request after deadline, start new epoch */
> > > + delta = 0;
> > > + }
> > > +
> > > + if (delta)
> > > + schedule_timeout_interruptible(delta * HZ / 1000);
> >
> > ERROR: modpost: "__divdi3" [fs/btrfs/btrfs.ko] undefined!
> >
> > I'm a bit surprised gcc doesn't emit code for the division by the
> > constant 1000, but emits a call to __divdi3(). So this has to become
> > div_u64(), too.
>
> There is schedule_hrtimeout(), which takes a ktime_t directly
> but has slightly different behavior. There is also an msecs_to_jiffies
> helper that should produce a fast division.

I'll use msecs_to_jiffies, thanks. If 'hr' in schedule_hrtimeout stands
for high resolution, it's not necessary here.

2021-05-21 20:20:44

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH] btrfs: scrub: per-device bandwidth control

Hi David,

On Fri, May 21, 2021 at 5:18 PM David Sterba <[email protected]> wrote:
> On Thu, May 20, 2021 at 03:14:03PM +0200, Arnd Bergmann wrote:
> > On Thu, May 20, 2021 at 9:43 AM Geert Uytterhoeven <[email protected]> wrote:
> > > On Tue, 18 May 2021, David Sterba wrote:
> > > > + /* Start new epoch, set deadline */
> > > > + now = ktime_get();
> > > > + if (sctx->throttle_deadline == 0) {
> > > > + sctx->throttle_deadline = ktime_add_ms(now, time_slice / div);
> > >
> > > ERROR: modpost: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!
> > >
> > > div_u64(bwlimit, div)
> >
> > If 'time_slice' is in nanoseconds, the best interface to use
> > is ktime_divns().
>
> It's in miliseconds and the division above is int/int, the problematic
> one is below.

Yep, sorry for the wrong pointer.

> >
> > > > + sctx->throttle_sent = 0;
> > > > + }
> > > > +
> > > > + /* Still in the time to send? */
> > > > + if (ktime_before(now, sctx->throttle_deadline)) {
> > > > + /* If current bio is within the limit, send it */
> > > > + sctx->throttle_sent += sbio->bio->bi_iter.bi_size;
> > > > + if (sctx->throttle_sent <= bwlimit / div)
> > > > + return;
> >
> > Doesn't this also need to be changed?
> >
> > > > + /* We're over the limit, sleep until the rest of the slice */
> > > > + delta = ktime_ms_delta(sctx->throttle_deadline, now);
> > > > + } else {
> > > > + /* New request after deadline, start new epoch */
> > > > + delta = 0;
> > > > + }
> > > > +
> > > > + if (delta)
> > > > + schedule_timeout_interruptible(delta * HZ / 1000);
> > >
> > > ERROR: modpost: "__divdi3" [fs/btrfs/btrfs.ko] undefined!
> > >
> > > I'm a bit surprised gcc doesn't emit code for the division by the
> > > constant 1000, but emits a call to __divdi3(). So this has to become
> > > div_u64(), too.
> >
> > There is schedule_hrtimeout(), which takes a ktime_t directly
> > but has slightly different behavior. There is also an msecs_to_jiffies
> > helper that should produce a fast division.
>
> I'll use msecs_to_jiffies, thanks. If 'hr' in schedule_hrtimeout stands
> for high resolution, it's not necessary here.

msecs_to_jiffies() takes (32-bit) "unsigned int", while delta is "s64".

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds