2018-10-04 20:09:37

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH] block: BFQ default for single queue devices

On Thu, 2018-10-04 at 20:25 +-0100, Alan Cox wrote:
+AD4 +AD4 I agree with Jens that it's best to leave it to the Linux distributors to
+AD4 +AD4 select a default I/O scheduler.
+AD4
+AD4 That assumes such a thing exists. The kernel knows what devices it is
+AD4 dealing with. The kernel 'default' ought to be 'whatever is usually best
+AD4 for this device'. A distro cannot just pick a correct single default
+AD4 because NVME and USB sticks are both normal and rather different in needs.

Which I/O scheduler works best also depends which workload the user will run.
BFQ has significant advantages for interactive workloads like video replay
with concurrent background I/O but probably slows down kernel builds. That's
why I'm not sure whether the kernel should select the default I/O scheduler.

Bart.


2018-10-04 20:40:13

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH] block: BFQ default for single queue devices



> Il giorno 04 ott 2018, alle ore 22:09, Bart Van Assche <[email protected]> ha scritto:
>
> On Thu, 2018-10-04 at 20:25 +0100, Alan Cox wrote:
>>> I agree with Jens that it's best to leave it to the Linux distributors to
>>> select a default I/O scheduler.
>>
>> That assumes such a thing exists. The kernel knows what devices it is
>> dealing with. The kernel 'default' ought to be 'whatever is usually best
>> for this device'. A distro cannot just pick a correct single default
>> because NVME and USB sticks are both normal and rather different in needs.
>
> Which I/O scheduler works best also depends which workload the user will run.
> BFQ has significant advantages for interactive workloads like video replay
> with concurrent background I/O but probably slows down kernel builds.

No, kernel build is, for evident reasons, one of the workloads I cared
most about. Actually, I tried to focus on all my main
kernel-development tasks, such as also git checkout, git merge, git
grep, ...

According to my test results, with BFQ these tasks are at least as
fast as, or, in most system configurations, much faster than with the
other schedulers. Of course, at the same time the system also remains
responsive with BFQ.

You can repeat these tests using one of my first scripts in the S
suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more
hypertrophied the names I gave :) ).

I stopped sharing also my kernel-build results years ago, because I
went on obtaining the same, identical good results for years, and I'm
aware that I tend to show and say too much stuff.

Thanks,
Paolo

> That's
> why I'm not sure whether the kernel should select the default I/O scheduler.
>
> Bart.


2018-10-04 22:43:19

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH] block: BFQ default for single queue devices

On Thu, 2018-10-04 at 22:39 +-0200, Paolo Valente wrote:
+AD4 No, kernel build is, for evident reasons, one of the workloads I cared
+AD4 most about. Actually, I tried to focus on all my main
+AD4 kernel-development tasks, such as also git checkout, git merge, git
+AD4 grep, ...
+AD4
+AD4 According to my test results, with BFQ these tasks are at least as
+AD4 fast as, or, in most system configurations, much faster than with the
+AD4 other schedulers. Of course, at the same time the system also remains
+AD4 responsive with BFQ.
+AD4
+AD4 You can repeat these tests using one of my first scripts in the S
+AD4 suite: kern+AF8-dev+AF8-tasks+AF8-vs+AF8-rw.sh (usually, the older the tests, the more
+AD4 hypertrophied the names I gave :) ).
+AD4
+AD4 I stopped sharing also my kernel-build results years ago, because I
+AD4 went on obtaining the same, identical good results for years, and I'm
+AD4 aware that I tend to show and say too much stuff.

On my test setup building the kernel is slightly slower when using the BFQ
scheduler compared to using scheduler +ACI-none+ACI (kernel 4.18.12, NVMe SSD,
single CPU with 6 cores, hyperthreading disabled). I am aware that the
proposal at the start of this thread was to make BFQ the default for devices
with a single hardware queue and not for devices like NVMe SSDs that support
multiple hardware queues.

What I think is missing is measurement results for BFQ on a system with
multiple CPU sockets and against a fast storage medium. Eliminating
the host lock from the SCSI core yielded a significant performance
improvement for such storage devices. Since the BFQ scheduler locks and
unlocks bfqd-+AD4-lock for every dispatch operation it is very likely that BFQ
will slow down I/O for fast storage devices, even if their driver only
creates a single hardware queue.

Bart.

2018-10-05 06:25:35

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH] block: BFQ default for single queue devices

On Thu, 2018-10-04 at 13:09 -0700, Bart Van Assche wrote:
> On Thu, 2018-10-04 at 20:25 +0100, Alan Cox wrote:
> > > I agree with Jens that it's best to leave it to the Linux distributors to
> > > select a default I/O scheduler.
> >
> > That assumes such a thing exists. The kernel knows what devices it is
> > dealing with. The kernel 'default' ought to be 'whatever is usually best
> > for this device'. A distro cannot just pick a correct single default
> > because NVME and USB sticks are both normal and rather different in needs.
>
> Which I/O scheduler works best also depends which workload the user will run.
> BFQ has significant advantages for interactive workloads like video replay
> with concurrent background I/O but probably slows down kernel builds. That's
> why I'm not sure whether the kernel should select the default I/O scheduler.

Whats wrong with this simple hierarchy?

1. Block core selects the default scheduler.
2. Driver can overrule it early.
3. Userspace can overrule the default later.

Everyone is happy.

Good defaults in block core are great. Those defaults + #3 may cover
99% of the population.

1% of the population can use #2. See, Linus wants "bfq" for ubiblock.
Why wouldn't we to let him work with UBI community, show that bfq is
best for ubiblock, and just let the UBI community overrule the block
core's default.

If some day in the future there is a very good reason, we can even make
this to be a module parameter, and people could just boot with
'ubiblock.iosched=bfq'.

2018-10-05 09:18:18

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] block: BFQ default for single queue devices

On Thu 04-10-18 15:42:52, Bart Van Assche wrote:
> On Thu, 2018-10-04 at 22:39 +0200, Paolo Valente wrote:
> > No, kernel build is, for evident reasons, one of the workloads I cared
> > most about. Actually, I tried to focus on all my main
> > kernel-development tasks, such as also git checkout, git merge, git
> > grep, ...
> >
> > According to my test results, with BFQ these tasks are at least as
> > fast as, or, in most system configurations, much faster than with the
> > other schedulers. Of course, at the same time the system also remains
> > responsive with BFQ.
> >
> > You can repeat these tests using one of my first scripts in the S
> > suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more
> > hypertrophied the names I gave :) ).
> >
> > I stopped sharing also my kernel-build results years ago, because I
> > went on obtaining the same, identical good results for years, and I'm
> > aware that I tend to show and say too much stuff.
>
> On my test setup building the kernel is slightly slower when using the BFQ
> scheduler compared to using scheduler "none" (kernel 4.18.12, NVMe SSD,
> single CPU with 6 cores, hyperthreading disabled). I am aware that the
> proposal at the start of this thread was to make BFQ the default for devices
> with a single hardware queue and not for devices like NVMe SSDs that support
> multiple hardware queues.
>
> What I think is missing is measurement results for BFQ on a system with
> multiple CPU sockets and against a fast storage medium. Eliminating
> the host lock from the SCSI core yielded a significant performance
> improvement for such storage devices. Since the BFQ scheduler locks and
> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
> will slow down I/O for fast storage devices, even if their driver only
> creates a single hardware queue.

Well, I'm not sure why that is missing. I don't think anyone proposed to
default to BFQ for such setup? Neither was anyone claiming that BFQ is
better in such situation... The proposal has been: Default to BFQ for slow
storage, leave it to deadline-mq otherwise.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-10-05 09:29:06

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH] block: BFQ default for single queue devices



> Il giorno 05 ott 2018, alle ore 00:42, Bart Van Assche <[email protected]> ha scritto:
>
> On Thu, 2018-10-04 at 22:39 +0200, Paolo Valente wrote:
>> No, kernel build is, for evident reasons, one of the workloads I cared
>> most about. Actually, I tried to focus on all my main
>> kernel-development tasks, such as also git checkout, git merge, git
>> grep, ...
>>
>> According to my test results, with BFQ these tasks are at least as
>> fast as, or, in most system configurations, much faster than with the
>> other schedulers. Of course, at the same time the system also remains
>> responsive with BFQ.
>>
>> You can repeat these tests using one of my first scripts in the S
>> suite: kern_dev_tasks_vs_rw.sh (usually, the older the tests, the more
>> hypertrophied the names I gave :) ).
>>
>> I stopped sharing also my kernel-build results years ago, because I
>> went on obtaining the same, identical good results for years, and I'm
>> aware that I tend to show and say too much stuff.
>
> On my test setup building the kernel is slightly slower when using the BFQ
> scheduler compared to using scheduler "none" (kernel 4.18.12, NVMe SSD,
> single CPU with 6 cores, hyperthreading disabled). I am aware that the
> proposal at the start of this thread was to make BFQ the default for devices
> with a single hardware queue and not for devices like NVMe SSDs that support
> multiple hardware queues.
>

I miss your point: as you yourself note, the proposal is limited to
single-queue devices, exactly because BFQ is not ready for
multiple-queue devices yet.

> What I think is missing is measurement results for BFQ on a system with
> multiple CPU sockets and against a fast storage medium.

It is not missing. As I happened to report in previous threads, we
made a script to measure that too [1], using fio and null block.

I have reported the results we obtained, for three classes of
processors, in the in-kernel BFQ documentation [2].

In particular, BFQ reached 400KIOPS with the fastest CPU mentioned in
that document (Intel i7-4850HQ).

So, since the speed of that single-socket commodity CPU is most likely
lower than the total speed of a multi-socket system, we have that, on
such a system and with BFQ, you should be conservatively ok with
single-queue devices in the range 300-500 KIOPS.

[1] https://github.com/Algodev-github/IOSpeed
[2] https://www.kernel.org/doc/Documentation/block/bfq-iosched.txt

>

> Eliminating
> the host lock from the SCSI core yielded a significant performance
> improvement for such storage devices. Since the BFQ scheduler locks and
> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
> will slow down I/O for fast storage devices, even if their driver only
> creates a single hardware queue.
>

One of the main motivations behind NVMe, and blk-mq itself, is that it
is hard to reach the above IOPS, and more, with a single I/O queue as
bottleneck.

So, I wouldn't expect that systems
- equipped with single-queue drives reaching more than 500 KIOPS
- using SATA or some other non-NVMe as protocol
- so fast to push these drives to their maximum speeds
constitute more than a negligible percentage of devices.

So, by sticking to mq-deadline, we would sacrifice 99% of systems, to
make sure, basically, that those very few systems on steroids reach
maximum throughput with random I/O (while however still suffering from
responsiveness problems). I think it makes much more sense to have as
default what is best for 99% of the single-queue systems, with those
super systems properly reconfigured by their users. For sure, other
defaults are to be changed too, to get the most out of those systems.

Thanks,
Paolo


> Bart.


2018-10-06 03:18:17

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH] block: BFQ default for single queue devices

On 10/5/18 2:16 AM, Jan Kara wrote:
> On Thu 04-10-18 15:42:52, Bart Van Assche wrote:
>> What I think is missing is measurement results for BFQ on a system with
>> multiple CPU sockets and against a fast storage medium. Eliminating
>> the host lock from the SCSI core yielded a significant performance
>> improvement for such storage devices. Since the BFQ scheduler locks and
>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
>> will slow down I/O for fast storage devices, even if their driver only
>> creates a single hardware queue.
>
> Well, I'm not sure why that is missing. I don't think anyone proposed to
> default to BFQ for such setup? Neither was anyone claiming that BFQ is
> better in such situation... The proposal has been: Default to BFQ for slow
> storage, leave it to deadline-mq otherwise.

Hi Jan,

How do you define slow storage? The proposal at the start of this thread
was to make BFQ the default for all block devices that create a single
hardware queue. That includes all SATA storage since scsi-mq only
creates a single hardware queue when using the SATA protocol. The
proposal to make BFQ the default for systems with a single hard disk
probably makes sense but I am not sure that making BFQ the default for
systems equipped with one or more (SATA) SSDs is also a good idea.
Especially for multi-socket systems since BFQ reintroduces a queue-wide
lock. As you know no queue-wide locking happens during I/O in the
scsi-mq core nor in the blk-mq core.

Bart.

2018-10-06 06:47:09

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH] block: BFQ default for single queue devices



> Il giorno 06 ott 2018, alle ore 05:12, Bart Van Assche <[email protected]> ha scritto:
>
> On 10/5/18 2:16 AM, Jan Kara wrote:
>> On Thu 04-10-18 15:42:52, Bart Van Assche wrote:
>>> What I think is missing is measurement results for BFQ on a system with
>>> multiple CPU sockets and against a fast storage medium. Eliminating
>>> the host lock from the SCSI core yielded a significant performance
>>> improvement for such storage devices. Since the BFQ scheduler locks and
>>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
>>> will slow down I/O for fast storage devices, even if their driver only
>>> creates a single hardware queue.
>> Well, I'm not sure why that is missing. I don't think anyone proposed to
>> default to BFQ for such setup? Neither was anyone claiming that BFQ is
>> better in such situation... The proposal has been: Default to BFQ for slow
>> storage, leave it to deadline-mq otherwise.
>
> Hi Jan,
>
> How do you define slow storage? The proposal at the start of this thread was to make BFQ the default for all block devices that create a single hardware queue. That includes all SATA storage since scsi-mq only creates a single hardware queue when using the SATA protocol. The proposal to make BFQ the default for systems with a single hard disk probably makes sense but I am not sure that making BFQ the default for systems equipped with one or more (SATA) SSDs is also a good idea. Especially for multi-socket systems since BFQ reintroduces a queue-wide lock.

No, BFQ has no queue-wide lock. The very first change made to BFQ for
porting it to blk-mq was to remove the queue lock. Guided by Jens, I
replaced that lock with the exact, same scheduler lock used in
mq-deadline.

Thanks,
Paolo

> As you know no queue-wide locking happens during I/O in the scsi-mq core nor in the blk-mq core.
>
> Bart.


2018-10-06 16:21:00

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH] block: BFQ default for single queue devices

On 10/5/18 11:46 PM, Paolo Valente wrote:
>> Il giorno 06 ott 2018, alle ore 05:12, Bart Van Assche <[email protected]> ha scritto:
>> On 10/5/18 2:16 AM, Jan Kara wrote:
>>> On Thu 04-10-18 15:42:52, Bart Van Assche wrote:
>>>> What I think is missing is measurement results for BFQ on a system with
>>>> multiple CPU sockets and against a fast storage medium. Eliminating
>>>> the host lock from the SCSI core yielded a significant performance
>>>> improvement for such storage devices. Since the BFQ scheduler locks and
>>>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
>>>> will slow down I/O for fast storage devices, even if their driver only
>>>> creates a single hardware queue.
>>> Well, I'm not sure why that is missing. I don't think anyone proposed to
>>> default to BFQ for such setup? Neither was anyone claiming that BFQ is
>>> better in such situation... The proposal has been: Default to BFQ for slow
>>> storage, leave it to deadline-mq otherwise.
>>
>> How do you define slow storage? The proposal at the start of this thread
>> was to make BFQ the default for all block devices that create a single
>> hardware queue. That includes all SATA storage since scsi-mq only creates
>> a single hardware queue when using the SATA protocol. The proposal to make >> BFQ the default for systems with a single hard disk probably makes sense
>> but I am not sure that making BFQ the default for systems equipped with
>> one or more (SATA) SSDs is also a good idea. Especially for multi-socket
>> systems since BFQ reintroduces a queue-wide lock.
>
> No, BFQ has no queue-wide lock. The very first change made to BFQ for
> porting it to blk-mq was to remove the queue lock. Guided by Jens, I
> replaced that lock with the exact, same scheduler lock used in
> mq-deadline.

It's easy to see that both mq-deadline and BFQ define a queue-wide lock.
For mq-deadline its deadline_data.lock. For BFQ it's bfq_data.lock. That
last lock serializes all bfq_dispatch_request() calls and hence reduces
concurrency while processing I/O requests. From bfq_dispatch_request():

static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
{
struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
[ ... ]
spin_lock_irq(&bfqd->lock);
[ ... ]
}

I think the above makes it very clear that bfqd->lock is queue-wide.

It is easy to understand why both I/O schedulers need a queue-wide lock:
the only way to avoid race conditions when considering all pending I/O
requests for scheduling decisions is to use a lock that covers all
pending requests and hence that is queue-wide.

Bart.

2018-10-06 16:46:51

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH] block: BFQ default for single queue devices



> Il giorno 06 ott 2018, alle ore 18:20, Bart Van Assche <[email protected]> ha scritto:
>
> On 10/5/18 11:46 PM, Paolo Valente wrote:
>>> Il giorno 06 ott 2018, alle ore 05:12, Bart Van Assche <[email protected]> ha scritto:
>>> On 10/5/18 2:16 AM, Jan Kara wrote:
>>>> On Thu 04-10-18 15:42:52, Bart Van Assche wrote:
>>>>> What I think is missing is measurement results for BFQ on a system with
>>>>> multiple CPU sockets and against a fast storage medium. Eliminating
>>>>> the host lock from the SCSI core yielded a significant performance
>>>>> improvement for such storage devices. Since the BFQ scheduler locks and
>>>>> unlocks bfqd->lock for every dispatch operation it is very likely that BFQ
>>>>> will slow down I/O for fast storage devices, even if their driver only
>>>>> creates a single hardware queue.
>>>> Well, I'm not sure why that is missing. I don't think anyone proposed to
>>>> default to BFQ for such setup? Neither was anyone claiming that BFQ is
>>>> better in such situation... The proposal has been: Default to BFQ for slow
>>>> storage, leave it to deadline-mq otherwise.
>>>
>>> How do you define slow storage? The proposal at the start of this thread
>>> was to make BFQ the default for all block devices that create a single
>>> hardware queue. That includes all SATA storage since scsi-mq only creates
>>> a single hardware queue when using the SATA protocol. The proposal to make >> BFQ the default for systems with a single hard disk probably makes sense
>>> but I am not sure that making BFQ the default for systems equipped with
>>> one or more (SATA) SSDs is also a good idea. Especially for multi-socket
>>> systems since BFQ reintroduces a queue-wide lock.
>> No, BFQ has no queue-wide lock. The very first change made to BFQ for
>> porting it to blk-mq was to remove the queue lock. Guided by Jens, I
>> replaced that lock with the exact, same scheduler lock used in
>> mq-deadline.
>
> It's easy to see that both mq-deadline and BFQ define a queue-wide lock. For mq-deadline its deadline_data.lock. For BFQ it's bfq_data.lock. That last lock serializes all bfq_dispatch_request() calls and hence reduces concurrency while processing I/O requests. From bfq_dispatch_request():
>
> static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
> {
> struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
> [ ... ]
> spin_lock_irq(&bfqd->lock);
> [ ... ]
> }
>
> I think the above makes it very clear that bfqd->lock is queue-wide.
>
> It is easy to understand why both I/O schedulers need a queue-wide lock: the only way to avoid race conditions when considering all pending I/O requests for scheduling decisions is to use a lock that covers all pending requests and hence that is queue-wide.
>

Absolutely true. Queue lock is evidently a very general concept, and
a lock on a scheduler is, in the end, a lock on its internal queue(s).
But the queue lock removed by blk-mq is not that small per-scheduler
lock, but the big, single-request-queue lock. The effects of the
latter are probably almost one order of magnitude higher than those of
a scheduler lock, even with a non-trivial scheduler like BFQ.

As a simple concrete proof of this fact, consider the numbers that I
already gave you, and that you can re-obtain in five minutes: on a
laptop, BFQ may support up to 400KIOPS. Probably, even just with noop
as I/O scheduler, the same PC cannot process so many IOPS with legacy
blk (because of the single-request-queue lock).

To sum up, in your argument you mixed two different locks.

Anyway, you are going very deep in this issue. This takes you very
close to what I'm currently working on (still in a design phase):
increasing the parallel efficiency of BFQ, mainly by reducing the
duration of the pieces of BFQ executed under its scheduler lock.

But the goal of such a non-trivial improvement is to go from the
current 400 KIOPS to more than one million of IOPS. This is an
improvement that will most likely provide no benefits for probably 99%
of the systems with single-queue devices. Those systems simply do no go
beyond 300 KIOPS.

So, I'm trying to first devote my limited single-person bandwidth
(sorry, I didn't resist the temptation to joke on this growing
discussion on single-something issues :) ) to improvements that make
BFQ better within its current hardware scope.

Thanks,
Paolo

> Bart.