2023-06-29 11:32:29

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH v2 0/4] blk-mq: optimize the size of struct request

From: Chengming Zhou <[email protected]>

v2:
- Change to use call_single_data_t, which use __aligned() to avoid
to use 2 cache lines for 1 csd. Thanks Ming Lei.
- [v1] https://lore.kernel.org/all/[email protected]/

Hello,

After the commit be4c427809b0 ("blk-mq: use the I/O scheduler for
writes from the flush state machine"), rq->flush can't reuse rq->elv
anymore, since flush_data requests can go into io scheduler now.

That increased the size of struct request by 24 bytes, but this
patchset can decrease the size by 40 bytes, which is good I think.

patch 1 use percpu csd to do remote complete instead of per-rq csd,
decrease the size by 24 bytes.

patch 2-3 reuse rq->queuelist in flush state machine pending list,
and maintain a u64 counter of inflight flush_data requests, decrease
the size by 16 bytes.

patch 4 is just cleanup by the way.

Thanks for comments!

Chengming Zhou (4):
blk-mq: use percpu csd to remote complete instead of per-rq csd
blk-flush: count inflight flush_data requests
blk-flush: reuse rq queuelist in flush state machine
blk-mq: delete unused completion_data in struct request

block/blk-flush.c | 19 +++++++++----------
block/blk-mq.c | 12 ++++++++----
block/blk.h | 5 ++---
include/linux/blk-mq.h | 10 ++--------
4 files changed, 21 insertions(+), 25 deletions(-)

--
2.39.2



2023-06-29 11:33:08

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH v2 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd

From: Chengming Zhou <[email protected]>

If request need to be completed remotely, we insert it into percpu llist,
and smp_call_function_single_async() if llist is empty previously.

We don't need to use per-rq csd, percpu csd is enough. And the size of
struct request is decreased by 24 bytes.

This way is cleaner, and looks correct, given block softirq is guaranteed to be
scheduled to consume the list if one new request is added to this percpu list,
either smp_call_function_single_async() returns -EBUSY or 0.

Signed-off-by: Chengming Zhou <[email protected]>
---
v2:
- Change to use call_single_data_t, which avoid to use 2 cache lines
for 1 csd, as suggested by Ming Lei.
- Improve the commit log, the explanation is copied from Ming Lei.
---
block/blk-mq.c | 12 ++++++++----
include/linux/blk-mq.h | 5 +----
2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index decb6ab2d508..e52200edd2b1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -43,6 +43,7 @@
#include "blk-ioprio.h"

static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
+static DEFINE_PER_CPU(call_single_data_t, blk_cpu_csd);

static void blk_mq_insert_request(struct request *rq, blk_insert_t flags);
static void blk_mq_request_bypass_insert(struct request *rq,
@@ -1156,13 +1157,13 @@ static void blk_mq_complete_send_ipi(struct request *rq)
{
struct llist_head *list;
unsigned int cpu;
+ call_single_data_t *csd;

cpu = rq->mq_ctx->cpu;
list = &per_cpu(blk_cpu_done, cpu);
- if (llist_add(&rq->ipi_list, list)) {
- INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
- smp_call_function_single_async(cpu, &rq->csd);
- }
+ csd = &per_cpu(blk_cpu_csd, cpu);
+ if (llist_add(&rq->ipi_list, list))
+ smp_call_function_single_async(cpu, csd);
}

static void blk_mq_raise_softirq(struct request *rq)
@@ -4796,6 +4797,9 @@ static int __init blk_mq_init(void)

for_each_possible_cpu(i)
init_llist_head(&per_cpu(blk_cpu_done, i));
+ for_each_possible_cpu(i)
+ INIT_CSD(&per_cpu(blk_cpu_csd, i),
+ __blk_mq_complete_request_remote, NULL);
open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);

cpuhp_setup_state_nocalls(CPUHP_BLOCK_SOFTIRQ_DEAD,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f401067ac03a..070551197c0e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -182,10 +182,7 @@ struct request {
rq_end_io_fn *saved_end_io;
} flush;

- union {
- struct __call_single_data csd;
- u64 fifo_time;
- };
+ u64 fifo_time;

/*
* completion callback.
--
2.39.2


2023-06-29 11:48:20

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH v2 3/4] blk-flush: reuse rq queuelist in flush state machine

From: Chengming Zhou <[email protected]>

Since we don't need to maintain inflight flush_data requests list
anymore, we can reuse rq->queuelist for flush pending list.

This patch decrease the size of struct request by 16 bytes.

Signed-off-by: Chengming Zhou <[email protected]>
---
block/blk-flush.c | 12 +++++-------
include/linux/blk-mq.h | 1 -
2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index bb7adfc2a5da..81588edbe8b0 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -183,14 +183,13 @@ static void blk_flush_complete_seq(struct request *rq,
/* queue for flush */
if (list_empty(pending))
fq->flush_pending_since = jiffies;
- list_move_tail(&rq->flush.list, pending);
+ list_move_tail(&rq->queuelist, pending);
break;

case REQ_FSEQ_DATA:
- list_del_init(&rq->flush.list);
fq->flush_data_in_flight++;
spin_lock(&q->requeue_lock);
- list_add_tail(&rq->queuelist, &q->flush_list);
+ list_move_tail(&rq->queuelist, &q->flush_list);
spin_unlock(&q->requeue_lock);
blk_mq_kick_requeue_list(q);
break;
@@ -202,7 +201,7 @@ static void blk_flush_complete_seq(struct request *rq,
* flush data request completion path. Restore @rq for
* normal completion and end it.
*/
- list_del_init(&rq->flush.list);
+ list_del_init(&rq->queuelist);
blk_flush_restore_request(rq);
blk_mq_end_request(rq, error);
break;
@@ -258,7 +257,7 @@ static enum rq_end_io_ret flush_end_io(struct request *flush_rq,
fq->flush_running_idx ^= 1;

/* and push the waiting requests to the next stage */
- list_for_each_entry_safe(rq, n, running, flush.list) {
+ list_for_each_entry_safe(rq, n, running, queuelist) {
unsigned int seq = blk_flush_cur_seq(rq);

BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
@@ -292,7 +291,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
{
struct list_head *pending = &fq->flush_queue[fq->flush_pending_idx];
struct request *first_rq =
- list_first_entry(pending, struct request, flush.list);
+ list_first_entry(pending, struct request, queuelist);
struct request *flush_rq = fq->flush_rq;

/* C1 described at the top of this file */
@@ -386,7 +385,6 @@ static enum rq_end_io_ret mq_flush_data_end_io(struct request *rq,
static void blk_rq_init_flush(struct request *rq)
{
rq->flush.seq = 0;
- INIT_LIST_HEAD(&rq->flush.list);
rq->rq_flags |= RQF_FLUSH_SEQ;
rq->flush.saved_end_io = rq->end_io; /* Usually NULL */
rq->end_io = mq_flush_data_end_io;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 070551197c0e..96644d6f8d18 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -178,7 +178,6 @@ struct request {

struct {
unsigned int seq;
- struct list_head list;
rq_end_io_fn *saved_end_io;
} flush;

--
2.39.2


2023-07-05 15:45:08

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd

On Thu, Jun 29, 2023 at 07:03:56PM +0800, [email protected] wrote:
> From: Chengming Zhou <[email protected]>
>
> If request need to be completed remotely, we insert it into percpu llist,
> and smp_call_function_single_async() if llist is empty previously.
>
> We don't need to use per-rq csd, percpu csd is enough. And the size of
> struct request is decreased by 24 bytes.
>
> This way is cleaner, and looks correct, given block softirq is guaranteed to be
> scheduled to consume the list if one new request is added to this percpu list,
> either smp_call_function_single_async() returns -EBUSY or 0.
>
> Signed-off-by: Chengming Zhou <[email protected]>
> ---
> v2:
> - Change to use call_single_data_t, which avoid to use 2 cache lines
> for 1 csd, as suggested by Ming Lei.
> - Improve the commit log, the explanation is copied from Ming Lei.

Reviewed-by: Ming Lei <[email protected]>

Thanks,
Ming


2023-07-06 13:13:03

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd

On Thu, Jun 29, 2023 at 07:03:56PM +0800, [email protected] wrote:
> From: Chengming Zhou <[email protected]>
>
> If request need to be completed remotely, we insert it into percpu llist,
> and smp_call_function_single_async() if llist is empty previously.
>
> We don't need to use per-rq csd, percpu csd is enough. And the size of
> struct request is decreased by 24 bytes.
>
> This way is cleaner, and looks correct, given block softirq is guaranteed to be
> scheduled to consume the list if one new request is added to this percpu list,
> either smp_call_function_single_async() returns -EBUSY or 0.

Please trim your commit logs to 73 characters per line so that they
are readable in git log output.

> static void blk_mq_request_bypass_insert(struct request *rq,
> @@ -1156,13 +1157,13 @@ static void blk_mq_complete_send_ipi(struct request *rq)
> {
> struct llist_head *list;
> unsigned int cpu;
> + call_single_data_t *csd;
>
> cpu = rq->mq_ctx->cpu;
> list = &per_cpu(blk_cpu_done, cpu);
> - if (llist_add(&rq->ipi_list, list)) {
> - INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
> - smp_call_function_single_async(cpu, &rq->csd);
> - }
> + csd = &per_cpu(blk_cpu_csd, cpu);
> + if (llist_add(&rq->ipi_list, list))
> + smp_call_function_single_async(cpu, csd);
> }

No need for the list and csd variables here as they are only used
once.

But I think this code has a rpboem when it is preemptd between
the llist_add and smp_call_function_single_async. We either need a
get_cpu/put_cpu around them, or instroduce a structure with the list
and csd, and then you can use one pointer from per_cpu and still ensure
the list and csd are for the same CPU.


2023-07-06 14:59:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd

On Thu, Jul 06, 2023 at 10:23:49PM +0800, Chengming Zhou wrote:
> Yes, should I change like below? Looks like much long code. :-)
>
> if (llist_add(&rq->ipi_list, &per_cpu(blk_cpu_done, cpu)))
> smp_call_function_single_async(cpu, &per_cpu(blk_cpu_csd, cpu));

Doesn't look bad too me.

>
>
> >
> > But I think this code has a rpboem when it is preemptd between
> > the llist_add and smp_call_function_single_async. We either need a
> > get_cpu/put_cpu around them, or instroduce a structure with the list
> > and csd, and then you can use one pointer from per_cpu and still ensure
> > the list and csd are for the same CPU.
> >
>
> cpu = rq->mq_ctx->cpu; So it's certainly the same CPU, right?

You're right of couse - cpu is the submitting cpu and not the current
one and thus not affected by preemption. Sorry for the noise.


2023-07-06 15:09:36

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd

On 2023/7/6 21:07, Christoph Hellwig wrote:
> On Thu, Jun 29, 2023 at 07:03:56PM +0800, [email protected] wrote:
>> From: Chengming Zhou <[email protected]>
>>
>> If request need to be completed remotely, we insert it into percpu llist,
>> and smp_call_function_single_async() if llist is empty previously.
>>
>> We don't need to use per-rq csd, percpu csd is enough. And the size of
>> struct request is decreased by 24 bytes.
>>
>> This way is cleaner, and looks correct, given block softirq is guaranteed to be
>> scheduled to consume the list if one new request is added to this percpu list,
>> either smp_call_function_single_async() returns -EBUSY or 0.
>
> Please trim your commit logs to 73 characters per line so that they
> are readable in git log output.

Ok, will fix in the next version.

>
>> static void blk_mq_request_bypass_insert(struct request *rq,
>> @@ -1156,13 +1157,13 @@ static void blk_mq_complete_send_ipi(struct request *rq)
>> {
>> struct llist_head *list;
>> unsigned int cpu;
>> + call_single_data_t *csd;
>>
>> cpu = rq->mq_ctx->cpu;
>> list = &per_cpu(blk_cpu_done, cpu);
>> - if (llist_add(&rq->ipi_list, list)) {
>> - INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
>> - smp_call_function_single_async(cpu, &rq->csd);
>> - }
>> + csd = &per_cpu(blk_cpu_csd, cpu);
>> + if (llist_add(&rq->ipi_list, list))
>> + smp_call_function_single_async(cpu, csd);
>> }
>
> No need for the list and csd variables here as they are only used
> once.

Yes, should I change like below? Looks like much long code. :-)

if (llist_add(&rq->ipi_list, &per_cpu(blk_cpu_done, cpu)))
smp_call_function_single_async(cpu, &per_cpu(blk_cpu_csd, cpu));

>
> But I think this code has a rpboem when it is preemptd between
> the llist_add and smp_call_function_single_async. We either need a
> get_cpu/put_cpu around them, or instroduce a structure with the list
> and csd, and then you can use one pointer from per_cpu and still ensure
> the list and csd are for the same CPU.
>

cpu = rq->mq_ctx->cpu; So it's certainly the same CPU, right?

Thanks!