2016-03-25 21:36:56

by Shaohua Li

[permalink] [raw]
Subject: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node

we allocate most data structure in device's node, but some data
structures are not for DMA and mostly used by specific cpus/node which
could diff from device's node. Allocating such hot data in device's
node doesn't make sense. Add an API to estimate hardware queue node.
This can be used before blk-mq actually establishes the mapping. This
API runs slow, but it only used in initialization time.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-mq.c | 21 +++++++++++++++++++++
include/linux/blk-mq.h | 2 ++
2 files changed, 23 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 050f7a1..ec214d3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2330,6 +2330,27 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
}
EXPORT_SYMBOL(blk_mq_free_tag_set);

+int blk_mq_estimate_hw_queue_node(unsigned int total_queues,
+ unsigned int index)
+{
+ unsigned int *map;
+ int node;
+
+ if (total_queues == 1)
+ return NUMA_NO_NODE;
+ map = kzalloc(sizeof(*map) * nr_cpu_ids, GFP_KERNEL);
+ if (!map)
+ return NUMA_NO_NODE;
+ if (blk_mq_update_queue_map(map, total_queues, cpu_online_mask)) {
+ kfree(map);
+ return NUMA_NO_NODE;
+ }
+ node = blk_mq_hw_queue_to_node(map, index);
+ kfree(map);
+ return node;
+}
+EXPORT_SYMBOL(blk_mq_estimate_hw_queue_node);
+
int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
{
struct blk_mq_tag_set *set = q->tag_set;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 15a73d4..892b41b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -188,6 +188,8 @@ void blk_mq_insert_request(struct request *, bool, bool, bool);
void blk_mq_free_request(struct request *rq);
void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *, struct request *rq);
bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
+int blk_mq_estimate_hw_queue_node(unsigned int total_queues,
+ unsigned int index);

enum {
BLK_MQ_REQ_NOWAIT = (1 << 0), /* return when out of requests */
--
2.8.0.rc2


2016-03-25 21:36:38

by Shaohua Li

[permalink] [raw]
Subject: [PATCH 3/3] nvme: allocate nvme_queue in correct node

nvme_queue is per-cpu queue (mostly). Allocating it in node where blk-mq
will use it.

Signed-off-by: Shaohua Li <[email protected]>
---
drivers/nvme/host/pci.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f8db70a..8d24701 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1138,9 +1138,9 @@ static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
}

static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
- int depth)
+ int depth, int node)
{
- struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq), GFP_KERNEL);
+ struct nvme_queue *nvmeq = kzalloc_node(sizeof(*nvmeq), GFP_KERNEL, node);
if (!nvmeq)
return NULL;

@@ -1318,7 +1318,8 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)

nvmeq = dev->queues[0];
if (!nvmeq) {
- nvmeq = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH);
+ nvmeq = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH,
+ dev_to_node(dev->dev));
if (!nvmeq)
return -ENOMEM;
}
@@ -1372,11 +1373,15 @@ static void nvme_watchdog_timer(unsigned long data)

static int nvme_create_io_queues(struct nvme_dev *dev)
{
- unsigned i, max;
+ unsigned i, max, start;
int ret = 0;
+ int node;

+ start = dev->queue_count;
for (i = dev->queue_count; i <= dev->max_qid; i++) {
- if (!nvme_alloc_queue(dev, i, dev->q_depth)) {
+ node = blk_mq_estimate_hw_queue_node(
+ dev->max_qid - start + 1, i - start);
+ if (!nvme_alloc_queue(dev, i, dev->q_depth, node)) {
ret = -ENOMEM;
break;
}
--
2.8.0.rc2

2016-03-25 21:36:36

by Shaohua Li

[permalink] [raw]
Subject: [PATCH 2/3] blk-mq: allocate blk_mq_tags and requests in correct node

blk_mq_tags/requests of specific hardware queue are mostly used in
specific cpus, which might not be in the same numa node as disk. For
example, a nvme card is in node 0. half hardware queue will be used by
node 0, the other node 1.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-mq.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ec214d3..2e37c28 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1465,9 +1465,15 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
struct blk_mq_tags *tags;
unsigned int i, j, entries_per_page, max_order = 4;
size_t rq_size, left;
+ int node;
+
+ node = blk_mq_estimate_hw_queue_node(set->nr_hw_queues,
+ hctx_idx);
+ if (node == NUMA_NO_NODE)
+ node = set->numa_node;

tags = blk_mq_init_tags(set->queue_depth, set->reserved_tags,
- set->numa_node,
+ node,
BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
if (!tags)
return NULL;
@@ -1476,7 +1482,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,

tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
- set->numa_node);
+ node);
if (!tags->rqs) {
blk_mq_free_tags(tags);
return NULL;
@@ -1500,7 +1506,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
this_order--;

do {
- page = alloc_pages_node(set->numa_node,
+ page = alloc_pages_node(node,
GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
this_order);
if (page)
@@ -1531,7 +1537,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
if (set->ops->init_request) {
if (set->ops->init_request(set->driver_data,
tags->rqs[i], hctx_idx, i,
- set->numa_node)) {
+ node)) {
tags->rqs[i] = NULL;
goto fail;
}
--
2.8.0.rc2

2016-03-29 07:24:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node

On Fri, Mar 25, 2016 at 02:36:30PM -0700, Shaohua Li wrote:
> we allocate most data structure in device's node, but some data
> structures are not for DMA and mostly used by specific cpus/node which
> could diff from device's node. Allocating such hot data in device's
> node doesn't make sense. Add an API to estimate hardware queue node.
> This can be used before blk-mq actually establishes the mapping. This
> API runs slow, but it only used in initialization time.

I think this is the wrong way around. I've got some proprotype code
that just leaves the cpu assignments to the drivers and picks it up
in blk-mq. Give me a few days to post it..

2016-03-29 16:48:05

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node

On Tue, Mar 29, 2016 at 12:24:43AM -0700, Christoph Hellwig wrote:
> On Fri, Mar 25, 2016 at 02:36:30PM -0700, Shaohua Li wrote:
> > we allocate most data structure in device's node, but some data
> > structures are not for DMA and mostly used by specific cpus/node which
> > could diff from device's node. Allocating such hot data in device's
> > node doesn't make sense. Add an API to estimate hardware queue node.
> > This can be used before blk-mq actually establishes the mapping. This
> > API runs slow, but it only used in initialization time.
>
> I think this is the wrong way around. I've got some proprotype code
> that just leaves the cpu assignments to the drivers and picks it up
> in blk-mq. Give me a few days to post it..

This looks weird, shouldn't the cpu assignment be determined by block
core (blk-mq) because block core decides how to use the queue?

2016-03-29 16:51:34

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node

On 03/29/2016 10:47 AM, Shaohua Li wrote:
> On Tue, Mar 29, 2016 at 12:24:43AM -0700, Christoph Hellwig wrote:
>> On Fri, Mar 25, 2016 at 02:36:30PM -0700, Shaohua Li wrote:
>>> we allocate most data structure in device's node, but some data
>>> structures are not for DMA and mostly used by specific cpus/node which
>>> could diff from device's node. Allocating such hot data in device's
>>> node doesn't make sense. Add an API to estimate hardware queue node.
>>> This can be used before blk-mq actually establishes the mapping. This
>>> API runs slow, but it only used in initialization time.
>>
>> I think this is the wrong way around. I've got some proprotype code
>> that just leaves the cpu assignments to the drivers and picks it up
>> in blk-mq. Give me a few days to post it..
>
> This looks weird, shouldn't the cpu assignment be determined by block
> core (blk-mq) because block core decides how to use the queue?

I agree, that belongs in the blk-mq proper, the driver should just
follow the rules outlined, not impose their own in this regard. It'll
also help with irq affinity mappings, once we get that in.

--
Jens Axboe

2016-03-29 17:44:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node

On Tue, Mar 29, 2016 at 10:50:11AM -0600, Jens Axboe wrote:
> >This looks weird, shouldn't the cpu assignment be determined by block
> >core (blk-mq) because block core decides how to use the queue?
>
> I agree, that belongs in the blk-mq proper, the driver should just follow
> the rules outlined, not impose their own in this regard. It'll also help
> with irq affinity mappings, once we get that in.

It's not going to work that way unfortunately. Lots of driver simply
have no control over the underlying interrupts. Think of any RDMA
storage or other layer drivers - they get low level queues from a layer
they don't control and need a block queue for each of them.

My plan is to make the block layer follow what the networking layer
does - get the low level queues / MSI-X pairs and then use the
infrastructure in lib/cpu_rmap.c to figure out the number of queues
and queue placement for them.

The lower half of that is about to get rewritten by Thomas after
disccussion between him, me and a few others to provide drivers
nice APIs for spreading MSI-X vectors over CPUs or nodes.

2016-03-29 20:52:13

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node

On 03/29/2016 11:44 AM, Christoph Hellwig wrote:
> On Tue, Mar 29, 2016 at 10:50:11AM -0600, Jens Axboe wrote:
>>> This looks weird, shouldn't the cpu assignment be determined by block
>>> core (blk-mq) because block core decides how to use the queue?
>>
>> I agree, that belongs in the blk-mq proper, the driver should just follow
>> the rules outlined, not impose their own in this regard. It'll also help
>> with irq affinity mappings, once we get that in.
>
> It's not going to work that way unfortunately. Lots of driver simply
> have no control over the underlying interrupts. Think of any RDMA
> storage or other layer drivers - they get low level queues from a layer
> they don't control and need a block queue for each of them.
>
> My plan is to make the block layer follow what the networking layer
> does - get the low level queues / MSI-X pairs and then use the
> infrastructure in lib/cpu_rmap.c to figure out the number of queues
> and queue placement for them.

That sounds fine, as long as we get it in general infrastructure. Note
that we need to be able to support sets of queues for blk-mq, once we
start splitting them up (for scheduling). As long as we can map sets of
queues to sets of CPUs, then that's not a concern.

For Shaohua's patches, not binding per-device data to the home node
makes a lot of sense, though.

--
Jens Axboe

2016-03-29 21:20:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node

On Tue, Mar 29, 2016 at 02:51:37PM -0600, Jens Axboe wrote:
> For Shaohua's patches, not binding per-device data to the home node makes a
> lot of sense, though.

It does. But exposing the node estimation function just to work around
the fact that the way the cpu/node mapping is done is upside down
currently doesn't seem helpful. Let's try to revamp that first and the
rebase the rest of the series on top of that.

2016-03-29 21:27:56

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node

On 03/29/2016 03:20 PM, Christoph Hellwig wrote:
> On Tue, Mar 29, 2016 at 02:51:37PM -0600, Jens Axboe wrote:
>> For Shaohua's patches, not binding per-device data to the home node makes a
>> lot of sense, though.
>
> It does. But exposing the node estimation function just to work around
> the fact that the way the cpu/node mapping is done is upside down
> currently doesn't seem helpful. Let's try to revamp that first and the
> rebase the rest of the series on top of that.

That's fine with me, it's not exactly urgent.

--
Jens Axboe