2015-05-20 12:25:16

by Parav Pandit

[permalink] [raw]
Subject: [PATCH] NVMe: nvme_queue made cache friendly.

nvme_queue structure made 64B cache friendly so that majority of the
data elements of the structure during IO and completion path can be
found in typical single 64B cache line size which was previously spanning beyond
single 64B cache line size.

By aligning most of the fields are found at start of the structure.
Elements which are not used in frequent IO path are moved at the end of structure.

Signed-off-by: Parav Pandit <[email protected]>
---
drivers/block/nvme-core.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index b9ba36f..1585d7d 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -98,23 +98,23 @@ struct async_cmd_info {
struct nvme_queue {
struct device *q_dmadev;
struct nvme_dev *dev;
- char irqname[24]; /* nvme4294967295-65535\0 */
- spinlock_t q_lock;
struct nvme_command *sq_cmds;
+ struct blk_mq_hw_ctx *hctx;
volatile struct nvme_completion *cqes;
- dma_addr_t sq_dma_addr;
- dma_addr_t cq_dma_addr;
u32 __iomem *q_db;
u16 q_depth;
- s16 cq_vector;
u16 sq_head;
u16 sq_tail;
u16 cq_head;
u16 qid;
+ s16 cq_vector;
u8 cq_phase;
u8 cqe_seen;
+ spinlock_t q_lock;
struct async_cmd_info cmdinfo;
- struct blk_mq_hw_ctx *hctx;
+ char irqname[24]; /* nvme4294967295-65535\0 */
+ dma_addr_t sq_dma_addr;
+ dma_addr_t cq_dma_addr;
};

/*
--
1.8.3.1


2015-05-20 13:20:48

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] NVMe: nvme_queue made cache friendly.

On Wed, May 20, 2015 at 02:01:03PM -0400, Parav Pandit wrote:
> nvme_queue structure made 64B cache friendly so that majority of the
> data elements of the structure during IO and completion path can be
> found in typical single 64B cache line size which was previously spanning beyond
> single 64B cache line size.

Have you done any performance measurements on this? I find it hard to
believe that moving q_lock to the second 64B cache line results in a
performance improvement. Seems to me it would result in a performance
loss, since you have to grab the lock before operating on the queue,
and cache line prefetching tends to prefetch the _next_ line, not the
_previous_ line.

> @@ -98,23 +98,23 @@ struct async_cmd_info {
> struct nvme_queue {
> struct device *q_dmadev;
> struct nvme_dev *dev;
> - char irqname[24]; /* nvme4294967295-65535\0 */
> - spinlock_t q_lock;
> struct nvme_command *sq_cmds;
> + struct blk_mq_hw_ctx *hctx;
> volatile struct nvme_completion *cqes;
> - dma_addr_t sq_dma_addr;
> - dma_addr_t cq_dma_addr;
> u32 __iomem *q_db;
> u16 q_depth;
> - s16 cq_vector;
> u16 sq_head;
> u16 sq_tail;
> u16 cq_head;
> u16 qid;
> + s16 cq_vector;
> u8 cq_phase;
> u8 cqe_seen;
> + spinlock_t q_lock;
> struct async_cmd_info cmdinfo;
> - struct blk_mq_hw_ctx *hctx;
> + char irqname[24]; /* nvme4294967295-65535\0 */
> + dma_addr_t sq_dma_addr;
> + dma_addr_t cq_dma_addr;
> };
>
> /*
> --
> 1.8.3.1

2015-05-20 13:34:11

by Parav Pandit

[permalink] [raw]
Subject: Re: [PATCH] NVMe: nvme_queue made cache friendly.

On Wed, May 20, 2015 at 6:50 PM, Matthew Wilcox <[email protected]> wrote:
> On Wed, May 20, 2015 at 02:01:03PM -0400, Parav Pandit wrote:
>> nvme_queue structure made 64B cache friendly so that majority of the
>> data elements of the structure during IO and completion path can be
>> found in typical single 64B cache line size which was previously spanning beyond
>> single 64B cache line size.
>
> Have you done any performance measurements on this?

I have not done the performance test yet.

> I find it hard to
> believe that moving q_lock to the second 64B cache line results in a
> performance improvement.

Newly aligned structure including q_lock actually fits completely in
first 64 bytes. Did I miss anything in calculation?
q_lock appears to be taken at the end of the IO processing, which
means by than sq_cmds, hctx etc fields are already accessed in same
line.
May be I should move it after q_db, instead of last element?


> Seems to me it would result in a performance
> loss, since you have to grab the lock before operating on the queue,
> and cache line prefetching tends to prefetch the _next_ line, not the
> _previous_ line.
>
>> @@ -98,23 +98,23 @@ struct async_cmd_info {
>> struct nvme_queue {
>> struct device *q_dmadev;
>> struct nvme_dev *dev;
>> - char irqname[24]; /* nvme4294967295-65535\0 */
>> - spinlock_t q_lock;
>> struct nvme_command *sq_cmds;
>> + struct blk_mq_hw_ctx *hctx;
>> volatile struct nvme_completion *cqes;
>> - dma_addr_t sq_dma_addr;
>> - dma_addr_t cq_dma_addr;
>> u32 __iomem *q_db;
>> u16 q_depth;
>> - s16 cq_vector;
>> u16 sq_head;
>> u16 sq_tail;
>> u16 cq_head;
>> u16 qid;
>> + s16 cq_vector;
>> u8 cq_phase;
>> u8 cqe_seen;
>> + spinlock_t q_lock;
>> struct async_cmd_info cmdinfo;
>> - struct blk_mq_hw_ctx *hctx;
>> + char irqname[24]; /* nvme4294967295-65535\0 */
>> + dma_addr_t sq_dma_addr;
>> + dma_addr_t cq_dma_addr;
>> };
>>
>> /*
>> --
>> 1.8.3.1