2012-10-30 20:15:07

by Jeff Moyer

[permalink] [raw]
Subject: [patch 00/10] make I/O path allocations more numa-friendly

Hi,

This patch set makes memory allocations for data structures used in
the I/O path more numa friendly by allocating them from the same numa
node as the storage device. I've only converted a handfull of drivers
at this point. My testing showed that, for workloads where the I/O
processes were not tied to the numa node housing the device, a speedup
of around 6% was observed. When the I/O processes were tied to the
numa node of the device, there was no measurable difference in my test
setup. Given my relatively low-end setup[1], I wouldn't be surprised
if others could show a more significant performance advantage.

Comments would be greatly appreciated.

Cheers,
Jeff

[1] LSI Megaraid SAS controller with 1GB battery-backed cache,
fronting a RAID 6 10+2. The workload I used was tuned to not
have to hit disk. Fio file attached.

Jeff Moyer (10):
scsi: make __scsi_alloc_queue numa-aware
scsi: make scsi_alloc_sdev numa-aware
scsi: allocate scsi_cmnd-s from the device's local numa node
sd: use alloc_disk_node
vfs: pass data to alloc_inode super operation
ata: use scsi_host_alloc_node
megaraid_sas: use scsi_host_alloc_node
mpt2sas: use scsi_host_alloc_node
lpfc: use scsi_host_alloc_node
cciss: use blk_init_queue_node

drivers/ata/libata-scsi.c | 3 ++-
drivers/block/cciss.c | 3 ++-
drivers/scsi/lpfc/lpfc_init.c | 10 ++++++----
drivers/scsi/megaraid/megaraid_sas_base.c | 5 +++--
drivers/scsi/mpt2sas/mpt2sas_scsih.c | 4 ++--
drivers/scsi/scsi.c | 17 +++++++++++------
drivers/scsi/scsi_lib.c | 2 +-
drivers/scsi/scsi_scan.c | 4 ++--
drivers/scsi/sd.c | 2 +-
fs/afs/super.c | 5 +++--
fs/block_dev.c | 17 +++++++++++++++--
fs/btrfs/ctree.h | 2 +-
fs/btrfs/inode.c | 3 ++-
fs/cifs/cifsfs.c | 2 +-
fs/inode.c | 10 +++++-----
fs/ocfs2/dlmfs/dlmfs.c | 3 ++-
fs/ocfs2/super.c | 5 +++--
include/linux/fs.h | 2 +-
18 files changed, 63 insertions(+), 36 deletions(-)


2012-10-30 20:15:28

by Jeff Moyer

[permalink] [raw]
Subject: [patch 09/10] lpfc: use scsi_host_alloc_node


Signed-off-by: Jeff Moyer <[email protected]>
---
drivers/scsi/lpfc/lpfc_init.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 7dc4218..65956d3 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -3051,11 +3051,13 @@ lpfc_create_port(struct lpfc_hba *phba, int instance, struct device *dev)
int error = 0;

if (dev != &phba->pcidev->dev)
- shost = scsi_host_alloc(&lpfc_vport_template,
- sizeof(struct lpfc_vport));
+ shost = scsi_host_alloc_node(&lpfc_vport_template,
+ sizeof(struct lpfc_vport),
+ dev_to_node(&phba->pcidev->dev));
else
- shost = scsi_host_alloc(&lpfc_template,
- sizeof(struct lpfc_vport));
+ shost = scsi_host_alloc_node(&lpfc_template,
+ sizeof(struct lpfc_vport),
+ dev_to_node(&phba->pcidev->dev));
if (!shost)
goto out;

--
1.7.1

2012-10-30 20:15:41

by Jeff Moyer

[permalink] [raw]
Subject: [patch 05/10] vfs: pass data to alloc_inode super operation

This patch passes a data pointer along to the alloc_inode
super_operations function. The value will initially be used by
bdev_alloc_inode to allocate the bdev_inode on the same numa
node as the device to which it is tied.

Signed-off-by: Jeff Moyer <[email protected]>
---
fs/afs/super.c | 5 +++--
fs/block_dev.c | 17 +++++++++++++++--
fs/btrfs/ctree.h | 2 +-
fs/btrfs/inode.c | 3 ++-
fs/cifs/cifsfs.c | 2 +-
fs/inode.c | 10 +++++-----
fs/ocfs2/dlmfs/dlmfs.c | 3 ++-
fs/ocfs2/super.c | 5 +++--
include/linux/fs.h | 2 +-
9 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/fs/afs/super.c b/fs/afs/super.c
index 4316500..e242661 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -32,7 +32,7 @@ static void afs_i_init_once(void *foo);
static struct dentry *afs_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data);
static void afs_kill_super(struct super_block *sb);
-static struct inode *afs_alloc_inode(struct super_block *sb);
+static struct inode *afs_alloc_inode(struct super_block *sb, void *data);
static void afs_destroy_inode(struct inode *inode);
static int afs_statfs(struct dentry *dentry, struct kstatfs *buf);

@@ -470,7 +470,8 @@ static void afs_i_init_once(void *_vnode)
/*
* allocate an AFS inode struct from our slab cache
*/
-static struct inode *afs_alloc_inode(struct super_block *sb)
+static struct inode *afs_alloc_inode(struct super_block *sb,
+ void * __attribute__((unused))data)
{
struct afs_vnode *vnode;

diff --git a/fs/block_dev.c b/fs/block_dev.c
index b3c1d3d..c366fb2 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -454,9 +454,22 @@ EXPORT_SYMBOL(blkdev_fsync);
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(bdev_lock);
static struct kmem_cache * bdev_cachep __read_mostly;

-static struct inode *bdev_alloc_inode(struct super_block *sb)
+static struct inode *bdev_alloc_inode(struct super_block *sb, void *data)
{
- struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL);
+ struct bdev_inode *ei;
+ int partno;
+ int node;
+
+ if (data) {
+ struct gendisk *disk;
+ dev_t dev = *(dev_t *)data;
+ disk = get_gendisk(dev, &partno);
+ node = disk->node_id;
+ put_disk(disk);
+ } else
+ node = -1;
+
+ ei = kmem_cache_alloc_node(bdev_cachep, GFP_KERNEL, node);
if (!ei)
return NULL;

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 926c9ff..0736131 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3324,7 +3324,7 @@ int btrfs_readpage(struct file *file, struct page *page);
void btrfs_evict_inode(struct inode *inode);
int btrfs_write_inode(struct inode *inode, struct writeback_control *wbc);
int btrfs_dirty_inode(struct inode *inode);
-struct inode *btrfs_alloc_inode(struct super_block *sb);
+struct inode *btrfs_alloc_inode(struct super_block *sb, void *data);
void btrfs_destroy_inode(struct inode *inode);
int btrfs_drop_inode(struct inode *inode);
int btrfs_init_cachep(void);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 85a1e50..d4b42d0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7057,7 +7057,8 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
return err;
}

-struct inode *btrfs_alloc_inode(struct super_block *sb)
+struct inode *btrfs_alloc_inode(struct super_block *sb,
+ void * __attribute__((unused))data)
{
struct btrfs_inode *ei;
struct inode *inode;
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index e7931cc..2e67ed7 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -215,7 +215,7 @@ mempool_t *cifs_req_poolp;
mempool_t *cifs_mid_poolp;

static struct inode *
-cifs_alloc_inode(struct super_block *sb)
+cifs_alloc_inode(struct super_block *sb, void * __attribute__((unused)) data)
{
struct cifsInodeInfo *cifs_inode;
cifs_inode = kmem_cache_alloc(cifs_inode_cachep, GFP_KERNEL);
diff --git a/fs/inode.c b/fs/inode.c
index b03c719..bb7e273 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -199,12 +199,12 @@ out:
}
EXPORT_SYMBOL(inode_init_always);

-static struct inode *alloc_inode(struct super_block *sb)
+static struct inode *alloc_inode(struct super_block *sb, void *data)
{
struct inode *inode;

if (sb->s_op->alloc_inode)
- inode = sb->s_op->alloc_inode(sb);
+ inode = sb->s_op->alloc_inode(sb, data);
else
inode = kmem_cache_alloc(inode_cachep, GFP_KERNEL);

@@ -892,7 +892,7 @@ EXPORT_SYMBOL(get_next_ino);
*/
struct inode *new_inode_pseudo(struct super_block *sb)
{
- struct inode *inode = alloc_inode(sb);
+ struct inode *inode = alloc_inode(sb, NULL);

if (inode) {
spin_lock(&inode->i_lock);
@@ -1004,7 +1004,7 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
return inode;
}

- inode = alloc_inode(sb);
+ inode = alloc_inode(sb, data);
if (inode) {
struct inode *old;

@@ -1073,7 +1073,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
return inode;
}

- inode = alloc_inode(sb);
+ inode = alloc_inode(sb, NULL);
if (inode) {
struct inode *old;

diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index 16b712d..a322984 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -340,7 +340,8 @@ static void dlmfs_init_once(void *foo)
inode_init_once(&ip->ip_vfs_inode);
}

-static struct inode *dlmfs_alloc_inode(struct super_block *sb)
+static struct inode *dlmfs_alloc_inode(struct super_block *sb,
+ void *__attribute__((unused)) data)
{
struct dlmfs_inode_private *ip;

diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 0e91ec2..ada4a81 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -137,7 +137,7 @@ static int ocfs2_get_sector(struct super_block *sb,
struct buffer_head **bh,
int block,
int sect_size);
-static struct inode *ocfs2_alloc_inode(struct super_block *sb);
+static struct inode *ocfs2_alloc_inode(struct super_block *sb, void *data);
static void ocfs2_destroy_inode(struct inode *inode);
static int ocfs2_susp_quotas(struct ocfs2_super *osb, int unsuspend);
static int ocfs2_enable_quotas(struct ocfs2_super *osb);
@@ -554,7 +554,8 @@ static void ocfs2_release_system_inodes(struct ocfs2_super *osb)
}

/* We're allocating fs objects, use GFP_NOFS */
-static struct inode *ocfs2_alloc_inode(struct super_block *sb)
+static struct inode *ocfs2_alloc_inode(struct super_block *sb,
+ void *__attribute__((unused))data)
{
struct ocfs2_inode_info *oi;

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b33cfc9..240b298 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1595,7 +1595,7 @@ extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
unsigned long, loff_t *);

struct super_operations {
- struct inode *(*alloc_inode)(struct super_block *sb);
+ struct inode *(*alloc_inode)(struct super_block *sb, void *data);
void (*destroy_inode)(struct inode *);

void (*dirty_inode) (struct inode *, int flags);
--
1.7.1

2012-10-30 20:15:27

by Jeff Moyer

[permalink] [raw]
Subject: [patch 03/10] scsi: allocate scsi_cmnd-s from the device's local numa node


Signed-off-by: Jeff Moyer <[email protected]>
---
drivers/scsi/scsi.c | 17 +++++++++++------
1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2936b44..4db6973 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -173,16 +173,20 @@ static DEFINE_MUTEX(host_cmd_pool_mutex);
* NULL on failure
*/
static struct scsi_cmnd *
-scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask)
+scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask,
+ int node)
{
struct scsi_cmnd *cmd;

- cmd = kmem_cache_zalloc(pool->cmd_slab, gfp_mask | pool->gfp_mask);
+ cmd = kmem_cache_alloc_node(pool->cmd_slab,
+ gfp_mask | pool->gfp_mask | __GFP_ZERO,
+ node);
if (!cmd)
return NULL;

- cmd->sense_buffer = kmem_cache_alloc(pool->sense_slab,
- gfp_mask | pool->gfp_mask);
+ cmd->sense_buffer = kmem_cache_alloc_node(pool->sense_slab,
+ gfp_mask | pool->gfp_mask | __GFP_ZERO,
+ node);
if (!cmd->sense_buffer) {
kmem_cache_free(pool->cmd_slab, cmd);
return NULL;
@@ -223,7 +227,8 @@ scsi_host_alloc_command(struct Scsi_Host *shost, gfp_t gfp_mask)
{
struct scsi_cmnd *cmd;

- cmd = scsi_pool_alloc_command(shost->cmd_pool, gfp_mask);
+ cmd = scsi_pool_alloc_command(shost->cmd_pool, gfp_mask,
+ shost->numa_node);
if (!cmd)
return NULL;

@@ -435,7 +440,7 @@ struct scsi_cmnd *scsi_allocate_command(gfp_t gfp_mask)
if (!pool)
return NULL;

- return scsi_pool_alloc_command(pool, gfp_mask);
+ return scsi_pool_alloc_command(pool, gfp_mask, NUMA_NO_NODE);
}
EXPORT_SYMBOL(scsi_allocate_command);

--
1.7.1

2012-10-30 20:15:25

by Jeff Moyer

[permalink] [raw]
Subject: [patch 04/10] sd: use alloc_disk_node


Signed-off-by: Jeff Moyer <[email protected]>
---
drivers/scsi/sd.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 12f6fdf..8deb915 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev)
if (!sdkp)
goto out;

- gd = alloc_disk(SD_MINORS);
+ gd = alloc_disk_node(SD_MINORS, dev_to_node(dev));
if (!gd)
goto out_free;

--
1.7.1

2012-10-30 20:15:22

by Jeff Moyer

[permalink] [raw]
Subject: [patch 01/10] scsi: make __scsi_alloc_queue numa-aware

Pass the numa node id set in the Scsi_Host on to blk_init_queue_node
in order to keep all allocations local to the numa node the device is
closest to.

Signed-off-by: Jeff Moyer <[email protected]>
---
drivers/scsi/scsi_lib.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index da36a3a..8662a09 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1664,7 +1664,7 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
struct request_queue *q;
struct device *dev = shost->dma_dev;

- q = blk_init_queue(request_fn, NULL);
+ q = blk_init_queue_node(request_fn, NULL, shost->numa_node);
if (!q)
return NULL;

--
1.7.1

2012-10-30 20:17:25

by Jeff Moyer

[permalink] [raw]
Subject: [patch 08/10] mpt2sas: use scsi_host_alloc_node


Signed-off-by: Jeff Moyer <[email protected]>
---
drivers/scsi/mpt2sas/mpt2sas_scsih.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/mpt2sas/mpt2sas_scsih.c b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
index af4e6c4..a4d6b36 100644
--- a/drivers/scsi/mpt2sas/mpt2sas_scsih.c
+++ b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
@@ -8011,8 +8011,8 @@ _scsih_probe(struct pci_dev *pdev, const struct pci_device_id *id)
struct MPT2SAS_ADAPTER *ioc;
struct Scsi_Host *shost;

- shost = scsi_host_alloc(&scsih_driver_template,
- sizeof(struct MPT2SAS_ADAPTER));
+ shost = scsi_host_alloc_node(&scsih_driver_template,
+ sizeof(struct MPT2SAS_ADAPTER), dev_to_node(&pdev->dev));
if (!shost)
return -ENODEV;

--
1.7.1

2012-10-30 20:17:22

by Jeff Moyer

[permalink] [raw]
Subject: [patch 07/10] megaraid_sas: use scsi_host_alloc_node


Signed-off-by: Jeff Moyer <[email protected]>
---
drivers/scsi/megaraid/megaraid_sas_base.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index d2c5366..707a6cd 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -4020,8 +4020,9 @@ megasas_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
if (megasas_set_dma_mask(pdev))
goto fail_set_dma_mask;

- host = scsi_host_alloc(&megasas_template,
- sizeof(struct megasas_instance));
+ host = scsi_host_alloc_node(&megasas_template,
+ sizeof(struct megasas_instance),
+ dev_to_node(&pdev->dev));

if (!host) {
printk(KERN_DEBUG "megasas: scsi_host_alloc failed\n");
--
1.7.1

2012-10-30 20:15:20

by Jeff Moyer

[permalink] [raw]
Subject: [patch 00/10] *** SUBJECT HERE ***

*** BLURB HERE ***

Jeff Moyer (10):
scsi: make __scsi_alloc_queue numa-aware
scsi: make scsi_alloc_sdev numa-aware
scsi: allocate scsi_cmnd-s from the device's local numa node
sd: use alloc_disk_node
vfs: pass data to alloc_inode super operation
ata: use scsi_host_alloc_node
megaraid_sas: use scsi_host_alloc_node
mpt2sas: use scsi_host_alloc_node
lpfc: use scsi_host_alloc_node
cciss: use blk_init_queue_node

drivers/ata/libata-scsi.c | 3 ++-
drivers/block/cciss.c | 3 ++-
drivers/scsi/lpfc/lpfc_init.c | 10 ++++++----
drivers/scsi/megaraid/megaraid_sas_base.c | 5 +++--
drivers/scsi/mpt2sas/mpt2sas_scsih.c | 4 ++--
drivers/scsi/scsi.c | 17 +++++++++++------
drivers/scsi/scsi_lib.c | 2 +-
drivers/scsi/scsi_scan.c | 4 ++--
drivers/scsi/sd.c | 2 +-
fs/afs/super.c | 5 +++--
fs/block_dev.c | 17 +++++++++++++++--
fs/btrfs/ctree.h | 2 +-
fs/btrfs/inode.c | 3 ++-
fs/cifs/cifsfs.c | 2 +-
fs/inode.c | 10 +++++-----
fs/ocfs2/dlmfs/dlmfs.c | 3 ++-
fs/ocfs2/super.c | 5 +++--
include/linux/fs.h | 2 +-
18 files changed, 63 insertions(+), 36 deletions(-)

2012-10-30 20:17:58

by Jeff Moyer

[permalink] [raw]
Subject: [patch 02/10] scsi: make scsi_alloc_sdev numa-aware

Use the numa node id set in the Scsi_Host to allocate the sdev structure
on the device-local numa node.

Signed-off-by: Jeff Moyer <[email protected]>
---
drivers/scsi/scsi_scan.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 3e58b22..f10a308 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -232,8 +232,8 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
extern void scsi_evt_thread(struct work_struct *work);
extern void scsi_requeue_run_queue(struct work_struct *work);

- sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
- GFP_ATOMIC);
+ sdev = kzalloc_node(sizeof(*sdev) + shost->transportt->device_size,
+ GFP_ATOMIC, shost->numa_node);
if (!sdev)
goto out;

--
1.7.1

2012-10-30 20:18:30

by Jeff Moyer

[permalink] [raw]
Subject: [patch 10/10] cciss: use blk_init_queue_node


Signed-off-by: Jeff Moyer <[email protected]>
---
drivers/block/cciss.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index b0f553b..5fe5546 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -1930,7 +1930,8 @@ static void cciss_get_serial_no(ctlr_info_t *h, int logvol,
static int cciss_add_disk(ctlr_info_t *h, struct gendisk *disk,
int drv_index)
{
- disk->queue = blk_init_queue(do_cciss_request, &h->lock);
+ disk->queue = blk_init_queue_node(do_cciss_request, &h->lock,
+ dev_to_node(&h->dev));
if (!disk->queue)
goto init_queue_failure;
sprintf(disk->disk_name, "cciss/c%dd%d", h->ctlr, drv_index);
--
1.7.1

2012-10-30 20:18:54

by Jeff Moyer

[permalink] [raw]
Subject: [patch 06/10] ata: use scsi_host_alloc_node


Signed-off-by: Jeff Moyer <[email protected]>
---
drivers/ata/libata-scsi.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index e3bda07..9d5dd09 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3586,7 +3586,8 @@ int ata_scsi_add_hosts(struct ata_host *host, struct scsi_host_template *sht)
struct Scsi_Host *shost;

rc = -ENOMEM;
- shost = scsi_host_alloc(sht, sizeof(struct ata_port *));
+ shost = scsi_host_alloc_node(sht, sizeof(struct ata_port *),
+ dev_to_node(host->dev));
if (!shost)
goto err_alloc;

--
1.7.1

2012-10-30 20:15:18

by Jeff Moyer

[permalink] [raw]
Subject: [patch 00/10] make I/O path allocations more numa-friendly

Hi,

This patch set makes memory allocations for data structures used
in the I/O path more numa friendly by allocating from the same
numa node on which the device resides. I've only converted a
handfull of drivers at this point. My testing showed that, for
workloads where the I/O processes were not tied to the numa node
housing the device, a speedup of around 6% was observed. When the
I/O processes were tied to the numa node of the device, there was
no measurable different in my test setup. Given my relatively
low-end setup[1], I wouldn't be surprised if others could show a more
significant performance gap.

Comments would be greatly appreciated.

Cheers,
Jeff

[1] LSI Megaraid SAS controller with 1GB battery-backed cache,
fronting a RAID 6 10+2. The workload I used was tuned to not
have to hit disk. Fio file attached.

Jeff Moyer (10):
scsi: make __scsi_alloc_queue numa-aware
scsi: make scsi_alloc_sdev numa-aware
scsi: allocate scsi_cmnd-s from the device's local numa node
sd: use alloc_disk_node
vfs: pass data to alloc_inode super operation
ata: use scsi_host_alloc_node
megaraid_sas: use scsi_host_alloc_node
mpt2sas: use scsi_host_alloc_node
lpfc: use scsi_host_alloc_node
cciss: use blk_init_queue_node

drivers/ata/libata-scsi.c | 3 ++-
drivers/block/cciss.c | 3 ++-
drivers/scsi/lpfc/lpfc_init.c | 10 ++++++----
drivers/scsi/megaraid/megaraid_sas_base.c | 5 +++--
drivers/scsi/mpt2sas/mpt2sas_scsih.c | 4 ++--
drivers/scsi/scsi.c | 17 +++++++++++------
drivers/scsi/scsi_lib.c | 2 +-
drivers/scsi/scsi_scan.c | 4 ++--
drivers/scsi/sd.c | 2 +-
fs/afs/super.c | 5 +++--
fs/block_dev.c | 17 +++++++++++++++--
fs/btrfs/ctree.h | 2 +-
fs/btrfs/inode.c | 3 ++-
fs/cifs/cifsfs.c | 2 +-
fs/inode.c | 10 +++++-----
fs/ocfs2/dlmfs/dlmfs.c | 3 ++-
fs/ocfs2/super.c | 5 +++--
include/linux/fs.h | 2 +-
18 files changed, 63 insertions(+), 36 deletions(-)

2012-10-30 20:51:49

by Al Viro

[permalink] [raw]
Subject: Re: [patch 05/10] vfs: pass data to alloc_inode super operation

On Tue, Oct 30, 2012 at 04:14:39PM -0400, Jeff Moyer wrote:
> This patch passes a data pointer along to the alloc_inode
> super_operations function. The value will initially be used by
> bdev_alloc_inode to allocate the bdev_inode on the same numa
> node as the device to which it is tied.

Yecchhh...

> -static struct inode *bdev_alloc_inode(struct super_block *sb)
> +static struct inode *bdev_alloc_inode(struct super_block *sb, void *data)
> {
> - struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL);
> + struct bdev_inode *ei;
> + int partno;
> + int node;
> +
> + if (data) {
> + struct gendisk *disk;
> + dev_t dev = *(dev_t *)data;
> + disk = get_gendisk(dev, &partno);
> + node = disk->node_id;
> + put_disk(disk);

Oh, _lovely_. Such a stunningly elegant idea. And why not pass the
sodding node_id directly, if I may ask?

[reads further]
[finds completely undocumented reason for that]

Take that crap out and don't bring it back. And consider this kludge NAKed
once and forever. It's unspeakably ugly, even if it didn't break just about
every fs out there (you do realize that almost all of them have ->alloc_inode(),
don't you? Use grep and see).

2012-10-30 21:17:07

by Al Viro

[permalink] [raw]
Subject: Re: [patch 05/10] vfs: pass data to alloc_inode super operation

On Tue, Oct 30, 2012 at 08:51:42PM +0000, Al Viro wrote:
> On Tue, Oct 30, 2012 at 04:14:39PM -0400, Jeff Moyer wrote:
> > This patch passes a data pointer along to the alloc_inode
> > super_operations function. The value will initially be used by
> > bdev_alloc_inode to allocate the bdev_inode on the same numa
> > node as the device to which it is tied.
>
> Yecchhh...
>
> > -static struct inode *bdev_alloc_inode(struct super_block *sb)
> > +static struct inode *bdev_alloc_inode(struct super_block *sb, void *data)
> > {
> > - struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL);
> > + struct bdev_inode *ei;
> > + int partno;
> > + int node;
> > +
> > + if (data) {
> > + struct gendisk *disk;
> > + dev_t dev = *(dev_t *)data;
> > + disk = get_gendisk(dev, &partno);
> > + node = disk->node_id;

BTW, speaking of other reasons this is FUBAR - mknod a block device node
for something that doesn't exist and try to open it. struct block_device
lifetime is not limited to that of struct gendisk; it can be allocated
before gendisk is there. Moreover, with static /dev and demand-loaded
modules this will be the normal scenario.

IOW, this is completely misguided.

2012-10-31 13:23:18

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch 05/10] vfs: pass data to alloc_inode super operation

Al Viro <[email protected]> writes:

> On Tue, Oct 30, 2012 at 08:51:42PM +0000, Al Viro wrote:
>> On Tue, Oct 30, 2012 at 04:14:39PM -0400, Jeff Moyer wrote:
>> > This patch passes a data pointer along to the alloc_inode
>> > super_operations function. The value will initially be used by
>> > bdev_alloc_inode to allocate the bdev_inode on the same numa
>> > node as the device to which it is tied.
>>
>> Yecchhh...
>>
>> > -static struct inode *bdev_alloc_inode(struct super_block *sb)
>> > +static struct inode *bdev_alloc_inode(struct super_block *sb, void *data)
>> > {
>> > - struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL);
>> > + struct bdev_inode *ei;
>> > + int partno;
>> > + int node;
>> > +
>> > + if (data) {
>> > + struct gendisk *disk;
>> > + dev_t dev = *(dev_t *)data;
>> > + disk = get_gendisk(dev, &partno);
>> > + node = disk->node_id;
>
> Oh, _lovely_. Such a stunningly elegant idea. And why not pass the
> sodding node_id directly, if I may ask?

struct block_device *bdget(dev_t dev)
{
...
inode = iget5_locked(blockdev_superblock, hash(dev),
bdev_test, bdev_set, &dev);

The device is used by bdev_test and bdev_set.

> Take that crap out and don't bring it back. And consider this kludge NAKed
> once and forever. It's unspeakably ugly, even if it didn't break just about
> every fs out there (you do realize that almost all of them have ->alloc_inode(),
> pdon't you? Use grep and see).

Oof. I was using cscope and it didn't show all callers.

> BTW, speaking of other reasons this is FUBAR - mknod a block device node
> for something that doesn't exist and try to open it. struct block_device
> lifetime is not limited to that of struct gendisk; it can be allocated
> before gendisk is there. Moreover, with static /dev and demand-loaded
> modules this will be the normal scenario.
>
> IOW, this is completely misguided.

OK, thanks for the review, Al, it's greatly appreciated. Do you have a
suggestion on how I might achieve this in a way that would be
acceptable? I'm not too fond of the idea of reallocating the inode when
the device does appear. I'll give it some more thought.

Again, thanks for taking a look.

Cheers,
Jeff

2012-10-31 18:56:49

by James Smart

[permalink] [raw]
Subject: Re: [patch 09/10] lpfc: use scsi_host_alloc_node

Thanks Jeff.

-- james s


Acked-By: James Smart <[email protected]>


On 10/30/2012 4:14 PM, Jeff Moyer wrote:
> Signed-off-by: Jeff Moyer <[email protected]>
> ---
> drivers/scsi/lpfc/lpfc_init.c | 10 ++++++----
> 1 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
> index 7dc4218..65956d3 100644
> --- a/drivers/scsi/lpfc/lpfc_init.c
> +++ b/drivers/scsi/lpfc/lpfc_init.c
> @@ -3051,11 +3051,13 @@ lpfc_create_port(struct lpfc_hba *phba, int instance, struct device *dev)
> int error = 0;
>
> if (dev != &phba->pcidev->dev)
> - shost = scsi_host_alloc(&lpfc_vport_template,
> - sizeof(struct lpfc_vport));
> + shost = scsi_host_alloc_node(&lpfc_vport_template,
> + sizeof(struct lpfc_vport),
> + dev_to_node(&phba->pcidev->dev));
> else
> - shost = scsi_host_alloc(&lpfc_template,
> - sizeof(struct lpfc_vport));
> + shost = scsi_host_alloc_node(&lpfc_template,
> + sizeof(struct lpfc_vport),
> + dev_to_node(&phba->pcidev->dev));
> if (!shost)
> goto out;
>

2012-11-02 11:31:42

by Bart Van Assche

[permalink] [raw]
Subject: Re: [patch 01/10] scsi: make __scsi_alloc_queue numa-aware

On 10/30/12 21:14, Jeff Moyer wrote:
> Pass the numa node id set in the Scsi_Host on to blk_init_queue_node
> in order to keep all allocations local to the numa node the device is
> closest to.
>
> Signed-off-by: Jeff Moyer <[email protected]>
> ---
> drivers/scsi/scsi_lib.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index da36a3a..8662a09 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1664,7 +1664,7 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
> struct request_queue *q;
> struct device *dev = shost->dma_dev;
>
> - q = blk_init_queue(request_fn, NULL);
> + q = blk_init_queue_node(request_fn, NULL, shost->numa_node);
> if (!q)
> return NULL;

Hello Jeff,

I haven't seen the patch that introduces numa_node in struct Scsi_Host
nor the cover letter of this patch series ? Have these been posted on
the linux-scsi mailing list ?

Bart.


2012-11-02 14:19:59

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch 01/10] scsi: make __scsi_alloc_queue numa-aware

Bart Van Assche <[email protected]> writes:

> On 10/30/12 21:14, Jeff Moyer wrote:
>> Pass the numa node id set in the Scsi_Host on to blk_init_queue_node
>> in order to keep all allocations local to the numa node the device is
>> closest to.
>>
>> Signed-off-by: Jeff Moyer <[email protected]>
>> ---
>> drivers/scsi/scsi_lib.c | 2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>> index da36a3a..8662a09 100644
>> --- a/drivers/scsi/scsi_lib.c
>> +++ b/drivers/scsi/scsi_lib.c
>> @@ -1664,7 +1664,7 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
>> struct request_queue *q;
>> struct device *dev = shost->dma_dev;
>>
>> - q = blk_init_queue(request_fn, NULL);
>> + q = blk_init_queue_node(request_fn, NULL, shost->numa_node);
>> if (!q)
>> return NULL;
>
> Hello Jeff,
>
> I haven't seen the patch that introduces numa_node in struct Scsi_Host
> nor the cover letter of this patch series ? Have these been posted on
> the linux-scsi mailing list ?

Hi, Bart,

Wow, looks like I left out the first patch! The cover letter I think
only went to lkml. I have to do a repost, so I'll be sure to send the
cover to linux-scsi as well, and CC you (and credit you for the idea,
which I totally forgot to do). I'll send a repost out today.

Cheers,
Jeff