2008-03-12 11:58:40

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 0/7] IO CPU affinity testing series

Hi,

Here's a new round of patches to play with io cpu affinity. It can,
as always, also be found in the block git repo. The branch name is
'io-cpu-affinity'.

The major change since last post is the abandonment of the kthread
approach. It was definitely slower then may 'add IPI to signal remote
block softirq' hack. So I decided to base this on the scalable
smp_call_function_single() that Nick posted. I tweaked it a bit to
make it more suitable for my use and also faster.

As for functionality, the only change is that I added a bio hint
that the submitter can use to ask for completion on the same CPU
that submitted the IO. Pass in BIO_CPU_AFFINE for that to occur.

Otherwise the modes are the same as last time:

- You can set a specific cpumask for queuing IO, and the block layer
will move submitters to one of those CPUs.
- You can set a specific cpumask for completion of IO, in which case
the block layer will move the completion to one of those CPUs.
- You can set rq_affinity mode, in which case IOs will always be
completed on the CPU that submitted them.

Look in /sys/block/<dev>/queue/ for the three sysfs variables that
modify this behaviour.

I'd be interested in getting some testing done on this, to see if
it really helps the larger end of the scale. Dave, I know you
have a lot of experience in this area and would appreciate your
input and/or testing. I'm not sure if any of the above modes will
allow you to do what you need for eg XFS - if you want all meta data
IO completed on one (or a set of) CPU(s), then I can add a mode
that will allow you to play with that. Or if something else, give me
some input and we can take it from there!

Patches are against latest -git.

--
Jens Axboe


2008-03-12 11:57:22

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 5/7] Add interface for queuing work on a specific CPU

Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/workqueue.h | 2 ++
kernel/workqueue.c | 33 ++++++++++++++++++++++++++-------
2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 542526c..fcc400b 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -181,6 +181,8 @@ extern void destroy_workqueue(struct workqueue_struct *wq);
extern int queue_work(struct workqueue_struct *wq, struct work_struct *work);
extern int queue_delayed_work(struct workqueue_struct *wq,
struct delayed_work *work, unsigned long delay);
+extern int queue_work_on_cpu(struct workqueue_struct *wq,
+ struct work_struct *work, int cpu);
extern int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
struct delayed_work *work, unsigned long delay);

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ff06611..6bbd7b0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -152,25 +152,44 @@ static void __queue_work(struct cpu_workqueue_struct *cwq,
}

/**
- * queue_work - queue work on a workqueue
+ * queue_work_on_cpu - queue work on a workqueue on a specific CPU
* @wq: workqueue to use
* @work: work to queue
+ * @cpu: cpu to queue the work on
*
* Returns 0 if @work was already on a queue, non-zero otherwise.
- *
- * We queue the work to the CPU it was submitted, but there is no
- * guarantee that it will be processed by that CPU.
*/
-int queue_work(struct workqueue_struct *wq, struct work_struct *work)
+int queue_work_on_cpu(struct workqueue_struct *wq, struct work_struct *work,
+ int cpu)
{
int ret = 0;

if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
BUG_ON(!list_empty(&work->entry));
- __queue_work(wq_per_cpu(wq, get_cpu()), work);
- put_cpu();
+ __queue_work(wq_per_cpu(wq, cpu), work);
ret = 1;
}
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(queue_work_on_cpu);
+
+/**
+ * queue_work - queue work on a workqueue
+ * @wq: workqueue to use
+ * @work: work to queue
+ *
+ * Returns 0 if @work was already on a queue, non-zero otherwise.
+ *
+ * We queue the work to the CPU it was submitted, but there is no
+ * guarantee that it will be processed by that CPU.
+ */
+int queue_work(struct workqueue_struct *wq, struct work_struct *work)
+{
+ int ret;
+
+ ret = queue_work_on_cpu(wq, work, get_cpu());
+ put_cpu();
return ret;
}
EXPORT_SYMBOL_GPL(queue_work);
--
1.5.4.GIT

2008-03-12 11:57:33

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 4/7] block: split softirq handling into blk-softirq.c

Signed-off-by: Jens Axboe <[email protected]>
---
block/Makefile | 3 +-
block/blk-core.c | 88 -------------------------------------------
block/blk-softirq.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 105 insertions(+), 89 deletions(-)
create mode 100644 block/blk-softirq.c

diff --git a/block/Makefile b/block/Makefile
index 5a43c7d..b064190 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -4,7 +4,8 @@

obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
- blk-exec.o blk-merge.o ioctl.o genhd.o scsi_ioctl.o
+ blk-exec.o blk-merge.o blk-softirq.o ioctl.o genhd.o \
+ scsi_ioctl.o

obj-$(CONFIG_BLK_DEV_BSG) += bsg.o
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 2a438a9..46819c1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,8 +26,6 @@
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/task_io_accounting_ops.h>
-#include <linux/interrupt.h>
-#include <linux/cpu.h>
#include <linux/blktrace_api.h>
#include <linux/fault-inject.h>

@@ -50,8 +48,6 @@ struct kmem_cache *blk_requestq_cachep;
*/
static struct workqueue_struct *kblockd_workqueue;

-static DEFINE_PER_CPU(struct list_head, blk_cpu_done);
-
static void drive_stat_acct(struct request *rq, int new_io)
{
int rw = rq_data_dir(rq);
@@ -1632,82 +1628,6 @@ static int __end_that_request_first(struct request *req, int error,
}

/*
- * splice the completion data to a local structure and hand off to
- * process_completion_queue() to complete the requests
- */
-static void blk_done_softirq(struct softirq_action *h)
-{
- struct list_head *cpu_list, local_list;
-
- local_irq_disable();
- cpu_list = &__get_cpu_var(blk_cpu_done);
- list_replace_init(cpu_list, &local_list);
- local_irq_enable();
-
- while (!list_empty(&local_list)) {
- struct request *rq;
-
- rq = list_entry(local_list.next, struct request, donelist);
- list_del_init(&rq->donelist);
- rq->q->softirq_done_fn(rq);
- }
-}
-
-static int __cpuinit blk_cpu_notify(struct notifier_block *self,
- unsigned long action, void *hcpu)
-{
- /*
- * If a CPU goes away, splice its entries to the current CPU
- * and trigger a run of the softirq
- */
- if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) {
- int cpu = (unsigned long) hcpu;
-
- local_irq_disable();
- list_splice_init(&per_cpu(blk_cpu_done, cpu),
- &__get_cpu_var(blk_cpu_done));
- raise_softirq_irqoff(BLOCK_SOFTIRQ);
- local_irq_enable();
- }
-
- return NOTIFY_OK;
-}
-
-
-static struct notifier_block blk_cpu_notifier __cpuinitdata = {
- .notifier_call = blk_cpu_notify,
-};
-
-/**
- * blk_complete_request - end I/O on a request
- * @req: the request being processed
- *
- * Description:
- * Ends all I/O on a request. It does not handle partial completions,
- * unless the driver actually implements this in its completion callback
- * through requeueing. The actual completion happens out-of-order,
- * through a softirq handler. The user must have registered a completion
- * callback through blk_queue_softirq_done().
- **/
-
-void blk_complete_request(struct request *req)
-{
- struct list_head *cpu_list;
- unsigned long flags;
-
- BUG_ON(!req->q->softirq_done_fn);
-
- local_irq_save(flags);
-
- cpu_list = &__get_cpu_var(blk_cpu_done);
- list_add_tail(&req->donelist, cpu_list);
- raise_softirq_irqoff(BLOCK_SOFTIRQ);
-
- local_irq_restore(flags);
-}
-EXPORT_SYMBOL(blk_complete_request);
-
-/*
* queue lock must be held
*/
static void end_that_request_last(struct request *req, int error)
@@ -2038,8 +1958,6 @@ EXPORT_SYMBOL(kblockd_flush_work);

int __init blk_dev_init(void)
{
- int i;
-
kblockd_workqueue = create_workqueue("kblockd");
if (!kblockd_workqueue)
panic("Failed to create kblockd\n");
@@ -2050,12 +1968,6 @@ int __init blk_dev_init(void)
blk_requestq_cachep = kmem_cache_create("blkdev_queue",
sizeof(struct request_queue), 0, SLAB_PANIC, NULL);

- for_each_possible_cpu(i)
- INIT_LIST_HEAD(&per_cpu(blk_cpu_done, i));
-
- open_softirq(BLOCK_SOFTIRQ, blk_done_softirq, NULL);
- register_hotcpu_notifier(&blk_cpu_notifier);
-
return 0;
}

diff --git a/block/blk-softirq.c b/block/blk-softirq.c
new file mode 100644
index 0000000..05f9451
--- /dev/null
+++ b/block/blk-softirq.c
@@ -0,0 +1,103 @@
+/*
+ * Functions related to softirq rq completions
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/interrupt.h>
+#include <linux/cpu.h>
+
+#include "blk.h"
+
+static DEFINE_PER_CPU(struct list_head, blk_cpu_done);
+
+static int __cpuinit blk_cpu_notify(struct notifier_block *self,
+ unsigned long action, void *hcpu)
+{
+ /*
+ * If a CPU goes away, splice its entries to the current CPU
+ * and trigger a run of the softirq
+ */
+ if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) {
+ int cpu = (unsigned long) hcpu;
+
+ local_irq_disable();
+ list_splice_init(&per_cpu(blk_cpu_done, cpu),
+ &__get_cpu_var(blk_cpu_done));
+ raise_softirq_irqoff(BLOCK_SOFTIRQ);
+ local_irq_enable();
+ }
+
+ return NOTIFY_OK;
+}
+
+
+static struct notifier_block blk_cpu_notifier __cpuinitdata = {
+ .notifier_call = blk_cpu_notify,
+};
+
+/*
+ * splice the completion data to a local structure and hand off to
+ * process_completion_queue() to complete the requests
+ */
+static void blk_done_softirq(struct softirq_action *h)
+{
+ struct list_head *cpu_list, local_list;
+
+ local_irq_disable();
+ cpu_list = &__get_cpu_var(blk_cpu_done);
+ list_replace_init(cpu_list, &local_list);
+ local_irq_enable();
+
+ while (!list_empty(&local_list)) {
+ struct request *rq;
+
+ rq = list_entry(local_list.next, struct request, donelist);
+ list_del_init(&rq->donelist);
+ rq->q->softirq_done_fn(rq);
+ }
+}
+
+/**
+ * blk_complete_request - end I/O on a request
+ * @req: the request being processed
+ *
+ * Description:
+ * Ends all I/O on a request. It does not handle partial completions,
+ * unless the driver actually implements this in its completion callback
+ * through requeueing. The actual completion happens out-of-order,
+ * through a softirq handler. The user must have registered a completion
+ * callback through blk_queue_softirq_done().
+ **/
+
+void blk_complete_request(struct request *req)
+{
+ struct list_head *cpu_list;
+ unsigned long flags;
+
+ BUG_ON(!req->q->softirq_done_fn);
+
+ local_irq_save(flags);
+
+ cpu_list = &__get_cpu_var(blk_cpu_done);
+ list_add_tail(&req->donelist, cpu_list);
+ raise_softirq_irqoff(BLOCK_SOFTIRQ);
+
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL(blk_complete_request);
+
+int __init blk_softirq_init(void)
+{
+ int i;
+
+ for_each_possible_cpu(i)
+ INIT_LIST_HEAD(&per_cpu(blk_cpu_done, i));
+
+ open_softirq(BLOCK_SOFTIRQ, blk_done_softirq, NULL);
+ register_hotcpu_notifier(&blk_cpu_notifier);
+ return 0;
+}
+subsys_initcall(blk_softirq_init);
--
1.5.4.GIT

2008-03-12 11:57:49

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 6/7] block: make kblockd_schedule_work() take the queue as parameter

Preparatory patch for checking queuing affinity.

Signed-off-by: Jens Axboe <[email protected]>
---
block/as-iosched.c | 6 +++---
block/blk-core.c | 8 ++++----
block/cfq-iosched.c | 2 +-
include/linux/blkdev.h | 2 +-
4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index 8c39467..6ef766f 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -450,7 +450,7 @@ static void as_antic_stop(struct as_data *ad)
del_timer(&ad->antic_timer);
ad->antic_status = ANTIC_FINISHED;
/* see as_work_handler */
- kblockd_schedule_work(&ad->antic_work);
+ kblockd_schedule_work(ad->q, &ad->antic_work);
}
}

@@ -471,7 +471,7 @@ static void as_antic_timeout(unsigned long data)
aic = ad->io_context->aic;

ad->antic_status = ANTIC_FINISHED;
- kblockd_schedule_work(&ad->antic_work);
+ kblockd_schedule_work(q, &ad->antic_work);

if (aic->ttime_samples == 0) {
/* process anticipated on has exited or timed out*/
@@ -831,7 +831,7 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
}

if (ad->changed_batch && ad->nr_dispatched == 1) {
- kblockd_schedule_work(&ad->antic_work);
+ kblockd_schedule_work(q, &ad->antic_work);
ad->changed_batch = 0;

if (ad->batch_data_dir == REQ_SYNC)
diff --git a/block/blk-core.c b/block/blk-core.c
index 46819c1..ec529dc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -299,7 +299,7 @@ void blk_unplug_timeout(unsigned long data)
blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);

- kblockd_schedule_work(&q->unplug_work);
+ kblockd_schedule_work(q, &q->unplug_work);
}

void blk_unplug(struct request_queue *q)
@@ -340,7 +340,7 @@ void blk_start_queue(struct request_queue *q)
clear_bit(QUEUE_FLAG_REENTER, &q->queue_flags);
} else {
blk_plug_device(q);
- kblockd_schedule_work(&q->unplug_work);
+ kblockd_schedule_work(q, &q->unplug_work);
}
}
EXPORT_SYMBOL(blk_start_queue);
@@ -408,7 +408,7 @@ void blk_run_queue(struct request_queue *q)
clear_bit(QUEUE_FLAG_REENTER, &q->queue_flags);
} else {
blk_plug_device(q);
- kblockd_schedule_work(&q->unplug_work);
+ kblockd_schedule_work(q, &q->unplug_work);
}
}

@@ -1944,7 +1944,7 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
rq->rq_disk = bio->bi_bdev->bd_disk;
}

-int kblockd_schedule_work(struct work_struct *work)
+int kblockd_schedule_work(struct request_queue *q, struct work_struct *work)
{
return queue_work(kblockd_workqueue, work);
}
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 0f962ec..4b31f7c 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -235,7 +235,7 @@ static inline int cfq_bio_sync(struct bio *bio)
static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
{
if (cfqd->busy_queues)
- kblockd_schedule_work(&cfqd->unplug_work);
+ kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
}

static int cfq_queue_empty(struct request_queue *q)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6f79d40..f48f32f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -811,7 +811,7 @@ static inline void put_dev_sector(Sector p)
}

struct work_struct;
-int kblockd_schedule_work(struct work_struct *work);
+int kblockd_schedule_work(struct request_queue *q, struct work_struct *work);
void kblockd_flush_work(struct work_struct *work);

#define MODULE_ALIAS_BLOCKDEV(major,minor) \
--
1.5.4.GIT

2008-03-12 11:58:07

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 1/7] x86-64: introduce fast variant of smp_call_function_single()

From: Nick Piggin <[email protected]>

Signed-off-by: Jens Axboe <[email protected]>
---
arch/x86/kernel/entry_64.S | 3 +
arch/x86/kernel/i8259_64.c | 1 +
arch/x86/kernel/smp_64.c | 303 +++++++++++++++++++++--------
include/asm-x86/hw_irq_64.h | 4 +-
include/asm-x86/mach-default/entry_arch.h | 1 +
include/linux/smp.h | 2 +-
6 files changed, 232 insertions(+), 82 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index c20c9e7..22caf56 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -713,6 +713,9 @@ END(invalidate_interrupt\num)
ENTRY(call_function_interrupt)
apicinterrupt CALL_FUNCTION_VECTOR,smp_call_function_interrupt
END(call_function_interrupt)
+ENTRY(call_function_single_interrupt)
+ apicinterrupt CALL_FUNCTION_SINGLE_VECTOR,smp_call_function_single_interrupt
+END(call_function_single_interrupt)
ENTRY(irq_move_cleanup_interrupt)
apicinterrupt IRQ_MOVE_CLEANUP_VECTOR,smp_irq_move_cleanup_interrupt
END(irq_move_cleanup_interrupt)
diff --git a/arch/x86/kernel/i8259_64.c b/arch/x86/kernel/i8259_64.c
index fa57a15..2b0b6d2 100644
--- a/arch/x86/kernel/i8259_64.c
+++ b/arch/x86/kernel/i8259_64.c
@@ -493,6 +493,7 @@ void __init native_init_IRQ(void)

/* IPI for generic function call */
set_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
+ set_intr_gate(CALL_FUNCTION_SINGLE_VECTOR, call_function_single_interrupt);

/* Low priority IPI to cleanup after moving an irq */
set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt);
diff --git a/arch/x86/kernel/smp_64.c b/arch/x86/kernel/smp_64.c
index 2fd74b0..1196a12 100644
--- a/arch/x86/kernel/smp_64.c
+++ b/arch/x86/kernel/smp_64.c
@@ -18,6 +18,7 @@
#include <linux/kernel_stat.h>
#include <linux/mc146818rtc.h>
#include <linux/interrupt.h>
+#include <linux/rcupdate.h>

#include <asm/mtrr.h>
#include <asm/pgalloc.h>
@@ -295,21 +296,29 @@ void smp_send_reschedule(int cpu)
send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR);
}

+#define CALL_WAIT 0x01
+#define CALL_FALLBACK 0x02
/*
* Structure and data for smp_call_function(). This is designed to minimise
* static memory requirements. It also looks cleaner.
*/
static DEFINE_SPINLOCK(call_lock);

-struct call_data_struct {
+struct call_data {
+ spinlock_t lock;
+ struct list_head list;
void (*func) (void *info);
void *info;
- atomic_t started;
- atomic_t finished;
- int wait;
+ unsigned int flags;
+ unsigned int refs;
+ cpumask_t cpumask;
+ struct rcu_head rcu_head;
};

-static struct call_data_struct * call_data;
+static LIST_HEAD(call_queue);
+
+static unsigned long call_fallback_used;
+static struct call_data call_data_fallback;

void lock_ipi_call_lock(void)
{
@@ -321,55 +330,47 @@ void unlock_ipi_call_lock(void)
spin_unlock_irq(&call_lock);
}

-/*
- * this function sends a 'generic call function' IPI to all other CPU
- * of the system defined in the mask.
- */
-static int __smp_call_function_mask(cpumask_t mask,
- void (*func)(void *), void *info,
- int wait)
-{
- struct call_data_struct data;
- cpumask_t allbutself;
- int cpus;

- allbutself = cpu_online_map;
- cpu_clear(smp_processor_id(), allbutself);
-
- cpus_and(mask, mask, allbutself);
- cpus = cpus_weight(mask);
-
- if (!cpus)
- return 0;
-
- data.func = func;
- data.info = info;
- atomic_set(&data.started, 0);
- data.wait = wait;
- if (wait)
- atomic_set(&data.finished, 0);
+struct call_single_data {
+ struct list_head list;
+ void (*func) (void *info);
+ void *info;
+ unsigned int flags;
+};

- call_data = &data;
- wmb();
+struct call_single_queue {
+ spinlock_t lock;
+ struct list_head list;
+};
+static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);

- /* Send a message to other CPUs */
- if (cpus_equal(mask, allbutself))
- send_IPI_allbutself(CALL_FUNCTION_VECTOR);
- else
- send_IPI_mask(mask, CALL_FUNCTION_VECTOR);
+static unsigned long call_single_fallback_used;
+static struct call_single_data call_single_data_fallback;

- /* Wait for response */
- while (atomic_read(&data.started) != cpus)
- cpu_relax();
+int __cpuinit init_smp_call(void)
+{
+ int i;

- if (!wait)
- return 0;
+ for_each_cpu_mask(i, cpu_possible_map) {
+ spin_lock_init(&per_cpu(call_single_queue, i).lock);
+ INIT_LIST_HEAD(&per_cpu(call_single_queue, i).list);
+ }
+ return 0;
+}
+core_initcall(init_smp_call);

- while (atomic_read(&data.finished) != cpus)
- cpu_relax();
+static void rcu_free_call_data(struct rcu_head *head)
+{
+ struct call_data *data;
+ data = container_of(head, struct call_data, rcu_head);
+ kfree(data);
+}

- return 0;
+static void free_call_data(struct call_data *data)
+{
+ call_rcu(&data->rcu_head, rcu_free_call_data);
}
+
/**
* smp_call_function_mask(): Run a function on a set of other CPUs.
* @mask: The set of cpus to run on. Must not include the current cpu.
@@ -389,15 +390,69 @@ int smp_call_function_mask(cpumask_t mask,
void (*func)(void *), void *info,
int wait)
{
- int ret;
+ struct call_data *data;
+ cpumask_t allbutself;
+ unsigned int flags;
+ int cpus;

/* Can deadlock when called with interrupts disabled */
WARN_ON(irqs_disabled());
+ WARN_ON(preemptible());
+
+ allbutself = cpu_online_map;
+ cpu_clear(smp_processor_id(), allbutself);
+
+ cpus_and(mask, mask, allbutself);
+ cpus = cpus_weight(mask);
+
+ if (!cpus)
+ return 0;

- spin_lock(&call_lock);
- ret = __smp_call_function_mask(mask, func, info, wait);
+ flags = wait ? CALL_WAIT : 0;
+ data = kmalloc(sizeof(struct call_data), GFP_ATOMIC);
+ if (unlikely(!data)) {
+ while (test_and_set_bit_lock(0, &call_fallback_used))
+ cpu_relax();
+ data = &call_data_fallback;
+ flags |= CALL_FALLBACK;
+ /* XXX: can IPI all to "synchronize" RCU? */
+ }
+
+ spin_lock_init(&data->lock);
+ data->func = func;
+ data->info = info;
+ data->flags = flags;
+ data->refs = cpus;
+ data->cpumask = mask;
+
+ local_irq_disable();
+ while (!spin_trylock(&call_lock)) {
+ local_irq_enable();
+ cpu_relax();
+ local_irq_disable();
+ }
+ /* could do ipi = list_empty(&dst->list) || !cpumask_ipi_pending() */
+ list_add_tail_rcu(&data->list, &call_queue);
spin_unlock(&call_lock);
- return ret;
+ local_irq_enable();
+
+ /* Send a message to other CPUs */
+ if (cpus_equal(mask, allbutself))
+ send_IPI_allbutself(CALL_FUNCTION_VECTOR);
+ else
+ send_IPI_mask(mask, CALL_FUNCTION_VECTOR);
+
+ if (wait) {
+ /* Wait for response */
+ while (data->flags)
+ cpu_relax();
+ if (likely(!(flags & CALL_FALLBACK)))
+ free_call_data(data);
+ else
+ clear_bit_unlock(0, &call_fallback_used);
+ }
+
+ return 0;
}
EXPORT_SYMBOL(smp_call_function_mask);

@@ -414,11 +469,11 @@ EXPORT_SYMBOL(smp_call_function_mask);
* or is or has executed.
*/

-int smp_call_function_single (int cpu, void (*func) (void *info), void *info,
+int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
int nonatomic, int wait)
{
/* prevent preemption and reschedule on another processor */
- int ret, me = get_cpu();
+ int me = get_cpu();

/* Can deadlock when called with interrupts disabled */
WARN_ON(irqs_disabled());
@@ -427,14 +482,54 @@ int smp_call_function_single (int cpu, void (*func) (void *info), void *info,
local_irq_disable();
func(info);
local_irq_enable();
- put_cpu();
- return 0;
- }
+ } else {
+ struct call_single_data d;
+ struct call_single_data *data;
+ struct call_single_queue *dst;
+ cpumask_t mask = cpumask_of_cpu(cpu);
+ unsigned int flags = wait ? CALL_WAIT : 0;
+ int ipi;
+
+ if (!wait) {
+ data = kmalloc(sizeof(struct call_single_data), GFP_ATOMIC);
+ if (unlikely(!data)) {
+ while (test_and_set_bit_lock(0, &call_single_fallback_used))
+ cpu_relax();
+ data = &call_single_data_fallback;
+ flags |= CALL_FALLBACK;
+ }
+ } else {
+ data = &d;
+ }
+
+ data->func = func;
+ data->info = info;
+ data->flags = flags;
+ dst = &per_cpu(call_single_queue, cpu);
+
+ local_irq_disable();
+ while (!spin_trylock(&dst->lock)) {
+ local_irq_enable();
+ cpu_relax();
+ local_irq_disable();
+ }
+ ipi = list_empty(&dst->list);
+ list_add_tail(&data->list, &dst->list);
+ spin_unlock(&dst->lock);
+ local_irq_enable();

- ret = smp_call_function_mask(cpumask_of_cpu(cpu), func, info, wait);
+ if (ipi)
+ send_IPI_mask(mask, CALL_FUNCTION_SINGLE_VECTOR);
+
+ if (wait) {
+ /* Wait for response */
+ while (data->flags)
+ cpu_relax();
+ }
+ }

put_cpu();
- return ret;
+ return 0;
}
EXPORT_SYMBOL(smp_call_function_single);

@@ -474,18 +569,13 @@ static void stop_this_cpu(void *dummy)

void smp_send_stop(void)
{
- int nolock;
unsigned long flags;

if (reboot_force)
return;

- /* Don't deadlock on the call lock in panic */
- nolock = !spin_trylock(&call_lock);
local_irq_save(flags);
- __smp_call_function_mask(cpu_online_map, stop_this_cpu, NULL, 0);
- if (!nolock)
- spin_unlock(&call_lock);
+ smp_call_function(stop_this_cpu, NULL, 0, 0);
disable_local_APIC();
local_irq_restore(flags);
}
@@ -503,28 +593,83 @@ asmlinkage void smp_reschedule_interrupt(void)

asmlinkage void smp_call_function_interrupt(void)
{
- void (*func) (void *info) = call_data->func;
- void *info = call_data->info;
- int wait = call_data->wait;
+ struct list_head *pos, *tmp;
+ int cpu = smp_processor_id();

ack_APIC_irq();
- /*
- * Notify initiating CPU that I've grabbed the data and am
- * about to execute the function
- */
- mb();
- atomic_inc(&call_data->started);
- /*
- * At this point the info structure may be out of scope unless wait==1
- */
exit_idle();
irq_enter();
- (*func)(info);
+
+ list_for_each_safe_rcu(pos, tmp, &call_queue) {
+ struct call_data *data;
+ int refs;
+
+ data = list_entry(pos, struct call_data, list);
+ if (!cpu_isset(cpu, data->cpumask))
+ continue;
+
+ data->func(data->info);
+ spin_lock(&data->lock);
+ WARN_ON(!cpu_isset(cpu, data->cpumask));
+ cpu_clear(cpu, data->cpumask);
+ WARN_ON(data->refs == 0);
+ data->refs--;
+ refs = data->refs;
+ spin_unlock(&data->lock);
+
+ if (refs == 0) {
+ WARN_ON(cpus_weight(data->cpumask));
+ spin_lock(&call_lock);
+ list_del_rcu(&data->list);
+ spin_unlock(&call_lock);
+ if (data->flags & CALL_WAIT) {
+ smp_wmb();
+ data->flags = 0;
+ } else {
+ if (likely(!(data->flags & CALL_FALLBACK)))
+ free_call_data(data);
+ else
+ clear_bit_unlock(0, &call_fallback_used);
+ }
+ }
+ }
+
add_pda(irq_call_count, 1);
irq_exit();
- if (wait) {
- mb();
- atomic_inc(&call_data->finished);
+}
+
+asmlinkage void smp_call_function_single_interrupt(void)
+{
+ struct call_single_queue *q;
+ LIST_HEAD(list);
+
+ ack_APIC_irq();
+ exit_idle();
+ irq_enter();
+
+ q = &__get_cpu_var(call_single_queue);
+ spin_lock(&q->lock);
+ list_replace_init(&q->list, &list);
+ spin_unlock(&q->lock);
+
+ while (!list_empty(&list)) {
+ struct call_single_data *data;
+
+ data = list_entry(list.next, struct call_single_data, list);
+ list_del(&data->list);
+
+ data->func(data->info);
+ if (data->flags & CALL_WAIT) {
+ smp_wmb();
+ data->flags = 0;
+ } else {
+ if (likely(!(data->flags & CALL_FALLBACK)))
+ kfree(data);
+ else
+ clear_bit_unlock(0, &call_single_fallback_used);
+ }
}
+ add_pda(irq_call_count, 1);
+ irq_exit();
}

diff --git a/include/asm-x86/hw_irq_64.h b/include/asm-x86/hw_irq_64.h
index 312a58d..06ac80c 100644
--- a/include/asm-x86/hw_irq_64.h
+++ b/include/asm-x86/hw_irq_64.h
@@ -68,8 +68,7 @@
#define ERROR_APIC_VECTOR 0xfe
#define RESCHEDULE_VECTOR 0xfd
#define CALL_FUNCTION_VECTOR 0xfc
-/* fb free - please don't readd KDB here because it's useless
- (hint - think what a NMI bit does to a vector) */
+#define CALL_FUNCTION_SINGLE_VECTOR 0xfb
#define THERMAL_APIC_VECTOR 0xfa
#define THRESHOLD_APIC_VECTOR 0xf9
/* f8 free */
@@ -102,6 +101,7 @@ void spurious_interrupt(void);
void error_interrupt(void);
void reschedule_interrupt(void);
void call_function_interrupt(void);
+void call_function_single_interrupt(void);
void irq_move_cleanup_interrupt(void);
void invalidate_interrupt0(void);
void invalidate_interrupt1(void);
diff --git a/include/asm-x86/mach-default/entry_arch.h b/include/asm-x86/mach-default/entry_arch.h
index bc86146..9283b60 100644
--- a/include/asm-x86/mach-default/entry_arch.h
+++ b/include/asm-x86/mach-default/entry_arch.h
@@ -13,6 +13,7 @@
BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
BUILD_INTERRUPT(invalidate_interrupt,INVALIDATE_TLB_VECTOR)
BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
+BUILD_INTERRUPT(call_function_single_interrupt,CALL_FUNCTION_SINGLE_VECTOR)
#endif

/*
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 55232cc..c938d26 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -53,7 +53,6 @@ extern void smp_cpus_done(unsigned int max_cpus);
* Call a function on all other processors
*/
int smp_call_function(void(*func)(void *info), void *info, int retry, int wait);
-
int smp_call_function_single(int cpuid, void (*func) (void *info), void *info,
int retry, int wait);

@@ -92,6 +91,7 @@ static inline int up_smp_call_function(void (*func)(void *), void *info)
}
#define smp_call_function(func, info, retry, wait) \
(up_smp_call_function(func, info))
+
#define on_each_cpu(func,info,retry,wait) \
({ \
local_irq_disable(); \
--
1.5.4.GIT

2008-03-12 11:58:24

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 7/7] block: add test code for testing CPU affinity

Supports several modes:

- Force IO queue affinity to a specific mask of CPUs
- Force IO completion affinity to a specific mask of CPUs
- Force completion of a request on the same CPU that queued it
- Allow IO submitter to set BIO_CPU_AFFINE in the bio, in which case
completion will be done on the same CPU as the submitter

Test code so far, this variant being based on using the more scalable
__smp_call_function_single().

Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-core.c | 77 ++++++++++++++++++++++-------------
block/blk-settings.c | 49 ++++++++++++++++++++++-
block/blk-softirq.c | 98 +++++++++++++++++++++++++++++++--------------
block/blk-sysfs.c | 86 ++++++++++++++++++++++++++++++++++++++++
include/linux/bio.h | 1 +
include/linux/blkdev.h | 12 +++++-
include/linux/elevator.h | 8 ++--
7 files changed, 264 insertions(+), 67 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ec529dc..8b04a15 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -110,7 +110,7 @@ EXPORT_SYMBOL(blk_get_backing_dev_info);
void rq_init(struct request_queue *q, struct request *rq)
{
INIT_LIST_HEAD(&rq->queuelist);
- INIT_LIST_HEAD(&rq->donelist);
+ rq->cpu = -1;
rq->q = q;
rq->sector = rq->hard_sector = (sector_t) -1;
rq->nr_sectors = rq->hard_nr_sectors = 0;
@@ -197,6 +197,11 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
}
EXPORT_SYMBOL(blk_dump_rq_flags);

+static inline int blk_is_io_cpu(struct request_queue *q)
+{
+ return cpu_isset(smp_processor_id(), q->queue_cpu);
+}
+
/*
* "plug" the device if there are no outstanding requests: this will
* force the transfer to start only after we have put all the requests
@@ -217,6 +222,10 @@ void blk_plug_device(struct request_queue *q)
return;

if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
+ /*
+ * no need to care about the io cpu here, since the
+ * timeout handler needs to punt to kblockd anyway
+ */
mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
}
@@ -316,6 +325,22 @@ void blk_unplug(struct request_queue *q)
}
EXPORT_SYMBOL(blk_unplug);

+static void blk_invoke_request_fn(struct request_queue *q)
+{
+ /*
+ * one level of recursion is ok and is much faster than kicking
+ * the unplug handling
+ */
+ if (blk_is_io_cpu(q) &&
+ !test_and_set_bit(QUEUE_FLAG_REENTER, &q->queue_flags)) {
+ q->request_fn(q);
+ clear_bit(QUEUE_FLAG_REENTER, &q->queue_flags);
+ } else {
+ set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags);
+ kblockd_schedule_work(q, &q->unplug_work);
+ }
+}
+
/**
* blk_start_queue - restart a previously stopped queue
* @q: The &struct request_queue in question
@@ -330,18 +355,7 @@ void blk_start_queue(struct request_queue *q)
WARN_ON(!irqs_disabled());

clear_bit(QUEUE_FLAG_STOPPED, &q->queue_flags);
-
- /*
- * one level of recursion is ok and is much faster than kicking
- * the unplug handling
- */
- if (!test_and_set_bit(QUEUE_FLAG_REENTER, &q->queue_flags)) {
- q->request_fn(q);
- clear_bit(QUEUE_FLAG_REENTER, &q->queue_flags);
- } else {
- blk_plug_device(q);
- kblockd_schedule_work(q, &q->unplug_work);
- }
+ blk_invoke_request_fn(q);
}
EXPORT_SYMBOL(blk_start_queue);

@@ -398,19 +412,8 @@ void blk_run_queue(struct request_queue *q)
spin_lock_irqsave(q->queue_lock, flags);
blk_remove_plug(q);

- /*
- * Only recurse once to avoid overrunning the stack, let the unplug
- * handling reinvoke the handler shortly if we already got there.
- */
- if (!elv_queue_empty(q)) {
- if (!test_and_set_bit(QUEUE_FLAG_REENTER, &q->queue_flags)) {
- q->request_fn(q);
- clear_bit(QUEUE_FLAG_REENTER, &q->queue_flags);
- } else {
- blk_plug_device(q);
- kblockd_schedule_work(q, &q->unplug_work);
- }
- }
+ if (!elv_queue_empty(q))
+ blk_invoke_request_fn(q);

spin_unlock_irqrestore(q->queue_lock, flags);
}
@@ -469,6 +472,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
if (!q)
return NULL;

+ cpus_setall(q->queue_cpu);
+ cpus_setall(q->complete_cpu);
q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
q->backing_dev_info.unplug_io_data = q;
err = bdi_init(&q->backing_dev_info);
@@ -872,7 +877,10 @@ EXPORT_SYMBOL(blk_get_request);
*/
void blk_start_queueing(struct request_queue *q)
{
- if (!blk_queue_plugged(q))
+ if (!blk_is_io_cpu(q)) {
+ set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags);
+ kblockd_schedule_work(q, &q->unplug_work);
+ } else if (!blk_queue_plugged(q))
q->request_fn(q);
else
__generic_unplug_device(q);
@@ -1190,13 +1198,15 @@ get_rq:
init_request_from_bio(req, bio);

spin_lock_irq(q->queue_lock);
+ if (q->queue_flags & (1 << QUEUE_FLAG_SAME_COMP) ||
+ bio_flagged(bio, BIO_CPU_AFFINE))
+ req->cpu = smp_processor_id();
if (elv_queue_empty(q))
blk_plug_device(q);
add_request(q, req);
out:
if (sync)
__generic_unplug_device(q);
-
spin_unlock_irq(q->queue_lock);
return 0;

@@ -1946,7 +1956,16 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,

int kblockd_schedule_work(struct request_queue *q, struct work_struct *work)
{
- return queue_work(kblockd_workqueue, work);
+ int cpu;
+
+ if (blk_is_io_cpu(q))
+ return queue_work(kblockd_workqueue, work);
+
+ /*
+ * would need to be improved, of course...
+ */
+ cpu = first_cpu(q->queue_cpu);
+ return queue_work_on_cpu(kblockd_workqueue, work, cpu);
}
EXPORT_SYMBOL(kblockd_schedule_work);

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 1344a0e..1365dd4 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -405,7 +405,54 @@ void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
}
EXPORT_SYMBOL(blk_queue_update_dma_alignment);

-static int __init blk_settings_init(void)
+static int blk_queue_set_cpumask(cpumask_t *cpumask, int cpu)
+{
+ if (cpu == -1)
+ cpus_setall(*cpumask);
+ else if (!cpu_isset(cpu, cpu_possible_map)) {
+ cpus_setall(*cpumask);
+ return -EINVAL;
+ } else {
+ cpus_clear(*cpumask);
+ cpu_set(cpu, *cpumask);
+ }
+
+ return 0;
+}
+
+/**
+ * blk_queue_set_completion_cpu - Set IO CPU for completions
+ * @q: the request queue for the device
+ * @cpu: cpu
+ *
+ * Description:
+ * This function allows a driver to set a CPU that should handle completions
+ * for this device.
+ *
+ **/
+int blk_queue_set_completion_cpu(struct request_queue *q, int cpu)
+{
+ return blk_queue_set_cpumask(&q->complete_cpu, cpu);
+}
+EXPORT_SYMBOL(blk_queue_set_completion_cpu);
+
+/**
+ * blk_queue_set_queue_cpu - Set IO CPU for queuing
+ * @q: the request queue for the device
+ * @cpu: cpu
+ *
+ * Description:
+ * This function allows a driver to set a CPU that should handle queuing
+ * for this device.
+ *
+ **/
+int blk_queue_set_queue_cpu(struct request_queue *q, int cpu)
+{
+ return blk_queue_set_cpumask(&q->queue_cpu, cpu);
+}
+EXPORT_SYMBOL(blk_queue_set_queue_cpu);
+
+int __init blk_settings_init(void)
{
blk_max_low_pfn = max_low_pfn - 1;
blk_max_pfn = max_pfn - 1;
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 05f9451..0f90383 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -13,6 +13,50 @@

static DEFINE_PER_CPU(struct list_head, blk_cpu_done);

+static void blk_done_softirq(struct softirq_action *h)
+{
+ struct list_head *cpu_list, local_list;
+
+ local_irq_disable();
+ cpu_list = &__get_cpu_var(blk_cpu_done);
+ list_replace_init(cpu_list, &local_list);
+ local_irq_enable();
+
+ while (!list_empty(&local_list)) {
+ struct request *rq;
+
+ rq = list_entry(local_list.next, struct request, csd.list);
+ list_del_init(&rq->csd.list);
+ rq->q->softirq_done_fn(rq);
+ }
+}
+
+static void trigger_softirq(void *data)
+{
+ struct list_head *list = &__get_cpu_var(blk_cpu_done);
+ struct request *rq = data;
+
+ if (!list_empty(list)) {
+ INIT_LIST_HEAD(&rq->csd.list);
+ local_irq_disable();
+ list_add_tail(&rq->csd.list, list);
+ local_irq_enable();
+ blk_done_softirq(NULL);
+ } else
+ rq->q->softirq_done_fn(rq);
+}
+
+static void raise_blk_irq(int cpu, struct request *rq)
+{
+ struct call_single_data *data = &rq->csd;
+
+ data->func = trigger_softirq;
+ data->info = rq;
+ data->flags = 0;
+
+ __smp_call_function_single(cpu, data);
+}
+
static int __cpuinit blk_cpu_notify(struct notifier_block *self,
unsigned long action, void *hcpu)
{
@@ -33,33 +77,10 @@ static int __cpuinit blk_cpu_notify(struct notifier_block *self,
return NOTIFY_OK;
}

-
-static struct notifier_block blk_cpu_notifier __cpuinitdata = {
+static struct notifier_block __cpuinitdata blk_cpu_notifier = {
.notifier_call = blk_cpu_notify,
};

-/*
- * splice the completion data to a local structure and hand off to
- * process_completion_queue() to complete the requests
- */
-static void blk_done_softirq(struct softirq_action *h)
-{
- struct list_head *cpu_list, local_list;
-
- local_irq_disable();
- cpu_list = &__get_cpu_var(blk_cpu_done);
- list_replace_init(cpu_list, &local_list);
- local_irq_enable();
-
- while (!list_empty(&local_list)) {
- struct request *rq;
-
- rq = list_entry(local_list.next, struct request, donelist);
- list_del_init(&rq->donelist);
- rq->q->softirq_done_fn(rq);
- }
-}
-
/**
* blk_complete_request - end I/O on a request
* @req: the request being processed
@@ -71,25 +92,40 @@ static void blk_done_softirq(struct softirq_action *h)
* through a softirq handler. The user must have registered a completion
* callback through blk_queue_softirq_done().
**/
-
void blk_complete_request(struct request *req)
{
- struct list_head *cpu_list;
+ struct request_queue *q = req->q;
unsigned long flags;
+ int ccpu, cpu;

- BUG_ON(!req->q->softirq_done_fn);
+ BUG_ON(!q->softirq_done_fn);

local_irq_save(flags);
+ cpu = smp_processor_id();

- cpu_list = &__get_cpu_var(blk_cpu_done);
- list_add_tail(&req->donelist, cpu_list);
- raise_softirq_irqoff(BLOCK_SOFTIRQ);
+ if ((q->queue_flags & (1 << QUEUE_FLAG_SAME_COMP)) && req->cpu != -1)
+ ccpu = req->cpu;
+ else if (cpu_isset(cpu, q->complete_cpu))
+ ccpu = cpu;
+ else
+ ccpu = first_cpu(q->complete_cpu);
+
+ if (ccpu == cpu) {
+ struct list_head *list = &__get_cpu_var(blk_cpu_done);
+
+ INIT_LIST_HEAD(&req->csd.list);
+ list_add_tail(&req->csd.list, list);
+
+ if (list->next == &req->csd.list)
+ raise_softirq_irqoff(BLOCK_SOFTIRQ);
+ } else
+ raise_blk_irq(ccpu, req);

local_irq_restore(flags);
}
EXPORT_SYMBOL(blk_complete_request);

-int __init blk_softirq_init(void)
+__init int blk_softirq_init(void)
{
int i;

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 54d0db1..947e463 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -135,6 +135,71 @@ static ssize_t queue_max_hw_sectors_show(struct request_queue *q, char *page)
return queue_var_show(max_hw_sectors_kb, (page));
}

+static ssize_t queue_complete_affinity_show(struct request_queue *q, char *page)
+{
+ ssize_t len = cpumask_scnprintf(page, PAGE_SIZE, q->complete_cpu);
+
+ len += sprintf(page + len, "\n");
+ return len;
+}
+
+static ssize_t queue_complete_affinity_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ char *p = (char *) page;
+ long val;
+
+ val = simple_strtol(p, &p, 10);
+ spin_lock_irq(q->queue_lock);
+ blk_queue_set_completion_cpu(q, val);
+ spin_unlock_irq(q->queue_lock);
+ return count;
+}
+
+static ssize_t queue_queue_affinity_show(struct request_queue *q, char *page)
+{
+ ssize_t len = cpumask_scnprintf(page, PAGE_SIZE, q->queue_cpu);
+
+ len += sprintf(page + len, "\n");
+ return len;
+}
+
+static ssize_t queue_queue_affinity_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ char *p = (char *) page;
+ long val;
+
+ val = simple_strtol(p, &p, 10);
+ spin_lock_irq(q->queue_lock);
+ blk_queue_set_queue_cpu(q, val);
+ spin_unlock_irq(q->queue_lock);
+ return count;
+}
+
+static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page)
+{
+ unsigned int same = (q->queue_flags & 1 << (QUEUE_FLAG_SAME_COMP)) != 0;
+
+ return queue_var_show(same, page);
+}
+
+static ssize_t
+queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
+{
+ unsigned long val;
+ ssize_t ret;
+
+ ret = queue_var_store(&val, page, count);
+ spin_lock_irq(q->queue_lock);
+ if (val)
+ q->queue_flags |= (1 << QUEUE_FLAG_SAME_COMP);
+ else
+ q->queue_flags &= ~(1 << QUEUE_FLAG_SAME_COMP);
+ spin_unlock_irq(q->queue_lock);
+
+ return ret;
+}

static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -170,6 +235,24 @@ static struct queue_sysfs_entry queue_hw_sector_size_entry = {
.show = queue_hw_sector_size_show,
};

+static struct queue_sysfs_entry queue_complete_affinity_entry = {
+ .attr = {.name = "completion_affinity", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_complete_affinity_show,
+ .store = queue_complete_affinity_store,
+};
+
+static struct queue_sysfs_entry queue_queue_affinity_entry = {
+ .attr = {.name = "queue_affinity", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_queue_affinity_show,
+ .store = queue_queue_affinity_store,
+};
+
+static struct queue_sysfs_entry queue_rq_affinity_entry = {
+ .attr = {.name = "rq_affinity", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_rq_affinity_show,
+ .store = queue_rq_affinity_store,
+};
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -177,6 +260,9 @@ static struct attribute *default_attrs[] = {
&queue_max_sectors_entry.attr,
&queue_iosched_entry.attr,
&queue_hw_sector_size_entry.attr,
+ &queue_complete_affinity_entry.attr,
+ &queue_queue_affinity_entry.attr,
+ &queue_rq_affinity_entry.attr,
NULL,
};

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 4c59bdc..6c4c8d7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -127,6 +127,7 @@ struct bio {
#define BIO_BOUNCED 5 /* bio is a bounce bio */
#define BIO_USER_MAPPED 6 /* contains user pages */
#define BIO_EOPNOTSUPP 7 /* not supported */
+#define BIO_CPU_AFFINE 8 /* complete bio on same CPU as submitted */
#define bio_flagged(bio, flag) ((bio)->bi_flags & (1 << (flag)))

/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f48f32f..4038b6f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -17,6 +17,7 @@
#include <linux/module.h>
#include <linux/stringify.h>
#include <linux/bsg.h>
+#include <linux/smp.h>

#include <asm/scatterlist.h>

@@ -143,7 +144,8 @@ enum rq_flag_bits {
*/
struct request {
struct list_head queuelist;
- struct list_head donelist;
+ struct call_single_data csd;
+ int cpu;

struct request_queue *q;

@@ -295,9 +297,12 @@ struct request_queue
unplug_fn *unplug_fn;
merge_bvec_fn *merge_bvec_fn;
prepare_flush_fn *prepare_flush_fn;
- softirq_done_fn *softirq_done_fn;
dma_drain_needed_fn *dma_drain_needed;

+ softirq_done_fn *softirq_done_fn;
+ cpumask_t queue_cpu;
+ cpumask_t complete_cpu;
+
/*
* Dispatch queue sorting
*/
@@ -405,6 +410,7 @@ struct request_queue
#define QUEUE_FLAG_PLUGGED 7 /* queue is plugged */
#define QUEUE_FLAG_ELVSWITCH 8 /* don't use elevator, just do FIFO */
#define QUEUE_FLAG_BIDI 9 /* queue supports bidi requests */
+#define QUEUE_FLAG_SAME_COMP 10 /* force complete on same CPU */

enum {
/*
@@ -710,6 +716,8 @@ extern void blk_queue_segment_boundary(struct request_queue *, unsigned long);
extern void blk_queue_prep_rq(struct request_queue *, prep_rq_fn *pfn);
extern void blk_queue_merge_bvec(struct request_queue *, merge_bvec_fn *);
extern void blk_queue_dma_alignment(struct request_queue *, int);
+extern int blk_queue_set_queue_cpu(struct request_queue *, int);
+extern int blk_queue_set_completion_cpu(struct request_queue *, int);
extern void blk_queue_update_dma_alignment(struct request_queue *, int);
extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 639624b..bb791c3 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -173,15 +173,15 @@ enum {
#define rb_entry_rq(node) rb_entry((node), struct request, rb_node)

/*
- * Hack to reuse the donelist list_head as the fifo time holder while
+ * Hack to reuse the csd.list list_head as the fifo time holder while
* the request is in the io scheduler. Saves an unsigned long in rq.
*/
-#define rq_fifo_time(rq) ((unsigned long) (rq)->donelist.next)
-#define rq_set_fifo_time(rq,exp) ((rq)->donelist.next = (void *) (exp))
+#define rq_fifo_time(rq) ((unsigned long) (rq)->csd.list.next)
+#define rq_set_fifo_time(rq,exp) ((rq)->csd.list.next = (void *) (exp))
#define rq_entry_fifo(ptr) list_entry((ptr), struct request, queuelist)
#define rq_fifo_clear(rq) do { \
list_del_init(&(rq)->queuelist); \
- INIT_LIST_HEAD(&(rq)->donelist); \
+ INIT_LIST_HEAD(&(rq)->csd.list); \
} while (0)

/*
--
1.5.4.GIT

2008-03-12 11:58:53

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 2/7] x86-64: speedup and tweak smp_call_function_single()

Add a __smp_call_function_single() that allows passing in the
caller data to avoid an allocation.

It's OK to have interrupts disabled, as long as we don't wait for
the IPI call to finish.

Get rid of the fallback data and pass back an error instead. Callers
that don't want to handle errors must either use wait == 1, or pre-allocate
the data and use the __smp_call_function_single() variant.

Signed-off-by: Jens Axboe <[email protected]>
---
arch/x86/kernel/smp_64.c | 117 +++++++++++++++++++++++----------------------
include/linux/smp.h | 8 +++
2 files changed, 68 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kernel/smp_64.c b/arch/x86/kernel/smp_64.c
index 1196a12..b1a3d3c 100644
--- a/arch/x86/kernel/smp_64.c
+++ b/arch/x86/kernel/smp_64.c
@@ -298,6 +298,8 @@ void smp_send_reschedule(int cpu)

#define CALL_WAIT 0x01
#define CALL_FALLBACK 0x02
+#define CALL_DATA_ALLOC 0x04
+
/*
* Structure and data for smp_call_function(). This is designed to minimise
* static memory requirements. It also looks cleaner.
@@ -330,23 +332,12 @@ void unlock_ipi_call_lock(void)
spin_unlock_irq(&call_lock);
}

-
-struct call_single_data {
- struct list_head list;
- void (*func) (void *info);
- void *info;
- unsigned int flags;
-};
-
struct call_single_queue {
spinlock_t lock;
struct list_head list;
};
static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);

-static unsigned long call_single_fallback_used;
-static struct call_single_data call_single_data_fallback;
-
int __cpuinit init_smp_call(void)
{
int i;
@@ -416,7 +407,8 @@ int smp_call_function_mask(cpumask_t mask,
data = &call_data_fallback;
flags |= CALL_FALLBACK;
/* XXX: can IPI all to "synchronize" RCU? */
- }
+ } else
+ flags |= CALL_DATA_ALLOC;

spin_lock_init(&data->lock);
data->func = func;
@@ -446,7 +438,7 @@ int smp_call_function_mask(cpumask_t mask,
/* Wait for response */
while (data->flags)
cpu_relax();
- if (likely(!(flags & CALL_FALLBACK)))
+ if (flags & CALL_DATA_ALLOC)
free_call_data(data);
else
clear_bit_unlock(0, &call_fallback_used);
@@ -457,6 +449,45 @@ int smp_call_function_mask(cpumask_t mask,
EXPORT_SYMBOL(smp_call_function_mask);

/*
+ * __smp_call_function_single - Run a function on a specific CPU
+ * @data: Associated data
+ *
+ * Retrurns 0 on success, else a negative status code.
+ *
+ * Does not return until the remote CPU is nearly ready to execute <func>
+ * or is or has executed. Also see smp_call_function_single()
+ */
+void __smp_call_function_single(int cpu, struct call_single_data *data)
+{
+ cpumask_t mask = cpumask_of_cpu(cpu);
+ struct call_single_queue *dst;
+ unsigned long flags;
+ /* prevent preemption and reschedule on another processor */
+ int ipi;
+
+ /* Can deadlock when called with interrupts disabled */
+ WARN_ON((data->flags & CALL_WAIT) && irqs_disabled());
+
+ INIT_LIST_HEAD(&data->list);
+ dst = &per_cpu(call_single_queue, cpu);
+
+ spin_lock_irqsave(&dst->lock, flags);
+ ipi = list_empty(&dst->list);
+ list_add_tail(&data->list, &dst->list);
+ spin_unlock_irqrestore(&dst->lock, flags);
+
+ if (ipi)
+ send_IPI_mask(mask, CALL_FUNCTION_SINGLE_VECTOR);
+
+ if (data->flags & CALL_WAIT) {
+ /* Wait for response */
+ while (data->flags)
+ cpu_relax();
+ }
+}
+EXPORT_SYMBOL(__smp_call_function_single);
+
+/*
* smp_call_function_single - Run a function on a specific CPU
* @func: The function to run. This must be fast and non-blocking.
* @info: An arbitrary pointer to pass to the function.
@@ -468,68 +499,44 @@ EXPORT_SYMBOL(smp_call_function_mask);
* Does not return until the remote CPU is nearly ready to execute <func>
* or is or has executed.
*/
-
int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
int nonatomic, int wait)
{
+ unsigned long flags;
/* prevent preemption and reschedule on another processor */
int me = get_cpu();
+ int ret = 0;

/* Can deadlock when called with interrupts disabled */
- WARN_ON(irqs_disabled());
+ WARN_ON(wait && irqs_disabled());

if (cpu == me) {
- local_irq_disable();
+ local_irq_save(flags);
func(info);
- local_irq_enable();
+ local_irq_restore(flags);
} else {
struct call_single_data d;
struct call_single_data *data;
- struct call_single_queue *dst;
- cpumask_t mask = cpumask_of_cpu(cpu);
- unsigned int flags = wait ? CALL_WAIT : 0;
- int ipi;

if (!wait) {
- data = kmalloc(sizeof(struct call_single_data), GFP_ATOMIC);
+ data = kmalloc(sizeof(*data), GFP_ATOMIC);
if (unlikely(!data)) {
- while (test_and_set_bit_lock(0, &call_single_fallback_used))
- cpu_relax();
- data = &call_single_data_fallback;
- flags |= CALL_FALLBACK;
+ ret = -ENOMEM;
+ goto out;
}
+ data->flags = CALL_DATA_ALLOC;
} else {
data = &d;
+ data->flags = CALL_WAIT;
}

data->func = func;
data->info = info;
- data->flags = flags;
- dst = &per_cpu(call_single_queue, cpu);
-
- local_irq_disable();
- while (!spin_trylock(&dst->lock)) {
- local_irq_enable();
- cpu_relax();
- local_irq_disable();
- }
- ipi = list_empty(&dst->list);
- list_add_tail(&data->list, &dst->list);
- spin_unlock(&dst->lock);
- local_irq_enable();
-
- if (ipi)
- send_IPI_mask(mask, CALL_FUNCTION_SINGLE_VECTOR);
-
- if (wait) {
- /* Wait for response */
- while (data->flags)
- cpu_relax();
- }
+ __smp_call_function_single(cpu, data);
}
-
+out:
put_cpu();
- return 0;
+ return ret;
}
EXPORT_SYMBOL(smp_call_function_single);

@@ -626,7 +633,7 @@ asmlinkage void smp_call_function_interrupt(void)
smp_wmb();
data->flags = 0;
} else {
- if (likely(!(data->flags & CALL_FALLBACK)))
+ if (likely(data->flags & CALL_DATA_ALLOC))
free_call_data(data);
else
clear_bit_unlock(0, &call_fallback_used);
@@ -662,12 +669,8 @@ asmlinkage void smp_call_function_single_interrupt(void)
if (data->flags & CALL_WAIT) {
smp_wmb();
data->flags = 0;
- } else {
- if (likely(!(data->flags & CALL_FALLBACK)))
- kfree(data);
- else
- clear_bit_unlock(0, &call_single_fallback_used);
- }
+ } else if (data->flags & CALL_DATA_ALLOC)
+ kfree(data);
}
add_pda(irq_call_count, 1);
irq_exit();
diff --git a/include/linux/smp.h b/include/linux/smp.h
index c938d26..629a44a 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -49,12 +49,20 @@ extern int __cpu_up(unsigned int cpunum);
*/
extern void smp_cpus_done(unsigned int max_cpus);

+struct call_single_data {
+ struct list_head list;
+ void (*func) (void *info);
+ void *info;
+ unsigned int flags;
+};
+
/*
* Call a function on all other processors
*/
int smp_call_function(void(*func)(void *info), void *info, int retry, int wait);
int smp_call_function_single(int cpuid, void (*func) (void *info), void *info,
int retry, int wait);
+void __smp_call_function_single(int cpuid, struct call_single_data *data);

/*
* Call a function on all processors
--
1.5.4.GIT

2008-03-12 11:59:17

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 3/7] x86: add fast smp_call_function_single()

Based on Nicks patch for x86-64, and with my tweaks thrown in.

Signed-off-by: Jens Axboe <[email protected]>
---
arch/x86/kernel/smp_32.c | 309 +++++++++++++++++++++-------
arch/x86/kernel/smpboot_32.c | 4 +
arch/x86/kernel/smpcommon_32.c | 34 ---
include/asm-x86/hw_irq_32.h | 1 +
include/asm-x86/mach-default/irq_vectors.h | 1 +
5 files changed, 242 insertions(+), 107 deletions(-)

diff --git a/arch/x86/kernel/smp_32.c b/arch/x86/kernel/smp_32.c
index dc0cde9..dec7cd3 100644
--- a/arch/x86/kernel/smp_32.c
+++ b/arch/x86/kernel/smp_32.c
@@ -476,20 +476,32 @@ static void native_smp_send_reschedule(int cpu)
send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR);
}

+#define CALL_WAIT 0x01
+#define CALL_FALLBACK 0x02
+#define CALL_DATA_ALLOC 0x04
+
/*
* Structure and data for smp_call_function(). This is designed to minimise
* static memory requirements. It also looks cleaner.
*/
static DEFINE_SPINLOCK(call_lock);

-struct call_data_struct {
+struct call_data {
+ spinlock_t lock;
+ struct list_head list;
void (*func) (void *info);
void *info;
- atomic_t started;
- atomic_t finished;
- int wait;
+ unsigned int flags;
+ unsigned int refs;
+ cpumask_t cpumask;
+ struct rcu_head rcu_head;
};

+static LIST_HEAD(call_queue);
+
+static unsigned long call_fallback_used;
+static struct call_data call_data_fallback;
+
void lock_ipi_call_lock(void)
{
spin_lock_irq(&call_lock);
@@ -500,39 +512,35 @@ void unlock_ipi_call_lock(void)
spin_unlock_irq(&call_lock);
}

-static struct call_data_struct *call_data;
+struct call_single_queue {
+ spinlock_t lock;
+ struct list_head list;
+};
+static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);

-static void __smp_call_function(void (*func) (void *info), void *info,
- int nonatomic, int wait)
+int __cpuinit init_smp_call(void)
{
- struct call_data_struct data;
- int cpus = num_online_cpus() - 1;
-
- if (!cpus)
- return;
-
- data.func = func;
- data.info = info;
- atomic_set(&data.started, 0);
- data.wait = wait;
- if (wait)
- atomic_set(&data.finished, 0);
+ int i;

- call_data = &data;
- mb();
-
- /* Send a message to all other CPUs and wait for them to respond */
- send_IPI_allbutself(CALL_FUNCTION_VECTOR);
+ for_each_cpu_mask(i, cpu_possible_map) {
+ spin_lock_init(&per_cpu(call_single_queue, i).lock);
+ INIT_LIST_HEAD(&per_cpu(call_single_queue, i).list);
+ }
+ return 0;
+}
+core_initcall(init_smp_call);

- /* Wait for response */
- while (atomic_read(&data.started) != cpus)
- cpu_relax();
+static void rcu_free_call_data(struct rcu_head *head)
+{
+ struct call_data *data = container_of(head, struct call_data, rcu_head);

- if (wait)
- while (atomic_read(&data.finished) != cpus)
- cpu_relax();
+ kfree(data);
}

+static void free_call_data(struct call_data *data)
+{
+ call_rcu(&data->rcu_head, rcu_free_call_data);
+}

/**
* smp_call_function_mask(): Run a function on a set of other CPUs.
@@ -554,15 +562,14 @@ native_smp_call_function_mask(cpumask_t mask,
void (*func)(void *), void *info,
int wait)
{
- struct call_data_struct data;
+ struct call_data *data;
cpumask_t allbutself;
+ unsigned int flags;
int cpus;

/* Can deadlock when called with interrupts disabled */
WARN_ON(irqs_disabled());
-
- /* Holding any lock stops cpus from going down. */
- spin_lock(&call_lock);
+ WARN_ON(preemptible());

allbutself = cpu_online_map;
cpu_clear(smp_processor_id(), allbutself);
@@ -570,20 +577,37 @@ native_smp_call_function_mask(cpumask_t mask,
cpus_and(mask, mask, allbutself);
cpus = cpus_weight(mask);

- if (!cpus) {
- spin_unlock(&call_lock);
+ if (!cpus)
return 0;
- }

- data.func = func;
- data.info = info;
- atomic_set(&data.started, 0);
- data.wait = wait;
- if (wait)
- atomic_set(&data.finished, 0);
+ flags = wait ? CALL_WAIT : 0;
+ data = kmalloc(sizeof(struct call_data), GFP_ATOMIC);
+ if (unlikely(!data)) {
+ while (test_and_set_bit_lock(0, &call_fallback_used))
+ cpu_relax();
+ data = &call_data_fallback;
+ flags |= CALL_FALLBACK;
+ /* XXX: can IPI all to "synchronize" RCU? */
+ } else
+ flags |= CALL_DATA_ALLOC;
+
+ spin_lock_init(&data->lock);
+ data->func = func;
+ data->info = info;
+ data->flags = flags;
+ data->refs = cpus;
+ data->cpumask = mask;

- call_data = &data;
- mb();
+ local_irq_disable();
+ while (!spin_trylock(&call_lock)) {
+ local_irq_enable();
+ cpu_relax();
+ local_irq_disable();
+ }
+ /* could do ipi = list_empty(&dst->list) || !cpumask_ipi_pending() */
+ list_add_tail_rcu(&data->list, &call_queue);
+ spin_unlock(&call_lock);
+ local_irq_enable();

/* Send a message to other CPUs */
if (cpus_equal(mask, allbutself))
@@ -591,18 +615,111 @@ native_smp_call_function_mask(cpumask_t mask,
else
send_IPI_mask(mask, CALL_FUNCTION_VECTOR);

- /* Wait for response */
- while (atomic_read(&data.started) != cpus)
- cpu_relax();
-
- if (wait)
- while (atomic_read(&data.finished) != cpus)
+ if (wait) {
+ /* Wait for response */
+ while (data->flags)
cpu_relax();
- spin_unlock(&call_lock);
+ if (flags & CALL_DATA_ALLOC)
+ free_call_data(data);
+ else
+ clear_bit_unlock(0, &call_fallback_used);
+ }

return 0;
}

+/*
+ * __smp_call_function_single - Run a function on a specific CPU
+ * @data: Associated data
+ *
+ * Retrurns 0 on success, else a negative status code.
+ *
+ * Does not return until the remote CPU is nearly ready to execute <func>
+ * or is or has executed. Also see smp_call_function_single()
+ */
+void __smp_call_function_single(int cpu, struct call_single_data *data)
+{
+ cpumask_t mask = cpumask_of_cpu(cpu);
+ struct call_single_queue *dst;
+ unsigned long flags;
+ /* prevent preemption and reschedule on another processor */
+ int ipi;
+
+ /* Can deadlock when called with interrupts disabled */
+ WARN_ON((data->flags & CALL_WAIT) && irqs_disabled());
+
+ INIT_LIST_HEAD(&data->list);
+ dst = &per_cpu(call_single_queue, cpu);
+
+ spin_lock_irqsave(&dst->lock, flags);
+ ipi = list_empty(&dst->list);
+ list_add_tail(&data->list, &dst->list);
+ spin_unlock_irqrestore(&dst->lock, flags);
+
+ if (ipi)
+ send_IPI_mask(mask, CALL_FUNCTION_SINGLE_VECTOR);
+
+ if (data->flags & CALL_WAIT) {
+ /* Wait for response */
+ while (data->flags)
+ cpu_relax();
+ }
+}
+EXPORT_SYMBOL(__smp_call_function_single);
+
+/*
+ * smp_call_function_single - Run a function on a specific CPU
+ * @func: The function to run. This must be fast and non-blocking.
+ * @info: An arbitrary pointer to pass to the function.
+ * @nonatomic: Currently unused.
+ * @wait: If true, wait until function has completed on other CPUs.
+ *
+ * Retrurns 0 on success, else a negative status code.
+ *
+ * Does not return until the remote CPU is nearly ready to execute <func>
+ * or is or has executed.
+ */
+int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
+ int nonatomic, int wait)
+{
+ unsigned long flags;
+ /* prevent preemption and reschedule on another processor */
+ int me = get_cpu();
+ int ret = 0;
+
+ /* Can deadlock when called with interrupts disabled */
+ WARN_ON(wait && irqs_disabled());
+
+ if (cpu == me) {
+ local_irq_save(flags);
+ func(info);
+ local_irq_restore(flags);
+ } else {
+ struct call_single_data d;
+ struct call_single_data *data;
+
+ if (!wait) {
+ data = kmalloc(sizeof(*data), GFP_ATOMIC);
+ if (unlikely(!data)) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ data->flags = CALL_DATA_ALLOC;
+ } else {
+ data = &d;
+ data->flags = CALL_WAIT;
+ }
+
+ data->func = func;
+ data->info = info;
+ __smp_call_function_single(cpu, data);
+ }
+out:
+ put_cpu();
+ return ret;
+}
+EXPORT_SYMBOL(smp_call_function_single);
+
static void stop_this_cpu (void * dummy)
{
local_irq_disable();
@@ -622,14 +739,10 @@ static void stop_this_cpu (void * dummy)

static void native_smp_send_stop(void)
{
- /* Don't deadlock on the call lock in panic */
- int nolock = !spin_trylock(&call_lock);
unsigned long flags;

local_irq_save(flags);
- __smp_call_function(stop_this_cpu, NULL, 0, 0);
- if (!nolock)
- spin_unlock(&call_lock);
+ smp_call_function(stop_this_cpu, NULL, 0, 0);
disable_local_APIC();
local_irq_restore(flags);
}
@@ -647,29 +760,79 @@ void smp_reschedule_interrupt(struct pt_regs *regs)

void smp_call_function_interrupt(struct pt_regs *regs)
{
- void (*func) (void *info) = call_data->func;
- void *info = call_data->info;
- int wait = call_data->wait;
+ struct list_head *pos, *tmp;
+ int cpu = smp_processor_id();

ack_APIC_irq();
- /*
- * Notify initiating CPU that I've grabbed the data and am
- * about to execute the function
- */
- mb();
- atomic_inc(&call_data->started);
- /*
- * At this point the info structure may be out of scope unless wait==1
- */
irq_enter();
- (*func)(info);
+
+ list_for_each_safe_rcu(pos, tmp, &call_queue) {
+ struct call_data *data;
+ int refs;
+
+ data = list_entry(pos, struct call_data, list);
+ if (!cpu_isset(cpu, data->cpumask))
+ continue;
+
+ data->func(data->info);
+ spin_lock(&data->lock);
+ WARN_ON(!cpu_isset(cpu, data->cpumask));
+ cpu_clear(cpu, data->cpumask);
+ WARN_ON(data->refs == 0);
+ data->refs--;
+ refs = data->refs;
+ spin_unlock(&data->lock);
+
+ if (refs == 0) {
+ WARN_ON(cpus_weight(data->cpumask));
+ spin_lock(&call_lock);
+ list_del_rcu(&data->list);
+ spin_unlock(&call_lock);
+ if (data->flags & CALL_WAIT) {
+ smp_wmb();
+ data->flags = 0;
+ } else {
+ if (likely(data->flags & CALL_DATA_ALLOC))
+ free_call_data(data);
+ else
+ clear_bit_unlock(0, &call_fallback_used);
+ }
+ }
+ }
+
__get_cpu_var(irq_stat).irq_call_count++;
irq_exit();
+}

- if (wait) {
- mb();
- atomic_inc(&call_data->finished);
+void smp_call_function_single_interrupt(void)
+{
+ struct call_single_queue *q;
+ LIST_HEAD(list);
+
+ ack_APIC_irq();
+ irq_enter();
+
+ q = &__get_cpu_var(call_single_queue);
+ spin_lock(&q->lock);
+ list_replace_init(&q->list, &list);
+ spin_unlock(&q->lock);
+
+ while (!list_empty(&list)) {
+ struct call_single_data *data;
+
+ data = list_entry(list.next, struct call_single_data, list);
+ list_del(&data->list);
+
+ data->func(data->info);
+ if (data->flags & CALL_WAIT) {
+ smp_wmb();
+ data->flags = 0;
+ } else if (data->flags & CALL_DATA_ALLOC)
+ kfree(data);
}
+
+ __get_cpu_var(irq_stat).irq_call_count++;
+ irq_exit();
}

static int convert_apicid_to_cpu(int apic_id)
diff --git a/arch/x86/kernel/smpboot_32.c b/arch/x86/kernel/smpboot_32.c
index 579b9b7..d250388 100644
--- a/arch/x86/kernel/smpboot_32.c
+++ b/arch/x86/kernel/smpboot_32.c
@@ -1304,6 +1304,10 @@ void __init smp_intr_init(void)

/* IPI for generic function call */
set_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
+
+ /* IPI for single call function */
+ set_intr_gate(CALL_FUNCTION_SINGLE_VECTOR,
+ call_function_single_interrupt);
}

/*
diff --git a/arch/x86/kernel/smpcommon_32.c b/arch/x86/kernel/smpcommon_32.c
index 8bc38af..4590a67 100644
--- a/arch/x86/kernel/smpcommon_32.c
+++ b/arch/x86/kernel/smpcommon_32.c
@@ -46,37 +46,3 @@ int smp_call_function(void (*func) (void *info), void *info, int nonatomic,
return smp_call_function_mask(cpu_online_map, func, info, wait);
}
EXPORT_SYMBOL(smp_call_function);
-
-/**
- * smp_call_function_single - Run a function on a specific CPU
- * @cpu: The target CPU. Cannot be the calling CPU.
- * @func: The function to run. This must be fast and non-blocking.
- * @info: An arbitrary pointer to pass to the function.
- * @nonatomic: Unused.
- * @wait: If true, wait until function has completed on other CPUs.
- *
- * Returns 0 on success, else a negative status code.
- *
- * If @wait is true, then returns once @func has returned; otherwise
- * it returns just before the target cpu calls @func.
- */
-int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
- int nonatomic, int wait)
-{
- /* prevent preemption and reschedule on another processor */
- int ret;
- int me = get_cpu();
- if (cpu == me) {
- local_irq_disable();
- func(info);
- local_irq_enable();
- put_cpu();
- return 0;
- }
-
- ret = smp_call_function_mask(cpumask_of_cpu(cpu), func, info, wait);
-
- put_cpu();
- return ret;
-}
-EXPORT_SYMBOL(smp_call_function_single);
diff --git a/include/asm-x86/hw_irq_32.h b/include/asm-x86/hw_irq_32.h
index ea88054..a87b132 100644
--- a/include/asm-x86/hw_irq_32.h
+++ b/include/asm-x86/hw_irq_32.h
@@ -32,6 +32,7 @@ extern void (*const interrupt[NR_IRQS])(void);
void reschedule_interrupt(void);
void invalidate_interrupt(void);
void call_function_interrupt(void);
+void call_function_single_interrupt(void);
#endif

#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/include/asm-x86/mach-default/irq_vectors.h b/include/asm-x86/mach-default/irq_vectors.h
index 881c63c..ed7d495 100644
--- a/include/asm-x86/mach-default/irq_vectors.h
+++ b/include/asm-x86/mach-default/irq_vectors.h
@@ -48,6 +48,7 @@
#define INVALIDATE_TLB_VECTOR 0xfd
#define RESCHEDULE_VECTOR 0xfc
#define CALL_FUNCTION_VECTOR 0xfb
+#define CALL_FUNCTION_SINGLE_VECTOR 0xfa

#define THERMAL_APIC_VECTOR 0xf0
/*
--
1.5.4.GIT

2008-03-12 16:41:56

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: [PATCH 0/7] IO CPU affinity testing series


Subject: [PATCH] Fixed race: using potentially invalid pointer

When data->flags & CSD_FLAG_ALLOC is true, the data could be freed by the other processor before we check for CSD_FLAG_WAIT.

Also: removed old comment, doesn't quite fit anymore.

This is applied against Jens' git tree w/ the ia64 additional commit.

Signed-off-by: Alan D. Brunelle <[email protected]>
---
arch/ia64/kernel/smp.c | 5 ++---
arch/x86/kernel/smp_32.c | 5 ++---
arch/x86/kernel/smp_64.c | 5 ++---
3 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/ia64/kernel/smp.c b/arch/ia64/kernel/smp.c
index 521bc52..ad153e2 100644
--- a/arch/ia64/kernel/smp.c
+++ b/arch/ia64/kernel/smp.c
@@ -407,8 +407,7 @@ void __smp_call_function_single(int cpu, struct call_single_data *data)
{
struct call_single_queue *dst;
unsigned long flags;
- /* prevent preemption and reschedule on another processor */
- int ipi;
+ int ipi, wait_done = data->flags & CSD_FLAG_WAIT;

/* Can deadlock when called with interrupts disabled */
WARN_ON((data->flags & CSD_FLAG_WAIT) && irqs_disabled());
@@ -424,7 +423,7 @@ void __smp_call_function_single(int cpu, struct call_single_data *data)
if (ipi)
send_IPI_single(cpu, IPI_CALL_FUNC_SINGLE);

- if (data->flags & CSD_FLAG_WAIT) {
+ if (wait_done) {
/* Wait for response */
while (data->flags)
cpu_relax();
diff --git a/arch/x86/kernel/smp_32.c b/arch/x86/kernel/smp_32.c
index dcbb89c..8239814 100644
--- a/arch/x86/kernel/smp_32.c
+++ b/arch/x86/kernel/smp_32.c
@@ -638,8 +638,7 @@ void __smp_call_function_single(int cpu, struct call_single_data *data)
cpumask_t mask = cpumask_of_cpu(cpu);
struct call_single_queue *dst;
unsigned long flags;
- /* prevent preemption and reschedule on another processor */
- int ipi;
+ int ipi, wait_done = data->flags & CSD_FLAG_WAIT;

/* Can deadlock when called with interrupts disabled */
WARN_ON((data->flags & CSD_FLAG_WAIT) && irqs_disabled());
@@ -655,7 +654,7 @@ void __smp_call_function_single(int cpu, struct call_single_data *data)
if (ipi)
send_IPI_mask(mask, CALL_FUNCTION_SINGLE_VECTOR);

- if (data->flags & CSD_FLAG_WAIT) {
+ if (wait_done) {
/* Wait for response */
while (data->flags)
cpu_relax();
diff --git a/arch/x86/kernel/smp_64.c b/arch/x86/kernel/smp_64.c
index 7e4e300..c89a4f7 100644
--- a/arch/x86/kernel/smp_64.c
+++ b/arch/x86/kernel/smp_64.c
@@ -458,8 +458,7 @@ void __smp_call_function_single(int cpu, struct call_single_data *data)
cpumask_t mask = cpumask_of_cpu(cpu);
struct call_single_queue *dst;
unsigned long flags;
- /* prevent preemption and reschedule on another processor */
- int ipi;
+ int ipi, wait_done = data->flags & CSD_FLAG_WAIT;

/* Can deadlock when called with interrupts disabled */
WARN_ON((data->flags & CSD_FLAG_WAIT) && irqs_disabled());
@@ -475,7 +474,7 @@ void __smp_call_function_single(int cpu, struct call_single_data *data)
if (ipi)
send_IPI_mask(mask, CALL_FUNCTION_SINGLE_VECTOR);

- if (data->flags & CSD_FLAG_WAIT) {
+ if (wait_done) {
/* Wait for response */
while (data->flags)
cpu_relax();
--
1.5.2.5

2008-03-12 17:54:57

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/7] IO CPU affinity testing series

On Wed, Mar 12 2008, Alan D. Brunelle wrote:
>
> Subject: [PATCH] Fixed race: using potentially invalid pointer
>
> When data->flags & CSD_FLAG_ALLOC is true, the data could be freed by the other processor before we check for CSD_FLAG_WAIT.

Oops, that was pretty dumb. Thanks!

> Also: removed old comment, doesn't quite fit anymore.
>
> This is applied against Jens' git tree w/ the ia64 additional commit.

Thanks, I'll split this into 3 parts and apply them.

--
Jens Axboe

2008-03-12 20:37:26

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [PATCH 0/7] IO CPU affinity testing series

Jens Axboe wrote:
> Hi,
>
> Here's a new round of patches to play with io cpu affinity. It can,
> as always, also be found in the block git repo. The branch name is
> 'io-cpu-affinity'.
>
> The major change since last post is the abandonment of the kthread
> approach. It was definitely slower then may 'add IPI to signal remote
> block softirq' hack. So I decided to base this on the scalable
> smp_call_function_single() that Nick posted. I tweaked it a bit to
> make it more suitable for my use and also faster.
>
> As for functionality, the only change is that I added a bio hint
> that the submitter can use to ask for completion on the same CPU
> that submitted the IO. Pass in BIO_CPU_AFFINE for that to occur.
>
> Otherwise the modes are the same as last time:
>
> - You can set a specific cpumask for queuing IO, and the block layer
> will move submitters to one of those CPUs.
> - You can set a specific cpumask for completion of IO, in which case
> the block layer will move the completion to one of those CPUs.
> - You can set rq_affinity mode, in which case IOs will always be
> completed on the CPU that submitted them.
>
> Look in /sys/block/<dev>/queue/ for the three sysfs variables that
> modify this behaviour.
>
> I'd be interested in getting some testing done on this, to see if
> it really helps the larger end of the scale. Dave, I know you
> have a lot of experience in this area and would appreciate your
> input and/or testing. I'm not sure if any of the above modes will
> allow you to do what you need for eg XFS - if you want all meta data
> IO completed on one (or a set of) CPU(s), then I can add a mode
> that will allow you to play with that. Or if something else, give me
> some input and we can take it from there!

Very cool stuff. I think I can use it for cpu isolation purposes.
ie Isolating a cpu from the io activity.

You may have noticed that I started a bunch of discussion on CPU isolation.
One thing that came out of that is the suggestion to use cpusets for managing
this affinity masks. We're still discussing the details, the general idea is
to provide extra flags in the cpusets that enable/disable various activities
on the cpus that belong to the set.

For example in this particular case we'd have something like "cpusets.io" flag
that would indicate whether cpus in the set are allowed to to the IO or not.
In other words:
/dev/cpuset/io (cpus=0,1,2; io=1)
/dev/cpuset/no-io (cpus=3,4,5; io=0)

I'm not sure whether this makes sense or not. One advantage is that it's more
dynamic and more flexible. If for example you add cpu to the io cpuset it will
automatically start handling io requests.

btw What did you mean by "to see if it really helps the larger end of the
scale", what problem were you guys trying to solve ? I'm guessing cpu
isolation would probably be an unexpected user of io cpu affinity :).

Max

2008-03-13 12:13:32

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/7] IO CPU affinity testing series

On Wed, Mar 12 2008, Max Krasnyanskiy wrote:
> Jens Axboe wrote:
> >Hi,
> >
> >Here's a new round of patches to play with io cpu affinity. It can,
> >as always, also be found in the block git repo. The branch name is
> >'io-cpu-affinity'.
> >
> >The major change since last post is the abandonment of the kthread
> >approach. It was definitely slower then may 'add IPI to signal remote
> >block softirq' hack. So I decided to base this on the scalable
> >smp_call_function_single() that Nick posted. I tweaked it a bit to
> >make it more suitable for my use and also faster.
> >
> >As for functionality, the only change is that I added a bio hint
> >that the submitter can use to ask for completion on the same CPU
> >that submitted the IO. Pass in BIO_CPU_AFFINE for that to occur.
> >
> >Otherwise the modes are the same as last time:
> >
> >- You can set a specific cpumask for queuing IO, and the block layer
> > will move submitters to one of those CPUs.
> >- You can set a specific cpumask for completion of IO, in which case
> > the block layer will move the completion to one of those CPUs.
> >- You can set rq_affinity mode, in which case IOs will always be
> > completed on the CPU that submitted them.
> >
> >Look in /sys/block/<dev>/queue/ for the three sysfs variables that
> >modify this behaviour.
> >
> >I'd be interested in getting some testing done on this, to see if
> >it really helps the larger end of the scale. Dave, I know you
> >have a lot of experience in this area and would appreciate your
> >input and/or testing. I'm not sure if any of the above modes will
> >allow you to do what you need for eg XFS - if you want all meta data
> >IO completed on one (or a set of) CPU(s), then I can add a mode
> >that will allow you to play with that. Or if something else, give me
> >some input and we can take it from there!
>
> Very cool stuff. I think I can use it for cpu isolation purposes.
> ie Isolating a cpu from the io activity.
>
> You may have noticed that I started a bunch of discussion on CPU isolation.
> One thing that came out of that is the suggestion to use cpusets for
> managing this affinity masks. We're still discussing the details, the
> general idea is to provide extra flags in the cpusets that enable/disable
> various activities
> on the cpus that belong to the set.
>
> For example in this particular case we'd have something like "cpusets.io"
> flag that would indicate whether cpus in the set are allowed to to the IO
> or not.
> In other words:
> /dev/cpuset/io (cpus=0,1,2; io=1)
> /dev/cpuset/no-io (cpus=3,4,5; io=0)
>
> I'm not sure whether this makes sense or not. One advantage is that it's
> more dynamic and more flexible. If for example you add cpu to the io cpuset
> it will automatically start handling io requests.

The code posted here works on the queue level, where as you want this to
be a global setting. So it'll require a bit of extra stuff to handle
that case, but the base infrastructure would not care.

> btw What did you mean by "to see if it really helps the larger end of the
> scale", what problem were you guys trying to solve ? I'm guessing cpu
> isolation would probably be an unexpected user of io cpu affinity :).

Nope, I didn't really consider isolation :-)

It's meant to speed up IO on larger SMP systems by reducing cache line
contention (or bouncing) by keeping data and/or locks local to a CPU (or
a set of CPUs).

--
Jens Axboe

2008-03-13 14:54:53

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: [PATCH 0/7] IO CPU affinity testing series


Your suggestion worked Jens, will do some benchmarking, and try to figure out why on the side...

Subject: [PATCH] IA64 boots with direct call of generic init single data

Need to figure out why it needs to be done earlier on ia64.

Signed-off-by: Alan D. Brunelle <[email protected]>
---
arch/ia64/kernel/setup.c | 2 ++
arch/ia64/kernel/smp.c | 7 -------
2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index 4aa9eae..36a0fe5 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -518,6 +518,8 @@ setup_arch (char **cmdline_p)
acpi_boot_init();
#endif

+ generic_init_call_single_data();
+
#ifdef CONFIG_VT
if (!conswitchp) {
# if defined(CONFIG_DUMMY_CONSOLE)
diff --git a/arch/ia64/kernel/smp.c b/arch/ia64/kernel/smp.c
index d8ee005..04ba9f8 100644
--- a/arch/ia64/kernel/smp.c
+++ b/arch/ia64/kernel/smp.c
@@ -113,13 +113,6 @@ stop_this_cpu (void)

DEFINE_PER_CPU(struct call_single_queue, call_single_queue);

-int __cpuinit init_smp_call(void)
-{
- generic_init_call_single_data();
- return 0;
-}
-core_initcall(init_smp_call);
-
void
cpu_die(void)
{
--
1.5.2.5

2008-03-13 15:00:44

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/7] IO CPU affinity testing series

On Thu, Mar 13 2008, Alan D. Brunelle wrote:
>
> Your suggestion worked Jens, will do some benchmarking, and try to
> figure out why on the side...

Good!

> Subject: [PATCH] IA64 boots with direct call of generic init single
> data
>
> Need to figure out why it needs to be done earlier on ia64.

Sometimes I find that the __init and friends need some include magic to
actually work, even if it compiles and links just fine. If you could
pick at it a bit once the benchmark is out of the way, that would be
great. This is a fine work-around for now.

--
Jens Axboe

2008-03-14 18:23:34

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 1/7] x86-64: introduce fast variant of smp_call_function_single()

Jens Axboe wrote:
> rom: Nick Piggin <[email protected]>
>

Why is this necessary? How is smp_call_function_single slow?

J

> Signed-off-by: Jens Axboe <[email protected]>
> ---
> arch/x86/kernel/entry_64.S | 3 +
> arch/x86/kernel/i8259_64.c | 1 +
> arch/x86/kernel/smp_64.c | 303 +++++++++++++++++++++--------
> include/asm-x86/hw_irq_64.h | 4 +-
> include/asm-x86/mach-default/entry_arch.h | 1 +
> include/linux/smp.h | 2 +-
> 6 files changed, 232 insertions(+), 82 deletions(-)
>
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index c20c9e7..22caf56 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -713,6 +713,9 @@ END(invalidate_interrupt\num)
> ENTRY(call_function_interrupt)
> apicinterrupt CALL_FUNCTION_VECTOR,smp_call_function_interrupt
> END(call_function_interrupt)
> +ENTRY(call_function_single_interrupt)
> + apicinterrupt CALL_FUNCTION_SINGLE_VECTOR,smp_call_function_single_interrupt
> +END(call_function_single_interrupt)
> ENTRY(irq_move_cleanup_interrupt)
> apicinterrupt IRQ_MOVE_CLEANUP_VECTOR,smp_irq_move_cleanup_interrupt
> END(irq_move_cleanup_interrupt)
> diff --git a/arch/x86/kernel/i8259_64.c b/arch/x86/kernel/i8259_64.c
> index fa57a15..2b0b6d2 100644
> --- a/arch/x86/kernel/i8259_64.c
> +++ b/arch/x86/kernel/i8259_64.c
> @@ -493,6 +493,7 @@ void __init native_init_IRQ(void)
>
> /* IPI for generic function call */
> set_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
> + set_intr_gate(CALL_FUNCTION_SINGLE_VECTOR, call_function_single_interrupt);
>
> /* Low priority IPI to cleanup after moving an irq */
> set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt);
> diff --git a/arch/x86/kernel/smp_64.c b/arch/x86/kernel/smp_64.c
> index 2fd74b0..1196a12 100644
> --- a/arch/x86/kernel/smp_64.c
> +++ b/arch/x86/kernel/smp_64.c
> @@ -18,6 +18,7 @@
> #include <linux/kernel_stat.h>
> #include <linux/mc146818rtc.h>
> #include <linux/interrupt.h>
> +#include <linux/rcupdate.h>
>
> #include <asm/mtrr.h>
> #include <asm/pgalloc.h>
> @@ -295,21 +296,29 @@ void smp_send_reschedule(int cpu)
> send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR);
> }
>
> +#define CALL_WAIT 0x01
> +#define CALL_FALLBACK 0x02
> /*
> * Structure and data for smp_call_function(). This is designed to minimise
> * static memory requirements. It also looks cleaner.
> */
> static DEFINE_SPINLOCK(call_lock);
>
> -struct call_data_struct {
> +struct call_data {
> + spinlock_t lock;
> + struct list_head list;
> void (*func) (void *info);
> void *info;
> - atomic_t started;
> - atomic_t finished;
> - int wait;
> + unsigned int flags;
> + unsigned int refs;
> + cpumask_t cpumask;
> + struct rcu_head rcu_head;
> };
>
> -static struct call_data_struct * call_data;
> +static LIST_HEAD(call_queue);
> +
> +static unsigned long call_fallback_used;
> +static struct call_data call_data_fallback;
>
> void lock_ipi_call_lock(void)
> {
> @@ -321,55 +330,47 @@ void unlock_ipi_call_lock(void)
> spin_unlock_irq(&call_lock);
> }
>
> -/*
> - * this function sends a 'generic call function' IPI to all other CPU
> - * of the system defined in the mask.
> - */
> -static int __smp_call_function_mask(cpumask_t mask,
> - void (*func)(void *), void *info,
> - int wait)
> -{
> - struct call_data_struct data;
> - cpumask_t allbutself;
> - int cpus;
>
> - allbutself = cpu_online_map;
> - cpu_clear(smp_processor_id(), allbutself);
> -
> - cpus_and(mask, mask, allbutself);
> - cpus = cpus_weight(mask);
> -
> - if (!cpus)
> - return 0;
> -
> - data.func = func;
> - data.info = info;
> - atomic_set(&data.started, 0);
> - data.wait = wait;
> - if (wait)
> - atomic_set(&data.finished, 0);
> +struct call_single_data {
> + struct list_head list;
> + void (*func) (void *info);
> + void *info;
> + unsigned int flags;
> +};
>
> - call_data = &data;
> - wmb();
> +struct call_single_queue {
> + spinlock_t lock;
> + struct list_head list;
> +};
> +static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
>
> - /* Send a message to other CPUs */
> - if (cpus_equal(mask, allbutself))
> - send_IPI_allbutself(CALL_FUNCTION_VECTOR);
> - else
> - send_IPI_mask(mask, CALL_FUNCTION_VECTOR);
> +static unsigned long call_single_fallback_used;
> +static struct call_single_data call_single_data_fallback;
>
> - /* Wait for response */
> - while (atomic_read(&data.started) != cpus)
> - cpu_relax();
> +int __cpuinit init_smp_call(void)
> +{
> + int i;
>
> - if (!wait)
> - return 0;
> + for_each_cpu_mask(i, cpu_possible_map) {
> + spin_lock_init(&per_cpu(call_single_queue, i).lock);
> + INIT_LIST_HEAD(&per_cpu(call_single_queue, i).list);
> + }
> + return 0;
> +}
> +core_initcall(init_smp_call);
>
> - while (atomic_read(&data.finished) != cpus)
> - cpu_relax();
> +static void rcu_free_call_data(struct rcu_head *head)
> +{
> + struct call_data *data;
> + data = container_of(head, struct call_data, rcu_head);
> + kfree(data);
> +}
>
> - return 0;
> +static void free_call_data(struct call_data *data)
> +{
> + call_rcu(&data->rcu_head, rcu_free_call_data);
> }
> +
> /**
> * smp_call_function_mask(): Run a function on a set of other CPUs.
> * @mask: The set of cpus to run on. Must not include the current cpu.
> @@ -389,15 +390,69 @@ int smp_call_function_mask(cpumask_t mask,
> void (*func)(void *), void *info,
> int wait)
> {
> - int ret;
> + struct call_data *data;
> + cpumask_t allbutself;
> + unsigned int flags;
> + int cpus;
>
> /* Can deadlock when called with interrupts disabled */
> WARN_ON(irqs_disabled());
> + WARN_ON(preemptible());
> +
> + allbutself = cpu_online_map;
> + cpu_clear(smp_processor_id(), allbutself);
> +
> + cpus_and(mask, mask, allbutself);
> + cpus = cpus_weight(mask);
> +
> + if (!cpus)
> + return 0;
>
> - spin_lock(&call_lock);
> - ret = __smp_call_function_mask(mask, func, info, wait);
> + flags = wait ? CALL_WAIT : 0;
> + data = kmalloc(sizeof(struct call_data), GFP_ATOMIC);
> + if (unlikely(!data)) {
> + while (test_and_set_bit_lock(0, &call_fallback_used))
> + cpu_relax();
> + data = &call_data_fallback;
> + flags |= CALL_FALLBACK;
> + /* XXX: can IPI all to "synchronize" RCU? */
> + }
> +
> + spin_lock_init(&data->lock);
> + data->func = func;
> + data->info = info;
> + data->flags = flags;
> + data->refs = cpus;
> + data->cpumask = mask;
> +
> + local_irq_disable();
> + while (!spin_trylock(&call_lock)) {
> + local_irq_enable();
> + cpu_relax();
> + local_irq_disable();
> + }
> + /* could do ipi = list_empty(&dst->list) || !cpumask_ipi_pending() */
> + list_add_tail_rcu(&data->list, &call_queue);
> spin_unlock(&call_lock);
> - return ret;
> + local_irq_enable();
> +
> + /* Send a message to other CPUs */
> + if (cpus_equal(mask, allbutself))
> + send_IPI_allbutself(CALL_FUNCTION_VECTOR);
> + else
> + send_IPI_mask(mask, CALL_FUNCTION_VECTOR);
> +
> + if (wait) {
> + /* Wait for response */
> + while (data->flags)
> + cpu_relax();
> + if (likely(!(flags & CALL_FALLBACK)))
> + free_call_data(data);
> + else
> + clear_bit_unlock(0, &call_fallback_used);
> + }
> +
> + return 0;
> }
> EXPORT_SYMBOL(smp_call_function_mask);
>
> @@ -414,11 +469,11 @@ EXPORT_SYMBOL(smp_call_function_mask);
> * or is or has executed.
> */
>
> -int smp_call_function_single (int cpu, void (*func) (void *info), void *info,
> +int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
> int nonatomic, int wait)
> {
> /* prevent preemption and reschedule on another processor */
> - int ret, me = get_cpu();
> + int me = get_cpu();
>
> /* Can deadlock when called with interrupts disabled */
> WARN_ON(irqs_disabled());
> @@ -427,14 +482,54 @@ int smp_call_function_single (int cpu, void (*func) (void *info), void *info,
> local_irq_disable();
> func(info);
> local_irq_enable();
> - put_cpu();
> - return 0;
> - }
> + } else {
> + struct call_single_data d;
> + struct call_single_data *data;
> + struct call_single_queue *dst;
> + cpumask_t mask = cpumask_of_cpu(cpu);
> + unsigned int flags = wait ? CALL_WAIT : 0;
> + int ipi;
> +
> + if (!wait) {
> + data = kmalloc(sizeof(struct call_single_data), GFP_ATOMIC);
> + if (unlikely(!data)) {
> + while (test_and_set_bit_lock(0, &call_single_fallback_used))
> + cpu_relax();
> + data = &call_single_data_fallback;
> + flags |= CALL_FALLBACK;
> + }
> + } else {
> + data = &d;
> + }
> +
> + data->func = func;
> + data->info = info;
> + data->flags = flags;
> + dst = &per_cpu(call_single_queue, cpu);
> +
> + local_irq_disable();
> + while (!spin_trylock(&dst->lock)) {
> + local_irq_enable();
> + cpu_relax();
> + local_irq_disable();
> + }
> + ipi = list_empty(&dst->list);
> + list_add_tail(&data->list, &dst->list);
> + spin_unlock(&dst->lock);
> + local_irq_enable();
>
> - ret = smp_call_function_mask(cpumask_of_cpu(cpu), func, info, wait);
> + if (ipi)
> + send_IPI_mask(mask, CALL_FUNCTION_SINGLE_VECTOR);
> +
> + if (wait) {
> + /* Wait for response */
> + while (data->flags)
> + cpu_relax();
> + }
> + }
>
> put_cpu();
> - return ret;
> + return 0;
> }
> EXPORT_SYMBOL(smp_call_function_single);
>
> @@ -474,18 +569,13 @@ static void stop_this_cpu(void *dummy)
>
> void smp_send_stop(void)
> {
> - int nolock;
> unsigned long flags;
>
> if (reboot_force)
> return;
>
> - /* Don't deadlock on the call lock in panic */
> - nolock = !spin_trylock(&call_lock);
> local_irq_save(flags);
> - __smp_call_function_mask(cpu_online_map, stop_this_cpu, NULL, 0);
> - if (!nolock)
> - spin_unlock(&call_lock);
> + smp_call_function(stop_this_cpu, NULL, 0, 0);
> disable_local_APIC();
> local_irq_restore(flags);
> }
> @@ -503,28 +593,83 @@ asmlinkage void smp_reschedule_interrupt(void)
>
> asmlinkage void smp_call_function_interrupt(void)
> {
> - void (*func) (void *info) = call_data->func;
> - void *info = call_data->info;
> - int wait = call_data->wait;
> + struct list_head *pos, *tmp;
> + int cpu = smp_processor_id();
>
> ack_APIC_irq();
> - /*
> - * Notify initiating CPU that I've grabbed the data and am
> - * about to execute the function
> - */
> - mb();
> - atomic_inc(&call_data->started);
> - /*
> - * At this point the info structure may be out of scope unless wait==1
> - */
> exit_idle();
> irq_enter();
> - (*func)(info);
> +
> + list_for_each_safe_rcu(pos, tmp, &call_queue) {
> + struct call_data *data;
> + int refs;
> +
> + data = list_entry(pos, struct call_data, list);
> + if (!cpu_isset(cpu, data->cpumask))
> + continue;
> +
> + data->func(data->info);
> + spin_lock(&data->lock);
> + WARN_ON(!cpu_isset(cpu, data->cpumask));
> + cpu_clear(cpu, data->cpumask);
> + WARN_ON(data->refs == 0);
> + data->refs--;
> + refs = data->refs;
> + spin_unlock(&data->lock);
> +
> + if (refs == 0) {
> + WARN_ON(cpus_weight(data->cpumask));
> + spin_lock(&call_lock);
> + list_del_rcu(&data->list);
> + spin_unlock(&call_lock);
> + if (data->flags & CALL_WAIT) {
> + smp_wmb();
> + data->flags = 0;
> + } else {
> + if (likely(!(data->flags & CALL_FALLBACK)))
> + free_call_data(data);
> + else
> + clear_bit_unlock(0, &call_fallback_used);
> + }
> + }
> + }
> +
> add_pda(irq_call_count, 1);
> irq_exit();
> - if (wait) {
> - mb();
> - atomic_inc(&call_data->finished);
> +}
> +
> +asmlinkage void smp_call_function_single_interrupt(void)
> +{
> + struct call_single_queue *q;
> + LIST_HEAD(list);
> +
> + ack_APIC_irq();
> + exit_idle();
> + irq_enter();
> +
> + q = &__get_cpu_var(call_single_queue);
> + spin_lock(&q->lock);
> + list_replace_init(&q->list, &list);
> + spin_unlock(&q->lock);
> +
> + while (!list_empty(&list)) {
> + struct call_single_data *data;
> +
> + data = list_entry(list.next, struct call_single_data, list);
> + list_del(&data->list);
> +
> + data->func(data->info);
> + if (data->flags & CALL_WAIT) {
> + smp_wmb();
> + data->flags = 0;
> + } else {
> + if (likely(!(data->flags & CALL_FALLBACK)))
> + kfree(data);
> + else
> + clear_bit_unlock(0, &call_single_fallback_used);
> + }
> }
> + add_pda(irq_call_count, 1);
> + irq_exit();
> }
>
> diff --git a/include/asm-x86/hw_irq_64.h b/include/asm-x86/hw_irq_64.h
> index 312a58d..06ac80c 100644
> --- a/include/asm-x86/hw_irq_64.h
> +++ b/include/asm-x86/hw_irq_64.h
> @@ -68,8 +68,7 @@
> #define ERROR_APIC_VECTOR 0xfe
> #define RESCHEDULE_VECTOR 0xfd
> #define CALL_FUNCTION_VECTOR 0xfc
> -/* fb free - please don't readd KDB here because it's useless
> - (hint - think what a NMI bit does to a vector) */
> +#define CALL_FUNCTION_SINGLE_VECTOR 0xfb
> #define THERMAL_APIC_VECTOR 0xfa
> #define THRESHOLD_APIC_VECTOR 0xf9
> /* f8 free */
> @@ -102,6 +101,7 @@ void spurious_interrupt(void);
> void error_interrupt(void);
> void reschedule_interrupt(void);
> void call_function_interrupt(void);
> +void call_function_single_interrupt(void);
> void irq_move_cleanup_interrupt(void);
> void invalidate_interrupt0(void);
> void invalidate_interrupt1(void);
> diff --git a/include/asm-x86/mach-default/entry_arch.h b/include/asm-x86/mach-default/entry_arch.h
> index bc86146..9283b60 100644
> --- a/include/asm-x86/mach-default/entry_arch.h
> +++ b/include/asm-x86/mach-default/entry_arch.h
> @@ -13,6 +13,7 @@
> BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
> BUILD_INTERRUPT(invalidate_interrupt,INVALIDATE_TLB_VECTOR)
> BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
> +BUILD_INTERRUPT(call_function_single_interrupt,CALL_FUNCTION_SINGLE_VECTOR)
> #endif
>
> /*
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 55232cc..c938d26 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -53,7 +53,6 @@ extern void smp_cpus_done(unsigned int max_cpus);
> * Call a function on all other processors
> */
> int smp_call_function(void(*func)(void *info), void *info, int retry, int wait);
> -
> int smp_call_function_single(int cpuid, void (*func) (void *info), void *info,
> int retry, int wait);
>
> @@ -92,6 +91,7 @@ static inline int up_smp_call_function(void (*func)(void *), void *info)
> }
> #define smp_call_function(func, info, retry, wait) \
> (up_smp_call_function(func, info))
> +
> #define on_each_cpu(func,info,retry,wait) \
> ({ \
> local_irq_disable(); \
>

2008-03-16 18:45:24

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/7] x86-64: introduce fast variant of smp_call_function_single()

On Fri, Mar 14 2008, Jeremy Fitzhardinge wrote:
> Jens Axboe wrote:
> >rom: Nick Piggin <[email protected]>
> >
>
> Why is this necessary? How is smp_call_function_single slow?

Because it's completely serialized by the call_lock spinlock.

--
Jens Axboe

2008-03-16 23:01:37

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 1/7] x86-64: introduce fast variant of smp_call_function_single()

Jens Axboe wrote:
> On Fri, Mar 14 2008, Jeremy Fitzhardinge wrote:
>
>> Jens Axboe wrote:
>>
>>> rom: Nick Piggin <[email protected]>
>>>
>>>
>> Why is this necessary? How is smp_call_function_single slow?
>>
>
> Because it's completely serialized by the call_lock spinlock.
>

Hm, yes. Would it be possible to implement smp_call_function_mask in a
generic way to avoid that? Turn the static structure into a per-cpu
request list?

J

2008-03-17 02:25:00

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 1/7] x86-64: introduce fast variant of smp_call_function_single()

On Monday 17 March 2008 09:58, Jeremy Fitzhardinge wrote:
> Jens Axboe wrote:
> > On Fri, Mar 14 2008, Jeremy Fitzhardinge wrote:
> >> Jens Axboe wrote:
> >>> rom: Nick Piggin <[email protected]>
> >>
> >> Why is this necessary? How is smp_call_function_single slow?
> >
> > Because it's completely serialized by the call_lock spinlock.
>
> Hm, yes. Would it be possible to implement smp_call_function_mask in a
> generic way to avoid that? Turn the static structure into a per-cpu
> request list?

Not really. The common cases (that I can see) are either call all,
or call one. In the call all case, you would have to touch every
other CPU's request list, and that's not really any better than
what I've done in my patchest for that.

There would presumably be some cutoff where it makes more sense to
queue events to the percpu IPI lists if you are only sending to a
few CPUs. That would be trivial to implement, but... what are the
use-cases for that? The big one that I really know of is user TLB
shootdown, but that has its own vector anyway.

2008-03-17 07:25:28

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/7] x86-64: introduce fast variant of smp_call_function_single()

On Sun, Mar 16 2008, Jeremy Fitzhardinge wrote:
> Jens Axboe wrote:
> >On Fri, Mar 14 2008, Jeremy Fitzhardinge wrote:
> >
> >>Jens Axboe wrote:
> >>
> >>>rom: Nick Piggin <[email protected]>
> >>>
> >>>
> >>Why is this necessary? How is smp_call_function_single slow?
> >>
> >
> >Because it's completely serialized by the call_lock spinlock.
> >
>
> Hm, yes. Would it be possible to implement smp_call_function_mask in a
> generic way to avoid that? Turn the static structure into a per-cpu
> request list?

Have you looked at the patches you are replying to? :-)

--
Jens Axboe