2008-02-03 09:53:14

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
>
> Second experiment which we did was migrating the IO submission to the
> IO completion cpu. Instead of submitting the IO on the same cpu where the
> request arrived, in this experiment the IO submission gets migrated to the
> cpu that is processing IO completions(interrupt). This will minimize the
> access to remote cachelines (that happens in timers, slab, scsi layers). The
> IO submission request is forwarded to the kblockd thread on the cpu receiving
> the interrupts. As part of this, we also made kblockd thread on each cpu as the
> highest priority thread, so that IO gets submitted as soon as possible on the
> interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
> resulted in 2% performance improvement and 3.3% improvement on two node ia64
> platform.
>
> Quick and dirty prototype patch(not meant for inclusion) for this io migration
> experiment is appended to this e-mail.
>
> Observation #1 mentioned above is also applicable to this experiment. CPU's
> processing interrupts will now have to cater IO submission/processing
> load aswell.
>
> Observation #2: This introduces some migration overhead during IO submission.
> With the current prototype, every incoming IO request results in an IPI and
> context switch(to kblockd thread) on the interrupt processing cpu.
> This issue needs to be addressed and main challenge to address is
> the efficient mechanism of doing this IO migration(how much batching to do and
> when to send the migrate request?), so that we don't delay the IO much and at
> the same point, don't cause much overhead during migration.

Hi guys,

Just had another way we might do this. Migrate the completions out to
the submitting CPUs rather than migrate submission into the completing
CPU.

I've got a basic patch that passes some stress testing. It seems fairly
simple to do at the block layer, and the bulk of the patch involves
introducing a scalable smp_call_function for it.

Now it could be optimised more by looking at batching up IPIs or
optimising the call function path or even mirating the completion event
at a different level...

However, this is a first cut. It actually seems like it might be taking
slightly more CPU to process block IO (~0.2%)... however, this is on my
dual core system that shares an llc, which means that there are very few
cache benefits to the migration, but non-zero overhead. So on multisocket
systems hopefully it might get to positive territory.

---

Index: linux-2.6/arch/x86/kernel/smp_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp_64.c
+++ linux-2.6/arch/x86/kernel/smp_64.c
@@ -321,6 +321,99 @@ void unlock_ipi_call_lock(void)
spin_unlock_irq(&call_lock);
}

+struct call_single_data {
+ struct list_head list;
+ void (*func) (void *info);
+ void *info;
+ int wait;
+};
+
+struct call_single_queue {
+ spinlock_t lock;
+ struct list_head list;
+};
+static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
+
+int __cpuinit init_smp_call(void)
+{
+ int i;
+
+ for_each_cpu_mask(i, cpu_possible_map) {
+ spin_lock_init(&per_cpu(call_single_queue, i).lock);
+ INIT_LIST_HEAD(&per_cpu(call_single_queue, i).list);
+ }
+ return 0;
+}
+core_initcall(init_smp_call);
+
+/*
+ * this function sends a 'generic call function' IPI to all other CPU
+ * of the system defined in the mask.
+ */
+int smp_call_function_fast(int cpu, void (*func)(void *), void *info,
+ int wait)
+{
+ struct call_single_data *data;
+ struct call_single_queue *dst = &per_cpu(call_single_queue, cpu);
+ cpumask_t mask = cpumask_of_cpu(cpu);
+ int ipi;
+
+ data = kmalloc(sizeof(struct call_single_data), GFP_ATOMIC);
+ data->func = func;
+ data->info = info;
+ data->wait = wait;
+
+ spin_lock_irq(&dst->lock);
+ ipi = list_empty(&dst->list);
+ list_add_tail(&data->list, &dst->list);
+ spin_unlock_irq(&dst->lock);
+
+ if (ipi)
+ send_IPI_mask(mask, CALL_FUNCTION_SINGLE_VECTOR);
+
+ if (wait) {
+ /* Wait for response */
+ while (data->wait)
+ cpu_relax();
+ kfree(data);
+ }
+
+ return 0;
+}
+
+asmlinkage void smp_call_function_fast_interrupt(void)
+{
+ struct call_single_queue *q;
+ unsigned long flags;
+ LIST_HEAD(list);
+
+ ack_APIC_irq();
+
+ q = &__get_cpu_var(call_single_queue);
+ spin_lock_irqsave(&q->lock, flags);
+ list_replace_init(&q->list, &list);
+ spin_unlock_irqrestore(&q->lock, flags);
+
+ exit_idle();
+ irq_enter();
+ while (!list_empty(&list)) {
+ struct call_single_data *data;
+
+ data = list_entry(list.next, struct call_single_data, list);
+ list_del(&data->list);
+
+ data->func(data->info);
+ if (data->wait) {
+ smp_mb();
+ data->wait = 0;
+ } else {
+ kfree(data);
+ }
+ }
+ add_pda(irq_call_count, 1);
+ irq_exit();
+}
+
/*
* this function sends a 'generic call function' IPI to all other CPU
* of the system defined in the mask.
Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c
+++ linux-2.6/block/blk-core.c
@@ -1604,6 +1604,13 @@ static int __end_that_request_first(stru
return 1;
}

+static void blk_done_softirq_other(void *data)
+{
+ struct request *rq = data;
+
+ blk_complete_request(rq);
+}
+
/*
* splice the completion data to a local structure and hand off to
* process_completion_queue() to complete the requests
@@ -1622,7 +1629,15 @@ static void blk_done_softirq(struct soft

rq = list_entry(local_list.next, struct request, donelist);
list_del_init(&rq->donelist);
- rq->q->softirq_done_fn(rq);
+ if (rq->submission_cpu != smp_processor_id()) {
+ /*
+ * Could batch up IPIs here, but we should measure how
+ * often blk_done_softirq gets a large batch...
+ */
+ smp_call_function_fast(rq->submission_cpu,
+ blk_done_softirq_other, rq, 0);
+ } else
+ rq->q->softirq_done_fn(rq);
}
}

Index: linux-2.6/include/asm-x86/hw_irq_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/hw_irq_64.h
+++ linux-2.6/include/asm-x86/hw_irq_64.h
@@ -68,8 +68,7 @@
#define ERROR_APIC_VECTOR 0xfe
#define RESCHEDULE_VECTOR 0xfd
#define CALL_FUNCTION_VECTOR 0xfc
-/* fb free - please don't readd KDB here because it's useless
- (hint - think what a NMI bit does to a vector) */
+#define CALL_FUNCTION_SINGLE_VECTOR 0xfb
#define THERMAL_APIC_VECTOR 0xfa
#define THRESHOLD_APIC_VECTOR 0xf9
/* f8 free */
@@ -102,6 +101,7 @@ void spurious_interrupt(void);
void error_interrupt(void);
void reschedule_interrupt(void);
void call_function_interrupt(void);
+void call_function_fast_interrupt(void);
void irq_move_cleanup_interrupt(void);
void invalidate_interrupt0(void);
void invalidate_interrupt1(void);
Index: linux-2.6/include/linux/smp.h
===================================================================
--- linux-2.6.orig/include/linux/smp.h
+++ linux-2.6/include/linux/smp.h
@@ -53,6 +53,7 @@ extern void smp_cpus_done(unsigned int m
* Call a function on all other processors
*/
int smp_call_function(void(*func)(void *info), void *info, int retry, int wait);
+int smp_call_function_fast(int cpuid, void(*func)(void *info), void *info, int wait);

int smp_call_function_single(int cpuid, void (*func) (void *info), void *info,
int retry, int wait);
@@ -92,6 +93,11 @@ static inline int up_smp_call_function(v
}
#define smp_call_function(func, info, retry, wait) \
(up_smp_call_function(func, info))
+static inline int smp_call_function_fast(int cpuid, void(*func)(void *info), void *info, int wait)
+{
+ return 0;
+}
+
#define on_each_cpu(func,info,retry,wait) \
({ \
local_irq_disable(); \
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c
+++ linux-2.6/block/elevator.c
@@ -648,6 +648,8 @@ void elv_insert(struct request_queue *q,
void __elv_add_request(struct request_queue *q, struct request *rq, int where,
int plug)
{
+ rq->submission_cpu = smp_processor_id();
+
if (q->ordcolor)
rq->cmd_flags |= REQ_ORDERED_COLOR;

Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h
+++ linux-2.6/include/linux/blkdev.h
@@ -208,6 +208,8 @@ struct request {

int ref_count;

+ int submission_cpu;
+
/*
* when request is used as a packet command carrier
*/
Index: linux-2.6/arch/x86/kernel/entry_64.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/entry_64.S
+++ linux-2.6/arch/x86/kernel/entry_64.S
@@ -696,6 +696,9 @@ END(invalidate_interrupt\num)
ENTRY(call_function_interrupt)
apicinterrupt CALL_FUNCTION_VECTOR,smp_call_function_interrupt
END(call_function_interrupt)
+ENTRY(call_function_fast_interrupt)
+ apicinterrupt CALL_FUNCTION_SINGLE_VECTOR,smp_call_function_fast_interrupt
+END(call_function_fast_interrupt)
ENTRY(irq_move_cleanup_interrupt)
apicinterrupt IRQ_MOVE_CLEANUP_VECTOR,smp_irq_move_cleanup_interrupt
END(irq_move_cleanup_interrupt)
Index: linux-2.6/arch/x86/kernel/i8259_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/i8259_64.c
+++ linux-2.6/arch/x86/kernel/i8259_64.c
@@ -493,6 +493,7 @@ void __init native_init_IRQ(void)

/* IPI for generic function call */
set_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
+ set_intr_gate(CALL_FUNCTION_SINGLE_VECTOR, call_function_fast_interrupt);

/* Low priority IPI to cleanup after moving an irq */
set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt);
Index: linux-2.6/include/asm-x86/mach-default/entry_arch.h
===================================================================
--- linux-2.6.orig/include/asm-x86/mach-default/entry_arch.h
+++ linux-2.6/include/asm-x86/mach-default/entry_arch.h
@@ -13,6 +13,7 @@
BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
BUILD_INTERRUPT(invalidate_interrupt,INVALIDATE_TLB_VECTOR)
BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
+BUILD_INTERRUPT(call_function_fast_interrupt,CALL_FUNCTION_SINGLE_VECTOR)
#endif

/*


2008-02-03 10:53:17

by Pekka Enberg

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

Hi Nick,

On Feb 3, 2008 11:52 AM, Nick Piggin <[email protected]> wrote:
> +asmlinkage void smp_call_function_fast_interrupt(void)
> +{

[snip]

> + while (!list_empty(&list)) {
> + struct call_single_data *data;
> +
> + data = list_entry(list.next, struct call_single_data, list);
> + list_del(&data->list);
> +
> + data->func(data->info);
> + if (data->wait) {
> + smp_mb();
> + data->wait = 0;

Why do we need smp_mb() here (maybe add a comment to keep
Andrew/checkpatch happy)?

Pekka

2008-02-03 11:58:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Sun, Feb 03, 2008 at 12:53:02PM +0200, Pekka Enberg wrote:
> Hi Nick,
>
> On Feb 3, 2008 11:52 AM, Nick Piggin <[email protected]> wrote:
> > +asmlinkage void smp_call_function_fast_interrupt(void)
> > +{
>
> [snip]
>
> > + while (!list_empty(&list)) {
> > + struct call_single_data *data;
> > +
> > + data = list_entry(list.next, struct call_single_data, list);
> > + list_del(&data->list);
> > +
> > + data->func(data->info);
> > + if (data->wait) {
> > + smp_mb();
> > + data->wait = 0;
>
> Why do we need smp_mb() here (maybe add a comment to keep
> Andrew/checkpatch happy)?

Yeah, definitely... it's just a really basic RFC, but I should get
into the habit of just doing it anyway.

Thanks,
Nick

2008-02-04 02:11:44

by David Chinner

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Sun, Feb 03, 2008 at 10:52:52AM +0100, Nick Piggin wrote:
> On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> >
> > Second experiment which we did was migrating the IO submission to the
> > IO completion cpu. Instead of submitting the IO on the same cpu where the
> > request arrived, in this experiment the IO submission gets migrated to the
> > cpu that is processing IO completions(interrupt). This will minimize the
> > access to remote cachelines (that happens in timers, slab, scsi layers). The
> > IO submission request is forwarded to the kblockd thread on the cpu receiving
> > the interrupts. As part of this, we also made kblockd thread on each cpu as the
> > highest priority thread, so that IO gets submitted as soon as possible on the
> > interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
> > resulted in 2% performance improvement and 3.3% improvement on two node ia64
> > platform.
> >
> > Quick and dirty prototype patch(not meant for inclusion) for this io migration
> > experiment is appended to this e-mail.
> >
> > Observation #1 mentioned above is also applicable to this experiment. CPU's
> > processing interrupts will now have to cater IO submission/processing
> > load aswell.
> >
> > Observation #2: This introduces some migration overhead during IO submission.
> > With the current prototype, every incoming IO request results in an IPI and
> > context switch(to kblockd thread) on the interrupt processing cpu.
> > This issue needs to be addressed and main challenge to address is
> > the efficient mechanism of doing this IO migration(how much batching to do and
> > when to send the migrate request?), so that we don't delay the IO much and at
> > the same point, don't cause much overhead during migration.
>
> Hi guys,
>
> Just had another way we might do this. Migrate the completions out to
> the submitting CPUs rather than migrate submission into the completing
> CPU.

Hi Nick,

When Matthew was describing this work at an LCA presentation (not
sure whether you were at that presentation or not), Zach came up
with the idea that allowing the submitting application control the
CPU that the io completion processing was occurring would be a good
approach to try. That is, we submit a "completion cookie" with the
bio that indicates where we want completion to run, rather than
dictating that completion runs on the submission CPU.

The reasoning is that only the higher level context really knows
what is optimal, and that changes from application to application.
The "complete on the submission CPU" policy _may_ be more optimal
for database workloads, but it is definitely suboptimal for XFS and
transaction I/O completion handling because it simply drags a bunch
of global filesystem state around between all the CPUs running
completions. In that case, we really only want a single CPU to be
handling the completions.....

(Zach - please correct me if I've missed anything)

Looking at your patch - if you turn it around so that the
"submission CPU" field can be specified as the "completion cpu" then
I think the patch will expose the policy knobs needed to do the
above. Add the bio -> rq linkage to enable filesystems and DIO to
control the completion CPU field and we're almost done.... ;)

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2008-02-04 04:17:04

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

David Chinner wrote:
> Hi Nick,
>
> When Matthew was describing this work at an LCA presentation (not
> sure whether you were at that presentation or not), Zach came up
> with the idea that allowing the submitting application control the
> CPU that the io completion processing was occurring would be a good
> approach to try. That is, we submit a "completion cookie" with the
> bio that indicates where we want completion to run, rather than
> dictating that completion runs on the submission CPU.
>
> The reasoning is that only the higher level context really knows
> what is optimal, and that changes from application to application.

well.. kinda. One of the really hard parts of the submit/completion stuff is that
the slab/slob/slub/slib allocator ends up basically "cycling" memory through the system;
there's a sink of free memory on all the submission cpus and a source of free memory
on the completion cpu. I don't think applications are capable of working out what is
best in this scenario..

2008-02-04 04:41:10

by David Chinner

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Sun, Feb 03, 2008 at 08:14:45PM -0800, Arjan van de Ven wrote:
> David Chinner wrote:
> >Hi Nick,
> >
> >When Matthew was describing this work at an LCA presentation (not
> >sure whether you were at that presentation or not), Zach came up
> >with the idea that allowing the submitting application control the
> >CPU that the io completion processing was occurring would be a good
> >approach to try. That is, we submit a "completion cookie" with the
> >bio that indicates where we want completion to run, rather than
> >dictating that completion runs on the submission CPU.
> >
> >The reasoning is that only the higher level context really knows
> >what is optimal, and that changes from application to application.
>
> well.. kinda. One of the really hard parts of the submit/completion stuff
> is that
> the slab/slob/slub/slib allocator ends up basically "cycling" memory
> through the system;
> there's a sink of free memory on all the submission cpus and a source of
> free memory
> on the completion cpu. I don't think applications are capable of working
> out what is
> best in this scenario..

Applications as in "anything that calls submit_bio()". i.e, direct I/O,
filesystems, etc. i.e. not userspace but in-kernel applications.

In XFS, simultaneous io completion on multiple CPUs can contribute greatly to
contention of global structures in XFS. By controlling where completions are
delivered, we can greatly reduce this contention, especially on large,
mulitpathed devices that deliver interrupts to multiple CPUs that may be far
distant from each other. We have all the state and intelligence necessary
to control this sort policy decision effectively.....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2008-02-04 10:10:19

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Mon, Feb 04, 2008 at 03:40:20PM +1100, David Chinner wrote:
> On Sun, Feb 03, 2008 at 08:14:45PM -0800, Arjan van de Ven wrote:
> > David Chinner wrote:
> > >Hi Nick,
> > >
> > >When Matthew was describing this work at an LCA presentation (not
> > >sure whether you were at that presentation or not), Zach came up
> > >with the idea that allowing the submitting application control the
> > >CPU that the io completion processing was occurring would be a good
> > >approach to try. That is, we submit a "completion cookie" with the
> > >bio that indicates where we want completion to run, rather than
> > >dictating that completion runs on the submission CPU.
> > >
> > >The reasoning is that only the higher level context really knows
> > >what is optimal, and that changes from application to application.
> >
> > well.. kinda. One of the really hard parts of the submit/completion stuff
> > is that
> > the slab/slob/slub/slib allocator ends up basically "cycling" memory
> > through the system;
> > there's a sink of free memory on all the submission cpus and a source of
> > free memory
> > on the completion cpu. I don't think applications are capable of working
> > out what is
> > best in this scenario..
>
> Applications as in "anything that calls submit_bio()". i.e, direct I/O,
> filesystems, etc. i.e. not userspace but in-kernel applications.
>
> In XFS, simultaneous io completion on multiple CPUs can contribute greatly to
> contention of global structures in XFS. By controlling where completions are
> delivered, we can greatly reduce this contention, especially on large,
> mulitpathed devices that deliver interrupts to multiple CPUs that may be far
> distant from each other. We have all the state and intelligence necessary
> to control this sort policy decision effectively.....

Hi Dave,

Thanks for taking a look at the patch... yes it would be easy to turn
this bit of state into a more flexible cookie (eg. complete on submitter;
complete on interrupt; complete on CPUx/nodex etc.). Maybe we'll need
something that complex... I'm not sure, it would probably need more
fine tuning. That said, I just wanted to get this approach out there
early for rfc.

I guess both you and Arjan have points. For a _lot_ of things, completing
on the same CPU as submitter (whether that is migrating submission as in
the original patch in the thread, or migrating completion like I do).

You get better behaviour in the slab and page allocators and locality
and cache hotness of memory. For example, I guess in a filesystem /
pagecache heavy workload, you have to touch each struct page, buffer head,
fs private state, and also often have to wake the thread for completion.
Much of this data has just been touched at submit time, so doin this on
the same CPU is nice...

I'm surprised that the xfs global state bouncing would outweigh the
bouncing of all the per-page/block/bio/request/etc data that gets touched
during completion. We'll see.

2008-02-04 10:12:59

by Jens Axboe

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Sun, Feb 03 2008, Nick Piggin wrote:
> On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> >
> > Second experiment which we did was migrating the IO submission to the
> > IO completion cpu. Instead of submitting the IO on the same cpu where the
> > request arrived, in this experiment the IO submission gets migrated to the
> > cpu that is processing IO completions(interrupt). This will minimize the
> > access to remote cachelines (that happens in timers, slab, scsi layers). The
> > IO submission request is forwarded to the kblockd thread on the cpu receiving
> > the interrupts. As part of this, we also made kblockd thread on each cpu as the
> > highest priority thread, so that IO gets submitted as soon as possible on the
> > interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
> > resulted in 2% performance improvement and 3.3% improvement on two node ia64
> > platform.
> >
> > Quick and dirty prototype patch(not meant for inclusion) for this io migration
> > experiment is appended to this e-mail.
> >
> > Observation #1 mentioned above is also applicable to this experiment. CPU's
> > processing interrupts will now have to cater IO submission/processing
> > load aswell.
> >
> > Observation #2: This introduces some migration overhead during IO submission.
> > With the current prototype, every incoming IO request results in an IPI and
> > context switch(to kblockd thread) on the interrupt processing cpu.
> > This issue needs to be addressed and main challenge to address is
> > the efficient mechanism of doing this IO migration(how much batching to do and
> > when to send the migrate request?), so that we don't delay the IO much and at
> > the same point, don't cause much overhead during migration.
>
> Hi guys,
>
> Just had another way we might do this. Migrate the completions out to
> the submitting CPUs rather than migrate submission into the completing
> CPU.
>
> I've got a basic patch that passes some stress testing. It seems fairly
> simple to do at the block layer, and the bulk of the patch involves
> introducing a scalable smp_call_function for it.
>
> Now it could be optimised more by looking at batching up IPIs or
> optimising the call function path or even mirating the completion event
> at a different level...
>
> However, this is a first cut. It actually seems like it might be taking
> slightly more CPU to process block IO (~0.2%)... however, this is on my
> dual core system that shares an llc, which means that there are very few
> cache benefits to the migration, but non-zero overhead. So on multisocket
> systems hopefully it might get to positive territory.

That's pretty funny, I did pretty much the exact same thing last week!
The primary difference between yours and mine is that I used a more
private interface to signal a softirq raise on another CPU, instead of
allocating call data and exposing a generic interface. That put the
locking in blk-core instead, turning blk_cpu_done into a structure with
a lock and list_head instead of just being a list head, and intercepted
at blk_complete_request() time instead of waiting for an already raised
softirq on that CPU.

Didn't get around to any performance testing yet, though. Will try and
clean it up a bit and do that.

--
Jens Axboe

2008-02-04 10:30:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

> + q = &__get_cpu_var(call_single_queue);
> + spin_lock_irqsave(&q->lock, flags);
> + list_replace_init(&q->list, &list);
> + spin_unlock_irqrestore(&q->lock, flags);

I think you could do that lockless if you use a similar data structure
as netchannels (essentially a fixed size single buffer queue with atomic
exchange of the first/last pointers) and not using a list. That would avoid
at least one bounce for the lock and likely another one for the list
manipulation.

Also the right way would be to not add a second mechanism for this,
but fix the standard smp_call_function_single() to support it.

-Andi

2008-02-04 10:31:46

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Mon, Feb 04, 2008 at 11:12:44AM +0100, Jens Axboe wrote:
> On Sun, Feb 03 2008, Nick Piggin wrote:
> > On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> >
> > Hi guys,
> >
> > Just had another way we might do this. Migrate the completions out to
> > the submitting CPUs rather than migrate submission into the completing
> > CPU.
> >
> > I've got a basic patch that passes some stress testing. It seems fairly
> > simple to do at the block layer, and the bulk of the patch involves
> > introducing a scalable smp_call_function for it.
> >
> > Now it could be optimised more by looking at batching up IPIs or
> > optimising the call function path or even mirating the completion event
> > at a different level...
> >
> > However, this is a first cut. It actually seems like it might be taking
> > slightly more CPU to process block IO (~0.2%)... however, this is on my
> > dual core system that shares an llc, which means that there are very few
> > cache benefits to the migration, but non-zero overhead. So on multisocket
> > systems hopefully it might get to positive territory.
>
> That's pretty funny, I did pretty much the exact same thing last week!

Oh nice ;)


> The primary difference between yours and mine is that I used a more
> private interface to signal a softirq raise on another CPU, instead of
> allocating call data and exposing a generic interface. That put the
> locking in blk-core instead, turning blk_cpu_done into a structure with
> a lock and list_head instead of just being a list head, and intercepted
> at blk_complete_request() time instead of waiting for an already raised
> softirq on that CPU.

Yeah I was looking at that... didn't really want to add the spinlock
overhead to the non-migration case. Anyway, I guess that sort of
fine implementation details is going to have to be sorted out with
results.

2008-02-04 10:34:08

by Jens Axboe

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Mon, Feb 04 2008, Nick Piggin wrote:
> On Mon, Feb 04, 2008 at 11:12:44AM +0100, Jens Axboe wrote:
> > On Sun, Feb 03 2008, Nick Piggin wrote:
> > > On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> > >
> > > Hi guys,
> > >
> > > Just had another way we might do this. Migrate the completions out to
> > > the submitting CPUs rather than migrate submission into the completing
> > > CPU.
> > >
> > > I've got a basic patch that passes some stress testing. It seems fairly
> > > simple to do at the block layer, and the bulk of the patch involves
> > > introducing a scalable smp_call_function for it.
> > >
> > > Now it could be optimised more by looking at batching up IPIs or
> > > optimising the call function path or even mirating the completion event
> > > at a different level...
> > >
> > > However, this is a first cut. It actually seems like it might be taking
> > > slightly more CPU to process block IO (~0.2%)... however, this is on my
> > > dual core system that shares an llc, which means that there are very few
> > > cache benefits to the migration, but non-zero overhead. So on multisocket
> > > systems hopefully it might get to positive territory.
> >
> > That's pretty funny, I did pretty much the exact same thing last week!
>
> Oh nice ;)
>
>
> > The primary difference between yours and mine is that I used a more
> > private interface to signal a softirq raise on another CPU, instead of
> > allocating call data and exposing a generic interface. That put the
> > locking in blk-core instead, turning blk_cpu_done into a structure with
> > a lock and list_head instead of just being a list head, and intercepted
> > at blk_complete_request() time instead of waiting for an already raised
> > softirq on that CPU.
>
> Yeah I was looking at that... didn't really want to add the spinlock
> overhead to the non-migration case. Anyway, I guess that sort of
> fine implementation details is going to have to be sorted out with
> results.

As Andi mentions, we can look into making that lockless. For the initial
implementation I didn't really care, just wanted something to play with
that would nicely allow me to control both the submit and complete side
of the affinity issue.

--
Jens Axboe

2008-02-04 18:21:28

by Zach Brown

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

[ ugh, still jet lagged. ]

> Hi Nick,
>
> When Matthew was describing this work at an LCA presentation (not
> sure whether you were at that presentation or not), Zach came up
> with the idea that allowing the submitting application control the
> CPU that the io completion processing was occurring would be a good
> approach to try. That is, we submit a "completion cookie" with the
> bio that indicates where we want completion to run, rather than
> dictating that completion runs on the submission CPU.
>
> The reasoning is that only the higher level context really knows
> what is optimal, and that changes from application to application.
> The "complete on the submission CPU" policy _may_ be more optimal
> for database workloads, but it is definitely suboptimal for XFS and
> transaction I/O completion handling because it simply drags a bunch
> of global filesystem state around between all the CPUs running
> completions. In that case, we really only want a single CPU to be
> handling the completions.....
>
> (Zach - please correct me if I've missed anything)

Yeah, I think Nick's patch (and Jens' approach, presumably) is just the
sort of thing we were hoping for when discussing this during Matthew's talk.

I was imagining the patch a little bit differently (per-cpu tasks, do a
wake_up from the driver instead of cpu nr testing up in blk, work
queues, whatever), but we know how to iron out these kinds of details ;).

> Looking at your patch - if you turn it around so that the
> "submission CPU" field can be specified as the "completion cpu" then
> I think the patch will expose the policy knobs needed to do the
> above.

Yeah, that seems pretty straight forward.

We might need some logic for noticing that the desired cpu has been
hot-plugged away while the IO was in flight, it occurs to me.

- z

2008-02-04 20:10:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Mon, Feb 04 2008, Zach Brown wrote:
> [ ugh, still jet lagged. ]
>
> > Hi Nick,
> >
> > When Matthew was describing this work at an LCA presentation (not
> > sure whether you were at that presentation or not), Zach came up
> > with the idea that allowing the submitting application control the
> > CPU that the io completion processing was occurring would be a good
> > approach to try. That is, we submit a "completion cookie" with the
> > bio that indicates where we want completion to run, rather than
> > dictating that completion runs on the submission CPU.
> >
> > The reasoning is that only the higher level context really knows
> > what is optimal, and that changes from application to application.
> > The "complete on the submission CPU" policy _may_ be more optimal
> > for database workloads, but it is definitely suboptimal for XFS and
> > transaction I/O completion handling because it simply drags a bunch
> > of global filesystem state around between all the CPUs running
> > completions. In that case, we really only want a single CPU to be
> > handling the completions.....
> >
> > (Zach - please correct me if I've missed anything)
>
> Yeah, I think Nick's patch (and Jens' approach, presumably) is just the
> sort of thing we were hoping for when discussing this during Matthew's talk.
>
> I was imagining the patch a little bit differently (per-cpu tasks, do a
> wake_up from the driver instead of cpu nr testing up in blk, work
> queues, whatever), but we know how to iron out these kinds of details ;).

per-cpu tasks/wq's might be better, it's a little awkward to jump
through hoops

> > Looking at your patch - if you turn it around so that the
> > "submission CPU" field can be specified as the "completion cpu" then
> > I think the patch will expose the policy knobs needed to do the
> > above.
>
> Yeah, that seems pretty straight forward.
>
> We might need some logic for noticing that the desired cpu has been
> hot-plugged away while the IO was in flight, it occurs to me.

the softirq completion stuff already handles cpus going away, at least
with my patch that stuff works fine (with a dead flag added).

--
Jens Axboe

2008-02-04 21:46:38

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

Jens Axboe wrote:
>> I was imagining the patch a little bit differently (per-cpu tasks, do a
>> wake_up from the driver instead of cpu nr testing up in blk, work
>> queues, whatever), but we know how to iron out these kinds of details ;).
>
> per-cpu tasks/wq's might be better, it's a little awkward to jump
> through hoops
>

one caveat btw; when the multiqueue storage hw becomes available for Linux,
we need to figure out how to deal with the preference thing; since there
honoring a "non-logical" preference would be quite expensive (it means
you can't make the local submit queues lockless etc etc), so before we go down
the road of having widespread APIs for this stuff.. we need to make sure we're
not going to do something that's going to be really stupid 6 to 18 months down the road.

2008-02-04 21:49:24

by Suresh Siddha

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Sun, Feb 03, 2008 at 10:52:52AM +0100, Nick Piggin wrote:
> Hi guys,
>
> Just had another way we might do this. Migrate the completions out to
> the submitting CPUs rather than migrate submission into the completing
> CPU.

Hi Nick, This was the first experiment I tried on a quad core four
package SMP platform. And it didn't show much improvement in my
prototype(my protoype was migrating the softirq to the kblockd context
of the submitting CPU).

In the OLTP workload, quite a bit of activity happens below the block layer
and by the time we come to softirq, some damage is done in
slab, scsi cmds, timers etc. Last year OLS paper
(http://ols.108.redhat.com/2007/Reprints/gough-Reprint.pdf)
shows different cache lines that are contended in the kernel for the
OLTP workload.

Softirq migration should atleast reduce the cacheline contention that
happens in sched and AIO layers. I didn't spend much time why my softirq
migration patch didn't help much (as I was behind bigger birds of migrating
IO submission to completion CPU at that time). If this solution has
less side-effects and easily acceptable, then we can analyze the softirq
migration patch further and findout the potential.

While there is some potential with the softirq migration, full potential
can be exploited by making the IO submission and completion on the same CPU.

thanks,
suresh

2008-02-04 22:29:22

by James Bottomley

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Mon, 2008-02-04 at 05:33 -0500, Jens Axboe wrote:
> As Andi mentions, we can look into making that lockless. For the initial
> implementation I didn't really care, just wanted something to play with
> that would nicely allow me to control both the submit and complete side
> of the affinity issue.

Sorry, late to the party ... it went to my steeleye address, not my
current one.

Could you try re-running the tests with a low queue depth (say around 8)
and the card interrupt bound to a single CPU.

The reason for asking you to do this is that it should emulate almost
precisely what you're looking for: The submit path will be picked up in
the SCSI softirq where the queue gets run, so you should find that all
submit and returns happen on a single CPU, so everything gets cache hot
there.

James

p.s. if everyone could also update my email address to the
hansenpartnership one, the people at steeleye who monitor my old email
account would be grateful.

2008-02-05 00:15:05

by David Chinner

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Mon, Feb 04, 2008 at 11:09:59AM +0100, Nick Piggin wrote:
> You get better behaviour in the slab and page allocators and locality
> and cache hotness of memory. For example, I guess in a filesystem /
> pagecache heavy workload, you have to touch each struct page, buffer head,
> fs private state, and also often have to wake the thread for completion.
> Much of this data has just been touched at submit time, so doin this on
> the same CPU is nice...

[....]

> I'm surprised that the xfs global state bouncing would outweigh the
> bouncing of all the per-page/block/bio/request/etc data that gets touched
> during completion. We'll see.

per-page/block.bio/request/etc is local to a single I/O. the only
penalty is a cacheline bounce for each of the structures from one
CPU to another. That is, there is no global state modified by these
completions.

The real issue is metadata. The transaction log I/O completion
funnels through a state machine protected by a single lock, which
means completions on different CPUs pulls that lock to all
completion CPUs. Given that the same lock is used during transaction
completion for other state transitions (in task context, not intr),
the more cpus active at once touches, the worse the problem gets.

Then there's metadata I/O completion, which funnels through a larger
set of global locks in the transaction subsystem (e.g. the active
item list lock, the log reservation locks, the log state lock, etc)
which once again means the more CPUs we have delivering I/O
completions, the worse the problem gets.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2008-02-05 08:25:01

by Jens Axboe

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Mon, Feb 04 2008, Arjan van de Ven wrote:
> Jens Axboe wrote:
> >>I was imagining the patch a little bit differently (per-cpu tasks, do a
> >>wake_up from the driver instead of cpu nr testing up in blk, work
> >>queues, whatever), but we know how to iron out these kinds of details ;).
> >
> >per-cpu tasks/wq's might be better, it's a little awkward to jump
> >through hoops
> >
>
> one caveat btw; when the multiqueue storage hw becomes available for Linux,
> we need to figure out how to deal with the preference thing; since there
> honoring a "non-logical" preference would be quite expensive (it means

non-local?

> you can't make the local submit queues lockless etc etc), so before we
> go down the road of having widespread APIs for this stuff.. we need to
> make sure we're not going to do something that's going to be really
> stupid 6 to 18 months down the road.

As far as I'm concerned, so far this is just playing around with
affinity (and to some extents taking it too far, on purpose). For
instance, my current patch can move submissions and completions
independently, with a set mask or by 'binding' a request to a CPU. Most
of that doesn't make sense. 'complete on the same CPU, if possible'
makes sense and would fit fine with multi-queue hw.

Moving submissions at the block layer to a defined set of CPUs is a bit
silly imho, it's pretty costly and it's a lot more sane simply bind the
submitters instead. So if you can set irq affinity, then just make the
submitters follow that.

--
Jens Axboe

2008-02-08 07:50:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] direct IO submission and completion scalability issues

On Tue, Feb 05, 2008 at 11:14:19AM +1100, David Chinner wrote:
> On Mon, Feb 04, 2008 at 11:09:59AM +0100, Nick Piggin wrote:
> > You get better behaviour in the slab and page allocators and locality
> > and cache hotness of memory. For example, I guess in a filesystem /
> > pagecache heavy workload, you have to touch each struct page, buffer head,
> > fs private state, and also often have to wake the thread for completion.
> > Much of this data has just been touched at submit time, so doin this on
> > the same CPU is nice...
>
> [....]
>
> > I'm surprised that the xfs global state bouncing would outweigh the
> > bouncing of all the per-page/block/bio/request/etc data that gets touched
> > during completion. We'll see.
>
> per-page/block.bio/request/etc is local to a single I/O. the only
> penalty is a cacheline bounce for each of the structures from one
> CPU to another. That is, there is no global state modified by these
> completions.

Yeah, but it is going from _all_ submitting CPUs to the one completing
CPU. So you could bottleneck the interconnect at the completing CPU
just as much as if you had cachelines being pulled the other way (ie.
many CPUs trying to pull in a global cacheline).


> The real issue is metadata. The transaction log I/O completion
> funnels through a state machine protected by a single lock, which
> means completions on different CPUs pulls that lock to all
> completion CPUs. Given that the same lock is used during transaction
> completion for other state transitions (in task context, not intr),
> the more cpus active at once touches, the worse the problem gets.

OK, once you add locking (and not simply cacheline contention), then
the problem gets harder I agree. But I think that if the submitting
side takes the same locks as log completion (eg. maybe for starting a
new transaction), then it is not going to be a clear win either way,
and you'd have to measure it in the end.