by Chris Mason

[permalink] [raw]

Subject: Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop

On Wed, Apr 14, 2010 at 06:16:53PM +0200, Manfred Spraul wrote:
> On 04/13/2010 08:19 PM, Chris Mason wrote:
> >On Wed, Apr 14, 2010 at 04:09:45AM +1000, Nick Piggin wrote:
> >>On Tue, Apr 13, 2010 at 01:39:41PM -0400, Chris Mason wrote:
> >>The other thing I don't know if your patch gets right is requeueing on
> >>of the operations. When you requeue from one list to another, then you
> >>seem to lose ordering with other pending operations, so that would
> >>seem to break the API as well (can't remember if the API strictly
> >>mandates FIFO, but anyway it can open up starvation cases).
> >I don't see anything in the docs about the FIFO order. I could add an
> >extra sort on sequence number pretty easily, but is the starvation case
> >really that bad?
> >
> How do you want to determine the sequence number?
> Is atomic_inc_return() on a per-semaphore array counter sufficiently fast?

I haven't tried yet, but hopefully it won't be a problem. A later patch
does atomics on the reference count and it doesn't show up in the
profiles.

>
> >>I was looking at doing a sequence number to be able to sort these, but
> >>it ended up getting over complex (and SAP was only using simple ops so
> >>it didn't seem to need much better).
> >>
> >>We want to be careful not to change semantics at all. And it gets
> >>tricky quickly :( What about Zach's simpler wakeup API?
> >Yeah, that's why my patches include code to handle userland sending
> >duplicate semids. Zach's simpler API is cooking too, but if I can get
> >this done without insane complexity it helps with more than just the
> >post/wait oracle workload.
> >
> What is the oracle workload, which multi-sembuf operations does it use?
> How many semaphores are in one array?
>
> When the last optimizations were written, I've searched a bit:
> - postgres uses per-process semaphores, with small semaphore arrays.
> [process sleeps on it's own semaphore and is woken up by someone
> else when it can make progress]

This is similar to Oracle (and the sembench program). Each process has
a semaphore and when it is waiting for a commit it goes to sleep on it.
They are woken up in bulk with semtimedop calls from a single process.

But oracle also uses semaphores for locking in a traditional sense.

Putting the waiters into a per-semaphore list is really only part of the
speedup. The real boost comes from the patch to break up the locks into
a per semaphore lock.

We gain another 10-15% from a later patch that gets uses atomics on the
refcount, which lets us do sem_putref without a lock (meaning we're
lockless once we get woken up).

I'm cleaning up fixes based on suggestions here and will repost.

> - with google, I couldn't find anything relevant that uses
> multi-sembuf semop() calls.
>

I think this should help any workload that has more than one semaphore
per array, even if they only do one sem per call.

> And I agree with Nick: We should be careful about changing the API.

Definitely, thanks for reading through it.

-chris

2010-04-14 19:11:42

On 04/12/2010 08:49 PM, Chris Mason wrote:
> I have a microbenchmark to test how quickly we can post and wait in
> bulk. With this change, semtimedop is able do to more than twice
> as much work in the same run. On a large numa machine, it brings
> the IPC lock system time (reported by perf) down from 85% to 15%.
>
>
Looking at the current code:
- update_queue() can be O(N^2) if only some of the waiting tasks are
woken up.
Actually: all non-woken up tasks are rescanned after a task that can be
woken up is found.

- Your test app tests the best case for the current code:
You wake up the tasks in the same order as the called semop().
If you invert the order (i.e.: worklist_add() adds to head instead of
tail), I would expect an even worse performance of the current code.

The O(N^2) is simple to fix, I've attached a patch.
For your micro-benchmark, the patch does not change much: you wake-up
in-order, thus the current code does not misbehave.

Do you know how Oracle wakes up the tasks?
FIFO, LIFO, un-ordered?

> while(unlikely(error == IN_WAKEUP)) {
> cpu_relax();
> error = queue.status;
> }
>
> - if (error != -EINTR) {
> + /*
> + * we are lock free right here, and we could have timed out or
> + * gotten a signal, so we need to be really careful with how we
> + * play with queue.status. It has three possible states:
> + *
> + * -EINTR, which means nobody has changed it since we slept. This
> + * means we woke up on our own.
> + *
> + * IN_WAKEUP, someone is currently waking us up. We need to loop
> + * here until they change it to the operation error value. If
> + * we don't loop, our process could exit before they are done waking us
> + *
> + * operation error value: we've been properly woken up and can exit
> + * at any time.
> + *
> + * If queue.status is currently -EINTR, we are still being processed
> + * by the semtimedop core. Someone either has us on a list head
> + * or is currently poking our queue struct. We need to find that
> + * reference and remove it, which is what remove_queue_from_lists
> + * does.
> + *
> + * We always check for both -EINTR and IN_WAKEUP because we have no
> + * locks held. Someone could change us from -EINTR to IN_WAKEUP at
> + * any time.
> + */
> + if (error != -EINTR&& error != IN_WAKEUP) {
> /* fast path: update_queue already obtained all requested
> * resources */
No: The code accesses a local variable. The loop above the comment
guarantees that the error can't be IN_WAKEUP.

> +
> +out_putref:
> + sem_putref(sma);
> + goto out_free;
>
Is it possible to move the sem_putref into wakeup_sem_queue()?
Right now, the exit path of semtimedop doesn't touch the spinlock.
You remove that optimization.

--
Manfred

Attachments:

patch-ipc-optimize_bulkwakeup (4.30 kB)

2010-04-16 11:45:46

by Chris Mason

[permalink] [raw]

Subject: Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop

On Fri, Apr 16, 2010 at 01:26:15PM +0200, Manfred Spraul wrote:
> On 04/12/2010 08:49 PM, Chris Mason wrote:
> >I have a microbenchmark to test how quickly we can post and wait in
> >bulk. With this change, semtimedop is able do to more than twice
> >as much work in the same run. On a large numa machine, it brings
> >the IPC lock system time (reported by perf) down from 85% to 15%.
> >
> Looking at the current code:
> - update_queue() can be O(N^2) if only some of the waiting tasks are
> woken up.
> Actually: all non-woken up tasks are rescanned after a task that can
> be woken up is found.
>
> - Your test app tests the best case for the current code:
> You wake up the tasks in the same order as the called semop().
> If you invert the order (i.e.: worklist_add() adds to head instead
> of tail), I would expect an even worse performance of the current
> code.
>
> The O(N^2) is simple to fix, I've attached a patch.

Good point.

> For your micro-benchmark, the patch does not change much: you
> wake-up in-order, thus the current code does not misbehave.
>
> Do you know how Oracle wakes up the tasks?
> FIFO, LIFO, un-ordered?

Ordering in terms of the sem array? I had them try many variations ;) I
don't think it will be ordered as well as sembench most of the time.

>
> > while(unlikely(error == IN_WAKEUP)) {
> > cpu_relax();
> > error = queue.status;
> > }
> >
> >- if (error != -EINTR) {
> >+ /*
> >+ * we are lock free right here, and we could have timed out or
> >+ * gotten a signal, so we need to be really careful with how we
> >+ * play with queue.status. It has three possible states:
> >+ *
> >+ * -EINTR, which means nobody has changed it since we slept. This
> >+ * means we woke up on our own.
> >+ *
> >+ * IN_WAKEUP, someone is currently waking us up. We need to loop
> >+ * here until they change it to the operation error value. If
> >+ * we don't loop, our process could exit before they are done waking us
> >+ *
> >+ * operation error value: we've been properly woken up and can exit
> >+ * at any time.
> >+ *
> >+ * If queue.status is currently -EINTR, we are still being processed
> >+ * by the semtimedop core. Someone either has us on a list head
> >+ * or is currently poking our queue struct. We need to find that
> >+ * reference and remove it, which is what remove_queue_from_lists
> >+ * does.
> >+ *
> >+ * We always check for both -EINTR and IN_WAKEUP because we have no
> >+ * locks held. Someone could change us from -EINTR to IN_WAKEUP at
> >+ * any time.
> >+ */
> >+ if (error != -EINTR&& error != IN_WAKEUP) {
> > /* fast path: update_queue already obtained all requested
> > * resources */
> No: The code accesses a local variable. The loop above the comment
> guarantees that the error can't be IN_WAKEUP.

Whoops, thanks.

>
> >+
> >+out_putref:
> >+ sem_putref(sma);
> >+ goto out_free;
> Is it possible to move the sem_putref into wakeup_sem_queue()?
> Right now, the exit path of semtimedop doesn't touch the spinlock.
> You remove that optimization.

I'll look at this, we need to be able to go through the sma to remove
the process from the lists if it woke up on its own, but I don't see why
we can't putref in wakeup.

My current revision of the patch uses an atomic instead of the lock, so it
restores the lockless wakeup either way. Still it is better to putref in
wakeup.

-chris

2010-04-17 10:24:42

by Manfred Spraul

[permalink] [raw]

Subject: Re: [PATCH 2/2] ipc semaphores: order wakeups based on waiter CPU

Hi Chris,

On 04/12/2010 08:49 PM, Chris Mason wrote:
> @@ -599,6 +622,13 @@ again:
> list_splice_init(&new_pending,&work_list);
> goto again;
> }
> +
> + list_sort(NULL,&wake_list, list_comp);
> + while (!list_empty(&wake_list)) {
> + q = list_entry(wake_list.next, struct sem_queue, list);
> + list_del_init(&q->list);
> + wake_up_sem_queue(q, 0);
> + }
> }
>
What about moving this step much later?

There is no need to hold any locks for the actual wake_up_process().

I've updated my patch:
- improved update_queue that guarantees no O(N^2) for your workload.
- move the actual wake-up after dropping all locks
- optimize setting sem_otime
- cacheline align the ipc spinlock.

But the odd thing:
It doesn't improve the sembench result at all (AMD Phenom X4)
The only thing that is reduced is the system time:
From ~1 min system time for "sembench -t 250 -w 250 -r 30 -o 0" to ~30 sec.

cpu binding the sembench threads results in an improvement of ~50% - at
the cost of a significant increase of the system time (from 30 seconds
to 1 min) and the user time (from 2 seconds to 14 seconds).

Are you sure that the problem is contention on the semaphore array spinlock?
With the above changes, the code that is under the spin_lock is very short.
Especially:
- Why does optimizing ipc/sem.c only reduce the system time [reported by
time] and not the sembench output?
- Why is there no improvement from the ___cache_line_align?
If there would be contention, then there should be trashing from
accessing the lock and writing sem_otime and reading sem_base.
- Additionally: you wrote that reducing the array size does not help much.
But: The arrays are 100% independant, the ipc code scales linearly.
Spreading the work over multiple spinlocks is - like cache line aligning
- usually a 100% guaranteed improvement if there is contention.

I've attached a modified sembench.c and the proposal for ipc/sem.c
Could you try it?
What do you think?
How many cores do you have in your test system?

--
Manfred

Attachments:

patch-ipc-optimize_bulkwakeup-3 (11.50 kB)
sembench.c (12.29 kB)
Download all attachments