Hi,
The following set of patches are based on the discussion of holding the
ipc lock unnecessarily, such as for permissions and security checks:
https://lkml.org/lkml/2013/2/28/540
Patch 1/4: Remove the bogus comment from ipc_checkid() requiring that
the ipc lock be held before calling it. Also simplify the function
return. This is a new patch, not present in the RFC.
Patch 2/4: Introduce functions to obtain the ipc object without holding
the lock. Two functions, ipc_obtain_object() and
ipc_obtained_object_check() are created, which are analogous to
ipc_lock() and ipc_lock_check(). This patch was acked by Michel
Lespinasse and reviewed by Chegu Vinod.
Patch 3/4: Introduce ipcctl_pre_down_nolock() function, which is a
lockless version of ipcctl_pre_down(). This function is common to sem,
msg and shm and does some common checking for IPC_RMID and IPC_SET
commands. The older version was kept but calls the lockless version
without breaking the semantics, and is hence transparent to users. This
was suggested by Linus. Once all users are updated, the
ipcctl_pre_down() function can be removed.
Patch 4/4: Use the new, lockless, functions introduced above to only
hold the ipc lock when necessary. The idea is simple: only check ipc
security and permissions within the rcu read region, *without* holding
the ipc lock. This patch was acked by Michel Lespinasse and reviewed by
Chegu Vinod.
Changes since v1 (RFC):
- Add patches 1 and 3.
- Patch 2: In ipc_lock(), instead of checking the return of
ipc_obtain_object_check() against NULL, use IS_ERR(). Suggested by
Michel Lespinasse.
- Patch 2,4: In order for the rcu read lock/unlock calls to be paired up
more obviously, force the user to call rcu_read_unlock *before* calling
ipc_obtain_object[_check](). Suggested by Michel Lespinasse.
- Patch 4: Return ERR_CAST() in sem_obtain_object[_check]() instead of a
cast to struct sem_array *. Suggested by Linus.
- Patch 4: Change open coded spin_lock calls to ipc_object_lock in
semaphore code. Suggested by Linus.
- Patch 4: Added a 'out_wakup' label to semctl_main() and semtimedop()
to return from the functions without having to call sem_unlock (and
hence spin_unlock) without having the lock held.
- More tests: For the past few days I've been running this patchset on
my own laptop, and a 2 and 8 socket machines running my Oracle
swinbbench workloads. I have not encountered any issues so far. The main
fix was suggested by Linus with the bogus ipcctl_pre_down() changes
without updating the callers.
Ok, some numbers...
1) With Rik's semop-multi.c microbenchmark we can see the following
results:
Baseline (3.9-rc1):
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 151452270, ops/sec 5048409
+ 59.40% a.out [kernel.kallsyms] [k] _raw_spin_lock
+ 6.14% a.out [kernel.kallsyms] [k] sys_semtimedop
+ 3.84% a.out [kernel.kallsyms] [k] avc_has_perm_flags
+ 3.64% a.out [kernel.kallsyms] [k] __audit_syscall_exit
+ 2.06% a.out [kernel.kallsyms] [k] copy_user_enhanced_fast_string
+ 1.86% a.out [kernel.kallsyms] [k] ipc_lock
With this patchset:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 273156400, ops/sec 9105213
+ 18.54% a.out [kernel.kallsyms] [k] _raw_spin_lock
+ 11.72% a.out [kernel.kallsyms] [k] sys_semtimedop
+ 7.70% a.out [kernel.kallsyms] [k] ipc_has_perm.isra.21
+ 6.58% a.out [kernel.kallsyms] [k] avc_has_perm_flags
+ 6.54% a.out [kernel.kallsyms] [k] __audit_syscall_exit
+ 4.71% a.out [kernel.kallsyms] [k] ipc_obtain_object_check
2) While on an Oracle swingbench DSS (data mining) workload the
improvements are not as exciting as with Rik's benchmark, we can see
some positive numbers. For an 8 socket machine the following are the
percentages of %sys time incurred in the ipc lock:
Baseline (3.9-rc1):
100 swingbench users: 8,74%
400 swingbench users: 21,86%
800 swingbench users: 84,35%
With this patchset:
100 swingbench users: 8,11%
400 swingbench users: 19,93%
800 swingbench users: 77,69%
Thanks,
Davidlohr
On Tue, Mar 5, 2013 at 1:35 AM, Davidlohr Bueso <[email protected]> wrote:
>
> The following set of patches are based on the discussion of holding the
> ipc lock unnecessarily, such as for permissions and security checks:
Ok, looks fine from a quick look (but then, so did your previous patch-set ;)
You still open-code the spinlock in at least a few places (I saw
sem_getref), but I still don't care deeply.
>> 2) While on an Oracle swingbench DSS (data mining) workload the
> improvements are not as exciting as with Rik's benchmark, we can see
> some positive numbers. For an 8 socket machine the following are the
> percentages of %sys time incurred in the ipc lock:
Ok, I hoped for it being more noticeable. Since that benchmark is less
trivial than Rik's, can you do a perf record -fg of it and give a more
complete picture of what the kernel footprint is - and in particular
who now gets that ipc lock function? Is it purely semtimedop, or what?
Look out for inlining - ipc_rcu_getref() looks like it would be
inlined, for example.
It would be good to get a "top twenty kernel functions" from the
profile, along with some call data on where the lock callers are.. I
know that Rik's benchmark *only* had that one call-site, I'm wondering
if the swingbench one has slightly more complex behavior...
Linus
On 03/05/2013 04:35 AM, Davidlohr Bueso wrote:
> 2) While on an Oracle swingbench DSS (data mining) workload the
> improvements are not as exciting as with Rik's benchmark, we can see
> some positive numbers. For an 8 socket machine the following are the
> percentages of %sys time incurred in the ipc lock:
>
> Baseline (3.9-rc1):
> 100 swingbench users: 8,74%
> 400 swingbench users: 21,86%
> 800 swingbench users: 84,35%
>
> With this patchset:
> 100 swingbench users: 8,11%
> 400 swingbench users: 19,93%
> 800 swingbench users: 77,69%
Does the swingbench DSS workload use multiple semaphores, or
just one?
Your patches look like a great start to make the semaphores
more scalable. If the swingbench DSS workload uses multiple
semaphores, I have ideas for follow-up patches to make things
scale better.
What does ipcs output look like while running swingbench DSS?
--
All rights reversed
On 03/05/2013 12:10 PM, Rik van Riel wrote:
> On 03/05/2013 04:35 AM, Davidlohr Bueso wrote:
>
>> 2) While on an Oracle swingbench DSS (data mining) workload the
>> improvements are not as exciting as with Rik's benchmark, we can see
>> some positive numbers. For an 8 socket machine the following are the
>> percentages of %sys time incurred in the ipc lock:
>>
>> Baseline (3.9-rc1):
>> 100 swingbench users: 8,74%
>> 400 swingbench users: 21,86%
>> 800 swingbench users: 84,35%
>>
>> With this patchset:
>> 100 swingbench users: 8,11%
>> 400 swingbench users: 19,93%
>> 800 swingbench users: 77,69%
>
> Does the swingbench DSS workload use multiple semaphores, or
> just one?
>
> Your patches look like a great start to make the semaphores
> more scalable. If the swingbench DSS workload uses multiple
> semaphores, I have ideas for follow-up patches to make things
> scale better.
>
> What does ipcs output look like while running swingbench DSS?
>
For Oracle, the semaphores are set up when the instance is started
irrespective of the workload. For a 8-socket 80 cores test system, the
output of ipcs look like:
------ Semaphore Arrays --------
key semid owner perms nsems
0x00000000 0 root 600 1
0x00000000 65537 root 600 1
0xcd9652f0 4718594 oracle 640 226
0xcd9652f1 4751363 oracle 640 226
0xcd9652f2 4784132 oracle 640 226
0xcd9652f3 4816901 oracle 640 226
0xcd9652f4 4849670 oracle 640 226
The recommended kernel.sem value from Oracle is "250 32000 100 128". I
have tried to reduce the maximum semaphores per array (1st value) while
increasing the max number of arrays. That tends to reduce the ipc_lock
contention in kernel, but it is against Oracle's recommendation.
-Longman
On Tue, Mar 5, 2013 at 11:42 AM, Waiman Long <[email protected]> wrote:
>
> The recommended kernel.sem value from Oracle is "250 32000 100 128". I have
> tried to reduce the maximum semaphores per array (1st value) while
> increasing the max number of arrays. That tends to reduce the ipc_lock
> contention in kernel, but it is against Oracle's recommendation.
Ok, the Oracle recommendations seem to be assuming that we'd be
scaling the semaphore locking sanely, which we don't. Since we share
one single lock for all semaphores in the whole array, Oracle's
recommendation does the wrong thing for our ipc_lock contention.
At the same time, I have to say that Oracle's recommendation is the
right thing to do, and it's really a kernel limitation that we scale
badly with lots of semaphores in the array. I'm surprised this hasn't
really come up before. It seems such a basic scalability issue for
such a traditional Unix load. And while everybody hates the SysV IPC
stuff, it's not like it's all *that* complicated. We've had people who
worked on much more fundamental and complex scalability things.
David's patch should make it much easier to do the locking more
fine-grained, and it sounds like Rik is actively working on that, so
I'm hopeful that we can actually do this right in the not too distant
future. The fact that oracle recomments using large semaphore arrays
actually makes me very hopeful that they use semaphores correctly, so
that if we just do our scalability work, you'd get the full advantage
of it..
Linus
On 03/05/2013 03:52 PM, Linus Torvalds wrote:
> On Tue, Mar 5, 2013 at 11:42 AM, Waiman Long <[email protected]> wrote:
>>
>> The recommended kernel.sem value from Oracle is "250 32000 100 128". I have
>> tried to reduce the maximum semaphores per array (1st value) while
>> increasing the max number of arrays. That tends to reduce the ipc_lock
>> contention in kernel, but it is against Oracle's recommendation.
>
> Ok, the Oracle recommendations seem to be assuming that we'd be
> scaling the semaphore locking sanely, which we don't. Since we share
> one single lock for all semaphores in the whole array, Oracle's
> recommendation does the wrong thing for our ipc_lock contention.
> David's patch should make it much easier to do the locking more
> fine-grained, and it sounds like Rik is actively working on that,
Indeed. Though how well my patches will work with Oracle will
depend a lot on what kind of semctl syscalls they are doing.
Does Oracle typically do one semop per semctl syscall, or does
it pass in a whole bunch at once?
--
All rights reversed
On 03/05/2013 03:53 PM, Rik van Riel wrote:
> On 03/05/2013 03:52 PM, Linus Torvalds wrote:
>> On Tue, Mar 5, 2013 at 11:42 AM, Waiman Long <[email protected]> wrote:
>>>
>>> The recommended kernel.sem value from Oracle is "250 32000 100 128".
>>> I have
>>> tried to reduce the maximum semaphores per array (1st value) while
>>> increasing the max number of arrays. That tends to reduce the ipc_lock
>>> contention in kernel, but it is against Oracle's recommendation.
>>
>> Ok, the Oracle recommendations seem to be assuming that we'd be
>> scaling the semaphore locking sanely, which we don't. Since we share
>> one single lock for all semaphores in the whole array, Oracle's
>> recommendation does the wrong thing for our ipc_lock contention.
>
>> David's patch should make it much easier to do the locking more
>> fine-grained, and it sounds like Rik is actively working on that,
>
> Indeed. Though how well my patches will work with Oracle will
> depend a lot on what kind of semctl syscalls they are doing.
>
> Does Oracle typically do one semop per semctl syscall, or does
> it pass in a whole bunch at once?
>
i had collected a strace log of Oracle instance startup a while ago. In
the log, almost all of the semctl() call is to set a single semaphore
value in one of the element of the array using SETVAL. Also there are
far more semtimedop() than semctl(), about 100:1. Again, all the
semtimedop() operations are on a single element of the semaphore array.
Please note that the behavior of Oracle at startup time may not be
indicative of what it will do when running benchmarks like Swingbench.
However, I don't think there will be dramatic change in behavior.
-Longman
On 03/05/2013 10:46 PM, Waiman Long wrote:
> On 03/05/2013 03:53 PM, Rik van Riel wrote:
>> Indeed. Though how well my patches will work with Oracle will
>> depend a lot on what kind of semctl syscalls they are doing.
>>
>> Does Oracle typically do one semop per semctl syscall, or does
>> it pass in a whole bunch at once?
>
> i had collected a strace log of Oracle instance startup a while ago. In
> the log, almost all of the semctl() call is to set a single semaphore
> value in one of the element of the array using SETVAL. Also there are
> far more semtimedop() than semctl(), about 100:1. Again, all the
> semtimedop() operations are on a single element of the semaphore array.
That is good to hear. Just what I was hoping when I started
working on my patches. You should expect them tomorrow or
Thursday.
--
All rights reversed
On Tue, 2013-03-05 at 07:40 -0800, Linus Torvalds wrote:
> On Tue, Mar 5, 2013 at 1:35 AM, Davidlohr Bueso <[email protected]> wrote:
> >
> > The following set of patches are based on the discussion of holding the
> > ipc lock unnecessarily, such as for permissions and security checks:
>
> Ok, looks fine from a quick look (but then, so did your previous patch-set ;)
>
> You still open-code the spinlock in at least a few places (I saw
> sem_getref), but I still don't care deeply.
>
> >> 2) While on an Oracle swingbench DSS (data mining) workload the
> > improvements are not as exciting as with Rik's benchmark, we can see
> > some positive numbers. For an 8 socket machine the following are the
> > percentages of %sys time incurred in the ipc lock:
>
> Ok, I hoped for it being more noticeable. Since that benchmark is less
> trivial than Rik's, can you do a perf record -fg of it and give a more
> complete picture of what the kernel footprint is - and in particular
> who now gets that ipc lock function? Is it purely semtimedop, or what?
> Look out for inlining - ipc_rcu_getref() looks like it would be
> inlined, for example.
>
> It would be good to get a "top twenty kernel functions" from the
> profile, along with some call data on where the lock callers are.. I
> know that Rik's benchmark *only* had that one call-site, I'm wondering
> if the swingbench one has slightly more complex behavior...
For a 400 user workload (the kernel functions remain basically the same
for any amount of users):
17.86% oracle [kernel.kallsyms] [k] _raw_spin_lock
8.46% swapper [kernel.kallsyms] [k] intel_idle
5.51% oracle [kernel.kallsyms] [k] try_atomic_semop
5.05% oracle [kernel.kallsyms] [k] update_sd_lb_stats
2.81% oracle [kernel.kallsyms] [k] tg_load_down
2.41% swapper [kernel.kallsyms] [k] update_blocked_averages
2.38% oracle [kernel.kallsyms] [k] idle_cpu
2.37% swapper [kernel.kallsyms] [k] native_write_msr_safe
2.28% oracle [kernel.kallsyms] [k] update_cfs_rq_blocked_load
1.84% oracle [kernel.kallsyms] [k] update_blocked_averages
1.79% oracle [kernel.kallsyms] [k] update_queue
1.73% swapper [kernel.kallsyms] [k] update_cfs_rq_blocked_load
1.29% oracle [kernel.kallsyms] [k] native_write_msr_safe
1.07% java [kernel.kallsyms] [k] update_sd_lb_stats
0.91% swapper [kernel.kallsyms] [k] poll_idle
0.86% oracle [kernel.kallsyms] [k] try_to_wake_up
0.80% java [kernel.kallsyms] [k] tg_load_down
0.72% oracle [kernel.kallsyms] [k] load_balance
0.67% oracle [kernel.kallsyms] [k] __schedule
0.67% oracle [kernel.kallsyms] [k] cpumask_next_and
Digging into the _raw_spin_lock call:
17.86% oracle [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--49.55%-- sys_semtimedop
| |
| |--77.41%-- system_call
| | semtimedop
| | skgpwwait
| | ksliwat
| | kslwaitctx
Thanks,
Davidlohr
On Tue, 2013-03-05 at 22:53 -0500, Rik van Riel wrote:
> On 03/05/2013 10:46 PM, Waiman Long wrote:
> > On 03/05/2013 03:53 PM, Rik van Riel wrote:
>
> >> Indeed. Though how well my patches will work with Oracle will
> >> depend a lot on what kind of semctl syscalls they are doing.
> >>
> >> Does Oracle typically do one semop per semctl syscall, or does
> >> it pass in a whole bunch at once?
> >
> > i had collected a strace log of Oracle instance startup a while ago. In
> > the log, almost all of the semctl() call is to set a single semaphore
> > value in one of the element of the array using SETVAL. Also there are
> > far more semtimedop() than semctl(), about 100:1. Again, all the
> > semtimedop() operations are on a single element of the semaphore array.
>
> That is good to hear. Just what I was hoping when I started
> working on my patches. You should expect them tomorrow or
> Thursday.
Great, looking forward.
Thanks,
Davidlohr
On Tue, Mar 5, 2013 at 11:13 PM, Davidlohr Bueso <[email protected]> wrote:
>
> Digging into the _raw_spin_lock call:
>
> 17.86% oracle [kernel.kallsyms] [k] _raw_spin_lock
> |
> --- _raw_spin_lock
> |
> |--49.55%-- sys_semtimedop
> | |
> | |--77.41%-- system_call
> | | semtimedop
> | | skgpwwait
> | | ksliwat
> | | kslwaitctx
Hmm. It looks like you cut that off a bit too early. This shows that
half the cases came from sys_semtimedop. Where did the other half come
from?
Linus
These patches go on top of Davidlohr's 4 patches. Performance numbers in patch "7"
---8<---
Replace the ipc_perm.lock with an rwlock, in preparation for
finer grained locking for semaphore operations.
Signed-off-by: Rik van Riel <[email protected]>
---
include/linux/ipc.h | 2 +-
ipc/sem.c | 11 +++--------
ipc/shm.c | 2 +-
ipc/util.c | 12 ++++++------
ipc/util.h | 6 +++---
5 files changed, 14 insertions(+), 19 deletions(-)
diff --git a/include/linux/ipc.h b/include/linux/ipc.h
index 8d861b2..3c5e2aa 100644
--- a/include/linux/ipc.h
+++ b/include/linux/ipc.h
@@ -10,7 +10,7 @@
/* used by in-kernel data structures */
struct kern_ipc_perm
{
- spinlock_t lock;
+ rwlock_t lock;
int deleted;
int id;
key_t key;
diff --git a/ipc/sem.c b/ipc/sem.c
index f06a853..efb49e7 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -260,7 +260,7 @@ static inline void sem_putref(struct sem_array *sma)
*/
static inline void sem_getref(struct sem_array *sma)
{
- spin_lock(&(sma)->sem_perm.lock);
+ write_lock(&(sma)->sem_perm.lock);
ipc_rcu_getref(sma);
ipc_unlock(&(sma)->sem_perm);
}
@@ -778,7 +778,6 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
struct list_head tasks;
/* Free the existing undo structures for this semaphore set. */
- assert_spin_locked(&sma->sem_perm.lock);
list_for_each_entry_safe(un, tu, &sma->list_id, list_id) {
list_del(&un->list_id);
spin_lock(&un->ulp->lock);
@@ -977,7 +976,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
}
}
- spin_lock(&sma->sem_perm.lock);
+ write_lock(&sma->sem_perm.lock);
for (i = 0; i < sma->sem_nsems; i++)
sem_io[i] = sma->sem_base[i].semval;
sem_unlock(sma);
@@ -1025,7 +1024,6 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
for (i = 0; i < nsems; i++)
sma->sem_base[i].semval = sem_io[i];
- assert_spin_locked(&sma->sem_perm.lock);
list_for_each_entry(un, &sma->list_id, list_id) {
for (i = 0; i < nsems; i++)
un->semadj[i] = 0;
@@ -1042,7 +1040,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
if(semnum < 0 || semnum >= nsems)
goto out_unlock;
- spin_lock(&sma->sem_perm.lock);
+ write_lock(&sma->sem_perm.lock);
curr = &sma->sem_base[semnum];
switch (cmd) {
@@ -1067,7 +1065,6 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
if (val > SEMVMX || val < 0)
goto out_unlock;
- assert_spin_locked(&sma->sem_perm.lock);
list_for_each_entry(un, &sma->list_id, list_id)
un->semadj[semnum] = 0;
@@ -1345,7 +1342,6 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
new->semid = semid;
assert_spin_locked(&ulp->lock);
list_add_rcu(&new->list_proc, &ulp->list_proc);
- assert_spin_locked(&sma->sem_perm.lock);
list_add(&new->list_id, &sma->list_id);
un = new;
@@ -1696,7 +1692,6 @@ void exit_sem(struct task_struct *tsk)
}
/* remove un from the linked lists */
- assert_spin_locked(&sma->sem_perm.lock);
list_del(&un->list_id);
spin_lock(&ulp->lock);
diff --git a/ipc/shm.c b/ipc/shm.c
index cb858df..62b9f1d 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -141,7 +141,7 @@ static inline struct shmid_kernel *shm_lock(struct ipc_namespace *ns, int id)
static inline void shm_lock_by_ptr(struct shmid_kernel *ipcp)
{
rcu_read_lock();
- spin_lock(&ipcp->shm_perm.lock);
+ write_lock(&ipcp->shm_perm.lock);
}
static inline struct shmid_kernel *shm_lock_check(struct ipc_namespace *ns,
diff --git a/ipc/util.c b/ipc/util.c
index 6a98e62..8a87900 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -263,17 +263,17 @@ int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
idr_preload(GFP_KERNEL);
- spin_lock_init(&new->lock);
+ rwlock_init(&new->lock);
new->deleted = 0;
rcu_read_lock();
- spin_lock(&new->lock);
+ write_lock(&new->lock);
id = idr_alloc(&ids->ipcs_idr, new,
(next_id < 0) ? 0 : ipcid_to_idx(next_id), 0,
GFP_NOWAIT);
idr_preload_end();
if (id < 0) {
- spin_unlock(&new->lock);
+ write_unlock(&new->lock);
rcu_read_unlock();
return id;
}
@@ -708,7 +708,7 @@ struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
if (IS_ERR(out))
goto err1;
- spin_lock(&out->lock);
+ write_lock(&out->lock);
/* ipc_rmid() may have already freed the ID while ipc_lock
* was spinning: here verify that the structure is still valid
@@ -718,7 +718,7 @@ struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
return out;
err0:
- spin_unlock(&out->lock);
+ write_unlock(&out->lock);
err1:
rcu_read_unlock();
return ERR_PTR(-EINVAL);
@@ -830,7 +830,7 @@ struct kern_ipc_perm *ipcctl_pre_down(struct ipc_namespace *ns,
if (IS_ERR(ipcp))
goto out;
- spin_lock(&ipcp->lock);
+ write_lock(&ipcp->lock);
out:
return ipcp;
}
diff --git a/ipc/util.h b/ipc/util.h
index c36b997..6f157ab 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -162,18 +162,18 @@ static inline int ipc_checkid(struct kern_ipc_perm *ipcp, int uid)
static inline void ipc_lock_by_ptr(struct kern_ipc_perm *perm)
{
rcu_read_lock();
- spin_lock(&perm->lock);
+ write_lock(&perm->lock);
}
static inline void ipc_unlock(struct kern_ipc_perm *perm)
{
- spin_unlock(&perm->lock);
+ write_unlock(&perm->lock);
rcu_read_unlock();
}
static inline void ipc_lock_object(struct kern_ipc_perm *perm)
{
- spin_lock(&perm->lock);
+ write_lock(&perm->lock);
}
struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id);
Rename sem_lock to sem_obtain_lock, so we can introduce a sem_lock
function later that only locks the sem_array and does nothing else.
Open code the locking from ipc_lock in sem_obtain_lock, so we can
introduce finer grained locking for the sem_array in the next patch.
Signed-off-by: Rik van Riel <[email protected]>
---
ipc/sem.c | 23 +++++++++++++++++++----
1 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/ipc/sem.c b/ipc/sem.c
index efb49e7..d92ba32 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -194,14 +194,29 @@ void __init sem_init (void)
* sem_lock_(check_) routines are called in the paths where the rw_mutex
* is not held.
*/
-static inline struct sem_array *sem_lock(struct ipc_namespace *ns, int id)
+static inline struct sem_array *sem_obtain_lock(struct ipc_namespace *ns, int id)
{
- struct kern_ipc_perm *ipcp = ipc_lock(&sem_ids(ns), id);
+ struct kern_ipc_perm *ipcp;
+ rcu_read_lock();
+ ipcp = ipc_obtain_object(&sem_ids(ns), id);
if (IS_ERR(ipcp))
- return (struct sem_array *)ipcp;
+ goto err1;
+
+ write_lock(&ipcp->lock);
+
+ /* ipc_rmid() may have already freed the ID while write_lock
+ * was spinning: verify that the structure is still valid
+ */
+ if (ipcp->deleted)
+ goto err0;
return container_of(ipcp, struct sem_array, sem_perm);
+err0:
+ write_unlock(&ipcp->lock);
+err1:
+ rcu_read_unlock();
+ return ERR_PTR(-EINVAL);
}
static inline struct sem_array *sem_obtain_object(struct ipc_namespace *ns, int id)
@@ -1558,7 +1573,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
goto out_free;
}
- sma = sem_lock(ns, semid);
+ sma = sem_obtain_lock(ns, semid);
/*
* Wait until it's guaranteed that no wakeup_sem_queue_do() is ongoing.
Introduce finer grained locking for semtimedop, to handle the
common case of a program wanting to manipulate one semaphore
from an array with multiple semaphores.
Each semaphore array has a read/write lock. If something
complex is going on (manipulation of the array, of multiple
semaphores in one syscall, etc), the lock is taken in exclusive
mode.
If the call is a semop manipulating just one semaphore in
an array with multiple semaphores, the read/write lock for
the semaphore array is taken in shared (read) mode, and the
individual semaphore's lock is taken.
On a 24 CPU system, performance numbers with the semop-multi
test with N threads and N semaphores, look like this:
vanilla Davidlohr's Davidlohr's +
threads patches rwlock patches
10 610652 726325 1783589
20 341570 365699 1520453
30 288102 307037 1498167
40 290714 305955 1612665
50 288620 312890 1733453
60 289987 306043 1649360
70 291298 306347 1723167
80 290948 305662 1729545
90 290996 306680 1736021
100 292243 306700 1773700
Signed-off-by: Rik van Riel <[email protected]>
---
ipc/sem.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 79 insertions(+), 21 deletions(-)
diff --git a/ipc/sem.c b/ipc/sem.c
index d92ba32..3ab9385 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -94,6 +94,7 @@
struct sem {
int semval; /* current value */
int sempid; /* pid of last operation */
+ spinlock_t lock; /* spinlock for fine-grained semtimedop */
struct list_head sem_pending; /* pending single-sop operations */
};
@@ -138,7 +139,6 @@ struct sem_undo_list {
#define sem_ids(ns) ((ns)->ids[IPC_SEM_IDS])
-#define sem_unlock(sma) ipc_unlock(&(sma)->sem_perm)
#define sem_checkid(sma, semid) ipc_checkid(&sma->sem_perm, semid)
static int newary(struct ipc_namespace *, struct ipc_params *);
@@ -191,19 +191,75 @@ void __init sem_init (void)
}
/*
+ * If the sem_array contains just one semaphore, or if multiple
+ * semops are performed in one syscall, or if there are complex
+ * operations pending, the whole sem_array is locked exclusively.
+ * If one semop is performed on an array with multiple semaphores,
+ * get a shared lock on the array, and lock the individual semaphore.
+ *
+ * Carefully guard against sma->complex_count changing between zero
+ * and non-zero while we are spinning for the lock. The value of
+ * sma->complex_count cannot change while we are holding the lock,
+ * so sem_unlock should be fine.
+ */
+static inline void sem_lock(struct sem_array *sma, struct sembuf *sops,
+ int nsops)
+{
+ again:
+ if (nsops == 1 && sma->sem_nsems > 1 && !sma->complex_count) {
+ struct sem *sem = sma->sem_base + sops->sem_num;
+ /* Shared access to the sem_array. */
+ read_lock(&sma->sem_perm.lock);
+
+ /* Was sma->complex_count set while we were spinning? */
+ if (unlikely(sma->complex_count)) {
+ read_unlock(&sma->sem_perm.lock);
+ goto again;
+ }
+
+ /* Lock the one semaphore we are interested in. */
+ spin_lock(&sem->lock);
+ } else {
+ /* Lock the sem_array for exclusive access. */
+ write_lock(&sma->sem_perm.lock);
+
+ /* Was sma->complex_count reset while we were spinning? */
+ if (unlikely(!sma->complex_count && nsops == 1 &&
+ sma->sem_nsems > 1)) {
+ write_unlock(&sma->sem_perm.lock);
+ goto again;
+ }
+ }
+}
+
+static inline void sem_unlock(struct sem_array *sma, struct sembuf *sops,
+ int nsops)
+{
+ if (nsops == 1 && sma->sem_nsems > 1 && !sma->complex_count) {
+ struct sem *sem = sma->sem_base + sops->sem_num;
+ spin_unlock(&sem->lock);
+ read_unlock(&sma->sem_perm.lock);
+ } else
+ write_unlock(&sma->sem_perm.lock);
+}
+
+/*
* sem_lock_(check_) routines are called in the paths where the rw_mutex
* is not held.
*/
-static inline struct sem_array *sem_obtain_lock(struct ipc_namespace *ns, int id)
+static inline struct sem_array *sem_obtain_lock(struct ipc_namespace *ns,
+ int id, struct sembuf *sops, int nsops)
{
struct kern_ipc_perm *ipcp;
+ struct sem_array *sma;
rcu_read_lock();
ipcp = ipc_obtain_object(&sem_ids(ns), id);
if (IS_ERR(ipcp))
goto err1;
- write_lock(&ipcp->lock);
+ sma = container_of(ipcp, struct sem_array, sem_perm);
+ sem_lock(sma, sops, nsops);
/* ipc_rmid() may have already freed the ID while write_lock
* was spinning: verify that the structure is still valid
@@ -211,9 +267,9 @@ static inline struct sem_array *sem_obtain_lock(struct ipc_namespace *ns, int id
if (ipcp->deleted)
goto err0;
- return container_of(ipcp, struct sem_array, sem_perm);
+ return sma;
err0:
- write_unlock(&ipcp->lock);
+ sem_unlock(sma, sops, nsops);
err1:
rcu_read_unlock();
return ERR_PTR(-EINVAL);
@@ -370,15 +426,17 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
sma->sem_base = (struct sem *) &sma[1];
- for (i = 0; i < nsems; i++)
+ for (i = 0; i < nsems; i++) {
+ spin_lock_init(&sma->sem_base[i].lock);
INIT_LIST_HEAD(&sma->sem_base[i].sem_pending);
+ }
sma->complex_count = 0;
INIT_LIST_HEAD(&sma->sem_pending);
INIT_LIST_HEAD(&sma->list_id);
sma->sem_nsems = nsems;
sma->sem_ctime = get_seconds();
- sem_unlock(sma);
+ sem_unlock(sma, NULL, -1);
return sma->sem_perm.id;
}
@@ -811,7 +869,7 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
/* Remove the semaphore set from the IDR */
sem_rmid(ns, sma);
- sem_unlock(sma);
+ sem_unlock(sma, NULL, -1);
wake_up_sem_queue_do(&tasks);
ns->used_sems -= sma->sem_nsems;
@@ -985,7 +1043,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
- sem_unlock(sma);
+ sem_unlock(sma, NULL, -1);
err = -EIDRM;
goto out_free;
}
@@ -994,7 +1052,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
write_lock(&sma->sem_perm.lock);
for (i = 0; i < sma->sem_nsems; i++)
sem_io[i] = sma->sem_base[i].semval;
- sem_unlock(sma);
+ sem_unlock(sma, NULL, -1);
err = 0;
if(copy_to_user(array, sem_io, nsems*sizeof(ushort)))
err = -EFAULT;
@@ -1031,7 +1089,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
}
sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
- sem_unlock(sma);
+ sem_unlock(sma, NULL, -1);
err = -EIDRM;
goto out_free;
}
@@ -1094,7 +1152,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
}
out_unlock:
- sem_unlock(sma);
+ sem_unlock(sma, NULL, -1);
out_wakeup:
wake_up_sem_queue_do(&tasks);
out_free:
@@ -1179,7 +1237,7 @@ static int semctl_down(struct ipc_namespace *ns, int semid,
}
out_unlock:
- sem_unlock(sma);
+ sem_unlock(sma, NULL, -1);
out_up:
up_write(&sem_ids(ns).rw_mutex);
return err;
@@ -1336,7 +1394,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
/* step 3: Acquire the lock on semaphore array */
sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
- sem_unlock(sma);
+ sem_unlock(sma, NULL, -1);
kfree(new);
un = ERR_PTR(-EIDRM);
goto out;
@@ -1363,7 +1421,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
success:
spin_unlock(&ulp->lock);
rcu_read_lock();
- sem_unlock(sma);
+ sem_unlock(sma, NULL, -1);
out:
return un;
}
@@ -1493,7 +1551,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
* "un" itself is guaranteed by rcu.
*/
error = -EIDRM;
- ipc_lock_object(&sma->sem_perm);
+ sem_lock(sma, sops, nsops);
if (un) {
if (un->semid == -1) {
rcu_read_unlock();
@@ -1550,7 +1608,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
queue.sleeper = current;
sleep_again:
- sem_unlock(sma);
+ sem_unlock(sma, sops, nsops);
current->state = TASK_INTERRUPTIBLE;
if (timeout)
@@ -1573,7 +1631,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
goto out_free;
}
- sma = sem_obtain_lock(ns, semid);
+ sma = sem_obtain_lock(ns, semid, sops, nsops);
/*
* Wait until it's guaranteed that no wakeup_sem_queue_do() is ongoing.
@@ -1612,7 +1670,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
unlink_queue(sma, &queue);
out_unlock_free:
- sem_unlock(sma);
+ sem_unlock(sma, sops, nsops);
out_wakeup:
wake_up_sem_queue_do(&tasks);
out_free:
@@ -1702,7 +1760,7 @@ void exit_sem(struct task_struct *tsk)
/* exit_sem raced with IPC_RMID+semget() that created
* exactly the same semid. Nothing to do.
*/
- sem_unlock(sma);
+ sem_unlock(sma, NULL, -1);
continue;
}
@@ -1741,7 +1799,7 @@ void exit_sem(struct task_struct *tsk)
/* maybe some queued-up processes were waiting for this */
INIT_LIST_HEAD(&tasks);
do_smart_update(sma, NULL, 0, 1, &tasks);
- sem_unlock(sma);
+ sem_unlock(sma, NULL, -1);
wake_up_sem_queue_do(&tasks);
kfree_rcu(un, rcu);
On Wed, Mar 6, 2013 at 5:15 PM, Rik van Riel <[email protected]> wrote:
>
> If the call is a semop manipulating just one semaphore in
> an array with multiple semaphores, the read/write lock for
> the semaphore array is taken in shared (read) mode, and the
> individual semaphore's lock is taken.
You know, we do something like this already elsewhere, and I think we
do it slightly better. See the mm_take_all_locks() logic in mm/mmap.c.
The optimal strategy if you have many items, and the *common* case is
that you need just one lock, is actually to only take that single lock
for the common case. No top-level lock at all.
Then, for the complex (but unlikely) case, you take a *separate*
top-level lock, and then you take the lower-level locks one by one,
while testing first if you already hold them (using a separate data
structure that is protected by the top-level lock).
This way, the common case only needs that single lock that protects
just that particular data structure.
That said, judging by your numbers, your read-write lock seems to work
fine too, even though I'd worry about cacheline ping-pong (but not
contention) on the readers. So it doesn't seem optimal, but it sure as
hell seems better than what we do now ;)
Linus
On 03/06/2013 05:57 PM, Linus Torvalds wrote:
> On Wed, Mar 6, 2013 at 5:15 PM, Rik van Riel <[email protected]> wrote:
>>
>> If the call is a semop manipulating just one semaphore in
>> an array with multiple semaphores, the read/write lock for
>> the semaphore array is taken in shared (read) mode, and the
>> individual semaphore's lock is taken.
>
> You know, we do something like this already elsewhere, and I think we
> do it slightly better. See the mm_take_all_locks() logic in mm/mmap.c.
That would work. If we are about to do one of the uncommon operations,or
sma->complex_count is set, we need to take the outer lock and all of the
inner locks.
The only complication would be interactions with the non-semaphore code
in ipc/util.c, which manipulates the kern_ipc_perm structures, which are
part of the sem_array structure.
> That said, judging by your numbers, your read-write lock seems to work
> fine too, even though I'd worry about cacheline ping-pong (but not
> contention) on the readers. So it doesn't seem optimal, but it sure as
> hell seems better than what we do now ;)
I can take a stab at implementing the take_all_locks approach tomorrow.
If things turn out to be easier than I fear, I will send an updated
patch. If the resulting changes to the rest of ipc/ turn out to be
too ugly to live, the rwsem performance is likely to be good enough
for a while, and I'll just send an email without a patch :)
On Wed, 2013-03-06 at 17:15 -0500, Rik van Riel wrote:
> Introduce finer grained locking for semtimedop, to handle the
> common case of a program wanting to manipulate one semaphore
> from an array with multiple semaphores.
>
> Each semaphore array has a read/write lock. If something
> complex is going on (manipulation of the array, of multiple
> semaphores in one syscall, etc), the lock is taken in exclusive
> mode.
>
> If the call is a semop manipulating just one semaphore in
> an array with multiple semaphores, the read/write lock for
> the semaphore array is taken in shared (read) mode, and the
> individual semaphore's lock is taken.
>
> On a 24 CPU system, performance numbers with the semop-multi
> test with N threads and N semaphores, look like this:
>
> vanilla Davidlohr's Davidlohr's +
> threads patches rwlock patches
> 10 610652 726325 1783589
> 20 341570 365699 1520453
> 30 288102 307037 1498167
> 40 290714 305955 1612665
> 50 288620 312890 1733453
> 60 289987 306043 1649360
> 70 291298 306347 1723167
> 80 290948 305662 1729545
> 90 290996 306680 1736021
> 100 292243 306700 1773700
Lovely numbers :)
On my laptop:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 281430894, ops/sec 9381029
+ 20.87% a.out [kernel.kallsyms] [k] sys_semtimedop
+ 8.31% a.out [kernel.kallsyms] [k] ipc_has_perm.isra.21
+ 6.88% a.out [kernel.kallsyms] [k] _raw_read_lock
+ 6.78% a.out [kernel.kallsyms] [k] avc_has_perm_flags
+ 5.26% a.out [kernel.kallsyms] [k] ipcperms
+ 4.91% a.out [kernel.kallsyms] [k] ipc_obtain_object_check
+ 4.69% a.out [kernel.kallsyms] [k] __audit_syscall_exit
+ 4.21% a.out [kernel.kallsyms] [k] copy_user_enhanced_fast_string
+ 3.61% a.out [kernel.kallsyms] [k] _raw_spin_lock
+ 3.55% a.out [kernel.kallsyms] [k] system_call
+ 3.35% a.out [kernel.kallsyms] [k] do_smart_update
+ 2.77% a.out [kernel.kallsyms] [k] __audit_syscall_entry
But my 8 socket 160 CPU box sure isn't happy. I'm getting all sorts of
issues (sometimes it will boot, sometimes it wont). When it does, linux
will hang as soon as I start my benchmarking:
BUG: soft lockup - CPU#77 stuck for 23s! [oracle:129877]
Modules linked in: fuse autofs4 sunrpc pcc_cpufreq ipv6 dm_mirror dm_region_hash dm_log dm_mod uinput iTCO_wdt iTCO_vendor_support sg freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr lpc_ich mfd_core hpilo hpwdt i7core_edac edac_core netxen_nic ext4 mbcache jbd2 sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul hpsa radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core
CPU 77
Pid: 129877, comm: oracle Tainted: G D W 3.9.0-rc1+ #20 HP ProLiant DL980 G7
RIP: 0010:[<ffffffff812777fa>] [<ffffffff812777fa>] __read_lock_failed+0xa/0x20
RSP: 0018:ffff8b87b8cf9ca8 EFLAGS: 00000297
RAX: ffffc900293c1020 RBX: 000000010007a021 RCX: 000000000000d3a5
RDX: 0000000000000001 RSI: ffff8b87b8cf9d58 RDI: ffffc900293c1020
RBP: ffff8b87b8cf9ca8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b87b8cf9c68
R13: ffff8b87b8cf9c68 R14: 0000000000000286 R15: ffff8b87caf10100
FS: 00007f7a689b2700(0000) GS:ffff8987ff9c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc49426d000 CR3: 00000187cf08f000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process oracle (pid: 129877, threadinfo ffff8b87b8cf8000, task ffff8b87caf10100)
Stack:
ffff8b87b8cf9cb8 ffffffff8155f374 ffff8b87b8cf9ce8 ffffffff81205245
0000000000000001 0000000000090002 00007fff82d3aa08 0000000000000000
ffff8b87b8cf9f78 ffffffff812069e1 0000000000cbc000 ffff8b87b8cf9f38
Call Trace:
[<ffffffff8155f374>] _raw_read_lock+0x14/0x20
[<ffffffff81205245>] sem_lock+0x85/0xa0
[<ffffffff812069e1>] sys_semtimedop+0x521/0x7c0
[<ffffffff81089e2c>] ? task_sched_runtime+0x4c/0x90
[<ffffffff8101c1b3>] ? native_sched_clock+0x13/0x80
[<ffffffff8101b7b9>] ? sched_clock+0x9/0x10
[<ffffffff8108f9ed>] ? sched_clock_cpu+0xcd/0x110
[<ffffffff8108914b>] ? update_rq_clock+0x2b/0x50
[<ffffffff81089e2c>] ? task_sched_runtime+0x4c/0x90
[<ffffffff8108fe48>] ? thread_group_cputime+0x88/0xc0
[<ffffffff8108fd1d>] ? cputime_adjust+0x3d/0x90
[<ffffffff8108fece>] ? thread_group_cputime_adjusted+0x4e/0x60
[<ffffffff81568119>] system_call_fastpath+0x16/0x1b
Code: 90 55 48 89 e5 f0 ff 07 f3 90 83 3f 01 75 f9 f0 ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 f0 48 ff 07 f3 90 <48> 83 3f 01 78 f8 f0 48 ff 0f 78 ee 5d c3 90 90 90 90 90 90 90
> +static inline void sem_unlock(struct sem_array *sma, struct sembuf *sops,
> + int nsops)
> +{
> + if (nsops == 1 && sma->sem_nsems > 1 && !sma->complex_count) {
> + struct sem *sem = sma->sem_base + sops->sem_num;
> + spin_unlock(&sem->lock);
> + read_unlock(&sma->sem_perm.lock);
> + } else
> + write_unlock(&sma->sem_perm.lock);
> +}
I believe (haven't tested it yet) the issue could be in sem_unlock() as
we aren't unlocking the RCU read section - before it was just a trivial
ipc_unlock call.
The rest looks good but will look more thoroughly tonight.
Thanks,
Davidlohr
On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote:
> Indeed. Though how well my patches will work with Oracle will
> depend a lot on what kind of semctl syscalls they are doing.
>
> Does Oracle typically do one semop per semctl syscall, or does
> it pass in a whole bunch at once?
https://oss.oracle.com/~mason/sembench.c
I think Chris wrote that to match a particular pattern of semaphore
operations the database engine in question does. I haven't checked to
see if it triggers the case in point though.
Also, Chris since left Oracle but maybe he knows who to poke.
On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote:
> On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote:
>
> > Indeed. Though how well my patches will work with Oracle will
> > depend a lot on what kind of semctl syscalls they are doing.
> >
> > Does Oracle typically do one semop per semctl syscall, or does
> > it pass in a whole bunch at once?
>
> https://oss.oracle.com/~mason/sembench.c
>
> I think Chris wrote that to match a particular pattern of semaphore
> operations the database engine in question does. I haven't checked to
> see if it triggers the case in point though.
>
> Also, Chris since left Oracle but maybe he knows who to poke.
>
Dave Kleikamp (cc'd) took over my patches and did the most recent
benchmarking. Ported against 3.0:
https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c
The current versions are still in the 2.6.32 oracle kernel, but it looks
like they reverted this 3.0 commit. I think with Manfred's upstream
work my more complex approach wasn't required anymore, but hopefully
Dave can fill in details.
Here is some of the original discussion around the patch:
https://lkml.org/lkml/2010/4/12/257
In terms of how oracle uses IPC, the part that shows up in profiles is
using semtimedop for bulk wakeups. They can configure things to use
either a bunch of small arrays or a huge single array (and anything in
between).
There is one IPC semaphore per process and they use this to wait for
some event (like a log commit). When the event comes in, everyone
waiting is woken in bulk via a semtimedop call.
So, single proc waking many waiters at once.
-chris
On 03/07/2013 06:55 AM, Chris Mason wrote:
> On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote:
>> On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote:
>>
>>> Indeed. Though how well my patches will work with Oracle will
>>> depend a lot on what kind of semctl syscalls they are doing.
>>>
>>> Does Oracle typically do one semop per semctl syscall, or does
>>> it pass in a whole bunch at once?
>>
>> https://oss.oracle.com/~mason/sembench.c
>>
>> I think Chris wrote that to match a particular pattern of semaphore
>> operations the database engine in question does. I haven't checked to
>> see if it triggers the case in point though.
>>
>> Also, Chris since left Oracle but maybe he knows who to poke.
>>
>
> Dave Kleikamp (cc'd) took over my patches and did the most recent
> benchmarking. Ported against 3.0:
>
> https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c
>
> The current versions are still in the 2.6.32 oracle kernel, but it looks
> like they reverted this 3.0 commit. I think with Manfred's upstream
> work my more complex approach wasn't required anymore, but hopefully
> Dave can fill in details.
>From what I recall, I could never get better performance from your
patches that we saw with Manfred's work alone. I can't remember the
reasons for including and then reverting the patches from the 3.0
(2.6.39) Oracle kernel, but in the end we weren't able to justify their
inclusion.
> Here is some of the original discussion around the patch:
>
> https://lkml.org/lkml/2010/4/12/257
>
> In terms of how oracle uses IPC, the part that shows up in profiles is
> using semtimedop for bulk wakeups. They can configure things to use
> either a bunch of small arrays or a huge single array (and anything in
> between).
>
> There is one IPC semaphore per process and they use this to wait for
> some event (like a log commit). When the event comes in, everyone
> waiting is woken in bulk via a semtimedop call.
>
> So, single proc waking many waiters at once.
>
> -chris
>
On Thu, Mar 07, 2013 at 08:54:55AM -0700, Dave Kleikamp wrote:
> On 03/07/2013 06:55 AM, Chris Mason wrote:
> > On Thu, Mar 07, 2013 at 01:45:33AM -0700, Peter Zijlstra wrote:
> >> On Tue, 2013-03-05 at 15:53 -0500, Rik van Riel wrote:
> >>
> >>> Indeed. Though how well my patches will work with Oracle will
> >>> depend a lot on what kind of semctl syscalls they are doing.
> >>>
> >>> Does Oracle typically do one semop per semctl syscall, or does
> >>> it pass in a whole bunch at once?
> >>
> >> https://oss.oracle.com/~mason/sembench.c
> >>
> >> I think Chris wrote that to match a particular pattern of semaphore
> >> operations the database engine in question does. I haven't checked to
> >> see if it triggers the case in point though.
> >>
> >> Also, Chris since left Oracle but maybe he knows who to poke.
> >>
> >
> > Dave Kleikamp (cc'd) took over my patches and did the most recent
> > benchmarking. Ported against 3.0:
> >
> > https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commit;h=c7fa322dd72b08450a440ef800124705a1fa148c
> >
> > The current versions are still in the 2.6.32 oracle kernel, but it looks
> > like they reverted this 3.0 commit. I think with Manfred's upstream
> > work my more complex approach wasn't required anymore, but hopefully
> > Dave can fill in details.
>
> From what I recall, I could never get better performance from your
> patches that we saw with Manfred's work alone. I can't remember the
> reasons for including and then reverting the patches from the 3.0
> (2.6.39) Oracle kernel, but in the end we weren't able to justify their
> inclusion.
Ok, so after this commit, oracle was happy:
commit fd5db42254518fbf241dc454e918598fbe494fa2
Author: Manfred Spraul <[email protected]>
Date: Wed May 26 14:43:40 2010 -0700
ipc/sem.c: optimize update_queue() for bulk wakeup calls
But that doesn't explain why Davidlohr saw semtimedop at the top of the
oracle profiles in his runs.
Looking through the patches in this thread, I don't see anything that
I'd expect to slow down oracle TPC numbers.
I dealt with the ipc_perm lock a little differently:
https://oss.oracle.com/git/?p=linux-uek-2.6.39.git;a=commitdiff;h=78fe45325c8e2e3f4b6ebb1ee15b6c2e8af5ddb1;hp=8102e1ff9d667661b581209323faaf7a84f0f528
My code switched the ipc_rcu_hdr refcount to an atomic, which changed
where I needed the spinlock. It may make things easier in patches 3/4
and 4/4.
(some of this code was Jens, but at the time he made me promise to
pretend he never touched it)
-chris
On 03/06/2013 08:36 PM, Davidlohr Bueso wrote:
> But my 8 socket 160 CPU box sure isn't happy. I'm getting all sorts of
> issues (sometimes it will boot, sometimes it wont). When it does, linux
> will hang as soon as I start my benchmarking:
I think I may have found the cause of this bug.
It looks like it would be possible for a caller with a single
semop to come into semtimedop, see that sma->complex_count is
elevated, and take the write lock.
Further down in semtimedop, that caller can call do_smart_update,
which calls update_queue, which in turn calls unlink_queue.
The function unlink_queue can then decrement sma->complex_count.
In other words, we correctly grab the write lock when changing
sma->complex_count. However, we do not remember that we took
the write lock, and call read_unlock when we leave semtimedop.
The next caller will get stuck.
To fix this bug, we need to remember whether or not we took
the lock in exclusive mode, and unlock it the right way.
Now that I have figured this bug out, it should be easy to
re-implement things the way Linus suggested, subject to the
same bug fixes :)
> BUG: soft lockup - CPU#77 stuck for 23s! [oracle:129877]
> Modules linked in: fuse autofs4 sunrpc pcc_cpufreq ipv6 dm_mirror dm_region_hash dm_log dm_mod uinput iTCO_wdt iTCO_vendor_support sg freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr lpc_ich mfd_core hpilo hpwdt i7core_edac edac_core netxen_nic ext4 mbcache jbd2 sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul hpsa radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core
> CPU 77
> Pid: 129877, comm: oracle Tainted: G D W 3.9.0-rc1+ #20 HP ProLiant DL980 G7
> RIP: 0010:[<ffffffff812777fa>] [<ffffffff812777fa>] __read_lock_failed+0xa/0x20
> RSP: 0018:ffff8b87b8cf9ca8 EFLAGS: 00000297
> RAX: ffffc900293c1020 RBX: 000000010007a021 RCX: 000000000000d3a5
> RDX: 0000000000000001 RSI: ffff8b87b8cf9d58 RDI: ffffc900293c1020
> RBP: ffff8b87b8cf9ca8 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b87b8cf9c68
> R13: ffff8b87b8cf9c68 R14: 0000000000000286 R15: ffff8b87caf10100
> FS: 00007f7a689b2700(0000) GS:ffff8987ff9c0000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fc49426d000 CR3: 00000187cf08f000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process oracle (pid: 129877, threadinfo ffff8b87b8cf8000, task ffff8b87caf10100)
> Stack:
> ffff8b87b8cf9cb8 ffffffff8155f374 ffff8b87b8cf9ce8 ffffffff81205245
> 0000000000000001 0000000000090002 00007fff82d3aa08 0000000000000000
> ffff8b87b8cf9f78 ffffffff812069e1 0000000000cbc000 ffff8b87b8cf9f38
> Call Trace:
> [<ffffffff8155f374>] _raw_read_lock+0x14/0x20
> [<ffffffff81205245>] sem_lock+0x85/0xa0
> [<ffffffff812069e1>] sys_semtimedop+0x521/0x7c0
> [<ffffffff81089e2c>] ? task_sched_runtime+0x4c/0x90
> [<ffffffff8101c1b3>] ? native_sched_clock+0x13/0x80
> [<ffffffff8101b7b9>] ? sched_clock+0x9/0x10
> [<ffffffff8108f9ed>] ? sched_clock_cpu+0xcd/0x110
> [<ffffffff8108914b>] ? update_rq_clock+0x2b/0x50
> [<ffffffff81089e2c>] ? task_sched_runtime+0x4c/0x90
> [<ffffffff8108fe48>] ? thread_group_cputime+0x88/0xc0
> [<ffffffff8108fd1d>] ? cputime_adjust+0x3d/0x90
> [<ffffffff8108fece>] ? thread_group_cputime_adjusted+0x4e/0x60
> [<ffffffff81568119>] system_call_fastpath+0x16/0x1b
> Code: 90 55 48 89 e5 f0 ff 07 f3 90 83 3f 01 75 f9 f0 ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 f0 48 ff 07 f3 90 <48> 83 3f 01 78 f8 f0 48 ff 0f 78 ee 5d c3 90 90 90 90 90 90 90
>
>
>> +static inline void sem_unlock(struct sem_array *sma, struct sembuf *sops,
>> + int nsops)
>> +{
>> + if (nsops == 1 && sma->sem_nsems > 1 && !sma->complex_count) {
>> + struct sem *sem = sma->sem_base + sops->sem_num;
>> + spin_unlock(&sem->lock);
>> + read_unlock(&sma->sem_perm.lock);
>> + } else
>> + write_unlock(&sma->sem_perm.lock);
>> +}
>
> I believe (haven't tested it yet) the issue could be in sem_unlock() as
> we aren't unlocking the RCU read section - before it was just a trivial
> ipc_unlock call.
That is an additional bug. You are right that sem_unlock does need
to call rcu_read_unlock() and that my patches fail to do so.
Good spotting.
--
All rights reversed