2011-05-15 14:34:23

by Török Edwin

[permalink] [raw]
Subject: rw_semaphore down_write a lot faster if wrapped by mutex ?!

Hi semaphore/mutex maintainers,

Looks like rw_semaphore's down_write is not as efficient as it could be.
It can have a latency in the miliseconds range, but if I wrap it in yet
another mutex then it becomes faster (100 us range).

One difference I noticed betwen the rwsem and mutex, is that the mutex
code does optimistic spinning. But adding something similar to the
rw_sem code didn't improve timings (it made things worse).
My guess is that this has to do something with excessive scheduler
ping-pong (spurious wakeups, scheduling a task that won't be able to
take the semaphore, etc.), I'm not sure what are the best tools to
confirm/infirm this. perf sched/perf lock/ftrace ?

Also the huge slowdowns only happen if I trigger a pagefault in the
just-mapped area, if I remove the ' *((volatile char*)h) = 0;' line from
mmapsem.c then mmap() time is back in the 50us range.
(And using MAP_POPULATE is even worse, presumably due to zero-filling,
but even with MAP_POPULATE the mutex hekps).

First some background: this all started out when I was investigating why
mmap()/munmap() is still faster in ClamAV when it is wrapped with a
pthread mutex. Initially the reason was that mmap_sem was held during
disk I/O, but thats supposedly been fixed, and ClamAV only uses anon
mmap + pread now anyway.

So I wrote the attached microbenchmark to illustrate the latency
difference. Note that in a real app (ClamAV), the difference is not as
large, only ~5-10%.

Yield Time: 0.002225s, Latency: 0.222500us
Mmap Time [nolock]: 21.647090s, Latency: 2164.709000us
Mmap Time [spinlock]: 0.649472s, Latency: 64.947200us
Mmap Time [mutex]: 0.720323s, Latency: 72.032300us

The difference is huge, switching between threads takes <1us, and
context switching between processes takes ~2us, so I don't know what
rw_sem is doing that takes 2ms!

To further track the problem I patched the kernel slightly, wrapping
down_write/up_write in a regular mutex (in a hackish way, this should be
per process, not a global one), see attached patch.
Sure enough mmap() improved now:
Yield Time: 0.002289s, Latency: 0.228900us
Mmap Time [nolock]: 1.000317s, Latency: 100.031700us
Mmap Time [spinlock]: 0.618873s, Latency: 61.887300us
Mmap Time [mutex]: 0.739471s, Latency: 73.947100us

Of course the attached patch is not a solution, it is just a test. The
nolock case is now very close to the userspace-locking version, the
slowdown is due to the double locking.
I could write a patch that adds a mutex to rwsem and wraps all writes
with it, but I'd rather see the rwsem code fixed / optimized.

The .config I used for testing is attached.

Best regards,
--Edwin


Attachments:
mmapsem.c (2.33 kB)
mmapsem_mutex.patch (1.04 kB)
.config (52.52 kB)
Download all attachments

2011-05-15 15:30:20

by Török Edwin

[permalink] [raw]
Subject: Re: rw_semaphore down_write a lot faster if wrapped by mutex ?!

On 05/15/2011 05:34 PM, T?r?k Edwin wrote:
> Hi semaphore/mutex maintainers,
>
> Looks like rw_semaphore's down_write is not as efficient as it could be.
> It can have a latency in the miliseconds range, but if I wrap it in yet
> another mutex then it becomes faster (100 us range).
>
> One difference I noticed betwen the rwsem and mutex, is that the mutex
> code does optimistic spinning. But adding something similar to the
> rw_sem code didn't improve timings (it made things worse).
> My guess is that this has to do something with excessive scheduler
> ping-pong (spurious wakeups, scheduling a task that won't be able to
> take the semaphore, etc.), I'm not sure what are the best tools to
> confirm/infirm this. perf sched/perf lock/ftrace ?

Hmm, with the added mutex the reader side of mmap_sem only sees one
contending locker at a time (the rest of write side contention is hidden
by the mutex), so this might give a better chance for the readers to
run, even in face of heavy write-side contention.
The up_write will see there are no more writers and always wake the
readers, whereas without the mutex it'll wake the other writer.

Perhaps rw_semaphore should have a flag to prefer waking readers over
writers, or take into consideration the number of readers waiting when
waking a reader vs a writer.

Waking a writer will cause additional latency, because more readers will
go to sleep:
latency = (enqueued_readers / enqueued_writers) * (avg_write_hold_time
+ context_switch_time)

Whereas waking (all) the readers will delay the writer only by:
latency = avg_reader_hold_time + context_switch_time

If the semaphore code could (approximately) measure these, then maybe it
would be able to better make a choice for future lock requests based on
(recent) lock contention history.

Best regards,
--Edwin