Subject: [RFC][PATCH 3/4] (Refcount + Waitqueue) implementation for cpu_hotplug "locking"


--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"


Attachments:
(No filename) (165.00 B)
3of4.patch (8.01 kB)
Download all attachments

2006-08-24 11:22:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] (Refcount + Waitqueue) implementation for cpu_hotplug "locking"


* Gautham R Shenoy <[email protected]> wrote:

> void lock_cpu_hotplug(void)
> {

> + DECLARE_WAITQUEUE(wait, current);
> + spin_lock(&cpu_hotplug.lock);
> + cpu_hotplug.reader_count++;

this should be per-CPU - lock_cpu_hotplug() should _not_ be a globally
synchronized event.

CPU removal is such a rare event that we can easily do something like a
global read-mostly 'CPU is locked for writes' flag (plus a completion
queue) that the 'write' side takes atomically - combined with per-CPU
refcount and a waitqueue that the read side increases/decreases and
wakes. Read-locking of the CPU is much more common and should be
fundamentally scalable: it should increase the per-CPU refcount, then
check the global 'writer active' flag, and if the writer flag is set, it
should wait on the global completion queue. When a reader drops the
refcount it should wake up the per-CPU waitqueue. [in which a writer
might be waiting for the refcount to go down to 0.]

Ingo

Subject: Re: [RFC][PATCH 3/4] (Refcount + Waitqueue) implementation for cpu_hotplug "locking"

On Thu, Aug 24, 2006 at 01:14:40PM +0200, Ingo Molnar wrote:
>
> * Gautham R Shenoy <[email protected]> wrote:
>
> > void lock_cpu_hotplug(void)
> > {
>
> > + DECLARE_WAITQUEUE(wait, current);
> > + spin_lock(&cpu_hotplug.lock);
> > + cpu_hotplug.reader_count++;
>
> this should be per-CPU - lock_cpu_hotplug() should _not_ be a globally
> synchronized event.
> CPU removal is such a rare event that we can easily do something like a
> global read-mostly 'CPU is locked for writes' flag (plus a completion
> queue) that the 'write' side takes atomically - combined with per-CPU
> refcount and a waitqueue that the read side increases/decreases and
> wakes. Read-locking of the CPU is much more common and should be
> fundamentally scalable: it should increase the per-CPU refcount, then
> check the global 'writer active' flag, and if the writer flag is set, it
> should wait on the global completion queue. When a reader drops the
> refcount it should wake up the per-CPU waitqueue. [in which a writer
> might be waiting for the refcount to go down to 0.]
This was the approach I tried to make it cache friendly.
These are the problems I faced.

- Reader checks the write_active flag. If set, he waits in the global read
queue. else, he gets the lock and increments percpu refcount.

- the writer would have to check each cpu's read refcount, and ensure that
read refcount =0 on all of them before he sets write_active and
begins a write operation.
This will create a big race window - a writer is checking
for a refcount on cpu(j), a reader comes on cpu(i) where i<j;
Let's assume the writer checks refcounts in increasing order of cpus.
Should the reader on cpu(i) wait or go ahead? If we use a global
lock to serialize this operation, we the whole purpose of maintaining
per cpu data is lost.

- If the reader decides to wait on cpu(i) and the writer on cpu(j+1)
finds refcount!=0,do we have both reader and writer waiting?
Or should the writer perform some sort of a rollback, where he wakes up
the readers on all cpus i < j+1?

- When a reader is done, he decrements his percpu refcount. But, a
percpu refcount = 0 does not mean there are no active readers in the
system. So the reader too, would have to check for each cpu refcount
before he wakes up the writer in his queue. this would mean
referencing other cpu's data.

- How do we deal when a reader takes a lock first on cpu(i) gets
migrated to cpu(j) during an unlock. Again, we have to cross-reference
other cpu's data.

I tried and gave up. But I would love to have this whole thing
implemented in a more cache friendly manner if we can.

Thanks and Regards
ego
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2006-08-24 12:32:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] (Refcount + Waitqueue) implementation for cpu_hotplug "locking"


* Gautham R Shenoy <[email protected]> wrote:

> This was the approach I tried to make it cache friendly.
> These are the problems I faced.
>
> - Reader checks the write_active flag. If set, he waits in the global read
> queue. else, he gets the lock and increments percpu refcount.
>
> - the writer would have to check each cpu's read refcount, and ensure that
> read refcount =0 on all of them before he sets write_active and
> begins a write operation.
> This will create a big race window - a writer is checking
> for a refcount on cpu(j), a reader comes on cpu(i) where i<j;
> Let's assume the writer checks refcounts in increasing order of cpus.
> Should the reader on cpu(i) wait or go ahead? If we use a global
> lock to serialize this operation, we the whole purpose of maintaining
> per cpu data is lost.

no. The writer first sets the global write_active flag, and _then_ goes
on to wait for all readers (if any) to get out of their critical
sections. (That's the purpose of the per-cpu waitqueue that readers use
to wake up a writer waiting for the refcount to go to 0.)

can you still see problems with this scheme?

(the 'write_active' flag is probably best implemented as a mutex, where
readers check mutex_is_locked(), and writers try to take it.)

Ingo

2006-08-24 12:58:54

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] (Refcount + Waitqueue) implementation for cpu_hotplug "locking"

On Thu, Aug 24, 2006 at 02:25:27PM +0200, Ingo Molnar wrote:
> no. The writer first sets the global write_active flag, and _then_ goes
> on to wait for all readers (if any) to get out of their critical
> sections. (That's the purpose of the per-cpu waitqueue that readers use
> to wake up a writer waiting for the refcount to go to 0.)
>
> can you still see problems with this scheme?

This can cause a deadlock sometimes, when a thread tries to take the
read_lock() recursively, with a writer having come in between the two
recursive reads:

Reader1 on CPU0 Writer1 on CPU1

read_lock() - success

write_lock() - blocks on Reader1
(writer_active = 1)


read_lock() - blocks on Writer1

The only way to avoid this deadlock is to either keep track of
cpu_hp_lock_count per-task (like the preemption count kept per-task)
or allow read_lock() to succeed if reader_count > 1 (even if
writer_active = 1). The later makes the lock unduely biased towards
readers.


--
Regards,
vatsa

Subject: Re: [RFC][PATCH 3/4] (Refcount + Waitqueue) implementation for cpu_hotplug "locking"

On Thu, Aug 24, 2006 at 06:28:14PM +0530, Srivatsa Vaddagiri wrote:
> On Thu, Aug 24, 2006 at 02:25:27PM +0200, Ingo Molnar wrote:
> > no. The writer first sets the global write_active flag, and _then_ goes
> > on to wait for all readers (if any) to get out of their critical
> > sections. (That's the purpose of the per-cpu waitqueue that readers use
> > to wake up a writer waiting for the refcount to go to 0.)
> >
> > can you still see problems with this scheme?
>
> This can cause a deadlock sometimes, when a thread tries to take the
> read_lock() recursively, with a writer having come in between the two
> recursive reads:
>
> Reader1 on CPU0 Writer1 on CPU1
>
> read_lock() - success
>
> write_lock() - blocks on Reader1
> (writer_active = 1)
>
>
> read_lock() - blocks on Writer1
>
> The only way to avoid this deadlock is to either keep track of
> cpu_hp_lock_count per-task (like the preemption count kept per-task)
> or allow read_lock() to succeed if reader_count > 1 (even if
> writer_active = 1). The later makes the lock unduely biased towards
> readers.

The reason why recursive read side locking works in the patches I posted, is
the fact that the _locking_is_unfair_. Which means even when a writer is
waiting, if there are readers in the system,a new reader will go ahead.

I can try incorporating this unfair model to Ingo's suggestion
as follows:
- A writer on arrival sets the global flag to writer_waiting.
- A reader on cpuX checks if global flag = writer_waiting. If yes,
and percpu(refcount) == 0, the reader blocks. if percpu(refcount)!=0,
the reader increments it and goes ahead,because there are readers
in the system.

This should work, if not for task migration. It may so happen that
a task has already taken a read lock on cpuX, gets migrated to cpuY
where percpu(refcount) = 0. Now a writer arrives, sets the global flag.
The reader tries taking a recursive read lock gets blocked since
percpu(refcount) on cpuY is 0.

Ingo, I am wondering if lockdep would be of some help here.
Since lockdep already checks for recursive reads, can I use it in
the above case and allow the new reader only if it is recursive?
I don't like the idea of explicitly checking for recursiveness
in the locking schema. Just that I can't think of a better way now.

Thanks and Regards
ego
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2006-08-25 06:20:06

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] (Refcount + Waitqueue) implementation for cpu_hotplug "locking"

Gautham R Shenoy wrote:
> On Thu, Aug 24, 2006 at 06:28:14PM +0530, Srivatsa Vaddagiri wrote:
>
>>On Thu, Aug 24, 2006 at 02:25:27PM +0200, Ingo Molnar wrote:
>>
>>>no. The writer first sets the global write_active flag, and _then_ goes
>>>on to wait for all readers (if any) to get out of their critical
>>>sections. (That's the purpose of the per-cpu waitqueue that readers use
>>>to wake up a writer waiting for the refcount to go to 0.)
>>>
>>>can you still see problems with this scheme?
>>
>>This can cause a deadlock sometimes, when a thread tries to take the
>>read_lock() recursively, with a writer having come in between the two
>>recursive reads:
>>
>> Reader1 on CPU0 Writer1 on CPU1
>>
>> read_lock() - success
>>
>> write_lock() - blocks on Reader1
>> (writer_active = 1)
>>
>>
>> read_lock() - blocks on Writer1
>>
>>The only way to avoid this deadlock is to either keep track of
>>cpu_hp_lock_count per-task (like the preemption count kept per-task)
>>or allow read_lock() to succeed if reader_count > 1 (even if
>>writer_active = 1). The later makes the lock unduely biased towards
>>readers.
>
>
> The reason why recursive read side locking works in the patches I posted, is
> the fact that the _locking_is_unfair_. Which means even when a writer is
> waiting, if there are readers in the system,a new reader will go ahead.
>
> I can try incorporating this unfair model to Ingo's suggestion
> as follows:
> - A writer on arrival sets the global flag to writer_waiting.
> - A reader on cpuX checks if global flag = writer_waiting. If yes,
> and percpu(refcount) == 0, the reader blocks. if percpu(refcount)!=0,
> the reader increments it and goes ahead,because there are readers
> in the system.
>
> This should work, if not for task migration. It may so happen that
> a task has already taken a read lock on cpuX, gets migrated to cpuY
> where percpu(refcount) = 0. Now a writer arrives, sets the global flag.
> The reader tries taking a recursive read lock gets blocked since
> percpu(refcount) on cpuY is 0.

This could easily block hotplug forever though, if you have lots of
tasks in the system.

>
> Ingo, I am wondering if lockdep would be of some help here.
> Since lockdep already checks for recursive reads, can I use it in
> the above case and allow the new reader only if it is recursive?
> I don't like the idea of explicitly checking for recursiveness
> in the locking schema. Just that I can't think of a better way now.

Well you would just have a depth count in the task_struct... in fact that
could *be* the read lock (ie. writer traverses all tasks instead of all
CPU locks), and would save a cacheline in the read path...

But I think the point was that we didn't want to add yet another field
to task_struct. Maybe it is acceptable... one day it will be like
page_flags though ;)


--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-25 06:30:29

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] (Refcount + Waitqueue) implementation for cpu_hotplug "locking"

On Fri, Aug 25, 2006 at 04:19:29PM +1000, Nick Piggin wrote:
> Well you would just have a depth count in the task_struct...

That (if can have) would make life so easy :)

--
Regards,
vatsa