2008-03-12 16:17:41

by Peter Teoh

[permalink] [raw]
Subject: per cpun+ spin locks coexistence?

Help me out this one - in fs/file.c, there is a function free_fdtable_rcu():

void free_fdtable_rcu(struct rcu_head *rcu)
{
struct fdtable *fdt = container_of(rcu, struct fdtable, rcu);
struct fdtable_defer *fddef;

BUG_ON(!fdt);

if (fdt->max_fds <= NR_OPEN_DEFAULT) {
/*
* This fdtable is embedded in the files structure and that
* structure itself is getting destroyed.
*/
kmem_cache_free(files_cachep,
container_of(fdt, struct files_struct,
fdtab));
return;
}
if (fdt->max_fds <= (PAGE_SIZE / sizeof(struct file *))) {
kfree(fdt->fd);
kfree(fdt->open_fds);
kfree(fdt);
} else {
fddef = &get_cpu_var(fdtable_defer_list);
spin_lock(&fddef->lock);
fdt->next = fddef->next;
fddef->next = fdt;
/* vmallocs are handled from the workqueue context */
schedule_work(&fddef->wq);
spin_unlock(&fddef->lock);
put_cpu_var(fdtable_defer_list);
}
}

Notice above that get_cpu_var() is followed by spin_lock(). Does this
make sense? get_cpu_var() will return a variable that is only
accessible by the current CPU - guaranteed it will not be touch (read or
write) by another CPU, right? so why do we need to spin_lock() it?

Thanks.

--
Regards,
Peter Teoh


2008-03-14 17:56:33

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: per cpun+ spin locks coexistence?

Peter Teoh wrote:
> Help me out this one - in fs/file.c, there is a function free_fdtable_rcu():
>
> void free_fdtable_rcu(struct rcu_head *rcu)
> {
> struct fdtable *fdt = container_of(rcu, struct fdtable, rcu);
> struct fdtable_defer *fddef;
>
> BUG_ON(!fdt);
>
> if (fdt->max_fds <= NR_OPEN_DEFAULT) {
> /*
> * This fdtable is embedded in the files structure and that
> * structure itself is getting destroyed.
> */
> kmem_cache_free(files_cachep,
> container_of(fdt, struct files_struct,
> fdtab));
> return;
> }
> if (fdt->max_fds <= (PAGE_SIZE / sizeof(struct file *))) {
> kfree(fdt->fd);
> kfree(fdt->open_fds);
> kfree(fdt);
> } else {
> fddef = &get_cpu_var(fdtable_defer_list);
> spin_lock(&fddef->lock);
> fdt->next = fddef->next;
> fddef->next = fdt;
> /* vmallocs are handled from the workqueue context */
> schedule_work(&fddef->wq);
> spin_unlock(&fddef->lock);
> put_cpu_var(fdtable_defer_list);
> }
> }
>
> Notice above that get_cpu_var() is followed by spin_lock(). Does this
> make sense? get_cpu_var() will return a variable that is only
> accessible by the current CPU - guaranteed it will not be touch (read or
> write) by another CPU, right?

No, not true. percpu is for stuff which is generally only touched by
one CPU, but there's nothing stopping other processors from accessing it
with per_cpu(var, cpu).

Besides, the lock isn't locking the percpu list head, but the thing on
the head of the list, presumably to prevent races with the workqueue.
(Though the list structure is nonstandard, so its not completely clear.)

J

2008-03-16 16:30:33

by Peter Teoh

[permalink] [raw]
Subject: Re: per cpun+ spin locks coexistence?

Thanks for the answer. But I still don't get it.

On Sat, Mar 15, 2008 at 1:54 AM, Jeremy Fitzhardinge <[email protected]> wrote:
>
> Peter Teoh wrote:
> > Help me out this one - in fs/file.c, there is a function free_fdtable_rcu():
> >
> > void free_fdtable_rcu(struct rcu_head *rcu)
> > {
> > struct fdtable *fdt = container_of(rcu, struct fdtable, rcu);
> > struct fdtable_defer *fddef;
> >
> > BUG_ON(!fdt);
> >
> > if (fdt->max_fds <= NR_OPEN_DEFAULT) {
> > /*
> > * This fdtable is embedded in the files structure and that
> > * structure itself is getting destroyed.
> > */
> > kmem_cache_free(files_cachep,
> > container_of(fdt, struct files_struct,
> > fdtab));
> > return;
> > }
> > if (fdt->max_fds <= (PAGE_SIZE / sizeof(struct file *))) {
> > kfree(fdt->fd);
> > kfree(fdt->open_fds);
> > kfree(fdt);
> > } else {
> > fddef = &get_cpu_var(fdtable_defer_list);
> > spin_lock(&fddef->lock);
> > fdt->next = fddef->next;
> > fddef->next = fdt;
> > /* vmallocs are handled from the workqueue context */
> > schedule_work(&fddef->wq);
> > spin_unlock(&fddef->lock);
> > put_cpu_var(fdtable_defer_list);
> > }
> > }
> >
> > Notice above that get_cpu_var() is followed by spin_lock(). Does this
> > make sense? get_cpu_var() will return a variable that is only
> > accessible by the current CPU - guaranteed it will not be touch (read or
> > write) by another CPU, right?
>
> No, not true. percpu is for stuff which is generally only touched by
> one CPU, but there's nothing stopping other processors from accessing it
> with per_cpu(var, cpu).

get_cpu_var() above, will return a ptr specific for a particular CPU
only, is correct?

#define get_cpu_var(var) (*({ \
extern int simple_identifier_##var(void); \
preempt_disable(); \
&__get_cpu_var(var); }))

SMP:

#define __get_cpu_var(var) \
(*SHIFT_PERCPU_PTR(&per_cpu_var(var), my_cpu_offset))

#define SHIFT_PERCPU_PTR(__p, __offset) RELOC_HIDE((__p), (__offset))

and RELOC_HIDE() i don't understand. So from what u said,
per_cpu_var() returns uniquely for each CPU, but __get_cpu_var() may
not be unique among the different CPU - is that correct?

When cpuA and cpuB call get_cpu_var(), the returned ptr is specific
only for cpuA and cpuB, right? So yes, as u said, different cpu can
call get_cpu_var(), but the returned ptr will be unique to each cpu,
therefore it is guaranteed that another CPU will not get hold of the
returned results of get_cpu_var(), right? So why spin_lock() comes
after get_cpu_var()?

>
> Besides, the lock isn't locking the percpu list head, but the thing on
> the head of the list, presumably to prevent races with the workqueue.

I think I have something much deeper to learn. Can u point me to
some resources to read more about this?
I don't understand the difference betw locking the percpu list head,
and locking things on the head of the list. For me, spin_lock() is
always to apply on ANY global variable - so that another cpu will
block when access to it is attempted - whether it is items on a list,
ot head of the list etc.

> (Though the list structure is nonstandard, so its not completely clear.)
>
> J

Thank you in advance for the all the help rendered, :=).

--
Regards,
Peter Teoh

2008-03-16 20:22:44

by Johannes Weiner

[permalink] [raw]
Subject: Re: per cpun+ spin locks coexistence?

Hi Peter,

"Peter Teoh" <[email protected]> writes:

>> > Notice above that get_cpu_var() is followed by spin_lock(). Does this
>> > make sense? get_cpu_var() will return a variable that is only
>> > accessible by the current CPU - guaranteed it will not be touch (read or
>> > write) by another CPU, right?
>>
>> No, not true. percpu is for stuff which is generally only touched by
>> one CPU, but there's nothing stopping other processors from accessing it
>> with per_cpu(var, cpu).
>
> get_cpu_var() above, will return a ptr specific for a particular CPU
> only, is correct?
>
> #define get_cpu_var(var) (*({ \
> extern int simple_identifier_##var(void); \
> preempt_disable(); \
> &__get_cpu_var(var); }))
>
> SMP:
>
> #define __get_cpu_var(var) \
> (*SHIFT_PERCPU_PTR(&per_cpu_var(var), my_cpu_offset))
>
> #define SHIFT_PERCPU_PTR(__p, __offset) RELOC_HIDE((__p), (__offset))
>
> and RELOC_HIDE() i don't understand.

RELOC_HIDE is a GCC hack. It does basically p+offset.

> So from what u said, per_cpu_var() returns uniquely for each CPU, but
> __get_cpu_var() may not be unique among the different CPU - is that
> correct?
>
> When cpuA and cpuB call get_cpu_var(), the returned ptr is specific
> only for cpuA and cpuB, right? So yes, as u said, different cpu can
> call get_cpu_var(), but the returned ptr will be unique to each cpu,
> therefore it is guaranteed that another CPU will not get hold of the
> returned results of get_cpu_var(), right? So why spin_lock() comes
> after get_cpu_var()?

A per-cpu variable is basically an array the size of the number of
possible CPUs in the system. get_cpu_var() checks what current CPU we
are running on and gets the array-element corresponding to this CPU.

So, really oversimplified, get_cpu_var(foo) translates to something like
foo[smp_processor_id()].

But per_cpu(), as Jemery said, allows you to retrieve an element from
the array that is not meant for your current CPU. So even if you run on
CPU1, you might fetch the element that is meant for use on CPU0.

But even if you don't access the cpu-local variable from other CPUs, the
element might still be a reference to a shared structure that you have
to lock properly. Another CPU might have a reference to this same
structure as well in its per-cpu variable.

Hannes

2008-03-17 17:07:13

by Peter Teoh

[permalink] [raw]
Subject: Re: per cpun+ spin locks coexistence?

Thanks for the explanation, much apologies for this newbie discussion.
But I still find it inexplicable:

On Mon, Mar 17, 2008 at 4:20 AM, Johannes Weiner <[email protected]> wrote:
>
> A per-cpu variable is basically an array the size of the number of
> possible CPUs in the system. get_cpu_var() checks what current CPU we
> are running on and gets the array-element corresponding to this CPU.
>
> So, really oversimplified, get_cpu_var(foo) translates to something like
> foo[smp_processor_id()].
>

Ok, so calling get_cpu_var() always return the array-element for the
current CPU, and since by design, only the current CPU can
modify/write to this array element (this is my assumption - correct?),
and the other CPU will just read it (using the per_cpu construct).
So far correct? So why do u still need to spin_lock() to lock other
CPU from accessing - the other CPU will always just READ it, so just
go ahead and let them read it. Seemed like it defeats the purpose of
get_cpu_var()'s design?

But supposed u really want to put a spin_lock(), just to be sure
nobody is even reading it, or modifying it, so then what is the
original purpose of get_cpu_var() - is it not to implement something
that can be parallelized among different CPU, without affecting each
other, and using no locks?

The dual use of spin_lock+get_cpu_var() confuses me here :-). (not
the per_cpu(), which I agree is supposed to be callabe from all the
different CPU, for purpose of enumeration or data collection).

--
Regards,
Peter Teoh

2008-03-17 17:53:58

by Johannes Weiner

[permalink] [raw]
Subject: Re: per cpun+ spin locks coexistence?

Hi,

"Peter Teoh" <[email protected]> writes:

> Thanks for the explanation, much apologies for this newbie discussion.
> But I still find it inexplicable:
>
> On Mon, Mar 17, 2008 at 4:20 AM, Johannes Weiner <[email protected]> wrote:
>>
>> A per-cpu variable is basically an array the size of the number of
>> possible CPUs in the system. get_cpu_var() checks what current CPU we
>> are running on and gets the array-element corresponding to this CPU.
>>
>> So, really oversimplified, get_cpu_var(foo) translates to something like
>> foo[smp_processor_id()].
>>
>
> Ok, so calling get_cpu_var() always return the array-element for the
> current CPU, and since by design, only the current CPU can
> modify/write to this array element (this is my assumption - correct?),
> and the other CPU will just read it (using the per_cpu construct).
> So far correct? So why do u still need to spin_lock() to lock other
> CPU from accessing - the other CPU will always just READ it, so just
> go ahead and let them read it. Seemed like it defeats the purpose of
> get_cpu_var()'s design?

You ignore the fact that there are structures and pointers in C :) One
and the same object may be referenced from different structure objects.

For example you have an object foo saved in a cpu-local variable and
this object references another object bar (foo.bar), you might do
whatever you want to do with foo. But if bar is a structure you might
still have to lock when you want to change the fields of it to prevent
races if this bar object can be referenced from another point like an
instance of some_struct:

PER-CPU [foo|foo|foo]
| | |
| | \
| \ +---> [bar]
\ +---> [bar]
+---> [bar] <--+
\
[some_struct]

A foo _might_ be protected by cpu-locality but as you can see, bar can
only be changed under lock if it is modified through foo _and_
some_struct.

I hope this clears it. Please correct me if I am wrong or supply a
better scenario than this ;)

Hannes

2008-03-17 19:23:00

by Eric Dumazet

[permalink] [raw]
Subject: Re: per cpun+ spin locks coexistence?

Peter Teoh a ?crit :
> Thanks for the explanation, much apologies for this newbie discussion.
> But I still find it inexplicable:
>
> On Mon, Mar 17, 2008 at 4:20 AM, Johannes Weiner <[email protected]> wrote:
>
>> A per-cpu variable is basically an array the size of the number of
>> possible CPUs in the system. get_cpu_var() checks what current CPU we
>> are running on and gets the array-element corresponding to this CPU.
>>
>> So, really oversimplified, get_cpu_var(foo) translates to something like
>> foo[smp_processor_id()].
>>
>>
>
> Ok, so calling get_cpu_var() always return the array-element for the
> current CPU, and since by design, only the current CPU can
> modify/write to this array element (this is my assumption - correct?),
> and the other CPU will just read it (using the per_cpu construct).
> So far correct? So why do u still need to spin_lock() to lock other
> CPU from accessing - the other CPU will always just READ it, so just
> go ahead and let them read it. Seemed like it defeats the purpose of
> get_cpu_var()'s design?
>
> But supposed u really want to put a spin_lock(), just to be sure
> nobody is even reading it, or modifying it, so then what is the
> original purpose of get_cpu_var() - is it not to implement something
> that can be parallelized among different CPU, without affecting each
> other, and using no locks?
>
> The dual use of spin_lock+get_cpu_var() confuses me here :-). (not
> the per_cpu(), which I agree is supposed to be callabe from all the
> different CPU, for purpose of enumeration or data collection).
>
>
You are right Peter, that fs/file.c contains some leftover from previous
implementation of defer queue,
that was using a timer.

So we can probably provide a patch that :

- Use spin_lock() & spin_unlock() instead of spin_lock_bh() &
spin_unlock_bh() in free_fdtable_work()
since we dont anymore use a softirq (timer) to reschedule the workqueue.

( this timer was deleted by the following patch :
http://readlist.com/lists/vger.kernel.org/linux-kernel/50/251040.html


But, you cannot avoid use of spin_lock()/spin_unlock() because
schedule_work() makes no garantee that the work will be done by this cpu.

(free_fdtable_work() can be called to flush the fd defer queue of CPU X
on behalf CPU Y, with X != Y .
You then can have a corruption because CPU X is inside
free_fdtable_rcu() and CPU Y is inside free_fdtable_work() )

So both spin_lock() and get_cpu_var() are necessary :

- One to get the precpu data for optimal performance on SMP (but not
mandatory)
- One to protect the data from corruption on SMP.



2008-03-19 19:47:54

by Dipankar Sarma

[permalink] [raw]
Subject: Re: per cpun+ spin locks coexistence?

On Wed, Mar 19, 2008 at 01:00:27AM +0800, Peter Teoh wrote:
> First, the following is possible:
>
> fddef = &get_cpu_var(fdtable_defer_list);
> spin_lock(&fddef->lock);
> fdt->next = fddef->next;
> fddef->next = fdt;==============>executing at CPU A
> /* vmallocs are handled from the workqueue context */
> schedule_work(&fddef->wq);
> spin_unlock(&fddef->lock);==============>executing at CPU B
> put_cpu_var(fdtable_defer_list);
>
> where the execution can switch CPU after the schedule_work() API, then
> LOGICALLY u definitely need the spin_lock(), and the per_cpu data is
> really not necessary.
>
> But without the per_cpu structure, then the following "dedicated
> chunk" can only execute on one processor, with the possibility of
> switching to another processor after schedule_work():

Wrong. You cannot switch to another processor in schedule_work(),
it doesn't block. Just read the code.

The reason why I used a per-cpu fdtable_defer_list is that I
did not want to make it a global list of deferred fdtable
structures to be freed and then have to protect it by a global lock.
Every fdtable free would have had to take the same global lock
in that case and this would have resulted in scalability
issues (cacheline bouncing, lock contention). Instead,
I used a per-cpu list in essence creating finer-grain locking.
Two CPUs doing fdtable free operations simultaneously will
not contend for the same lock in this case. The fddef lock
is still required for the reasons Eric pointed out.


> and per_cpu is so that the same chunk of code can be executing at
> multiple CPUs all at the same time.
>
> Now the key issue rises up - as I have just asked before u answered my question:
>
> http://mail.nl.linux.org/kernelnewbies/2008-03/msg00236.html
>
> can schedule_work() sleep? (just like schedule(), whcih can sleep right?)
> schedule_work() is guaranteed to execute the work queue at least once,
> and so this thread may or may not sleep. correct? Or wrong?

Wrong.

Thanks
Dipankar

2008-03-19 21:05:55

by Peter Teoh

[permalink] [raw]
Subject: Re: per cpun+ spin locks coexistence?

Eric Dumazet wrote:
> Peter Teoh a ?crit :
>> On 3/18/08, Eric Dumazet <[email protected]> wrote:
>>
>>
>>> You are right Peter, that fs/file.c contains some leftover from
>>> previous
>>> implementation of defer queue,
>>> that was using a timer.
>>>
>>> So we can probably provide a patch that :
>>>
>>> - Use spin_lock() & spin_unlock() instead of spin_lock_bh() &
>>> spin_unlock_bh() in free_fdtable_work()
>>> since we dont anymore use a softirq (timer) to reschedule the
>>> workqueue.
>>>
>>> ( this timer was deleted by the following patch :
>>> http://readlist.com/lists/vger.kernel.org/linux-kernel/50/251040.html
>>>
>>>
>>> But, you cannot avoid use of spin_lock()/spin_unlock() because
>>> schedule_work() makes no garantee that the work will be done by
>>> this cpu.
>>>
>>
>> Ah.....u have hit the nail....and combine with Johannes Weiner's
>> explanation, I have pieced together the full scenario:
>>
>> First, the following is possible:
>>
>> fddef = &get_cpu_var(fdtable_defer_list);
>> spin_lock(&fddef->lock);
>> fdt->next = fddef->next;
>> fddef->next = fdt;==============>executing at CPU A
>> /* vmallocs are handled from the workqueue context */
>> schedule_work(&fddef->wq);
>> spin_unlock(&fddef->lock);==============>executing at
>> CPU B
>> put_cpu_var(fdtable_defer_list);
>>
>> where the execution can switch CPU after the schedule_work() API, then
>> LOGICALLY u definitely need the spin_lock(), and the per_cpu data is
>> really not necessary.
>>
>> But without the per_cpu structure, then the following "dedicated
>> chunk" can only execute on one processor, with the possibility of
>> switching to another processor after schedule_work():
>>
> Hum, you misunderstood the point.
>
> schedule_work(); wont switch your current CPU, since you are inside a
> spin_lock
> ()/spin_unlock() pair, so preemption is not possible.
>
>
>
>> So then we introduce the per_cpu structure - so that the "dedicated
>> chunk" can be executing on multiple processor ALL AT THE SAME TIME,
>> without interferring each other, as fddef are per-cpu (rightfully
>> owned only before schedule_work() is called, but after schedule_work()
>> is called, an arbitrary CPU will be executing this fddef).
>>
>> spin_lock() is necessary because of the possibility of CPU switch
>> (schedule_work()).
>>
>> and per_cpu is so that the same chunk of code can be executing at
>> multiple CPUs all at the same time.
>>
>> Now the key issue rises up - as I have just asked before u answered
>> my question:
>>
>> http://mail.nl.linux.org/kernelnewbies/2008-03/msg00236.html
>>
>> can schedule_work() sleep? (just like schedule(), whcih can sleep
>> right?)
>> schedule_work() is guaranteed to execute the work queue at least once,
>> and so this thread may or may not sleep. correct? Or wrong?
>>
>>
> schedule_work() cannot sleep. It only queues a work to be done later
> by a special thread.
>
> We need this because vfree() should not be called from softirq handler
> (rcu in this case), so we queue a (small) job.
>> Problem is when u sleep and never wake up, then the spin_lock become
>> permanently locked, and when later the same CPU (have to be the same
>> fddef CPU) is being reschedule to execute the get_cpu_var() again, it
>> will spin_lock() infinitely, resulting in 100% CPU utilization error.
>>
>> To prevent these types of error, spin_lock are always not to be used
>> with to wrap around functions that can sleep, and can only containing
>> short routines between lock and unlock.
>>
>> Is my analysis correct?
>>
>>
> Not exactly :) , but please continue to learn :)
>

Thank you everyone here for a very informative education. I will go
back and analyse in more detail. :-).

2008-03-20 00:08:04

by Peter Teoh

[permalink] [raw]
Subject: Re: per cpun+ spin locks coexistence?

On 3/18/08, Eric Dumazet <[email protected]> wrote:

>
> You are right Peter, that fs/file.c contains some leftover from previous
> implementation of defer queue,
> that was using a timer.
>
> So we can probably provide a patch that :
>
> - Use spin_lock() & spin_unlock() instead of spin_lock_bh() &
> spin_unlock_bh() in free_fdtable_work()
> since we dont anymore use a softirq (timer) to reschedule the workqueue.
>
> ( this timer was deleted by the following patch :
> http://readlist.com/lists/vger.kernel.org/linux-kernel/50/251040.html
>
>
> But, you cannot avoid use of spin_lock()/spin_unlock() because
> schedule_work() makes no garantee that the work will be done by this cpu.

Ah.....u have hit the nail....and combine with Johannes Weiner's
explanation, I have pieced together the full scenario:

First, the following is possible:

fddef = &get_cpu_var(fdtable_defer_list);
spin_lock(&fddef->lock);
fdt->next = fddef->next;
fddef->next = fdt;==============>executing at CPU A
/* vmallocs are handled from the workqueue context */
schedule_work(&fddef->wq);
spin_unlock(&fddef->lock);==============>executing at CPU B
put_cpu_var(fdtable_defer_list);

where the execution can switch CPU after the schedule_work() API, then
LOGICALLY u definitely need the spin_lock(), and the per_cpu data is
really not necessary.

But without the per_cpu structure, then the following "dedicated
chunk" can only execute on one processor, with the possibility of
switching to another processor after schedule_work():

So then we introduce the per_cpu structure - so that the "dedicated
chunk" can be executing on multiple processor ALL AT THE SAME TIME,
without interferring each other, as fddef are per-cpu (rightfully
owned only before schedule_work() is called, but after schedule_work()
is called, an arbitrary CPU will be executing this fddef).

spin_lock() is necessary because of the possibility of CPU switch
(schedule_work()).

and per_cpu is so that the same chunk of code can be executing at
multiple CPUs all at the same time.

Now the key issue rises up - as I have just asked before u answered my question:

http://mail.nl.linux.org/kernelnewbies/2008-03/msg00236.html

can schedule_work() sleep? (just like schedule(), whcih can sleep right?)
schedule_work() is guaranteed to execute the work queue at least once,
and so this thread may or may not sleep. correct? Or wrong?

Problem is when u sleep and never wake up, then the spin_lock become
permanently locked, and when later the same CPU (have to be the same
fddef CPU) is being reschedule to execute the get_cpu_var() again, it
will spin_lock() infinitely, resulting in 100% CPU utilization error.

To prevent these types of error, spin_lock are always not to be used
with to wrap around functions that can sleep, and can only containing
short routines between lock and unlock.

Is my analysis correct?

--
Regards,
Peter Teoh

2008-03-20 09:34:44

by Eric Dumazet

[permalink] [raw]
Subject: Re: per cpun+ spin locks coexistence?

Peter Teoh a ?crit :
> On 3/18/08, Eric Dumazet <[email protected]> wrote:
>
>
>> You are right Peter, that fs/file.c contains some leftover from previous
>> implementation of defer queue,
>> that was using a timer.
>>
>> So we can probably provide a patch that :
>>
>> - Use spin_lock() & spin_unlock() instead of spin_lock_bh() &
>> spin_unlock_bh() in free_fdtable_work()
>> since we dont anymore use a softirq (timer) to reschedule the workqueue.
>>
>> ( this timer was deleted by the following patch :
>> http://readlist.com/lists/vger.kernel.org/linux-kernel/50/251040.html
>>
>>
>> But, you cannot avoid use of spin_lock()/spin_unlock() because
>> schedule_work() makes no garantee that the work will be done by this cpu.
>>
>
> Ah.....u have hit the nail....and combine with Johannes Weiner's
> explanation, I have pieced together the full scenario:
>
> First, the following is possible:
>
> fddef = &get_cpu_var(fdtable_defer_list);
> spin_lock(&fddef->lock);
> fdt->next = fddef->next;
> fddef->next = fdt;==============>executing at CPU A
> /* vmallocs are handled from the workqueue context */
> schedule_work(&fddef->wq);
> spin_unlock(&fddef->lock);==============>executing at CPU B
> put_cpu_var(fdtable_defer_list);
>
> where the execution can switch CPU after the schedule_work() API, then
> LOGICALLY u definitely need the spin_lock(), and the per_cpu data is
> really not necessary.
>
> But without the per_cpu structure, then the following "dedicated
> chunk" can only execute on one processor, with the possibility of
> switching to another processor after schedule_work():
>
Hum, you misunderstood the point.

schedule_work(); wont switch your current CPU, since you are inside a spin_lock
()/spin_unlock() pair, so preemption is not possible.



> So then we introduce the per_cpu structure - so that the "dedicated
> chunk" can be executing on multiple processor ALL AT THE SAME TIME,
> without interferring each other, as fddef are per-cpu (rightfully
> owned only before schedule_work() is called, but after schedule_work()
> is called, an arbitrary CPU will be executing this fddef).
>
> spin_lock() is necessary because of the possibility of CPU switch
> (schedule_work()).
>
> and per_cpu is so that the same chunk of code can be executing at
> multiple CPUs all at the same time.
>
> Now the key issue rises up - as I have just asked before u answered my question:
>
> http://mail.nl.linux.org/kernelnewbies/2008-03/msg00236.html
>
> can schedule_work() sleep? (just like schedule(), whcih can sleep right?)
> schedule_work() is guaranteed to execute the work queue at least once,
> and so this thread may or may not sleep. correct? Or wrong?
>
>
schedule_work() cannot sleep. It only queues a work to be done later by
a special thread.

We need this because vfree() should not be called from softirq handler
(rcu in this case), so we queue a (small) job.
> Problem is when u sleep and never wake up, then the spin_lock become
> permanently locked, and when later the same CPU (have to be the same
> fddef CPU) is being reschedule to execute the get_cpu_var() again, it
> will spin_lock() infinitely, resulting in 100% CPU utilization error.
>
> To prevent these types of error, spin_lock are always not to be used
> with to wrap around functions that can sleep, and can only containing
> short routines between lock and unlock.
>
> Is my analysis correct?
>
>
Not exactly :) , but please continue to learn :)