I've been reading i386 interrupt handling code for a couple of days
and encountered something that looks like a race condition. It's
between include/asm-i386/hardirq.h:irq_enter() and
arch/i386/kernel/irq.c:get_irqlock(). They seem to be using lockless
synchronization with local_irq_count of each cpu and global_irq_lock
variable.
A. locking CPU
1. Do test_and_set_bit() on global_irq_lock, if fail, repeat.
2. If all local_irq_count's are zero, we're the winner. Check other
stuff; otherwise, clear global_irq_lock and retry.
B. other CPUs
1. Increment local_irq_count
2. test_bit() on global_irq_lock, if zero, continue handling interrupt;
otherwise, wait till it's cleared.
For this to work, the locking CPU should fetch the value of
local_irq_count after global_irq_lock value becomes visible to other
CPUs, and other CPUs should fetch the value of global_irq_lock after
making the incremented local_irq_count visible to other CPUs.
The locking CPU is OK because test_and_set_bit() forces ordering on
x86, but there should be a mb() betweewn step 1 and 2 for other CPUs
because none of ++ and test_bit is ordering. The B part is irq_enter()
in hardirq.h which looks like the following.
static inline void irq_enter(int cpu, int irq)
{
++local_irq_count(cpu);
while (test_bit(0,&global_irq_lock)) {
cpu_relax();
}
}
Is it a race condition or am I getting it horribly wrong? Thx in
advance.
On Thu, 21 Aug 2003, TeJun Huh wrote:
> I've been reading i386 interrupt handling code for a couple of days
> and encountered something that looks like a race condition. It's
> between include/asm-i386/hardirq.h:irq_enter() and
> arch/i386/kernel/irq.c:get_irqlock(). They seem to be using lockless
> synchronization with local_irq_count of each cpu and global_irq_lock
> variable.
Ok 2.4 (but for future try and mention which kernel version). You'll have
to forgive me if i misunderstand you..
> A. locking CPU
>
> 1. Do test_and_set_bit() on global_irq_lock, if fail, repeat.
> 2. If all local_irq_count's are zero, we're the winner. Check other
> stuff; otherwise, clear global_irq_lock and retry.
Are you referring to hardirq_trylock()?
> B. other CPUs
>
> 1. Increment local_irq_count
> 2. test_bit() on global_irq_lock, if zero, continue handling interrupt;
> otherwise, wait till it's cleared.
>
> For this to work, the locking CPU should fetch the value of
> local_irq_count after global_irq_lock value becomes visible to other
> CPUs, and other CPUs should fetch the value of global_irq_lock after
> making the incremented local_irq_count visible to other CPUs.
Why after? it's currently in an interrupt anyway, the local_irq_count is
per cpu so it's not used on other cpus why do you need to make it
visible on other processors? (save irqs_running() but even that's ok)
> The locking CPU is OK because test_and_set_bit() forces ordering on
> x86, but there should be a mb() betweewn step 1 and 2 for other CPUs
> because none of ++ and test_bit is ordering. The B part is irq_enter()
> in hardirq.h which looks like the following.
>
> static inline void irq_enter(int cpu, int irq)
> {
> ++local_irq_count(cpu);
>
> while (test_bit(0,&global_irq_lock)) {
> cpu_relax();
> }
> }
>
> Is it a race condition or am I getting it horribly wrong? Thx in
> advance.
I don't see or understand the race condition you're describing,
local_irq_count is per cpu.
Zwane
On Thu, Aug 21, 2003 at 06:07:34AM -0400, Zwane Mwaikambo wrote:
>
> Ok 2.4 (but for future try and mention which kernel version). You'll have
> to forgive me if i misunderstand you..
The version I'm looking at is 2.4.21. Sorry about forgetting to
mention.
> Are you referring to hardirq_trylock()?
>...cut...
> > For this to work, the locking CPU should fetch the value of
> > local_irq_count after global_irq_lock value becomes visible to other
> > CPUs, and other CPUs should fetch the value of global_irq_lock after
> > making the incremented local_irq_count visible to other CPUs.
>
> Why after? it's currently in an interrupt anyway, the local_irq_count is
> per cpu so it's not used on other cpus why do you need to make it
> visible on other processors? (save irqs_running() but even that's ok)
I'm talking about global_irq_lock synchronization. local_irq_count
_is_ local but used to synchronize global irq lock. Sparc uses big
reader lock for this purpose but x86 code seems to use memory-ordered
lockless synchronization.
I'll describe it in more detail. On MP, cli() is __global_cli(),
which in turn calls get_irqlock(). get_irqlock() uses
test_and_set_bit() and wait_on_irq() to achieve global irq locking.
The counterpart of this locking is irq_enter() and irq_exit().
Simplified version of the mechanism is as following.
A. get_irqlock() -> wait_on_irq()
1. Repeat test_and_set_bit(0, &global_irq_lock) until we're the winner.
2. Test if all local_irq_count's are zero. If there is any non-zero
value, the CPU might have entered interrupt handler already. Clear
global_irq_lock and go back to step 1.
=> If the test succeeded, we should be sure that no other cpu is
running an interrupt handler and none will enter interrupt handler
until global_irq_lock is cleared.
B. irq_enter()
1. Increment local_irq_count.
2. Do test_bit(0, &global_irq_lock). If it's set, someone is trying to
grab or have grabbed global_irq_lock, loop until it gets cleared.
If global_irq_lock is clear, the CPU enters interrupt handler.
The race condition occurs because there is no mb() between step 1 and
2 of irq_enter(). Example scenarios would be
[AM]: atomic & memory barrier
[L] : local to cpu (not yet visible to other cpus)
[G] : became global
A B
calls cli() Interrupt occurs
executing get_irqlock() executing irq_enter()
** Scenario #1
[L]++local_irq_counter
fetch global_irq_lock
[AM]set global_irq_lock test global_irq_lock
fetch local_irq_counter
test local_irq_counter [G]++local_irq_counter
** Scenario #2
fetch global_irq_lock
[AM]set global_irq_lock
fetch local_irq_counter
test local_irq_counter [L]++local_irq_counter
[G]++local_irq_counter
test global_irq_lock
On above scenarios, B enters interrupt handler and A returns
successfully from cli() - B will be executing an interrupt handler
while A is inside cli(), sti() critical section. This occurs because
there is nothing which forces fetching of global_irq_lock occur after
making local_irq_counter increment visible to other cpus.
If I misunderstood the synchronization mechanism or architectural
characteristics, please point out.
TeJun wrote:
> static inline void irq_enter(int cpu, int irq)
> {
> ++local_irq_count(cpu);
>
> while (test_bit(0,&global_irq_lock)) {
> cpu_relax();
> }
> }
>
> Is it a race condition or am I getting it horribly wrong? Thx in
> advance.
Yes, it's a race. Actually a variant of the race that lead to the introduction of set_current_state():
test_bit is a simple read instruction. i386 cpus are free to execute it early, i.e. they can execute it before the write part of "++local_irq_count(cpu)".
I think smp_rmb() is the right barrier - could you write a patch and send it to Marcelo?
--
Manfred
On Thu, Aug 21, 2003 at 07:01:39PM +0200, Manfred Spraul wrote:
> TeJun wrote:
> >static inline void irq_enter(int cpu, int irq)
> >{
> > ++local_irq_count(cpu);
> >
> > while (test_bit(0,&global_irq_lock)) {
> > cpu_relax();
> > }
> >}
> >
> > Is it a race condition or am I getting it horribly wrong? Thx in
> >advance.
>
> Yes, it's a race. Actually a variant of the race that lead to the
> introduction of set_current_state():
>
> test_bit is a simple read instruction. i386 cpus are free to execute it
> early, i.e. they can execute it before the write part of
> "++local_irq_count(cpu)".
>
> I think smp_rmb() is the right barrier - could you write a patch and send
> it to Marcelo?
smb_rmb is enough in practice for x86 (in asm-i386), but not the right
barrier in general because rmb only serializes reads against reads, so
it would also make little sense while reading the i386 code. here you've
to serialize a write against a read so it would be misleading unless you
know exactly the lowlevel implementations of those barriers.
smp_mb() before the while loop should be the correct barrier for all
archs and the asm generated on x86 will be the same.
alpha, ia64 and x86-64 (and probably others) needs it too.
Andrea
> smb_rmb is enough in practice for x86 (in asm-i386), but not the right
> barrier in general because rmb only serializes reads against reads, so
> it would also make little sense while reading the i386 code. here you've
> to serialize a write against a read so it would be misleading unless you
> know exactly the lowlevel implementations of those barriers.
>
> smp_mb() before the while loop should be the correct barrier for all
> archs and the asm generated on x86 will be the same.
>
> alpha, ia64 and x86-64 (and probably others) needs it too.
Can some kind soul please provide me with the needed mini-patch. I would like
to try that on my constantly crashing SMP test box...
Regards,
Stephan
On Thu, Aug 21, 2003 at 11:48:24PM +0200, Stephan von Krawczynski wrote:
>
> > smb_rmb is enough in practice for x86 (in asm-i386), but not the right
> > barrier in general because rmb only serializes reads against reads, so
> > it would also make little sense while reading the i386 code. here you've
> > to serialize a write against a read so it would be misleading unless you
> > know exactly the lowlevel implementations of those barriers.
> >
> > smp_mb() before the while loop should be the correct barrier for all
> > archs and the asm generated on x86 will be the same.
> >
> > alpha, ia64 and x86-64 (and probably others) needs it too.
>
> Can some kind soul please provide me with the needed mini-patch. I would like
> to try that on my constantly crashing SMP test box...
--- 2.4.22pre7aa1/include/asm-i386/hardirq.h.~1~ 2003-07-20 18:39:04.000000000 +0200
+++ 2.4.22pre7aa1/include/asm-i386/hardirq.h 2003-08-22 00:24:08.000000000 +0200
@@ -71,6 +71,8 @@ static inline void irq_enter(int cpu, in
{
++local_irq_count(cpu);
+ smp_mb();
+
while (test_bit(0,&global_irq_lock)) {
cpu_relax();
}
Andrea
I'm attaching patch for i386. It makes three changes.
1. add smp_mb() between local_irq_count++ and global_irq_lock test
in irq_enter().
2. add smp_mb__after_clear_bit() before irqs_running() test in
wait_on_irq().
3. remove irqs_running() test from synchronize_irq()
Removing irqs_running() test from synchronize_irq() is needed for the
same reason. Other interrupts might be running on successful return
from synchronize_irq().
smp_mb__after_clear_bit() should be smp_mb__after_test_and_set_bit()
which doesn't exist. Should I add this?
After determining smp_mb__after_clear_bit(), I'll make a patch for
every affected architecture. Please comment.
# ------------ patch follows --------------
diff -Nru a/arch/i386/kernel/irq.c b/arch/i386/kernel/irq.c
--- a/arch/i386/kernel/irq.c Fri Aug 22 10:07:50 2003
+++ b/arch/i386/kernel/irq.c Fri Aug 22 10:07:50 2003
@@ -271,6 +271,8 @@
* for bottom half handlers unless we're
* already executing in one..
*/
+ smp_mb__after_clear_bit(); /* Synchronize with irq_enter() */
+
if (!irqs_running())
if (local_bh_count(cpu) || !spin_is_locked(&global_bh_lock))
break;
@@ -307,11 +309,9 @@
*/
void synchronize_irq(void)
{
- if (irqs_running()) {
- /* Stupid approach */
- cli();
- sti();
- }
+ /* Stupid approach */
+ cli();
+ sti();
}
static inline void get_irqlock(int cpu)
diff -Nru a/include/asm-i386/hardirq.h b/include/asm-i386/hardirq.h
--- a/include/asm-i386/hardirq.h Fri Aug 22 10:07:50 2003
+++ b/include/asm-i386/hardirq.h Fri Aug 22 10:07:50 2003
@@ -67,6 +67,8 @@
{
++local_irq_count(cpu);
+ smp_mb(); /* Synchronize with wait_on_irq() */
+
while (test_bit(0,&global_irq_lock)) {
cpu_relax();
}
On Fri, 22 Aug 2003 10:18:40 +0900
TeJun Huh <[email protected]> wrote:
> I'm attaching patch for i386. It makes three changes.
>
> 1. add smp_mb() between local_irq_count++ and global_irq_lock test
> in irq_enter().
> 2. add smp_mb__after_clear_bit() before irqs_running() test in
> wait_on_irq().
> 3. remove irqs_running() test from synchronize_irq()
Thank you TeJun,
I have started tests and will provide feedback if your patch has any influence
on my problem. This may take some days.
Regards,
Stephan
thanks TeJun,
just one comment
On Fri, Aug 22, 2003 at 10:18:40AM +0900, TeJun Huh wrote:
> 3. remove irqs_running() test from synchronize_irq()
I'm not convinced this one is needed. An irq can still run on another
cpu but the cli();sti() may execute while it's here:
irq running synchronize_irq()
-------------- -----------------
do_IRQ
handle_IRQ_event
cli()
sti()
irq_enter -> way too late
in short, doing irqs_running() doesn't seem to weaken the semantics of
synchronize_irq() to me.
I think it should be changed this way instead:
void synchronize_irq(void)
{
smp_mb();
if (irqs_running()) {
/* Stupid approach */
cli();
sti();
}
}
to be sure to read the local irq area after the previous code (the
test_and_set_bit of the global_irq_lock of a cli() in your version would
achieve the same implicit smp_mb too, so maybe your only point for doing
cli()/sti() was to execute the smp_mb before the irqs_running?). the
above version is more finegrined and it looks equivalent to yours.
Andrea
Hello Andrea,
On Fri, Aug 22, 2003 at 06:25:46PM +0200, Andrea Arcangeli wrote:
> thanks TeJun,
>
> just one comment
>
> On Fri, Aug 22, 2003 at 10:18:40AM +0900, TeJun Huh wrote:
> > 3. remove irqs_running() test from synchronize_irq()
>
> I'm not convinced this one is needed. An irq can still run on another
> cpu but the cli();sti() may execute while it's here:
>
> irq running synchronize_irq()
> -------------- -----------------
> do_IRQ
> handle_IRQ_event
> cli()
> sti()
>
> irq_enter -> way too late
>
> in short, doing irqs_running() doesn't seem to weaken the semantics of
> synchronize_irq() to me.
>
> I think it should be changed this way instead:
>
> void synchronize_irq(void)
> {
> smp_mb();
> if (irqs_running()) {
> /* Stupid approach */
> cli();
> sti();
> }
> }
>
> to be sure to read the local irq area after the previous code (the
> test_and_set_bit of the global_irq_lock of a cli() in your version would
> achieve the same implicit smp_mb too, so maybe your only point for doing
> cli()/sti() was to execute the smp_mb before the irqs_running?). the
> above version is more finegrined and it looks equivalent to yours.
>
> Andrea
Yes, you're right. Adding just smp_mb() should guarantee that no cpu
is executing interrupt handler which may not see memory contents
modified before synchronize_irq() after synchronize_irq() returns. I
think we need some decent comments there. :-)
As now I know that test_and_set_bit() implies memory barrier,
smb_mb__after_clear_bit() can be removed. I'll make and post a patch
which fixes this race and the bh race of the other thread.
Thanks.
--
tejun
On Sun, Aug 24, 2003 at 12:06:51PM +0900, TeJun Huh wrote:
> As now I know that test_and_set_bit() implies memory barrier,
> smb_mb__after_clear_bit() can be removed. I'll make and post a patch
;) right
> which fixes this race and the bh race of the other thread.
thanks,
Andrea