2006-02-14 22:36:03

by Gerard J Snitselaar

[permalink] [raw]
Subject: Problem: Possible deadlock for 2.4 SMP systems

Problem: Possible deadlock for 2.4 SMP systems
Arch: i386

Full Description: I ran into this with the serial driver,
but it might affect other drivers and possibly other
architectures. On an smp system one cpu (cpu0) was in the
process of shutting down the serial port, while another cpu
(cpu1) was in the process of trying to service the interrupt
for that port. What appears to happen is cpu0 calls cli() in
shutdown() (drivers/char/serial.c), grabbing global_irq_lock.
Meanwhile cpu1 sets IRQ_INPROGRESS, and eventually calls
handle_IRQ_event() and spins on global_irq_lock in irq_enter().
CPU0 calls free_irq() and eventually gets to the point where
it spins while IRQ_INPROGRESS is set. Since cpu0 is holding
global_irq_lock, cpu1 can't do its work and clear IRQ_INPROGRESS.

I read somewhere that global_irq_lock is deprecated, so is there
something that the serial driver should be doing instead of cli()
and restore_flags() in shutdown()?


2006-02-16 19:55:21

by Paul Fulghum

[permalink] [raw]
Subject: Re: Problem: Possible deadlock for 2.4 SMP systems

Gerard Snitselaar wrote:
> What appears to happen is cpu0 calls cli() in
> shutdown() (drivers/char/serial.c), grabbing global_irq_lock.
> Meanwhile cpu1 sets IRQ_INPROGRESS, and eventually calls
> handle_IRQ_event() and spins on global_irq_lock in irq_enter().
> CPU0 calls free_irq() and eventually gets to the point where
> it spins while IRQ_INPROGRESS is set. Since cpu0 is holding
> global_irq_lock, cpu1 can't do its work and clear IRQ_INPROGRESS.

From looking at irq.c (2.4.31) I guess that calling free_irq()
on SMP after cli() is not safe because of the race you describe.

> I read somewhere that global_irq_lock is deprecated, so is there
> something that the serial driver should be doing instead of cli()
> and restore_flags() in shutdown()?

shutdown() seems a little backwards:
it calls free_irq(), then it disables device interrupts.

One way of handling this may be to move the code
block (the if statement after 'Free the IRQ' comment)
that calls free_irq() to after the restore_flags().

At that point, the device is no longer generating
interrupts and has been removed from the IRQ_ports
list so the ISR will not touch the device instance
and free_irq() can finish safely.

--
Paul Fulghum
Microgate Systems, Ltd.