LinuxLists.cc - floppy.c soft lockup

2007-05-29 17:40:43

Subject: floppy.c soft lockup

Changes in floppy.c from 2.6.17 and 2.6.18 have broken an application I have. I have tracked
it down to a single line of code. When the following patch is applied to the version in 2.6.18
my application works.

--- linux-2.6.18/drivers/block/floppy.c 2006-09-19 23:42:06.000000000 -0400
+++ linux-2.6.18-crt/drivers/block/floppy.c 2007-05-29 09:12:20.000000000 -0400
@@ -893,7 +893,6 @@
set_current_state(TASK_RUNNING);
remove_wait_queue(&fdc_wait, &wait);

- flush_scheduled_work();
}
command_status = FD_COMMAND_NONE;

I don't claim to understand the changes from 2.6.17 to 2.6.18 except for the devfs removal.
All I can say is this one line of code kills the application. I have tried to write a short pgm
that shows my problem but everything else I write seems to work. The application only runs
on SMP machines and uses process and irq affinities with real-time scheduling. When I turn
off process and irq affinities the application runs.

I have tried kernels up through 2.6.21.1 with the same results. All kernels from 2.6.18 up
require that I remove this one line of code or my application does not work?

Regards
Mark

2007-05-31 05:47:18

by Andrew Morton

[permalink] [raw]

Subject: Re: floppy.c soft lockup

On Tue, 29 May 2007 13:31:05 -0400 Mark Hounschell <[email protected]> wrote:

> Changes in floppy.c from 2.6.17 and 2.6.18 have broken an application I have. I have tracked
> it down to a single line of code. When the following patch is applied to the version in 2.6.18
> my application works.
>
> --- linux-2.6.18/drivers/block/floppy.c 2006-09-19 23:42:06.000000000 -0400
> +++ linux-2.6.18-crt/drivers/block/floppy.c 2007-05-29 09:12:20.000000000 -0400
> @@ -893,7 +893,6 @@
> set_current_state(TASK_RUNNING);
> remove_wait_queue(&fdc_wait, &wait);
>
> - flush_scheduled_work();
> }
> command_status = FD_COMMAND_NONE;
>
> I don't claim to understand the changes from 2.6.17 to 2.6.18 except for the devfs removal.
> All I can say is this one line of code kills the application. I have tried to write a short pgm
> that shows my problem but everything else I write seems to work. The application only runs
> on SMP machines and uses process and irq affinities with real-time scheduling. When I turn
> off process and irq affinities the application runs.
>
> I have tried kernels up through 2.6.21.1 with the same results. All kernels from 2.6.18 up
> require that I remove this one line of code or my application does not work?

Interesting. I'd expect that the calling process is spinning, with realtime
policy and is expecting some other process to do something (ie: run a workqueue).

If you keep the process and irq affinities, and disable the realtime policy
does that also prevent the problem?

It would be interesting it you could capture a few task traces while it is stuck:
echo 1 > /proc/sys/kernel/sysrq then do ALT-SYSRQ-P a bunch of times and ALT-SYSRQ-T,
see if you can work out where the CPU is stuck.

ALso, 2.6.22-rc3 might have accidentally fixed this.

2007-05-31 14:28:38

Sorry for delay,

On 06/07, Mark Hounschell wrote:
>
> >From an earlier thread member:
>
> >> Mark writes:
> >> Again I don't understand why flush_scheduled_work() running on behalf
> >> of a process affinitized to processor-1 requires cooperation from
> >> events/2 (affinitized to processor-2)
> >> when there is an events/1 already affinitized to processor 1?
>
> >Oleg write:
> >flush_workqueue() blocks until any scheduled work on any CPU has run to
> >completion. If we have some work_struct pending on CPU 2, it can be
> >completed only when events/2 executes it.
>
> Could not flush_scheduled_work() just follow the affinity mask of the
> task that caused the call to begin with. If calling task had a cpu-mask
> of 3 then flush_scheduled_work() would do the events/0 and events/1
> thing and if the calling task had an affinity mask of 1 then only
> events/0 would be done?
>
> In other words changing what Oleg says above just slightly:
>
> flush_workqueue() blocks until any scheduled work on any CPU in the
> calling tasks affinity mask has run to completion?

No, we can't do this, this makes flush_workqueue() meaningless.

Even if we could, this can't help. Suppose that a kernel thread takes some
global lock (for example, in our case cache_reap() takes cache_chain_mutex)
and then it is preempted by RT task which doesn't relinquish CPU.

So this problem is "wider", flush_workqueue() was just a random victim.

Oleg.