2006-11-11 20:48:51

by Mikael Pettersson

[permalink] [raw]
Subject: [BUG] floppy: broken after resume due to 2.6.18-rc1 lockdep changes

On my old Dell Latitude laptop, the first access to the floppy
after having resumed from APM suspend fails miserably and generates
these kernel messages (from 2.6.19-rc5):

floppy driver state
-------------------
now=1023244 last interrupt=256121 diff=767123 last called handler=c01e7580
timeout_message=lock fdc
last output bytes:
4 90 256004
0 90 256004
1 90 256004
2 90 256004
12 90 256004
1b 90 256004
ff 90 256004
f 80 256061
0 90 256061
5 90 256061
8 81 256062
e6 80 256064
0 90 256064
5 90 256064
0 90 256064
1 90 256064
2 90 256064
12 90 256064
1b 90 256064
ff 90 256064
last result at 256121
last redo_fd_request at 256121
4 0 0 6 0 1 2
status=0
fdc_busy=1
do_floppy=c01ed580
cont=c02a5b24
current_req=00000000
command_status=-1

floppy0: floppy timeout called
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 0
Buffer I/O error on device fd0, logical block 0
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 8
Buffer I/O error on device fd0, logical block 1
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 16
Buffer I/O error on device fd0, logical block 2
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 24
Buffer I/O error on device fd0, logical block 3
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 32
Buffer I/O error on device fd0, logical block 4
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 40
Buffer I/O error on device fd0, logical block 5
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 48
Buffer I/O error on device fd0, logical block 6
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 56
Buffer I/O error on device fd0, logical block 7
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 0
Buffer I/O error on device fd0, logical block 0

It's only the first post-resume access that triggers this failure,
subsequent accesses do work.

I've traced the cause to Ingo's lockdep patch in 2.6.18-rc1
(see below): reverting it makes the floppy work after resume again.

Before these changes the driver reserved and released the HW
dynamically, but after the changes it no longer does that.
What I suspect is happening is that an APM suspend/resume cycle
doesn't fully preserve the HW state. Previously the driver
wouldn't notice since it would re-reserve and reinitialise
the HW anyway when needed.

So putting a floppy_release_irq_and_dma() in a ->suspend()
hook and a floppy_grab_irq_and_dma() in a ->resume() hook
should do the trick. Unfortunately I'm not sure where in
floppy.c to do that since these routines aren't per-device
but work on all devices in one go.

/Mikael

>From: Ingo Molnar <[email protected]>
Date: Mon, 3 Jul 2006 07:24:23 +0000 (-0700)
Subject: [PATCH] lockdep: floppy.c irq release fix
X-Git-Tag: v2.6.18-rc1
X-Git-Url: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3e541a4ae534a7e59ad464af9abea382b3035724

[PATCH] lockdep: floppy.c irq release fix

The lock validator triggered a number of bugs in the floppy driver, all
related to the floppy driver allocating and freeing irq and dma resources from
interrupt context. The initial solution was to use schedule_work() to push
this into process context, but this caused further problems: for example the
current floppy driver in -mm2 is totally broken and all floppy commands time
out with an error. (as reported by Barry K. Nathan)

This patch tries another solution: simply get rid of all that dynamic IRQ and
DMA allocation/freeing. I doubt it made much sense back in the heydays of
floppies (if two devices raced for DMA or IRQ resources then we didnt handle
those cases too gracefully anyway), and today it makes near zero sense.

So the new code does the simplest and most straightforward thing: allocate IRQ
and DMA resources at module init time, and free them at module removal time.
Dont try to release while the driver is operational. This, besides making the
floppy driver functional again has an added bonus, floppy IRQ stats are
finally persistent and visible in /proc/interrupts:

6: 63 XT-PIC-level floppy

Besides normal floppy IO i have also tested IO error handling, motor-off
timeouts, etc. - and everything seems to be working fine.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
---

--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -249,18 +249,6 @@ static int irqdma_allocated;
#include <linux/cdrom.h> /* for the compatibility eject ioctl */
#include <linux/completion.h>

-/*
- * Interrupt freeing also means /proc VFS work - dont do it
- * from interrupt context. We push this work into keventd:
- */
-static void fd_free_irq_fn(void *data)
-{
- fd_free_irq();
-}
-
-static DECLARE_WORK(fd_free_irq_work, fd_free_irq_fn, NULL);
-
-
static struct request *current_req;
static struct request_queue *floppy_queue;
static void do_fd_request(request_queue_t * q);
@@ -826,15 +814,6 @@ static int set_dor(int fdc, char mask, c
UDRS->select_date = jiffies;
}
}
- /*
- * We should propagate failures to grab the resources back
- * nicely from here. Actually we ought to rewrite the fd
- * driver some day too.
- */
- if (newdor & FLOPPY_MOTOR_MASK)
- floppy_grab_irq_and_dma();
- if (olddor & FLOPPY_MOTOR_MASK)
- floppy_release_irq_and_dma();
return olddor;
}

@@ -892,8 +871,6 @@ static int _lock_fdc(int drive, int inte
line);
return -1;
}
- if (floppy_grab_irq_and_dma() == -1)
- return -EBUSY;

if (test_and_set_bit(0, &fdc_busy)) {
DECLARE_WAITQUEUE(wait, current);
@@ -915,6 +892,8 @@ static int _lock_fdc(int drive, int inte

set_current_state(TASK_RUNNING);
remove_wait_queue(&fdc_wait, &wait);
+
+ flush_scheduled_work();
}
command_status = FD_COMMAND_NONE;

@@ -948,7 +927,6 @@ static inline void unlock_fdc(void)
if (elv_next_request(floppy_queue))
do_fd_request(floppy_queue);
spin_unlock_irqrestore(&floppy_lock, flags);
- floppy_release_irq_and_dma();
wake_up(&fdc_wait);
}

@@ -3694,8 +3672,8 @@ static int floppy_release(struct inode *
}
if (!UDRS->fd_ref)
opened_bdev[drive] = NULL;
- floppy_release_irq_and_dma();
mutex_unlock(&open_lock);
+
return 0;
}

@@ -3726,9 +3704,6 @@ static int floppy_open(struct inode *ino
if (UDRS->fd_ref == -1 || (UDRS->fd_ref && (filp->f_flags & O_EXCL)))
goto out2;

- if (floppy_grab_irq_and_dma())
- goto out2;
-
if (filp->f_flags & O_EXCL)
UDRS->fd_ref = -1;
else
@@ -3805,7 +3780,6 @@ out:
UDRS->fd_ref--;
if (!UDRS->fd_ref)
opened_bdev[drive] = NULL;
- floppy_release_irq_and_dma();
out2:
mutex_unlock(&open_lock);
return res;
@@ -3822,14 +3796,9 @@ static int check_floppy_change(struct ge
return 1;

if (time_after(jiffies, UDRS->last_checked + UDP->checkfreq)) {
- if (floppy_grab_irq_and_dma()) {
- return 1;
- }
-
lock_fdc(drive, 0);
poll_drive(0, 0);
process_fd_request();
- floppy_release_irq_and_dma();
}

if (UTESTF(FD_DISK_CHANGED) ||
@@ -4346,7 +4315,6 @@ static int __init floppy_init(void)
fdc = 0;
del_timer(&fd_timeout);
current_drive = 0;
- floppy_release_irq_and_dma();
initialising = 0;
if (have_no_fdc) {
DPRINT("no floppy controllers found\n");
@@ -4504,7 +4472,7 @@ static void floppy_release_irq_and_dma(v
if (irqdma_allocated) {
fd_disable_dma();
fd_free_dma();
- schedule_work(&fd_free_irq_work);
+ fd_free_irq();
irqdma_allocated = 0;
}
set_dor(0, ~0, 8);
@@ -4600,8 +4568,6 @@ void cleanup_module(void)
/* eject disk, if any */
fd_eject(0);

- flush_scheduled_work(); /* fd_free_irq() might be pending */
-
wait_for_completion(&device_release);
}


2006-11-12 15:48:05

by Ingo Molnar

[permalink] [raw]
Subject: [patch] floppy: suspend/resume fix


* Mikael Pettersson <[email protected]> wrote:

> On my old Dell Latitude laptop, the first access to the floppy after
> having resumed from APM suspend fails miserably and generates these
> kernel messages (from 2.6.19-rc5):
[...]

> It's only the first post-resume access that triggers this failure,
> subsequent accesses do work.
>
> I've traced the cause to Ingo's lockdep patch in 2.6.18-rc1 (see
> below): reverting it makes the floppy work after resume again.

could you check the patch below? I had to add a platform driver to
floppy.c to get suspend/resume callbacks, but otherwise it's relatively
straightforward.

Ingo

----------------------->
Subject: [patch] floppy: suspend/resume fix
From: Ingo Molnar <[email protected]>

introduce a floppy platform-driver and suspend/resume ops to
stop/start the floppy driver. Bug reported by Mikael Pettersson.

Signed-off-by: Ingo Molnar <[email protected]>
---
drivers/block/floppy.c | 31 ++++++++++++++++++++++++++++++-
1 file changed, 30 insertions(+), 1 deletion(-)

Index: linux/drivers/block/floppy.c
===================================================================
--- linux.orig/drivers/block/floppy.c
+++ linux/drivers/block/floppy.c
@@ -4157,6 +4157,28 @@ static void floppy_device_release(struct
complete(&device_release);
}

+static int floppy_suspend(struct platform_device *dev, pm_message_t state)
+{
+ floppy_release_irq_and_dma();
+
+ return 0;
+}
+
+static int floppy_resume(struct platform_device *dev)
+{
+ floppy_grab_irq_and_dma();
+
+ return 0;
+}
+
+static struct platform_driver floppy_driver = {
+ .suspend = floppy_suspend,
+ .resume = floppy_resume,
+ .driver = {
+ .name = "floppy",
+ },
+};
+
static struct platform_device floppy_device[N_DRIVE];

static struct kobject *floppy_find(dev_t dev, int *part, void *data)
@@ -4205,10 +4227,14 @@ static int __init floppy_init(void)
if (err)
goto out_put_disk;

+ err = platform_driver_register(&floppy_driver);
+ if (err)
+ goto out_unreg_blkdev;
+
floppy_queue = blk_init_queue(do_fd_request, &floppy_lock);
if (!floppy_queue) {
err = -ENOMEM;
- goto out_unreg_blkdev;
+ goto out_unreg_driver;
}
blk_queue_max_sectors(floppy_queue, 64);

@@ -4352,6 +4378,8 @@ out_flush_work:
out_unreg_region:
blk_unregister_region(MKDEV(FLOPPY_MAJOR, 0), 256);
blk_cleanup_queue(floppy_queue);
+out_unreg_driver:
+ platform_driver_unregister(&floppy_driver);
out_unreg_blkdev:
unregister_blkdev(FLOPPY_MAJOR, "fd");
out_put_disk:
@@ -4543,6 +4571,7 @@ void cleanup_module(void)
init_completion(&device_release);
blk_unregister_region(MKDEV(FLOPPY_MAJOR, 0), 256);
unregister_blkdev(FLOPPY_MAJOR, "fd");
+ platform_driver_unregister(&floppy_driver);

for (drive = 0; drive < N_DRIVE; drive++) {
del_timer_sync(&motor_off_timer[drive]);

2006-11-12 17:53:32

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Sun, 12 Nov 2006 16:47:12 +0100, Ingo Molnar wrote:
>* Mikael Pettersson <[email protected]> wrote:
>
>> On my old Dell Latitude laptop, the first access to the floppy after
>> having resumed from APM suspend fails miserably and generates these
>> kernel messages (from 2.6.19-rc5):
>[...]
>
>> It's only the first post-resume access that triggers this failure,
>> subsequent accesses do work.
>>
>> I've traced the cause to Ingo's lockdep patch in 2.6.18-rc1 (see
>> below): reverting it makes the floppy work after resume again.
>
>could you check the patch below? I had to add a platform driver to
>floppy.c to get suspend/resume callbacks, but otherwise it's relatively
>straightforward.

Sorry, no joy. The first access post-resume still fails and generates:

floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 0
Buffer I/O error on device fd0, logical block 0
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 8
Buffer I/O error on device fd0, logical block 1
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 16
Buffer I/O error on device fd0, logical block 2
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 24
Buffer I/O error on device fd0, logical block 3
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 32
Buffer I/O error on device fd0, logical block 4
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 40
Buffer I/O error on device fd0, logical block 5
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 48
Buffer I/O error on device fd0, logical block 6
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 56
Buffer I/O error on device fd0, logical block 7
floppy0: disk absent or changed during operation
end_request: I/O error, dev fd0, sector 0
Buffer I/O error on device fd0, logical block 0

The 2nd etc post-resume accesses work like before.

Your patch did change the failure behaviour slightly.
Previously the first post-resume access would experience
a delay of about a second or so, and then report an error.
With this patch the error is immediate.

/Mikael

> Ingo
>
>----------------------->
>Subject: [patch] floppy: suspend/resume fix
>From: Ingo Molnar <[email protected]>
>
>introduce a floppy platform-driver and suspend/resume ops to
>stop/start the floppy driver. Bug reported by Mikael Pettersson.
>
>Signed-off-by: Ingo Molnar <[email protected]>
>---
> drivers/block/floppy.c | 31 ++++++++++++++++++++++++++++++-
> 1 file changed, 30 insertions(+), 1 deletion(-)
>
>Index: linux/drivers/block/floppy.c
>===================================================================
>--- linux.orig/drivers/block/floppy.c
>+++ linux/drivers/block/floppy.c
>@@ -4157,6 +4157,28 @@ static void floppy_device_release(struct
> complete(&device_release);
> }
>
>+static int floppy_suspend(struct platform_device *dev, pm_message_t state)
>+{
>+ floppy_release_irq_and_dma();
>+
>+ return 0;
>+}
>+
>+static int floppy_resume(struct platform_device *dev)
>+{
>+ floppy_grab_irq_and_dma();
>+
>+ return 0;
>+}
>+
>+static struct platform_driver floppy_driver = {
>+ .suspend = floppy_suspend,
>+ .resume = floppy_resume,
>+ .driver = {
>+ .name = "floppy",
>+ },
>+};
>+
> static struct platform_device floppy_device[N_DRIVE];
>
> static struct kobject *floppy_find(dev_t dev, int *part, void *data)
>@@ -4205,10 +4227,14 @@ static int __init floppy_init(void)
> if (err)
> goto out_put_disk;
>
>+ err = platform_driver_register(&floppy_driver);
>+ if (err)
>+ goto out_unreg_blkdev;
>+
> floppy_queue = blk_init_queue(do_fd_request, &floppy_lock);
> if (!floppy_queue) {
> err = -ENOMEM;
>- goto out_unreg_blkdev;
>+ goto out_unreg_driver;
> }
> blk_queue_max_sectors(floppy_queue, 64);
>
>@@ -4352,6 +4378,8 @@ out_flush_work:
> out_unreg_region:
> blk_unregister_region(MKDEV(FLOPPY_MAJOR, 0), 256);
> blk_cleanup_queue(floppy_queue);
>+out_unreg_driver:
>+ platform_driver_unregister(&floppy_driver);
> out_unreg_blkdev:
> unregister_blkdev(FLOPPY_MAJOR, "fd");
> out_put_disk:
>@@ -4543,6 +4571,7 @@ void cleanup_module(void)
> init_completion(&device_release);
> blk_unregister_region(MKDEV(FLOPPY_MAJOR, 0), 256);
> unregister_blkdev(FLOPPY_MAJOR, "fd");
>+ platform_driver_unregister(&floppy_driver);
>
> for (drive = 0; drive < N_DRIVE; drive++) {
> del_timer_sync(&motor_off_timer[drive]);
>

2006-11-12 18:10:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix


* Mikael Pettersson <[email protected]> wrote:

> Sorry, no joy. The first access post-resume still fails and generates:

ok, then someone who knows the floppy driver better than me should put
the right stuff into the suspend/resume hooks :-)

Ingo

2006-11-12 19:30:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Sun, 12 Nov 2006 19:09:53 +0100
Ingo Molnar <[email protected]> wrote:

>
> * Mikael Pettersson <[email protected]> wrote:
>
> > Sorry, no joy. The first access post-resume still fails and generates:
>
> ok, then someone who knows the floppy driver better than me should put
> the right stuff into the suspend/resume hooks :-)

I don't think anyone understands the floppy driver.

How about we just revert the lockdep change?

2006-11-12 19:40:12

by Russell King

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Sun, Nov 12, 2006 at 07:09:53PM +0100, Ingo Molnar wrote:
>
> * Mikael Pettersson <[email protected]> wrote:
>
> > Sorry, no joy. The first access post-resume still fails and generates:
>
> ok, then someone who knows the floppy driver better than me should put
> the right stuff into the suspend/resume hooks :-)

At a guess, what's probably happening is that the floppy drive, when
powered on after resume, reports "disk changed" because it doesn't
know any better.

We interpret "disk changed" to mean the disk has been removed and
possibly changed (which _is_ correct) and thereby abort any further
IO (irrespective of resume.)

Now, consider the following two scenarios:

1. You suspend and then resume, leaving the disk in the floppy drive.

2. You suspend, remove the floppy disk, insert a totally different disk
in the same drive, and then resume.

What should you do? (Hint: without reading the disk and comparing it
with what you have cached you don't know if the disk has been changed
or not.)

If you argue that in case (1) you should continue to allow IO, then
you potentially end up scribbling over a disk when someone does (2).

So I'd argue that the behaviour being seen by Mikael is the _safest_
behaviour, and the most correct behaviour given the limitations of
the hardware.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-11-12 20:35:22

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Sun, 2006-11-12 at 11:30 -0800, Andrew Morton wrote:
> On Sun, 12 Nov 2006 19:09:53 +0100
> Ingo Molnar <[email protected]> wrote:
>
> >
> > * Mikael Pettersson <[email protected]> wrote:
> >
> > > Sorry, no joy. The first access post-resume still fails and generates:
> >
> > ok, then someone who knows the floppy driver better than me should put
> > the right stuff into the suspend/resume hooks :-)
>
> I don't think anyone understands the floppy driver.
>
> How about we just revert the lockdep change?

the lockdep change wasn't "just" an annotation, but an actual bugfix
though :(

but yeah arguably for a bug that isn't hit much :(

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-11-12 20:47:28

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Sun, 12 Nov 2006 19:40:01 +0000, Russell King wrote:
> On Sun, Nov 12, 2006 at 07:09:53PM +0100, Ingo Molnar wrote:
> >
> > * Mikael Pettersson <[email protected]> wrote:
> >
> > > Sorry, no joy. The first access post-resume still fails and generates:
> >
> > ok, then someone who knows the floppy driver better than me should put
> > the right stuff into the suspend/resume hooks :-)
>
> At a guess, what's probably happening is that the floppy drive, when
> powered on after resume, reports "disk changed" because it doesn't
> know any better.
>
> We interpret "disk changed" to mean the disk has been removed and
> possibly changed (which _is_ correct) and thereby abort any further
> IO (irrespective of resume.)
>
> Now, consider the following two scenarios:
>
> 1. You suspend and then resume, leaving the disk in the floppy drive.
>
> 2. You suspend, remove the floppy disk, insert a totally different disk
> in the same drive, and then resume.
>
> What should you do? (Hint: without reading the disk and comparing it
> with what you have cached you don't know if the disk has been changed
> or not.)
>
> If you argue that in case (1) you should continue to allow IO, then
> you potentially end up scribbling over a disk when someone does (2).
>
> So I'd argue that the behaviour being seen by Mikael is the _safest_
> behaviour, and the most correct behaviour given the limitations of
> the hardware.

Note that my usage scenario consists of separate I/O commands:

1. boot
2. tar cf /dev/fd0 somefile
3. tar tvf /dev/fd0
4. suspend (close lid) and later resume (open lid)
5. tar tvf /dev/fd0 (this one fails since 2.6.18-rc1)

Ages and ages ago the floppy driver would cache data so that
after e.g. one "tar tvf /dev/fd0" command, a new "tar tvf /dev/fd0"
wouldn't actually do any I/O. But nowadays, at least in all 2.6
kernels, that's not done and both "tar tvf /dev/fd0" commands
will read the media.

Thus I don't see any excuse for the kernel failing step 5 above.

/Mikael

2006-11-12 21:29:49

by Russell King

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Sun, Nov 12, 2006 at 09:47:08PM +0100, Mikael Pettersson wrote:
> On Sun, 12 Nov 2006 19:40:01 +0000, Russell King wrote:
> > On Sun, Nov 12, 2006 at 07:09:53PM +0100, Ingo Molnar wrote:
> > >
> > > * Mikael Pettersson <[email protected]> wrote:
> > >
> > > > Sorry, no joy. The first access post-resume still fails and generates:
> > >
> > > ok, then someone who knows the floppy driver better than me should put
> > > the right stuff into the suspend/resume hooks :-)
> >
> > At a guess, what's probably happening is that the floppy drive, when
> > powered on after resume, reports "disk changed" because it doesn't
> > know any better.
> >
> > We interpret "disk changed" to mean the disk has been removed and
> > possibly changed (which _is_ correct) and thereby abort any further
> > IO (irrespective of resume.)
> >
> > Now, consider the following two scenarios:
> >
> > 1. You suspend and then resume, leaving the disk in the floppy drive.
> >
> > 2. You suspend, remove the floppy disk, insert a totally different disk
> > in the same drive, and then resume.
> >
> > What should you do? (Hint: without reading the disk and comparing it
> > with what you have cached you don't know if the disk has been changed
> > or not.)
> >
> > If you argue that in case (1) you should continue to allow IO, then
> > you potentially end up scribbling over a disk when someone does (2).
> >
> > So I'd argue that the behaviour being seen by Mikael is the _safest_
> > behaviour, and the most correct behaviour given the limitations of
> > the hardware.
>
> Note that my usage scenario consists of separate I/O commands:
>
> 1. boot
> 2. tar cf /dev/fd0 somefile
> 3. tar tvf /dev/fd0
> 4. suspend (close lid) and later resume (open lid)
> 5. tar tvf /dev/fd0 (this one fails since 2.6.18-rc1)
>
> Ages and ages ago the floppy driver would cache data so that
> after e.g. one "tar tvf /dev/fd0" command, a new "tar tvf /dev/fd0"
> wouldn't actually do any I/O. But nowadays, at least in all 2.6
> kernels, that's not done and both "tar tvf /dev/fd0" commands
> will read the media.
>
> Thus I don't see any excuse for the kernel failing step 5 above.

In which case isn't the real regression that it does IO?

Nevertheless, I give you two options:

1. Abort all IO do inserted floppy disk after resume.
2. Corrupt replaced floppy disk after resume.

You have to pick one and exactly one. Which is inherently less risky to
the end user?

(I'm not saying that I can change the behaviour; I'm merely trying to
point out that you're arguing for an inherently dangerous behaviour.)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-11-12 21:44:46

by Bodo Eggert

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

Russell King <[email protected]> wrote:

> At a guess, what's probably happening is that the floppy drive, when
> powered on after resume, reports "disk changed" because it doesn't
> know any better.
>
> We interpret "disk changed" to mean the disk has been removed and
> possibly changed (which _is_ correct) and thereby abort any further
> IO (irrespective of resume.)
>
> Now, consider the following two scenarios:
>
> 1. You suspend and then resume, leaving the disk in the floppy drive.
>
> 2. You suspend, remove the floppy disk, insert a totally different disk
> in the same drive, and then resume.
>
> What should you do? (Hint: without reading the disk and comparing it
> with what you have cached you don't know if the disk has been changed
> or not.)
>
> If you argue that in case (1) you should continue to allow IO, then
> you potentially end up scribbling over a disk when someone does (2).
>
> So I'd argue that the behaviour being seen by Mikael is the _safest_
> behaviour, and the most correct behaviour given the limitations of
> the hardware.

- If a user suspends with a floppy in the drive, it will mostly be an error,
and he'll unsuspend in order to correct it.
- If it is no error, putting a different/modified floppy into the drive
before resume is unlikely
- Even if somebody does this, you can mostly detect the different disk
by comparing the first sector or just the FAT "serial number".

Therefore you can implement a relatively safe resume that will mostly DTRT
but destroy data in some unlikely cases, while doing the "safe thing" would
mostly cause a harmless (unless not noticed) kind of data loss and sometimes
safe you from real data loss.

I think you should let the user choose which foot to shoot.
--
Ich danke GMX daf?r, die Verwendung meiner Adressen mittels per SPF
verbreiteten L?gen zu sabotieren.

http://david.woodhou.se/why-not-spf.html

2006-11-12 22:04:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix


* Russell King <[email protected]> wrote:

> In which case isn't the real regression that it does IO?
>
> Nevertheless, I give you two options:
>
> 1. Abort all IO do inserted floppy disk after resume.
> 2. Corrupt replaced floppy disk after resume.
>
> You have to pick one and exactly one. Which is inherently less risky
> to the end user?

this isnt about in-flight IO (suspend doesnt succeed if IO is in flight
anyway). The bug is this:

1) you use the floppy and then stop using it
2) 1 hour passes. Nothing uses the floppy.
3) you suspend and later resume
4) another hour passes. Nothing uses the floppy.
5) you try to use the floppy: you get a bunch of IO errors!
6) you try to use the floppy again: this time it works

that's the regression. For some reason suspend/resume puts the floppy
hardware into a state that confuses the floppy driver.

my patch adds all the right suspend/resume hooks for this, without
reintroducing the bug that was noticed by lockdep and which was fixed by
my patch that introduced this regression - we just have to figure out
what to do upon resume to get the floppy driver and the hardware match
up each other, without passing spurious IO errors to the user.

Ingo

2006-11-12 22:40:50

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Sun, 12 Nov 2006 22:44:28 +0100, Bodo Eggert wrote:
> > 1. You suspend and then resume, leaving the disk in the floppy drive.
> >
> > 2. You suspend, remove the floppy disk, insert a totally different disk
> > in the same drive, and then resume.
> >
> > What should you do? (Hint: without reading the disk and comparing it
> > with what you have cached you don't know if the disk has been changed
> > or not.)
> >
> > If you argue that in case (1) you should continue to allow IO, then
> > you potentially end up scribbling over a disk when someone does (2).
> >
> > So I'd argue that the behaviour being seen by Mikael is the _safest_
> > behaviour, and the most correct behaviour given the limitations of
> > the hardware.
>
> - If a user suspends with a floppy in the drive, it will mostly be an error,
> and he'll unsuspend in order to correct it.
> - If it is no error, putting a different/modified floppy into the drive
> before resume is unlikely
> - Even if somebody does this, you can mostly detect the different disk
> by comparing the first sector or just the FAT "serial number".
>
> Therefore you can implement a relatively safe resume that will mostly DTRT
> but destroy data in some unlikely cases, while doing the "safe thing" would
> mostly cause a harmless (unless not noticed) kind of data loss and sometimes
> safe you from real data loss.

The bug occurs regardless of whether I leave the floppy disc in the drive
during suspend or not. 2.6.19-rc5 (vanilla and with Ingo's suspend/resume
hooks) fails the following use case as well:

1. boot
2. insert floppy disc
3. tar tvf /dev/fd0 (works)
4. manually eject floppy disc
5. suspend, later resume
6. insert floppy disc
7. tar tvf /dev/fd0 (fails with I/O errors)
8. tar tvf /dev/fd0 (works)

Like Ingo said, something happens to the HW during suspend and we
need to figure out how to reinitialise the HW and the driver so that
things work immediately after resume.

/Mikael

2006-11-12 23:54:18

by Russell King

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Sun, Nov 12, 2006 at 11:03:18PM +0100, Ingo Molnar wrote:
>
> * Russell King <[email protected]> wrote:
>
> > In which case isn't the real regression that it does IO?
> >
> > Nevertheless, I give you two options:
> >
> > 1. Abort all IO do inserted floppy disk after resume.
> > 2. Corrupt replaced floppy disk after resume.
> >
> > You have to pick one and exactly one. Which is inherently less risky
> > to the end user?
>
> this isnt about in-flight IO (suspend doesnt succeed if IO is in flight
> anyway).

I wasn't talking about in-flight IO. Take a moment to think about it.

- You have a floppy inserted and mounted.
- You write a file to it, and then suspend the machine.
*** No IO was in progress when the suspend occurred.
- You remove the floppy disk and insert a different disk
- You resume
- The kernel submits the dirty buffers for writing out to the disk.

That would lead to a corrupted floppy disk.

> The bug is this:
>
> 1) you use the floppy and then stop using it
> 2) 1 hour passes. Nothing uses the floppy.
> 3) you suspend and later resume
> 4) another hour passes. Nothing uses the floppy.
> 5) you try to use the floppy: you get a bunch of IO errors!
> 6) you try to use the floppy again: this time it works
>
> that's the regression. For some reason suspend/resume puts the floppy
> hardware into a state that confuses the floppy driver.
>
> my patch adds all the right suspend/resume hooks for this, without
> reintroducing the bug that was noticed by lockdep and which was fixed by
> my patch that introduced this regression - we just have to figure out
> what to do upon resume to get the floppy driver and the hardware match
> up each other, without passing spurious IO errors to the user.

You need to step the head off track 0 and back again. That clears the
disk changed status on the floppy drive hardware.

Now, if you'd actually taken the time to read my messages you'd have
realised that the floppy drive hardware defaults to reporting a
"disk changed" status after power on. Ergo, after resume, it says
"disk changed" as well.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-11-12 23:58:16

by Russell King

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Sun, Nov 12, 2006 at 11:40:28PM +0100, Mikael Pettersson wrote:
> The bug occurs regardless of whether I leave the floppy disc in the drive
> during suspend or not. 2.6.19-rc5 (vanilla and with Ingo's suspend/resume
> hooks) fails the following use case as well:
>
> 1. boot
> 2. insert floppy disc
> 3. tar tvf /dev/fd0 (works)
> 4. manually eject floppy disc
> 5. suspend, later resume
> 6. insert floppy disc
> 7. tar tvf /dev/fd0 (fails with I/O errors)
> 8. tar tvf /dev/fd0 (works)
>
> Like Ingo said, something happens to the HW during suspend and we
> need to figure out how to reinitialise the HW and the driver so that
> things work immediately after resume.

Now this is interesting - I know there's been a long standing bug with
kernels on my Thinkpad which behave in a similar way to your description
above. Basically whenever I change the disk in the drive I tend to need
_two_ goes to do anything with it - the first mostly always fails with
IO errors.

I've not bothered to report it because it runs a _very_ old 2.6 kernel
and upgrading the machine to something modern would be extremely painful
(read - would need new hard drive, and probably more RAM than the laptop
can take.)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-11-14 11:05:55

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

Hi!

> - Even if somebody does this, you can mostly detect the different disk
> by comparing the first sector or just the FAT "serial number".

Feel free to send a patch; it would be useful, but it would probably
also mean an ugly layering violation.

> I think you should let the user choose which foot to shoot.

-ENOPATCH.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-14 11:10:15

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

Hi!

> > > Nevertheless, I give you two options:
> > >
> > > 1. Abort all IO do inserted floppy disk after resume.
> > > 2. Corrupt replaced floppy disk after resume.
> > >
> > > You have to pick one and exactly one. Which is inherently less risky
> > > to the end user?
> >
> > this isnt about in-flight IO (suspend doesnt succeed if IO is in flight
> > anyway).
>
> I wasn't talking about in-flight IO. Take a moment to think about it.
>
> - You have a floppy inserted and mounted.

Notice that Ingo is not talking about floppy being mounted.

> - You write a file to it, and then suspend the machine.
> *** No IO was in progress when the suspend occurred.
> - You remove the floppy disk and insert a different disk
> - You resume
> - The kernel submits the dirty buffers for writing out to the disk.
>
> That would lead to a corrupted floppy disk.

Suspending with mounted floppy is a user error. Write tarball to a
floppy, suspend, resume, write another tarball to a floppy is not, and
we want to fix that.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-14 16:34:58

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Tue, 2006-11-14 at 12:09 +0100, Pavel Machek wrote:
> Suspending with mounted floppy is a user error.

Huh? How so?

Lee

2006-11-15 18:48:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix


* Pavel Machek <[email protected]> wrote:

> > I wasn't talking about in-flight IO. Take a moment to think about
> > it.
> >
> > - You have a floppy inserted and mounted.
>
> Notice that Ingo is not talking about floppy being mounted.

yeah. I was thinking in terms of "mdir a:".

if a floppy is mounted then i agree that suspending is probably a bit
too much to expect from the kernel - but this particular bug affects
normal mtool use (which you can think of to be equivalent to mounting,
using and then umounting the floppy).

Ingo

2006-11-15 18:54:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix


* Russell King <[email protected]> wrote:

> On Sun, Nov 12, 2006 at 11:40:28PM +0100, Mikael Pettersson wrote:
> > The bug occurs regardless of whether I leave the floppy disc in the drive
> > during suspend or not. 2.6.19-rc5 (vanilla and with Ingo's suspend/resume
> > hooks) fails the following use case as well:
> >
> > 1. boot
> > 2. insert floppy disc
> > 3. tar tvf /dev/fd0 (works)
> > 4. manually eject floppy disc
> > 5. suspend, later resume
> > 6. insert floppy disc
> > 7. tar tvf /dev/fd0 (fails with I/O errors)
> > 8. tar tvf /dev/fd0 (works)
> >
> > Like Ingo said, something happens to the HW during suspend and we
> > need to figure out how to reinitialise the HW and the driver so that
> > things work immediately after resume.
>
> Now this is interesting - I know there's been a long standing bug with
> kernels on my Thinkpad which behave in a similar way to your
> description above. Basically whenever I change the disk in the drive
> I tend to need _two_ goes to do anything with it - the first mostly
> always fails with IO errors.

yeah. But somehow the pre-lockdep-change driver gets this right - purely
by virtue of unregistering the IRQ line and the DMA channel - neither of
which should have any material effect on behavior ... [and when i
restored this in suspend/resume it didnt fix the bug] Weird.

Ingo

2006-11-15 20:24:35

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Tue 2006-11-14 11:34:21, Lee Revell wrote:
> On Tue, 2006-11-14 at 12:09 +0100, Pavel Machek wrote:
> > Suspending with mounted floppy is a user error.
>
> Huh? How so?

Floppy is removable, and you are expected to umount removable devices
before suspend.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-15 20:35:30

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Wed, 2006-11-15 at 21:24 +0100, Pavel Machek wrote:
> On Tue 2006-11-14 11:34:21, Lee Revell wrote:
> > On Tue, 2006-11-14 at 12:09 +0100, Pavel Machek wrote:
> > > Suspending with mounted floppy is a user error.
> >
> > Huh? How so?
>
> Floppy is removable, and you are expected to umount removable devices
> before suspend.

Ah, OK. I read that as "it's an error for the user to close the lid
with a floppy or CD in the drive". But really HAL or the desktop
environment or whatever is supposed to do it...

Lee

2006-11-15 20:44:04

by Alan

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Wed, 15 Nov 2006 21:24:18 +0100
Pavel Machek <[email protected]> wrote:

> On Tue 2006-11-14 11:34:21, Lee Revell wrote:
> > On Tue, 2006-11-14 at 12:09 +0100, Pavel Machek wrote:
> > > Suspending with mounted floppy is a user error.
> >
> > Huh? How so?
>
> Floppy is removable, and you are expected to umount removable devices
> before suspend.

That seems pretty crude. There are lots of cases where an apparently
removable device is/should be preserved properly and left mounted (eg
builtin CF).

We really want to be smarter than that - which means the drivers ought to
be doing stuff in their suspend/resume paths to figure out if the media
changed when really possible (eg IDE removable)

Floppy is probably not too fixable, but calling it a "user error" is
insulting - user expectation is reasonable that suspend/resume should
just work. The implementation is just rather trickier/nonsensical in this
case.

2006-11-15 20:49:49

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

Hi!

> > > > Suspending with mounted floppy is a user error.
> > >
> > > Huh? How so?
> >
> > Floppy is removable, and you are expected to umount removable devices
> > before suspend.
>
> That seems pretty crude. There are lots of cases where an apparently
> removable device is/should be preserved properly and left mounted (eg
> builtin CF).
>
> We really want to be smarter than that - which means the drivers ought to
> be doing stuff in their suspend/resume paths to figure out if the media
> changed when really possible (eg IDE removable)
>
> Floppy is probably not too fixable, but calling it a "user error" is
> insulting - user expectation is reasonable that suspend/resume should
> just work. The implementation is just rather trickier/nonsensical in this
> case.

Yep, it would be nice to do something about that; but I'm not sure how
this "was media changed" should be implemented, and if it should be
done in kernel or in userland.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-15 21:06:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Wednesday, 15 November 2006 21:49, Pavel Machek wrote:
> Hi!
>
> > > > > Suspending with mounted floppy is a user error.
> > > >
> > > > Huh? How so?
> > >
> > > Floppy is removable, and you are expected to umount removable devices
> > > before suspend.
> >
> > That seems pretty crude. There are lots of cases where an apparently
> > removable device is/should be preserved properly and left mounted (eg
> > builtin CF).
> >
> > We really want to be smarter than that - which means the drivers ought to
> > be doing stuff in their suspend/resume paths to figure out if the media
> > changed when really possible (eg IDE removable)
> >
> > Floppy is probably not too fixable, but calling it a "user error" is
> > insulting - user expectation is reasonable that suspend/resume should
> > just work. The implementation is just rather trickier/nonsensical in this
> > case.
>
> Yep, it would be nice to do something about that; but I'm not sure how
> this "was media changed" should be implemented, and if it should be
> done in kernel or in userland.

<heresy>
Use something like subfs?
</heresy>

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-15 21:14:42

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix


> Yep, it would be nice to do something about that; but I'm not sure how
> this "was media changed" should be implemented, and if it should be
> done in kernel or in userland.

well I guess step 1 is to sync_bdev() or whatever it is called nowadays
before suspend. And maybe force unmount on resume always?


--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-11-16 12:39:04

by Russell King

[permalink] [raw]
Subject: Re: [patch] floppy: suspend/resume fix

On Wed, Nov 15, 2006 at 09:49:33PM +0100, Pavel Machek wrote:
> > > > > Suspending with mounted floppy is a user error.
> > > >
> > > > Huh? How so?
> > >
> > > Floppy is removable, and you are expected to umount removable devices
> > > before suspend.
> >
> > That seems pretty crude. There are lots of cases where an apparently
> > removable device is/should be preserved properly and left mounted (eg
> > builtin CF).
> >
> > We really want to be smarter than that - which means the drivers ought to
> > be doing stuff in their suspend/resume paths to figure out if the media
> > changed when really possible (eg IDE removable)
> >
> > Floppy is probably not too fixable, but calling it a "user error" is
> > insulting - user expectation is reasonable that suspend/resume should
> > just work. The implementation is just rather trickier/nonsensical in this
> > case.
>
> Yep, it would be nice to do something about that; but I'm not sure how
> this "was media changed" should be implemented, and if it should be
> done in kernel or in userland.

With a floppy the only way to do that is to read data off the disk and
compare it with what we know was on the disk prior to suspend/resume.

Since we don't have a "standard floppy format" (since we're flexible)
you can't rely on MSDOS boot sectors and the like to uniquely identify
floppy disks.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core