Hi!
Here's patch to prevent random scribling over disks during
suspend... In the meantime alan killed (unreferenced at that time)
idedisk_suspend() and idedisk_release(), so I have to reintroduce
them.
Now... There is *another* do_idedisk_suspend / do_idedisk_resume
method in ide_disk.c. As far as I can see, it is not used anywhere,
and is certainly not enough to prevent corruption on swsusp. [Besides,
doing suspend/resume support generically using devicefs, not with ide
stuff seems more right.] Should I go ahead and kill do_idedisk_suspend
and related code?
Pavel
--- clean/drivers/ide/ide-disk.c 2002-11-01 00:37:14.000000000 +0100
+++ linux-swsusp/drivers/ide/ide-disk.c 2002-11-01 21:36:54.000000000 +0100
@@ -1561,6 +1561,56 @@
#endif
}
+static int idedisk_suspend(struct device *dev, u32 state, u32 level)
+{
+ ide_drive_t *drive = dev->driver_data;
+
+ printk("Suspending device %p\n", dev->driver_data);
+
+ /* I hope that every freeze operation from the upper levels have
+ * already been done...
+ */
+
+ if (level != SUSPEND_SAVE_STATE)
+ return 0;
+ BUG_ON(in_interrupt());
+
+ printk("Waiting for commands to finish\n");
+
+ /* wait until all commands are finished */
+ /* FIXME: waiting for spinlocks should be done instead. */
+ if (!(HWGROUP(drive)))
+ printk("No hwgroup?\n");
+ while (HWGROUP(drive)->handler)
+ yield();
+
+ /* set the drive to standby */
+ printk(KERN_INFO "suspending: %s ", drive->name);
+ if (drive->driver) {
+ if (drive->driver->standby)
+ drive->driver->standby(drive);
+ }
+ drive->blocked = 1;
+
+ while (HWGROUP(drive)->handler)
+ yield();
+
+ return 0;
+}
+
+static int idedisk_resume(struct device *dev, u32 level)
+{
+ ide_drive_t *drive = dev->driver_data;
+
+ if (level != RESUME_RESTORE_STATE)
+ return 0;
+ if (!drive->blocked)
+ panic("ide: Resume but not suspended?\n");
+
+ drive->blocked = 0;
+ return 0;
+}
+
static void idedisk_setup (ide_drive_t *drive)
{
struct hd_driveid *id = drive->id;
@@ -1709,6 +1759,10 @@
.proc = idedisk_proc,
.attach = idedisk_attach,
.drives = LIST_HEAD_INIT(idedisk_driver.drives),
+ .gen_driver = {
+ .suspend = idedisk_suspend,
+ .resume = idedisk_resume,
+ }
};
static int idedisk_open(struct inode *inode, struct file *filp)
@@ -1835,8 +1889,7 @@
static int idedisk_init (void)
{
- ide_register_driver(&idedisk_driver);
- return 0;
+ return ide_register_driver(&idedisk_driver);
}
module_init(idedisk_init);
--- clean/drivers/ide/ide-probe.c 2002-11-01 00:37:14.000000000 +0100
+++ linux-swsusp/drivers/ide/ide-probe.c 2002-11-01 01:26:37.000000000 +0100
@@ -1054,6 +1054,7 @@
"%s","IDE Drive");
drive->gendev.parent = &hwif->gendev;
drive->gendev.bus = &ide_bus_type;
+ drive->gendev.driver_data = drive;
sprintf (name, "host%d/bus%d/target%d/lun%d",
(hwif->channel && hwif->mate) ?
hwif->mate->index : hwif->index,
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?
In article <[email protected]>,
Pavel Machek <[email protected]> wrote:
>
>Here's patch to prevent random scribling over disks during
>suspend... In the meantime alan killed (unreferenced at that time)
>idedisk_suspend() and idedisk_release(), so I have to reintroduce
>them.
I _still_ haven't gotten an explanation for the difference between the
do_xxx_suspend and xxxx_suspend, and why we have both.
> Should I go ahead and kill do_idedisk_suspend and related code?
Please. Along with an explanation of what the differences are, and who
cares, and who calls which, and why having two different versions aren't
needed any more (or if they _are_ needed, explain that). Right now I'm
not applying this patch just because I want an explanation of it (like I
did last time).
The two different cases are just too confusing, I want to be unconfused.
Thanks,
Linus
Hi!
> > Here's patch to prevent random scribling over disks during
> > suspend... In the meantime alan killed (unreferenced at that time)
> > idedisk_suspend() and idedisk_release(), so I have to reintroduce
> > them.
>
> Please fix this at a different level. idedisk is not the place to be
> doing this. If the device layer is doing the right thing then the
> request queue will be idle.
How can I wait for all requests to be finished? Should it be at
ll_rw_blk level?
Anyway if I want disk to be spinned down (so their caches are
flushed), some kind of idedisk_suspend is required, right?
> > + .gen_driver = {
> > + .suspend = idedisk_suspend,
> > + .resume = idedisk_resume,
> > + }
>
> Some disks are going to be settint their own power methods too.
Fine with me as long as they call idedisk_suspend() first ;-).
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
> Here's patch to prevent random scribling over disks during
> suspend... In the meantime alan killed (unreferenced at that time)
> idedisk_suspend() and idedisk_release(), so I have to reintroduce
> them.
Please fix this at a different level. idedisk is not the place to be
doing this. If the device layer is doing the right thing then the
request queue will be idle.
> + .gen_driver = {
> + .suspend = idedisk_suspend,
> + .resume = idedisk_resume,
> + }
Some disks are going to be settint their own power methods too.
On Sat, 2002-11-02 at 20:25, Pavel Machek wrote:
> > Some disks are going to be settint their own power methods too.
>
> Fine with me as long as they call idedisk_suspend() first ;-).
They may or may not do so depending upon other considerations. The new
driver layer is supposed to handle suspend/power ordering. If it doesn't
then it needs fixing. You can't go around hacking weird suspend shit
into every single block driver for a final system, sure for a prototype
but not the real thing. No way
>> Here's patch to prevent random scribling over disks during
>> suspend... In the meantime alan killed (unreferenced at that time)
>> idedisk_suspend() and idedisk_release(), so I have to reintroduce
>> them.
>
>Please fix this at a different level. idedisk is not the place to be
>doing this. If the device layer is doing the right thing then the
>request queue will be idle.
Hrm... I don't think so Alan. The PM ordering is bus driven,
so actual bus binding of the disk is it's controller, not
the request queue which is the functional binding. It's up to
the disk driver to shut down processing of the request queue.
On the same idea, it's not the network layer that will stop
sending packets to an eth driver, it's the eth driver that
gets suspended by it's bus binding (PCI for example) that
will eventually request the network layer not to send it
any more packets (netif_stop_queue() typically).
So in our case, what we want is the pci controller driver
getting the suspend request (and non-PCI IDE controller will
have to write their own bus binding according to the new model).
It will then tell it's own subdevices (disks in this case) to
suspend (hrm... I don't have the source at hand, I'm away for
a few days right now, does disks have struct device attached
as childs of the ATA controller ? they should...). At that
point, it's up to the _disk_ suspend() function to actually
request the block layer to stop queue processing, typically
this can be done by just not eating requests from the queue,
or better, by (un)plugging the queue on suspend/resume.
Now, there is indeed as subtle race if the disk driver wants
to queue itself a request to actually send the suspend command
to the disk. This request has to be the _last_ of the queue.
That is, the queue must be stopped right after processing this
request with no room for a new request to be taken in between.
The way I see it is that the queue should be stopped not by
the suspend function itself, but by the request processing
function when fetching that STANDBY request from the queue.
We need special care not to mixup with hdparm originated
STANDBY though (but do those go through the queue ? I'm not
sure how far the cleanup went here....)
Ben.
On Sun, 2002-11-03 at 14:57, [email protected] wrote:
> Hrm... I don't think so Alan. The PM ordering is bus driven,
> so actual bus binding of the disk is it's controller, not
> the request queue which is the functional binding. It's up to
> the disk driver to shut down processing of the request queue.
That requires code in every driver. Duplicated, hard to write, likely to
be racey code. Thats bad.
The bigger picture really should be
ACPI etc "I want to suspend to disk"
PM layer
Suspend the non I/O tasks (btw reminds me - eh tasks and
all workqueues may be I/O tasks at times)
Complete all the block I/O queues
Throw out the pages we can evict
Write suspend image
Jump to PM layer "power off" logic
If you do it that way up then no drivers need to be hacked about.
Alan
>That requires code in every driver. Duplicated, hard to write, likely to
>be racey code. Thats bad.
>
>The bigger picture really should be
>
>ACPI etc "I want to suspend to disk"
>
>PM layer
> Suspend the non I/O tasks (btw reminds me - eh tasks and
> all workqueues may be I/O tasks at times)
> Complete all the block I/O queues
> Throw out the pages we can evict
> Write suspend image
>
> Jump to PM layer "power off" logic
>
>If you do it that way up then no drivers need to be hacked about.
Hrm... thanks to the miracle of having a BIOS that will deal
with the grunt work of actually shutting down the chipsets,
resuming them, etc...
This is definitely not the case on pmac and embedded. Or did I
miss something to your explanation ?
I really don't like the above as it basically bypasse the
bus ordering, which is the only sane way I see to deal with
dependencies when the drivers are actually shutting down HW
While suspend to disk may justify suspention of all non IO
tasks etc..., when doing suspend to RAM, I get very good
results (and very fast suspend/resume cycles) by letting
just about everything run and relying on implicitly beeing
suspended as a result of relying on a service/driver that
has blocked it's queue.
But doing that properly definitely involve a precise process
to be followed by driver (though most drivers can actually
implement only a simplify version of it) so that we don't
fall in a case where a driver trying to allocate memory (to
save state for example) ends up blocking for ever because
it's hung waiting for a page to be swapped out on a device
that is already asleep.
We have debated this a lot with Patrick, Greg, Pavel, ...
and at the "improvised" PM BOF during OLS, and I really
think we should do it what I think is "the right way",
even if it seems slightly more complicated driver-side.
I'm pretty sure it's not actually _that_ complicated, and
in most case, the actual functionality can be provided by
helper routines in each functional subsystem, the important
point beeing that those have to be called by the driver at
appropriate moment so that ordering stays correct.
Also, I don't like, from the design point of view, the notion
of suspending tasks to "hope" no new IO requests will get
triggered. That involve specific coloring of various kernel
threads, and can be nasty if for any reason we end up having
IOs triggered asynchronously by non-tasks (though I'm not
sure that is possible in linux today).
Also, what about a driver that, for it's own suspend operation
needs to actually trigger an IO ? The disk is a typical case
where we want to send a STANDBY command (though the queue ?
synchronously ? after waiting for the initial queue emptyness ?).
I'd add to that the problem isn't specific to BIO queues. It's
the same problem we have to deal with URB lists, 1394 request
queues, etc... as we don't have a unified model for IO queues
(and that's good that way imho).
Anyway, all this is talk, I've started playing with moving
my pmac stuff to the new model, and I intend to actually
make things work the way I want at first as a proof of concept.
Then, I volunteer writing a HOWTO explaining clearly what a
driver should do for proper PM, and I'm pretty sure that won't
be that nasty and race prone as you are afraid of ;)
Ben.
> > Jump to PM layer "power off" logic
> >
> >If you do it that way up then no drivers need to be hacked about.
>
> Hrm... thanks to the miracle of having a BIOS that will deal
> with the grunt work of actually shutting down the chipsets,
> resuming them, etc...
As I said "jump to PM layer power off logic"
So after you have suspended you neatly power it all off
> I really don't like the above as it basically bypasse the
> bus ordering, which is the only sane way I see to deal with
> dependencies when the drivers are actually shutting down HW
Bus ordering applies to power off not to suspend to disk sequence
> Then, I volunteer writing a HOWTO explaining clearly what a
> driver should do for proper PM, and I'm pretty sure that won't
> be that nasty and race prone as you are afraid of ;)
Good. It'll be nice to have suspend to disk in 2.7
Hi!
> > > Some disks are going to be settint their own power methods too.
> >
> > Fine with me as long as they call idedisk_suspend() first ;-).
>
> They may or may not do so depending upon other considerations. The new
> driver layer is supposed to handle suspend/power ordering. If it doesn't
> then it needs fixing. You can't go around hacking weird suspend shit
> into every single block driver for a final system, sure for a prototype
> but not the real thing. No way
Suspend *needs* to touch all drivers.
I do stopping at high levels already (all tasks are put into
refrigerator), but that can still me that DMA on particular driver is
in progress. Only driver itself knows if it is doing any DMA and how
to stop it. I do not see how to do it on upper layers.
[Take for example USB. It is not block device, still it does DMA which
means it has to be stopped.]
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?
Hi!
> > Hrm... I don't think so Alan. The PM ordering is bus driven,
> > so actual bus binding of the disk is it's controller, not
> > the request queue which is the functional binding. It's up to
> > the disk driver to shut down processing of the request queue.
>
> That requires code in every driver. Duplicated, hard to write, likely to
> be racey code. Thats bad.
>
> The bigger picture really should be
>
> ACPI etc "I want to suspend to disk"
>
> PM layer
> Suspend the non I/O tasks (btw reminds me - eh tasks and
> all workqueues may be I/O tasks at times)
> Complete all the block I/O queues
> Throw out the pages we can evict
...DMA from disk may be still running here...
> Write suspend image
>
> Jump to PM layer "power off" logic
>
> If you do it that way up then no drivers need to be hacked about.
...and at resume you find out that your memory is not consistent
because DMA was still running when you were doing copy.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?
On Sun, 2002-11-03 at 20:12, Pavel Machek wrote:
> > Throw out the pages we can evict
>
> ...DMA from disk may be still running here...
Only if a request is still active and therfore the queue is not quiesced
> ...and at resume you find out that your memory is not consistent
> because DMA was still running when you were doing copy.
I can see how that can be a problem for some other things but not block
devices.
Hi!
> > > Throw out the pages we can evict
> >
> > ...DMA from disk may be still running here...
>
> Only if a request is still active and therfore the queue is not
> quiesced
>
How do I quiesce a queue? Is it ll_rw_blk stuff?
>
> > ...and at resume you find out that your memory is not consistent
> > because DMA was still running when you were doing copy.
>
> I can see how that can be a problem for some other things but not block
> devices.
You are probably right that for ide disk quiescing a queue is enough,
but nothing prevents block device to do some DMA just for fun. Also I
want to spindown on suspend (andre wanted that, to flush caches), so I
guess that the patch is quite good as-is....
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
On Sun, 2002-11-03 at 22:09, Pavel Machek wrote:
> You are probably right that for ide disk quiescing a queue is enough,
> but nothing prevents block device to do some DMA just for fun. Also I
> want to spindown on suspend (andre wanted that, to flush caches), so I
> guess that the patch is quite good as-is....
That will get done by the power down part of the process as its needed
in both cases
Hi!
> > You are probably right that for ide disk quiescing a queue is enough,
> > but nothing prevents block device to do some DMA just for fun. Also I
> > want to spindown on suspend (andre wanted that, to flush caches), so I
> > guess that the patch is quite good as-is....
>
> That will get done by the power down part of the process as its needed
> in both cases
At least in suspend-to-ram (S3), power down part is not even
called. [Its suspend, we are not powering off, after all.]
On S3 resume you should wait for disks to spin up, so you need resume
handler.
I used same stuff for S3 and S4, which means I do need to spin them
down even for S4. I believe same handlers for S3 and S4 suspend/resume
is right thing to do...
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
On Sun, 3 Nov 2002, Pavel Machek wrote:
>
> How do I quiesce a queue? Is it ll_rw_blk stuff?
Just send a request down the request list, and make sure that
- the command is marked as being non-mergeable or re-orderable by
software (as all special commands are)
- the command is not re-orderable / mergeable by hardware (and since the
command in question would be something like "flush" or "spindown",
hardware really would be quite broken if it re-ordered it ;)
and then just wait for its completion.
The code is not that complicated, it looks roughly something like
struct request *rq;
rq = blk_get_request(q, WRITE, __GFP_WAIT);
rq->flags = REQ_BLOCK_PC;
rq->data = NULL;
rq->data_len = 0;
rq->timeout = 5*HZ; /* Or whatever */
memset(rq->cmd, 0, sizeof(rq->cmd));
rq->cmd[0] = SYNCHRONIZE_CACHE;
.. fill in whatever bytes the SYNCHRONIZE_CACHE cmd needs ..
rq->cmd_len = 10;
err = blk_do_rq(q, bdev, rq);
blk_put_request(rq);
and you're done. The above should work pretty much on all block drivers
out there, btw: the ones that don't understand SCSI commands should just
ignore requests that aren't the regular REQ_CMD commands.
See drivers/block/scsi_ioctl.c for other examples of sending down commands
to block devices.
Linus
Hi!
> > How do I quiesce a queue? Is it ll_rw_blk stuff?
>
> Just send a request down the request list, and make sure that
>
> - the command is marked as being non-mergeable or re-orderable by
> software (as all special commands are)
>
> - the command is not re-orderable / mergeable by hardware (and since the
> command in question would be something like "flush" or "spindown",
> hardware really would be quite broken if it re-ordered it ;)
>
> and then just wait for its completion.
Okay, so it can be done.
Is that really the right way to prepare disks for suspend?
I sleep all devices by telling driverfs to sleep them. Should I tell
all block devices, then tell driverfs? Seems hacky to me. Or should
idedisk_suspend generate request for itself, then pass it through
queues?
> The code is not that complicated, it looks roughly something like
>
> struct request *rq;
>
> rq = blk_get_request(q, WRITE, __GFP_WAIT);
> rq->flags = REQ_BLOCK_PC;
> rq->data = NULL;
> rq->data_len = 0;
> rq->timeout = 5*HZ; /* Or whatever */
> memset(rq->cmd, 0, sizeof(rq->cmd));
> rq->cmd[0] = SYNCHRONIZE_CACHE;
> .. fill in whatever bytes the SYNCHRONIZE_CACHE cmd needs ..
> rq->cmd_len = 10;
> err = blk_do_rq(q, bdev, rq);
> blk_put_request(rq);
Thanx.
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
On Sun, 3 Nov 2002, Linus Torvalds wrote:
>
> The above should work pretty much on all block drivers out there, btw:
> the ones that don't understand SCSI commands should just ignore requests
> that aren't the regular REQ_CMD commands.
Note that "should work" does not necessarily mean "does work". For
example, in the IDE world, some of the generic packet command stuff is
only understood by ide-cd.c, and the generic IDE layer doesn't necessarily
understand it even if you have a disk that speaks ATAPI. I think Jens will
fix that wart.
Linus
On Sun, 2002-11-03 at 22:27, Pavel Machek wrote:
> Hi!
>
> > > You are probably right that for ide disk quiescing a queue is enough,
> > > but nothing prevents block device to do some DMA just for fun. Also I
> > > want to spindown on suspend (andre wanted that, to flush caches), so I
> > > guess that the patch is quite good as-is....
> >
> > That will get done by the power down part of the process as its needed
> > in both cases
>
> At least in suspend-to-ram (S3), power down part is not even
> called. [Its suspend, we are not powering off, after all.]
> On S3 resume you should wait for disks to spin up, so you need resume
> handler.
>
> I used same stuff for S3 and S4, which means I do need to spin them
> down even for S4. I believe same handlers for S3 and S4 suspend/resume
> is right thing to do...
S4 the bios has spun the disks back up, S3 we may need to let the disks
perform the IDE power on and diskware load. Ben has some possible code
for that
On Sun, 3 Nov 2002, Pavel Machek wrote:
>
> Is that really the right way to prepare disks for suspend?
It probably is, although I suspect it should just be a default action, and
drivers can choose to implement their own "suspend()" functionality if
they want to.
> I sleep all devices by telling driverfs to sleep them. Should I tell
> all block devices, then tell driverfs? Seems hacky to me. Or should
> idedisk_suspend generate request for itself, then pass it through
> queues?
I would strongly encourage letting the device hierarchy suspend() (now
called sysfs, not driverfs) call be the _only_ call the disk controller
ever gets. Having two different suspend mechanisms is just too confusing
for words, and there's no point.
Linus
On Sun, 3 Nov 2002, Pavel Machek wrote:
> Suspend *needs* to touch all drivers.
Indeed, but ...
> I do stopping at high levels already
... does swsusp really need to duplicate the functionality of
the APM / ACPI layers ?
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>
>
>Bus ordering applies to power off not to suspend to disk sequence
Sure it does. Drivers have to save & resume state don't they ?
How can a driver save a consistent state if that state isn't
saved after blocking IOs and how can that be done if childs
of that device haven't already done their save_state/block
sequence ?
The problem is tricky with things like USB or FireWire. Basically
the host chip state save will not save pending requests state (it
would be way too nasty), so child devices will have to make sure
they stop any pending request before saving a consistent state.
Later on, during the actual "suspend" phase of the process, the
child drivers may eventually re-issue a new request, but that
will not result to a state change in the saved state (as it will
be too late for suspend to disk typically).
The suspending of tasks etc... as done currently by swsusp makes
definitely things easier as far as VM & block IO interactions
are concerned though, I agree. Might be a good idea to keep
that part "simple" for now as there is still enough work
with things like USB & 1394.
>> Then, I volunteer writing a HOWTO explaining clearly what a
>> driver should do for proper PM, and I'm pretty sure that won't
>> be that nasty and race prone as you are afraid of ;)
>
>Good. It'll be nice to have suspend to disk in 2.7
Well, it's mostly driver work, and the hooks are in there already
with the device model, so this will be driver updates going to
2.5/2.6 over time I guess. We just need to get the semantics right.
Ben.
>Just send a request down the request list, and make sure that
>
> - the command is marked as being non-mergeable or re-orderable by
> software (as all special commands are)
>
> - the command is not re-orderable / mergeable by hardware (and since the
> command in question would be something like "flush" or "spindown",
> hardware really would be quite broken if it re-ordered it ;)
>
>and then just wait for its completion.
Ok, good, this is exactly what I was talking about in a
previous mail, escept you have the real code ;)
Though I do insist of this beeing bus ordering driven, that
is this command is to be sent down the queue by the ide-disk driver
itself, when asked for suspend by it's parent, whataver it is
(typically the PCI based controller driver).
That would exactly provide the implementation for the save_state
callback. Though there is still a small issue in using synchronize
cache vs. standby.
Our model currently specify we have save_state (which blocks IOs),
then later on, suspend, which does the actual power off. While this
is actually a good things (especially with swsusp, that should allow
us to not poweroff devices on the path to the disk, a future
optimisation avoiding a full wakeup of all devices), in this specific
case, it also means we can't use the queue at the suspend() state
to send the STANDBY command. If we send the STANDBY command (to
spin off the platters) at save_state() time instead, we get the
chance of have to wait again for re-spinning up on suspend to
disk.
Maybe the fix is as simple as doing sync. cache in save_state, standby
in suspend(), but the later beeing done without using the queue (which
is what I do in my current pmac implementation in 2.4, direct ATA reg.
blasting, ugh !). Or maybe we can find a way to carry a "hint" during
the suspend process so that save_state "knows" the device is marked
as the target for a later suspend to disk process, and "avoids"
doing a standby in that case.
What do you think ?
Ben.
>Note that "should work" does not necessarily mean "does work". For
>example, in the IDE world, some of the generic packet command stuff is
>only understood by ide-cd.c, and the generic IDE layer doesn't necessarily
>understand it even if you have a disk that speaks ATAPI. I think Jens will
>fix that wart.
Which is why, IMHO (am I repeating myself ? :) that command has to
be sent down the queue by the _lowest_ level device driver, that
is ide-cd, ide-disk, etc...
This is the way our new device model is designed, at least from
my understanding of our numerous discussions with Patrick. The
actual device beeing suspended is the ide-disk (/cd/tape/...),
it is the target of the suspend request, it gets it from it's
parent in the bus binding (PCI mostly, non-PCI controllers will
have to provide something here), and it's the only place of code
that _knows_ what have to be done.
"knowing" that an ATA disk wants a SYNC. CACHE and/or STANDBY
command while an ATAPI CD wants a packet command (with eventually
a door unlock, and a check unit attention on resume) is not
the responsibility of the block layer nor even the ATA layer,
that would be a layering violation. It's the driver for the
actual device which is the one to know what to do with it's
device and when to do it (when = bus binding).
Pavel propose to stop processes & threads to make things easier
regarding VM, I now tend to think it will indeed make things
less intricated at first and agree we should keep it for now
especially with suspend-to-disk, but it's not the responsibility
of any generic, subsystem or whatever code to actually go suspend
the IO queues down to the drivers (it may be to stop feeding them,
which is the point of stopping processes).
I have the feeling that Alan is trying to avoid any kind of
responsibility upon drivers ;) Well, unfortunately, I think
we have no choice here, and that mean yes, we will have
eventually a few broken drivers that won't play nice with
suspend-to-disk/ram at first, and yes, users will notice,
post bug reports, and hopefully this will be fixed over
time...
Ben.
>S4 the bios has spun the disks back up, S3 we may need to let the disks
>perform the IDE power on and diskware load. Ben has some possible code
>for that
On pmac, I just hard reset them as I have GPIOs to control
the disk reset line, then wait for BSY to go away. We need something
smarter for the generic case, typically a mix of ATA/ATAPI reset,
and eventually execute diagnostics. I need to talk to Andre a bit
about it.
Also, I know at least of one nasty device here that won't play
nice unless it gets the identify command after reset (special
hacked device that lies about it's device type, a ZIP that
masquerades as an IDE-CD, to workaround firmware bugs in some
older laptops, ugh !)
For now, I suspect sending an ATA reset or ATAPI reset depending
on the device type, and making sure BSY is gone should cover
most cases though.
For disks, you may also need to redo the LBA setting (never played
with that, I only own sane enough disks...) and the SET_FEATURE for
PIO & DMA modes (provided your host controller chipset driver
saves & restore them).
But with the current design we have, this should probably be done
by the host chipset driver too. In 2.4, I just re-do a dma_check()
at the end of the resume phase, but that's incomplete in theory.
Ben.
On Sun, Nov 03 2002, Linus Torvalds wrote:
> struct request *rq;
>
> rq = blk_get_request(q, WRITE, __GFP_WAIT);
> rq->flags = REQ_BLOCK_PC;
> rq->data = NULL;
> rq->data_len = 0;
> rq->timeout = 5*HZ; /* Or whatever */
> memset(rq->cmd, 0, sizeof(rq->cmd));
> rq->cmd[0] = SYNCHRONIZE_CACHE;
> .. fill in whatever bytes the SYNCHRONIZE_CACHE cmd needs ..
> rq->cmd_len = 10;
> err = blk_do_rq(q, bdev, rq);
> blk_put_request(rq);
Warning, redundant blk_put_request().
I agree with Linus' example though. The queue is nice that way, for
completely synchronizing access to the device.
> and you're done. The above should work pretty much on all block drivers
> out there, btw: the ones that don't understand SCSI commands should just
> ignore requests that aren't the regular REQ_CMD commands.
Except ide drives?
--
Jens Axboe
On Sun, Nov 03 2002, Linus Torvalds wrote:
>
> On Sun, 3 Nov 2002, Linus Torvalds wrote:
> >
> > The above should work pretty much on all block drivers out there, btw:
> > the ones that don't understand SCSI commands should just ignore requests
> > that aren't the regular REQ_CMD commands.
>
> Note that "should work" does not necessarily mean "does work". For
> example, in the IDE world, some of the generic packet command stuff is
> only understood by ide-cd.c, and the generic IDE layer doesn't necessarily
> understand it even if you have a disk that speaks ATAPI. I think Jens will
> fix that wart.
It's probably not a good idea to rely on ata drives that also speak
atapi, to my knowledge only a few old WDC drives ever did that. Since we
are basically moving to the point where "SCSI" commands is the
commandset that the block layer uses to make drivers do things for it,
I had the idea of doing a rq -> taskfile conversion for ide. Just for
simple things like read/write and sync cache, basically stuff that is
directly translatable. That would make Linus' example actually work, and
it would also make the direct read/write programs using SG_IO work on
IDE drives as well.
--
Jens Axboe
On Mon, Nov 04 2002, Jens Axboe wrote:
> On Sun, Nov 03 2002, Linus Torvalds wrote:
> > struct request *rq;
> >
> > rq = blk_get_request(q, WRITE, __GFP_WAIT);
> > rq->flags = REQ_BLOCK_PC;
> > rq->data = NULL;
> > rq->data_len = 0;
> > rq->timeout = 5*HZ; /* Or whatever */
> > memset(rq->cmd, 0, sizeof(rq->cmd));
> > rq->cmd[0] = SYNCHRONIZE_CACHE;
> > .. fill in whatever bytes the SYNCHRONIZE_CACHE cmd needs ..
> > rq->cmd_len = 10;
> > err = blk_do_rq(q, bdev, rq);
> > blk_put_request(rq);
>
> Warning, redundant blk_put_request().
Sorry no, I fixed that so the example is fine!
--
Jens Axboe
On Mon, 2002-11-04 at 09:44, Jens Axboe wrote:
> It's probably not a good idea to rely on ata drives that also speak
> atapi, to my knowledge only a few old WDC drives ever did that. Since we
> are basically moving to the point where "SCSI" commands is the
> commandset that the block layer uses to make drivers do things for it,
> I had the idea of doing a rq -> taskfile conversion for ide. Just for
> simple things like read/write and sync cache, basically stuff that is
> directly translatable. That would make Linus' example actually work, and
> it would also make the direct read/write programs using SG_IO work on
> IDE drives as well.
Going beyond IDE it might be cleaner to be able to do
struct bio_command_ops
{
eject: idedisk_eject,
suspend: idedisk_suspend,
identify: idedisk_identify,
...
[maybe even read:/write: in some cases
like smart scsi raids]
}
that way IDE disk and all the other weirdass drives can have -one-
command parser not the twenty differently buggy ones we have now simply
by doing
if(rq_is_command(rq))
bio_do_command(rq, &bio_command_ops);
Its also very convenient as we can add fields to the structure and then
to drives without breaking the API and without so much updating
Alan
On Mon, 2002-11-04 at 08:16, [email protected] wrote:
> Also, I know at least of one nasty device here that won't play
> nice unless it gets the identify command after reset (special
> hacked device that lies about it's device type, a ZIP that
> masquerades as an IDE-CD, to workaround firmware bugs in some
> older laptops, ugh !)
And old old quantum drives a few of which if I remember rightly were not
averse to doing one or two command types before the disk head reached
speed
On Mon, 4 Nov 2002 [email protected] wrote:
>
> >Note that "should work" does not necessarily mean "does work". For
> >example, in the IDE world, some of the generic packet command stuff is
> >only understood by ide-cd.c, and the generic IDE layer doesn't necessarily
> >understand it even if you have a disk that speaks ATAPI. I think Jens will
> >fix that wart.
>
> Which is why, IMHO (am I repeating myself ? :) that command has to
> be sent down the queue by the _lowest_ level device driver, that
> is ide-cd, ide-disk, etc...
I would agree with you, but for the fact that:
- I really think we want to have the drivers try to translate SCSI
commands _anyway_.
- that request queue is _damn_important_ in also acting as a
synchronization barrier.
The thing is, many of these commands might well be things user space wants
to do as well, and you have two choices:
- add an ioctl for each kind of command you want to do, and let the
low-level driver do it.
Oh, and btw, we've largely done it this way in the past, and they have
pretty much _all_ gotten the synchronization wrong.
- add one generic ioctl (already done), which pushes a SCSI command down
the pipe, and let the pipe be the synchronization, and cause the switch
to be at run-time.
The thing is, you need to have a case statement for the ioctl, and you
need to have a case statement for the command byte to parse it. And the
SCSI command vs ioctl has a number of advantages:
- you can think of the SCSI command as an ioctl with a standard
numbering and automatic synchronization with the queue.
- a lot of devices can use the raw command as-is, with no case statements
or translation what-so-ever.
Anyway, this is why I'd much rather have higher layers use a standardized
queue packet (a SCSI command) to inform lower-level drivers about special
events, rather than have the lower levels decide on their own command set
and have specialized ways to try to tell them to use that specialized
command some other way (a bdev "ops" structure would probably be the way
we'd go).
So assuming that drivers will accepts commands down the request queue
anyway (because it's the only sane way to push them down and get any kind
of reasonable ordering), then that would make it a waste of time and extra
complexity to _also_ have another interface to push special commands.
Especially as that other interface would end up being almost certainly
broken wrt synchronization (proof: look at the current mess).
Linus
>Anyway, this is why I'd much rather have higher layers use a standardized
>queue packet (a SCSI command) to inform lower-level drivers about special
>events, rather than have the lower levels decide on their own command set
>and have specialized ways to try to tell them to use that specialized
>command some other way (a bdev "ops" structure would probably be the way
>we'd go).
>
>So assuming that drivers will accepts commands down the request queue
>anyway (because it's the only sane way to push them down and get any kind
>of reasonable ordering), then that would make it a waste of time and extra
>complexity to _also_ have another interface to push special commands.
>Especially as that other interface would end up being almost certainly
>broken wrt synchronization (proof: look at the current mess).
Ok, I see your point. However, even if having a common mecanism to
push down a "generic" (ie SCSI packet) "suspend" command down the
queue, the problem of doing proper suspend/resume has other issues
to deal with, and it's basically not a higher level layer thing.
One is ordering (regarding the controller's own power management
facilities if any, I do power down the controller itself on pmacs
for example, or regarding other PM issues on the bus path down to
that controller). A given IDE disk/cd/tape (same with SCSI or
anything else) must be suspended/resumed at the proper moment,
that is exactly just before it's parent, as a result of the
suspend request getting down the device tree.
Another is proper save/restore. We aren't just sending a "please
stop that disk spinning" command, like one would do for automatic
power management of inactive disks or whatever. Indeed, a common
packet command would be very well suited for that. But in our case,
we are dealing with suspend-to-disk/ram, which usually involves
a bit more than that. We want to block the queue atomically with
reception of this suspend command for example until machine is
resumed. We need to send commands to restore the device state
(ATA/ATAPI reset, LBA setting, timings, ...) on resume before
we actually accept new commands from the queue (pushing them
at head of queue before de-blocking it ?)
All of the above is quite device specific. You won't need the
same commands & state restore stuffs for a disk, cd, tape, ...
So it's really the driver for that specific device that is
the only one to know what has to be done for this specific
device.
I'm not quite yet sure what is the best way to deal with all
that yet for IDE, what I did for pmac so far was a lot simpler
but not definitely generic enough (basically blocking the
hwgroup state to "busy" and hand-blasting ATA regs to send
STANDBY command).
Ben.
Hi!
> > Suspend *needs* to touch all drivers.
>
> Indeed, but ...
>
> > I do stopping at high levels already
>
> ... does swsusp really need to duplicate the functionality of
> the APM / ACPI layers ?
Its the other way. ACPI uses swsusp to implement the sleep, and does
not do this itself.
APM is really different because in such cases BIOS does the work.
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
Hi!
> > I sleep all devices by telling driverfs to sleep them. Should I tell
> > all block devices, then tell driverfs? Seems hacky to me. Or should
> > idedisk_suspend generate request for itself, then pass it through
> > queues?
>
> I would strongly encourage letting the device hierarchy suspend() (now
> called sysfs, not driverfs) call be the _only_ call the disk controller
> ever gets. Having two different suspend mechanisms is just too confusing
> for words, and there's no point.
Okay, here's the fix. It kills private suspend/resume infrastructure
in ide, it was unused, anyway. Only change is that standby is called
from cleanup(). That's okay, only place where standby() was not called
before cleanup was ATAPI and that does not apply to disks.
Pavel
--- linux-swsusp.prague/drivers/ide/ide-disk.c 2002-11-01 21:36:54.000000000 +0100
+++ linux-swsusp/drivers/ide/ide-disk.c 2002-11-06 14:33:35.000000000 +0100
@@ -1412,24 +1412,6 @@
return call_idedisk_standby(drive, 0);
}
-static int call_idedisk_suspend (ide_drive_t *drive, int arg)
-{
- ide_task_t args;
- u8 suspend = (arg) ? WIN_SLEEPNOW2 : WIN_SLEEPNOW1;
- memset(&args, 0, sizeof(ide_task_t));
- args.tfRegister[IDE_COMMAND_OFFSET] = suspend;
- args.command_type = ide_cmd_type_parser(&args);
- return ide_raw_taskfile(drive, &args, NULL);
-}
-
-static int do_idedisk_suspend (ide_drive_t *drive)
-{
- if (drive->suspend_reset)
- return 1;
-
- return call_idedisk_suspend(drive, 0);
-}
-
#if 0
static int call_idedisk_checkpower (ide_drive_t *drive, int arg)
{
@@ -1456,13 +1438,6 @@
}
#endif
-static int do_idedisk_resume (ide_drive_t *drive)
-{
- if (!drive->suspend_reset)
- return 1;
- return 0;
-}
-
static int do_idedisk_flushcache (ide_drive_t *drive)
{
ide_task_t args;
@@ -1721,6 +1693,7 @@
{
struct gendisk *g = drive->disk;
+ do_idedisk_standby(drive);
if ((drive->id->cfs_enable_2 & 0x3000) && drive->wcache)
if (do_idedisk_flushcache(drive))
printk (KERN_INFO "%s: Write Cache FAILED Flushing!\n",
@@ -1746,9 +1719,6 @@
.supports_dma = 1,
.supports_dsc_overlap = 0,
.cleanup = idedisk_cleanup,
- .standby = do_idedisk_standby,
- .suspend = do_idedisk_suspend,
- .resume = do_idedisk_resume,
.flushcache = do_idedisk_flushcache,
.do_request = do_rw_disk,
.sense = idedisk_dump_status,
--- linux-swsusp.prague/drivers/ide/ide.c 2002-11-01 00:44:15.000000000 +0100
+++ linux-swsusp/drivers/ide/ide.c 2002-11-06 14:08:53.000000000 +0100
@@ -3111,21 +3111,6 @@
#endif
}
-static int default_standby (ide_drive_t *drive)
-{
- return 0;
-}
-
-static int default_suspend (ide_drive_t *drive)
-{
- return 0;
-}
-
-static int default_resume (ide_drive_t *drive)
-{
- return 0;
-}
-
static int default_flushcache (ide_drive_t *drive)
{
return 0;
@@ -3182,9 +3167,6 @@
{
ide_driver_t *d = drive->driver;
- if (d->standby == NULL) d->standby = default_standby;
- if (d->suspend == NULL) d->suspend = default_suspend;
- if (d->resume == NULL) d->resume = default_resume;
if (d->flushcache == NULL) d->flushcache = default_flushcache;
if (d->do_request == NULL) d->do_request = default_do_request;
if (d->end_request == NULL) d->end_request = default_end_request;
@@ -3262,13 +3244,8 @@
ide_drive_t * drive = container_of(dev,ide_drive_t,gendev);
ide_driver_t * driver = drive->driver;
- if (driver) {
- if (driver->standby)
- driver->standby(drive);
- if (driver->cleanup)
- driver->cleanup(drive);
- }
-
+ if (driver && driver->cleanup)
+ driver->cleanup(drive);
return 0;
}
--- linux-swsusp.prague/include/linux/ide.h 2002-11-01 00:45:02.000000000 +0100
+++ linux-swsusp/include/linux/ide.h 2002-11-06 14:07:05.000000000 +0100
@@ -1180,9 +1180,6 @@
unsigned supports_dma : 1;
unsigned supports_dsc_overlap : 1;
int (*cleanup)(ide_drive_t *);
- int (*standby)(ide_drive_t *);
- int (*suspend)(ide_drive_t *);
- int (*resume)(ide_drive_t *);
int (*flushcache)(ide_drive_t *);
ide_startstop_t (*do_request)(ide_drive_t *, struct request *, sector_t);
int (*end_request)(ide_drive_t *, int, int);
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
[email protected] said:
> The bigger picture really should be
> ACPI etc "I want to suspend to disk"
Er, what?
The 'I want to suspend to disk' instruction comes from the generic PM code
too, surely? ACPI just registers a handful of methods with the generic PM
code for actually doing stuff like entering sleep states, etc.
--
dwmw2
Hi!
> > The bigger picture really should be
> > ACPI etc "I want to suspend to disk"
>
> Er, what?
>
> The 'I want to suspend to disk' instruction comes from the generic PM code
> too, surely? ACPI just registers a handful of methods with the generic PM
> code for actually doing stuff like entering sleep states, etc.
ACPI wants to enter S4 when user asks it to, and swsusp is the way to
do it. When machine overheats, ACPI wants to enter S4 too.
Of course there should be generic way (and it is, sys_reboot) to enter
S4 without asking acpi. echo 4 > /proc/acpi/sleep is just easier for
most people.
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
[email protected] said:
> ACPI wants to enter S4 when user asks it to, and swsusp is the way to
> do it. When machine overheats, ACPI wants to enter S4 too.
> Of course there should be generic way (and it is, sys_reboot) to enter
> S4 without asking acpi. echo 4 > /proc/acpi/sleep is just easier for
> most people.
Why /proc/acpi/sleep ?
Other PM implementations gave us /proc/sys/pm/suspend -- why doesn't ACPI
use that?
The stuff in /proc/acpi should be ACPI-specific. Anything _generic_ like
battery info, sleep states, etc. should have a generic interface which can
be used by any implementation.
--
dwmw2
> From: David Woodhouse [mailto:[email protected]]
> Why /proc/acpi/sleep ?
>
> Other PM implementations gave us /proc/sys/pm/suspend -- why
> doesn't ACPI
> use that?
>
> The stuff in /proc/acpi should be ACPI-specific. Anything
> _generic_ like
> battery info, sleep states, etc. should have a generic
> interface which can
> be used by any implementation.
Yes, ACPI can and should use standard kernel interfaces when they are
available. The interfaces you're talking about don't exist yet, but could be
added.
An example of this is cpufreq. ACPI exposed its own proc interface before
cpufreq was integrated, but now it recently started working through cpufreq.
It should be relatively easy to establish generic interfaces for other areas
and hook ACPI into them.
Regards -- Andy
Hi!
> > ACPI wants to enter S4 when user asks it to, and swsusp is the way to
> > do it. When machine overheats, ACPI wants to enter S4 too.
>
> > Of course there should be generic way (and it is, sys_reboot) to enter
> > S4 without asking acpi. echo 4 > /proc/acpi/sleep is just easier for
> > most people.
>
> Why /proc/acpi/sleep ?
>
> Other PM implementations gave us /proc/sys/pm/suspend -- why doesn't ACPI
> use that?
Well, /proc/sys/pm/suspend also does not seem like the right name for
suspend...
I believe sys_reboot() is the right way to do. Perhaps
/proc/acpi/sleep should be killed in favor of that?
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
[email protected] said:
> Yes, ACPI can and should use standard kernel interfaces when they are
> available.
I'd go further than that. Where it's blatantly obvious that a standard
kernel interface is going to be required for some functionality for which
ACPI is the first implementation, that generic interface should have been
implemented from the start.
> The interfaces you're talking about don't exist yet, but
> could be added.
Actually I do have boxes on which I "echo 1 > /proc/sys/pm/suspend" to make
them sleep. Pavel's right though -- that's not a particularly wonderful
interface either. Using sys_reboot() makes some sense to me.
But stuff like battery info in /proc/acpi just has no excuse.
--
dwmw2
On Thu, 2002-11-07 at 23:25, David Woodhouse wrote:
>
> Actually I do have boxes on which I "echo 1 > /proc/sys/pm/suspend" to make
> them sleep. Pavel's right though -- that's not a particularly wonderful
> interface either. Using sys_reboot() makes some sense to me.
>
> But stuff like battery info in /proc/acpi just has no excuse.
(David: resent, forgot the list)
We need more than just an interface to put the machine to sleep
in fact.
We also need a way for userland to be notified that the machine
will sleep and that it woke up. And if possible in a blocking
way (that is the sleep process waits for the notified userland
app to ack, or refuse).
Typically, that happens today with X via /dev/apm_bios. On PPC,
I emulate this interface, I think ARM does as well. But that
should definitely be replaced by something. Maybe not in kernel
though, that all depends if the kernel can be the initiator
of a suspend process or not.
If it can (upon request from the firmware or whatever), then we need
such an API provided by the kernel. If we decide only userland can
trigger suspend, then we probably need to design some PM daemon
that provides a known interface.
It's very important for apps like X that bang the HW directly to
be notified of the sleep process properly.
Ben.
Hi!
> But stuff like battery info in /proc/acpi just has no excuse.
Yes... But how should "generic" battery info look like?
On apm you only know percentages and ETA left.
On acpi you know voltages, capacities and present rate.
On zaurus you only know voltages.
It will be quite hard to decide "one correct interface". It should
probably be called "/proc/power".
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
[email protected] said:
> Yes... But how should "generic" battery info look like?
> On apm you only know percentages and ETA left.
> On acpi you know voltages, capacities and present rate.
> On zaurus you only know voltages.
> It will be quite hard to decide "one correct interface". It should
> probably be called "/proc/power".
Battery info call returns a structure where some elements can be 'unknown'.
ACPI does it like that already, IIRC -- it's not mandatory to actually fill
in every field correctly.
--
dwmw2
Hi!
> > Yes... But how should "generic" battery info look like?
> > On apm you only know percentages and ETA left.
> > On acpi you know voltages, capacities and present rate.
> > On zaurus you only know voltages.
> > It will be quite hard to decide "one correct interface". It should
> > probably be called "/proc/power".
>
> Battery info call returns a structure where some elements can be 'unknown'.
> ACPI does it like that already, IIRC -- it's not mandatory to actually fill
> in every field correctly.
Would you care to suggest how battery info should look like?
Pavel
--
When do you have heart between your knees?