2004-10-11 00:46:46

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Totally broken PCI PM calls

Any reason why this totally broken code was ever merged upstream ?
We debated that again and again... the result of the below code is
to _REMOVE_ useful information from PCI drivers (suspend to disk vs.
suspend to RAM) for no good reason, making PCI drivers suddently
take a different state than any other driver, breaking radeonfb
suspend code, etc....

I insist again and again (with WORKING code here that does both
suspend-to-RAM & to-disk when not broken by patches like that)
that this is the WRONG thing to do, there is no such generic
correspondance between a suspend state and a PCI D state, this
is under driver or platform control, and this just makes things
more confusing for everybody.

Please, revert that to something sane before 2.6.9... I'll be an
asshole on this one and the bunch of PPC suspend-to-disk patch I'll
submit after 2.6.9 will contain a patch reverting that if not done
and I'll bomb it ever and ever again until some sense gets into
all of this.

Ben.


static int pci_device_suspend(struct device * dev, u32 state)
{
struct pci_dev * pci_dev = to_pci_dev(dev);
struct pci_driver * drv = pci_dev->driver;
u32 dev_state;
int i = 0;

/* Translate PM_SUSPEND_xx states to PCI device states */
static u32 state_conversion[] = {
[PM_SUSPEND_ON] = 0,
[PM_SUSPEND_STANDBY] = 1,
[PM_SUSPEND_MEM] = 3,
[PM_SUSPEND_DISK] = 3,
};

if (state >= sizeof(state_conversion) / sizeof(state_conversion[1]))
return -EINVAL;

dev_state = state_conversion[state];
if (drv && drv->suspend)
i = drv->suspend(pci_dev, dev_state);

pci_save_state(pci_dev, pci_dev->saved_config_space);
return i;
}


2004-10-11 02:41:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls



On Mon, 11 Oct 2004, Benjamin Herrenschmidt wrote:
>
> Any reason why this totally broken code was ever merged upstream ?

Because it fixes a lot of drivers.

> Please, revert that to something sane before 2.6.9...

Nope. There's just too much confusion abou what the state thing means.
See the TG3 driver, for one, all the USB drivers for another.

The long-term solution is to make this thing be not a number at all, but a
restricted type (ie a "struct" with one member, or similar) to make sure
you _cannot_ mis-use it. As it is, most PCI drivers do seem to expect a
PCI suspend state.

> I'll bomb it ever and ever again until some sense gets into
> all of this.

If you have the energy to complain about it, maybe you have the energy to
change it to use a "struct", and fix up all the drivers and verify that
they actually get it right.

Until that happens, we pass in PCI suspend states, and we _also_ make sure
that the PCI suspend states "almost match" the PM_SUSPEND_xxx constants,
so that when there is confusion, it tends to be as benign as possible.

Linus

2004-10-11 03:42:53

by Paul Mackerras

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Linus Torvalds writes:

> On Mon, 11 Oct 2004, Benjamin Herrenschmidt wrote:
> >
> > Any reason why this totally broken code was ever merged upstream ?
>
> Because it fixes a lot of drivers.
>
> > Please, revert that to something sane before 2.6.9...
>
> Nope. There's just too much confusion abou what the state thing means.
> See the TG3 driver, for one, all the USB drivers for another.

The USB drivers aren't a good example, they are currently quite broken
as far as suspend/resume is concerned. They used to work just fine
but got broken some time in the last few months.

The problem I have at the moment is that PCI drivers get asked to go
to D3 for both suspend-to-ram and suspend-to-disk. In particular the
radeonfb driver wants to do different things in these two cases.

Really, what is bogus is pci-driver.c thinking it can tell what state
to ask the PCI device driver to put the device into from the system
power state, without using any platform information. For suspend to
ram, we may need to put devices into D2, D3hot or D3cold, depending on
the motherboard design. For suspend to disk, we don't really want to
change the device's power state at all, but just quiesce the device
and save its state.

In the radeonfb case, it needs to put the device into either D3hot or
D3cold (depending on the motherboard design) for suspend to ram. For
suspend to disk it doesn't want to do much of anything (otherwise we
lose the console at that point).

> The long-term solution is to make this thing be not a number at all, but a
> restricted type (ie a "struct" with one member, or similar) to make sure
> you _cannot_ mis-use it. As it is, most PCI drivers do seem to expect a
> PCI suspend state.

Maybe the real problem is that we are trying to use the device suspend
functions for suspend-to-disk, when we don't really want to change the
device's power state at all.

Paul.

2004-10-11 03:46:19

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Mon, 2004-10-11 at 12:41, Linus Torvalds wrote:

> > Any reason why this totally broken code was ever merged upstream ?
>
> Because it fixes a lot of drivers.

Disagreed. Sorry, but can you give me a good example ? The drivers still
do the broken assumptions of passing directly the state parameter to
pci_set_power_state() (or whatever we call this one these days) but this
is worked around by defining PM_SUSPEND_MEM to 3 in pm.h.

(Which was actually done in the tree)

What this code in the PCI layer does is basically just turning the "4"
of suspend-to-disk into a "3" for D3 (and by doing so, confirms what is
imho a broken interface). This is wrong. That means that driver D3 all
over the place on suspend-to-disk which is totally useless and even not
wanted in various cases. It prevents things like radeonfb from properly
special casing suspend-to-disk to keep us with a working console until
late in the process.

> > Please, revert that to something sane before 2.6.9...
>
> Nope. There's just too much confusion abou what the state thing means.
> See the TG3 driver, for one, all the USB drivers for another.

tg3 is fixed for now with 2 lines:

===== drivers/net/tg3.c 1.209 vs edited =====
--- 1.209/drivers/net/tg3.c 2004-09-15 15:15:06 +10:00
+++ edited/drivers/net/tg3.c 2004-10-11 13:36:19 +10:00
@@ -8445,6 +8445,9 @@
spin_unlock(&tp->tx_lock);
spin_unlock_irq(&tp->lock);

+ if (state != PM_SUSPEND_MEM)
+ return 0;
+
err = tg3_set_power_state(tp, state);
if (err) {
spin_lock_irq(&tp->lock);

USB suffer from various breakage still in PM code. For one, that PCI
change will do nothing with the USB "device" drivers, and as far as
hosts are concerned, well, I don't see the problem. Just go to D3 when
passed PM_SUSPEND_MEM, and disconnect when passed PM_SUSPEND_DISK. But
have you tried removing a USB device while the machine is asleep with
any of the later versions of the USB code ? deadlock on some semaphore
on resume, 100% reproduceable, which is actually a regression as it used
to work with earlier versions of the kernel. The USB code path are
something I haven't found my way in to try to fix it though, I think we
have some fundamental issues with adding/removing devices vs. walking
the PM tree.

> The long-term solution is to make this thing be not a number at all, but a
> restricted type (ie a "struct" with one member, or similar) to make sure
> you _cannot_ mis-use it. As it is, most PCI drivers do seem to expect a
> PCI suspend state.

Maybe. There is more confusion than that actually. The device-level PM
via /sysfs allows you to pass the number down directly, which is bogus
as those have totally different semantics than the system sleep and
won't properly walk the tree to suspend children etc...

> > I'll bomb it ever and ever again until some sense gets into
> > all of this.
>
> If you have the energy to complain about it, maybe you have the energy to
> change it to use a "struct", and fix up all the drivers and verify that
> they actually get it right.
>
> Until that happens, we pass in PCI suspend states, and we _also_ make sure
> that the PCI suspend states "almost match" the PM_SUSPEND_xxx constants,
> so that when there is confusion, it tends to be as benign as possible.

If we can ever agree about the format of that structure and what kind
of information we want to pass to drives ... The actual D state we want is
something which is a "mix" of driver-specific and platform-specific knowledge.

The problem with your fix is that it removes from the drivers the knowledge
of PM_SUSPEND_DISK (well ,maybe still obtainable via a global, but that's
really ugly). Also, it causes drivers to default to D3 for suspend-to-disk
which doesn't make much sense.

I'd rather have drivers try to pass "4" to pci_set_power_state() and having
the later just "do nothign" in this case.

Ben.


2004-10-11 04:04:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls



On Mon, 11 Oct 2004, Paul Mackerras wrote:
>
> The USB drivers aren't a good example, they are currently quite broken
> as far as suspend/resume is concerned. They used to work just fine
> but got broken some time in the last few months.

And they are unbroken again (well, at least they work for me again).
Partly by the PM_ renumbering under discussion.

> The problem I have at the moment is that PCI drivers get asked to go
> to D3 for both suspend-to-ram and suspend-to-disk. In particular the
> radeonfb driver wants to do different things in these two cases.

Hey, I don't disagree. But I pointed out why it's done the way it is done.
I even told you what can be done about it - so please argue _those_ points
instead of just ignoring them.

Linus

2004-10-11 04:08:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls



On Mon, 11 Oct 2004, Benjamin Herrenschmidt wrote:
>
> Disagreed. Sorry, but can you give me a good example ? The drivers still
> do the broken assumptions of passing directly the state parameter to
> pci_set_power_state() (or whatever we call this one these days) but this
> is worked around by defining PM_SUSPEND_MEM to 3 in pm.h.

.. take a look at PM_SUSPEND_DISK for a moment.

If you only care about PM_SUSPEND_MEM, then what's your problem? You get
the right value already.

And if you _do_ care about PM_SUSPEND_DISK, then don't ignore it in the
discussion. You can't have it both ways.

The fact is, my laptop can now (finally) do suspend-to-disk. It never
could do that before. And yes, it does use radeonfb, so your arguments
hold no water with me.

I told you what can done to fix things up. Stop ignoring that reality.

Linus

2004-10-11 04:24:54

by Paul Mackerras

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Linus Torvalds writes:

> And they are unbroken again (well, at least they work for me again).
> Partly by the PM_ renumbering under discussion.

Interesting. I find that with suspend-to-ram, USB keyboards don't
work after resume, and that the system will hang on resume if you
remove a USB device during sleep.

Are you using suspend-to-ram or suspend-to-disk?

> > The problem I have at the moment is that PCI drivers get asked to go
> > to D3 for both suspend-to-ram and suspend-to-disk. In particular the
> > radeonfb driver wants to do different things in these two cases.
>
> Hey, I don't disagree. But I pointed out why it's done the way it is done.
> I even told you what can be done about it - so please argue _those_ points
> instead of just ignoring them.

OK, how about we pass a struct that contains both the target system
power state and the target device power state? That would fit in 32
bits.

For now (i.e. 2.6.9), could we have the following patch? It only
affects suspend-to-disk, and it tells the drivers that we are going to
D3cold (4) when we are doing suspend-to-disk.

Signed-off-by: Paul Mackerras <[email protected]>

diff -urN linux-2.5/drivers/pci/pci-driver.c pmac-2.5/drivers/pci/pci-driver.c
--- linux-2.5/drivers/pci/pci-driver.c 2004-10-04 13:31:01.000000000 +1000
+++ pmac-2.5/drivers/pci/pci-driver.c 2004-10-11 14:15:27.986286792 +1000
@@ -307,7 +307,7 @@
[PM_SUSPEND_ON] = 0,
[PM_SUSPEND_STANDBY] = 1,
[PM_SUSPEND_MEM] = 3,
- [PM_SUSPEND_DISK] = 3,
+ [PM_SUSPEND_DISK] = 4,
};

if (state >= sizeof(state_conversion) / sizeof(state_conversion[1]))

2004-10-11 04:26:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls



On Mon, 11 Oct 2004, Paul Mackerras wrote:
>
> Maybe the real problem is that we are trying to use the device suspend
> functions for suspend-to-disk, when we don't really want to change the
> device's power state at all.

An acceptable solution is certainly to instead of passing down "go to D3",
just not do anything at all. HOWEVER, I doubt that is actually all that
good a solution either: devices quite possibly do want to save state
and/or set wake-on-events.

But I don't know. Somebody would need to go through the drivers and verify
that.

NOTE! I don't mind passing down "D3cold" (aka 4 - same as PM_SUSPEND_DISK)
in theory. The problem is that the very first machine I ever tested it on
clearly didn't like it and refused to suspend. Maybe I was unlucky, but
the point is, the real solution again requires people to _verify_ that.

See the pattern?

Which is why I suggested making it a separate type that is _not_ a normal
number. Exactly so that you cannot think it's a PCI state by mistake, when
clearly drivers _do_ think that. And force people to verify it.

You could do it with "sparse" and "bitwise" types too. Sparse will
complain if you use the type in an inappropriate manner. But the basic
issue remains: there are PCI power states, and there are "suspend" power
states, and they are different. And right now people _are_ confused about
them.

Arguing against that is futile. It's a fact.

I'm more than happy to learn that there are other alternatives to solving
the confusion problem. But quite frankly, arguing for undoing the
translation is just stupid. It's putting your head in the sand and saying
"la-la-laa-I-can't-hear-you".

And until something actually tries to sort _out_ the confusion, the state
translation stays. Does it put devices into sleep modes when you shouldn't
need to? Sure. But at least it's not confused about things.

Linus

2004-10-11 04:26:46

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Mon, 2004-10-11 at 14:08, Linus Torvalds wrote:
> On Mon, 11 Oct 2004, Benjamin Herrenschmidt wrote:
> >
> > Disagreed. Sorry, but can you give me a good example ? The drivers still
> > do the broken assumptions of passing directly the state parameter to
> > pci_set_power_state() (or whatever we call this one these days) but this
> > is worked around by defining PM_SUSPEND_MEM to 3 in pm.h.
>
> .. take a look at PM_SUSPEND_DISK for a moment.
>
> If you only care about PM_SUSPEND_MEM, then what's your problem? You get
> the right value already.

How so ? I care about both. We had 2 different problems. One was
PM_SUSPEND_MEM would be defined to 2 which caused dumb drivers to try to
go to D2 instead of D3, and one is that some drivers are still mixing up
PM_SUSPEND_DISK, and I don't think the "fix" in pci-driver.c is any good
for that...

> And if you _do_ care about PM_SUSPEND_DISK, then don't ignore it in the
> discussion. You can't have it both ways.

I'm not ignoring it. I pretend that it's wrong.

> The fact is, my laptop can now (finally) do suspend-to-disk. It never
> could do that before. And yes, it does use radeonfb, so your arguments
> hold no water with me.

But radeonfb ends up suspending the display at a wrong time and you miss
half of the output, which makes any kind of debugging near to
impossible.

> I told you what can done to fix things up. Stop ignoring that reality.

I'm not ignoring that reality and I may well come up with a patch for
after 2.6.9 but I consider the current state of things broken.

Ben.


2004-10-11 04:32:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls



On Mon, 11 Oct 2004, Benjamin Herrenschmidt wrote:
>
> But radeonfb ends up suspending the display at a wrong time and you miss
> half of the output, which makes any kind of debugging near to
> impossible.

Agreed.

But notice? It _works_. It's suspendign too damn eagerly, and it's hard to
debug, but it's a "safe" solution to the confusion that does exist. Which
is why I did it.

And please do realize that I'd love to solve the confusion, and remove the
hack. It's a hack, I admit it. But it's better than just saying "be
confused, be broken, I don't care".

If the hack ends up motivating somebody (hint hint) to solve the problem
properly, I'll be really happy. Paul suggested one solution (don't call
down to suspend at all - which is also a hack, but I suspect it might be
about as good a hack as the current one). I suggested another: using type
checking to make sure drivers _aren't_ confused.

The more the merrier. Care to come up with a solution of your own?

And no, I'm not interested in the type "let's fix one driver" kind of
thing. That's what we've had for the last year or more, and the fact is,
my laptop _never_ suspended during that time. So I really think it needs a
_proper_ solution.

Linus

2004-10-11 04:55:42

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Mon, 2004-10-11 at 14:32, Linus Torvalds wrote:
> On Mon, 11 Oct 2004, Benjamin Herrenschmidt wrote:
> >
> > But radeonfb ends up suspending the display at a wrong time and you miss
> > half of the output, which makes any kind of debugging near to
> > impossible.
>
> Agreed.
>
> But notice? It _works_. It's suspendign too damn eagerly, and it's hard to
> debug, but it's a "safe" solution to the confusion that does exist. Which
> is why I did it.

Well... it doesn't work on paul's laptop, but anyway, ok, let's go for
the struct thing and forget about this for 2.6.9.

Now paul and I are trying to figure out what to put in that struct and
what kind of information actually make any sense. It's not trivial. We
have to deal with several things.

We have the system state (cause of the request if you prefer), that is
"idle" (we mostly don't implement that one yet but it's useful to make
"room" for it, handhelds wants that badly), "suspend to ram" and
"suspend to disk".

But what about user /sysfs originated requests ? (that is random numbers
the user whacks in /sys/devices/...../power) what are their semantics ?

Also, do we carry around a "suggested" D state for what it means ? it's
really an obscure PCI concept. However, as you can see with the hacks
in drivers like radeonfb, we would be happy to be able to tell the
driver wether the chip will be leaved alone, powered off, unclocked,
etc... so the driver can take the right decision vs. what it supports.

That would mean, at least for PCI, a kind of platform hook that provides
that information, and in what form ? a D state ? I would vote for a
simple PCI specific pci_* (no good name comes to mind at the moment)
that would provide the platform suggested D state based on the pci_dev
and the struct we pass.

> And please do realize that I'd love to solve the confusion, and remove the
> hack. It's a hack, I admit it. But it's better than just saying "be
> confused, be broken, I don't care".

Ok, ok ... well, it's broken with your "fix" for paul's box, but it's
ok, the ppc suspend-to-disk code isn't upstream yet anyway.

> If the hack ends up motivating somebody (hint hint) to solve the problem
> properly, I'll be really happy. Paul suggested one solution (don't call
> down to suspend at all - which is also a hack, but I suspect it might be
> about as good a hack as the current one). I suggested another: using type
> checking to make sure drivers _aren't_ confused.

Paul and I would love do the right thing, it's just difficult to define
what the right thing is at this point. Actually, I have a pretty good
idea for a lot but what happens via /sysfs...

> The more the merrier. Care to come up with a solution of your own?
>
> And no, I'm not interested in the type "let's fix one driver" kind of
> thing. That's what we've had for the last year or more, and the fact is,
> my laptop _never_ suspended during that time. So I really think it needs a
> _proper_ solution.

Agreed.

Ben.


2004-10-11 09:51:05

by Pavel Machek

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi!

> Any reason why this totally broken code was ever merged upstream ?
> We debated that again and again... the result of the below code is
> to _REMOVE_ useful information from PCI drivers (suspend to disk vs.
> suspend to RAM) for no good reason, making PCI drivers suddently
> take a different state than any other driver, breaking radeonfb
> suspend code, etc....

Well, old code was not too working, either... In -mm, we decided
s-t-disk vs. s-t-RAM according to global system_state, which did the
trick without touching all the drivers. Unfortunately that part did
not get merged. (And got dropped from -mm; I still have the patch).

Pavel
--
Boycott Kodak -- for their patent abuse against Java.

2004-10-11 09:59:21

by Pavel Machek

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi!

> For now (i.e. 2.6.9), could we have the following patch? It only
> affects suspend-to-disk, and it tells the drivers that we are going to
> D3cold (4) when we are doing suspend-to-disk.

If you want stop-gap solution, patch that was in -mm is better
idea. Please do *not* apply this one.

Pavel

> --- linux-2.5/drivers/pci/pci-driver.c 2004-10-04 13:31:01.000000000 +1000
> +++ pmac-2.5/drivers/pci/pci-driver.c 2004-10-11 14:15:27.986286792 +1000
> @@ -307,7 +307,7 @@
> [PM_SUSPEND_ON] = 0,
> [PM_SUSPEND_STANDBY] = 1,
> [PM_SUSPEND_MEM] = 3,
> - [PM_SUSPEND_DISK] = 3,
> + [PM_SUSPEND_DISK] = 4,
> };
>
> if (state >= sizeof(state_conversion) / sizeof(state_conversion[1]))

--
Boycott Kodak -- for their patent abuse against Java.

2004-10-11 10:08:42

by Pavel Machek

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi!

> The problem with your fix is that it removes from the drivers the knowledge
> of PM_SUSPEND_DISK (well ,maybe still obtainable via a global, but that's
> really ugly). Also, it causes drivers to default to D3 for suspend-to-disk
> which doesn't make much sense.

Global does not sound *that* bad to me... Actually I want drivers to
default to D3 for suspend-to-disk, because that's the state where DMA
is guaranteed to be off. It might be ugly but its better safe than
sorry situation.
Pavel
--
Boycott Kodak -- for their patent abuse against Java.

2004-10-11 10:18:27

by Pavel Machek

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi!

> > Maybe the real problem is that we are trying to use the device suspend
> > functions for suspend-to-disk, when we don't really want to change the
> > device's power state at all.
>
> An acceptable solution is certainly to instead of passing down "go to D3",
> just not do anything at all. HOWEVER, I doubt that is actually all that
> good a solution either: devices quite possibly do want to save state
> and/or set wake-on-events.

And DMA needs to be stopped, or it is "bye bye data" situation.

> Which is why I suggested making it a separate type that is _not_ a normal
> number. Exactly so that you cannot think it's a PCI state by mistake, when
> clearly drivers _do_ think that. And force people to verify it.
>
> You could do it with "sparse" and "bitwise" types too. Sparse will
> complain if you use the type in an inappropriate manner. But the basic
> issue remains: there are PCI power states, and there are "suspend" power
> states, and they are different. And right now people _are_ confused about
> them.

Does sparse now have typechecking on enums? Solution that was in -mm
was basically "put enums there so drivers can't be confused" + "signal
global state out-of-band in global variable". It was not too nice, but
it certainly was working.
Pavel
--
Boycott Kodak -- for their patent abuse against Java.

2004-10-11 10:54:54

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Mon, 2004-10-11 at 20:18, Pavel Machek wrote:
> Hi!
>
> > > Maybe the real problem is that we are trying to use the device suspend
> > > functions for suspend-to-disk, when we don't really want to change the
> > > device's power state at all.
> >
> > An acceptable solution is certainly to instead of passing down "go to D3",
> > just not do anything at all. HOWEVER, I doubt that is actually all that
> > good a solution either: devices quite possibly do want to save state
> > and/or set wake-on-events.
>
> And DMA needs to be stopped, or it is "bye bye data" situation.

This is true for pretty much any PM state

> Does sparse now have typechecking on enums? Solution that was in -mm
> was basically "put enums there so drivers can't be confused" + "signal
> global state out-of-band in global variable". It was not too nice, but
> it certainly was working.
> Pavel
--
Benjamin Herrenschmidt <[email protected]>

2004-10-11 14:44:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls



On Mon, 11 Oct 2004, Paul Mackerras wrote:
>
> Are you using suspend-to-ram or suspend-to-disk?

Suspend-to-disk. suspend-to-ram still doesn't work for me (never has, oh,
well.. Slow progress).

> For now (i.e. 2.6.9), could we have the following patch? It only
> affects suspend-to-disk, and it tells the drivers that we are going to
> D3cold (4) when we are doing suspend-to-disk.

No.

If we do this, then we don't need the translation at all. And I guarantee
you that drivers _will_ be broken - I had this in my tree earlier by
virtue of "no translation". I mentioned the tg3 driver already, and I
assume others will too.

I repeat: we can remove the hack if somebody checks the drivers.

Linus

2004-10-11 15:00:32

by Pavel Machek

[permalink] [raw]
Subject: suspend-to-RAM [was Re: Totally broken PCI PM calls]

Hi!

> > Are you using suspend-to-ram or suspend-to-disk?
>
> Suspend-to-disk. suspend-to-ram still doesn't work for me (never has, oh,
> well.. Slow progress).

Which machine is that, btw? Evo N620c has probably BIOS/firmware bug
that kills machine on attempt to enter S3 or S4. It takes pressing
power button 3 times (!) to get machine back.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-10-11 15:36:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: suspend-to-RAM [was Re: Totally broken PCI PM calls]



On Mon, 11 Oct 2004, Pavel Machek wrote:
>
> Which machine is that, btw? Evo N620c has probably BIOS/firmware bug
> that kills machine on attempt to enter S3 or S4. It takes pressing
> power button 3 times (!) to get machine back.

That's the one. suspend-to-disk works, but suspend-to-ram leaves the fam
going, and does not come back no matter how many times you press the power
button. You need to kill it (twice) by holding the power button for five
seconds (which is the "hard-power-off" signal to the southbridge, when
everything else is locked up).

suspend-to-disk really shuts off, and comes back after just a single power
button event. Of course, it's slow and boring, I'd much rather have STR
working ;)

Linus

2004-10-11 15:53:39

by Brice Goglin

[permalink] [raw]
Subject: Re: suspend-to-RAM [was Re: Totally broken PCI PM calls]

> Which machine is that, btw? Evo N620c has probably BIOS/firmware bug
> that kills machine on attempt to enter S3 or S4. It takes pressing
> power button 3 times (!) to get machine back.
> Pavel

Hi,

On my N600c, suspend-to-RAM seems to complete... but when I try to wake
up the laptop (by pressing the power button), it blinks strangely and
then immediately shutdowns instead of resuming...

Brice

2004-10-11 16:06:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls



On Mon, 11 Oct 2004, Pavel Machek wrote:
>
> Does sparse now have typechecking on enums?

You can mark an enum "bitwise" (by making all of it's values be
"bitwise"), and it will be considered a type of its own, yes. But then you
also cannot do arithmetic on it (which _usually_ is what you want, but not
necessarily always).

(You'd also need to pass in the "-Wbitwise" flag to sparse, to get the
checks).

By the time you mark something "bitwise", you don't even need to use an
enum, btw. You can just do a regular integer typedef and mark the typedef
to be "bitwise" - that generates a unique type right there. That's what
the endianness checking does.

Linus

2004-10-11 16:19:02

by David Brownell

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Sunday 10 October 2004 9:55 pm, Benjamin Herrenschmidt wrote:
>
> Well... it doesn't work on paul's laptop, but anyway, ok, let's go for
> the struct thing and forget about this for 2.6.9.

PM hasn't worked for me on 2.6 with most hardware, about since
the "new PMcore" kicked in, so it's hard for me to judge progress
except at the level of unit tests. Where I see two steps forward,
one step back ... on a good day.

The root cause of many of these problem is that there's
a confusion between system-wide sleep states and the
device-specific power states from which they're built.
They _should_ be using two distinct data types. Not one
integer/enum type ... especially not one int/enum type
that's got multiple conflicting "legacy" interpretations!

Something else to remember: the design center should
be (or at least include!) "partially suspended" systems,
where most hardware is in a low power mode unless it's
needed. That's not something the current PM framework
does well; it gets confused on system suspend later on,
if it finds devices it didnt' suspend.


> Now paul and I are trying to figure out what to put in that struct and
> what kind of information actually make any sense. It's not trivial. We
> have to deal with several things.

In USB-land, we've identified a need for a state name. Each different
bus can have different states, and device-specific power states are
also common. Sysfs should show those device-specific states.

Doesn't seem like anything else is needed ... so long as the power
management interactions get (re)factored sanely! However, each
bus or device could use container_of() to access any additional state
information ... including linkage to other hardware, or for PCI the
actual number of the state.


> We have the system state (cause of the request if you prefer),

The system state shouldn't matter to a device-specific suspend
routine ... AT ALL.

I like the model that what gets passed to drivers is more of a
local policy update than anything else. Within the bounds of
that policy, the driver can pick many device power states.
When the system is going to power itself off, it can pass a
device power policy that's rather inflexible.


> that is
> "idle" (we mostly don't implement that one yet but it's useful to make
> "room" for it, handhelds wants that badly),

In USB-land, we've been discussing "idle" as something orthogonal
to the PM states. Not clear on all the details yet, but I'd vote against
anything that tries to make "driver doing nothing" a power state,
or doesn't have a way to idle drivers.

Drivers can be "idle" without entering low power states, and can
use wakeup events to enter/exit low power states without being
fully idle. (Hardware allowing.) That's an example of a policy that
drivers should be able to choose without affecting PMcore.

I've sent separate posts about how to add wakeup support to the
PM framework ... that omission should have been a great big
red (or is it chartreuse?) warning flag that the PM framework was
lacking some very basic functionality.


> "suspend to ram" and "suspend to disk".

Those aren't device power states at all!! They're system sleep
states (or transitions). And by the way, "suspend to disk" is
an odd model for systems without disks, or swap ... which oddly
enough tend to be ones that really need good PM, and which
(like HH.org) need the "partially suspended" system model
to work well.


> But what about user /sysfs originated requests ? (that is random numbers
> the user whacks in /sys/devices/...../power) what are their semantics ?

Sysfs should only read/write the names of the states that particular
device can support. Plus probably some generic requests for policies
that the sleep framework would hand to individual drivers.


> Also, do we carry around a "suggested" D state for what it means ? it's
> really an obscure PCI concept. However, as you can see with the hacks
> in drivers like radeonfb, ....

I can't think of PCI D-states as obscure, they're the core of its PM support.
PCI drivers need to use them to implement power policies.

Without looking at that code, I'll just say that while many PCI drivers
can probably offload the decision making to some PCI core code
("use D1 or D2 when idle if available; D3hot otherwise"), not all
can do that.


> > The more the merrier. Care to come up with a solution of your own?
> >
> > And no, I'm not interested in the type "let's fix one driver" kind of
> > thing. That's what we've had for the last year or more, and the fact is,
> > my laptop _never_ suspended during that time. So I really think it needs a
> > _proper_ solution.
>
> Agreed.

Right, and let's not make something that only works for PC-class hardware
or only for system-wide suspend operations ...

- Dave


2004-10-11 16:19:36

by Pavel Machek

[permalink] [raw]
Subject: Re: suspend-to-RAM [was Re: Totally broken PCI PM calls]

Hi!

> >Which machine is that, btw? Evo N620c has probably BIOS/firmware bug
> >that kills machine on attempt to enter S3 or S4. It takes pressing
> >power button 3 times (!) to get machine back.
> > Pavel
>
> On my N600c, suspend-to-RAM seems to complete... but when I try to wake
> up the laptop (by pressing the power button), it blinks strangely and
> then immediately shutdowns instead of resuming...

Your machine is probabl different from N620c i this regard...

Can you test if it reaches start of wakeup.S? Just insert infinite
loop there...
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-10-11 16:44:58

by David Brownell

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Sunday 10 October 2004 8:42 pm, Paul Mackerras wrote:
>
> The USB drivers aren't a good example, they are currently quite broken
> as far as suspend/resume is concerned. They used to work just fine
> but got broken some time in the last few months.

It would have been interesting to have seen some problem report
on linux-usb-devel. Which USB drivers, by the way? There have
to be at least a hundred of them.

I can say that USB suspend/resume works much better now on
some x86 hardware, especially with patches in Greg's bk-usb tree;
at the level of "those never worked on 2.6 before!" There are still
several PMcore problems getting in the way though, and I'd not
be surprised if your "used to work" translated as "somehow a
bunch of bugs canceled each other out on PPC" ...


> Maybe the real problem is that we are trying to use the device suspend
> functions for suspend-to-disk, when we don't really want to change the
> device's power state at all.

I've made that point too. STD is logically a few steps: quiesce system,
write image to swap, change power state. The ACPI spec talks about
that as keeping the system in a G1/S4 powered state, but "swusp"
doesn't use that ... it does a full power-off. And of course, full power-off
means that the BIOS probably mucks with the USB hardware, so it's
not a real resume any more.

- Dave

2004-10-11 17:04:32

by David Brownell

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Sunday 10 October 2004 9:24 pm, Paul Mackerras wrote:
> Linus Torvalds writes:
>
> > And they are unbroken again (well, at least they work for me again).
> > Partly by the PM_ renumbering under discussion.
>
> Interesting. I find that with suspend-to-ram, USB keyboards don't
> work after resume, and that the system will hang on resume if you
> remove a USB device during sleep.

A "hang" sounds like the pmcore bug I reported about a year ago...

It's rather foolish of the PM core to use the same semaphore to
protect system-wide suspend/resume operations that it uses to
for mutual exclusion on the device add/remove (which suspend
and resume callbacks did happily in 2.4) ... since it's routine to
unplug peripherals on suspended systems!

Alternatively, if you're combininging USB_SUSPEND with any
system-wide suspend operation, you're asking for trouble;
the PM core is just not ready for that. (In fact I've wondered
if maybe 2.6.9 shouldn't discourage that combination more
actively...)

But a keyboard-specific issue might be improved with the
HID patch I posted last week, teaching that driver how to
handle suspend() and resume() callbacks.

- Dave

2004-10-11 17:11:58

by Brice Goglin

[permalink] [raw]
Subject: Re: suspend-to-RAM [was Re: Totally broken PCI PM calls]

>>On my N600c, suspend-to-RAM seems to complete... but when I try to wake
>>up the laptop (by pressing the power button), it blinks strangely and
>>then immediately shutdowns instead of resuming...
>
>
> Your machine is probabl different from N620c i this regard...
>
> Can you test if it reaches start of wakeup.S? Just insert infinite
> loop there...
> Pavel

Well, my apologies, the behavior changed recently:
* S1 seems to work.
* S3 is so buggy: suspend still seem to complete. Pushing the power
button to resume doesn't shutdown the machine anymore: the laptop wakes
up but doesn't write or do anything. I'm only able to stop it by
pressing the power button.

Regards,
Brice Goglin

2004-10-11 17:39:12

by Olivier Galibert

[permalink] [raw]
Subject: Re: suspend-to-RAM [was Re: Totally broken PCI PM calls]

On Mon, Oct 11, 2004 at 08:30:16AM -0700, Linus Torvalds wrote:
> That's the one. suspend-to-disk works, but suspend-to-ram leaves the fam
> going, and does not come back no matter how many times you press the power
> button. You need to kill it (twice) by holding the power button for five
> seconds (which is the "hard-power-off" signal to the southbridge, when
> everything else is locked up).

I had a somewhat equivalent problem with suspend-to-ram (working but
no wakeup) which required two patches:
- add PWRB and LID0 as wakeup devices[1]
- ignore PRWF since it doesn't send acpi events [2]

It was with a 2.6.8-rc2 kernel, so the situation may have changed since then.

OG.

[1] Seems added now, did you check the contents of /proc/acpi/wakeup_devices ?
[2] http://bugzilla.kernel.org/show_bug.cgi?id=1920

2004-10-11 18:22:17

by Pavel Machek

[permalink] [raw]
Subject: Re: suspend-to-RAM [was Re: Totally broken PCI PM calls]

H!

> > That's the one. suspend-to-disk works, but suspend-to-ram leaves the fam
> > going, and does not come back no matter how many times you press the power
> > button. You need to kill it (twice) by holding the power button for five
> > seconds (which is the "hard-power-off" signal to the southbridge, when
> > everything else is locked up).
>
> I had a somewhat equivalent problem with suspend-to-ram (working but
> no wakeup) which required two patches:
> - add PWRB and LID0 as wakeup devices[1]
> - ignore PRWF since it doesn't send acpi events [2]
>
> It was with a 2.6.8-rc2 kernel, so the situation may have changed since then.

Actually on N620c it does not even suspend, so Linus has different
problem.

Pavel

--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-10-11 18:24:52

by Pavel Machek

[permalink] [raw]
Subject: Re: suspend-to-RAM [was Re: Totally broken PCI PM calls]

Hi!

> >>On my N600c, suspend-to-RAM seems to complete... but when I try to wake
> >>up the laptop (by pressing the power button), it blinks strangely and
> >>then immediately shutdowns instead of resuming...
> >
> >
> >Your machine is probabl different from N620c i this regard...
> >
> >Can you test if it reaches start of wakeup.S? Just insert infinite
> >loop there...
>
> Well, my apologies, the behavior changed recently:
> * S1 seems to work.
> * S3 is so buggy: suspend still seem to complete. Pushing the power
> button to resume doesn't shutdown the machine anymore: the laptop wakes
> up but doesn't write or do anything. I'm only able to stop it by
> pressing the power button.

Hmm, and is the linux alive (like capslock works?) or do you need
hard poweroff?
Pavel

--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-10-11 18:41:26

by Brice Goglin

[permalink] [raw]
Subject: Re: suspend-to-RAM [was Re: Totally broken PCI PM calls]

>>* S1 seems to work.
>>* S3 is so buggy: suspend still seem to complete. Pushing the power
>>button to resume doesn't shutdown the machine anymore: the laptop wakes
>>up but doesn't write or do anything. I'm only able to stop it by
>>pressing the power button.
>
>
> Hmm, and is the linux alive (like capslock works?) or do you need
> hard poweroff?
> Pavel

Hard poweroff only. capslock doesn't work.

Thanks
Brice

2004-10-11 21:17:56

by Nigel Cunningham

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi David.

On Tue, 2004-10-12 at 02:36, David Brownell wrote:
> I've made that point too. STD is logically a few steps: quiesce system,
> write image to swap, change power state. The ACPI spec talks about
> that as keeping the system in a G1/S4 powered state, but "swusp"
> doesn't use that ... it does a full power-off. And of course, full power-off
> means that the BIOS probably mucks with the USB hardware, so it's
> not a real resume any more.

That's not necessarily true. Swsusp and suspend2 both include support
for enter ACPI S4 state. For suspend2 it's optional (to allow for broken
bioses). Not sure about whether it is with swsusp.

Regards,

Nigel
--
Nigel Cunningham
Pastoral Worker
Christian Reformed Church of Tuggeranong
PO Box 1004, Tuggeranong, ACT 2901

Many today claim to be tolerant. True tolerance, however, can cope with others
being intolerant.

2004-10-11 21:37:17

by David Brownell

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi Nigel,

On Monday 11 October 2004 2:17 pm, Nigel Cunningham wrote:
> On Tue, 2004-10-12 at 02:36, David Brownell wrote:
> > I've made that point too. STD is logically a few steps: quiesce system,
> > write image to swap, change power state.

I'm hoping you agree with that abbreviated summary of
what's involved! Pavel seemed to. Of course the devil is
in the details, which I hope to leave mostly to others ... ;)


> > The ACPI spec talks about
> > that as keeping the system in a G1/S4 powered state, but "swusp"
> > doesn't use that ... it does a full power-off. And of course,
> > full power-off
> > means that the BIOS probably mucks with the USB hardware, so it's
> > not a real resume any more.
>
> That's not necessarily true. Swsusp and suspend2 both include support
> for enter ACPI S4 state. For suspend2 it's optional (to allow for broken
> bioses). Not sure about whether it is with swsusp.

The machines I've tested with relatively generic 2.6.9-rc kernels
don't use BIOS support for S4 when I call swsusp.

Of course the ACPI spec muddies the water by talking about two
different states called "S4": "S4 Sleeping", which is what I was
talking about as G1/S4; and "S4 Non-Volatile Sleep" that's more
what I've seen with swusp: more like a G2 or G3 poweroff.

I'm willing to believe that there are systems on which swsusp
tells drivers a less confusing story ... but I don't seem to have
tested with those!

- Dave

2004-10-11 22:17:02

by Stefan Seyfried

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

David Brownell wrote:

> The machines I've tested with relatively generic 2.6.9-rc kernels
> don't use BIOS support for S4 when I call swsusp.

first do either
echo platform > /sys/power/disk # for S4
echo shutdown > /sys/power/disk # for poweroff

then do
echo disk > /sys/power/state


Stefan

2004-10-11 22:27:26

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Tue, 2004-10-12 at 02:36, David Brownell wrote:

> I can say that USB suspend/resume works much better now on
> some x86 hardware, especially with patches in Greg's bk-usb tree;
> at the level of "those never worked on 2.6 before!" There are still
> several PMcore problems getting in the way though, and I'd not
> be surprised if your "used to work" translated as "somehow a
> bunch of bugs canceled each other out on PPC" ...

Well... the difference may be that PPC had working suspend to RAM for
ages way before x86 had anything working better than shitty APM, _and_
based on the device model ...

One thing is that CONFIG_USB_SUSPEND is a 100% killer with USB and
suspend. When set, we will deadlock on some semaphores randomly at
wakeup under various conditions. The typical case is removing your
device (like the mouse) while the machine is asleep, but that's not the
only one.

> > Maybe the real problem is that we are trying to use the device suspend
> > functions for suspend-to-disk, when we don't really want to change the
> > device's power state at all.
>
> I've made that point too. STD is logically a few steps: quiesce system,
> write image to swap, change power state. The ACPI spec talks about
> that as keeping the system in a G1/S4 powered state, but "swusp"
> doesn't use that ... it does a full power-off. And of course, full power-off
> means that the BIOS probably mucks with the USB hardware, so it's
> not a real resume any more.

Well, on a lot of platforms, you don't have the choice but do a full power
off anyway. swsusp can do an ACPI S4 if asked to. The only problem at the
moment is that the first round does actually ask for a suspend (D3 with the
new PCI translation) while it should only ask for an "idle" state (I wouldn't
call it D1 neither, we only wnat the driver to stop all activities & DMA,
we don't care about the HW power state proper).

Ben.


2004-10-11 22:25:56

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Tue, 2004-10-12 at 02:15, David Brownell wrote:

> In USB-land, we've been discussing "idle" as something orthogonal
> to the PM states. Not clear on all the details yet, but I'd vote against
> anything that tries to make "driver doing nothing" a power state,
> or doesn't have a way to idle drivers.

Well, having an "idle" system state that asks all drivers to go to
"idle" state doesn't prevent local bus/driver management from having
it's own dynamic idle state ...

"idle" as a system state is a kind of "light sleep" state where you come
back up very quickly and makes a lot of sense for handheld devices.

"idle" as a device-state is what suspend-to-disk really wants :)

So you don't wnat the system state passed down to drivers but a policy
instead ... We probably need more than that, like some additional flags
along with a platform filter. For example, during a system suspend, a
given piece of HW may end up beeing unclocked or powered off by the
system, the driver will want to know that. The firmware may reboot the
device on wakeup or not. The driver need to know that too.

But I agree that we can avoid passing down a system state if we define
a "policy" state along with a few flags.

There is something else that we need to take into account. For system
state, we want the driver to stay idle until woken up explicitely. But
there are also more "dynamic" PM states that we may want to be triggered
by the user via sysfs for which the driver will come back automatically
when a request comes in from upstream (equivalent to disk idle sleep).

> Drivers can be "idle" without entering low power states, and can
> use wakeup events to enter/exit low power states without being
> fully idle. (Hardware allowing.) That's an example of a policy that
> drivers should be able to choose without affecting PMcore.

But that doesn't prevent the system from enforcing all drivers into an
idle state, that is no request processing, consistent state image in
memory and no DMA, but no need to actually power down the device. That
is suitable for quick-wakeup idle or for suspend-to-disk.

> I've sent separate posts about how to add wakeup support to the
> PM framework ... that omission should have been a great big
> red (or is it chartreuse?) warning flag that the PM framework was
> lacking some very basic functionality.

Right.

>
> > "suspend to ram" and "suspend to disk".
>
> Those aren't device power states at all!! They're system sleep
> states (or transitions). And by the way, "suspend to disk" is
> an odd model for systems without disks, or swap ... which oddly
> enough tend to be ones that really need good PM, and which
> (like HH.org) need the "partially suspended" system model
> to work well.

They are, but the choice of a device power state from a system power
state can most of the time only be done ... by the device-driver itself,
eventually with support from the platform. Only the device-driver knows
what it's device really supports as far as states as concerned, what the
driver supports waking the device from, etc... Though it needs
informations and help from the platform as well, like knowing if the
firmware will bring the device back up from power-on-reset (this is very
important for video cards).
>
> > But what about user /sysfs originated requests ? (that is random numbers
> > the user whacks in /sys/devices/...../power) what are their semantics ?
>
> Sysfs should only read/write the names of the states that particular
> device can support. Plus probably some generic requests for policies
> that the sleep framework would hand to individual drivers.
>
> > Also, do we carry around a "suggested" D state for what it means ? it's
> > really an obscure PCI concept. However, as you can see with the hacks
> > in drivers like radeonfb, ....
>
> I can't think of PCI D-states as obscure, they're the core of its PM support.
> PCI drivers need to use them to implement power policies.

No, they are obscure. The signification of a given D state at the HW
level and the way a given state is actually supported by a given device
is really totally device-specific. The PCI spec is nice but HW rarely
follow it in my experience.

> Without looking at that code, I'll just say that while many PCI drivers
> can probably offload the decision making to some PCI core code
> ("use D1 or D2 when idle if available; D3hot otherwise"), not all
> can do that.

The problem, again, is that chosing the right state is a decision that
can only be done by the driver, provided it knows information about the
system state (or policy if you prefer, your policy thing is just a way
to pass system states without calling them this way), and informations
from the platform about what will actually happen on this specific slot
after the state is entered (and a lot of systems don't give a shit about
PCI D states at this point, some machines will just power down all
slots, or a random set of them, some will cut clocks off, some will do
nothing, etc...)

Ben.


2004-10-11 22:30:17

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls


> A "hang" sounds like the pmcore bug I reported about a year ago...
>
> It's rather foolish of the PM core to use the same semaphore to
> protect system-wide suspend/resume operations that it uses to
> for mutual exclusion on the device add/remove (which suspend
> and resume callbacks did happily in 2.4) ... since it's routine to
> unplug peripherals on suspended systems!

Definitely. One thing is: how to do it instead ? I've been thinking
about it for a while and am still wondering... do we want a list
mecanism with add/remove notifiers so the PM walk can keep in sync
with devices added/removed ? or should addition/removal be simply
postponed until the end of the sleep/wakeup process (I tend to vote
for that).

> Alternatively, if you're combininging USB_SUSPEND with any
> system-wide suspend operation, you're asking for trouble;
> the PM core is just not ready for that. (In fact I've wondered
> if maybe 2.6.9 shouldn't discourage that combination more
> actively...)
>
> But a keyboard-specific issue might be improved with the
> HID patch I posted last week, teaching that driver how to
> handle suspend() and resume() callbacks.
>
> - Dave
--
Benjamin Herrenschmidt <[email protected]>

2004-10-11 22:59:39

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Monday 11 October 2004 05:28 pm, Benjamin Herrenschmidt wrote:
> > A "hang" sounds like the pmcore bug I reported about a year ago...
> >
> > It's rather foolish of the PM core to use the same semaphore to
> > protect system-wide suspend/resume operations that it uses to
> > for mutual exclusion on the device add/remove (which suspend
> > and resume callbacks did happily in 2.4) ... since it's routine to
> > unplug peripherals on suspended systems!
>
> Definitely. One thing is: how to do it instead ? I've been thinking
> about it for a while and am still wondering... do we want a list
> mecanism with add/remove notifiers so the PM walk can keep in sync
> with devices added/removed ? or should addition/removal be simply
> postponed until the end of the sleep/wakeup process (I tend to vote
> for that).
>

Yes, I think that devices that failed to resume (and all their children)
have to be moved by the core resume function into a separate list and
then destroyed (again by the driver core). For that we might need to add
bus_type->remove_device() handler as it seems that all buses do alot
of work outside of driver->remove handlers. The remove_device should
accept additional argument - something like dead_device that would
suggest that driver should not be alarmed by any errors during unbind/
removal process as the device (or rather usually its parent) is simply
not there anynore.

Just my $.02

--
Dmitry

2004-10-11 23:12:12

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Tue, 2004-10-12 at 08:58, Dmitry Torokhov wrote:

> Yes, I think that devices that failed to resume (and all their children)
> have to be moved by the core resume function into a separate list and
> then destroyed (again by the driver core). For that we might need to add
> bus_type->remove_device() handler as it seems that all buses do alot
> of work outside of driver->remove handlers. The remove_device should
> accept additional argument - something like dead_device that would
> suggest that driver should not be alarmed by any errors during unbind/
> removal process as the device (or rather usually its parent) is simply
> not there anynore.

They already do... think USB...

It's really only a locking problem within the PM core, that's getting
OT at the moment. I'll see if I can come up with something later.

Ben.


2004-10-12 01:24:50

by Nigel Cunningham

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi.

On Tue, 2004-10-12 at 07:37, David Brownell wrote:
> On Monday 11 October 2004 2:17 pm, Nigel Cunningham wrote:
> > On Tue, 2004-10-12 at 02:36, David Brownell wrote:
> > > I've made that point too. STD is logically a few steps: quiesce system,
> > > write image to swap, change power state.
>
> I'm hoping you agree with that abbreviated summary of
> what's involved! Pavel seemed to. Of course the devil is
> in the details, which I hope to leave mostly to others ... ;)

It certainly fits for Pavel, but, I do have things slightly different,
so that don't have the maximum-image-size-is-half-of-RAM problem. At the
moment I do:

quiesce system & prepare metadata
suspend devices not used for writing the image or user I/O (this is just
seeking to remove any chance of them allocating memory, which might
strangle writing the image)
write LRU pages ('pageset 2')
suspend used devices
save CPU context & make atomic copy of remaining pages
resume used devices
save image
power down used devices
power down system/enter S4.

This is why I made that 'device tree' patch - so I could separate the
devices used for writing the image from those unused and treat them
separately. I'm pretty sure that I could get away with leaving the
unused ones alone until snapshot time, but it seems more ideal to me to
get them to save state and power down at the start, especially if you're
trying to suspend when the battery is low!

At the very least, I'd like to see that 'snapshot' state implemented
separately to the 'powerdown' state.

> Of course the ACPI spec muddies the water by talking about two
> different states called "S4": "S4 Sleeping", which is what I was
> talking about as G1/S4; and "S4 Non-Volatile Sleep" that's more
> what I've seen with swusp: more like a G2 or G3 poweroff.

Okay. I've only looked at the ACPI spec occasionally, and have generally
just followed the lead of Patrick and Pavel in implementing ACPI support
in s-t-d.

> I'm willing to believe that there are systems on which swsusp
> tells drivers a less confusing story ... but I don't seem to have
> tested with those!

:> I'm confused by all these changes; no wonder drivers are!

Nigel
--
Nigel Cunningham
Pastoral Worker
Christian Reformed Church of Tuggeranong
PO Box 1004, Tuggeranong, ACT 2901

Many today claim to be tolerant. True tolerance, however, can cope with others
being intolerant.

2004-10-12 03:03:43

by David Brownell

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Monday 11 October 2004 3:22 pm, Benjamin Herrenschmidt wrote:
> On Tue, 2004-10-12 at 02:15, David Brownell wrote:
>
> > In USB-land, we've been discussing "idle" as something orthogonal
> > to the PM states. Not clear on all the details yet, but I'd vote against
> > anything that tries to make "driver doing nothing" a power state,
> > or doesn't have a way to idle drivers.
>
> Well, having an "idle" system state that asks all drivers to go to
> "idle" state doesn't prevent local bus/driver management from having
> it's own dynamic idle state ...

Right ... if it were a CPU, one could say it's the difference between
a default "performance" governor or instead "ondemand". But
it's an single device; so we can't currently say such things.


> "idle" as a system state is a kind of "light sleep" state where you come
> back up very quickly and makes a lot of sense for handheld devices.
>
> "idle" as a device-state is what suspend-to-disk really wants :)

Some of those handheld devices want the analogue of an
"ondemand" governor applied to each device. On a truly
idle system, that should end up looking like an "idle" system
state ... except that it wakes as needed!


> So you don't wnat the system state passed down to drivers but a policy
> instead ... We probably need more than that, like some additional flags
> along with a platform filter. For example, during a system suspend, a
> given piece of HW may end up beeing unclocked or powered off by the
> system, the driver will want to know that. The firmware may reboot the
> device on wakeup or not. The driver need to know that too.

Right; losing one clock may just save power, but losing another might
force a reset. But most of that would be the driver's knowledge of how
the system was built ... more common with SOC systems than with PCs,
and a general PM core issue not a PCI-only one.

For PCI, the drivers are often going to stick to the basics: if it's really
in a D3hot, D2, or D1 state then it's still powered and didn't need to
lose any state. Simple drivers might always do another reset. More
complex ones (like a USB HCD!) will avoid that because of evil
consequences.


> But I agree that we can avoid passing down a system state if we define
> a "policy" state along with a few flags.
>
> There is something else that we need to take into account. For system
> state, we want the driver to stay idle until woken up explicitely. But
> there are also more "dynamic" PM states that we may want to be triggered
> by the user via sysfs for which the driver will come back automatically
> when a request comes in from upstream (equivalent to disk idle sleep).

Yes, in fact what you talked about as a "system idle" state seems
like it might be a per-device version of your "dynamic device idle" state,
but with wakeup events globally disabled. (And maybe also automatic
power state transitions disabled...)

Or I suppose "idle" could just be another kind of device policy
governor, along with "performance" and "ondemand".


> They are, but the choice of a device power state from a system power
> state can most of the time only be done ... by the device-driver itself,
> eventually with support from the platform. Only the device-driver knows
> what it's device really supports as far as states as concerned, what the
> driver supports waking the device from, etc... Though it needs
> informations and help from the platform as well, like knowing if the
> firmware will bring the device back up from power-on-reset (this is very
> important for video cards).

Sure, though PCI is standardized enough that many non-video drivers
can have very simple suspend/resume logic (as I sketched above).
The framework needs to handle those well, too!



> The problem, again, is that chosing the right state is a decision that
> can only be done by the driver, provided it knows information about the
> system state (or policy if you prefer, your policy thing is just a way
> to pass system states without calling them this way),

It's not "just" that. Consider N devices that can each be made to use
one of three policies: performance, ondemand, or idle. Clearly
there are more than 3N potential system states! Though I suppose
it all comes down to what a "state" is. It'd certainly make sense to
make it easy to set all devices to use the same policy; also, to let
userspace manage the states of particular devices.


> and informations
> from the platform about what will actually happen on this specific slot
> after the state is entered (and a lot of systems don't give a shit about
> PCI D states at this point, some machines will just power down all
> slots, or a random set of them, some will cut clocks off, some will do
> nothing, etc...)

Well that sounds like no fun! But it's not all that different from how
various non-PCI systems work either.

- Dave

2004-10-12 03:03:47

by David Brownell

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Monday 11 October 2004 4:08 pm, Benjamin Herrenschmidt wrote:
> On Tue, 2004-10-12 at 08:58, Dmitry Torokhov wrote:
>
> > Yes, I think that devices that failed to resume (and all their children)
> > have to be moved by the core resume function into a separate list and
> > then destroyed (again by the driver core).

OHCI has to do this when the controller loses power during suspend;
which includes many suspend-to-disk cases. It marks the devices dead,
kills the I/O queues, and then makes khubd do all the work.


> > For that we might need to add
> > bus_type->remove_device() handler as it seems that all buses do alot
> > of work outside of driver->remove handlers. The remove_device should
> > accept additional argument - something like dead_device that would
> > suggest that driver should not be alarmed by any errors during unbind/
> > removal process as the device (or rather usually its parent) is simply
> > not there anynore.
>
> They already do... think USB...

USB decided against the extra argument; drivers don't much care
at that point. And anyway, they can tell the device is gone by looking
at status codes returned by URB completion or submission.

- Dave


> Ben.
>
>
>

2004-10-12 03:07:54

by David Brownell

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Monday 11 October 2004 3:12 pm, Stefan Seyfried wrote:
> David Brownell wrote:
>
> > The machines I've tested with relatively generic 2.6.9-rc kernels
> > don't use BIOS support for S4 when I call swsusp.
>
> first do either
> echo platform > /sys/power/disk # for S4
> echo shutdown > /sys/power/disk # for poweroff
>
> then do
> echo disk > /sys/power/state

Oddly enough, neither of them work lately for me.
They each resume immediately after writing the
image to disk.

- Dave

p.s. I find the /sys/power/disk file mildly cryptic, maybe
other folk will find the attached patch slightly more
informative about what this interface can do.




Attachments:
(No filename) (666.00 B)
disk.patch (577.00 B)
Download all attachments

2004-10-12 04:02:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls


> Right ... if it were a CPU, one could say it's the difference between
> a default "performance" governor or instead "ondemand". But
> it's an single device; so we can't currently say such things.

Yes. And we need partial tree suspend for that.

> > "idle" as a system state is a kind of "light sleep" state where you come
> > back up very quickly and makes a lot of sense for handheld devices.
> >
> > "idle" as a device-state is what suspend-to-disk really wants :)
>
> Some of those handheld devices want the analogue of an
> "ondemand" governor applied to each device. On a truly
> idle system, that should end up looking like an "idle" system
> state ... except that it wakes as needed!

Yup.

> Right; losing one clock may just save power, but losing another might
> force a reset. But most of that would be the driver's knowledge of how
> the system was built ... more common with SOC systems than with PCs,
> and a general PM core issue not a PCI-only one.
>
> For PCI, the drivers are often going to stick to the basics: if it's really
> in a D3hot, D2, or D1 state then it's still powered and didn't need to
> lose any state. Simple drivers might always do another reset. More
> complex ones (like a USB HCD!) will avoid that because of evil
> consequences.

Well... the typical problem is video cards and most of them are PCI.
Depending on what your system & firmware does, you want to use D2
(usually called "APM" mode by card vendors) or D3 (which end up beeing
cold most of the time as the system will remove power). Then it depends
if the firmware will wake the card for you or not. In some cases, the
driver must refuse suspend if it can't do it. For example, on PowerMacs,
I have these cases:

- Some laptops get the chip unclocked when suspending to RAM. We do D2
state in the ATI drivers, we have the code for that with most mobility
chips and that's what we had working for a while

- Recent laptops and desktops power off the chip. Paulus managed to use
a tool I wrote to spy the MacOS driver and we got the re-init sequence
for some of these (currently 2 models) but not all. So the driver must
make the decision here of interrupting the sleep process if it can't
wake the chip.

- A bunch of x86 laptops will have the chip powered off, but the BIOS
will re-POST it for you. (Ok, here we may have a problem identifying if
a given BIOS will do it or not ...)

So the driver decision of what state to enter into and how to wakeup
depends on critierias some beeing platform specific infos on what
happens with a given slot...

> Yes, in fact what you talked about as a "system idle" state seems
> like it might be a per-device version of your "dynamic device idle" state,
> but with wakeup events globally disabled. (And maybe also automatic
> power state transitions disabled...)

Or globally enabled ... that is the high level "monitoring" stuff
detects nothing happened anywhere for some time and broadcasts a system
"idle" state to all devices. But it has the same wakeup semantics as a
sleep state, that is a keyboard (or pencil) hit will trigger a wakeup,
difference is wakeup there is very fast.

> Or I suppose "idle" could just be another kind of device policy
> governor, along with "performance" and "ondemand".

Well, we have three different informations if you think about it:

- device state (the state the HW is put into)
- the driver state (the state the driver is)
- the 'stickyness" of those state (do we send back a wakeup
even globally to the system ? does the driver just wakes up
itself automatically if a request comes from above ?)

Though the later can be just flags, along with, eventually, some of my
platform thingies, those can be flags.

An example of the difference between device state and driver state: a
system "idle" state want devices to be put into some sort of D1 state,
that is power managed with very fast wakeup. On another hand,
suspend-to-disk don't care about devices beeing put in any power state
at all, but the driver must be "frozen" in a consistent state (not
process requests etc...) so we get a consistent image.
In the first case, it makes even sense to keep the driver operating
while the device is D1, the driver would then just wake the device on an
incoming request (provided this is allowed by the policy). In the later
case, the driver state is the only thing that matters.

The "will lose power" flag could be useful for USB too when you think
about it. Some controllers will need a disconnect (some don't do PCI PM
for example, or we can decide that in suspend-to-disk without S4 BIOS
support, we always disconnect everybody). That doesn't preclude the USB
driver from wanting to act before suspend. For example, an externally
powered mass storage may want to issue a spin-down command.

> > They are, but the choice of a device power state from a system power
> > state can most of the time only be done ... by the device-driver itself,
> > eventually with support from the platform. Only the device-driver knows
> > what it's device really supports as far as states as concerned, what the
> > driver supports waking the device from, etc... Though it needs
> > informations and help from the platform as well, like knowing if the
> > firmware will bring the device back up from power-on-reset (this is very
> > important for video cards).
>
> Sure, though PCI is standardized enough that many non-video drivers
> can have very simple suspend/resume logic (as I sketched above).
> The framework needs to handle those well, too!

Agreed. That's why I was thinking about a "helper" that would convert
the passed-in struct into a PCI D state (with the help eventually of
the platform, that is the helper beeing platform-overridable) and would
give the suggested D state.

> It's not "just" that. Consider N devices that can each be made to use
> one of three policies: performance, ondemand, or idle. Clearly
> there are more than 3N potential system states! Though I suppose
> it all comes down to what a "state" is. It'd certainly make sense to
> make it easy to set all devices to use the same policy; also, to let
> userspace manage the states of particular devices.

Yup, with the exception that it becomes hell when those devices are
anywhere on the VM path... which makes userspace policy unuseable for
system suspend.

> > and informations
> > from the platform about what will actually happen on this specific slot
> > after the state is entered (and a lot of systems don't give a shit about
> > PCI D states at this point, some machines will just power down all
> > slots, or a random set of them, some will cut clocks off, some will do
> > nothing, etc...)
>
> Well that sounds like no fun! But it's not all that different from how
> various non-PCI systems work either.

Yup.

Ben.


2004-10-12 04:10:00

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Monday 11 October 2004 10:00 pm, David Brownell wrote:
> On Monday 11 October 2004 4:08 pm, Benjamin Herrenschmidt wrote:
> > On Tue, 2004-10-12 at 08:58, Dmitry Torokhov wrote:
> >
> > > Yes, I think that devices that failed to resume (and all their children)
> > > have to be moved by the core resume function into a separate list and
> > > then destroyed (again by the driver core).
>
> OHCI has to do this when the controller loses power during suspend;
> which includes many suspend-to-disk cases. It marks the devices dead,
> kills the I/O queues, and then makes khubd do all the work.
>

Yes, I see that. But so far every bus implements it in its own way - USB,
IEEE1394, serio, pcmcia... It would be nice if there was a standard
mechanism to deal with that situation. And actually a standard way of
pruning part of the device tree which is useful for other things as well.

>
> > > For that we might need to add
> > > bus_type->remove_device() handler as it seems that all buses do alot
> > > of work outside of driver->remove handlers. The remove_device should
> > > accept additional argument - something like dead_device that would
> > > suggest that driver should not be alarmed by any errors during unbind/
> > > removal process as the device (or rather usually its parent) is simply
> > > not there anynore.
> >
> > They already do... think USB...
>
> USB decided against the extra argument; drivers don't much care
> at that point. And anyway, they can tell the device is gone by looking
> at status codes returned by URB completion or submission.
>

It really depends on the bus I think... I do not know USB well enough but
does status code gives you enough information to determine that the device
is gone or it's just not responding for mose reason? I probably would not
want to log errors when device is really gone, but when I fail to talk to
it becuase its stuck I would complain.

--
Dmitry

2004-10-12 08:54:51

by Pavel Machek

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi!

> The machines I've tested with relatively generic 2.6.9-rc kernels
> don't use BIOS support for S4 when I call swsusp.
>
> Of course the ACPI spec muddies the water by talking about two
> different states called "S4": "S4 Sleeping", which is what I was
> talking about as G1/S4; and "S4 Non-Volatile Sleep" that's more
> what I've seen with swusp: more like a G2 or G3 poweroff.
>
> I'm willing to believe that there are systems on which swsusp
> tells drivers a less confusing story ... but I don't seem to have
> tested with those!

If you are entering S4 or S5 at the end of swsusp basically should not
matter to anyone. What we tell the drivers is same in both cases.

Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-10-12 08:58:26

by Pavel Machek

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi!

> > > The machines I've tested with relatively generic 2.6.9-rc kernels
> > > don't use BIOS support for S4 when I call swsusp.
> >
> > first do either
> > echo platform > /sys/power/disk # for S4
> > echo shutdown > /sys/power/disk # for poweroff
> >
> > then do
> > echo disk > /sys/power/state
>
> Oddly enough, neither of them work lately for me.
> They each resume immediately after writing the
> image to disk.

dmesg would help....

> p.s. I find the /sys/power/disk file mildly cryptic, maybe
> other folk will find the attached patch slightly more
> informative about what this interface can do.

> --- 1.19/kernel/power/disk.c Thu Sep 9 08:45:13 2004
> +++ edited/kernel/power/disk.c Fri Oct 1 11:01:41 2004
> @@ -282,7 +282,14 @@
>
> static ssize_t disk_show(struct subsystem * subsys, char * buf)
> {
> - return sprintf(buf,"%s\n",pm_disk_modes[pm_disk_mode]);
> + return sprintf(buf,"%s%s %s%s %s%s\n",
> + (pm_disk_mode == pm_ops->pm_disk_mode) ? "*" : "",
> + pm_disk_modes[pm_ops->pm_disk_mode],
> + (pm_disk_mode == PM_DISK_SHUTDOWN) ? "*" : "",
> + pm_disk_modes[PM_DISK_SHUTDOWN],
> + (pm_disk_mode == PM_DISK_REBOOT) ? "*" : "",
> + pm_disk_modes[PM_DISK_REBOOT]
> + );
> }
>
>

Hmm, its interface change, and was not /sys expected to be "one file,
one value"?

Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-10-12 09:27:37

by Russell King

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Tue, Oct 12, 2004 at 08:28:07AM +1000, Benjamin Herrenschmidt wrote:
> Definitely. One thing is: how to do it instead ? I've been thinking
> about it for a while and am still wondering... do we want a list
> mecanism with add/remove notifiers so the PM walk can keep in sync
> with devices added/removed ? or should addition/removal be simply
> postponed until the end of the sleep/wakeup process (I tend to vote
> for that).

What about the case where you're walking the tree for a resume, and
you've hotplugged a whole tree of devices which have a similar bus
setup to the original.

Yes, I'm thinking of the case of Cardbus with hotpluggable PCI buses.
If we detect that the "bridge" at the top of the chain has changed,
we _really_ don't want to try to restore the state of the child
devices - they may have the same bus IDs, but they could well object
to being inappropriately setup.

Sure, we can say "don't do that then" but I suspect the exact same
problem is present with USB, and USB is far more liable to have this
type of abuse than PCI.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-10-12 10:41:49

by Stefan Seyfried

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hello,

Pavel Machek wrote:

>>p.s. I find the /sys/power/disk file mildly cryptic, maybe
>> other folk will find the attached patch slightly more
>> informative about what this interface can do.
>
>>--- 1.19/kernel/power/disk.c Thu Sep 9 08:45:13 2004
>>+++ edited/kernel/power/disk.c Fri Oct 1 11:01:41 2004
>>@@ -282,7 +282,14 @@
>>
>> static ssize_t disk_show(struct subsystem * subsys, char * buf)
>> {
>>- return sprintf(buf,"%s\n",pm_disk_modes[pm_disk_mode]);
>>+ return sprintf(buf,"%s%s %s%s %s%s\n",
>>+ (pm_disk_mode == pm_ops->pm_disk_mode) ? "*" : "",
>>+ pm_disk_modes[pm_ops->pm_disk_mode],
>>+ (pm_disk_mode == PM_DISK_SHUTDOWN) ? "*" : "",
>>+ pm_disk_modes[PM_DISK_SHUTDOWN],
>>+ (pm_disk_mode == PM_DISK_REBOOT) ? "*" : "",
>>+ pm_disk_modes[PM_DISK_REBOOT]
>>+ );
>> }
>
> Hmm, its interface change,

yes, this would break _my_ userspace app that checks for successfully
setting shutdown mode by reading out the file, but well, i could live
with that.

> and was not /sys expected to be "one file, one value"?

# cat /sys/power/state
standby mem disk

so one could assume that /sys/power/disk would have similar semantics.
OTOH, /sys/power/state does not indicate the active state but
/sys/power/disk does, so they _are_ different.
--
Stefan Seyfried, QA / R&D Team Mobile Devices, SUSE LINUX AG N?rnberg.

"Any ideas, John?"
"Well, surrounding them's out."

2004-10-12 10:51:22

by Nigel Cunningham

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi.

On Tue, 2004-10-12 at 14:02, Benjamin Herrenschmidt wrote:

[...]

> An example of the difference between device state and driver state: a
> system "idle" state want devices to be put into some sort of D1 state,
> that is power managed with very fast wakeup. On another hand,
> suspend-to-disk don't care about devices beeing put in any power state
> at all, but the driver must be "frozen" in a consistent state (not
> process requests etc...) so we get a consistent image.

> In the first case, it makes even sense to keep the driver operating
> while the device is D1, the driver would then just wake the device on an
> incoming request (provided this is allowed by the policy). In the later
> case, the driver state is the only thing that matters.

Not quite - you have to be able to get the device into a matching state
at resume time. Probably not a problem in most cases, I realise, but
thought it was worth a mention.

[...]

> Yup, with the exception that it becomes hell when those devices are
> anywhere on the VM path... which makes userspace policy unuseable for
> system suspend.

A device on the VM path? I don't follow here. Can I have a hand please?

Regards,

Nigel
--
Nigel Cunningham
Pastoral Worker
Christian Reformed Church of Tuggeranong
PO Box 1004, Tuggeranong, ACT 2901

Many today claim to be tolerant. True tolerance, however, can cope with others
being intolerant.

2004-10-12 11:26:28

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls


> What about the case where you're walking the tree for a resume, and
> you've hotplugged a whole tree of devices which have a similar bus
> setup to the original.
>
> Yes, I'm thinking of the case of Cardbus with hotpluggable PCI buses.
> If we detect that the "bridge" at the top of the chain has changed,
> we _really_ don't want to try to restore the state of the child
> devices - they may have the same bus IDs, but they could well object
> to being inappropriately setup.
>
> Sure, we can say "don't do that then" but I suspect the exact same
> problem is present with USB, and USB is far more liable to have this
> type of abuse than PCI.

Most of the time, this scenario works fine (provided you have paulus
recent patch tho), since the "host controller" of whatever tree of
devices will only start registering it's child after beeing itself woken
up, stuffs will end up at the right place... In fact, the only real race
we have, after paulus patch is applied, is some tiny window when adding
devices suring the suspend process. It's so tiny that we can probably
find a way to just error out the insertion in this case tho...

Ben.



2004-10-12 11:30:20

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls


> > In the first case, it makes even sense to keep the driver operating
> > while the device is D1, the driver would then just wake the device on an
> > incoming request (provided this is allowed by the policy). In the later
> > case, the driver state is the only thing that matters.
>
> Not quite - you have to be able to get the device into a matching state
> at resume time. Probably not a problem in most cases, I realise, but
> thought it was worth a mention.

Wait, I'm not talking about swsusp here, that's exactly my distinction
between driver state and device state :) Only system suspend like swsusp
and suspend-to-ram require a completely coherent and frozen memory
"image".

All sorts of dynamic PM, including system-wide "idle" can deal with
scenarios where the driver can stay functional and decide by itself to
wake up the device upon incoming requests.

> [...]
>
> > Yup, with the exception that it becomes hell when those devices are
> > anywhere on the VM path... which makes userspace policy unuseable for
> > system suspend.
>
> A device on the VM path? I don't follow here. Can I have a hand please?

Any block device basically, I don't even want to think about the issues
between NFS & suspend :)

That is anything that userspace potentially requires to operate (device on
the path to an mmap'd file, swap, whatever)...

Ben.


2004-10-12 11:38:50

by Nigel Cunningham

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi.

On Tue, 2004-10-12 at 21:27, Benjamin Herrenschmidt wrote:
> > > In the first case, it makes even sense to keep the driver operating
> > > while the device is D1, the driver would then just wake the device on an
> > > incoming request (provided this is allowed by the policy). In the later
> > > case, the driver state is the only thing that matters.
> >
> > Not quite - you have to be able to get the device into a matching state
> > at resume time. Probably not a problem in most cases, I realise, but
> > thought it was worth a mention.
>
> Wait, I'm not talking about swsusp here, that's exactly my distinction
> between driver state and device state :) Only system suspend like swsusp
> and suspend-to-ram require a completely coherent and frozen memory
> "image".

Oh ok. I thought suspend was one application of what you were talking
about (not necessarily the only one, but one...)

> All sorts of dynamic PM, including system-wide "idle" can deal with
> scenarios where the driver can stay functional and decide by itself to
> wake up the device upon incoming requests.
>
> > [...]
> >
> > > Yup, with the exception that it becomes hell when those devices are
> > > anywhere on the VM path... which makes userspace policy unuseable for
> > > system suspend.
> >
> > A device on the VM path? I don't follow here. Can I have a hand please?
>
> Any block device basically, I don't even want to think about the issues
> between NFS & suspend :)

> That is anything that userspace potentially requires to operate (device on
> the path to an mmap'd file, swap, whatever)...

Ah.. I'm with you now. The 'VM path' was too abstract for me at 9:30pm
:>.

Nigel
--
Nigel Cunningham
Pastoral Worker
Christian Reformed Church of Tuggeranong
PO Box 1004, Tuggeranong, ACT 2901

Many today claim to be tolerant. True tolerance, however, can cope with others
being intolerant.

2004-10-12 11:53:00

by Pavel Machek

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi!

> > Well... it doesn't work on paul's laptop, but anyway, ok, let's go for
> > the struct thing and forget about this for 2.6.9.
>
> PM hasn't worked for me on 2.6 with most hardware, about since
> the "new PMcore" kicked in, so it's hard for me to judge progress
> except at the level of unit tests. Where I see two steps forward,
> one step back ... on a good day.
>
> The root cause of many of these problem is that there's
> a confusion between system-wide sleep states and the
> device-specific power states from which they're built.
> They _should_ be using two distinct data types. Not one
> integer/enum type ... especially not one int/enum type
> that's got multiple conflicting "legacy" interpretations!

Actually I do not see what is so wrong with one enum type; with sparse
we have typechecking, and if someone assigns value from one enum into
another enum, he's clearly doing something wrong.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-10-12 17:30:02

by David Brownell

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Monday 11 October 2004 9:09 pm, Dmitry Torokhov wrote:
> On Monday 11 October 2004 10:00 pm, David Brownell wrote:

> > OHCI has to do this when the controller loses power during suspend;
> > which includes many suspend-to-disk cases. It marks the devices dead,
> > kills the I/O queues, and then makes khubd do all the work.
> >
>
> Yes, I see that. But so far every bus implements it in its own way - USB,
> IEEE1394, serio, pcmcia... It would be nice if there was a standard
> mechanism to deal with that situation. And actually a standard way of
> pruning part of the device tree which is useful for other things as well.

Certainly. It _shouldn't_ be too tricky adding that to the PM core,
but there are issues including how to get the locking right when
tasks are concurrently walking the device tree. The issue comes
up with PCI/Cardbus hotplugging too, so it's not really offtopic ... ;)


> > USB decided against the extra argument; drivers don't much care
> > at that point. And anyway, they can tell the device is gone by looking
> > at status codes returned by URB completion or submission.
> >
>
> It really depends on the bus I think... I do not know USB well enough but
> does status code gives you enough information to determine that the device
> is gone or it's just not responding for mose reason?

There's a window between "unplug" and "khubd notices unplug"
where requests give controller-specific status reports, some of which
translate directly to "not responding". The driver's fault recovery
mechanism will eventually start seeing "device gone", once khubd
notices. Other than network drivers, most drivers usually see the
"gone" status first, or disconnect() if they had no pending URBs.

That's much the same as PCMCIA/Cardbus: register reads will
return ~0 for a while before disconnect notification.

- Dave

2004-10-12 18:54:30

by David Brownell

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Tuesday 12 October 2004 1:53 am, Pavel Machek wrote:
>
> If you are entering S4 or S5 at the end of swsusp basically should not
> matter to anyone. What we tell the drivers is same in both cases.

The problem cases are on resume, where drivers
can see different controller state. Both S4 and S5
resume can leave it in reset; fine. But from S4
the other option is the controller being in the state
set up previously by the driver ... yet from S5 the
other option is boot firmware (BIOS etc) mucking
with it, leaving it in any of several states that are
not otherwise documented for resume() paths.

Drivers that don't reset the controller in resume()
will need special handling for those BIOS cases.
That means USB HCDs, and maybe not a lot else
yet in Linux.

- Dave

2004-10-12 18:53:35

by David Brownell

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Tuesday 12 October 2004 1:54 am, Pavel Machek wrote:
> > > echo disk > /sys/power/state
> >
> > Oddly enough, neither of them work lately for me.
> > They each resume immediately after writing the
> > image to disk.
>
> dmesg would help....

This is with /sys/power/disk set up for "shutdown";
the system didn't actually shut down, it restarted
the CPU right after snapshotting.

Stopping tasks: ===================|
Freeing
memory: ........................................................................................................|
Freezing CPUs (at 0)...ok
PM: Attempting to suspend to disk.
PM: snapshotting memory.
Restarting CPUs...ok
Restarting tasks... done
eth0: Media Link On 10mbps half-duplex

I've not had time to try that on other systems. Reverting
the change to map PCI states didn't improve things.


> > p.s. I find the /sys/power/disk file mildly cryptic, maybe
> > other folk will find the attached patch slightly more
> > informative about what this interface can do.
> >
>
> Hmm, its interface change,

To an file that was just added recently, making it more
like the other file in that same directory.

> and was not /sys expected to be "one file,
> one value"?

It is one value -- a set! OK, the active member
of that set is distinguished. The power/state file
could do the same thing (but the active state
there would always be "on").


- Dave

2004-10-12 19:50:55

by Pavel Machek

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi!

> > If you are entering S4 or S5 at the end of swsusp basically should not
> > matter to anyone. What we tell the drivers is same in both cases.
>
> The problem cases are on resume, where drivers
> can see different controller state. Both S4 and S5
> resume can leave it in reset; fine. But from S4
> the other option is the controller being in the state
> set up previously by the driver ... yet from S5 the
> other option is boot firmware (BIOS etc) mucking
> with it, leaving it in any of several states that are
> not otherwise documented for resume() paths.

I do not think that S4 ad S5 differ in this regard. During resume, you
go through normal boot in both cases, bootloader, linux-kernel boot.

Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-10-12 20:30:30

by Stefan Seyfried

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

David Brownell wrote:

> This is with /sys/power/disk set up for "shutdown";
> the system didn't actually shut down, it restarted
> the CPU right after snapshotting.
>
> Stopping tasks: ===================|
> Freeing
> memory: ........................................................................................................|
> Freezing CPUs (at 0)...ok
> PM: Attempting to suspend to disk.
> PM: snapshotting memory.
> Restarting CPUs...ok
> Restarting tasks... done
> eth0: Media Link On 10mbps half-duplex

you have a swap partition?
swap enabled?

> To an file that was just added recently, making it more
> like the other file in that same directory.

recently?

>>and was not /sys expected to be "one file,
>>one value"?
>
> It is one value -- a set! OK, the active member
> of that set is distinguished. The power/state file
> could do the same thing (but the active state
> there would always be "on").

But then we would need another file which provides the information that
/sys/power/state provides now, namely what power states are available.


Stefan

2004-10-12 22:14:22

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Wed, 2004-10-13 at 04:52, David Brownell wrote:

> Drivers that don't reset the controller in resume()
> will need special handling for those BIOS cases.
> That means USB HCDs, and maybe not a lot else
> yet in Linux.

Usually, at least for OHCI, you can read the controller
status and know if it got reset or is still in suspend state,
at least that how we did so far (and how apple does as well
afaik) and seems to work. I don't know about EHCI.

Ben.

2004-10-12 22:36:10

by David Brownell

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Tuesday 12 October 2004 3:13 pm, Benjamin Herrenschmidt wrote:
> On Wed, 2004-10-13 at 04:52, David Brownell wrote:
>
> > Drivers that don't reset the controller in resume()
> > will need special handling for those BIOS cases.
> > That means USB HCDs, and maybe not a lot else
> > yet in Linux.
>
> Usually, at least for OHCI, you can read the controller
> status and know if it got reset or is still in suspend state,
> at least that how we did so far (and how apple does as well
> afaik) and seems to work.

Either of those cases have been handled OK for ages;
but the MM tree has OHCI code that also knows about
some (I hope all!) of the "new" BIOS states.


> I don't know about EHCI.

It's on my list to find out ... :) One part of it should be
as simple OHCI states.

- Dave


> Ben.
>
>

2004-10-13 14:40:21

by David Brownell

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

On Tuesday 12 October 2004 1:28 pm, Stefan Seyfried wrote:
> David Brownell wrote:
>
> > This is with /sys/power/disk set up for "shutdown";
> > the system didn't actually shut down, it restarted
> > the CPU right after snapshotting.
> >
> > ...
>
> you have a swap partition?
> swap enabled?

Yes. "echo suspend > /sys/power/state" works though;
it's only STD that stopped behaving on that system.
Probably OT for this thread though.

2004-10-15 14:09:40

by Pavel Machek

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi!

> > Does sparse now have typechecking on enums?
>
> You can mark an enum "bitwise" (by making all of it's values be
> "bitwise"), and it will be considered a type of its own, yes. But then you
> also cannot do arithmetic on it (which _usually_ is what you want, but not
> necessarily always).
>
> (You'd also need to pass in the "-Wbitwise" flag to sparse, to get the
> checks).
>
> By the time you mark something "bitwise", you don't even need to use an
> enum, btw. You can just do a regular integer typedef and mark the typedef
> to be "bitwise" - that generates a unique type right there. That's what
> the endianness checking does.

I'm trying to learn how to work with bitwise on obsolete stuff, but
checking there is good, too, right?

Is this right way to do it?

+typedef enum pm_request __bitwise {
+ __bitwise PM_SUSPEND, /* enter D1-D3 */
+ __bitwise PM_RESUME, /* enter D0 */
+} pm_request_t;
+
+/*
+ * Device types... these are passed to pm_register
+ */
+typedef enum pm_dev_type __bitwise {
+ __bitwise PM_UNKNOWN_DEV = 0, /* generic */
+ __bitwise PM_SYS_DEV, /* system device (fan, KB
controller, ...) */
+ __bitwise PM_PCI_DEV, /* PCI device */
+ __bitwise PM_USB_DEV, /* USB device */
+ __bitwise PM_SCSI_DEV, /* SCSI device */
+ __bitwise PM_ISA_DEV, /* ISA device */
+ __bitwise PM_MTD_DEV, /* Memory Technology Device */
+} pm_dev_t;

Having __bitwise at every line in enum looks quite ugly to my
eyes. [Where to get sparse? I tried to google for it but "sparse" is
very common word (as in sparse matrix). And theres no
kernel/people/linus on kernel.org...]
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-10-15 15:58:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls



On Fri, 15 Oct 2004, Pavel Machek wrote:
>
> I'm trying to learn how to work with bitwise on obsolete stuff, but
> checking there is good, too, right?
>
> Is this right way to do it?
>
> +typedef enum pm_request __bitwise {
> + __bitwise PM_SUSPEND, /* enter D1-D3 */
> + __bitwise PM_RESUME, /* enter D0 */
> +} pm_request_t;

No, "__bitwise" is a type attribute, so you'd have to do it something like
this:

typedef int __bitwise pm_request_t;

enum pm_request {
PM_SUSPEND = (__force pm_request_t) 1,
PM_RESUME = (__force pm_request_t) 2
};

which makes PM_SUSPEND and PM_RESUME "bitwise" integers (the "__force" is
there because sparse will complain about casting to/from a bitwise type,
but in this case we really _do_ want to force the conversion). And because
the enum values are all the same type, now "enum pm_request" will be that
type too.

And with gcc, all the __bitwise/__force stuff goes away, and it all ends
up looking just like integers to gcc.

Quite frankly, you don't need the enum there. The above all really just
boils down to one special "int __bitwise" type.

So the simpler way is to just do

typedef int __bitwise pm_request_t;

#define PM_SUSPEND ((__force pm_request_t) 1)
#define PM_RESUME ((__force pm_request_t) 2)

and you now have all the infrastructure needed for strict typechecking.

One small note: the constant integer "0" is special. You can use a
constant zero as a bitwise integer type without sparse ever complaining.
This is because "bitwise" (as the name implies) was designed for making
sure that bitwise types don't get mixed up (little-endian vs big-endian
vs cpu-endian vs whatever), and there the constant "0" really _is_
special.

Also, because of the "bitwise" nature of bitwise types, you cannot add,
subtract or do a lot of things with bitwise types. But you _can_ use the
bitwise operations on them, and you can test them for equality.

So at some point we might add a separate "__opaque" type that allows no
operations at all (except for assignment and equality comparison), and
where "0" isn't special. But in the meantime, __bitwise gets you most of
the way. Just keep in mind that sparse won't warn about use of the
constant zero.

> Having __bitwise at every line in enum looks quite ugly to my
> eyes.

And in fact you cannot do it that way. "__bitwise" will always create a
_new_ type, so every time you use it you get a _different_ type. So to use
it sanely, you have to create _one_ typedef for each type you want to use,
and make that one __bitwise, and that will be the only __bitwise that
you'll ever see for that particular usage. After that, you use the
typedef, because it is now a unique type, thanks to the __bitwise.

> [Where to get sparse? I tried to google for it but "sparse" is
> very common word (as in sparse matrix). And theres no
> kernel/people/linus on kernel.org...]

With BK, you can just get it from

bk://sparse.bkbits.net/sparse

and I think DaveJ does tar-balls somewhere. If you search for "sparse
checker linux" you'll find a number of hits..

Once you have it, just do

make
make install

as your regular user, and it will install sparse in your ~/bin directory.
After that, doing a kernel make with "make C=1" will run sparse on all the
C files that get recompiled, or with "make C=2" will run sparse on the
files whether they need to be recompiled or not (ie the latter is fast way
to check the whole tree if you have already built it).

Linus

2004-10-24 20:59:10

by Pavel Machek

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls

Hi!

> > I'm trying to learn how to work with bitwise on obsolete stuff, but
> > checking there is good, too, right?
> >
> > Is this right way to do it?
> >
> > +typedef enum pm_request __bitwise {
> > + __bitwise PM_SUSPEND, /* enter D1-D3 */
> > + __bitwise PM_RESUME, /* enter D0 */
> > +} pm_request_t;
>
> No, "__bitwise" is a type attribute, so you'd have to do it something like
> this:

Ok, after -Wbitwise for sparse, strict typechecking seems to
work. Unfortunately, it produces a *lot* of noise, for code such as

static ssize_t disk_show(struct subsystem * subsys, char * buf)
{
return sprintf(buf, "%s\n", pm_disk_modes[pm_disk_mode]);
}

...where pm_disk_mode is __bitwise. That is not really what we
want. Would it be possible to get something similar to __bitwise where
arithmetic is still okay to do? With __bitwise, I'd need to do:

@@ -292,15 +297,15 @@
int i;
int len;
char *p;
- u32 mode = 0;
+ suspend_disk_method_t mode = 0;

p = memchr(buf, '\n', n);
len = p ? p - buf : n;

down(&pm_sem);
- for (i = PM_DISK_FIRMWARE; i < PM_DISK_MAX; i++) {
+ for (i = (int __force) PM_DISK_FIRMWARE; i < (int __force) PM_DISK_MAX; i++) {
if (!strncmp(buf, pm_disk_modes[i], len)) {
- mode = i;
+ mode = (suspend_disk_method_t __force) i;
break;
}
}


...thats ugly.

Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-10-24 21:19:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: Totally broken PCI PM calls



On Sun, 24 Oct 2004, Pavel Machek wrote:
>
> Ok, after -Wbitwise for sparse, strict typechecking seems to
> work. Unfortunately, it produces a *lot* of noise, for code such as
>
> static ssize_t disk_show(struct subsystem * subsys, char * buf)
> {
> return sprintf(buf, "%s\n", pm_disk_modes[pm_disk_mode]);
> }
>
> ...where pm_disk_mode is __bitwise. That is not really what we
> want. Would it be possible to get something similar to __bitwise where
> arithmetic is still okay to do?

I'll try to get to it - I've spent the last two days on sparse making sure
that I can do flow-control checking (things like "a function that gets a
spinlock needs to release it"), and it will take a while before I get out
of it.

But yes, I think it makes sense to have a "unique type" thing that allows
arithmetic. And I think it makes sense to have another unique type that
disallows _all_ operations (ie truly opaque cookies, where the only valid
op is to compare it with another cookie).

Maybe Al is interested..

Linus

---
> With __bitwise, I'd need to do:
>
> @@ -292,15 +297,15 @@
> int i;
> int len;
> char *p;
> - u32 mode = 0;
> + suspend_disk_method_t mode = 0;
>
> p = memchr(buf, '\n', n);
> len = p ? p - buf : n;
>
> down(&pm_sem);
> - for (i = PM_DISK_FIRMWARE; i < PM_DISK_MAX; i++) {
> + for (i = (int __force) PM_DISK_FIRMWARE; i < (int __force) PM_DISK_MAX; i++) {
> if (!strncmp(buf, pm_disk_modes[i], len)) {
> - mode = i;
> + mode = (suspend_disk_method_t __force) i;
> break;
> }
> }
>
>
> ...thats ugly.
>
> Pavel
> --
> People were complaining that M$ turns users into beta-testers...
> ...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
>