2006-08-15 22:10:48

by Dave Jones

[permalink] [raw]
Subject: peculiar suspend/resume bug.

Here's a fun one.
- Get a dual core cpufreq aware laptop (Like say, a core-duo)
- Add a cpufreq monitor to gnome-panel. Configure it
to watch the 2nd core.
- Suspend.
- Resume.

Watch the cpufreq monitor die horribly.

I believe this is because we take down the 2nd core at suspend
time with cpu hotplug, and for some reason we're scheduling
userspace before we bring that second core back up.

Anyone have any clues why this is happening?

Dave

--
http://www.codemonkey.org.uk


2006-08-16 00:19:35

by Nigel Cunningham

[permalink] [raw]
Subject: Re: peculiar suspend/resume bug.

Hi Dave.

On Tue, 2006-08-15 at 18:10 -0400, Dave Jones wrote:
> Here's a fun one.
> - Get a dual core cpufreq aware laptop (Like say, a core-duo)
> - Add a cpufreq monitor to gnome-panel. Configure it
> to watch the 2nd core.
> - Suspend.
> - Resume.
>
> Watch the cpufreq monitor die horribly.
>
> I believe this is because we take down the 2nd core at suspend
> time with cpu hotplug, and for some reason we're scheduling
> userspace before we bring that second core back up.
>
> Anyone have any clues why this is happening?

If you hotunplug and replug the cpu using the sysfs interface, rather
than suspending and resuming, does the same thing happen?

Regards,

Nigel

2006-08-16 00:37:34

by Dave Jones

[permalink] [raw]
Subject: Re: peculiar suspend/resume bug.

On Wed, Aug 16, 2006 at 10:19:59AM +1000, Nigel Cunningham wrote:
> Hi Dave.
>
> On Tue, 2006-08-15 at 18:10 -0400, Dave Jones wrote:
> > Here's a fun one.
> > - Get a dual core cpufreq aware laptop (Like say, a core-duo)
> > - Add a cpufreq monitor to gnome-panel. Configure it
> > to watch the 2nd core.
> > - Suspend.
> > - Resume.
> >
> > Watch the cpufreq monitor die horribly.
> >
> > I believe this is because we take down the 2nd core at suspend
> > time with cpu hotplug, and for some reason we're scheduling
> > userspace before we bring that second core back up.
> >
> > Anyone have any clues why this is happening?
>
> If you hotunplug and replug the cpu using the sysfs interface, rather
> than suspending and resuming, does the same thing happen?

cpufreq-applet crashes as soon as the cpu goes offline.
Now, the applet should be written to deal with this scenario more
gracefully, but I'm questioning whether or not userspace should
*see* the unplug/replug that suspend does at all.

IMO, when we shouldn't schedule userspace until the system is
in the exact state it was before we suspended.

Dave

--
http://www.codemonkey.org.uk

2006-08-16 01:05:40

by Nigel Cunningham

[permalink] [raw]
Subject: Re: peculiar suspend/resume bug.

Hi Dave.

On Tue, 2006-08-15 at 20:37 -0400, Dave Jones wrote:
> On Wed, Aug 16, 2006 at 10:19:59AM +1000, Nigel Cunningham wrote:
> > Hi Dave.
> >
> > On Tue, 2006-08-15 at 18:10 -0400, Dave Jones wrote:
> > > Here's a fun one.
> > > - Get a dual core cpufreq aware laptop (Like say, a core-duo)
> > > - Add a cpufreq monitor to gnome-panel. Configure it
> > > to watch the 2nd core.
> > > - Suspend.
> > > - Resume.
> > >
> > > Watch the cpufreq monitor die horribly.
> > >
> > > I believe this is because we take down the 2nd core at suspend
> > > time with cpu hotplug, and for some reason we're scheduling
> > > userspace before we bring that second core back up.
> > >
> > > Anyone have any clues why this is happening?
> >
> > If you hotunplug and replug the cpu using the sysfs interface, rather
> > than suspending and resuming, does the same thing happen?
>
> cpufreq-applet crashes as soon as the cpu goes offline.
> Now, the applet should be written to deal with this scenario more
> gracefully, but I'm questioning whether or not userspace should
> *see* the unplug/replug that suspend does at all.
>
> IMO, when we shouldn't schedule userspace until the system is
> in the exact state it was before we suspended.

At the moment, the cpu hotplugging/unplugging is done outside of
freezing processes because once we've frozen processes we can't (afaik)
move ones that are tied to the cpu being unplugged to another processor,
and won't also be able to kill kernel threads that are tied to the
processor(s) being taken down.

Personally, I wouldn't mind being seeing this addressed as I see a few
other benefits to being able to hot[un]plug later, besides simplifying
life for the cpufreq-applet (although it shouldn't crash if a cpu is
offlined anyway).

Regards,

Nigel

2006-08-16 02:41:53

by Matthew Garrett

[permalink] [raw]
Subject: Re: peculiar suspend/resume bug.

On Tue, Aug 15, 2006 at 08:37:28PM -0400, Dave Jones wrote:

> cpufreq-applet crashes as soon as the cpu goes offline.
> Now, the applet should be written to deal with this scenario more
> gracefully, but I'm questioning whether or not userspace should
> *see* the unplug/replug that suspend does at all.

As Nigel mentioned, cpu unplug happens just before processes are frozen,
so I guess there's a chance for it to be scheduled. On the other hand,
it's not unreasonable for CPUs to be unplugged during runtime anyway -
perhaps userspace should be able to deal with that?

--
Matthew Garrett | [email protected]

2006-08-16 03:54:00

by Dave Jones

[permalink] [raw]
Subject: Re: peculiar suspend/resume bug.

On Wed, Aug 16, 2006 at 03:41:40AM +0100, Matthew Garrett wrote:
> On Tue, Aug 15, 2006 at 08:37:28PM -0400, Dave Jones wrote:
>
> > cpufreq-applet crashes as soon as the cpu goes offline.
> > Now, the applet should be written to deal with this scenario more
> > gracefully, but I'm questioning whether or not userspace should
> > *see* the unplug/replug that suspend does at all.
>
> As Nigel mentioned, cpu unplug happens just before processes are frozen,
> so I guess there's a chance for it to be scheduled. On the other hand,
> it's not unreasonable for CPUs to be unplugged during runtime anyway -
> perhaps userspace should be able to deal with that?

Sure, I'm not debating that point. It's a bug in the applet that needs fixing,
but it also seems that we could be saving a whole lot of pain by
hiding this from userspace at suspend/resume time.

Dave

--
http://www.codemonkey.org.uk

2006-08-16 08:51:00

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: peculiar suspend/resume bug.

Hi,

On Wednesday 16 August 2006 05:53, Dave Jones wrote:
> On Wed, Aug 16, 2006 at 03:41:40AM +0100, Matthew Garrett wrote:
> > On Tue, Aug 15, 2006 at 08:37:28PM -0400, Dave Jones wrote:
> >
> > > cpufreq-applet crashes as soon as the cpu goes offline.
> > > Now, the applet should be written to deal with this scenario more
> > > gracefully, but I'm questioning whether or not userspace should
> > > *see* the unplug/replug that suspend does at all.
> >
> > As Nigel mentioned, cpu unplug happens just before processes are frozen,
> > so I guess there's a chance for it to be scheduled. On the other hand,
> > it's not unreasonable for CPUs to be unplugged during runtime anyway -
> > perhaps userspace should be able to deal with that?
>
> Sure, I'm not debating that point. It's a bug in the applet that needs fixing,
> but it also seems that we could be saving a whole lot of pain by
> hiding this from userspace at suspend/resume time.

Yes, that's the plan, but for now the freezer is not SMP-friendly, so to
speak, and we have some work to do to make it possible.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-08-16 22:06:37

by Pavel Machek

[permalink] [raw]
Subject: Re: peculiar suspend/resume bug.

Hi!

> Here's a fun one.
> - Get a dual core cpufreq aware laptop (Like say, a core-duo)
> - Add a cpufreq monitor to gnome-panel. Configure it
> to watch the 2nd core.
> - Suspend.
> - Resume.
>
> Watch the cpufreq monitor die horribly.
>
> I believe this is because we take down the 2nd core at suspend
> time with cpu hotplug, and for some reason we're scheduling
> userspace before we bring that second core back up.
>
> Anyone have any clues why this is happening?

Its by design, we do unplug first. Okay, maybe it is more of design
bug :-).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-08-17 01:44:08

by Nigel Cunningham

[permalink] [raw]
Subject: Re: peculiar suspend/resume bug.

Hi.

On Wed, 2006-08-16 at 03:41 +0100, Matthew Garrett wrote:
> On Tue, Aug 15, 2006 at 08:37:28PM -0400, Dave Jones wrote:
>
> > cpufreq-applet crashes as soon as the cpu goes offline.
> > Now, the applet should be written to deal with this scenario more
> > gracefully, but I'm questioning whether or not userspace should
> > *see* the unplug/replug that suspend does at all.
>
> As Nigel mentioned, cpu unplug happens just before processes are frozen,
> so I guess there's a chance for it to be scheduled. On the other hand,
> it's not unreasonable for CPUs to be unplugged during runtime anyway -
> perhaps userspace should be able to deal with that?

Agreed.

I've spent a little more time thinking about this, and want to put a few
thoughts forward for discussion/ignoring/flame bait/whatever.

I see two main issues at the moment with freezing before hotplugging.
The first is that we have cpu specific kernel threads that we're going
to want to kill, and the second is that we have userspace threads that
we want to migrate to another cpu. Have I missed anything?

The first issue could be helped by splitting the freezing of userspace
processes from kernel space. The kernel threads could thus die without
us having to worry about userspace seeing what's going on. I haven't
looked at vanilla in a while; this might already be in. Alternatively,
if it's viable, per-cpu kernel threads could perhaps be made NO_FREEZE.

The second issue is migrating userspace threads. I'm no scheduling
expert, so I'll just speculate :>. I wondered if it's possible to make
the migration happen lazily; in such a way that if, when we come to thaw
userspace, the cpu has been hotplugged again, the migration never
happens. Does that sound possible?

Regards,

Nigel

2006-08-17 05:40:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: peculiar suspend/resume bug.

On Thursday 17 August 2006 03:44, Nigel Cunningham wrote:
> Hi.
>
> On Wed, 2006-08-16 at 03:41 +0100, Matthew Garrett wrote:
> > On Tue, Aug 15, 2006 at 08:37:28PM -0400, Dave Jones wrote:
> >
> > > cpufreq-applet crashes as soon as the cpu goes offline.
> > > Now, the applet should be written to deal with this scenario more
> > > gracefully, but I'm questioning whether or not userspace should
> > > *see* the unplug/replug that suspend does at all.
> >
> > As Nigel mentioned, cpu unplug happens just before processes are frozen,
> > so I guess there's a chance for it to be scheduled. On the other hand,
> > it's not unreasonable for CPUs to be unplugged during runtime anyway -
> > perhaps userspace should be able to deal with that?
>
> Agreed.
>
> I've spent a little more time thinking about this, and want to put a few
> thoughts forward for discussion/ignoring/flame bait/whatever.
>
> I see two main issues at the moment with freezing before hotplugging.
> The first is that we have cpu specific kernel threads that we're going
> to want to kill, and the second is that we have userspace threads that
> we want to migrate to another cpu. Have I missed anything?

I have bad memories from the time we were not using the CPU-hotplug and
tried to freeze tasks with all CPUs on-line. There were some very subtle
race conditions appearing between the freezer and the running tasks
which were a nightmare to figure out. I'm not sure that they will appear
now, but something tells me so. :-)

> The first issue could be helped by splitting the freezing of userspace
> processes from kernel space. The kernel threads could thus die without
> us having to worry about userspace seeing what's going on. I haven't
> looked at vanilla in a while; this might already be in.

Yes, it is.

> Alternatively, if it's viable, per-cpu kernel threads could perhaps be made
> NO_FREEZE.
>
> The second issue is migrating userspace threads. I'm no scheduling
> expert, so I'll just speculate :>. I wondered if it's possible to make
> the migration happen lazily; in such a way that if, when we come to thaw
> userspace, the cpu has been hotplugged again, the migration never
> happens. Does that sound possible?

The CPU hotplug makes the tasks migrate automatically, but that's not
a problem, as I see it. The problem is some tasks may have specific CPU
affinities set and these should not change accross suspend/resume.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-08-17 05:54:41

by Nigel Cunningham

[permalink] [raw]
Subject: Re: peculiar suspend/resume bug.

Hi.

Thanks for the reply.

On Thu, 2006-08-17 at 07:44 +0200, Rafael J. Wysocki wrote:
> On Thursday 17 August 2006 03:44, Nigel Cunningham wrote:
> > Hi.
> >
> > On Wed, 2006-08-16 at 03:41 +0100, Matthew Garrett wrote:
> > > On Tue, Aug 15, 2006 at 08:37:28PM -0400, Dave Jones wrote:
> > >
> > > > cpufreq-applet crashes as soon as the cpu goes offline.
> > > > Now, the applet should be written to deal with this scenario more
> > > > gracefully, but I'm questioning whether or not userspace should
> > > > *see* the unplug/replug that suspend does at all.
> > >
> > > As Nigel mentioned, cpu unplug happens just before processes are frozen,
> > > so I guess there's a chance for it to be scheduled. On the other hand,
> > > it's not unreasonable for CPUs to be unplugged during runtime anyway -
> > > perhaps userspace should be able to deal with that?
> >
> > Agreed.
> >
> > I've spent a little more time thinking about this, and want to put a few
> > thoughts forward for discussion/ignoring/flame bait/whatever.
> >
> > I see two main issues at the moment with freezing before hotplugging.
> > The first is that we have cpu specific kernel threads that we're going
> > to want to kill, and the second is that we have userspace threads that
> > we want to migrate to another cpu. Have I missed anything?
>
> I have bad memories from the time we were not using the CPU-hotplug and
> tried to freeze tasks with all CPUs on-line. There were some very subtle
> race conditions appearing between the freezer and the running tasks
> which were a nightmare to figure out. I'm not sure that they will appear
> now, but something tells me so. :-)

I think you'll find that the separate freezing of kernel space will
help. We had SMP support in Suspend2 long before cpu hotplugging was
added, and it was stable and reliable. I'm reasonably certain that the
switch to splitting freezing was pre-cpu hotplugging.

> > The first issue could be helped by splitting the freezing of userspace
> > processes from kernel space. The kernel threads could thus die without
> > us having to worry about userspace seeing what's going on. I haven't
> > looked at vanilla in a while; this might already be in.
>
> Yes, it is.

Great. Sorry for my slowness. I just keep too many things on the go at
once.

> > Alternatively, if it's viable, per-cpu kernel threads could perhaps be made
> > NO_FREEZE.
> >
> > The second issue is migrating userspace threads. I'm no scheduling
> > expert, so I'll just speculate :>. I wondered if it's possible to make
> > the migration happen lazily; in such a way that if, when we come to thaw
> > userspace, the cpu has been hotplugged again, the migration never
> > happens. Does that sound possible?
>
> The CPU hotplug makes the tasks migrate automatically, but that's not
> a problem, as I see it. The problem is some tasks may have specific CPU
> affinities set and these should not change accross suspend/resume.

Mmm. My concern was that cpu hotplug might somehow deadlock if the
process it was trying to migrate was frozen. You don't think that's a
possibility?

With affinities, would saving and restoring be a possibility?

Regards,

Nigel

2006-08-17 06:27:11

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: peculiar suspend/resume bug.

Hi,

On Thursday 17 August 2006 07:55, Nigel Cunningham wrote:
> Hi.
>
> Thanks for the reply.
>
> On Thu, 2006-08-17 at 07:44 +0200, Rafael J. Wysocki wrote:
> > On Thursday 17 August 2006 03:44, Nigel Cunningham wrote:
> > > Hi.
> > >
> > > On Wed, 2006-08-16 at 03:41 +0100, Matthew Garrett wrote:
> > > > On Tue, Aug 15, 2006 at 08:37:28PM -0400, Dave Jones wrote:
> > > >
> > > > > cpufreq-applet crashes as soon as the cpu goes offline.
> > > > > Now, the applet should be written to deal with this scenario more
> > > > > gracefully, but I'm questioning whether or not userspace should
> > > > > *see* the unplug/replug that suspend does at all.
> > > >
> > > > As Nigel mentioned, cpu unplug happens just before processes are frozen,
> > > > so I guess there's a chance for it to be scheduled. On the other hand,
> > > > it's not unreasonable for CPUs to be unplugged during runtime anyway -
> > > > perhaps userspace should be able to deal with that?
> > >
> > > Agreed.
> > >
> > > I've spent a little more time thinking about this, and want to put a few
> > > thoughts forward for discussion/ignoring/flame bait/whatever.
> > >
> > > I see two main issues at the moment with freezing before hotplugging.
> > > The first is that we have cpu specific kernel threads that we're going
> > > to want to kill, and the second is that we have userspace threads that
> > > we want to migrate to another cpu. Have I missed anything?
> >
> > I have bad memories from the time we were not using the CPU-hotplug and
> > tried to freeze tasks with all CPUs on-line. There were some very subtle
> > race conditions appearing between the freezer and the running tasks
> > which were a nightmare to figure out. I'm not sure that they will appear
> > now, but something tells me so. :-)
>
> I think you'll find that the separate freezing of kernel space will
> help.

That certainly is possible, but will need some testing.

> We had SMP support in Suspend2 long before cpu hotplugging was
> added, and it was stable and reliable. I'm reasonably certain that the
> switch to splitting freezing was pre-cpu hotplugging.
>
> > > The first issue could be helped by splitting the freezing of userspace
> > > processes from kernel space. The kernel threads could thus die without
> > > us having to worry about userspace seeing what's going on. I haven't
> > > looked at vanilla in a while; this might already be in.
> >
> > Yes, it is.
>
> Great. Sorry for my slowness. I just keep too many things on the go at
> once.
>
> > > Alternatively, if it's viable, per-cpu kernel threads could perhaps be made
> > > NO_FREEZE.
> > >
> > > The second issue is migrating userspace threads. I'm no scheduling
> > > expert, so I'll just speculate :>. I wondered if it's possible to make
> > > the migration happen lazily; in such a way that if, when we come to thaw
> > > userspace, the cpu has been hotplugged again, the migration never
> > > happens. Does that sound possible?
> >
> > The CPU hotplug makes the tasks migrate automatically, but that's not
> > a problem, as I see it. The problem is some tasks may have specific CPU
> > affinities set and these should not change accross suspend/resume.
>
> Mmm. My concern was that cpu hotplug might somehow deadlock if the
> process it was trying to migrate was frozen. You don't think that's a
> possibility?

No, I don't. Of course it'll have to be tested anyway. :-)

> With affinities, would saving and restoring be a possibility?

I haven't thought about it yet. Perhaps, but it will need to be done with
care.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller