by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH 7/7] rtc: cmos: Add suspend/resume endurance testing hook

On Thu, May 19, 2022 at 4:33 AM Len Brown <[email protected]> wrote:
>
> First let's agree on why this should not be ignored.
>
> Our development team at Intel has lab with laptops, we run sleepgraph
> on every RC, and we publish the tool in public:
> https://www.intel.com/content/www/us/en/developer/topic-technology/open/pm-graph/overview.html
>
> But even if we were funded to do it (which we are not), we can't
> possibly test every kind of device.
> We need the community to help testing Linux (suspend/resume,
> specifically) on a broad range of devices, so together we can make it
> better for all.
>
> The community is made up mostly of users, rather than kernel hackers,
> and so this effectively means that distro binary kernels need to be
> able to support testing.
>
> Enabling that broad community of users/contributors is the goal.
>
> As Rui explained, this patch does nothing and breaks nothing if the
> new hook remains unused.
> If it is used, then overrides the wakeup duration for all subsequent
> system suspends, until it is cleared.
> If it does more than that, or does that in a clumsy way, then let's fix that.
>
> Today it gives us two new capabilities:
>
> 1. Prevents a lost wake event. Commonly we see this with kcompatd
> taking 20 seconds when we had previously armed the RTC for 15 seconds.
> The system will sleep forever, until the user intervenes -- which may
> be a very long time later.
>
> Rafael, If you have a better way to fix that, I'm all ears. Aborted
> suspend flows are ugly -- particularly when the user didn't want them,
> but they are much less ugly then losing a wake event, which can result
> in losing, say 10-hours of test time.

The real problem here is the missed wakeup events and I've already
said in this thread how this can be fixed and Rui appears to agree
with me.

So I'd say why don't we just go and fix it?

And it is orthogonal to the first 3 patches in this series, because
they move the PCH thermal delay to the noirq phase which is later than
the arming of the RTC Fixed Event IIUC.

> 2. Allows more suspends/resume cycles per time. Say the early wake is
> fixed. Then we have to decide how long to sleep before being
> suspended. If we set it for 1 second, and suspend takes longer than 1
> second, then all of our tests will fail with early wakeups and we have
> tested nothing.

We have tested "early" wakeups which is what the users would see on
the system in question if they set the RTC wake time to 1 second
before suspending.

This may not be what we want to test, though, but that is a different
problem, as discussed below.

> If we set it to 60 seconds, and suspend takes 1
> second, then 59/60 seconds are spent sleeping, when they could be
> spent testing Linux. With this patch, we can set it to the minimum of
> 2 seconds right before we sleep, guaranteeing that we spend at least 1
> second, and under 2 seconds sleeping, and the rest of the time testing
> -- which allows us to meet the goal.

So the goal specifically is to test the last phase of system suspend
and in particular whether or not the platform has reached the specific
minimum-power state at the end of it.

In order to do that, we basically want to ignore all of the wakeup
events except for the special RTC wakeup 1 second after the platform
has been asked to get into the minimum-power state, so what we are
talking about here really is a special suspend test mode using the RTC
as a wakeup vehicle.