s2idle works like a regular suspend with freezing processes and freezing
devices. All CPUs except the control CPU go into idle. Once this is
completed the control CPU kicks all other CPUs out of idle, so that they
reenter the idle loop and then enter s2idle state. The control CPU then
issues an swait() on the suspend state and therefore enters the idle loop
as well.
Due to being kicked out of idle, the other CPUs leave their NOHZ states,
which means the tick is active and the corresponding hrtimer is programmed
to the next jiffie.
On entering s2idle the CPUs shut down their local clockevent device to
prevent wakeups. The last CPU which enters s2idle shuts down its local
clockevent and freezes timekeeping.
On resume, one of the CPUs receives the wakeup interrupt, unfreezes
timekeeping and its local clockevent and starts the resume process. At that
point all other CPUs are still in s2idle with their clockevents switched
off. They only resume when they are kicked by another CPU or after resuming
devices and then receiving a device interrupt.
That means there is no guarantee that all CPUs will wakeup directly on
resume. As the consequence there is no guarantee that timers which are
queued on those CPUs and should expire directly after resume, are
handled. Also timer list timers which are remotely queued to one of those
CPUs after resume will not result in a reporgramming IPI as the tick is
active. A queue hrtimer will also not result in a reprogramming IPI because
the first hrtimer event is already in the past.
The recent introduction of the timer pull model (7ee988770326 ("timers:
Implement the hierarchical pull model")) amplifies this problem, if the
current migrator is one of the non woken up CPUs. When a non pinned timer
list timer is queued and the queueing CPU goes idle, it relies on the still
suspended migrator CPU to expire the timer which will happen by chance.
The problem existis since commit 8d89835b0467 ("PM: suspend: Do not pause
cpuidle in the suspend-to-idle path"). There the cpuidle_pause() call which
in turn invoked a wakeup for all idle CPUs was moved to a later point in
the resume process. This might not be reached or reached very late because
it waits on a timer of a still suspended CPU.
Address this by kicking all CPUs out of idle after the control CPU returns
from swait() so that they resume their timers and restore consistent system
state.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218641
Fixes: 8d89835b0467 ("PM: suspend: Do not pause cpuidle in the suspend-to-idle path")
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Tested-by: Mario Limonciello <[email protected]>
Cc: [email protected]
---
kernel/power/suspend.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index e3ae93bbcb9b..09f8397bae15 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -106,6 +106,12 @@ static void s2idle_enter(void)
swait_event_exclusive(s2idle_wait_head,
s2idle_state == S2IDLE_STATE_WAKE);
+ /*
+ * Kick all CPUs to ensure that they resume their timers and restore
+ * consistent system state.
+ */
+ wake_up_all_idle_cpus();
+
cpus_read_unlock();
raw_spin_lock_irq(&s2idle_lock);
--
2.39.2
On Fri, Apr 05 2024 at 10:34, Anna-Maria Behnsen wrote:
> s2idle works like a regular suspend with freezing processes and freezing
> devices. All CPUs except the control CPU go into idle. Once this is
> completed the control CPU kicks all other CPUs out of idle, so that they
> reenter the idle loop and then enter s2idle state. The control CPU then
> issues an swait() on the suspend state and therefore enters the idle loop
> as well.
>
> Due to being kicked out of idle, the other CPUs leave their NOHZ states,
> which means the tick is active and the corresponding hrtimer is programmed
> to the next jiffie.
>
> On entering s2idle the CPUs shut down their local clockevent device to
> prevent wakeups. The last CPU which enters s2idle shuts down its local
> clockevent and freezes timekeeping.
>
> On resume, one of the CPUs receives the wakeup interrupt, unfreezes
> timekeeping and its local clockevent and starts the resume process. At that
> point all other CPUs are still in s2idle with their clockevents switched
> off. They only resume when they are kicked by another CPU or after resuming
> devices and then receiving a device interrupt.
>
> That means there is no guarantee that all CPUs will wakeup directly on
> resume. As the consequence there is no guarantee that timers which are
s/As the/As a/
> queued on those CPUs and should expire directly after resume, are
> handled. Also timer list timers which are remotely queued to one of those
> CPUs after resume will not result in a reporgramming IPI as the tick is
s/reporgramming/reprogamming/
> active. A queue hrtimer will also not result in a reprogramming IPI because
s/A queue/Queueing a/
> the first hrtimer event is already in the past.
>
> The recent introduction of the timer pull model (7ee988770326 ("timers:
> Implement the hierarchical pull model")) amplifies this problem, if the
> current migrator is one of the non woken up CPUs. When a non pinned timer
> list timer is queued and the queueing CPU goes idle, it relies on the still
> suspended migrator CPU to expire the timer which will happen by chance.
>
> The problem existis since commit 8d89835b0467 ("PM: suspend: Do not pause
> cpuidle in the suspend-to-idle path"). There the cpuidle_pause() call which
> in turn invoked a wakeup for all idle CPUs was moved to a later point in
> the resume process. This might not be reached or reached very late because
> it waits on a timer of a still suspended CPU.
>
> Address this by kicking all CPUs out of idle after the control CPU returns
> from swait() so that they resume their timers and restore consistent system
> state.
>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218641
> Fixes: 8d89835b0467 ("PM: suspend: Do not pause cpuidle in the suspend-to-idle path")
> Signed-off-by: Anna-Maria Behnsen <[email protected]>
> Tested-by: Mario Limonciello <[email protected]>
> Cc: [email protected]
Reviewed-by: Thomas Gleixner <[email protected]>
On Fri, Apr 05 2024 at 19:52, Thomas Gleixner wrote:
> On Fri, Apr 05 2024 at 10:34, Anna-Maria Behnsen wrote:
>> queued on those CPUs and should expire directly after resume, are
>> handled. Also timer list timers which are remotely queued to one of those
>> CPUs after resume will not result in a reporgramming IPI as the tick is
>
> s/reporgramming/reprogamming/
Haha. I can't spell either. reprogramming obviously.
s2idle works like a regular suspend with freezing processes and freezing
devices. All CPUs except the control CPU go into idle. Once this is
completed the control CPU kicks all other CPUs out of idle, so that they
reenter the idle loop and then enter s2idle state. The control CPU then
issues an swait() on the suspend state and therefore enters the idle loop
as well.
Due to being kicked out of idle, the other CPUs leave their NOHZ states,
which means the tick is active and the corresponding hrtimer is programmed
to the next jiffie.
On entering s2idle the CPUs shut down their local clockevent device to
prevent wakeups. The last CPU which enters s2idle shuts down its local
clockevent and freezes timekeeping.
On resume, one of the CPUs receives the wakeup interrupt, unfreezes
timekeeping and its local clockevent and starts the resume process. At that
point all other CPUs are still in s2idle with their clockevents switched
off. They only resume when they are kicked by another CPU or after resuming
devices and then receiving a device interrupt.
That means there is no guarantee that all CPUs will wakeup directly on
resume. As a consequence there is no guarantee that timers which are queued
on those CPUs and should expire directly after resume, are handled. Also
timer list timers which are remotely queued to one of those CPUs after
resume will not result in a reprogramming IPI as the tick is
active. Queueing a hrtimer will also not result in a reprogramming IPI
because the first hrtimer event is already in the past.
The recent introduction of the timer pull model (7ee988770326 ("timers:
Implement the hierarchical pull model")) amplifies this problem, if the
current migrator is one of the non woken up CPUs. When a non pinned timer
list timer is queued and the queuing CPU goes idle, it relies on the still
suspended migrator CPU to expire the timer which will happen by chance.
The problem exists since commit 8d89835b0467 ("PM: suspend: Do not pause
cpuidle in the suspend-to-idle path"). There the cpuidle_pause() call which
in turn invoked a wakeup for all idle CPUs was moved to a later point in
the resume process. This might not be reached or reached very late because
it waits on a timer of a still suspended CPU.
Address this by kicking all CPUs out of idle after the control CPU returns
from swait() so that they resume their timers and restore consistent system
state.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218641
Fixes: 8d89835b0467 ("PM: suspend: Do not pause cpuidle in the suspend-to-idle path")
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Tested-by: Mario Limonciello <[email protected]>
Cc: [email protected]
---
v2: Fix typos in commit message
---
kernel/power/suspend.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -106,6 +106,12 @@ static void s2idle_enter(void)
swait_event_exclusive(s2idle_wait_head,
s2idle_state == S2IDLE_STATE_WAKE);
+ /*
+ * Kick all CPUs to ensure that they resume their timers and restore
+ * consistent system state.
+ */
+ wake_up_all_idle_cpus();
+
cpus_read_unlock();
raw_spin_lock_irq(&s2idle_lock);
On Mon, Apr 08, 2024 at 09:02:23AM +0200, Anna-Maria Behnsen wrote:
> s2idle works like a regular suspend with freezing processes and freezing
> devices. All CPUs except the control CPU go into idle. Once this is
> completed the control CPU kicks all other CPUs out of idle, so that they
> reenter the idle loop and then enter s2idle state. The control CPU then
> issues an swait() on the suspend state and therefore enters the idle loop
> as well.
>
> Due to being kicked out of idle, the other CPUs leave their NOHZ states,
> which means the tick is active and the corresponding hrtimer is programmed
> to the next jiffie.
>
> On entering s2idle the CPUs shut down their local clockevent device to
> prevent wakeups. The last CPU which enters s2idle shuts down its local
> clockevent and freezes timekeeping.
>
> On resume, one of the CPUs receives the wakeup interrupt, unfreezes
> timekeeping and its local clockevent and starts the resume process. At that
> point all other CPUs are still in s2idle with their clockevents switched
> off. They only resume when they are kicked by another CPU or after resuming
> devices and then receiving a device interrupt.
>
> That means there is no guarantee that all CPUs will wakeup directly on
> resume. As a consequence there is no guarantee that timers which are queued
> on those CPUs and should expire directly after resume, are handled. Also
> timer list timers which are remotely queued to one of those CPUs after
> resume will not result in a reprogramming IPI as the tick is
> active. Queueing a hrtimer will also not result in a reprogramming IPI
> because the first hrtimer event is already in the past.
>
> The recent introduction of the timer pull model (7ee988770326 ("timers:
> Implement the hierarchical pull model")) amplifies this problem, if the
> current migrator is one of the non woken up CPUs. When a non pinned timer
> list timer is queued and the queuing CPU goes idle, it relies on the still
> suspended migrator CPU to expire the timer which will happen by chance.
>
> The problem exists since commit 8d89835b0467 ("PM: suspend: Do not pause
> cpuidle in the suspend-to-idle path"). There the cpuidle_pause() call which
> in turn invoked a wakeup for all idle CPUs was moved to a later point in
> the resume process. This might not be reached or reached very late because
> it waits on a timer of a still suspended CPU.
>
> Address this by kicking all CPUs out of idle after the control CPU returns
> from swait() so that they resume their timers and restore consistent system
> state.
>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218641
> Fixes: 8d89835b0467 ("PM: suspend: Do not pause cpuidle in the suspend-to-idle path")
> Signed-off-by: Anna-Maria Behnsen <[email protected]>
> Reviewed-by: Thomas Gleixner <[email protected]>
> Tested-by: Mario Limonciello <[email protected]>
> Cc: [email protected]
Cute,
Acked-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> kernel/power/suspend.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -106,6 +106,12 @@ static void s2idle_enter(void)
> swait_event_exclusive(s2idle_wait_head,
> s2idle_state == S2IDLE_STATE_WAKE);
>
> + /*
> + * Kick all CPUs to ensure that they resume their timers and restore
> + * consistent system state.
> + */
> + wake_up_all_idle_cpus();
> +
> cpus_read_unlock();
>
> raw_spin_lock_irq(&s2idle_lock);
On Mon, 8 Apr 2024 at 09:02, Anna-Maria Behnsen
<[email protected]> wrote:
>
> s2idle works like a regular suspend with freezing processes and freezing
> devices. All CPUs except the control CPU go into idle. Once this is
> completed the control CPU kicks all other CPUs out of idle, so that they
> reenter the idle loop and then enter s2idle state. The control CPU then
> issues an swait() on the suspend state and therefore enters the idle loop
> as well.
>
> Due to being kicked out of idle, the other CPUs leave their NOHZ states,
> which means the tick is active and the corresponding hrtimer is programmed
> to the next jiffie.
>
> On entering s2idle the CPUs shut down their local clockevent device to
> prevent wakeups. The last CPU which enters s2idle shuts down its local
> clockevent and freezes timekeeping.
>
> On resume, one of the CPUs receives the wakeup interrupt, unfreezes
> timekeeping and its local clockevent and starts the resume process. At that
> point all other CPUs are still in s2idle with their clockevents switched
> off. They only resume when they are kicked by another CPU or after resuming
> devices and then receiving a device interrupt.
>
> That means there is no guarantee that all CPUs will wakeup directly on
> resume. As a consequence there is no guarantee that timers which are queued
> on those CPUs and should expire directly after resume, are handled. Also
> timer list timers which are remotely queued to one of those CPUs after
> resume will not result in a reprogramming IPI as the tick is
> active. Queueing a hrtimer will also not result in a reprogramming IPI
> because the first hrtimer event is already in the past.
>
> The recent introduction of the timer pull model (7ee988770326 ("timers:
> Implement the hierarchical pull model")) amplifies this problem, if the
> current migrator is one of the non woken up CPUs. When a non pinned timer
> list timer is queued and the queuing CPU goes idle, it relies on the still
> suspended migrator CPU to expire the timer which will happen by chance.
>
> The problem exists since commit 8d89835b0467 ("PM: suspend: Do not pause
> cpuidle in the suspend-to-idle path"). There the cpuidle_pause() call which
> in turn invoked a wakeup for all idle CPUs was moved to a later point in
> the resume process. This might not be reached or reached very late because
> it waits on a timer of a still suspended CPU.
>
> Address this by kicking all CPUs out of idle after the control CPU returns
> from swait() so that they resume their timers and restore consistent system
> state.
>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218641
> Fixes: 8d89835b0467 ("PM: suspend: Do not pause cpuidle in the suspend-to-idle path")
> Signed-off-by: Anna-Maria Behnsen <[email protected]>
> Reviewed-by: Thomas Gleixner <[email protected]>
> Tested-by: Mario Limonciello <[email protected]>
> Cc: [email protected]
Thanks for the detailed commit message! Please add:
Reviewed-by: Ulf Hansson <[email protected]>
Kind regards
Uffe
> ---
> v2: Fix typos in commit message
> ---
> kernel/power/suspend.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -106,6 +106,12 @@ static void s2idle_enter(void)
> swait_event_exclusive(s2idle_wait_head,
> s2idle_state == S2IDLE_STATE_WAKE);
>
> + /*
> + * Kick all CPUs to ensure that they resume their timers and restore
> + * consistent system state.
> + */
> + wake_up_all_idle_cpus();
> +
> cpus_read_unlock();
>
> raw_spin_lock_irq(&s2idle_lock);
On Mon, Apr 8, 2024 at 2:43 PM Ulf Hansson <[email protected]> wrote:
>
> On Mon, 8 Apr 2024 at 09:02, Anna-Maria Behnsen
> <[email protected]> wrote:
> >
> > s2idle works like a regular suspend with freezing processes and freezing
> > devices. All CPUs except the control CPU go into idle. Once this is
> > completed the control CPU kicks all other CPUs out of idle, so that they
> > reenter the idle loop and then enter s2idle state. The control CPU then
> > issues an swait() on the suspend state and therefore enters the idle loop
> > as well.
> >
> > Due to being kicked out of idle, the other CPUs leave their NOHZ states,
> > which means the tick is active and the corresponding hrtimer is programmed
> > to the next jiffie.
> >
> > On entering s2idle the CPUs shut down their local clockevent device to
> > prevent wakeups. The last CPU which enters s2idle shuts down its local
> > clockevent and freezes timekeeping.
> >
> > On resume, one of the CPUs receives the wakeup interrupt, unfreezes
> > timekeeping and its local clockevent and starts the resume process. At that
> > point all other CPUs are still in s2idle with their clockevents switched
> > off. They only resume when they are kicked by another CPU or after resuming
> > devices and then receiving a device interrupt.
> >
> > That means there is no guarantee that all CPUs will wakeup directly on
> > resume. As a consequence there is no guarantee that timers which are queued
> > on those CPUs and should expire directly after resume, are handled. Also
> > timer list timers which are remotely queued to one of those CPUs after
> > resume will not result in a reprogramming IPI as the tick is
> > active. Queueing a hrtimer will also not result in a reprogramming IPI
> > because the first hrtimer event is already in the past.
> >
> > The recent introduction of the timer pull model (7ee988770326 ("timers:
> > Implement the hierarchical pull model")) amplifies this problem, if the
> > current migrator is one of the non woken up CPUs. When a non pinned timer
> > list timer is queued and the queuing CPU goes idle, it relies on the still
> > suspended migrator CPU to expire the timer which will happen by chance.
> >
> > The problem exists since commit 8d89835b0467 ("PM: suspend: Do not pause
> > cpuidle in the suspend-to-idle path"). There the cpuidle_pause() call which
> > in turn invoked a wakeup for all idle CPUs was moved to a later point in
> > the resume process. This might not be reached or reached very late because
> > it waits on a timer of a still suspended CPU.
> >
> > Address this by kicking all CPUs out of idle after the control CPU returns
> > from swait() so that they resume their timers and restore consistent system
> > state.
> >
> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218641
> > Fixes: 8d89835b0467 ("PM: suspend: Do not pause cpuidle in the suspend-to-idle path")
> > Signed-off-by: Anna-Maria Behnsen <[email protected]>
> > Reviewed-by: Thomas Gleixner <[email protected]>
> > Tested-by: Mario Limonciello <[email protected]>
> > Cc: [email protected]
>
> Thanks for the detailed commit message! Please add:
>
> Reviewed-by: Ulf Hansson <[email protected]>
Applied as 6.9-rc material, many thanks to everyone involved!