Hi,
The goal is to make Linux kernel PM / PM domains / cpuidle friendlier for
Realtime systsems (PREEMPT_RT). Realtime changes regular spinlocks into
sleeping primitives, thus other parts of the code must be ready for it.
Changes since v1
================
1. Patch #1: Add missing WARN for parent domain
2. New patches 3-5 for other issues encountered with PREEMPT_RT.
Best regards,
Krzysztof
---
Cc: Adrien Thierry <[email protected]>
Cc: Brian Masney <[email protected]>
Cc: [email protected]
Krzysztof Kozlowski (5):
PM: domains: Add GENPD_FLAG_RT_SAFE for PREEMPT_RT
cpuidle: psci: Mark as PREEMPT_RT safe
cpuidle: psci: Do not suspend topology CPUs on PREEMPT_RT
PM: Allow calling dev_pm_domain_set() with raw spinlock
PM: domains: Do not call device_pm_check_callbacks() when holding
genpd_lock()
drivers/base/power/common.c | 27 ++++++++++-
drivers/base/power/domain.c | 65 +++++++++++++++++++++++++--
drivers/cpuidle/cpuidle-psci-domain.c | 3 +-
drivers/cpuidle/cpuidle-psci.c | 4 +-
include/linux/pm_domain.h | 16 +++++++
5 files changed, 107 insertions(+), 8 deletions(-)
--
2.34.1
The PSCI cpuidle power domain in power_off callback uses
__this_cpu_write() so it is PREEMPT_RT safe. This allows to use it in
Realtime kernels and solves errors like:
BUG: scheduling while atomic: swapper/2/0/0x00000002
Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
Call trace:
dump_backtrace.part.0+0xe0/0xf0
show_stack+0x18/0x40
dump_stack_lvl+0x68/0x84
dump_stack+0x18/0x34
__schedule_bug+0x60/0x80
__schedule+0x628/0x800
schedule_rtlock+0x28/0x5c
rtlock_slowlock_locked+0x360/0xd30
rt_spin_lock+0x88/0xb0
genpd_lock_nested_spin+0x1c/0x30
genpd_power_off.part.0.isra.0+0x20c/0x2a0
genpd_runtime_suspend+0x150/0x2bc
__rpm_callback+0x48/0x170
rpm_callback+0x6c/0x7c
rpm_suspend+0x108/0x660
__pm_runtime_suspend+0x4c/0x8c
__psci_enter_domain_idle_state.constprop.0+0x54/0xe0
psci_enter_domain_idle_state+0x18/0x2c
cpuidle_enter_state+0x8c/0x4e0
cpuidle_enter+0x38/0x50
do_idle+0x248/0x2f0
cpu_startup_entry+0x24/0x30
secondary_start_kernel+0x130/0x154
__secondary_switched+0xb0/0xb4
Cc: Adrien Thierry <[email protected]>
Cc: Brian Masney <[email protected]>
Cc: [email protected]
Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
drivers/cpuidle/cpuidle-psci-domain.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/cpuidle/cpuidle-psci-domain.c b/drivers/cpuidle/cpuidle-psci-domain.c
index c80cf9ddabd8..d15a91fb7048 100644
--- a/drivers/cpuidle/cpuidle-psci-domain.c
+++ b/drivers/cpuidle/cpuidle-psci-domain.c
@@ -62,7 +62,8 @@ static int psci_pd_init(struct device_node *np, bool use_osi)
if (!pd_provider)
goto free_pd;
- pd->flags |= GENPD_FLAG_IRQ_SAFE | GENPD_FLAG_CPU_DOMAIN;
+ pd->flags |= GENPD_FLAG_IRQ_SAFE | GENPD_FLAG_RT_SAFE | \
+ GENPD_FLAG_CPU_DOMAIN;
/* Allow power off when OSI has been successfully enabled. */
if (use_osi)
--
2.34.1
device_pm_check_callbacks() uses dev->power spinlock, which on
PREEMPT_RT sleeps. However some PM domains on PREEMPT_RT might be using
raw spinlocks as genpd_lock(), thus dev_pm_domain_set() must not call
device_pm_check_callbacks(). In fact device_pm_check_callbacks() is not
strictly related to dev_pm_domain_set() and calls for these two can be
made separately.
Add new helper dev_pm_domain_set_no_cb() which will only set PM domain
but will not check the callbacks, leaving the checl to the caller.
Cc: Adrien Thierry <[email protected]>
Cc: Brian Masney <[email protected]>
Cc: [email protected]
Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
drivers/base/power/common.c | 27 +++++++++++++++++++++++++--
include/linux/pm_domain.h | 3 +++
2 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/drivers/base/power/common.c b/drivers/base/power/common.c
index 72115917e0bd..f81cab6990ad 100644
--- a/drivers/base/power/common.c
+++ b/drivers/base/power/common.c
@@ -218,6 +218,30 @@ EXPORT_SYMBOL_GPL(dev_pm_domain_start);
* This function must be called with the device lock held.
*/
void dev_pm_domain_set(struct device *dev, struct dev_pm_domain *pd)
+{
+ if (dev->pm_domain == pd)
+ return;
+
+ dev_pm_domain_set_no_cb(dev, pd);
+ device_pm_check_callbacks(dev);
+}
+EXPORT_SYMBOL_GPL(dev_pm_domain_set);
+
+/**
+ * dev_pm_domain_set_no_cb - Set PM domain of a device.
+ * @dev: Device whose PM domain is to be set.
+ * @pd: PM domain to be set, or NULL.
+ *
+ * Sets the PM domain the device belongs to. The PM domain of a device needs
+ * to be set before its probe finishes (it's bound to a driver).
+ *
+ * This is exactly like dev_pm_domain_set(), however device_pm_check_callbacks()
+ * is not called and the caller is responsible to invoke
+ * device_pm_check_callbacks() with device lock held.
+ *
+ * This function must be called with the device lock held.
+ */
+void dev_pm_domain_set_no_cb(struct device *dev, struct dev_pm_domain *pd)
{
if (dev->pm_domain == pd)
return;
@@ -225,6 +249,5 @@ void dev_pm_domain_set(struct device *dev, struct dev_pm_domain *pd)
WARN(pd && device_is_bound(dev),
"PM domains can only be changed for unbound devices\n");
dev->pm_domain = pd;
- device_pm_check_callbacks(dev);
}
-EXPORT_SYMBOL_GPL(dev_pm_domain_set);
+EXPORT_SYMBOL_GPL(dev_pm_domain_set_no_cb);
diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
index 0a1600244963..352d0c76bfec 100644
--- a/include/linux/pm_domain.h
+++ b/include/linux/pm_domain.h
@@ -438,6 +438,7 @@ struct device *dev_pm_domain_attach_by_name(struct device *dev,
void dev_pm_domain_detach(struct device *dev, bool power_off);
int dev_pm_domain_start(struct device *dev);
void dev_pm_domain_set(struct device *dev, struct dev_pm_domain *pd);
+void dev_pm_domain_set_no_cb(struct device *dev, struct dev_pm_domain *pd);
#else
static inline int dev_pm_domain_attach(struct device *dev, bool power_on)
{
@@ -460,6 +461,8 @@ static inline int dev_pm_domain_start(struct device *dev)
}
static inline void dev_pm_domain_set(struct device *dev,
struct dev_pm_domain *pd) {}
+static inline void dev_pm_domain_set_no_cb(struct device *dev,
+ struct dev_pm_domain *pd) {}
#endif
#endif /* _LINUX_PM_DOMAIN_H */
--
2.34.1
Realtime kernels are not supposed to go into deep sleep modes (neither
system nor CPUs) because the latencies might become unpredictable.
Therefore actual power suspending of CPU topology is not that important.
On the other hand, this runtime PM of CPU topology is not compatible
with PREEMPT_RT:
1. Core cpuidle path disables IRQs.
2. Core cpuidle calls cpuidle-psci.
3. cpuidle-psci in __psci_enter_domain_idle_state() calls
pm_runtime_put_sync_suspend() and pm_runtime_get_sync() which use
spinlocks (which are sleeping on PREEMPT_RT):
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/0
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
1 lock held by swapper/0/0:
#0: ffff20524613c1a0 (&dev->power.lock){+.+.}-{3:3}, at: __pm_runtime_suspend+0x30/0xac
irq event stamp: 18776
hardirqs last enabled at (18775): [<ffffa4853720ac80>] tick_nohz_idle_enter+0x7c/0x17c
hardirqs last disabled at (18776): [<ffffa4853717ae00>] do_idle+0xe0/0x300
softirqs last enabled at (4310): [<ffffa48537126398>] __local_bh_enable_ip+0x98/0x280
softirqs last disabled at (4288): [<ffffa4853721c250>] cgroup_idr_alloc.constprop.0+0x0/0xd0
Preemption disabled at:
[<ffffa485381e55a0>] schedule_preempt_disabled+0x20/0x30
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 6.1.0-rt5-00322-gb49b67f1d8dc #1
Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
Call trace:
dump_backtrace.part.0+0xe0/0xf0
show_stack+0x18/0x40
dump_stack_lvl+0x8c/0xb8
dump_stack+0x18/0x34
__might_resched+0x17c/0x214
rt_spin_lock+0x5c/0x100
__pm_runtime_suspend+0x30/0xac
__psci_enter_domain_idle_state.constprop.0+0x60/0x104
psci_enter_domain_idle_state+0x18/0x2c
cpuidle_enter_state+0x220/0x37c
cpuidle_enter+0x38/0x50
do_idle+0x258/0x300
cpu_startup_entry+0x28/0x30
rest_init+0x104/0x180
arch_post_acpi_subsys_init+0x0/0x18
start_kernel+0x6fc/0x73c
__primary_switched+0xbc/0xc4
Cc: Adrien Thierry <[email protected]>
Cc: Brian Masney <[email protected]>
Cc: [email protected]
Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
drivers/cpuidle/cpuidle-psci.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/cpuidle/cpuidle-psci.c b/drivers/cpuidle/cpuidle-psci.c
index 57bc3e3ae391..9d971cc4b12b 100644
--- a/drivers/cpuidle/cpuidle-psci.c
+++ b/drivers/cpuidle/cpuidle-psci.c
@@ -72,7 +72,7 @@ static int __psci_enter_domain_idle_state(struct cpuidle_device *dev,
ct_irq_enter_irqson();
if (s2idle)
dev_pm_genpd_suspend(pd_dev);
- else
+ else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
pm_runtime_put_sync_suspend(pd_dev);
ct_irq_exit_irqson();
@@ -85,7 +85,7 @@ static int __psci_enter_domain_idle_state(struct cpuidle_device *dev,
ct_irq_enter_irqson();
if (s2idle)
dev_pm_genpd_resume(pd_dev);
- else
+ else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
pm_runtime_get_sync(pd_dev);
ct_irq_exit_irqson();
--
2.34.1
Realtime kernels with PREEMPT_RT must use raw_spinlock_t for domains
which are invoked from CPU idle (thus from atomic section). Example is
cpuidle PSCI domain driver which itself is PREEMPT_RT safe, but is being
called as part of cpuidle.
Add a flag allowing a power domain provider to indicate it is RT safe.
The flag is supposed to be used with existing GENPD_FLAG_IRQ_SAFE.
Cc: Adrien Thierry <[email protected]>
Cc: Brian Masney <[email protected]>
Cc: [email protected]
Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
Independently from Adrien, I encountered the same problem around genpd
when using PREEMPT_RT kernel.
Previous patch by Adrien:
https://lore.kernel.org/all/[email protected]/
---
drivers/base/power/domain.c | 59 +++++++++++++++++++++++++++++++++++--
include/linux/pm_domain.h | 13 ++++++++
2 files changed, 70 insertions(+), 2 deletions(-)
diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index 967bcf9d415e..4dfce1d476f4 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -119,6 +119,48 @@ static const struct genpd_lock_ops genpd_spin_ops = {
.unlock = genpd_unlock_spin,
};
+static void genpd_lock_rawspin(struct generic_pm_domain *genpd)
+ __acquires(&genpd->rslock)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&genpd->rslock, flags);
+ genpd->rlock_flags = flags;
+}
+
+static void genpd_lock_nested_rawspin(struct generic_pm_domain *genpd,
+ int depth)
+ __acquires(&genpd->rslock)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave_nested(&genpd->rslock, flags, depth);
+ genpd->rlock_flags = flags;
+}
+
+static int genpd_lock_interruptible_rawspin(struct generic_pm_domain *genpd)
+ __acquires(&genpd->rslock)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&genpd->rslock, flags);
+ genpd->rlock_flags = flags;
+ return 0;
+}
+
+static void genpd_unlock_rawspin(struct generic_pm_domain *genpd)
+ __releases(&genpd->rslock)
+{
+ raw_spin_unlock_irqrestore(&genpd->rslock, genpd->rlock_flags);
+}
+
+static const struct genpd_lock_ops genpd_rawspin_ops = {
+ .lock = genpd_lock_rawspin,
+ .lock_nested = genpd_lock_nested_rawspin,
+ .lock_interruptible = genpd_lock_interruptible_rawspin,
+ .unlock = genpd_unlock_rawspin,
+};
+
#define genpd_lock(p) p->lock_ops->lock(p)
#define genpd_lock_nested(p, d) p->lock_ops->lock_nested(p, d)
#define genpd_lock_interruptible(p) p->lock_ops->lock_interruptible(p)
@@ -126,6 +168,8 @@ static const struct genpd_lock_ops genpd_spin_ops = {
#define genpd_status_on(genpd) (genpd->status == GENPD_STATE_ON)
#define genpd_is_irq_safe(genpd) (genpd->flags & GENPD_FLAG_IRQ_SAFE)
+#define genpd_is_rt_safe(genpd) (genpd_is_irq_safe(genpd) && \
+ (genpd->flags & GENPD_FLAG_RT_SAFE))
#define genpd_is_always_on(genpd) (genpd->flags & GENPD_FLAG_ALWAYS_ON)
#define genpd_is_active_wakeup(genpd) (genpd->flags & GENPD_FLAG_ACTIVE_WAKEUP)
#define genpd_is_cpu_domain(genpd) (genpd->flags & GENPD_FLAG_CPU_DOMAIN)
@@ -1838,6 +1882,12 @@ static int genpd_add_subdomain(struct generic_pm_domain *genpd,
return -EINVAL;
}
+ if (!genpd_is_rt_safe(genpd) && genpd_is_rt_safe(subdomain)) {
+ WARN(1, "Parent %s of subdomain %s must be RT safe\n",
+ genpd->name, subdomain->name);
+ return -EINVAL;
+ }
+
link = kzalloc(sizeof(*link), GFP_KERNEL);
if (!link)
return -ENOMEM;
@@ -2008,8 +2058,13 @@ static void genpd_free_data(struct generic_pm_domain *genpd)
static void genpd_lock_init(struct generic_pm_domain *genpd)
{
if (genpd->flags & GENPD_FLAG_IRQ_SAFE) {
- spin_lock_init(&genpd->slock);
- genpd->lock_ops = &genpd_spin_ops;
+ if (genpd->flags & GENPD_FLAG_RT_SAFE) {
+ raw_spin_lock_init(&genpd->rslock);
+ genpd->lock_ops = &genpd_rawspin_ops;
+ } else {
+ spin_lock_init(&genpd->slock);
+ genpd->lock_ops = &genpd_spin_ops;
+ }
} else {
mutex_init(&genpd->mlock);
genpd->lock_ops = &genpd_mtx_ops;
diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
index 1cd41bdf73cf..0a1600244963 100644
--- a/include/linux/pm_domain.h
+++ b/include/linux/pm_domain.h
@@ -61,6 +61,14 @@
* GENPD_FLAG_MIN_RESIDENCY: Enable the genpd governor to consider its
* components' next wakeup when determining the
* optimal idle state.
+ *
+ * GENPD_FLAG_RT_SAFE: When used with GENPD_FLAG_IRQ_SAFE, this informs
+ * genpd that its backend callbacks, ->power_on|off(),
+ * do not use other spinlocks. They might use
+ * raw_spinlocks or other pre-emption-disable
+ * methods, all of which are PREEMPT_RT safe. Note
+ * that, a genpd having this flag set, requires its
+ * masterdomains to also have it set.
*/
#define GENPD_FLAG_PM_CLK (1U << 0)
#define GENPD_FLAG_IRQ_SAFE (1U << 1)
@@ -69,6 +77,7 @@
#define GENPD_FLAG_CPU_DOMAIN (1U << 4)
#define GENPD_FLAG_RPM_ALWAYS_ON (1U << 5)
#define GENPD_FLAG_MIN_RESIDENCY (1U << 6)
+#define GENPD_FLAG_RT_SAFE (1U << 7)
enum gpd_status {
GENPD_STATE_ON = 0, /* PM domain is on */
@@ -164,6 +173,10 @@ struct generic_pm_domain {
spinlock_t slock;
unsigned long lock_flags;
};
+ struct {
+ raw_spinlock_t rslock;
+ unsigned long rlock_flags;
+ };
};
};
--
2.34.1
If PM domain on PREEMPT_RT is marked as GENPD_FLAG_RT_SAFE(), the
genpd_lock() will be a raw spin lock, thus device_pm_check_callbacks()
must be called outside of the domain lock.
This solves on PREEMPT_RT:
[ BUG: Invalid wait context ]
6.1.0-rt5-00325-g8a5f56bcfcca #8 Tainted: G W
-----------------------------
swapper/0/1 is trying to lock:
ffff76e045dec9a0 (&dev->power.lock){+.+.}-{3:3}, at: device_pm_check_callbacks+0x20/0xf0
other info that might help us debug this:
context-{5:5}
3 locks held by swapper/0/1:
#0: ffff76e045deb8e8 (&dev->mutex){....}-{4:4}, at: __device_attach+0x38/0x1c0
#1: ffffa92b81f825e0 (gpd_list_lock){+.+.}-{4:4}, at: __genpd_dev_pm_attach+0x7c/0x250
#2: ffff76e04105c7a0 (&genpd->rslock){....}-{2:2}, at: genpd_lock_rawspin+0x1c/0x30
stack backtrace:
CPU: 5 PID: 1 Comm: swapper/0 Tainted: G W 6.1.0-rt5-00325-g8a5f56bcfcca #8
Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
Call trace:
dump_backtrace.part.0+0xe0/0xf0
show_stack+0x18/0x40
dump_stack_lvl+0x8c/0xb8
dump_stack+0x18/0x34
__lock_acquire+0x938/0x2100
lock_acquire.part.0+0x104/0x28c
lock_acquire+0x68/0x84
rt_spin_lock+0x40/0x100
device_pm_check_callbacks+0x20/0xf0
dev_pm_domain_set+0x54/0x64
genpd_add_device+0x258/0x340
__genpd_dev_pm_attach+0xa8/0x250
genpd_dev_pm_attach_by_id+0xc4/0x190
genpd_dev_pm_attach_by_name+0x3c/0x60
dev_pm_domain_attach_by_name+0x20/0x30
dt_idle_attach_cpu+0x24/0x90
psci_cpuidle_probe+0x300/0x4b0
platform_probe+0x68/0xe0
really_probe+0xbc/0x2dc
__driver_probe_device+0x78/0xe0
driver_probe_device+0x3c/0x160
__device_attach_driver+0xb8/0x140
bus_for_each_drv+0x78/0xd0
__device_attach+0xa8/0x1c0
device_initial_probe+0x14/0x20
bus_probe_device+0x9c/0xa4
device_add+0x3b4/0x8dc
platform_device_add+0x114/0x234
platform_device_register_full+0x108/0x1a4
psci_idle_init+0x6c/0xb0
do_one_initcall+0x74/0x450
kernel_init_freeable+0x2e0/0x350
kernel_init+0x24/0x130
ret_from_fork+0x10/0x20
Cc: Adrien Thierry <[email protected]>
Cc: Brian Masney <[email protected]>
Cc: [email protected]
Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
drivers/base/power/domain.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index 4dfce1d476f4..db499ba40497 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -1666,10 +1666,14 @@ static int genpd_add_device(struct generic_pm_domain *genpd, struct device *dev,
if (ret)
goto out;
+
+ /* PREEMPT_RT: Must be outside of genpd_lock */
+ device_pm_check_callbacks(dev);
+
genpd_lock(genpd);
genpd_set_cpumask(genpd, gpd_data->cpu);
- dev_pm_domain_set(dev, &genpd->domain);
+ dev_pm_domain_set_no_cb(dev, &genpd->domain);
genpd->device_count++;
if (gd)
--
2.34.1
Hi Krzysztof,
Thanks for looking into this!
I tested your patchset on the QDrive3 on a CentOS Stream 9 RT kernel (I
couldn't test it on mainline because the latest RT patchset only supports
6.1 which is missing some bits needed to boot QDrive3).
It fixes the PSCI cpuidle issue I was encountering in [1]. However, I may
have found another code path that triggers a similar issue:
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 113, name: kworker/4:2
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
4 locks held by kworker/4:2/113:
#0: ffff09b0c2376928 ((wq_completion)pm){+.+.}-{0:0}, at: process_one_work+0x1f4/0x7c0
#1: ffff800008bf3dd0 ((work_completion)(&genpd->power_off_work)){+.+.}-{0:0}, at: process_one_work+0x1f4/0x7c0
#2: ffff09b0c2e44860 (&genpd->rslock){....}-{2:2}, at: genpd_lock_rawspin+0x20/0x30
#3: ffff09b0c6696a20 (&dev->power.lock){+.+.}-{2:2}, at: dev_pm_qos_flags+0x2c/0x60
irq event stamp: 170
hardirqs last enabled at (169): [<ffffa1be822f8a78>] _raw_spin_unlock_irq+0x48/0xc4
hardirqs last disabled at (170): [<ffffa1be822f8df4>] _raw_spin_lock_irqsave+0xb0/0xfc
softirqs last enabled at (0): [<ffffa1be814cfff0>] copy_process+0x68c/0x1500
softirqs last disabled at (0): [<0000000000000000>] 0x0
Preemption disabled at:
[<ffffa1be81d7e620>] genpd_lock_rawspin+0x20/0x30
CPU: 4 PID: 113 Comm: kworker/4:2 Tainted: G X --------- --- 5.14.0-rt14+ #2
Hardware name: Qualcomm SA8540 ADP (DT)
Workqueue: pm genpd_power_off_work_fn
Call trace:
dump_backtrace+0xb4/0x12c
show_stack+0x1c/0x70
dump_stack_lvl+0x98/0xd0
dump_stack+0x14/0x2c
__might_resched+0x180/0x220
rt_spin_lock+0x74/0x11c
dev_pm_qos_flags+0x2c/0x60
genpd_power_off.part.0.isra.0+0xac/0x2d0
genpd_power_off_work_fn+0x68/0x8c
process_one_work+0x2b8/0x7c0
worker_thread+0x15c/0x44c
kthread+0xf8/0x104
ret_from_fork+0x10/0x20
This happens consistently during boot. But on the mainline kernel, this
code path has changed: genpd_power_off no longer calls dev_pm_qos_flags.
So it might not happen on mainline. I hope to be able to test your
patchset again soon on mainline with the next version of the RT patchset
(which should be able to boot the QDrive3).
Best,
Adrien
[1] https://lore.kernel.org/all/[email protected]/
On Tue, 20 Dec 2022 at 22:36, Adrien Thierry <[email protected]> wrote:
>
> Hi Krzysztof,
> Thanks for looking into this!
>
> I tested your patchset on the QDrive3 on a CentOS Stream 9 RT kernel (I
> couldn't test it on mainline because the latest RT patchset only supports
> 6.1 which is missing some bits needed to boot QDrive3).
>
> It fixes the PSCI cpuidle issue I was encountering in [1]. However, I may
> have found another code path that triggers a similar issue:
>
> BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
> in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 113, name: kworker/4:2
> preempt_count: 1, expected: 0
> RCU nest depth: 0, expected: 0
> 4 locks held by kworker/4:2/113:
> #0: ffff09b0c2376928 ((wq_completion)pm){+.+.}-{0:0}, at: process_one_work+0x1f4/0x7c0
> #1: ffff800008bf3dd0 ((work_completion)(&genpd->power_off_work)){+.+.}-{0:0}, at: process_one_work+0x1f4/0x7c0
> #2: ffff09b0c2e44860 (&genpd->rslock){....}-{2:2}, at: genpd_lock_rawspin+0x20/0x30
> #3: ffff09b0c6696a20 (&dev->power.lock){+.+.}-{2:2}, at: dev_pm_qos_flags+0x2c/0x60
> irq event stamp: 170
> hardirqs last enabled at (169): [<ffffa1be822f8a78>] _raw_spin_unlock_irq+0x48/0xc4
> hardirqs last disabled at (170): [<ffffa1be822f8df4>] _raw_spin_lock_irqsave+0xb0/0xfc
> softirqs last enabled at (0): [<ffffa1be814cfff0>] copy_process+0x68c/0x1500
> softirqs last disabled at (0): [<0000000000000000>] 0x0
> Preemption disabled at:
> [<ffffa1be81d7e620>] genpd_lock_rawspin+0x20/0x30
> CPU: 4 PID: 113 Comm: kworker/4:2 Tainted: G X --------- --- 5.14.0-rt14+ #2
> Hardware name: Qualcomm SA8540 ADP (DT)
> Workqueue: pm genpd_power_off_work_fn
> Call trace:
> dump_backtrace+0xb4/0x12c
> show_stack+0x1c/0x70
> dump_stack_lvl+0x98/0xd0
> dump_stack+0x14/0x2c
> __might_resched+0x180/0x220
> rt_spin_lock+0x74/0x11c
> dev_pm_qos_flags+0x2c/0x60
> genpd_power_off.part.0.isra.0+0xac/0x2d0
> genpd_power_off_work_fn+0x68/0x8c
> process_one_work+0x2b8/0x7c0
> worker_thread+0x15c/0x44c
> kthread+0xf8/0x104
> ret_from_fork+0x10/0x20
>
> This happens consistently during boot. But on the mainline kernel, this
> code path has changed: genpd_power_off no longer calls dev_pm_qos_flags.
> So it might not happen on mainline. I hope to be able to test your
> patchset again soon on mainline with the next version of the RT patchset
> (which should be able to boot the QDrive3).
You are right, since commit 3f9ee7da724a ("PM: domains: Don't check
PM_QOS_FLAG_NO_POWER_OFF in genpd") dev_pm_qos_flags() doesn't get
called in genpd_power_off() anymore. That patch was introduced in
v5.19.
>
> Best,
> Adrien
>
> [1] https://lore.kernel.org/all/[email protected]/
>
Kind regards
Uffe
On Mon, 19 Dec 2022 at 16:15, Krzysztof Kozlowski
<[email protected]> wrote:
>
> Realtime kernels with PREEMPT_RT must use raw_spinlock_t for domains
> which are invoked from CPU idle (thus from atomic section). Example is
> cpuidle PSCI domain driver which itself is PREEMPT_RT safe, but is being
> called as part of cpuidle.
Just so I don't get this wrong, since the cpuidle-psci also calls
pm_runtime_* functions so it isn't PREEMPT_RT safe, at least not yet?
>
> Add a flag allowing a power domain provider to indicate it is RT safe.
> The flag is supposed to be used with existing GENPD_FLAG_IRQ_SAFE.
>
> Cc: Adrien Thierry <[email protected]>
> Cc: Brian Masney <[email protected]>
> Cc: [email protected]
> Signed-off-by: Krzysztof Kozlowski <[email protected]>
For genpd, overall, I think this looks like an okay approach to me.
Although, let me check the whole series (I need some more time to do
that) before I give this my blessing.
Kind regards
Uffe
>
> ---
>
> Independently from Adrien, I encountered the same problem around genpd
> when using PREEMPT_RT kernel.
>
> Previous patch by Adrien:
> https://lore.kernel.org/all/[email protected]/
> ---
> drivers/base/power/domain.c | 59 +++++++++++++++++++++++++++++++++++--
> include/linux/pm_domain.h | 13 ++++++++
> 2 files changed, 70 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
> index 967bcf9d415e..4dfce1d476f4 100644
> --- a/drivers/base/power/domain.c
> +++ b/drivers/base/power/domain.c
> @@ -119,6 +119,48 @@ static const struct genpd_lock_ops genpd_spin_ops = {
> .unlock = genpd_unlock_spin,
> };
>
> +static void genpd_lock_rawspin(struct generic_pm_domain *genpd)
> + __acquires(&genpd->rslock)
> +{
> + unsigned long flags;
> +
> + raw_spin_lock_irqsave(&genpd->rslock, flags);
> + genpd->rlock_flags = flags;
> +}
> +
> +static void genpd_lock_nested_rawspin(struct generic_pm_domain *genpd,
> + int depth)
> + __acquires(&genpd->rslock)
> +{
> + unsigned long flags;
> +
> + raw_spin_lock_irqsave_nested(&genpd->rslock, flags, depth);
> + genpd->rlock_flags = flags;
> +}
> +
> +static int genpd_lock_interruptible_rawspin(struct generic_pm_domain *genpd)
> + __acquires(&genpd->rslock)
> +{
> + unsigned long flags;
> +
> + raw_spin_lock_irqsave(&genpd->rslock, flags);
> + genpd->rlock_flags = flags;
> + return 0;
> +}
> +
> +static void genpd_unlock_rawspin(struct generic_pm_domain *genpd)
> + __releases(&genpd->rslock)
> +{
> + raw_spin_unlock_irqrestore(&genpd->rslock, genpd->rlock_flags);
> +}
> +
> +static const struct genpd_lock_ops genpd_rawspin_ops = {
> + .lock = genpd_lock_rawspin,
> + .lock_nested = genpd_lock_nested_rawspin,
> + .lock_interruptible = genpd_lock_interruptible_rawspin,
> + .unlock = genpd_unlock_rawspin,
> +};
> +
> #define genpd_lock(p) p->lock_ops->lock(p)
> #define genpd_lock_nested(p, d) p->lock_ops->lock_nested(p, d)
> #define genpd_lock_interruptible(p) p->lock_ops->lock_interruptible(p)
> @@ -126,6 +168,8 @@ static const struct genpd_lock_ops genpd_spin_ops = {
>
> #define genpd_status_on(genpd) (genpd->status == GENPD_STATE_ON)
> #define genpd_is_irq_safe(genpd) (genpd->flags & GENPD_FLAG_IRQ_SAFE)
> +#define genpd_is_rt_safe(genpd) (genpd_is_irq_safe(genpd) && \
> + (genpd->flags & GENPD_FLAG_RT_SAFE))
> #define genpd_is_always_on(genpd) (genpd->flags & GENPD_FLAG_ALWAYS_ON)
> #define genpd_is_active_wakeup(genpd) (genpd->flags & GENPD_FLAG_ACTIVE_WAKEUP)
> #define genpd_is_cpu_domain(genpd) (genpd->flags & GENPD_FLAG_CPU_DOMAIN)
> @@ -1838,6 +1882,12 @@ static int genpd_add_subdomain(struct generic_pm_domain *genpd,
> return -EINVAL;
> }
>
> + if (!genpd_is_rt_safe(genpd) && genpd_is_rt_safe(subdomain)) {
> + WARN(1, "Parent %s of subdomain %s must be RT safe\n",
> + genpd->name, subdomain->name);
> + return -EINVAL;
> + }
> +
> link = kzalloc(sizeof(*link), GFP_KERNEL);
> if (!link)
> return -ENOMEM;
> @@ -2008,8 +2058,13 @@ static void genpd_free_data(struct generic_pm_domain *genpd)
> static void genpd_lock_init(struct generic_pm_domain *genpd)
> {
> if (genpd->flags & GENPD_FLAG_IRQ_SAFE) {
> - spin_lock_init(&genpd->slock);
> - genpd->lock_ops = &genpd_spin_ops;
> + if (genpd->flags & GENPD_FLAG_RT_SAFE) {
> + raw_spin_lock_init(&genpd->rslock);
> + genpd->lock_ops = &genpd_rawspin_ops;
> + } else {
> + spin_lock_init(&genpd->slock);
> + genpd->lock_ops = &genpd_spin_ops;
> + }
> } else {
> mutex_init(&genpd->mlock);
> genpd->lock_ops = &genpd_mtx_ops;
> diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
> index 1cd41bdf73cf..0a1600244963 100644
> --- a/include/linux/pm_domain.h
> +++ b/include/linux/pm_domain.h
> @@ -61,6 +61,14 @@
> * GENPD_FLAG_MIN_RESIDENCY: Enable the genpd governor to consider its
> * components' next wakeup when determining the
> * optimal idle state.
> + *
> + * GENPD_FLAG_RT_SAFE: When used with GENPD_FLAG_IRQ_SAFE, this informs
> + * genpd that its backend callbacks, ->power_on|off(),
> + * do not use other spinlocks. They might use
> + * raw_spinlocks or other pre-emption-disable
> + * methods, all of which are PREEMPT_RT safe. Note
> + * that, a genpd having this flag set, requires its
> + * masterdomains to also have it set.
> */
> #define GENPD_FLAG_PM_CLK (1U << 0)
> #define GENPD_FLAG_IRQ_SAFE (1U << 1)
> @@ -69,6 +77,7 @@
> #define GENPD_FLAG_CPU_DOMAIN (1U << 4)
> #define GENPD_FLAG_RPM_ALWAYS_ON (1U << 5)
> #define GENPD_FLAG_MIN_RESIDENCY (1U << 6)
> +#define GENPD_FLAG_RT_SAFE (1U << 7)
>
> enum gpd_status {
> GENPD_STATE_ON = 0, /* PM domain is on */
> @@ -164,6 +173,10 @@ struct generic_pm_domain {
> spinlock_t slock;
> unsigned long lock_flags;
> };
> + struct {
> + raw_spinlock_t rslock;
> + unsigned long rlock_flags;
> + };
> };
>
> };
> --
> 2.34.1
>
On 04/01/2023 16:45, Ulf Hansson wrote:
> On Mon, 19 Dec 2022 at 16:15, Krzysztof Kozlowski
> <[email protected]> wrote:
>>
>> Realtime kernels with PREEMPT_RT must use raw_spinlock_t for domains
>> which are invoked from CPU idle (thus from atomic section). Example is
>> cpuidle PSCI domain driver which itself is PREEMPT_RT safe, but is being
>> called as part of cpuidle.
>
> Just so I don't get this wrong, since the cpuidle-psci also calls
> pm_runtime_* functions so it isn't PREEMPT_RT safe, at least not yet?
You are correct. Patch 3 here addresses it by... just not doing runtime
PM. This is a hacky workaround but:
1. I don't have any other idea,
2. It's not a big problem because RT systems are not supposed to have
any CPU idle (one of first things during RT system tuning is to disable
cpuidle).
>
>>
>> Add a flag allowing a power domain provider to indicate it is RT safe.
>> The flag is supposed to be used with existing GENPD_FLAG_IRQ_SAFE.
>>
>> Cc: Adrien Thierry <[email protected]>
>> Cc: Brian Masney <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Krzysztof Kozlowski <[email protected]>
>
> For genpd, overall, I think this looks like an okay approach to me.
> Although, let me check the whole series (I need some more time to do
> that) before I give this my blessing.
Sure, we are all have too many mails in inbox. :)
Best regards,
Krzysztof
On 2022-12-19 16:14:59 [+0100], Krzysztof Kozlowski wrote:
> Realtime kernels with PREEMPT_RT must use raw_spinlock_t for domains
> which are invoked from CPU idle (thus from atomic section). Example is
> cpuidle PSCI domain driver which itself is PREEMPT_RT safe, but is being
> called as part of cpuidle.
I think it needs to be clarified what PREEMPT_RT safe means. PSCI is an
external interface which does not inform us what it does and how long
the operation will take.
The ACPI table for instance populate several idle states and their
entry/exit time. Then you can decide if and when an entry/exit latency
of 500us is PREEMPT_RT safe.
> Add a flag allowing a power domain provider to indicate it is RT safe.
> The flag is supposed to be used with existing GENPD_FLAG_IRQ_SAFE.
>
> Cc: Adrien Thierry <[email protected]>
> Cc: Brian Masney <[email protected]>
> Cc: [email protected]
> Signed-off-by: Krzysztof Kozlowski <[email protected]>
>
…
> index 1cd41bdf73cf..0a1600244963 100644
> --- a/include/linux/pm_domain.h
> +++ b/include/linux/pm_domain.h
> @@ -61,6 +61,14 @@
> * GENPD_FLAG_MIN_RESIDENCY: Enable the genpd governor to consider its
> * components' next wakeup when determining the
> * optimal idle state.
> + *
> + * GENPD_FLAG_RT_SAFE: When used with GENPD_FLAG_IRQ_SAFE, this informs
> + * genpd that its backend callbacks, ->power_on|off(),
> + * do not use other spinlocks. They might use
> + * raw_spinlocks or other pre-emption-disable
> + * methods, all of which are PREEMPT_RT safe. Note
Please use spinlock_t and raw_spinlock_t. Wouldn't it be better to write
"preemption" instead "pre-emption"?
The important part is probably that once a raw_spinlock_t has been
acquired, it is not possible to invoke any function that acquries
sleeping locks (which includes memory allocations). While even without
that flag it is possible to invoke a function which disables and enables
preemption on its own.
> + * that, a genpd having this flag set, requires its
> + * masterdomains to also have it set.
This could be verified upon registration, no?
It might be worth noting that preemption-off section during PM
operations contribute to the system's max latency. Depending on how low
the operation is, it may or may not be a problem.
The ->power_on|off() refers to the sate of the component, right?
> */
> #define GENPD_FLAG_PM_CLK (1U << 0)
> #define GENPD_FLAG_IRQ_SAFE (1U << 1)
Sebastian
On 2022-12-19 16:15:00 [+0100], Krzysztof Kozlowski wrote:
> The PSCI cpuidle power domain in power_off callback uses
> __this_cpu_write() so it is PREEMPT_RT safe. This allows to use it in
Why does __this_cpu_write() matter here?
> Realtime kernels and solves errors like:
>
> BUG: scheduling while atomic: swapper/2/0/0x00000002
> Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
> Call trace:
> dump_backtrace.part.0+0xe0/0xf0
> show_stack+0x18/0x40
> dump_stack_lvl+0x68/0x84
> dump_stack+0x18/0x34
> __schedule_bug+0x60/0x80
> __schedule+0x628/0x800
> schedule_rtlock+0x28/0x5c
> rtlock_slowlock_locked+0x360/0xd30
> rt_spin_lock+0x88/0xb0
> genpd_lock_nested_spin+0x1c/0x30
> genpd_power_off.part.0.isra.0+0x20c/0x2a0
> genpd_runtime_suspend+0x150/0x2bc
> __rpm_callback+0x48/0x170
> rpm_callback+0x6c/0x7c
> rpm_suspend+0x108/0x660
> __pm_runtime_suspend+0x4c/0x8c
> __psci_enter_domain_idle_state.constprop.0+0x54/0xe0
> psci_enter_domain_idle_state+0x18/0x2c
> cpuidle_enter_state+0x8c/0x4e0
> cpuidle_enter+0x38/0x50
> do_idle+0x248/0x2f0
> cpu_startup_entry+0x24/0x30
> secondary_start_kernel+0x130/0x154
> __secondary_switched+0xb0/0xb4
This is to a sleeping lock (spinlock_t) in an IRQ-off region which
starts in do_idle(). You could describe the problem and to solution you
aim for instead pasting a backtrace into the commit description and
adding a flow in the code.
I don't see how your commit description matches your change in code. One
might think "Oh. If I see this pattern, I need an irqsafe lock to fix
it".
Sebastian
On 2023-01-06 15:52:57 [+0100], Krzysztof Kozlowski wrote:
> > Just so I don't get this wrong, since the cpuidle-psci also calls
> > pm_runtime_* functions so it isn't PREEMPT_RT safe, at least not yet?
>
> You are correct. Patch 3 here addresses it by... just not doing runtime
> PM. This is a hacky workaround but:
> 1. I don't have any other idea,
> 2. It's not a big problem because RT systems are not supposed to have
> any CPU idle (one of first things during RT system tuning is to disable
> cpuidle).
so you say you use idle=poll instead?
> Best regards,
> Krzysztof
Sebastian
On 2022-12-19 16:15:02 [+0100], Krzysztof Kozlowski wrote:
> device_pm_check_callbacks() uses dev->power spinlock, which on
> PREEMPT_RT sleeps. However some PM domains on PREEMPT_RT might be using
> raw spinlocks as genpd_lock(), thus dev_pm_domain_set() must not call
> device_pm_check_callbacks(). In fact device_pm_check_callbacks() is not
> strictly related to dev_pm_domain_set() and calls for these two can be
> made separately.
>
> Add new helper dev_pm_domain_set_no_cb() which will only set PM domain
> but will not check the callbacks, leaving the checl to the caller.
s/checl/check/
But this I comprehend.
> Cc: Adrien Thierry <[email protected]>
> Cc: Brian Masney <[email protected]>
> Cc: [email protected]
> Signed-off-by: Krzysztof Kozlowski <[email protected]>
Sebastian
On 12/01/2023 11:36, Sebastian Andrzej Siewior wrote:
> On 2023-01-06 15:52:57 [+0100], Krzysztof Kozlowski wrote:
>>> Just so I don't get this wrong, since the cpuidle-psci also calls
>>> pm_runtime_* functions so it isn't PREEMPT_RT safe, at least not yet?
>>
>> You are correct. Patch 3 here addresses it by... just not doing runtime
>> PM. This is a hacky workaround but:
>> 1. I don't have any other idea,
>> 2. It's not a big problem because RT systems are not supposed to have
>> any CPU idle (one of first things during RT system tuning is to disable
>> cpuidle).
>
> so you say you use idle=poll instead?
This was generic comment that system is not supposed to go into deeper
idle states.
Best regards,
Krzysztof
On 2022-12-19 16:15:01 [+0100], Krzysztof Kozlowski wrote:
> diff --git a/drivers/cpuidle/cpuidle-psci.c b/drivers/cpuidle/cpuidle-psci.c
> index 57bc3e3ae391..9d971cc4b12b 100644
> --- a/drivers/cpuidle/cpuidle-psci.c
> +++ b/drivers/cpuidle/cpuidle-psci.c
> @@ -72,7 +72,7 @@ static int __psci_enter_domain_idle_state(struct cpuidle_device *dev,
> ct_irq_enter_irqson();
> if (s2idle)
> dev_pm_genpd_suspend(pd_dev);
> - else
> + else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> pm_runtime_put_sync_suspend(pd_dev);
So based on the commit description you run into a sleeping lock in
pm_runtime_put_sync_suspend() while the CPU is in an IRQ-off region.
Why is it okay to skip it on PREEMPT_RT?
> ct_irq_exit_irqson();
>
Sebastian
On 12/01/2023 12:00, Sebastian Andrzej Siewior wrote:
> On 2022-12-19 16:15:00 [+0100], Krzysztof Kozlowski wrote:
>> The PSCI cpuidle power domain in power_off callback uses
>> __this_cpu_write() so it is PREEMPT_RT safe. This allows to use it in
>
> Why does __this_cpu_write() matter here?
I'll reword to "not using sleeping primitives nor spinlock_t"
>
>> Realtime kernels and solves errors like:
>>
>> BUG: scheduling while atomic: swapper/2/0/0x00000002
>> Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
>> Call trace:
>> dump_backtrace.part.0+0xe0/0xf0
>> show_stack+0x18/0x40
>> dump_stack_lvl+0x68/0x84
>> dump_stack+0x18/0x34
>> __schedule_bug+0x60/0x80
>> __schedule+0x628/0x800
>> schedule_rtlock+0x28/0x5c
>> rtlock_slowlock_locked+0x360/0xd30
>> rt_spin_lock+0x88/0xb0
>> genpd_lock_nested_spin+0x1c/0x30
>> genpd_power_off.part.0.isra.0+0x20c/0x2a0
>> genpd_runtime_suspend+0x150/0x2bc
>> __rpm_callback+0x48/0x170
>> rpm_callback+0x6c/0x7c
>> rpm_suspend+0x108/0x660
>> __pm_runtime_suspend+0x4c/0x8c
>> __psci_enter_domain_idle_state.constprop.0+0x54/0xe0
>> psci_enter_domain_idle_state+0x18/0x2c
>> cpuidle_enter_state+0x8c/0x4e0
>> cpuidle_enter+0x38/0x50
>> do_idle+0x248/0x2f0
>> cpu_startup_entry+0x24/0x30
>> secondary_start_kernel+0x130/0x154
>> __secondary_switched+0xb0/0xb4
>
> This is to a sleeping lock (spinlock_t) in an IRQ-off region which
> starts in do_idle(). You could describe the problem and to solution you
> aim for instead pasting a backtrace into the commit description and
> adding a flow in the code.
I'll extend the description.
>
> I don't see how your commit description matches your change in code. One
> might think "Oh. If I see this pattern, I need an irqsafe lock to fix
> it".
Best regards,
Krzysztof
On 12/01/2023 11:32, Sebastian Andrzej Siewior wrote:
> On 2022-12-19 16:14:59 [+0100], Krzysztof Kozlowski wrote:
>> Realtime kernels with PREEMPT_RT must use raw_spinlock_t for domains
>> which are invoked from CPU idle (thus from atomic section). Example is
>> cpuidle PSCI domain driver which itself is PREEMPT_RT safe, but is being
>> called as part of cpuidle.
>
> I think it needs to be clarified what PREEMPT_RT safe means.
OK
> PSCI is an
> external interface which does not inform us what it does and how long
> the operation will take.
> The ACPI table for instance populate several idle states and their
> entry/exit time. Then you can decide if and when an entry/exit latency
> of 500us is PREEMPT_RT safe.
>
>> Add a flag allowing a power domain provider to indicate it is RT safe.
>> The flag is supposed to be used with existing GENPD_FLAG_IRQ_SAFE.
>>
>> Cc: Adrien Thierry <[email protected]>
>> Cc: Brian Masney <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Krzysztof Kozlowski <[email protected]>
>>
> …
>> index 1cd41bdf73cf..0a1600244963 100644
>> --- a/include/linux/pm_domain.h
>> +++ b/include/linux/pm_domain.h
>> @@ -61,6 +61,14 @@
>> * GENPD_FLAG_MIN_RESIDENCY: Enable the genpd governor to consider its
>> * components' next wakeup when determining the
>> * optimal idle state.
>> + *
>> + * GENPD_FLAG_RT_SAFE: When used with GENPD_FLAG_IRQ_SAFE, this informs
>> + * genpd that its backend callbacks, ->power_on|off(),
>> + * do not use other spinlocks. They might use
>> + * raw_spinlocks or other pre-emption-disable
>> + * methods, all of which are PREEMPT_RT safe. Note
>
> Please use spinlock_t and raw_spinlock_t. Wouldn't it be better to write
> "preemption" instead "pre-emption"?
Sure.
> The important part is probably that once a raw_spinlock_t has been
> acquired, it is not possible to invoke any function that acquries
> sleeping locks (which includes memory allocations). While even without
> that flag it is possible to invoke a function which disables and enables
> preemption on its own.
>
>> + * that, a genpd having this flag set, requires its
>> + * masterdomains to also have it set.
>
> This could be verified upon registration, no?
It is, just like the IRQ_SAFE flag. The code is symmetrical to IRQ_SAFE.
> It might be worth noting that preemption-off section during PM
> operations contribute to the system's max latency.
You mean in the commit msg? In the doc, I don't want to deviate from
IRQ_SAFE. It's not really related to the flag.
> Depending on how low
> the operation is, it may or may not be a problem.
> The ->power_on|off() refers to the sate of the component, right?
It refers to genpd framework.
Best regards,
Krzysztof
On 12/01/2023 12:09, Sebastian Andrzej Siewior wrote:
> On 2022-12-19 16:15:01 [+0100], Krzysztof Kozlowski wrote:
>> diff --git a/drivers/cpuidle/cpuidle-psci.c b/drivers/cpuidle/cpuidle-psci.c
>> index 57bc3e3ae391..9d971cc4b12b 100644
>> --- a/drivers/cpuidle/cpuidle-psci.c
>> +++ b/drivers/cpuidle/cpuidle-psci.c
>> @@ -72,7 +72,7 @@ static int __psci_enter_domain_idle_state(struct cpuidle_device *dev,
>> ct_irq_enter_irqson();
>> if (s2idle)
>> dev_pm_genpd_suspend(pd_dev);
>> - else
>> + else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>> pm_runtime_put_sync_suspend(pd_dev);
>
> So based on the commit description you run into a sleeping lock in
> pm_runtime_put_sync_suspend() while the CPU is in an IRQ-off region.
> Why is it okay to skip it on PREEMPT_RT?
It is okay to skip it everywhere, you just don't get a suspended CPU.
Why PREEMPT_RT is different here - having suspended CPU is a great way
to have longer or even unpredictable (as it goes to firmware which is
out of control of Linux) latencies.
Best regards,
Krzysztof
On 2022-12-19 16:15:03 [+0100], Krzysztof Kozlowski wrote:
> If PM domain on PREEMPT_RT is marked as GENPD_FLAG_RT_SAFE(), the
> genpd_lock() will be a raw spin lock, thus device_pm_check_callbacks()
a raw_spinlock_t
> must be called outside of the domain lock.
Right. First the sleeping lock, followed by the spinning locks. This is
covered in
Documentation/locking/locktypes.rst
at the end.
> This solves on PREEMPT_RT:
Yes but
> [ BUG: Invalid wait context ]
This "Invalid wait context" should also trigger on !PREEMPT_RT with
CONFIG_PROVE_RAW_LOCK_NESTING.
> 6.1.0-rt5-00325-g8a5f56bcfcca #8 Tainted: G W
> -----------------------------
> swapper/0/1 is trying to lock:
> ffff76e045dec9a0 (&dev->power.lock){+.+.}-{3:3}, at: device_pm_check_callbacks+0x20/0xf0
> other info that might help us debug this:
> context-{5:5}
> 3 locks held by swapper/0/1:
> #0: ffff76e045deb8e8 (&dev->mutex){....}-{4:4}, at: __device_attach+0x38/0x1c0
> #1: ffffa92b81f825e0 (gpd_list_lock){+.+.}-{4:4}, at: __genpd_dev_pm_attach+0x7c/0x250
> #2: ffff76e04105c7a0 (&genpd->rslock){....}-{2:2}, at: genpd_lock_rawspin+0x1c/0x30
> stack backtrace:
> CPU: 5 PID: 1 Comm: swapper/0 Tainted: G W 6.1.0-rt5-00325-g8a5f56bcfcca #8
> Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
> Call trace:
> dump_backtrace.part.0+0xe0/0xf0
> show_stack+0x18/0x40
> dump_stack_lvl+0x8c/0xb8
> dump_stack+0x18/0x34
> __lock_acquire+0x938/0x2100
> lock_acquire.part.0+0x104/0x28c
> lock_acquire+0x68/0x84
> rt_spin_lock+0x40/0x100
> device_pm_check_callbacks+0x20/0xf0
> dev_pm_domain_set+0x54/0x64
> genpd_add_device+0x258/0x340
> __genpd_dev_pm_attach+0xa8/0x250
> genpd_dev_pm_attach_by_id+0xc4/0x190
> genpd_dev_pm_attach_by_name+0x3c/0x60
> dev_pm_domain_attach_by_name+0x20/0x30
> dt_idle_attach_cpu+0x24/0x90
> psci_cpuidle_probe+0x300/0x4b0
> platform_probe+0x68/0xe0
> really_probe+0xbc/0x2dc
> __driver_probe_device+0x78/0xe0
> driver_probe_device+0x3c/0x160
> __device_attach_driver+0xb8/0x140
> bus_for_each_drv+0x78/0xd0
> __device_attach+0xa8/0x1c0
> device_initial_probe+0x14/0x20
> bus_probe_device+0x9c/0xa4
> device_add+0x3b4/0x8dc
> platform_device_add+0x114/0x234
> platform_device_register_full+0x108/0x1a4
> psci_idle_init+0x6c/0xb0
> do_one_initcall+0x74/0x450
> kernel_init_freeable+0x2e0/0x350
> kernel_init+0x24/0x130
> ret_from_fork+0x10/0x20
I would prefer a description of the issue instead hacing this
backtrace.
> Cc: Adrien Thierry <[email protected]>
> Cc: Brian Masney <[email protected]>
> Cc: [email protected]
> Signed-off-by: Krzysztof Kozlowski <[email protected]>
> ---
> drivers/base/power/domain.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
> index 4dfce1d476f4..db499ba40497 100644
> --- a/drivers/base/power/domain.c
> +++ b/drivers/base/power/domain.c
> @@ -1666,10 +1666,14 @@ static int genpd_add_device(struct generic_pm_domain *genpd, struct device *dev,
> if (ret)
> goto out;
>
> +
> + /* PREEMPT_RT: Must be outside of genpd_lock */
Could this comment be rewritten if needed?
The callback, which acquires sleeping locks on PREEMPT_RT, can be moved
before genpd_lock() because it does not rely on this lock. It must be
moved because the latter may acquire spinning locks.
It might be also be part of the commit message…
> + device_pm_check_callbacks(dev);
> +
> genpd_lock(genpd);
>
> genpd_set_cpumask(genpd, gpd_data->cpu);
> - dev_pm_domain_set(dev, &genpd->domain);
> + dev_pm_domain_set_no_cb(dev, &genpd->domain);
>
> genpd->device_count++;
> if (gd)
Sebastian
On 12/01/2023 12:31, Sebastian Andrzej Siewior wrote:
> On 2022-12-19 16:15:03 [+0100], Krzysztof Kozlowski wrote:
>> If PM domain on PREEMPT_RT is marked as GENPD_FLAG_RT_SAFE(), the
>> genpd_lock() will be a raw spin lock, thus device_pm_check_callbacks()
>
> a raw_spinlock_t
>
>> must be called outside of the domain lock.
>
> Right. First the sleeping lock, followed by the spinning locks. This is
> covered in
> Documentation/locking/locktypes.rst
>
> at the end.
I don't understand your comment. Do you expect me to change something?
>
>> This solves on PREEMPT_RT:
> Yes but
>> [ BUG: Invalid wait context ]
>
> This "Invalid wait context" should also trigger on !PREEMPT_RT with
> CONFIG_PROVE_RAW_LOCK_NESTING.
Could be, I just did not hit it.
>
>> 6.1.0-rt5-00325-g8a5f56bcfcca #8 Tainted: G W
>> -----------------------------
>> swapper/0/1 is trying to lock:
>> ffff76e045dec9a0 (&dev->power.lock){+.+.}-{3:3}, at: device_pm_check_callbacks+0x20/0xf0
>> other info that might help us debug this:
>> context-{5:5}
>> 3 locks held by swapper/0/1:
>> #0: ffff76e045deb8e8 (&dev->mutex){....}-{4:4}, at: __device_attach+0x38/0x1c0
>> #1: ffffa92b81f825e0 (gpd_list_lock){+.+.}-{4:4}, at: __genpd_dev_pm_attach+0x7c/0x250
>> #2: ffff76e04105c7a0 (&genpd->rslock){....}-{2:2}, at: genpd_lock_rawspin+0x1c/0x30
>> stack backtrace:
>> CPU: 5 PID: 1 Comm: swapper/0 Tainted: G W 6.1.0-rt5-00325-g8a5f56bcfcca #8
>> Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
>> Call trace:
>> dump_backtrace.part.0+0xe0/0xf0
>> show_stack+0x18/0x40
>> dump_stack_lvl+0x8c/0xb8
>> dump_stack+0x18/0x34
>> __lock_acquire+0x938/0x2100
>> lock_acquire.part.0+0x104/0x28c
>> lock_acquire+0x68/0x84
>> rt_spin_lock+0x40/0x100
>> device_pm_check_callbacks+0x20/0xf0
>> dev_pm_domain_set+0x54/0x64
>> genpd_add_device+0x258/0x340
>> __genpd_dev_pm_attach+0xa8/0x250
>> genpd_dev_pm_attach_by_id+0xc4/0x190
>> genpd_dev_pm_attach_by_name+0x3c/0x60
>> dev_pm_domain_attach_by_name+0x20/0x30
>> dt_idle_attach_cpu+0x24/0x90
>> psci_cpuidle_probe+0x300/0x4b0
>> platform_probe+0x68/0xe0
>> really_probe+0xbc/0x2dc
>> __driver_probe_device+0x78/0xe0
>> driver_probe_device+0x3c/0x160
>> __device_attach_driver+0xb8/0x140
>> bus_for_each_drv+0x78/0xd0
>> __device_attach+0xa8/0x1c0
>> device_initial_probe+0x14/0x20
>> bus_probe_device+0x9c/0xa4
>> device_add+0x3b4/0x8dc
>> platform_device_add+0x114/0x234
>> platform_device_register_full+0x108/0x1a4
>> psci_idle_init+0x6c/0xb0
>> do_one_initcall+0x74/0x450
>> kernel_init_freeable+0x2e0/0x350
>> kernel_init+0x24/0x130
>> ret_from_fork+0x10/0x20
>
> I would prefer a description of the issue instead hacing this
> backtrace.
I'll extend the commit msg.
>
>> Cc: Adrien Thierry <[email protected]>
>> Cc: Brian Masney <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Krzysztof Kozlowski <[email protected]>
>> ---
>> drivers/base/power/domain.c | 6 +++++-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
>> index 4dfce1d476f4..db499ba40497 100644
>> --- a/drivers/base/power/domain.c
>> +++ b/drivers/base/power/domain.c
>> @@ -1666,10 +1666,14 @@ static int genpd_add_device(struct generic_pm_domain *genpd, struct device *dev,
>> if (ret)
>> goto out;
>>
>> +
>> + /* PREEMPT_RT: Must be outside of genpd_lock */
>
> Could this comment be rewritten if needed?
> The callback, which acquires sleeping locks on PREEMPT_RT, can be moved
> before genpd_lock() because it does not rely on this lock. It must be
> moved because the latter may acquire spinning locks.
Sure
Best regards,
Krzysztof
On Mon, 19 Dec 2022 at 16:15, Krzysztof Kozlowski
<[email protected]> wrote:
>
> If PM domain on PREEMPT_RT is marked as GENPD_FLAG_RT_SAFE(), the
> genpd_lock() will be a raw spin lock, thus device_pm_check_callbacks()
> must be called outside of the domain lock.
>
> This solves on PREEMPT_RT:
>
> [ BUG: Invalid wait context ]
> 6.1.0-rt5-00325-g8a5f56bcfcca #8 Tainted: G W
> -----------------------------
> swapper/0/1 is trying to lock:
> ffff76e045dec9a0 (&dev->power.lock){+.+.}-{3:3}, at: device_pm_check_callbacks+0x20/0xf0
> other info that might help us debug this:
> context-{5:5}
> 3 locks held by swapper/0/1:
> #0: ffff76e045deb8e8 (&dev->mutex){....}-{4:4}, at: __device_attach+0x38/0x1c0
> #1: ffffa92b81f825e0 (gpd_list_lock){+.+.}-{4:4}, at: __genpd_dev_pm_attach+0x7c/0x250
> #2: ffff76e04105c7a0 (&genpd->rslock){....}-{2:2}, at: genpd_lock_rawspin+0x1c/0x30
> stack backtrace:
> CPU: 5 PID: 1 Comm: swapper/0 Tainted: G W 6.1.0-rt5-00325-g8a5f56bcfcca #8
> Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
> Call trace:
> dump_backtrace.part.0+0xe0/0xf0
> show_stack+0x18/0x40
> dump_stack_lvl+0x8c/0xb8
> dump_stack+0x18/0x34
> __lock_acquire+0x938/0x2100
> lock_acquire.part.0+0x104/0x28c
> lock_acquire+0x68/0x84
> rt_spin_lock+0x40/0x100
> device_pm_check_callbacks+0x20/0xf0
> dev_pm_domain_set+0x54/0x64
> genpd_add_device+0x258/0x340
> __genpd_dev_pm_attach+0xa8/0x250
> genpd_dev_pm_attach_by_id+0xc4/0x190
> genpd_dev_pm_attach_by_name+0x3c/0x60
> dev_pm_domain_attach_by_name+0x20/0x30
> dt_idle_attach_cpu+0x24/0x90
> psci_cpuidle_probe+0x300/0x4b0
> platform_probe+0x68/0xe0
> really_probe+0xbc/0x2dc
> __driver_probe_device+0x78/0xe0
> driver_probe_device+0x3c/0x160
> __device_attach_driver+0xb8/0x140
> bus_for_each_drv+0x78/0xd0
> __device_attach+0xa8/0x1c0
> device_initial_probe+0x14/0x20
> bus_probe_device+0x9c/0xa4
> device_add+0x3b4/0x8dc
> platform_device_add+0x114/0x234
> platform_device_register_full+0x108/0x1a4
> psci_idle_init+0x6c/0xb0
> do_one_initcall+0x74/0x450
> kernel_init_freeable+0x2e0/0x350
> kernel_init+0x24/0x130
> ret_from_fork+0x10/0x20
>
> Cc: Adrien Thierry <[email protected]>
> Cc: Brian Masney <[email protected]>
> Cc: [email protected]
> Signed-off-by: Krzysztof Kozlowski <[email protected]>
> ---
> drivers/base/power/domain.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
> index 4dfce1d476f4..db499ba40497 100644
> --- a/drivers/base/power/domain.c
> +++ b/drivers/base/power/domain.c
> @@ -1666,10 +1666,14 @@ static int genpd_add_device(struct generic_pm_domain *genpd, struct device *dev,
> if (ret)
> goto out;
>
> +
> + /* PREEMPT_RT: Must be outside of genpd_lock */
> + device_pm_check_callbacks(dev);
> +
> genpd_lock(genpd);
>
> genpd_set_cpumask(genpd, gpd_data->cpu);
> - dev_pm_domain_set(dev, &genpd->domain);
> + dev_pm_domain_set_no_cb(dev, &genpd->domain);
>
> genpd->device_count++;
> if (gd)
Rather than splitting up the assignment in two steps, I think it
should be perfectly fine to move the call to dev_pm_domain_set()
outside the genpd lock.
Note that, genpd_add_device() is always being called with
gpd_list_lock mutex being held. This prevents the genpd from being
removed, while we use it here.
Moreover, we need a similar change for the call to dev_pm_domain_set()
in genpd_remove_device().
> --
> 2.34.1
>
Kind regards
Uffe
On Mon, 19 Dec 2022 at 16:15, Krzysztof Kozlowski
<[email protected]> wrote:
>
> The PSCI cpuidle power domain in power_off callback uses
> __this_cpu_write() so it is PREEMPT_RT safe. This allows to use it in
> Realtime kernels and solves errors like:
>
> BUG: scheduling while atomic: swapper/2/0/0x00000002
> Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
> Call trace:
> dump_backtrace.part.0+0xe0/0xf0
> show_stack+0x18/0x40
> dump_stack_lvl+0x68/0x84
> dump_stack+0x18/0x34
> __schedule_bug+0x60/0x80
> __schedule+0x628/0x800
> schedule_rtlock+0x28/0x5c
> rtlock_slowlock_locked+0x360/0xd30
> rt_spin_lock+0x88/0xb0
> genpd_lock_nested_spin+0x1c/0x30
> genpd_power_off.part.0.isra.0+0x20c/0x2a0
> genpd_runtime_suspend+0x150/0x2bc
> __rpm_callback+0x48/0x170
> rpm_callback+0x6c/0x7c
> rpm_suspend+0x108/0x660
> __pm_runtime_suspend+0x4c/0x8c
> __psci_enter_domain_idle_state.constprop.0+0x54/0xe0
> psci_enter_domain_idle_state+0x18/0x2c
> cpuidle_enter_state+0x8c/0x4e0
> cpuidle_enter+0x38/0x50
> do_idle+0x248/0x2f0
> cpu_startup_entry+0x24/0x30
> secondary_start_kernel+0x130/0x154
> __secondary_switched+0xb0/0xb4
>
> Cc: Adrien Thierry <[email protected]>
> Cc: Brian Masney <[email protected]>
> Cc: [email protected]
> Signed-off-by: Krzysztof Kozlowski <[email protected]>
> ---
> drivers/cpuidle/cpuidle-psci-domain.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpuidle/cpuidle-psci-domain.c b/drivers/cpuidle/cpuidle-psci-domain.c
> index c80cf9ddabd8..d15a91fb7048 100644
> --- a/drivers/cpuidle/cpuidle-psci-domain.c
> +++ b/drivers/cpuidle/cpuidle-psci-domain.c
> @@ -62,7 +62,8 @@ static int psci_pd_init(struct device_node *np, bool use_osi)
> if (!pd_provider)
> goto free_pd;
>
> - pd->flags |= GENPD_FLAG_IRQ_SAFE | GENPD_FLAG_CPU_DOMAIN;
> + pd->flags |= GENPD_FLAG_IRQ_SAFE | GENPD_FLAG_RT_SAFE | \
> + GENPD_FLAG_CPU_DOMAIN;
My main concern with this, is that it will affect the parent domains
too. Whether those would be able to use the GENPD_FLAG_RT_SAFE or not,
is a different story.
In one way or the other, I think it would be better to limit the
GENPD_FLAG_RT_SAFE to be used only for PREEMPT_RT kernels.
>
> /* Allow power off when OSI has been successfully enabled. */
> if (use_osi)
> --
> 2.34.1
>
Kind regards
Uffe
On 17/01/2023 16:27, Ulf Hansson wrote:
> On Mon, 19 Dec 2022 at 16:15, Krzysztof Kozlowski
> <[email protected]> wrote:
>>
>> The PSCI cpuidle power domain in power_off callback uses
>> __this_cpu_write() so it is PREEMPT_RT safe. This allows to use it in
>> Realtime kernels and solves errors like:
>>
>> BUG: scheduling while atomic: swapper/2/0/0x00000002
>> Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
>> Call trace:
>> dump_backtrace.part.0+0xe0/0xf0
>> show_stack+0x18/0x40
>> dump_stack_lvl+0x68/0x84
>> dump_stack+0x18/0x34
>> __schedule_bug+0x60/0x80
>> __schedule+0x628/0x800
>> schedule_rtlock+0x28/0x5c
>> rtlock_slowlock_locked+0x360/0xd30
>> rt_spin_lock+0x88/0xb0
>> genpd_lock_nested_spin+0x1c/0x30
>> genpd_power_off.part.0.isra.0+0x20c/0x2a0
>> genpd_runtime_suspend+0x150/0x2bc
>> __rpm_callback+0x48/0x170
>> rpm_callback+0x6c/0x7c
>> rpm_suspend+0x108/0x660
>> __pm_runtime_suspend+0x4c/0x8c
>> __psci_enter_domain_idle_state.constprop.0+0x54/0xe0
>> psci_enter_domain_idle_state+0x18/0x2c
>> cpuidle_enter_state+0x8c/0x4e0
>> cpuidle_enter+0x38/0x50
>> do_idle+0x248/0x2f0
>> cpu_startup_entry+0x24/0x30
>> secondary_start_kernel+0x130/0x154
>> __secondary_switched+0xb0/0xb4
>>
>> Cc: Adrien Thierry <[email protected]>
>> Cc: Brian Masney <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Krzysztof Kozlowski <[email protected]>
>> ---
>> drivers/cpuidle/cpuidle-psci-domain.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpuidle/cpuidle-psci-domain.c b/drivers/cpuidle/cpuidle-psci-domain.c
>> index c80cf9ddabd8..d15a91fb7048 100644
>> --- a/drivers/cpuidle/cpuidle-psci-domain.c
>> +++ b/drivers/cpuidle/cpuidle-psci-domain.c
>> @@ -62,7 +62,8 @@ static int psci_pd_init(struct device_node *np, bool use_osi)
>> if (!pd_provider)
>> goto free_pd;
>>
>> - pd->flags |= GENPD_FLAG_IRQ_SAFE | GENPD_FLAG_CPU_DOMAIN;
>> + pd->flags |= GENPD_FLAG_IRQ_SAFE | GENPD_FLAG_RT_SAFE | \
>> + GENPD_FLAG_CPU_DOMAIN;
>
> My main concern with this, is that it will affect the parent domains
> too. Whether those would be able to use the GENPD_FLAG_RT_SAFE or not,
> is a different story.
>
> In one way or the other, I think it would be better to limit the
> GENPD_FLAG_RT_SAFE to be used only for PREEMPT_RT kernels.
I can do it... or maybe we should just drop the flags (RT and IRQ safe)
when parent domain does not have it?
Best regards,
Krzysztof
On 17/01/2023 16:11, Ulf Hansson wrote:
> On Mon, 19 Dec 2022 at 16:15, Krzysztof Kozlowski
> <[email protected]> wrote:
>>
>> If PM domain on PREEMPT_RT is marked as GENPD_FLAG_RT_SAFE(), the
>> genpd_lock() will be a raw spin lock, thus device_pm_check_callbacks()
>> must be called outside of the domain lock.
>>
>> This solves on PREEMPT_RT:
>>
>> [ BUG: Invalid wait context ]
>> 6.1.0-rt5-00325-g8a5f56bcfcca #8 Tainted: G W
>> -----------------------------
>> swapper/0/1 is trying to lock:
>> ffff76e045dec9a0 (&dev->power.lock){+.+.}-{3:3}, at: device_pm_check_callbacks+0x20/0xf0
>> other info that might help us debug this:
>> context-{5:5}
>> 3 locks held by swapper/0/1:
>> #0: ffff76e045deb8e8 (&dev->mutex){....}-{4:4}, at: __device_attach+0x38/0x1c0
>> #1: ffffa92b81f825e0 (gpd_list_lock){+.+.}-{4:4}, at: __genpd_dev_pm_attach+0x7c/0x250
>> #2: ffff76e04105c7a0 (&genpd->rslock){....}-{2:2}, at: genpd_lock_rawspin+0x1c/0x30
>> stack backtrace:
>> CPU: 5 PID: 1 Comm: swapper/0 Tainted: G W 6.1.0-rt5-00325-g8a5f56bcfcca #8
>> Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
>> Call trace:
>> dump_backtrace.part.0+0xe0/0xf0
>> show_stack+0x18/0x40
>> dump_stack_lvl+0x8c/0xb8
>> dump_stack+0x18/0x34
>> __lock_acquire+0x938/0x2100
>> lock_acquire.part.0+0x104/0x28c
>> lock_acquire+0x68/0x84
>> rt_spin_lock+0x40/0x100
>> device_pm_check_callbacks+0x20/0xf0
>> dev_pm_domain_set+0x54/0x64
>> genpd_add_device+0x258/0x340
>> __genpd_dev_pm_attach+0xa8/0x250
>> genpd_dev_pm_attach_by_id+0xc4/0x190
>> genpd_dev_pm_attach_by_name+0x3c/0x60
>> dev_pm_domain_attach_by_name+0x20/0x30
>> dt_idle_attach_cpu+0x24/0x90
>> psci_cpuidle_probe+0x300/0x4b0
>> platform_probe+0x68/0xe0
>> really_probe+0xbc/0x2dc
>> __driver_probe_device+0x78/0xe0
>> driver_probe_device+0x3c/0x160
>> __device_attach_driver+0xb8/0x140
>> bus_for_each_drv+0x78/0xd0
>> __device_attach+0xa8/0x1c0
>> device_initial_probe+0x14/0x20
>> bus_probe_device+0x9c/0xa4
>> device_add+0x3b4/0x8dc
>> platform_device_add+0x114/0x234
>> platform_device_register_full+0x108/0x1a4
>> psci_idle_init+0x6c/0xb0
>> do_one_initcall+0x74/0x450
>> kernel_init_freeable+0x2e0/0x350
>> kernel_init+0x24/0x130
>> ret_from_fork+0x10/0x20
>>
>> Cc: Adrien Thierry <[email protected]>
>> Cc: Brian Masney <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Krzysztof Kozlowski <[email protected]>
>> ---
>> drivers/base/power/domain.c | 6 +++++-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
>> index 4dfce1d476f4..db499ba40497 100644
>> --- a/drivers/base/power/domain.c
>> +++ b/drivers/base/power/domain.c
>> @@ -1666,10 +1666,14 @@ static int genpd_add_device(struct generic_pm_domain *genpd, struct device *dev,
>> if (ret)
>> goto out;
>>
>> +
>> + /* PREEMPT_RT: Must be outside of genpd_lock */
>> + device_pm_check_callbacks(dev);
>> +
>> genpd_lock(genpd);
>>
>> genpd_set_cpumask(genpd, gpd_data->cpu);
>> - dev_pm_domain_set(dev, &genpd->domain);
>> + dev_pm_domain_set_no_cb(dev, &genpd->domain);
>>
>> genpd->device_count++;
>> if (gd)
>
> Rather than splitting up the assignment in two steps, I think it
> should be perfectly fine to move the call to dev_pm_domain_set()
> outside the genpd lock.
>
> Note that, genpd_add_device() is always being called with
> gpd_list_lock mutex being held. This prevents the genpd from being
> removed, while we use it here.
Hm, indeed, should be fine.
>
> Moreover, we need a similar change for the call to dev_pm_domain_set()
> in genpd_remove_device().
Right.
Best regards,
Krzysztof
On 19/01/2023 16:40, Krzysztof Kozlowski wrote:
> On 17/01/2023 16:27, Ulf Hansson wrote:
>> On Mon, 19 Dec 2022 at 16:15, Krzysztof Kozlowski
>> <[email protected]> wrote:
>>>
>>> The PSCI cpuidle power domain in power_off callback uses
>>> __this_cpu_write() so it is PREEMPT_RT safe. This allows to use it in
>>> Realtime kernels and solves errors like:
>>>
>>> BUG: scheduling while atomic: swapper/2/0/0x00000002
>>> Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
>>> Call trace:
>>> dump_backtrace.part.0+0xe0/0xf0
>>> show_stack+0x18/0x40
>>> dump_stack_lvl+0x68/0x84
>>> dump_stack+0x18/0x34
>>> __schedule_bug+0x60/0x80
>>> __schedule+0x628/0x800
>>> schedule_rtlock+0x28/0x5c
>>> rtlock_slowlock_locked+0x360/0xd30
>>> rt_spin_lock+0x88/0xb0
>>> genpd_lock_nested_spin+0x1c/0x30
>>> genpd_power_off.part.0.isra.0+0x20c/0x2a0
>>> genpd_runtime_suspend+0x150/0x2bc
>>> __rpm_callback+0x48/0x170
>>> rpm_callback+0x6c/0x7c
>>> rpm_suspend+0x108/0x660
>>> __pm_runtime_suspend+0x4c/0x8c
>>> __psci_enter_domain_idle_state.constprop.0+0x54/0xe0
>>> psci_enter_domain_idle_state+0x18/0x2c
>>> cpuidle_enter_state+0x8c/0x4e0
>>> cpuidle_enter+0x38/0x50
>>> do_idle+0x248/0x2f0
>>> cpu_startup_entry+0x24/0x30
>>> secondary_start_kernel+0x130/0x154
>>> __secondary_switched+0xb0/0xb4
>>>
>>> Cc: Adrien Thierry <[email protected]>
>>> Cc: Brian Masney <[email protected]>
>>> Cc: [email protected]
>>> Signed-off-by: Krzysztof Kozlowski <[email protected]>
>>> ---
>>> drivers/cpuidle/cpuidle-psci-domain.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/cpuidle/cpuidle-psci-domain.c b/drivers/cpuidle/cpuidle-psci-domain.c
>>> index c80cf9ddabd8..d15a91fb7048 100644
>>> --- a/drivers/cpuidle/cpuidle-psci-domain.c
>>> +++ b/drivers/cpuidle/cpuidle-psci-domain.c
>>> @@ -62,7 +62,8 @@ static int psci_pd_init(struct device_node *np, bool use_osi)
>>> if (!pd_provider)
>>> goto free_pd;
>>>
>>> - pd->flags |= GENPD_FLAG_IRQ_SAFE | GENPD_FLAG_CPU_DOMAIN;
>>> + pd->flags |= GENPD_FLAG_IRQ_SAFE | GENPD_FLAG_RT_SAFE | \
>>> + GENPD_FLAG_CPU_DOMAIN;
>>
>> My main concern with this, is that it will affect the parent domains
>> too. Whether those would be able to use the GENPD_FLAG_RT_SAFE or not,
>> is a different story.
>>
>> In one way or the other, I think it would be better to limit the
>> GENPD_FLAG_RT_SAFE to be used only for PREEMPT_RT kernels.
>
> I can do it... or maybe we should just drop the flags (RT and IRQ safe)
> when parent domain does not have it?
Actually, with next patch, I can skip this one entirely. This is needed
if PSCI cpuidle driver invokes runtime PM functions which eventually
puts PSCI cpuidle power domain into suspend/resume. If the former does
not happen, the domain driver won't be even called so my problem disappears.
Since I need patch 3/5 - effectively disabling PSCI cpuidle runtime PM -
we can drop this one, till we find a real user needing it.
Best regards,
Krzysztof
On 2023-01-12 12:34:35 [+0100], Krzysztof Kozlowski wrote:
> On 12/01/2023 12:09, Sebastian Andrzej Siewior wrote:
> > On 2022-12-19 16:15:01 [+0100], Krzysztof Kozlowski wrote:
> >> diff --git a/drivers/cpuidle/cpuidle-psci.c b/drivers/cpuidle/cpuidle-psci.c
> >> index 57bc3e3ae391..9d971cc4b12b 100644
> >> --- a/drivers/cpuidle/cpuidle-psci.c
> >> +++ b/drivers/cpuidle/cpuidle-psci.c
> >> @@ -72,7 +72,7 @@ static int __psci_enter_domain_idle_state(struct cpuidle_device *dev,
> >> ct_irq_enter_irqson();
> >> if (s2idle)
> >> dev_pm_genpd_suspend(pd_dev);
> >> - else
> >> + else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> >> pm_runtime_put_sync_suspend(pd_dev);
> >
> > So based on the commit description you run into a sleeping lock in
> > pm_runtime_put_sync_suspend() while the CPU is in an IRQ-off region.
> > Why is it okay to skip it on PREEMPT_RT?
>
> It is okay to skip it everywhere, you just don't get a suspended CPU.
> Why PREEMPT_RT is different here - having suspended CPU is a great way
> to have longer or even unpredictable (as it goes to firmware which is
> out of control of Linux) latencies.
On X86 C1 has a latency of less than 5us and this is exposed by the
firmware. C1E and everything that follows has a much higher entry/ exit
latency which makes not usable.
The entry/exit latency seems not to be exposed by PSCI. My understanding
is that the driver is now enabled but not doing any suspending, right?
If so, why isn't it completely disabled?
> Best regards,
> Krzysztof
Sebastian