2021-06-16 06:48:57

by Viresh Kumar

[permalink] [raw]
Subject: [PATCH V2 0/3] cpufreq: cppc: Add support for frequency invariance

Hello,

Changes since V1:

- Few of the patches migrating users to ->exit() callback are posted separately.

- The CPPC patch was completely reverted and so the support for FIE is again
added here from scratch.

- The start_cpu() and stop_cpu() interface is reworked a little so stop_cpu() is
only ever called for a CPU if start_cpu() was called for it earlier.

- A new patch to implement RCU locking in arch_topology core to avoid some
races.

- Some cleanup and very clear/separate paths for FIE in cppc driver now.


-------------------------8<-------------------------

CPPC cpufreq driver is used for ARM servers and this patch series tries to
provide counter-based frequency invariance support for them in the absence for
architecture specific counters (like AMUs).

This was reverted earlier for the 5.13 kernel after Qian Cai reported kernel
oops during suspend/resume.

This is based of pm/linux-next + a cleanup patchset:

https://lore.kernel.org/linux-pm/[email protected]/

All the patches are pushed here together for people to run.

https://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git/log/?h=cpufreq/cppc

This is tested on my Hikey platform (without the actual read/write to
performance counters), with this script for over an hour:

while true; do
for i in `seq 1 7`;
do
echo 0 > /sys/devices/system/cpu/cpu$i/online;
done;

for i in `seq 1 7`;
do
echo 1 > /sys/devices/system/cpu/cpu$i/online;
done;
done


Vincent will be giving this patchset a try on ThunderX2.

Meanwhile it is up for review. Ideally I would like to get this merged for 5.14,
but lets see how it goes.

Thanks.

--
Viresh

Viresh Kumar (3):
cpufreq: Add start_cpu() and stop_cpu() callbacks
arch_topology: Avoid use-after-free for scale_freq_data
cpufreq: CPPC: Add support for frequency invariance

Documentation/cpu-freq/cpu-drivers.rst | 7 +-
drivers/base/arch_topology.c | 27 ++-
drivers/cpufreq/Kconfig.arm | 10 ++
drivers/cpufreq/cppc_cpufreq.c | 232 +++++++++++++++++++++++--
drivers/cpufreq/cpufreq.c | 19 +-
include/linux/arch_topology.h | 1 +
include/linux/cpufreq.h | 5 +-
kernel/sched/core.c | 1 +
8 files changed, 272 insertions(+), 30 deletions(-)

--
2.31.1.272.g89b43f80a514


2021-06-16 06:49:06

by Viresh Kumar

[permalink] [raw]
Subject: [PATCH V2 1/3] cpufreq: Add start_cpu() and stop_cpu() callbacks

On CPU hot-unplug, the cpufreq core doesn't call any driver specific
callback unless all the CPUs of a policy went away, in which case we
call stop_cpu() callback.

For the CPPC cpufreq driver, we need to perform per-cpu init/exit work
which that can't be performed from policy specific init()/exit()
callbacks.

This patch adds a new callback, start_cpu() and modifies the stop_cpu()
callback, to perform such CPU specific work.

These routines are called whenever a CPU is added or removed from a
policy.

Note that this also moves the setting of policy->cpus to online CPUs
only, outside of rwsem as we needed to call start_cpu() for online CPUs
only. This shouldn't have any side effects.

Signed-off-by: Viresh Kumar <[email protected]>
---
Documentation/cpu-freq/cpu-drivers.rst | 7 +++++--
drivers/cpufreq/cpufreq.c | 19 +++++++++++++++----
include/linux/cpufreq.h | 5 ++++-
3 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/Documentation/cpu-freq/cpu-drivers.rst b/Documentation/cpu-freq/cpu-drivers.rst
index a697278ce190..15cfe42b4075 100644
--- a/Documentation/cpu-freq/cpu-drivers.rst
+++ b/Documentation/cpu-freq/cpu-drivers.rst
@@ -71,8 +71,11 @@ And optionally
.exit - A pointer to a per-policy cleanup function called during
CPU_POST_DEAD phase of cpu hotplug process.

- .stop_cpu - A pointer to a per-policy stop function called during
- CPU_DOWN_PREPARE phase of cpu hotplug process.
+ .start_cpu - A pointer to a per-policy per-cpu start function called
+ during CPU online phase.
+
+ .stop_cpu - A pointer to a per-policy per-cpu stop function called
+ during CPU offline phase.

.suspend - A pointer to a per-policy suspend function which is called
with interrupts disabled and _after_ the governor is stopped for the
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 802abc925b2a..128dfb1b0cdf 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1119,6 +1119,10 @@ static int cpufreq_add_policy_cpu(struct cpufreq_policy *policy, unsigned int cp

cpumask_set_cpu(cpu, policy->cpus);

+ /* Do CPU specific initialization if required */
+ if (cpufreq_driver->start_cpu)
+ cpufreq_driver->start_cpu(policy, cpu);
+
if (has_target()) {
ret = cpufreq_start_governor(policy);
if (ret)
@@ -1375,13 +1379,19 @@ static int cpufreq_online(unsigned int cpu)
cpumask_copy(policy->related_cpus, policy->cpus);
}

- down_write(&policy->rwsem);
/*
* affected cpus must always be the one, which are online. We aren't
* managing offline cpus here.
*/
cpumask_and(policy->cpus, policy->cpus, cpu_online_mask);

+ /* Do CPU specific initialization if required */
+ if (cpufreq_driver->start_cpu) {
+ for_each_cpu(j, policy->cpus)
+ cpufreq_driver->start_cpu(policy, j);
+ }
+
+ down_write(&policy->rwsem);
if (new_policy) {
for_each_cpu(j, policy->related_cpus) {
per_cpu(cpufreq_cpu_data, j) = policy;
@@ -1581,6 +1591,10 @@ static int cpufreq_offline(unsigned int cpu)
policy->cpu = cpumask_any(policy->cpus);
}

+ /* Do CPU specific de-initialization if required */
+ if (cpufreq_driver->stop_cpu)
+ cpufreq_driver->stop_cpu(policy, cpu);
+
/* Start governor again for active policy */
if (!policy_is_inactive(policy)) {
if (has_target()) {
@@ -1597,9 +1611,6 @@ static int cpufreq_offline(unsigned int cpu)
policy->cdev = NULL;
}

- if (cpufreq_driver->stop_cpu)
- cpufreq_driver->stop_cpu(policy);
-
if (has_target())
cpufreq_exit_governor(policy);

diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 353969c7acd3..c281b3df4e2f 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -371,7 +371,10 @@ struct cpufreq_driver {
int (*online)(struct cpufreq_policy *policy);
int (*offline)(struct cpufreq_policy *policy);
int (*exit)(struct cpufreq_policy *policy);
- void (*stop_cpu)(struct cpufreq_policy *policy);
+
+ /* CPU specific start/stop */
+ void (*start_cpu)(struct cpufreq_policy *policy, unsigned int cpu);
+ void (*stop_cpu)(struct cpufreq_policy *policy, unsigned int cpu);
int (*suspend)(struct cpufreq_policy *policy);
int (*resume)(struct cpufreq_policy *policy);

--
2.31.1.272.g89b43f80a514

2021-06-16 06:49:43

by Viresh Kumar

[permalink] [raw]
Subject: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data

Currently topology_scale_freq_tick() may end up using a pointer to
struct scale_freq_data, which was previously cleared by
topology_clear_scale_freq_source(), as there is no protection in place
here. The users of topology_clear_scale_freq_source() though needs a
guarantee that the previous scale_freq_data isn't used anymore.

Since topology_scale_freq_tick() is called from scheduler tick, we don't
want to add locking in there. Use the RCU update mechanism instead
(which is already used by the scheduler's utilization update path) to
guarantee race free updates here.

Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Viresh Kumar <[email protected]>
---
drivers/base/arch_topology.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index c1179edc0f3b..921312a8d957 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -18,10 +18,11 @@
#include <linux/cpumask.h>
#include <linux/init.h>
#include <linux/percpu.h>
+#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/smp.h>

-static DEFINE_PER_CPU(struct scale_freq_data *, sft_data);
+static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
static struct cpumask scale_freq_counters_mask;
static bool scale_freq_invariant;

@@ -66,16 +67,20 @@ void topology_set_scale_freq_source(struct scale_freq_data *data,
if (cpumask_empty(&scale_freq_counters_mask))
scale_freq_invariant = topology_scale_freq_invariant();

+ rcu_read_lock();
+
for_each_cpu(cpu, cpus) {
- sfd = per_cpu(sft_data, cpu);
+ sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));

/* Use ARCH provided counters whenever possible */
if (!sfd || sfd->source != SCALE_FREQ_SOURCE_ARCH) {
- per_cpu(sft_data, cpu) = data;
+ rcu_assign_pointer(per_cpu(sft_data, cpu), data);
cpumask_set_cpu(cpu, &scale_freq_counters_mask);
}
}

+ rcu_read_unlock();
+
update_scale_freq_invariant(true);
}
EXPORT_SYMBOL_GPL(topology_set_scale_freq_source);
@@ -86,22 +91,32 @@ void topology_clear_scale_freq_source(enum scale_freq_source source,
struct scale_freq_data *sfd;
int cpu;

+ rcu_read_lock();
+
for_each_cpu(cpu, cpus) {
- sfd = per_cpu(sft_data, cpu);
+ sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));

if (sfd && sfd->source == source) {
- per_cpu(sft_data, cpu) = NULL;
+ rcu_assign_pointer(per_cpu(sft_data, cpu), NULL);
cpumask_clear_cpu(cpu, &scale_freq_counters_mask);
}
}

+ rcu_read_unlock();
+
+ /*
+ * Make sure all references to previous sft_data are dropped to avoid
+ * use-after-free races.
+ */
+ synchronize_rcu();
+
update_scale_freq_invariant(false);
}
EXPORT_SYMBOL_GPL(topology_clear_scale_freq_source);

void topology_scale_freq_tick(void)
{
- struct scale_freq_data *sfd = *this_cpu_ptr(&sft_data);
+ struct scale_freq_data *sfd = rcu_dereference_sched(*this_cpu_ptr(&sft_data));

if (sfd)
sfd->set_freq_scale();
--
2.31.1.272.g89b43f80a514

2021-06-16 06:49:52

by Viresh Kumar

[permalink] [raw]
Subject: [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance

The Frequency Invariance Engine (FIE) is providing a frequency scaling
correction factor that helps achieve more accurate load-tracking.

Normally, this scaling factor can be obtained directly with the help of
the cpufreq drivers as they know the exact frequency the hardware is
running at. But that isn't the case for CPPC cpufreq driver.

Another way of obtaining that is using the arch specific counter
support, which is already present in kernel, but that hardware is
optional for platforms.

This patch updates the CPPC driver to register itself with the topology
core to provide its own implementation (cppc_scale_freq_tick()) of
topology_scale_freq_tick() which gets called by the scheduler on every
tick. Note that the arch specific counters have higher priority than
CPPC counters, if available, though the CPPC driver doesn't need to have
any special handling for that.

On an invocation of cppc_scale_freq_tick(), we schedule an irq work
(since we reach here from hard-irq context), which then schedules a
normal work item and cppc_scale_freq_workfn() updates the per_cpu
arch_freq_scale variable based on the counter updates since the last
tick.

To allow platforms to disable this CPPC counter-based frequency
invariance support, this is all done under CONFIG_ACPI_CPPC_CPUFREQ_FIE,
which is enabled by default.

This also exports sched_setattr_nocheck() as the CPPC driver can be
built as a module.

Cc: [email protected]
Signed-off-by: Viresh Kumar <[email protected]>
---
drivers/cpufreq/Kconfig.arm | 10 ++
drivers/cpufreq/cppc_cpufreq.c | 232 ++++++++++++++++++++++++++++++---
include/linux/arch_topology.h | 1 +
kernel/sched/core.c | 1 +
4 files changed, 227 insertions(+), 17 deletions(-)

diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
index e65e0a43be64..a5c5f70acfc9 100644
--- a/drivers/cpufreq/Kconfig.arm
+++ b/drivers/cpufreq/Kconfig.arm
@@ -19,6 +19,16 @@ config ACPI_CPPC_CPUFREQ

If in doubt, say N.

+config ACPI_CPPC_CPUFREQ_FIE
+ bool "Frequency Invariance support for CPPC cpufreq driver"
+ depends on ACPI_CPPC_CPUFREQ && GENERIC_ARCH_TOPOLOGY
+ default y
+ help
+ This extends frequency invariance support in the CPPC cpufreq driver,
+ by using CPPC delivered and reference performance counters.
+
+ If in doubt, say N.
+
config ARM_ALLWINNER_SUN50I_CPUFREQ_NVMEM
tristate "Allwinner nvmem based SUN50I CPUFreq driver"
depends on ARCH_SUNXI
diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index be4f62e2c5f1..63e4cff46f95 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -10,14 +10,18 @@

#define pr_fmt(fmt) "CPPC Cpufreq:" fmt

+#include <linux/arch_topology.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/delay.h>
#include <linux/cpu.h>
#include <linux/cpufreq.h>
#include <linux/dmi.h>
+#include <linux/irq_work.h>
+#include <linux/kthread.h>
#include <linux/time.h>
#include <linux/vmalloc.h>
+#include <uapi/linux/sched/types.h>

#include <asm/unaligned.h>

@@ -57,6 +61,183 @@ static struct cppc_workaround_oem_info wa_info[] = {
}
};

+#ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
+
+/* Frequency invariance support */
+struct cppc_freq_invariance {
+ int cpu;
+ struct irq_work irq_work;
+ struct kthread_work work;
+ struct cppc_perf_fb_ctrs prev_perf_fb_ctrs;
+ struct cppc_cpudata *cpu_data;
+};
+
+static DEFINE_PER_CPU(struct cppc_freq_invariance, cppc_freq_inv);
+static struct kthread_worker *kworker_fie;
+
+static struct cpufreq_driver cppc_cpufreq_driver;
+static unsigned int hisi_cppc_cpufreq_get_rate(unsigned int cpu);
+static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
+ struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+ struct cppc_perf_fb_ctrs *fb_ctrs_t1);
+
+/**
+ * cppc_scale_freq_workfn - CPPC arch_freq_scale updater for frequency invariance
+ * @work: The work item.
+ *
+ * The CPPC driver register itself with the topology core to provide its own
+ * implementation (cppc_scale_freq_tick()) of topology_scale_freq_tick() which
+ * gets called by the scheduler on every tick.
+ *
+ * Note that the arch specific counters have higher priority than CPPC counters,
+ * if available, though the CPPC driver doesn't need to have any special
+ * handling for that.
+ *
+ * On an invocation of cppc_scale_freq_tick(), we schedule an irq work (since we
+ * reach here from hard-irq context), which then schedules a normal work item
+ * and cppc_scale_freq_workfn() updates the per_cpu arch_freq_scale variable
+ * based on the counter updates since the last tick.
+ */
+static void cppc_scale_freq_workfn(struct kthread_work *work)
+{
+ struct cppc_freq_invariance *cppc_fi;
+ struct cppc_perf_fb_ctrs fb_ctrs = {0};
+ struct cppc_cpudata *cpu_data;
+ unsigned long local_freq_scale;
+ u64 perf;
+
+ cppc_fi = container_of(work, struct cppc_freq_invariance, work);
+ cpu_data = cppc_fi->cpu_data;
+
+ if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) {
+ pr_warn("%s: failed to read perf counters\n", __func__);
+ return;
+ }
+
+ perf = cppc_perf_from_fbctrs(cpu_data, &cppc_fi->prev_perf_fb_ctrs,
+ &fb_ctrs);
+ cppc_fi->prev_perf_fb_ctrs = fb_ctrs;
+
+ perf <<= SCHED_CAPACITY_SHIFT;
+ local_freq_scale = div64_u64(perf, cpu_data->perf_caps.highest_perf);
+ if (WARN_ON(local_freq_scale > 1024))
+ local_freq_scale = 1024;
+
+ per_cpu(arch_freq_scale, cppc_fi->cpu) = local_freq_scale;
+}
+
+static void cppc_irq_work(struct irq_work *irq_work)
+{
+ struct cppc_freq_invariance *cppc_fi;
+
+ cppc_fi = container_of(irq_work, struct cppc_freq_invariance, irq_work);
+ kthread_queue_work(kworker_fie, &cppc_fi->work);
+}
+
+static void cppc_scale_freq_tick(void)
+{
+ struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, smp_processor_id());
+
+ /*
+ * cppc_get_perf_ctrs() can potentially sleep, call that from the right
+ * context.
+ */
+ irq_work_queue(&cppc_fi->irq_work);
+}
+
+static struct scale_freq_data cppc_sftd = {
+ .source = SCALE_FREQ_SOURCE_CPPC,
+ .set_freq_scale = cppc_scale_freq_tick,
+};
+
+static void cppc_cpufreq_start_cpu(struct cpufreq_policy *policy,
+ unsigned int cpu)
+{
+ struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+ int ret;
+
+ cppc_fi->cpu = cpu;
+ cppc_fi->cpu_data = policy->driver_data;
+ kthread_init_work(&cppc_fi->work, cppc_scale_freq_workfn);
+ init_irq_work(&cppc_fi->irq_work, cppc_irq_work);
+
+ ret = cppc_get_perf_ctrs(cpu, &cppc_fi->prev_perf_fb_ctrs);
+ if (ret) {
+ pr_warn("%s: failed to read perf counters: %d\n", __func__,
+ ret);
+ return;
+ }
+
+ /* Register for freq-invariance */
+ topology_set_scale_freq_source(&cppc_sftd, cpumask_of(cpu));
+}
+
+static void cppc_cpufreq_stop_cpu(struct cpufreq_policy *policy,
+ unsigned int cpu)
+{
+ struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+
+ topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, cpumask_of(cpu));
+
+ irq_work_sync(&cppc_fi->irq_work);
+ kthread_cancel_work_sync(&cppc_fi->work);
+}
+
+static void __init cppc_freq_invariance_init(void)
+{
+ struct sched_attr attr = {
+ .size = sizeof(struct sched_attr),
+ .sched_policy = SCHED_DEADLINE,
+ .sched_nice = 0,
+ .sched_priority = 0,
+ /*
+ * Fake (unused) bandwidth; workaround to "fix"
+ * priority inheritance.
+ */
+ .sched_runtime = 1000000,
+ .sched_deadline = 10000000,
+ .sched_period = 10000000,
+ };
+ int ret;
+
+ if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+ return;
+
+ kworker_fie = kthread_create_worker(0, "cppc_fie");
+ if (IS_ERR(kworker_fie))
+ return;
+
+ ret = sched_setattr_nocheck(kworker_fie->task, &attr);
+ if (ret) {
+ pr_warn("%s: failed to set SCHED_DEADLINE: %d\n", __func__,
+ ret);
+ kthread_destroy_worker(kworker_fie);
+ return;
+ }
+
+ cppc_cpufreq_driver.start_cpu = cppc_cpufreq_start_cpu;
+ cppc_cpufreq_driver.stop_cpu = cppc_cpufreq_stop_cpu;
+}
+
+static void cppc_freq_invariance_exit(void)
+{
+ if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+ return;
+
+ kthread_destroy_worker(kworker_fie);
+ kworker_fie = NULL;
+}
+
+#else
+static inline void cppc_freq_invariance_init(void)
+{
+}
+
+static inline void cppc_freq_invariance_exit(void)
+{
+}
+#endif /* CONFIG_ACPI_CPPC_CPUFREQ_FIE */
+
/* Callback function used to retrieve the max frequency from DMI */
static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)
{
@@ -362,26 +543,35 @@ static inline u64 get_delta(u64 t1, u64 t0)
return (u32)t1 - (u32)t0;
}

-static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
- struct cppc_perf_fb_ctrs fb_ctrs_t0,
- struct cppc_perf_fb_ctrs fb_ctrs_t1)
+static int cppc_perf_from_fbctrs(struct cppc_cpudata *cpu_data,
+ struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+ struct cppc_perf_fb_ctrs *fb_ctrs_t1)
{
u64 delta_reference, delta_delivered;
- u64 reference_perf, delivered_perf;
+ u64 reference_perf;

- reference_perf = fb_ctrs_t0.reference_perf;
+ reference_perf = fb_ctrs_t0->reference_perf;

- delta_reference = get_delta(fb_ctrs_t1.reference,
- fb_ctrs_t0.reference);
- delta_delivered = get_delta(fb_ctrs_t1.delivered,
- fb_ctrs_t0.delivered);
+ delta_reference = get_delta(fb_ctrs_t1->reference,
+ fb_ctrs_t0->reference);
+ delta_delivered = get_delta(fb_ctrs_t1->delivered,
+ fb_ctrs_t0->delivered);

- /* Check to avoid divide-by zero */
- if (delta_reference || delta_delivered)
- delivered_perf = (reference_perf * delta_delivered) /
- delta_reference;
- else
- delivered_perf = cpu_data->perf_ctrls.desired_perf;
+ /* Check to avoid divide-by zero and invalid delivered_perf */
+ if (!delta_reference || !delta_delivered)
+ return cpu_data->perf_ctrls.desired_perf;
+
+ return (reference_perf * delta_delivered) / delta_reference;
+}
+
+static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu_data,
+ struct cppc_perf_fb_ctrs *fb_ctrs_t0,
+ struct cppc_perf_fb_ctrs *fb_ctrs_t1)
+{
+ u64 delivered_perf;
+
+ delivered_perf = cppc_perf_from_fbctrs(cpu_data, fb_ctrs_t0,
+ fb_ctrs_t1);

return cppc_cpufreq_perf_to_khz(cpu_data, delivered_perf);
}
@@ -405,7 +595,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
if (ret)
return ret;

- return cppc_get_rate_from_fbctrs(cpu_data, fb_ctrs_t0, fb_ctrs_t1);
+ return cppc_get_rate_from_fbctrs(cpu_data, &fb_ctrs_t0, &fb_ctrs_t1);
}

static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
@@ -506,14 +696,21 @@ static void cppc_check_hisi_workaround(void)

static int __init cppc_cpufreq_init(void)
{
+ int ret;
+
if ((acpi_disabled) || !acpi_cpc_valid())
return -ENODEV;

INIT_LIST_HEAD(&cpu_data_list);

cppc_check_hisi_workaround();
+ cppc_freq_invariance_init();

- return cpufreq_register_driver(&cppc_cpufreq_driver);
+ ret = cpufreq_register_driver(&cppc_cpufreq_driver);
+ if (ret)
+ cppc_freq_invariance_exit();
+
+ return ret;
}

static inline void free_cpu_data(void)
@@ -531,6 +728,7 @@ static inline void free_cpu_data(void)
static void __exit cppc_cpufreq_exit(void)
{
cpufreq_unregister_driver(&cppc_cpufreq_driver);
+ cppc_freq_invariance_exit();

free_cpu_data();
}
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 11e555cfaecb..f180240dc95f 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -37,6 +37,7 @@ bool topology_scale_freq_invariant(void);
enum scale_freq_source {
SCALE_FREQ_SOURCE_CPUFREQ = 0,
SCALE_FREQ_SOURCE_ARCH,
+ SCALE_FREQ_SOURCE_CPPC,
};

struct scale_freq_data {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ca80df205ce..5226cc26a095 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6389,6 +6389,7 @@ int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr)
{
return __sched_setscheduler(p, attr, false, true);
}
+EXPORT_SYMBOL_GPL(sched_setattr_nocheck);

/**
* sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
--
2.31.1.272.g89b43f80a514

2021-06-16 07:59:04

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data

On Wed, Jun 16, 2021 at 12:18:08PM +0530, Viresh Kumar wrote:
> Currently topology_scale_freq_tick() may end up using a pointer to
> struct scale_freq_data, which was previously cleared by
> topology_clear_scale_freq_source(), as there is no protection in place
> here. The users of topology_clear_scale_freq_source() though needs a
> guarantee that the previous scale_freq_data isn't used anymore.
>
> Since topology_scale_freq_tick() is called from scheduler tick, we don't
> want to add locking in there. Use the RCU update mechanism instead
> (which is already used by the scheduler's utilization update path) to
> guarantee race free updates here.
>
> Cc: Paul E. McKenney <[email protected]>
> Signed-off-by: Viresh Kumar <[email protected]>

So this is a bugfix for problems in the current codebase? What commit
does this fix? Should it go to the stable kernels?

> ---
> drivers/base/arch_topology.c | 27 +++++++++++++++++++++------
> 1 file changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index c1179edc0f3b..921312a8d957 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -18,10 +18,11 @@
> #include <linux/cpumask.h>
> #include <linux/init.h>
> #include <linux/percpu.h>
> +#include <linux/rcupdate.h>
> #include <linux/sched.h>
> #include <linux/smp.h>
>
> -static DEFINE_PER_CPU(struct scale_freq_data *, sft_data);
> +static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
> static struct cpumask scale_freq_counters_mask;
> static bool scale_freq_invariant;
>
> @@ -66,16 +67,20 @@ void topology_set_scale_freq_source(struct scale_freq_data *data,
> if (cpumask_empty(&scale_freq_counters_mask))
> scale_freq_invariant = topology_scale_freq_invariant();
>
> + rcu_read_lock();
> +
> for_each_cpu(cpu, cpus) {
> - sfd = per_cpu(sft_data, cpu);
> + sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
>
> /* Use ARCH provided counters whenever possible */
> if (!sfd || sfd->source != SCALE_FREQ_SOURCE_ARCH) {
> - per_cpu(sft_data, cpu) = data;
> + rcu_assign_pointer(per_cpu(sft_data, cpu), data);
> cpumask_set_cpu(cpu, &scale_freq_counters_mask);
> }
> }
>
> + rcu_read_unlock();
> +
> update_scale_freq_invariant(true);
> }
> EXPORT_SYMBOL_GPL(topology_set_scale_freq_source);
> @@ -86,22 +91,32 @@ void topology_clear_scale_freq_source(enum scale_freq_source source,
> struct scale_freq_data *sfd;
> int cpu;
>
> + rcu_read_lock();
> +
> for_each_cpu(cpu, cpus) {
> - sfd = per_cpu(sft_data, cpu);
> + sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
>
> if (sfd && sfd->source == source) {
> - per_cpu(sft_data, cpu) = NULL;
> + rcu_assign_pointer(per_cpu(sft_data, cpu), NULL);
> cpumask_clear_cpu(cpu, &scale_freq_counters_mask);
> }
> }
>
> + rcu_read_unlock();
> +
> + /*
> + * Make sure all references to previous sft_data are dropped to avoid
> + * use-after-free races.
> + */
> + synchronize_rcu();

What race is happening? How could the current code race? Only when a
cpu is removed?

thanks,

greg k-h

2021-06-16 08:20:56

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data

On 16-06-21, 09:57, Greg Kroah-Hartman wrote:
> On Wed, Jun 16, 2021 at 12:18:08PM +0530, Viresh Kumar wrote:
> > Currently topology_scale_freq_tick() may end up using a pointer to
> > struct scale_freq_data, which was previously cleared by
> > topology_clear_scale_freq_source(), as there is no protection in place
> > here. The users of topology_clear_scale_freq_source() though needs a
> > guarantee that the previous scale_freq_data isn't used anymore.
> >
> > Since topology_scale_freq_tick() is called from scheduler tick, we don't
> > want to add locking in there. Use the RCU update mechanism instead
> > (which is already used by the scheduler's utilization update path) to
> > guarantee race free updates here.
> >
> > Cc: Paul E. McKenney <[email protected]>
> > Signed-off-by: Viresh Kumar <[email protected]>
>
> So this is a bugfix for problems in the current codebase? What commit
> does this fix? Should it go to the stable kernels?

There is only one user of topology_clear_scale_freq_source()
(cppc-cpufreq driver, which is already reverted in pm/linux-next). So
in the upcoming 5.13 kernel release, there will be no one using this
API and so no one will break.

And so I skipped the fixes tag, I can add it though.

> > ---
> > drivers/base/arch_topology.c | 27 +++++++++++++++++++++------
> > 1 file changed, 21 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index c1179edc0f3b..921312a8d957 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -18,10 +18,11 @@
> > #include <linux/cpumask.h>
> > #include <linux/init.h>
> > #include <linux/percpu.h>
> > +#include <linux/rcupdate.h>
> > #include <linux/sched.h>
> > #include <linux/smp.h>
> >
> > -static DEFINE_PER_CPU(struct scale_freq_data *, sft_data);
> > +static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
> > static struct cpumask scale_freq_counters_mask;
> > static bool scale_freq_invariant;
> >
> > @@ -66,16 +67,20 @@ void topology_set_scale_freq_source(struct scale_freq_data *data,
> > if (cpumask_empty(&scale_freq_counters_mask))
> > scale_freq_invariant = topology_scale_freq_invariant();
> >
> > + rcu_read_lock();
> > +
> > for_each_cpu(cpu, cpus) {
> > - sfd = per_cpu(sft_data, cpu);
> > + sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
> >
> > /* Use ARCH provided counters whenever possible */
> > if (!sfd || sfd->source != SCALE_FREQ_SOURCE_ARCH) {
> > - per_cpu(sft_data, cpu) = data;
> > + rcu_assign_pointer(per_cpu(sft_data, cpu), data);
> > cpumask_set_cpu(cpu, &scale_freq_counters_mask);
> > }
> > }
> >
> > + rcu_read_unlock();
> > +
> > update_scale_freq_invariant(true);
> > }
> > EXPORT_SYMBOL_GPL(topology_set_scale_freq_source);
> > @@ -86,22 +91,32 @@ void topology_clear_scale_freq_source(enum scale_freq_source source,
> > struct scale_freq_data *sfd;
> > int cpu;
> >
> > + rcu_read_lock();
> > +
> > for_each_cpu(cpu, cpus) {
> > - sfd = per_cpu(sft_data, cpu);
> > + sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
> >
> > if (sfd && sfd->source == source) {
> > - per_cpu(sft_data, cpu) = NULL;
> > + rcu_assign_pointer(per_cpu(sft_data, cpu), NULL);
> > cpumask_clear_cpu(cpu, &scale_freq_counters_mask);
> > }
> > }
> >
> > + rcu_read_unlock();
> > +
> > + /*
> > + * Make sure all references to previous sft_data are dropped to avoid
> > + * use-after-free races.
> > + */
> > + synchronize_rcu();
>
> What race is happening? How could the current code race? Only when a
> cpu is removed?

topology_scale_freq_tick() is called by the scheduler for each CPU
from scheduler_tick().

It is possible that topology_scale_freq_tick() ends up using an older
copy of sft_data pointer, while it is being removed by
topology_clear_scale_freq_source() because a CPU went away or a
cpufreq driver went away, or during normal suspend/resume (where CPUs
are hot-unplugged).

synchronize_rcu() makes sure that all RCU critical sections that
started before it is called, will finish before it returns. And so the
callers of topology_clear_scale_freq_source() don't need to worry
about their callback getting called anymore.

--
viresh

2021-06-16 08:35:22

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data

On Wed, Jun 16, 2021 at 01:48:59PM +0530, Viresh Kumar wrote:
> On 16-06-21, 09:57, Greg Kroah-Hartman wrote:
> > On Wed, Jun 16, 2021 at 12:18:08PM +0530, Viresh Kumar wrote:
> > > Currently topology_scale_freq_tick() may end up using a pointer to
> > > struct scale_freq_data, which was previously cleared by
> > > topology_clear_scale_freq_source(), as there is no protection in place
> > > here. The users of topology_clear_scale_freq_source() though needs a
> > > guarantee that the previous scale_freq_data isn't used anymore.
> > >
> > > Since topology_scale_freq_tick() is called from scheduler tick, we don't
> > > want to add locking in there. Use the RCU update mechanism instead
> > > (which is already used by the scheduler's utilization update path) to
> > > guarantee race free updates here.
> > >
> > > Cc: Paul E. McKenney <[email protected]>
> > > Signed-off-by: Viresh Kumar <[email protected]>
> >
> > So this is a bugfix for problems in the current codebase? What commit
> > does this fix? Should it go to the stable kernels?
>
> There is only one user of topology_clear_scale_freq_source()
> (cppc-cpufreq driver, which is already reverted in pm/linux-next). So
> in the upcoming 5.13 kernel release, there will be no one using this
> API and so no one will break.
>
> And so I skipped the fixes tag, I can add it though.

It would be nice to have to answer this type of question, otherwise you
will have automated scripts trying to backport this to kernels where it
does not belong :)

thanks,

greg k-h

2021-06-16 09:12:51

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data

On 16-06-21, 10:31, Greg Kroah-Hartman wrote:
> On Wed, Jun 16, 2021 at 01:48:59PM +0530, Viresh Kumar wrote:
> > On 16-06-21, 09:57, Greg Kroah-Hartman wrote:
> > > On Wed, Jun 16, 2021 at 12:18:08PM +0530, Viresh Kumar wrote:
> > > > Currently topology_scale_freq_tick() may end up using a pointer to
> > > > struct scale_freq_data, which was previously cleared by
> > > > topology_clear_scale_freq_source(), as there is no protection in place
> > > > here. The users of topology_clear_scale_freq_source() though needs a
> > > > guarantee that the previous scale_freq_data isn't used anymore.
> > > >
> > > > Since topology_scale_freq_tick() is called from scheduler tick, we don't
> > > > want to add locking in there. Use the RCU update mechanism instead
> > > > (which is already used by the scheduler's utilization update path) to
> > > > guarantee race free updates here.
> > > >
> > > > Cc: Paul E. McKenney <[email protected]>
> > > > Signed-off-by: Viresh Kumar <[email protected]>
> > >
> > > So this is a bugfix for problems in the current codebase? What commit
> > > does this fix? Should it go to the stable kernels?
> >
> > There is only one user of topology_clear_scale_freq_source()
> > (cppc-cpufreq driver, which is already reverted in pm/linux-next). So
> > in the upcoming 5.13 kernel release, there will be no one using this
> > API and so no one will break.
> >
> > And so I skipped the fixes tag, I can add it though.
>
> It would be nice to have to answer this type of question, otherwise you
> will have automated scripts trying to backport this to kernels where it
> does not belong :)

Sure, I will add these to the next version.

--
viresh

2021-06-16 10:04:42

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH V2 0/3] cpufreq: cppc: Add support for frequency invariance

On Wed, 16 Jun 2021 at 08:48, Viresh Kumar <[email protected]> wrote:
>
> Hello,
>
> Changes since V1:
>
> - Few of the patches migrating users to ->exit() callback are posted separately.
>
> - The CPPC patch was completely reverted and so the support for FIE is again
> added here from scratch.
>
> - The start_cpu() and stop_cpu() interface is reworked a little so stop_cpu() is
> only ever called for a CPU if start_cpu() was called for it earlier.
>
> - A new patch to implement RCU locking in arch_topology core to avoid some
> races.
>
> - Some cleanup and very clear/separate paths for FIE in cppc driver now.
>
>
> -------------------------8<-------------------------
>
> CPPC cpufreq driver is used for ARM servers and this patch series tries to
> provide counter-based frequency invariance support for them in the absence for
> architecture specific counters (like AMUs).
>
> This was reverted earlier for the 5.13 kernel after Qian Cai reported kernel
> oops during suspend/resume.
>
> This is based of pm/linux-next + a cleanup patchset:
>
> https://lore.kernel.org/linux-pm/[email protected]/
>
> All the patches are pushed here together for people to run.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git/log/?h=cpufreq/cppc
>
> This is tested on my Hikey platform (without the actual read/write to
> performance counters), with this script for over an hour:
>
> while true; do
> for i in `seq 1 7`;
> do
> echo 0 > /sys/devices/system/cpu/cpu$i/online;
> done;
>
> for i in `seq 1 7`;
> do
> echo 1 > /sys/devices/system/cpu/cpu$i/online;
> done;
> done
>
>
> Vincent will be giving this patchset a try on ThunderX2.

I tested your branch and got the following while booting:

[ 24.454543] zswap: loaded using pool lzo/zbud
[ 24.454753] pstore: Using crash dump compression: deflate
[ 24.454776] AppArmor: AppArmor sha1 policy hashing enabled
[ 24.454784] ima: No TPM chip found, activating TPM-bypass!
[ 24.454789] ima: Allocated hash algorithm: sha256
[ 24.454801] ima: No architecture policies found
[ 24.455750] pcieport 0000:0f:00.0: Adding to iommu group 0
[ 24.893888] ------------[ cut here ]------------
[ 24.893891] WARNING: CPU: 95 PID: 1442 at
drivers/cpufreq/cppc_cpufreq.c:123 cppc_scale_freq_workfn+0xc8/0xf8
[ 24.893901] Modules linked in:
[ 24.893906] CPU: 95 PID: 1442 Comm: cppc_fie Not tainted 5.13.0-rc6+ #359
[ 24.893910] Hardware name: To be filled by O.E.M. Saber/Saber, BIOS
0ACKL026 03/19/2019
[ 24.893912] pstate: 20400009 (nzCv daif +PAN -UAO -TCO BTYPE=--)
[ 24.893915] pc : cppc_scale_freq_workfn+0xc8/0xf8
[ 24.893918] lr : cppc_scale_freq_workfn+0x5c/0xf8
[ 24.893921] sp : ffff80003727bd90
[ 24.893922] x29: ffff80003727bd90 x28: 0000000000000000 x27: ffff800010ec2000
[ 24.893928] x26: ffff800010ec2000 x25: ffff8000107c3d90 x24: 0000000000000001
[ 24.893932] x23: ffff000816244880 x22: ffff8000113f9000 x21: ffff009f825a0a80
[ 24.893935] x20: ffff009efc394220 x19: ffff800011199000 x18: 000000000000001b
[ 24.893939] x17: 0000000000000007 x16: 0000000000000001 x15: 00000000000000bf
[ 24.893943] x14: 0000000000000016 x13: 000000000000029b x12: 0000000000000016
[ 24.893946] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff009efc6958c0
[ 24.893950] x8 : ffff009efc394248 x7 : 0000000002bde780 x6 : 00000000ffffffff
[ 24.893954] x5 : 00000000916e502a x4 : 00000000d9730e80 x3 : ffffffffffffffff
[ 24.893958] x2 : 00000000001e8480 x1 : 00000000002625a0 x0 : 0000000000000401
[ 24.893962] Call trace:
[ 24.893964] cppc_scale_freq_workfn+0xc8/0xf8
[ 24.893967] kthread_worker_fn+0x110/0x318
[ 24.893971] kthread+0xf4/0x120
[ 24.893973] ret_from_fork+0x10/0x18
[ 24.893977] ---[ end trace ea6dbaf832bce3e4 ]---


>
> Meanwhile it is up for review. Ideally I would like to get this merged for 5.14,
> but lets see how it goes.
>
> Thanks.
>
> --
> Viresh
>
> Viresh Kumar (3):
> cpufreq: Add start_cpu() and stop_cpu() callbacks
> arch_topology: Avoid use-after-free for scale_freq_data
> cpufreq: CPPC: Add support for frequency invariance
>
> Documentation/cpu-freq/cpu-drivers.rst | 7 +-
> drivers/base/arch_topology.c | 27 ++-
> drivers/cpufreq/Kconfig.arm | 10 ++
> drivers/cpufreq/cppc_cpufreq.c | 232 +++++++++++++++++++++++--
> drivers/cpufreq/cpufreq.c | 19 +-
> include/linux/arch_topology.h | 1 +
> include/linux/cpufreq.h | 5 +-
> kernel/sched/core.c | 1 +
> 8 files changed, 272 insertions(+), 30 deletions(-)
>
> --
> 2.31.1.272.g89b43f80a514
>

2021-06-16 11:43:37

by Ionela Voinescu

[permalink] [raw]
Subject: Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data

Hey,

On Wednesday 16 Jun 2021 at 12:18:08 (+0530), Viresh Kumar wrote:
> Currently topology_scale_freq_tick() may end up using a pointer to
> struct scale_freq_data, which was previously cleared by
> topology_clear_scale_freq_source(), as there is no protection in place
> here. The users of topology_clear_scale_freq_source() though needs a
> guarantee that the previous scale_freq_data isn't used anymore.
>

Please correct me if I'm wrong, but my understanding is that this is
only a problem for the cppc cpufreq invariance functionality. Let's
consider a scenario where CPUs are either hotplugged out or the cpufreq
CPPC driver module is removed; topology_clear_scale_freq_source() would
get called and the sfd_data will be set to NULL. But if at the same
time topology_scale_freq_tick() got an old reference of sfd_data,
set_freq_scale() will be called. This is only a problem for CPPC cpufreq
as cppc_scale_freq_tick() will end up using driver internal data that
might have been freed during the hotplug callbacks or the exit path.

If this is the case, wouldn't the synchronisation issue be better
resolved in the CPPC cpufreq driver, rather than here?

Thank you,
Ionela.

> Since topology_scale_freq_tick() is called from scheduler tick, we don't
> want to add locking in there. Use the RCU update mechanism instead
> (which is already used by the scheduler's utilization update path) to
> guarantee race free updates here.
>
> Cc: Paul E. McKenney <[email protected]>
> Signed-off-by: Viresh Kumar <[email protected]>
> ---
> drivers/base/arch_topology.c | 27 +++++++++++++++++++++------
> 1 file changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index c1179edc0f3b..921312a8d957 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -18,10 +18,11 @@
> #include <linux/cpumask.h>
> #include <linux/init.h>
> #include <linux/percpu.h>
> +#include <linux/rcupdate.h>
> #include <linux/sched.h>
> #include <linux/smp.h>
>
> -static DEFINE_PER_CPU(struct scale_freq_data *, sft_data);
> +static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
> static struct cpumask scale_freq_counters_mask;
> static bool scale_freq_invariant;
>
> @@ -66,16 +67,20 @@ void topology_set_scale_freq_source(struct scale_freq_data *data,
> if (cpumask_empty(&scale_freq_counters_mask))
> scale_freq_invariant = topology_scale_freq_invariant();
>
> + rcu_read_lock();
> +
> for_each_cpu(cpu, cpus) {
> - sfd = per_cpu(sft_data, cpu);
> + sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
>
> /* Use ARCH provided counters whenever possible */
> if (!sfd || sfd->source != SCALE_FREQ_SOURCE_ARCH) {
> - per_cpu(sft_data, cpu) = data;
> + rcu_assign_pointer(per_cpu(sft_data, cpu), data);
> cpumask_set_cpu(cpu, &scale_freq_counters_mask);
> }
> }
>
> + rcu_read_unlock();
> +
> update_scale_freq_invariant(true);
> }
> EXPORT_SYMBOL_GPL(topology_set_scale_freq_source);
> @@ -86,22 +91,32 @@ void topology_clear_scale_freq_source(enum scale_freq_source source,
> struct scale_freq_data *sfd;
> int cpu;
>
> + rcu_read_lock();
> +
> for_each_cpu(cpu, cpus) {
> - sfd = per_cpu(sft_data, cpu);
> + sfd = rcu_dereference(*per_cpu_ptr(&sft_data, cpu));
>
> if (sfd && sfd->source == source) {
> - per_cpu(sft_data, cpu) = NULL;
> + rcu_assign_pointer(per_cpu(sft_data, cpu), NULL);
> cpumask_clear_cpu(cpu, &scale_freq_counters_mask);
> }
> }
>
> + rcu_read_unlock();
> +
> + /*
> + * Make sure all references to previous sft_data are dropped to avoid
> + * use-after-free races.
> + */
> + synchronize_rcu();
> +
> update_scale_freq_invariant(false);
> }
> EXPORT_SYMBOL_GPL(topology_clear_scale_freq_source);
>
> void topology_scale_freq_tick(void)
> {
> - struct scale_freq_data *sfd = *this_cpu_ptr(&sft_data);
> + struct scale_freq_data *sfd = rcu_dereference_sched(*this_cpu_ptr(&sft_data));
>
> if (sfd)
> sfd->set_freq_scale();
> --
> 2.31.1.272.g89b43f80a514
>

2021-06-16 11:44:54

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data

Hi Ionela,

On 16-06-21, 12:25, Ionela Voinescu wrote:
> Please correct me if I'm wrong, but my understanding is that this is
> only a problem for the cppc cpufreq invariance functionality. Let's
> consider a scenario where CPUs are either hotplugged out or the cpufreq
> CPPC driver module is removed; topology_clear_scale_freq_source() would
> get called and the sfd_data will be set to NULL. But if at the same
> time topology_scale_freq_tick() got an old reference of sfd_data,
> set_freq_scale() will be called. This is only a problem for CPPC cpufreq
> as cppc_scale_freq_tick() will end up using driver internal data that
> might have been freed during the hotplug callbacks or the exit path.

For now, yes, CPPC is the only one affected.

> If this is the case, wouldn't the synchronisation issue be better
> resolved in the CPPC cpufreq driver, rather than here?

Hmm, the way I see it is that topology_clear_scale_freq_source() is an API
provided by topology core and the topology core needs to guarantee that it
doesn't use the data any longer after topology_clear_scale_freq_source() is
called.

The same is true for other APIs, like:

irq_work_sync();
kthread_cancel_work_sync();

It isn't the user which needs to take this into account, but the API provider.

There may be more users of this in the future, lets say another cpufreq driver,
and so keeping this synchronization at the API provider is the right thing to do
IMHO.

And from the user's perspective, like cppc, it doesn't have any control over who
is using its callback and how and when. It is very very difficult to provide
something like this at the users, redundant anyway. For example cppc won't ever
know when topology_scale_freq_tick() has stopped calling its callback.

For example this is what cppc driver needs to do now:

+static void cppc_cpufreq_stop_cpu(struct cpufreq_policy *policy,
+ unsigned int cpu)
+{
+ struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);
+
+ topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, cpumask_of(cpu));
+ irq_work_sync(&cppc_fi->irq_work);
+ kthread_cancel_work_sync(&cppc_fi->work);
+}

The driver uses APIs provided by 3 layers, topology, irq-work, kthread and all
must provide these guarantees.

A very similar thing is implemented in kernel/sched/cpufreq.c for example.

--
viresh

2021-06-16 12:27:36

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 0/3] cpufreq: cppc: Add support for frequency invariance

On 16-06-21, 12:02, Vincent Guittot wrote:
> I tested your branch and got the following while booting:
>
> [ 24.454543] zswap: loaded using pool lzo/zbud
> [ 24.454753] pstore: Using crash dump compression: deflate
> [ 24.454776] AppArmor: AppArmor sha1 policy hashing enabled
> [ 24.454784] ima: No TPM chip found, activating TPM-bypass!
> [ 24.454789] ima: Allocated hash algorithm: sha256
> [ 24.454801] ima: No architecture policies found
> [ 24.455750] pcieport 0000:0f:00.0: Adding to iommu group 0
> [ 24.893888] ------------[ cut here ]------------
> [ 24.893891] WARNING: CPU: 95 PID: 1442 at
> drivers/cpufreq/cppc_cpufreq.c:123 cppc_scale_freq_workfn+0xc8/0xf8
> [ 24.893901] Modules linked in:
> [ 24.893906] CPU: 95 PID: 1442 Comm: cppc_fie Not tainted 5.13.0-rc6+ #359
> [ 24.893910] Hardware name: To be filled by O.E.M. Saber/Saber, BIOS
> 0ACKL026 03/19/2019
> [ 24.893912] pstate: 20400009 (nzCv daif +PAN -UAO -TCO BTYPE=--)
> [ 24.893915] pc : cppc_scale_freq_workfn+0xc8/0xf8
> [ 24.893918] lr : cppc_scale_freq_workfn+0x5c/0xf8
> [ 24.893921] sp : ffff80003727bd90
> [ 24.893922] x29: ffff80003727bd90 x28: 0000000000000000 x27: ffff800010ec2000
> [ 24.893928] x26: ffff800010ec2000 x25: ffff8000107c3d90 x24: 0000000000000001
> [ 24.893932] x23: ffff000816244880 x22: ffff8000113f9000 x21: ffff009f825a0a80
> [ 24.893935] x20: ffff009efc394220 x19: ffff800011199000 x18: 000000000000001b
> [ 24.893939] x17: 0000000000000007 x16: 0000000000000001 x15: 00000000000000bf
> [ 24.893943] x14: 0000000000000016 x13: 000000000000029b x12: 0000000000000016
> [ 24.893946] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff009efc6958c0
> [ 24.893950] x8 : ffff009efc394248 x7 : 0000000002bde780 x6 : 00000000ffffffff
> [ 24.893954] x5 : 00000000916e502a x4 : 00000000d9730e80 x3 : ffffffffffffffff
> [ 24.893958] x2 : 00000000001e8480 x1 : 00000000002625a0 x0 : 0000000000000401
> [ 24.893962] Call trace:
> [ 24.893964] cppc_scale_freq_workfn+0xc8/0xf8
> [ 24.893967] kthread_worker_fn+0x110/0x318
> [ 24.893971] kthread+0xf4/0x120
> [ 24.893973] ret_from_fork+0x10/0x18
> [ 24.893977] ---[ end trace ea6dbaf832bce3e4 ]---

Thanks Vincent.

This is triggering from cppc_scale_freq_workfn():

if (WARN_ON(local_freq_scale > 1024))

Looks like there is something fishy about the perf calculations here
after reading the counters, we tried to scale that in the range 0-1024
and it came larger than that.

Will keep you posted.

--
viresh

2021-06-16 12:28:06

by Ionela Voinescu

[permalink] [raw]
Subject: Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data

On Wednesday 16 Jun 2021 at 17:06:04 (+0530), Viresh Kumar wrote:
> Hi Ionela,
>
> On 16-06-21, 12:25, Ionela Voinescu wrote:
> > Please correct me if I'm wrong, but my understanding is that this is
> > only a problem for the cppc cpufreq invariance functionality. Let's
> > consider a scenario where CPUs are either hotplugged out or the cpufreq
> > CPPC driver module is removed; topology_clear_scale_freq_source() would
> > get called and the sfd_data will be set to NULL. But if at the same
> > time topology_scale_freq_tick() got an old reference of sfd_data,
> > set_freq_scale() will be called. This is only a problem for CPPC cpufreq
> > as cppc_scale_freq_tick() will end up using driver internal data that
> > might have been freed during the hotplug callbacks or the exit path.
>
> For now, yes, CPPC is the only one affected.
>
> > If this is the case, wouldn't the synchronisation issue be better
> > resolved in the CPPC cpufreq driver, rather than here?
>
> Hmm, the way I see it is that topology_clear_scale_freq_source() is an API
> provided by topology core and the topology core needs to guarantee that it
> doesn't use the data any longer after topology_clear_scale_freq_source() is
> called.
>
> The same is true for other APIs, like:
>
> irq_work_sync();
> kthread_cancel_work_sync();
>
> It isn't the user which needs to take this into account, but the API provider.
>

I would agree if it wasn't for the fact that the driver provides the
set_freq_scale() implementation that ends up using driver internal data
which could have been freed by the driver's own .exit()/stop_cpu()
callbacks. The API and the generic implementation has the responsibility
of making sure of sane access to its own structures.

Even if we would want to keep drivers from shooting themselves in the
foot, I would prefer we postpone it until we have more users for this,
before we add any synchronisation mechanisms to functionality called
on the tick.

Let's see if there's a less invasive solution to fix CPPC for now, what
do you think?

Thanks,
Ionela.

> There may be more users of this in the future, lets say another cpufreq driver,
> and so keeping this synchronization at the API provider is the right thing to do
> IMHO.
>
> And from the user's perspective, like cppc, it doesn't have any control over who
> is using its callback and how and when. It is very very difficult to provide

> something like this at the users, redundant anyway. For example cppc won't ever
> know when topology_scale_freq_tick() has stopped calling its callback.
>
> For example this is what cppc driver needs to do now:
>
> +static void cppc_cpufreq_stop_cpu(struct cpufreq_policy *policy,
> + unsigned int cpu)
> +{
> + struct cppc_freq_invariance *cppc_fi = &per_cpu(cppc_freq_inv, cpu);
> +
> + topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, cpumask_of(cpu));
> + irq_work_sync(&cppc_fi->irq_work);
> + kthread_cancel_work_sync(&cppc_fi->work);
> +}
>
> The driver uses APIs provided by 3 layers, topology, irq-work, kthread and all
> must provide these guarantees.
>
> A very similar thing is implemented in kernel/sched/cpufreq.c for example.
>
> --
> viresh

2021-06-17 03:16:36

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 2/3] arch_topology: Avoid use-after-free for scale_freq_data

On 16-06-21, 13:00, Ionela Voinescu wrote:
> I would agree if it wasn't for the fact that the driver provides the
> set_freq_scale() implementation that ends up using driver internal data
> which could have been freed by the driver's own .exit()/stop_cpu()
> callbacks. The API and the generic implementation has the responsibility
> of making sure of sane access to its own structures.

How do you see timer core or workqueue core then ? Why do they make
sure they don't end up calling user's function once we ask them not to
?

And also scheduler's own util update mechanism, the exact thing
happens there. Users (cpufreq governors) call
cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook()
to pass their own data structure "struct update_util_data", which has
ia callback within. This is what happens from scheduler's context in
those cases.

static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
{
struct update_util_data *data;

data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
cpu_of(rq)));
if (data)
data->func(data, rq_clock(rq), flags);
}

Also note that some kind of synchronisation mechanism is indeed
required between topology_set_scale_freq_source() and
topology_clear_scale_freq_source(), there is no race there today,
sure, but this is an API which is made generic.

> Even if we would want to keep drivers from shooting themselves in the
> foot, I would prefer we postpone it until we have more users for this,
> before we add any synchronisation mechanisms to functionality called
> on the tick.

The rcu mechanism is very much used in the scheduler itself because it
is lightweight. Honestly I don't even see any other way (w.r.t.
locking) users can fix it at their end. They don't know which was the
last tick that used their callback.

> Let's see if there's a less invasive solution to fix CPPC for now, what
> do you think?

For me, this change is required in the API despite how CPPC ends up
using it. Else we are failing at defining the API itself IMHO.

--
viresh

2021-06-17 13:32:00

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 3/3] cpufreq: CPPC: Add support for frequency invariance

On 17-06-21, 11:34, Ionela Voinescu wrote:
> I might be missing something, but when you offline a single CPU in a
> policy, the worse that can happen is that a last call to
> cppc_scale_freq_tick() would have sneaked in before irqs and the tick
> are disabled. But even if we have a last call to
> cppc_scale_freq_workfn(), the counter read methods would know how to
> cope with hotplug, and the cppc_cpudata structure would still be
> allocated and have valid desired_perf and highest_perf values.

Okay, I somehow assumed that cppc_scale_freq_workfn() needs to run on the local
CPU, while it can actually land anywhere. My fault.

But the irq-work being queued here is per-cpu and it will get queued on the
local CPU where the tick occurred.

Now I am not sure of what will happen to this irq-work in that case. I tried to
look now and I see that these irq-work items are processed first on tick and
then the tick handler of scheduler is called, so that will again queue the cppc
irq-work.

What happens if this happens along with CPU hotplug ? I am not sure I understand
that. There may or may not be any side effects of this. Lets assume the work
item is left in the queue as is and no tick happens after that as the CPU is
offlined. We are good.

Now if we unload the cpufreq driver at this moment, the driver will call
irq_work_sync(), which may end up in a while loop ? There is no
irq-work-cancel() API.

Peter: Can you help here on this ? Lemme try to explain a bit here:

We are starting an irq-work (in cppc cpufreq driver) from
scheduler_tick()->arch_scale_freq_tick(). What will happen if the driver doesn't
take care of CPU hotplug explicitly and make sure this work isn't queued again
from the next tick.

Is it important for user to make sure it gets rid of the irq-work during hotplug
here ?

> Worse case, the last scale factor set for the CPU will be meaningless,
> but it's already meaningless as the CPU is going down.
>
> When you are referring to the issue reported by Qian I suppose you are
> referring to this [1]. I think this is the case where you hotplug the
> last CPU in a policy and free cppc_cpudata.
>
> [1] https://lore.kernel.org/linux-pm/[email protected]/

Yes, I was talking about this report only, I am not sure now if I understand
what actually happened here :)

Ionela: I have skipped replying to rest of your email, will get back to that
once we have clarification on the above first.

Thanks a lot for your reviews, always on time :)

--
viresh

2021-06-17 18:00:04

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH V2 1/3] cpufreq: Add start_cpu() and stop_cpu() callbacks

On Wed, Jun 16, 2021 at 8:48 AM Viresh Kumar <[email protected]> wrote:
>
> On CPU hot-unplug, the cpufreq core doesn't call any driver specific
> callback unless all the CPUs of a policy went away, in which case we
> call stop_cpu() callback.
>
> For the CPPC cpufreq driver, we need to perform per-cpu init/exit work
> which that can't be performed from policy specific init()/exit()
> callbacks.
>
> This patch adds a new callback, start_cpu() and modifies the stop_cpu()
> callback, to perform such CPU specific work.
>
> These routines are called whenever a CPU is added or removed from a
> policy.
>
> Note that this also moves the setting of policy->cpus to online CPUs
> only, outside of rwsem as we needed to call start_cpu() for online CPUs
> only. This shouldn't have any side effects.
>
> Signed-off-by: Viresh Kumar <[email protected]>
> ---
> Documentation/cpu-freq/cpu-drivers.rst | 7 +++++--
> drivers/cpufreq/cpufreq.c | 19 +++++++++++++++----
> include/linux/cpufreq.h | 5 ++++-
> 3 files changed, 24 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/cpu-freq/cpu-drivers.rst b/Documentation/cpu-freq/cpu-drivers.rst
> index a697278ce190..15cfe42b4075 100644
> --- a/Documentation/cpu-freq/cpu-drivers.rst
> +++ b/Documentation/cpu-freq/cpu-drivers.rst
> @@ -71,8 +71,11 @@ And optionally
> .exit - A pointer to a per-policy cleanup function called during
> CPU_POST_DEAD phase of cpu hotplug process.
>
> - .stop_cpu - A pointer to a per-policy stop function called during
> - CPU_DOWN_PREPARE phase of cpu hotplug process.
> + .start_cpu - A pointer to a per-policy per-cpu start function called
> + during CPU online phase.
> +
> + .stop_cpu - A pointer to a per-policy per-cpu stop function called
> + during CPU offline phase.
>
> .suspend - A pointer to a per-policy suspend function which is called
> with interrupts disabled and _after_ the governor is stopped for the
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 802abc925b2a..128dfb1b0cdf 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -1119,6 +1119,10 @@ static int cpufreq_add_policy_cpu(struct cpufreq_policy *policy, unsigned int cp
>
> cpumask_set_cpu(cpu, policy->cpus);
>
> + /* Do CPU specific initialization if required */
> + if (cpufreq_driver->start_cpu)
> + cpufreq_driver->start_cpu(policy, cpu);
> +
> if (has_target()) {
> ret = cpufreq_start_governor(policy);
> if (ret)
> @@ -1375,13 +1379,19 @@ static int cpufreq_online(unsigned int cpu)
> cpumask_copy(policy->related_cpus, policy->cpus);
> }
>
> - down_write(&policy->rwsem);
> /*
> * affected cpus must always be the one, which are online. We aren't
> * managing offline cpus here.
> */
> cpumask_and(policy->cpus, policy->cpus, cpu_online_mask);
>
> + /* Do CPU specific initialization if required */
> + if (cpufreq_driver->start_cpu) {
> + for_each_cpu(j, policy->cpus)
> + cpufreq_driver->start_cpu(policy, j);
> + }
> +
> + down_write(&policy->rwsem);
> if (new_policy) {
> for_each_cpu(j, policy->related_cpus) {
> per_cpu(cpufreq_cpu_data, j) = policy;
> @@ -1581,6 +1591,10 @@ static int cpufreq_offline(unsigned int cpu)
> policy->cpu = cpumask_any(policy->cpus);
> }
>
> + /* Do CPU specific de-initialization if required */
> + if (cpufreq_driver->stop_cpu)
> + cpufreq_driver->stop_cpu(policy, cpu);
> +
> /* Start governor again for active policy */
> if (!policy_is_inactive(policy)) {
> if (has_target()) {
> @@ -1597,9 +1611,6 @@ static int cpufreq_offline(unsigned int cpu)
> policy->cdev = NULL;
> }
>
> - if (cpufreq_driver->stop_cpu)
> - cpufreq_driver->stop_cpu(policy);
> -

This should be a separate patch IMO, after you've migrated everyone to
->offline/->exit.

BTW, IMO it might be better to migrate ->stop_cpu to ->offline rather
than to ->exit.

> if (has_target())
> cpufreq_exit_governor(policy);
>
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 353969c7acd3..c281b3df4e2f 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -371,7 +371,10 @@ struct cpufreq_driver {
> int (*online)(struct cpufreq_policy *policy);
> int (*offline)(struct cpufreq_policy *policy);
> int (*exit)(struct cpufreq_policy *policy);
> - void (*stop_cpu)(struct cpufreq_policy *policy);
> +
> + /* CPU specific start/stop */
> + void (*start_cpu)(struct cpufreq_policy *policy, unsigned int cpu);
> + void (*stop_cpu)(struct cpufreq_policy *policy, unsigned int cpu);
> int (*suspend)(struct cpufreq_policy *policy);
> int (*resume)(struct cpufreq_policy *policy);
>
> --
> 2.31.1.272.g89b43f80a514
>

2021-06-18 10:43:29

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 1/3] cpufreq: Add start_cpu() and stop_cpu() callbacks

On 17-06-21, 15:33, Rafael J. Wysocki wrote:
> On Wed, Jun 16, 2021 at 8:48 AM Viresh Kumar <[email protected]> wrote:
> > + /* Do CPU specific de-initialization if required */
> > + if (cpufreq_driver->stop_cpu)
> > + cpufreq_driver->stop_cpu(policy, cpu);
> > +
> > /* Start governor again for active policy */
> > if (!policy_is_inactive(policy)) {
> > if (has_target()) {
> > @@ -1597,9 +1611,6 @@ static int cpufreq_offline(unsigned int cpu)
> > policy->cdev = NULL;
> > }
> >
> > - if (cpufreq_driver->stop_cpu)
> > - cpufreq_driver->stop_cpu(policy);
> > -
>
> This should be a separate patch IMO, after you've migrated everyone to
> ->offline/->exit.

Right, anyway this patch may not be required anymore. I will send a patch to
remove this.

> BTW, IMO it might be better to migrate ->stop_cpu to ->offline rather
> than to ->exit.

This is a bit tricky IMO.

First, offline() isn't implemented by everyone, out of the three implementations
which were using stop_cpu(), only intel-pstate had offline() as well.

Second, the primary purpose of online/offline callbacks was suspend/resume
oriented and for the same reason, we don't call online() when the policy first
comes up and so in case of errors during bring up, we end up calling exit()
directly and not offline().

IMO this is a very specific thing to drivers and they need to see what fits best
for them, exit() or offline() or both.

--
viresh