LinuxLists.cc - [PATCH 1/3] arm64: Add TIF_POLLING_NRFLAG

2024-04-02 01:47:56

by Okanovic, Haris

[permalink] [raw]

Subject: [PATCH 1/3] arm64: Add TIF_POLLING_NRFLAG

TIF_POLLING_NRFLAG was removed from arm64 as there were no polling
idle states. Add back TIF_POLLING_NRFLAG in preparation for an arm64
cpuidle driver which supports polling.

Signed-off-by: Haris Okanovic <[email protected]>
---
arch/arm64/include/asm/thread_info.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index e72a3bf9e563..ab22c7d1967e 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -81,6 +81,7 @@ void arch_setup_new_exec(void);
#define TIF_SME 27 /* SME in use */
#define TIF_SME_VL_INHERIT 28 /* Inherit SME vl_onexec across exec */
#define TIF_KERNEL_FPSTATE 29 /* Task is in a kernel mode FPSIMD section */
+#define TIF_POLLING_NRFLAG 30

#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
@@ -97,6 +98,7 @@ void arch_setup_new_exec(void);
#define _TIF_SVE (1 << TIF_SVE)
#define _TIF_MTE_ASYNC_FAULT (1 << TIF_MTE_ASYNC_FAULT)
#define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL)
+#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)

#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
--
2.34.1

2024-04-02 01:48:09

by Okanovic, Haris

[permalink] [raw]

Subject: [PATCH 2/3] arm64: add __READ_ONCE_EX()

Perform an exclusive load, which atomically loads a word and arms the
execusive monitor to enable wfe() polling of an address.

Adding this macro in preparation for an arm64 cpuidle driver which
supports a wfe() based polling state.

https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/exclusive-accesses/exclusive-monitors

Signed-off-by: Haris Okanovic <[email protected]>
---
arch/arm64/include/asm/readex.h | 46 +++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)
create mode 100644 arch/arm64/include/asm/readex.h

diff --git a/arch/arm64/include/asm/readex.h b/arch/arm64/include/asm/readex.h
new file mode 100644
index 000000000000..51963c3107e1
--- /dev/null
+++ b/arch/arm64/include/asm/readex.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Based on arch/arm64/include/asm/rwonce.h
+ *
+ * Copyright (C) 2020 Google LLC.
+ * Copyright (C) 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#ifndef __ASM_READEX_H
+#define __ASM_READEX_H
+
+#define __LOAD_EX(sfx, regs...) "ldaxr" #sfx "\t" #regs
+
+#define __READ_ONCE_EX(x) \
+({ \
+ typeof(&(x)) __x = &(x); \
+ int atomic = 1; \
+ union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u; \
+ switch (sizeof(x)) { \
+ case 1: \
+ asm volatile(__LOAD_EX(b, %w0, %1) \
+ : "=r" (*(__u8 *)__u.__c) \
+ : "Q" (*__x) : "memory"); \
+ break; \
+ case 2: \
+ asm volatile(__LOAD_EX(h, %w0, %1) \
+ : "=r" (*(__u16 *)__u.__c) \
+ : "Q" (*__x) : "memory"); \
+ break; \
+ case 4: \
+ asm volatile(__LOAD_EX(, %w0, %1) \
+ : "=r" (*(__u32 *)__u.__c) \
+ : "Q" (*__x) : "memory"); \
+ break; \
+ case 8: \
+ asm volatile(__LOAD_EX(, %0, %1) \
+ : "=r" (*(__u64 *)__u.__c) \
+ : "Q" (*__x) : "memory"); \
+ break; \
+ default: \
+ atomic = 0; \
+ } \
+ atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\
+})
+
+#endif
--
2.34.1

2024-04-02 01:51:28

by Okanovic, Haris

[permalink] [raw]

Subject: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle

An arm64 cpuidle driver with two states: (1) First polls for new runable
tasks up to 100 us (by default) before (2) a wfi idle and awoken by
interrupt (the current arm64 behavior). It allows CPUs to return from
idle more quickly by avoiding the longer interrupt wakeup path, which
may require EL1/EL2 transition in certain VM scenarios.

Poll duration is optionally configured at load time via the poll_limit
module parameter.

The default 100 us duration was experimentally chosen, by measuring QPS
(queries per sec) of the MLPerf bert inference benchmark, which seems
particularly susceptible to this change; see procedure below. 100 us is
the inflection point where QPS stopped growing in a range of tested
values. All results are from AWS m7g.16xlarge instances (Graviton3 SoC)
with dedicated tenancy (dedicated hardware).

| before | 10us | 25us | 50us | 100us | 125us | 150us | 200us | 300us |
| 5.87 | 5.91 | 5.96 | 6.01 | 6.06 | 6.07 | 6.06 | 6.06 | 6.06 |

Perf's scheduler benchmarks also improve with a range of poll_limit
values >= 10 us. Higher limits produce near identical results within a
3% noise margin. The following tables are `perf bench sched` results,
run times in seconds.

`perf bench sched messaging -l 80000`
| AWS instance | SoC | Before | After | % Change |
| c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none |
| c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none |
| c6g.metal | Graviton2 | 17.621 | 16.744 | none |
| c7g.metal | Graviton3 | 13.430 | 13.404 | none |

`perf bench sched pipe -l 2500000`
| AWS instance | SoC | Before | After | % Change |
| c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50% |
| c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34% |
| c6g.metal | Graviton2 | 17.609 | 15.170 | -14% |
| c7g.metal | Graviton3 | 14.103 | 12.304 | -13% |

`perf bench sched seccomp-notify -l 2500000`
| AWS instance | SoC | Before | After | % Change |
| c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52% |
| c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33% |
| c6g.metal | Graviton2 | 15.717 | 13.536 | -14% |
| c7g.metal | Graviton3 | 13.301 | 11.491 | -14% |

Steps to run MLPerf bert inference on Ubuntu 22.04:
sudo apt install build-essential python3 python3-pip
pip install "pybind11[global]" tensorflow transformers
export TF_ENABLE_ONEDNN_OPTS=1
export DNNL_DEFAULT_FPMATH_MODE=BF16
git clone https://github.com/mlcommons/inference.git --recursive
cd inference
git checkout v2.0
cd loadgen
CFLAGS="-std=c++14" python3 setup.py bdist_wheel
pip install dist/*.whl
cd ../language/bert
make setup
python3 run.py --backend=tf --scenario=SingleStream

Suggested-by: Ali Saidi <[email protected]>
Reviewed-by: Ali Saidi <[email protected]>
Reviewed-by: Geoff Blake <[email protected]>
Cc: Brian Silver <[email protected]>
Signed-off-by: Haris Okanovic <[email protected]>
---
drivers/cpuidle/Kconfig.arm | 13 ++
drivers/cpuidle/Makefile | 1 +
drivers/cpuidle/cpuidle-arm-polling.c | 171 ++++++++++++++++++++++++++
3 files changed, 185 insertions(+)
create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c

diff --git a/drivers/cpuidle/Kconfig.arm b/drivers/cpuidle/Kconfig.arm
index a1ee475d180d..484666dda38d 100644
--- a/drivers/cpuidle/Kconfig.arm
+++ b/drivers/cpuidle/Kconfig.arm
@@ -14,6 +14,19 @@ config ARM_CPUIDLE
initialized by calling the CPU operations init idle hook
provided by architecture code.

+config ARM_POLL_CPUIDLE
+ bool "ARM64 CPU idle Driver with polling"
+ depends on ARM64
+ depends on ARM_ARCH_TIMER_EVTSTREAM
+ select CPU_IDLE_MULTIPLE_DRIVERS
+ help
+ Select this to enable a polling cpuidle driver for ARM64:
+ The first state polls TIF_NEED_RESCHED for best latency on short
+ sleep intervals. The second state falls back to arch_cpu_idle() to
+ wait for interrupt. This is can be helpful in workloads that
+ frequently block/wake at short intervals or VMs where wakeup IPIs
+ are more expensive.
+
config ARM_PSCI_CPUIDLE
bool "PSCI CPU idle Driver"
depends on ARM_PSCI_FW
diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
index d103342b7cfc..23c21422792d 100644
--- a/drivers/cpuidle/Makefile
+++ b/drivers/cpuidle/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE) += cpuidle-ux500.o
obj-$(CONFIG_ARM_AT91_CPUIDLE) += cpuidle-at91.o
obj-$(CONFIG_ARM_EXYNOS_CPUIDLE) += cpuidle-exynos.o
obj-$(CONFIG_ARM_CPUIDLE) += cpuidle-arm.o
+obj-$(CONFIG_ARM_POLL_CPUIDLE) += cpuidle-arm-polling.o
obj-$(CONFIG_ARM_PSCI_CPUIDLE) += cpuidle-psci.o
obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN) += cpuidle-psci-domain.o
obj-$(CONFIG_ARM_TEGRA_CPUIDLE) += cpuidle-tegra.o
diff --git a/drivers/cpuidle/cpuidle-arm-polling.c b/drivers/cpuidle/cpuidle-arm-polling.c
new file mode 100644
index 000000000000..bca128568114
--- /dev/null
+++ b/drivers/cpuidle/cpuidle-arm-polling.c
@@ -0,0 +1,171 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ARM64 CPU idle driver using wfe polling
+ *
+ * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights reserved.
+ *
+ * Authors:
+ * Haris Okanovic <[email protected]>
+ * Brian Silver <[email protected]>
+ *
+ * Based on cpuidle-arm.c
+ * Copyright (C) 2014 ARM Ltd.
+ * Author: Lorenzo Pieralisi <[email protected]>
+ */
+
+#include <linux/cpu.h>
+#include <linux/cpu_cooling.h>
+#include <linux/cpuidle.h>
+#include <linux/sched/clock.h>
+
+#include <asm/cpuidle.h>
+#include <asm/readex.h>
+
+#include "dt_idle_states.h"
+
+/* Max duration of the wfe() poll loop in us, before transitioning to
+ * arch_cpu_idle()/wfi() sleep.
+ */
+#define DEFAULT_POLL_LIMIT_US 100
+static unsigned int poll_limit __read_mostly = DEFAULT_POLL_LIMIT_US;
+
+/*
+ * arm_idle_wfe_poll - Polls state in wfe loop until reschedule is
+ * needed or timeout
+ */
+static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int idx)
+{
+ u64 time_start, time_limit;
+
+ time_start = local_clock();
+ dev->poll_time_limit = false;
+
+ local_irq_enable();
+
+ if (current_set_polling_and_test())
+ goto end;
+
+ time_limit = cpuidle_poll_time(drv, dev);
+
+ do {
+ // exclusive read arms the monitor for wfe
+ if (__READ_ONCE_EX(current_thread_info()->flags) & _TIF_NEED_RESCHED)
+ goto end;
+
+ // may exit prematurely, see ARM_ARCH_TIMER_EVTSTREAM
+ wfe();
+ } while (local_clock() - time_start < time_limit);
+
+ dev->poll_time_limit = true;
+
+end:
+ current_clr_polling();
+ return idx;
+}
+
+/*
+ * arm_idle_wfi - Places cpu in lower power state until interrupt,
+ * a fallback to polling
+ */
+static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int idx)
+{
+ if (current_clr_polling_and_test()) {
+ local_irq_enable();
+ return idx;
+ }
+ arch_cpu_idle();
+ return idx;
+}
+
+static struct cpuidle_driver arm_poll_idle_driver __initdata = {
+ .name = "arm_poll_idle",
+ .owner = THIS_MODULE,
+ .states = {
+ {
+ .enter = arm_idle_wfe_poll,
+ .exit_latency = 0,
+ .target_residency = 0,
+ .exit_latency_ns = 0,
+ .power_usage = UINT_MAX,
+ .flags = CPUIDLE_FLAG_POLLING,
+ .name = "WFE",
+ .desc = "ARM WFE",
+ },
+ {
+ .enter = arm_idle_wfi,
+ .exit_latency = DEFAULT_POLL_LIMIT_US,
+ .target_residency = DEFAULT_POLL_LIMIT_US,
+ .power_usage = UINT_MAX,
+ .name = "WFI",
+ .desc = "ARM WFI",
+ },
+ },
+ .state_count = 2,
+};
+
+/*
+ * arm_poll_init_cpu - Initializes arm cpuidle polling driver for one cpu
+ */
+static int __init arm_poll_init_cpu(int cpu)
+{
+ int ret;
+ struct cpuidle_driver *drv;
+
+ drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv), GFP_KERNEL);
+ if (!drv)
+ return -ENOMEM;
+
+ drv->cpumask = (struct cpumask *)cpumask_of(cpu);
+ drv->states[1].exit_latency = poll_limit;
+ drv->states[1].target_residency = poll_limit;
+
+ ret = cpuidle_register(drv, NULL);
+ if (ret) {
+ pr_err("failed to register driver: %d, cpu %d\n", ret, cpu);
+ goto out_kfree_drv;
+ }
+
+ pr_info("registered driver cpu %d\n", cpu);
+
+ cpuidle_cooling_register(drv);
+
+ return 0;
+
+out_kfree_drv:
+ kfree(drv);
+ return ret;
+}
+
+/*
+ * arm_poll_init - Initializes arm cpuidle polling driver
+ */
+static int __init arm_poll_init(void)
+{
+ int cpu, ret;
+ struct cpuidle_driver *drv;
+ struct cpuidle_device *dev;
+
+ for_each_possible_cpu(cpu) {
+ ret = arm_poll_init_cpu(cpu);
+ if (ret)
+ goto out_fail;
+ }
+
+ return 0;
+
+out_fail:
+ pr_info("de-register all");
+ while (--cpu >= 0) {
+ dev = per_cpu(cpuidle_devices, cpu);
+ drv = cpuidle_get_cpu_driver(dev);
+ cpuidle_unregister(drv);
+ kfree(drv);
+ }
+
+ return ret;
+}
+
+module_param(poll_limit, uint, 0444);
+device_initcall(arm_poll_init);
--
2.34.1

2024-04-02 02:30:51

by Okanovic, Haris

[permalink] [raw]

Subject: Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle

On Mon, 2024-04-01 at 20:47 -0500, Haris Okanovic wrote:
> +static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device *dev,
> +                               struct cpuidle_driver *drv, int idx)
> +{
> +       u64 time_start, time_limit;
> +
> +       time_start = local_clock();
> +       dev->poll_time_limit = false;
> +
> +       local_irq_enable();

Re this comment from Peter Zijlstra [1], should I use
raw_local_irq_enable() instead? I see examples of both under
drivers/cpuidle/.

[1]
https://elixir.bootlin.com/linux/v6.9-rc2/source/include/linux/cpuidle.h#L119

> +
> +       if (current_set_polling_and_test())
> +               goto end;
> +
> +       time_limit = cpuidle_poll_time(drv, dev);
> +
> +       do {
> +               // exclusive read arms the monitor for wfe
> +               if (__READ_ONCE_EX(current_thread_info()->flags) &
> _TIF_NEED_RESCHED)
> +                       goto end;
> +
> +               // may exit prematurely, see ARM_ARCH_TIMER_EVTSTREAM
> +               wfe();
> +       } while (local_clock() - time_start < time_limit);
> +
> +       dev->poll_time_limit = true;
> +
> +end:
> +       current_clr_polling();
> +       return idx;
> +}
> +

Thanks,
Haris Okanovic

2024-04-02 17:23:26

by Mark Rutland

[permalink] [raw]

Subject: Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle

On Mon, Apr 01, 2024 at 08:47:06PM -0500, Haris Okanovic wrote:
> An arm64 cpuidle driver with two states: (1) First polls for new runable
> tasks up to 100 us (by default) before (2) a wfi idle and awoken by
> interrupt (the current arm64 behavior). It allows CPUs to return from
> idle more quickly by avoiding the longer interrupt wakeup path, which
> may require EL1/EL2 transition in certain VM scenarios.

Please start off with an explanation of the problem you're trying to solve
(which IIUC is to wake up more quickly in certain cases), before describing the
solution. That makes it *significantly* easier for people to review this, since
once you have the problem statement in mind it's much easier to understand how
the solution space follows from that.

> Poll duration is optionally configured at load time via the poll_limit
> module parameter.

Why should this be a configurable parameter?

(note, at this point you haven't introduced any of the data below, so the
trade-off isn't clear to anyone).

> The default 100 us duration was experimentally chosen, by measuring QPS
> (queries per sec) of the MLPerf bert inference benchmark, which seems
> particularly susceptible to this change; see procedure below. 100 us is
> the inflection point where QPS stopped growing in a range of tested
> values. All results are from AWS m7g.16xlarge instances (Graviton3 SoC)
> with dedicated tenancy (dedicated hardware).
>
> | before | 10us | 25us | 50us | 100us | 125us | 150us | 200us | 300us |
> | 5.87 | 5.91 | 5.96 | 6.01 | 6.06 | 6.07 | 6.06 | 6.06 | 6.06 |
>
> Perf's scheduler benchmarks also improve with a range of poll_limit
> values >= 10 us. Higher limits produce near identical results within a
> 3% noise margin. The following tables are `perf bench sched` results,
> run times in seconds.
>
> `perf bench sched messaging -l 80000`
> | AWS instance | SoC | Before | After | % Change |
> | c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none |
> | c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none |
> | c6g.metal | Graviton2 | 17.621 | 16.744 | none |
> | c7g.metal | Graviton3 | 13.430 | 13.404 | none |
>
> `perf bench sched pipe -l 2500000`
> | AWS instance | SoC | Before | After | % Change |
> | c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50% |
> | c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34% |
> | c6g.metal | Graviton2 | 17.609 | 15.170 | -14% |
> | c7g.metal | Graviton3 | 14.103 | 12.304 | -13% |
>
> `perf bench sched seccomp-notify -l 2500000`
> | AWS instance | SoC | Before | After | % Change |
> | c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52% |
> | c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33% |
> | c6g.metal | Graviton2 | 15.717 | 13.536 | -14% |
> | c7g.metal | Graviton3 | 13.301 | 11.491 | -14% |

Ok, so perf numbers for a busy workload go up.

What happens for idle state residency on a mostly idle system?

> Steps to run MLPerf bert inference on Ubuntu 22.04:
> sudo apt install build-essential python3 python3-pip
> pip install "pybind11[global]" tensorflow transformers
> export TF_ENABLE_ONEDNN_OPTS=1
> export DNNL_DEFAULT_FPMATH_MODE=BF16
> git clone https://github.com/mlcommons/inference.git --recursive
> cd inference
> git checkout v2.0
> cd loadgen
> CFLAGS="-std=c++14" python3 setup.py bdist_wheel
> pip install dist/*.whl
> cd ../language/bert
> make setup
> python3 run.py --backend=tf --scenario=SingleStream
>
> Suggested-by: Ali Saidi <[email protected]>
> Reviewed-by: Ali Saidi <[email protected]>
> Reviewed-by: Geoff Blake <[email protected]>
> Cc: Brian Silver <[email protected]>
> Signed-off-by: Haris Okanovic <[email protected]>
> ---
> drivers/cpuidle/Kconfig.arm | 13 ++
> drivers/cpuidle/Makefile | 1 +
> drivers/cpuidle/cpuidle-arm-polling.c | 171 ++++++++++++++++++++++++++
> 3 files changed, 185 insertions(+)
> create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c
>
> diff --git a/drivers/cpuidle/Kconfig.arm b/drivers/cpuidle/Kconfig.arm
> index a1ee475d180d..484666dda38d 100644
> --- a/drivers/cpuidle/Kconfig.arm
> +++ b/drivers/cpuidle/Kconfig.arm
> @@ -14,6 +14,19 @@ config ARM_CPUIDLE
> initialized by calling the CPU operations init idle hook
> provided by architecture code.
>
> +config ARM_POLL_CPUIDLE
> + bool "ARM64 CPU idle Driver with polling"
> + depends on ARM64
> + depends on ARM_ARCH_TIMER_EVTSTREAM
> + select CPU_IDLE_MULTIPLE_DRIVERS
> + help
> + Select this to enable a polling cpuidle driver for ARM64:
> + The first state polls TIF_NEED_RESCHED for best latency on short
> + sleep intervals. The second state falls back to arch_cpu_idle() to
> + wait for interrupt. This is can be helpful in workloads that
> + frequently block/wake at short intervals or VMs where wakeup IPIs
> + are more expensive.

Why is this a separate driver rather than an optional feature in the existing
driver?

The fact that this duplicates a bunch of code indicates to me that this should
not be a separate driver.

> +
> config ARM_PSCI_CPUIDLE
> bool "PSCI CPU idle Driver"
> depends on ARM_PSCI_FW
> diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> index d103342b7cfc..23c21422792d 100644
> --- a/drivers/cpuidle/Makefile
> +++ b/drivers/cpuidle/Makefile
> @@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE) += cpuidle-ux500.o
> obj-$(CONFIG_ARM_AT91_CPUIDLE) += cpuidle-at91.o
> obj-$(CONFIG_ARM_EXYNOS_CPUIDLE) += cpuidle-exynos.o
> obj-$(CONFIG_ARM_CPUIDLE) += cpuidle-arm.o
> +obj-$(CONFIG_ARM_POLL_CPUIDLE) += cpuidle-arm-polling.o
> obj-$(CONFIG_ARM_PSCI_CPUIDLE) += cpuidle-psci.o
> obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN) += cpuidle-psci-domain.o
> obj-$(CONFIG_ARM_TEGRA_CPUIDLE) += cpuidle-tegra.o
> diff --git a/drivers/cpuidle/cpuidle-arm-polling.c b/drivers/cpuidle/cpuidle-arm-polling.c
> new file mode 100644
> index 000000000000..bca128568114
> --- /dev/null
> +++ b/drivers/cpuidle/cpuidle-arm-polling.c
> @@ -0,0 +1,171 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * ARM64 CPU idle driver using wfe polling
> + *
> + * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights reserved.
> + *
> + * Authors:
> + * Haris Okanovic <[email protected]>
> + * Brian Silver <[email protected]>
> + *
> + * Based on cpuidle-arm.c
> + * Copyright (C) 2014 ARM Ltd.
> + * Author: Lorenzo Pieralisi <[email protected]>
> + */
> +
> +#include <linux/cpu.h>
> +#include <linux/cpu_cooling.h>
> +#include <linux/cpuidle.h>
> +#include <linux/sched/clock.h>
> +
> +#include <asm/cpuidle.h>
> +#include <asm/readex.h>
> +
> +#include "dt_idle_states.h"
> +
> +/* Max duration of the wfe() poll loop in us, before transitioning to
> + * arch_cpu_idle()/wfi() sleep.
> + */

/*
* Comments should have the leading '/*' on a separate line.
* See https://www.kernel.org/doc/html/v6.8/process/coding-style.html#commenting
*/

> +#define DEFAULT_POLL_LIMIT_US 100
> +static unsigned int poll_limit __read_mostly = DEFAULT_POLL_LIMIT_US;
> +
> +/*
> + * arm_idle_wfe_poll - Polls state in wfe loop until reschedule is
> + * needed or timeout
> + */
> +static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device *dev,
> + struct cpuidle_driver *drv, int idx)
> +{
> + u64 time_start, time_limit;
> +
> + time_start = local_clock();
> + dev->poll_time_limit = false;
> +
> + local_irq_enable();

Why enable IRQs here? We don't do that in the regular cpuidle-arm driver, nor
the cpuidle-psci driver, and there's no explanation here or in the commit message.

How does this interact with RCU? Is that still watching or are we in an
extended quiescent state? For PSCI idle states we enter an EQS, and it seems
like we probably should here...

> +
> + if (current_set_polling_and_test())
> + goto end;
> +
> + time_limit = cpuidle_poll_time(drv, dev);
> +
> + do {
> + // exclusive read arms the monitor for wfe
> + if (__READ_ONCE_EX(current_thread_info()->flags) & _TIF_NEED_RESCHED)
> + goto end;
> +
> + // may exit prematurely, see ARM_ARCH_TIMER_EVTSTREAM
> + wfe();
> + } while (local_clock() - time_start < time_limit);

.. and if the EVTSTREAM is disabled, we'll sit in WFE forever rather than
entering a deeper idle state, which doesn't seem desirable.

It's worth noting that now that we have WFET, we'll probably want to disable
the EVTSTREAM by default at some point, at least in some configurations, since
that'll be able to sit in a WFE state for longer while also reliably waking up
when required.

I suspect we want something like an smp_load_acquire_timeout() here to do the
wait in arch code (allowing us to use WFET), and enabling this state will
depend on either having WFET or EVTSTREAM.

> +
> + dev->poll_time_limit = true;
> +
> +end:
> + current_clr_polling();
> + return idx;
> +}
> +
> +/*
> + * arm_idle_wfi - Places cpu in lower power state until interrupt,
> + * a fallback to polling
> + */
> +static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev,
> + struct cpuidle_driver *drv, int idx)
> +{
> + if (current_clr_polling_and_test()) {
> + local_irq_enable();
> + return idx;
> + }

Same as above, why enable IRQs here?

> + arch_cpu_idle();
> + return idx;

.. and if we need to enable IRQs in the other cases above, why do we *not*
need to enable them here?

> +}
> +
> +static struct cpuidle_driver arm_poll_idle_driver __initdata = {
> + .name = "arm_poll_idle",
> + .owner = THIS_MODULE,
> + .states = {
> + {
> + .enter = arm_idle_wfe_poll,
> + .exit_latency = 0,
> + .target_residency = 0,
> + .exit_latency_ns = 0,
> + .power_usage = UINT_MAX,
> + .flags = CPUIDLE_FLAG_POLLING,
> + .name = "WFE",
> + .desc = "ARM WFE",
> + },
> + {
> + .enter = arm_idle_wfi,
> + .exit_latency = DEFAULT_POLL_LIMIT_US,
> + .target_residency = DEFAULT_POLL_LIMIT_US,
> + .power_usage = UINT_MAX,
> + .name = "WFI",
> + .desc = "ARM WFI",
> + },
> + },
> + .state_count = 2,
> +};

How does this interact with the existing driver?

How does DEFAULT_POLL_LIMIT_US compare with PSCI idle states?

> +
> +/*
> + * arm_poll_init_cpu - Initializes arm cpuidle polling driver for one cpu
> + */
> +static int __init arm_poll_init_cpu(int cpu)
> +{
> + int ret;
> + struct cpuidle_driver *drv;
> +
> + drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv), GFP_KERNEL);
> + if (!drv)
> + return -ENOMEM;
> +
> + drv->cpumask = (struct cpumask *)cpumask_of(cpu);
> + drv->states[1].exit_latency = poll_limit;
> + drv->states[1].target_residency = poll_limit;
> +
> + ret = cpuidle_register(drv, NULL);
> + if (ret) {
> + pr_err("failed to register driver: %d, cpu %d\n", ret, cpu);
> + goto out_kfree_drv;
> + }
> +
> + pr_info("registered driver cpu %d\n", cpu);

This does not need to be printed for each CPU.

Mark.

> +
> + cpuidle_cooling_register(drv);
> +
> + return 0;
> +
> +out_kfree_drv:
> + kfree(drv);
> + return ret;
> +}
> +
> +/*
> + * arm_poll_init - Initializes arm cpuidle polling driver
> + */
> +static int __init arm_poll_init(void)
> +{
> + int cpu, ret;
> + struct cpuidle_driver *drv;
> + struct cpuidle_device *dev;
> +
> + for_each_possible_cpu(cpu) {
> + ret = arm_poll_init_cpu(cpu);
> + if (ret)
> + goto out_fail;
> + }
> +
> + return 0;
> +
> +out_fail:
> + pr_info("de-register all");
> + while (--cpu >= 0) {
> + dev = per_cpu(cpuidle_devices, cpu);
> + drv = cpuidle_get_cpu_driver(dev);
> + cpuidle_unregister(drv);
> + kfree(drv);
> + }
> +
> + return ret;
> +}
> +
> +module_param(poll_limit, uint, 0444);
> +device_initcall(arm_poll_init);
> --
> 2.34.1
>
>

2024-04-02 17:27:45

by Mark Rutland

[permalink] [raw]

Subject: Re: [PATCH 2/3] arm64: add __READ_ONCE_EX()

On Mon, Apr 01, 2024 at 08:47:05PM -0500, Haris Okanovic wrote:
> Perform an exclusive load, which atomically loads a word and arms the
> execusive monitor to enable wfe() polling of an address.
>
> Adding this macro in preparation for an arm64 cpuidle driver which
> supports a wfe() based polling state.
>
> https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/exclusive-accesses/exclusive-monitors
>
> Signed-off-by: Haris Okanovic <[email protected]>
> ---
> arch/arm64/include/asm/readex.h | 46 +++++++++++++++++++++++++++++++++
> 1 file changed, 46 insertions(+)
> create mode 100644 arch/arm64/include/asm/readex.h
>
> diff --git a/arch/arm64/include/asm/readex.h b/arch/arm64/include/asm/readex.h
> new file mode 100644
> index 000000000000..51963c3107e1
> --- /dev/null
> +++ b/arch/arm64/include/asm/readex.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Based on arch/arm64/include/asm/rwonce.h
> + *
> + * Copyright (C) 2020 Google LLC.
> + * Copyright (C) 2024 Amazon.com, Inc. or its affiliates.
> + */
> +
> +#ifndef __ASM_READEX_H
> +#define __ASM_READEX_H
> +
> +#define __LOAD_EX(sfx, regs...) "ldaxr" #sfx "\t" #regs
> +
> +#define __READ_ONCE_EX(x) \
> +({ \
> + typeof(&(x)) __x = &(x); \
> + int atomic = 1; \
> + union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u; \
> + switch (sizeof(x)) { \
> + case 1: \
> + asm volatile(__LOAD_EX(b, %w0, %1) \
> + : "=r" (*(__u8 *)__u.__c) \
> + : "Q" (*__x) : "memory"); \
> + break; \
> + case 2: \
> + asm volatile(__LOAD_EX(h, %w0, %1) \
> + : "=r" (*(__u16 *)__u.__c) \
> + : "Q" (*__x) : "memory"); \
> + break; \
> + case 4: \
> + asm volatile(__LOAD_EX(, %w0, %1) \
> + : "=r" (*(__u32 *)__u.__c) \
> + : "Q" (*__x) : "memory"); \
> + break; \
> + case 8: \
> + asm volatile(__LOAD_EX(, %0, %1) \
> + : "=r" (*(__u64 *)__u.__c) \
> + : "Q" (*__x) : "memory"); \
> + break; \
> + default: \
> + atomic = 0; \
> + } \
> + atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\
> +})

Why can't you use the existing smp_cond_load_relaxed() or
smp_cond_load_acquire()?

I don't believe this is necessary.

Mark.

2024-04-02 23:16:58

[permalink] [raw]

Subject: Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle

Mark Rutland <[email protected]> writes:

> On Mon, Apr 01, 2024 at 08:47:06PM -0500, Haris Okanovic wrote:
>> An arm64 cpuidle driver with two states: (1) First polls for new runable
>> tasks up to 100 us (by default) before (2) a wfi idle and awoken by
>> interrupt (the current arm64 behavior). It allows CPUs to return from
>> idle more quickly by avoiding the longer interrupt wakeup path, which
>> may require EL1/EL2 transition in certain VM scenarios.
>
> Please start off with an explanation of the problem you're trying to solve
> (which IIUC is to wake up more quickly in certain cases), before describing the
> solution. That makes it *significantly* easier for people to review this, since
> once you have the problem statement in mind it's much easier to understand how
> the solution space follows from that.
>
>> Poll duration is optionally configured at load time via the poll_limit
>> module parameter.
>
> Why should this be a configurable parameter?
>
> (note, at this point you haven't introduced any of the data below, so the
> trade-off isn't clear to anyone).
>
>> The default 100 us duration was experimentally chosen, by measuring QPS
>> (queries per sec) of the MLPerf bert inference benchmark, which seems
>> particularly susceptible to this change; see procedure below. 100 us is
>> the inflection point where QPS stopped growing in a range of tested
>> values. All results are from AWS m7g.16xlarge instances (Graviton3 SoC)
>> with dedicated tenancy (dedicated hardware).
>>
>> | before | 10us | 25us | 50us | 100us | 125us | 150us | 200us | 300us |
>> | 5.87 | 5.91 | 5.96 | 6.01 | 6.06 | 6.07 | 6.06 | 6.06 | 6.06 |
>>
>> Perf's scheduler benchmarks also improve with a range of poll_limit
>> values >= 10 us. Higher limits produce near identical results within a
>> 3% noise margin. The following tables are `perf bench sched` results,
>> run times in seconds.
>>
>> `perf bench sched messaging -l 80000`
>> | AWS instance | SoC | Before | After | % Change |
>> | c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none |
>> | c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none |
>> | c6g.metal | Graviton2 | 17.621 | 16.744 | none |
>> | c7g.metal | Graviton3 | 13.430 | 13.404 | none |
>>
>> `perf bench sched pipe -l 2500000`
>> | AWS instance | SoC | Before | After | % Change |
>> | c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50% |
>> | c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34% |
>> | c6g.metal | Graviton2 | 17.609 | 15.170 | -14% |
>> | c7g.metal | Graviton3 | 14.103 | 12.304 | -13% |
>>
>> `perf bench sched seccomp-notify -l 2500000`
>> | AWS instance | SoC | Before | After | % Change |
>> | c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52% |
>> | c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33% |
>> | c6g.metal | Graviton2 | 15.717 | 13.536 | -14% |
>> | c7g.metal | Graviton3 | 13.301 | 11.491 | -14% |
>
> Ok, so perf numbers for a busy workload go up.
>
> What happens for idle state residency on a mostly idle system?
>
>> Steps to run MLPerf bert inference on Ubuntu 22.04:
>> sudo apt install build-essential python3 python3-pip
>> pip install "pybind11[global]" tensorflow transformers
>> export TF_ENABLE_ONEDNN_OPTS=1
>> export DNNL_DEFAULT_FPMATH_MODE=BF16
>> git clone https://github.com/mlcommons/inference.git --recursive
>> cd inference
>> git checkout v2.0
>> cd loadgen
>> CFLAGS="-std=c++14" python3 setup.py bdist_wheel
>> pip install dist/*.whl
>> cd ../language/bert
>> make setup
>> python3 run.py --backend=tf --scenario=SingleStream
>>
>> Suggested-by: Ali Saidi <[email protected]>
>> Reviewed-by: Ali Saidi <[email protected]>
>> Reviewed-by: Geoff Blake <[email protected]>
>> Cc: Brian Silver <[email protected]>
>> Signed-off-by: Haris Okanovic <[email protected]>
>> ---
>> drivers/cpuidle/Kconfig.arm | 13 ++
>> drivers/cpuidle/Makefile | 1 +
>> drivers/cpuidle/cpuidle-arm-polling.c | 171 ++++++++++++++++++++++++++
>> 3 files changed, 185 insertions(+)
>> create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c
>>
>> diff --git a/drivers/cpuidle/Kconfig.arm b/drivers/cpuidle/Kconfig.arm
>> index a1ee475d180d..484666dda38d 100644
>> --- a/drivers/cpuidle/Kconfig.arm
>> +++ b/drivers/cpuidle/Kconfig.arm
>> @@ -14,6 +14,19 @@ config ARM_CPUIDLE
>> initialized by calling the CPU operations init idle hook
>> provided by architecture code.
>>
>> +config ARM_POLL_CPUIDLE
>> + bool "ARM64 CPU idle Driver with polling"
>> + depends on ARM64
>> + depends on ARM_ARCH_TIMER_EVTSTREAM
>> + select CPU_IDLE_MULTIPLE_DRIVERS
>> + help
>> + Select this to enable a polling cpuidle driver for ARM64:
>> + The first state polls TIF_NEED_RESCHED for best latency on short
>> + sleep intervals. The second state falls back to arch_cpu_idle() to
>> + wait for interrupt. This is can be helpful in workloads that
>> + frequently block/wake at short intervals or VMs where wakeup IPIs
>> + are more expensive.
>
> Why is this a separate driver rather than an optional feature in the existing
> driver?
>
> The fact that this duplicates a bunch of code indicates to me that this should
> not be a separate driver.

Also, the cpuidle-haltpoll driver is meant to do something quite similar.
That driver polls adaptively based on the haltpoll governor's tuning of
the polling period.

However, cpuidle-haltpoll is currently x86 only. Mihai (also from Oracle)
posted patches [1] adding support for ARM64.

Haris, could you take a look at it and see if it does what you are
looking for? The polling path in the linked version also uses
smp_cond_load_relaxed() so even the mechanisms for both of these
are fairly similar.

(I'll be sending out the next version shortly. Happy to Cc you if you
would like to try that out.)

Thanks
Ankur

[1] https://lore.kernel.org/lkml/[email protected]/

>
>> +
>> config ARM_PSCI_CPUIDLE
>> bool "PSCI CPU idle Driver"
>> depends on ARM_PSCI_FW
>> diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
>> index d103342b7cfc..23c21422792d 100644
>> --- a/drivers/cpuidle/Makefile
>> +++ b/drivers/cpuidle/Makefile
>> @@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE) += cpuidle-ux500.o
>> obj-$(CONFIG_ARM_AT91_CPUIDLE) += cpuidle-at91.o
>> obj-$(CONFIG_ARM_EXYNOS_CPUIDLE) += cpuidle-exynos.o
>> obj-$(CONFIG_ARM_CPUIDLE) += cpuidle-arm.o
>> +obj-$(CONFIG_ARM_POLL_CPUIDLE) += cpuidle-arm-polling.o
>> obj-$(CONFIG_ARM_PSCI_CPUIDLE) += cpuidle-psci.o
>> obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN) += cpuidle-psci-domain.o
>> obj-$(CONFIG_ARM_TEGRA_CPUIDLE) += cpuidle-tegra.o
>> diff --git a/drivers/cpuidle/cpuidle-arm-polling.c b/drivers/cpuidle/cpuidle-arm-polling.c
>> new file mode 100644
>> index 000000000000..bca128568114
>> --- /dev/null
>> +++ b/drivers/cpuidle/cpuidle-arm-polling.c
>> @@ -0,0 +1,171 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * ARM64 CPU idle driver using wfe polling
>> + *
>> + * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights reserved.
>> + *
>> + * Authors:
>> + * Haris Okanovic <[email protected]>
>> + * Brian Silver <[email protected]>
>> + *
>> + * Based on cpuidle-arm.c
>> + * Copyright (C) 2014 ARM Ltd.
>> + * Author: Lorenzo Pieralisi <[email protected]>
>> + */
>> +
>> +#include <linux/cpu.h>
>> +#include <linux/cpu_cooling.h>
>> +#include <linux/cpuidle.h>
>> +#include <linux/sched/clock.h>
>> +
>> +#include <asm/cpuidle.h>
>> +#include <asm/readex.h>
>> +
>> +#include "dt_idle_states.h"
>> +
>> +/* Max duration of the wfe() poll loop in us, before transitioning to
>> + * arch_cpu_idle()/wfi() sleep.
>> + */
>
> /*
> * Comments should have the leading '/*' on a separate line.
> * See https://www.kernel.org/doc/html/v6.8/process/coding-style.html#commenting
> */
>
>> +#define DEFAULT_POLL_LIMIT_US 100
>> +static unsigned int poll_limit __read_mostly = DEFAULT_POLL_LIMIT_US;
>> +
>> +/*
>> + * arm_idle_wfe_poll - Polls state in wfe loop until reschedule is
>> + * needed or timeout
>> + */
>> +static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device *dev,
>> + struct cpuidle_driver *drv, int idx)
>> +{
>> + u64 time_start, time_limit;
>> +
>> + time_start = local_clock();
>> + dev->poll_time_limit = false;
>> +
>> + local_irq_enable();
>
> Why enable IRQs here? We don't do that in the regular cpuidle-arm driver, nor
> the cpuidle-psci driver, and there's no explanation here or in the commit message.
>
> How does this interact with RCU? Is that still watching or are we in an
> extended quiescent state? For PSCI idle states we enter an EQS, and it seems
> like we probably should here...
>
>> +
>> + if (current_set_polling_and_test())
>> + goto end;
>> +
>> + time_limit = cpuidle_poll_time(drv, dev);
>> +
>> + do {
>> + // exclusive read arms the monitor for wfe
>> + if (__READ_ONCE_EX(current_thread_info()->flags) & _TIF_NEED_RESCHED)
>> + goto end;
>> +
>> + // may exit prematurely, see ARM_ARCH_TIMER_EVTSTREAM
>> + wfe();
>> + } while (local_clock() - time_start < time_limit);
>
> .. and if the EVTSTREAM is disabled, we'll sit in WFE forever rather than
> entering a deeper idle state, which doesn't seem desirable.
>
> It's worth noting that now that we have WFET, we'll probably want to disable
> the EVTSTREAM by default at some point, at least in some configurations, since
> that'll be able to sit in a WFE state for longer while also reliably waking up
> when required.
>
> I suspect we want something like an smp_load_acquire_timeout() here to do the
> wait in arch code (allowing us to use WFET), and enabling this state will
> depend on either having WFET or EVTSTREAM.
>
>> +
>> + dev->poll_time_limit = true;
>> +
>> +end:
>> + current_clr_polling();
>> + return idx;
>> +}
>> +
>> +/*
>> + * arm_idle_wfi - Places cpu in lower power state until interrupt,
>> + * a fallback to polling
>> + */
>> +static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev,
>> + struct cpuidle_driver *drv, int idx)
>> +{
>> + if (current_clr_polling_and_test()) {
>> + local_irq_enable();
>> + return idx;
>> + }
>
> Same as above, why enable IRQs here?
>
>> + arch_cpu_idle();
>> + return idx;
>
> .. and if we need to enable IRQs in the other cases above, why do we *not*
> need to enable them here?
>
>> +}
>> +
>> +static struct cpuidle_driver arm_poll_idle_driver __initdata = {
>> + .name = "arm_poll_idle",
>> + .owner = THIS_MODULE,
>> + .states = {
>> + {
>> + .enter = arm_idle_wfe_poll,
>> + .exit_latency = 0,
>> + .target_residency = 0,
>> + .exit_latency_ns = 0,
>> + .power_usage = UINT_MAX,
>> + .flags = CPUIDLE_FLAG_POLLING,
>> + .name = "WFE",
>> + .desc = "ARM WFE",
>> + },
>> + {
>> + .enter = arm_idle_wfi,
>> + .exit_latency = DEFAULT_POLL_LIMIT_US,
>> + .target_residency = DEFAULT_POLL_LIMIT_US,
>> + .power_usage = UINT_MAX,
>> + .name = "WFI",
>> + .desc = "ARM WFI",
>> + },
>> + },
>> + .state_count = 2,
>> +};
>
> How does this interact with the existing driver?
>
> How does DEFAULT_POLL_LIMIT_US compare with PSCI idle states?
>
>> +
>> +/*
>> + * arm_poll_init_cpu - Initializes arm cpuidle polling driver for one cpu
>> + */
>> +static int __init arm_poll_init_cpu(int cpu)
>> +{
>> + int ret;
>> + struct cpuidle_driver *drv;
>> +
>> + drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv), GFP_KERNEL);
>> + if (!drv)
>> + return -ENOMEM;
>> +
>> + drv->cpumask = (struct cpumask *)cpumask_of(cpu);
>> + drv->states[1].exit_latency = poll_limit;
>> + drv->states[1].target_residency = poll_limit;
>> +
>> + ret = cpuidle_register(drv, NULL);
>> + if (ret) {
>> + pr_err("failed to register driver: %d, cpu %d\n", ret, cpu);
>> + goto out_kfree_drv;
>> + }
>> +
>> + pr_info("registered driver cpu %d\n", cpu);
>
> This does not need to be printed for each CPU.
>
> Mark.
>
>> +
>> + cpuidle_cooling_register(drv);
>> +
>> + return 0;
>> +
>> +out_kfree_drv:
>> + kfree(drv);
>> + return ret;
>> +}
>> +
>> +/*
>> + * arm_poll_init - Initializes arm cpuidle polling driver
>> + */
>> +static int __init arm_poll_init(void)
>> +{
>> + int cpu, ret;
>> + struct cpuidle_driver *drv;
>> + struct cpuidle_device *dev;
>> +
>> + for_each_possible_cpu(cpu) {
>> + ret = arm_poll_init_cpu(cpu);
>> + if (ret)
>> + goto out_fail;
>> + }
>> +
>> + return 0;
>> +
>> +out_fail:
>> + pr_info("de-register all");
>> + while (--cpu >= 0) {
>> + dev = per_cpu(cpuidle_devices, cpu);
>> + drv = cpuidle_get_cpu_driver(dev);
>> + cpuidle_unregister(drv);
>> + kfree(drv);
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +module_param(poll_limit, uint, 0444);
>> +device_initcall(arm_poll_init);
>> --
>> 2.34.1
>>
>>

--
ankur

2024-04-05 19:37:05

by Okanovic, Haris

[permalink] [raw]

Subject: Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle

On Tue, 2024-04-02 at 16:17 -0700, Ankur Arora wrote:
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender
> and know the content is safe.
>
>
>
> Mark Rutland <[email protected]> writes:
>
> > On Mon, Apr 01, 2024 at 08:47:06PM -0500, Haris Okanovic wrote:
> > > An arm64 cpuidle driver with two states: (1) First polls for new
> > > runable
> > > tasks up to 100 us (by default) before (2) a wfi idle and awoken
> > > by
> > > interrupt (the current arm64 behavior). It allows CPUs to return
> > > from
> > > idle more quickly by avoiding the longer interrupt wakeup path,
> > > which
> > > may require EL1/EL2 transition in certain VM scenarios.
> >
> > Please start off with an explanation of the problem you're trying
> > to solve
> > (which IIUC is to wake up more quickly in certain cases), before
> > describing the
> > solution. That makes it *significantly* easier for people to review
> > this, since
> > once you have the problem statement in mind it's much easier to
> > understand how
> > the solution space follows from that.
> >
> > > Poll duration is optionally configured at load time via the
> > > poll_limit
> > > module parameter.
> >
> > Why should this be a configurable parameter?
> >
> > (note, at this point you haven't introduced any of the data below,
> > so the
> > trade-off isn't clear to anyone).
> >
> > > The default 100 us duration was experimentally chosen, by
> > > measuring QPS
> > > (queries per sec) of the MLPerf bert inference benchmark, which
> > > seems
> > > particularly susceptible to this change; see procedure below. 100
> > > us is
> > > the inflection point where QPS stopped growing in a range of
> > > tested
> > > values. All results are from AWS m7g.16xlarge instances
> > > (Graviton3 SoC)
> > > with dedicated tenancy (dedicated hardware).
> > >
> > > > before | 10us | 25us | 50us | 100us | 125us | 150us | 200us |
> > > > 300us |
> > > > 5.87   | 5.91 | 5.96 | 6.01 | 6.06 | 6.07 | 6.06 | 6.06 |
> > > > 6.06 |
> > >
> > > Perf's scheduler benchmarks also improve with a range of
> > > poll_limit
> > > values >= 10 us. Higher limits produce near identical results
> > > within a
> > > 3% noise margin. The following tables are `perf bench sched`
> > > results,
> > > run times in seconds.
> > >
> > > `perf bench sched messaging -l 80000`
> > > > AWS instance | SoC       | Before | After | % Change |
> > > > c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none     |
> > > > c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none     |
> > > > c6g.metal     | Graviton2 | 17.621 | 16.744 | none     |
> > > > c7g.metal     | Graviton3 | 13.430 | 13.404 | none     |
> > >
> > > `perf bench sched pipe -l 2500000`
> > > > AWS instance | SoC       | Before | After | % Change |
> > > > c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50%     |
> > > > c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34%     |
> > > > c6g.metal     | Graviton2 | 17.609 | 15.170 | -14%     |
> > > > c7g.metal     | Graviton3 | 14.103 | 12.304 | -13%     |
> > >
> > > `perf bench sched seccomp-notify -l 2500000`
> > > > AWS instance | SoC       | Before | After | % Change |
> > > > c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52%     |
> > > > c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33%     |
> > > > c6g.metal     | Graviton2 | 15.717 | 13.536 | -14%     |
> > > > c7g.metal     | Graviton3 | 13.301 | 11.491 | -14%     |
> >
> > Ok, so perf numbers for a busy workload go up.
> >
> > What happens for idle state residency on a mostly idle system?
> >
> > > Steps to run MLPerf bert inference on Ubuntu 22.04:
> > > sudo apt install build-essential python3 python3-pip
> > > pip install "pybind11[global]" tensorflow transformers
> > > export TF_ENABLE_ONEDNN_OPTS=1
> > > export DNNL_DEFAULT_FPMATH_MODE=BF16
> > > git clone https://github.com/mlcommons/inference.git --recursive
> > > cd inference
> > > git checkout v2.0
> > > cd loadgen
> > > CFLAGS="-std=c++14" python3 setup.py bdist_wheel
> > > pip install dist/*.whl
> > > cd ../language/bert
> > > make setup
> > > python3 run.py --backend=tf --scenario=SingleStream
> > >
> > > Suggested-by: Ali Saidi <[email protected]>
> > > Reviewed-by: Ali Saidi <[email protected]>
> > > Reviewed-by: Geoff Blake <[email protected]>
> > > Cc: Brian Silver <[email protected]>
> > > Signed-off-by: Haris Okanovic <[email protected]>
> > > ---
> > > drivers/cpuidle/Kconfig.arm           | 13 ++
> > > drivers/cpuidle/Makefile              |   1 +
> > > drivers/cpuidle/cpuidle-arm-polling.c | 171
> > > ++++++++++++++++++++++++++
> > > 3 files changed, 185 insertions(+)
> > > create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c
> > >
> > > diff --git a/drivers/cpuidle/Kconfig.arm
> > > b/drivers/cpuidle/Kconfig.arm
> > > index a1ee475d180d..484666dda38d 100644
> > > --- a/drivers/cpuidle/Kconfig.arm
> > > +++ b/drivers/cpuidle/Kconfig.arm
> > > @@ -14,6 +14,19 @@ config ARM_CPUIDLE
> > >        initialized by calling the CPU operations init idle hook
> > >        provided by architecture code.
> > >
> > > +config ARM_POLL_CPUIDLE
> > > +    bool "ARM64 CPU idle Driver with polling"
> > > +    depends on ARM64
> > > +    depends on ARM_ARCH_TIMER_EVTSTREAM
> > > +    select CPU_IDLE_MULTIPLE_DRIVERS
> > > +    help
> > > +      Select this to enable a polling cpuidle driver for ARM64:
> > > +      The first state polls TIF_NEED_RESCHED for best latency on
> > > short
> > > +      sleep intervals. The second state falls back to
> > > arch_cpu_idle() to
> > > +      wait for interrupt. This is can be helpful in workloads
> > > that
> > > +      frequently block/wake at short intervals or VMs where
> > > wakeup IPIs
> > > +      are more expensive.
> >
> > Why is this a separate driver rather than an optional feature in
> > the existing
> > driver?
> >
> > The fact that this duplicates a bunch of code indicates to me that
> > this should
> > not be a separate driver.
>
> Also, the cpuidle-haltpoll driver is meant to do something quite
> similar.
> That driver polls adaptively based on the haltpoll governor's tuning
> of
> the polling period.
>
> However, cpuidle-haltpoll is currently x86 only. Mihai (also from
> Oracle)
> posted patches [1] adding support for ARM64.
>
> Haris, could you take a look at it and see if it does what you are
> looking for? The polling path in the linked version also uses
> smp_cond_load_relaxed() so even the mechanisms for both of these
> are fairly similar.

Hi Ankur,

I agree, except for that small bug in exit condition, your haltpoll
changes fundamentally do the same thing:

> @ int __cpuidle poll_idle(...
> - if (!(ret & _TIF_NEED_RESCHED))
> + if (ret & _TIF_NEED_RESCHE

I'll follow up with another patch for AWS Graviton when your team is
finished.

Do you have a rough ETA of when your changes will land in master?

>
> (I'll be sending out the next version shortly. Happy to Cc you if you
> would like to try that out.)

Yes, please do!

Thanks,
Haris Okanovic

>
> Thanks
> Ankur
>
> [1]
> https://lore.kernel.org/lkml/[email protected]/
>
> >
> > > +
> > > config ARM_PSCI_CPUIDLE
> > >      bool "PSCI CPU idle Driver"
> > >      depends on ARM_PSCI_FW
> > > diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> > > index d103342b7cfc..23c21422792d 100644
> > > --- a/drivers/cpuidle/Makefile
> > > +++ b/drivers/cpuidle/Makefile
> > > @@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE)         +=
> > > cpuidle-ux500.o
> > > obj-$(CONFIG_ARM_AT91_CPUIDLE)          += cpuidle-at91.o
> > > obj-$(CONFIG_ARM_EXYNOS_CPUIDLE)        += cpuidle-exynos.o
> > > obj-$(CONFIG_ARM_CPUIDLE)           += cpuidle-arm.o
> > > +obj-$(CONFIG_ARM_POLL_CPUIDLE)              += cpuidle-arm-
> > > polling.o
> > > obj-$(CONFIG_ARM_PSCI_CPUIDLE)              += cpuidle-psci.o
> > > obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN)       += cpuidle-psci-
> > > domain.o
> > > obj-$(CONFIG_ARM_TEGRA_CPUIDLE)             += cpuidle-tegra.o
> > > diff --git a/drivers/cpuidle/cpuidle-arm-polling.c
> > > b/drivers/cpuidle/cpuidle-arm-polling.c
> > > new file mode 100644
> > > index 000000000000..bca128568114
> > > --- /dev/null
> > > +++ b/drivers/cpuidle/cpuidle-arm-polling.c
> > > @@ -0,0 +1,171 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * ARM64 CPU idle driver using wfe polling
> > > + *
> > > + * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights
> > > reserved.
> > > + *
> > > + * Authors:
> > > + *   Haris Okanovic <[email protected]>
> > > + *   Brian Silver <[email protected]>
> > > + *
> > > + * Based on cpuidle-arm.c
> > > + * Copyright (C) 2014 ARM Ltd.
> > > + * Author: Lorenzo Pieralisi <[email protected]>
> > > + */
> > > +
> > > +#include <linux/cpu.h>
> > > +#include <linux/cpu_cooling.h>
> > > +#include <linux/cpuidle.h>
> > > +#include <linux/sched/clock.h>
> > > +
> > > +#include <asm/cpuidle.h>
> > > +#include <asm/readex.h>
> > > +
> > > +#include "dt_idle_states.h"
> > > +
> > > +/* Max duration of the wfe() poll loop in us, before
> > > transitioning to
> > > + * arch_cpu_idle()/wfi() sleep.
> > > + */
> >
> > /*
> > * Comments should have the leading '/*' on a separate line.
> > * See
> > https://www.kernel.org/doc/html/v6.8/process/coding-style.html#commenting
> > */
> >
> > > +#define DEFAULT_POLL_LIMIT_US 100
> > > +static unsigned int poll_limit __read_mostly =
> > > DEFAULT_POLL_LIMIT_US;
> > > +
> > > +/*
> > > + * arm_idle_wfe_poll - Polls state in wfe loop until reschedule
> > > is
> > > + * needed or timeout
> > > + */
> > > +static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device
> > > *dev,
> > > +                            struct cpuidle_driver *drv, int idx)
> > > +{
> > > +    u64 time_start, time_limit;
> > > +
> > > +    time_start = local_clock();
> > > +    dev->poll_time_limit = false;
> > > +
> > > +    local_irq_enable();
> >
> > Why enable IRQs here? We don't do that in the regular cpuidle-arm
> > driver, nor
> > the cpuidle-psci driver, and there's no explanation here or in the
> > commit message.
> >
> > How does this interact with RCU? Is that still watching or are we
> > in an
> > extended quiescent state? For PSCI idle states we enter an EQS, and
> > it seems
> > like we probably should here...
> >
> > > +
> > > +    if (current_set_polling_and_test())
> > > +            goto end;
> > > +
> > > +    time_limit = cpuidle_poll_time(drv, dev);
> > > +
> > > +    do {
> > > +            // exclusive read arms the monitor for wfe
> > > +            if (__READ_ONCE_EX(current_thread_info()->flags) &
> > > _TIF_NEED_RESCHED)
> > > +                    goto end;
> > > +
> > > +            // may exit prematurely, see
> > > ARM_ARCH_TIMER_EVTSTREAM
> > > +            wfe();
> > > +    } while (local_clock() - time_start < time_limit);
> >
> > .. and if the EVTSTREAM is disabled, we'll sit in WFE forever
> > rather than
> > entering a deeper idle state, which doesn't seem desirable.
> >
> > It's worth noting that now that we have WFET, we'll probably want
> > to disable
> > the EVTSTREAM by default at some point, at least in some
> > configurations, since
> > that'll be able to sit in a WFE state for longer while also
> > reliably waking up
> > when required.
> >
> > I suspect we want something like an smp_load_acquire_timeout() here
> > to do the
> > wait in arch code (allowing us to use WFET), and enabling this
> > state will
> > depend on either having WFET or EVTSTREAM.
> >
> > > +
> > > +    dev->poll_time_limit = true;
> > > +
> > > +end:
> > > +    current_clr_polling();
> > > +    return idx;
> > > +}
> > > +
> > > +/*
> > > + * arm_idle_wfi - Places cpu in lower power state until
> > > interrupt,
> > > + * a fallback to polling
> > > + */
> > > +static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev,
> > > +                            struct cpuidle_driver *drv, int idx)
> > > +{
> > > +    if (current_clr_polling_and_test()) {
> > > +            local_irq_enable();
> > > +            return idx;
> > > +    }
> >
> > Same as above, why enable IRQs here?
> >
> > > +    arch_cpu_idle();
> > > +    return idx;
> >
> > .. and if we need to enable IRQs in the other cases above, why do
> > we *not*
> > need to enable them here?
> >
> > > +}
> > > +
> > > +static struct cpuidle_driver arm_poll_idle_driver __initdata = {
> > > +    .name = "arm_poll_idle",
> > > +    .owner = THIS_MODULE,
> > > +    .states = {
> > > +            {
> > > +                    .enter                  = arm_idle_wfe_poll,
> > > +                    .exit_latency           = 0,
> > > +                    .target_residency       = 0,
> > > +                    .exit_latency_ns        = 0,
> > > +                    .power_usage            = UINT_MAX,
> > > +                    .flags                  =
> > > CPUIDLE_FLAG_POLLING,
> > > +                    .name                   = "WFE",
> > > +                    .desc                   = "ARM WFE",
> > > +            },
> > > +            {
> > > +                    .enter                  = arm_idle_wfi,
> > > +                    .exit_latency           =
> > > DEFAULT_POLL_LIMIT_US,
> > > +                    .target_residency       =
> > > DEFAULT_POLL_LIMIT_US,
> > > +                    .power_usage            = UINT_MAX,
> > > +                    .name                   = "WFI",
> > > +                    .desc                   = "ARM WFI",
> > > +            },
> > > +    },
> > > +    .state_count = 2,
> > > +};
> >
> > How does this interact with the existing driver?
> >
> > How does DEFAULT_POLL_LIMIT_US compare with PSCI idle states?
> >
> > > +
> > > +/*
> > > + * arm_poll_init_cpu - Initializes arm cpuidle polling driver
> > > for one cpu
> > > + */
> > > +static int __init arm_poll_init_cpu(int cpu)
> > > +{
> > > +    int ret;
> > > +    struct cpuidle_driver *drv;
> > > +
> > > +    drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv),
> > > GFP_KERNEL);
> > > +    if (!drv)
> > > +            return -ENOMEM;
> > > +
> > > +    drv->cpumask = (struct cpumask *)cpumask_of(cpu);
> > > +    drv->states[1].exit_latency = poll_limit;
> > > +    drv->states[1].target_residency = poll_limit;
> > > +
> > > +    ret = cpuidle_register(drv, NULL);
> > > +    if (ret) {
> > > +            pr_err("failed to register driver: %d, cpu %d\n",
> > > ret, cpu);
> > > +            goto out_kfree_drv;
> > > +    }
> > > +
> > > +    pr_info("registered driver cpu %d\n", cpu);
> >
> > This does not need to be printed for each CPU.
> >
> > Mark.
> >
> > > +
> > > +    cpuidle_cooling_register(drv);
> > > +
> > > +    return 0;
> > > +
> > > +out_kfree_drv:
> > > +    kfree(drv);
> > > +    return ret;
> > > +}
> > > +
> > > +/*
> > > + * arm_poll_init - Initializes arm cpuidle polling driver
> > > + */
> > > +static int __init arm_poll_init(void)
> > > +{
> > > +    int cpu, ret;
> > > +    struct cpuidle_driver *drv;
> > > +    struct cpuidle_device *dev;
> > > +
> > > +    for_each_possible_cpu(cpu) {
> > > +            ret = arm_poll_init_cpu(cpu);
> > > +            if (ret)
> > > +                    goto out_fail;
> > > +    }
> > > +
> > > +    return 0;
> > > +
> > > +out_fail:
> > > +    pr_info("de-register all");
> > > +    while (--cpu >= 0) {
> > > +            dev = per_cpu(cpuidle_devices, cpu);
> > > +            drv = cpuidle_get_cpu_driver(dev);
> > > +            cpuidle_unregister(drv);
> > > +            kfree(drv);
> > > +    }
> > > +
> > > +    return ret;
> > > +}
> > > +
> > > +module_param(poll_limit, uint, 0444);
> > > +device_initcall(arm_poll_init);
> > > --
> > > 2.34.1
> > >
> > >
>
>
> --
> ankur

2024-04-05 20:06:24

by Okanovic, Haris

[permalink] [raw]

Subject: Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle

On Tue, 2024-04-02 at 18:23 +0100, Mark Rutland wrote:
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender
> and know the content is safe.
>
>
>
> On Mon, Apr 01, 2024 at 08:47:06PM -0500, Haris Okanovic wrote:
> > An arm64 cpuidle driver with two states: (1) First polls for new
> > runable
> > tasks up to 100 us (by default) before (2) a wfi idle and awoken by
> > interrupt (the current arm64 behavior). It allows CPUs to return
> > from
> > idle more quickly by avoiding the longer interrupt wakeup path,
> > which
> > may require EL1/EL2 transition in certain VM scenarios.
>
> Please start off with an explanation of the problem you're trying to
> solve
> (which IIUC is to wake up more quickly in certain cases), before
> describing the
> solution. That makes it *significantly* easier for people to review
> this, since
> once you have the problem statement in mind it's much easier to
> understand how
> the solution space follows from that.

Hi Mark,

I'll try to clarify:

This change improves performance of certain multithreaded workloads
that rapidly cycles through blocked/running states, by avoiding a
relatively costly IPI wakeup path.

One example is Tensorflow as demonstated by the benchmark shared
earlier. It generates 250k reschedule IPIs per sec on AWS m7g.16xlarge
(Graviton3 SoC) in Linux 6.9.0-rc2, which drops to 120k/sec with this
change and a ~3% gain in throughput.

Some Java workloads have a similar pattern. For example, Renissance
Reactors benchmark (Reactors.IO framework app) drops from 370k to 80k
IPIs and sees a ~25% latency drop.

Renissance Reactors repro steps on Ubuntu 22.04 on AWS:
sudo apt-get install java-common
wget
https://corretto.aws/downloads/latest/amazon-corretto-17-aarch64-linux-jdk.deb
sudo dpkg -i amazon-corretto-17-aarch64-linux-jdk.deb
wget
https://github.com/renaissance-benchmarks/renaissance/releases/download/v0.15.0/renaissance-gpl-0.15.0.jar
java -jar renaissance-gpl-0.15.0.jar reactors —repetitions 1

>
> > Poll duration is optionally configured at load time via the
> > poll_limit
> > module parameter.
>
> Why should this be a configurable parameter?
>
> (note, at this point you haven't introduced any of the data below, so
> the
> trade-off isn't clear to anyone).

Some applications could improve with higher poll periods. For example,
Tensorflow throughput improves ~3% at a 100us+ poll limit and ~1% at
10us limit. A configurable knob allows users to experiment more easily.

There’s also no performance penatly for having the option. New values
are patched during load and all other code paths remain the same.

>
> > The default 100 us duration was experimentally chosen, by measuring
> > QPS
> > (queries per sec) of the MLPerf bert inference benchmark, which
> > seems
> > particularly susceptible to this change; see procedure below. 100
> > us is
> > the inflection point where QPS stopped growing in a range of tested
> > values. All results are from AWS m7g.16xlarge instances (Graviton3
> > SoC)
> > with dedicated tenancy (dedicated hardware).
> >
> > > before | 10us | 25us | 50us | 100us | 125us | 150us | 200us |
> > > 300us |
> > > 5.87   | 5.91 | 5.96 | 6.01 | 6.06 | 6.07 | 6.06 | 6.06 |
> > > 6.06 |
> >
> > Perf's scheduler benchmarks also improve with a range of poll_limit
> > values >= 10 us. Higher limits produce near identical results
> > within a
> > 3% noise margin. The following tables are `perf bench sched`
> > results,
> > run times in seconds.
> >
> > `perf bench sched messaging -l 80000`
> > > AWS instance | SoC       | Before | After | % Change |
> > > c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none     |
> > > c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none     |
> > > c6g.metal     | Graviton2 | 17.621 | 16.744 | none     |
> > > c7g.metal     | Graviton3 | 13.430 | 13.404 | none     |
> >
> > `perf bench sched pipe -l 2500000`
> > > AWS instance | SoC       | Before | After | % Change |
> > > c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50%     |
> > > c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34%     |
> > > c6g.metal     | Graviton2 | 17.609 | 15.170 | -14%     |
> > > c7g.metal     | Graviton3 | 14.103 | 12.304 | -13%     |
> >
> > `perf bench sched seccomp-notify -l 2500000`
> > > AWS instance | SoC       | Before | After | % Change |
> > > c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52%     |
> > > c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33%     |
> > > c6g.metal     | Graviton2 | 15.717 | 13.536 | -14%     |
> > > c7g.metal     | Graviton3 | 13.301 | 11.491 | -14%     |
>
> Ok, so perf numbers for a busy workload go up.
>
> What happens for idle state residency on a mostly idle system?

It decreases by about 0.005% on AWS m7g.16xlarge (Graviton3 SoC).

>
> > Steps to run MLPerf bert inference on Ubuntu 22.04:
> > sudo apt install build-essential python3 python3-pip
> > pip install "pybind11[global]" tensorflow transformers
> > export TF_ENABLE_ONEDNN_OPTS=1
> > export DNNL_DEFAULT_FPMATH_MODE=BF16
> > git clone https://github.com/mlcommons/inference.git --recursive
> > cd inference
> > git checkout v2.0
> > cd loadgen
> > CFLAGS="-std=c++14" python3 setup.py bdist_wheel
> > pip install dist/*.whl
> > cd ../language/bert
> > make setup
> > python3 run.py --backend=tf --scenario=SingleStream
> >
> > Suggested-by: Ali Saidi <[email protected]>
> > Reviewed-by: Ali Saidi <[email protected]>
> > Reviewed-by: Geoff Blake <[email protected]>
> > Cc: Brian Silver <[email protected]>
> > Signed-off-by: Haris Okanovic <[email protected]>
> > ---
> > drivers/cpuidle/Kconfig.arm           | 13 ++
> > drivers/cpuidle/Makefile              |   1 +
> > drivers/cpuidle/cpuidle-arm-polling.c | 171
> > ++++++++++++++++++++++++++
> > 3 files changed, 185 insertions(+)
> > create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c
> >
> > diff --git a/drivers/cpuidle/Kconfig.arm
> > b/drivers/cpuidle/Kconfig.arm
> > index a1ee475d180d..484666dda38d 100644
> > --- a/drivers/cpuidle/Kconfig.arm
> > +++ b/drivers/cpuidle/Kconfig.arm
> > @@ -14,6 +14,19 @@ config ARM_CPUIDLE
> >         initialized by calling the CPU operations init idle hook
> >         provided by architecture code.
> >
> > +config ARM_POLL_CPUIDLE
> > +     bool "ARM64 CPU idle Driver with polling"
> > +     depends on ARM64
> > +     depends on ARM_ARCH_TIMER_EVTSTREAM
> > +     select CPU_IDLE_MULTIPLE_DRIVERS
> > +     help
> > +       Select this to enable a polling cpuidle driver for ARM64:
> > +       The first state polls TIF_NEED_RESCHED for best latency on
> > short
> > +       sleep intervals. The second state falls back to
> > arch_cpu_idle() to
> > +       wait for interrupt. This is can be helpful in workloads
> > that
> > +       frequently block/wake at short intervals or VMs where
> > wakeup IPIs
> > +       are more expensive.
>
> Why is this a separate driver rather than an optional feature in the
> existing
> driver?
>
> The fact that this duplicates a bunch of code indicates to me that
> this should
> not be a separate driver.

I'll explore using Ankur Arora’s haltpoll changes on AWS Graviton; see
related thread.

>
> > +
> > config ARM_PSCI_CPUIDLE
> >       bool "PSCI CPU idle Driver"
> >       depends on ARM_PSCI_FW
> > diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> > index d103342b7cfc..23c21422792d 100644
> > --- a/drivers/cpuidle/Makefile
> > +++ b/drivers/cpuidle/Makefile
> > @@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE)         +=
> > cpuidle-ux500.o
> > obj-$(CONFIG_ARM_AT91_CPUIDLE)          += cpuidle-at91.o
> > obj-$(CONFIG_ARM_EXYNOS_CPUIDLE)        += cpuidle-exynos.o
> > obj-$(CONFIG_ARM_CPUIDLE)            += cpuidle-arm.o
> > +obj-$(CONFIG_ARM_POLL_CPUIDLE)               += cpuidle-arm-
> > polling.o
> > obj-$(CONFIG_ARM_PSCI_CPUIDLE)               += cpuidle-psci.o
> > obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN)        += cpuidle-psci-
> > domain.o
> > obj-$(CONFIG_ARM_TEGRA_CPUIDLE)              += cpuidle-tegra.o
> > diff --git a/drivers/cpuidle/cpuidle-arm-polling.c
> > b/drivers/cpuidle/cpuidle-arm-polling.c
> > new file mode 100644
> > index 000000000000..bca128568114
> > --- /dev/null
> > +++ b/drivers/cpuidle/cpuidle-arm-polling.c
> > @@ -0,0 +1,171 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * ARM64 CPU idle driver using wfe polling
> > + *
> > + * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights
> > reserved.
> > + *
> > + * Authors:
> > + *   Haris Okanovic <[email protected]>
> > + *   Brian Silver <[email protected]>
> > + *
> > + * Based on cpuidle-arm.c
> > + * Copyright (C) 2014 ARM Ltd.
> > + * Author: Lorenzo Pieralisi <[email protected]>
> > + */
> > +
> > +#include <linux/cpu.h>
> > +#include <linux/cpu_cooling.h>
> > +#include <linux/cpuidle.h>
> > +#include <linux/sched/clock.h>
> > +
> > +#include <asm/cpuidle.h>
> > +#include <asm/readex.h>
> > +
> > +#include "dt_idle_states.h"
> > +
> > +/* Max duration of the wfe() poll loop in us, before transitioning
> > to
> > + * arch_cpu_idle()/wfi() sleep.
> > + */
>
> /*
> * Comments should have the leading '/*' on a separate line.
> * See
> https://www.kernel.org/doc/html/v6.8/process/coding-style.html#commenting
> */
>
> > +#define DEFAULT_POLL_LIMIT_US 100
> > +static unsigned int poll_limit __read_mostly =
> > DEFAULT_POLL_LIMIT_US;
> > +
> > +/*
> > + * arm_idle_wfe_poll - Polls state in wfe loop until reschedule is
> > + * needed or timeout
> > + */
> > +static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device *dev,
> > +                             struct cpuidle_driver *drv, int idx)
> > +{
> > +     u64 time_start, time_limit;
> > +
> > +     time_start = local_clock();
> > +     dev->poll_time_limit = false;
> > +
> > +     local_irq_enable();
>
> Why enable IRQs here? We don't do that in the regular cpuidle-arm
> driver, nor
> the cpuidle-psci driver, and there's no explanation here or in the
> commit message.
>
> How does this interact with RCU? Is that still watching or are we in
> an
> extended quiescent state? For PSCI idle states we enter an EQS, and
> it seems
> like we probably should here...

I'll circle back to your code comments (here and elsewhere) if we
proceede with this patch instead of Ankur's.

>
> > +
> > +     if (current_set_polling_and_test())
> > +             goto end;
> > +
> > +     time_limit = cpuidle_poll_time(drv, dev);
> > +
> > +     do {
> > +             // exclusive read arms the monitor for wfe
> > +             if (__READ_ONCE_EX(current_thread_info()->flags) &
> > _TIF_NEED_RESCHED)
> > +                     goto end;
> > +
> > +             // may exit prematurely, see ARM_ARCH_TIMER_EVTSTREAM
> > +             wfe();
> > +     } while (local_clock() - time_start < time_limit);
>
> ... and if the EVTSTREAM is disabled, we'll sit in WFE forever rather
> than
> entering a deeper idle state, which doesn't seem desirable.

I took a depdnency on ARM_ARCH_TIMER_EVTSTREAM in Kconfig.arm to ensure
event stream is enabled.

>
> It's worth noting that now that we have WFET, we'll probably want to
> disable
> the EVTSTREAM by default at some point, at least in some
> configurations, since
> that'll be able to sit in a WFE state for longer while also reliably
> waking up
> when required.
>
> I suspect we want something like an smp_load_acquire_timeout() here
> to do the
> wait in arch code (allowing us to use WFET), and enabling this state
> will
> depend on either having WFET or EVTSTREAM.
>
> > +
> > +     dev->poll_time_limit = true;
> > +
> > +end:
> > +     current_clr_polling();
> > +     return idx;
> > +}
> > +
> > +/*
> > + * arm_idle_wfi - Places cpu in lower power state until interrupt,
> > + * a fallback to polling
> > + */
> > +static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev,
> > +                             struct cpuidle_driver *drv, int idx)
> > +{
> > +     if (current_clr_polling_and_test()) {
> > +             local_irq_enable();
> > +             return idx;
> > +     }
>
> Same as above, why enable IRQs here?
>
> > +     arch_cpu_idle();
> > +     return idx;
>
> ... and if we need to enable IRQs in the other cases above, why do we
> *not*
> need to enable them here?
>
> > +}
> > +
> > +static struct cpuidle_driver arm_poll_idle_driver __initdata = {
> > +     .name = "arm_poll_idle",
> > +     .owner = THIS_MODULE,
> > +     .states = {
> > +             {
> > +                     .enter                  = arm_idle_wfe_poll,
> > +                     .exit_latency           = 0,
> > +                     .target_residency       = 0,
> > +                     .exit_latency_ns        = 0,
> > +                     .power_usage            = UINT_MAX,
> > +                     .flags                  =
> > CPUIDLE_FLAG_POLLING,
> > +                     .name                   = "WFE",
> > +                     .desc                   = "ARM WFE",
> > +             },
> > +             {
> > +                     .enter                  = arm_idle_wfi,
> > +                     .exit_latency           =
> > DEFAULT_POLL_LIMIT_US,
> > +                     .target_residency       =
> > DEFAULT_POLL_LIMIT_US,
> > +                     .power_usage            = UINT_MAX,
> > +                     .name                   = "WFI",
> > +                     .desc                   = "ARM WFI",
> > +             },
> > +     },
> > +     .state_count = 2,
> > +};
>
> How does this interact with the existing driver?
>
> How does DEFAULT_POLL_LIMIT_US compare with PSCI idle states?
>
> > +
> > +/*
> > + * arm_poll_init_cpu - Initializes arm cpuidle polling driver for
> > one cpu
> > + */
> > +static int __init arm_poll_init_cpu(int cpu)
> > +{
> > +     int ret;
> > +     struct cpuidle_driver *drv;
> > +
> > +     drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv),
> > GFP_KERNEL);
> > +     if (!drv)
> > +             return -ENOMEM;
> > +
> > +     drv->cpumask = (struct cpumask *)cpumask_of(cpu);
> > +     drv->states[1].exit_latency = poll_limit;
> > +     drv->states[1].target_residency = poll_limit;
> > +
> > +     ret = cpuidle_register(drv, NULL);
> > +     if (ret) {
> > +             pr_err("failed to register driver: %d, cpu %d\n",
> > ret, cpu);
> > +             goto out_kfree_drv;
> > +     }
> > +
> > +     pr_info("registered driver cpu %d\n", cpu);
>
> This does not need to be printed for each CPU.
>
> Mark.
>
> > +
> > +     cpuidle_cooling_register(drv);
> > +
> > +     return 0;
> > +
> > +out_kfree_drv:
> > +     kfree(drv);
> > +     return ret;
> > +}
> > +
> > +/*
> > + * arm_poll_init - Initializes arm cpuidle polling driver
> > + */
> > +static int __init arm_poll_init(void)
> > +{
> > +     int cpu, ret;
> > +     struct cpuidle_driver *drv;
> > +     struct cpuidle_device *dev;
> > +
> > +     for_each_possible_cpu(cpu) {
> > +             ret = arm_poll_init_cpu(cpu);
> > +             if (ret)
> > +                     goto out_fail;
> > +     }
> > +
> > +     return 0;
> > +
> > +out_fail:
> > +     pr_info("de-register all");
> > +     while (--cpu >= 0) {
> > +             dev = per_cpu(cpuidle_devices, cpu);
> > +             drv = cpuidle_get_cpu_driver(dev);
> > +             cpuidle_unregister(drv);
> > +             kfree(drv);
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +module_param(poll_limit, uint, 0444);
> > +device_initcall(arm_poll_init);
> > --
> > 2.34.1
> >
> >

Thanks,
Haris Okanovic

2024-04-05 20:22:05

[permalink] [raw]

Subject: Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle

Okanovic, Haris <[email protected]> writes:

> On Tue, 2024-04-02 at 16:17 -0700, Ankur Arora wrote:
>> CAUTION: This email originated from outside of the organization. Do
>> not click links or open attachments unless you can confirm the sender
>> and know the content is safe.
>>
>>
>>
>> Mark Rutland <[email protected]> writes:
>>
>> > On Mon, Apr 01, 2024 at 08:47:06PM -0500, Haris Okanovic wrote:
>> > > An arm64 cpuidle driver with two states: (1) First polls for new
>> > > runable
>> > > tasks up to 100 us (by default) before (2) a wfi idle and awoken
>> > > by
>> > > interrupt (the current arm64 behavior). It allows CPUs to return
>> > > from
>> > > idle more quickly by avoiding the longer interrupt wakeup path,
>> > > which
>> > > may require EL1/EL2 transition in certain VM scenarios.
>> >
>> > Please start off with an explanation of the problem you're trying
>> > to solve
>> > (which IIUC is to wake up more quickly in certain cases), before
>> > describing the
>> > solution. That makes it *significantly* easier for people to review
>> > this, since
>> > once you have the problem statement in mind it's much easier to
>> > understand how
>> > the solution space follows from that.
>> >
>> > > Poll duration is optionally configured at load time via the
>> > > poll_limit
>> > > module parameter.
>> >
>> > Why should this be a configurable parameter?
>> >
>> > (note, at this point you haven't introduced any of the data below,
>> > so the
>> > trade-off isn't clear to anyone).
>> >
>> > > The default 100 us duration was experimentally chosen, by
>> > > measuring QPS
>> > > (queries per sec) of the MLPerf bert inference benchmark, which
>> > > seems
>> > > particularly susceptible to this change; see procedure below. 100
>> > > us is
>> > > the inflection point where QPS stopped growing in a range of
>> > > tested
>> > > values. All results are from AWS m7g.16xlarge instances
>> > > (Graviton3 SoC)
>> > > with dedicated tenancy (dedicated hardware).
>> > >
>> > > > before | 10us | 25us | 50us | 100us | 125us | 150us | 200us |
>> > > > 300us |
>> > > > 5.87 | 5.91 | 5.96 | 6.01 | 6.06 | 6.07 | 6.06 | 6.06 |
>> > > > 6.06 |
>> > >
>> > > Perf's scheduler benchmarks also improve with a range of
>> > > poll_limit
>> > > values >= 10 us. Higher limits produce near identical results
>> > > within a
>> > > 3% noise margin. The following tables are `perf bench sched`
>> > > results,
>> > > run times in seconds.
>> > >
>> > > `perf bench sched messaging -l 80000`
>> > > > AWS instance | SoC | Before | After | % Change |
>> > > > c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none |
>> > > > c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none |
>> > > > c6g.metal | Graviton2 | 17.621 | 16.744 | none |
>> > > > c7g.metal | Graviton3 | 13.430 | 13.404 | none |
>> > >
>> > > `perf bench sched pipe -l 2500000`
>> > > > AWS instance | SoC | Before | After | % Change |
>> > > > c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50% |
>> > > > c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34% |
>> > > > c6g.metal | Graviton2 | 17.609 | 15.170 | -14% |
>> > > > c7g.metal | Graviton3 | 14.103 | 12.304 | -13% |
>> > >
>> > > `perf bench sched seccomp-notify -l 2500000`
>> > > > AWS instance | SoC | Before | After | % Change |
>> > > > c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52% |
>> > > > c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33% |
>> > > > c6g.metal | Graviton2 | 15.717 | 13.536 | -14% |
>> > > > c7g.metal | Graviton3 | 13.301 | 11.491 | -14% |
>> >
>> > Ok, so perf numbers for a busy workload go up.
>> >
>> > What happens for idle state residency on a mostly idle system?
>> >
>> > > Steps to run MLPerf bert inference on Ubuntu 22.04:
>> > > sudo apt install build-essential python3 python3-pip
>> > > pip install "pybind11[global]" tensorflow transformers
>> > > export TF_ENABLE_ONEDNN_OPTS=1
>> > > export DNNL_DEFAULT_FPMATH_MODE=BF16
>> > > git clone https://github.com/mlcommons/inference.git --recursive
>> > > cd inference
>> > > git checkout v2.0
>> > > cd loadgen
>> > > CFLAGS="-std=c++14" python3 setup.py bdist_wheel
>> > > pip install dist/*.whl
>> > > cd ../language/bert
>> > > make setup
>> > > python3 run.py --backend=tf --scenario=SingleStream
>> > >
>> > > Suggested-by: Ali Saidi <[email protected]>
>> > > Reviewed-by: Ali Saidi <[email protected]>
>> > > Reviewed-by: Geoff Blake <[email protected]>
>> > > Cc: Brian Silver <[email protected]>
>> > > Signed-off-by: Haris Okanovic <[email protected]>
>> > > ---
>> > > drivers/cpuidle/Kconfig.arm | 13 ++
>> > > drivers/cpuidle/Makefile | 1 +
>> > > drivers/cpuidle/cpuidle-arm-polling.c | 171
>> > > ++++++++++++++++++++++++++
>> > > 3 files changed, 185 insertions(+)
>> > > create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c
>> > >
>> > > diff --git a/drivers/cpuidle/Kconfig.arm
>> > > b/drivers/cpuidle/Kconfig.arm
>> > > index a1ee475d180d..484666dda38d 100644
>> > > --- a/drivers/cpuidle/Kconfig.arm
>> > > +++ b/drivers/cpuidle/Kconfig.arm
>> > > @@ -14,6 +14,19 @@ config ARM_CPUIDLE
>> > > initialized by calling the CPU operations init idle hook
>> > > provided by architecture code.
>> > >
>> > > +config ARM_POLL_CPUIDLE
>> > > + bool "ARM64 CPU idle Driver with polling"
>> > > + depends on ARM64
>> > > + depends on ARM_ARCH_TIMER_EVTSTREAM
>> > > + select CPU_IDLE_MULTIPLE_DRIVERS
>> > > + help
>> > > + Select this to enable a polling cpuidle driver for ARM64:
>> > > + The first state polls TIF_NEED_RESCHED for best latency on
>> > > short
>> > > + sleep intervals. The second state falls back to
>> > > arch_cpu_idle() to
>> > > + wait for interrupt. This is can be helpful in workloads
>> > > that
>> > > + frequently block/wake at short intervals or VMs where
>> > > wakeup IPIs
>> > > + are more expensive.
>> >
>> > Why is this a separate driver rather than an optional feature in
>> > the existing
>> > driver?
>> >
>> > The fact that this duplicates a bunch of code indicates to me that
>> > this should
>> > not be a separate driver.
>>
>> Also, the cpuidle-haltpoll driver is meant to do something quite
>> similar.
>> That driver polls adaptively based on the haltpoll governor's tuning
>> of
>> the polling period.
>>
>> However, cpuidle-haltpoll is currently x86 only. Mihai (also from
>> Oracle)
>> posted patches [1] adding support for ARM64.
>>
>> Haris, could you take a look at it and see if it does what you are
>> looking for? The polling path in the linked version also uses
>> smp_cond_load_relaxed() so even the mechanisms for both of these
>> are fairly similar.
>
> Hi Ankur,
>
> I agree, except for that small bug in exit condition, your haltpoll
> changes fundamentally do the same thing:

Yup. Will address that bug and a few other things in the next version.

>> @ int __cpuidle poll_idle(...
>> - if (!(ret & _TIF_NEED_RESCHED))
>> + if (ret & _TIF_NEED_RESCHE
>
> I'll follow up with another patch for AWS Graviton when your team is
> finished.
>
> Do you have a rough ETA of when your changes will land in master?

That I guess would be determined by the maintainers, but I should be
able to send it out the coming week.

Thanks
Ankur

>>
>> (I'll be sending out the next version shortly. Happy to Cc you if you
>> would like to try that out.)
>
> Yes, please do!
>
> Thanks,
> Haris Okanovic
>
>>
>> Thanks
>> Ankur
>>
>> [1]
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> >
>> > > +
>> > > config ARM_PSCI_CPUIDLE
>> > > bool "PSCI CPU idle Driver"
>> > > depends on ARM_PSCI_FW
>> > > diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
>> > > index d103342b7cfc..23c21422792d 100644
>> > > --- a/drivers/cpuidle/Makefile
>> > > +++ b/drivers/cpuidle/Makefile
>> > > @@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE) +=
>> > > cpuidle-ux500.o
>> > > obj-$(CONFIG_ARM_AT91_CPUIDLE) += cpuidle-at91.o
>> > > obj-$(CONFIG_ARM_EXYNOS_CPUIDLE) += cpuidle-exynos.o
>> > > obj-$(CONFIG_ARM_CPUIDLE) += cpuidle-arm.o
>> > > +obj-$(CONFIG_ARM_POLL_CPUIDLE) += cpuidle-arm-
>> > > polling.o
>> > > obj-$(CONFIG_ARM_PSCI_CPUIDLE) += cpuidle-psci.o
>> > > obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN) += cpuidle-psci-
>> > > domain.o
>> > > obj-$(CONFIG_ARM_TEGRA_CPUIDLE) += cpuidle-tegra.o
>> > > diff --git a/drivers/cpuidle/cpuidle-arm-polling.c
>> > > b/drivers/cpuidle/cpuidle-arm-polling.c
>> > > new file mode 100644
>> > > index 000000000000..bca128568114
>> > > --- /dev/null
>> > > +++ b/drivers/cpuidle/cpuidle-arm-polling.c
>> > > @@ -0,0 +1,171 @@
>> > > +// SPDX-License-Identifier: GPL-2.0
>> > > +/*
>> > > + * ARM64 CPU idle driver using wfe polling
>> > > + *
>> > > + * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights
>> > > reserved.
>> > > + *
>> > > + * Authors:
>> > > + * Haris Okanovic <[email protected]>
>> > > + * Brian Silver <[email protected]>
>> > > + *
>> > > + * Based on cpuidle-arm.c
>> > > + * Copyright (C) 2014 ARM Ltd.
>> > > + * Author: Lorenzo Pieralisi <[email protected]>
>> > > + */
>> > > +
>> > > +#include <linux/cpu.h>
>> > > +#include <linux/cpu_cooling.h>
>> > > +#include <linux/cpuidle.h>
>> > > +#include <linux/sched/clock.h>
>> > > +
>> > > +#include <asm/cpuidle.h>
>> > > +#include <asm/readex.h>
>> > > +
>> > > +#include "dt_idle_states.h"
>> > > +
>> > > +/* Max duration of the wfe() poll loop in us, before
>> > > transitioning to
>> > > + * arch_cpu_idle()/wfi() sleep.
>> > > + */
>> >
>> > /*
>> > * Comments should have the leading '/*' on a separate line.
>> > * See
>> > https://www.kernel.org/doc/html/v6.8/process/coding-style.html#commenting
>> > */
>> >
>> > > +#define DEFAULT_POLL_LIMIT_US 100
>> > > +static unsigned int poll_limit __read_mostly =
>> > > DEFAULT_POLL_LIMIT_US;
>> > > +
>> > > +/*
>> > > + * arm_idle_wfe_poll - Polls state in wfe loop until reschedule
>> > > is
>> > > + * needed or timeout
>> > > + */
>> > > +static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device
>> > > *dev,
>> > > + struct cpuidle_driver *drv, int idx)
>> > > +{
>> > > + u64 time_start, time_limit;
>> > > +
>> > > + time_start = local_clock();
>> > > + dev->poll_time_limit = false;
>> > > +
>> > > + local_irq_enable();
>> >
>> > Why enable IRQs here? We don't do that in the regular cpuidle-arm
>> > driver, nor
>> > the cpuidle-psci driver, and there's no explanation here or in the
>> > commit message.
>> >
>> > How does this interact with RCU? Is that still watching or are we
>> > in an
>> > extended quiescent state? For PSCI idle states we enter an EQS, and
>> > it seems
>> > like we probably should here...
>> >
>> > > +
>> > > + if (current_set_polling_and_test())
>> > > + goto end;
>> > > +
>> > > + time_limit = cpuidle_poll_time(drv, dev);
>> > > +
>> > > + do {
>> > > + // exclusive read arms the monitor for wfe
>> > > + if (__READ_ONCE_EX(current_thread_info()->flags) &
>> > > _TIF_NEED_RESCHED)
>> > > + goto end;
>> > > +
>> > > + // may exit prematurely, see
>> > > ARM_ARCH_TIMER_EVTSTREAM
>> > > + wfe();
>> > > + } while (local_clock() - time_start < time_limit);
>> >
>> > .. and if the EVTSTREAM is disabled, we'll sit in WFE forever
>> > rather than
>> > entering a deeper idle state, which doesn't seem desirable.
>> >
>> > It's worth noting that now that we have WFET, we'll probably want
>> > to disable
>> > the EVTSTREAM by default at some point, at least in some
>> > configurations, since
>> > that'll be able to sit in a WFE state for longer while also
>> > reliably waking up
>> > when required.
>> >
>> > I suspect we want something like an smp_load_acquire_timeout() here
>> > to do the
>> > wait in arch code (allowing us to use WFET), and enabling this
>> > state will
>> > depend on either having WFET or EVTSTREAM.
>> >
>> > > +
>> > > + dev->poll_time_limit = true;
>> > > +
>> > > +end:
>> > > + current_clr_polling();
>> > > + return idx;
>> > > +}
>> > > +
>> > > +/*
>> > > + * arm_idle_wfi - Places cpu in lower power state until
>> > > interrupt,
>> > > + * a fallback to polling
>> > > + */
>> > > +static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev,
>> > > + struct cpuidle_driver *drv, int idx)
>> > > +{
>> > > + if (current_clr_polling_and_test()) {
>> > > + local_irq_enable();
>> > > + return idx;
>> > > + }
>> >
>> > Same as above, why enable IRQs here?
>> >
>> > > + arch_cpu_idle();
>> > > + return idx;
>> >
>> > .. and if we need to enable IRQs in the other cases above, why do
>> > we *not*
>> > need to enable them here?
>> >
>> > > +}
>> > > +
>> > > +static struct cpuidle_driver arm_poll_idle_driver __initdata = {
>> > > + .name = "arm_poll_idle",
>> > > + .owner = THIS_MODULE,
>> > > + .states = {
>> > > + {
>> > > + .enter = arm_idle_wfe_poll,
>> > > + .exit_latency = 0,
>> > > + .target_residency = 0,
>> > > + .exit_latency_ns = 0,
>> > > + .power_usage = UINT_MAX,
>> > > + .flags =
>> > > CPUIDLE_FLAG_POLLING,
>> > > + .name = "WFE",
>> > > + .desc = "ARM WFE",
>> > > + },
>> > > + {
>> > > + .enter = arm_idle_wfi,
>> > > + .exit_latency =
>> > > DEFAULT_POLL_LIMIT_US,
>> > > + .target_residency =
>> > > DEFAULT_POLL_LIMIT_US,
>> > > + .power_usage = UINT_MAX,
>> > > + .name = "WFI",
>> > > + .desc = "ARM WFI",
>> > > + },
>> > > + },
>> > > + .state_count = 2,
>> > > +};
>> >
>> > How does this interact with the existing driver?
>> >
>> > How does DEFAULT_POLL_LIMIT_US compare with PSCI idle states?
>> >
>> > > +
>> > > +/*
>> > > + * arm_poll_init_cpu - Initializes arm cpuidle polling driver
>> > > for one cpu
>> > > + */
>> > > +static int __init arm_poll_init_cpu(int cpu)
>> > > +{
>> > > + int ret;
>> > > + struct cpuidle_driver *drv;
>> > > +
>> > > + drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv),
>> > > GFP_KERNEL);
>> > > + if (!drv)
>> > > + return -ENOMEM;
>> > > +
>> > > + drv->cpumask = (struct cpumask *)cpumask_of(cpu);
>> > > + drv->states[1].exit_latency = poll_limit;
>> > > + drv->states[1].target_residency = poll_limit;
>> > > +
>> > > + ret = cpuidle_register(drv, NULL);
>> > > + if (ret) {
>> > > + pr_err("failed to register driver: %d, cpu %d\n",
>> > > ret, cpu);
>> > > + goto out_kfree_drv;
>> > > + }
>> > > +
>> > > + pr_info("registered driver cpu %d\n", cpu);
>> >
>> > This does not need to be printed for each CPU.
>> >
>> > Mark.
>> >
>> > > +
>> > > + cpuidle_cooling_register(drv);
>> > > +
>> > > + return 0;
>> > > +
>> > > +out_kfree_drv:
>> > > + kfree(drv);
>> > > + return ret;
>> > > +}
>> > > +
>> > > +/*
>> > > + * arm_poll_init - Initializes arm cpuidle polling driver
>> > > + */
>> > > +static int __init arm_poll_init(void)
>> > > +{
>> > > + int cpu, ret;
>> > > + struct cpuidle_driver *drv;
>> > > + struct cpuidle_device *dev;
>> > > +
>> > > + for_each_possible_cpu(cpu) {
>> > > + ret = arm_poll_init_cpu(cpu);
>> > > + if (ret)
>> > > + goto out_fail;
>> > > + }
>> > > +
>> > > + return 0;
>> > > +
>> > > +out_fail:
>> > > + pr_info("de-register all");
>> > > + while (--cpu >= 0) {
>> > > + dev = per_cpu(cpuidle_devices, cpu);
>> > > + drv = cpuidle_get_cpu_driver(dev);
>> > > + cpuidle_unregister(drv);
>> > > + kfree(drv);
>> > > + }
>> > > +
>> > > + return ret;
>> > > +}
>> > > +
>> > > +module_param(poll_limit, uint, 0444);
>> > > +device_initcall(arm_poll_init);
>> > > --
>> > > 2.34.1
>> > >
>> > >
>>
>>
>> --
>> ankur

--
ankur

2024-04-08 14:52:11

by David Laight

[permalink] [raw]

Subject: RE: [PATCH 2/3] arm64: add __READ_ONCE_EX()

From: Haris Okanovic
> Sent: 02 April 2024 02:47
>
> Perform an exclusive load, which atomically loads a word and arms the
> execusive monitor to enable wfe() polling of an address.
>
> Adding this macro in preparation for an arm64 cpuidle driver which
> supports a wfe() based polling state.
>
> https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/exclusive-
> accesses/exclusive-monitors
>
> Signed-off-by: Haris Okanovic <[email protected]>
> ---
> arch/arm64/include/asm/readex.h | 46 +++++++++++++++++++++++++++++++++
> 1 file changed, 46 insertions(+)
> create mode 100644 arch/arm64/include/asm/readex.h
>
> diff --git a/arch/arm64/include/asm/readex.h b/arch/arm64/include/asm/readex.h
> new file mode 100644
> index 000000000000..51963c3107e1
> --- /dev/null
> +++ b/arch/arm64/include/asm/readex.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Based on arch/arm64/include/asm/rwonce.h
> + *
> + * Copyright (C) 2020 Google LLC.
> + * Copyright (C) 2024 Amazon.com, Inc. or its affiliates.
> + */
> +
> +#ifndef __ASM_READEX_H
> +#define __ASM_READEX_H
> +
> +#define __LOAD_EX(sfx, regs...) "ldaxr" #sfx "\t" #regs
> +
> +#define __READ_ONCE_EX(x) \
> +({ \
> + typeof(&(x)) __x = &(x); \
> + int atomic = 1; \
> + union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u; \
> + switch (sizeof(x)) { \
> + case 1: \
> + asm volatile(__LOAD_EX(b, %w0, %1) \
> + : "=r" (*(__u8 *)__u.__c) \
> + : "Q" (*__x) : "memory"); \
> + break; \
> + case 2: \
> + asm volatile(__LOAD_EX(h, %w0, %1) \
> + : "=r" (*(__u16 *)__u.__c) \
> + : "Q" (*__x) : "memory"); \
> + break; \
> + case 4: \
> + asm volatile(__LOAD_EX(, %w0, %1) \
> + : "=r" (*(__u32 *)__u.__c) \
> + : "Q" (*__x) : "memory"); \
> + break; \
> + case 8: \
> + asm volatile(__LOAD_EX(, %0, %1) \
> + : "=r" (*(__u64 *)__u.__c) \
> + : "Q" (*__x) : "memory"); \
> + break; \
> + default: \
> + atomic = 0; \
> + } \
> + atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\

I'm pretty sure that doesn't work the way you expect.
The ?: operator promotes 'unsigned char' to 'int'.
So you can fall foul of signedness tests (eg in min()).
It also isn't going to work for non-scalers.

Replacing the ?: with __builtin_choose_expr() may help.

(This is probably a bug in the code you copied.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)