The housekeeping CPU masks, set up by the "isolcpus" and "nohz_full"
boot command line options, are used at boot time to exclude selected
CPUs from running some kernel housekeeping facilities to minimize
disturbance to latency sensitive userspace applications such as DPDK.
However, these options can have negative consequences for "normal"
workloads. Both nohz_full and rcu_nocbs can be applied to a subset of
the CPUs on a server (so as to not impact the "normal" workloads), but
they can only be changed with a reboot. This is a problem for
containerized workloads running on OpenShift (i.e. kubernetes) where a
mix of low latency and "normal" workloads can be created/destroyed
dynamically and the number of CPUs allocated to each workload is often
not known at boot time.
This series of patches is based on series
"isolation: Exclude dynamically isolated CPUs from housekeeping masks"
https://lore.kernel.org/lkml/[email protected]/
Its purpose is to exclude dynamically isolated CPUs from some
housekeeping masks so that subsystems that check the housekeeping masks
at run time will not use those isolated CPUs.
However, some of subsystems can use obsolete housekeeping CPU masks.
Therefore, to prevent the use of these isolated CPUs, it is necessary to
explicitly propagate changes of the housekeeping masks to all subsystems
depending on the mask.
Costa Shulyupin (7):
sched/isolation: Add infrastructure to adjust affinity for dynamic CPU
isolation
sched/isolation: Adjust affinity of timers according to change of
housekeeping cpumask
sched/isolation: Adjust affinity of hrtimers according to change of
housekeeping cpumask
sched/isolation: Adjust affinity of managed irqs according to change
of housekeeping cpumask
[NOT-FOR-MERGE] test timers affinity adjustment
[NOT-FOR-MERGE] test timers and hrtimers affinity adjustment
[NOT-FOR-MERGE] test managed irqs affinity adjustment
include/linux/hrtimer.h | 2 +
include/linux/timer.h | 2 +
init/Kconfig | 1 +
kernel/cgroup/cpuset.c | 3 +-
kernel/sched/isolation.c | 119 +++++++++++++++++++++++++++++++++++++--
kernel/time/hrtimer.c | 81 ++++++++++++++++++++++++++
kernel/time/timer.c | 64 +++++++++++++++++++++
tests/managed_irq.c | 71 +++++++++++++++++++++++
tests/timers.c | 58 +++++++++++++++++++
9 files changed, 395 insertions(+), 6 deletions(-)
create mode 100644 tests/managed_irq.c
create mode 100644 tests/timers.c
--
2.45.0
Adjust affinity of watchdog_cpumask, hrtimers according to
change of housekeeping.cpumasks[HK_TYPE_TIMER].
Function migrate_hrtimer_list_except() is prototyped from
migrate_hrtimer_list() and is more generic.
Potentially it can be used instead of migrate_hrtimer_list.
Function hrtimers_resettle_from_cpu() is blindly prototyped
from hrtimers_cpu_dying(). local_irq_disable() is used because
cpuhp_thread_fun() uses it before cpuhp_invoke_callback().
Core test snippets without infrastructure:
1. Create hrtimer on specific cpu with:
set_cpus_allowed_ptr(current, cpumask_of(test_cpu));
hrtimer_init(&test_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
test_hrtimer.function = test_hrtimer_cb;
hrtimer_start(&test_hrtimer, -1, HRTIMER_MODE_REL);
2. Call housekeeping_update()
3. Assure that there is only tick_nohz_handler on specified cpu
in /proc/timer_list manually or with script:
grep -E 'cpu| #[0-9]' /proc/timer_list | \
awk "/cpu:/{y=0};/cpu: $test_cpu\$/{y=1};y"
Another alternative solution to migrate hrtimers:
1. Use cpuhp to set sched_timer offline
2. Resettle all hrtimers likewise migrate_hrtimer_list
3. Use cpuhp to set sched_timer online
Signed-off-by: Costa Shulyupin <[email protected]>
---
include/linux/hrtimer.h | 2 +
kernel/sched/isolation.c | 2 +
kernel/time/hrtimer.c | 81 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 85 insertions(+)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index aa1e65ccb6158..004632fc7d643 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -381,8 +381,10 @@ extern void sysrq_timer_list_show(void);
int hrtimers_prepare_cpu(unsigned int cpu);
#ifdef CONFIG_HOTPLUG_CPU
int hrtimers_cpu_dying(unsigned int cpu);
+void hrtimers_resettle_from_cpu(unsigned int cpu);
#else
#define hrtimers_cpu_dying NULL
+static inline void hrtimers_resettle_from_cpu(unsigned int cpu) { }
#endif
#endif
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 3b63f0212887e..85a17d39d8bb0 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -126,10 +126,12 @@ static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable
for_each_cpu(cpu, enable_mask) {
timers_prepare_cpu(cpu);
+ hrtimers_prepare_cpu(cpu);
}
for_each_cpu(cpu, disable_mask) {
timers_resettle_from_cpu(cpu);
+ hrtimers_resettle_from_cpu(cpu);
}
}
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 492c14aac642b..7e71ebbb72348 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2201,6 +2201,87 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
}
}
+/*
+ * migrate_hrtimer_list_except - migrates hrtimers from one base to another,
+ * except specified one.
+ */
+static void migrate_hrtimer_list_except(struct hrtimer_clock_base *old_base,
+ struct hrtimer_clock_base *new_base, struct hrtimer *except)
+{
+ struct hrtimer *timer;
+ struct timerqueue_node *node;
+
+ node = timerqueue_getnext(&old_base->active);
+ while (node) {
+ timer = container_of(node, struct hrtimer, node);
+ node = timerqueue_iterate_next(node);
+ if (timer == except)
+ continue;
+
+ BUG_ON(hrtimer_callback_running(timer));
+ debug_deactivate(timer);
+
+ /*
+ * Mark it as ENQUEUED not INACTIVE otherwise the
+ * timer could be seen as !active and just vanish away
+ * under us on another CPU
+ */
+ __remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0);
+ timer->base = new_base;
+ /*
+ * Enqueue the timers on the new cpu. This does not
+ * reprogram the event device in case the timer
+ * expires before the earliest on this CPU, but we run
+ * hrtimer_interrupt after we migrated everything to
+ * sort out already expired timers and reprogram the
+ * event device.
+ */
+ enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS);
+ }
+}
+
+/**
+ * hrtimers_resettle_from_cpu - resettles hrtimers from
+ * specified cpu to housekeeping cpus.
+ */
+void hrtimers_resettle_from_cpu(unsigned int isol_cpu)
+{
+ int ncpu, i;
+ struct tick_sched *ts = tick_get_tick_sched(isol_cpu);
+ struct hrtimer_cpu_base *old_base, *new_base;
+
+ local_irq_disable();
+ ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
+
+ old_base = &per_cpu(hrtimer_bases, isol_cpu);
+ new_base = &per_cpu(hrtimer_bases, ncpu);
+
+ /*
+ * The caller is globally serialized and nobody else
+ * takes two locks at once, deadlock is not possible.
+ */
+ raw_spin_lock(&old_base->lock);
+ raw_spin_lock_nested(&new_base->lock, SINGLE_DEPTH_NESTING);
+ for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
+ migrate_hrtimer_list_except(&old_base->clock_base[i],
+ &new_base->clock_base[i],
+ &ts->sched_timer);
+ }
+
+ /*
+ * The migration might have changed the first expiring softirq
+ * timer on this CPU. Update it.
+ */
+ __hrtimer_get_next_event(new_base, HRTIMER_ACTIVE_SOFT);
+
+ raw_spin_unlock(&new_base->lock);
+ raw_spin_unlock(&old_base->lock);
+ local_irq_enable();
+
+ /* Tell the other CPU to retrigger the next event */
+ smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);
+}
+
int hrtimers_cpu_dying(unsigned int dying_cpu)
{
int i, ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
--
2.45.0
Kernel module to generate timers and hrtimers for the test
and shell commands.
Signed-off-by: Costa Shulyupin <[email protected]>
---
tests/timers.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)
create mode 100644 tests/timers.c
diff --git a/tests/timers.c b/tests/timers.c
new file mode 100644
index 0000000000000..bf6ef3244bc05
--- /dev/null
+++ b/tests/timers.c
@@ -0,0 +1,58 @@
+#include <linux/timer.h>
+#include <linux/hrtimer.h>
+#include <linux/module.h>
+
+/*
+ * Testing instructions:
+ *
+ * isolate=1
+ * insmod timers.ko test_cpu=$isolate
+ * cd /sys/fs/cgroup/
+ * echo +cpuset > cgroup.subtree_control
+ * mkdir test
+ * echo isolated > test/cpuset.cpus.partition
+ * echo $isolate > test/cpuset.cpus
+ *
+ * awk "/cpu:/{y=0};/cpu: $isolate\$/{y=1};/ #[0-9]/ && y;" /proc/timer_list \
+ * | grep -q test_hrtimer_cb && echo FAIL || echo PASS
+ *
+ * Assure that there is no timers on the isolated cpu.
+ */
+
+static void test_timer_cb(struct timer_list *unused) { }
+
+static struct timer_list test_timer;
+
+static struct hrtimer test_hrtimer;
+
+static enum hrtimer_restart test_hrtimer_cb(struct hrtimer *hr_timer)
+{
+ return HRTIMER_NORESTART;
+}
+
+static int test_cpu = 1;
+module_param(test_cpu, int, 0444);
+
+static int timers_init(void)
+{
+ set_cpus_allowed_ptr(current, cpumask_of(test_cpu));
+ timer_setup(&test_timer, test_timer_cb, TIMER_PINNED);
+ test_timer.expires = KTIME_MAX;
+ add_timer(&test_timer);
+
+ hrtimer_init(&test_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ test_hrtimer.function = test_hrtimer_cb;
+ hrtimer_start(&test_hrtimer, -1, HRTIMER_MODE_REL);
+ return 0;
+}
+
+static void timers_exit(void)
+{
+ del_timer(&test_timer);
+ hrtimer_cancel(&test_hrtimer);
+}
+
+module_init(timers_init);
+module_exit(timers_exit);
+
+MODULE_LICENSE("GPL");
--
2.45.0
Kernel module to generatemanaged irqs for the test
and shell commands.
Signed-off-by: Costa Shulyupin <[email protected]>
---
tests/managed_irq.c | 71 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 71 insertions(+)
create mode 100644 tests/managed_irq.c
diff --git a/tests/managed_irq.c b/tests/managed_irq.c
new file mode 100644
index 0000000000000..fd4e91e44e59d
--- /dev/null
+++ b/tests/managed_irq.c
@@ -0,0 +1,71 @@
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/module.h>
+
+/*
+ * Testing instructions:
+ *
+ * isolate=1
+ * insmod managed_irq.ko test_cpu=$isolate
+ * cd /sys/fs/cgroup/
+ * echo +cpuset > cgroup.subtree_control
+ * mkdir test
+ * echo isolated > test/cpuset.cpus.partition
+ * echo $isolate > test/cpuset.cpus
+ * read test_irq < /sys/module/managed_irq/parameters/test_irq
+ *
+ * cat /proc/irq/$test_irq/smp_affinity_list
+ *
+ * read affinity < /proc/irq/$test_irq/smp_affinity
+ * test 0 -ne $((0x$affinity & 1 << $isolate)) && echo FAIL | PASS
+ *
+ * Assure that irq affinity doesn't contain isolated cpu.
+ */
+
+static int test_cpu = 1;
+module_param(test_cpu, int, 0444);
+
+static int test_irq;
+module_param(test_irq, int, 0444);
+
+static irqreturn_t test_irq_cb(int irq, void *dev_id)
+{
+ return IRQ_HANDLED;
+}
+
+static int test_set_affinity(struct irq_data *d, const struct cpumask *m, bool f)
+{
+ irq_data_update_effective_affinity(d, m);
+ return 0;
+}
+
+static int make_test_irq(void)
+{
+ struct irq_affinity_desc a = {
+ mask: *cpumask_of(test_cpu),
+ is_managed: true
+ };
+ int test_irq = __irq_alloc_descs(-1, 1, 1, 0, THIS_MODULE, &a);
+ static struct irq_chip test_chip = { .irq_set_affinity =
+ test_set_affinity };
+
+ irq_set_chip(test_irq, &test_chip);
+ irq_set_status_flags(test_irq, IRQ_MOVE_PCNTXT);
+ pr_debug("test_irq=%d\n", test_irq);
+ int err = request_irq(test_irq, test_irq_cb, 0, "test_affinity", 0);
+
+ if (err)
+ pr_err("%d\n", err);
+
+ return test_irq;
+}
+
+static int managed_irq_init(void)
+{
+ test_irq = make_test_irq();
+ return 0;
+}
+
+module_init(managed_irq_init);
+
+MODULE_LICENSE("GPL");
--
2.45.0
Introduce infrastructure function housekeeping_update() to change
housekeeping_cpumask during runtime and adjust affinities of depended
subsystems.
Affinity adjustments of subsystems follow in subsequent patches.
Parent patch:
"sched/isolation: Exclude dynamically isolated CPUs from housekeeping masks"
https://lore.kernel.org/lkml/[email protected]/
Test example for cgroup2:
cd /sys/fs/cgroup/
echo +cpuset > cgroup.subtree_control
mkdir test
echo isolated > test/cpuset.cpus.partition
echo $isolate > test/cpuset.cpus
Signed-off-by: Costa Shulyupin <[email protected]>
---
kernel/sched/isolation.c | 48 +++++++++++++++++++++++++++++++++++-----
1 file changed, 43 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 948b9ee0dc2cc..036e48f0e7d1b 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -116,6 +116,39 @@ static void __init housekeeping_setup_type(enum hk_type type,
housekeeping_staging);
}
+/*
+ * housekeeping_update - change housekeeping.cpumasks[type] and propagate the
+ * change.
+ *
+ * Assuming cpuset_mutex is held in sched_partition_write or
+ * cpuset_write_resmask.
+ */
+static int housekeeping_update(enum hk_type type, cpumask_var_t update)
+{
+ struct {
+ struct cpumask changed;
+ struct cpumask enable;
+ struct cpumask disable;
+ } *masks;
+
+ masks = kmalloc(sizeof(*masks), GFP_KERNEL);
+ if (!masks)
+ return -ENOMEM;
+
+ lockdep_assert_cpus_held();
+ cpumask_xor(&masks->changed, housekeeping_cpumask(type), update);
+ cpumask_and(&masks->enable, &masks->changed, update);
+ cpumask_andnot(&masks->disable, &masks->changed, update);
+ cpumask_copy(housekeeping.cpumasks[type], update);
+ housekeeping.flags |= BIT(type);
+ if (!static_branch_unlikely(&housekeeping_overridden))
+ static_key_enable_cpuslocked(&housekeeping_overridden.key);
+
+ kfree(masks);
+
+ return 0;
+}
+
static int __init housekeeping_setup(char *str, unsigned long flags)
{
cpumask_var_t non_housekeeping_mask, housekeeping_staging;
@@ -314,9 +347,12 @@ int housekeeping_exlude_isolcpus(const struct cpumask *isolcpus, unsigned long f
/*
* Reset housekeeping to bootup default
*/
- for_each_set_bit(type, &housekeeping_boot.flags, HK_TYPE_MAX)
- cpumask_copy(housekeeping.cpumasks[type],
- housekeeping_boot.cpumasks[type]);
+ for_each_set_bit(type, &housekeeping_boot.flags, HK_TYPE_MAX) {
+ int err = housekeeping_update(type, housekeeping_boot.cpumasks[type]);
+
+ if (err)
+ return err;
+ }
WRITE_ONCE(housekeeping.flags, housekeeping_boot.flags);
if (!housekeeping_boot.flags &&
@@ -344,9 +380,11 @@ int housekeeping_exlude_isolcpus(const struct cpumask *isolcpus, unsigned long f
cpumask_andnot(tmp_mask, src_mask, isolcpus);
if (!cpumask_intersects(tmp_mask, cpu_online_mask))
return -EINVAL; /* Invalid isolated CPUs */
- cpumask_copy(housekeeping.cpumasks[type], tmp_mask);
+ int err = housekeeping_update(type, tmp_mask);
+
+ if (err)
+ return err;
}
- WRITE_ONCE(housekeeping.flags, housekeeping_boot.flags | flags);
excluded = true;
if (!static_branch_unlikely(&housekeeping_overridden))
static_key_enable_cpuslocked(&housekeeping_overridden.key);
--
2.45.0
irq_affinity_adjust() is prototyped from irq_affinity_online_cpu()
and irq_restore_affinity_of_irq().
Core test snippets without infrastructure:
1. Create managed IRQ on specific cpu with:
static int test_set_affinity(struct irq_data *data,
const struct cpumask *m, bool f)
{
irq_data_update_effective_affinity(data, m);
return 0;
}
static int make_test_irq(void)
{
struct irq_affinity_desc a = { mask: *cpumask_of(test_cpu),
is_managed: true };
static struct irq_chip test_chip = { .irq_set_affinity = test_set_affinity };
int test_irq = __irq_alloc_descs(-1, 1, 1, 0, THIS_MODULE, &a);
irq_set_chip(test_irq, &test_chip);
irq_set_status_flags(test_irq, IRQ_MOVE_PCNTXT);
request_irq(test_irq, test_irq_cb, 0, "test_affinity", 0);
return test_irq;
}
2. Isolate specified CPU.
3. Assure that test_irq doesn't affine with test_cpu:
cat /proc/irq/$test_irq/smp_affinity_list
Signed-off-by: Costa Shulyupin <[email protected]>
---
kernel/cgroup/cpuset.c | 3 ++-
kernel/sched/isolation.c | 44 +++++++++++++++++++++++++++++++++++++++-
2 files changed, 45 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9d01e8e0a3ed9..2e59a2983eb2e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -225,7 +225,8 @@ static struct list_head remote_children;
/*
* The set of housekeeping flags to be updated for CPU isolation
*/
-#define HOUSEKEEPING_FLAGS (BIT(HK_TYPE_TIMER) | BIT(HK_TYPE_RCU))
+#define HOUSEKEEPING_FLAGS (BIT(HK_TYPE_TIMER) | BIT(HK_TYPE_RCU) \
+ | BIT(HK_TYPE_MANAGED_IRQ))
/*
* Partition root states:
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 85a17d39d8bb0..b0503ed362fce 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -135,6 +135,43 @@ static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable
}
}
+static int irq_affinity_adjust(cpumask_var_t disable_mask)
+{
+ unsigned int irq;
+ cpumask_var_t mask;
+
+ if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+ return -ENOMEM;
+
+ irq_lock_sparse();
+ for_each_active_irq(irq) {
+ struct irq_desc *desc = irq_to_desc(irq);
+
+ raw_spin_lock_irq(&desc->lock);
+ struct irq_data *data = irq_desc_get_irq_data(desc);
+
+ if (irqd_affinity_is_managed(data) && cpumask_weight_and(disable_mask,
+ irq_data_get_affinity_mask(data))) {
+
+ cpumask_and(mask, cpu_online_mask, irq_default_affinity);
+ cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+ irq_set_affinity_locked(data, mask, true);
+ WARN_ON(cpumask_weight_and(irq_data_get_effective_affinity_mask(data),
+ disable_mask));
+ WARN_ON(!cpumask_subset(irq_data_get_effective_affinity_mask(data),
+ cpu_online_mask));
+ WARN_ON(!cpumask_subset(irq_data_get_effective_affinity_mask(data),
+ housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)));
+ }
+ raw_spin_unlock_irq(&desc->lock);
+ }
+ irq_unlock_sparse();
+
+ free_cpumask_var(mask);
+
+ return 0;
+}
+
/*
* housekeeping_update - change housekeeping.cpumasks[type] and propagate the
* change.
@@ -144,6 +181,8 @@ static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable
*/
static int housekeeping_update(enum hk_type type, cpumask_var_t update)
{
+ int err = 0;
+
struct {
struct cpumask changed;
struct cpumask enable;
@@ -171,11 +210,14 @@ static int housekeeping_update(enum hk_type type, cpumask_var_t update)
lockup_detector_reconfigure();
#endif
break;
+ case HK_TYPE_MANAGED_IRQ:
+ err = irq_affinity_adjust(&masks->disable);
+ break;
default:
}
kfree(masks);
- return 0;
+ return err;
}
static int __init housekeeping_setup(char *str, unsigned long flags)
--
2.45.0
On Thu, May 16 2024 at 22:04, Costa Shulyupin wrote:
> Introduce infrastructure function housekeeping_update() to change
> housekeeping_cpumask during runtime and adjust affinities of depended
> subsystems.
>
> Affinity adjustments of subsystems follow in subsequent patches.
>
> Parent patch:
> "sched/isolation: Exclude dynamically isolated CPUs from housekeeping masks"
> https://lore.kernel.org/lkml/[email protected]/
>
> Test example for cgroup2:
>
> cd /sys/fs/cgroup/
> echo +cpuset > cgroup.subtree_control
> mkdir test
> echo isolated > test/cpuset.cpus.partition
> echo $isolate > test/cpuset.cpus
This changelog is not telling me anything. Please see
Documentation/process/ what changelogs should contain.
> +/*
> + * housekeeping_update - change housekeeping.cpumasks[type] and propagate the
> + * change.
> + *
> + * Assuming cpuset_mutex is held in sched_partition_write or
> + * cpuset_write_resmask.
Locking cannot be assumed. lockdep_assert_held() is there to document
and enforce such requirements.
> + */
> +static int housekeeping_update(enum hk_type type, cpumask_var_t update)
Please us 'struct cpumask *update' as it makes it clear what this is
about. cpumask_var_t is a hack to make onstack and embedded cpumask and
their allocated counterparts possible without #ifdeffery in the code.
But any function which is not related to alloc/free of cpumask_var_t
should simply use 'struct cpumask *' as argument type.
> + housekeeping.flags |= BIT(type);
The existing code uses WRITE_ONCE() probably for a reason. Why is that
not longer required here?
> static int __init housekeeping_setup(char *str, unsigned long flags)
> {
> cpumask_var_t non_housekeeping_mask, housekeeping_staging;
> @@ -314,9 +347,12 @@ int housekeeping_exlude_isolcpus(const struct cpumask *isolcpus, unsigned long f
> /*
> * Reset housekeeping to bootup default
> */
> - for_each_set_bit(type, &housekeeping_boot.flags, HK_TYPE_MAX)
> - cpumask_copy(housekeeping.cpumasks[type],
> - housekeeping_boot.cpumasks[type]);
> + for_each_set_bit(type, &housekeeping_boot.flags, HK_TYPE_MAX) {
> + int err = housekeeping_update(type, housekeeping_boot.cpumasks[type]);
> +
> + if (err)
> + return err;
> + }
>
> WRITE_ONCE(housekeeping.flags, housekeeping_boot.flags);
> if (!housekeeping_boot.flags &&
> @@ -344,9 +380,11 @@ int housekeeping_exlude_isolcpus(const struct cpumask *isolcpus, unsigned long f
> cpumask_andnot(tmp_mask, src_mask, isolcpus);
> if (!cpumask_intersects(tmp_mask, cpu_online_mask))
> return -EINVAL; /* Invalid isolated CPUs */
> - cpumask_copy(housekeeping.cpumasks[type], tmp_mask);
> + int err = housekeeping_update(type, tmp_mask);
> +
> + if (err)
> + return err;
Do we really need two places to define 'int err' or might it be possible
to have one instance defined at function scope?
Thanks,
tglx
On Thu, May 16 2024 at 22:04, Costa Shulyupin wrote:
> Adjust affinity of watchdog_cpumask, hrtimers according to
> change of housekeeping.cpumasks[HK_TYPE_TIMER].
>
> Function migrate_hrtimer_list_except() is prototyped from
> migrate_hrtimer_list() and is more generic.
>
> Potentially it can be used instead of migrate_hrtimer_list.
>
> Function hrtimers_resettle_from_cpu() is blindly prototyped
> from hrtimers_cpu_dying(). local_irq_disable() is used because
> cpuhp_thread_fun() uses it before cpuhp_invoke_callback().
I'm again impressed by the depth of analysis.
Q: What the heck has cpuhp_thread_fun() to do with hrtimers_cpu_dying()?
A: Absolutely nothing.
> Core test snippets without infrastructure:
>
> 1. Create hrtimer on specific cpu with:
>
> set_cpus_allowed_ptr(current, cpumask_of(test_cpu));
> hrtimer_init(&test_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> test_hrtimer.function = test_hrtimer_cb;
> hrtimer_start(&test_hrtimer, -1, HRTIMER_MODE_REL);
>
> 2. Call housekeeping_update()
>
> 3. Assure that there is only tick_nohz_handler on specified cpu
> in /proc/timer_list manually or with script:
>
> grep -E 'cpu| #[0-9]' /proc/timer_list | \
> awk "/cpu:/{y=0};/cpu: $test_cpu\$/{y=1};y"
>
> Another alternative solution to migrate hrtimers:
> 1. Use cpuhp to set sched_timer offline
> 2. Resettle all hrtimers likewise migrate_hrtimer_list
> 3. Use cpuhp to set sched_timer online
Another pile of incomprehensible word salad which pretends to be a
change log.
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -126,10 +126,12 @@ static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable
>
> for_each_cpu(cpu, enable_mask) {
> timers_prepare_cpu(cpu);
> + hrtimers_prepare_cpu(cpu);
And yet another lockless re-initialization of an active in use data
structure ...
> +/*
> + * migrate_hrtimer_list_except - migrates hrtimers from one base to another,
> + * except specified one.
> + */
> +static void migrate_hrtimer_list_except(struct hrtimer_clock_base *old_base,
> + struct hrtimer_clock_base *new_base, struct hrtimer *except)
> +{
> + struct hrtimer *timer;
> + struct timerqueue_node *node;
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#variable-declarations
> + node = timerqueue_getnext(&old_base->active);
> + while (node) {
> + timer = container_of(node, struct hrtimer, node);
> + node = timerqueue_iterate_next(node);
> + if (timer == except)
> + continue;
What's the rationale that there is only a single timer to except from
the migration?
> + BUG_ON(hrtimer_callback_running(timer));
Q: What prevents that a hrtimer callback is running on the CPU which was
not isolated before?
A: Nothing. Ergo this is prone to kill a perfectly correct system just
because of blindly copying something.
At least your attempt of a changelog is clear about that...
> + debug_deactivate(timer);
> +
> + /*
> + * Mark it as ENQUEUED not INACTIVE otherwise the
> + * timer could be seen as !active and just vanish away
> + * under us on another CPU
> + */
> + __remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0);
> + timer->base = new_base;
> + /*
> + * Enqueue the timers on the new cpu. This does not
> + * reprogram the event device in case the timer
> + * expires before the earliest on this CPU, but we run
> + * hrtimer_interrupt after we migrated everything to
> + * sort out already expired timers and reprogram the
> + * event device.
> + */
> + enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS);
> + }
> +}
> +
> +/**
> + * hrtimers_resettle_from_cpu - resettles hrtimers from
> + * specified cpu to housekeeping cpus.
> + */
> +void hrtimers_resettle_from_cpu(unsigned int isol_cpu)
> +{
> + int ncpu, i;
> + struct tick_sched *ts = tick_get_tick_sched(isol_cpu);
> + struct hrtimer_cpu_base *old_base, *new_base;
> +
> + local_irq_disable();
> + ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
> +
> + old_base = &per_cpu(hrtimer_bases, isol_cpu);
> + new_base = &per_cpu(hrtimer_bases, ncpu);
> +
> + /*
> + * The caller is globally serialized and nobody else
> + * takes two locks at once, deadlock is not possible.
> + */
> + raw_spin_lock(&old_base->lock);
> + raw_spin_lock_nested(&new_base->lock, SINGLE_DEPTH_NESTING);
> + for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
> + migrate_hrtimer_list_except(&old_base->clock_base[i],
> + &new_base->clock_base[i],
> + &ts->sched_timer);
> + }
> +
> + /*
> + * The migration might have changed the first expiring softirq
> + * timer on this CPU. Update it.
> + */
> + __hrtimer_get_next_event(new_base, HRTIMER_ACTIVE_SOFT);
> +
> + raw_spin_unlock(&new_base->lock);
> + raw_spin_unlock(&old_base->lock);
> + local_irq_enable();
> +
> + /* Tell the other CPU to retrigger the next event */
> + smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);
> +}
We clearly need another copy of hrtimers_cpu_dying() and
migrate_hrtimer_list() to add a local_irq_dis/enable() pair and a
completely ill defined exclusion mechanism which assumes that the tick
hrtimer is the only one which has to be excluded from migration.
Even if the exlusion mechanism would be correct there is ZERO
justification for this copy and pasta orgy even if you marked this
series RFC. RFC != POC hackery.
Thanks,
tglx
On Thu, May 16 2024 at 22:04, Costa Shulyupin wrote:
> irq_affinity_adjust() is prototyped from irq_affinity_online_cpu()
> and irq_restore_affinity_of_irq().
I'm used to this prototyped phrase by now. It still does not justify to
expose me to this POC hackery.
My previous comments about change logs still apply.
> +static int irq_affinity_adjust(cpumask_var_t disable_mask)
> +{
> + unsigned int irq;
> + cpumask_var_t mask;
> +
> + if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> + return -ENOMEM;
> +
> + irq_lock_sparse();
> + for_each_active_irq(irq) {
> + struct irq_desc *desc = irq_to_desc(irq);
> +
> + raw_spin_lock_irq(&desc->lock);
That's simply broken. This is not CPU hotplug on an outgoing CPU. Why
are you assuming that your isolation change code can rely on the
implicit guarantees of CPU hot(un)plug?
Also there is a reason why interrupt related code is in kernel/irq/* and
not in some random other location. Even if C allows you to fiddle with
everything that does not mean that hiding random hacks in other places
is correct in any way.
> + struct irq_data *data = irq_desc_get_irq_data(desc);
> +
> + if (irqd_affinity_is_managed(data) && cpumask_weight_and(disable_mask,
> + irq_data_get_affinity_mask(data))) {
Interrupt target isolation is only relevant for managed interrupts and
non-managed interrupts clearly are going to migrate themself away
magically, right?
> +
> + cpumask_and(mask, cpu_online_mask, irq_default_affinity);
> + cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
There are clearly a lot of comments explaining what this is doing and
why it is correct as there is a guarantee that these masks overlap by
definition.
> + irq_set_affinity_locked(data, mask, true);
Plus the extensive explanation why using 'force=true' is even remotely
correct here.
I conceed that the documentation of that function and its arguments is
close to non-existant, but if you follow the call chain of that function
there are enough hints down the road, no?
> + WARN_ON(cpumask_weight_and(irq_data_get_effective_affinity_mask(data),
> + disable_mask));
> + WARN_ON(!cpumask_subset(irq_data_get_effective_affinity_mask(data),
> + cpu_online_mask));
> + WARN_ON(!cpumask_subset(irq_data_get_effective_affinity_mask(data),
> + housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)));
These warnings are required and useful within the spinlock held and
interrupt disabled section because of what?
- Because the resulting stack trace provides a well known call chain
- Because the resulting warnings do not tell anything about the
affected interrupt number
- Because the resulting warnings do not tell anything about the CPU
masks which cause the problem
- Because the aggregate information of the above is utterly useless
Impressive...
Thanks,
tglx
Costa!
On Sat, May 18 2024 at 03:17, Thomas Gleixner wrote:
> Impressive...
Now let's take a step back because none of this makes any sense at all
conceptually.
Reconfiguring the housekeeping CPUs on a life system is expensive and a
slow path operation no matter what.
So why inflicting all of this nonsense to the kernel instead of
cleverly (ab)using CPU hotplug for it in user space:
for_each_cpu(cpu, new_house_keeping_mask) {
if (cpu_ishk(cpu))
continue;
cpu_offline(cpu);
set_cpu_in_hkmask(cpu);
cpu_online(cpu);
}
for_each_cpu(cpu, new_isolated_mask) {
if (!cpu_ishk(cpu))
continue;
cpu_offline(cpu);
clear_cpu_in_hkmask(cpu);
cpu_online(cpu);
}
Or something like that. You get the idea, right?
IOW, the only kernel change which is required to achieve your goal is to
ensure that changing the housekeeping/isolated property of a CPU at
runtime is only possible when the CPU is "offline".
Then all of the magic things you try to solve just work out of the box
because the existing and well exercised hotplug code takes care of it
already, no?
I might be missing something obvious as always, so feel free to educate
me on it.
Thanks,
tglx
Hello Thomas,
Thank you so much for very valuable feedback!
I've tested your proposal to use the hotplug for the isolation.
It works well for both timers, but managed irqs are not ready for this scenario.
Current implementation is introduced by
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=11ea68f553e244851d15793a7fa33a97c46d8271
How do you suggest to proceed?
> So why inflicting all of this nonsense to the kernel instead of
> cleverly (ab)using CPU hotplug for it in user space:
..
> Or something like that. You get the idea, right?
>
> IOW, the only kernel change which is required to achieve your goal is to
> ensure that changing the housekeeping/isolated property of a CPU at
> runtime is only possible when the CPU is "offline".
On Mon, May 27 2024 at 15:21, Costa Shulyupin wrote:
> Thank you so much for very valuable feedback!
> I've tested your proposal to use the hotplug for the isolation.
> It works well for both timers, but managed irqs are not ready for this scenario.
>
> How do you suggest to proceed?
That's non-trivial because it's not only the interrupt affinity.
Drivers also have their queues associated accordingly and if you just
change the interrupt affinity under the hood then all the other
associations don't get updated and don't work.
That needs to be solved at some other level IMO.
Thanks,
tglx