commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
introduced the logic to distribute the tasks within cpumask upon initial
wakeup. For Telco RAN deployments, isolcpus are a necessity to cater to
the requirement of low latency applications. These isolcpus are generally
tickless so that high priority SCHED_FIFO tasks can execute without any
OS jitter. Since load balancing is disabled on isocpus, any task
which gets placed on these CPUs can not be migrated on its own.
For RT applications to execute on isolcpus, a guaranteed kubernetes pod
with all isolcpus becomes the requirement and these RT applications are
affine to execute on a specific isolcpu within the kubernetes pod.
However, there may be some non-RT tasks which could also schedule in the
same kubernetes pod without being affine to any specific CPU(inherits the
pod cpuset affinity). With multiple spawning and running containers inside
the pod, container runtime spawns several non-RT initializing tasks
("runc init") inside the pod and due to above mentioned commits, these
non-RT tasks may get placed on any isolcpus and may starve if it happens
to wakeup on the same CPU as SCHED_FIFO task because RT throttling is also
disabled in telco setup. Thus, RAN deployment fails and eventually leads
to system hangs.
With the introduction of kernel cmdline param 'sched_pick_firstcpu',
there is an option provided for such usecases to disable the distribution
of tasks within the cpumask logic and use the previous 'pick first cpu'
approach for initial placement of tasks. Because many telco vendors
configure the system in such a way that the first cpu within a cpuset
of pod doesn't run any SCHED_FIFO or High priority tasks.
Co-developed-by: Alexey Makhalov <[email protected]>
Signed-off-by: Alexey Makhalov <[email protected]>
Signed-off-by: Ankit Jain <[email protected]>
---
lib/cpumask.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/lib/cpumask.c b/lib/cpumask.c
index e77ee9d46f71..3dea87d5ec1f 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -154,6 +154,23 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
}
EXPORT_SYMBOL(cpumask_local_spread);
+/*
+ * Task distribution within the cpumask feature disabled?
+ */
+static bool cpumask_pick_firstcpu __read_mostly;
+
+/*
+ * Disable Tasks distribution within the cpumask feature
+ */
+static int __init cpumask_pick_firstcpu_setup(char *str)
+{
+ cpumask_pick_firstcpu = 1;
+ pr_info("cpumask: Tasks distribution within cpumask is disabled.");
+ return 1;
+}
+
+__setup("sched_pick_firstcpu", cpumask_pick_firstcpu_setup);
+
static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
/**
@@ -171,6 +188,13 @@ unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
{
unsigned int next, prev;
+ /*
+ * Don't distribute, if tasks distribution
+ * within cpumask feature is disabled
+ */
+ if (cpumask_pick_firstcpu)
+ return cpumask_any_and(src1p, src2p);
+
/* NOTE: our first selection will skip 0. */
prev = __this_cpu_read(distribute_cpu_mask_prev);
--
2.23.1
On Tue, Apr 30, 2024 at 02:34:31PM +0530, Ankit Jain wrote:
> commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> introduced the logic to distribute the tasks within cpumask upon initial
> wakeup.
So let's add the authors in CC list?
> For Telco RAN deployments, isolcpus are a necessity to cater to
> the requirement of low latency applications. These isolcpus are generally
> tickless so that high priority SCHED_FIFO tasks can execute without any
> OS jitter. Since load balancing is disabled on isocpus, any task
> which gets placed on these CPUs can not be migrated on its own.
> For RT applications to execute on isolcpus, a guaranteed kubernetes pod
> with all isolcpus becomes the requirement and these RT applications are
> affine to execute on a specific isolcpu within the kubernetes pod.
> However, there may be some non-RT tasks which could also schedule in the
> same kubernetes pod without being affine to any specific CPU(inherits the
> pod cpuset affinity).
OK... It looks like adding scheduler maintainers is also a necessity to
cater here...
> With multiple spawning and running containers inside
> the pod, container runtime spawns several non-RT initializing tasks
> ("runc init") inside the pod and due to above mentioned commits, these
> non-RT tasks may get placed on any isolcpus and may starve if it happens
> to wakeup on the same CPU as SCHED_FIFO task because RT throttling is also
> disabled in telco setup. Thus, RAN deployment fails and eventually leads
> to system hangs.
Not that I'm familiar to your setup, but this sounds like a userspace
configuration problems. Can you try to move your non-RT tasks into a
cgroup attached to non-RT CPUs, or something like that?
> With the introduction of kernel cmdline param 'sched_pick_firstcpu',
> there is an option provided for such usecases to disable the distribution
> of tasks within the cpumask logic and use the previous 'pick first cpu'
> approach for initial placement of tasks. Because many telco vendors
> configure the system in such a way that the first cpu within a cpuset
> of pod doesn't run any SCHED_FIFO or High priority tasks.
>
> Co-developed-by: Alexey Makhalov <[email protected]>
> Signed-off-by: Alexey Makhalov <[email protected]>
> Signed-off-by: Ankit Jain <[email protected]>
> ---
> lib/cpumask.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index e77ee9d46f71..3dea87d5ec1f 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -154,6 +154,23 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
> }
> EXPORT_SYMBOL(cpumask_local_spread);
>
> +/*
> + * Task distribution within the cpumask feature disabled?
> + */
> +static bool cpumask_pick_firstcpu __read_mostly;
> +
> +/*
> + * Disable Tasks distribution within the cpumask feature
> + */
> +static int __init cpumask_pick_firstcpu_setup(char *str)
> +{
> + cpumask_pick_firstcpu = 1;
> + pr_info("cpumask: Tasks distribution within cpumask is disabled.");
> + return 1;
> +}
> +
> +__setup("sched_pick_firstcpu", cpumask_pick_firstcpu_setup);
> +
> static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
>
> /**
> @@ -171,6 +188,13 @@ unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
> {
> unsigned int next, prev;
>
> + /*
> + * Don't distribute, if tasks distribution
> + * within cpumask feature is disabled
> + */
> + if (cpumask_pick_firstcpu)
> + return cpumask_any_and(src1p, src2p);
No, this is a wrong way.
To begin with, this parameter shouldn't control a single random
function. At least, the other cpumask_*_distribute() should be
consistent to the policy.
But in general... I don't think we should do things like that at all.
Cpumask API is a simple and plain wrapper around bitmaps. If you want
to modify a behavior of the scheduler, you could do that at scheduler
level, not in a random helper function.
Consider 2 cases:
- Someone unrelated to scheduler would use the same helper and will
be affected by this parameter inadvertently.
- Scheduler will switch to using another function to distribute CPUs,
and your setups will suddenly get broken again. This time deeply in
production.
Thanks,
Yury
> /* NOTE: our first selection will skip 0. */
> prev = __this_cpu_read(distribute_cpu_mask_prev);
>
> --
> 2.23.1
Hi Yuri,
On Tue, Apr 30, 2024 at 11:23:07AM -0700 Yury Norov wrote:
> On Tue, Apr 30, 2024 at 02:34:31PM +0530, Ankit Jain wrote:
> > commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> > and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> > introduced the logic to distribute the tasks within cpumask upon initial
> > wakeup.
>
> So let's add the authors in CC list?
>
> > For Telco RAN deployments, isolcpus are a necessity to cater to
> > the requirement of low latency applications. These isolcpus are generally
> > tickless so that high priority SCHED_FIFO tasks can execute without any
> > OS jitter. Since load balancing is disabled on isocpus, any task
> > which gets placed on these CPUs can not be migrated on its own.
> > For RT applications to execute on isolcpus, a guaranteed kubernetes pod
> > with all isolcpus becomes the requirement and these RT applications are
> > affine to execute on a specific isolcpu within the kubernetes pod.
> > However, there may be some non-RT tasks which could also schedule in the
> > same kubernetes pod without being affine to any specific CPU(inherits the
> > pod cpuset affinity).
>
> OK... It looks like adding scheduler maintainers is also a necessity to
> cater here...
>
> > With multiple spawning and running containers inside
> > the pod, container runtime spawns several non-RT initializing tasks
> > ("runc init") inside the pod and due to above mentioned commits, these
> > non-RT tasks may get placed on any isolcpus and may starve if it happens
> > to wakeup on the same CPU as SCHED_FIFO task because RT throttling is also
> > disabled in telco setup. Thus, RAN deployment fails and eventually leads
> > to system hangs.
>
> Not that I'm familiar to your setup, but this sounds like a userspace
> configuration problems. Can you try to move your non-RT tasks into a
> cgroup attached to non-RT CPUs, or something like that?
>
It's not really. In a container environment just logging in to the
container could end up with the exec'd task landing on one of
the polling or latency sensitive cores.
In a telco deployment the applications will run containers with
isolated(pinned) cpus with load balacning disabled. These
containers typically use one of these cpus for its "housekeeping"
with the remainder used for the latency sensitive workloads.
Also, this is a change in kernel behavior which is breaking
userspace.
We are also hitting this and are interested in a way to get the
old behavior back for some workloads.
> > With the introduction of kernel cmdline param 'sched_pick_firstcpu',
> > there is an option provided for such usecases to disable the distribution
> > of tasks within the cpumask logic and use the previous 'pick first cpu'
> > approach for initial placement of tasks. Because many telco vendors
> > configure the system in such a way that the first cpu within a cpuset
> > of pod doesn't run any SCHED_FIFO or High priority tasks.
> >
> > Co-developed-by: Alexey Makhalov <[email protected]>
> > Signed-off-by: Alexey Makhalov <[email protected]>
> > Signed-off-by: Ankit Jain <[email protected]>
> > ---
> > lib/cpumask.c | 24 ++++++++++++++++++++++++
> > 1 file changed, 24 insertions(+)
> >
> > diff --git a/lib/cpumask.c b/lib/cpumask.c
> > index e77ee9d46f71..3dea87d5ec1f 100644
> > --- a/lib/cpumask.c
> > +++ b/lib/cpumask.c
> > @@ -154,6 +154,23 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
> > }
> > EXPORT_SYMBOL(cpumask_local_spread);
> >
> > +/*
> > + * Task distribution within the cpumask feature disabled?
> > + */
> > +static bool cpumask_pick_firstcpu __read_mostly;
> > +
> > +/*
> > + * Disable Tasks distribution within the cpumask feature
> > + */
> > +static int __init cpumask_pick_firstcpu_setup(char *str)
> > +{
> > + cpumask_pick_firstcpu = 1;
> > + pr_info("cpumask: Tasks distribution within cpumask is disabled.");
> > + return 1;
> > +}
> > +
> > +__setup("sched_pick_firstcpu", cpumask_pick_firstcpu_setup);
> > +
> > static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
> >
> > /**
> > @@ -171,6 +188,13 @@ unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
> > {
> > unsigned int next, prev;
> >
> > + /*
> > + * Don't distribute, if tasks distribution
> > + * within cpumask feature is disabled
> > + */
> > + if (cpumask_pick_firstcpu)
> > + return cpumask_any_and(src1p, src2p);
>
> No, this is a wrong way.
>
> To begin with, this parameter shouldn't control a single random
> function. At least, the other cpumask_*_distribute() should be
> consistent to the policy.
>
> But in general... I don't think we should do things like that at all.
> Cpumask API is a simple and plain wrapper around bitmaps. If you want
> to modify a behavior of the scheduler, you could do that at scheduler
> level, not in a random helper function.
>
> Consider 2 cases:
> - Someone unrelated to scheduler would use the same helper and will
> be affected by this parameter inadvertently.
> - Scheduler will switch to using another function to distribute CPUs,
> and your setups will suddenly get broken again. This time deeply in
> production.
>
Yeah, I think I agree with this part. At the scheduler level, where this
is called, makes more sense.
Note, this is "deeply in production" now...
Cheers,
Phil
> Thanks,
> Yury
>
> > /* NOTE: our first selection will skip 0. */
> > prev = __this_cpu_read(distribute_cpu_mask_prev);
> >
> > --
> > 2.23.1
>
--
On Wed, May 01, 2024 at 09:36:08AM -0400, Phil Auld wrote:
>
> Hi Yuri,
[...]
> > Not that I'm familiar to your setup, but this sounds like a userspace
> > configuration problems. Can you try to move your non-RT tasks into a
> > cgroup attached to non-RT CPUs, or something like that?
> >
>
> It's not really. In a container environment just logging in to the
> container could end up with the exec'd task landing on one of
> the polling or latency sensitive cores.
>
> In a telco deployment the applications will run containers with
> isolated(pinned) cpus with load balacning disabled. These
> containers typically use one of these cpus for its "housekeeping"
> with the remainder used for the latency sensitive workloads.
>
> Also, this is a change in kernel behavior which is breaking
> userspace.
Alright, that's a different story.
> We are also hitting this and are interested in a way to get the
> old behavior back for some workloads.
>
> > > With the introduction of kernel cmdline param 'sched_pick_firstcpu',
> > > there is an option provided for such usecases to disable the distribution
> > > of tasks within the cpumask logic and use the previous 'pick first cpu'
> > > approach for initial placement of tasks. Because many telco vendors
> > > configure the system in such a way that the first cpu within a cpuset
> > > of pod doesn't run any SCHED_FIFO or High priority tasks.
> > >
> > > Co-developed-by: Alexey Makhalov <[email protected]>
> > > Signed-off-by: Alexey Makhalov <[email protected]>
> > > Signed-off-by: Ankit Jain <[email protected]>
> > > ---
> > > lib/cpumask.c | 24 ++++++++++++++++++++++++
> > > 1 file changed, 24 insertions(+)
> > >
> > > diff --git a/lib/cpumask.c b/lib/cpumask.c
> > > index e77ee9d46f71..3dea87d5ec1f 100644
> > > --- a/lib/cpumask.c
> > > +++ b/lib/cpumask.c
> > > @@ -154,6 +154,23 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
> > > }
> > > EXPORT_SYMBOL(cpumask_local_spread);
> > >
> > > +/*
> > > + * Task distribution within the cpumask feature disabled?
> > > + */
> > > +static bool cpumask_pick_firstcpu __read_mostly;
> > > +
> > > +/*
> > > + * Disable Tasks distribution within the cpumask feature
> > > + */
> > > +static int __init cpumask_pick_firstcpu_setup(char *str)
> > > +{
> > > + cpumask_pick_firstcpu = 1;
> > > + pr_info("cpumask: Tasks distribution within cpumask is disabled.");
> > > + return 1;
> > > +}
> > > +
> > > +__setup("sched_pick_firstcpu", cpumask_pick_firstcpu_setup);
> > > +
> > > static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
> > >
> > > /**
> > > @@ -171,6 +188,13 @@ unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
> > > {
> > > unsigned int next, prev;
> > >
> > > + /*
> > > + * Don't distribute, if tasks distribution
> > > + * within cpumask feature is disabled
> > > + */
> > > + if (cpumask_pick_firstcpu)
> > > + return cpumask_any_and(src1p, src2p);
> >
> > No, this is a wrong way.
> >
> > To begin with, this parameter shouldn't control a single random
> > function. At least, the other cpumask_*_distribute() should be
> > consistent to the policy.
> >
> > But in general... I don't think we should do things like that at all.
> > Cpumask API is a simple and plain wrapper around bitmaps. If you want
> > to modify a behavior of the scheduler, you could do that at scheduler
> > level, not in a random helper function.
> >
> > Consider 2 cases:
> > - Someone unrelated to scheduler would use the same helper and will
> > be affected by this parameter inadvertently.
> > - Scheduler will switch to using another function to distribute CPUs,
> > and your setups will suddenly get broken again. This time deeply in
> > production.
> >
>
> Yeah, I think I agree with this part. At the scheduler level, where this
> is called, makes more sense.
>
> Note, this is "deeply in production" now...
So, if we all agree that touching cpumasks is a bad idea, let's drop
this patch and try figuring out a better solution.
Now that you're saying the scheduler patches break userspace, I think
it would be legitimate to revert them, unless there's a simple fix for
that.
Let's see what the folks will say. Please keep me in CC.
Thanks,
Yury
On Wed, May 01, 2024 at 08:27:58AM -0700 Yury Norov wrote:
> On Wed, May 01, 2024 at 09:36:08AM -0400, Phil Auld wrote:
> >
> > Hi Yuri,
>
> [...]
>
> > > Not that I'm familiar to your setup, but this sounds like a userspace
> > > configuration problems. Can you try to move your non-RT tasks into a
> > > cgroup attached to non-RT CPUs, or something like that?
> > >
> >
> > It's not really. In a container environment just logging in to the
> > container could end up with the exec'd task landing on one of
> > the polling or latency sensitive cores.
> >
> > In a telco deployment the applications will run containers with
> > isolated(pinned) cpus with load balacning disabled. These
> > containers typically use one of these cpus for its "housekeeping"
> > with the remainder used for the latency sensitive workloads.
> >
> > Also, this is a change in kernel behavior which is breaking
> > userspace.
>
> Alright, that's a different story.
>
It's a specific edge case. I'd prefer to push for a forward solution
than revert.
> > We are also hitting this and are interested in a way to get the
> > old behavior back for some workloads.
> >
> > > > With the introduction of kernel cmdline param 'sched_pick_firstcpu',
> > > > there is an option provided for such usecases to disable the distribution
> > > > of tasks within the cpumask logic and use the previous 'pick first cpu'
> > > > approach for initial placement of tasks. Because many telco vendors
> > > > configure the system in such a way that the first cpu within a cpuset
> > > > of pod doesn't run any SCHED_FIFO or High priority tasks.
> > > >
> > > > Co-developed-by: Alexey Makhalov <[email protected]>
> > > > Signed-off-by: Alexey Makhalov <[email protected]>
> > > > Signed-off-by: Ankit Jain <[email protected]>
> > > > ---
> > > > lib/cpumask.c | 24 ++++++++++++++++++++++++
> > > > 1 file changed, 24 insertions(+)
> > > >
> > > > diff --git a/lib/cpumask.c b/lib/cpumask.c
> > > > index e77ee9d46f71..3dea87d5ec1f 100644
> > > > --- a/lib/cpumask.c
> > > > +++ b/lib/cpumask.c
> > > > @@ -154,6 +154,23 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
> > > > }
> > > > EXPORT_SYMBOL(cpumask_local_spread);
> > > >
> > > > +/*
> > > > + * Task distribution within the cpumask feature disabled?
> > > > + */
> > > > +static bool cpumask_pick_firstcpu __read_mostly;
> > > > +
> > > > +/*
> > > > + * Disable Tasks distribution within the cpumask feature
> > > > + */
> > > > +static int __init cpumask_pick_firstcpu_setup(char *str)
> > > > +{
> > > > + cpumask_pick_firstcpu = 1;
> > > > + pr_info("cpumask: Tasks distribution within cpumask is disabled.");
> > > > + return 1;
> > > > +}
> > > > +
> > > > +__setup("sched_pick_firstcpu", cpumask_pick_firstcpu_setup);
> > > > +
> > > > static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
> > > >
> > > > /**
> > > > @@ -171,6 +188,13 @@ unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
> > > > {
> > > > unsigned int next, prev;
> > > >
> > > > + /*
> > > > + * Don't distribute, if tasks distribution
> > > > + * within cpumask feature is disabled
> > > > + */
> > > > + if (cpumask_pick_firstcpu)
> > > > + return cpumask_any_and(src1p, src2p);
> > >
> > > No, this is a wrong way.
> > >
> > > To begin with, this parameter shouldn't control a single random
> > > function. At least, the other cpumask_*_distribute() should be
> > > consistent to the policy.
> > >
> > > But in general... I don't think we should do things like that at all.
> > > Cpumask API is a simple and plain wrapper around bitmaps. If you want
> > > to modify a behavior of the scheduler, you could do that at scheduler
> > > level, not in a random helper function.
> > >
> > > Consider 2 cases:
> > > - Someone unrelated to scheduler would use the same helper and will
> > > be affected by this parameter inadvertently.
> > > - Scheduler will switch to using another function to distribute CPUs,
> > > and your setups will suddenly get broken again. This time deeply in
> > > production.
> > >
> >
> > Yeah, I think I agree with this part. At the scheduler level, where this
> > is called, makes more sense.
> >
> > Note, this is "deeply in production" now...
>
> So, if we all agree that touching cpumasks is a bad idea, let's drop
> this patch and try figuring out a better solution.
>
> Now that you're saying the scheduler patches break userspace, I think
> it would be legitimate to revert them, unless there's a simple fix for
> that.
As I said above let's try to go forward if we can. I'd argue that relying
on the old first cpu selection is not really an API, or documented so I
don't think a revert is needed.
I think a static key at the one or two places _distribute() is used
in the scheduler (and workqueue?) code would have the same effect as
this and be a better fit.
Cheers,
Phil
>
> Let's see what the folks will say. Please keep me in CC.
>
> Thanks,
> Yury
>
--
On Tue, Apr 30, 2024 at 11:23:07AM -0700, Yury Norov wrote:
> On Tue, Apr 30, 2024 at 02:34:31PM +0530, Ankit Jain wrote:
> > commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> > and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> > introduced the logic to distribute the tasks within cpumask upon initial
> > wakeup.
>
> So let's add the authors in CC list?
>
> > For Telco RAN deployments, isolcpus are a necessity to cater to
> > the requirement of low latency applications. These isolcpus are generally
> > tickless so that high priority SCHED_FIFO tasks can execute without any
> > OS jitter. Since load balancing is disabled on isocpus, any task
> > which gets placed on these CPUs can not be migrated on its own.
> > For RT applications to execute on isolcpus, a guaranteed kubernetes pod
> > with all isolcpus becomes the requirement and these RT applications are
> > affine to execute on a specific isolcpu within the kubernetes pod.
> > However, there may be some non-RT tasks which could also schedule in the
> > same kubernetes pod without being affine to any specific CPU(inherits the
> > pod cpuset affinity).
>
> OK... It looks like adding scheduler maintainers is also a necessity to
> cater here...
So 14e292f8d453 is very specifically only using sched_domain_span(), and
if you're using partitioned CPUs they should not show up there.
As to 46a87b3851f0, if you're explicitly creating tasks with an affinity
masks that spans your partition then you're getting what you ask for.
In fact, I already explained this to you earlier, so why are you
suggesting horrible hacks again? This behaviour toggle you suggest is
absolutely unacceptable.
I even explained what the problem was and where to look for solutions.
https://lkml.kernel.org/r/[email protected]
> On 02-May-2024, at 1:39 PM, Ankit Jain <[email protected]> wrote:
>
> On Tue, Apr 30, 2024 at 11:23:07AM -0700, Yury Norov wrote:
>> On Tue, Apr 30, 2024 at 02:34:31PM +0530, Ankit Jain wrote:
>>> commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
>>> and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
>>> introduced the logic to distribute the tasks within cpumask upon initial
>>> wakeup.
>>
>> So let's add the authors in CC list?
>>
>>> For Telco RAN deployments, isolcpus are a necessity to cater to
>>> the requirement of low latency applications. These isolcpus are generally
>>> tickless so that high priority SCHED_FIFO tasks can execute without any
>>> OS jitter. Since load balancing is disabled on isocpus, any task
>>> which gets placed on these CPUs can not be migrated on its own.
>>> For RT applications to execute on isolcpus, a guaranteed kubernetes pod
>>> with all isolcpus becomes the requirement and these RT applications are
>>> affine to execute on a specific isolcpu within the kubernetes pod.
>>> However, there may be some non-RT tasks which could also schedule in the
>>> same kubernetes pod without being affine to any specific CPU(inherits the
>>> pod cpuset affinity).
>>
>> OK... It looks like adding scheduler maintainers is also a necessity to
>> cater here...
>
> So 14e292f8d453 is very specifically only using sched_domain_span(), and
> if you're using partitioned CPUs they should not show up there.
>
> As to 46a87b3851f0, if you're explicitly creating tasks with an affinity
> masks that spans your partition then you're getting what you ask for.
>
> In fact, I already explained this to you earlier, so why are you
> suggesting horrible hacks again? This behaviour toggle you suggest is
> absolutely unacceptable.
>
> I even explained what the problem was and where to look for solutions.
>
> https://www.google.com/url?q=https://lkml.kernel.org/r/[email protected]&source=gmail-imap&ust=1715244235000000&usg=AOvVaw03JVPjd7QleOP3JrOjP8pO
Hi Peter,
Apologies for not trying out the solution you suggested earlier. I will test this
solution for our use case and share the result.
Thanks,
Ankit
Hi Peter,
On Thu, May 02, 2024 at 10:43:49AM +0200 Peter Zijlstra wrote:
> On Tue, Apr 30, 2024 at 11:23:07AM -0700, Yury Norov wrote:
> > On Tue, Apr 30, 2024 at 02:34:31PM +0530, Ankit Jain wrote:
> > > commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> > > and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> > > introduced the logic to distribute the tasks within cpumask upon initial
> > > wakeup.
> >
> > So let's add the authors in CC list?
> >
> > > For Telco RAN deployments, isolcpus are a necessity to cater to
> > > the requirement of low latency applications. These isolcpus are generally
> > > tickless so that high priority SCHED_FIFO tasks can execute without any
> > > OS jitter. Since load balancing is disabled on isocpus, any task
> > > which gets placed on these CPUs can not be migrated on its own.
> > > For RT applications to execute on isolcpus, a guaranteed kubernetes pod
> > > with all isolcpus becomes the requirement and these RT applications are
> > > affine to execute on a specific isolcpu within the kubernetes pod.
> > > However, there may be some non-RT tasks which could also schedule in the
> > > same kubernetes pod without being affine to any specific CPU(inherits the
> > > pod cpuset affinity).
> >
> > OK... It looks like adding scheduler maintainers is also a necessity to
> > cater here...
>
> So 14e292f8d453 is very specifically only using sched_domain_span(), and
> if you're using partitioned CPUs they should not show up there.
>
> As to 46a87b3851f0, if you're explicitly creating tasks with an affinity
> masks that spans your partition then you're getting what you ask for.
I think you are skipping some details. We've also asked for no load
balancing and this spreading is a form of load balancing. So that's
not getting what was asked for.
And the tasks being created with this affinity are not being explicitly
created thusly. It's implicit. Anything exec'd into the container
(==cgroup with cpuset set) gets the cpu affinity spanning the cpus in
the container. There are layers. It's not someone using "taskset bash"
at a command prompt.
>
> In fact, I already explained this to you earlier, so why are you
> suggesting horrible hacks again? This behaviour toggle you suggest is
> absolutely unacceptable.
>
> I even explained what the problem was and where to look for solutions.
>
> https://lkml.kernel.org/r/[email protected]
>
I was not aware of the history here. Thanks. I'll look into that. At
first blush it's not obvious how that helps but I've been wrong before...
Cheers,
Phil
--