As we know, from sched domain's perspective, the DIE layer should be
larger than or at least equal to the MC layer, and in some cases, MC
is defined by the arch specified hardware, MPIDR for example, but NUMA
can be defined by users, with the following system configrations:
*************************************
NUMA: 0-2, 3-7
core_siblings: 0-3, 4-7
*************************************
Per the current code, for core 3, its MC cpu map fallbacks to 3~7(its
core_sibings is 0~3 while its numa node map is 3~7).
For the sched MC, when we are build sched groups:
step1. core3 's sched groups chain is built like this: 3->4->5->6->7->3
step2. core4's sched groups chain is built like this: 4->5->6->7->4
so after step2, core3's sched groups for MC level is overlapped, more
importantly, it will fall to dead loop if while(sg != sg->groups)
Obviously, the NUMA node with cpu 3-7 conflict with the MC level cpu
map, but unfortunately, there is no way even detect such cases.
In this patch, prompt a warning message to help with the above cases.
Signed-off-by: Zeng Tao <[email protected]>
---
drivers/base/arch_topology.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1eb81f11..5fe44b3 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -439,10 +439,18 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
/* not numa in package, lets use the package siblings */
core_mask = &cpu_topology[cpu].core_sibling;
- }
+ } else
+ pr_warn_once("Warning: suspicous broken topology: cpu:[%d]'s core_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
+ cpu, cpumask_pr_args(&cpu_topology[cpu].core_sibling),
+ cpumask_pr_args(core_mask));
+
if (cpu_topology[cpu].llc_id != -1) {
if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
core_mask = &cpu_topology[cpu].llc_sibling;
+ else
+ pr_warn_once("Warning: suspicous broken topology: cpu:[%d]'s llc_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
+ cpu, cpumask_pr_args(&cpu_topology[cpu].llc_sibling),
+ cpumask_pr_args(core_mask));
}
return core_mask;
--
2.8.1
On Mon, Dec 23, 2019 at 04:16:19PM +0800, z00214469 wrote:
> As we know, from sched domain's perspective, the DIE layer should be
> larger than or at least equal to the MC layer, and in some cases, MC
> is defined by the arch specified hardware, MPIDR for example, but NUMA
> can be defined by users,
Who are the users you are referring above ?
> with the following system configrations:
Do you mean ACPI tables or DT or some firmware tables ?
> *************************************
> NUMA: 0-2, 3-7
Is the above simply wrong with respect to hardware and it actually match
core_siblings ?
> core_siblings: 0-3, 4-7
> *************************************
> Per the current code, for core 3, its MC cpu map fallbacks to 3~7(its
> core_sibings is 0~3 while its numa node map is 3~7).
>
> For the sched MC, when we are build sched groups:
> step1. core3 's sched groups chain is built like this: 3->4->5->6->7->3
> step2. core4's sched groups chain is built like this: 4->5->6->7->4
> so after step2, core3's sched groups for MC level is overlapped, more
> importantly, it will fall to dead loop if while(sg != sg->groups)
>
> Obviously, the NUMA node with cpu 3-7 conflict with the MC level cpu
> map, but unfortunately, there is no way even detect such cases.
>
Again, is cpu 3-7 actually in a NUMA node or is it 4-7 ?
> In this patch, prompt a warning message to help with the above cases.
>
> Signed-off-by: Zeng Tao <[email protected]>
> ---
> drivers/base/arch_topology.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 1eb81f11..5fe44b3 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -439,10 +439,18 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
> if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
> /* not numa in package, lets use the package siblings */
> core_mask = &cpu_topology[cpu].core_sibling;
> - }
> + } else
> + pr_warn_once("Warning: suspicous broken topology: cpu:[%d]'s core_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
> + cpu, cpumask_pr_args(&cpu_topology[cpu].core_sibling),
> + cpumask_pr_args(core_mask));
> +
Won't this print warning on all systems that don't have numa within a
package ? What are you trying to achieve here ?
> if (cpu_topology[cpu].llc_id != -1) {
> if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
> core_mask = &cpu_topology[cpu].llc_sibling;
> + else
> + pr_warn_once("Warning: suspicous broken topology: cpu:[%d]'s llc_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
> + cpu, cpumask_pr_args(&cpu_topology[cpu].llc_sibling),
> + cpumask_pr_args(core_mask));
> }
>
This will trigger warning on all systems that lack cacheinfo topology.
I don't understand the intent of this patch at all. Can you explain
all the steps you follow and the issue you face ?
--
Regards,
Sudeep
Hi Sudeep:
Thanks for your reply.
> -----Original Message-----
> From: Sudeep Holla [mailto:[email protected]]
> Sent: Wednesday, January 01, 2020 12:41 AM
> To: Zengtao (B)
> Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> [email protected]; Sudeep Holla; Morten Rasmussen
> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> with lower layer
>
> On Mon, Dec 23, 2019 at 04:16:19PM +0800, z00214469 wrote:
> > As we know, from sched domain's perspective, the DIE layer should be
> > larger than or at least equal to the MC layer, and in some cases, MC
> > is defined by the arch specified hardware, MPIDR for example, but
> NUMA
> > can be defined by users,
>
> Who are the users you are referring above ?
For example, when I use QEMU to start a guest linux, I can define the
NUMA topology of the guest linux whatever i want.
> > with the following system configrations:
>
> Do you mean ACPI tables or DT or some firmware tables ?
>
> > *************************************
> > NUMA: 0-2, 3-7
>
> Is the above simply wrong with respect to hardware and it actually match
> core_siblings ?
>
Actually, we can't simply say this is wrong, i just want to show an example.
And this example also can be:
NUMA: 0~23, 24~47
core_siblings: 0-15, 16-31, 32~47
> > core_siblings: 0-3, 4-7
> > *************************************
> > Per the current code, for core 3, its MC cpu map fallbacks to 3~7(its
> > core_sibings is 0~3 while its numa node map is 3~7).
> >
> > For the sched MC, when we are build sched groups:
> > step1. core3 's sched groups chain is built like this: 3->4->5->6->7->3
> > step2. core4's sched groups chain is built like this: 4->5->6->7->4
> > so after step2, core3's sched groups for MC level is overlapped, more
> > importantly, it will fall to dead loop if while(sg != sg->groups)
> >
> > Obviously, the NUMA node with cpu 3-7 conflict with the MC level cpu
> > map, but unfortunately, there is no way even detect such cases.
> >
>
> Again, is cpu 3-7 actually in a NUMA node or is it 4-7 ?
>
> > In this patch, prompt a warning message to help with the above cases.
> >
> > Signed-off-by: Zeng Tao <[email protected]>
> > ---
> > drivers/base/arch_topology.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index 1eb81f11..5fe44b3 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -439,10 +439,18 @@ const struct cpumask
> *cpu_coregroup_mask(int cpu)
> > if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
> > /* not numa in package, lets use the package siblings */
> > core_mask = &cpu_topology[cpu].core_sibling;
> > - }
> > + } else
> > + pr_warn_once("Warning: suspicous broken topology: cpu:[%d]'s
> core_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
> > + cpu, cpumask_pr_args(&cpu_topology[cpu].core_sibling),
> > + cpumask_pr_args(core_mask));
> > +
>
> Won't this print warning on all systems that don't have numa within a
> package ? What are you trying to achieve here ?
Since in my case, when this corner case happens, the linux kernel just fall into
dead loop with no prompt, here this is a helping message will help a lot.
>
> > if (cpu_topology[cpu].llc_id != -1) {
> > if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
> > core_mask = &cpu_topology[cpu].llc_sibling;
> > + else
> > + pr_warn_once("Warning: suspicous broken topology:
> cpu:[%d]'s llc_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
> > + cpu,
> cpumask_pr_args(&cpu_topology[cpu].llc_sibling),
> > + cpumask_pr_args(core_mask));
> > }
> >
>
> This will trigger warning on all systems that lack cacheinfo topology.
> I don't understand the intent of this patch at all. Can you explain
> all the steps you follow and the issue you face ?
Can you show me an example, what I really want to warn is the case that
NUMA topology conflicts with lower level.
>
> --
> Regards,
> Sudeep
On Thu, Jan 02, 2020 at 03:05:40AM +0000, Zengtao (B) wrote:
> Hi Sudeep:
>
> Thanks for your reply.
>
> > -----Original Message-----
> > From: Sudeep Holla [mailto:[email protected]]
> > Sent: Wednesday, January 01, 2020 12:41 AM
> > To: Zengtao (B)
> > Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> > [email protected]; Sudeep Holla; Morten Rasmussen
> > Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> > with lower layer
> >
> > On Mon, Dec 23, 2019 at 04:16:19PM +0800, z00214469 wrote:
> > > As we know, from sched domain's perspective, the DIE layer should be
> > > larger than or at least equal to the MC layer, and in some cases, MC
> > > is defined by the arch specified hardware, MPIDR for example, but
> > NUMA
> > > can be defined by users,
> >
> > Who are the users you are referring above ?
> For example, when I use QEMU to start a guest linux, I can define the
> NUMA topology of the guest linux whatever i want.
OK and how is the information passed to the kernel ? DT or ACPI ?
We need to fix the miss match if any during the initial parse of those
information.
> > > with the following system configrations:
> >
> > Do you mean ACPI tables or DT or some firmware tables ?
> >
> > > *************************************
> > > NUMA: 0-2, 3-7
> >
> > Is the above simply wrong with respect to hardware and it actually match
> > core_siblings ?
> >
> Actually, we can't simply say this is wrong, i just want to show an example.
> And this example also can be:
> NUMA: 0-23, 24-47
> core_siblings: 0-15, 16-31, 32-47
>
Are you sure of the above ? Possible values w.r.t hardware config:
core_siblings: 0-15, 16-23, 24-31, 32-47
But what you have specified above is still wrong core_siblings IMO.
[...]
> > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > > index 1eb81f11..5fe44b3 100644
> > > --- a/drivers/base/arch_topology.c
> > > +++ b/drivers/base/arch_topology.c
> > > @@ -439,10 +439,18 @@ const struct cpumask
> > *cpu_coregroup_mask(int cpu)
> > > if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
> > > /* not numa in package, lets use the package siblings */
> > > core_mask = &cpu_topology[cpu].core_sibling;
> > > - }
> > > + } else
> > > + pr_warn_once("Warning: suspicous broken topology: cpu:[%d]'s
> > core_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
> > > + cpu, cpumask_pr_args(&cpu_topology[cpu].core_sibling),
> > > + cpumask_pr_args(core_mask));
> > > +
> >
> > Won't this print warning on all systems that don't have numa within a
> > package ? What are you trying to achieve here ?
>
> Since in my case, when this corner case happens, the linux kernel just fall into
> dead loop with no prompt, here this is a helping message will help a lot.
>
As I said, wrong configurations need to be detected when generating
DT/ACPI if possible. The above will print warning on systems with NUMA
within package.
NUMA: 0-7, 8-15
core_siblings: 0-15
The above is the example where the die has 16 CPUs and 2 NUMA nodes
within a package, your change throws error to the above config which is
wrong.
> >
> > > if (cpu_topology[cpu].llc_id != -1) {
> > > if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
> > > core_mask = &cpu_topology[cpu].llc_sibling;
> > > + else
> > > + pr_warn_once("Warning: suspicous broken topology:
> > cpu:[%d]'s llc_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
> > > + cpu,
> > cpumask_pr_args(&cpu_topology[cpu].llc_sibling),
> > > + cpumask_pr_args(core_mask));
> > > }
> > >
> >
> > This will trigger warning on all systems that lack cacheinfo topology.
> > I don't understand the intent of this patch at all. Can you explain
> > all the steps you follow and the issue you face ?
>
> Can you show me an example, what I really want to warn is the case that
> NUMA topology conflicts with lower level.
>
I was wrong here, I mis-read this section. I still fail to understand
why the above change is needed. I understood the QEMU example, but you
haven't specified how cacheinfo looks like there.
--
Regards,
Sudeep
> -----Original Message-----
> From: Sudeep Holla [mailto:[email protected]]
> Sent: Thursday, January 02, 2020 7:30 PM
> To: Zengtao (B)
> Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> [email protected]; Morten Rasmussen
> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> with lower layer
>
> On Thu, Jan 02, 2020 at 03:05:40AM +0000, Zengtao (B) wrote:
> > Hi Sudeep:
> >
> > Thanks for your reply.
> >
> > > -----Original Message-----
> > > From: Sudeep Holla [mailto:[email protected]]
> > > Sent: Wednesday, January 01, 2020 12:41 AM
> > > To: Zengtao (B)
> > > Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> > > [email protected]; Sudeep Holla; Morten Rasmussen
> > > Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations
> conflicts
> > > with lower layer
> > >
> > > On Mon, Dec 23, 2019 at 04:16:19PM +0800, z00214469 wrote:
> > > > As we know, from sched domain's perspective, the DIE layer should
> be
> > > > larger than or at least equal to the MC layer, and in some cases, MC
> > > > is defined by the arch specified hardware, MPIDR for example, but
> > > NUMA
> > > > can be defined by users,
> > >
> > > Who are the users you are referring above ?
> > For example, when I use QEMU to start a guest linux, I can define the
> > NUMA topology of the guest linux whatever i want.
>
> OK and how is the information passed to the kernel ? DT or ACPI ?
> We need to fix the miss match if any during the initial parse of those
> information.
>
Both, For the current QEMU, we don't have the correct cpu topology
passed to linux. Luckily drjones planed to deal with the issue.
https://patchwork.ozlabs.org/cover/939301/
> > > > with the following system configrations:
> > >
> > > Do you mean ACPI tables or DT or some firmware tables ?
> > >
> > > > *************************************
> > > > NUMA: 0-2, 3-7
> > >
> > > Is the above simply wrong with respect to hardware and it actually
> match
> > > core_siblings ?
> > >
> > Actually, we can't simply say this is wrong, i just want to show an
> example.
> > And this example also can be:
> > NUMA: 0-23, 24-47
> > core_siblings: 0-15, 16-31, 32-47
> >
>
> Are you sure of the above ? Possible values w.r.t hardware config:
> core_siblings: 0-15, 16-23, 24-31, 32-47
>
> But what you have specified above is still wrong core_siblings IMO.
>
It depends on the hardware, on my platform, 16 cores per cluster.
>
> [...]
>
> > > > diff --git a/drivers/base/arch_topology.c
> b/drivers/base/arch_topology.c
> > > > index 1eb81f11..5fe44b3 100644
> > > > --- a/drivers/base/arch_topology.c
> > > > +++ b/drivers/base/arch_topology.c
> > > > @@ -439,10 +439,18 @@ const struct cpumask
> > > *cpu_coregroup_mask(int cpu)
> > > > if (cpumask_subset(&cpu_topology[cpu].core_sibling,
> core_mask)) {
> > > > /* not numa in package, lets use the package siblings */
> > > > core_mask = &cpu_topology[cpu].core_sibling;
> > > > - }
> > > > + } else
> > > > + pr_warn_once("Warning: suspicous broken topology:
> cpu:[%d]'s
> > > core_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
> > > > + cpu,
> cpumask_pr_args(&cpu_topology[cpu].core_sibling),
> > > > + cpumask_pr_args(core_mask));
> > > > +
> > >
> > > Won't this print warning on all systems that don't have numa within a
> > > package ? What are you trying to achieve here ?
> >
> > Since in my case, when this corner case happens, the linux kernel just fall
> into
> > dead loop with no prompt, here this is a helping message will help a lot.
> >
>
> As I said, wrong configurations need to be detected when generating
> DT/ACPI if possible. The above will print warning on systems with NUMA
> within package.
>
> NUMA: 0-7, 8-15
> core_siblings: 0-15
>
> The above is the example where the die has 16 CPUs and 2 NUMA nodes
> within a package, your change throws error to the above config which is
> wrong.
>
From your example, the core 7 and core 8 has got different LLC but the same Low
Level cache?
From schedule view of point, lower level sched domain should be a subset of higher
Level sched domain.
> > >
> > > > if (cpu_topology[cpu].llc_id != -1) {
> > > > if (cpumask_subset(&cpu_topology[cpu].llc_sibling,
> core_mask))
> > > > core_mask = &cpu_topology[cpu].llc_sibling;
> > > > + else
> > > > + pr_warn_once("Warning: suspicous broken topology:
> > > cpu:[%d]'s llc_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
> > > > + cpu,
> > > cpumask_pr_args(&cpu_topology[cpu].llc_sibling),
> > > > + cpumask_pr_args(core_mask));
> > > > }
> > > >
> > >
> > > This will trigger warning on all systems that lack cacheinfo topology.
> > > I don't understand the intent of this patch at all. Can you explain
> > > all the steps you follow and the issue you face ?
> >
> > Can you show me an example, what I really want to warn is the case that
> > NUMA topology conflicts with lower level.
> >
>
> I was wrong here, I mis-read this section. I still fail to understand
> why the above change is needed. I understood the QEMU example, but you
> haven't specified how cacheinfo looks like there.
>
My idea:
(1) If the dtb/acpi is not legal, the linux kernel should check or prompt to help debug.
(2)From scheduler view of point, low level sched domain should be a subset of high level
Sched domain.
I am very sure about (2), any example to show me (2) is wrong?
Thanks.
Zengtao
On 02/01/2020 12:47, Zengtao (B) wrote:
>>
>> As I said, wrong configurations need to be detected when generating
>> DT/ACPI if possible. The above will print warning on systems with NUMA
>> within package.
>>
>> NUMA: 0-7, 8-15
>> core_siblings: 0-15
>>
>> The above is the example where the die has 16 CPUs and 2 NUMA nodes
>> within a package, your change throws error to the above config which is
>> wrong.
>>
> From your example, the core 7 and core 8 has got different LLC but the same Low
> Level cache?
AFAIA what matters here is memory controllers, less so LLCs. Cores within
a single die could have private LLCs and separate memory controllers, or
shared LLC and separate memory controllers.
> From schedule view of point, lower level sched domain should be a subset of higher
> Level sched domain.
>
Right, and that is checked when you have sched_debug on the cmdline
(or write 1 to /sys/kernel/debug/sched_debug & regenerate the sched domains)
Now, I don't know how this plays out for the numa-in-package topologies like
the one suggested by Sudeep. x86 and AMD had to play some games to get
numa-in-package topologies working, see e.g.
cebf15eb09a2 ("x86, sched: Add new topology for multi-NUMA-node CPUs")
perhaps you need to "lie" here and ensure sub-NUMA sched domain levels are
a subset of the NUMA levels, i.e. lie for core_siblings. I might go and
play with this to see how it behaves.
On Thu, Jan 02, 2020 at 12:47:01PM +0000, Zengtao (B) wrote:
> > -----Original Message-----
> > From: Sudeep Holla [mailto:[email protected]]
> > Sent: Thursday, January 02, 2020 7:30 PM
> > To: Zengtao (B)
> > Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> > [email protected]; Morten Rasmussen
> > Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> > with lower layer
> >
> > On Thu, Jan 02, 2020 at 03:05:40AM +0000, Zengtao (B) wrote:
> > > Hi Sudeep:
> > >
> > > Thanks for your reply.
> > >
> > > > -----Original Message-----
> > > > From: Sudeep Holla [mailto:[email protected]]
> > > > Sent: Wednesday, January 01, 2020 12:41 AM
> > > > To: Zengtao (B)
> > > > Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> > > > [email protected]; Sudeep Holla; Morten Rasmussen
> > > > Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations
> > conflicts
> > > > with lower layer
> > > >
> > > > On Mon, Dec 23, 2019 at 04:16:19PM +0800, z00214469 wrote:
> > > > > As we know, from sched domain's perspective, the DIE layer should
> > be
> > > > > larger than or at least equal to the MC layer, and in some cases, MC
> > > > > is defined by the arch specified hardware, MPIDR for example, but
> > > > NUMA
> > > > > can be defined by users,
> > > >
> > > > Who are the users you are referring above ?
> > > For example, when I use QEMU to start a guest linux, I can define the
> > > NUMA topology of the guest linux whatever i want.
> >
> > OK and how is the information passed to the kernel ? DT or ACPI ?
> > We need to fix the miss match if any during the initial parse of those
> > information.
> >
>
> Both, For the current QEMU, we don't have the correct cpu topology
> passed to linux. Luckily drjones planed to deal with the issue.
> https://patchwork.ozlabs.org/cover/939301/
>
> > > > > with the following system configrations:
> > > >
> > > > Do you mean ACPI tables or DT or some firmware tables ?
> > > >
> > > > > *************************************
> > > > > NUMA: 0-2, 3-7
> > > >
> > > > Is the above simply wrong with respect to hardware and it actually
> > match
> > > > core_siblings ?
> > > >
> > > Actually, we can't simply say this is wrong, i just want to show an
> > example.
> > > And this example also can be:
> > > NUMA: 0-23, 24-47
> > > core_siblings: 0-15, 16-31, 32-47
> > >
> >
> > Are you sure of the above ? Possible values w.r.t hardware config:
> > core_siblings: 0-15, 16-23, 24-31, 32-47
> >
> > But what you have specified above is still wrong core_siblings IMO.
> >
> It depends on the hardware, on my platform, 16 cores per cluster.
>
Sorry, I made mistake with my examples above, I was assuming 8 CPUs
per cluster but didn't represent it well. Anyways my point was:
Can few CPUs in a cluster be part of one NUMA node while the remaining
CPUs of the same cluster part of another NUMA node ? That sounds
interesting and quite complex topology to me. How does the cache
topology look like in that case ?
--
Regards,
Sudeep
On 02/01/2020 14:22, Valentin Schneider wrote:
> On 02/01/2020 12:47, Zengtao (B) wrote:
[...]
>> From schedule view of point, lower level sched domain should be a subset of higher
>> Level sched domain.
>>
>
> Right, and that is checked when you have sched_debug on the cmdline
> (or write 1 to /sys/kernel/debug/sched_debug & regenerate the sched domains)
You should even get informed in case CONFIG_SCHED_DEBUG is not set.
BUG: arch topology borken
With CONFIG_SCHED_DEBUG (and a CPU removed from the cpu_mask (DIE level)
on an Arm64 Juno board) you get extra information:
BUG: arch topology borken
the MC domain not a subset of the DIE domain
> Now, I don't know how this plays out for the numa-in-package topologies like
> the one suggested by Sudeep. x86 and AMD had to play some games to get
> numa-in-package topologies working, see e.g.
>
> cebf15eb09a2 ("x86, sched: Add new topology for multi-NUMA-node CPUs")
>
Yeah, the reason why we need this change would be interesting.
[...]
> -----Original Message-----
> From: Valentin Schneider [mailto:[email protected]]
> Sent: Thursday, January 02, 2020 9:22 PM
> To: Zengtao (B); Sudeep Holla
> Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> [email protected]; Morten Rasmussen
> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> with lower layer
>
> On 02/01/2020 12:47, Zengtao (B) wrote:
> >>
> >> As I said, wrong configurations need to be detected when generating
> >> DT/ACPI if possible. The above will print warning on systems with
> NUMA
> >> within package.
> >>
> >> NUMA: 0-7, 8-15
> >> core_siblings: 0-15
> >>
> >> The above is the example where the die has 16 CPUs and 2 NUMA
> nodes
> >> within a package, your change throws error to the above config which is
> >> wrong.
> >>
> > From your example, the core 7 and core 8 has got different LLC but the
> same Low
> > Level cache?
>
> AFAIA what matters here is memory controllers, less so LLCs. Cores within
> a single die could have private LLCs and separate memory controllers, or
> shared LLC and separate memory controllers.
>
> > From schedule view of point, lower level sched domain should be a subset
> of higher
> > Level sched domain.
> >
>
> Right, and that is checked when you have sched_debug on the cmdline
> (or write 1 to /sys/kernel/debug/sched_debug & regenerate the sched
> domains)
>
No, here I think you don't get my issue, please try to understand my example
First:.
*************************************
NUMA: 0-2, 3-7
core_siblings: 0-3, 4-7
*************************************
When we are building the sched domain, per the current code:
(1) For core 3
MC sched domain fallbacks to 3~7
DIE sched domain is 3~7
(2) For core 4:
MC sched domain is 4~7
DIE sched domain is 3~7
When we are build sched groups for the MC level:
(1). core3's sched groups chain is built like as: 3->4->5->6->7->3
(2). core4's sched groups chain is built like as: 4->5->6->7->4
so after (2),
core3's sched groups is overlapped, and it's not a chain any more.
In the afterwards usecase of core3's sched groups, deadloop happens.
And it's difficult for the scheduler to find out such errors,
that is why I think a warning is necessary here.
> Now, I don't know how this plays out for the numa-in-package topologies
> like
> the one suggested by Sudeep. x86 and AMD had to play some games to
> get
> numa-in-package topologies working, see e.g.
>
> cebf15eb09a2 ("x86, sched: Add new topology for multi-NUMA-node
> CPUs")
>
> perhaps you need to "lie" here and ensure sub-NUMA sched domain levels
> are
> a subset of the NUMA levels, i.e. lie for core_siblings. I might go and
> play with this to see how it behaves.
On 03/01/2020 04:24, Zengtao (B) wrote:
>>> From your example, the core 7 and core 8 has got different LLC but the
>> same Low
>>> Level cache?
>>
>> AFAIA what matters here is memory controllers, less so LLCs. Cores within
>> a single die could have private LLCs and separate memory controllers, or
>> shared LLC and separate memory controllers.
>>
>>> From schedule view of point, lower level sched domain should be a subset
>> of higher
>>> Level sched domain.
>>>
>>
>> Right, and that is checked when you have sched_debug on the cmdline
>> (or write 1 to /sys/kernel/debug/sched_debug & regenerate the sched
>> domains)
>>
>
> No, here I think you don't get my issue, please try to understand my example
> First:.
>
> *************************************
> NUMA: 0-2, 3-7
> core_siblings: 0-3, 4-7
> *************************************
> When we are building the sched domain, per the current code:
> (1) For core 3
> MC sched domain fallbacks to 3~7
> DIE sched domain is 3~7
> (2) For core 4:
> MC sched domain is 4~7
> DIE sched domain is 3~7
>
> When we are build sched groups for the MC level:
> (1). core3's sched groups chain is built like as: 3->4->5->6->7->3
> (2). core4's sched groups chain is built like as: 4->5->6->7->4
> so after (2),
> core3's sched groups is overlapped, and it's not a chain any more.
> In the afterwards usecase of core3's sched groups, deadloop happens.
>
> And it's difficult for the scheduler to find out such errors,
> that is why I think a warning is necessary here.
>
I was mostly commenting on Sudeep's example, sorry for the confusion.
The case you describe in the changelog (and again here in your reply)
is indeed slightly different, and quite more dramatic (we "corrupt" the
sched_group list).
I believe that could still be detected by the scheduler: the problem is
that we have non-equal yet non-disjoint non-NUMA sched domain spans; AFAIA
the only accepted configurations are either completely equal or completely
disjoint (otherwise we end up "corrupting" the sched group chain as above,
because we violate the assumptions used for building groups, see
topology.c::get_group()).
I'm juggling with other things atm, but let me have a think and see if we
couldn't detect that in the scheduler itself.
>> Now, I don't know how this plays out for the numa-in-package topologies
>> like
>> the one suggested by Sudeep. x86 and AMD had to play some games to
>> get
>> numa-in-package topologies working, see e.g.
>>
>> cebf15eb09a2 ("x86, sched: Add new topology for multi-NUMA-node
>> CPUs")
>>
>> perhaps you need to "lie" here and ensure sub-NUMA sched domain levels
>> are
>> a subset of the NUMA levels, i.e. lie for core_siblings. I might go and
>> play with this to see how it behaves.
On Fri, Jan 03, 2020 at 04:24:04AM +0000, Zengtao (B) wrote:
> > -----Original Message-----
> > From: Valentin Schneider [mailto:[email protected]]
> > Sent: Thursday, January 02, 2020 9:22 PM
> > To: Zengtao (B); Sudeep Holla
> > Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> > [email protected]; Morten Rasmussen
> > Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> > with lower layer
> >
[...]
> >
> > Right, and that is checked when you have sched_debug on the cmdline
> > (or write 1 to /sys/kernel/debug/sched_debug & regenerate the sched
> > domains)
> >
>
> No, here I think you don't get my issue, please try to understand my example
> First:.
>
> *************************************
> NUMA: 0-2, 3-7
> core_siblings: 0-3, 4-7
> *************************************
> When we are building the sched domain, per the current code:
> (1) For core 3
> MC sched domain fallbacks to 3~7
> DIE sched domain is 3~7
> (2) For core 4:
> MC sched domain is 4~7
> DIE sched domain is 3~7
>
> When we are build sched groups for the MC level:
> (1). core3's sched groups chain is built like as: 3->4->5->6->7->3
> (2). core4's sched groups chain is built like as: 4->5->6->7->4
> so after (2),
> core3's sched groups is overlapped, and it's not a chain any more.
> In the afterwards usecase of core3's sched groups, deadloop happens.
>
> And it's difficult for the scheduler to find out such errors,
> that is why I think a warning is necessary here.
>
We can figure out a way to warn if it's absolutely necessary, but I
would like to understand the system topology here. You haven't answered
my query on cache topology. Please give more description on why the
NUMA configuration is like the above example with specific hardware
design details. Is this just a case where user can specify anything
they wish ?
--
Regards,
Sudeep
On 03/01/2020 10:57, Valentin Schneider wrote:
> I'm juggling with other things atm, but let me have a think and see if we
> couldn't detect that in the scheduler itself.
>
Something like this ought to catch your case; might need to compare group
spans rather than pure group pointers.
---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6ec1e595b1d4..c4151e11afcd 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1120,6 +1120,13 @@ build_sched_groups(struct sched_domain *sd, int cpu)
sg = get_group(i, sdd);
+ /* sg's are inited as self-looping. If 'last' is not self
+ * looping, we set it in a previous visit. No further visit
+ * should change the link order, if we do then the topology
+ * description is terminally broken.
+ */
+ BUG_ON(last && last->next != last && last->next != sg);
+
cpumask_or(covered, covered, sched_group_span(sg));
if (!first)
On 03/01/2020 13:14, Valentin Schneider wrote:
> On 03/01/2020 10:57, Valentin Schneider wrote:
>> I'm juggling with other things atm, but let me have a think and see if we
>> couldn't detect that in the scheduler itself.
If this is a common problem, we should detect it in the scheduler rather than in
the arch code.
> Something like this ought to catch your case; might need to compare group
> spans rather than pure group pointers.
>
> ---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 6ec1e595b1d4..c4151e11afcd 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1120,6 +1120,13 @@ build_sched_groups(struct sched_domain *sd, int cpu)
>
> sg = get_group(i, sdd);
>
> + /* sg's are inited as self-looping. If 'last' is not self
> + * looping, we set it in a previous visit. No further visit
> + * should change the link order, if we do then the topology
> + * description is terminally broken.
> + */
> + BUG_ON(last && last->next != last && last->next != sg);
> +
> cpumask_or(covered, covered, sched_group_span(sg));
>
> if (!first)
>
Still don't see the actual problem case. The closest I came is:
qemu-system-aarch64 -kernel ... -append ' ... loglevel=8 sched_debug'
-smp cores=4,sockets=2 ... -numa node,cpus=0-2,nodeid=0
-numa node,cpus=3-7,nodeid=1
but this behaves sane. Since DIE and NUMA have the same span, the former degenerates.
[ 0.654451] CPU0 attaching sched-domain(s):
[ 0.654483] domain-0: span=0-2 level=MC
[ 0.654635] groups: 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1015 }, 2:{ span=2 cap=1014 }
[ 0.654787] domain-1: span=0-7 level=NUMA
[ 0.654805] groups: 0:{ span=0-2 cap=3037 }, 3:{ span=3-7 cap=5048 }
[ 0.655326] CPU1 attaching sched-domain(s):
[ 0.655339] domain-0: span=0-2 level=MC
[ 0.655356] groups: 1:{ span=1 cap=1015 }, 2:{ span=2 cap=1014 }, 0:{ span=0 cap=1008 }
[ 0.655391] domain-1: span=0-7 level=NUMA
[ 0.655407] groups: 0:{ span=0-2 cap=3037 }, 3:{ span=3-7 cap=5048 }
[ 0.655480] CPU2 attaching sched-domain(s):
[ 0.655492] domain-0: span=0-2 level=MC
[ 0.655507] groups: 2:{ span=2 cap=1014 }, 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1015 }
[ 0.655541] domain-1: span=0-7 level=NUMA
[ 0.655556] groups: 0:{ span=0-2 cap=3037 }, 3:{ span=3-7 cap=5048 }
[ 0.655603] CPU3 attaching sched-domain(s):
[ 0.655614] domain-0: span=3-7 level=MC
[ 0.655628] groups: 3:{ span=3 cap=984 }, 4:{ span=4 cap=1015 }, 5:{ span=5 cap=1016 }, 6:{ span=6 cap=1016 }, 7:{ span=7 cap=1017 }
[ 0.655693] domain-1: span=0-7 level=NUMA
[ 0.655721] groups: 3:{ span=3-7 cap=5048 }, 0:{ span=0-2 cap=3037 }
[ 0.655769] CPU4 attaching sched-domain(s):
[ 0.655780] domain-0: span=3-7 level=MC
[ 0.655795] groups: 4:{ span=4 cap=1015 }, 5:{ span=5 cap=1016 }, 6:{ span=6 cap=1016 }, 7:{ span=7 cap=1017 }, 3:{ span=3 cap=984 }
[ 0.655841] domain-1: span=0-7 level=NUMA
[ 0.655855] groups: 3:{ span=3-7 cap=5048 }, 0:{ span=0-2 cap=3037 }
[ 0.655902] CPU5 attaching sched-domain(s):
[ 0.655916] domain-0: span=3-7 level=MC
[ 0.655930] groups: 5:{ span=5 cap=1016 }, 6:{ span=6 cap=1016 }, 7:{ span=7 cap=1017 }, 3:{ span=3 cap=984 }, 4:{ span=4 cap=1015 }
[ 0.656545] domain-1: span=0-7 level=NUMA
[ 0.656562] groups: 3:{ span=3-7 cap=5048 }, 0:{ span=0-2 cap=3037 }
[ 0.656775] CPU6 attaching sched-domain(s):
[ 0.656796] domain-0: span=3-7 level=MC
[ 0.656835] groups: 6:{ span=6 cap=1016 }, 7:{ span=7 cap=1017 }, 3:{ span=3 cap=984 }, 4:{ span=4 cap=1015 }, 5:{ span=5 cap=1016 }
[ 0.656881] domain-1: span=0-7 level=NUMA
[ 0.656911] groups: 3:{ span=3-7 cap=5048 }, 0:{ span=0-2 cap=3037 }
[ 0.657102] CPU7 attaching sched-domain(s):
[ 0.657113] domain-0: span=3-7 level=MC
[ 0.657128] groups: 7:{ span=7 cap=1017 }, 3:{ span=3 cap=984 }, 4:{ span=4 cap=1015 }, 5:{ span=5 cap=1016 }, 6:{ span=6 cap=1016 }
[ 0.657172] domain-1: span=0-7 level=NUMA
[ 0.657186] groups: 3:{ span=3-7 cap=5048 }, 0:{ span=0-2 cap=3037 }
[ 0.657241] root domain span: 0-7 (max cpu_capacity = 1024)
> -----Original Message-----
> From: Sudeep Holla [mailto:[email protected]]
> Sent: Friday, January 03, 2020 7:40 PM
> To: Zengtao (B)
> Cc: Valentin Schneider; Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> [email protected]; Morten Rasmussen; Sudeep Holla
> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> with lower layer
>
> On Fri, Jan 03, 2020 at 04:24:04AM +0000, Zengtao (B) wrote:
> > > -----Original Message-----
> > > From: Valentin Schneider [mailto:[email protected]]
> > > Sent: Thursday, January 02, 2020 9:22 PM
> > > To: Zengtao (B); Sudeep Holla
> > > Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> > > [email protected]; Morten Rasmussen
> > > Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations
> conflicts
> > > with lower layer
> > >
>
> [...]
>
> > >
> > > Right, and that is checked when you have sched_debug on the cmdline
> > > (or write 1 to /sys/kernel/debug/sched_debug & regenerate the sched
> > > domains)
> > >
> >
> > No, here I think you don't get my issue, please try to understand my
> example
> > First:.
> >
> > *************************************
> > NUMA: 0-2, 3-7
> > core_siblings: 0-3, 4-7
> > *************************************
> > When we are building the sched domain, per the current code:
> > (1) For core 3
> > MC sched domain fallbacks to 3~7
> > DIE sched domain is 3~7
> > (2) For core 4:
> > MC sched domain is 4~7
> > DIE sched domain is 3~7
> >
> > When we are build sched groups for the MC level:
> > (1). core3's sched groups chain is built like as: 3->4->5->6->7->3
> > (2). core4's sched groups chain is built like as: 4->5->6->7->4
> > so after (2),
> > core3's sched groups is overlapped, and it's not a chain any more.
> > In the afterwards usecase of core3's sched groups, deadloop happens.
> >
> > And it's difficult for the scheduler to find out such errors,
> > that is why I think a warning is necessary here.
> >
>
> We can figure out a way to warn if it's absolutely necessary, but I
> would like to understand the system topology here. You haven't answered
> my query on cache topology. Please give more description on why the
> NUMA configuration is like the above example with specific hardware
> design details. Is this just a case where user can specify anything
> they wish ?
>
Sorry for the late response, In fact, it's a VM usecase, you can simply
understand it as a test case. It's a corner case, but it will hang the kernel,
that is why I suggest a warning is needed.
I think we need an sanity check or just simply warning, either in the scheduler
or arch topology parsing.
Regards
Zengtao
> -----Original Message-----
> From: Dietmar Eggemann [mailto:[email protected]]
> Sent: Saturday, January 04, 2020 1:21 AM
> To: Valentin Schneider; Zengtao (B); Sudeep Holla
> Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> [email protected]; Morten Rasmussen
> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> with lower layer
>
> On 03/01/2020 13:14, Valentin Schneider wrote:
> > On 03/01/2020 10:57, Valentin Schneider wrote:
> >> I'm juggling with other things atm, but let me have a think and see if we
> >> couldn't detect that in the scheduler itself.
>
> If this is a common problem, we should detect it in the scheduler rather
> than in
> the arch code.
>
> > Something like this ought to catch your case; might need to compare
> group
> > spans rather than pure group pointers.
> >
> > ---
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 6ec1e595b1d4..c4151e11afcd 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1120,6 +1120,13 @@ build_sched_groups(struct sched_domain
> *sd, int cpu)
> >
> > sg = get_group(i, sdd);
> >
> > + /* sg's are inited as self-looping. If 'last' is not self
> > + * looping, we set it in a previous visit. No further visit
> > + * should change the link order, if we do then the topology
> > + * description is terminally broken.
> > + */
> > + BUG_ON(last && last->next != last && last->next != sg);
> > +
> > cpumask_or(covered, covered, sched_group_span(sg));
> >
> > if (!first)
> >
>
> Still don't see the actual problem case. The closest I came is:
>
> qemu-system-aarch64 -kernel ... -append ' ... loglevel=8 sched_debug'
> -smp cores=4,sockets=2 ... -numa node,cpus=0-2,nodeid=0
> -numa node,cpus=3-7,nodeid=1
>
It's related to the HW topology, if you hw have got 2 clusters 0~3, 4~7,
with the mainline qemu, you will see the issue.
I think you can manually modify the MPIDR parsing to reproduce the
issue.
Linux will use the MPIDR to guess the MC topology since currently qemu
don't provide it.
Refer to: https://patchwork.ozlabs.org/cover/939301/
> but this behaves sane. Since DIE and NUMA have the same span, the
> former degenerates.
>
> [ 0.654451] CPU0 attaching sched-domain(s):
> [ 0.654483] domain-0: span=0-2 level=MC
> [ 0.654635] groups: 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1015 },
> 2:{ span=2 cap=1014 }
> [ 0.654787] domain-1: span=0-7 level=NUMA
> [ 0.654805] groups: 0:{ span=0-2 cap=3037 }, 3:{ span=3-7
> cap=5048 }
> [ 0.655326] CPU1 attaching sched-domain(s):
> [ 0.655339] domain-0: span=0-2 level=MC
> [ 0.655356] groups: 1:{ span=1 cap=1015 }, 2:{ span=2 cap=1014 },
> 0:{ span=0 cap=1008 }
> [ 0.655391] domain-1: span=0-7 level=NUMA
> [ 0.655407] groups: 0:{ span=0-2 cap=3037 }, 3:{ span=3-7
> cap=5048 }
> [ 0.655480] CPU2 attaching sched-domain(s):
> [ 0.655492] domain-0: span=0-2 level=MC
> [ 0.655507] groups: 2:{ span=2 cap=1014 }, 0:{ span=0 cap=1008 },
> 1:{ span=1 cap=1015 }
> [ 0.655541] domain-1: span=0-7 level=NUMA
> [ 0.655556] groups: 0:{ span=0-2 cap=3037 }, 3:{ span=3-7
> cap=5048 }
> [ 0.655603] CPU3 attaching sched-domain(s):
> [ 0.655614] domain-0: span=3-7 level=MC
> [ 0.655628] groups: 3:{ span=3 cap=984 }, 4:{ span=4 cap=1015 },
> 5:{ span=5 cap=1016 }, 6:{ span=6 cap=1016 }, 7:{ span=7 cap=1017 }
> [ 0.655693] domain-1: span=0-7 level=NUMA
> [ 0.655721] groups: 3:{ span=3-7 cap=5048 }, 0:{ span=0-2
> cap=3037 }
> [ 0.655769] CPU4 attaching sched-domain(s):
> [ 0.655780] domain-0: span=3-7 level=MC
> [ 0.655795] groups: 4:{ span=4 cap=1015 }, 5:{ span=5 cap=1016 },
> 6:{ span=6 cap=1016 }, 7:{ span=7 cap=1017 }, 3:{ span=3 cap=984 }
> [ 0.655841] domain-1: span=0-7 level=NUMA
> [ 0.655855] groups: 3:{ span=3-7 cap=5048 }, 0:{ span=0-2
> cap=3037 }
> [ 0.655902] CPU5 attaching sched-domain(s):
> [ 0.655916] domain-0: span=3-7 level=MC
> [ 0.655930] groups: 5:{ span=5 cap=1016 }, 6:{ span=6 cap=1016 },
> 7:{ span=7 cap=1017 }, 3:{ span=3 cap=984 }, 4:{ span=4 cap=1015 }
> [ 0.656545] domain-1: span=0-7 level=NUMA
> [ 0.656562] groups: 3:{ span=3-7 cap=5048 }, 0:{ span=0-2
> cap=3037 }
> [ 0.656775] CPU6 attaching sched-domain(s):
> [ 0.656796] domain-0: span=3-7 level=MC
> [ 0.656835] groups: 6:{ span=6 cap=1016 }, 7:{ span=7 cap=1017 },
> 3:{ span=3 cap=984 }, 4:{ span=4 cap=1015 }, 5:{ span=5 cap=1016 }
> [ 0.656881] domain-1: span=0-7 level=NUMA
> [ 0.656911] groups: 3:{ span=3-7 cap=5048 }, 0:{ span=0-2
> cap=3037 }
> [ 0.657102] CPU7 attaching sched-domain(s):
> [ 0.657113] domain-0: span=3-7 level=MC
> [ 0.657128] groups: 7:{ span=7 cap=1017 }, 3:{ span=3 cap=984 },
> 4:{ span=4 cap=1015 }, 5:{ span=5 cap=1016 }, 6:{ span=6 cap=1016 }
> [ 0.657172] domain-1: span=0-7 level=NUMA
> [ 0.657186] groups: 3:{ span=3-7 cap=5048 }, 0:{ span=0-2
> cap=3037 }
> [ 0.657241] root domain span: 0-7 (max cpu_capacity = 1024)
> -----Original Message-----
> From: Valentin Schneider [mailto:[email protected]]
> Sent: Friday, January 03, 2020 8:15 PM
> To: Zengtao (B); Sudeep Holla
> Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> [email protected]; Morten Rasmussen
> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> with lower layer
>
> On 03/01/2020 10:57, Valentin Schneider wrote:
> > I'm juggling with other things atm, but let me have a think and see if we
> > couldn't detect that in the scheduler itself.
> >
>
> Something like this ought to catch your case; might need to compare group
> spans rather than pure group pointers.
>
Good suggestion, I will need to have a think and try
Thanks.
> ---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 6ec1e595b1d4..c4151e11afcd 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1120,6 +1120,13 @@ build_sched_groups(struct sched_domain *sd,
> int cpu)
>
> sg = get_group(i, sdd);
>
> + /* sg's are inited as self-looping. If 'last' is not self
> + * looping, we set it in a previous visit. No further visit
> + * should change the link order, if we do then the topology
> + * description is terminally broken.
> + */
> + BUG_ON(last && last->next != last && last->next != sg);
> +
> cpumask_or(covered, covered, sched_group_span(sg));
>
> if (!first)
On 06/01/2020 02:48, Zengtao (B) wrote:
[...]
>> -----Original Message-----
>> From: Dietmar Eggemann [mailto:[email protected]]
>> Sent: Saturday, January 04, 2020 1:21 AM
>> To: Valentin Schneider; Zengtao (B); Sudeep Holla
>> Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
>> [email protected]; Morten Rasmussen
>> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
>> with lower layer
>>
>> On 03/01/2020 13:14, Valentin Schneider wrote:
>>> On 03/01/2020 10:57, Valentin Schneider wrote:
>> Still don't see the actual problem case. The closest I came is:
>>
>> qemu-system-aarch64 -kernel ... -append ' ... loglevel=8 sched_debug'
>> -smp cores=4,sockets=2 ... -numa node,cpus=0-2,nodeid=0
>> -numa node,cpus=3-7,nodeid=1
>>
>
> It's related to the HW topology, if you hw have got 2 clusters 0~3, 4~7,
> with the mainline qemu, you will see the issue.
> I think you can manually modify the MPIDR parsing to reproduce the
> issue.
> Linux will use the MPIDR to guess the MC topology since currently qemu
> don't provide it.
> Refer to: https://patchwork.ozlabs.org/cover/939301/
That makes sense to me. Valentin and I already discussed this setup as a
possible system where this issue can happen.
I already suspected that virt machines only support flat cpu toplogy.
Good to know. Although I was able to to pass '... -smp cores=8 -dtb
foo.dtb ...' into mainline qemu to achieve a 2 cluster system (MC and
DIE sd level) with an extra cpu-map entry in the dts file:
cpu-map {
cluster0 {
core0 {
cpu = <&A53_0>;
};
...
};
cluster1 {
core0 {
cpu = <&A53_4>;
};
...
};
};
But I didn't succeed in combining this with the '... -numa
node,cpus=0-3,nodeid=0 -numa node,cpus=4-7,nodeid=1 ...' params to
create a system like yours.
Your issue is related to the 'numa mask check for scheduler MC
selection' functionality. It was introduced by commit 37c3ec2d810f and
re-introduced by commit e67ecf647020 later. I don't know why we need
this functionality?
How does your setup behave when you revert commit e67ecf647020? Or do
you want an explicit warning in case of NUMA boundaries not respecting
physical topology?
> -----Original Message-----
> From: Dietmar Eggemann [mailto:[email protected]]
> Sent: Monday, January 06, 2020 10:31 PM
> To: Zengtao (B); Valentin Schneider; Sudeep Holla
> Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> [email protected]; Morten Rasmussen
> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> with lower layer
>
> On 06/01/2020 02:48, Zengtao (B) wrote:
>
> [...]
>
> >> -----Original Message-----
> >> From: Dietmar Eggemann [mailto:[email protected]]
> >> Sent: Saturday, January 04, 2020 1:21 AM
> >> To: Valentin Schneider; Zengtao (B); Sudeep Holla
> >> Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> >> [email protected]; Morten Rasmussen
> >> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations
> conflicts
> >> with lower layer
> >>
> >> On 03/01/2020 13:14, Valentin Schneider wrote:
> >>> On 03/01/2020 10:57, Valentin Schneider wrote:
>
> >> Still don't see the actual problem case. The closest I came is:
> >>
> >> qemu-system-aarch64 -kernel ... -append ' ... loglevel=8 sched_debug'
> >> -smp cores=4,sockets=2 ... -numa node,cpus=0-2,nodeid=0
> >> -numa node,cpus=3-7,nodeid=1
> >>
> >
> > It's related to the HW topology, if you hw have got 2 clusters 0~3, 4~7,
> > with the mainline qemu, you will see the issue.
> > I think you can manually modify the MPIDR parsing to reproduce the
> > issue.
> > Linux will use the MPIDR to guess the MC topology since currently qemu
> > don't provide it.
> > Refer to: https://patchwork.ozlabs.org/cover/939301/
>
> That makes sense to me. Valentin and I already discussed this setup as a
> possible system where this issue can happen.
>
> I already suspected that virt machines only support flat cpu toplogy.
> Good to know. Although I was able to to pass '... -smp cores=8 -dtb
> foo.dtb ...' into mainline qemu to achieve a 2 cluster system (MC and
> DIE sd level) with an extra cpu-map entry in the dts file:
>
> cpu-map {
> cluster0 {
> core0 {
> cpu = <&A53_0>;
> };
> ...
> };
>
> cluster1 {
> core0 {
> cpu = <&A53_4>;
> };
> ...
> };
> };
>
> But I didn't succeed in combining this with the '... -numa
> node,cpus=0-3,nodeid=0 -numa node,cpus=4-7,nodeid=1 ...' params to
> create a system like yours.
I guest that you have used your own dtb, so maybe you need to specify the
numa_node_id in the device tree.
Maybe you can refer to:
Documentation/devicetree/bindings/numa.txt
>
> Your issue is related to the 'numa mask check for scheduler MC
> selection' functionality. It was introduced by commit 37c3ec2d810f and
> re-introduced by commit e67ecf647020 later. I don't know why we need
> this functionality?
>
> How does your setup behave when you revert commit e67ecf647020? Or
> do
> you want an explicit warning in case of NUMA boundaries not respecting
> physical topology?
I will need to have a look to commit e67ecf647020
Thanks
Regards
Zengtao
On Mon, Jan 06, 2020 at 01:37:59AM +0000, Zengtao (B) wrote:
> > -----Original Message-----
> > From: Sudeep Holla [mailto:[email protected]]
> > Sent: Friday, January 03, 2020 7:40 PM
> > To: Zengtao (B)
> > Cc: Valentin Schneider; Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> > [email protected]; Morten Rasmussen; Sudeep Holla
> > Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> > with lower layer
> >
> > On Fri, Jan 03, 2020 at 04:24:04AM +0000, Zengtao (B) wrote:
> > > > -----Original Message-----
> > > > From: Valentin Schneider [mailto:[email protected]]
> > > > Sent: Thursday, January 02, 2020 9:22 PM
> > > > To: Zengtao (B); Sudeep Holla
> > > > Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> > > > [email protected]; Morten Rasmussen
> > > > Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations
> > conflicts
> > > > with lower layer
> > > >
> >
> > [...]
> >
> > > >
> > > > Right, and that is checked when you have sched_debug on the cmdline
> > > > (or write 1 to /sys/kernel/debug/sched_debug & regenerate the sched
> > > > domains)
> > > >
> > >
> > > No, here I think you don't get my issue, please try to understand my
> > example
> > > First:.
> > >
> > > *************************************
> > > NUMA: 0-2, 3-7
> > > core_siblings: 0-3, 4-7
> > > *************************************
> > > When we are building the sched domain, per the current code:
> > > (1) For core 3
> > > MC sched domain fallbacks to 3~7
> > > DIE sched domain is 3~7
> > > (2) For core 4:
> > > MC sched domain is 4~7
> > > DIE sched domain is 3~7
> > >
> > > When we are build sched groups for the MC level:
> > > (1). core3's sched groups chain is built like as: 3->4->5->6->7->3
> > > (2). core4's sched groups chain is built like as: 4->5->6->7->4
> > > so after (2),
> > > core3's sched groups is overlapped, and it's not a chain any more.
> > > In the afterwards usecase of core3's sched groups, deadloop happens.
> > >
> > > And it's difficult for the scheduler to find out such errors,
> > > that is why I think a warning is necessary here.
> > >
> >
> > We can figure out a way to warn if it's absolutely necessary, but I
> > would like to understand the system topology here. You haven't answered
> > my query on cache topology. Please give more description on why the
> > NUMA configuration is like the above example with specific hardware
> > design details. Is this just a case where user can specify anything
> > they wish ?
> >
>
> Sorry for the late response, In fact, it's a VM usecase, you can simply
> understand it as a test case. It's a corner case, but it will hang the kernel,
> that is why I suggest a warning is needed.
>
> I think we need an sanity check or just simply warning, either in the scheduler
> or arch topology parsing.
IIUC, the problem is that virt can set up a broken topology in some
cases where MPIDR doesn't line up correctly with the defined NUMA nodes.
We could argue that it is a qemu/virt problem, but it would be nice if
we could at least detect it. The proposed patch isn't really the right
solution as it warns on some valid topologies as Sudeep already pointed
out.
It sounds more like we need a mask subset check in the sched_domain
building code, if there isn't already one?
Morten
On Thu, Jan 02, 2020 at 01:22:19PM +0000, Valentin Schneider wrote:
> On 02/01/2020 12:47, Zengtao (B) wrote:
> >>
> >> As I said, wrong configurations need to be detected when generating
> >> DT/ACPI if possible. The above will print warning on systems with NUMA
> >> within package.
> >>
> >> NUMA: 0-7, 8-15
> >> core_siblings: 0-15
> >>
> >> The above is the example where the die has 16 CPUs and 2 NUMA nodes
> >> within a package, your change throws error to the above config which is
> >> wrong.
> >>
> > From your example, the core 7 and core 8 has got different LLC but the same Low
> > Level cache?
>
> AFAIA what matters here is memory controllers, less so LLCs. Cores within
> a single die could have private LLCs and separate memory controllers, or
> shared LLC and separate memory controllers.
Don't confuse cache boundaries, packages and nodes :-)
core_siblings are cpus in the same package and doesn't say anything
about cache boundaries. It is not given that there is sched_domain that
matches the core_sibling span.
The MC sched_domain is supposed to match the LLC span which might
different for core_siblings. So the about example should be valid for a
NUMA-in-package system with one package containing two nodes.
Morten
On Mon, Jan 06, 2020 at 03:31:24PM +0100, Dietmar Eggemann wrote:
> Your issue is related to the 'numa mask check for scheduler MC
> selection' functionality. It was introduced by commit 37c3ec2d810f and
> re-introduced by commit e67ecf647020 later. I don't know why we need
> this functionality?
That functionality is to ensure that we don't break the sched_domain
hierarchy for numa-in-cluster systems. We have to be sure that the MC
domain is always smaller or equal to the NUMA node span.
Morten
On 09/01/2020 12:05, Morten Rasmussen wrote:
> On Mon, Jan 06, 2020 at 03:31:24PM +0100, Dietmar Eggemann wrote:
>> Your issue is related to the 'numa mask check for scheduler MC
>> selection' functionality. It was introduced by commit 37c3ec2d810f and
>> re-introduced by commit e67ecf647020 later. I don't know why we need
>> this functionality?
>
> That functionality is to ensure that we don't break the sched_domain
> hierarchy for numa-in-cluster systems. We have to be sure that the MC
> domain is always smaller or equal to the NUMA node span.
Thanks! And we already have Arm64 systems today using 'numa-in-cluster',
as I learned yesterday.
> -----Original Message-----
> From: Morten Rasmussen [mailto:[email protected]]
> Sent: Thursday, January 09, 2020 6:43 PM
> To: Zengtao (B)
> Cc: Sudeep Holla; Valentin Schneider; Linuxarm; Greg Kroah-Hartman;
> Rafael J. Wysocki; [email protected]
> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> with lower layer
>
> On Mon, Jan 06, 2020 at 01:37:59AM +0000, Zengtao (B) wrote:
> > > -----Original Message-----
> > > From: Sudeep Holla [mailto:[email protected]]
> > > Sent: Friday, January 03, 2020 7:40 PM
> > > To: Zengtao (B)
> > > Cc: Valentin Schneider; Linuxarm; Greg Kroah-Hartman; Rafael J.
> Wysocki;
> > > [email protected]; Morten Rasmussen; Sudeep Holla
> > > Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations
> conflicts
> > > with lower layer
> > >
> > > On Fri, Jan 03, 2020 at 04:24:04AM +0000, Zengtao (B) wrote:
> > > > > -----Original Message-----
> > > > > From: Valentin Schneider [mailto:[email protected]]
> > > > > Sent: Thursday, January 02, 2020 9:22 PM
> > > > > To: Zengtao (B); Sudeep Holla
> > > > > Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> > > > > [email protected]; Morten Rasmussen
> > > > > Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations
> > > conflicts
> > > > > with lower layer
> > > > >
> > >
> > > [...]
> > >
> > > > >
> > > > > Right, and that is checked when you have sched_debug on the
> cmdline
> > > > > (or write 1 to /sys/kernel/debug/sched_debug & regenerate the
> sched
> > > > > domains)
> > > > >
> > > >
> > > > No, here I think you don't get my issue, please try to understand my
> > > example
> > > > First:.
> > > >
> > > > *************************************
> > > > NUMA: 0-2, 3-7
> > > > core_siblings: 0-3, 4-7
> > > > *************************************
> > > > When we are building the sched domain, per the current code:
> > > > (1) For core 3
> > > > MC sched domain fallbacks to 3~7
> > > > DIE sched domain is 3~7
> > > > (2) For core 4:
> > > > MC sched domain is 4~7
> > > > DIE sched domain is 3~7
> > > >
> > > > When we are build sched groups for the MC level:
> > > > (1). core3's sched groups chain is built like as: 3->4->5->6->7->3
> > > > (2). core4's sched groups chain is built like as: 4->5->6->7->4
> > > > so after (2),
> > > > core3's sched groups is overlapped, and it's not a chain any more.
> > > > In the afterwards usecase of core3's sched groups, deadloop
> happens.
> > > >
> > > > And it's difficult for the scheduler to find out such errors,
> > > > that is why I think a warning is necessary here.
> > > >
> > >
> > > We can figure out a way to warn if it's absolutely necessary, but I
> > > would like to understand the system topology here. You haven't
> answered
> > > my query on cache topology. Please give more description on why the
> > > NUMA configuration is like the above example with specific hardware
> > > design details. Is this just a case where user can specify anything
> > > they wish ?
> > >
> >
> > Sorry for the late response, In fact, it's a VM usecase, you can simply
> > understand it as a test case. It's a corner case, but it will hang the
> kernel,
> > that is why I suggest a warning is needed.
> >
> > I think we need an sanity check or just simply warning, either in the
> scheduler
> > or arch topology parsing.
>
> IIUC, the problem is that virt can set up a broken topology in some
> cases where MPIDR doesn't line up correctly with the defined NUMA
> nodes.
>
> We could argue that it is a qemu/virt problem, but it would be nice if
> we could at least detect it. The proposed patch isn't really the right
> solution as it warns on some valid topologies as Sudeep already pointed
> out.
>
> It sounds more like we need a mask subset check in the sched_domain
> building code, if there isn't already one?
Currently no, it's a bit complex to do the check in the sched_domain building code,
I need to take a think of that.
Suggestion welcomed.
Thanks
Zengtao
>
> Morten
On 09/01/2020 12:58, Zengtao (B) wrote:
>> IIUC, the problem is that virt can set up a broken topology in some
>> cases where MPIDR doesn't line up correctly with the defined NUMA
>> nodes.
>>
>> We could argue that it is a qemu/virt problem, but it would be nice if
>> we could at least detect it. The proposed patch isn't really the right
>> solution as it warns on some valid topologies as Sudeep already pointed
>> out.
>>
>> It sounds more like we need a mask subset check in the sched_domain
>> building code, if there isn't already one?
>
> Currently no, it's a bit complex to do the check in the sched_domain building code,
> I need to take a think of that.
> Suggestion welcomed.
>
Doing a search on the sched_domain spans themselves should look something like
the completely untested:
---8<---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6ec1e595b1d4..96128d12ec23 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1879,6 +1879,43 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
return sd;
}
+/* Ensure topology masks are sane; non-NUMA spans shouldn't overlap */
+static int validate_topology_spans(const struct cpumask *cpu_map)
+{
+ struct sched_domain_topology_level *tl;
+ int i, j;
+
+ for_each_sd_topology(tl) {
+ /* NUMA levels are allowed to overlap */
+ if (tl->flags & SDTL_OVERLAP)
+ break;
+
+ /*
+ * Non-NUMA levels cannot partially overlap - they must be
+ * either equal or wholly disjoint. Otherwise we can end up
+ * breaking the sched_group lists - i.e. a later get_group()
+ * pass breaks the linking done for an earlier span.
+ */
+ for_each_cpu(i, cpu_map) {
+ for_each_cpu(j, cpu_map) {
+ if (i == j)
+ continue;
+ /*
+ * We should 'and' all those masks with 'cpu_map'
+ * to exactly match the topology we're about to
+ * build, but that can only remove CPUs, which
+ * only lessens our ability to detect overlaps
+ */
+ if (!cpumask_equal(tl->mask(i), tl->mask(j)) &&
+ cpumask_intersects(tl->mask(i), tl->mask(j)))
+ return -1;
+ }
+ }
+ }
+
+ return 0;
+}
+
/*
* Find the sched_domain_topology_level where all CPU capacities are visible
* for all CPUs.
@@ -1953,7 +1990,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
struct sched_domain_topology_level *tl_asym;
bool has_asym = false;
- if (WARN_ON(cpumask_empty(cpu_map)))
+ if (WARN_ON(cpumask_empty(cpu_map)) ||
+ WARN_ON(validate_topology_spans(cpu_map)))
goto error;
alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
--->8---
Alternatively the assertion on the sched_group linking I suggested earlier
in the thread should suffice, since this should trigger whenever we have
overlapping non-NUMA sched domains.
Since you have a setup where you can reproduce the issue, could please give
either (ideally both!) a try? Thanks.
> Thanks
> Zengtao
>
>>
>> Morten
On 09/01/2020 10:52, Morten Rasmussen wrote:
>> AFAIA what matters here is memory controllers, less so LLCs. Cores within
>> a single die could have private LLCs and separate memory controllers, or
>> shared LLC and separate memory controllers.
>
> Don't confuse cache boundaries, packages and nodes :-)
>
> core_siblings are cpus in the same package and doesn't say anything
> about cache boundaries. It is not given that there is sched_domain that
> matches the core_sibling span.
>
> The MC sched_domain is supposed to match the LLC span which might
> different for core_siblings. So the about example should be valid for a
> NUMA-in-package system with one package containing two nodes.
>
Right, the point I was trying to make is that node boundaries can be pretty
much anything, so nodes can span over LLCs, or LLCs can span over nodes,
which is why we need checks such as the one in arch_topology() that lets us
build up a usable domain hierarchy (which cares about LLCs, at least at some
level).
> Morten
>
> -----Original Message-----
> From: Valentin Schneider [mailto:[email protected]]
> Sent: Sunday, January 12, 2020 4:56 AM
> To: Zengtao (B); Morten Rasmussen
> Cc: Sudeep Holla; Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> [email protected]
> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> with lower layer
>
> On 09/01/2020 12:58, Zengtao (B) wrote:
> >> IIUC, the problem is that virt can set up a broken topology in some
> >> cases where MPIDR doesn't line up correctly with the defined NUMA
> >> nodes.
> >>
> >> We could argue that it is a qemu/virt problem, but it would be nice if
> >> we could at least detect it. The proposed patch isn't really the right
> >> solution as it warns on some valid topologies as Sudeep already pointed
> >> out.
> >>
> >> It sounds more like we need a mask subset check in the sched_domain
> >> building code, if there isn't already one?
> >
> > Currently no, it's a bit complex to do the check in the sched_domain
> building code,
> > I need to take a think of that.
> > Suggestion welcomed.
> >
>
> Doing a search on the sched_domain spans themselves should look
> something like
> the completely untested:
>
> ---8<---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 6ec1e595b1d4..96128d12ec23 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1879,6 +1879,43 @@ static struct sched_domain
> *build_sched_domain(struct sched_domain_topology_leve
> return sd;
> }
>
> +/* Ensure topology masks are sane; non-NUMA spans shouldn't overlap */
> +static int validate_topology_spans(const struct cpumask *cpu_map)
> +{
> + struct sched_domain_topology_level *tl;
> + int i, j;
> +
> + for_each_sd_topology(tl) {
> + /* NUMA levels are allowed to overlap */
> + if (tl->flags & SDTL_OVERLAP)
> + break;
> +
> + /*
> + * Non-NUMA levels cannot partially overlap - they must be
> + * either equal or wholly disjoint. Otherwise we can end up
> + * breaking the sched_group lists - i.e. a later get_group()
> + * pass breaks the linking done for an earlier span.
> + */
> + for_each_cpu(i, cpu_map) {
> + for_each_cpu(j, cpu_map) {
> + if (i == j)
> + continue;
> + /*
> + * We should 'and' all those masks with 'cpu_map'
> + * to exactly match the topology we're about to
> + * build, but that can only remove CPUs, which
> + * only lessens our ability to detect overlaps
> + */
> + if (!cpumask_equal(tl->mask(i), tl->mask(j)) &&
> + cpumask_intersects(tl->mask(i), tl->mask(j)))
> + return -1;
> + }
> + }
> + }
> +
> + return 0;
> +}
> +
> /*
> * Find the sched_domain_topology_level where all CPU capacities are
> visible
> * for all CPUs.
> @@ -1953,7 +1990,8 @@ build_sched_domains(const struct cpumask
> *cpu_map, struct sched_domain_attr *att
> struct sched_domain_topology_level *tl_asym;
> bool has_asym = false;
>
> - if (WARN_ON(cpumask_empty(cpu_map)))
> + if (WARN_ON(cpumask_empty(cpu_map)) ||
> + WARN_ON(validate_topology_spans(cpu_map)))
> goto error;
>
> alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
> --->8---
>
> Alternatively the assertion on the sched_group linking I suggested earlier
> in the thread should suffice, since this should trigger whenever we have
> overlapping non-NUMA sched domains.
>
> Since you have a setup where you can reproduce the issue, could please
> give
> either (ideally both!) a try? Thanks.
I have tried both, this previous one don't work. But this one seems work
correctly with the warning message printout as expected.
This patch is based on the fact " non-NUMA spans shouldn't overlap ", I am
not quite sure if this is always true?
Anyway, Could you help to raise the new patch?
Thanks
Zengtao
>
> > Thanks
> > Zengtao
> >
> >>
> >> Morten
On 13/01/2020 06:51, Zengtao (B) wrote:
> I have tried both, this previous one don't work. But this one seems work
> correctly with the warning message printout as expected.
>
Thanks for trying it out.
> This patch is based on the fact " non-NUMA spans shouldn't overlap ", I am
> not quite sure if this is always true?
>
I think this is required for get_group() to work properly. Otherwise,
successive get_group() calls may override (and break) the sd->groups
linking as you initially reported.
In your example, for MC level we have
tl->mask(3) == 3-7
tl->mask(4) == 4-7
Which partially overlaps, causing the relinking of '7->3' to '7->4'. Valid
configurations would be
wholly disjoint:
tl->mask(3) == 0-3
tl->maks(4) == 4-7
equal:
tl->mask(3) == 3-7
tl->mask(4) == 3-7
> Anyway, Could you help to raise the new patch?
>
Ideally I'd like to be able to reproduce this locally first (TBH I'd like
to get my first suggestion to work since it's less intrusive). Could you
share how you were able to trigger this? Dietmar's been trying to reproduce
this with qemu but I don't think he's there just yet.
> Thanks
> Zengtao
>
>>
>>> Thanks
>>> Zengtao
>>>
>>>>
>>>> Morten
> -----Original Message-----
> From: Valentin Schneider [mailto:[email protected]]
> Sent: Monday, January 13, 2020 7:17 PM
> To: Zengtao (B); Morten Rasmussen
> Cc: Sudeep Holla; Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> [email protected]
> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations
> conflicts with lower layer
>
> On 13/01/2020 06:51, Zengtao (B) wrote:
> > I have tried both, this previous one don't work. But this one seems
> work
> > correctly with the warning message printout as expected.
> >
>
> Thanks for trying it out.
>
> > This patch is based on the fact " non-NUMA spans shouldn't overlap ",
> I am
> > not quite sure if this is always true?
> >
>
> I think this is required for get_group() to work properly. Otherwise,
> successive get_group() calls may override (and break) the sd->groups
> linking as you initially reported.
>
> In your example, for MC level we have
>
> tl->mask(3) == 3-7
> tl->mask(4) == 4-7
>
> Which partially overlaps, causing the relinking of '7->3' to '7->4'. Valid
> configurations would be
>
> wholly disjoint:
> tl->mask(3) == 0-3
> tl->maks(4) == 4-7
>
> equal:
> tl->mask(3) == 3-7
> tl->mask(4) == 3-7
>
> > Anyway, Could you help to raise the new patch?
> >
>
> Ideally I'd like to be able to reproduce this locally first (TBH I'd like
> to get my first suggestion to work since it's less intrusive). Could you
> share how you were able to trigger this? Dietmar's been trying to
> reproduce
> this with qemu but I don't think he's there just yet.
Do you have got a hardware platform with clusters?what's the hardware
Cpu topology?
Regards
Zengtao
On 13.01.20 13:08, Zengtao (B) wrote:
>> -----Original Message-----
>> From: Valentin Schneider [mailto:[email protected]]
>> Sent: Monday, January 13, 2020 7:17 PM
>> To: Zengtao (B); Morten Rasmussen
>> Cc: Sudeep Holla; Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
>> [email protected]
>> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations
>> conflicts with lower layer
>>
>> On 13/01/2020 06:51, Zengtao (B) wrote:
>>> I have tried both, this previous one don't work. But this one seems
>> work
>>> correctly with the warning message printout as expected.
>>>
>>
>> Thanks for trying it out.
>>
>>> This patch is based on the fact " non-NUMA spans shouldn't overlap ",
>> I am
>>> not quite sure if this is always true?
>>>
>>
>> I think this is required for get_group() to work properly. Otherwise,
>> successive get_group() calls may override (and break) the sd->groups
>> linking as you initially reported.
>>
>> In your example, for MC level we have
>>
>> tl->mask(3) == 3-7
>> tl->mask(4) == 4-7
>>
>> Which partially overlaps, causing the relinking of '7->3' to '7->4'. Valid
>> configurations would be
>>
>> wholly disjoint:
>> tl->mask(3) == 0-3
>> tl->maks(4) == 4-7
>>
>> equal:
>> tl->mask(3) == 3-7
>> tl->mask(4) == 3-7
>>
>>> Anyway, Could you help to raise the new patch?
>>>
>>
>> Ideally I'd like to be able to reproduce this locally first (TBH I'd like
>> to get my first suggestion to work since it's less intrusive). Could you
>> share how you were able to trigger this? Dietmar's been trying to
>> reproduce
>> this with qemu but I don't think he's there just yet.
>
> Do you have got a hardware platform with clusters?what's the hardware
> Cpu topology?
I can test this with:
sudo qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img
-append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp
cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa
node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1
and
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1eb81f113786..e941a402e5f1 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -465,6 +465,9 @@ void update_siblings_masks(unsigned int cpuid)
if (cpuid_topo->package_id != cpu_topo->package_id)
continue;
+ if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
+ continue;
+
cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
on mainline qemu. I do need the hack in update_siblings_masks() since
the topology detection via -smp cores=X, sockets=Y doesn't work yet.
On Sun, Jan 12, 2020 at 01:22:02PM +0000, Valentin Schneider wrote:
> On 09/01/2020 10:52, Morten Rasmussen wrote:
> >> AFAIA what matters here is memory controllers, less so LLCs. Cores within
> >> a single die could have private LLCs and separate memory controllers, or
> >> shared LLC and separate memory controllers.
> >
> > Don't confuse cache boundaries, packages and nodes :-)
> >
> > core_siblings are cpus in the same package and doesn't say anything
> > about cache boundaries. It is not given that there is sched_domain that
> > matches the core_sibling span.
> >
> > The MC sched_domain is supposed to match the LLC span which might
> > different for core_siblings. So the about example should be valid for a
> > NUMA-in-package system with one package containing two nodes.
> >
>
> Right, the point I was trying to make is that node boundaries can be pretty
> much anything, so nodes can span over LLCs, or LLCs can span over nodes,
> which is why we need checks such as the one in arch_topology() that lets us
> build up a usable domain hierarchy (which cares about LLCs, at least at some
> level).
Indeed. The topology masks can't always be used as is to define the
sched_domain hierarchy.
On 11.01.20 21:56, Valentin Schneider wrote:
> On 09/01/2020 12:58, Zengtao (B) wrote:
>>> IIUC, the problem is that virt can set up a broken topology in some
>>> cases where MPIDR doesn't line up correctly with the defined NUMA
>>> nodes.
>>>
>>> We could argue that it is a qemu/virt problem, but it would be nice if
>>> we could at least detect it. The proposed patch isn't really the right
>>> solution as it warns on some valid topologies as Sudeep already pointed
>>> out.
>>>
>>> It sounds more like we need a mask subset check in the sched_domain
>>> building code, if there isn't already one?
>>
>> Currently no, it's a bit complex to do the check in the sched_domain building code,
>> I need to take a think of that.
>> Suggestion welcomed.
>>
>
> Doing a search on the sched_domain spans themselves should look something like
> the completely untested:
[...]
LGTM. This code detects the issue in cpu_coregroup_mask(), which is the
the cpumask function of the sched domain MC level struct
sched_domain_topology_level of ARM64's (and other archs)
default_topology[].
I wonder how x86 copes with such a config error?
Maybe they do it inside their cpu_coregroup_mask()?
We could move validate_topology_spans() into the existing
for_each_cpu(i, cpu_map)
for_each_sd_topology(tl)
loop in build_sched_domains() saving some code?
---8<---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e6ff114e53f2..5f2764433a3d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1880,37 +1880,34 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
}
/* Ensure topology masks are sane; non-NUMA spans shouldn't overlap */
-static int validate_topology_spans(const struct cpumask *cpu_map)
+static int validate_topology_spans(struct sched_domain_topology_level *tl,
+ const struct cpumask *cpu_map, int cpu)
{
- struct sched_domain_topology_level *tl;
- int i, j;
+ const struct cpumask* mask = tl->mask(cpu);
+ int i;
- for_each_sd_topology(tl) {
- /* NUMA levels are allowed to overlap */
- if (tl->flags & SDTL_OVERLAP)
- break;
+ /* NUMA levels are allowed to overlap */
+ if (tl->flags & SDTL_OVERLAP)
+ return 0;
+ /*
+ * Non-NUMA levels cannot partially overlap - they must be
+ * either equal or wholly disjoint. Otherwise we can end up
+ * breaking the sched_group lists - i.e. a later get_group()
+ * pass breaks the linking done for an earlier span.
+ */
+ for_each_cpu(i, cpu_map) {
+ if (i == cpu)
+ continue;
/*
- * Non-NUMA levels cannot partially overlap - they must be
- * either equal or wholly disjoint. Otherwise we can end up
- * breaking the sched_group lists - i.e. a later get_group()
- * pass breaks the linking done for an earlier span.
+ * We should 'and' all those masks with 'cpu_map'
+ * to exactly match the topology we're about to
+ * build, but that can only remove CPUs, which
+ * only lessens our ability to detect overlaps
*/
- for_each_cpu(i, cpu_map) {
- for_each_cpu(j, cpu_map) {
- if (i == j)
- continue;
- /*
- * We should 'and' all those masks with 'cpu_map'
- * to exactly match the topology we're about to
- * build, but that can only remove CPUs, which
- * only lessens our ability to detect overlaps
- */
- if (!cpumask_equal(tl->mask(i), tl->mask(j)) &&
- cpumask_intersects(tl->mask(i), tl->mask(j)))
- return -1;
- }
- }
+ if (!cpumask_equal(mask, tl->mask(i)) &&
+ cpumask_intersects(mask, tl->mask(i)))
+ return -1;
}
return 0;
@@ -1990,8 +1987,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
struct sched_domain_topology_level *tl_asym;
bool has_asym = false;
- if (WARN_ON(cpumask_empty(cpu_map)) ||
- WARN_ON(validate_topology_spans(cpu_map)))
+ if (WARN_ON(cpumask_empty(cpu_map)))
goto error;
alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
@@ -2013,6 +2009,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
has_asym = true;
}
+ if (WARN_ON(validate_topology_spans(tl, cpu_map, i)))
+ goto error;
+
sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
if (tl == sched_domain_topology)
On 13/01/2020 14:49, Dietmar Eggemann wrote:
> LGTM. This code detects the issue in cpu_coregroup_mask(), which is the
> the cpumask function of the sched domain MC level struct
> sched_domain_topology_level of ARM64's (and other archs)
> default_topology[].
> I wonder how x86 copes with such a config error?
> Maybe they do it inside their cpu_coregroup_mask()?
>
>
> We could move validate_topology_spans() into the existing
>
> for_each_cpu(i, cpu_map)
> for_each_sd_topology(tl)
>
> loop in build_sched_domains() saving some code?
>
[...]
Yeah that should work. Folks might want to gate it under SCHED_DEBUG, but
that's another discussion.