There is report again that the tsc clocksource on a 4 sockets x86
Skylake server was wrongly judged as 'unstable' by 'jiffies' watchdog,
and disabled [1]. Also we got silimar reports on 8 sockets platform
from internal test.
Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC
on qualified platorms") was introduce to deal with these false
alarms of tsc unstable issues, covering qualified platforms for 2
sockets or smaller ones.
Extend the exemption also to 4/8 sockets to fix the issue.
Rui also proposed another way to disable 'jiffies' as clocksource
watchdog [2], which can also solve this specific problem in an
architecture independent way, with one limitation that some tsc false
alarms are reported by other watchdogs like HPET in post-boot time,
while 'jiffies' is mostly used in boot phase before hardware
clocksources are initialized.
[1]. https://lore.kernel.org/all/[email protected]/
[2]. https://lore.kernel.org/all/[email protected]/
Reported-by: Yu Liao <[email protected]>
Tested-by: Yu Liao <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
arch/x86/kernel/tsc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index cafacb2e58cc..b4ea79cb1d1a 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
- nr_online_nodes <= 2)
+ nr_online_nodes <= 8)
tsc_disable_clocksource_watchdog();
}
--
2.34.1
On Sun, Oct 09, 2022 at 01:12:09PM +0800, Feng Tang wrote:
> There is report again that the tsc clocksource on a 4 sockets x86
> Skylake server was wrongly judged as 'unstable' by 'jiffies' watchdog,
> and disabled [1]. Also we got silimar reports on 8 sockets platform
> from internal test.
>
> Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC
> on qualified platorms") was introduce to deal with these false
> alarms of tsc unstable issues, covering qualified platforms for 2
> sockets or smaller ones.
>
> Extend the exemption also to 4/8 sockets to fix the issue.
>
> Rui also proposed another way to disable 'jiffies' as clocksource
> watchdog [2], which can also solve this specific problem in an
> architecture independent way, with one limitation that some tsc false
> alarms are reported by other watchdogs like HPET in post-boot time,
> while 'jiffies' is mostly used in boot phase before hardware
> clocksources are initialized.
>
> [1]. https://lore.kernel.org/all/[email protected]/
> [2]. https://lore.kernel.org/all/[email protected]/
>
> Reported-by: Yu Liao <[email protected]>
> Tested-by: Yu Liao <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> arch/x86/kernel/tsc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index cafacb2e58cc..b4ea79cb1d1a 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
> if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> - nr_online_nodes <= 2)
> + nr_online_nodes <= 8)
So you're saying all 8 socket systems since Broadwell (?) are TSC
sync'ed ?
AFAIK there is no architectural guarantee for >4 sockets to have a sane
TSC. If there is one, the above should be limited to architectures that
conform.
On Sun, Oct 09, 2022 at 03:01:32PM +0200, Peter Zijlstra wrote:
> On Sun, Oct 09, 2022 at 01:12:09PM +0800, Feng Tang wrote:
> > There is report again that the tsc clocksource on a 4 sockets x86
> > Skylake server was wrongly judged as 'unstable' by 'jiffies' watchdog,
> > and disabled [1]. Also we got silimar reports on 8 sockets platform
> > from internal test.
> >
> > Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC
> > on qualified platorms") was introduce to deal with these false
> > alarms of tsc unstable issues, covering qualified platforms for 2
> > sockets or smaller ones.
> >
> > Extend the exemption also to 4/8 sockets to fix the issue.
> >
> > Rui also proposed another way to disable 'jiffies' as clocksource
> > watchdog [2], which can also solve this specific problem in an
> > architecture independent way, with one limitation that some tsc false
> > alarms are reported by other watchdogs like HPET in post-boot time,
> > while 'jiffies' is mostly used in boot phase before hardware
> > clocksources are initialized.
> >
> > [1]. https://lore.kernel.org/all/[email protected]/
> > [2]. https://lore.kernel.org/all/[email protected]/
> >
> > Reported-by: Yu Liao <[email protected]>
> > Tested-by: Yu Liao <[email protected]>
> > Signed-off-by: Feng Tang <[email protected]>
> > ---
> > arch/x86/kernel/tsc.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index cafacb2e58cc..b4ea79cb1d1a 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
> > if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> > - nr_online_nodes <= 2)
> > + nr_online_nodes <= 8)
>
> So you're saying all 8 socket systems since Broadwell (?) are TSC
> sync'ed ?
No, I didn't mean that. I haven't got chance to any 8 sockets
machine, and I got a report last month that on one 8S machine,
the TSC was judged 'unstable' by HPET as watchdog.
> AFAIK there is no architectural guarantee for >4 sockets to have a sane
> TSC. If there is one, the above should be limited to architectures that
> conform.
Thanks for the note! Yes, we should be very cautious for 8 sockets
machine. Will limit the max sockets to 4, which was also originally
suggested by Thomas.
Thanks,
Feng
On 10/9/22 18:23, Feng Tang wrote:
>>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
>>> index cafacb2e58cc..b4ea79cb1d1a 100644
>>> --- a/arch/x86/kernel/tsc.c
>>> +++ b/arch/x86/kernel/tsc.c
>>> @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
>>> if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
>>> boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
>>> boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
>>> - nr_online_nodes <= 2)
>>> + nr_online_nodes <= 8)
>> So you're saying all 8 socket systems since Broadwell (?) are TSC
>> sync'ed ?
> No, I didn't mean that. I haven't got chance to any 8 sockets
> machine, and I got a report last month that on one 8S machine,
> the TSC was judged 'unstable' by HPET as watchdog.
That's not a great check. Think about numa=fake=4U, for instance. Or a
single-socket system with persistent memory and high bandwidth memory.
Basically 'nr_online_nodes' is a software construct. It's going to be
really hard to infer anything from it about what the _hardware_ is.
On Mon, Oct 10, 2022 at 07:23:10AM -0700, Dave Hansen wrote:
> On 10/9/22 18:23, Feng Tang wrote:
> >>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> >>> index cafacb2e58cc..b4ea79cb1d1a 100644
> >>> --- a/arch/x86/kernel/tsc.c
> >>> +++ b/arch/x86/kernel/tsc.c
> >>> @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
> >>> if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> >>> boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> >>> boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> >>> - nr_online_nodes <= 2)
> >>> + nr_online_nodes <= 8)
> >> So you're saying all 8 socket systems since Broadwell (?) are TSC
> >> sync'ed ?
> > No, I didn't mean that. I haven't got chance to any 8 sockets
> > machine, and I got a report last month that on one 8S machine,
> > the TSC was judged 'unstable' by HPET as watchdog.
>
> That's not a great check. Think about numa=fake=4U, for instance. Or a
> single-socket system with persistent memory and high bandwidth memory.
>
> Basically 'nr_online_nodes' is a software construct. It's going to be
> really hard to infer anything from it about what the _hardware_ is.
You are right! How to get the socket number was indeed a trouble when
I worked on commit b50db7095fe0, the problem is related to the
initialization order. This tsc check needs to be done in tsc_init(),
while the node_stats[] get initialized in later's call of smp_init().
For the case you mentioned above, I dug out some old logs which showed
its init order:
numa=fake=4 on a SKL desktop
================
[ 0.000066] [tsc_early_init()]: nr_online_nodes = 1
[ 0.000068] [tsc_early_init()]: nr_cpu_nodes = 0
[ 0.000070] [tsc_early_init()]: nr_mem_nodes = 0
[ 0.104015] [tsc_init()]: nr_online_nodes = 4
[ 0.104019] [tsc_init()]: nr_cpu_nodes = 0
[ 0.104022] [tsc_init()]: nr_mem_nodes = 4
[ 0.124778] smp: Brought up 4 nodes, 4 CPUs
[ 0.760915] [init_tsc_clocksource()]: nr_online_nodes = 4
[ 0.760919] [init_tsc_clocksource()]: nr_cpu_nodes = 4
[ 0.760922] [init_tsc_clocksource()]: nr_mem_nodes = 4
QEMU with 2 CPU-DRAM nodes + 2 Persistent memory nodes
========================================================
[ 0.066651] [tsc_early_init()]: nr_online_nodes = 1
[ 0.067494] [tsc_early_init()]: nr_cpu_nodes = 0
[ 0.068288] [tsc_early_init()]: nr_mem_nodes = 0
[ 0.677694] [tsc_init()]: nr_online_nodes = 4
[ 0.678862] [tsc_init()]: nr_cpu_nodes = 0
[ 0.679962] [tsc_init()]: nr_mem_nodes = 4
[ 1.139240] [init_tsc_clocksource()]: nr_online_nodes = 4
[ 1.140576] [init_tsc_clocksource()]: nr_cpu_nodes = 2
[ 1.141823] [init_tsc_clocksource()]: nr_mem_nodes = 4
[ 1.660100] [kernel_init()]: nr_online_nodes = 4
[ 1.661234] [kernel_init()]: nr_cpu_nodes = 2
[ 1.662300] [kernel_init()]: nr_mem_nodes = 4
The 'nr_online_nodes' was chosed in the hope of that, in worse case
the patch is just a nop and won't wrongly lift the check.
One possible solution for this problem is to leverage the SRAT table
early init which is called before tsc_init(), and can provide CPU
nodes info. Will try this way.
Thanks,
Feng
On Tue, Oct 11, 2022 at 09:09:12AM +0800, Feng Tang wrote:
> On Mon, Oct 10, 2022 at 07:23:10AM -0700, Dave Hansen wrote:
> > On 10/9/22 18:23, Feng Tang wrote:
> > >>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > >>> index cafacb2e58cc..b4ea79cb1d1a 100644
> > >>> --- a/arch/x86/kernel/tsc.c
> > >>> +++ b/arch/x86/kernel/tsc.c
> > >>> @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
> > >>> if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > >>> boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > >>> boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> > >>> - nr_online_nodes <= 2)
> > >>> + nr_online_nodes <= 8)
> > >> So you're saying all 8 socket systems since Broadwell (?) are TSC
> > >> sync'ed ?
> > > No, I didn't mean that. I haven't got chance to any 8 sockets
> > > machine, and I got a report last month that on one 8S machine,
> > > the TSC was judged 'unstable' by HPET as watchdog.
> >
> > That's not a great check. Think about numa=fake=4U, for instance. Or a
> > single-socket system with persistent memory and high bandwidth memory.
> >
> > Basically 'nr_online_nodes' is a software construct. It's going to be
> > really hard to infer anything from it about what the _hardware_ is.
>
> You are right! How to get the socket number was indeed a trouble when
> I worked on commit b50db7095fe0, the problem is related to the
> initialization order. This tsc check needs to be done in tsc_init(),
> while the node_stats[] get initialized in later's call of smp_init().
>
> For the case you mentioned above, I dug out some old logs which showed
> its init order:
>
> numa=fake=4 on a SKL desktop
> ================
> [ 0.000066] [tsc_early_init()]: nr_online_nodes = 1
> [ 0.000068] [tsc_early_init()]: nr_cpu_nodes = 0
> [ 0.000070] [tsc_early_init()]: nr_mem_nodes = 0
> [ 0.104015] [tsc_init()]: nr_online_nodes = 4
> [ 0.104019] [tsc_init()]: nr_cpu_nodes = 0
> [ 0.104022] [tsc_init()]: nr_mem_nodes = 4
> [ 0.124778] smp: Brought up 4 nodes, 4 CPUs
> [ 0.760915] [init_tsc_clocksource()]: nr_online_nodes = 4
> [ 0.760919] [init_tsc_clocksource()]: nr_cpu_nodes = 4
> [ 0.760922] [init_tsc_clocksource()]: nr_mem_nodes = 4
>
> QEMU with 2 CPU-DRAM nodes + 2 Persistent memory nodes
> ========================================================
> [ 0.066651] [tsc_early_init()]: nr_online_nodes = 1
> [ 0.067494] [tsc_early_init()]: nr_cpu_nodes = 0
> [ 0.068288] [tsc_early_init()]: nr_mem_nodes = 0
> [ 0.677694] [tsc_init()]: nr_online_nodes = 4
> [ 0.678862] [tsc_init()]: nr_cpu_nodes = 0
> [ 0.679962] [tsc_init()]: nr_mem_nodes = 4
> [ 1.139240] [init_tsc_clocksource()]: nr_online_nodes = 4
> [ 1.140576] [init_tsc_clocksource()]: nr_cpu_nodes = 2
> [ 1.141823] [init_tsc_clocksource()]: nr_mem_nodes = 4
> [ 1.660100] [kernel_init()]: nr_online_nodes = 4
> [ 1.661234] [kernel_init()]: nr_cpu_nodes = 2
> [ 1.662300] [kernel_init()]: nr_mem_nodes = 4
>
> The 'nr_online_nodes' was chosed in the hope of that, in worse case
> the patch is just a nop and won't wrongly lift the check.
>
> One possible solution for this problem is to leverage the SRAT table
> early init which is called before tsc_init(), and can provide CPU
> nodes info. Will try this way.
Th simple patch below is to have a dedicate CPU nodemask and set it in
early SRAT CPU parsing, still it has problem when sub-numa is enabled
in BIOS where there are more NUMA nodes in SRAT table. (also I'm
not sure the change to amdtopology.c is right)
Thanks,
Feng
diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index e3bae2b60a0d..e745053a5f9a 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -31,6 +31,7 @@ extern int numa_off;
*/
extern s16 __apicid_to_node[MAX_LOCAL_APIC];
extern nodemask_t numa_nodes_parsed __initdata;
+extern nodemask_t numa_cpu_nodes __initdata;
extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
extern void __init numa_set_distance(int from, int to, int distance);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 179e0b1ba5cc..a2a7fc5aa15c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -29,6 +29,7 @@
#include <asm/intel-family.h>
#include <asm/i8259.h>
#include <asm/uv/uv.h>
+#include <asm/numa.h>
unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
EXPORT_SYMBOL(cpu_khz);
@@ -1218,7 +1219,7 @@ first_dump();
if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
- nr_online_nodes <= 2)
+ nodes_weight(numa_cpu_nodes) <= 2)
tsc_disable_clocksource_watchdog();
}
diff --git a/arch/x86/mm/amdtopology.c b/arch/x86/mm/amdtopology.c
index b3ca7d23e4b0..6b982a16cc38 100644
--- a/arch/x86/mm/amdtopology.c
+++ b/arch/x86/mm/amdtopology.c
@@ -152,6 +152,7 @@ int __init amd_numa_init(void)
prevbase = base;
numa_add_memblk(nodeid, base, limit);
node_set(nodeid, numa_nodes_parsed);
+ node_set(nodeid, numa_cpu_nodes);
}
if (nodes_empty(numa_nodes_parsed))
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 090125b3ee1f..82798fee97a2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -21,6 +21,7 @@
int numa_off;
nodemask_t numa_nodes_parsed __initdata;
+nodemask_t numa_cpu_nodes __initdata;
struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
EXPORT_SYMBOL(node_data);
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 7688117ac2f4..11b08b317306 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -59,6 +59,7 @@ acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa)
}
set_apicid_to_node(apic_id, node);
node_set(node, numa_nodes_parsed);
+ node_set(node, numa_cpu_nodes);
printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%04x -> Node %u\n",
@@ -106,6 +107,7 @@ acpi_numa_processor_affinity_init(struct acpi_srat_cpu_affinity *pa)
set_apicid_to_node(apic_id, node);
node_set(node, numa_nodes_parsed);
+ node_set(node, numa_cpu_nodes);
printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%02x -> Node %u\n",
On Mon, Oct 10, 2022 at 07:23:10AM -0700, Dave Hansen wrote:
> On 10/9/22 18:23, Feng Tang wrote:
> >>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> >>> index cafacb2e58cc..b4ea79cb1d1a 100644
> >>> --- a/arch/x86/kernel/tsc.c
> >>> +++ b/arch/x86/kernel/tsc.c
> >>> @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
> >>> if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> >>> boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> >>> boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> >>> - nr_online_nodes <= 2)
> >>> + nr_online_nodes <= 8)
> >> So you're saying all 8 socket systems since Broadwell (?) are TSC
> >> sync'ed ?
> > No, I didn't mean that. I haven't got chance to any 8 sockets
> > machine, and I got a report last month that on one 8S machine,
> > the TSC was judged 'unstable' by HPET as watchdog.
>
> That's not a great check. Think about numa=fake=4U, for instance. Or a
> single-socket system with persistent memory and high bandwidth memory.
>
> Basically 'nr_online_nodes' is a software construct. It's going to be
> really hard to infer anything from it about what the _hardware_ is.
We have both c->phys_proc_id and c->logical_proc_id along with
logical_packages.
I'm thinking you want something like max(c->phys_proc_id) <= 4. Because
even if you only populate 4 sockets of an 8 socket server you're up a
creek without no paddles.
But it all comes down to how much drugs the firmware teams have had :/
It is entirely possible to enumerate with phys_proc_id==42 on a 2 socket
system.
On Tue, Oct 11, 2022 at 03:51:21PM +0800, Feng Tang wrote:
> Th simple patch below is to have a dedicate CPU nodemask and set it in
> early SRAT CPU parsing, still it has problem when sub-numa is enabled
> in BIOS where there are more NUMA nodes in SRAT table. (also I'm
> not sure the change to amdtopology.c is right)
No; none of this has anything to do with nodes. This is about sockets.
On Tue, 2022-10-11 at 09:52 +0200, Peter Zijlstra wrote:
> On Mon, Oct 10, 2022 at 07:23:10AM -0700, Dave Hansen wrote:
> > On 10/9/22 18:23, Feng Tang wrote:
> > > > > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > > > > index cafacb2e58cc..b4ea79cb1d1a 100644
> > > > > --- a/arch/x86/kernel/tsc.c
> > > > > +++ b/arch/x86/kernel/tsc.c
> > > > > @@ -1217,7 +1217,7 @@ static void __init
> > > > > check_system_tsc_reliable(void)
> > > > > if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > > > > boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > > > > boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> > > > > - nr_online_nodes <= 2)
> > > > > + nr_online_nodes <= 8)
> > > > So you're saying all 8 socket systems since Broadwell (?) are
> > > > TSC
> > > > sync'ed ?
> > > No, I didn't mean that. I haven't got chance to any 8 sockets
> > > machine, and I got a report last month that on one 8S machine,
> > > the TSC was judged 'unstable' by HPET as watchdog.
> >
> > That's not a great check. Think about numa=fake=4U, for
> > instance. Or a
> > single-socket system with persistent memory and high bandwidth
> > memory.
> >
> > Basically 'nr_online_nodes' is a software construct. It's going to
> > be
> > really hard to infer anything from it about what the _hardware_ is.
>
> We have both c->phys_proc_id and c->logical_proc_id along with
> logical_packages.
>
> I'm thinking you want something like max(c->phys_proc_id) <= 4.
> Because
> even if you only populate 4 sockets of an 8 socket server you're up a
> creek without no paddles.
>
> But it all comes down to how much drugs the firmware teams have had
> :/
> It is entirely possible to enumerate with phys_proc_id==42 on a 2
> socket
> system.
topology_max_packages() or variable logical_packages can tell the
maximum packages.
But this check_system_tsc_reliable() is done in early boot phase where
we have boot cpu only. And the cpu topology is not built up at this
stage.
thanks,
rui
On Tue, Oct 11, 2022 at 09:33:26PM +0800, Zhang Rui wrote:
> topology_max_packages() or variable logical_packages can tell the
> maximum packages.
> But this check_system_tsc_reliable() is done in early boot phase where
> we have boot cpu only. And the cpu topology is not built up at this
> stage.
Is there a problem with disabling the TSC watchdog later in boot --
after SMP bringup for example?
On Tue, Oct 11, 2022 at 04:01:46PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 11, 2022 at 09:33:26PM +0800, Zhang Rui wrote:
>
> > topology_max_packages() or variable logical_packages can tell the
> > maximum packages.
> > But this check_system_tsc_reliable() is done in early boot phase where
> > we have boot cpu only. And the cpu topology is not built up at this
> > stage.
>
> Is there a problem with disabling the TSC watchdog later in boot --
> after SMP bringup for example?
Currently the watchdog is disabled inside tsc_init(), right before
'tsc-early' clocksrouce is registered, otherwise it starts to be
monitored by 'jiffies' as watchdog. And there has been many cases
that 'jiffies' watchdog misjudged tsc as 'unstable' in early boot
phase, including recent Yu Liao's report on a 4 socket Skylake
server.
Thanks,
Feng
On Tue, Oct 11, 2022 at 03:01:15PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 11, 2022 at 03:51:21PM +0800, Feng Tang wrote:
> > Th simple patch below is to have a dedicate CPU nodemask and set it in
> > early SRAT CPU parsing, still it has problem when sub-numa is enabled
> > in BIOS where there are more NUMA nodes in SRAT table. (also I'm
> > not sure the change to amdtopology.c is right)
>
> No; none of this has anything to do with nodes. This is about sockets.
Exactly. All we try to do is to get a closer number to the socket
numbers (also stated in current code comments)
According to our discussion, we haven't found a way to get a very
accurate number of sockets, so I plan to (if no objection):
* Send a patch lifting the socket number check from 2 to 4, to fix
the issue reported by Yu Liao.
* Send another RFC patch[1], which makes the socket number more
accurate, as it solve the 2 problems mentioned by Dave:
- fakenuma (numa=fake=4 etc)
- system with CPU-DRAM nodes + HBM nodes + Persistent Memory nodes
but it still can't cover the subnuma enabled case
[1]. https://lore.kernel.org/lkml/Y0UgeUIJSFNR4mQB@feng-clx/
Thanks,
Feng