2023-01-20 03:28:33

by Nick Bowler

[permalink] [raw]
Subject: Re: PROBLEM: Only one CPU active on Ultra 60 since ~4.8 (regression)

Hi,

I'm resending this report CC'd to linux-kernel as there was no response
on the sparclinux list.

I tried 6.2-rc4 and there is no change in behaviour. Reverting the
indicated commit still works to fix the problem.

On 2022-07-12, Nick Bowler <[email protected]> wrote:
> When using newer kernels on my Ultra 60 with dual 450MHz UltraSPARC-II
> CPUs, I noticed that only CPU 0 comes up, while older kernels (including
> 4.7) are working fine with both CPUs.
>
> I bisected the failure to this commit:
>
> 9b2f753ec23710aa32c0d837d2499db92fe9115b is the first bad commit
> commit 9b2f753ec23710aa32c0d837d2499db92fe9115b
> Author: Atish Patra <[email protected]>
> Date: Thu Sep 15 14:54:40 2016 -0600
>
> sparc64: Fix cpu_possible_mask if nr_cpus is set
>
> This is a small change that reverts very easily on top of 5.18: there is
> just one trivial conflict. Once reverted, both CPUs work again.
>
> Maybe this is related to the fact that the CPUs on this system are
> numbered CPU0 and CPU2 (there is no CPU1)?
>
> Here is /proc/cpuinfo on a working kernel:
>
> % cat /proc/cpuinfo
> cpu : TI UltraSparc II (BlackBird)
> fpu : UltraSparc II integrated FPU
> pmu : ultra12
> prom : OBP 3.23.1 1999/07/16 12:08
> type : sun4u
> ncpus probed : 2
> ncpus active : 2
> D$ parity tl1 : 0
> I$ parity tl1 : 0
> cpucaps : flush,stbar,swap,muldiv,v9,mul32,div32,v8plus,vis
> Cpu0ClkTck : 000000001ad31b4f
> Cpu2ClkTck : 000000001ad31b4f
> MMU Type : Spitfire
> MMU PGSZs : 8K,64K,512K,4MB
> State:
> CPU0: online
> CPU2: online
>
> And on a broken kernel:
>
> % cat /proc/cpuinfo
> cpu : TI UltraSparc II (BlackBird)
> fpu : UltraSparc II integrated FPU
> pmu : ultra12
> prom : OBP 3.23.1 1999/07/16 12:08
> type : sun4u
> ncpus probed : 2
> ncpus active : 1
> D$ parity tl1 : 0
> I$ parity tl1 : 0
> cpucaps : flush,stbar,swap,muldiv,v9,mul32,div32,v8plus,vis
> Cpu0ClkTck : 000000001ad31861
> MMU Type : Spitfire
> MMU PGSZs : 8K,64K,512K,4MB
> State:
> CPU0: online
>
> Let me know if you need any more info.
>
> Thanks,
> Nick


Subject: Re: PROBLEM: Only one CPU active on Ultra 60 since ~4.8 (regression)

CCing the sparc maintainer. Also CCing the regression list, as it should
be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html

The the mail address of the culprit's author bounces. There is another
Atish Patra still active; does anyone known if those two are the same
person?

Anyway, that's it from my side.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

On 20.01.23 04:15, Nick Bowler wrote:
> Hi,
>
> I'm resending this report CC'd to linux-kernel as there was no response
> on the sparclinux list.
>
> I tried 6.2-rc4 and there is no change in behaviour. Reverting the
> indicated commit still works to fix the problem.
>
> On 2022-07-12, Nick Bowler <[email protected]> wrote:
>> When using newer kernels on my Ultra 60 with dual 450MHz UltraSPARC-II
>> CPUs, I noticed that only CPU 0 comes up, while older kernels (including
>> 4.7) are working fine with both CPUs.
>>
>> I bisected the failure to this commit:
>>
>> 9b2f753ec23710aa32c0d837d2499db92fe9115b is the first bad commit
>> commit 9b2f753ec23710aa32c0d837d2499db92fe9115b
>> Author: Atish Patra <[email protected]>
>> Date: Thu Sep 15 14:54:40 2016 -0600
>>
>> sparc64: Fix cpu_possible_mask if nr_cpus is set
>>
>> This is a small change that reverts very easily on top of 5.18: there is
>> just one trivial conflict. Once reverted, both CPUs work again.
>>
>> Maybe this is related to the fact that the CPUs on this system are
>> numbered CPU0 and CPU2 (there is no CPU1)?
>>
>> Here is /proc/cpuinfo on a working kernel:
>>
>> % cat /proc/cpuinfo
>> cpu : TI UltraSparc II (BlackBird)
>> fpu : UltraSparc II integrated FPU
>> pmu : ultra12
>> prom : OBP 3.23.1 1999/07/16 12:08
>> type : sun4u
>> ncpus probed : 2
>> ncpus active : 2
>> D$ parity tl1 : 0
>> I$ parity tl1 : 0
>> cpucaps : flush,stbar,swap,muldiv,v9,mul32,div32,v8plus,vis
>> Cpu0ClkTck : 000000001ad31b4f
>> Cpu2ClkTck : 000000001ad31b4f
>> MMU Type : Spitfire
>> MMU PGSZs : 8K,64K,512K,4MB
>> State:
>> CPU0: online
>> CPU2: online
>>
>> And on a broken kernel:
>>
>> % cat /proc/cpuinfo
>> cpu : TI UltraSparc II (BlackBird)
>> fpu : UltraSparc II integrated FPU
>> pmu : ultra12
>> prom : OBP 3.23.1 1999/07/16 12:08
>> type : sun4u
>> ncpus probed : 2
>> ncpus active : 1
>> D$ parity tl1 : 0
>> I$ parity tl1 : 0
>> cpucaps : flush,stbar,swap,muldiv,v9,mul32,div32,v8plus,vis
>> Cpu0ClkTck : 000000001ad31861
>> MMU Type : Spitfire
>> MMU PGSZs : 8K,64K,512K,4MB
>> State:
>> CPU0: online
>>
>> Let me know if you need any more info.
>>
>> Thanks,
>> Nick
>

2024-03-22 09:02:38

by Nick Bowler

[permalink] [raw]
Subject: Re: PROBLEM: Only one CPU active on Ultra 60 since ~4.8 (regression)

Hi,

Just a friendly reminder that this issue still happens on Linux 6.8 and
reverting commit 9b2f753ec237 as indicated below is still sufficient to
resolve the problem.

On 2023-01-21 08:31, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
> CCing the sparc maintainer. Also CCing the regression list, as it should
> be in the loop for regressions:
> https://docs.kernel.org/admin-guide/reporting-regressions.html
>
> The the mail address of the culprit's author bounces. There is another
> Atish Patra still active; does anyone known if those two are the same
> person?
>
> Anyway, that's it from my side.
[...]
> On 20.01.23 04:15, Nick Bowler wrote:
>> Hi,
>>
>> I'm resending this report CC'd to linux-kernel as there was no response
>> on the sparclinux list.
>>
>> I tried 6.2-rc4 and there is no change in behaviour. Reverting the
>> indicated commit still works to fix the problem.
>>
>> On 2022-07-12, Nick Bowler <[email protected]> wrote:
>>> When using newer kernels on my Ultra 60 with dual 450MHz UltraSPARC-II
>>> CPUs, I noticed that only CPU 0 comes up, while older kernels (including
>>> 4.7) are working fine with both CPUs.
>>>
>>> I bisected the failure to this commit:
>>>
>>> 9b2f753ec23710aa32c0d837d2499db92fe9115b is the first bad commit
>>> commit 9b2f753ec23710aa32c0d837d2499db92fe9115b
>>> Author: Atish Patra <[email protected]>
>>> Date: Thu Sep 15 14:54:40 2016 -0600
>>>
>>> sparc64: Fix cpu_possible_mask if nr_cpus is set
>>>
>>> This is a small change that reverts very easily on top of 5.18: there is
>>> just one trivial conflict. Once reverted, both CPUs work again.
>>>
>>> Maybe this is related to the fact that the CPUs on this system are
>>> numbered CPU0 and CPU2 (there is no CPU1)?
>>>
>>> Here is /proc/cpuinfo on a working kernel:
>>>
>>> % cat /proc/cpuinfo
>>> cpu : TI UltraSparc II (BlackBird)
>>> fpu : UltraSparc II integrated FPU
>>> pmu : ultra12
>>> prom : OBP 3.23.1 1999/07/16 12:08
>>> type : sun4u
>>> ncpus probed : 2
>>> ncpus active : 2
>>> D$ parity tl1 : 0
>>> I$ parity tl1 : 0
>>> cpucaps : flush,stbar,swap,muldiv,v9,mul32,div32,v8plus,vis
>>> Cpu0ClkTck : 000000001ad31b4f
>>> Cpu2ClkTck : 000000001ad31b4f
>>> MMU Type : Spitfire
>>> MMU PGSZs : 8K,64K,512K,4MB
>>> State:
>>> CPU0: online
>>> CPU2: online
>>>
>>> And on a broken kernel:
>>>
>>> % cat /proc/cpuinfo
>>> cpu : TI UltraSparc II (BlackBird)
>>> fpu : UltraSparc II integrated FPU
>>> pmu : ultra12
>>> prom : OBP 3.23.1 1999/07/16 12:08
>>> type : sun4u
>>> ncpus probed : 2
>>> ncpus active : 1
>>> D$ parity tl1 : 0
>>> I$ parity tl1 : 0
>>> cpucaps : flush,stbar,swap,muldiv,v9,mul32,div32,v8plus,vis
>>> Cpu0ClkTck : 000000001ad31861
>>> MMU Type : Spitfire
>>> MMU PGSZs : 8K,64K,512K,4MB
>>> State:
>>> CPU0: online
>>>
>>> Let me know if you need any more info.
>>>
>>> Thanks,
>>> Nick

Subject: Re: PROBLEM: Only one CPU active on Ultra 60 since ~4.8 (regression)

[CCing Linus, in case I say something to his disliking]

On 22.03.24 05:57, Nick Bowler wrote:
>
> Just a friendly reminder that this issue still happens on Linux 6.8 and
> reverting commit 9b2f753ec237 as indicated below is still sufficient to
> resolve the problem.

FWIW, that commit 9b2f753ec23710 ("sparc64: Fix cpu_possible_mask if
nr_cpus is set") is from v4.8. Reverting it after all that time might
easily lead to even bigger trouble. That's why it might be better to
handle this like a bug and not like a regression. At least unless we
find someone to judge how likely such an outcome is. But it seems nobody
really cared so far, so unless this mail makes someone act you might be
out of luck. :-/

I wish it was different, but in the end we (including the maintainers)
are all just volunteers here which you can only motivate or compel (up
to some point) to look into some issue, but can not force to do so.

Ciao, Thorsten

> On 2023-01-21 08:31, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
>> CCing the sparc maintainer. Also CCing the regression list, as it should
>> be in the loop for regressions:
>> https://docs.kernel.org/admin-guide/reporting-regressions.html
>>
>> The the mail address of the culprit's author bounces. There is another
>> Atish Patra still active; does anyone known if those two are the same
>> person?
>>
>> Anyway, that's it from my side.
> [...]
>> On 20.01.23 04:15, Nick Bowler wrote:
>>> Hi,
>>>
>>> I'm resending this report CC'd to linux-kernel as there was no response
>>> on the sparclinux list.
>>>
>>> I tried 6.2-rc4 and there is no change in behaviour. Reverting the
>>> indicated commit still works to fix the problem.
>>>
>>> On 2022-07-12, Nick Bowler <[email protected]> wrote:
>>>> When using newer kernels on my Ultra 60 with dual 450MHz UltraSPARC-II
>>>> CPUs, I noticed that only CPU 0 comes up, while older kernels (including
>>>> 4.7) are working fine with both CPUs.
>>>>
>>>> I bisected the failure to this commit:
>>>>
>>>> 9b2f753ec23710aa32c0d837d2499db92fe9115b is the first bad commit
>>>> commit 9b2f753ec23710aa32c0d837d2499db92fe9115b
>>>> Author: Atish Patra <[email protected]>
>>>> Date: Thu Sep 15 14:54:40 2016 -0600
>>>>
>>>> sparc64: Fix cpu_possible_mask if nr_cpus is set
>>>>
>>>> This is a small change that reverts very easily on top of 5.18: there is
>>>> just one trivial conflict. Once reverted, both CPUs work again.
>>>>
>>>> Maybe this is related to the fact that the CPUs on this system are
>>>> numbered CPU0 and CPU2 (there is no CPU1)?
>>>>
>>>> Here is /proc/cpuinfo on a working kernel:
>>>>
>>>> % cat /proc/cpuinfo
>>>> cpu : TI UltraSparc II (BlackBird)
>>>> fpu : UltraSparc II integrated FPU
>>>> pmu : ultra12
>>>> prom : OBP 3.23.1 1999/07/16 12:08
>>>> type : sun4u
>>>> ncpus probed : 2
>>>> ncpus active : 2
>>>> D$ parity tl1 : 0
>>>> I$ parity tl1 : 0
>>>> cpucaps : flush,stbar,swap,muldiv,v9,mul32,div32,v8plus,vis
>>>> Cpu0ClkTck : 000000001ad31b4f
>>>> Cpu2ClkTck : 000000001ad31b4f
>>>> MMU Type : Spitfire
>>>> MMU PGSZs : 8K,64K,512K,4MB
>>>> State:
>>>> CPU0: online
>>>> CPU2: online
>>>>
>>>> And on a broken kernel:
>>>>
>>>> % cat /proc/cpuinfo
>>>> cpu : TI UltraSparc II (BlackBird)
>>>> fpu : UltraSparc II integrated FPU
>>>> pmu : ultra12
>>>> prom : OBP 3.23.1 1999/07/16 12:08
>>>> type : sun4u
>>>> ncpus probed : 2
>>>> ncpus active : 1
>>>> D$ parity tl1 : 0
>>>> I$ parity tl1 : 0
>>>> cpucaps : flush,stbar,swap,muldiv,v9,mul32,div32,v8plus,vis
>>>> Cpu0ClkTck : 000000001ad31861
>>>> MMU Type : Spitfire
>>>> MMU PGSZs : 8K,64K,512K,4MB
>>>> State:
>>>> CPU0: online
>>>>
>>>> Let me know if you need any more info.
>>>>
>>>> Thanks,
>>>> Nick
>
>

2024-03-28 20:09:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: PROBLEM: Only one CPU active on Ultra 60 since ~4.8 (regression)

On Thu, 28 Mar 2024 at 12:36, Linux regression tracking (Thorsten
Leemhuis) <[email protected]> wrote:
>
> [CCing Linus, in case I say something to his disliking]
>
> On 22.03.24 05:57, Nick Bowler wrote:
> >
> > Just a friendly reminder that this issue still happens on Linux 6.8 and
> > reverting commit 9b2f753ec237 as indicated below is still sufficient to
> > resolve the problem.
>
> FWIW, that commit 9b2f753ec23710 ("sparc64: Fix cpu_possible_mask if
> nr_cpus is set") is from v4.8. Reverting it after all that time might
> easily lead to even bigger trouble.

I'm definitely not reverting a patch from almost a decade ago as a regression.

If it took that long to find, it can't be that critical of a regression.

So yes, let's treat it as a regular bug. And let's bring in Andreas to
the discussion too (although presumably he has seen it on the
sparclinux mailing list).

Andreas, if not, here's the link to lore for the beginning of the thread:

https://lore.kernel.org/all/CADyTPEwt=ZNams+1bpMB1F9w_vUdPsGCt92DBQxxq_VtaLoTdw@mail.gmail.com/

And from a quick look I do think that commit is buggy, and yes, the
fix probably is just be to revert it.

As the original report makes clear, that commit 9b2f753ec23710 is
clearly confused about the difference between "number of CPU's", and
"index of CPU numbers".

When that smp_fill_in_cpu_possible_map() does

int possible_cpus = num_possible_cpus();

and then uses that to fill in &__cpu_possible_mask, that's completely
nonsensical. Because we literally have

#define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
#define num_possible_cpus() cpumask_weight(cpu_possible_mask)

so it's reading cpu_possible_mask to figure out how many cpus it might
have, and then using that number to set possibly *different* bits in
the same bitmap that is just used to judge what the max number is.

So I do think a revert is called for, but I'm not going to treat this
as a regression, I'm going to just treat it as "sparc bug" and hope
that the sparc people try to figure out why that crazy code was
written.

And maybe it made more sense back a decade ago than it does now.

Andreas?

Linus

2024-03-28 21:09:24

by Nick Bowler

[permalink] [raw]
Subject: Re: PROBLEM: Only one CPU active on Ultra 60 since ~4.8 (regression)

On 2024-03-28 16:09, Linus Torvalds wrote:
> On Thu, 28 Mar 2024 at 12:36, Linux regression tracking (Thorsten
> Leemhuis) <[email protected]> wrote:
>>
>> [CCing Linus, in case I say something to his disliking]
>>
>> On 22.03.24 05:57, Nick Bowler wrote:
>>>
>>> Just a friendly reminder that this issue still happens on Linux 6.8 and
>>> reverting commit 9b2f753ec237 as indicated below is still sufficient to
>>> resolve the problem.
>>
>> FWIW, that commit 9b2f753ec23710 ("sparc64: Fix cpu_possible_mask if
>> nr_cpus is set") is from v4.8. Reverting it after all that time might
>> easily lead to even bigger trouble.
>
> I'm definitely not reverting a patch from almost a decade ago as a regression.
>
> If it took that long to find, it can't be that critical of a regression.

FWIW I'm not the first person to notice this problem. Searching the sparclinux
archive for "ultra 60" which turns up this very similar report[1] from two years
prior to mine which also went nowhere (sadly, this reporter did not perform a
bisection to find the problematic commit -- perhaps because nobody asked).

[1] https://lore.kernel.org/sparclinux/[email protected]/

Cheers,
Nick

2024-03-29 09:46:20

by Sam Ravnborg

[permalink] [raw]
Subject: Re: PROBLEM: Only one CPU active on Ultra 60 since ~4.8 (regression)

Hi Nick,

On Thu, Mar 28, 2024 at 05:08:50PM -0400, Nick Bowler wrote:
> On 2024-03-28 16:09, Linus Torvalds wrote:
> > On Thu, 28 Mar 2024 at 12:36, Linux regression tracking (Thorsten
> > Leemhuis) <[email protected]> wrote:
> >>
> >> [CCing Linus, in case I say something to his disliking]
> >>
> >> On 22.03.24 05:57, Nick Bowler wrote:
> >>>
> >>> Just a friendly reminder that this issue still happens on Linux 6.8 and
> >>> reverting commit 9b2f753ec237 as indicated below is still sufficient to
> >>> resolve the problem.
> >>
> >> FWIW, that commit 9b2f753ec23710 ("sparc64: Fix cpu_possible_mask if
> >> nr_cpus is set") is from v4.8. Reverting it after all that time might
> >> easily lead to even bigger trouble.
> >
> > I'm definitely not reverting a patch from almost a decade ago as a regression.
> >
> > If it took that long to find, it can't be that critical of a regression.
>
> FWIW I'm not the first person to notice this problem. Searching the sparclinux
> archive for "ultra 60" which turns up this very similar report[1] from two years
> prior to mine which also went nowhere (sadly, this reporter did not perform a
> bisection to find the problematic commit -- perhaps because nobody asked).
>
> [1] https://lore.kernel.org/sparclinux/[email protected]/

I took a look at this and may have a fix. Could you try the following
patch. It builds - but I have not tested it.

Sam


From a0fb7c6e6817849550d07b4c5a354ccc58382bc1 Mon Sep 17 00:00:00 2001
From: Sam Ravnborg <[email protected]>
Date: Fri, 29 Mar 2024 10:34:07 +0100
Subject: [PATCH] sparc64: Fix number of online CPUs

Nick Bowler reported:
When using newer kernels on my Ultra 60 with dual 450MHz UltraSPARC-II
CPUs, I noticed that only CPU 0 comes up, while older kernels (including
4.7) are working fine with both CPUs.

I bisected the failure to this commit:

9b2f753ec23710aa32c0d837d2499db92fe9115b is the first bad commit
commit 9b2f753ec23710aa32c0d837d2499db92fe9115b
Author: Atish Patra <[email protected]>
Date: Thu Sep 15 14:54:40 2016 -0600

sparc64: Fix cpu_possible_mask if nr_cpus is set

This is a small change that reverts very easily on top of 5.18: there is
just one trivial conflict. Once reverted, both CPUs work again.

Maybe this is related to the fact that the CPUs on this system are
numbered CPU0 and CPU2 (there is no CPU1)?

The current code that adjust cpu_possible based on nr_cpu_ids do not
take into account that CPU's may not come one after each other.
Move the check to the function that setup the cpu_possible mask
so there is no need to adjust it later.

Signed-off-by: Sam Ravnborg <[email protected]>
Reported-by: Nick Bowler <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: "David S. Miller" <[email protected]>
---
arch/sparc/include/asm/smp_64.h | 2 --
arch/sparc/kernel/prom_64.c | 4 +++-
arch/sparc/kernel/setup_64.c | 1 -
arch/sparc/kernel/smp_64.c | 14 --------------
4 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/arch/sparc/include/asm/smp_64.h b/arch/sparc/include/asm/smp_64.h
index 505b6700805d..0964fede0b2c 100644
--- a/arch/sparc/include/asm/smp_64.h
+++ b/arch/sparc/include/asm/smp_64.h
@@ -47,7 +47,6 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask);
int hard_smp_processor_id(void);
#define raw_smp_processor_id() (current_thread_info()->cpu)

-void smp_fill_in_cpu_possible_map(void);
void smp_fill_in_sib_core_maps(void);
void __noreturn cpu_play_dead(void);

@@ -77,7 +76,6 @@ void __cpu_die(unsigned int cpu);
#define smp_fill_in_sib_core_maps() do { } while (0)
#define smp_fetch_global_regs() do { } while (0)
#define smp_fetch_global_pmu() do { } while (0)
-#define smp_fill_in_cpu_possible_map() do { } while (0)
#define smp_init_cpu_poke() do { } while (0)
#define scheduler_poke() do { } while (0)

diff --git a/arch/sparc/kernel/prom_64.c b/arch/sparc/kernel/prom_64.c
index 998aa693d491..ba82884cb92a 100644
--- a/arch/sparc/kernel/prom_64.c
+++ b/arch/sparc/kernel/prom_64.c
@@ -483,7 +483,9 @@ static void *record_one_cpu(struct device_node *dp, int cpuid, int arg)
ncpus_probed++;
#ifdef CONFIG_SMP
set_cpu_present(cpuid, true);
- set_cpu_possible(cpuid, true);
+
+ if (num_possible_cpus() < nr_cpu_ids)
+ set_cpu_possible(cpuid, true);
#endif
return NULL;
}
diff --git a/arch/sparc/kernel/setup_64.c b/arch/sparc/kernel/setup_64.c
index 6a4797dec34b..6bbe8e394ad3 100644
--- a/arch/sparc/kernel/setup_64.c
+++ b/arch/sparc/kernel/setup_64.c
@@ -671,7 +671,6 @@ void __init setup_arch(char **cmdline_p)

paging_init();
init_sparc64_elf_hwcap();
- smp_fill_in_cpu_possible_map();
/*
* Once the OF device tree and MDESC have been setup and nr_cpus has
* been parsed, we know the list of possible cpus. Therefore we can
diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index f3969a3600db..e50c38eba2b8 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1220,20 +1220,6 @@ void __init smp_setup_processor_id(void)
xcall_deliver_impl = hypervisor_xcall_deliver;
}

-void __init smp_fill_in_cpu_possible_map(void)
-{
- int possible_cpus = num_possible_cpus();
- int i;
-
- if (possible_cpus > nr_cpu_ids)
- possible_cpus = nr_cpu_ids;
-
- for (i = 0; i < possible_cpus; i++)
- set_cpu_possible(i, true);
- for (; i < NR_CPUS; i++)
- set_cpu_possible(i, false);
-}
-
void smp_fill_in_sib_core_maps(void)
{
unsigned int i;
--
2.34.1


2024-03-29 20:11:19

by Nick Bowler

[permalink] [raw]
Subject: Re: PROBLEM: Only one CPU active on Ultra 60 since ~4.8 (regression)

Hi Sam,

On 2024-03-29 05:44, Sam Ravnborg wrote:
> I took a look at this and may have a fix. Could you try the following
> patch. It builds - but I have not tested it.

With this patch applied on top of 6.9-rc1, both CPUs appear to come up:

% cat /proc/cpuinfo
[...]
ncpus probed : 2
ncpus active : 2
[...]
State:
CPU0: online
CPU2: online

Thanks,
Nick

2024-03-30 09:17:24

by Sam Ravnborg

[permalink] [raw]
Subject: Re: PROBLEM: Only one CPU active on Ultra 60 since ~4.8 (regression)

On Fri, Mar 29, 2024 at 04:11:06PM -0400, Nick Bowler wrote:
> Hi Sam,
>
> On 2024-03-29 05:44, Sam Ravnborg wrote:
> > I took a look at this and may have a fix. Could you try the following
> > patch. It builds - but I have not tested it.
>
> With this patch applied on top of 6.9-rc1, both CPUs appear to come up:
>
> % cat /proc/cpuinfo
> [...]
> ncpus probed : 2
> ncpus active : 2
> [...]
> State:
> CPU0: online
> CPU2: online

Thanks, I will add a Tested-by: Nick Bowler <[email protected]>
and submit the patch properly along with a few other sparc64 related
fixes.

Sam

2024-04-05 15:11:56

by Andreas Larsson

[permalink] [raw]
Subject: Re: PROBLEM: Only one CPU active on Ultra 60 since ~4.8 (regression)



On 2024-03-28 21:09, Linus Torvalds wrote:
> On Thu, 28 Mar 2024 at 12:36, Linux regression tracking (Thorsten
> Leemhuis) <[email protected]> wrote:
>>
>> [CCing Linus, in case I say something to his disliking]
>>
>> On 22.03.24 05:57, Nick Bowler wrote:
>>>
>>> Just a friendly reminder that this issue still happens on Linux 6.8 and
>>> reverting commit 9b2f753ec237 as indicated below is still sufficient to
>>> resolve the problem.
>>
>> FWIW, that commit 9b2f753ec23710 ("sparc64: Fix cpu_possible_mask if
>> nr_cpus is set") is from v4.8. Reverting it after all that time might
>> easily lead to even bigger trouble.
>
> I'm definitely not reverting a patch from almost a decade ago as a regression.
>
> If it took that long to find, it can't be that critical of a regression.
>
> So yes, let's treat it as a regular bug. And let's bring in Andreas to
> the discussion too (although presumably he has seen it on the
> sparclinux mailing list).
Yes, I am aware and I agree we should treat it as a regular bug.

Reverting it as a regression fix would lead to followup issues like
canceling the effect of commit ebb99a4c12e4 ("sparc64: Fix irq stack
bootmem allocation.") but with misleading comments left in place.

Sam's fix looks like a good solution for me to pick up to my
for-next branch.

Thanks,
Andreas