2022-10-24 18:53:34

by Joe Korty

[permalink] [raw]
Subject: [PATCH v2] arm64: arch_timer: XGene-1: TVAL register math error breaks timer expiry calculation.

arm64: arch_timer: XGene-1: TVAL register math error breaks timer expiry calculation.

The TVAL register is 32 bit signed. Thus only the lower 31 bits are
available to specify when an interrupt is to occur at some time in the
near future. Attempting to specify a larger interval with TVAL results
in a negative time delta which means the timer fires immediately upon
being programmed, rather than firing at that expected future time.

The solution is for linux to declare that TVAL is a 31 bit register rather
than give its true size of 32 bits. This prevents linux from programming
TVAL with a too-large value. Note that, prior to 5.16, this little trick
was the standard way to handle TVAL in linux, so there is nothing new
happening here on that front.

Test procedure: for some reason, the lockup watchdog is sensitive to
this bug. When we turn the watchdog off, then run a little 'hello world'
program in each of the CPUs, there will often be one that 1) hangs up
forever, 2) hangs up what seems like forever, but actully contines after
a few minutes. In either case, the program cannot be freed by a ^C.
This test sequence requires CONFIG_SOFTLOCKUP_DETECTOR, and probably
requires that one of the NO_HZ Kconfig options be specified.

The sequence is, for an 8 cpu Mustang XGene-1:

echo 0 >/proc/sys/kernel/watchdog
for i in {0..7}; do taskset -c $i echo hi there $i; done

Note that though the hangup usually happens, it does not
always happen.

Some comments on the v1 version of this patch by Marc Zyngier:

XGene implements CVAL (a 64bit comparator) in terms of TVAL (a countdown
register) instead of the other way around. TVAL being a 32bit register,
the width of the counter should equally be 32. However, TVAL is a
*signed* value, and keeps counting down in the negative range once the
timer fires.

It means that any TVAL value with bit 31 set will fire immediately,
as it cannot be distinguished from an already expired timer. Reducing
the timer range back to a paltry 31 bits papers over the issue.

Another problem cannot be fixed though, which is that the timer interrupt
*must* be handled within the negative countdown period, or the interrupt
will be lost (TVAL will rollover to a positive value, indicative of a
new timer deadline).

[ v2: Expanded CC list - jak ]
[ v2: Revamped changelog - jak ]
[ v2: streamlined inlined comments - jak ]

Cc: [email protected] # 5.16+
Fixes: 012f18850452 ("clocksource/drivers/arm_arch_timer: Work around broken CVAL implementations")
Signed-off-by: Joe Korty <[email protected]>
---
base-commit: v6.0
Index: b/drivers/clocksource/arm_arch_timer.c
===================================================================
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -804,6 +804,9 @@ static u64 __arch_timer_check_delta(void
/*
* XGene-1 implements CVAL in terms of TVAL, meaning
* that the maximum timer range is 32bit. Shame on them.
+ *
+ * Note that TVAL is signed, thus has only 31 of its
+ * 32 bits to express magnitude.
*/
MIDR_ALL_VERSIONS(MIDR_CPU_MODEL(ARM_CPU_IMP_APM,
APM_CPU_PART_POTENZA)),
@@ -811,8 +814,8 @@ static u64 __arch_timer_check_delta(void
};

if (is_midr_in_range_list(read_cpuid_id(), broken_cval_midrs)) {
- pr_warn_once("Broken CNTx_CVAL_EL1, limiting width to 32bits");
- return CLOCKSOURCE_MASK(32);
+ pr_warn_once("Broken CNTx_CVAL_EL1, using 32 bit TVAL instead.\n");
+ return CLOCKSOURCE_MASK(31);
}
#endif
return CLOCKSOURCE_MASK(arch_counter_get_width());


2022-10-25 12:43:08

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v2] arm64: arch_timer: XGene-1: TVAL register math error breaks timer expiry calculation.

Hi Joe,

Thanks for respinning this. Some comments below. Nothing major, but
worth keeping in mind for your next patches.

On Mon, 24 Oct 2022 17:54:22 +0100,
Joe Korty <[email protected]> wrote:
>
> arm64: arch_timer: XGene-1: TVAL register math error breaks timer expiry calculation.
>

nit: this line is already in the email subject which will be captured
when the patch is applied. No need to repeat it. Also, no final '.' at
the end of the patch subject.

> The TVAL register is 32 bit signed. Thus only the lower 31 bits are
> available to specify when an interrupt is to occur at some time in the
> near future. Attempting to specify a larger interval with TVAL results
> in a negative time delta which means the timer fires immediately upon
> being programmed, rather than firing at that expected future time.
>
> The solution is for linux to declare that TVAL is a 31 bit register rather

nit: s/linux/Linux/

> than give its true size of 32 bits. This prevents linux from programming
> TVAL with a too-large value. Note that, prior to 5.16, this little trick
> was the standard way to handle TVAL in linux, so there is nothing new
> happening here on that front.
>
> Test procedure: for some reason, the lockup watchdog is sensitive to
> this bug.

My interpretation is that the softlockup detector hides the issue,
because it keeps generating short timer deadlines that are within the
scope of the broken timer.

Disable it, and you start using NO_HZ with much longer timer
deadlines, which turns into an interrupt flood:

11: 1124855130 949168462 758009394 76417474 104782230 30210281 310890 1734323687 GICv2 29 Level arch_timer

And "much longer" isn't that long: it takes less than 43s to underflow
TVAL at 50MHz (the frequency of the counter on XGene-1).

> When we turn the watchdog off, then run a little 'hello world'
> program in each of the CPUs, there will often be one that 1) hangs up
> forever, 2) hangs up what seems like forever, but actully contines after
> a few minutes. In either case, the program cannot be freed by a ^C.
> This test sequence requires CONFIG_SOFTLOCKUP_DETECTOR, and probably
> requires that one of the NO_HZ Kconfig options be specified.
>
> The sequence is, for an 8 cpu Mustang XGene-1:
>
> echo 0 >/proc/sys/kernel/watchdog
> for i in {0..7}; do taskset -c $i echo hi there $i; done
>
> Note that though the hangup usually happens, it does not
> always happen.

The first line is enough to kill the machine here. Nice one! :D

>
> Some comments on the v1 version of this patch by Marc Zyngier:

nit: mentioning previous versions in the commit message isn't very
helpful, as the git tree won't carry that patch. Just saying
"additional comment from John Doe:" is enough.

>
> XGene implements CVAL (a 64bit comparator) in terms of TVAL (a countdown
> register) instead of the other way around. TVAL being a 32bit register,
> the width of the counter should equally be 32. However, TVAL is a
> *signed* value, and keeps counting down in the negative range once the
> timer fires.
>
> It means that any TVAL value with bit 31 set will fire immediately,
> as it cannot be distinguished from an already expired timer. Reducing
> the timer range back to a paltry 31 bits papers over the issue.
>
> Another problem cannot be fixed though, which is that the timer interrupt
> *must* be handled within the negative countdown period, or the interrupt
> will be lost (TVAL will rollover to a positive value, indicative of a
> new timer deadline).
>
> [ v2: Expanded CC list - jak ]
> [ v2: Revamped changelog - jak ]
> [ v2: streamlined inlined comments - jak ]

This information should be stashed below the '---' line so that it
isn't captured in the commit when applying the patch.

>
> Cc: [email protected] # 5.16+
> Fixes: 012f18850452 ("clocksource/drivers/arm_arch_timer: Work around broken CVAL implementations")
> Signed-off-by: Joe Korty <[email protected]>
> ---
> base-commit: v6.0
> Index: b/drivers/clocksource/arm_arch_timer.c
> ===================================================================
> --- a/drivers/clocksource/arm_arch_timer.c
> +++ b/drivers/clocksource/arm_arch_timer.c
> @@ -804,6 +804,9 @@ static u64 __arch_timer_check_delta(void
> /*
> * XGene-1 implements CVAL in terms of TVAL, meaning
> * that the maximum timer range is 32bit. Shame on them.
> + *
> + * Note that TVAL is signed, thus has only 31 of its
> + * 32 bits to express magnitude.
> */
> MIDR_ALL_VERSIONS(MIDR_CPU_MODEL(ARM_CPU_IMP_APM,
> APM_CPU_PART_POTENZA)),
> @@ -811,8 +814,8 @@ static u64 __arch_timer_check_delta(void
> };
>
> if (is_midr_in_range_list(read_cpuid_id(), broken_cval_midrs)) {
> - pr_warn_once("Broken CNTx_CVAL_EL1, limiting width to 32bits");
> - return CLOCKSOURCE_MASK(32);
> + pr_warn_once("Broken CNTx_CVAL_EL1, using 32 bit TVAL instead.\n");

s/TVAL/CNTx_TVAL_EL1/ instead.

> + return CLOCKSOURCE_MASK(31);
> }
> #endif
> return CLOCKSOURCE_MASK(arch_counter_get_width());
>

With the commit message suitably amended:

Reviewed-by: Marc Zyngier <[email protected]>

Daniel, would you mind fixing it up when applying this patch? XGene is
trivially broken without this fix, and it would be good if it could
make it in one of the 6.1-rc.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2022-11-03 16:03:26

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v2] arm64: arch_timer: XGene-1: TVAL register math error breaks timer expiry calculation.

On 2022-10-25 13:31, Marc Zyngier wrote:
> Daniel, would you mind fixing it up when applying this patch? XGene is
> trivially broken without this fix, and it would be good if it could
> make it in one of the 6.1-rc.

Daniel, are you able to take this patch? I don't mind respinning
it myself if necessary.

Thanks,

M.
--
Jazz is not dead. It just smells funny...

Subject: [tip: timers/urgent] clocksource/drivers/arm_arch_timer: Fix XGene-1 TVAL register math error

The following commit has been merged into the timers/urgent branch of tip:

Commit-ID: 839a973988a94c15002cbd81536e4af6ced2bd30
Gitweb: https://git.kernel.org/tip/839a973988a94c15002cbd81536e4af6ced2bd30
Author: Joe Korty <[email protected]>
AuthorDate: Mon, 21 Nov 2022 14:53:43
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Mon, 21 Nov 2022 16:01:56 +01:00

clocksource/drivers/arm_arch_timer: Fix XGene-1 TVAL register math error

The TVAL register is 32 bit signed. Thus only the lower 31 bits are
available to specify when an interrupt is to occur at some time in the
near future. Attempting to specify a larger interval with TVAL results
in a negative time delta which means the timer fires immediately upon
being programmed, rather than firing at that expected future time.

The solution is for Linux to declare that TVAL is a 31 bit register rather
than give its true size of 32 bits. This prevents Linux from programming
TVAL with a too-large value. Note that, prior to 5.16, this little trick
was the standard way to handle TVAL in Linux, so there is nothing new
happening here on that front.

The softlockup detector hides the issue, because it keeps generating
short timer deadlines that are within the scope of the broken timer.

Disabling it, it starts using NO_HZ with much longer timer deadlines, which
turns into an interrupt flood:

11: 1124855130 949168462 758009394 76417474 104782230 30210281
310890 1734323687 GICv2 29 Level arch_timer

And "much longer" isn't that long: it takes less than 43s to underflow
TVAL at 50MHz (the frequency of the counter on XGene-1).

Some comments on the v1 version of this patch by Marc Zyngier:

XGene implements CVAL (a 64bit comparator) in terms of TVAL (a countdown
register) instead of the other way around. TVAL being a 32bit register,
the width of the counter should equally be 32. However, TVAL is a
*signed* value, and keeps counting down in the negative range once the
timer fires.

It means that any TVAL value with bit 31 set will fire immediately,
as it cannot be distinguished from an already expired timer. Reducing
the timer range back to a paltry 31 bits papers over the issue.

Another problem cannot be fixed though, which is that the timer interrupt
*must* be handled within the negative countdown period, or the interrupt
will be lost (TVAL will rollover to a positive value, indicative of a
new timer deadline).

Fixes: 012f18850452 ("clocksource/drivers/arm_arch_timer: Work around broken CVAL implementations")
Signed-off-by: Joe Korty <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]

[maz: revamped the commit message]
---
drivers/clocksource/arm_arch_timer.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index a7ff775..933bb96 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -806,6 +806,9 @@ static u64 __arch_timer_check_delta(void)
/*
* XGene-1 implements CVAL in terms of TVAL, meaning
* that the maximum timer range is 32bit. Shame on them.
+ *
+ * Note that TVAL is signed, thus has only 31 of its
+ * 32 bits to express magnitude.
*/
MIDR_ALL_VERSIONS(MIDR_CPU_MODEL(ARM_CPU_IMP_APM,
APM_CPU_PART_POTENZA)),
@@ -813,8 +816,8 @@ static u64 __arch_timer_check_delta(void)
};

if (is_midr_in_range_list(read_cpuid_id(), broken_cval_midrs)) {
- pr_warn_once("Broken CNTx_CVAL_EL1, limiting width to 32bits");
- return CLOCKSOURCE_MASK(32);
+ pr_warn_once("Broken CNTx_CVAL_EL1, using 31 bit TVAL instead.\n");
+ return CLOCKSOURCE_MASK(31);
}
#endif
return CLOCKSOURCE_MASK(arch_counter_get_width());

2022-12-02 13:56:42

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH v2] arm64: arch_timer: XGene-1: TVAL register math error breaks timer expiry calculation.


Hi Marc,

On 03/11/2022 16:05, Marc Zyngier wrote:
> On 2022-10-25 13:31, Marc Zyngier wrote:
>> Daniel, would you mind fixing it up when applying this patch? XGene is
>> trivially broken without this fix, and it would be good if it could
>> make it in one of the 6.1-rc.
>
> Daniel, are you able to take this patch? I don't mind respinning
> it myself if necessary.

Yes please, if you can take care of updating the patch that will help.
I've been in leave during a long time and I'm still processing all the
submitted changes I received in the meantime


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2022-12-02 14:23:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2] arm64: arch_timer: XGene-1: TVAL register math error breaks timer expiry calculation.

On Fri, Dec 02 2022 at 13:36, Daniel Lezcano wrote:
> On 03/11/2022 16:05, Marc Zyngier wrote:
>> On 2022-10-25 13:31, Marc Zyngier wrote:
>>> Daniel, would you mind fixing it up when applying this patch? XGene is
>>> trivially broken without this fix, and it would be good if it could
>>> make it in one of the 6.1-rc.
>>
>> Daniel, are you able to take this patch? I don't mind respinning
>> it myself if necessary.
>
> Yes please, if you can take care of updating the patch that will help.
> I've been in leave during a long time and I'm still processing all the
> submitted changes I received in the meantime

It's in Linus tree already.

Subject: [tip: timers/core] clocksource/drivers/arm_arch_timer: Fix XGene-1 TVAL register math error

The following commit has been merged into the timers/core branch of tip:

Commit-ID: 45ae272a948a03a7d55748bf52d2f47d3b4e1d5a
Gitweb: https://git.kernel.org/tip/45ae272a948a03a7d55748bf52d2f47d3b4e1d5a
Author: Joe Korty <[email protected]>
AuthorDate: Mon, 21 Nov 2022 14:53:43
Committer: Daniel Lezcano <[email protected]>
CommitterDate: Fri, 02 Dec 2022 12:48:28 +01:00

clocksource/drivers/arm_arch_timer: Fix XGene-1 TVAL register math error

The TVAL register is 32 bit signed. Thus only the lower 31 bits are
available to specify when an interrupt is to occur at some time in the
near future. Attempting to specify a larger interval with TVAL results
in a negative time delta which means the timer fires immediately upon
being programmed, rather than firing at that expected future time.

The solution is for Linux to declare that TVAL is a 31 bit register rather
than give its true size of 32 bits. This prevents Linux from programming
TVAL with a too-large value. Note that, prior to 5.16, this little trick
was the standard way to handle TVAL in Linux, so there is nothing new
happening here on that front.

The softlockup detector hides the issue, because it keeps generating
short timer deadlines that are within the scope of the broken timer.

Disable it, and you start using NO_HZ with much longer timer deadlines,
which turns into an interrupt flood:

11: 1124855130 949168462 758009394 76417474 104782230 30210281
310890 1734323687 GICv2 29 Level arch_timer

And "much longer" isn't that long: it takes less than 43s to underflow
TVAL at 50MHz (the frequency of the counter on XGene-1).

Some comments on the v1 version of this patch by Marc Zyngier:

XGene implements CVAL (a 64bit comparator) in terms of TVAL (a countdown
register) instead of the other way around. TVAL being a 32bit register,
the width of the counter should equally be 32. However, TVAL is a
*signed* value, and keeps counting down in the negative range once the
timer fires.

It means that any TVAL value with bit 31 set will fire immediately,
as it cannot be distinguished from an already expired timer. Reducing
the timer range back to a paltry 31 bits papers over the issue.

Another problem cannot be fixed though, which is that the timer interrupt
*must* be handled within the negative countdown period, or the interrupt
will be lost (TVAL will rollover to a positive value, indicative of a
new timer deadline).

Cc: [email protected] # 5.16+
Fixes: 012f18850452 ("clocksource/drivers/arm_arch_timer: Work around broken CVAL implementations")
Signed-off-by: Joe Korty <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
[maz: revamped the commit message]
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/clocksource/arm_arch_timer.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index 9c3420a..e2920da 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -806,6 +806,9 @@ static u64 __arch_timer_check_delta(void)
/*
* XGene-1 implements CVAL in terms of TVAL, meaning
* that the maximum timer range is 32bit. Shame on them.
+ *
+ * Note that TVAL is signed, thus has only 31 of its
+ * 32 bits to express magnitude.
*/
MIDR_ALL_VERSIONS(MIDR_CPU_MODEL(ARM_CPU_IMP_APM,
APM_CPU_PART_POTENZA)),
@@ -813,8 +816,8 @@ static u64 __arch_timer_check_delta(void)
};

if (is_midr_in_range_list(read_cpuid_id(), broken_cval_midrs)) {
- pr_warn_once("Broken CNTx_CVAL_EL1, limiting width to 32bits");
- return CLOCKSOURCE_MASK(32);
+ pr_warn_once("Broken CNTx_CVAL_EL1, using 32 bit TVAL instead.\n");
+ return CLOCKSOURCE_MASK(31);
}
#endif
return CLOCKSOURCE_MASK(arch_counter_get_width());