LinuxLists.cc - [PATCH] clocksource: disable irq when holding watchdog

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On Mon, Oct 16 2023 at 10:46, John Stultz wrote:
> On Fri, Oct 13, 2023 at 7:51 AM Tetsuo Handa
> <[email protected]> wrote:
>>
>> Lockdep found that spin_lock(&watchdog_lock) from call_timer_fn()
>> is not safe. Use spin_lock_irqsave(&watchdog_lock, flags) instead.
>>
>> [ 0.378387] TSC synchronization [CPU#0 -> CPU#1]:
>> [ 0.378387] Measured 55060 cycles TSC warp between CPUs, turning off TSC clock.

<SNIP/>

Please trim splats down to the relevant content. Timestamps and the bulk
of the backtraces are completely uninteresting. Something like this is
completely sufficient and readable:

tsc: Marking TSC unstable due to check_tsc_sync_source failed

WARNING: inconsistent lock state
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
ffffffff8cfa1c78 (watchdog_lock){?.-.}-{2:2}, at: clocksource_watchdog+0x23/0x5a0
{IN-HARDIRQ-W} state was registered at:
_raw_spin_lock_irqsave+0x3f/0x60
clocksource_mark_unstable+0x1b/0x90
mark_tsc_unstable+0x41/0x50
check_tsc_sync_source+0x14f/0x180
sysvec_call_function_single+0x69/0x90

Possible unsafe locking scenario:
lock(watchdog_lock);
<Interrupt>
lock(watchdog_lock);

stack backtrace:
_raw_spin_lock+0x30/0x40
clocksource_watchdog+0x23/0x5a0
run_timer_softirq+0x2a/0x50
sysvec_apic_timer_interrupt+0x6e/0x90

20 lines of useful information vs. ~100 lines containing mostly
noise. No?

Documentation/process/* clearly asks for this.

>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>> index c108ed8a9804..4e8fc0a5ca9d 100644
>> --- a/kernel/time/clocksource.c
>> +++ b/kernel/time/clocksource.c
>> @@ -405,8 +405,9 @@ static void clocksource_watchdog(struct timer_list *unused)
>> enum wd_read_status read_ret;
>> unsigned long extra_wait = 0;
>> u32 md;
>> + unsigned long flags;

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#variable-declarations

>> - spin_lock(&watchdog_lock);
>> + spin_lock_irqsave(&watchdog_lock, flags);

That looks obvious, but it's just making the problem worse. See below.

> At first glance this looks sane to me.

Emphasis on "first glance" :)

> The only thing that gives me pause is that I'm a little shocked this
> hasn't been caught earlier, and I don't right off see recent changes
> that would have caused this to trip. But Thomas may have more insight.

From looking at the (for you neatly trimmed) backtrace:

{IN-HARDIRQ-W} state was registered at:
_raw_spin_lock_irqsave+0x3f/0x60
clocksource_mark_unstable+0x1b/0x90
mark_tsc_unstable+0x41/0x50
check_tsc_sync_source+0x14f/0x180
sysvec_call_function_single+0x69/0x90

it's bloody obvious which change caused that:

9d349d47f0e3 ("x86/smpboot: Make TSC synchronization function call based")

Even if you don't recognize it immediately then:

git grep -l check_tsc_sync_source | xargs git blame | grep check_tsc_sync_source

is not really rocket science either, right?

Vs. you being shocked: As long as that synchronization function does not
try to mark the TSC unstable, lockdep won't notice. Definitely
surprising that this took a few months to surface.

But that's not what shocks me...

That spinlock_irqsave() conversion is not a solution. Why?

If that happens on RT then this is even worse. mark_tsc_unstable()
is invoked from hard interrupt context via smp function call which
is a NONO due to watchdog_lock being a spinlock and the _irqsave()
change won't help anything.

Yes, I broke it and did not think about that particular side effect
when moving check_tsc_sync_source() from thread to smp function call
context.

But that's just a distraction. What's even worse is that the patch
blindly slaps spinlock_irqsave() at the problem without actually looking
what the function does and why it does not use the irqsave variant in
the first place.

The watchdog function deliberately runs with interrupts enabled and
only explicitly disables them for the actual readout because there is
no point in delaying interrupts for this horrorshow accross the whole
function.

But the explicit local_irq_disable()/enable() pair around the readout
means that the patch above is obviously broken:

spin_lock_irqsave(&watchdog_lock, flags);
cs_watchdog_read()
local_irq_disable();
...
local_irq_enable(); <--- FAIL

Lockdep should have yelled at that too, but even if it failed to do
so then it's still very obviously wrong.

Something like the uncompiled/untested below should cure it for real. It
really does not matter whether the TSC unstable event happens a bit
later. The system is unhappy no matter what.

That said, this whole clocksource watchdog mess wants a proper
overhaul. It has become a pile of warts and duct tape by now and after
staring at it long enough there is no real reason to run it in a timer
callback anymore. It just can move into delayed work and the whole
locking problem can be reduced to the clocksource_mutex and some well
thought out atomic operations to handle the mark unstable case. But
that's a different story and not relevant for curing the problem at
hand.

Thanks,

tglx
---
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -15,6 +15,7 @@
* ( The serial nature of the boot logic and the CPU hotplug lock
* protects against more than 2 CPUs entering this code. )
*/
+#include <linux/workqueue.h>
#include <linux/topology.h>
#include <linux/spinlock.h>
#include <linux/kernel.h>
@@ -342,6 +343,13 @@ static inline unsigned int loop_timeout(
return (cpumask_weight(topology_core_cpumask(cpu)) > 1) ? 2 : 20;
}

+static void tsc_sync_mark_tsc_unstable(struct work_struct *work)
+{
+ mark_tsc_unstable("check_tsc_sync_source failed");
+}
+
+static DECLARE_WORK(tsc_sync_work, tsc_sync_mark_tsc_unstable);
+
/*
* The freshly booted CPU initiates this via an async SMP function call.
*/
@@ -395,7 +403,7 @@ static void check_tsc_sync_source(void *
"turning off TSC clock.\n", max_warp);
if (random_warps)
pr_warn("TSC warped randomly between CPUs\n");
- mark_tsc_unstable("check_tsc_sync_source failed");
+ schedule_work(&tsc_sync_work);
}

/*

2023-10-16 23:03:58

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On Mon, Oct 16, 2023 at 11:47:55PM +0200, Thomas Gleixner wrote:
> On Mon, Oct 16 2023 at 10:46, John Stultz wrote:
> > On Fri, Oct 13, 2023 at 7:51 AM Tetsuo Handa
> > <[email protected]> wrote:
> >>
> >> Lockdep found that spin_lock(&watchdog_lock) from call_timer_fn()
> >> is not safe. Use spin_lock_irqsave(&watchdog_lock, flags) instead.
> >>
> >> [ 0.378387] TSC synchronization [CPU#0 -> CPU#1]:
> >> [ 0.378387] Measured 55060 cycles TSC warp between CPUs, turning off TSC clock.

[ . . . ]

> Something like the uncompiled/untested below should cure it for real. It
> really does not matter whether the TSC unstable event happens a bit
> later. The system is unhappy no matter what.

This does pass my acceptance tests:

Tested-by: Paul E. McKenney <[email protected]>

> That said, this whole clocksource watchdog mess wants a proper
> overhaul. It has become a pile of warts and duct tape by now and after
> staring at it long enough there is no real reason to run it in a timer
> callback anymore. It just can move into delayed work and the whole
> locking problem can be reduced to the clocksource_mutex and some well
> thought out atomic operations to handle the mark unstable case. But
> that's a different story and not relevant for curing the problem at
> hand.

Moving the code to delayed work seems quite reasonable.

But Thomas, you do understand that the way things have been going for
the clocksource watchdog, pushing it out to delayed work will no doubt
add yet more hair on large busy systems, right? Yeah, yeah, I know,
delayed work shouldn't be any worse than ksoftirqd. The key word of
course being "shouldn't". ;-)

Thanx, Paul

> Thanks,
>
> tglx
> ---
> --- a/arch/x86/kernel/tsc_sync.c
> +++ b/arch/x86/kernel/tsc_sync.c
> @@ -15,6 +15,7 @@
> * ( The serial nature of the boot logic and the CPU hotplug lock
> * protects against more than 2 CPUs entering this code. )
> */
> +#include <linux/workqueue.h>
> #include <linux/topology.h>
> #include <linux/spinlock.h>
> #include <linux/kernel.h>
> @@ -342,6 +343,13 @@ static inline unsigned int loop_timeout(
> return (cpumask_weight(topology_core_cpumask(cpu)) > 1) ? 2 : 20;
> }
>
> +static void tsc_sync_mark_tsc_unstable(struct work_struct *work)
> +{
> + mark_tsc_unstable("check_tsc_sync_source failed");
> +}
> +
> +static DECLARE_WORK(tsc_sync_work, tsc_sync_mark_tsc_unstable);
> +
> /*
> * The freshly booted CPU initiates this via an async SMP function call.
> */
> @@ -395,7 +403,7 @@ static void check_tsc_sync_source(void *
> "turning off TSC clock.\n", max_warp);
> if (random_warps)
> pr_warn("TSC warped randomly between CPUs\n");
> - mark_tsc_unstable("check_tsc_sync_source failed");
> + schedule_work(&tsc_sync_work);
> }
>
> /*

2023-10-17 06:50:04

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On Mon, Oct 16 2023 at 16:03, Paul E. McKenney wrote:
> On Mon, Oct 16, 2023 at 11:47:55PM +0200, Thomas Gleixner wrote:
>> That said, this whole clocksource watchdog mess wants a proper
>> overhaul. It has become a pile of warts and duct tape by now and after
>> staring at it long enough there is no real reason to run it in a timer
>> callback anymore. It just can move into delayed work and the whole
>> locking problem can be reduced to the clocksource_mutex and some well
>> thought out atomic operations to handle the mark unstable case. But
>> that's a different story and not relevant for curing the problem at
>> hand.
>
> Moving the code to delayed work seems quite reasonable.
>
> But Thomas, you do understand that the way things have been going for
> the clocksource watchdog, pushing it out to delayed work will no doubt
> add yet more hair on large busy systems, right? Yeah, yeah, I know,
> delayed work shouldn't be any worse than ksoftirqd. The key word of
> course being "shouldn't". ;-)

Yes, I'm aware of that. I still think it's worth at least to try it.

Thanks,

tglx

2023-10-17 10:38:50

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On 2023/10/17 6:47, Thomas Gleixner wrote:
> Something like the uncompiled/untested below should cure it for real. It
> really does not matter whether the TSC unstable event happens a bit
> later. The system is unhappy no matter what.

Thank you. That patch fixes the lockdep warning.

But I'm hitting something different (but might be timer/scheduler related) problem.
What config option would cause taking more than 2 minutes to bring up only 8 CPUs?
(This environment is Oracle VM VirtualBox on Windows 11 host.)

Before:
----------------------------------------
[ 0.871340] Callback from call_rcu_tasks_trace() invoked.
[ 0.872337] rcu: Max phase no-delay instances is 400.
[ 0.875815] NMI watchdog: Perf NMI watchdog permanently disabled
[ 0.877552] smp: Bringing up secondary CPUs ...
[ 0.879599] smpboot: x86: Booting SMP configuration:
[ 0.881333] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
[ 0.892319] TSC synchronization [CPU#0 -> CPU#1]:
[ 0.892319] Measured 66852 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.892319] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[ 1.108337]
[ 1.109319] ================================
[ 1.109319] WARNING: inconsistent lock state
[ 1.109319] 6.6.0-rc5+ #7 Not tainted
[ 1.109319] --------------------------------
[ 1.109319] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ 1.109319] rcu_tasks_kthre/13 [HC0[0]:SC1[1]:HE1:SE0] takes:
[ 1.109319] ffffffff86a85818 (watchdog_lock){?.-.}-{2:2}, at: clocksource_watchdog+0x26/0x620
[ 1.109319] {IN-HARDIRQ-W} state was registered at:
[ 1.109319] lock_acquire+0xca/0x2d0
[ 1.109319] _raw_spin_lock_irqsave+0x4f/0x70
[ 1.109319] clocksource_mark_unstable+0x20/0xa0
[ 1.109319] mark_tsc_unstable+0x4a/0x70
[ 1.109319] check_tsc_sync_source+0x169/0x1d0
[ 1.109319] __flush_smp_call_function_queue+0x16d/0x590
[ 1.109319] generic_smp_call_function_single_interrupt+0x17/0x20
[ 1.109319] __sysvec_call_function_single+0x3a/0x110
[ 1.109319] sysvec_call_function_single+0x7f/0x90
[ 1.109319] asm_sysvec_call_function_single+0x1f/0x30
[ 1.109319] pv_native_safe_halt+0xf/0x20
[ 1.109319] arch_cpu_idle+0xd/0x20
[ 1.109319] default_idle_call+0x83/0x180
[ 1.109319] do_idle+0x21a/0x260
[ 1.109319] cpu_startup_entry+0x31/0x40
[ 1.109319] rest_init+0x10f/0x200
[ 1.109319] arch_call_rest_init+0x12/0x50
[ 1.109319] start_kernel+0x80b/0xa30
[ 1.109319] x86_64_start_reservations+0x1c/0x30
[ 1.109319] x86_64_start_kernel+0xbf/0x110
[ 1.109319] secondary_startup_64_no_verify+0x17d/0x18b
[ 1.109319] irq event stamp: 232
[ 1.109319] hardirqs last enabled at (232): [<ffffffff850c93ac>] _raw_spin_unlock_irq+0x2c/0x60
[ 1.109319] hardirqs last disabled at (231): [<ffffffff850c9081>] _raw_spin_lock_irq+0x61/0x70
[ 1.109319] softirqs last enabled at (88): [<ffffffff841c200f>] rcu_tasks_invoke_cbs+0xff/0x2f0
[ 1.109319] softirqs last disabled at (173): [<ffffffff840ece40>] __irq_exit_rcu+0x80/0xb0
[ 1.109319]
[ 1.109319] other info that might help us debug this:
[ 1.109319] Possible unsafe locking scenario:
[ 1.109319]
[ 1.109319] CPU0
[ 1.109319] ----
[ 1.109319] lock(watchdog_lock);
[ 1.109319] <Interrupt>
[ 1.109319] lock(watchdog_lock);
[ 1.109319]
[ 1.109319] *** DEADLOCK ***
[ 1.109319]
[ 1.109319] 3 locks held by rcu_tasks_kthre/13:
[ 1.109319] #0: ffffffff869d6730 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x361/0x510
[ 1.109319] #1: ffffffff86a7ceb8 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x43c/0x510
[ 1.109319] #2: ffffb67680003e68 ((&watchdog_timer)){+.-.}-{0:0}, at: call_timer_fn+0x7b/0x240
[ 1.109319]
[ 1.109319] stack backtrace:
[ 1.109319] CPU: 0 PID: 13 Comm: rcu_tasks_kthre Not tainted 6.6.0-rc5+ #7
[ 1.109319] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 1.109319] Call Trace:
[ 1.109319] <IRQ>
[ 1.109319] dump_stack_lvl+0x60/0xa0
[ 1.109319] dump_stack+0x14/0x20
[ 1.109319] print_usage_bug.part.0+0x25c/0x350
[ 1.109319] mark_lock.part.0+0x72f/0xa00
[ 1.109319] __lock_acquire+0x390/0x24f0
[ 1.109319] ? lockdep_unlock+0x56/0xc0
[ 1.109319] lock_acquire+0xca/0x2d0
[ 1.109319] ? clocksource_watchdog+0x26/0x620
[ 1.109319] ? __pfx_clocksource_watchdog+0x10/0x10
[ 1.109319] _raw_spin_lock+0x34/0x50
[ 1.109319] ? clocksource_watchdog+0x26/0x620
[ 1.109319] clocksource_watchdog+0x26/0x620
[ 1.109319] ? lock_acquire+0xca/0x2d0
[ 1.109319] ? __pfx_clocksource_watchdog+0x10/0x10
[ 1.109319] call_timer_fn+0xa4/0x240
[ 1.109319] ? __pfx_clocksource_watchdog+0x10/0x10
[ 1.109319] __run_timers.part.0+0x207/0x2d0
[ 1.109319] run_timer_softirq+0x2e/0x60
[ 1.109319] __do_softirq+0xc6/0x446
[ 1.109319] __irq_exit_rcu+0x80/0xb0
[ 1.109319] irq_exit_rcu+0x12/0x20
[ 1.109319] sysvec_apic_timer_interrupt+0x84/0x90
[ 1.109319] </IRQ>
[ 1.109319] <TASK>
[ 1.109319] asm_sysvec_apic_timer_interrupt+0x1f/0x30
[ 1.109319] RIP: 0010:_flat_send_IPI_mask+0x5d/0x70
[ 1.109319] Code: f8 c9 c3 cc cc cc cc 89 75 f4 e8 ae 24 20 00 8b 75 f4 ba 00 08 00 00 44 89 e7 e8 ae 63 ff ff e8 c9 23 20 00 fb 0f 1f 44 00 00 <4c> 8b 65 f8 c9 c3 cc cc cc cc 66 0f 1f 84 00 00 00 00 00 90 90 90
[ 1.109319] RSP: 0000:ffffb67680073ac0 EFLAGS: 00010202
[ 1.109319] RAX: 0000000000000073 RBX: 0000000000000000 RCX: 0000000000000040
[ 1.109319] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff840a4357
[ 1.109319] RBP: ffffb67680073ad0 R08: 0000000000000001 R09: 0000000000000001
[ 1.109319] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000020
[ 1.109319] R13: 00000000000000fb R14: 000000000000001f R15: ffffffff869da348
[ 1.109319] ? _flat_send_IPI_mask+0x57/0x70
[ 1.109319] flat_send_IPI_mask+0x15/0x20
[ 1.109319] default_send_IPI_single+0x44/0x60
[ 1.109319] native_send_call_func_single_ipi+0x17/0x20
[ 1.109319] __smp_call_single_queue+0xd1/0x1c0
[ 1.109319] generic_exec_single+0x3a/0x1e0
[ 1.109319] smp_call_function_single+0xc2/0x140
[ 1.109319] ? _raw_spin_unlock_irqrestore+0x35/0x60
[ 1.109319] __sync_rcu_exp_select_node_cpus+0x284/0x500
[ 1.109319] sync_rcu_exp_select_cpus+0x232/0x550
[ 1.109319] synchronize_rcu_expedited+0x44e/0x510
[ 1.109319] ? lock_is_held_type+0xaf/0x120
[ 1.109319] synchronize_rcu+0x1ad/0x1e0
[ 1.109319] ? __mutex_lock+0x9b/0xd30
[ 1.109319] ? __mutex_lock+0xa86/0xd30
[ 1.109319] ? lock_acquire+0xca/0x2d0
[ 1.109319] ? find_held_lock+0x31/0x90
[ 1.109319] rcu_tasks_pregp_step+0x12/0x20
[ 1.109319] rcu_tasks_wait_gp+0x5c/0x430
[ 1.109319] rcu_tasks_one_gp+0x49f/0x510
[ 1.109319] ? rcu_tasks_one_gp+0x75/0x510
[ 1.109319] ? __pfx_rcu_tasks_kthread+0x10/0x10
[ 1.109319] rcu_tasks_kthread+0xb4/0xe0
[ 1.109319] kthread+0x109/0x140
[ 1.109319] ? __pfx_kthread+0x10/0x10
[ 1.109319] ret_from_fork+0x3d/0x60
[ 1.109319] ? __pfx_kthread+0x10/0x10
[ 1.109319] ret_from_fork_asm+0x1b/0x30
[ 1.109319] </TASK>
[ 40.748350] smp: Brought up 1 node, 8 CPUs
[ 48.744332] smpboot: Max logical packages: 1
[ 57.896327] tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 5 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 11129 jiffies old.
[ 69.324322] Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
[ 83.516363] smpboot: Total of 8 processors activated (41779.32 BogoMIPS)
[ 103.157341] Callback from call_rcu_tasks_rude() invoked.
[ 105.961362] rcu_tasks_wait_gp: rcu_tasks grace period number 5 (since boot) is 104910 jiffies old.
[ 137.781409] rcu_tasks_wait_gp: rcu_tasks grace period number 5 (since boot) is 136730 jiffies old.
[ 142.589347] Callback from call_rcu_tasks() invoked.
----------------------------------------

After:
----------------------------------------
[ 0.000000] Linux version 6.6.0-rc5+ (root@ubuntu) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #8 SMP PREEMPT_DYNAMIC Tue Oct 17 09:56:49 UTC 2023
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.6.0-rc5+ root=UUID=a2500e32-bbc8-4f32-bf86-94a6ae160b4b ro console=ttyS0,115200n8 console=tty0
(...snipped...)
[ 0.340002] Dynamic Preempt: full
[ 0.340209] Running RCU self tests
[ 0.340210] Running RCU synchronous self tests
[ 0.340214] rcu: Preemptible hierarchical RCU implementation.
[ 0.340215] rcu: RCU lockdep checking is enabled.
[ 0.340215] rcu: RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=8.
[ 0.340217] Trampoline variant of Tasks RCU enabled.
[ 0.340217] Rude variant of Tasks RCU enabled.
[ 0.340218] Tracing variant of Tasks RCU enabled.
[ 0.340219] rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
[ 0.340219] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8
[ 0.340275] Running RCU synchronous self tests
[ 0.343652] NR_IRQS: 524544, nr_irqs: 488, preallocated irqs: 16
[ 0.344029] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[ 0.376872] Console: colour VGA+ 80x25
[ 0.376893] printk: console [tty0] enabled
[ 0.585051] printk: console [ttyS0] enabled
[ 0.807871] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
[ 0.812562] ... MAX_LOCKDEP_SUBCLASSES: 8
[ 0.814986] ... MAX_LOCK_DEPTH: 48
[ 0.817368] ... MAX_LOCKDEP_KEYS: 8192
[ 0.819835] ... CLASSHASH_SIZE: 4096
[ 0.822452] ... MAX_LOCKDEP_ENTRIES: 32768
[ 0.825290] ... MAX_LOCKDEP_CHAINS: 65536
[ 0.827689] ... CHAINHASH_SIZE: 32768
[ 0.830278] memory used by lock dependency info: 6493 kB
[ 0.833130] memory used for stack traces: 4224 kB
[ 0.836051] per task-struct memory footprint: 1920 bytes
[ 0.838946] ACPI: Core revision 20230628
[ 0.841933] APIC: Switch to symmetric I/O mode setup
[ 0.855141] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.858422] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x25a39815946, max_idle_ns: 440795282683 ns
[ 0.863903] Calibrating delay loop (skipped) preset value.. 5222.41 BogoMIPS (lpj=2611208)
[ 0.864890] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.864890] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[ 0.864890] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[ 0.864890] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!
[ 0.864890] Spectre V2 : Mitigation: Enhanced / Automatic IBRS
[ 0.864890] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[ 0.864903] Spectre V2 : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT
[ 0.866894] RETBleed: Mitigation: Enhanced IBRS
[ 0.868893] Speculative Store Bypass: Vulnerable
[ 0.870902] x86/fpu: x87 FPU will use FXSAVE
[ 0.896596] Freeing SMP alternatives memory: 40K
[ 0.897898] pid_max: default: 32768 minimum: 301
[ 0.900001] LSM: initializing lsm=capability,bpf
[ 0.901913] LSM support for eBPF active
[ 0.904221] Mount-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
[ 0.907152] Mountpoint-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
[ 0.910834] Running RCU synchronous self tests
[ 0.911894] Running RCU synchronous self tests
[ 1.029893] APIC calibration not consistent with PM-Timer: 148ms instead of 100ms
[ 1.030890] APIC delta adjusted to PM-Timer: 6288407 (9311079)
[ 1.031135] smpboot: CPU0: 12th Gen Intel(R) Core(TM) i7-1255U (family: 0x6, model: 0x9a, stepping: 0x4)
[ 1.031890] RCU Tasks: Setting shift to 3 and lim to 1 rcu_task_cb_adjust=1.
[ 1.031890] RCU Tasks Rude: Setting shift to 3 and lim to 1 rcu_task_cb_adjust=1.
[ 1.031890] RCU Tasks Trace: Setting shift to 3 and lim to 1 rcu_task_cb_adjust=1.
[ 1.031890] Running RCU-tasks wait API self tests
[ 1.136395] Performance Events: unsupported p6 CPU model 154 no PMU driver, software events only.
[ 1.138894] signal: max sigframe size: 1440
[ 1.141093] rcu: Hierarchical SRCU implementation.
[ 1.142904] Callback from call_rcu_tasks_trace() invoked.
[ 1.144897] rcu: Max phase no-delay instances is 400.
[ 1.148535] NMI watchdog: Perf NMI watchdog permanently disabled
[ 1.150496] smp: Bringing up secondary CPUs ...
[ 1.153270] smpboot: x86: Booting SMP configuration:
[ 1.154900] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
[ 1.163890] TSC synchronization [CPU#0 -> CPU#1]:
[ 1.163890] Measured 66840 cycles TSC warp between CPUs, turning off TSC clock.
[ 1.163890] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[ 26.569928] rcu_tasks_wait_gp: rcu_tasks grace period number 5 (since boot) is 25133 jiffies old.
[ 78.872032] Callback from call_rcu_tasks() invoked.
[ 91.327971] smp: Brought up 1 node, 8 CPUs
[ 96.402957] smpboot: Max logical packages: 1
[ 119.253936] Callback from call_rcu_tasks_rude() invoked.
[ 127.412964] smpboot: Total of 8 processors activated (41779.32 BogoMIPS)
----------------------------------------

2023-10-17 14:11:15

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On Tue, Oct 17, 2023 at 07:37:50PM +0900, Tetsuo Handa wrote:
> On 2023/10/17 6:47, Thomas Gleixner wrote:
> > Something like the uncompiled/untested below should cure it for real. It
> > really does not matter whether the TSC unstable event happens a bit
> > later. The system is unhappy no matter what.
>
> Thank you. That patch fixes the lockdep warning.
>
> But I'm hitting something different (but might be timer/scheduler related) problem.
> What config option would cause taking more than 2 minutes to bring up only 8 CPUs?
> (This environment is Oracle VM VirtualBox on Windows 11 host.)
>
> Before:
> ----------------------------------------
> [ 0.871340] Callback from call_rcu_tasks_trace() invoked.
> [ 0.872337] rcu: Max phase no-delay instances is 400.
> [ 0.875815] NMI watchdog: Perf NMI watchdog permanently disabled
> [ 0.877552] smp: Bringing up secondary CPUs ...
> [ 0.879599] smpboot: x86: Booting SMP configuration:
> [ 0.881333] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
> [ 0.892319] TSC synchronization [CPU#0 -> CPU#1]:
> [ 0.892319] Measured 66852 cycles TSC warp between CPUs, turning off TSC clock.
> [ 0.892319] tsc: Marking TSC unstable due to check_tsc_sync_source failed
> [ 1.108337]
> [ 1.109319] ================================
> [ 1.109319] WARNING: inconsistent lock state
> [ 1.109319] 6.6.0-rc5+ #7 Not tainted
> [ 1.109319] --------------------------------
> [ 1.109319] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
> [ 1.109319] rcu_tasks_kthre/13 [HC0[0]:SC1[1]:HE1:SE0] takes:
> [ 1.109319] ffffffff86a85818 (watchdog_lock){?.-.}-{2:2}, at: clocksource_watchdog+0x26/0x620
> [ 1.109319] {IN-HARDIRQ-W} state was registered at:
> [ 1.109319] lock_acquire+0xca/0x2d0
> [ 1.109319] _raw_spin_lock_irqsave+0x4f/0x70
> [ 1.109319] clocksource_mark_unstable+0x20/0xa0
> [ 1.109319] mark_tsc_unstable+0x4a/0x70
> [ 1.109319] check_tsc_sync_source+0x169/0x1d0
> [ 1.109319] __flush_smp_call_function_queue+0x16d/0x590
> [ 1.109319] generic_smp_call_function_single_interrupt+0x17/0x20
> [ 1.109319] __sysvec_call_function_single+0x3a/0x110
> [ 1.109319] sysvec_call_function_single+0x7f/0x90
> [ 1.109319] asm_sysvec_call_function_single+0x1f/0x30
> [ 1.109319] pv_native_safe_halt+0xf/0x20
> [ 1.109319] arch_cpu_idle+0xd/0x20
> [ 1.109319] default_idle_call+0x83/0x180
> [ 1.109319] do_idle+0x21a/0x260
> [ 1.109319] cpu_startup_entry+0x31/0x40
> [ 1.109319] rest_init+0x10f/0x200
> [ 1.109319] arch_call_rest_init+0x12/0x50
> [ 1.109319] start_kernel+0x80b/0xa30
> [ 1.109319] x86_64_start_reservations+0x1c/0x30
> [ 1.109319] x86_64_start_kernel+0xbf/0x110
> [ 1.109319] secondary_startup_64_no_verify+0x17d/0x18b
> [ 1.109319] irq event stamp: 232
> [ 1.109319] hardirqs last enabled at (232): [<ffffffff850c93ac>] _raw_spin_unlock_irq+0x2c/0x60
> [ 1.109319] hardirqs last disabled at (231): [<ffffffff850c9081>] _raw_spin_lock_irq+0x61/0x70
> [ 1.109319] softirqs last enabled at (88): [<ffffffff841c200f>] rcu_tasks_invoke_cbs+0xff/0x2f0
> [ 1.109319] softirqs last disabled at (173): [<ffffffff840ece40>] __irq_exit_rcu+0x80/0xb0
> [ 1.109319]
> [ 1.109319] other info that might help us debug this:
> [ 1.109319] Possible unsafe locking scenario:
> [ 1.109319]
> [ 1.109319] CPU0
> [ 1.109319] ----
> [ 1.109319] lock(watchdog_lock);
> [ 1.109319] <Interrupt>
> [ 1.109319] lock(watchdog_lock);
> [ 1.109319]
> [ 1.109319] *** DEADLOCK ***
> [ 1.109319]
> [ 1.109319] 3 locks held by rcu_tasks_kthre/13:
> [ 1.109319] #0: ffffffff869d6730 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x361/0x510
> [ 1.109319] #1: ffffffff86a7ceb8 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x43c/0x510
> [ 1.109319] #2: ffffb67680003e68 ((&watchdog_timer)){+.-.}-{0:0}, at: call_timer_fn+0x7b/0x240
> [ 1.109319]
> [ 1.109319] stack backtrace:
> [ 1.109319] CPU: 0 PID: 13 Comm: rcu_tasks_kthre Not tainted 6.6.0-rc5+ #7
> [ 1.109319] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> [ 1.109319] Call Trace:
> [ 1.109319] <IRQ>
> [ 1.109319] dump_stack_lvl+0x60/0xa0
> [ 1.109319] dump_stack+0x14/0x20
> [ 1.109319] print_usage_bug.part.0+0x25c/0x350
> [ 1.109319] mark_lock.part.0+0x72f/0xa00
> [ 1.109319] __lock_acquire+0x390/0x24f0
> [ 1.109319] ? lockdep_unlock+0x56/0xc0
> [ 1.109319] lock_acquire+0xca/0x2d0
> [ 1.109319] ? clocksource_watchdog+0x26/0x620
> [ 1.109319] ? __pfx_clocksource_watchdog+0x10/0x10
> [ 1.109319] _raw_spin_lock+0x34/0x50
> [ 1.109319] ? clocksource_watchdog+0x26/0x620
> [ 1.109319] clocksource_watchdog+0x26/0x620
> [ 1.109319] ? lock_acquire+0xca/0x2d0
> [ 1.109319] ? __pfx_clocksource_watchdog+0x10/0x10
> [ 1.109319] call_timer_fn+0xa4/0x240
> [ 1.109319] ? __pfx_clocksource_watchdog+0x10/0x10
> [ 1.109319] __run_timers.part.0+0x207/0x2d0
> [ 1.109319] run_timer_softirq+0x2e/0x60
> [ 1.109319] __do_softirq+0xc6/0x446
> [ 1.109319] __irq_exit_rcu+0x80/0xb0
> [ 1.109319] irq_exit_rcu+0x12/0x20
> [ 1.109319] sysvec_apic_timer_interrupt+0x84/0x90
> [ 1.109319] </IRQ>
> [ 1.109319] <TASK>
> [ 1.109319] asm_sysvec_apic_timer_interrupt+0x1f/0x30
> [ 1.109319] RIP: 0010:_flat_send_IPI_mask+0x5d/0x70
> [ 1.109319] Code: f8 c9 c3 cc cc cc cc 89 75 f4 e8 ae 24 20 00 8b 75 f4 ba 00 08 00 00 44 89 e7 e8 ae 63 ff ff e8 c9 23 20 00 fb 0f 1f 44 00 00 <4c> 8b 65 f8 c9 c3 cc cc cc cc 66 0f 1f 84 00 00 00 00 00 90 90 90
> [ 1.109319] RSP: 0000:ffffb67680073ac0 EFLAGS: 00010202
> [ 1.109319] RAX: 0000000000000073 RBX: 0000000000000000 RCX: 0000000000000040
> [ 1.109319] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff840a4357
> [ 1.109319] RBP: ffffb67680073ad0 R08: 0000000000000001 R09: 0000000000000001
> [ 1.109319] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000020
> [ 1.109319] R13: 00000000000000fb R14: 000000000000001f R15: ffffffff869da348
> [ 1.109319] ? _flat_send_IPI_mask+0x57/0x70
> [ 1.109319] flat_send_IPI_mask+0x15/0x20
> [ 1.109319] default_send_IPI_single+0x44/0x60
> [ 1.109319] native_send_call_func_single_ipi+0x17/0x20
> [ 1.109319] __smp_call_single_queue+0xd1/0x1c0
> [ 1.109319] generic_exec_single+0x3a/0x1e0
> [ 1.109319] smp_call_function_single+0xc2/0x140
> [ 1.109319] ? _raw_spin_unlock_irqrestore+0x35/0x60
> [ 1.109319] __sync_rcu_exp_select_node_cpus+0x284/0x500
> [ 1.109319] sync_rcu_exp_select_cpus+0x232/0x550
> [ 1.109319] synchronize_rcu_expedited+0x44e/0x510
> [ 1.109319] ? lock_is_held_type+0xaf/0x120
> [ 1.109319] synchronize_rcu+0x1ad/0x1e0
> [ 1.109319] ? __mutex_lock+0x9b/0xd30
> [ 1.109319] ? __mutex_lock+0xa86/0xd30
> [ 1.109319] ? lock_acquire+0xca/0x2d0
> [ 1.109319] ? find_held_lock+0x31/0x90
> [ 1.109319] rcu_tasks_pregp_step+0x12/0x20
> [ 1.109319] rcu_tasks_wait_gp+0x5c/0x430
> [ 1.109319] rcu_tasks_one_gp+0x49f/0x510
> [ 1.109319] ? rcu_tasks_one_gp+0x75/0x510
> [ 1.109319] ? __pfx_rcu_tasks_kthread+0x10/0x10
> [ 1.109319] rcu_tasks_kthread+0xb4/0xe0
> [ 1.109319] kthread+0x109/0x140
> [ 1.109319] ? __pfx_kthread+0x10/0x10
> [ 1.109319] ret_from_fork+0x3d/0x60
> [ 1.109319] ? __pfx_kthread+0x10/0x10
> [ 1.109319] ret_from_fork_asm+0x1b/0x30
> [ 1.109319] </TASK>
> [ 40.748350] smp: Brought up 1 node, 8 CPUs
> [ 48.744332] smpboot: Max logical packages: 1
> [ 57.896327] tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 5 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 11129 jiffies old.
> [ 69.324322] Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
> [ 83.516363] smpboot: Total of 8 processors activated (41779.32 BogoMIPS)
> [ 103.157341] Callback from call_rcu_tasks_rude() invoked.
> [ 105.961362] rcu_tasks_wait_gp: rcu_tasks grace period number 5 (since boot) is 104910 jiffies old.
> [ 137.781409] rcu_tasks_wait_gp: rcu_tasks grace period number 5 (since boot) is 136730 jiffies old.
> [ 142.589347] Callback from call_rcu_tasks() invoked.

The timing is about the same in both cases. Does this happen in kernels
built with CONFIG_PROVE_LOCKING=n?

Either way, what mainline or -stable version is this?

Thanx, Paul

> ----------------------------------------
>
> After:
> ----------------------------------------
> [ 0.000000] Linux version 6.6.0-rc5+ (root@ubuntu) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #8 SMP PREEMPT_DYNAMIC Tue Oct 17 09:56:49 UTC 2023
> [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.6.0-rc5+ root=UUID=a2500e32-bbc8-4f32-bf86-94a6ae160b4b ro console=ttyS0,115200n8 console=tty0
> (...snipped...)
> [ 0.340002] Dynamic Preempt: full
> [ 0.340209] Running RCU self tests
> [ 0.340210] Running RCU synchronous self tests
> [ 0.340214] rcu: Preemptible hierarchical RCU implementation.
> [ 0.340215] rcu: RCU lockdep checking is enabled.
> [ 0.340215] rcu: RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=8.
> [ 0.340217] Trampoline variant of Tasks RCU enabled.
> [ 0.340217] Rude variant of Tasks RCU enabled.
> [ 0.340218] Tracing variant of Tasks RCU enabled.
> [ 0.340219] rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
> [ 0.340219] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8
> [ 0.340275] Running RCU synchronous self tests
> [ 0.343652] NR_IRQS: 524544, nr_irqs: 488, preallocated irqs: 16
> [ 0.344029] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> [ 0.376872] Console: colour VGA+ 80x25
> [ 0.376893] printk: console [tty0] enabled
> [ 0.585051] printk: console [ttyS0] enabled
> [ 0.807871] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
> [ 0.812562] ... MAX_LOCKDEP_SUBCLASSES: 8
> [ 0.814986] ... MAX_LOCK_DEPTH: 48
> [ 0.817368] ... MAX_LOCKDEP_KEYS: 8192
> [ 0.819835] ... CLASSHASH_SIZE: 4096
> [ 0.822452] ... MAX_LOCKDEP_ENTRIES: 32768
> [ 0.825290] ... MAX_LOCKDEP_CHAINS: 65536
> [ 0.827689] ... CHAINHASH_SIZE: 32768
> [ 0.830278] memory used by lock dependency info: 6493 kB
> [ 0.833130] memory used for stack traces: 4224 kB
> [ 0.836051] per task-struct memory footprint: 1920 bytes
> [ 0.838946] ACPI: Core revision 20230628
> [ 0.841933] APIC: Switch to symmetric I/O mode setup
> [ 0.855141] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [ 0.858422] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x25a39815946, max_idle_ns: 440795282683 ns
> [ 0.863903] Calibrating delay loop (skipped) preset value.. 5222.41 BogoMIPS (lpj=2611208)
> [ 0.864890] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
> [ 0.864890] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
> [ 0.864890] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
> [ 0.864890] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!
> [ 0.864890] Spectre V2 : Mitigation: Enhanced / Automatic IBRS
> [ 0.864890] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
> [ 0.864903] Spectre V2 : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT
> [ 0.866894] RETBleed: Mitigation: Enhanced IBRS
> [ 0.868893] Speculative Store Bypass: Vulnerable
> [ 0.870902] x86/fpu: x87 FPU will use FXSAVE
> [ 0.896596] Freeing SMP alternatives memory: 40K
> [ 0.897898] pid_max: default: 32768 minimum: 301
> [ 0.900001] LSM: initializing lsm=capability,bpf
> [ 0.901913] LSM support for eBPF active
> [ 0.904221] Mount-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
> [ 0.907152] Mountpoint-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
> [ 0.910834] Running RCU synchronous self tests
> [ 0.911894] Running RCU synchronous self tests
> [ 1.029893] APIC calibration not consistent with PM-Timer: 148ms instead of 100ms
> [ 1.030890] APIC delta adjusted to PM-Timer: 6288407 (9311079)
> [ 1.031135] smpboot: CPU0: 12th Gen Intel(R) Core(TM) i7-1255U (family: 0x6, model: 0x9a, stepping: 0x4)
> [ 1.031890] RCU Tasks: Setting shift to 3 and lim to 1 rcu_task_cb_adjust=1.
> [ 1.031890] RCU Tasks Rude: Setting shift to 3 and lim to 1 rcu_task_cb_adjust=1.
> [ 1.031890] RCU Tasks Trace: Setting shift to 3 and lim to 1 rcu_task_cb_adjust=1.
> [ 1.031890] Running RCU-tasks wait API self tests
> [ 1.136395] Performance Events: unsupported p6 CPU model 154 no PMU driver, software events only.
> [ 1.138894] signal: max sigframe size: 1440
> [ 1.141093] rcu: Hierarchical SRCU implementation.
> [ 1.142904] Callback from call_rcu_tasks_trace() invoked.
> [ 1.144897] rcu: Max phase no-delay instances is 400.
> [ 1.148535] NMI watchdog: Perf NMI watchdog permanently disabled
> [ 1.150496] smp: Bringing up secondary CPUs ...
> [ 1.153270] smpboot: x86: Booting SMP configuration:
> [ 1.154900] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
> [ 1.163890] TSC synchronization [CPU#0 -> CPU#1]:
> [ 1.163890] Measured 66840 cycles TSC warp between CPUs, turning off TSC clock.
> [ 1.163890] tsc: Marking TSC unstable due to check_tsc_sync_source failed
> [ 26.569928] rcu_tasks_wait_gp: rcu_tasks grace period number 5 (since boot) is 25133 jiffies old.
> [ 78.872032] Callback from call_rcu_tasks() invoked.
> [ 91.327971] smp: Brought up 1 node, 8 CPUs
> [ 96.402957] smpboot: Max logical packages: 1
> [ 119.253936] Callback from call_rcu_tasks_rude() invoked.
> [ 127.412964] smpboot: Total of 8 processors activated (41779.32 BogoMIPS)
> ----------------------------------------
>

2023-10-17 14:11:32

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On Tue, Oct 17, 2023 at 08:49:46AM +0200, Thomas Gleixner wrote:
> On Mon, Oct 16 2023 at 16:03, Paul E. McKenney wrote:
> > On Mon, Oct 16, 2023 at 11:47:55PM +0200, Thomas Gleixner wrote:
> >> That said, this whole clocksource watchdog mess wants a proper
> >> overhaul. It has become a pile of warts and duct tape by now and after
> >> staring at it long enough there is no real reason to run it in a timer
> >> callback anymore. It just can move into delayed work and the whole
> >> locking problem can be reduced to the clocksource_mutex and some well
> >> thought out atomic operations to handle the mark unstable case. But
> >> that's a different story and not relevant for curing the problem at
> >> hand.
> >
> > Moving the code to delayed work seems quite reasonable.
> >
> > But Thomas, you do understand that the way things have been going for
> > the clocksource watchdog, pushing it out to delayed work will no doubt
> > add yet more hair on large busy systems, right? Yeah, yeah, I know,
> > delayed work shouldn't be any worse than ksoftirqd. The key word of
> > course being "shouldn't". ;-)
>
> Yes, I'm aware of that. I still think it's worth at least to try it.

OK, good. And again, agreed.

Thanx, Paul

2023-10-19 09:31:20

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On 2023/10/17 23:10, Paul E. McKenney wrote:
>> But I'm hitting something different (but might be timer/scheduler related) problem.
>> What config option would cause taking more than 2 minutes to bring up only 8 CPUs?
>> (This environment is Oracle VM VirtualBox on Windows 11 host.)
>>
> The timing is about the same in both cases. Does this happen in kernels
> built with CONFIG_PROVE_LOCKING=n?

Disabling all options in "Lock Debugging (spinlocks, mutexes, etc...)" does not help.

>
> Either way, what mainline or -stable version is this?

The kernel is latest commit of upstream linux.git tree.
I'm seeing this slowdown (almost hung-up) problem for many releases,
but nobody else seems to be reporting this problem.

The only reliable workaround is to specify "nosmp" kernel command line option.

2023-10-19 12:14:46

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On Thu, Oct 19 2023 at 18:30, Tetsuo Handa wrote:
> On 2023/10/17 23:10, Paul E. McKenney wrote:
>>> But I'm hitting something different (but might be timer/scheduler related) problem.
>>> What config option would cause taking more than 2 minutes to bring up only 8 CPUs?
>>> (This environment is Oracle VM VirtualBox on Windows 11 host.)
>>>
>> The timing is about the same in both cases. Does this happen in kernels
>> built with CONFIG_PROVE_LOCKING=n?
>
> Disabling all options in "Lock Debugging (spinlocks, mutexes, etc...)" does not help.
>
>>
>> Either way, what mainline or -stable version is this?
>
> The kernel is latest commit of upstream linux.git tree.
> I'm seeing this slowdown (almost hung-up) problem for many releases,
> but nobody else seems to be reporting this problem.

Maybe because of this:
>>> (This environment is Oracle VM VirtualBox on Windows 11 host.)

> The only reliable workaround is to specify "nosmp" kernel command line option.

Which makes the kernel keep TSC as clocksource because there is no
synchronization problem between CPU0 and CPU0 :)

What is the fallback clocksource when the TSC is discarded on SMP?

From the dmesg snippets I assume it is the ACPI PMTIMER, which is I/O
port based and presumably takes a VMEXIT on every read and depending on
the emulation this might be slooooow.

So you could boot with "nosmp clocksource=acpi_pm" on the command line
and compare that against a "nosmp" boot.

If my assumption is right, then on UP the ACPI PM variant won't see the
massive slowdown SMP observes, but there should be still an observable
difference.

Thanks,

tglx

2023-10-19 14:27:03

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On 2023/10/19 21:14, Thomas Gleixner wrote:
> Maybe because of this:
>>>> (This environment is Oracle VM VirtualBox on Windows 11 host.)

I don't think that I'm the only user running Linux on Oracle VM VirtualBox on Windows. ;-)

> So you could boot with "nosmp clocksource=acpi_pm" on the command line
> and compare that against a "nosmp" boot.

This did not make much difference.

Just for testing, I booted several Fedora/Ubuntu kernels used by installation ISO image.
5.11 / 6.0 / 6.6 Fedora kernels all showed this problem. On the other hand, 5.15 Ubuntu
kernel does not show this problem. Thus, I'd rather suspect kernel config dependent
problem than platform dependent problem.

Fedora-Everything-netinst-x86_64-Rawhide-20231018.n.0.iso
[ 0.000000] Linux version 6.6.0-0.rc6.47.fc40.x86_64 (mockbuild@65b4f242683a45f6bd4621fe8c018a9d) (gcc (GCC) 13.2.1 20231011 (Red Hat 13.2.1-4), GNU ld version 2.41-7.fc40) #1 SMP PREEMPT_DYNAMIC Mon Oct 16 13:34:35 UTC 2023

Fedora-Server-netinst-x86_64-37-1.7.iso
[ 0.000000] Linux version 6.0.7-301.fc37.x86_64 ([email protected]) (gcc (GCC) 12.2.1 20220819 (Red Hat 12.2.1-2), GNU ld version 2.38-24.fc37) #1 SMP PREEMPT_DYNAMIC Fri Nov 4 18:35:48 UTC 2022

Fedora-Everything-netinst-x86_64-34-1.2.iso
[ 0.000000] Linux version 5.11.12-300.fc34.x86_64 ([email protected]) (gcc (GCC) 11.0.1 20210324 (Red Hat 11.0.1-0), GNU ld version 2.35.1-41.fc34) #1 SMP Wed Apr 7 16:31:13 UTC 2021

ubuntu-22.04.3-live-server-amd64.iso
[ 0.000000] Linux version 5.15.0-78-generic (buildd@lcy02-amd64-008) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 (Ubuntu 5.15.0-78.85-generic 5.15.99)
[ 0.000000] Command line: BOOT_IMAGE=/casper/vmlinuz console=ttyS0,115200n8 console=tty0 ---
[ 0.000000] KERNEL supported cpus:
[ 0.000000] Intel GenuineIntel
[ 0.000000] AMD AuthenticAMD
[ 0.000000] Hygon HygonGenuine
[ 0.000000] Centaur CentaurHauls
[ 0.000000] zhaoxin Shanghai
[ 0.000000] x86/fpu: x87 FPU will use FXSAVE
[ 0.000000] signal: max sigframe size: 1440
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000dffeffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000dfff0000-0x00000000dfffffff] ACPI data
[ 0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000019fffffff] usable
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] SMBIOS 2.5 present.
[ 0.000000] DMI: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 0.000000] Hypervisor detected: KVM
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000000] kvm-clock: cpu 0, msr 18e01001, primary cpu clock
[ 0.000001] kvm-clock: using sched offset of 33909880646 cycles
[ 0.000004] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.000006] tsc: Detected 2611.208 MHz processor
[ 0.001408] last_pfn = 0x1a0000 max_arch_pfn = 0x400000000
[ 0.001457] Disabled
[ 0.001458] x86/PAT: MTRRs disabled, skipping PAT initialization too.
[ 0.001463] CPU MTRRs all blank - virtualized system.
[ 0.001464] x86/PAT: Configuration [0-7]: WB WT UC- UC WB WT UC- UC
[ 0.001466] last_pfn = 0xdfff0 max_arch_pfn = 0x400000000
[ 0.001514] found SMP MP-table at [mem 0x0009fff0-0x0009ffff]
[ 0.001529] Incomplete global flushes, disabling PCID
[ 0.001751] RAMDISK: [mem 0x2a8db000-0x31464fff]
[ 0.001756] ACPI: Early table checksum verification disabled
[ 0.001760] ACPI: RSDP 0x00000000000E0000 000024 (v02 VBOX )
[ 0.001764] ACPI: XSDT 0x00000000DFFF0030 00003C (v01 VBOX VBOXXSDT 00000001 ASL 00000061)
[ 0.001769] ACPI: FACP 0x00000000DFFF00F0 0000F4 (v04 VBOX VBOXFACP 00000001 ASL 00000061)
[ 0.001774] ACPI: DSDT 0x00000000DFFF0640 002353 (v02 VBOX VBOXBIOS 00000002 INTL 20100528)
[ 0.001777] ACPI: FACS 0x00000000DFFF0200 000040
[ 0.001780] ACPI: FACS 0x00000000DFFF0200 000040
[ 0.001782] ACPI: APIC 0x00000000DFFF0240 00008C (v02 VBOX VBOXAPIC 00000001 ASL 00000061)
[ 0.001785] ACPI: SSDT 0x00000000DFFF02D0 00036C (v01 VBOX VBOXCPUT 00000002 INTL 20100528)
[ 0.001787] ACPI: Reserving FACP table memory at [mem 0xdfff00f0-0xdfff01e3]
[ 0.001789] ACPI: Reserving DSDT table memory at [mem 0xdfff0640-0xdfff2992]
[ 0.001789] ACPI: Reserving FACS table memory at [mem 0xdfff0200-0xdfff023f]
[ 0.001790] ACPI: Reserving FACS table memory at [mem 0xdfff0200-0xdfff023f]
[ 0.001791] ACPI: Reserving APIC table memory at [mem 0xdfff0240-0xdfff02cb]
[ 0.001792] ACPI: Reserving SSDT table memory at [mem 0xdfff02d0-0xdfff063b]
[ 0.002486] No NUMA configuration found
[ 0.002487] Faking a node at [mem 0x0000000000000000-0x000000019fffffff]
[ 0.002493] NODE_DATA(0) allocated [mem 0x19ffd2000-0x19fffbfff]
[ 0.003086] Zone ranges:
[ 0.003087] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.003089] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
[ 0.003090] Normal [mem 0x0000000100000000-0x000000019fffffff]
[ 0.003091] Device empty
[ 0.003092] Movable zone start for each node
[ 0.003094] Early memory node ranges
[ 0.003094] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.003095] node 0: [mem 0x0000000000100000-0x00000000dffeffff]
[ 0.003097] node 0: [mem 0x0000000100000000-0x000000019fffffff]
[ 0.003098] Initmem setup node 0 [mem 0x0000000000001000-0x000000019fffffff]
[ 0.003128] On node 0, zone DMA: 1 pages in unavailable ranges
[ 0.003600] On node 0, zone DMA: 97 pages in unavailable ranges
[ 0.181055] On node 0, zone Normal: 16 pages in unavailable ranges
[ 0.182110] ACPI: PM-Timer IO Port: 0x4008
[ 0.182310] IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
[ 0.182313] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[ 0.182315] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
[ 0.182318] ACPI: Using ACPI (MADT) for SMP configuration information
[ 0.182338] smpboot: Allowing 8 CPUs, 0 hotplug CPUs
[ 0.182433] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
[ 0.182435] PM: hibernation: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
[ 0.182435] PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000effff]
[ 0.182436] PM: hibernation: Registered nosave memory: [mem 0x000f0000-0x000fffff]
[ 0.182437] PM: hibernation: Registered nosave memory: [mem 0xdfff0000-0xdfffffff]
[ 0.182438] PM: hibernation: Registered nosave memory: [mem 0xe0000000-0xfebfffff]
[ 0.182438] PM: hibernation: Registered nosave memory: [mem 0xfec00000-0xfec00fff]
[ 0.182439] PM: hibernation: Registered nosave memory: [mem 0xfec01000-0xfedfffff]
[ 0.182440] PM: hibernation: Registered nosave memory: [mem 0xfee00000-0xfee00fff]
[ 0.182440] PM: hibernation: Registered nosave memory: [mem 0xfee01000-0xfffbffff]
[ 0.182441] PM: hibernation: Registered nosave memory: [mem 0xfffc0000-0xffffffff]
[ 0.182442] [mem 0xe0000000-0xfebfffff] available for PCI devices
[ 0.182444] Booting paravirtualized kernel on KVM
[ 0.182446] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[ 0.182461] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
[ 0.186444] percpu: Embedded 60 pages/cpu s208896 r8192 d28672 u262144
[ 0.186530] kvm-guest: PV spinlocks enabled
[ 0.186532] PV qspinlock hash table entries: 256 (order: 0, 4096 bytes, linear)
[ 0.186554] Built 1 zonelists, mobility grouping on. Total pages: 1548016
[ 0.186555] Policy zone: Normal
[ 0.186556] Kernel command line: BOOT_IMAGE=/casper/vmlinuz console=ttyS0,115200n8 console=tty0 ---
[ 0.186591] Unknown kernel command line parameters "--- BOOT_IMAGE=/casper/vmlinuz", will be passed to user space.
[ 0.201567] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes, linear)
[ 0.208917] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes, linear)
[ 0.209043] mem auto-init: stack:off, heap alloc:on, heap free:off
[ 0.284849] Memory: 5943520K/6291000K available (16393K kernel code, 4383K rwdata, 10844K rodata, 3248K init, 18832K bss, 347220K reserved, 0K cma-reserved)
[ 0.286016] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
[ 0.286060] ftrace: allocating 50600 entries in 198 pages
[ 0.304937] ftrace: allocated 198 pages with 4 groups
[ 0.305656] rcu: Hierarchical RCU implementation.
[ 0.305658] rcu: RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=8.
[ 0.305659] Rude variant of Tasks RCU enabled.
[ 0.305660] Tracing variant of Tasks RCU enabled.
[ 0.305660] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[ 0.305661] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8
[ 0.309135] NR_IRQS: 524544, nr_irqs: 488, preallocated irqs: 16
[ 0.309742] random: crng init done
[ 0.338006] Console: colour VGA+ 80x25
[ 0.508207] printk: console [tty0] enabled
[ 0.683408] printk: console [ttyS0] enabled
[ 0.685593] ACPI: Core revision 20210730
[ 0.687749] APIC: Switch to symmetric I/O mode setup
[ 0.700751] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.703694] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x25a39815946, max_idle_ns: 440795282683 ns
[ 0.708547] Calibrating delay loop (skipped) preset value.. 5222.41 BogoMIPS (lpj=10444832)
[ 0.712492] pid_max: default: 32768 minimum: 301
[ 0.712542] LSM: Security Framework initializing
[ 0.712542] landlock: Up and running.
[ 0.712542] Yama: becoming mindful.
[ 0.712542] AppArmor: AppArmor initialized
[ 0.712542] Mount-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
[ 0.712542] Mountpoint-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
[ 0.716559] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.719146] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[ 0.720548] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[ 0.724545] Spectre V2 : Mitigation: Enhanced IBRS
[ 0.726808] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[ 0.728544] Spectre V2 : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT
[ 0.732544] RETBleed: Mitigation: Enhanced IBRS
[ 0.736546] Speculative Store Bypass: Vulnerable
[ 0.761213] Freeing SMP alternatives memory: 44K
[ 0.888543] APIC calibration not consistent with PM-Timer: 112ms instead of 100ms
[ 0.892542] APIC delta adjusted to PM-Timer: 6460077 (7235567)
[ 0.892738] smpboot: CPU0: 12th Gen Intel(R) Core(TM) i7-1255U (family: 0x6, model: 0x9a, stepping: 0x4)
[ 0.897089] Performance Events: unsupported p6 CPU model 154 no PMU driver, software events only.
[ 0.900604] rcu: Hierarchical SRCU implementation.
[ 0.903574] NMI watchdog: Perf NMI watchdog permanently disabled
[ 0.904961] smp: Bringing up secondary CPUs ...
[ 0.909089] x86: Booting SMP configuration:
[ 0.911031] .... node #0, CPUs: #1
[ 0.402809] kvm-clock: cpu 1, msr 18e01041, secondary cpu clock
[ 0.916542] TSC synchronization [CPU#0 -> CPU#1]:
[ 0.916542] Measured 57359 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.916542] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[ 0.916968] #2
[ 0.402809] kvm-clock: cpu 2, msr 18e01081, secondary cpu clock
[ 0.920985] #3
[ 0.402809] kvm-clock: cpu 3, msr 18e010c1, secondary cpu clock
[ 0.926782] #4
[ 0.402809] kvm-clock: cpu 4, msr 18e01101, secondary cpu clock
[ 0.935490] #5
[ 0.402809] kvm-clock: cpu 5, msr 18e01141, secondary cpu clock
[ 0.939186] #6
[ 0.402809] kvm-clock: cpu 6, msr 18e01181, secondary cpu clock
[ 0.948125] #7
[ 0.402809] kvm-clock: cpu 7, msr 18e011c1, secondary cpu clock
[ 0.960542] smp: Brought up 1 node, 8 CPUs
[ 0.962451] smpboot: Max logical packages: 1
[ 0.964401] smpboot: Total of 8 processors activated (41779.32 BogoMIPS)

2023-10-20 03:31:21

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On Thu, Oct 19, 2023 at 11:26:17PM +0900, Tetsuo Handa wrote:
> On 2023/10/19 21:14, Thomas Gleixner wrote:
> > Maybe because of this:
> >>>> (This environment is Oracle VM VirtualBox on Windows 11 host.)
>
> I don't think that I'm the only user running Linux on Oracle VM VirtualBox on Windows. ;-)
>
> > So you could boot with "nosmp clocksource=acpi_pm" on the command line
> > and compare that against a "nosmp" boot.
>
> This did not make much difference.
>
> Just for testing, I booted several Fedora/Ubuntu kernels used by installation ISO image.
> 5.11 / 6.0 / 6.6 Fedora kernels all showed this problem. On the other hand, 5.15 Ubuntu
> kernel does not show this problem. Thus, I'd rather suspect kernel config dependent
> problem than platform dependent problem.
>
> Fedora-Everything-netinst-x86_64-Rawhide-20231018.n.0.iso
> [ 0.000000] Linux version 6.6.0-0.rc6.47.fc40.x86_64 (mockbuild@65b4f242683a45f6bd4621fe8c018a9d) (gcc (GCC) 13.2.1 20231011 (Red Hat 13.2.1-4), GNU ld version 2.41-7.fc40) #1 SMP PREEMPT_DYNAMIC Mon Oct 16 13:34:35 UTC 2023
>
> Fedora-Server-netinst-x86_64-37-1.7.iso
> [ 0.000000] Linux version 6.0.7-301.fc37.x86_64 ([email protected]) (gcc (GCC) 12.2.1 20220819 (Red Hat 12.2.1-2), GNU ld version 2.38-24.fc37) #1 SMP PREEMPT_DYNAMIC Fri Nov 4 18:35:48 UTC 2022
>
> Fedora-Everything-netinst-x86_64-34-1.2.iso
> [ 0.000000] Linux version 5.11.12-300.fc34.x86_64 ([email protected]) (gcc (GCC) 11.0.1 20210324 (Red Hat 11.0.1-0), GNU ld version 2.35.1-41.fc34) #1 SMP Wed Apr 7 16:31:13 UTC 2021
>
> ubuntu-22.04.3-live-server-amd64.iso
> [ 0.000000] Linux version 5.15.0-78-generic (buildd@lcy02-amd64-008) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 (Ubuntu 5.15.0-78.85-generic 5.15.99)
> [ 0.000000] Command line: BOOT_IMAGE=/casper/vmlinuz console=ttyS0,115200n8 console=tty0 ---

A bit of a long shot, but given your earlier console log, could you
please try Frederic's patches?

https://lore.kernel.org/lkml/[email protected]/

Thanx, Paul

> [ 0.000000] KERNEL supported cpus:
> [ 0.000000] Intel GenuineIntel
> [ 0.000000] AMD AuthenticAMD
> [ 0.000000] Hygon HygonGenuine
> [ 0.000000] Centaur CentaurHauls
> [ 0.000000] zhaoxin Shanghai
> [ 0.000000] x86/fpu: x87 FPU will use FXSAVE
> [ 0.000000] signal: max sigframe size: 1440
> [ 0.000000] BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000dffeffff] usable
> [ 0.000000] BIOS-e820: [mem 0x00000000dfff0000-0x00000000dfffffff] ACPI data
> [ 0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000019fffffff] usable
> [ 0.000000] NX (Execute Disable) protection: active
> [ 0.000000] SMBIOS 2.5 present.
> [ 0.000000] DMI: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> [ 0.000000] Hypervisor detected: KVM
> [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [ 0.000000] kvm-clock: cpu 0, msr 18e01001, primary cpu clock
> [ 0.000001] kvm-clock: using sched offset of 33909880646 cycles
> [ 0.000004] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [ 0.000006] tsc: Detected 2611.208 MHz processor
> [ 0.001408] last_pfn = 0x1a0000 max_arch_pfn = 0x400000000
> [ 0.001457] Disabled
> [ 0.001458] x86/PAT: MTRRs disabled, skipping PAT initialization too.
> [ 0.001463] CPU MTRRs all blank - virtualized system.
> [ 0.001464] x86/PAT: Configuration [0-7]: WB WT UC- UC WB WT UC- UC
> [ 0.001466] last_pfn = 0xdfff0 max_arch_pfn = 0x400000000
> [ 0.001514] found SMP MP-table at [mem 0x0009fff0-0x0009ffff]
> [ 0.001529] Incomplete global flushes, disabling PCID
> [ 0.001751] RAMDISK: [mem 0x2a8db000-0x31464fff]
> [ 0.001756] ACPI: Early table checksum verification disabled
> [ 0.001760] ACPI: RSDP 0x00000000000E0000 000024 (v02 VBOX )
> [ 0.001764] ACPI: XSDT 0x00000000DFFF0030 00003C (v01 VBOX VBOXXSDT 00000001 ASL 00000061)
> [ 0.001769] ACPI: FACP 0x00000000DFFF00F0 0000F4 (v04 VBOX VBOXFACP 00000001 ASL 00000061)
> [ 0.001774] ACPI: DSDT 0x00000000DFFF0640 002353 (v02 VBOX VBOXBIOS 00000002 INTL 20100528)
> [ 0.001777] ACPI: FACS 0x00000000DFFF0200 000040
> [ 0.001780] ACPI: FACS 0x00000000DFFF0200 000040
> [ 0.001782] ACPI: APIC 0x00000000DFFF0240 00008C (v02 VBOX VBOXAPIC 00000001 ASL 00000061)
> [ 0.001785] ACPI: SSDT 0x00000000DFFF02D0 00036C (v01 VBOX VBOXCPUT 00000002 INTL 20100528)
> [ 0.001787] ACPI: Reserving FACP table memory at [mem 0xdfff00f0-0xdfff01e3]
> [ 0.001789] ACPI: Reserving DSDT table memory at [mem 0xdfff0640-0xdfff2992]
> [ 0.001789] ACPI: Reserving FACS table memory at [mem 0xdfff0200-0xdfff023f]
> [ 0.001790] ACPI: Reserving FACS table memory at [mem 0xdfff0200-0xdfff023f]
> [ 0.001791] ACPI: Reserving APIC table memory at [mem 0xdfff0240-0xdfff02cb]
> [ 0.001792] ACPI: Reserving SSDT table memory at [mem 0xdfff02d0-0xdfff063b]
> [ 0.002486] No NUMA configuration found
> [ 0.002487] Faking a node at [mem 0x0000000000000000-0x000000019fffffff]
> [ 0.002493] NODE_DATA(0) allocated [mem 0x19ffd2000-0x19fffbfff]
> [ 0.003086] Zone ranges:
> [ 0.003087] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> [ 0.003089] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
> [ 0.003090] Normal [mem 0x0000000100000000-0x000000019fffffff]
> [ 0.003091] Device empty
> [ 0.003092] Movable zone start for each node
> [ 0.003094] Early memory node ranges
> [ 0.003094] node 0: [mem 0x0000000000001000-0x000000000009efff]
> [ 0.003095] node 0: [mem 0x0000000000100000-0x00000000dffeffff]
> [ 0.003097] node 0: [mem 0x0000000100000000-0x000000019fffffff]
> [ 0.003098] Initmem setup node 0 [mem 0x0000000000001000-0x000000019fffffff]
> [ 0.003128] On node 0, zone DMA: 1 pages in unavailable ranges
> [ 0.003600] On node 0, zone DMA: 97 pages in unavailable ranges
> [ 0.181055] On node 0, zone Normal: 16 pages in unavailable ranges
> [ 0.182110] ACPI: PM-Timer IO Port: 0x4008
> [ 0.182310] IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
> [ 0.182313] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> [ 0.182315] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
> [ 0.182318] ACPI: Using ACPI (MADT) for SMP configuration information
> [ 0.182338] smpboot: Allowing 8 CPUs, 0 hotplug CPUs
> [ 0.182433] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
> [ 0.182435] PM: hibernation: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
> [ 0.182435] PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000effff]
> [ 0.182436] PM: hibernation: Registered nosave memory: [mem 0x000f0000-0x000fffff]
> [ 0.182437] PM: hibernation: Registered nosave memory: [mem 0xdfff0000-0xdfffffff]
> [ 0.182438] PM: hibernation: Registered nosave memory: [mem 0xe0000000-0xfebfffff]
> [ 0.182438] PM: hibernation: Registered nosave memory: [mem 0xfec00000-0xfec00fff]
> [ 0.182439] PM: hibernation: Registered nosave memory: [mem 0xfec01000-0xfedfffff]
> [ 0.182440] PM: hibernation: Registered nosave memory: [mem 0xfee00000-0xfee00fff]
> [ 0.182440] PM: hibernation: Registered nosave memory: [mem 0xfee01000-0xfffbffff]
> [ 0.182441] PM: hibernation: Registered nosave memory: [mem 0xfffc0000-0xffffffff]
> [ 0.182442] [mem 0xe0000000-0xfebfffff] available for PCI devices
> [ 0.182444] Booting paravirtualized kernel on KVM
> [ 0.182446] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
> [ 0.182461] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
> [ 0.186444] percpu: Embedded 60 pages/cpu s208896 r8192 d28672 u262144
> [ 0.186530] kvm-guest: PV spinlocks enabled
> [ 0.186532] PV qspinlock hash table entries: 256 (order: 0, 4096 bytes, linear)
> [ 0.186554] Built 1 zonelists, mobility grouping on. Total pages: 1548016
> [ 0.186555] Policy zone: Normal
> [ 0.186556] Kernel command line: BOOT_IMAGE=/casper/vmlinuz console=ttyS0,115200n8 console=tty0 ---
> [ 0.186591] Unknown kernel command line parameters "--- BOOT_IMAGE=/casper/vmlinuz", will be passed to user space.
> [ 0.201567] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes, linear)
> [ 0.208917] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes, linear)
> [ 0.209043] mem auto-init: stack:off, heap alloc:on, heap free:off
> [ 0.284849] Memory: 5943520K/6291000K available (16393K kernel code, 4383K rwdata, 10844K rodata, 3248K init, 18832K bss, 347220K reserved, 0K cma-reserved)
> [ 0.286016] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
> [ 0.286060] ftrace: allocating 50600 entries in 198 pages
> [ 0.304937] ftrace: allocated 198 pages with 4 groups
> [ 0.305656] rcu: Hierarchical RCU implementation.
> [ 0.305658] rcu: RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=8.
> [ 0.305659] Rude variant of Tasks RCU enabled.
> [ 0.305660] Tracing variant of Tasks RCU enabled.
> [ 0.305660] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
> [ 0.305661] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8
> [ 0.309135] NR_IRQS: 524544, nr_irqs: 488, preallocated irqs: 16
> [ 0.309742] random: crng init done
> [ 0.338006] Console: colour VGA+ 80x25
> [ 0.508207] printk: console [tty0] enabled
> [ 0.683408] printk: console [ttyS0] enabled
> [ 0.685593] ACPI: Core revision 20210730
> [ 0.687749] APIC: Switch to symmetric I/O mode setup
> [ 0.700751] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [ 0.703694] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x25a39815946, max_idle_ns: 440795282683 ns
> [ 0.708547] Calibrating delay loop (skipped) preset value.. 5222.41 BogoMIPS (lpj=10444832)
> [ 0.712492] pid_max: default: 32768 minimum: 301
> [ 0.712542] LSM: Security Framework initializing
> [ 0.712542] landlock: Up and running.
> [ 0.712542] Yama: becoming mindful.
> [ 0.712542] AppArmor: AppArmor initialized
> [ 0.712542] Mount-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
> [ 0.712542] Mountpoint-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
> [ 0.716559] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
> [ 0.719146] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
> [ 0.720548] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
> [ 0.724545] Spectre V2 : Mitigation: Enhanced IBRS
> [ 0.726808] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
> [ 0.728544] Spectre V2 : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT
> [ 0.732544] RETBleed: Mitigation: Enhanced IBRS
> [ 0.736546] Speculative Store Bypass: Vulnerable
> [ 0.761213] Freeing SMP alternatives memory: 44K
> [ 0.888543] APIC calibration not consistent with PM-Timer: 112ms instead of 100ms
> [ 0.892542] APIC delta adjusted to PM-Timer: 6460077 (7235567)
> [ 0.892738] smpboot: CPU0: 12th Gen Intel(R) Core(TM) i7-1255U (family: 0x6, model: 0x9a, stepping: 0x4)
> [ 0.897089] Performance Events: unsupported p6 CPU model 154 no PMU driver, software events only.
> [ 0.900604] rcu: Hierarchical SRCU implementation.
> [ 0.903574] NMI watchdog: Perf NMI watchdog permanently disabled
> [ 0.904961] smp: Bringing up secondary CPUs ...
> [ 0.909089] x86: Booting SMP configuration:
> [ 0.911031] .... node #0, CPUs: #1
> [ 0.402809] kvm-clock: cpu 1, msr 18e01041, secondary cpu clock
> [ 0.916542] TSC synchronization [CPU#0 -> CPU#1]:
> [ 0.916542] Measured 57359 cycles TSC warp between CPUs, turning off TSC clock.
> [ 0.916542] tsc: Marking TSC unstable due to check_tsc_sync_source failed
> [ 0.916968] #2
> [ 0.402809] kvm-clock: cpu 2, msr 18e01081, secondary cpu clock
> [ 0.920985] #3
> [ 0.402809] kvm-clock: cpu 3, msr 18e010c1, secondary cpu clock
> [ 0.926782] #4
> [ 0.402809] kvm-clock: cpu 4, msr 18e01101, secondary cpu clock
> [ 0.935490] #5
> [ 0.402809] kvm-clock: cpu 5, msr 18e01141, secondary cpu clock
> [ 0.939186] #6
> [ 0.402809] kvm-clock: cpu 6, msr 18e01181, secondary cpu clock
> [ 0.948125] #7
> [ 0.402809] kvm-clock: cpu 7, msr 18e011c1, secondary cpu clock
> [ 0.960542] smp: Brought up 1 node, 8 CPUs
> [ 0.962451] smpboot: Max logical packages: 1
> [ 0.964401] smpboot: Total of 8 processors activated (41779.32 BogoMIPS)
>

2023-10-20 09:03:26

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On 2023-10-16 23:47:55 [+0200], Thomas Gleixner wrote:
> But the explicit local_irq_disable()/enable() pair around the readout
> means that the patch above is obviously broken:
>
> spin_lock_irqsave(&watchdog_lock, flags);
> cs_watchdog_read()
> local_irq_disable();
> ...
> local_irq_enable(); <--- FAIL
>
> Lockdep should have yelled at that too, but even if it failed to do
> so then it's still very obviously wrong.

Lockdep will shout from local_irq_enable() if the lock is still held.
That is working and I just verified it. However I didn't get to that
part in cs_watchdog_read() in my testing so that might be why lockdep
didn't complain in Tetsuo's case. After adding a irq-enable before the
unlock for lockdep-testing I did see the splat.

> Something like the uncompiled/untested below should cure it for real. It
> really does not matter whether the TSC unstable event happens a bit
> later. The system is unhappy no matter what.
The snippet below is perfect :)

> That said, this whole clocksource watchdog mess wants a proper
> overhaul. It has become a pile of warts and duct tape by now and after
> staring at it long enough there is no real reason to run it in a timer
> callback anymore. It just can move into delayed work and the whole
> locking problem can be reduced to the clocksource_mutex and some well
> thought out atomic operations to handle the mark unstable case. But
> that's a different story and not relevant for curing the problem at
> hand.

A pile of warts. Okay. This explains why the lock is acquired everywhere
with irqsave while in-IRQ access is not desired.

> Thanks,
>
> tglx

Sebastian

2023-10-20 12:51:34

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On 2023/10/20 12:30, Paul E. McKenney wrote:
> A bit of a long shot, but given your earlier console log, could you
> please try Frederic's patches?
>
> https://lore.kernel.org/lkml/[email protected]/

Didn't make difference.

I booted another Ubuntu kernel used by installation ISO image, and found no problem.
Maybe the only way to narrow down is to build various kernel configurations?

ubuntu-23.10-live-server-amd64.iso
[ 0.000000] Linux version 6.5.0-9-generic (buildd@bos03-amd64-043) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.41) #9-Ubuntu SMP PREEMPT_DYNAMIC Sat Oct 7 01:35:40 UTC 2023 (Ubuntu 6.5.0-9.9-generic 6.5.3)
[ 0.000000] Command line: BOOT_IMAGE=/casper/vmlinuz console=ttyS0,115200n8 console=tty0 ---
(...snipped...)
[ 0.896948] smp: Bringing up secondary CPUs ...
[ 0.899571] smpboot: x86: Booting SMP configuration:
[ 0.901912] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
[ 0.914558] TSC synchronization [CPU#0 -> CPU#1]:
[ 0.916496] Measured 62205 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.916496] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[ 0.940504] smp: Brought up 1 node, 8 CPUs
[ 0.944498] smpboot: Max logical packages: 1
[ 0.946468] smpboot: Total of 8 processors activated (41779.32 BogoMIPS)

2023-10-20 13:40:26

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On Fri, Oct 20, 2023 at 09:50:48PM +0900, Tetsuo Handa wrote:
> On 2023/10/20 12:30, Paul E. McKenney wrote:
> > A bit of a long shot, but given your earlier console log, could you
> > please try Frederic's patches?
> >
> > https://lore.kernel.org/lkml/[email protected]/
>
> Didn't make difference.

Huh. Thank you for trying it out!

Thanx, Paul

> I booted another Ubuntu kernel used by installation ISO image, and found no problem.
> Maybe the only way to narrow down is to build various kernel configurations?
>
> ubuntu-23.10-live-server-amd64.iso
> [ 0.000000] Linux version 6.5.0-9-generic (buildd@bos03-amd64-043) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.41) #9-Ubuntu SMP PREEMPT_DYNAMIC Sat Oct 7 01:35:40 UTC 2023 (Ubuntu 6.5.0-9.9-generic 6.5.3)
> [ 0.000000] Command line: BOOT_IMAGE=/casper/vmlinuz console=ttyS0,115200n8 console=tty0 ---
> (...snipped...)
> [ 0.896948] smp: Bringing up secondary CPUs ...
> [ 0.899571] smpboot: x86: Booting SMP configuration:
> [ 0.901912] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
> [ 0.914558] TSC synchronization [CPU#0 -> CPU#1]:
> [ 0.916496] Measured 62205 cycles TSC warp between CPUs, turning off TSC clock.
> [ 0.916496] tsc: Marking TSC unstable due to check_tsc_sync_source failed
> [ 0.940504] smp: Brought up 1 node, 8 CPUs
> [ 0.944498] smpboot: Max logical packages: 1
> [ 0.946468] smpboot: Total of 8 processors activated (41779.32 BogoMIPS)
>

2023-10-24 13:01:25

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On Thu, Oct 19 2023 at 23:26, Tetsuo Handa wrote:
> On 2023/10/19 21:14, Thomas Gleixner wrote:
>> Maybe because of this:
>>>>> (This environment is Oracle VM VirtualBox on Windows 11 host.)
>
> I don't think that I'm the only user running Linux on Oracle VM VirtualBox on Windows. ;-)
>
>> So you could boot with "nosmp clocksource=acpi_pm" on the command line
>> and compare that against a "nosmp" boot.
>
> This did not make much difference.

Interesting. Can you please tell what the replacement clocksource is
when the TSC is disabled?

> Just for testing, I booted several Fedora/Ubuntu kernels used by installation ISO image.
> 5.11 / 6.0 / 6.6 Fedora kernels all showed this problem. On the other hand, 5.15 Ubuntu
> kernel does not show this problem. Thus, I'd rather suspect kernel config dependent
> problem than platform dependent problem.
>
> Fedora-Everything-netinst-x86_64-Rawhide-20231018.n.0.iso
> [ 0.000000] Linux version 6.6.0-0.rc6.47.fc40.x86_64 (mockbuild@65b4f242683a45f6bd4621fe8c018a9d) (gcc (GCC) 13.2.1 20231011 (Red Hat 13.2.1-4), GNU ld version 2.41-7.fc40) #1 SMP PREEMPT_DYNAMIC Mon Oct 16 13:34:35 UTC 2023
>
> Fedora-Server-netinst-x86_64-37-1.7.iso
> [ 0.000000] Linux version 6.0.7-301.fc37.x86_64 ([email protected]) (gcc (GCC) 12.2.1 20220819 (Red Hat 12.2.1-2), GNU ld version 2.38-24.fc37) #1 SMP PREEMPT_DYNAMIC Fri Nov 4 18:35:48 UTC 2022
>
> Fedora-Everything-netinst-x86_64-34-1.2.iso
> [ 0.000000] Linux version 5.11.12-300.fc34.x86_64 ([email protected]) (gcc (GCC) 11.0.1 20210324 (Red Hat 11.0.1-0), GNU ld version 2.35.1-41.fc34) #1 SMP Wed Apr 7 16:31:13 UTC 2021
>
> ubuntu-22.04.3-live-server-amd64.iso
> [ 0.000001] kvm-clock: using sched offset of 33909880646 cycles
> [ 0.000004] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns

Are the others using kvm-clock too? I haven't seen that in the snippets
you provided before.

Thanks,

tglx

2023-10-24 14:52:57

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On 2023/10/24 22:00, Thomas Gleixner wrote:
> Interesting. Can you please tell what the replacement clocksource is
> when the TSC is disabled?

Where can I find the replacement clocksource from?

This slowdown is timing dependent. The same kernel can sometimes boot without slowdown.
I put a small kernel config at https://I-love.SAKURA.ne.jp/tmp/config-6.6-rc7 .

---------- bad case ----------
[ 0.000000][ T0] Linux version 6.6.0-rc7+ (root@ubuntu) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #1055 SMP PREEMPT_DYNAMIC Tue Oct 24 14:20:03 UTC 2023
[ 0.000000][ T0] Command line: BOOT_IMAGE=/boot/vmlinuz-6.6.0-rc7+ root=UUID=a2500e32-bbc8-4f32-bf86-94a6ae160b4b ro lsm=bpf console=ttyS0,115200n8 console=tty0
[ 0.000000][ T0] KERNEL supported cpus:
[ 0.000000][ T0] Intel GenuineIntel
[ 0.000000][ T0] AMD AuthenticAMD
[ 0.000000][ T0] Hygon HygonGenuine
[ 0.000000][ T0] Centaur CentaurHauls
[ 0.000000][ T0] zhaoxin Shanghai
[ 0.000000][ T0] BIOS-provided physical RAM map:
[ 0.000000][ T0] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000][ T0] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000][ T0] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[ 0.000000][ T0] BIOS-e820: [mem 0x0000000000100000-0x00000000dffeffff] usable
[ 0.000000][ T0] BIOS-e820: [mem 0x00000000dfff0000-0x00000000dfffffff] ACPI data
[ 0.000000][ T0] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
[ 0.000000][ T0] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
[ 0.000000][ T0] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[ 0.000000][ T0] BIOS-e820: [mem 0x0000000100000000-0x000000019fffffff] usable
[ 0.000000][ T0] NX (Execute Disable) protection: active
[ 0.000000][ T0] APIC: Static calls initialized
[ 0.000000][ T0] SMBIOS 2.5 present.
[ 0.000000][ T0] DMI: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 0.000000][ T0] Hypervisor detected: KVM
[ 0.000000][ T0] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000000][ T0] kvm-clock: using sched offset of 4651200357824 cycles
[ 0.000002][ T0] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.000004][ T0] tsc: Detected 2611.208 MHz processor
[ 0.001251][ T0] last_pfn = 0x1a0000 max_arch_pfn = 0x400000000
[ 0.001265][ T0] MTRRs disabled by BIOS
[ 0.001267][ T0] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
[ 0.001290][ T0] last_pfn = 0xdfff0 max_arch_pfn = 0x400000000
[ 0.001329][ T0] found SMP MP-table at [mem 0x0009fff0-0x0009ffff]
[ 0.001340][ T0] Incomplete global flushes, disabling PCID
[ 0.001500][ T0] RAMDISK: [mem 0x35563000-0x36aa8fff]
[ 0.001509][ T0] No NUMA configuration found
[ 0.001510][ T0] Faking a node at [mem 0x0000000000000000-0x000000019fffffff]
[ 0.001515][ T0] NODE_DATA(0) allocated [mem 0x19ffda000-0x19fffbfff]
[ 0.001676][ T0] Zone ranges:
[ 0.001677][ T0] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.001678][ T0] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
[ 0.001679][ T0] Normal [mem 0x0000000100000000-0x000000019fffffff]
[ 0.001680][ T0] Movable zone start for each node
[ 0.001681][ T0] Early memory node ranges
[ 0.001681][ T0] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.001682][ T0] node 0: [mem 0x0000000000100000-0x00000000dffeffff]
[ 0.001683][ T0] node 0: [mem 0x0000000100000000-0x000000019fffffff]
[ 0.001683][ T0] Initmem setup node 0 [mem 0x0000000000001000-0x000000019fffffff]
[ 0.001686][ T0] On node 0, zone DMA: 1 pages in unavailable ranges
[ 0.001738][ T0] On node 0, zone DMA: 97 pages in unavailable ranges
[ 0.011334][ T0] On node 0, zone Normal: 16 pages in unavailable ranges
[ 0.011338][ T0] Intel MultiProcessor Specification v1.4
[ 0.011341][ T0] MPTABLE: OEM ID: VBOXCPU
[ 0.011342][ T0] MPTABLE: Product ID: VirtualBox
[ 0.011342][ T0] MPTABLE: APIC at: 0xFEE00000
[ 0.011403][ T0] Processor #0 (Bootup-CPU)
[ 0.011404][ T0] Processor #1
[ 0.011405][ T0] Processor #2
[ 0.011405][ T0] Processor #3
[ 0.011405][ T0] Processor #4
[ 0.011406][ T0] Processor #5
[ 0.011406][ T0] Processor #6
[ 0.011406][ T0] Processor #7
[ 0.011558][ T0] IOAPIC[0]: apic_id 0, version 32, address 0xfec00000, GSI 0-23
[ 0.011560][ T0] Processors: 8
[ 0.011561][ T0] smpboot: Allowing 8 CPUs, 0 hotplug CPUs
[ 0.011738][ T0] [mem 0xe0000000-0xfebfffff] available for PCI devices
[ 0.011739][ T0] Booting paravirtualized kernel on KVM
[ 0.011740][ T0] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[ 0.011744][ T0] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
[ 0.012191][ T0] percpu: Embedded 43 pages/cpu s147032 r0 d29096 u262144
[ 0.012323][ T0] kvm-guest: PV spinlocks enabled
[ 0.012324][ T0] PV qspinlock hash table entries: 256 (order: 0, 4096 bytes, linear)
[ 0.012328][ T0] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.6.0-rc7+ root=UUID=a2500e32-bbc8-4f32-bf86-94a6ae160b4b ro lsm=bpf console=ttyS0,115200n8 console=tty0
[ 0.012344][ T0] Unknown kernel command line parameters "BOOT_IMAGE=/boot/vmlinuz-6.6.0-rc7+ lsm=bpf", will be passed to user space.
[ 0.012359][ T0] random: crng init done
[ 0.012937][ T0] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes, linear)
[ 0.013283][ T0] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes, linear)
[ 0.013400][ T0] Fallback order for Node 0: 0
[ 0.013403][ T0] Built 1 zonelists, mobility grouping on. Total pages: 1548016
[ 0.013403][ T0] Policy zone: Normal
[ 0.013407][ T0] mem auto-init: stack:off, heap alloc:on, heap free:off
[ 0.013409][ T0] software IO TLB: area num 8.
[ 0.034155][ T0] Memory: 6067192K/6291000K available (6144K kernel code, 7969K rwdata, 1888K rodata, 2240K init, 4064K bss, 223548K reserved, 0K cma-reserved)
[ 0.034284][ T0] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
[ 0.034321][ T0] Dynamic Preempt: full
[ 0.034347][ T0] rcu: Preemptible hierarchical RCU implementation.
[ 0.034348][ T0] rcu: RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=8.
[ 0.034349][ T0] rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
[ 0.034350][ T0] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8
[ 0.034398][ T0] NR_IRQS: 4352, nr_irqs: 104, preallocated irqs: 16
[ 0.034655][ T0] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[ 0.062932][ T0] Console: colour VGA+ 80x25
[ 0.062937][ T0] printk: console [tty0] enabled
[ 0.201799][ T0] printk: console [ttyS0] enabled
[ 0.347985][ T0] APIC: Switch to symmetric I/O mode setup
[ 0.358057][ T0] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.361420][ T0] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x25a39815946, max_idle_ns: 440795282683 ns
[ 0.366555][ T0] Calibrating delay loop (skipped) preset value.. 5222.41 BogoMIPS (lpj=2611208)
[ 0.367549][ T0] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.368557][ T0] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[ 0.370554][ T0] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[ 0.372552][ T0] Spectre V2 : Mitigation: Enhanced / Automatic IBRS
[ 0.374551][ T0] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[ 0.376551][ T0] Spectre V2 : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT
[ 0.378552][ T0] RETBleed: Mitigation: Enhanced IBRS
[ 0.380551][ T0] Speculative Store Bypass: Vulnerable
[ 0.382553][ T0] x86/fpu: x87 FPU will use FXSAVE
[ 0.391441][ T0] Freeing SMP alternatives memory: 16K
[ 0.392553][ T0] pid_max: default: 32768 minimum: 301
[ 0.394628][ T0] Mount-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
[ 0.396566][ T0] Mountpoint-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
[ 0.601817][ T1] smpboot: CPU0: 12th Gen Intel(R) Core(TM) i7-1255U (family: 0x6, model: 0x9a, stepping: 0x4)
[ 0.603653][ T1] Performance Events: unsupported p6 CPU model 154 no PMU driver, software events only.
[ 0.605559][ T1] signal: max sigframe size: 1440
[ 0.607568][ T1] rcu: Hierarchical SRCU implementation.
[ 0.609550][ T1] rcu: Max phase no-delay instances is 400.
[ 0.611622][ T1] NMI watchdog: Perf NMI watchdog permanently disabled
[ 0.613627][ T1] smp: Bringing up secondary CPUs ...
[ 0.616629][ T1] smpboot: x86: Booting SMP configuration:
[ 0.619552][ T1] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
[ 0.624549][ C0] TSC synchronization [CPU#0 -> CPU#1]:
[ 0.624549][ C0] Measured 112923 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.624549][ T8] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[ 11.793585][ T1] smp: Brought up 1 node, 8 CPUs
[ 22.463567][ T1] smpboot: Max logical packages: 1
[ 26.831718][ T1] smpboot: Total of 8 processors activated (41779.32 BogoMIPS)
(...too slow to wait...)
---------- bad case ----------

---------- another bad case (the same kernel, different try) ----------
(...snipped...)
[ 0.624575][ T1] smp: Bringing up secondary CPUs ...
[ 0.624575][ T1] smpboot: x86: Booting SMP configuration:
[ 0.624575][ T1] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
[ 0.626575][ C0] TSC synchronization [CPU#0 -> CPU#1]:
[ 0.626575][ C0] Measured 61974 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.626575][ T8] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[ 20.677606][ T1] smp: Brought up 1 node, 8 CPUs
[ 31.942609][ T1] smpboot: Max logical packages: 1
[ 46.890605][ T1] smpboot: Total of 8 processors activated (41779.32 BogoMIPS)
(...too slow to wait...)
---------- another bad case ----------

---------- good case (the same kernel, different try) ----------
[ 0.000000][ T0] Linux version 6.6.0-rc7+ (root@ubuntu) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #1055 SMP PREEMPT_DYNAMIC Tue Oct 24 14:20:03 UTC 2023
[ 0.000000][ T0] Command line: BOOT_IMAGE=/boot/vmlinuz-6.6.0-rc7+ root=UUID=a2500e32-bbc8-4f32-bf86-94a6ae160b4b ro lsm=bpf console=ttyS0,115200n8 console=tty0
[ 0.000000][ T0] KERNEL supported cpus:
[ 0.000000][ T0] Intel GenuineIntel
[ 0.000000][ T0] AMD AuthenticAMD
[ 0.000000][ T0] Hygon HygonGenuine
[ 0.000000][ T0] Centaur CentaurHauls
[ 0.000000][ T0] zhaoxin Shanghai
[ 0.000000][ T0] BIOS-provided physical RAM map:
[ 0.000000][ T0] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000][ T0] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000][ T0] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[ 0.000000][ T0] BIOS-e820: [mem 0x0000000000100000-0x00000000dffeffff] usable
[ 0.000000][ T0] BIOS-e820: [mem 0x00000000dfff0000-0x00000000dfffffff] ACPI data
[ 0.000000][ T0] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
[ 0.000000][ T0] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
[ 0.000000][ T0] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[ 0.000000][ T0] BIOS-e820: [mem 0x0000000100000000-0x000000019fffffff] usable
[ 0.000000][ T0] NX (Execute Disable) protection: active
[ 0.000000][ T0] APIC: Static calls initialized
[ 0.000000][ T0] SMBIOS 2.5 present.
[ 0.000000][ T0] DMI: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 0.000000][ T0] Hypervisor detected: KVM
[ 0.000000][ T0] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000002][ T0] kvm-clock: using sched offset of 189117280516 cycles
[ 0.000004][ T0] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.000006][ T0] tsc: Detected 2611.208 MHz processor
[ 0.001341][ T0] last_pfn = 0x1a0000 max_arch_pfn = 0x400000000
[ 0.001356][ T0] MTRRs disabled by BIOS
[ 0.001358][ T0] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
[ 0.001367][ T0] last_pfn = 0xdfff0 max_arch_pfn = 0x400000000
[ 0.001409][ T0] found SMP MP-table at [mem 0x0009fff0-0x0009ffff]
[ 0.001422][ T0] Incomplete global flushes, disabling PCID
[ 0.001559][ T0] RAMDISK: [mem 0x35563000-0x36aa8fff]
[ 0.001568][ T0] No NUMA configuration found
[ 0.001568][ T0] Faking a node at [mem 0x0000000000000000-0x000000019fffffff]
[ 0.001573][ T0] NODE_DATA(0) allocated [mem 0x19ffda000-0x19fffbfff]
[ 0.001704][ T0] Zone ranges:
[ 0.001704][ T0] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.001706][ T0] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
[ 0.001707][ T0] Normal [mem 0x0000000100000000-0x000000019fffffff]
[ 0.001707][ T0] Movable zone start for each node
[ 0.001708][ T0] Early memory node ranges
[ 0.001709][ T0] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.001709][ T0] node 0: [mem 0x0000000000100000-0x00000000dffeffff]
[ 0.001710][ T0] node 0: [mem 0x0000000100000000-0x000000019fffffff]
[ 0.001711][ T0] Initmem setup node 0 [mem 0x0000000000001000-0x000000019fffffff]
[ 0.001714][ T0] On node 0, zone DMA: 1 pages in unavailable ranges
[ 0.001738][ T0] On node 0, zone DMA: 97 pages in unavailable ranges
[ 0.011645][ T0] On node 0, zone Normal: 16 pages in unavailable ranges
[ 0.011651][ T0] Intel MultiProcessor Specification v1.4
[ 0.011653][ T0] MPTABLE: OEM ID: VBOXCPU
[ 0.011654][ T0] MPTABLE: Product ID: VirtualBox
[ 0.011654][ T0] MPTABLE: APIC at: 0xFEE00000
[ 0.011736][ T0] Processor #0 (Bootup-CPU)
[ 0.011737][ T0] Processor #1
[ 0.011738][ T0] Processor #2
[ 0.011738][ T0] Processor #3
[ 0.011738][ T0] Processor #4
[ 0.011739][ T0] Processor #5
[ 0.011739][ T0] Processor #6
[ 0.011739][ T0] Processor #7
[ 0.011914][ T0] IOAPIC[0]: apic_id 0, version 32, address 0xfec00000, GSI 0-23
[ 0.011917][ T0] Processors: 8
[ 0.011918][ T0] smpboot: Allowing 8 CPUs, 0 hotplug CPUs
[ 0.012112][ T0] [mem 0xe0000000-0xfebfffff] available for PCI devices
[ 0.012113][ T0] Booting paravirtualized kernel on KVM
[ 0.012115][ T0] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[ 0.012119][ T0] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
[ 0.012381][ T0] percpu: Embedded 43 pages/cpu s147032 r0 d29096 u262144
[ 0.012533][ T0] kvm-guest: PV spinlocks enabled
[ 0.012534][ T0] PV qspinlock hash table entries: 256 (order: 0, 4096 bytes, linear)
[ 0.012539][ T0] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.6.0-rc7+ root=UUID=a2500e32-bbc8-4f32-bf86-94a6ae160b4b ro lsm=bpf console=ttyS0,115200n8 console=tty0
[ 0.012557][ T0] Unknown kernel command line parameters "BOOT_IMAGE=/boot/vmlinuz-6.6.0-rc7+ lsm=bpf", will be passed to user space.
[ 0.012574][ T0] random: crng init done
[ 0.013371][ T0] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes, linear)
[ 0.013777][ T0] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes, linear)
[ 0.013889][ T0] Fallback order for Node 0: 0
[ 0.013892][ T0] Built 1 zonelists, mobility grouping on. Total pages: 1548016
[ 0.013893][ T0] Policy zone: Normal
[ 0.013896][ T0] mem auto-init: stack:off, heap alloc:on, heap free:off
[ 0.013897][ T0] software IO TLB: area num 8.
[ 0.035006][ T0] Memory: 6067192K/6291000K available (6144K kernel code, 7969K rwdata, 1888K rodata, 2240K init, 4064K bss, 223548K reserved, 0K cma-reserved)
[ 0.035165][ T0] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
[ 0.035212][ T0] Dynamic Preempt: full
[ 0.035244][ T0] rcu: Preemptible hierarchical RCU implementation.
[ 0.035245][ T0] rcu: RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=8.
[ 0.035246][ T0] rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
[ 0.035247][ T0] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8
[ 0.035309][ T0] NR_IRQS: 4352, nr_irqs: 104, preallocated irqs: 16
[ 0.035572][ T0] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[ 0.066412][ T0] Console: colour VGA+ 80x25
[ 0.066416][ T0] printk: console [tty0] enabled
[ 0.206184][ T0] printk: console [ttyS0] enabled
[ 0.355879][ T0] APIC: Switch to symmetric I/O mode setup
[ 0.365732][ T0] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.368881][ T0] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x25a39815946, max_idle_ns: 440795282683 ns
[ 0.373903][ T0] Calibrating delay loop (skipped) preset value.. 5222.41 BogoMIPS (lpj=2611208)
[ 0.377974][ T0] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.379900][ T0] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[ 0.381903][ T0] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[ 0.383901][ T0] Spectre V2 : Mitigation: Enhanced / Automatic IBRS
[ 0.385900][ T0] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[ 0.387900][ T0] Spectre V2 : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT
[ 0.389900][ T0] RETBleed: Mitigation: Enhanced IBRS
[ 0.391900][ T0] Speculative Store Bypass: Vulnerable
[ 0.394902][ T0] x86/fpu: x87 FPU will use FXSAVE
[ 0.403537][ T0] Freeing SMP alternatives memory: 16K
[ 0.404902][ T0] pid_max: default: 32768 minimum: 301
[ 0.406940][ T0] Mount-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
[ 0.409131][ T0] Mountpoint-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
[ 0.617243][ T1] smpboot: CPU0: 12th Gen Intel(R) Core(TM) i7-1255U (family: 0x6, model: 0x9a, stepping: 0x4)
[ 0.617898][ T1] Performance Events: unsupported p6 CPU model 154 no PMU driver, software events only.
[ 0.617898][ T1] signal: max sigframe size: 1440
[ 0.617898][ T1] rcu: Hierarchical SRCU implementation.
[ 0.617898][ T1] rcu: Max phase no-delay instances is 400.
[ 0.617898][ T1] NMI watchdog: Perf NMI watchdog permanently disabled
[ 0.617898][ T1] smp: Bringing up secondary CPUs ...
[ 0.617898][ T1] smpboot: x86: Booting SMP configuration:
[ 0.617898][ T1] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
[ 0.620898][ C0] TSC synchronization [CPU#0 -> CPU#1]:
[ 0.620898][ C0] Measured 61957 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.620898][ T8] tsc: Marking TSC unstable due to check_tsc_sync_source failed
[ 0.856914][ T1] smp: Brought up 1 node, 8 CPUs
[ 0.910911][ T1] smpboot: Max logical packages: 1
[ 1.041908][ T1] smpboot: Total of 8 processors activated (41779.32 BogoMIPS)
[ 1.051188][ T1] devtmpfs: initialized
[ 1.210943][ T1] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[ 1.234916][ T1] futex hash table entries: 2048 (order: 5, 131072 bytes, linear)
[ 1.530084][ T1] cpuidle: using governor ladder
[ 1.531920][ T1] cpuidle: using governor menu
[ 1.577978][ T1] iommu: Default domain type: Translated
[ 1.579900][ T1] iommu: DMA domain TLB invalidation policy: lazy mode
[ 1.585010][ T1] clocksource: Switched to clocksource kvm-clock
[ 1.626044][ T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[ 1.626972][ T1] software IO TLB: mapped [mem 0x00000000dbff0000-0x00000000dfff0000] (64MB)
[ 1.626972][ T1] platform rtc_cmos: registered platform RTC device (no PNP device found)
(...snipped...)
---------- good case ----------

2023-10-25 21:29:05

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On Tue, Oct 24 2023 at 23:52, Tetsuo Handa wrote:

> On 2023/10/24 22:00, Thomas Gleixner wrote:
>> Interesting. Can you please tell what the replacement clocksource is
>> when the TSC is disabled?
>
> Where can I find the replacement clocksource from?

/sys/kernel/devices/system/clocksource/clocksource0/active_clocksource

but I already figured it out. It selects kvm_clock as fallback, but that
happens way later. At that point during AP bringup timekeeping is still
based on jiffies. So that TSC unstable marking is not an issue
vs. timekeeping.

I have no idea what the kernel, VirtualPox or Windoze are doing during
that time. I fear you need to add some debug on your own or if
VirtualPox has a monitor/debugger you might use that to inspect what the
guest is doing.

Thanks,

tglx

2023-10-25 21:31:52

by tip-bot2 for Dhananjay Ugwekar

[permalink] [raw]

Subject: [PATCH] x86/tsc: Defer marking TSC unstable to a worker

Tetsuo reported the following lockdep splat when the TSC synchronization
fails during CPU hotplug:

tsc: Marking TSC unstable due to check_tsc_sync_source failed

WARNING: inconsistent lock state
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
ffffffff8cfa1c78 (watchdog_lock){?.-.}-{2:2}, at: clocksource_watchdog+0x23/0x5a0
{IN-HARDIRQ-W} state was registered at:
_raw_spin_lock_irqsave+0x3f/0x60
clocksource_mark_unstable+0x1b/0x90
mark_tsc_unstable+0x41/0x50
check_tsc_sync_source+0x14f/0x180
sysvec_call_function_single+0x69/0x90

Possible unsafe locking scenario:
lock(watchdog_lock);
<Interrupt>
lock(watchdog_lock);

stack backtrace:
_raw_spin_lock+0x30/0x40
clocksource_watchdog+0x23/0x5a0
run_timer_softirq+0x2a/0x50
sysvec_apic_timer_interrupt+0x6e/0x90

The reason is the recent conversion of the TSC synchronization function
during CPU hotplug on the control CPU to a SMP function call. In case
that the synchronization with the upcoming CPU fails, the TSC has to be
marked unstable via clocksource_mark_unstable().

clocksource_mark_unstable() acquires 'watchdog_lock', but that lock is
taken with interrupts enabled in the watchdog timer callback to minimize
interrupt disabled time. That's obviously a possible deadlock scenario,

Before that change the synchronization function was invoked in thread
context so this could not happen.

As it is not crucical whether the unstable marking happens slightly
delayed, defer the call to a worker thread which avoids the lock context
problem.

Fixes: 9d349d47f0e3 ("x86/smpboot: Make TSC synchronization function call based")
Reported-by: Tetsuo Handa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Tetsuo Handa <[email protected]>
Cc: [email protected]

---
arch/x86/kernel/tsc_sync.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -15,6 +15,7 @@
* ( The serial nature of the boot logic and the CPU hotplug lock
* protects against more than 2 CPUs entering this code. )
*/
+#include <linux/workqueue.h>
#include <linux/topology.h>
#include <linux/spinlock.h>
#include <linux/kernel.h>
@@ -342,6 +343,13 @@ static inline unsigned int loop_timeout(
return (cpumask_weight(topology_core_cpumask(cpu)) > 1) ? 2 : 20;
}

+static void tsc_sync_mark_tsc_unstable(struct work_struct *work)
+{
+ mark_tsc_unstable("check_tsc_sync_source failed");
+}
+
+static DECLARE_WORK(tsc_sync_work, tsc_sync_mark_tsc_unstable);
+
/*
* The freshly booted CPU initiates this via an async SMP function call.
*/
@@ -395,7 +403,7 @@ static void check_tsc_sync_source(void *
"turning off TSC clock.\n", max_warp);
if (random_warps)
pr_warn("TSC warped randomly between CPUs\n");
- mark_tsc_unstable("check_tsc_sync_source failed");
+ schedule_work(&tsc_sync_work);
}

/*

2023-10-26 02:33:38

by Oliver Sang

[permalink] [raw]

Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

Hello,

kernel test robot noticed "WARNING:at_kernel/locking/irqflag-debug.c:#warn_bogus_irq_restore" on:

commit: e0e3e69931bce73960e6583cd6385ebbd77fa12f ("[PATCH] clocksource: disable irq when holding watchdog_lock.")
url: https://github.com/intel-lab-lkp/linux/commits/Tetsuo-Handa/clocksource-disable-irq-when-holding-watchdog_lock/20231017-143723
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 8ceea12d183cf29f28072dede218a04eda2a789c
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [PATCH] clocksource: disable irq when holding watchdog_lock.

in testcase: boot

compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)

+-------------------------------------------------------------------+------------+------------+
| | 8ceea12d18 | e0e3e69931 |
+-------------------------------------------------------------------+------------+------------+
| WARNING:at_kernel/locking/irqflag-debug.c:#warn_bogus_irq_restore | 0 | 13 |
| RIP:warn_bogus_irq_restore | 0 | 13 |
+-------------------------------------------------------------------+------------+------------+

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]

[ 33.794687][ C0] ------------[ cut here ]------------
[ 33.795935][ C0] raw_local_irq_restore() called with IRQs enabled
[ 33.797202][ C0] WARNING: CPU: 0 PID: 10 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore (kernel/locking/irqflag-debug.c:10 (discriminator 3))
[ 33.798562][ C0] Modules linked in:
[ 33.799208][ C0] CPU: 0 PID: 10 Comm: kworker/u4:0 Not tainted 6.6.0-rc1-00009-ge0e3e69931bc #1 1a5570c26ece064894076fe3d70a31d4a0268b39
[ 33.801328][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 33.802561][ C0] Workqueue: events_unbound padata_mt_helper
[ 33.803751][ C0] RIP: 0010:warn_bogus_irq_restore (kernel/locking/irqflag-debug.c:10 (discriminator 3))
[ 33.804801][ C0] Code: 0f 1f 44 00 00 f3 0f 1e fa 90 80 3d 92 d0 f9 00 00 74 02 90 c3 c6 05 87 d0 f9 00 01 90 48 c7 c7 68 ad 59 82 e8 0c a4 29 ff 90 <0f> 0b 90 90 eb e3 0f 1f 44 00 00 f3 0f 1e fa 44 8b 05 39 1a fc 00
All code
========
0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
5: f3 0f 1e fa endbr64
9: 90 nop
a: 80 3d 92 d0 f9 00 00 cmpb $0x0,0xf9d092(%rip) # 0xf9d0a3
11: 74 02 je 0x15
13: 90 nop
14: c3 ret
15: c6 05 87 d0 f9 00 01 movb $0x1,0xf9d087(%rip) # 0xf9d0a3
1c: 90 nop
1d: 48 c7 c7 68 ad 59 82 mov $0xffffffff8259ad68,%rdi
24: e8 0c a4 29 ff call 0xffffffffff29a435
29: 90 nop
2a:* 0f 0b ud2 <-- trapping instruction
2c: 90 nop
2d: 90 nop
2e: eb e3 jmp 0x13
30: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
35: f3 0f 1e fa endbr64
39: 44 8b 05 39 1a fc 00 mov 0xfc1a39(%rip),%r8d # 0xfc1a79

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 90 nop
3: 90 nop
4: eb e3 jmp 0xffffffffffffffe9
6: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
b: f3 0f 1e fa endbr64
f: 44 8b 05 39 1a fc 00 mov 0xfc1a39(%rip),%r8d # 0xfc1a4f
[ 33.806560][ C0] RSP: 0000:ffff88842fa05e40 EFLAGS: 00010286
[ 33.807965][ C0] RAX: 0000000000000000 RBX: ffffffff82c290c0 RCX: 0000000000000100
[ 33.809247][ C0] RDX: ffff888100578000 RSI: ffffffff810c1d4f RDI: 0000000000000001
[ 33.810559][ C0] RBP: 0000000000000246 R08: 0000000000000000 R09: 0000000000000001
[ 33.812116][ C0] R10: 0000000000000001 R11: ffffffff82c10bf8 R12: ffffffff811c03c0
[ 33.813603][ C0] R13: 00000000fffedb90 R14: 00000000fffedb90 R15: ffff88842fa05f08
[ 33.814564][ C0] FS: 0000000000000000(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000
[ 33.816171][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 33.817290][ C0] CR2: ffff88843ffff000 CR3: 0000000002853000 CR4: 00000000000406f0
[ 33.818566][ C0] Call Trace:
[ 33.819066][ C0] <IRQ>
[ 33.819497][ C0] ? show_regs (arch/x86/kernel/dumpstack.c:479)
[ 33.820147][ C0] ? __warn (kernel/panic.c:673)
[ 33.820756][ C0] ? warn_bogus_irq_restore (kernel/locking/irqflag-debug.c:10 (discriminator 3))
[ 33.821557][ C0] ? report_bug (lib/bug.c:180 lib/bug.c:219)
[ 33.822432][ C0] ? handle_bug (arch/x86/kernel/traps.c:237)
[ 33.822560][ C0] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1))
[ 33.823248][ C0] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568)
[ 33.823984][ C0] ? clocksource_watchdog_kthread (kernel/time/clocksource.c:400)
[ 33.824914][ C0] ? __warn_printk (kernel/panic.c:722)
[ 33.825736][ C0] ? warn_bogus_irq_restore (kernel/locking/irqflag-debug.c:10 (discriminator 3))
[ 33.826566][ C0] _raw_spin_unlock_irqrestore (arch/x86/include/asm/irqflags.h:134 include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
[ 33.827489][ C0] call_timer_fn (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/timer.h:127 kernel/time/timer.c:1701)
[ 33.828252][ C0] ? clocksource_watchdog_kthread (kernel/time/clocksource.c:400)
[ 33.829277][ C0] __run_timers (kernel/time/timer.c:1752 kernel/time/timer.c:2022)
[ 33.830126][ C0] __do_softirq (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:554)
[ 33.830563][ C0] ? sched_clock_cpu (kernel/sched/clock.c:270 kernel/sched/clock.c:405)
[ 33.831364][ C0] irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:644)
[ 33.832124][ C0] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1074 (discriminator 14))
[ 33.833107][ C0] </IRQ>
[ 33.833608][ C0] <TASK>
[ 33.834109][ C0] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:645)
[ 33.834562][ C0] RIP: 0010:clear_page_rep (arch/x86/lib/clear_page_64.S:20)
[ 33.835452][ C0] Code: 5b 08 48 83 c7 18 e8 24 ab fe ff 48 89 5d 20 48 85 db 75 e2 5b 5d e9 34 e8 3f ff 0f 1f 40 00 f3 0f 1e fa b9 00 02 00 00 31 c0 <f3> 48 ab c3 90 f3 0f 1e fa 31 c0 b9 40 00 00 00 0f 1f 44 00 00 ff
All code
========
0: 5b pop %rbx
1: 08 48 83 or %cl,-0x7d(%rax)
4: c7 (bad)
5: 18 e8 sbb %ch,%al
7: 24 ab and $0xab,%al
9: fe (bad)
a: ff 48 89 decl -0x77(%rax)
d: 5d pop %rbp
e: 20 48 85 and %cl,-0x7b(%rax)
11: db 75 e2 (bad) -0x1e(%rbp)
14: 5b pop %rbx
15: 5d pop %rbp
16: e9 34 e8 3f ff jmp 0xffffffffff3fe84f
1b: 0f 1f 40 00 nopl 0x0(%rax)
1f: f3 0f 1e fa endbr64
23: b9 00 02 00 00 mov $0x200,%ecx
28: 31 c0 xor %eax,%eax
2a:* f3 48 ab rep stos %rax,%es:(%rdi) <-- trapping instruction
2d: c3 ret
2e: 90 nop
2f: f3 0f 1e fa endbr64
33: 31 c0 xor %eax,%eax
35: b9 40 00 00 00 mov $0x40,%ecx
3a: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
3f: ff .byte 0xff

Code starting with the faulting instruction
===========================================
0: f3 48 ab rep stos %rax,%es:(%rdi)
3: c3 ret
4: 90 nop
5: f3 0f 1e fa endbr64
9: 31 c0 xor %eax,%eax
b: b9 40 00 00 00 mov $0x40,%ecx
10: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
15: ff .byte 0xff
[ 33.838561][ C0] RSP: 0000:ffff88810057fc98 EFLAGS: 00010246
[ 33.839599][ C0] RAX: 0000000000000000 RBX: ffffea0005140000 RCX: 0000000000000200
[ 33.840946][ C0] RDX: 0000000000000200 RSI: ffffea0005148000 RDI: ffff888145200000
[ 33.842280][ C0] RBP: 0000000000000700 R08: 0000000000000001 R09: 0000000000000001
[ 33.842562][ C0] R10: 0000000000000001 R11: ffffffff84799a30 R12: 0000000000145000
[ 33.843895][ C0] R13: 000000000000000a R14: 0000000000000001 R15: 0000000000000400
[ 33.845247][ C0] __free_pages_ok (mm/page_alloc.c:1070 mm/page_alloc.c:1164 mm/page_alloc.c:1266)
[ 33.846083][ C0] deferred_free_range (mm/mm_init.c:1990)
[ 33.846565][ C0] deferred_init_maxorder (mm/mm_init.c:2145)
[ 33.847517][ C0] deferred_init_memmap_chunk (mm/mm_init.c:2168)
[ 33.848497][ C0] padata_mt_helper (include/linux/spinlock.h:351 kernel/padata.c:448)
[ 33.849347][ C0] process_one_work (kernel/workqueue.c:2635)
[ 33.850240][ C0] worker_thread (kernel/workqueue.c:2697 kernel/workqueue.c:2784)
[ 33.850569][ C0] ? process_one_work (kernel/workqueue.c:2730)
[ 33.851504][ C0] kthread (kernel/kthread.c:388)
[ 33.852237][ C0] ? kthread_complete_and_exit (kernel/kthread.c:341)
[ 33.853263][ C0] ret_from_fork (arch/x86/kernel/process.c:153)
[ 33.854093][ C0] ? kthread_complete_and_exit (kernel/kthread.c:341)
[ 33.854566][ C0] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
[ 33.855469][ C0] </TASK>
[ 33.856020][ C0] irq event stamp: 110440
[ 33.856786][ C0] hardirqs last enabled at (110448): console_unlock (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:67 arch/x86/include/asm/irqflags.h:127 kernel/printk/printk.c:347 kernel/printk/printk.c:2718 kernel/printk/printk.c:3037)
[ 33.858488][ C0] hardirqs last disabled at (110455): console_unlock (kernel/printk/printk.c:345 kernel/printk/printk.c:2718 kernel/printk/printk.c:3037)
[ 33.858561][ C0] softirqs last enabled at (105760): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582)
[ 33.860220][ C0] softirqs last disabled at (109873): irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:644)
[ 33.861864][ C0] ---[ end trace 0000000000000000 ]---
[ 43.589614][ T12] rcu_tasks_wait_gp: rcu_tasks grace period number 5 (since boot) is 2542 jiffies old.
[ 47.973598][ T26] node 0 deferred pages initialised in 14620ms
[ 47.974563][ T26] pgdatinit0 (26) used greatest stack depth: 14104 bytes left
[ 48.006348][ T1] allocated 268435456 bytes of page_ext
[ 48.007874][ T1] Node 0, zone DMA32: page owner found early allocated 9 pages
[ 48.013919][ T12] Callback from call_rcu_tasks() invoked.
[ 48.059811][ T1] Node 0, zone Normal: page owner found early allocated 67566 pages
[ 48.063845][ T1] devtmpfs: initialized
[ 48.063942][ T1] x86/mm: Memory block size: 128MB
[ 48.105976][ T1] Running RCU synchronous self tests
[ 48.106615][ T1] Running RCU synchronous self tests
[ 48.106615][ T1] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[ 48.106615][ T1] futex hash table entries: 512 (order: 4, 65536 bytes, linear)
[ 48.110571][ T1] prandom: seed boundary self test passed
[ 48.115313][ T1] prandom: 100 self tests passed

The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231026/[email protected]

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-10-26 09:39:34

[permalink] [raw]

Subject: [tip: x86/urgent] x86/tsc: Defer marking TSC unstable to a worker

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 4f0d5e708ef88b58c954581ca54161064beaaaf5
Gitweb: https://git.kernel.org/tip/4f0d5e708ef88b58c954581ca54161064beaaaf5
Author: Thomas Gleixner <[email protected]>
AuthorDate: Wed, 25 Oct 2023 23:31:35 +02:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Thu, 26 Oct 2023 11:31:45 +02:00

x86/tsc: Defer marking TSC unstable to a worker

Tetsuo reported the following lockdep splat when the TSC synchronization
fails during CPU hotplug:

tsc: Marking TSC unstable due to check_tsc_sync_source failed

WARNING: inconsistent lock state
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
ffffffff8cfa1c78 (watchdog_lock){?.-.}-{2:2}, at: clocksource_watchdog+0x23/0x5a0
{IN-HARDIRQ-W} state was registered at:
_raw_spin_lock_irqsave+0x3f/0x60
clocksource_mark_unstable+0x1b/0x90
mark_tsc_unstable+0x41/0x50
check_tsc_sync_source+0x14f/0x180
sysvec_call_function_single+0x69/0x90

Possible unsafe locking scenario:
lock(watchdog_lock);
<Interrupt>
lock(watchdog_lock);

stack backtrace:
_raw_spin_lock+0x30/0x40
clocksource_watchdog+0x23/0x5a0
run_timer_softirq+0x2a/0x50
sysvec_apic_timer_interrupt+0x6e/0x90

The reason is the recent conversion of the TSC synchronization function
during CPU hotplug on the control CPU to a SMP function call. In case
that the synchronization with the upcoming CPU fails, the TSC has to be
marked unstable via clocksource_mark_unstable().

clocksource_mark_unstable() acquires 'watchdog_lock', but that lock is
taken with interrupts enabled in the watchdog timer callback to minimize
interrupt disabled time. That's obviously a possible deadlock scenario,

Before that change the synchronization function was invoked in thread
context so this could not happen.

As it is not crucical whether the unstable marking happens slightly
delayed, defer the call to a worker thread which avoids the lock context
problem.

Fixes: 9d349d47f0e3 ("x86/smpboot: Make TSC synchronization function call based")
Reported-by: Tetsuo Handa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Tetsuo Handa <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/87zg064ceg.ffs@tglx
---
arch/x86/kernel/tsc_sync.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c
index bbc440c..1123ef3 100644
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -15,6 +15,7 @@
* ( The serial nature of the boot logic and the CPU hotplug lock
* protects against more than 2 CPUs entering this code. )
*/
+#include <linux/workqueue.h>
#include <linux/topology.h>
#include <linux/spinlock.h>
#include <linux/kernel.h>
@@ -342,6 +343,13 @@ static inline unsigned int loop_timeout(int cpu)
return (cpumask_weight(topology_core_cpumask(cpu)) > 1) ? 2 : 20;
}

+static void tsc_sync_mark_tsc_unstable(struct work_struct *work)
+{
+ mark_tsc_unstable("check_tsc_sync_source failed");
+}
+
+static DECLARE_WORK(tsc_sync_work, tsc_sync_mark_tsc_unstable);
+
/*
* The freshly booted CPU initiates this via an async SMP function call.
*/
@@ -395,7 +403,7 @@ retry:
"turning off TSC clock.\n", max_warp);
if (random_warps)
pr_warn("TSC warped randomly between CPUs\n");
- mark_tsc_unstable("check_tsc_sync_source failed");
+ schedule_work(&tsc_sync_work);
}

/*

2023-10-27 15:19:22