2015-06-02 09:46:01

by Gu Zheng

[permalink] [raw]
Subject: [RFC PATCH V2] x86, espfix: postpone the initialization of espfix stack for AP

The following lockdep warning occurrs when running with latest kernel:
[ 3.178000] ------------[ cut here ]------------
[ 3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 lockdep_trace_alloc+0xdd/0xe0()
[ 3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[ 3.199000] Modules linked in:

[ 3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[ 3.221000] 0000000000000000 2d6601fb3e6d4e4c ffff88086fd5fc38 ffffffff81773f0a
[ 3.230000] 0000000000000000 ffff88086fd5fc90 ffff88086fd5fc78 ffffffff8108c85a
[ 3.238000] ffff88086fd60000 0000000000000092 ffff88086fd60000 00000000000000d0
[ 3.246000] Call Trace:
[ 3.249000] [<ffffffff81773f0a>] dump_stack+0x4c/0x65
[ 3.255000] [<ffffffff8108c85a>] warn_slowpath_common+0x8a/0xc0
[ 3.261000] [<ffffffff8108c8e5>] warn_slowpath_fmt+0x55/0x70
[ 3.268000] [<ffffffff810ee24d>] lockdep_trace_alloc+0xdd/0xe0
[ 3.274000] [<ffffffff811cda0d>] __alloc_pages_nodemask+0xad/0xca0
[ 3.281000] [<ffffffff810ec7ad>] ? __lock_acquire+0xf6d/0x1560
[ 3.288000] [<ffffffff81219c8a>] alloc_page_interleave+0x3a/0x90
[ 3.295000] [<ffffffff8121b32d>] alloc_pages_current+0x17d/0x1a0
[ 3.301000] [<ffffffff811c869e>] ? __get_free_pages+0xe/0x50
[ 3.308000] [<ffffffff811c869e>] __get_free_pages+0xe/0x50
[ 3.314000] [<ffffffff8102640b>] init_espfix_ap+0x17b/0x320
[ 3.320000] [<ffffffff8105c691>] start_secondary+0xf1/0x1f0
[ 3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
init_espfix_ap() which is called before enabled local irq, and the lockdep
sub-system considers this behaviour as allocating memory with GFP_FS with
local irq disabled, then trigger the warning as mentioned about.

Though we could allocate them on the boot CPU side and hand them over to
the secondary CPU, but it seemes a bit waste if some of cpus are offline.
As thers is no need to these pages(espfix stack) until we try to run user
code, so we postpone the initialization of espfix stack after cpu booted
to avoid the noise.


Signed-off-by: Gu Zheng <[email protected]>
---
v2:
Let the boot up routine init the espfix stack for the target cpu after it
booted.
---
arch/x86/include/asm/espfix.h | 2 +-
arch/x86/kernel/espfix_64.c | 15 +++++++--------
arch/x86/kernel/smpboot.c | 14 +++++++-------
3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..b074c4f 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);

extern void init_espfix_bsp(void);
-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);

#endif /* CONFIG_X86_64 */

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..37a4404 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,12 +131,12 @@ void __init init_espfix_bsp(void)
init_espfix_random();

/* The rest is the same as for any other processor */
- init_espfix_ap();
+ init_espfix_ap(0);
}

-void init_espfix_ap(void)
+void init_espfix_ap(int cpu)
{
- unsigned int cpu, page;
+ unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
@@ -146,10 +146,9 @@ void init_espfix_ap(void)
pteval_t ptemask;

/* We only have to do this once... */
- if (likely(this_cpu_read(espfix_stack)))
+ if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */

- cpu = smp_processor_id();
addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;

@@ -199,7 +198,7 @@ void init_espfix_ap(void)
unlock_done:
mutex_unlock(&espfix_init_mutex);
done:
- this_cpu_write(espfix_stack, addr);
- this_cpu_write(espfix_waddr, (unsigned long)stack_page
- + (addr & ~PAGE_MASK));
+ per_cpu(espfix_stack, cpu) = addr;
+ per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
+ + (addr & ~PAGE_MASK);
}
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 50e547e..e9fdd0e 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -240,13 +240,6 @@ static void notrace start_secondary(void *unused)
check_tsc_sync_target();

/*
- * Enable the espfix hack for this CPU
- */
-#ifdef CONFIG_X86_ESPFIX64
- init_espfix_ap();
-#endif
-
- /*
* We need to hold vector_lock so there the set of online cpus
* does not change while we are assigning vectors to cpus. Holding
* this lock ensures we don't half assign or remove an irq from a cpu.
@@ -901,6 +894,13 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
}
}

+ /*
+ * Enable the espfix hack for this CPU
+ */
+#ifdef CONFIG_X86_ESPFIX64
+ init_espfix_ap(cpu);
+#endif
+
/* mark "stuck" area as not stuck */
*trampoline_status = 0;

--
1.7.7


2015-06-02 11:59:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH V2] x86, espfix: postpone the initialization of espfix stack for AP


* Gu Zheng <[email protected]> wrote:

> The following lockdep warning occurrs when running with latest kernel:
> [ 3.178000] ------------[ cut here ]------------
> [ 3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 lockdep_trace_alloc+0xdd/0xe0()
> [ 3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
> [ 3.199000] Modules linked in:
>
> [ 3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
> [ 3.221000] 0000000000000000 2d6601fb3e6d4e4c ffff88086fd5fc38 ffffffff81773f0a
> [ 3.230000] 0000000000000000 ffff88086fd5fc90 ffff88086fd5fc78 ffffffff8108c85a
> [ 3.238000] ffff88086fd60000 0000000000000092 ffff88086fd60000 00000000000000d0
> [ 3.246000] Call Trace:
> [ 3.249000] [<ffffffff81773f0a>] dump_stack+0x4c/0x65
> [ 3.255000] [<ffffffff8108c85a>] warn_slowpath_common+0x8a/0xc0
> [ 3.261000] [<ffffffff8108c8e5>] warn_slowpath_fmt+0x55/0x70
> [ 3.268000] [<ffffffff810ee24d>] lockdep_trace_alloc+0xdd/0xe0
> [ 3.274000] [<ffffffff811cda0d>] __alloc_pages_nodemask+0xad/0xca0
> [ 3.281000] [<ffffffff810ec7ad>] ? __lock_acquire+0xf6d/0x1560
> [ 3.288000] [<ffffffff81219c8a>] alloc_page_interleave+0x3a/0x90
> [ 3.295000] [<ffffffff8121b32d>] alloc_pages_current+0x17d/0x1a0
> [ 3.301000] [<ffffffff811c869e>] ? __get_free_pages+0xe/0x50
> [ 3.308000] [<ffffffff811c869e>] __get_free_pages+0xe/0x50
> [ 3.314000] [<ffffffff8102640b>] init_espfix_ap+0x17b/0x320
> [ 3.320000] [<ffffffff8105c691>] start_secondary+0xf1/0x1f0
> [ 3.327000] ---[ end trace 1b3327d9d6a1d62c ]---
>
> This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
> init_espfix_ap() which is called before enabled local irq, and the lockdep
> sub-system considers this behaviour as allocating memory with GFP_FS with local
> irq disabled, then trigger the warning as mentioned about.

Why should this be a 'mis-warning'? If the GFP_KERNEL allocation sleeps then we'll
sleep with irqs disabled => bad.

This looks like a real (albeit hard to trigger) bug.

> Though we could allocate them on the boot CPU side and hand them over to the
> secondary CPU, but it seemes a bit waste if some of cpus are offline. As thers
> is no need to these pages(espfix stack) until we try to run user code, so we
> postpone the initialization of espfix stack after cpu booted to avoid the noise.

> -void init_espfix_ap(void)
> +void init_espfix_ap(int cpu)
> {

So how about the concern I raised in a former thread, that the allocation should
be done for the node the target CPU is on? The 'cpu' parameter should be
propagated to the allocation as well, and turned into a node allocation or so.

Even though some CPUs will share the espfix stack, some won't.

Thanks,

Ingo

2015-06-03 10:18:56

by Gu Zheng

[permalink] [raw]
Subject: Re: [RFC PATCH V2] x86, espfix: postpone the initialization of espfix stack for AP

Hi Ingo,

On 06/02/2015 07:59 PM, Ingo Molnar wrote:

>
> * Gu Zheng <[email protected]> wrote:
>
>> The following lockdep warning occurrs when running with latest kernel:
>> [ 3.178000] ------------[ cut here ]------------
>> [ 3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 lockdep_trace_alloc+0xdd/0xe0()
>> [ 3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
>> [ 3.199000] Modules linked in:
>>
>> [ 3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
>> [ 3.221000] 0000000000000000 2d6601fb3e6d4e4c ffff88086fd5fc38 ffffffff81773f0a
>> [ 3.230000] 0000000000000000 ffff88086fd5fc90 ffff88086fd5fc78 ffffffff8108c85a
>> [ 3.238000] ffff88086fd60000 0000000000000092 ffff88086fd60000 00000000000000d0
>> [ 3.246000] Call Trace:
>> [ 3.249000] [<ffffffff81773f0a>] dump_stack+0x4c/0x65
>> [ 3.255000] [<ffffffff8108c85a>] warn_slowpath_common+0x8a/0xc0
>> [ 3.261000] [<ffffffff8108c8e5>] warn_slowpath_fmt+0x55/0x70
>> [ 3.268000] [<ffffffff810ee24d>] lockdep_trace_alloc+0xdd/0xe0
>> [ 3.274000] [<ffffffff811cda0d>] __alloc_pages_nodemask+0xad/0xca0
>> [ 3.281000] [<ffffffff810ec7ad>] ? __lock_acquire+0xf6d/0x1560
>> [ 3.288000] [<ffffffff81219c8a>] alloc_page_interleave+0x3a/0x90
>> [ 3.295000] [<ffffffff8121b32d>] alloc_pages_current+0x17d/0x1a0
>> [ 3.301000] [<ffffffff811c869e>] ? __get_free_pages+0xe/0x50
>> [ 3.308000] [<ffffffff811c869e>] __get_free_pages+0xe/0x50
>> [ 3.314000] [<ffffffff8102640b>] init_espfix_ap+0x17b/0x320
>> [ 3.320000] [<ffffffff8105c691>] start_secondary+0xf1/0x1f0
>> [ 3.327000] ---[ end trace 1b3327d9d6a1d62c ]---
>>
>> This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
>> init_espfix_ap() which is called before enabled local irq, and the lockdep
>> sub-system considers this behaviour as allocating memory with GFP_FS with local
>> irq disabled, then trigger the warning as mentioned about.
>
> Why should this be a 'mis-warning'? If the GFP_KERNEL allocation sleeps then we'll
> sleep with irqs disabled => bad.
>
> This looks like a real (albeit hard to trigger) bug.


You are right.
Thanks for correct me, I misread the log.

>
>> Though we could allocate them on the boot CPU side and hand them over to the
>> secondary CPU, but it seemes a bit waste if some of cpus are offline. As thers
>> is no need to these pages(espfix stack) until we try to run user code, so we
>> postpone the initialization of espfix stack after cpu booted to avoid the noise.
>
>> -void init_espfix_ap(void)
>> +void init_espfix_ap(int cpu)
>> {
>
> So how about the concern I raised in a former thread, that the allocation should
> be done for the node the target CPU is on? The 'cpu' parameter should be
> propagated to the allocation as well, and turned into a node allocation or so.
>
> Even though some CPUs will share the espfix stack, some won't.


Hmm, sounds reasonable.

Regards,
Gu

>
> Thanks,
>
> Ingo
> .
>



2015-06-04 10:06:30

by Gu Zheng

[permalink] [raw]
Subject: [PATCH V1] x86, espfix: postpone the initialization of espfix stack for AP

The following lockdep warning occurrs when running with latest kernel:
[ 3.178000] ------------[ cut here ]------------
[ 3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 lockdep_trace_alloc+0xdd/0xe0()
[ 3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[ 3.199000] Modules linked in:

[ 3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[ 3.221000] 0000000000000000 2d6601fb3e6d4e4c ffff88086fd5fc38 ffffffff81773f0a
[ 3.230000] 0000000000000000 ffff88086fd5fc90 ffff88086fd5fc78 ffffffff8108c85a
[ 3.238000] ffff88086fd60000 0000000000000092 ffff88086fd60000 00000000000000d0
[ 3.246000] Call Trace:
[ 3.249000] [<ffffffff81773f0a>] dump_stack+0x4c/0x65
[ 3.255000] [<ffffffff8108c85a>] warn_slowpath_common+0x8a/0xc0
[ 3.261000] [<ffffffff8108c8e5>] warn_slowpath_fmt+0x55/0x70
[ 3.268000] [<ffffffff810ee24d>] lockdep_trace_alloc+0xdd/0xe0
[ 3.274000] [<ffffffff811cda0d>] __alloc_pages_nodemask+0xad/0xca0
[ 3.281000] [<ffffffff810ec7ad>] ? __lock_acquire+0xf6d/0x1560
[ 3.288000] [<ffffffff81219c8a>] alloc_page_interleave+0x3a/0x90
[ 3.295000] [<ffffffff8121b32d>] alloc_pages_current+0x17d/0x1a0
[ 3.301000] [<ffffffff811c869e>] ? __get_free_pages+0xe/0x50
[ 3.308000] [<ffffffff811c869e>] __get_free_pages+0xe/0x50
[ 3.314000] [<ffffffff8102640b>] init_espfix_ap+0x17b/0x320
[ 3.320000] [<ffffffff8105c691>] start_secondary+0xf1/0x1f0
[ 3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
before enabled local irq, and the lockdep sub-system considers this
behaviour as allocating memory with GFP_FS with local irq disabled,
then trigger the warning as mentioned about.

Though we could allocate them on the boot CPU side and hand them over to
the secondary CPU, but it seemes a bit waste if some of cpus are offline.
As thers is no need to these pages(espfix stack) until we try to run user
code, so we postpone the initialization of espfix stack, and let the boot
up routine init the espfix stack for the target cpu after it booted to
avoid the noise.

Signed-off-by: Gu Zheng <[email protected]>
---
v1:
Alloc the page on the node the target CPU is on.
RFC:
Let the boot up routine init the espfix stack for the target cpu after it
booted.
---
---
arch/x86/include/asm/espfix.h | 2 +-
arch/x86/kernel/espfix_64.c | 28 ++++++++++++++++------------
arch/x86/kernel/smpboot.c | 14 +++++++-------
3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..ca3ce9a 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);

extern void init_espfix_bsp(void);
-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);

#endif /* CONFIG_X86_64 */

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..e397583 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,25 +131,24 @@ void __init init_espfix_bsp(void)
init_espfix_random();

/* The rest is the same as for any other processor */
- init_espfix_ap();
+ init_espfix_ap(0);
}

-void init_espfix_ap(void)
+void init_espfix_ap(int cpu)
{
- unsigned int cpu, page;
+ unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
pte_t pte, *pte_p;
- int n;
+ int n, node;
void *stack_page;
pteval_t ptemask;

/* We only have to do this once... */
- if (likely(this_cpu_read(espfix_stack)))
+ if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */

- cpu = smp_processor_id();
addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;

@@ -165,12 +164,15 @@ void init_espfix_ap(void)
if (stack_page)
goto unlock_done;

+ node = cpu_to_node(cpu);
ptemask = __supported_pte_mask;

pud_p = &espfix_pud_page[pud_index(addr)];
pud = *pud_p;
if (!pud_present(pud)) {
- pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+ struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+ pmd_p = (pmd_t *)page_address(page);
pud = __pud(__pa(pmd_p) | (PGTABLE_PROT & ptemask));
paravirt_alloc_pmd(&init_mm, __pa(pmd_p) >> PAGE_SHIFT);
for (n = 0; n < ESPFIX_PUD_CLONES; n++)
@@ -180,7 +182,9 @@ void init_espfix_ap(void)
pmd_p = pmd_offset(&pud, addr);
pmd = *pmd_p;
if (!pmd_present(pmd)) {
- pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+ struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+ pte_p = (pte_t *)page_address(page);
pmd = __pmd(__pa(pte_p) | (PGTABLE_PROT & ptemask));
paravirt_alloc_pte(&init_mm, __pa(pte_p) >> PAGE_SHIFT);
for (n = 0; n < ESPFIX_PMD_CLONES; n++)
@@ -188,7 +192,7 @@ void init_espfix_ap(void)
}

pte_p = pte_offset_kernel(&pmd, addr);
- stack_page = (void *)__get_free_page(GFP_KERNEL);
+ stack_page = page_address(alloc_pages_node(node, GFP_KERNEL, 0));
pte = __pte(__pa(stack_page) | (__PAGE_KERNEL_RO & ptemask));
for (n = 0; n < ESPFIX_PTE_CLONES; n++)
set_pte(&pte_p[n*PTE_STRIDE], pte);
@@ -199,7 +203,7 @@ void init_espfix_ap(void)
unlock_done:
mutex_unlock(&espfix_init_mutex);
done:
- this_cpu_write(espfix_stack, addr);
- this_cpu_write(espfix_waddr, (unsigned long)stack_page
- + (addr & ~PAGE_MASK));
+ per_cpu(espfix_stack, cpu) = addr;
+ per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
+ + (addr & ~PAGE_MASK);
}
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 50e547e..e9fdd0e 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -240,13 +240,6 @@ static void notrace start_secondary(void *unused)
check_tsc_sync_target();

/*
- * Enable the espfix hack for this CPU
- */
-#ifdef CONFIG_X86_ESPFIX64
- init_espfix_ap();
-#endif
-
- /*
* We need to hold vector_lock so there the set of online cpus
* does not change while we are assigning vectors to cpus. Holding
* this lock ensures we don't half assign or remove an irq from a cpu.
@@ -901,6 +894,13 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
}
}

+ /*
+ * Enable the espfix hack for this CPU
+ */
+#ifdef CONFIG_X86_ESPFIX64
+ init_espfix_ap(cpu);
+#endif
+
/* mark "stuck" area as not stuck */
*trampoline_status = 0;

--
1.7.7

2015-06-17 05:54:51

by Zhu Guihua

[permalink] [raw]
Subject: Re: [PATCH V1] x86, espfix: postpone the initialization of espfix stack for AP

Any feedback about this?

On 06/04/2015 05:45 PM, Gu Zheng wrote:
> The following lockdep warning occurrs when running with latest kernel:
> [ 3.178000] ------------[ cut here ]------------
> [ 3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 lockdep_trace_alloc+0xdd/0xe0()
> [ 3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
> [ 3.199000] Modules linked in:
>
> [ 3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
> [ 3.221000] 0000000000000000 2d6601fb3e6d4e4c ffff88086fd5fc38 ffffffff81773f0a
> [ 3.230000] 0000000000000000 ffff88086fd5fc90 ffff88086fd5fc78 ffffffff8108c85a
> [ 3.238000] ffff88086fd60000 0000000000000092 ffff88086fd60000 00000000000000d0
> [ 3.246000] Call Trace:
> [ 3.249000] [<ffffffff81773f0a>] dump_stack+0x4c/0x65
> [ 3.255000] [<ffffffff8108c85a>] warn_slowpath_common+0x8a/0xc0
> [ 3.261000] [<ffffffff8108c8e5>] warn_slowpath_fmt+0x55/0x70
> [ 3.268000] [<ffffffff810ee24d>] lockdep_trace_alloc+0xdd/0xe0
> [ 3.274000] [<ffffffff811cda0d>] __alloc_pages_nodemask+0xad/0xca0
> [ 3.281000] [<ffffffff810ec7ad>] ? __lock_acquire+0xf6d/0x1560
> [ 3.288000] [<ffffffff81219c8a>] alloc_page_interleave+0x3a/0x90
> [ 3.295000] [<ffffffff8121b32d>] alloc_pages_current+0x17d/0x1a0
> [ 3.301000] [<ffffffff811c869e>] ? __get_free_pages+0xe/0x50
> [ 3.308000] [<ffffffff811c869e>] __get_free_pages+0xe/0x50
> [ 3.314000] [<ffffffff8102640b>] init_espfix_ap+0x17b/0x320
> [ 3.320000] [<ffffffff8105c691>] start_secondary+0xf1/0x1f0
> [ 3.327000] ---[ end trace 1b3327d9d6a1d62c ]---
>
> As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
> before enabled local irq, and the lockdep sub-system considers this
> behaviour as allocating memory with GFP_FS with local irq disabled,
> then trigger the warning as mentioned about.
>
> Though we could allocate them on the boot CPU side and hand them over to
> the secondary CPU, but it seemes a bit waste if some of cpus are offline.
> As thers is no need to these pages(espfix stack) until we try to run user
> code, so we postpone the initialization of espfix stack, and let the boot
> up routine init the espfix stack for the target cpu after it booted to
> avoid the noise.
>
> Signed-off-by: Gu Zheng <[email protected]>
> ---
> v1:
> Alloc the page on the node the target CPU is on.
> RFC:
> Let the boot up routine init the espfix stack for the target cpu after it
> booted.
> ---
> ---
> arch/x86/include/asm/espfix.h | 2 +-
> arch/x86/kernel/espfix_64.c | 28 ++++++++++++++++------------
> arch/x86/kernel/smpboot.c | 14 +++++++-------
> 3 files changed, 24 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
> index 99efebb..ca3ce9a 100644
> --- a/arch/x86/include/asm/espfix.h
> +++ b/arch/x86/include/asm/espfix.h
> @@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
> DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
>
> extern void init_espfix_bsp(void);
> -extern void init_espfix_ap(void);
> +extern void init_espfix_ap(int cpu);
>
> #endif /* CONFIG_X86_64 */
>
> diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
> index f5d0730..e397583 100644
> --- a/arch/x86/kernel/espfix_64.c
> +++ b/arch/x86/kernel/espfix_64.c
> @@ -131,25 +131,24 @@ void __init init_espfix_bsp(void)
> init_espfix_random();
>
> /* The rest is the same as for any other processor */
> - init_espfix_ap();
> + init_espfix_ap(0);
> }
>
> -void init_espfix_ap(void)
> +void init_espfix_ap(int cpu)
> {
> - unsigned int cpu, page;
> + unsigned int page;
> unsigned long addr;
> pud_t pud, *pud_p;
> pmd_t pmd, *pmd_p;
> pte_t pte, *pte_p;
> - int n;
> + int n, node;
> void *stack_page;
> pteval_t ptemask;
>
> /* We only have to do this once... */
> - if (likely(this_cpu_read(espfix_stack)))
> + if (likely(per_cpu(espfix_stack, cpu)))
> return; /* Already initialized */
>
> - cpu = smp_processor_id();
> addr = espfix_base_addr(cpu);
> page = cpu/ESPFIX_STACKS_PER_PAGE;
>
> @@ -165,12 +164,15 @@ void init_espfix_ap(void)
> if (stack_page)
> goto unlock_done;
>
> + node = cpu_to_node(cpu);
> ptemask = __supported_pte_mask;
>
> pud_p = &espfix_pud_page[pud_index(addr)];
> pud = *pud_p;
> if (!pud_present(pud)) {
> - pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
> + struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
> +
> + pmd_p = (pmd_t *)page_address(page);
> pud = __pud(__pa(pmd_p) | (PGTABLE_PROT & ptemask));
> paravirt_alloc_pmd(&init_mm, __pa(pmd_p) >> PAGE_SHIFT);
> for (n = 0; n < ESPFIX_PUD_CLONES; n++)
> @@ -180,7 +182,9 @@ void init_espfix_ap(void)
> pmd_p = pmd_offset(&pud, addr);
> pmd = *pmd_p;
> if (!pmd_present(pmd)) {
> - pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
> + struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
> +
> + pte_p = (pte_t *)page_address(page);
> pmd = __pmd(__pa(pte_p) | (PGTABLE_PROT & ptemask));
> paravirt_alloc_pte(&init_mm, __pa(pte_p) >> PAGE_SHIFT);
> for (n = 0; n < ESPFIX_PMD_CLONES; n++)
> @@ -188,7 +192,7 @@ void init_espfix_ap(void)
> }
>
> pte_p = pte_offset_kernel(&pmd, addr);
> - stack_page = (void *)__get_free_page(GFP_KERNEL);
> + stack_page = page_address(alloc_pages_node(node, GFP_KERNEL, 0));
> pte = __pte(__pa(stack_page) | (__PAGE_KERNEL_RO & ptemask));
> for (n = 0; n < ESPFIX_PTE_CLONES; n++)
> set_pte(&pte_p[n*PTE_STRIDE], pte);
> @@ -199,7 +203,7 @@ void init_espfix_ap(void)
> unlock_done:
> mutex_unlock(&espfix_init_mutex);
> done:
> - this_cpu_write(espfix_stack, addr);
> - this_cpu_write(espfix_waddr, (unsigned long)stack_page
> - + (addr & ~PAGE_MASK));
> + per_cpu(espfix_stack, cpu) = addr;
> + per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
> + + (addr & ~PAGE_MASK);
> }
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 50e547e..e9fdd0e 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -240,13 +240,6 @@ static void notrace start_secondary(void *unused)
> check_tsc_sync_target();
>
> /*
> - * Enable the espfix hack for this CPU
> - */
> -#ifdef CONFIG_X86_ESPFIX64
> - init_espfix_ap();
> -#endif
> -
> - /*
> * We need to hold vector_lock so there the set of online cpus
> * does not change while we are assigning vectors to cpus. Holding
> * this lock ensures we don't half assign or remove an irq from a cpu.
> @@ -901,6 +894,13 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
> }
> }
>
> + /*
> + * Enable the espfix hack for this CPU
> + */
> +#ifdef CONFIG_X86_ESPFIX64
> + init_espfix_ap(cpu);
> +#endif
> +
> /* mark "stuck" area as not stuck */
> *trampoline_status = 0;
>

2015-06-17 07:27:43

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH V1] x86, espfix: postpone the initialization of espfix stack for AP

On 06/04/2015 02:45 AM, Gu Zheng wrote:
> The following lockdep warning occurrs when running with latest kernel:
> [ 3.178000] ------------[ cut here ]------------
> [ 3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 lockdep_trace_alloc+0xdd/0xe0()
> [ 3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
> [ 3.199000] Modules linked in:
>
> [ 3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
> [ 3.221000] 0000000000000000 2d6601fb3e6d4e4c ffff88086fd5fc38 ffffffff81773f0a
> [ 3.230000] 0000000000000000 ffff88086fd5fc90 ffff88086fd5fc78 ffffffff8108c85a
> [ 3.238000] ffff88086fd60000 0000000000000092 ffff88086fd60000 00000000000000d0
> [ 3.246000] Call Trace:
> [ 3.249000] [<ffffffff81773f0a>] dump_stack+0x4c/0x65
> [ 3.255000] [<ffffffff8108c85a>] warn_slowpath_common+0x8a/0xc0
> [ 3.261000] [<ffffffff8108c8e5>] warn_slowpath_fmt+0x55/0x70
> [ 3.268000] [<ffffffff810ee24d>] lockdep_trace_alloc+0xdd/0xe0
> [ 3.274000] [<ffffffff811cda0d>] __alloc_pages_nodemask+0xad/0xca0
> [ 3.281000] [<ffffffff810ec7ad>] ? __lock_acquire+0xf6d/0x1560
> [ 3.288000] [<ffffffff81219c8a>] alloc_page_interleave+0x3a/0x90
> [ 3.295000] [<ffffffff8121b32d>] alloc_pages_current+0x17d/0x1a0
> [ 3.301000] [<ffffffff811c869e>] ? __get_free_pages+0xe/0x50
> [ 3.308000] [<ffffffff811c869e>] __get_free_pages+0xe/0x50
> [ 3.314000] [<ffffffff8102640b>] init_espfix_ap+0x17b/0x320
> [ 3.320000] [<ffffffff8105c691>] start_secondary+0xf1/0x1f0
> [ 3.327000] ---[ end trace 1b3327d9d6a1d62c ]---
>
> As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
> before enabled local irq, and the lockdep sub-system considers this
> behaviour as allocating memory with GFP_FS with local irq disabled,
> then trigger the warning as mentioned about.
>
> Though we could allocate them on the boot CPU side and hand them over to
> the secondary CPU, but it seemes a bit waste if some of cpus are offline.
> As thers is no need to these pages(espfix stack) until we try to run user
> code, so we postpone the initialization of espfix stack, and let the boot
> up routine init the espfix stack for the target cpu after it booted to
> avoid the noise.
>

It isn't *at all* obvious to me at least that if the GFP_KERNEL
allocation fails we may not get rescheduled on another CPU and/or get stuck.

I'm starting to think that the right thing to do is to allocate these on
the CPU that is bringing up the other CPU, at the same time we allocate
the percpu area. This won't affect offline CPUs.

-hpa

2015-06-17 21:04:24

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH V1] x86, espfix: postpone the initialization of espfix stack for AP

On Wed, Jun 17, 2015 at 12:27:05AM -0700, H. Peter Anvin wrote:
> On 06/04/2015 02:45 AM, Gu Zheng wrote:
> > The following lockdep warning occurrs when running with latest kernel:
> > [ 3.178000] ------------[ cut here ]------------
> > [ 3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 lockdep_trace_alloc+0xdd/0xe0()
> > [ 3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
> > [ 3.199000] Modules linked in:
> >
> > [ 3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
> > [ 3.221000] 0000000000000000 2d6601fb3e6d4e4c ffff88086fd5fc38 ffffffff81773f0a
> > [ 3.230000] 0000000000000000 ffff88086fd5fc90 ffff88086fd5fc78 ffffffff8108c85a
> > [ 3.238000] ffff88086fd60000 0000000000000092 ffff88086fd60000 00000000000000d0
> > [ 3.246000] Call Trace:
> > [ 3.249000] [<ffffffff81773f0a>] dump_stack+0x4c/0x65
> > [ 3.255000] [<ffffffff8108c85a>] warn_slowpath_common+0x8a/0xc0
> > [ 3.261000] [<ffffffff8108c8e5>] warn_slowpath_fmt+0x55/0x70
> > [ 3.268000] [<ffffffff810ee24d>] lockdep_trace_alloc+0xdd/0xe0
> > [ 3.274000] [<ffffffff811cda0d>] __alloc_pages_nodemask+0xad/0xca0
> > [ 3.281000] [<ffffffff810ec7ad>] ? __lock_acquire+0xf6d/0x1560
> > [ 3.288000] [<ffffffff81219c8a>] alloc_page_interleave+0x3a/0x90
> > [ 3.295000] [<ffffffff8121b32d>] alloc_pages_current+0x17d/0x1a0
> > [ 3.301000] [<ffffffff811c869e>] ? __get_free_pages+0xe/0x50
> > [ 3.308000] [<ffffffff811c869e>] __get_free_pages+0xe/0x50
> > [ 3.314000] [<ffffffff8102640b>] init_espfix_ap+0x17b/0x320
> > [ 3.320000] [<ffffffff8105c691>] start_secondary+0xf1/0x1f0
> > [ 3.327000] ---[ end trace 1b3327d9d6a1d62c ]---
> >
> > As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
> > before enabled local irq, and the lockdep sub-system considers this
> > behaviour as allocating memory with GFP_FS with local irq disabled,
> > then trigger the warning as mentioned about.
> >
> > Though we could allocate them on the boot CPU side and hand them over to
> > the secondary CPU, but it seemes a bit waste if some of cpus are offline.
> > As thers is no need to these pages(espfix stack) until we try to run user
> > code, so we postpone the initialization of espfix stack, and let the boot
> > up routine init the espfix stack for the target cpu after it booted to
> > avoid the noise.
> >
>
> It isn't *at all* obvious to me at least that if the GFP_KERNEL
> allocation fails we may not get rescheduled on another CPU and/or get stuck.
>
> I'm starting to think that the right thing to do is to allocate these on
> the CPU that is bringing up the other CPU, at the same time we allocate
> the percpu area. This won't affect offline CPUs.

Btw, as part of experimenting for something else, I was able to trigger
this even on a guest here. It is an insane guest though: 16 NUMA nodes,
with 8 cores each:

[ 0.032000] ------------[ cut here ]------------
[ 0.032000] WARNING: CPU: 64 PID: 0 at kernel/locking/lockdep.c:2755 lockdep_trace_alloc+0x10c/0x120()
[ 0.032000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[ 0.032000] Modules linked in:
[ 0.032000] CPU: 64 PID: 0 Comm: swapper/64 Not tainted 4.1.0-rc3+ #4
[ 0.032000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 0.032000] ffffffff818dd1a1 ffff880006a1fca8 ffffffff816a9685 0000000000000000
[ 0.032000] ffff880006a1fcf8 ffff880006a1fce8 ffffffff81058585 00000000001d74c0
[ 0.032000] 0000000000000080 0000000000000046 ffff880047ffcd00 ffff88003e804058
[ 0.032000] Call Trace:
[ 0.032000] [<ffffffff816a9685>] dump_stack+0x4f/0x7b
[ 0.032000] [<ffffffff81058585>] warn_slowpath_common+0x95/0xe0
[ 0.032000] [<ffffffff81058616>] warn_slowpath_fmt+0x46/0x50
[ 0.032000] [<ffffffff810a662c>] lockdep_trace_alloc+0x10c/0x120
[ 0.032000] [<ffffffff8113d6ed>] __alloc_pages_nodemask+0xad/0xab0
[ 0.032000] [<ffffffff813442d7>] ? debug_smp_processor_id+0x17/0x20
[ 0.032000] [<ffffffff810a370e>] ? put_lock_stats.isra.19+0xe/0x30
[ 0.032000] [<ffffffff816ae288>] ? mutex_lock_nested+0x2e8/0x420
[ 0.032000] [<ffffffff8117e0cc>] alloc_page_interleave+0x3c/0x90
[ 0.032000] [<ffffffff8117e995>] alloc_pages_current+0xc5/0xd0
[ 0.032000] [<ffffffff81138734>] __get_free_pages+0x14/0x50
[ 0.032000] [<ffffffff8100a484>] init_espfix_ap.part.5+0x164/0x270
[ 0.032000] [<ffffffff8100a5b1>] init_espfix_ap+0x21/0x30
[ 0.032000] [<ffffffff8103cd28>] start_secondary+0xe8/0x180
[ 0.032000] ---[ end trace 6a7abdb28fbb7667 ]---

Now I can test the future fix too. :)

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-06-17 21:12:14

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH V1] x86, espfix: postpone the initialization of espfix stack for AP

On 06/17/2015 02:04 PM, Borislav Petkov wrote:
>>
>> It isn't *at all* obvious to me at least that if the GFP_KERNEL
>> allocation fails we may not get rescheduled on another CPU and/or get stuck.
>>
>> I'm starting to think that the right thing to do is to allocate these on
>> the CPU that is bringing up the other CPU, at the same time we allocate
>> the percpu area. This won't affect offline CPUs.
>
> Btw, as part of experimenting for something else, I was able to trigger
> this even on a guest here. It is an insane guest though: 16 NUMA nodes,
> with 8 cores each:
>

Is this reliable?

-hpa

2015-06-17 21:50:32

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH V1] x86, espfix: postpone the initialization of espfix stack for AP

On Wed, Jun 17, 2015 at 02:11:27PM -0700, H. Peter Anvin wrote:
> Is this reliable?

10 out of 10 :)

I'm booting with:

...
-smp 128
-numa node,nodeid=0,cpus=0-7
-numa node,nodeid=1,cpus=8-15
-numa node,nodeid=2,cpus=16-23
-numa node,nodeid=3,cpus=24-31
-numa node,nodeid=4,cpus=32-39
-numa node,nodeid=5,cpus=40-47
-numa node,nodeid=6,cpus=48-55
-numa node,nodeid=7,cpus=56-63
-numa node,nodeid=8,cpus=64-71
-numa node,nodeid=9,cpus=72-79
-numa node,nodeid=10,cpus=80-87
-numa node,nodeid=11,cpus=88-95
-numa node,nodeid=12,cpus=96-103
-numa node,nodeid=13,cpus=104-111
-numa node,nodeid=14,cpus=112-119
-numa node,nodeid=15,cpus=120-127
...

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--