Hi Linus,
Here's SLAB updates for 2.6.39-rc1. The interesting bits are SLUB lockless
fastpath patches from Christoph which improve performance and updates to
support bigger 'struct rcu_head' from Lai. I pulled the per-CPU tree to
slub/lockless which is why some already merged patches show up in the
pull request.
Pekka
The following changes since commit a952baa034ae7c2e4a66932005cbc7ebbccfe28d:
Linus Torvalds (1):
Merge branch 'for-linus' of git://git.kernel.org/.../dtor/input
are available in the git repository at:
ssh://master.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6.git for-linus
Christoph Lameter (5):
mm: Remove support for kmem_cache_name()
slub: min_partial needs to be in first cacheline
slub: Get rid of slab_free_hook_irq()
Lockless (and preemptless) fastpaths for slub
slub: Dont define useless label in the !CONFIG_CMPXCHG_LOCAL case
Eric Dumazet (1):
slub: fix kmemcheck calls to match ksize() hints
Lai Jiangshan (3):
slub: automatically reserve bytes at the end of slab
slub,rcu: don't assume the size of struct rcu_head
slab,rcu: don't assume the size of struct rcu_head
Mariusz Kozlowski (1):
slub: fix ksize() build error
Pekka Enberg (6):
Revert "slab: Fix missing DEBUG_SLAB last user"
Merge branch 'for-2.6.39' of git://git.kernel.org/.../tj/percpu into slub/lockless
Merge branch 'slab/rcu' into slab/next
Merge branch 'slab/urgent' into slab/next
Merge branch 'slab/next' into for-linus
Merge branch 'slub/lockless' into for-linus
arch/alpha/kernel/vmlinux.lds.S | 5 +-
arch/arm/kernel/vmlinux.lds.S | 2 +-
arch/blackfin/kernel/vmlinux.lds.S | 2 +-
arch/cris/kernel/vmlinux.lds.S | 2 +-
arch/frv/kernel/vmlinux.lds.S | 2 +-
arch/ia64/kernel/vmlinux.lds.S | 2 +-
arch/m32r/kernel/vmlinux.lds.S | 2 +-
arch/mips/kernel/vmlinux.lds.S | 2 +-
arch/mn10300/kernel/vmlinux.lds.S | 2 +-
arch/parisc/kernel/vmlinux.lds.S | 2 +-
arch/powerpc/kernel/vmlinux.lds.S | 2 +-
arch/s390/kernel/vmlinux.lds.S | 2 +-
arch/sh/kernel/vmlinux.lds.S | 2 +-
arch/sparc/kernel/vmlinux.lds.S | 2 +-
arch/tile/kernel/vmlinux.lds.S | 2 +-
arch/um/include/asm/common.lds.S | 2 +-
arch/x86/include/asm/percpu.h | 48 +++++
arch/x86/kernel/vmlinux.lds.S | 4 +-
arch/x86/lib/Makefile | 1 +
arch/x86/lib/cmpxchg16b_emu.S | 59 ++++++
arch/xtensa/kernel/vmlinux.lds.S | 2 +-
include/asm-generic/vmlinux.lds.h | 35 +++--
include/linux/percpu.h | 128 +++++++++++++
include/linux/slab.h | 1 -
include/linux/slub_def.h | 8 +-
mm/slab.c | 55 +++---
mm/slob.c | 6 -
mm/slub.c | 366 +++++++++++++++++++++++++++++-------
28 files changed, 612 insertions(+), 136 deletions(-)
create mode 100644 arch/x86/lib/cmpxchg16b_emu.S
hi Pekka,
* Pekka Enberg <[email protected]> wrote:
> Hi Linus,
>
> Here's SLAB updates for 2.6.39-rc1. The interesting bits are SLUB
> lockless fastpath patches from Christoph which improve performance
> and updates to support bigger 'struct rcu_head' from Lai. I pulled
> the per-CPU tree to slub/lockless which is why some already merged
> patches show up in the pull request.
>
> Pekka
>
> The following changes since commit a952baa034ae7c2e4a66932005cbc7ebbccfe28d:
> Linus Torvalds (1):
> Merge branch 'for-linus' of git://git.kernel.org/.../dtor/input
>
> are available in the git repository at:
>
> ssh://master.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6.git for-linus
>
> Christoph Lameter (5):
> mm: Remove support for kmem_cache_name()
> slub: min_partial needs to be in first cacheline
> slub: Get rid of slab_free_hook_irq()
> Lockless (and preemptless) fastpaths for slub
> slub: Dont define useless label in the !CONFIG_CMPXCHG_LOCAL case
>
> Eric Dumazet (1):
> slub: fix kmemcheck calls to match ksize() hints
>
> Lai Jiangshan (3):
> slub: automatically reserve bytes at the end of slab
> slub,rcu: don't assume the size of struct rcu_head
> slab,rcu: don't assume the size of struct rcu_head
>
> Mariusz Kozlowski (1):
> slub: fix ksize() build error
>
> Pekka Enberg (6):
> Revert "slab: Fix missing DEBUG_SLAB last user"
> Merge branch 'for-2.6.39' of git://git.kernel.org/.../tj/percpu into slub/lockless
> Merge branch 'slab/rcu' into slab/next
> Merge branch 'slab/urgent' into slab/next
> Merge branch 'slab/next' into for-linus
> Merge branch 'slub/lockless' into for-linus
FYI, some sort of boot crash has snuck upstream in the last 24 hours:
BUG: unable to handle kernel paging request at ffff87ffc147e020
IP: [<ffffffff811aa762>] this_cpu_cmpxchg16b_emu+0x2/0x1c
[<ffffffff810d9cbc>] ? kmem_cache_alloc+0x4c/0x110
[<ffffffff8151cf06>] kmem_cache_init+0xeb/0x2b0
[<ffffffff81504a06>] start_kernel+0x1de/0x49b
[<ffffffff8150432b>] x86_64_start_reservations+0x132/0x136
[<ffffffff81504140>] ? early_idt_handlers+0x140/0x140
And the SLAB changes are one of the suspects. It triggers in about 5% of all
randconfigs. I'm bisecting it currently.
Config attached.
Ingo
On Thu, 24 Mar 2011, Ingo Molnar wrote:
> FYI, some sort of boot crash has snuck upstream in the last 24 hours:
>
> BUG: unable to handle kernel paging request at ffff87ffc147e020
> IP: [<ffffffff811aa762>] this_cpu_cmpxchg16b_emu+0x2/0x1c
Hmmm.. This is the fallback code for the case that the processor does not
support cmpxchg16b.
* Ingo Molnar <[email protected]> wrote:
> FYI, some sort of boot crash has snuck upstream in the last 24 hours:
>
> BUG: unable to handle kernel paging request at ffff87ffc147e020
> IP: [<ffffffff811aa762>] this_cpu_cmpxchg16b_emu+0x2/0x1c
>
> [<ffffffff810d9cbc>] ? kmem_cache_alloc+0x4c/0x110
> [<ffffffff8151cf06>] kmem_cache_init+0xeb/0x2b0
> [<ffffffff81504a06>] start_kernel+0x1de/0x49b
> [<ffffffff8150432b>] x86_64_start_reservations+0x132/0x136
> [<ffffffff81504140>] ? early_idt_handlers+0x140/0x140
>
> And the SLAB changes are one of the suspects. It triggers in about 5% of all
> randconfigs. I'm bisecting it currently.
Caused by:
| 8a5ec0ba42c4919e2d8f4c3138cc8b987fdb0b79 is the first bad commit
| commit 8a5ec0ba42c4919e2d8f4c3138cc8b987fdb0b79
| Author: Christoph Lameter <[email protected]>
| Date: Fri Feb 25 11:38:54 2011 -0600
|
| Lockless (and preemptless) fastpaths for slub
I'll try to revert these:
2fd66c517d5e: slub: Add missing irq restore for the OOM path
a24c5a0ea902: slub: Dont define useless label in the !CONFIG_CMPXCHG_LOCAL case
8a5ec0ba42c4: Lockless (and preemptless) fastpaths for slub
Thanks,
Ingo
* Ingo Molnar <[email protected]> wrote:
> Caused by:
>
> | 8a5ec0ba42c4919e2d8f4c3138cc8b987fdb0b79 is the first bad commit
> | commit 8a5ec0ba42c4919e2d8f4c3138cc8b987fdb0b79
> | Author: Christoph Lameter <[email protected]>
> | Date: Fri Feb 25 11:38:54 2011 -0600
> |
> | Lockless (and preemptless) fastpaths for slub
>
> I'll try to revert these:
>
> 2fd66c517d5e: slub: Add missing irq restore for the OOM path
> a24c5a0ea902: slub: Dont define useless label in the !CONFIG_CMPXCHG_LOCAL case
> 8a5ec0ba42c4: Lockless (and preemptless) fastpaths for slub
The combo revert below solves the boot crash.
Thanks,
Ingo
------------------------------->
>From 1b322eee05a96e8395b62e04a3d1f10fb483c259 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Thu, 24 Mar 2011 17:03:51 +0100
Subject: [PATCH] Revert various slub patches
Revert "slub: Add missing irq restore for the OOM path"
This reverts commit 2fd66c517d5e98de2528d86e0e62f5069ff99f59.
Revert "slub: Dont define useless label in the !CONFIG_CMPXCHG_LOCAL case"
This reverts commit a24c5a0ea902bcda348f086bd909cc2d6e305bf8.
Revert "slub,rcu: don't assume the size of struct rcu_head"
This reverts commit da9a638c6f8fc0633fa94a334f1c053f5e307177.
Revert "slub: Add statistics for this_cmpxchg_double failures"
This reverts commit 4fdccdfbb4652a7bbac8adbce7449eb093775118.
Revert "Lockless (and preemptless) fastpaths for slub"
This reverts commit 8a5ec0ba42c4919e2d8f4c3138cc8b987fdb0b79.
---
include/linux/slub_def.h | 6 +-
mm/slub.c | 243 ++--------------------------------------------
2 files changed, 9 insertions(+), 240 deletions(-)
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 45ca123..74af3d6 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -32,14 +32,10 @@ enum stat_item {
DEACTIVATE_TO_TAIL, /* Cpu slab was moved to the tail of partials */
DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
ORDER_FALLBACK, /* Number of times fallback was necessary */
- CMPXCHG_DOUBLE_CPU_FAIL,/* Failure of this_cpu_cmpxchg_double */
NR_SLUB_STAT_ITEMS };
struct kmem_cache_cpu {
- void **freelist; /* Pointer to next available object */
-#ifdef CONFIG_CMPXCHG_LOCAL
- unsigned long tid; /* Globally unique transaction id */
-#endif
+ void **freelist; /* Pointer to first free per cpu object */
struct page *page; /* The slab from which we are allocating */
int node; /* The node of the page (or -1 for debug) */
#ifdef CONFIG_SLUB_STATS
diff --git a/mm/slub.c b/mm/slub.c
index 93de30d..bd84722 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -217,7 +217,7 @@ static inline void sysfs_slab_remove(struct kmem_cache *s)
#endif
-static inline void stat(const struct kmem_cache *s, enum stat_item si)
+static inline void stat(struct kmem_cache *s, enum stat_item si)
{
#ifdef CONFIG_SLUB_STATS
__this_cpu_inc(s->cpu_slab->stat[si]);
@@ -1285,38 +1285,21 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
__free_pages(page, order);
}
-#define need_reserve_slab_rcu \
- (sizeof(((struct page *)NULL)->lru) < sizeof(struct rcu_head))
-
static void rcu_free_slab(struct rcu_head *h)
{
struct page *page;
- if (need_reserve_slab_rcu)
- page = virt_to_head_page(h);
- else
- page = container_of((struct list_head *)h, struct page, lru);
-
+ page = container_of((struct list_head *)h, struct page, lru);
__free_slab(page->slab, page);
}
static void free_slab(struct kmem_cache *s, struct page *page)
{
if (unlikely(s->flags & SLAB_DESTROY_BY_RCU)) {
- struct rcu_head *head;
-
- if (need_reserve_slab_rcu) {
- int order = compound_order(page);
- int offset = (PAGE_SIZE << order) - s->reserved;
-
- VM_BUG_ON(s->reserved != sizeof(*head));
- head = page_address(page) + offset;
- } else {
- /*
- * RCU free overloads the RCU head over the LRU
- */
- head = (void *)&page->lru;
- }
+ /*
+ * RCU free overloads the RCU head over the LRU
+ */
+ struct rcu_head *head = (void *)&page->lru;
call_rcu(head, rcu_free_slab);
} else
@@ -1540,78 +1523,6 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
}
}
-#ifdef CONFIG_CMPXCHG_LOCAL
-#ifdef CONFIG_PREEMPT
-/*
- * Calculate the next globally unique transaction for disambiguiation
- * during cmpxchg. The transactions start with the cpu number and are then
- * incremented by CONFIG_NR_CPUS.
- */
-#define TID_STEP roundup_pow_of_two(CONFIG_NR_CPUS)
-#else
-/*
- * No preemption supported therefore also no need to check for
- * different cpus.
- */
-#define TID_STEP 1
-#endif
-
-static inline unsigned long next_tid(unsigned long tid)
-{
- return tid + TID_STEP;
-}
-
-static inline unsigned int tid_to_cpu(unsigned long tid)
-{
- return tid % TID_STEP;
-}
-
-static inline unsigned long tid_to_event(unsigned long tid)
-{
- return tid / TID_STEP;
-}
-
-static inline unsigned int init_tid(int cpu)
-{
- return cpu;
-}
-
-static inline void note_cmpxchg_failure(const char *n,
- const struct kmem_cache *s, unsigned long tid)
-{
-#ifdef SLUB_DEBUG_CMPXCHG
- unsigned long actual_tid = __this_cpu_read(s->cpu_slab->tid);
-
- printk(KERN_INFO "%s %s: cmpxchg redo ", n, s->name);
-
-#ifdef CONFIG_PREEMPT
- if (tid_to_cpu(tid) != tid_to_cpu(actual_tid))
- printk("due to cpu change %d -> %d\n",
- tid_to_cpu(tid), tid_to_cpu(actual_tid));
- else
-#endif
- if (tid_to_event(tid) != tid_to_event(actual_tid))
- printk("due to cpu running other code. Event %ld->%ld\n",
- tid_to_event(tid), tid_to_event(actual_tid));
- else
- printk("for unknown reason: actual=%lx was=%lx target=%lx\n",
- actual_tid, tid, next_tid(tid));
-#endif
- stat(s, CMPXCHG_DOUBLE_CPU_FAIL);
-}
-
-#endif
-
-void init_kmem_cache_cpus(struct kmem_cache *s)
-{
-#if defined(CONFIG_CMPXCHG_LOCAL) && defined(CONFIG_PREEMPT)
- int cpu;
-
- for_each_possible_cpu(cpu)
- per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
-#endif
-
-}
/*
* Remove the cpu slab
*/
@@ -1643,9 +1554,6 @@ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
page->inuse--;
}
c->page = NULL;
-#ifdef CONFIG_CMPXCHG_LOCAL
- c->tid = next_tid(c->tid);
-#endif
unfreeze_slab(s, page, tail);
}
@@ -1780,19 +1688,6 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
{
void **object;
struct page *new;
-#ifdef CONFIG_CMPXCHG_LOCAL
- unsigned long flags;
-
- local_irq_save(flags);
-#ifdef CONFIG_PREEMPT
- /*
- * We may have been preempted and rescheduled on a different
- * cpu before disabling interrupts. Need to reload cpu area
- * pointer.
- */
- c = this_cpu_ptr(s->cpu_slab);
-#endif
-#endif
/* We handle __GFP_ZERO in the caller */
gfpflags &= ~__GFP_ZERO;
@@ -1819,10 +1714,6 @@ load_freelist:
c->node = page_to_nid(c->page);
unlock_out:
slab_unlock(c->page);
-#ifdef CONFIG_CMPXCHG_LOCAL
- c->tid = next_tid(c->tid);
- local_irq_restore(flags);
-#endif
stat(s, ALLOC_SLOWPATH);
return object;
@@ -1858,9 +1749,6 @@ new_slab:
}
if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
slab_out_of_memory(s, gfpflags, node);
-#ifdef CONFIG_CMPXCHG_LOCAL
- local_irq_restore(flags);
-#endif
return NULL;
debug:
if (!alloc_debug_processing(s, c->page, object, addr))
@@ -1887,76 +1775,23 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
{
void **object;
struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
- unsigned long tid;
-#else
unsigned long flags;
-#endif
if (slab_pre_alloc_hook(s, gfpflags))
return NULL;
-#ifndef CONFIG_CMPXCHG_LOCAL
local_irq_save(flags);
-#else
-redo:
-#endif
-
- /*
- * Must read kmem_cache cpu data via this cpu ptr. Preemption is
- * enabled. We may switch back and forth between cpus while
- * reading from one cpu area. That does not matter as long
- * as we end up on the original cpu again when doing the cmpxchg.
- */
c = __this_cpu_ptr(s->cpu_slab);
-
-#ifdef CONFIG_CMPXCHG_LOCAL
- /*
- * The transaction ids are globally unique per cpu and per operation on
- * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
- * occurs on the right processor and that there was no operation on the
- * linked list in between.
- */
- tid = c->tid;
- barrier();
-#endif
-
object = c->freelist;
if (unlikely(!object || !node_match(c, node)))
object = __slab_alloc(s, gfpflags, node, addr, c);
else {
-#ifdef CONFIG_CMPXCHG_LOCAL
- /*
- * The cmpxchg will only match if there was no additonal
- * operation and if we are on the right processor.
- *
- * The cmpxchg does the following atomically (without lock semantics!)
- * 1. Relocate first pointer to the current per cpu area.
- * 2. Verify that tid and freelist have not been changed
- * 3. If they were not changed replace tid and freelist
- *
- * Since this is without lock semantics the protection is only against
- * code executing on this cpu *not* from access by other cpus.
- */
- if (unlikely(!this_cpu_cmpxchg_double(
- s->cpu_slab->freelist, s->cpu_slab->tid,
- object, tid,
- get_freepointer(s, object), next_tid(tid)))) {
-
- note_cmpxchg_failure("slab_alloc", s, tid);
- goto redo;
- }
-#else
c->freelist = get_freepointer(s, object);
-#endif
stat(s, ALLOC_FASTPATH);
}
-
-#ifndef CONFIG_CMPXCHG_LOCAL
local_irq_restore(flags);
-#endif
if (unlikely(gfpflags & __GFP_ZERO) && object)
memset(object, 0, s->objsize);
@@ -2034,13 +1869,9 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
{
void *prior;
void **object = (void *)x;
-#ifdef CONFIG_CMPXCHG_LOCAL
- unsigned long flags;
- local_irq_save(flags);
-#endif
- slab_lock(page);
stat(s, FREE_SLOWPATH);
+ slab_lock(page);
if (kmem_cache_debug(s))
goto debug;
@@ -2070,9 +1901,6 @@ checks_ok:
out_unlock:
slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
- local_irq_restore(flags);
-#endif
return;
slab_empty:
@@ -2084,9 +1912,6 @@ slab_empty:
stat(s, FREE_REMOVE_PARTIAL);
}
slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
- local_irq_restore(flags);
-#endif
stat(s, FREE_SLAB);
discard_slab(s, page);
return;
@@ -2113,56 +1938,21 @@ static __always_inline void slab_free(struct kmem_cache *s,
{
void **object = (void *)x;
struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
- unsigned long tid;
-#else
unsigned long flags;
-#endif
slab_free_hook(s, x);
-#ifndef CONFIG_CMPXCHG_LOCAL
local_irq_save(flags);
-
-#else
-redo:
-#endif
-
- /*
- * Determine the currently cpus per cpu slab.
- * The cpu may change afterward. However that does not matter since
- * data is retrieved via this pointer. If we are on the same cpu
- * during the cmpxchg then the free will succedd.
- */
c = __this_cpu_ptr(s->cpu_slab);
-#ifdef CONFIG_CMPXCHG_LOCAL
- tid = c->tid;
- barrier();
-#endif
-
if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
set_freepointer(s, object, c->freelist);
-
-#ifdef CONFIG_CMPXCHG_LOCAL
- if (unlikely(!this_cpu_cmpxchg_double(
- s->cpu_slab->freelist, s->cpu_slab->tid,
- c->freelist, tid,
- object, next_tid(tid)))) {
-
- note_cmpxchg_failure("slab_free", s, tid);
- goto redo;
- }
-#else
c->freelist = object;
-#endif
stat(s, FREE_FASTPATH);
} else
__slab_free(s, page, x, addr);
-#ifndef CONFIG_CMPXCHG_LOCAL
local_irq_restore(flags);
-#endif
}
void kmem_cache_free(struct kmem_cache *s, void *x)
@@ -2354,23 +2144,9 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
-#ifdef CONFIG_CMPXCHG_LOCAL
- /*
- * Must align to double word boundary for the double cmpxchg instructions
- * to work.
- */
- s->cpu_slab = __alloc_percpu(sizeof(struct kmem_cache_cpu), 2 * sizeof(void *));
-#else
- /* Regular alignment is sufficient */
s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
-#endif
-
- if (!s->cpu_slab)
- return 0;
- init_kmem_cache_cpus(s);
-
- return 1;
+ return s->cpu_slab != NULL;
}
static struct kmem_cache *kmem_cache_node;
@@ -2609,9 +2385,6 @@ static int kmem_cache_open(struct kmem_cache *s,
s->flags = kmem_cache_flags(size, flags, name, ctor);
s->reserved = 0;
- if (need_reserve_slab_rcu && (s->flags & SLAB_DESTROY_BY_RCU))
- s->reserved = sizeof(struct rcu_head);
-
if (!calculate_sizes(s, -1))
goto error;
if (disable_higher_order_debug) {
On Thu, Mar 24, 2011 at 4:41 PM, Christoph Lameter <[email protected]> wrote:
> On Thu, 24 Mar 2011, Ingo Molnar wrote:
>
>> FYI, some sort of boot crash has snuck upstream in the last 24 hours:
>>
>> ?BUG: unable to handle kernel paging request at ffff87ffc147e020
>> ?IP: [<ffffffff811aa762>] this_cpu_cmpxchg16b_emu+0x2/0x1c
>
> Hmmm.. This is the fallback code for the case that the processor does not
> support cmpxchg16b.
How does alternative_io() work? Does it require
alternative_instructions() to be executed. If so, the fallback code
won't be active when we enter kmem_cache_init(). Is there any reason
check_bugs() is called so late during boot? Can we do something like
the totally untested attached patch?
On Thu, Mar 24, 2011 at 6:14 PM, Ingo Molnar <[email protected]> wrote:
> The combo revert below solves the boot crash.
>
> Thanks,
>
> ? ? ? ?Ingo
>
> ------------------------------->
> From 1b322eee05a96e8395b62e04a3d1f10fb483c259 Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <[email protected]>
> Date: Thu, 24 Mar 2011 17:03:51 +0100
> Subject: [PATCH] Revert various slub patches
>
> Revert "slub: Add missing irq restore for the OOM path"
>
> This reverts commit 2fd66c517d5e98de2528d86e0e62f5069ff99f59.
>
> Revert "slub: Dont define useless label in the !CONFIG_CMPXCHG_LOCAL case"
>
> This reverts commit a24c5a0ea902bcda348f086bd909cc2d6e305bf8.
>
> Revert "slub,rcu: don't assume the size of struct rcu_head"
>
> This reverts commit da9a638c6f8fc0633fa94a334f1c053f5e307177.
Btw, this RCU commit is unrelated. It just happens to have a merge
conflict with the lockless patches which is why you needed to revert
it.
Pekka
* Pekka Enberg <[email protected]> wrote:
> On Thu, Mar 24, 2011 at 6:14 PM, Ingo Molnar <[email protected]> wrote:
> > The combo revert below solves the boot crash.
> >
> > Thanks,
> >
> > ? ? ? ?Ingo
> >
> > ------------------------------->
> > From 1b322eee05a96e8395b62e04a3d1f10fb483c259 Mon Sep 17 00:00:00 2001
> > From: Ingo Molnar <[email protected]>
> > Date: Thu, 24 Mar 2011 17:03:51 +0100
> > Subject: [PATCH] Revert various slub patches
> >
> > Revert "slub: Add missing irq restore for the OOM path"
> >
> > This reverts commit 2fd66c517d5e98de2528d86e0e62f5069ff99f59.
> >
> > Revert "slub: Dont define useless label in the !CONFIG_CMPXCHG_LOCAL case"
> >
> > This reverts commit a24c5a0ea902bcda348f086bd909cc2d6e305bf8.
> >
> > Revert "slub,rcu: don't assume the size of struct rcu_head"
> >
> > This reverts commit da9a638c6f8fc0633fa94a334f1c053f5e307177.
>
> Btw, this RCU commit is unrelated. It just happens to have a merge
> conflict with the lockless patches which is why you needed to revert
> it.
Yeah - this was just a brute-force combo patch to remove the buggy commit.
Thanks,
Ingo
* Pekka Enberg <[email protected]> wrote:
> On Thu, Mar 24, 2011 at 4:41 PM, Christoph Lameter <[email protected]> wrote:
> > On Thu, 24 Mar 2011, Ingo Molnar wrote:
> >
> >> FYI, some sort of boot crash has snuck upstream in the last 24 hours:
> >>
> >> ?BUG: unable to handle kernel paging request at ffff87ffc147e020
> >> ?IP: [<ffffffff811aa762>] this_cpu_cmpxchg16b_emu+0x2/0x1c
> >
> > Hmmm.. This is the fallback code for the case that the processor does not
> > support cmpxchg16b.
>
> How does alternative_io() work? Does it require
> alternative_instructions() to be executed. If so, the fallback code
> won't be active when we enter kmem_cache_init(). Is there any reason
> check_bugs() is called so late during boot? Can we do something like
> the totally untested attached patch?
Does the config i sent you boot on your box? I think the bug is pretty generic
and should trigger on any box.
Thanks,
Ingo
On Thu, 24 Mar 2011, Ingo Molnar wrote:
> > How does alternative_io() work? Does it require
> > alternative_instructions() to be executed. If so, the fallback code
> > won't be active when we enter kmem_cache_init(). Is there any reason
> > check_bugs() is called so late during boot? Can we do something like
> > the totally untested attached patch?
>
> Does the config i sent you boot on your box? I think the bug is pretty generic
> and should trigger on any box.
The bug should only trigger on old AMD64 boxes that do not support
cmpxchg16b.
On Thu, 24 Mar 2011, Ingo Molnar wrote:
>> > How does alternative_io() work? Does it require
>> > alternative_instructions() to be executed. If so, the fallback code
>> > won't be active when we enter kmem_cache_init(). Is there any reason
>> > check_bugs() is called so late during boot? Can we do something like
>> > the totally untested attached patch?
>>
>> Does the config i sent you boot on your box? I think the bug is pretty generic
>> and should trigger on any box.
On Thu, Mar 24, 2011 at 7:43 PM, Christoph Lameter <[email protected]> wrote:
> The bug should only trigger on old AMD64 boxes that do not support
> cmpxchg16b.
Yup. Ingo is it possible to see /proc/cpuinfo of one of the affected
boxes? I'll try your config but I'm pretty sure the problem doesn't
trigger here. Like I said, I think the problem is that alternative
instructions are not patched early enough for cmpxchg16b emulation to
work for kmem_cache_init(). I tried my check_bugs() patch but it hangs
during boot. I'll see if I can cook up a patch that does
alternative_instructions() before kmem_cache_init() because I think
those *should* be available during boot too.
Pekka
On Thu, 24 Mar 2011, Pekka Enberg wrote:
> On Thu, Mar 24, 2011 at 7:43 PM, Christoph Lameter <[email protected]> wrote:
> > The bug should only trigger on old AMD64 boxes that do not support
> > cmpxchg16b.
>
> Yup. Ingo is it possible to see /proc/cpuinfo of one of the affected
> boxes? I'll try your config but I'm pretty sure the problem doesn't
> trigger here. Like I said, I think the problem is that alternative
> instructions are not patched early enough for cmpxchg16b emulation to
> work for kmem_cache_init(). I tried my check_bugs() patch but it hangs
> during boot. I'll see if I can cook up a patch that does
> alternative_instructions() before kmem_cache_init() because I think
> those *should* be available during boot too.
I forced the fallback to the _emu function to occur but could not trigger
the bug in kvm.
On Thu, Mar 24, 2011 at 8:01 PM, Christoph Lameter <[email protected]> wrote:
> On Thu, 24 Mar 2011, Pekka Enberg wrote:
>
>> On Thu, Mar 24, 2011 at 7:43 PM, Christoph Lameter <[email protected]> wrote:
>> > The bug should only trigger on old AMD64 boxes that do not support
>> > cmpxchg16b.
>>
>> Yup. Ingo is it possible to see /proc/cpuinfo of one of the affected
>> boxes? I'll try your config but I'm pretty sure the problem doesn't
>> trigger here. Like I said, I think the problem is that alternative
>> instructions are not patched early enough for cmpxchg16b emulation to
>> work for kmem_cache_init(). I tried my check_bugs() patch but it hangs
>> during boot. I'll see if I can cook up a patch that does
>> alternative_instructions() before kmem_cache_init() because I think
>> those *should* be available during boot too.
>
> I forced the fallback to the _emu function to occur but could not trigger
> the bug in kvm.
That's not the problem. I'm sure the fallback is just fine. What I'm
saying is that the fallback is *not patched* to kernel text on Ingo's
machines because alternative_instructions() happens late in the boot!
So the problem is that on Ingo's boxes (that presumably have old AMD
CPUs) we execute cmpxchg16b, not the fallback code.
Pekka
Hi Ingo,
On Thu, 24 Mar 2011, Ingo Molnar wrote:
> * Pekka Enberg <[email protected]> wrote:
>
>> On Thu, Mar 24, 2011 at 4:41 PM, Christoph Lameter <[email protected]> wrote:
>>> On Thu, 24 Mar 2011, Ingo Molnar wrote:
>>>
>>>> FYI, some sort of boot crash has snuck upstream in the last 24 hours:
>>>>
>>>> ?BUG: unable to handle kernel paging request at ffff87ffc147e020
>>>> ?IP: [<ffffffff811aa762>] this_cpu_cmpxchg16b_emu+0x2/0x1c
>>>
>>> Hmmm.. This is the fallback code for the case that the processor does not
>>> support cmpxchg16b.
>>
>> How does alternative_io() work? Does it require
>> alternative_instructions() to be executed. If so, the fallback code
>> won't be active when we enter kmem_cache_init(). Is there any reason
>> check_bugs() is called so late during boot? Can we do something like
>> the totally untested attached patch?
>
> Does the config i sent you boot on your box? I think the bug is pretty generic
> and should trigger on any box.
Here's a patch that reorganizes the alternatives fixup to happen earlier
so that the fallback should work. It boots on my machine so please give it
a spin if possible.
I'll try out your .config next.
Pekka
>From dd1534455196d2a8f6c9c912db614e59986c9f0e Mon Sep 17 00:00:00 2001
From: Pekka Enberg <[email protected]>
Date: Thu, 24 Mar 2011 19:59:35 +0200
Subject: [PATCH] x86: Early boot alternative instructions
Commit 8a5ec0ba42c4919e2d8f4c3138cc8b987fdb0b79 ("Lockless (and preemptless)
fastpaths for slub") added use of cmpxchg16b in kmem_cache_init() which happens
early on during the boot. As cmpxchg16b is not supported on some older AMD
CPUs, there's a alternative_io() fallback for cmpxchg16b emulation.
Unfortunately alternative_instructions() happens late in the boot sequence so
the fallback code is not patched into kernel text which causes the following oops:
BUG: unable to handle kernel paging request at ffff87ffc147e020
IP: [<ffffffff811aa762>] this_cpu_cmpxchg16b_emu+0x2/0x1c
[<ffffffff810d9cbc>] ? kmem_cache_alloc+0x4c/0x110
[<ffffffff8151cf06>] kmem_cache_init+0xeb/0x2b0
[<ffffffff81504a06>] start_kernel+0x1de/0x49b
[<ffffffff8150432b>] x86_64_start_reservations+0x132/0x136
[<ffffffff81504140>] ? early_idt_handlers+0x140/0x140
This patch reorganizes the code so that alternative_io() tagged fallbacks are
patched to the kernel before kmem_cache_init() is called.
Reported-by: Ingo Molnar <[email protected]>
Signed-off-by: Pekka Enberg <[email protected]>
---
arch/x86/include/asm/alternative.h | 1 +
arch/x86/kernel/alternative.c | 12 +++++++++++-
arch/x86/kernel/cpu/bugs.c | 2 --
arch/x86/kernel/cpu/bugs_64.c | 2 --
include/asm-generic/alternative.h | 19 +++++++++++++++++++
init/main.c | 3 +++
6 files changed, 34 insertions(+), 5 deletions(-)
create mode 100644 include/asm-generic/alternative.h
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 13009d1..ca819c6 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -54,6 +54,7 @@ struct alt_instr {
#endif
};
+extern void alternative_instructions_early(void);
extern void alternative_instructions(void);
extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 4a23467..32e96dd 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -453,7 +453,7 @@ extern struct paravirt_patch_site __start_parainstructions[],
__stop_parainstructions[];
#endif /* CONFIG_PARAVIRT */
-void __init alternative_instructions(void)
+void __init alternative_instructions_early(void)
{
/* The patching is not fully atomic, so try to avoid local interruptions
that might execute the to be patched code.
@@ -473,6 +473,16 @@ void __init alternative_instructions(void)
apply_alternatives(__alt_instructions, __alt_instructions_end);
+ restart_nmi();
+}
+
+void __init alternative_instructions(void)
+{
+ /* The patching is not fully atomic, so try to avoid local interruptions
+ that might execute the to be patched code.
+ Other CPUs are not running. */
+ stop_nmi();
+
/* switch to patch-once-at-boottime-only mode and free the
* tables in case we know the number of CPUs will never ever
* change */
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index c39576c..3854e58 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -15,7 +15,6 @@
#include <asm/i387.h>
#include <asm/msr.h>
#include <asm/paravirt.h>
-#include <asm/alternative.h>
static int __init no_halt(char *s)
{
@@ -165,5 +164,4 @@ void __init check_bugs(void)
check_popad();
init_utsname()->machine[1] =
'0' + (boot_cpu_data.x86 > 6 ? 6 : boot_cpu_data.x86);
- alternative_instructions();
}
diff --git a/arch/x86/kernel/cpu/bugs_64.c b/arch/x86/kernel/cpu/bugs_64.c
index 04f0fe5..1064db5 100644
--- a/arch/x86/kernel/cpu/bugs_64.c
+++ b/arch/x86/kernel/cpu/bugs_64.c
@@ -5,7 +5,6 @@
#include <linux/kernel.h>
#include <linux/init.h>
-#include <asm/alternative.h>
#include <asm/bugs.h>
#include <asm/processor.h>
#include <asm/mtrr.h>
@@ -18,7 +17,6 @@ void __init check_bugs(void)
printk(KERN_INFO "CPU: ");
print_cpu_info(&boot_cpu_data);
#endif
- alternative_instructions();
/*
* Make sure the first 2MB area is not mapped by huge pages
diff --git a/include/asm-generic/alternative.h b/include/asm-generic/alternative.h
new file mode 100644
index 0000000..893a070
--- /dev/null
+++ b/include/asm-generic/alternative.h
@@ -0,0 +1,19 @@
+/*
+ * linux/include/asm-generic/alternative.h
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef __ASM_GENERIC_ALTERNATIVE_H__
+#define __ASM_GENERIC_ALTERNATIVE_H__
+
+static inline void alternative_instructions_early(void)
+{
+}
+
+static inline void alternative_instructions(void)
+{
+}
+
+#endif /* __ASM_GENERIC_ALTERNATIVE_H__ */
diff --git a/init/main.c b/init/main.c
index 4a9479e..0bb3c52 100644
--- a/init/main.c
+++ b/init/main.c
@@ -74,6 +74,7 @@
#include <asm/setup.h>
#include <asm/sections.h>
#include <asm/cacheflush.h>
+#include <asm/alternative.h>
#ifdef CONFIG_X86_LOCAL_APIC
#include <asm/smp.h>
@@ -508,6 +509,7 @@ asmlinkage void __init start_kernel(void)
vfs_caches_init_early();
sort_main_extable();
trap_init();
+ alternative_instructions_early();
mm_init();
/*
* Set up the scheduler prior starting any interrupts (such as the
@@ -614,6 +616,7 @@ asmlinkage void __init start_kernel(void)
taskstats_init_early();
delayacct_init();
+ alternative_instructions();
check_bugs();
acpi_early_init(); /* before LAPIC and SMP init */
--
1.7.0.4
On Thu, 24 Mar 2011, Pekka Enberg wrote:
> > I forced the fallback to the _emu function to occur but could not trigger
> > the bug in kvm.
>
> That's not the problem. I'm sure the fallback is just fine. What I'm
> saying is that the fallback is *not patched* to kernel text on Ingo's
> machines because alternative_instructions() happens late in the boot!
> So the problem is that on Ingo's boxes (that presumably have old AMD
> CPUs) we execute cmpxchg16b, not the fallback code.
But then we would get the bug in kmem_cache_alloc() and not in the
*_emu() function. So the _emu is executing but failing on Ingo's system
but not on mine. Question is why.
For some reason the first reference to %gs:(%rsi) wont work right on his
system:
>From arch/x86/lib/cmpxchg16b_emu
#
# Emulate 'cmpxchg16b %gs:(%rsi)' except we return the result in %al not
# via the ZF. Caller will access %al to get result.
#
# Note that this is only useful for a cpuops operation. Meaning that we
# do *not* have a fully atomic operation but just an operation that is
# *atomic* on a single cpu (as provided by the this_cpu_xx class of
# macros).
#
this_cpu_cmpxchg16b_emu:
pushf
cli
cmpq %gs:(%rsi), %rax
jne not_same
cmpq %gs:8(%rsi), %rdx
jne not_same
movq %rbx, %gs:(%rsi)
movq %rcx, %gs:8(%rsi)
popf
mov $1, %al
ret
not_same:
popf
xor %al,%al
ret
CFI_ENDPROC
Le jeudi 24 mars 2011 à 13:15 -0500, Christoph Lameter a écrit :
> But then we would get the bug in kmem_cache_alloc() and not in the
> *_emu() function. So the _emu is executing but failing on Ingo's system
> but not on mine. Question is why.
>
> For some reason the first reference to %gs:(%rsi) wont work right on his
> system:
>
> From arch/x86/lib/cmpxchg16b_emu
>
> #
> # Emulate 'cmpxchg16b %gs:(%rsi)' except we return the result in %al not
> # via the ZF. Caller will access %al to get result.
> #
> # Note that this is only useful for a cpuops operation. Meaning that we
> # do *not* have a fully atomic operation but just an operation that is
> # *atomic* on a single cpu (as provided by the this_cpu_xx class of
> # macros).
> #
> this_cpu_cmpxchg16b_emu:
> pushf
> cli
>
> cmpq %gs:(%rsi), %rax
> jne not_same
> cmpq %gs:8(%rsi), %rdx
> jne not_same
>
> movq %rbx, %gs:(%rsi)
> movq %rcx, %gs:8(%rsi)
>
> popf
> mov $1, %al
> ret
>
> not_same:
> popf
> xor %al,%al
> ret
>
> CFI_ENDPROC
Random guess
Masking interrupts, and accessing vmalloc() based memory for the first
time ?
On Thu, 24 Mar 2011, Eric Dumazet wrote:
> > this_cpu_cmpxchg16b_emu:
> > pushf
> > cli
> >
> > cmpq %gs:(%rsi), %rax
> Random guess
>
> Masking interrupts, and accessing vmalloc() based memory for the first
> time ?
Hmmm.. Could be. KVM would not really disable interrupts so this may
explain that the test case works here.
Simple fix would be to do a load before the cli I guess.
Le jeudi 24 mars 2011 à 13:47 -0500, Christoph Lameter a écrit :
> Hmmm.. Could be. KVM would not really disable interrupts so this may
> explain that the test case works here.
>
> Simple fix would be to do a load before the cli I guess.
>
Hmm...
If we have a preemption and migration right after this load...
There's a different crash triggered by the slub merge as well, i've bisected it
back to one of these commits:
09b9cc4: sd: Fail discard requests when logical block provisioning has been disabled
a24c5a0: slub: Dont define useless label in the !CONFIG_CMPXCHG_LOCAL case
5bfe53a: slab,rcu: don't assume the size of struct rcu_head
da9a638: slub,rcu: don't assume the size of struct rcu_head
ab9a0f1: slub: automatically reserve bytes at the end of slab
8a5ec0b: Lockless (and preemptless) fastpaths for slub
d3f661d: slub: Get rid of slab_free_hook_irq()
1a757fe: slub: min_partial needs to be in first cacheline
d71f606: slub: fix ksize() build error
b3d4188: slub: fix kmemcheck calls to match ksize() hints
3ff84a7: Revert "slab: Fix missing DEBUG_SLAB last user"
6331046: mm: Remove support for kmem_cache_name()
The crash is below.
Ingo
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/sda6
general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
last sysfs file: /sys/block/sda/sda10/dev
CPU 0
Pid: 0, comm: swapper Tainted: G W 2.6.38-tip-09247-g0637536-dirty #110370 System manufacturer System Product Name/A8N-E
RIP: 0010:[<ffffffff810570a9>] [<ffffffff810570a9>] get_next_timer_interrupt+0x119/0x260
RSP: 0018:ffff88003fa03ec8 EFLAGS: 00010002
RAX: 6b6b6b6b6b6b6b6b RBX: 000000013fff034e RCX: ffffffff82808cc0
RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000000 RDI: 000000000000000e
RBP: ffff88003fa03f18 R08: 000000000000000e R09: ffffffff82808be0
R10: 0000000000000000 R11: 0000000003fffc0e R12: 00000000ffff034e
R13: ffffffff828087c0 R14: 0000000000000010 R15: ffff88003fa0c200
FS: 000000000071d8f0(0000) GS:ffff88003fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000072fc5f CR3: 000000003bc32000 CR4: 00000000000006b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff82400000, task ffffffff8242f020)
Stack:
ffff88003fa03f18 ffffffff810929bb ffffffff82808be0 ffffffff82808ce0
ffffffff82808de0 ffffffff82808ee0 ffff88003fa0dbe0 0000000000000000
0000000000000086 00000013a4a5c86d ffff88003fa03f78 ffffffff81076761
Call Trace:
<IRQ>
[<ffffffff810929bb>] ? rcu_needs_cpu+0x6b/0x220
[<ffffffff81076761>] tick_nohz_stop_sched_tick+0x2c1/0x3d0
[<ffffffff81d64b6c>] ? call_softirq+0x1c/0x30
[<ffffffff8104dcc4>] irq_exit+0x84/0xb0
[<ffffffff81d651e0>] smp_apic_timer_interrupt+0x70/0x9b
[<ffffffff81d64653>] apic_timer_interrupt+0x13/0x20
<EOI>
[<ffffffff8100a262>] ? default_idle+0x42/0x110
[<ffffffff810011cd>] cpu_idle+0x5d/0xb0
[<ffffffff81cc7d1e>] rest_init+0x72/0x74
[<ffffffff825dec49>] start_kernel+0x44d/0x458
[<ffffffff825de322>] x86_64_start_reservations+0x132/0x136
[<ffffffff825de416>] x86_64_start_kernel+0xf0/0xf7
Code: 04 4c 01 c9 48 8b 01 eb 22 66 0f 1f 84 00 00 00 00 00 f6 40 18 01 75 10 48 8b 40 10 be 01 00 00 00 48 39 d8 48 0f 48 d8 48 89 d0
8b 10 48 39 c1 0f 18 0a 75 dc 85 f6 75 5d 83 c7 01 83 e7 0f
RIP [<ffffffff810570a9>] get_next_timer_interrupt+0x119/0x260
RSP <ffff88003fa03ec8>
---[ end trace cea3203dccec701b ]---
Kernel panic - not syncing: Fatal exception
Pid: 0, comm: swapper Tainted: G D W 2.6.38-tip-09247-g0637536-dirty #110370
Call Trace:
<IRQ> [<ffffffff81d4da7f>] panic+0x91/0x18e
[<ffffffff81005ec4>] oops_end+0xd4/0xf0
[<ffffffff81006038>] die+0x58/0x90
[<ffffffff810033d2>] do_general_protection+0x162/0x170
[<ffffffff812bf265>] ? __cfq_slice_expired+0x295/0x490
[<ffffffff81d6382f>] general_protection+0x1f/0x30
[<ffffffff810570a9>] ? get_next_timer_interrupt+0x119/0x260
[<ffffffff81056fd4>] ? get_next_timer_interrupt+0x44/0x260
[<ffffffff810929bb>] ? rcu_needs_cpu+0x6b/0x220
[<ffffffff81076761>] tick_nohz_stop_sched_tick+0x2c1/0x3d0
[<ffffffff81d64b6c>] ? call_softirq+0x1c/0x30
[<ffffffff8104dcc4>] irq_exit+0x84/0xb0
[<ffffffff81d651e0>] smp_apic_timer_interrupt+0x70/0x9b
[<ffffffff81d64653>] apic_timer_interrupt+0x13/0x20
<EOI> [<ffffffff8100a262>] ? default_idle+0x42/0x110
[<ffffffff810011cd>] cpu_idle+0x5d/0xb0
[<ffffffff81cc7d1e>] rest_init+0x72/0x74
[<ffffffff825dec49>] start_kernel+0x44d/0x458
[<ffffffff825de322>] x86_64_start_reservations+0x132/0x136
[<ffffffff825de416>] x86_64_start_kernel+0xf0/0xf7
On Thu, Mar 24, 2011 at 8:47 PM, Christoph Lameter <[email protected]> wrote:
>
> On Thu, 24 Mar 2011, Eric Dumazet wrote:
>
>> > this_cpu_cmpxchg16b_emu:
>> > ? ? ? ? pushf
>> > ? ? ? ? cli
>> >
>> > ? ? ? ? cmpq %gs:(%rsi), %rax
>
>> Random guess
>>
>> Masking interrupts, and accessing vmalloc() based memory for the first
>> time ?
>
> Hmmm.. Could be. KVM would not really disable interrupts so this may
> explain that the test case works here.
>
> Simple fix would be to do a load before the cli I guess.
Btw, I tried Ingo's .config and it doesn't boot here so it's somehow
.config related.
* Pekka Enberg <[email protected]> wrote:
> On Thu, Mar 24, 2011 at 8:47 PM, Christoph Lameter <[email protected]> wrote:
> >
> > On Thu, 24 Mar 2011, Eric Dumazet wrote:
> >
> >> > this_cpu_cmpxchg16b_emu:
> >> > ? ? ? ? pushf
> >> > ? ? ? ? cli
> >> >
> >> > ? ? ? ? cmpq %gs:(%rsi), %rax
> >
> >> Random guess
> >>
> >> Masking interrupts, and accessing vmalloc() based memory for the first
> >> time ?
> >
> > Hmmm.. Could be. KVM would not really disable interrupts so this may
> > explain that the test case works here.
> >
> > Simple fix would be to do a load before the cli I guess.
>
> Btw, I tried Ingo's .config and it doesn't boot here so it's somehow
> .config related.
did you get a similar early crash as i? I'd not expect my .config to have all
the drivers that are needed on your box.
Thanks,
Ingo
On Thu, 24 Mar 2011, Eric Dumazet wrote:
> Le jeudi 24 mars 2011 à 13:47 -0500, Christoph Lameter a écrit :
>
> > Hmmm.. Could be. KVM would not really disable interrupts so this may
> > explain that the test case works here.
> >
> > Simple fix would be to do a load before the cli I guess.
> >
>
> Hmm...
>
> If we have a preemption and migration right after this load...
Cannot be the issue here since init_kmem_cache_cpus already
touches the per cpu data. At least if CONFIG_PREEMPT is on.
Is it on?
void init_kmem_cache_cpus(struct kmem_cache *s)
{
#if defined(CONFIG_CMPXCHG_LOCAL) && defined(CONFIG_PREEMPT)
int cpu;
for_each_possible_cpu(cpu)
per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
#endif
}
On Thu, 24 Mar 2011, Pekka Enberg wrote:
> > Simple fix would be to do a load before the cli I guess.
>
> Btw, I tried Ingo's .config and it doesn't boot here so it's somehow
> .config related.
CONFIG_PREEMPT related?
On Thu, Mar 24, 2011 at 8:59 PM, Ingo Molnar <[email protected]> wrote:
>
> * Pekka Enberg <[email protected]> wrote:
>
>> On Thu, Mar 24, 2011 at 8:47 PM, Christoph Lameter <[email protected]> wrote:
>> >
>> > On Thu, 24 Mar 2011, Eric Dumazet wrote:
>> >
>> >> > this_cpu_cmpxchg16b_emu:
>> >> > ? ? ? ? pushf
>> >> > ? ? ? ? cli
>> >> >
>> >> > ? ? ? ? cmpq %gs:(%rsi), %rax
>> >
>> >> Random guess
>> >>
>> >> Masking interrupts, and accessing vmalloc() based memory for the first
>> >> time ?
>> >
>> > Hmmm.. Could be. KVM would not really disable interrupts so this may
>> > explain that the test case works here.
>> >
>> > Simple fix would be to do a load before the cli I guess.
>>
>> Btw, I tried Ingo's .config and it doesn't boot here so it's somehow
>> .config related.
>
> did you get a similar early crash as i? I'd not expect my .config to have all
> the drivers that are needed on your box.
It hanged here which is pretty much expected on this box if
kmem_cache_init() oopses. I'm now trying to see if I'm able to find
the config option that breaks things. CONFIG_PREEMPT_NONE is a
suspect:
penberg@tiger:~/linux$ grep PREEMPT ../config-ingo
# CONFIG_PREEMPT_RCU is not set
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
On Thu, Mar 24, 2011 at 9:02 PM, Christoph Lameter <[email protected]> wrote:
> On Thu, 24 Mar 2011, Eric Dumazet wrote:
>
>> Le jeudi 24 mars 2011 ? 13:47 -0500, Christoph Lameter a ?crit :
>>
>> > Hmmm.. Could be. KVM would not really disable interrupts so this may
>> > explain that the test case works here.
>> >
>> > Simple fix would be to do a load before the cli I guess.
>> >
>>
>> Hmm...
>>
>> If we have a preemption and migration right after this load...
>
> Cannot be the issue here since init_kmem_cache_cpus already
> touches the per cpu data. At least if CONFIG_PREEMPT is on.
> Is it on?
It's not.
On Thu, 24 Mar 2011, Pekka Enberg wrote:
> It hanged here which is pretty much expected on this box if
> kmem_cache_init() oopses. I'm now trying to see if I'm able to find
> the config option that breaks things. CONFIG_PREEMPT_NONE is a
> suspect:
>
> penberg@tiger:~/linux$ grep PREEMPT ../config-ingo
> # CONFIG_PREEMPT_RCU is not set
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set
> # CONFIG_PREEMPT is not set
The following patch should ensure that all percpu data is touched
before any emulation functions are called:
---
mm/slub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2011-03-24 14:03:10.000000000 -0500
+++ linux-2.6/mm/slub.c 2011-03-24 14:04:08.000000000 -0500
@@ -1604,7 +1604,7 @@ static inline void note_cmpxchg_failure(
void init_kmem_cache_cpus(struct kmem_cache *s)
{
-#if defined(CONFIG_CMPXCHG_LOCAL) && defined(CONFIG_PREEMPT)
+#ifdef CONFIG_CMPXCHG_LOCAL
int cpu;
for_each_possible_cpu(cpu)
On Thu, 24 Mar 2011, Ingo Molnar wrote:
> RIP: 0010:[<ffffffff810570a9>] [<ffffffff810570a9>] get_next_timer_interrupt+0x119/0x260
That's a typical timer crash, but you were unable to debug it with
debugobjects because commit d3f661d6 broke those.
Christoph, debugobjects do not need to run with interupts
disabled. And just because they were in that section to keep all the
debug stuff together does not make an excuse for not looking at the
code and just slopping it into some totally unrelated ifdef along with
a completely bogus comment.
Signed-off-by: Thomas Gleixner <[email protected]>
---
mm/slub.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -849,11 +849,11 @@ static inline void slab_free_hook(struct
local_irq_save(flags);
kmemcheck_slab_free(s, x, s->objsize);
debug_check_no_locks_freed(x, s->objsize);
- if (!(s->flags & SLAB_DEBUG_OBJECTS))
- debug_check_no_obj_freed(x, s->objsize);
local_irq_restore(flags);
}
#endif
+ if (!(s->flags & SLAB_DEBUG_OBJECTS))
+ debug_check_no_obj_freed(x, s->objsize);
}
/*
On Thu, Mar 24, 2011 at 9:05 PM, Christoph Lameter <[email protected]> wrote:
> On Thu, 24 Mar 2011, Pekka Enberg wrote:
>
>> It hanged here which is pretty much expected on this box if
>> kmem_cache_init() oopses. I'm now trying to see if I'm able to find
>> the config option that breaks things. CONFIG_PREEMPT_NONE is a
>> suspect:
>>
>> penberg@tiger:~/linux$ grep PREEMPT ../config-ingo
>> # CONFIG_PREEMPT_RCU is not set
>> CONFIG_PREEMPT_NONE=y
>> # CONFIG_PREEMPT_VOLUNTARY is not set
>> # CONFIG_PREEMPT is not set
>
> The following patch should ensure that all percpu data is touched
> before any emulation functions are called:
>
> ---
> ?mm/slub.c | ? ?2 +-
> ?1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c ? ?2011-03-24 14:03:10.000000000 -0500
> +++ linux-2.6/mm/slub.c 2011-03-24 14:04:08.000000000 -0500
> @@ -1604,7 +1604,7 @@ static inline void note_cmpxchg_failure(
>
> ?void init_kmem_cache_cpus(struct kmem_cache *s)
> ?{
> -#if defined(CONFIG_CMPXCHG_LOCAL) && defined(CONFIG_PREEMPT)
> +#ifdef CONFIG_CMPXCHG_LOCAL
> ? ? ? ?int cpu;
>
> ? ? ? ?for_each_possible_cpu(cpu)
>
Ingo, can you try this patch out, please? I'm compiling here but
unfortunately I'm stuck with a really slow laptop...
After we made debugobjects working again, we got the following:
WARNING: at lib/debugobjects.c:262 debug_print_object+0x8e/0xb0()
Hardware name: System Product Name
ODEBUG: free active (active state 0) object type: timer_list hint: hci_cmd_timer+0x0/0x60
Pid: 2125, comm: dmsetup Tainted: G W 2.6.38-06707-gc62b389 #110375
Call Trace:
[<ffffffff8104700a>] warn_slowpath_common+0x7a/0xb0
[<ffffffff810470b6>] warn_slowpath_fmt+0x46/0x50
[<ffffffff812d3a5e>] debug_print_object+0x8e/0xb0
[<ffffffff81bd8810>] ? hci_cmd_timer+0x0/0x60
[<ffffffff812d4685>] debug_check_no_obj_freed+0x125/0x230
[<ffffffff810f1063>] ? check_object+0xb3/0x2b0
[<ffffffff810f3630>] kfree+0x150/0x190
[<ffffffff81be4d06>] ? bt_host_release+0x16/0x20
[<ffffffff81be4d06>] bt_host_release+0x16/0x20
[<ffffffff813a1907>] device_release+0x27/0xa0
[<ffffffff812c519c>] kobject_release+0x4c/0xa0
[<ffffffff812c5150>] ? kobject_release+0x0/0xa0
[<ffffffff812c61f6>] kref_put+0x36/0x70
[<ffffffff812c4d37>] kobject_put+0x27/0x60
[<ffffffff813a21f7>] put_device+0x17/0x20
[<ffffffff81bda4f9>] hci_free_dev+0x29/0x30
[<ffffffff81928be6>] vhci_release+0x36/0x70
[<ffffffff810fb366>] fput+0xd6/0x1f0
[<ffffffff810f8fe6>] filp_close+0x66/0x90
[<ffffffff810f90a9>] sys_close+0x99/0xf0
[<ffffffff81d4c96b>] system_call_fastpath+0x16/0x1b
That timer was introduced with commit 6bd32326cda(Bluetooth: Use
proper timer for hci command timout)
Timer seems to be running when the thing is closed. Removing the timer
unconditionally fixes the problem. And yes, it needs to be fixed
before the HCI_UP check.
Signed-off-by: Thomas Gleixner <[email protected]>
---
net/bluetooth/hci_core.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
Index: linux-2.6/net/bluetooth/hci_core.c
===================================================================
--- linux-2.6.orig/net/bluetooth/hci_core.c
+++ linux-2.6/net/bluetooth/hci_core.c
@@ -584,6 +584,9 @@ static int hci_dev_do_close(struct hci_d
hci_req_cancel(hdev, ENODEV);
hci_req_lock(hdev);
+ /* Stop timer, it might be running */
+ del_timer_sync(&hdev->cmd_timer);
+
if (!test_and_clear_bit(HCI_UP, &hdev->flags)) {
hci_req_unlock(hdev);
return 0;
@@ -623,7 +626,6 @@ static int hci_dev_do_close(struct hci_d
/* Drop last sent command */
if (hdev->sent_cmd) {
- del_timer_sync(&hdev->cmd_timer);
kfree_skb(hdev->sent_cmd);
hdev->sent_cmd = NULL;
}
* Thomas Gleixner <[email protected]> wrote:
> On Thu, 24 Mar 2011, Ingo Molnar wrote:
> > RIP: 0010:[<ffffffff810570a9>] [<ffffffff810570a9>] get_next_timer_interrupt+0x119/0x260
>
> That's a typical timer crash, but you were unable to debug it with
> debugobjects because commit d3f661d6 broke those.
>
> Christoph, debugobjects do not need to run with interupts
> disabled. And just because they were in that section to keep all the
> debug stuff together does not make an excuse for not looking at the
> code and just slopping it into some totally unrelated ifdef along with
> a completely bogus comment.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> mm/slub.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c
> +++ linux-2.6/mm/slub.c
> @@ -849,11 +849,11 @@ static inline void slab_free_hook(struct
> local_irq_save(flags);
> kmemcheck_slab_free(s, x, s->objsize);
> debug_check_no_locks_freed(x, s->objsize);
> - if (!(s->flags & SLAB_DEBUG_OBJECTS))
> - debug_check_no_obj_freed(x, s->objsize);
> local_irq_restore(flags);
> }
> #endif
> + if (!(s->flags & SLAB_DEBUG_OBJECTS))
> + debug_check_no_obj_freed(x, s->objsize);
Thanks, this did the trick!
Tested-by: Ingo Molnar <[email protected]>
With this fix i got the warning below - pinpointing a net/bluetooth/hci_core.c
timer bug.
Thanks,
Ingo
------------[ cut here ]------------
WARNING: at lib/debugobjects.c:262 debug_print_object+0x8e/0xb0()
Hardware name: System Product Name
ODEBUG: free active (active state 0) object type: timer_list hint: hci_cmd_timer+0x0/0x60
Pid: 2076, comm: dmsetup Not tainted 2.6.38-tip-09251-ged68fd8-dirty #110378
Call Trace:
[<ffffffff8104703a>] warn_slowpath_common+0x7a/0xb0
[<ffffffff810470e6>] warn_slowpath_fmt+0x46/0x50
[<ffffffff812d3eee>] debug_print_object+0x8e/0xb0
[<ffffffff81bee870>] ? bt_sock_wait_state+0x150/0x150
[<ffffffff812d4b15>] debug_check_no_obj_freed+0x125/0x230
[<ffffffff810f1173>] ? check_object+0xb3/0x2b0
[<ffffffff81bfad56>] ? bt_host_release+0x16/0x20
[<ffffffff81bfad56>] ? bt_host_release+0x16/0x20
[<ffffffff810f373c>] kfree+0x14c/0x190
[<ffffffff81bfad56>] bt_host_release+0x16/0x20
[<ffffffff813a1b87>] device_release+0x27/0xa0
[<ffffffff812c53bc>] kobject_release+0x4c/0xa0
[<ffffffff812c5370>] ? kobject_del+0x40/0x40
[<ffffffff812c6416>] kref_put+0x36/0x70
[<ffffffff812c4f57>] kobject_put+0x27/0x60
[<ffffffff813a2477>] put_device+0x17/0x20
[<ffffffff81bf0549>] hci_free_dev+0x29/0x30
[<ffffffff8193ed16>] vhci_release+0x36/0x70
[<ffffffff810fb4f6>] fput+0xd6/0x1f0
[<ffffffff810f9176>] filp_close+0x66/0x90
[<ffffffff810f9239>] sys_close+0x99/0xf0
[<ffffffff81d63dab>] system_call_fastpath+0x16/0x1b
---[ end trace ea6ca6434ee730b9 ]---
* Thomas Gleixner <[email protected]> wrote:
> After we made debugobjects working again, we got the following:
>
> WARNING: at lib/debugobjects.c:262 debug_print_object+0x8e/0xb0()
> Hardware name: System Product Name
> ODEBUG: free active (active state 0) object type: timer_list hint: hci_cmd_timer+0x0/0x60
> Pid: 2125, comm: dmsetup Tainted: G W 2.6.38-06707-gc62b389 #110375
> Call Trace:
> [<ffffffff8104700a>] warn_slowpath_common+0x7a/0xb0
> [<ffffffff810470b6>] warn_slowpath_fmt+0x46/0x50
> [<ffffffff812d3a5e>] debug_print_object+0x8e/0xb0
> [<ffffffff81bd8810>] ? hci_cmd_timer+0x0/0x60
> [<ffffffff812d4685>] debug_check_no_obj_freed+0x125/0x230
> [<ffffffff810f1063>] ? check_object+0xb3/0x2b0
> [<ffffffff810f3630>] kfree+0x150/0x190
> [<ffffffff81be4d06>] ? bt_host_release+0x16/0x20
> [<ffffffff81be4d06>] bt_host_release+0x16/0x20
> [<ffffffff813a1907>] device_release+0x27/0xa0
> [<ffffffff812c519c>] kobject_release+0x4c/0xa0
> [<ffffffff812c5150>] ? kobject_release+0x0/0xa0
> [<ffffffff812c61f6>] kref_put+0x36/0x70
> [<ffffffff812c4d37>] kobject_put+0x27/0x60
> [<ffffffff813a21f7>] put_device+0x17/0x20
> [<ffffffff81bda4f9>] hci_free_dev+0x29/0x30
> [<ffffffff81928be6>] vhci_release+0x36/0x70
> [<ffffffff810fb366>] fput+0xd6/0x1f0
> [<ffffffff810f8fe6>] filp_close+0x66/0x90
> [<ffffffff810f90a9>] sys_close+0x99/0xf0
> [<ffffffff81d4c96b>] system_call_fastpath+0x16/0x1b
>
> That timer was introduced with commit 6bd32326cda(Bluetooth: Use
> proper timer for hci command timout)
>
> Timer seems to be running when the thing is closed. Removing the timer
> unconditionally fixes the problem. And yes, it needs to be fixed
> before the HCI_UP check.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> net/bluetooth/hci_core.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/net/bluetooth/hci_core.c
> ===================================================================
> --- linux-2.6.orig/net/bluetooth/hci_core.c
> +++ linux-2.6/net/bluetooth/hci_core.c
> @@ -584,6 +584,9 @@ static int hci_dev_do_close(struct hci_d
> hci_req_cancel(hdev, ENODEV);
> hci_req_lock(hdev);
>
> + /* Stop timer, it might be running */
> + del_timer_sync(&hdev->cmd_timer);
> +
> if (!test_and_clear_bit(HCI_UP, &hdev->flags)) {
> hci_req_unlock(hdev);
> return 0;
> @@ -623,7 +626,6 @@ static int hci_dev_do_close(struct hci_d
>
> /* Drop last sent command */
> if (hdev->sent_cmd) {
> - del_timer_sync(&hdev->cmd_timer);
> kfree_skb(hdev->sent_cmd);
> hdev->sent_cmd = NULL;
> }
Yes, this fixes the warning.
Tested-by: Ingo Molnar <[email protected]>
Thanks,
Ingo
On Thu, Mar 24, 2011 at 9:22 PM, Ingo Molnar <[email protected]> wrote:
>
> * Thomas Gleixner <[email protected]> wrote:
>
>> On Thu, 24 Mar 2011, Ingo Molnar wrote:
>> > RIP: 0010:[<ffffffff810570a9>] ?[<ffffffff810570a9>] get_next_timer_interrupt+0x119/0x260
>>
>> That's a typical timer crash, but you were unable to debug it with
>> debugobjects because commit d3f661d6 broke those.
>>
>> Christoph, debugobjects do not need to run with interupts
>> disabled. And just because they were in that section to keep all the
>> debug stuff together does not make an excuse for not looking at the
>> code and just slopping it into some totally unrelated ifdef along with
>> a completely bogus comment.
>>
>> Signed-off-by: Thomas Gleixner <[email protected]>
>> ---
>> ?mm/slub.c | ? ?4 ++--
>> ?1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> Index: linux-2.6/mm/slub.c
>> ===================================================================
>> --- linux-2.6.orig/mm/slub.c
>> +++ linux-2.6/mm/slub.c
>> @@ -849,11 +849,11 @@ static inline void slab_free_hook(struct
>> ? ? ? ? ? ? ? local_irq_save(flags);
>> ? ? ? ? ? ? ? kmemcheck_slab_free(s, x, s->objsize);
>> ? ? ? ? ? ? ? debug_check_no_locks_freed(x, s->objsize);
>> - ? ? ? ? ? ? if (!(s->flags & SLAB_DEBUG_OBJECTS))
>> - ? ? ? ? ? ? ? ? ? ? debug_check_no_obj_freed(x, s->objsize);
>> ? ? ? ? ? ? ? local_irq_restore(flags);
>> ? ? ? }
>> ?#endif
>> + ? ? if (!(s->flags & SLAB_DEBUG_OBJECTS))
>> + ? ? ? ? ? ? debug_check_no_obj_freed(x, s->objsize);
>
> Thanks, this did the trick!
>
> Tested-by: Ingo Molnar <[email protected]>
>
> With this fix i got the warning below - pinpointing a net/bluetooth/hci_core.c
> timer bug.
Applied, thanks!
On Thu, Mar 24, 2011 at 9:27 PM, Ingo Molnar <[email protected]> wrote:
>
> * Pekka Enberg <[email protected]> wrote:
>
>> >From dd1534455196d2a8f6c9c912db614e59986c9f0e Mon Sep 17 00:00:00 2001
>> From: Pekka Enberg <[email protected]>
>> Date: Thu, 24 Mar 2011 19:59:35 +0200
>> Subject: [PATCH] x86: Early boot alternative instructions
>
> hm, patch is whitespace damaged.
>
> Also, the fix looks rather intrusive.
>
> Could we please disable the lockless slub code first and then do everything
> with proper testing and re-enable the lockless code *after* we know that the
> alternatives fixup change is robust, etc? That way there's no rush needed.
>
> There's a lot of code that could break from tweaking the alternatives code.
Just ignore this patch. As explained by Christoph, if alternative_io()
was the issue, we'd see the crash in kmem_cache_alloc().
* Pekka Enberg <[email protected]> wrote:
> >From dd1534455196d2a8f6c9c912db614e59986c9f0e Mon Sep 17 00:00:00 2001
> From: Pekka Enberg <[email protected]>
> Date: Thu, 24 Mar 2011 19:59:35 +0200
> Subject: [PATCH] x86: Early boot alternative instructions
hm, patch is whitespace damaged.
Also, the fix looks rather intrusive.
Could we please disable the lockless slub code first and then do everything
with proper testing and re-enable the lockless code *after* we know that the
alternatives fixup change is robust, etc? That way there's no rush needed.
There's a lot of code that could break from tweaking the alternatives code.
Thanks,
Ingo
* Pekka Enberg <[email protected]> wrote:
> > -#if defined(CONFIG_CMPXCHG_LOCAL) && defined(CONFIG_PREEMPT)
> > +#ifdef CONFIG_CMPXCHG_LOCAL
> > ? ? ? ?int cpu;
> >
> > ? ? ? ?for_each_possible_cpu(cpu)
> >
>
> Ingo, can you try this patch out, please? I'm compiling here but
> unfortunately I'm stuck with a really slow laptop...
Yes, it does the trick with the config i sent.
Tested-by: Ingo Molnar <[email protected]>
Thanks,
Ingo
On Thu, Mar 24, 2011 at 9:36 PM, Ingo Molnar <[email protected]> wrote:
>
> * Pekka Enberg <[email protected]> wrote:
>
>> > -#if defined(CONFIG_CMPXCHG_LOCAL) && defined(CONFIG_PREEMPT)
>> > +#ifdef CONFIG_CMPXCHG_LOCAL
>> > ? ? ? ?int cpu;
>> >
>> > ? ? ? ?for_each_possible_cpu(cpu)
>> >
>>
>> Ingo, can you try this patch out, please? I'm compiling here but
>> unfortunately I'm stuck with a really slow laptop...
>
> Yes, it does the trick with the config i sent.
>
> Tested-by: Ingo Molnar <[email protected]>
Thanks, Ingo! Christoph, may I have your sign-off for the patch and
I'll send it to Linus?
On Thu, 24 Mar 2011, Pekka Enberg wrote:
> Thanks, Ingo! Christoph, may I have your sign-off for the patch and
> I'll send it to Linus?
Subject: SLUB: Write to per cpu data when allocating it
It turns out that the cmpxchg16b emulation has to access vmalloced
percpu memory with interrupts disabled. If the memory has never
been touched before then the fault necessary to establish the
mapping will not to occur and the kernel will fail on boot.
Fix that by reusing the CONFIG_PREEMPT code that writes the
cpu number into a field on every cpu. Writing to the per cpu
area before causes the mapping to be established before we get
to a cmpxchg16b emulation.
Tested-by: Ingo Molnar <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
---
mm/slub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2011-03-24 14:03:10.000000000 -0500
+++ linux-2.6/mm/slub.c 2011-03-24 14:04:08.000000000 -0500
@@ -1604,7 +1604,7 @@ static inline void note_cmpxchg_failure(
void init_kmem_cache_cpus(struct kmem_cache *s)
{
-#if defined(CONFIG_CMPXCHG_LOCAL) && defined(CONFIG_PREEMPT)
+#ifdef CONFIG_CMPXCHG_LOCAL
int cpu;
for_each_possible_cpu(cpu)
* Christoph Lameter <[email protected]> wrote:
> On Thu, 24 Mar 2011, Pekka Enberg wrote:
>
> > Thanks, Ingo! Christoph, may I have your sign-off for the patch and
> > I'll send it to Linus?
>
>
> Subject: SLUB: Write to per cpu data when allocating it
The commit title should obviously start with "SLUB: Fix boot crash ..."
Thanks,
Ingo
* Ingo Molnar <[email protected]> [2011-03-24 20:23:33 +0100]:
>
> * Thomas Gleixner <[email protected]> wrote:
>
> > After we made debugobjects working again, we got the following:
> >
> > WARNING: at lib/debugobjects.c:262 debug_print_object+0x8e/0xb0()
> > Hardware name: System Product Name
> > ODEBUG: free active (active state 0) object type: timer_list hint: hci_cmd_timer+0x0/0x60
> > Pid: 2125, comm: dmsetup Tainted: G W 2.6.38-06707-gc62b389 #110375
> > Call Trace:
> > [<ffffffff8104700a>] warn_slowpath_common+0x7a/0xb0
> > [<ffffffff810470b6>] warn_slowpath_fmt+0x46/0x50
> > [<ffffffff812d3a5e>] debug_print_object+0x8e/0xb0
> > [<ffffffff81bd8810>] ? hci_cmd_timer+0x0/0x60
> > [<ffffffff812d4685>] debug_check_no_obj_freed+0x125/0x230
> > [<ffffffff810f1063>] ? check_object+0xb3/0x2b0
> > [<ffffffff810f3630>] kfree+0x150/0x190
> > [<ffffffff81be4d06>] ? bt_host_release+0x16/0x20
> > [<ffffffff81be4d06>] bt_host_release+0x16/0x20
> > [<ffffffff813a1907>] device_release+0x27/0xa0
> > [<ffffffff812c519c>] kobject_release+0x4c/0xa0
> > [<ffffffff812c5150>] ? kobject_release+0x0/0xa0
> > [<ffffffff812c61f6>] kref_put+0x36/0x70
> > [<ffffffff812c4d37>] kobject_put+0x27/0x60
> > [<ffffffff813a21f7>] put_device+0x17/0x20
> > [<ffffffff81bda4f9>] hci_free_dev+0x29/0x30
> > [<ffffffff81928be6>] vhci_release+0x36/0x70
> > [<ffffffff810fb366>] fput+0xd6/0x1f0
> > [<ffffffff810f8fe6>] filp_close+0x66/0x90
> > [<ffffffff810f90a9>] sys_close+0x99/0xf0
> > [<ffffffff81d4c96b>] system_call_fastpath+0x16/0x1b
> >
> > That timer was introduced with commit 6bd32326cda(Bluetooth: Use
> > proper timer for hci command timout)
> >
> > Timer seems to be running when the thing is closed. Removing the timer
> > unconditionally fixes the problem. And yes, it needs to be fixed
> > before the HCI_UP check.
> >
> > Signed-off-by: Thomas Gleixner <[email protected]>
> > ---
> > net/bluetooth/hci_core.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6/net/bluetooth/hci_core.c
> > ===================================================================
> > --- linux-2.6.orig/net/bluetooth/hci_core.c
> > +++ linux-2.6/net/bluetooth/hci_core.c
> > @@ -584,6 +584,9 @@ static int hci_dev_do_close(struct hci_d
> > hci_req_cancel(hdev, ENODEV);
> > hci_req_lock(hdev);
> >
> > + /* Stop timer, it might be running */
> > + del_timer_sync(&hdev->cmd_timer);
> > +
> > if (!test_and_clear_bit(HCI_UP, &hdev->flags)) {
> > hci_req_unlock(hdev);
> > return 0;
> > @@ -623,7 +626,6 @@ static int hci_dev_do_close(struct hci_d
> >
> > /* Drop last sent command */
> > if (hdev->sent_cmd) {
> > - del_timer_sync(&hdev->cmd_timer);
> > kfree_skb(hdev->sent_cmd);
> > hdev->sent_cmd = NULL;
> > }
>
> Yes, this fixes the warning.
>
> Tested-by: Ingo Molnar <[email protected]>
I applied this patch to the Bluetooth tree, it should reach mailine soon.
But if you want to overstep me to have this fixed quickly, then:
Acked-by: Gustavo F. Padovan <[email protected]>
--
Gustavo F. Padovan
http://profusion.mobi
Le jeudi 24 mars 2011 à 14:51 -0500, Christoph Lameter a écrit :
> On Thu, 24 Mar 2011, Pekka Enberg wrote:
>
> > Thanks, Ingo! Christoph, may I have your sign-off for the patch and
> > I'll send it to Linus?
>
>
> Subject: SLUB: Write to per cpu data when allocating it
>
> It turns out that the cmpxchg16b emulation has to access vmalloced
> percpu memory with interrupts disabled. If the memory has never
> been touched before then the fault necessary to establish the
> mapping will not to occur and the kernel will fail on boot.
>
> Fix that by reusing the CONFIG_PREEMPT code that writes the
> cpu number into a field on every cpu. Writing to the per cpu
> area before causes the mapping to be established before we get
> to a cmpxchg16b emulation.
>
Thats strange, alloc_percpu() is supposed to zero the memory already ...
Are you sure its really this problem of interrupts being disabled ?
On Thu, 24 Mar 2011, Eric Dumazet wrote:
> Thats strange, alloc_percpu() is supposed to zero the memory already ...
True.
> Are you sure its really this problem of interrupts being disabled ?
Guess so since Ingo and Pekka reported that it fixed the problem.
Tejun: Can you help us with this mystery?
Hello,
On Thu, Mar 24, 2011 at 03:43:25PM -0500, Christoph Lameter wrote:
> > Thats strange, alloc_percpu() is supposed to zero the memory already ...
>
> True.
>
> > Are you sure its really this problem of interrupts being disabled ?
>
> Guess so since Ingo and Pekka reported that it fixed the problem.
>
> Tejun: Can you help us with this mystery?
I've looked through the code but can't figure out what the difference
is. The memset code is in mm/percpu-vm.c::pcpu_populate_chunk().
for_each_possible_cpu(cpu)
memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);
(pcpu_chunk_addr(chunk, cpu, 0) + off) is the same vaddr as will be
obtained by per_cpu_ptr(ptr, cpu), so all allocated memory regions are
accessed before being returned. Dazed and confused (seems like the
theme of today for me).
Could it be that the vmalloc page is taking more than one faults?
Thanks.
--
tejun
On Fri, 25 Mar 2011, Tejun Heo wrote:
> I've looked through the code but can't figure out what the difference
> is. The memset code is in mm/percpu-vm.c::pcpu_populate_chunk().
>
> for_each_possible_cpu(cpu)
> memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);
>
> (pcpu_chunk_addr(chunk, cpu, 0) + off) is the same vaddr as will be
> obtained by per_cpu_ptr(ptr, cpu), so all allocated memory regions are
> accessed before being returned. Dazed and confused (seems like the
> theme of today for me).
>
> Could it be that the vmalloc page is taking more than one faults?
The vmalloc page only contains per cpu data from a single cpu right?
Could anyone have set write access restrictions that would require a fault
to get rid of?
Or does an access from a different cpu require a "page table sync"?
There is some rather strange looking code in arch/x86/mm/fault.c:vmalloc_fault
* Pekka Enberg <[email protected]> wrote:
> On Thu, Mar 24, 2011 at 9:22 PM, Ingo Molnar <[email protected]> wrote:
> >
> > * Thomas Gleixner <[email protected]> wrote:
> >
> >> On Thu, 24 Mar 2011, Ingo Molnar wrote:
> >> > RIP: 0010:[<ffffffff810570a9>] ?[<ffffffff810570a9>] get_next_timer_interrupt+0x119/0x260
> >>
> >> That's a typical timer crash, but you were unable to debug it with
> >> debugobjects because commit d3f661d6 broke those.
> >>
> >> Christoph, debugobjects do not need to run with interupts
> >> disabled. And just because they were in that section to keep all the
> >> debug stuff together does not make an excuse for not looking at the
> >> code and just slopping it into some totally unrelated ifdef along with
> >> a completely bogus comment.
> >>
> >> Signed-off-by: Thomas Gleixner <[email protected]>
> >> ---
> >> ?mm/slub.c | ? ?4 ++--
> >> ?1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> Index: linux-2.6/mm/slub.c
> >> ===================================================================
> >> --- linux-2.6.orig/mm/slub.c
> >> +++ linux-2.6/mm/slub.c
> >> @@ -849,11 +849,11 @@ static inline void slab_free_hook(struct
> >> ? ? ? ? ? ? ? local_irq_save(flags);
> >> ? ? ? ? ? ? ? kmemcheck_slab_free(s, x, s->objsize);
> >> ? ? ? ? ? ? ? debug_check_no_locks_freed(x, s->objsize);
> >> - ? ? ? ? ? ? if (!(s->flags & SLAB_DEBUG_OBJECTS))
> >> - ? ? ? ? ? ? ? ? ? ? debug_check_no_obj_freed(x, s->objsize);
> >> ? ? ? ? ? ? ? local_irq_restore(flags);
> >> ? ? ? }
> >> ?#endif
> >> + ? ? if (!(s->flags & SLAB_DEBUG_OBJECTS))
> >> + ? ? ? ? ? ? debug_check_no_obj_freed(x, s->objsize);
> >
> > Thanks, this did the trick!
> >
> > Tested-by: Ingo Molnar <[email protected]>
> >
> > With this fix i got the warning below - pinpointing a net/bluetooth/hci_core.c
> > timer bug.
>
> Applied, thanks!
Just an update: all SLAB/SLUB problems appear to be fixed with the latest
patches, so i'm a happy camper!
Thanks Pekka,
Ingo
ok, bad news - the bootcrash has come back - see the crashlog attached below.
Config attached as well.
I reproduced this on upstream 16c29dafcc86 - i.e. all the latest Slab fixes
applied.
Thanks,
Ingo
------------------->
Linux version 2.6.38-08569-g16c29da (mingo@sirius) (gcc version 4.6.0 20110304 (Red Hat 4.6.0-0.12) (GCC) ) #110593 Sat Mar 26 13:46:21 CET 2011
Command line: root=/dev/sda6 earlyprintk=ttyS0,115200 console=ttyS0,115200 debug initcall_debug sysrq_always_enabled ignore_loglevel selinux=0 nmi_watchdog=1 panic=1 3
KERNEL supported cpus:
Intel GenuineIntel
AMD AuthenticAMD
Centaur CentaurHauls
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000003fff0000 (usable)
BIOS-e820: 000000003fff0000 - 000000003fff3000 (ACPI NVS)
BIOS-e820: 000000003fff3000 - 0000000040000000 (ACPI data)
BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
bootconsole [earlyser0] enabled
debug: ignoring loglevel setting.
NX (Execute Disable) protection: active
DMI 2.3 present.
DMI: System manufacturer System Product Name/A8N-E, BIOS ASUS A8N-E ACPI BIOS Revision 1008 08/22/2005
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
last_pfn = 0x3fff0 max_arch_pfn = 0x400000000
MTRR default type: uncachable
MTRR fixed ranges enabled:
00000-9FFFF write-back
A0000-BFFFF uncachable
C0000-C7FFF write-protect
C8000-FFFFF uncachable
MTRR variable ranges enabled:
0 base 0000000000 mask FFC0000000 write-back
1 disabled
2 disabled
3 disabled
4 disabled
5 disabled
6 disabled
7 disabled
x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
found SMP MP-table at [ffff8800000f5680] f5680
initial memory mapped : 0 - 20000000
Base memory trampoline at [ffff88000009d000] 9d000 size 8192
init_memory_mapping: 0000000000000000-000000003fff0000
0000000000 - 003fe00000 page 2M
003fe00000 - 003fff0000 page 4k
kernel direct mapping tables up to 3fff0000 @ 3ffed000-3fff0000
[ffffea0000000000-ffffea0000dfffff] PMD -> [ffff88003e600000-ffff88003f3fffff] on node 0
Zone PFN ranges:
DMA 0x00000010 -> 0x00001000
DMA32 0x00001000 -> 0x00100000
Normal empty
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
0: 0x00000010 -> 0x0000009f
0: 0x00000100 -> 0x0003fff0
On node 0 totalpages: 262015
DMA zone: 56 pages used for memmap
DMA zone: 2 pages reserved
DMA zone: 3925 pages, LIFO batch:0
DMA32 zone: 3528 pages used for memmap
DMA32 zone: 254504 pages, LIFO batch:31
SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
Intel MultiProcessor Specification v1.4
MPTABLE: OEM ID: OEM00000
MPTABLE: Product ID: PROD00000000
MPTABLE: APIC at: 0xFEE00000
Processor #0 (Bootup-CPU)
Processor #1
ACPI: NR_CPUS/possible_cpus limit of 1 reached. Processor 1/0x1 ignored.
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
Processors: 1
nr_irqs_gsi: 40
Allocating PCI resources starting at 40000000 (gap: 40000000:a0000000)
Booting paravirtualized kernel on bare hardware
pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
pcpu-alloc: [0] 0
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 258429
Kernel command line: root=/dev/sda6 earlyprintk=ttyS0,115200 console=ttyS0,115200 debug initcall_debug sysrq_always_enabled ignore_loglevel selinux=0 nmi_watchdog=1 panic=1 3
sysrq: sysrq always enabled.
PID hash table entries: 4096 (order: 3, 32768 bytes)
Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
Memory: 1011352k/1048512k available (11595k kernel code, 452k absent, 36708k reserved, 6343k data, 1024k init)
BUG: unable to handle kernel paging request at ffff87ffc1fdd020
IP: [<ffffffff812b50c2>] this_cpu_cmpxchg16b_emu+0x2/0x1c
PGD 0
Oops: 0000 [#1]
last sysfs file:
CPU 0
Pid: 0, comm: swapper Not tainted 2.6.38-08569-g16c29da #110593 System manufacturer System Product Name/A8N-E
RIP: 0010:[<ffffffff812b50c2>] [<ffffffff812b50c2>] this_cpu_cmpxchg16b_emu+0x2/0x1c
RSP: 0000:ffffffff82003e78 EFLAGS: 00010086
RAX: ffff88003f8020c0 RBX: ffff88003f802180 RCX: 0000000000000002
RDX: 0000000000000001 RSI: ffff88003ffc2020 RDI: ffffffff8219be43
RBP: ffffffff82003ed8 R08: 0000000000000000 R09: ffffea0000de40d0
R10: ffffffff821666b0 R11: ffffffff821666b0 R12: ffff88003f8020c0
R13: ffff88003f802000 R14: ffffffffffffffff R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffffffff8201b000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff87ffc1fdd020 CR3: 000000000200b000 CR4: 00000000000006b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff82002000, task ffffffff82015300)
Stack:
0000000000000086 ffffffff810dcd03 ffffffffffffffff 0000000000093e30
ffffffff82003ef8 0000000000000000 00000000000000c0 0000000000000000
ffff88003f800000 ffff88003f8000c0 ffffffffffffffff 0000000000093e30
Call Trace:
[<ffffffff810dcd03>] ? kmem_cache_alloc+0x53/0x130
[<ffffffff8219be43>] kmem_cache_init+0xeb/0x2b0
[<ffffffff821869eb>] start_kernel+0x1cb/0x322
[<ffffffff82186338>] x86_64_start_reservations+0xfe/0x102
[<ffffffff82186140>] ? early_idt_handlers+0x140/0x140
[<ffffffff82186409>] x86_64_start_kernel+0xcd/0xdc
Code: 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 90 90 90 90 90 90 90 9c fa
48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e
RIP [<ffffffff812b50c2>] this_cpu_cmpxchg16b_emu+0x2/0x1c
RSP <ffffffff82003e78>
CR2: ffff87ffc1fdd020
---[ end trace a7919e7f17c0a725 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
Pid: 0, comm: swapper Tainted: G D 2.6.38-08569-g16c29da #110593
Call Trace:
[<ffffffff81b23b97>] panic+0x91/0x18c
[<ffffffff81042ce3>] do_exit+0x323/0x340
[<ffffffff81005786>] oops_end+0x96/0xe0
[<ffffffff81b233a5>] no_context+0x141/0x14e
[<ffffffff81b2353d>] __bad_area_nosemaphore+0x18b/0x1ae
[<ffffffff81b23573>] bad_area_nosemaphore+0x13/0x15
[<ffffffff8101f561>] do_page_fault+0x2a1/0x4b0
[<ffffffff810b1703>] ? get_page_from_freelist+0x1a3/0x3d0
[<ffffffff810c3089>] ? pcpu_chunk_slot+0x9/0x50
[<ffffffff81b4f6a5>] page_fault+0x25/0x30
[<ffffffff8219be43>] ? kmem_cache_init+0xeb/0x2b0
[<ffffffff812b50c2>] ? this_cpu_cmpxchg16b_emu+0x2/0x1c
[<ffffffff810dcd03>] ? kmem_cache_alloc+0x53/0x130
[<ffffffff8219be43>] kmem_cache_init+0xeb/0x2b0
[<ffffffff821869eb>] start_kernel+0x1cb/0x322
[<ffffffff82186338>] x86_64_start_reservations+0xfe/0x102
[<ffffffff82186140>] ? early_idt_handlers+0x140/0x140
[<ffffffff82186409>] x86_64_start_kernel+0xcd/0xdc
Rebooting in 1 seconds..
The commit below solves this crash for me. Could we please apply this simple
patch, until the real bug has been found, to keep upstream debuggable? The
eventual fix can then re-enable the lockless allocator.
Thanks,
Ingo
--------------->
>From bb764182707216faeb815c3bbdf6a9e9992a459a Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Sat, 26 Mar 2011 12:40:30 +0100
Subject: [PATCH] slub: Disable the lockless allocator
This boot crash:
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
Memory: 1011352k/1048512k available (11595k kernel code, 452k absent, 36708k reserved, 6343k data, 1024k init)
BUG: unable to handle kernel paging request at ffff87ffc1fdd020
IP: [<ffffffff812b50c2>] this_cpu_cmpxchg16b_emu+0x2/0x1c
PGD 0
Oops: 0000 [#1]
last sysfs file:
CPU 0
Pid: 0, comm: swapper Not tainted 2.6.38-08569-g16c29da #110593 System manufacturer System Product Name/A8N-E
RIP: 0010:[<ffffffff812b50c2>] [<ffffffff812b50c2>] this_cpu_cmpxchg16b_emu+0x2/0x1c
RSP: 0000:ffffffff82003e78 EFLAGS: 00010086
RAX: ffff88003f8020c0 RBX: ffff88003f802180 RCX: 0000000000000002
RDX: 0000000000000001 RSI: ffff88003ffc2020 RDI: ffffffff8219be43
RBP: ffffffff82003ed8 R08: 0000000000000000 R09: ffffea0000de40d0
R10: ffffffff821666b0 R11: ffffffff821666b0 R12: ffff88003f8020c0
R13: ffff88003f802000 R14: ffffffffffffffff R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffffffff8201b000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff87ffc1fdd020 CR3: 000000000200b000 CR4: 00000000000006b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff82002000, task ffffffff82015300)
Stack:
0000000000000086 ffffffff810dcd03 ffffffffffffffff 0000000000093e30
ffffffff82003ef8 0000000000000000 00000000000000c0 0000000000000000
ffff88003f800000 ffff88003f8000c0 ffffffffffffffff 0000000000093e30
Call Trace:
[<ffffffff810dcd03>] ? kmem_cache_alloc+0x53/0x130
[<ffffffff8219be43>] kmem_cache_init+0xeb/0x2b0
[<ffffffff821869eb>] start_kernel+0x1cb/0x322
[<ffffffff82186338>] x86_64_start_reservations+0xfe/0x102
[<ffffffff82186140>] ? early_idt_handlers+0x140/0x140
[<ffffffff82186409>] x86_64_start_kernel+0xcd/0xdc
Code: 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 90 90 90 90 90 90 90 9c fa
48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e
RIP [<ffffffff812b50c2>] this_cpu_cmpxchg16b_emu+0x2/0x1c
RSP <ffffffff82003e78>
CR2: ffff87ffc1fdd020
---[ end trace a7919e7f17c0a725 ]---
Occurs due to:
8a5ec0ba42c4: Lockless (and preemptless) fastpaths for slub
Disable this code for now, it's not yet fully cooked.
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/slub.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index f881874..37ccf77 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -30,6 +30,7 @@
#include <trace/events/kmem.h>
+#undef CONFIG_CMPXCHG_LOCAL
/*
* Lock order:
* 1. slab_lock(page)
Le samedi 26 mars 2011 à 12:47 +0100, Ingo Molnar a écrit :
> The commit below solves this crash for me. Could we please apply this simple
> patch, until the real bug has been found, to keep upstream debuggable? The
> eventual fix can then re-enable the lockless allocator.
>
> Thanks,
>
> Ingo
>
> --------------->
> From bb764182707216faeb815c3bbdf6a9e9992a459a Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <[email protected]>
> Date: Sat, 26 Mar 2011 12:40:30 +0100
> Subject: [PATCH] slub: Disable the lockless allocator
>
> This boot crash:
>
> Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
> Memory: 1011352k/1048512k available (11595k kernel code, 452k absent, 36708k reserved, 6343k data, 1024k init)
> BUG: unable to handle kernel paging request at ffff87ffc1fdd020
> IP: [<ffffffff812b50c2>] this_cpu_cmpxchg16b_emu+0x2/0x1c
> PGD 0
> Oops: 0000 [#1]
> last sysfs file:
> CPU 0
> Pid: 0, comm: swapper Not tainted 2.6.38-08569-g16c29da #110593 System manufacturer System Product Name/A8N-E
> RIP: 0010:[<ffffffff812b50c2>] [<ffffffff812b50c2>] this_cpu_cmpxchg16b_emu+0x2/0x1c
> RSP: 0000:ffffffff82003e78 EFLAGS: 00010086
> RAX: ffff88003f8020c0 RBX: ffff88003f802180 RCX: 0000000000000002
> RDX: 0000000000000001 RSI: ffff88003ffc2020 RDI: ffffffff8219be43
Interesting. I am wondering if its not a per_cpu problem, or another
problem uncovered by lockless allocator ?
Old values of cmpxchg16b() :
RAX = ffff88003f8020c0 , RDX = 1
New values
RBX = ffff88003f802180 , RCX = 2
address : RSI = ffff88003ffc2020
RSI is supposed to be a dynamic percpu addr
Yet this value seems pretty outside of pcpu pools
CR2: ffff87ffc1fdd020
On my machine, if I add this patch to display cpu_slab values :
diff --git a/mm/slub.c b/mm/slub.c
index f881874..3b118f3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2364,7 +2364,7 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
/* Regular alignment is sufficient */
s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
#endif
-
+ pr_err("cpu_slab=%p\n", s->cpu_slab);
if (!s->cpu_slab)
return 0;
I get following traces :
First ones are allocated from the static pcpu area (zero based)
Looks consistent with :
00000000000127c0 D __per_cpu_end
[ 0.000000] cpu_slab=00000000000147c0
[ 0.000000] cpu_slab=00000000000147e0
[ 0.000000] cpu_slab=0000000000014800
[ 0.000000] cpu_slab=0000000000014820
[ 0.000000] cpu_slab=0000000000014840
[ 0.000000] cpu_slab=0000000000014860
[ 0.000000] cpu_slab=0000000000014880
[ 0.000000] cpu_slab=00000000000148a0
[ 0.000000] cpu_slab=00000000000148c0
[ 0.000000] cpu_slab=00000000000148e0
[ 0.000000] cpu_slab=0000000000014900
[ 0.000000] cpu_slab=0000000000014920
[ 0.000000] cpu_slab=0000000000014940
[ 0.000000] cpu_slab=0000000000014960
[ 0.000000] cpu_slab=0000000000014980
[ 0.000000] cpu_slab=00000000000149a0
[ 0.000000] cpu_slab=00000000000149c0
[ 0.000000] cpu_slab=00000000000149e0
[ 0.000000] cpu_slab=0000000000014a00
[ 0.000000] cpu_slab=0000000000014a20
[ 0.000000] cpu_slab=0000000000014a40
[ 0.000000] cpu_slab=0000000000014a60
[ 0.000000] cpu_slab=0000000000014a80
[ 0.000000] cpu_slab=0000000000014aa0
[ 0.000000] cpu_slab=0000000000014ac0
[ 0.000000] cpu_slab=0000000000014ae0
[ 0.000000] cpu_slab=0000000000014b00
[ 0.000000] cpu_slab=0000000000014b20
[ 0.000000] cpu_slab=0000000000014b40
[ 0.000000] cpu_slab=0000000000014c80
[ 0.000000] cpu_slab=0000000000017240
[ 0.000000] cpu_slab=0000000000017260
[ 0.000350] cpu_slab=0000000000017280
[ 0.000461] cpu_slab=00000000000172a0
[ 0.000573] cpu_slab=00000000000172c0
[ 0.000684] cpu_slab=00000000000172e0
[ 0.000797] cpu_slab=0000000000017300
[ 0.000918] cpu_slab=0000000000017320
[ 0.001031] cpu_slab=0000000000017340
[ 0.001141] cpu_slab=0000000000017360
[ 0.001253] cpu_slab=0000000000017380
[ 0.001365] cpu_slab=00000000000173a0
[ 0.001477] cpu_slab=00000000000173c0
[ 0.001589] cpu_slab=00000000000173e0
[ 0.001713] cpu_slab=0000000000017400
[ 0.001833] cpu_slab=0000000000017420
[ 0.001946] cpu_slab=0000000000017440
[ 0.002057] cpu_slab=0000000000017460
[ 0.003548] cpu_slab=0000000000017480
[ 0.004380] cpu_slab=00000000000174a0
[ 0.004508] cpu_slab=00000000000174d0
[ 0.004727] cpu_slab=00000000000174f0
[ 0.005006] cpu_slab=0000000000017550
[ 0.005187] cpu_slab=0000000000017590
[ 0.005345] cpu_slab=00000000000175c0
[ 0.007824] cpu_slab=0000000000017600
[ 0.007935] cpu_slab=0000000000017620
[ 0.008048] cpu_slab=0000000000017640
[ 0.008159] cpu_slab=0000000000017660
[ 0.008271] cpu_slab=0000000000017680
[ 1.496852] cpu_slab=00000000000177b0
[ 1.497347] cpu_slab=0000000000017870
[ 1.497509] cpu_slab=0000000000017890
[ 1.497632] cpu_slab=00000000000178b0
[ 1.497749] cpu_slab=00000000000178d0
[ 1.497868] cpu_slab=0000000000017970
[ 1.510931] cpu_slab=0000000000017a70
[ 1.511000] cpu_slab=0000000000017a90
[ 1.511068] cpu_slab=0000000000017ab0
[ 1.511139] cpu_slab=0000000000017ad0
[ 1.511208] cpu_slab=0000000000017c70
[ 1.511430] cpu_slab=0000000000017c90
[ 1.511543] cpu_slab=0000000000017cb0
[ 1.511659] cpu_slab=0000000000017cd0
[ 1.511816] cpu_slab=0000000000017d70
[ 1.511929] cpu_slab=0000000000017d90
[ 1.538272] cpu_slab=0000000000017db0
[ 1.538384] cpu_slab=0000000000017dd0
[ 1.538500] cpu_slab=0000000000017e70
[ 1.538615] cpu_slab=0000000000017e90
[ 1.538731] cpu_slab=0000000000017eb0
[ 1.538851] cpu_slab=0000000000017ed0
[ 1.555795] cpu_slab=0000000000017f70
[ 1.555908] cpu_slab=0000000000017f90
[ 1.578009] cpu_slab=0000000000017fc0
[ 1.578122] cpu_slab=0000000000017fe0
[ 1.578240] cpu_slab=0000000000018070
[ 1.578352] cpu_slab=0000000000018090
[ 1.578466] cpu_slab=00000000000180b0
[ 1.578685] cpu_slab=00000000000180d0
[ 1.578843] cpu_slab=00000000000181d0
[ 1.579371] cpu_slab=0000000000018270
[ 1.579453] cpu_slab=0000000000018290
[ 1.579531] cpu_slab=00000000000182b0
[ 1.579636] cpu_slab=00000000000182d0
[ 1.580122] cpu_slab=0000000000019370
[ 1.580202] cpu_slab=0000000000019390
[ 1.580482] cpu_slab=0000000000019e20
[ 1.606265] cpu_slab=0000000000019e40
[ 1.606337] cpu_slab=0000000000019e60
[ 1.606405] cpu_slab=0000000000019e80
[ 1.633467] cpu_slab=0000000000019ea0
[ 1.633622] cpu_slab=0000000000019ec0
[ 1.633753] cpu_slab=0000000000019ee0
[ 1.633881] cpu_slab=0000000000019f00
[ 1.634013] cpu_slab=0000000000019f20
[ 1.634139] cpu_slab=0000000000019f40
[ 1.634294] cpu_slab=0000000000019f60
[ 1.634582] cpu_slab=0000000000019f80
[ 1.634918] cpu_slab=0000000000019fc0
[ 1.635051] cpu_slab=0000000000019fe0
After exhaustion of static pcpu area we switch do vmalloc() based ones :
[ 1.635175] cpu_slab=000060fee0002070
[ 1.635324] cpu_slab=000060fee0002090
[ 1.635445] cpu_slab=000060fee00020b0
[ 1.635563] cpu_slab=000060fee00020d0
[ 1.635690] cpu_slab=000060fee00020f0
[ 1.635802] cpu_slab=000060fee0002110
[ 1.635951] cpu_slab=000060fee0002140
[ 1.636081] cpu_slab=000060fee0002170
[ 1.636309] cpu_slab=000060fee0002190
[ 1.636882] cpu_slab=000060fee00021d0
[ 1.637257] cpu_slab=000060fee0002270
[ 1.637381] cpu_slab=000060fee0002290
[ 3.264327] cpu_slab=000060fee0002860
[ 3.264465] cpu_slab=000060fee0002880
[ 3.267217] cpu_slab=000060fee0002900
[ 3.294910] cpu_slab=000060fee0002950
[ 3.302653] cpu_slab=000060fee0002bb0
[ 3.302783] cpu_slab=000060fee0002bd0
[ 3.302899] cpu_slab=000060fee0002bf0
[ 3.303033] cpu_slab=000060fee0002c10
[ 3.303153] cpu_slab=000060fee0002c30
[ 3.303268] cpu_slab=000060fee0002c50
[ 3.303539] cpu_slab=000060fee0004f90
[ 3.303675] cpu_slab=000060fee0004fb0
[ 3.303811] cpu_slab=000060fee0005030
[ 3.303939] cpu_slab=000060fee0005050
# grep pcpu_get_vm_areas /proc/vmallocinfo
0xffffe8ff5fc00000-0xffffe8ff5fe00000 2097152 pcpu_get_vm_areas+0x0/0x580 vmalloc
0xffffe8ffffc00000-0xffffe8ffffe00000 2097152 pcpu_get_vm_areas+0x0/0x580 vmalloc
On Sat, Mar 26, 2011 at 4:27 AM, Ingo Molnar <[email protected]> wrote:
>
> ok, bad news - the bootcrash has come back - see the crashlog attached below.
> Config attached as well.
>
> I reproduced this on upstream 16c29dafcc86 - i.e. all the latest Slab fixes
> applied.
>
> BUG: unable to handle kernel paging request at ffff87ffc1fdd020
> IP: [<ffffffff812b50c2>] this_cpu_cmpxchg16b_emu+0x2/0x1c
> ...
> Code: 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d9 90 c3 90 90 90 90 90 90 90 9c fa
> ?48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e
Your "Code:" line is buggy, and is missing the first byte of the
faulting instruction (which should have that "<>" around it). But from
the offset, we know it's <65>, and it's the first read of %gs. In
which case it all decodes to the right thing, ie
0: 9c pushfq
1: fa cli
2:* 65 48 3b 06 cmp %gs <-- trapping instruction:(%rsi),%rax
6: 75 14 jne 0x1c
8: 65 48 3b 56 08 cmp %gs:0x8(%rsi),%rdx
d: 75 0d jne 0x1c
f: 65 48 89 1e mov %rbx,%gs:(%rsi)
(Heh, the "trapping instruction" points to the %gs override itself,
which looks odd but is technically not incorrect).
And quite frankly, I don't see how this can have anything to do with
the emulated code. It's doing exactly the same thing as the cmpxchg16b
instruction is, except for the fact that a real cmpxchg16b would have
(a) not done this with interrupts disabled and (b) would have done the
fault as a write-fault.
But neither of those should make any difference what-so-ever. If this
was rally about the vmalloc space, arch/x86/mm/fault.c should have
fixed it up. That code is entirely happy fixing up stuff with
interrupts disabled too, and is in fact designed to do so.
So I don't see how this could be about the cmpxchg16b instruction
emulation. We should have gotten pretty much the exact same page
fault even for a real cmpxchg16b instruction.
I wonder if there is something wrong in the percpu allocation, and the
whole new slub thing just ends up causing us to do those allocations
earlier or in a different pattern. The percpu offset is fairly close
to the beginning of a page (offset 0x20).
Tejun, do you see anything suspicious/odd about the percpu allocations
that alloc_kmem_cache_cpus() does? In particular, should the slab
init code use the reserved chunks so that we don't get some kind of
crazy "slab wants to do percpu alloc to initialize, which wants to use
slab to allocate the chunk"?
Linus
On Sat, 26 Mar 2011, Eric Dumazet wrote:
> address : RSI = ffff88003ffc2020
>
> RSI is supposed to be a dynamic percpu addr
> Yet this value seems pretty outside of pcpu pools
>
> CR2: ffff87ffc1fdd020
Right. RSI should be in the range of the values you printed.
However, the determination RSI is independent of the emulation of the
instruction. If RSI is set wrong then we should see these failures on
machines that do not do the instruction emulation. A kvm run with Ingo's
config should show the same issues. Will do that now.
Indeed I can reproduce it easily.
With a few printk's this shows that alloc_percpu() returns a wrong
address which is the cause of all of this:
logs:
Memory: 1011444k/1048564k available (11622k kernel code, 452k absent,
36668k reserved, 6270k data, 1028k init)
alloc_kmem_cache_cpus kmem_cache_node = ffff88003ffcf000
alloc_kmem_cache_cpus kmem_cache = ffff88003ffcf020
This means that the cpu_slab is not a percpu pointer.
Now the first allocation attempt:
Slab kmem_cache cpu_slab=ffff88003ffcf020 freelist=ffff88003f8020c0
BUG: unable to handle kernel paging request at ffff87ffc1fdc020
IP: [<ffffffff812c3852>] this_cpu_cmpxchg16b_emu+0x2/0x1c
Tejun: Whats going on there? I should be getting offsets into the per cpu
area and not kernel addresses.
On Sat, Mar 26, 2011 at 12:18 PM, Christoph Lameter <[email protected]> wrote:
>
> Right. RSI should be in the range of the values you printed.
> However, the determination RSI is independent of the emulation of the
> instruction. If RSI is set wrong then we should see these failures on
> machines that do not do the instruction emulation. A kvm run with Ingo's
> config should show the same issues. Will do that now.
I bet it's timing-dependent and/or dependent on some code layout
issue. Why else would the totally pointless previous patch have made
any difference for Ingo (the "SLUB: Write to per cpu data when
allocating it" thing seems to be pure voodoo programming)?
That said, this early in the boot I don't think we should have any
parallelism going on, so I don't see what could make those kinds of
random effects.
Linus
On Sat, 26 Mar 2011, Christoph Lameter wrote:
> Tejun: Whats going on there? I should be getting offsets into the per cpu
> area and not kernel addresses.
Its a UP kernel running on dual Athlon. So its okay ... Argh.... The
following patch fixes it by using the fallback code for cmpxchg_double:
Subject: per_cpu: Fixup cmpxchg_double for !SMP
cmpxchg_double should only be provided for SMP. In the UP case
the GS register is not defined and the function will fail.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h 2011-03-26 11:11:19.175244998 -0500
+++ linux-2.6/arch/x86/include/asm/percpu.h 2011-03-26 14:45:29.254088998 -0500
@@ -507,6 +507,7 @@
* it in software. The address used in the cmpxchg16 instruction must be
* aligned to a 16 byte boundary.
*/
+#ifdef CONFIG_SMP
#define percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2) \
({ \
char __ret; \
@@ -529,6 +530,7 @@
#define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
#endif
+#endif
/* This is not atomic against other CPUs -- CPU preemption needs to be off */
#define x86_test_and_clear_bit_percpu(bit, var) \
* Christoph Lameter <[email protected]> wrote:
> On Sat, 26 Mar 2011, Christoph Lameter wrote:
>
> > Tejun: Whats going on there? I should be getting offsets into the per cpu
> > area and not kernel addresses.
>
> Its a UP kernel running on dual Athlon. So its okay ... Argh.... The
> following patch fixes it by using the fallback code for cmpxchg_double:
>
>
>
> Subject: per_cpu: Fixup cmpxchg_double for !SMP
>
> cmpxchg_double should only be provided for SMP. In the UP case
> the GS register is not defined and the function will fail.
>
> Signed-off-by: Christoph Lameter <[email protected]>
I.e. the bug got introduced by:
| commit b9ec40af0e18fb7d02106be148036c2ea490fdf9
| Author: Christoph Lameter <[email protected]>
| Date: Mon Feb 28 11:02:24 2011 +0100
|
| percpu, x86: Add arch-specific this_cpu_cmpxchg_double() support
and then the lockless allocator made use of it, right?
Thanks,
Ingo
* Christoph Lameter <[email protected]> wrote:
> On Sat, 26 Mar 2011, Christoph Lameter wrote:
>
> > Tejun: Whats going on there? I should be getting offsets into the per cpu
> > area and not kernel addresses.
>
> Its a UP kernel running on dual Athlon. So its okay ... Argh.... The
> following patch fixes it by using the fallback code for cmpxchg_double:
The fix works here too:
Tested-by: Ingo Molnar <[email protected]>
Thanks,
Ingo
On Sat, 26 Mar 2011, Ingo Molnar wrote:
> > Subject: per_cpu: Fixup cmpxchg_double for !SMP
> >
> > cmpxchg_double should only be provided for SMP. In the UP case
> > the GS register is not defined and the function will fail.
> >
> > Signed-off-by: Christoph Lameter <[email protected]>
>
> I.e. the bug got introduced by:
>
> | commit b9ec40af0e18fb7d02106be148036c2ea490fdf9
> | Author: Christoph Lameter <[email protected]>
> | Date: Mon Feb 28 11:02:24 2011 +0100
> |
> | percpu, x86: Add arch-specific this_cpu_cmpxchg_double() support
>
> and then the lockless allocator made use of it, right?
Correct.
On Sat, Mar 26, 2011 at 12:49 PM, Christoph Lameter <[email protected]> wrote:
> On Sat, 26 Mar 2011, Christoph Lameter wrote:
>
>> Tejun: Whats going on there? I should be getting offsets into the per cpu
>> area and not kernel addresses.
>
> Its a UP kernel running on dual Athlon. So its okay ... Argh.... The
> following patch fixes it by using the fallback code for cmpxchg_double:
Hmm.
Looking closer, I think there are more bugs in that cmpxchg_double thing.
In particular, it doesn't mark memory as changed, so gcc might
miscompile it even on SMP. Also, I think we'd be better off using the
'cmpxchg16b' instruction even on UP, so it's sad to disable it
entirely there.
Wouldn't something like the attached be better?
NOTE! TOTALLY UNTESTED!
Linus
On Sat, 26 Mar 2011, Linus Torvalds wrote:
> Wouldn't something like the attached be better?
On first glance I cannot find fault with it.
But then the same fix must also be used in the asm code or the fallback
(turns out that the fallback is always used in kmem_cache_init since
the instruction patching comes later).
Patch boots fine both in UP and SMP mode
Subject: percpu: Omit segment prefix in the UP case for cmpxchg_double
Omit the segment prefix in the UP case. GS is not used then
and we will generate segfaults if cmpxchg16b is used otherwise.
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
arch/x86/include/asm/percpu.h | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)
Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h 2011-03-26 20:43:03.994089001 -0500
+++ linux-2.6/arch/x86/include/asm/percpu.h 2011-03-26 20:43:22.414089004 -0500
@@ -45,7 +45,7 @@
#include <linux/stringify.h>
#ifdef CONFIG_SMP
-#define __percpu_arg(x) "%%"__stringify(__percpu_seg)":%P" #x
+#define __percpu_prefix "%%"__stringify(__percpu_seg)":"
#define __my_cpu_offset percpu_read(this_cpu_off)
/*
@@ -62,9 +62,11 @@
(typeof(*(ptr)) __kernel __force *)tcp_ptr__; \
})
#else
-#define __percpu_arg(x) "%P" #x
+#define __percpu_prefix ""
#endif
+#define __percpu_arg(x) __percpu_prefix "%P" #x
+
/*
* Initialized pointers to per-cpu variables needed for the boot
* processor need to use these macros to get the proper address
@@ -516,11 +518,11 @@
typeof(o2) __n2 = n2; \
typeof(o2) __dummy; \
alternative_io("call this_cpu_cmpxchg16b_emu\n\t" P6_NOP4, \
- "cmpxchg16b %%gs:(%%rsi)\n\tsetz %0\n\t", \
+ "cmpxchg16b " __percpu_prefix "(%%rsi)\n\tsetz %0\n\t", \
X86_FEATURE_CX16, \
ASM_OUTPUT2("=a"(__ret), "=d"(__dummy)), \
"S" (&pcp1), "b"(__n1), "c"(__n2), \
- "a"(__o1), "d"(__o2)); \
+ "a"(__o1), "d"(__o2) : "memory"); \
__ret; \
})
Index: linux-2.6/arch/x86/lib/cmpxchg16b_emu.S
===================================================================
--- linux-2.6.orig/arch/x86/lib/cmpxchg16b_emu.S 2011-03-26 20:43:57.384089004 -0500
+++ linux-2.6/arch/x86/lib/cmpxchg16b_emu.S 2011-03-26 20:48:42.684088999 -0500
@@ -10,6 +10,12 @@
#include <asm/frame.h>
#include <asm/dwarf2.h>
+#ifdef CONFIG_SMP
+#define SEG_PREFIX %gs:
+#else
+#define SEG_PREFIX
+#endif
+
.text
/*
@@ -37,13 +43,13 @@
pushf
cli
- cmpq %gs:(%rsi), %rax
+ cmpq SEG_PREFIX(%rsi), %rax
jne not_same
- cmpq %gs:8(%rsi), %rdx
+ cmpq SEG_PREFIX 8(%rsi), %rdx
jne not_same
- movq %rbx, %gs:(%rsi)
- movq %rcx, %gs:8(%rsi)
+ movq %rbx, SEG_PREFIX(%rsi)
+ movq %rcx, SEG_PREFIX 8(%rsi)
popf
mov $1, %al
Le samedi 26 mars 2011 à 20:57 -0500, Christoph Lameter a écrit :
> But then the same fix must also be used in the asm code or the fallback
> (turns out that the fallback is always used in kmem_cache_init since
> the instruction patching comes later).
>
> Patch boots fine both in UP and SMP mode
>
>
>
>
> Subject: percpu: Omit segment prefix in the UP case for cmpxchg_double
>
> Omit the segment prefix in the UP case. GS is not used then
> and we will generate segfaults if cmpxchg16b is used otherwise.
>
> Signed-off-by: Linus Torvalds <[email protected]>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> arch/x86/include/asm/percpu.h | 10 ++++++----
> 1 files changed, 6 insertions(+), 4 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/percpu.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/percpu.h 2011-03-26 20:43:03.994089001 -0500
> +++ linux-2.6/arch/x86/include/asm/percpu.h 2011-03-26 20:43:22.414089004 -0500
> @@ -45,7 +45,7 @@
> #include <linux/stringify.h>
>
> #ifdef CONFIG_SMP
> -#define __percpu_arg(x) "%%"__stringify(__percpu_seg)":%P" #x
> +#define __percpu_prefix "%%"__stringify(__percpu_seg)":"
> #define __my_cpu_offset percpu_read(this_cpu_off)
>
> /*
> @@ -62,9 +62,11 @@
> (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \
> })
> #else
> -#define __percpu_arg(x) "%P" #x
> +#define __percpu_prefix ""
> #endif
>
> +#define __percpu_arg(x) __percpu_prefix "%P" #x
> +
> /*
> * Initialized pointers to per-cpu variables needed for the boot
> * processor need to use these macros to get the proper address
> @@ -516,11 +518,11 @@
> typeof(o2) __n2 = n2; \
> typeof(o2) __dummy; \
> alternative_io("call this_cpu_cmpxchg16b_emu\n\t" P6_NOP4, \
I guess you should make P6_NOP4 be P6_NOP3 in !SMP builds.
> - "cmpxchg16b %%gs:(%%rsi)\n\tsetz %0\n\t", \
> + "cmpxchg16b " __percpu_prefix "(%%rsi)\n\tsetz %0\n\t", \
> X86_FEATURE_CX16, \
> ASM_OUTPUT2("=a"(__ret), "=d"(__dummy)), \
> "S" (&pcp1), "b"(__n1), "c"(__n2), \
> - "a"(__o1), "d"(__o2)); \
> + "a"(__o1), "d"(__o2) : "memory"); \
> __ret; \
> })
>
On 03/24/2011 08:47 PM, Christoph Lameter wrote:
> On Thu, 24 Mar 2011, Eric Dumazet wrote:
>
> > > this_cpu_cmpxchg16b_emu:
> > > pushf
> > > cli
> > >
> > > cmpq %gs:(%rsi), %rax
>
> > Random guess
> >
> > Masking interrupts, and accessing vmalloc() based memory for the first
> > time ?
>
> Hmmm.. Could be. KVM would not really disable interrupts so this may
> explain that the test case works here.
kvm does really disable guest interrupts (it doesn't disable host
interrupts, but these are invisible to the guest).
--
error compiling committee.c: too many arguments to function
On 03/24/2011 09:51 PM, Christoph Lameter wrote:
> On Thu, 24 Mar 2011, Pekka Enberg wrote:
>
> > Thanks, Ingo! Christoph, may I have your sign-off for the patch and
> > I'll send it to Linus?
>
>
> Subject: SLUB: Write to per cpu data when allocating it
>
> It turns out that the cmpxchg16b emulation has to access vmalloced
> percpu memory with interrupts disabled. If the memory has never
> been touched before then the fault necessary to establish the
> mapping will not to occur and the kernel will fail on boot.
>
I don't get it. Disabling interrupts doesn't disable faults.
> Fix that by reusing the CONFIG_PREEMPT code that writes the
> cpu number into a field on every cpu. Writing to the per cpu
> area before causes the mapping to be established before we get
> to a cmpxchg16b emulation.
>
> Tested-by: Ingo Molnar<[email protected]>
> Signed-off-by: Christoph Lameter<[email protected]>
>
> ---
> mm/slub.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c 2011-03-24 14:03:10.000000000 -0500
> +++ linux-2.6/mm/slub.c 2011-03-24 14:04:08.000000000 -0500
> @@ -1604,7 +1604,7 @@ static inline void note_cmpxchg_failure(
>
> void init_kmem_cache_cpus(struct kmem_cache *s)
> {
> -#if defined(CONFIG_CMPXCHG_LOCAL)&& defined(CONFIG_PREEMPT)
> +#ifdef CONFIG_CMPXCHG_LOCAL
> int cpu;
>
> for_each_possible_cpu(cpu)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
error compiling committee.c: too many arguments to function
On 3/27/11 4:57 AM, Christoph Lameter wrote:
> But then the same fix must also be used in the asm code or the fallback
> (turns out that the fallback is always used in kmem_cache_init since
> the instruction patching comes later).
>
> Patch boots fine both in UP and SMP mode
>
>
>
>
> Subject: percpu: Omit segment prefix in the UP case for cmpxchg_double
>
> Omit the segment prefix in the UP case. GS is not used then
> and we will generate segfaults if cmpxchg16b is used otherwise.
>
> Signed-off-by: Linus Torvalds<[email protected]>
> Signed-off-by: Christoph Lameter<[email protected]>
>
> arch/x86/include/asm/percpu.h | 10 ++++++----
> 1 files changed, 6 insertions(+), 4 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/percpu.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/percpu.h 2011-03-26 20:43:03.994089001 -0500
> +++ linux-2.6/arch/x86/include/asm/percpu.h 2011-03-26 20:43:22.414089004 -0500
> @@ -45,7 +45,7 @@
> #include<linux/stringify.h>
>
> #ifdef CONFIG_SMP
> -#define __percpu_arg(x) "%%"__stringify(__percpu_seg)":%P" #x
> +#define __percpu_prefix "%%"__stringify(__percpu_seg)":"
> #define __my_cpu_offset percpu_read(this_cpu_off)
>
> /*
> @@ -62,9 +62,11 @@
> (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \
> })
> #else
> -#define __percpu_arg(x) "%P" #x
> +#define __percpu_prefix ""
> #endif
>
> +#define __percpu_arg(x) __percpu_prefix "%P" #x
> +
> /*
> * Initialized pointers to per-cpu variables needed for the boot
> * processor need to use these macros to get the proper address
> @@ -516,11 +518,11 @@
> typeof(o2) __n2 = n2; \
> typeof(o2) __dummy; \
> alternative_io("call this_cpu_cmpxchg16b_emu\n\t" P6_NOP4, \
> - "cmpxchg16b %%gs:(%%rsi)\n\tsetz %0\n\t", \
> + "cmpxchg16b " __percpu_prefix "(%%rsi)\n\tsetz %0\n\t", \
> X86_FEATURE_CX16, \
> ASM_OUTPUT2("=a"(__ret), "=d"(__dummy)), \
> "S" (&pcp1), "b"(__n1), "c"(__n2), \
> - "a"(__o1), "d"(__o2)); \
> + "a"(__o1), "d"(__o2) : "memory"); \
> __ret; \
> })
>
> Index: linux-2.6/arch/x86/lib/cmpxchg16b_emu.S
> ===================================================================
> --- linux-2.6.orig/arch/x86/lib/cmpxchg16b_emu.S 2011-03-26 20:43:57.384089004 -0500
> +++ linux-2.6/arch/x86/lib/cmpxchg16b_emu.S 2011-03-26 20:48:42.684088999 -0500
> @@ -10,6 +10,12 @@
> #include<asm/frame.h>
> #include<asm/dwarf2.h>
>
> +#ifdef CONFIG_SMP
> +#define SEG_PREFIX %gs:
> +#else
> +#define SEG_PREFIX
> +#endif
> +
> .text
>
> /*
> @@ -37,13 +43,13 @@
> pushf
> cli
>
> - cmpq %gs:(%rsi), %rax
> + cmpq SEG_PREFIX(%rsi), %rax
> jne not_same
> - cmpq %gs:8(%rsi), %rdx
> + cmpq SEG_PREFIX 8(%rsi), %rdx
> jne not_same
>
> - movq %rbx, %gs:(%rsi)
> - movq %rcx, %gs:8(%rsi)
> + movq %rbx, SEG_PREFIX(%rsi)
> + movq %rcx, SEG_PREFIX 8(%rsi)
>
> popf
> mov $1, %al
Tejun, does this look good to you as well? I think it should go through
the percpu tree. It's needed to fix a boot crash with lockless SLUB
fastpaths enabled.
Pekka
* Pekka Enberg <[email protected]> wrote:
> On 3/27/11 4:57 AM, Christoph Lameter wrote:
> >But then the same fix must also be used in the asm code or the fallback
> >(turns out that the fallback is always used in kmem_cache_init since
> >the instruction patching comes later).
> >
> >Patch boots fine both in UP and SMP mode
> >
> >
> >
> >
> >Subject: percpu: Omit segment prefix in the UP case for cmpxchg_double
> >
> >Omit the segment prefix in the UP case. GS is not used then
> >and we will generate segfaults if cmpxchg16b is used otherwise.
> >
> >Signed-off-by: Linus Torvalds<[email protected]>
> >Signed-off-by: Christoph Lameter<[email protected]>
> >
> > arch/x86/include/asm/percpu.h | 10 ++++++----
> > 1 files changed, 6 insertions(+), 4 deletions(-)
> >
> >Index: linux-2.6/arch/x86/include/asm/percpu.h
> >===================================================================
> >--- linux-2.6.orig/arch/x86/include/asm/percpu.h 2011-03-26 20:43:03.994089001 -0500
> >+++ linux-2.6/arch/x86/include/asm/percpu.h 2011-03-26 20:43:22.414089004 -0500
> >@@ -45,7 +45,7 @@
> > #include<linux/stringify.h>
> >
> > #ifdef CONFIG_SMP
> >-#define __percpu_arg(x) "%%"__stringify(__percpu_seg)":%P" #x
> >+#define __percpu_prefix "%%"__stringify(__percpu_seg)":"
> > #define __my_cpu_offset percpu_read(this_cpu_off)
> >
> > /*
> >@@ -62,9 +62,11 @@
> > (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \
> > })
> > #else
> >-#define __percpu_arg(x) "%P" #x
> >+#define __percpu_prefix ""
> > #endif
> >
> >+#define __percpu_arg(x) __percpu_prefix "%P" #x
> >+
> > /*
> > * Initialized pointers to per-cpu variables needed for the boot
> > * processor need to use these macros to get the proper address
> >@@ -516,11 +518,11 @@
> > typeof(o2) __n2 = n2; \
> > typeof(o2) __dummy; \
> > alternative_io("call this_cpu_cmpxchg16b_emu\n\t" P6_NOP4, \
> >- "cmpxchg16b %%gs:(%%rsi)\n\tsetz %0\n\t", \
> >+ "cmpxchg16b " __percpu_prefix "(%%rsi)\n\tsetz %0\n\t", \
> > X86_FEATURE_CX16, \
> > ASM_OUTPUT2("=a"(__ret), "=d"(__dummy)), \
> > "S" (&pcp1), "b"(__n1), "c"(__n2), \
> >- "a"(__o1), "d"(__o2)); \
> >+ "a"(__o1), "d"(__o2) : "memory"); \
> > __ret; \
> > })
> >
> >Index: linux-2.6/arch/x86/lib/cmpxchg16b_emu.S
> >===================================================================
> >--- linux-2.6.orig/arch/x86/lib/cmpxchg16b_emu.S 2011-03-26 20:43:57.384089004 -0500
> >+++ linux-2.6/arch/x86/lib/cmpxchg16b_emu.S 2011-03-26 20:48:42.684088999 -0500
> >@@ -10,6 +10,12 @@
> > #include<asm/frame.h>
> > #include<asm/dwarf2.h>
> >
> >+#ifdef CONFIG_SMP
> >+#define SEG_PREFIX %gs:
> >+#else
> >+#define SEG_PREFIX
> >+#endif
> >+
> > .text
> >
> > /*
> >@@ -37,13 +43,13 @@
> > pushf
> > cli
> >
> >- cmpq %gs:(%rsi), %rax
> >+ cmpq SEG_PREFIX(%rsi), %rax
> > jne not_same
> >- cmpq %gs:8(%rsi), %rdx
> >+ cmpq SEG_PREFIX 8(%rsi), %rdx
> > jne not_same
> >
> >- movq %rbx, %gs:(%rsi)
> >- movq %rcx, %gs:8(%rsi)
> >+ movq %rbx, SEG_PREFIX(%rsi)
> >+ movq %rcx, SEG_PREFIX 8(%rsi)
> >
> > popf
> > mov $1, %al
>
> Tejun, does this look good to you as well? I think it should go
> through the percpu tree. It's needed to fix a boot crash with
> lockless SLUB fastpaths enabled.
AFAICS Linus applied it already:
d7c3f8cee81f: percpu: Omit segment prefix in the UP case for cmpxchg_double
Thanks,
Ingo
On Mon, Mar 28, 2011 at 9:19 AM, Ingo Molnar <[email protected]> wrote:
>> Tejun, does this look good to you as well? I think it should go
>> through the percpu tree. It's needed to fix a boot crash with
>> lockless SLUB fastpaths enabled.
>
> AFAICS Linus applied it already:
>
> d7c3f8cee81f: percpu: Omit segment prefix in the UP case for cmpxchg_double
Oh, I missed that. Did you test the patch, Ingo? It's missing
attributions and reference to the LKML discussion unfortunately...
* Pekka Enberg <[email protected]> wrote:
> On Mon, Mar 28, 2011 at 9:19 AM, Ingo Molnar <[email protected]> wrote:
> >> Tejun, does this look good to you as well? I think it should go
> >> through the percpu tree. It's needed to fix a boot crash with
> >> lockless SLUB fastpaths enabled.
> >
> > AFAICS Linus applied it already:
> >
> > d7c3f8cee81f: percpu: Omit segment prefix in the UP case for cmpxchg_double
>
> Oh, I missed that. Did you test the patch, Ingo? It's missing attributions
> and reference to the LKML discussion unfortunately...
I think we might still be missing the hunk below - or is it now not needed
anymore?
Thanks,
Ingo
-------------->
>From 53c0eceb7bf64f2a89c59ae4f14a676fa4128462 Mon Sep 17 00:00:00 2001
From: Christoph Lameter <[email protected]>
Date: Sat, 26 Mar 2011 14:49:56 -0500
Subject: [PATCH] per_cpu: Fix cmpxchg_double() for !SMP
cmpxchg_double() should only be provided for SMP. In the UP case
the GS register is not defined and the function will fail.
Signed-off-by: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Eric Dumazet <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/percpu.h | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index a09e1f0..52330a4 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -507,6 +507,7 @@ do { \
* it in software. The address used in the cmpxchg16 instruction must be
* aligned to a 16 byte boundary.
*/
+#ifdef CONFIG_SMP
#define percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2) \
({ \
char __ret; \
@@ -529,6 +530,7 @@ do { \
#define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
#endif
+#endif
/* This is not atomic against other CPUs -- CPU preemption needs to be off */
#define x86_test_and_clear_bit_percpu(bit, var) \
Le lundi 28 mars 2011 à 08:36 +0200, Ingo Molnar a écrit :
> I think we might still be missing the hunk below - or is it now not needed
> anymore?
Its not needed anymore, once cmpxchg16b implementation works correctly
on !SMP build. (It does now with current linux-2.6 tree)
Hello,
Heh, a lot of activities over the weekend.
On Mon, Mar 28, 2011 at 09:12:24AM +0300, Pekka Enberg wrote:
> >@@ -37,13 +43,13 @@
> > pushf
> > cli
> >
> >- cmpq %gs:(%rsi), %rax
> >+ cmpq SEG_PREFIX(%rsi), %rax
> > jne not_same
> >- cmpq %gs:8(%rsi), %rdx
> >+ cmpq SEG_PREFIX 8(%rsi), %rdx
> > jne not_same
> >
> >- movq %rbx, %gs:(%rsi)
> >- movq %rcx, %gs:8(%rsi)
> >+ movq %rbx, SEG_PREFIX(%rsi)
> >+ movq %rcx, SEG_PREFIX 8(%rsi)
> >
> > popf
> > mov $1, %al
>
> Tejun, does this look good to you as well? I think it should go
> through the percpu tree. It's needed to fix a boot crash with
> lockless SLUB fastpaths enabled.
Linus already applied it so it's all done now. The patch looks okay
to me although I would like to have the SEG_PREFIX defined in
asm/percpu.h instead. Well, we can do that later.
Thanks.
--
tejun
percpu_cmpxchg16b_double() uses alternative_io() and looks like :
e8 .. .. .. .. call this_cpu_cmpxchg16b_emu
X bytes NOPX
or, once patched (if cpu supports native instruction) on SMP build :
65 48 0f c7 0e cmpxchg16b %gs:(%rsi)
0f 94 c0 sete %al
on !SMP build :
48 0f c7 0e cmpxchg16b (%rsi)
0f 94 c0 sete %al
Therefore, NOPX should be :
P6_NOP3 on SMP
P6_NOP2 on !SMP
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tejun Heo <[email protected]>
---
arch/x86/include/asm/percpu.h | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index d475b43..d68fca6 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -509,6 +509,11 @@ do { \
* it in software. The address used in the cmpxchg16 instruction must be
* aligned to a 16 byte boundary.
*/
+#ifdef CONFIG_SMP
+#define CMPXCHG16B_EMU_CALL "call this_cpu_cmpxchg16b_emu\n\t" P6_NOP3
+#else
+#define CMPXCHG16B_EMU_CALL "call this_cpu_cmpxchg16b_emu\n\t" P6_NOP2
+#endif
#define percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2) \
({ \
char __ret; \
@@ -517,7 +522,7 @@ do { \
typeof(o2) __o2 = o2; \
typeof(o2) __n2 = n2; \
typeof(o2) __dummy; \
- alternative_io("call this_cpu_cmpxchg16b_emu\n\t" P6_NOP4, \
+ alternative_io(CMPXCHG16B_EMU_CALL, \
"cmpxchg16b " __percpu_prefix "(%%rsi)\n\tsetz %0\n\t", \
X86_FEATURE_CX16, \
ASM_OUTPUT2("=a"(__ret), "=d"(__dummy)), \
On Mon, 28 Mar 2011, Ingo Molnar wrote:
> I think we might still be missing the hunk below - or is it now not needed
> anymore?
Its not needed anymore.
>
> Thanks,
>
> Ingo
>
> -------------->
> >From 53c0eceb7bf64f2a89c59ae4f14a676fa4128462 Mon Sep 17 00:00:00 2001
> From: Christoph Lameter <[email protected]>
> Date: Sat, 26 Mar 2011 14:49:56 -0500
> Subject: [PATCH] per_cpu: Fix cmpxchg_double() for !SMP
>
> cmpxchg_double() should only be provided for SMP. In the UP case
> the GS register is not defined and the function will fail.
>
> Signed-off-by: Christoph Lameter <[email protected]>
> Cc: Pekka Enberg <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Eric Dumazet <[email protected]>
> LKML-Reference: <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/x86/include/asm/percpu.h | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
> index a09e1f0..52330a4 100644
> --- a/arch/x86/include/asm/percpu.h
> +++ b/arch/x86/include/asm/percpu.h
> @@ -507,6 +507,7 @@ do { \
> * it in software. The address used in the cmpxchg16 instruction must be
> * aligned to a 16 byte boundary.
> */
> +#ifdef CONFIG_SMP
> #define percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2) \
> ({ \
> char __ret; \
> @@ -529,6 +530,7 @@ do { \
> #define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
>
> #endif
> +#endif
>
> /* This is not atomic against other CPUs -- CPU preemption needs to be off */
> #define x86_test_and_clear_bit_percpu(bit, var) \
>
On Mon, 28 Mar 2011, Eric Dumazet wrote:
> Therefore, NOPX should be :
>
> P6_NOP3 on SMP
> P6_NOP2 on !SMP
Acked-by: Christoph Lameter <[email protected]>
On Mon, Mar 28, 2011 at 08:46:05AM -0500, Christoph Lameter wrote:
> On Mon, 28 Mar 2011, Eric Dumazet wrote:
>
> > Therefore, NOPX should be :
> >
> > P6_NOP3 on SMP
> > P6_NOP2 on !SMP
>
> Acked-by: Christoph Lameter <[email protected]>
Applied to percpu#fixes-2.6.39.
Thanks.
--
tejun