2017-07-26 16:50:27

by Dima Zavin

[permalink] [raw]
Subject: [RFC PATCH] mm/slub: fix a deadlock due to incomplete patching of cpusets_enabled()

In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.

This problem manifested itself as a dead lock in the slub
allocator, inside get_any_partial. The loop reads
mems_allowed_seq value (via read_mems_allowed_begin),
performs the defrag operation, and then verifies the consistency
of mem_allowed via the read_mems_allowed_retry and the cookie
returned by xxx_begin. The issue here is that both begin and retry
first check if cpusets are enabled via cpusets_enabled() static branch.
This branch can be rewritted dynamically (via cpuset_inc) if a new
cpuset is created. The x86 jump label code fully synchronizes across
all CPUs for every entry it rewrites. If it rewrites only one of the
callsites (specifically the one in read_mems_allowed_retry) and then
waits for the smp_call_function(do_sync_core) to complete while a CPU is
inside the begin/retry section with IRQs off and the mems_allowed value
is changed, we can hang. This is because begin() will always return 0
(since it wasn't patched yet) while retry() will test the 0 against
the actual value of the seq counter.

The fix is to cache the value that's returned by cpusets_enabled() at the
top of the loop, and only operate on the seqlock (both begin and retry) if
it was true.

The relevant stack traces of the two stuck threads:

CPU: 107 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
RIP: smp_call_function_many+0x1f9/0x260
Call Trace:
? setup_data_read+0xa0/0xa0
? ___slab_alloc+0x28b/0x5a0
smp_call_function+0x3b/0x70
? setup_data_read+0xa0/0xa0
on_each_cpu+0x2f/0x90
? ___slab_alloc+0x28a/0x5a0
? ___slab_alloc+0x28b/0x5a0
text_poke_bp+0x87/0xd0
? ___slab_alloc+0x28a/0x5a0
arch_jump_label_transform+0x93/0x100
__jump_label_update+0x77/0x90
jump_label_update+0xaa/0xc0
static_key_slow_inc+0x9e/0xb0
cpuset_css_online+0x70/0x2e0
online_css+0x2c/0xa0
cgroup_apply_control_enable+0x27f/0x3d0
cgroup_mkdir+0x2b7/0x420
kernfs_iop_mkdir+0x5a/0x80
vfs_mkdir+0xf6/0x1a0
SyS_mkdir+0xb7/0xe0
entry_SYSCALL_64_fastpath+0x18/0xad

...

CPU: 22 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8818087c0000 task.stack: ffffc90000030000
RIP: int3+0x39/0x70
Call Trace:
<#DB> ? ___slab_alloc+0x28b/0x5a0
<EOE> ? copy_process.part.40+0xf7/0x1de0
? __slab_alloc.isra.80+0x54/0x90
? copy_process.part.40+0xf7/0x1de0
? copy_process.part.40+0xf7/0x1de0
? kmem_cache_alloc_node+0x8a/0x280
? copy_process.part.40+0xf7/0x1de0
? _do_fork+0xe7/0x6c0
? _raw_spin_unlock_irq+0x2d/0x60
? trace_hardirqs_on_caller+0x136/0x1d0
? entry_SYSCALL_64_fastpath+0x5/0xad
? do_syscall_64+0x27/0x350
? SyS_clone+0x19/0x20
? do_syscall_64+0x60/0x350
? entry_SYSCALL64_slow_path+0x25/0x25

Reported-by: Cliff Spradlin <[email protected]>
Signed-off-by: Dima Zavin <[email protected]>
---

We were reproducing the issue here with some regularity on ubuntu 14.04
running v4.9 (v4.9.36 at the time). The patch applies cleanly to 4.12 but
was only compile-tested there.

This is kind of a hacky solution that solves our immediate issue, but looks
like a more general problem and can affect other unsuspecting users of
these APIs. I suppose an irqs-off seqlock loop that is optimized away via
static_branch rewrites is rare. And, technically, it actually would be ok
except for the all-cpu sync in the x86 jump-label code between each entry
re-write. I don't know enough about all the implications of changing that,
or anything else in this path so I went with a targeted "fix" and rely on
the collective wisdom here to sort out what the correct solution to the
problem should be.

include/linux/cpuset.h | 14 ++++++++++++--
mm/slub.c | 13 +++++++++++--
2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index bfc204e70338..2a0f217413c6 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -111,12 +111,17 @@ extern void cpuset_print_current_mems_allowed(void);
* causing process failure. A retry loop with read_mems_allowed_begin and
* read_mems_allowed_retry prevents these artificial failures.
*/
+static inline unsigned int raw_read_mems_allowed_begin(void)
+{
+ return read_seqcount_begin(&current->mems_allowed_seq);
+}
+
static inline unsigned int read_mems_allowed_begin(void)
{
if (!cpusets_enabled())
return 0;

- return read_seqcount_begin(&current->mems_allowed_seq);
+ return raw_read_mems_allowed_begin();
}

/*
@@ -125,12 +130,17 @@ static inline unsigned int read_mems_allowed_begin(void)
* update of mems_allowed. It is up to the caller to retry the operation if
* appropriate.
*/
+static inline bool raw_read_mems_allowed_retry(unsigned int seq)
+{
+ return read_seqcount_retry(&current->mems_allowed_seq, seq);
+}
+
static inline bool read_mems_allowed_retry(unsigned int seq)
{
if (!cpusets_enabled())
return false;

- return read_seqcount_retry(&current->mems_allowed_seq, seq);
+ return raw_read_mems_allowed_retry(seq);
}

static inline void set_mems_allowed(nodemask_t nodemask)
diff --git a/mm/slub.c b/mm/slub.c
index edc79ca3c6d5..7a6c74851250 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1847,6 +1847,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
enum zone_type high_zoneidx = gfp_zone(flags);
void *object;
unsigned int cpuset_mems_cookie;
+ bool csets_enabled;

/*
* The defrag ratio allows a configuration of the tradeoffs between
@@ -1871,7 +1872,14 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
return NULL;

do {
- cpuset_mems_cookie = read_mems_allowed_begin();
+ if (cpusets_enabled()) {
+ csets_enabled = true;
+ cpuset_mems_cookie = raw_read_mems_allowed_begin();
+ } else {
+ csets_enabled = false;
+ cpuset_mems_cookie = 0;
+ }
+
zonelist = node_zonelist(mempolicy_slab_node(), flags);
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
struct kmem_cache_node *n;
@@ -1893,7 +1901,8 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
}
}
}
- } while (read_mems_allowed_retry(cpuset_mems_cookie));
+ } while (csets_enabled &&
+ raw_read_mems_allowed_retry(cpuset_mems_cookie));
#endif
return NULL;
}
--
2.14.0.rc0.400.g1c36432dff-goog


Subject: Re: [RFC PATCH] mm/slub: fix a deadlock due to incomplete patching of cpusets_enabled()

On Wed, 26 Jul 2017, Dima Zavin wrote:

> The fix is to cache the value that's returned by cpusets_enabled() at the
> top of the loop, and only operate on the seqlock (both begin and retry) if
> it was true.

I think the proper fix would be to ensure that the calls to
read_mems_allowed_{begin,retry} cannot cause the deadlock. Otherwise you
have to fix this in multiple places.

Maybe read_mems_allowed_* can do some form of synchronization or *_retry
can implictly rely on the results of cpusets_enabled() by *_begin?

2017-07-26 19:54:48

by Dima Zavin

[permalink] [raw]
Subject: Re: [RFC PATCH] mm/slub: fix a deadlock due to incomplete patching of cpusets_enabled()

On Wed, Jul 26, 2017 at 10:02 AM, Christopher Lameter <[email protected]> wrote:
> On Wed, 26 Jul 2017, Dima Zavin wrote:
>
>> The fix is to cache the value that's returned by cpusets_enabled() at the
>> top of the loop, and only operate on the seqlock (both begin and retry) if
>> it was true.
>
> I think the proper fix would be to ensure that the calls to
> read_mems_allowed_{begin,retry} cannot cause the deadlock. Otherwise you
> have to fix this in multiple places.
>
> Maybe read_mems_allowed_* can do some form of synchronization or *_retry
> can implictly rely on the results of cpusets_enabled() by *_begin?
>

(res-ending because gmail hates me, sorry).

Thanks for the quick reply!

I can turn the cookie into a uint64, put the sequence into the low
order 32 bits and put the enabled state into bit 33 (or 63 :) ). Then
retry will not query cpusets_enabled() and will just look at the
enabled bit. This means that *_retry will always have a conditional
jump (i.e. lose the whole static_branch optimization) but maybe that's
ok since that's pretty rare and the *_begin() will still benefit from
it?

2017-07-27 16:46:14

by Dima Zavin

[permalink] [raw]
Subject: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.

This problem manifested itself as a dead lock in the slub
allocator, inside get_any_partial. The loop reads
mems_allowed_seq value (via read_mems_allowed_begin),
performs the defrag operation, and then verifies the consistency
of mem_allowed via the read_mems_allowed_retry and the cookie
returned by xxx_begin. The issue here is that both begin and retry
first check if cpusets are enabled via cpusets_enabled() static branch.
This branch can be rewritted dynamically (via cpuset_inc) if a new
cpuset is created. The x86 jump label code fully synchronizes across
all CPUs for every entry it rewrites. If it rewrites only one of the
callsites (specifically the one in read_mems_allowed_retry) and then
waits for the smp_call_function(do_sync_core) to complete while a CPU is
inside the begin/retry section with IRQs off and the mems_allowed value
is changed, we can hang. This is because begin() will always return 0
(since it wasn't patched yet) while retry() will test the 0 against
the actual value of the seq counter.

The fix is to cache the value that's returned by cpusets_enabled() at the
top of the loop, and only operate on the seqcount (both begin and retry) if
it was true.

The relevant stack traces of the two stuck threads:

CPU: 107 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
RIP: smp_call_function_many+0x1f9/0x260
Call Trace:
? setup_data_read+0xa0/0xa0
? ___slab_alloc+0x28b/0x5a0
smp_call_function+0x3b/0x70
? setup_data_read+0xa0/0xa0
on_each_cpu+0x2f/0x90
? ___slab_alloc+0x28a/0x5a0
? ___slab_alloc+0x28b/0x5a0
text_poke_bp+0x87/0xd0
? ___slab_alloc+0x28a/0x5a0
arch_jump_label_transform+0x93/0x100
__jump_label_update+0x77/0x90
jump_label_update+0xaa/0xc0
static_key_slow_inc+0x9e/0xb0
cpuset_css_online+0x70/0x2e0
online_css+0x2c/0xa0
cgroup_apply_control_enable+0x27f/0x3d0
cgroup_mkdir+0x2b7/0x420
kernfs_iop_mkdir+0x5a/0x80
vfs_mkdir+0xf6/0x1a0
SyS_mkdir+0xb7/0xe0
entry_SYSCALL_64_fastpath+0x18/0xad

...

CPU: 22 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8818087c0000 task.stack: ffffc90000030000
RIP: int3+0x39/0x70
Call Trace:
<#DB> ? ___slab_alloc+0x28b/0x5a0
<EOE> ? copy_process.part.40+0xf7/0x1de0
? __slab_alloc.isra.80+0x54/0x90
? copy_process.part.40+0xf7/0x1de0
? copy_process.part.40+0xf7/0x1de0
? kmem_cache_alloc_node+0x8a/0x280
? copy_process.part.40+0xf7/0x1de0
? _do_fork+0xe7/0x6c0
? _raw_spin_unlock_irq+0x2d/0x60
? trace_hardirqs_on_caller+0x136/0x1d0
? entry_SYSCALL_64_fastpath+0x5/0xad
? do_syscall_64+0x27/0x350
? SyS_clone+0x19/0x20
? do_syscall_64+0x60/0x350
? entry_SYSCALL64_slow_path+0x25/0x25

Reported-by: Cliff Spradlin <[email protected]>
Signed-off-by: Dima Zavin <[email protected]>
---

v2:
- Moved the cached cpusets_enabled() state into the cookie, turned
the cookie into a struct and updated all the other call sites.
- Applied on top of v4.12 since one of the callers in page_alloc.c changed.
Still only tested on v4.9.36 and compile tested against v4.12.

include/linux/cpuset.h | 27 +++++++++++++++++----------
mm/filemap.c | 6 +++---
mm/hugetlb.c | 12 ++++++------
mm/mempolicy.c | 12 ++++++------
mm/page_alloc.c | 8 ++++----
mm/slab.c | 6 +++---
mm/slub.c | 6 +++---
7 files changed, 42 insertions(+), 35 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 119a3f9604b0..f64f6d3b1dce 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -16,6 +16,11 @@
#include <linux/mm.h>
#include <linux/jump_label.h>

+struct cpuset_mems_cookie {
+ unsigned int seq;
+ bool was_enabled;
+};
+
#ifdef CONFIG_CPUSETS

extern struct static_key_false cpusets_enabled_key;
@@ -113,12 +118,15 @@ extern void cpuset_print_current_mems_allowed(void);
* causing process failure. A retry loop with read_mems_allowed_begin and
* read_mems_allowed_retry prevents these artificial failures.
*/
-static inline unsigned int read_mems_allowed_begin(void)
+static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie)
{
- if (!cpusets_enabled())
- return 0;
+ if (!cpusets_enabled()) {
+ cookie->was_enabled = false;
+ return;
+ }

- return read_seqcount_begin(&current->mems_allowed_seq);
+ cookie->was_enabled = true;
+ cookie->seq = read_seqcount_begin(&current->mems_allowed_seq);
}

/*
@@ -127,12 +135,11 @@ static inline unsigned int read_mems_allowed_begin(void)
* update of mems_allowed. It is up to the caller to retry the operation if
* appropriate.
*/
-static inline bool read_mems_allowed_retry(unsigned int seq)
+static inline bool read_mems_allowed_retry(struct cpuset_mems_cookie *cookie)
{
- if (!cpusets_enabled())
+ if (!cookie->was_enabled)
return false;
-
- return read_seqcount_retry(&current->mems_allowed_seq, seq);
+ return read_seqcount_retry(&current->mems_allowed_seq, cookie->seq);
}

static inline void set_mems_allowed(nodemask_t nodemask)
@@ -249,12 +256,12 @@ static inline void set_mems_allowed(nodemask_t nodemask)
{
}

-static inline unsigned int read_mems_allowed_begin(void)
+static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie)
{
return 0;
}

-static inline bool read_mems_allowed_retry(unsigned int seq)
+static inline bool read_mems_allowed_retry(struct cpuset_mems_cookie *cookie)
{
return false;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 6f1be573a5e6..c0730b377519 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -716,12 +716,12 @@ struct page *__page_cache_alloc(gfp_t gfp)
struct page *page;

if (cpuset_do_page_mem_spread()) {
- unsigned int cpuset_mems_cookie;
+ struct cpuset_mems_cookie cpuset_mems_cookie;
do {
- cpuset_mems_cookie = read_mems_allowed_begin();
+ read_mems_allowed_begin(&cpuset_mems_cookie);
n = cpuset_mem_spread_node();
page = __alloc_pages_node(n, gfp, 0);
- } while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
+ } while (!page && read_mems_allowed_retry(&cpuset_mems_cookie));

return page;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3eedb187e549..1defa44f4fe6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -907,7 +907,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
struct zonelist *zonelist;
struct zone *zone;
struct zoneref *z;
- unsigned int cpuset_mems_cookie;
+ struct cpuset_mems_cookie cpuset_mems_cookie;

/*
* A child process with MAP_PRIVATE mappings created by their parent
@@ -923,7 +923,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
goto err;

retry_cpuset:
- cpuset_mems_cookie = read_mems_allowed_begin();
+ read_mems_allowed_begin(&cpuset_mems_cookie);
zonelist = huge_zonelist(vma, address,
htlb_alloc_mask(h), &mpol, &nodemask);

@@ -945,7 +945,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
}

mpol_cond_put(mpol);
- if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
+ if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie)))
goto retry_cpuset;
return page;

@@ -1511,7 +1511,7 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
{
int order = huge_page_order(h);
gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
- unsigned int cpuset_mems_cookie;
+ struct cpuset_mems_cookie cpuset_mems_cookie;

/*
* We need a VMA to get a memory policy. If we do not
@@ -1548,13 +1548,13 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
struct zonelist *zl;
nodemask_t *nodemask;

- cpuset_mems_cookie = read_mems_allowed_begin();
+ read_mems_allowed_begin(&cpuset_mems_cookie);
zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask);
mpol_cond_put(mpol);
page = __alloc_pages_nodemask(gfp, order, zl, nodemask);
if (page)
return page;
- } while (read_mems_allowed_retry(cpuset_mems_cookie));
+ } while (read_mems_allowed_retry(&cpuset_mems_cookie));

return NULL;
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 37d0b334bfe9..b4f2513a2296 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1971,13 +1971,13 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
{
struct mempolicy *pol;
struct page *page;
- unsigned int cpuset_mems_cookie;
+ struct cpuset_mems_cookie cpuset_mems_cookie;
struct zonelist *zl;
nodemask_t *nmask;

retry_cpuset:
pol = get_vma_policy(vma, addr);
- cpuset_mems_cookie = read_mems_allowed_begin();
+ read_mems_allowed_begin(&cpuset_mems_cookie);

if (pol->mode == MPOL_INTERLEAVE) {
unsigned nid;
@@ -2019,7 +2019,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
page = __alloc_pages_nodemask(gfp, order, zl, nmask);
mpol_cond_put(pol);
out:
- if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
+ if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie)))
goto retry_cpuset;
return page;
}
@@ -2047,13 +2047,13 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
{
struct mempolicy *pol = &default_policy;
struct page *page;
- unsigned int cpuset_mems_cookie;
+ struct cpuset_mems_cookie cpuset_mems_cookie;

if (!in_interrupt() && !(gfp & __GFP_THISNODE))
pol = get_task_policy(current);

retry_cpuset:
- cpuset_mems_cookie = read_mems_allowed_begin();
+ read_mems_allowed_begin(&cpuset_mems_cookie);

/*
* No reference counting needed for current->mempolicy
@@ -2066,7 +2066,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
policy_zonelist(gfp, pol, numa_node_id()),
policy_nodemask(gfp, pol));

- if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
+ if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie)))
goto retry_cpuset;

return page;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2302f250d6b1..36cd4e95fb38 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3688,7 +3688,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
int no_progress_loops;
unsigned long alloc_start = jiffies;
unsigned int stall_timeout = 10 * HZ;
- unsigned int cpuset_mems_cookie;
+ struct cpuset_mems_cookie cpuset_mems_cookie;

/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -3713,7 +3713,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
compaction_retries = 0;
no_progress_loops = 0;
compact_priority = DEF_COMPACT_PRIORITY;
- cpuset_mems_cookie = read_mems_allowed_begin();
+ read_mems_allowed_begin(&cpuset_mems_cookie);

/*
* The fast path uses conservative alloc_flags to succeed only until
@@ -3872,7 +3872,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* It's possible we raced with cpuset update so the OOM would be
* premature (see below the nopage: label for full explanation).
*/
- if (read_mems_allowed_retry(cpuset_mems_cookie))
+ if (read_mems_allowed_retry(&cpuset_mems_cookie))
goto retry_cpuset;

/* Reclaim has failed us, start killing things */
@@ -3900,7 +3900,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* to fail, check if the cpuset changed during allocation and if so,
* retry.
*/
- if (read_mems_allowed_retry(cpuset_mems_cookie))
+ if (read_mems_allowed_retry(&cpuset_mems_cookie))
goto retry_cpuset;

/*
diff --git a/mm/slab.c b/mm/slab.c
index 2a31ee3c5814..391fe9d9d24e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3195,13 +3195,13 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
void *obj = NULL;
struct page *page;
int nid;
- unsigned int cpuset_mems_cookie;
+ struct cpuset_mems_cookie cpuset_mems_cookie;

if (flags & __GFP_THISNODE)
return NULL;

retry_cpuset:
- cpuset_mems_cookie = read_mems_allowed_begin();
+ read_mems_allowed_begin(&cpuset_mems_cookie);
zonelist = node_zonelist(mempolicy_slab_node(), flags);

retry:
@@ -3245,7 +3245,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
}
}

- if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie)))
+ if (unlikely(!obj && read_mems_allowed_retry(&cpuset_mems_cookie)))
goto retry_cpuset;
return obj;
}
diff --git a/mm/slub.c b/mm/slub.c
index 8addc535bcdc..55c4862852ec 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1849,7 +1849,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(flags);
void *object;
- unsigned int cpuset_mems_cookie;
+ struct cpuset_mems_cookie cpuset_mems_cookie;

/*
* The defrag ratio allows a configuration of the tradeoffs between
@@ -1874,7 +1874,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
return NULL;

do {
- cpuset_mems_cookie = read_mems_allowed_begin();
+ read_mems_allowed_begin(&cpuset_mems_cookie);
zonelist = node_zonelist(mempolicy_slab_node(), flags);
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
struct kmem_cache_node *n;
@@ -1896,7 +1896,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
}
}
}
- } while (read_mems_allowed_retry(cpuset_mems_cookie));
+ } while (read_mems_allowed_retry(&cpuset_mems_cookie));
#endif
return NULL;
}
--
2.14.0.rc0.400.g1c36432dff-goog

2017-07-27 19:49:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

On Thu, 27 Jul 2017 09:46:08 -0700 Dima Zavin <[email protected]> wrote:

> In codepaths that use the begin/retry interface for reading
> mems_allowed_seq with irqs disabled, there exists a race condition that
> stalls the patch process after only modifying a subset of the
> static_branch call sites.
>
> This problem manifested itself as a dead lock in the slub
> allocator, inside get_any_partial. The loop reads
> mems_allowed_seq value (via read_mems_allowed_begin),
> performs the defrag operation, and then verifies the consistency
> of mem_allowed via the read_mems_allowed_retry and the cookie
> returned by xxx_begin. The issue here is that both begin and retry
> first check if cpusets are enabled via cpusets_enabled() static branch.
> This branch can be rewritted dynamically (via cpuset_inc) if a new
> cpuset is created. The x86 jump label code fully synchronizes across
> all CPUs for every entry it rewrites. If it rewrites only one of the
> callsites (specifically the one in read_mems_allowed_retry) and then
> waits for the smp_call_function(do_sync_core) to complete while a CPU is
> inside the begin/retry section with IRQs off and the mems_allowed value
> is changed, we can hang. This is because begin() will always return 0
> (since it wasn't patched yet) while retry() will test the 0 against
> the actual value of the seq counter.
>
> The fix is to cache the value that's returned by cpusets_enabled() at the
> top of the loop, and only operate on the seqcount (both begin and retry) if
> it was true.

Tricky. Hence we should have a nice code comment somewhere describing
all of this.

> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -16,6 +16,11 @@
> #include <linux/mm.h>
> #include <linux/jump_label.h>
>
> +struct cpuset_mems_cookie {
> + unsigned int seq;
> + bool was_enabled;
> +};

At cpuset_mems_cookie would be a good site - why it exists, what it
does, when it is used and how.


2017-07-27 19:51:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

On Thu, 27 Jul 2017 09:46:08 -0700 Dima Zavin <[email protected]> wrote:

> - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
> Still only tested on v4.9.36 and compile tested against v4.12.

That's a problem - this doesn't come close to applying on current
mainline. I can fix that I guess, but the result should be tested
well.

2017-07-27 21:42:15

by Dima Zavin

[permalink] [raw]
Subject: Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

On Thu, Jul 27, 2017 at 12:48 PM, Andrew Morton
<[email protected]> wrote:
> On Thu, 27 Jul 2017 09:46:08 -0700 Dima Zavin <[email protected]> wrote:
>
>> In codepaths that use the begin/retry interface for reading
>> mems_allowed_seq with irqs disabled, there exists a race condition that
>> stalls the patch process after only modifying a subset of the
>> static_branch call sites.
>>
>> This problem manifested itself as a dead lock in the slub
>> allocator, inside get_any_partial. The loop reads
>> mems_allowed_seq value (via read_mems_allowed_begin),
>> performs the defrag operation, and then verifies the consistency
>> of mem_allowed via the read_mems_allowed_retry and the cookie
>> returned by xxx_begin. The issue here is that both begin and retry
>> first check if cpusets are enabled via cpusets_enabled() static branch.
>> This branch can be rewritted dynamically (via cpuset_inc) if a new
>> cpuset is created. The x86 jump label code fully synchronizes across
>> all CPUs for every entry it rewrites. If it rewrites only one of the
>> callsites (specifically the one in read_mems_allowed_retry) and then
>> waits for the smp_call_function(do_sync_core) to complete while a CPU is
>> inside the begin/retry section with IRQs off and the mems_allowed value
>> is changed, we can hang. This is because begin() will always return 0
>> (since it wasn't patched yet) while retry() will test the 0 against
>> the actual value of the seq counter.
>>
>> The fix is to cache the value that's returned by cpusets_enabled() at the
>> top of the loop, and only operate on the seqcount (both begin and retry) if
>> it was true.
>
> Tricky. Hence we should have a nice code comment somewhere describing
> all of this.
>
>> --- a/include/linux/cpuset.h
>> +++ b/include/linux/cpuset.h
>> @@ -16,6 +16,11 @@
>> #include <linux/mm.h>
>> #include <linux/jump_label.h>
>>
>> +struct cpuset_mems_cookie {
>> + unsigned int seq;
>> + bool was_enabled;
>> +};
>
> At cpuset_mems_cookie would be a good site - why it exists, what it
> does, when it is used and how.

Will do. I actually had a comment here but removed it in lieu of
commit message :) Will put it back.

2017-07-27 21:42:22

by Dima Zavin

[permalink] [raw]
Subject: Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

On Thu, Jul 27, 2017 at 12:51 PM, Andrew Morton
<[email protected]> wrote:
> On Thu, 27 Jul 2017 09:46:08 -0700 Dima Zavin <[email protected]> wrote:
>
>> - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
>> Still only tested on v4.9.36 and compile tested against v4.12.
>
> That's a problem - this doesn't come close to applying on current
> mainline. I can fix that I guess, but the result should be tested
> well.
>

I'll fix up for latest, and see if I can test it. I should be able to
boot vanilla with not too much trouble. May take a few days.

2017-07-28 07:45:21

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

[+CC PeterZ]

On 07/27/2017 06:46 PM, Dima Zavin wrote:
> In codepaths that use the begin/retry interface for reading
> mems_allowed_seq with irqs disabled, there exists a race condition that
> stalls the patch process after only modifying a subset of the
> static_branch call sites.
>
> This problem manifested itself as a dead lock in the slub
> allocator, inside get_any_partial. The loop reads
> mems_allowed_seq value (via read_mems_allowed_begin),
> performs the defrag operation, and then verifies the consistency
> of mem_allowed via the read_mems_allowed_retry and the cookie
> returned by xxx_begin. The issue here is that both begin and retry
> first check if cpusets are enabled via cpusets_enabled() static branch.
> This branch can be rewritted dynamically (via cpuset_inc) if a new
> cpuset is created. The x86 jump label code fully synchronizes across
> all CPUs for every entry it rewrites. If it rewrites only one of the
> callsites (specifically the one in read_mems_allowed_retry) and then
> waits for the smp_call_function(do_sync_core) to complete while a CPU is
> inside the begin/retry section with IRQs off and the mems_allowed value
> is changed, we can hang. This is because begin() will always return 0
> (since it wasn't patched yet) while retry() will test the 0 against
> the actual value of the seq counter.

Hm I wonder if there are other static branch users potentially having
similar problem. Then it would be best to fix this at static branch
level. Any idea, Peter? An inelegant solution would be to have indicate
static_branch_(un)likely() callsites ordering for the patching. I.e.
here we would make sure that read_mems_allowed_begin() callsites are
patched before read_mems_allowed_retry() when enabling the static key,
and the opposite order when disabling the static key.

> The fix is to cache the value that's returned by cpusets_enabled() at the
> top of the loop, and only operate on the seqcount (both begin and retry) if
> it was true.

Maybe we could just return e.g. -1 in read_mems_allowed_begin() when
cpusets are disabled, and test it in read_mems_allowed_retry() before
doing a proper seqcount retry check? Also I think you can still do the
cpusets_enabled() check in read_mems_allowed_retry() before the
was_enabled (or cookie == -1) test?

> The relevant stack traces of the two stuck threads:
>
> CPU: 107 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
> Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
> task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
> RIP: smp_call_function_many+0x1f9/0x260
> Call Trace:
> ? setup_data_read+0xa0/0xa0
> ? ___slab_alloc+0x28b/0x5a0
> smp_call_function+0x3b/0x70
> ? setup_data_read+0xa0/0xa0
> on_each_cpu+0x2f/0x90
> ? ___slab_alloc+0x28a/0x5a0
> ? ___slab_alloc+0x28b/0x5a0
> text_poke_bp+0x87/0xd0
> ? ___slab_alloc+0x28a/0x5a0
> arch_jump_label_transform+0x93/0x100
> __jump_label_update+0x77/0x90
> jump_label_update+0xaa/0xc0
> static_key_slow_inc+0x9e/0xb0
> cpuset_css_online+0x70/0x2e0
> online_css+0x2c/0xa0
> cgroup_apply_control_enable+0x27f/0x3d0
> cgroup_mkdir+0x2b7/0x420
> kernfs_iop_mkdir+0x5a/0x80
> vfs_mkdir+0xf6/0x1a0
> SyS_mkdir+0xb7/0xe0
> entry_SYSCALL_64_fastpath+0x18/0xad
>
> ...
>
> CPU: 22 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
> Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
> task: ffff8818087c0000 task.stack: ffffc90000030000
> RIP: int3+0x39/0x70
> Call Trace:
> <#DB> ? ___slab_alloc+0x28b/0x5a0
> <EOE> ? copy_process.part.40+0xf7/0x1de0
> ? __slab_alloc.isra.80+0x54/0x90
> ? copy_process.part.40+0xf7/0x1de0
> ? copy_process.part.40+0xf7/0x1de0
> ? kmem_cache_alloc_node+0x8a/0x280
> ? copy_process.part.40+0xf7/0x1de0
> ? _do_fork+0xe7/0x6c0
> ? _raw_spin_unlock_irq+0x2d/0x60
> ? trace_hardirqs_on_caller+0x136/0x1d0
> ? entry_SYSCALL_64_fastpath+0x5/0xad
> ? do_syscall_64+0x27/0x350
> ? SyS_clone+0x19/0x20
> ? do_syscall_64+0x60/0x350
> ? entry_SYSCALL64_slow_path+0x25/0x25
>
> Reported-by: Cliff Spradlin <[email protected]>
> Signed-off-by: Dima Zavin <[email protected]>
> ---
>
> v2:
> - Moved the cached cpusets_enabled() state into the cookie, turned
> the cookie into a struct and updated all the other call sites.
> - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
> Still only tested on v4.9.36 and compile tested against v4.12.
>
> include/linux/cpuset.h | 27 +++++++++++++++++----------
> mm/filemap.c | 6 +++---
> mm/hugetlb.c | 12 ++++++------
> mm/mempolicy.c | 12 ++++++------
> mm/page_alloc.c | 8 ++++----
> mm/slab.c | 6 +++---
> mm/slub.c | 6 +++---
> 7 files changed, 42 insertions(+), 35 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 119a3f9604b0..f64f6d3b1dce 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -16,6 +16,11 @@
> #include <linux/mm.h>
> #include <linux/jump_label.h>
>
> +struct cpuset_mems_cookie {
> + unsigned int seq;
> + bool was_enabled;
> +};
> +
> #ifdef CONFIG_CPUSETS
>
> extern struct static_key_false cpusets_enabled_key;
> @@ -113,12 +118,15 @@ extern void cpuset_print_current_mems_allowed(void);
> * causing process failure. A retry loop with read_mems_allowed_begin and
> * read_mems_allowed_retry prevents these artificial failures.
> */
> -static inline unsigned int read_mems_allowed_begin(void)
> +static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie)
> {
> - if (!cpusets_enabled())
> - return 0;
> + if (!cpusets_enabled()) {
> + cookie->was_enabled = false;
> + return;
> + }
>
> - return read_seqcount_begin(&current->mems_allowed_seq);
> + cookie->was_enabled = true;
> + cookie->seq = read_seqcount_begin(&current->mems_allowed_seq);
> }
>
> /*
> @@ -127,12 +135,11 @@ static inline unsigned int read_mems_allowed_begin(void)
> * update of mems_allowed. It is up to the caller to retry the operation if
> * appropriate.
> */
> -static inline bool read_mems_allowed_retry(unsigned int seq)
> +static inline bool read_mems_allowed_retry(struct cpuset_mems_cookie *cookie)
> {
> - if (!cpusets_enabled())
> + if (!cookie->was_enabled)
> return false;
> -
> - return read_seqcount_retry(&current->mems_allowed_seq, seq);
> + return read_seqcount_retry(&current->mems_allowed_seq, cookie->seq);
> }
>
> static inline void set_mems_allowed(nodemask_t nodemask)
> @@ -249,12 +256,12 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> {
> }
>
> -static inline unsigned int read_mems_allowed_begin(void)
> +static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie)
> {
> return 0;
> }
>
> -static inline bool read_mems_allowed_retry(unsigned int seq)
> +static inline bool read_mems_allowed_retry(struct cpuset_mems_cookie *cookie)
> {
> return false;
> }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6f1be573a5e6..c0730b377519 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -716,12 +716,12 @@ struct page *__page_cache_alloc(gfp_t gfp)
> struct page *page;
>
> if (cpuset_do_page_mem_spread()) {
> - unsigned int cpuset_mems_cookie;
> + struct cpuset_mems_cookie cpuset_mems_cookie;
> do {
> - cpuset_mems_cookie = read_mems_allowed_begin();
> + read_mems_allowed_begin(&cpuset_mems_cookie);
> n = cpuset_mem_spread_node();
> page = __alloc_pages_node(n, gfp, 0);
> - } while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
> + } while (!page && read_mems_allowed_retry(&cpuset_mems_cookie));
>
> return page;
> }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3eedb187e549..1defa44f4fe6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -907,7 +907,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> struct zonelist *zonelist;
> struct zone *zone;
> struct zoneref *z;
> - unsigned int cpuset_mems_cookie;
> + struct cpuset_mems_cookie cpuset_mems_cookie;
>
> /*
> * A child process with MAP_PRIVATE mappings created by their parent
> @@ -923,7 +923,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> goto err;
>
> retry_cpuset:
> - cpuset_mems_cookie = read_mems_allowed_begin();
> + read_mems_allowed_begin(&cpuset_mems_cookie);
> zonelist = huge_zonelist(vma, address,
> htlb_alloc_mask(h), &mpol, &nodemask);
>
> @@ -945,7 +945,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> }
>
> mpol_cond_put(mpol);
> - if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> + if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie)))
> goto retry_cpuset;
> return page;
>
> @@ -1511,7 +1511,7 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
> {
> int order = huge_page_order(h);
> gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
> - unsigned int cpuset_mems_cookie;
> + struct cpuset_mems_cookie cpuset_mems_cookie;
>
> /*
> * We need a VMA to get a memory policy. If we do not
> @@ -1548,13 +1548,13 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
> struct zonelist *zl;
> nodemask_t *nodemask;
>
> - cpuset_mems_cookie = read_mems_allowed_begin();
> + read_mems_allowed_begin(&cpuset_mems_cookie);
> zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask);
> mpol_cond_put(mpol);
> page = __alloc_pages_nodemask(gfp, order, zl, nodemask);
> if (page)
> return page;
> - } while (read_mems_allowed_retry(cpuset_mems_cookie));
> + } while (read_mems_allowed_retry(&cpuset_mems_cookie));
>
> return NULL;
> }
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 37d0b334bfe9..b4f2513a2296 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1971,13 +1971,13 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> {
> struct mempolicy *pol;
> struct page *page;
> - unsigned int cpuset_mems_cookie;
> + struct cpuset_mems_cookie cpuset_mems_cookie;
> struct zonelist *zl;
> nodemask_t *nmask;
>
> retry_cpuset:
> pol = get_vma_policy(vma, addr);
> - cpuset_mems_cookie = read_mems_allowed_begin();
> + read_mems_allowed_begin(&cpuset_mems_cookie);
>
> if (pol->mode == MPOL_INTERLEAVE) {
> unsigned nid;
> @@ -2019,7 +2019,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> page = __alloc_pages_nodemask(gfp, order, zl, nmask);
> mpol_cond_put(pol);
> out:
> - if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> + if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie)))
> goto retry_cpuset;
> return page;
> }
> @@ -2047,13 +2047,13 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
> {
> struct mempolicy *pol = &default_policy;
> struct page *page;
> - unsigned int cpuset_mems_cookie;
> + struct cpuset_mems_cookie cpuset_mems_cookie;
>
> if (!in_interrupt() && !(gfp & __GFP_THISNODE))
> pol = get_task_policy(current);
>
> retry_cpuset:
> - cpuset_mems_cookie = read_mems_allowed_begin();
> + read_mems_allowed_begin(&cpuset_mems_cookie);
>
> /*
> * No reference counting needed for current->mempolicy
> @@ -2066,7 +2066,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
> policy_zonelist(gfp, pol, numa_node_id()),
> policy_nodemask(gfp, pol));
>
> - if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> + if (unlikely(!page && read_mems_allowed_retry(&cpuset_mems_cookie)))
> goto retry_cpuset;
>
> return page;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2302f250d6b1..36cd4e95fb38 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3688,7 +3688,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int no_progress_loops;
> unsigned long alloc_start = jiffies;
> unsigned int stall_timeout = 10 * HZ;
> - unsigned int cpuset_mems_cookie;
> + struct cpuset_mems_cookie cpuset_mems_cookie;
>
> /*
> * In the slowpath, we sanity check order to avoid ever trying to
> @@ -3713,7 +3713,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> compaction_retries = 0;
> no_progress_loops = 0;
> compact_priority = DEF_COMPACT_PRIORITY;
> - cpuset_mems_cookie = read_mems_allowed_begin();
> + read_mems_allowed_begin(&cpuset_mems_cookie);
>
> /*
> * The fast path uses conservative alloc_flags to succeed only until
> @@ -3872,7 +3872,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * It's possible we raced with cpuset update so the OOM would be
> * premature (see below the nopage: label for full explanation).
> */
> - if (read_mems_allowed_retry(cpuset_mems_cookie))
> + if (read_mems_allowed_retry(&cpuset_mems_cookie))
> goto retry_cpuset;
>
> /* Reclaim has failed us, start killing things */
> @@ -3900,7 +3900,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * to fail, check if the cpuset changed during allocation and if so,
> * retry.
> */
> - if (read_mems_allowed_retry(cpuset_mems_cookie))
> + if (read_mems_allowed_retry(&cpuset_mems_cookie))
> goto retry_cpuset;
>
> /*
> diff --git a/mm/slab.c b/mm/slab.c
> index 2a31ee3c5814..391fe9d9d24e 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3195,13 +3195,13 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
> void *obj = NULL;
> struct page *page;
> int nid;
> - unsigned int cpuset_mems_cookie;
> + struct cpuset_mems_cookie cpuset_mems_cookie;
>
> if (flags & __GFP_THISNODE)
> return NULL;
>
> retry_cpuset:
> - cpuset_mems_cookie = read_mems_allowed_begin();
> + read_mems_allowed_begin(&cpuset_mems_cookie);
> zonelist = node_zonelist(mempolicy_slab_node(), flags);
>
> retry:
> @@ -3245,7 +3245,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
> }
> }
>
> - if (unlikely(!obj && read_mems_allowed_retry(cpuset_mems_cookie)))
> + if (unlikely(!obj && read_mems_allowed_retry(&cpuset_mems_cookie)))
> goto retry_cpuset;
> return obj;
> }
> diff --git a/mm/slub.c b/mm/slub.c
> index 8addc535bcdc..55c4862852ec 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1849,7 +1849,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
> struct zone *zone;
> enum zone_type high_zoneidx = gfp_zone(flags);
> void *object;
> - unsigned int cpuset_mems_cookie;
> + struct cpuset_mems_cookie cpuset_mems_cookie;
>
> /*
> * The defrag ratio allows a configuration of the tradeoffs between
> @@ -1874,7 +1874,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
> return NULL;
>
> do {
> - cpuset_mems_cookie = read_mems_allowed_begin();
> + read_mems_allowed_begin(&cpuset_mems_cookie);
> zonelist = node_zonelist(mempolicy_slab_node(), flags);
> for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> struct kmem_cache_node *n;
> @@ -1896,7 +1896,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
> }
> }
> }
> - } while (read_mems_allowed_retry(cpuset_mems_cookie));
> + } while (read_mems_allowed_retry(&cpuset_mems_cookie));
> #endif
> return NULL;
> }
>

2017-07-28 08:49:16

by Dima Zavin

[permalink] [raw]
Subject: Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

On Fri, Jul 28, 2017 at 12:45 AM, Vlastimil Babka <[email protected]> wrote:
> [+CC PeterZ]
>
> On 07/27/2017 06:46 PM, Dima Zavin wrote:
>> In codepaths that use the begin/retry interface for reading
>> mems_allowed_seq with irqs disabled, there exists a race condition that
>> stalls the patch process after only modifying a subset of the
>> static_branch call sites.
>>
>> This problem manifested itself as a dead lock in the slub
>> allocator, inside get_any_partial. The loop reads
>> mems_allowed_seq value (via read_mems_allowed_begin),
>> performs the defrag operation, and then verifies the consistency
>> of mem_allowed via the read_mems_allowed_retry and the cookie
>> returned by xxx_begin. The issue here is that both begin and retry
>> first check if cpusets are enabled via cpusets_enabled() static branch.
>> This branch can be rewritted dynamically (via cpuset_inc) if a new
>> cpuset is created. The x86 jump label code fully synchronizes across
>> all CPUs for every entry it rewrites. If it rewrites only one of the
>> callsites (specifically the one in read_mems_allowed_retry) and then
>> waits for the smp_call_function(do_sync_core) to complete while a CPU is
>> inside the begin/retry section with IRQs off and the mems_allowed value
>> is changed, we can hang. This is because begin() will always return 0
>> (since it wasn't patched yet) while retry() will test the 0 against
>> the actual value of the seq counter.
>
> Hm I wonder if there are other static branch users potentially having
> similar problem. Then it would be best to fix this at static branch
> level. Any idea, Peter? An inelegant solution would be to have indicate
> static_branch_(un)likely() callsites ordering for the patching. I.e.
> here we would make sure that read_mems_allowed_begin() callsites are
> patched before read_mems_allowed_retry() when enabling the static key,
> and the opposite order when disabling the static key.
>

This was my main worry, that I'm just patching up one incarnation of
this problem
and other clients will eventually trip over this.

>> The fix is to cache the value that's returned by cpusets_enabled() at the
>> top of the loop, and only operate on the seqcount (both begin and retry) if
>> it was true.
>
> Maybe we could just return e.g. -1 in read_mems_allowed_begin() when
> cpusets are disabled, and test it in read_mems_allowed_retry() before
> doing a proper seqcount retry check? Also I think you can still do the
> cpusets_enabled() check in read_mems_allowed_retry() before the
> was_enabled (or cookie == -1) test?

Hmm, good point! If cpusets_enabled() is true, then we can still test against
was_enabled and do the right thing (adds one extra branch in that case). When
it's false, we still benefit from the static_branch fanciness. Thanks!

Re setting the cookie to -1, I didn't really want to overload the
cookie value but
rather just make the state explicit so it's easier to grawk as this is
all already
subtle enough.

2017-07-28 09:31:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

On Fri, Jul 28, 2017 at 09:45:16AM +0200, Vlastimil Babka wrote:
> [+CC PeterZ]
>
> On 07/27/2017 06:46 PM, Dima Zavin wrote:
> > In codepaths that use the begin/retry interface for reading
> > mems_allowed_seq with irqs disabled, there exists a race condition that
> > stalls the patch process after only modifying a subset of the
> > static_branch call sites.
> >
> > This problem manifested itself as a dead lock in the slub
> > allocator, inside get_any_partial. The loop reads
> > mems_allowed_seq value (via read_mems_allowed_begin),
> > performs the defrag operation, and then verifies the consistency
> > of mem_allowed via the read_mems_allowed_retry and the cookie
> > returned by xxx_begin. The issue here is that both begin and retry
> > first check if cpusets are enabled via cpusets_enabled() static branch.
> > This branch can be rewritted dynamically (via cpuset_inc) if a new
> > cpuset is created. The x86 jump label code fully synchronizes across
> > all CPUs for every entry it rewrites. If it rewrites only one of the
> > callsites (specifically the one in read_mems_allowed_retry) and then
> > waits for the smp_call_function(do_sync_core) to complete while a CPU is
> > inside the begin/retry section with IRQs off and the mems_allowed value
> > is changed, we can hang. This is because begin() will always return 0
> > (since it wasn't patched yet) while retry() will test the 0 against
> > the actual value of the seq counter.
>
> Hm I wonder if there are other static branch users potentially having
> similar problem. Then it would be best to fix this at static branch
> level. Any idea, Peter? An inelegant solution would be to have indicate
> static_branch_(un)likely() callsites ordering for the patching. I.e.
> here we would make sure that read_mems_allowed_begin() callsites are
> patched before read_mems_allowed_retry() when enabling the static key,
> and the opposite order when disabling the static key.

I'm not aware of any other sure ordering requirements. But you can
manually create this order by using 2 static keys. Then flip them in the
desired order.

2017-07-28 14:05:49

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

On 07/28/2017 11:30 AM, Peter Zijlstra wrote:
> On Fri, Jul 28, 2017 at 09:45:16AM +0200, Vlastimil Babka wrote:
>> [+CC PeterZ]
>>
>> On 07/27/2017 06:46 PM, Dima Zavin wrote:
>>> In codepaths that use the begin/retry interface for reading
>>> mems_allowed_seq with irqs disabled, there exists a race condition that
>>> stalls the patch process after only modifying a subset of the
>>> static_branch call sites.
>>>
>>> This problem manifested itself as a dead lock in the slub
>>> allocator, inside get_any_partial. The loop reads
>>> mems_allowed_seq value (via read_mems_allowed_begin),
>>> performs the defrag operation, and then verifies the consistency
>>> of mem_allowed via the read_mems_allowed_retry and the cookie
>>> returned by xxx_begin. The issue here is that both begin and retry
>>> first check if cpusets are enabled via cpusets_enabled() static branch.
>>> This branch can be rewritted dynamically (via cpuset_inc) if a new
>>> cpuset is created. The x86 jump label code fully synchronizes across
>>> all CPUs for every entry it rewrites. If it rewrites only one of the
>>> callsites (specifically the one in read_mems_allowed_retry) and then
>>> waits for the smp_call_function(do_sync_core) to complete while a CPU is
>>> inside the begin/retry section with IRQs off and the mems_allowed value
>>> is changed, we can hang. This is because begin() will always return 0
>>> (since it wasn't patched yet) while retry() will test the 0 against
>>> the actual value of the seq counter.
>>
>> Hm I wonder if there are other static branch users potentially having
>> similar problem. Then it would be best to fix this at static branch
>> level. Any idea, Peter? An inelegant solution would be to have indicate
>> static_branch_(un)likely() callsites ordering for the patching. I.e.
>> here we would make sure that read_mems_allowed_begin() callsites are
>> patched before read_mems_allowed_retry() when enabling the static key,
>> and the opposite order when disabling the static key.
>
> I'm not aware of any other sure ordering requirements. But you can
> manually create this order by using 2 static keys. Then flip them in the
> desired order.

Right, thanks for the suggestion. I think that would be preferable to
complicating the cookie handling. Add a new key next to
cpusets_enabled_key, let's say "cpusets_enabled_pre_key". Make
read_mems_allowed_begin() check this key instead of cpusets_enabled().
Change cpuset_inc/dec to inc/dec also this new key in the right order
and that should be it. Dima, can you try that or should I?

Thanks,
Vlastimil

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2017-07-28 16:52:45

by Dima Zavin

[permalink] [raw]
Subject: Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

On Fri, Jul 28, 2017 at 7:05 AM, Vlastimil Babka <[email protected]> wrote:
> On 07/28/2017 11:30 AM, Peter Zijlstra wrote:
>> On Fri, Jul 28, 2017 at 09:45:16AM +0200, Vlastimil Babka wrote:
>>> [+CC PeterZ]
>>>
>>> On 07/27/2017 06:46 PM, Dima Zavin wrote:
>>>> In codepaths that use the begin/retry interface for reading
>>>> mems_allowed_seq with irqs disabled, there exists a race condition that
>>>> stalls the patch process after only modifying a subset of the
>>>> static_branch call sites.
>>>>
>>>> This problem manifested itself as a dead lock in the slub
>>>> allocator, inside get_any_partial. The loop reads
>>>> mems_allowed_seq value (via read_mems_allowed_begin),
>>>> performs the defrag operation, and then verifies the consistency
>>>> of mem_allowed via the read_mems_allowed_retry and the cookie
>>>> returned by xxx_begin. The issue here is that both begin and retry
>>>> first check if cpusets are enabled via cpusets_enabled() static branch.
>>>> This branch can be rewritted dynamically (via cpuset_inc) if a new
>>>> cpuset is created. The x86 jump label code fully synchronizes across
>>>> all CPUs for every entry it rewrites. If it rewrites only one of the
>>>> callsites (specifically the one in read_mems_allowed_retry) and then
>>>> waits for the smp_call_function(do_sync_core) to complete while a CPU is
>>>> inside the begin/retry section with IRQs off and the mems_allowed value
>>>> is changed, we can hang. This is because begin() will always return 0
>>>> (since it wasn't patched yet) while retry() will test the 0 against
>>>> the actual value of the seq counter.
>>>
>>> Hm I wonder if there are other static branch users potentially having
>>> similar problem. Then it would be best to fix this at static branch
>>> level. Any idea, Peter? An inelegant solution would be to have indicate
>>> static_branch_(un)likely() callsites ordering for the patching. I.e.
>>> here we would make sure that read_mems_allowed_begin() callsites are
>>> patched before read_mems_allowed_retry() when enabling the static key,
>>> and the opposite order when disabling the static key.
>>
>> I'm not aware of any other sure ordering requirements. But you can
>> manually create this order by using 2 static keys. Then flip them in the
>> desired order.
>
> Right, thanks for the suggestion. I think that would be preferable to
> complicating the cookie handling. Add a new key next to
> cpusets_enabled_key, let's say "cpusets_enabled_pre_key". Make
> read_mems_allowed_begin() check this key instead of cpusets_enabled().
> Change cpuset_inc/dec to inc/dec also this new key in the right order
> and that should be it. Dima, can you try that or should I?

Yeah, I like that approach much better. I'll re-spin a new version in a bit.

--Dima

>
> Thanks,
> Vlastimil
>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>
>

2017-07-29 04:57:55

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

Hi Dima,

[auto build test WARNING on v4.12]
[cannot apply to cgroup/for-next linus/master v4.13-rc2 v4.13-rc1 next-20170728]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Dima-Zavin/cpuset-fix-a-deadlock-due-to-incomplete-patching-of-cpusets_enabled/20170729-123852
config: i386-randconfig-x019-201730 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All warnings (new ones prefixed by >>):

In file included from kernel/sched/core.c:13:0:
include/linux/cpuset.h: In function 'read_mems_allowed_begin':
>> include/linux/cpuset.h:261:9: warning: 'return' with a value, in function returning void
return 0;
^
include/linux/cpuset.h:259:20: note: declared here
static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie)
^~~~~~~~~~~~~~~~~~~~~~~

vim +/return +261 include/linux/cpuset.h

58568d2a Miao Xie 2009-06-16 258
6b828cc9 Dima Zavin 2017-07-27 259 static inline void read_mems_allowed_begin(struct cpuset_mems_cookie *cookie)
c0ff7453 Miao Xie 2010-05-24 260 {
cc9a6c87 Mel Gorman 2012-03-21 @261 return 0;
c0ff7453 Miao Xie 2010-05-24 262 }
c0ff7453 Miao Xie 2010-05-24 263

:::::: The code at line 261 was first introduced by commit
:::::: cc9a6c8776615f9c194ccf0b63a0aa5628235545 cpuset: mm: reduce large amounts of memory barrier related damage v3

:::::: TO: Mel Gorman <[email protected]>
:::::: CC: Linus Torvalds <[email protected]>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.78 kB)
.config.gz (29.43 kB)
Download all attachments

2017-07-31 04:01:19

by Dima Zavin

[permalink] [raw]
Subject: [PATCH v3] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.

This problem manifested itself as a dead lock in the slub
allocator, inside get_any_partial. The loop reads
mems_allowed_seq value (via read_mems_allowed_begin),
performs the defrag operation, and then verifies the consistency
of mem_allowed via the read_mems_allowed_retry and the cookie
returned by xxx_begin. The issue here is that both begin and retry
first check if cpusets are enabled via cpusets_enabled() static branch.
This branch can be rewritted dynamically (via cpuset_inc) if a new
cpuset is created. The x86 jump label code fully synchronizes across
all CPUs for every entry it rewrites. If it rewrites only one of the
callsites (specifically the one in read_mems_allowed_retry) and then
waits for the smp_call_function(do_sync_core) to complete while a CPU is
inside the begin/retry section with IRQs off and the mems_allowed value
is changed, we can hang. This is because begin() will always return 0
(since it wasn't patched yet) while retry() will test the 0 against
the actual value of the seq counter.

The fix is to use two different static keys: one for begin
(pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
always return a valid seqcount if are enabling cpusets. Similarly,
when disabling cpusets via cpuset_dec(), we first ensure that callers
of cpuset_mems_allowed_retry() will start ignoring the seqcount
value before we let cpuset_mems_allowed_begin() return 0.

The relevant stack traces of the two stuck threads:

CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
RIP: smp_call_function_many+0x1f9/0x260
Call Trace:
? setup_data_read+0xa0/0xa0
? ___slab_alloc+0x28b/0x5a0
smp_call_function+0x3b/0x70
? setup_data_read+0xa0/0xa0
on_each_cpu+0x2f/0x90
? ___slab_alloc+0x28a/0x5a0
? ___slab_alloc+0x28b/0x5a0
text_poke_bp+0x87/0xd0
? ___slab_alloc+0x28a/0x5a0
arch_jump_label_transform+0x93/0x100
__jump_label_update+0x77/0x90
jump_label_update+0xaa/0xc0
static_key_slow_inc+0x9e/0xb0
cpuset_css_online+0x70/0x2e0
online_css+0x2c/0xa0
cgroup_apply_control_enable+0x27f/0x3d0
cgroup_mkdir+0x2b7/0x420
kernfs_iop_mkdir+0x5a/0x80
vfs_mkdir+0xf6/0x1a0
SyS_mkdir+0xb7/0xe0
entry_SYSCALL_64_fastpath+0x18/0xad

...

CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8818087c0000 task.stack: ffffc90000030000
RIP: int3+0x39/0x70
Call Trace:
<#DB> ? ___slab_alloc+0x28b/0x5a0
<EOE> ? copy_process.part.40+0xf7/0x1de0
? __slab_alloc.isra.80+0x54/0x90
? copy_process.part.40+0xf7/0x1de0
? copy_process.part.40+0xf7/0x1de0
? kmem_cache_alloc_node+0x8a/0x280
? copy_process.part.40+0xf7/0x1de0
? _do_fork+0xe7/0x6c0
? _raw_spin_unlock_irq+0x2d/0x60
? trace_hardirqs_on_caller+0x136/0x1d0
? entry_SYSCALL_64_fastpath+0x5/0xad
? do_syscall_64+0x27/0x350
? SyS_clone+0x19/0x20
? do_syscall_64+0x60/0x350
? entry_SYSCALL64_slow_path+0x25/0x25

Reported-by: Cliff Spradlin <[email protected]>
Signed-off-by: Dima Zavin <[email protected]>
---

v3:
- Changed the implementation based on Peter Zijlstra's suggestion. Now
using two keys for begin/retry instead of hacking the state into the
cookie.
- Rebased and tested on top of v4.13-rc3.

v4:
- Moved the cached cpusets_enabled() state into the cookie, turned
the cookie into a struct and updated all the other call sites.
- Applied on top of v4.12 since one of the callers in page_alloc.c changed.
Still only tested on v4.9.36 and compile tested against v4.12.

include/linux/cpuset.h | 19 +++++++++++++++++--
kernel/cgroup/cpuset.c | 1 +
2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 119a3f9604b0..e5a684c04c70 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -18,6 +18,19 @@

#ifdef CONFIG_CPUSETS

+/*
+ * Static branch rewrites can happen in an arbitrary order for a given
+ * key. In code paths where we need to loop with read_mems_allowed_begin() and
+ * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need
+ * to ensure that begin() always gets rewritten before retry() in the
+ * disabled -> enabled transition. If not, then if local irqs are disabled
+ * around the loop, we can deadlock since retry() would always be
+ * comparing the latest value of the mems_allowed seqcount against 0 as
+ * begin() still would see cpusets_enabled() as false. The enabled -> disabled
+ * transition should happen in reverse order for the same reasons (want to stop
+ * looking at real value of mems_allowed.sequence in retry() first).
+ */
+extern struct static_key_false cpusets_pre_enable_key;
extern struct static_key_false cpusets_enabled_key;
static inline bool cpusets_enabled(void)
{
@@ -32,12 +45,14 @@ static inline int nr_cpusets(void)

static inline void cpuset_inc(void)
{
+ static_branch_inc(&cpusets_pre_enable_key);
static_branch_inc(&cpusets_enabled_key);
}

static inline void cpuset_dec(void)
{
static_branch_dec(&cpusets_enabled_key);
+ static_branch_dec(&cpusets_pre_enable_key);
}

extern int cpuset_init(void);
@@ -115,7 +130,7 @@ extern void cpuset_print_current_mems_allowed(void);
*/
static inline unsigned int read_mems_allowed_begin(void)
{
- if (!cpusets_enabled())
+ if (!static_branch_unlikely(&cpusets_pre_enable_key))
return 0;

return read_seqcount_begin(&current->mems_allowed_seq);
@@ -129,7 +144,7 @@ static inline unsigned int read_mems_allowed_begin(void)
*/
static inline bool read_mems_allowed_retry(unsigned int seq)
{
- if (!cpusets_enabled())
+ if (!static_branch_unlikely(&cpusets_enabled_key))
return false;

return read_seqcount_retry(&current->mems_allowed_seq, seq);
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ca8376e5008c..8d5151688504 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -63,6 +63,7 @@
#include <linux/cgroup.h>
#include <linux/wait.h>

+DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);

/* See "Frequency meter" comments, below. */
--
2.14.0.rc0.400.g1c36432dff-goog

2017-07-31 04:04:44

by Dima Zavin

[permalink] [raw]
Subject: Re: [PATCH v3] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

On Sun, Jul 30, 2017 at 9:01 PM, Dima Zavin <[email protected]> wrote:
> In codepaths that use the begin/retry interface for reading
> mems_allowed_seq with irqs disabled, there exists a race condition that
> stalls the patch process after only modifying a subset of the
> static_branch call sites.
>
> This problem manifested itself as a dead lock in the slub
> allocator, inside get_any_partial. The loop reads
> mems_allowed_seq value (via read_mems_allowed_begin),
> performs the defrag operation, and then verifies the consistency
> of mem_allowed via the read_mems_allowed_retry and the cookie
> returned by xxx_begin. The issue here is that both begin and retry
> first check if cpusets are enabled via cpusets_enabled() static branch.
> This branch can be rewritted dynamically (via cpuset_inc) if a new
> cpuset is created. The x86 jump label code fully synchronizes across
> all CPUs for every entry it rewrites. If it rewrites only one of the
> callsites (specifically the one in read_mems_allowed_retry) and then
> waits for the smp_call_function(do_sync_core) to complete while a CPU is
> inside the begin/retry section with IRQs off and the mems_allowed value
> is changed, we can hang. This is because begin() will always return 0
> (since it wasn't patched yet) while retry() will test the 0 against
> the actual value of the seq counter.
>
> The fix is to use two different static keys: one for begin
> (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
> first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
> always return a valid seqcount if are enabling cpusets. Similarly,
> when disabling cpusets via cpuset_dec(), we first ensure that callers
> of cpuset_mems_allowed_retry() will start ignoring the seqcount
> value before we let cpuset_mems_allowed_begin() return 0.
>
> The relevant stack traces of the two stuck threads:
>
> CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
> Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
> task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
> RIP: smp_call_function_many+0x1f9/0x260
> Call Trace:
> ? setup_data_read+0xa0/0xa0
> ? ___slab_alloc+0x28b/0x5a0
> smp_call_function+0x3b/0x70
> ? setup_data_read+0xa0/0xa0
> on_each_cpu+0x2f/0x90
> ? ___slab_alloc+0x28a/0x5a0
> ? ___slab_alloc+0x28b/0x5a0
> text_poke_bp+0x87/0xd0
> ? ___slab_alloc+0x28a/0x5a0
> arch_jump_label_transform+0x93/0x100
> __jump_label_update+0x77/0x90
> jump_label_update+0xaa/0xc0
> static_key_slow_inc+0x9e/0xb0
> cpuset_css_online+0x70/0x2e0
> online_css+0x2c/0xa0
> cgroup_apply_control_enable+0x27f/0x3d0
> cgroup_mkdir+0x2b7/0x420
> kernfs_iop_mkdir+0x5a/0x80
> vfs_mkdir+0xf6/0x1a0
> SyS_mkdir+0xb7/0xe0
> entry_SYSCALL_64_fastpath+0x18/0xad
>
> ...
>
> CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
> Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
> task: ffff8818087c0000 task.stack: ffffc90000030000
> RIP: int3+0x39/0x70
> Call Trace:
> <#DB> ? ___slab_alloc+0x28b/0x5a0
> <EOE> ? copy_process.part.40+0xf7/0x1de0
> ? __slab_alloc.isra.80+0x54/0x90
> ? copy_process.part.40+0xf7/0x1de0
> ? copy_process.part.40+0xf7/0x1de0
> ? kmem_cache_alloc_node+0x8a/0x280
> ? copy_process.part.40+0xf7/0x1de0
> ? _do_fork+0xe7/0x6c0
> ? _raw_spin_unlock_irq+0x2d/0x60
> ? trace_hardirqs_on_caller+0x136/0x1d0
> ? entry_SYSCALL_64_fastpath+0x5/0xad
> ? do_syscall_64+0x27/0x350
> ? SyS_clone+0x19/0x20
> ? do_syscall_64+0x60/0x350
> ? entry_SYSCALL64_slow_path+0x25/0x25
>
> Reported-by: Cliff Spradlin <[email protected]>
> Signed-off-by: Dima Zavin <[email protected]>
> ---
>
> v3:
> - Changed the implementation based on Peter Zijlstra's suggestion. Now
> using two keys for begin/retry instead of hacking the state into the
> cookie.
> - Rebased and tested on top of v4.13-rc3.
>
> v4:

Doh, latest patch is v3, I obviously meant v2 here instead of v4. Sigh. Sorry.

--Dima

> - Moved the cached cpusets_enabled() state into the cookie, turned
> the cookie into a struct and updated all the other call sites.
> - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
> Still only tested on v4.9.36 and compile tested against v4.12.
>
> include/linux/cpuset.h | 19 +++++++++++++++++--
> kernel/cgroup/cpuset.c | 1 +
> 2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 119a3f9604b0..e5a684c04c70 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -18,6 +18,19 @@
>
> #ifdef CONFIG_CPUSETS
>
> +/*
> + * Static branch rewrites can happen in an arbitrary order for a given
> + * key. In code paths where we need to loop with read_mems_allowed_begin() and
> + * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need
> + * to ensure that begin() always gets rewritten before retry() in the
> + * disabled -> enabled transition. If not, then if local irqs are disabled
> + * around the loop, we can deadlock since retry() would always be
> + * comparing the latest value of the mems_allowed seqcount against 0 as
> + * begin() still would see cpusets_enabled() as false. The enabled -> disabled
> + * transition should happen in reverse order for the same reasons (want to stop
> + * looking at real value of mems_allowed.sequence in retry() first).
> + */
> +extern struct static_key_false cpusets_pre_enable_key;
> extern struct static_key_false cpusets_enabled_key;
> static inline bool cpusets_enabled(void)
> {
> @@ -32,12 +45,14 @@ static inline int nr_cpusets(void)
>
> static inline void cpuset_inc(void)
> {
> + static_branch_inc(&cpusets_pre_enable_key);
> static_branch_inc(&cpusets_enabled_key);
> }
>
> static inline void cpuset_dec(void)
> {
> static_branch_dec(&cpusets_enabled_key);
> + static_branch_dec(&cpusets_pre_enable_key);
> }
>
> extern int cpuset_init(void);
> @@ -115,7 +130,7 @@ extern void cpuset_print_current_mems_allowed(void);
> */
> static inline unsigned int read_mems_allowed_begin(void)
> {
> - if (!cpusets_enabled())
> + if (!static_branch_unlikely(&cpusets_pre_enable_key))
> return 0;
>
> return read_seqcount_begin(&current->mems_allowed_seq);
> @@ -129,7 +144,7 @@ static inline unsigned int read_mems_allowed_begin(void)
> */
> static inline bool read_mems_allowed_retry(unsigned int seq)
> {
> - if (!cpusets_enabled())
> + if (!static_branch_unlikely(&cpusets_enabled_key))
> return false;
>
> return read_seqcount_retry(&current->mems_allowed_seq, seq);
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index ca8376e5008c..8d5151688504 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -63,6 +63,7 @@
> #include <linux/cgroup.h>
> #include <linux/wait.h>
>
> +DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
> DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
>
> /* See "Frequency meter" comments, below. */
> --
> 2.14.0.rc0.400.g1c36432dff-goog
>

2017-07-31 08:02:27

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

On 07/31/2017 06:01 AM, Dima Zavin wrote:
> In codepaths that use the begin/retry interface for reading
> mems_allowed_seq with irqs disabled, there exists a race condition that
> stalls the patch process after only modifying a subset of the
> static_branch call sites.
>
> This problem manifested itself as a dead lock in the slub
> allocator, inside get_any_partial. The loop reads
> mems_allowed_seq value (via read_mems_allowed_begin),
> performs the defrag operation, and then verifies the consistency
> of mem_allowed via the read_mems_allowed_retry and the cookie
> returned by xxx_begin. The issue here is that both begin and retry
> first check if cpusets are enabled via cpusets_enabled() static branch.
> This branch can be rewritted dynamically (via cpuset_inc) if a new
> cpuset is created. The x86 jump label code fully synchronizes across
> all CPUs for every entry it rewrites. If it rewrites only one of the
> callsites (specifically the one in read_mems_allowed_retry) and then
> waits for the smp_call_function(do_sync_core) to complete while a CPU is
> inside the begin/retry section with IRQs off and the mems_allowed value
> is changed, we can hang. This is because begin() will always return 0
> (since it wasn't patched yet) while retry() will test the 0 against
> the actual value of the seq counter.
>
> The fix is to use two different static keys: one for begin
> (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
> first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
> always return a valid seqcount if are enabling cpusets. Similarly,
> when disabling cpusets via cpuset_dec(), we first ensure that callers
> of cpuset_mems_allowed_retry() will start ignoring the seqcount
> value before we let cpuset_mems_allowed_begin() return 0.
>
> The relevant stack traces of the two stuck threads:
>
> CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
> Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
> task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
> RIP: smp_call_function_many+0x1f9/0x260
> Call Trace:
> ? setup_data_read+0xa0/0xa0
> ? ___slab_alloc+0x28b/0x5a0
> smp_call_function+0x3b/0x70
> ? setup_data_read+0xa0/0xa0
> on_each_cpu+0x2f/0x90
> ? ___slab_alloc+0x28a/0x5a0
> ? ___slab_alloc+0x28b/0x5a0
> text_poke_bp+0x87/0xd0
> ? ___slab_alloc+0x28a/0x5a0
> arch_jump_label_transform+0x93/0x100
> __jump_label_update+0x77/0x90
> jump_label_update+0xaa/0xc0
> static_key_slow_inc+0x9e/0xb0
> cpuset_css_online+0x70/0x2e0
> online_css+0x2c/0xa0
> cgroup_apply_control_enable+0x27f/0x3d0
> cgroup_mkdir+0x2b7/0x420
> kernfs_iop_mkdir+0x5a/0x80
> vfs_mkdir+0xf6/0x1a0
> SyS_mkdir+0xb7/0xe0
> entry_SYSCALL_64_fastpath+0x18/0xad
>
> ...
>
> CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
> Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
> task: ffff8818087c0000 task.stack: ffffc90000030000
> RIP: int3+0x39/0x70
> Call Trace:
> <#DB> ? ___slab_alloc+0x28b/0x5a0
> <EOE> ? copy_process.part.40+0xf7/0x1de0
> ? __slab_alloc.isra.80+0x54/0x90
> ? copy_process.part.40+0xf7/0x1de0
> ? copy_process.part.40+0xf7/0x1de0
> ? kmem_cache_alloc_node+0x8a/0x280
> ? copy_process.part.40+0xf7/0x1de0
> ? _do_fork+0xe7/0x6c0
> ? _raw_spin_unlock_irq+0x2d/0x60
> ? trace_hardirqs_on_caller+0x136/0x1d0
> ? entry_SYSCALL_64_fastpath+0x5/0xad
> ? do_syscall_64+0x27/0x350
> ? SyS_clone+0x19/0x20
> ? do_syscall_64+0x60/0x350
> ? entry_SYSCALL64_slow_path+0x25/0x25
>
> Reported-by: Cliff Spradlin <[email protected]>
> Signed-off-by: Dima Zavin <[email protected]>

Looks good. Could you verify it fixes the issue, or was it too hard to
reproduce? Also is this a stable candidate patch, and can you identify
an exact commit hash it fixes?

Acked-by: Vlastimil Babka <[email protected]>

> ---
>
> v3:
> - Changed the implementation based on Peter Zijlstra's suggestion. Now
> using two keys for begin/retry instead of hacking the state into the
> cookie.
> - Rebased and tested on top of v4.13-rc3.
>
> v4:
> - Moved the cached cpusets_enabled() state into the cookie, turned
> the cookie into a struct and updated all the other call sites.
> - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
> Still only tested on v4.9.36 and compile tested against v4.12.
>
> include/linux/cpuset.h | 19 +++++++++++++++++--
> kernel/cgroup/cpuset.c | 1 +
> 2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 119a3f9604b0..e5a684c04c70 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -18,6 +18,19 @@
>
> #ifdef CONFIG_CPUSETS
>
> +/*
> + * Static branch rewrites can happen in an arbitrary order for a given
> + * key. In code paths where we need to loop with read_mems_allowed_begin() and
> + * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need
> + * to ensure that begin() always gets rewritten before retry() in the
> + * disabled -> enabled transition. If not, then if local irqs are disabled
> + * around the loop, we can deadlock since retry() would always be
> + * comparing the latest value of the mems_allowed seqcount against 0 as
> + * begin() still would see cpusets_enabled() as false. The enabled -> disabled
> + * transition should happen in reverse order for the same reasons (want to stop
> + * looking at real value of mems_allowed.sequence in retry() first).
> + */
> +extern struct static_key_false cpusets_pre_enable_key;
> extern struct static_key_false cpusets_enabled_key;
> static inline bool cpusets_enabled(void)
> {
> @@ -32,12 +45,14 @@ static inline int nr_cpusets(void)
>
> static inline void cpuset_inc(void)
> {
> + static_branch_inc(&cpusets_pre_enable_key);
> static_branch_inc(&cpusets_enabled_key);
> }
>
> static inline void cpuset_dec(void)
> {
> static_branch_dec(&cpusets_enabled_key);
> + static_branch_dec(&cpusets_pre_enable_key);
> }
>
> extern int cpuset_init(void);
> @@ -115,7 +130,7 @@ extern void cpuset_print_current_mems_allowed(void);
> */
> static inline unsigned int read_mems_allowed_begin(void)
> {
> - if (!cpusets_enabled())
> + if (!static_branch_unlikely(&cpusets_pre_enable_key))
> return 0;
>
> return read_seqcount_begin(&current->mems_allowed_seq);
> @@ -129,7 +144,7 @@ static inline unsigned int read_mems_allowed_begin(void)
> */
> static inline bool read_mems_allowed_retry(unsigned int seq)
> {
> - if (!cpusets_enabled())
> + if (!static_branch_unlikely(&cpusets_enabled_key))
> return false;
>
> return read_seqcount_retry(&current->mems_allowed_seq, seq);
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index ca8376e5008c..8d5151688504 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -63,6 +63,7 @@
> #include <linux/cgroup.h>
> #include <linux/wait.h>
>
> +DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
> DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
>
> /* See "Frequency meter" comments, below. */
>

2017-07-31 09:05:43

by Dima Zavin

[permalink] [raw]
Subject: Re: [PATCH v3] cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()

On Mon, Jul 31, 2017 at 1:02 AM, Vlastimil Babka <[email protected]> wrote:
> On 07/31/2017 06:01 AM, Dima Zavin wrote:
>> In codepaths that use the begin/retry interface for reading
>> mems_allowed_seq with irqs disabled, there exists a race condition that
>> stalls the patch process after only modifying a subset of the
>> static_branch call sites.
>>
>> This problem manifested itself as a dead lock in the slub
>> allocator, inside get_any_partial. The loop reads
>> mems_allowed_seq value (via read_mems_allowed_begin),
>> performs the defrag operation, and then verifies the consistency
>> of mem_allowed via the read_mems_allowed_retry and the cookie
>> returned by xxx_begin. The issue here is that both begin and retry
>> first check if cpusets are enabled via cpusets_enabled() static branch.
>> This branch can be rewritted dynamically (via cpuset_inc) if a new
>> cpuset is created. The x86 jump label code fully synchronizes across
>> all CPUs for every entry it rewrites. If it rewrites only one of the
>> callsites (specifically the one in read_mems_allowed_retry) and then
>> waits for the smp_call_function(do_sync_core) to complete while a CPU is
>> inside the begin/retry section with IRQs off and the mems_allowed value
>> is changed, we can hang. This is because begin() will always return 0
>> (since it wasn't patched yet) while retry() will test the 0 against
>> the actual value of the seq counter.
>>
>> The fix is to use two different static keys: one for begin
>> (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
>> first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
>> always return a valid seqcount if are enabling cpusets. Similarly,
>> when disabling cpusets via cpuset_dec(), we first ensure that callers
>> of cpuset_mems_allowed_retry() will start ignoring the seqcount
>> value before we let cpuset_mems_allowed_begin() return 0.
>>
>> The relevant stack traces of the two stuck threads:
>>
>> CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
>> Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
>> task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
>> RIP: smp_call_function_many+0x1f9/0x260
>> Call Trace:
>> ? setup_data_read+0xa0/0xa0
>> ? ___slab_alloc+0x28b/0x5a0
>> smp_call_function+0x3b/0x70
>> ? setup_data_read+0xa0/0xa0
>> on_each_cpu+0x2f/0x90
>> ? ___slab_alloc+0x28a/0x5a0
>> ? ___slab_alloc+0x28b/0x5a0
>> text_poke_bp+0x87/0xd0
>> ? ___slab_alloc+0x28a/0x5a0
>> arch_jump_label_transform+0x93/0x100
>> __jump_label_update+0x77/0x90
>> jump_label_update+0xaa/0xc0
>> static_key_slow_inc+0x9e/0xb0
>> cpuset_css_online+0x70/0x2e0
>> online_css+0x2c/0xa0
>> cgroup_apply_control_enable+0x27f/0x3d0
>> cgroup_mkdir+0x2b7/0x420
>> kernfs_iop_mkdir+0x5a/0x80
>> vfs_mkdir+0xf6/0x1a0
>> SyS_mkdir+0xb7/0xe0
>> entry_SYSCALL_64_fastpath+0x18/0xad
>>
>> ...
>>
>> CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
>> Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
>> task: ffff8818087c0000 task.stack: ffffc90000030000
>> RIP: int3+0x39/0x70
>> Call Trace:
>> <#DB> ? ___slab_alloc+0x28b/0x5a0
>> <EOE> ? copy_process.part.40+0xf7/0x1de0
>> ? __slab_alloc.isra.80+0x54/0x90
>> ? copy_process.part.40+0xf7/0x1de0
>> ? copy_process.part.40+0xf7/0x1de0
>> ? kmem_cache_alloc_node+0x8a/0x280
>> ? copy_process.part.40+0xf7/0x1de0
>> ? _do_fork+0xe7/0x6c0
>> ? _raw_spin_unlock_irq+0x2d/0x60
>> ? trace_hardirqs_on_caller+0x136/0x1d0
>> ? entry_SYSCALL_64_fastpath+0x5/0xad
>> ? do_syscall_64+0x27/0x350
>> ? SyS_clone+0x19/0x20
>> ? do_syscall_64+0x60/0x350
>> ? entry_SYSCALL64_slow_path+0x25/0x25
>>
>> Reported-by: Cliff Spradlin <[email protected]>
>> Signed-off-by: Dima Zavin <[email protected]>
>
> Looks good. Could you verify it fixes the issue, or was it too hard to
> reproduce? Also is this a stable candidate patch, and can you identify
> an exact commit hash it fixes?

It's tough to reproduce but this fix works as well as the original
hack. I think the problematic commit must have been:

commit 46e700abc44ce215acb4341d9702ce3972eda571
Author: Mel Gorman <[email protected]>
Date: Fri Nov 6 16:28:15 2015 -0800
mm, page_alloc: remove unnecessary taking of a seqlock when
cpusets are disabled

It's probably stable worthy, but I don't know who can make the call on that.

Thanks for the reviews!

--Dima

>
> Acked-by: Vlastimil Babka <[email protected]>
>
>> ---
>>
>> v3:
>> - Changed the implementation based on Peter Zijlstra's suggestion. Now
>> using two keys for begin/retry instead of hacking the state into the
>> cookie.
>> - Rebased and tested on top of v4.13-rc3.
>>
>> v4:
>> - Moved the cached cpusets_enabled() state into the cookie, turned
>> the cookie into a struct and updated all the other call sites.
>> - Applied on top of v4.12 since one of the callers in page_alloc.c changed.
>> Still only tested on v4.9.36 and compile tested against v4.12.
>>
>> include/linux/cpuset.h | 19 +++++++++++++++++--
>> kernel/cgroup/cpuset.c | 1 +
>> 2 files changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
>> index 119a3f9604b0..e5a684c04c70 100644
>> --- a/include/linux/cpuset.h
>> +++ b/include/linux/cpuset.h
>> @@ -18,6 +18,19 @@
>>
>> #ifdef CONFIG_CPUSETS
>>
>> +/*
>> + * Static branch rewrites can happen in an arbitrary order for a given
>> + * key. In code paths where we need to loop with read_mems_allowed_begin() and
>> + * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need
>> + * to ensure that begin() always gets rewritten before retry() in the
>> + * disabled -> enabled transition. If not, then if local irqs are disabled
>> + * around the loop, we can deadlock since retry() would always be
>> + * comparing the latest value of the mems_allowed seqcount against 0 as
>> + * begin() still would see cpusets_enabled() as false. The enabled -> disabled
>> + * transition should happen in reverse order for the same reasons (want to stop
>> + * looking at real value of mems_allowed.sequence in retry() first).
>> + */
>> +extern struct static_key_false cpusets_pre_enable_key;
>> extern struct static_key_false cpusets_enabled_key;
>> static inline bool cpusets_enabled(void)
>> {
>> @@ -32,12 +45,14 @@ static inline int nr_cpusets(void)
>>
>> static inline void cpuset_inc(void)
>> {
>> + static_branch_inc(&cpusets_pre_enable_key);
>> static_branch_inc(&cpusets_enabled_key);
>> }
>>
>> static inline void cpuset_dec(void)
>> {
>> static_branch_dec(&cpusets_enabled_key);
>> + static_branch_dec(&cpusets_pre_enable_key);
>> }
>>
>> extern int cpuset_init(void);
>> @@ -115,7 +130,7 @@ extern void cpuset_print_current_mems_allowed(void);
>> */
>> static inline unsigned int read_mems_allowed_begin(void)
>> {
>> - if (!cpusets_enabled())
>> + if (!static_branch_unlikely(&cpusets_pre_enable_key))
>> return 0;
>>
>> return read_seqcount_begin(&current->mems_allowed_seq);
>> @@ -129,7 +144,7 @@ static inline unsigned int read_mems_allowed_begin(void)
>> */
>> static inline bool read_mems_allowed_retry(unsigned int seq)
>> {
>> - if (!cpusets_enabled())
>> + if (!static_branch_unlikely(&cpusets_enabled_key))
>> return false;
>>
>> return read_seqcount_retry(&current->mems_allowed_seq, seq);
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index ca8376e5008c..8d5151688504 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -63,6 +63,7 @@
>> #include <linux/cgroup.h>
>> #include <linux/wait.h>
>>
>> +DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
>> DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
>>
>> /* See "Frequency meter" comments, below. */
>>
>