2016-03-28 05:27:23

by Joonsoo Kim

[permalink] [raw]
Subject: mm/slab: reduce lock contention in alloc path

From: Joonsoo Kim <[email protected]>

While processing concurrent allocation, SLAB could be contended
a lot because it did a lots of work with holding a lock. This
patchset try to reduce the number of critical section to reduce
lock contention. Major changes are lockless decision to allocate
more slab and lockless cpu cache refill from the newly allocated slab.

Below is the result of concurrent allocation/free in slab allocation
benchmark made by Christoph a long time ago. I make the output simpler.
The number shows cycle count during alloc/free respectively so less
is better.

* Before
Kmalloc N*alloc N*free(32): Average=365/806
Kmalloc N*alloc N*free(64): Average=452/690
Kmalloc N*alloc N*free(128): Average=736/886
Kmalloc N*alloc N*free(256): Average=1167/985
Kmalloc N*alloc N*free(512): Average=2088/1125
Kmalloc N*alloc N*free(1024): Average=4115/1184
Kmalloc N*alloc N*free(2048): Average=8451/1748
Kmalloc N*alloc N*free(4096): Average=16024/2048

* After
Kmalloc N*alloc N*free(32): Average=344/792
Kmalloc N*alloc N*free(64): Average=347/882
Kmalloc N*alloc N*free(128): Average=390/959
Kmalloc N*alloc N*free(256): Average=393/1067
Kmalloc N*alloc N*free(512): Average=683/1229
Kmalloc N*alloc N*free(1024): Average=1295/1325
Kmalloc N*alloc N*free(2048): Average=2513/1664
Kmalloc N*alloc N*free(4096): Average=4742/2172

It shows that performance improves greatly (roughly more than 50%)
for the object class whose size is more than 128 bytes.

Thanks.

Joonsoo Kim (11):
mm/slab: hold a slab_mutex when calling __kmem_cache_shrink()
mm/slab: remove BAD_ALIEN_MAGIC again
mm/slab: drain the free slab as much as possible
mm/slab: factor out kmem_cache_node initialization code
mm/slab: clean-up kmem_cache_node setup
mm/slab: don't keep free slabs if free_objects exceeds free_limit
mm/slab: racy access/modify the slab color
mm/slab: make cache_grow() handle the page allocated on arbitrary node
mm/slab: separate cache_grow() to two parts
mm/slab: refill cpu cache through a new slab without holding a node
lock
mm/slab: lockless decision to grow cache

mm/slab.c | 495 ++++++++++++++++++++++++++++---------------------------
mm/slab_common.c | 4 +
2 files changed, 255 insertions(+), 244 deletions(-)

--
1.9.1


2016-03-28 05:27:37

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 03/11] mm/slab: drain the free slab as much as possible

From: Joonsoo Kim <[email protected]>

slabs_tofree() implies freeing all free slab. We can do it with
just providing INT_MAX.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index a5a205b..ba2eacf 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -888,12 +888,6 @@ static int init_cache_node_node(int node)
return 0;
}

-static inline int slabs_tofree(struct kmem_cache *cachep,
- struct kmem_cache_node *n)
-{
- return (n->free_objects + cachep->num - 1) / cachep->num;
-}
-
static void cpuup_canceled(long cpu)
{
struct kmem_cache *cachep;
@@ -958,7 +952,7 @@ free_slab:
n = get_node(cachep, node);
if (!n)
continue;
- drain_freelist(cachep, n, slabs_tofree(cachep, n));
+ drain_freelist(cachep, n, INT_MAX);
}
}

@@ -1110,7 +1104,7 @@ static int __meminit drain_cache_node_node(int node)
if (!n)
continue;

- drain_freelist(cachep, n, slabs_tofree(cachep, n));
+ drain_freelist(cachep, n, INT_MAX);

if (!list_empty(&n->slabs_full) ||
!list_empty(&n->slabs_partial)) {
@@ -2280,7 +2274,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)

check_irq_on();
for_each_kmem_cache_node(cachep, node, n) {
- drain_freelist(cachep, n, slabs_tofree(cachep, n));
+ drain_freelist(cachep, n, INT_MAX);

ret += !list_empty(&n->slabs_full) ||
!list_empty(&n->slabs_partial);
--
1.9.1

2016-03-28 05:27:48

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code

From: Joonsoo Kim <[email protected]>

It can be reused on other place, so factor out it. Following
patch will use it.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 68 ++++++++++++++++++++++++++++++++++++---------------------------
1 file changed, 39 insertions(+), 29 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index ba2eacf..569d7db 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -841,6 +841,40 @@ static inline gfp_t gfp_exact_node(gfp_t flags)
}
#endif

+static int init_cache_node(struct kmem_cache *cachep, int node, gfp_t gfp)
+{
+ struct kmem_cache_node *n;
+
+ /*
+ * Set up the kmem_cache_node for cpu before we can
+ * begin anything. Make sure some other cpu on this
+ * node has not already allocated this
+ */
+ n = get_node(cachep, node);
+ if (n)
+ return 0;
+
+ n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node);
+ if (!n)
+ return -ENOMEM;
+
+ kmem_cache_node_init(n);
+ n->next_reap = jiffies + REAPTIMEOUT_NODE +
+ ((unsigned long)cachep) % REAPTIMEOUT_NODE;
+
+ n->free_limit =
+ (1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num;
+
+ /*
+ * The kmem_cache_nodes don't come and go as CPUs
+ * come and go. slab_mutex is sufficient
+ * protection here.
+ */
+ cachep->node[node] = n;
+
+ return 0;
+}
+
/*
* Allocates and initializes node for a node on each slab cache, used for
* either memory or cpu hotplug. If memory is being hot-added, the kmem_cache_node
@@ -852,39 +886,15 @@ static inline gfp_t gfp_exact_node(gfp_t flags)
*/
static int init_cache_node_node(int node)
{
+ int ret;
struct kmem_cache *cachep;
- struct kmem_cache_node *n;
- const size_t memsize = sizeof(struct kmem_cache_node);

list_for_each_entry(cachep, &slab_caches, list) {
- /*
- * Set up the kmem_cache_node for cpu before we can
- * begin anything. Make sure some other cpu on this
- * node has not already allocated this
- */
- n = get_node(cachep, node);
- if (!n) {
- n = kmalloc_node(memsize, GFP_KERNEL, node);
- if (!n)
- return -ENOMEM;
- kmem_cache_node_init(n);
- n->next_reap = jiffies + REAPTIMEOUT_NODE +
- ((unsigned long)cachep) % REAPTIMEOUT_NODE;
-
- /*
- * The kmem_cache_nodes don't come and go as CPUs
- * come and go. slab_mutex is sufficient
- * protection here.
- */
- cachep->node[node] = n;
- }
-
- spin_lock_irq(&n->list_lock);
- n->free_limit =
- (1 + nr_cpus_node(node)) *
- cachep->batchcount + cachep->num;
- spin_unlock_irq(&n->list_lock);
+ ret = init_cache_node(cachep, node, GFP_KERNEL);
+ if (ret)
+ return ret;
}
+
return 0;
}

--
1.9.1

2016-03-28 05:27:57

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again

From: Joonsoo Kim <[email protected]>

Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by
'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")'
because it causes a problem on m68k which has many node
but !CONFIG_NUMA. In this case, although alien cache isn't used
at all but to cope with some initialization path, garbage value
is used and that is BAD_ALIEN_MAGIC. Now, this patch set
use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization
path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 043606a..a5a205b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -421,8 +421,6 @@ static struct kmem_cache kmem_cache_boot = {
.name = "kmem_cache",
};

-#define BAD_ALIEN_MAGIC 0x01020304ul
-
static DEFINE_PER_CPU(struct delayed_work, slab_reap_work);

static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep)
@@ -637,7 +635,7 @@ static int transfer_objects(struct array_cache *to,
static inline struct alien_cache **alloc_alien_cache(int node,
int limit, gfp_t gfp)
{
- return (struct alien_cache **)BAD_ALIEN_MAGIC;
+ return NULL;
}

static inline void free_alien_cache(struct alien_cache **ac_ptr)
@@ -1205,7 +1203,7 @@ void __init kmem_cache_init(void)
sizeof(struct rcu_head));
kmem_cache = &kmem_cache_boot;

- if (num_possible_nodes() == 1)
+ if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1)
use_alien_caches = 0;

for (i = 0; i < NUM_INIT_LISTS; i++)
--
1.9.1

2016-03-28 05:28:05

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink()

From: Joonsoo Kim <[email protected]>

Major kmem_cache metadata in slab subsystem is synchronized with
the slab_mutex. In SLAB, if some of them is changed, node's shared
array cache would be freed and re-populated. If __kmem_cache_shrink()
is called at the same time, it will call drain_array() with n->shared
without holding node lock so problem can happen.

We can fix this small theoretical race condition by holding node lock
in drain_array(), but, holding a slab_mutex in kmem_cache_shrink()
looks more appropriate solution because stable state would make things
less error-prone and this is not performance critical path.

In addtion, annotate on SLAB functions.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 2 ++
mm/slab_common.c | 4 ++++
2 files changed, 6 insertions(+)

diff --git a/mm/slab.c b/mm/slab.c
index a53a0f6..043606a 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2218,6 +2218,7 @@ static void do_drain(void *arg)
ac->avail = 0;
}

+/* Should be called with slab_mutex to prevent from freeing shared array */
static void drain_cpu_caches(struct kmem_cache *cachep)
{
struct kmem_cache_node *n;
@@ -3871,6 +3872,7 @@ skip_setup:
* Drain an array if it contains any elements taking the node lock only if
* necessary. Note that the node listlock also protects the array_cache
* if drain_array() is used on the shared array.
+ * Should be called with slab_mutex to prevent from freeing shared array.
*/
static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n,
struct array_cache *ac, int force, int node)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index a65dad7..5bed565 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -755,7 +755,11 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
get_online_cpus();
get_online_mems();
kasan_cache_shrink(cachep);
+
+ mutex_lock(&slab_mutex);
ret = __kmem_cache_shrink(cachep, false);
+ mutex_unlock(&slab_mutex);
+
put_online_mems();
put_online_cpus();
return ret;
--
1.9.1

2016-03-28 05:28:25

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 10/11] mm/slab: refill cpu cache through a new slab without holding a node lock

From: Joonsoo Kim <[email protected]>

Until now, cache growing makes a free slab on node's slab list and then
we can allocate free objects from it. This necessarily requires
to hold a node lock which is very contended. If we refill cpu cache
before attaching it to node's slab list, we can avoid holding a node lock
as much as possible because this newly allocated slab is only visible
to the current task. This will reduce lock contention.

Below is the result of concurrent allocation/free in slab allocation
benchmark made by Christoph a long time ago. I make the output simpler.
The number shows cycle count during alloc/free respectively so less
is better.

* Before
Kmalloc N*alloc N*free(32): Average=355/750
Kmalloc N*alloc N*free(64): Average=452/812
Kmalloc N*alloc N*free(128): Average=559/1070
Kmalloc N*alloc N*free(256): Average=1176/980
Kmalloc N*alloc N*free(512): Average=1939/1189
Kmalloc N*alloc N*free(1024): Average=3521/1278
Kmalloc N*alloc N*free(2048): Average=7152/1838
Kmalloc N*alloc N*free(4096): Average=13438/2013

* After
Kmalloc N*alloc N*free(32): Average=248/966
Kmalloc N*alloc N*free(64): Average=261/949
Kmalloc N*alloc N*free(128): Average=314/1016
Kmalloc N*alloc N*free(256): Average=741/1061
Kmalloc N*alloc N*free(512): Average=1246/1152
Kmalloc N*alloc N*free(1024): Average=2437/1259
Kmalloc N*alloc N*free(2048): Average=4980/1800
Kmalloc N*alloc N*free(4096): Average=9000/2078

It shows that contention is reduced for all the object sizes
and performance increases by 30 ~ 40%.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 68 +++++++++++++++++++++++++++++++++------------------------------
1 file changed, 36 insertions(+), 32 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 401e60c..029d6b3 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2827,6 +2827,30 @@ static noinline void *cache_alloc_pfmemalloc(struct kmem_cache *cachep,
return obj;
}

+/*
+ * Slab list should be fixed up by fixup_slab_list() for existing slab
+ * or cache_grow_end() for new slab
+ */
+static __always_inline int alloc_block(struct kmem_cache *cachep,
+ struct array_cache *ac, struct page *page, int batchcount)
+{
+ /*
+ * There must be at least one object available for
+ * allocation.
+ */
+ BUG_ON(page->active >= cachep->num);
+
+ while (page->active < cachep->num && batchcount--) {
+ STATS_INC_ALLOCED(cachep);
+ STATS_INC_ACTIVE(cachep);
+ STATS_SET_HIGH(cachep);
+
+ ac->entry[ac->avail++] = slab_get_obj(cachep, page);
+ }
+
+ return batchcount;
+}
+
static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
{
int batchcount;
@@ -2839,7 +2863,6 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
check_irq_off();
node = numa_mem_id();

-retry:
ac = cpu_cache_get(cachep);
batchcount = ac->batchcount;
if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -2869,21 +2892,7 @@ retry:

check_spinlock_acquired(cachep);

- /*
- * The slab was either on partial or free list so
- * there must be at least one object available for
- * allocation.
- */
- BUG_ON(page->active >= cachep->num);
-
- while (page->active < cachep->num && batchcount--) {
- STATS_INC_ALLOCED(cachep);
- STATS_INC_ACTIVE(cachep);
- STATS_SET_HIGH(cachep);
-
- ac->entry[ac->avail++] = slab_get_obj(cachep, page);
- }
-
+ batchcount = alloc_block(cachep, ac, page, batchcount);
fixup_slab_list(cachep, n, page, &list);
}

@@ -2903,21 +2912,18 @@ alloc_done:
}

page = cache_grow_begin(cachep, gfp_exact_node(flags), node);
- cache_grow_end(cachep, page);

/*
* cache_grow_begin() can reenable interrupts,
* then ac could change.
*/
ac = cpu_cache_get(cachep);
- node = numa_mem_id();
+ if (!ac->avail && page)
+ alloc_block(cachep, ac, page, batchcount);
+ cache_grow_end(cachep, page);

- /* no objects in sight? abort */
- if (!page && ac->avail == 0)
+ if (!ac->avail)
return NULL;
-
- if (!ac->avail) /* objects refilled by interrupt? */
- goto retry;
}
ac->touched = 1;

@@ -3111,14 +3117,13 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
{
struct page *page;
struct kmem_cache_node *n;
- void *obj;
+ void *obj = NULL;
void *list = NULL;

VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES);
n = get_node(cachep, nodeid);
BUG_ON(!n);

-retry:
check_irq_off();
spin_lock(&n->list_lock);
page = get_first_slab(n, false);
@@ -3140,19 +3145,18 @@ retry:

spin_unlock(&n->list_lock);
fixup_objfreelist_debug(cachep, &list);
- goto done;
+ return obj;

must_grow:
spin_unlock(&n->list_lock);
page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid);
+ if (page) {
+ /* This slab isn't counted yet so don't update free_objects */
+ obj = slab_get_obj(cachep, page);
+ }
cache_grow_end(cachep, page);
- if (page)
- goto retry;

- return fallback_alloc(cachep, flags);
-
-done:
- return obj;
+ return obj ? obj : fallback_alloc(cachep, flags);
}

static __always_inline void *
--
1.9.1

2016-03-28 05:28:33

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 11/11] mm/slab: lockless decision to grow cache

From: Joonsoo Kim <[email protected]>

To check whther free objects exist or not precisely, we need to grab
a lock. But, accuracy isn't that important because race window would
be even small and if there is too much free object, cache reaper would
reap it. So, this patch makes the check for free object exisistence
not to hold a lock. This will reduce lock contention in heavily
allocation case.

Note that until now, n->shared can be freed during the processing by
writing slabinfo, but, with some trick in this patch, we can access it
freely within interrupt disabled period.

Below is the result of concurrent allocation/free in slab allocation
benchmark made by Christoph a long time ago. I make the output simpler.
The number shows cycle count during alloc/free respectively so less
is better.

* Before
Kmalloc N*alloc N*free(32): Average=248/966
Kmalloc N*alloc N*free(64): Average=261/949
Kmalloc N*alloc N*free(128): Average=314/1016
Kmalloc N*alloc N*free(256): Average=741/1061
Kmalloc N*alloc N*free(512): Average=1246/1152
Kmalloc N*alloc N*free(1024): Average=2437/1259
Kmalloc N*alloc N*free(2048): Average=4980/1800
Kmalloc N*alloc N*free(4096): Average=9000/2078

* After
Kmalloc N*alloc N*free(32): Average=344/792
Kmalloc N*alloc N*free(64): Average=347/882
Kmalloc N*alloc N*free(128): Average=390/959
Kmalloc N*alloc N*free(256): Average=393/1067
Kmalloc N*alloc N*free(512): Average=683/1229
Kmalloc N*alloc N*free(1024): Average=1295/1325
Kmalloc N*alloc N*free(2048): Average=2513/1664
Kmalloc N*alloc N*free(4096): Average=4742/2172

It shows that allocation performance decreases for the object size
up to 128 and it may be due to extra checks in cache_alloc_refill().
But, with considering improvement of free performance, net result looks
the same. Result for other size class looks very promising, roughly,
50% performance improvement.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 029d6b3..b70aabf 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -951,6 +951,15 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep,
spin_unlock_irq(&n->list_lock);
slabs_destroy(cachep, &list);

+ /*
+ * To protect lockless access to n->shared during irq disabled context.
+ * If n->shared isn't NULL in irq disabled context, accessing to it is
+ * guaranteed to be valid until irq is re-enabled, because it will be
+ * freed after kick_all_cpus_sync().
+ */
+ if (force_change)
+ kick_all_cpus_sync();
+
fail:
kfree(old_shared);
kfree(new_shared);
@@ -2855,7 +2864,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
{
int batchcount;
struct kmem_cache_node *n;
- struct array_cache *ac;
+ struct array_cache *ac, *shared;
int node;
void *list = NULL;
struct page *page;
@@ -2876,11 +2885,16 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
n = get_node(cachep, node);

BUG_ON(ac->avail > 0 || !n);
+ shared = READ_ONCE(n->shared);
+ if (!n->free_objects && (!shared || !shared->avail))
+ goto direct_grow;
+
spin_lock(&n->list_lock);
+ shared = READ_ONCE(n->shared);

/* See if we can refill from the shared array */
- if (n->shared && transfer_objects(ac, n->shared, batchcount)) {
- n->shared->touched = 1;
+ if (shared && transfer_objects(ac, shared, batchcount)) {
+ shared->touched = 1;
goto alloc_done;
}

@@ -2902,6 +2916,7 @@ alloc_done:
spin_unlock(&n->list_lock);
fixup_objfreelist_debug(cachep, &list);

+direct_grow:
if (unlikely(!ac->avail)) {
/* Check if we can use obj in pfmemalloc slab */
if (sk_memalloc_socks()) {
--
1.9.1

2016-03-28 05:28:52

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 07/11] mm/slab: racy access/modify the slab color

From: Joonsoo Kim <[email protected]>

Slab color isn't needed to be changed strictly. Because locking
for changing slab color could cause more lock contention so this patch
implements racy access/modify the slab color. This is a preparation step
to implement lockless allocation path when there is no free objects in
the kmem_cache.

Below is the result of concurrent allocation/free in slab allocation
benchmark made by Christoph a long time ago. I make the output simpler.
The number shows cycle count during alloc/free respectively so less
is better.

* Before
Kmalloc N*alloc N*free(32): Average=365/806
Kmalloc N*alloc N*free(64): Average=452/690
Kmalloc N*alloc N*free(128): Average=736/886
Kmalloc N*alloc N*free(256): Average=1167/985
Kmalloc N*alloc N*free(512): Average=2088/1125
Kmalloc N*alloc N*free(1024): Average=4115/1184
Kmalloc N*alloc N*free(2048): Average=8451/1748
Kmalloc N*alloc N*free(4096): Average=16024/2048

* After
Kmalloc N*alloc N*free(32): Average=355/750
Kmalloc N*alloc N*free(64): Average=452/812
Kmalloc N*alloc N*free(128): Average=559/1070
Kmalloc N*alloc N*free(256): Average=1176/980
Kmalloc N*alloc N*free(512): Average=1939/1189
Kmalloc N*alloc N*free(1024): Average=3521/1278
Kmalloc N*alloc N*free(2048): Average=7152/1838
Kmalloc N*alloc N*free(4096): Average=13438/2013

It shows that contention is reduced for object size >= 1024
and performance increases by roughly 15%.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index df11757..52fc5e3 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2536,20 +2536,7 @@ static int cache_grow(struct kmem_cache *cachep,
}
local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);

- /* Take the node list lock to change the colour_next on this node */
check_irq_off();
- n = get_node(cachep, nodeid);
- spin_lock(&n->list_lock);
-
- /* Get colour for the slab, and cal the next value. */
- offset = n->colour_next;
- n->colour_next++;
- if (n->colour_next >= cachep->colour)
- n->colour_next = 0;
- spin_unlock(&n->list_lock);
-
- offset *= cachep->colour_off;
-
if (gfpflags_allow_blocking(local_flags))
local_irq_enable();

@@ -2570,6 +2557,19 @@ static int cache_grow(struct kmem_cache *cachep,
if (!page)
goto failed;

+ n = get_node(cachep, nodeid);
+
+ /* Get colour for the slab, and cal the next value. */
+ n->colour_next++;
+ if (n->colour_next >= cachep->colour)
+ n->colour_next = 0;
+
+ offset = n->colour_next;
+ if (offset >= cachep->colour)
+ offset = 0;
+
+ offset *= cachep->colour_off;
+
/* Get slab management. */
freelist = alloc_slabmgmt(cachep, page, offset,
local_flags & ~GFP_CONSTRAINT_MASK, nodeid);
--
1.9.1

2016-03-28 05:28:37

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 09/11] mm/slab: separate cache_grow() to two parts

From: Joonsoo Kim <[email protected]>

This is a preparation step to implement lockless allocation path when
there is no free objects in kmem_cache. What we'd like to do here is
to refill cpu cache without holding a node lock. To accomplish this
purpose, refill should be done after new slab allocation but before
attaching the slab to the management list. So, this patch separates
cache_grow() to two parts, allocation and attaching to the list
in order to add some code inbetween them in the following patch.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 74 ++++++++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 52 insertions(+), 22 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index ce8ed65..401e60c 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -213,6 +213,11 @@ static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list);
static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp);
static void cache_reap(struct work_struct *unused);

+static inline void fixup_objfreelist_debug(struct kmem_cache *cachep,
+ void **list);
+static inline void fixup_slab_list(struct kmem_cache *cachep,
+ struct kmem_cache_node *n, struct page *page,
+ void **list);
static int slab_early_init = 1;

#define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node))
@@ -1796,7 +1801,7 @@ static size_t calculate_slab_order(struct kmem_cache *cachep,

/*
* Needed to avoid possible looping condition
- * in cache_grow()
+ * in cache_grow_begin()
*/
if (OFF_SLAB(freelist_cache))
continue;
@@ -2518,7 +2523,8 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page,
* Grow (by 1) the number of slabs within a cache. This is called by
* kmem_cache_alloc() when there are no active objs left in a cache.
*/
-static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid)
+static struct page *cache_grow_begin(struct kmem_cache *cachep,
+ gfp_t flags, int nodeid)
{
void *freelist;
size_t offset;
@@ -2584,21 +2590,40 @@ static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid)

if (gfpflags_allow_blocking(local_flags))
local_irq_disable();
- check_irq_off();
- spin_lock(&n->list_lock);

- /* Make slab active. */
- list_add_tail(&page->lru, &(n->slabs_free));
- STATS_INC_GROWN(cachep);
- n->free_objects += cachep->num;
- spin_unlock(&n->list_lock);
- return page_node;
+ return page;
+
opps1:
kmem_freepages(cachep, page);
failed:
if (gfpflags_allow_blocking(local_flags))
local_irq_disable();
- return -1;
+ return NULL;
+}
+
+static void cache_grow_end(struct kmem_cache *cachep, struct page *page)
+{
+ struct kmem_cache_node *n;
+ void *list = NULL;
+
+ check_irq_off();
+
+ if (!page)
+ return;
+
+ INIT_LIST_HEAD(&page->lru);
+ n = get_node(cachep, page_to_nid(page));
+
+ spin_lock(&n->list_lock);
+ if (!page->active)
+ list_add_tail(&page->lru, &(n->slabs_free));
+ else
+ fixup_slab_list(cachep, n, page, &list);
+ STATS_INC_GROWN(cachep);
+ n->free_objects += cachep->num - page->active;
+ spin_unlock(&n->list_lock);
+
+ fixup_objfreelist_debug(cachep, &list);
}

#if DEBUG
@@ -2809,6 +2834,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
struct array_cache *ac;
int node;
void *list = NULL;
+ struct page *page;

check_irq_off();
node = numa_mem_id();
@@ -2836,7 +2862,6 @@ retry:
}

while (batchcount > 0) {
- struct page *page;
/* Get slab alloc is to come from. */
page = get_first_slab(n, false);
if (!page)
@@ -2869,8 +2894,6 @@ alloc_done:
fixup_objfreelist_debug(cachep, &list);

if (unlikely(!ac->avail)) {
- int x;
-
/* Check if we can use obj in pfmemalloc slab */
if (sk_memalloc_socks()) {
void *obj = cache_alloc_pfmemalloc(cachep, n, flags);
@@ -2879,14 +2902,18 @@ alloc_done:
return obj;
}

- x = cache_grow(cachep, gfp_exact_node(flags), node);
+ page = cache_grow_begin(cachep, gfp_exact_node(flags), node);
+ cache_grow_end(cachep, page);

- /* cache_grow can reenable interrupts, then ac could change. */
+ /*
+ * cache_grow_begin() can reenable interrupts,
+ * then ac could change.
+ */
ac = cpu_cache_get(cachep);
node = numa_mem_id();

/* no objects in sight? abort */
- if (x < 0 && ac->avail == 0)
+ if (!page && ac->avail == 0)
return NULL;

if (!ac->avail) /* objects refilled by interrupt? */
@@ -3019,6 +3046,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(flags);
void *obj = NULL;
+ struct page *page;
int nid;
unsigned int cpuset_mems_cookie;

@@ -3054,8 +3082,10 @@ retry:
* We may trigger various forms of reclaim on the allowed
* set and go into memory reserves if necessary.
*/
- nid = cache_grow(cache, flags, numa_mem_id());
- if (nid >= 0) {
+ page = cache_grow_begin(cache, flags, numa_mem_id());
+ cache_grow_end(cache, page);
+ if (page) {
+ nid = page_to_nid(page);
obj = ____cache_alloc_node(cache,
gfp_exact_node(flags), nid);

@@ -3083,7 +3113,6 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
struct kmem_cache_node *n;
void *obj;
void *list = NULL;
- int x;

VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES);
n = get_node(cachep, nodeid);
@@ -3115,8 +3144,9 @@ retry:

must_grow:
spin_unlock(&n->list_lock);
- x = cache_grow(cachep, gfp_exact_node(flags), nodeid);
- if (x >= 0)
+ page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid);
+ cache_grow_end(cachep, page);
+ if (page)
goto retry;

return fallback_alloc(cachep, flags);
--
1.9.1

2016-03-28 05:28:47

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 08/11] mm/slab: make cache_grow() handle the page allocated on arbitrary node

From: Joonsoo Kim <[email protected]>

Currently, cache_grow() assumes that allocated page's nodeid would be
same with parameter nodeid which is used for allocation request. If
we discard this assumption, we can handle fallback_alloc() case
gracefully. So, this patch makes cache_grow() handle the page allocated
on arbitrary node and clean-up relevant code.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 60 +++++++++++++++++++++---------------------------------------
1 file changed, 21 insertions(+), 39 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 52fc5e3..ce8ed65 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2518,13 +2518,14 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page,
* Grow (by 1) the number of slabs within a cache. This is called by
* kmem_cache_alloc() when there are no active objs left in a cache.
*/
-static int cache_grow(struct kmem_cache *cachep,
- gfp_t flags, int nodeid, struct page *page)
+static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid)
{
void *freelist;
size_t offset;
gfp_t local_flags;
+ int page_node;
struct kmem_cache_node *n;
+ struct page *page;

/*
* Be lazy and only check for valid flags here, keeping it out of the
@@ -2552,12 +2553,12 @@ static int cache_grow(struct kmem_cache *cachep,
* Get mem for the objs. Attempt to allocate a physical page from
* 'nodeid'.
*/
- if (!page)
- page = kmem_getpages(cachep, local_flags, nodeid);
+ page = kmem_getpages(cachep, local_flags, nodeid);
if (!page)
goto failed;

- n = get_node(cachep, nodeid);
+ page_node = page_to_nid(page);
+ n = get_node(cachep, page_node);

/* Get colour for the slab, and cal the next value. */
n->colour_next++;
@@ -2572,7 +2573,7 @@ static int cache_grow(struct kmem_cache *cachep,

/* Get slab management. */
freelist = alloc_slabmgmt(cachep, page, offset,
- local_flags & ~GFP_CONSTRAINT_MASK, nodeid);
+ local_flags & ~GFP_CONSTRAINT_MASK, page_node);
if (OFF_SLAB(cachep) && !freelist)
goto opps1;

@@ -2591,13 +2592,13 @@ static int cache_grow(struct kmem_cache *cachep,
STATS_INC_GROWN(cachep);
n->free_objects += cachep->num;
spin_unlock(&n->list_lock);
- return 1;
+ return page_node;
opps1:
kmem_freepages(cachep, page);
failed:
if (gfpflags_allow_blocking(local_flags))
local_irq_disable();
- return 0;
+ return -1;
}

#if DEBUG
@@ -2878,14 +2879,14 @@ alloc_done:
return obj;
}

- x = cache_grow(cachep, gfp_exact_node(flags), node, NULL);
+ x = cache_grow(cachep, gfp_exact_node(flags), node);

/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
node = numa_mem_id();

/* no objects in sight? abort */
- if (!x && ac->avail == 0)
+ if (x < 0 && ac->avail == 0)
return NULL;

if (!ac->avail) /* objects refilled by interrupt? */
@@ -3014,7 +3015,6 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
{
struct zonelist *zonelist;
- gfp_t local_flags;
struct zoneref *z;
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(flags);
@@ -3025,8 +3025,6 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
if (flags & __GFP_THISNODE)
return NULL;

- local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
-
retry_cpuset:
cpuset_mems_cookie = read_mems_allowed_begin();
zonelist = node_zonelist(mempolicy_slab_node(), flags);
@@ -3056,33 +3054,17 @@ retry:
* We may trigger various forms of reclaim on the allowed
* set and go into memory reserves if necessary.
*/
- struct page *page;
+ nid = cache_grow(cache, flags, numa_mem_id());
+ if (nid >= 0) {
+ obj = ____cache_alloc_node(cache,
+ gfp_exact_node(flags), nid);

- if (gfpflags_allow_blocking(local_flags))
- local_irq_enable();
- kmem_flagcheck(cache, flags);
- page = kmem_getpages(cache, local_flags, numa_mem_id());
- if (gfpflags_allow_blocking(local_flags))
- local_irq_disable();
- if (page) {
/*
- * Insert into the appropriate per node queues
+ * Another processor may allocate the objects in
+ * the slab since we are not holding any locks.
*/
- nid = page_to_nid(page);
- if (cache_grow(cache, flags, nid, page)) {
- obj = ____cache_alloc_node(cache,
- gfp_exact_node(flags), nid);
- if (!obj)
- /*
- * Another processor may allocate the
- * objects in the slab since we are
- * not holding any locks.
- */
- goto retry;
- } else {
- /* cache_grow already freed obj */
- obj = NULL;
- }
+ if (!obj)
+ goto retry;
}
}

@@ -3133,8 +3115,8 @@ retry:

must_grow:
spin_unlock(&n->list_lock);
- x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL);
- if (x)
+ x = cache_grow(cachep, gfp_exact_node(flags), nodeid);
+ if (x >= 0)
goto retry;

return fallback_alloc(cachep, flags);
--
1.9.1

2016-03-28 05:29:05

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit

From: Joonsoo Kim <[email protected]>

Currently, determination to free a slab is done whenever free object is
put into the slab. This has a problem that free slabs are not freed
even if we have free slabs and have more free_objects than free_limit
when processed slab isn't a free slab. This would cause to keep
too much memory in the slab subsystem. This patch try to fix it
by checking number of free object after all free work is done. If there
is free slab at that time, we can free it so we keep free slab as minimal
as possible.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b96f381..df11757 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3258,6 +3258,9 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
{
int i;
struct kmem_cache_node *n = get_node(cachep, node);
+ struct page *page;
+
+ n->free_objects += nr_objects;

for (i = 0; i < nr_objects; i++) {
void *objp;
@@ -3270,17 +3273,11 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
check_spinlock_acquired_node(cachep, node);
slab_put_obj(cachep, page, objp);
STATS_DEC_ACTIVE(cachep);
- n->free_objects++;

/* fixup slab chains */
- if (page->active == 0) {
- if (n->free_objects > n->free_limit) {
- n->free_objects -= cachep->num;
- list_add_tail(&page->lru, list);
- } else {
- list_add(&page->lru, &n->slabs_free);
- }
- } else {
+ if (page->active == 0)
+ list_add(&page->lru, &n->slabs_free);
+ else {
/* Unconditionally move a slab to the end of the
* partial list on free - maximum time for the
* other objects to be freed, too.
@@ -3288,6 +3285,14 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
list_add_tail(&page->lru, &n->slabs_partial);
}
}
+
+ while (n->free_objects > n->free_limit && !list_empty(&n->slabs_free)) {
+ n->free_objects -= cachep->num;
+
+ page = list_last_entry(&n->slabs_free, struct page, lru);
+ list_del(&page->lru);
+ list_add(&page->lru, list);
+ }
}

static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
--
1.9.1

2016-03-28 05:29:51

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup

From: Joonsoo Kim <[email protected]>

There are mostly same code for setting up kmem_cache_node either
in cpuup_prepare() or alloc_kmem_cache_node(). Factor out and
clean-up them.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 167 +++++++++++++++++++++++++-------------------------------------
1 file changed, 67 insertions(+), 100 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 569d7db..b96f381 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -898,6 +898,62 @@ static int init_cache_node_node(int node)
return 0;
}

+static int setup_kmem_cache_node(struct kmem_cache *cachep,
+ int node, gfp_t gfp, bool force_change)
+{
+ int ret = -ENOMEM;
+ struct kmem_cache_node *n;
+ struct array_cache *old_shared = NULL;
+ struct array_cache *new_shared = NULL;
+ struct alien_cache **new_alien = NULL;
+ LIST_HEAD(list);
+
+ if (use_alien_caches) {
+ new_alien = alloc_alien_cache(node, cachep->limit, gfp);
+ if (!new_alien)
+ goto fail;
+ }
+
+ if (cachep->shared) {
+ new_shared = alloc_arraycache(node,
+ cachep->shared * cachep->batchcount, 0xbaadf00d, gfp);
+ if (!new_shared)
+ goto fail;
+ }
+
+ ret = init_cache_node(cachep, node, gfp);
+ if (ret)
+ goto fail;
+
+ n = get_node(cachep, node);
+ spin_lock_irq(&n->list_lock);
+ if (n->shared) {
+ free_block(cachep, n->shared->entry,
+ n->shared->avail, node, &list);
+ }
+
+ if (!n->shared || force_change) {
+ old_shared = n->shared;
+ n->shared = new_shared;
+ new_shared = NULL;
+ }
+
+ if (!n->alien) {
+ n->alien = new_alien;
+ new_alien = NULL;
+ }
+
+ spin_unlock_irq(&n->list_lock);
+ slabs_destroy(cachep, &list);
+
+fail:
+ kfree(old_shared);
+ kfree(new_shared);
+ free_alien_cache(new_alien);
+
+ return ret;
+}
+
static void cpuup_canceled(long cpu)
{
struct kmem_cache *cachep;
@@ -969,7 +1025,6 @@ free_slab:
static int cpuup_prepare(long cpu)
{
struct kmem_cache *cachep;
- struct kmem_cache_node *n = NULL;
int node = cpu_to_mem(cpu);
int err;

@@ -988,44 +1043,9 @@ static int cpuup_prepare(long cpu)
* array caches
*/
list_for_each_entry(cachep, &slab_caches, list) {
- struct array_cache *shared = NULL;
- struct alien_cache **alien = NULL;
-
- if (cachep->shared) {
- shared = alloc_arraycache(node,
- cachep->shared * cachep->batchcount,
- 0xbaadf00d, GFP_KERNEL);
- if (!shared)
- goto bad;
- }
- if (use_alien_caches) {
- alien = alloc_alien_cache(node, cachep->limit, GFP_KERNEL);
- if (!alien) {
- kfree(shared);
- goto bad;
- }
- }
- n = get_node(cachep, node);
- BUG_ON(!n);
-
- spin_lock_irq(&n->list_lock);
- if (!n->shared) {
- /*
- * We are serialised from CPU_DEAD or
- * CPU_UP_CANCELLED by the cpucontrol lock
- */
- n->shared = shared;
- shared = NULL;
- }
-#ifdef CONFIG_NUMA
- if (!n->alien) {
- n->alien = alien;
- alien = NULL;
- }
-#endif
- spin_unlock_irq(&n->list_lock);
- kfree(shared);
- free_alien_cache(alien);
+ err = setup_kmem_cache_node(cachep, node, GFP_KERNEL, false);
+ if (err)
+ goto bad;
}

return 0;
@@ -3652,72 +3672,19 @@ EXPORT_SYMBOL(kfree);
/*
* This initializes kmem_cache_node or resizes various caches for all nodes.
*/
-static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp)
+static int setup_kmem_cache_node_node(struct kmem_cache *cachep, gfp_t gfp)
{
+ int ret;
int node;
struct kmem_cache_node *n;
- struct array_cache *new_shared;
- struct alien_cache **new_alien = NULL;

for_each_online_node(node) {
-
- if (use_alien_caches) {
- new_alien = alloc_alien_cache(node, cachep->limit, gfp);
- if (!new_alien)
- goto fail;
- }
-
- new_shared = NULL;
- if (cachep->shared) {
- new_shared = alloc_arraycache(node,
- cachep->shared*cachep->batchcount,
- 0xbaadf00d, gfp);
- if (!new_shared) {
- free_alien_cache(new_alien);
- goto fail;
- }
- }
-
- n = get_node(cachep, node);
- if (n) {
- struct array_cache *shared = n->shared;
- LIST_HEAD(list);
-
- spin_lock_irq(&n->list_lock);
-
- if (shared)
- free_block(cachep, shared->entry,
- shared->avail, node, &list);
-
- n->shared = new_shared;
- if (!n->alien) {
- n->alien = new_alien;
- new_alien = NULL;
- }
- n->free_limit = (1 + nr_cpus_node(node)) *
- cachep->batchcount + cachep->num;
- spin_unlock_irq(&n->list_lock);
- slabs_destroy(cachep, &list);
- kfree(shared);
- free_alien_cache(new_alien);
- continue;
- }
- n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node);
- if (!n) {
- free_alien_cache(new_alien);
- kfree(new_shared);
+ ret = setup_kmem_cache_node(cachep, node, gfp, true);
+ if (ret)
goto fail;
- }

- kmem_cache_node_init(n);
- n->next_reap = jiffies + REAPTIMEOUT_NODE +
- ((unsigned long)cachep) % REAPTIMEOUT_NODE;
- n->shared = new_shared;
- n->alien = new_alien;
- n->free_limit = (1 + nr_cpus_node(node)) *
- cachep->batchcount + cachep->num;
- cachep->node[node] = n;
}
+
return 0;

fail:
@@ -3759,7 +3726,7 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
cachep->shared = shared;

if (!prev)
- goto alloc_node;
+ goto setup_node;

for_each_online_cpu(cpu) {
LIST_HEAD(list);
@@ -3776,8 +3743,8 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
}
free_percpu(prev);

-alloc_node:
- return alloc_kmem_cache_node(cachep, gfp);
+setup_node:
+ return setup_kmem_cache_node_node(cachep, gfp);
}

static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
--
1.9.1

2016-03-28 08:58:40

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again

Hi Jonsoo,

On Mon, Mar 28, 2016 at 7:26 AM, <[email protected]> wrote:
> From: Joonsoo Kim <[email protected]>
>
> Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by
> 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")'
> because it causes a problem on m68k which has many node
> but !CONFIG_NUMA. In this case, although alien cache isn't used
> at all but to cope with some initialization path, garbage value
> is used and that is BAD_ALIEN_MAGIC. Now, this patch set
> use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization
> path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it.
>
> Signed-off-by: Joonsoo Kim <[email protected]>

I gave this a try on m68k/ARAnyM, and it didn't crash, unlike the previous
version that was reverted, so
Tested-by: Geert Uytterhoeven <[email protected]>

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2016-03-28 21:19:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again

On Mon, 28 Mar 2016 14:26:52 +0900 [email protected] wrote:

> From: Joonsoo Kim <[email protected]>
>
> Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by
> 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")'
> because it causes a problem on m68k which has many node
> but !CONFIG_NUMA.

Whaaa? How is that even possible? I'd have thought that everything
would break at compile time (at least) with such a setup.

> In this case, although alien cache isn't used
> at all but to cope with some initialization path, garbage value
> is used and that is BAD_ALIEN_MAGIC. Now, this patch set
> use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization
> path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it.
>
> ...
>
> @@ -1205,7 +1203,7 @@ void __init kmem_cache_init(void)
> sizeof(struct rcu_head));
> kmem_cache = &kmem_cache_boot;
>
> - if (num_possible_nodes() == 1)
> + if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1)
> use_alien_caches = 0;
>
> for (i = 0; i < NUM_INIT_LISTS; i++)

This does look screwy. How can num_possible_nodes() possibly return
anything but "1" if CONFIG_NUMA=n.

Can we please get a code comment in here to explain things to the poor
old reader and to prevent people from trying to "fix" it?

Subject: Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink()

On Mon, 28 Mar 2016, [email protected] wrote:

> Major kmem_cache metadata in slab subsystem is synchronized with
> the slab_mutex. In SLAB, if some of them is changed, node's shared
> array cache would be freed and re-populated. If __kmem_cache_shrink()
> is called at the same time, it will call drain_array() with n->shared
> without holding node lock so problem can happen.
>
> We can fix this small theoretical race condition by holding node lock
> in drain_array(), but, holding a slab_mutex in kmem_cache_shrink()
> looks more appropriate solution because stable state would make things
> less error-prone and this is not performance critical path.

Ummm.. The mutex taking is added to common code. So this will also affect
SLUB. The patch needs to consider this. Do we want to force all
allocators to run shrinking only when holding the lock? SLUB does not
need to hold the mutex. And frankly the mutex is for reconfiguration of
metadata which is *not* occurring here. A shrink operation does not do
that. Can we figure out a slab specific way of handling synchronization
in the strange free/realloc cycle?

It seems that taking the node lock is the appropriate level of
synchrnonization since the concern is with the contents of a shared cache
at that level. There is no change of metadata which would require the
mutex.

Subject: Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again

On Mon, 28 Mar 2016, Andrew Morton wrote:

> On Mon, 28 Mar 2016 14:26:52 +0900 [email protected] wrote:
>
> > From: Joonsoo Kim <[email protected]>
> >
> > Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by
> > 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")'
> > because it causes a problem on m68k which has many node
> > but !CONFIG_NUMA.
>
> Whaaa? How is that even possible? I'd have thought that everything
> would break at compile time (at least) with such a setup.

Yes we have that and the support for this caused numerous issues. Can we
stop supporting such a configuration?

Subject: Re: [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code

On Mon, 28 Mar 2016, [email protected] wrote:

> From: Joonsoo Kim <[email protected]>
> - spin_lock_irq(&n->list_lock);
> - n->free_limit =
> - (1 + nr_cpus_node(node)) *
> - cachep->batchcount + cachep->num;
> - spin_unlock_irq(&n->list_lock);
> + ret = init_cache_node(cachep, node, GFP_KERNEL);
> + if (ret)
> + return ret;

Drop ret and do a

return init_cache_node(...);

instead?

Subject: Re: [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup

On Mon, 28 Mar 2016, [email protected] wrote:

> * This initializes kmem_cache_node or resizes various caches for all nodes.
> */
> -static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp)
> +static int setup_kmem_cache_node_node(struct kmem_cache *cachep, gfp_t gfp)

... _node_node? Isnt there a better name for it?

Subject: Re: [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit

On Mon, 28 Mar 2016, [email protected] wrote:

> From: Joonsoo Kim <[email protected]>
>
> Currently, determination to free a slab is done whenever free object is
> put into the slab. This has a problem that free slabs are not freed
> even if we have free slabs and have more free_objects than free_limit

There needs to be a better explanation here since I do not get why there
is an issue with checking after free if a slab is actually free.

> when processed slab isn't a free slab. This would cause to keep
> too much memory in the slab subsystem. This patch try to fix it
> by checking number of free object after all free work is done. If there
> is free slab at that time, we can free it so we keep free slab as minimal
> as possible.

Ok if we check after free work is done then the number of free slabs may
be higher than the limit set and then we free the additional slabs to get
down to the limit that was set?

Subject: Re: [PATCH 07/11] mm/slab: racy access/modify the slab color

On Mon, 28 Mar 2016, [email protected] wrote:

> From: Joonsoo Kim <[email protected]>
>
> Slab color isn't needed to be changed strictly. Because locking
> for changing slab color could cause more lock contention so this patch
> implements racy access/modify the slab color. This is a preparation step
> to implement lockless allocation path when there is no free objects in
> the kmem_cache.

Acked-by: Christoph Lameter <[email protected]>

The rest of the description does not relate to this patch and does not
actually reflect the improvement of applying this patch. Remove the rest?


> Below is the result of concurrent allocation/free in slab allocation
> benchmark made by Christoph a long time ago. I make the output simpler.
> The number shows cycle count during alloc/free respectively so less
> is better.
>
> * Before
> Kmalloc N*alloc N*free(32): Average=365/806
> Kmalloc N*alloc N*free(64): Average=452/690
> Kmalloc N*alloc N*free(128): Average=736/886
> Kmalloc N*alloc N*free(256): Average=1167/985
> Kmalloc N*alloc N*free(512): Average=2088/1125
> Kmalloc N*alloc N*free(1024): Average=4115/1184
> Kmalloc N*alloc N*free(2048): Average=8451/1748
> Kmalloc N*alloc N*free(4096): Average=16024/2048
>
> * After
> Kmalloc N*alloc N*free(32): Average=355/750
> Kmalloc N*alloc N*free(64): Average=452/812
> Kmalloc N*alloc N*free(128): Average=559/1070
> Kmalloc N*alloc N*free(256): Average=1176/980
> Kmalloc N*alloc N*free(512): Average=1939/1189
> Kmalloc N*alloc N*free(1024): Average=3521/1278
> Kmalloc N*alloc N*free(2048): Average=7152/1838
> Kmalloc N*alloc N*free(4096): Average=13438/2013
>
> It shows that contention is reduced for object size >= 1024
> and performance increases by roughly 15%.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/slab.c | 26 +++++++++++++-------------
> 1 file changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/mm/slab.c b/mm/slab.c
> index df11757..52fc5e3 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -2536,20 +2536,7 @@ static int cache_grow(struct kmem_cache *cachep,
> }
> local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
>
> - /* Take the node list lock to change the colour_next on this node */
> check_irq_off();
> - n = get_node(cachep, nodeid);
> - spin_lock(&n->list_lock);
> -
> - /* Get colour for the slab, and cal the next value. */
> - offset = n->colour_next;
> - n->colour_next++;
> - if (n->colour_next >= cachep->colour)
> - n->colour_next = 0;
> - spin_unlock(&n->list_lock);
> -
> - offset *= cachep->colour_off;
> -
> if (gfpflags_allow_blocking(local_flags))
> local_irq_enable();
>
> @@ -2570,6 +2557,19 @@ static int cache_grow(struct kmem_cache *cachep,
> if (!page)
> goto failed;
>
> + n = get_node(cachep, nodeid);
> +
> + /* Get colour for the slab, and cal the next value. */
> + n->colour_next++;
> + if (n->colour_next >= cachep->colour)
> + n->colour_next = 0;
> +
> + offset = n->colour_next;
> + if (offset >= cachep->colour)
> + offset = 0;
> +
> + offset *= cachep->colour_off;
> +
> /* Get slab management. */
> freelist = alloc_slabmgmt(cachep, page, offset,
> local_flags & ~GFP_CONSTRAINT_MASK, nodeid);
>

2016-03-30 08:09:40

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again

On Mon, Mar 28, 2016 at 10:58:38AM +0200, Geert Uytterhoeven wrote:
> Hi Jonsoo,
>
> On Mon, Mar 28, 2016 at 7:26 AM, <[email protected]> wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by
> > 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")'
> > because it causes a problem on m68k which has many node
> > but !CONFIG_NUMA. In this case, although alien cache isn't used
> > at all but to cope with some initialization path, garbage value
> > is used and that is BAD_ALIEN_MAGIC. Now, this patch set
> > use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization
> > path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> I gave this a try on m68k/ARAnyM, and it didn't crash, unlike the previous
> version that was reverted, so
> Tested-by: Geert Uytterhoeven <[email protected]>

Thanks for testing!!!

Thanks.

2016-03-30 08:09:59

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink()

On Mon, Mar 28, 2016 at 07:50:36PM -0500, Christoph Lameter wrote:
> On Mon, 28 Mar 2016, [email protected] wrote:
>
> > Major kmem_cache metadata in slab subsystem is synchronized with
> > the slab_mutex. In SLAB, if some of them is changed, node's shared
> > array cache would be freed and re-populated. If __kmem_cache_shrink()
> > is called at the same time, it will call drain_array() with n->shared
> > without holding node lock so problem can happen.
> >
> > We can fix this small theoretical race condition by holding node lock
> > in drain_array(), but, holding a slab_mutex in kmem_cache_shrink()
> > looks more appropriate solution because stable state would make things
> > less error-prone and this is not performance critical path.
>
> Ummm.. The mutex taking is added to common code. So this will also affect
> SLUB. The patch needs to consider this. Do we want to force all
> allocators to run shrinking only when holding the lock? SLUB does not
> need to hold the mutex. And frankly the mutex is for reconfiguration of
> metadata which is *not* occurring here. A shrink operation does not do
> that. Can we figure out a slab specific way of handling synchronization
> in the strange free/realloc cycle?
>
> It seems that taking the node lock is the appropriate level of
> synchrnonization since the concern is with the contents of a shared cache
> at that level. There is no change of metadata which would require the
> mutex.

Okay. I will fix it.

Thanks.

2016-03-30 08:11:03

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code

On Mon, Mar 28, 2016 at 07:56:15PM -0500, Christoph Lameter wrote:
> On Mon, 28 Mar 2016, [email protected] wrote:
>
> > From: Joonsoo Kim <[email protected]>
> > - spin_lock_irq(&n->list_lock);
> > - n->free_limit =
> > - (1 + nr_cpus_node(node)) *
> > - cachep->batchcount + cachep->num;
> > - spin_unlock_irq(&n->list_lock);
> > + ret = init_cache_node(cachep, node, GFP_KERNEL);
> > + if (ret)
> > + return ret;
>
> Drop ret and do a
>
> return init_cache_node(...);
>
> instead?

Will do it.

Thanks.

2016-03-30 08:13:31

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup

On Mon, Mar 28, 2016 at 07:58:29PM -0500, Christoph Lameter wrote:
> On Mon, 28 Mar 2016, [email protected] wrote:
>
> > * This initializes kmem_cache_node or resizes various caches for all nodes.
> > */
> > -static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp)
> > +static int setup_kmem_cache_node_node(struct kmem_cache *cachep, gfp_t gfp)
>
> ... _node_node? Isnt there a better name for it?

I will think it more. Reason I use this naming is that there is other
site that uses this naming convention. I'm just mimicking it. :)
It's very appreaciate if you have a suggestion.

Thanks.

2016-03-30 08:23:41

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit

On Mon, Mar 28, 2016 at 08:03:16PM -0500, Christoph Lameter wrote:
> On Mon, 28 Mar 2016, [email protected] wrote:
>
> > From: Joonsoo Kim <[email protected]>
> >
> > Currently, determination to free a slab is done whenever free object is
> > put into the slab. This has a problem that free slabs are not freed
> > even if we have free slabs and have more free_objects than free_limit
>
> There needs to be a better explanation here since I do not get why there
> is an issue with checking after free if a slab is actually free.

Okay. Consider following 3 objects free situation.

free_limt = 10
nr_free = 9

free(free slab) free(free slab) free(not free slab)

If we check it one by one, when nr_free > free_limit (at last free),
we cannot free the slab because current slab isn't a free slab.

But, if we check it lastly, we can free 1 free slab.

I will add more explanation on the next version.

>
> > when processed slab isn't a free slab. This would cause to keep
> > too much memory in the slab subsystem. This patch try to fix it
> > by checking number of free object after all free work is done. If there
> > is free slab at that time, we can free it so we keep free slab as minimal
> > as possible.
>
> Ok if we check after free work is done then the number of free slabs may
> be higher than the limit set and then we free the additional slabs to get
> down to the limit that was set?

Yes.

Thanks.

2016-03-30 08:24:09

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 07/11] mm/slab: racy access/modify the slab color

On Mon, Mar 28, 2016 at 08:05:41PM -0500, Christoph Lameter wrote:
> On Mon, 28 Mar 2016, [email protected] wrote:
>
> > From: Joonsoo Kim <[email protected]>
> >
> > Slab color isn't needed to be changed strictly. Because locking
> > for changing slab color could cause more lock contention so this patch
> > implements racy access/modify the slab color. This is a preparation step
> > to implement lockless allocation path when there is no free objects in
> > the kmem_cache.
>
> Acked-by: Christoph Lameter <[email protected]>
>
> The rest of the description does not relate to this patch and does not
> actually reflect the improvement of applying this patch. Remove the rest?

No, below improvement is for this individual patch.

Thanks.

>
>
> > Below is the result of concurrent allocation/free in slab allocation
> > benchmark made by Christoph a long time ago. I make the output simpler.
> > The number shows cycle count during alloc/free respectively so less
> > is better.
> >
> > * Before
> > Kmalloc N*alloc N*free(32): Average=365/806
> > Kmalloc N*alloc N*free(64): Average=452/690
> > Kmalloc N*alloc N*free(128): Average=736/886
> > Kmalloc N*alloc N*free(256): Average=1167/985
> > Kmalloc N*alloc N*free(512): Average=2088/1125
> > Kmalloc N*alloc N*free(1024): Average=4115/1184
> > Kmalloc N*alloc N*free(2048): Average=8451/1748
> > Kmalloc N*alloc N*free(4096): Average=16024/2048
> >
> > * After
> > Kmalloc N*alloc N*free(32): Average=355/750
> > Kmalloc N*alloc N*free(64): Average=452/812
> > Kmalloc N*alloc N*free(128): Average=559/1070
> > Kmalloc N*alloc N*free(256): Average=1176/980
> > Kmalloc N*alloc N*free(512): Average=1939/1189
> > Kmalloc N*alloc N*free(1024): Average=3521/1278
> > Kmalloc N*alloc N*free(2048): Average=7152/1838
> > Kmalloc N*alloc N*free(4096): Average=13438/2013
> >
> > It shows that contention is reduced for object size >= 1024
> > and performance increases by roughly 15%.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> > ---
> > mm/slab.c | 26 +++++++++++++-------------
> > 1 file changed, 13 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/slab.c b/mm/slab.c
> > index df11757..52fc5e3 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -2536,20 +2536,7 @@ static int cache_grow(struct kmem_cache *cachep,
> > }
> > local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
> >
> > - /* Take the node list lock to change the colour_next on this node */
> > check_irq_off();
> > - n = get_node(cachep, nodeid);
> > - spin_lock(&n->list_lock);
> > -
> > - /* Get colour for the slab, and cal the next value. */
> > - offset = n->colour_next;
> > - n->colour_next++;
> > - if (n->colour_next >= cachep->colour)
> > - n->colour_next = 0;
> > - spin_unlock(&n->list_lock);
> > -
> > - offset *= cachep->colour_off;
> > -
> > if (gfpflags_allow_blocking(local_flags))
> > local_irq_enable();
> >
> > @@ -2570,6 +2557,19 @@ static int cache_grow(struct kmem_cache *cachep,
> > if (!page)
> > goto failed;
> >
> > + n = get_node(cachep, nodeid);
> > +
> > + /* Get colour for the slab, and cal the next value. */
> > + n->colour_next++;
> > + if (n->colour_next >= cachep->colour)
> > + n->colour_next = 0;
> > +
> > + offset = n->colour_next;
> > + if (offset >= cachep->colour)
> > + offset = 0;
> > +
> > + offset *= cachep->colour_off;
> > +
> > /* Get slab management. */
> > freelist = alloc_slabmgmt(cachep, page, offset,
> > local_flags & ~GFP_CONSTRAINT_MASK, nodeid);
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-31 10:53:20

by Angel Shtilianov

[permalink] [raw]
Subject: Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink()



On 03/28/2016 08:26 AM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Major kmem_cache metadata in slab subsystem is synchronized with
> the slab_mutex. In SLAB, if some of them is changed, node's shared
> array cache would be freed and re-populated. If __kmem_cache_shrink()
> is called at the same time, it will call drain_array() with n->shared
> without holding node lock so problem can happen.
>
> We can fix this small theoretical race condition by holding node lock
> in drain_array(), but, holding a slab_mutex in kmem_cache_shrink()
> looks more appropriate solution because stable state would make things
> less error-prone and this is not performance critical path.
>
> In addtion, annotate on SLAB functions.

Just a nit but would it not be better instead of doing comment-style
annotation to use lockdep_assert_held/_once. In both cases for someone
to understand what locks have to be held will go and read the source. In
my mind it's easier to miss a comment line, rather than the
lockdep_assert. Furthermore in case lockdep is enabled a locking
violation would spew useful info to dmesg.

2016-04-01 02:16:04

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink()

On Thu, Mar 31, 2016 at 01:53:14PM +0300, Nikolay Borisov wrote:
>
>
> On 03/28/2016 08:26 AM, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > Major kmem_cache metadata in slab subsystem is synchronized with
> > the slab_mutex. In SLAB, if some of them is changed, node's shared
> > array cache would be freed and re-populated. If __kmem_cache_shrink()
> > is called at the same time, it will call drain_array() with n->shared
> > without holding node lock so problem can happen.
> >
> > We can fix this small theoretical race condition by holding node lock
> > in drain_array(), but, holding a slab_mutex in kmem_cache_shrink()
> > looks more appropriate solution because stable state would make things
> > less error-prone and this is not performance critical path.
> >
> > In addtion, annotate on SLAB functions.
>
> Just a nit but would it not be better instead of doing comment-style
> annotation to use lockdep_assert_held/_once. In both cases for someone
> to understand what locks have to be held will go and read the source. In
> my mind it's easier to miss a comment line, rather than the
> lockdep_assert. Furthermore in case lockdep is enabled a locking
> violation would spew useful info to dmesg.

Good idea. I'm not sure if lockdep_assert is best fit but I will add
something to check it rather than just adding the comment.

Thanks.