From: Joonsoo Kim <[email protected]>
Major changes from v1
o hold node lock instead of slab_mutex in kmem_cache_shrink()
o fix suspend-to-ram issue reported by Nishanth
o use synchronize_sched() instead of kick_all_cpus_sync()
While processing concurrent allocation, SLAB could be contended
a lot because it did a lots of work with holding a lock. This
patchset try to reduce the number of critical section to reduce
lock contention. Major changes are lockless decision to allocate
more slab and lockless cpu cache refill from the newly allocated slab.
Below is the result of concurrent allocation/free in slab allocation
benchmark made by Christoph a long time ago. I make the output simpler.
The number shows cycle count during alloc/free respectively so less
is better.
* Before
Kmalloc N*alloc N*free(32): Average=365/806
Kmalloc N*alloc N*free(64): Average=452/690
Kmalloc N*alloc N*free(128): Average=736/886
Kmalloc N*alloc N*free(256): Average=1167/985
Kmalloc N*alloc N*free(512): Average=2088/1125
Kmalloc N*alloc N*free(1024): Average=4115/1184
Kmalloc N*alloc N*free(2048): Average=8451/1748
Kmalloc N*alloc N*free(4096): Average=16024/2048
* After
Kmalloc N*alloc N*free(32): Average=344/792
Kmalloc N*alloc N*free(64): Average=347/882
Kmalloc N*alloc N*free(128): Average=390/959
Kmalloc N*alloc N*free(256): Average=393/1067
Kmalloc N*alloc N*free(512): Average=683/1229
Kmalloc N*alloc N*free(1024): Average=1295/1325
Kmalloc N*alloc N*free(2048): Average=2513/1664
Kmalloc N*alloc N*free(4096): Average=4742/2172
It shows that performance improves greatly (roughly more than 50%)
for the object class whose size is more than 128 bytes.
Joonsoo Kim (11):
mm/slab: fix the theoretical race by holding proper lock
mm/slab: remove BAD_ALIEN_MAGIC again
mm/slab: drain the free slab as much as possible
mm/slab: factor out kmem_cache_node initialization code
mm/slab: clean-up kmem_cache_node setup
mm/slab: don't keep free slabs if free_objects exceeds free_limit
mm/slab: racy access/modify the slab color
mm/slab: make cache_grow() handle the page allocated on arbitrary node
mm/slab: separate cache_grow() to two parts
mm/slab: refill cpu cache through a new slab without holding a node
lock
mm/slab: lockless decision to grow cache
mm/slab.c | 562 +++++++++++++++++++++++++++++++++-----------------------------
1 file changed, 295 insertions(+), 267 deletions(-)
--
1.9.1
From: Joonsoo Kim <[email protected]>
Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by 'commit
edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' because it causes
a problem on m68k which has many node but !CONFIG_NUMA. In this case,
although alien cache isn't used at all but to cope with some
initialization path, garbage value is used and that is BAD_ALIEN_MAGIC.
Now, this patch set use_alien_caches to 0 when !CONFIG_NUMA, there is no
initialization path problem so we don't need BAD_ALIEN_MAGIC at all. So
remove it.
Tested-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index d8746c0..373b8be 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -421,8 +421,6 @@ static struct kmem_cache kmem_cache_boot = {
.name = "kmem_cache",
};
-#define BAD_ALIEN_MAGIC 0x01020304ul
-
static DEFINE_PER_CPU(struct delayed_work, slab_reap_work);
static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep)
@@ -637,7 +635,7 @@ static int transfer_objects(struct array_cache *to,
static inline struct alien_cache **alloc_alien_cache(int node,
int limit, gfp_t gfp)
{
- return (struct alien_cache **)BAD_ALIEN_MAGIC;
+ return NULL;
}
static inline void free_alien_cache(struct alien_cache **ac_ptr)
@@ -1205,7 +1203,7 @@ void __init kmem_cache_init(void)
sizeof(struct rcu_head));
kmem_cache = &kmem_cache_boot;
- if (num_possible_nodes() == 1)
+ if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1)
use_alien_caches = 0;
for (i = 0; i < NUM_INIT_LISTS; i++)
--
1.9.1
From: Joonsoo Kim <[email protected]>
While processing concurrent allocation, SLAB could be contended a lot
because it did a lots of work with holding a lock. This patchset try to
reduce the number of critical section to reduce lock contention. Major
changes are lockless decision to allocate more slab and lockless cpu cache
refill from the newly allocated slab.
Below is the result of concurrent allocation/free in slab allocation
benchmark made by Christoph a long time ago. I make the output simpler.
The number shows cycle count during alloc/free respectively so less is
better.
* Before
Kmalloc N*alloc N*free(32): Average=365/806
Kmalloc N*alloc N*free(64): Average=452/690
Kmalloc N*alloc N*free(128): Average=736/886
Kmalloc N*alloc N*free(256): Average=1167/985
Kmalloc N*alloc N*free(512): Average=2088/1125
Kmalloc N*alloc N*free(1024): Average=4115/1184
Kmalloc N*alloc N*free(2048): Average=8451/1748
Kmalloc N*alloc N*free(4096): Average=16024/2048
* After
Kmalloc N*alloc N*free(32): Average=344/792
Kmalloc N*alloc N*free(64): Average=347/882
Kmalloc N*alloc N*free(128): Average=390/959
Kmalloc N*alloc N*free(256): Average=393/1067
Kmalloc N*alloc N*free(512): Average=683/1229
Kmalloc N*alloc N*free(1024): Average=1295/1325
Kmalloc N*alloc N*free(2048): Average=2513/1664
Kmalloc N*alloc N*free(4096): Average=4742/2172
It shows that performance improves greatly (roughly more than 50%) for the
object class whose size is more than 128 bytes.
This patch (of 11):
If we don't hold neither the slab_mutex nor the node lock, node's shared
array cache could be freed and re-populated. If __kmem_cache_shrink() is
called at the same time, it will call drain_array() with n->shared without
holding node lock so problem can happen. This patch fix the situation by
holding the node lock before trying to drain the shared array.
In addition, add a debug check to confirm that n->shared access race
doesn't exist.
v2:
o Hold the node lock instead of holding the slab_mutex (per Christoph)
o Add a debug check rather than adding code comment (per Nikolay)
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 68 ++++++++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 45 insertions(+), 23 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index a53a0f6..d8746c0 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2173,6 +2173,11 @@ static void check_irq_on(void)
BUG_ON(irqs_disabled());
}
+static void check_mutex_acquired(void)
+{
+ BUG_ON(!mutex_is_locked(&slab_mutex));
+}
+
static void check_spinlock_acquired(struct kmem_cache *cachep)
{
#ifdef CONFIG_SMP
@@ -2192,13 +2197,27 @@ static void check_spinlock_acquired_node(struct kmem_cache *cachep, int node)
#else
#define check_irq_off() do { } while(0)
#define check_irq_on() do { } while(0)
+#define check_mutex_acquired() do { } while(0)
#define check_spinlock_acquired(x) do { } while(0)
#define check_spinlock_acquired_node(x, y) do { } while(0)
#endif
-static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n,
- struct array_cache *ac,
- int force, int node);
+static void drain_array_locked(struct kmem_cache *cachep, struct array_cache *ac,
+ int node, bool free_all, struct list_head *list)
+{
+ int tofree;
+
+ if (!ac || !ac->avail)
+ return;
+
+ tofree = free_all ? ac->avail : (ac->limit + 4) / 5;
+ if (tofree > ac->avail)
+ tofree = (ac->avail + 1) / 2;
+
+ free_block(cachep, ac->entry, tofree, node, list);
+ ac->avail -= tofree;
+ memmove(ac->entry, &(ac->entry[tofree]), sizeof(void *) * ac->avail);
+}
static void do_drain(void *arg)
{
@@ -2222,6 +2241,7 @@ static void drain_cpu_caches(struct kmem_cache *cachep)
{
struct kmem_cache_node *n;
int node;
+ LIST_HEAD(list);
on_each_cpu(do_drain, cachep, 1);
check_irq_on();
@@ -2229,8 +2249,13 @@ static void drain_cpu_caches(struct kmem_cache *cachep)
if (n->alien)
drain_alien_cache(cachep, n->alien);
- for_each_kmem_cache_node(cachep, node, n)
- drain_array(cachep, n, n->shared, 1, node);
+ for_each_kmem_cache_node(cachep, node, n) {
+ spin_lock_irq(&n->list_lock);
+ drain_array_locked(cachep, n->shared, node, true, &list);
+ spin_unlock_irq(&n->list_lock);
+
+ slabs_destroy(cachep, &list);
+ }
}
/*
@@ -3873,29 +3898,26 @@ skip_setup:
* if drain_array() is used on the shared array.
*/
static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n,
- struct array_cache *ac, int force, int node)
+ struct array_cache *ac, int node)
{
LIST_HEAD(list);
- int tofree;
+
+ /* ac from n->shared can be freed if we don't hold the slab_mutex. */
+ check_mutex_acquired();
if (!ac || !ac->avail)
return;
- if (ac->touched && !force) {
+
+ if (ac->touched) {
ac->touched = 0;
- } else {
- spin_lock_irq(&n->list_lock);
- if (ac->avail) {
- tofree = force ? ac->avail : (ac->limit + 4) / 5;
- if (tofree > ac->avail)
- tofree = (ac->avail + 1) / 2;
- free_block(cachep, ac->entry, tofree, node, &list);
- ac->avail -= tofree;
- memmove(ac->entry, &(ac->entry[tofree]),
- sizeof(void *) * ac->avail);
- }
- spin_unlock_irq(&n->list_lock);
- slabs_destroy(cachep, &list);
+ return;
}
+
+ spin_lock_irq(&n->list_lock);
+ drain_array_locked(cachep, ac, node, false, &list);
+ spin_unlock_irq(&n->list_lock);
+
+ slabs_destroy(cachep, &list);
}
/**
@@ -3933,7 +3955,7 @@ static void cache_reap(struct work_struct *w)
reap_alien(searchp, n);
- drain_array(searchp, n, cpu_cache_get(searchp), 0, node);
+ drain_array(searchp, n, cpu_cache_get(searchp), node);
/*
* These are racy checks but it does not matter
@@ -3944,7 +3966,7 @@ static void cache_reap(struct work_struct *w)
n->next_reap = jiffies + REAPTIMEOUT_NODE;
- drain_array(searchp, n, n->shared, 0, node);
+ drain_array(searchp, n, n->shared, node);
if (n->free_touched)
n->free_touched = 0;
--
1.9.1
From: Joonsoo Kim <[email protected]>
There are mostly same code for setting up kmem_cache_node either in
cpuup_prepare() or alloc_kmem_cache_node(). Factor out and clean-up them.
v2
o Rename setup_kmem_cache_node_node to setup_kmem_cache_nodes
o Fix suspend-to-ram issue reported by Nishanth
Tested-by: Nishanth Menon <[email protected]>
Tested-by: Jon Hunter <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 168 +++++++++++++++++++++++++-------------------------------------
1 file changed, 68 insertions(+), 100 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 49af685..27cb390 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -898,6 +898,63 @@ static int init_cache_node_node(int node)
return 0;
}
+static int setup_kmem_cache_node(struct kmem_cache *cachep,
+ int node, gfp_t gfp, bool force_change)
+{
+ int ret = -ENOMEM;
+ struct kmem_cache_node *n;
+ struct array_cache *old_shared = NULL;
+ struct array_cache *new_shared = NULL;
+ struct alien_cache **new_alien = NULL;
+ LIST_HEAD(list);
+
+ if (use_alien_caches) {
+ new_alien = alloc_alien_cache(node, cachep->limit, gfp);
+ if (!new_alien)
+ goto fail;
+ }
+
+ if (cachep->shared) {
+ new_shared = alloc_arraycache(node,
+ cachep->shared * cachep->batchcount, 0xbaadf00d, gfp);
+ if (!new_shared)
+ goto fail;
+ }
+
+ ret = init_cache_node(cachep, node, gfp);
+ if (ret)
+ goto fail;
+
+ n = get_node(cachep, node);
+ spin_lock_irq(&n->list_lock);
+ if (n->shared && force_change) {
+ free_block(cachep, n->shared->entry,
+ n->shared->avail, node, &list);
+ n->shared->avail = 0;
+ }
+
+ if (!n->shared || force_change) {
+ old_shared = n->shared;
+ n->shared = new_shared;
+ new_shared = NULL;
+ }
+
+ if (!n->alien) {
+ n->alien = new_alien;
+ new_alien = NULL;
+ }
+
+ spin_unlock_irq(&n->list_lock);
+ slabs_destroy(cachep, &list);
+
+fail:
+ kfree(old_shared);
+ kfree(new_shared);
+ free_alien_cache(new_alien);
+
+ return ret;
+}
+
static void cpuup_canceled(long cpu)
{
struct kmem_cache *cachep;
@@ -969,7 +1026,6 @@ free_slab:
static int cpuup_prepare(long cpu)
{
struct kmem_cache *cachep;
- struct kmem_cache_node *n = NULL;
int node = cpu_to_mem(cpu);
int err;
@@ -988,44 +1044,9 @@ static int cpuup_prepare(long cpu)
* array caches
*/
list_for_each_entry(cachep, &slab_caches, list) {
- struct array_cache *shared = NULL;
- struct alien_cache **alien = NULL;
-
- if (cachep->shared) {
- shared = alloc_arraycache(node,
- cachep->shared * cachep->batchcount,
- 0xbaadf00d, GFP_KERNEL);
- if (!shared)
- goto bad;
- }
- if (use_alien_caches) {
- alien = alloc_alien_cache(node, cachep->limit, GFP_KERNEL);
- if (!alien) {
- kfree(shared);
- goto bad;
- }
- }
- n = get_node(cachep, node);
- BUG_ON(!n);
-
- spin_lock_irq(&n->list_lock);
- if (!n->shared) {
- /*
- * We are serialised from CPU_DEAD or
- * CPU_UP_CANCELLED by the cpucontrol lock
- */
- n->shared = shared;
- shared = NULL;
- }
-#ifdef CONFIG_NUMA
- if (!n->alien) {
- n->alien = alien;
- alien = NULL;
- }
-#endif
- spin_unlock_irq(&n->list_lock);
- kfree(shared);
- free_alien_cache(alien);
+ err = setup_kmem_cache_node(cachep, node, GFP_KERNEL, false);
+ if (err)
+ goto bad;
}
return 0;
@@ -3676,72 +3697,19 @@ EXPORT_SYMBOL(kfree);
/*
* This initializes kmem_cache_node or resizes various caches for all nodes.
*/
-static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp)
+static int setup_kmem_cache_nodes(struct kmem_cache *cachep, gfp_t gfp)
{
+ int ret;
int node;
struct kmem_cache_node *n;
- struct array_cache *new_shared;
- struct alien_cache **new_alien = NULL;
for_each_online_node(node) {
-
- if (use_alien_caches) {
- new_alien = alloc_alien_cache(node, cachep->limit, gfp);
- if (!new_alien)
- goto fail;
- }
-
- new_shared = NULL;
- if (cachep->shared) {
- new_shared = alloc_arraycache(node,
- cachep->shared*cachep->batchcount,
- 0xbaadf00d, gfp);
- if (!new_shared) {
- free_alien_cache(new_alien);
- goto fail;
- }
- }
-
- n = get_node(cachep, node);
- if (n) {
- struct array_cache *shared = n->shared;
- LIST_HEAD(list);
-
- spin_lock_irq(&n->list_lock);
-
- if (shared)
- free_block(cachep, shared->entry,
- shared->avail, node, &list);
-
- n->shared = new_shared;
- if (!n->alien) {
- n->alien = new_alien;
- new_alien = NULL;
- }
- n->free_limit = (1 + nr_cpus_node(node)) *
- cachep->batchcount + cachep->num;
- spin_unlock_irq(&n->list_lock);
- slabs_destroy(cachep, &list);
- kfree(shared);
- free_alien_cache(new_alien);
- continue;
- }
- n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node);
- if (!n) {
- free_alien_cache(new_alien);
- kfree(new_shared);
+ ret = setup_kmem_cache_node(cachep, node, gfp, true);
+ if (ret)
goto fail;
- }
- kmem_cache_node_init(n);
- n->next_reap = jiffies + REAPTIMEOUT_NODE +
- ((unsigned long)cachep) % REAPTIMEOUT_NODE;
- n->shared = new_shared;
- n->alien = new_alien;
- n->free_limit = (1 + nr_cpus_node(node)) *
- cachep->batchcount + cachep->num;
- cachep->node[node] = n;
}
+
return 0;
fail:
@@ -3783,7 +3751,7 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
cachep->shared = shared;
if (!prev)
- goto alloc_node;
+ goto setup_node;
for_each_online_cpu(cpu) {
LIST_HEAD(list);
@@ -3800,8 +3768,8 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
}
free_percpu(prev);
-alloc_node:
- return alloc_kmem_cache_node(cachep, gfp);
+setup_node:
+ return setup_kmem_cache_nodes(cachep, gfp);
}
static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
--
1.9.1
From: Joonsoo Kim <[email protected]>
Currently, cache_grow() assumes that allocated page's nodeid would be same
with parameter nodeid which is used for allocation request. If we discard
this assumption, we can handle fallback_alloc() case gracefully. So, this
patch makes cache_grow() handle the page allocated on arbitrary node and
clean-up relevant code.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 60 +++++++++++++++++++++---------------------------------------
1 file changed, 21 insertions(+), 39 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index a3422bc..1910589 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2543,13 +2543,14 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page,
* Grow (by 1) the number of slabs within a cache. This is called by
* kmem_cache_alloc() when there are no active objs left in a cache.
*/
-static int cache_grow(struct kmem_cache *cachep,
- gfp_t flags, int nodeid, struct page *page)
+static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid)
{
void *freelist;
size_t offset;
gfp_t local_flags;
+ int page_node;
struct kmem_cache_node *n;
+ struct page *page;
/*
* Be lazy and only check for valid flags here, keeping it out of the
@@ -2577,12 +2578,12 @@ static int cache_grow(struct kmem_cache *cachep,
* Get mem for the objs. Attempt to allocate a physical page from
* 'nodeid'.
*/
- if (!page)
- page = kmem_getpages(cachep, local_flags, nodeid);
+ page = kmem_getpages(cachep, local_flags, nodeid);
if (!page)
goto failed;
- n = get_node(cachep, nodeid);
+ page_node = page_to_nid(page);
+ n = get_node(cachep, page_node);
/* Get colour for the slab, and cal the next value. */
n->colour_next++;
@@ -2597,7 +2598,7 @@ static int cache_grow(struct kmem_cache *cachep,
/* Get slab management. */
freelist = alloc_slabmgmt(cachep, page, offset,
- local_flags & ~GFP_CONSTRAINT_MASK, nodeid);
+ local_flags & ~GFP_CONSTRAINT_MASK, page_node);
if (OFF_SLAB(cachep) && !freelist)
goto opps1;
@@ -2616,13 +2617,13 @@ static int cache_grow(struct kmem_cache *cachep,
STATS_INC_GROWN(cachep);
n->free_objects += cachep->num;
spin_unlock(&n->list_lock);
- return 1;
+ return page_node;
opps1:
kmem_freepages(cachep, page);
failed:
if (gfpflags_allow_blocking(local_flags))
local_irq_disable();
- return 0;
+ return -1;
}
#if DEBUG
@@ -2903,14 +2904,14 @@ alloc_done:
return obj;
}
- x = cache_grow(cachep, gfp_exact_node(flags), node, NULL);
+ x = cache_grow(cachep, gfp_exact_node(flags), node);
/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
node = numa_mem_id();
/* no objects in sight? abort */
- if (!x && ac->avail == 0)
+ if (x < 0 && ac->avail == 0)
return NULL;
if (!ac->avail) /* objects refilled by interrupt? */
@@ -3039,7 +3040,6 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
{
struct zonelist *zonelist;
- gfp_t local_flags;
struct zoneref *z;
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(flags);
@@ -3050,8 +3050,6 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
if (flags & __GFP_THISNODE)
return NULL;
- local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
-
retry_cpuset:
cpuset_mems_cookie = read_mems_allowed_begin();
zonelist = node_zonelist(mempolicy_slab_node(), flags);
@@ -3081,33 +3079,17 @@ retry:
* We may trigger various forms of reclaim on the allowed
* set and go into memory reserves if necessary.
*/
- struct page *page;
+ nid = cache_grow(cache, flags, numa_mem_id());
+ if (nid >= 0) {
+ obj = ____cache_alloc_node(cache,
+ gfp_exact_node(flags), nid);
- if (gfpflags_allow_blocking(local_flags))
- local_irq_enable();
- kmem_flagcheck(cache, flags);
- page = kmem_getpages(cache, local_flags, numa_mem_id());
- if (gfpflags_allow_blocking(local_flags))
- local_irq_disable();
- if (page) {
/*
- * Insert into the appropriate per node queues
+ * Another processor may allocate the objects in
+ * the slab since we are not holding any locks.
*/
- nid = page_to_nid(page);
- if (cache_grow(cache, flags, nid, page)) {
- obj = ____cache_alloc_node(cache,
- gfp_exact_node(flags), nid);
- if (!obj)
- /*
- * Another processor may allocate the
- * objects in the slab since we are
- * not holding any locks.
- */
- goto retry;
- } else {
- /* cache_grow already freed obj */
- obj = NULL;
- }
+ if (!obj)
+ goto retry;
}
}
@@ -3158,8 +3140,8 @@ retry:
must_grow:
spin_unlock(&n->list_lock);
- x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL);
- if (x)
+ x = cache_grow(cachep, gfp_exact_node(flags), nodeid);
+ if (x >= 0)
goto retry;
return fallback_alloc(cachep, flags);
--
1.9.1
From: Joonsoo Kim <[email protected]>
Currently, determination to free a slab is done whenever each freed
object is put into the slab. This has a following problem.
Assume free_limit = 10 and nr_free = 9.
Free happens as following sequence and nr_free changes as following.
free(become a free slab) free(not become a free slab)
nr_free: 9 -> 10 (at first free) -> 11 (at second free)
If we try to check if we can free current slab or not on each object free,
we can't free any slab in this situation because current slab isn't
a free slab when nr_free exceed free_limit (at second free) even if
there is a free slab.
However, if we check it lastly, we can free 1 free slab.
This problem would cause to keep too much memory in the slab subsystem.
This patch try to fix it by checking number of free object after
all free work is done. If there is free slab at that time, we can
free slab as much as possible so we keep free slab as minimal.
v2: explain more about the problem
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 27cb390..6e61461 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3283,6 +3283,9 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
{
int i;
struct kmem_cache_node *n = get_node(cachep, node);
+ struct page *page;
+
+ n->free_objects += nr_objects;
for (i = 0; i < nr_objects; i++) {
void *objp;
@@ -3295,17 +3298,11 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
check_spinlock_acquired_node(cachep, node);
slab_put_obj(cachep, page, objp);
STATS_DEC_ACTIVE(cachep);
- n->free_objects++;
/* fixup slab chains */
- if (page->active == 0) {
- if (n->free_objects > n->free_limit) {
- n->free_objects -= cachep->num;
- list_add_tail(&page->lru, list);
- } else {
- list_add(&page->lru, &n->slabs_free);
- }
- } else {
+ if (page->active == 0)
+ list_add(&page->lru, &n->slabs_free);
+ else {
/* Unconditionally move a slab to the end of the
* partial list on free - maximum time for the
* other objects to be freed, too.
@@ -3313,6 +3310,14 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
list_add_tail(&page->lru, &n->slabs_partial);
}
}
+
+ while (n->free_objects > n->free_limit && !list_empty(&n->slabs_free)) {
+ n->free_objects -= cachep->num;
+
+ page = list_last_entry(&n->slabs_free, struct page, lru);
+ list_del(&page->lru);
+ list_add(&page->lru, list);
+ }
}
static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
--
1.9.1
From: Joonsoo Kim <[email protected]>
This is a preparation step to implement lockless allocation path when
there is no free objects in kmem_cache. What we'd like to do here is to
refill cpu cache without holding a node lock. To accomplish this purpose,
refill should be done after new slab allocation but before attaching the
slab to the management list. So, this patch separates cache_grow() to two
parts, allocation and attaching to the list in order to add some code
inbetween them in the following patch.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 74 ++++++++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 52 insertions(+), 22 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 1910589..2c28ad5 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -213,6 +213,11 @@ static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list);
static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp);
static void cache_reap(struct work_struct *unused);
+static inline void fixup_objfreelist_debug(struct kmem_cache *cachep,
+ void **list);
+static inline void fixup_slab_list(struct kmem_cache *cachep,
+ struct kmem_cache_node *n, struct page *page,
+ void **list);
static int slab_early_init = 1;
#define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node))
@@ -1797,7 +1802,7 @@ static size_t calculate_slab_order(struct kmem_cache *cachep,
/*
* Needed to avoid possible looping condition
- * in cache_grow()
+ * in cache_grow_begin()
*/
if (OFF_SLAB(freelist_cache))
continue;
@@ -2543,7 +2548,8 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page,
* Grow (by 1) the number of slabs within a cache. This is called by
* kmem_cache_alloc() when there are no active objs left in a cache.
*/
-static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid)
+static struct page *cache_grow_begin(struct kmem_cache *cachep,
+ gfp_t flags, int nodeid)
{
void *freelist;
size_t offset;
@@ -2609,21 +2615,40 @@ static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid)
if (gfpflags_allow_blocking(local_flags))
local_irq_disable();
- check_irq_off();
- spin_lock(&n->list_lock);
- /* Make slab active. */
- list_add_tail(&page->lru, &(n->slabs_free));
- STATS_INC_GROWN(cachep);
- n->free_objects += cachep->num;
- spin_unlock(&n->list_lock);
- return page_node;
+ return page;
+
opps1:
kmem_freepages(cachep, page);
failed:
if (gfpflags_allow_blocking(local_flags))
local_irq_disable();
- return -1;
+ return NULL;
+}
+
+static void cache_grow_end(struct kmem_cache *cachep, struct page *page)
+{
+ struct kmem_cache_node *n;
+ void *list = NULL;
+
+ check_irq_off();
+
+ if (!page)
+ return;
+
+ INIT_LIST_HEAD(&page->lru);
+ n = get_node(cachep, page_to_nid(page));
+
+ spin_lock(&n->list_lock);
+ if (!page->active)
+ list_add_tail(&page->lru, &(n->slabs_free));
+ else
+ fixup_slab_list(cachep, n, page, &list);
+ STATS_INC_GROWN(cachep);
+ n->free_objects += cachep->num - page->active;
+ spin_unlock(&n->list_lock);
+
+ fixup_objfreelist_debug(cachep, &list);
}
#if DEBUG
@@ -2834,6 +2859,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
struct array_cache *ac;
int node;
void *list = NULL;
+ struct page *page;
check_irq_off();
node = numa_mem_id();
@@ -2861,7 +2887,6 @@ retry:
}
while (batchcount > 0) {
- struct page *page;
/* Get slab alloc is to come from. */
page = get_first_slab(n, false);
if (!page)
@@ -2894,8 +2919,6 @@ alloc_done:
fixup_objfreelist_debug(cachep, &list);
if (unlikely(!ac->avail)) {
- int x;
-
/* Check if we can use obj in pfmemalloc slab */
if (sk_memalloc_socks()) {
void *obj = cache_alloc_pfmemalloc(cachep, n, flags);
@@ -2904,14 +2927,18 @@ alloc_done:
return obj;
}
- x = cache_grow(cachep, gfp_exact_node(flags), node);
+ page = cache_grow_begin(cachep, gfp_exact_node(flags), node);
+ cache_grow_end(cachep, page);
- /* cache_grow can reenable interrupts, then ac could change. */
+ /*
+ * cache_grow_begin() can reenable interrupts,
+ * then ac could change.
+ */
ac = cpu_cache_get(cachep);
node = numa_mem_id();
/* no objects in sight? abort */
- if (x < 0 && ac->avail == 0)
+ if (!page && ac->avail == 0)
return NULL;
if (!ac->avail) /* objects refilled by interrupt? */
@@ -3044,6 +3071,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(flags);
void *obj = NULL;
+ struct page *page;
int nid;
unsigned int cpuset_mems_cookie;
@@ -3079,8 +3107,10 @@ retry:
* We may trigger various forms of reclaim on the allowed
* set and go into memory reserves if necessary.
*/
- nid = cache_grow(cache, flags, numa_mem_id());
- if (nid >= 0) {
+ page = cache_grow_begin(cache, flags, numa_mem_id());
+ cache_grow_end(cache, page);
+ if (page) {
+ nid = page_to_nid(page);
obj = ____cache_alloc_node(cache,
gfp_exact_node(flags), nid);
@@ -3108,7 +3138,6 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
struct kmem_cache_node *n;
void *obj;
void *list = NULL;
- int x;
VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES);
n = get_node(cachep, nodeid);
@@ -3140,8 +3169,9 @@ retry:
must_grow:
spin_unlock(&n->list_lock);
- x = cache_grow(cachep, gfp_exact_node(flags), nodeid);
- if (x >= 0)
+ page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid);
+ cache_grow_end(cachep, page);
+ if (page)
goto retry;
return fallback_alloc(cachep, flags);
--
1.9.1
From: Joonsoo Kim <[email protected]>
Until now, cache growing makes a free slab on node's slab list and then we
can allocate free objects from it. This necessarily requires to hold a
node lock which is very contended. If we refill cpu cache before
attaching it to node's slab list, we can avoid holding a node lock as much
as possible because this newly allocated slab is only visible to the
current task. This will reduce lock contention.
Below is the result of concurrent allocation/free in slab allocation
benchmark made by Christoph a long time ago. I make the output simpler.
The number shows cycle count during alloc/free respectively so less is
better.
* Before
Kmalloc N*alloc N*free(32): Average=355/750
Kmalloc N*alloc N*free(64): Average=452/812
Kmalloc N*alloc N*free(128): Average=559/1070
Kmalloc N*alloc N*free(256): Average=1176/980
Kmalloc N*alloc N*free(512): Average=1939/1189
Kmalloc N*alloc N*free(1024): Average=3521/1278
Kmalloc N*alloc N*free(2048): Average=7152/1838
Kmalloc N*alloc N*free(4096): Average=13438/2013
* After
Kmalloc N*alloc N*free(32): Average=248/966
Kmalloc N*alloc N*free(64): Average=261/949
Kmalloc N*alloc N*free(128): Average=314/1016
Kmalloc N*alloc N*free(256): Average=741/1061
Kmalloc N*alloc N*free(512): Average=1246/1152
Kmalloc N*alloc N*free(1024): Average=2437/1259
Kmalloc N*alloc N*free(2048): Average=4980/1800
Kmalloc N*alloc N*free(4096): Average=9000/2078
It shows that contention is reduced for all the object sizes and
performance increases by 30 ~ 40%.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 68 +++++++++++++++++++++++++++++++++------------------------------
1 file changed, 36 insertions(+), 32 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 2c28ad5..cf12fbd 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2852,6 +2852,30 @@ static noinline void *cache_alloc_pfmemalloc(struct kmem_cache *cachep,
return obj;
}
+/*
+ * Slab list should be fixed up by fixup_slab_list() for existing slab
+ * or cache_grow_end() for new slab
+ */
+static __always_inline int alloc_block(struct kmem_cache *cachep,
+ struct array_cache *ac, struct page *page, int batchcount)
+{
+ /*
+ * There must be at least one object available for
+ * allocation.
+ */
+ BUG_ON(page->active >= cachep->num);
+
+ while (page->active < cachep->num && batchcount--) {
+ STATS_INC_ALLOCED(cachep);
+ STATS_INC_ACTIVE(cachep);
+ STATS_SET_HIGH(cachep);
+
+ ac->entry[ac->avail++] = slab_get_obj(cachep, page);
+ }
+
+ return batchcount;
+}
+
static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
{
int batchcount;
@@ -2864,7 +2888,6 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
check_irq_off();
node = numa_mem_id();
-retry:
ac = cpu_cache_get(cachep);
batchcount = ac->batchcount;
if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -2894,21 +2917,7 @@ retry:
check_spinlock_acquired(cachep);
- /*
- * The slab was either on partial or free list so
- * there must be at least one object available for
- * allocation.
- */
- BUG_ON(page->active >= cachep->num);
-
- while (page->active < cachep->num && batchcount--) {
- STATS_INC_ALLOCED(cachep);
- STATS_INC_ACTIVE(cachep);
- STATS_SET_HIGH(cachep);
-
- ac->entry[ac->avail++] = slab_get_obj(cachep, page);
- }
-
+ batchcount = alloc_block(cachep, ac, page, batchcount);
fixup_slab_list(cachep, n, page, &list);
}
@@ -2928,21 +2937,18 @@ alloc_done:
}
page = cache_grow_begin(cachep, gfp_exact_node(flags), node);
- cache_grow_end(cachep, page);
/*
* cache_grow_begin() can reenable interrupts,
* then ac could change.
*/
ac = cpu_cache_get(cachep);
- node = numa_mem_id();
+ if (!ac->avail && page)
+ alloc_block(cachep, ac, page, batchcount);
+ cache_grow_end(cachep, page);
- /* no objects in sight? abort */
- if (!page && ac->avail == 0)
+ if (!ac->avail)
return NULL;
-
- if (!ac->avail) /* objects refilled by interrupt? */
- goto retry;
}
ac->touched = 1;
@@ -3136,14 +3142,13 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
{
struct page *page;
struct kmem_cache_node *n;
- void *obj;
+ void *obj = NULL;
void *list = NULL;
VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES);
n = get_node(cachep, nodeid);
BUG_ON(!n);
-retry:
check_irq_off();
spin_lock(&n->list_lock);
page = get_first_slab(n, false);
@@ -3165,19 +3170,18 @@ retry:
spin_unlock(&n->list_lock);
fixup_objfreelist_debug(cachep, &list);
- goto done;
+ return obj;
must_grow:
spin_unlock(&n->list_lock);
page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid);
+ if (page) {
+ /* This slab isn't counted yet so don't update free_objects */
+ obj = slab_get_obj(cachep, page);
+ }
cache_grow_end(cachep, page);
- if (page)
- goto retry;
- return fallback_alloc(cachep, flags);
-
-done:
- return obj;
+ return obj ? obj : fallback_alloc(cachep, flags);
}
static __always_inline void *
--
1.9.1
From: Joonsoo Kim <[email protected]>
To check whther free objects exist or not precisely, we need to grab a
lock. But, accuracy isn't that important because race window would be
even small and if there is too much free object, cache reaper would reap
it. So, this patch makes the check for free object exisistence not to
hold a lock. This will reduce lock contention in heavily allocation case.
Note that until now, n->shared can be freed during the processing by
writing slabinfo, but, with some trick in this patch, we can access it
freely within interrupt disabled period.
Below is the result of concurrent allocation/free in slab allocation
benchmark made by Christoph a long time ago. I make the output simpler.
The number shows cycle count during alloc/free respectively so less is
better.
* Before
Kmalloc N*alloc N*free(32): Average=248/966
Kmalloc N*alloc N*free(64): Average=261/949
Kmalloc N*alloc N*free(128): Average=314/1016
Kmalloc N*alloc N*free(256): Average=741/1061
Kmalloc N*alloc N*free(512): Average=1246/1152
Kmalloc N*alloc N*free(1024): Average=2437/1259
Kmalloc N*alloc N*free(2048): Average=4980/1800
Kmalloc N*alloc N*free(4096): Average=9000/2078
* After
Kmalloc N*alloc N*free(32): Average=344/792
Kmalloc N*alloc N*free(64): Average=347/882
Kmalloc N*alloc N*free(128): Average=390/959
Kmalloc N*alloc N*free(256): Average=393/1067
Kmalloc N*alloc N*free(512): Average=683/1229
Kmalloc N*alloc N*free(1024): Average=1295/1325
Kmalloc N*alloc N*free(2048): Average=2513/1664
Kmalloc N*alloc N*free(4096): Average=4742/2172
It shows that allocation performance decreases for the object size up to
128 and it may be due to extra checks in cache_alloc_refill(). But, with
considering improvement of free performance, net result looks the same.
Result for other size class looks very promising, roughly, 50% performance
improvement.
v2: replace kick_all_cpus_sync() with synchronize_sched().
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index cf12fbd..13e74aa 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -952,6 +952,15 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep,
spin_unlock_irq(&n->list_lock);
slabs_destroy(cachep, &list);
+ /*
+ * To protect lockless access to n->shared during irq disabled context.
+ * If n->shared isn't NULL in irq disabled context, accessing to it is
+ * guaranteed to be valid until irq is re-enabled, because it will be
+ * freed after synchronize_sched().
+ */
+ if (force_change)
+ synchronize_sched();
+
fail:
kfree(old_shared);
kfree(new_shared);
@@ -2880,7 +2889,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
{
int batchcount;
struct kmem_cache_node *n;
- struct array_cache *ac;
+ struct array_cache *ac, *shared;
int node;
void *list = NULL;
struct page *page;
@@ -2901,11 +2910,16 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
n = get_node(cachep, node);
BUG_ON(ac->avail > 0 || !n);
+ shared = READ_ONCE(n->shared);
+ if (!n->free_objects && (!shared || !shared->avail))
+ goto direct_grow;
+
spin_lock(&n->list_lock);
+ shared = READ_ONCE(n->shared);
/* See if we can refill from the shared array */
- if (n->shared && transfer_objects(ac, n->shared, batchcount)) {
- n->shared->touched = 1;
+ if (shared && transfer_objects(ac, shared, batchcount)) {
+ shared->touched = 1;
goto alloc_done;
}
@@ -2927,6 +2941,7 @@ alloc_done:
spin_unlock(&n->list_lock);
fixup_objfreelist_debug(cachep, &list);
+direct_grow:
if (unlikely(!ac->avail)) {
/* Check if we can use obj in pfmemalloc slab */
if (sk_memalloc_socks()) {
--
1.9.1
From: Joonsoo Kim <[email protected]>
Slab color isn't needed to be changed strictly. Because locking for
changing slab color could cause more lock contention so this patch
implements racy access/modify the slab color. This is a preparation step
to implement lockless allocation path when there is no free objects in the
kmem_cache.
Below is the result of concurrent allocation/free in slab allocation
benchmark made by Christoph a long time ago. I make the output simpler.
The number shows cycle count during alloc/free respectively so less is
better.
* Before
Kmalloc N*alloc N*free(32): Average=365/806
Kmalloc N*alloc N*free(64): Average=452/690
Kmalloc N*alloc N*free(128): Average=736/886
Kmalloc N*alloc N*free(256): Average=1167/985
Kmalloc N*alloc N*free(512): Average=2088/1125
Kmalloc N*alloc N*free(1024): Average=4115/1184
Kmalloc N*alloc N*free(2048): Average=8451/1748
Kmalloc N*alloc N*free(4096): Average=16024/2048
* After
Kmalloc N*alloc N*free(32): Average=355/750
Kmalloc N*alloc N*free(64): Average=452/812
Kmalloc N*alloc N*free(128): Average=559/1070
Kmalloc N*alloc N*free(256): Average=1176/980
Kmalloc N*alloc N*free(512): Average=1939/1189
Kmalloc N*alloc N*free(1024): Average=3521/1278
Kmalloc N*alloc N*free(2048): Average=7152/1838
Kmalloc N*alloc N*free(4096): Average=13438/2013
It shows that contention is reduced for object size >= 1024 and
performance increases by roughly 15%.
Acked-by: Christoph Lameter <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 6e61461..a3422bc 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2561,20 +2561,7 @@ static int cache_grow(struct kmem_cache *cachep,
}
local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
- /* Take the node list lock to change the colour_next on this node */
check_irq_off();
- n = get_node(cachep, nodeid);
- spin_lock(&n->list_lock);
-
- /* Get colour for the slab, and cal the next value. */
- offset = n->colour_next;
- n->colour_next++;
- if (n->colour_next >= cachep->colour)
- n->colour_next = 0;
- spin_unlock(&n->list_lock);
-
- offset *= cachep->colour_off;
-
if (gfpflags_allow_blocking(local_flags))
local_irq_enable();
@@ -2595,6 +2582,19 @@ static int cache_grow(struct kmem_cache *cachep,
if (!page)
goto failed;
+ n = get_node(cachep, nodeid);
+
+ /* Get colour for the slab, and cal the next value. */
+ n->colour_next++;
+ if (n->colour_next >= cachep->colour)
+ n->colour_next = 0;
+
+ offset = n->colour_next;
+ if (offset >= cachep->colour)
+ offset = 0;
+
+ offset *= cachep->colour_off;
+
/* Get slab management. */
freelist = alloc_slabmgmt(cachep, page, offset,
local_flags & ~GFP_CONSTRAINT_MASK, nodeid);
--
1.9.1
From: Joonsoo Kim <[email protected]>
It can be reused on other place, so factor out it. Following patch will
use it.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 68 ++++++++++++++++++++++++++++++++++++---------------------------
1 file changed, 39 insertions(+), 29 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 5451929..49af685 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -841,6 +841,40 @@ static inline gfp_t gfp_exact_node(gfp_t flags)
}
#endif
+static int init_cache_node(struct kmem_cache *cachep, int node, gfp_t gfp)
+{
+ struct kmem_cache_node *n;
+
+ /*
+ * Set up the kmem_cache_node for cpu before we can
+ * begin anything. Make sure some other cpu on this
+ * node has not already allocated this
+ */
+ n = get_node(cachep, node);
+ if (n)
+ return 0;
+
+ n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node);
+ if (!n)
+ return -ENOMEM;
+
+ kmem_cache_node_init(n);
+ n->next_reap = jiffies + REAPTIMEOUT_NODE +
+ ((unsigned long)cachep) % REAPTIMEOUT_NODE;
+
+ n->free_limit =
+ (1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num;
+
+ /*
+ * The kmem_cache_nodes don't come and go as CPUs
+ * come and go. slab_mutex is sufficient
+ * protection here.
+ */
+ cachep->node[node] = n;
+
+ return 0;
+}
+
/*
* Allocates and initializes node for a node on each slab cache, used for
* either memory or cpu hotplug. If memory is being hot-added, the kmem_cache_node
@@ -852,39 +886,15 @@ static inline gfp_t gfp_exact_node(gfp_t flags)
*/
static int init_cache_node_node(int node)
{
+ int ret;
struct kmem_cache *cachep;
- struct kmem_cache_node *n;
- const size_t memsize = sizeof(struct kmem_cache_node);
list_for_each_entry(cachep, &slab_caches, list) {
- /*
- * Set up the kmem_cache_node for cpu before we can
- * begin anything. Make sure some other cpu on this
- * node has not already allocated this
- */
- n = get_node(cachep, node);
- if (!n) {
- n = kmalloc_node(memsize, GFP_KERNEL, node);
- if (!n)
- return -ENOMEM;
- kmem_cache_node_init(n);
- n->next_reap = jiffies + REAPTIMEOUT_NODE +
- ((unsigned long)cachep) % REAPTIMEOUT_NODE;
-
- /*
- * The kmem_cache_nodes don't come and go as CPUs
- * come and go. slab_mutex is sufficient
- * protection here.
- */
- cachep->node[node] = n;
- }
-
- spin_lock_irq(&n->list_lock);
- n->free_limit =
- (1 + nr_cpus_node(node)) *
- cachep->batchcount + cachep->num;
- spin_unlock_irq(&n->list_lock);
+ ret = init_cache_node(cachep, node, GFP_KERNEL);
+ if (ret)
+ return ret;
}
+
return 0;
}
--
1.9.1
From: Joonsoo Kim <[email protected]>
slabs_tofree() implies freeing all free slab. We can do it with
just providing INT_MAX.
Acked-by: Christoph Lameter <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 373b8be..5451929 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -888,12 +888,6 @@ static int init_cache_node_node(int node)
return 0;
}
-static inline int slabs_tofree(struct kmem_cache *cachep,
- struct kmem_cache_node *n)
-{
- return (n->free_objects + cachep->num - 1) / cachep->num;
-}
-
static void cpuup_canceled(long cpu)
{
struct kmem_cache *cachep;
@@ -958,7 +952,7 @@ free_slab:
n = get_node(cachep, node);
if (!n)
continue;
- drain_freelist(cachep, n, slabs_tofree(cachep, n));
+ drain_freelist(cachep, n, INT_MAX);
}
}
@@ -1110,7 +1104,7 @@ static int __meminit drain_cache_node_node(int node)
if (!n)
continue;
- drain_freelist(cachep, n, slabs_tofree(cachep, n));
+ drain_freelist(cachep, n, INT_MAX);
if (!list_empty(&n->slabs_full) ||
!list_empty(&n->slabs_partial)) {
@@ -2304,7 +2298,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
check_irq_on();
for_each_kmem_cache_node(cachep, node, n) {
- drain_freelist(cachep, n, slabs_tofree(cachep, n));
+ drain_freelist(cachep, n, INT_MAX);
ret += !list_empty(&n->slabs_full) ||
!list_empty(&n->slabs_partial);
--
1.9.1
On Tue, 12 Apr 2016 13:51:06 +0900
[email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> To check whther free objects exist or not precisely, we need to grab a
^^^^^^
(spelling)
> lock. But, accuracy isn't that important because race window would be
> even small and if there is too much free object, cache reaper would reap
> it. So, this patch makes the check for free object exisistence not to
^^^^^^^^^^^
(spelling)
> hold a lock. This will reduce lock contention in heavily allocation case.
>
> Note that until now, n->shared can be freed during the processing by
> writing slabinfo, but, with some trick in this patch, we can access it
> freely within interrupt disabled period.
>
> Below is the result of concurrent allocation/free in slab allocation
> benchmark made by Christoph a long time ago. I make the output simpler.
> The number shows cycle count during alloc/free respectively so less is
> better.
I cannot figure out which if Christoph's tests you are using. And I
even have a copy of his test here:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_test.c
I think you need to describe the test a bit better...
Looking a long time at the output on my own system, I guess you are
showing results from the "Concurrent allocs". Then it would be
relevant how many CPUs your system have.
It would also be relevant to mention that N=10000. And perhaps mention
that it means, e.g all CPUs do N=10000 alloc concurrently, synchronize
before doing N free concurrently.
> * Before
> Kmalloc N*alloc N*free(32): Average=248/966
> Kmalloc N*alloc N*free(64): Average=261/949
> Kmalloc N*alloc N*free(128): Average=314/1016
> Kmalloc N*alloc N*free(256): Average=741/1061
> Kmalloc N*alloc N*free(512): Average=1246/1152
> Kmalloc N*alloc N*free(1024): Average=2437/1259
> Kmalloc N*alloc N*free(2048): Average=4980/1800
> Kmalloc N*alloc N*free(4096): Average=9000/2078
>
> * After
> Kmalloc N*alloc N*free(32): Average=344/792
> Kmalloc N*alloc N*free(64): Average=347/882
> Kmalloc N*alloc N*free(128): Average=390/959
> Kmalloc N*alloc N*free(256): Average=393/1067
> Kmalloc N*alloc N*free(512): Average=683/1229
> Kmalloc N*alloc N*free(1024): Average=1295/1325
> Kmalloc N*alloc N*free(2048): Average=2513/1664
> Kmalloc N*alloc N*free(4096): Average=4742/2172
>
> It shows that allocation performance decreases for the object size up to
> 128 and it may be due to extra checks in cache_alloc_refill(). But, with
> considering improvement of free performance, net result looks the same.
> Result for other size class looks very promising, roughly, 50% performance
> improvement.
Super nice performance boost. The numbers on my system are
significantly smaller, but this is a before/after test and the absolute
numbers are not that important.
Oh, maybe this was because I ran the test with SLUB... recompiling with
SLAB... and the results are comparable to your numbers (on my 8 core
i7-4790K CPU @ 4.00GHz)
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
On Tue, Apr 12, 2016 at 09:24:34AM +0200, Jesper Dangaard Brouer wrote:
> On Tue, 12 Apr 2016 13:51:06 +0900
> [email protected] wrote:
>
> > From: Joonsoo Kim <[email protected]>
> >
> > To check whther free objects exist or not precisely, we need to grab a
> ^^^^^^
> (spelling)
Will fix.
> > lock. But, accuracy isn't that important because race window would be
> > even small and if there is too much free object, cache reaper would reap
> > it. So, this patch makes the check for free object exisistence not to
> ^^^^^^^^^^^
> (spelling)
Ditto.
>
> > hold a lock. This will reduce lock contention in heavily allocation case.
> >
> > Note that until now, n->shared can be freed during the processing by
> > writing slabinfo, but, with some trick in this patch, we can access it
> > freely within interrupt disabled period.
> >
> > Below is the result of concurrent allocation/free in slab allocation
> > benchmark made by Christoph a long time ago. I make the output simpler.
> > The number shows cycle count during alloc/free respectively so less is
> > better.
>
> I cannot figure out which if Christoph's tests you are using. And I
> even have a copy of his test here:
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_test.c
I don't remember where I grab the source but it's same thing you have.
But, my version has some modification for stable result. I do each test
50 times and get the average result.
> I think you need to describe the test a bit better...
Okay. I assume that relevant people (like as Christoph or you) can
understand the result easily but it seems not.
> Looking a long time at the output on my own system, I guess you are
> showing results from the "Concurrent allocs". Then it would be
> relevant how many CPUs your system have.
Right. I'm doing the test with my 8 core i7-3770 CPU @ 3.40GHz.
> It would also be relevant to mention that N=10000. And perhaps mention
> that it means, e.g all CPUs do N=10000 alloc concurrently, synchronize
> before doing N free concurrently.
I'm doing the test with N=100000.
>
> > * Before
> > Kmalloc N*alloc N*free(32): Average=248/966
> > Kmalloc N*alloc N*free(64): Average=261/949
> > Kmalloc N*alloc N*free(128): Average=314/1016
> > Kmalloc N*alloc N*free(256): Average=741/1061
> > Kmalloc N*alloc N*free(512): Average=1246/1152
> > Kmalloc N*alloc N*free(1024): Average=2437/1259
> > Kmalloc N*alloc N*free(2048): Average=4980/1800
> > Kmalloc N*alloc N*free(4096): Average=9000/2078
> >
> > * After
> > Kmalloc N*alloc N*free(32): Average=344/792
> > Kmalloc N*alloc N*free(64): Average=347/882
> > Kmalloc N*alloc N*free(128): Average=390/959
> > Kmalloc N*alloc N*free(256): Average=393/1067
> > Kmalloc N*alloc N*free(512): Average=683/1229
> > Kmalloc N*alloc N*free(1024): Average=1295/1325
> > Kmalloc N*alloc N*free(2048): Average=2513/1664
> > Kmalloc N*alloc N*free(4096): Average=4742/2172
> >
> > It shows that allocation performance decreases for the object size up to
> > 128 and it may be due to extra checks in cache_alloc_refill(). But, with
> > considering improvement of free performance, net result looks the same.
> > Result for other size class looks very promising, roughly, 50% performance
> > improvement.
>
> Super nice performance boost. The numbers on my system are
Thanks!
> significantly smaller, but this is a before/after test and the absolute
> numbers are not that important.
>
> Oh, maybe this was because I ran the test with SLUB... recompiling with
> SLAB... and the results are comparable to your numbers (on my 8 core
> i7-4790K CPU @ 4.00GHz)
Okay.
Thanks.
On Tue, 12 Apr 2016, [email protected] wrote:
> @@ -2222,6 +2241,7 @@ static void drain_cpu_caches(struct kmem_cache *cachep)
> {
> struct kmem_cache_node *n;
> int node;
> + LIST_HEAD(list);
>
> on_each_cpu(do_drain, cachep, 1);
> check_irq_on();
> @@ -2229,8 +2249,13 @@ static void drain_cpu_caches(struct kmem_cache *cachep)
> if (n->alien)
> drain_alien_cache(cachep, n->alien);
>
> - for_each_kmem_cache_node(cachep, node, n)
> - drain_array(cachep, n, n->shared, 1, node);
> + for_each_kmem_cache_node(cachep, node, n) {
> + spin_lock_irq(&n->list_lock);
> + drain_array_locked(cachep, n->shared, node, true, &list);
> + spin_unlock_irq(&n->list_lock);
> +
> + slabs_destroy(cachep, &list);
Can the slabs_destroy() call be moved outside of the loop? It may be
faster then?
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
On Tue, Apr 12, 2016 at 11:38:39AM -0500, Christoph Lameter wrote:
> On Tue, 12 Apr 2016, [email protected] wrote:
>
> > @@ -2222,6 +2241,7 @@ static void drain_cpu_caches(struct kmem_cache *cachep)
> > {
> > struct kmem_cache_node *n;
> > int node;
> > + LIST_HEAD(list);
> >
> > on_each_cpu(do_drain, cachep, 1);
> > check_irq_on();
> > @@ -2229,8 +2249,13 @@ static void drain_cpu_caches(struct kmem_cache *cachep)
> > if (n->alien)
> > drain_alien_cache(cachep, n->alien);
> >
> > - for_each_kmem_cache_node(cachep, node, n)
> > - drain_array(cachep, n, n->shared, 1, node);
> > + for_each_kmem_cache_node(cachep, node, n) {
> > + spin_lock_irq(&n->list_lock);
> > + drain_array_locked(cachep, n->shared, node, true, &list);
> > + spin_unlock_irq(&n->list_lock);
> > +
> > + slabs_destroy(cachep, &list);
>
> Can the slabs_destroy() call be moved outside of the loop? It may be
> faster then?
Yes, it can. But, I'd prefer to call it on each node. It would be
better for cache although it would be marginal.
Thanks.
On Tue, Apr 12, 2016 at 01:50:59PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> It can be reused on other place, so factor out it. Following patch will
> use it.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/slab.c | 68 ++++++++++++++++++++++++++++++++++++---------------------------
> 1 file changed, 39 insertions(+), 29 deletions(-)
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 5451929..49af685 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -841,6 +841,40 @@ static inline gfp_t gfp_exact_node(gfp_t flags)
> }
> #endif
>
> +static int init_cache_node(struct kmem_cache *cachep, int node, gfp_t gfp)
> +{
> + struct kmem_cache_node *n;
> +
> + /*
> + * Set up the kmem_cache_node for cpu before we can
> + * begin anything. Make sure some other cpu on this
> + * node has not already allocated this
> + */
> + n = get_node(cachep, node);
> + if (n)
> + return 0;
> +
> + n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node);
> + if (!n)
> + return -ENOMEM;
> +
> + kmem_cache_node_init(n);
> + n->next_reap = jiffies + REAPTIMEOUT_NODE +
> + ((unsigned long)cachep) % REAPTIMEOUT_NODE;
> +
> + n->free_limit =
> + (1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num;
> +
> + /*
> + * The kmem_cache_nodes don't come and go as CPUs
> + * come and go. slab_mutex is sufficient
> + * protection here.
> + */
> + cachep->node[node] = n;
> +
> + return 0;
> +}
> +
Hello, Andrew.
Could you apply following fix for this patch to mmotm?
Thanks.
------>8-----------
Date: Thu, 14 Apr 2016 10:28:11 +0900
Subject: [PATCH] mm/slab: fix bug
n->free_limit is once set in boot-up process without enabling multiple
cpu so it could be very low value. If we don't re-set when another cpu
is up, it will stay too low. Fix it.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/slab.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/slab.c b/mm/slab.c
index 13e74aa..59dd94a 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -856,8 +856,14 @@ static int init_cache_node(struct kmem_cache *cachep, int node, gfp_t gfp)
* node has not already allocated this
*/
n = get_node(cachep, node);
- if (n)
+ if (n) {
+ spin_lock_irq(&n->list_lock);
+ n->free_limit = (1 + nr_cpus_node(node)) * cachep->batchcount +
+ cachep->num;
+ spin_unlock_irq(&n->list_lock);
+
return 0;
+ }
n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node);
if (!n)
--
1.9.1