2019-05-20 06:09:50

by Tobin C. Harding

[permalink] [raw]
Subject: [RFC PATCH v5 00/16] Slab Movable Objects (SMO)

Hi,

Another iteration of the SMO patch set, updates to this version are
restricted to the XArray patches (#9 and #10 and tested with module
implemented in #11).

Applies on top of Linus' tree (tag: v5.2-rc1).

This is a patch set implementing movable objects within the SLUB
allocator. This is work based on Christopher Lameter's patch set:

https://lore.kernel.org/patchwork/project/lkml/list/?series=377335

The original code logic is from that set and implemented by Christopher.
Clean up, refactoring, documentation, and additional features by myself.
Responsibility for any bugs remaining falls solely with myself.

I am intending on sending a non-RFC version soon after this one (if
XArray stuff is ok). If anyone has any objects with SMO in general
please yell at me now.

Changes to this version:

Patch XArray to use a separate slab cache. Currently the radix tree and
XArray use the same slab cache. Radix tree nodes can not be moved but
XArray nodes can.

Matthew,

Does this fit in ok with your plans for the XArray and radix tree? I
don't really like the function names used here or the init function name
(xarray_slabcache_init()). If there is a better way to do this please
mercilessly correct me :)


Thanks for looking at this,
Tobin.


Tobin C. Harding (16):
slub: Add isolate() and migrate() methods
tools/vm/slabinfo: Add support for -C and -M options
slub: Sort slab cache list
slub: Slab defrag core
tools/vm/slabinfo: Add remote node defrag ratio output
tools/vm/slabinfo: Add defrag_used_ratio output
tools/testing/slab: Add object migration test module
tools/testing/slab: Add object migration test suite
lib: Separate radix_tree_node and xa_node slab cache
xarray: Implement migration function for xa_node objects
tools/testing/slab: Add XArray movable objects tests
slub: Enable moving objects to/from specific nodes
slub: Enable balancing slabs across nodes
dcache: Provide a dentry constructor
dcache: Implement partial shrink via Slab Movable Objects
dcache: Add CONFIG_DCACHE_SMO

Documentation/ABI/testing/sysfs-kernel-slab | 14 +
fs/dcache.c | 110 ++-
include/linux/slab.h | 71 ++
include/linux/slub_def.h | 10 +
include/linux/xarray.h | 3 +
init/main.c | 2 +
lib/radix-tree.c | 2 +-
lib/xarray.c | 109 ++-
mm/Kconfig | 14 +
mm/slab_common.c | 2 +-
mm/slub.c | 819 ++++++++++++++++++--
tools/testing/slab/Makefile | 10 +
tools/testing/slab/slub_defrag.c | 567 ++++++++++++++
tools/testing/slab/slub_defrag.py | 451 +++++++++++
tools/testing/slab/slub_defrag_xarray.c | 211 +++++
tools/vm/slabinfo.c | 51 +-
16 files changed, 2343 insertions(+), 103 deletions(-)
create mode 100644 tools/testing/slab/Makefile
create mode 100644 tools/testing/slab/slub_defrag.c
create mode 100755 tools/testing/slab/slub_defrag.py
create mode 100644 tools/testing/slab/slub_defrag_xarray.c

--
2.21.0



2019-05-20 06:17:42

by Tobin C. Harding

[permalink] [raw]
Subject: [RFC PATCH v5 06/16] tools/vm/slabinfo: Add defrag_used_ratio output

Add output for the newly added defrag_used_ratio sysfs knob.

Signed-off-by: Tobin C. Harding <[email protected]>
---
tools/vm/slabinfo.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index d2c22f9ee2d8..ef4ff93df4cc 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -34,6 +34,7 @@ struct slabinfo {
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
int movable, ctor;
+ int defrag_used_ratio;
int remote_node_defrag_ratio;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
@@ -549,6 +550,8 @@ static void report(struct slabinfo *s)
printf("** Slabs are destroyed via RCU\n");
if (s->reclaim_account)
printf("** Reclaim accounting active\n");
+ if (s->movable)
+ printf("** Defragmentation at %d%%\n", s->defrag_used_ratio);

printf("\nSizes (bytes) Slabs Debug Memory\n");
printf("------------------------------------------------------------------------\n");
@@ -1279,6 +1282,7 @@ static void read_slab_dir(void)
slab->deactivate_bypass = get_obj("deactivate_bypass");
slab->remote_node_defrag_ratio =
get_obj("remote_node_defrag_ratio");
+ slab->defrag_used_ratio = get_obj("defrag_used_ratio");
chdir("..");
if (read_slab_obj(slab, "ops")) {
if (strstr(buffer, "ctor :"))
--
2.21.0


2019-05-20 06:20:54

by Tobin C. Harding

[permalink] [raw]
Subject: [RFC PATCH v5 10/16] xarray: Implement migration function for xa_node objects

Recently Slab Movable Objects (SMO) was implemented for the SLUB
allocator. The XArray can take advantage of this and make the xa_node
slab cache objects movable.

Implement functions to migrate objects and activate SMO when we
initialise the XArray slab cache.

This is based on initial code by Matthew Wilcox and was modified to work
with slab object migration.

Cc: Matthew Wilcox <[email protected]>
Co-developed-by: Christoph Lameter <[email protected]>
Signed-off-by: Tobin C. Harding <[email protected]>
---
lib/xarray.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)

diff --git a/lib/xarray.c b/lib/xarray.c
index a528a5277c9d..c6b077f59e88 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1993,12 +1993,73 @@ static void xa_node_ctor(void *arg)
INIT_LIST_HEAD(&node->private_list);
}

+static void xa_object_migrate(struct xa_node *node, int numa_node)
+{
+ struct xarray *xa = READ_ONCE(node->array);
+ void __rcu **slot;
+ struct xa_node *new_node;
+ int i;
+
+ /* Freed or not yet in tree then skip */
+ if (!xa || xa == XA_RCU_FREE)
+ return;
+
+ new_node = kmem_cache_alloc_node(xa_node_cachep, GFP_KERNEL, numa_node);
+ if (!new_node) {
+ pr_err("%s: slab cache allocation failed\n", __func__);
+ return;
+ }
+
+ xa_lock_irq(xa);
+
+ /* Check again..... */
+ if (xa != node->array) {
+ node = new_node;
+ goto unlock;
+ }
+
+ memcpy(new_node, node, sizeof(struct xa_node));
+
+ if (list_empty(&node->private_list))
+ INIT_LIST_HEAD(&new_node->private_list);
+ else
+ list_replace(&node->private_list, &new_node->private_list);
+
+ for (i = 0; i < XA_CHUNK_SIZE; i++) {
+ void *x = xa_entry_locked(xa, new_node, i);
+
+ if (xa_is_node(x))
+ rcu_assign_pointer(xa_to_node(x)->parent, new_node);
+ }
+ if (!new_node->parent)
+ slot = &xa->xa_head;
+ else
+ slot = &xa_parent_locked(xa, new_node)->slots[new_node->offset];
+ rcu_assign_pointer(*slot, xa_mk_node(new_node));
+
+unlock:
+ xa_unlock_irq(xa);
+ xa_node_free(node);
+ rcu_barrier();
+}
+
+static void xa_migrate(struct kmem_cache *s, void **objects, int nr,
+ int node, void *_unused)
+{
+ int i;
+
+ for (i = 0; i < nr; i++)
+ xa_object_migrate(objects[i], node);
+}
+
+
void __init xarray_slabcache_init(void)
{
xa_node_cachep = kmem_cache_create("xarray_node",
sizeof(struct xa_node), 0,
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
xa_node_ctor);
+ kmem_cache_setup_mobility(xa_node_cachep, NULL, xa_migrate);
}

#ifdef XA_DEBUG
--
2.21.0


2019-05-20 06:21:54

by Tobin C. Harding

[permalink] [raw]
Subject: [RFC PATCH v5 11/16] tools/testing/slab: Add XArray movable objects tests

We just implemented movable objects for the XArray. Let's test it
intree.

Add test module for the XArray's movable objects implementation.

Functionality of the XArray Slab Movable Object implementation can
usually be seen by simply by using `slabinfo` on a running machine since
the radix tree is typically in use on a running machine and will have
partial slabs. For repeated testing we can use the test module to run
to simulate a workload on the XArray then use `slabinfo` to test object
migration is functioning.

If testing on freshly spun up VM (low radix tree workload) it may be
necessary to load/unload the module a number of times to create partial
slabs.

Example test session
--------------------

Relevant /proc/slabinfo column headers:

name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>

Prior to testing slabinfo report for radix_tree_node:

# slabinfo radix_tree_node --report

Slabcache: radix_tree_node Aliases: 0 Order : 2 Objects: 8352
** Reclaim accounting active
** Defragmentation at 30%

Sizes (bytes) Slabs Debug Memory
------------------------------------------------------------------------
Object : 576 Total : 497 Sanity Checks : On Total: 8142848
SlabObj: 912 Full : 473 Redzoning : On Used : 4810752
SlabSiz: 16384 Partial: 24 Poisoning : On Loss : 3332096
Loss : 336 CpuSlab: 0 Tracking : On Lalig: 2806272
Align : 8 Objects: 17 Tracing : Off Lpadd: 437360

Here you can see the kernel was built with Slab Movable Objects enabled
for the XArray (XArray uses the radix tree below the surface).

After inserting the test module (note we have triggered allocation of a
number of radix tree nodes increasing the object count but decreasing the
number of partial slabs):

# slabinfo radix_tree_node --report

Slabcache: radix_tree_node Aliases: 0 Order : 2 Objects: 8442
** Reclaim accounting active
** Defragmentation at 30%

Sizes (bytes) Slabs Debug Memory
------------------------------------------------------------------------
Object : 576 Total : 499 Sanity Checks : On Total: 8175616
SlabObj: 912 Full : 484 Redzoning : On Used : 4862592
SlabSiz: 16384 Partial: 15 Poisoning : On Loss : 3313024
Loss : 336 CpuSlab: 0 Tracking : On Lalig: 2836512
Align : 8 Objects: 17 Tracing : Off Lpadd: 439120

Now we can shrink the radix_tree_node cache:

# slabinfo radix_tree_node --shrink
# slabinfo radix_tree_node --report

Slabcache: radix_tree_node Aliases: 0 Order : 2 Objects: 8515
** Reclaim accounting active
** Defragmentation at 30%

Sizes (bytes) Slabs Debug Memory
------------------------------------------------------------------------
Object : 576 Total : 501 Sanity Checks : On Total: 8208384
SlabObj: 912 Full : 500 Redzoning : On Used : 4904640
SlabSiz: 16384 Partial: 1 Poisoning : On Loss : 3303744
Loss : 336 CpuSlab: 0 Tracking : On Lalig: 2861040
Align : 8 Objects: 17 Tracing : Off Lpadd: 440880

Note the single remaining partial slab.

Signed-off-by: Tobin C. Harding <[email protected]>
---
tools/testing/slab/Makefile | 2 +-
tools/testing/slab/slub_defrag_xarray.c | 211 ++++++++++++++++++++++++
2 files changed, 212 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/slab/slub_defrag_xarray.c

diff --git a/tools/testing/slab/Makefile b/tools/testing/slab/Makefile
index 440c2e3e356f..44c18d9a4d52 100644
--- a/tools/testing/slab/Makefile
+++ b/tools/testing/slab/Makefile
@@ -1,4 +1,4 @@
-obj-m += slub_defrag.o
+obj-m += slub_defrag.o slub_defrag_xarray.o

KTREE=../../..

diff --git a/tools/testing/slab/slub_defrag_xarray.c b/tools/testing/slab/slub_defrag_xarray.c
new file mode 100644
index 000000000000..41143f73256c
--- /dev/null
+++ b/tools/testing/slab/slub_defrag_xarray.c
@@ -0,0 +1,211 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/list.h>
+#include <linux/gfp.h>
+#include <linux/xarray.h>
+
+#define SMOX_CACHE_NAME "smox_test"
+static struct kmem_cache *cachep;
+
+/*
+ * Declare XArrays globally so we can clean them up on module unload.
+ */
+
+/* Used by test_smo_xarray()*/
+DEFINE_XARRAY(things);
+
+/* Thing to store pointers to in the XArray */
+struct smox_thing {
+ long id;
+};
+
+/* It's up to the caller to ensure id is unique */
+static struct smox_thing *alloc_thing(int id)
+{
+ struct smox_thing *thing;
+
+ thing = kmem_cache_alloc(cachep, GFP_KERNEL);
+ if (!thing)
+ return ERR_PTR(-ENOMEM);
+
+ thing->id = id;
+ return thing;
+}
+
+/**
+ * smox_object_ctor() - SMO object constructor function.
+ * @ptr: Pointer to memory where the object should be constructed.
+ */
+void smox_object_ctor(void *ptr)
+{
+ struct smox_thing *thing = ptr;
+
+ thing->id = -1;
+}
+
+/**
+ * smox_cache_migrate() - kmem_cache migrate function.
+ * @cp: kmem_cache pointer.
+ * @objs: Array of pointers to objects to migrate.
+ * @size: Number of objects in @objs.
+ * @node: NUMA node where the object should be allocated.
+ * @private: Pointer returned by kmem_cache_isolate_func().
+ */
+void smox_cache_migrate(struct kmem_cache *cp, void **objs, int size,
+ int node, void *private)
+{
+ struct smox_thing **ptrs = (struct smox_thing **)objs;
+ struct smox_thing *old, *new;
+ struct smox_thing *thing;
+ unsigned long index;
+ void *entry;
+ int i;
+
+ for (i = 0; i < size; i++) {
+ old = ptrs[i];
+
+ new = kmem_cache_alloc(cachep, GFP_KERNEL);
+ if (!new) {
+ pr_debug("kmem_cache_alloc failed\n");
+ return;
+ }
+
+ new->id = old->id;
+
+ /* Update reference the brain dead way */
+ xa_for_each(&things, index, thing) {
+ if (thing == old) {
+ entry = xa_store(&things, index, new, GFP_KERNEL);
+ if (entry != old) {
+ pr_err("failed to exchange new/old\n");
+ return;
+ }
+ }
+ }
+ kmem_cache_free(cachep, old);
+ }
+}
+
+/*
+ * test_smo_xarray() - Run some tests using an XArray.
+ */
+static int test_smo_xarray(void)
+{
+ const int keep = 6; /* Free 5 out of 6 items */
+ const int nr_items = 10000;
+ struct smox_thing *thing;
+ unsigned long index;
+ void *entry;
+ int expected;
+ int i;
+
+ /*
+ * Populate XArray, this adds to the radix_tree_node cache as
+ * well as the smox_test cache.
+ */
+ for (i = 0; i < nr_items; i++) {
+ thing = alloc_thing(i);
+ entry = xa_store(&things, i, thing, GFP_KERNEL);
+ if (xa_is_err(entry)) {
+ pr_err("smox: failed to allocate entry: %d\n", i);
+ return -ENOMEM;
+ }
+ }
+
+ /* Now free items, putting holes in both caches. */
+ for (i = 0; i < nr_items; i++) {
+ if (i % keep == 0)
+ continue;
+
+ thing = xa_erase(&things, i);
+ if (xa_is_err(thing))
+ pr_err("smox: error erasing entry: %d\n", i);
+ kmem_cache_free(cachep, thing);
+ }
+
+ expected = 0;
+ xa_for_each(&things, index, thing) {
+ if (thing->id != expected || index != expected) {
+ pr_err("smox: error; got %ld want %d at %ld\n",
+ thing->id, expected, index);
+ return -1;
+ }
+ expected += keep;
+ }
+
+ /*
+ * Leave caches sparsely allocated. Shrink caches manually with:
+ *
+ * slabinfo radix_tree_node --shrink
+ * slabinfo smox_test --shrink
+ */
+
+ return 0;
+}
+
+static int __init smox_cache_init(void)
+{
+ cachep = kmem_cache_create(SMOX_CACHE_NAME,
+ sizeof(struct smox_thing),
+ 0, 0, smox_object_ctor);
+ if (!cachep)
+ return -1;
+
+ return 0;
+}
+
+static void __exit smox_cache_cleanup(void)
+{
+ struct smox_thing *thing;
+ unsigned long i;
+
+ xa_for_each(&things, i, thing) {
+ kmem_cache_free(cachep, thing);
+ }
+ xa_destroy(&things);
+ kmem_cache_destroy(cachep);
+}
+
+static int __init smox_init(void)
+{
+ int ret;
+
+ ret = smox_cache_init();
+ if (ret) {
+ pr_err("smo_xarray: failed to create cache\n");
+ return ret;
+ }
+ pr_info("smo_xarray: created kmem_cache: %s\n", SMOX_CACHE_NAME);
+
+ kmem_cache_setup_mobility(cachep, NULL, smox_cache_migrate);
+ pr_info("smo_xarray: kmem_cache %s defrag enabled\n", SMOX_CACHE_NAME);
+
+ /*
+ * Running this test consumes memory unless you shrink the
+ * radix_tree_node cache manually with `slabinfo`.
+ */
+ ret = test_smo_xarray();
+ if (ret)
+ pr_warn("test_smo_xarray failed: %d\n", ret);
+
+ pr_info("smo_xarray: module loaded successfully\n");
+ return 0;
+}
+module_init(smox_init);
+
+static void __exit smox_exit(void)
+{
+ smox_cache_cleanup();
+
+ pr_info("smo_xarray: module removed\n");
+}
+module_exit(smox_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Tobin C. Harding");
+MODULE_DESCRIPTION("SMO XArray test module.");
--
2.21.0


2019-05-20 06:48:50

by Tobin C. Harding

[permalink] [raw]
Subject: [RFC PATCH v5 02/16] tools/vm/slabinfo: Add support for -C and -M options

-C lists caches that use a ctor.

-M lists caches that support object migration.

Add command line options to show caches with a constructor and caches
that are movable (i.e. have migrate function).

Co-developed-by: Christoph Lameter <[email protected]>
Signed-off-by: Tobin C. Harding <[email protected]>
---
tools/vm/slabinfo.c | 40 ++++++++++++++++++++++++++++++++++++----
1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index 73818f1b2ef8..cbfc56c44c2f 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -33,6 +33,7 @@ struct slabinfo {
unsigned int hwcache_align, object_size, objs_per_slab;
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
+ int movable, ctor;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
unsigned long free_fastpath, free_slowpath;
@@ -67,6 +68,8 @@ int show_report;
int show_alias;
int show_slab;
int skip_zero = 1;
+int show_movable;
+int show_ctor;
int show_numa;
int show_track;
int show_first_alias;
@@ -109,11 +112,13 @@ static void fatal(const char *x, ...)

static void usage(void)
{
- printf("slabinfo 4/15/2011. (c) 2007 sgi/(c) 2011 Linux Foundation.\n\n"
- "slabinfo [-aADefhilnosrStTvz1LXBU] [N=K] [-dafzput] [slab-regexp]\n"
+ printf("slabinfo 4/15/2017. (c) 2007 sgi/(c) 2011 Linux Foundation/(c) 2017 Jump Trading LLC.\n\n"
+ "slabinfo [-aACDefhilMnosrStTvz1LXBU] [N=K] [-dafzput] [slab-regexp]\n"
+
"-a|--aliases Show aliases\n"
"-A|--activity Most active slabs first\n"
"-B|--Bytes Show size in bytes\n"
+ "-C|--ctor Show slabs with ctors\n"
"-D|--display-active Switch line format to activity\n"
"-e|--empty Show empty slabs\n"
"-f|--first-alias Show first alias\n"
@@ -121,6 +126,7 @@ static void usage(void)
"-i|--inverted Inverted list\n"
"-l|--slabs Show slabs\n"
"-L|--Loss Sort by loss\n"
+ "-M|--movable Show caches that support movable objects\n"
"-n|--numa Show NUMA information\n"
"-N|--lines=K Show the first K slabs\n"
"-o|--ops Show kmem_cache_ops\n"
@@ -588,6 +594,12 @@ static void slabcache(struct slabinfo *s)
if (show_empty && s->slabs)
return;

+ if (show_ctor && !s->ctor)
+ return;
+
+ if (show_movable && !s->movable)
+ return;
+
if (sort_loss == 0)
store_size(size_str, slab_size(s));
else
@@ -602,6 +614,10 @@ static void slabcache(struct slabinfo *s)
*p++ = '*';
if (s->cache_dma)
*p++ = 'd';
+ if (s->ctor)
+ *p++ = 'C';
+ if (s->movable)
+ *p++ = 'M';
if (s->hwcache_align)
*p++ = 'A';
if (s->poison)
@@ -636,7 +652,8 @@ static void slabcache(struct slabinfo *s)
printf("%-21s %8ld %7d %15s %14s %4d %1d %3ld %3ld %s\n",
s->name, s->objects, s->object_size, size_str, dist_str,
s->objs_per_slab, s->order,
- s->slabs ? (s->partial * 100) / s->slabs : 100,
+ s->slabs ? (s->partial * 100) /
+ (s->slabs * s->objs_per_slab) : 100,
s->slabs ? (s->objects * s->object_size * 100) /
(s->slabs * (page_size << s->order)) : 100,
flags);
@@ -1256,6 +1273,13 @@ static void read_slab_dir(void)
slab->alloc_node_mismatch = get_obj("alloc_node_mismatch");
slab->deactivate_bypass = get_obj("deactivate_bypass");
chdir("..");
+ if (read_slab_obj(slab, "ops")) {
+ if (strstr(buffer, "ctor :"))
+ slab->ctor = 1;
+ if (strstr(buffer, "migrate :"))
+ slab->movable = 1;
+ }
+
if (slab->name[0] == ':')
alias_targets++;
slab++;
@@ -1332,6 +1356,8 @@ static void xtotals(void)
}

struct option opts[] = {
+ { "ctor", no_argument, NULL, 'C' },
+ { "movable", no_argument, NULL, 'M' },
{ "aliases", no_argument, NULL, 'a' },
{ "activity", no_argument, NULL, 'A' },
{ "debug", optional_argument, NULL, 'd' },
@@ -1367,7 +1393,7 @@ int main(int argc, char *argv[])

page_size = getpagesize();

- while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTSN:LXBU",
+ while ((c = getopt_long(argc, argv, "aACd::Defhil1MnoprstvzTSN:LXBU",
opts, NULL)) != -1)
switch (c) {
case '1':
@@ -1376,6 +1402,9 @@ int main(int argc, char *argv[])
case 'a':
show_alias = 1;
break;
+ case 'C':
+ show_ctor = 1;
+ break;
case 'A':
sort_active = 1;
break;
@@ -1399,6 +1428,9 @@ int main(int argc, char *argv[])
case 'i':
show_inverted = 1;
break;
+ case 'M':
+ show_movable = 1;
+ break;
case 'n':
show_numa = 1;
break;
--
2.21.0


2019-05-20 06:49:07

by Tobin C. Harding

[permalink] [raw]
Subject: [RFC PATCH v5 03/16] slub: Sort slab cache list

It is advantageous to have all defragmentable slabs together at the
beginning of the list of slabs so that there is no need to scan the
complete list. Put defragmentable caches first when adding a slab cache
and others last.

Co-developed-by: Christoph Lameter <[email protected]>
Signed-off-by: Tobin C. Harding <[email protected]>
---
mm/slab_common.c | 2 +-
mm/slub.c | 6 ++++++
2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 58251ba63e4a..db5e9a0b1535 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -393,7 +393,7 @@ static struct kmem_cache *create_cache(const char *name,
goto out_free_cache;

s->refcount = 1;
- list_add(&s->list, &slab_caches);
+ list_add_tail(&s->list, &slab_caches);
memcg_link_cache(s);
out:
if (err)
diff --git a/mm/slub.c b/mm/slub.c
index 1c380a2bc78a..66d474397c0f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4333,6 +4333,8 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
return;
}

+ mutex_lock(&slab_mutex);
+
s->isolate = isolate;
s->migrate = migrate;

@@ -4341,6 +4343,10 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
* to disable fast cmpxchg based processing.
*/
s->flags &= ~__CMPXCHG_DOUBLE;
+
+ list_move(&s->list, &slab_caches); /* Move to top */
+
+ mutex_unlock(&slab_mutex);
}
EXPORT_SYMBOL(kmem_cache_setup_mobility);

--
2.21.0


2019-05-20 06:51:24

by Tobin C. Harding

[permalink] [raw]
Subject: [RFC PATCH v5 08/16] tools/testing/slab: Add object migration test suite

We just added a module that enables testing the SLUB allocators ability
to defrag/shrink caches via movable objects. Tests are better when they
are automated.

Add automated testing via a python script for SLUB movable objects.

Example output:

$ cd path/to/linux/tools/testing/slab
$ /slub_defrag.py
Please run script as root

$ sudo ./slub_defrag.py
<test are quiet, no output on success>

$ sudo ./slub_defrag.py --debug
Loading module ...
Slab cache smo_test created
Objects per slab: 20
Running sanity checks ...

Running module stress test (see dmesg for additional test output) ...
Removing module slub_defrag ...
Loading module ...
Slab cache smo_test created

Running test non-movable ...
testing slab 'smo_test' prior to enabling movable objects ...
verified non-movable slabs are NOT shrinkable

Running test movable ...
testing slab 'smo_test' after enabling movable objects ...
verified movable slabs are shrinkable

Removing module slub_defrag ...

Signed-off-by: Tobin C. Harding <[email protected]>
---
tools/testing/slab/slub_defrag.c | 1 +
tools/testing/slab/slub_defrag.py | 451 ++++++++++++++++++++++++++++++
2 files changed, 452 insertions(+)
create mode 100755 tools/testing/slab/slub_defrag.py

diff --git a/tools/testing/slab/slub_defrag.c b/tools/testing/slab/slub_defrag.c
index 4a5c24394b96..8332e69ee868 100644
--- a/tools/testing/slab/slub_defrag.c
+++ b/tools/testing/slab/slub_defrag.c
@@ -337,6 +337,7 @@ static int smo_run_module_tests(int nr_objs, int keep)

/*
* struct functions() - Map command to a function pointer.
+ * If you update this please update the documentation in slub_defrag.py
*/
struct functions {
char *fn_name;
diff --git a/tools/testing/slab/slub_defrag.py b/tools/testing/slab/slub_defrag.py
new file mode 100755
index 000000000000..41747c0db39b
--- /dev/null
+++ b/tools/testing/slab/slub_defrag.py
@@ -0,0 +1,451 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import subprocess
+import sys
+from os import path
+
+# SLUB Movable Objects test suite.
+#
+# Requirements:
+# - CONFIG_SLUB=y
+# - CONFIG_SLUB_DEBUG=y
+# - The slub_defrag module in this directory.
+
+# Test SMO using a kernel module that enables triggering arbitrary
+# kernel code from userspace via a debugfs file.
+#
+# Module code is in ./slub_defrag.c, basically the functionality is as
+# follows:
+#
+# - Creates debugfs file /sys/kernel/debugfs/smo/callfn
+# - Writes to 'callfn' are parsed as a command string and the function
+# associated with command is called.
+# - Defines 4 commands (all commands operate on smo_test cache):
+# - 'test': Runs module stress tests.
+# - 'alloc N': Allocates N slub objects
+# - 'free N POS': Frees N objects starting at POS (see below)
+# - 'enable': Enables SLUB Movable Objects
+#
+# The module maintains a list of allocated objects. Allocation adds
+# objects to the tail of the list. Free'ing frees from the head of the
+# list. This has the effect of creating free slots in the slab. For
+# finer grained control over where in the cache slots are free'd POS
+# (position) argument may be used.
+
+# The main() function is reasonably readable; the test suite does the
+# following:
+#
+# 1. Runs the module stress tests.
+# 2. Tests the cache without movable objects enabled.
+# - Creates multiple partial slabs as explained above.
+# - Verifies that partial slabs are _not_ removed by shrink (see below).
+# 3. Tests the cache with movable objects enabled.
+# - Creates multiple partial slabs as explained above.
+# - Verifies that partial slabs _are_ removed by shrink (see below).
+
+# The sysfs file /sys/kernel/slab/<cache>/shrink enables calling the
+# function kmem_cache_shrink() (see mm/slab_common.c and mm/slub.cc).
+# Shrinking a cache attempts to consolidate all partial slabs by moving
+# objects if object migration is enable for the cache, otherwise
+# shrinking a cache simply re-orders the partial list so as most densely
+# populated slab are at the head of the list.
+
+# Enable/disable debugging output (also enabled via -d | --debug).
+debug = False
+
+# Used in debug messages and when running `insmod`.
+MODULE_NAME = "slub_defrag"
+
+# Slab cache created by the test module.
+CACHE_NAME = "smo_test"
+
+# Set by get_slab_config()
+objects_per_slab = 0
+pages_per_slab = 0
+debugfs_mounted = False # Set to true if we mount debugfs.
+
+
+def eprint(*args, **kwargs):
+ print(*args, file=sys.stderr, **kwargs)
+
+
+def dprint(*args, **kwargs):
+ if debug:
+ print(*args, file=sys.stderr, **kwargs)
+
+
+def run_shell(cmd):
+ return subprocess.call([cmd], shell=True)
+
+
+def run_shell_get_stdout(cmd):
+ return subprocess.check_output([cmd], shell=True)
+
+
+def assert_root():
+ user = run_shell_get_stdout('whoami')
+ if user != b'root\n':
+ eprint("Please run script as root")
+ sys.exit(1)
+
+
+def mount_debugfs():
+ mounted = False
+
+ # Check if debugfs is mounted at a known mount point.
+ ret = run_shell('mount -l | grep /sys/kernel/debug > /dev/null 2>&1')
+ if ret != 0:
+ run_shell('mount -t debugfs none /sys/kernel/debug/')
+ mounted = True
+ dprint("Mounted debugfs on /sys/kernel/debug")
+
+ return mounted
+
+
+def umount_debugfs():
+ dprint("Un-mounting debugfs")
+ run_shell('umount /sys/kernel/debug')
+
+
+def load_module():
+ """Loads the test module.
+
+ We need a clean slab state to start with so module must
+ be loaded by the test suite.
+ """
+ ret = run_shell('lsmod | grep %s > /dev/null' % MODULE_NAME)
+ if ret == 0:
+ eprint("Please unload slub_defrag module before running test suite")
+ return -1
+
+ dprint('Loading module ...')
+ ret = run_shell('insmod %s.ko' % MODULE_NAME)
+ if ret != 0: # ret==1 on error
+ return -1
+
+ dprint("Slab cache %s created" % CACHE_NAME)
+ return 0
+
+
+def unload_module():
+ ret = run_shell('lsmod | grep %s > /dev/null' % MODULE_NAME)
+ if ret == 0:
+ dprint('Removing module %s ...' % MODULE_NAME)
+ run_shell('rmmod %s > /dev/null 2>&1' % MODULE_NAME)
+
+
+def get_sysfs_value(filename):
+ """
+ Parse slab sysfs files (single line: '20 N0=20')
+ """
+ path = '/sys/kernel/slab/smo_test/%s' % filename
+ f = open(path, "r")
+ s = f.readline()
+ tokens = s.split(" ")
+
+ return int(tokens[0])
+
+
+def get_nr_objects_active():
+ return get_sysfs_value('objects')
+
+
+def get_nr_objects_total():
+ return get_sysfs_value('total_objects')
+
+
+def get_nr_slabs_total():
+ return get_sysfs_value('slabs')
+
+
+def get_nr_slabs_partial():
+ return get_sysfs_value('partial')
+
+
+def get_nr_slabs_full():
+ return get_nr_slabs_total() - get_nr_slabs_partial()
+
+
+def get_slab_config():
+ """Get relevant information from sysfs."""
+ global objects_per_slab
+
+ objects_per_slab = get_sysfs_value('objs_per_slab')
+ if objects_per_slab < 0:
+ return -1
+
+ dprint("Objects per slab: %d" % objects_per_slab)
+ return 0
+
+
+def verify_state(nr_objects_active, nr_objects_total,
+ nr_slabs_partial, nr_slabs_full, nr_slabs_total, msg=''):
+ err = 0
+ got_nr_objects_active = get_nr_objects_active()
+ got_nr_objects_total = get_nr_objects_total()
+ got_nr_slabs_partial = get_nr_slabs_partial()
+ got_nr_slabs_full = get_nr_slabs_full()
+ got_nr_slabs_total = get_nr_slabs_total()
+
+ if got_nr_objects_active != nr_objects_active:
+ err = -1
+
+ if got_nr_objects_total != nr_objects_total:
+ err = -2
+
+ if got_nr_slabs_partial != nr_slabs_partial:
+ err = -3
+
+ if got_nr_slabs_full != nr_slabs_full:
+ err = -4
+
+ if got_nr_slabs_total != nr_slabs_total:
+ err = -5
+
+ if err != 0:
+ dprint("Verify state: %s" % msg)
+ dprint(" what\t\t\twant\tgot")
+ dprint("-----------------------------------------")
+ dprint(" %s\t%d\t%d" % ('nr_objects_active', nr_objects_active, got_nr_objects_active))
+ dprint(" %s\t%d\t%d" % ('nr_objects_total', nr_objects_total, got_nr_objects_total))
+ dprint(" %s\t%d\t%d" % ('nr_slabs_partial', nr_slabs_partial, got_nr_slabs_partial))
+ dprint(" %s\t\t%d\t%d" % ('nr_slabs_full', nr_slabs_full, got_nr_slabs_full))
+ dprint(" %s\t%d\t%d\n" % ('nr_slabs_total', nr_slabs_total, got_nr_slabs_total))
+
+ return err
+
+
+def exec_via_sysfs(command):
+ ret = run_shell('echo %s > /sys/kernel/debug/smo/callfn' % command)
+ if ret != 0:
+ eprint("Failed to echo command to sysfs: %s" % command)
+
+ return ret
+
+
+def enable_movable_objects():
+ return exec_via_sysfs('enable')
+
+
+def alloc(n):
+ exec_via_sysfs("alloc %d" % n)
+
+
+def free(n, pos = 0):
+ exec_via_sysfs('free %d %d' % (n, pos))
+
+
+def shrink():
+ ret = run_shell('slabinfo smo_test -s')
+ if ret != 0:
+ eprint("Failed to execute slabinfo -s")
+
+
+def sanity_checks():
+ # Verify everything is 0 to start with.
+ return verify_state(0, 0, 0, 0, 0, "sanity check")
+
+
+def test_non_movable():
+ one_over = objects_per_slab + 1
+
+ dprint("testing slab 'smo_test' prior to enabling movable objects ...")
+
+ alloc(one_over)
+
+ objects_active = one_over
+ objects_total = objects_per_slab * 2
+ slabs_partial = 1
+ slabs_full = 1
+ slabs_total = 2
+ ret = verify_state(objects_active, objects_total,
+ slabs_partial, slabs_full, slabs_total,
+ "non-movable: initial allocation")
+ if ret != 0:
+ eprint("test_non_movable: failed to verify initial state")
+ return -1
+
+ # Free object from first slot of first slab.
+ free(1)
+ objects_active = one_over - 1
+ objects_total = objects_per_slab * 2
+ slabs_partial = 2
+ slabs_full = 0
+ slabs_total = 2
+ ret = verify_state(objects_active, objects_total,
+ slabs_partial, slabs_full, slabs_total,
+ "non-movable: after free")
+ if ret != 0:
+ eprint("test_non_movable: failed to verify after free")
+ return -1
+
+ # Non-movable cache, shrink should have no effect.
+ shrink()
+ ret = verify_state(objects_active, objects_total,
+ slabs_partial, slabs_full, slabs_total,
+ "non-movable: after shrink")
+ if ret != 0:
+ eprint("test_non_movable: failed to verify after shrink")
+ return -1
+
+ # Cleanup
+ free(objects_per_slab)
+ shrink()
+
+ dprint("verified non-movable slabs are NOT shrinkable")
+ return 0
+
+
+def test_movable():
+ one_over = objects_per_slab + 1
+
+ dprint("testing slab 'smo_test' after enabling movable objects ...")
+
+ alloc(one_over)
+
+ objects_active = one_over
+ objects_total = objects_per_slab * 2
+ slabs_partial = 1
+ slabs_full = 1
+ slabs_total = 2
+ ret = verify_state(objects_active, objects_total,
+ slabs_partial, slabs_full, slabs_total,
+ "movable: initial allocation")
+ if ret != 0:
+ eprint("test_movable: failed to verify initial state")
+ return -1
+
+ # Free object from first slot of first slab.
+ free(1)
+ objects_active = one_over - 1
+ objects_total = objects_per_slab * 2
+ slabs_partial = 2
+ slabs_full = 0
+ slabs_total = 2
+ ret = verify_state(objects_active, objects_total,
+ slabs_partial, slabs_full, slabs_total,
+ "movable: after free")
+ if ret != 0:
+ eprint("test_movable: failed to verify after free")
+ return -1
+
+ # movable cache, shrink should move objects and free slab.
+ shrink()
+ objects_active = one_over - 1
+ objects_total = objects_per_slab * 1
+ slabs_partial = 0
+ slabs_full = 1
+ slabs_total = 1
+ ret = verify_state(objects_active, objects_total,
+ slabs_partial, slabs_full, slabs_total,
+ "movable: after shrink")
+ if ret != 0:
+ eprint("test_movable: failed to verify after shrink")
+ return -1
+
+ # Cleanup
+ free(objects_per_slab)
+ shrink()
+
+ dprint("verified movable slabs are shrinkable")
+ return 0
+
+
+def dprint_start_test(test):
+ dprint("Running %s ..." % test)
+
+
+def dprint_done():
+ dprint("")
+
+
+def run_test(fn, desc):
+ dprint_start_test(desc)
+ ret = fn()
+ if ret < 0:
+ fail_test(desc)
+ dprint_done()
+
+
+# Load and unload the module for this test to ensure clean state.
+def run_module_stress_test():
+ dprint("Running module stress test (see dmesg for additional test output) ...")
+
+ unload_module()
+ ret = load_module()
+ if ret < 0:
+ cleanup_and_exit(ret)
+
+ exec_via_sysfs("test");
+
+ unload_module()
+
+ dprint()
+
+
+def fail_test(msg):
+ eprint("\nFAIL: test failed: '%s' ... aborting\n" % msg)
+ cleanup_and_exit(1)
+
+
+def display_help():
+ print("Usage: %s [OPTIONS]\n" % path.basename(sys.argv[0]))
+ print("\tRuns defrag test suite (a.k.a. SLUB Movable Objects)\n")
+ print("OPTIONS:")
+ print("\t-d | --debug Enable verbose debug output")
+ print("\t-h | --help Print this help and exit")
+
+
+def cleanup_and_exit(return_code):
+ global debugfs_mounted
+
+ if debugfs_mounted == True:
+ umount_debugfs()
+
+ unload_module()
+
+ sys.exit(return_code)
+
+
+def main():
+ global debug
+
+ if len(sys.argv) > 1:
+ if sys.argv[1] == '-h' or sys.argv[1] == '--help':
+ display_help()
+ sys.exit(0)
+
+ if sys.argv[1] == '-d' or sys.argv[1] == '--debug':
+ debug = True
+
+ assert_root()
+
+ # Use cleanup_and_exit() instead of sys.exit() after mounting debugfs.
+ debugfs_mounted = mount_debugfs()
+
+ # Loads and unloads the module.
+ run_module_stress_test()
+
+ ret = load_module()
+ if (ret < 0):
+ cleanup_and_exit(ret)
+
+ ret = get_slab_config()
+ if (ret != 0):
+ fail_test("get slab config details")
+
+ run_test(sanity_checks, "sanity checks")
+
+ run_test(test_non_movable, "test non-movable")
+
+ ret = enable_movable_objects()
+ if (ret != 0):
+ fail_test("enable movable objects")
+
+ run_test(test_movable, "test movable")
+
+ cleanup_and_exit(0)
+
+if __name__== "__main__":
+ main()
--
2.21.0


2019-05-20 06:52:30

by Tobin C. Harding

[permalink] [raw]
Subject: [RFC PATCH v5 09/16] lib: Separate radix_tree_node and xa_node slab cache

Earlier, Slab Movable Objects (SMO) was implemented. The XArray is now
able to take advantage of SMO in order to make xarray nodes
movable (when using the SLUB allocator).

Currently the radix tree uses the same slab cache as the XArray. Only
XArray nodes are movable _not_ radix tree nodes. We can give the radix
tree its own slab cache to overcome this.

In preparation for implementing XArray object migration (xa_node
objects) via Slab Movable Objects add a slab cache solely for XArray
nodes and make the XArray use this slab cache instead of the
radix_tree_node slab cache.

Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Tobin C. Harding <[email protected]>
---
include/linux/xarray.h | 3 +++
init/main.c | 2 ++
lib/radix-tree.c | 2 +-
lib/xarray.c | 48 ++++++++++++++++++++++++++++++++++--------
4 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 0e01e6129145..773f91f8e1db 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -42,6 +42,9 @@

#define BITS_PER_XA_VALUE (BITS_PER_LONG - 1)

+/* Called from init/main.c */
+void xarray_slabcache_init(void);
+
/**
* xa_mk_value() - Create an XArray entry from an integer.
* @v: Value to store in XArray.
diff --git a/init/main.c b/init/main.c
index 5a2c69b4d7b3..e89915ffbe26 100644
--- a/init/main.c
+++ b/init/main.c
@@ -106,6 +106,7 @@ static int kernel_init(void *);

extern void init_IRQ(void);
extern void radix_tree_init(void);
+extern void xarray_slabcache_init(void);

/*
* Debug helper: via this flag we know that we are in 'early bootup code'
@@ -621,6 +622,7 @@ asmlinkage __visible void __init start_kernel(void)
"Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
radix_tree_init();
+ xarray_slabcache_init();

/*
* Set up housekeeping before setting up workqueues to allow the unbound
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 14d51548bea6..edbfb530ba73 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -44,7 +44,7 @@
/*
* Radix tree node cache.
*/
-struct kmem_cache *radix_tree_node_cachep;
+static struct kmem_cache *radix_tree_node_cachep;

/*
* The radix tree is variable-height, so an insert operation not only has
diff --git a/lib/xarray.c b/lib/xarray.c
index 6be3acbb861f..a528a5277c9d 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -27,6 +27,8 @@
* @entry refers to something stored in a slot in the xarray
*/

+static struct kmem_cache *xa_node_cachep;
+
static inline unsigned int xa_lock_type(const struct xarray *xa)
{
return (__force unsigned int)xa->xa_flags & 3;
@@ -244,9 +246,21 @@ void *xas_load(struct xa_state *xas)
}
EXPORT_SYMBOL_GPL(xas_load);

-/* Move the radix tree node cache here */
-extern struct kmem_cache *radix_tree_node_cachep;
-extern void radix_tree_node_rcu_free(struct rcu_head *head);
+void xa_node_rcu_free(struct rcu_head *head)
+{
+ struct xa_node *node = container_of(head, struct xa_node, rcu_head);
+
+ /*
+ * Must only free zeroed nodes into the slab. We can be left with
+ * non-NULL entries by radix_tree_free_nodes, so clear the entries
+ * and tags here.
+ */
+ memset(node->slots, 0, sizeof(node->slots));
+ memset(node->tags, 0, sizeof(node->tags));
+ INIT_LIST_HEAD(&node->private_list);
+
+ kmem_cache_free(xa_node_cachep, node);
+}

#define XA_RCU_FREE ((struct xarray *)1)

@@ -254,7 +268,7 @@ static void xa_node_free(struct xa_node *node)
{
XA_NODE_BUG_ON(node, !list_empty(&node->private_list));
node->array = XA_RCU_FREE;
- call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
+ call_rcu(&node->rcu_head, xa_node_rcu_free);
}

/*
@@ -270,7 +284,7 @@ static void xas_destroy(struct xa_state *xas)
if (!node)
return;
XA_NODE_BUG_ON(node, !list_empty(&node->private_list));
- kmem_cache_free(radix_tree_node_cachep, node);
+ kmem_cache_free(xa_node_cachep, node);
xas->xa_alloc = NULL;
}

@@ -298,7 +312,7 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
xas_destroy(xas);
return false;
}
- xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+ xas->xa_alloc = kmem_cache_alloc(xa_node_cachep, gfp);
if (!xas->xa_alloc)
return false;
XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
@@ -327,10 +341,10 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
}
if (gfpflags_allow_blocking(gfp)) {
xas_unlock_type(xas, lock_type);
- xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+ xas->xa_alloc = kmem_cache_alloc(xa_node_cachep, gfp);
xas_lock_type(xas, lock_type);
} else {
- xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+ xas->xa_alloc = kmem_cache_alloc(xa_node_cachep, gfp);
}
if (!xas->xa_alloc)
return false;
@@ -358,7 +372,7 @@ static void *xas_alloc(struct xa_state *xas, unsigned int shift)
if (node) {
xas->xa_alloc = NULL;
} else {
- node = kmem_cache_alloc(radix_tree_node_cachep,
+ node = kmem_cache_alloc(xa_node_cachep,
GFP_NOWAIT | __GFP_NOWARN);
if (!node) {
xas_set_err(xas, -ENOMEM);
@@ -1971,6 +1985,22 @@ void xa_destroy(struct xarray *xa)
}
EXPORT_SYMBOL(xa_destroy);

+static void xa_node_ctor(void *arg)
+{
+ struct xa_node *node = arg;
+
+ memset(node, 0, sizeof(*node));
+ INIT_LIST_HEAD(&node->private_list);
+}
+
+void __init xarray_slabcache_init(void)
+{
+ xa_node_cachep = kmem_cache_create("xarray_node",
+ sizeof(struct xa_node), 0,
+ SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
+ xa_node_ctor);
+}
+
#ifdef XA_DEBUG
void xa_dump_node(const struct xa_node *node)
{
--
2.21.0


2019-05-20 07:50:47

by Tobin C. Harding

[permalink] [raw]
Subject: [RFC PATCH v5 07/16] tools/testing/slab: Add object migration test module

We just implemented slab movable objects for the SLUB allocator. We
should test that code. In order to do so we need to be able to do a
number of things

- Create a cache
- Enable Slab Movable Objects for the cache
- Allocate objects to the cache
- Free objects from within specific slabs of the cache

We can do all this via a loadable module.

Add a module that defines functions that can be triggered from userspace
via a debugfs entry. From the source:

/*
* SLUB defragmentation a.k.a. Slab Movable Objects (SMO).
*
* This module is used for testing the SLUB allocator. Enables
* userspace to run kernel functions via a debugfs file.
*
* debugfs: /sys/kernel/debugfs/smo/callfn (write only)
*
* String written to `callfn` is parsed by the module and associated
* function is called. See fn_tab for mapping of strings to functions.
*/

References to allocated objects are kept by the module in a linked list
so that userspace can control which object to free.

We introduce the following four functions via the function table

"enable": Enables object migration for the test cache.
"alloc X": Allocates X objects
"free X [Y]": Frees X objects starting at list position Y (default Y==0)
"test": Runs [stress] tests from within the module (see below).

{"enable", smo_enable_cache_mobility},
{"alloc", smo_alloc_objects},
{"free", smo_free_object},
{"test", smo_run_module_tests},

Freeing from the start of the list creates a hole in the slab being
freed from (i.e. creates a partial slab). The results of running these
commands can be see using `slabinfo` (available in tools/vm/):

make -o slabinfo tools/vm/slabinfo.c

Stress tests can be run from within the module. These tests are
internal to the module because we verify that object references are
still good after object migration. These are called 'stress' tests
because it is intended that they create/free a lot of objects.
Userspace can control the number of objects to create, default is 1000.

Example test session
--------------------

Relevant /proc/slabinfo column headers:

name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>

# mount -t debugfs none /sys/kernel/debug/
$ cd path/to/linux/tools/testing/slab; make
...

# insmod slub_defrag.ko
# cat /proc/slabinfo | grep smo_test | sed 's/:.*//'
smo_test 0 0 392 20 2

From this we can see that the module created cache 'smo_test' with 20
objects per slab and 2 pages per slab (and cache is currently empty).

We can play with the slab allocator manually:

# insmod slub_defrag.ko
# echo 'alloc 21' > callfn
# cat /proc/slabinfo | grep smo_test | sed 's/:.*//'
smo_test 21 40 392 20 2

We see here that 21 active objects have been allocated creating 2
slabs (40 total objects).

# slabinfo smo_test --report

Slabcache: smo_test Aliases: 0 Order : 1 Objects: 21

Sizes (bytes) Slabs Debug Memory
------------------------------------------------------------------------
Object : 56 Total : 2 Sanity Checks : On Total: 16384
SlabObj: 392 Full : 1 Redzoning : On Used : 1176
SlabSiz: 8192 Partial: 1 Poisoning : On Loss : 15208
Loss : 336 CpuSlab: 0 Tracking : On Lalig: 7056
Align : 8 Objects: 20 Tracing : Off Lpadd: 704

Now free an object from the first slot of the first slab

# echo 'free 1' > callfn
# cat /proc/slabinfo | grep smo_test | sed 's/:.*//'
smo_test 20 40 392 20 2

# slabinfo smo_test --report

Slabcache: smo_test Aliases: 0 Order : 1 Objects: 20

Sizes (bytes) Slabs Debug Memory
------------------------------------------------------------------------
Object : 56 Total : 2 Sanity Checks : On Total: 16384
SlabObj: 392 Full : 0 Redzoning : On Used : 1120
SlabSiz: 8192 Partial: 2 Poisoning : On Loss : 15264
Loss : 336 CpuSlab: 0 Tracking : On Lalig: 6720
Align : 8 Objects: 20 Tracing : Off Lpadd: 704

Calling shrink now on the cache does nothing because object migration is
not enabled (output omitted). If we enable object migration then shrink
the cache we expect the object from the second slab to me moved to the
first slot in the first slab and the second slab to be removed from the
partial list.

# echo 'enable' > callfn
# slabinfo smo_test --shrink
# slabinfo smo_test --report

Slabcache: smo_test Aliases: 0 Order : 1 Objects: 20
** Defragmentation at 30%

Sizes (bytes) Slabs Debug Memory
------------------------------------------------------------------------
Object : 56 Total : 1 Sanity Checks : On Total: 8192
SlabObj: 392 Full : 1 Redzoning : On Used : 1120
SlabSiz: 8192 Partial: 0 Poisoning : On Loss : 7072
Loss : 336 CpuSlab: 0 Tracking : On Lalig: 6720
Align : 8 Objects: 20 Tracing : Off Lpadd: 352

We can run the stress tests (with the default number of objects):

# cd /sys/kernel/debug/smo
# echo 'test' > callfn
[ 3.576617] smo: test using nr_objs: 1000 keep: 10
[ 3.580169] smo: Module tests completed successfully

Signed-off-by: Tobin C. Harding <[email protected]>
---
tools/testing/slab/Makefile | 10 +
tools/testing/slab/slub_defrag.c | 566 +++++++++++++++++++++++++++++++
2 files changed, 576 insertions(+)
create mode 100644 tools/testing/slab/Makefile
create mode 100644 tools/testing/slab/slub_defrag.c

diff --git a/tools/testing/slab/Makefile b/tools/testing/slab/Makefile
new file mode 100644
index 000000000000..440c2e3e356f
--- /dev/null
+++ b/tools/testing/slab/Makefile
@@ -0,0 +1,10 @@
+obj-m += slub_defrag.o
+
+KTREE=../../..
+
+all:
+ make -C ${KTREE} M=$(PWD) modules
+
+clean:
+ make -C ${KTREE} M=$(PWD) clean
+
diff --git a/tools/testing/slab/slub_defrag.c b/tools/testing/slab/slub_defrag.c
new file mode 100644
index 000000000000..4a5c24394b96
--- /dev/null
+++ b/tools/testing/slab/slub_defrag.c
@@ -0,0 +1,566 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/list.h>
+#include <linux/gfp.h>
+#include <linux/debugfs.h>
+#include <linux/numa.h>
+
+/*
+ * SLUB defragmentation a.k.a. Slab Movable Objects (SMO).
+ *
+ * This module is used for testing the SLUB allocator. Enables
+ * userspace to run kernel functions via a debugfs file.
+ *
+ * debugfs: /sys/kernel/debugfs/smo/callfn (write only)
+ *
+ * String written to `callfn` is parsed by the module and associated
+ * function is called. See fn_tab for mapping of strings to functions.
+ */
+
+/* debugfs commands accept two optional arguments */
+#define SMO_CMD_DEFAUT_ARG -1
+
+#define SMO_DEBUGFS_DIR "smo"
+struct dentry *smo_debugfs_root;
+
+#define SMO_CACHE_NAME "smo_test"
+static struct kmem_cache *cachep;
+
+struct smo_slub_object {
+ struct list_head list;
+ char buf[32]; /* Unused except to control size of object */
+ long id;
+};
+
+/* Our list of allocated objects */
+LIST_HEAD(objects);
+
+static void list_add_to_objects(struct smo_slub_object *so)
+{
+ /*
+ * We free from the front of the list so store at the
+ * tail in order to put holes in the cache when we free.
+ */
+ list_add_tail(&so->list, &objects);
+}
+
+/**
+ * smo_object_ctor() - SMO object constructor function.
+ * @ptr: Pointer to memory where the object should be constructed.
+ */
+void smo_object_ctor(void *ptr)
+{
+ struct smo_slub_object *so = ptr;
+
+ INIT_LIST_HEAD(&so->list);
+ memset(so->buf, 0, sizeof(so->buf));
+ so->id = -1;
+}
+
+/**
+ * smo_cache_migrate() - kmem_cache migrate function.
+ * @cp: kmem_cache pointer.
+ * @objs: Array of pointers to objects to migrate.
+ * @size: Number of objects in @objs.
+ * @node: NUMA node where the object should be allocated.
+ * @private: Pointer returned by kmem_cache_isolate_func().
+ */
+void smo_cache_migrate(struct kmem_cache *cp, void **objs, int size,
+ int node, void *private)
+{
+ struct smo_slub_object **so_objs = (struct smo_slub_object **)objs;
+ struct smo_slub_object *so_old, *so_new;
+ int i;
+
+ for (i = 0; i < size; i++) {
+ so_old = so_objs[i];
+
+ so_new = kmem_cache_alloc_node(cachep, GFP_KERNEL, node);
+ if (!so_new) {
+ pr_debug("kmem_cache_alloc failed\n");
+ return;
+ }
+
+ /* Copy object */
+ so_new->id = so_old->id;
+
+ /* Update references to old object */
+ list_del(&so_old->list);
+ list_add_to_objects(so_new);
+
+ kmem_cache_free(cachep, so_old);
+ }
+}
+
+static int smo_enable_cache_mobility(int _unused, int __unused)
+{
+ /* Enable movable objects: BOOM! */
+ kmem_cache_setup_mobility(cachep, NULL, smo_cache_migrate);
+ pr_info("smo: kmem_cache %s defrag enabled\n", SMO_CACHE_NAME);
+ return 0;
+}
+
+/*
+ * smo_alloc_objects() - Allocate objects and store reference.
+ * @nr_objs: Number of objects to allocate.
+ * @node: NUMA node to allocate objects on.
+ *
+ * Allocates @n smo_slub_objects. Stores a reference to them in
+ * the global list of objects (at the tail of the list).
+ *
+ * Return: The number of objects allocated.
+ */
+static int smo_alloc_objects(int nr_objs, int node)
+{
+ struct smo_slub_object *so;
+ int i;
+
+ /* Set sane parameters if no args passed in */
+ if (nr_objs == SMO_CMD_DEFAUT_ARG)
+ nr_objs = 1;
+ if (node == SMO_CMD_DEFAUT_ARG)
+ node = NUMA_NO_NODE;
+
+ for (i = 0; i < nr_objs; i++) {
+ if (node == NUMA_NO_NODE)
+ so = kmem_cache_alloc(cachep, GFP_KERNEL);
+ else
+ so = kmem_cache_alloc_node(cachep, GFP_KERNEL, node);
+ if (!so) {
+ pr_err("smo: Failed to alloc object %d of %d\n", i, nr_objs);
+ return i;
+ }
+ list_add_to_objects(so);
+ }
+ return nr_objs;
+}
+
+/*
+ * smo_free_object() - Frees n objects from position.
+ * @nr_objs: Number of objects to free.
+ * @pos: Position in global list to start freeing.
+ *
+ * Iterates over the global list of objects to position @pos then frees @n
+ * objects from there (or to end of list). Does nothing if @n > list length.
+ *
+ * Calling with @n==0 frees all objects starting at @pos.
+ *
+ * Return: Number of objects freed.
+ */
+static int smo_free_object(int nr_objs, int pos)
+{
+ struct smo_slub_object *cur, *tmp;
+ int deleted = 0;
+ int i = 0;
+
+ /* Set sane parameters if no args passed in */
+ if (nr_objs == SMO_CMD_DEFAUT_ARG)
+ nr_objs = 1;
+ if (pos == SMO_CMD_DEFAUT_ARG)
+ pos = 0;
+
+ list_for_each_entry_safe(cur, tmp, &objects, list) {
+ if (i < pos) {
+ i++;
+ continue;
+ }
+
+ list_del(&cur->list);
+ kmem_cache_free(cachep, cur);
+ deleted++;
+ if (deleted == nr_objs)
+ break;
+ }
+ return deleted;
+}
+
+static int index_for_expected_id(long *expected, int size, long id)
+{
+ int i;
+
+ /* Array is unsorted, just iterate the whole thing */
+ for (i = 0; i < size; i++) {
+ if (expected[i] == id)
+ return i;
+ }
+ return -1; /* Not found */
+}
+
+static int assert_have_objects(int nr_objs, int keep)
+{
+ struct smo_slub_object *cur;
+ long *expected; /* Array of expected IDs */
+ int nr_ids; /* Length of array */
+ long id;
+ int index, i;
+
+ nr_ids = nr_objs / keep + 1;
+
+ expected = kmalloc_array(nr_ids, sizeof(long), GFP_KERNEL);
+ if (!expected)
+ return -ENOMEM;
+
+ id = 0;
+ for (i = 0; i < nr_ids; i++) {
+ expected[i] = id;
+ id += keep;
+ }
+
+ list_for_each_entry(cur, &objects, list) {
+ index = index_for_expected_id(expected, nr_ids, cur->id);
+ if (index < 0) {
+ pr_err("smo: ID not found: %ld\n", cur->id);
+ return -1;
+ }
+
+ if (expected[index] == -1) {
+ pr_err("smo: ID already encountered: %ld\n", cur->id);
+ return -1;
+ }
+ expected[index] = -1;
+ }
+ return 0;
+}
+
+/*
+ * smo_run_module_tests() - Runs unit tests from within the module
+ * @nr_objs: Number of objects to allocate.
+ * @keep: Free all but 1 in @keep objects.
+ *
+ * Allocates @nr_objects then iterates over the allocated objects
+ * freeing all but 1 out of every @keep objects i.e. for @keep==10
+ * keeps the first object then frees the next 9.
+ *
+ * Caller is responsible for ensuring that the cache has at most a
+ * single slab on the partial list without any objects in it. This is
+ * easy enough to ensure, just call this when the module is freshly
+ * loaded.
+ */
+static int smo_run_module_tests(int nr_objs, int keep)
+{
+ struct smo_slub_object *so;
+ struct smo_slub_object *cur, *tmp;
+ long i;
+
+ if (!list_empty(&objects)) {
+ pr_err("smo: test requires clean module state\n");
+ return -1;
+ }
+
+ /* Set sane parameters if no args passed in */
+ if (nr_objs == SMO_CMD_DEFAUT_ARG)
+ nr_objs = 1000;
+ if (keep == SMO_CMD_DEFAUT_ARG)
+ keep = 10;
+
+ pr_info("smo: test using nr_objs: %d keep: %d\n", nr_objs, keep);
+
+ /* Perhaps we got called like this 'test 1000' */
+ if (keep == 0) {
+ pr_err("Usage: test <nr_objs> <keep>\n");
+ return -1;
+ }
+
+ /* Test constructor */
+ so = kmem_cache_alloc(cachep, GFP_KERNEL);
+ if (!so) {
+ pr_err("smo: Failed to alloc object\n");
+ return -1;
+ }
+ if (so->id != -1) {
+ pr_err("smo: Initial state incorrect");
+ return -1;
+ }
+ kmem_cache_free(cachep, so);
+
+ /*
+ * Test that object migration is correctly implemented by module
+ *
+ * This gives us confidence that if new code correctly enables
+ * object migration (via correct implementation of migrate and
+ * isolate functions) then the slub allocator code that does
+ * object migration is correct.
+ */
+
+ for (i = 0; i < nr_objs; i++) {
+ so = kmem_cache_alloc(cachep, GFP_KERNEL);
+ if (!so) {
+ pr_err("smo: Failed to alloc object %ld of %d\n",
+ i, nr_objs);
+ return -1;
+ }
+ so->id = (long)i;
+ list_add_to_objects(so);
+ }
+
+ assert_have_objects(nr_objs, 1);
+
+ i = 0;
+ list_for_each_entry_safe(cur, tmp, &objects, list) {
+ if (i++ % keep == 0)
+ continue;
+
+ list_del(&cur->list);
+ kmem_cache_free(cachep, cur);
+ }
+
+ /* Verify shrink does nothing when migration is not enabled */
+ kmem_cache_shrink(cachep);
+ assert_have_objects(nr_objs, 1);
+
+ /* Now test shrink */
+ kmem_cache_setup_mobility(cachep, NULL, smo_cache_migrate);
+ kmem_cache_shrink(cachep);
+ /*
+ * Because of how migrate function deletes and adds objects to
+ * the objects list we have no way of knowing the order. We
+ * want to confirm that we have all the objects after shrink
+ * that we had before we did the shrink.
+ */
+ assert_have_objects(nr_objs, keep);
+
+ /* cleanup */
+ list_for_each_entry_safe(cur, tmp, &objects, list) {
+ list_del(&cur->list);
+ kmem_cache_free(cachep, cur);
+ }
+ kmem_cache_shrink(cachep); /* Remove empty slabs from partial list */
+
+ pr_info("smo: Module tests completed successfully\n");
+ return 0;
+}
+
+/*
+ * struct functions() - Map command to a function pointer.
+ */
+struct functions {
+ char *fn_name;
+ int (*fn_ptr)(int arg0, int arg1);
+} fn_tab[] = {
+ /*
+ * Because of the way we parse the function table no command
+ * may have another command as its prefix.
+ * i.e. this will break: 'foo' and 'foobar'
+ */
+ {"enable", smo_enable_cache_mobility},
+ {"alloc", smo_alloc_objects},
+ {"free", smo_free_object},
+ {"test", smo_run_module_tests},
+};
+
+#define FN_TAB_SIZE (sizeof(fn_tab) / sizeof(struct functions))
+
+/*
+ * parse_cmd_buf() - Gets command and arguments command string.
+ * @buf: Buffer containing the command string.
+ * @cmd: Out parameter, pointer to the command.
+ * @arg1: Out parameter, stores the first argument.
+ * @arg2: Out parameter, stores the second argument.
+ *
+ * Parses and tokenizes the input command buffer. Stores a pointer to the
+ * command (start of @buf) in @cmd. Stores the converted long values for
+ * argument 1 and 2 in the respective out parameters @arg1 and @arg2.
+ *
+ * Since arguments are optional, if they are not found the default values are
+ * returned. In order for the caller to differentiate defaults from arguments
+ * of the same value the number of arguments parsed is returned.
+ *
+ * Return: Number of arguments found.
+ */
+static int parse_cmd_buf(char *buf, char **cmd, long *arg1, long *arg2)
+{
+ int found;
+ char *ptr;
+ int ret;
+
+ *arg1 = SMO_CMD_DEFAUT_ARG;
+ *arg2 = SMO_CMD_DEFAUT_ARG;
+ found = 0;
+
+ /* Jump over the command, check if there are any args */
+ ptr = strsep(&buf, " ");
+ if (!ptr || !buf)
+ return found;
+
+ ptr = strsep(&buf, " ");
+ ret = kstrtol(ptr, 10, arg1);
+ if (ret < 0) {
+ pr_err("failed to convert arg, defaulting to %d. (%s)\n",
+ SMO_CMD_DEFAUT_ARG, ptr);
+ return found;
+ }
+ found++;
+ if (!buf) /* No second arg */
+ return found;
+
+ ptr = strsep(&buf, " ");
+ ret = kstrtol(ptr, 10, arg2);
+ if (ret < 0) {
+ pr_err("failed to convert arg, defaulting to %d. (%s)\n",
+ SMO_CMD_DEFAUT_ARG, ptr);
+ return found;
+ }
+ found++;
+
+ return found;
+}
+
+/*
+ * call_function() - Calls the function described by str.
+ * @str: '<cmd> [<arg>]'
+ *
+ * Does table lookup on <cmd>, calls appropriate function passing
+ * <arg> as a the argument. Optional arg defaults to 1.
+ */
+static void call_function(char *str)
+{
+ char *cmd;
+ long arg1 = 0;
+ long arg2 = 0;
+ int i;
+
+ if (!str)
+ return;
+
+ (void)parse_cmd_buf(str, &cmd, &arg1, &arg2);
+
+ for (i = 0; i < FN_TAB_SIZE; i++) {
+ char *fn_name = fn_tab[i].fn_name;
+
+ if (strcmp(fn_name, str) == 0) {
+ fn_tab[i].fn_ptr(arg1, arg2);
+ return; /* All done */
+ }
+ }
+
+ pr_err("failed to call function for cmd: %s\n", str);
+}
+
+/*
+ * smo_callfn_debugfs_write() - debugfs write function.
+ * @file: User file
+ * @user_buf: Userspace buffer
+ * @len: Length of the user space buffer
+ * @off: Offset within the file
+ *
+ * Used for triggering functions by writing command to debugfs file.
+ *
+ * echo '<cmd> <arg>' > /sys/kernel/debug/smo/callfn
+ *
+ * Return: Number of bytes copied if request succeeds,
+ * the corresponding error code otherwise.
+ */
+static ssize_t smo_callfn_debugfs_write(struct file *file,
+ const char __user *ubuf,
+ size_t len,
+ loff_t *off)
+{
+ char *kbuf;
+ int nbytes = 0;
+
+ if (*off != 0 || len == 0)
+ return -EINVAL;
+
+ kbuf = kzalloc(len, GFP_KERNEL);
+ if (!kbuf)
+ return -ENOMEM;
+
+ nbytes = strncpy_from_user(kbuf, ubuf, len);
+ if (nbytes < 0)
+ goto out;
+
+ if (kbuf[nbytes - 1] == '\n')
+ kbuf[nbytes - 1] = '\0';
+
+ call_function(kbuf); /* Tokenizes kbuf */
+out:
+ kfree(kbuf);
+ return nbytes;
+}
+
+const struct file_operations fops_callfn_debugfs = {
+ .owner = THIS_MODULE,
+ .write = smo_callfn_debugfs_write,
+};
+
+static int __init smo_debugfs_init(void)
+{
+ struct dentry *d;
+
+ smo_debugfs_root = debugfs_create_dir(SMO_DEBUGFS_DIR, NULL);
+ d = debugfs_create_file("callfn", 0200, smo_debugfs_root, NULL,
+ &fops_callfn_debugfs);
+ if (IS_ERR(d))
+ return PTR_ERR(d);
+
+ return 0;
+}
+
+static void __exit smo_debugfs_cleanup(void)
+{
+ debugfs_remove_recursive(smo_debugfs_root);
+}
+
+static int __init smo_cache_init(void)
+{
+ cachep = kmem_cache_create(SMO_CACHE_NAME,
+ sizeof(struct smo_slub_object),
+ 0, 0, smo_object_ctor);
+ if (!cachep)
+ return -1;
+
+ return 0;
+}
+
+static void __exit smo_cache_cleanup(void)
+{
+ struct smo_slub_object *cur, *tmp;
+
+ list_for_each_entry_safe(cur, tmp, &objects, list) {
+ list_del(&cur->list);
+ kmem_cache_free(cachep, cur);
+ }
+ kmem_cache_destroy(cachep);
+}
+
+static int __init smo_init(void)
+{
+ int ret;
+
+ ret = smo_cache_init();
+ if (ret) {
+ pr_err("smo: Failed to create cache\n");
+ return ret;
+ }
+ pr_info("smo: Created kmem_cache: %s\n", SMO_CACHE_NAME);
+
+ ret = smo_debugfs_init();
+ if (ret) {
+ pr_err("smo: Failed to init debugfs\n");
+ return ret;
+ }
+ pr_info("smo: Created debugfs directory: /sys/kernel/debugfs/%s\n",
+ SMO_DEBUGFS_DIR);
+
+ pr_info("smo: Test module loaded\n");
+ return 0;
+}
+module_init(smo_init);
+
+static void __exit smo_exit(void)
+{
+ smo_debugfs_cleanup();
+ smo_cache_cleanup();
+
+ pr_info("smo: Test module removed\n");
+}
+module_exit(smo_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Tobin C. Harding");
+MODULE_DESCRIPTION("SLUB Movable Objects test module.");
--
2.21.0


2019-05-20 07:50:52

by Tobin C. Harding

[permalink] [raw]
Subject: [RFC PATCH v5 13/16] slub: Enable balancing slabs across nodes

We have just implemented Slab Movable Objects (SMO). On NUMA systems
slabs can become unbalanced i.e. many slabs on one node while other
nodes have few slabs. Using SMO we can balance the slabs across all
the nodes.

The algorithm used is as follows:

1. Move all objects to node 0 (this has the effect of defragmenting the
cache).

2. Calculate the desired number of slabs for each node (this is done
using the approximation nr_slabs / nr_nodes).

3. Loop over the nodes moving the desired number of slabs from node 0
to the node.

Feature is conditionally built in with CONFIG_SMO_NODE, this is because
we need the full list (we enable SLUB_DEBUG to get this). Future
version may separate final list out of SLUB_DEBUG.

Expose this functionality to userspace via a sysfs entry. Add sysfs
entry:

/sysfs/kernel/slab/<cache>/balance

Write of '1' to this file triggers balance, no other value accepted.

This feature relies on SMO being enable for the cache, this is done with
a call to, after the isolate/migrate functions have been defined.

kmem_cache_setup_mobility(s, isolate, migrate)

Signed-off-by: Tobin C. Harding <[email protected]>
---
mm/slub.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 120 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 9582f2fc97d2..25b6d1e408e3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4574,6 +4574,109 @@ static unsigned long kmem_cache_move_to_node(struct kmem_cache *s, int node)

return left;
}
+
+/*
+ * kmem_cache_move_slabs() - Attempt to move @num slabs to target_node,
+ * @s: The cache we are working on.
+ * @node: The node to move objects from.
+ * @target_node: The node to move objects to.
+ * @num: The number of slabs to move.
+ *
+ * Attempts to move @num slabs from @node to @target_node. This is done
+ * by migrating objects from slabs on the full_list.
+ *
+ * Return: The number of slabs moved or error code.
+ */
+static long kmem_cache_move_slabs(struct kmem_cache *s,
+ int node, int target_node, long num)
+{
+ struct kmem_cache_node *n = get_node(s, node);
+ LIST_HEAD(move_list);
+ struct page *page, *page2;
+ unsigned long flags;
+ void **scratch;
+ long done = 0;
+
+ if (node == target_node)
+ return -EINVAL;
+
+ scratch = alloc_scratch(s);
+ if (!scratch)
+ return -ENOMEM;
+
+ spin_lock_irqsave(&n->list_lock, flags);
+ list_for_each_entry_safe(page, page2, &n->full, lru) {
+ if (!slab_trylock(page))
+ /* Busy slab. Get out of the way */
+ continue;
+
+ list_move(&page->lru, &move_list);
+ page->frozen = 1;
+ slab_unlock(page);
+
+ if (++done >= num)
+ break;
+ }
+ spin_unlock_irqrestore(&n->list_lock, flags);
+
+ list_for_each_entry(page, &move_list, lru) {
+ if (page->inuse)
+ move_slab_page(page, scratch, target_node);
+ }
+ kfree(scratch);
+
+ /* Inspect results and dispose of pages */
+ spin_lock_irqsave(&n->list_lock, flags);
+ list_for_each_entry_safe(page, page2, &move_list, lru) {
+ list_del(&page->lru);
+ slab_lock(page);
+ page->frozen = 0;
+
+ if (page->inuse) {
+ /*
+ * This is best effort only, if slab still has
+ * objects just put it back on the partial list.
+ */
+ n->nr_partial++;
+ list_add_tail(&page->lru, &n->partial);
+ slab_unlock(page);
+ } else {
+ slab_unlock(page);
+ discard_slab(s, page);
+ }
+ }
+ spin_unlock_irqrestore(&n->list_lock, flags);
+
+ return done;
+}
+
+/*
+ * kmem_cache_balance_nodes() - Balance slabs across nodes.
+ * @s: The cache we are working on.
+ */
+static void kmem_cache_balance_nodes(struct kmem_cache *s)
+{
+ struct kmem_cache_node *n = get_node(s, 0);
+ unsigned long desired_nr_slabs_per_node;
+ unsigned long nr_slabs;
+ int nr_nodes = 0;
+ int nid;
+
+ (void)kmem_cache_move_to_node(s, 0);
+
+ for_each_node_state(nid, N_NORMAL_MEMORY)
+ nr_nodes++;
+
+ nr_slabs = atomic_long_read(&n->nr_slabs);
+ desired_nr_slabs_per_node = nr_slabs / nr_nodes;
+
+ for_each_node_state(nid, N_NORMAL_MEMORY) {
+ if (nid == 0)
+ continue;
+
+ kmem_cache_move_slabs(s, 0, nid, desired_nr_slabs_per_node);
+ }
+}
#endif

/**
@@ -5838,6 +5941,22 @@ static ssize_t move_store(struct kmem_cache *s, const char *buf, size_t length)
return length;
}
SLAB_ATTR(move);
+
+static ssize_t balance_show(struct kmem_cache *s, char *buf)
+{
+ return 0;
+}
+
+static ssize_t balance_store(struct kmem_cache *s,
+ const char *buf, size_t length)
+{
+ if (buf[0] == '1')
+ kmem_cache_balance_nodes(s);
+ else
+ return -EINVAL;
+ return length;
+}
+SLAB_ATTR(balance);
#endif /* CONFIG_SMO_NODE */

#ifdef CONFIG_NUMA
@@ -5966,6 +6085,7 @@ static struct attribute *slab_attrs[] = {
&shrink_attr.attr,
#ifdef CONFIG_SMO_NODE
&move_attr.attr,
+ &balance_attr.attr,
#endif
&slabs_cpu_partial_attr.attr,
#ifdef CONFIG_SLUB_DEBUG
--
2.21.0


2019-05-20 07:50:58

by Tobin C. Harding

[permalink] [raw]
Subject: [RFC PATCH v5 14/16] dcache: Provide a dentry constructor

In order to support object migration on the dentry cache we need to have
a determined object state at all times. Without a constructor the object
would have a random state after allocation.

Provide a dentry constructor.

Signed-off-by: Tobin C. Harding <[email protected]>
---
fs/dcache.c | 30 +++++++++++++++++++++---------
1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 8136bda27a1f..b7318615979d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1602,6 +1602,16 @@ void d_invalidate(struct dentry *dentry)
}
EXPORT_SYMBOL(d_invalidate);

+static void dcache_ctor(void *p)
+{
+ struct dentry *dentry = p;
+
+ /* Mimic lockref_mark_dead() */
+ dentry->d_lockref.count = -128;
+
+ spin_lock_init(&dentry->d_lock);
+}
+
/**
* __d_alloc - allocate a dcache entry
* @sb: filesystem it will belong to
@@ -1657,7 +1667,6 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)

dentry->d_lockref.count = 1;
dentry->d_flags = 0;
- spin_lock_init(&dentry->d_lock);
seqcount_init(&dentry->d_seq);
dentry->d_inode = NULL;
dentry->d_parent = dentry;
@@ -3095,14 +3104,17 @@ static void __init dcache_init_early(void)

static void __init dcache_init(void)
{
- /*
- * A constructor could be added for stable state like the lists,
- * but it is probably not worth it because of the cache nature
- * of the dcache.
- */
- dentry_cache = KMEM_CACHE_USERCOPY(dentry,
- SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT,
- d_iname);
+ slab_flags_t flags =
+ SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | SLAB_MEM_SPREAD | SLAB_ACCOUNT;
+
+ dentry_cache =
+ kmem_cache_create_usercopy("dentry",
+ sizeof(struct dentry),
+ __alignof__(struct dentry),
+ flags,
+ offsetof(struct dentry, d_iname),
+ sizeof_field(struct dentry, d_iname),
+ dcache_ctor);

/* Hash may have been set up in dcache_init_early */
if (!hashdist)
--
2.21.0


2019-05-20 07:51:02

by Tobin C. Harding

[permalink] [raw]
Subject: [RFC PATCH v5 15/16] dcache: Implement partial shrink via Slab Movable Objects

The dentry slab cache is susceptible to internal fragmentation. Now
that we have Slab Movable Objects we can attempt to defragment the
dcache. Dentry objects are inherently _not_ relocatable however under
some conditions they can be free'd. This is the same as shrinking the
dcache but instead of shrinking the whole cache we only attempt to free
those objects that are located in partially full slab pages. There is
no guarantee that this will reduce the memory usage of the system, it is
a compromise between fragmented memory and total cache shrinkage with
the hope that some memory pressure can be alleviated.

This is implemented using the newly added Slab Movable Objects
infrastructure. The dcache 'migration' function is intentionally _not_
called 'd_migrate' because we only free, we do not migrate. Call it
'd_partial_shrink' to make explicit that no reallocation is done.

Implement isolate and 'migrate' functions for the dentry slab cache.

Signed-off-by: Tobin C. Harding <[email protected]>
---
fs/dcache.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 76 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index b7318615979d..0dfe580c2d42 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -31,6 +31,7 @@
#include <linux/bit_spinlock.h>
#include <linux/rculist_bl.h>
#include <linux/list_lru.h>
+#include <linux/backing-dev.h>
#include "internal.h"
#include "mount.h"

@@ -3071,6 +3072,79 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode)
}
EXPORT_SYMBOL(d_tmpfile);

+/*
+ * d_isolate() - Dentry isolation callback function.
+ * @s: The dentry cache.
+ * @v: Vector of pointers to the objects to isolate.
+ * @nr: Number of objects in @v.
+ *
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *d_isolate(struct kmem_cache *s, void **v, int nr)
+{
+ struct list_head *dispose;
+ struct dentry *dentry;
+ int i;
+
+ dispose = kmalloc(sizeof(*dispose), GFP_KERNEL);
+ if (!dispose)
+ return NULL;
+
+ INIT_LIST_HEAD(dispose);
+
+ for (i = 0; i < nr; i++) {
+ dentry = v[i];
+ spin_lock(&dentry->d_lock);
+
+ if (dentry->d_lockref.count > 0 ||
+ dentry->d_flags & DCACHE_SHRINK_LIST) {
+ spin_unlock(&dentry->d_lock);
+ continue;
+ }
+
+ if (dentry->d_flags & DCACHE_LRU_LIST)
+ d_lru_del(dentry);
+
+ d_shrink_add(dentry, dispose);
+ spin_unlock(&dentry->d_lock);
+ }
+
+ return dispose;
+}
+
+/*
+ * d_partial_shrink() - Dentry migration callback function.
+ * @s: The dentry cache.
+ * @_unused: We do not access the vector.
+ * @__unused: No need for length of vector.
+ * @___unused: We do not do any allocation.
+ * @private: list_head pointer representing the shrink list.
+ *
+ * Dispose of the shrink list created during isolation function.
+ *
+ * Dentry objects can _not_ be relocated and shrinking the whole dcache
+ * can be expensive. This is an effort to free dentry objects that are
+ * stopping slab pages from being free'd without clearing the whole dcache.
+ *
+ * This callback is called from the SLUB allocator object migration
+ * infrastructure in attempt to free up slab pages by freeing dentry
+ * objects from partially full slabs.
+ */
+static void d_partial_shrink(struct kmem_cache *s, void **_unused, int __unused,
+ int ___unused, void *private)
+{
+ struct list_head *dispose = private;
+
+ if (!private) /* kmalloc error during isolate. */
+ return;
+
+ if (!list_empty(dispose))
+ shrink_dentry_list(dispose);
+
+ kfree(private);
+}
+
static __initdata unsigned long dhash_entries;
static int __init set_dhash_entries(char *str)
{
@@ -3116,6 +3190,8 @@ static void __init dcache_init(void)
sizeof_field(struct dentry, d_iname),
dcache_ctor);

+ kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);
+
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
return;
--
2.21.0


2019-05-21 00:40:52

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH v5 03/16] slub: Sort slab cache list

On Mon, May 20, 2019 at 03:40:04PM +1000, Tobin C. Harding wrote:
> It is advantageous to have all defragmentable slabs together at the
> beginning of the list of slabs so that there is no need to scan the
> complete list. Put defragmentable caches first when adding a slab cache
> and others last.
>
> Co-developed-by: Christoph Lameter <[email protected]>
> Signed-off-by: Tobin C. Harding <[email protected]>

Reviewed-by: Roman Gushchin <[email protected]>

2019-05-21 01:06:21

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH v5 13/16] slub: Enable balancing slabs across nodes

On Mon, May 20, 2019 at 03:40:14PM +1000, Tobin C. Harding wrote:
> We have just implemented Slab Movable Objects (SMO). On NUMA systems
> slabs can become unbalanced i.e. many slabs on one node while other
> nodes have few slabs. Using SMO we can balance the slabs across all
> the nodes.
>
> The algorithm used is as follows:
>
> 1. Move all objects to node 0 (this has the effect of defragmenting the
> cache).

This already sounds dangerous (or costly). Can't it be done without
cross-node data moves?

>
> 2. Calculate the desired number of slabs for each node (this is done
> using the approximation nr_slabs / nr_nodes).

So that on this step only (actual data size - desired data size) has
to be moved?

Thanks!

2019-05-21 01:48:14

by Tobin C. Harding

[permalink] [raw]
Subject: Re: [RFC PATCH v5 13/16] slub: Enable balancing slabs across nodes

On Tue, May 21, 2019 at 01:04:10AM +0000, Roman Gushchin wrote:
> On Mon, May 20, 2019 at 03:40:14PM +1000, Tobin C. Harding wrote:
> > We have just implemented Slab Movable Objects (SMO). On NUMA systems
> > slabs can become unbalanced i.e. many slabs on one node while other
> > nodes have few slabs. Using SMO we can balance the slabs across all
> > the nodes.
> >
> > The algorithm used is as follows:
> >
> > 1. Move all objects to node 0 (this has the effect of defragmenting the
> > cache).
>
> This already sounds dangerous (or costly). Can't it be done without
> cross-node data moves?
>
> >
> > 2. Calculate the desired number of slabs for each node (this is done
> > using the approximation nr_slabs / nr_nodes).
>
> So that on this step only (actual data size - desired data size) has
> to be moved?

This is just the most braindead algorithm I could come up with. Surely
there are a bunch of things that could be improved. Since I don't know
the exact use case it seemed best not to optimize for any one use case.

I'll review, comment on, and test any algorithm you come up with!

thanks,
Tobin.