LinuxLists.cc - [RFC PATCH 0/5] kmalloc-reclaimable caches

2018-05-24 11:02:29

Subject: [RFC PATCH 0/5] kmalloc-reclaimable caches

Hi,

as discussed at LSF/MM [1] here's a RFC patchset that introduces
kmalloc-reclaimable caches (more details in the first patch) and uses them
for SLAB freelists and dcache external names. The latter allows us to
repurpose the NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.

This is how /proc/slabinfo looks like after booting in virtme:

...
kmalloc-reclaimable-4194304 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-reclaimable-96 17 64 128 32 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-reclaimable-64 50 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 6
kmalloc-reclaimable-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4194304 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-64 2888 2944 64 64 1 : tunables 120 60 8 : slabdata 46 46 454
kmalloc-32 4325 4712 32 124 1 : tunables 120 60 8 : slabdata 38 38 563
kmalloc-128 1178 1216 128 32 1 : tunables 120 60 8 : slabdata 38 38 114
...

/proc/vmstat with new/renamed nr_reclaimable counter (patch 4):

...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_reclaimable 2817
...

/proc/meminfo with exposed nr_reclaimable counter (patch 5):

...
AnonPages: 8624 kB
Mapped: 3340 kB
Shmem: 564 kB
Reclaimable: 11272 kB
Slab: 18368 kB
SReclaimable: 11272 kB
SUnreclaim: 7096 kB
KernelStack: 1168 kB
PageTables: 448 kB
...

Now for the issues a.k.a. why RFC:

- I haven't find any other obvious users for reclaimable kmalloc (yet)
- the name of caches kmalloc-reclaimable-X is rather long
- the vmstat/meminfo counter name is rather general and might suggest it also
includes reclaimable page caches, which it doesn't

Suggestions welcome for all three points. For the last one, we might also keep
the counter separate from nr_slab_reclaimable, not superset. I did a superset
as IIRC somebody suggested that in the older threads or at LSF.

Thanks,
Vlastimil

[1] https://lwn.net/Articles/753154/

Vlastimil Babka (5):
mm, slab/slub: introduce kmalloc-reclaimable caches
mm, slab: allocate off-slab freelists as reclaimable when appropriate
dcache: allocate external names from reclaimable kmalloc caches
mm: rename and change semantics of nr_indirectly_reclaimable_bytes
mm, proc: add NR_RECLAIMABLE to /proc/meminfo

drivers/base/node.c | 2 +
drivers/staging/android/ion/ion_page_pool.c | 4 +-
fs/dcache.c | 40 ++++-----------
fs/proc/meminfo.c | 3 +-
include/linux/mmzone.h | 2 +-
include/linux/slab.h | 17 +++++--
mm/page_alloc.c | 15 ++----
mm/slab.c | 23 ++++++---
mm/slab_common.c | 56 ++++++++++++++-------
mm/slub.c | 12 ++---
mm/util.c | 16 ++----
mm/vmstat.c | 6 +--
12 files changed, 99 insertions(+), 97 deletions(-)

--
2.17.0

2018-05-24 11:01:43

by Vlastimil Babka

[permalink] [raw]

Subject: [RFC PATCH 2/5] mm, slab: allocate off-slab freelists as reclaimable when appropriate

In SLAB, OFF_SLAB caches allocate management structures (currently just the
freelist) from kmalloc caches when placement in a slab page together with
objects would lead to suboptimal memory usage. For SLAB_RECLAIM_ACCOUNT caches,
we can allocate the freelists from the newly introduced reclaimable kmalloc
caches, because shrinking the OFF_SLAB cache will in general result to freeing
of the freelists as well. This should improve accounting and anti-fragmentation
a bit.

Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/slab.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/slab.c b/mm/slab.c
index 8d7e1f06127b..4dd7d73a1972 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2142,8 +2142,13 @@ int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags)
#endif

if (OFF_SLAB(cachep)) {
+ /*
+ * If this cache is reclaimable, allocate also freelists from
+ * a reclaimable kmalloc cache.
+ */
cachep->freelist_cache =
- kmalloc_slab(cachep->freelist_size, 0u);
+ kmalloc_slab(cachep->freelist_size,
+ cachep->allocflags & __GFP_RECLAIMABLE);
}

err = setup_cpu_cache(cachep, gfp);
--
2.17.0

2018-05-24 11:02:06

by Vlastimil Babka

[permalink] [raw]

Subject: [RFC PATCH 5/5] mm, proc: add NR_RECLAIMABLE to /proc/meminfo

The vmstat NR_RECLAIMABLE counter is a superset of NR_SLAB_RECLAIMABLE and
other non-slab allocations that can be reclaimed via shrinker. Make it visible
also in /proc/meminfo and /sys/...node info.

Signed-off-by: Vlastimil Babka <[email protected]>
---
drivers/base/node.c | 2 ++
fs/proc/meminfo.c | 3 ++-
2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index a5e821d09656..b35e3627eab7 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -118,6 +118,7 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d NFS_Unstable: %8lu kB\n"
"Node %d Bounce: %8lu kB\n"
"Node %d WritebackTmp: %8lu kB\n"
+ "Node %d Reclaimable: %8lu kB\n"
"Node %d Slab: %8lu kB\n"
"Node %d SReclaimable: %8lu kB\n"
"Node %d SUnreclaim: %8lu kB\n"
@@ -138,6 +139,7 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
+ nid, K(node_page_state(pgdat, NR_RECLAIMABLE)),
nid, K(node_page_state(pgdat, NR_SLAB_RECLAIMABLE) +
node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE)),
nid, K(node_page_state(pgdat, NR_SLAB_RECLAIMABLE)),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 2fb04846ed11..6ca0158a9ebd 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -93,10 +93,11 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
show_val_kb(m, "Mapped: ",
global_node_page_state(NR_FILE_MAPPED));
show_val_kb(m, "Shmem: ", i.sharedram);
+ show_val_kb(m, "Reclaimable: ",
+ global_node_page_state(NR_RECLAIMABLE));
show_val_kb(m, "Slab: ",
global_node_page_state(NR_SLAB_RECLAIMABLE) +
global_node_page_state(NR_SLAB_UNRECLAIMABLE));
-
show_val_kb(m, "SReclaimable: ",
global_node_page_state(NR_SLAB_RECLAIMABLE));
show_val_kb(m, "SUnreclaim: ",
--
2.17.0

2018-05-24 11:03:23

by Vlastimil Babka

[permalink] [raw]

Subject: [RFC PATCH 1/5] mm, slab/slub: introduce kmalloc-reclaimable caches

Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which indicates
they contain objects which can be reclaimed under memory pressure (typically
through a shrinker). This makes the slab pages accounted as NR_SLAB_RECLAIMABLE
in vmstat, which is reflected also the MemAvailable meminfo counter and in
overcommit decisions. The slab pages are also allocated with __GFP_RECLAIMABLE,
which is good for anti-fragmentation through grouping pages by mobility.

The generic kmalloc-X caches are created without this flag, but sometimes are
used also for objects that can be reclaimed, which due to varying size cannot
have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A prominent example
are dcache external names, which prompted the creation of a new, manually
managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES in commit f1782c9bc547
("dcache: account external names as indirectly reclaimable memory").

To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-reclaimable-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among gfp
flags. The kmalloc_caches[size_idx] array is extended to a two-dimensional
array kmalloc_caches[reclaimable][size_idx] to avoid an extra branch testing
for the flag.

This change only applies to SLAB and SLUB, not SLOB. This is fine, since SLOB's
target are tiny system and this patch does add some overhead of kmem management
objects.

Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/slab.h | 17 ++++++++++----
mm/slab.c | 4 ++--
mm/slab_common.c | 56 ++++++++++++++++++++++++++++++--------------
mm/slub.c | 12 +++++-----
4 files changed, 58 insertions(+), 31 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9ebe659bd4a5..5bff0571b360 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -296,11 +296,16 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
(KMALLOC_MIN_SIZE) : 16)

#ifndef CONFIG_SLOB
-extern struct kmem_cache *kmalloc_caches[KMALLOC_SHIFT_HIGH + 1];
+extern struct kmem_cache *kmalloc_caches[2][KMALLOC_SHIFT_HIGH + 1];
#ifdef CONFIG_ZONE_DMA
extern struct kmem_cache *kmalloc_dma_caches[KMALLOC_SHIFT_HIGH + 1];
#endif

+static __always_inline unsigned int kmalloc_reclaimable(gfp_t flags)
+{
+ return !!(flags & __GFP_RECLAIMABLE);
+}
+
/*
* Figure out which kmalloc slab an allocation of a certain size
* belongs to.
@@ -536,12 +541,13 @@ static __always_inline void *kmalloc(size_t size, gfp_t flags)
#ifndef CONFIG_SLOB
if (!(flags & GFP_DMA)) {
unsigned int index = kmalloc_index(size);
+ unsigned int recl = kmalloc_reclaimable(flags);

if (!index)
return ZERO_SIZE_PTR;

- return kmem_cache_alloc_trace(kmalloc_caches[index],
- flags, size);
+ return kmem_cache_alloc_trace(
+ kmalloc_caches[recl][index], flags, size);
}
#endif
}
@@ -588,12 +594,13 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
if (__builtin_constant_p(size) &&
size <= KMALLOC_MAX_CACHE_SIZE && !(flags & GFP_DMA)) {
unsigned int i = kmalloc_index(size);
+ unsigned int recl = kmalloc_reclaimable(flags);

if (!i)
return ZERO_SIZE_PTR;

- return kmem_cache_alloc_node_trace(kmalloc_caches[i],
- flags, node, size);
+ return kmem_cache_alloc_node_trace(
+ kmalloc_caches[recl][i], flags, node, size);
}
#endif
return __kmalloc_node(size, flags, node);
diff --git a/mm/slab.c b/mm/slab.c
index c1fe8099b3cd..8d7e1f06127b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1290,7 +1290,7 @@ void __init kmem_cache_init(void)
* Initialize the caches that provide memory for the kmem_cache_node
* structures first. Without this, further allocations will bug.
*/
- kmalloc_caches[INDEX_NODE] = create_kmalloc_cache(
+ kmalloc_caches[0][INDEX_NODE] = create_kmalloc_cache(
kmalloc_info[INDEX_NODE].name,
kmalloc_size(INDEX_NODE), ARCH_KMALLOC_FLAGS,
0, kmalloc_size(INDEX_NODE));
@@ -1306,7 +1306,7 @@ void __init kmem_cache_init(void)
for_each_online_node(nid) {
init_list(kmem_cache, &init_kmem_cache_node[CACHE_CACHE + nid], nid);

- init_list(kmalloc_caches[INDEX_NODE],
+ init_list(kmalloc_caches[0][INDEX_NODE],
&init_kmem_cache_node[SIZE_NODE + nid], nid);
}
}
diff --git a/mm/slab_common.c b/mm/slab_common.c
index b0dd9db1eb2f..d9a66095de0d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -938,7 +938,11 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
return s;
}

-struct kmem_cache *kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __ro_after_init;
+/*
+ * kmalloc_caches[0][] - kmalloc caches for non-reclaimable allocations
+ * kmalloc_caches[1][] - kmalloc caches for __GFP_RECLAIMABLE allocations
+ */
+struct kmem_cache *kmalloc_caches[2][KMALLOC_SHIFT_HIGH + 1] __ro_after_init;
EXPORT_SYMBOL(kmalloc_caches);

#ifdef CONFIG_ZONE_DMA
@@ -1010,7 +1014,7 @@ struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
return kmalloc_dma_caches[index];

#endif
- return kmalloc_caches[index];
+ return kmalloc_caches[kmalloc_reclaimable(flags)][index];
}

/*
@@ -1082,9 +1086,21 @@ void __init setup_kmalloc_cache_index_table(void)
}
}

-static void __init new_kmalloc_cache(int idx, slab_flags_t flags)
+static void __init
+new_kmalloc_cache(int idx, int reclaimable, slab_flags_t flags)
{
- kmalloc_caches[idx] = create_kmalloc_cache(kmalloc_info[idx].name,
+ const char * name;
+
+ if (reclaimable) {
+ flags |= SLAB_RECLAIM_ACCOUNT;
+ name = kasprintf(GFP_NOWAIT, "kmalloc-reclaimable-%u",
+ kmalloc_info[idx].size);
+ BUG_ON(!name);
+ } else {
+ name = kmalloc_info[idx].name;
+ }
+
+ kmalloc_caches[reclaimable][idx] = create_kmalloc_cache(name,
kmalloc_info[idx].size, flags, 0,
kmalloc_info[idx].size);
}
@@ -1096,21 +1112,25 @@ static void __init new_kmalloc_cache(int idx, slab_flags_t flags)
*/
void __init create_kmalloc_caches(slab_flags_t flags)
{
- int i;
+ int i, reclaimable;

- for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
- if (!kmalloc_caches[i])
- new_kmalloc_cache(i, flags);
+ for (reclaimable = 0; reclaimable <= 1; reclaimable++) {
+ for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
+ if (!kmalloc_caches[reclaimable][i])
+ new_kmalloc_cache(i, reclaimable, flags);

- /*
- * Caches that are not of the two-to-the-power-of size.
- * These have to be created immediately after the
- * earlier power of two caches
- */
- if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i == 6)
- new_kmalloc_cache(1, flags);
- if (KMALLOC_MIN_SIZE <= 64 && !kmalloc_caches[2] && i == 7)
- new_kmalloc_cache(2, flags);
+ /*
+ * Caches that are not of the two-to-the-power-of size.
+ * These have to be created immediately after the
+ * earlier power of two caches
+ */
+ if (KMALLOC_MIN_SIZE <= 32 && i == 6 &&
+ !kmalloc_caches[reclaimable][1])
+ new_kmalloc_cache(1, reclaimable, flags);
+ if (KMALLOC_MIN_SIZE <= 64 && i == 7 &&
+ !kmalloc_caches[reclaimable][2])
+ new_kmalloc_cache(2, reclaimable, flags);
+ }
}

/* Kmalloc array is now usable */
@@ -1118,7 +1138,7 @@ void __init create_kmalloc_caches(slab_flags_t flags)

#ifdef CONFIG_ZONE_DMA
for (i = 0; i <= KMALLOC_SHIFT_HIGH; i++) {
- struct kmem_cache *s = kmalloc_caches[i];
+ struct kmem_cache *s = kmalloc_caches[0][i];

if (s) {
unsigned int size = kmalloc_size(i);
diff --git a/mm/slub.c b/mm/slub.c
index 48f75872c356..c7d7b83f20c2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4707,7 +4707,7 @@ static void __init resiliency_test(void)
pr_err("\n1. kmalloc-16: Clobber Redzone/next pointer 0x12->0x%p\n\n",
p + 16);

- validate_slab_cache(kmalloc_caches[4]);
+ validate_slab_cache(kmalloc_caches[0][4]);

/* Hmmm... The next two are dangerous */
p = kzalloc(32, GFP_KERNEL);
@@ -4716,33 +4716,33 @@ static void __init resiliency_test(void)
p);
pr_err("If allocated object is overwritten then not detectable\n\n");

- validate_slab_cache(kmalloc_caches[5]);
+ validate_slab_cache(kmalloc_caches[0][5]);
p = kzalloc(64, GFP_KERNEL);
p += 64 + (get_cycles() & 0xff) * sizeof(void *);
*p = 0x56;
pr_err("\n3. kmalloc-64: corrupting random byte 0x56->0x%p\n",
p);
pr_err("If allocated object is overwritten then not detectable\n\n");
- validate_slab_cache(kmalloc_caches[6]);
+ validate_slab_cache(kmalloc_caches[0][6]);

pr_err("\nB. Corruption after free\n");
p = kzalloc(128, GFP_KERNEL);
kfree(p);
*p = 0x78;
pr_err("1. kmalloc-128: Clobber first word 0x78->0x%p\n\n", p);
- validate_slab_cache(kmalloc_caches[7]);
+ validate_slab_cache(kmalloc_caches[0][7]);

p = kzalloc(256, GFP_KERNEL);
kfree(p);
p[50] = 0x9a;
pr_err("\n2. kmalloc-256: Clobber 50th byte 0x9a->0x%p\n\n", p);
- validate_slab_cache(kmalloc_caches[8]);
+ validate_slab_cache(kmalloc_caches[0][8]);

p = kzalloc(512, GFP_KERNEL);
kfree(p);
p[512] = 0xab;
pr_err("\n3. kmalloc-512: Clobber redzone 0xab->0x%p\n\n", p);
- validate_slab_cache(kmalloc_caches[9]);
+ validate_slab_cache(kmalloc_caches[0][9]);
}
#else
#ifdef CONFIG_SYSFS
--
2.17.0

2018-05-24 11:04:17

by Vlastimil Babka

[permalink] [raw]

Subject: [RFC PATCH 4/5] mm: rename and change semantics of nr_indirectly_reclaimable_bytes

The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by commit
eb59254608bc ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with the goal of
accounting objects that can be reclaimed, but cannot be allocated via a
SLAB_RECLAIM_ACCOUNT cache. This is now possible via kmalloc() with
__GFP_RECLAIMABLE flag, and the dcache external names user is converted.

The counter is however still useful for accounting direct page allocations
(i.e. not slab) with a shrinker, such as the ION page pool. So keep it, and:

- change granularity to pages to be more like other counters; sub-page
allocations should be able to use kmalloc
- rename the counter to NR_RECLAIMABLE
- expose the counter again in vmstat as "nr_reclaimable"; we can again remove
the check for not printing "hidden" counters
- make the counter include also SLAB_RECLAIM_ACCOUNT, so it covers all
shrinker-based (i.e. not page cache) reclaimable pages

Signed-off-by: Vlastimil Babka <[email protected]>
---
drivers/staging/android/ion/ion_page_pool.c | 4 ++--
include/linux/mmzone.h | 2 +-
mm/page_alloc.c | 15 ++++-----------
mm/slab.c | 12 ++++++++----
mm/util.c | 16 +++++-----------
mm/vmstat.c | 6 +-----
6 files changed, 21 insertions(+), 34 deletions(-)

diff --git a/drivers/staging/android/ion/ion_page_pool.c b/drivers/staging/android/ion/ion_page_pool.c
index 9bc56eb48d2a..11e6e694f425 100644
--- a/drivers/staging/android/ion/ion_page_pool.c
+++ b/drivers/staging/android/ion/ion_page_pool.c
@@ -33,8 +33,8 @@ static void ion_page_pool_add(struct ion_page_pool *pool, struct page *page)
pool->low_count++;
}

- mod_node_page_state(page_pgdat(page), NR_INDIRECTLY_RECLAIMABLE_BYTES,
- (1 << (PAGE_SHIFT + pool->order)));
+ mod_node_page_state(page_pgdat(page), NR_RECLAIMABLE,
+ 1 << pool->order);
mutex_unlock(&pool->mutex);
}

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2dc52a..4343948f33e5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -180,7 +180,7 @@ enum node_stat_item {
NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
- NR_INDIRECTLY_RECLAIMABLE_BYTES, /* measured in bytes */
+ NR_RECLAIMABLE, /* all reclaimable pages, including slab */
NR_VM_NODE_STAT_ITEMS
};

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 249546393bd6..6f22fec0df54 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4708,6 +4708,7 @@ long si_mem_available(void)
unsigned long pagecache;
unsigned long wmark_low = 0;
unsigned long pages[NR_LRU_LISTS];
+ unsigned long reclaimable;
struct zone *zone;
int lru;

@@ -4733,19 +4734,11 @@ long si_mem_available(void)
available += pagecache;

/*
- * Part of the reclaimable slab consists of items that are in use,
+ * Part of the reclaimable pages consists of items that are in use,
* and cannot be freed. Cap this estimate at the low watermark.
*/
- available += global_node_page_state(NR_SLAB_RECLAIMABLE) -
- min(global_node_page_state(NR_SLAB_RECLAIMABLE) / 2,
- wmark_low);
-
- /*
- * Part of the kernel memory, which can be released under memory
- * pressure.
- */
- available += global_node_page_state(NR_INDIRECTLY_RECLAIMABLE_BYTES) >>
- PAGE_SHIFT;
+ reclaimable = global_node_page_state(NR_RECLAIMABLE);
+ available += reclaimable - min(reclaimable / 2, wmark_low);

if (available < 0)
available = 0;
diff --git a/mm/slab.c b/mm/slab.c
index 4dd7d73a1972..a2a8c0802253 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1420,10 +1420,12 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
}

nr_pages = (1 << cachep->gfporder);
- if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
+ if (cachep->flags & SLAB_RECLAIM_ACCOUNT) {
mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, nr_pages);
- else
+ mod_node_page_state(page_pgdat(page), NR_RECLAIMABLE, nr_pages);
+ } else {
mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, nr_pages);
+ }

__SetPageSlab(page);
/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
@@ -1441,10 +1443,12 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
int order = cachep->gfporder;
unsigned long nr_freed = (1 << order);

- if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
+ if (cachep->flags & SLAB_RECLAIM_ACCOUNT) {
mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, -nr_freed);
- else
+ mod_node_page_state(page_pgdat(page), NR_RECLAIMABLE, -nr_freed);
+ } else {
mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, -nr_freed);
+ }

BUG_ON(!PageSlab(page));
__ClearPageSlabPfmemalloc(page);
diff --git a/mm/util.c b/mm/util.c
index 98180a994895..3ffd92a9778a 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -662,19 +662,13 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
free += get_nr_swap_pages();

/*
- * Any slabs which are created with the
+ * Pages accounted as reclaimable.
+ * This includes any slabs which are created with the
* SLAB_RECLAIM_ACCOUNT flag claim to have contents
- * which are reclaimable, under pressure. The dentry
- * cache and most inode caches should fall into this
+ * which are reclaimable, under pressure. The dentry
+ * cache and most inode caches should fall into this.
*/
- free += global_node_page_state(NR_SLAB_RECLAIMABLE);
-
- /*
- * Part of the kernel memory, which can be released
- * under memory pressure.
- */
- free += global_node_page_state(
- NR_INDIRECTLY_RECLAIMABLE_BYTES) >> PAGE_SHIFT;
+ free += global_node_page_state(NR_RECLAIMABLE);

/*
* Leave reserved pages. The pages are not for anonymous pages.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 75eda9c2b260..21d571da9d5a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1161,7 +1161,7 @@ const char * const vmstat_text[] = {
"nr_vmscan_immediate_reclaim",
"nr_dirtied",
"nr_written",
- "", /* nr_indirectly_reclaimable */
+ "nr_reclaimable",

/* enum writeback_stat_item counters */
"nr_dirty_threshold",
@@ -1704,10 +1704,6 @@ static int vmstat_show(struct seq_file *m, void *arg)
unsigned long *l = arg;
unsigned long off = l - (unsigned long *)m->private;

- /* Skip hidden vmstat items. */
- if (*vmstat_text[off] == '\0')
- return 0;
-
seq_puts(m, vmstat_text[off]);
seq_put_decimal_ull(m, " ", *l);
seq_putc(m, '\n');
--
2.17.0

2018-05-24 11:04:43

by Vlastimil Babka

[permalink] [raw]

Subject: [RFC PATCH 3/5] dcache: allocate external names from reclaimable kmalloc caches

We can use the newly introduced kmalloc-reclaimable-X caches, to allocate
external names in dcache, which will take care of the proper accounting
automatically, and also improve anti-fragmentation page grouping.

This effectively reverts commit f1782c9bc547 ("dcache: account external names
as indirectly reclaimable memory") and instead passes __GFP_RECLAIMABLE to
kmalloc(). The accounting thus moves from NR_INDIRECTLY_RECLAIMABLE_BYTES to
NR_SLAB_RECLAIMABLE, which is also considered in MemAvailable calculation and
overcommit decisions.

This reverts commit f1782c9bc547754f4bd3043fe8cfda53db85f13f.
---
fs/dcache.c | 40 ++++++++++------------------------------
1 file changed, 10 insertions(+), 30 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c30a8ae46096..3346034d4520 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -257,25 +257,11 @@ static void __d_free(struct rcu_head *head)
kmem_cache_free(dentry_cache, dentry);
}

-static void __d_free_external_name(struct rcu_head *head)
-{
- struct external_name *name = container_of(head, struct external_name,
- u.head);
-
- mod_node_page_state(page_pgdat(virt_to_page(name)),
- NR_INDIRECTLY_RECLAIMABLE_BYTES,
- -ksize(name));
-
- kfree(name);
-}
-
static void __d_free_external(struct rcu_head *head)
{
struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
-
- __d_free_external_name(&external_name(dentry)->u.head);
-
- kmem_cache_free(dentry_cache, dentry);
+ kfree(external_name(dentry));
+ kmem_cache_free(dentry_cache, dentry);
}

static inline int dname_external(const struct dentry *dentry)
@@ -306,7 +292,7 @@ void release_dentry_name_snapshot(struct name_snapshot *name)
struct external_name *p;
p = container_of(name->name, struct external_name, name[0]);
if (unlikely(atomic_dec_and_test(&p->u.count)))
- call_rcu(&p->u.head, __d_free_external_name);
+ kfree_rcu(p, u.head);
}
}
EXPORT_SYMBOL(release_dentry_name_snapshot);
@@ -1609,7 +1595,6 @@ EXPORT_SYMBOL(d_invalidate);

struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
{
- struct external_name *ext = NULL;
struct dentry *dentry;
char *dname;
int err;
@@ -1630,14 +1615,15 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
dname = dentry->d_iname;
} else if (name->len > DNAME_INLINE_LEN-1) {
size_t size = offsetof(struct external_name, name[1]);
-
- ext = kmalloc(size + name->len, GFP_KERNEL_ACCOUNT);
- if (!ext) {
+ struct external_name *p = kmalloc(size + name->len,
+ GFP_KERNEL_ACCOUNT |
+ __GFP_RECLAIMABLE);
+ if (!p) {
kmem_cache_free(dentry_cache, dentry);
return NULL;
}
- atomic_set(&ext->u.count, 1);
- dname = ext->name;
+ atomic_set(&p->u.count, 1);
+ dname = p->name;
} else {
dname = dentry->d_iname;
}
@@ -1676,12 +1662,6 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
}
}

- if (unlikely(ext)) {
- pg_data_t *pgdat = page_pgdat(virt_to_page(ext));
- mod_node_page_state(pgdat, NR_INDIRECTLY_RECLAIMABLE_BYTES,
- ksize(ext));
- }
-
this_cpu_inc(nr_dentry);

return dentry;
@@ -2762,7 +2742,7 @@ static void copy_name(struct dentry *dentry, struct dentry *target)
dentry->d_name.hash_len = target->d_name.hash_len;
}
if (old_name && likely(atomic_dec_and_test(&old_name->u.count)))
- call_rcu(&old_name->u.head, __d_free_external_name);
+ kfree_rcu(old_name, u.head);
}

/*
--
2.17.0

2018-05-24 11:46:26

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] kmalloc-reclaimable caches

On Thu, May 24, 2018 at 01:00:06PM +0200, Vlastimil Babka wrote:
> Now for the issues a.k.a. why RFC:
>
> - I haven't find any other obvious users for reclaimable kmalloc (yet)

Is that a problem? This sounds like it's enough to solve Facebook's
problem.

> - the name of caches kmalloc-reclaimable-X is rather long

Yes; Christoph and I were talking about restricting slab names to 16 bytes
just to make /proc/slabinfo easier to read. How about

kmalloc-rec-128k
1234567890123456

Just makes it ;-)

Of course, somebody needs to do the work to use k/M instead of 4194304.
We also need to bikeshed about when to switch; should it be:

kmalloc-rec-512
kmalloc-rec-1024
kmalloc-rec-2048
kmalloc-rec-4096
kmalloc-rec-8192
kmalloc-rec-16k

or should it be

kmalloc-rec-512
kmalloc-rec-1k
kmalloc-rec-2k
kmalloc-rec-4k
kmalloc-rec-8k
kmalloc-rec-16k

I slightly favour the latter as it'll be easier to implement. Something like

static const char suffixes[3] = ' kM';
int idx = 0;

while (size > 1024) {
size /= 1024;
idx++;
}

sprintf("%d%c", size, suffices[idx]);

2018-05-24 12:16:24

by Roman Gushchin

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] kmalloc-reclaimable caches

On Thu, May 24, 2018 at 01:00:06PM +0200, Vlastimil Babka wrote:
> Hi,
>
> as discussed at LSF/MM [1] here's a RFC patchset that introduces
> kmalloc-reclaimable caches (more details in the first patch) and uses them
> for SLAB freelists and dcache external names. The latter allows us to
> repurpose the NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
>
> This is how /proc/slabinfo looks like after booting in virtme:
>
> ...
> kmalloc-reclaimable-4194304 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
> ...
> kmalloc-reclaimable-96 17 64 128 32 1 : tunables 120 60 8 : slabdata 2 2 0
> kmalloc-reclaimable-64 50 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 6
> kmalloc-reclaimable-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
> kmalloc-4194304 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
> ...
> kmalloc-64 2888 2944 64 64 1 : tunables 120 60 8 : slabdata 46 46 454
> kmalloc-32 4325 4712 32 124 1 : tunables 120 60 8 : slabdata 38 38 563
> kmalloc-128 1178 1216 128 32 1 : tunables 120 60 8 : slabdata 38 38 114
> ...
>
> /proc/vmstat with new/renamed nr_reclaimable counter (patch 4):
>
> ...
> nr_slab_reclaimable 2817
> nr_slab_unreclaimable 1781
> ...
> nr_reclaimable 2817
> ...
>
> /proc/meminfo with exposed nr_reclaimable counter (patch 5):
>
> ...
> AnonPages: 8624 kB
> Mapped: 3340 kB
> Shmem: 564 kB
> Reclaimable: 11272 kB
> Slab: 18368 kB
> SReclaimable: 11272 kB
> SUnreclaim: 7096 kB
> KernelStack: 1168 kB
> PageTables: 448 kB
> ...
>
> Now for the issues a.k.a. why RFC:
>
> - I haven't find any other obvious users for reclaimable kmalloc (yet)

As I remember, ION memory allocator was discussed related to this theme:
https://lkml.org/lkml/2018/4/24/1288

> I did a superset as IIRC somebody suggested that in the older threads or at LSF.

This looks nice to me!

Thanks!

2018-05-25 02:22:01

by Johannes Weiner

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] kmalloc-reclaimable caches

On Thu, May 24, 2018 at 01:00:06PM +0200, Vlastimil Babka wrote:
> - the vmstat/meminfo counter name is rather general and might suggest it also
> includes reclaimable page caches, which it doesn't
>
> Suggestions welcome for all three points. For the last one, we might also keep
> the counter separate from nr_slab_reclaimable, not superset. I did a superset
> as IIRC somebody suggested that in the older threads or at LSF.

Yeah, the "reclaimable" name is too generic. How about KReclaimable?

The counter being a superset sounds good to me. We use this info for
both load balancing and manual debugging. For load balancing code it's
nice not having to worry about finding all the counters that hold
reclaimable memory depending on kernel version; it's always simply
user cache + user anon + kernel reclaimable. And for debugging, we can
always add more specific subset counters later on if we need them.

2018-05-25 02:24:20

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] kmalloc-reclaimable caches

On 05/24/2018 02:13 PM, Roman Gushchin wrote:
> On Thu, May 24, 2018 at 01:00:06PM +0200, Vlastimil Babka wrote:
>> Hi,
>>
>> - I haven't find any other obvious users for reclaimable kmalloc (yet)
>
> As I remember, ION memory allocator was discussed related to this theme:
> https://lkml.org/lkml/2018/4/24/1288

+CC Laura

Yeah ION added the NR_INDIRECTLY_RECLAIMABLE_BYTES handling, which is
adjusted to page granularity in patch 4. I'm not sure if it should use
kmalloc as it seems to be allocating order-X pages, where kmalloc/slab
just means extra overhead. But maybe if it doesn't allocate/free too
frequently, it could work?

>> I did a superset as IIRC somebody suggested that in the older threads or at LSF.
>
> This looks nice to me!
>
> Thanks!
>

2018-05-25 02:26:36

by Randy Dunlap

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] kmalloc-reclaimable caches

On 05/24/2018 04:43 AM, Matthew Wilcox wrote:
> On Thu, May 24, 2018 at 01:00:06PM +0200, Vlastimil Babka wrote:
>> Now for the issues a.k.a. why RFC:
>>
>> - I haven't find any other obvious users for reclaimable kmalloc (yet)
>
> Is that a problem? This sounds like it's enough to solve Facebook's
> problem.
>
>> - the name of caches kmalloc-reclaimable-X is rather long
>
> Yes; Christoph and I were talking about restricting slab names to 16 bytes
> just to make /proc/slabinfo easier to read. How about
>
> kmalloc-rec-128k
> 1234567890123456
>
> Just makes it ;-)
>
> Of course, somebody needs to do the work to use k/M instead of 4194304.
> We also need to bikeshed about when to switch; should it be:
>
> kmalloc-rec-512
> kmalloc-rec-1024
> kmalloc-rec-2048
> kmalloc-rec-4096
> kmalloc-rec-8192
> kmalloc-rec-16k
>
> or should it be
>
> kmalloc-rec-512
> kmalloc-rec-1k
> kmalloc-rec-2k
> kmalloc-rec-4k
> kmalloc-rec-8k
> kmalloc-rec-16k
>
> I slightly favour the latter as it'll be easier to implement. Something like

Yes, agree, start using the suffix early.

>
> static const char suffixes[3] = ' kM';
> int idx = 0;
>
> while (size > 1024) {
> size /= 1024;
> idx++;
> }
>
> sprintf("%d%c", size, suffices[idx]);

suffixes
>
> --

--
~Randy

2018-05-25 02:34:23

by Laura Abbott

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] kmalloc-reclaimable caches

On 05/24/2018 08:52 AM, Vlastimil Babka wrote:
> On 05/24/2018 02:13 PM, Roman Gushchin wrote:
>> On Thu, May 24, 2018 at 01:00:06PM +0200, Vlastimil Babka wrote:
>>> Hi,
>>>
>>> - I haven't find any other obvious users for reclaimable kmalloc (yet)
>>
>> As I remember, ION memory allocator was discussed related to this theme:
>> https://lkml.org/lkml/2018/4/24/1288
>
> +CC Laura
>
> Yeah ION added the NR_INDIRECTLY_RECLAIMABLE_BYTES handling, which is
> adjusted to page granularity in patch 4. I'm not sure if it should use
> kmalloc as it seems to be allocating order-X pages, where kmalloc/slab
> just means extra overhead. But maybe if it doesn't allocate/free too
> frequently, it could work?
>

The page pool allocation is supposed to be a slow path but it's
one I'd rather not have too much overhead. It also just looks really odd
to be allocating higher order pages via kmalloc imho.

>>> I did a superset as IIRC somebody suggested that in the older threads or at LSF.
>>
>> This looks nice to me!
>>
>> Thanks!
>>
>

2018-05-25 02:37:45

by Randy Dunlap

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] kmalloc-reclaimable caches

On 05/24/2018 09:18 AM, Randy Dunlap wrote:
> On 05/24/2018 04:43 AM, Matthew Wilcox wrote:
>> On Thu, May 24, 2018 at 01:00:06PM +0200, Vlastimil Babka wrote:
>>> Now for the issues a.k.a. why RFC:
>>>
>>> - I haven't find any other obvious users for reclaimable kmalloc (yet)
>>
>> Is that a problem? This sounds like it's enough to solve Facebook's
>> problem.
>>
>>> - the name of caches kmalloc-reclaimable-X is rather long
>>
>> Yes; Christoph and I were talking about restricting slab names to 16 bytes
>> just to make /proc/slabinfo easier to read. How about
>>
>> kmalloc-rec-128k
>> 1234567890123456
>>
>> Just makes it ;-)
>>
>> Of course, somebody needs to do the work to use k/M instead of 4194304.
>> We also need to bikeshed about when to switch; should it be:
>>
>> kmalloc-rec-512
>> kmalloc-rec-1024
>> kmalloc-rec-2048
>> kmalloc-rec-4096
>> kmalloc-rec-8192
>> kmalloc-rec-16k
>>
>> or should it be
>>
>> kmalloc-rec-512
>> kmalloc-rec-1k
>> kmalloc-rec-2k
>> kmalloc-rec-4k
>> kmalloc-rec-8k
>> kmalloc-rec-16k
>>
>> I slightly favour the latter as it'll be easier to implement. Something like
>
> Yes, agree, start using the suffix early.
>
>>
>> static const char suffixes[3] = ' kM';
>> int idx = 0;
>>
>> while (size > 1024) {

I would use (size >= 1024)
so that 1M is printed instead of 1024K.

>> size /= 1024;
>> idx++;
>> }
>>
>> sprintf("%d%c", size, suffices[idx]);
>
> suffixes
>>
>> --
>
>

--
~Randy

2018-05-25 02:39:00

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] kmalloc-reclaimable caches

On Thu, May 24, 2018 at 11:40:59AM -0700, Randy Dunlap wrote:
> >> while (size > 1024) {
>
> I would use (size >= 1024)
> so that 1M is printed instead of 1024K.

Yes; that's what I meant to type. Thanks!

2018-05-25 15:54:02

by Christoph Lameter (Ampere)

[permalink] [raw]

Subject: Re: [RFC PATCH 1/5] mm, slab/slub: introduce kmalloc-reclaimable caches

On Thu, 24 May 2018, Vlastimil Babka wrote:

> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 9ebe659bd4a5..5bff0571b360 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -296,11 +296,16 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
> (KMALLOC_MIN_SIZE) : 16)
>
> #ifndef CONFIG_SLOB
> -extern struct kmem_cache *kmalloc_caches[KMALLOC_SHIFT_HIGH + 1];
> +extern struct kmem_cache *kmalloc_caches[2][KMALLOC_SHIFT_HIGH + 1];
> #ifdef CONFIG_ZONE_DMA
> extern struct kmem_cache *kmalloc_dma_caches[KMALLOC_SHIFT_HIGH + 1];
> #endif

In the existing code we used a different array name for the DMA caches.
This is a similar situation.

I would suggest to use

kmalloc_reclaimable_caches[]

or make it consistent by folding the DMA caches into the array too (but
then note the issues below).

> @@ -536,12 +541,13 @@ static __always_inline void *kmalloc(size_t size, gfp_t flags)
> #ifndef CONFIG_SLOB
> if (!(flags & GFP_DMA)) {
> unsigned int index = kmalloc_index(size);
> + unsigned int recl = kmalloc_reclaimable(flags);

This is a hotpath reserved for regular allocations. The reclaimable slabs
need to be handled like the DMA slabs. So check for GFP_DMA plus the
reclaimable flags.

> @@ -588,12 +594,13 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
> if (__builtin_constant_p(size) &&
> size <= KMALLOC_MAX_CACHE_SIZE && !(flags & GFP_DMA)) {
> unsigned int i = kmalloc_index(size);
> + unsigned int recl = kmalloc_reclaimable(flags);
>

Same situation here and additional times below.

2018-05-25 16:00:51

by Christoph Lameter (Ampere)

[permalink] [raw]

Subject: Re: [RFC PATCH 4/5] mm: rename and change semantics of nr_indirectly_reclaimable_bytes

On Thu, 24 May 2018, Vlastimil Babka wrote:

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 32699b2dc52a..4343948f33e5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -180,7 +180,7 @@ enum node_stat_item {
> NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */
> NR_DIRTIED, /* page dirtyings since bootup */
> NR_WRITTEN, /* page writings since bootup */
> - NR_INDIRECTLY_RECLAIMABLE_BYTES, /* measured in bytes */
> + NR_RECLAIMABLE, /* all reclaimable pages, including slab */
> NR_VM_NODE_STAT_ITEMS

We already have NR_SLAB_RECLAIMABLE and NR_RECLAIMABLE now counts what
NR_SLAB_RECLAIMABLE counts plus something else. THis means updating
two counters in parallel.

Could keep the existing counter and just account
for those non slab things you mentioned? Avoid counting twice and may
provide unique insides into those non slab reclaimable objects. I'd like
to see this.

2018-05-28 15:58:12

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] kmalloc-reclaimable caches

On 05/24/2018 05:32 PM, Johannes Weiner wrote:
> On Thu, May 24, 2018 at 01:00:06PM +0200, Vlastimil Babka wrote:
>> - the vmstat/meminfo counter name is rather general and might suggest it also
>> includes reclaimable page caches, which it doesn't
>>
>> Suggestions welcome for all three points. For the last one, we might also keep
>> the counter separate from nr_slab_reclaimable, not superset. I did a superset
>> as IIRC somebody suggested that in the older threads or at LSF.
>
> Yeah, the "reclaimable" name is too generic. How about KReclaimable?
>
> The counter being a superset sounds good to me. We use this info for
> both load balancing and manual debugging. For load balancing code it's
> nice not having to worry about finding all the counters that hold
> reclaimable memory depending on kernel version; it's always simply
> user cache + user anon + kernel reclaimable. And for debugging, we can
> always add more specific subset counters later on if we need them.

Hm, Christoph in his reply to patch 4/5 expressed a different opinion.
It's true that updating two counters has extra overhead, especially if
there are two separate critical sections:

mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, nr_pages);
mod_node_page_state(page_pgdat(page), NR_RECLAIMABLE, nr_pages);

The first disables irq for CONFIG_MEMCG or defers to
mod_node_page_state() otherwise.
mod_node_page_state() is different depending on CONFIG_SMP and
CONFIG_HAVE_CMPXCHG_LOCAL.

I don't see an easy way to make this optimal? Different counter would be
indeed simpler. /proc/vmstat would then print separate counters, but we
could have both separate and summary counter in /proc/meminfo. Would
that be enough?

2018-05-28 16:06:32

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC PATCH 1/5] mm, slab/slub: introduce kmalloc-reclaimable caches

On 05/25/2018 05:51 PM, Christopher Lameter wrote:
> On Thu, 24 May 2018, Vlastimil Babka wrote:
>
>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>> index 9ebe659bd4a5..5bff0571b360 100644
>> --- a/include/linux/slab.h
>> +++ b/include/linux/slab.h
>> @@ -296,11 +296,16 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
>> (KMALLOC_MIN_SIZE) : 16)
>>
>> #ifndef CONFIG_SLOB
>> -extern struct kmem_cache *kmalloc_caches[KMALLOC_SHIFT_HIGH + 1];
>> +extern struct kmem_cache *kmalloc_caches[2][KMALLOC_SHIFT_HIGH + 1];
>> #ifdef CONFIG_ZONE_DMA
>> extern struct kmem_cache *kmalloc_dma_caches[KMALLOC_SHIFT_HIGH + 1];
>> #endif
>
> In the existing code we used a different array name for the DMA caches.
> This is a similar situation.
>
> I would suggest to use
>
> kmalloc_reclaimable_caches[]
>
> or make it consistent by folding the DMA caches into the array too (but
> then note the issues below).
>
>> @@ -536,12 +541,13 @@ static __always_inline void *kmalloc(size_t size, gfp_t flags)
>> #ifndef CONFIG_SLOB
>> if (!(flags & GFP_DMA)) {
>> unsigned int index = kmalloc_index(size);
>> + unsigned int recl = kmalloc_reclaimable(flags);
>
> This is a hotpath reserved for regular allocations. The reclaimable slabs
> need to be handled like the DMA slabs. So check for GFP_DMA plus the
> reclaimable flags.

Yeah I thought that by doing reclaimable via array index manipulation
and not a branch, there would be no noticeable overhead. And GFP_DMA
should go away eventually. I will see if I can convert GFP_DMA to
another index, and completely remove the branch quoted above.

>> @@ -588,12 +594,13 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
>> if (__builtin_constant_p(size) &&
>> size <= KMALLOC_MAX_CACHE_SIZE && !(flags & GFP_DMA)) {
>> unsigned int i = kmalloc_index(size);
>> + unsigned int recl = kmalloc_reclaimable(flags);
>>
>
>
> Same situation here and additional times below.
>

2018-05-29 17:56:56

by Johannes Weiner

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] kmalloc-reclaimable caches

On Mon, May 28, 2018 at 10:15:46AM +0200, Vlastimil Babka wrote:
> On 05/24/2018 05:32 PM, Johannes Weiner wrote:
> > On Thu, May 24, 2018 at 01:00:06PM +0200, Vlastimil Babka wrote:
> >> - the vmstat/meminfo counter name is rather general and might suggest it also
> >> includes reclaimable page caches, which it doesn't
> >>
> >> Suggestions welcome for all three points. For the last one, we might also keep
> >> the counter separate from nr_slab_reclaimable, not superset. I did a superset
> >> as IIRC somebody suggested that in the older threads or at LSF.
> >
> > Yeah, the "reclaimable" name is too generic. How about KReclaimable?
> >
> > The counter being a superset sounds good to me. We use this info for
> > both load balancing and manual debugging. For load balancing code it's
> > nice not having to worry about finding all the counters that hold
> > reclaimable memory depending on kernel version; it's always simply
> > user cache + user anon + kernel reclaimable. And for debugging, we can
> > always add more specific subset counters later on if we need them.
>
> Hm, Christoph in his reply to patch 4/5 expressed a different opinion.
> It's true that updating two counters has extra overhead, especially if
> there are two separate critical sections:
>
> mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, nr_pages);
> mod_node_page_state(page_pgdat(page), NR_RECLAIMABLE, nr_pages);
>
> The first disables irq for CONFIG_MEMCG or defers to
> mod_node_page_state() otherwise.
> mod_node_page_state() is different depending on CONFIG_SMP and
> CONFIG_HAVE_CMPXCHG_LOCAL.
>
> I don't see an easy way to make this optimal? Different counter would be
> indeed simpler. /proc/vmstat would then print separate counters, but we
> could have both separate and summary counter in /proc/meminfo. Would
> that be enough?

Yeah, that works just as well.