Hi Andrew,
these patches address problems we've had in our fleet with excessive
shadow radix tree nodes and proc inodes.
Patch #1 is a fix for the same-named patch already queued up in the
-mm tree. The other three patches are stand-alone.
No need to use the preemption-safe lruvec state function inside the
reclaim region that has irqs disabled.
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/workingset.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index e5c70bc94077..f564aaa6b71d 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -493,7 +493,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
* shadow entries we were tracking ...
*/
xas_store(&xas, NULL);
- inc_lruvec_page_state(virt_to_page(node), WORKINGSET_NODERECLAIM);
+ __inc_lruvec_page_state(virt_to_page(node), WORKINGSET_NODERECLAIM);
out_invalid:
xa_unlock_irq(&mapping->i_pages);
--
2.19.0
Make it easier to catch bugs in the shadow node shrinker by adding a
counter for the shadow nodes in circulation.
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/vmstat.c | 1 +
mm/workingset.c | 12 ++++++++++--
3 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4179e67add3d..d82e80d82aa6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -161,6 +161,7 @@ enum node_stat_item {
NR_SLAB_UNRECLAIMABLE,
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
+ WORKINGSET_NODES,
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
WORKINGSET_RESTORE,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d08ed044759d..6038ce593ce3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1143,6 +1143,7 @@ const char * const vmstat_text[] = {
"nr_slab_unreclaimable",
"nr_isolated_anon",
"nr_isolated_file",
+ "workingset_nodes",
"workingset_refault",
"workingset_activate",
"workingset_restore",
diff --git a/mm/workingset.c b/mm/workingset.c
index f564aaa6b71d..cfdf6adf7e7c 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -378,11 +378,17 @@ void workingset_update_node(struct xa_node *node)
* as node->private_list is protected by the i_pages lock.
*/
if (node->count && node->count == node->nr_values) {
- if (list_empty(&node->private_list))
+ if (list_empty(&node->private_list)) {
list_lru_add(&shadow_nodes, &node->private_list);
+ __inc_lruvec_page_state(virt_to_page(node),
+ WORKINGSET_NODES);
+ }
} else {
- if (!list_empty(&node->private_list))
+ if (!list_empty(&node->private_list)) {
list_lru_del(&shadow_nodes, &node->private_list);
+ __dec_lruvec_page_state(virt_to_page(node),
+ WORKINGSET_NODES);
+ }
}
}
@@ -472,6 +478,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
}
list_lru_isolate(lru, item);
+ __dec_lruvec_page_state(virt_to_page(node), WORKINGSET_NODES);
+
spin_unlock(lru_lock);
/*
--
2.19.0
The page cache and most shrinkable slab caches hold data that has been
read from disk, but there are some caches that only cache CPU work,
such as the dentry and inode caches of procfs and sysfs, as well as
the subset of radix tree nodes that track non-resident page cache.
Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS
for the shrinker's seeks setting tells the reclaim algorithm that for
every two page cache pages scanned it should scan one slab object.
This is a bogus setting. A virtual inode that required no IO to create
is not twice as valuable as a page cache page; shadow cache entries
with eviction distances beyond the size of memory aren't either.
In most cases, the behavior in practice is still fine. Such virtual
caches don't tend to grow and assert themselves aggressively, and
usually get picked up before they cause problems. But there are
scenarios where that's not true.
Our database workloads suffer from two of those. For one, their file
workingset is several times bigger than available memory, which has
the kernel aggressively create shadow page cache entries for the
non-resident parts of it. The workingset code does tell the VM that
most of these are expendable, but the VM ends up balancing them 2:1 to
cache pages as per the seeks setting. This is a huge waste of memory.
These workloads also deal with tens of thousands of open files and use
/proc for introspection, which ends up growing the proc_inode_cache to
absurdly large sizes - again at the cost of valuable cache space,
which isn't a reasonable trade-off, given that proc inodes can be
re-created without involving the disk.
This patch implements a "zero-seek" setting for shrinkers that results
in a target ratio of 0:1 between their objects and IO-backed
caches. This allows such virtual caches to grow when memory is
available (they do cache/avoid CPU work after all), but effectively
disables them as soon as IO-backed objects are under pressure.
It then switches the shrinkers for procfs and sysfs metadata, as well
as excess page cache shadow nodes, to the new zero-seek setting.
Reported-by: Domas Mituzas <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
fs/kernfs/mount.c | 3 +++
fs/proc/root.c | 3 +++
mm/vmscan.c | 15 ++++++++++++---
mm/workingset.c | 2 +-
4 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 1bd43f6947f3..7d56b624e0dc 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -251,6 +251,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
sb->s_export_op = &kernfs_export_ops;
sb->s_time_gran = 1;
+ /* sysfs dentries and inodes don't require IO to create */
+ sb->s_shrink.seeks = 0;
+
/* get root inode, initialize and unlock it */
mutex_lock(&kernfs_mutex);
inode = kernfs_get_inode(sb, info->root->kn);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 8912a8b57ac3..74975ca77b71 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -127,6 +127,9 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
*/
s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
+ /* procfs dentries and inodes don't require IO to create */
+ s->s_shrink.seeks = 0;
+
pde_get(&proc_root);
root_inode = proc_get_inode(s, &proc_root);
if (!root_inode) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a859f64a2166..62ac0c488624 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -474,9 +474,18 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
total_scan = nr;
- delta = freeable >> priority;
- delta *= 4;
- do_div(delta, shrinker->seeks);
+ if (shrinker->seeks) {
+ delta = freeable >> priority;
+ delta *= 4;
+ do_div(delta, shrinker->seeks);
+ } else {
+ /*
+ * These objects don't require any IO to create. Trim
+ * them aggressively under memory pressure to keep
+ * them from causing refetches in the IO caches.
+ */
+ delta = freeable / 2;
+ }
/*
* Make sure we apply some minimal pressure on default priority
diff --git a/mm/workingset.c b/mm/workingset.c
index cfdf6adf7e7c..97523c4d3496 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -523,7 +523,7 @@ static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
static struct shrinker workingset_shadow_shrinker = {
.count_objects = count_shadow_nodes,
.scan_objects = scan_shadow_nodes,
- .seeks = DEFAULT_SEEKS,
+ .seeks = 0, /* ->count reports only fully expendable nodes */
.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
};
--
2.19.0
The shadow shrinker is invoked per NUMA node, but the shadow limit
enforced for cgroups is based on the page counter, which isn't NUMA
aware. Instead of shrinking shadow pages to desired_size, we end up
with desired_size * nr_online_nodes.
Switch to NUMA-aware lru and slab counters to approximate cgroup size.
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/workingset.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index 1d111913929d..e5c70bc94077 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -418,9 +418,15 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
* PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
*/
#ifdef CONFIG_MEMCG
- if (sc->memcg)
- pages = page_counter_read(&sc->memcg->memory);
- else
+ if (sc->memcg) {
+ struct lruvec *lruvec;
+
+ pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
+ LRU_ALL);
+ lruvec = mem_cgroup_lruvec(NODE_DATA(sc->nid), sc->memcg);
+ pages += lruvec_page_state(lruvec, NR_SLAB_RECLAIMABLE);
+ pages += lruvec_page_state(lruvec, NR_SLAB_UNRECLAIMABLE);
+ } else
#endif
pages = node_present_pages(sc->nid);
--
2.19.0
On Tue, 9 Oct 2018 14:47:32 -0400 Johannes Weiner <[email protected]> wrote:
> Make it easier to catch bugs in the shadow node shrinker by adding a
> counter for the shadow nodes in circulation.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/mmzone.h | 1 +
> mm/vmstat.c | 1 +
> mm/workingset.c | 12 ++++++++++--
> 3 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4179e67add3d..d82e80d82aa6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -161,6 +161,7 @@ enum node_stat_item {
> NR_SLAB_UNRECLAIMABLE,
> NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
> + WORKINGSET_NODES,
Documentation/admin-guide/cgroup-v2.rst, please. And please check for
any other missing items while in there?
On Tue, 9 Oct 2018 14:47:32 -0400 Johannes Weiner <[email protected]> wrote:
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -378,11 +378,17 @@ void workingset_update_node(struct xa_node *node)
> * as node->private_list is protected by the i_pages lock.
> */
> if (node->count && node->count == node->nr_values) {
> - if (list_empty(&node->private_list))
> + if (list_empty(&node->private_list)) {
> list_lru_add(&shadow_nodes, &node->private_list);
> + __inc_lruvec_page_state(virt_to_page(node),
> + WORKINGSET_NODES);
> + }
> } else {
> - if (!list_empty(&node->private_list))
> + if (!list_empty(&node->private_list)) {
> list_lru_del(&shadow_nodes, &node->private_list);
> + __dec_lruvec_page_state(virt_to_page(node),
> + WORKINGSET_NODES);
> + }
> }
> }
A bit worried that we're depending on the caller's caller to have
disabled interrupts to avoid subtle and rare errors.
Can we do this?
--- a/mm/workingset.c~mm-workingset-add-vmstat-counter-for-shadow-nodes-fix
+++ a/mm/workingset.c
@@ -377,6 +377,8 @@ void workingset_update_node(struct radix
* already where they should be. The list_empty() test is safe
* as node->private_list is protected by the i_pages lock.
*/
+ WARN_ON_ONCE(!irqs_disabled()); /* For __inc_lruvec_page_state */
+
if (node->count && node->count == node->exceptional) {
if (list_empty(&node->private_list)) {
list_lru_add(&shadow_nodes, &node->private_list);
_
On Tue, 9 Oct 2018 15:15:56 -0700 Andrew Morton <[email protected]> wrote:
> Seems sane, but I'm somewhat worried about unexpected effects on other
> workloads. So I think I'll hold this over for 4.20. Or shouldn't I?
Meant 4.21. But on reflection this is perhaps excessively cautious.
On Tue, 9 Oct 2018 14:47:33 -0400 Johannes Weiner <[email protected]> wrote:
> The page cache and most shrinkable slab caches hold data that has been
> read from disk, but there are some caches that only cache CPU work,
> such as the dentry and inode caches of procfs and sysfs, as well as
> the subset of radix tree nodes that track non-resident page cache.
>
> Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS
> for the shrinker's seeks setting tells the reclaim algorithm that for
> every two page cache pages scanned it should scan one slab object.
>
> This is a bogus setting. A virtual inode that required no IO to create
> is not twice as valuable as a page cache page; shadow cache entries
> with eviction distances beyond the size of memory aren't either.
>
> In most cases, the behavior in practice is still fine. Such virtual
> caches don't tend to grow and assert themselves aggressively, and
> usually get picked up before they cause problems. But there are
> scenarios where that's not true.
>
> Our database workloads suffer from two of those. For one, their file
> workingset is several times bigger than available memory, which has
> the kernel aggressively create shadow page cache entries for the
> non-resident parts of it. The workingset code does tell the VM that
> most of these are expendable, but the VM ends up balancing them 2:1 to
> cache pages as per the seeks setting. This is a huge waste of memory.
>
> These workloads also deal with tens of thousands of open files and use
> /proc for introspection, which ends up growing the proc_inode_cache to
> absurdly large sizes - again at the cost of valuable cache space,
> which isn't a reasonable trade-off, given that proc inodes can be
> re-created without involving the disk.
>
> This patch implements a "zero-seek" setting for shrinkers that results
> in a target ratio of 0:1 between their objects and IO-backed
> caches. This allows such virtual caches to grow when memory is
> available (they do cache/avoid CPU work after all), but effectively
> disables them as soon as IO-backed objects are under pressure.
>
> It then switches the shrinkers for procfs and sysfs metadata, as well
> as excess page cache shadow nodes, to the new zero-seek setting.
>
Seems sane, but I'm somewhat worried about unexpected effects on other
workloads. So I think I'll hold this over for 4.20. Or shouldn't I?
On Tue, 2018-10-09 at 14:47 -0400, Johannes Weiner wrote:
> The shadow shrinker is invoked per NUMA node, but the shadow limit
> enforced for cgroups is based on the page counter, which isn't NUMA
> aware. Instead of shrinking shadow pages to desired_size, we end up
> with desired_size * nr_online_nodes.
>
> Switch to NUMA-aware lru and slab counters to approximate cgroup
> size.
>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
--
All Rights Reversed.
On Tue, 2018-10-09 at 14:47 -0400, Johannes Weiner wrote:
> No need to use the preemption-safe lruvec state function inside the
> reclaim region that has irqs disabled.
>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
On Tue, 2018-10-09 at 14:47 -0400, Johannes Weiner wrote:
> These workloads also deal with tens of thousands of open files and
> use
> /proc for introspection, which ends up growing the proc_inode_cache
> to
> absurdly large sizes - again at the cost of valuable cache space,
> which isn't a reasonable trade-off, given that proc inodes can be
> re-created without involving the disk.
>
> This patch implements a "zero-seek" setting for shrinkers that
> results
> in a target ratio of 0:1 between their objects and IO-backed
> caches. This allows such virtual caches to grow when memory is
> available (they do cache/avoid CPU work after all), but effectively
> disables them as soon as IO-backed objects are under pressure.
>
> It then switches the shrinkers for procfs and sysfs metadata, as well
> as excess page cache shadow nodes, to the new zero-seek setting.
This patch looks like a great step in the right
direction, though I do not know whether it is
aggressive enough.
Given that internal slab fragmentation will
prevent the slab cache from returning a slab to
the VM if just one object in that slab is still
in use, there may well be workloads where we
should just put a hard cap on the number of
freeable items these slabs, and reclaim them
preemptively.
However, I do not know for sure, and this patch
seems like a big improvement over what we had
before, so ...
> Reported-by: Domas Mituzas <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
On Tue, Oct 09, 2018 at 03:04:01PM -0700, Andrew Morton wrote:
> On Tue, 9 Oct 2018 14:47:32 -0400 Johannes Weiner <[email protected]> wrote:
>
> > Make it easier to catch bugs in the shadow node shrinker by adding a
> > counter for the shadow nodes in circulation.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > include/linux/mmzone.h | 1 +
> > mm/vmstat.c | 1 +
> > mm/workingset.c | 12 ++++++++++--
> > 3 files changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 4179e67add3d..d82e80d82aa6 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -161,6 +161,7 @@ enum node_stat_item {
> > NR_SLAB_UNRECLAIMABLE,
> > NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> > NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
> > + WORKINGSET_NODES,
>
> Documentation/admin-guide/cgroup-v2.rst, please. And please check for
> any other missing items while in there?
The new counter isn't being added to the per-cgroup memory.stat,
actually, it just shows in /proc/vmstat.
It seemed a bit too low-level for the cgroup interface, and the other
stats in there are in bytes, which isn't straight-forward to calculate
with sl*b packing.
Not that I'm against adding a cgroup breakdown in general, but the
global counter was enough to see if things were working right or not,
so I'd cross that bridge when somebody needs it per cgroup.
But I checked cgroup-v2.rst anyway: all the exported items are
documented. Only the reclaim vs. refault stats were in different
orders: the doc has the refault stats first, the interface leads with
the reclaim stats. The refault stats go better with the page fault
stats, and are probably of more interest (since they have higher
impact on performance) than the LRU shuffling, so maybe this?
---
Subject: [PATCH] mm: memcontrol: fix memory.stat item ordering
The refault stats go better with the page fault stats, and are of
higher interest than the stats on LRU operations. In fact they used to
be grouped together; when the LRU operation stats were added later on,
they were wedged in between.
Move them back together. Documentation/admin-guide/cgroup-v2.rst
already lists them in the right order.
Signed-off-by: Johannes Weiner <[email protected]>
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 81b47d0b14d7..ed15f233d31d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5575,6 +5575,13 @@ static int memory_stat_show(struct seq_file *m, void *v)
seq_printf(m, "pgfault %lu\n", acc.events[PGFAULT]);
seq_printf(m, "pgmajfault %lu\n", acc.events[PGMAJFAULT]);
+ seq_printf(m, "workingset_refault %lu\n",
+ acc.stat[WORKINGSET_REFAULT]);
+ seq_printf(m, "workingset_activate %lu\n",
+ acc.stat[WORKINGSET_ACTIVATE]);
+ seq_printf(m, "workingset_nodereclaim %lu\n",
+ acc.stat[WORKINGSET_NODERECLAIM]);
+
seq_printf(m, "pgrefill %lu\n", acc.events[PGREFILL]);
seq_printf(m, "pgscan %lu\n", acc.events[PGSCAN_KSWAPD] +
acc.events[PGSCAN_DIRECT]);
@@ -5585,13 +5592,6 @@ static int memory_stat_show(struct seq_file *m, void *v)
seq_printf(m, "pglazyfree %lu\n", acc.events[PGLAZYFREE]);
seq_printf(m, "pglazyfreed %lu\n", acc.events[PGLAZYFREED]);
- seq_printf(m, "workingset_refault %lu\n",
- acc.stat[WORKINGSET_REFAULT]);
- seq_printf(m, "workingset_activate %lu\n",
- acc.stat[WORKINGSET_ACTIVATE]);
- seq_printf(m, "workingset_nodereclaim %lu\n",
- acc.stat[WORKINGSET_NODERECLAIM]);
-
return 0;
}
On Tue, Oct 09, 2018 at 03:08:45PM -0700, Andrew Morton wrote:
> On Tue, 9 Oct 2018 14:47:32 -0400 Johannes Weiner <[email protected]> wrote:
>
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -378,11 +378,17 @@ void workingset_update_node(struct xa_node *node)
> > * as node->private_list is protected by the i_pages lock.
> > */
> > if (node->count && node->count == node->nr_values) {
> > - if (list_empty(&node->private_list))
> > + if (list_empty(&node->private_list)) {
> > list_lru_add(&shadow_nodes, &node->private_list);
> > + __inc_lruvec_page_state(virt_to_page(node),
> > + WORKINGSET_NODES);
> > + }
> > } else {
> > - if (!list_empty(&node->private_list))
> > + if (!list_empty(&node->private_list)) {
> > list_lru_del(&shadow_nodes, &node->private_list);
> > + __dec_lruvec_page_state(virt_to_page(node),
> > + WORKINGSET_NODES);
> > + }
> > }
> > }
>
> A bit worried that we're depending on the caller's caller to have
> disabled interrupts to avoid subtle and rare errors.
>
> Can we do this?
I'm not opposed to it, but the i_pages lock is guaranteed to be held
during the tree update, and that lock is also taken from the io
completion irq to maintain the tree's dirty/writeback state. It seems
like a robust assumption that interrupts will be disabled here.
But all that isn't very obvious from the code at hand, so I wouldn't
mind adding the check for documentation purposes.
It's not a super hot path, but maybe VM_WARN_ON_ONCE()?
> --- a/mm/workingset.c~mm-workingset-add-vmstat-counter-for-shadow-nodes-fix
> +++ a/mm/workingset.c
> @@ -377,6 +377,8 @@ void workingset_update_node(struct radix
> * already where they should be. The list_empty() test is safe
> * as node->private_list is protected by the i_pages lock.
> */
> + WARN_ON_ONCE(!irqs_disabled()); /* For __inc_lruvec_page_state */
> +
> if (node->count && node->count == node->exceptional) {
> if (list_empty(&node->private_list)) {
> list_lru_add(&shadow_nodes, &node->private_list);
> _
On Wed, Oct 10, 2018 at 01:03:50AM +0000, Rik van Riel wrote:
> On Tue, 2018-10-09 at 14:47 -0400, Johannes Weiner wrote:
>
> > These workloads also deal with tens of thousands of open files and
> > use
> > /proc for introspection, which ends up growing the proc_inode_cache
> > to
> > absurdly large sizes - again at the cost of valuable cache space,
> > which isn't a reasonable trade-off, given that proc inodes can be
> > re-created without involving the disk.
> >
> > This patch implements a "zero-seek" setting for shrinkers that
> > results
> > in a target ratio of 0:1 between their objects and IO-backed
> > caches. This allows such virtual caches to grow when memory is
> > available (they do cache/avoid CPU work after all), but effectively
> > disables them as soon as IO-backed objects are under pressure.
> >
> > It then switches the shrinkers for procfs and sysfs metadata, as well
> > as excess page cache shadow nodes, to the new zero-seek setting.
>
> This patch looks like a great step in the right
> direction, though I do not know whether it is
> aggressive enough.
>
> Given that internal slab fragmentation will
> prevent the slab cache from returning a slab to
> the VM if just one object in that slab is still
> in use, there may well be workloads where we
> should just put a hard cap on the number of
> freeable items these slabs, and reclaim them
> preemptively.
>
> However, I do not know for sure, and this patch
> seems like a big improvement over what we had
> before, so ...
Fully agreed, fragmentation is still a concern. I'm still working on
that part, but artificial caps and pro-active reclaim are trickier to
get right than prioritization, and since these patches here are useful
on their own I didn't want to hold them back.
> > Reported-by: Domas Mituzas <[email protected]>
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Reviewed-by: Rik van Riel <[email protected]>
Thanks!
On 10/9/18 8:47 PM, Johannes Weiner wrote:
> The page cache and most shrinkable slab caches hold data that has been
> read from disk, but there are some caches that only cache CPU work,
> such as the dentry and inode caches of procfs and sysfs, as well as
> the subset of radix tree nodes that track non-resident page cache.
>
> Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS
> for the shrinker's seeks setting tells the reclaim algorithm that for
> every two page cache pages scanned it should scan one slab object.
>
> This is a bogus setting. A virtual inode that required no IO to create
> is not twice as valuable as a page cache page; shadow cache entries
> with eviction distances beyond the size of memory aren't either.
>
> In most cases, the behavior in practice is still fine. Such virtual
> caches don't tend to grow and assert themselves aggressively, and
> usually get picked up before they cause problems. But there are
> scenarios where that's not true.
>
> Our database workloads suffer from two of those. For one, their file
> workingset is several times bigger than available memory, which has
> the kernel aggressively create shadow page cache entries for the
> non-resident parts of it. The workingset code does tell the VM that
> most of these are expendable, but the VM ends up balancing them 2:1 to
> cache pages as per the seeks setting. This is a huge waste of memory.
>
> These workloads also deal with tens of thousands of open files and use
> /proc for introspection, which ends up growing the proc_inode_cache to
> absurdly large sizes - again at the cost of valuable cache space,
> which isn't a reasonable trade-off, given that proc inodes can be
> re-created without involving the disk.
>
> This patch implements a "zero-seek" setting for shrinkers that results
> in a target ratio of 0:1 between their objects and IO-backed
> caches. This allows such virtual caches to grow when memory is
> available (they do cache/avoid CPU work after all), but effectively
> disables them as soon as IO-backed objects are under pressure.
>
> It then switches the shrinkers for procfs and sysfs metadata, as well
> as excess page cache shadow nodes, to the new zero-seek setting.
AFAIU procfs and sysfs metadata have exclusive slab caches, while the
shadow nodes share 'radix_tree_node' cache with non-shadow ones, right?
To avoid fragmentation, it should be better if they had also separate
cache, since their lifetime becomes different. In case that's feasible
(are non-shadow nodes changing to shadow nodes and vice versa? I guess
they do? That would require reallocation in the other cache.).
Vlastimil
On Tue, Oct 09, 2018 at 03:08:45PM -0700, Andrew Morton wrote:
> On Tue, 9 Oct 2018 14:47:32 -0400 Johannes Weiner <[email protected]> wrote:
>
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -378,11 +378,17 @@ void workingset_update_node(struct xa_node *node)
> > * as node->private_list is protected by the i_pages lock.
> > */
> > if (node->count && node->count == node->nr_values) {
> > - if (list_empty(&node->private_list))
> > + if (list_empty(&node->private_list)) {
> > list_lru_add(&shadow_nodes, &node->private_list);
> > + __inc_lruvec_page_state(virt_to_page(node),
> > + WORKINGSET_NODES);
> > + }
> > } else {
> > - if (!list_empty(&node->private_list))
> > + if (!list_empty(&node->private_list)) {
> > list_lru_del(&shadow_nodes, &node->private_list);
> > + __dec_lruvec_page_state(virt_to_page(node),
> > + WORKINGSET_NODES);
> > + }
> > }
> > }
>
> A bit worried that we're depending on the caller's caller to have
> disabled interrupts to avoid subtle and rare errors.
>
> Can we do this?
>
> --- a/mm/workingset.c~mm-workingset-add-vmstat-counter-for-shadow-nodes-fix
> +++ a/mm/workingset.c
> @@ -377,6 +377,8 @@ void workingset_update_node(struct radix
> * already where they should be. The list_empty() test is safe
> * as node->private_list is protected by the i_pages lock.
> */
> + WARN_ON_ONCE(!irqs_disabled()); /* For __inc_lruvec_page_state */
> +
> if (node->count && node->count == node->exceptional) {
> if (list_empty(&node->private_list)) {
> list_lru_add(&shadow_nodes, &node->private_list);
Note that for whatever reason, I've observed that irqs_disabled() is
actually quite an expensive call. I'm not saying the warning is a bad
idea but it should not be sprinkled around unnecessary and may be more
suitable as a debug option.
--
Mel Gorman
SUSE Labs
On Tue, 16 Oct 2018 09:49:23 +0100 Mel Gorman <[email protected]> wrote:
> > Can we do this?
> >
> > --- a/mm/workingset.c~mm-workingset-add-vmstat-counter-for-shadow-nodes-fix
> > +++ a/mm/workingset.c
> > @@ -377,6 +377,8 @@ void workingset_update_node(struct radix
> > * already where they should be. The list_empty() test is safe
> > * as node->private_list is protected by the i_pages lock.
> > */
> > + WARN_ON_ONCE(!irqs_disabled()); /* For __inc_lruvec_page_state */
> > +
> > if (node->count && node->count == node->exceptional) {
> > if (list_empty(&node->private_list)) {
> > list_lru_add(&shadow_nodes, &node->private_list);
>
> Note that for whatever reason, I've observed that irqs_disabled() is
> actually quite an expensive call. I'm not saying the warning is a bad
> idea but it should not be sprinkled around unnecessary and may be more
> suitable as a debug option.
Yup, it is now VM_WARN_ON_ONCE().