Hello,
I have a workload, which creates lots of cache pages. Before 4.18.15,
the behavior was very stable: pagecache is constantly growing until it
consumes all the free memory, and then kswapd is balancing it around
low watermark. After 4.18.15, once in a while khugepaged is waking up
and reclaims almost all the pages from pagecache, so there is always
around 2G of 8G unused. THP is enabled only for madvise case and are
not used.
The exact change that leads to current behavior is
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y&id=62aad93f09c1952ede86405894df1b22012fd5ab
[add linux-mm mailing list + people]
On 10/20/18 4:41 AM, Spock wrote:
> Hello,
>
> I have a workload, which creates lots of cache pages. Before 4.18.15,
> the behavior was very stable: pagecache is constantly growing until it
> consumes all the free memory, and then kswapd is balancing it around
> low watermark. After 4.18.15, once in a while khugepaged is waking up
> and reclaims almost all the pages from pagecache, so there is always
> around 2G of 8G unused. THP is enabled only for madvise case and are
> not used.
>
> The exact change that leads to current behavior is
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y&id=62aad93f09c1952ede86405894df1b22012fd5ab
>
--
~Randy
On Sat, Oct 20, 2018 at 08:37:28AM -0700, Randy Dunlap wrote:
> [add linux-mm mailing list + people]
>
>
> On 10/20/18 4:41 AM, Spock wrote:
> > Hello,
> >
> > I have a workload, which creates lots of cache pages. Before 4.18.15,
> > the behavior was very stable: pagecache is constantly growing until it
> > consumes all the free memory, and then kswapd is balancing it around
> > low watermark. After 4.18.15, once in a while khugepaged is waking up
> > and reclaims almost all the pages from pagecache, so there is always
> > around 2G of 8G unused. THP is enabled only for madvise case and are
> > not used.
> >
> > The exact change that leads to current behavior is
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y&id=62aad93f09c1952ede86405894df1b22012fd5ab
> >
Hello!
Can you, please, describe your workload in more details?
Do you use memory cgroups? How many of them? What's the ratio between slabs
and pagecache in the affected cgroup? Is the pagecache mmapped by some process?
Is the majority of the pagecache created by few cached files or the number
of files is big?
This is definitely a strange effect. The change shouldn't affect pagecache
reclaim directly, so the only possibility I see is that because we started
applying some minimal pressure on slabs, we also started reclaim some internal
fs structures under background memory pressure, which leads to a more aggressive
pagecache reclaim.
Thanks!
Cc som more people.
I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a
relatively small number of objects") has been backported to the stable
tree when not marked that way. Put that aside it seems likely that the
upstream kernel will have the same issue I suspect. Roman, could you
have a look please?
On Sat 20-10-18 14:41:40, Spock wrote:
> Hello,
>
> I have a workload, which creates lots of cache pages. Before 4.18.15,
> the behavior was very stable: pagecache is constantly growing until it
> consumes all the free memory, and then kswapd is balancing it around
> low watermark. After 4.18.15, once in a while khugepaged is waking up
> and reclaims almost all the pages from pagecache, so there is always
> around 2G of 8G unused. THP is enabled only for madvise case and are
> not used.
>
> The exact change that leads to current behavior is
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y&id=62aad93f09c1952ede86405894df1b22012fd5ab
--
Michal Hocko
SUSE Labs
On Mon, Oct 22, 2018 at 10:33:22AM +0200, Michal Hocko wrote:
> Cc som more people.
>
> I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a
> relatively small number of objects") has been backported to the stable
> tree when not marked that way. Put that aside it seems likely that the
> upstream kernel will have the same issue I suspect. Roman, could you
> have a look please?
Sure, already looking... Spock provided some useful details, and I think,
I know what's happening... Hope to propose a solution soon.
RE backporting: I'm slightly surprised that only one patch of the memcg
reclaim fix series has been backported. Either all or none makes much more
sense to me.
Thanks!
On Mon 22-10-18 15:08:22, Roman Gushchin wrote:
[...]
> RE backporting: I'm slightly surprised that only one patch of the memcg
> reclaim fix series has been backported. Either all or none makes much more
> sense to me.
Yeah, I think this is AUTOSEL trying to be clever again. I though it has
been agreed that MM is quite good at marking patches for stable and so
it was not considered by the machinery. Sasha?
--
Michal Hocko
SUSE Labs
On Mon, Oct 22, 2018 at 10:33:22AM +0200, Michal Hocko wrote:
> Cc som more people.
>
> I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a
> relatively small number of objects") has been backported to the stable
> tree when not marked that way. Put that aside it seems likely that the
> upstream kernel will have the same issue I suspect. Roman, could you
> have a look please?
So, the problem is probably caused by the unused inode eviction code:
inode_lru_isolate() invalidates all pages belonging to an unreferenced
clean inode at once, even if the goal was to scan (and potentially free)
just one inode (or any other slab object).
Spock's workload, as described, has few large files in the pagecache,
so it becomes noticeable. A small pressure applied on inode cache
surprisingly results in cleaning up significant percentage of the memory.
It happened before my change too, but was probably less noticeable, because
usually required higher memory pressure to happen. So, too aggressive reclaim
was less unexpected.
How to fix this?
It seems to me, that we shouldn't try invalidating pagecache pages from
the inode reclaim path at all (maybe except inodes with only few pages).
If an inode has a lot of attached pagecache, let it be evicted "naturally",
through file LRU lists.
But I need to perform some real-life testing on how this will work.
Thanks!
> On Sat 20-10-18 14:41:40, Spock wrote:
> > Hello,
> >
> > I have a workload, which creates lots of cache pages. Before 4.18.15,
> > the behavior was very stable: pagecache is constantly growing until it
> > consumes all the free memory, and then kswapd is balancing it around
> > low watermark. After 4.18.15, once in a while khugepaged is waking up
> > and reclaims almost all the pages from pagecache, so there is always
> > around 2G of 8G unused. THP is enabled only for madvise case and are
> > not used.
Spock, can you, please, check if the following patch solves the problem
for you?
Thank you!
--
diff --git a/fs/inode.c b/fs/inode.c
index 73432e64f874..63aca301a8bc 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -731,7 +731,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
}
/* recently referenced inodes get one more pass */
- if (inode->i_state & I_REFERENCED) {
+ if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) {
inode->i_state &= ~I_REFERENCED;
spin_unlock(&inode->i_lock);
return LRU_ROTATE;
On Mon, Oct 22, 2018 at 1:01 PM Michal Hocko <[email protected]> wrote:
>
> On Mon 22-10-18 15:08:22, Roman Gushchin wrote:
> [...]
> > RE backporting: I'm slightly surprised that only one patch of the memcg
> > reclaim fix series has been backported. Either all or none makes much more
> > sense to me.
>
> Yeah, I think this is AUTOSEL trying to be clever again. I though it has
> been agreed that MM is quite good at marking patches for stable and so
> it was not considered by the machinery. Sasha?
I've talked about it briefly with Andrew, and he suggested that I'll
send him the list of AUTOSEL commits separately to avoid the noise, so
we'll try that and see what happens.
--
Thanks.
Sasha