Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Wed, 26 Feb 2020 10:05:34 +0100
From:   Michal Hocko <mhocko@kernel.org>
To:     Sultan Alsawaf <sultan@kerneltoast.com>
Cc:     Mel Gorman <mgorman@suse.de>, Dave Hansen <dave.hansen@intel.com>,
        Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to
 free pages
Message-ID: <20200226090534.GB3771@dhcp22.suse.cz>
References: <20200219200527.GF11847@dhcp22.suse.cz>
 <20200219204220.GA3488@sultan-book.localdomain>
 <20200219214513.GL3420@suse.de>
 <20200219224231.GA5190@sultan-book.localdomain>
 <20200220101945.GN3420@suse.de>
 <20200221042232.GA2197@sultan-book.localdomain>
 <20200221080737.GK20509@dhcp22.suse.cz>
 <20200221210824.GA3605@sultan-book.localdomain>
 <20200225090945.GJ22443@dhcp22.suse.cz>
 <20200225171242.GA496421@sultan-box.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200225171242.GA496421@sultan-box.localdomain>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Tue 25-02-20 09:12:42, Sultan Alsawaf wrote:
> On Tue, Feb 25, 2020 at 10:09:45AM +0100, Michal Hocko wrote:
> > On Fri 21-02-20 13:08:24, Sultan Alsawaf wrote:
> > [...]
> > > Both of these logs are attached in a tarball.
> > 
> > Thanks! First of all
> > $ grep pswp vmstat.1582318979
> > pswpin 0
> > pswpout 0
> > 
> > suggests that you do not have any swap storage, right?
> 
> Correct. I'm not using any swap (and it should not be necessary to make Linux mm
> work of course).

In your particular situation you simply overcommit the memory way too
much AFAICS. Look at the vmstat data
vmstat.1582318758
nr_free_pages 16292
nr_page_table_pages 2822
nr_inactive_anon 8171
nr_active_anon 134168
nr_inactive_file 729
nr_active_file 508
nr_unevictable 51984
nr_slab_reclaimable 10919
nr_slab_unreclaimable 19766

This is roughly memory that we have clear accounting for (well except
for hugetlb pages but you do not seem to be using those). This is
82% of the memory. The rest is used by different kernel subsystems.
Of that only 47MB is reclaimable without invoking the OOM killer (if we
include nr_slab_reclaimable which might be quite hard to reclaim
effectively). I can imagine that could work only if the page cache
footprint was negligible but that doesn't seem to be the case here.
If you add swap storage then the math is completely different so yes the
swap is likely going to make a difference here.

> If I were to divide my RAM in half and use one half as swap,
> do you think the results would be different? IMO they shouldn't be.

That really depends on what is the swap backed memory footprint.

> > The amount of anonymous memory is not really high (~560MB) but file LRU
> > is _really_ low (~3MB), unevictable list is at ~200MB. That gets us to
> > ~760M of memory which is 74% of the memory. Please note that your mem=2G
> > setup gives you only 1G of memory in fact (based on the zone_info you
> > have posted). That is not something unusual but the amount of the page
> > cache is worrying because I would expect a heavy trashing because most
> > of the executables are going to require major faults. Anonymous memory
> > is not swapped out obviously so there is no other option than to refault
> > constantly.
> 
> I noticed that only 1G was available as well.

This is because of your physical memory layout. mem=2G doesn't restrict
the amount of the memory to 2G but rather the top physical address to
2G. If you have holes in the memory layout you are likely to get much
less memory.

> Perhaps direct reclaim wasn't
> attempted due to the zone_reclaimable_pages() check, though I don't think direct
> reclaim would've been particularly helpful in this case (see below).

there are reclaimable pages so zone_reclaimable_pages shouldn't really
stop DR for the zone Normal.

> > kswapd has some feedback mechanism to back off when the zone is hopless
> > from the reclaim point of view AFAIR but it seems it has failed in this
> > particular situation. It should have relied on the direct reclaim and
> > eventually trigger the OOM killer. Your patch has worked around this by
> > bailing out from the kswapd reclaim too early so a part of the page
> > cache required for the code to move on would stay resident and move
> > further.
> > 
> > The proper fix should, however, check the amount of reclaimable pages
> > and back off if they cannot meet the target IMO. We cannot rely on the
> > general reclaimability here because that could really be thrashing.
> 
> Yes, my guess was that thrashing out pages used by the running programs was the
> cause for my freezes, but I didn't think of making kswapd back off a different
> way.
> 
> Right now I don't see any such back-off mechanism in kswapd.

There is pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES check in prepare_kswapd_sleep
but this is not helping really, quite contrary, because kswapd is able
to reclaim page cache. The problem is that we even try to reclaim that
page cache I believe. If we have a signal that all the reclaimed memory
is essentially refaulted then we should backoff.

A simpler and probably much more subtle solution would be to back off
kswapd when zone_reclaimable_pages() < high_wmark_pages. This would push
out the work to the direct reclaim which itself is changing the overall
timing as the reclaim would be more bound to the memory demand.

> Also, if we add
> this into kswapd, we would need to plug it into the direct reclaim path as well,
> no? I don't think direct reclaim would help with the situation I've run into;
> although it wouldn't be as bad as letting kswapd evict pages to the high
> watermark, it would still cause page thrashing that would just be capped to the
> amount of pages a direct reclaimer is looking to steal.

And more importantly it would be bound to the allocating context so the
feedback mechanism would be more bound to the workload.

> Considering that my patch remedies this issue for me without invoking the OOM
> killer, a proper solution should produce the same or better results. I don't
> think the OOM killer should have been triggered in this case.

Your system is likely struggling already, it is just less visible
because kswapd reclaim is essentially faster than paging in so a single
access might trigger several refaults.
-- 
Michal Hocko
SUSE Labs