Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
From:   Yang Shi <shy828301@gmail.com>
To:     vdavydov.dev@gmail.com, ktkhai@virtuozzo.com, guro@fb.com,
        shakeelb@google.com, akpm@linux-foundation.org
Cc:     shy828301@gmail.com, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org
Subject: [PATCH] mm: list_lru: hold nlru lock to avoid reading transient negative nr_items
Date:   Mon, 30 Nov 2020 10:45:14 -0800
Message-Id: <20201130184514.551950-1-shy828301@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

When investigating a slab cache bloat problem, significant amount of
negative dentry cache was seen, but confusingly they neither got shrunk
by reclaimer (the host has very tight memory) nor be shrunk by dropping
cache.  The vmcore shows there are over 14M negative dentry objects on lru,
but tracing result shows they were even not scanned at all.  The further
investigation shows the memcg's vfs shrinker_map bit is not set.  So the
reclaimer or dropping cache just skip calling vfs shrinker.  So we have
to reboot the hosts to get the memory back.

I didn't manage to come up with a reproducer in test environment, and the
problem can't be reproduced after rebooting.  But it seems there is race
between shrinker map bit clear and reparenting by code inspection.  The
hypothesis is elaborated as below.

The memcg hierarchy on our production environment looks like:
                root
               /    \
          system   user

The main workloads are running under user slice's children, and it creates
and removes memcg frequently.  So reparenting happens very often under user
slice, but no task is under user slice directly.

So with the frequent reparenting and tight memory pressure, the below
hypothetical race condition may happen:

    CPU A                            CPU B                         CPU C
reparent
    dst->nr_items == 0
                                 shrinker:
                                     total_objects == 0
    add src->nr_items to dst
    set_bit
                                     retrun SHRINK_EMPTY
                                     clear_bit
                                                                  list_lru_del()
reparent again
    dst->nr_items may go negative
    due to current list_lru_del()
    on CPU C
                                 The second run of shrinker:
                                     read nr_items without any
                                     synchronization, so it may
                                     see intermediate negative
                                     nr_items then total_objects
                                     may return 0 conincidently

                                     keep the bit cleared
    dst->nr_items != 0
    skip set_bit
    add scr->nr_item to dst

After this point dst->nr_item may never go zero, so reparenting will not
set shrinker_map bit anymore.  And since there is no task under user
slice directly, so no new object will be added to its lru to set the
shrinker map bit either.  That bit is kept cleared forever.

How does list_lru_del() race with reparenting?  It is because
reparenting replaces childen's kmemcg_id to parent's without protecting
from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
actually deleting items from child's lru, but dec'ing parent's nr_items,
so the parent's nr_items may go negative as commit
2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
free kmemcg_id on css offline") says.

Can we move kmemcg_id replacement after reparenting?  No, because the
race with list_lru_del() may result in negative src->nr_items, but it
will never be fixed.  So the shrinker may never return SHRINK_EMPTY then
keep the shrinker map bit set always.  The shrinker will be always
called for nonsense.

Can we synchronize list_lru_del() and reparenting?  Yes, it could be
done.  But it seems we need introduce a new lock or use nlru->lock.  But
it sounds complicated to move kmemcg_id replacement code under nlru->lock.
And list_lru_del() may be called quite often to exacerbate some hot
path, i.e. dentry kill.

So, it sounds acceptable to synchronize reading nr_items to avoid seeing
intermediate negative nr_items given the simplicity and it is typically
just called by shrinkers when counting the freeable objects.

The patch is tested with some shrinker intensive workloads, no
noticeable regression is soptted.

Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/list_lru.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 5aa6e44bc2ae..5c128a7710ff 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -178,10 +178,17 @@ unsigned long list_lru_count_one(struct list_lru *lru,
 	struct list_lru_one *l;
 	unsigned long count;
 
-	rcu_read_lock();
+	/*
+	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
+	 * we have to use IRQ-safe primitives here to avoid deadlock.
+	 *
+	 * Hold the lock to prevent from seeing transient negative
+	 * nr_items value.
+	 */
+	spin_lock_irq(&nlru->lock);
 	l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
 	count = READ_ONCE(l->nr_items);
-	rcu_read_unlock();
+	spin_unlock_irq(&nlru->lock);
 
 	return count;
 }
-- 
2.26.2