Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp741385pxu; Fri, 11 Dec 2020 13:11:03 -0800 (PST) X-Google-Smtp-Source: ABdhPJzOyvdb0Yp285qsNoFo3KuZxDYi1xHyDPfrOaeKcsIv6Gp4it5WBv2WYYx6gcJvVFKdCkLw X-Received: by 2002:a50:ed17:: with SMTP id j23mr13988271eds.218.1607721063247; Fri, 11 Dec 2020 13:11:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1607721063; cv=none; d=google.com; s=arc-20160816; b=Hvx+NnvHQWi90myqMgHVToER9anclRmOP+mLmWM+boqOJ9NSH7WgZSBHiGe1c+6JFu 4ScKs33wTC9w52oan8CA1XnowR9edwiQM4/C9BMjyM/RX/vw1VdcQb3M8ZHCX36Oj6EP G8NAIgOLJ8DpUKqBpdqN8Qyn5oVgzlRYzC6bN2/sQfByzYHqgVVgMB2+SQ8Jj4OK8xpt 2KX7pYSg69dQC3KPMzOetsUQOB+j0rWORlG5oWj0QQ+tc/J4FJW4Ksqgew8xzQnvaWmR IskRcnc9VZ+ZaQTDatRuSXAIrvFgL1DsnRRv2xEaKzZoOz3QrQYIefm/HFtWwb+o5dE2 21EQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from; bh=KnfttL9DcI5HcL1PymidCBzLV4Gb1alHPnlVnDpxZ1s=; b=BBh+DaZFZUBIsd8++S6aZyMgMzwmniTXbyDjDogGvJ8brjCEeXdWjYY2z07rDaMmw0 FEv7mzJIT5Y6ZoBUj8SogMm1STjUh47dBhaNqd12fFOsh4E4CrIdmxSykhO1HGKKI0UL HGDERaUe1tbNb7mdj25X7DY+ZxgOcpwv7nV1oeSKJcEVTTj3LGYYholHUVcDXD4/1Ccv lBHsRd1yNTQaaNAYTr75fQhOlN15MF06JozrU+gkapNImf3modGnNte18JPuXwYA5Fk4 2d6sPgRgd7FkrTtrkKYfQdHvF3I1EsmIoJIp9zAa1tUY/vvhUOQgMyU4HdT8FqMvtqUT gv/Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id d24si5695075edu.93.2020.12.11.13.10.40; Fri, 11 Dec 2020 13:11:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2393100AbgLJSe0 (ORCPT + 99 others); Thu, 10 Dec 2020 13:34:26 -0500 Received: from mail.kernel.org ([198.145.29.99]:41790 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2390744AbgLJOdd (ORCPT ); Thu, 10 Dec 2020 09:33:33 -0500 From: Greg Kroah-Hartman Authentication-Results: mail.kernel.org; dkim=permerror (bad message/signature format) To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Roman Gushchin , Yang Shi , Andrew Morton , Shakeel Butt , Kirill Tkhai , Vladimir Davydov , Linus Torvalds Subject: [PATCH 4.19 22/39] mm: list_lru: set shrinker map bit when child nr_items is not zero Date: Thu, 10 Dec 2020 15:27:01 +0100 Message-Id: <20201210142603.366919073@linuxfoundation.org> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201210142602.272595094@linuxfoundation.org> References: <20201210142602.272595094@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Yang Shi commit 8199be001a470209f5c938570cc199abb012fe53 upstream. When investigating a slab cache bloat problem, significant amount of negative dentry cache was seen, but confusingly they neither got shrunk by reclaimer (the host has very tight memory) nor be shrunk by dropping cache. The vmcore shows there are over 14M negative dentry objects on lru, but tracing result shows they were even not scanned at all. Further investigation shows the memcg's vfs shrinker_map bit is not set. So the reclaimer or dropping cache just skip calling vfs shrinker. So we have to reboot the hosts to get the memory back. I didn't manage to come up with a reproducer in test environment, and the problem can't be reproduced after rebooting. But it seems there is race between shrinker map bit clear and reparenting by code inspection. The hypothesis is elaborated as below. The memcg hierarchy on our production environment looks like: root / \ system user The main workloads are running under user slice's children, and it creates and removes memcg frequently. So reparenting happens very often under user slice, but no task is under user slice directly. So with the frequent reparenting and tight memory pressure, the below hypothetical race condition may happen: CPU A CPU B reparent dst->nr_items == 0 shrinker: total_objects == 0 add src->nr_items to dst set_bit return SHRINK_EMPTY clear_bit child memcg offline replace child's kmemcg_id with parent's (in memcg_offline_kmem()) list_lru_del() between shrinker runs see parent's kmemcg_id dec dst->nr_items reparent again dst->nr_items may go negative due to concurrent list_lru_del() The second run of shrinker: read nr_items without any synchronization, so it may see intermediate negative nr_items then total_objects may return 0 coincidently keep the bit cleared dst->nr_items != 0 skip set_bit add scr->nr_item to dst After this point dst->nr_item may never go zero, so reparenting will not set shrinker_map bit anymore. And since there is no task under user slice directly, so no new object will be added to its lru to set the shrinker map bit either. That bit is kept cleared forever. How does list_lru_del() race with reparenting? It is because reparenting replaces children's kmemcg_id to parent's without protecting from nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually deleting items from child's lru, but dec'ing parent's nr_items, so the parent's nr_items may go negative as commit 2788cf0c401c ("memcg: reparent list_lrus and free kmemcg_id on css offline") says. Since it is impossible that dst->nr_items goes negative and src->nr_items goes zero at the same time, so it seems we could set the shrinker map bit iff src->nr_items != 0. We could synchronize list_lru_count_one() and reparenting with nlru->lock, but it seems checking src->nr_items in reparenting is the simplest and avoids lock contention. Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance") Suggested-by: Roman Gushchin Signed-off-by: Yang Shi Signed-off-by: Andrew Morton Reviewed-by: Roman Gushchin Reviewed-by: Shakeel Butt Acked-by: Kirill Tkhai Cc: Vladimir Davydov Cc: [4.19] Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/list_lru.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -542,7 +542,6 @@ static void memcg_drain_list_lru_node(st struct list_lru_node *nlru = &lru->node[nid]; int dst_idx = dst_memcg->kmemcg_id; struct list_lru_one *src, *dst; - bool set; /* * Since list_lru_{add,del} may be called under an IRQ-safe lock, @@ -554,11 +553,12 @@ static void memcg_drain_list_lru_node(st dst = list_lru_from_memcg_idx(nlru, dst_idx); list_splice_init(&src->list, &dst->list); - set = (!dst->nr_items && src->nr_items); - dst->nr_items += src->nr_items; - if (set) + + if (src->nr_items) { + dst->nr_items += src->nr_items; memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru)); - src->nr_items = 0; + src->nr_items = 0; + } spin_unlock_irq(&nlru->lock); }