Received: by 2002:a05:7412:40d:b0:e2:908c:2ebd with SMTP id 13csp1014131rdf; Wed, 22 Nov 2023 03:24:55 -0800 (PST) X-Google-Smtp-Source: AGHT+IHfCi962vFikiZemafzuqE2rSBKBhVJDVn/IFukI0aNDfT0BPPN26WWt3zKoFZqPvJhXF1L X-Received: by 2002:aca:2203:0:b0:3b2:f599:348c with SMTP id b3-20020aca2203000000b003b2f599348cmr1984685oic.49.1700652295358; Wed, 22 Nov 2023 03:24:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700652295; cv=none; d=google.com; s=arc-20160816; b=gSXJ3yY4G5LEWUzodUmn1ZfJ4EYIRQKLhZPUbWN5JB2huIEvOqRUarqJBEzENtYrsQ 2LjA+UmatO7T1iMwYwCVyEHyTxbYsSEN7eao2O4PvF4GF18rfoEWe3xL/BziaUByJYam LhEl9IfP3xTvMHuWM7I9L1YFK6S1MDa3GdJb/4lbSdH+a4tE5rxjLDKFyCasls9FlIQk ubUduOSd8YSVEcb/IrtWjGaQJADVdrShBntfAPrcEaPjcj6Pg8vO6KWqOA2onHXqRsNT yT2UCIOuokWXec20Cz3PryI4VoOJWLGouE2L8nlg5NcQZTVBscME7oQ7cXFcmZRRQwZr 4SpA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=FKyJjqN4yB5dEQXOJ/hPSQkG+aJnTkBN/bXtW33LM8I=; fh=fvddWMaUSgUDO8bJ5Fiett03egvAmvV0GdxiRgr+IOU=; b=xtmV/0+C453yNvnl0Qci8lgQeMWc2JJIQB8YRogtuYGrodNjPflcqkRPvLzLxwLQxd NHanEwrg70N7qjLGvPb8ue4mG0q0bupUdO9Y120ADOa4TGOkJBjqTmeF6urxDX+J/XVQ ysH05tnCubaI/7a6rq0UqAk9Km08BbB8s8+WcpkKl1wvCj9ML/Tayb6od6tERb3IRQCO rsjemxh9Cvrarm/Gz+Rkih+dST/EYb1+gCjjC32l3hMw4y6xQnyIueSX/WfC7wMqgbP3 jPS6v/iXBuPzB6Q1+2jMacQoAvudeuW8Y+fFw/Iydv69l6VhfqXmxGN6K1ZAt6FbRwwW VpPQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WhRgNPUe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id k69-20020a638448000000b005c1b59032b5si11731370pgd.453.2023.11.22.03.24.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 22 Nov 2023 03:24:55 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WhRgNPUe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 30AFF81B4545; Wed, 22 Nov 2023 03:24:52 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235029AbjKVLYU (ORCPT + 99 others); Wed, 22 Nov 2023 06:24:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46566 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229806AbjKVLYS (ORCPT ); Wed, 22 Nov 2023 06:24:18 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 856DE18C for ; Wed, 22 Nov 2023 03:24:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700652252; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=FKyJjqN4yB5dEQXOJ/hPSQkG+aJnTkBN/bXtW33LM8I=; b=WhRgNPUeEmwUO6ffrhg7weAaLvJBy75e1AKHL6qNQREv1RcXKGVuKbQJUU72aJOICq09Tz RL92vL+f1CetEbXF0aFAoGGNDa4qHlH9NKxgJ+0Hxar/Ix1d7awdXfYfW8OO9NLHwP53Qf ntmhvQyCrpWMjy/batcSRVNmP14JK2g= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-339-UYTsx_zkOiKtykoe-8kyvA-1; Wed, 22 Nov 2023 06:24:09 -0500 X-MC-Unique: UYTsx_zkOiKtykoe-8kyvA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 2700C8007B3; Wed, 22 Nov 2023 11:24:09 +0000 (UTC) Received: from tpad.localdomain (ovpn-112-3.gru2.redhat.com [10.97.112.3]) by smtp.corp.redhat.com (Postfix) with ESMTPS id BFA5B40C6EBB; Wed, 22 Nov 2023 11:24:08 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id CB4E2400DEE3E; Tue, 21 Nov 2023 10:35:21 -0300 (-03) Date: Tue, 21 Nov 2023 10:35:21 -0300 From: Marcelo Tosatti To: Michal Hocko Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Vlastimil Babka , Andrew Morton , David Hildenbrand , Peter Xu Subject: Re: [patch 0/2] mm: too_many_isolated can stall due to out of sync VM counters Message-ID: References: <20231113233420.446465795@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2 X-Spam-Status: No, score=-0.1 required=5.0 tests=DATE_IN_PAST_12_24, DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Wed, 22 Nov 2023 03:24:52 -0800 (PST) On Tue, Nov 14, 2023 at 01:46:41PM +0100, Michal Hocko wrote: > On Tue 14-11-23 09:26:53, Marcelo Tosatti wrote: > > Hi Michal, > > > > On Tue, Nov 14, 2023 at 09:20:09AM +0100, Michal Hocko wrote: > > > On Mon 13-11-23 20:34:20, Marcelo Tosatti wrote: > > > > A customer reported seeing processes hung at too_many_isolated, > > > > while analysis indicated that the problem occurred due to out > > > > of sync per-CPU stats (see below). > > > > > > > > Fix is to use node_page_state_snapshot to avoid the out of stale values. > > > > > > > > 2136 static unsigned long > > > > 2137 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > > > > 2138 struct scan_control *sc, enum lru_list lru) > > > > 2139 { > > > > : > > > > 2145 bool file = is_file_lru(lru); > > > > : > > > > 2147 struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > > > : > > > > 2150 while (unlikely(too_many_isolated(pgdat, file, sc))) { > > > > 2151 if (stalled) > > > > 2152 return 0; > > > > 2153 > > > > 2154 /* wait a bit for the reclaimer. */ > > > > 2155 msleep(100); <--- some processes were sleeping here, with pending SIGKILL. > > > > 2156 stalled = true; > > > > 2157 > > > > 2158 /* We are about to die and free our memory. Return now. */ > > > > 2159 if (fatal_signal_pending(current)) > > > > 2160 return SWAP_CLUSTER_MAX; > > > > 2161 } > > > > > > > > msleep() must be called only when there are too many isolated pages: > > > > > > What do you mean here? > > > > That msleep() must not be called when > > > > isolated > inactive > > > > is false. > > Well, but the code is structured in a way that this is simply true. > too_many_isolated might be false positive because it is a very loose > interface and the number of isolated pages can fluctuate depending on > the number of direct reclaimers. OK > > > > > 2019 static int too_many_isolated(struct pglist_data *pgdat, int file, > > > > 2020 struct scan_control *sc) > > > > 2021 { > > > > : > > > > 2030 if (file) { > > > > 2031 inactive = node_page_state(pgdat, NR_INACTIVE_FILE); > > > > 2032 isolated = node_page_state(pgdat, NR_ISOLATED_FILE); > > > > 2033 } else { > > > > : > > > > 2046 return isolated > inactive; > > > > > > > > The return value was true since: > > > > > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_INACTIVE_FILE] > > > > $8 = { > > > > counter = 1 > > > > } > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_ISOLATED_FILE] > > > > $9 = { > > > > counter = 2 > > > > > > > > while per_cpu stats had: > > > > > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->per_cpu_nodestats > > > > $85 = (struct per_cpu_nodestat *) 0xffff8000118832e0 > > > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[42] > > > > $86 = 0xffff00917fcc32e0 > > > > crash> p ((struct per_cpu_nodestat *) 0xffff00917fcc32e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > > > > $87 = -1 '\377' > > > > > > > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[44] > > > > $89 = 0xffff00917fe032e0 > > > > crash> p ((struct per_cpu_nodestat *) 0xffff00917fe032e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > > > > $91 = -1 '\377' > > > > > > This doesn't really tell much. How much out of sync they really are > > > cumulatively over all cpus? > > > > This is the cumulative value over all CPUs (offsets for other CPUs > > have been omitted since they are zero). > > OK, so that means the NR_ISOLATED_FILE is 0 while NR_INACTIVE_FILE is 1, > correct? If that is the case then the value is indeed outdated but it > also means that the NR_INACTIVE_FILE is so small that all but 1 (resp. 2 > as kswapd is never throttled) reclaimers will be stalled anyway. So does > the exact snapshot really help? By looking at the data: > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_INACTIVE_FILE] > $8 = { > counter = 1 > } > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_ISOLATED_FILE] > $9 = { > counter = 2 > > while per_cpu stats had: > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->per_cpu_nodestats > $85 = (struct per_cpu_nodestat *) 0xffff8000118832e0 > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[42] > $86 = 0xffff00917fcc32e0 > crash> p ((struct per_cpu_nodestat *) 0xffff00917fcc32e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > $87 = -1 '\377' > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[44] > $89 = 0xffff00917fe032e0 > crash> p ((struct per_cpu_nodestat *) 0xffff00917fe032e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > $91 = -1 '\377' Actual-Value = Global-Counter + CPU0.delta + CPU1.delta + ... + CPUn.delta Nr-Isolated-File = Nr-Isolated-Global + CPU0.delta-isolated + CPU1.delta-isolated + ... + CPUn.delta-isolated Nr-Inactive-File = Nr-Inactive-Global + CPU0.delta-inactive + CPU1.delta-inactive + ... + CPUn.delta-inactive With outdated values: ==================== Nr-Isolated-File = 2 Nr-Inactive-File = 1 Therefore isolated > inactive, since 2 > 1. Without outdated values (snapshot): ================================== Nr-Isolated-File = 2 - 1 - 1 = 0 Nr-Inactive-File = 1 > Do you have any means to reproduce this > behavior and see that the patch actually changed the behavior? No, because its not easy to test patches on the system where this was reproduced. However, the calculations above seem pretty unambiguous, showing that the snapshot would fix the problem. > [...] > > > > With a very low NR_FREE_PAGES and many contending allocation the system > > > could be easily stuck in reclaim. What are other reclaim > > > characteristics? > > > > I can ask. What information in particular do you want to know? > > When I am dealing with issues like this I heavily rely on /proc/vmstat > counters and pgscan, pgsteal counters to see whether there is any > progress over time. I understand your desire for additional data, can try to grab it (or create a synthetic configuration where this problem is reproducible). However, given the calculations above, it is clear that one problem is the out of sync counters. Don't you agree? > > > Is the direct reclaim successful? > > > > Processes are stuck in too_many_isolated (unnecessarily). What do you mean when you ask > > "Is the direct reclaim successful", precisely? > > With such a small LRU list it is quite likely that many processes will > be competing over last pages on the list while rest will be throttled > because there is nothing to reclaim. It is quite possible that all > reclaimers will be waiting for a single reclaimer (either kswapd or > other direct reclaimer). Sure, but again, the calculations above show that processes are stuck on too_many_isolated (and the proposed fix will address that situation). > I would like to understand whether the system > is stuck in unproductive state where everybody just waits until the > counter is synced or everything just progress very slowly because of the > small LRU. OK.