Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp4420167pxb; Thu, 14 Oct 2021 04:59:07 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw9de9pU5LAIHlmyz79i2BFQ6id01hhK6Kp7AmaEChCDZH1YjhnQ46LWrxJWJ0PxgNH+8I7 X-Received: by 2002:a17:903:2341:b0:13e:ae30:412 with SMTP id c1-20020a170903234100b0013eae300412mr4602961plh.15.1634212747017; Thu, 14 Oct 2021 04:59:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634212747; cv=none; d=google.com; s=arc-20160816; b=FKu/Wh++pE0nMQ8nR4xZstqA/1wZ7OCExtMOOeFwPagVnxyxdrjI1AMnbFm/ZQPdEE e2w1xBvvFnKgU4gb1jCq4wvuP4lH1Yt2XHqoAV+N3Hf4/WJEu8fK58FWhwX2LfferAHd RcuGBv9YRumxl3FjepnKS8xwkgupXbg59w0fwfjysv1L8WvFKDGvf9qPRVT2H8RcSb5e rYh9IMnyrSNp8Vp7BcEHUZGq8oBaP6an6hNMAkwBPQprhgozk0iINU2PSVwUsyq6FoKl ml+ByYxkIMiXPluxpeG1B0KGTT3Q1gAJ5gMif8oxCRk1od5ksWLR/0xWQlIeRN/fDGZp 2sWA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=RdGudlqEi/0Rd/AXhEI7rsQzR2zmGykqy6h21u9CaKA=; b=kqSe/jWxTw/S8NcUdjAVXl/4uteEZ0G39xmaGqxH0x8YTDzHgOrNpAynCEgC/9QG2f E+ZOlLcWZfKhi/kwSjhJUfVTzwlUxRF0/x18Pa9yBU6KjfLje3O/90Y6bOFQRU2f+Ywi LGxKVvg92HJ33bYt0tvNC5yOFHB0d8yf5LomeX17GJ68/gzXUlxnSdpwCVTRXSOKvY7x Xj9HirMjTdbDBbIVf7KjsvS/ocanWaQ94p07NwAgTrcmsmLnzfLn5flRJ5iJaFbN4twg OuW/JlM4fgfFvCNwvVbxip3/BB9lyeb/B+r+FzWZW5v4P+D0xzYvr5gDAQMj0FIqf9Gw zhUQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id r11si2829764plr.110.2021.10.14.04.58.52; Thu, 14 Oct 2021 04:59:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230257AbhJNKty (ORCPT + 99 others); Thu, 14 Oct 2021 06:49:54 -0400 Received: from outbound-smtp44.blacknight.com ([46.22.136.52]:48137 "EHLO outbound-smtp44.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230265AbhJNKtx (ORCPT ); Thu, 14 Oct 2021 06:49:53 -0400 Received: from mail.blacknight.com (pemlinmail06.blacknight.ie [81.17.255.152]) by outbound-smtp44.blacknight.com (Postfix) with ESMTPS id DBFCDF83DB for ; Thu, 14 Oct 2021 11:47:45 +0100 (IST) Received: (qmail 19320 invoked from network); 14 Oct 2021 10:47:45 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.17.29]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 14 Oct 2021 10:47:45 -0000 Date: Thu, 14 Oct 2021 11:47:44 +0100 From: Mel Gorman To: Vlastimil Babka Cc: Linux-MM , NeilBrown , Theodore Ts'o , Andreas Dilger , "Darrick J . Wong" , Matthew Wilcox , Michal Hocko , Dave Chinner , Rik van Riel , Johannes Weiner , Jonathan Corbet , Linux-fsdevel , LKML Subject: Re: [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested Message-ID: <20211014104744.GY3959@techsingularity.net> References: <20211008135332.19567-1-mgorman@techsingularity.net> <20211008135332.19567-2-mgorman@techsingularity.net> <63898e7a-0846-3105-96b5-76c89635e499@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <63898e7a-0846-3105-96b5-76c89635e499@suse.cz> User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Thanks Vlastimil On Wed, Oct 13, 2021 at 05:39:36PM +0200, Vlastimil Babka wrote: > > +/* > > + * Account for pages written if tasks are throttled waiting on dirty > > + * pages to clean. If enough pages have been cleaned since throttling > > + * started then wakeup the throttled tasks. > > + */ > > +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page, > > + int nr_throttled) > > +{ > > + unsigned long nr_written; > > + > > + __inc_node_page_state(page, NR_THROTTLED_WRITTEN); > > Is this intentionally using the __ version that normally expects irqs to be > disabled (AFAIK they are not in this path)? I think this is rarely used cold > path so it doesn't seem worth to trade off speed for accuracy. > It was intentional because IRQs can be disabled and if it's race-prone, it's not overly problematic but you're right, better to be safe. I changed it to the safe type as it's mostly free on x86, arm64 and s390 and for other architectures, this is a slow path. > > + nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) - > > + READ_ONCE(pgdat->nr_reclaim_start); > > Even if the inc above was safe, node_page_state() will return only the > global counter, so the value we read here will only actually increment when > some cpu's counter overflows, so it will be "bursty". Maybe it's ok, just > worth documenting? > I didn't think the penalty of doing an accurate read while writeback throttled is worth it. I'll add a comment. > > + > > + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled) > > + wake_up_all(&pgdat->reclaim_wait); > > Hm it seems a bit weird that the more tasks are throttled, the more we wait, > and then wake up all. Theoretically this will lead to even more > bursty/staggering herd behavior. Could be better to wake up single task each > SWAP_CLUSTER_MAX, and bump nr_reclaim_start? But maybe it's not a problem in > practice due to HZ/10 timeouts being short enough? > Yes, the more tasks are throttled the longer tasks wait because tasks are allocating faster than writeback can complete so I wanted to reduce the allocation pressure. I considered waking one task at a time but there is no prioritisation of tasks on the waitqueue and it's not clear that the additional complexity is justified. With inaccurate counters, a light allocator could get throttled for the full timeout unnecessarily. Even if we were to wake one task at a time, I would prefer it was done as a potential optimisation on top. Diff on top based on review feedback; diff --git a/mm/vmscan.c b/mm/vmscan.c index bcd22e53795f..735b1f2b5d9e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1048,7 +1048,15 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page, { unsigned long nr_written; - __inc_node_page_state(page, NR_THROTTLED_WRITTEN); + inc_node_page_state(page, NR_THROTTLED_WRITTEN); + + /* + * This is an inaccurate read as the per-cpu deltas may not + * be synchronised. However, given that the system is + * writeback throttled, it is not worth taking the penalty + * of getting an accurate count. At worst, the throttle + * timeout guarantees forward progress. + */ nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) - READ_ONCE(pgdat->nr_reclaim_start);