Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp3576620pxb; Wed, 13 Oct 2021 08:42:30 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwZcgUadL97Ebg8KuGpOEK2Qued9yZJQZzc0bJbcx9IPfETjVX8ZRXoWrLuxPOjHq6Gj6CP X-Received: by 2002:a17:906:5ac8:: with SMTP id x8mr5497ejs.132.1634139750004; Wed, 13 Oct 2021 08:42:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634139749; cv=none; d=google.com; s=arc-20160816; b=tbCvuwTRD2OFBg6s5pjBQra5bFMNXh6QpfqS1/f7x+P7GYEt7QU9B+QyWZsA9PIVaV xhYTtxMkPDk5N/7TGrj2WV7CVr9luANjUI3RGX4CBBCE/sOOUPl/oRkye0DZYPtGVN3U xNPACcnkjittBrtdZ+cQgEM81hRHMSw42LRDsPKSD0EHI1A1R/1vkAf3aOoT3j98decZ J59SK9IpcUaYwqq4xPq0VzlmzPHegok8K3c/muIAJ13RXa4CB6IcJbuHiG5f5Pkd2ZbB pZGluAZ5EkiMNmmzDTs/lF9KA1t7CaBuGl1ffW1hAKyax81GlavazQG24Qi2/0Ghciio 1hpw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:subject :from:references:cc:to:content-language:user-agent:mime-version:date :message-id:dkim-signature:dkim-signature; bh=FNjFlLBCp8eIniFVxC2OTUCUHJDZc+bQGW4gxrxjguU=; b=q0C8PaHJbVxmayAPwoyHdrKiXPIcK+7+EVdd6X0hXzuM9mJsa3HbIsrhoT+ukOIhfY ERmjLZar0U6s09+UMihtDEYWiwRy/MRiOVNdoUAInQI6V+0ZQhxdsFHhIgvzxprVl0gC 5JXynRa9X13ZwIzCdNeNr/qFw367Fdey4nT2LAQ13ucfpByfHT7fSOKl7njaOy6R5DY7 wFsSmMoCuZHNwdBlGzfe02IlP97Q2ZPQOt9aF5B8d+D7CTtVeKIdCm3QM8rWggbPckps L31dx/fGTuCvU3n82GlkF0t3h0gDVvI19NjYHIsPdR82OwOxPgNBhKW64gJH7mh+GIpS Y9sQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=2mNBWlrd; dkim=neutral (no key) header.i=@suse.cz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w7si21856565ejy.356.2021.10.13.08.42.05; Wed, 13 Oct 2021 08:42:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=2mNBWlrd; dkim=neutral (no key) header.i=@suse.cz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233600AbhJMPlm (ORCPT + 99 others); Wed, 13 Oct 2021 11:41:42 -0400 Received: from smtp-out1.suse.de ([195.135.220.28]:53866 "EHLO smtp-out1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229653AbhJMPll (ORCPT ); Wed, 13 Oct 2021 11:41:41 -0400 Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id DB7442199F; Wed, 13 Oct 2021 15:39:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1634139576; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FNjFlLBCp8eIniFVxC2OTUCUHJDZc+bQGW4gxrxjguU=; b=2mNBWlrdAdRT1IpwE9rQPOuZw6AahAK19rjBykFGnH2K1nyTFnMqyyVT6I19SNAlfLbQL9 gav+K0NvwaPnXEdVUrZ8RWSO1ZA9K3+4NSeMQ0obfx7sq9UL276QvTnTrBHQCtgUFoRihS FAVNP/qetNKkny0rkaZwQBsEaEvsW9w= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1634139576; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FNjFlLBCp8eIniFVxC2OTUCUHJDZc+bQGW4gxrxjguU=; b=KqOOnrNZJ/HPOwlOsOK0o6nRptwG4pCDO5AkIv7nwQlkaAIFWH3/TlZZMxlrHZ/kqYhBw3 g0zeQRk8Q1UqrZBQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id A5B4813D05; Wed, 13 Oct 2021 15:39:36 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 06aHJ7j9ZmEDfQAAMHmgww (envelope-from ); Wed, 13 Oct 2021 15:39:36 +0000 Message-ID: <63898e7a-0846-3105-96b5-76c89635e499@suse.cz> Date: Wed, 13 Oct 2021 17:39:36 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 Content-Language: en-US To: Mel Gorman , Linux-MM Cc: NeilBrown , Theodore Ts'o , Andreas Dilger , "Darrick J . Wong" , Matthew Wilcox , Michal Hocko , Dave Chinner , Rik van Riel , Johannes Weiner , Jonathan Corbet , Linux-fsdevel , LKML References: <20211008135332.19567-1-mgorman@techsingularity.net> <20211008135332.19567-2-mgorman@techsingularity.net> From: Vlastimil Babka Subject: Re: [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested In-Reply-To: <20211008135332.19567-2-mgorman@techsingularity.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/8/21 15:53, Mel Gorman wrote: > Page reclaim throttles on wait_iff_congested under the following conditions > > o kswapd is encountering pages under writeback and marked for immediate > reclaim implying that pages are cycling through the LRU faster than > pages can be cleaned. > > o Direct reclaim will stall if all dirty pages are backed by congested > inodes. > > wait_iff_congested is almost completely broken with few exceptions. This > patch adds a new node-based workqueue and tracks the number of throttled > tasks and pages written back since throttling started. If enough pages > belonging to the node are written back then the throttled tasks will wake > early. If not, the throttled tasks sleeps until the timeout expires. > > [neilb@suse.de: Uninterruptible sleep and simpler wakeups] > [hdanton@sina.com: Avoid race when reclaim starts] > Signed-off-by: Mel Gorman Seems mostly OK, have just some doubts wrt NR_THROTTLED_WRITTEN mechanics, that may ultimately be just a point of comments to add. ... > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1006,6 +1006,56 @@ static void handle_write_error(struct address_space *mapping, > unlock_page(page); > } > > +static void > +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, > + long timeout) > +{ > + wait_queue_head_t *wqh = &pgdat->reclaim_wait; > + long ret; > + DEFINE_WAIT(wait); > + > + /* > + * Do not throttle IO workers, kthreads other than kswapd or > + * workqueues. They may be required for reclaim to make > + * forward progress (e.g. journalling workqueues or kthreads). > + */ > + if (!current_is_kswapd() && > + current->flags & (PF_IO_WORKER|PF_KTHREAD)) > + return; > + > + if (atomic_inc_return(&pgdat->nr_reclaim_throttled) == 1) { > + WRITE_ONCE(pgdat->nr_reclaim_start, > + node_page_state(pgdat, NR_THROTTLED_WRITTEN)); > + } > + > + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); > + ret = schedule_timeout(timeout); > + finish_wait(wqh, &wait); > + atomic_dec(&pgdat->nr_reclaim_throttled); > + > + trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout), > + jiffies_to_usecs(timeout - ret), > + reason); > +} > + > +/* > + * Account for pages written if tasks are throttled waiting on dirty > + * pages to clean. If enough pages have been cleaned since throttling > + * started then wakeup the throttled tasks. > + */ > +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page, > + int nr_throttled) > +{ > + unsigned long nr_written; > + > + __inc_node_page_state(page, NR_THROTTLED_WRITTEN); Is this intentionally using the __ version that normally expects irqs to be disabled (AFAIK they are not in this path)? I think this is rarely used cold path so it doesn't seem worth to trade off speed for accuracy. > + nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) - > + READ_ONCE(pgdat->nr_reclaim_start); Even if the inc above was safe, node_page_state() will return only the global counter, so the value we read here will only actually increment when some cpu's counter overflows, so it will be "bursty". Maybe it's ok, just worth documenting? > + > + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled) > + wake_up_all(&pgdat->reclaim_wait); Hm it seems a bit weird that the more tasks are throttled, the more we wait, and then wake up all. Theoretically this will lead to even more bursty/staggering herd behavior. Could be better to wake up single task each SWAP_CLUSTER_MAX, and bump nr_reclaim_start? But maybe it's not a problem in practice due to HZ/10 timeouts being short enough? > +} > +