Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751917AbdI2AQE (ORCPT ); Thu, 28 Sep 2017 20:16:04 -0400 Received: from mail-wm0-f49.google.com ([74.125.82.49]:50870 "EHLO mail-wm0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751219AbdI2AQC (ORCPT ); Thu, 28 Sep 2017 20:16:02 -0400 X-Google-Smtp-Source: AOwi7QAFBHvTvfcTPVKVkoX4PkbKuYtGhthgb8N3CMCb1FpchhTsZx257pBnCjODWPep/ZSFYQGtiw== Subject: Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, hannes@cmpxchg.org, jack@suse.cz, torvalds@linux-foundation.org References: <1506543239-31470-1-git-send-email-axboe@kernel.dk> <1506543239-31470-11-git-send-email-axboe@kernel.dk> <20170928144100.e11801ef742521e0e3f4b8df@linux-foundation.org> From: Jens Axboe Message-ID: <05736c6b-f401-2d02-432c-2fd6966abbd4@kernel.dk> Date: Fri, 29 Sep 2017 02:15:58 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: <20170928144100.e11801ef742521e0e3f4b8df@linux-foundation.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3112 Lines: 76 On 09/28/2017 11:41 PM, Andrew Morton wrote: > On Wed, 27 Sep 2017 14:13:57 -0600 Jens Axboe wrote: > >> When someone calls wakeup_flusher_threads() or >> wakeup_flusher_threads_bdi(), they schedule writeback of all dirty >> pages in the system (or on that bdi). If we are tight on memory, we >> can get tons of these queued from kswapd/vmscan. This causes (at >> least) two problems: >> >> 1) We consume a ton of memory just allocating writeback work items. >> We've seen as much as 600 million of these writeback work items >> pending. That's a lot of memory to pointlessly hold hostage, >> while the box is under memory pressure. >> >> 2) We spend so much time processing these work items, that we >> introduce a softlockup in writeback processing. This is because >> each of the writeback work items don't end up doing any work (it's >> hard when you have millions of identical ones coming in to the >> flush machinery), so we just sit in a tight loop pulling work >> items and deleting/freeing them. >> >> Fix this by adding a 'start_all' bit to the writeback structure, and >> set that when someone attempts to flush all dirty pages. The bit is >> cleared when we start writeback on that work item. If the bit is >> already set when we attempt to queue !nr_pages writeback, then we >> simply ignore it. >> >> This provides us one full flush in flight, with one pending as well, >> and makes for more efficient handling of this type of writeback. >> >> ... >> >> @@ -953,12 +954,27 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic, >> return; >> >> /* >> + * All callers of this function want to start writeback of all >> + * dirty pages. Places like vmscan can call this at a very >> + * high frequency, causing pointless allocations of tons of >> + * work items and keeping the flusher threads busy retrieving >> + * that work. Ensure that we only allow one of them pending and >> + * inflight at the time. It doesn't matter if we race a little >> + * bit on this, so use the faster separate test/set bit variants. >> + */ >> + if (test_bit(WB_start_all, &wb->state)) >> + return; >> + >> + set_bit(WB_start_all, &wb->state); > > test_and_set_bit()? Like Linus says, this is done purposely. I've even included a bit about it in the comment above, though maybe it's not clear enough. I've used this trick in blk-mq quite a bit as well, and for high frequency calls, it can make a substantial difference not to redirty that cache line if you can avoid it. If you do care about atomicity, this works really well too: if (test_bit(bit, addr) || test_and_set_bit(bit, addr)) ... just to avoid the locked operation. Also see this commit: commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d Author: Jens Axboe Date: Thu May 22 11:54:16 2014 -0700 mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT where there are some actual numbers on a specific case. For the case at hand, we don't even need to do the test_and_set case, since we don't care about a small race there. -- Jens Axboe