Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp1798324ybn; Thu, 26 Sep 2019 02:20:56 -0700 (PDT) X-Google-Smtp-Source: APXvYqy15Mb/+/W6jLILFDvk1J8qBLXb1YIiPbdfn6fgQbtozcd2ggliR2r1S/Cf+6p2dZN33jtl X-Received: by 2002:a50:fd10:: with SMTP id i16mr2450662eds.239.1569489656287; Thu, 26 Sep 2019 02:20:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569489656; cv=none; d=google.com; s=arc-20160816; b=pw/Etn3XTMsVsCVtXCjzSvp7FWbmXogRFqtYWb4I3EbZdJUji102aQBVqoXiNATrS+ rzQ3J1HeKTmiJ364tfoUk6l5gxX5RBIYC43K7PeQGZ0ZK3lpdHhiz+KZ1vmXrzgmFl2d sRCFNhOKivQKu3hX/7vuru2Ao44M7SxUKGR90NhB7Q1Z+RLalT/LyFIwnD4MvKHjt7l+ kKywCXZfzQ/isGp8r700pnqO5CoP6XQxrVXoqNRQ7FxvEuUCdO8QATVKBFf1AvTHmRjj lFqOHdqTnJWVqSzPUGD/IEmyJC2wTNvexj9Thc5fVFyLRzrPf9n+XMLqmDUixVB94G3Y O/UQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature; bh=HF6YUnXbxku3Mhtk3A5NH7rcf030Ym9OwhuMB5F9CyI=; b=BEZSfqpbgmoKGZV7FtxQaGP+iAENqy/v/FCKQc1M21FWnjCXcfQ19htdKp7EDV6HyV aTZvc/qo8D/dgqC0WxtqzoE2ZzVMQ8iWwtNWxxnH3UibEtJi/me2Ax1nihtVywkBE12c UD5HVTEr1TJZmBv7XvkomSlhGj3E4uSz0R14TP5Lk9d7WrWIRUwaI12W1wpVtiGUmylA z4caYBlJ18NzHvMyJ/HZZTp2BXQHFWjmJzOtzlI0NuD0wx+dlJwSAU08p+vlzzATbgoo vMPfIh98w1Taci7oLhXsyGhgZweVM4IsK3HVPaIW7ZTvV/vDgNRbcV/TOTWV5hCxZXW0 HYiQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@yandex-team.ru header.s=default header.b="y/ntVFSQ"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=yandex-team.ru Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r17si1019993edx.257.2019.09.26.02.20.32; Thu, 26 Sep 2019 02:20:56 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@yandex-team.ru header.s=default header.b="y/ntVFSQ"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=yandex-team.ru Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2634001AbfIYIPf (ORCPT + 99 others); Wed, 25 Sep 2019 04:15:35 -0400 Received: from forwardcorp1o.mail.yandex.net ([95.108.205.193]:54580 "EHLO forwardcorp1o.mail.yandex.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2437303AbfIYIPf (ORCPT ); Wed, 25 Sep 2019 04:15:35 -0400 Received: from mxbackcorp1g.mail.yandex.net (mxbackcorp1g.mail.yandex.net [IPv6:2a02:6b8:0:1402::301]) by forwardcorp1o.mail.yandex.net (Yandex) with ESMTP id B38152E14FF; Wed, 25 Sep 2019 11:15:31 +0300 (MSK) Received: from myt4-4db2488e778a.qloud-c.yandex.net (myt4-4db2488e778a.qloud-c.yandex.net [2a02:6b8:c00:884:0:640:4db2:488e]) by mxbackcorp1g.mail.yandex.net (nwsmtp/Yandex) with ESMTP id PLUDrWnc5m-FU2CcNFu; Wed, 25 Sep 2019 11:15:31 +0300 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.ru; s=default; t=1569399331; bh=HF6YUnXbxku3Mhtk3A5NH7rcf030Ym9OwhuMB5F9CyI=; h=In-Reply-To:Message-ID:From:Date:References:To:Subject:Cc; b=y/ntVFSQHlcgQ8zEapV79qFKiGZ+Jy6AHoiw9NXCPRGwQHb9EzewdMzr1WYCCkICD HZpjFFFbNUDZeJ/gWjf3UZk35n4F+xSYtJFiWnvFdD3qDwL0a8UZdp62LTskjVAAp8 ZlFX7DlZJz9fH8S5PRLvDE0B86JRY4hDZNUmCFSY= Authentication-Results: mxbackcorp1g.mail.yandex.net; dkim=pass header.i=@yandex-team.ru Received: from dynamic-red.dhcp.yndx.net (dynamic-red.dhcp.yndx.net [2a02:6b8:0:40c:3d4d:a9cb:ef29:4bb1]) by myt4-4db2488e778a.qloud-c.yandex.net (nwsmtp/Yandex) with ESMTPSA id rDiuIcqJka-FUIKcAUR; Wed, 25 Sep 2019 11:15:30 +0300 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client certificate not present) Subject: Re: [PATCH v2] mm: implement write-behind policy for sequential file writes To: Dave Chinner Cc: Tejun Heo , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jens Axboe , Michal Hocko , Mel Gorman , Johannes Weiner , Linus Torvalds References: <156896493723.4334.13340481207144634918.stgit@buzz> <875f3b55-4fe1-e2c3-5bee-ca79e4668e72@yandex-team.ru> <20190923145242.GF2233839@devbig004.ftw2.facebook.com> <20190924073940.GM6636@dread.disaster.area> <20190925071854.GC804@dread.disaster.area> From: Konstantin Khlebnikov Message-ID: Date: Wed, 25 Sep 2019 11:15:30 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <20190925071854.GC804@dread.disaster.area> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-CA Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 25/09/2019 10.18, Dave Chinner wrote: > On Tue, Sep 24, 2019 at 12:00:17PM +0300, Konstantin Khlebnikov wrote: >> On 24/09/2019 10.39, Dave Chinner wrote: >>> On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote: >>>> On 23/09/2019 17.52, Tejun Heo wrote: >>>>> Hello, Konstantin. >>>>> >>>>> On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote: >>>>>> With vm.dirty_write_behind 1 or 2 files are written even faster and >>>>> >>>>> Is the faster speed reproducible? I don't quite understand why this >>>>> would be. >>>> >>>> Writing to disk simply starts earlier. >>> >>> Stupid question: how is this any different to simply winding down >>> our dirty writeback and throttling thresholds like so: >>> >>> # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes >>> >>> to start background writeback when there's 100MB of dirty pages in >>> memory, and then: >>> >>> # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes >>> >>> So that writers are directly throttled at 200MB of dirty pages in >>> memory? >>> >>> This effectively gives us global writebehind behaviour with a >>> 100-200MB cache write burst for initial writes. >> >> Global limits affect all dirty pages including memory-mapped and >> randomly touched. Write-behind aims only into sequential streams. > > There are apps that do sequential writes via mmap()d files. > They should do writebehind too, yes? I see no reason for that. This is different scenario. Mmap have no clear signal about "end of write", only page fault at beginning. Theoretically we could implement similar sliding window and start writeback on consequent page faults. But applications who use memory mapped files probably knows better what to do with this data. I prefer to leave them alone for now. > >>> ANd, really such strict writebehind behaviour is going to cause all >>> sorts of unintended problesm with filesystems because there will be >>> adverse interactions with delayed allocation. We need a substantial >>> amount of dirty data to be cached for writeback for fragmentation >>> minimisation algorithms to be able to do their job.... >> >> I think most sequentially written files never change after close. > > There are lots of apps that write zeros to initialise and allocate > space, then go write real data to them. Database WAL files are > commonly initialised like this... Those zeros are just bunch of dirty pages which have to be written. Sync and memory pressure will do that, why write-behind don't have to? > >> Except of knowing final size of huge files (>16Mb in my patch) >> there should be no difference for delayed allocation. > > There is, because you throttle the writes down such that there is > only 16MB of dirty data in memory. Hence filesystems will only > typically allocate in 16MB chunks as that's all the delalloc range > spans. > > I'm not so concerned for XFS here, because our speculative > preallocation will handle this just fine, but for ext4 and btrfs > it's going to interleave the allocate of concurrent streaming writes > and fragment the crap out of the files. > > In general, the smaller you make the individual file writeback > window, the worse the fragmentation problems gets.... AFAIR ext4 already preallocates extent beyond EOF too. But this must be carefully tested for all modern fs for sure. > >> Probably write behind could provide hint about streaming pattern: >> pass something like "MSG_MORE" into writeback call. > > How does that help when we've only got dirty data and block > reservations up to EOF which is no more than 16MB away? Block allocator should interpret this flags as "more data are expected" and preallocate extent bigger than data and beyond EOF. > > Cheers, > > Dave. >