Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp3854940pxu; Sun, 20 Dec 2020 19:06:05 -0800 (PST) X-Google-Smtp-Source: ABdhPJzkt8c/cQBFldXWkoG+8nRAKU0gs+T1LBUExs3ujgeffwFtgvGwV9h8F10yjR4ETLoV2Rke X-Received: by 2002:aa7:d846:: with SMTP id f6mr14212460eds.55.1608519965489; Sun, 20 Dec 2020 19:06:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1608519965; cv=none; d=google.com; s=arc-20160816; b=o3MxdxBvf07DST+yMHgoQjblvQA6Rf9Lt4P71ImUTkp+H9SScwR4wvr+nMt6wS5tTK qQG4/Un85ZWKF471Eupw0qbE2+p1HoHw5INIrh7VShmXQ+d2DMRqM+k0Oi0llKbazFvX SYf2brITf4raPqlCw7ixEfrr3TypvtLRFZp7QDSqzIVkhcNcvyviYygZt4m4nTcF1pSu ch0C3H3PiXUn04X9uTMPpWgSYS3mObAL3CyeZLV/ALG0NOr1F1nBsLGb+rVP7e/qdMQ+ Nsn+xR6Wrhi9hOBlP8n8v0fCMI3Wo0fKzu8HDcHAw9HSfW5SgeLtegRUb/+gMCH4tOkH WQNQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=hFJT3v+PwbNf1aWn9WoJuXw8KS1kgHiHzQf09Gn6sBM=; b=Vsf4ygSQvGX8OiQa8PCnyU2h0IXdyH44AUAJAqqKmd5BcTHP1XIZ2bpsjqrV1PNari KTwoVjLgPdgKKq6qTxm9oLIxs6HuQyeleJXduqG2QdIIOz5oKqBD0ZTJIw2Rw5eZDJ5+ FoytcOzK5yXIIgqymOtM1BoMPoiSHrsNw9xKxoPGNDkVCffHX6Z9mzgw89YpNL9fA/VG 58QoTGHFK1mpHJMTMPI4AqXybS45VkHc4KLFLKxsu3vcZ7Gz1OsqNmmbHA2bRs73Mccn N6/jvlaw+8Xylmv9+vZbk9w0IllDirZ4Q9sJJqkAc2Oc/8F7MmBZrouFp/zdWq82X3aO NbLw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id x2si8333556ejc.177.2020.12.20.19.05.34; Sun, 20 Dec 2020 19:06:05 -0800 (PST) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725777AbgLUDF2 (ORCPT + 99 others); Sun, 20 Dec 2020 22:05:28 -0500 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:48694 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725497AbgLUDF2 (ORCPT ); Sun, 20 Dec 2020 22:05:28 -0500 Received: from callcc.thunk.org (pool-72-74-133-215.bstnma.fios.verizon.net [72.74.133.215]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 0BL34cO2026704 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 20 Dec 2020 22:04:39 -0500 Received: by callcc.thunk.org (Postfix, from userid 15806) id 911B5420280; Sun, 20 Dec 2020 22:04:38 -0500 (EST) Date: Sun, 20 Dec 2020 22:04:38 -0500 From: "Theodore Y. Ts'o" To: Matteo Croce Cc: linux-ext4@vger.kernel.org Subject: Re: discard and data=writeback Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Fri, Dec 18, 2020 at 07:40:09PM +0100, Matteo Croce wrote: > > I noticed a big slowdown on file removal, so I tried to remove the > discard option, and it helped > a lot. > Obviously discarding blocks will have an overhead, but the strange > thing is that it only > does when using data=writeback: If data=ordered mount option is enabled, when you have allocating buffered writes pending, the data block writes are forced out *before* we write out the journal blocks, followed by a cache flush, followed by the commit block (which is either written with the Forced Unit Attention bit set if the storage device supports this, or the commit block is followed by another cache flush). After the journal commit block is written out, then if the discard mount option is enabled, then all blocks that were released during the last joutnal transaction are then discarded. If data=writeback is enabled, then we do *not* flush out any dirty pages in the page cache that were allocated during the previous transaction. This means that if you crash, it is possible that freshly inodes that contain freshly allocated blocks may have stale data in those new allocated blocks. This blocks might include some other users' e-mails, medical records, cryptographic keys, or other PII. Which is why data=ordered is the default. So if data=ordered and data=writeback makes any difference, the first question I'd have to ask is whether any dirty pages in the page cache, or any background writes happening in parallel with the rm -rf command. > It seems that ext4_issue_discard() is called ~300 times with data=ordered > and ~50k times with data=writeback. ext4_issue_discard() gets called for each contiguous set of blocks that were released in a particular jbd2 transaction. So if you are deleting 100 files, and all of those files are unlinked in a single transaction, and all of those blocks belonging to those files belong to a single contiguous block region, then ext4_issue_discard() will be called only once. If you delete a single file, but all of its blocks are heavily fragmented, then ext4_issue_discard() be called a thousand times. If you delete 100 files, all of which are contiguous, but each file is in a different part of the disk, then ext4_issue_discard() might be called 100 times. So that implies that your experiment may not be repeatable; did you make sure the file system was freshly reformatted before you wrote out the files in the directory you are deleting? And was the directory written out in exactly the same way? And did you make sure all of the writes were flushed out to disk before you tried timing the "rm -rf" command? And did you make sure that there weren't any other processes running that might be issuing other file system operations (either data or metadata heavy) that might be interfering with the "rm -rf" operation? What kind of storage device were you using? (An SSD; a USB thumb drive; some kind of Cloud emulated block device?) Note that benchmarking the file system operations is *hard*. When I worked with a graduate student working on a paper describing a prototype of a file system enhancement to ext4 to optimize ext4 for drive-managed SMR drives[1], the graduate student spent *way* more time getting reliable, repeatable benchmarks than making changes to ext4 for the prototype. (It turns out the SMR GC operations caused variations in write speeds, which meant the writeback throughput measurements would fluctuate wildly, which then influenced the writeback cache ratio, which in turn massively influenced the how aggressively the writeback threads would behave, which in turn massively influenced the filebench and postmark numbers.) [1] https://www.usenix.org/conference/fast17/technical-sessions/presentation/aghayev So there can be variability caused by how blocks are allocated at the file system; how the SSD is assigning blocks to flash erase blocks; how the SSD's GC operation influences its write speed, which can in turn influence the kernel's measured writeback throughput; different SSD's or Cloud block devices can have very different discard performance that can vary based on past write history, yadda, yadda, yadda. Cheers, - Ted