Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp3315978pxu; Tue, 8 Dec 2020 08:53:50 -0800 (PST) X-Google-Smtp-Source: ABdhPJx/IYn1diRLrmmC6mlQIZD59eJUOhrFULrrfeGcZz0Y5Ag86yBZc80pA4/QSXI02MFB9CQV X-Received: by 2002:a17:906:b20f:: with SMTP id p15mr9765143ejz.542.1607446430105; Tue, 08 Dec 2020 08:53:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1607446430; cv=none; d=google.com; s=arc-20160816; b=mSbiP/ZwZrHlAbbfdqYo5rS353gqUoC7pFaLXLi0DawnZZKOqhU6QDZngvDat05Wqe 2sg2F938EBYn+x8keWBgtJnmUaUDM5fDAiNsdfVsDaiSNScQsAEwflnW/r2IZ94Gnr0R dnaCIkzteWZ5JoMmq8NSV8HY22PwE70wBbehar4qCxm6scCQYZUrUzZ78h3Kq4UpjbkI O41QUM17mHokXwlM/tGXemowVINgxM7JS3X6chcMz+IUARIg2Z7u5GiglZq0QpW0nPOR CAEj3zdfmDNa+m+QbbkNyQIZoLiB2AU0CaC5anDJ2DvZlf/PFKP2VWDqeVcLmsCKB0NJ cTDA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=a4GXjswfrKpxScFDaPViikVWxkvGO2qa2PkfkvMQqHk=; b=KxasAm7RVxhmzZA7oiq+zfFp1+ecInW+4pE1V4FvhlkR3pOcYv/Qqvg4Q346zF7h2K wdU0TjFEOMIkjIM5SDHsefzYQZg3+RWMK6DITtK9A6BAMKu2a/NMbWyzD5PZfKx0dt/T n++/vONBqJpXOoyl1O65ifUQIwjHOYpEkRvGZfzE6vsaW8tlf9ixtn9/g/Jc9a+8TviH AZNHiGgut0HbGbRiP/wve4jqH2aJXtfA9ZaA+0w5VIAGLd+CKSIk596DPU8/mcCz36NJ yORkyrFk9gA8q16I/U75Jn9Gq2AGp5B8fp1bLqv4N8hoCgNFMZqzQNdyfTNjiXa8brB/ XShw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a18si9041701edx.342.2020.12.08.08.53.21; Tue, 08 Dec 2020 08:53:50 -0800 (PST) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730448AbgLHQxF (ORCPT + 99 others); Tue, 8 Dec 2020 11:53:05 -0500 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:36239 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1730439AbgLHQxE (ORCPT ); Tue, 8 Dec 2020 11:53:04 -0500 Received: from callcc.thunk.org (pool-72-74-133-215.bstnma.fios.verizon.net [72.74.133.215]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 0B8GqESk012309 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 8 Dec 2020 11:52:15 -0500 Received: by callcc.thunk.org (Postfix, from userid 15806) id 6CDE7420136; Tue, 8 Dec 2020 11:52:14 -0500 (EST) Date: Tue, 8 Dec 2020 11:52:14 -0500 From: "Theodore Y. Ts'o" To: Michael Walle Cc: Ulf Hansson , linux-ext4@vger.kernel.org, linux-mmc@vger.kernel.org, linux-block Subject: Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op Message-ID: <20201208165214.GD52960@mit.edu> References: <97c4bb65c8a3e688b191d57e9f06aa5a@walle.cc> <20201207183534.GA52960@mit.edu> <2edcf8e344937b3c5b92a0b87ebd13bd@walle.cc> <20201208024057.GC52960@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Tue, Dec 08, 2020 at 12:26:22PM +0100, Michael Walle wrote: > Do we really need to map these functions? What if we don't have an > actual discard, but just a slow erase (I'm now assuming that erase > will likely be slow on sdcards)? Can't we just tell the user space > there is no discard? Like on a normal HDD? I really don't know the > implications, seems like mmc_erase() is just there for the linux > discard feature. So the potential gotcha here is that "discard" is important for reducing write amplification, and thus improving the lifespan of devices. (See my reference to the Tesla engine controller story earlier.) So if a device doesn't have "discard" but has "erase", and "erase" is fast, then skipping the discard could end up significantly reducing the lifespan of your product, and we're back to the NHTSA investigating whether they should stick Tesla for the $1500 engine controller replacement when cards die early. I guess the JEDEC spec does specify a way to query the card for how long an erase takes, but I don't have the knowledge about how the actual real-world implementations of these specs (and their many variants over the years) actually behave. Can the erase times that they advertise actually be trusted to be accurate? How many of them actually supply erase times at all, no matter what the spec says? > Coming from the user space side. Does mkfs.ext4 assumes its pre-discard > is fast? I'd think so, right? I'd presume it was intented to tell the > FTL of the block device, "hey these blocks are unused, you can do some > wear leveling with them". Yes, the assumption is that discard is fast. Exactly how fast seems to vary; this is one of the reasons why there are three different ways to do discards on a file system after files are deleted. One way is to do them after the deleted definitely won't be unwound (i.e., after the ext4 journal commit). But on some devices, the discard command, while fast, is slow enough that this will compete with the I/O completion times of other read commands, thus degrading system performance. So you can also execute the trim commands out of cron, using the fstrim command, which will run the discards in the background, and the system administrator can adjust when fstrim is executed during times wheno performance isn't critical. (e.g., when the phone is on a charger in the middle of the night, or at 4am local time, etc.) Finally, you can configure e2fsck to run the discards after the file system consistency check is done. The reason why we have to leave this up to the system administrators is that we have essentially no guidance from the device how slow the discard command might be, how it intereferes with other device operations, and whether the discard might be more likely ignored if the device is busy. So it might be that the discard will more likely improve write endurance when it is done when the device is idle. All of the speccs (SCSI, SATA, UFS, eMMC, SD) are remarkable unhelpful because performance considerations is generally consider "out of scope" of standards committees. They want to leave that up to market forces; which is why big companies (at handset vendors, hyperscale cloud providers, systems integrators, etc.) have to spend as much money doing certification testing before deciding which products to buy; think of it as a full-employment act for storage engineers. :-) But yes, mke2fs assumes that discard is sufficiently fast that it doing it at file system format time is extremely reasonable. The bigger concern is that we can't necessarily count on discard zero'ing the inode table, and there are robustness reasons (especially if before we had metadata checksums) where it makes file system repairs much more robust if the inode table is zero'ed ahead of time. > I'm just about finding some SD cards and looking how they behave timing > wise and what they report they support (ie. erase or discard). Looks > like other cards are doing better. But I'd have to find out if they > support the discard (mine doesn't) and if they are slow too if I force > them to use the normal erase. The challenge is that this sort of thing gets rapidly out of date, and it's not just SD cards but also eMMC devices which are built into various embedded devices, high-end SDHC cards, etc., etc. So doing this gets very expensive. That being said, both ext4 and f2fs do pre-discards as part of the format step, since improving write endurance is important; customers get cranky when their $1000 smart phones die an early death. So an SD card that behaves the way yours does would probably get disqualified very early in the certification step if it were ever intended to be used in an Android handset, since pretty much all Android devices, or embedded devices for that matter, use either f2fs or ext4. That's one of the reasons why I was a bit surprised that your device had such an "interesting" performance profile. Maybe it was intended for use in digital cameras, and digital camerase don't issue discards? I don't know.... > > I agree, these are the three levels that make sense to support. > > > > Honestly I haven't been paying enough attention to discussions for the > > generic block layer around discards. However, considering what you > > just stated above, we seem to be missing one request operation, don't > > we? Yes, that's true. We only have "discard" and "secure discard". Part of that is because that's the only levels which are available for SSD's, for which I have the same general complaint vis-a-vis standards committees and the general lack of usefulness for file system engineers. For example, pretty much everyone in the enterprise and hyperscale cloud world assume that low-numbered LBA's have better performance profiles, and are located physically at the outer diameter of HDD's, compared to high-number'ed LBA's. But that's nothing which is specified by the standards committees, because "performance considerations are out of scope". Yet we still have to engineer storage systems which assume this to be true, even though nothing in the formal specs guarantees this. We just have to trust that anyone who tries to sell a HDD for which this isn't true, even if it is "standards complaint", is going to have a bad time, and trust that this is enough. (Perhaps this is why when a certain HDD manufacturer tried to sell HDD's containing drive-managed SMR for the NAS market, without disclosing this fact to consumers, this generated a massive backlash.... Simply being standards compliant is not enough.) - Ted