Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753231AbdDKObr (ORCPT ); Tue, 11 Apr 2017 10:31:47 -0400 Received: from out1-smtp.messagingengine.com ([66.111.4.25]:48633 "EHLO out1-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752765AbdDKObj (ORCPT ); Tue, 11 Apr 2017 10:31:39 -0400 X-ME-Sender: X-Sasl-enc: fG2xYSFBoYjwpQMxhVghCk/kQYwyzWmYc3KlItk2Qn3Z 1491921092 Date: Tue, 11 Apr 2017 11:31:29 -0300 From: Henrique de Moraes Holschuh To: Martin Steigerwald Cc: Tejun Heo , linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org, Hans de Goede Subject: Re: Race to power off harming SATA SSDs Message-ID: <20170411143129.GA28632@khazad-dum.debian.net> References: <20170410232118.GA4816@khazad-dum.debian.net> <20170410235206.GA28603@wtj.duckdns.org> <3231980.BbEtxjAFS5@merkaba> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <3231980.BbEtxjAFS5@merkaba> X-GPG-Fingerprint1: 4096R/0x0BD9E81139CB4807: C467 A717 507B BAFE D3C1 6092 0BD9 E811 39CB 4807 User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3240 Lines: 65 On Tue, 11 Apr 2017, Martin Steigerwald wrote: > I do have a Crucial M500 and I do have an increase of that counter: > > martin@merkaba:~[…]/Crucial-M500> grep "^174" smartctl-a-201* > smartctl-a-2014-03-05.txt:174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 > Old_age Always - 1 > smartctl-a-2014-10-11-nach-prüfsummenfehlern.txt:174 Unexpect_Power_Loss_Ct > 0x0032 100 100 000 Old_age Always - 67 > smartctl-a-2015-05-01.txt:174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 > Old_age Always - 105 > smartctl-a-2016-02-06.txt:174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 > Old_age Always - 148 > smartctl-a-2016-07-08-unreadable-sector.txt:174 Unexpect_Power_Loss_Ct 0x0032 > 100 100 000 Old_age Always - 201 > smartctl-a-2017-04-11.txt:174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 > Old_age Always - 272 > > > I mostly didn´t notice anything, except for one time where I indeed had a > BTRFS checksum error, luckily within a BTRFS RAID 1 with an Intel SSD (which > also has an attribute for unclean shutdown which raises). The Crucial M500 has something called "RAIN" which it got unmodified from its Micron datacenter siblings of the time, along with a large amount of flash overprovisioning. Too bad it lost the overprovisioned supercapacitor bank present on the Microns. RAIN does block-level N+1 RAID5-like parity across the flash chips on top of the usual block-based ECC, and the SSD has a background scrubber task that repairs and blocks that fail ECC correction using the RAIN parity information. On such an SSD, you really need multi-chip flash corruption beyond what ECC can fix to even get the operating system/filesystem to notice any damage, unless you are paying attention to its SMART attributes (it counts the number of blocks that required RAIN recovery -- which implies ECC failed to correct that block in the first place), etc. Unfortunately, I do not have correlation data to know whether there is an increase on RAIN-corrected or ECC-corrected blocks during the 24h after an unclean poweroff right after STANDBY IMMEDIATE on a Crucial M500 SSD. > The write-up Henrique gave me the idea, that maybe it wasn´t an user triggered > unclean shutdown that caused the issue, but an unclean shutdown triggered by > the Linux kernel SSD shutdown procedure implementation. Maybe. But that corruption could easily having been caused by something else. There is no shortage of possible culprits. I expect most damage caused by unclean SSD power-offs to be hidden from the user/operating system/filesystem by the extensive recovery facilities present on most SSDs. Note that the fact that data was transparently (and sucessfully) recovered doesn't mean damage did not happen, or that the unit was not harmed by it: it likely got some extra flash wear at the very least. BTW, for the record, Windows 7 also appears to have had (and maybe still have) this issue as far as I can tell. Almost every user report of excessive unclean power off alerts (and also of SSD bricking) to be found on SSD vendor forums come from Windows users. -- Henrique Holschuh