Received: by 2002:a25:868d:0:0:0:0:0 with SMTP id z13csp49065ybk; Tue, 12 May 2020 15:03:08 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzVj0It+MEthmyrjqkGczug22juUhwx9/NFFxtNIlbcFhonYBn8CWCyBiwsTcfW464UEV87 X-Received: by 2002:aa7:c3d7:: with SMTP id l23mr5219108edr.125.1589320988561; Tue, 12 May 2020 15:03:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1589320988; cv=none; d=google.com; s=arc-20160816; b=BecuubWHhBfHuAFZJ9S7nGrX9K8bP4Y9EDLsrvKhoHkZ89HY0T2/tqNUyCIpyg+abr Sw51TEDeBs0KZnqn1juQpjqTIy6jvZFkaukmSIugHtfHhwoUrNxstOKcRRnXeks6oPia N+a6vT/sRSueW+DiWjxziBIjm76dW4eDuPjT6/8dIe66slfONeJHE/aDH4bpCSXWZ/f6 IjW2Vmh6fTn6oTHrwbjBLGkAZu4bC+ygUE8YrCB62lqYbkxmOsqMwSkdVLy4PmGM1xn6 AusMt5KBRVVDMOcw9IWqn1znNcl/jwlzmyV3IsOZfV8ECXFspBUgHEyW0auh2lhwLil8 NOuQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=Y3dCLDap0cOf8I4MAXBp4B1rhjC3B9vDJK4Ruprpu5M=; b=VCfcPK9JIYxFlj3TaN3sm8zbAWItTtdaoMzsFQtRllrrm7C+T+qkO1I8DMJRgqlQ0w iyyec+pVE/aevKalBD7mOwgqrjTnRESkkrROKPvBYrANoUcxVu66S72ow4Ku42bzWATJ yi3STMVa99GcbzFGsTvWfbhRZTjAp97tyA3rkR919SQglogkv2R399ZibmthHfKmKxGx aPEkwDys/ks57GQtbOUty3XWwns7cl8x5Bk8b1+lPssO9/HisBy4/Jmy9ctcPDvgJZ0z MbnRau5txaX5k1eoWwpdJP7PZJ/A851pS6cKLrmruqH096xgahRxBrU07SR77T90ElG1 Zl5A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id rl12si8335741ejb.199.2020.05.12.15.02.14; Tue, 12 May 2020 15:03:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731267AbgELWB1 (ORCPT + 99 others); Tue, 12 May 2020 18:01:27 -0400 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:34593 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728313AbgELWB0 (ORCPT ); Tue, 12 May 2020 18:01:26 -0400 Received: from callcc.thunk.org (pool-100-0-195-244.bstnma.fios.verizon.net [100.0.195.244]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 04CM1CO6015667 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 12 May 2020 18:01:12 -0400 Received: by callcc.thunk.org (Postfix, from userid 15806) id 10F464202E4; Tue, 12 May 2020 18:01:12 -0400 (EDT) Date: Tue, 12 May 2020 18:01:11 -0400 From: "Theodore Y. Ts'o" To: julio.lajara@alum.rpi.edu Cc: linux-ext4@vger.kernel.org Subject: Re: Reducing ext4 fs issues resulting from frequent hard poweroffs Message-ID: <20200512220111.GD1596452@mit.edu> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Tue, May 12, 2020 at 05:08:51PM -0400, Julio Lajara wrote: > Hi all, I currently manage an IOT fleet based on Intel NUCs running > Ubuntu 18.04 Server on SSDs with etx4, no swap. The device usage is > more CPU bound than I/O bound and we are having some issues keeping a > subset of devices running due to them being hard powered off in the > field in some regions (sometimes as frequently as every 12hrs). Due to > current difficulties in getting devices back from the field I'm > looking into tweaking them as best as possible to survive these hard > power off barring any physical SSD issues. Hi Julio, If the hardware devices are behaving appropriately --- that is, after receiving a CACHE FLUSH command the storage device persists all blocks written up to the CACHE FLUSH command, such that when the OS receives the command completion notification of the CACHE FLUSH, everything is persisted even after a hard power off --- no special configuration should be necessary. We have regression tests which simulate this and ext4 regularly passes them. If you need to tweak settings, that's an indication that your hardware is buggy. And unfortunately ,there's not much we can do to prevent failures. A lot is going to depend on *how* crappy the SSD's happen to be. Your best bet might be to find a way to make your root filesystem read-only, so it's not being modified at all, and then set up a scratch partition with state which can be reformatted at any time if it gets corrupted --- and then try to get all of your date pushed out to your remote servers / cloud as often as possible. And next time, qualify the SSD's ahead of time to make sure they aren't overly "cost optimized" (read: crap) before you buy your fleet of devices. :-( - Ted