Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp3489598imm; Wed, 5 Sep 2018 00:39:23 -0700 (PDT) X-Google-Smtp-Source: ANB0VdYKd+P4O17qX2nQepZP6RYRpErGcTKMgtJwLFdVeLB41IU8V3K254WNC82ATqlp2kT3GIXC X-Received: by 2002:a17:902:8542:: with SMTP id d2-v6mr38083928plo.285.1536133163098; Wed, 05 Sep 2018 00:39:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1536133163; cv=none; d=google.com; s=arc-20160816; b=rD5LyxhUMJSZLhTSKNoqhbWHB1nc/8UFjoA4j1LOj2bBHqmGQywoHK4iEpjNrZJV80 rVkY8S7nbv/XtFR0xEkNPEVl75cQaQ+FdtSxKE7q9ZjkNPTIf+8pXNrRyQb4Woek4PJW rYpeiKFS+7LhtY6MFUX6Lv6mNi89L+mHQfbwKw8a4asJbp0/8vejgwRPtfQIQUTvxs4M GCmEobaxR28GQufn5NMJwPmmQugYNkFVG+S86YJwadKwISAgBLE5MP0fYawK+tfgZ8ah cSrzqcRi/xl7Lo2D+GRpWvqAyesyDz2nzF+NxfZ1PDT4FNxzm432RB1YEpRbDgaXXvDy DJxg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=W5tkEUL4xi8EtAG/wh5DP3PSvTRBUg1+x1lVL5xz2cQ=; b=pVTUZ8mSNw41wbCqYg1hBzmQ3Do+2h0rj4VNYrP2YkLisPLdhLZB9xSC3oOoz6tcua ryuNQMKuTYqza3DhDyJyeEU1MQg5WRepz/XPnXyytujljKAIgtlVyZziDYW+lUJ7h42v 6sWKKJcEEyeyrCgcq94oicWE96hmNIwBgg7Mbnq9wIQu4yj8BprOP23PBCisNzDxp0Al +vYD37qKVnkBDvKGOivGBLR0ueCIn5X0xzN1fG8Dk04M7qlHilRM6FhSohPxoqBechDp 5vvdOmZAT5dRlT/wDEDzJ+GB1GXjzwJ6wAw2BAXP0HJN90fDMe7V/8NGG9wTiCT/VdyL 0oJg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a15-v6si1393551pln.137.2018.09.05.00.39.07; Wed, 05 Sep 2018 00:39:23 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727640AbeIEMGV convert rfc822-to-8bit (ORCPT + 99 others); Wed, 5 Sep 2018 08:06:21 -0400 Received: from mondschein.lichtvoll.de ([194.150.191.11]:40439 "EHLO mail.lichtvoll.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726386AbeIEMGV (ORCPT ); Wed, 5 Sep 2018 08:06:21 -0400 Authentication-Results: auth=pass smtp.auth=martin smtp.mailfrom=martin@lichtvoll.de Received: from 127.0.0.1 (localhost [127.0.0.1]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.lichtvoll.de (Postfix) with ESMTPSA id C58AD3A2B49; Wed, 5 Sep 2018 09:37:25 +0200 (CEST) From: Martin Steigerwald To: Jeff Layton Cc: =?utf-8?B?54Sm5pmT5Yas?= , R.E.Wolff@bitwizard.nl, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: POSIX violation by writeback error Date: Wed, 05 Sep 2018 09:37:25 +0200 Message-ID: <1959947.mKHFU3S0Eq@merkaba> In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8BIT Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Jeff Layton - 04.09.18, 17:44: > > - If the following read() could be served by a page in memory, just > > returns the data. If the following read() could not be served by a > > page in memory and the inode/address_space has a writeback error > > mark, returns EIO. If there is a writeback error on the file, and > > the request data could not be served > > by a page in memory, it means we are reading a (partically) > > corrupted > > (out-of-data) > > file. Receiving an EIO is expected. > > No, an error on read is not expected there. Consider this: > > Suppose the backend filesystem (maybe an NFSv3 export) is really r/o, > but was mounted r/w. An application queues up a bunch of writes that > of course can't be written back (they get EROFS or something when > they're flushed back to the server), but that application never calls > fsync. > > A completely unrelated application is running as a user that can open > the file for read, but not r/w. It then goes to open and read the file > and then gets EIO back or maybe even EROFS. > > Why should that application (which did zero writes) have any reason to > think that the error was due to prior writeback failure by a > completely separate process? Does EROFS make sense when you're > attempting to do a read anyway? > > Moreover, what is that application's remedy in this case? It just > wants to read the file, but may not be able to even open it for write > to issue an fsync to "clear" the error. How do we get things moving > again so it can do what it wants? > > I think your suggestion would open the floodgates for local DoS > attacks. I wonder whether a new error for reporting writeback errors like this could help out of the situation. But from all I read here so far, this is a really challenging situation to deal with. I still remember how AmigaOS dealt with this case and from an usability point of view it was close to ideal: If a disk was removed, like a floppy disk, a network disk provided by Envoy or even a hard disk, it pops up a dialog "You MUST insert volume again". And if you did, it continued writing. That worked even with networked devices. I tested it. I unplugged the ethernet cable and replugged it and it continued writing. I can imagine that this would be quite challenging to implement within Linux. I remember there has been a Google Summer of Code project for NetBSD at least been offered to implement this, but I never got to know whether it was taken or even implemented. If so it might serve as an inspiration. Anyway AmigaOS did this even for stationary hard disks. I had the issue of a flaky connection through IDE to SCSI and then SCSI to UWSCSI adapter. And when the hard disk had connection issues that dialog popped up, with the name of the operating system volume for example. Every access to it was blocked then. It simply blocked all processes that accessed it till it became available again (usually I rebooted in case of stationary device cause I had to open case or no hot plug available or working). But AFAIR AmigaOS also did not have a notion of caching writes for longer than maybe a few seconds or so and I think just within the device driver. Writes were (almost) immediate. There have been some asynchronous I/O libraries and I would expect an delay in the dialog popping up in that case. It would be challenging to implement for Linux even just for removable devices. You have page dirtying and delayed writeback – which is still an performance issue with NFS of 1 GBit, rsync from local storage that is faster than 1 GBit and huge files, reducing dirty memory ratio may help to halve the time needed to complete the rsync copy operation. And you would need to communicate all the way to userspace to let the user know about the issue. Still, at least for removable media, this would be almost the most usability friendly approach. With robust filesystems (Amiga Old Filesystem and Fast Filesystem was not robust in case of sudden write interruption, so the "MUST" was mean that way) one may even offer "Please insert device again to write out unwritten data or choose to discard that data" in a dialog. And for removable media it may even work as blocking processes that access it usually would not block the whole system. But for the operating system disk? I know how Plasma desktop behaves during massive I/O operations. It usually just completely stalls to a halt. It seems to me that its processes do some I/O almost all of the time … or that the Linux kernel blocks other syscalls too during heavy I/O load. I just liked to mention it as another crazy idea. But I bet it would practically need to rewrite the I/O subsystem in Linux to a great extent, probably diminishing its performance in situations of write pressure. Or maybe a genius finds a way to implement both. :) What I do think tough is that the dirty page caching of Linux with its current standard settings is excessive. 5% / 10% of available memory often is a lot these days. There has been a discussion reducing the default, but AFAIK it was never done. Linus suggested in that discussion to about what the storage can write out in 3 to 5 seconds. That may even help with error reporting as reducing dirty memory ratio will reduce the memory pressure and so you may choose to add some memory allocations for error handling. And the time till you know its not working may be less. Thanks, -- Martin