Hi,
After reading several writeback error handling articles from LWN, I
begin to be upset about writeback error handling.
Jlayton's patch is simple but wonderful idea towards correct error
reporting. It seems one crucial thing is still here to be fixed. Does
anyone have some idea?
The crucial thing may be that a read() after a successful
open()-write()-close() may return old data.
That may happen where an async writeback error occurs after close()
and the inode/mapping get evicted before read().
That violate POSIX as POSIX requires that a read() that can be proved
to occur after a write() has returned will return the new data.
Regards,
Trol
On Tue, Sep 04, 2018 at 02:32:28PM +0800, 焦晓冬 wrote:
> Hi,
>
> After reading several writeback error handling articles from LWN, I
> begin to be upset about writeback error handling.
>
> Jlayton's patch is simple but wonderful idea towards correct error
> reporting. It seems one crucial thing is still here to be fixed. Does
> anyone have some idea?
>
> The crucial thing may be that a read() after a successful
> open()-write()-close() may return old data.
>
> That may happen where an async writeback error occurs after close()
> and the inode/mapping get evicted before read().
Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
and then close it. Then I repeat this 9 times.
Now, when writing those files to storage fails, there is 5Gb of data
to remember and only 1Gb of RAM.
I can choose any part of that 5Gb and try to read it.
Please make a suggestion about where we should store that data?
In the easy case, where the data easily fits in RAM, you COULD write a
solution. But when the hardware fails, the SYSTEM will not be able to
follow the posix rules.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
The plan was simple, like my brother-in-law Phil. But unlike
Phil, this plan just might work.
On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff <[email protected]> wrote:
...
> >
> > Jlayton's patch is simple but wonderful idea towards correct error
> > reporting. It seems one crucial thing is still here to be fixed. Does
> > anyone have some idea?
> >
> > The crucial thing may be that a read() after a successful
> > open()-write()-close() may return old data.
> >
> > That may happen where an async writeback error occurs after close()
> > and the inode/mapping get evicted before read().
>
> Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> and then close it. Then I repeat this 9 times.
>
> Now, when writing those files to storage fails, there is 5Gb of data
> to remember and only 1Gb of RAM.
>
> I can choose any part of that 5Gb and try to read it.
>
> Please make a suggestion about where we should store that data?
That is certainly not possible to be done. But at least, shall we report
error on read()? Silently returning wrong data may cause further damage,
such as removing wrong files since it was marked as garbage in the old file.
As I can see, that is all about error reporting.
As for suggestion, maybe the error flag of inode/mapping, or the entire inode
should not be evicted if there was an error. That hopefully won't take much
memory. On extreme conditions, where too much error inode requires staying
in memory, maybe we should panic rather then spread the error.
>
> In the easy case, where the data easily fits in RAM, you COULD write a
> solution. But when the hardware fails, the SYSTEM will not be able to
> follow the posix rules.
Nope, we are able to follow the rules. The above is one way that follows the
POSIX rules.
>
> Roger.
>
> --
> ** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> The plan was simple, like my brother-in-law Phil. But unlike
> Phil, this plan just might work.
On Tue, Sep 04, 2018 at 04:58:59PM +0800, 焦晓冬 wrote:
> As for suggestion, maybe the error flag of inode/mapping, or the entire inode
> should not be evicted if there was an error. That hopefully won't take much
> memory. On extreme conditions, where too much error inode requires staying
> in memory, maybe we should panic rather then spread the error.
Again you are hoping it will fit in memory. In an extreme case it
won't fit in memory. Tyring to come up with heuristics about when to
remember and when to forget such things from the past is very
difficult.
Think of my comments as: "it's harder than you think", not as "can't
be done".
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
The plan was simple, like my brother-in-law Phil. But unlike
Phil, this plan just might work.
On Tue, Sep 4, 2018 at 5:29 PM Rogier Wolff <[email protected]> wrote:
>
> On Tue, Sep 04, 2018 at 04:58:59PM +0800, 焦晓冬 wrote:
>
> > As for suggestion, maybe the error flag of inode/mapping, or the entire inode
> > should not be evicted if there was an error. That hopefully won't take much
> > memory. On extreme conditions, where too much error inode requires staying
> > in memory, maybe we should panic rather then spread the error.
>
> Again you are hoping it will fit in memory. In an extreme case it
> won't fit in memory. Tyring to come up with heuristics about when to
> remember and when to forget such things from the past is very
> difficult.
The key point is to report errors, not to hide it from user space to
prevent further errors/damage,
and that is also what POSIX wants.
And, storing inode/mapping/error_flag in memory is quite different
from storing the data itself.
They are tiny and only increase per inode rather than per error page.
>
> Think of my comments as: "it's harder than you think", not as "can't
> be done".
>
> Roger.
>
> --
> ** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> The plan was simple, like my brother-in-law Phil. But unlike
> Phil, this plan just might work.
On Tue, 2018-09-04 at 13:42 +0800, 焦晓冬 wrote:
> Hi,
>
> After reading several writeback error handling articles from LWN, I
> begin to be upset about writeback error handling.
>
> Jlayton's patch is simple but wonderful idea towards correct error
> reporting. It seems one crucial thing is still here to be fixed. Does
> anyone have some idea?
>
> The crucial thing may be that a read() after a successful open()-
> write()-close() may return old data.
> That may happen where an async writeback error occurs after close()
> and the inode/mapping get evicted before read().
>
> That violate POSIX as POSIX requires that a read() that can be proved
> to occur after a write() has returned will return the new data.
That can happen even before a close(), and it varies by filesystem. Most
filesystems just pretend the page is clean after writeback failure. It's
quite possible to do:
write()
kernel attempts to write back page and fails
page is marked clean and evicted from the cache
read()
Now your write is gone and there were no calls between the write and
read.
The question we still need to answer is this:
When we attempt to write back some data from the cache and that fails,
what should happen to the dirty pages?
Unfortunately, there are no good answers given the write/fsync/read
model for I/O. I tend to think that in the long run we may need new
interfaces to handle this better.
--
Jeff Layton <[email protected]>
On Tue, 2018-09-04 at 16:58 +0800, 焦晓冬 wrote:
> On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff <[email protected]> wrote:
>
> ...
> > >
> > > Jlayton's patch is simple but wonderful idea towards correct error
> > > reporting. It seems one crucial thing is still here to be fixed. Does
> > > anyone have some idea?
> > >
> > > The crucial thing may be that a read() after a successful
> > > open()-write()-close() may return old data.
> > >
> > > That may happen where an async writeback error occurs after close()
> > > and the inode/mapping get evicted before read().
> >
> > Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> > and then close it. Then I repeat this 9 times.
> >
> > Now, when writing those files to storage fails, there is 5Gb of data
> > to remember and only 1Gb of RAM.
> >
> > I can choose any part of that 5Gb and try to read it.
> >
> > Please make a suggestion about where we should store that data?
>
> That is certainly not possible to be done. But at least, shall we report
> error on read()? Silently returning wrong data may cause further damage,
> such as removing wrong files since it was marked as garbage in the old file.
>
Is the data wrong though? You tried to write and then that failed.
Eventually we want to be able to get at the data that's actually in the
file -- what is that point?
If I get an error back on a read, why should I think that it has
anything at all to do with writes that previously failed? It may even
have been written by a completely separate process that I had nothing at
all to do with.
> As I can see, that is all about error reporting.
>
> As for suggestion, maybe the error flag of inode/mapping, or the entire inode
> should not be evicted if there was an error. That hopefully won't take much
> memory. On extreme conditions, where too much error inode requires staying
> in memory, maybe we should panic rather then spread the error.
>
> >
> > In the easy case, where the data easily fits in RAM, you COULD write a
> > solution. But when the hardware fails, the SYSTEM will not be able to
> > follow the posix rules.
>
> Nope, we are able to follow the rules. The above is one way that follows the
> POSIX rules.
>
This is something we discussed at LSF this year.
We could attempt to keep dirty data around for a little while, at least
long enough to ensure that reads reflect earlier writes until the errors
can be scraped out by fsync. That would sort of redefine fsync from
being "ensure that my writes are flushed" to "synchronize my cache with
the current state of the file".
The problem of course is that applications are not required to do fsync
at all. At what point do we give up on it, and toss out the pages that
can't be cleaned?
We could allow for a tunable that does a kernel panic if writebacks fail
and the errors are never fetched via fsync, and we run out of memory. I
don't think that is something most users would want though.
Another thought: maybe we could OOM kill any process that has the file
open and then toss out the page data in that situation?
I'm wide open to (good) ideas here.
--
Jeff Layton <[email protected]>
On Tue, Sep 4, 2018 at 7:09 PM Jeff Layton <[email protected]> wrote:
>
> On Tue, 2018-09-04 at 16:58 +0800, Trol wrote:
> > On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff <[email protected]> wrote:
> >
> > ...
> > > >
> > > > Jlayton's patch is simple but wonderful idea towards correct error
> > > > reporting. It seems one crucial thing is still here to be fixed. Does
> > > > anyone have some idea?
> > > >
> > > > The crucial thing may be that a read() after a successful
> > > > open()-write()-close() may return old data.
> > > >
> > > > That may happen where an async writeback error occurs after close()
> > > > and the inode/mapping get evicted before read().
> > >
> > > Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> > > and then close it. Then I repeat this 9 times.
> > >
> > > Now, when writing those files to storage fails, there is 5Gb of data
> > > to remember and only 1Gb of RAM.
> > >
> > > I can choose any part of that 5Gb and try to read it.
> > >
> > > Please make a suggestion about where we should store that data?
> >
> > That is certainly not possible to be done. But at least, shall we report
> > error on read()? Silently returning wrong data may cause further damage,
> > such as removing wrong files since it was marked as garbage in the old file.
> >
>
> Is the data wrong though? You tried to write and then that failed.
> Eventually we want to be able to get at the data that's actually in the
> file -- what is that point?
The point is silently data corruption is dangerous. I would prefer getting an
error back to receive wrong data.
A practical and concrete example may be,
A disk cleaner program that first searches for garbage files that won't be used
anymore and save the list in a file (open()-write()-close()) and wait for the
user to confirm the list of files to be removed. A writeback error occurs
and the related page/inode/address_space gets evicted while the user is
taking a long thought about it. Finally, the user hits enter and the
cleaner begin
to open() read() the list again. But what gets removed is the old list
of files that
was generated several months ago...
Another example may be,
An email editor and a busy mail sender. A well written mail to my boss is
composed by this email editor and is saved in a file (open()-write()-close()).
The mail sender gets notified with the path of the mail file to queue it and
send it later. A writeback error occurs and the related
page/inode/address_space gets evicted while the mail is still waiting in the
queue of the mail sender. Finally, the mail file is open() read() by the sender,
but what is sent is the mail to my girlfriend that was composed yesterday...
In both cases, the files are not meant to be persisted onto the disk.
So, fsync()
is not likely to be called.
>
> If I get an error back on a read, why should I think that it has
> anything at all to do with writes that previously failed? It may even
> have been written by a completely separate process that I had nothing at
> all to do with.
>
> > As I can see, that is all about error reporting.
> >
> > As for suggestion, maybe the error flag of inode/mapping, or the entire inode
> > should not be evicted if there was an error. That hopefully won't take much
> > memory. On extreme conditions, where too much error inode requires staying
> > in memory, maybe we should panic rather then spread the error.
> >
> > >
> > > In the easy case, where the data easily fits in RAM, you COULD write a
> > > solution. But when the hardware fails, the SYSTEM will not be able to
> > > follow the posix rules.
> >
> > Nope, we are able to follow the rules. The above is one way that follows the
> > POSIX rules.
> >
>
> This is something we discussed at LSF this year.
>
> We could attempt to keep dirty data around for a little while, at least
> long enough to ensure that reads reflect earlier writes until the errors
> can be scraped out by fsync. That would sort of redefine fsync from
> being "ensure that my writes are flushed" to "synchronize my cache with
> the current state of the file".
>
> The problem of course is that applications are not required to do fsync
> at all. At what point do we give up on it, and toss out the pages that
> can't be cleaned?
>
> We could allow for a tunable that does a kernel panic if writebacks fail
> and the errors are never fetched via fsync, and we run out of memory. I
> don't think that is something most users would want though.
>
> Another thought: maybe we could OOM kill any process that has the file
> open and then toss out the page data in that situation?
>
> I'm wide open to (good) ideas here.
As I said above, silently data corruption is dangerous and maybe we really
should report errors to user space even in desperate cases.
One possible approach may be:
- When a writeback error occurs, mark the page clean and remember the error
in the inode/address_space of the file.
I think that is what the kernel is doing currently.
- If the following read() could be served by a page in memory, just returns the
data. If the following read() could not be served by a page in memory and the
inode/address_space has a writeback error mark, returns EIO.
If there is a writeback error on the file, and the request data could
not be served
by a page in memory, it means we are reading a (partically) corrupted
(out-of-data)
file. Receiving an EIO is expected.
- We refuse to evict inodes/address_spaces that is writeback error marked. If
the number of writeback error marked inodes reaches a limit, we shall
just refuse
to open new files (or refuse to open new files for writing) .
That would NOT take as much memory as retaining the pages themselves as
it is per file/inode rather than per byte of the file. Limiting the
number of writeback
error marked inodes is just like limiting the number of open files
we're currently
doing
- Finally, after the system reboots, programs could see (partially)
corrupted (out-of-data) files. Since user space programs didn't mean to
persist these files (didn't call fsync()), that is fairly reasonable.
> --
> Jeff Layton <[email protected]>
>
On Tue, 2018-09-04 at 22:56 +0800, 焦晓冬 wrote:
> On Tue, Sep 4, 2018 at 7:09 PM Jeff Layton <[email protected]> wrote:
> >
> > On Tue, 2018-09-04 at 16:58 +0800, Trol wrote:
> > > On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff <[email protected]> wrote:
> > >
> > > ...
> > > > >
> > > > > Jlayton's patch is simple but wonderful idea towards correct error
> > > > > reporting. It seems one crucial thing is still here to be fixed. Does
> > > > > anyone have some idea?
> > > > >
> > > > > The crucial thing may be that a read() after a successful
> > > > > open()-write()-close() may return old data.
> > > > >
> > > > > That may happen where an async writeback error occurs after close()
> > > > > and the inode/mapping get evicted before read().
> > > >
> > > > Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> > > > and then close it. Then I repeat this 9 times.
> > > >
> > > > Now, when writing those files to storage fails, there is 5Gb of data
> > > > to remember and only 1Gb of RAM.
> > > >
> > > > I can choose any part of that 5Gb and try to read it.
> > > >
> > > > Please make a suggestion about where we should store that data?
> > >
> > > That is certainly not possible to be done. But at least, shall we report
> > > error on read()? Silently returning wrong data may cause further damage,
> > > such as removing wrong files since it was marked as garbage in the old file.
> > >
> >
> > Is the data wrong though? You tried to write and then that failed.
> > Eventually we want to be able to get at the data that's actually in the
> > file -- what is that point?
>
> The point is silently data corruption is dangerous. I would prefer getting an
> error back to receive wrong data.
>
Well, _you_ might like that, but there are whole piles of applications
that may fall over completely in this situation. Legacy usage matters
here.
> A practical and concrete example may be,
> A disk cleaner program that first searches for garbage files that won't be used
> anymore and save the list in a file (open()-write()-close()) and wait for the
> user to confirm the list of files to be removed. A writeback error occurs
> and the related page/inode/address_space gets evicted while the user is
> taking a long thought about it. Finally, the user hits enter and the
> cleaner begin
> to open() read() the list again. But what gets removed is the old list
> of files that
> was generated several months ago...
>
> Another example may be,
> An email editor and a busy mail sender. A well written mail to my boss is
> composed by this email editor and is saved in a file (open()-write()-close()).
> The mail sender gets notified with the path of the mail file to queue it and
> send it later. A writeback error occurs and the related
> page/inode/address_space gets evicted while the mail is still waiting in the
> queue of the mail sender. Finally, the mail file is open() read() by the sender,
> but what is sent is the mail to my girlfriend that was composed yesterday...
>
> In both cases, the files are not meant to be persisted onto the disk.
> So, fsync()
> is not likely to be called.
>
So at what point are you going to give up on keeping the data? The
fundamental problem here is an open-ended commitment. We (justifiably)
avoid those in kernel development because it might leave the system
without a way out of a resource crunch.
> >
> > If I get an error back on a read, why should I think that it has
> > anything at all to do with writes that previously failed? It may even
> > have been written by a completely separate process that I had nothing at
> > all to do with.
> >
> > > As I can see, that is all about error reporting.
> > >
> > > As for suggestion, maybe the error flag of inode/mapping, or the entire inode
> > > should not be evicted if there was an error. That hopefully won't take much
> > > memory. On extreme conditions, where too much error inode requires staying
> > > in memory, maybe we should panic rather then spread the error.
> > >
> > > >
> > > > In the easy case, where the data easily fits in RAM, you COULD write a
> > > > solution. But when the hardware fails, the SYSTEM will not be able to
> > > > follow the posix rules.
> > >
> > > Nope, we are able to follow the rules. The above is one way that follows the
> > > POSIX rules.
> > >
> >
> > This is something we discussed at LSF this year.
> >
> > We could attempt to keep dirty data around for a little while, at least
> > long enough to ensure that reads reflect earlier writes until the errors
> > can be scraped out by fsync. That would sort of redefine fsync from
> > being "ensure that my writes are flushed" to "synchronize my cache with
> > the current state of the file".
> >
> > The problem of course is that applications are not required to do fsync
> > at all. At what point do we give up on it, and toss out the pages that
> > can't be cleaned?
> >
> > We could allow for a tunable that does a kernel panic if writebacks fail
> > and the errors are never fetched via fsync, and we run out of memory. I
> > don't think that is something most users would want though.
> >
> > Another thought: maybe we could OOM kill any process that has the file
> > open and then toss out the page data in that situation?
> >
> > I'm wide open to (good) ideas here.
>
> As I said above, silently data corruption is dangerous and maybe we really
> should report errors to user space even in desperate cases.
>
> One possible approach may be:
>
> - When a writeback error occurs, mark the page clean and remember the error
> in the inode/address_space of the file.
> I think that is what the kernel is doing currently.
>
Yes.
> - If the following read() could be served by a page in memory, just returns the
> data. If the following read() could not be served by a page in memory and the
> inode/address_space has a writeback error mark, returns EIO.
> If there is a writeback error on the file, and the request data could
> not be served
> by a page in memory, it means we are reading a (partically) corrupted
> (out-of-data)
> file. Receiving an EIO is expected.
>
No, an error on read is not expected there. Consider this:
Suppose the backend filesystem (maybe an NFSv3 export) is really r/o,
but was mounted r/w. An application queues up a bunch of writes that of
course can't be written back (they get EROFS or something when they're
flushed back to the server), but that application never calls fsync.
A completely unrelated application is running as a user that can open
the file for read, but not r/w. It then goes to open and read the file
and then gets EIO back or maybe even EROFS.
Why should that application (which did zero writes) have any reason to
think that the error was due to prior writeback failure by a completely
separate process? Does EROFS make sense when you're attempting to do a
read anyway?
Moreover, what is that application's remedy in this case? It just wants
to read the file, but may not be able to even open it for write to issue
an fsync to "clear" the error. How do we get things moving again so it
can do what it wants?
I think your suggestion would open the floodgates for local DoS attacks.
> - We refuse to evict inodes/address_spaces that is writeback error marked. If
> the number of writeback error marked inodes reaches a limit, we shall
> just refuse
> to open new files (or refuse to open new files for writing) .
> That would NOT take as much memory as retaining the pages themselves as
> it is per file/inode rather than per byte of the file. Limiting the
> number of writeback
> error marked inodes is just like limiting the number of open files
> we're currently
> doing
>
This was one of the suggestions at LSF this year.
That said, we can't just refuse to evict those inodes, as we may
eventually need the memory. We may have to settle for prioritizing
inodes that can be cleaned for eviction, and only evict the ones that
can't when we have no other choice.
Denying new opens is also a potentially helpful for someone wanting to
do a local DoS attack.
> - Finally, after the system reboots, programs could see (partially)
> corrupted (out-of-data) files. Since user space programs didn't mean to
> persist these files (didn't call fsync()), that is fairly reasonable.
--
Jeff Layton <[email protected]>
On Tue, Sep 04, 2018 at 11:44:20AM -0400, Jeff Layton wrote:
> On Tue, 2018-09-04 at 22:56 +0800, 焦晓冬 wrote:
> > A practical and concrete example may be,
> > A disk cleaner program that first searches for garbage files that won't be used
> > anymore and save the list in a file (open()-write()-close()) and wait for the
> > user to confirm the list of files to be removed. A writeback error occurs
> > and the related page/inode/address_space gets evicted while the user is
> > taking a long thought about it. Finally, the user hits enter and the
> > cleaner begin
> > to open() read() the list again. But what gets removed is the old list
> > of files that
> > was generated several months ago...
> >
> > Another example may be,
> > An email editor and a busy mail sender. A well written mail to my boss is
> > composed by this email editor and is saved in a file (open()-write()-close()).
> > The mail sender gets notified with the path of the mail file to queue it and
> > send it later. A writeback error occurs and the related
> > page/inode/address_space gets evicted while the mail is still waiting in the
> > queue of the mail sender. Finally, the mail file is open() read() by the sender,
> > but what is sent is the mail to my girlfriend that was composed yesterday...
> >
> > In both cases, the files are not meant to be persisted onto the disk.
> > So, fsync()
> > is not likely to be called.
> >
>
> So at what point are you going to give up on keeping the data? The
> fundamental problem here is an open-ended commitment. We (justifiably)
> avoid those in kernel development because it might leave the system
> without a way out of a resource crunch.
Well, I think the point was that in the above examples you'd prefer that
the read just fail--no need to keep the data. A bit marking the file
(or even the entire filesystem) unreadable would satisfy posix, I guess.
Whether that's practical, I don't know.
> > - If the following read() could be served by a page in memory, just returns the
> > data. If the following read() could not be served by a page in memory and the
> > inode/address_space has a writeback error mark, returns EIO.
> > If there is a writeback error on the file, and the request data could
> > not be served
> > by a page in memory, it means we are reading a (partically) corrupted
> > (out-of-data)
> > file. Receiving an EIO is expected.
> >
>
> No, an error on read is not expected there. Consider this:
>
> Suppose the backend filesystem (maybe an NFSv3 export) is really r/o,
> but was mounted r/w. An application queues up a bunch of writes that of
> course can't be written back (they get EROFS or something when they're
> flushed back to the server), but that application never calls fsync.
>
> A completely unrelated application is running as a user that can open
> the file for read, but not r/w. It then goes to open and read the file
> and then gets EIO back or maybe even EROFS.
>
> Why should that application (which did zero writes) have any reason to
> think that the error was due to prior writeback failure by a completely
> separate process? Does EROFS make sense when you're attempting to do a
> read anyway?
>
> Moreover, what is that application's remedy in this case? It just wants
> to read the file, but may not be able to even open it for write to issue
> an fsync to "clear" the error. How do we get things moving again so it
> can do what it wants?
>
> I think your suggestion would open the floodgates for local DoS attacks.
Do we really care about processes with write permissions (even only
local client-side write permissions) being able to DoS readers? In
general readers kinda have to trust writers.
--b.
On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote:
>
> Well, I think the point was that in the above examples you'd prefer that
> the read just fail--no need to keep the data. A bit marking the file
> (or even the entire filesystem) unreadable would satisfy posix, I guess.
> Whether that's practical, I don't know.
When you would do it like that (mark the whole filesystem as "in
error") things go from bad to worse even faster. The Linux kernel
tries to keep the system up even in the face of errors.
With that suggestion, having one application run into a writeback
error would effectively crash the whole system because the filesystem
may be the root filesystem and stuff like "sshd" that you need to
diagnose the problem needs to be read from the disk....
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
The plan was simple, like my brother-in-law Phil. But unlike
Phil, this plan just might work.
On Tue, Sep 04, 2018 at 06:23:48PM +0200, Rogier Wolff wrote:
> On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote:
> > Well, I think the point was that in the above examples you'd prefer that
> > the read just fail--no need to keep the data. A bit marking the file
> > (or even the entire filesystem) unreadable would satisfy posix, I guess.
> > Whether that's practical, I don't know.
>
> When you would do it like that (mark the whole filesystem as "in
> error") things go from bad to worse even faster. The Linux kernel
> tries to keep the system up even in the face of errors.
>
> With that suggestion, having one application run into a writeback
> error would effectively crash the whole system because the filesystem
> may be the root filesystem and stuff like "sshd" that you need to
> diagnose the problem needs to be read from the disk....
Well, the absolutist position on posix compliance here would be that a
crash is still preferable to returning the wrong data. And for the
cases 焦晓冬 gives, that sounds right? Maybe it's the wrong balance in
general, I don't know. And we do already have filesystems with
panic-on-error options, so if they aren't used maybe then maybe users
have already voted against that level of strictness.
--b.
On Tue, 2018-09-04 at 14:54 -0400, J. Bruce Fields wrote:
> On Tue, Sep 04, 2018 at 06:23:48PM +0200, Rogier Wolff wrote:
> > On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote:
> > > Well, I think the point was that in the above examples you'd prefer that
> > > the read just fail--no need to keep the data. A bit marking the file
> > > (or even the entire filesystem) unreadable would satisfy posix, I guess.
> > > Whether that's practical, I don't know.
> >
> > When you would do it like that (mark the whole filesystem as "in
> > error") things go from bad to worse even faster. The Linux kernel
> > tries to keep the system up even in the face of errors.
> >
> > With that suggestion, having one application run into a writeback
> > error would effectively crash the whole system because the filesystem
> > may be the root filesystem and stuff like "sshd" that you need to
> > diagnose the problem needs to be read from the disk....
>
> Well, the absolutist position on posix compliance here would be that a
> crash is still preferable to returning the wrong data. And for the
> cases 焦晓冬 gives, that sounds right? Maybe it's the wrong balance in
> general, I don't know. And we do already have filesystems with
> panic-on-error options, so if they aren't used maybe then maybe users
> have already voted against that level of strictness.
>
Yeah, idk. The problem here is that this is squarely in the domain of
implementation defined behavior. I do think that the current "policy"
(if you call it that) of what to do after a wb error is weird and wrong.
What we probably ought to do is start considering how we'd like it to
behave.
How about something like this?
Mark the pages as "uncleanable" after a writeback error. We'll satisfy
reads from the cached data until someone calls fsync, at which point
we'd return the error and invalidate the uncleanable pages.
If no one calls fsync and scrapes the error, we'll hold on to it for as
long as we can (or up to some predefined limit) and then after that
we'll invalidate the uncleanable pages and start returning errors on
reads. If someone eventually calls fsync afterward, we can return to
normal operation.
As always though...what about mmap? Would we need to SIGBUS at the point
where we'd start returning errors on read()?
Would that approximate the current behavior enough and make sense?
Implementing it all sounds non-trivial though...
--
Jeff Layton <[email protected]>
On Tue, Sep 04, 2018 at 04:18:18PM -0400, Jeff Layton wrote:
> On Tue, 2018-09-04 at 14:54 -0400, J. Bruce Fields wrote:
> > On Tue, Sep 04, 2018 at 06:23:48PM +0200, Rogier Wolff wrote:
> > > On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote:
> > > > Well, I think the point was that in the above examples you'd prefer that
> > > > the read just fail--no need to keep the data. A bit marking the file
> > > > (or even the entire filesystem) unreadable would satisfy posix, I guess.
> > > > Whether that's practical, I don't know.
> > >
> > > When you would do it like that (mark the whole filesystem as "in
> > > error") things go from bad to worse even faster. The Linux kernel
> > > tries to keep the system up even in the face of errors.
> > >
> > > With that suggestion, having one application run into a writeback
> > > error would effectively crash the whole system because the filesystem
> > > may be the root filesystem and stuff like "sshd" that you need to
> > > diagnose the problem needs to be read from the disk....
> >
> > Well, the absolutist position on posix compliance here would be that a
> > crash is still preferable to returning the wrong data. And for the
> > cases 焦晓冬 gives, that sounds right? Maybe it's the wrong balance in
> > general, I don't know. And we do already have filesystems with
> > panic-on-error options, so if they aren't used maybe then maybe users
> > have already voted against that level of strictness.
> >
>
> Yeah, idk. The problem here is that this is squarely in the domain of
> implementation defined behavior. I do think that the current "policy"
> (if you call it that) of what to do after a wb error is weird and wrong.
> What we probably ought to do is start considering how we'd like it to
> behave.
>
> How about something like this?
>
> Mark the pages as "uncleanable" after a writeback error. We'll satisfy
> reads from the cached data until someone calls fsync, at which point
> we'd return the error and invalidate the uncleanable pages.
>
> If no one calls fsync and scrapes the error, we'll hold on to it for as
> long as we can (or up to some predefined limit) and then after that
> we'll invalidate the uncleanable pages and start returning errors on
> reads. If someone eventually calls fsync afterward, we can return to
> normal operation.
>
> As always though...what about mmap? Would we need to SIGBUS at the point
> where we'd start returning errors on read()?
>
> Would that approximate the current behavior enough and make sense?
> Implementing it all sounds non-trivial though...
>
Here's a crazy and potentially stupid idea:
Implement a new class of swap space for backing dirty pages which fail
to write back. Pages in this space survive reboots, essentially backing
the implicit commitment POSIX establishes in the face of asynchronous
writeback errors. Rather than evicting these pages as clean, they are
swapped out to the persistent swap.
Administrators then decide if they want to throw some cheap storage at
enabling this coverage, or live with the existing risks.
I think it may be an interesting approach, enabling administrators to
repair primary storage while operating in a degraded mode. Then once
things are corrected, there would be a way to evict pages from
persistent swap for another shot at writeback to primary storage.
Regards,
Vito Caputo
On Tue, Sep 04, 2018 at 01:35:34PM -0700, Vito Caputo wrote:
> Implement a new class of swap space for backing dirty pages which fail
> to write back. Pages in this space survive reboots, essentially backing
> the implicit commitment POSIX establishes in the face of asynchronous
> writeback errors. Rather than evicting these pages as clean, they are
> swapped out to the persistent swap.
You not only need to track which index within a file this swapped page
belongs to but also which file. And that starts to get tricky. It may
or may not have a name; it may or may not have a persistent inode number;
it may or may not have a persistent fhandle. If it's on network storage,
it may have been modified by another machine. If it's on removable
storage, it may have been modified by another machine.
On Tue, Sep 04, 2018 at 01:35:34PM -0700, Vito Caputo wrote:
> On Tue, Sep 04, 2018 at 04:18:18PM -0400, Jeff Layton wrote:
> > On Tue, 2018-09-04 at 14:54 -0400, J. Bruce Fields wrote:
> > > On Tue, Sep 04, 2018 at 06:23:48PM +0200, Rogier Wolff wrote:
> > > > On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote:
> > > > > Well, I think the point was that in the above examples you'd prefer that
> > > > > the read just fail--no need to keep the data. A bit marking the file
> > > > > (or even the entire filesystem) unreadable would satisfy posix, I guess.
> > > > > Whether that's practical, I don't know.
> > > >
> > > > When you would do it like that (mark the whole filesystem as "in
> > > > error") things go from bad to worse even faster. The Linux kernel
> > > > tries to keep the system up even in the face of errors.
> > > >
> > > > With that suggestion, having one application run into a writeback
> > > > error would effectively crash the whole system because the filesystem
> > > > may be the root filesystem and stuff like "sshd" that you need to
> > > > diagnose the problem needs to be read from the disk....
> > >
> > > Well, the absolutist position on posix compliance here would be that a
> > > crash is still preferable to returning the wrong data. And for the
> > > cases 焦晓冬 gives, that sounds right? Maybe it's the wrong balance in
> > > general, I don't know. And we do already have filesystems with
> > > panic-on-error options, so if they aren't used maybe then maybe users
> > > have already voted against that level of strictness.
> > >
> >
> > Yeah, idk. The problem here is that this is squarely in the domain of
> > implementation defined behavior. I do think that the current "policy"
> > (if you call it that) of what to do after a wb error is weird and wrong.
> > What we probably ought to do is start considering how we'd like it to
> > behave.
> >
> > How about something like this?
> >
> > Mark the pages as "uncleanable" after a writeback error. We'll satisfy
> > reads from the cached data until someone calls fsync, at which point
> > we'd return the error and invalidate the uncleanable pages.
> >
> > If no one calls fsync and scrapes the error, we'll hold on to it for as
> > long as we can (or up to some predefined limit) and then after that
> > we'll invalidate the uncleanable pages and start returning errors on
> > reads. If someone eventually calls fsync afterward, we can return to
> > normal operation.
> >
> > As always though...what about mmap? Would we need to SIGBUS at the point
> > where we'd start returning errors on read()?
> >
> > Would that approximate the current behavior enough and make sense?
> > Implementing it all sounds non-trivial though...
> >
>
> Here's a crazy and potentially stupid idea:
>
> Implement a new class of swap space for backing dirty pages which fail
> to write back. Pages in this space survive reboots, essentially backing
> the implicit commitment POSIX establishes in the face of asynchronous
> writeback errors. Rather than evicting these pages as clean, they are
> swapped out to the persistent swap.
And when that "swap" area gets write errors, too? What then? We're
straight back to the same "what the hell do we do with the error"
problem.
Adding more turtles doesn't help solve this issue.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Sep 04, 2018 at 11:44:20AM -0400, Jeff Layton wrote:
> On Tue, 2018-09-04 at 22:56 +0800, 焦晓冬 wrote:
> > On Tue, Sep 4, 2018 at 7:09 PM Jeff Layton <[email protected]> wrote:
> > >
> > > On Tue, 2018-09-04 at 16:58 +0800, Trol wrote:
> > > > That is certainly not possible to be done. But at least, shall we report
> > > > error on read()? Silently returning wrong data may cause further damage,
> > > > such as removing wrong files since it was marked as garbage in the old file.
> > > >
> > >
> > > Is the data wrong though? You tried to write and then that failed.
> > > Eventually we want to be able to get at the data that's actually in the
> > > file -- what is that point?
> >
> > The point is silently data corruption is dangerous. I would prefer getting an
> > error back to receive wrong data.
> >
>
> Well, _you_ might like that, but there are whole piles of applications
> that may fall over completely in this situation. Legacy usage matters
> here.
Can I make a suggestion here?
First imagine a spherical cow in a vacuum.....
What I mean is: In the absence of boundary conditions (the real world)
what would ideally happen?
I'd say:
* When you've written data to a file, you would want to read that
written data back. Even in the presence of errors on the backing
media.
But already this is controversial: I've seen time-and-time again that
people with raid-5 setups continue to work untill the second drive
fails: They ignored the signals the system was giving: "Please replace
a drive".
So when a mail queuer puts mail the mailq files and the mail processor
can get them out of there intact, nobody is going to notice. (I know
mail queuers should call fsync and report errors when that fails, but
there are bound to be applications where calling fsync is not
appropriate (*))
So maybe when the write fails, the reads on that file should fail?
Then it means the data required to keep in memory is much reduced: you
only have to keep the metadata.
In both cases, semantics change when a reboot happens before the
read. Should we care? If we can't fix it when a reboot has happened,
does it make sense to do something different when a reboot has NOT
happened?
Roger.
(*) I have 800Gb of data I need to give to a client. The
truck-of-tapes solution of today is a 1Tb USB-3 drive. Writing that
data onto the drive runs at 30Mb/sec (USB2 speed: USB3 didn't work for
some reason) for 5-10 seconds and then slows down to 200k/sec for
minutes at a time. One of the reasons might be that fuse-ntfs is
calling fsync on the MFT and directory files to keep stuff consistent
just in case things crash. Well... In this case this means that
copying the data took 3 full days instead of 3 hours. Too much calling
fsync is not good either.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
The plan was simple, like my brother-in-law Phil. But unlike
Phil, this plan just might work.
Jeff Layton - 04.09.18, 17:44:
> > - If the following read() could be served by a page in memory, just
> > returns the data. If the following read() could not be served by a
> > page in memory and the inode/address_space has a writeback error
> > mark, returns EIO. If there is a writeback error on the file, and
> > the request data could not be served
> > by a page in memory, it means we are reading a (partically)
> > corrupted
> > (out-of-data)
> > file. Receiving an EIO is expected.
>
> No, an error on read is not expected there. Consider this:
>
> Suppose the backend filesystem (maybe an NFSv3 export) is really r/o,
> but was mounted r/w. An application queues up a bunch of writes that
> of course can't be written back (they get EROFS or something when
> they're flushed back to the server), but that application never calls
> fsync.
>
> A completely unrelated application is running as a user that can open
> the file for read, but not r/w. It then goes to open and read the file
> and then gets EIO back or maybe even EROFS.
>
> Why should that application (which did zero writes) have any reason to
> think that the error was due to prior writeback failure by a
> completely separate process? Does EROFS make sense when you're
> attempting to do a read anyway?
>
> Moreover, what is that application's remedy in this case? It just
> wants to read the file, but may not be able to even open it for write
> to issue an fsync to "clear" the error. How do we get things moving
> again so it can do what it wants?
>
> I think your suggestion would open the floodgates for local DoS
> attacks.
I wonder whether a new error for reporting writeback errors like this
could help out of the situation. But from all I read here so far, this
is a really challenging situation to deal with.
I still remember how AmigaOS dealt with this case and from an usability
point of view it was close to ideal: If a disk was removed, like a
floppy disk, a network disk provided by Envoy or even a hard disk, it
pops up a dialog "You MUST insert volume <name of volume> again". And if
you did, it continued writing. That worked even with networked devices.
I tested it. I unplugged the ethernet cable and replugged it and it
continued writing.
I can imagine that this would be quite challenging to implement within
Linux. I remember there has been a Google Summer of Code project for
NetBSD at least been offered to implement this, but I never got to know
whether it was taken or even implemented. If so it might serve as an
inspiration. Anyway AmigaOS did this even for stationary hard disks. I
had the issue of a flaky connection through IDE to SCSI and then SCSI to
UWSCSI adapter. And when the hard disk had connection issues that dialog
popped up, with the name of the operating system volume for example.
Every access to it was blocked then. It simply blocked all processes
that accessed it till it became available again (usually I rebooted in
case of stationary device cause I had to open case or no hot plug
available or working).
But AFAIR AmigaOS also did not have a notion of caching writes for
longer than maybe a few seconds or so and I think just within the device
driver. Writes were (almost) immediate. There have been some
asynchronous I/O libraries and I would expect an delay in the dialog
popping up in that case.
It would be challenging to implement for Linux even just for removable
devices. You have page dirtying and delayed writeback – which is still
an performance issue with NFS of 1 GBit, rsync from local storage that
is faster than 1 GBit and huge files, reducing dirty memory ratio may
help to halve the time needed to complete the rsync copy operation. And
you would need to communicate all the way to userspace to let the user
know about the issue.
Still, at least for removable media, this would be almost the most
usability friendly approach. With robust filesystems (Amiga Old
Filesystem and Fast Filesystem was not robust in case of sudden write
interruption, so the "MUST" was mean that way) one may even offer
"Please insert device <name of device> again to write out unwritten data
or choose to discard that data" in a dialog. And for removable media it
may even work as blocking processes that access it usually would not
block the whole system. But for the operating system disk? I know how
Plasma desktop behaves during massive I/O operations. It usually just
completely stalls to a halt. It seems to me that its processes do some
I/O almost all of the time … or that the Linux kernel blocks other
syscalls too during heavy I/O load.
I just liked to mention it as another crazy idea. But I bet it would
practically need to rewrite the I/O subsystem in Linux to a great
extent, probably diminishing its performance in situations of write
pressure. Or maybe a genius finds a way to implement both. :)
What I do think tough is that the dirty page caching of Linux with its
current standard settings is excessive. 5% / 10% of available memory
often is a lot these days. There has been a discussion reducing the
default, but AFAIK it was never done. Linus suggested in that discussion
to about what the storage can write out in 3 to 5 seconds. That may even
help with error reporting as reducing dirty memory ratio will reduce the
memory pressure and so you may choose to add some memory allocations for
error handling. And the time till you know its not working may be less.
Thanks,
--
Martin
Rogier Wolff - 05.09.18, 09:08:
> So when a mail queuer puts mail the mailq files and the mail processor
> can get them out of there intact, nobody is going to notice. (I know
> mail queuers should call fsync and report errors when that fails, but
> there are bound to be applications where calling fsync is not
> appropriate (*))
AFAIK at least Postfix MDA only reports mail as being accepted over SMTP
once fsync() on the mail file completed successfully. And I?d expect
every sensible MDA to do this. I don?t know how Dovecot MDA which I
currently use for sieve support does this tough.
--
Martin
On Wed, Sep 05, 2018 at 09:39:58AM +0200, Martin Steigerwald wrote:
> Rogier Wolff - 05.09.18, 09:08:
> > So when a mail queuer puts mail the mailq files and the mail processor
> > can get them out of there intact, nobody is going to notice. (I know
> > mail queuers should call fsync and report errors when that fails, but
> > there are bound to be applications where calling fsync is not
> > appropriate (*))
>
> AFAIK at least Postfix MDA only reports mail as being accepted over SMTP
> once fsync() on the mail file completed successfully. And I?d expect
> every sensible MDA to do this. I don?t know how Dovecot MDA which I
> currently use for sieve support does this tough.
Yes. That's why I added the remark that mailers will call fsync and know
about it on the write side. I encountered a situation in the last few
days that when a developer runs into this while developing, would have
caused him to write:
/* Calling this fsync causes unacceptable performance */
// fsync (fd);
I know of an application somewhere that does realtime-gathering of
call-records (number X called Y for Z seconds). They come in from a
variety of sources, get de-duplicated standardized and written to
files. Then different output modules push the data to the different
consumers within the company. Billing among them.
Now getting old data there would be pretty bad. And calling fsync
all the time might have performance issues....
That's the situation where "old data is really bad".
But when apt-get upgrade replaces your /bin/sh and gets a write error
returning error on subsequent reads is really bad.
It is more difficult than you think.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
The plan was simple, like my brother-in-law Phil. But unlike
Phil, this plan just might work.
On Tue, Sep 4, 2018 at 11:44 PM Jeff Layton <[email protected]> wrote:
>
> On Tue, 2018-09-04 at 22:56 +0800, 焦晓冬 wrote:
> > On Tue, Sep 4, 2018 at 7:09 PM Jeff Layton <[email protected]> wrote:
> > >
> > > On Tue, 2018-09-04 at 16:58 +0800, Trol wrote:
> > > > On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff <[email protected]> wrote:
> > > >
> > > > ...
> > > > > >
> > > > > > Jlayton's patch is simple but wonderful idea towards correct error
> > > > > > reporting. It seems one crucial thing is still here to be fixed. Does
> > > > > > anyone have some idea?
> > > > > >
> > > > > > The crucial thing may be that a read() after a successful
> > > > > > open()-write()-close() may return old data.
> > > > > >
> > > > > > That may happen where an async writeback error occurs after close()
> > > > > > and the inode/mapping get evicted before read().
> > > > >
> > > > > Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> > > > > and then close it. Then I repeat this 9 times.
> > > > >
> > > > > Now, when writing those files to storage fails, there is 5Gb of data
> > > > > to remember and only 1Gb of RAM.
> > > > >
> > > > > I can choose any part of that 5Gb and try to read it.
> > > > >
> > > > > Please make a suggestion about where we should store that data?
> > > >
> > > > That is certainly not possible to be done. But at least, shall we report
> > > > error on read()? Silently returning wrong data may cause further damage,
> > > > such as removing wrong files since it was marked as garbage in the old file.
> > > >
> > >
> > > Is the data wrong though? You tried to write and then that failed.
> > > Eventually we want to be able to get at the data that's actually in the
> > > file -- what is that point?
> >
> > The point is silently data corruption is dangerous. I would prefer getting an
> > error back to receive wrong data.
> >
>
> Well, _you_ might like that, but there are whole piles of applications
> that may fall over completely in this situation. Legacy usage matters
> here.
>
> > A practical and concrete example may be,
> > A disk cleaner program that first searches for garbage files that won't be used
> > anymore and save the list in a file (open()-write()-close()) and wait for the
> > user to confirm the list of files to be removed. A writeback error occurs
> > and the related page/inode/address_space gets evicted while the user is
> > taking a long thought about it. Finally, the user hits enter and the
> > cleaner begin
> > to open() read() the list again. But what gets removed is the old list
> > of files that
> > was generated several months ago...
> >
> > Another example may be,
> > An email editor and a busy mail sender. A well written mail to my boss is
> > composed by this email editor and is saved in a file (open()-write()-close()).
> > The mail sender gets notified with the path of the mail file to queue it and
> > send it later. A writeback error occurs and the related
> > page/inode/address_space gets evicted while the mail is still waiting in the
> > queue of the mail sender. Finally, the mail file is open() read() by the sender,
> > but what is sent is the mail to my girlfriend that was composed yesterday...
> >
> > In both cases, the files are not meant to be persisted onto the disk.
> > So, fsync()
> > is not likely to be called.
> >
>
> So at what point are you going to give up on keeping the data? The
> fundamental problem here is an open-ended commitment. We (justifiably)
> avoid those in kernel development because it might leave the system
> without a way out of a resource crunch.
>
> > >
> > > If I get an error back on a read, why should I think that it has
> > > anything at all to do with writes that previously failed? It may even
> > > have been written by a completely separate process that I had nothing at
> > > all to do with.
> > >
> > > > As I can see, that is all about error reporting.
> > > >
> > > > As for suggestion, maybe the error flag of inode/mapping, or the entire inode
> > > > should not be evicted if there was an error. That hopefully won't take much
> > > > memory. On extreme conditions, where too much error inode requires staying
> > > > in memory, maybe we should panic rather then spread the error.
> > > >
> > > > >
> > > > > In the easy case, where the data easily fits in RAM, you COULD write a
> > > > > solution. But when the hardware fails, the SYSTEM will not be able to
> > > > > follow the posix rules.
> > > >
> > > > Nope, we are able to follow the rules. The above is one way that follows the
> > > > POSIX rules.
> > > >
> > >
> > > This is something we discussed at LSF this year.
> > >
> > > We could attempt to keep dirty data around for a little while, at least
> > > long enough to ensure that reads reflect earlier writes until the errors
> > > can be scraped out by fsync. That would sort of redefine fsync from
> > > being "ensure that my writes are flushed" to "synchronize my cache with
> > > the current state of the file".
> > >
> > > The problem of course is that applications are not required to do fsync
> > > at all. At what point do we give up on it, and toss out the pages that
> > > can't be cleaned?
> > >
> > > We could allow for a tunable that does a kernel panic if writebacks fail
> > > and the errors are never fetched via fsync, and we run out of memory. I
> > > don't think that is something most users would want though.
> > >
> > > Another thought: maybe we could OOM kill any process that has the file
> > > open and then toss out the page data in that situation?
> > >
> > > I'm wide open to (good) ideas here.
> >
> > As I said above, silently data corruption is dangerous and maybe we really
> > should report errors to user space even in desperate cases.
> >
> > One possible approach may be:
> >
> > - When a writeback error occurs, mark the page clean and remember the error
> > in the inode/address_space of the file.
> > I think that is what the kernel is doing currently.
> >
>
> Yes.
>
> > - If the following read() could be served by a page in memory, just returns the
> > data. If the following read() could not be served by a page in memory and the
> > inode/address_space has a writeback error mark, returns EIO.
> > If there is a writeback error on the file, and the request data could
> > not be served
> > by a page in memory, it means we are reading a (partically) corrupted
> > (out-of-data)
> > file. Receiving an EIO is expected.
> >
>
> No, an error on read is not expected there. Consider this:
>
> Suppose the backend filesystem (maybe an NFSv3 export) is really r/o,
> but was mounted r/w. An application queues up a bunch of writes that of
> course can't be written back (they get EROFS or something when they're
> flushed back to the server), but that application never calls fsync.
>
> A completely unrelated application is running as a user that can open
> the file for read, but not r/w. It then goes to open and read the file
> and then gets EIO back or maybe even EROFS.
>
> Why should that application (which did zero writes) have any reason to
> think that the error was due to prior writeback failure by a completely
> separate process? Does EROFS make sense when you're attempting to do a
> read anyway?
Well, since the reader application and the writer application are reading
a same file, they are indeed related. The reader here is expecting
to read the lasted data the writer offers, not any data available. The
reader is surely not expecting to read partially new and partially old data.
Right? And, that `read() should return the lasted write()` by POSIX
supports this expectation.
When we cloud provide the lasted data that is expected, we just give
them the data. If we could not, we give back an error. That is much like
we return an error when network condition is bad and only part of lasted
data could be successfully fetched.
No, EROFS makes no sense. EROFS of writeback should be converted
to EIO on read.
>
> Moreover, what is that application's remedy in this case? It just wants
> to read the file, but may not be able to even open it for write to issue
> an fsync to "clear" the error. How do we get things moving again so it
> can do what it wants?
At this point we have lost the lasted data. I don't think use fsync() as
clear_error_flag() is a good idea. The data of the file is now partially
old and partially new. The content of the file is unpredictable. Application
may just want to simply remove it. If the application really want this
corrupted data to restore what it could, adding a new flag to open()
named O_IGNORE_ERROR_IF_POSSIBLE maybe a good idea to
support it. O_DIRECT, O_NOATIME, O_PATH, and O_TMPFILE flags
are all Linux-specific, so, adding this flag seems acceptable.
>
> I think your suggestion would open the floodgates for local DoS attacks.
>
I don't think so. After all, the writer already has write permission. It means
clearing the file directly should be a better way to attack the reader.
> > - We refuse to evict inodes/address_spaces that is writeback error marked. If
> > the number of writeback error marked inodes reaches a limit, we shall
> > just refuse
> > to open new files (or refuse to open new files for writing) .
> > That would NOT take as much memory as retaining the pages themselves as
> > it is per file/inode rather than per byte of the file. Limiting the
> > number of writeback
> > error marked inodes is just like limiting the number of open files
> > we're currently
> > doing
> >
>
> This was one of the suggestions at LSF this year.
>
> That said, we can't just refuse to evict those inodes, as we may
> eventually need the memory. We may have to settle for prioritizing
> inodes that can be cleaned for eviction, and only evict the ones that
> can't when we have no other choice.
>
> Denying new opens is also a potentially helpful for someone wanting to
> do a local DoS attack.
Yes, I think so. I have no good idea about how to avoid it, yet. I'll
think about it.
Maybe someone else clever would give us some idea?
>
> > - Finally, after the system reboots, programs could see (partially)
> > corrupted (out-of-data) files. Since user space programs didn't mean to
> > persist these files (didn't call fsync()), that is fairly reasonable.
>
> --
> Jeff Layton <[email protected]>
>
On Wed, Sep 5, 2018 at 4:18 AM Jeff Layton <[email protected]> wrote:
>
> On Tue, 2018-09-04 at 14:54 -0400, J. Bruce Fields wrote:
> > On Tue, Sep 04, 2018 at 06:23:48PM +0200, Rogier Wolff wrote:
> > > On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote:
> > > > Well, I think the point was that in the above examples you'd prefer that
> > > > the read just fail--no need to keep the data. A bit marking the file
> > > > (or even the entire filesystem) unreadable would satisfy posix, I guess.
> > > > Whether that's practical, I don't know.
> > >
> > > When you would do it like that (mark the whole filesystem as "in
> > > error") things go from bad to worse even faster. The Linux kernel
> > > tries to keep the system up even in the face of errors.
> > >
> > > With that suggestion, having one application run into a writeback
> > > error would effectively crash the whole system because the filesystem
> > > may be the root filesystem and stuff like "sshd" that you need to
> > > diagnose the problem needs to be read from the disk....
> >
> > Well, the absolutist position on posix compliance here would be that a
> > crash is still preferable to returning the wrong data. And for the
> > cases 焦晓冬 gives, that sounds right? Maybe it's the wrong balance in
> > general, I don't know. And we do already have filesystems with
> > panic-on-error options, so if they aren't used maybe then maybe users
> > have already voted against that level of strictness.
> >
>
> Yeah, idk. The problem here is that this is squarely in the domain of
> implementation defined behavior. I do think that the current "policy"
> (if you call it that) of what to do after a wb error is weird and wrong.
> What we probably ought to do is start considering how we'd like it to
> behave.
>
> How about something like this?
>
> Mark the pages as "uncleanable" after a writeback error. We'll satisfy
> reads from the cached data until someone calls fsync, at which point
> we'd return the error and invalidate the uncleanable pages.
Totally agree with you.
>
> If no one calls fsync and scrapes the error, we'll hold on to it for as
> long as we can (or up to some predefined limit) and then after that
> we'll invalidate the uncleanable pages and start returning errors on
> reads. If someone eventually calls fsync afterward, we can return to
> normal operation.
Agree with you except that using fsync() as `clear_error_mark()` seems
weird and counter-intuitive.
>
> As always though...what about mmap? Would we need to SIGBUS at the point
> where we'd start returning errors on read()?
I think SIGBUS to mmap() is the same thing as EIO to read().
>
> Would that approximate the current behavior enough and make sense?
> Implementing it all sounds non-trivial though...
No.
No problem is reported because nowadays we are relying on the
underlying disk drives. They transparently redirect bad sectors and
use S.M.A.R.T to waning us long before a real EIO could be seen.
As to network filesystems, if I'm not wrong, close() op calls fsync()
inside the implementation. So there is also no problem.
>
> --
> Jeff Layton <[email protected]>
>
On Wed, Sep 5, 2018 at 4:04 PM Rogier Wolff <[email protected]> wrote:
>
> On Wed, Sep 05, 2018 at 09:39:58AM +0200, Martin Steigerwald wrote:
> > Rogier Wolff - 05.09.18, 09:08:
> > > So when a mail queuer puts mail the mailq files and the mail processor
> > > can get them out of there intact, nobody is going to notice. (I know
> > > mail queuers should call fsync and report errors when that fails, but
> > > there are bound to be applications where calling fsync is not
> > > appropriate (*))
> >
> > AFAIK at least Postfix MDA only reports mail as being accepted over SMTP
> > once fsync() on the mail file completed successfully. And I´d expect
> > every sensible MDA to do this. I don´t know how Dovecot MDA which I
> > currently use for sieve support does this tough.
>
Is every implementation of mail editor really going to call fsync()? Why
they are going to call fsync(), when fsync() is meant to persist the file
on disk which is apparently unnecessary if the delivering to SMTP task
won't start again after reboot?
> Yes. That's why I added the remark that mailers will call fsync and know
> about it on the write side. I encountered a situation in the last few
> days that when a developer runs into this while developing, would have
> caused him to write:
> /* Calling this fsync causes unacceptable performance */
> // fsync (fd);
>
> I know of an application somewhere that does realtime-gathering of
> call-records (number X called Y for Z seconds). They come in from a
> variety of sources, get de-duplicated standardized and written to
> files. Then different output modules push the data to the different
> consumers within the company. Billing among them.
>
> Now getting old data there would be pretty bad. And calling fsync
> all the time might have performance issues....
>
> That's the situation where "old data is really bad".
>
> But when apt-get upgrade replaces your /bin/sh and gets a write error
> returning error on subsequent reads is really bad.
At this point, the /bin/sh may be partially old and partially new. Execute
this corrupted bin is also dangerous though.
>
> It is more difficult than you think.
>
> Roger.
>
> --
> ** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> The plan was simple, like my brother-in-law Phil. But unlike
> Phil, this plan just might work.
Rogier Wolff - 05.09.18, 10:04:
> On Wed, Sep 05, 2018 at 09:39:58AM +0200, Martin Steigerwald wrote:
> > Rogier Wolff - 05.09.18, 09:08:
> > > So when a mail queuer puts mail the mailq files and the mail
> > > processor can get them out of there intact, nobody is going to
> > > notice. (I know mail queuers should call fsync and report errors
> > > when that fails, but there are bound to be applications where
> > > calling fsync is not appropriate (*))
> >
> > AFAIK at least Postfix MDA only reports mail as being accepted over
> > SMTP once fsync() on the mail file completed successfully. And I?d
> > expect every sensible MDA to do this. I don?t know how Dovecot MDA
> > which I currently use for sieve support does this tough.
>
> Yes. That's why I added the remark that mailers will call fsync and
> know about it on the write side. I encountered a situation in the
> last few days that when a developer runs into this while developing,
> would have caused him to write:
> /* Calling this fsync causes unacceptable performance */
> // fsync (fd);
Hey, I still have
# KDE Sync
# Re: zero size file after power failure with kernel 2.6.30.5
# http://permalink.gmane.org/gmane.comp.file-systems.xfs.general/30512
export KDE_EXTRA_FSYNC=1
in my ~/.zshrc.
One reason KDE developers did this was Ext3 having been so slow with
fsync(). See also:
Bug 187172 - truncated configuration files on power loss or hard crash
https://bugs.kde.org/187172
> But when apt-get upgrade replaces your /bin/sh and gets a write error
> returning error on subsequent reads is really bad.
I sometimes used eatmydata with apt upgrade / dist-upgrade, but yeah,
this asks for trouble on write interruptions.
> It is more difficult than you think.
Heh. :)
Thanks,
--
Martin
On Wed, 2018-09-05 at 16:24 +0800, 焦晓冬 wrote:
> On Wed, Sep 5, 2018 at 4:18 AM Jeff Layton <[email protected]> wrote:
> >
> > On Tue, 2018-09-04 at 14:54 -0400, J. Bruce Fields wrote:
> > > On Tue, Sep 04, 2018 at 06:23:48PM +0200, Rogier Wolff wrote:
> > > > On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote:
> > > > > Well, I think the point was that in the above examples you'd prefer that
> > > > > the read just fail--no need to keep the data. A bit marking the file
> > > > > (or even the entire filesystem) unreadable would satisfy posix, I guess.
> > > > > Whether that's practical, I don't know.
> > > >
> > > > When you would do it like that (mark the whole filesystem as "in
> > > > error") things go from bad to worse even faster. The Linux kernel
> > > > tries to keep the system up even in the face of errors.
> > > >
> > > > With that suggestion, having one application run into a writeback
> > > > error would effectively crash the whole system because the filesystem
> > > > may be the root filesystem and stuff like "sshd" that you need to
> > > > diagnose the problem needs to be read from the disk....
> > >
> > > Well, the absolutist position on posix compliance here would be that a
> > > crash is still preferable to returning the wrong data. And for the
> > > cases 焦晓冬 gives, that sounds right? Maybe it's the wrong balance in
> > > general, I don't know. And we do already have filesystems with
> > > panic-on-error options, so if they aren't used maybe then maybe users
> > > have already voted against that level of strictness.
> > >
> >
> > Yeah, idk. The problem here is that this is squarely in the domain of
> > implementation defined behavior. I do think that the current "policy"
> > (if you call it that) of what to do after a wb error is weird and wrong.
> > What we probably ought to do is start considering how we'd like it to
> > behave.
> >
> > How about something like this?
> >
> > Mark the pages as "uncleanable" after a writeback error. We'll satisfy
> > reads from the cached data until someone calls fsync, at which point
> > we'd return the error and invalidate the uncleanable pages.
>
> Totally agree with you.
>
> >
> > If no one calls fsync and scrapes the error, we'll hold on to it for as
> > long as we can (or up to some predefined limit) and then after that
> > we'll invalidate the uncleanable pages and start returning errors on
> > reads. If someone eventually calls fsync afterward, we can return to
> > normal operation.
>
> Agree with you except that using fsync() as `clear_error_mark()` seems
> weird and counter-intuitive.
>
That is essentially how fsync (and the errseq_t infrastructure) works.
Once the kernel has hit a wb error, it reports that error to fsync
exactly once per fd. In practice, the errors are not "cleared", but it
appears that way to the fsync caller.
> >
> > As always though...what about mmap? Would we need to SIGBUS at the point
> > where we'd start returning errors on read()?
>
> I think SIGBUS to mmap() is the same thing as EIO to read().
>
> >
> > Would that approximate the current behavior enough and make sense?
> > Implementing it all sounds non-trivial though...
>
> No.
> No problem is reported because nowadays we are relying on the
> underlying disk drives. They transparently redirect bad sectors and
> use S.M.A.R.T to waning us long before a real EIO could be seen.
> As to network filesystems, if I'm not wrong, close() op calls fsync()
> inside the implementation. So there is also no problem.
There is no requirement for a filesystem to flush data on close(). In
fact, most local filesystems do not. NFS does, but that's because it has
to in order to provide close-to-open cache consistency semantics.
--
Jeff Layton <[email protected]>
On Wed, 2018-09-05 at 09:37 +0200, Martin Steigerwald wrote:
> Jeff Layton - 04.09.18, 17:44:
> > > - If the following read() could be served by a page in memory, just
> > > returns the data. If the following read() could not be served by a
> > > page in memory and the inode/address_space has a writeback error
> > > mark, returns EIO. If there is a writeback error on the file, and
> > > the request data could not be served
> > > by a page in memory, it means we are reading a (partically)
> > > corrupted
> > > (out-of-data)
> > > file. Receiving an EIO is expected.
> >
> > No, an error on read is not expected there. Consider this:
> >
> > Suppose the backend filesystem (maybe an NFSv3 export) is really r/o,
> > but was mounted r/w. An application queues up a bunch of writes that
> > of course can't be written back (they get EROFS or something when
> > they're flushed back to the server), but that application never calls
> > fsync.
> >
> > A completely unrelated application is running as a user that can open
> > the file for read, but not r/w. It then goes to open and read the file
> > and then gets EIO back or maybe even EROFS.
> >
> > Why should that application (which did zero writes) have any reason to
> > think that the error was due to prior writeback failure by a
> > completely separate process? Does EROFS make sense when you're
> > attempting to do a read anyway?
> >
> > Moreover, what is that application's remedy in this case? It just
> > wants to read the file, but may not be able to even open it for write
> > to issue an fsync to "clear" the error. How do we get things moving
> > again so it can do what it wants?
> >
> > I think your suggestion would open the floodgates for local DoS
> > attacks.
>
> I wonder whether a new error for reporting writeback errors like this
> could help out of the situation. But from all I read here so far, this
> is a really challenging situation to deal with.
>
> I still remember how AmigaOS dealt with this case and from an usability
> point of view it was close to ideal: If a disk was removed, like a
> floppy disk, a network disk provided by Envoy or even a hard disk, it
> pops up a dialog "You MUST insert volume <name of volume> again". And if
> you did, it continued writing. That worked even with networked devices.
> I tested it. I unplugged the ethernet cable and replugged it and it
> continued writing.
>
> I can imagine that this would be quite challenging to implement within
> Linux. I remember there has been a Google Summer of Code project for
> NetBSD at least been offered to implement this, but I never got to know
> whether it was taken or even implemented. If so it might serve as an
> inspiration. Anyway AmigaOS did this even for stationary hard disks. I
> had the issue of a flaky connection through IDE to SCSI and then SCSI to
> UWSCSI adapter. And when the hard disk had connection issues that dialog
> popped up, with the name of the operating system volume for example.
>
> Every access to it was blocked then. It simply blocked all processes
> that accessed it till it became available again (usually I rebooted in
> case of stationary device cause I had to open case or no hot plug
> available or working).
>
> But AFAIR AmigaOS also did not have a notion of caching writes for
> longer than maybe a few seconds or so and I think just within the device
> driver. Writes were (almost) immediate. There have been some
> asynchronous I/O libraries and I would expect an delay in the dialog
> popping up in that case.
>
> It would be challenging to implement for Linux even just for removable
> devices. You have page dirtying and delayed writeback – which is still
> an performance issue with NFS of 1 GBit, rsync from local storage that
> is faster than 1 GBit and huge files, reducing dirty memory ratio may
> help to halve the time needed to complete the rsync copy operation. And
> you would need to communicate all the way to userspace to let the user
> know about the issue.
>
You may be interested in Project Banbury:
http://www.wil.cx/~willy/banbury.html
> Still, at least for removable media, this would be almost the most
> usability friendly approach. With robust filesystems (Amiga Old
> Filesystem and Fast Filesystem was not robust in case of sudden write
> interruption, so the "MUST" was mean that way) one may even offer
> "Please insert device <name of device> again to write out unwritten data
> or choose to discard that data" in a dialog. And for removable media it
> may even work as blocking processes that access it usually would not
> block the whole system. But for the operating system disk? I know how
> Plasma desktop behaves during massive I/O operations. It usually just
> completely stalls to a halt. It seems to me that its processes do some
> I/O almost all of the time … or that the Linux kernel blocks other
> syscalls too during heavy I/O load.
>
> I just liked to mention it as another crazy idea. But I bet it would
> practically need to rewrite the I/O subsystem in Linux to a great
> extent, probably diminishing its performance in situations of write
> pressure. Or maybe a genius finds a way to implement both. :)
>
> What I do think tough is that the dirty page caching of Linux with its
> current standard settings is excessive. 5% / 10% of available memory
> often is a lot these days. There has been a discussion reducing the
> default, but AFAIK it was never done. Linus suggested in that discussion
> to about what the storage can write out in 3 to 5 seconds. That may even
> help with error reporting as reducing dirty memory ratio will reduce the
> memory pressure and so you may choose to add some memory allocations for
> error handling. And the time till you know its not working may be less.
>
--
Jeff Layton <[email protected]>
On 2018-09-05 04:37, 焦晓冬 wrote:
> On Wed, Sep 5, 2018 at 4:04 PM Rogier Wolff <[email protected]> wrote:
>>
>> On Wed, Sep 05, 2018 at 09:39:58AM +0200, Martin Steigerwald wrote:
>>> Rogier Wolff - 05.09.18, 09:08:
>>>> So when a mail queuer puts mail the mailq files and the mail processor
>>>> can get them out of there intact, nobody is going to notice. (I know
>>>> mail queuers should call fsync and report errors when that fails, but
>>>> there are bound to be applications where calling fsync is not
>>>> appropriate (*))
>>>
>>> AFAIK at least Postfix MDA only reports mail as being accepted over SMTP
>>> once fsync() on the mail file completed successfully. And I´d expect
>>> every sensible MDA to do this. I don´t know how Dovecot MDA which I
>>> currently use for sieve support does this tough.
>>
>
> Is every implementation of mail editor really going to call fsync()? Why
> they are going to call fsync(), when fsync() is meant to persist the file
> on disk which is apparently unnecessary if the delivering to SMTP task
> won't start again after reboot?
Not mail clients, the actual servers. If they implement the SMTP
standard correctly, they _have_ to call fsync() before they return that
an email was accepted for delivery or relaying, because SMTP requires
that a successful return means that the system can actually attempt
delivery, which is not guaranteed if they haven't verified that it's
actually written out to persistent storage.
>
>> Yes. That's why I added the remark that mailers will call fsync and know
>> about it on the write side. I encountered a situation in the last few
>> days that when a developer runs into this while developing, would have
>> caused him to write:
>> /* Calling this fsync causes unacceptable performance */
>> // fsync (fd);
>>
>> I know of an application somewhere that does realtime-gathering of
>> call-records (number X called Y for Z seconds). They come in from a
>> variety of sources, get de-duplicated standardized and written to
>> files. Then different output modules push the data to the different
>> consumers within the company. Billing among them.
>>
>> Now getting old data there would be pretty bad. And calling fsync
>> all the time might have performance issues....
>>
>> That's the situation where "old data is really bad".
>>
>> But when apt-get upgrade replaces your /bin/sh and gets a write error
>> returning error on subsequent reads is really bad.
>
> At this point, the /bin/sh may be partially old and partially new. Execute
> this corrupted bin is also dangerous though.
But the system may still be usable in that state, while returning an
error there guarantees it isn't. This is, in general, not the best
example though, because no sane package manager directly overwrites
_anything_, they all do some variation on replace-by-rename and call
fsync _before_ renaming, so this situation is not realistically going to
happen on any real system.
On Wed, Sep 05, 2018 at 06:55:15AM -0400, Jeff Layton wrote:
> There is no requirement for a filesystem to flush data on close().
And you can't start doing things like that. In some weird cases, you
might have an application open-write-close files at a much higher rate
than what a harddisk can handle. And this has worked for years because
the kernel caches stuff from inodes and data-blocks. If you suddenly
write stuff to harddisk at 10ms for each seek between inode area and
data-area... You end up limited to about 50 of these open-write-close
cycles per second.
My home system is now able make/write/close about 100000 files per
second.
assurancetourix:~/testfiles> time ../a.out 100000 000
0.103u 0.999s 0:01.10 99.0% 0+0k 0+800000io 0pf+0w
(The test program was accessing arguments beyond the end-of-arguments,
An extra argument for this one time program was easier than
open/fix/recompile).
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
The plan was simple, like my brother-in-law Phil. But unlike
Phil, this plan just might work.
On Wed, Sep 05, 2018 at 08:07:25AM -0400, Austin S. Hemmelgarn wrote:
> On 2018-09-05 04:37, 焦晓冬 wrote:
> >At this point, the /bin/sh may be partially old and partially new. Execute
> >this corrupted bin is also dangerous though.
> But the system may still be usable in that state, while returning an
> error there guarantees it isn't. This is, in general, not the best
> example though, because no sane package manager directly overwrites
> _anything_, they all do some variation on replace-by-rename and call
> fsync _before_ renaming, so this situation is not realistically
> going to happen on any real system.
Again, the "returning an error guarantees it isn't" is what's
important here. A lot of scenarios exist where it is a slightly less
important file than "/bin/sh" that would trigger such a comiplete
system failure. But there are a whole lot of files that can be pretty
critical for a system where "old value" is better than "all programs
get an error now". So when you propose "reads should now return error",
you really need to think things through.
It is not enough to say: But I encountered a situation where returning
an error was preferable, you need to think through the counter-cases
that others might run into that would make the "new" situation worse
than before.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
The plan was simple, like my brother-in-law Phil. But unlike
Phil, this plan just might work.
On Wed, Sep 05, 2018 at 04:09:42PM +0800, 焦晓冬 wrote:
> Well, since the reader application and the writer application are reading
> a same file, they are indeed related. The reader here is expecting
> to read the lasted data the writer offers, not any data available. The
> reader is surely not expecting to read partially new and partially old data.
> Right? And, that `read() should return the lasted write()` by POSIX
> supports this expectation.
Unix, and therefore Linux's, core assumption is that the primary
abstraction is the file. So if you say that all applications which
read or write the same file, that's equivalent of saying, "all
applications are related". Consider that a text editor can read a
config file, or a source file, or any other text file. Consider shell
script commands such as "cat", "sort", "uniq". Heck /bin/cp copies
any type of file. Does that mean that /bin/cp, as a reader
application, is related to all applications on the system.
The real problem here is that we're trying to guess the motivations
and usage of programs that are reading the file, and there's no good
way to do that. It could be that the reader is someone who wants to
be informed that file is in page cache, but was never persisted to
disk. It could be that the user has figured out something has gone
terribly wrong, and is desperately trying to rescue all the data she
can by copying it to another disk. In that case, stopping the reader
from being able to access the contents is exactly the wrong thing to
do if what you care about is preventing data loss.
The other thing which you seem to be assuming is that applications
which care about precious data won't use fsync(2). And in general,
it's been fairly well known for decades that if you care about your
data, you have to use fsync(2) or O_DIRECT writes; and you *must*
check the error return of both the fsync(2) and the close(2) system
calls. Emacs got that right in the mid-1980's --- over 30 years ago.
We mocked GNOME and KDE's toy notepad applications for getting this
wrong a decade ago, and they've since fixed it.
Actually, the GNOME and KDE applications, because they were too lazy
to persist the xattr and ACL's, decided it was better to truncate the
file and then rewrite it. So if you crashed after the
truncate... your data was toast. This was a decade ago, and again, it
was considered spectacular bad application programming then, and it's
since been fixed. The point here is that there will always be lousy
application programs. And it is a genuine systems design question how
much should we sacrifice performance and efficiency to accomodate
stupid application programs.
For example, we could make close(2) imply an fsync(2), and return the
error in close(2). But *that* assumes that applications actually
check the return value for close(2) --- and there will be those that
don't. This would completely trash performance for builds, since it
would slow down writing generated files such as all the *.o object
files. Which since they are generated files, they aren't precious.
So forcing an fsync(2) after writing all of those files will destroy
your system performance.
- Ted
On Wed, Sep 05, 2018 at 06:55:15AM -0400, Jeff Layton wrote:
> There is no requirement for a filesystem to flush data on close(). In
> fact, most local filesystems do not. NFS does, but that's because it has
> to in order to provide close-to-open cache consistency semantics.
And these days even NFS can delay writeback till after close thanks to
write delegations.
--b.
On Wed, Sep 05, 2018 at 02:07:46PM +0200, Rogier Wolff wrote:
> On Wed, Sep 05, 2018 at 06:55:15AM -0400, Jeff Layton wrote:
> > There is no requirement for a filesystem to flush data on close().
>
> And you can't start doing things like that.
Of course we can. And we do.
We've been doing targetted flush-on-close for years in some
filesystems because applications don't use fsync where they should
and users blame the filesystems for losing their data.
i.e. we do what the applications should have done but don't because
"fsync is slow". Another common phrase I hear is "I don't need
fsync because I don't see any problems in my testing". They don't
see problems because appliation developers typically don't do power
fail testing of their applications and hence never trigger the
conditions needed to expose the bugs in their applications.
> In some weird cases, you
> might have an application open-write-close files at a much higher rate
> than what a harddisk can handle.
It's not a weird case - the kernel NFSD does this for every write
request it receives.
> And this has worked for years because
> the kernel caches stuff from inodes and data-blocks. If you suddenly
> write stuff to harddisk at 10ms for each seek between inode area and
> data-area..
You're assuming an awful lot about filesystem implementation here.
Neither ext4, btrfs or XFS issue physical IO like this when flushing
data.
> You end up limited to about 50 of these open-write-close
> cycles per second.
You're also conflating "flushing data" with "synchronous".
fsync() is a synchronous data flush because we have defined that way
- it has to wait for IO completion *after* flushing. However, we can
use other methods of flushing data that don't need to wait for
completion, or we can issue synchronous IOs concurrently (e.g. via
AIO or threads) so flushes don't block applications from doing
real work while waiting for IO. Examples are below.
> My home system is now able make/write/close about 100000 files per
> second.
>
> assurancetourix:~/testfiles> time ../a.out 100000 000
> 0.103u 0.999s 0:01.10 99.0% 0+0k 0+800000io 0pf+0w
Now you've written 100k files in a second into the cache, how long
do it take the system to flush them out to stable storage? If the
data is truly so ephemeral it's going to be deleted before it's
written back, then why the hell write it to the filesystem in the
first place?
I use fsmark for open-write-close testing to drive through the
cache phase into sustained IO-at-resource-exhaustion behavour. I
also use multiple threads to drive the system to being IO bound
before it runs out of CPU.
From one of my test scripts for creating 10 million 4k files to test
background writeback behaviour:
# ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7
# Version 3.3, 8 thread(s) starting at Thu Sep 6 09:37:52 2018
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 10000 subdirectories with 1800 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
0 80000 4096 99066.0 545242
0 160000 4096 100528.7 579256
0 240000 4096 95789.6 600522
0 320000 4096 102129.8 532474
0 400000 4096 89551.3 581729
[skip rest of clean cache phase, dirty cache throttling begins an
we enter the sustained IO performance phase]
0 1360000 4096 32222.3 685659
0 1440000 4096 35473.6 693983
0 1520000 4096 34753.0 693478
0 1600000 4096 35540.4 690363
....
So, I see 100k files/s on an idle page cache, and 35k files/s
(170MB/s at ~2000 IOPS) when background writeback kicks in
and it's essentially in a sustained IO bound state.
Turning that into open-write-fsync-close:
-S Sync Method (0:No Sync, 1:fsyncBeforeClose,
....
Yeah, that sucks - it's about 1000 files/s (~5MB/s and 2000 IOPS),
because it's *synchronous* writeback and I'm only driving 8 threads.
Note that the number of IOPS is almost identical to the "no fsync"
case above - the disk utilisation is almost identical for the two
workloads.
HOwever, let's turn that into a open-write-flush-close operation by
using AIO and not waiting for the fsync to complete before closing
the file. I have this in fsmark, because lots of people with IO
intensive apps have been asking for it over the past 10 years.
-A <use aio_fsync>
.....
FSUse% Count Size Files/sec App Overhead
0 80000 4096 28770.5 1569090
0 160000 4096 31340.7 1356595
0 240000 4096 30803.0 1423583
0 320000 4096 30404.5 1510099
0 400000 4096 30961.2 1500736
Yup, it's pretty much the same throughput as background async
writeback. It's a little slower - about 160MB/s and 2,500 IOPS -
due to the increase in overall journal writes caused by the fsync
calls.
What's clear, however, is that we're retiring 10 userspace fsync
operations for every physical disk IO here, as opposed to 2 IOs per
fsync in the above case. Put simply, the assumption that
applications can't do more flush/fsync operations than disk IOs is
not valid, and that performance of open-write-flush-close workloads
on modern filesystems isn't anywhere near as bad as you think it is.
To mangle a common saying into storage speak:
"Caches are for show, IO is for go"
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Sep 06, 2018 at 12:57:09PM +1000, Dave Chinner wrote:
> On Wed, Sep 05, 2018 at 02:07:46PM +0200, Rogier Wolff wrote:
> > And this has worked for years because
> > the kernel caches stuff from inodes and data-blocks. If you suddenly
> > write stuff to harddisk at 10ms for each seek between inode area and
> > data-area..
>
> You're assuming an awful lot about filesystem implementation here.
> Neither ext4, btrfs or XFS issue physical IO like this when flushing
> data.
My thinking is: When fsync (implicit or explicit) needs to know
the result of the underlying IO, it needs to wait for it to have
happened.
My thinking is: You can either log the data in the logfile or just the
metadata. By default/most people will chose the last. In the "make sure
it hits storage" case, you have three areas.
* The logfile
* the inode area
* the data area.
When you allow the application to continue pasta close, you can gather
up say a few megabytes of updates to each area and do say 50 seeks per
second. (achieving maybe about 50% of the throughput performance of
your drive)
If you don't store the /data/, you can stay in the inode or logfile
area and get a high throughput on your drive. But when a crash has the
filesystem in a defined state, what use is that if your application is
in a bad state because it is getting bad data?
Of course the application can be rewritten to have multiple threads so
that while one thread is waiting for a close to finish another one can
open/write/close another file. But there are existing applicaitons run
by users who do not have the knowledge or option to delve into the
source and rewrite the application to be multithreaded.
Your 100k files per second is closely similar to mine. In real life we
are not going to see such extreme numbers, but in some cases the
benchmark does predict a part of the performance of an application.
In practice, an application may spend 50% of the time on thinking
about the next file to make, and then 50k times per second actually
making the file.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
The plan was simple, like my brother-in-law Phil. But unlike
Phil, this plan just might work.
On Wed, Sep 5, 2018 at 4:09 PM 焦晓冬 <[email protected]> wrote:
>
> On Tue, Sep 4, 2018 at 11:44 PM Jeff Layton <[email protected]> wrote:
> >
> > On Tue, 2018-09-04 at 22:56 +0800, 焦晓冬 wrote:
> > > On Tue, Sep 4, 2018 at 7:09 PM Jeff Layton <[email protected]> wrote:
> > > >
> > > > On Tue, 2018-09-04 at 16:58 +0800, Trol wrote:
> > > > > On Tue, Sep 4, 2018 at 3:53 PM Rogier Wolff <[email protected]> wrote:
> > > > >
> > > > > ...
> > > > > > >
> > > > > > > Jlayton's patch is simple but wonderful idea towards correct error
> > > > > > > reporting. It seems one crucial thing is still here to be fixed. Does
> > > > > > > anyone have some idea?
> > > > > > >
> > > > > > > The crucial thing may be that a read() after a successful
> > > > > > > open()-write()-close() may return old data.
> > > > > > >
> > > > > > > That may happen where an async writeback error occurs after close()
> > > > > > > and the inode/mapping get evicted before read().
> > > > > >
> > > > > > Suppose I have 1Gb of RAM. Suppose I open a file, write 0.5Gb to it
> > > > > > and then close it. Then I repeat this 9 times.
> > > > > >
> > > > > > Now, when writing those files to storage fails, there is 5Gb of data
> > > > > > to remember and only 1Gb of RAM.
> > > > > >
> > > > > > I can choose any part of that 5Gb and try to read it.
> > > > > >
> > > > > > Please make a suggestion about where we should store that data?
> > > > >
> > > > > That is certainly not possible to be done. But at least, shall we report
> > > > > error on read()? Silently returning wrong data may cause further damage,
> > > > > such as removing wrong files since it was marked as garbage in the old file.
> > > > >
> > > >
> > > > Is the data wrong though? You tried to write and then that failed.
> > > > Eventually we want to be able to get at the data that's actually in the
> > > > file -- what is that point?
> > >
> > > The point is silently data corruption is dangerous. I would prefer getting an
> > > error back to receive wrong data.
> > >
> >
> > Well, _you_ might like that, but there are whole piles of applications
> > that may fall over completely in this situation. Legacy usage matters
> > here.
> >
> > > A practical and concrete example may be,
> > > A disk cleaner program that first searches for garbage files that won't be used
> > > anymore and save the list in a file (open()-write()-close()) and wait for the
> > > user to confirm the list of files to be removed. A writeback error occurs
> > > and the related page/inode/address_space gets evicted while the user is
> > > taking a long thought about it. Finally, the user hits enter and the
> > > cleaner begin
> > > to open() read() the list again. But what gets removed is the old list
> > > of files that
> > > was generated several months ago...
> > >
> > > Another example may be,
> > > An email editor and a busy mail sender. A well written mail to my boss is
> > > composed by this email editor and is saved in a file (open()-write()-close()).
> > > The mail sender gets notified with the path of the mail file to queue it and
> > > send it later. A writeback error occurs and the related
> > > page/inode/address_space gets evicted while the mail is still waiting in the
> > > queue of the mail sender. Finally, the mail file is open() read() by the sender,
> > > but what is sent is the mail to my girlfriend that was composed yesterday...
> > >
> > > In both cases, the files are not meant to be persisted onto the disk.
> > > So, fsync()
> > > is not likely to be called.
> > >
> >
> > So at what point are you going to give up on keeping the data? The
> > fundamental problem here is an open-ended commitment. We (justifiably)
> > avoid those in kernel development because it might leave the system
> > without a way out of a resource crunch.
> >
> > > >
> > > > If I get an error back on a read, why should I think that it has
> > > > anything at all to do with writes that previously failed? It may even
> > > > have been written by a completely separate process that I had nothing at
> > > > all to do with.
> > > >
> > > > > As I can see, that is all about error reporting.
> > > > >
> > > > > As for suggestion, maybe the error flag of inode/mapping, or the entire inode
> > > > > should not be evicted if there was an error. That hopefully won't take much
> > > > > memory. On extreme conditions, where too much error inode requires staying
> > > > > in memory, maybe we should panic rather then spread the error.
> > > > >
> > > > > >
> > > > > > In the easy case, where the data easily fits in RAM, you COULD write a
> > > > > > solution. But when the hardware fails, the SYSTEM will not be able to
> > > > > > follow the posix rules.
> > > > >
> > > > > Nope, we are able to follow the rules. The above is one way that follows the
> > > > > POSIX rules.
> > > > >
> > > >
> > > > This is something we discussed at LSF this year.
> > > >
> > > > We could attempt to keep dirty data around for a little while, at least
> > > > long enough to ensure that reads reflect earlier writes until the errors
> > > > can be scraped out by fsync. That would sort of redefine fsync from
> > > > being "ensure that my writes are flushed" to "synchronize my cache with
> > > > the current state of the file".
> > > >
> > > > The problem of course is that applications are not required to do fsync
> > > > at all. At what point do we give up on it, and toss out the pages that
> > > > can't be cleaned?
> > > >
> > > > We could allow for a tunable that does a kernel panic if writebacks fail
> > > > and the errors are never fetched via fsync, and we run out of memory. I
> > > > don't think that is something most users would want though.
> > > >
> > > > Another thought: maybe we could OOM kill any process that has the file
> > > > open and then toss out the page data in that situation?
> > > >
> > > > I'm wide open to (good) ideas here.
> > >
> > > As I said above, silently data corruption is dangerous and maybe we really
> > > should report errors to user space even in desperate cases.
> > >
> > > One possible approach may be:
> > >
> > > - When a writeback error occurs, mark the page clean and remember the error
> > > in the inode/address_space of the file.
> > > I think that is what the kernel is doing currently.
> > >
> >
> > Yes.
> >
> > > - If the following read() could be served by a page in memory, just returns the
> > > data. If the following read() could not be served by a page in memory and the
> > > inode/address_space has a writeback error mark, returns EIO.
> > > If there is a writeback error on the file, and the request data could
> > > not be served
> > > by a page in memory, it means we are reading a (partically) corrupted
> > > (out-of-data)
> > > file. Receiving an EIO is expected.
> > >
> >
> > No, an error on read is not expected there. Consider this:
> >
> > Suppose the backend filesystem (maybe an NFSv3 export) is really r/o,
> > but was mounted r/w. An application queues up a bunch of writes that of
> > course can't be written back (they get EROFS or something when they're
> > flushed back to the server), but that application never calls fsync.
> >
> > A completely unrelated application is running as a user that can open
> > the file for read, but not r/w. It then goes to open and read the file
> > and then gets EIO back or maybe even EROFS.
> >
> > Why should that application (which did zero writes) have any reason to
> > think that the error was due to prior writeback failure by a completely
> > separate process? Does EROFS make sense when you're attempting to do a
> > read anyway?
>
> Well, since the reader application and the writer application are reading
> a same file, they are indeed related. The reader here is expecting
> to read the lasted data the writer offers, not any data available. The
> reader is surely not expecting to read partially new and partially old data.
> Right? And, that `read() should return the lasted write()` by POSIX
> supports this expectation.
>
> When we cloud provide the lasted data that is expected, we just give
> them the data. If we could not, we give back an error. That is much like
> we return an error when network condition is bad and only part of lasted
> data could be successfully fetched.
>
> No, EROFS makes no sense. EROFS of writeback should be converted
> to EIO on read.
>
> >
> > Moreover, what is that application's remedy in this case? It just wants
> > to read the file, but may not be able to even open it for write to issue
> > an fsync to "clear" the error. How do we get things moving again so it
> > can do what it wants?
>
> At this point we have lost the lasted data. I don't think use fsync() as
> clear_error_flag() is a good idea. The data of the file is now partially
> old and partially new. The content of the file is unpredictable. Application
> may just want to simply remove it. If the application really want this
> corrupted data to restore what it could, adding a new flag to open()
> named O_IGNORE_ERROR_IF_POSSIBLE maybe a good idea to
> support it. O_DIRECT, O_NOATIME, O_PATH, and O_TMPFILE flags
> are all Linux-specific, so, adding this flag seems acceptable.
>
> >
> > I think your suggestion would open the floodgates for local DoS attacks.
> >
>
> I don't think so. After all, the writer already has write permission. It means
> clearing the file directly should be a better way to attack the reader.
>
> > > - We refuse to evict inodes/address_spaces that is writeback error marked. If
> > > the number of writeback error marked inodes reaches a limit, we shall
> > > just refuse
> > > to open new files (or refuse to open new files for writing) .
> > > That would NOT take as much memory as retaining the pages themselves as
> > > it is per file/inode rather than per byte of the file. Limiting the
> > > number of writeback
> > > error marked inodes is just like limiting the number of open files
> > > we're currently
> > > doing
> > >
> >
> > This was one of the suggestions at LSF this year.
> >
> > That said, we can't just refuse to evict those inodes, as we may
> > eventually need the memory. We may have to settle for prioritizing
> > inodes that can be cleaned for eviction, and only evict the ones that
> > can't when we have no other choice.
> >
> > Denying new opens is also a potentially helpful for someone wanting to
> > do a local DoS attack.
>
> Yes, I think so. I have no good idea about how to avoid it, yet. I'll
> think about it.
> Maybe someone else clever would give us some idea?
Well, think about how filesystems deal with file metadata. Modern journal
filesystems cache meta modifications in memory firstly. And then write the
journal of modification operations to disk. After all these tasks successfully
complete, it starts to perform real modifications to file metadata on disk.
Finally, it removes the journal. If power failure happens, it starts again from
the first operation in the journal. Thus, metadata is always consistent even
after power failure.
If EIO happens during writing the cached meta modification operations to disk,
what will the journal filesystem do? It certainly couldn't drop the
cached opration
that failed to be written and go on or the filesystem will be
inconsistent, eg., an
already allocated block may be allocated again in the future.
The meta operation cache is limited in memory space, just as the
dentry/inode/address_space/error_flag cache is limited in memory space.
So, the way journal filesystems deal with EIO and limited memory space of meta
operation cache applies to us dealing with EIO and limited memory space of
dentry/inode/address_space/error_flag cache with no further potential DoS
backdoor kicking in.
>
> >
> > > - Finally, after the system reboots, programs could see (partially)
> > > corrupted (out-of-data) files. Since user space programs didn't mean to
> > > persist these files (didn't call fsync()), that is fairly reasonable.
> >
> > --
> > Jeff Layton <[email protected]>
> >
On Thu, 6 Sep 2018 11:17:18 +0200
Rogier Wolff <[email protected]> wrote:
> On Thu, Sep 06, 2018 at 12:57:09PM +1000, Dave Chinner wrote:
> > On Wed, Sep 05, 2018 at 02:07:46PM +0200, Rogier Wolff wrote:
>
> > > And this has worked for years because
> > > the kernel caches stuff from inodes and data-blocks. If you suddenly
> > > write stuff to harddisk at 10ms for each seek between inode area and
> > > data-area..
> >
> > You're assuming an awful lot about filesystem implementation here.
> > Neither ext4, btrfs or XFS issue physical IO like this when flushing
> > data.
>
> My thinking is: When fsync (implicit or explicit) needs to know
> the result of the underlying IO, it needs to wait for it to have
> happened.
Worse than that. In many cases it needs to wait for the I/O command to
have been accepted and confirmed by the drive, then tell the disk to do a
commit to physical media, then see if that blows up. A confirmation the
disk got the data is not a confirmation that it's stable. Your disk can
also reply from its internal cache with data that will fail to hit the
media a few seconds later.
Given a cache flush on an ATA disk can take 7 seconds I'm not fond of it
8) Fortunately spinning rust is on the way out.
It's even uglier in truth. Spinning rust rewrites sectors under you
by magic without your knowledge and in freaky cases you can have data
turn error that you've not even touched this month. Flash has some
similar behaviour although it can at least use a supercap to do real work.
You can also issue things like a single 16K write and have only the last
8K succeed and the drive report an error, which freaks out some supposedly
robust techniques.
Alan
> The other thing which you seem to be assuming is that applications
> which care about precious data won't use fsync(2). And in general,
> it's been fairly well known for decades that if you care about your
> data, you have to use fsync(2) or O_DIRECT writes; and you *must*
> check the error return of both the fsync(2) and the close(2) system
> calls. Emacs got that right in the mid-1980's --- over 30 years ago.
> We mocked GNOME and KDE's toy notepad applications for getting this
> wrong a decade ago, and they've since fixed it.
That's also because our fsync no longer sucks rocks. It used to be
possible for a box under heavy disk I/O to take minutes to fsync a file
because our disk scheduling was so awful (hours if doing a backup to USB
stick).
The problem I think actually is a bit different. There isn't an
int fbarrier(int fd, ...);
call with more relaxed semantics so that you can say 'what I have done so
far must not be consumed by a reader until we are sure it is stable, but
I don't actually need it to hit disk right now'. That's just a flag on
buffers saying 'if we try to read this, make sure we write it out first',
and the flag is cleared as the buffer hits media in writeback.
All of this is still probabilities. I can do all the fsync's I like,
consume the stable data, cause actions over the network like sending
people goods, and then the server is destroyed by a power surge.
Transactions are a higher level concept and the kernel can't fix that.
Alan
> write()
> kernel attempts to write back page and fails
> page is marked clean and evicted from the cache
> read()
>
> Now your write is gone and there were no calls between the write and
> read.
>
> The question we still need to answer is this:
>
> When we attempt to write back some data from the cache and that fails,
> what should happen to the dirty pages?
Why do you care about the content of the pages at that point. The only
options are to use the data (todays model), or to report that you are on
fire.
If you are going to error you don't need to use the data so you could in
fact compress dramatically the amount of stuff you need to save
somewhere. You need the page information so you can realize what page
this is, but you can point the data into oblivion somewhere because you
are no longer going to give it to anyone (assuming you can successfully
force unmap it from everyone once it's not locked by a DMA or similar).
In the real world though it's fairly unusual to just lose a bit of I/O.
Flash devices in particular have a nasty tendancy to simply go *poof* and
the first you know about an I/O error is the last data the drive ever
gives you short of jtag. NFS is an exception and NFS soft timeouts are
nasty.
Alan
On Tue, 2018-09-25 at 00:30 +0100, Alan Cox wrote:
> > write()
> > kernel attempts to write back page and fails
> > page is marked clean and evicted from the cache
> > read()
> >
> > Now your write is gone and there were no calls between the write and
> > read.
> >
> > The question we still need to answer is this:
> >
> > When we attempt to write back some data from the cache and that fails,
> > what should happen to the dirty pages?
>
> Why do you care about the content of the pages at that point. The only
> options are to use the data (todays model), or to report that you are on
> fire.
>
The data itself doesn't matter much. What does matter is consistent
behavior in the face of such an error. The issue (IMO) is that
currently, the result of a read that takes place after a write but
before an fsync is indeterminate.
If writeback succeeded (or hasn't been done yet) you'll get back the
data you wrote, but if there was a writeback error you may or may not.
The behavior in that case mostly depends on the whim of the filesystem
developer, and they all behave somewhat differently.
> If you are going to error you don't need to use the data so you could in
> fact compress dramatically the amount of stuff you need to save
> somewhere. You need the page information so you can realize what page
> this is, but you can point the data into oblivion somewhere because you
> are no longer going to give it to anyone (assuming you can successfully
> force unmap it from everyone once it's not locked by a DMA or similar).
>
> In the real world though it's fairly unusual to just lose a bit of I/O.
> Flash devices in particular have a nasty tendancy to simply go *poof* and
> the first you know about an I/O error is the last data the drive ever
> gives you short of jtag. NFS is an exception and NFS soft timeouts are
> nasty.
>
Linux has dozens of filesystems and they all behave differently in this
regard. A catastrophic failure (paradoxically) makes things simpler for
the fs developer, but even on local filesystems isolated errors can
occur. It's also not just NFS -- what mostly started me down this road
was working on ENOSPC handling for CephFS.
I think it'd be good to at least establish a "gold standard" for what
filesystems ought to do in this situation. We might not be able to
achieve that in all cases, but we could then document the exceptions.
--
Jeff Layton <[email protected]>
On Tue, Sep 25, 2018 at 07:15:34AM -0400, Jeff Layton wrote:
> Linux has dozens of filesystems and they all behave differently in this
> regard. A catastrophic failure (paradoxically) makes things simpler for
> the fs developer, but even on local filesystems isolated errors can
> occur. It's also not just NFS -- what mostly started me down this road
> was working on ENOSPC handling for CephFS.
>
> I think it'd be good to at least establish a "gold standard" for what
> filesystems ought to do in this situation. We might not be able to
> achieve that in all cases, but we could then document the exceptions.
I'd argue the standard should be the precedent set by AFS and NFS.
AFS verifies space available on close(2) and returns ENOSPC from the
close(2) system call if space is not available. At MIT Project
Athena, where we used AFS extensively in the late 80's and early 90's,
we made and contributed back changes to avoid data loss as a result of
quota errors.
The best practice that should be documented for userspace is when
writing precious files[1], programs should open for writing foo.new, write
out the data, call fsync() and check the error return, call close()
and check the error return, and then call rename(foo.new, foo) and
check the error return. Writing a library function which does this,
and which also copies the ACL's and xattr's from foo to foo.new before
the rename() would probably help, but not as much as we might think.
[1] That is, editors writing source files, but not compilers and
similar programs writing object files and other generated files.
None of this is really all that new. We had the same discussion back
during the O_PONIES controversy, and we came out in the same place.
- Ted
P.S. One thought: it might be cool if there was some way for
userspace applications to mark files with "nuke if not closed" flag,
such that if the system crashes, the file systems would automatically
unlink the file after a reboot or if the process was killed or exits
without an explicit close(2). For networked/remote file systems that
supported this flag, after the client comes back up after a reboot, it
could notify the server that all files created previously from that
client should be unlinked.
Unlike O_TMPFILE, this would require file system changes to support,
so maybe it's not worth having something which automatically cleans up
files that were in the middle of being written at the time of a system
crash. (Especially since you can get most of the functionality by
using some naming convention for files that in the process of being
written, and then teach some program that is regularly scanning the
entire file system, such as updatedb(2) to nuke the files from a cron
job. It won't be as efficient, but it would be much easier to
implement.)
On Tue, Sep 25, 2018 at 11:46:27AM -0400, Theodore Y. Ts'o wrote:
> (Especially since you can get most of the functionality by
> using some naming convention for files that in the process of being
> written, and then teach some program that is regularly scanning the
> entire file system, such as updatedb(2) to nuke the files from a cron
> job. It won't be as efficient, but it would be much easier to
> implement.)
It is MUCH easier to have a per-application cleanup job. You can run
that at boot-time.
#/etc/init.d/myname startup script
rm /var/run/myname/unfinished.*
Simple things should be kept simple.
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
The plan was simple, like my brother-in-law Phil. But unlike
Phil, this plan just might work.
> Unlike O_TMPFILE, this would require file system changes to support,
> so maybe it's not worth having something which automatically cleans up
> files that were in the middle of being written at the time of a system
> crash.
Would it. If you open a file unlink it and write to it and then have a
linkf(fd, path); your underlying fs behaviour isn't really changed it's
just you are allowed to name a file late ?
Alan
On Tue, 2018-09-25 at 11:46 -0400, Theodore Y. Ts'o wrote:
> On Tue, Sep 25, 2018 at 07:15:34AM -0400, Jeff Layton wrote:
> > Linux has dozens of filesystems and they all behave differently in this
> > regard. A catastrophic failure (paradoxically) makes things simpler for
> > the fs developer, but even on local filesystems isolated errors can
> > occur. It's also not just NFS -- what mostly started me down this road
> > was working on ENOSPC handling for CephFS.
> >
> > I think it'd be good to at least establish a "gold standard" for what
> > filesystems ought to do in this situation. We might not be able to
> > achieve that in all cases, but we could then document the exceptions.
>
> I'd argue the standard should be the precedent set by AFS and NFS.
> AFS verifies space available on close(2) and returns ENOSPC from the
> close(2) system call if space is not available. At MIT Project
> Athena, where we used AFS extensively in the late 80's and early 90's,
> we made and contributed back changes to avoid data loss as a result of
> quota errors.
>
> The best practice that should be documented for userspace is when
> writing precious files[1], programs should open for writing foo.new, write
> out the data, call fsync() and check the error return, call close()
> and check the error return, and then call rename(foo.new, foo) and
> check the error return. Writing a library function which does this,
> and which also copies the ACL's and xattr's from foo to foo.new before
> the rename() would probably help, but not as much as we might think.
>
> [1] That is, editors writing source files, but not compilers and
> similar programs writing object files and other generated files.
>
> None of this is really all that new. We had the same discussion back
> during the O_PONIES controversy, and we came out in the same place.
>
> - Ted
>
> P.S. One thought: it might be cool if there was some way for
> userspace applications to mark files with "nuke if not closed" flag,
> such that if the system crashes, the file systems would automatically
> unlink the file after a reboot or if the process was killed or exits
> without an explicit close(2). For networked/remote file systems that
> supported this flag, after the client comes back up after a reboot, it
> could notify the server that all files created previously from that
> client should be unlinked.
>
> Unlike O_TMPFILE, this would require file system changes to support,
> so maybe it's not worth having something which automatically cleans up
> files that were in the middle of being written at the time of a system
> crash. (Especially since you can get most of the functionality by
> using some naming convention for files that in the process of being
> written, and then teach some program that is regularly scanning the
> entire file system, such as updatedb(2) to nuke the files from a cron
> job. It won't be as efficient, but it would be much easier to
> implement.)
That's all well and good, but still doesn't quite solve the main concern
with all of this. It's suppose we have this series of events:
open file r/w
write 1024 bytes to offset 0
<background writeback that fails>
read 1024 bytes from offset 0
Open, write and read are successful, and there was no fsync or close in
between them. Will that read reflect the result of the previous write or
no?
The answer today is "it depends".
--
Jeff Layton <[email protected]>
On Tue, Sep 25, 2018 at 11:46:27AM -0400, Theodore Y. Ts'o wrote:
> P.S. One thought: it might be cool if there was some way for
> userspace applications to mark files with "nuke if not closed" flag,
> such that if the system crashes, the file systems would automatically
> unlink the file after a reboot or if the process was killed or exits
> without an explicit close(2). For networked/remote file systems that
> supported this flag, after the client comes back up after a reboot, it
> could notify the server that all files created previously from that
> client should be unlinked.
>
> Unlike O_TMPFILE, this would require file system changes to support,
> so maybe it's not worth having something which automatically cleans up
> files that were in the middle of being written at the time of a system
> crash.
Isn't this what the snippet for O_TMPFILE in "man 2 open" does?:
char path[PATH_MAX];
fd = open("/path/to/dir", O_TMPFILE | O_RDWR,
S_IRUSR | S_IWUSR);
/* File I/O on 'fd'... */
snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file",
AT_SYMLINK_FOLLOW);
Meow!
--
⢀⣴⠾⠻⢶⣦⠀ 10 people enter a bar:
⣾⠁⢰⠒⠀⣿⡁ • 1 who understands binary,
⢿⡄⠘⠷⠚⠋⠀ • 1 who doesn't,
⠈⠳⣄⠀⠀⠀⠀ • and E who prefer to write it as hex.
On Tue, Sep 25, 2018 at 12:41:18PM -0400, Jeff Layton wrote:
> That's all well and good, but still doesn't quite solve the main concern
> with all of this. It's suppose we have this series of events:
>
> open file r/w
> write 1024 bytes to offset 0
> <background writeback that fails>
> read 1024 bytes from offset 0
>
> Open, write and read are successful, and there was no fsync or close in
> between them. Will that read reflect the result of the previous write or
> no?
If the background writeback hasn't happened, Posix requires that the
read returns the result of the write. And the user doesn't know when
or if the background writeback has happened unless the user calls
fsync(2).
Posix in general basically says anything is possible if the system
fails or crashes, or is dropped into molten lava, etc. Do we say that
Linux is not Posix compliant if a cosmic ray flips a few bits in the
page cache? Hardly! The *only* time Posix makes any guarantees is if
fsync(2) returns success. So the subject line, is in my opinion
incorrect. The moment we are worrying about storage errors, and the
user hasn't used fsync(2), Posix is no longer relevant for the
purposes of the discussion.
> The answer today is "it depends".
And I think that's fine. The only way we can make any guarantees is
if we do what Alan suggested, which is to imply that a read on a dirty
page *block* until the the page is successfully written back. This
would destroy performance. I know I wouldn't want to use such a
system, and if someone were to propose it, I'd strongly argue for a
switch to turn it *off*, and I suspect most system administators would
turn it off once they saw what it did to system performance. (As a
thought experiment, think about what it would do to kernel compiles.
It means that before you link the .o files, you would have to block
and wait for them to be written to disk so you could be sure the
writeback would be successful. **Ugh**.)
Given that many people would turn such a feature off once they saw
what it does to their system performance, applications in general
couldn't rely on it. which means applications who cared would have to
do what they should have done all along. If it's precious data use
fsync(2). If not, most of the time things are *fine* and it's not
worth sacrificing performance for the corner cases unless it really is
ultra-precious data and you are willing to pay the overhead.
- Ted
On Tue, Sep 25, 2018 at 07:35:11PM +0200, Adam Borowski wrote:
> Isn't this what the snippet for O_TMPFILE in "man 2 open" does?:
>
> char path[PATH_MAX];
> fd = open("/path/to/dir", O_TMPFILE | O_RDWR,
> S_IRUSR | S_IWUSR);
>
> /* File I/O on 'fd'... */
>
> snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd);
> linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file",
> AT_SYMLINK_FOLLOW);
Huh. I stand corrected. I had assumed O_TMPFILE worked like any
other file where the link count was zero, and linkat(2) wouldn't allow
this. But obviously, this does work. In fact, from the linkat(2) man
page, using:
linkat(fd, NULL, AT_FDCWD, "/path/for/file", AT_EMPTY_PATH);
is an even simpler way that doesn't /proc being mounted.
TIL...
- Ted
> And I think that's fine. The only way we can make any guarantees is
> if we do what Alan suggested, which is to imply that a read on a dirty
> page *block* until the the page is successfully written back. This
> would destroy performance.
In almost all cases you don't care so you wouldn't use it. In those cases
where it might matter it's almost always the case that a reader won't
consume it before it hits the media.
That's why I suggested having an fbarrier() so you can explicitly say 'in
the even that case does happen then stall and write it'. It's kind of
lazy fsync. That can be used with almost no cost by things like mail
daemons. Another way given that this only really makes sense with locks
is to add that fbarrier notion as a file locking optional semantic so you
can 'unlock with barrier' and 'lock with barrier honoured'
Alan
On Wed, Sep 26, 2018 at 07:10:55PM +0100, Alan Cox wrote:
> In almost all cases you don't care so you wouldn't use it. In those cases
> where it might matter it's almost always the case that a reader won't
> consume it before it hits the media.
>
> That's why I suggested having an fbarrier() so you can explicitly say 'in
> the even that case does happen then stall and write it'. It's kind of
> lazy fsync. That can be used with almost no cost by things like mail
> daemons.
How could mail daemons use it? They *have* to do an fsync() before
they send a 2xx SMTP return code.
There are plenty of other dependencies besides read --- for example,
if you write a mp3 file, and then write the playlist.m3u file, maybe
the barrier requirement is "if playlist.m3u survives after the crash,
all of the mp3 files mentioned in it must completely written out to
disk". So I'm not sure how useful a fbarrier(2) would be in practice.
> Another way given that this only really makes sense with locks
> is to add that fbarrier notion as a file locking optional semantic so you
> can 'unlock with barrier' and 'lock with barrier honoured'
I'm not sure what you're suggesting?
- Ted
On Wed, Sep 26, 2018 at 07:10:55PM +0100, Alan Cox wrote:
> > And I think that's fine. The only way we can make any guarantees is
> > if we do what Alan suggested, which is to imply that a read on a dirty
> > page *block* until the the page is successfully written back. This
> > would destroy performance.
>
> In almost all cases you don't care so you wouldn't use it. In those cases
> where it might matter it's almost always the case that a reader won't
> consume it before it hits the media.
Wait! Source code builds (*) nowadays are quite fast because
everything happens without hitting the disk. This means my compile has
finished linking the executable by the time the kernel starts thinking
about writing the objects to disk.
Roger.
(*) Of projects smaller than the Linux kernel.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
The plan was simple, like my brother-in-law Phil. But unlike
Phil, this plan just might work.
On Tue, 2018-09-25 at 18:30 -0400, Theodore Y. Ts'o wrote:
> On Tue, Sep 25, 2018 at 12:41:18PM -0400, Jeff Layton wrote:
> > That's all well and good, but still doesn't quite solve the main concern
> > with all of this. It's suppose we have this series of events:
> >
> > open file r/w
> > write 1024 bytes to offset 0
> > <background writeback that fails>
> > read 1024 bytes from offset 0
> >
> > Open, write and read are successful, and there was no fsync or close in
> > between them. Will that read reflect the result of the previous write or
> > no?
>
> If the background writeback hasn't happened, Posix requires that the
> read returns the result of the write. And the user doesn't know when
> or if the background writeback has happened unless the user calls
> fsync(2).
>
> Posix in general basically says anything is possible if the system
> fails or crashes, or is dropped into molten lava, etc. Do we say that
> Linux is not Posix compliant if a cosmic ray flips a few bits in the
> page cache? Hardly! The *only* time Posix makes any guarantees is if
> fsync(2) returns success. So the subject line, is in my opinion
> incorrect. The moment we are worrying about storage errors, and the
> user hasn't used fsync(2), Posix is no longer relevant for the
> purposes of the discussion.
>
> > The answer today is "it depends".
>
> And I think that's fine. The only way we can make any guarantees is
> if we do what Alan suggested, which is to imply that a read on a dirty
> page *block* until the the page is successfully written back. This
> would destroy performance. I know I wouldn't want to use such a
> system, and if someone were to propose it, I'd strongly argue for a
> switch to turn it *off*, and I suspect most system administators would
> turn it off once they saw what it did to system performance. (As a
> thought experiment, think about what it would do to kernel compiles.
> It means that before you link the .o files, you would have to block
> and wait for them to be written to disk so you could be sure the
> writeback would be successful. **Ugh**.)
>
> Given that many people would turn such a feature off once they saw
> what it does to their system performance, applications in general
> couldn't rely on it. which means applications who cared would have to
> do what they should have done all along. If it's precious data use
> fsync(2). If not, most of the time things are *fine* and it's not
> worth sacrificing performance for the corner cases unless it really is
> ultra-precious data and you are willing to pay the overhead.
Basically, the problem (as I see it) is that we can end up evicting
uncleanable data from the cache before you have a chance to call fsync,
and that means that the results of a read after a write are not
completely reliable.
We had some small discussion of this at LSF (mostly over malt beverages)
and wondered: could we offer a guarantee that uncleanable dirty data
will stick around until:
1) someone issues fsync() and scrapes the error
...or...
2) some timeout occurs (or we hit some other threshold? This part is
definitely open for debate)
That would at least allow an application issuing regular fsync calls to
reliably re-fetch write data via reads up until the point where we see
fsync fail. Those that don't issue regular fsyncs should be no worse off
than they are today.
Granted #2 above represents something of an open-ended commitment -- we
could have a bunch of writers that don't call fsync fill up memory with
uncleanable pages, and at that point we're sort of stuck.
That said, all of this is a rather theoretical problem. I've not heard
any reports of problems due to uncleanable data being evicted prior to
fsync, so I've not lept to start rolling patches for this.
--
Jeff Layton <[email protected]>
On Thu, Sep 27, 2018 at 08:43:10AM -0400, Jeff Layton wrote:
>
> Basically, the problem (as I see it) is that we can end up evicting
> uncleanable data from the cache before you have a chance to call fsync,
> and that means that the results of a read after a write are not
> completely reliable.
Part of the problem is that people don't agree on what the problem is. :-)
The original posting was from someone who claimed it was a "POSIX
violation" if a subsequent read returns *successfully*, but then the
writeback succeeds.
Other people are worried about this problem; yet others are worried
about the system wedging and OOM-killing itself, etc.
The problem is that in the face of I/O errors, it's impossible to keep
everyone happy. (You could make the local storage device completely
reliable, with a multi-million dollar storage array with remote
replication, but then the CFO won't be happy; and other people were
talking about making things work with cheap USB thumb drives and
laptops. This is the very definition of an over-constained problem.)
- Ted
On Wed, 26 Sep 2018 17:49:09 -0400
"Theodore Y. Ts'o" <[email protected]> wrote:
> On Wed, Sep 26, 2018 at 07:10:55PM +0100, Alan Cox wrote:
> > In almost all cases you don't care so you wouldn't use it. In those cases
> > where it might matter it's almost always the case that a reader won't
> > consume it before it hits the media.
> >
> > That's why I suggested having an fbarrier() so you can explicitly say 'in
> > the even that case does happen then stall and write it'. It's kind of
> > lazy fsync. That can be used with almost no cost by things like mail
> > daemons.
>
> How could mail daemons use it? They *have* to do an fsync() before
> they send a 2xx SMTP return code.
Point - so actually it would be less useful
> > Another way given that this only really makes sense with locks
> > is to add that fbarrier notion as a file locking optional semantic so you
> > can 'unlock with barrier' and 'lock with barrier honoured'
>
> I'm not sure what you're suggesting?
If someone has an actual case you could in theory constrain it to a range
specified in a file lock and only between two people who care. That said
seems like a lot of complexity to make a case nobody cares about only
affect people who care about it
Alan