2023-02-28 20:49:15

by Darrick J. Wong

[permalink] [raw]
Subject: [LSF TOPIC] online repair of filesystems: what next?

Hello fsdevel people,

Five years ago[0], we started a conversation about cross-filesystem
userspace tooling for online fsck. I think enough time has passed for
us to have another one, since a few things have happened since then:

1. ext4 has gained the ability to send corruption reports to a userspace
monitoring program via fsnotify. Thanks, Collabora!

2. XFS now tracks successful scrubs and corruptions seen during runtime
and during scrubs. Userspace can query this information.

3. Directory parent pointers, which enable online repair of the
directory tree, is nearing completion.

4. Dave and I are working on merging online repair of space metadata for
XFS. Online repair of directory trees is feature complete, but we
still have one or two unresolved questions in the parent pointer
code.

5. I've gotten a bit better[1] at writing systemd service descriptions
for scheduling and performing background online fsck.

Now that fsnotify_sb_error exists as a result of (1), I think we
should figure out how to plumb calls into the readahead and writeback
code so that IO failures can be reported to the fsnotify monitor. I
suspect there may be a few difficulties here since fsnotify (iirc)
allocates memory and takes locks.

As a result of (2), XFS now retains quite a bit of incore state about
its own health. The structure that fsnotify gives to userspace is very
generic (superblock, inode, errno, errno count). How might XFS export
a greater amount of information via this interface? We can provide
details at finer granularity -- for example, a specific data structure
under an allocation group or an inode, or specific quota records.

With (4) on the way, I can envision wanting a system service that would
watch for these fsnotify events, and transform the error reports into
targeted repair calls in the kernel. This of course would be very
filesystem specific, but I would also like to hear from anyone pondering
other usecases for fsnotify filesystem error monitors.

Once (3) lands, XFS gains the ability to translate a block device IO
error to an inode number and file offset, and then the inode number to a
path. In other words, your file breaks and now we can tell applications
which file it was so they can failover or redownload it or whatever.
Ric Wheeler mentioned this in 2018's session.

The final topic from that 2018 session concerned generic wrappers for
fsscrub. I haven't pushed hard on that topic because XFS hasn't had
much to show for that. Now that I'm better versed in systemd services,
I envision three ways to interact with online fsck:

- A CLI program that can be run by anyone.

- Background systemd services that fire up periodically.

- A dbus service that programs can bind to and request a fsck.

I still think there's an opportunity to standardize the naming to make
it easier to use a variety of filesystems. I propose for the CLI:

/usr/sbin/fsscrub $mnt that calls /usr/sbin/fsscrub.$FSTYP $mnt

For systemd services, I propose "fsscrub@<escaped mountpoint>". I
suspect we want a separate background service that itself runs
periodically and invokes the fsscrub@$mnt services. xfsprogs already
has a xfs_scrub_all service that does that. The services are nifty
because it's really easy to restrict privileges, implement resource
usage controls, and use private name/mountspaces to isolate the process
from the rest of the system.

dbus is a bit trickier, since there's no precedent at all. I guess
we'd have to define an interface for filesystem "object". Then we could
write a service that establishes a well-known bus name and maintains
object paths for each mounted filesystem. Each of those objects would
export the filesystem interface, and that's how programs would call
online fsck as a service.

Ok, that's enough for a single session topic. Thoughts? :)

--D

[0] https://lwn.net/Articles/754504/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-optimize-by-default


2023-03-08 17:12:19

by Jan Kara

[permalink] [raw]
Subject: Re: [LSF TOPIC] online repair of filesystems: what next?

Hi!

I'm interested in this topic. Some comments below.

On Tue 28-02-23 12:49:03, Darrick J. Wong wrote:
> Five years ago[0], we started a conversation about cross-filesystem
> userspace tooling for online fsck. I think enough time has passed for
> us to have another one, since a few things have happened since then:
>
> 1. ext4 has gained the ability to send corruption reports to a userspace
> monitoring program via fsnotify. Thanks, Collabora!
>
> 2. XFS now tracks successful scrubs and corruptions seen during runtime
> and during scrubs. Userspace can query this information.
>
> 3. Directory parent pointers, which enable online repair of the
> directory tree, is nearing completion.
>
> 4. Dave and I are working on merging online repair of space metadata for
> XFS. Online repair of directory trees is feature complete, but we
> still have one or two unresolved questions in the parent pointer
> code.
>
> 5. I've gotten a bit better[1] at writing systemd service descriptions
> for scheduling and performing background online fsck.
>
> Now that fsnotify_sb_error exists as a result of (1), I think we
> should figure out how to plumb calls into the readahead and writeback
> code so that IO failures can be reported to the fsnotify monitor. I
> suspect there may be a few difficulties here since fsnotify (iirc)
> allocates memory and takes locks.

Well, if you want to generate fsnotify events from an interrupt handler,
you're going to have a hard time, I don't have a good answer for that. But
offloading of error event generation to a workqueue should be doable (and
event delivery is async anyway so from userspace POV there's no
difference). Otherwise locking shouldn't be a problem AFAICT. WRT memory
allocation, we currently preallocate the error events to avoid the loss of
event due to ENOMEM. With current usecases (filesystem catastrophical error
reporting) we have settled on a mempool with 32 preallocated events (note
that preallocated event gets used only if normal kmalloc fails) for
simplicity. If the error reporting mechanism is going to be used
significantly more, we may need to reconsider this but it should be doable.
And frankly if you have a storm of fs errors *and* the system is going
ENOMEM at the same time, I have my doubts loosing some error report is
going to do any more harm ;).

> As a result of (2), XFS now retains quite a bit of incore state about
> its own health. The structure that fsnotify gives to userspace is very
> generic (superblock, inode, errno, errno count). How might XFS export
> a greater amount of information via this interface? We can provide
> details at finer granularity -- for example, a specific data structure
> under an allocation group or an inode, or specific quota records.

Fsnotify (fanotify in fact) interface is fairly flexible in what can be
passed through it. So if you need to pass some (reasonably short) binary
blob to userspace which knows how to decode it, fanotify can handle that
(with some wrapping). Obviously there's a tradeoff to make how much of the
event is generic (as that is then easier to process by tools common for all
filesystems) and how much is fs specific (which allows to pass more
detailed information). But I guess we need to have concrete examples of
events to discuss this.

> With (4) on the way, I can envision wanting a system service that would
> watch for these fsnotify events, and transform the error reports into
> targeted repair calls in the kernel. This of course would be very
> filesystem specific, but I would also like to hear from anyone pondering
> other usecases for fsnotify filesystem error monitors.

I think when we do report IO errors (or ENOSPC, EDQUOT errors for that
matter) through fsnotify, there would be some interesting system-health
monitoring usecases. But I don't know about anybody working on this.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-03-08 21:55:03

by Dave Chinner

[permalink] [raw]
Subject: Re: [LSF TOPIC] online repair of filesystems: what next?

On Wed, Mar 08, 2023 at 06:12:06PM +0100, Jan Kara wrote:
> Hi!
>
> I'm interested in this topic. Some comments below.
>
> On Tue 28-02-23 12:49:03, Darrick J. Wong wrote:
> > Five years ago[0], we started a conversation about cross-filesystem
> > userspace tooling for online fsck. I think enough time has passed for
> > us to have another one, since a few things have happened since then:
> >
> > 1. ext4 has gained the ability to send corruption reports to a userspace
> > monitoring program via fsnotify. Thanks, Collabora!
> >
> > 2. XFS now tracks successful scrubs and corruptions seen during runtime
> > and during scrubs. Userspace can query this information.
> >
> > 3. Directory parent pointers, which enable online repair of the
> > directory tree, is nearing completion.
> >
> > 4. Dave and I are working on merging online repair of space metadata for
> > XFS. Online repair of directory trees is feature complete, but we
> > still have one or two unresolved questions in the parent pointer
> > code.
> >
> > 5. I've gotten a bit better[1] at writing systemd service descriptions
> > for scheduling and performing background online fsck.
> >
> > Now that fsnotify_sb_error exists as a result of (1), I think we
> > should figure out how to plumb calls into the readahead and writeback
> > code so that IO failures can be reported to the fsnotify monitor. I
> > suspect there may be a few difficulties here since fsnotify (iirc)
> > allocates memory and takes locks.
>
> Well, if you want to generate fsnotify events from an interrupt handler,
> you're going to have a hard time, I don't have a good answer for that.

I don't think we ever do that, or need to do that. IO completions
that can throw corruption errors are already running in workqueue
contexts in XFS.

Worst case, we throw all bios that have IO errors flagged to the
same IO completion workqueues, and the problem of memory allocation,
locks, etc in interrupt context goes away entire.

> But
> offloading of error event generation to a workqueue should be doable (and
> event delivery is async anyway so from userspace POV there's no
> difference).

Unless I'm misunderstanding you (possible!), that requires a memory
allocation to offload the error information to the work queue to
allow the fsnotify error message to be generated in an async manner.
That doesn't seem to solve anything.

> Otherwise locking shouldn't be a problem AFAICT. WRT memory
> allocation, we currently preallocate the error events to avoid the loss of
> event due to ENOMEM. With current usecases (filesystem catastrophical error
> reporting) we have settled on a mempool with 32 preallocated events (note
> that preallocated event gets used only if normal kmalloc fails) for
> simplicity. If the error reporting mechanism is going to be used
> significantly more, we may need to reconsider this but it should be doable.
> And frankly if you have a storm of fs errors *and* the system is going
> ENOMEM at the same time, I have my doubts loosing some error report is
> going to do any more harm ;).

Once the filesystem is shut down, it will need to turn off
individual sickness notifications because everything is sick at this
point.

> > As a result of (2), XFS now retains quite a bit of incore state about
> > its own health. The structure that fsnotify gives to userspace is very
> > generic (superblock, inode, errno, errno count). How might XFS export
> > a greater amount of information via this interface? We can provide
> > details at finer granularity -- for example, a specific data structure
> > under an allocation group or an inode, or specific quota records.
>
> Fsnotify (fanotify in fact) interface is fairly flexible in what can be
> passed through it. So if you need to pass some (reasonably short) binary
> blob to userspace which knows how to decode it, fanotify can handle that
> (with some wrapping). Obviously there's a tradeoff to make how much of the
> event is generic (as that is then easier to process by tools common for all
> filesystems) and how much is fs specific (which allows to pass more
> detailed information). But I guess we need to have concrete examples of
> events to discuss this.

Fine grained health information will always be filesystem specific -
IMO it's not worth trying to make it generic when there is only one
filesystem that tracking and exporting fine-grained health
information. Once (if) we get multiple filesystems tracking fine
grained health information, then we'll have the information we need
to implement a useful generic set of notifications, but until then I
don't think we should try.

We should just export the notifications the filesystem utilities
need to do their work for the moment. When management applications
(e.g Stratis) get to the point where they can report/manage
filesystem health and need that information from multiple
filesystems types, then we can work out a useful common subset of
fine grained events across those filesystems that the applications
can listen for.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-03-09 16:00:13

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [LSF TOPIC] online repair of filesystems: what next?

On Thu, Mar 09, 2023 at 08:54:39AM +1100, Dave Chinner wrote:
> On Wed, Mar 08, 2023 at 06:12:06PM +0100, Jan Kara wrote:
> > Hi!
> >
> > I'm interested in this topic. Some comments below.
> >
> > On Tue 28-02-23 12:49:03, Darrick J. Wong wrote:
> > > Five years ago[0], we started a conversation about cross-filesystem
> > > userspace tooling for online fsck. I think enough time has passed for
> > > us to have another one, since a few things have happened since then:
> > >
> > > 1. ext4 has gained the ability to send corruption reports to a userspace
> > > monitoring program via fsnotify. Thanks, Collabora!
> > >
> > > 2. XFS now tracks successful scrubs and corruptions seen during runtime
> > > and during scrubs. Userspace can query this information.
> > >
> > > 3. Directory parent pointers, which enable online repair of the
> > > directory tree, is nearing completion.
> > >
> > > 4. Dave and I are working on merging online repair of space metadata for
> > > XFS. Online repair of directory trees is feature complete, but we
> > > still have one or two unresolved questions in the parent pointer
> > > code.
> > >
> > > 5. I've gotten a bit better[1] at writing systemd service descriptions
> > > for scheduling and performing background online fsck.
> > >
> > > Now that fsnotify_sb_error exists as a result of (1), I think we
> > > should figure out how to plumb calls into the readahead and writeback
> > > code so that IO failures can be reported to the fsnotify monitor. I
> > > suspect there may be a few difficulties here since fsnotify (iirc)
> > > allocates memory and takes locks.
> >
> > Well, if you want to generate fsnotify events from an interrupt handler,
> > you're going to have a hard time, I don't have a good answer for that.
>
> I don't think we ever do that, or need to do that. IO completions
> that can throw corruption errors are already running in workqueue
> contexts in XFS.
>
> Worst case, we throw all bios that have IO errors flagged to the
> same IO completion workqueues, and the problem of memory allocation,
> locks, etc in interrupt context goes away entire.

Indeed. For XFS I think the only time we might need to fsnotify about
errors from interrupt context is writeback completions for a pure
overwrite? We could punt those to a workqueue as Dave says. Or figure
out a way for whoever's initiating writeback to send it for us?

I think this is a general issue for the pagecache, not XFS. I'll
brainstorm with willy the next time I encounter him.

> > But
> > offloading of error event generation to a workqueue should be doable (and
> > event delivery is async anyway so from userspace POV there's no
> > difference).
>
> Unless I'm misunderstanding you (possible!), that requires a memory
> allocation to offload the error information to the work queue to
> allow the fsnotify error message to be generated in an async manner.
> That doesn't seem to solve anything.
>
> > Otherwise locking shouldn't be a problem AFAICT. WRT memory
> > allocation, we currently preallocate the error events to avoid the loss of
> > event due to ENOMEM. With current usecases (filesystem catastrophical error
> > reporting) we have settled on a mempool with 32 preallocated events (note
> > that preallocated event gets used only if normal kmalloc fails) for
> > simplicity. If the error reporting mechanism is going to be used
> > significantly more, we may need to reconsider this but it should be doable.
> > And frankly if you have a storm of fs errors *and* the system is going
> > ENOMEM at the same time, I have my doubts loosing some error report is
> > going to do any more harm ;).
>
> Once the filesystem is shut down, it will need to turn off
> individual sickness notifications because everything is sick at this
> point.

I was thinking that the existing fsnotify error set should adopt a 'YOUR
FS IS DEAD' notification. Then when the fs goes down due to errors or
the shutdown ioctl, we can broadcast that as the final last gasp of the
filesystem.

> > > As a result of (2), XFS now retains quite a bit of incore state about
> > > its own health. The structure that fsnotify gives to userspace is very
> > > generic (superblock, inode, errno, errno count). How might XFS export
> > > a greater amount of information via this interface? We can provide
> > > details at finer granularity -- for example, a specific data structure
> > > under an allocation group or an inode, or specific quota records.
> >
> > Fsnotify (fanotify in fact) interface is fairly flexible in what can be
> > passed through it. So if you need to pass some (reasonably short) binary
> > blob to userspace which knows how to decode it, fanotify can handle that
> > (with some wrapping). Obviously there's a tradeoff to make how much of the
> > event is generic (as that is then easier to process by tools common for all
> > filesystems) and how much is fs specific (which allows to pass more
> > detailed information). But I guess we need to have concrete examples of
> > events to discuss this.
>
> Fine grained health information will always be filesystem specific -
> IMO it's not worth trying to make it generic when there is only one
> filesystem that tracking and exporting fine-grained health
> information. Once (if) we get multiple filesystems tracking fine
> grained health information, then we'll have the information we need
> to implement a useful generic set of notifications, but until then I
> don't think we should try.

Same here. XFS might want to send the generic notifications and follow
them up with more specific information?

> We should just export the notifications the filesystem utilities
> need to do their work for the moment. When management applications
> (e.g Stratis) get to the point where they can report/manage
> filesystem health and need that information from multiple
> filesystems types, then we can work out a useful common subset of
> fine grained events across those filesystems that the applications
> can listen for.

If someone wants to write xfs_scrubd that listens for events and issues
XFS_IOC_SCRUB_METADATA calls I'd be all ears. :)

--D

> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2023-03-09 17:50:08

by Jan Kara

[permalink] [raw]
Subject: Re: [LSF TOPIC] online repair of filesystems: what next?

On Thu 09-03-23 08:54:39, Dave Chinner wrote:
> On Wed, Mar 08, 2023 at 06:12:06PM +0100, Jan Kara wrote:
> > Hi!
> >
> > I'm interested in this topic. Some comments below.
> >
> > On Tue 28-02-23 12:49:03, Darrick J. Wong wrote:
> > > Five years ago[0], we started a conversation about cross-filesystem
> > > userspace tooling for online fsck. I think enough time has passed for
> > > us to have another one, since a few things have happened since then:
> > >
> > > 1. ext4 has gained the ability to send corruption reports to a userspace
> > > monitoring program via fsnotify. Thanks, Collabora!
> > >
> > > 2. XFS now tracks successful scrubs and corruptions seen during runtime
> > > and during scrubs. Userspace can query this information.
> > >
> > > 3. Directory parent pointers, which enable online repair of the
> > > directory tree, is nearing completion.
> > >
> > > 4. Dave and I are working on merging online repair of space metadata for
> > > XFS. Online repair of directory trees is feature complete, but we
> > > still have one or two unresolved questions in the parent pointer
> > > code.
> > >
> > > 5. I've gotten a bit better[1] at writing systemd service descriptions
> > > for scheduling and performing background online fsck.
> > >
> > > Now that fsnotify_sb_error exists as a result of (1), I think we
> > > should figure out how to plumb calls into the readahead and writeback
> > > code so that IO failures can be reported to the fsnotify monitor. I
> > > suspect there may be a few difficulties here since fsnotify (iirc)
> > > allocates memory and takes locks.
> >
> > Well, if you want to generate fsnotify events from an interrupt handler,
> > you're going to have a hard time, I don't have a good answer for that.
>
> I don't think we ever do that, or need to do that. IO completions
> that can throw corruption errors are already running in workqueue
> contexts in XFS.
>
> Worst case, we throw all bios that have IO errors flagged to the
> same IO completion workqueues, and the problem of memory allocation,
> locks, etc in interrupt context goes away entire.
>
> > But
> > offloading of error event generation to a workqueue should be doable (and
> > event delivery is async anyway so from userspace POV there's no
> > difference).
>
> Unless I'm misunderstanding you (possible!), that requires a memory
> allocation to offload the error information to the work queue to
> allow the fsnotify error message to be generated in an async manner.
> That doesn't seem to solve anything.

I think your "punt bio completions with errors to a workqueue" is perfectly
fine for our purposes and solves all the problems I had in mind.

> > Otherwise locking shouldn't be a problem AFAICT. WRT memory
> > allocation, we currently preallocate the error events to avoid the loss of
> > event due to ENOMEM. With current usecases (filesystem catastrophical error
> > reporting) we have settled on a mempool with 32 preallocated events (note
> > that preallocated event gets used only if normal kmalloc fails) for
> > simplicity. If the error reporting mechanism is going to be used
> > significantly more, we may need to reconsider this but it should be doable.
> > And frankly if you have a storm of fs errors *and* the system is going
> > ENOMEM at the same time, I have my doubts loosing some error report is
> > going to do any more harm ;).
>
> Once the filesystem is shut down, it will need to turn off
> individual sickness notifications because everything is sick at this
> point.

Yup.

> > > As a result of (2), XFS now retains quite a bit of incore state about
> > > its own health. The structure that fsnotify gives to userspace is very
> > > generic (superblock, inode, errno, errno count). How might XFS export
> > > a greater amount of information via this interface? We can provide
> > > details at finer granularity -- for example, a specific data structure
> > > under an allocation group or an inode, or specific quota records.
> >
> > Fsnotify (fanotify in fact) interface is fairly flexible in what can be
> > passed through it. So if you need to pass some (reasonably short) binary
> > blob to userspace which knows how to decode it, fanotify can handle that
> > (with some wrapping). Obviously there's a tradeoff to make how much of the
> > event is generic (as that is then easier to process by tools common for all
> > filesystems) and how much is fs specific (which allows to pass more
> > detailed information). But I guess we need to have concrete examples of
> > events to discuss this.
>
> Fine grained health information will always be filesystem specific -
> IMO it's not worth trying to make it generic when there is only one
> filesystem that tracking and exporting fine-grained health
> information. Once (if) we get multiple filesystems tracking fine
> grained health information, then we'll have the information we need
> to implement a useful generic set of notifications, but until then I
> don't think we should try.

Fine grained health information is definitely always going to be fs
specific. I agree. I was just thinking loud whether the event should be all
fs-specific blob or whether we should not have event containing stuff like:
errno (EIO, EFSCORRUPTED,...), inode, offset, length, <and some fs-specific
blob here with more details> so that e.g. application monitoring service
could listen to such events and act on them (e.g. by failing over to
another node) without needing to understand fs-specific details.

> We should just export the notifications the filesystem utilities
> need to do their work for the moment. When management applications
> (e.g Stratis) get to the point where they can report/manage
> filesystem health and need that information from multiple
> filesystems types, then we can work out a useful common subset of
> fine grained events across those filesystems that the applications
> can listen for.

And I guess this is a fair point that we should not try to craft generic
info in events for uncertain usecases because we'll almost certainly get it
wrong and need to change the info anyway for it to be useful.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-03-09 18:27:23

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [LSF TOPIC] online repair of filesystems: what next?

"Darrick J. Wong" <[email protected]> writes:

> On Thu, Mar 09, 2023 at 08:54:39AM +1100, Dave Chinner wrote:
>> On Wed, Mar 08, 2023 at 06:12:06PM +0100, Jan Kara wrote:
>> > Hi!
>> >
>> > I'm interested in this topic. Some comments below.
>> >
>> > On Tue 28-02-23 12:49:03, Darrick J. Wong wrote:
>> > > Five years ago[0], we started a conversation about cross-filesystem
>> > > userspace tooling for online fsck. I think enough time has passed for
>> > > us to have another one, since a few things have happened since then:
>> > >
>> > > 1. ext4 has gained the ability to send corruption reports to a userspace
>> > > monitoring program via fsnotify. Thanks, Collabora!
>> > >
>> > > 2. XFS now tracks successful scrubs and corruptions seen during runtime
>> > > and during scrubs. Userspace can query this information.
>> > >
>> > > 3. Directory parent pointers, which enable online repair of the
>> > > directory tree, is nearing completion.
>> > >
>> > > 4. Dave and I are working on merging online repair of space metadata for
>> > > XFS. Online repair of directory trees is feature complete, but we
>> > > still have one or two unresolved questions in the parent pointer
>> > > code.
>> > >
>> > > 5. I've gotten a bit better[1] at writing systemd service descriptions
>> > > for scheduling and performing background online fsck.
>> > >
>> > > Now that fsnotify_sb_error exists as a result of (1), I think we
>> > > should figure out how to plumb calls into the readahead and writeback
>> > > code so that IO failures can be reported to the fsnotify monitor. I
>> > > suspect there may be a few difficulties here since fsnotify (iirc)
>> > > allocates memory and takes locks.
>> >
>> > Well, if you want to generate fsnotify events from an interrupt handler,
>> > you're going to have a hard time, I don't have a good answer for that.
>>
>> I don't think we ever do that, or need to do that. IO completions
>> that can throw corruption errors are already running in workqueue
>> contexts in XFS.
>>
>> Worst case, we throw all bios that have IO errors flagged to the
>> same IO completion workqueues, and the problem of memory allocation,
>> locks, etc in interrupt context goes away entire.
>
> Indeed. For XFS I think the only time we might need to fsnotify about
> errors from interrupt context is writeback completions for a pure
> overwrite? We could punt those to a workqueue as Dave says. Or figure
> out a way for whoever's initiating writeback to send it for us?
>
> I think this is a general issue for the pagecache, not XFS. I'll
> brainstorm with willy the next time I encounter him.
>
>> > But
>> > offloading of error event generation to a workqueue should be doable (and
>> > event delivery is async anyway so from userspace POV there's no
>> > difference).
>>
>> Unless I'm misunderstanding you (possible!), that requires a memory
>> allocation to offload the error information to the work queue to
>> allow the fsnotify error message to be generated in an async manner.
>> That doesn't seem to solve anything.
>>
>> > Otherwise locking shouldn't be a problem AFAICT. WRT memory
>> > allocation, we currently preallocate the error events to avoid the loss of
>> > event due to ENOMEM. With current usecases (filesystem catastrophical error
>> > reporting) we have settled on a mempool with 32 preallocated events (note
>> > that preallocated event gets used only if normal kmalloc fails) for
>> > simplicity. If the error reporting mechanism is going to be used
>> > significantly more, we may need to reconsider this but it should be doable.
>> > And frankly if you have a storm of fs errors *and* the system is going
>> > ENOMEM at the same time, I have my doubts loosing some error report is
>> > going to do any more harm ;).
>>
>> Once the filesystem is shut down, it will need to turn off
>> individual sickness notifications because everything is sick at this
>> point.
>
> I was thinking that the existing fsnotify error set should adopt a 'YOUR
> FS IS DEAD' notification. Then when the fs goes down due to errors or
> the shutdown ioctl, we can broadcast that as the final last gasp of the
> filesystem.
>
>> > > As a result of (2), XFS now retains quite a bit of incore state about
>> > > its own health. The structure that fsnotify gives to userspace is very
>> > > generic (superblock, inode, errno, errno count). How might XFS export
>> > > a greater amount of information via this interface? We can provide
>> > > details at finer granularity -- for example, a specific data structure
>> > > under an allocation group or an inode, or specific quota records.
>> >
>> > Fsnotify (fanotify in fact) interface is fairly flexible in what can be
>> > passed through it. So if you need to pass some (reasonably short) binary
>> > blob to userspace which knows how to decode it, fanotify can handle that
>> > (with some wrapping). Obviously there's a tradeoff to make how much of the
>> > event is generic (as that is then easier to process by tools common for all
>> > filesystems) and how much is fs specific (which allows to pass more
>> > detailed information). But I guess we need to have concrete examples of
>> > events to discuss this.
>>
>> Fine grained health information will always be filesystem specific -
>> IMO it's not worth trying to make it generic when there is only one
>> filesystem that tracking and exporting fine-grained health
>> information. Once (if) we get multiple filesystems tracking fine
>> grained health information, then we'll have the information we need
>> to implement a useful generic set of notifications, but until then I
>> don't think we should try.
>
> Same here. XFS might want to send the generic notifications and follow
> them up with more specific information?
>
>> We should just export the notifications the filesystem utilities
>> need to do their work for the moment. When management applications
>> (e.g Stratis) get to the point where they can report/manage
>> filesystem health and need that information from multiple
>> filesystems types, then we can work out a useful common subset of
>> fine grained events across those filesystems that the applications
>> can listen for.
>
> If someone wants to write xfs_scrubd that listens for events and issues
> XFS_IOC_SCRUB_METADATA calls I'd be all ears. :)
>

Does it make sense to have more generic FS specific application daemon
which can listen on such events from fanotify and take admin actions
based on that.
For e.g. If any FS corruption is encountered causing FS shutdown and/or
ro mount.

1. then taking a xfs metadump which can later be used for analysis
of what went wrong (ofcourse this will need more thinking on how and
where to store it).
2. Initiating xfs_scrub with XFS_IOC_SCRUB_METATA call.
3. What else?

Ofcourse in production workloads the metadump can be collected by
obfuscating file/directory names ;)

-ritesh

2023-03-14 02:14:27

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [LSF TOPIC] online repair of filesystems: what next?

On Thu, Mar 09, 2023 at 11:56:57PM +0530, Ritesh Harjani wrote:
> "Darrick J. Wong" <[email protected]> writes:
>
> > On Thu, Mar 09, 2023 at 08:54:39AM +1100, Dave Chinner wrote:
> >> On Wed, Mar 08, 2023 at 06:12:06PM +0100, Jan Kara wrote:
> >> > Hi!
> >> >
> >> > I'm interested in this topic. Some comments below.
> >> >
> >> > On Tue 28-02-23 12:49:03, Darrick J. Wong wrote:
> >> > > Five years ago[0], we started a conversation about cross-filesystem
> >> > > userspace tooling for online fsck. I think enough time has passed for
> >> > > us to have another one, since a few things have happened since then:
> >> > >
> >> > > 1. ext4 has gained the ability to send corruption reports to a userspace
> >> > > monitoring program via fsnotify. Thanks, Collabora!
> >> > >
> >> > > 2. XFS now tracks successful scrubs and corruptions seen during runtime
> >> > > and during scrubs. Userspace can query this information.
> >> > >
> >> > > 3. Directory parent pointers, which enable online repair of the
> >> > > directory tree, is nearing completion.
> >> > >
> >> > > 4. Dave and I are working on merging online repair of space metadata for
> >> > > XFS. Online repair of directory trees is feature complete, but we
> >> > > still have one or two unresolved questions in the parent pointer
> >> > > code.
> >> > >
> >> > > 5. I've gotten a bit better[1] at writing systemd service descriptions
> >> > > for scheduling and performing background online fsck.
> >> > >
> >> > > Now that fsnotify_sb_error exists as a result of (1), I think we
> >> > > should figure out how to plumb calls into the readahead and writeback
> >> > > code so that IO failures can be reported to the fsnotify monitor. I
> >> > > suspect there may be a few difficulties here since fsnotify (iirc)
> >> > > allocates memory and takes locks.
> >> >
> >> > Well, if you want to generate fsnotify events from an interrupt handler,
> >> > you're going to have a hard time, I don't have a good answer for that.
> >>
> >> I don't think we ever do that, or need to do that. IO completions
> >> that can throw corruption errors are already running in workqueue
> >> contexts in XFS.
> >>
> >> Worst case, we throw all bios that have IO errors flagged to the
> >> same IO completion workqueues, and the problem of memory allocation,
> >> locks, etc in interrupt context goes away entire.
> >
> > Indeed. For XFS I think the only time we might need to fsnotify about
> > errors from interrupt context is writeback completions for a pure
> > overwrite? We could punt those to a workqueue as Dave says. Or figure
> > out a way for whoever's initiating writeback to send it for us?
> >
> > I think this is a general issue for the pagecache, not XFS. I'll
> > brainstorm with willy the next time I encounter him.
> >
> >> > But
> >> > offloading of error event generation to a workqueue should be doable (and
> >> > event delivery is async anyway so from userspace POV there's no
> >> > difference).
> >>
> >> Unless I'm misunderstanding you (possible!), that requires a memory
> >> allocation to offload the error information to the work queue to
> >> allow the fsnotify error message to be generated in an async manner.
> >> That doesn't seem to solve anything.
> >>
> >> > Otherwise locking shouldn't be a problem AFAICT. WRT memory
> >> > allocation, we currently preallocate the error events to avoid the loss of
> >> > event due to ENOMEM. With current usecases (filesystem catastrophical error
> >> > reporting) we have settled on a mempool with 32 preallocated events (note
> >> > that preallocated event gets used only if normal kmalloc fails) for
> >> > simplicity. If the error reporting mechanism is going to be used
> >> > significantly more, we may need to reconsider this but it should be doable.
> >> > And frankly if you have a storm of fs errors *and* the system is going
> >> > ENOMEM at the same time, I have my doubts loosing some error report is
> >> > going to do any more harm ;).
> >>
> >> Once the filesystem is shut down, it will need to turn off
> >> individual sickness notifications because everything is sick at this
> >> point.
> >
> > I was thinking that the existing fsnotify error set should adopt a 'YOUR
> > FS IS DEAD' notification. Then when the fs goes down due to errors or
> > the shutdown ioctl, we can broadcast that as the final last gasp of the
> > filesystem.
> >
> >> > > As a result of (2), XFS now retains quite a bit of incore state about
> >> > > its own health. The structure that fsnotify gives to userspace is very
> >> > > generic (superblock, inode, errno, errno count). How might XFS export
> >> > > a greater amount of information via this interface? We can provide
> >> > > details at finer granularity -- for example, a specific data structure
> >> > > under an allocation group or an inode, or specific quota records.
> >> >
> >> > Fsnotify (fanotify in fact) interface is fairly flexible in what can be
> >> > passed through it. So if you need to pass some (reasonably short) binary
> >> > blob to userspace which knows how to decode it, fanotify can handle that
> >> > (with some wrapping). Obviously there's a tradeoff to make how much of the
> >> > event is generic (as that is then easier to process by tools common for all
> >> > filesystems) and how much is fs specific (which allows to pass more
> >> > detailed information). But I guess we need to have concrete examples of
> >> > events to discuss this.
> >>
> >> Fine grained health information will always be filesystem specific -
> >> IMO it's not worth trying to make it generic when there is only one
> >> filesystem that tracking and exporting fine-grained health
> >> information. Once (if) we get multiple filesystems tracking fine
> >> grained health information, then we'll have the information we need
> >> to implement a useful generic set of notifications, but until then I
> >> don't think we should try.
> >
> > Same here. XFS might want to send the generic notifications and follow
> > them up with more specific information?
> >
> >> We should just export the notifications the filesystem utilities
> >> need to do their work for the moment. When management applications
> >> (e.g Stratis) get to the point where they can report/manage
> >> filesystem health and need that information from multiple
> >> filesystems types, then we can work out a useful common subset of
> >> fine grained events across those filesystems that the applications
> >> can listen for.
> >
> > If someone wants to write xfs_scrubd that listens for events and issues
> > XFS_IOC_SCRUB_METADATA calls I'd be all ears. :)
> >
>
> Does it make sense to have more generic FS specific application daemon
> which can listen on such events from fanotify and take admin actions
> based on that.
> For e.g. If any FS corruption is encountered causing FS shutdown and/or
> ro mount.

If we ever wire up generic notifications for the pagecache and iomap
then I guess we /could/ at least build a generic service to do things
like blast the user's session notifier/sysadmin's monitoring service
when things go wrong.

> 1. then taking a xfs metadump which can later be used for analysis
> of what went wrong (ofcourse this will need more thinking on how and
> where to store it).
> 2. Initiating xfs_scrub with XFS_IOC_SCRUB_METATA call.

These things are all /very/ filesystem specific. For things like
metadump and auto-scrubbing I think we'd need something in xfsprogs, not
a generic tool.

> 3. What else?
>
> Ofcourse in production workloads the metadump can be collected by
> obfuscating file/directory names ;)

That said... it would be pretty useful if there was *some* ability to
automate capture of metadata dumps for ext4 and xfs. Once the fs goes
offline it's probably safe to capture the dump since (presumably) the fs
will not be writing to the block device any more.

The hard part is having a place to dump that much information. Do we
still trust the running system enough to handle it, or would we be
better off deferring that to a kdump payload?

--D

> -ritesh

2023-03-15 03:46:15

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [LSF TOPIC] online repair of filesystems: what next?

"Darrick J. Wong" <[email protected]> writes:

> On Thu, Mar 09, 2023 at 11:56:57PM +0530, Ritesh Harjani wrote:
>> "Darrick J. Wong" <[email protected]> writes:
>>
>> > On Thu, Mar 09, 2023 at 08:54:39AM +1100, Dave Chinner wrote:
>> >> On Wed, Mar 08, 2023 at 06:12:06PM +0100, Jan Kara wrote:
>> >> > Hi!
>> >> >
>> >> > I'm interested in this topic. Some comments below.
>> >> >
>> >> > On Tue 28-02-23 12:49:03, Darrick J. Wong wrote:
>> >> > > Five years ago[0], we started a conversation about cross-filesystem
>> >> > > userspace tooling for online fsck. I think enough time has passed for
>> >> > > us to have another one, since a few things have happened since then:
>> >> > >
>> >> > > 1. ext4 has gained the ability to send corruption reports to a userspace
>> >> > > monitoring program via fsnotify. Thanks, Collabora!
>> >> > >
>> >> > > 2. XFS now tracks successful scrubs and corruptions seen during runtime
>> >> > > and during scrubs. Userspace can query this information.
>> >> > >
>> >> > > 3. Directory parent pointers, which enable online repair of the
>> >> > > directory tree, is nearing completion.
>> >> > >
>> >> > > 4. Dave and I are working on merging online repair of space metadata for
>> >> > > XFS. Online repair of directory trees is feature complete, but we
>> >> > > still have one or two unresolved questions in the parent pointer
>> >> > > code.
>> >> > >
>> >> > > 5. I've gotten a bit better[1] at writing systemd service descriptions
>> >> > > for scheduling and performing background online fsck.
>> >> > >
>> >> > > Now that fsnotify_sb_error exists as a result of (1), I think we
>> >> > > should figure out how to plumb calls into the readahead and writeback
>> >> > > code so that IO failures can be reported to the fsnotify monitor. I
>> >> > > suspect there may be a few difficulties here since fsnotify (iirc)
>> >> > > allocates memory and takes locks.
>> >> >
>> >> > Well, if you want to generate fsnotify events from an interrupt handler,
>> >> > you're going to have a hard time, I don't have a good answer for that.
>> >>
>> >> I don't think we ever do that, or need to do that. IO completions
>> >> that can throw corruption errors are already running in workqueue
>> >> contexts in XFS.
>> >>
>> >> Worst case, we throw all bios that have IO errors flagged to the
>> >> same IO completion workqueues, and the problem of memory allocation,
>> >> locks, etc in interrupt context goes away entire.
>> >
>> > Indeed. For XFS I think the only time we might need to fsnotify about
>> > errors from interrupt context is writeback completions for a pure
>> > overwrite? We could punt those to a workqueue as Dave says. Or figure
>> > out a way for whoever's initiating writeback to send it for us?
>> >
>> > I think this is a general issue for the pagecache, not XFS. I'll
>> > brainstorm with willy the next time I encounter him.
>> >
>> >> > But
>> >> > offloading of error event generation to a workqueue should be doable (and
>> >> > event delivery is async anyway so from userspace POV there's no
>> >> > difference).
>> >>
>> >> Unless I'm misunderstanding you (possible!), that requires a memory
>> >> allocation to offload the error information to the work queue to
>> >> allow the fsnotify error message to be generated in an async manner.
>> >> That doesn't seem to solve anything.
>> >>
>> >> > Otherwise locking shouldn't be a problem AFAICT. WRT memory
>> >> > allocation, we currently preallocate the error events to avoid the loss of
>> >> > event due to ENOMEM. With current usecases (filesystem catastrophical error
>> >> > reporting) we have settled on a mempool with 32 preallocated events (note
>> >> > that preallocated event gets used only if normal kmalloc fails) for
>> >> > simplicity. If the error reporting mechanism is going to be used
>> >> > significantly more, we may need to reconsider this but it should be doable.
>> >> > And frankly if you have a storm of fs errors *and* the system is going
>> >> > ENOMEM at the same time, I have my doubts loosing some error report is
>> >> > going to do any more harm ;).
>> >>
>> >> Once the filesystem is shut down, it will need to turn off
>> >> individual sickness notifications because everything is sick at this
>> >> point.
>> >
>> > I was thinking that the existing fsnotify error set should adopt a 'YOUR
>> > FS IS DEAD' notification. Then when the fs goes down due to errors or
>> > the shutdown ioctl, we can broadcast that as the final last gasp of the
>> > filesystem.
>> >
>> >> > > As a result of (2), XFS now retains quite a bit of incore state about
>> >> > > its own health. The structure that fsnotify gives to userspace is very
>> >> > > generic (superblock, inode, errno, errno count). How might XFS export
>> >> > > a greater amount of information via this interface? We can provide
>> >> > > details at finer granularity -- for example, a specific data structure
>> >> > > under an allocation group or an inode, or specific quota records.
>> >> >
>> >> > Fsnotify (fanotify in fact) interface is fairly flexible in what can be
>> >> > passed through it. So if you need to pass some (reasonably short) binary
>> >> > blob to userspace which knows how to decode it, fanotify can handle that
>> >> > (with some wrapping). Obviously there's a tradeoff to make how much of the
>> >> > event is generic (as that is then easier to process by tools common for all
>> >> > filesystems) and how much is fs specific (which allows to pass more
>> >> > detailed information). But I guess we need to have concrete examples of
>> >> > events to discuss this.
>> >>
>> >> Fine grained health information will always be filesystem specific -
>> >> IMO it's not worth trying to make it generic when there is only one
>> >> filesystem that tracking and exporting fine-grained health
>> >> information. Once (if) we get multiple filesystems tracking fine
>> >> grained health information, then we'll have the information we need
>> >> to implement a useful generic set of notifications, but until then I
>> >> don't think we should try.
>> >
>> > Same here. XFS might want to send the generic notifications and follow
>> > them up with more specific information?
>> >
>> >> We should just export the notifications the filesystem utilities
>> >> need to do their work for the moment. When management applications
>> >> (e.g Stratis) get to the point where they can report/manage
>> >> filesystem health and need that information from multiple
>> >> filesystems types, then we can work out a useful common subset of
>> >> fine grained events across those filesystems that the applications
>> >> can listen for.
>> >
>> > If someone wants to write xfs_scrubd that listens for events and issues
>> > XFS_IOC_SCRUB_METADATA calls I'd be all ears. :)
>> >
>>
>> Does it make sense to have more generic FS specific application daemon
>> which can listen on such events from fanotify and take admin actions
>> based on that.
>> For e.g. If any FS corruption is encountered causing FS shutdown and/or
>> ro mount.
>
> If we ever wire up generic notifications for the pagecache and iomap
> then I guess we /could/ at least build a generic service to do things
> like blast the user's session notifier/sysadmin's monitoring service
> when things go wrong.
>

right.

>> 1. then taking a xfs metadump which can later be used for analysis
>> of what went wrong (ofcourse this will need more thinking on how and
>> where to store it).
>> 2. Initiating xfs_scrub with XFS_IOC_SCRUB_METATA call.
>
> These things are all /very/ filesystem specific. For things like
> metadump and auto-scrubbing I think we'd need something in xfsprogs, not
> a generic tool.
>

I meant a generic fsadmin tool with plugins for each filesystem to take FS
specific actions when anything gets reported.
It could use tools from xfsprogs to take FS specific action like
capturing xfs metadump.


>> 3. What else?
>>
>> Ofcourse in production workloads the metadump can be collected by
>> obfuscating file/directory names ;)
>
> That said... it would be pretty useful if there was *some* ability to
> automate capture of metadata dumps for ext4 and xfs. Once the fs goes
> offline it's probably safe to capture the dump since (presumably) the fs
> will not be writing to the block device any more.
>
> The hard part is having a place to dump that much information. Do we
> still trust the running system enough to handle it, or would we be
> better off deferring that to a kdump payload?

I agree kdump is a better place. We don't know the state of the system.


>
> --D
>
>> -ritesh

2023-04-15 12:34:50

by Amir Goldstein

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] online repair of filesystems: what next?

On Tue, Feb 28, 2023 at 10:49 PM Darrick J. Wong <[email protected]> wrote:
>
> Hello fsdevel people,
>
> Five years ago[0], we started a conversation about cross-filesystem
> userspace tooling for online fsck. I think enough time has passed for
> us to have another one, since a few things have happened since then:
>
> 1. ext4 has gained the ability to send corruption reports to a userspace
> monitoring program via fsnotify. Thanks, Collabora!
>
> 2. XFS now tracks successful scrubs and corruptions seen during runtime
> and during scrubs. Userspace can query this information.
>
> 3. Directory parent pointers, which enable online repair of the
> directory tree, is nearing completion.
>
> 4. Dave and I are working on merging online repair of space metadata for
> XFS. Online repair of directory trees is feature complete, but we
> still have one or two unresolved questions in the parent pointer
> code.
>
> 5. I've gotten a bit better[1] at writing systemd service descriptions
> for scheduling and performing background online fsck.
>
> Now that fsnotify_sb_error exists as a result of (1), I think we
> should figure out how to plumb calls into the readahead and writeback
> code so that IO failures can be reported to the fsnotify monitor. I
> suspect there may be a few difficulties here since fsnotify (iirc)
> allocates memory and takes locks.
>
> As a result of (2), XFS now retains quite a bit of incore state about
> its own health. The structure that fsnotify gives to userspace is very
> generic (superblock, inode, errno, errno count). How might XFS export
> a greater amount of information via this interface? We can provide
> details at finer granularity -- for example, a specific data structure
> under an allocation group or an inode, or specific quota records.
>
> With (4) on the way, I can envision wanting a system service that would
> watch for these fsnotify events, and transform the error reports into
> targeted repair calls in the kernel. This of course would be very
> filesystem specific, but I would also like to hear from anyone pondering
> other usecases for fsnotify filesystem error monitors.
>
> Once (3) lands, XFS gains the ability to translate a block device IO
> error to an inode number and file offset, and then the inode number to a
> path. In other words, your file breaks and now we can tell applications
> which file it was so they can failover or redownload it or whatever.
> Ric Wheeler mentioned this in 2018's session.
>
> The final topic from that 2018 session concerned generic wrappers for
> fsscrub. I haven't pushed hard on that topic because XFS hasn't had
> much to show for that. Now that I'm better versed in systemd services,
> I envision three ways to interact with online fsck:
>
> - A CLI program that can be run by anyone.
>
> - Background systemd services that fire up periodically.
>
> - A dbus service that programs can bind to and request a fsck.
>
> I still think there's an opportunity to standardize the naming to make
> it easier to use a variety of filesystems. I propose for the CLI:
>
> /usr/sbin/fsscrub $mnt that calls /usr/sbin/fsscrub.$FSTYP $mnt
>
> For systemd services, I propose "fsscrub@<escaped mountpoint>". I
> suspect we want a separate background service that itself runs
> periodically and invokes the fsscrub@$mnt services. xfsprogs already
> has a xfs_scrub_all service that does that. The services are nifty
> because it's really easy to restrict privileges, implement resource
> usage controls, and use private name/mountspaces to isolate the process
> from the rest of the system.
>
> dbus is a bit trickier, since there's no precedent at all. I guess
> we'd have to define an interface for filesystem "object". Then we could
> write a service that establishes a well-known bus name and maintains
> object paths for each mounted filesystem. Each of those objects would
> export the filesystem interface, and that's how programs would call
> online fsck as a service.
>
> Ok, that's enough for a single session topic. Thoughts? :)

Darrick,

Quick question.
You indicated that you would like to discuss the topics:
Atomic file contents exchange
Atomic directio writes

Are those intended to be in a separate session from online fsck?
Both in the same session?

I know you posted patches for FIEXCHANGE_RANGE [1],
but they were hiding inside a huge DELUGE and people
were on New Years holidays, so nobody commented.

Perhaps you should consider posting an uptodate
topic suggestion to let people have an opportunity to
start a discussion before LSFMM.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/167243843494.699466.5163281976943635014.stgit@magnolia/

2023-04-16 08:40:16

by Qu Wenruo

[permalink] [raw]
Subject: Re: [LSF TOPIC] online repair of filesystems: what next?



On 2023/3/1 04:49, Darrick J. Wong wrote:
> Hello fsdevel people,
>
> Five years ago[0], we started a conversation about cross-filesystem
> userspace tooling for online fsck. I think enough time has passed for
> us to have another one, since a few things have happened since then:
>
> 1. ext4 has gained the ability to send corruption reports to a userspace
> monitoring program via fsnotify. Thanks, Collabora!

Not familiar with the new fsnotify thing, any article to start?

I really believe we should have a generic interface to report errors,
currently btrfs reports extra details just through dmesg (like the
logical/physical of the corruption, reason, involved inodes etc), which
is far from ideal.

>
> 2. XFS now tracks successful scrubs and corruptions seen during runtime
> and during scrubs. Userspace can query this information.
>
> 3. Directory parent pointers, which enable online repair of the
> directory tree, is nearing completion.
>
> 4. Dave and I are working on merging online repair of space metadata for
> XFS. Online repair of directory trees is feature complete, but we
> still have one or two unresolved questions in the parent pointer
> code.
>
> 5. I've gotten a bit better[1] at writing systemd service descriptions
> for scheduling and performing background online fsck.
>
> Now that fsnotify_sb_error exists as a result of (1), I think we
> should figure out how to plumb calls into the readahead and writeback
> code so that IO failures can be reported to the fsnotify monitor. I
> suspect there may be a few difficulties here since fsnotify (iirc)
> allocates memory and takes locks.
>
> As a result of (2), XFS now retains quite a bit of incore state about
> its own health. The structure that fsnotify gives to userspace is very
> generic (superblock, inode, errno, errno count). How might XFS export
> a greater amount of information via this interface? We can provide
> details at finer granularity -- for example, a specific data structure
> under an allocation group or an inode, or specific quota records.

The same for btrfs.

Some btrfs specific info like subvolume id is also needed to locate the
corrupted inode (ino is not unique among the full fs, but only inside
one subvolume).

And something like file paths for the corrupted inode is also very
helpful for end users to locate (and normally delete) the offending inode.

>
> With (4) on the way, I can envision wanting a system service that would
> watch for these fsnotify events, and transform the error reports into
> targeted repair calls in the kernel.

Btrfs has two ways of repair:

- Read time repair
This happens automatically for both invovled data and metadata, as
long as the fs is mount RW.

- Scrub time repair
The repair is also automatic.
The main difference is, scrub is manually triggered by user space.
Otherwise it can be considered as a full read of the fs (both metadata
and data).

But the repair of btrfs only involves using the extra copies, never
intended to repair things like directories.
(That's still the work of btrfs-check, and the complex cross reference
of btrfs is not designed to repair those problems at runtime)

Currently both repair would result a dmesg based report, while scrub has
its own interface to report some very basis accounting, like how many
sectors are corrupted, and how many are repaired.

A feature full and generic interface to report errors are definitely a
good direction to go.

> This of course would be very
> filesystem specific, but I would also like to hear from anyone pondering
> other usecases for fsnotify filesystem error monitors.

Btrfs also has an internal error counters, but that's accumulated value,
sometimes it's not that helpful and can even be confusing.

If we have such interface, we can more or less get rid of the internal
error counters, and rely on the user space to do the history recording.

>
> Once (3) lands, XFS gains the ability to translate a block device IO
> error to an inode number and file offset, and then the inode number to a
> path. In other words, your file breaks and now we can tell applications
> which file it was so they can failover or redownload it or whatever.
> Ric Wheeler mentioned this in 2018's session.

Yeah, if user space deamon can automatically (at least by some policy)
delete offending files, it can be a great help.

As we have hit several reports that corrupted files (no extra copy to
recover from) are preventing btrfs balance, and users have to locate the
file from dmesg, and then delete the file and retry balancing.

Thus such interface can greatly improve the user experience.

Thanks,
Qu

>
> The final topic from that 2018 session concerned generic wrappers for
> fsscrub. I haven't pushed hard on that topic because XFS hasn't had
> much to show for that. Now that I'm better versed in systemd services,
> I envision three ways to interact with online fsck:
>
> - A CLI program that can be run by anyone.
>
> - Background systemd services that fire up periodically.
>
> - A dbus service that programs can bind to and request a fsck.
>
> I still think there's an opportunity to standardize the naming to make
> it easier to use a variety of filesystems. I propose for the CLI:
>
> /usr/sbin/fsscrub $mnt that calls /usr/sbin/fsscrub.$FSTYP $mnt
>
> For systemd services, I propose "fsscrub@<escaped mountpoint>". I
> suspect we want a separate background service that itself runs
> periodically and invokes the fsscrub@$mnt services. xfsprogs already
> has a xfs_scrub_all service that does that. The services are nifty
> because it's really easy to restrict privileges, implement resource
> usage controls, and use private name/mountspaces to isolate the process
> from the rest of the system.
>
> dbus is a bit trickier, since there's no precedent at all. I guess
> we'd have to define an interface for filesystem "object". Then we could
> write a service that establishes a well-known bus name and maintains
> object paths for each mounted filesystem. Each of those objects would
> export the filesystem interface, and that's how programs would call
> online fsck as a service.
>
> Ok, that's enough for a single session topic. Thoughts? :)
>
> --D
>
> [0] https://lwn.net/Articles/754504/
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-optimize-by-default

2023-04-16 08:47:36

by Amir Goldstein

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] online repair of filesystems: what next?

On Sun, Apr 16, 2023 at 11:11 AM Qu Wenruo <[email protected]> wrote:
>
>
>
> On 2023/3/1 04:49, Darrick J. Wong wrote:
> > Hello fsdevel people,
> >
> > Five years ago[0], we started a conversation about cross-filesystem
> > userspace tooling for online fsck. I think enough time has passed for
> > us to have another one, since a few things have happened since then:
> >
> > 1. ext4 has gained the ability to send corruption reports to a userspace
> > monitoring program via fsnotify. Thanks, Collabora!
>
> Not familiar with the new fsnotify thing, any article to start?

https://docs.kernel.org/admin-guide/filesystem-monitoring.html#file-system-error-reporting

fs needs to opt-in with fsnotify_sb_error() calls and currently, only
ext4 does that.

>
> I really believe we should have a generic interface to report errors,
> currently btrfs reports extra details just through dmesg (like the
> logical/physical of the corruption, reason, involved inodes etc), which
> is far from ideal.
>
> >
> > 2. XFS now tracks successful scrubs and corruptions seen during runtime
> > and during scrubs. Userspace can query this information.
> >
> > 3. Directory parent pointers, which enable online repair of the
> > directory tree, is nearing completion.
> >
> > 4. Dave and I are working on merging online repair of space metadata for
> > XFS. Online repair of directory trees is feature complete, but we
> > still have one or two unresolved questions in the parent pointer
> > code.
> >
> > 5. I've gotten a bit better[1] at writing systemd service descriptions
> > for scheduling and performing background online fsck.
> >
> > Now that fsnotify_sb_error exists as a result of (1), I think we
> > should figure out how to plumb calls into the readahead and writeback
> > code so that IO failures can be reported to the fsnotify monitor. I
> > suspect there may be a few difficulties here since fsnotify (iirc)
> > allocates memory and takes locks.
> >
> > As a result of (2), XFS now retains quite a bit of incore state about
> > its own health. The structure that fsnotify gives to userspace is very
> > generic (superblock, inode, errno, errno count). How might XFS export
> > a greater amount of information via this interface? We can provide
> > details at finer granularity -- for example, a specific data structure
> > under an allocation group or an inode, or specific quota records.
>
> The same for btrfs.
>
> Some btrfs specific info like subvolume id is also needed to locate the
> corrupted inode (ino is not unique among the full fs, but only inside
> one subvolume).
>

The fanotify error event (which btrfs does not currently generate)
contains an "FID record", FID is fsid+file_handle.
For btrfs, file_handle would be FILEID_BTRFS_WITHOUT_PARENT
so include the subvol root ino.

> And something like file paths for the corrupted inode is also very
> helpful for end users to locate (and normally delete) the offending inode.
>

This interface was merged without the ability to report an fs-specific
info blob, but it was designed in a way that would allow adding that blob.

Thanks,
Amir.

2023-04-18 05:07:47

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] online repair of filesystems: what next?

On Sat, Apr 15, 2023 at 03:18:05PM +0300, Amir Goldstein wrote:
> On Tue, Feb 28, 2023 at 10:49 PM Darrick J. Wong <[email protected]> wrote:
> >
> > Hello fsdevel people,
> >
> > Five years ago[0], we started a conversation about cross-filesystem
> > userspace tooling for online fsck. I think enough time has passed for
> > us to have another one, since a few things have happened since then:
> >
> > 1. ext4 has gained the ability to send corruption reports to a userspace
> > monitoring program via fsnotify. Thanks, Collabora!
> >
> > 2. XFS now tracks successful scrubs and corruptions seen during runtime
> > and during scrubs. Userspace can query this information.
> >
> > 3. Directory parent pointers, which enable online repair of the
> > directory tree, is nearing completion.
> >
> > 4. Dave and I are working on merging online repair of space metadata for
> > XFS. Online repair of directory trees is feature complete, but we
> > still have one or two unresolved questions in the parent pointer
> > code.
> >
> > 5. I've gotten a bit better[1] at writing systemd service descriptions
> > for scheduling and performing background online fsck.
> >
> > Now that fsnotify_sb_error exists as a result of (1), I think we
> > should figure out how to plumb calls into the readahead and writeback
> > code so that IO failures can be reported to the fsnotify monitor. I
> > suspect there may be a few difficulties here since fsnotify (iirc)
> > allocates memory and takes locks.
> >
> > As a result of (2), XFS now retains quite a bit of incore state about
> > its own health. The structure that fsnotify gives to userspace is very
> > generic (superblock, inode, errno, errno count). How might XFS export
> > a greater amount of information via this interface? We can provide
> > details at finer granularity -- for example, a specific data structure
> > under an allocation group or an inode, or specific quota records.
> >
> > With (4) on the way, I can envision wanting a system service that would
> > watch for these fsnotify events, and transform the error reports into
> > targeted repair calls in the kernel. This of course would be very
> > filesystem specific, but I would also like to hear from anyone pondering
> > other usecases for fsnotify filesystem error monitors.
> >
> > Once (3) lands, XFS gains the ability to translate a block device IO
> > error to an inode number and file offset, and then the inode number to a
> > path. In other words, your file breaks and now we can tell applications
> > which file it was so they can failover or redownload it or whatever.
> > Ric Wheeler mentioned this in 2018's session.
> >
> > The final topic from that 2018 session concerned generic wrappers for
> > fsscrub. I haven't pushed hard on that topic because XFS hasn't had
> > much to show for that. Now that I'm better versed in systemd services,
> > I envision three ways to interact with online fsck:
> >
> > - A CLI program that can be run by anyone.
> >
> > - Background systemd services that fire up periodically.
> >
> > - A dbus service that programs can bind to and request a fsck.
> >
> > I still think there's an opportunity to standardize the naming to make
> > it easier to use a variety of filesystems. I propose for the CLI:
> >
> > /usr/sbin/fsscrub $mnt that calls /usr/sbin/fsscrub.$FSTYP $mnt
> >
> > For systemd services, I propose "fsscrub@<escaped mountpoint>". I
> > suspect we want a separate background service that itself runs
> > periodically and invokes the fsscrub@$mnt services. xfsprogs already
> > has a xfs_scrub_all service that does that. The services are nifty
> > because it's really easy to restrict privileges, implement resource
> > usage controls, and use private name/mountspaces to isolate the process
> > from the rest of the system.
> >
> > dbus is a bit trickier, since there's no precedent at all. I guess
> > we'd have to define an interface for filesystem "object". Then we could
> > write a service that establishes a well-known bus name and maintains
> > object paths for each mounted filesystem. Each of those objects would
> > export the filesystem interface, and that's how programs would call
> > online fsck as a service.
> >
> > Ok, that's enough for a single session topic. Thoughts? :)
>
> Darrick,
>
> Quick question.
> You indicated that you would like to discuss the topics:
> Atomic file contents exchange
> Atomic directio writes

This one ^^^^^^^^ topic should still get its own session, ideally with
Martin Petersen and John Garry running it. A few cloud vendors'
software defined storage stacks can support multi-lba atomic writes, and
some database software could take advantage of that to reduce nested WAL
overhead.

> Are those intended to be in a separate session from online fsck?
> Both in the same session?
>
> I know you posted patches for FIEXCHANGE_RANGE [1],
> but they were hiding inside a huge DELUGE and people
> were on New Years holidays, so nobody commented.

After 3 years of sparse review comments, I decided to withdraw
FIEXCHANGE_RANGE from general consideration after realizing that very
few filesystems actually have the infrastructure to support atomic file
contents exchange, hence there's little to be gained from undertaking
fsdevel bikeshedding.

> Perhaps you should consider posting an uptodate
> topic suggestion to let people have an opportunity to
> start a discussion before LSFMM.

TBH, most of my fs complaints these days are managerial problems (Are we
spending too much time on LTS? How on earth do we prioritize projects
with all these drive by bots?? Why can't we support large engineering
efforts better???) than technical.

(I /am/ willing to have a "Online fs metadata reconstruction: How does
it work, and can I have some of what you're smoking?" BOF tho)

--D

> Thanks,
> Amir.
>
> [1] https://lore.kernel.org/linux-fsdevel/167243843494.699466.5163281976943635014.stgit@magnolia/

2023-04-18 07:47:54

by Amir Goldstein

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] online repair of filesystems: what next?

On Tue, Apr 18, 2023 at 7:46 AM Darrick J. Wong <[email protected]> wrote:
>
> On Sat, Apr 15, 2023 at 03:18:05PM +0300, Amir Goldstein wrote:
> > On Tue, Feb 28, 2023 at 10:49 PM Darrick J. Wong <[email protected]> wrote:
...
> > Darrick,
> >
> > Quick question.
> > You indicated that you would like to discuss the topics:
> > Atomic file contents exchange
> > Atomic directio writes
>
> This one ^^^^^^^^ topic should still get its own session, ideally with
> Martin Petersen and John Garry running it. A few cloud vendors'
> software defined storage stacks can support multi-lba atomic writes, and
> some database software could take advantage of that to reduce nested WAL
> overhead.
>

CC Martin.
If you want to lead this session, please schedule it.

> > Are those intended to be in a separate session from online fsck?
> > Both in the same session?
> >
> > I know you posted patches for FIEXCHANGE_RANGE [1],
> > but they were hiding inside a huge DELUGE and people
> > were on New Years holidays, so nobody commented.
>
> After 3 years of sparse review comments, I decided to withdraw
> FIEXCHANGE_RANGE from general consideration after realizing that very
> few filesystems actually have the infrastructure to support atomic file
> contents exchange, hence there's little to be gained from undertaking
> fsdevel bikeshedding.
>
> > Perhaps you should consider posting an uptodate
> > topic suggestion to let people have an opportunity to
> > start a discussion before LSFMM.
>
> TBH, most of my fs complaints these days are managerial problems (Are we
> spending too much time on LTS? How on earth do we prioritize projects
> with all these drive by bots?? Why can't we support large engineering
> efforts better???) than technical.

I penciled one session for "FS stable backporting (and other LTS woes)".
I made it a cross FS/IO session so we can have this session in the big room
and you are welcome to pull this discussion to any direction you want.

>
> (I /am/ willing to have a "Online fs metadata reconstruction: How does
> it work, and can I have some of what you're smoking?" BOF tho)
>

I penciled a session for this one already.
Maybe it would be interesting for the crowd to hear some about
"behind the scenes" - how hard it was and still is to pull off an
engineering project of this scale - lessons learned, things that
you might have done differently.

Cheers,
Amir.

2023-04-19 02:14:39

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] online repair of filesystems: what next?

On Tue, Apr 18, 2023 at 10:46:32AM +0300, Amir Goldstein wrote:
> On Tue, Apr 18, 2023 at 7:46 AM Darrick J. Wong <[email protected]> wrote:
> >
> > On Sat, Apr 15, 2023 at 03:18:05PM +0300, Amir Goldstein wrote:
> > > On Tue, Feb 28, 2023 at 10:49 PM Darrick J. Wong <[email protected]> wrote:
> ...
> > > Darrick,
> > >
> > > Quick question.
> > > You indicated that you would like to discuss the topics:
> > > Atomic file contents exchange
> > > Atomic directio writes
> >
> > This one ^^^^^^^^ topic should still get its own session, ideally with
> > Martin Petersen and John Garry running it. A few cloud vendors'
> > software defined storage stacks can support multi-lba atomic writes, and
> > some database software could take advantage of that to reduce nested WAL
> > overhead.
> >
>
> CC Martin.
> If you want to lead this session, please schedule it.
>
> > > Are those intended to be in a separate session from online fsck?
> > > Both in the same session?
> > >
> > > I know you posted patches for FIEXCHANGE_RANGE [1],
> > > but they were hiding inside a huge DELUGE and people
> > > were on New Years holidays, so nobody commented.
> >
> > After 3 years of sparse review comments, I decided to withdraw
> > FIEXCHANGE_RANGE from general consideration after realizing that very
> > few filesystems actually have the infrastructure to support atomic file
> > contents exchange, hence there's little to be gained from undertaking
> > fsdevel bikeshedding.
> >
> > > Perhaps you should consider posting an uptodate
> > > topic suggestion to let people have an opportunity to
> > > start a discussion before LSFMM.
> >
> > TBH, most of my fs complaints these days are managerial problems (Are we
> > spending too much time on LTS? How on earth do we prioritize projects
> > with all these drive by bots?? Why can't we support large engineering
> > efforts better???) than technical.
>
> I penciled one session for "FS stable backporting (and other LTS woes)".
> I made it a cross FS/IO session so we can have this session in the big room
> and you are welcome to pull this discussion to any direction you want.

Ok, thank you. Hopefully we can get all the folks who do backports into
this one. That might be a big ask for Chandan, depending on when you
schedule it.

(Unless it's schedule for 7pm :P)

> >
> > (I /am/ willing to have a "Online fs metadata reconstruction: How does
> > it work, and can I have some of what you're smoking?" BOF tho)
> >
>
> I penciled a session for this one already.
> Maybe it would be interesting for the crowd to hear some about
> "behind the scenes" - how hard it was and still is to pull off an
> engineering project of this scale - lessons learned, things that
> you might have done differently.

Thanks!

--D

>
> Cheers,
> Amir.

2023-04-19 03:38:32

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] online repair of filesystems: what next?

On Tue, Apr 18, 2023 at 10:46:32AM +0300, Amir Goldstein wrote:
> On Tue, Apr 18, 2023 at 7:46 AM Darrick J. Wong <[email protected]> wrote:
> > TBH, most of my fs complaints these days are managerial problems (Are we
> > spending too much time on LTS? How on earth do we prioritize projects
> > with all these drive by bots?? Why can't we support large engineering
> > efforts better???) than technical.
>
> I penciled one session for "FS stable backporting (and other LTS woes)".
> I made it a cross FS/IO session so we can have this session in the big room
> and you are welcome to pull this discussion to any direction you want.

Would this make sense to include the MM folks as well? Certainly MM
has made the same choice as XFS ("No automatic backports, we will cc:
stable on patches that make sense").

2023-04-19 04:20:42

by Amir Goldstein

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] online repair of filesystems: what next?

On Wed, Apr 19, 2023 at 6:34 AM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Apr 18, 2023 at 10:46:32AM +0300, Amir Goldstein wrote:
> > On Tue, Apr 18, 2023 at 7:46 AM Darrick J. Wong <[email protected]> wrote:
> > > TBH, most of my fs complaints these days are managerial problems (Are we
> > > spending too much time on LTS? How on earth do we prioritize projects
> > > with all these drive by bots?? Why can't we support large engineering
> > > efforts better???) than technical.
> >
> > I penciled one session for "FS stable backporting (and other LTS woes)".
> > I made it a cross FS/IO session so we can have this session in the big room
> > and you are welcome to pull this discussion to any direction you want.
>
> Would this make sense to include the MM folks as well? Certainly MM
> has made the same choice as XFS ("No automatic backports, we will cc:
> stable on patches that make sense").

Yeh that makes sense.
Added MM to that session.

Thanks,
Amir.

2023-04-19 04:21:07

by Amir Goldstein

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] online repair of filesystems: what next?

On Wed, Apr 19, 2023 at 5:11 AM Darrick J. Wong <[email protected]> wrote:
>
> On Tue, Apr 18, 2023 at 10:46:32AM +0300, Amir Goldstein wrote:
> > On Tue, Apr 18, 2023 at 7:46 AM Darrick J. Wong <[email protected]> wrote:
> > >
> > > On Sat, Apr 15, 2023 at 03:18:05PM +0300, Amir Goldstein wrote:
> > > > On Tue, Feb 28, 2023 at 10:49 PM Darrick J. Wong <[email protected]> wrote:
> > ...
> > > > Darrick,
> > > >
> > > > Quick question.
> > > > You indicated that you would like to discuss the topics:
> > > > Atomic file contents exchange
> > > > Atomic directio writes
> > >
> > > This one ^^^^^^^^ topic should still get its own session, ideally with
> > > Martin Petersen and John Garry running it. A few cloud vendors'
> > > software defined storage stacks can support multi-lba atomic writes, and
> > > some database software could take advantage of that to reduce nested WAL
> > > overhead.
> > >
> >
> > CC Martin.
> > If you want to lead this session, please schedule it.
> >
> > > > Are those intended to be in a separate session from online fsck?
> > > > Both in the same session?
> > > >
> > > > I know you posted patches for FIEXCHANGE_RANGE [1],
> > > > but they were hiding inside a huge DELUGE and people
> > > > were on New Years holidays, so nobody commented.
> > >
> > > After 3 years of sparse review comments, I decided to withdraw
> > > FIEXCHANGE_RANGE from general consideration after realizing that very
> > > few filesystems actually have the infrastructure to support atomic file
> > > contents exchange, hence there's little to be gained from undertaking
> > > fsdevel bikeshedding.
> > >
> > > > Perhaps you should consider posting an uptodate
> > > > topic suggestion to let people have an opportunity to
> > > > start a discussion before LSFMM.
> > >
> > > TBH, most of my fs complaints these days are managerial problems (Are we
> > > spending too much time on LTS? How on earth do we prioritize projects
> > > with all these drive by bots?? Why can't we support large engineering
> > > efforts better???) than technical.
> >
> > I penciled one session for "FS stable backporting (and other LTS woes)".
> > I made it a cross FS/IO session so we can have this session in the big room
> > and you are welcome to pull this discussion to any direction you want.
>
> Ok, thank you. Hopefully we can get all the folks who do backports into
> this one. That might be a big ask for Chandan, depending on when you
> schedule it.
>
> (Unless it's schedule for 7pm :P)
>

Oh thanks for reminding me!
I moved it to Wed 9am, so it is more convenient for Chandan.

Thanks,
Amir.

2023-04-19 13:45:18

by Chandan Babu R

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] online repair of filesystems: what next?

On Wed, Apr 19, 2023 at 07:06:58 AM +0300, Amir Goldstein wrote:
> On Wed, Apr 19, 2023 at 5:11 AM Darrick J. Wong <[email protected]> wrote:
>>
>> On Tue, Apr 18, 2023 at 10:46:32AM +0300, Amir Goldstein wrote:
>> > On Tue, Apr 18, 2023 at 7:46 AM Darrick J. Wong <[email protected]> wrote:
>> > >
>> > > On Sat, Apr 15, 2023 at 03:18:05PM +0300, Amir Goldstein wrote:
>> > > > On Tue, Feb 28, 2023 at 10:49 PM Darrick J. Wong <[email protected]> wrote:
>> > ...
>> > > > Darrick,
>> > > >
>> > > > Quick question.
>> > > > You indicated that you would like to discuss the topics:
>> > > > Atomic file contents exchange
>> > > > Atomic directio writes
>> > >
>> > > This one ^^^^^^^^ topic should still get its own session, ideally with
>> > > Martin Petersen and John Garry running it. A few cloud vendors'
>> > > software defined storage stacks can support multi-lba atomic writes, and
>> > > some database software could take advantage of that to reduce nested WAL
>> > > overhead.
>> > >
>> >
>> > CC Martin.
>> > If you want to lead this session, please schedule it.
>> >
>> > > > Are those intended to be in a separate session from online fsck?
>> > > > Both in the same session?
>> > > >
>> > > > I know you posted patches for FIEXCHANGE_RANGE [1],
>> > > > but they were hiding inside a huge DELUGE and people
>> > > > were on New Years holidays, so nobody commented.
>> > >
>> > > After 3 years of sparse review comments, I decided to withdraw
>> > > FIEXCHANGE_RANGE from general consideration after realizing that very
>> > > few filesystems actually have the infrastructure to support atomic file
>> > > contents exchange, hence there's little to be gained from undertaking
>> > > fsdevel bikeshedding.
>> > >
>> > > > Perhaps you should consider posting an uptodate
>> > > > topic suggestion to let people have an opportunity to
>> > > > start a discussion before LSFMM.
>> > >
>> > > TBH, most of my fs complaints these days are managerial problems (Are we
>> > > spending too much time on LTS? How on earth do we prioritize projects
>> > > with all these drive by bots?? Why can't we support large engineering
>> > > efforts better???) than technical.
>> >
>> > I penciled one session for "FS stable backporting (and other LTS woes)".
>> > I made it a cross FS/IO session so we can have this session in the big room
>> > and you are welcome to pull this discussion to any direction you want.
>>
>> Ok, thank you. Hopefully we can get all the folks who do backports into
>> this one. That might be a big ask for Chandan, depending on when you
>> schedule it.
>>
>> (Unless it's schedule for 7pm :P)
>>
>
> Oh thanks for reminding me!
> I moved it to Wed 9am, so it is more convenient for Chandan.

This maps to 9:30 AM for me. Thanks for selecting a time which is convenient
for me.

--
chandan

2023-04-20 04:41:20

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] online repair of filesystems: what next?

On Wed, Apr 19, 2023 at 04:28:48PM +0530, Chandan Babu R wrote:
> On Wed, Apr 19, 2023 at 07:06:58 AM +0300, Amir Goldstein wrote:
> > On Wed, Apr 19, 2023 at 5:11 AM Darrick J. Wong <[email protected]> wrote:
> >>
> >> On Tue, Apr 18, 2023 at 10:46:32AM +0300, Amir Goldstein wrote:
> >> > On Tue, Apr 18, 2023 at 7:46 AM Darrick J. Wong <[email protected]> wrote:
> >> > >
> >> > > On Sat, Apr 15, 2023 at 03:18:05PM +0300, Amir Goldstein wrote:
> >> > > > On Tue, Feb 28, 2023 at 10:49 PM Darrick J. Wong <[email protected]> wrote:
> >> > ...
> >> > > > Darrick,
> >> > > >
> >> > > > Quick question.
> >> > > > You indicated that you would like to discuss the topics:
> >> > > > Atomic file contents exchange
> >> > > > Atomic directio writes
> >> > >
> >> > > This one ^^^^^^^^ topic should still get its own session, ideally with
> >> > > Martin Petersen and John Garry running it. A few cloud vendors'
> >> > > software defined storage stacks can support multi-lba atomic writes, and
> >> > > some database software could take advantage of that to reduce nested WAL
> >> > > overhead.
> >> > >
> >> >
> >> > CC Martin.
> >> > If you want to lead this session, please schedule it.
> >> >
> >> > > > Are those intended to be in a separate session from online fsck?
> >> > > > Both in the same session?
> >> > > >
> >> > > > I know you posted patches for FIEXCHANGE_RANGE [1],
> >> > > > but they were hiding inside a huge DELUGE and people
> >> > > > were on New Years holidays, so nobody commented.
> >> > >
> >> > > After 3 years of sparse review comments, I decided to withdraw
> >> > > FIEXCHANGE_RANGE from general consideration after realizing that very
> >> > > few filesystems actually have the infrastructure to support atomic file
> >> > > contents exchange, hence there's little to be gained from undertaking
> >> > > fsdevel bikeshedding.
> >> > >
> >> > > > Perhaps you should consider posting an uptodate
> >> > > > topic suggestion to let people have an opportunity to
> >> > > > start a discussion before LSFMM.
> >> > >
> >> > > TBH, most of my fs complaints these days are managerial problems (Are we
> >> > > spending too much time on LTS? How on earth do we prioritize projects
> >> > > with all these drive by bots?? Why can't we support large engineering
> >> > > efforts better???) than technical.
> >> >
> >> > I penciled one session for "FS stable backporting (and other LTS woes)".
> >> > I made it a cross FS/IO session so we can have this session in the big room
> >> > and you are welcome to pull this discussion to any direction you want.
> >>
> >> Ok, thank you. Hopefully we can get all the folks who do backports into
> >> this one. That might be a big ask for Chandan, depending on when you
> >> schedule it.
> >>
> >> (Unless it's schedule for 7pm :P)
> >>
> >
> > Oh thanks for reminding me!
> > I moved it to Wed 9am, so it is more convenient for Chandan.
>
> This maps to 9:30 AM for me. Thanks for selecting a time which is convenient
> for me.

Er... doesn't 9:30am for Chandan map to 9:00*pm* the previous evening
for those of us in Vancouver?

(Or I guess 9:30pm for Chandan if we actually are having a morning
session?)

Chandan: I'll ask Shirley to cancel our staff meeting so you don't have
a crazy(er) meeting schedule during LSF.

--D

> --
> chandan

2023-04-20 04:51:15

by Chandan Babu R

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] online repair of filesystems: what next?

On Wed, Apr 19, 2023 at 09:32:14 PM -0700, Darrick J. Wong wrote:
> On Wed, Apr 19, 2023 at 04:28:48PM +0530, Chandan Babu R wrote:
>> On Wed, Apr 19, 2023 at 07:06:58 AM +0300, Amir Goldstein wrote:
>> > On Wed, Apr 19, 2023 at 5:11 AM Darrick J. Wong <[email protected]> wrote:
>> >>
>> >> On Tue, Apr 18, 2023 at 10:46:32AM +0300, Amir Goldstein wrote:
>> >> > On Tue, Apr 18, 2023 at 7:46 AM Darrick J. Wong <[email protected]> wrote:
>> >> > >
>> >> > > On Sat, Apr 15, 2023 at 03:18:05PM +0300, Amir Goldstein wrote:
>> >> > > > On Tue, Feb 28, 2023 at 10:49 PM Darrick J. Wong <[email protected]> wrote:
>> >> > ...
>> >> > > > Darrick,
>> >> > > >
>> >> > > > Quick question.
>> >> > > > You indicated that you would like to discuss the topics:
>> >> > > > Atomic file contents exchange
>> >> > > > Atomic directio writes
>> >> > >
>> >> > > This one ^^^^^^^^ topic should still get its own session, ideally with
>> >> > > Martin Petersen and John Garry running it. A few cloud vendors'
>> >> > > software defined storage stacks can support multi-lba atomic writes, and
>> >> > > some database software could take advantage of that to reduce nested WAL
>> >> > > overhead.
>> >> > >
>> >> >
>> >> > CC Martin.
>> >> > If you want to lead this session, please schedule it.
>> >> >
>> >> > > > Are those intended to be in a separate session from online fsck?
>> >> > > > Both in the same session?
>> >> > > >
>> >> > > > I know you posted patches for FIEXCHANGE_RANGE [1],
>> >> > > > but they were hiding inside a huge DELUGE and people
>> >> > > > were on New Years holidays, so nobody commented.
>> >> > >
>> >> > > After 3 years of sparse review comments, I decided to withdraw
>> >> > > FIEXCHANGE_RANGE from general consideration after realizing that very
>> >> > > few filesystems actually have the infrastructure to support atomic file
>> >> > > contents exchange, hence there's little to be gained from undertaking
>> >> > > fsdevel bikeshedding.
>> >> > >
>> >> > > > Perhaps you should consider posting an uptodate
>> >> > > > topic suggestion to let people have an opportunity to
>> >> > > > start a discussion before LSFMM.
>> >> > >
>> >> > > TBH, most of my fs complaints these days are managerial problems (Are we
>> >> > > spending too much time on LTS? How on earth do we prioritize projects
>> >> > > with all these drive by bots?? Why can't we support large engineering
>> >> > > efforts better???) than technical.
>> >> >
>> >> > I penciled one session for "FS stable backporting (and other LTS woes)".
>> >> > I made it a cross FS/IO session so we can have this session in the big room
>> >> > and you are welcome to pull this discussion to any direction you want.
>> >>
>> >> Ok, thank you. Hopefully we can get all the folks who do backports into
>> >> this one. That might be a big ask for Chandan, depending on when you
>> >> schedule it.
>> >>
>> >> (Unless it's schedule for 7pm :P)
>> >>
>> >
>> > Oh thanks for reminding me!
>> > I moved it to Wed 9am, so it is more convenient for Chandan.
>>
>> This maps to 9:30 AM for me. Thanks for selecting a time which is convenient
>> for me.
>
> Er... doesn't 9:30am for Chandan map to 9:00*pm* the previous evening
> for those of us in Vancouver?
>
> (Or I guess 9:30pm for Chandan if we actually are having a morning
> session?)

Sorry, you are right. I mixed up AM/PM. It will indeed be 9:30 PM for me and I am
fine with the time schedule.

>
> Chandan: I'll ask Shirley to cancel our staff meeting so you don't have
> a crazy(er) meeting schedule during LSF.

Sure. Thank you.

--
chandan