2021-09-18 04:50:43

by Eric Blake

[permalink] [raw]
Subject: e2fsprogs concurrency questions

TL;DR summary: is there documented best practices for parallel access
to the same inode of an ext2 filesystem from multiple threads?

First, a meta-question: is there a publicly archived mailing list for
questions on e2fsprogs? The README merely mentions Ted's email
address, and http://e2fsprogs.sourceforge.net/ is silent on contact
information, although with some googling, I found at least
https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/
which suggests [email protected] as a worthwhile list to
mention on the web page.

Now, on to my real reason for writing. The nbdkit project is using
the ext2fs library to provide an ext2/3/4 filter on top of any data
being served over NBD (Network Block Device protocol) in userspace:
https://libguestfs.org/nbdkit-ext2-filter.1.html

Searching for the word 'thread' or 'concurrent' in libext2fs.info came
up with no hits, so I'm going off of minimal documentation, and mostly
what I can ascertain from existing examples (of which I'm not seeing
very many).

Right now, the nbdkit filter forces some rather strict serialization
in order to be conservatively safe: for every client that wants to
connect, the nbdkit filter calls ext2fs_open(), then eventually
ext2fs_file_open2(), then exposes the contents of that one extracted
file over NBD one operation at a time, then closes everything back
down before accepting a second client. But we'd LOVE to add some
parallelization; the NBD protocol allows multiple clients, as well as
out-of-order processing of requests from a single client.

Right away, I already know that calling ext2fs_open() more than once
on the same file system is a recipe for disaster (it is equivalent to
mounting the same block device at once through more than one bare
metal OS, and won't work). So I've got a proposal for how to rework
the nbdkit code to open the file system exactly once and share that
handle among multiple NBD clients:
https://listman.redhat.com/archives/libguestfs/2021-May/msg00028.html

However, in my testing, I quickly found that while it would let me
visit two independent inodes at once through two separate clients, I
was seeing inconsistencies when trying to visit the SAME inode through
two independent clients. That is, with (abbreviated code):

ext2fs_open(..., &fs);
ext2fs_namei(fs, ... "/foo", &ino);
ext2fs_file_open2(fs, ino, NULL, flags, &f1); // hand f1 to thread 1
ext2fs_file_open2(fs, ino, NULL, flags, &f2); // hand f2 to thread 2
// thread 1
ext2fs_file_read(f1, buf...);
// thread 2
ext2fs_file_write(f2, buf...);
ext2fs_file_flush(f2);
// thread 1
ext2fs_file_flush(f1);
ext2fs_file_read(f1, buf...);

the two open file handles carried independent buffering state - even
though thread 2 (tried to) flush everything, the handle f1 STILL
reports the data read prior to thread 2 doing any modification.

Is it okay to have two concurrent handles open to the same inode, or
do I need to implement a hash map on my end so that two NBD clients
requesting access to the same file within the ext2 filesystem share a
single inode? If concurrent handles are supported, what mechanism can
I use to ensure that a flush performed on one handle will be visible
for reading from the other handle, as ext2fs_file_flush does not seem
to be strong enough?

Next, when using a single open ext2_ino_t, are there any concurrency
restrictions that I must observe when using that handle from more than
one thread at a time? For example, is it safe to have two threads
both in the middle of a call to ext2_file_read() on that same handle,
or must I add my own mutex locking to ensure that a second thread
doesn't read data until the first thread is complete with its call?
Or put another way, are the ext2fs_* calls re-entrant?

Next, the nbdkit ext2 filter is using a custom io handler for
converting client requests as filtered through ext2fs back into raw
read/write/flush calls to pass to the real underlying NBD storage.
Among others, I implemented the io_flush(io_channel) callback, but in
debugging it, I see it only gets invoked during ext2fs_close(), and
not during ext2fs_file_flush(). Is this a symptom of me not calling
ext2fs_flush2() at points where I want to be sure actions on a single
file within the filesystem are flushed to persistant storage?

Finally, I see with
https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/
that you recently added EXT2_FLAG_THREADS, as well as
CHANNEL_FLAGS_THREADS. I think it should be fairly straightforward to
tweak my nbdkit custom IO manager to advertise CHANNEL_FLAGS_THREADS
(as the NBD protocol really DOES support parallel outstanding IO
requests), and then add EXT2_FLAG_THREADS into the flags I pss to
ext2fs_file_open2(), to try and get ext2fs to take advantage of
parallel access to the underlying storage (regardless of whether the
clients are parallel coming into ext2fs). Are there any concurrency
issues I should be aware of on that front when updating my code?

Obviously, when the kernel accesses an ext2/3/4 file system, it DOES
support full concurrency (separate user space processes can open
independent handles to the same file, and the processes must
coordinate with well-timed fsync() or similar any time there is an
expectation of a happens-before relation where actions from one
process must be observed from another). But nbdkit is all about
accessing the data of an ext2 filesystem from userspace, without any
kernel bio involvement, and is thus reliant on whatever concurrency
guarantees the ext2progs library has (or lacks).

--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org


2021-09-19 12:11:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: e2fsprogs concurrency questions

On Fri, Sep 17, 2021 at 04:06:55PM -0500, Eric Blake wrote:
> TL;DR summary: is there documented best practices for parallel access
> to the same inode of an ext2 filesystem from multiple threads?
>
> First, a meta-question: is there a publicly archived mailing list for
> questions on e2fsprogs? The README merely mentions Ted's email
> address, and http://e2fsprogs.sourceforge.net/ is silent on contact
> information, although with some googling, I found at least
> https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/
> which suggests [email protected] as a worthwhile list to
> mention on the web page.

Yes, discussions and patches relating to e2fsprogs take place on
[email protected]. (Just as xfsprogs patches and discussions
are sent to [email protected].)

> Now, on to my real reason for writing. The nbdkit project is using
> the ext2fs library to provide an ext2/3/4 filter on top of any data
> being served over NBD (Network Block Device protocol) in userspace:
> https://libguestfs.org/nbdkit-ext2-filter.1.html
>
> Searching for the word 'thread' or 'concurrent' in libext2fs.info came
> up with no hits, so I'm going off of minimal documentation, and mostly
> what I can ascertain from existing examples (of which I'm not seeing
> very many).

Historically, libext2fs and e2fsprogs had no pthreads or concurrency
access at all. This is because e2fsprogs predates Linux having
pthreads support at all.

This is _starting_ to change, but more on that in a little.

> Right now, the nbdkit filter forces some rather strict serialization
> in order to be conservatively safe: for every client that wants to
> connect, the nbdkit filter calls ext2fs_open(), then eventually
> ext2fs_file_open2(), then exposes the contents of that one extracted
> file over NBD one operation at a time, then closes everything back
> down before accepting a second client. But we'd LOVE to add some
> parallelization; the NBD protocol allows multiple clients, as well as
> out-of-order processing of requests from a single client.
>
> Right away, I already know that calling ext2fs_open() more than once
> on the same file system is a recipe for disaster (it is equivalent to
> mounting the same block device at once through more than one bare
> metal OS, and won't work). So I've got a proposal for how to rework
> the nbdkit code to open the file system exactly once and share that
> handle among multiple NBD clients:
> https://listman.redhat.com/archives/libguestfs/2021-May/msg00028.html

So you are apparently calling ext2fs_open() before forking, and then
you want to use the ext2fs handle from separate processes. Is that
correct?

That's not going to work if you are going to try to modify the file
system from different processes simultaneously. That's because the
libext2fs using a writeback cache. After the fork, each process has
its own copy of the wrteiback cache.

If you are using threads, older versions of libext2fs don't do any
locking before modifying data structures internal to the ext2_fs file
handle. So if two threads simultaneously try to use the "ext2_fs fs"
handle, they might try to access the block allocation bitmap (for
example) at the same time, without locking, and so bad things will
happen.

You can do your own locking to make sure only one thread is trying to
use the fs handle at a time, at which point you should be fine. So
you can have multiple clients accessing the file system without having
to open the file system, open a file, and then close the file and
close the file system before accepting the next client. But only one
client can be using the ext2_fs handle at a time, and if you want to
share any libext2fs data structure across multiple threads,
appropriate read/write locking would be needed.

> Is it okay to have two concurrent handles open to the same inode, or
> do I need to implement a hash map on my end so that two NBD clients
> requesting access to the same file within the ext2 filesystem share a
> single inode? If concurrent handles are supported, what mechanism can
> I use to ensure that a flush performed on one handle will be visible
> for reading from the other handle, as ext2fs_file_flush does not seem
> to be strong enough?

You could have two threads sharing the same file handle, with locking
so that only one thread is using a file handle at a time. Also, since
we don't have an analogue for pread(2) and pwrite(2), each thread
would have to assume that the fseek position may have changed by some
other thread, so after it grabbed the file system lock, and then the
per-file lock, it would need to call ext2fs_file_llseek() to make sure
file's position is at a known location before calling
ext2fs_file_read() or ext2fs_file_write().

(The reason why ext2fs_file_flush() is not strong enough is because
that will force writeback, but it doesn't invaludate any cached
information about the file's inode or extent tree structure. So if
the file inode or extent tree is getting modified by some other thread
out from under it, you're going to have a bad time.)

> Finally, I see with
> https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/
> that you recently added EXT2_FLAG_THREADS, as well as
> CHANNEL_FLAGS_THREADS. I think it should be fairly straightforward to
> tweak my nbdkit custom IO manager to advertise CHANNEL_FLAGS_THREADS
> (as the NBD protocol really DOES support parallel outstanding IO
> requests), and then add EXT2_FLAG_THREADS into the flags I pss to
> ext2fs_file_open2(), to try and get ext2fs to take advantage of
> parallel access to the underlying storage (regardless of whether the
> clients are parallel coming into ext2fs). Are there any concurrency
> issues I should be aware of on that front when updating my code?

So this is the _beginning_ of adding threaded support into libext2fs.
At the moment, we now have locking for the unix_io.c data structures.
This allows multiple threads to safely do read-only operations in
parallel. But this is *all* that it allows.

This was implemented as part of preparatory work to do parallel
e2fsck. The strategy is that we will have several different threads
reading from disjoint parts of the file system. So for example, one
thread might be reading from block groups 0 -- 100. Another thread
might be reading from block groups 101 -- 200. And so on. Each
thread will have its own copy of struct e2fsck_struct, and when they
are done they will merge their data to the global e2fsck_struct. If
there are any inconsistencies that need to be fixed, such that the
file system needs to be modified, this will require waiting until all
of the threads are done, or other specialized locking inside e2fsck.
Of course, in the "happy path", where the file system does not need
any repairs, we won't need to do any special locking or waiting, since
the workload will be read-only.

So we do not have any concurrency support for allocating inodes, or
allocating blocks, or assigning blocks to an inode's extent tree, etc.
Nor do we currently have any plans to add more concureency support to
libext2fs.

To do this would require a huge amount of effort, and it would also
require making a lot of changes to the underlying data structures.
For example, even if we added locking to all of the various data
structures hanging off of the ext2_fs handle, if two threads tried to
open the same inode using ext2fs_file_open(), the two file handles are
completely independent, and there is no way for one thread to do any
kind of cache invalidation of another thread's file handle after it
has modified the inode. The same is true if one thread is using a
directory iterator while another process tries to modify that
directory.

> Obviously, when the kernel accesses an ext2/3/4 file system, it DOES
> support full concurrency (separate user space processes can open
> independent handles to the same file....

Yes, and that's because the kernel was designed with that in mind from
the beginning. The ext2fs library was originally designed to support
programs like e2fsck, mke2fs, and debugfs. None of these tools
required concurrency, and as I've mentioned, at the time when
libext2fs was first implemented, Linux didn't even *have* threads
support. So concurrency wasn't even possible, even if it had been
needed at that time.

> process must be observed from another). But nbdkit is all about
> accessing the data of an ext2 filesystem from userspace, without any
> kernel bio involvement, and is thus reliant on whatever concurrency
> guarantees the ext2progs library has (or lacks).

The e2fsprogs library pretty much doesn't have any concurrency
guarantees, sorry. I suspect you could create a layer on top of
libext2fs which actually implemented a global inode cache ala the
kernel, so that when two threads call something like ext2fs_iget()
function, it works like the kernel's iget() function and they get the
same inode structure, which is reference counted. Things like the
directory iterator would have to be changed into something more like
the functions exported by the kernel VFS layer, which would make this
layer useless for e2fsck, but it would be more useful for a threaded
client that wanted concurrent read/write access to the filesystem from
mulitple threads.

Cheers,

- Ted

2021-09-19 13:02:35

by Andreas Dilger

[permalink] [raw]
Subject: Re: e2fsprogs concurrency questions

What I don't understand here is why not just use a network filesystem that is explicitly designed for this task (eg. NFS or Ganesha on to of ext4)?

Cheers, Andreas

> On Sep 18, 2021, at 21:49, Theodore Ts'o <[email protected]> wrote:
>
> On Fri, Sep 17, 2021 at 04:06:55PM -0500, Eric Blake wrote:
>> TL;DR summary: is there documented best practices for parallel access
>> to the same inode of an ext2 filesystem from multiple threads?
>>
>> First, a meta-question: is there a publicly archived mailing list for
>> questions on e2fsprogs? The README merely mentions Ted's email
>> address, and http://e2fsprogs.sourceforge.net/ is silent on contact
>> information, although with some googling, I found at least
>> https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/
>> which suggests [email protected] as a worthwhile list to
>> mention on the web page.
>
> Yes, discussions and patches relating to e2fsprogs take place on
> [email protected]. (Just as xfsprogs patches and discussions
> are sent to [email protected].)
>
>> Now, on to my real reason for writing. The nbdkit project is using
>> the ext2fs library to provide an ext2/3/4 filter on top of any data
>> being served over NBD (Network Block Device protocol) in userspace:
>> https://libguestfs.org/nbdkit-ext2-filter.1.html
>>
>> Searching for the word 'thread' or 'concurrent' in libext2fs.info came
>> up with no hits, so I'm going off of minimal documentation, and mostly
>> what I can ascertain from existing examples (of which I'm not seeing
>> very many).
>
> Historically, libext2fs and e2fsprogs had no pthreads or concurrency
> access at all. This is because e2fsprogs predates Linux having
> pthreads support at all.
>
> This is _starting_ to change, but more on that in a little.
>
>> Right now, the nbdkit filter forces some rather strict serialization
>> in order to be conservatively safe: for every client that wants to
>> connect, the nbdkit filter calls ext2fs_open(), then eventually
>> ext2fs_file_open2(), then exposes the contents of that one extracted
>> file over NBD one operation at a time, then closes everything back
>> down before accepting a second client. But we'd LOVE to add some
>> parallelization; the NBD protocol allows multiple clients, as well as
>> out-of-order processing of requests from a single client.
>>
>> Right away, I already know that calling ext2fs_open() more than once
>> on the same file system is a recipe for disaster (it is equivalent to
>> mounting the same block device at once through more than one bare
>> metal OS, and won't work). So I've got a proposal for how to rework
>> the nbdkit code to open the file system exactly once and share that
>> handle among multiple NBD clients:
>> https://listman.redhat.com/archives/libguestfs/2021-May/msg00028.html
>
> So you are apparently calling ext2fs_open() before forking, and then
> you want to use the ext2fs handle from separate processes. Is that
> correct?
>
> That's not going to work if you are going to try to modify the file
> system from different processes simultaneously. That's because the
> libext2fs using a writeback cache. After the fork, each process has
> its own copy of the wrteiback cache.
>
> If you are using threads, older versions of libext2fs don't do any
> locking before modifying data structures internal to the ext2_fs file
> handle. So if two threads simultaneously try to use the "ext2_fs fs"
> handle, they might try to access the block allocation bitmap (for
> example) at the same time, without locking, and so bad things will
> happen.
>
> You can do your own locking to make sure only one thread is trying to
> use the fs handle at a time, at which point you should be fine. So
> you can have multiple clients accessing the file system without having
> to open the file system, open a file, and then close the file and
> close the file system before accepting the next client. But only one
> client can be using the ext2_fs handle at a time, and if you want to
> share any libext2fs data structure across multiple threads,
> appropriate read/write locking would be needed.
>
>> Is it okay to have two concurrent handles open to the same inode, or
>> do I need to implement a hash map on my end so that two NBD clients
>> requesting access to the same file within the ext2 filesystem share a
>> single inode? If concurrent handles are supported, what mechanism can
>> I use to ensure that a flush performed on one handle will be visible
>> for reading from the other handle, as ext2fs_file_flush does not seem
>> to be strong enough?
>
> You could have two threads sharing the same file handle, with locking
> so that only one thread is using a file handle at a time. Also, since
> we don't have an analogue for pread(2) and pwrite(2), each thread
> would have to assume that the fseek position may have changed by some
> other thread, so after it grabbed the file system lock, and then the
> per-file lock, it would need to call ext2fs_file_llseek() to make sure
> file's position is at a known location before calling
> ext2fs_file_read() or ext2fs_file_write().
>
> (The reason why ext2fs_file_flush() is not strong enough is because
> that will force writeback, but it doesn't invaludate any cached
> information about the file's inode or extent tree structure. So if
> the file inode or extent tree is getting modified by some other thread
> out from under it, you're going to have a bad time.)
>
>> Finally, I see with
>> https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/
>> that you recently added EXT2_FLAG_THREADS, as well as
>> CHANNEL_FLAGS_THREADS. I think it should be fairly straightforward to
>> tweak my nbdkit custom IO manager to advertise CHANNEL_FLAGS_THREADS
>> (as the NBD protocol really DOES support parallel outstanding IO
>> requests), and then add EXT2_FLAG_THREADS into the flags I pss to
>> ext2fs_file_open2(), to try and get ext2fs to take advantage of
>> parallel access to the underlying storage (regardless of whether the
>> clients are parallel coming into ext2fs). Are there any concurrency
>> issues I should be aware of on that front when updating my code?
>
> So this is the _beginning_ of adding threaded support into libext2fs.
> At the moment, we now have locking for the unix_io.c data structures.
> This allows multiple threads to safely do read-only operations in
> parallel. But this is *all* that it allows.
>
> This was implemented as part of preparatory work to do parallel
> e2fsck. The strategy is that we will have several different threads
> reading from disjoint parts of the file system. So for example, one
> thread might be reading from block groups 0 -- 100. Another thread
> might be reading from block groups 101 -- 200. And so on. Each
> thread will have its own copy of struct e2fsck_struct, and when they
> are done they will merge their data to the global e2fsck_struct. If
> there are any inconsistencies that need to be fixed, such that the
> file system needs to be modified, this will require waiting until all
> of the threads are done, or other specialized locking inside e2fsck.
> Of course, in the "happy path", where the file system does not need
> any repairs, we won't need to do any special locking or waiting, since
> the workload will be read-only.
>
> So we do not have any concurrency support for allocating inodes, or
> allocating blocks, or assigning blocks to an inode's extent tree, etc.
> Nor do we currently have any plans to add more concureency support to
> libext2fs.
>
> To do this would require a huge amount of effort, and it would also
> require making a lot of changes to the underlying data structures.
> For example, even if we added locking to all of the various data
> structures hanging off of the ext2_fs handle, if two threads tried to
> open the same inode using ext2fs_file_open(), the two file handles are
> completely independent, and there is no way for one thread to do any
> kind of cache invalidation of another thread's file handle after it
> has modified the inode. The same is true if one thread is using a
> directory iterator while another process tries to modify that
> directory.
>
>> Obviously, when the kernel accesses an ext2/3/4 file system, it DOES
>> support full concurrency (separate user space processes can open
>> independent handles to the same file....
>
> Yes, and that's because the kernel was designed with that in mind from
> the beginning. The ext2fs library was originally designed to support
> programs like e2fsck, mke2fs, and debugfs. None of these tools
> required concurrency, and as I've mentioned, at the time when
> libext2fs was first implemented, Linux didn't even *have* threads
> support. So concurrency wasn't even possible, even if it had been
> needed at that time.
>
>> process must be observed from another). But nbdkit is all about
>> accessing the data of an ext2 filesystem from userspace, without any
>> kernel bio involvement, and is thus reliant on whatever concurrency
>> guarantees the ext2progs library has (or lacks).
>
> The e2fsprogs library pretty much doesn't have any concurrency
> guarantees, sorry. I suspect you could create a layer on top of
> libext2fs which actually implemented a global inode cache ala the
> kernel, so that when two threads call something like ext2fs_iget()
> function, it works like the kernel's iget() function and they get the
> same inode structure, which is reference counted. Things like the
> directory iterator would have to be changed into something more like
> the functions exported by the kernel VFS layer, which would make this
> layer useless for e2fsck, but it would be more useful for a threaded
> client that wanted concurrent read/write access to the filesystem from
> mulitple threads.
>
> Cheers,
>
> - Ted

2021-09-19 17:05:22

by Richard W.M. Jones

[permalink] [raw]
Subject: Re: e2fsprogs concurrency questions


On Sat, Sep 18, 2021 at 11:49:22PM -0400, Theodore Ts'o wrote:
> On Fri, Sep 17, 2021 at 04:06:55PM -0500, Eric Blake wrote:
> > TL;DR summary: is there documented best practices for parallel access
> > to the same inode of an ext2 filesystem from multiple threads?
> >
> > First, a meta-question: is there a publicly archived mailing list for
> > questions on e2fsprogs? The README merely mentions Ted's email
> > address, and http://e2fsprogs.sourceforge.net/ is silent on contact
> > information, although with some googling, I found at least
> > https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/
> > which suggests [email protected] as a worthwhile list to
> > mention on the web page.
>
> Yes, discussions and patches relating to e2fsprogs take place on
> [email protected]. (Just as xfsprogs patches and discussions
> are sent to [email protected].)
>
> > Now, on to my real reason for writing. The nbdkit project is using
> > the ext2fs library to provide an ext2/3/4 filter on top of any data
> > being served over NBD (Network Block Device protocol) in userspace:
> > https://libguestfs.org/nbdkit-ext2-filter.1.html
> >
> > Searching for the word 'thread' or 'concurrent' in libext2fs.info came
> > up with no hits, so I'm going off of minimal documentation, and mostly
> > what I can ascertain from existing examples (of which I'm not seeing
> > very many).
>
> Historically, libext2fs and e2fsprogs had no pthreads or concurrency
> access at all. This is because e2fsprogs predates Linux having
> pthreads support at all.
>
> This is _starting_ to change, but more on that in a little.
>
> > Right now, the nbdkit filter forces some rather strict serialization
> > in order to be conservatively safe: for every client that wants to
> > connect, the nbdkit filter calls ext2fs_open(), then eventually
> > ext2fs_file_open2(), then exposes the contents of that one extracted
> > file over NBD one operation at a time, then closes everything back
> > down before accepting a second client. But we'd LOVE to add some
> > parallelization; the NBD protocol allows multiple clients, as well as
> > out-of-order processing of requests from a single client.
> >
> > Right away, I already know that calling ext2fs_open() more than once
> > on the same file system is a recipe for disaster (it is equivalent to
> > mounting the same block device at once through more than one bare
> > metal OS, and won't work). So I've got a proposal for how to rework
> > the nbdkit code to open the file system exactly once and share that
> > handle among multiple NBD clients:
> > https://listman.redhat.com/archives/libguestfs/2021-May/msg00028.html

I didn't see a link to the actual client code, so here it is:

https://gitlab.com/nbdkit/nbdkit/-/tree/master/filters/ext2

Also this page explains the callback lifecycle and thread model of
nbdkit:

https://libguestfs.org/nbdkit-plugin.3.html#Callback-lifecycle
https://libguestfs.org/nbdkit-plugin.3.html#Threads

In the current filter, as Eric said above we're using the nbdkit
thread model NBDKIT_THREAD_MODEL_SERIALIZE_CONNECTIONS which basically
means we operate in the most conservative possible mode. NBD clients
must queue up to connect one at a time, and requests from that single
client are handled one at a time.

> So you are apparently calling ext2fs_open() before forking, and then
> you want to use the ext2fs handle from separate processes. Is that
> correct?

We call ext2fs_open once per client (when the client connects). This
all happens after nbdkit has forked and is accepting connections.
nbdkit only ever uses a single process, and uses threads to handle the
different clients.

In the code the call happens in ext2_prepare() which isn't shown in
the lifecycle diagram (because nbdkit filters work slightly
differently from plugins), but happens just after the "open" step.

Really the question is if we can relax the thread model.

> That's not going to work if you are going to try to modify the file
> system from different processes simultaneously. That's because the
> libext2fs using a writeback cache. After the fork, each process has
> its own copy of the wrteiback cache.
>
> If you are using threads, older versions of libext2fs don't do any
> locking before modifying data structures internal to the ext2_fs file
> handle. So if two threads simultaneously try to use the "ext2_fs fs"
> handle, they might try to access the block allocation bitmap (for
> example) at the same time, without locking, and so bad things will
> happen.
>
> You can do your own locking to make sure only one thread is trying to
> use the fs handle at a time, at which point you should be fine. So
> you can have multiple clients accessing the file system without having
> to open the file system, open a file, and then close the file and
> close the file system before accepting the next client. But only one
> client can be using the ext2_fs handle at a time, and if you want to
> share any libext2fs data structure across multiple threads,
> appropriate read/write locking would be needed.

Are there structures shared between ext2_fs handles? I mean, if we
had two concurrent threads using different ext2_fs handles, but open
on the same file, is that going to be a problem? (It sounds like it
would be, with conflicting access to the block allocation bitmap and
so on.)

> > Is it okay to have two concurrent handles open to the same inode, or
> > do I need to implement a hash map on my end so that two NBD clients
> > requesting access to the same file within the ext2 filesystem share a
> > single inode? If concurrent handles are supported, what mechanism can
> > I use to ensure that a flush performed on one handle will be visible
> > for reading from the other handle, as ext2fs_file_flush does not seem
> > to be strong enough?
>
> You could have two threads sharing the same file handle, with locking
> so that only one thread is using a file handle at a time. Also, since
> we don't have an analogue for pread(2) and pwrite(2), each thread
> would have to assume that the fseek position may have changed by some
> other thread, so after it grabbed the file system lock, and then the
> per-file lock, it would need to call ext2fs_file_llseek() to make sure
> file's position is at a known location before calling
> ext2fs_file_read() or ext2fs_file_write().
>
> (The reason why ext2fs_file_flush() is not strong enough is because
> that will force writeback, but it doesn't invaludate any cached
> information about the file's inode or extent tree structure. So if
> the file inode or extent tree is getting modified by some other thread
> out from under it, you're going to have a bad time.)
>
> > Finally, I see with
> > https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/
> > that you recently added EXT2_FLAG_THREADS, as well as
> > CHANNEL_FLAGS_THREADS. I think it should be fairly straightforward to
> > tweak my nbdkit custom IO manager to advertise CHANNEL_FLAGS_THREADS
> > (as the NBD protocol really DOES support parallel outstanding IO
> > requests), and then add EXT2_FLAG_THREADS into the flags I pss to
> > ext2fs_file_open2(), to try and get ext2fs to take advantage of
> > parallel access to the underlying storage (regardless of whether the
> > clients are parallel coming into ext2fs). Are there any concurrency
> > issues I should be aware of on that front when updating my code?
>
> So this is the _beginning_ of adding threaded support into libext2fs.
> At the moment, we now have locking for the unix_io.c data structures.
> This allows multiple threads to safely do read-only operations in
> parallel. But this is *all* that it allows.
>
> This was implemented as part of preparatory work to do parallel
> e2fsck. The strategy is that we will have several different threads
> reading from disjoint parts of the file system. So for example, one
> thread might be reading from block groups 0 -- 100. Another thread
> might be reading from block groups 101 -- 200. And so on. Each
> thread will have its own copy of struct e2fsck_struct, and when they
> are done they will merge their data to the global e2fsck_struct. If
> there are any inconsistencies that need to be fixed, such that the
> file system needs to be modified, this will require waiting until all
> of the threads are done, or other specialized locking inside e2fsck.
> Of course, in the "happy path", where the file system does not need
> any repairs, we won't need to do any special locking or waiting, since
> the workload will be read-only.
>
> So we do not have any concurrency support for allocating inodes, or
> allocating blocks, or assigning blocks to an inode's extent tree, etc.
> Nor do we currently have any plans to add more concureency support to
> libext2fs.
>
> To do this would require a huge amount of effort, and it would also
> require making a lot of changes to the underlying data structures.
> For example, even if we added locking to all of the various data
> structures hanging off of the ext2_fs handle, if two threads tried to
> open the same inode using ext2fs_file_open(), the two file handles are
> completely independent, and there is no way for one thread to do any
> kind of cache invalidation of another thread's file handle after it
> has modified the inode. The same is true if one thread is using a
> directory iterator while another process tries to modify that
> directory.
>
> > Obviously, when the kernel accesses an ext2/3/4 file system, it DOES
> > support full concurrency (separate user space processes can open
> > independent handles to the same file....
>
> Yes, and that's because the kernel was designed with that in mind from
> the beginning. The ext2fs library was originally designed to support
> programs like e2fsck, mke2fs, and debugfs. None of these tools
> required concurrency, and as I've mentioned, at the time when
> libext2fs was first implemented, Linux didn't even *have* threads
> support. So concurrency wasn't even possible, even if it had been
> needed at that time.
>
> > process must be observed from another). But nbdkit is all about
> > accessing the data of an ext2 filesystem from userspace, without any
> > kernel bio involvement, and is thus reliant on whatever concurrency
> > guarantees the ext2progs library has (or lacks).
>
> The e2fsprogs library pretty much doesn't have any concurrency
> guarantees, sorry. I suspect you could create a layer on top of
> libext2fs which actually implemented a global inode cache ala the
> kernel, so that when two threads call something like ext2fs_iget()
> function, it works like the kernel's iget() function and they get the
> same inode structure, which is reference counted. Things like the
> directory iterator would have to be changed into something more like
> the functions exported by the kernel VFS layer, which would make this
> layer useless for e2fsck, but it would be more useful for a threaded
> client that wanted concurrent read/write access to the filesystem from
> mulitple threads.

Interesting stuff, thanks.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines. Supports shell scripting,
bindings from many languages. http://libguestfs.org

2021-09-20 07:43:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: e2fsprogs concurrency questions

On Sun, Sep 19, 2021 at 01:35:23PM +0100, Richard W.M. Jones wrote:
> Are there structures shared between ext2_fs handles?

Sadly, no.


> I mean, if we had two concurrent threads using different ext2_fs
> handles, but open on the same file, is that going to be a problem?
> (It sounds like it would be, with conflicting access to the block
> allocation bitmap and so on.)

Yes, there's going to be a problem.

If you have two separate ext2fs_fs handles (each opened via a separate
call to ext2fs_open), they will not share any structures, nor is there
any locking for any of the data structures. So that means you could
use a single ext2s_fs handle, and share it across threads --- but you
need to make sure that only one thread is using the handle at a time,
and you can't have two file handles open to the same inode, or read an
inode twice, and then modify one, write it back, and expect the other
inode will magically be updated --- because they won't be.

Fundamentally, libext2fs was not designed for concurrent operation.

I suppose you could use fuse2fs, and then having the clients access
the file system via the FUSE interface. That might be more efficient.

- Ted

2021-09-20 16:59:06

by Richard W.M. Jones

[permalink] [raw]
Subject: Re: e2fsprogs concurrency questions

On Sun, Sep 19, 2021 at 04:39:01AM -0600, Andreas Dilger wrote:
> What I don't understand here is why not just use a network
> filesystem that is explicitly designed for this task (eg. NFS or
> Ganesha on to of ext4)?

nbdkit-ext2-filter is very niche, but it's quite different from
anything NFS can do. For example:

$ nbdkit --filter=ext2 --filter=xz \
curl http://oirase.annexia.org/tmp/disk.img.xz \
ext2file=/disk/fedora-33.img

$ nbdinfo nbd://localhost
protocol: newstyle-fixed without TLS
export="":
export-size: 6442450944
content: DOS/MBR boot sector
uri: nbd://localhost:10809/
contexts:
base:allocation
is_rotational: false
is_read_only: true
can_cache: true
can_df: true
can_fast_zero: false
can_flush: true
can_fua: false
can_multi_conn: false
can_trim: false
can_zero: false

$ guestfish --ro --format=raw -a nbd://localhost -i
[...]
Operating system: Fedora 33 (Thirty Three)
/dev/sda3 mounted on /
/dev/sda2 mounted on /boot

What we're doing here is exporting a compressed ext4 image over HTTP
and then accessing a VM image inside it.

(This is a contrived example but it's similar to something called the
Containerized Data Importer in Kubernetes.)

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html