(+David Howells in Cc as he's knows how that works better than me;
-syzbot lists as it doesn't really concern this bug)
Christian Schoenebeck wrote on Sat, Mar 26, 2022 at 01:36:31PM +0100:
> BTW, another issue that I am seeing for a long time affects the fs-cache: When
> I use cache=mmap then things seem to be harmless, I periodically see messages
> like these, but that's about it:
>
> [90763.435562] FS-Cache: Duplicate cookie detected
> [90763.436514] FS-Cache: O-cookie c=00dcb42f [p=00000003 fl=216 nc=0 na=0]
> [90763.437795] FS-Cache: O-cookie d=0000000000000000{?} n=0000000000000000
> [90763.440096] FS-Cache: O-key=[8] 'a7ab2c0000000000'
> [90763.441656] FS-Cache: N-cookie c=00dcb4a7 [p=00000003 fl=2 nc=0 na=1]
> [90763.446753] FS-Cache: N-cookie d=000000005b583d5a{9p.inode} n=00000000212184fb
> [90763.448196] FS-Cache: N-key=[8] 'a7ab2c0000000000'
hm, fscache code shouldn't be used for cache=mmap, I'm surprised you can
hit this...
> The real trouble starts when I use cache=loose though, in this case I get all
> sorts of misbehaviours from time to time, especially complaining about invalid
> file descriptors.
... but I did encouter these on cache=loose/fscache, although I hadn't
noticed any bad behaviour such as invalid file descriptors.
> Any clues?
Since I hadn't noticed real harm I didn't look too hard into it, I have
a couple of ideas:
- the cookie is just a truncated part of the inode number, it's possible
we get real collisions because there are no guarantees there won't be
identical inodes there.
In particular, it's trivial to reproduce by exporting submounts:
## on host in export directory
# mount -t tmpfs tmpfs m1
# mount -t tmpfs tmpfs m2
# echo foo > m1/a
# echo bar > m2/a
# ls -li m1 m2
m1:
total 4
2 -rw-r--r-- 1 asmadeus users 4 Mar 26 22:23 a
m2:
total 4
2 -rw-r--r-- 1 asmadeus users 4 Mar 26 22:23 a
## on client
# /mnt/t/m*/a
foo
bar
FS-Cache: Duplicate cookie detected
FS-Cache: O-cookie c=0000099a [fl=4000 na=0 nA=0 s=-]
FS-Cache: O-cookie V=00000006 [9p,tmp,]
FS-Cache: O-key=[8] '0200000000000000'
FS-Cache: N-cookie c=0000099b [fl=0 na=0 nA=0 s=-]
FS-Cache: N-cookie V=00000006 [9p,tmp,]
FS-Cache: N-key=[8] '0200000000000000'
But as you can see despite the warning the content is properly
different, and writing also works, so this probably isn't it... Although
the fscache code we're using is totally different -- your dmesg output
is from the "pre-netfs" code, so that might have gotten fixed as a side
effect?
- lifecycle différence between inode and fscache entry.
David pushed a patch a few years back to address this but it looks like
it never got merged:
https://lore.kernel.org/lkml/155231584487.2992.17466330160329385162.stgit@warthog.procyon.org.uk/
the rationale is that we could evict the inode then reallocate it, and
it'd generate a new fscache entry with the same key before the previous
fscache entry had been freed.
I'm not sure if that got fixed otherwise and it might not be possible
anymore, I didn't follow that, but given
- some other bug...
If you have some kind of reproducer of invalid filedescriptor or similar
errors I'd be happy to dig a bit more, I don't particularly like all
aspect of our cache model but it's not good if it corrupts things.
--
Dominique
I made some tests & benchmarks regarding the fs-cache issue of 9p, running
different kernel versions and kernel configurations in comparison.
Setup: all tests were running QEMU v7.0.0-rc0+ [3]. Linux guest system was
running 9p as root filesystem as described in the QEMU 9p root fs HOWTO [4]
and then I installed build tools into guest OS required for this test. In each
test run I compiled the same source files as parallel build (make -jN).
Between each run I deleted the build directory and rebooted the guest system
with different kernel config and restarted a build on guest. Results:
Case Linux kernel version .config msize cache duration host cpu errors/warnings
A) 5.17.0+[2] + msize patches[1] debug 4186112 mmap 20m 40s ~80% none
B) 5.17.0+[2] + msize patches[1] debug 4186112 loose 31m 28s ~35% several errors (compilation completed)
C) 5.17.0+[2] + msize patches[1] debug 507904 mmap 20m 25s ~84% none
D) 5.17.0+[2] + msize patches[1] debug 507904 loose 31m 2s ~33% several errors (compilation completed)
E) 5.17.0+[2] debug 512000 mmap 23m 45s ~75% none
F) 5.17.0+[2] debug 512000 loose 32m 6s ~31% several errors (compilation completed)
G) 5.17.0+[2] release 512000 mmap 23m 18s ~76% none
H) 5.17.0+[2] release 512000 loose 32m 33s ~31% several errors (compilation completed)
I) 5.17.0+[2] + msize patches[1] release 4186112 mmap 20m 30s ~83% none
J) 5.17.0+[2] + msize patches[1] release 4186112 loose 31m 21s ~31% several errors (compilation completed)
K) 5.10.84 release 512000 mmap 39m 20s ~80% none
L) 5.10.84 release 512000 loose 13m 40s ~55% none
[1] 9p msize patches v4 (2021-12-30): https://lore.kernel.org/netdev/[email protected]/
[2] Linux kernel "5.17.0+": SHA-1 710f5d627a98 ("Merge tag 'usb-5.18-rc1'", 2022-03-26 13:08:25)
[3] QEMU "v7.0.0-rc0+": SHA-1 1d60bb4b146 ("Merge tag 'pull-request-2022-03-15v2'", 2022-03-16 10:43:58)
[4] 9p as root filesystem: https://wiki.qemu.org/Documentation/9p_root_fs
As for fs-cache issues:
======================
Disclaimer: I have not looked into the fs-cache sources yet, so I am not sure,
but my first impression is that probably something got broken with recent
fs-cache changes (see column errors, especially in comparison to case L) which
did not generate any errors)? And also note the huge build duration
differences, especially in comparison to case L), so fs-cache (cache=loose)
also got significantly slower while cache=mmap OTOH became significantly
faster?
About the errors: I actually already see errors with cache=loose and recent
kernel version just when booting the guest OS. For these tests I chose some
sources which allowed me to complete the build to capture some benchmark as
well, I got some "soft" errors with those, but the build completed at least.
I had other sources OTOH which did not complete though and aborted with
certain invalid file descriptor errors, which I obviously could not use for
those benchmarks here.
debug/release .config: In the first runs with recent kernel 5.17.0+ I still
had debugging turned on, whereas the older kernel was optimized. So I repeated
the tests of kernel 5.17.0+ with -O2 and debugging options turned off, but the
numbers only slightly improved. So debug vs. release does not seem to have a
significant impact on the results.
host cpu column: these were just very approximate numbers that I additionally
wrote down to compare host CPU saturation during these tests.
As for latest msize patches (v4):
================================
large msize: In these tests there are a very large amount of rather small
chunk I/O in parallel, where a huge msize (e.g. 4MB) does not really bring
advantages. So this is different to my previous benchmarks which focused on
large chunk sequential I/O before, where large msize values could shine. You
can see that case A) is probably even a bit slower with msize=4MB, where I am
assuming that Treaddir requests still being msize large might hurt here with
msize=4MB in these tests. I still need to verify that though.
small msize: The results also suggest though that the msize patches bring
advantages with a smaller msize value in comparison to unpatched kernels. I
assume that's because of the last bunch of patches which reduce the size of
most 9p requests to what they really need, instead of simply allocating
always 'msize' for each 9p request as it is still right now on master.
...
And finally in response to your previous email, see below ...
On Samstag, 26. M?rz 2022 14:35:14 CEST [email protected] wrote:
> (+David Howells in Cc as he's knows how that works better than me;
> -syzbot lists as it doesn't really concern this bug)
+Greg Kurz, for 9p server part
> Christian Schoenebeck wrote on Sat, Mar 26, 2022 at 01:36:31PM +0100:
> > BTW, another issue that I am seeing for a long time affects the fs-cache:
> > When I use cache=mmap then things seem to be harmless, I periodically see
> > messages like these, but that's about it:
> >
> > [90763.435562] FS-Cache: Duplicate cookie detected
> > [90763.436514] FS-Cache: O-cookie c=00dcb42f [p=00000003 fl=216 nc=0 na=0]
> > [90763.437795] FS-Cache: O-cookie d=0000000000000000{?} n=0000000000000000
> > [90763.440096] FS-Cache: O-key=[8] 'a7ab2c0000000000'
> > [90763.441656] FS-Cache: N-cookie c=00dcb4a7 [p=00000003 fl=2 nc=0 na=1]
> > [90763.446753] FS-Cache: N-cookie d=000000005b583d5a{9p.inode}
> > n=00000000212184fb [90763.448196] FS-Cache: N-key=[8] 'a7ab2c0000000000'
>
> hm, fscache code shouldn't be used for cache=mmap, I'm surprised you can
> hit this...
I assume that you mean that 9p driver does not explicitly ask for fs-cache
being used for mmap. I see that 9p uses the kernel's generalized mmap
implementation:
https://github.com/torvalds/linux/blob/d888c83fcec75194a8a48ccd283953bdba7b2550/fs/9p/vfs_file.c#L481
I haven't dived further into this, but the kernel has to use some kind of
filesystem cache anyway to provide the mmap functionality, so I guess it makes
sense that I got those warning messages from the FS-Cache subsystem?
> > The real trouble starts when I use cache=loose though, in this case I get
> > all sorts of misbehaviours from time to time, especially complaining
> > about invalid file descriptors.
>
> ... but I did encouter these on cache=loose/fscache, although I hadn't
> noticed any bad behaviour such as invalid file descriptors.
>
> > Any clues?
>
> Since I hadn't noticed real harm I didn't look too hard into it, I have
> a couple of ideas:
> - the cookie is just a truncated part of the inode number, it's possible
> we get real collisions because there are no guarantees there won't be
> identical inodes there.
I think with 'truncated' you actually mean what's going on 9p server (QEMU)
side, see below ...
> In particular, it's trivial to reproduce by exporting submounts:
>
> ## on host in export directory
> # mount -t tmpfs tmpfs m1
> # mount -t tmpfs tmpfs m2
> # echo foo > m1/a
> # echo bar > m2/a
> # ls -li m1 m2
> m1:
> total 4
> 2 -rw-r--r-- 1 asmadeus users 4 Mar 26 22:23 a
>
> m2:
> total 4
> 2 -rw-r--r-- 1 asmadeus users 4 Mar 26 22:23 a
>
> ## on client
> # /mnt/t/m*/a
> foo
> bar
> FS-Cache: Duplicate cookie detected
> FS-Cache: O-cookie c=0000099a [fl=4000 na=0 nA=0 s=-]
> FS-Cache: O-cookie V=00000006 [9p,tmp,]
> FS-Cache: O-key=[8] '0200000000000000'
> FS-Cache: N-cookie c=0000099b [fl=0 na=0 nA=0 s=-]
> FS-Cache: N-cookie V=00000006 [9p,tmp,]
> FS-Cache: N-key=[8] '0200000000000000'
With QEMU >= 5.2 you should see the following QEMU warning with your reproducer:
"
qemu-system-x86_64: warning: 9p: Multiple devices detected in same VirtFS
export, which might lead to file ID collisions and severe misbehaviours on
guest! You should either use a separate export for each device shared from
host or use virtfs option 'multidevs=remap'!
"
And after restarting QEMU with 'multidevs=remap' you won't get such errors
anymore. I just tested this right now: without 'multidevs=remap' I would get
those errors with your reproducer above, with 'multidevs=remap' there were
no errors.
Background: the Linux 9p driver is using the 9p "QID path" as file ID, i.e. as
key for looking up entries in the fs-cache:
https://github.com/torvalds/linux/blob/d888c83fcec75194a8a48ccd283953bdba7b2550/fs/9p/cache.c#L65
By default QEMU just uses the host file's inode number as "QID path". So if
you have multiple filesystems inside the exported tree, this can lead to
collisions. Usually we "should" place both the device ID number and inode
number into "QID path" to prevent that, but the problem is "QID path" is
currently only 64-bit large in the 9p protocol, so it is too small to hold
both device id and inode number:
http://ericvh.github.io/9p-rfc/rfc9p2000.html#anchor32
If 'multidevs=remap' is passed to QEMU though then guaranteed unique "QID
path" numbers are generated, even if there are multiple filesystems mounted
inside the exported tree. So you won't get collisions in this case. This is
usually cost free, because we are using the fact that inode numbers are always
sequentially generated by host file systems from 1 upwards. So on the left
hand side of inode numbers we usally have plenty of zeros and can prefix them
with our own numbers there to prevent collissions while being able to squeeze
them into 64-bit.
> But as you can see despite the warning the content is properly
> different, and writing also works, so this probably isn't it... Although
> the fscache code we're using is totally different -- your dmesg output
> is from the "pre-netfs" code, so that might have gotten fixed as a side
> effect?
>
> - lifecycle diff?rence between inode and fscache entry.
> David pushed a patch a few years back to address this but it looks like
> it never got merged:
> https://lore.kernel.org/lkml/155231584487.2992.17466330160329385162.stgit@wa
> rthog.procyon.org.uk/
>
> the rationale is that we could evict the inode then reallocate it, and
> it'd generate a new fscache entry with the same key before the previous
> fscache entry had been freed.
> I'm not sure if that got fixed otherwise and it might not be possible
> anymore, I didn't follow that, but given
I don't know the current fs-cache implementation in the Linux kernel yet, so I
can't comment on this part at this point.
> - some other bug...
>
> If you have some kind of reproducer of invalid filedescriptor or similar
> errors I'd be happy to dig a bit more, I don't particularly like all
> aspect of our cache model but it's not good if it corrupts things.
Maybe you can reproduce this with the root fs setup [4] described above? As I
said, I immediately get errors when guest OS is booting. So I don't have to
run something fancy to get errors with cache=loose & recent kernel.
Best regards,
Christian Schoenebeck
Thanks Christian!
Christian Schoenebeck wrote on Wed, Mar 30, 2022 at 02:21:16PM +0200:
> Case Linux kernel version .config msize cache duration host cpu errors/warnings
>
> A) 5.17.0+[2] + msize patches[1] debug 4186112 mmap 20m 40s ~80% none
> B) 5.17.0+[2] + msize patches[1] debug 4186112 loose 31m 28s ~35% several errors (compilation completed)
> C) 5.17.0+[2] + msize patches[1] debug 507904 mmap 20m 25s ~84% none
> D) 5.17.0+[2] + msize patches[1] debug 507904 loose 31m 2s ~33% several errors (compilation completed)
> E) 5.17.0+[2] debug 512000 mmap 23m 45s ~75% none
> F) 5.17.0+[2] debug 512000 loose 32m 6s ~31% several errors (compilation completed)
> G) 5.17.0+[2] release 512000 mmap 23m 18s ~76% none
> H) 5.17.0+[2] release 512000 loose 32m 33s ~31% several errors (compilation completed)
> I) 5.17.0+[2] + msize patches[1] release 4186112 mmap 20m 30s ~83% none
> J) 5.17.0+[2] + msize patches[1] release 4186112 loose 31m 21s ~31% several errors (compilation completed)
> K) 5.10.84 release 512000 mmap 39m 20s ~80% none
> L) 5.10.84 release 512000 loose 13m 40s ~55% none
ow.
> Disclaimer: I have not looked into the fs-cache sources yet, so I am not sure,
> but my first impression is that probably something got broken with recent
> fs-cache changes (see column errors, especially in comparison to case L) which
> did not generate any errors)? And also note the huge build duration
> differences, especially in comparison to case L), so fs-cache (cache=loose)
> also got significantly slower while cache=mmap OTOH became significantly
> faster?
Yes, that's a big regression; I didn't do any performance benchmark with
the new patches as I didn't think it'd matter but I obviously should
have.
There is one thing I must check: I know new kernels will be writing in
4k chunks and that is going to be very slow until the netfs write
helpers are finished, but I thought the old code did the same.
If the old code had bigger writes that performance will probably come
back.
Otherwise there's some other error like not reusing cached content we
should use.
> About the errors: I actually already see errors with cache=loose and recent
> kernel version just when booting the guest OS. For these tests I chose some
> sources which allowed me to complete the build to capture some benchmark as
> well, I got some "soft" errors with those, but the build completed at least.
> I had other sources OTOH which did not complete though and aborted with
> certain invalid file descriptor errors, which I obviously could not use for
> those benchmarks here.
That's less surprising, the change was really huge. I'm annoyed because
I did test part of a parallel linux kernel compilation with
cache=fscache without noticing a problem :/
I'll try to reproduce this weekend-ish.
> > Christian Schoenebeck wrote on Sat, Mar 26, 2022 at 01:36:31PM +0100:
> > hm, fscache code shouldn't be used for cache=mmap, I'm surprised you can
> > hit this...
>
> I assume that you mean that 9p driver does not explicitly ask for fs-cache
> being used for mmap. I see that 9p uses the kernel's generalized mmap
> implementation:
>
> https://github.com/torvalds/linux/blob/d888c83fcec75194a8a48ccd283953bdba7b2550/fs/9p/vfs_file.c#L481
>
> I haven't dived further into this, but the kernel has to use some kind of
> filesystem cache anyway to provide the mmap functionality, so I guess it makes
> sense that I got those warning messages from the FS-Cache subsystem?
It uses the generic mmap which has vfs caching, but definitely not
fs-cache.
fs-cache adds more hooks for cachefilesd (writing file contents to disk
for bigger cache) and things like that cache=loose/mmap shouldn't be
caring about. cache=loose actually just disables some key parts so I'm
not surprised it shares bugs with the new code, but cache=mmap is really
independant and I need to trace where these come from...
> With QEMU >= 5.2 you should see the following QEMU warning with your reproducer:
>
> "
> qemu-system-x86_64: warning: 9p: Multiple devices detected in same VirtFS
> export, which might lead to file ID collisions and severe misbehaviours on
> guest! You should either use a separate export for each device shared from
> host or use virtfs option 'multidevs=remap'!
> "
oh, I wasn't aware of the new option. Good job there!
It's the easiest way to reproduce but there are also harder to fix
collisions, file systems only guarantee unicity for (fsid,inode
number,version) which is usually bigger than 128 bits (although version
is often 0), but version isn't exposed to userspace easily...
What we'd want for unicity is handle from e.g. name_to_handle_at but
that'd add overhead, wouldn't fit in qid path and not all fs are capable
of providing one... The 9p protocol just doesn't want bigger handles
than qid path.
And, err, looking at the qemu code
qidp->version = stbuf->st_mtime ^ (stbuf->st_size << 8);
so the qid is treated as "data version",
but on kernel side we've treated it as inode version (i_version, see
include/linux/iversion.h)
(v9fs_test_inode_dotl checks the version is the same when comparing two
inodes) so it will incorrectly identify two identical inodes as
different.
That will cause problems...
Since you'll be faster than me could you try keeping it at 0 there?
I see fscache also uses the qid version as 'auxilliary data', but I'm
not sure what this is used for -- if it's a data version like thing then
it will also at least invalidate the cache content all the time.
Note there also is a data_version thing in the protocol in the response
to getattr, which the protocol side of 9p in linux digilently fills in
st_data_version, but we never use it that I can see.
This is probably what 9p meant to fill, and fscache should rely on to
detect file changes if that helps.
I'm sorry I didn't see this sooner....
> > If you have some kind of reproducer of invalid filedescriptor or similar
> > errors I'd be happy to dig a bit more, I don't particularly like all
> > aspect of our cache model but it's not good if it corrupts things.
>
> Maybe you can reproduce this with the root fs setup [4] described above? As I
> said, I immediately get errors when guest OS is booting. So I don't have to
> run something fancy to get errors with cache=loose & recent kernel.
Yes, this is much worse than I had first assumed when you first brought
it up, I'll definitely set some time aside to investigate.
--
Dominique
On Samstag, 2. April 2022 01:11:04 CEST [email protected] wrote:
> Christian Schoenebeck wrote on Fri, Apr 01, 2022 at 04:19:20PM +0200:
> > 4. v9fs_qid_iget_dotl [fs/9p/vfs_inode_dotl.c, 133]:
> > v9fs_cache_inode_get_cookie(inode);
> > ^--- Called independent of function argument "new"'s value here
> >
> > https://github.com/torvalds/linux/blob/e8b767f5e04097aaedcd6e06e2270f9f
> > e5282696/fs/9p/vfs_inode_dotl.c#L133
> uh, I'd have assumed either this call or the function to check
> v9ses->cache, but it doesn't look like either do...
> There's a nice compile-time static inline empty definition if FSCACHE is
> not compiled in, but that should -also- be a check at runtime based on
> the session struct.
>
> For your remark vs. the 'new' argument, it does depend on it:
> - new determines which check is used for iget5_locked.
> In the 'new' case, v9fs_test_new_inode_dotl always returns 0 so we
> always get a new inode.
> - if iget5_locked found an existing inode (i_state & I_NEW false) we
> return it.
Yes, I saw that. This part so far is correct.
The only minor issue I see here from performance PoV: the test function passed
(always returning zero) unnecessarily causes find_inode() to iterate over all
inodes of the associated hash bucket:
https://github.com/torvalds/linux/blob/88e6c0207623874922712e162e25d9dafd39661e/fs/inode.c#L912
IMO it would make sense introducing an official of what's called "identically
zero function" in shared code space and let 9p use that official function
instead. Then inode.c could simply compare the test function pointer and not
bother to iterate over the entire list in such a case.
> - at this point we're allocating a new inode, so we should initialize
> its cookie if it's on a fscache-enabled mount
>
> So that part looks OK to me.
Mmm, but you agree that it also does that for cache=mmap right now, right?
> What isn't correct with qemu setting qid version is the non-new case's
> v9fs_test_inode_dotl, it'll consider the inode to be new if version
> changed so it would recreate new, different inodes with same inode
> number/cookie and I was sure that was the problem, but it looks like
> there's more to it from your reply below :(
Yes, it does not seem to be related, and I mean this part of the code has not
changed for 11 years. So if that was the cause, then old kernels would suffer
from the same issues, which does not seem to be the case.
I would not say though that QEMU is necessarily wrong in filling in mtime for
qid.version. The 9p spec just says:
"The version is a version number for a file; typically, it is incremented
every time the file is modified."
So yes, it does recommend sequential numbering, but OTOH does not require it.
And implementing that would be expensive, because 9p server would need to
maintain its own version for every single file. whereas using host
filesystem's mtime is cheap.
> >> Well, at least that one is an easy fix: we just don't need this.
> >> It's the easiest way to reproduce but there are also harder to fix
> >> collisions, file systems only guarantee unicity for (fsid,inode
> >> number,version) which is usually bigger than 128 bits (although version
> >> is often 0), but version isn't exposed to userspace easily...
> >> What we'd want for unicity is handle from e.g. name_to_handle_at but
> >> that'd add overhead, wouldn't fit in qid path and not all fs are capable
> >> of providing one... The 9p protocol just doesn't want bigger handles
> >> than qid path.
> >
> > No bigger qid.path on 9p protocol level in future? Why?
>
> I have no idea about the "9p protocol" as a standard, nor who decides
> how that evolves -- I guess if we wanted to we could probably make a
> 9p2022.L without concerting much around, but I have no plan to do
> that... But if we do, I can probably add quite a few things to the
> list of things that might need to change :)
Yes, I agree, 9p protocol changes are a long-term thing which I don't want to
hurry either. But I do think it makes sense to at least collect an informal
list of ideas/features/issues that should be addressed in future, e.g. a wiki
page. For now I am using the QEMU wiki for this:
https://wiki.qemu.org/Documentation/9p#Protocol_Plans
You can use that wiki page as well of course, or if somebody thinks there
would be a better place, no problem for me either.
> That being said, this particular change of qid format is rather
> annoying. 9p2000.L basically just added new message types, so dissectors
> such as wireshark could barge in the middle of the tcp flow and more or
> less understand; modifying a basic type like this would require to
> either catch the TVERSION message or make new message types for
> everything that deals wth qids (auth/attach, walk, lopen, lcreate,
> mknod, getattr, readdir, mkdir... that's quite a few)
>
> So I think we're better off with the status quo here.
Well, in a future protocol revision I would try to merge the individual 9p
dialects as much as possible anyway and I would also release one document that
covers all supported messages instead of requiring people to read 4+
individual specs, which I don't find helpful. Most changes were just about
data types, which could also be covered in a spec by just naming a message
once, and listing the difference for individual 9p versions for that
particular message.
But again: no priority for me either ATM. There is still enough to do for
fixing the implementations on server and client side for current 9p protocol
versions.
Best regards,
Christian Schoenebeck
Christian Schoenebeck wrote on Fri, Apr 01, 2022 at 04:19:20PM +0200:
> 4. v9fs_qid_iget_dotl [fs/9p/vfs_inode_dotl.c, 133]:
>
> v9fs_cache_inode_get_cookie(inode);
> ^--- Called independent of function argument "new"'s value here
> https://github.com/torvalds/linux/blob/e8b767f5e04097aaedcd6e06e2270f9fe5282696/fs/9p/vfs_inode_dotl.c#L133
uh, I'd have assumed either this call or the function to check
v9ses->cache, but it doesn't look like either do...
There's a nice compile-time static inline empty definition if FSCACHE is
not compiled in, but that should -also- be a check at runtime based on
the session struct.
For your remark vs. the 'new' argument, it does depend on it:
- new determines which check is used for iget5_locked.
In the 'new' case, v9fs_test_new_inode_dotl always returns 0 so we
always get a new inode.
- if iget5_locked found an existing inode (i_state & I_NEW false) we
return it.
- at this point we're allocating a new inode, so we should initialize
its cookie if it's on a fscache-enabled mount
So that part looks OK to me.
What isn't correct with qemu setting qid version is the non-new case's
v9fs_test_inode_dotl, it'll consider the inode to be new if version
changed so it would recreate new, different inodes with same inode
number/cookie and I was sure that was the problem, but it looks like
there's more to it from your reply below :(
>> Well, at least that one is an easy fix: we just don't need this.
>> It's the easiest way to reproduce but there are also harder to fix
>> collisions, file systems only guarantee unicity for (fsid,inode
>> number,version) which is usually bigger than 128 bits (although version
>> is often 0), but version isn't exposed to userspace easily...
>> What we'd want for unicity is handle from e.g. name_to_handle_at but
>> that'd add overhead, wouldn't fit in qid path and not all fs are capable
>> of providing one... The 9p protocol just doesn't want bigger handles
>> than qid path.
>
> No bigger qid.path on 9p protocol level in future? Why?
I have no idea about the "9p protocol" as a standard, nor who decides
how that evolves -- I guess if we wanted to we could probably make a
9p2022.L without concerting much around, but I have no plan to do
that... But if we do, I can probably add quite a few things to the
list of things that might need to change :)
That being said, this particular change of qid format is rather
annoying. 9p2000.L basically just added new message types, so dissectors
such as wireshark could barge in the middle of the tcp flow and more or
less understand; modifying a basic type like this would require to
either catch the TVERSION message or make new message types for
everything that deals wth qids (auth/attach, walk, lopen, lcreate,
mknod, getattr, readdir, mkdir... that's quite a few)
So I think we're better off with the status quo here.
>> And, err, looking at the qemu code
>>
>> qidp->version = stbuf->st_mtime ^ (stbuf->st_size << 8);
>>
>> so the qid is treated as "data version",
>> but on kernel side we've treated it as inode version (i_version, see
>> include/linux/iversion.h)
>>
>> (v9fs_test_inode_dotl checks the version is the same when comparing two
>> inodes) so it will incorrectly identify two identical inodes as
>> different.
>> That will cause problems...
>> Since you'll be faster than me could you try keeping it at 0 there?
>
> I tried with your suggested change on QEMU side:
>
> diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
> index a6d6b3f835..5d9be87758 100644
> --- a/hw/9pfs/9p.c
> +++ b/hw/9pfs/9p.c
> @@ -981,7 +981,7 @@ static int stat_to_qid(V9fsPDU *pdu, const struct stat *stbuf, V9fsQID *qidp)
> memcpy(&qidp->path, &stbuf->st_ino, size);
> }
>
> - qidp->version = stbuf->st_mtime ^ (stbuf->st_size << 8);
> + qidp->version = 0;
> qidp->type = 0;
> if (S_ISDIR(stbuf->st_mode)) {
> qidp->type |= P9_QID_TYPE_DIR;
>
> Unfortunately it did not make any difference for these 2 Linux kernel fs-cache
> issues at least; still same errors, and same suboptimal performance.
Thanks, I'll give it a try today or tomorrow, adding some trace when we
create inodes might give a clue...
--
Dominique
On Mittwoch, 30. M?rz 2022 23:47:41 CEST [email protected] wrote:
> Thanks Christian!
>
> Christian Schoenebeck wrote on Wed, Mar 30, 2022 at 02:21:16PM +0200:
[...]
> > > Christian Schoenebeck wrote on Sat, Mar 26, 2022 at 01:36:31PM +0100:
> > > hm, fscache code shouldn't be used for cache=mmap, I'm surprised you can
> > > hit this...
> >
> > I assume that you mean that 9p driver does not explicitly ask for fs-cache
> > being used for mmap. I see that 9p uses the kernel's generalized mmap
> > implementation:
> >
> > https://github.com/torvalds/linux/blob/d888c83fcec75194a8a48ccd283953bdba7
> > b2550/fs/9p/vfs_file.c#L481
> >
> > I haven't dived further into this, but the kernel has to use some kind of
> > filesystem cache anyway to provide the mmap functionality, so I guess it
> > makes sense that I got those warning messages from the FS-Cache
> > subsystem?
> It uses the generic mmap which has vfs caching, but definitely not
> fs-cache.
> fs-cache adds more hooks for cachefilesd (writing file contents to disk
> for bigger cache) and things like that cache=loose/mmap shouldn't be
> caring about. cache=loose actually just disables some key parts so I'm
> not surprised it shares bugs with the new code, but cache=mmap is really
> independant and I need to trace where these come from...
From looking at the sources, the call stack for emitting "FS-Cache: Duplicate
cookie detected" error messages with 9p "cache=mmap" option seems to be:
1. v9fs_vfs_lookup [fs/9p/vfs_inode.c, 788]:
inode = v9fs_get_new_inode_from_fid(v9ses, fid, dir->i_sb);
2. v9fs_get_new_inode_from_fid [fs/9p/v9fs.h, 228]:
return v9fs_inode_from_fid_dotl(v9ses, fid, sb, 1);
3. v9fs_inode_from_fid_dotl [fs/9p/vfs_inode_dotl.c, 157]:
inode = v9fs_qid_iget_dotl(sb, &st->qid, fid, st, new);
4. v9fs_qid_iget_dotl [fs/9p/vfs_inode_dotl.c, 133]:
v9fs_cache_inode_get_cookie(inode);
^--- Called independent of function argument "new"'s value here
https://github.com/torvalds/linux/blob/e8b767f5e04097aaedcd6e06e2270f9fe5282696/fs/9p/vfs_inode_dotl.c#L133
5. v9fs_cache_inode_get_cookie [fs/9p/cache.c, 68]:
v9inode->fscache =
fscache_acquire_cookie(v9fs_session_cache(v9ses),
0,
&path, sizeof(path),
&version, sizeof(version),
i_size_read(&v9inode->vfs_inode));
6. fscache_acquire_cookie [include/linux/fscache.h, 251]:
return __fscache_acquire_cookie(volume, advice,
index_key, index_key_len,
aux_data, aux_data_len,
object_size);
7. __fscache_acquire_cookie [fs/fscache/cookie.c, 472]:
if (!fscache_hash_cookie(cookie)) {
fscache_see_cookie(cookie, fscache_cookie_discard);
fscache_free_cookie(cookie);
return NULL;
}
8. fscache_hash_cookie [fs/fscache/cookie.c, 430]:
pr_err("Duplicate cookie detected\n");
> > With QEMU >= 5.2 you should see the following QEMU warning with your
> > reproducer:
> >
> > "
> > qemu-system-x86_64: warning: 9p: Multiple devices detected in same VirtFS
> > export, which might lead to file ID collisions and severe misbehaviours on
> > guest! You should either use a separate export for each device shared from
> > host or use virtfs option 'multidevs=remap'!
> > "
>
> oh, I wasn't aware of the new option. Good job there!
>
> It's the easiest way to reproduce but there are also harder to fix
> collisions, file systems only guarantee unicity for (fsid,inode
> number,version) which is usually bigger than 128 bits (although version
> is often 0), but version isn't exposed to userspace easily...
> What we'd want for unicity is handle from e.g. name_to_handle_at but
> that'd add overhead, wouldn't fit in qid path and not all fs are capable
> of providing one... The 9p protocol just doesn't want bigger handles
> than qid path.
No bigger qid.path on 9p protocol level in future? Why?
> And, err, looking at the qemu code
>
> qidp->version = stbuf->st_mtime ^ (stbuf->st_size << 8);
>
> so the qid is treated as "data version",
> but on kernel side we've treated it as inode version (i_version, see
> include/linux/iversion.h)
>
> (v9fs_test_inode_dotl checks the version is the same when comparing two
> inodes) so it will incorrectly identify two identical inodes as
> different.
> That will cause problems...
> Since you'll be faster than me could you try keeping it at 0 there?
I tried with your suggested change on QEMU side:
diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index a6d6b3f835..5d9be87758 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -981,7 +981,7 @@ static int stat_to_qid(V9fsPDU *pdu, const struct stat *stbuf, V9fsQID *qidp)
memcpy(&qidp->path, &stbuf->st_ino, size);
}
- qidp->version = stbuf->st_mtime ^ (stbuf->st_size << 8);
+ qidp->version = 0;
qidp->type = 0;
if (S_ISDIR(stbuf->st_mode)) {
qidp->type |= P9_QID_TYPE_DIR;
Unfortunately it did not make any difference for these 2 Linux kernel fs-cache
issues at least; still same errors, and same suboptimal performance.
Best regards,
Christian Schoenebeck
On Mittwoch, 30. M?rz 2022 14:21:16 CEST Christian Schoenebeck wrote:
> I made some tests & benchmarks regarding the fs-cache issue of 9p, running
> different kernel versions and kernel configurations in comparison.
[...]
> Case Linux kernel version .config msize cache duration host cpu errors/warnings
>
> A) 5.17.0+[2] + msize patches[1] debug 4186112 mmap 20m 40s ~80% none
> B) 5.17.0+[2] + msize patches[1] debug 4186112 loose 31m 28s ~35% several errors (compilation completed)
> C) 5.17.0+[2] + msize patches[1] debug 507904 mmap 20m 25s ~84% none
> D) 5.17.0+[2] + msize patches[1] debug 507904 loose 31m 2s ~33% several errors (compilation completed)
> E) 5.17.0+[2] debug 512000 mmap 23m 45s ~75% none
> F) 5.17.0+[2] debug 512000 loose 32m 6s ~31% several errors (compilation completed)
> G) 5.17.0+[2] release 512000 mmap 23m 18s ~76% none
> H) 5.17.0+[2] release 512000 loose 32m 33s ~31% several errors (compilation completed)
> I) 5.17.0+[2] + msize patches[1] release 4186112 mmap 20m 30s ~83% none
> J) 5.17.0+[2] + msize patches[1] release 4186112 loose 31m 21s ~31% several errors (compilation completed)
> K) 5.10.84 release 512000 mmap 39m 20s ~80% none
> L) 5.10.84 release 512000 loose 13m 40s ~55% none
[...]
> About the errors: I actually already see errors with cache=loose and recent
> kernel version just when booting the guest OS. For these tests I chose some
> sources which allowed me to complete the build to capture some benchmark as
> well, I got some "soft" errors with those, but the build completed at least.
> I had other sources OTOH which did not complete though and aborted with
> certain invalid file descriptor errors, which I obviously could not use for
> those benchmarks here.
I used git-bisect to identify the commit that broke 9p behaviour, and it is
indeed this one:
commit eb497943fa215897f2f60fd28aa6fe52da27ca6c (HEAD, refs/bisect/bad)
Author: David Howells <[email protected]>
Date: Tue Nov 2 08:29:55 2021 +0000
9p: Convert to using the netfs helper lib to do reads and caching
Convert the 9p filesystem to use the netfs helper lib to handle readpage,
readahead and write_begin, converting those into a common issue_op for the
filesystem itself to handle. The netfs helper lib also handles reading
from fscache if a cache is available, and interleaving reads from both
sources.
This change also switches from the old fscache I/O API to the new one,
meaning that fscache no longer keeps track of netfs pages and instead does
async DIO between the backing files and the 9p file pagecache. As a part
of this change, the handling of PG_fscache changes. It now just means that
the cache has a write I/O operation in progress on a page (PG_locked
is used for a read I/O op).
Note that this is a cut-down version of the fscache rewrite and does not
change any of the cookie and cache coherency handling.
Changes
=======
ver #4:
- Rebase on top of folios.
- Don't use wait_on_page_bit_killable().
ver #3:
- v9fs_req_issue_op() needs to terminate the subrequest.
- v9fs_write_end() needs to call SetPageUptodate() a bit more often.
- It's not CONFIG_{AFS,V9FS}_FSCACHE[1]
- v9fs_init_rreq() should take a ref on the p9_fid and the cleanup should
drop it [from Dominique Martinet].
Signed-off-by: David Howells <[email protected]>
Reviewed-and-tested-by: Dominique Martinet <[email protected]>
cc: [email protected]
cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]/ [1]
Link: https://lore.kernel.org/r/163162772646.438332.16323773205855053535.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/163189109885.2509237.7153668924503399173.stgit@warthog.procyon.org.uk/ # rfc v2
Link: https://lore.kernel.org/r/163363943896.1980952.1226527304649419689.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/163551662876.1877519.14706391695553204156.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/163584179557.4023316.11089762304657644342.stgit@warthog.procyon.org.uk # rebase on folio
Signed-off-by: Dominique Martinet <[email protected]>
So Linux kernel v5.15 is fine, v5.16 is broken.
Best regards,
Christian Schoenebeck
Christian Schoenebeck <[email protected]> wrote:
> From looking at the sources, the call stack for emitting "FS-Cache: Duplicate
> cookie detected" error messages with 9p "cache=mmap" option seems to be:
You might find these tracepoints useful:
echo 1 >/sys/kernel/debug/tracing/events/fscache/fscache_cookie/enable
echo 1 >/sys/kernel/debug/tracing/events/fscache/fscache_acquire/enable
echo 1 >/sys/kernel/debug/tracing/events/fscache/fscache_relinquish/enable
David
On Samstag, 9. April 2022 13:16:11 CEST Christian Schoenebeck wrote:
> On Mittwoch, 30. M?rz 2022 14:21:16 CEST Christian Schoenebeck wrote:
> > I made some tests & benchmarks regarding the fs-cache issue of 9p, running
> > different kernel versions and kernel configurations in comparison.
> [...]
> > Case Linux kernel version .config msize cache duration host cpu errors/warnings
> >
> > A) 5.17.0+[2] + msize patches[1] debug 4186112 mmap 20m 40s ~80% none
> > B) 5.17.0+[2] + msize patches[1] debug 4186112 loose 31m 28s ~35% several errors (compilation completed)
> > C) 5.17.0+[2] + msize patches[1] debug 507904 mmap 20m 25s ~84% none
> > D) 5.17.0+[2] + msize patches[1] debug 507904 loose 31m 2s ~33% several errors (compilation completed)
> > E) 5.17.0+[2] debug 512000 mmap 23m 45s ~75% none
> > F) 5.17.0+[2] debug 512000 loose 32m 6s ~31% several errors (compilation completed)
> > G) 5.17.0+[2] release 512000 mmap 23m 18s ~76% none
> > H) 5.17.0+[2] release 512000 loose 32m 33s ~31% several errors (compilation completed)
> > I) 5.17.0+[2] + msize patches[1] release 4186112 mmap 20m 30s ~83% none
> > J) 5.17.0+[2] + msize patches[1] release 4186112 loose 31m 21s ~31% several errors (compilation completed)
> > K) 5.10.84 release 512000 mmap 39m 20s ~80% none
> > L) 5.10.84 release 512000 loose 13m 40s ~55% none
> [...]
> > About the errors: I actually already see errors with cache=loose and recent
> > kernel version just when booting the guest OS. For these tests I chose some
> > sources which allowed me to complete the build to capture some benchmark as
> > well, I got some "soft" errors with those, but the build completed at least.
> > I had other sources OTOH which did not complete though and aborted with
> > certain invalid file descriptor errors, which I obviously could not use for
> > those benchmarks here.
>
> I used git-bisect to identify the commit that broke 9p behaviour, and it is
> indeed this one:
>
> commit eb497943fa215897f2f60fd28aa6fe52da27ca6c (HEAD, refs/bisect/bad)
> Author: David Howells <[email protected]>
> Date: Tue Nov 2 08:29:55 2021 +0000
>
> 9p: Convert to using the netfs helper lib to do reads and caching
>
> Convert the 9p filesystem to use the netfs helper lib to handle readpage,
> readahead and write_begin, converting those into a common issue_op for the
> filesystem itself to handle. The netfs helper lib also handles reading
> from fscache if a cache is available, and interleaving reads from both
> sources.
I looked into the errors I get, and as far as I can see it, all misbehaviours
that I see, boil down to "Bad file descriptor" (EBADF) errors being the
originating cause.
The easiest misbehaviours on the guest system I can look into, are errors
with the git client. For instance 'git fetch origin' fails this way:
...
write(3, "d16782889ee07005d1f57eb884f4a06b"..., 40) = 40
write(3, "\n", 1) = 1
close(3) = 0
access(".git/hooks/reference-transaction", X_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, ".git/logs/refs/remotes/origin/master", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
fstat(7, {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
read(7, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\t\0\0\0\t\0\0\0\0"..., 8192) = 2326
lseek(7, -1467, SEEK_CUR) = 859
read(7, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\t\0\0\0\t\0\0\0\0"..., 8192) = 1467
close(7) = 0
write(3, "d8a68c5027ef629d93b9d9519ff4da95"..., 168) = -1 EBADF (Bad file descriptor)
...
error: cannot update the ref 'refs/remotes/origin/master': unable to append to '.git/logs/refs/remotes/origin/master': Bad file descriptor
I tried to manually replicate those file access operations on that
.git/logs/refs/remotes/origin/master file in question, and it worked. But when
I look at the strace output above, I see there is a close(3) call just before
the subsequent openat(".git/logs/refs/remotes/origin/master") call returning 3,
which makes me wonder, is this maybe a concurrency issue on file descriptor
management?
Ideas anyone?
Best regards,
Christian Schoenebeck
[email protected] wrote:
> I see fscache also uses the qid version as 'auxilliary data', but I'm
> not sure what this is used for -- if it's a data version like thing then
> it will also at least invalidate the cache content all the time.
I should really have renamed "auxiliary data" to "coherency data". It's used
by direct comparison when fscache binds a cookie to a backing cache object to
work out if the content of the backing object is still valid.
David
On Montag, 11. April 2022 00:54:47 CEST [email protected] wrote:
> Thanks for keeping it up!
>
> Christian Schoenebeck wrote on Sun, Apr 10, 2022 at 06:18:38PM +0200:
> > > I used git-bisect to identify the commit that broke 9p behaviour, and it
> > > is
> > > indeed this one:
> > >
> > > commit eb497943fa215897f2f60fd28aa6fe52da27ca6c (HEAD, refs/bisect/bad)
> > > Author: David Howells <[email protected]>
> > > Date: Tue Nov 2 08:29:55 2021 +0000
> > >
> > > 9p: Convert to using the netfs helper lib to do reads and caching
>
> Yes, quite a few things changed with that.
>
> > I looked into the errors I get, and as far as I can see it, all
> > misbehaviours that I see, boil down to "Bad file descriptor" (EBADF)
> > errors being the originating cause.
> >
> > The easiest misbehaviours on the guest system I can look into, are errors
>
> > with the git client. For instance 'git fetch origin' fails this way:
> FWIW I didn't report but did try to reproduce, on my machines (tried a
> couple) booting on a small alpine rootfs over 9p works, and I tried some
> git clone/git fetch of variying sizes of local repo (tmpfs in VM -> 9p
> mount of tmpfs on host) to no avail.
I get more convinced that it's a bug on Linux kernel side. When guest is
booted I always immediately get a read("/var/log/wtmp") = EBADF error on
guest. And the 9p command sequence sent to QEMU 9p server were:
...
v9fs_clunk tag 0 id 120 fid 568
v9fs_walk tag 0 id 110 fid 1 newfid 568 nwnames 1
v9fs_rerror tag 0 id 110 err 2
v9fs_walk tag 0 id 110 fid 26 newfid 568 nwnames 1
v9fs_rerror tag 0 id 110 err 2
v9fs_readlink tag 0 id 22 fid 474
v9fs_readlink_return tag 0 id 22 name /run
v9fs_readlink tag 0 id 22 fid 474
v9fs_readlink_return tag 0 id 22 name /run
v9fs_readlink tag 0 id 22 fid 474
v9fs_readlink_return tag 0 id 22 name /run
v9fs_readlink tag 0 id 22 fid 474
v9fs_readlink_return tag 0 id 22 name /run
v9fs_walk tag 0 id 110 fid 633 newfid 568 nwnames 1
v9fs_rerror tag 0 id 110 err 2
v9fs_walk tag 0 id 110 fid 875 newfid 568 nwnames 0
v9fs_walk_return tag 0 id 110 nwnames 0 qids (nil)
v9fs_open tag 0 id 12 fid 568 mode 32769
v9fs_open_return tag 0 id 12 qid={type 0 version 0 path 820297} iounit 507904
v9fs_walk tag 0 id 110 fid 875 newfid 900 nwnames 0
v9fs_walk_return tag 0 id 110 nwnames 0 qids (nil)
v9fs_open tag 0 id 12 fid 900 mode 2
v9fs_open_return tag 0 id 12 qid={type 0 version 0 path 820297} iounit 507904
v9fs_lock tag 0 id 52 fid 568 type 1 start 0 length 0
v9fs_lock_return tag 0 id 52 status 0
v9fs_xattrwalk tag 0 id 30 fid 568 newfid 901 name security.capability
v9fs_rerror tag 0 id 30 err 95
v9fs_read tag 0 id 116 fid 568 off 192512 max_count 256
So guest opens /var/log/wtmp with fid=568 mode=32769, which is write-only
mode, and then it tries to read that fid 568, which eventually causes the
read() call on host to error with EBADF. Which makes sense, as the file was
opened in write-only mode, hence read() is not possible with that file
descriptor.
The other things I noticed when looking at the 9p command sequence above:
there is a Twalk on fid 568 before, which is not clunked before reusing fid
568 with Topen later. And before that Twalk on fid 568 there is a Tclunk on
fid 568, but apparently that fid was not used before.
> Perhaps backing filesystem dependant? qemu version? virtfs access options?
I tried with different hardware and different file systems (ext4, btrfs), same
misbehaviours.
QEMU is latest git version. I also tried several different QEMU versions, same
thing.
QEMU command line used:
~/git/qemu/build/qemu-system-x86_64 \
-machine pc,accel=kvm,usb=off,dump-guest-core=off -m 16384 \
-smp 8,sockets=8,cores=1,threads=1 -rtc base=utc -boot strict=on \
-kernel ~/vm/bullseye/boot/vmlinuz \
-initrd ~/vm/bullseye/boot/initrd.img \
-append 'root=fsRoot rw rootfstype=9p rootflags=trans=virtio,version=9p2000.L,msize=4186112,cache=loose console=ttyS0' \
-fsdev local,security_model=mapped,multidevs=remap,id=fsdev-fs0,path=$HOME/vm/bullseye/ \
-device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=fsRoot \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-nographic
Important for reproducing this issue:
* cache=loose
* -smp N (with N>1)
* Guest booted with Linux kernel containing commit eb497943fa21
(uname >= 5.16)
I'm pretty sure that you can reproduce this issue with the QEMU 9p rootfs
setup HOWTO linked before.
> It's all extremely slow though... like the final checkout counting files
> at less than 10/s
It is VERY slow. And the weird thing is that cache=loose got much slower than
cache=mmap. My worst case expactation would be cache=loose at least not
performing worse than cache=mmap.
Best regards,
Christian Schoenebeck
Thanks for keeping it up!
Christian Schoenebeck wrote on Sun, Apr 10, 2022 at 06:18:38PM +0200:
> > I used git-bisect to identify the commit that broke 9p behaviour, and it is
> > indeed this one:
> >
> > commit eb497943fa215897f2f60fd28aa6fe52da27ca6c (HEAD, refs/bisect/bad)
> > Author: David Howells <[email protected]>
> > Date: Tue Nov 2 08:29:55 2021 +0000
> >
> > 9p: Convert to using the netfs helper lib to do reads and caching
Yes, quite a few things changed with that.
> I looked into the errors I get, and as far as I can see it, all misbehaviours
> that I see, boil down to "Bad file descriptor" (EBADF) errors being the
> originating cause.
>
> The easiest misbehaviours on the guest system I can look into, are errors
> with the git client. For instance 'git fetch origin' fails this way:
FWIW I didn't report but did try to reproduce, on my machines (tried a
couple) booting on a small alpine rootfs over 9p works, and I tried some
git clone/git fetch of variying sizes of local repo (tmpfs in VM -> 9p
mount of tmpfs on host) to no avail.
Perhaps backing filesystem dependant? qemu version? virtfs access options?
It's all extremely slow though... like the final checkout counting files
at less than 10/s
> ...
> write(3, "d16782889ee07005d1f57eb884f4a06b"..., 40) = 40
> write(3, "\n", 1) = 1
> close(3) = 0
> access(".git/hooks/reference-transaction", X_OK) = -1 ENOENT (No such file or directory)
> openat(AT_FDCWD, ".git/logs/refs/remotes/origin/master", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
> openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 7
> fstat(7, {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
> fstat(7, {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
> read(7, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\t\0\0\0\t\0\0\0\0"..., 8192) = 2326
> lseek(7, -1467, SEEK_CUR) = 859
> read(7, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\t\0\0\0\t\0\0\0\0"..., 8192) = 1467
> close(7) = 0
> write(3, "d8a68c5027ef629d93b9d9519ff4da95"..., 168) = -1 EBADF (Bad file descriptor)
> ...
> error: cannot update the ref 'refs/remotes/origin/master': unable to append to '.git/logs/refs/remotes/origin/master': Bad file descriptor
>
> I tried to manually replicate those file access operations on that
> .git/logs/refs/remotes/origin/master file in question, and it worked. But when
> I look at the strace output above, I see there is a close(3) call just before
> the subsequent openat(".git/logs/refs/remotes/origin/master") call returning 3,
> which makes me wonder, is this maybe a concurrency issue on file descriptor
> management?
hmm, in cache=loose case write should just be updating the page cache
for buffers to be flushed later, so this is definitely weird.
If you can reproduce well enough for this, could you first confirm that
the EBADF comes from the client and not qemu? either mounting with debug
or getting traces from qemu at a protocol level would get that.
If it's local there are only so many places EBADF can come from and it
should be possible to trace it back with e.g. perf probe or bpftrace,
but even if we confirm that e.g. the process' fd table is messed up it
won't tell us why it was, so it's going to be annoying... I'd really
like to be able to reproduce this somehow :/
--
Dominique
Christian Schoenebeck wrote on Mon, Apr 11, 2022 at 03:41:45PM +0200:
> I get more convinced that it's a bug on Linux kernel side. When guest is
> booted I always immediately get a read("/var/log/wtmp") = EBADF error on
> guest. And the 9p command sequence sent to QEMU 9p server were:
Yes, I'm not pointing fingers, just trying to understand :)
>
> ...
> v9fs_clunk tag 0 id 120 fid 568
> v9fs_walk tag 0 id 110 fid 1 newfid 568 nwnames 1
> v9fs_rerror tag 0 id 110 err 2
> v9fs_walk tag 0 id 110 fid 26 newfid 568 nwnames 1
> v9fs_rerror tag 0 id 110 err 2
> v9fs_readlink tag 0 id 22 fid 474
> v9fs_readlink_return tag 0 id 22 name /run
> v9fs_readlink tag 0 id 22 fid 474
> v9fs_readlink_return tag 0 id 22 name /run
> v9fs_readlink tag 0 id 22 fid 474
> v9fs_readlink_return tag 0 id 22 name /run
> v9fs_readlink tag 0 id 22 fid 474
> v9fs_readlink_return tag 0 id 22 name /run
> v9fs_walk tag 0 id 110 fid 633 newfid 568 nwnames 1
> v9fs_rerror tag 0 id 110 err 2
> v9fs_walk tag 0 id 110 fid 875 newfid 568 nwnames 0
> v9fs_walk_return tag 0 id 110 nwnames 0 qids (nil)
> v9fs_open tag 0 id 12 fid 568 mode 32769
> v9fs_open_return tag 0 id 12 qid={type 0 version 0 path 820297} iounit 507904
> v9fs_walk tag 0 id 110 fid 875 newfid 900 nwnames 0
> v9fs_walk_return tag 0 id 110 nwnames 0 qids (nil)
> v9fs_open tag 0 id 12 fid 900 mode 2
> v9fs_open_return tag 0 id 12 qid={type 0 version 0 path 820297} iounit 507904
> v9fs_lock tag 0 id 52 fid 568 type 1 start 0 length 0
> v9fs_lock_return tag 0 id 52 status 0
> v9fs_xattrwalk tag 0 id 30 fid 568 newfid 901 name security.capability
> v9fs_rerror tag 0 id 30 err 95
> v9fs_read tag 0 id 116 fid 568 off 192512 max_count 256
>
> So guest opens /var/log/wtmp with fid=568 mode=32769, which is write-only
> mode, and then it tries to read that fid 568, which eventually causes the
> read() call on host to error with EBADF. Which makes sense, as the file was
> opened in write-only mode, hence read() is not possible with that file
> descriptor.
Oh! That's something we can work on. the vfs code has different caches
for read only and read-write fids, perhaps the new netfs code just used
the wrong one somewhere. I'll have a look.
> The other things I noticed when looking at the 9p command sequence above:
> there is a Twalk on fid 568 before, which is not clunked before reusing fid
> 568 with Topen later. And before that Twalk on fid 568 there is a Tclunk on
> fid 568, but apparently that fid was not used before.
This one though is just weird, I don't see where linux would make up a fid to
clunk like this... Could messages be ordered a bit weird through
multithreading?
e.g. thread 1 opens, thread 2 clunks almost immediately afterwards, and
would be printed the other way around?
Should still be serialized through the virtio ring buffer so I don't
believe what I'm saying myself... It might be worth digging further as
well.
> > Perhaps backing filesystem dependant? qemu version? virtfs access options?
>
> I tried with different hardware and different file systems (ext4, btrfs), same
> misbehaviours.
>
> QEMU is latest git version. I also tried several different QEMU versions, same
> thing.
>
> QEMU command line used:
>
> ~/git/qemu/build/qemu-system-x86_64 \
> -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 16384 \
> -smp 8,sockets=8,cores=1,threads=1 -rtc base=utc -boot strict=on \
> -kernel ~/vm/bullseye/boot/vmlinuz \
> -initrd ~/vm/bullseye/boot/initrd.img \
> -append 'root=fsRoot rw rootfstype=9p rootflags=trans=virtio,version=9p2000.L,msize=4186112,cache=loose console=ttyS0' \
> -fsdev local,security_model=mapped,multidevs=remap,id=fsdev-fs0,path=$HOME/vm/bullseye/ \
> -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=fsRoot \
> -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
> -nographic
>
> Important for reproducing this issue:
>
> * cache=loose
> * -smp N (with N>1)
> * Guest booted with Linux kernel containing commit eb497943fa21
> (uname >= 5.16)
>
> I'm pretty sure that you can reproduce this issue with the QEMU 9p rootfs
> setup HOWTO linked before.
Yes, I'm not sure why I can't reproduce... All my computers are pretty
slow but the conditions should be met.
I'll try again with a command line closer to what you just gave here.
> > It's all extremely slow though... like the final checkout counting files
> > at less than 10/s
>
> It is VERY slow. And the weird thing is that cache=loose got much slower than
> cache=mmap. My worst case expactation would be cache=loose at least not
> performing worse than cache=mmap.
Yes, some profiling is also in order, it didn't use to be that slow so
it must not be reusing previously open fids as it should have or
something..
--
Dominique
On Mittwoch, 13. April 2022 00:38:21 CEST [email protected] wrote:
> Christian Schoenebeck wrote on Mon, Apr 11, 2022 at 03:41:45PM +0200:
> > I get more convinced that it's a bug on Linux kernel side. When guest is
> > booted I always immediately get a read("/var/log/wtmp") = EBADF error on
> > guest. And the 9p command sequence sent to QEMU 9p server were:
>
> Yes, I'm not pointing fingers, just trying to understand :)
Don't worry, that was not my impression, nor was it my intention either. I'm
jut trying to interpret what I'm seeing here.
> > ...
> > v9fs_clunk tag 0 id 120 fid 568
> > v9fs_walk tag 0 id 110 fid 1 newfid 568 nwnames 1
> > v9fs_rerror tag 0 id 110 err 2
> > v9fs_walk tag 0 id 110 fid 26 newfid 568 nwnames 1
> > v9fs_rerror tag 0 id 110 err 2
> > v9fs_readlink tag 0 id 22 fid 474
> > v9fs_readlink_return tag 0 id 22 name /run
> > v9fs_readlink tag 0 id 22 fid 474
> > v9fs_readlink_return tag 0 id 22 name /run
> > v9fs_readlink tag 0 id 22 fid 474
> > v9fs_readlink_return tag 0 id 22 name /run
> > v9fs_readlink tag 0 id 22 fid 474
> > v9fs_readlink_return tag 0 id 22 name /run
> > v9fs_walk tag 0 id 110 fid 633 newfid 568 nwnames 1
> > v9fs_rerror tag 0 id 110 err 2
> > v9fs_walk tag 0 id 110 fid 875 newfid 568 nwnames 0
> > v9fs_walk_return tag 0 id 110 nwnames 0 qids (nil)
> > v9fs_open tag 0 id 12 fid 568 mode 32769
> > v9fs_open_return tag 0 id 12 qid={type 0 version 0 path 820297} iounit
> > 507904 v9fs_walk tag 0 id 110 fid 875 newfid 900 nwnames 0
> > v9fs_walk_return tag 0 id 110 nwnames 0 qids (nil)
> > v9fs_open tag 0 id 12 fid 900 mode 2
> > v9fs_open_return tag 0 id 12 qid={type 0 version 0 path 820297} iounit
> > 507904 v9fs_lock tag 0 id 52 fid 568 type 1 start 0 length 0
> > v9fs_lock_return tag 0 id 52 status 0
> > v9fs_xattrwalk tag 0 id 30 fid 568 newfid 901 name security.capability
> > v9fs_rerror tag 0 id 30 err 95
> > v9fs_read tag 0 id 116 fid 568 off 192512 max_count 256
> >
> > So guest opens /var/log/wtmp with fid=568 mode=32769, which is write-only
> > mode, and then it tries to read that fid 568, which eventually causes the
> > read() call on host to error with EBADF. Which makes sense, as the file
> > was
> > opened in write-only mode, hence read() is not possible with that file
> > descriptor.
>
> Oh! That's something we can work on. the vfs code has different caches
> for read only and read-write fids, perhaps the new netfs code just used
> the wrong one somewhere. I'll have a look.
>
> > The other things I noticed when looking at the 9p command sequence above:
> > there is a Twalk on fid 568 before, which is not clunked before reusing
> > fid
> > 568 with Topen later. And before that Twalk on fid 568 there is a Tclunk
> > on
> > fid 568, but apparently that fid was not used before.
>
> This one though is just weird, I don't see where linux would make up a fid
> to clunk like this... Could messages be ordered a bit weird through
> multithreading?
> e.g. thread 1 opens, thread 2 clunks almost immediately afterwards, and
> would be printed the other way around?
Yeah, something like that was also my guess.
> Should still be serialized through the virtio ring buffer so I don't
> believe what I'm saying myself... It might be worth digging further as
> well.
>
> > > Perhaps backing filesystem dependant? qemu version? virtfs access
> > > options?
> >
> > I tried with different hardware and different file systems (ext4, btrfs),
> > same misbehaviours.
> >
> > QEMU is latest git version. I also tried several different QEMU versions,
> > same thing.
> >
> > QEMU command line used:
> >
> > ~/git/qemu/build/qemu-system-x86_64 \
> > -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 16384 \
> > -smp 8,sockets=8,cores=1,threads=1 -rtc base=utc -boot strict=on \
> > -kernel ~/vm/bullseye/boot/vmlinuz \
> > -initrd ~/vm/bullseye/boot/initrd.img \
> > -append 'root=fsRoot rw rootfstype=9p
> > rootflags=trans=virtio,version=9p2000.L,msize=4186112,cache=loose
> > console=ttyS0' \ -fsdev
> > local,security_model=mapped,multidevs=remap,id=fsdev-fs0,path=$HOME/vm/bu
> > llseye/ \ -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=fsRoot \
> > -sandbox
> > on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
> > -nographic
> >
> > Important for reproducing this issue:
> > * cache=loose
> > * -smp N (with N>1)
> > * Guest booted with Linux kernel containing commit eb497943fa21
> >
> > (uname >= 5.16)
> >
> > I'm pretty sure that you can reproduce this issue with the QEMU 9p rootfs
> > setup HOWTO linked before.
>
> Yes, I'm not sure why I can't reproduce... All my computers are pretty
> slow but the conditions should be met.
> I'll try again with a command line closer to what you just gave here.
I'm not surprised that you could not reproduce the EBADF errors yet. To make
this more clear, as for the git client errors: I have like 200+ git
repositories checked out on that test VM, and only about 5 of them trigger
EBADF errors on 'git pull'. But those few repositories reproduce the EBADF
errors reliably here.
In other words: these EBADF errors only seem to trigger under certain
circumstances, so it requires quite a bunch of test material to get a
reproducer.
Like I said though, with the Bullseye installation I immediately get EBADF
errors already when booting, whereas with a Buster VM it boots without errors.
> > > It's all extremely slow though... like the final checkout counting files
> > > at less than 10/s
> >
> > It is VERY slow. And the weird thing is that cache=loose got much slower
> > than cache=mmap. My worst case expactation would be cache=loose at least
> > not performing worse than cache=mmap.
>
> Yes, some profiling is also in order, it didn't use to be that slow so
> it must not be reusing previously open fids as it should have or
> something..
If somebody has some more ideas what I can try/test, let me know. However ATM
I won't be able to review the netfs and vfs code to actually find the cause of
these issues.
Best regards,
Christian Schoenebeck
Christian Schoenebeck wrote on Thu, Apr 14, 2022 at 02:44:53PM +0200:
> > Yes, I'm not sure why I can't reproduce... All my computers are pretty
> > slow but the conditions should be met.
> > I'll try again with a command line closer to what you just gave here.
>
> I'm not surprised that you could not reproduce the EBADF errors yet. To make
> this more clear, as for the git client errors: I have like 200+ git
> repositories checked out on that test VM, and only about 5 of them trigger
> EBADF errors on 'git pull'. But those few repositories reproduce the EBADF
> errors reliably here.
>
> In other words: these EBADF errors only seem to trigger under certain
> circumstances, so it requires quite a bunch of test material to get a
> reproducer.
>
> Like I said though, with the Bullseye installation I immediately get EBADF
> errors already when booting, whereas with a Buster VM it boots without errors.
Okay, I had missed that!
I've managed to reproduce with git:
https://gaia.codewreck.org/local/tmp/c.tar.zst
This archive (~300KB) when decompressed is a ~150MB repo where git reset
produces EBADF reliably for me.
From the looks of it, write fails in v9fs_write_begin, which itself
fails because it tries to read first on a file that was open with
O_WRONLY|O_CREAT|O_APPEND.
Since this is an append the read is necessary to populate the local page
cache when writing, and we're careful that the writeback fid is open in
write, but not about read...
Will have to think how we might want to handle this; perhaps just giving
the writeback fid read rights all the time as well...
Ran out of time for tonight, but hopefully we can sort it out soonish now!
> If somebody has some more ideas what I can try/test, let me know. However ATM
> I won't be able to review the netfs and vfs code to actually find the cause of
> these issues.
You've been of great help already, thanks!
--
Dominique
On Sonntag, 17. April 2022 14:56:22 CEST [email protected] wrote:
> Christian Schoenebeck wrote on Thu, Apr 14, 2022 at 02:44:53PM +0200:
> > > Yes, I'm not sure why I can't reproduce... All my computers are pretty
> > > slow but the conditions should be met.
> > > I'll try again with a command line closer to what you just gave here.
> >
> > I'm not surprised that you could not reproduce the EBADF errors yet. To
> > make this more clear, as for the git client errors: I have like 200+ git
> > repositories checked out on that test VM, and only about 5 of them
> > trigger EBADF errors on 'git pull'. But those few repositories reproduce
> > the EBADF errors reliably here.
> >
> > In other words: these EBADF errors only seem to trigger under certain
> > circumstances, so it requires quite a bunch of test material to get a
> > reproducer.
> >
> > Like I said though, with the Bullseye installation I immediately get EBADF
> > errors already when booting, whereas with a Buster VM it boots without
> > errors.
> Okay, I had missed that!
>
> I've managed to reproduce with git:
> https://gaia.codewreck.org/local/tmp/c.tar.zst
>
> This archive (~300KB) when decompressed is a ~150MB repo where git reset
> produces EBADF reliably for me.
I'm glad you were able to reproduce these EBADF errors!
> From the looks of it, write fails in v9fs_write_begin, which itself
> fails because it tries to read first on a file that was open with
> O_WRONLY|O_CREAT|O_APPEND.
> Since this is an append the read is necessary to populate the local page
> cache when writing, and we're careful that the writeback fid is open in
> write, but not about read...
>
> Will have to think how we might want to handle this; perhaps just giving
> the writeback fid read rights all the time as well...
> Ran out of time for tonight, but hopefully we can sort it out soonish now!
I fear that would just trade symptoms: There are use cases for write-only
permissions, which would then fail after such kind of simple change.
Independent of this EBADF issue, it would be good to know why 9p performance
got so slow with cache=loose by the netfs changes. Maybe David has an idea?
Best regards,
Christian Schoenebeck
Christian Schoenebeck wrote on Sun, Apr 17, 2022 at 03:52:43PM +0200:
> > From the looks of it, write fails in v9fs_write_begin, which itself
> > fails because it tries to read first on a file that was open with
> > O_WRONLY|O_CREAT|O_APPEND.
> > Since this is an append the read is necessary to populate the local page
> > cache when writing, and we're careful that the writeback fid is open in
> > write, but not about read...
> >
> > Will have to think how we might want to handle this; perhaps just giving
> > the writeback fid read rights all the time as well...
> > Ran out of time for tonight, but hopefully we can sort it out soonish now!
>
> I fear that would just trade symptoms: There are use cases for write-only
> permissions, which would then fail after such kind of simple change.
The writeback fid is only used by async ops to flush (and apparently
since 5.10ish populate) the cache; I actually wonder how that "populate
the cache" worked before!
Anyway, since it's not used by direct operations I believe we can mess
with its open mode, but that assumes permission checks are properly done
at vfs level (this is pretty much the line of thinking that allowed
dirty cow... But in this case if a file is opened read-only the
writeback fid isn't allocated afaik, so it's probably ok ?...)
Alternatively we could have the cache issue yet another open for read
when needed, but I think a single RW fid is probably good enough if we
might read from it (no TRUNC)...
It'd break opening the writeback fid on files with -w- permission if the
open is not done as root, but I don't see how we could make appending to
a write only file at something that is not a page boundary either way.
David, netfs doesn't allow cache at byte granularity, correct?
If it does we could fix the problem by only triggering a read when
really needed.
> Independent of this EBADF issue, it would be good to know why 9p performance
> got so slow with cache=loose by the netfs changes. Maybe David has an idea?
Yes, I've just compared the behaviour of the old cache and new one (with
cache=loose) and the main difference in behaviour I can see if the time
until flush is longer on older version, and reads are bigger with the
new version recently, but the rest is all identical as far as I can see
(4k IOs for write, walk/open/clunk sequences to read a cached file (we
could delay these until reading into cache or a metadata op is
required?), TFSYNC after a series of write or on directories after a
while...), so I don't see a difference.
In particular I don't observe any cache invalidation when the mtime (and
so qid 'version', e.g. cache anciliary data) changes, but for cache=loose
that's how I'd expect it to work as well.
Perhaps the performance difference can be explained just by how
aggressively it's flushed out of memory, since it's written to disk
faster it'd also be easier to forget about and re-issue slow reads?
hmm... I need to spend more time on that as well...
--
Dominique