(fixed the subject again and promoted David Howells to To, please read
the previous couple of mails when you have time)
[email protected] wrote on Mon, Apr 18, 2022 at 06:22:26AM +0900:
> Christian Schoenebeck wrote on Sun, Apr 17, 2022 at 03:52:43PM +0200:
> > > From the looks of it, write fails in v9fs_write_begin, which itself
> > > fails because it tries to read first on a file that was open with
> > > O_WRONLY|O_CREAT|O_APPEND.
> > > Since this is an append the read is necessary to populate the local page
> > > cache when writing, and we're careful that the writeback fid is open in
> > > write, but not about read...
BTW now this is understood here's a much simpler reproducer:
---append.c----
#include <fcntl.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
if (argc < 2)
return 1;
int fd = open(argv[1], O_WRONLY|O_APPEND);
if (fd < 0)
return 1;
if (write(fd, "test\n", 5) < 0)
return 1;
return 0;
}
---
---
echo foo > foo
echo 3 > /proc/sys/vm/drop_caches
strace ./append foo
...
openat(AT_FDCWD, "foo", O_WRONLY|O_APPEND) = 3
write(3, "test\n", 5) = -1 EBADF (Bad file descriptor)
---
at 9p client level:
----
9pnet: (00000460) >>> TWALK fids 1,2 nwname 1d wname[0] t
9pnet: (00000460) >>> size=20 type: 110 tag: 0
9pnet: (00000460) <<< size=22 type: 111 tag: 0
9pnet: (00000460) <<< RWALK nwqid 1:
9pnet: (00000460) <<< [0] 0.6e672b.6289a895
9pnet: (00000460) >>> TGETATTR fid 2, request_mask 6143
9pnet: (00000460) >>> size=19 type: 24 tag: 0
9pnet: (00000460) <<< size=160 type: 25 tag: 0
9pnet: (00000460) <<< RGETATTR st_result_mask=6143
<<< qid=0.6e672b.6289a895
<<< st_mode=000081ed st_nlink=1
<<< st_uid=1000 st_gid=100
<<< st_rdev=0 st_size=d538 st_blksize=126976 st_blocks=112
<<< st_atime_sec=1650233493 st_atime_nsec=697920121
<<< st_mtime_sec=1650233493 st_mtime_nsec=19911120
<<< st_ctime_sec=1650233493 st_ctime_nsec=19911120
<<< st_btime_sec=0 st_btime_nsec=0
<<< st_gen=0 st_data_version=0
9pnet: (00000460) >>> TWALK fids 2,3 nwname 0d wname[0] (null)
9pnet: (00000460) >>> size=17 type: 110 tag: 0
9pnet: (00000460) <<< size=9 type: 111 tag: 0
9pnet: (00000460) <<< RWALK nwqid 0:
9pnet: (00000460) >>> TLOPEN fid 3 mode 32768
9pnet: (00000460) >>> size=15 type: 12 tag: 0
9pnet: (00000460) <<< size=24 type: 13 tag: 0
9pnet: (00000460) <<< RLOPEN qid 0.6e672b.6289a895 iounit 1f000
9pnet: (00000460) >>> TREAD fid 3 offset 0 8192
9pnet: (00000460) >>> size=23 type: 116 tag: 0
9pnet: (00000460) <<< size=8203 type: 117 tag: 0
9pnet: (00000460) <<< RREAD count 8192
9pnet: (00000460) >>> TREAD fid 3 offset 8192 16384
9pnet: (00000460) >>> size=23 type: 116 tag: 0
9pnet: (00000460) <<< size=16395 type: 117 tag: 0
9pnet: (00000460) <<< RREAD count 16384
9pnet: (00000460) >>> TXATTRWALK file_fid 2, attr_fid 4 name security.capability
9pnet: (00000460) >>> size=36 type: 30 tag: 0
9pnet: (00000460) <<< size=11 type: 7 tag: 0
9pnet: (00000460) <<< RLERROR (-95)
9pnet: (00000460) >>> TREAD fid 3 offset 24576 30008
9pnet: (00000460) >>> size=23 type: 116 tag: 0
9pnet: (00000460) <<< size=30019 type: 117 tag: 0
9pnet: (00000460) <<< RREAD count 30008
9pnet: (00000460) >>> TWALK fids 1,4 nwname 1d wname[0] foo
9pnet: (00000460) >>> size=22 type: 110 tag: 0
9pnet: (00000460) <<< size=22 type: 111 tag: 0
9pnet: (00000460) <<< RWALK nwqid 1:
9pnet: (00000460) <<< [0] 0.6e66f9.625c86a5
9pnet: (00000460) >>> TGETATTR fid 4, request_mask 6143
9pnet: (00000460) >>> size=19 type: 24 tag: 0
9pnet: (00000460) <<< size=160 type: 25 tag: 0
9pnet: (00000460) <<< RGETATTR st_result_mask=6143
<<< qid=0.6e66f9.625c86a5
<<< st_mode=000081a4 st_nlink=1
<<< st_uid=0 st_gid=0
<<< st_rdev=0 st_size=9 st_blksize=126976 st_blocks=8
<<< st_atime_sec=1650233249 st_atime_nsec=226674419
<<< st_mtime_sec=1650233253 st_mtime_nsec=226727529
<<< st_ctime_sec=1650233253 st_ctime_nsec=226727529
<<< st_btime_sec=0 st_btime_nsec=0
<<< st_gen=0 st_data_version=0
9pnet: (00000460) >>> TWALK fids 4,5 nwname 0d wname[0] (null)
9pnet: (00000460) >>> size=17 type: 110 tag: 0
9pnet: (00000460) <<< size=9 type: 111 tag: 0
9pnet: (00000460) <<< RWALK nwqid 0:
9pnet: (00000460) >>> TLOPEN fid 5 mode 33793
9pnet: (00000460) >>> size=15 type: 12 tag: 0
9pnet: (00000460) <<< size=24 type: 13 tag: 0
9pnet: (00000460) <<< RLOPEN qid 0.6e66f9.625c86a5 iounit 1f000
9pnet: (00000460) >>> TWALK fids 4,6 nwname 0d wname[0] (null)
9pnet: (00000460) >>> size=17 type: 110 tag: 0
9pnet: (00000460) <<< size=9 type: 111 tag: 0
9pnet: (00000460) <<< RWALK nwqid 0:
9pnet: (00000460) >>> TLOPEN fid 6 mode 2
9pnet: (00000460) >>> size=15 type: 12 tag: 0
9pnet: (00000460) <<< size=24 type: 13 tag: 0
9pnet: (00000460) <<< RLOPEN qid 0.6e66f9.625c86a5 iounit 1f000
9pnet: (00000460) >>> TXATTRWALK file_fid 4, attr_fid 7 name security.capability
9pnet: (00000460) >>> size=36 type: 30 tag: 0
9pnet: (00000460) <<< size=11 type: 7 tag: 0
9pnet: (00000460) <<< RLERROR (-95)
9pnet: (00000460) >>> TREAD fid 5 offset 0 9
9pnet: (00000460) >>> size=23 type: 116 tag: 0
9pnet: (00000460) <<< size=11 type: 7 tag: 0
9pnet: (00000460) <<< RLERROR (-9)
9pnet: (00000460) >>> TCLUNK fid 5 (try 0)
9pnet: (00000460) >>> size=11 type: 120 tag: 0
9pnet: (00000460) <<< size=7 type: 121 tag: 0
9pnet: (00000460) <<< RCLUNK fid 5
9pnet: (00000460) >>> TCLUNK fid 3 (try 0)
9pnet: (00000460) >>> size=11 type: 120 tag: 0
9pnet: (00000460) <<< size=7 type: 121 tag: 0
9pnet: (00000460) <<< RCLUNK fid 3
-------
--
Dominique
On Donnerstag, 21. April 2022 12:36:12 CEST David Howells wrote:
> [email protected] wrote:
> > int fd = open(argv[1], O_WRONLY|O_APPEND);
> > if (fd < 0)
> >
> > return 1;
> >
> > if (write(fd, "test\n", 5) < 0)
>
> I think I need to implement the ability to store writes in non-uptodate
> pages without needing to read from the server as NFS does. This may fix
> the performance drop also.
>
> David
I hope this does not sound harsh, wouldn't it make sense to revert
eb497943fa215897f2f60fd28aa6fe52da27ca6c for now until those issues are sorted
out? My concern is that it might take a long time to address them, and these
are not minor issues.
Best regards,
Christian Schoenebeck
[email protected] wrote:
> int fd = open(argv[1], O_WRONLY|O_APPEND);
> if (fd < 0)
> return 1;
> if (write(fd, "test\n", 5) < 0)
I think I need to implement the ability to store writes in non-uptodate pages
without needing to read from the server as NFS does. This may fix the
performance drop also.
David
Christian Schoenebeck wrote on Thu, Apr 21, 2022 at 01:36:14PM +0200:
> I hope this does not sound harsh, wouldn't it make sense to revert
> eb497943fa215897f2f60fd28aa6fe52da27ca6c for now until those issues are sorted
> out? My concern is that it might take a long time to address them, and these
> are not minor issues.
I'm not sure that's possible at all, the related old fscache code has
been ripped out since and just reverting won't work.
I'm also curious why that behavior changed though, I don't think the
old code had any special handling of partially written pages either...
Understanding that might give a key to a small quick fix.
It is quite a bad bug though and really wish I could give it the
attention it deserves, early next month has a few holidays here
hopefully I'll be able to look at it closer then :/
--
Dominique
There may be a quick and dirty workaround. I think the problem is that unless
the O_APPEND read starts at the beginning of a page, netfs is going to enforce
a read. Does the attached patch fix the problem? (note that it's untested)
Also, can you get the contents of /proc/fs/fscache/stats from after
reproducing the problem?
David
---
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 501128188343..5f61fdb950b0 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -291,16 +291,25 @@ static int v9fs_write_end(struct file *filp, struct address_space *mapping,
struct folio *folio = page_folio(subpage);
struct inode *inode = mapping->host;
struct v9fs_inode *v9inode = V9FS_I(inode);
+ size_t fsize = folio_size(folio);
+ size_t offset = pos & (fsize - 1);
+ /* With multipage folio support, we may be given len > fsize */
+ size_t copy_size = min_t(size_t, len, fsize - offset);
p9_debug(P9_DEBUG_VFS, "filp %p, mapping %p\n", filp, mapping);
if (!folio_test_uptodate(folio)) {
- if (unlikely(copied < len)) {
+ if (unlikely(copied < copy_size)) {
copied = 0;
goto out;
}
-
- folio_mark_uptodate(folio);
+ if (offset == 0) {
+ if (copied == fsize)
+ folio_mark_uptodate(folio);
+ /* Could clear to end of page if last_pos == new EOF
+ * and then mark uptodate
+ */
+ }
}
/*
diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index 281a88a5b8dc..78439f628c23 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -364,6 +364,12 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
if (folio_test_uptodate(folio))
goto have_folio;
+ if (!netfs_is_cache_enabled(ctx) &&
+ (file->f_flags & (O_APPEND | O_ACCMODE)) == (O_APPEND | O_WRONLY)) {
+ netfs_stat(&netfs_n_rh_write_append);
+ goto have_folio_no_wait;
+ }
+
/* If the page is beyond the EOF, we want to clear it - unless it's
* within the cache granule containing the EOF, in which case we need
* to preload the granule.
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index b7b0e3d18d9e..a1cd649197dc 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -67,6 +67,7 @@ extern atomic_t netfs_n_rh_read_failed;
extern atomic_t netfs_n_rh_zero;
extern atomic_t netfs_n_rh_short_read;
extern atomic_t netfs_n_rh_write;
+extern atomic_t netfs_n_rh_write_append;
extern atomic_t netfs_n_rh_write_begin;
extern atomic_t netfs_n_rh_write_done;
extern atomic_t netfs_n_rh_write_failed;
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index 5510a7a14a40..fce87f86f950 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -23,6 +23,7 @@ atomic_t netfs_n_rh_read_failed;
atomic_t netfs_n_rh_zero;
atomic_t netfs_n_rh_short_read;
atomic_t netfs_n_rh_write;
+atomic_t netfs_n_rh_write_append;
atomic_t netfs_n_rh_write_begin;
atomic_t netfs_n_rh_write_done;
atomic_t netfs_n_rh_write_failed;
@@ -37,10 +38,11 @@ void netfs_stats_show(struct seq_file *m)
atomic_read(&netfs_n_rh_write_zskip),
atomic_read(&netfs_n_rh_rreq),
atomic_read(&netfs_n_rh_sreq));
- seq_printf(m, "RdHelp : ZR=%u sh=%u sk=%u\n",
+ seq_printf(m, "RdHelp : ZR=%u sh=%u sk=%u wa=%u\n",
atomic_read(&netfs_n_rh_zero),
atomic_read(&netfs_n_rh_short_read),
- atomic_read(&netfs_n_rh_write_zskip));
+ atomic_read(&netfs_n_rh_write_zskip),
+ atomic_read(&netfs_n_rh_write_append));
seq_printf(m, "RdHelp : DL=%u ds=%u df=%u di=%u\n",
atomic_read(&netfs_n_rh_download),
atomic_read(&netfs_n_rh_download_done),
On Montag, 25. April 2022 16:10:16 CEST David Howells wrote:
> There may be a quick and dirty workaround. I think the problem is that
> unless the O_APPEND read starts at the beginning of a page, netfs is going
> to enforce a read. Does the attached patch fix the problem? (note that
> it's untested)
Patch doesn't apply for me on master:
checking file fs/9p/vfs_addr.c
Hunk #1 FAILED at 291.
1 out of 1 hunk FAILED
checking file fs/netfs/buffered_read.c
Hunk #1 FAILED at 364.
1 out of 1 hunk FAILED
checking file fs/netfs/internal.h
checking file fs/netfs/stats.c
Hunk #2 FAILED at 38.
1 out of 2 hunks FAILED
commit d615b5416f8a1afeb82d13b238f8152c572d59c0 (HEAD -> master, origin/master, origin/HEAD)
Merge: 0fc74d820a01 4d8ec9120819
Author: Linus Torvalds <[email protected]>
Date: Mon Apr 25 10:53:56 2022 -0700
What was is based on?
> Also, can you get the contents of /proc/fs/fscache/stats from after
> reproducing the problem?
FS-Cache statistics
Cookies: n=684 v=1 vcol=0 voom=0
Acquire: n=689 ok=689 oom=0
LRU : n=0 exp=0 rmv=0 drp=0 at=0
Invals : n=0
Updates: n=2095 rsz=0 rsn=0
Relinqs: n=5 rtr=0 drop=5
NoSpace: nwr=0 ncr=0 cull=0
IO : rd=0 wr=0
RdHelp : RA=974 RP=0 WB=13323 WBZ=2072 rr=0 sr=0
RdHelp : ZR=13854 sh=0 sk=2072
RdHelp : DL=14297 ds=14297 df=13322 di=0
RdHelp : RD=0 rs=0 rf=0
RdHelp : WR=0 ws=0 wf=0
Best regards,
Christian Schoenebeck
Sorry for the delay.
Christian Schoenebeck wrote on Tue, Apr 26, 2022 at 05:38:30PM +0200:
> On Montag, 25. April 2022 16:10:16 CEST David Howells wrote:
> > There may be a quick and dirty workaround. I think the problem is that
> > unless the O_APPEND read starts at the beginning of a page, netfs is going
> > to enforce a read. Does the attached patch fix the problem? (note that
> > it's untested)
It might work for this particular case (O_APPEND), but what about an
arbitrary pwrite or seek+write in the middle of a file?
e.g.
$ dd if=/dev/zero of=test bs=1M count=1
$ chmod 400 test
# drop cache or remound
$ dd if=/dev/urandom of=test bs=102 seek=2 count=1 conv=notrunc
dd: error writing 'test': Bad file descriptor
Silly question, how does that work on ceph or AFS? the read back
callback always works regardless of permission?
Basically I think we really only have two choices there:
- make the readback call work regardless of open mode, e.g. make it use
the writeback fid if it wasn't, and make that writeback_fid all-able
Now I'm looking, v9fs_writeback_fid() calls
v9fs_fid_lookup_with_uid(GLOBAL_ROOT_UID) and opens with O_RDWR, so it
shoud be a root fid we can read regardles of file perm !
The more I think about it and the more I think that's the way to go and
probably how it used to work, I'll look into why this isn't working
(main fid used or writeback fid not root)
- add some complex code to track the exact byte range that got updated
in some conditions e.g. WRONLY or read fails?
That'd still be useful depending on how the backend tracks file mode,
qemu as user with security_model=mapped-file keeps files 600 but with
passthrough or none qemu wouldn't be able to read the file regardless of
what we do on client...
Christian, if you still have an old kernel around did that use to work?
> Patch doesn't apply for me on master:
It applies on fscache-next
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-next
But on that branch with the patch (works fine without) I get another
problem just writing normally:
[ 94.327094] ------------[ cut here ]------------
[ 94.327809] WARNING: CPU: 0 PID: 93 at mm/page-writeback.c:2498 __folio_mark_dirty+0x397/0x510
[ 94.329191] Modules linked in:
[ 94.329491] CPU: 0 PID: 93 Comm: cat Not tainted 5.18.0-rc1+ #56
[ 94.330195] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[ 94.331709] RIP: 0010:__folio_mark_dirty+0x397/0x510
[ 94.332312] Code: 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 14 01 00 00 44 8b 7b 5c 44 89 3c 24 4c 89 fd 49 63 d7 e9 4d fe ff ff <0f> 0b e9 c0 fc ff f0
[ 94.335341] RSP: 0018:ffffc90000257ad0 EFLAGS: 00010046
[ 94.336031] RAX: 4000000000000009 RBX: ffffea0001ffb080 RCX: ffffffff815144cc
[ 94.336937] RDX: 1ffffd40003ff610 RSI: 0000000000000008 RDI: ffffea0001ffb080
[ 94.337749] RBP: ffff8880056c4488 R08: 0000000000000000 R09: ffffea0001ffb087
[ 94.338612] R10: fffff940003ff610 R11: 0000000000000001 R12: 0000000000000246
[ 94.339551] R13: ffff8880056c4490 R14: 0000000000000001 R15: 0000000000000068
[ 94.340487] FS: 00007f18dbc1eb80(0000) GS:ffff88806ca00000(0000) knlGS:0000000000000000
[ 94.341558] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 94.342369] CR2: 00007f18dbbfd000 CR3: 000000000b5b4000 CR4: 00000000000006b0
[ 94.343613] Call Trace:
[ 94.343856] <TASK>
[ 94.344052] filemap_dirty_folio+0x73/0xc0
[ 94.344646] v9fs_write_end+0x18f/0x300
[ 94.345195] generic_perform_write+0x2bd/0x4a0
[ 94.345834] ? __bpf_trace_file_check_and_advance_wb_err+0x10/0x10
[ 94.346807] ? discard_new_inode+0x100/0x100
[ 94.347398] ? generic_write_checks+0x1e8/0x360
[ 94.347926] __generic_file_write_iter+0x247/0x3d0
[ 94.348420] generic_file_write_iter+0xbe/0x1d0
[ 94.348885] new_sync_write+0x2f0/0x540
[ 94.349250] ? new_sync_read+0x530/0x530
[ 94.349634] vfs_write+0x517/0x7b0
[ 94.349939] ksys_write+0xed/0x1c0
[ 94.350318] ? __ia32_sys_read+0xb0/0xb0
[ 94.350817] do_syscall_64+0x43/0x90
[ 94.351257] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 94.351955] RIP: 0033:0x7f18dbe0eea3
[ 94.352438] Code: 54 ff ff 48 83 c4 58 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 75
[ 94.355597] RSP: 002b:00007fffdf4661d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 94.356520] RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007f18dbe0eea3
[ 94.357392] RDX: 0000000000000068 RSI: 00007f18dbbfd000 RDI: 0000000000000001
[ 94.358287] RBP: 00007f18dbbfd000 R08: 00007f18dbbfc010 R09: 0000000000000000
[ 94.359318] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000001
[ 94.360349] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000020000
[ 94.361295] </TASK>
[ 94.361462] ---[ end trace 0000000000000000 ]---
got it with cat but dd with bs >=2 also reproduces, the second write
fails with EBADF:
110 openat(AT_FDCWD, "bar", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
110 dup2(3, 1) = 1
110 close(3) = 0
110 execve("/run/current-system/sw/bin/cat", ["cat"], 0x12e1010 /* 10 vars */) = 0
110 read(0, "[ 94.327094] ------------[ cut"..., 131072) = 52
110 write(1, "[ 94.327094] ------------[ cut"..., 52) = 52
110 read(0, "[ 94.327809] WARNING: CPU: 0 P"..., 131072) = 98
110 write(1, "[ 94.327809] WARNING: CPU: 0 P"..., 98) = -1 EBADF (Bad file descriptor)
I'm sure that could be fixed, but as said above I don't think it's the
right approach.
> > Also, can you get the contents of /proc/fs/fscache/stats from after
> > reproducing the problem?
>
> FS-Cache statistics
(He probably wanted to confirm the new trace he added got hit with the
workaround pattern, I didn't get that far as I couldn't compile my
reproducer on that fs...)
--
Dominique
On Dienstag, 3. Mai 2022 12:21:23 CEST [email protected] wrote:
[...]
> - add some complex code to track the exact byte range that got updated
> in some conditions e.g. WRONLY or read fails?
> That'd still be useful depending on how the backend tracks file mode,
> qemu as user with security_model=mapped-file keeps files 600 but with
> passthrough or none qemu wouldn't be able to read the file regardless of
> what we do on client...
> Christian, if you still have an old kernel around did that use to work?
Sorry, what was the question, i.e. what should I test / look for precisely? :)
[...]
> > > Also, can you get the contents of /proc/fs/fscache/stats from after
> > > reproducing the problem?
> >
> > FS-Cache statistics
>
> (He probably wanted to confirm the new trace he added got hit with the
> workaround pattern, I didn't get that far as I couldn't compile my
> reproducer on that fs...)
Yeah, I got that. But since his patch did not apply, I just dumped what I got
so far in case the existing stats might be useful anyway.
Best regards,
Christian Schoenebeck
On Mittwoch, 4. Mai 2022 23:48:47 CEST [email protected] wrote:
> Christian Schoenebeck wrote on Wed, May 04, 2022 at 08:33:36PM +0200:
> > On Dienstag, 3. Mai 2022 12:21:23 CEST [email protected] wrote:
> > > - add some complex code to track the exact byte range that got updated
> > >
> > > in some conditions e.g. WRONLY or read fails?
> > > That'd still be useful depending on how the backend tracks file mode,
> > > qemu as user with security_model=mapped-file keeps files 600 but with
> > > passthrough or none qemu wouldn't be able to read the file regardless of
> > > what we do on client...
> > > Christian, if you still have an old kernel around did that use to work?
> >
> > Sorry, what was the question, i.e. what should I test / look for
> > precisely? :)
> I was curious if older kernel does not issue read at all, or issues read
> on writeback fid correctly opened as root/RDRW
>
> You can try either the append.c I pasted a few mails back or the dd
> commands, as regular user.
>
> $ dd if=/dev/zero of=test bs=1M count=1
> $ chmod 400 test
> # drop cache or remount
> $ dd if=/dev/urandom of=test bs=102 seek=2 count=1 conv=notrunc
> dd: error writing 'test': Bad file descriptor
Seems you were right, the old kernel opens the file with O_RDWR.
The following was taken with cache=loose, pre-netfs kernel version, using your
append code and file to be appended already containing 34 bytes, relevant file is fid 7:
v9fs_open tag 0 id 12 fid 7 mode 2
v9fs_open_return tag 0 id 12 qid={type 0 version 1651854932 path 3108899} iounit 4096
v9fs_xattrwalk tag 0 id 30 fid 5 newfid 8 name security.capability
v9fs_rerror tag 0 id 30 err 95
v9fs_read tag 0 id 116 fid 7 off 0 max_count 4096
v9fs_read_return tag 0 id 116 count 34 err 45
v9fs_read tag 0 id 116 fid 7 off 34 max_count 4062
v9fs_read_return tag 0 id 116 count 0 err 11
v9fs_clunk tag 0 id 120 fid 6
v9fs_clunk tag 0 id 120 fid 4
[delay]
v9fs_write tag 0 id 118 fid 7 off 0 count 39 cnt 1
v9fs_write_return tag 0 id 118 total 39 err 11
v9fs_fsync tag 0 id 50 fid 7 datasync 0
BTW to see this protocol debug output with QEMU:
cd qemu/build
../configure --enable-trace-backends=log ...
make -jN
./qemu-system-x86_64 -trace 'v9fs*' ...
Best regards,
Christian Schoenebeck
Christian Schoenebeck wrote on Wed, May 04, 2022 at 08:33:36PM +0200:
> On Dienstag, 3. Mai 2022 12:21:23 CEST [email protected] wrote:
> > - add some complex code to track the exact byte range that got updated
> > in some conditions e.g. WRONLY or read fails?
> > That'd still be useful depending on how the backend tracks file mode,
> > qemu as user with security_model=mapped-file keeps files 600 but with
> > passthrough or none qemu wouldn't be able to read the file regardless of
> > what we do on client...
> > Christian, if you still have an old kernel around did that use to work?
>
> Sorry, what was the question, i.e. what should I test / look for precisely? :)
I was curious if older kernel does not issue read at all, or issues read
on writeback fid correctly opened as root/RDRW
You can try either the append.c I pasted a few mails back or the dd
commands, as regular user.
$ dd if=/dev/zero of=test bs=1M count=1
$ chmod 400 test
# drop cache or remount
$ dd if=/dev/urandom of=test bs=102 seek=2 count=1 conv=notrunc
dd: error writing 'test': Bad file descriptor
... But honestly I should just find the time to do it myself, this has
been dragging on for too long...
--
Dominique
On Freitag, 6. Mai 2022 21:14:52 CEST Christian Schoenebeck wrote:
> On Mittwoch, 4. Mai 2022 23:48:47 CEST [email protected] wrote:
> > Christian Schoenebeck wrote on Wed, May 04, 2022 at 08:33:36PM +0200:
> > > On Dienstag, 3. Mai 2022 12:21:23 CEST [email protected] wrote:
> > > > - add some complex code to track the exact byte range that got
> > > > updated
> > > >
> > > > in some conditions e.g. WRONLY or read fails?
> > > > That'd still be useful depending on how the backend tracks file mode,
> > > > qemu as user with security_model=mapped-file keeps files 600 but with
> > > > passthrough or none qemu wouldn't be able to read the file regardless
> > > > of
> > > > what we do on client...
> > > > Christian, if you still have an old kernel around did that use to
> > > > work?
> > >
> > > Sorry, what was the question, i.e. what should I test / look for
> > > precisely? :)
> >
> > I was curious if older kernel does not issue read at all, or issues read
> > on writeback fid correctly opened as root/RDRW
> >
> > You can try either the append.c I pasted a few mails back or the dd
> > commands, as regular user.
> >
> > $ dd if=/dev/zero of=test bs=1M count=1
> > $ chmod 400 test
> > # drop cache or remount
> > $ dd if=/dev/urandom of=test bs=102 seek=2 count=1 conv=notrunc
> > dd: error writing 'test': Bad file descriptor
>
> Seems you were right, the old kernel opens the file with O_RDWR.
>
> The following was taken with cache=loose, pre-netfs kernel version, using
> your append code and file to be appended already containing 34 bytes,
> relevant file is fid 7:
>
> v9fs_open tag 0 id 12 fid 7 mode 2
> v9fs_open_return tag 0 id 12 qid={type 0 version 1651854932 path 3108899}
> iounit 4096 v9fs_xattrwalk tag 0 id 30 fid 5 newfid 8 name
> security.capability v9fs_rerror tag 0 id 30 err 95
> v9fs_read tag 0 id 116 fid 7 off 0 max_count 4096
> v9fs_read_return tag 0 id 116 count 34 err 45
> v9fs_read tag 0 id 116 fid 7 off 34 max_count 4062
> v9fs_read_return tag 0 id 116 count 0 err 11
> v9fs_clunk tag 0 id 120 fid 6
> v9fs_clunk tag 0 id 120 fid 4
> [delay]
> v9fs_write tag 0 id 118 fid 7 off 0 count 39 cnt 1
> v9fs_write_return tag 0 id 118 total 39 err 11
> v9fs_fsync tag 0 id 50 fid 7 datasync 0
>
> BTW to see this protocol debug output with QEMU:
>
> cd qemu/build
> ../configure --enable-trace-backends=log ...
> make -jN
> ./qemu-system-x86_64 -trace 'v9fs*' ...
I had another time slice on this issue today. As Dominique pointed out before,
the writeback_fid was and still is opened with O_RDWR [fs/9p/fid.c]:
struct p9_fid *v9fs_writeback_fid(struct dentry *dentry)
{
int err;
struct p9_fid *fid, *ofid;
ofid = v9fs_fid_lookup_with_uid(dentry, GLOBAL_ROOT_UID, 0);
fid = clone_fid(ofid);
if (IS_ERR(fid))
goto error_out;
p9_client_clunk(ofid);
/*
* writeback fid will only be used to write back the
* dirty pages. We always request for the open fid in read-write
* mode so that a partial page write which result in page
* read can work.
*/
err = p9_client_open(fid, O_RDWR);
if (err < 0) {
p9_client_clunk(fid);
fid = ERR_PTR(err);
goto error_out;
}
error_out:
return fid;
}
The problem rather seems to be that the new netfs code does not use the
writeback_fid when doing an implied read before the actual partial writeback.
As I showed in my previous email, the old pre-netfs kernel versions also did a
read before partial writebacks, but apparently used the special writeback_fid
for that.
I added some trap code to recent netfs kernel version:
diff --git a/net/9p/client.c b/net/9p/client.c
index 8bba0d9cf975..11ff1ee2130e 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1549,12 +1549,21 @@ int p9_client_unlinkat(struct p9_fid *dfid, const char *name, int flags)
}
EXPORT_SYMBOL(p9_client_unlinkat);
+void p9_bug(void) {
+ BUG_ON(true);
+}
+EXPORT_SYMBOL(p9_bug);
+
int
p9_client_read(struct p9_fid *fid, u64 offset, struct iov_iter *to, int *err)
{
int total = 0;
*err = 0;
+ if ((fid->mode & O_ACCMODE) == O_WRONLY) {
+ p9_bug();
+ }
+
while (iov_iter_count(to)) {
int count;
@@ -1648,6 +1657,10 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
p9_debug(P9_DEBUG_9P, ">>> TWRITE fid %d offset %llu count %zd\n",
fid->fid, offset, iov_iter_count(from));
+ if ((fid->mode & O_ACCMODE) == O_RDONLY) {
+ p9_bug();
+ }
+
while (iov_iter_count(from)) {
int count = iov_iter_count(from);
int rsize = fid->iounit;
Which triggers the trap in p9_client_read() with cache=loose. Here is the
backtrace [based on d615b5416f8a1afeb82d13b238f8152c572d59c0]:
[ 139.365314] p9_client_read (net/9p/client.c:1553 net/9p/client.c:1564) 9pnet
[ 139.148806] v9fs_issue_read (fs/9p/vfs_addr.c:45) 9p
[ 139.149268] netfs_begin_read (fs/netfs/io.c:91 fs/netfs/io.c:579 fs/netfs/io.c:625) netfs
[ 139.149725] ? xas_load (lib/xarray.c:211 lib/xarray.c:242)
[ 139.150057] ? xa_load (lib/xarray.c:1469)
[ 139.150398] netfs_write_begin (fs/netfs/buffered_read.c:407) netfs
[ 139.150883] v9fs_write_begin (fs/9p/vfs_addr.c:279 (discriminator 2)) 9p
[ 139.151293] generic_perform_write (mm/filemap.c:3789)
[ 139.151721] ? generic_update_time (fs/inode.c:1858)
[ 139.152112] ? file_update_time (fs/inode.c:2089)
[ 139.152504] __generic_file_write_iter (mm/filemap.c:3916)
[ 139.152943] generic_file_write_iter (./include/linux/fs.h:753 mm/filemap.c:3948)
[ 139.153348] new_sync_write (fs/read_write.c:505 (discriminator 1))
[ 139.153754] vfs_write (fs/read_write.c:591)
[ 139.154090] ksys_write (fs/read_write.c:644)
[ 139.154417] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[ 139.154776] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:115)
I still had not time to read the netfs code part yet, but I assume netfs falls
back to a generic 9p read on the O_WRONLY opened fid here, instead of using
the special O_RDWR opened 'writeback_fid'.
Is there already some info available in the netfs API that the read is
actually part of a writeback task, so that we could force on 9p driver level
to use the special writeback_fid for the read in this case instead?
Best regards,
Christian Schoenebeck
Sorry, I had planned on working on this today but the other patchset
ended up taking all my time... I think I'm bad at priorities, this
is definitely important...
David, I think with the latest comments we made it should be relatively
straightforward to make netfs use the writeback fid? Could you find some
time to have a look? It should be trivial to reproduce, I gave these
commands a few mails ago (needs to run as a regular user, on a fscache mount)
---
$ dd if=/dev/zero of=test bs=1M count=1
$ chmod 200 test
# drop cache or remount
$ dd if=/dev/urandom of=test bs=102 seek=2 count=1 conv=notrunc
dd: error writing 'test': Bad file descriptor
---
Otherwise I'll try to make some more 9p time again, but it's getting
more and more difficult for me...
Christian Schoenebeck wrote on Fri, Jun 03, 2022 at 06:46:04PM +0200:
> I had another time slice on this issue today. As Dominique pointed out before,
> the writeback_fid was and still is opened with O_RDWR [fs/9p/fid.c]:
>
> struct p9_fid *v9fs_writeback_fid(struct dentry *dentry)
> {
> int err;
> struct p9_fid *fid, *ofid;
>
> ofid = v9fs_fid_lookup_with_uid(dentry, GLOBAL_ROOT_UID, 0);
> fid = clone_fid(ofid);
> if (IS_ERR(fid))
> goto error_out;
> p9_client_clunk(ofid);
> /*
> * writeback fid will only be used to write back the
> * dirty pages. We always request for the open fid in read-write
> * mode so that a partial page write which result in page
> * read can work.
> */
> err = p9_client_open(fid, O_RDWR);
> if (err < 0) {
> p9_client_clunk(fid);
> fid = ERR_PTR(err);
> goto error_out;
> }
> error_out:
> return fid;
> }
>
> The problem rather seems to be that the new netfs code does not use the
> writeback_fid when doing an implied read before the actual partial writeback.
>
> As I showed in my previous email, the old pre-netfs kernel versions also did a
> read before partial writebacks, but apparently used the special writeback_fid
> for that.
This looks good! Thanks for keeping it up.
>
> I added some trap code to recent netfs kernel version:
>
> diff --git a/net/9p/client.c b/net/9p/client.c
> index 8bba0d9cf975..11ff1ee2130e 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -1549,12 +1549,21 @@ int p9_client_unlinkat(struct p9_fid *dfid, const char *name, int flags)
> }
> EXPORT_SYMBOL(p9_client_unlinkat);
>
> +void p9_bug(void) {
> + BUG_ON(true);
> +}
> +EXPORT_SYMBOL(p9_bug);
> +
> int
> p9_client_read(struct p9_fid *fid, u64 offset, struct iov_iter *to, int *err)
> {
> int total = 0;
> *err = 0;
>
> + if ((fid->mode & O_ACCMODE) == O_WRONLY) {
> + p9_bug();
> + }
> +
> while (iov_iter_count(to)) {
> int count;
>
> @@ -1648,6 +1657,10 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
> p9_debug(P9_DEBUG_9P, ">>> TWRITE fid %d offset %llu count %zd\n",
> fid->fid, offset, iov_iter_count(from));
>
> + if ((fid->mode & O_ACCMODE) == O_RDONLY) {
> + p9_bug();
> + }
> +
> while (iov_iter_count(from)) {
> int count = iov_iter_count(from);
> int rsize = fid->iounit;
>
> Which triggers the trap in p9_client_read() with cache=loose. Here is the
> backtrace [based on d615b5416f8a1afeb82d13b238f8152c572d59c0]:
>
> [ 139.365314] p9_client_read (net/9p/client.c:1553 net/9p/client.c:1564) 9pnet
> [ 139.148806] v9fs_issue_read (fs/9p/vfs_addr.c:45) 9p
> [ 139.149268] netfs_begin_read (fs/netfs/io.c:91 fs/netfs/io.c:579 fs/netfs/io.c:625) netfs
> [ 139.149725] ? xas_load (lib/xarray.c:211 lib/xarray.c:242)
> [ 139.150057] ? xa_load (lib/xarray.c:1469)
> [ 139.150398] netfs_write_begin (fs/netfs/buffered_read.c:407) netfs
> [ 139.150883] v9fs_write_begin (fs/9p/vfs_addr.c:279 (discriminator 2)) 9p
> [ 139.151293] generic_perform_write (mm/filemap.c:3789)
> [ 139.151721] ? generic_update_time (fs/inode.c:1858)
> [ 139.152112] ? file_update_time (fs/inode.c:2089)
> [ 139.152504] __generic_file_write_iter (mm/filemap.c:3916)
> [ 139.152943] generic_file_write_iter (./include/linux/fs.h:753 mm/filemap.c:3948)
> [ 139.153348] new_sync_write (fs/read_write.c:505 (discriminator 1))
> [ 139.153754] vfs_write (fs/read_write.c:591)
> [ 139.154090] ksys_write (fs/read_write.c:644)
> [ 139.154417] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> [ 139.154776] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:115)
>
> I still had not time to read the netfs code part yet, but I assume netfs falls
> back to a generic 9p read on the O_WRONLY opened fid here, instead of using
> the special O_RDWR opened 'writeback_fid'.
>
> Is there already some info available in the netfs API that the read is
> actually part of a writeback task, so that we could force on 9p driver level
> to use the special writeback_fid for the read in this case instead?
--
Dominique
cached operations sometimes need to do invalid operations (e.g. read
on a write only file)
Historic fscache had added a "writeback fid" for this, but the conversion
to new fscache somehow lost usage of it: use the writeback fid instead
of normal one.
Note that the way this works (writeback fid being linked to inode) means
we might use overprivileged fid for some operations, e.g. write as root
when we shouldn't.
Ideally we should keep both fids handy, and only use the writeback fid
when really required e.g. reads to a write-only file to fill in the page
cache (read-modify-write); but this is the situation we've always had
and this commit only fixes an issue we've had for too long.
Fixes: eb497943fa21 ("9p: Convert to using the netfs helper lib to do reads and caching")
Cc: [email protected]
Cc: David Howells <[email protected]>
Reported-By: Christian Schoenebeck <[email protected]>
Signed-off-by: Dominique Martinet <[email protected]>
---
Ok so finally had time to look at this, and it's not a lot so this is
the most straight forward way to do: just reverting to how the old
fscache worked.
This appears to work from quick testing, Chiristian could you test it?
I think the warnings you added in p9_client_read/write that check
fid->mode might a lot of sense, if you care to resend it as
WARN_ON((fid->mode & ACCMODE) == O_xyz);
instead I'll queue that for 5.20
@Stable people, I've checked it applies to 5.17 and 5.18 so should be
good to grab once I submit it for inclusion (that commit was included in
5.16, which is no longer stable)
fs/9p/vfs_addr.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 7382c5227e94..262968d02f55 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -58,7 +58,11 @@ static void v9fs_issue_read(struct netfs_io_subrequest *subreq)
*/
static int v9fs_init_request(struct netfs_io_request *rreq, struct file *file)
{
- struct p9_fid *fid = file->private_data;
+ struct inode *inode = file_inode(file);
+ struct v9fs_inode *v9inode = V9FS_I(inode);
+ struct p9_fid *fid = v9inode->writeback_fid;
+
+ BUG_ON(!fid);
p9_fid_get(fid);
rreq->netfs_priv = fid;
--
2.35.1