Now that net-next went in... Here's the next big chunk - killing
->aio_read() and ->aio_write(). There'll be one more pile today (direct_IO
changes and generic_write_checks() cleanups/fixes), but I'd prefer to keep
that one separate. Please, pull from
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus-2
Shortlog:
Al Viro (37):
cuse: switch to iov_iter
fuse: switch fuse_direct_io_file_operations to ->{read,write}_iter()
expand __fuse_direct_write() in both callers
lustre: kill unused members of struct vvp_thread_info
ioctx_alloc(): remove pointless check
aio_run_iocb(): kill dead check
acct: check FMODE_CAN_WRITE
nommu: use __vfs_read()
net/9p: switch the guts of p9_client_{read,write}() to iov_iter
9p: switch p9_client_write() to passing it struct iov_iter *
9p: switch ->writepage() to direct use of p9_client_write()
9p: fold v9fs_file_write_internal() into the caller
9p: get rid of v9fs_direct_file_write()
9p: switch p9_client_read() to passing struct iov_iter *
9p: get rid of v9fs_direct_file_read()
9p: switch to ->read_iter/->write_iter
9p: we are leaking glock.client_id in v9fs_file_getlock()
p9_client_attach(): set fid->uid correctly
net/9p: remove (now-)unused helpers
ncpfs: switch to ->read_iter/->write_iter
coda: switch to ->read_iter/->write_iter
switch hugetlbfs to ->read_iter()
new helper: __vfs_write()
autofs: switch to __vfs_write()
export __vfs_read()
ashmem: use __vfs_read()
serial2002: switch to __vfs_read/__vfs_write
switch /dev/loop to vfs_iter_write()
coredump: accept any write method
make new_sync_{read,write}() static
switch drivers/char/mem.c to ->read_iter/->write_iter
fuse: switch to ->read_iter/->write_iter
fuse: use iov_iter_get_pages() for non-splice path
kill do_sync_read/do_sync_write
infinibad: weird APIs switched to ->write_iter()
pcm: another weird API abuse
->aio_read and ->aio_write removed
Diffstat:
Documentation/filesystems/Locking | 2 -
Documentation/filesystems/porting | 12 +
Documentation/filesystems/vfs.txt | 6 -
arch/s390/hypfs/inode.c | 2 -
drivers/block/loop.c | 12 +-
drivers/char/mem.c | 20 +-
drivers/char/raw.c | 2 -
drivers/infiniband/hw/ipath/ipath_file_ops.c | 18 +-
drivers/infiniband/hw/qib/qib_file_ops.c | 20 +-
drivers/net/macvtap.c | 2 -
drivers/net/tun.c | 2 -
drivers/staging/android/ashmem.c | 2 +-
drivers/staging/comedi/drivers/serial2002.c | 18 +-
drivers/staging/lustre/lustre/llite/file.c | 6 -
.../staging/lustre/lustre/llite/llite_internal.h | 2 -
drivers/usb/gadget/function/f_fs.c | 2 -
drivers/usb/gadget/legacy/inode.c | 2 -
fs/9p/v9fs_vfs.h | 4 -
fs/9p/vfs_addr.c | 80 +++---
fs/9p/vfs_dir.c | 15 +-
fs/9p/vfs_file.c | 314 ++++-----------------
fs/9p/xattr.c | 80 ++----
fs/adfs/file.c | 2 -
fs/affs/file.c | 2 -
fs/afs/file.c | 2 -
fs/aio.c | 23 +-
fs/autofs4/autofs_i.h | 2 +-
fs/autofs4/waitq.c | 2 +-
fs/bfs/file.c | 2 -
fs/block_dev.c | 2 -
fs/btrfs/file.c | 2 -
fs/ceph/file.c | 2 -
fs/cifs/cifsfs.c | 12 -
fs/coda/file.c | 38 +--
fs/coredump.c | 2 +-
fs/ecryptfs/file.c | 2 -
fs/exofs/file.c | 2 -
fs/ext2/file.c | 4 -
fs/ext3/file.c | 2 -
fs/ext4/file.c | 4 -
fs/f2fs/file.c | 2 -
fs/fat/file.c | 2 -
fs/file_table.c | 4 +-
fs/fuse/cuse.c | 25 +-
fs/fuse/dev.c | 63 ++---
fs/fuse/file.c | 69 ++---
fs/gfs2/file.c | 4 -
fs/hfs/inode.c | 2 -
fs/hfsplus/inode.c | 2 -
fs/hostfs/hostfs_kern.c | 2 -
fs/hpfs/file.c | 2 -
fs/hugetlbfs/inode.c | 91 +++---
fs/jffs2/file.c | 2 -
fs/jfs/file.c | 2 -
fs/logfs/file.c | 2 -
fs/minix/file.c | 2 -
fs/ncpfs/file.c | 90 ++----
fs/ncpfs/ncplib_kernel.c | 6 +-
fs/ncpfs/ncplib_kernel.h | 2 +-
fs/nfs/file.c | 2 -
fs/nfs/nfs4file.c | 2 -
fs/nilfs2/file.c | 2 -
fs/ntfs/file.c | 2 -
fs/ocfs2/file.c | 4 -
fs/omfs/file.c | 2 -
fs/open.c | 4 +-
fs/pipe.c | 2 -
fs/ramfs/file-mmu.c | 2 -
fs/ramfs/file-nommu.c | 2 -
fs/read_write.c | 109 ++-----
fs/reiserfs/file.c | 2 -
fs/romfs/mmap-nommu.c | 1 -
fs/sysv/file.c | 2 -
fs/ubifs/file.c | 2 -
fs/udf/file.c | 2 -
fs/ufs/file.c | 2 -
fs/xfs/xfs_file.c | 2 -
include/linux/fs.h | 7 +-
include/net/9p/client.h | 8 +-
include/net/9p/transport.h | 2 +-
kernel/acct.c | 2 +-
mm/nommu.c | 4 +-
mm/shmem.c | 2 -
net/9p/client.c | 262 ++++++++---------
net/9p/protocol.c | 24 +-
net/9p/trans_common.c | 42 +--
net/9p/trans_common.h | 2 -
net/9p/trans_virtio.c | 137 +++++----
net/socket.c | 2 -
sound/core/pcm_native.c | 39 +--
90 files changed, 598 insertions(+), 1185 deletions(-)
On 04/15/2015 09:14 PM, Al Viro wrote:
> 9p: switch p9_client_write() to passing it struct iov_iter *
Hi Al,
This change caused following:
[ 91.637917] ==================================================================
[ 91.639252] BUG: KASan: out of bounds on stack in iov_iter_advance+0x3e4/0x4b0 at addr ffff8800ba1efd20
[ 91.640979] Read of size 8 by task trinity-c15/4746
[ 91.641493] page:ffffea0002e87bc0 count:0 mapcount:0 mapping: (null) index:0x0
[ 91.641787] flags: 0x100000000000000()
[ 91.641787] page dumped because: kasan: bad access detected
[ 91.641787] CPU: 1 PID: 4746 Comm: trinity-c15 Not tainted 4.0.0+ #319
[ 91.641787] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
[ 91.641787] ffff8800ba1efd08 ffff8800ba1ef918 ffffffff81ab94ed 1ffffd40005d0f7f
[ 91.641787] ffff8800ba1ef9b0 ffff8800ba1ef998 ffffffff812bc0f4 ffffffff814901f0
[ 91.641787] ffff8800ba1efa40 0000000000000296 ffff8801f4993490 ffffffff81228fe0
[ 91.641787] Call Trace:
[ 91.641787] dump_stack (lib/dump_stack.c:52)
[ 91.641787] kasan_report_error (mm/kasan/report.c:132 mm/kasan/report.c:193)
[ 91.641787] ? idr_mark_full (lib/idr.c:551)
[ 91.641787] ? clear_exceptional_entry (mm/truncate.c:561)
[ 91.641787] __asan_report_load8_noabort (mm/kasan/report.c:251)
[ 91.641787] ? iov_iter_advance (lib/iov_iter.c:511)
[ 91.641787] iov_iter_advance (lib/iov_iter.c:511)
[ 91.641787] p9_client_write (net/9p/client.c:1656)
[ 91.641787] ? p9_client_readdir (net/9p/client.c:1614)
[ 91.641787] ? kasan_kmalloc (mm/kasan/kasan.c:355)
[ 91.641787] ? __kmalloc (mm/slub.c:3325)
[ 91.641787] ? p9_client_readdir (net/9p/client.c:1614)
[ 91.641787] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:405)
[ 91.641787] v9fs_file_write_iter (fs/9p/vfs_file.c:421)
[ 91.641787] ? __sb_end_write (fs/super.c:1192)
[ 91.641787] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:405)
[ 91.641787] ? do_readv_writev (fs/read_write.c:776)
[ 91.641787] ? recalc_sigpending (kernel/signal.c:160)
[ 91.641787] ? __set_task_blocked (kernel/signal.c:2514)
[ 91.641787] __vfs_write (fs/read_write.c:479 fs/read_write.c:490)
[ 91.641787] ? recalc_sigpending (kernel/signal.c:160)
[ 91.641787] ? __vfs_read (fs/read_write.c:486)
[ 91.641787] ? __sb_end_write (fs/super.c:1192)
[ 91.641787] ? signal_setup_done (kernel/signal.c:2556)
[ 91.641787] ? hrtimer_start (kernel/time/hrtimer.c:1043)
[ 91.641787] ? do_setitimer (kernel/time/itimer.c:222)
[ 91.641787] vfs_write (include/linux/fs.h:1984 include/linux/fs.h:2416 fs/read_write.c:543)
[ 91.641787] SyS_write (fs/read_write.c:585 fs/read_write.c:576)
[ 91.641787] ? SyS_read (fs/read_write.c:576)
[ 91.641787] ? init_fpu (arch/x86/kernel/i387.c:231 arch/x86/kernel/i387.c:266)
[ 91.641787] ? math_state_restore (arch/x86/kernel/traps.c:869)
[ 91.641787] system_call_fastpath (arch/x86/kernel/entry_64.S:261)
[ 91.641787] Memory state around the buggy address:
[ 91.641787] ffff8800ba1efc00: f2 00 f4 f4 f4 f2 f2 f2 f2 00 f4 f4 f4 f3 f3 f3
[ 91.641787] ffff8800ba1efc80: f3 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1
[ 91.641787] >ffff8800ba1efd00: f1 00 00 f4 f4 f2 f2 f2 f2 00 00 00 00 00 f4 f4
[ 91.641787] ^
[ 91.641787] ffff8800ba1efd80: f4 f2 f2 f2 f2 00 00 00 00 00 f4 f4 f4 f3 f3 f3
[ 91.641787] ffff8800ba1efe00: f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 91.641787] ==================================================================
This could happen when p9pdu_readf() changes 'count' to some value > iov_iter_count(from):
p9_client_write():
<...>
int count = iov_iter_count(from);
<...>
*err = p9pdu_readf(req->rc, clnt->proto_version, "d", &count);
<...>
iov_iter_advance(from, count);
On 04/23/2015 01:16 PM, Andrey Ryabinin wrote:
> On 04/15/2015 09:14 PM, Al Viro wrote:
>> 9p: switch p9_client_write() to passing it struct iov_iter *
>
> Hi Al,
>
> This change caused following:
>
> [ 91.637917] ==================================================================
> [ 91.639252] BUG: KASan: out of bounds on stack in iov_iter_advance+0x3e4/0x4b0 at addr ffff8800ba1efd20
> [ 91.640979] Read of size 8 by task trinity-c15/4746
> [ 91.641493] page:ffffea0002e87bc0 count:0 mapcount:0 mapping: (null) index:0x0
> [ 91.641787] flags: 0x100000000000000()
> [ 91.641787] page dumped because: kasan: bad access detected
> [ 91.641787] CPU: 1 PID: 4746 Comm: trinity-c15 Not tainted 4.0.0+ #319
> [ 91.641787] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
> [ 91.641787] ffff8800ba1efd08 ffff8800ba1ef918 ffffffff81ab94ed 1ffffd40005d0f7f
> [ 91.641787] ffff8800ba1ef9b0 ffff8800ba1ef998 ffffffff812bc0f4 ffffffff814901f0
> [ 91.641787] ffff8800ba1efa40 0000000000000296 ffff8801f4993490 ffffffff81228fe0
> [ 91.641787] Call Trace:
> [ 91.641787] dump_stack (lib/dump_stack.c:52)
> [ 91.641787] kasan_report_error (mm/kasan/report.c:132 mm/kasan/report.c:193)
> [ 91.641787] ? idr_mark_full (lib/idr.c:551)
> [ 91.641787] ? clear_exceptional_entry (mm/truncate.c:561)
> [ 91.641787] __asan_report_load8_noabort (mm/kasan/report.c:251)
> [ 91.641787] ? iov_iter_advance (lib/iov_iter.c:511)
> [ 91.641787] iov_iter_advance (lib/iov_iter.c:511)
> [ 91.641787] p9_client_write (net/9p/client.c:1656)
> [ 91.641787] ? p9_client_readdir (net/9p/client.c:1614)
> [ 91.641787] ? kasan_kmalloc (mm/kasan/kasan.c:355)
> [ 91.641787] ? __kmalloc (mm/slub.c:3325)
> [ 91.641787] ? p9_client_readdir (net/9p/client.c:1614)
> [ 91.641787] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:405)
> [ 91.641787] v9fs_file_write_iter (fs/9p/vfs_file.c:421)
> [ 91.641787] ? __sb_end_write (fs/super.c:1192)
> [ 91.641787] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:405)
> [ 91.641787] ? do_readv_writev (fs/read_write.c:776)
> [ 91.641787] ? recalc_sigpending (kernel/signal.c:160)
> [ 91.641787] ? __set_task_blocked (kernel/signal.c:2514)
> [ 91.641787] __vfs_write (fs/read_write.c:479 fs/read_write.c:490)
> [ 91.641787] ? recalc_sigpending (kernel/signal.c:160)
> [ 91.641787] ? __vfs_read (fs/read_write.c:486)
> [ 91.641787] ? __sb_end_write (fs/super.c:1192)
> [ 91.641787] ? signal_setup_done (kernel/signal.c:2556)
> [ 91.641787] ? hrtimer_start (kernel/time/hrtimer.c:1043)
> [ 91.641787] ? do_setitimer (kernel/time/itimer.c:222)
> [ 91.641787] vfs_write (include/linux/fs.h:1984 include/linux/fs.h:2416 fs/read_write.c:543)
> [ 91.641787] SyS_write (fs/read_write.c:585 fs/read_write.c:576)
> [ 91.641787] ? SyS_read (fs/read_write.c:576)
> [ 91.641787] ? init_fpu (arch/x86/kernel/i387.c:231 arch/x86/kernel/i387.c:266)
> [ 91.641787] ? math_state_restore (arch/x86/kernel/traps.c:869)
> [ 91.641787] system_call_fastpath (arch/x86/kernel/entry_64.S:261)
> [ 91.641787] Memory state around the buggy address:
> [ 91.641787] ffff8800ba1efc00: f2 00 f4 f4 f4 f2 f2 f2 f2 00 f4 f4 f4 f3 f3 f3
> [ 91.641787] ffff8800ba1efc80: f3 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1
> [ 91.641787] >ffff8800ba1efd00: f1 00 00 f4 f4 f2 f2 f2 f2 00 00 00 00 00 f4 f4
> [ 91.641787] ^
> [ 91.641787] ffff8800ba1efd80: f4 f2 f2 f2 f2 00 00 00 00 00 f4 f4 f4 f3 f3 f3
> [ 91.641787] ffff8800ba1efe00: f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 91.641787] ==================================================================
>
>
>
> This could happen when p9pdu_readf() changes 'count' to some value > iov_iter_count(from):
>
> p9_client_write():
> <...>
> int count = iov_iter_count(from);
> <...>
> *err = p9pdu_readf(req->rc, clnt->proto_version, "d", &count);
> <...>
> iov_iter_advance(from, count);
>
>
>
Ping.
On Thu, Apr 23, 2015 at 01:16:15PM +0300, Andrey Ryabinin wrote:
> This change caused following:
> This could happen when p9pdu_readf() changes 'count' to some value > iov_iter_count(from):
>
> p9_client_write():
> <...>
> int count = iov_iter_count(from);
> <...>
> *err = p9pdu_readf(req->rc, clnt->proto_version, "d", &count);
> <...>
> iov_iter_advance(from, count);
*blink*
That's a bug, all right, but I would love to see how you trigger it.
It would require server to respond to "write that many bytes" with "OK,
<greater number> bytes written". We certainly need to cope with that
(we can't trust the server to be sane), but if that's what is going on,
you've got a server bug as well.
Could you check if the patch below triggers WARN_ON() in it on your
reproducer? p9_client_read() has a similar issue as well...
diff --git a/net/9p/client.c b/net/9p/client.c
index 6f4c4c8..f99bce7 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1588,6 +1588,10 @@ p9_client_read(struct p9_fid *fid, u64 offset, struct iov_iter *to, int *err)
p9_free_req(clnt, req);
break;
}
+ if (count > rsize) {
+ WARN_ON(1);
+ count = rsize;
+ }
if (non_zc) {
int n = copy_to_iter(dataptr, count, to);
@@ -1650,6 +1654,10 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
}
p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);
+ if (count > rsize) {
+ WARN_ON(1);
+ count = rsize;
+ }
p9_free_req(clnt, req);
iov_iter_advance(from, count);
On Sun, Jun 21, 2015 at 2:12 PM, Al Viro <[email protected]> wrote:
> + if (count > rsize) {
> + WARN_ON(1);
> + count = rsize;
> + }
So if we'd actually want to merge it with the warning, I'd prefer writing it as
if (WARN_ON_ONCE(count > rsize))
count = size;
because it's smaller and cannot spam your logs. WARN_ON_ONCE() will
only _warn_ once, but it always returns the conditional for the
warning, so the above does the right thing.
Linus
On Sun, Jun 21, 2015 at 02:16:15PM -0700, Linus Torvalds wrote:
> On Sun, Jun 21, 2015 at 2:12 PM, Al Viro <[email protected]> wrote:
> > + if (count > rsize) {
> > + WARN_ON(1);
> > + count = rsize;
> > + }
>
> So if we'd actually want to merge it with the warning, I'd prefer writing it as
>
> if (WARN_ON_ONCE(count > rsize))
> count = size;
>
> because it's smaller and cannot spam your logs. WARN_ON_ONCE() will
> only _warn_ once, but it always returns the conditional for the
> warning, so the above does the right thing.
Sure, but I would really like to verify that this _is_ what's going on
there. This is just a "please try that on your reproducer to make sure
that it's not something entirely different", thus the lack of S-o-b,
etc. For the final variant (and we definitely should cope with BS from
server) we probably want to use p9_debug() instead of WARN_ON/WARN_ON_ONCE.
On 06/22/2015 12:12 AM, Al Viro wrote:
> On Thu, Apr 23, 2015 at 01:16:15PM +0300, Andrey Ryabinin wrote:
>> This change caused following:
>
>> This could happen when p9pdu_readf() changes 'count' to some value > iov_iter_count(from):
>>
>> p9_client_write():
>> <...>
>> int count = iov_iter_count(from);
>> <...>
>> *err = p9pdu_readf(req->rc, clnt->proto_version, "d", &count);
>> <...>
>> iov_iter_advance(from, count);
>
> *blink*
>
> That's a bug, all right, but I would love to see how you trigger it.
> It would require server to respond to "write that many bytes" with "OK,
> <greater number> bytes written". We certainly need to cope with that
> (we can't trust the server to be sane), but if that's what is going on,
> you've got a server bug as well.
>
> Could you check if the patch below triggers WARN_ON() in it on your
> reproducer? p9_client_read() has a similar issue as well...
>
I've tried something like your patch before to check the read side
and I haven't seen anything before and don't see it right now.
Though, this doesn't mean that there is no problem with read.
I mean that trinity hits this on write and may just not hit this on read.
On Mon, Jun 22, 2015 at 03:02:11PM +0300, Andrey Ryabinin wrote:
> On 06/22/2015 12:12 AM, Al Viro wrote:
> > On Thu, Apr 23, 2015 at 01:16:15PM +0300, Andrey Ryabinin wrote:
> >> This change caused following:
> >
> >> This could happen when p9pdu_readf() changes 'count' to some value > iov_iter_count(from):
> >>
> >> p9_client_write():
> >> <...>
> >> int count = iov_iter_count(from);
> >> <...>
> >> *err = p9pdu_readf(req->rc, clnt->proto_version, "d", &count);
> >> <...>
> >> iov_iter_advance(from, count);
> >
> > *blink*
> >
> > That's a bug, all right, but I would love to see how you trigger it.
> > It would require server to respond to "write that many bytes" with "OK,
> > <greater number> bytes written". We certainly need to cope with that
> > (we can't trust the server to be sane), but if that's what is going on,
> > you've got a server bug as well.
> >
> > Could you check if the patch below triggers WARN_ON() in it on your
> > reproducer? p9_client_read() has a similar issue as well...
> >
>
> I've tried something like your patch before to check the read side
> and I haven't seen anything before and don't see it right now.
> Though, this doesn't mean that there is no problem with read.
> I mean that trinity hits this on write and may just not hit this on read.
"This" being the WARN_ON() in that patch? Could you please run the same
test with the following delta and post its printks? It's one thing if
you are hitting a buggy server, it gets confused and tells you it has
written more bytes than you told it to write. Quite a different story
in case if we are miscalculating the size we are putting into RWRITE
packet and/or advancing the iterator when we shouldn't...
What server are you using, BTW? And which transport (virtio or network -
IOW, is it zero-copy path or not)?
diff --git a/net/9p/client.c b/net/9p/client.c
index 6f4c4c8..80e45a5 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1638,6 +1638,9 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
req = p9_client_rpc(clnt, P9_TWRITE, "dqV", fid->fid,
offset, rsize, from);
}
+ if (iov_iter_count(from) != count)
+ printk(KERN_ERR "fucked: iterator got advanced [%d -> %zd]\n",
+ count, iov_iter_count(from));
if (IS_ERR(req)) {
*err = PTR_ERR(req);
break;
@@ -1649,6 +1652,10 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
p9_free_req(clnt, req);
}
+ if (count > rsize)
+ printk(KERN_ERR "fucked: sent %d, server says it got %d (err = %d)\n",
+ rsize, count, *err);
+
p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);
p9_free_req(clnt, req);
On 07/01/2015 09:27 AM, Al Viro wrote:
> On Mon, Jun 22, 2015 at 03:02:11PM +0300, Andrey Ryabinin wrote:
>> On 06/22/2015 12:12 AM, Al Viro wrote:
>>> On Thu, Apr 23, 2015 at 01:16:15PM +0300, Andrey Ryabinin wrote:
>>>> This change caused following:
>>>
>>>> This could happen when p9pdu_readf() changes 'count' to some value > iov_iter_count(from):
>>>>
>>>> p9_client_write():
>>>> <...>
>>>> int count = iov_iter_count(from);
>>>> <...>
>>>> *err = p9pdu_readf(req->rc, clnt->proto_version, "d", &count);
>>>> <...>
>>>> iov_iter_advance(from, count);
>>>
>>> *blink*
>>>
>>> That's a bug, all right, but I would love to see how you trigger it.
>>> It would require server to respond to "write that many bytes" with "OK,
>>> <greater number> bytes written". We certainly need to cope with that
>>> (we can't trust the server to be sane), but if that's what is going on,
>>> you've got a server bug as well.
>>>
>>> Could you check if the patch below triggers WARN_ON() in it on your
>>> reproducer? p9_client_read() has a similar issue as well...
>>>
>>
>> I've tried something like your patch before to check the read side
>> and I haven't seen anything before and don't see it right now.
>> Though, this doesn't mean that there is no problem with read.
>> I mean that trinity hits this on write and may just not hit this on read.
>
> "This" being the WARN_ON() in that patch?
Yes.
> Could you please run the same
> test with the following delta and post its printks?
# dmesg | grep fucked
[ 114.732166] fucked: sent 2037, server says it got 2047 (err = 0)
[ 124.937105] fucked: sent 27, server says it got 4096 (err = 0)
[ 154.075400] fucked: sent 19, server says it got 4096 (err = 0)
> It's one thing if
> you are hitting a buggy server, it gets confused and tells you it has
> written more bytes than you told it to write. Quite a different story
> in case if we are miscalculating the size we are putting into RWRITE
> packet and/or advancing the iterator when we shouldn't...
>
> What server are you using, BTW? And which transport (virtio or network -
> IOW, is it zero-copy path or not)?
qemu v2.2.1, virtio transport.
On Wed, Jul 01, 2015 at 10:50:59AM +0300, Andrey Ryabinin wrote:
> # dmesg | grep fucked
>
> [ 114.732166] fucked: sent 2037, server says it got 2047 (err = 0)
> [ 124.937105] fucked: sent 27, server says it got 4096 (err = 0)
> [ 154.075400] fucked: sent 19, server says it got 4096 (err = 0)
>
> > It's one thing if
> > you are hitting a buggy server, it gets confused and tells you it has
> > written more bytes than you told it to write. Quite a different story
> > in case if we are miscalculating the size we are putting into RWRITE
TWRITE,
sorry about the braino.
> > packet and/or advancing the iterator when we shouldn't...
> >
> > What server are you using, BTW? And which transport (virtio or network -
> > IOW, is it zero-copy path or not)?
>
> qemu v2.2.1, virtio transport.
Hmm... The first one should've hit the zero-copy path then, the last two
are too short for that... *grumble*
I wonder if these short writes are getting matched with response from wrong
request - that would explain full-page responses...
Could you check if 3.19 was getting anything similar? I.e. in
p9_client_write() there add
if (count > rsize)
printk(KERN_ERR "bogus RWRITE: %d -> %d\n", rsize, count);
just before
p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);
and see if that triggers...
On 07/01/2015 11:27 AM, Al Viro wrote:
>
> Could you check if 3.19 was getting anything similar? I.e. in
> p9_client_write() there add
> if (count > rsize)
> printk(KERN_ERR "bogus RWRITE: %d -> %d\n", rsize, count);
> just before
> p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);
> and see if that triggers...
>
Yeah, the same thing:
[ 125.962374] bogus RWRITE: 27 -> 4096
[ 207.587632] bogus RWRITE: 27 -> 4096
[ 215.055627] bogus RWRITE: 27 -> 4096
[ 235.583138] bogus RWRITE: 27 -> 4096
[ 245.749174] bogus RWRITE: 27 -> 4096
[ 246.759270] bogus RWRITE: 27 -> 4096
[ 248.020787] bogus RWRITE: 27 -> 4096
On Wed, Jul 01, 2015 at 11:41:04AM +0300, Andrey Ryabinin wrote:
> On 07/01/2015 11:27 AM, Al Viro wrote:
> >
> > Could you check if 3.19 was getting anything similar? I.e. in
> > p9_client_write() there add
> > if (count > rsize)
> > printk(KERN_ERR "bogus RWRITE: %d -> %d\n", rsize, count);
> > just before
> > p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);
> > and see if that triggers...
> >
>
> Yeah, the same thing:
> [ 125.962374] bogus RWRITE: 27 -> 4096
> [ 207.587632] bogus RWRITE: 27 -> 4096
> [ 215.055627] bogus RWRITE: 27 -> 4096
> [ 235.583138] bogus RWRITE: 27 -> 4096
> [ 245.749174] bogus RWRITE: 27 -> 4096
> [ 246.759270] bogus RWRITE: 27 -> 4096
> [ 248.020787] bogus RWRITE: 27 -> 4096
Hrm... Could you add (int)req->rc->id, (int)req->rc->tag and (int)req->tc->tag
to that printk (on either kernel, the problem's apparently not new)?
The question is whether we are mismatching replies, sending bogus requests or
if it's really the server sending bogus replies. Which qemu version are
you using, BTW?
On 07/01/2015 11:55 AM, Al Viro wrote:
> On Wed, Jul 01, 2015 at 11:41:04AM +0300, Andrey Ryabinin wrote:
>> On 07/01/2015 11:27 AM, Al Viro wrote:
>>>
>>> Could you check if 3.19 was getting anything similar? I.e. in
>>> p9_client_write() there add
>>> if (count > rsize)
>>> printk(KERN_ERR "bogus RWRITE: %d -> %d\n", rsize, count);
>>> just before
>>> p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);
>>> and see if that triggers...
>>>
>>
>> Yeah, the same thing:
>> [ 125.962374] bogus RWRITE: 27 -> 4096
>> [ 207.587632] bogus RWRITE: 27 -> 4096
>> [ 215.055627] bogus RWRITE: 27 -> 4096
>> [ 235.583138] bogus RWRITE: 27 -> 4096
>> [ 245.749174] bogus RWRITE: 27 -> 4096
>> [ 246.759270] bogus RWRITE: 27 -> 4096
>> [ 248.020787] bogus RWRITE: 27 -> 4096
>
> Hrm... Could you add (int)req->rc->id, (int)req->rc->tag and (int)req->tc->tag
> to that printk (on either kernel, the problem's apparently not new)?
>
I've attached gdb instead.
So, after message "bogus RWRITE: 93 -> 4096"
I've got this:
(gdb) p *req->rc
$11 = {size = 11, id = 119 'w', tag = 3, offset = 11, capacity = 8192, sdata = 0xffff8802347b8020 "\v"}
(gdb) p *req->tc
$12 = {size = 116, id = 118 'v', tag = 3, offset = 0, capacity = 8192, sdata = 0xffff88023479c020 "t"}
> The question is whether we are mismatching replies, sending bogus requests or
> if it's really the server sending bogus replies. Which qemu version are
> you using, BTW?
>
As I said before qemu's version is 2.2.1.
So, I've decided to try kvmtool. It took a bit longer to trigger, but still:
[ 466.552432] bogus RWRITE: 57 -> 8168
[ 969.317058] bogus RWRITE: 27 -> 8168
On Wed, Jul 01, 2015 at 02:25:43PM +0300, Andrey Ryabinin wrote:
> I've attached gdb instead.
> So, after message "bogus RWRITE: 93 -> 4096"
> I've got this:
>
> (gdb) p *req->rc
> $11 = {size = 11, id = 119 'w', tag = 3, offset = 11, capacity = 8192, sdata = 0xffff8802347b8020 "\v"}
> (gdb) p *req->tc
> $12 = {size = 116, id = 118 'v', tag = 3, offset = 0, capacity = 8192, sdata = 0xffff88023479c020 "t"}
*grumble*
Request: id = P9_TWRITE, tag = 3. Response: id = P9_RWRITE, tag = 3.
Size of request is reasonable: it should be 32bit request size + 8bit id
(TWRITE) + 16bit tag + 32bit fid + 64bit offset + 32bit count + payload,
i.e. 23 + payload size. 23 + 93 == 116, which is what we have there.
Success response is 32bit request size + 8bit id (RWRITE) + 16bit tag + 32bit
count, i.e. 11 bytes and that seems to be what we are getting here.
That would appear to exclude the possibility of bogus request - even if we had
somehow ended up with count == 4096 in TWRITE arguments, server wouldn't have
anywhere to get that much data from and either the things are *really* badly
fucked on server, or it should've replied with RERROR.
To exclude it completely we could check 4 bytes at req->tc->sdata + 19
(that's where count is), but I don't believe that this is where the problem
is.
The thing is, looking through qemu hw/9pfs/virtio-9p.c:v9fs_write() and the
stuff it calls, I don't see any way for that kind of crap to happen there...
Just in case - after the do-while loop in qemu v9fs_write(), just prior
to
offset = 7;
err = pdu_marshal(pdu, offset, "d", total);
if (err < 0) {
goto out;
}
could you slap something like
if (total > count)
*(char *)0 = 0;
and see if it dumps core on your test? Or use some more humane form of
debugging - it's been a while since I last ran qemu under gdb and I'd
need more coffee to bring that memory up...
Mismatched reply could also be a possibility, but only if we end up with
sending more than one request with the same tag without waiting for response
for the first one.
The reason why I don't want to let it go just with the "cap count with rsize
in a couple of places in net/9p/client.c" (which we'll definitely need - we
can't assume that server is sane, not compromised, etc.) is that it smells
like a symptom of something very fishy and I'd prefer to get to the root of
that thing...
Oh, lovely - I do see one bug in there that could've lead to bogosities around
the request lifetimes, but it's more recent than 3.19, so we have something
else going on. In any case, the patch below is needed to fix a fuckup
introduced in 4.0 - getting truncated RWRITE packet from server (as in
"badly malformed response", not "short write") ends up with double-free.
Almost certainly not what we are hitting here, though.
diff --git a/net/9p/client.c b/net/9p/client.c
index 6f4c4c8..ca3b342 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1647,6 +1647,7 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
if (*err) {
trace_9p_protocol_dump(clnt, req->rc);
p9_free_req(clnt, req);
+ break;
}
p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);
On Wed, Jul 01, 2015 at 07:44:08PM +0100, Al Viro wrote:
> Mismatched reply could also be a possibility, but only if we end up with
> sending more than one request with the same tag without waiting for response
> for the first one.
... and I think I see what's going on. Tags are 16bit. Suppose the
server stalls for some reason *and* we keep piling the requests up.
New tags keep being grabbed by this:
tag = P9_NOTAG;
if (type != P9_TVERSION) {
tag = p9_idpool_get(c->tagpool);
if (tag < 0)
return ERR_PTR(-ENOMEM);
}
tag is int here. Then we pass tag to
req = p9_tag_alloc(c, tag, req_size);
and that's what sets req->tc->tag. OK, but... The argument of p9_tag_alloc()
in u16, so after 2^16 pending requests we'll wrap around. p9_idpool_get()
will happily return values greater than 65535 - it's using idr and it's
used (with different pools) for 16bit tags and 32bit FIDs.
Now, p9_tag_alloc(c, 65539, max_size) will return the same req we'd got from
p9_tag_alloc(c, 3, max_size). And we are fucked - as far as the server is
concerned, we'd just sent another request with tag 3. And on the client
there are two threads waiting for responses on the same p9_req_t. Both
happen to be TWRITE. Response to the first request arrives and we happen
to let the second thread go at it first. Voila - the first request had
been for page-sized write() and got successfully handled. The _second_ one
had been short and is very surprised to see confirmation of 4Kb worth of
data having been written.
It should be easy to confirm - in p9_client_prepare_req() add
if (WARN_ON_ONCE(tag != (u16)tag)) {
p9_idpool_put(tag, c->tagpool);
return ERR_PTR(-ENOMEM);
}
right after
tag = p9_idpool_get(c->tagpool);
if (tag < 0)
return ERR_PTR(-ENOMEM);
and see if it triggers. I'm not sure if failing with ENOMEM is the
right response (another variant is to sleep there until the pile
gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
not for the real work, but it will do for confirming that this is what
we are hitting.
[9p and sunrpc folks added to Cc]
On Thu, Jul 02, 2015 at 04:20:42AM +0100, Al Viro wrote:
> On Wed, Jul 01, 2015 at 07:44:08PM +0100, Al Viro wrote:
> > Mismatched reply could also be a possibility, but only if we end up with
> > sending more than one request with the same tag without waiting for response
> > for the first one.
>
> ... and I think I see what's going on. Tags are 16bit. Suppose the
> server stalls for some reason *and* we keep piling the requests up.
> New tags keep being grabbed by this:
>
> tag = P9_NOTAG;
> if (type != P9_TVERSION) {
> tag = p9_idpool_get(c->tagpool);
> if (tag < 0)
> return ERR_PTR(-ENOMEM);
> }
> tag is int here. Then we pass tag to
> req = p9_tag_alloc(c, tag, req_size);
> and that's what sets req->tc->tag. OK, but... The argument of p9_tag_alloc()
> in u16, so after 2^16 pending requests we'll wrap around. p9_idpool_get()
> will happily return values greater than 65535 - it's using idr and it's
> used (with different pools) for 16bit tags and 32bit FIDs.
>
> Now, p9_tag_alloc(c, 65539, max_size) will return the same req we'd got from
> p9_tag_alloc(c, 3, max_size). And we are fucked - as far as the server is
> concerned, we'd just sent another request with tag 3. And on the client
> there are two threads waiting for responses on the same p9_req_t. Both
> happen to be TWRITE. Response to the first request arrives and we happen
> to let the second thread go at it first. Voila - the first request had
> been for page-sized write() and got successfully handled. The _second_ one
> had been short and is very surprised to see confirmation of 4Kb worth of
> data having been written.
>
> It should be easy to confirm - in p9_client_prepare_req() add
> if (WARN_ON_ONCE(tag != (u16)tag)) {
> p9_idpool_put(tag, c->tagpool);
> return ERR_PTR(-ENOMEM);
> }
> right after
> tag = p9_idpool_get(c->tagpool);
> if (tag < 0)
> return ERR_PTR(-ENOMEM);
>
> and see if it triggers. I'm not sure if failing with ENOMEM is the
> right response (another variant is to sleep there until the pile
> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
> not for the real work, but it will do for confirming that this is what
> we are hitting.
FWIW, we probably would be better off with throttling rather than ENOMEM
in such situations. I'm not familiar with sunrpc enough to be sure how
to do that right way (note that RPC equivalent of 9P tags is 32bit, so
the throttling there is based on memory shortage rather than running out
of XID space), but the interesting issues should be similar - potential
deadlocks in near-OOM situations. Suggestions?
[repeating, since my previous email didn't reach mailing lists]
2015-07-02 7:10 GMT+03:00 Al Viro <[email protected]>:
>> It should be easy to confirm - in p9_client_prepare_req() add
>> if (WARN_ON_ONCE(tag != (u16)tag)) {
>> p9_idpool_put(tag, c->tagpool);
>> return ERR_PTR(-ENOMEM);
>> }
>> right after
>> tag = p9_idpool_get(c->tagpool);
>> if (tag < 0)
>> return ERR_PTR(-ENOMEM);
>>
>> and see if it triggers. I'm not sure if failing with ENOMEM is the
>> right response (another variant is to sleep there until the pile
>> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
>> not for the real work, but it will do for confirming that this is what
>> we are hitting.
>
Apparently, I'm seeing something else. That WARN_ON_ONCE didn't trigger.
On Thu, Jul 02, 2015 at 10:19:07AM +0300, Andrey Ryabinin wrote:
> On 07/02/2015 07:10 AM, Al Viro wrote:
> >>
> >> It should be easy to confirm - in p9_client_prepare_req() add
> >> if (WARN_ON_ONCE(tag != (u16)tag)) {
> >> p9_idpool_put(tag, c->tagpool);
> >> return ERR_PTR(-ENOMEM);
> >> }
> >> right after
> >> tag = p9_idpool_get(c->tagpool);
> >> if (tag < 0)
> >> return ERR_PTR(-ENOMEM);
> >>
> >> and see if it triggers. I'm not sure if failing with ENOMEM is the
> >> right response (another variant is to sleep there until the pile
> >> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
> >> not for the real work, but it will do for confirming that this is what
> >> we are hitting.
> >
>
> Apparently, I'm seeing something else. That WARN_ON_ONCE didn't trigger.
While the one in p9_client_write() (on rsize < count) did?
On Thu, Jul 02, 2015 at 10:50:05AM +0300, Andrey Ryabinin wrote:
> >> and see if it triggers. I'm not sure if failing with ENOMEM is the
> >> right response (another variant is to sleep there until the pile
> >> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
> >> not for the real work, but it will do for confirming that this is what
> >> we are hitting.
> >
>
> Apparently, I'm seeing something else. That WARN_ON_ONCE didn't trigger.
Summary for those who'd missed the beginning of the thread: what we are
seeing is p9_client_write() issing TWRITE and getting RWRITE in reply
(tags match, packets look plausible) with count in RWRITE way more than
that in TWRITE.
IOW, we are telling the server to write e.g. 93 bytes and are getting told
that yes, the write had been successful - all 4096 bytes of it.
qemu virtio-9p for server; from my reading of qemu side of things, it can't
be sending reply with count greater than that in request.
The bug I suspected to be the cause of that is in tag allocation in
net/9p/client.c - we could end up wrapping around 2^16 with enough pending
requests and that would have triggered that kind of mess. However, Andrey
doesn't see that test (tag wraparound in p9_client_prepare_req()) trigger.
BTW, was that on the run where debugging printk in p9_client_write() *did*
trigger?
On 07/02/2015 10:59 AM, Al Viro wrote:
> On Thu, Jul 02, 2015 at 10:50:05AM +0300, Andrey Ryabinin wrote:
>
>>>> and see if it triggers. I'm not sure if failing with ENOMEM is the
>>>> right response (another variant is to sleep there until the pile
>>>> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
>>>> not for the real work, but it will do for confirming that this is what
>>>> we are hitting.
>>>
>>
>> Apparently, I'm seeing something else. That WARN_ON_ONCE didn't trigger.
>
> Summary for those who'd missed the beginning of the thread: what we are
> seeing is p9_client_write() issing TWRITE and getting RWRITE in reply
> (tags match, packets look plausible) with count in RWRITE way more than
> that in TWRITE.
>
> IOW, we are telling the server to write e.g. 93 bytes and are getting told
> that yes, the write had been successful - all 4096 bytes of it.
>
> qemu virtio-9p for server; from my reading of qemu side of things, it can't
> be sending reply with count greater than that in request.
Besides qemu, I've also tried kvmtool with the same result. IOW I'm seeing
this under kvmtool as well. It just takes a bit longer to reproduce
this in kvmtool.
> The bug I suspected to be the cause of that is in tag allocation in
> net/9p/client.c - we could end up wrapping around 2^16 with enough pending
> requests and that would have triggered that kind of mess. However, Andrey
> doesn't see that test (tag wraparound in p9_client_prepare_req()) trigger.
> BTW, was that on the run where debugging printk in p9_client_write() *did*
> trigger?
Yes, WARN_ON_ONCE() in p9_client_prepare_req() didn't trigger,
but debug printk in p9_client_write() *did* trigger.
On Thu, Jul 02, 2015 at 11:19:03AM +0300, Andrey Ryabinin wrote:
> Besides qemu, I've also tried kvmtool with the same result. IOW I'm seeing
> this under kvmtool as well. It just takes a bit longer to reproduce
> this in kvmtool.
>
> > The bug I suspected to be the cause of that is in tag allocation in
> > net/9p/client.c - we could end up wrapping around 2^16 with enough pending
> > requests and that would have triggered that kind of mess. However, Andrey
> > doesn't see that test (tag wraparound in p9_client_prepare_req()) trigger.
> > BTW, was that on the run where debugging printk in p9_client_write() *did*
> > trigger?
>
> Yes, WARN_ON_ONCE() in p9_client_prepare_req() didn't trigger,
> but debug printk in p9_client_write() *did* trigger.
Bloody wonderful... Could you check if v9fs_write() in qemu
hw/9pfs/virtio-9p.c ever gets to
offset = 7;
err = pdu_marshal(pdu, offset, "d", total);
with total > count on your testcase?
On Thu, Jul 02, 2015 at 09:25:30AM +0100, Al Viro wrote:
> On Thu, Jul 02, 2015 at 11:19:03AM +0300, Andrey Ryabinin wrote:
> > Besides qemu, I've also tried kvmtool with the same result. IOW I'm seeing
> > this under kvmtool as well. It just takes a bit longer to reproduce
> > this in kvmtool.
> >
> > > The bug I suspected to be the cause of that is in tag allocation in
> > > net/9p/client.c - we could end up wrapping around 2^16 with enough pending
> > > requests and that would have triggered that kind of mess. However, Andrey
> > > doesn't see that test (tag wraparound in p9_client_prepare_req()) trigger.
> > > BTW, was that on the run where debugging printk in p9_client_write() *did*
> > > trigger?
> >
> > Yes, WARN_ON_ONCE() in p9_client_prepare_req() didn't trigger,
> > but debug printk in p9_client_write() *did* trigger.
>
> Bloody wonderful... Could you check if v9fs_write() in qemu
> hw/9pfs/virtio-9p.c ever gets to
> offset = 7;
> err = pdu_marshal(pdu, offset, "d", total);
> with total > count on your testcase?
Another thing that might be worth checking: in p9_tag_alloc() (net/9p/client.c)
before
req->status = REQ_STATUS_ALLOC;
check that req->status == REQ_STATUS_IDLE and yell if it isn't.
BTW, the loop in there (
/* check again since original check was outside of lock */
while (tag >= c->max_tag) {
) looks fishy. If we get more than P9_ROW_MAXTAG allocations at once,
we'll have trouble, but I doubt that this is what we are hitting. In any
case, adding WARN_ON(c->req[row]); right after
row = (tag / P9_ROW_MAXTAG);
wouldn't hurt. I would be very surprised if that one triggered, though.
On Thu, 2 Jul 2015 04:20:42 +0100
Al Viro <[email protected]> wrote:
> On Wed, Jul 01, 2015 at 07:44:08PM +0100, Al Viro wrote:
> > Mismatched reply could also be a possibility, but only if we end up with
> > sending more than one request with the same tag without waiting for response
> > for the first one.
>
> ... and I think I see what's going on. Tags are 16bit. Suppose the
> server stalls for some reason *and* we keep piling the requests up.
> New tags keep being grabbed by this:
>
> tag = P9_NOTAG;
> if (type != P9_TVERSION) {
> tag = p9_idpool_get(c->tagpool);
> if (tag < 0)
> return ERR_PTR(-ENOMEM);
> }
> tag is int here. Then we pass tag to
> req = p9_tag_alloc(c, tag, req_size);
> and that's what sets req->tc->tag. OK, but... The argument of p9_tag_alloc()
> in u16, so after 2^16 pending requests we'll wrap around. p9_idpool_get()
> will happily return values greater than 65535 - it's using idr and it's
> used (with different pools) for 16bit tags and 32bit FIDs.
>
> Now, p9_tag_alloc(c, 65539, max_size) will return the same req we'd got from
> p9_tag_alloc(c, 3, max_size). And we are fucked - as far as the server is
> concerned, we'd just sent another request with tag 3. And on the client
> there are two threads waiting for responses on the same p9_req_t. Both
> happen to be TWRITE. Response to the first request arrives and we happen
> to let the second thread go at it first. Voila - the first request had
> been for page-sized write() and got successfully handled. The _second_ one
> had been short and is very surprised to see confirmation of 4Kb worth of
> data having been written.
>
> It should be easy to confirm - in p9_client_prepare_req() add
> if (WARN_ON_ONCE(tag != (u16)tag)) {
> p9_idpool_put(tag, c->tagpool);
> return ERR_PTR(-ENOMEM);
> }
> right after
> tag = p9_idpool_get(c->tagpool);
> if (tag < 0)
> return ERR_PTR(-ENOMEM);
>
> and see if it triggers. I'm not sure if failing with ENOMEM is the
> right response (another variant is to sleep there until the pile
> gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
> not for the real work, but it will do for confirming that this is what
> we are hitting.
ISTM that pd_idpool_get ought to be using idr_alloc_cyclic instead.
That should ensure that it's only allocating values from within the
given range.
--
Jeff Layton <[email protected]>
On Thu, 2 Jul 2015 08:00:26 -0400
Jeff Layton <[email protected]> wrote:
> On Thu, 2 Jul 2015 04:20:42 +0100
> Al Viro <[email protected]> wrote:
>
> > On Wed, Jul 01, 2015 at 07:44:08PM +0100, Al Viro wrote:
> > > Mismatched reply could also be a possibility, but only if we end up with
> > > sending more than one request with the same tag without waiting for response
> > > for the first one.
> >
> > ... and I think I see what's going on. Tags are 16bit. Suppose the
> > server stalls for some reason *and* we keep piling the requests up.
> > New tags keep being grabbed by this:
> >
> > tag = P9_NOTAG;
> > if (type != P9_TVERSION) {
> > tag = p9_idpool_get(c->tagpool);
> > if (tag < 0)
> > return ERR_PTR(-ENOMEM);
> > }
> > tag is int here. Then we pass tag to
> > req = p9_tag_alloc(c, tag, req_size);
> > and that's what sets req->tc->tag. OK, but... The argument of p9_tag_alloc()
> > in u16, so after 2^16 pending requests we'll wrap around. p9_idpool_get()
> > will happily return values greater than 65535 - it's using idr and it's
> > used (with different pools) for 16bit tags and 32bit FIDs.
> >
> > Now, p9_tag_alloc(c, 65539, max_size) will return the same req we'd got from
> > p9_tag_alloc(c, 3, max_size). And we are fucked - as far as the server is
> > concerned, we'd just sent another request with tag 3. And on the client
> > there are two threads waiting for responses on the same p9_req_t. Both
> > happen to be TWRITE. Response to the first request arrives and we happen
> > to let the second thread go at it first. Voila - the first request had
> > been for page-sized write() and got successfully handled. The _second_ one
> > had been short and is very surprised to see confirmation of 4Kb worth of
> > data having been written.
> >
> > It should be easy to confirm - in p9_client_prepare_req() add
> > if (WARN_ON_ONCE(tag != (u16)tag)) {
> > p9_idpool_put(tag, c->tagpool);
> > return ERR_PTR(-ENOMEM);
> > }
> > right after
> > tag = p9_idpool_get(c->tagpool);
> > if (tag < 0)
> > return ERR_PTR(-ENOMEM);
> >
> > and see if it triggers. I'm not sure if failing with ENOMEM is the
> > right response (another variant is to sleep there until the pile
> > gets cleaned or until we get killed), and WARN_ON_ONCE() is definitely
> > not for the real work, but it will do for confirming that this is what
> > we are hitting.
>
> ISTM that pd_idpool_get ought to be using idr_alloc_cyclic instead.
> That should ensure that it's only allocating values from within the
> given range.
>
Erm...and why is it passing in '0' to idr_alloc for the end value if it
can't deal with more than 16 bits? That seems like a plain old bug...
The other stuff you've noted should also be fixed of course, but the
IDR usage here could use a little rework.
--
Jeff Layton <[email protected]>
On 07/02/2015 11:42 AM, Al Viro wrote:
> On Thu, Jul 02, 2015 at 09:25:30AM +0100, Al Viro wrote:
>> On Thu, Jul 02, 2015 at 11:19:03AM +0300, Andrey Ryabinin wrote:
>>> Besides qemu, I've also tried kvmtool with the same result. IOW I'm seeing
>>> this under kvmtool as well. It just takes a bit longer to reproduce
>>> this in kvmtool.
>>>
>>>> The bug I suspected to be the cause of that is in tag allocation in
>>>> net/9p/client.c - we could end up wrapping around 2^16 with enough pending
>>>> requests and that would have triggered that kind of mess. However, Andrey
>>>> doesn't see that test (tag wraparound in p9_client_prepare_req()) trigger.
>>>> BTW, was that on the run where debugging printk in p9_client_write() *did*
>>>> trigger?
>>>
>>> Yes, WARN_ON_ONCE() in p9_client_prepare_req() didn't trigger,
>>> but debug printk in p9_client_write() *did* trigger.
>>
>> Bloody wonderful... Could you check if v9fs_write() in qemu
>> hw/9pfs/virtio-9p.c ever gets to
>> offset = 7;
>> err = pdu_marshal(pdu, offset, "d", total);
>> with total > count on your testcase?
Added:
+ if (total > count)
+ *(char *)0 = 0
and never hit this condition.
>
> Another thing that might be worth checking: in p9_tag_alloc() (net/9p/client.c)
> before
> req->status = REQ_STATUS_ALLOC;
> check that req->status == REQ_STATUS_IDLE and yell if it isn't.
>
diff --git a/net/9p/client.c b/net/9p/client.c
index 6f4c4c8..16a17a0 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -286,6 +286,8 @@ p9_tag_alloc(struct p9_client *c, u16 tag, unsigned int max_size)
p9pdu_reset(req->rc);
req->tc->tag = tag-1;
+ if (WARN_ON(req->status != REQ_STATUS_IDLE))
+ pr_err("req->status: %d\n", req->status);
req->status = REQ_STATUS_ALLOC;
return req;
[ 150.155020] ------------[ cut here ]------------
[ 150.156700] WARNING: CPU: 2 PID: 2304 at ../net/9p/client.c:289 p9_client_prepare_req+0x3b0/0x550()
[ 150.158404] Modules linked in:
[ 150.160177] CPU: 2 PID: 2304 Comm: trinity-c84 Not tainted 4.1.0-rc8+ #409
[ 150.161794] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
[ 150.165540] 0000000000000009 ffff8801ed13f7a8 ffffffff8161434b 0000000000000000
[ 150.170939] 0000000000000000 ffff8801ed13f7f8 ffffffff8107cf99 ffff8801f451d5f0
[ 150.175942] ffffffff815f6760 0000000000000003 ffff8800bbac00e0 ffff8800bbac00f0
[ 150.178393] Call Trace:
[ 150.178883] [<ffffffff8161434b>] dump_stack+0x45/0x57
[ 150.179914] [<ffffffff8107cf99>] warn_slowpath_common+0x99/0xe0
[ 150.181375] [<ffffffff815f6760>] ? p9_client_prepare_req+0x3b0/0x550
[ 150.182597] [<ffffffff8107d145>] warn_slowpath_null+0x15/0x20
[ 150.184067] [<ffffffff815f6760>] p9_client_prepare_req+0x3b0/0x550
[ 150.185043] [<ffffffff815fb1e2>] p9_client_zc_rpc.constprop.5+0xe2/0x730
[ 150.186229] [<ffffffff8161ba46>] ? _raw_spin_unlock+0x16/0x70
[ 150.187049] [<ffffffff815fb100>] ? p9_client_xattrwalk+0x1b0/0x1b0
[ 150.188477] [<ffffffff812b1b5e>] ? idr_remove+0x2ce/0x420
[ 150.189443] [<ffffffff815f5af0>] ? v9fs_unregister_trans+0x70/0x70
[ 150.190456] [<ffffffff812b1890>] ? idr_mark_full+0x80/0x80
[ 150.191489] [<ffffffff815f5af0>] ? v9fs_unregister_trans+0x70/0x70
[ 150.193911] [<ffffffff8161bae7>] ? _raw_spin_unlock_irqrestore+0x47/0xb0
[ 150.195462] [<ffffffff815fcae4>] ? p9_idpool_put+0x54/0x60
[ 150.196729] [<ffffffff812c9906>] ? iov_iter_advance+0xb6/0x240
[ 150.199766] [<ffffffff815fbea3>] p9_client_write+0x333/0x3d0
[ 150.201073] [<ffffffff811a882e>] ? kasan_kmalloc+0x5e/0x70
[ 150.202512] [<ffffffff815fbb70>] ? p9_client_readdir+0x340/0x340
[ 150.204115] [<ffffffff811c80dd>] ? rw_copy_check_uvector+0xed/0x170
[ 150.204960] [<ffffffff812d82d6>] ? __percpu_counter_add+0x26/0xb0
[ 150.206517] [<ffffffff8113574a>] ? generic_write_checks+0xfa/0x1e0
[ 150.208092] [<ffffffff8125c054>] v9fs_file_write_iter+0xc4/0x200
[ 150.209642] [<ffffffff811c9840>] ? __sb_end_write+0x80/0x80
[ 150.211305] [<ffffffff8125bf90>] ? v9fs_file_lock_dotl+0x3d0/0x3d0
[ 150.216908] [<ffffffff81128d14>] ? ctx_sched_in.isra.57+0xe4/0x2f0
[ 150.221069] [<ffffffff811c6d84>] ? rw_verify_area+0x54/0x150
[ 150.222570] [<ffffffff811c7363>] do_readv_writev+0x223/0x450
[ 150.229044] [<ffffffff811290b7>] ? perf_event_context_sched_in.isra.61+0x127/0x180
[ 150.231020] [<ffffffff8125bf90>] ? v9fs_file_lock_dotl+0x3d0/0x3d0
[ 150.231862] [<ffffffff811c7140>] ? vfs_write+0x1e0/0x1e0
[ 150.232583] [<ffffffff81129b7a>] ? __perf_event_task_sched_in+0x5a/0xa0
[ 150.233471] [<ffffffff810aee08>] ? finish_task_switch+0xa8/0x1b0
[ 150.234282] [<ffffffff8161588b>] ? __schedule+0x3db/0xc90
[ 150.235020] [<ffffffff81616197>] ? schedule+0x57/0xd0
[ 150.235709] [<ffffffff8161ba46>] ? _raw_spin_unlock+0x16/0x70
[ 150.236493] [<ffffffff81618a0f>] ? __mutex_lock_slowpath+0x1df/0x200
[ 150.239059] [<ffffffff811c9840>] ? __sb_end_write+0x80/0x80
[ 150.242120] [<ffffffff81618830>] ? __ww_mutex_lock_interruptible+0xe0/0xe0
[ 150.245274] [<ffffffff810ecfd3>] ? hrtimer_start+0x13/0x20
[ 150.247527] [<ffffffff810ee54d>] ? do_setitimer+0x30d/0x400
[ 150.249351] [<ffffffff811c764e>] vfs_writev+0x4e/0x70
[ 150.250378] [<ffffffff811c8348>] SyS_writev+0xa8/0x140
[ 150.251545] [<ffffffff811c82a0>] ? SyS_readv+0x140/0x140
[ 150.253208] [<ffffffff8161c2ae>] system_call_fastpath+0x12/0x71
[ 150.256990] ---[ end trace 4f640ea141ed3d61 ]---
[ 150.259076] 9pnet: req->status: 4
> BTW, the loop in there (
> /* check again since original check was outside of lock */
> while (tag >= c->max_tag) {
> ) looks fishy. If we get more than P9_ROW_MAXTAG allocations at once,
> we'll have trouble, but I doubt that this is what we are hitting. In any
> case, adding WARN_ON(c->req[row]); right after
I didn't get this. c->reqs[row] is always non-NULL as it should be, so this warning
will trigger all the time.
> row = (tag / P9_ROW_MAXTAG);
> wouldn't hurt. I would be very surprised if that one triggered, though.
>
On Thu, Jul 02, 2015 at 03:19:57PM +0300, Andrey Ryabinin wrote:
> Added:
> + if (total > count)
> + *(char *)0 = 0
>
> and never hit this condition.
>
OK, so it's definitely a mismatched response.
> req->tc->tag = tag-1;
> + if (WARN_ON(req->status != REQ_STATUS_IDLE))
> + pr_err("req->status: %d\n", req->status);
> req->status = REQ_STATUS_ALLOC;
>
> return req;
> [ 150.259076] 9pnet: req->status: 4
IOW, REQ_STATUS_RCVD. Hmm... Stray tag seen by req_done() after we'd already
freed the tag in question? That, or it really would have to had wrapped
around... Note that req_done() does *not* check anything about the req -
not even that p9_tag_lookup() hasn't returned NULL, so a server sending you
any response tagged with number well above anything you'd ever sent will
reliably oops you.
Frankly, the whole thing needs fuzzing from the server side - start throwing
crap at the client and see how badly does it get fucked... Folks, it's
a network protocol, with userland servers, no less. You *can't* assume
them competent and non-malicious...
How much traffic does it take to reproduce that fun, BTW? IOW, is attempting
to log the sequence of tag {allocation,freeing}/tag of packet being {sent,
received} something completely suicidal, or is it more or less feasible?
> I didn't get this. c->reqs[row] is always non-NULL as it should be, so this warning
> will trigger all the time.
????
row = (tag / P9_ROW_MAXTAG);
c->reqs[row] = kcalloc(P9_ROW_MAXTAG,
sizeof(struct p9_req_t), GFP_ATOMIC);
and you are seeing c->reqs[row] != NULL *BEFORE* that kcalloc()? All the time,
no less? Just to make sure we are on the same page - the delta against
mainline I would like tested is this:
diff --git a/net/9p/client.c b/net/9p/client.c
index 6f4c4c8..fa88c9a 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -248,6 +248,9 @@ p9_tag_alloc(struct p9_client *c, u16 tag, unsigned int max_size)
/* check again since original check was outside of lock */
while (tag >= c->max_tag) {
row = (tag / P9_ROW_MAXTAG);
+
+ WARN_ON_ONCE(c->reqs[row]); // are we about to leak?
+
c->reqs[row] = kcalloc(P9_ROW_MAXTAG,
sizeof(struct p9_req_t), GFP_ATOMIC);
@@ -286,6 +289,8 @@ p9_tag_alloc(struct p9_client *c, u16 tag, unsigned int max_size)
p9pdu_reset(req->rc);
req->tc->tag = tag-1;
+ if (req->status != REQ_STATUS_IDLE)
+ pr_err("using tag %d with odd status (%d)", tag, req->status);
req->status = REQ_STATUS_ALLOC;
return req;
@@ -425,6 +430,8 @@ void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status)
* the other thread wakes up will indeed be seen by the waiting side.
*/
smp_wmb();
+ if (req->status == REQ_STATUS_IDLE)
+ pr_err("late delivery, tag %d already freed", req->tc->tag);
req->status = status;
wake_up(req->wq);
@@ -693,6 +700,10 @@ static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
tag = p9_idpool_get(c->tagpool);
if (tag < 0)
return ERR_PTR(-ENOMEM);
+ if (WARN_ON_ONCE(tag != (u16)tag)) { // wrapped around?
+ p9_idpool_put(tag, c->tagpool);
+ return ERR_PTR(-ENOMEM);
+ }
}
req = p9_tag_alloc(c, tag, req_size);
@@ -1647,7 +1658,10 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
if (*err) {
trace_9p_protocol_dump(clnt, req->rc);
p9_free_req(clnt, req);
+ break;
}
+ if (rsize < count)
+ pr_err("mismatched reply [tag = %d]\n", req->tc->tag);
p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);
On Thu, Jul 02, 2015 at 08:07:38AM -0400, Jeff Layton wrote:
> Erm...and why is it passing in '0' to idr_alloc for the end value if it
> can't deal with more than 16 bits? That seems like a plain old bug...
Because they are using the same function (with different pool, obviously)
for FID allocation, and those are 32bit...
On Thu, Jul 02, 2015 at 05:43:32PM +0100, Al Viro wrote:
> req->tc->tag = tag-1;
> + if (req->status != REQ_STATUS_IDLE)
> + pr_err("using tag %d with odd status (%d)", tag, req->status);
Should be tag - 1 here, actually.
diff --git a/net/9p/client.c b/net/9p/client.c
index 6f4c4c8..9719886 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -248,6 +248,9 @@ p9_tag_alloc(struct p9_client *c, u16 tag, unsigned int max_size)
/* check again since original check was outside of lock */
while (tag >= c->max_tag) {
row = (tag / P9_ROW_MAXTAG);
+
+ WARN_ON_ONCE(c->reqs[row]); // are we about to leak?
+
c->reqs[row] = kcalloc(P9_ROW_MAXTAG,
sizeof(struct p9_req_t), GFP_ATOMIC);
@@ -286,6 +289,8 @@ p9_tag_alloc(struct p9_client *c, u16 tag, unsigned int max_size)
p9pdu_reset(req->rc);
req->tc->tag = tag-1;
+ if (req->status != REQ_STATUS_IDLE)
+ pr_err("using tag %d with odd status (%d)", tag - 1, req->status);
req->status = REQ_STATUS_ALLOC;
return req;
@@ -425,6 +430,8 @@ void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status)
* the other thread wakes up will indeed be seen by the waiting side.
*/
smp_wmb();
+ if (req->status == REQ_STATUS_IDLE)
+ pr_err("late delivery, tag %d already freed", req->tc->tag);
req->status = status;
wake_up(req->wq);
@@ -693,6 +700,10 @@ static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
tag = p9_idpool_get(c->tagpool);
if (tag < 0)
return ERR_PTR(-ENOMEM);
+ if (WARN_ON_ONCE(tag != (u16)tag)) { // wrapped around?
+ p9_idpool_put(tag, c->tagpool);
+ return ERR_PTR(-ENOMEM);
+ }
}
req = p9_tag_alloc(c, tag, req_size);
@@ -1647,7 +1658,10 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
if (*err) {
trace_9p_protocol_dump(clnt, req->rc);
p9_free_req(clnt, req);
+ break;
}
+ if (rsize < count)
+ pr_err("mismatched reply [tag = %d]\n", req->tc->tag);
p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);
On Thu, 2 Jul 2015 17:45:35 +0100
Al Viro <[email protected]> wrote:
> On Thu, Jul 02, 2015 at 08:07:38AM -0400, Jeff Layton wrote:
>
> > Erm...and why is it passing in '0' to idr_alloc for the end value if it
> > can't deal with more than 16 bits? That seems like a plain old bug...
>
> Because they are using the same function (with different pool, obviously)
> for FID allocation, and those are 32bit...
Ahh, right...
So p9_idpool_create should take an argument for the "end" value, and
then store that in a new field in p9_idpool. Then they can pass that in
as the "end" parm in idr_alloc. Or, they could give up using the same
function there and use a different one for tags and FIDs.
In any case...allowing this thing to allocate tag values that can
collide seems fundamentally wrong. Using idr_alloc_cyclic might also
not hurt either, particularly given that these tag values are supposed
to function something like an XID and you probably don't want to be
reusing them too quickly.
--
Jeff Layton <[email protected]>
Jeff Layton wrote on Thu, Jul 02, 2015:
> So p9_idpool_create should take an argument for the "end" value, and
> then store that in a new field in p9_idpool. Then they can pass that in
> as the "end" parm in idr_alloc. Or, they could give up using the same
> function there and use a different one for tags and FIDs.
>
> In any case...allowing this thing to allocate tag values that can
> collide seems fundamentally wrong. Using idr_alloc_cyclic might also
> not hurt either, particularly given that these tag values are supposed
> to function something like an XID and you probably don't want to be
> reusing them too quickly.
Using cache=none here so behavious is likely different with cache, but
basically you can't get more than one tag per user thread accessing the
9P mount...
And in RDMA there's a credit so I can't get past whatever sq option was
given (defaults to 32) -- tbh even with other transports I doubt it's
going to get much higher.
Still definitely needs fixing, but I think the issue is somewhere
else... If Andrey could share the workload he uses I can try with other
servers, would be nice if we can rule a qemu bug out completely :)
--
Dominique
On Thu, Jul 02, 2015 at 01:01:39PM -0400, Jeff Layton wrote:
> So p9_idpool_create should take an argument for the "end" value, and
> then store that in a new field in p9_idpool. Then they can pass that in
> as the "end" parm in idr_alloc. Or, they could give up using the same
> function there and use a different one for tags and FIDs.
>
> In any case...allowing this thing to allocate tag values that can
> collide seems fundamentally wrong. Using idr_alloc_cyclic might also
> not hurt either, particularly given that these tag values are supposed
> to function something like an XID and you probably don't want to be
> reusing them too quickly.
All they are used for is matching response to request. Basically, you
can have up to 65535 pending requests. Reusing it right after getting
the response is fine.
Keep in mind that it's not supposed to be used on top of something like
UDP - *all* retransmits, etc., are responsibility of transport. It's
connection-oriented, reliable and order-preserving, with a shitload of state
tied to connection, starting with FIDs - unlike FHANDLE, FID meaning depends
upon connection history. Tags are even more transient.
Basically, the rules are
* request bears a 16bit tag.
* server can process pending requests in any order (with one exception)
and it must put the same tag into responses.
* client can ask to abort a pending request by issuing Tflush[old_tag];
* server must handle Tflush immediately. It must drop any pending
request matching old_tag and send Rflush confirming that. No response to
any request matching old_tag sent before Tflush should be issued after issuing
Rflush.
* if client has not issued Tflush, it must not reuse the tag until
getting a response bearing that tag.
* if client *has* issued Tflush, it must not reuse the tag until
getting Rflush, even if it does get response to the request it has tried to
abort.
BTW, failure to send Tflush means that we should leave the tag in use,
period. As it is, p9_client_rpc()/p9_client_zc_rpc() ignore such
situations - failure from p9_client_flush() is simply not noticed.
I seriously doubt that this is what we are hitting here, but it's a bug
all the same.
We also must _not_ let p9_client_cb() do anything unless req is non-NULL
and req->status is REQ_STATUS_SENT - stray tags from server shouldn't
be acted upon. As it is, the whole thing is trivial to oops - just have
server send _any_ R-message with something like 0xfff0 for tag. End of
story, p9_tag_lookup() returns NULL, p9_client_cb() oopses.
On Thu, Jul 02, 2015 at 07:56:29PM +0200, Dominique Martinet wrote:
> Using cache=none here so behavious is likely different with cache, but
> basically you can't get more than one tag per user thread accessing the
> 9P mount...
Yes, and...? You can get a lot more than one user thread... Andrey is
using trinity(1) on client, and that's *definitely* not single-threaded -
the whole point is stressing the damn thing.
On Thu, 2 Jul 2015 19:56:29 +0200
Dominique Martinet <[email protected]> wrote:
> Jeff Layton wrote on Thu, Jul 02, 2015:
> > So p9_idpool_create should take an argument for the "end" value, and
> > then store that in a new field in p9_idpool. Then they can pass that in
> > as the "end" parm in idr_alloc. Or, they could give up using the same
> > function there and use a different one for tags and FIDs.
> >
> > In any case...allowing this thing to allocate tag values that can
> > collide seems fundamentally wrong. Using idr_alloc_cyclic might also
> > not hurt either, particularly given that these tag values are supposed
> > to function something like an XID and you probably don't want to be
> > reusing them too quickly.
>
> Using cache=none here so behavious is likely different with cache, but
> basically you can't get more than one tag per user thread accessing the
> 9P mount...
> And in RDMA there's a credit so I can't get past whatever sq option was
> given (defaults to 32) -- tbh even with other transports I doubt it's
> going to get much higher.
>
> Still definitely needs fixing, but I think the issue is somewhere
> else... If Andrey could share the workload he uses I can try with other
> servers, would be nice if we can rule a qemu bug out completely :)
>
Fair enough...
If you're in there and decide to fix this up, then consider moving this
over to IDA instead of IDR. The pointers stored are not terribly
interesting (always the same as the p9_idpool), so by doing that you'll
save quite a bit of memory as well.
--
Jeff Layton <[email protected]>
On Thu, Jul 2, 2015 at 11:40 AM, Al Viro <[email protected]> wrote:
>
> All they are used for is matching response to request. Basically, you
> can have up to 65535 pending requests. Reusing it right after getting
> the response is fine.
Reusing a tag right after getting the completion may be fine in
theory, but it still sounds like a bad idea. Sure, it's used to match
the command with the reply, but using those kinds of things for
matching re-sends and to index into various "current data structures"
is also very common (not having looked at p9 I don't know how much it
does), and basically reusing tags "soon" tends to make those kidns of
things fragile.
Which can easily turn a "this _should_ work" into "it doesn't
_actually_ work" just because it ends up making things like race
conditions and re-ordering of replies trigger worse behavior. For
example, things like "done with previous command X" and "now starting
new command X" - if the tag is the same and those *independent*
messages get re-ordered, the tag just failed in what it was supposed
to do.
So circular allocators are likely a good idea even if there are other
layers that should handle retransmits etc.
So it does sound like it would be better to use a circular tag
allocator rather than a "lowest tag first" allocator.
Linus
2015-07-02 19:43 GMT+03:00 Al Viro <[email protected]>:
> On Thu, Jul 02, 2015 at 03:19:57PM +0300, Andrey Ryabinin wrote:
>
>> Added:
>> + if (total > count)
>> + *(char *)0 = 0
>>
>> and never hit this condition.
>>
>
> OK, so it's definitely a mismatched response.
>
>> req->tc->tag = tag-1;
>> + if (WARN_ON(req->status != REQ_STATUS_IDLE))
>> + pr_err("req->status: %d\n", req->status);
>> req->status = REQ_STATUS_ALLOC;
>>
>> return req;
>
>> [ 150.259076] 9pnet: req->status: 4
>
> IOW, REQ_STATUS_RCVD. Hmm... Stray tag seen by req_done() after we'd already
> freed the tag in question? That, or it really would have to had wrapped
> around... Note that req_done() does *not* check anything about the req -
> not even that p9_tag_lookup() hasn't returned NULL, so a server sending you
> any response tagged with number well above anything you'd ever sent will
> reliably oops you.
>
> Frankly, the whole thing needs fuzzing from the server side - start throwing
> crap at the client and see how badly does it get fucked... Folks, it's
> a network protocol, with userland servers, no less. You *can't* assume
> them competent and non-malicious...
>
> How much traffic does it take to reproduce that fun, BTW? IOW, is attempting
> to log the sequence of tag {allocation,freeing}/tag of packet being {sent,
> received} something completely suicidal, or is it more or less feasible?
>
No idea. Usually it takes 1-2 minutes after trinity (100 threads) starts.
>> I didn't get this. c->reqs[row] is always non-NULL as it should be, so this warning
>> will trigger all the time.
>
> ????
> row = (tag / P9_ROW_MAXTAG);
> c->reqs[row] = kcalloc(P9_ROW_MAXTAG,
> sizeof(struct p9_req_t), GFP_ATOMIC);
>
> and you are seeing c->reqs[row] != NULL *BEFORE* that kcalloc()? All the time,
> no less? Just to make sure we are on the same page - the delta against
> mainline I would like tested is this:
>
Ah, I was looking at the second ' row = tag / P9_ROW_MAXTAG;' line
which is after kcalloc().
I'll check tomorrow then.
2015-07-02 20:56 GMT+03:00 Dominique Martinet <[email protected]>:
>
> Still definitely needs fixing, but I think the issue is somewhere
> else... If Andrey could share the workload he uses I can try with other
> servers, would be nice if we can rule a qemu bug out completely :)
>
I simply run trinity from 9p rootfs:
./trinity -qqq --dangerous -xinit_module -C100
qemu guest, virtio transport.
BTW, I've discovered that all this bogus writes comes from trinity
logger (log.c)
which just opens one file per thread (trinity-childX.log) and writes to it.
On Thu, Jul 02, 2015 at 12:16:14PM -0700, Linus Torvalds wrote:
> On Thu, Jul 2, 2015 at 11:40 AM, Al Viro <[email protected]> wrote:
> >
> > All they are used for is matching response to request. Basically, you
> > can have up to 65535 pending requests. Reusing it right after getting
> > the response is fine.
>
> Reusing a tag right after getting the completion may be fine in
> theory, but it still sounds like a bad idea. Sure, it's used to match
> the command with the reply, but using those kinds of things for
> matching re-sends and to index into various "current data structures"
> is also very common (not having looked at p9 I don't know how much it
> does), and basically reusing tags "soon" tends to make those kidns of
> things fragile.
_All_ retransmits are done in transport layer there. It's not NFS - it
really expects reliable ordered connection for transport. No retransmits,
no duplicates, etc. I'm not dead against circular allocation, but I would
really like to figure out what's going on first.
I still wonder if we are seeing wraparound (should've posted a diff instead
of verbal description - mea culpa). If we are not, it smells like response
to request having arrived while the tag had been not in use from the client
POV, _or_ buggered barriers of some kind. Maybe buggered ordering of
replies somewhere, but that's only if Tflush had been involved (as in
-> Twrite tag = 3
-> Tflush tag = 42 old_tag = 3 <- Rwrite tag = 3
<- Rflush tag = 42
mark tag 3 free to be reused
reuse tag 3
... somehow get to seeing Rwrite only now
But I don't see where such ordering violation could've happened at the moment.
The way it's supposed to work is that the sequence
-> Twhatever tag = N
-> Tflush old_tag = N
must either end up with no response to the former arriving at all, or
arriving before the response to the latter. Transport itself does preserve
ordering (TCP certainly would, but virtio queue also does, AFAICS) and
we really need to have p9_client_cb() called in order of arrival.
Hmm... This is a stab in the dark, but... we have vring_interrupt() calling
req_done(), which does
while (1) {
spin_lock_irqsave(&chan->lock, flags);
rc = virtqueue_get_buf(chan->vq, &len);
if (rc == NULL) {
spin_unlock_irqrestore(&chan->lock, flags);
break;
}
chan->ring_bufs_avail = 1;
spin_unlock_irqrestore(&chan->lock, flags);
/* Wakeup if anyone waiting for VirtIO ring space. */
wake_up(chan->vc_wq);
p9_debug(P9_DEBUG_TRANS, ": rc %p\n", rc);
p9_debug(P9_DEBUG_TRANS, ": lookup tag %d\n", rc->tag);
req = p9_tag_lookup(chan->client, rc->tag);
p9_client_cb(chan->client, req, REQ_STATUS_RCVD);
}
What's to prevent *another* vring_interrupt() (called from some kind of
IRQ handler) hitting on another CPU and competing with this one for the
queue? While we are at it, both p9_tag_lookup() and p9_client_cb()
should be find with being called under spin_lock_irqsave, so why not
hold it outside of the loop?
Al Viro wrote on Thu, Jul 02, 2015:
> On Thu, Jul 02, 2015 at 07:56:29PM +0200, Dominique Martinet wrote:
> > Using cache=none here so behavious is likely different with cache, but
> > basically you can't get more than one tag per user thread accessing the
> > 9P mount...
>
> Yes, and...? You can get a lot more than one user thread... Andrey is
> using trinity(1) on client, and that's *definitely* not single-threaded -
> the whole point is stressing the damn thing.
I have run trinity quite a bit and it doesn't fork bomb as far as I can
recall, with him running it with -C100 we're not quite at 2^16 yet?
I do agree it's a problem, just don't think it's the one we're
hitting -- I'll try again on a recent kernel to see if anything changed
with rdma/tcp as well, but I'm starting to doubt I'll get any luck with
anything other than virtio; which doesn't really help since it's not the
same order of latencies.
FWIW I don't *think* trinity can issue TFlush either without user
interaction, that's a really special call. It can only happen in rpc()
or zc_rpc() if it's interrupted by ERESTARTSYS which I understand as ^C?
(I'll look into making the pools use IDA unless someone else steps up,
sure. Thanks Jeff)
--
Dominique
On 07/02/2015 07:49 PM, Al Viro wrote:
> On Thu, Jul 02, 2015 at 05:43:32PM +0100, Al Viro wrote:
>> req->tc->tag = tag-1;
>> + if (req->status != REQ_STATUS_IDLE)
>> + pr_err("using tag %d with odd status (%d)", tag, req->status);
>
> Should be tag - 1 here, actually.
So, with this change:
[ 81.654695] 9pnet: late delivery, tag 2 already freed
[ 81.655186] 9pnet: mismatched reply [tag = 2]
[ 85.014795] 9pnet: late delivery, tag 4 already freed
[ 85.015753] 9pnet: late delivery, tag 4 already freed
[ 85.016473] 9pnet: late delivery, tag 4 already freed
[ 87.275842] 9pnet: mismatched reply [tag = 5]
[ 88.943223] 9pnet: late delivery, tag 4 already freed
[ 88.943790] 9pnet: late delivery, tag 4 already freed
[ 88.944320] 9pnet: late delivery, tag 4 already freed
[ 88.945050] 9pnet: mismatched reply [tag = 4]
[ 105.551220] 9pnet: late delivery, tag 3 already freed
[ 105.552065] 9pnet: late delivery, tag 3 already freed
[ 114.528467] 9pnet: late delivery, tag 4 already freed
[ 164.791295] 9pnet: late delivery, tag 2 already freed
[ 164.792156] 9pnet: using tag 2 with odd status (4)
[ 164.864398] 9pnet: mismatched reply [tag = 4]
[ 166.419993] 9pnet: mismatched reply [tag = 3]
[ 174.062625] 9pnet: late delivery, tag 3 already freed
[ 174.063121] 9pnet: late delivery, tag 3 already freed
[ 174.063637] 9pnet: late delivery, tag 3 already freed
[ 174.064018] 9pnet: late delivery, tag 3 already freed
[ 174.064547] 9pnet: mismatched reply [tag = 3]
[ 175.418729] 9pnet: late delivery, tag 3 already freed
[ 177.911727] 9pnet: mismatched reply [tag = 1]
[ 235.748520] 9pnet: late delivery, tag 1 already freed
[ 235.749595] 9pnet: using tag 1 with odd status (4)
[ 243.496782] 9pnet: late delivery, tag 6 already freed
[ 243.497697] 9pnet: late delivery, tag 6 already freed
[ 243.499079] 9pnet: mismatched reply [tag = 6]
[ 243.736388] 9pnet: late delivery, tag 1 already freed
[ 243.740019] 9pnet: late delivery, tag 1 already freed
[ 248.337712] 9pnet: late delivery, tag 2 already freed
[ 249.872950] 9pnet: mismatched reply [tag = 9]
[ 252.448858] 9pnet: mismatched reply [tag = 8]
[ 252.668878] 9pnet: mismatched reply [tag = 11]
On Fri, Jul 03, 2015 at 11:19:31AM +0300, Andrey Ryabinin wrote:
> On 07/02/2015 07:49 PM, Al Viro wrote:
> > On Thu, Jul 02, 2015 at 05:43:32PM +0100, Al Viro wrote:
> >> req->tc->tag = tag-1;
> >> + if (req->status != REQ_STATUS_IDLE)
> >> + pr_err("using tag %d with odd status (%d)", tag, req->status);
> >
> > Should be tag - 1 here, actually.
>
> So, with this change:
>
> [ 81.654695] 9pnet: late delivery, tag 2 already freed
> [ 81.655186] 9pnet: mismatched reply [tag = 2]
> [ 85.014795] 9pnet: late delivery, tag 4 already freed
> [ 85.015753] 9pnet: late delivery, tag 4 already freed
> [ 85.016473] 9pnet: late delivery, tag 4 already freed
> [ 87.275842] 9pnet: mismatched reply [tag = 5]
> [ 88.943223] 9pnet: late delivery, tag 4 already freed
> [ 88.943790] 9pnet: late delivery, tag 4 already freed
> [ 88.944320] 9pnet: late delivery, tag 4 already freed
> [ 88.945050] 9pnet: mismatched reply [tag = 4]
> [ 105.551220] 9pnet: late delivery, tag 3 already freed
> [ 105.552065] 9pnet: late delivery, tag 3 already freed
> [ 114.528467] 9pnet: late delivery, tag 4 already freed
> [ 164.791295] 9pnet: late delivery, tag 2 already freed
> [ 164.792156] 9pnet: using tag 2 with odd status (4)
> [ 164.864398] 9pnet: mismatched reply [tag = 4]
> [ 166.419993] 9pnet: mismatched reply [tag = 3]
> [ 174.062625] 9pnet: late delivery, tag 3 already freed
> [ 174.063121] 9pnet: late delivery, tag 3 already freed
> [ 174.063637] 9pnet: late delivery, tag 3 already freed
> [ 174.064018] 9pnet: late delivery, tag 3 already freed
> [ 174.064547] 9pnet: mismatched reply [tag = 3]
> [ 175.418729] 9pnet: late delivery, tag 3 already freed
> [ 177.911727] 9pnet: mismatched reply [tag = 1]
> [ 235.748520] 9pnet: late delivery, tag 1 already freed
> [ 235.749595] 9pnet: using tag 1 with odd status (4)
> [ 243.496782] 9pnet: late delivery, tag 6 already freed
> [ 243.497697] 9pnet: late delivery, tag 6 already freed
> [ 243.499079] 9pnet: mismatched reply [tag = 6]
> [ 243.736388] 9pnet: late delivery, tag 1 already freed
> [ 243.740019] 9pnet: late delivery, tag 1 already freed
> [ 248.337712] 9pnet: late delivery, tag 2 already freed
> [ 249.872950] 9pnet: mismatched reply [tag = 9]
> [ 252.448858] 9pnet: mismatched reply [tag = 8]
> [ 252.668878] 9pnet: mismatched reply [tag = 11]
FWIW, I'd reproduced it here as well. With the addition of checking in
p9_client_cb() whether the tag is already freed in IDR, reporting ->status
when it's not "sent" and reporting Tflush, I've got this:
[ 2754.713015] 9pnet: flushing 1
[ 2755.516123] 9pnet: flush 1 [3]done
[ 2758.443265] 9pnet: flushing 16
[ 2768.655768] 9pnet: flush 16 [8]done
[ 2783.939538] 9pnet: flushing 30
[ 2786.067856] 9pnet: flush 30 [3]done
[ 2809.784119] 9pnet: [ffff880113213000] delivery in 0, tag 1
[ 2809.948681] 9pnet: [ffff880113213000] using tag 1 with odd status (4)
Here we have delivery when that sucker is into p9_free_req(), but hasn't
removed from IDR yet.
[ 2809.948681] 9pnet: [ffff880113213000] delivery in 0, tag 1
[ 2816.767861] 9pnet: [ffff880113213000] using tag 1 with odd status (4)
Ditto.
[ 2816.767861] 9pnet: flushing 4
[ 2816.769484] 9pnet: flush 4 [5]done
[ 2846.327753] 9pnet: flushing 3
[ 2854.876131] 9pnet: flush 3 [4]done
[ 2856.492801] 9pnet: flushing 6
[ 2857.675642] 9pnet: flush 6 [1]done
[ 2860.051701] 9pnet: [ffff880113213000] delivery in 0, tag 1
[ 2860.052941] 9pnet: [ffff880113213000] late delivery, tag 1 already freed in IDR
[ 2860.680181] 9pnet: [ffff880113213000] using tag 1 with odd status (4)
This has hit between p9_free_req() and reallocating that request.
[ 2909.911815] 9pnet: [ffff880113213000] delivery in 4, tag 1
[ 2909.913145] 9pnet: [ffff880113213000] late delivery, tag 1 already freed in IDR
... while _this_ one has raced with p9_free_req() in a different way.
[ 2910.852202] 9pnet: flushing 3
[ 2917.985941] 9pnet: flush 3 [4]done
[ 2937.600652] 9pnet: flushing 2
[ 2939.775354] 9pnet: flush 2 [4]done
[ 2961.521600] 9pnet: [ffff880113213000] delivery in 0, tag 1
[ 2962.320383] 9pnet: [ffff880113213000] using tag 1 with odd status (4)
Same at the very first one.
[ 2962.320383] 9pnet: [ffff880113213000] delivery in 4, tag 8
[ 3001.578372] 9pnet: [ffff880113213000] delivery in 4, tag 6
Those had hit before p9_free_req().
AFAICS, we get occasional stray responses from somewhere. And no, it doesn't
seem to be related to flushes or to dropping chan->lock in req_done() (this
run had been with chan->lock taken on the outside of the loop).
What I really don't understand is WTF is it playing with p9_tag_lookup() -
it's stashing req->tc via virtqueue_add_sgs() opaque data argument, fetches
it back in req_done(), then picks ->tag from it and uses p9_tag_lookup() to
find req. Why not simply pass req instead? I had been wrong about that
p9_tag_lookup() being able to return NULL, but why bother with it at all?
On Fri, Jul 03, 2015 at 10:42:10AM +0100, Al Viro wrote:
> AFAICS, we get occasional stray responses from somewhere. And no, it doesn't
> seem to be related to flushes or to dropping chan->lock in req_done() (this
> run had been with chan->lock taken on the outside of the loop).
>
> What I really don't understand is WTF is it playing with p9_tag_lookup() -
> it's stashing req->tc via virtqueue_add_sgs() opaque data argument, fetches
> it back in req_done(), then picks ->tag from it and uses p9_tag_lookup() to
> find req. Why not simply pass req instead? I had been wrong about that
> p9_tag_lookup() being able to return NULL, but why bother with it at all?
Got it. What happens is that on zero-copy path a signal hitting in the
end of p9_virtio_zc_request() is treated as "it hadn't been sent, got
an error, fuck off and mark the tag ready for reuse". No TFLUSH issued,
etc. As the result, when reply finally *does* arrive (we had actually
sent the request), it plays hell on the entire thing - tag might very
well have been reused by then and an unrelated request sent with the
same tag. Depending on the timing, results can get rather ugly.
There are still other bogosities found in this thread, and at the very
least we need to cope with genuine corrupted response from server, but
the patch below fixes the problem with stray responses here and stops the
"what do you mean, you'd written 4K? I've only sent 30 bytes!" problems
here. 10 minutes of trinity running without triggering it, while without
that patch it triggers in 2-3 minutes.
Could you verify that the patch below deals with your setup as well?
If it does, I'm going to put it into tonight's pull request, after I get
some sleep... Right now I'm about to crawl in direction of bed - 25 hours
of uptime is a bit too much... ;-/
diff --git a/net/9p/client.c b/net/9p/client.c
index 6f4c4c8..8c4941d 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -843,7 +843,8 @@ static struct p9_req_t *p9_client_zc_rpc(struct p9_client *c, int8_t type,
if (err < 0) {
if (err == -EIO)
c->status = Disconnected;
- goto reterr;
+ if (err != -ERESTARTSYS)
+ goto reterr;
}
if (req->status == REQ_STATUS_ERROR) {
p9_debug(P9_DEBUG_ERROR, "req_status error %d\n", req->t_err);
@@ -1647,7 +1648,10 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
if (*err) {
trace_9p_protocol_dump(clnt, req->rc);
p9_free_req(clnt, req);
+ break;
}
+ if (rsize < count)
+ pr_err("mismatched reply [tag = %d]\n", req->tc->tag);
p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);
2015-07-03 18:00 GMT+03:00 Al Viro <[email protected]>:
> Could you verify that the patch below deals with your setup as well?
> If it does, I'm going to put it into tonight's pull request, after I get
> some sleep... Right now I'm about to crawl in direction of bed - 25 hours
> of uptime is a bit too much... ;-/
>
Works for me.
Tested-by: Andrey Ryabinin <[email protected]>