2018-07-16 06:00:10

by syzbot

[permalink] [raw]
Subject: BUG: corrupted list in p9_read_work

Hello,

syzbot found the following crash on:

HEAD commit: 37b5dca2898d Merge tag 'rtc-4.18-2' of git://git.kernel.or..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=12b3e770400000
kernel config: https://syzkaller.appspot.com/x/.config?x=25856fac4e580aa7
dashboard link: https://syzkaller.appspot.com/bug?extid=2222c34dc40b515f30dc
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: [email protected]

list_del corruption, ffff8801931d8e28->next is LIST_POISON1
(dead000000000100)
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:47!
invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 PID: 27 Comm: kworker/0:2 Not tainted 4.18.0-rc4+ #148
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Workqueue: events p9_read_work
RIP: 0010:__list_del_entry_valid.cold.1+0x26/0x58 lib/list_debug.c:45
Code: 02 fe 0f 0b 4c 89 e2 48 89 de 48 c7 c7 60 68 1a 88 e8 ba 69 02 fe 0f
0b 4c 89 ea 48 89 de 48 c7 c7 00 68 1a 88 e8 a6 69 02 fe <0f> 0b 48 89 de
48 c7 c7 20 69 1a 88 e8 95 69 02 fe 0f 0b 48 89 de
RSP: 0018:ffff8801d94575b0 EFLAGS: 00010286
RAX: 000000000000004e RBX: ffff8801931d8e28 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81631821 RDI: 0000000000000001
RBP: ffff8801d94575c8 R08: ffff8801d9448700 R09: ffffed003b5c4fc0
R10: ffffed003b5c4fc0 R11: ffff8801dae27e07 R12: dead000000000200
R13: dead000000000100 R14: ffff8801931d8e00 R15: ffff8801996f7a10
FS: 0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9e4e7e0000 CR3: 00000001aeeae000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__list_del_entry include/linux/list.h:117 [inline]
list_del include/linux/list.h:125 [inline]
p9_read_work+0xa29/0x1060 net/9p/trans_fd.c:375
process_one_work+0xc73/0x1ba0 kernel/workqueue.c:2153
worker_thread+0x189/0x13c0 kernel/workqueue.c:2296
kthread+0x345/0x410 kernel/kthread.c:246
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Modules linked in:
Dumping ftrace buffer:
(ftrace buffer empty)
---[ end trace 4e7d7b3b542c6c1e ]---
RIP: 0010:__list_del_entry_valid.cold.1+0x26/0x58 lib/list_debug.c:45
Code: 02 fe 0f 0b 4c 89 e2 48 89 de 48 c7 c7 60 68 1a 88 e8 ba 69 02 fe 0f
0b 4c 89 ea 48 89 de 48 c7 c7 00 68 1a 88 e8 a6 69 02 fe <0f> 0b 48 89 de
48 c7 c7 20 69 1a 88 e8 95 69 02 fe 0f 0b 48 89 de
RSP: 0018:ffff8801d94575b0 EFLAGS: 00010286
RAX: 000000000000004e RBX: ffff8801931d8e28 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81631821 RDI: 0000000000000001
RBP: ffff8801d94575c8 R08: ffff8801d9448700 R09: ffffed003b5c4fc0
R10: ffffed003b5c4fc0 R11: ffff8801dae27e07 R12: dead000000000200
R13: dead000000000100 R14: ffff8801931d8e00 R15: ffff8801996f7a10
FS: 0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9e4e7e0000 CR3: 00000001aeeae000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.


2018-10-09 01:08:59

by syzbot

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

syzbot has found a reproducer for the following crash on:

HEAD commit: 0854ba5ff5c9 Merge git://git.kernel.org/pub/scm/linux/kern..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1514ec06400000
kernel config: https://syzkaller.appspot.com/x/.config?x=88e9a8a39dc0be2d
dashboard link: https://syzkaller.appspot.com/bug?extid=2222c34dc40b515f30dc
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10b91685400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: [email protected]

FS-Cache: N-cookie d=000000000a092700 n=00000000d8ee0022
FS-Cache: N-key=[10] '34323935303034313132'
list_del corruption, ffff88019ae36ee8->next is LIST_POISON1
(dead000000000100)
------------[ cut here ]------------
kobject: '9p-11043': free name
kernel BUG at lib/list_debug.c:47!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 2686 Comm: kworker/1:2 Not tainted 4.19.0-rc7+ #274
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
kobject: '9p-11049' (0000000096206f7a): kobject_add_internal:
parent: 'bdi', set: 'devices'
Workqueue: events p9_read_work
RIP: 0010:__list_del_entry_valid.cold.1+0x26/0x58 lib/list_debug.c:45
Code: d7 fd 0f 0b 4c 89 e2 48 89 de 48 c7 c7 40 92 40 88 e8 7a a2 d7 fd 0f
0b 4c 89 ea 48 89 de 48 c7 c7 e0 91 40 88 e8 66 a2 d7 fd <0f> 0b 48 89 de
48 c7 c7 00 93 40 88 e8 55 a2 d7 fd 0f 0b 48 89 de
RSP: 0018:ffff8801cc5975b8 EFLAGS: 00010282
kobject: '9p-11049' (0000000096206f7a): kobject_uevent_env
RAX: 000000000000004e RBX: ffff88019ae36ee8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81650405 RDI: 0000000000000005
RBP: ffff8801cc5975d0 R08: ffff8801cc58a4c0 R09: ffffed003b5e4fe8
R10: ffffed003b5e4fe8 R11: ffff8801daf27f47 R12: dead000000000200
R13: dead000000000100 R14: ffff8801c8931050 R15: ffff8801c8931010
FS: 0000000000000000(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fef196c1000 CR3: 00000001ccff7000 CR4: 00000000001406e0
Call Trace:
kobject: '9p-11049' (0000000096206f7a): fill_kobj_path: path
= '/devices/virtual/bdi/9p-11049'
__list_del_entry include/linux/list.h:117 [inline]
list_del include/linux/list.h:125 [inline]
p9_read_work+0xab6/0x10e0 net/9p/trans_fd.c:379
kobject: 'loop4' (00000000513f3e2f): kobject_uevent_env
FS-Cache: Duplicate cookie detected
process_one_work+0xc90/0x1b90 kernel/workqueue.c:2153
FS-Cache: O-cookie c=00000000911358e4 [p=000000006545c95d fl=222 nc=0 na=1]
FS-Cache: O-cookie d=000000000a092700 n=000000007635356b
FS-Cache: O-key=[10] '
34
32
39
35
30
30
34
31
32
worker_thread+0x17f/0x1390 kernel/workqueue.c:2296
36
'
FS-Cache: N-cookie c=00000000abaeee81 [p=000000006545c95d fl=2 nc=0 na=1]
FS-Cache: N-cookie d=000000000a092700 n=00000000ee16a363
FS-Cache: N-key=[10] '
34
32
39
35
30
30
34
kthread+0x35a/0x420 kernel/kthread.c:246
31
32
36
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413
'
Modules linked in:
---[ end trace 41e06641f5c3c814 ]---
kobject: '9p-11050' (000000002a096aa2): kobject_add_internal:
parent: 'bdi', set: 'devices'
RIP: 0010:__list_del_entry_valid.cold.1+0x26/0x58 lib/list_debug.c:45
Code: d7 fd 0f 0b 4c 89 e2 48 89 de 48 c7 c7 40 92 40 88 e8 7a a2 d7 fd 0f
0b 4c 89 ea 48 89 de 48 c7 c7 e0 91 40 88 e8 66 a2 d7 fd <0f> 0b 48 89 de
48 c7 c7 00 93 40 88 e8 55 a2 d7 fd 0f 0b 48 89 de
RSP: 0018:ffff8801cc5975b8 EFLAGS: 00010282
RAX: 000000000000004e RBX: ffff88019ae36ee8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81650405 RDI: 0000000000000005
RBP: ffff8801cc5975d0 R08: ffff8801cc58a4c0 R09: ffffed003b5e4fe8
R10: ffffed003b5e4fe8 R11: ffff8801daf27f47 R12: dead000000000200
R13: dead000000000100 R14: ffff8801c8931050 R15: ffff8801c8931010
FS: 0000000000000000(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fef196c1000 CR3: 00000001ccff7000 CR4: 00000000001406e0


2018-10-09 02:11:52

by Dominique Martinet

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

syzbot wrote on Mon, Oct 08, 2018:
> syzbot has found a reproducer for the following crash on:
>
> HEAD commit: 0854ba5ff5c9 Merge git://git.kernel.org/pub/scm/linux/kern..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1514ec06400000
> kernel config: https://syzkaller.appspot.com/x/.config?x=88e9a8a39dc0be2d
> dashboard link: https://syzkaller.appspot.com/bug?extid=2222c34dc40b515f30dc
> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10b91685400000
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: [email protected]
>
> list_del corruption, ffff88019ae36ee8->next is LIST_POISON1
> (dead000000000100)
> ------------[ cut here ]------------
> [...]
> list_del include/linux/list.h:125 [inline]
> p9_read_work+0xab6/0x10e0 net/9p/trans_fd.c:379

Hmm this looks very much like the report from
[email protected]
which should have been fixed by Tomas in 9f476d7c540cb
("net/9p/trans_fd.c: fix race by holding the lock")...

It looks like another double list_del, looking at the code again there
actually are other ways this could happen around connection errors.
For example,
- p9_read_work receives something and lookup works... meanwhile
- p9_write_work fails to write and calls p9_conn_cancel, which deletes
from the req_list without waiting for other works to finish (could also
happen in p9_poll_mux)
- p9_read_work finishes processing the read and deletes from list again

For this one the simplest fix would probably be to just not
list_del/call p9_client_cb at all if m->r?req->status isn't
REQ_STATUS_ERROR in p9_read_work after the "got new packet" debug print,
and frankly I think that's saner so I'll send a patch shortly doing
that, but I have zero confidence there aren't similar bugs around, the
tcp code is so messy... Most of the syzbot reports recently have been
around trans_fd which I don't think is used much in real life, and this
is not really motivating (i.e. I think it would probably need a more
extensive rewrite but nobody cares) :/


Dmitry, on that note, do you think syzbot could possibly test other
transports somehow? rdma or virtio cannot be faked as easily as passing
a fd around, but I'd be very interested in seeing these flayed a bit.

(I'm also curious what logic is used to generate the syz tests, the
write$P9_Rxx replies have nothing to do with what the client would
expect so it probably doesn't test very far; this test in particular
does not even get past the initial P9_TVERSION that the client would
expect immediately after mount, so it's basically only testing logic
around packet handling on error... Or if we're accepting a RREADDIR in
reply to TVERSION we have bigger problems, and now I'm looking at it I
think we just might never check that....... I'll look at that for the
next cycle)


Back to the current patch, since as I said I am not confident this is a
good enough fix for the current bug, will I get notified if the bug
happens again once the patch hits linux-next with the Reported-by tag ?
(I don't have the setup necessary to run a syz repro as there is no C
repro, and won't have much time to do that setup sorry)


> FS-Cache: N-cookie d=000000000a092700 n=00000000d8ee0022
> FS-Cache: N-key=[10] '34323935303034313132'
> FS-Cache: Duplicate cookie detected
> FS-Cache: O-cookie c=00000000911358e4 [p=000000006545c95d fl=222 nc=0 na=1]
> FS-Cache: O-cookie d=000000000a092700 n=000000007635356b
> FS-Cache: O-key=[10] '
> [...]

(on an unrelated topic, I got these FS-Cache warnings quite often when
testing with cache enabled and have no idea what they mean. I don't
normally use cache so haven't spent time looking at it, but I find these
rather worrying... If someone having a clue reads this, I'd love to hear
what they could mean and what we should look at)

Thanks,
--
Dominique

2018-10-09 04:06:49

by Dominique Martinet

[permalink] [raw]
Subject: [PATCH 2/2] 9p/trans_fd: put worker reqs on destroy

From: Dominique Martinet <[email protected]>

p9_read_work/p9_write_work might still hold references to a req after
having been cancelled; make sure we put any of these to avoid potential
request leak on disconnect.

Fixes: 728356dedeff8 ("9p: Add refcount to p9_req_t")
Signed-off-by: Dominique Martinet <[email protected]>
Cc: Eric Van Hensbergen <[email protected]>
Cc: Latchesar Ionkov <[email protected]>
Cc: Tomas Bortoli <[email protected]>
---
Noticed we could leak a ref while looking at the syzbot report,
this should be safe enough after the work has been cancelled...
Probably.

net/9p/trans_fd.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index a0317d459cde..f868cf6fba79 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -876,7 +876,15 @@ static void p9_conn_destroy(struct p9_conn *m)

p9_mux_poll_stop(m);
cancel_work_sync(&m->rq);
+ if (m->rreq) {
+ p9_req_put(m->rreq);
+ m->rreq = NULL;
+ }
cancel_work_sync(&m->wq);
+ if (m->wreq) {
+ p9_req_put(m->wreq);
+ m->wreq = NULL;
+ }

p9_conn_cancel(m, -ECONNRESET);

--
2.19.1


2018-10-09 04:07:20

by Dominique Martinet

[permalink] [raw]
Subject: [PATCH 1/2] 9p/trans_fd: abort p9_read_work if req status changed

From: Dominique Martinet <[email protected]>

p9_read_work would try to handle an errored req even if it got put to
error state by another thread between the lookup (that worked) and the
time it had been fully read.
The request itself is safe to use because we hold a ref to it from the
lookup (for m->rreq, so it was safe to read into the request data buffer
until this point), but the req_list has been deleted at the same time
status changed, and client_cb already has been called as well, so we
should not do either.

Signed-off-by: Dominique Martinet <[email protected]>
Reported-by: [email protected]
Cc: Eric Van Hensbergen <[email protected]>
Cc: Latchesar Ionkov <[email protected]>
---

As written in reply to the bug report I'm not sure it's what syzbot
complained about exactly, but it feels like a correct thing to do.
The second patch is unrelated to the syzbot report, but something that
occured to me while looking at this ; I'll take both into linux-next
around the start of next week after getting some proper testing done
unless remarks happen.
(they pass basic tests already)

net/9p/trans_fd.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 12559c474dde..a0317d459cde 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -292,7 +292,6 @@ static void p9_read_work(struct work_struct *work)
__poll_t n;
int err;
struct p9_conn *m;
- int status = REQ_STATUS_ERROR;

m = container_of(work, struct p9_conn, rq);

@@ -375,11 +374,17 @@ static void p9_read_work(struct work_struct *work)
p9_debug(P9_DEBUG_TRANS, "got new packet\n");
m->rreq->rc.size = m->rc.offset;
spin_lock(&m->client->lock);
- if (m->rreq->status != REQ_STATUS_ERROR)
- status = REQ_STATUS_RCVD;
- list_del(&m->rreq->req_list);
- /* update req->status while holding client->lock */
- p9_client_cb(m->client, m->rreq, status);
+ if (m->rreq->status == REQ_STATUS_SENT) {
+ list_del(&m->rreq->req_list);
+ p9_client_cb(m->client, m->rreq, REQ_STATUS_RCVD);
+ } else {
+ spin_unlock(&m->client->lock);
+ p9_debug(P9_DEBUG_ERROR,
+ "Request tag %d errored out while we were reading the reply\n",
+ m->rc.tag);
+ err = -EIO;
+ goto error;
+ }
spin_unlock(&m->client->lock);
m->rc.sdata = NULL;
m->rc.offset = 0;
--
2.19.1


2018-10-09 13:21:59

by Tomas Bortoli

[permalink] [raw]
Subject: Re: [PATCH 2/2] 9p/trans_fd: put worker reqs on destroy

Il giorno mar 9 ott 2018 alle ore 06:06 Dominique Martinet
<[email protected]> ha scritto:
>
> From: Dominique Martinet <[email protected]>
>
> p9_read_work/p9_write_work might still hold references to a req after
> having been cancelled; make sure we put any of these to avoid potential
> request leak on disconnect.
>
> Fixes: 728356dedeff8 ("9p: Add refcount to p9_req_t")
> Signed-off-by: Dominique Martinet <[email protected]>
> Cc: Eric Van Hensbergen <[email protected]>
> Cc: Latchesar Ionkov <[email protected]>
> Cc: Tomas Bortoli <[email protected]>
Reviewed-by: Tomas Bortoli <[email protected]>
> ---
> Noticed we could leak a ref while looking at the syzbot report,
> this should be safe enough after the work has been cancelled...
> Probably.
>
> net/9p/trans_fd.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
> index a0317d459cde..f868cf6fba79 100644
> --- a/net/9p/trans_fd.c
> +++ b/net/9p/trans_fd.c
> @@ -876,7 +876,15 @@ static void p9_conn_destroy(struct p9_conn *m)
>
> p9_mux_poll_stop(m);
> cancel_work_sync(&m->rq);
> + if (m->rreq) {
> + p9_req_put(m->rreq);
> + m->rreq = NULL;
> + }
> cancel_work_sync(&m->wq);
> + if (m->wreq) {
> + p9_req_put(m->wreq);
> + m->wreq = NULL;
> + }
>
> p9_conn_cancel(m, -ECONNRESET);
>
> --
> 2.19.1
>

LGTM
--
Tomas

2018-10-10 14:05:19

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

On Tue, Oct 9, 2018 at 4:09 AM, Dominique Martinet
<[email protected]> wrote:
> syzbot wrote on Mon, Oct 08, 2018:
>> syzbot has found a reproducer for the following crash on:
>>
>> HEAD commit: 0854ba5ff5c9 Merge git://git.kernel.org/pub/scm/linux/kern..
>> git tree: upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=1514ec06400000
>> kernel config: https://syzkaller.appspot.com/x/.config?x=88e9a8a39dc0be2d
>> dashboard link: https://syzkaller.appspot.com/bug?extid=2222c34dc40b515f30dc
>> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10b91685400000
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: [email protected]
>>
>> list_del corruption, ffff88019ae36ee8->next is LIST_POISON1
>> (dead000000000100)
>> ------------[ cut here ]------------
>> [...]
>> list_del include/linux/list.h:125 [inline]
>> p9_read_work+0xab6/0x10e0 net/9p/trans_fd.c:379
>
> Hmm this looks very much like the report from
> [email protected]
> which should have been fixed by Tomas in 9f476d7c540cb
> ("net/9p/trans_fd.c: fix race by holding the lock")...
>
> It looks like another double list_del, looking at the code again there
> actually are other ways this could happen around connection errors.
> For example,
> - p9_read_work receives something and lookup works... meanwhile
> - p9_write_work fails to write and calls p9_conn_cancel, which deletes
> from the req_list without waiting for other works to finish (could also
> happen in p9_poll_mux)
> - p9_read_work finishes processing the read and deletes from list again
>
> For this one the simplest fix would probably be to just not
> list_del/call p9_client_cb at all if m->r?req->status isn't
> REQ_STATUS_ERROR in p9_read_work after the "got new packet" debug print,
> and frankly I think that's saner so I'll send a patch shortly doing
> that, but I have zero confidence there aren't similar bugs around, the
> tcp code is so messy... Most of the syzbot reports recently have been
> around trans_fd which I don't think is used much in real life, and this
> is not really motivating (i.e. I think it would probably need a more
> extensive rewrite but nobody cares) :/
>
>
> Dmitry, on that note, do you think syzbot could possibly test other
> transports somehow? rdma or virtio cannot be faked as easily as passing
> a fd around, but I'd be very interested in seeing these flayed a bit.

Hi Dominique,

How can they be faked?
If we could create a private rdma/virtio stub instance per test
process, then we could I think easily use that instance for 9p. But is
it possible?
Testing on real hardware is mostly outside of our priorities at the
moment. I mean syzkaller itself can be run on anything, and one could
extend descriptions to use a known rdma interface and run on a real
hardware. But we can't afford this at the moment.
As far as I understand RDMA maintainers run syzkaller on real
hardware, but I don't know if they are up to including 9p into
testing. +Leon

2018-10-10 14:30:34

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

On Tue, Oct 9, 2018 at 4:09 AM, Dominique Martinet
<[email protected]> wrote:
> syzbot wrote on Mon, Oct 08, 2018:
>> syzbot has found a reproducer for the following crash on:
>>
>> HEAD commit: 0854ba5ff5c9 Merge git://git.kernel.org/pub/scm/linux/kern..
>> git tree: upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=1514ec06400000
>> kernel config: https://syzkaller.appspot.com/x/.config?x=88e9a8a39dc0be2d
>> dashboard link: https://syzkaller.appspot.com/bug?extid=2222c34dc40b515f30dc
>> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10b91685400000
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: [email protected]
>>
>> list_del corruption, ffff88019ae36ee8->next is LIST_POISON1
>> (dead000000000100)
>> ------------[ cut here ]------------
>> [...]
>> list_del include/linux/list.h:125 [inline]
>> p9_read_work+0xab6/0x10e0 net/9p/trans_fd.c:379
>
> Hmm this looks very much like the report from
> [email protected]
> which should have been fixed by Tomas in 9f476d7c540cb
> ("net/9p/trans_fd.c: fix race by holding the lock")...
>
> It looks like another double list_del, looking at the code again there
> actually are other ways this could happen around connection errors.
> For example,
> - p9_read_work receives something and lookup works... meanwhile
> - p9_write_work fails to write and calls p9_conn_cancel, which deletes
> from the req_list without waiting for other works to finish (could also
> happen in p9_poll_mux)
> - p9_read_work finishes processing the read and deletes from list again
>
> For this one the simplest fix would probably be to just not
> list_del/call p9_client_cb at all if m->r?req->status isn't
> REQ_STATUS_ERROR in p9_read_work after the "got new packet" debug print,
> and frankly I think that's saner so I'll send a patch shortly doing
> that, but I have zero confidence there aren't similar bugs around, the
> tcp code is so messy... Most of the syzbot reports recently have been
> around trans_fd which I don't think is used much in real life, and this
> is not really motivating (i.e. I think it would probably need a more
> extensive rewrite but nobody cares) :/
>
>
> Dmitry, on that note, do you think syzbot could possibly test other
> transports somehow? rdma or virtio cannot be faked as easily as passing
> a fd around, but I'd be very interested in seeing these flayed a bit.
>
> (I'm also curious what logic is used to generate the syz tests, the
> write$P9_Rxx replies have nothing to do with what the client would
> expect so it probably doesn't test very far; this test in particular
> does not even get past the initial P9_TVERSION that the client would
> expect immediately after mount, so it's basically only testing logic
> around packet handling on error... Or if we're accepting a RREADDIR in
> reply to TVERSION we have bigger problems, and now I'm looking at it I
> think we just might never check that....... I'll look at that for the
> next cycle)
>
>
> Back to the current patch, since as I said I am not confident this is a
> good enough fix for the current bug, will I get notified if the bug
> happens again once the patch hits linux-next with the Reported-by tag ?
> (I don't have the setup necessary to run a syz repro as there is no C
> repro, and won't have much time to do that setup sorry)

Yes, the bug will be reported again if it still happens after the
patch is merged (not just into linux-next, but into all tested trees,
but it does not matter much). So marking bugs as fixed tentatively is
fine if that's our best guess.

But note that syzbot can test fixes itself on request. It boils down
to just giving it the patch and the base tree:
https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches

2018-10-10 14:42:09

by Dominique Martinet

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

Dmitry Vyukov wrote on Wed, Oct 10, 2018:
> How can they be faked?
> If we could create a private rdma/virtio stub instance per test
> process, then we could I think easily use that instance for 9p. But is
> it possible?

"RDMA" itself can be faked pretty easily nowadays, there's a "rxe"
driver that is soft RDMA over ethernet and can run over anything.

The problem is that you can't just give the client a file like trans fd;
you'd need to open an ""rdma socket"" (simplifying wording a bit), and
afaik there is no standard tool for it ; or rather, the problem is that
RDMA is packet based so even if there were you can't just write stuff
in a fd and hope it'll work, so you need a server.

If you're interested, 9p is trivial enough that I could provide you with
a trivial server that works like your file (just need to reimplement
something that parses header to packetize it properly; so you could
write to its stdin for example) ; that'd require some setup in the VM
(configure rxe and install that tool), but it would definitely be
possible.
What do you think ?


For virtio, I'm not as familiar with the environment so I do not know if
there are ways to fake it as easily unfortunately.

> Testing on real hardware is mostly outside of our priorities at the
> moment. I mean syzkaller itself can be run on anything, and one could
> extend descriptions to use a known rdma interface and run on a real
> hardware. But we can't afford this at the moment.

Sure, I understand that.

> As far as I understand RDMA maintainers run syzkaller on real
> hardware, but I don't know if they are up to including 9p into
> testing. +Leon

I'd be interested in knowing what kind of tests runs there :)

--
Dominique Martinet

2018-10-10 14:44:57

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

On Tue, Oct 9, 2018 at 4:09 AM, Dominique Martinet
<[email protected]> wrote:
> syzbot wrote on Mon, Oct 08, 2018:
>> syzbot has found a reproducer for the following crash on:
>>
>> HEAD commit: 0854ba5ff5c9 Merge git://git.kernel.org/pub/scm/linux/kern..
>> git tree: upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=1514ec06400000
>> kernel config: https://syzkaller.appspot.com/x/.config?x=88e9a8a39dc0be2d
>> dashboard link: https://syzkaller.appspot.com/bug?extid=2222c34dc40b515f30dc
>> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10b91685400000
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: [email protected]
>>
>> list_del corruption, ffff88019ae36ee8->next is LIST_POISON1
>> (dead000000000100)
>> ------------[ cut here ]------------
>> [...]
>> list_del include/linux/list.h:125 [inline]
>> p9_read_work+0xab6/0x10e0 net/9p/trans_fd.c:379
>
> Hmm this looks very much like the report from
> [email protected]
> which should have been fixed by Tomas in 9f476d7c540cb
> ("net/9p/trans_fd.c: fix race by holding the lock")...
>
> It looks like another double list_del, looking at the code again there
> actually are other ways this could happen around connection errors.
> For example,
> - p9_read_work receives something and lookup works... meanwhile
> - p9_write_work fails to write and calls p9_conn_cancel, which deletes
> from the req_list without waiting for other works to finish (could also
> happen in p9_poll_mux)
> - p9_read_work finishes processing the read and deletes from list again
>
> For this one the simplest fix would probably be to just not
> list_del/call p9_client_cb at all if m->r?req->status isn't
> REQ_STATUS_ERROR in p9_read_work after the "got new packet" debug print,
> and frankly I think that's saner so I'll send a patch shortly doing
> that, but I have zero confidence there aren't similar bugs around, the
> tcp code is so messy... Most of the syzbot reports recently have been
> around trans_fd which I don't think is used much in real life, and this
> is not really motivating (i.e. I think it would probably need a more
> extensive rewrite but nobody cares) :/
>
>
> Dmitry, on that note, do you think syzbot could possibly test other
> transports somehow? rdma or virtio cannot be faked as easily as passing
> a fd around, but I'd be very interested in seeing these flayed a bit.
>
> (I'm also curious what logic is used to generate the syz tests, the
> write$P9_Rxx replies have nothing to do with what the client would
> expect so it probably doesn't test very far; this test in particular
> does not even get past the initial P9_TVERSION that the client would
> expect immediately after mount, so it's basically only testing logic
> around packet handling on error... Or if we're accepting a RREADDIR in
> reply to TVERSION we have bigger problems, and now I'm looking at it I
> think we just might never check that....... I'll look at that for the
> next cycle)

Good question.

It's a mix of dumb and not-so-dumb.
First we have descriptions of kernel interface, here are 9p ones:
https://github.com/google/syzkaller/blob/master/sys/linux/9p.txt
These descriptions allows to generate primitively meaningful things
(e.g. proper struct layout). They also capture some interrelations
between calls. For example, you can see these "resource rfd9p" and
"resource wfd9p" at the top, these as "fd subtypes", and descriptions
capture what produces these resources as output and what consumes
these resources as input. For example, rfd9p is produced by pipe and
consumed by mount, so we know that these calls need to be called in
that order. But this does not work too well for, for example, 9p
message tags/types, because we don't know what exactly message type we
will read out and these tags expire after reply.

Second, syzkaller uses code coverage as guidance. So as soon as it
learns to do proper handshake, it sees new coverage and memorizes this
program as useful and tries to extend it more in future. Later it
learns how to create a single file, sees new coverage, memorizes, etc.
This allows it to incrementally build more and more complex programs
over time.

You can see current code coverage it achieved here (in cover column):
https://syzkaller.appspot.com/#managers
e.g. (note: 80MB file):
https://storage.googleapis.com/syzkaller/cover/ci-upstream-kasan-gce-root.html

As far as I see it did some non-trivial progress for 9p subsystem. I
don't know if it reached everything reachable or not, though.

2018-10-10 14:51:10

by syzbot

[permalink] [raw]
Subject: Re: Re: BUG: corrupted list in p9_read_work

> Dmitry Vyukov wrote on Wed, Oct 10, 2018:
>> > Back to the current patch, since as I said I am not confident this is a
>> > good enough fix for the current bug, will I get notified if the bug
>> > happens again once the patch hits linux-next with the Reported-by tag ?
>> > (I don't have the setup necessary to run a syz repro as there is no C
>> > repro, and won't have much time to do that setup sorry)

>> Yes, the bug will be reported again if it still happens after the
>> patch is merged (not just into linux-next, but into all tested trees,
>> but it does not matter much). So marking bugs as fixed tentatively is
>> fine if that's our best guess.

> Ok, thanks for confirming...

>> But note that syzbot can test fixes itself on request. It boils down
>> to just giving it the patch and the base tree:
>> https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches

> .. and for clarifying that bit, let's try that! :)

> #syz test: git://github.com/martinetd/linux
> e4ca13f7d075e551dc158df6af18fb412a1dba0a

"git://github.com/martinetd/linux" does not look like a valid git repo
address.


> --
> Dominique

2018-10-10 14:51:17

by Dominique Martinet

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

Dmitry Vyukov wrote on Wed, Oct 10, 2018:
> > Back to the current patch, since as I said I am not confident this is a
> > good enough fix for the current bug, will I get notified if the bug
> > happens again once the patch hits linux-next with the Reported-by tag ?
> > (I don't have the setup necessary to run a syz repro as there is no C
> > repro, and won't have much time to do that setup sorry)
>
> Yes, the bug will be reported again if it still happens after the
> patch is merged (not just into linux-next, but into all tested trees,
> but it does not matter much). So marking bugs as fixed tentatively is
> fine if that's our best guess.

Ok, thanks for confirming...

> But note that syzbot can test fixes itself on request. It boils down
> to just giving it the patch and the base tree:
> https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches

.. and for clarifying that bit, let's try that! :)

#syz test: git://github.com/martinetd/linux e4ca13f7d075e551dc158df6af18fb412a1dba0a

--
Dominique

2018-10-10 14:54:11

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

On Wed, Oct 10, 2018 at 4:40 PM, Dominique Martinet
<[email protected]> wrote:
> Dmitry Vyukov wrote on Wed, Oct 10, 2018:
>> How can they be faked?
>> If we could create a private rdma/virtio stub instance per test
>> process, then we could I think easily use that instance for 9p. But is
>> it possible?
>
> "RDMA" itself can be faked pretty easily nowadays, there's a "rxe"
> driver that is soft RDMA over ethernet and can run over anything.
>
> The problem is that you can't just give the client a file like trans fd;
> you'd need to open an ""rdma socket"" (simplifying wording a bit), and
> afaik there is no standard tool for it ; or rather, the problem is that
> RDMA is packet based so even if there were you can't just write stuff
> in a fd and hope it'll work, so you need a server.
>
> If you're interested, 9p is trivial enough that I could provide you with
> a trivial server that works like your file (just need to reimplement
> something that parses header to packetize it properly; so you could
> write to its stdin for example) ; that'd require some setup in the VM
> (configure rxe and install that tool), but it would definitely be
> possible.
> What do you think ?

I would like to hear more details.
Opening a socket is not a problem. Why do we need a tool for this?
I don't understand the problem with "packet-based" and what does it
mean to have a separate server? Any why?
We definitely don't want to involve a separate third-party server,
that's very problematic for multiple reasons. But we can have a chunk
of custom C code inside of syzkaller.
What exactly setup we need?

I guess it will make things simpler if you provide some kind of "hello
world" C program that mounts 9p/rdma. I don't need exact messages
(they will be same as with pipe transport, right?) nor actual server
implementation, but just the place where to inject these packets.

2018-10-10 16:00:35

by Dominique Martinet

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

Dmitry Vyukov wrote on Wed, Oct 10, 2018:
> > The problem is that you can't just give the client a file like trans fd;
> > you'd need to open an ""rdma socket"" (simplifying wording a bit), and
> > afaik there is no standard tool for it ; or rather, the problem is that
> > RDMA is packet based so even if there were you can't just write stuff
> > in a fd and hope it'll work, so you need a server.
> >
> > If you're interested, 9p is trivial enough that I could provide you with
> > a trivial server that works like your file (just need to reimplement
> > something that parses header to packetize it properly; so you could
> > write to its stdin for example) ; that'd require some setup in the VM
> > (configure rxe and install that tool), but it would definitely be
> > possible.
> > What do you think ?
>
> I would like to hear more details.
> Opening a socket is not a problem. Why do we need a tool for this?

Sorry, that's my head thinking unixy and piping things :)

> I don't understand the problem with "packet-based" and what does it
> mean to have a separate server? Any why?

Packet-based means you can't just read/write in a fd and expect the
other side to know where to cut the packets to send it to the client,
but if we do it internally there's no problem. We know where to cut.

> We definitely don't want to involve a separate third-party server,
> that's very problematic for multiple reasons. But we can have a chunk
> of custom C code inside of syzkaller.
> What exactly setup we need?

The setup itself isn't that bad, it's actually pretty much trivial - on
a fedora VM I just had to run 'rxe_cfg start ens3' (virtio interface
name) and then the infiniband tools are happy e.g. ibv_devinfo should
list an interface if you have the userspace library that should have
come with rxe_cfg.
(specifically, my VM uses /etc/libibverbs.d/rxe.driver to point to the
lib, and /usr/lib64/libibverbs/librxe-rdmav16.so the lib itself)

Once tools like ibv_devinfo list the interface, it means syzkaller can
use it, and very probably means the kernel can as well; that's it.


> I guess it will make things simpler if you provide some kind of "hello
> world" C program that mounts 9p/rdma. I don't need exact messages
> (they will be same as with pipe transport, right?) nor actual server
> implementation, but just the place where to inject these packets.

That's still the tricky part, I'm afraid... Making a separate server
would have been easy because I could have reused some of my junk for the
actual connection handling (some rdma helper library I wrote ages
ago[1]), but if you're going to just embed C code you'll probably want
something lower level? I've never seen syzkaller use any library call
but I'm not even sure I would know how to create a qp without
libibverbs, would standard stuff be OK ?

I think the interface improved quite a bit since I last looked at it so
I'll need a bit of time to figure it out again but I'll send you a
simple conection with a few messages soonishâ„¢

[1] https://github.com/cea-hpc/mooshika

--
Dominique

2018-10-10 16:03:06

by Dominique Martinet

[permalink] [raw]
Subject: Re: Re: BUG: corrupted list in p9_read_work

syzbot wrote on Wed, Oct 10, 2018:
>>> But note that syzbot can test fixes itself on request. It boils down
>>> to just giving it the patch and the base tree:
>>> https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches
>
>> .. and for clarifying that bit, let's try that! :)
>
>> #syz test: git://github.com/martinetd/linux
>> e4ca13f7d075e551dc158df6af18fb412a1dba0a
>
> "git://github.com/martinetd/linux" does not look like a valid git
> repo address.

It works though, is it just picky because I didn't end it in .git? let's
try again, sorry for the noise...

#syz test: git://github.com/martinetd/linux.git e4ca13f7d075e551dc158df6af18fb412a1dba0a

--
Dominique

2018-10-10 16:03:07

by syzbot

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

Hello,

syzbot tried to test the proposed patch but build/boot failed:

failed to checkout kernel repo git://github.com/martinetd/linux.git on
commit e4ca13f7d075e551dc158df6af18fb412a1dba0a: failed to run
["git" "checkout" "e4ca13f7d075e551dc158df6af18fb412a1dba0a"]: exit status
128
fatal: reference is not a tree: e4ca13f7d075e551dc158df6af18fb412a1dba0a



Tested on:

commit: [unknown]
git tree: git://github.com/martinetd/linux.git
e4ca13f7d075e551dc158df6af18fb412a1dba0a
compiler: gcc (GCC) 8.0.1 20180413 (experimental)


2018-10-10 16:11:06

by Dominique Martinet

[permalink] [raw]
Subject: Re: Re: BUG: corrupted list in p9_read_work

Dominique Martinet wrote on Wed, Oct 10, 2018:
> It works though, is it just picky because I didn't end it in .git? let's
> try again, sorry for the noise...
>
> #syz test: git://github.com/martinetd/linux.git e4ca13f7d075e551dc158df6af18fb412a1dba0a

And I guess the commit hash needs to be in the default clone branch to
work ?
('git fetch <repo> <hash>' happily fetches the commit in a new clone for
me... But that feels like a github specific behaviour maybe)

Oh, well; made a branch for it, last try for me.

#syz test: git://github.com/martinetd/linux.git for-syzbot

--
Dominique

2018-10-10 16:29:51

by syzbot

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger
crash:

Reported-and-tested-by:
[email protected]

Tested on:

commit: e4ca13f7d075 9p/trans_fd: abort p9_read_work if req status..
git tree: git://github.com/martinetd/linux.git for-syzbot
kernel config: https://syzkaller.appspot.com/x/.config?x=fada1c387645ed03
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

Note: testing is done by a robot and is best-effort only.

2018-10-10 16:37:41

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: Re: BUG: corrupted list in p9_read_work

On Wed, Oct 10, 2018 at 6:10 PM, Dominique Martinet
<[email protected]> wrote:
> Dominique Martinet wrote on Wed, Oct 10, 2018:
>> It works though, is it just picky because I didn't end it in .git? let's
>> try again, sorry for the noise...
>>
>> #syz test: git://github.com/martinetd/linux.git e4ca13f7d075e551dc158df6af18fb412a1dba0a
>
> And I guess the commit hash needs to be in the default clone branch to
> work ?
> ('git fetch <repo> <hash>' happily fetches the commit in a new clone for
> me... But that feels like a github specific behaviour maybe)

yeeeep, this is bug:
https://github.com/google/syzkaller/issues/728

Turns out git fetch of a named remote and just a tree work
differently. The latter only fetches the main branch.

'git fetch <repo> <hash>' is it a thing? Is it something that requires
special server configuration? I remember something similar that wasn't
able to fetch a random commit hash all the time...

The plan was to make a named remote and then fetch it, this should
fetch everything.


> Oh, well; made a branch for it, last try for me.
>
> #syz test: git://github.com/martinetd/linux.git for-syzbot
>
> --
> Dominique

2018-10-10 22:56:38

by Dominique Martinet

[permalink] [raw]
Subject: Re: Re: BUG: corrupted list in p9_read_work

Dmitry Vyukov wrote on Wed, Oct 10, 2018:
> yeeeep, this is bug:
> https://github.com/google/syzkaller/issues/728

Yeah, it makes sense ; I just had to stumble on it once :)

> Turns out git fetch of a named remote and just a tree work
> differently. The latter only fetches the main branch.
>
> 'git fetch <repo> <hash>' is it a thing? Is it something that requires
> special server configuration? I remember something similar that wasn't
> able to fetch a random commit hash all the time...

With my version of git (2.19.1); this works with a local tree (git fetch
/path/to/repo ref) and with a github remote, but not with a gitolite
remote ; I don't think it's safe to assume it'll always work, but it can
work.

> The plan was to make a named remote and then fetch it, this should
> fetch everything.

Yeah, that's less efficient but that'd fetch all named branches at
least; probably safer than what I tried :)

Anyway, the fix seems to be a hit, so cool!

Thanks!
--
Dominique

2018-10-11 12:58:21

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

On Wed, Oct 10, 2018 at 5:58 PM, Dominique Martinet
<[email protected]> wrote:
> Dmitry Vyukov wrote on Wed, Oct 10, 2018:
>> > The problem is that you can't just give the client a file like trans fd;
>> > you'd need to open an ""rdma socket"" (simplifying wording a bit), and
>> > afaik there is no standard tool for it ; or rather, the problem is that
>> > RDMA is packet based so even if there were you can't just write stuff
>> > in a fd and hope it'll work, so you need a server.
>> >
>> > If you're interested, 9p is trivial enough that I could provide you with
>> > a trivial server that works like your file (just need to reimplement
>> > something that parses header to packetize it properly; so you could
>> > write to its stdin for example) ; that'd require some setup in the VM
>> > (configure rxe and install that tool), but it would definitely be
>> > possible.
>> > What do you think ?
>>
>> I would like to hear more details.
>> Opening a socket is not a problem. Why do we need a tool for this?
>
> Sorry, that's my head thinking unixy and piping things :)
>
>> I don't understand the problem with "packet-based" and what does it
>> mean to have a separate server? Any why?
>
> Packet-based means you can't just read/write in a fd and expect the
> other side to know where to cut the packets to send it to the client,
> but if we do it internally there's no problem. We know where to cut.

Ah, OK. This should not be a problem because the descriptions:
https://github.com/google/syzkaller/blob/master/sys/linux/9p.txt#L70
know what a packet is. So we always give write a single packet.


>> We definitely don't want to involve a separate third-party server,
>> that's very problematic for multiple reasons. But we can have a chunk
>> of custom C code inside of syzkaller.
>> What exactly setup we need?
>
> The setup itself isn't that bad, it's actually pretty much trivial - on
> a fedora VM I just had to run 'rxe_cfg start ens3' (virtio interface
> name) and then the infiniband tools are happy e.g. ibv_devinfo should
> list an interface if you have the userspace library that should have
> come with rxe_cfg.
> (specifically, my VM uses /etc/libibverbs.d/rxe.driver to point to the
> lib, and /usr/lib64/libibverbs/librxe-rdmav16.so the lib itself)
>
> Once tools like ibv_devinfo list the interface, it means syzkaller can
> use it, and very probably means the kernel can as well; that's it.
>
>
>> I guess it will make things simpler if you provide some kind of "hello
>> world" C program that mounts 9p/rdma. I don't need exact messages
>> (they will be same as with pipe transport, right?) nor actual server
>> implementation, but just the place where to inject these packets.
>
> That's still the tricky part, I'm afraid... Making a separate server
> would have been easy because I could have reused some of my junk for the
> actual connection handling (some rdma helper library I wrote ages
> ago[1]), but if you're going to just embed C code you'll probably want
> something lower level? I've never seen syzkaller use any library call
> but I'm not even sure I would know how to create a qp without
> libibverbs, would standard stuff be OK ?

Raw syscalls preferably.
What does 'rxe_cfg start ens3' do on syscall level? Some netlink?

Any libraries and utilities are hell pain in linux world. Will it work
in Android userspace? gVisor? Who will explain all syzkaller users
where they get this for their who-knows-what distro, which is 10 years
old because of corp policies, and debug how their version of the
library has a slightly incompatible version?
For example, after figuring out that rxe_cfg actually comes from
rdma-core (which is a separate delight on linux), my debian
destribution failed to install it because of some conflicts around
/etc/modprobe.d/mlx4.conf, and my ubuntu distro does not know about
such package. And we've just started :)

Syscalls tend to be simpler and more reliable. If it gives ENOSUPP,
ok, that's it. If it works, great, we can use it.

2018-10-11 13:13:30

by Dominique Martinet

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

Dmitry Vyukov wrote on Thu, Oct 11, 2018:
> > That's still the tricky part, I'm afraid... Making a separate server
> > would have been easy because I could have reused some of my junk for the
> > actual connection handling (some rdma helper library I wrote ages
> > ago[1]), but if you're going to just embed C code you'll probably want
> > something lower level? I've never seen syzkaller use any library call
> > but I'm not even sure I would know how to create a qp without
> > libibverbs, would standard stuff be OK ?
>
> Raw syscalls preferably.
> What does 'rxe_cfg start ens3' do on syscall level? Some netlink?

modprobe rdma_rxe (and a bunch of other rdma modules before that) then
writes the interface name in /sys/module/rdma_rxe/parameters/add
apparently; then checks it worked.
this part could be done in C directly without too much trouble, but as
long as the proper kernel configuration/modules are available

> Any libraries and utilities are hell pain in linux world. Will it work
> in Android userspace? gVisor? Who will explain all syzkaller users
> where they get this for their who-knows-what distro, which is 10 years
> old because of corp policies, and debug how their version of the
> library has a slightly incompatible version?
> For example, after figuring out that rxe_cfg actually comes from
> rdma-core (which is a separate delight on linux), my debian
> destribution failed to install it because of some conflicts around
> /etc/modprobe.d/mlx4.conf, and my ubuntu distro does not know about
> such package. And we've just started :)

The rdma ecosystem is a pain, I'll easily agree with that...

> Syscalls tend to be simpler and more reliable. If it gives ENOSUPP,
> ok, that's it. If it works, great, we can use it.

I'll have to look into it a bit more; libibverbs abstracts a lot of
stuff into per-nic userspace drivers (the files I cited in a previous
mail) and basically with the mellanox cards I'm familiar with the whole
user session looks like this:
* common libibverbs/rdmacm code opens /dev/infiniband/rdma_cm and
/dev/infiniband/uverbs0 (plus a bunch of files to figure out abi
version, what user driver to load etc)
* it and the userspace driver issue "commands" over these two files' fd
to setup the connection ; some commands are standard but some are
specific to the interface and defined in the driver.
There are many facets to a connection in RDMA: a protection domain used
to register memory with the nic, a queue pair that is the actual tx/rx
connection, optionally a completion channel that will be another fd to
listen on for events that tell you something happened and finally some
memory regions to directly communicate with the nic from userspace
depending on the specific driver.
* then there's the actual usage, more commands through the uverbs0 char
device to register the memory you'll use, and once that's done it's
entierly up to the driver - for example the mellanox lib can do
everything in userspace playing with the memory regions it registered,
but I'd wager the rxe driver does more calls through the uverbs0 fd...

Honestly I'm not keen on reimplementing all of this; the interface
itself pretty much depends on your version of the kernel (there is a
common ABI defined, but as far as specific nics are concerned if your
kernel module doesn't match the user library version you can get some
nasty surprises), and it's far from the black or white of a good ol'
ENOSUPP error.


I'll look if I can figure out if there is a common subset of verbs
commands that are standard and sufficient to setup a listening
connection and exchange data that should be supported for all devices
and would let us reimplement just that, but while I hear your point
about android and ten years in the future I think it's more likely than
ten years in the future the verb abi will have changed but libibverbs
will just have the new version implemented and hide the change :P

--
Dominique

2018-10-11 13:45:49

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

On Thu, Oct 11, 2018 at 3:10 PM, Dominique Martinet
<[email protected]> wrote:
> Dmitry Vyukov wrote on Thu, Oct 11, 2018:
>> > That's still the tricky part, I'm afraid... Making a separate server
>> > would have been easy because I could have reused some of my junk for the
>> > actual connection handling (some rdma helper library I wrote ages
>> > ago[1]), but if you're going to just embed C code you'll probably want
>> > something lower level? I've never seen syzkaller use any library call
>> > but I'm not even sure I would know how to create a qp without
>> > libibverbs, would standard stuff be OK ?
>>
>> Raw syscalls preferably.
>> What does 'rxe_cfg start ens3' do on syscall level? Some netlink?
>
> modprobe rdma_rxe (and a bunch of other rdma modules before that) then
> writes the interface name in /sys/module/rdma_rxe/parameters/add
> apparently; then checks it worked.
> this part could be done in C directly without too much trouble, but as
> long as the proper kernel configuration/modules are available

Now we are talking!
We generally assume that all modules are simply compiled into kernel.
At least that's we have on syzbot. If somebody can't compile them in,
we can suggest to add modprobe into init.
So this boils down to just writing to /sys/module/rdma_rxe/parameters/add.



>> Any libraries and utilities are hell pain in linux world. Will it work
>> in Android userspace? gVisor? Who will explain all syzkaller users
>> where they get this for their who-knows-what distro, which is 10 years
>> old because of corp policies, and debug how their version of the
>> library has a slightly incompatible version?
>> For example, after figuring out that rxe_cfg actually comes from
>> rdma-core (which is a separate delight on linux), my debian
>> destribution failed to install it because of some conflicts around
>> /etc/modprobe.d/mlx4.conf, and my ubuntu distro does not know about
>> such package. And we've just started :)
>
> The rdma ecosystem is a pain, I'll easily agree with that...
>
>> Syscalls tend to be simpler and more reliable. If it gives ENOSUPP,
>> ok, that's it. If it works, great, we can use it.
>
> I'll have to look into it a bit more; libibverbs abstracts a lot of
> stuff into per-nic userspace drivers (the files I cited in a previous
> mail) and basically with the mellanox cards I'm familiar with the whole
> user session looks like this:
> * common libibverbs/rdmacm code opens /dev/infiniband/rdma_cm and
> /dev/infiniband/uverbs0 (plus a bunch of files to figure out abi
> version, what user driver to load etc)
> * it and the userspace driver issue "commands" over these two files' fd
> to setup the connection ; some commands are standard but some are
> specific to the interface and defined in the driver.

But we will use some kind of virtual/stub driver, right? We don't have
real hardware. So all these commands should be fixed and known for the
virtual/stub driver.

> There are many facets to a connection in RDMA: a protection domain used
> to register memory with the nic, a queue pair that is the actual tx/rx
> connection, optionally a completion channel that will be another fd to
> listen on for events that tell you something happened and finally some
> memory regions to directly communicate with the nic from userspace
> depending on the specific driver.
> * then there's the actual usage, more commands through the uverbs0 char
> device to register the memory you'll use, and once that's done it's
> entierly up to the driver - for example the mellanox lib can do
> everything in userspace playing with the memory regions it registered,
> but I'd wager the rxe driver does more calls through the uverbs0 fd...
>
> Honestly I'm not keen on reimplementing all of this; the interface
> itself pretty much depends on your version of the kernel (there is a
> common ABI defined, but as far as specific nics are concerned if your
> kernel module doesn't match the user library version you can get some
> nasty surprises), and it's far from the black or white of a good ol'
> ENOSUPP error.
>
>
> I'll look if I can figure out if there is a common subset of verbs
> commands that are standard and sufficient to setup a listening
> connection and exchange data that should be supported for all devices
> and would let us reimplement just that, but while I hear your point
> about android and ten years in the future I think it's more likely than
> ten years in the future the verb abi will have changed but libibverbs
> will just have the new version implemented and hide the change :P

But again we don't need to support all of the available hardware.
For example, we are testing net stack from external side using tun.
tun is a very simple, virtual abstraction of a network card. It allows
us to test all of generic net stack starting from L2 without messing
with any real drivers and their differences entirely. I had impression
that we are talking about something similar here too. Or not?

Also I am a bit missing context about rdma<->9p interface. Do we need
to setup all these ring buffers to satisfy the parts that 9p needs? Is
it that 9p actually reads data directly from these ring buffers? Or
there is some higher-level rdma interface that 9p uses?

2018-10-11 14:29:06

by Dominique Martinet

[permalink] [raw]
Subject: 9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)

Dmitry Vyukov wrote on Thu, Oct 11, 2018:
> > Now we are talking!
> > We generally assume that all modules are simply compiled into kernel.
> > At least that's we have on syzbot. If somebody can't compile them in,
> > we can suggest to add modprobe into init.
> > So this boils down to just writing to /sys/module/rdma_rxe/parameters/add.
>
> This fails for me:
>
> root@syzkaller:~# echo -n syz1 > /sys/module/rdma_rxe/parameters/add
> [20992.905406] rdma_rxe: interface syz1 not found
> bash: echo: write error: Invalid argument

Works here, I just did:

[root@f2 ~]# modprobe rdma_rxe
[root@f2 ~]# echo -n ens3 > /sys/module/rdma_rxe/parameters/add

dmesg says:
[ 35.595534] rdma_rxe: set rxe0 active
[ 35.595541] rdma_rxe: added rxe0 to ens3


Actually for a dummy interface if I try the echo directly the echo
works, and a verb device is created, and I just confirmed I can use
it... so not sure why rxe_cfg said EINVAL earlier...

[root@f2 ~]# ip link add dummy0 type dummy
[root@f2 ~]# ip link set dummy0 up
[root@f2 ~]# ip addr add 10.1.1.1/24 dev dummy0
[root@f2 ~]# modprobe rdma_rxe
[root@f2 ~]# echo -n dummy0 > /sys/module/rdma_rxe/parameters/add


(then using my test client:
[root@f2 src]# ./rcat -s
INFO: trans_rdma.c (879), msk_cma_event_handler: CONNECT_REQUEST
INFO: trans_rdma.c (862), msk_cma_event_handler: ESTABLISHED
INFO: trans_rdma.c (917), msk_cma_event_handler: DISCONNECT EVENT...

[root@f2 src]# ./rcat -c 10.1.1.1
INFO: trans_rdma.c (862), msk_cma_event_handler: ESTABLISHED
^C
)


I assume your syz1 interface is a tap device as you were saying earlier?
Got anything in dmesg?

--
Dominique

2018-10-11 14:57:35

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: BUG: corrupted list in p9_read_work

On Thu, Oct 11, 2018 at 3:27 PM, Dmitry Vyukov <[email protected]> wrote:
> On Thu, Oct 11, 2018 at 3:10 PM, Dominique Martinet
> <[email protected]> wrote:
>> Dmitry Vyukov wrote on Thu, Oct 11, 2018:
>>> > That's still the tricky part, I'm afraid... Making a separate server
>>> > would have been easy because I could have reused some of my junk for the
>>> > actual connection handling (some rdma helper library I wrote ages
>>> > ago[1]), but if you're going to just embed C code you'll probably want
>>> > something lower level? I've never seen syzkaller use any library call
>>> > but I'm not even sure I would know how to create a qp without
>>> > libibverbs, would standard stuff be OK ?
>>>
>>> Raw syscalls preferably.
>>> What does 'rxe_cfg start ens3' do on syscall level? Some netlink?
>>
>> modprobe rdma_rxe (and a bunch of other rdma modules before that) then
>> writes the interface name in /sys/module/rdma_rxe/parameters/add
>> apparently; then checks it worked.
>> this part could be done in C directly without too much trouble, but as
>> long as the proper kernel configuration/modules are available
>
> Now we are talking!
> We generally assume that all modules are simply compiled into kernel.
> At least that's we have on syzbot. If somebody can't compile them in,
> we can suggest to add modprobe into init.
> So this boils down to just writing to /sys/module/rdma_rxe/parameters/add.

This fails for me:

root@syzkaller:~# echo -n syz1 > /sys/module/rdma_rxe/parameters/add
[20992.905406] rdma_rxe: interface syz1 not found
bash: echo: write error: Invalid argument



>>> Any libraries and utilities are hell pain in linux world. Will it work
>>> in Android userspace? gVisor? Who will explain all syzkaller users
>>> where they get this for their who-knows-what distro, which is 10 years
>>> old because of corp policies, and debug how their version of the
>>> library has a slightly incompatible version?
>>> For example, after figuring out that rxe_cfg actually comes from
>>> rdma-core (which is a separate delight on linux), my debian
>>> destribution failed to install it because of some conflicts around
>>> /etc/modprobe.d/mlx4.conf, and my ubuntu distro does not know about
>>> such package. And we've just started :)
>>
>> The rdma ecosystem is a pain, I'll easily agree with that...
>>
>>> Syscalls tend to be simpler and more reliable. If it gives ENOSUPP,
>>> ok, that's it. If it works, great, we can use it.
>>
>> I'll have to look into it a bit more; libibverbs abstracts a lot of
>> stuff into per-nic userspace drivers (the files I cited in a previous
>> mail) and basically with the mellanox cards I'm familiar with the whole
>> user session looks like this:
>> * common libibverbs/rdmacm code opens /dev/infiniband/rdma_cm and
>> /dev/infiniband/uverbs0 (plus a bunch of files to figure out abi
>> version, what user driver to load etc)
>> * it and the userspace driver issue "commands" over these two files' fd
>> to setup the connection ; some commands are standard but some are
>> specific to the interface and defined in the driver.
>
> But we will use some kind of virtual/stub driver, right? We don't have
> real hardware. So all these commands should be fixed and known for the
> virtual/stub driver.
>
>> There are many facets to a connection in RDMA: a protection domain used
>> to register memory with the nic, a queue pair that is the actual tx/rx
>> connection, optionally a completion channel that will be another fd to
>> listen on for events that tell you something happened and finally some
>> memory regions to directly communicate with the nic from userspace
>> depending on the specific driver.
>> * then there's the actual usage, more commands through the uverbs0 char
>> device to register the memory you'll use, and once that's done it's
>> entierly up to the driver - for example the mellanox lib can do
>> everything in userspace playing with the memory regions it registered,
>> but I'd wager the rxe driver does more calls through the uverbs0 fd...
>>
>> Honestly I'm not keen on reimplementing all of this; the interface
>> itself pretty much depends on your version of the kernel (there is a
>> common ABI defined, but as far as specific nics are concerned if your
>> kernel module doesn't match the user library version you can get some
>> nasty surprises), and it's far from the black or white of a good ol'
>> ENOSUPP error.
>>
>>
>> I'll look if I can figure out if there is a common subset of verbs
>> commands that are standard and sufficient to setup a listening
>> connection and exchange data that should be supported for all devices
>> and would let us reimplement just that, but while I hear your point
>> about android and ten years in the future I think it's more likely than
>> ten years in the future the verb abi will have changed but libibverbs
>> will just have the new version implemented and hide the change :P
>
> But again we don't need to support all of the available hardware.
> For example, we are testing net stack from external side using tun.
> tun is a very simple, virtual abstraction of a network card. It allows
> us to test all of generic net stack starting from L2 without messing
> with any real drivers and their differences entirely. I had impression
> that we are talking about something similar here too. Or not?
>
> Also I am a bit missing context about rdma<->9p interface. Do we need
> to setup all these ring buffers to satisfy the parts that 9p needs? Is
> it that 9p actually reads data directly from these ring buffers? Or
> there is some higher-level rdma interface that 9p uses?

2018-10-11 14:59:50

by Dominique Martinet

[permalink] [raw]
Subject: 9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)

Dmitry Vyukov wrote on Thu, Oct 11, 2018:
> But again we don't need to support all of the available hardware.

I agree with that, I just have no idea what the "librxe-rdmav16.so" lib
could be doing and described something I am slightly more familiar with
(e.g. libmlx5)
I talked about a common subset of the verb abi because I didn't want to
look into what it's doing, but if it's not enough there's always that
possibility.


> For example, we are testing net stack from external side using tun.
> tun is a very simple, virtual abstraction of a network card. It allows
> us to test all of generic net stack starting from L2 without messing
> with any real drivers and their differences entirely. I had impression
> that we are talking about something similar here too. Or not?

That sounds about right, rxe is a software implementation that should
work on most network interfaces ; at least from what I tried it worked
on a VM's virtio net down to my laptop's wifi interface so it's a good
start... I'm not saying all because I just tried a dummy interface and
that returned EINVAL.
The only point I disagree is the 'very simple', even getting that to
work will be a far cry from a socket() call... :)


> Also I am a bit missing context about rdma<->9p interface. Do we need
> to setup all these ring buffers to satisfy the parts that 9p needs? Is
> it that 9p actually reads data directly from these ring buffers? Or
> there is some higher-level rdma interface that 9p uses?

It needs an "RDMA_PS_TCP" connection established, that requires
everything I described unfortunately...
Once that's established we need to register some memory to the driver
and post some recv buffers (even if we won't read it, the client would
get errors if we aren't ready to receive anything - at least it does
with real hardware), and also use some registered memory to send data.

Thinking back though I think that my server implementation isn't very
far from the raw level in what I'm doing, I recall libibverbs fallback
implementation (e.g. if the driver lib doesn't implement it otherwise)
of the functions I looked at like ibv_post_send to mostly be just
serializing the arguments, slapping the command from an enum in front of
it and sending it to the kernel, so it might be enough to just
reimplement that shim in or figure a way to generate the binary commands
once and then use these values; now I'm comparing two runs of strace of
my test server I definitely see a pattern.

I'll give it a try but don't expect something fast, and it's probably
not going to be very pretty either...

To give a concrete example, here are all the read/write/fcntl calls
looking just at /dev/infiniband in a hello world program that just
establishes connection (server side), receive and send two messages and
quits:


This part apparently sets up the listening connection of the server:

1430 1539262699.126025 openat(AT_FDCWD, "/dev/infiniband/rdma_cm", O_RDWR|O_CLOEXEC) = 3
1430 1539262699.126155 write(3, "\0\0\0\0\30\0\4\0@m'\1\0\0\0\0\344\327\375\271\374\177\0\0?\1\2\0\0\0\0\0", 32) = 32
1430 1539262699.126192 write(3, "\24\0\0\0\210\0\0\0\0\0\0\0000\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\6\0\0\377\377\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 144) = 144
1430 1539262699.126223 write(3, "\23\0\0\0\20\0\20\1 \326\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
1430 1539262699.126250 write(3, "\23\0\0\0\20\0\20\1 \326\375\271\374\177\0\0\0\0\0\0\2\0\0\0", 24) = 24
1430 1539262699.126274 write(3, "\1\0\0\0\20\0\4\0\324\327\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
1430 1539262699.126303 close(3) = 0
1430 1539262699.126360 openat(AT_FDCWD, "/dev/infiniband/rdma_cm", O_RDWR|O_CLOEXEC) = 3
1430 1539262699.126429 write(3, "\0\0\0\0\30\0\4\0\240\217'\1\0\0\0\0t\330\375\271\374\177\0\0\6\1\2\0\0\0\0\0", 32) = 32
1430 1539262699.126472 write(3, "\24\0\0\0\210\0\0\0\0\0\0\0\34\0\0\0\n\0\4\323\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 144) = 144
1430 1539262699.126501 write(3, "\23\0\0\0\20\0\20\1p\326\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
1430 1539262699.126534 write(3, "\23\0\0\0\20\0\20\1p\326\375\271\374\177\0\0\0\0\0\0\2\0\0\0", 24) = 24
1430 1539262699.127119 write(3, "\7\0\0\0\10\0\0\0\0\0\0\0@\0\0\0", 16) = 16
1430 1539262699.127149 write(3, "\23\0\0\0\20\0\20\1`\327\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
1430 1539262699.127319 fcntl(3, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE)
1430 1539262699.127348 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE <unfinished ...>

Then the client connects (had some epoll on read on fd 3, but no read?!)

1446 1539262706.268685 write(3, "\f\0\0\0\10\0H\1\200\307\211\302G\177\0\0", 16) = 16
1446 1539262706.268718 write(3, "\23\0\0\0\20\0\20\1\240\304\211\302G\177\0\0\2\0\0\0\0\0\0\0", 24) = 24
1446 1539262706.269440 openat(AT_FDCWD, "/dev/infiniband/uverbs0", O_RDWR|O_CLOEXEC) = 5
1446 1539262706.269474 write(5, "\0\0\0\0\4\0\2\0H\302\211\302G\177\0\0", 16) = 16
1446 1539262706.269503 write(5, "\1\0\0\0\4\0,\0\220\301\211\302G\177\0\0", 16) = 16
1446 1539262706.269545 write(5, "\2\0\0\0\6\0\n\0\20\302\211\302G\177\0\0\1\0\0\0\0\0\0\0", 24) = 24
1446 1539262706.269571 write(5, "\3\0\0\0\4\0\1\0\314\303\211\302G\177\0\0", 16) = 16
1446 1539262706.269596 write(3, "\23\0\0\0\20\0\20\1\240\304\211\302G\177\0\0\2\0\0\0\2\0\0\0", 24) = 24
1446 1539262706.269618 write(3, "\23\0\0\0\20\0\270\1\200\303\211\302G\177\0\0\2\0\0\0\1\0\0\0", 24) = 24
1430 1539262706.269801 write(5, "\3\0\0\0\4\0\1\0\354\330\375\271\374\177\0\0", 16) = 16
1430 1539262706.269944 write(5, "\21\0\0\0\4\0\1\0T\330\375\271\374\177\0\0", 16) = 16
1430 1539262706.270000 write(5, "\22\0\0\0\n\0\6\0 \330\375\271\374\177\0\0`\232'\1\0\0\0\0006\0\0\0\0\0\0\0\7\0\0\0\0\0\0\0", 40) = 40
1430 1539262706.270203 write(5, "\27\0\0\0\4\0\0\0\2\0\0\0\0\0\0\0", 16) = 16
1430 1539262706.270262 write(5, "\30\0\0\0\20\0\20\0000\327\375\271\374\177\0\0\20\233'\1\0\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0002\0\0\0\4\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\1\2\0\0", 64) = 64
1430 1539262706.270482 write(3, "\v\0\0\0\20\0\220\0p\326\375\271\374\177\0\0\2\0\0\0\1\0\0\0", 24) = 24
1430 1539262706.270546 write(5, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\09\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\16\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0", 120) = 120
1430 1539262706.270677 write(5, "\t\0\0\0\f\0\3\0\224\330\375\271\374\177\0\0\20p)\302G\177\0\0\0\0@\0\0\0\0\0\20p)\302G\177\0\0\1\0\0\0\1\0\0\0", 48) = 48
1430 1539262706.271973 write(5, "\t\0\0\0\f\0\3\0D\330\375\271\374\177\0\0\210\362&\1\0\0\0\0\1\0\0\0\0\0\0\0\210\362&\1\0\0\0\0\1\0\0\0\1\0\0\0", 48) = 48
1430 1539262706.272060 write(3, "\v\0\0\0\20\0\220\0000\325\375\271\374\177\0\0\2\0\0\0\1\0\0\0", 24) = 24
1430 1539262706.272110 write(5, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\09\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\16\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0", 120) = 120
1430 1539262706.272159 write(3, "\v\0\0\0\20\0\220\0000\325\375\271\374\177\0\0\2\0\0\0\2\0\0\0", 24) = 24
1430 1539262706.272205 write(5, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\n*\21f\0\0\0\0\0\0\0\0\1@\0\0\0\7\1\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\201\221\22\0\0\0\0\0\340\t\351\0\0\0\0\0\23\0\0\0\0\0\0\0\0\0\0\0\2\0\3\0\0\0\1\0\0\0\0\0\0\0\0\0", 120) = 120
1430 1539262706.272439 write(3, "\v\0\0\0\20\0\220\0000\325\375\271\374\177\0\0\2\0\0\0\3\0\0\0", 24) = 24
1430 1539262706.272496 write(5, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\1.\1\0\0\0\0\0\0\0\0\0\364((\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\0\1\0\0\0\23\7\7\0\0\0\0", 120) = 120
1430 1539262706.272565 write(3, "\10\0\0\0 \1\0\0\220\f\0\274G\177\0\0\24\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\1\0\0\n\1\2\0\0\0\0\0\0\0", 296) = 296
1446 1539262706.272962 write(3, "\f\0\0\0\10\0H\1\200\307\211\302G\177\0\0", 16) = 16
1430 1539262706.274144 write(5, "\t\0\0\0\f\0\3\0D\330\375\271\374\177\0\0`\0\351\301G\177\0\0\0\0 \0\0\0\0\0`\0\351\301G\177\0\0\1\0\0\0\1\0\0\0", 48) = 48


Some data is exchanged (we don't see the data as it's in buffers whose
address was given earlier):

1464 1539262714.529679 write(5, "\27\0\0\0\4\0\0\0\2\0\0\0\0\0\0\0", 16) = 16
1464 1539262714.530059 write(5, "\34\0\0\0\10\0\1\0lT)\302G\177\0\0\3\0\0\0\0\0\0\0\0\0\0\0\200\0\0\0", 32) = 32
1464 1539262714.530634 write(5, "\27\0\0\0\4\0\0\0\2\0\0\0\0\0\0\0", 16) = 16
1430 1539262719.331307 write(5, "\34\0\0\0\10\0\1\0\374\327\375\271\374\177\0\0\3\0\0\0\0\0\0\0\0\0\0\0\200\0\0\0", 32) = 32
1464 1539262719.332113 write(5, "\27\0\0\0\4\0\0\0\2\0\0\0\0\0\0\0", 16) = 16

And disconnect:

1430 1539262721.192844 write(5, "\r\0\0\0\3\0\0\0\6\0\0\0", 12) = 12
1430 1539262721.193186 write(5, "\r\0\0\0\3\0\0\0\5\0\0\0", 12) = 12
1430 1539262721.193324 write(5, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 120) = 120
1430 1539262721.193567 write(3, "\n\0\0\0\4\0\0\0\2\0\0\0", 12) = 12
1446 1539262721.256556 write(3, "\f\0\0\0\10\0H\1\200\307\211\302G\177\0\0", 16) = 16
1430 1539262721.257618 write(3, "\1\0\0\0\20\0\4\0\204\327\375\271\374\177\0\0\2\0\0\0\0\0\0\0", 24) = 24
1430 1539262721.257769 write(5, "\4\0\0\0\3\0\0\0\0\0\0\0", 12) = 12
1430 1539262721.258369 write(5, "\27\0\0\0\4\0\0\0\2\0\0\0\0\0\0\0", 16) = 16
1430 1539262721.258667 write(5, "\33\0\0\0\6\0\1\0T\327\375\271\374\177\0\0\3\0\0\0\0\0\0\0", 24) = 24
1430 1539262721.259223 write(5, "\24\0\0\0\6\0\2\08\327\375\271\374\177\0\0\2\0\0\0\0\0\0\0", 24) = 24
1430 1539262721.260476 write(3, "\1\0\0\0\20\0\4\0D\330\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
1430 1539262721.260726 close(3) = 0
1430 1539262721.261082 write(5, "\4\0\0\0\3\0\0\0\1\0\0\0", 12) = -1 EBUSY (Device or resource busy)
1430 1539262721.358728 write(5, "\r\0\0\0\3\0\0\0\4\0\0\0", 12) = 12


I don't see any read on these fd despite epoll being set to wait for
read events on these so I'm not quite sure where ibverbs knows if the
commands worked or not, but hopefully that illustrats that it's slightly
more complex than just socket/bind/listen/accept/write/close! :)

--
Dominique

2018-10-12 14:42:58

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: 9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)

On Thu, Oct 11, 2018 at 4:28 PM, Dominique Martinet
<[email protected]> wrote:
> Dmitry Vyukov wrote on Thu, Oct 11, 2018:
>> > Now we are talking!
>> > We generally assume that all modules are simply compiled into kernel.
>> > At least that's we have on syzbot. If somebody can't compile them in,
>> > we can suggest to add modprobe into init.
>> > So this boils down to just writing to /sys/module/rdma_rxe/parameters/add.
>>
>> This fails for me:
>>
>> root@syzkaller:~# echo -n syz1 > /sys/module/rdma_rxe/parameters/add
>> [20992.905406] rdma_rxe: interface syz1 not found
>> bash: echo: write error: Invalid argument
>
> Works here, I just did:
>
> [root@f2 ~]# modprobe rdma_rxe
> [root@f2 ~]# echo -n ens3 > /sys/module/rdma_rxe/parameters/add
>
> dmesg says:
> [ 35.595534] rdma_rxe: set rxe0 active
> [ 35.595541] rdma_rxe: added rxe0 to ens3
>
>
> Actually for a dummy interface if I try the echo directly the echo
> works, and a verb device is created, and I just confirmed I can use
> it... so not sure why rxe_cfg said EINVAL earlier...
>
> [root@f2 ~]# ip link add dummy0 type dummy
> [root@f2 ~]# ip link set dummy0 up
> [root@f2 ~]# ip addr add 10.1.1.1/24 dev dummy0
> [root@f2 ~]# modprobe rdma_rxe
> [root@f2 ~]# echo -n dummy0 > /sys/module/rdma_rxe/parameters/add
>
>
> (then using my test client:
> [root@f2 src]# ./rcat -s
> INFO: trans_rdma.c (879), msk_cma_event_handler: CONNECT_REQUEST
> INFO: trans_rdma.c (862), msk_cma_event_handler: ESTABLISHED
> INFO: trans_rdma.c (917), msk_cma_event_handler: DISCONNECT EVENT...
>
> [root@f2 src]# ./rcat -c 10.1.1.1
> INFO: trans_rdma.c (862), msk_cma_event_handler: ESTABLISHED
> ^C
> )
>
>
> I assume your syz1 interface is a tap device as you were saying earlier?
> Got anything in dmesg?

Umm... nope, just a random string.
Don't assume any prior knowledge on my side :)

2018-10-12 14:52:18

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: 9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)

On Thu, Oct 11, 2018 at 4:19 PM, Dominique Martinet
<[email protected]> wrote:
> Dmitry Vyukov wrote on Thu, Oct 11, 2018:
>> But again we don't need to support all of the available hardware.
>
> I agree with that, I just have no idea what the "librxe-rdmav16.so" lib
> could be doing and described something I am slightly more familiar with
> (e.g. libmlx5)
> I talked about a common subset of the verb abi because I didn't want to
> look into what it's doing, but if it's not enough there's always that
> possibility.
>
>
>> For example, we are testing net stack from external side using tun.
>> tun is a very simple, virtual abstraction of a network card. It allows
>> us to test all of generic net stack starting from L2 without messing
>> with any real drivers and their differences entirely. I had impression
>> that we are talking about something similar here too. Or not?
>
> That sounds about right, rxe is a software implementation that should
> work on most network interfaces ; at least from what I tried it worked
> on a VM's virtio net down to my laptop's wifi interface so it's a good
> start... I'm not saying all because I just tried a dummy interface and
> that returned EINVAL.
> The only point I disagree is the 'very simple', even getting that to
> work will be a far cry from a socket() call... :)
>
>
>> Also I am a bit missing context about rdma<->9p interface. Do we need
>> to setup all these ring buffers to satisfy the parts that 9p needs? Is
>> it that 9p actually reads data directly from these ring buffers? Or
>> there is some higher-level rdma interface that 9p uses?
>
> It needs an "RDMA_PS_TCP" connection established, that requires
> everything I described unfortunately...
> Once that's established we need to register some memory to the driver
> and post some recv buffers (even if we won't read it, the client would
> get errors if we aren't ready to receive anything - at least it does
> with real hardware), and also use some registered memory to send data.
>
> Thinking back though I think that my server implementation isn't very
> far from the raw level in what I'm doing, I recall libibverbs fallback
> implementation (e.g. if the driver lib doesn't implement it otherwise)
> of the functions I looked at like ibv_post_send to mostly be just
> serializing the arguments, slapping the command from an enum in front of
> it and sending it to the kernel, so it might be enough to just
> reimplement that shim in or figure a way to generate the binary commands
> once and then use these values; now I'm comparing two runs of strace of
> my test server I definitely see a pattern.
>
> I'll give it a try but don't expect something fast, and it's probably
> not going to be very pretty either...
>
> To give a concrete example, here are all the read/write/fcntl calls
> looking just at /dev/infiniband in a hello world program that just
> establishes connection (server side), receive and send two messages and
> quits:
>
>
> This part apparently sets up the listening connection of the server:
>
> 1430 1539262699.126025 openat(AT_FDCWD, "/dev/infiniband/rdma_cm", O_RDWR|O_CLOEXEC) = 3
> 1430 1539262699.126155 write(3, "\0\0\0\0\30\0\4\0@m'\1\0\0\0\0\344\327\375\271\374\177\0\0?\1\2\0\0\0\0\0", 32) = 32
> 1430 1539262699.126192 write(3, "\24\0\0\0\210\0\0\0\0\0\0\0000\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\6\0\0\377\377\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 144) = 144
> 1430 1539262699.126223 write(3, "\23\0\0\0\20\0\20\1 \326\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
> 1430 1539262699.126250 write(3, "\23\0\0\0\20\0\20\1 \326\375\271\374\177\0\0\0\0\0\0\2\0\0\0", 24) = 24
> 1430 1539262699.126274 write(3, "\1\0\0\0\20\0\4\0\324\327\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
> 1430 1539262699.126303 close(3) = 0
> 1430 1539262699.126360 openat(AT_FDCWD, "/dev/infiniband/rdma_cm", O_RDWR|O_CLOEXEC) = 3
> 1430 1539262699.126429 write(3, "\0\0\0\0\30\0\4\0\240\217'\1\0\0\0\0t\330\375\271\374\177\0\0\6\1\2\0\0\0\0\0", 32) = 32
> 1430 1539262699.126472 write(3, "\24\0\0\0\210\0\0\0\0\0\0\0\34\0\0\0\n\0\4\323\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 144) = 144
> 1430 1539262699.126501 write(3, "\23\0\0\0\20\0\20\1p\326\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
> 1430 1539262699.126534 write(3, "\23\0\0\0\20\0\20\1p\326\375\271\374\177\0\0\0\0\0\0\2\0\0\0", 24) = 24
> 1430 1539262699.127119 write(3, "\7\0\0\0\10\0\0\0\0\0\0\0@\0\0\0", 16) = 16
> 1430 1539262699.127149 write(3, "\23\0\0\0\20\0\20\1`\327\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
> 1430 1539262699.127319 fcntl(3, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE)
> 1430 1539262699.127348 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE <unfinished ...>
>
> Then the client connects (had some epoll on read on fd 3, but no read?!)
>
> 1446 1539262706.268685 write(3, "\f\0\0\0\10\0H\1\200\307\211\302G\177\0\0", 16) = 16
> 1446 1539262706.268718 write(3, "\23\0\0\0\20\0\20\1\240\304\211\302G\177\0\0\2\0\0\0\0\0\0\0", 24) = 24
> 1446 1539262706.269440 openat(AT_FDCWD, "/dev/infiniband/uverbs0", O_RDWR|O_CLOEXEC) = 5
> 1446 1539262706.269474 write(5, "\0\0\0\0\4\0\2\0H\302\211\302G\177\0\0", 16) = 16
> 1446 1539262706.269503 write(5, "\1\0\0\0\4\0,\0\220\301\211\302G\177\0\0", 16) = 16
> 1446 1539262706.269545 write(5, "\2\0\0\0\6\0\n\0\20\302\211\302G\177\0\0\1\0\0\0\0\0\0\0", 24) = 24
> 1446 1539262706.269571 write(5, "\3\0\0\0\4\0\1\0\314\303\211\302G\177\0\0", 16) = 16
> 1446 1539262706.269596 write(3, "\23\0\0\0\20\0\20\1\240\304\211\302G\177\0\0\2\0\0\0\2\0\0\0", 24) = 24
> 1446 1539262706.269618 write(3, "\23\0\0\0\20\0\270\1\200\303\211\302G\177\0\0\2\0\0\0\1\0\0\0", 24) = 24
> 1430 1539262706.269801 write(5, "\3\0\0\0\4\0\1\0\354\330\375\271\374\177\0\0", 16) = 16
> 1430 1539262706.269944 write(5, "\21\0\0\0\4\0\1\0T\330\375\271\374\177\0\0", 16) = 16
> 1430 1539262706.270000 write(5, "\22\0\0\0\n\0\6\0 \330\375\271\374\177\0\0`\232'\1\0\0\0\0006\0\0\0\0\0\0\0\7\0\0\0\0\0\0\0", 40) = 40
> 1430 1539262706.270203 write(5, "\27\0\0\0\4\0\0\0\2\0\0\0\0\0\0\0", 16) = 16
> 1430 1539262706.270262 write(5, "\30\0\0\0\20\0\20\0000\327\375\271\374\177\0\0\20\233'\1\0\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0002\0\0\0\4\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\1\2\0\0", 64) = 64
> 1430 1539262706.270482 write(3, "\v\0\0\0\20\0\220\0p\326\375\271\374\177\0\0\2\0\0\0\1\0\0\0", 24) = 24
> 1430 1539262706.270546 write(5, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\09\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\16\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0", 120) = 120
> 1430 1539262706.270677 write(5, "\t\0\0\0\f\0\3\0\224\330\375\271\374\177\0\0\20p)\302G\177\0\0\0\0@\0\0\0\0\0\20p)\302G\177\0\0\1\0\0\0\1\0\0\0", 48) = 48
> 1430 1539262706.271973 write(5, "\t\0\0\0\f\0\3\0D\330\375\271\374\177\0\0\210\362&\1\0\0\0\0\1\0\0\0\0\0\0\0\210\362&\1\0\0\0\0\1\0\0\0\1\0\0\0", 48) = 48
> 1430 1539262706.272060 write(3, "\v\0\0\0\20\0\220\0000\325\375\271\374\177\0\0\2\0\0\0\1\0\0\0", 24) = 24
> 1430 1539262706.272110 write(5, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\09\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\16\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0", 120) = 120
> 1430 1539262706.272159 write(3, "\v\0\0\0\20\0\220\0000\325\375\271\374\177\0\0\2\0\0\0\2\0\0\0", 24) = 24
> 1430 1539262706.272205 write(5, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\n*\21f\0\0\0\0\0\0\0\0\1@\0\0\0\7\1\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\201\221\22\0\0\0\0\0\340\t\351\0\0\0\0\0\23\0\0\0\0\0\0\0\0\0\0\0\2\0\3\0\0\0\1\0\0\0\0\0\0\0\0\0", 120) = 120
> 1430 1539262706.272439 write(3, "\v\0\0\0\20\0\220\0000\325\375\271\374\177\0\0\2\0\0\0\3\0\0\0", 24) = 24
> 1430 1539262706.272496 write(5, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\1.\1\0\0\0\0\0\0\0\0\0\364((\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\0\1\0\0\0\23\7\7\0\0\0\0", 120) = 120
> 1430 1539262706.272565 write(3, "\10\0\0\0 \1\0\0\220\f\0\274G\177\0\0\24\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\1\0\0\n\1\2\0\0\0\0\0\0\0", 296) = 296
> 1446 1539262706.272962 write(3, "\f\0\0\0\10\0H\1\200\307\211\302G\177\0\0", 16) = 16
> 1430 1539262706.274144 write(5, "\t\0\0\0\f\0\3\0D\330\375\271\374\177\0\0`\0\351\301G\177\0\0\0\0 \0\0\0\0\0`\0\351\301G\177\0\0\1\0\0\0\1\0\0\0", 48) = 48
>
>
> Some data is exchanged (we don't see the data as it's in buffers whose
> address was given earlier):
>
> 1464 1539262714.529679 write(5, "\27\0\0\0\4\0\0\0\2\0\0\0\0\0\0\0", 16) = 16
> 1464 1539262714.530059 write(5, "\34\0\0\0\10\0\1\0lT)\302G\177\0\0\3\0\0\0\0\0\0\0\0\0\0\0\200\0\0\0", 32) = 32
> 1464 1539262714.530634 write(5, "\27\0\0\0\4\0\0\0\2\0\0\0\0\0\0\0", 16) = 16
> 1430 1539262719.331307 write(5, "\34\0\0\0\10\0\1\0\374\327\375\271\374\177\0\0\3\0\0\0\0\0\0\0\0\0\0\0\200\0\0\0", 32) = 32
> 1464 1539262719.332113 write(5, "\27\0\0\0\4\0\0\0\2\0\0\0\0\0\0\0", 16) = 16
>
> And disconnect:
>
> 1430 1539262721.192844 write(5, "\r\0\0\0\3\0\0\0\6\0\0\0", 12) = 12
> 1430 1539262721.193186 write(5, "\r\0\0\0\3\0\0\0\5\0\0\0", 12) = 12
> 1430 1539262721.193324 write(5, "\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 120) = 120
> 1430 1539262721.193567 write(3, "\n\0\0\0\4\0\0\0\2\0\0\0", 12) = 12
> 1446 1539262721.256556 write(3, "\f\0\0\0\10\0H\1\200\307\211\302G\177\0\0", 16) = 16
> 1430 1539262721.257618 write(3, "\1\0\0\0\20\0\4\0\204\327\375\271\374\177\0\0\2\0\0\0\0\0\0\0", 24) = 24
> 1430 1539262721.257769 write(5, "\4\0\0\0\3\0\0\0\0\0\0\0", 12) = 12
> 1430 1539262721.258369 write(5, "\27\0\0\0\4\0\0\0\2\0\0\0\0\0\0\0", 16) = 16
> 1430 1539262721.258667 write(5, "\33\0\0\0\6\0\1\0T\327\375\271\374\177\0\0\3\0\0\0\0\0\0\0", 24) = 24
> 1430 1539262721.259223 write(5, "\24\0\0\0\6\0\2\08\327\375\271\374\177\0\0\2\0\0\0\0\0\0\0", 24) = 24
> 1430 1539262721.260476 write(3, "\1\0\0\0\20\0\4\0D\330\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
> 1430 1539262721.260726 close(3) = 0
> 1430 1539262721.261082 write(5, "\4\0\0\0\3\0\0\0\1\0\0\0", 12) = -1 EBUSY (Device or resource busy)
> 1430 1539262721.358728 write(5, "\r\0\0\0\3\0\0\0\4\0\0\0", 12) = 12
>
>
> I don't see any read on these fd despite epoll being set to wait for
> read events on these so I'm not quite sure where ibverbs knows if the
> commands worked or not, but hopefully that illustrats that it's slightly
> more complex than just socket/bind/listen/accept/write/close! :)

Yes, it seems so.

I guess I am still missing the big picture somewhat.
If we do "echo -n FOO > /sys/module/rdma_rxe/parameters/add" and let's
say FOO is a tun device. Does it mean that we will send/receive
packets from the tun? If yes, that would make things simpler. And do
we still need ring buffers in that case? If not and we still send/recv
via in-memory ring buffers, then why do we need tun at all?

Leon, maybe you know how to setup a stub rdma that we could use as 9p
transport? If we do this, I guess it will also expose lots of
interesting rdma code paths for testing.

2018-10-12 15:11:34

by Dominique Martinet

[permalink] [raw]
Subject: Re: 9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)

Dmitry Vyukov wrote on Fri, Oct 12, 2018:
> > I don't see any read on these fd despite epoll being set to wait for
> > read events on these so I'm not quite sure where ibverbs knows if the
> > commands worked or not, but hopefully that illustrats that it's slightly
> > more complex than just socket/bind/listen/accept/write/close! :)
>
> Yes, it seems so.
>
> I guess I am still missing the big picture somewhat.
> If we do "echo -n FOO > /sys/module/rdma_rxe/parameters/add" and let's
> say FOO is a tun device. Does it mean that we will send/receive
> packets from the tun? If yes, that would make things simpler. And do
> we still need ring buffers in that case? If not and we still send/recv
> via in-memory ring buffers, then why do we need tun at all?

Hmm, good point; I hadn't looked at the network level how this is
emulated.
When I use a single VM I do not see anything with tcpdump on any
interface, so I assume the kernel short-cuts the interface in this case.
When communicating between two machines there obviously is traffic; it
appears to be transported over udp - I see the messages I sent in plain
text in the dump and there is only a handful of packets for the whole
connecting and teardown so it's definitely much simpler.

This might have some knob I am not aware of to force the driver to send
udp in the local setup, if we can it's going to be much easier to
reimplement the rxe emulation protocol with raw syscalls than what I was
describing earlier...

> Leon, maybe you know how to setup a stub rdma that we could use as 9p
> transport? If we do this, I guess it will also expose lots of
> interesting rdma code paths for testing.

I'm doing this on my free time atm so I can't invest too much, would
love some help if you're aware of anything :)

--
Dominique

2018-10-15 10:47:28

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH 2/2] 9p/trans_fd: put worker reqs on destroy

Tomas Bortoli wrote on Tue, Oct 09, 2018:
> Il giorno mar 9 ott 2018 alle ore 06:06 Dominique Martinet
> > Fixes: 728356dedeff8 ("9p: Add refcount to p9_req_t")
> > Signed-off-by: Dominique Martinet <[email protected]>
> > Cc: Eric Van Hensbergen <[email protected]>
> > Cc: Latchesar Ionkov <[email protected]>
> > Cc: Tomas Bortoli <[email protected]>
>
> Reviewed-by: Tomas Bortoli <[email protected]>

Thanks Tomas.

Tests seem ok, I've push both patches to my next branch, and will submit
them with the rest in a couple of weeks.
Quite a few of my patches haven't had reviews this cycle, if you read
this and have a bit of time, please pull my branch and have a read, or
each commit has a link to the lkml post to reply somewhat easily.


Reminder gir url:
git://github.com/martinetd/linux 9p-next

--
Dominique

2018-11-17 08:48:10

by Dominique Martinet

[permalink] [raw]
Subject: Re: 9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)

Dominique Martinet wrote on Fri, Oct 12, 2018:
> Hmm, good point; I hadn't looked at the network level how this is
> emulated.
> When I use a single VM I do not see anything with tcpdump on any
> interface, so I assume the kernel short-cuts the interface in this case.
> When communicating between two machines there obviously is traffic; it
> appears to be transported over udp - I see the messages I sent in plain
> text in the dump and there is only a handful of packets for the whole
> connecting and teardown so it's definitely much simpler.
>
> This might have some knob I am not aware of to force the driver to send
> udp in the local setup, if we can it's going to be much easier to
> reimplement the rxe emulation protocol with raw syscalls than what I was
> describing earlier...

I've taken a little bit of time to look at this.
Unfortunately, I haven't been able to find any knob to make a
single-machine setup send UDP over an interface; would have hoped that
playing with multiple interfaces might be a way around this (e.g. use
the rxe driver on one interface and listen on the udp port on the other
one), but I couldn't get this to work and don't see anything in the code
that hints it might be configurable.
It seems to realize that the remote IP is assigned locally anyway and
while I cannot "bind" to an address that wasn't added to the rxe driver
at the rdma_cm level, trying to connect to it never sends any packet to
a ncat listening on 4791/udp on that interface...
If anyone has an idea, it'd be great!

Protocole-wise, once we get that to work it doesn't look very difficult
to implement some basic "accept connection and send acks/replies" small
server, but we need the first part to work...


--
Dominique

2018-11-20 11:31:53

by Dominique Martinet

[permalink] [raw]
Subject: Re: 9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)

Dmitry Vyukov wrote on Tue, Nov 20, 2018:
> I don't understand most of the words here, so I probably won't be of help.
>
> Leon told me that there is some kind of software emulation RDMA
> interface (CONFIG_RDMA_RXE?). Maybe it can be used to create an
> emulated RDMA interface per test process and use it to communicate
> with 9p client?

Yes, rxe is what you made me look into and is what I was talking about
here - the problem is I can't seem to figure how to use it *as udp*
within a single machine.

With two VMs or a remote outside of the VM when I try to open a RDMA
connection with that software emulation I get nicely understandable udp
packets ; but within a single VM there is nothing visible at the udp
layer, even if I explicitely configure multiple interfaces and try to
use one as rxe and another without it that would act as "outside" -- the
kernel doesn't send anything to the normal interface like it normally
would.

There might be other ways around it (I'm starting to think very ugly
things like really pretend to connect to an IP outside of the VM, but
capture the outgoing traffic with pcap and forge the replies), but I'd
rather not have to go to this extent...

--
Dominique

2018-11-20 14:50:22

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: 9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)

On Sat, Nov 17, 2018 at 9:46 AM, Dominique Martinet
<[email protected]> wrote:
> Dominique Martinet wrote on Fri, Oct 12, 2018:
>> Hmm, good point; I hadn't looked at the network level how this is
>> emulated.
>> When I use a single VM I do not see anything with tcpdump on any
>> interface, so I assume the kernel short-cuts the interface in this case.
>> When communicating between two machines there obviously is traffic; it
>> appears to be transported over udp - I see the messages I sent in plain
>> text in the dump and there is only a handful of packets for the whole
>> connecting and teardown so it's definitely much simpler.
>>
>> This might have some knob I am not aware of to force the driver to send
>> udp in the local setup, if we can it's going to be much easier to
>> reimplement the rxe emulation protocol with raw syscalls than what I was
>> describing earlier...
>
> I've taken a little bit of time to look at this.
> Unfortunately, I haven't been able to find any knob to make a
> single-machine setup send UDP over an interface; would have hoped that
> playing with multiple interfaces might be a way around this (e.g. use
> the rxe driver on one interface and listen on the udp port on the other
> one), but I couldn't get this to work and don't see anything in the code
> that hints it might be configurable.
> It seems to realize that the remote IP is assigned locally anyway and
> while I cannot "bind" to an address that wasn't added to the rxe driver
> at the rdma_cm level, trying to connect to it never sends any packet to
> a ncat listening on 4791/udp on that interface...
> If anyone has an idea, it'd be great!
>
> Protocole-wise, once we get that to work it doesn't look very difficult
> to implement some basic "accept connection and send acks/replies" small
> server, but we need the first part to work...

I don't understand most of the words here, so I probably won't be of help.

Leon told me that there is some kind of software emulation RDMA
interface (CONFIG_RDMA_RXE?). Maybe it can be used to create an
emulated RDMA interface per test process and use it to communicate
with 9p client?