2017-07-28 17:38:25

by Logan Gunthorpe

[permalink] [raw]
Subject: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20

Hi,

My system has been failing with recent kernels (4.12.x and 4.13-rc2)
with a NULL pointer dereference at the stack trace given at the end of
this email. This happens when simply running 'ib_write_bw -R <server>'
with a Chelsio T6 (cxgb4). I've bisected (log attached) to find the
offending commit to be:

commit 1e7710f3f6563940bb6bbc94aa8eadfd344a86af
Author: Matan Barak <[email protected]>
IB/core: Change completion channel to use the reworked objects schema

Reverting this commit (and the dependent commits db1b5ddd53365 and
e0fcc61113c that also fix other bugs with this commit) from v4.12.3
fixes the issue.

I did the bisect with the userspace libraries in Debian Stretch but I
also had this bug with rdma-core v14. I was pretty sure v4.12 kernels
worked for me in the past but likely only before I upgraded from Jessie
to Stretch.

Thanks,

Logan


PS. As a side rant, this bug was found after a very *frustrating* day of
what was supposed to be the 20 minute task of getting my RDMA cards
plugged in again. I tried with both CX4s and the T6s (and I'm still not
sure if my CX4s work yet). Instead, it turns out there's a whole mess of
bugs in the kernel I had to go up against. I went back and forth between
different versions of the userspace libraries because I was sure 4.11
worked -- but it turned out 4.11.10+, 4.12.x and who knows what other
stable kernels are currently broken by the bug fixed in [1]. And there
was a whole other bug that broke things that was fixed in the 4.12-rc
series that I had to carefully bisect around to find the one reported
above. So frustrating!!

[1] 5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3

--

[ 53.320439] iwpm_register_pid: Unable to send a nlmsg (client = 2)
[ 54.738579] BUG: unable to handle kernel NULL pointer dereference at
0000000000000058
[ 54.747439] IP: _raw_spin_lock_irqsave+0x10/0x30
[ 54.752719] PGD 0
[ 54.752721] P4D 0
[ 54.755049]
[ 54.759109] Oops: 0002 [#1] SMP
[ 54.762699] Modules linked in:
[ 54.766195] CPU: 0 PID: 5 Comm: kworker/u16:0 Not tainted
4.13.0-rc2.direct #708
[ 54.774536] Hardware name: Supermicro SYS-7047GR-TRF/X9DRG-QF, BIOS
3.0a 12/05/2013
[ 54.783182] Workqueue: iw_cxgb4 process_work
[ 54.788036] task: ffff880276a5ee80 task.stack: ffffc900000c4000
[ 54.794728] RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30
[ 54.800552] RSP: 0018:ffffc900000c7c70 EFLAGS: 00010046
[ 54.806473] RAX: 0000000000000000 RBX: 0000000000000002 RCX:
0000000000000000
[ 54.814524] RDX: 0000000000000001 RSI: 0000000000000058 RDI:
0000000000000058
[ 54.822583] RBP: ffff880470484600 R08: 0000000000000001 R09:
0000000000000001
[ 54.830663] R10: 0000000000000040 R11: ffff88047420b400 R12:
0000000000000282
[ 54.838744] R13: ffffc900000c7dc0 R14: 0000000000000001 R15:
ffff880470484600
[ 54.846825] FS: 0000000000000000(0000) GS:ffff880277c00000(0000)
knlGS:0000000000000000
[ 54.855997] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 54.862522] CR2: 0000000000000058 CR3: 0000000001e0a000 CR4:
00000000000406f0
[ 54.870602] Call Trace:
[ 54.873442] ? ib_uverbs_comp_handler+0x20/0xe0
[ 54.878610] ? flush_qp+0x6e/0x2b0
[ 54.882514] ? c4iw_modify_qp+0x11c2/0x1870
[ 54.887295] ? close_con_rpl+0xe7/0x170
[ 54.891686] ? kfree_skb+0x33/0x90
[ 54.895592] ? skb_dequeue+0x52/0x60
[ 54.899690] ? process_work+0x4a/0x60
[ 54.903887] ? process_one_work+0x1c2/0x3e0
[ 54.908664] ? worker_thread+0x47/0x3d0
[ 54.913056] ? kthread+0xfc/0x130
[ 54.916864] ? create_worker+0x180/0x180
[ 54.921353] ? kthread_create_on_node+0x40/0x40
[ 54.926521] ? ret_from_fork+0x22/0x30
[ 54.930811] Code: c0 74 05 e8 b3 1c 73 ff 48 89 d8 5b c3 0f 1f 40 00
66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 53 9c 5b fa 31 c0 ba 01 00
00 00 <f0> 0f b1 17 85 c0 75 05 48 89 d8 5b c3 89 c6 e8 9c 09 73 ff 48
[ 54.952099] RIP: _raw_spin_lock_irqsave+0x10/0x30 RSP: ffffc900000c7c70
[ 54.959598] CR2: 0000000000000058
[ 54.963405] ---[ end trace 896cfe0234c949d2 ]---
[ 102.633421] random: crng init done


Attachments:
bisect.log (2.73 kB)

2017-08-01 11:08:12

by Matan Barak

[permalink] [raw]
Subject: Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20

On Fri, Jul 28, 2017 at 8:38 PM, Logan Gunthorpe <[email protected]> wrote:
> Hi,
>
> My system has been failing with recent kernels (4.12.x and 4.13-rc2)
> with a NULL pointer dereference at the stack trace given at the end of
> this email. This happens when simply running 'ib_write_bw -R <server>'
> with a Chelsio T6 (cxgb4). I've bisected (log attached) to find the
> offending commit to be:
>
> commit 1e7710f3f6563940bb6bbc94aa8eadfd344a86af
> Author: Matan Barak <[email protected]>
> IB/core: Change completion channel to use the reworked objects schema
>
> Reverting this commit (and the dependent commits db1b5ddd53365 and
> e0fcc61113c that also fix other bugs with this commit) from v4.12.3
> fixes the issue.
>
> I did the bisect with the userspace libraries in Debian Stretch but I
> also had this bug with rdma-core v14. I was pretty sure v4.12 kernels
> worked for me in the past but likely only before I upgraded from Jessie
> to Stretch.
>
> Thanks,
>
> Logan
>

Hi Logan,

I've tried to reproduce this in my setup (ConnectX 4, RoCE mode) using
1e7710f3f6563940bb6bbc94aa8eadfd344a86af as the kernel's head.
I've used d779dd9a9e8f as rdma-core user-space and the latest perftest bits.
I couldn't reproduce this problem.
I'll try to review this commit again, but please provide more information.
For example, do you see the iwpm_register_pid error when these commits
are reverted?
Does this also happen when using the plain rdma-cm examples (ucmatose,
rping)? Does it happen in a plain verbs application (ibv_rc_pingpong)?
I assume you use iWarp, right? Did you test other modes?
Did you reproduce this issue with your ConnectX 4 as well?
Could you please reproduce it with KASAN as well?

PS, e0fcc61113c isn't a bug fix, it's just a simple refactor.

Regards,
Matan

>
> PS. As a side rant, this bug was found after a very *frustrating* day of
> what was supposed to be the 20 minute task of getting my RDMA cards
> plugged in again. I tried with both CX4s and the T6s (and I'm still not
> sure if my CX4s work yet). Instead, it turns out there's a whole mess of
> bugs in the kernel I had to go up against. I went back and forth between
> different versions of the userspace libraries because I was sure 4.11
> worked -- but it turned out 4.11.10+, 4.12.x and who knows what other
> stable kernels are currently broken by the bug fixed in [1]. And there
> was a whole other bug that broke things that was fixed in the 4.12-rc
> series that I had to carefully bisect around to find the one reported
> above. So frustrating!!
>
> [1] 5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3
>
> --
>
> [ 53.320439] iwpm_register_pid: Unable to send a nlmsg (client = 2)
> [ 54.738579] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000058
> [ 54.747439] IP: _raw_spin_lock_irqsave+0x10/0x30
> [ 54.752719] PGD 0
> [ 54.752721] P4D 0
> [ 54.755049]
> [ 54.759109] Oops: 0002 [#1] SMP
> [ 54.762699] Modules linked in:
> [ 54.766195] CPU: 0 PID: 5 Comm: kworker/u16:0 Not tainted
> 4.13.0-rc2.direct #708
> [ 54.774536] Hardware name: Supermicro SYS-7047GR-TRF/X9DRG-QF, BIOS
> 3.0a 12/05/2013
> [ 54.783182] Workqueue: iw_cxgb4 process_work
> [ 54.788036] task: ffff880276a5ee80 task.stack: ffffc900000c4000
> [ 54.794728] RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30
> [ 54.800552] RSP: 0018:ffffc900000c7c70 EFLAGS: 00010046
> [ 54.806473] RAX: 0000000000000000 RBX: 0000000000000002 RCX:
> 0000000000000000
> [ 54.814524] RDX: 0000000000000001 RSI: 0000000000000058 RDI:
> 0000000000000058
> [ 54.822583] RBP: ffff880470484600 R08: 0000000000000001 R09:
> 0000000000000001
> [ 54.830663] R10: 0000000000000040 R11: ffff88047420b400 R12:
> 0000000000000282
> [ 54.838744] R13: ffffc900000c7dc0 R14: 0000000000000001 R15:
> ffff880470484600
> [ 54.846825] FS: 0000000000000000(0000) GS:ffff880277c00000(0000)
> knlGS:0000000000000000
> [ 54.855997] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 54.862522] CR2: 0000000000000058 CR3: 0000000001e0a000 CR4:
> 00000000000406f0
> [ 54.870602] Call Trace:
> [ 54.873442] ? ib_uverbs_comp_handler+0x20/0xe0
> [ 54.878610] ? flush_qp+0x6e/0x2b0
> [ 54.882514] ? c4iw_modify_qp+0x11c2/0x1870
> [ 54.887295] ? close_con_rpl+0xe7/0x170
> [ 54.891686] ? kfree_skb+0x33/0x90
> [ 54.895592] ? skb_dequeue+0x52/0x60
> [ 54.899690] ? process_work+0x4a/0x60
> [ 54.903887] ? process_one_work+0x1c2/0x3e0
> [ 54.908664] ? worker_thread+0x47/0x3d0
> [ 54.913056] ? kthread+0xfc/0x130
> [ 54.916864] ? create_worker+0x180/0x180
> [ 54.921353] ? kthread_create_on_node+0x40/0x40
> [ 54.926521] ? ret_from_fork+0x22/0x30
> [ 54.930811] Code: c0 74 05 e8 b3 1c 73 ff 48 89 d8 5b c3 0f 1f 40 00
> 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 53 9c 5b fa 31 c0 ba 01 00
> 00 00 <f0> 0f b1 17 85 c0 75 05 48 89 d8 5b c3 89 c6 e8 9c 09 73 ff 48
> [ 54.952099] RIP: _raw_spin_lock_irqsave+0x10/0x30 RSP: ffffc900000c7c70
> [ 54.959598] CR2: 0000000000000058
> [ 54.963405] ---[ end trace 896cfe0234c949d2 ]---
> [ 102.633421] random: crng init done
>

2017-08-01 12:30:49

by Potnuri Bharat Teja

[permalink] [raw]
Subject: Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20

On Tuesday, August 08/01/17, 2017 at 16:38:08 +0530, Matan Barak wrote:
> On Fri, Jul 28, 2017 at 8:38 PM, Logan Gunthorpe <[email protected]> wrote:
> > Hi,
> >
> > My system has been failing with recent kernels (4.12.x and 4.13-rc2)
> > with a NULL pointer dereference at the stack trace given at the end of
> > this email. This happens when simply running 'ib_write_bw -R <server>'
> > with a Chelsio T6 (cxgb4). I've bisected (log attached) to find the
> > offending commit to be:
> >
> > commit 1e7710f3f6563940bb6bbc94aa8eadfd344a86af
> > Author: Matan Barak <[email protected]>
> > IB/core: Change completion channel to use the reworked objects schema
> >
> > Reverting this commit (and the dependent commits db1b5ddd53365 and
> > e0fcc61113c that also fix other bugs with this commit) from v4.12.3
> > fixes the issue.
> >
> > I did the bisect with the userspace libraries in Debian Stretch but I
> > also had this bug with rdma-core v14. I was pretty sure v4.12 kernels
> > worked for me in the past but likely only before I upgraded from Jessie
> > to Stretch.
> >
> > Thanks,
> >
> > Logan
> >
Hi Logan,
Today I sent out a patch to address the issue. Please try it.
"[PATCH 1/1] RDMA/uverbs: Initialize cq_context appropriately"

>
> Hi Logan,
>
> I've tried to reproduce this in my setup (ConnectX 4, RoCE mode) using
> 1e7710f3f6563940bb6bbc94aa8eadfd344a86af as the kernel's head.
> I've used d779dd9a9e8f as rdma-core user-space and the latest perftest bits.
> I couldn't reproduce this problem.
> I'll try to review this commit again, but please provide more information.
> For example, do you see the iwpm_register_pid error when these commits
> are reverted?
Hi Matan,
Issue is seen with applications not creating a completion channel.
It is not seen with rping or similar applications which do create completion channel.

Today I sent out a patch to address the issue. Please review it.
"[PATCH 1/1] RDMA/uverbs: Initialize cq_context appropriately"

Thanks,
Bharat.
> Does this also happen when using the plain rdma-cm examples (ucmatose,
> rping)? Does it happen in a plain verbs application (ibv_rc_pingpong)?
> I assume you use iWarp, right? Did you test other modes?
> Did you reproduce this issue with your ConnectX 4 as well?
> Could you please reproduce it with KASAN as well?

>
> PS, e0fcc61113c isn't a bug fix, it's just a simple refactor.
>
> Regards,
> Matan
>
> >
> > PS. As a side rant, this bug was found after a very *frustrating* day of
> > what was supposed to be the 20 minute task of getting my RDMA cards
> > plugged in again. I tried with both CX4s and the T6s (and I'm still not
> > sure if my CX4s work yet). Instead, it turns out there's a whole mess of
> > bugs in the kernel I had to go up against. I went back and forth between
> > different versions of the userspace libraries because I was sure 4.11
> > worked -- but it turned out 4.11.10+, 4.12.x and who knows what other
> > stable kernels are currently broken by the bug fixed in [1]. And there
> > was a whole other bug that broke things that was fixed in the 4.12-rc
> > series that I had to carefully bisect around to find the one reported
> > above. So frustrating!!
> >
> > [1] 5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3
> >
> > --
> >
> > [ 53.320439] iwpm_register_pid: Unable to send a nlmsg (client = 2)
> > [ 54.738579] BUG: unable to handle kernel NULL pointer dereference at
> > 0000000000000058
> > [ 54.747439] IP: _raw_spin_lock_irqsave+0x10/0x30
> > [ 54.752719] PGD 0
> > [ 54.752721] P4D 0
> > [ 54.755049]
> > [ 54.759109] Oops: 0002 [#1] SMP
> > [ 54.762699] Modules linked in:
> > [ 54.766195] CPU: 0 PID: 5 Comm: kworker/u16:0 Not tainted
> > 4.13.0-rc2.direct #708
> > [ 54.774536] Hardware name: Supermicro SYS-7047GR-TRF/X9DRG-QF, BIOS
> > 3.0a 12/05/2013
> > [ 54.783182] Workqueue: iw_cxgb4 process_work
> > [ 54.788036] task: ffff880276a5ee80 task.stack: ffffc900000c4000
> > [ 54.794728] RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30
> > [ 54.800552] RSP: 0018:ffffc900000c7c70 EFLAGS: 00010046
> > [ 54.806473] RAX: 0000000000000000 RBX: 0000000000000002 RCX:
> > 0000000000000000
> > [ 54.814524] RDX: 0000000000000001 RSI: 0000000000000058 RDI:
> > 0000000000000058
> > [ 54.822583] RBP: ffff880470484600 R08: 0000000000000001 R09:
> > 0000000000000001
> > [ 54.830663] R10: 0000000000000040 R11: ffff88047420b400 R12:
> > 0000000000000282
> > [ 54.838744] R13: ffffc900000c7dc0 R14: 0000000000000001 R15:
> > ffff880470484600
> > [ 54.846825] FS: 0000000000000000(0000) GS:ffff880277c00000(0000)
> > knlGS:0000000000000000
> > [ 54.855997] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 54.862522] CR2: 0000000000000058 CR3: 0000000001e0a000 CR4:
> > 00000000000406f0
> > [ 54.870602] Call Trace:
> > [ 54.873442] ? ib_uverbs_comp_handler+0x20/0xe0
> > [ 54.878610] ? flush_qp+0x6e/0x2b0
> > [ 54.882514] ? c4iw_modify_qp+0x11c2/0x1870
> > [ 54.887295] ? close_con_rpl+0xe7/0x170
> > [ 54.891686] ? kfree_skb+0x33/0x90
> > [ 54.895592] ? skb_dequeue+0x52/0x60
> > [ 54.899690] ? process_work+0x4a/0x60
> > [ 54.903887] ? process_one_work+0x1c2/0x3e0
> > [ 54.908664] ? worker_thread+0x47/0x3d0
> > [ 54.913056] ? kthread+0xfc/0x130
> > [ 54.916864] ? create_worker+0x180/0x180
> > [ 54.921353] ? kthread_create_on_node+0x40/0x40
> > [ 54.926521] ? ret_from_fork+0x22/0x30
> > [ 54.930811] Code: c0 74 05 e8 b3 1c 73 ff 48 89 d8 5b c3 0f 1f 40 00
> > 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 53 9c 5b fa 31 c0 ba 01 00
> > 00 00 <f0> 0f b1 17 85 c0 75 05 48 89 d8 5b c3 89 c6 e8 9c 09 73 ff 48
> > [ 54.952099] RIP: _raw_spin_lock_irqsave+0x10/0x30 RSP: ffffc900000c7c70
> > [ 54.959598] CR2: 0000000000000058
> > [ 54.963405] ---[ end trace 896cfe0234c949d2 ]---
> > [ 102.633421] random: crng init done
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2017-08-01 18:33:13

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20

Hey,

The patch Bharat provided fixes the kernel panic but RDMA in userspace
still does not work at all. Reverting the commits I mentioned still
fixes everything.

To answer your questions:

* I see the iwpm_register_pid message even when things are working so I
don't think it's related.

* All clients I've tried fail. I've attached a log of all the error
messages I see with various clients. (This was with Bharat's patch so
there was no kernel panic and I saw no dmesgs during these runs). The
same runs with the commits I mentioned reverted work fine.

* I retested everything with the CX4 cards as well and they have a
similar problem but produce different error messages. I've attached a
log of client runs as well. The CX4 also works once I revert those
patches. However, by memory, I don't think the CX4s ever suffered from
the kernel panic, and I guess it was just luck that the patches I
reverted caused all these problems.


On 01/08/17 05:08 AM, Matan Barak wrote:
> PS, e0fcc61113c isn't a bug fix, it's just a simple refactor.

If it's not a bug fix I don't think it should have a fixes tag. It
probably didn't mater in this case but you don't want refactor commits
to accidentally reach a stable kernel.

Thanks,

Logan







Attachments:
cxgb4-client-errors.txt (1.47 kB)
mlx5-client-errors.txt (1.34 kB)
Download all attachments

2017-08-01 18:35:14

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20

Hey,

On 01/08/17 06:30 AM, Potnuri Bharat Teja wrote:
> Hi Logan,
> Today I sent out a patch to address the issue. Please try it.
> "[PATCH 1/1] RDMA/uverbs: Initialize cq_context appropriately"

Thanks, as I mentioned in my other email this fixes the kernel panic on
the T6 but doesn't solve all my problems. You can add a

Tested-by: Logan Gunthorpe <[email protected]>

Logan

2017-08-01 19:29:25

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20

On Tue, Aug 01, 2017 at 12:32:57PM -0600, Logan Gunthorpe wrote:
> Couldn't create rdma QP - Invalid argument
> Unable to create QP.
> Failed to create QP.

Failing to create a QP makes me wonder if you have have this patch?

Subject: [PATCH v2 1/2] RDMA/uverbs: Fix the check for port number

The port number is only valid if IB_QP_PORT is set in the mask.
So only check port number if it is valid to prevent modify_qp from
failing due to an invalid port number.

Fixes: 5ecce4c9b17b("Check port number supplied by user verbs cmds")
Cc: <[email protected]> # v2.6.14+
Reviewed-by: Steve Wise <[email protected]>
Signed-off-by: Mustafa Ismail <[email protected]>

Jason

2017-08-01 19:40:04

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20


On 01/08/17 01:29 PM, Jason Gunthorpe wrote:
> On Tue, Aug 01, 2017 at 12:32:57PM -0600, Logan Gunthorpe wrote:
>> Couldn't create rdma QP - Invalid argument
>> Unable to create QP.
>> Failed to create QP.
>
> Failing to create a QP makes me wonder if you have have this patch?
>
> Subject: [PATCH v2 1/2] RDMA/uverbs: Fix the check for port number
>
> The port number is only valid if IB_QP_PORT is set in the mask.
> So only check port number if it is valid to prevent modify_qp from
> failing due to an invalid port number.
>
> Fixes: 5ecce4c9b17b("Check port number supplied by user verbs cmds")
> Cc: <[email protected]> # v2.6.14+
> Reviewed-by: Steve Wise <[email protected]>
> Signed-off-by: Mustafa Ismail <[email protected]>

Oh, oops, I forgot about that. I mentioned the fix for that in my
original email and it seems I wasn't testing apples to apples for my
testing today. During my testing today, the branch with the reverted
commits had the fix for that commit while the branch with Bharat's patch
didn't.

I just did a test with both Bharat's patch and 5a7a88f1b4, and
everything is working correctly again.

So that's great, we just need these patches to be picked up by the
stable kernels.

Thanks,

Logan