2020-04-24 15:30:27

by Divya Indi

[permalink] [raw]
Subject: Request for feedback : Possible use-after-free in routing SA query via netlink

Hi All,

I wanted some feedback on a crash caused due to use-after-free in the
ibacm code path [while routing SA query via netlink].

Commit 3ebd2fd IB/sa: Put netlink request into the request list before sending

Above commit moved adding the query to the request list before ib_nl_snd_msg
and moved ib_nl_snd_msg out of the spinlock (request_lock).

However, if there is a delay in sending out the request (For
eg: Delay due to low memory situation) the timer to handle request timeout
might kick in before the request is sent out to ibacm via netlink.
ib_nl_request_timeout may result in release of the query (by call to send_handler)
while ib_nl_snd_msg is still accessing query.


We get the following stacktrace for the crash -

[<ffffffffa02f43cb>] ? ib_pack+0x17b/0x240 [ib_core]
[<ffffffffa032aef1>] ib_sa_path_rec_get+0x181/0x200 [ib_sa]
[<ffffffffa0379db0>] rdma_resolve_route+0x3c0/0x8d0 [rdma_cm]
[<ffffffffa0374450>] ? cma_bind_port+0xa0/0xa0 [rdma_cm]
[<ffffffffa040f850>] ? rds_rdma_cm_event_handler_cmn+0x850/0x850
[rds_rdma]
[<ffffffffa040f22c>] rds_rdma_cm_event_handler_cmn+0x22c/0x850
[rds_rdma]
[<ffffffffa040f860>] rds_rdma_cm_event_handler+0x10/0x20 [rds_rdma]
[<ffffffffa037778e>] addr_handler+0x9e/0x140 [rdma_cm]
[<ffffffffa026cdb4>] process_req+0x134/0x190 [ib_addr]
[<ffffffff810a02f9>] process_one_work+0x169/0x4a0
[<ffffffff810a0b2b>] worker_thread+0x5b/0x560
[<ffffffff810a0ad0>] ? flush_delayed_work+0x50/0x50
[<ffffffff810a68fb>] kthread+0xcb/0xf0
[<ffffffff816ec49a>] ? __schedule+0x24a/0x810
[<ffffffff816ec49a>] ? __schedule+0x24a/0x810
[<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
[<ffffffff816f25a7>] ret_from_fork+0x47/0x90
[<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
....
RIP [<ffffffffa03296cd>] send_mad+0x33d/0x5d0 [ib_sa]

On analysis of the vmcore, we see crash happens at -

ib_sa_path_rec_get
send_mad
ib_nl_make_request()
ib_nl_send_msg
1 static void ib_nl_set_path_rec_attrs(struct sk_buff *skb,
2 struct ib_sa_query *query)
3 {
4 struct ib_sa_path_rec *sa_rec = query->mad_buf->context[1];
5 struct ib_sa_mad *mad = query->mad_buf->mad;
6 ib_sa_comp_mask comp_mask = mad->sa_hdr.comp_mask;

Page fault occurs at line 5 while trying to access query->mad_buf->mad;

If we look at the query, it does not appear to be a valid ib_sa_query. Instead
looks like a pid struct for a process -> Use-after-free situation.

We could simulate the crash by explicitly introducing a delay in ib_nl_snd_msg with
a sleep. The timer kicks in before ib_nl_send_msg has even sent out the request
and releases the query. We could reproduce the crash with a similar stack trace.

To summarize - We have a use-after-free possibility here when the timer(ib_nl_request_timeout)
kicks in before ib_nl_snd_msg has completed sending the query out to ibacm via netlink. The
timeout handler ie ib_nl_request_timeout may result in releasing the query while ib_nl_snd_msg
is still accessing query.

Appreciate your thoughts on the above issue.


Thanks,
Divya


2020-04-24 18:24:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Request for feedback : Possible use-after-free in routing SA query via netlink

On Fri, Apr 24, 2020 at 08:28:09AM -0700, Divya Indi wrote:

> If we look at the query, it does not appear to be a valid ib_sa_query. Instead
> looks like a pid struct for a process -> Use-after-free situation.
>
> We could simulate the crash by explicitly introducing a delay in ib_nl_snd_msg with
> a sleep. The timer kicks in before ib_nl_send_msg has even sent out the request
> and releases the query. We could reproduce the crash with a similar stack trace.
>
> To summarize - We have a use-after-free possibility here when the timer(ib_nl_request_timeout)
> kicks in before ib_nl_snd_msg has completed sending the query out to ibacm via netlink. The
> timeout handler ie ib_nl_request_timeout may result in releasing the query while ib_nl_snd_msg
> is still accessing query.
>
> Appreciate your thoughts on the above issue.

Your assesment looks right to me.

Fixing it will require being able to clearly explain what the lifetime
cycle is for ib_sa_query - and what is there today looks like a mess,
so I'm not sure what it should be changed into.

There is lots of other stuff there that looks really weird, like
ib_sa_cancel_query() keeps going even though there are still timers
running..

Jason

2020-04-30 15:25:35

by Divya Indi

[permalink] [raw]
Subject: Re: Request for feedback : Possible use-after-free in routing SA query via netlink

Hi Jason,

Thanks for taking the time to review.

On 4/24/20 11:22 AM, Jason Gunthorpe wrote:
> On Fri, Apr 24, 2020 at 08:28:09AM -0700, Divya Indi wrote:
>
>> If we look at the query, it does not appear to be a valid ib_sa_query. Instead
>> looks like a pid struct for a process -> Use-after-free situation.
>>
>> We could simulate the crash by explicitly introducing a delay in ib_nl_snd_msg with
>> a sleep. The timer kicks in before ib_nl_send_msg has even sent out the request
>> and releases the query. We could reproduce the crash with a similar stack trace.
>>
>> To summarize - We have a use-after-free possibility here when the timer(ib_nl_request_timeout)
>> kicks in before ib_nl_snd_msg has completed sending the query out to ibacm via netlink. The
>> timeout handler ie ib_nl_request_timeout may result in releasing the query while ib_nl_snd_msg
>> is still accessing query.
>>
>> Appreciate your thoughts on the above issue.
> Your assesment looks right to me.
>
> Fixing it will require being able to clearly explain what the lifetime
> cycle is for ib_sa_query - and what is there today looks like a mess,
> so I'm not sure what it should be changed into.

The issue reported here appears to be restricted to the ibacm code path.
Hence, we tried to resolve it for the life cycle of sa_query in the ibacm code path.
I will follow up this email with a proposed fix(patch), it would be very helpful if
you can provide your feedback on the same.

Thanks,
Divya

>
> The
> re is lots of other stuff there that looks really weird, like
> ib_sa_cancel_query() keeps going even though there are still timers
> running..
>
> Jason