Commit 3ebd2fd0d011 ("IB/sa: Put netlink request into the request list before sending")'
-
1. Adds the query to the request list before ib_nl_snd_msg.
2. Moves ib_nl_send_msg out of spinlock, hence safe to use gfp_mask as is.
However, if there is a delay in sending out the request (For
eg: Delay due to low memory situation) the timer to handle request timeout
might kick in before the request is sent out to ibacm via netlink.
ib_nl_request_timeout may release the query causing a use after free situation
while accessing the query in ib_nl_send_msg.
Call Trace for the above race:
[<ffffffffa02f43cb>] ? ib_pack+0x17b/0x240 [ib_core]
[<ffffffffa032aef1>] ib_sa_path_rec_get+0x181/0x200 [ib_sa]
[<ffffffffa0379db0>] rdma_resolve_route+0x3c0/0x8d0 [rdma_cm]
[<ffffffffa0374450>] ? cma_bind_port+0xa0/0xa0 [rdma_cm]
[<ffffffffa040f850>] ? rds_rdma_cm_event_handler_cmn+0x850/0x850
[rds_rdma]
[<ffffffffa040f22c>] rds_rdma_cm_event_handler_cmn+0x22c/0x850
[rds_rdma]
[<ffffffffa040f860>] rds_rdma_cm_event_handler+0x10/0x20 [rds_rdma]
[<ffffffffa037778e>] addr_handler+0x9e/0x140 [rdma_cm]
[<ffffffffa026cdb4>] process_req+0x134/0x190 [ib_addr]
[<ffffffff810a02f9>] process_one_work+0x169/0x4a0
[<ffffffff810a0b2b>] worker_thread+0x5b/0x560
[<ffffffff810a0ad0>] ? flush_delayed_work+0x50/0x50
[<ffffffff810a68fb>] kthread+0xcb/0xf0
[<ffffffff816ec49a>] ? __schedule+0x24a/0x810
[<ffffffff816ec49a>] ? __schedule+0x24a/0x810
[<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
[<ffffffff816f25a7>] ret_from_fork+0x47/0x90
[<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
....
RIP [<ffffffffa03296cd>] send_mad+0x33d/0x5d0 [ib_sa]
To resolve the above issue -
1. Add the req to the request list only after the request has been sent out.
2. To handle the race where response comes in before adding request to
the request list, send(rdma_nl_multicast) and add to list while holding the
spinlock - request_lock.
3. Use non blocking memory allocation flags for rdma_nl_multicast since it is
called while holding a spinlock.
Fixes: 3ebd2fd0d011 ("IB/sa: Put netlink request into the request list
before sending")
Signed-off-by: Divya Indi <[email protected]>
---
v1:
- Use flag IB_SA_NL_QUERY_SENT to prevent the use-after-free.
v2:
- Use atomic bit ops for setting and testing IB_SA_NL_QUERY_SENT.
- Rewording and adding comments.
v3:
- Change approach and remove usage of IB_SA_NL_QUERY_SENT.
- Add req to request list only after the request has been sent out.
- Send and add to list while holding the spinlock (request_lock).
- Overide gfp_mask and use GFP_NOWAIT for rdma_nl_multicast since we
need non blocking memory allocation while holding spinlock.
v4:
- Formatting changes.
- Use GFP_NOWAIT conditionally - Only when GFP_ATOMIC is not provided by caller.
---
drivers/infiniband/core/sa_query.c | 41 ++++++++++++++++++++++----------------
1 file changed, 24 insertions(+), 17 deletions(-)
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 74e0058..9066d48 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -836,6 +836,10 @@ static int ib_nl_send_msg(struct ib_sa_query *query, gfp_t gfp_mask)
void *data;
struct ib_sa_mad *mad;
int len;
+ unsigned long flags;
+ unsigned long delay;
+ gfp_t gfp_flag;
+ int ret;
mad = query->mad_buf->mad;
len = ib_nl_get_path_rec_attrs_len(mad->sa_hdr.comp_mask);
@@ -860,36 +864,39 @@ static int ib_nl_send_msg(struct ib_sa_query *query, gfp_t gfp_mask)
/* Repair the nlmsg header length */
nlmsg_end(skb, nlh);
- return rdma_nl_multicast(&init_net, skb, RDMA_NL_GROUP_LS, gfp_mask);
-}
+ gfp_flag = ((gfp_mask & GFP_ATOMIC) == GFP_ATOMIC) ? GFP_ATOMIC :
+ GFP_NOWAIT;
-static int ib_nl_make_request(struct ib_sa_query *query, gfp_t gfp_mask)
-{
- unsigned long flags;
- unsigned long delay;
- int ret;
+ spin_lock_irqsave(&ib_nl_request_lock, flags);
+ ret = rdma_nl_multicast(&init_net, skb, RDMA_NL_GROUP_LS, gfp_flag);
- INIT_LIST_HEAD(&query->list);
- query->seq = (u32)atomic_inc_return(&ib_nl_sa_request_seq);
+ if (ret)
+ goto out;
- /* Put the request on the list first.*/
- spin_lock_irqsave(&ib_nl_request_lock, flags);
+ /* Put the request on the list.*/
delay = msecs_to_jiffies(sa_local_svc_timeout_ms);
query->timeout = delay + jiffies;
list_add_tail(&query->list, &ib_nl_request_list);
/* Start the timeout if this is the only request */
if (ib_nl_request_list.next == &query->list)
queue_delayed_work(ib_nl_wq, &ib_nl_timed_work, delay);
+
+out:
spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+ return ret;
+}
+
+static int ib_nl_make_request(struct ib_sa_query *query, gfp_t gfp_mask)
+{
+ int ret;
+
+ INIT_LIST_HEAD(&query->list);
+ query->seq = (u32)atomic_inc_return(&ib_nl_sa_request_seq);
+
ret = ib_nl_send_msg(query, gfp_mask);
- if (ret) {
+ if (ret)
ret = -EIO;
- /* Remove the request */
- spin_lock_irqsave(&ib_nl_request_lock, flags);
- list_del(&query->list);
- spin_unlock_irqrestore(&ib_nl_request_lock, flags);
- }
return ret;
}
--
1.8.3.1
On Tue, Jun 23, 2020 at 07:13:09PM -0700, Divya Indi wrote:
> Commit 3ebd2fd0d011 ("IB/sa: Put netlink request into the request list before sending")'
> -
> 1. Adds the query to the request list before ib_nl_snd_msg.
> 2. Moves ib_nl_send_msg out of spinlock, hence safe to use gfp_mask as is.
>
> However, if there is a delay in sending out the request (For
> eg: Delay due to low memory situation) the timer to handle request timeout
> might kick in before the request is sent out to ibacm via netlink.
> ib_nl_request_timeout may release the query causing a use after free situation
> while accessing the query in ib_nl_send_msg.
>
> Call Trace for the above race:
>
> [<ffffffffa02f43cb>] ? ib_pack+0x17b/0x240 [ib_core]
> [<ffffffffa032aef1>] ib_sa_path_rec_get+0x181/0x200 [ib_sa]
> [<ffffffffa0379db0>] rdma_resolve_route+0x3c0/0x8d0 [rdma_cm]
> [<ffffffffa0374450>] ? cma_bind_port+0xa0/0xa0 [rdma_cm]
> [<ffffffffa040f850>] ? rds_rdma_cm_event_handler_cmn+0x850/0x850
> [rds_rdma]
> [<ffffffffa040f22c>] rds_rdma_cm_event_handler_cmn+0x22c/0x850
> [rds_rdma]
> [<ffffffffa040f860>] rds_rdma_cm_event_handler+0x10/0x20 [rds_rdma]
> [<ffffffffa037778e>] addr_handler+0x9e/0x140 [rdma_cm]
> [<ffffffffa026cdb4>] process_req+0x134/0x190 [ib_addr]
> [<ffffffff810a02f9>] process_one_work+0x169/0x4a0
> [<ffffffff810a0b2b>] worker_thread+0x5b/0x560
> [<ffffffff810a0ad0>] ? flush_delayed_work+0x50/0x50
> [<ffffffff810a68fb>] kthread+0xcb/0xf0
> [<ffffffff816ec49a>] ? __schedule+0x24a/0x810
> [<ffffffff816ec49a>] ? __schedule+0x24a/0x810
> [<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
> [<ffffffff816f25a7>] ret_from_fork+0x47/0x90
> [<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
> ....
> RIP [<ffffffffa03296cd>] send_mad+0x33d/0x5d0 [ib_sa]
>
> To resolve the above issue -
> 1. Add the req to the request list only after the request has been sent out.
> 2. To handle the race where response comes in before adding request to
> the request list, send(rdma_nl_multicast) and add to list while holding the
> spinlock - request_lock.
> 3. Use non blocking memory allocation flags for rdma_nl_multicast since it is
> called while holding a spinlock.
>
> Fixes: 3ebd2fd0d011 ("IB/sa: Put netlink request into the request list
> before sending")
>
> Signed-off-by: Divya Indi <[email protected]>
> ---
> v1:
> - Use flag IB_SA_NL_QUERY_SENT to prevent the use-after-free.
>
> v2:
> - Use atomic bit ops for setting and testing IB_SA_NL_QUERY_SENT.
> - Rewording and adding comments.
>
> v3:
> - Change approach and remove usage of IB_SA_NL_QUERY_SENT.
> - Add req to request list only after the request has been sent out.
> - Send and add to list while holding the spinlock (request_lock).
> - Overide gfp_mask and use GFP_NOWAIT for rdma_nl_multicast since we
> need non blocking memory allocation while holding spinlock.
>
> v4:
> - Formatting changes.
> - Use GFP_NOWAIT conditionally - Only when GFP_ATOMIC is not provided by caller.
> ---
> drivers/infiniband/core/sa_query.c | 41 ++++++++++++++++++++++----------------
> 1 file changed, 24 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
> index 74e0058..9066d48 100644
> --- a/drivers/infiniband/core/sa_query.c
> +++ b/drivers/infiniband/core/sa_query.c
> @@ -836,6 +836,10 @@ static int ib_nl_send_msg(struct ib_sa_query *query, gfp_t gfp_mask)
> void *data;
> struct ib_sa_mad *mad;
> int len;
> + unsigned long flags;
> + unsigned long delay;
> + gfp_t gfp_flag;
> + int ret;
>
> mad = query->mad_buf->mad;
> len = ib_nl_get_path_rec_attrs_len(mad->sa_hdr.comp_mask);
> @@ -860,36 +864,39 @@ static int ib_nl_send_msg(struct ib_sa_query *query, gfp_t gfp_mask)
> /* Repair the nlmsg header length */
> nlmsg_end(skb, nlh);
>
> - return rdma_nl_multicast(&init_net, skb, RDMA_NL_GROUP_LS, gfp_mask);
> -}
> + gfp_flag = ((gfp_mask & GFP_ATOMIC) == GFP_ATOMIC) ? GFP_ATOMIC :
> + GFP_NOWAIT;
I would say that the better way will be to write something like this:
gfp_flag |= GFP_NOWAIT;
Thanks
Hi Leon,
Please find my comments inline -
On 6/25/20 3:09 AM, Leon Romanovsky wrote:
> On Tue, Jun 23, 2020 at 07:13:09PM -0700, Divya Indi wrote:
>> Commit 3ebd2fd0d011 ("IB/sa: Put netlink request into the request list before sending")'
>> -
>> 1. Adds the query to the request list before ib_nl_snd_msg.
>> 2. Moves ib_nl_send_msg out of spinlock, hence safe to use gfp_mask as is.
>>
>> However, if there is a delay in sending out the request (For
>> eg: Delay due to low memory situation) the timer to handle request timeout
>> might kick in before the request is sent out to ibacm via netlink.
>> ib_nl_request_timeout may release the query causing a use after free situation
>> while accessing the query in ib_nl_send_msg.
>>
>> Call Trace for the above race:
>>
>> [<ffffffffa02f43cb>] ? ib_pack+0x17b/0x240 [ib_core]
>> [<ffffffffa032aef1>] ib_sa_path_rec_get+0x181/0x200 [ib_sa]
>> [<ffffffffa0379db0>] rdma_resolve_route+0x3c0/0x8d0 [rdma_cm]
>> [<ffffffffa0374450>] ? cma_bind_port+0xa0/0xa0 [rdma_cm]
>> [<ffffffffa040f850>] ? rds_rdma_cm_event_handler_cmn+0x850/0x850
>> [rds_rdma]
>> [<ffffffffa040f22c>] rds_rdma_cm_event_handler_cmn+0x22c/0x850
>> [rds_rdma]
>> [<ffffffffa040f860>] rds_rdma_cm_event_handler+0x10/0x20 [rds_rdma]
>> [<ffffffffa037778e>] addr_handler+0x9e/0x140 [rdma_cm]
>> [<ffffffffa026cdb4>] process_req+0x134/0x190 [ib_addr]
>> [<ffffffff810a02f9>] process_one_work+0x169/0x4a0
>> [<ffffffff810a0b2b>] worker_thread+0x5b/0x560
>> [<ffffffff810a0ad0>] ? flush_delayed_work+0x50/0x50
>> [<ffffffff810a68fb>] kthread+0xcb/0xf0
>> [<ffffffff816ec49a>] ? __schedule+0x24a/0x810
>> [<ffffffff816ec49a>] ? __schedule+0x24a/0x810
>> [<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
>> [<ffffffff816f25a7>] ret_from_fork+0x47/0x90
>> [<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
>> ....
>> RIP [<ffffffffa03296cd>] send_mad+0x33d/0x5d0 [ib_sa]
>>
>> To resolve the above issue -
>> 1. Add the req to the request list only after the request has been sent out.
>> 2. To handle the race where response comes in before adding request to
>> the request list, send(rdma_nl_multicast) and add to list while holding the
>> spinlock - request_lock.
>> 3. Use non blocking memory allocation flags for rdma_nl_multicast since it is
>> called while holding a spinlock.
>>
>> Fixes: 3ebd2fd0d011 ("IB/sa: Put netlink request into the request list
>> before sending")
>>
>> Signed-off-by: Divya Indi <[email protected]>
>> ---
>> v1:
>> - Use flag IB_SA_NL_QUERY_SENT to prevent the use-after-free.
>>
>> v2:
>> - Use atomic bit ops for setting and testing IB_SA_NL_QUERY_SENT.
>> - Rewording and adding comments.
>>
>> v3:
>> - Change approach and remove usage of IB_SA_NL_QUERY_SENT.
>> - Add req to request list only after the request has been sent out.
>> - Send and add to list while holding the spinlock (request_lock).
>> - Overide gfp_mask and use GFP_NOWAIT for rdma_nl_multicast since we
>> need non blocking memory allocation while holding spinlock.
>>
>> v4:
>> - Formatting changes.
>> - Use GFP_NOWAIT conditionally - Only when GFP_ATOMIC is not provided by caller.
>> ---
>> drivers/infiniband/core/sa_query.c | 41 ++++++++++++++++++++++----------------
>> 1 file changed, 24 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
>> index 74e0058..9066d48 100644
>> --- a/drivers/infiniband/core/sa_query.c
>> +++ b/drivers/infiniband/core/sa_query.c
>> @@ -836,6 +836,10 @@ static int ib_nl_send_msg(struct ib_sa_query *query, gfp_t gfp_mask)
>> void *data;
>> struct ib_sa_mad *mad;
>> int len;
>> + unsigned long flags;
>> + unsigned long delay;
>> + gfp_t gfp_flag;
>> + int ret;
>>
>> mad = query->mad_buf->mad;
>> len = ib_nl_get_path_rec_attrs_len(mad->sa_hdr.comp_mask);
>> @@ -860,36 +864,39 @@ static int ib_nl_send_msg(struct ib_sa_query *query, gfp_t gfp_mask)
>> /* Repair the nlmsg header length */
>> nlmsg_end(skb, nlh);
>>
>> - return rdma_nl_multicast(&init_net, skb, RDMA_NL_GROUP_LS, gfp_mask);
>> -}
>> + gfp_flag = ((gfp_mask & GFP_ATOMIC) == GFP_ATOMIC) ? GFP_ATOMIC :
>> + GFP_NOWAIT;
> I would say that the better way will be to write something like this:
> gfp_flag |= GFP_NOWAIT;
You mean gfp_flag = gfp_mask|GFP_NOWAIT? [We dont want to modify the gfp_mask sent by caller]
#define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
If a caller passes GFP_KERNEL, "gfp_mask|GFP_NOWAIT" will still have __GFP_RECLAIM,
__GFP_IO and __GFP_FS set which is not suitable for using under spinlock.
Thanks,
Divya
>
> Thanks
On Thu, Jun 25, 2020 at 10:11:07AM -0700, Divya Indi wrote:
> Hi Leon,
>
> Please find my comments inline -
>
> On 6/25/20 3:09 AM, Leon Romanovsky wrote:
> > On Tue, Jun 23, 2020 at 07:13:09PM -0700, Divya Indi wrote:
> >> Commit 3ebd2fd0d011 ("IB/sa: Put netlink request into the request list before sending")'
> >> -
> >> 1. Adds the query to the request list before ib_nl_snd_msg.
> >> 2. Moves ib_nl_send_msg out of spinlock, hence safe to use gfp_mask as is.
> >>
> >> However, if there is a delay in sending out the request (For
> >> eg: Delay due to low memory situation) the timer to handle request timeout
> >> might kick in before the request is sent out to ibacm via netlink.
> >> ib_nl_request_timeout may release the query causing a use after free situation
> >> while accessing the query in ib_nl_send_msg.
> >>
> >> Call Trace for the above race:
> >>
> >> [<ffffffffa02f43cb>] ? ib_pack+0x17b/0x240 [ib_core]
> >> [<ffffffffa032aef1>] ib_sa_path_rec_get+0x181/0x200 [ib_sa]
> >> [<ffffffffa0379db0>] rdma_resolve_route+0x3c0/0x8d0 [rdma_cm]
> >> [<ffffffffa0374450>] ? cma_bind_port+0xa0/0xa0 [rdma_cm]
> >> [<ffffffffa040f850>] ? rds_rdma_cm_event_handler_cmn+0x850/0x850
> >> [rds_rdma]
> >> [<ffffffffa040f22c>] rds_rdma_cm_event_handler_cmn+0x22c/0x850
> >> [rds_rdma]
> >> [<ffffffffa040f860>] rds_rdma_cm_event_handler+0x10/0x20 [rds_rdma]
> >> [<ffffffffa037778e>] addr_handler+0x9e/0x140 [rdma_cm]
> >> [<ffffffffa026cdb4>] process_req+0x134/0x190 [ib_addr]
> >> [<ffffffff810a02f9>] process_one_work+0x169/0x4a0
> >> [<ffffffff810a0b2b>] worker_thread+0x5b/0x560
> >> [<ffffffff810a0ad0>] ? flush_delayed_work+0x50/0x50
> >> [<ffffffff810a68fb>] kthread+0xcb/0xf0
> >> [<ffffffff816ec49a>] ? __schedule+0x24a/0x810
> >> [<ffffffff816ec49a>] ? __schedule+0x24a/0x810
> >> [<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
> >> [<ffffffff816f25a7>] ret_from_fork+0x47/0x90
> >> [<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
> >> ....
> >> RIP [<ffffffffa03296cd>] send_mad+0x33d/0x5d0 [ib_sa]
> >>
> >> To resolve the above issue -
> >> 1. Add the req to the request list only after the request has been sent out.
> >> 2. To handle the race where response comes in before adding request to
> >> the request list, send(rdma_nl_multicast) and add to list while holding the
> >> spinlock - request_lock.
> >> 3. Use non blocking memory allocation flags for rdma_nl_multicast since it is
> >> called while holding a spinlock.
> >>
> >> Fixes: 3ebd2fd0d011 ("IB/sa: Put netlink request into the request list
> >> before sending")
> >>
> >> Signed-off-by: Divya Indi <[email protected]>
> >> ---
> >> v1:
> >> - Use flag IB_SA_NL_QUERY_SENT to prevent the use-after-free.
> >>
> >> v2:
> >> - Use atomic bit ops for setting and testing IB_SA_NL_QUERY_SENT.
> >> - Rewording and adding comments.
> >>
> >> v3:
> >> - Change approach and remove usage of IB_SA_NL_QUERY_SENT.
> >> - Add req to request list only after the request has been sent out.
> >> - Send and add to list while holding the spinlock (request_lock).
> >> - Overide gfp_mask and use GFP_NOWAIT for rdma_nl_multicast since we
> >> need non blocking memory allocation while holding spinlock.
> >>
> >> v4:
> >> - Formatting changes.
> >> - Use GFP_NOWAIT conditionally - Only when GFP_ATOMIC is not provided by caller.
> >> ---
> >> drivers/infiniband/core/sa_query.c | 41 ++++++++++++++++++++++----------------
> >> 1 file changed, 24 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
> >> index 74e0058..9066d48 100644
> >> --- a/drivers/infiniband/core/sa_query.c
> >> +++ b/drivers/infiniband/core/sa_query.c
> >> @@ -836,6 +836,10 @@ static int ib_nl_send_msg(struct ib_sa_query *query, gfp_t gfp_mask)
> >> void *data;
> >> struct ib_sa_mad *mad;
> >> int len;
> >> + unsigned long flags;
> >> + unsigned long delay;
> >> + gfp_t gfp_flag;
> >> + int ret;
> >>
> >> mad = query->mad_buf->mad;
> >> len = ib_nl_get_path_rec_attrs_len(mad->sa_hdr.comp_mask);
> >> @@ -860,36 +864,39 @@ static int ib_nl_send_msg(struct ib_sa_query *query, gfp_t gfp_mask)
> >> /* Repair the nlmsg header length */
> >> nlmsg_end(skb, nlh);
> >>
> >> - return rdma_nl_multicast(&init_net, skb, RDMA_NL_GROUP_LS, gfp_mask);
> >> -}
> >> + gfp_flag = ((gfp_mask & GFP_ATOMIC) == GFP_ATOMIC) ? GFP_ATOMIC :
> >> + GFP_NOWAIT;
> > I would say that the better way will be to write something like this:
> > gfp_flag |= GFP_NOWAIT;
>
> You mean gfp_flag = gfp_mask|GFP_NOWAIT? [We dont want to modify the gfp_mask sent by caller]
>
> #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
>
> If a caller passes GFP_KERNEL, "gfp_mask|GFP_NOWAIT" will still have __GFP_RECLAIM,
> __GFP_IO and __GFP_FS set which is not suitable for using under spinlock.
Ahh, sorry I completely forgot about spinlock part.
Thanks
>
> Thanks,
> Divya
>
> >
> > Thanks
On Tue, Jun 23, 2020 at 07:13:09PM -0700, Divya Indi wrote:
> Commit 3ebd2fd0d011 ("IB/sa: Put netlink request into the request list before sending")'
> -
> 1. Adds the query to the request list before ib_nl_snd_msg.
> 2. Moves ib_nl_send_msg out of spinlock, hence safe to use gfp_mask as is.
>
> However, if there is a delay in sending out the request (For
> eg: Delay due to low memory situation) the timer to handle request timeout
> might kick in before the request is sent out to ibacm via netlink.
> ib_nl_request_timeout may release the query causing a use after free situation
> while accessing the query in ib_nl_send_msg.
>
> Call Trace for the above race:
>
> [<ffffffffa02f43cb>] ? ib_pack+0x17b/0x240 [ib_core]
> [<ffffffffa032aef1>] ib_sa_path_rec_get+0x181/0x200 [ib_sa]
> [<ffffffffa0379db0>] rdma_resolve_route+0x3c0/0x8d0 [rdma_cm]
> [<ffffffffa0374450>] ? cma_bind_port+0xa0/0xa0 [rdma_cm]
> [<ffffffffa040f850>] ? rds_rdma_cm_event_handler_cmn+0x850/0x850
> [rds_rdma]
> [<ffffffffa040f22c>] rds_rdma_cm_event_handler_cmn+0x22c/0x850
> [rds_rdma]
> [<ffffffffa040f860>] rds_rdma_cm_event_handler+0x10/0x20 [rds_rdma]
> [<ffffffffa037778e>] addr_handler+0x9e/0x140 [rdma_cm]
> [<ffffffffa026cdb4>] process_req+0x134/0x190 [ib_addr]
> [<ffffffff810a02f9>] process_one_work+0x169/0x4a0
> [<ffffffff810a0b2b>] worker_thread+0x5b/0x560
> [<ffffffff810a0ad0>] ? flush_delayed_work+0x50/0x50
> [<ffffffff810a68fb>] kthread+0xcb/0xf0
> [<ffffffff816ec49a>] ? __schedule+0x24a/0x810
> [<ffffffff816ec49a>] ? __schedule+0x24a/0x810
> [<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
> [<ffffffff816f25a7>] ret_from_fork+0x47/0x90
> [<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
> ....
> RIP [<ffffffffa03296cd>] send_mad+0x33d/0x5d0 [ib_sa]
>
> To resolve the above issue -
> 1. Add the req to the request list only after the request has been sent out.
> 2. To handle the race where response comes in before adding request to
> the request list, send(rdma_nl_multicast) and add to list while holding the
> spinlock - request_lock.
> 3. Use non blocking memory allocation flags for rdma_nl_multicast since it is
> called while holding a spinlock.
>
> Fixes: 3ebd2fd0d011 ("IB/sa: Put netlink request into the request list
> before sending")
>
> Signed-off-by: Divya Indi <[email protected]>
> ---
> v1:
> - Use flag IB_SA_NL_QUERY_SENT to prevent the use-after-free.
>
> v2:
> - Use atomic bit ops for setting and testing IB_SA_NL_QUERY_SENT.
> - Rewording and adding comments.
>
> v3:
> - Change approach and remove usage of IB_SA_NL_QUERY_SENT.
> - Add req to request list only after the request has been sent out.
> - Send and add to list while holding the spinlock (request_lock).
> - Overide gfp_mask and use GFP_NOWAIT for rdma_nl_multicast since we
> need non blocking memory allocation while holding spinlock.
>
> v4:
> - Formatting changes.
> - Use GFP_NOWAIT conditionally - Only when GFP_ATOMIC is not provided by caller.
> ---
> drivers/infiniband/core/sa_query.c | 41 ++++++++++++++++++++++----------------
> 1 file changed, 24 insertions(+), 17 deletions(-)
I made a few edits, and applied to for-rc
Thanks,
Jason
Thanks Jason.
Appreciate your help and feedback for fixing this issue.
Would it be possible to access the edited version of the patch?
If yes, please share a pointer to the same.
Thanks,
Divya
On 7/2/20 12:07 PM, Jason Gunthorpe wrote:
> On Tue, Jun 23, 2020 at 07:13:09PM -0700, Divya Indi wrote:
>> Commit 3ebd2fd0d011 ("IB/sa: Put netlink request into the request list before sending")'
>> -
>> 1. Adds the query to the request list before ib_nl_snd_msg.
>> 2. Moves ib_nl_send_msg out of spinlock, hence safe to use gfp_mask as is.
>>
>> However, if there is a delay in sending out the request (For
>> eg: Delay due to low memory situation) the timer to handle request timeout
>> might kick in before the request is sent out to ibacm via netlink.
>> ib_nl_request_timeout may release the query causing a use after free situation
>> while accessing the query in ib_nl_send_msg.
>>
>> Call Trace for the above race:
>>
>> [<ffffffffa02f43cb>] ? ib_pack+0x17b/0x240 [ib_core]
>> [<ffffffffa032aef1>] ib_sa_path_rec_get+0x181/0x200 [ib_sa]
>> [<ffffffffa0379db0>] rdma_resolve_route+0x3c0/0x8d0 [rdma_cm]
>> [<ffffffffa0374450>] ? cma_bind_port+0xa0/0xa0 [rdma_cm]
>> [<ffffffffa040f850>] ? rds_rdma_cm_event_handler_cmn+0x850/0x850
>> [rds_rdma]
>> [<ffffffffa040f22c>] rds_rdma_cm_event_handler_cmn+0x22c/0x850
>> [rds_rdma]
>> [<ffffffffa040f860>] rds_rdma_cm_event_handler+0x10/0x20 [rds_rdma]
>> [<ffffffffa037778e>] addr_handler+0x9e/0x140 [rdma_cm]
>> [<ffffffffa026cdb4>] process_req+0x134/0x190 [ib_addr]
>> [<ffffffff810a02f9>] process_one_work+0x169/0x4a0
>> [<ffffffff810a0b2b>] worker_thread+0x5b/0x560
>> [<ffffffff810a0ad0>] ? flush_delayed_work+0x50/0x50
>> [<ffffffff810a68fb>] kthread+0xcb/0xf0
>> [<ffffffff816ec49a>] ? __schedule+0x24a/0x810
>> [<ffffffff816ec49a>] ? __schedule+0x24a/0x810
>> [<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
>> [<ffffffff816f25a7>] ret_from_fork+0x47/0x90
>> [<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
>> ....
>> RIP [<ffffffffa03296cd>] send_mad+0x33d/0x5d0 [ib_sa]
>>
>> To resolve the above issue -
>> 1. Add the req to the request list only after the request has been sent out.
>> 2. To handle the race where response comes in before adding request to
>> the request list, send(rdma_nl_multicast) and add to list while holding the
>> spinlock - request_lock.
>> 3. Use non blocking memory allocation flags for rdma_nl_multicast since it is
>> called while holding a spinlock.
>>
>> Fixes: 3ebd2fd0d011 ("IB/sa: Put netlink request into the request list
>> before sending")
>>
>> Signed-off-by: Divya Indi <[email protected]>
>> ---
>> v1:
>> - Use flag IB_SA_NL_QUERY_SENT to prevent the use-after-free.
>>
>> v2:
>> - Use atomic bit ops for setting and testing IB_SA_NL_QUERY_SENT.
>> - Rewording and adding comments.
>>
>> v3:
>> - Change approach and remove usage of IB_SA_NL_QUERY_SENT.
>> - Add req to request list only after the request has been sent out.
>> - Send and add to list while holding the spinlock (request_lock).
>> - Overide gfp_mask and use GFP_NOWAIT for rdma_nl_multicast since we
>> need non blocking memory allocation while holding spinlock.
>>
>> v4:
>> - Formatting changes.
>> - Use GFP_NOWAIT conditionally - Only when GFP_ATOMIC is not provided by caller.
>> ---
>> drivers/infiniband/core/sa_query.c | 41 ++++++++++++++++++++++----------------
>> 1 file changed, 24 insertions(+), 17 deletions(-)
> I made a few edits, and applied to for-rc
>
> Thanks,
> Jason
On Tue, Jul 07, 2020 at 06:05:02PM -0700, Divya Indi wrote:
> Thanks Jason.
>
> Appreciate your help and feedback for fixing this issue.
>
> Would it be possible to access the edited version of the patch?
> If yes, please share a pointer to the same.
https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=for-rc&id=f427f4d6214c183c474eeb46212d38e6c7223d6a
Jason
> On 8 Jul 2020, at 03:12, Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Jul 07, 2020 at 06:05:02PM -0700, Divya Indi wrote:
>> Thanks Jason.
>>
>> Appreciate your help and feedback for fixing this issue.
>>
>> Would it be possible to access the edited version of the patch?
>> If yes, please share a pointer to the same.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=for-rc&id=f427f4d6214c183c474eeb46212d38e6c7223d6a
Hi Jason,
At first glanse, this commit calls rdma_nl_multicast() whilst holding a spinlock. Since rdma_nl_multicast() is called with a gfp_flag parameter, one could assume it supports an atomic context. rdma_nl_multicast() ends up in netlink_broadcast_filtered(). This function calls netlink_lock_table(), which calls read_unlock_irqrestore(), which ends up calling _raw_read_unlock_irqrestore(). And here preempt_enable() is called :-(
Now, this could be fixed by calling rdma_nl_multicast() outside the spinlock and instead insert the request into the timeout list in a sorted fashion.
But the main problem here is that ib_nl_make_request() can be called from an atomic context, for example from:
neigh_refresh_path() (takes a spin lock) ==>
path_rec_start() ==>
ib_sa_path_rec_get() ==>
send_mad() ==>
ib_nl_make_request() ==>
Here's the stack trace (not newest upstream, but I pretty sure the same problem is there):
<IRQ>
queued_spin_lock_slowpath+0xb/0xf
_raw_spin_lock_irqsave+0x46/0x48
send_mad+0x3d2/0x590 [ib_core]
? ipoib_start_xmit+0x6a0/0x6a0 [ib_ipoib]
ib_sa_path_rec_get+0x223/0x4d0 [ib_core]
? ipoib_start_xmit+0x6a0/0x6a0 [ib_ipoib]
? do_IRQ+0x59/0xe3
path_rec_start+0xa3/0x140 [ib_ipoib]
? ipoib_start_xmit+0x6a0/0x6a0 [ib_ipoib]
ipoib_start_xmit+0x2b0/0x6a0 [ib_ipoib]
dev_hard_start_xmit+0xb2/0x237
sch_direct_xmit+0x114/0x1bf
__dev_queue_xmit+0x592/0x818
? __alloc_skb+0xa1/0x289
dev_queue_xmit+0x10/0x12
arp_xmit+0x38/0xa6
arp_send_dst.part.16+0x61/0x84
arp_process+0x825/0x889
? try_to_wake_up+0x59/0x4f1
arp_rcv+0x140/0x1c9
? wake_up_worker+0x28/0x2b
? __slab_free+0x9b/0x2ba
__netif_receive_skb_core+0x401/0xb39
? dma_get_required_mask+0x28/0x31
? iommu_should_identity_map+0x52/0xdb
? iommu_no_mapping+0x4a/0xd1
__netif_receive_skb+0x18/0x59
netif_receive_skb_internal+0x45/0x119
napi_gro_receive+0xd8/0xf6
ipoib_ib_handle_rx_wc+0x1ca/0x520 [ib_ipoib]
ipoib_poll+0xcd/0x150 [ib_ipoib]
net_rx_action+0x289/0x3f4
__do_softirq+0xe1/0x2b5
do_softirq_own_stack+0x2a/0x35
</IRQ>
do_softirq+0x4d/0x6a
__local_bh_enable_ip+0x57/0x59
_raw_spin_unlock_bh+0x23/0x25
peernet2id+0x51/0x73
netlink_broadcast_filtered+0x223/0x41b
netlink_broadcast+0x1d/0x1f
rdma_nl_multicast+0x22/0x30 [ib_core]
send_mad+0x3e5/0x590 [ib_core]
? cma_bind_port+0x90/0x90 [rdma_cm]
ib_sa_path_rec_get+0x223/0x4d0 [ib_core]
? cma_bind_port+0x90/0x90 [rdma_cm]
? ring_buffer_lock_reserve+0x120/0x34d
? kmem_cache_alloc_trace+0x16f/0x1cd
rdma_resolve_route+0x287/0x810 [rdma_cm]
? cma_bind_port+0x90/0x90 [rdma_cm]
rds_rdma_cm_event_handler_cmn+0x311/0x7d0 [rds_rdma]
rds_rdma_cm_event_handler_worker+0x22/0x30 [rds_rdma]
process_one_work+0x169/0x3a6
worker_thread+0x4d/0x3e5
kthread+0x105/0x138
How shall this be attacked?
Thxs, Håkon
On Mon, Aug 23, 2021 at 04:54:16PM +0000, Haakon Bugge wrote:
>
>
> > On 8 Jul 2020, at 03:12, Jason Gunthorpe <[email protected]> wrote:
> >
> > On Tue, Jul 07, 2020 at 06:05:02PM -0700, Divya Indi wrote:
> >> Thanks Jason.
> >>
> >> Appreciate your help and feedback for fixing this issue.
> >>
> >> Would it be possible to access the edited version of the patch?
> >> If yes, please share a pointer to the same.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=for-rc&id=f427f4d6214c183c474eeb46212d38e6c7223d6a
>
> Hi Jason,
>
>
> At first glanse, this commit calls rdma_nl_multicast() whilst
> holding a spinlock. Since rdma_nl_multicast() is called with a
> gfp_flag parameter, one could assume it supports an atomic
> context. rdma_nl_multicast() ends up in
> netlink_broadcast_filtered(). This function calls
> netlink_lock_table(), which calls read_unlock_irqrestore(), which
> ends up calling _raw_read_unlock_irqrestore(). And here
> preempt_enable() is called :-(
I don't understand. This:
unsigned long flags;
read_lock_irqsave(&nl_table_lock, flags);
atomic_inc(&nl_table_users);
read_unlock_irqrestore(&nl_table_lock, flags);
Is perfectly fine in an atomic context.
preempt_enable is implemented as a nesting counter, so it is fine to
call it from inside an atomic region so long as it is balanced.
Jason
> On 25 Aug 2021, at 19:26, Jason Gunthorpe <[email protected]> wrote:
>
> On Mon, Aug 23, 2021 at 04:54:16PM +0000, Haakon Bugge wrote:
>>
>>
>>> On 8 Jul 2020, at 03:12, Jason Gunthorpe <[email protected]> wrote:
>>>
>>> On Tue, Jul 07, 2020 at 06:05:02PM -0700, Divya Indi wrote:
>>>> Thanks Jason.
>>>>
>>>> Appreciate your help and feedback for fixing this issue.
>>>>
>>>> Would it be possible to access the edited version of the patch?
>>>> If yes, please share a pointer to the same.
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=for-rc&id=f427f4d6214c183c474eeb46212d38e6c7223d6a
>>
>> Hi Jason,
>>
>>
>> At first glanse, this commit calls rdma_nl_multicast() whilst
>> holding a spinlock. Since rdma_nl_multicast() is called with a
>> gfp_flag parameter, one could assume it supports an atomic
>> context. rdma_nl_multicast() ends up in
>> netlink_broadcast_filtered(). This function calls
>> netlink_lock_table(), which calls read_unlock_irqrestore(), which
>> ends up calling _raw_read_unlock_irqrestore(). And here
>> preempt_enable() is called :-(
>
> I don't understand. This:
>
> unsigned long flags;
>
> read_lock_irqsave(&nl_table_lock, flags);
> atomic_inc(&nl_table_users);
> read_unlock_irqrestore(&nl_table_lock, flags);
>
> Is perfectly fine in an atomic context.
>
> preempt_enable is implemented as a nesting counter, so it is fine to
> call it from inside an atomic region so long as it is balanced.
You are right. As I said, the stack trace was from a UEK kernel. It turns out, I overlooked commit 2dce224f469f ("netns: protect netns ID lookups with RCU"), which replaces spin_{lock,unlock}_bh with rcu_read_{lock,unlock} in peernet2id().
This commit fixed this bug un-intentionally, I would say! So, this bug has been present in kernels until v5.5-rc7.
Sorry for the noise!
Håkon