2018-08-01 21:54:22

by Ben Greear

[permalink] [raw]
Subject: use-after free bug in hacked 4.16 kernel, related to fq_flow_dequeue

This is from my hacked kernel, could be my fault. I thought the fq guys might
want to know however...

==================================================================
BUG: KASAN: use-after-free in fq_flow_dequeue+0x353/0x3c0 [mac80211]
Read of size 4 at addr ffff88013d92a700 by task rmmod/813

audit: type=1130 audit(1533153605.287:233): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=sysstat-collect comm="systemd" exe="/usr/lib/systemd/s'
CPU: 0 PID: 813 Comm: rmmod Tainted: G W 4.16.18+ #24
Hardware name: _ _/, BIOS 5.11 08/26/2016
Call Trace:
dump_stack+0x7c/0xbf
print_address_description+0x70/0x280
audit: type=1131 audit(1533153605.287:234): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=sysstat-collect comm="systemd" exe="/usr/lib/systemd/s'
? fq_flow_dequeue+0x353/0x3c0 [mac80211]
kasan_report+0x25c/0x350
fq_flow_dequeue+0x353/0x3c0 [mac80211]
fq_flow_reset.constprop.56+0x2b/0x2d0 [mac80211]
fq_reset.constprop.53+0x79/0x110 [mac80211]
ieee80211_txq_teardown_flows+0xc2/0x100 [mac80211]
ieee80211_unregister_hw+0x17b/0x260 [mac80211]
ath10k_mac_unregister+0x35/0x1a0 [ath10k_core]
ath10k_core_unregister+0x60/0x160 [ath10k_core]
ath10k_pci_remove+0x53/0x100 [ath10k_pci]
pci_device_remove+0x97/0x1d0
device_release_driver_internal+0x26f/0x520
driver_detach+0x9d/0x140
bus_remove_driver+0xde/0x2c0
pci_unregister_driver+0x28/0x1a0
ath10k_pci_exit+0xc/0x14 [ath10k_pci]
SyS_delete_module+0x39a/0x4a0
? free_module+0x7d0/0x7d0
? exit_to_usermode_loop+0x75/0xf0
? free_module+0x7d0/0x7d0
do_syscall_64+0x193/0x5e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x7f65a31ac5e7
RSP: 002b:00007ffd0781e9a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 00007ffd0781e9f8 RCX: 00007f65a31ac5e7
RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055e08a426248
RBP: 000055e08a4261e0 R08: 000000000000000a R09: 1999999999999999
R10: 00007f65a321c1a0 R11: 0000000000000206 R12: 00007ffd0781ebc0
R13: 00007ffd07820643 R14: 0000000000000000 R15: 000055e08a4261e0

The buggy address belongs to the page:
page:ffffea0004f64a80 count:0 mapcount:0 mapping:0000000000000000 index:0xffff88013d92a640
flags: 0x5fff8000000000()
raw: 005fff8000000000 0000000000000000 ffff88013d92a640 00000000ffffffff
raw: 0000000000000000 dead000000000200 ffff88014c02a600 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88013d92a600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88013d92a680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88013d92a700: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88013d92a780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88013d92a800: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================


Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2018-08-02 21:47:15

by Ben Greear

[permalink] [raw]
Subject: Re: use-after free bug in hacked 4.16 kernel, related to fq_flow_dequeue

On 08/02/2018 12:45 PM, Toke H?iland-J?rgensen wrote:
> Ben Greear <[email protected]> writes:
>
>> This is from my hacked kernel, could be my fault. I thought the fq
>> guys might want to know however...
>
> Hmm, nothing obvious comes to mind; fq_flow_dequeue() just dequeues a
> packet from the queue; it only has two memory derefs, to fq->lock and
> flow->queue. Don't see why either of those should be freed at this
> point.
>
> Unless fq_adjust_removal() is being inlined, perhaps? Then I suppose the
> flow->tin reference could be the problem, if the txq_info struct was
> already freed; did you change anything around the handling of TXQs?

I have worked on some stuff to fix other leaks and corruptions in ath10k related
to txqs, maybe that is part of this problem. My full tree is here:

https://github.com/greearb/linux-ct-4.16

This bug in question is fairly repeatable on my current setup, which is high speed
tx + rx on a 9984 NIC, with buggy firmware that crashes often in the tx
path. I think the crash only happens when I rmmod the driver under
load, but possibly some of the fw crash cleanup logic that ran previously
is also involved.

I'll get the FW fixed sooner or later and quite reloading modules, and then this
problem will probably go away.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2018-08-13 14:49:26

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: use-after free bug in hacked 4.16 kernel, related to fq_flow_dequeue

Ben Greear <[email protected]> writes:

> On 08/02/2018 01:20 PM, Toke H=C3=B8iland-J=C3=B8rgensen wrote:
>> Ben Greear <[email protected]> writes:
>>
>>> On 08/02/2018 12:45 PM, Toke H=C3=B8iland-J=C3=B8rgensen wrote:
>>>> Ben Greear <[email protected]> writes:
>>>>
>>>>> This is from my hacked kernel, could be my fault. I thought the fq
>>>>> guys might want to know however...
>>>>
>>>> Hmm, nothing obvious comes to mind; fq_flow_dequeue() just dequeues a
>>>> packet from the queue; it only has two memory derefs, to fq->lock and
>>>> flow->queue. Don't see why either of those should be freed at this
>>>> point.
>>>>
>>>> Unless fq_adjust_removal() is being inlined, perhaps? Then I suppose t=
he
>>>> flow->tin reference could be the problem, if the txq_info struct was
>>>> already freed; did you change anything around the handling of TXQs?
>>>
>>> I have worked on some stuff to fix other leaks and corruptions in ath10=
k related
>>> to txqs, maybe that is part of this problem. My full tree is here:
>>>
>>> https://github.com/greearb/linux-ct-4.16
>>>
>>> This bug in question is fairly repeatable on my current setup, which
>>> is high speed tx + rx on a 9984 NIC, with buggy firmware that crashes
>>> often in the tx path. I think the crash only happens when I rmmod the
>>> driver under load, but possibly some of the fw crash cleanup logic
>>> that ran previously is also involved.
>>
>> Yeah, if it happens under load that is consistent with packets being
>> queued.
>>
>> It seems that mac80211 frees the netdevs of an interface before flushing
>> the TXQs, which may be the cause of the bug you are seeing. Could you
>> try the patch below and see if that fixes the issue?
>
> I've run with this for a few days, and it seems to at least not cause
> any extra problems. I mostly fixed the firmware crashing I was seeing
> before, so not certain it fixes the root cause of the crashes I
> saw before. I'm going to roll this into my 4.16 ct kernel for wider
> testing.

Right, thanks for testing. I'll send a proper patch :)

-Toke

2018-08-02 22:13:29

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: use-after free bug in hacked 4.16 kernel, related to fq_flow_dequeue

Ben Greear <[email protected]> writes:

> On 08/02/2018 12:45 PM, Toke H=C3=B8iland-J=C3=B8rgensen wrote:
>> Ben Greear <[email protected]> writes:
>>
>>> This is from my hacked kernel, could be my fault. I thought the fq
>>> guys might want to know however...
>>
>> Hmm, nothing obvious comes to mind; fq_flow_dequeue() just dequeues a
>> packet from the queue; it only has two memory derefs, to fq->lock and
>> flow->queue. Don't see why either of those should be freed at this
>> point.
>>
>> Unless fq_adjust_removal() is being inlined, perhaps? Then I suppose the
>> flow->tin reference could be the problem, if the txq_info struct was
>> already freed; did you change anything around the handling of TXQs?
>
> I have worked on some stuff to fix other leaks and corruptions in ath10k =
related
> to txqs, maybe that is part of this problem. My full tree is here:
>
> https://github.com/greearb/linux-ct-4.16
>
> This bug in question is fairly repeatable on my current setup, which
> is high speed tx + rx on a 9984 NIC, with buggy firmware that crashes
> often in the tx path. I think the crash only happens when I rmmod the
> driver under load, but possibly some of the fw crash cleanup logic
> that ran previously is also involved.

Yeah, if it happens under load that is consistent with packets being
queued.

It seems that mac80211 frees the netdevs of an interface before flushing
the TXQs, which may be the cause of the bug you are seeing. Could you
try the patch below and see if that fixes the issue?

-Toke


diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index e65c2abb2a54..d21ef14d327d 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -1213,6 +1213,7 @@ void ieee80211_unregister_hw(struct ieee80211_hw *hw)
#if IS_ENABLED(CONFIG_IPV6)
unregister_inet6addr_notifier(&local->ifa6_notifier);
#endif
+ ieee80211_txq_teardown_flows(local);
=20
rtnl_lock();
=20
@@ -1241,7 +1242,6 @@ void ieee80211_unregister_hw(struct ieee80211_hw *hw)
skb_queue_purge(&local->skb_queue);
skb_queue_purge(&local->skb_queue_unreliable);
skb_queue_purge(&local->skb_queue_tdls_chsw);
- ieee80211_txq_teardown_flows(local);
=20
destroy_workqueue(local->workqueue);
wiphy_unregister(local->hw.wiphy);

2018-08-02 21:38:30

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: use-after free bug in hacked 4.16 kernel, related to fq_flow_dequeue

Ben Greear <[email protected]> writes:

> This is from my hacked kernel, could be my fault. I thought the fq
> guys might want to know however...

Hmm, nothing obvious comes to mind; fq_flow_dequeue() just dequeues a
packet from the queue; it only has two memory derefs, to fq->lock and
flow->queue. Don't see why either of those should be freed at this
point.

Unless fq_adjust_removal() is being inlined, perhaps? Then I suppose the
flow->tin reference could be the problem, if the txq_info struct was
already freed; did you change anything around the handling of TXQs?

-Toke

2018-08-12 23:44:11

by Ben Greear

[permalink] [raw]
Subject: Re: use-after free bug in hacked 4.16 kernel, related to fq_flow_dequeue

On 08/02/2018 01:20 PM, Toke Høiland-Jørgensen wrote:
> Ben Greear <[email protected]> writes:
>
>> On 08/02/2018 12:45 PM, Toke Høiland-Jørgensen wrote:
>>> Ben Greear <[email protected]> writes:
>>>
>>>> This is from my hacked kernel, could be my fault. I thought the fq
>>>> guys might want to know however...
>>>
>>> Hmm, nothing obvious comes to mind; fq_flow_dequeue() just dequeues a
>>> packet from the queue; it only has two memory derefs, to fq->lock and
>>> flow->queue. Don't see why either of those should be freed at this
>>> point.
>>>
>>> Unless fq_adjust_removal() is being inlined, perhaps? Then I suppose the
>>> flow->tin reference could be the problem, if the txq_info struct was
>>> already freed; did you change anything around the handling of TXQs?
>>
>> I have worked on some stuff to fix other leaks and corruptions in ath10k related
>> to txqs, maybe that is part of this problem. My full tree is here:
>>
>> https://github.com/greearb/linux-ct-4.16
>>
>> This bug in question is fairly repeatable on my current setup, which
>> is high speed tx + rx on a 9984 NIC, with buggy firmware that crashes
>> often in the tx path. I think the crash only happens when I rmmod the
>> driver under load, but possibly some of the fw crash cleanup logic
>> that ran previously is also involved.
>
> Yeah, if it happens under load that is consistent with packets being
> queued.
>
> It seems that mac80211 frees the netdevs of an interface before flushing
> the TXQs, which may be the cause of the bug you are seeing. Could you
> try the patch below and see if that fixes the issue?

I've run with this for a few days, and it seems to at least not cause
any extra problems. I mostly fixed the firmware crashing I was seeing
before, so not certain it fixes the root cause of the crashes I
saw before. I'm going to roll this into my 4.16 ct kernel for wider
testing.

Thanks,
Ben

>
> -Toke
>
>
> diff --git a/net/mac80211/main.c b/net/mac80211/main.c
> index e65c2abb2a54..d21ef14d327d 100644
> --- a/net/mac80211/main.c
> +++ b/net/mac80211/main.c
> @@ -1213,6 +1213,7 @@ void ieee80211_unregister_hw(struct ieee80211_hw *hw)
> #if IS_ENABLED(CONFIG_IPV6)
> unregister_inet6addr_notifier(&local->ifa6_notifier);
> #endif
> + ieee80211_txq_teardown_flows(local);
>
> rtnl_lock();
>
> @@ -1241,7 +1242,6 @@ void ieee80211_unregister_hw(struct ieee80211_hw *hw)
> skb_queue_purge(&local->skb_queue);
> skb_queue_purge(&local->skb_queue_unreliable);
> skb_queue_purge(&local->skb_queue_tdls_chsw);
> - ieee80211_txq_teardown_flows(local);
>
> destroy_workqueue(local->workqueue);
> wiphy_unregister(local->hw.wiphy);
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com