Hello Luiz,
Thanks for your review.
On 05/09/2023 22:42, Luiz Augusto von Dentz wrote:
> Hi Olivier,
>
> On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux
> <[email protected]> wrote:
>>
>> Request for Comments
[...]
>>
>> The "ble-memleak-repro" program reproduces the memory leak, if the
>> kernel is not patched. Its source is in
>> "package/ble-memleak-repro/ble-memleak-repro.c" [18].
>
> We should probably create a test in l2cap-tester using SO_SNDTIMEO
> first, so we can make sure CI test such case and we are able to detect
> if the problem is reintroduced later:
>
> https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c
I didn't know about "l2cap-tester.c". Agree, it would be great to
integrate my test app into it. I could try, but I don't know the test
framework yet.
>> Memory Leak Fix
>> ===============
>>
>> We have fixed the memory leak, see the patch series in
>> "patches/linux/":
>>
>> 1. The first patch
>> "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11]
>> enables Bluetooth on the DK2.
>
> This above should probably be sent separately.
>
>> 2. The second patch
>> "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12]
>> adds many dynamic debug traces that help understand the kernel
>> behaviour and freeing problems.
>
> I'm fine with this change, but we better use the likes of bt_dev_dbg
> instead of BT_DBG.
This commit intended to increase the visibility during debugging, and
I was not intending it for a permanent presence in the kernel.
But if you find it useful, I can submit a patch RFC v2 with
"bt_dev_dbg()" instead of "BT_DBG()". Note that there is currently no
"bt_dev_dbg()" in "l2cap_core.c" yet.
>> "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13]
>> and
>> "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14]
>> fix the memory leak.
>> 4. Patches
>> "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15],
>> "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16]
>> and
>> "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17]
>> fixes the use-after-free that appears when the "struct l2cap_conn"
>> [5] and "struct hci_conn" [6] objects are freed.
>
> These I'm not very comfortable applying as they are, I'm afraid there
> could be regressions if they are not accompanied with proper tests,
> besides I think there are less intrusive ways to cleanup l2cap_conn,
> see below.
>
>> The commit messages explain why the commit is needed, which problem
>> the commit solves, and how.
>>
>> The first and second patches are present for the memory leak
>> reproduction only, they should not be part of a final fix.
>>
>> Patch Status
>> ============
>>
>> As of writing, the memory leak is fixed. The fix opened the door to
>> other problems, especially use-after-free, sometimes followed by
>> crashes due to NULL dereferences. We think there are weak references
>> (i.e. pointers that don't increment the refcounter) previously
>> pointing to memory that was never freed, but now is freed. On kernels
>> v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the
>> use-after-free and NULL dereferences, but not on kernel v6.5 yet.
>
> I think the problem is that the lifetime of l2cap_conn shall be hooked
> with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del
> don't actually trigger l2cap_conn_del, which imo is the culprit here,
> because connect_cfm/disconnect_cfm is not called when the connection
> procedure has been aborted prematurely, so perhaps we shall get rid of
> the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the
> cleanup with in the following order:
> hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del,
> that way we avoid having multiple code paths attempting to cleanup
> objects associated with hci_conn/hci_chan.
I fully agree with your analysis, which correspond to mine: we should
call "l2cap_conn_del()", it would properly clean the allocations in
"l2cap_conn_add()".
I have tried but it was not obvious to find the right place to call
"l2cap_conn_del()" with the proper locking.
As you write, connect_cfm/disconnect_cfm is not called when the
connection is aborted, and that is the root cause of the memory leak.
Your proposal is most probably the best way to go.
> I'd change hci_chan_create to take a del callback to avoid having
> circular dependencies on one another though.
Interesting, could you explain how you would do it? Perhaps point on
a callback example?
>> Reproducing with Buildroot
>> ==========================
>>
>> We have reproduced and fixed the memory leak with Buildroot [7] and a
>> "ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4].
>>
>> The "ble-memleak-repro" repository [1] contains the sources of a
>> complete external Buildroot customisation [8], with the patches, a
>> README, and more explanations to reproduce the problem with Buildroot
>> on a DK2.
>>
>> References:
>> ===========
>>
>> - [1]: <https://gitlab.com/essensium-mind/ble-memleak-repro.git>
>> "ble-memleak-repro repository"
[...]
--
Olivier L'Heureux
Hi Olivier,
On Wed, Sep 13, 2023 at 3:25 PM Olivier L'Heureux
<[email protected]> wrote:
>
> Hello Luiz,
>
> Thanks for your review.
>
> On 05/09/2023 22:42, Luiz Augusto von Dentz wrote:
> > Hi Olivier,
> >
> > On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux
> > <[email protected]> wrote:
> >>
> >> Request for Comments
> [...]
> >>
> >> The "ble-memleak-repro" program reproduces the memory leak, if the
> >> kernel is not patched. Its source is in
> >> "package/ble-memleak-repro/ble-memleak-repro.c" [18].
> >
> > We should probably create a test in l2cap-tester using SO_SNDTIMEO
> > first, so we can make sure CI test such case and we are able to detect
> > if the problem is reintroduced later:
> >
> > https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c
>
> I didn't know about "l2cap-tester.c". Agree, it would be great to
> integrate my test app into it. I could try, but I don't know the test
> framework yet.
>
> >> Memory Leak Fix
> >> ===============
> >>
> >> We have fixed the memory leak, see the patch series in
> >> "patches/linux/":
> >>
> >> 1. The first patch
> >> "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11]
> >> enables Bluetooth on the DK2.
> >
> > This above should probably be sent separately.
> >
> >> 2. The second patch
> >> "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12]
> >> adds many dynamic debug traces that help understand the kernel
> >> behaviour and freeing problems.
> >
> > I'm fine with this change, but we better use the likes of bt_dev_dbg
> > instead of BT_DBG.
>
> This commit intended to increase the visibility during debugging, and
> I was not intending it for a permanent presence in the kernel.
> But if you find it useful, I can submit a patch RFC v2 with
> "bt_dev_dbg()" instead of "BT_DBG()". Note that there is currently no
> "bt_dev_dbg()" in "l2cap_core.c" yet.
>
> >> "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13]
> >> and
> >> "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14]
> >> fix the memory leak.
> >> 4. Patches
> >> "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15],
> >> "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16]
> >> and
> >> "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17]
> >> fixes the use-after-free that appears when the "struct l2cap_conn"
> >> [5] and "struct hci_conn" [6] objects are freed.
> >
> > These I'm not very comfortable applying as they are, I'm afraid there
> > could be regressions if they are not accompanied with proper tests,
> > besides I think there are less intrusive ways to cleanup l2cap_conn,
> > see below.
> >
> >> The commit messages explain why the commit is needed, which problem
> >> the commit solves, and how.
> >>
> >> The first and second patches are present for the memory leak
> >> reproduction only, they should not be part of a final fix.
> >>
> >> Patch Status
> >> ============
> >>
> >> As of writing, the memory leak is fixed. The fix opened the door to
> >> other problems, especially use-after-free, sometimes followed by
> >> crashes due to NULL dereferences. We think there are weak references
> >> (i.e. pointers that don't increment the refcounter) previously
> >> pointing to memory that was never freed, but now is freed. On kernels
> >> v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the
> >> use-after-free and NULL dereferences, but not on kernel v6.5 yet.
> >
> > I think the problem is that the lifetime of l2cap_conn shall be hooked
> > with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del
> > don't actually trigger l2cap_conn_del, which imo is the culprit here,
> > because connect_cfm/disconnect_cfm is not called when the connection
> > procedure has been aborted prematurely, so perhaps we shall get rid of
> > the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the
> > cleanup with in the following order:
> > hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del,
> > that way we avoid having multiple code paths attempting to cleanup
> > objects associated with hci_conn/hci_chan.
>
> I fully agree with your analysis, which correspond to mine: we should
> call "l2cap_conn_del()", it would properly clean the allocations in
> "l2cap_conn_add()".
> I have tried but it was not obvious to find the right place to call
> "l2cap_conn_del()" with the proper locking.
> As you write, connect_cfm/disconnect_cfm is not called when the
> connection is aborted, and that is the root cause of the memory leak.
Btw, I was trying to reproduce it with the following test set, but at
least kmemleak was not able to detect leaks of l2cap_conn:
https://patchwork.kernel.org/project/bluetooth/list/?series=784343
> Your proposal is most probably the best way to go.
>
> > I'd change hci_chan_create to take a del callback to avoid having
> > circular dependencies on one another though.
>
> Interesting, could you explain how you would do it? Perhaps point on
> a callback example?
>
> >> Reproducing with Buildroot
> >> ==========================
> >>
> >> We have reproduced and fixed the memory leak with Buildroot [7] and a
> >> "ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4].
> >>
> >> The "ble-memleak-repro" repository [1] contains the sources of a
> >> complete external Buildroot customisation [8], with the patches, a
> >> README, and more explanations to reproduce the problem with Buildroot
> >> on a DK2.
> >>
> >> References:
> >> ===========
> >>
> >> - [1]: <https://gitlab.com/essensium-mind/ble-memleak-repro.git>
> >> "ble-memleak-repro repository"
> [...]
>
> --
> Olivier L'Heureux
--
Luiz Augusto von Dentz
Hi Luiz,
Because Olivier has been and still is being swamped with other work, I'm
kind of trying to take over these patches. So far I can reproduce the
memleak on a recent kernel without the patches.
Olivier told me you added tests to check for the memleak in
l2cap-tester. Can you point me towards more info on how exactly you run
these tests, as the info I find on the web is rather minimal.
From what I read in the thread, it looks like your tests don't catch
the memleak?
Kind regards,
Marleen
On 14/09/2023 23:17, Luiz Augusto von Dentz wrote:
> Hi Olivier,
>
> On Wed, Sep 13, 2023 at 3:25 PM Olivier L'Heureux
> <[email protected]> wrote:
>> Hello Luiz,
>>
>> Thanks for your review.
>>
>> On 05/09/2023 22:42, Luiz Augusto von Dentz wrote:
>>> Hi Olivier,
>>>
>>> On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux
>>> <[email protected]> wrote:
>>>> Request for Comments
>> [...]
>>>> The "ble-memleak-repro" program reproduces the memory leak, if the
>>>> kernel is not patched. Its source is in
>>>> "package/ble-memleak-repro/ble-memleak-repro.c" [18].
>>> We should probably create a test in l2cap-tester using SO_SNDTIMEO
>>> first, so we can make sure CI test such case and we are able to detect
>>> if the problem is reintroduced later:
>>>
>>> https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c
>> I didn't know about "l2cap-tester.c". Agree, it would be great to
>> integrate my test app into it. I could try, but I don't know the test
>> framework yet.
>>
>>>> Memory Leak Fix
>>>> ===============
>>>>
>>>> We have fixed the memory leak, see the patch series in
>>>> "patches/linux/":
>>>>
>>>> 1. The first patch
>>>> "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11]
>>>> enables Bluetooth on the DK2.
>>> This above should probably be sent separately.
>>>
>>>> 2. The second patch
>>>> "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12]
>>>> adds many dynamic debug traces that help understand the kernel
>>>> behaviour and freeing problems.
>>> I'm fine with this change, but we better use the likes of bt_dev_dbg
>>> instead of BT_DBG.
>> This commit intended to increase the visibility during debugging, and
>> I was not intending it for a permanent presence in the kernel.
>> But if you find it useful, I can submit a patch RFC v2 with
>> "bt_dev_dbg()" instead of "BT_DBG()". Note that there is currently no
>> "bt_dev_dbg()" in "l2cap_core.c" yet.
>>
>>>> "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13]
>>>> and
>>>> "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14]
>>>> fix the memory leak.
>>>> 4. Patches
>>>> "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15],
>>>> "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16]
>>>> and
>>>> "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17]
>>>> fixes the use-after-free that appears when the "struct l2cap_conn"
>>>> [5] and "struct hci_conn" [6] objects are freed.
>>> These I'm not very comfortable applying as they are, I'm afraid there
>>> could be regressions if they are not accompanied with proper tests,
>>> besides I think there are less intrusive ways to cleanup l2cap_conn,
>>> see below.
>>>
>>>> The commit messages explain why the commit is needed, which problem
>>>> the commit solves, and how.
>>>>
>>>> The first and second patches are present for the memory leak
>>>> reproduction only, they should not be part of a final fix.
>>>>
>>>> Patch Status
>>>> ============
>>>>
>>>> As of writing, the memory leak is fixed. The fix opened the door to
>>>> other problems, especially use-after-free, sometimes followed by
>>>> crashes due to NULL dereferences. We think there are weak references
>>>> (i.e. pointers that don't increment the refcounter) previously
>>>> pointing to memory that was never freed, but now is freed. On kernels
>>>> v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the
>>>> use-after-free and NULL dereferences, but not on kernel v6.5 yet.
>>> I think the problem is that the lifetime of l2cap_conn shall be hooked
>>> with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del
>>> don't actually trigger l2cap_conn_del, which imo is the culprit here,
>>> because connect_cfm/disconnect_cfm is not called when the connection
>>> procedure has been aborted prematurely, so perhaps we shall get rid of
>>> the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the
>>> cleanup with in the following order:
>>> hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del,
>>> that way we avoid having multiple code paths attempting to cleanup
>>> objects associated with hci_conn/hci_chan.
>> I fully agree with your analysis, which correspond to mine: we should
>> call "l2cap_conn_del()", it would properly clean the allocations in
>> "l2cap_conn_add()".
>> I have tried but it was not obvious to find the right place to call
>> "l2cap_conn_del()" with the proper locking.
>> As you write, connect_cfm/disconnect_cfm is not called when the
>> connection is aborted, and that is the root cause of the memory leak.
> Btw, I was trying to reproduce it with the following test set, but at
> least kmemleak was not able to detect leaks of l2cap_conn:
>
> https://patchwork.kernel.org/project/bluetooth/list/?series=784343
>
>> Your proposal is most probably the best way to go.
>>
>>> I'd change hci_chan_create to take a del callback to avoid having
>>> circular dependencies on one another though.
>> Interesting, could you explain how you would do it? Perhaps point on
>> a callback example?
>>
>>>> Reproducing with Buildroot
>>>> ==========================
>>>>
>>>> We have reproduced and fixed the memory leak with Buildroot [7] and a
>>>> "ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4].
>>>>
>>>> The "ble-memleak-repro" repository [1] contains the sources of a
>>>> complete external Buildroot customisation [8], with the patches, a
>>>> README, and more explanations to reproduce the problem with Buildroot
>>>> on a DK2.
>>>>
>>>> References:
>>>> ===========
>>>>
>>>> - [1]:<https://gitlab.com/essensium-mind/ble-memleak-repro.git>
>>>> "ble-memleak-repro repository"
>> [...]
>>
>> --
>> Olivier L'Heureux