2021-02-25 09:58:33

by Allen

[permalink] [raw]
Subject: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot

From: Allen Pais <[email protected]>

The following out of memory errors are seen on kexec reboot
from the optee core.

[ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
[ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22

tee_shm_release() is not invoked on dma shm buffer.

Implement .shutdown() method to handle the release of the buffers
correctly.

More info:
https://github.com/OP-TEE/optee_os/issues/3637

Signed-off-by: Allen Pais <[email protected]>
---
drivers/tee/optee/core.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/drivers/tee/optee/core.c b/drivers/tee/optee/core.c
index cf4718c6d35d..80e2774b5e2a 100644
--- a/drivers/tee/optee/core.c
+++ b/drivers/tee/optee/core.c
@@ -582,6 +582,13 @@ static optee_invoke_fn *get_invoke_func(struct device *dev)
return ERR_PTR(-EINVAL);
}

+/* optee_remove - Device Removal Routine
+ * @pdev: platform device information struct
+ *
+ * optee_remove is called by platform subsystem to alter the driver
+ * that it should release the device
+ */
+
static int optee_remove(struct platform_device *pdev)
{
struct optee *optee = platform_get_drvdata(pdev);
@@ -612,6 +619,18 @@ static int optee_remove(struct platform_device *pdev)
return 0;
}

+/* optee_shutdown - Device Removal Routine
+ * @pdev: platform device information struct
+ *
+ * platform_shutdown is called by the platform subsystem to alter
+ * the driver that a shutdown/reboot(or kexec) is happening and
+ * device must be disabled.
+ */
+static void optee_shutdown(struct platform_device *pdev)
+{
+ optee_disable_shm_cache(platform_get_drvdata(pdev));
+}
+
static int optee_probe(struct platform_device *pdev)
{
optee_invoke_fn *invoke_fn;
@@ -738,6 +757,7 @@ MODULE_DEVICE_TABLE(of, optee_dt_match);
static struct platform_driver optee_driver = {
.probe = optee_probe,
.remove = optee_remove,
+ .shutdown = optee_shutdown,
.driver = {
.name = "optee",
.of_match_table = optee_dt_match,
--
2.25.1


2021-03-03 04:25:19

by Jens Wiklander

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot

On Thu, Feb 25, 2021 at 10:06 AM Allen Pais <[email protected]> wrote:
>
> From: Allen Pais <[email protected]>
>
> The following out of memory errors are seen on kexec reboot
> from the optee core.
>
> [ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
> [ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22
>
> tee_shm_release() is not invoked on dma shm buffer.
>
> Implement .shutdown() method to handle the release of the buffers
> correctly.
>
> More info:
> https://github.com/OP-TEE/optee_os/issues/3637
>
> Signed-off-by: Allen Pais <[email protected]>
> ---
> drivers/tee/optee/core.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)

This looks good to me. Do you have a practical way of testing this on
QEMU for instance?

Thanks,
Jens

>
> diff --git a/drivers/tee/optee/core.c b/drivers/tee/optee/core.c
> index cf4718c6d35d..80e2774b5e2a 100644
> --- a/drivers/tee/optee/core.c
> +++ b/drivers/tee/optee/core.c
> @@ -582,6 +582,13 @@ static optee_invoke_fn *get_invoke_func(struct device *dev)
> return ERR_PTR(-EINVAL);
> }
>
> +/* optee_remove - Device Removal Routine
> + * @pdev: platform device information struct
> + *
> + * optee_remove is called by platform subsystem to alter the driver
> + * that it should release the device
> + */
> +
> static int optee_remove(struct platform_device *pdev)
> {
> struct optee *optee = platform_get_drvdata(pdev);
> @@ -612,6 +619,18 @@ static int optee_remove(struct platform_device *pdev)
> return 0;
> }
>
> +/* optee_shutdown - Device Removal Routine
> + * @pdev: platform device information struct
> + *
> + * platform_shutdown is called by the platform subsystem to alter
> + * the driver that a shutdown/reboot(or kexec) is happening and
> + * device must be disabled.
> + */
> +static void optee_shutdown(struct platform_device *pdev)
> +{
> + optee_disable_shm_cache(platform_get_drvdata(pdev));
> +}
> +
> static int optee_probe(struct platform_device *pdev)
> {
> optee_invoke_fn *invoke_fn;
> @@ -738,6 +757,7 @@ MODULE_DEVICE_TABLE(of, optee_dt_match);
> static struct platform_driver optee_driver = {
> .probe = optee_probe,
> .remove = optee_remove,
> + .shutdown = optee_shutdown,
> .driver = {
> .name = "optee",
> .of_match_table = optee_dt_match,
> --
> 2.25.1
>

2021-03-04 05:50:34

by Allen Pais

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot

>>
>> From: Allen Pais <[email protected]>
>>
>> The following out of memory errors are seen on kexec reboot
>> from the optee core.
>>
>> [ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
>> [ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22
>>
>> tee_shm_release() is not invoked on dma shm buffer.
>>
>> Implement .shutdown() method to handle the release of the buffers
>> correctly.
>>
>> More info:
>> https://github.com/OP-TEE/optee_os/issues/3637
>>
>> Signed-off-by: Allen Pais <[email protected]>
>> ---
>> drivers/tee/optee/core.c | 20 ++++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>
> This looks good to me. Do you have a practical way of testing this on
> QEMU for instance?

I have not tried this on QEMU. I will give it a go today.

Thanks.

>
> Thanks,
> Jens
>
>>
>> diff --git a/drivers/tee/optee/core.c b/drivers/tee/optee/core.c
>> index cf4718c6d35d..80e2774b5e2a 100644
>> --- a/drivers/tee/optee/core.c
>> +++ b/drivers/tee/optee/core.c
>> @@ -582,6 +582,13 @@ static optee_invoke_fn *get_invoke_func(struct device *dev)
>> return ERR_PTR(-EINVAL);
>> }
>>
>> +/* optee_remove - Device Removal Routine
>> + * @pdev: platform device information struct
>> + *
>> + * optee_remove is called by platform subsystem to alter the driver
>> + * that it should release the device
>> + */
>> +
>> static int optee_remove(struct platform_device *pdev)
>> {
>> struct optee *optee = platform_get_drvdata(pdev);
>> @@ -612,6 +619,18 @@ static int optee_remove(struct platform_device *pdev)
>> return 0;
>> }
>>
>> +/* optee_shutdown - Device Removal Routine
>> + * @pdev: platform device information struct
>> + *
>> + * platform_shutdown is called by the platform subsystem to alter
>> + * the driver that a shutdown/reboot(or kexec) is happening and
>> + * device must be disabled.
>> + */
>> +static void optee_shutdown(struct platform_device *pdev)
>> +{
>> + optee_disable_shm_cache(platform_get_drvdata(pdev));
>> +}
>> +
>> static int optee_probe(struct platform_device *pdev)
>> {
>> optee_invoke_fn *invoke_fn;
>> @@ -738,6 +757,7 @@ MODULE_DEVICE_TABLE(of, optee_dt_match);
>> static struct platform_driver optee_driver = {
>> .probe = optee_probe,
>> .remove = optee_remove,
>> + .shutdown = optee_shutdown,
>> .driver = {
>> .name = "optee",
>> .of_match_table = optee_dt_match,
>> --
>> 2.25.1
>>

2021-03-16 20:25:34

by Allen Pais

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot



>>
>> [ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
>> [ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22
>>
>> tee_shm_release() is not invoked on dma shm buffer.
>>
>> Implement .shutdown() method to handle the release of the buffers
>> correctly.
>>
>> More info:
>> https://github.com/OP-TEE/optee_os/issues/3637
>>
>> Signed-off-by: Allen Pais <[email protected]>
>> ---
>> drivers/tee/optee/core.c | 20 ++++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>
> This looks good to me. Do you have a practical way of testing this on
> QEMU for instance?
>

Jens,

I could not reproduce nor create a setup using QEMU, I could only
do it on a real h/w.

I have extensively tested the fix and I don't see any issues.

Thanks.

2021-03-18 20:55:52

by Tyler Hicks

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot

On 2021-02-25 14:36:09, Allen Pais wrote:
> From: Allen Pais <[email protected]>
>
> The following out of memory errors are seen on kexec reboot
> from the optee core.
>
> [ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
> [ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22
>
> tee_shm_release() is not invoked on dma shm buffer.
>
> Implement .shutdown() method to handle the release of the buffers
> correctly.
>
> More info:
> https://github.com/OP-TEE/optee_os/issues/3637
>
> Signed-off-by: Allen Pais <[email protected]>
> ---
> drivers/tee/optee/core.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/drivers/tee/optee/core.c b/drivers/tee/optee/core.c
> index cf4718c6d35d..80e2774b5e2a 100644
> --- a/drivers/tee/optee/core.c
> +++ b/drivers/tee/optee/core.c
> @@ -582,6 +582,13 @@ static optee_invoke_fn *get_invoke_func(struct device *dev)
> return ERR_PTR(-EINVAL);
> }
>
> +/* optee_remove - Device Removal Routine
> + * @pdev: platform device information struct
> + *
> + * optee_remove is called by platform subsystem to alter the driver
^ alert?

> + * that it should release the device
> + */
> +
> static int optee_remove(struct platform_device *pdev)
> {
> struct optee *optee = platform_get_drvdata(pdev);
> @@ -612,6 +619,18 @@ static int optee_remove(struct platform_device *pdev)
> return 0;
> }
>
> +/* optee_shutdown - Device Removal Routine
> + * @pdev: platform device information struct
> + *
> + * platform_shutdown is called by the platform subsystem to alter
^ alert

With those two changes,

Reviewed-by: Tyler Hicks <[email protected]>

Tyler

> + * the driver that a shutdown/reboot(or kexec) is happening and
> + * device must be disabled.
> + */
> +static void optee_shutdown(struct platform_device *pdev)
> +{
> + optee_disable_shm_cache(platform_get_drvdata(pdev));
> +}
> +
> static int optee_probe(struct platform_device *pdev)
> {
> optee_invoke_fn *invoke_fn;
> @@ -738,6 +757,7 @@ MODULE_DEVICE_TABLE(of, optee_dt_match);
> static struct platform_driver optee_driver = {
> .probe = optee_probe,
> .remove = optee_remove,
> + .shutdown = optee_shutdown,
> .driver = {
> .name = "optee",
> .of_match_table = optee_dt_match,
> --
> 2.25.1
>

2021-03-19 07:07:03

by Jens Wiklander

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot

On Tue, Mar 16, 2021 at 2:21 PM Allen Pais <[email protected]> wrote:
>
>
>
> >>
> >> [ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
> >> [ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22
> >>
> >> tee_shm_release() is not invoked on dma shm buffer.
> >>
> >> Implement .shutdown() method to handle the release of the buffers
> >> correctly.
> >>
> >> More info:
> >> https://github.com/OP-TEE/optee_os/issues/3637
> >>
> >> Signed-off-by: Allen Pais <[email protected]>
> >> ---
> >> drivers/tee/optee/core.c | 20 ++++++++++++++++++++
> >> 1 file changed, 20 insertions(+)
> >
> > This looks good to me. Do you have a practical way of testing this on
> > QEMU for instance?
> >
>
> Jens,
>
> I could not reproduce nor create a setup using QEMU, I could only
> do it on a real h/w.
>
> I have extensively tested the fix and I don't see any issues.

I did a few test runs too, seems OK.

Thanks,
Jens

2021-03-22 08:01:17

by Allen Pais

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot


>>>>
>>>> [ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
>>>> [ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22
>>>>
>>>> tee_shm_release() is not invoked on dma shm buffer.
>>>>
>>>> Implement .shutdown() method to handle the release of the buffers
>>>> correctly.
>>>>
>>>> More info:
>>>> https://github.com/OP-TEE/optee_os/issues/3637
>>>>
>>>> Signed-off-by: Allen Pais <[email protected]>
>>>> ---
>>>> drivers/tee/optee/core.c | 20 ++++++++++++++++++++
>>>> 1 file changed, 20 insertions(+)
>>>
>>> This looks good to me. Do you have a practical way of testing this on
>>> QEMU for instance?
>>>
>>
>> Jens,
>>
>> I could not reproduce nor create a setup using QEMU, I could only
>> do it on a real h/w.
>>
>> I have extensively tested the fix and I don't see any issues.
>
> I did a few test runs too, seems OK.

Thank you very much.

2021-05-05 13:47:25

by Allen Pais

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot

Jens,

>>>> [ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
>>>> [ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22
>>>>
>>>> tee_shm_release() is not invoked on dma shm buffer.
>>>>
>>>> Implement .shutdown() method to handle the release of the buffers
>>>> correctly.
>>>>
>>>> More info:
>>>> https://github.com/OP-TEE/optee_os/issues/3637
>>>>
>>>> Signed-off-by: Allen Pais <[email protected]>
>>>> ---
>>>> drivers/tee/optee/core.c | 20 ++++++++++++++++++++
>>>> 1 file changed, 20 insertions(+)
>>>
>>> This looks good to me. Do you have a practical way of testing this on
>>> QEMU for instance?
>>>
>>
>> Jens,
>>
>> I could not reproduce nor create a setup using QEMU, I could only
>> do it on a real h/w.
>>
>> I have extensively tested the fix and I don't see any issues.
>
> I did a few test runs too, seems OK.

I carried these changes and have not run into any issues with Kexec so far.
Last week, while trying out kdump, we ran into a crash(this is when the
Kdump kernel reboots).

$echo c > /proc/sysrq-trigger

Leads to:

[ 18.004831] Unable to handle kernel paging request at virtual address ffff0008dcef6758
[ 18.013002] Mem abort info:
[ 18.015885] ESR = 0x96000005
[ 18.019034] EC = 0x25: DABT (current EL), IL = 32 bits
[ 18.024516] SET = 0, FnV = 0
[ 18.027667] EA = 0, S1PTW = 0
[ 18.030905] Data abort info:
[ 18.033877] ISV = 0, ISS = 0x00000005
[ 18.037835] CM = 0, WnR = 0
[ 18.040896] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000970a78000
[ 18.047811] [ffff0008dcef6758] pgd=000000097fbf9003, pud=0000000000000000
[ 18.054819] Internal error: Oops: 96000005 [#1] SMP
[ 18.059850] Modules linked in: bnxt_en pcie_iproc_platform pcie_iproc diagbe(O)
[ 18.067395] CPU: 3 PID: 1 Comm: systemd-shutdow Tainted: G O 5.4.83-microsoft-standard #1
[ 18.077174] Hardware name: Overlake (DT)
[ 18.081219] pstate: 80400005 (Nzcv daif +PAN -UAO)
[ 18.086170] pc : tee_shm_free+0x18/0x48
[ 18.090126] lr : optee_disable_shm_cache+0xa4/0xf0
[ 18.095066] sp : ffff80001005bb90
[ 18.098484] x29: ffff80001005bb90 x28: ffff000037e20000
[ 18.103962] x27: 0000000000000000 x26: ffff00003ed10490
[ 18.109440] x25: ffffca760e975f90 x24: 0000000000000000
[ 18.114918] x23: ffffca760ed79808 x22: ffff00003ec66e18
[ 18.120396] x21: ffff80001005bc08 x20: 00000000b200000a
[ 18.125874] x19: ffff0008dcef6700 x18: 0000000000000010
[ 18.131352] x17: 0000000000000000 x16: 0000000000000000
[ 18.136829] x15: ffffffffffffffff x14: ffffca760ed79808
[ 18.142307] x13: ffff80009005b897 x12: ffff80001005b89f
[ 18.147786] x11: ffffca760eda4000 x10: ffff80001005b820
[ 18.153264] x9 : 00000000ffffffd0 x8 : ffffca760e59b2c0
[ 18.158742] x7 : 0000000000000000 x6 : 0000000000000000
[ 18.164220] x5 : 0000000000000000 x4 : 0000000000000000
[ 18.169698] x3 : 0000000000000000 x2 : ffff0008dcef6700
[ 18.175175] x1 : 00000000ffff0008 x0 : ffffca760e59ca04
[ 18.180654] Call trace:
[ 18.183176] tee_shm_free+0x18/0x48
[ 18.186773] optee_disable_shm_cache+0xa4/0xf0
[ 18.191356] optee_shutdown+0x20/0x30
[ 18.195135] platform_drv_shutdown+0x2c/0x38
[ 18.199538] device_shutdown+0x180/0x298
[ 18.203586] kernel_restart_prepare+0x44/0x50
[ 18.208078] kernel_restart+0x20/0x68
[ 18.211853] __do_sys_reboot+0x104/0x258
[ 18.215899] __arm64_sys_reboot+0x2c/0x38
[ 18.220035] el0_svc_handler+0x90/0x138
[ 18.223991] el0_svc+0x8/0x208
[ 18.227143] Code: f9000bf3 aa0003f3 aa1e03e0 d503201f (b9405a60)
[ 18.233435] ---[ end trace 835d756cd66aa959 ]---
[ 18.238621] Kernel panic - not syncing: Fatal exception
[ 18.244014] Kernel Offset: 0x4a75fde00000 from 0xffff800010000000
[ 18.250299] PHYS_OFFSET: 0xffff99c680000000
[ 18.254613] CPU features: 0x0002,21806008
[ 18.258747] Memory Limit: none
[ 18.262310] ---[ end Kernel panic - not syncing: Fatal exception ]—

I see that before secure world returns OPTEE_SMC_RETURN_ENOTAVAIL(which
Should disable and clear all the cache) we run into the crash trying to free shm.

Thoughts?

Thanks.

2021-05-06 07:04:06

by Jens Wiklander

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot

On Wed, May 5, 2021 at 3:45 PM Allen Pais <[email protected]> wrote:
>
> Jens,
>
> >>>> [ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
> >>>> [ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22
> >>>>
> >>>> tee_shm_release() is not invoked on dma shm buffer.
> >>>>
> >>>> Implement .shutdown() method to handle the release of the buffers
> >>>> correctly.
> >>>>
> >>>> More info:
> >>>> https://github.com/OP-TEE/optee_os/issues/3637
> >>>>
> >>>> Signed-off-by: Allen Pais <[email protected]>
> >>>> ---
> >>>> drivers/tee/optee/core.c | 20 ++++++++++++++++++++
> >>>> 1 file changed, 20 insertions(+)
> >>>
> >>> This looks good to me. Do you have a practical way of testing this on
> >>> QEMU for instance?
> >>>
> >>
> >> Jens,
> >>
> >> I could not reproduce nor create a setup using QEMU, I could only
> >> do it on a real h/w.
> >>
> >> I have extensively tested the fix and I don't see any issues.
> >
> > I did a few test runs too, seems OK.
>
> I carried these changes and have not run into any issues with Kexec so far.
> Last week, while trying out kdump, we ran into a crash(this is when the
> Kdump kernel reboots).
>
> $echo c > /proc/sysrq-trigger
>
> Leads to:
>
> [ 18.004831] Unable to handle kernel paging request at virtual address ffff0008dcef6758
> [ 18.013002] Mem abort info:
> [ 18.015885] ESR = 0x96000005
> [ 18.019034] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 18.024516] SET = 0, FnV = 0
> [ 18.027667] EA = 0, S1PTW = 0
> [ 18.030905] Data abort info:
> [ 18.033877] ISV = 0, ISS = 0x00000005
> [ 18.037835] CM = 0, WnR = 0
> [ 18.040896] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000970a78000
> [ 18.047811] [ffff0008dcef6758] pgd=000000097fbf9003, pud=0000000000000000
> [ 18.054819] Internal error: Oops: 96000005 [#1] SMP
> [ 18.059850] Modules linked in: bnxt_en pcie_iproc_platform pcie_iproc diagbe(O)
> [ 18.067395] CPU: 3 PID: 1 Comm: systemd-shutdow Tainted: G O 5.4.83-microsoft-standard #1
> [ 18.077174] Hardware name: Overlake (DT)
> [ 18.081219] pstate: 80400005 (Nzcv daif +PAN -UAO)
> [ 18.086170] pc : tee_shm_free+0x18/0x48
> [ 18.090126] lr : optee_disable_shm_cache+0xa4/0xf0
> [ 18.095066] sp : ffff80001005bb90
> [ 18.098484] x29: ffff80001005bb90 x28: ffff000037e20000
> [ 18.103962] x27: 0000000000000000 x26: ffff00003ed10490
> [ 18.109440] x25: ffffca760e975f90 x24: 0000000000000000
> [ 18.114918] x23: ffffca760ed79808 x22: ffff00003ec66e18
> [ 18.120396] x21: ffff80001005bc08 x20: 00000000b200000a
> [ 18.125874] x19: ffff0008dcef6700 x18: 0000000000000010
> [ 18.131352] x17: 0000000000000000 x16: 0000000000000000
> [ 18.136829] x15: ffffffffffffffff x14: ffffca760ed79808
> [ 18.142307] x13: ffff80009005b897 x12: ffff80001005b89f
> [ 18.147786] x11: ffffca760eda4000 x10: ffff80001005b820
> [ 18.153264] x9 : 00000000ffffffd0 x8 : ffffca760e59b2c0
> [ 18.158742] x7 : 0000000000000000 x6 : 0000000000000000
> [ 18.164220] x5 : 0000000000000000 x4 : 0000000000000000
> [ 18.169698] x3 : 0000000000000000 x2 : ffff0008dcef6700
> [ 18.175175] x1 : 00000000ffff0008 x0 : ffffca760e59ca04
> [ 18.180654] Call trace:
> [ 18.183176] tee_shm_free+0x18/0x48
> [ 18.186773] optee_disable_shm_cache+0xa4/0xf0
> [ 18.191356] optee_shutdown+0x20/0x30
> [ 18.195135] platform_drv_shutdown+0x2c/0x38
> [ 18.199538] device_shutdown+0x180/0x298
> [ 18.203586] kernel_restart_prepare+0x44/0x50
> [ 18.208078] kernel_restart+0x20/0x68
> [ 18.211853] __do_sys_reboot+0x104/0x258
> [ 18.215899] __arm64_sys_reboot+0x2c/0x38
> [ 18.220035] el0_svc_handler+0x90/0x138
> [ 18.223991] el0_svc+0x8/0x208
> [ 18.227143] Code: f9000bf3 aa0003f3 aa1e03e0 d503201f (b9405a60)
> [ 18.233435] ---[ end trace 835d756cd66aa959 ]---
> [ 18.238621] Kernel panic - not syncing: Fatal exception
> [ 18.244014] Kernel Offset: 0x4a75fde00000 from 0xffff800010000000
> [ 18.250299] PHYS_OFFSET: 0xffff99c680000000
> [ 18.254613] CPU features: 0x0002,21806008
> [ 18.258747] Memory Limit: none
> [ 18.262310] ---[ end Kernel panic - not syncing: Fatal exception ]—
>
> I see that before secure world returns OPTEE_SMC_RETURN_ENOTAVAIL(which
> Should disable and clear all the cache) we run into the crash trying to free shm.
>
> Thoughts?

It seems that the pointer is invalid, but the pointer doesn't look
like garbage. Could the kernel have unmapped the memory area covering
that address?

Cheers,
Jens

2021-05-06 07:11:35

by Allen Pais

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot


>>>>>> [ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
>>>>>> [ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22
>>>>>>
>>>>>> tee_shm_release() is not invoked on dma shm buffer.
>>>>>>
>>>>>> Implement .shutdown() method to handle the release of the buffers
>>>>>> correctly.
>>>>>>
>>>>>> More info:
>>>>>> https://github.com/OP-TEE/optee_os/issues/3637
>>>>>>
>>>>>> Signed-off-by: Allen Pais <[email protected]>
>>>>>> ---
>>>>>> drivers/tee/optee/core.c | 20 ++++++++++++++++++++
>>>>>> 1 file changed, 20 insertions(+)
>>>>>
>>>>> This looks good to me. Do you have a practical way of testing this on
>>>>> QEMU for instance?
>>>>>
>>>>
>>>> Jens,
>>>>
>>>> I could not reproduce nor create a setup using QEMU, I could only
>>>> do it on a real h/w.
>>>>
>>>> I have extensively tested the fix and I don't see any issues.
>>>
>>> I did a few test runs too, seems OK.
>>
>> I carried these changes and have not run into any issues with Kexec so far.
>> Last week, while trying out kdump, we ran into a crash(this is when the
>> Kdump kernel reboots).
>>
>> $echo c > /proc/sysrq-trigger
>>
>> Leads to:
>>
>> [ 18.004831] Unable to handle kernel paging request at virtual address ffff0008dcef6758
>> [ 18.013002] Mem abort info:
>> [ 18.015885] ESR = 0x96000005
>> [ 18.019034] EC = 0x25: DABT (current EL), IL = 32 bits
>> [ 18.024516] SET = 0, FnV = 0
>> [ 18.027667] EA = 0, S1PTW = 0
>> [ 18.030905] Data abort info:
>> [ 18.033877] ISV = 0, ISS = 0x00000005
>> [ 18.037835] CM = 0, WnR = 0
>> [ 18.040896] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000970a78000
>> [ 18.047811] [ffff0008dcef6758] pgd=000000097fbf9003, pud=0000000000000000
>> [ 18.054819] Internal error: Oops: 96000005 [#1] SMP
>> [ 18.059850] Modules linked in: bnxt_en pcie_iproc_platform pcie_iproc diagbe(O)
>> [ 18.067395] CPU: 3 PID: 1 Comm: systemd-shutdow Tainted: G O 5.4.83-microsoft-standard #1
>> [ 18.077174] Hardware name: Overlake (DT)
>> [ 18.081219] pstate: 80400005 (Nzcv daif +PAN -UAO)
>> [ 18.086170] pc : tee_shm_free+0x18/0x48
>> [ 18.090126] lr : optee_disable_shm_cache+0xa4/0xf0
>> [ 18.095066] sp : ffff80001005bb90
>> [ 18.098484] x29: ffff80001005bb90 x28: ffff000037e20000
>> [ 18.103962] x27: 0000000000000000 x26: ffff00003ed10490
>> [ 18.109440] x25: ffffca760e975f90 x24: 0000000000000000
>> [ 18.114918] x23: ffffca760ed79808 x22: ffff00003ec66e18
>> [ 18.120396] x21: ffff80001005bc08 x20: 00000000b200000a
>> [ 18.125874] x19: ffff0008dcef6700 x18: 0000000000000010
>> [ 18.131352] x17: 0000000000000000 x16: 0000000000000000
>> [ 18.136829] x15: ffffffffffffffff x14: ffffca760ed79808
>> [ 18.142307] x13: ffff80009005b897 x12: ffff80001005b89f
>> [ 18.147786] x11: ffffca760eda4000 x10: ffff80001005b820
>> [ 18.153264] x9 : 00000000ffffffd0 x8 : ffffca760e59b2c0
>> [ 18.158742] x7 : 0000000000000000 x6 : 0000000000000000
>> [ 18.164220] x5 : 0000000000000000 x4 : 0000000000000000
>> [ 18.169698] x3 : 0000000000000000 x2 : ffff0008dcef6700
>> [ 18.175175] x1 : 00000000ffff0008 x0 : ffffca760e59ca04
>> [ 18.180654] Call trace:
>> [ 18.183176] tee_shm_free+0x18/0x48
>> [ 18.186773] optee_disable_shm_cache+0xa4/0xf0
>> [ 18.191356] optee_shutdown+0x20/0x30
>> [ 18.195135] platform_drv_shutdown+0x2c/0x38
>> [ 18.199538] device_shutdown+0x180/0x298
>> [ 18.203586] kernel_restart_prepare+0x44/0x50
>> [ 18.208078] kernel_restart+0x20/0x68
>> [ 18.211853] __do_sys_reboot+0x104/0x258
>> [ 18.215899] __arm64_sys_reboot+0x2c/0x38
>> [ 18.220035] el0_svc_handler+0x90/0x138
>> [ 18.223991] el0_svc+0x8/0x208
>> [ 18.227143] Code: f9000bf3 aa0003f3 aa1e03e0 d503201f (b9405a60)
>> [ 18.233435] ---[ end trace 835d756cd66aa959 ]---
>> [ 18.238621] Kernel panic - not syncing: Fatal exception
>> [ 18.244014] Kernel Offset: 0x4a75fde00000 from 0xffff800010000000
>> [ 18.250299] PHYS_OFFSET: 0xffff99c680000000
>> [ 18.254613] CPU features: 0x0002,21806008
>> [ 18.258747] Memory Limit: none
>> [ 18.262310] ---[ end Kernel panic - not syncing: Fatal exception ]—
>>
>> I see that before secure world returns OPTEE_SMC_RETURN_ENOTAVAIL(which
>> Should disable and clear all the cache) we run into the crash trying to free shm.
>>
>> Thoughts?
>
> It seems that the pointer is invalid, but the pointer doesn't look
> like garbage. Could the kernel have unmapped the memory area covering
> that address?
>

Yes, I am not entirely sure if the kernel had the time to unmap the memory.
Right after triggering the crash the kdump kernel is booted and I see the following

[ 2.050145] optee: probing for conduit method.
[ 2.054743] optee: revision 3.6 (f84427aa)
[ 2.054821] optee: dynamic shared memory is enabled
[ 2.066186] optee: initialized driver

Could this be previous un-released maps causing corruption?

Thanks.

2021-05-06 07:31:42

by Allen Pais

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot

>>
>>>>>>>> [ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
>>>>>>>> [ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22
>>>>>>>>
>>>>>>>> tee_shm_release() is not invoked on dma shm buffer.
>>>>>>>>
>>>>>>>> Implement .shutdown() method to handle the release of the buffers
>>>>>>>> correctly.
>>>>>>>>
>>>>>>>> More info:
>>>>>>>> https://github.com/OP-TEE/optee_os/issues/3637
>>>>>>>>
>>>>>>>> Signed-off-by: Allen Pais <[email protected]>
>>>>>>>> ---
>>>>>>>> drivers/tee/optee/core.c | 20 ++++++++++++++++++++
>>>>>>>> 1 file changed, 20 insertions(+)
>>>>>>>
>>>>>>> This looks good to me. Do you have a practical way of testing this on
>>>>>>> QEMU for instance?
>>>>>>>
>>>>>>
>>>>>> Jens,
>>>>>>
>>>>>> I could not reproduce nor create a setup using QEMU, I could only
>>>>>> do it on a real h/w.
>>>>>>
>>>>>> I have extensively tested the fix and I don't see any issues.
>>>>>
>>>>> I did a few test runs too, seems OK.
>>>>
>>>> I carried these changes and have not run into any issues with Kexec so far.
>>>> Last week, while trying out kdump, we ran into a crash(this is when the
>>>> Kdump kernel reboots).
>>>>
>>>> $echo c > /proc/sysrq-trigger
>>>>
>>>> Leads to:
>>>>
>>>> [ 18.004831] Unable to handle kernel paging request at virtual address ffff0008dcef6758
>>>> [ 18.013002] Mem abort info:
>>>> [ 18.015885] ESR = 0x96000005
>>>> [ 18.019034] EC = 0x25: DABT (current EL), IL = 32 bits
>>>> [ 18.024516] SET = 0, FnV = 0
>>>> [ 18.027667] EA = 0, S1PTW = 0
>>>> [ 18.030905] Data abort info:
>>>> [ 18.033877] ISV = 0, ISS = 0x00000005
>>>> [ 18.037835] CM = 0, WnR = 0
>>>> [ 18.040896] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000970a78000
>>>> [ 18.047811] [ffff0008dcef6758] pgd=000000097fbf9003, pud=0000000000000000
>>>> [ 18.054819] Internal error: Oops: 96000005 [#1] SMP
>>>> [ 18.059850] Modules linked in: bnxt_en pcie_iproc_platform pcie_iproc diagbe(O)
>>>> [ 18.067395] CPU: 3 PID: 1 Comm: systemd-shutdow Tainted: G O 5.4.83-microsoft-standard #1
>>>> [ 18.077174] Hardware name: Overlake (DT)
>>>> [ 18.081219] pstate: 80400005 (Nzcv daif +PAN -UAO)
>>>> [ 18.086170] pc : tee_shm_free+0x18/0x48
>>>> [ 18.090126] lr : optee_disable_shm_cache+0xa4/0xf0
>>>> [ 18.095066] sp : ffff80001005bb90
>>>> [ 18.098484] x29: ffff80001005bb90 x28: ffff000037e20000
>>>> [ 18.103962] x27: 0000000000000000 x26: ffff00003ed10490
>>>> [ 18.109440] x25: ffffca760e975f90 x24: 0000000000000000
>>>> [ 18.114918] x23: ffffca760ed79808 x22: ffff00003ec66e18
>>>> [ 18.120396] x21: ffff80001005bc08 x20: 00000000b200000a
>>>> [ 18.125874] x19: ffff0008dcef6700 x18: 0000000000000010
>>>> [ 18.131352] x17: 0000000000000000 x16: 0000000000000000
>>>> [ 18.136829] x15: ffffffffffffffff x14: ffffca760ed79808
>>>> [ 18.142307] x13: ffff80009005b897 x12: ffff80001005b89f
>>>> [ 18.147786] x11: ffffca760eda4000 x10: ffff80001005b820
>>>> [ 18.153264] x9 : 00000000ffffffd0 x8 : ffffca760e59b2c0
>>>> [ 18.158742] x7 : 0000000000000000 x6 : 0000000000000000
>>>> [ 18.164220] x5 : 0000000000000000 x4 : 0000000000000000
>>>> [ 18.169698] x3 : 0000000000000000 x2 : ffff0008dcef6700
>>>> [ 18.175175] x1 : 00000000ffff0008 x0 : ffffca760e59ca04
>>>> [ 18.180654] Call trace:
>>>> [ 18.183176] tee_shm_free+0x18/0x48
>>>> [ 18.186773] optee_disable_shm_cache+0xa4/0xf0
>>>> [ 18.191356] optee_shutdown+0x20/0x30
>>>> [ 18.195135] platform_drv_shutdown+0x2c/0x38
>>>> [ 18.199538] device_shutdown+0x180/0x298
>>>> [ 18.203586] kernel_restart_prepare+0x44/0x50
>>>> [ 18.208078] kernel_restart+0x20/0x68
>>>> [ 18.211853] __do_sys_reboot+0x104/0x258
>>>> [ 18.215899] __arm64_sys_reboot+0x2c/0x38
>>>> [ 18.220035] el0_svc_handler+0x90/0x138
>>>> [ 18.223991] el0_svc+0x8/0x208
>>>> [ 18.227143] Code: f9000bf3 aa0003f3 aa1e03e0 d503201f (b9405a60)
>>>> [ 18.233435] ---[ end trace 835d756cd66aa959 ]---
>>>> [ 18.238621] Kernel panic - not syncing: Fatal exception
>>>> [ 18.244014] Kernel Offset: 0x4a75fde00000 from 0xffff800010000000
>>>> [ 18.250299] PHYS_OFFSET: 0xffff99c680000000
>>>> [ 18.254613] CPU features: 0x0002,21806008
>>>> [ 18.258747] Memory Limit: none
>>>> [ 18.262310] ---[ end Kernel panic - not syncing: Fatal exception ]—
>>>>
>>>> I see that before secure world returns OPTEE_SMC_RETURN_ENOTAVAIL(which
>>>> Should disable and clear all the cache) we run into the crash trying to free shm.
>>>>
>>>> Thoughts?
>>>
>>> It seems that the pointer is invalid, but the pointer doesn't look
>>> like garbage. Could the kernel have unmapped the memory area covering
>>> that address?
>>>
>>
>> Yes, I am not entirely sure if the kernel had the time to unmap the memory.
>> Right after triggering the crash the kdump kernel is booted and I see the following
>>
>> [ 2.050145] optee: probing for conduit method.
>> [ 2.054743] optee: revision 3.6 (f84427aa)
>> [ 2.054821] optee: dynamic shared memory is enabled
>> [ 2.066186] optee: initialized driver
>>
>> Could this be previous un-released maps causing corruption?
>
> Aha, yes, that could be it.
>

How about checking for the ptr?

diff --git a/drivers/tee/optee/call.c b/drivers/tee/optee/call.c
index aadedec3bfe7..8dc4fe9a1588 100644
--- a/drivers/tee/optee/call.c
+++ b/drivers/tee/optee/call.c
@@ -426,10 +426,12 @@ void optee_disable_shm_cache(struct optee *optee)
if (res.result.status == OPTEE_SMC_RETURN_ENOTAVAIL)
break; /* All shm's freed */
if (res.result.status == OPTEE_SMC_RETURN_OK) {
- struct tee_shm *shm;
+ struct tee_shm *shm = NULL;

shm = reg_pair_to_ptr(res.result.shm_upper32,
res.result.shm_lower32);
+ if (IS_ERR(shm))
+ return PTR_ERR(shm);
tee_shm_free(shm);

Thanks.

2021-05-06 08:28:04

by Jens Wiklander

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot

On Thu, May 6, 2021 at 9:29 AM Allen Pais <[email protected]> wrote:
>
> >>
> >>>>>>>> [ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
> >>>>>>>> [ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22
> >>>>>>>>
> >>>>>>>> tee_shm_release() is not invoked on dma shm buffer.
> >>>>>>>>
> >>>>>>>> Implement .shutdown() method to handle the release of the buffers
> >>>>>>>> correctly.
> >>>>>>>>
> >>>>>>>> More info:
> >>>>>>>> https://github.com/OP-TEE/optee_os/issues/3637
> >>>>>>>>
> >>>>>>>> Signed-off-by: Allen Pais <[email protected]>
> >>>>>>>> ---
> >>>>>>>> drivers/tee/optee/core.c | 20 ++++++++++++++++++++
> >>>>>>>> 1 file changed, 20 insertions(+)
> >>>>>>>
> >>>>>>> This looks good to me. Do you have a practical way of testing this on
> >>>>>>> QEMU for instance?
> >>>>>>>
> >>>>>>
> >>>>>> Jens,
> >>>>>>
> >>>>>> I could not reproduce nor create a setup using QEMU, I could only
> >>>>>> do it on a real h/w.
> >>>>>>
> >>>>>> I have extensively tested the fix and I don't see any issues.
> >>>>>
> >>>>> I did a few test runs too, seems OK.
> >>>>
> >>>> I carried these changes and have not run into any issues with Kexec so far.
> >>>> Last week, while trying out kdump, we ran into a crash(this is when the
> >>>> Kdump kernel reboots).
> >>>>
> >>>> $echo c > /proc/sysrq-trigger
> >>>>
> >>>> Leads to:
> >>>>
> >>>> [ 18.004831] Unable to handle kernel paging request at virtual address ffff0008dcef6758
> >>>> [ 18.013002] Mem abort info:
> >>>> [ 18.015885] ESR = 0x96000005
> >>>> [ 18.019034] EC = 0x25: DABT (current EL), IL = 32 bits
> >>>> [ 18.024516] SET = 0, FnV = 0
> >>>> [ 18.027667] EA = 0, S1PTW = 0
> >>>> [ 18.030905] Data abort info:
> >>>> [ 18.033877] ISV = 0, ISS = 0x00000005
> >>>> [ 18.037835] CM = 0, WnR = 0
> >>>> [ 18.040896] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000970a78000
> >>>> [ 18.047811] [ffff0008dcef6758] pgd=000000097fbf9003, pud=0000000000000000
> >>>> [ 18.054819] Internal error: Oops: 96000005 [#1] SMP
> >>>> [ 18.059850] Modules linked in: bnxt_en pcie_iproc_platform pcie_iproc diagbe(O)
> >>>> [ 18.067395] CPU: 3 PID: 1 Comm: systemd-shutdow Tainted: G O 5.4.83-microsoft-standard #1
> >>>> [ 18.077174] Hardware name: Overlake (DT)
> >>>> [ 18.081219] pstate: 80400005 (Nzcv daif +PAN -UAO)
> >>>> [ 18.086170] pc : tee_shm_free+0x18/0x48
> >>>> [ 18.090126] lr : optee_disable_shm_cache+0xa4/0xf0
> >>>> [ 18.095066] sp : ffff80001005bb90
> >>>> [ 18.098484] x29: ffff80001005bb90 x28: ffff000037e20000
> >>>> [ 18.103962] x27: 0000000000000000 x26: ffff00003ed10490
> >>>> [ 18.109440] x25: ffffca760e975f90 x24: 0000000000000000
> >>>> [ 18.114918] x23: ffffca760ed79808 x22: ffff00003ec66e18
> >>>> [ 18.120396] x21: ffff80001005bc08 x20: 00000000b200000a
> >>>> [ 18.125874] x19: ffff0008dcef6700 x18: 0000000000000010
> >>>> [ 18.131352] x17: 0000000000000000 x16: 0000000000000000
> >>>> [ 18.136829] x15: ffffffffffffffff x14: ffffca760ed79808
> >>>> [ 18.142307] x13: ffff80009005b897 x12: ffff80001005b89f
> >>>> [ 18.147786] x11: ffffca760eda4000 x10: ffff80001005b820
> >>>> [ 18.153264] x9 : 00000000ffffffd0 x8 : ffffca760e59b2c0
> >>>> [ 18.158742] x7 : 0000000000000000 x6 : 0000000000000000
> >>>> [ 18.164220] x5 : 0000000000000000 x4 : 0000000000000000
> >>>> [ 18.169698] x3 : 0000000000000000 x2 : ffff0008dcef6700
> >>>> [ 18.175175] x1 : 00000000ffff0008 x0 : ffffca760e59ca04
> >>>> [ 18.180654] Call trace:
> >>>> [ 18.183176] tee_shm_free+0x18/0x48
> >>>> [ 18.186773] optee_disable_shm_cache+0xa4/0xf0
> >>>> [ 18.191356] optee_shutdown+0x20/0x30
> >>>> [ 18.195135] platform_drv_shutdown+0x2c/0x38
> >>>> [ 18.199538] device_shutdown+0x180/0x298
> >>>> [ 18.203586] kernel_restart_prepare+0x44/0x50
> >>>> [ 18.208078] kernel_restart+0x20/0x68
> >>>> [ 18.211853] __do_sys_reboot+0x104/0x258
> >>>> [ 18.215899] __arm64_sys_reboot+0x2c/0x38
> >>>> [ 18.220035] el0_svc_handler+0x90/0x138
> >>>> [ 18.223991] el0_svc+0x8/0x208
> >>>> [ 18.227143] Code: f9000bf3 aa0003f3 aa1e03e0 d503201f (b9405a60)
> >>>> [ 18.233435] ---[ end trace 835d756cd66aa959 ]---
> >>>> [ 18.238621] Kernel panic - not syncing: Fatal exception
> >>>> [ 18.244014] Kernel Offset: 0x4a75fde00000 from 0xffff800010000000
> >>>> [ 18.250299] PHYS_OFFSET: 0xffff99c680000000
> >>>> [ 18.254613] CPU features: 0x0002,21806008
> >>>> [ 18.258747] Memory Limit: none
> >>>> [ 18.262310] ---[ end Kernel panic - not syncing: Fatal exception ]—
> >>>>
> >>>> I see that before secure world returns OPTEE_SMC_RETURN_ENOTAVAIL(which
> >>>> Should disable and clear all the cache) we run into the crash trying to free shm.
> >>>>
> >>>> Thoughts?
> >>>
> >>> It seems that the pointer is invalid, but the pointer doesn't look
> >>> like garbage. Could the kernel have unmapped the memory area covering
> >>> that address?
> >>>
> >>
> >> Yes, I am not entirely sure if the kernel had the time to unmap the memory.
> >> Right after triggering the crash the kdump kernel is booted and I see the following
> >>
> >> [ 2.050145] optee: probing for conduit method.
> >> [ 2.054743] optee: revision 3.6 (f84427aa)
> >> [ 2.054821] optee: dynamic shared memory is enabled
> >> [ 2.066186] optee: initialized driver
> >>
> >> Could this be previous un-released maps causing corruption?
> >
> > Aha, yes, that could be it.
> >
>
> How about checking for the ptr?
>
> diff --git a/drivers/tee/optee/call.c b/drivers/tee/optee/call.c
> index aadedec3bfe7..8dc4fe9a1588 100644
> --- a/drivers/tee/optee/call.c
> +++ b/drivers/tee/optee/call.c
> @@ -426,10 +426,12 @@ void optee_disable_shm_cache(struct optee *optee)
> if (res.result.status == OPTEE_SMC_RETURN_ENOTAVAIL)
> break; /* All shm's freed */
> if (res.result.status == OPTEE_SMC_RETURN_OK) {
> - struct tee_shm *shm;
> + struct tee_shm *shm = NULL;
>
> shm = reg_pair_to_ptr(res.result.shm_upper32,
> res.result.shm_lower32);
> + if (IS_ERR(shm))
> + return PTR_ERR(shm);
> tee_shm_free(shm);

I don't think that will help. If your theory is correct then that
pointer is from an older incarnation of the kernel. It could be worth
trying calling this function just before the call to
optee_enable_shm_cache() in optee_probe() but skipping the calls to
`tee_shm_free()` in that case. Since the kernel has restarted these
returned pointers are not valid any more and there's nothing to free,
we just need to make sure that secure world stops using those too.

Cheers,
Jens

2021-05-06 08:36:44

by Allen Pais

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot



>>
>>>>
>>>>>>>>>> [ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
>>>>>>>>>> [ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22
>>>>>>>>>>
>>>>>>>>>> tee_shm_release() is not invoked on dma shm buffer.
>>>>>>>>>>
>>>>>>>>>> Implement .shutdown() method to handle the release of the buffers
>>>>>>>>>> correctly.
>>>>>>>>>>
>>>>>>>>>> More info:
>>>>>>>>>> https://github.com/OP-TEE/optee_os/issues/3637
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Allen Pais <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>> drivers/tee/optee/core.c | 20 ++++++++++++++++++++
>>>>>>>>>> 1 file changed, 20 insertions(+)
>>>>>>>>>
>>>>>>>>> This looks good to me. Do you have a practical way of testing this on
>>>>>>>>> QEMU for instance?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Jens,
>>>>>>>>
>>>>>>>> I could not reproduce nor create a setup using QEMU, I could only
>>>>>>>> do it on a real h/w.
>>>>>>>>
>>>>>>>> I have extensively tested the fix and I don't see any issues.
>>>>>>>
>>>>>>> I did a few test runs too, seems OK.
>>>>>>
>>>>>> I carried these changes and have not run into any issues with Kexec so far.
>>>>>> Last week, while trying out kdump, we ran into a crash(this is when the
>>>>>> Kdump kernel reboots).
>>>>>>
>>>>>> $echo c > /proc/sysrq-trigger
>>>>>>
>>>>>> Leads to:
>>>>>>
>>>>>> [ 18.004831] Unable to handle kernel paging request at virtual address ffff0008dcef6758
>>>>>> [ 18.013002] Mem abort info:
>>>>>> [ 18.015885] ESR = 0x96000005
>>>>>> [ 18.019034] EC = 0x25: DABT (current EL), IL = 32 bits
>>>>>> [ 18.024516] SET = 0, FnV = 0
>>>>>> [ 18.027667] EA = 0, S1PTW = 0
>>>>>> [ 18.030905] Data abort info:
>>>>>> [ 18.033877] ISV = 0, ISS = 0x00000005
>>>>>> [ 18.037835] CM = 0, WnR = 0
>>>>>> [ 18.040896] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000970a78000
>>>>>> [ 18.047811] [ffff0008dcef6758] pgd=000000097fbf9003, pud=0000000000000000
>>>>>> [ 18.054819] Internal error: Oops: 96000005 [#1] SMP
>>>>>> [ 18.059850] Modules linked in: bnxt_en pcie_iproc_platform pcie_iproc diagbe(O)
>>>>>> [ 18.067395] CPU: 3 PID: 1 Comm: systemd-shutdow Tainted: G O 5.4.83-microsoft-standard #1
>>>>>> [ 18.077174] Hardware name: Overlake (DT)
>>>>>> [ 18.081219] pstate: 80400005 (Nzcv daif +PAN -UAO)
>>>>>> [ 18.086170] pc : tee_shm_free+0x18/0x48
>>>>>> [ 18.090126] lr : optee_disable_shm_cache+0xa4/0xf0
>>>>>> [ 18.095066] sp : ffff80001005bb90
>>>>>> [ 18.098484] x29: ffff80001005bb90 x28: ffff000037e20000
>>>>>> [ 18.103962] x27: 0000000000000000 x26: ffff00003ed10490
>>>>>> [ 18.109440] x25: ffffca760e975f90 x24: 0000000000000000
>>>>>> [ 18.114918] x23: ffffca760ed79808 x22: ffff00003ec66e18
>>>>>> [ 18.120396] x21: ffff80001005bc08 x20: 00000000b200000a
>>>>>> [ 18.125874] x19: ffff0008dcef6700 x18: 0000000000000010
>>>>>> [ 18.131352] x17: 0000000000000000 x16: 0000000000000000
>>>>>> [ 18.136829] x15: ffffffffffffffff x14: ffffca760ed79808
>>>>>> [ 18.142307] x13: ffff80009005b897 x12: ffff80001005b89f
>>>>>> [ 18.147786] x11: ffffca760eda4000 x10: ffff80001005b820
>>>>>> [ 18.153264] x9 : 00000000ffffffd0 x8 : ffffca760e59b2c0
>>>>>> [ 18.158742] x7 : 0000000000000000 x6 : 0000000000000000
>>>>>> [ 18.164220] x5 : 0000000000000000 x4 : 0000000000000000
>>>>>> [ 18.169698] x3 : 0000000000000000 x2 : ffff0008dcef6700
>>>>>> [ 18.175175] x1 : 00000000ffff0008 x0 : ffffca760e59ca04
>>>>>> [ 18.180654] Call trace:
>>>>>> [ 18.183176] tee_shm_free+0x18/0x48
>>>>>> [ 18.186773] optee_disable_shm_cache+0xa4/0xf0
>>>>>> [ 18.191356] optee_shutdown+0x20/0x30
>>>>>> [ 18.195135] platform_drv_shutdown+0x2c/0x38
>>>>>> [ 18.199538] device_shutdown+0x180/0x298
>>>>>> [ 18.203586] kernel_restart_prepare+0x44/0x50
>>>>>> [ 18.208078] kernel_restart+0x20/0x68
>>>>>> [ 18.211853] __do_sys_reboot+0x104/0x258
>>>>>> [ 18.215899] __arm64_sys_reboot+0x2c/0x38
>>>>>> [ 18.220035] el0_svc_handler+0x90/0x138
>>>>>> [ 18.223991] el0_svc+0x8/0x208
>>>>>> [ 18.227143] Code: f9000bf3 aa0003f3 aa1e03e0 d503201f (b9405a60)
>>>>>> [ 18.233435] ---[ end trace 835d756cd66aa959 ]---
>>>>>> [ 18.238621] Kernel panic - not syncing: Fatal exception
>>>>>> [ 18.244014] Kernel Offset: 0x4a75fde00000 from 0xffff800010000000
>>>>>> [ 18.250299] PHYS_OFFSET: 0xffff99c680000000
>>>>>> [ 18.254613] CPU features: 0x0002,21806008
>>>>>> [ 18.258747] Memory Limit: none
>>>>>> [ 18.262310] ---[ end Kernel panic - not syncing: Fatal exception ]—
>>>>>>
>>>>>> I see that before secure world returns OPTEE_SMC_RETURN_ENOTAVAIL(which
>>>>>> Should disable and clear all the cache) we run into the crash trying to free shm.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> It seems that the pointer is invalid, but the pointer doesn't look
>>>>> like garbage. Could the kernel have unmapped the memory area covering
>>>>> that address?
>>>>>
>>>>
>>>> Yes, I am not entirely sure if the kernel had the time to unmap the memory.
>>>> Right after triggering the crash the kdump kernel is booted and I see the following
>>>>
>>>> [ 2.050145] optee: probing for conduit method.
>>>> [ 2.054743] optee: revision 3.6 (f84427aa)
>>>> [ 2.054821] optee: dynamic shared memory is enabled
>>>> [ 2.066186] optee: initialized driver
>>>>
>>>> Could this be previous un-released maps causing corruption?
>>>
>>> Aha, yes, that could be it.
>>>
>>
>> How about checking for the ptr?
>>
>> diff --git a/drivers/tee/optee/call.c b/drivers/tee/optee/call.c
>> index aadedec3bfe7..8dc4fe9a1588 100644
>> --- a/drivers/tee/optee/call.c
>> +++ b/drivers/tee/optee/call.c
>> @@ -426,10 +426,12 @@ void optee_disable_shm_cache(struct optee *optee)
>> if (res.result.status == OPTEE_SMC_RETURN_ENOTAVAIL)
>> break; /* All shm's freed */
>> if (res.result.status == OPTEE_SMC_RETURN_OK) {
>> - struct tee_shm *shm;
>> + struct tee_shm *shm = NULL;
>>
>> shm = reg_pair_to_ptr(res.result.shm_upper32,
>> res.result.shm_lower32);
>> + if (IS_ERR(shm))
>> + return PTR_ERR(shm);
>> tee_shm_free(shm);
>
> I don't think that will help. If your theory is correct then that
> pointer is from an older incarnation of the kernel. It could be worth
> trying calling this function just before the call to
> optee_enable_shm_cache() in optee_probe() but skipping the calls to
> `tee_shm_free()` in that case. Since the kernel has restarted these
> returned pointers are not valid any more and there's nothing to free,
> we just need to make sure that secure world stops using those too.
>

I thought about it too, but was not very sure.

Calling optee_disable_shm_cache() before the enable call to ensure
That we have dropped all references to the secure world and looking
To start of fresh. Lemme try that out.

Thanks.

2021-05-06 09:19:51

by Jens Wiklander

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot

On Thu, May 6, 2021 at 9:10 AM Allen Pais <[email protected]> wrote:
>
>
> >>>>>> [ 0.368428] tee_bnxt_fw optee-clnt0: tee_shm_alloc failed
> >>>>>> [ 0.368461] tee_bnxt_fw: probe of optee-clnt0 failed with error -22
> >>>>>>
> >>>>>> tee_shm_release() is not invoked on dma shm buffer.
> >>>>>>
> >>>>>> Implement .shutdown() method to handle the release of the buffers
> >>>>>> correctly.
> >>>>>>
> >>>>>> More info:
> >>>>>> https://github.com/OP-TEE/optee_os/issues/3637
> >>>>>>
> >>>>>> Signed-off-by: Allen Pais <[email protected]>
> >>>>>> ---
> >>>>>> drivers/tee/optee/core.c | 20 ++++++++++++++++++++
> >>>>>> 1 file changed, 20 insertions(+)
> >>>>>
> >>>>> This looks good to me. Do you have a practical way of testing this on
> >>>>> QEMU for instance?
> >>>>>
> >>>>
> >>>> Jens,
> >>>>
> >>>> I could not reproduce nor create a setup using QEMU, I could only
> >>>> do it on a real h/w.
> >>>>
> >>>> I have extensively tested the fix and I don't see any issues.
> >>>
> >>> I did a few test runs too, seems OK.
> >>
> >> I carried these changes and have not run into any issues with Kexec so far.
> >> Last week, while trying out kdump, we ran into a crash(this is when the
> >> Kdump kernel reboots).
> >>
> >> $echo c > /proc/sysrq-trigger
> >>
> >> Leads to:
> >>
> >> [ 18.004831] Unable to handle kernel paging request at virtual address ffff0008dcef6758
> >> [ 18.013002] Mem abort info:
> >> [ 18.015885] ESR = 0x96000005
> >> [ 18.019034] EC = 0x25: DABT (current EL), IL = 32 bits
> >> [ 18.024516] SET = 0, FnV = 0
> >> [ 18.027667] EA = 0, S1PTW = 0
> >> [ 18.030905] Data abort info:
> >> [ 18.033877] ISV = 0, ISS = 0x00000005
> >> [ 18.037835] CM = 0, WnR = 0
> >> [ 18.040896] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000970a78000
> >> [ 18.047811] [ffff0008dcef6758] pgd=000000097fbf9003, pud=0000000000000000
> >> [ 18.054819] Internal error: Oops: 96000005 [#1] SMP
> >> [ 18.059850] Modules linked in: bnxt_en pcie_iproc_platform pcie_iproc diagbe(O)
> >> [ 18.067395] CPU: 3 PID: 1 Comm: systemd-shutdow Tainted: G O 5.4.83-microsoft-standard #1
> >> [ 18.077174] Hardware name: Overlake (DT)
> >> [ 18.081219] pstate: 80400005 (Nzcv daif +PAN -UAO)
> >> [ 18.086170] pc : tee_shm_free+0x18/0x48
> >> [ 18.090126] lr : optee_disable_shm_cache+0xa4/0xf0
> >> [ 18.095066] sp : ffff80001005bb90
> >> [ 18.098484] x29: ffff80001005bb90 x28: ffff000037e20000
> >> [ 18.103962] x27: 0000000000000000 x26: ffff00003ed10490
> >> [ 18.109440] x25: ffffca760e975f90 x24: 0000000000000000
> >> [ 18.114918] x23: ffffca760ed79808 x22: ffff00003ec66e18
> >> [ 18.120396] x21: ffff80001005bc08 x20: 00000000b200000a
> >> [ 18.125874] x19: ffff0008dcef6700 x18: 0000000000000010
> >> [ 18.131352] x17: 0000000000000000 x16: 0000000000000000
> >> [ 18.136829] x15: ffffffffffffffff x14: ffffca760ed79808
> >> [ 18.142307] x13: ffff80009005b897 x12: ffff80001005b89f
> >> [ 18.147786] x11: ffffca760eda4000 x10: ffff80001005b820
> >> [ 18.153264] x9 : 00000000ffffffd0 x8 : ffffca760e59b2c0
> >> [ 18.158742] x7 : 0000000000000000 x6 : 0000000000000000
> >> [ 18.164220] x5 : 0000000000000000 x4 : 0000000000000000
> >> [ 18.169698] x3 : 0000000000000000 x2 : ffff0008dcef6700
> >> [ 18.175175] x1 : 00000000ffff0008 x0 : ffffca760e59ca04
> >> [ 18.180654] Call trace:
> >> [ 18.183176] tee_shm_free+0x18/0x48
> >> [ 18.186773] optee_disable_shm_cache+0xa4/0xf0
> >> [ 18.191356] optee_shutdown+0x20/0x30
> >> [ 18.195135] platform_drv_shutdown+0x2c/0x38
> >> [ 18.199538] device_shutdown+0x180/0x298
> >> [ 18.203586] kernel_restart_prepare+0x44/0x50
> >> [ 18.208078] kernel_restart+0x20/0x68
> >> [ 18.211853] __do_sys_reboot+0x104/0x258
> >> [ 18.215899] __arm64_sys_reboot+0x2c/0x38
> >> [ 18.220035] el0_svc_handler+0x90/0x138
> >> [ 18.223991] el0_svc+0x8/0x208
> >> [ 18.227143] Code: f9000bf3 aa0003f3 aa1e03e0 d503201f (b9405a60)
> >> [ 18.233435] ---[ end trace 835d756cd66aa959 ]---
> >> [ 18.238621] Kernel panic - not syncing: Fatal exception
> >> [ 18.244014] Kernel Offset: 0x4a75fde00000 from 0xffff800010000000
> >> [ 18.250299] PHYS_OFFSET: 0xffff99c680000000
> >> [ 18.254613] CPU features: 0x0002,21806008
> >> [ 18.258747] Memory Limit: none
> >> [ 18.262310] ---[ end Kernel panic - not syncing: Fatal exception ]—
> >>
> >> I see that before secure world returns OPTEE_SMC_RETURN_ENOTAVAIL(which
> >> Should disable and clear all the cache) we run into the crash trying to free shm.
> >>
> >> Thoughts?
> >
> > It seems that the pointer is invalid, but the pointer doesn't look
> > like garbage. Could the kernel have unmapped the memory area covering
> > that address?
> >
>
> Yes, I am not entirely sure if the kernel had the time to unmap the memory.
> Right after triggering the crash the kdump kernel is booted and I see the following
>
> [ 2.050145] optee: probing for conduit method.
> [ 2.054743] optee: revision 3.6 (f84427aa)
> [ 2.054821] optee: dynamic shared memory is enabled
> [ 2.066186] optee: initialized driver
>
> Could this be previous un-released maps causing corruption?

Aha, yes, that could be it.

Cheers,
Jens

2021-05-07 07:54:54

by Allen Pais

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] optee: fix tee out of memory failure seen during kexec reboot

>>>>>>>>
>>>>>>>> I could not reproduce nor create a setup using QEMU, I could only
>>>>>>>> do it on a real h/w.
>>>>>>>>
>>>>>>>> I have extensively tested the fix and I don't see any issues.
>>>>>>>
>>>>>>> I did a few test runs too, seems OK.
>>>>>>
>>>>>> I carried these changes and have not run into any issues with Kexec so far.
>>>>>> Last week, while trying out kdump, we ran into a crash(this is when the
>>>>>> Kdump kernel reboots).
>>>>>>
>>>>>> $echo c > /proc/sysrq-trigger
>>>>>>
>>>>>> Leads to:
>>>>>>
>>>>>> [ 18.004831] Unable to handle kernel paging request at virtual address ffff0008dcef6758
>>>>>> [ 18.013002] Mem abort info:
>>>>>> [ 18.015885] ESR = 0x96000005
>>>>>> [ 18.019034] EC = 0x25: DABT (current EL), IL = 32 bits
>>>>>> [ 18.024516] SET = 0, FnV = 0
>>>>>> [ 18.027667] EA = 0, S1PTW = 0
>>>>>> [ 18.030905] Data abort info:
>>>>>> [ 18.033877] ISV = 0, ISS = 0x00000005
>>>>>> [ 18.037835] CM = 0, WnR = 0
>>>>>> [ 18.040896] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000970a78000
>>>>>> [ 18.047811] [ffff0008dcef6758] pgd=000000097fbf9003, pud=0000000000000000
>>>>>> [ 18.054819] Internal error: Oops: 96000005 [#1] SMP
>>>>>> [ 18.059850] Modules linked in: bnxt_en pcie_iproc_platform pcie_iproc diagbe(O)
>>>>>> [ 18.067395] CPU: 3 PID: 1 Comm: systemd-shutdow Tainted: G O 5.4.83-microsoft-standard #1
>>>>>> [ 18.077174] Hardware name: Overlake (DT)
>>>>>> [ 18.081219] pstate: 80400005 (Nzcv daif +PAN -UAO)
>>>>>> [ 18.086170] pc : tee_shm_free+0x18/0x48
>>>>>> [ 18.090126] lr : optee_disable_shm_cache+0xa4/0xf0
>>>>>> [ 18.095066] sp : ffff80001005bb90
>>>>>> [ 18.098484] x29: ffff80001005bb90 x28: ffff000037e20000
>>>>>> [ 18.103962] x27: 0000000000000000 x26: ffff00003ed10490
>>>>>> [ 18.109440] x25: ffffca760e975f90 x24: 0000000000000000
>>>>>> [ 18.114918] x23: ffffca760ed79808 x22: ffff00003ec66e18
>>>>>> [ 18.120396] x21: ffff80001005bc08 x20: 00000000b200000a
>>>>>> [ 18.125874] x19: ffff0008dcef6700 x18: 0000000000000010
>>>>>> [ 18.131352] x17: 0000000000000000 x16: 0000000000000000
>>>>>> [ 18.136829] x15: ffffffffffffffff x14: ffffca760ed79808
>>>>>> [ 18.142307] x13: ffff80009005b897 x12: ffff80001005b89f
>>>>>> [ 18.147786] x11: ffffca760eda4000 x10: ffff80001005b820
>>>>>> [ 18.153264] x9 : 00000000ffffffd0 x8 : ffffca760e59b2c0
>>>>>> [ 18.158742] x7 : 0000000000000000 x6 : 0000000000000000
>>>>>> [ 18.164220] x5 : 0000000000000000 x4 : 0000000000000000
>>>>>> [ 18.169698] x3 : 0000000000000000 x2 : ffff0008dcef6700
>>>>>> [ 18.175175] x1 : 00000000ffff0008 x0 : ffffca760e59ca04
>>>>>> [ 18.180654] Call trace:
>>>>>> [ 18.183176] tee_shm_free+0x18/0x48
>>>>>> [ 18.186773] optee_disable_shm_cache+0xa4/0xf0
>>>>>> [ 18.191356] optee_shutdown+0x20/0x30
>>>>>> [ 18.195135] platform_drv_shutdown+0x2c/0x38
>>>>>> [ 18.199538] device_shutdown+0x180/0x298
>>>>>> [ 18.203586] kernel_restart_prepare+0x44/0x50
>>>>>> [ 18.208078] kernel_restart+0x20/0x68
>>>>>> [ 18.211853] __do_sys_reboot+0x104/0x258
>>>>>> [ 18.215899] __arm64_sys_reboot+0x2c/0x38
>>>>>> [ 18.220035] el0_svc_handler+0x90/0x138
>>>>>> [ 18.223991] el0_svc+0x8/0x208
>>>>>> [ 18.227143] Code: f9000bf3 aa0003f3 aa1e03e0 d503201f (b9405a60)
>>>>>> [ 18.233435] ---[ end trace 835d756cd66aa959 ]---
>>>>>> [ 18.238621] Kernel panic - not syncing: Fatal exception
>>>>>> [ 18.244014] Kernel Offset: 0x4a75fde00000 from 0xffff800010000000
>>>>>> [ 18.250299] PHYS_OFFSET: 0xffff99c680000000
>>>>>> [ 18.254613] CPU features: 0x0002,21806008
>>>>>> [ 18.258747] Memory Limit: none
>>>>>> [ 18.262310] ---[ end Kernel panic - not syncing: Fatal exception ]—
>>>>>>
>>>>>> I see that before secure world returns OPTEE_SMC_RETURN_ENOTAVAIL(which
>>>>>> Should disable and clear all the cache) we run into the crash trying to free shm.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> It seems that the pointer is invalid, but the pointer doesn't look
>>>>> like garbage. Could the kernel have unmapped the memory area covering
>>>>> that address?
>>>>>
>>>>
>>>> Yes, I am not entirely sure if the kernel had the time to unmap the memory.
>>>> Right after triggering the crash the kdump kernel is booted and I see the following
>>>>
>>>> [ 2.050145] optee: probing for conduit method.
>>>> [ 2.054743] optee: revision 3.6 (f84427aa)
>>>> [ 2.054821] optee: dynamic shared memory is enabled
>>>> [ 2.066186] optee: initialized driver
>>>>
>>>> Could this be previous un-released maps causing corruption?
>>>
>>> Aha, yes, that could be it.
>>>
>>
>> How about checking for the ptr?
>>
>> diff --git a/drivers/tee/optee/call.c b/drivers/tee/optee/call.c
>> index aadedec3bfe7..8dc4fe9a1588 100644
>> --- a/drivers/tee/optee/call.c
>> +++ b/drivers/tee/optee/call.c
>> @@ -426,10 +426,12 @@ void optee_disable_shm_cache(struct optee *optee)
>> if (res.result.status == OPTEE_SMC_RETURN_ENOTAVAIL)
>> break; /* All shm's freed */
>> if (res.result.status == OPTEE_SMC_RETURN_OK) {
>> - struct tee_shm *shm;
>> + struct tee_shm *shm = NULL;
>>
>> shm = reg_pair_to_ptr(res.result.shm_upper32,
>> res.result.shm_lower32);
>> + if (IS_ERR(shm))
>> + return PTR_ERR(shm);
>> tee_shm_free(shm);
>
> I don't think that will help. If your theory is correct then that
> pointer is from an older incarnation of the kernel. It could be worth
> trying calling this function just before the call to
> optee_enable_shm_cache() in optee_probe() but skipping the calls to
> `tee_shm_free()` in that case. Since the kernel has restarted these
> returned pointers are not valid any more and there's nothing to free,
> we just need to make sure that secure world stops using those too.
>

Jens,

I suppose you saw the email from @Tyler, we have it fixed but ran
Into many arm-smmu 64000000.mmu: xxx logs being printed out
And system becoming unstable and stops responding.

Am debugging this further, any input would be really helpful.

Thanks.