2022-05-04 17:35:03

by Nicolas Frattaroli

[permalink] [raw]
Subject: [BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

Good day,

a user on the #linux-rockchip channel on the Libera.chat IRC network
reported that their RK3568 was no longer getting a CPU and GPU clock
from scmi and consequently not booting when using linux-next. This
was bisected down to the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/firmware/arm_scmi/base.c?h=next-20220503&id=3b0041f6e10e5bdbb646d98172be43e88734ed62

The error message in the log is as follows:

arm-scmi firmware:scmi: Malformed reply - real_sz:8 calc_sz:4, t->rx.len is 12, sizeof(u32) is 4, loop_num_ret is 3

The rockchip firmware (bl31) being used was v1.32, from here:

https://github.com/JeffyCN/rockchip_mirrors/blob/rkbin/bin/rk35/rk3568_bl31_v1.32.elf

This seems like a non-fatal firmware bug, for which a kernel workaround is
certainly possible, but it would be good if rockchip could fix this in their
firmware.

The user going by "amazingfate" reported that commenting out the
ret = -EPROTO; break;
fixes the issue for them.

I'm writing here to get the discussion started on how we can resolve this
before the Linux 5.19 release.

Sudeep Holla has already told me they'll gladly add a workaround before
the 5.19 release, but would rather see this fixed in the vendor firmware
first. Would rockchip be able and willing to fix it and publish a new
bl31 for rk3568?

Regards,
Nicolas Frattaroli




2022-05-04 18:06:21

by Sudeep Holla

[permalink] [raw]
Subject: Re: [BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

+ Cristian

Hi Nicolas,

Thanks for the formal report.

On Wed, May 04, 2022 at 02:49:07PM +0200, Nicolas Frattaroli wrote:
> Good day,
>
> a user on the #linux-rockchip channel on the Libera.chat IRC network
> reported that their RK3568 was no longer getting a CPU and GPU clock
> from scmi and consequently not booting when using linux-next. This
> was bisected down to the following commit:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/firmware/arm_scmi/base.c?h=next-20220503&id=3b0041f6e10e5bdbb646d98172be43e88734ed62
>
> The error message in the log is as follows:
>
> arm-scmi firmware:scmi: Malformed reply - real_sz:8 calc_sz:4, t->rx.len is 12, sizeof(u32) is 4, loop_num_ret is 3
>
> The rockchip firmware (bl31) being used was v1.32, from here:
>
> https://github.com/JeffyCN/rockchip_mirrors/blob/rkbin/bin/rk35/rk3568_bl31_v1.32.elf
>

So this platform is not supported in upstream TF-A like its predecessors ?

> This seems like a non-fatal firmware bug, for which a kernel workaround is
> certainly possible, but it would be good if rockchip could fix this in their
> firmware.
>

Indeed, we added this check finding issue in one of our tests. Luckily
it helped to unearth the same issue on this platform, but due to the
nature of its f/w release, it is bit unfortunate that it can't be fixed
easily and quickly. But I really wish this gets fixed in the firmware.
Are there any other f/w bugs reported so far ? If so how are they fixed
as I don't expect all such bugs can be worked around in the kernel though
this might be. I would like to hear details there if possible.

> The user going by "amazingfate" reported that commenting out the
> ret = -EPROTO; break;
> fixes the issue for them.
>

Sure, or we could relax the check as calc_sz <= real_sz or something so
that the reverse is still caught and handled as OS might read junk data in
the later case.

> I'm writing here to get the discussion started on how we can resolve this
> before the Linux 5.19 release.
>

Agreed, I just sent by pull request for this literally few hours ago.

> Sudeep Holla has already told me they'll gladly add a workaround before
> the 5.19 release, but would rather see this fixed in the vendor firmware
> first. Would rockchip be able and willing to fix it and publish a new
> bl31 for rk3568?
>

Indeed and as mentioned above details on how other such f/w bugs are dealt
in general esp that the firmware is blob release and one can't fix it easily.
Do we have a bugzilla kind of setup to report and get the bugs fixed ?

--
Regards,
Sudeep

2022-05-06 04:23:46

by Cristian Marussi

[permalink] [raw]
Subject: Re: [BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

On Wed, May 04, 2022 at 07:51:45PM +0200, Nicolas Frattaroli wrote:
> On Mittwoch, 4. Mai 2022 15:21:30 CEST Sudeep Holla wrote:
> > + Cristian

+Etieenne

Hi Nicolas,

> >
> > Hi Nicolas,
> >
> > Thanks for the formal report.
> >
> > On Wed, May 04, 2022 at 02:49:07PM +0200, Nicolas Frattaroli wrote:
> > > Good day,
> > >
> > > a user on the #linux-rockchip channel on the Libera.chat IRC network
> > > reported that their RK3568 was no longer getting a CPU and GPU clock
> > > from scmi and consequently not booting when using linux-next. This
> > > was bisected down to the following commit:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/firmware/arm_scmi/base.c?h=next-20220503&id=3b0041f6e10e5bdbb646d98172be43e88734ed62
> > >
> > > The error message in the log is as follows:
> > >
> > > arm-scmi firmware:scmi: Malformed reply - real_sz:8 calc_sz:4, t->rx.len is 12, sizeof(u32) is 4, loop_num_ret is 3
> > >
> > > The rockchip firmware (bl31) being used was v1.32, from here:
> > >
> > > https://github.com/JeffyCN/rockchip_mirrors/blob/rkbin/bin/rk35/rk3568_bl31_v1.32.elf
> > >
> >
> > So this platform is not supported in upstream TF-A like its predecessors ?
>
> Hello,
>
> it is not yet supported by upstream. Rockchip plans to release the sources
> for it at some point if I recall correctly, but I believe their software
> team has been very busy due to new hardware releases, so it hasn't happened
> yet. I hope we'll see an open source release of the TF-A sources eventually,
> so that for bugs like this we can always fix them without the vendor needing
> to do it for us.
>
> >
> > > This seems like a non-fatal firmware bug, for which a kernel workaround is
> > > certainly possible, but it would be good if rockchip could fix this in their
> > > firmware.
> > >
> >
> > Indeed, we added this check finding issue in one of our tests. Luckily
> > it helped to unearth the same issue on this platform, but due to the
> > nature of its f/w release, it is bit unfortunate that it can't be fixed
> > easily and quickly. But I really wish this gets fixed in the firmware.
> > Are there any other f/w bugs reported so far ? If so how are they fixed
> > as I don't expect all such bugs can be worked around in the kernel though
> > this might be. I would like to hear details there if possible.
>
> I'm not aware of how the rockchip bug report workflow works. They seemingly
> did update the firmware multiple times, last in October of 2021.
>
> The official rockchip repository at [1] hasn't been kept as up to date as
> the mirror by a rockchip employee at [2], so most people seem to have been
> using the latter. Speaking of which, I'll add the owner of that repo to
> the CC of this thread to make sure this doesn't get lost.
>
> Rockchip lists an e-mail at [3] for reporting issues at, but this seems to
> relate to their open-source documentation. The official github repository
> of "rkbin" on the "rockchip-linux" organisation does not have issues
> enabled, so submitting a bug report through that is unfortunately not
> possible.

Having a quick look at TF-A SCMI code in charge of this message (at least in
the upstream):

https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/drivers/scmi-msg/base.c#n136

it seems to me that the bug lies in the fact that the BASE_DISCOVER_PROTOCOLS
response message built by the function above is not properly sized: the trailing
message payload carrying the list of protocols (after returned_protos field) returns
always a fixed arbitrarily sized payload, possibly zeroed if fewer protocols are
carried.

IOW, even though the answer in this case carries 3 items/protocols, the payload
is anyway 8 bytes (with the last 5 bytes zeroed), while by the spec it should have
been just 4 bytes.

(in fact testing the kernel fix on a JUNO with last SCP fw release did NOT expose
any issue...)

I think a fix FW side could be something along these lines (UNTESTED NOR
BUILT ! ... I Cc'ed Etienne that seems the author of this bit)

This basically mirrors the same checks introduced in kernel...if someone
is fancy/able to test it....

---

diff --git a/drivers/scmi-msg/base.c b/drivers/scmi-msg/base.c
index 2d7203451..35c99e308 100644
--- a/drivers/scmi-msg/base.c
+++ b/drivers/scmi-msg/base.c
@@ -142,6 +142,7 @@ static void discover_list_protocols(struct scmi_msg *msg)
uint8_t outargs[sizeof(p2a) + MAX_PROTOCOL_IN_LIST] = { 0U };
const uint8_t *list = NULL;
unsigned int count = 0U;
+ size_t list_sz = 0U;

if (msg->in_size != sizeof(*a2p)) {
scmi_status_response(msg, SCMI_PROTOCOL_ERROR);
@@ -163,9 +164,12 @@ static void discover_list_protocols(struct scmi_msg *msg)
p2a.num_protocols = count;

memcpy(outargs, &p2a, sizeof(p2a));
- memcpy(outargs + sizeof(p2a), list + a2p->skip, count);
-
- scmi_write_response(msg, outargs, sizeof(outargs));
+ if (count) {
+ memcpy(outargs + sizeof(p2a), list + a2p->skip, count);
+ list_sz = (1 + (count - 1) / sizeof(uint32_t)) *
+ sizeof(uint32_t);
+ }
+ scmi_write_response(msg, outargs, sizeof(p2a) + list_sz);
}


>
> >
> > > The user going by "amazingfate" reported that commenting out the
> > > ret = -EPROTO; break;
> > > fixes the issue for them.
> > >
> >
> > Sure, or we could relax the check as calc_sz <= real_sz or something so
> > that the reverse is still caught and handled as OS might read junk data in
> > the later case.
>
> This seems like a good solution, that way we're unlikely to ever run into a
> situation where the kernel does the wrong thing here even if we're less
> strict about the check. In either case, it should print a dev_err though,
> it's still an error even if we can tolerate it in some cases.
>
> >

Beside fixing the FW, adding a workaround like the one Sudeep mentioned to
avoid killing old fw plaform seems reasonable: we can just _err() as long as
Kernel is not put into peril (i.e. calc_sz <= real_sz ) without
necessarily bail out if an out of spec, but harmless, message is
spotted.

> > > I'm writing here to get the discussion started on how we can resolve this
> > > before the Linux 5.19 release.
> > >
> >
> > Agreed, I just sent by pull request for this literally few hours ago.
> >
> > > Sudeep Holla has already told me they'll gladly add a workaround before
> > > the 5.19 release, but would rather see this fixed in the vendor firmware
> > > first. Would rockchip be able and willing to fix it and publish a new
> > > bl31 for rk3568?
> > >
> >
> > Indeed and as mentioned above details on how other such f/w bugs are dealt
> > in general esp that the firmware is blob release and one can't fix it easily.
> > Do we have a bugzilla kind of setup to report and get the bugs fixed ?
>
> It's worth mentioning that I think even if we get Rockchip to fix the bug in
> the firmware, I believe Linux should still add a workaround, as otherwise
> people running older firmware who are upgrading their kernels could suddenly
> have unbootable systems and don't know why that happened.
>

Agree.

Thanks,
Cristian

2022-05-06 10:23:57

by Etienne Carriere

[permalink] [raw]
Subject: Re: [BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

Hello Nicolas, Cristian,

On Thu, 5 May 2022 at 10:03, Cristian Marussi <[email protected]> wrote:
>
> On Wed, May 04, 2022 at 07:51:45PM +0200, Nicolas Frattaroli wrote:
> > On Mittwoch, 4. Mai 2022 15:21:30 CEST Sudeep Holla wrote:
> > > + Cristian
>
> +Etieenne
>
> Hi Nicolas,
>
> > >
> > > Hi Nicolas,
> > >
> > > Thanks for the formal report.
> > >
> > > On Wed, May 04, 2022 at 02:49:07PM +0200, Nicolas Frattaroli wrote:
> > > > Good day,
> > > >
> > > > a user on the #linux-rockchip channel on the Libera.chat IRC network
> > > > reported that their RK3568 was no longer getting a CPU and GPU clock
> > > > from scmi and consequently not booting when using linux-next. This
> > > > was bisected down to the following commit:
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/firmware/arm_scmi/base.c?h=next-20220503&id=3b0041f6e10e5bdbb646d98172be43e88734ed62
> > > >
> > > > The error message in the log is as follows:
> > > >
> > > > arm-scmi firmware:scmi: Malformed reply - real_sz:8 calc_sz:4, t->rx.len is 12, sizeof(u32) is 4, loop_num_ret is 3
> > > >
> > > > The rockchip firmware (bl31) being used was v1.32, from here:
> > > >
> > > > https://github.com/JeffyCN/rockchip_mirrors/blob/rkbin/bin/rk35/rk3568_bl31_v1.32.elf
> > > >
> > >
> > > So this platform is not supported in upstream TF-A like its predecessors ?
> >
> > Hello,
> >
> > it is not yet supported by upstream. Rockchip plans to release the sources
> > for it at some point if I recall correctly, but I believe their software
> > team has been very busy due to new hardware releases, so it hasn't happened
> > yet. I hope we'll see an open source release of the TF-A sources eventually,
> > so that for bugs like this we can always fix them without the vendor needing
> > to do it for us.
> >
> > >
> > > > This seems like a non-fatal firmware bug, for which a kernel workaround is
> > > > certainly possible, but it would be good if rockchip could fix this in their
> > > > firmware.
> > > >
> > >
> > > Indeed, we added this check finding issue in one of our tests. Luckily
> > > it helped to unearth the same issue on this platform, but due to the
> > > nature of its f/w release, it is bit unfortunate that it can't be fixed
> > > easily and quickly. But I really wish this gets fixed in the firmware.
> > > Are there any other f/w bugs reported so far ? If so how are they fixed
> > > as I don't expect all such bugs can be worked around in the kernel though
> > > this might be. I would like to hear details there if possible.
> >
> > I'm not aware of how the rockchip bug report workflow works. They seemingly
> > did update the firmware multiple times, last in October of 2021.
> >
> > The official rockchip repository at [1] hasn't been kept as up to date as
> > the mirror by a rockchip employee at [2], so most people seem to have been
> > using the latter. Speaking of which, I'll add the owner of that repo to
> > the CC of this thread to make sure this doesn't get lost.
> >
> > Rockchip lists an e-mail at [3] for reporting issues at, but this seems to
> > relate to their open-source documentation. The official github repository
> > of "rkbin" on the "rockchip-linux" organisation does not have issues
> > enabled, so submitting a bug report through that is unfortunately not
> > possible.
>
> Having a quick look at TF-A SCMI code in charge of this message (at least in
> the upstream):
>
> https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/drivers/scmi-msg/base.c#n136
>
> it seems to me that the bug lies in the fact that the BASE_DISCOVER_PROTOCOLS
> response message built by the function above is not properly sized: the trailing
> message payload carrying the list of protocols (after returned_protos field) returns
> always a fixed arbitrarily sized payload, possibly zeroed if fewer protocols are
> carried.
>
> IOW, even though the answer in this case carries 3 items/protocols, the payload
> is anyway 8 bytes (with the last 5 bytes zeroed), while by the spec it should have
> been just 4 bytes.
>
> (in fact testing the kernel fix on a JUNO with last SCP fw release did NOT expose
> any issue...)
>
> I think a fix FW side could be something along these lines (UNTESTED NOR
> BUILT ! ... I Cc'ed Etienne that seems the author of this bit)
>
> This basically mirrors the same checks introduced in kernel...if someone
> is fancy/able to test it....

Indeed the firmware implementation is wrong in TF-A.
And also in OP-TEE by the way:
https://github.com/OP-TEE/optee_os/blob/3.17.0/core/drivers/scmi-msg/base.c#L163-L166

@Nicoals, do you want to send a patch to TF-A, or do you want me to do it?

I can fix the optee_os implementation. I'll tell you when I'll have
created a P-R.
The fix is the same for TF-A and OP-TEE.
Proposal from Cristian looks good to me, maybe simplified:

```patch
memcpy(outargs, &p2a, sizeof(p2a));
memcpy(outargs + sizeof(p2a), list + a2p->skip, count);

- scmi_write_response(msg, outargs, sizeof(outargs));
+ list_sz = (1 + (count - 1) / sizeof(uint32_t)) * sizeof(uint32_t);
+ scmi_write_response(msg, outargs, sizeof(p2a) + list_sz);
```


>
> diff --git a/drivers/scmi-msg/base.c b/drivers/scmi-msg/base.c
> index 2d7203451..35c99e308 100644
> --- a/drivers/scmi-msg/base.c
> +++ b/drivers/scmi-msg/base.c
> @@ -142,6 +142,7 @@ static void discover_list_protocols(struct scmi_msg *msg)
> uint8_t outargs[sizeof(p2a) + MAX_PROTOCOL_IN_LIST] = { 0U };
> const uint8_t *list = NULL;
> unsigned int count = 0U;
> + size_t list_sz = 0U;
>
> if (msg->in_size != sizeof(*a2p)) {
> scmi_status_response(msg, SCMI_PROTOCOL_ERROR);
> @@ -163,9 +164,12 @@ static void discover_list_protocols(struct scmi_msg *msg)
> p2a.num_protocols = count;
>
> memcpy(outargs, &p2a, sizeof(p2a));
> - memcpy(outargs + sizeof(p2a), list + a2p->skip, count);
> -
> - scmi_write_response(msg, outargs, sizeof(outargs));
> + if (count) {
> + memcpy(outargs + sizeof(p2a), list + a2p->skip, count);
> + list_sz = (1 + (count - 1) / sizeof(uint32_t)) *
> + sizeof(uint32_t);
> + }
> + scmi_write_response(msg, outargs, sizeof(p2a) + list_sz);
> }
>
>
> >
> > >
> > > > The user going by "amazingfate" reported that commenting out the
> > > > ret = -EPROTO; break;
> > > > fixes the issue for them.
> > > >
> > >
> > > Sure, or we could relax the check as calc_sz <= real_sz or something so
> > > that the reverse is still caught and handled as OS might read junk data in
> > > the later case.
> >
> > This seems like a good solution, that way we're unlikely to ever run into a
> > situation where the kernel does the wrong thing here even if we're less
> > strict about the check. In either case, it should print a dev_err though,
> > it's still an error even if we can tolerate it in some cases.
> >
> > >
>
> Beside fixing the FW, adding a workaround like the one Sudeep mentioned to
> avoid killing old fw plaform seems reasonable: we can just _err() as long as
> Kernel is not put into peril (i.e. calc_sz <= real_sz ) without
> necessarily bail out if an out of spec, but harmless, message is
> spotted.
>
> > > > I'm writing here to get the discussion started on how we can resolve this
> > > > before the Linux 5.19 release.
> > > >
> > >
> > > Agreed, I just sent by pull request for this literally few hours ago.
> > >
> > > > Sudeep Holla has already told me they'll gladly add a workaround before
> > > > the 5.19 release, but would rather see this fixed in the vendor firmware
> > > > first. Would rockchip be able and willing to fix it and publish a new
> > > > bl31 for rk3568?
> > > >
> > >
> > > Indeed and as mentioned above details on how other such f/w bugs are dealt
> > > in general esp that the firmware is blob release and one can't fix it easily.
> > > Do we have a bugzilla kind of setup to report and get the bugs fixed ?
> >
> > It's worth mentioning that I think even if we get Rockchip to fix the bug in
> > the firmware, I believe Linux should still add a workaround, as otherwise
> > people running older firmware who are upgrading their kernels could suddenly
> > have unbootable systems and don't know why that happened.
> >
>
> Agree.

Agree also. Sorry for that bug in TF-A/OP-TEE.

BR,
Etienne

>
> Thanks,
> Cristian

2022-05-08 15:26:57

by Sudeep Holla

[permalink] [raw]
Subject: Re: [BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

> > > > On Wed, May 04, 2022 at 02:49:07PM +0200, Nicolas Frattaroli wrote:
> > > > > Good day,
> > > > >
> > > > > a user on the #linux-rockchip channel on the Libera.chat IRC network
> > > > > reported that their RK3568 was no longer getting a CPU and GPU clock
> > > > > from scmi and consequently not booting when using linux-next. This
> > > > > was bisected down to the following commit:

OK I missed to read the above properly earlier. If scmi probe failure is
resulting in Linux boot failure, then that is another bug that needs fixing.
Why does not getting CPU clock block the boot. I would like to see the boot
logs. I considered this issue to be non-fatal and must be just ending up
disabling all SCMI communication. But the reported issue is boot failure
which sounds like another/different bug and I would like that to be fixed
first before we push the workaround for the reported issue so that it is
not ignored.

Has anyone analysed why the absence of CPU clock results in boot failure ?
Are you running the upstream kernel itself ?

--
Regards,
Sudeep

2022-05-09 02:00:14

by Cristian Marussi

[permalink] [raw]
Subject: Re: [BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

On Thu, May 05, 2022 at 11:47:41AM +0100, Cristian Marussi wrote:
> On Thu, May 05, 2022 at 11:40:09AM +0200, Etienne Carriere wrote:
> > Hello Nicolas, Cristian,
> >
> > On Thu, 5 May 2022 at 10:03, Cristian Marussi <[email protected]> wrote:
> > >
> > > On Wed, May 04, 2022 at 07:51:45PM +0200, Nicolas Frattaroli wrote:
> > > > On Mittwoch, 4. Mai 2022 15:21:30 CEST Sudeep Holla wrote:
> > > > > + Cristian
> > >
> > > +Etieenne
> > >
> > > Hi Nicolas,
> > >
>
> Hi Etienne,
>
> [snip]
>
> > > Having a quick look at TF-A SCMI code in charge of this message (at least in
> > > the upstream):
> > >
> > > https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/drivers/scmi-msg/base.c#n136
> > >
> > > it seems to me that the bug lies in the fact that the BASE_DISCOVER_PROTOCOLS
> > > response message built by the function above is not properly sized: the trailing
> > > message payload carrying the list of protocols (after returned_protos field) returns
> > > always a fixed arbitrarily sized payload, possibly zeroed if fewer protocols are
> > > carried.
> > >
> > > IOW, even though the answer in this case carries 3 items/protocols, the payload
> > > is anyway 8 bytes (with the last 5 bytes zeroed), while by the spec it should have
> > > been just 4 bytes.
> > >
> > > (in fact testing the kernel fix on a JUNO with last SCP fw release did NOT expose
> > > any issue...)
> > >
> > > I think a fix FW side could be something along these lines (UNTESTED NOR
> > > BUILT ! ... I Cc'ed Etienne that seems the author of this bit)
> > >
> > > This basically mirrors the same checks introduced in kernel...if someone
> > > is fancy/able to test it....
> >
> > Indeed the firmware implementation is wrong in TF-A.
> > And also in OP-TEE by the way:
> > https://github.com/OP-TEE/optee_os/blob/3.17.0/core/drivers/scmi-msg/base.c#L163-L166
> >
> > @Nicoals, do you want to send a patch to TF-A, or do you want me to do it?
> >
> > I can fix the optee_os implementation. I'll tell you when I'll have
> > created a P-R.
> > The fix is the same for TF-A and OP-TEE.
> > Proposal from Cristian looks good to me, maybe simplified:
> >
> > ```patch
> > memcpy(outargs, &p2a, sizeof(p2a));
> > memcpy(outargs + sizeof(p2a), list + a2p->skip, count);
> >
> > - scmi_write_response(msg, outargs, sizeof(outargs));
> > + list_sz = (1 + (count - 1) / sizeof(uint32_t)) * sizeof(uint32_t);
> > + scmi_write_response(msg, outargs, sizeof(p2a) + list_sz);
> > ```
> >
>
> I don't think list_sz is properly calculated if you don't rule out the
> count = 0 case (did the same mistake in Kernel at first :D)...if count is
> zero list_sz ends up being 4 [(1 + (0 - 1) / 4 ) * 4] while it should be
> zero. (...and 'if (count)' also avoid an unneeded memcpy of zero bytes)
>
> Moreover reviewing my own proposal code below it's probably easier to use
> some sort of macro like ALIGN(count, 4) if there's one in TF-A/OP-TEE.
> (...checking anyway that can handle correctly the zero case..)
>
> Thanks,
> Etienne
>
Sorry this was meant to be

Thanks,
Cristian

:P but I messed up the snipping

Cristian


2022-05-09 04:14:09

by Nicolas Frattaroli

[permalink] [raw]
Subject: Re: [BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

On Donnerstag, 5. Mai 2022 12:10:32 CEST Sudeep Holla wrote:
> > > > > On Wed, May 04, 2022 at 02:49:07PM +0200, Nicolas Frattaroli wrote:
> > > > > > Good day,
> > > > > >
> > > > > > a user on the #linux-rockchip channel on the Libera.chat IRC network
> > > > > > reported that their RK3568 was no longer getting a CPU and GPU clock
> > > > > > from scmi and consequently not booting when using linux-next. This
> > > > > > was bisected down to the following commit:
>
> OK I missed to read the above properly earlier. If scmi probe failure is
> resulting in Linux boot failure, then that is another bug that needs fixing.
> Why does not getting CPU clock block the boot. I would like to see the boot
> logs. I considered this issue to be non-fatal and must be just ending up
> disabling all SCMI communication. But the reported issue is boot failure
> which sounds like another/different bug and I would like that to be fixed
> first before we push the workaround for the reported issue so that it is
> not ignored.
>
> Has anyone analysed why the absence of CPU clock results in boot failure ?
> Are you running the upstream kernel itself ?
>
>

Hello,

I'm sorry, I seem to have misinterpreted the original user's messages
as having been a boot failure. Upon re-reading the logs, this doesn't
seem to have been explicitly mentioned. I therefore assume this wasn't
causing a failure to boot.

Sadly the user isn't in the IRC channel at this moment so I cannot ask
them further questions.

I have tested this out on my own RK3566 based platform, and found that
we get the following:

$ sudo dmesg | grep arm-scmi
[ 0.247134] arm-scmi firmware:scmi: Enabled polling mode TX channel - prot_id:16
[ 0.247526] arm-scmi firmware:scmi: SCMI Notifications - Core Enabled.
[ 0.247760] arm-scmi firmware:scmi: Malformed reply - real_sz:8 calc_sz:4
[ 0.247773] arm-scmi firmware:scmi: SCMI Protocol v2.0 'rockchip:' Firmware version 0x0
[ 0.247920] arm-scmi firmware:scmi: SCMI protocol 20 not implemented
[ 1.545441] arm-scmi firmware:scmi: Failed. SCMI protocol 20 not active.
[ 1.562958] arm-scmi firmware:scmi: Failed. SCMI protocol 23 not active.
[ 1.565676] arm-scmi firmware:scmi: Failed. SCMI protocol 22 not active.
[ 2.094446] arm-scmi firmware:scmi: Failed. SCMI protocol 21 not active.
[ 2.103474] arm-scmi firmware:scmi: Failed. SCMI protocol 19 not active.
[ 5.586871] arm-scmi firmware:scmi: Failed. SCMI protocol 17 not active.
[ 5.593178] arm-scmi firmware:scmi: Failed. SCMI protocol 21 not active.

$ sudo dmesg | grep clk
[ 18.255901] panfrost fde60000.gpu: clk init failed -517
[ 18.686720] panfrost fde60000.gpu: clk init failed -517

The system does boot, it's just awfully slow. This is not a boot failure,
but arguably still a pretty bad failure mode to find oneself in.

Sorry for the confusion this caused.

Regards,
Nicolas Frattaroli



2022-05-09 06:08:50

by Nicolas Frattaroli

[permalink] [raw]
Subject: Re: [BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

On Mittwoch, 4. Mai 2022 15:21:30 CEST Sudeep Holla wrote:
> + Cristian
>
> Hi Nicolas,
>
> Thanks for the formal report.
>
> On Wed, May 04, 2022 at 02:49:07PM +0200, Nicolas Frattaroli wrote:
> > Good day,
> >
> > a user on the #linux-rockchip channel on the Libera.chat IRC network
> > reported that their RK3568 was no longer getting a CPU and GPU clock
> > from scmi and consequently not booting when using linux-next. This
> > was bisected down to the following commit:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/firmware/arm_scmi/base.c?h=next-20220503&id=3b0041f6e10e5bdbb646d98172be43e88734ed62
> >
> > The error message in the log is as follows:
> >
> > arm-scmi firmware:scmi: Malformed reply - real_sz:8 calc_sz:4, t->rx.len is 12, sizeof(u32) is 4, loop_num_ret is 3
> >
> > The rockchip firmware (bl31) being used was v1.32, from here:
> >
> > https://github.com/JeffyCN/rockchip_mirrors/blob/rkbin/bin/rk35/rk3568_bl31_v1.32.elf
> >
>
> So this platform is not supported in upstream TF-A like its predecessors ?

Hello,

it is not yet supported by upstream. Rockchip plans to release the sources
for it at some point if I recall correctly, but I believe their software
team has been very busy due to new hardware releases, so it hasn't happened
yet. I hope we'll see an open source release of the TF-A sources eventually,
so that for bugs like this we can always fix them without the vendor needing
to do it for us.

>
> > This seems like a non-fatal firmware bug, for which a kernel workaround is
> > certainly possible, but it would be good if rockchip could fix this in their
> > firmware.
> >
>
> Indeed, we added this check finding issue in one of our tests. Luckily
> it helped to unearth the same issue on this platform, but due to the
> nature of its f/w release, it is bit unfortunate that it can't be fixed
> easily and quickly. But I really wish this gets fixed in the firmware.
> Are there any other f/w bugs reported so far ? If so how are they fixed
> as I don't expect all such bugs can be worked around in the kernel though
> this might be. I would like to hear details there if possible.

I'm not aware of how the rockchip bug report workflow works. They seemingly
did update the firmware multiple times, last in October of 2021.

The official rockchip repository at [1] hasn't been kept as up to date as
the mirror by a rockchip employee at [2], so most people seem to have been
using the latter. Speaking of which, I'll add the owner of that repo to
the CC of this thread to make sure this doesn't get lost.

Rockchip lists an e-mail at [3] for reporting issues at, but this seems to
relate to their open-source documentation. The official github repository
of "rkbin" on the "rockchip-linux" organisation does not have issues
enabled, so submitting a bug report through that is unfortunately not
possible.

>
> > The user going by "amazingfate" reported that commenting out the
> > ret = -EPROTO; break;
> > fixes the issue for them.
> >
>
> Sure, or we could relax the check as calc_sz <= real_sz or something so
> that the reverse is still caught and handled as OS might read junk data in
> the later case.

This seems like a good solution, that way we're unlikely to ever run into a
situation where the kernel does the wrong thing here even if we're less
strict about the check. In either case, it should print a dev_err though,
it's still an error even if we can tolerate it in some cases.

>
> > I'm writing here to get the discussion started on how we can resolve this
> > before the Linux 5.19 release.
> >
>
> Agreed, I just sent by pull request for this literally few hours ago.
>
> > Sudeep Holla has already told me they'll gladly add a workaround before
> > the 5.19 release, but would rather see this fixed in the vendor firmware
> > first. Would rockchip be able and willing to fix it and publish a new
> > bl31 for rk3568?
> >
>
> Indeed and as mentioned above details on how other such f/w bugs are dealt
> in general esp that the firmware is blob release and one can't fix it easily.
> Do we have a bugzilla kind of setup to report and get the bugs fixed ?

It's worth mentioning that I think even if we get Rockchip to fix the bug in
the firmware, I believe Linux should still add a workaround, as otherwise
people running older firmware who are upgrading their kernels could suddenly
have unbootable systems and don't know why that happened.

Regards,
Nicolas Frattaroli

PS: I've also CC'd Peter Geis as he has worked on the RK356x support in
mainline a bunch and I believe he has been in contact with Rockchip
about releasing the TF-A sources before.

[1]: https://github.com/rockchip-linux/rkbin
[2]: https://github.com/JeffyCN/rockchip_mirrors/tree/rkbin
[3]: http://opensource.rock-chips.com/wiki_Main_Page




2022-05-09 07:05:50

by Cristian Marussi

[permalink] [raw]
Subject: Re: [BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

On Thu, May 05, 2022 at 11:40:09AM +0200, Etienne Carriere wrote:
> Hello Nicolas, Cristian,
>
> On Thu, 5 May 2022 at 10:03, Cristian Marussi <[email protected]> wrote:
> >
> > On Wed, May 04, 2022 at 07:51:45PM +0200, Nicolas Frattaroli wrote:
> > > On Mittwoch, 4. Mai 2022 15:21:30 CEST Sudeep Holla wrote:
> > > > + Cristian
> >
> > +Etieenne
> >
> > Hi Nicolas,
> >

Hi Etienne,

[snip]

> > Having a quick look at TF-A SCMI code in charge of this message (at least in
> > the upstream):
> >
> > https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/drivers/scmi-msg/base.c#n136
> >
> > it seems to me that the bug lies in the fact that the BASE_DISCOVER_PROTOCOLS
> > response message built by the function above is not properly sized: the trailing
> > message payload carrying the list of protocols (after returned_protos field) returns
> > always a fixed arbitrarily sized payload, possibly zeroed if fewer protocols are
> > carried.
> >
> > IOW, even though the answer in this case carries 3 items/protocols, the payload
> > is anyway 8 bytes (with the last 5 bytes zeroed), while by the spec it should have
> > been just 4 bytes.
> >
> > (in fact testing the kernel fix on a JUNO with last SCP fw release did NOT expose
> > any issue...)
> >
> > I think a fix FW side could be something along these lines (UNTESTED NOR
> > BUILT ! ... I Cc'ed Etienne that seems the author of this bit)
> >
> > This basically mirrors the same checks introduced in kernel...if someone
> > is fancy/able to test it....
>
> Indeed the firmware implementation is wrong in TF-A.
> And also in OP-TEE by the way:
> https://github.com/OP-TEE/optee_os/blob/3.17.0/core/drivers/scmi-msg/base.c#L163-L166
>
> @Nicoals, do you want to send a patch to TF-A, or do you want me to do it?
>
> I can fix the optee_os implementation. I'll tell you when I'll have
> created a P-R.
> The fix is the same for TF-A and OP-TEE.
> Proposal from Cristian looks good to me, maybe simplified:
>
> ```patch
> memcpy(outargs, &p2a, sizeof(p2a));
> memcpy(outargs + sizeof(p2a), list + a2p->skip, count);
>
> - scmi_write_response(msg, outargs, sizeof(outargs));
> + list_sz = (1 + (count - 1) / sizeof(uint32_t)) * sizeof(uint32_t);
> + scmi_write_response(msg, outargs, sizeof(p2a) + list_sz);
> ```
>

I don't think list_sz is properly calculated if you don't rule out the
count = 0 case (did the same mistake in Kernel at first :D)...if count is
zero list_sz ends up being 4 [(1 + (0 - 1) / 4 ) * 4] while it should be
zero. (...and 'if (count)' also avoid an unneeded memcpy of zero bytes)

Moreover reviewing my own proposal code below it's probably easier to use
some sort of macro like ALIGN(count, 4) if there's one in TF-A/OP-TEE.
(...checking anyway that can handle correctly the zero case..)

Thanks,
Etienne

2022-05-14 00:12:31

by Nicolas Frattaroli

[permalink] [raw]
Subject: Re: [BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

Hello,

sorry for the late reply, completely missed that there was a question
for me in this mail.

On Donnerstag, 5. Mai 2022 11:40:09 CEST Etienne Carriere wrote:
> Hello Nicolas, Cristian,
> [...]
>
> Indeed the firmware implementation is wrong in TF-A.
> And also in OP-TEE by the way:
> https://github.com/OP-TEE/optee_os/blob/3.17.0/core/drivers/scmi-msg/base.c#L163-L166
>
> @Nicoals, do you want to send a patch to TF-A, or do you want me to do it?

I have no experience with TF-A, so I'd prefer if you could do it.

In good news, Rockchip has confirmed they're preparing to release RK356x
TF-A sources, so I'll be able to port the patch over to their sources once
they are released, if they don't already apply it themselves.

>
> I can fix the optee_os implementation. I'll tell you when I'll have
> created a P-R.
> The fix is the same for TF-A and OP-TEE.
> Proposal from Cristian looks good to me, maybe simplified:
>
> ```patch
> memcpy(outargs, &p2a, sizeof(p2a));
> memcpy(outargs + sizeof(p2a), list + a2p->skip, count);
>
> - scmi_write_response(msg, outargs, sizeof(outargs));
> + list_sz = (1 + (count - 1) / sizeof(uint32_t)) * sizeof(uint32_t);
> + scmi_write_response(msg, outargs, sizeof(p2a) + list_sz);
> ```
> [...]
>
> BR,
> Etienne

Regards,
Nicolas Frattaroli




2022-05-14 00:44:12

by Kever Yang

[permalink] [raw]
Subject: Re: [BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

Hi Sudeep,

    Rockchip has notice this issue, we will check how to fix this issue
in our ATF release.


Thanks,
- Kever
On 2022/5/12 22:34, Sudeep Holla wrote:
> On Thu, May 12, 2022 at 01:11:22PM +0200, Nicolas Frattaroli wrote:
>> Hello,
>>
>> sorry for the late reply, completely missed that there was a question
>> for me in this mail.
>>
>> On Donnerstag, 5. Mai 2022 11:40:09 CEST Etienne Carriere wrote:
>>> Hello Nicolas, Cristian,
>>> [...]
>>>
>>> Indeed the firmware implementation is wrong in TF-A.
>>> And also in OP-TEE by the way:
>>> https://github.com/OP-TEE/optee_os/blob/3.17.0/core/drivers/scmi-msg/base.c#L163-L166
>>>
>>> @Nicoals, do you want to send a patch to TF-A, or do you want me to do it?
>> I have no experience with TF-A, so I'd prefer if you could do it.
>>
>> In good news, Rockchip has confirmed they're preparing to release RK356x
>> TF-A sources, so I'll be able to port the patch over to their sources once
>> they are released, if they don't already apply it themselves.
>>
> So, there is no way to get a blob release with the patch applied ?
> We know it is a bug in TF-A and if Rockchip is using that codebase, it
> will be the same bug there too causing this issue. Waiting until the
> code is released and the proper TF-A port is done may not be acceptable
> for many developers. So someone from the rockchip doing these tf-a blob
> release must get involved here, understand the situation and get the fixed.
>
> We can workaround, but I want to hear that it will be fixed instead of
> getting ignored until proper port is available.
>
> Etienne, are you not the author for the related TF-A code ? How can we
> get that fixed in the TF-A code base. If you are not, then I will try to
> get my repo login credentials sorted so that I can only push TF-A change.
>

2022-05-14 01:10:12

by Sudeep Holla

[permalink] [raw]
Subject: Re: [BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

On Thu, May 12, 2022 at 01:11:22PM +0200, Nicolas Frattaroli wrote:
> Hello,
>
> sorry for the late reply, completely missed that there was a question
> for me in this mail.
>
> On Donnerstag, 5. Mai 2022 11:40:09 CEST Etienne Carriere wrote:
> > Hello Nicolas, Cristian,
> > [...]
> >
> > Indeed the firmware implementation is wrong in TF-A.
> > And also in OP-TEE by the way:
> > https://github.com/OP-TEE/optee_os/blob/3.17.0/core/drivers/scmi-msg/base.c#L163-L166
> >
> > @Nicoals, do you want to send a patch to TF-A, or do you want me to do it?
>
> I have no experience with TF-A, so I'd prefer if you could do it.
>
> In good news, Rockchip has confirmed they're preparing to release RK356x
> TF-A sources, so I'll be able to port the patch over to their sources once
> they are released, if they don't already apply it themselves.
>

So, there is no way to get a blob release with the patch applied ?
We know it is a bug in TF-A and if Rockchip is using that codebase, it
will be the same bug there too causing this issue. Waiting until the
code is released and the proper TF-A port is done may not be acceptable
for many developers. So someone from the rockchip doing these tf-a blob
release must get involved here, understand the situation and get the fixed.

We can workaround, but I want to hear that it will be fixed instead of
getting ignored until proper port is available.

Etienne, are you not the author for the related TF-A code ? How can we
get that fixed in the TF-A code base. If you are not, then I will try to
get my repo login credentials sorted so that I can only push TF-A change.

--
Regards,
Sudeep

2022-05-14 03:34:41

by Etienne Carriere

[permalink] [raw]
Subject: Re: [BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

Hello all,

On Thu, 12 May 2022 at 16:34, Sudeep Holla <[email protected]> wrote:
>
> On Thu, May 12, 2022 at 01:11:22PM +0200, Nicolas Frattaroli wrote:
> > Hello,
> >
> > sorry for the late reply, completely missed that there was a question
> > for me in this mail.
> >
> > On Donnerstag, 5. Mai 2022 11:40:09 CEST Etienne Carriere wrote:
> > > Hello Nicolas, Cristian,
> > > [...]
> > >
> > > Indeed the firmware implementation is wrong in TF-A.
> > > And also in OP-TEE by the way:
> > > https://github.com/OP-TEE/optee_os/blob/3.17.0/core/drivers/scmi-msg/base.c#L163-L166
> > >
> > > @Nicoals, do you want to send a patch to TF-A, or do you want me to do it?
> >
> > I have no experience with TF-A, so I'd prefer if you could do it.
> >
> > In good news, Rockchip has confirmed they're preparing to release RK356x
> > TF-A sources, so I'll be able to port the patch over to their sources once
> > they are released, if they don't already apply it themselves.
> >
>
> So, there is no way to get a blob release with the patch applied ?
> We know it is a bug in TF-A and if Rockchip is using that codebase, it
> will be the same bug there too causing this issue. Waiting until the
> code is released and the proper TF-A port is done may not be acceptable
> for many developers. So someone from the rockchip doing these tf-a blob
> release must get involved here, understand the situation and get the fixed.
>
> We can workaround, but I want to hear that it will be fixed instead of
> getting ignored until proper port is available.
>
> Etienne, are you not the author for the related TF-A code ? How can we
> get that fixed in the TF-A code base. If you are not, then I will try to
> get my repo login credentials sorted so that I can only push TF-A change.

Yes, I'm the author of the code.
I have created a gerrit change in TF-A to fix the issue:
https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/15196

OP-TEE implementation also embeds the same scmi-msg drivers and have
the same issue.
I should be fixed by P-R: https://github.com/OP-TEE/optee_os/pull/5334
I will ask TF-A if this specific change can hit TF-A release v2.7 tag,
if posisble.

Feedbacks are welcome on these 2 forums.

Regards,
etienne


>
> --
> Regards,
> Sudeep