LinuxLists.cc - [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On 05/08/13 14:47, Stephen Boyd wrote:
> From: Brian Swetland <[email protected]>
>
> Currently v7 CPUs with an MIDR that has no bits set in the range
> [16:12] will be detected as old ARM CPUs with no caches and so

I mean [15:12], sorry.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

2013-05-15 19:38:14

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On 05/08/13 14:47, Stephen Boyd wrote:
> From: Brian Swetland <[email protected]>
>
> Currently v7 CPUs with an MIDR that has no bits set in the range
> [16:12] will be detected as old ARM CPUs with no caches and so
> the cache will never be turned on during decompression. ARM's
> Cortex chips have an 0xC in the range [16:12] so they never match
> this entry, but Qualcomm's Scorpion and Krait processors never
> set these bits to anything besides 0 so they always match.
>
> Skip this entry if we've compiled in support for v7 CPUs. This
> allows kernel decompression to happen nearly instantly instead of
> taking over 20 seconds.
>
> Signed-off-by: Brian Swetland <[email protected]>
> [sboyd: Clarified and extended commit text]
> Signed-off-by: Stephen Boyd <[email protected]>
> ---

Ping?

> arch/arm/boot/compressed/head.S | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S
> index fe4d9c3..a236190 100644
> --- a/arch/arm/boot/compressed/head.S
> +++ b/arch/arm/boot/compressed/head.S
> @@ -805,6 +805,8 @@ call_cache_fn: adr r12, proc_types
> .align 2
> .type proc_types,#object
> proc_types:
> +#if !defined(CONFIG_CPU_V7)
> + /* This collides with some V7 IDs, preventing correct detection */
> .word 0x00000000 @ old ARM ID
> .word 0x0000f000
> mov pc, lr
> @@ -813,6 +815,7 @@ proc_types:
> THUMB( nop )
> mov pc, lr
> THUMB( nop )
> +#endif
>
> .word 0x41007000 @ ARM7/710
> .word 0xfff8fe00

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

2013-05-23 17:54:30

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On 05/15/13 12:38, Stephen Boyd wrote:
> On 05/08/13 14:47, Stephen Boyd wrote:
>> From: Brian Swetland <[email protected]>
>>
>> Currently v7 CPUs with an MIDR that has no bits set in the range
>> [16:12] will be detected as old ARM CPUs with no caches and so
>> the cache will never be turned on during decompression. ARM's
>> Cortex chips have an 0xC in the range [16:12] so they never match
>> this entry, but Qualcomm's Scorpion and Krait processors never
>> set these bits to anything besides 0 so they always match.
>>
>> Skip this entry if we've compiled in support for v7 CPUs. This
>> allows kernel decompression to happen nearly instantly instead of
>> taking over 20 seconds.
>>
>> Signed-off-by: Brian Swetland <[email protected]>
>> [sboyd: Clarified and extended commit text]
>> Signed-off-by: Stephen Boyd <[email protected]>
>> ---
> Ping?

Russell, shall I add this to the patch tracker?

>
>> arch/arm/boot/compressed/head.S | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S
>> index fe4d9c3..a236190 100644
>> --- a/arch/arm/boot/compressed/head.S
>> +++ b/arch/arm/boot/compressed/head.S
>> @@ -805,6 +805,8 @@ call_cache_fn: adr r12, proc_types
>> .align 2
>> .type proc_types,#object
>> proc_types:
>> +#if !defined(CONFIG_CPU_V7)
>> + /* This collides with some V7 IDs, preventing correct detection */
>> .word 0x00000000 @ old ARM ID
>> .word 0x0000f000
>> mov pc, lr
>> @@ -813,6 +815,7 @@ proc_types:
>> THUMB( nop )
>> mov pc, lr
>> THUMB( nop )
>> +#endif
>>
>> .word 0x41007000 @ ARM7/710
>> .word 0xfff8fe00
>

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

2013-05-23 23:15:42

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On Thu, May 23, 2013 at 10:54:26AM -0700, Stephen Boyd wrote:
> On 05/15/13 12:38, Stephen Boyd wrote:
> > On 05/08/13 14:47, Stephen Boyd wrote:
> >> From: Brian Swetland <[email protected]>
> >>
> >> Currently v7 CPUs with an MIDR that has no bits set in the range
> >> [16:12] will be detected as old ARM CPUs with no caches and so
> >> the cache will never be turned on during decompression. ARM's
> >> Cortex chips have an 0xC in the range [16:12] so they never match
> >> this entry, but Qualcomm's Scorpion and Krait processors never
> >> set these bits to anything besides 0 so they always match.
> >>
> >> Skip this entry if we've compiled in support for v7 CPUs. This
> >> allows kernel decompression to happen nearly instantly instead of
> >> taking over 20 seconds.
> >>
> >> Signed-off-by: Brian Swetland <[email protected]>
> >> [sboyd: Clarified and extended commit text]
> >> Signed-off-by: Stephen Boyd <[email protected]>
> >> ---
> > Ping?
>
> Russell, shall I add this to the patch tracker?

Yes please.

2013-05-24 22:05:54

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On 05/24, Russell King - ARM Linux wrote:
> On Thu, May 23, 2013 at 10:54:26AM -0700, Stephen Boyd wrote:
> > On 05/15/13 12:38, Stephen Boyd wrote:
> > > On 05/08/13 14:47, Stephen Boyd wrote:
> > >> From: Brian Swetland <[email protected]>
> > >>
> > >> Currently v7 CPUs with an MIDR that has no bits set in the range
> > >> [16:12] will be detected as old ARM CPUs with no caches and so
> > >> the cache will never be turned on during decompression. ARM's
> > >> Cortex chips have an 0xC in the range [16:12] so they never match
> > >> this entry, but Qualcomm's Scorpion and Krait processors never
> > >> set these bits to anything besides 0 so they always match.
> > >>
> > >> Skip this entry if we've compiled in support for v7 CPUs. This
> > >> allows kernel decompression to happen nearly instantly instead of
> > >> taking over 20 seconds.
> > >>
> > >> Signed-off-by: Brian Swetland <[email protected]>
> > >> [sboyd: Clarified and extended commit text]
> > >> Signed-off-by: Stephen Boyd <[email protected]>
> > >> ---
> > > Ping?
> >
> > Russell, shall I add this to the patch tracker?
>
> Yes please.
>

Ok, thanks.

I've noticed another problem now that our caches are used. On MSM
we have TEXT_OFFSET set to at least 0x208000 if we've built-in
support for MSM8x60/8960. If I boot a kernel with the MSM code
built-in that requires the higher text offset, but I load my
compressed kernel below that address (such as 0x0) the
decompression fails.

This happens because the page tables are written into the
compressed data region before we relocate ourself to a higher
location.

Here's some ascii art to show the problem

We start off at 0x0

0x000000 +---------+
| |
| zImage |
0x208000 |---------| <- r4 (zreladdr)
| zImage |
0x300000 +---------+ <- _edata

Then we run far enough to call cache_on which goes ahead and
calls __setup_mmu and sets up our page tables.

0x008000 +---------+
| |
| zImage |
| |
0x204000 |---------|
| pgdir |
0x208000 |---------| <- r4 (zreladdr)
| |
| zImage |
| |
0x300000 +---------+ <- _edata

This is bad because we just wrote our page tables into the
compressed data. Nobody notices though and we finish relocating
ourselves and then we call decompress_kernel() which fails
randomly. (BTW, why does error() sit in a while loop forever? We
can't get any information about why the decompression failed if
we have debug_ll enabled. I had to patch the error() routine to
not while loop forever to get that print after do_decompress to
be useful.)

I see a few solutions.

1) Relocate with caches off and then turn on caches after we're
running in a location where we won't overwrite ourselves.

2) Have temporary page tables for the relocation phase that live
just below the location we're going to relocate to.

3) Force bootloaders loading these types of images to load the
zImage at least as high as the TEXT_OFFSET is compiled to.

I don't think we can convince everyone that #3 is ok to do. I'm
leaning towards #2 since we get all the benefits of the cache
during the relocation phase but #1 is the obviously simple fix.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

2013-06-03 21:13:42

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

Resending due to rmk's vacation.

On 05/24/13 15:05, Stephen Boyd wrote:
> I've noticed another problem now that our caches are used. On MSM
> we have TEXT_OFFSET set to at least 0x208000 if we've built-in
> support for MSM8x60/8960. If I boot a kernel with the MSM code
> built-in that requires the higher text offset, but I load my
> compressed kernel below that address (such as 0x0) the
> decompression fails.
>
> This happens because the page tables are written into the
> compressed data region before we relocate ourself to a higher
> location.
>
> Here's some ascii art to show the problem
>
> We start off at 0x0
>
> 0x000000 +---------+
> | |
> | zImage |
> 0x208000 |---------| <- r4 (zreladdr)
> | zImage |
> 0x300000 +---------+ <- _edata
>
> Then we run far enough to call cache_on which goes ahead and
> calls __setup_mmu and sets up our page tables.
>
> 0x008000 +---------+
> | |
> | zImage |
> | |
> 0x204000 |---------|
> | pgdir |
> 0x208000 |---------| <- r4 (zreladdr)
> | |
> | zImage |
> | |
> 0x300000 +---------+ <- _edata
>
> This is bad because we just wrote our page tables into the
> compressed data. Nobody notices though and we finish relocating
> ourselves and then we call decompress_kernel() which fails
> randomly. (BTW, why does error() sit in a while loop forever? We
> can't get any information about why the decompression failed if
> we have debug_ll enabled. I had to patch the error() routine to
> not while loop forever to get that print after do_decompress to
> be useful.)
>
> I see a few solutions.
>
> 1) Relocate with caches off and then turn on caches after we're
> running in a location where we won't overwrite ourselves.
>
> 2) Have temporary page tables for the relocation phase that live
> just below the location we're going to relocate to.
>
> 3) Force bootloaders loading these types of images to load the
> zImage at least as high as the TEXT_OFFSET is compiled to.
>
> I don't think we can convince everyone that #3 is ok to do. I'm
> leaning towards #2 since we get all the benefits of the cache
> during the relocation phase but #1 is the obviously simple fix.
>
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

2013-06-03 21:33:52

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

> On 05/24/13 15:05, Stephen Boyd wrote:
> > I see a few solutions.
> >
> > 1) Relocate with caches off and then turn on caches after we're
> > running in a location where we won't overwrite ourselves.

Due to the cost of doing memory copy with the cache off, thisoption
should be conditionally used and only when there is an actual conflict.

> > 2) Have temporary page tables for the relocation phase that live
> > just below the location we're going to relocate to.
> >
> > 3) Force bootloaders loading these types of images to load the
> > zImage at least as high as the TEXT_OFFSET is compiled to.
> >
> > I don't think we can convince everyone that #3 is ok to do. I'm
> > leaning towards #2 since we get all the benefits of the cache
> > during the relocation phase but #1 is the obviously simple fix.

I'd consider #2 too.

Nicolas

2013-06-03 22:23:45

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On Mon, Jun 03, 2013 at 02:13:39PM -0700, Stephen Boyd wrote:
> Resending due to rmk's vacation.
>
> On 05/24/13 15:05, Stephen Boyd wrote:
> > I've noticed another problem now that our caches are used. On MSM
> > we have TEXT_OFFSET set to at least 0x208000 if we've built-in
> > support for MSM8x60/8960. If I boot a kernel with the MSM code
> > built-in that requires the higher text offset, but I load my
> > compressed kernel below that address (such as 0x0) the
> > decompression fails.
> >
> > This happens because the page tables are written into the
> > compressed data region before we relocate ourself to a higher
> > location.

We've always required kernel images to be loaded above RAM+32K for
exactly this issue.

> > This is bad because we just wrote our page tables into the
> > compressed data. Nobody notices though and we finish relocating
> > ourselves and then we call decompress_kernel() which fails
> > randomly. (BTW, why does error() sit in a while loop forever?

It loops forever because there is _nothing_ else to be done. It's
already printed a message explaining why stuff has failed:

void error(char *x)
{
arch_error(x);

putstr("\n\n");
putstr(x);
putstr("\n\n -- System halted");

while(1); /* Halt */
}

and the while loop is to prevent us trying to do something stupid
after failure. Basically, error() never returns.

I've no idea why you say the following:

> > We
> > can't get any information about why the decompression failed if
> > we have debug_ll enabled. I had to patch the error() routine to
> > not while loop forever to get that print after do_decompress to
> > be useful.)

Maybe your implementation of puts() for the decompressor is faulty then?
Because it works for me - when something goes wrong with the decompression,
I get a message such as:

Decompressing kernel...

CRC error

-- System halted

> > I see a few solutions.
> >
> > 1) Relocate with caches off and then turn on caches after we're
> > running in a location where we won't overwrite ourselves.
> >
> > 2) Have temporary page tables for the relocation phase that live
> > just below the location we're going to relocate to.
> >
> > 3) Force bootloaders loading these types of images to load the
> > zImage at least as high as the TEXT_OFFSET is compiled to.
> >
> > I don't think we can convince everyone that #3 is ok to do. I'm
> > leaning towards #2 since we get all the benefits of the cache
> > during the relocation phase but #1 is the obviously simple fix.

(3) is what we've always required in the past. We already have code
to relocate the compressed image, so we _might_ be able to do (1).

The easy solution is to continue saying "minimum of RAM start + 32K"
as we've always had in the past though.

2013-06-03 22:37:46

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On 06/03/13 15:23, Russell King - ARM Linux wrote:
> On Mon, Jun 03, 2013 at 02:13:39PM -0700, Stephen Boyd wrote:
>> We
>> can't get any information about why the decompression failed if
>> we have debug_ll enabled. I had to patch the error() routine to
>> not while loop forever to get that print after do_decompress to
>> be useful.)
> Maybe your implementation of puts() for the decompressor is faulty then?
> Because it works for me - when something goes wrong with the decompression,
> I get a message such as:
>
> Decompressing kernel...
>
> CRC error
>
> -- System halted
>

I was expecting to see

Decompressing kernel...

CRC error

decompressor returned an error

but since we loop forever this code in arch/arm/boot/compressed/misc.c
doesn't do anything:

if (ret)
error("decompressor returned an error");

I guess that is desired though because you say we shouldn't do something
stupid.

>>> I see a few solutions.
>>>
>>> 1) Relocate with caches off and then turn on caches after we're
>>> running in a location where we won't overwrite ourselves.
>>>
>>> 2) Have temporary page tables for the relocation phase that live
>>> just below the location we're going to relocate to.
>>>
>>> 3) Force bootloaders loading these types of images to load the
>>> zImage at least as high as the TEXT_OFFSET is compiled to.
>>>
>>> I don't think we can convince everyone that #3 is ok to do. I'm
>>> leaning towards #2 since we get all the benefits of the cache
>>> during the relocation phase but #1 is the obviously simple fix.
> (3) is what we've always required in the past. We already have code
> to relocate the compressed image, so we _might_ be able to do (1).
>
> The easy solution is to continue saying "minimum of RAM start + 32K"
> as we've always had in the past though.

In my case I'm booting a kernel with textoffset = 0x208000 but RAM
starts at 0x0. Does "minimum of RAM start" mean 0x0 or 0x200000?

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

2013-06-03 22:38:46

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On Mon, Jun 03, 2013 at 05:33:45PM -0400, Nicolas Pitre wrote:
> > On 05/24/13 15:05, Stephen Boyd wrote:
> > > I see a few solutions.
> > >
> > > 1) Relocate with caches off and then turn on caches after we're
> > > running in a location where we won't overwrite ourselves.
>
> Due to the cost of doing memory copy with the cache off, thisoption
> should be conditionally used and only when there is an actual conflict.
>
> > > 2) Have temporary page tables for the relocation phase that live
> > > just below the location we're going to relocate to.
> > >
> > > 3) Force bootloaders loading these types of images to load the
> > > zImage at least as high as the TEXT_OFFSET is compiled to.
> > >
> > > I don't think we can convince everyone that #3 is ok to do. I'm
> > > leaning towards #2 since we get all the benefits of the cache
> > > during the relocation phase but #1 is the obviously simple fix.
>
> I'd consider #2 too.

The problem with #2 is the added complexity it brings. The _whole_
point of loading the kernel at RAM+32K is so that we know that the
32K below the image is available for our use cheaply without playing
all sorts of stupid games with turning caches on and off multiple
times, or changing page tables and such like.

The initialization is already complicated enough, it doesn't need to
become any more complicated.

An even simpler solution to this would be to pad the decompressor
with a branch, and 32K-4 of zeros. That removes the whole problem
without adding much more code, but at the expense of 32K of bloat.
32K is nothing compared to the >1.5MB zImage size we have today.

2013-06-03 22:46:11

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On Mon, Jun 03, 2013 at 03:37:39PM -0700, Stephen Boyd wrote:
> On 06/03/13 15:23, Russell King - ARM Linux wrote:
> > On Mon, Jun 03, 2013 at 02:13:39PM -0700, Stephen Boyd wrote:
> >> We
> >> can't get any information about why the decompression failed if
> >> we have debug_ll enabled. I had to patch the error() routine to
> >> not while loop forever to get that print after do_decompress to
> >> be useful.)
> > Maybe your implementation of puts() for the decompressor is faulty then?
> > Because it works for me - when something goes wrong with the decompression,
> > I get a message such as:
> >
> > Decompressing kernel...
> >
> > CRC error
> >
> > -- System halted
> >
>
> I was expecting to see
>
> Decompressing kernel...
>
> CRC error
>
> decompressor returned an error
>
>
> but since we loop forever this code in arch/arm/boot/compressed/misc.c
> doesn't do anything:
>
> if (ret)
> error("decompressor returned an error");

I think that's just a final catch-all in case one of the decompressors
doesn't use the error() function pointer it was passed.

> In my case I'm booting a kernel with textoffset = 0x208000 but RAM
> starts at 0x0. Does "minimum of RAM start" mean 0x0 or 0x200000?

The basic requirement for zImage's is no less than the start of RAM
plus 32K. Or let me put it another way - start of writable memory
plus 32K.

Whether you need an offset of 0x200000 or not is not for the
decompressor to know. If you're having to avoid the first 0x200000
bytes of memory for some reason (eg, secure firmware or DSP needs
it left free) then there's no way for the decompressor to know that,
so it's irrelevant.

So, lets say that your platform has a DSP which needs the first 0x200000
bytes left free. So the boot loader _already_ needs to know to load
the image not at zero, but above 0x200000. The additional 32K
requirement is really nothing new and so should be treated in just the
same way.

Leave at least 32K of usable memory below the zImage at all times.

2013-06-03 23:00:02

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On 06/03/13 15:45, Russell King - ARM Linux wrote:
> On Mon, Jun 03, 2013 at 03:37:39PM -0700, Stephen Boyd wrote:
>> In my case I'm booting a kernel with textoffset = 0x208000 but RAM
>> starts at 0x0. Does "minimum of RAM start" mean 0x0 or 0x200000?
> The basic requirement for zImage's is no less than the start of RAM
> plus 32K. Or let me put it another way - start of writable memory
> plus 32K.
>
> Whether you need an offset of 0x200000 or not is not for the
> decompressor to know. If you're having to avoid the first 0x200000
> bytes of memory for some reason (eg, secure firmware or DSP needs
> it left free) then there's no way for the decompressor to know that,
> so it's irrelevant.
>
> So, lets say that your platform has a DSP which needs the first 0x200000
> bytes left free. So the boot loader _already_ needs to know to load
> the image not at zero, but above 0x200000. The additional 32K
> requirement is really nothing new and so should be treated in just the
> same way.
>
> Leave at least 32K of usable memory below the zImage at all times.

Understood. On my device writeable RAM actually starts at 0x0 but I have
compiled in support for devices which don't have writeable memory at
0x0, instead they have writeable memory starting at 0x200000. Because I
have a kernel supporting more than one device with differing memory
layouts I run into this problem. The same problem will occur to any
devices in the multi-platform kernel when a device with unwriteable
memory near the bottom (such as MSM8960) joins the multi-platform defconfig.

Let me try to word it in your example. I have compiled in support for a
platform that has a DSP which needs the first 0x200000 bytes left free.
I have also compiled in support for a platform that doesn't have this
requirement. I plan to run the zImage on the second platform (the one
without the DSP requirement). The bootloader I'm running this zImage on
has no idea that I've compiled in support for the other platform with
the DSP requirement so it assumes it can load the zImage at the start of
RAM (0x0) plus 32K. This is bad because then the page tables get written
into my compressed data and it fails to decompress.

I believe you think the bootloader should know about this, but I can't
tell how it could know.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

2013-06-04 05:28:06

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On Mon, 3 Jun 2013, Stephen Boyd wrote:

> On 06/03/13 15:45, Russell King - ARM Linux wrote:
> > On Mon, Jun 03, 2013 at 03:37:39PM -0700, Stephen Boyd wrote:
> >> In my case I'm booting a kernel with textoffset = 0x208000 but RAM
> >> starts at 0x0. Does "minimum of RAM start" mean 0x0 or 0x200000?
> > The basic requirement for zImage's is no less than the start of RAM
> > plus 32K. Or let me put it another way - start of writable memory
> > plus 32K.
> >
> > Whether you need an offset of 0x200000 or not is not for the
> > decompressor to know. If you're having to avoid the first 0x200000
> > bytes of memory for some reason (eg, secure firmware or DSP needs
> > it left free) then there's no way for the decompressor to know that,
> > so it's irrelevant.
> >
> > So, lets say that your platform has a DSP which needs the first 0x200000
> > bytes left free. So the boot loader _already_ needs to know to load
> > the image not at zero, but above 0x200000. The additional 32K
> > requirement is really nothing new and so should be treated in just the
> > same way.
> >
> > Leave at least 32K of usable memory below the zImage at all times.
>
> Understood. On my device writeable RAM actually starts at 0x0 but I have
> compiled in support for devices which don't have writeable memory at
> 0x0, instead they have writeable memory starting at 0x200000. Because I
> have a kernel supporting more than one device with differing memory
> layouts I run into this problem. The same problem will occur to any
> devices in the multi-platform kernel when a device with unwriteable
> memory near the bottom (such as MSM8960) joins the multi-platform defconfig.
>
> Let me try to word it in your example. I have compiled in support for a
> platform that has a DSP which needs the first 0x200000 bytes left free.
> I have also compiled in support for a platform that doesn't have this
> requirement. I plan to run the zImage on the second platform (the one
> without the DSP requirement). The bootloader I'm running this zImage on
> has no idea that I've compiled in support for the other platform with
> the DSP requirement so it assumes it can load the zImage at the start of
> RAM (0x0) plus 32K. This is bad because then the page tables get written
> into my compressed data and it fails to decompress.

I've looked at the code and I think that #1 in your initial options is
probably best here. I agree with Russell about #2 being way too complex
for only this case.

So, right before calling into cache_on, you could test if r4 - 16K >= pc
and r4 < pc + (_end - .) then skip cache_on.

Something like this untested patch:

diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S
index 9a94f344df..9e0dbbccdd 100644
--- a/arch/arm/boot/compressed/head.S
+++ b/arch/arm/boot/compressed/head.S
@@ -182,7 +182,16 @@ not_angel:
ldr r4, =zreladdr
#endif

- bl cache_on
+ /* Set up a page table only if we don't overwrite ourself */
+ ldr r0, 1f
+ add r0, r0, pc
+ cmp r4, r0
+ mov r0, pc
+ cmpcc r0, r4
+ blcs cache_on
+ b restart
+ .align 2
+1: .word _end - . + 0x4000

restart: adr r0, LC0
ldmia r0, {r1, r2, r3, r6, r10, r11, r12}

2013-06-04 19:47:13

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On 06/04, Nicolas Pitre wrote:
> On Mon, 3 Jun 2013, Stephen Boyd wrote:
>
> > On 06/03/13 15:45, Russell King - ARM Linux wrote:
> > > On Mon, Jun 03, 2013 at 03:37:39PM -0700, Stephen Boyd wrote:
> > >> In my case I'm booting a kernel with textoffset = 0x208000 but RAM
> > >> starts at 0x0. Does "minimum of RAM start" mean 0x0 or 0x200000?
> > > The basic requirement for zImage's is no less than the start of RAM
> > > plus 32K. Or let me put it another way - start of writable memory
> > > plus 32K.
> > >
> > > Whether you need an offset of 0x200000 or not is not for the
> > > decompressor to know. If you're having to avoid the first 0x200000
> > > bytes of memory for some reason (eg, secure firmware or DSP needs
> > > it left free) then there's no way for the decompressor to know that,
> > > so it's irrelevant.
> > >
> > > So, lets say that your platform has a DSP which needs the first 0x200000
> > > bytes left free. So the boot loader _already_ needs to know to load
> > > the image not at zero, but above 0x200000. The additional 32K
> > > requirement is really nothing new and so should be treated in just the
> > > same way.
> > >
> > > Leave at least 32K of usable memory below the zImage at all times.
> >
> > Understood. On my device writeable RAM actually starts at 0x0 but I have
> > compiled in support for devices which don't have writeable memory at
> > 0x0, instead they have writeable memory starting at 0x200000. Because I
> > have a kernel supporting more than one device with differing memory
> > layouts I run into this problem. The same problem will occur to any
> > devices in the multi-platform kernel when a device with unwriteable
> > memory near the bottom (such as MSM8960) joins the multi-platform defconfig.
> >
> > Let me try to word it in your example. I have compiled in support for a
> > platform that has a DSP which needs the first 0x200000 bytes left free.
> > I have also compiled in support for a platform that doesn't have this
> > requirement. I plan to run the zImage on the second platform (the one
> > without the DSP requirement). The bootloader I'm running this zImage on
> > has no idea that I've compiled in support for the other platform with
> > the DSP requirement so it assumes it can load the zImage at the start of
> > RAM (0x0) plus 32K. This is bad because then the page tables get written
> > into my compressed data and it fails to decompress.
>
> I've looked at the code and I think that #1 in your initial options is
> probably best here. I agree with Russell about #2 being way too complex
> for only this case.
>
> So, right before calling into cache_on, you could test if r4 - 16K >= pc
> and r4 < pc + (_end - .) then skip cache_on.
>
> Something like this untested patch:

So this would cause the decompression to run without the cache on
if we have to relocate the decompression code to avoid
overwriting ourselves? It seems that the memcpy is fairly quick
on my hardware in comparison to the decompression so moving the
cache_on() call to right before we run decompression keeps things
pretty fast. It's very possible different hardware will have
different results. This is what I meant by option #1. I suppose
we can make it smarter and conditionalize it on if we relocated
or not?

----8<---
diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S
index fe4d9c3..fcf3ff3 100644
--- a/arch/arm/boot/compressed/head.S
+++ b/arch/arm/boot/compressed/head.S
@@ -182,8 +182,6 @@ not_angel:
ldr r4, =zreladdr
#endif

- bl cache_on
-
restart: adr r0, LC0
ldmia r0, {r1, r2, r3, r6, r10, r11, r12}
ldr sp, [r0, #28]
@@ -464,6 +462,7 @@ not_relocated: mov r0, #0
cmp r2, r3
blo 1b

+ bl cache_on
/*
* The C runtime environment should now be setup sufficiently.
* Set up some pointers, and start decompressing.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

2013-06-04 21:19:04

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On Tue, 4 Jun 2013, Stephen Boyd wrote:

> On 06/04, Nicolas Pitre wrote:
> > On Mon, 3 Jun 2013, Stephen Boyd wrote:
> >
> > > On 06/03/13 15:45, Russell King - ARM Linux wrote:
> > > > On Mon, Jun 03, 2013 at 03:37:39PM -0700, Stephen Boyd wrote:
> > > >> In my case I'm booting a kernel with textoffset = 0x208000 but RAM
> > > >> starts at 0x0. Does "minimum of RAM start" mean 0x0 or 0x200000?
> > > > The basic requirement for zImage's is no less than the start of RAM
> > > > plus 32K. Or let me put it another way - start of writable memory
> > > > plus 32K.
> > > >
> > > > Whether you need an offset of 0x200000 or not is not for the
> > > > decompressor to know. If you're having to avoid the first 0x200000
> > > > bytes of memory for some reason (eg, secure firmware or DSP needs
> > > > it left free) then there's no way for the decompressor to know that,
> > > > so it's irrelevant.
> > > >
> > > > So, lets say that your platform has a DSP which needs the first 0x200000
> > > > bytes left free. So the boot loader _already_ needs to know to load
> > > > the image not at zero, but above 0x200000. The additional 32K
> > > > requirement is really nothing new and so should be treated in just the
> > > > same way.
> > > >
> > > > Leave at least 32K of usable memory below the zImage at all times.
> > >
> > > Understood. On my device writeable RAM actually starts at 0x0 but I have
> > > compiled in support for devices which don't have writeable memory at
> > > 0x0, instead they have writeable memory starting at 0x200000. Because I
> > > have a kernel supporting more than one device with differing memory
> > > layouts I run into this problem. The same problem will occur to any
> > > devices in the multi-platform kernel when a device with unwriteable
> > > memory near the bottom (such as MSM8960) joins the multi-platform defconfig.
> > >
> > > Let me try to word it in your example. I have compiled in support for a
> > > platform that has a DSP which needs the first 0x200000 bytes left free.
> > > I have also compiled in support for a platform that doesn't have this
> > > requirement. I plan to run the zImage on the second platform (the one
> > > without the DSP requirement). The bootloader I'm running this zImage on
> > > has no idea that I've compiled in support for the other platform with
> > > the DSP requirement so it assumes it can load the zImage at the start of
> > > RAM (0x0) plus 32K. This is bad because then the page tables get written
> > > into my compressed data and it fails to decompress.
> >
> > I've looked at the code and I think that #1 in your initial options is
> > probably best here. I agree with Russell about #2 being way too complex
> > for only this case.
> >
> > So, right before calling into cache_on, you could test if r4 - 16K >= pc
> > and r4 < pc + (_end - .) then skip cache_on.
> >
> > Something like this untested patch:
>
> So this would cause the decompression to run without the cache on
> if we have to relocate the decompression code to avoid
> overwriting ourselves?

No. What my example patch does is to simply skip setting up a page
table and turning on the cache if the page table ends up in the
code/data to be relocated.

> It seems that the memcpy is fairly quick on my hardware in comparison
> to the decompression so moving the cache_on() call to right before we
> run decompression keeps things pretty fast. It's very possible
> different hardware will have different results.

"Fairly quick" is still not optimal.

> This is what I meant by option #1. I suppose
> we can make it smarter and conditionalize it on if we relocated
> or not?

Here's what I,m suggesting:

>From f1ec55e8189c06b89d2f196929f01fe3fc40f908 Mon Sep 17 00:00:00 2001
From: Nicolas Pitre <[email protected]>
Date: Tue, 4 Jun 2013 17:01:30 -0400
Subject: [PATCH] ARM: zImage: don't overwrite ourself with a page table

When zImage is loaded into RAM at a low address but TEXT_OFFSET
is set higher, we risk overwriting ourself with the page table
needed to turn on the cache as it is located relative to the relocation
address. Let's defer the cache setup after relocation in that case.

Signed-off-by: Nicolas Pitre <[email protected]>

diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S
index 9a94f344df..773bc35f92 100644
--- a/arch/arm/boot/compressed/head.S
+++ b/arch/arm/boot/compressed/head.S
@@ -178,11 +178,23 @@ not_angel:
mov r4, pc
and r4, r4, #0xf8000000
add r4, r4, #TEXT_OFFSET
+ bl cache_on
#else
ldr r4, =zreladdr
-#endif

- bl cache_on
+ /*
+ * Set up a page table only if we don't overwrite ourself.
+ * That means r4 < pc && r4 - 4K > &_end.
+ * Given that r4 > &_en is most unfrequent, we add a rough
+ * additional 1MB of room for a possible appended DTB.
+ */
+ mov r0, pc
+ cmp r0, r4
+ ldrcc r0, LC0+32
+ addcc r0, r0, pc
+ cmpcc r4, r0
+ blcs cache_on
+#endif

restart: adr r0, LC0
ldmia r0, {r1, r2, r3, r6, r10, r11, r12}
@@ -464,6 +476,16 @@ not_relocated: mov r0, #0
cmp r2, r3
blo 1b

+#if defined(CONFIG_AUTO_ZRELADDR) && defined(CONFIG_CPU_CP15)
+ /*
+ * Did we skip the cache setup earlier?
+ * Do it now if so.
+ */
+ mrc p15, 0, r0, c1, c0, 0 @ read control register
+ tst r0, #1 @ MMU bit set?
+ bleq cache_on @ no: set it up
+#endif
+
/*
* The C runtime environment should now be setup sufficiently.
* Set up some pointers, and start decompressing.
@@ -512,6 +534,7 @@ LC0: .word LC0 @ r1
.word _got_start @ r11
.word _got_end @ ip
.word .L_user_stack_end @ sp
+ .word _end - restart + 16384 + 1024*1024
.size LC0, . - LC0

#ifdef CONFIG_ARCH_RPC

2013-06-04 21:45:06

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On 06/04, Nicolas Pitre wrote:
> On Tue, 4 Jun 2013, Stephen Boyd wrote:
>
> > On 06/04, Nicolas Pitre wrote:
> > >
> > > I've looked at the code and I think that #1 in your initial options is
> > > probably best here. I agree with Russell about #2 being way too complex
> > > for only this case.
> > >
> > > So, right before calling into cache_on, you could test if r4 - 16K >= pc
> > > and r4 < pc + (_end - .) then skip cache_on.
> > >
> > > Something like this untested patch:
> >
> > So this would cause the decompression to run without the cache on
> > if we have to relocate the decompression code to avoid
> > overwriting ourselves?
>
> No. What my example patch does is to simply skip setting up a page
> table and turning on the cache if the page table ends up in the
> code/data to be relocated.
>
> > It seems that the memcpy is fairly quick on my hardware in comparison
> > to the decompression so moving the cache_on() call to right before we
> > run decompression keeps things pretty fast. It's very possible
> > different hardware will have different results.
>
> "Fairly quick" is still not optimal.
>
> > This is what I meant by option #1. I suppose
> > we can make it smarter and conditionalize it on if we relocated
> > or not?
>
> Here's what I,m suggesting:

Yes this looks closer to what I'm saying.

>
> From f1ec55e8189c06b89d2f196929f01fe3fc40f908 Mon Sep 17 00:00:00 2001
> From: Nicolas Pitre <[email protected]>
> Date: Tue, 4 Jun 2013 17:01:30 -0400
> Subject: [PATCH] ARM: zImage: don't overwrite ourself with a page table
>
> When zImage is loaded into RAM at a low address but TEXT_OFFSET
> is set higher, we risk overwriting ourself with the page table
> needed to turn on the cache as it is located relative to the relocation
> address. Let's defer the cache setup after relocation in that case.
>
> Signed-off-by: Nicolas Pitre <[email protected]>
>
> diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S
> index 9a94f344df..773bc35f92 100644
> --- a/arch/arm/boot/compressed/head.S
> +++ b/arch/arm/boot/compressed/head.S
> @@ -178,11 +178,23 @@ not_angel:
> mov r4, pc
> and r4, r4, #0xf8000000
> add r4, r4, #TEXT_OFFSET
> + bl cache_on
> #else
> ldr r4, =zreladdr
> -#endif
>
> - bl cache_on
> + /*
> + * Set up a page table only if we don't overwrite ourself.
> + * That means r4 < pc && r4 - 4K > &_end.
> + * Given that r4 > &_en is most unfrequent, we add a rough
> + * additional 1MB of room for a possible appended DTB.
> + */
> + mov r0, pc
> + cmp r0, r4
> + ldrcc r0, LC0+32
> + addcc r0, r0, pc
> + cmpcc r4, r0
> + blcs cache_on
> +#endif

But this looks backwards? Shouldn't we put it in the
CONFIG_AUTO_ZRELADDR case?

>
> restart: adr r0, LC0
> ldmia r0, {r1, r2, r3, r6, r10, r11, r12}
> @@ -464,6 +476,16 @@ not_relocated: mov r0, #0
> cmp r2, r3
> blo 1b
>
> +#if defined(CONFIG_AUTO_ZRELADDR) && defined(CONFIG_CPU_CP15)
> + /*
> + * Did we skip the cache setup earlier?
> + * Do it now if so.
> + */
> + mrc p15, 0, r0, c1, c0, 0 @ read control register
> + tst r0, #1 @ MMU bit set?
> + bleq cache_on @ no: set it up
> +#endif

Too bad we can't store one more variable into LC0 that says we
turned the caches on. Then we could read that here and detect it
without CP15 accessors required.

> +
> /*
> * The C runtime environment should now be setup sufficiently.
> * Set up some pointers, and start decompressing.
> @@ -512,6 +534,7 @@ LC0: .word LC0 @ r1
> .word _got_start @ r11
> .word _got_end @ ip
> .word .L_user_stack_end @ sp
> + .word _end - restart + 16384 + 1024*1024
> .size LC0, . - LC0
>
> #ifdef CONFIG_ARCH_RPC

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

2013-06-05 02:23:09

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On Tue, 4 Jun 2013, Stephen Boyd wrote:

> On 06/04, Nicolas Pitre wrote:
> > diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S
> > index 9a94f344df..773bc35f92 100644
> > --- a/arch/arm/boot/compressed/head.S
> > +++ b/arch/arm/boot/compressed/head.S
> > @@ -178,11 +178,23 @@ not_angel:
> > mov r4, pc
> > and r4, r4, #0xf8000000
> > add r4, r4, #TEXT_OFFSET
> > + bl cache_on
> > #else
> > ldr r4, =zreladdr
> > -#endif
> >
> > - bl cache_on
> > + /*
> > + * Set up a page table only if we don't overwrite ourself.
> > + * That means r4 < pc && r4 - 4K > &_end.
> > + * Given that r4 > &_en is most unfrequent, we add a rough
> > + * additional 1MB of room for a possible appended DTB.
> > + */
> > + mov r0, pc
> > + cmp r0, r4
> > + ldrcc r0, LC0+32
> > + addcc r0, r0, pc
> > + cmpcc r4, r0
> > + blcs cache_on
> > +#endif
>
> But this looks backwards? Shouldn't we put it in the
> CONFIG_AUTO_ZRELADDR case?

Obviously. I was looking for zreladdr only. In fact this should be done
in both cases.

> > restart: adr r0, LC0
> > ldmia r0, {r1, r2, r3, r6, r10, r11, r12}
> > @@ -464,6 +476,16 @@ not_relocated: mov r0, #0
> > cmp r2, r3
> > blo 1b
> >
> > +#if defined(CONFIG_AUTO_ZRELADDR) && defined(CONFIG_CPU_CP15)
> > + /*
> > + * Did we skip the cache setup earlier?
> > + * Do it now if so.
> > + */
> > + mrc p15, 0, r0, c1, c0, 0 @ read control register
> > + tst r0, #1 @ MMU bit set?
> > + bleq cache_on @ no: set it up
> > +#endif
>
> Too bad we can't store one more variable into LC0 that says we
> turned the caches on. Then we could read that here and detect it
> without CP15 accessors required.

The LC0 area should be considered read-only as it may be located in
flash.

Here's what I came with instead:

From: Nicolas Pitre <[email protected]>
Date: Tue, 4 Jun 2013 17:01:30 -0400
Subject: [PATCH] ARM: zImage: don't overwrite ourself with a page table

When zImage is loaded into RAM at a low address but TEXT_OFFSET
is set higher, we risk overwriting ourself with the page table
needed to turn on the cache as it is located relative to the relocation
address. Let's defer the cache setup after relocation in that case.

Signed-off-by: Nicolas Pitre <[email protected]>

diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S
index 9a94f344df..aa909393f2 100644
--- a/arch/arm/boot/compressed/head.S
+++ b/arch/arm/boot/compressed/head.S
@@ -182,7 +182,19 @@ not_angel:
ldr r4, =zreladdr
#endif

- bl cache_on
+ /*
+ * Set up a page table only if it won't overwrite ourself.
+ * That means r4 < pc && r4 - 16k page directory > &_end.
+ * Given that r4 > &_en is most unfrequent, we add a rough
+ * additional 1MB of room for a possible appended DTB.
+ */
+ mov r0, pc
+ cmp r0, r4
+ ldrcc r0, LC0+32
+ addcc r0, r0, pc
+ cmpcc r4, r0
+ orrcc r4, r4, #1 @ remember we skipped cache_on
+ blcs cache_on

restart: adr r0, LC0
ldmia r0, {r1, r2, r3, r6, r10, r11, r12}
@@ -228,7 +240,7 @@ restart: adr r0, LC0
* r0 = delta
* r2 = BSS start
* r3 = BSS end
- * r4 = final kernel address
+ * r4 = final kernel address (possibly with LSB set)
* r5 = appended dtb size (still unknown)
* r6 = _edata
* r7 = architecture ID
@@ -276,6 +288,7 @@ restart: adr r0, LC0
*/
cmp r0, #1
sub r0, r4, #TEXT_OFFSET
+ bic r0, r0, #1
add r0, r0, #0x100
mov r1, r6
sub r2, sp, r6
@@ -322,12 +335,13 @@ dtb_check_done:

/*
* Check to see if we will overwrite ourselves.
- * r4 = final kernel address
+ * r4 = final kernel address (possibly with LSB set)
* r9 = size of decompressed image
* r10 = end of this image, including bss/stack/malloc space if non XIP
* We basically want:
* r4 - 16k page directory >= r10 -> OK
* r4 + image length <= address of wont_overwrite -> OK
+ * Note: the possible LSB in r4 is harmless here.
*/
add r10, r10, #16384
cmp r4, r10
@@ -389,7 +403,8 @@ dtb_check_done:
add sp, sp, r6
#endif

- bl cache_clean_flush
+ tst r4, #1
+ bleq cache_clean_flush

adr r0, BSYM(restart)
add r0, r0, r6
@@ -401,7 +416,7 @@ wont_overwrite:
* r0 = delta
* r2 = BSS start
* r3 = BSS end
- * r4 = kernel execution address
+ * r4 = kernel execution address (possibly with LSB set)
* r5 = appended dtb size (0 if not present)
* r7 = architecture ID
* r8 = atags pointer
@@ -464,6 +479,15 @@ not_relocated: mov r0, #0
cmp r2, r3
blo 1b

+ /*
+ * Did we skip the cache setup earlier?
+ * That is indicated by the LSB in r4.
+ * Do it now if so.
+ */
+ tst r4, #1
+ bic r4, r4, #1
+ blne cache_on
+
/*
* The C runtime environment should now be setup sufficiently.
* Set up some pointers, and start decompressing.
@@ -512,6 +536,7 @@ LC0: .word LC0 @ r1
.word _got_start @ r11
.word _got_end @ ip
.word .L_user_stack_end @ sp
+ .word _end - restart + 16384 + 1024*1024
.size LC0, . - LC0

#ifdef CONFIG_ARCH_RPC

2013-06-06 01:29:24

[permalink] [raw]

Subject: Re: [PATCH] ARM: avoid mis-detecting some V7 cores in the decompressor

On 06/04, Nicolas Pitre wrote:
>
> The LC0 area should be considered read-only as it may be located in
> flash.
>
> Here's what I came with instead:
>
> From: Nicolas Pitre <[email protected]>
> Date: Tue, 4 Jun 2013 17:01:30 -0400
> Subject: [PATCH] ARM: zImage: don't overwrite ourself with a page table
>
> When zImage is loaded into RAM at a low address but TEXT_OFFSET
> is set higher, we risk overwriting ourself with the page table
> needed to turn on the cache as it is located relative to the relocation
> address. Let's defer the cache setup after relocation in that case.
>
> Signed-off-by: Nicolas Pitre <[email protected]>

Reported-by: Stephen Boyd <[email protected]>
Tested-by: Stephen Boyd <[email protected]>

This one passes testing on my two platforms with and without the
2Mb reservation at the beginning of ram. Seems like a good enough
compromise for me.

>
> diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S
> index 9a94f344df..aa909393f2 100644
> --- a/arch/arm/boot/compressed/head.S
> +++ b/arch/arm/boot/compressed/head.S
> @@ -182,7 +182,19 @@ not_angel:
> ldr r4, =zreladdr
> #endif
>
> - bl cache_on
> + /*
> + * Set up a page table only if it won't overwrite ourself.
> + * That means r4 < pc && r4 - 16k page directory > &_end.
> + * Given that r4 > &_en is most unfrequent, we add a rough

/s/_en/_end/

> + * additional 1MB of room for a possible appended DTB.
> + */
> + mov r0, pc
> + cmp r0, r4
> + ldrcc r0, LC0+32
> + addcc r0, r0, pc
> + cmpcc r4, r0
> + orrcc r4, r4, #1 @ remember we skipped cache_on
> + blcs cache_on
>

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

2013-06-06 04:21:48