LinuxLists.cc - Re: [PATCH 01/22] ARM: add mechanism for late code patching

2012-08-06 11:12:42

Subject: Re: [PATCH 01/22] ARM: add mechanism for late code patching

On Tue, Jul 31, 2012 at 07:04:37PM -0400, Cyril Chemparathy wrote:
> +static void __init init_patch_kernel(void)
> +{
> + const void *start = &__patch_table_begin;
> + const void *end = &__patch_table_end;
> +
> + BUG_ON(patch_kernel(start, end - start));
> + flush_icache_range(init_mm.start_code, init_mm.end_code);

Err. You are asking the kernel to flush every single cache line
manually throughout the kernel code. That's a flush every 32-bytes
over maybe a few megabytes of address space.

This is one of the reasons we do the patching in assembly code before
the caches are enabled - so we don't have to worry about the interaction
with the CPU caches, which for this kind of application would be very
expensive.

2012-08-06 13:19:27

by Cyril Chemparathy

[permalink] [raw]

Subject: Re: [PATCH 01/22] ARM: add mechanism for late code patching

On 8/6/2012 7:12 AM, Russell King - ARM Linux wrote:
> On Tue, Jul 31, 2012 at 07:04:37PM -0400, Cyril Chemparathy wrote:
>> +static void __init init_patch_kernel(void)
>> +{
>> + const void *start = &__patch_table_begin;
>> + const void *end = &__patch_table_end;
>> +
>> + BUG_ON(patch_kernel(start, end - start));
>> + flush_icache_range(init_mm.start_code, init_mm.end_code);
>
> Err. You are asking the kernel to flush every single cache line
> manually throughout the kernel code. That's a flush every 32-bytes
> over maybe a few megabytes of address space.
>

With a flush_cache_all(), we could avoid having to operate a cacheline
at a time, but that clobbers way more than necessary.

Maybe the better answer is to flush only the patched cachelines.

> This is one of the reasons we do the patching in assembly code before
> the caches are enabled - so we don't have to worry about the interaction
> with the CPU caches, which for this kind of application would be very
> expensive.
>

Sure, flushing caches is expensive. But then, so is running the
patching code with caches disabled. I guess memory access latencies
drive the performance trade off here.

--
Thanks
- Cyril

2012-08-06 13:26:55

by Russell King - ARM Linux

[permalink] [raw]

Subject: Re: [PATCH 01/22] ARM: add mechanism for late code patching

On Mon, Aug 06, 2012 at 09:19:10AM -0400, Cyril Chemparathy wrote:
> With a flush_cache_all(), we could avoid having to operate a cacheline
> at a time, but that clobbers way more than necessary.

You can't do that, because flush_cache_all() on some CPUs requires the
proper MMU mappings to be in place, and you can't get those mappings
in place because you don't have the V:P offsets fixed up in the kernel.
Welcome to the chicken and egg problem.

> Sure, flushing caches is expensive. But then, so is running the
> patching code with caches disabled. I guess memory access latencies
> drive the performance trade off here.

There we disagree on a few orders of magnitude. There are relatively
few places that need updating. According to the kernel I have here:

text data bss dec hex filename
7644346 454320 212984 8311650 7ed362 vmlinux

Idx Name Size VMA LMA File off Algn
1 .text 004cd170 c00081c0 c00081c0 000081c0 2**5
16 .init.pv_table 00000300 c0753a24 c0753a24 00753a24 2**0

That's about 7MB of text, and only 192 points in that code which need
patching. Even if we did this with caches on, that's still 192 places,
and only 192 places we'd need to flush a cache line.

Alternatively, with your approach and 7MB of text, you need to flush
238885 cache lines to cover the entire kernel.

It would be far _cheaper_ with your approach to flush the individual
cache lines as you go.

2012-08-06 13:38:32

by Cyril Chemparathy

[permalink] [raw]

Subject: Re: [PATCH 01/22] ARM: add mechanism for late code patching

On 8/6/2012 9:26 AM, Russell King - ARM Linux wrote:
> On Mon, Aug 06, 2012 at 09:19:10AM -0400, Cyril Chemparathy wrote:
>> With a flush_cache_all(), we could avoid having to operate a cacheline
>> at a time, but that clobbers way more than necessary.
>
> You can't do that, because flush_cache_all() on some CPUs requires the
> proper MMU mappings to be in place, and you can't get those mappings
> in place because you don't have the V:P offsets fixed up in the kernel.
> Welcome to the chicken and egg problem.
>
>> Sure, flushing caches is expensive. But then, so is running the
>> patching code with caches disabled. I guess memory access latencies
>> drive the performance trade off here.
>
> There we disagree on a few orders of magnitude. There are relatively
> few places that need updating. According to the kernel I have here:
>
> text data bss dec hex filename
> 7644346 454320 212984 8311650 7ed362 vmlinux
>
> Idx Name Size VMA LMA File off Algn
> 1 .text 004cd170 c00081c0 c00081c0 000081c0 2**5
> 16 .init.pv_table 00000300 c0753a24 c0753a24 00753a24 2**0
>
> That's about 7MB of text, and only 192 points in that code which need
> patching. Even if we did this with caches on, that's still 192 places,
> and only 192 places we'd need to flush a cache line.
>
> Alternatively, with your approach and 7MB of text, you need to flush
> 238885 cache lines to cover the entire kernel.
>
> It would be far _cheaper_ with your approach to flush the individual
> cache lines as you go.
>

Agreed. Thanks.

--
Thanks
- Cyril

2012-08-06 18:02:24

by Nicolas Pitre

[permalink] [raw]

Subject: Re: [PATCH 01/22] ARM: add mechanism for late code patching

On Mon, 6 Aug 2012, Russell King - ARM Linux wrote:

> On Mon, Aug 06, 2012 at 09:19:10AM -0400, Cyril Chemparathy wrote:
> > With a flush_cache_all(), we could avoid having to operate a cacheline
> > at a time, but that clobbers way more than necessary.
>
> You can't do that, because flush_cache_all() on some CPUs requires the
> proper MMU mappings to be in place, and you can't get those mappings
> in place because you don't have the V:P offsets fixed up in the kernel.
> Welcome to the chicken and egg problem.

This problem is fixed in this case by having the p2v and v2p code sites
using an out-of-line non optimized computation until those sites are
runtime patched with the inlined optimized computation we have today.

Nicolas