2014-12-03 09:52:40

by Mason

[permalink] [raw]
Subject: Re: Creating 16 MB super-sections for MMIO

[Top-posting to address topicality]

Is LKML more appropriate a list to discuss MMU setup?
(I've been studying arch-specific code in arch/arm/mm)

Regards.

On 02/12/2014 11:42, Mason wrote:

> [Branched from "Code generation involving __raw_readl and __raw_writel" thread]
> http://thread.gmane.org/gmane.linux.ports.arm.kernel/376055
>
> On 27/11/2014 14:12, Arnd Bergmann wrote:
>
>> On Thursday 27 November 2014 14:01:41 Mason wrote:
>>
>>> I'm asking because I have an idea in mind: on the bus, the first
>>> 16 MB contains only memory-mapped registers, so I've been thinking
>>> I can map this region at init, and keep it for the lifetime of the
>>> system. It would use only one entry in the TLB, since the CPU
>>> supports 16 MB super-sections.
>>>
>>> I could even lock that entry in the TLB so that these accesses
>>> are guaranteed to never TLB-miss, right?
>>
>> The map_io callback will set up a mapping like that, and when
>> a driver calls ioremap on the same physical address, you will
>> get the correct pointer using that TLB, you just don't communicate
>> the address through a pointer any more.
>
> [NOTE: Initially, the focus of this message was on TLB lockdown,
> but then it changed to creating super-sections]
>
> According to the ARM architecture manual:
>
> The architecture has a concept of an entry locked down in the TLB.
> The method by which lockdown is achieved is IMPLEMENTATION DEFINED,
> and an implementation might not support lockdown.
>
> Does Linux support locking down an entry in the TLB?
> Where are CPU-specific implementations stored in the source tree?
> (I'm using a Cortex A9.)
>
> I glanced at
>
> arch/arm/mm/tlb-v7.S
> arch/arm/include/asm/tlbflush.h
>
> but nothing jumped out at me.
>
> arch/arm/mach-tegra/cortex-a9.S (an obsolete file?) did mention
> lockdown (albeit in a comment only).
>
> https://chromium.googlesource.com/chromiumos/third_party/kernel-next/+/0.11.257.B/arch/arm/mach-tegra/cortex-a9.S
>
> [some time passes]
>
> After giving the issue more thought, I think trying to lock the TLB entry
> might be a case of premature optimization. However, it seems worthwhile to
> make sure that Linux correctly sets up the 16 MB mapping, using a single
> TLB entry (instead of 16 section entries).
>
> I traced create_mapping -> alloc_init_pud -> alloc_init_pmd -> __map_init_section
>
> (I think I'm in the right place...)
> However, I was expecting PMD_SECT_SUPER somewhere in there, yet I don't
> see any, so I'm not confident about a super-section being created.
>
> The only two relevant functions appear to be
>
> create_36bit_mapping()
> remap_area_supersections()
>
> The first is only called in this case:
> #ifndef CONFIG_ARM_LPAE
> /*
> * Catch 36-bit addresses
> */
> if (md->pfn >= 0x100000) {
> create_36bit_mapping(md, type);
> return;
> }
> #endif
>
> Since I want to map PA 0, I could lie and pretend it is PA 2^32,
> pray for a wrap-around back to 0, and get the super-section mapping.
> That sounds like an ugly hack...
>
> The other function is only called in this case:
> #if !defined(CONFIG_SMP) && !defined(CONFIG_ARM_LPAE)
> if (DOMAIN_IO == 0 &&
> (((cpu_architecture() >= CPU_ARCH_ARMv6) && (get_cr() & CR_XP)) ||
> cpu_is_xsc3()) && pfn >= 0x100000 &&
> !((paddr | size | addr) & ~SUPERSECTION_MASK)) {
> area->flags |= VM_ARM_SECTION_MAPPING;
> err = remap_area_supersections(addr, pfn, size, type);
> } else if (!((paddr | size | addr) & ~PMD_MASK)) {
> area->flags |= VM_ARM_SECTION_MAPPING;
> err = remap_area_sections(addr, pfn, size, type);
> } else
> #endif
>
> But we do define CONFIG_SMP (dual core CPU).
> So no super-sections for me, IIUC?
>
> Regards.


2014-12-03 10:32:22

by Catalin Marinas

[permalink] [raw]
Subject: Re: Creating 16 MB super-sections for MMIO

On Wed, Dec 03, 2014 at 09:52:33AM +0000, Mason wrote:
> [Top-posting to address topicality]
>
> Is LKML more appropriate a list to discuss MMU setup?
> (I've been studying arch-specific code in arch/arm/mm)

The list is fine. The reasons behind this proposal aren't clear. Are you
trying to optimise mmio register accesses by avoiding TLB misses?

--
Catalin

2014-12-03 14:20:14

by Mason

[permalink] [raw]
Subject: Re: Creating 16 MB super-sections for MMIO

Hello Catalin,

On 03/12/2014 11:32, Catalin Marinas wrote:

> The reasons behind this proposal aren't clear. Are you trying
> to optimise mmio register accesses by avoiding TLB misses?

I am trying to minimize TLB "pollution" by using a "huge" page.

The ARM manual states: "Support for Supersections, Sections and
Large pages enables a large region of memory to be mapped using
only a single entry in the TLB."

Is it correct that, if I create a virtual-to-physical mapping of
an 8 MB memory region, then, in the best-case scenario, the kernel
will create 8 Sections? Thus my mapping would use up to 8 entries
in the TLB at any given time.

Instead, if I create a V2P mapping of a 16 MB region, and if the
kernel supports so-called Super-sections, then the mapping would
use only a single entry in the TLB. Right?

Thus creating a Super-section leaves more TLB entries available
for user-space processes and kernel threads, which can only
improve system performance, AFAIU.

On my SoC, physical addresses 0 to 2^24 are reserved for device
memory-mapped registers. There are holes in the region (meaning
addresses that don't map to any register) but the bus is defined
in such a way that
- writing to a hole is a NOP,
-reading from a hole returns 0.

So I'd like to map
physical addresses 0-2^24
to
virtual addresses 0xf000_0000 - 0xf100_0000
using a Super-section.

And I was hoping that calling iotable_init with a struct map_desc
entry where length = SZ_16M would create such a super-section.

Did that make sense?

As far as I could tell, Linux does not create a super-section in the
case outlined above. Perhaps I misread the source code?

Regards.

2014-12-03 17:06:18

by Arnd Bergmann

[permalink] [raw]
Subject: Re: Creating 16 MB super-sections for MMIO

On Wednesday 03 December 2014 15:20:01 Mason wrote:
>
> As far as I could tell, Linux does not create a super-section in the
> case outlined above. Perhaps I misread the source code?

I believe you are right, and I also agree that in theory implementing
what you say (both 64k and 16M mappings) can only help, but it's not
obvious if this makes a measurable difference in the end.

MMIO register accesses are usually slow for other reasons, and
they tend to be rare, so it's possible that you won't be able
to ever tell a difference because the MMIO TLB often gets evicted
by user mappings between accesses to different 1MB sections,
and the timing difference between a TLB-hot and cold MMIO access
might not be that great (depending on the latency of a particular
register).

I don't think there would be any objections to doing superpage
or supersection mappings for early page tables if you can show
any benefit whatsoever, but it may be hard to come up with a
scenario where it's actually measurable.

Arnd

2014-12-03 17:48:02

by Mason

[permalink] [raw]
Subject: Re: Creating 16 MB super-sections for MMIO

On 03/12/2014 18:06, Arnd Bergmann wrote:

> Mason wrote:
>
>> As far as I could tell, Linux does not create a super-section in the
>> case outlined above. Perhaps I misread the source code?
>
> I believe you are right, and I also agree that in theory implementing
> what you say (both 64k and 16M mappings) can only help, but it's not
> obvious if this makes a measurable difference in the end.

It will be an interesting thought experiment to come up with
a relevant benchmark. TODO.

> MMIO register accesses are usually slow for other reasons, and
> they tend to be rare,

Reading e.g. the system tick counter on this SoC takes ~65 ns
(so ~65 cycles from the CPU's PoV) which is roughly twice as
fast as accessing uncached RAM.

I don't think we can say that MMIO registers accesses are slow
when they are faster than RAM, right?

> so it's possible that you won't be able
> to ever tell a difference because the MMIO TLB often gets evicted
> by user mappings between accesses to different 1MB sections,
> and the timing difference between a TLB-hot and cold MMIO access
> might not be that great (depending on the latency of a particular
> register).

I don't know if other SoCs are built differently, but on this one,
most drivers are hammering the same 16MB memory region where the
MMIO registers live. I don't think the entry would ever get evicted
if there's some kind of LRU-policy in action.

[Seems it might worthwhile to investigate TLB entry lockdown
(on Cortex A9) after all.]

> I don't think there would be any objections to doing superpage
> or supersection mappings for early page tables if you can show
> any benefit whatsoever, but it may be hard to come up with a
> scenario where it's actually measurable.

I'll have to think about it.

Thanks.