Hi Linus,
These are the main arm64 updates for 4.4. There are a few more conflicts
than usual caused by some reworking under arch/arm64 (the CPU feature
detection, EFI_STUB, relaxed atomics) and an arm64 fix that went in
during 4.3-rc7 and which I haven't pulled into the upstream branch for
arm64. The more complicated fix-up is in arch/arm64/kernel/cpufeature.c
where irqchip/GICv3 changes merged via tip are conflicting with the
arm64 CPU feature detection.
I included the conflict resolution (only the conflicting hunks) and the
end of this email. Please let me know if there are any issues.
The diffstat contains some kernel/irq/ changes that have been pulled
form tip/irq/for-arm as explained in the tag text.
Thanks.
The following changes since commit 049e6dde7e57f0054fdc49102e7ef4830c698b46:
Linux 4.3-rc4 (2015-10-04 16:57:17 +0100)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux tags/arm64-upstream
for you to fetch changes up to f8f8bdc48851da979c6e0e4808b6031122e4af47:
arm64/efi: fix libstub build under CONFIG_MODVERSIONS (2015-11-02 13:50:17 +0000)
----------------------------------------------------------------
arm64 updates for 4.4:
- "genirq: Introduce generic irq migration for cpu hotunplugged" patch
merged from tip/irq/for-arm to allow the arm64-specific part to be
upstreamed via the arm64 tree
- CPU feature detection reworked to cope with heterogeneous systems
where CPUs may not have exactly the same features. The features
reported by the kernel via internal data structures or ELF_HWCAP are
delayed until all the CPUs are up (and before user space starts)
- Support for 16KB pages, with the additional bonus of a 36-bit VA
space, though the latter only depending on EXPERT
- Implement native {relaxed, acquire, release} atomics for arm64
- New ASID allocation algorithm which avoids IPI on roll-over, together
with TLB invalidation optimisations (using local vs global where
feasible)
- KASan support for arm64
- EFI_STUB clean-up and isolation for the kernel proper (required by
KASan)
- copy_{to,from,in}_user optimisations (sharing the memcpy template)
- perf: moving arm64 to the arm32/64 shared PMU framework
- L1_CACHE_BYTES increased to 128 to accommodate Cavium hardware
- Support for the contiguous PTE hint on kernel mapping (16 consecutive
entries may be able to use a single TLB entry)
- Generic CONFIG_HZ now used on arm64
- defconfig updates
----------------------------------------------------------------
Alexander Kuleshov (1):
arm64/mm: use PAGE_ALIGNED instead of IS_ALIGNED
Alim Akhtar (1):
arm64: defconfig: Enable samsung serial and mmc
Andrey Ryabinin (4):
arm64: introduce VA_START macro - the first kernel virtual address.
arm64: move PGD_SIZE definition to pgalloc.h
arm64: add KASAN support
Documentation/features/KASAN: arm64 supports KASAN now
Ard Biesheuvel (7):
arm64/efi: remove /chosen/linux, uefi-stub-kern-ver DT property
arm64: use ENDPIPROC() to annotate position independent assembler routines
arm64/efi: isolate EFI stub from the kernel proper
arm64: Add page size to the kernel image header
arm64: remove bogus TASK_SIZE_64 check
arm64/efi: move arm64 specific stub C code to libstub
arm64/efi: fix libstub build under CONFIG_MODVERSIONS
Catalin Marinas (6):
Merge branch 'irq/for-arm' of git://git.kernel.org/.../tip/tip
arm64: Fix missing #include in hw_breakpoint.c
Revert "arm64: ioremap: add ioremap_cache macro"
arm64: Minor coding style fixes for kc_offset_to_vaddr and kc_vaddr_to_offset
arm64: Make 36-bit VA depend on EXPERT
Merge branch 'irq/for-arm' of git://git.kernel.org/.../tip/tip
Dave Martin (1):
arm64: Constify hwcap name string arrays
Dietmar Eggemann (1):
ARM64: Enable multi-core scheduler support by default
Feng Kan (2):
arm64: Change memcpy in kernel to use the copy template file
arm64: copy_to-from-in_user optimization using copy template
Jeremy Linton (6):
arm64: Add contiguous page flag shifts and constants
arm64: PTE/PMD contiguous bit definition
arm64: Macros to check/set/unset the contiguous bit
arm64: Default kernel pages should be contiguous
arm64: Make the kernel page dump utility aware of the CONT bit
arm64: Mark kernel page ranges contiguous
Jisheng Zhang (1):
arm64: add cpu_idle tracepoints to arch_cpu_idle
Jungseok Lee (1):
arm64: Synchronise dump_backtrace() with perf callchain
Kefeng Wang (1):
arm64: make Timer Interrupt Frequency selectable
Linus Walleij (1):
ARM64: kasan: print memory assignment
Mark Rutland (8):
arm64: perf: move to shared arm_pmu framework
arm64: perf: add Cortex-A53 support
arm64: perf: add Cortex-A57 support
arm64: dts: juno: describe PMUs separately
MAINTAINERS: update ARM PMU profiling and debugging for arm64
MAINTAINERS: add myself as arm perf reviewer
arm64: Simplify NR_FIX_BTMAPS calculation
arm64: page-align sections for DEBUG_RODATA
Mark Salyzyn (1):
arm64: AArch32 user space PC alignment exception
Robin Murphy (2):
arm64: Fix compat register mappings
arm64: Fix build with CONFIG_ZONE_DMA=n
Suzuki K. Poulose (28):
arm64: Move swapper pagetable definitions
arm64: Handle section maps for swapper/idmap
arm64: Introduce helpers for page table levels
arm64: Calculate size for idmap_pg_dir at compile time
arm64: Handle 4 level page table for swapper
arm64: Clean config usages for page size
arm64: Kconfig: Fix help text about AArch32 support with 64K pages
arm64: Check for selected granule support
arm64: Add 16K page size support
arm64: 36 bit VA
arm64: Make the CPU information more clear
arm64: Delay ELF HWCAP initialisation until all CPUs are up
arm64: Delay cpuinfo_store_boot_cpu
arm64: Move cpu feature detection code
arm64: Move mixed endian support detection
arm64: Move /proc/cpuinfo handling code
arm64: Handle width of a cpuid feature
arm64: Keep track of CPU feature registers
arm64: Consolidate CPU Sanity check to CPU Feature infrastructure
arm64: Read system wide CPUID value
arm64: Cleanup mixed endian support detection
arm64: Refactor check_cpu_capabilities
arm64: Delay cpu feature capability checks
arm64/capabilities: Make use of system wide safe value
arm64/HWCAP: Use system wide safe values
arm64: Move FP/ASIMD hwcap handling to common code
arm64/debug: Make use of the system wide safe value
arm64/kvm: Make use of the system wide safe values
Thomas Gleixner (1):
genirq: Make the cpuhotplug migration code less noisy
Tirumalesh Chalamarla (1):
arm64: Increase the max granular size
Will Deacon (15):
arm64: mm: remove unused cpu_set_idmap_tcr_t0sz function
arm64: proc: de-scope TLBI operation during cold boot
arm64: flush: use local TLB and I-cache invalidation
arm64: mm: rewrite ASID allocator and MM context-switching code
arm64: tlbflush: remove redundant ASID casts to (unsigned long)
arm64: tlbflush: avoid flushing when fullmm == 1
arm64: switch_mm: simplify mm and CPU checks
arm64: mm: kill mm_cpumask usage
arm64: tlb: remove redundant barrier from __flush_tlb_pgtable
arm64: mm: remove dsb from update_mmu_cache
arm64: hw_breakpoint: use target state to determine ABI behaviour
arm64: atomics: implement native {relaxed, acquire, release} atomics
arm64: kasan: fix issues reported by sparse
arm64: cpufeature: declare enable_cpu_capabilities as static
arm64: cachetype: fix definitions of ICACHEF_* flags
Yang Shi (1):
arm64: debug: Fix typo in debug-monitors.c
Yang Yingliang (1):
arm64: fix a migrating irq bug when hotplug cpu
yalin wang (2):
arm64: ioremap: add ioremap_cache macro
arm64: add kc_offset_to_vaddr and kc_vaddr_to_offset macro
Documentation/arm/uefi.txt | 2 -
Documentation/arm64/booting.txt | 7 +-
Documentation/devicetree/bindings/arm/pmu.txt | 2 +
.../features/debug/KASAN/arch-support.txt | 2 +-
MAINTAINERS | 9 +-
arch/arm64/Kconfig | 69 +-
arch/arm64/Kconfig.debug | 2 +-
arch/arm64/Makefile | 7 +
arch/arm64/boot/dts/arm/juno-r1.dts | 18 +-
arch/arm64/boot/dts/arm/juno.dts | 18 +-
arch/arm64/configs/defconfig | 9 +
arch/arm64/include/asm/assembler.h | 11 +
arch/arm64/include/asm/atomic.h | 63 +-
arch/arm64/include/asm/atomic_ll_sc.h | 98 +-
arch/arm64/include/asm/atomic_lse.h | 193 ++--
arch/arm64/include/asm/cache.h | 2 +-
arch/arm64/include/asm/cacheflush.h | 7 +
arch/arm64/include/asm/cachetype.h | 4 +-
arch/arm64/include/asm/cmpxchg.h | 279 +++--
arch/arm64/include/asm/cpu.h | 4 +
arch/arm64/include/asm/cpufeature.h | 91 +-
arch/arm64/include/asm/cputype.h | 15 -
arch/arm64/include/asm/fixmap.h | 7 +-
arch/arm64/include/asm/hw_breakpoint.h | 9 +-
arch/arm64/include/asm/hwcap.h | 8 +
arch/arm64/include/asm/irq.h | 1 -
arch/arm64/include/asm/kasan.h | 38 +
arch/arm64/include/asm/kernel-pgtable.h | 83 ++
arch/arm64/include/asm/memory.h | 6 +-
arch/arm64/include/asm/mmu.h | 15 +-
arch/arm64/include/asm/mmu_context.h | 113 +--
arch/arm64/include/asm/page.h | 27 +-
arch/arm64/include/asm/pgalloc.h | 1 +
arch/arm64/include/asm/pgtable-hwdef.h | 48 +-
arch/arm64/include/asm/pgtable.h | 30 +-
arch/arm64/include/asm/pmu.h | 83 --
arch/arm64/include/asm/processor.h | 2 +-
arch/arm64/include/asm/ptrace.h | 16 +-
arch/arm64/include/asm/string.h | 16 +
arch/arm64/include/asm/sysreg.h | 157 ++-
arch/arm64/include/asm/thread_info.h | 5 +-
arch/arm64/include/asm/tlb.h | 26 +-
arch/arm64/include/asm/tlbflush.h | 18 +-
arch/arm64/kernel/Makefile | 11 +-
arch/arm64/kernel/arm64ksyms.c | 3 +
arch/arm64/kernel/asm-offsets.c | 2 +-
arch/arm64/kernel/cpu_errata.c | 2 +-
arch/arm64/kernel/cpufeature.c | 851 +++++++++++++++-
arch/arm64/kernel/cpuinfo.c | 254 +++--
arch/arm64/kernel/debug-monitors.c | 8 +-
arch/arm64/kernel/efi-entry.S | 10 +-
arch/arm64/kernel/efi.c | 5 +-
arch/arm64/kernel/entry.S | 2 +
arch/arm64/kernel/fpsimd.c | 16 +-
arch/arm64/kernel/head.S | 76 +-
arch/arm64/kernel/hw_breakpoint.c | 19 +-
arch/arm64/kernel/image.h | 38 +-
arch/arm64/kernel/irq.c | 62 --
arch/arm64/kernel/module.c | 16 +-
arch/arm64/kernel/perf_event.c | 1066 ++++----------------
arch/arm64/kernel/process.c | 3 +
arch/arm64/kernel/setup.c | 245 +----
arch/arm64/kernel/smp.c | 22 +-
arch/arm64/kernel/suspend.c | 2 +-
arch/arm64/kernel/traps.c | 15 +-
arch/arm64/kernel/vmlinux.lds.S | 6 +-
arch/arm64/kvm/Kconfig | 3 +
arch/arm64/kvm/reset.c | 2 +-
arch/arm64/kvm/sys_regs.c | 12 +-
arch/arm64/lib/copy_from_user.S | 78 +-
arch/arm64/lib/copy_in_user.S | 67 +-
arch/arm64/lib/copy_template.S | 193 ++++
arch/arm64/lib/copy_to_user.S | 67 +-
arch/arm64/lib/memchr.S | 2 +-
arch/arm64/lib/memcmp.S | 2 +-
arch/arm64/lib/memcpy.S | 184 +---
arch/arm64/lib/memmove.S | 9 +-
arch/arm64/lib/memset.S | 5 +-
arch/arm64/lib/strcmp.S | 2 +-
arch/arm64/lib/strlen.S | 2 +-
arch/arm64/lib/strncmp.S | 2 +-
arch/arm64/mm/Makefile | 3 +
arch/arm64/mm/cache.S | 10 +-
arch/arm64/mm/context.c | 236 +++--
arch/arm64/mm/dump.c | 18 +-
arch/arm64/mm/fault.c | 2 +-
arch/arm64/mm/init.c | 19 +-
arch/arm64/mm/kasan_init.c | 165 +++
arch/arm64/mm/mmu.c | 145 ++-
arch/arm64/mm/pageattr.c | 2 +-
arch/arm64/mm/pgd.c | 2 -
arch/arm64/mm/proc.S | 10 +-
drivers/firmware/efi/Makefile | 8 +
drivers/firmware/efi/libstub/Makefile | 42 +-
.../firmware/efi/libstub/arm64-stub.c | 0
drivers/firmware/efi/libstub/fdt.c | 9 -
drivers/firmware/efi/libstub/string.c | 57 ++
drivers/perf/Kconfig | 2 +-
include/linux/irq.h | 2 +
kernel/irq/Kconfig | 4 +
kernel/irq/Makefile | 1 +
kernel/irq/cpuhotplug.c | 82 ++
scripts/Makefile.kasan | 4 +-
103 files changed, 3386 insertions(+), 2422 deletions(-)
create mode 100644 arch/arm64/include/asm/kasan.h
create mode 100644 arch/arm64/include/asm/kernel-pgtable.h
delete mode 100644 arch/arm64/include/asm/pmu.h
create mode 100644 arch/arm64/lib/copy_template.S
create mode 100644 arch/arm64/mm/kasan_init.c
rename arch/arm64/kernel/efi-stub.c => drivers/firmware/efi/libstub/arm64-stub.c (100%)
create mode 100644 drivers/firmware/efi/libstub/string.c
create mode 100644 kernel/irq/cpuhotplug.c
Conflict resolution with 66ef3493d4bb387f5a83915e33dc893102fd1b43:
-------------8<-------------------------
diff --cc Documentation/arm/uefi.txt
index 7b3fdfe0f7ba,7f1bed8872f3..000000000000
--- a/Documentation/arm/uefi.txt
+++ b/Documentation/arm/uefi.txt
@@@ -58,5 -58,5 +58,3 @@@ linux,uefi-mmap-desc-size | 32-bit | Si
--------------------------------------------------------------------------------
linux,uefi-mmap-desc-ver | 32-bit | Version of the mmap descriptor format.
--------------------------------------------------------------------------------
- linux,uefi-stub-kern-ver | string | Copy of linux_banner from build.
- --------------------------------------------------------------------------------
-
-For verbose debug messages, specify 'uefi_debug' on the kernel command line.
diff --cc arch/arm64/include/asm/atomic.h
index 1e247ac2601a,5e13ad76a249..000000000000
--- a/arch/arm64/include/asm/atomic.h
+++ b/arch/arm64/include/asm/atomic.h
@@@ -54,8 -54,39 +54,39 @@@
#define ATOMIC_INIT(i) { (i) }
#define atomic_read(v) READ_ONCE((v)->counter)
-#define atomic_set(v, i) (((v)->counter) = (i))
+#define atomic_set(v, i) WRITE_ONCE(((v)->counter), (i))
+
+ #define atomic_add_return_relaxed atomic_add_return_relaxed
+ #define atomic_add_return_acquire atomic_add_return_acquire
+ #define atomic_add_return_release atomic_add_return_release
+ #define atomic_add_return atomic_add_return
+
+ #define atomic_inc_return_relaxed(v) atomic_add_return_relaxed(1, (v))
+ #define atomic_inc_return_acquire(v) atomic_add_return_acquire(1, (v))
+ #define atomic_inc_return_release(v) atomic_add_return_release(1, (v))
+ #define atomic_inc_return(v) atomic_add_return(1, (v))
+
+ #define atomic_sub_return_relaxed atomic_sub_return_relaxed
+ #define atomic_sub_return_acquire atomic_sub_return_acquire
+ #define atomic_sub_return_release atomic_sub_return_release
+ #define atomic_sub_return atomic_sub_return
+
+ #define atomic_dec_return_relaxed(v) atomic_sub_return_relaxed(1, (v))
+ #define atomic_dec_return_acquire(v) atomic_sub_return_acquire(1, (v))
+ #define atomic_dec_return_release(v) atomic_sub_return_release(1, (v))
+ #define atomic_dec_return(v) atomic_sub_return(1, (v))
+
+ #define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
+ #define atomic_xchg_acquire(v, new) xchg_acquire(&((v)->counter), (new))
+ #define atomic_xchg_release(v, new) xchg_release(&((v)->counter), (new))
#define atomic_xchg(v, new) xchg(&((v)->counter), (new))
+
+ #define atomic_cmpxchg_relaxed(v, old, new) \
+ cmpxchg_relaxed(&((v)->counter), (old), (new))
+ #define atomic_cmpxchg_acquire(v, old, new) \
+ cmpxchg_acquire(&((v)->counter), (old), (new))
+ #define atomic_cmpxchg_release(v, old, new) \
+ cmpxchg_release(&((v)->counter), (old), (new))
#define atomic_cmpxchg(v, old, new) cmpxchg(&((v)->counter), (old), (new))
#define atomic_inc(v) atomic_add(1, (v))
diff --cc arch/arm64/kernel/cpufeature.c
index 305f30dc9e63,504526fa8129..000000000000
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@@ -33,41 -586,22 +588,37 @@@
return val >= entry->min_field_value;
}
- #define __ID_FEAT_CHK(reg) \
- static bool __maybe_unused \
- has_##reg##_feature(const struct arm64_cpu_capabilities *entry) \
- { \
- u64 val; \
- \
- val = read_cpuid(reg##_el1); \
- return feature_matches(val, entry); \
- }
+ static bool
+ has_cpuid_feature(const struct arm64_cpu_capabilities *entry)
+ {
+ u64 val;
- __ID_FEAT_CHK(id_aa64pfr0);
- __ID_FEAT_CHK(id_aa64mmfr1);
- __ID_FEAT_CHK(id_aa64isar0);
+ val = read_system_reg(entry->sys_reg);
+ return feature_matches(val, entry);
+ }
+static bool has_useable_gicv3_cpuif(const struct arm64_cpu_capabilities *entry)
+{
+ bool has_sre;
+
- if (!has_id_aa64pfr0_feature(entry))
++ if (!has_cpuid_feature(entry))
+ return false;
+
+ has_sre = gic_enable_sre();
+ if (!has_sre)
+ pr_warn_once("%s present but disabled by higher exception level\n",
+ entry->desc);
+
+ return has_sre;
+}
+
static const struct arm64_cpu_capabilities arm64_features[] = {
{
.desc = "GIC system register CPU interface",
.capability = ARM64_HAS_SYSREG_GIC_CPUIF,
- .matches = has_cpuid_feature,
+ .matches = has_useable_gicv3_cpuif,
- .field_pos = 24,
+ .sys_reg = SYS_ID_AA64PFR0_EL1,
+ .field_pos = ID_AA64PFR0_GIC_SHIFT,
.min_field_value = 1,
},
#ifdef CONFIG_ARM64_PAN
diff --cc arch/arm64/kernel/suspend.c
index 44ca4143b013,3c5e4e6dcf68..000000000000
--- a/arch/arm64/kernel/suspend.c
+++ b/arch/arm64/kernel/suspend.c
@@@ -80,21 -80,17 +80,21 @@@ int cpu_suspend(unsigned long arg, int
if (ret == 0) {
/*
* We are resuming from reset with TTBR0_EL1 set to the
- * idmap to enable the MMU; restore the active_mm mappings in
- * TTBR0_EL1 unless the active_mm == &init_mm, in which case
- * the thread entered cpu_suspend with TTBR0_EL1 set to
- * reserved TTBR0 page tables and should be restored as such.
+ * idmap to enable the MMU; set the TTBR0 to the reserved
+ * page tables to prevent speculative TLB allocations, flush
+ * the local tlb and set the default tcr_el1.t0sz so that
+ * the TTBR0 address space set-up is properly restored.
+ * If the current active_mm != &init_mm we entered cpu_suspend
+ * with mappings in TTBR0 that must be restored, so we switch
+ * them back to complete the address space configuration
+ * restoration before returning.
*/
- if (mm == &init_mm)
- cpu_set_reserved_ttbr0();
- else
- cpu_switch_mm(mm->pgd, mm);
-
+ cpu_set_reserved_ttbr0();
- flush_tlb_all();
+ local_flush_tlb_all();
+ cpu_set_default_tcr_t0sz();
+
+ if (mm != &init_mm)
+ cpu_switch_mm(mm->pgd, mm);
/*
* Restore per-cpu offset before any kernel
-------------8<-------------------------
--
Catalin
On Wed, Nov 4, 2015 at 10:25 AM, Catalin Marinas
<[email protected]> wrote:
>
> - Support for 16KB pages, with the additional bonus of a 36-bit VA
> space, though the latter only depending on EXPERT
So I told the ppc people this many years ago, and I guess I'll tell
you guys too: 16kB pages are not actually useful, and anybody who
thinks they are have not actually done the math.
It ends up being a horrible waste of memory for things like the page
cache, to the point where all the arguments for it ("it allows us to
manage lots of memory more cheaply") are pure and utter BS, because
you effectively lose half of that memory to fragmentation in pretty
much all normal loads.
It's good for single-process loads - if you do a lot of big fortran
jobs, or a lot of big database loads, and nothing else, you're fine.
Or if you are an embedded OS and only haev one particular load you
worry about.
But it is really really nasty for any general-purpose stuff, and when
your hardware people tell you that it's a great way to make your TLB's
more effective, tell them back that they are incompetent morons, and
that they should just make their TLB's better.
Because they are.
To make them understand the problem, compare it to having a 256-byte
cacheline. They might understand it then, because you're talking about
things that they almost certainly *also* wanted to do, but did the
numbers on, and realized it was bad.
And on the other hand, if they go "Hmm. 256-byte cache lines? We
should do that too", then you know they are not worth your time, and
you can quietly tell your bosses that they should up the medication in
the watercooler in the hw lab.
Linus
On Wed, Nov 04, 2015 at 02:55:01PM -0800, Linus Torvalds wrote:
> On Wed, Nov 4, 2015 at 10:25 AM, Catalin Marinas
> <[email protected]> wrote:
> >
> > - Support for 16KB pages, with the additional bonus of a 36-bit VA
> > space, though the latter only depending on EXPERT
>
> So I told the ppc people this many years ago, and I guess I'll tell
> you guys too: 16kB pages are not actually useful, and anybody who
> thinks they are have not actually done the math.
Without doing any benchmarks (not just the maths but taking TLB misses
into account), I agree with you. As a note, I don't actually expect this
feature to be used in practice, firstly because it is an optional
architecture feature and secondly because people wanting a bigger page
size (like Red Hat) went to the extreme 64KB size already. But adding
this option to the kernel doesn't cost us much (some macro clean-up) and
it's something the CPU validation people would most likely use.
Who knows, maybe those people who went for 64KB pages get burnt and go
for 16KB as an intermediate step before moving back to 4KB.
> It's good for single-process loads - if you do a lot of big fortran
> jobs, or a lot of big database loads, and nothing else, you're fine.
These are some of the arguments from the server camp: specific
workloads.
> Or if you are an embedded OS and only haev one particular load you
> worry about.
It's unlikely for embedded/mobile because of the memory usage, though
I've seen it done on 32-bit ARMv7 (Cortex-A9). The WD My Cloud NAS at
some point upgraded the firmware to use 64KB pages in Linux (not
something supported by mainline). I have no idea what led to their
decision but the workloads are very specific, I guess there was some
gain for them.
> But it is really really nasty for any general-purpose stuff, and when
> your hardware people tell you that it's a great way to make your TLB's
> more effective, tell them back that they are incompetent morons, and
> that they should just make their TLB's better.
Virtualisation, nested pages is an area where you can always squeeze a
bit more performance even if your TLBs are fast (for example, 4 levels
guest + 4 levels host page tables would need 24 memory accesses for a
completely cold TLB miss). But this would normally only be an option for
the host kernel, not aimed at general purpose guest.
> To make them understand the problem, compare it to having a 256-byte
> cacheline. They might understand it then, because you're talking about
> things that they almost certainly *also* wanted to do, but did the
> numbers on, and realized it was bad.
The difference is that a 256-byte cacheline is hard-wired and the cache
size fixed when you build the silicon. OTOH, the page size is
configurable and I would be very worried if 4KB pages are ever
deprecated. The counter argument from the HW camp is usually that the
architecture is not designed just for the current RAM limits and not
even for the current Linux implementation. It's more like "in 10 years
time we may afford to waste a lot more memory *or* Linux may find a way
to merge/compress partially filled page cache pages (well, those not
mapped to user) *or* some other workloads emerge, so we better have the
option in early".
I don't see the 4KB page configuration ever going away from the ARM
cores and the mobile camp is pretty much tied to it. We'll have to wait
until we see some real workloads on servers and what the larger page
impact is. Hopefully the ecosystem (software, silicon vendors) will
eventually converge to the best solution (which could simply be smaller
pages and better TLBs). In the meantime, I'm giving them enough Kconfig
rope to use it as they see appropriate. The architecture specification
does a similar thing.
--
Catalin
On Thursday 05 November 2015 18:27:18 Catalin Marinas wrote:
> On Wed, Nov 04, 2015 at 02:55:01PM -0800, Linus Torvalds wrote:
> > On Wed, Nov 4, 2015 at 10:25 AM, Catalin Marinas <[email protected]> wrote:
> > It's good for single-process loads - if you do a lot of big fortran
> > jobs, or a lot of big database loads, and nothing else, you're fine.
>
> These are some of the arguments from the server camp: specific
> workloads.
I think (a little overgeneralized), you want 4KB pages for any file
based mappings, but larger (in some cases much larger) for anonymous
memory. The last time this came up, I theorized about a way to change
do_anonymous_page() to always operate on 64KB units on a normal
4KB page based kernel, and use the ARM64 contiguous page hint
to get 64KB TLBs for those pages.
This could be done compile-time, system-wide, or per-process if
desired, and should provide performance as good as the current
64KB page kernels for almost any server workloads, and in
some cases much better than that, as long as the hints are
actually interpreted by the CPU implementation.
> > Or if you are an embedded OS and only haev one particular load you
> > worry about.
>
> It's unlikely for embedded/mobile because of the memory usage, though
> I've seen it done on 32-bit ARMv7 (Cortex-A9). The WD My Cloud NAS at
> some point upgraded the firmware to use 64KB pages in Linux (not
> something supported by mainline). I have no idea what led to their
> decision but the workloads are very specific, I guess there was some
> gain for them.
Very interesting.
I can think of one particular use case where it makes sense: If your
storage device uses larger than 4KB sectors, making the page size
in the kernel the same as the sector size will speed up I/O.
An example for this would be low-end flash devices (USB sticks,
SD cards, not SSD) that in this year's generation tend to write
a 64KB block faster than any smaller unit on average (in absolute
terms, so doing 4KB writes is at least 16 times slower per MB
than doing 64KB writes). For an embedded system, it may hence
end up being more economical to put in four times the RAM compared
to replacing the storage with something than can handle small I/Os
efficiently.
Hard drive vendors have been talking about larger than 4K
sectors for a while. I didn't think anyone built them, but
as WD makes both the NAS and the hard drive in it, it is
theoretically possible that they did this here.
> > But it is really really nasty for any general-purpose stuff, and when
> > your hardware people tell you that it's a great way to make your TLB's
> > more effective, tell them back that they are incompetent morons, and
> > that they should just make their TLB's better.
>
> Virtualisation, nested pages is an area where you can always squeeze a
> bit more performance even if your TLBs are fast (for example, 4 levels
> guest + 4 levels host page tables would need 24 memory accesses for a
> completely cold TLB miss). But this would normally only be an option for
> the host kernel, not aimed at general purpose guest.
Virtualization of course is what has been driving the improvements
for huge page handling, and using huge pages helps much more here
than a slight increase in page size. Then again, using 16KB pages
also increases the hugepage size from 2MB to 32MB, which can also
help.
Arnd
On Fri, Nov 06, 2015 at 10:57:58AM +0100, Arnd Bergmann wrote:
> On Thursday 05 November 2015 18:27:18 Catalin Marinas wrote:
> > On Wed, Nov 04, 2015 at 02:55:01PM -0800, Linus Torvalds wrote:
> > > On Wed, Nov 4, 2015 at 10:25 AM, Catalin Marinas <[email protected]> wrote:
> > > It's good for single-process loads - if you do a lot of big fortran
> > > jobs, or a lot of big database loads, and nothing else, you're fine.
> >
> > These are some of the arguments from the server camp: specific
> > workloads.
>
> I think (a little overgeneralized), you want 4KB pages for any file
> based mappings,
In general, yes, but if the main/only workload on your server is mapping
large db files, the memory usage cost may be amortised. For general
purpose stuff like compiling a Linux kernel, I did some tests
(kernbench) and the page cache usage went from ~2.5GB with 4KB pages to
~6.6GB with 64KB pages, so clearly not suitable. Unfortunately I
couldn't get any meaningful performance numbers as the test was done
over slow NFS.
I'm not recommending 64KB pages but I'm closely following how it's used
and any performance figures. In terms of TLB, there are two aspects that
larger pages try to address (to the detriment of memory usage):
1. A reduction in TLB misses
2. A reduction in the cost of a TLB miss by having fewer page table
levels (42-bit VA with 2 levels vs 3 or even 4 with 4KB).
Of course, Linus' point for making TLB faster is always good idea but
even on x86 people are looking to improve things (otherwise we may not
have had THP/hugetlb supported on this architecture).
> but larger (in some cases much larger) for anonymous
> memory. The last time this came up, I theorized about a way to change
> do_anonymous_page() to always operate on 64KB units on a normal
> 4KB page based kernel, and use the ARM64 contiguous page hint
> to get 64KB TLBs for those pages.
We have transparent huge pages for this, though the much higher 2MB
size. This would also improve the cost of a TLB miss by walking one
fewer level (point 2 above). I've seen patches for THP on file maps but
I'm not sure what the status is.
As a test, we could fake a 64KB THP by using a dummy PMD that contains
16 PTE entries, just to see how the performance goes. But this would
only address point 1 above.
> This could be done compile-time, system-wide, or per-process if
> desired, and should provide performance as good as the current
> 64KB page kernels for almost any server workloads, and in
> some cases much better than that, as long as the hints are
> actually interpreted by the CPU implementation.
Apart from anonymous mappings, could the file page cache be optimised?
Not all file accesses use mmap() (e.g. gcc compilation seems to do
sequential accesses for the C files the compiler reads), so you don't
always need a full page cache page for a file.
We could have a feature to allow sharing of partially filled page cache
pages and only break them up if mmap'ed to user. A less optimal
implementation based on the current kernel infrastructure could be
something like a cleancache driver able to store partially filled page
cache pages more efficiently (together with a more aggressive eviction
of such pages from the page cache into the cleancache).
--
Catalin
On Friday 06 November 2015 16:04:08 Catalin Marinas wrote:
> On Fri, Nov 06, 2015 at 10:57:58AM +0100, Arnd Bergmann wrote:
> > On Thursday 05 November 2015 18:27:18 Catalin Marinas wrote:
> > > On Wed, Nov 04, 2015 at 02:55:01PM -0800, Linus Torvalds wrote:
> > > > On Wed, Nov 4, 2015 at 10:25 AM, Catalin Marinas <[email protected]> wrote:
> > > > It's good for single-process loads - if you do a lot of big fortran
> > > > jobs, or a lot of big database loads, and nothing else, you're fine.
> > >
> > > These are some of the arguments from the server camp: specific
> > > workloads.
> >
> > I think (a little overgeneralized), you want 4KB pages for any file
> > based mappings,
>
> In general, yes, but if the main/only workload on your server is mapping
> large db files, the memory usage cost may be amortised.
This will still only do you good for a database that is read into memory
once and not written much, and at that point you can as well use hugepages.
The problems for using 64kb page cache on file mappings are
- while you normally want some readahead, the larger pages also result
in read-behind, so you have to actually transfer data from disk into
RAM without ever accessing it.
- When you write the data, you have to write the full 64K page because
that is the granularity of your dirty bit tracking.
So even if you don't care at all about memory consumption, you are
still transferring several times more data to and from your drives.
As mentioned that can be a win on some storage devices, but usually
it's a loss.
> > but larger (in some cases much larger) for anonymous
> > memory. The last time this came up, I theorized about a way to change
> > do_anonymous_page() to always operate on 64KB units on a normal
> > 4KB page based kernel, and use the ARM64 contiguous page hint
> > to get 64KB TLBs for those pages.
>
> We have transparent huge pages for this, though the much higher 2MB
> size. This would also improve the cost of a TLB miss by walking one
> fewer level (point 2 above). I've seen patches for THP on file maps but
> I'm not sure what the status is.
>
> As a test, we could fake a 64KB THP by using a dummy PMD that contains
> 16 PTE entries, just to see how the performance goes. But this would
> only address point 1 above.
Right.
> > This could be done compile-time, system-wide, or per-process if
> > desired, and should provide performance as good as the current
> > 64KB page kernels for almost any server workloads, and in
> > some cases much better than that, as long as the hints are
> > actually interpreted by the CPU implementation.
>
> Apart from anonymous mappings, could the file page cache be optimised?
> Not all file accesses use mmap() (e.g. gcc compilation seems to do
> sequential accesses for the C files the compiler reads), so you don't
> always need a full page cache page for a file.
>
> We could have a feature to allow sharing of partially filled page cache
> pages and only break them up if mmap'ed to user.
I would think that adds way too much complexity for the gains.
> A less optimal
> implementation based on the current kernel infrastructure could be
> something like a cleancache driver able to store partially filled page
> cache pages more efficiently (together with a more aggressive eviction
> of such pages from the page cache into the cleancache).
Not sure, it could work but may still require changing too much of
the way we handle files today.
Arnd
Hi
On Fri, 6 Nov 2015, Arnd Bergmann wrote:
> On Friday 06 November 2015 16:04:08 Catalin Marinas wrote:
> > On Fri, Nov 06, 2015 at 10:57:58AM +0100, Arnd Bergmann wrote:
> > > On Thursday 05 November 2015 18:27:18 Catalin Marinas wrote:
> > > > On Wed, Nov 04, 2015 at 02:55:01PM -0800, Linus Torvalds wrote:
> > > > > On Wed, Nov 4, 2015 at 10:25 AM, Catalin Marinas <[email protected]> wrote:
> > > > > It's good for single-process loads - if you do a lot of big fortran
> > > > > jobs, or a lot of big database loads, and nothing else, you're fine.
> > > >
> > > > These are some of the arguments from the server camp: specific
> > > > workloads.
> > >
> > > I think (a little overgeneralized), you want 4KB pages for any file
> > > based mappings,
> >
> > In general, yes, but if the main/only workload on your server is mapping
> > large db files, the memory usage cost may be amortised.
>
> This will still only do you good for a database that is read into memory
> once and not written much, and at that point you can as well use hugepages.
>
> The problems for using 64kb page cache on file mappings are
>
> - while you normally want some readahead, the larger pages also result
> in read-behind, so you have to actually transfer data from disk into
> RAM without ever accessing it.
>
> - When you write the data, you have to write the full 64K page because
> that is the granularity of your dirty bit tracking.
>
> So even if you don't care at all about memory consumption, you are
> still transferring several times more data to and from your drives.
> As mentioned that can be a win on some storage devices, but usually
> it's a loss.
>
there is also a maybe a bigger problem.
I know this from my Zyxel NAS540, this thing is build around the Mindspeed
Comcerto 2000 SoC
Zyxel is currently rolling back to support 4k page sizeses in upcommig
5.10 firmware release, because Minspeed did some stupid thing :
It's not possible to use some standard ARMv7 toolchain and build your
own/needed userspace tools.
And this in change which causes the pain
diff --git a/arch/arm/include/asm/elf.h b/arch/arm/include/asm/elf.h
-#define ELF_EXEC_PAGESIZE 4096
+#define ELF_EXEC_PAGESIZE (PAGE_SIZE)
The SoC is mostly build from off the shelf IP's
SATA, NAND, SPI and so on
The only thing which is completly braindead is MAC
It's using some kind of VLAN tagging to support tree ports,
only one descriptor chain for all three interfaces.
Hans Ulli
On Saturday 07 November 2015 11:56:44 Hans Ulli Kroll wrote:
> On Fri, 6 Nov 2015, Arnd Bergmann wrote:
> > On Friday 06 November 2015 16:04:08 Catalin Marinas wrote:
> > > On Fri, Nov 06, 2015 at 10:57:58AM +0100, Arnd Bergmann wrote:
> > > > On Thursday 05 November 2015 18:27:18 Catalin Marinas wrote:
> > > > > On Wed, Nov 04, 2015 at 02:55:01PM -0800, Linus Torvalds wrote:
> > > > > > On Wed, Nov 4, 2015 at 10:25 AM, Catalin Marinas <[email protected]> wrote:
> > > > > > It's good for single-process loads - if you do a lot of big fortran
> > > > > > jobs, or a lot of big database loads, and nothing else, you're fine.
> > > > >
> > > > > These are some of the arguments from the server camp: specific
> > > > > workloads.
> > > >
> > > > I think (a little overgeneralized), you want 4KB pages for any file
> > > > based mappings,
> > >
> > > In general, yes, but if the main/only workload on your server is mapping
> > > large db files, the memory usage cost may be amortised.
> >
> > This will still only do you good for a database that is read into memory
> > once and not written much, and at that point you can as well use hugepages.
> >
> > The problems for using 64kb page cache on file mappings are
> >
> > - while you normally want some readahead, the larger pages also result
> > in read-behind, so you have to actually transfer data from disk into
> > RAM without ever accessing it.
> >
> > - When you write the data, you have to write the full 64K page because
> > that is the granularity of your dirty bit tracking.
> >
> > So even if you don't care at all about memory consumption, you are
> > still transferring several times more data to and from your drives.
> > As mentioned that can be a win on some storage devices, but usually
> > it's a loss.
> >
>
> there is also a maybe a bigger problem.
> I know this from my Zyxel NAS540, this thing is build around the Mindspeed
> Comcerto 2000 SoC
>
> Zyxel is currently rolling back to support 4k page sizeses in upcommig
> 5.10 firmware release, because Minspeed did some stupid thing :
>
> It's not possible to use some standard ARMv7 toolchain and build your
> own/needed userspace tools.
>
> And this in change which causes the pain
>
> diff --git a/arch/arm/include/asm/elf.h b/arch/arm/include/asm/elf.h
> -#define ELF_EXEC_PAGESIZE 4096
> +#define ELF_EXEC_PAGESIZE (PAGE_SIZE)
In ARM32 binutils, ELF_MAXPAGESIZE was changed last year to 64KB, so
binutils-2.25 or higher should support this by default, as long as you
recompile all user binaries.
> The SoC is mostly build from off the shelf IP's
> SATA, NAND, SPI and so on
> The only thing which is completly braindead is MAC
> It's using some kind of VLAN tagging to support tree ports,
> only one descriptor chain for all three interfaces.
You mean they used 64KB logical page sizes to work around a broken
ethernet MAC?
Arnd
On Sat, 7 Nov 2015, Arnd Bergmann wrote:
> On Saturday 07 November 2015 11:56:44 Hans Ulli Kroll wrote:
> > On Fri, 6 Nov 2015, Arnd Bergmann wrote:
> > > On Friday 06 November 2015 16:04:08 Catalin Marinas wrote:
> > > > On Fri, Nov 06, 2015 at 10:57:58AM +0100, Arnd Bergmann wrote:
> > > > > On Thursday 05 November 2015 18:27:18 Catalin Marinas wrote:
> > > > > > On Wed, Nov 04, 2015 at 02:55:01PM -0800, Linus Torvalds wrote:
> > > > > > > On Wed, Nov 4, 2015 at 10:25 AM, Catalin Marinas <[email protected]> wrote:
> > > > > > > It's good for single-process loads - if you do a lot of big fortran
> > > > > > > jobs, or a lot of big database loads, and nothing else, you're fine.
> > > > > >
> > > > > > These are some of the arguments from the server camp: specific
> > > > > > workloads.
> > > > >
> > > > > I think (a little overgeneralized), you want 4KB pages for any file
> > > > > based mappings,
> > > >
> > > > In general, yes, but if the main/only workload on your server is mapping
> > > > large db files, the memory usage cost may be amortised.
> > >
> > > This will still only do you good for a database that is read into memory
> > > once and not written much, and at that point you can as well use hugepages.
> > >
> > > The problems for using 64kb page cache on file mappings are
> > >
> > > - while you normally want some readahead, the larger pages also result
> > > in read-behind, so you have to actually transfer data from disk into
> > > RAM without ever accessing it.
> > >
> > > - When you write the data, you have to write the full 64K page because
> > > that is the granularity of your dirty bit tracking.
> > >
> > > So even if you don't care at all about memory consumption, you are
> > > still transferring several times more data to and from your drives.
> > > As mentioned that can be a win on some storage devices, but usually
> > > it's a loss.
> > >
> >
> > there is also a maybe a bigger problem.
> > I know this from my Zyxel NAS540, this thing is build around the Mindspeed
> > Comcerto 2000 SoC
> >
> > Zyxel is currently rolling back to support 4k page sizeses in upcommig
> > 5.10 firmware release, because Minspeed did some stupid thing :
> >
> > It's not possible to use some standard ARMv7 toolchain and build your
> > own/needed userspace tools.
> >
> > And this in change which causes the pain
> >
> > diff --git a/arch/arm/include/asm/elf.h b/arch/arm/include/asm/elf.h
> > -#define ELF_EXEC_PAGESIZE 4096
> > +#define ELF_EXEC_PAGESIZE (PAGE_SIZE)
>
> In ARM32 binutils, ELF_MAXPAGESIZE was changed last year to 64KB, so
> binutils-2.25 or higher should support this by default, as long as you
> recompile all user binaries.
>
Thanks for the hint ...
> > The SoC is mostly build from off the shelf IP's
> > SATA, NAND, SPI and so on
> > The only thing which is completly braindead is MAC
> > It's using some kind of VLAN tagging to support tree ports,
> > only one descriptor chain for all three interfaces.
>
> You mean they used 64KB logical page sizes to work around a broken
> ethernet MAC?
>
> Arnd
>
No.
The MAC is some other issue :
I think the main purpose of this design to use this SoC for Deep
Packet Inspection in HW.
They use also this MAC (or is't queue) for for en/decrypting traffic
for the WIFI devices, in the sources it's called vwd.
-> virtual wirelss divice.
IP-Stack -> VWD -> WIFI-DEV
FYI, Here is the Datasheed (only 4 Pages)
http://downloads.codico.com/MISC/Newsletter/2013/2013_02/862xx-BRF-001-M_C2K.pdf
But this is to much, to wrap my head around.
Hans Ulli
On 11/6/15, 11:04 AM, Catalin Marinas wrote:
> On Fri, Nov 06, 2015 at 10:57:58AM +0100, Arnd Bergmann wrote:
>> On Thursday 05 November 2015 18:27:18 Catalin Marinas wrote:
>>> On Wed, Nov 04, 2015 at 02:55:01PM -0800, Linus Torvalds wrote:
>>>> On Wed, Nov 4, 2015 at 10:25 AM, Catalin Marinas <[email protected]> wrote:
>>>> It's good for single-process loads - if you do a lot of big fortran
>>>> jobs, or a lot of big database loads, and nothing else, you're fine.
>>>
>>> These are some of the arguments from the server camp: specific
>>> workloads.
On our end, I asked our performance folks (and many others) about 3 or 4
years ago what they thought would make sense. The numbers suggested that
16KB might have been ideal (for specific targeted workloads), but since
that was optional in the architecture (as a later addition) that meant
"does not exist" as far as server/general purpose goes. Which lead to
more conversation, followed ultimately by the 64KB choice. The decision
to go to 64KB was in part based upon various discussion that suggested
this size was appropriate for workloads, but it is something that is
under evaluation. And obviously the number of threads on the topic is
not something that is ignored. 4KB with contiguous hint + huge pages
might well end up being the sweet spot in the longer term.
One of the purposes of Red Hat Enterprise Linux Server for ARM (RHELSA)
Development Preview (which I know just rolls off the tongue) is to test
the water with various decisions and see what works out, and what does
not. If 64KB does indeed turn out to be a poor decision then the page
size will be reverted to 4KB at some future time. But it is only once we
have some of the higher end mainstream systems running RHELSA (like we
do now) that we can start to actually look at real data and decide.
In addition to the TLB/hardware walker (micro)cache impact of page size
in terms of levels of walk through the tables (but we have cont. hint
and aggressive microcaches of interim levels to help us with this),
there is also the potential impact upon cache design. True we mostly
claim to be PIPT but underneath implementations might well be able to
optimize the (parallel) indexing stage given a larger page size. In many
conversations over the past few years with the architects building the
impending tsunami of high end v8 server cores, no objections have been
raised against the choice of 64KB in the first go around.
Anyway. We'll all watch and see :)
Jon.