2015-11-25 21:51:12

by Stephen Boyd

[permalink] [raw]
Subject: [PATCH v2 0/2] ARM: Use udiv/sdiv for __aeabi_{u}idiv library functions

This is a respin of a patch series from about a year ago[1]. I realized
that we already had most of the code in recordmcount to figure out
where we make calls to particular functions, so recording where
we make calls to the integer division functions should be easy enough
to add support for using the same design. Looking back on the thread
it seems like Mans was thinking along the same lines, although it wasn't
obvious to me back then or even over the last few days when I wrote this.

This series copies and reworks recordmcount to record the locations of the
calls to the library integer division functions on ARM builds, and puts those
locations into a table that we use to patch instructions at boot. The first
patch adds the recorduidiv program, and the second patch implements the
runtime patching for modules and kernel code. The module part hooks into
the relocation patching code we already have.

The discussion surrounding adding ARMv7VE so that this patchset isn't
as necessary is still on going, so I'm just sending out these patches for
now. Assuming we can come to a conclusion on how to implement ARMv7VE
I'll finish up those patches.

Comments/feedback appreciated.

Changes from v1:
* Add suport for PJ4 udiv/sdiv mrc opcodes
* Create a new program for recording uidiv instead of modifying recordmcount

[1] http://lkml.kernel.org/r/[email protected]

Cc: Nicolas Pitre <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Måns Rullgård <[email protected]>
Cc: Thomas Petazzoni <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Russell King - ARM Linux <[email protected]>

Stephen Boyd (2):
scripts: Add a recorduidiv program
ARM: Replace calls to __aeabi_{u}idiv with udiv/sdiv instructions

arch/arm/Kconfig | 14 +
arch/arm/include/asm/cputype.h | 8 +-
arch/arm/include/asm/setup.h | 3 +
arch/arm/kernel/module.c | 40 +++
arch/arm/kernel/setup.c | 68 ++++
arch/arm/kernel/vmlinux.lds.S | 13 +
scripts/Makefile | 1 +
scripts/Makefile.build | 17 +-
scripts/recorduidiv.c | 745 +++++++++++++++++++++++++++++++++++++++++
9 files changed, 906 insertions(+), 3 deletions(-)
create mode 100644 scripts/recorduidiv.c

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


2015-11-25 21:51:37

by Stephen Boyd

[permalink] [raw]
Subject: [PATCH v2 1/2] scripts: Add a recorduidiv program

The ARM compiler inserts calls to __aeabi_uidiv() and
__aeabi_idiv() when it needs to perform division on signed and
unsigned integers. If a processor has support for the udiv and
sdiv division instructions the calls to these support routines
can be replaced with those instructions. Therefore, record the
location of calls to these library functions into two sections
(one for udiv and one for sdiv) similar to how we trace calls to
mcount. When the kernel boots up it will check to see if the
processor supports the instructions and then patch the call sites
with the instruction.

Cc: Nicolas Pitre <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Måns Rullgård <[email protected]>
Cc: Michal Marek <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>
---
scripts/Makefile | 1 +
scripts/Makefile.build | 17 +-
scripts/recorduidiv.c | 745 +++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 761 insertions(+), 2 deletions(-)
create mode 100644 scripts/recorduidiv.c

diff --git a/scripts/Makefile b/scripts/Makefile
index fd0d53d4a234..26b9ebd8be31 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -15,6 +15,7 @@ hostprogs-$(CONFIG_KALLSYMS) += kallsyms
hostprogs-$(CONFIG_LOGO) += pnmtologo
hostprogs-$(CONFIG_VT) += conmakehash
hostprogs-$(BUILD_C_RECORDMCOUNT) += recordmcount
+hostprogs-$(CONFIG_ARM_PATCH_UIDIV) += recorduidiv
hostprogs-$(CONFIG_BUILDTIME_EXTABLE_SORT) += sortextable
hostprogs-$(CONFIG_ASN1) += asn1_compiler
hostprogs-$(CONFIG_MODULE_SIG) += sign-file
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 01df30af4d4a..142e5db29d3b 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -241,12 +241,25 @@ cmd_record_mcount = \
fi;
endif

+ifdef CONFIG_ARM_PATCH_UIDIV
+# Due to recursion, we must skip empty.o.
+# The empty.o file is created in the make process in order to determine
+# the target endianness and word size. It is made before all other C
+# files, including recorduidiv.
+cmd_record_uidiv = \
+ if [ $(@) != "scripts/mod/empty.o" ]; then \
+ $(objtree)/scripts/recorduidiv "$(@)"; \
+ fi;
+recorduidiv_source := $(srctree)/scripts/recorduidiv.c
+endif
+
define rule_cc_o_c
$(call echo-cmd,checksrc) $(cmd_checksrc) \
$(call echo-cmd,cc_o_c) $(cmd_cc_o_c); \
$(cmd_modversions) \
$(call echo-cmd,record_mcount) \
$(cmd_record_mcount) \
+ $(call echo-cmd,record_uidiv) $(cmd_record_uidiv) \
scripts/basic/fixdep $(depfile) $@ '$(call make-cmd,cc_o_c)' > \
$(dot-target).tmp; \
rm -f $(depfile); \
@@ -254,13 +267,13 @@ define rule_cc_o_c
endef

# Built-in and composite module parts
-$(obj)/%.o: $(src)/%.c $(recordmcount_source) FORCE
+$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(recorduidiv_source) FORCE
$(call cmd,force_checksrc)
$(call if_changed_rule,cc_o_c)

# Single-part modules are special since we need to mark them in $(MODVERDIR)

-$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) FORCE
+$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(recorduidiv_source) FORCE
$(call cmd,force_checksrc)
$(call if_changed_rule,cc_o_c)
@{ echo $(@:.o=.ko); echo $@; } > $(MODVERDIR)/$(@F:.o=.mod)
diff --git a/scripts/recorduidiv.c b/scripts/recorduidiv.c
new file mode 100644
index 000000000000..a2af5c46cb6d
--- /dev/null
+++ b/scripts/recorduidiv.c
@@ -0,0 +1,745 @@
+/*
+ * recorduidiv.c: construct a table of the locations of calls to '__aeabi_uidiv'
+ * and '__aeabi_idiv' so that the kernel can replace them with idiv and sdiv
+ * instructions.
+ *
+ * Copyright 2009 John F. Reiser <[email protected]>. All rights reserved.
+ * Licensed under the GNU General Public License, version 2 (GPLv2).
+ *
+ * Restructured to fit Linux format, as well as other updates:
+ * Copyright 2010 Steven Rostedt <[email protected]>, Red Hat Inc.
+ *
+ * Copyright (c) 2015 The Linux Foundation. All rights reserved.
+ */
+
+/*
+ * Strategy: alter the .o file in-place.
+ *
+ * Append a new STRTAB that has the new section names, followed by a new array
+ * ElfXX_Shdr[] that has the new section headers, followed by the section
+ * contents for __udiv_loc and __idiv_loc and their relocations. The old
+ * shstrtab strings, and the old ElfXX_Shdr[] array, remain as "garbage"
+ * (commonly, a couple kilobytes.) Subsequent processing by /bin/ld (or the
+ * kernel module loader) will ignore the garbage regions, because they are not
+ * designated by the new .e_shoff nor the new ElfXX_Shdr[]. [In order to
+ * remove the garbage, then use "ld -r" to create a new file that omits the
+ * garbage.]
+ */
+
+#include <sys/types.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <elf.h>
+#include <fcntl.h>
+#include <setjmp.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+# define _align 3u
+# define _size 4
+
+#ifndef R_ARM_CALL
+#define R_ARM_CALL 28
+#endif
+
+#ifndef R_ARM_JUMP24
+#define R_ARM_JUMP24 29
+#endif
+
+static int fd_map; /* File descriptor for file being modified. */
+static int mmap_failed; /* Boolean flag. */
+static void *ehdr_curr; /* current ElfXX_Ehdr * for resource cleanup */
+static struct stat sb; /* Remember .st_size, etc. */
+static jmp_buf jmpenv; /* setjmp/longjmp per-file error escape */
+
+/* setjmp() return values */
+enum {
+ SJ_SETJMP = 0, /* hardwired first return */
+ SJ_FAIL,
+ SJ_SUCCEED
+};
+
+/* Per-file resource cleanup when multiple files. */
+static void
+cleanup(void)
+{
+ if (!mmap_failed)
+ munmap(ehdr_curr, sb.st_size);
+ else
+ free(ehdr_curr);
+ close(fd_map);
+}
+
+static void __attribute__((noreturn))
+fail_file(void)
+{
+ cleanup();
+ longjmp(jmpenv, SJ_FAIL);
+}
+
+static void __attribute__((noreturn))
+succeed_file(void)
+{
+ cleanup();
+ longjmp(jmpenv, SJ_SUCCEED);
+}
+
+/* ulseek, uread, ...: Check return value for errors. */
+
+static off_t
+ulseek(int const fd, off_t const offset, int const whence)
+{
+ off_t const w = lseek(fd, offset, whence);
+ if (w == (off_t)-1) {
+ perror("lseek");
+ fail_file();
+ }
+ return w;
+}
+
+static size_t
+uread(int const fd, void *const buf, size_t const count)
+{
+ size_t const n = read(fd, buf, count);
+ if (n != count) {
+ perror("read");
+ fail_file();
+ }
+ return n;
+}
+
+static size_t
+uwrite(int const fd, void const *const buf, size_t const count)
+{
+ size_t const n = write(fd, buf, count);
+ if (n != count) {
+ perror("write");
+ fail_file();
+ }
+ return n;
+}
+
+static void *
+umalloc(size_t size)
+{
+ void *const addr = malloc(size);
+ if (addr == 0) {
+ fprintf(stderr, "malloc failed: %zu bytes\n", size);
+ fail_file();
+ }
+ return addr;
+}
+
+/*
+ * Get the whole file as a programming convenience in order to avoid
+ * malloc+lseek+read+free of many pieces. If successful, then mmap
+ * avoids copying unused pieces; else just read the whole file.
+ * Open for both read and write; new info will be appended to the file.
+ * Use MAP_PRIVATE so that a few changes to the in-memory ElfXX_Ehdr
+ * do not propagate to the file until an explicit overwrite at the last.
+ * This preserves most aspects of consistency (all except .st_size)
+ * for simultaneous readers of the file while we are appending to it.
+ * However, multiple writers still are bad. We choose not to use
+ * locking because it is expensive and the use case of kernel build
+ * makes multiple writers unlikely.
+ */
+static void *mmap_file(char const *fname)
+{
+ void *addr;
+
+ fd_map = open(fname, O_RDWR);
+ if (fd_map < 0 || fstat(fd_map, &sb) < 0) {
+ perror(fname);
+ fail_file();
+ }
+ if (!S_ISREG(sb.st_mode)) {
+ fprintf(stderr, "not a regular file: %s\n", fname);
+ fail_file();
+ }
+ addr = mmap(0, sb.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE,
+ fd_map, 0);
+ mmap_failed = 0;
+ if (addr == MAP_FAILED) {
+ mmap_failed = 1;
+ addr = umalloc(sb.st_size);
+ uread(fd_map, addr, sb.st_size);
+ }
+ return addr;
+}
+
+/* w8rev, w8nat, ...: Handle endianness. */
+
+static uint64_t w8rev(uint64_t const x)
+{
+ return ((0xff & (x >> (0 * 8))) << (7 * 8))
+ | ((0xff & (x >> (1 * 8))) << (6 * 8))
+ | ((0xff & (x >> (2 * 8))) << (5 * 8))
+ | ((0xff & (x >> (3 * 8))) << (4 * 8))
+ | ((0xff & (x >> (4 * 8))) << (3 * 8))
+ | ((0xff & (x >> (5 * 8))) << (2 * 8))
+ | ((0xff & (x >> (6 * 8))) << (1 * 8))
+ | ((0xff & (x >> (7 * 8))) << (0 * 8));
+}
+
+static uint32_t w4rev(uint32_t const x)
+{
+ return ((0xff & (x >> (0 * 8))) << (3 * 8))
+ | ((0xff & (x >> (1 * 8))) << (2 * 8))
+ | ((0xff & (x >> (2 * 8))) << (1 * 8))
+ | ((0xff & (x >> (3 * 8))) << (0 * 8));
+}
+
+static uint32_t w2rev(uint16_t const x)
+{
+ return ((0xff & (x >> (0 * 8))) << (1 * 8))
+ | ((0xff & (x >> (1 * 8))) << (0 * 8));
+}
+
+static uint64_t w8nat(uint64_t const x)
+{
+ return x;
+}
+
+static uint32_t w4nat(uint32_t const x)
+{
+ return x;
+}
+
+static uint32_t w2nat(uint16_t const x)
+{
+ return x;
+}
+
+static uint64_t (*w8)(uint64_t);
+static uint32_t (*w)(uint32_t);
+static uint32_t (*w2)(uint16_t);
+
+/* Names of the sections that could contain calls to __aeabi_{u}idiv() */
+static int is_valid_section_name(char const *const txtname)
+{
+ return strcmp(".text", txtname) == 0 ||
+ strcmp(".ref.text", txtname) == 0 ||
+ strcmp(".sched.text", txtname) == 0 ||
+ strcmp(".spinlock.text", txtname) == 0 ||
+ strcmp(".irqentry.text", txtname) == 0 ||
+ strcmp(".kprobes.text", txtname) == 0 ||
+ strcmp(".text.unlikely", txtname) == 0 ||
+ strcmp(".init.text", txtname) == 0;
+}
+
+static uint32_t Elf32_r_sym(Elf32_Rel const *rp)
+{
+ return ELF32_R_SYM(w(rp->r_info));
+}
+
+static void append_section(uint32_t const *const mloc0,
+ uint32_t const *const mlocp,
+ Elf32_Rel const *const mrel0,
+ Elf32_Rel const *const mrelp,
+ char const *name,
+ unsigned int const rel_entsize,
+ unsigned int const symsec_sh_link,
+ uint32_t *name_offp,
+ uint32_t *tp,
+ unsigned *shnump
+ )
+{
+ Elf32_Shdr mcsec;
+ uint32_t name_off = *name_offp;
+ uint32_t t = *tp;
+ uint32_t loc_diff = (void *)mlocp - (void *)mloc0;
+ uint32_t rel_diff = (void *)mrelp - (void *)mrel0;
+ unsigned shnum = *shnump;
+
+ mcsec.sh_name = w((sizeof(Elf32_Rela) == rel_entsize) + strlen(".rel")
+ + name_off);
+ mcsec.sh_type = w(SHT_PROGBITS);
+ mcsec.sh_flags = w(SHF_ALLOC);
+ mcsec.sh_addr = 0;
+ mcsec.sh_offset = w(t);
+ mcsec.sh_size = w(loc_diff);
+ mcsec.sh_link = 0;
+ mcsec.sh_info = 0;
+ mcsec.sh_addralign = w(_size);
+ mcsec.sh_entsize = w(_size);
+ uwrite(fd_map, &mcsec, sizeof(mcsec));
+ t += loc_diff;
+
+ mcsec.sh_name = w(name_off);
+ mcsec.sh_type = (sizeof(Elf32_Rela) == rel_entsize)
+ ? w(SHT_RELA)
+ : w(SHT_REL);
+ mcsec.sh_flags = 0;
+ mcsec.sh_addr = 0;
+ mcsec.sh_offset = w(t);
+ mcsec.sh_size = w(rel_diff);
+ mcsec.sh_link = w(symsec_sh_link);
+ mcsec.sh_info = w(shnum);
+ mcsec.sh_addralign = w(_size);
+ mcsec.sh_entsize = w(rel_entsize);
+ uwrite(fd_map, &mcsec, sizeof(mcsec));
+ t += rel_diff;
+
+ shnum += 2;
+ name_off += strlen(name) + 1;
+
+ *tp = t;
+ *shnump = shnum;
+ *name_offp = name_off;
+}
+
+/*
+ * Append the new shstrtab, Elf32_Shdr[], __{udiv,idiv}_loc and their
+ * relocations.
+ */
+static void append_func(Elf32_Ehdr *const ehdr,
+ Elf32_Shdr *const shstr,
+ uint32_t const *const mloc0_u,
+ uint32_t const *const mlocp_u,
+ Elf32_Rel const *const mrel0_u,
+ Elf32_Rel const *const mrelp_u,
+ uint32_t const *const mloc0_i,
+ uint32_t const *const mlocp_i,
+ Elf32_Rel const *const mrel0_i,
+ Elf32_Rel const *const mrelp_i,
+ unsigned int const rel_entsize,
+ unsigned int const symsec_sh_link)
+{
+ /* Begin constructing output file */
+ char const *udiv_name = (sizeof(Elf32_Rela) == rel_entsize)
+ ? ".rela__udiv_loc"
+ : ".rel__udiv_loc";
+ char const *idiv_name = (sizeof(Elf32_Rela) == rel_entsize)
+ ? ".rela__idiv_loc"
+ : ".rel__idiv_loc";
+ unsigned old_shnum = w2(ehdr->e_shnum);
+ uint32_t const old_shoff = w(ehdr->e_shoff);
+ uint32_t const old_shstr_sh_size = w(shstr->sh_size);
+ uint32_t const old_shstr_sh_offset = w(shstr->sh_offset);
+ uint32_t new_e_shoff;
+ uint32_t t = w(shstr->sh_size);
+ uint32_t name_off = old_shstr_sh_size;
+ int udiv = 0, idiv = 0;
+ int num_sections;
+
+ if (mlocp_u != mloc0_u) {
+ t += 1 + strlen(udiv_name);
+ udiv = 1;
+ }
+ if (mlocp_i != mloc0_i) {
+ t += 1 + strlen(idiv_name);
+ idiv = 1;
+ }
+ num_sections = (udiv * 2) + (idiv * 2);
+
+ shstr->sh_size = w(t);
+ shstr->sh_offset = w(sb.st_size);
+ t += sb.st_size;
+ t += (_align & -t); /* word-byte align */
+ new_e_shoff = t;
+
+ /* body for new shstrtab */
+ ulseek(fd_map, sb.st_size, SEEK_SET);
+ uwrite(fd_map, old_shstr_sh_offset + (void *)ehdr, name_off);
+ if (udiv)
+ uwrite(fd_map, udiv_name, 1 + strlen(udiv_name));
+ if (idiv)
+ uwrite(fd_map, idiv_name, 1 + strlen(idiv_name));
+
+ /* old(modified) Elf32_Shdr table, word-byte aligned */
+ ulseek(fd_map, t, SEEK_SET);
+ t += sizeof(Elf32_Shdr) * old_shnum;
+ uwrite(fd_map, old_shoff + (void *)ehdr,
+ sizeof(Elf32_Shdr) * old_shnum);
+
+ t += num_sections * sizeof(Elf32_Shdr);
+
+ /* new sections __udiv_loc and .rel__udiv_loc */
+ if (udiv)
+ append_section(mloc0_u, mlocp_u, mrel0_u, mrelp_u, udiv_name,
+ rel_entsize, symsec_sh_link, &name_off, &t,
+ &old_shnum);
+
+ /* new sections __idiv_loc and .rel__idiv_loc */
+ if (idiv)
+ append_section(mloc0_i, mlocp_i, mrel0_i, mrelp_i, idiv_name,
+ rel_entsize, symsec_sh_link, &name_off, &t,
+ &old_shnum);
+
+ if (udiv) {
+ uwrite(fd_map, mloc0_u, (void *)mlocp_u - (void *)mloc0_u);
+ uwrite(fd_map, mrel0_u, (void *)mrelp_u - (void *)mrel0_u);
+ }
+ if (idiv) {
+ uwrite(fd_map, mloc0_i, (void *)mlocp_i - (void *)mloc0_i);
+ uwrite(fd_map, mrel0_i, (void *)mrelp_i - (void *)mrel0_i);
+ }
+
+ ehdr->e_shoff = w(new_e_shoff);
+ ehdr->e_shnum = w2(num_sections + w2(ehdr->e_shnum));
+ ulseek(fd_map, 0, SEEK_SET);
+ uwrite(fd_map, ehdr, sizeof(*ehdr));
+}
+
+static unsigned get_sym(Elf32_Sym const *const sym0,
+ Elf32_Rel const *relp,
+ char const *const str0, const char *find)
+{
+ unsigned sym = 0;
+ Elf32_Sym const *const symp = &sym0[Elf32_r_sym(relp)];
+ char const *symname = &str0[w(symp->st_name)];
+
+ if (strcmp(find, symname) == 0)
+ sym = Elf32_r_sym(relp);
+
+ return sym;
+}
+
+static void get_sym_str_and_relp(Elf32_Shdr const *const relhdr,
+ Elf32_Ehdr const *const ehdr,
+ Elf32_Sym const **sym0,
+ char const **str0,
+ Elf32_Rel const **relp)
+{
+ Elf32_Shdr *const shdr0 = (Elf32_Shdr *)(w(ehdr->e_shoff)
+ + (void *)ehdr);
+ unsigned const symsec_sh_link = w(relhdr->sh_link);
+ Elf32_Shdr const *const symsec = &shdr0[symsec_sh_link];
+ Elf32_Shdr const *const strsec = &shdr0[w(symsec->sh_link)];
+ Elf32_Rel const *const rel0 = (Elf32_Rel const *)(w(relhdr->sh_offset)
+ + (void *)ehdr);
+
+ *sym0 = (Elf32_Sym const *)(w(symsec->sh_offset)
+ + (void *)ehdr);
+
+ *str0 = (char const *)(w(strsec->sh_offset)
+ + (void *)ehdr);
+
+ *relp = rel0;
+}
+
+static void add_relocation(Elf32_Rel const *relp, uint32_t *mloc0, uint32_t **mlocpp,
+ uint32_t const recval, unsigned const recsym,
+ Elf32_Rel **const mrelpp, unsigned offbase,
+ unsigned rel_entsize)
+{
+ uint32_t *mlocp = *mlocpp;
+ Elf32_Rel *mrelp = *mrelpp;
+ uint32_t const addend = w(w(relp->r_offset) - recval);
+ mrelp->r_offset = w(offbase + ((void *)mlocp - (void *)mloc0));
+ mrelp->r_info = w(ELF32_R_INFO(recsym, R_ARM_ABS32));
+ if (rel_entsize == sizeof(Elf32_Rela)) {
+ ((Elf32_Rela *)mrelp)->r_addend = addend;
+ *mlocp++ = 0;
+ } else
+ *mlocp++ = addend;
+
+ *mlocpp = mlocp;
+ *mrelpp = (Elf32_Rel *)(rel_entsize + (void *)mrelp);
+}
+
+/*
+ * Look at the relocations in order to find the calls to __aeabi_{u}idiv.
+ * Accumulate the section offsets that are found, and their relocation info,
+ * onto the end of the existing arrays.
+ */
+static void sift_relocations(uint32_t **mlocpp_u, uint32_t *mloc_base_u,
+ Elf32_Rel **const mrelpp_u,
+ uint32_t **mlocpp_i,
+ uint32_t *mloc_base_i,
+ Elf32_Rel **const mrelpp_i,
+ Elf32_Shdr const *const relhdr,
+ Elf32_Ehdr const *const ehdr,
+ unsigned const recsym, uint32_t const recval)
+{
+ uint32_t *mlocp_u = *mlocpp_u;
+ unsigned const offbase_u = (void *)mlocp_u - (void *)mloc_base_u;
+ uint32_t *const mloc0_u = mlocp_u;
+ Elf32_Rel *mrelp_u = *mrelpp_u;
+ uint32_t *mlocp_i = *mlocpp_i;
+ unsigned const offbase_i = (void *)mlocp_i - (void *)mloc_base_i;
+ uint32_t *const mloc0_i = mlocp_i;
+ Elf32_Rel *mrelp_i = *mrelpp_i;
+ Elf32_Sym const *sym0;
+ char const *str0;
+ Elf32_Rel const *relp;
+ unsigned rel_entsize = w(relhdr->sh_entsize);
+ unsigned const nrel = w(relhdr->sh_size) / rel_entsize;
+ unsigned udiv_sym = 0, idiv_sym = 0;
+ unsigned t;
+
+ get_sym_str_and_relp(relhdr, ehdr, &sym0, &str0, &relp);
+
+ for (t = nrel; t; --t) {
+ if (!udiv_sym)
+ udiv_sym = get_sym(sym0, relp, str0,
+ "__aeabi_uidiv");
+ if (!idiv_sym)
+ idiv_sym = get_sym(sym0, relp, str0,
+ "__aeabi_idiv");
+
+ switch (relp->r_info & 0xff) {
+ case R_ARM_PC24:
+ case R_ARM_CALL:
+ case R_ARM_JUMP24:
+ if (udiv_sym == Elf32_r_sym(relp)) {
+ add_relocation(relp, mloc0_u, &mlocp_u, recval,
+ recsym, &mrelp_u, offbase_u,
+ rel_entsize);
+ } else if (idiv_sym == Elf32_r_sym(relp)) {
+ add_relocation(relp, mloc0_i, &mlocp_i, recval,
+ recsym, &mrelp_i, offbase_i,
+ rel_entsize);
+ }
+ }
+
+ relp = (Elf32_Rel const *)(rel_entsize + (void *)relp);
+ }
+ *mrelpp_i = mrelp_i;
+ *mrelpp_u = mrelp_u;
+ *mlocpp_i = mlocp_i;
+ *mlocpp_u = mlocp_u;
+}
+
+/*
+ * Find a symbol in the given section, to be used as the base for relocating
+ * the table of offsets of calls. A local or global symbol suffices,
+ * but avoid a Weak symbol because it may be overridden; the change in value
+ * would invalidate the relocations of the offsets of the calls.
+ * Often the found symbol will be the unnamed local symbol generated by
+ * GNU 'as' for the start of each section. For example:
+ * Num: Value Size Type Bind Vis Ndx Name
+ * 2: 00000000 0 SECTION LOCAL DEFAULT 1
+ */
+static unsigned find_secsym_ndx(unsigned const txtndx,
+ char const *const txtname,
+ uint32_t *const recvalp,
+ Elf32_Shdr const *const symhdr,
+ Elf32_Ehdr const *const ehdr)
+{
+ Elf32_Sym const *const sym0 = (Elf32_Sym const *)(w(symhdr->sh_offset)
+ + (void *)ehdr);
+ unsigned const nsym = w(symhdr->sh_size) / w(symhdr->sh_entsize);
+ Elf32_Sym const *symp;
+ unsigned t;
+
+ for (symp = sym0, t = nsym; t; --t, ++symp) {
+ unsigned int const st_bind = ELF32_ST_BIND(symp->st_info);
+
+ if (txtndx == w2(symp->st_shndx)
+ /* avoid STB_WEAK */
+ && (STB_LOCAL == st_bind || STB_GLOBAL == st_bind)) {
+ /* function symbols on ARM have quirks, avoid them */
+ if (ELF32_ST_TYPE(symp->st_info) == STT_FUNC)
+ continue;
+
+ *recvalp = w(symp->st_value);
+ return symp - sym0;
+ }
+ }
+ fprintf(stderr, "Cannot find symbol for section %d: %s.\n",
+ txtndx, txtname);
+ fail_file();
+}
+
+
+/* Evade ISO C restriction: no declaration after statement in has_rel. */
+static char const *
+__has_rel(Elf32_Shdr const *const relhdr, /* is SHT_REL or SHT_RELA */
+ Elf32_Shdr const *const shdr0,
+ char const *const shstrtab,
+ char const *const fname)
+{
+ /* .sh_info depends on .sh_type == SHT_REL[,A] */
+ Elf32_Shdr const *const txthdr = &shdr0[w(relhdr->sh_info)];
+ char const *const txtname = &shstrtab[w(txthdr->sh_name)];
+
+ if (strcmp("__idiv_loc", txtname) == 0 ||
+ strcmp("__udiv_loc", txtname) == 0)
+ succeed_file();
+ if (w(txthdr->sh_type) != SHT_PROGBITS ||
+ !(w(txthdr->sh_flags) & SHF_EXECINSTR))
+ return NULL;
+ return txtname;
+}
+
+static char const *has_rel(Elf32_Shdr const *const relhdr,
+ Elf32_Shdr const *const shdr0,
+ char const *const shstrtab,
+ char const *const fname)
+{
+ if (w(relhdr->sh_type) != SHT_REL && w(relhdr->sh_type) != SHT_RELA)
+ return NULL;
+ return __has_rel(relhdr, shdr0, shstrtab, fname);
+}
+
+
+static unsigned tot_relsize(Elf32_Shdr const *const shdr0,
+ unsigned nhdr,
+ const char *const shstrtab,
+ const char *const fname)
+{
+ unsigned totrelsz = 0;
+ Elf32_Shdr const *shdrp = shdr0;
+ char const *txtname;
+
+ for (; nhdr; --nhdr, ++shdrp) {
+ txtname = has_rel(shdrp, shdr0, shstrtab, fname);
+ if (txtname && is_valid_section_name(txtname))
+ totrelsz += w(shdrp->sh_size);
+ }
+ return totrelsz;
+}
+
+
+/* Overall supervision for Elf32 ET_REL file. */
+static void
+do_func(Elf32_Ehdr *const ehdr, char const *const fname)
+{
+ Elf32_Shdr *const shdr0 = (Elf32_Shdr *)(w(ehdr->e_shoff)
+ + (void *)ehdr);
+ unsigned const nhdr = w2(ehdr->e_shnum);
+ Elf32_Shdr *const shstr = &shdr0[w2(ehdr->e_shstrndx)];
+ char const *const shstrtab = (char const *)(w(shstr->sh_offset)
+ + (void *)ehdr);
+
+ Elf32_Shdr const *relhdr;
+ unsigned k;
+
+ /* Upper bound on space: assume all relevant relocs are valid. */
+ unsigned const totrelsz = tot_relsize(shdr0, nhdr, shstrtab, fname);
+ Elf32_Rel *const mrel0_u = umalloc(totrelsz);
+ Elf32_Rel * mrelp_u = mrel0_u;
+
+ /* 2*sizeof(address) <= sizeof(Elf32_Rel) */
+ uint32_t *const mloc0_u = umalloc(totrelsz>>1);
+ uint32_t * mlocp_u = mloc0_u;
+
+ Elf32_Rel *const mrel0_i = umalloc(totrelsz);
+ Elf32_Rel * mrelp_i = mrel0_i;
+
+ /* 2*sizeof(address) <= sizeof(Elf32_Rel) */
+ uint32_t *const mloc0_i = umalloc(totrelsz>>1);
+ uint32_t * mlocp_i = mloc0_i;
+
+ unsigned rel_entsize = 0;
+ unsigned symsec_sh_link = 0;
+
+ for (relhdr = shdr0, k = nhdr; k; --k, ++relhdr) {
+ char const *const txtname = has_rel(relhdr, shdr0, shstrtab,
+ fname);
+ if (txtname && is_valid_section_name(txtname)) {
+ uint32_t recval = 0;
+ unsigned const recsym = find_secsym_ndx(
+ w(relhdr->sh_info), txtname, &recval,
+ &shdr0[symsec_sh_link = w(relhdr->sh_link)],
+ ehdr);
+
+ rel_entsize = w(relhdr->sh_entsize);
+ sift_relocations(&mlocp_u, mloc0_u, &mrelp_u,
+ &mlocp_i, mloc0_i, &mrelp_i,
+ relhdr, ehdr, recsym, recval);
+ }
+ }
+ if (mloc0_u != mlocp_u || mloc0_i != mlocp_i) {
+ append_func(ehdr, shstr, mloc0_u, mlocp_u, mrel0_u, mrelp_u,
+ mloc0_i, mlocp_i, mrel0_i, mrelp_i,
+ rel_entsize, symsec_sh_link);
+ }
+ free(mrel0_u);
+ free(mloc0_u);
+ free(mrel0_i);
+ free(mloc0_i);
+}
+
+static void
+do_file(char const *const fname)
+{
+ Elf32_Ehdr *const ehdr = mmap_file(fname);
+
+ ehdr_curr = ehdr;
+ w = w4nat;
+ w2 = w2nat;
+ w8 = w8nat;
+ switch (ehdr->e_ident[EI_DATA]) {
+ static unsigned int const endian = 1;
+ default:
+ fprintf(stderr, "unrecognized ELF data encoding %d: %s\n",
+ ehdr->e_ident[EI_DATA], fname);
+ fail_file();
+ break;
+ case ELFDATA2LSB:
+ if (*(unsigned char const *)&endian != 1) {
+ /* main() is big endian, file.o is little endian. */
+ w = w4rev;
+ w2 = w2rev;
+ w8 = w8rev;
+ }
+ break;
+ case ELFDATA2MSB:
+ if (*(unsigned char const *)&endian != 0) {
+ /* main() is little endian, file.o is big endian. */
+ w = w4rev;
+ w2 = w2rev;
+ w8 = w8rev;
+ }
+ break;
+ } /* end switch */
+ if (memcmp(ELFMAG, ehdr->e_ident, SELFMAG) != 0
+ || w2(ehdr->e_type) != ET_REL
+ || ehdr->e_ident[EI_VERSION] != EV_CURRENT) {
+ fprintf(stderr, "unrecognized ET_REL file %s\n", fname);
+ fail_file();
+ }
+
+ if (w2(ehdr->e_ehsize) != sizeof(Elf32_Ehdr)
+ || w2(ehdr->e_shentsize) != sizeof(Elf32_Shdr)) {
+ fprintf(stderr,
+ "unrecognized ET_REL file: %s\n", fname);
+ fail_file();
+ }
+ do_func(ehdr, fname);
+
+ cleanup();
+}
+
+int
+main(int argc, char *argv[])
+{
+ int n_error = 0; /* gcc-4.3.0 false positive complaint */
+ int i;
+
+ if (argc < 2) {
+ fprintf(stderr, "usage: recorduidiv file.o...\n");
+ return 0;
+ }
+
+ /* Process each file in turn, allowing deep failure. */
+ for (i = 1; i < argc; i++) {
+ char *file = argv[i];
+ int const sjval = setjmp(jmpenv);
+
+ switch (sjval) {
+ default:
+ fprintf(stderr, "internal error: %s\n", file);
+ exit(1);
+ break;
+ case SJ_SETJMP: /* normal sequence */
+ /* Avoid problems if early cleanup() */
+ fd_map = -1;
+ ehdr_curr = NULL;
+ mmap_failed = 1;
+ do_file(file);
+ break;
+ case SJ_FAIL: /* error in do_file or below */
+ ++n_error;
+ break;
+ case SJ_SUCCEED: /* premature success */
+ /* do nothing */
+ break;
+ } /* end switch */
+ }
+ return !!n_error;
+}
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

2015-11-25 21:51:30

by Stephen Boyd

[permalink] [raw]
Subject: [PATCH v2 2/2] ARM: Replace calls to __aeabi_{u}idiv with udiv/sdiv instructions

The ARM compiler inserts calls to __aeabi_uidiv() and
__aeabi_idiv() when it needs to perform division on signed and
unsigned integers. If a processor has support for the udiv and
sdiv division instructions the calls to these support routines
can be replaced with those instructions. Now that recordmcount
records the locations of calls to these library functions in
two sections (one for udiv and one for sdiv), iterate over these
sections early at boot and patch the call sites with the
appropriate division instruction when we determine that the
processor supports the division instructions. Using the division
instructions should be faster and less power intensive than
running the support code.

Cc: Nicolas Pitre <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Måns Rullgård <[email protected]>
Cc: Thomas Petazzoni <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>
---
arch/arm/Kconfig | 14 +++++++++
arch/arm/include/asm/cputype.h | 8 ++++-
arch/arm/include/asm/setup.h | 3 ++
arch/arm/kernel/module.c | 40 +++++++++++++++++++++++++
arch/arm/kernel/setup.c | 68 ++++++++++++++++++++++++++++++++++++++++++
arch/arm/kernel/vmlinux.lds.S | 13 ++++++++
6 files changed, 145 insertions(+), 1 deletion(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 0365cbbc9179..aa8bc7da6331 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1617,6 +1617,20 @@ config AEABI

To use this you need GCC version 4.0.0 or later.

+config ARM_PATCH_UIDIV
+ bool "Runtime patch calls to __aeabi_{u}idiv() with udiv/sdiv"
+ depends on CPU_32v7 && !XIP_KERNEL && AEABI
+ help
+ Some v7 CPUs have support for the udiv and sdiv instructions
+ that can be used in place of calls to __aeabi_uidiv and __aeabi_idiv
+ functions provided by the ARM runtime ABI.
+
+ Enabling this option allows the kernel to modify itself to replace
+ branches to these library functions with the udiv and sdiv
+ instructions themselves. Typically this will be faster and less
+ power intensive than running the library support code to do
+ integer division.
+
config OABI_COMPAT
bool "Allow old ABI binaries to run with this kernel (EXPERIMENTAL)"
depends on AEABI && !THUMB2_KERNEL
diff --git a/arch/arm/include/asm/cputype.h b/arch/arm/include/asm/cputype.h
index 85e374f873ac..48c77d422a0d 100644
--- a/arch/arm/include/asm/cputype.h
+++ b/arch/arm/include/asm/cputype.h
@@ -250,8 +250,14 @@ static inline int cpu_is_pj4(void)

return 0;
}
+
+static inline bool __attribute_const__ cpu_is_pj4_nomp(void)
+{
+ return read_cpuid_part() == 0x56005810;
+}
#else
-#define cpu_is_pj4() 0
+#define cpu_is_pj4() 0
+#define cpu_is_pj4_nomp() 0
#endif

static inline int __attribute_const__ cpuid_feature_extract_field(u32 features,
diff --git a/arch/arm/include/asm/setup.h b/arch/arm/include/asm/setup.h
index e0adb9f1bf94..a514552d5cbd 100644
--- a/arch/arm/include/asm/setup.h
+++ b/arch/arm/include/asm/setup.h
@@ -25,4 +25,7 @@ extern int arm_add_memory(u64 start, u64 size);
extern void early_print(const char *str, ...);
extern void dump_machine_table(void);

+extern void patch_udiv(void *addr, size_t size);
+extern void patch_sdiv(void *addr, size_t size);
+
#endif
diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index efdddcb97dd1..aa59e5cfe6a0 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -20,7 +20,9 @@
#include <linux/string.h>
#include <linux/gfp.h>

+#include <asm/hwcap.h>
#include <asm/pgtable.h>
+#include <asm/setup.h>
#include <asm/sections.h>
#include <asm/smp_plat.h>
#include <asm/unwind.h>
@@ -51,6 +53,38 @@ void *module_alloc(unsigned long size)
}
#endif

+#ifdef CONFIG_ARM_PATCH_UIDIV
+static int module_patch_aeabi_uidiv(unsigned long loc, const Elf32_Sym *sym)
+{
+ extern char __aeabi_uidiv[], __aeabi_idiv[];
+ unsigned long udiv_addr = (unsigned long)__aeabi_uidiv;
+ unsigned long sdiv_addr = (unsigned long)__aeabi_idiv;
+ unsigned int mask;
+
+ if (IS_ENABLED(CONFIG_THUMB2_KERNEL))
+ mask = HWCAP_IDIVT;
+ else
+ mask = HWCAP_IDIVA;
+
+ if (elf_hwcap & mask) {
+ if (sym->st_value == udiv_addr) {
+ patch_udiv(&loc, sizeof(loc));
+ return 1;
+ } else if (sym->st_value == sdiv_addr) {
+ patch_sdiv(&loc, sizeof(loc));
+ return 1;
+ }
+ }
+
+ return 0;
+}
+#else
+static int module_patch_aeabi_uidiv(unsigned long loc, const Elf32_Sym *sym)
+{
+ return 0;
+}
+#endif
+
int
apply_relocate(Elf32_Shdr *sechdrs, const char *strtab, unsigned int symindex,
unsigned int relindex, struct module *module)
@@ -109,6 +143,9 @@ apply_relocate(Elf32_Shdr *sechdrs, const char *strtab, unsigned int symindex,
return -ENOEXEC;
}

+ if (module_patch_aeabi_uidiv(loc, sym))
+ break;
+
offset = __mem_to_opcode_arm(*(u32 *)loc);
offset = (offset & 0x00ffffff) << 2;
if (offset & 0x02000000)
@@ -195,6 +232,9 @@ apply_relocate(Elf32_Shdr *sechdrs, const char *strtab, unsigned int symindex,
return -ENOEXEC;
}

+ if (module_patch_aeabi_uidiv(loc, sym))
+ break;
+
upper = __mem_to_opcode_thumb16(*(u16 *)loc);
lower = __mem_to_opcode_thumb16(*(u16 *)(loc + 2));

diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index 20edd349d379..39a46059d5e8 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -32,6 +32,7 @@
#include <linux/compiler.h>
#include <linux/sort.h>
#include <linux/psci.h>
+#include <linux/module.h>

#include <asm/unified.h>
#include <asm/cp15.h>
@@ -375,6 +376,72 @@ void __init early_print(const char *str, ...)
printk("%s", buf);
}

+#ifdef CONFIG_ARM_PATCH_UIDIV
+/* "sdiv r0, r0, r1" or "mrc p6, 1, r0, CR0, CR1, 4" if we're on pj4 w/o MP */
+static u32 __attribute_const__ sdiv_instruction(void)
+{
+ if (IS_ENABLED(CONFIG_THUMB2_KERNEL)) {
+ if (cpu_is_pj4_nomp())
+ return __opcode_to_mem_thumb32(0xee300691);
+ return __opcode_to_mem_thumb32(0xfb90f0f1);
+ }
+
+ if (cpu_is_pj4_nomp())
+ return __opcode_to_mem_arm(0xee300691);
+ return __opcode_to_mem_arm(0xe710f110);
+}
+
+/* "udiv r0, r0, r1" or "mrc p6, 1, r0, CR0, CR1, 0" if we're on pj4 w/o MP */
+static u32 __attribute_const__ udiv_instruction(void)
+{
+ if (IS_ENABLED(CONFIG_THUMB2_KERNEL)) {
+ if (cpu_is_pj4_nomp())
+ return __opcode_to_mem_thumb32(0xee300611);
+ return __opcode_to_mem_thumb32(0xfbb0f0f1);
+ }
+
+ if (cpu_is_pj4_nomp())
+ return __opcode_to_mem_arm(0xee300611);
+ return __opcode_to_mem_arm(0xe730f110);
+}
+
+static void __init_or_module patch(u32 **addr, size_t count, u32 insn)
+{
+ for (; count != 0; count -= 4)
+ **addr++ = insn;
+}
+
+void __init_or_module patch_udiv(void *addr, size_t size)
+{
+ patch(addr, size, udiv_instruction());
+}
+
+void __init_or_module patch_sdiv(void *addr, size_t size)
+{
+ return patch(addr, size, sdiv_instruction());
+}
+
+static void __init patch_aeabi_uidiv(void)
+{
+ extern char __start_udiv_loc[], __stop_udiv_loc[];
+ extern char __start_idiv_loc[], __stop_idiv_loc[];
+ unsigned int mask;
+
+ if (IS_ENABLED(CONFIG_THUMB2_KERNEL))
+ mask = HWCAP_IDIVT;
+ else
+ mask = HWCAP_IDIVA;
+
+ if (!(elf_hwcap & mask))
+ return;
+
+ patch_udiv(__start_udiv_loc, __stop_udiv_loc - __start_udiv_loc);
+ patch_sdiv(__start_idiv_loc, __stop_idiv_loc - __start_idiv_loc);
+}
+#else
+static void __init patch_aeabi_uidiv(void) { }
+#endif
+
static void __init cpuid_init_hwcaps(void)
{
int block;
@@ -642,6 +709,7 @@ static void __init setup_processor(void)
elf_hwcap = list->elf_hwcap;

cpuid_init_hwcaps();
+ patch_aeabi_uidiv();

#ifndef CONFIG_ARM_THUMB
elf_hwcap &= ~(HWCAP_THUMB | HWCAP_IDIVT);
diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
index 8b60fde5ce48..bc87a2e04e6f 100644
--- a/arch/arm/kernel/vmlinux.lds.S
+++ b/arch/arm/kernel/vmlinux.lds.S
@@ -28,6 +28,18 @@
*(.hyp.idmap.text) \
VMLINUX_SYMBOL(__hyp_idmap_text_end) = .;

+#ifdef CONFIG_ARM_PATCH_UIDIV
+#define UIDIV_REC . = ALIGN(8); \
+ VMLINUX_SYMBOL(__start_udiv_loc) = .; \
+ *(__udiv_loc) \
+ VMLINUX_SYMBOL(__stop_udiv_loc) = .; \
+ VMLINUX_SYMBOL(__start_idiv_loc) = .; \
+ *(__idiv_loc) \
+ VMLINUX_SYMBOL(__stop_idiv_loc) = .;
+#else
+#define UIDIV_REC
+#endif
+
#ifdef CONFIG_HOTPLUG_CPU
#define ARM_CPU_DISCARD(x)
#define ARM_CPU_KEEP(x) x
@@ -210,6 +222,7 @@ SECTIONS
.init.data : {
#ifndef CONFIG_XIP_KERNEL
INIT_DATA
+ UIDIV_REC
#endif
INIT_SETUP(16)
INIT_CALLS
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

2015-11-25 23:24:18

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ARM: Replace calls to __aeabi_{u}idiv with udiv/sdiv instructions

On Wed, 25 Nov 2015, Stephen Boyd wrote:

> The ARM compiler inserts calls to __aeabi_uidiv() and
> __aeabi_idiv() when it needs to perform division on signed and
> unsigned integers. If a processor has support for the udiv and
> sdiv division instructions the calls to these support routines
> can be replaced with those instructions. Now that recordmcount
> records the locations of calls to these library functions in
> two sections (one for udiv and one for sdiv), iterate over these
> sections early at boot and patch the call sites with the
> appropriate division instruction when we determine that the
> processor supports the division instructions. Using the division
> instructions should be faster and less power intensive than
> running the support code.

A few remarks:

1) The code assumes unconditional branches to __aeabi_idiv and
__aeabi_uidiv. What if there are conditional branches? Also, tail
call optimizations will generate a straight b opcode rather than a bl
and patching those will obviously have catastrophic results. I think
you should validate the presence of a bl before patching over it.

2) For those cases where a call to __aeabi_uidiv and __aeabi_idiv is not
patched due to (1), you could patch a uidiv/idiv plus "bx lr"
at those function entry points too.

3) In fact I was wondering if the overhead of the branch and back is
really significant compared to the non trivial cost of a idiv
instruction and all the complex infrastructure required to patch
those branches directly, and consequently if the performance
difference is actually worth it versus simply doing (2) alone.

4) Please add some printing to the boot log (debug level should be fine)
about the fact that you did modify n branch instances with a div
insn. That _could_ turn out to be a useful clue when debugging kernel
"strangeties".

> Cc: Nicolas Pitre <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Måns Rullgård <[email protected]>
> Cc: Thomas Petazzoni <[email protected]>
> Signed-off-by: Stephen Boyd <[email protected]>
> ---
> arch/arm/Kconfig | 14 +++++++++
> arch/arm/include/asm/cputype.h | 8 ++++-
> arch/arm/include/asm/setup.h | 3 ++
> arch/arm/kernel/module.c | 40 +++++++++++++++++++++++++
> arch/arm/kernel/setup.c | 68 ++++++++++++++++++++++++++++++++++++++++++
> arch/arm/kernel/vmlinux.lds.S | 13 ++++++++
> 6 files changed, 145 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 0365cbbc9179..aa8bc7da6331 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -1617,6 +1617,20 @@ config AEABI
>
> To use this you need GCC version 4.0.0 or later.
>
> +config ARM_PATCH_UIDIV
> + bool "Runtime patch calls to __aeabi_{u}idiv() with udiv/sdiv"
> + depends on CPU_32v7 && !XIP_KERNEL && AEABI
> + help
> + Some v7 CPUs have support for the udiv and sdiv instructions
> + that can be used in place of calls to __aeabi_uidiv and __aeabi_idiv
> + functions provided by the ARM runtime ABI.
> +
> + Enabling this option allows the kernel to modify itself to replace
> + branches to these library functions with the udiv and sdiv
> + instructions themselves. Typically this will be faster and less
> + power intensive than running the library support code to do
> + integer division.
> +
> config OABI_COMPAT
> bool "Allow old ABI binaries to run with this kernel (EXPERIMENTAL)"
> depends on AEABI && !THUMB2_KERNEL
> diff --git a/arch/arm/include/asm/cputype.h b/arch/arm/include/asm/cputype.h
> index 85e374f873ac..48c77d422a0d 100644
> --- a/arch/arm/include/asm/cputype.h
> +++ b/arch/arm/include/asm/cputype.h
> @@ -250,8 +250,14 @@ static inline int cpu_is_pj4(void)
>
> return 0;
> }
> +
> +static inline bool __attribute_const__ cpu_is_pj4_nomp(void)
> +{
> + return read_cpuid_part() == 0x56005810;
> +}
> #else
> -#define cpu_is_pj4() 0
> +#define cpu_is_pj4() 0
> +#define cpu_is_pj4_nomp() 0
> #endif
>
> static inline int __attribute_const__ cpuid_feature_extract_field(u32 features,
> diff --git a/arch/arm/include/asm/setup.h b/arch/arm/include/asm/setup.h
> index e0adb9f1bf94..a514552d5cbd 100644
> --- a/arch/arm/include/asm/setup.h
> +++ b/arch/arm/include/asm/setup.h
> @@ -25,4 +25,7 @@ extern int arm_add_memory(u64 start, u64 size);
> extern void early_print(const char *str, ...);
> extern void dump_machine_table(void);
>
> +extern void patch_udiv(void *addr, size_t size);
> +extern void patch_sdiv(void *addr, size_t size);
> +
> #endif
> diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
> index efdddcb97dd1..aa59e5cfe6a0 100644
> --- a/arch/arm/kernel/module.c
> +++ b/arch/arm/kernel/module.c
> @@ -20,7 +20,9 @@
> #include <linux/string.h>
> #include <linux/gfp.h>
>
> +#include <asm/hwcap.h>
> #include <asm/pgtable.h>
> +#include <asm/setup.h>
> #include <asm/sections.h>
> #include <asm/smp_plat.h>
> #include <asm/unwind.h>
> @@ -51,6 +53,38 @@ void *module_alloc(unsigned long size)
> }
> #endif
>
> +#ifdef CONFIG_ARM_PATCH_UIDIV
> +static int module_patch_aeabi_uidiv(unsigned long loc, const Elf32_Sym *sym)
> +{
> + extern char __aeabi_uidiv[], __aeabi_idiv[];
> + unsigned long udiv_addr = (unsigned long)__aeabi_uidiv;
> + unsigned long sdiv_addr = (unsigned long)__aeabi_idiv;
> + unsigned int mask;
> +
> + if (IS_ENABLED(CONFIG_THUMB2_KERNEL))
> + mask = HWCAP_IDIVT;
> + else
> + mask = HWCAP_IDIVA;
> +
> + if (elf_hwcap & mask) {
> + if (sym->st_value == udiv_addr) {
> + patch_udiv(&loc, sizeof(loc));
> + return 1;
> + } else if (sym->st_value == sdiv_addr) {
> + patch_sdiv(&loc, sizeof(loc));
> + return 1;
> + }
> + }
> +
> + return 0;
> +}
> +#else
> +static int module_patch_aeabi_uidiv(unsigned long loc, const Elf32_Sym *sym)
> +{
> + return 0;
> +}
> +#endif
> +
> int
> apply_relocate(Elf32_Shdr *sechdrs, const char *strtab, unsigned int symindex,
> unsigned int relindex, struct module *module)
> @@ -109,6 +143,9 @@ apply_relocate(Elf32_Shdr *sechdrs, const char *strtab, unsigned int symindex,
> return -ENOEXEC;
> }
>
> + if (module_patch_aeabi_uidiv(loc, sym))
> + break;
> +
> offset = __mem_to_opcode_arm(*(u32 *)loc);
> offset = (offset & 0x00ffffff) << 2;
> if (offset & 0x02000000)
> @@ -195,6 +232,9 @@ apply_relocate(Elf32_Shdr *sechdrs, const char *strtab, unsigned int symindex,
> return -ENOEXEC;
> }
>
> + if (module_patch_aeabi_uidiv(loc, sym))
> + break;
> +
> upper = __mem_to_opcode_thumb16(*(u16 *)loc);
> lower = __mem_to_opcode_thumb16(*(u16 *)(loc + 2));
>
> diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
> index 20edd349d379..39a46059d5e8 100644
> --- a/arch/arm/kernel/setup.c
> +++ b/arch/arm/kernel/setup.c
> @@ -32,6 +32,7 @@
> #include <linux/compiler.h>
> #include <linux/sort.h>
> #include <linux/psci.h>
> +#include <linux/module.h>
>
> #include <asm/unified.h>
> #include <asm/cp15.h>
> @@ -375,6 +376,72 @@ void __init early_print(const char *str, ...)
> printk("%s", buf);
> }
>
> +#ifdef CONFIG_ARM_PATCH_UIDIV
> +/* "sdiv r0, r0, r1" or "mrc p6, 1, r0, CR0, CR1, 4" if we're on pj4 w/o MP */
> +static u32 __attribute_const__ sdiv_instruction(void)
> +{
> + if (IS_ENABLED(CONFIG_THUMB2_KERNEL)) {
> + if (cpu_is_pj4_nomp())
> + return __opcode_to_mem_thumb32(0xee300691);
> + return __opcode_to_mem_thumb32(0xfb90f0f1);
> + }
> +
> + if (cpu_is_pj4_nomp())
> + return __opcode_to_mem_arm(0xee300691);
> + return __opcode_to_mem_arm(0xe710f110);
> +}
> +
> +/* "udiv r0, r0, r1" or "mrc p6, 1, r0, CR0, CR1, 0" if we're on pj4 w/o MP */
> +static u32 __attribute_const__ udiv_instruction(void)
> +{
> + if (IS_ENABLED(CONFIG_THUMB2_KERNEL)) {
> + if (cpu_is_pj4_nomp())
> + return __opcode_to_mem_thumb32(0xee300611);
> + return __opcode_to_mem_thumb32(0xfbb0f0f1);
> + }
> +
> + if (cpu_is_pj4_nomp())
> + return __opcode_to_mem_arm(0xee300611);
> + return __opcode_to_mem_arm(0xe730f110);
> +}
> +
> +static void __init_or_module patch(u32 **addr, size_t count, u32 insn)
> +{
> + for (; count != 0; count -= 4)
> + **addr++ = insn;
> +}
> +
> +void __init_or_module patch_udiv(void *addr, size_t size)
> +{
> + patch(addr, size, udiv_instruction());
> +}
> +
> +void __init_or_module patch_sdiv(void *addr, size_t size)
> +{
> + return patch(addr, size, sdiv_instruction());
> +}
> +
> +static void __init patch_aeabi_uidiv(void)
> +{
> + extern char __start_udiv_loc[], __stop_udiv_loc[];
> + extern char __start_idiv_loc[], __stop_idiv_loc[];
> + unsigned int mask;
> +
> + if (IS_ENABLED(CONFIG_THUMB2_KERNEL))
> + mask = HWCAP_IDIVT;
> + else
> + mask = HWCAP_IDIVA;
> +
> + if (!(elf_hwcap & mask))
> + return;
> +
> + patch_udiv(__start_udiv_loc, __stop_udiv_loc - __start_udiv_loc);
> + patch_sdiv(__start_idiv_loc, __stop_idiv_loc - __start_idiv_loc);
> +}
> +#else
> +static void __init patch_aeabi_uidiv(void) { }
> +#endif
> +
> static void __init cpuid_init_hwcaps(void)
> {
> int block;
> @@ -642,6 +709,7 @@ static void __init setup_processor(void)
> elf_hwcap = list->elf_hwcap;
>
> cpuid_init_hwcaps();
> + patch_aeabi_uidiv();
>
> #ifndef CONFIG_ARM_THUMB
> elf_hwcap &= ~(HWCAP_THUMB | HWCAP_IDIVT);
> diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
> index 8b60fde5ce48..bc87a2e04e6f 100644
> --- a/arch/arm/kernel/vmlinux.lds.S
> +++ b/arch/arm/kernel/vmlinux.lds.S
> @@ -28,6 +28,18 @@
> *(.hyp.idmap.text) \
> VMLINUX_SYMBOL(__hyp_idmap_text_end) = .;
>
> +#ifdef CONFIG_ARM_PATCH_UIDIV
> +#define UIDIV_REC . = ALIGN(8); \
> + VMLINUX_SYMBOL(__start_udiv_loc) = .; \
> + *(__udiv_loc) \
> + VMLINUX_SYMBOL(__stop_udiv_loc) = .; \
> + VMLINUX_SYMBOL(__start_idiv_loc) = .; \
> + *(__idiv_loc) \
> + VMLINUX_SYMBOL(__stop_idiv_loc) = .;
> +#else
> +#define UIDIV_REC
> +#endif
> +
> #ifdef CONFIG_HOTPLUG_CPU
> #define ARM_CPU_DISCARD(x)
> #define ARM_CPU_KEEP(x) x
> @@ -210,6 +222,7 @@ SECTIONS
> .init.data : {
> #ifndef CONFIG_XIP_KERNEL
> INIT_DATA
> + UIDIV_REC
> #endif
> INIT_SETUP(16)
> INIT_CALLS
> --
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
>
>

2015-11-25 23:47:29

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

On Wed, Nov 25, 2015 at 01:51:03PM -0800, Stephen Boyd wrote:
> The ARM compiler inserts calls to __aeabi_uidiv() and
> __aeabi_idiv() when it needs to perform division on signed and
> unsigned integers. If a processor has support for the udiv and
> sdiv division instructions the calls to these support routines
> can be replaced with those instructions. Therefore, record the
> location of calls to these library functions into two sections
> (one for udiv and one for sdiv) similar to how we trace calls to
> mcount. When the kernel boots up it will check to see if the
> processor supports the instructions and then patch the call sites
> with the instruction.

Do we have any resolution on these programs which modify the object
files in-place, rather than breaking any hard-links which may be
present (eg, as a result of using ccache in hard-link mode) ?

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-11-26 00:06:07

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ARM: Replace calls to __aeabi_{u}idiv with udiv/sdiv instructions

On Wed, Nov 25, 2015 at 06:09:13PM -0500, Nicolas Pitre wrote:
> 3) In fact I was wondering if the overhead of the branch and back is
> really significant compared to the non trivial cost of a idiv
> instruction and all the complex infrastructure required to patch
> those branches directly, and consequently if the performance
> difference is actually worth it versus simply doing (2) alone.

I definitely agree with you on this, given that modern CPUs which
are going to be benefitting from idiv are modern CPUs with a branch
predictor (and if it's not predicting such unconditional calls and
returns it's not much use as a branch predictor!)

I think what we need to see is the performance of existing kernels,
vs patching the idiv instructions at every callsite, vs patching
the called function itself.

> > +#ifdef CONFIG_ARM_PATCH_UIDIV
> > +/* "sdiv r0, r0, r1" or "mrc p6, 1, r0, CR0, CR1, 4" if we're on pj4 w/o MP */
> > +static u32 __attribute_const__ sdiv_instruction(void)
> > +{
> > + if (IS_ENABLED(CONFIG_THUMB2_KERNEL)) {
> > + if (cpu_is_pj4_nomp())
> > + return __opcode_to_mem_thumb32(0xee300691);
> > + return __opcode_to_mem_thumb32(0xfb90f0f1);
> > + }
> > +
> > + if (cpu_is_pj4_nomp())
> > + return __opcode_to_mem_arm(0xee300691);
> > + return __opcode_to_mem_arm(0xe710f110);
> > +}
> > +
> > +/* "udiv r0, r0, r1" or "mrc p6, 1, r0, CR0, CR1, 0" if we're on pj4 w/o MP */
> > +static u32 __attribute_const__ udiv_instruction(void)
> > +{
> > + if (IS_ENABLED(CONFIG_THUMB2_KERNEL)) {
> > + if (cpu_is_pj4_nomp())
> > + return __opcode_to_mem_thumb32(0xee300611);
> > + return __opcode_to_mem_thumb32(0xfbb0f0f1);
> > + }
> > +
> > + if (cpu_is_pj4_nomp())
> > + return __opcode_to_mem_arm(0xee300611);
> > + return __opcode_to_mem_arm(0xe730f110);
> > +}

Any reason the above aren't marked with __init_or_module as well, as
the compiler can choose not to inline them?

> > +
> > +static void __init_or_module patch(u32 **addr, size_t count, u32 insn)
> > +{
> > + for (; count != 0; count -= 4)
> > + **addr++ = insn;
> > +}
> > +
> > +void __init_or_module patch_udiv(void *addr, size_t size)
> > +{
> > + patch(addr, size, udiv_instruction());
> > +}
> > +
> > +void __init_or_module patch_sdiv(void *addr, size_t size)
> > +{
> > + return patch(addr, size, sdiv_instruction());
> > +}
> > +
> > +static void __init patch_aeabi_uidiv(void)
> > +{
> > + extern char __start_udiv_loc[], __stop_udiv_loc[];
> > + extern char __start_idiv_loc[], __stop_idiv_loc[];
> > + unsigned int mask;
> > +
> > + if (IS_ENABLED(CONFIG_THUMB2_KERNEL))
> > + mask = HWCAP_IDIVT;
> > + else
> > + mask = HWCAP_IDIVA;
> > +
> > + if (!(elf_hwcap & mask))
> > + return;
> > +
> > + patch_udiv(__start_udiv_loc, __stop_udiv_loc - __start_udiv_loc);
> > + patch_sdiv(__start_idiv_loc, __stop_idiv_loc - __start_idiv_loc);

I'm left really concerned about this. We're modifying code with all
the caches on, and the above is not only missing any coherency of the
I/D paths, it's also missing any branch predictor maintanence. So, if
we've executed any divisions at this point, the predictor could already
predicted one of these branches that's being modified.

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-11-26 00:07:59

by Måns Rullgård

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ARM: Replace calls to __aeabi_{u}idiv with udiv/sdiv instructions

Nicolas Pitre <[email protected]> writes:

> On Wed, 25 Nov 2015, Stephen Boyd wrote:
>
>> The ARM compiler inserts calls to __aeabi_uidiv() and
>> __aeabi_idiv() when it needs to perform division on signed and
>> unsigned integers. If a processor has support for the udiv and
>> sdiv division instructions the calls to these support routines
>> can be replaced with those instructions. Now that recordmcount
>> records the locations of calls to these library functions in
>> two sections (one for udiv and one for sdiv), iterate over these
>> sections early at boot and patch the call sites with the
>> appropriate division instruction when we determine that the
>> processor supports the division instructions. Using the division
>> instructions should be faster and less power intensive than
>> running the support code.
>
> A few remarks:
>
> 1) The code assumes unconditional branches to __aeabi_idiv and
> __aeabi_uidiv. What if there are conditional branches? Also, tail
> call optimizations will generate a straight b opcode rather than a bl
> and patching those will obviously have catastrophic results. I think
> you should validate the presence of a bl before patching over it.

I did a quick check on a compiled kernel I had nearby, and there are no
conditional or tail calls to those functions, so although they should
obviously be checked for correctness, performance is unlikely to matter
for those.

However, there are almost half as many calls to __aeabi_{u}idivmod as to
the plain div functions, 129 vs 228 for signed and unsigned combined.
For best results, these functions should also be patched with the
hardware instructions. Obviously the call sites for these can't be
patched.

> 2) For those cases where a call to __aeabi_uidiv and __aeabi_idiv is not
> patched due to (1), you could patch a uidiv/idiv plus "bx lr"
> at those function entry points too.
>
> 3) In fact I was wondering if the overhead of the branch and back is
> really significant compared to the non trivial cost of a idiv
> instruction and all the complex infrastructure required to patch
> those branches directly, and consequently if the performance
> difference is actually worth it versus simply doing (2) alone.

Depending on the operands, the div instruction can take as few as 3
cycles on a Cortex-A7.

--
M?ns Rullg?rd
[email protected]

2015-11-26 00:08:51

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ARM: Replace calls to __aeabi_{u}idiv with udiv/sdiv instructions

On Wed, Nov 25, 2015 at 01:51:04PM -0800, Stephen Boyd wrote:
> diff --git a/arch/arm/include/asm/cputype.h b/arch/arm/include/asm/cputype.h
> index 85e374f873ac..48c77d422a0d 100644
> --- a/arch/arm/include/asm/cputype.h
> +++ b/arch/arm/include/asm/cputype.h
> @@ -250,8 +250,14 @@ static inline int cpu_is_pj4(void)
>
> return 0;
> }
> +
> +static inline bool __attribute_const__ cpu_is_pj4_nomp(void)
> +{
> + return read_cpuid_part() == 0x56005810;

One other thing here, we really ought to avoid adding more magic numbers
to this file. We have the ARM processors nicely #defined, but not a lot
else. Maybe its time that we continued with the nice definitions for
these part numbers?

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-11-26 00:44:54

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ARM: Replace calls to __aeabi_{u}idiv with udiv/sdiv instructions

On Thu, 26 Nov 2015, M?ns Rullg?rd wrote:

> Nicolas Pitre <[email protected]> writes:
>
> > On Wed, 25 Nov 2015, Stephen Boyd wrote:
> >
> >> The ARM compiler inserts calls to __aeabi_uidiv() and
> >> __aeabi_idiv() when it needs to perform division on signed and
> >> unsigned integers. If a processor has support for the udiv and
> >> sdiv division instructions the calls to these support routines
> >> can be replaced with those instructions. Now that recordmcount
> >> records the locations of calls to these library functions in
> >> two sections (one for udiv and one for sdiv), iterate over these
> >> sections early at boot and patch the call sites with the
> >> appropriate division instruction when we determine that the
> >> processor supports the division instructions. Using the division
> >> instructions should be faster and less power intensive than
> >> running the support code.
> >
> > A few remarks:
> >
> > 1) The code assumes unconditional branches to __aeabi_idiv and
> > __aeabi_uidiv. What if there are conditional branches? Also, tail
> > call optimizations will generate a straight b opcode rather than a bl
> > and patching those will obviously have catastrophic results. I think
> > you should validate the presence of a bl before patching over it.
>
> I did a quick check on a compiled kernel I had nearby, and there are no
> conditional or tail calls to those functions, so although they should
> obviously be checked for correctness, performance is unlikely to matter
> for those.

I'm more worried about correctness than performance. This kind of
patching should be done defensively.

> However, there are almost half as many calls to __aeabi_{u}idivmod as to
> the plain div functions, 129 vs 228 for signed and unsigned combined.
> For best results, these functions should also be patched with the
> hardware instructions. Obviously the call sites for these can't be
> patched.

Current __aeabi_{u}idivmod implementations are simple wrappers around a
call to __aeabi_{u}idiv so they'd get the benefit automatically
regardless of the chosen approach.

> > 2) For those cases where a call to __aeabi_uidiv and __aeabi_idiv is not
> > patched due to (1), you could patch a uidiv/idiv plus "bx lr"
> > at those function entry points too.
> >
> > 3) In fact I was wondering if the overhead of the branch and back is
> > really significant compared to the non trivial cost of a idiv
> > instruction and all the complex infrastructure required to patch
> > those branches directly, and consequently if the performance
> > difference is actually worth it versus simply doing (2) alone.
>
> Depending on the operands, the div instruction can take as few as 3
> cycles on a Cortex-A7.

Even the current software based implementation can produce a result with
about 5 simple ALU instructions depending on the operands.

The average cycle count is more important than the easy-way-out case.
And then how significant the two branches around it are compared to idiv
alone from direct patching of every call to it.


Nicolas

2015-11-26 00:50:15

by Måns Rullgård

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ARM: Replace calls to __aeabi_{u}idiv with udiv/sdiv instructions

Nicolas Pitre <[email protected]> writes:

> On Thu, 26 Nov 2015, M?ns Rullg?rd wrote:
>
>> Nicolas Pitre <[email protected]> writes:
>>
>> > 3) In fact I was wondering if the overhead of the branch and back is
>> > really significant compared to the non trivial cost of a idiv
>> > instruction and all the complex infrastructure required to patch
>> > those branches directly, and consequently if the performance
>> > difference is actually worth it versus simply doing (2) alone.
>>
>> Depending on the operands, the div instruction can take as few as 3
>> cycles on a Cortex-A7.
>
> Even the current software based implementation can produce a result with
> about 5 simple ALU instructions depending on the operands.
>
> The average cycle count is more important than the easy-way-out case.
> And then how significant the two branches around it are compared to idiv
> alone from direct patching of every call to it.

If not calling the function saves an I-cache miss, the benefit can be
substantial. No, I have no proof of this being a problem, but it's
something that could happen.

Of course, none of this is going to be as good as letting the compiler
generate div instructions directly.

--
M?ns Rullg?rd
[email protected]

2015-11-26 01:29:23

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ARM: Replace calls to __aeabi_{u}idiv with udiv/sdiv instructions

On Thu, Nov 26, 2015 at 12:50:08AM +0000, M?ns Rullg?rd wrote:
> If not calling the function saves an I-cache miss, the benefit can be
> substantial. No, I have no proof of this being a problem, but it's
> something that could happen.

That's a simplistic view of modern CPUs.

As I've already said, modern CPUs which have branch prediction, but
they also have speculative instruction fetching and speculative data
prefetching - which the CPUs which have idiv support will have.

With such features, the branch predictor is able to learn that the
branch will be taken, and because of the speculative instruction
fetching, it can bring the cache line in so that it has the
instructions it needs with minimal or, if working correctly,
without stalling the CPU pipeline.

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-11-26 02:20:00

by Måns Rullgård

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ARM: Replace calls to __aeabi_{u}idiv with udiv/sdiv instructions

Russell King - ARM Linux <[email protected]> writes:

> On Thu, Nov 26, 2015 at 12:50:08AM +0000, M?ns Rullg?rd wrote:
>> If not calling the function saves an I-cache miss, the benefit can be
>> substantial. No, I have no proof of this being a problem, but it's
>> something that could happen.
>
> That's a simplistic view of modern CPUs.
>
> As I've already said, modern CPUs which have branch prediction, but
> they also have speculative instruction fetching and speculative data
> prefetching - which the CPUs which have idiv support will have.
>
> With such features, the branch predictor is able to learn that the
> branch will be taken, and because of the speculative instruction
> fetching, it can bring the cache line in so that it has the
> instructions it needs with minimal or, if working correctly,
> without stalling the CPU pipeline.

It doesn't matter how many fancy features the CPU has. Executing more
branches and using more cache lines puts additional pressure on those
resources, reducing overall performance. Besides, the performance
counters readily show that the prediction is nothing near as perfect as
you seem to believe.

--
M?ns Rullg?rd
[email protected]

2015-11-26 05:32:50

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ARM: Replace calls to __aeabi_{u}idiv with udiv/sdiv instructions

On Thu, 26 Nov 2015, M?ns Rullg?rd wrote:

> Russell King - ARM Linux <[email protected]> writes:
>
> > On Thu, Nov 26, 2015 at 12:50:08AM +0000, M?ns Rullg?rd wrote:
> >> If not calling the function saves an I-cache miss, the benefit can be
> >> substantial. No, I have no proof of this being a problem, but it's
> >> something that could happen.
> >
> > That's a simplistic view of modern CPUs.
> >
> > As I've already said, modern CPUs which have branch prediction, but
> > they also have speculative instruction fetching and speculative data
> > prefetching - which the CPUs which have idiv support will have.
> >
> > With such features, the branch predictor is able to learn that the
> > branch will be taken, and because of the speculative instruction
> > fetching, it can bring the cache line in so that it has the
> > instructions it needs with minimal or, if working correctly,
> > without stalling the CPU pipeline.
>
> It doesn't matter how many fancy features the CPU has. Executing more
> branches and using more cache lines puts additional pressure on those
> resources, reducing overall performance. Besides, the performance
> counters readily show that the prediction is nothing near as perfect as
> you seem to believe.

OK... Let's try to come up with actual numbers.

We know that letting gcc emit idiv by itself is the ultimate solution.
And it is free of maintenance on our side besides passing the
appropriate argument to gcc of course. So this is worth doing.

For the case where you have a set of target machines in your kernel that
may or may not have idiv, then the first step should be to patch
__aeabi_uidiv and __aeabi_idiv. This is a pretty small and simple
change that might turn out to be more than good enough. It is necessary
anyway as the full patching solution does not cover all cases.

Then, IMHO, it would be a good idea to get performance numbers to
compare that first step and the full patching solution. Of course the
full patching will yield better performance. It has to. But if the
difference is not significant enough, then it might not be worth
introducing the implied complexity into mainline. And it is not because
the approach is bad. In fact I think this is a very cool hack. But it
comes with a cost in maintenance and that cost has to be justified.

Just to have an idea, I produced the attached micro benchmark. I tested
on a TC2 forced to a single Cortex-A15 core and I got those results:

Testing INLINE_DIV ...

real 0m7.182s
user 0m7.170s
sys 0m0.000s

Testing PATCHED_DIV ...

real 0m7.181s
user 0m7.170s
sys 0m0.000s

Testing OUTOFLINE_DIV ...

real 0m7.181s
user 0m7.170s
sys 0m0.005s

Testing LIBGCC_DIV ...

real 0m18.659s
user 0m18.635s
sys 0m0.000s

As you can see, whether the div is inline or out-of-line, whether
arguments are moved into r0-r1 or not, makes no difference at all on a
Cortex-A15.

Now forcing it onto a Cortex-A7 core:

Testing INLINE_DIV ...

real 0m8.917s
user 0m8.895s
sys 0m0.005s

Testing PATCHED_DIV ...

real 0m11.666s
user 0m11.645s
sys 0m0.000s

Testing OUTOFLINE_DIV ...

real 0m13.065s
user 0m13.025s
sys 0m0.000s

Testing LIBGCC_DIV ...

real 0m51.815s
user 0m51.750s
sys 0m0.005s

So on A cortex-A7 the various overheads become visible. How significant
is it in practice with normal kernel usage? I don't know.


Nicolas


Attachments:
go (207.00 B)
divtest.S (582.00 B)
Download all attachments

2015-11-26 12:42:07

by Måns Rullgård

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ARM: Replace calls to __aeabi_{u}idiv with udiv/sdiv instructions

Nicolas Pitre <[email protected]> writes:

> On Thu, 26 Nov 2015, M?ns Rullg?rd wrote:
>
>> Russell King - ARM Linux <[email protected]> writes:
>>
>> > On Thu, Nov 26, 2015 at 12:50:08AM +0000, M?ns Rullg?rd wrote:
>> >> If not calling the function saves an I-cache miss, the benefit can be
>> >> substantial. No, I have no proof of this being a problem, but it's
>> >> something that could happen.
>> >
>> > That's a simplistic view of modern CPUs.
>> >
>> > As I've already said, modern CPUs which have branch prediction, but
>> > they also have speculative instruction fetching and speculative data
>> > prefetching - which the CPUs which have idiv support will have.
>> >
>> > With such features, the branch predictor is able to learn that the
>> > branch will be taken, and because of the speculative instruction
>> > fetching, it can bring the cache line in so that it has the
>> > instructions it needs with minimal or, if working correctly,
>> > without stalling the CPU pipeline.
>>
>> It doesn't matter how many fancy features the CPU has. Executing more
>> branches and using more cache lines puts additional pressure on those
>> resources, reducing overall performance. Besides, the performance
>> counters readily show that the prediction is nothing near as perfect as
>> you seem to believe.
>
> OK... Let's try to come up with actual numbers.
>
> We know that letting gcc emit idiv by itself is the ultimate solution.
> And it is free of maintenance on our side besides passing the
> appropriate argument to gcc of course. So this is worth doing.
>
> For the case where you have a set of target machines in your kernel that
> may or may not have idiv, then the first step should be to patch
> __aeabi_uidiv and __aeabi_idiv. This is a pretty small and simple
> change that might turn out to be more than good enough. It is necessary
> anyway as the full patching solution does not cover all cases.
>
> Then, IMHO, it would be a good idea to get performance numbers to
> compare that first step and the full patching solution. Of course the
> full patching will yield better performance. It has to. But if the
> difference is not significant enough, then it might not be worth
> introducing the implied complexity into mainline. And it is not because
> the approach is bad. In fact I think this is a very cool hack. But it
> comes with a cost in maintenance and that cost has to be justified.
>
> Just to have an idea, I produced the attached micro benchmark. I tested
> on a TC2 forced to a single Cortex-A15 core and I got those results:
>
> Testing INLINE_DIV ...
>
> real 0m7.182s
> user 0m7.170s
> sys 0m0.000s
>
> Testing PATCHED_DIV ...
>
> real 0m7.181s
> user 0m7.170s
> sys 0m0.000s
>
> Testing OUTOFLINE_DIV ...
>
> real 0m7.181s
> user 0m7.170s
> sys 0m0.005s
>
> Testing LIBGCC_DIV ...
>
> real 0m18.659s
> user 0m18.635s
> sys 0m0.000s
>
> As you can see, whether the div is inline or out-of-line, whether
> arguments are moved into r0-r1 or not, makes no difference at all on a
> Cortex-A15.
>
> Now forcing it onto a Cortex-A7 core:
>
> Testing INLINE_DIV ...
>
> real 0m8.917s
> user 0m8.895s
> sys 0m0.005s
>
> Testing PATCHED_DIV ...
>
> real 0m11.666s
> user 0m11.645s
> sys 0m0.000s
>
> Testing OUTOFLINE_DIV ...
>
> real 0m13.065s
> user 0m13.025s
> sys 0m0.000s
>
> Testing LIBGCC_DIV ...
>
> real 0m51.815s
> user 0m51.750s
> sys 0m0.005s
>
> So on A cortex-A7 the various overheads become visible. How significant
> is it in practice with normal kernel usage? I don't know.

Bear in mind that in a trivial test like this, everything fits in L1
caches and branch prediction works perfectly. It would be more
informative to measure the effect on a load that already has some cache
and branch prediction misses.

--
M?ns Rullg?rd
[email protected]

2015-11-30 15:11:21

by Michal Marek

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

On 2015-11-26 00:47, Russell King - ARM Linux wrote:
> On Wed, Nov 25, 2015 at 01:51:03PM -0800, Stephen Boyd wrote:
>> The ARM compiler inserts calls to __aeabi_uidiv() and
>> __aeabi_idiv() when it needs to perform division on signed and
>> unsigned integers. If a processor has support for the udiv and
>> sdiv division instructions the calls to these support routines
>> can be replaced with those instructions. Therefore, record the
>> location of calls to these library functions into two sections
>> (one for udiv and one for sdiv) similar to how we trace calls to
>> mcount. When the kernel boots up it will check to see if the
>> processor supports the instructions and then patch the call sites
>> with the instruction.
>
> Do we have any resolution on these programs which modify the object
> files in-place, rather than breaking any hard-links which may be
> present (eg, as a result of using ccache in hard-link mode) ?

Good point, but I do not think anybody is using CCACHE_HARDLINK with the
kernel. As the manpage says, it is going to confuse make, so the time
saved by ccache would be offset by make trying to recompile all *.c
files each time.

Michal

2015-11-30 15:32:26

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

On Mon, Nov 30, 2015 at 04:11:16PM +0100, Michal Marek wrote:
> On 2015-11-26 00:47, Russell King - ARM Linux wrote:
> > On Wed, Nov 25, 2015 at 01:51:03PM -0800, Stephen Boyd wrote:
> >> The ARM compiler inserts calls to __aeabi_uidiv() and
> >> __aeabi_idiv() when it needs to perform division on signed and
> >> unsigned integers. If a processor has support for the udiv and
> >> sdiv division instructions the calls to these support routines
> >> can be replaced with those instructions. Therefore, record the
> >> location of calls to these library functions into two sections
> >> (one for udiv and one for sdiv) similar to how we trace calls to
> >> mcount. When the kernel boots up it will check to see if the
> >> processor supports the instructions and then patch the call sites
> >> with the instruction.
> >
> > Do we have any resolution on these programs which modify the object
> > files in-place, rather than breaking any hard-links which may be
> > present (eg, as a result of using ccache in hard-link mode) ?
>
> Good point, but I do not think anybody is using CCACHE_HARDLINK with the
> kernel.

That's wrong then, because I've been using it for a very long time with
my nightly builds. :) Therefore, there is somebody!

> As the manpage says, it is going to confuse make, so the time
> saved by ccache would be offset by make trying to recompile all *.c
> files each time.

>From what I've noticed, it makes a big difference when running nightly
builds. My nightly builds use O= and always build into an empty target
tree, so there are no old objects back-dated to confuse make.

Even if there were, make would spot that the object is older than the
source, and try to re-make the target again, at which point ccache
would re-hardlink the object after looking up the hashed preprocessed
source.

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-11-30 15:40:30

by Michal Marek

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

On 2015-11-30 16:32, Russell King - ARM Linux wrote:
> On Mon, Nov 30, 2015 at 04:11:16PM +0100, Michal Marek wrote:
>> On 2015-11-26 00:47, Russell King - ARM Linux wrote:
>>> On Wed, Nov 25, 2015 at 01:51:03PM -0800, Stephen Boyd wrote:
>>>> The ARM compiler inserts calls to __aeabi_uidiv() and
>>>> __aeabi_idiv() when it needs to perform division on signed and
>>>> unsigned integers. If a processor has support for the udiv and
>>>> sdiv division instructions the calls to these support routines
>>>> can be replaced with those instructions. Therefore, record the
>>>> location of calls to these library functions into two sections
>>>> (one for udiv and one for sdiv) similar to how we trace calls to
>>>> mcount. When the kernel boots up it will check to see if the
>>>> processor supports the instructions and then patch the call sites
>>>> with the instruction.
>>>
>>> Do we have any resolution on these programs which modify the object
>>> files in-place, rather than breaking any hard-links which may be
>>> present (eg, as a result of using ccache in hard-link mode) ?
>>
>> Good point, but I do not think anybody is using CCACHE_HARDLINK with the
>> kernel.
>
> That's wrong then, because I've been using it for a very long time with
> my nightly builds. :) Therefore, there is somebody!

OK.


>> As the manpage says, it is going to confuse make, so the time
>> saved by ccache would be offset by make trying to recompile all *.c
>> files each time.
>
> From what I've noticed, it makes a big difference when running nightly
> builds. My nightly builds use O= and always build into an empty target
> tree, so there are no old objects back-dated to confuse make.
>
> Even if there were, make would spot that the object is older than the
> source, and try to re-make the target again, at which point ccache
> would re-hardlink the object after looking up the hashed preprocessed
> source.

That's what I meant. If you do a second make in a freshly built tree, it
will recompile all the files again. It will all be cache hits, but each
of the will be lot slower than comparing two timestamps. But in
throwaway build trees you do not care, I haven't considered that...

Michal

2015-12-01 16:07:12

by Michal Marek

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

On 2015-11-30 16:40, Michal Marek wrote:
> On 2015-11-30 16:32, Russell King - ARM Linux wrote:
>> On Mon, Nov 30, 2015 at 04:11:16PM +0100, Michal Marek wrote:
>>> On 2015-11-26 00:47, Russell King - ARM Linux wrote:
>>>> Do we have any resolution on these programs which modify the object
>>>> files in-place, rather than breaking any hard-links which may be
>>>> present (eg, as a result of using ccache in hard-link mode) ?
>>>
>>> Good point, but I do not think anybody is using CCACHE_HARDLINK with the
>>> kernel.
>>
>> That's wrong then, because I've been using it for a very long time with
>> my nightly builds. :) Therefore, there is somebody!
>
> OK.

So, both recordmcount and the new recordudiv program are idempotent.
They check if the to-be-added section is already present and do nothing.
So the result is correct even with CCACHE_HARDLINK, just the
intermediate file might be incorrect. If this still is considered an
issue, I suggest clearing CCACHE_HARDLINK when using any of these
postprocessors, so as not to penalize other use cases.

The perl recordmcount does not modify the file in place.

Michal

2015-12-01 16:20:00

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

On Tue, Dec 01, 2015 at 05:07:05PM +0100, Michal Marek wrote:
> On 2015-11-30 16:40, Michal Marek wrote:
> > On 2015-11-30 16:32, Russell King - ARM Linux wrote:
> >> On Mon, Nov 30, 2015 at 04:11:16PM +0100, Michal Marek wrote:
> >>> On 2015-11-26 00:47, Russell King - ARM Linux wrote:
> >>>> Do we have any resolution on these programs which modify the object
> >>>> files in-place, rather than breaking any hard-links which may be
> >>>> present (eg, as a result of using ccache in hard-link mode) ?
> >>>
> >>> Good point, but I do not think anybody is using CCACHE_HARDLINK with the
> >>> kernel.
> >>
> >> That's wrong then, because I've been using it for a very long time with
> >> my nightly builds. :) Therefore, there is somebody!
> >
> > OK.
>
> So, both recordmcount and the new recordudiv program are idempotent.
> They check if the to-be-added section is already present and do nothing.

They hardly "do nothing", as the (eg) recordmcount plasters the build
log with warnings. A solution to that would be to make recordmcount
silent if the section is already present.

> So the result is correct even with CCACHE_HARDLINK, just the
> intermediate file might be incorrect. If this still is considered an
> issue, I suggest clearing CCACHE_HARDLINK when using any of these
> postprocessors, so as not to penalize other use cases.

Another solution would be to have the top level make file unset the
CCACHE_HARDLINK environment variable if any of the options which enable
in-place editing of object files is enabled. Looking at the ccache
code, the environment variable has to be deleted from the environment
to turn off the option - and I'm not sure whether make can delete
environment variables. It certainly can override them, but I see
nothing in the info pages which suggests that environment variables
can be deleted by a makefile.

However, doing it outside of the kernel build system is likely error
prone especially as the kernel configuration options change and/or
their effect changes.

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-12-01 16:43:52

by Michal Marek

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

On 2015-12-01 17:19, Russell King - ARM Linux wrote:
> On Tue, Dec 01, 2015 at 05:07:05PM +0100, Michal Marek wrote:
>> So, both recordmcount and the new recordudiv program are idempotent.
>> They check if the to-be-added section is already present and do nothing.
>
> They hardly "do nothing", as the (eg) recordmcount plasters the build
> log with warnings. A solution to that would be to make recordmcount
> silent if the section is already present.

Right, there is a warning. The recorduidiv program exits silently.


>> So the result is correct even with CCACHE_HARDLINK, just the
>> intermediate file might be incorrect. If this still is considered an
>> issue, I suggest clearing CCACHE_HARDLINK when using any of these
>> postprocessors, so as not to penalize other use cases.
>
> Another solution would be to have the top level make file unset the
> CCACHE_HARDLINK environment variable if any of the options which enable
> in-place editing of object files is enabled.

This is what I meant, sorry for not being clear.


> Looking at the ccache
> code, the environment variable has to be deleted from the environment
> to turn off the option - and I'm not sure whether make can delete
> environment variables. It certainly can override them, but I see
> nothing in the info pages which suggests that environment variables
> can be deleted by a makefile.

unexport CCACHE_HARDLINK

will do the trick.

Michal

2015-12-01 16:49:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

On Tue, 1 Dec 2015 16:19:44 +0000
Russell King - ARM Linux <[email protected]> wrote:

> They hardly "do nothing", as the (eg) recordmcount plasters the build
> log with warnings. A solution to that would be to make recordmcount
> silent if the section is already present.

Note, that warning found plenty of bugs when modifications of the build
system was being done and broke recordmcount.c. I really don't want to
silent it.

But for some reason, your build is causing lots of warnings and not for
others. Perhaps we can add a "SILENT_RECORDMCOUNT" environment variable
and have it set when something like CCACHE_HARDLINK or whatever is
causing it to trigger when we don't care.

-- Steve

2015-12-01 17:10:28

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

On Tue, Dec 01, 2015 at 11:49:29AM -0500, Steven Rostedt wrote:
> On Tue, 1 Dec 2015 16:19:44 +0000
> Russell King - ARM Linux <[email protected]> wrote:
>
> > They hardly "do nothing", as the (eg) recordmcount plasters the build
> > log with warnings. A solution to that would be to make recordmcount
> > silent if the section is already present.
>
> Note, that warning found plenty of bugs when modifications of the build
> system was being done and broke recordmcount.c. I really don't want to
> silent it.
>
> But for some reason, your build is causing lots of warnings and not for
> others. Perhaps we can add a "SILENT_RECORDMCOUNT" environment variable
> and have it set when something like CCACHE_HARDLINK or whatever is
> causing it to trigger when we don't care.

The case is:

Build 1 runs with CCACHE_HARDLINK enabled.
- Each object ccache creates will be stored in ccache, and hard linked
into the throw-away object tree.
- recordmcount modifies in-place the object in the object tree, which
also modifies the object in the ccache repository.

The throw-away object tree is thrown away, and a new tree is created,
and the build re-run. It doesn't matter what CCACHE options are used,
the effect will now be the same:
- Each "hit" ccache object from the previous build will be linked or
copied to the new throw-away object tree.
- recordmcount will be re-run on the object, which now contains the
results of the previous recordmcount in-place modification. This
causes recordmcount to issue a warning.

There's two solutions to this: one is to disable CCACHE_HARDLINK for
all kernel builds which use in-place object modification. The other
solution is to avoid in-place object modification, instead doing a
read-write-rename.

I think I ought to ask another question though, before we decide what
to do. With recordmcount doing in-place object modification, what
happens if a SIGINT or similar is received half way through the
modification of an object? I would hope that make would delete the
object and not leave it around.

Another suggestion - maybe recordmcount, which fstat()s the file,
should check the st_nlink before modifying the file, and error out
with a helpful error message telling people not to use hardlinks,
which would stop nasty surprises (and make it a rule that this should
be implemented as a general principle for good build behaviour) - iow,
something like this (untested):

scripts/recordmcount.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/scripts/recordmcount.c b/scripts/recordmcount.c
index 698768bdc581..bb7589fd7392 100644
--- a/scripts/recordmcount.c
+++ b/scripts/recordmcount.c
@@ -203,6 +203,10 @@ static void *mmap_file(char const *fname)
fprintf(stderr, "not a regular file: %s\n", fname);
fail_file();
}
+ if (sb.st_nlink != 1) {
+ fprintf(stderr, "file is hard linked: %s\n", fname);
+ fail_file();
+ }
addr = mmap(0, sb.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE,
fd_map, 0);
mmap_failed = 0;


--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-12-01 17:22:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

On Tue, 1 Dec 2015 17:10:14 +0000
Russell King - ARM Linux <[email protected]> wrote:

> Another suggestion - maybe recordmcount, which fstat()s the file,
> should check the st_nlink before modifying the file, and error out
> with a helpful error message telling people not to use hardlinks,
> which would stop nasty surprises (and make it a rule that this should
> be implemented as a general principle for good build behaviour) - iow,

Actually I like this solution the best.

> something like this (untested):

Can you test it to see if it gives you the error, otherwise I need to
set up a CCACHE_HARDLINK environment :-)

I guess another solution is to do a copy instead of modifying in place
if it detects the multiple hard link?

-- Steve


>
> scripts/recordmcount.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/scripts/recordmcount.c b/scripts/recordmcount.c
> index 698768bdc581..bb7589fd7392 100644
> --- a/scripts/recordmcount.c
> +++ b/scripts/recordmcount.c
> @@ -203,6 +203,10 @@ static void *mmap_file(char const *fname)
> fprintf(stderr, "not a regular file: %s\n", fname);
> fail_file();
> }
> + if (sb.st_nlink != 1) {
> + fprintf(stderr, "file is hard linked: %s\n", fname);
> + fail_file();
> + }
> addr = mmap(0, sb.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE,
> fd_map, 0);
> mmap_failed = 0;
>
>

2015-12-01 18:17:01

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

On Tue, Dec 01, 2015 at 12:22:12PM -0500, Steven Rostedt wrote:
> On Tue, 1 Dec 2015 17:10:14 +0000
> Russell King - ARM Linux <[email protected]> wrote:
>
> > Another suggestion - maybe recordmcount, which fstat()s the file,
> > should check the st_nlink before modifying the file, and error out
> > with a helpful error message telling people not to use hardlinks,
> > which would stop nasty surprises (and make it a rule that this should
> > be implemented as a general principle for good build behaviour) - iow,
>
> Actually I like this solution the best.
>
> > something like this (untested):
>
> Can you test it to see if it gives you the error, otherwise I need to
> set up a CCACHE_HARDLINK environment :-)

It does indeed:

CC init/do_mounts_initrd.o
file is hard linked: init/do_mounts_initrd.o
/home/rmk/git/linux-rmk/scripts/Makefile.build:258: recipe for target 'init/do_mounts_initrd.o' failed
make[2]: *** [init/do_mounts_initrd.o] Error 1

> I guess another solution is to do a copy instead of modifying in place
> if it detects the multiple hard link?

That would be the "transparent" solution. If you think it's worth
persuing, I'll have a go at fixing recordmcount to do that.

--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-12-01 21:39:14

by Michal Marek

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

Dne 1.12.2015 v 19:16 Russell King - ARM Linux napsal(a):
> On Tue, Dec 01, 2015 at 12:22:12PM -0500, Steven Rostedt wrote:
>> I guess another solution is to do a copy instead of modifying in place
>> if it detects the multiple hard link?
>
> That would be the "transparent" solution. If you think it's worth
> persuing, I'll have a go at fixing recordmcount to do that.

But in terms of number of file copies, it would on par with disabling
CCACHE_HARDLINK from within the Makefile.

Michal

2015-12-02 10:24:03

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

On Tue, Dec 01, 2015 at 06:16:43PM +0000, Russell King - ARM Linux wrote:
> On Tue, Dec 01, 2015 at 12:22:12PM -0500, Steven Rostedt wrote:
> > I guess another solution is to do a copy instead of modifying in place
> > if it detects the multiple hard link?
>
> That would be the "transparent" solution. If you think it's worth
> persuing, I'll have a go at fixing recordmcount to do that.

Well, copying the file is easy - I've tested this and the linker
appears happy with the result:

scripts/recordmcount.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/scripts/recordmcount.c b/scripts/recordmcount.c
index 698768bdc581..91705ef30402 100644
--- a/scripts/recordmcount.c
+++ b/scripts/recordmcount.c
@@ -211,6 +211,20 @@ static void *mmap_file(char const *fname)
addr = umalloc(sb.st_size);
uread(fd_map, addr, sb.st_size);
}
+ if (sb.st_nlink != 1) {
+ /* file is hard-linked, break the hard link */
+ close(fd_map);
+ if (unlink(fname) < 0) {
+ perror(fname);
+ fail_file();
+ }
+ fd_map = open(fname, O_RDWR | O_CREAT, sb.st_mode);
+ if (fd_map < 0) {
+ perror(fname);
+ fail_file();
+ }
+ uwrite(fd_map, addr, sb.st_size);
+ }
return addr;
}


--
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2015-12-02 14:05:25

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] scripts: Add a recorduidiv program

On Wed, 2 Dec 2015 10:23:39 +0000
Russell King - ARM Linux <[email protected]> wrote:

> On Tue, Dec 01, 2015 at 06:16:43PM +0000, Russell King - ARM Linux wrote:
> > On Tue, Dec 01, 2015 at 12:22:12PM -0500, Steven Rostedt wrote:
> > > I guess another solution is to do a copy instead of modifying in place
> > > if it detects the multiple hard link?
> >
> > That would be the "transparent" solution. If you think it's worth
> > persuing, I'll have a go at fixing recordmcount to do that.
>
> Well, copying the file is easy - I've tested this and the linker
> appears happy with the result:

Want to make this into a proper patch and I'll start running it through
my tests?

-- Steve

>
> scripts/recordmcount.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/scripts/recordmcount.c b/scripts/recordmcount.c
> index 698768bdc581..91705ef30402 100644
> --- a/scripts/recordmcount.c
> +++ b/scripts/recordmcount.c
> @@ -211,6 +211,20 @@ static void *mmap_file(char const *fname)
> addr = umalloc(sb.st_size);
> uread(fd_map, addr, sb.st_size);
> }
> + if (sb.st_nlink != 1) {
> + /* file is hard-linked, break the hard link */
> + close(fd_map);
> + if (unlink(fname) < 0) {
> + perror(fname);
> + fail_file();
> + }
> + fd_map = open(fname, O_RDWR | O_CREAT, sb.st_mode);
> + if (fd_map < 0) {
> + perror(fname);
> + fail_file();
> + }
> + uwrite(fd_map, addr, sb.st_size);
> + }
> return addr;
> }
>
>