2015-12-15 22:16:46

by Luis Chamberlain

[permalink] [raw]
Subject: [RFC v1 0/8] x86/init: Linux linker tables

From: "Luis R. Rodriguez" <[email protected]>

A long time ago in a galaxy far,
far away...

Konrad Rzeszutek Wilk posted patches which eventually got merged to help
with modularizing the IOMMUs we have on x86 [0]. This work was done due to
the complex relationship that exists on IOMMUs and the requirements on
careful execution. The solution also provided a mechanism which jettisoned
unused IOMMUs during run-time.

During review, even though the code was merged, hpa did note that we tend
to encounter this type of problem "often enough that we should implement a
generic facility for it" [1], hpa acknowledged that it obviously has to be
based on sections and even noted that perhaps we might be able in the future to
automate its creation. He noted that the gPXE folks had done just this with
linker tables and suggested that "presumably we'd need a few different flavors
for init tables and so on, but this would make it a generic mechanism."

The IOMMU code got merged and this was left on someone's mental backburner.
I've had an itch to scratch recently to try to avoid issues which are possible
if one does not jettison other code carefully due to the large complexity of
implicit dependencies of certain code on x86 in particular with possible dead
code on x86 due to paravirtualization, and the IOMMU jettison strategy turned
out to be my favorite solution so far. I've taken on hpa's suggestions from
back in the day to review gPXE's solution to see if we could embrace it on
Linux for a generic section solution to help jettison code carefully.

What this patch set does exactly:

This RFC patch set attempts to add support for such a generic solution.
In the end, it turns out that the best solution possible was the best of
both worlds: a combination of what Konrad had implemented in addition to
what Michael Brown had implemented on the gXPE front. The IOMMU solution
enables simple semantic annotations for dependency relationships, this
however requires a run time sort. The gPXE solution grants the option to
simply sort at build time. One of gPXE's solution primary goals however was
also to help avoid bit-rot on code that's possible from #ifdef'ery. The Linux
linker table solution enables developers to pick and choose what they
need, with linker tables being the simplest solution. Contrary to gPXE
which strives to force compilation of all linker table solutions we
let developers pick *when* they want this as part of their solution.
As can be seen from the suggested x86 init specific use of linker tables
proposed you can also take advantage of both, linker sorting, optional
compilation when needed (at developer's discretion), and even careful
semantics annotation for dependency / relationship annotations. Although
the x86 init solution here is heavily inspired by the IOMMU solution it
diverges with strong semantics, and a new optional subarchitecture
annotation. Sorting of init sequences is structure specific, as such
each subsystem must defing their own solution unless semantics could
be shared. I considered sharing semantics but in the end this proved
pointless so this keeps things separate. A series of changes were made
to the x86 init sequence in contrast to the IOMMU solution to be *extremely
pedantic* on semantics, review of this changes can be studied on the
table-init tree [2].

Quick review of gPXE's solution and prospects on further changes:

In my review from gPXE's solution it was not clear what hpa meant by
gXPE folks having automated this process, they actually use linker tables
all around, forcing compilation of *everything* and just do linking of
enabled features at link time. You still need to build linker tables on
your own. What I do see more potential for in the future is enabling to
evolve stronger semantics over time, and this would also be subsystem specific.
This will be evident in this patch set on the x86 init use of linker tables.
I also see potential in strenghtening semantics for linker sorting, any of
these types of features however would impose requiring newer binutils. For
instance, gPXE's linker solution currently relies on SORT(), that defaults to
SORT_BY_NAME(). This sorts lexicographically, gPXE's solution uses two digits
to enable SORT_BY_NAME()'s lexicographical sort to sort orde by numeric
priority. Since one is in control of order-level numbers one can provide
guarantee that this sort should work as intended, however binutils also now has
a SORT_BY_INIT_PRIORITY() which sorts specifically based on digits.
SORT_BY_INIT_PRIORITY() was designed specifically for init_array sections
though. Refer to the userspace mockup solution table-init git tree [2] commit
6deba47ee1ad461e90 for more details on this. One thing I can envision to help
here further are prospects for future linker enhancements, these however must
be carefully considered in light of possible requirements of newer binutils
and also of compatibility with other toolchains. For now we resort to the good
'ol SORT() which even Linux has made use of for ages already. In that regards
here, nothing new here. gPXE folks however did find some fun ICC compatiblity
issues, and have also developed fixes for them, these were obviously carried
over but should be reviewed carefully. Lastly, I should point out that
essentially what we're developing are different forms loose and strong
semantics -- in the most complex form what we really are after are feature
graphs. Mauro explained to me that for media driver they needed to build a
feature graphs, IMHO that could be a next level of strong semantics we might
want to consider. For now though the combination of gPXE's linker-table
solution based on linker order-level sort, and our old IOMMU init solution
and its simple dependency map (and possible extensions, see subarchitecture)
should likely suffice as a light weight solution for where we need semantics
for at least on the x86 init path.

Motivation and possible enhancements:

To understand what made me tick to work on this feel free to read the dead
code concerns with paravirtualization posts I've made, which also go into
and detail Xen's alternative entry point [4] [5]. That's not the end of
possibilities to help address to possible "dead code" on Linux, I have other
ideas but I have a bit more work to do before publishing anything about it.
If anything -- later work should likely supplement this solution further.

Currently the earliest I was able to make use of boot_params on x86-64 for
linker tables was after load_idt(), so I've decided to stick with the earliest
init call for x86 init sequences starting on x86_64_start_reservation().
However, if we're able to make use of boot_params earlier (help appreciated),
in particular just boot_params.hdr.hardware_subarch we should be able to make
clearer and careful annotations on early x86-64 init code. Using the
boot_params.hdr.hardware_subarch and requiring clear subarchitecture support
annotations on early init code should provide a *proactive* means to avoid issues
such as the cr4 shadow oversight [6] which later caused crashes on Xen. The
latest similar type of issue is when KAsan was introduced, after it was
integrated I suspected KAsan would probably break if enabled on Xen, I reported
this in March 2015 [7] and Andrey confirmed this might be the case but since he
wanted it enabled for allyesconfig and allmodconfig he could not think of a
clean way to disable it or address support for it then [8]. During the
development of this patch set I confirmed KAsan crashes Xen dom0 and therefore
should obviously crash Xen PV guests as well. Using the same kernel KAsan
still worked on KVM and bare-metal. Provided we had early access to
boot_params.hdr.hardware_subarch, in x86_64_start_kernel() we could annotate
kasan_early_init() and friends not supported on Xen, however since we don't
have a way to disable KAsan at run time at this time we wouldn't have any other
option but to crash early. Because of this though we should perhaps just consider
disabling KAsan for Xen configs and KAsan folks might just have to bite the
bullet. If KAsan folks are gung-ho about not disabling KAsan when Xen is
enabled and if its easier to add support to disable KAsan at run time than
it is to add KAsan support for Xen one alternative might be to use linker
tables to annotate this and disable KAsan at run time for Xen.

One of the points of the use of the x86 linker table solution and specific
semantics there in is to *force* developers to think about requirements on x86
carefully. So for instance by requiring itemizing supported subarchitectures
*early* on in development we should be able to avoid proactively issues such as
the crash due to the cr4 shadow changes not added on the Xen entry path, and
also missing the requirement to develop a solution for Kasan for Xen. The
proposed x86 linker table solution is not the first to use
boot_params.hdr.hardware_subarch, its first use was actually on i386, and for
lguest. We take advantage of it to avoid further extending pv_ops, and
*actually* to see if we could simplify pv_ops over time further. This patch set
also includes a renaming of paravirt_enabled() to paravirt_legacy()

An unexpected long term side goal which *might* be possible due to linker
tables on x86 is to help unify the different C entry points for Linux x86-64.
I've taken a stab at it after these patches [9] but it fails on Xen, likely
because the stack is not set up right due to the different calls / argument
requirements. If this is desirable folks could take a look and perhaps help
on that front.

Where to get code alternatively and testing:

All of these patches are RFCs, if anything the only patches worth consdering
merging sooner rather than later might be "paravirt: rename paravirt_enabled
to paravirt_legacy" and "x86/boot: add BIT() to boot/bitops.h". After review
I can send those separately in patch form if agreeable. Since these are all
just RFCs I've based these patches on top of Linus' v4.4-rc5. If you want all
patches in one file you can get them here [10], and a respective linux-next
version which applies on top of next-20151215 here [11]. I also have two
git trees set up with branches for this code, one based on Linus' tree [12]
and another based on linux-next next-20151215 [13]. I'll see what issues
zero-day bot testing finds out. I've just run time tested this on x86-64
bare metal, KVM, and Xen dom0 so far.

[0] https://marc.info/?l=linux-kernel&m=128284562303565&w=2
[1] https://marc.info/?l=linux-kernel&m=128285216913266&w=2
[2] https://github.com/mcgrof/table-init/
[3] https://github.com/mcgrof/table-init/commit/6deba47ee1ad461e90e0fbba226a535cfc1c58f3
[4] http://www.do-not-panic.com/2015/12/avoiding-dead-code-pvops-not-silver-bullet.html
[5] http://www.do-not-panic.com/2015/12/xen-and-x86-linux-zero-page.html
[6] http://lists.xenproject.org/archives/html/xen-devel/2015-02/msg02742.html
[7] http://lkml.kernel.org/r/CAB=NE6Xs5fepzNtymzT4CueeJZ0KMPETpda114DpL4eMtDswtw@mail.gmail.com
[8] http://lkml.kernel.org/r/[email protected]
[9] http://drvbp1.linux-foundation.org/~mcgrof/patches/2015/12/15/x86-merge-x86-init-v1.patch
[10] http://drvbp1.linux-foundation.org/~mcgrof/patches/2015/12/15/pend-20151215-rfc-v1-linker-tables.patch
[11] http://drvbp1.linux-foundation.org/~mcgrof/patches/2015/12/15/pend-next-20151215-rfc-v1-linker-tables.patch
[12] https://git.kernel.org/cgit/linux/kernel/git/mcgrof/linux.git/log/?h=20151215-rfc-v1-linker-tables
[13] https://git.kernel.org/cgit/linux/kernel/git/mcgrof/linux-next.git/log/?h=20151215-rfc-v1-linker-tables

Luis R. Rodriguez (8):
paravirt: rename paravirt_enabled to paravirt_legacy
tables.h: add linker table support
x86/boot: add BIT() to boot/bitops.h
x86/init: add linker table support
x86/init: move ebda reservations into linker table
x86/init: use linker table for i386 early setup
x86/init: user linker table for ce4100 early setup
x86/init: use linker table for mid early setup

Documentation/kbuild/makefiles.txt | 19 ++
arch/x86/Kconfig.debug | 47 ++++
arch/x86/boot/bitops.h | 2 +
arch/x86/boot/boot.h | 2 +-
arch/x86/include/asm/bios_ebda.h | 2 -
arch/x86/include/asm/paravirt.h | 4 +-
arch/x86/include/asm/paravirt_types.h | 11 +-
arch/x86/include/asm/processor.h | 2 +-
arch/x86/include/asm/setup.h | 12 -
arch/x86/include/asm/x86_init.h | 1 +
arch/x86/include/asm/x86_init_fn.h | 267 ++++++++++++++++++++
arch/x86/kernel/Makefile | 4 +-
arch/x86/kernel/apm_32.c | 2 +-
arch/x86/kernel/asm-offsets.c | 2 +-
arch/x86/kernel/cpu/intel.c | 2 +-
arch/x86/kernel/cpu/microcode/core.c | 2 +-
arch/x86/kernel/dbg-tables/Makefile | 18 ++
arch/x86/kernel/dbg-tables/alpha.c | 10 +
arch/x86/kernel/dbg-tables/beta.c | 18 ++
arch/x86/kernel/dbg-tables/delta.c | 10 +
arch/x86/kernel/dbg-tables/gamma.c | 18 ++
arch/x86/kernel/dbg-tables/gamma.h | 3 +
arch/x86/kernel/{head.c => ebda.c} | 6 +-
arch/x86/kernel/head32.c | 22 +-
arch/x86/kernel/head64.c | 4 +-
arch/x86/kernel/init.c | 55 +++++
arch/x86/kernel/kvm.c | 9 +-
arch/x86/kernel/paravirt.c | 2 +-
arch/x86/kernel/sort-init.c | 114 +++++++++
arch/x86/kernel/tboot.c | 2 +-
arch/x86/kernel/vmlinux.lds.S | 25 ++
arch/x86/lguest/boot.c | 2 +-
arch/x86/mm/dump_pagetables.c | 2 +-
arch/x86/platform/ce4100/ce4100.c | 4 +-
arch/x86/platform/intel-mid/intel-mid.c | 4 +-
arch/x86/tools/relocs.c | 1 +
arch/x86/xen/enlighten.c | 2 +-
drivers/pnp/pnpbios/core.c | 2 +-
include/linux/tables.h | 421 ++++++++++++++++++++++++++++++++
scripts/Makefile.build | 4 +-
scripts/Makefile.clean | 1 +
scripts/Makefile.lib | 12 +
42 files changed, 1091 insertions(+), 61 deletions(-)
create mode 100644 arch/x86/include/asm/x86_init_fn.h
create mode 100644 arch/x86/kernel/dbg-tables/Makefile
create mode 100644 arch/x86/kernel/dbg-tables/alpha.c
create mode 100644 arch/x86/kernel/dbg-tables/beta.c
create mode 100644 arch/x86/kernel/dbg-tables/delta.c
create mode 100644 arch/x86/kernel/dbg-tables/gamma.c
create mode 100644 arch/x86/kernel/dbg-tables/gamma.h
rename arch/x86/kernel/{head.c => ebda.c} (94%)
create mode 100644 arch/x86/kernel/init.c
create mode 100644 arch/x86/kernel/sort-init.c
create mode 100644 include/linux/tables.h

--
2.6.2


2015-12-15 22:16:52

by Luis Chamberlain

[permalink] [raw]
Subject: [RFC v1 1/8] paravirt: rename paravirt_enabled to paravirt_legacy

From: "Luis R. Rodriguez" <[email protected]>

paravirt_enabled conveys the idea that if this is set or if
paravirt_enabled() returns true you are in a paravirtualized
environment. This is not true, this is really only useful to
determine if you have a paravirtualization hypervisor that uses
supports legacy paravirtualized guests. At run time, this tells
us if we've booted into a Linux guest with support for legacy
paravirtualized guests.

Make emphasis about this by renaming the member and helper.
While at it, make the type bool, and document it therefore
avoiding special cased comments trying to explain it.

The rename is done using the following Coccinelle SmPL patch:

@ rename_paravirt_enabled @
@@

-paravirt_enabled()
+paravirt_legacy()

@ rename_pv_info_pv_enabled @
@@
-pv_info.paravirt_enabled
+pv_info.paravirt_legacy

@ is_pv @
identifier pv;
@@
struct pv_info pv = {
};

@ rename_struct_pv_enabled depends on is_pv @
identifier is_pv.pv;
expression val;
@@

struct pv_info pv = {
- .paravirt_enabled
+ .paravirt_legacy
= val,
};

Generated-by: Coccinelle SmPL
Suggested-by: Konrad Rzeszutek Wilk <[email protected]>
Cc: [email protected]
cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Luis R. Rodriguez <[email protected]>
---
arch/x86/include/asm/paravirt.h | 4 ++--
arch/x86/include/asm/paravirt_types.h | 11 ++++++++++-
arch/x86/include/asm/processor.h | 2 +-
arch/x86/kernel/apm_32.c | 2 +-
arch/x86/kernel/asm-offsets.c | 2 +-
arch/x86/kernel/cpu/intel.c | 2 +-
arch/x86/kernel/cpu/microcode/core.c | 2 +-
arch/x86/kernel/head.c | 2 +-
arch/x86/kernel/kvm.c | 9 +--------
arch/x86/kernel/paravirt.c | 2 +-
arch/x86/kernel/tboot.c | 2 +-
arch/x86/lguest/boot.c | 2 +-
arch/x86/mm/dump_pagetables.c | 2 +-
arch/x86/xen/enlighten.c | 2 +-
drivers/pnp/pnpbios/core.c | 2 +-
15 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 10d0596433f8..38e698ff8cae 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -14,9 +14,9 @@
#include <linux/types.h>
#include <linux/cpumask.h>

-static inline int paravirt_enabled(void)
+static inline bool paravirt_legacy(void)
{
- return pv_info.paravirt_enabled;
+ return pv_info.paravirt_legacy;
}

static inline void load_sp0(struct tss_struct *tss,
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 31247b5bff7c..3d89e8e878bc 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -61,6 +61,15 @@ struct paravirt_callee_save {
};

/* general info */
+
+/**
+ * struct pv_info - hypervisor information
+ *
+ * @paravirt_legacy: true if this hypervisor supports legacy
+ * paravirtualized guests. Examples of features that
+ * characterize legacy paravirtualized guests are
+ * things such as the need for APM, PNP BIOS.
+ */
struct pv_info {
unsigned int kernel_rpl;
int shared_kernel_pmd;
@@ -69,7 +78,7 @@ struct pv_info {
u16 extra_user_64bit_cs; /* __USER_CS if none */
#endif

- int paravirt_enabled;
+ bool paravirt_legacy;
const char *name;
};

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 67522256c7ff..354a0d010c59 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -471,7 +471,7 @@ static inline unsigned long current_top_of_stack(void)
#include <asm/paravirt.h>
#else
#define __cpuid native_cpuid
-#define paravirt_enabled() 0
+#define paravirt_legacy() false

static inline void load_sp0(struct tss_struct *tss,
struct thread_struct *thread)
diff --git a/arch/x86/kernel/apm_32.c b/arch/x86/kernel/apm_32.c
index 052c9c3026cc..74a3e8ca2c6d 100644
--- a/arch/x86/kernel/apm_32.c
+++ b/arch/x86/kernel/apm_32.c
@@ -2267,7 +2267,7 @@ static int __init apm_init(void)

dmi_check_system(apm_dmi_table);

- if (apm_info.bios.version == 0 || paravirt_enabled() || machine_is_olpc()) {
+ if (apm_info.bios.version == 0 || paravirt_legacy() || machine_is_olpc()) {
printk(KERN_INFO "apm: BIOS not found.\n");
return -ENODEV;
}
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 439df975bc7a..20a7a0df2c65 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -59,7 +59,7 @@ void common(void) {

#ifdef CONFIG_PARAVIRT
BLANK();
- OFFSET(PARAVIRT_enabled, pv_info, paravirt_enabled);
+ OFFSET(PARAVIRT_legacy, pv_info, paravirt_legacy);
OFFSET(PARAVIRT_PATCH_pv_cpu_ops, paravirt_patch_template, pv_cpu_ops);
OFFSET(PARAVIRT_PATCH_pv_irq_ops, paravirt_patch_template, pv_irq_ops);
OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable);
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 209ac1e7d1f0..134b5d5d9c74 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -220,7 +220,7 @@ static void intel_workarounds(struct cpuinfo_x86 *c)
* The Quark is also family 5, but does not have the same bug.
*/
clear_cpu_bug(c, X86_BUG_F00F);
- if (!paravirt_enabled() && c->x86 == 5 && c->x86_model < 9) {
+ if (!paravirt_legacy() && c->x86 == 5 && c->x86_model < 9) {
static int f00f_workaround_enabled;

set_cpu_bug(c, X86_BUG_F00F);
diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c
index b3e94ef461fd..022cc8dad97f 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -630,7 +630,7 @@ int __init microcode_init(void)
struct cpuinfo_x86 *c = &boot_cpu_data;
int error;

- if (paravirt_enabled() || dis_ucode_ldr)
+ if (paravirt_legacy() || dis_ucode_ldr)
return -EINVAL;

if (c->x86_vendor == X86_VENDOR_INTEL)
diff --git a/arch/x86/kernel/head.c b/arch/x86/kernel/head.c
index 992f442ca155..279fad7288f8 100644
--- a/arch/x86/kernel/head.c
+++ b/arch/x86/kernel/head.c
@@ -38,7 +38,7 @@ void __init reserve_ebda_region(void)
* that the paravirt case can handle memory setup
* correctly, without our help.
*/
- if (paravirt_enabled())
+ if (paravirt_legacy())
return;

/* end of low (conventional) memory */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 47190bd399e7..e2368f73bc14 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -283,14 +283,7 @@ NOKPROBE_SYMBOL(do_async_page_fault);
static void __init paravirt_ops_setup(void)
{
pv_info.name = "KVM";
-
- /*
- * KVM isn't paravirt in the sense of paravirt_enabled. A KVM
- * guest kernel works like a bare metal kernel with additional
- * features, and paravirt_enabled is about features that are
- * missing.
- */
- pv_info.paravirt_enabled = 0;
+ pv_info.paravirt_legacy = false;

if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY))
pv_cpu_ops.io_delay = kvm_io_delay;
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c2130aef3f9d..326df1ce2a11 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -312,7 +312,7 @@ enum paravirt_lazy_mode paravirt_get_lazy_mode(void)

struct pv_info pv_info = {
.name = "bare hardware",
- .paravirt_enabled = 0,
+ .paravirt_legacy = false,
.kernel_rpl = 0,
.shared_kernel_pmd = 1, /* Only used when CONFIG_X86_PAE is set */

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 91a4496db434..87a5027ff242 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -75,7 +75,7 @@ void __init tboot_probe(void)
}

/* only a natively booted kernel should be using TXT */
- if (paravirt_enabled()) {
+ if (paravirt_legacy()) {
pr_warning("non-0 tboot_addr but pv_ops is enabled\n");
return;
}
diff --git a/arch/x86/lguest/boot.c b/arch/x86/lguest/boot.c
index a0d09f6c6533..eb0a42bbf41b 100644
--- a/arch/x86/lguest/boot.c
+++ b/arch/x86/lguest/boot.c
@@ -1409,7 +1409,7 @@ __init void lguest_init(void)
/* We're under lguest. */
pv_info.name = "lguest";
/* Paravirt is enabled. */
- pv_info.paravirt_enabled = 1;
+ pv_info.paravirt_legacy = true;
/* We're running at privilege level 1, not 0 as normal. */
pv_info.kernel_rpl = 1;
/* Everyone except Xen runs with this set. */
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index a035c2aa7801..daf87f12a0af 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -365,7 +365,7 @@ static inline bool is_hypervisor_range(int idx)
* ffff800000000000 - ffff87ffffffffff is reserved for
* the hypervisor.
*/
- return paravirt_enabled() &&
+ return paravirt_legacy() &&
(idx >= pgd_index(__PAGE_OFFSET) - 16) &&
(idx < pgd_index(__PAGE_OFFSET));
}
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 5774800ff583..95060b31e49c 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1186,7 +1186,7 @@ static unsigned xen_patch(u8 type, u16 clobbers, void *insnbuf,
}

static const struct pv_info xen_info __initconst = {
- .paravirt_enabled = 1,
+ .paravirt_legacy = true,
.shared_kernel_pmd = 0,

#ifdef CONFIG_X86_64
diff --git a/drivers/pnp/pnpbios/core.c b/drivers/pnp/pnpbios/core.c
index facd43b8516c..40557bc16005 100644
--- a/drivers/pnp/pnpbios/core.c
+++ b/drivers/pnp/pnpbios/core.c
@@ -521,7 +521,7 @@ static int __init pnpbios_init(void)
int ret;

if (pnpbios_disabled || dmi_check_system(pnpbios_dmi_table) ||
- paravirt_enabled()) {
+ paravirt_legacy()) {
printk(KERN_INFO "PnPBIOS: Disabled\n");
return -ENODEV;
}
--
2.6.2

2015-12-15 22:18:48

by Luis Chamberlain

[permalink] [raw]
Subject: [RFC v1 2/8] tables.h: add linker table support

From: "Luis R. Rodriguez" <[email protected]>

This adds Linux table support to Linux, based on Michael Brown's
gPXE's linker table solution. Linker tables enable an extremely
light weight linker build time solution for feature ordering and
selection, this can help to both simplify init sequences in a generic
fashion and helps avoiding code bit-rotting when desirable.

Bit rotting avoidance is enabled by *granting the option* to force
compilation of code and only enable linking object code in when
specific features have been enabled. It accomplishes this by using
linker sections, the lightweight feature ordering of code is enabled
by taking advantage of the old binutils ld SORT() on features
specific sections.

Contrary to gPXE's solution, which strives to force compilation
of all features, Linux' solution strives to encourage usage of linker
tables to force compilation of code *only on designated code*. Linker
tables can be used without requiring one to force code compilation
of features that use it though, to take advantage of the simplification
of init sequences.

Linux code that uses linker tables *and* wishes to always require
compilation but selectively linking must use new kbuild target object
that helps accomplishes this, table-y. Linux code that uses linker
tables but does not want to require forcing compilation can use the
good 'ol obj-y targets.

Used commit c23508d796 from gPXE upstream [0] as the starting point
for development and evaluation of Linux's integration, further changes
made and evaluation of different linker table options for Linux are
available on the table-init userspace mockup solution [1].

[0] git://git.etherboot.org/scm/gpxe.git
[1] https://github.com/mcgrof/table-init.git

Cc: Michael Brown <[email protected]>
Signed-off-by: Luis R. Rodriguez <[email protected]>
---
Documentation/kbuild/makefiles.txt | 19 ++
include/linux/tables.h | 421 +++++++++++++++++++++++++++++++++++++
scripts/Makefile.build | 4 +-
scripts/Makefile.clean | 1 +
scripts/Makefile.lib | 12 ++
5 files changed, 455 insertions(+), 2 deletions(-)
create mode 100644 include/linux/tables.h

diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt
index 13f888a02a3d..0a21b8db2f90 100644
--- a/Documentation/kbuild/makefiles.txt
+++ b/Documentation/kbuild/makefiles.txt
@@ -1088,6 +1088,25 @@ When kbuild executes, the following steps are followed (roughly):
In this example, extra-y is used to list object files that
shall be built, but shall not be linked as part of built-in.o.

+ table-y table-m and table-
+
+ To avoid code bit-rot you may wish to avoid using #ifdefs on your
+ code. Bit-rot is possible if certain features may not be enabled
+ on certain build environments. The table-y, table-m specifies
+ targets which you always wish to force compilation on, but wish
+ to only enable linking in if the feature is enabled, these are
+ features taking advantage of Linux's linker tables. For details
+ on how to implement a feature using linker tablesrefer to
+ include/linux/tables.h.
+
+ Example:
+ table-$(CONFIG-FEATURE_FOO) += foo.o
+
+ An alternative to using table-y, table-m is to use extra-y followed
+ by the respective obj-y, obj-m:
+
+ extra-y += foo.o
+ obj-$(CONFIG-FEATURE_FOO) += foo.o

--- 6.7 Commands useful for building a boot image

diff --git a/include/linux/tables.h b/include/linux/tables.h
new file mode 100644
index 000000000000..380b8220bed8
--- /dev/null
+++ b/include/linux/tables.h
@@ -0,0 +1,421 @@
+#ifndef _LINUX_LINKER_TABLES_H
+#define _LINUX_LINKER_TABLES_H
+/*
+ * Linux linker tables
+ *
+ * Copyright (C) 2005-2009 Michael Brown <[email protected]>
+ * Copyright (C) 2015 Luis R. Rodriguez <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Due to this file being licensed under the GPL there is controversy over
+ * whether this permits you to write a module that #includes this file
+ * without placing your module under the GPL. Please consult a lawyer for
+ * advice before doing this.
+ */
+
+/**
+ * DOC: linker tables - simplifying inits and for when #ifdefs are harmful
+ *
+ * Linker tables help you simplify init sequences by using ELF sections, linker
+ * build time selective sorting (disabled options get ignored), and can also be
+ * used to help you avoid bit-rot code due to #ifdery collateral.
+ *
+ * The code bit-rot prolem:
+ *
+ * Overuse of C #ifdefs can be problematic for certain types of code. Linux
+ * provides a rich array of features, but all these features take up valuable
+ * space in a kernel image. The traditional solution to this problem has been
+ * for each feature to have its own Kconfig entry and for the respective code
+ * to be wrapped around #ifdefs, allowing the feature to be compiled in only
+ * if desired.
+ *
+ * The problem with this is that over time it becomes very difficult and time
+ * consuming to compile, let alone test, all possible versions of Linux. Code
+ * that is not typically used tends to suffer from bit-rot over time. It can
+ * become difficult to predict which combinations of compile-time options will
+ * result in code that can compile and link correctly.
+ *
+ * To solve this problem linker tables can be used on Linux, it enables you to
+ * always force compiling of select features that one wishes to avoid bit-rot
+ * while still enabling you to disable linking feature code into the final
+ * kernel image or building certain modules if the features have been disabled
+ * via Kconfig. The code is derivative of gPXE linker table's solution. For
+ * more details on history and to see how this code evolved refer to the
+ * userspace mockup solution on github [0]. This repository can also be used to
+ * for ease of testing of extensions and sampling of changes prior to inclusion
+ * into Linux, it is intended to be kept up to date to match Linux's solution.
+ * Contrary to gPXE's solution, which stives to force compilation of
+ * *everything*, Linux's solution allows for developers to be selective over
+ * where one should use linker tables.
+ *
+ * [0] https://github.com/mcgrof/table-init/
+ *
+ * To use linker tables, features should be implemented in separate C files,
+ * and should always be compiled -- they should not be guarded with a C code
+ * #ifdef CONFIG_FOO statements. By making code always compile, you avoid
+ * the problem of bit-rot in rarely-used code. The trick to linker tables
+ * is by forcing compilation but only linking if the feature has been enabled.
+ * This is accomplished with a new target for Makefiles documented below.
+ *
+ * Let's assume we want to always force compilation of feature FOO in the
+ * kernel but avoid linking it. When you enable the FOO feature via Kconfig
+ * you'd end up with:
+ *
+ * #define CONFIG_FOO 1
+ *
+ * You typically would then just use this on your Makefile to selectively
+ * compile and link the feature:
+ *
+ * obj-$(CONFIG_FOO) += foo.o
+ *
+ * You'd instead use the new linker table object:
+ *
+ * table-$(CONFIG_FOO) += foo.o
+ *
+ * Alternatively this would be the equivalent of listing:
+ *
+ * extra += foo.o
+ * obj-$(CONFIG_FOO) += foo.o
+ *
+ * The reason why this works cleanly is due to how linker tables work.
+ * Traditionally, we would implement features in C code as follows:
+ *
+ * foo_init();
+ *
+ * You'd then have a foo.h which would have:
+ *
+ * #ifdef CONFIG_FOO
+ * void foo(void);
+ * #else
+ * static inline void foo(void) { }
+ * #endif
+ *
+ * Simplifying inits:
+ *
+ * With linker tables this is no longer necessary as your init routines would
+ * be implicit, you'd instead call:
+ *
+ * call_init_fns();
+ *
+ * call_init_fns() would call all functions present in your init table and if
+ * and only if foo.o gets linked in, then its initialisation function will be
+ * called.
+ *
+ * The linker script takes care of assembling the tables for us. All of our
+ * table sections have names of the format .tbl.NAME.NN. NAME designates the
+ * data structure stored in the table. For instance if you had defined a struct
+ * init_fn as the data type for your init sequences you would have a respective
+ * init_fns table name declared as reference for these init sequences. NN is a
+ * two-digit decimal number used to impose an "order level" upon the tables if
+ * required. NN=00 is reserved for the symbol indicating "table start", and
+ * NN=99 is reserved for the symbol indicating "table end". In order for the
+ * call_init_fns() to work behind the scenes the custom linker script would
+ * need to define the beginning of the table, the end of the table, and in
+ * between it should use SORT() to give order-level effect. To account for all
+ * of your struct init_fn init sequences, in your linker script you would have:
+ *
+ * .tbl : {
+ * __tbl_start_init_fns = .;
+ * *(SORT(.tbl.init_fns.*))
+ * __tbl_end_init_fns = .;
+ *
+ * Other init tables can go here as well for different
+ * structures.
+ * }
+ *
+ * The order-level is really only a helper, if only one order level is
+ * used, the next contributing factor to order is the order of the code
+ * in the C file, and the order of the objects in the Makefile. Using
+ * an order level then should not really be needed in most cases, its
+ * use however enables to compartamentalize code into tables where ordering
+ * through C file or through the Makefile would otherwise be very difficult
+ * or if one wanted to enable very specific initialization semantics.
+ *
+ * As an example, suppose that we want to create a "frobnicator"
+ * feature framework, and allow for several independent modules to
+ * provide frobnicating services. Then we would create a frob.h
+ * header file containing e.g.
+ *
+ * struct frobnicator {
+ * const char *name;
+ * void (*frob) (void);
+ * };
+ *
+ * #define FROBNICATORS __table(struct frobnicator, "frobnicators")
+ * #define __frobnicator __table_entry(FROBNICATORS, 01)
+ *
+ * Any module providing frobnicating services would look something
+ * like
+ *
+ * #include "frob.h"
+ *
+ * static void my_frob(void) {
+ * ... Do my frobnicating
+ * }
+ *
+ * struct frob my_frobnicator __frobnicator = {
+ * .name = "my_frob",
+ * .frob = my_frob,
+ * };
+ *
+ * The central frobnicator code (frob.c) would use the frobnicating
+ * modules as follows
+ *
+ * #include "frob.h"
+ *
+ * void frob_all(void) {
+ * struct frob *frob;
+ *
+ * for_each_table(frob, FROBNICATORS) {
+ * pr_info("Calling frobnicator \"%s\"\n", frob->name);
+ * frob->frob();
+ * }
+ * }
+ */
+
+/*
+ * __table - Declares a linker table
+ *
+ * @type: data type
+ * @name: table name
+ *
+ * Declares a linker table.
+ */
+#define __table(type, name) (type, name)
+
+/*
+ * __table_type() - get linker table data type
+ *
+ * @table: linker table
+ *
+ * Gives you the linker table data type.
+ */
+#define __table_type(table) __table_extract_type table
+#define __table_extract_type(type, name) type
+
+/*
+ * __table_name - get linker table name
+ *
+ * @table: linker table
+ *
+ * Gives you the table name.
+ */
+#define __table_name(table) __table_extract_name table
+#define __table_extract_name(type, name) name
+
+/*
+ * __table_section - get linker table section name
+ *
+ * @table: linker table
+ * @idx: order level, this is the sub-table index
+ *
+ * Declares the section name.
+ */
+#define __table_section(table, idx) \
+ ".tbl." __table_name (table) "." __table_str (idx)
+#define __table_str(x) #x
+
+/*
+ * __table_alignment - get linker table alignment
+ *
+ * @table: linker table
+ *
+ * Gives you the linker table alignment.
+ */
+#define __table_alignment( table ) __alignof__ (__table_type(table))
+
+/*
+ * __table_entry - declare a linker table entry
+ *
+ * @table: linker table
+ * @idx: order level, the sub-table index
+ *
+ * Example usage:
+ *
+ * #define FROBNICATORS __table (struct frobnicator, "frobnicators")
+ * #define __frobnicator __table_entry(FROBNICATORS, 01)
+ *
+ * struct frobnicator my_frob __frobnicator = {
+ * ...
+ * };
+ */
+#define __table_entry(table, idx) \
+ __attribute__ ((__section__(__table_section(table, idx)), \
+ __aligned__(__table_alignment(table))))
+
+/*
+ * __table_entries - get start of linker table entries
+ *
+ * @table: linker table
+ * @idx: order level, the sub-table index
+ *
+ * This gives you the start of the respective linker table entries
+ */
+#define __table_entries(table, idx) ( { \
+ static __table_type(table) __table_entries[0] \
+ __table_entry(table, idx); \
+ __table_entries; } )
+
+/*
+ * table_start - get start of linker table
+ *
+ * @table: linker table
+ *
+ * This gives you the start of the respective linker table.
+ *
+ * Example usage:
+ *
+ * #define FROBNICATORS __table (struct frobnicator, "frobnicators")
+ *
+ * struct frobnicator *frobs = table_start(FROBNICATORS);
+ */
+#define table_start(table) __table_entries(table, 00)
+
+/*
+ * table_end - get end of linker table
+ *
+ * @table: linker table
+ *
+ * This gives you the end of the respective linker table.
+ *
+ * Example usage:
+ *
+ * #define FROBNICATORS __table (struct frobnicator, "frobnicators")
+ *
+ * struct frobnicator *frobs_end = table_end(FROBNICATORS);
+ */
+#define table_end(table) __table_entries(table, 99)
+
+/*
+ * table_num_entries - get number of entries in linker table
+ *
+ * @table: linker table
+ *
+ * This gives you the number of entries in linker table.
+ *
+ * Example usage:
+ *
+ * #define FROBNICATORS __table(struct frobnicator, "frobnicators")
+ *
+ * unsigned int num_frobs = table_num_entries(FROBNICATORS);
+ */
+#define table_num_entries(table) \
+ ((unsigned int) (table_end(table) - \
+ table_start(table)))
+
+/*
+ * for_each_table_entry - iterate through all entries within a linker table
+ *
+ * @pointer: entry pointer
+ * @table: linker table
+ *
+ * Example usage:
+ *
+ * #define FROBNICATORS __table(struct frobnicator, "frobnicators")
+ *
+ * struct frobnicator *frob;
+ *
+ * for_each_table_entry(frob, FROBNICATORS) {
+ * ...
+ * }
+ */
+#define for_each_table_entry(pointer, table) \
+ for (pointer = table_start(table); \
+ pointer < table_end(table); \
+ pointer++)
+
+/**
+ * for_each_table_entry_reverse - iterate through linker table in reverse order
+ *
+ * @pointer: entry pointer
+ * @table: linker table
+ *
+ * Example usage:
+ *
+ * #define FROBNICATORS __table(struct frobnicator, "frobnicators")
+ *
+ * struct frobnicator *frob;
+ *
+ * for_each_table_entry_reverse(frob, FROBNICATORS) {
+ * ...
+ * }
+ */
+#define for_each_table_entry_reverse(pointer, table) \
+ for (pointer = (table_end(table) - 1 ); \
+ pointer >= table_start(table); \
+ pointer--)
+
+/*
+ *
+ * Intel's C compiler chokes on several of the constructs used in this
+ * file. The workarounds are ugly, so we use them only for an icc
+ * build.
+ */
+#define ICC_ALIGN_HACK_FACTOR 128
+#ifdef __ICC
+
+/*
+ * icc miscompiles zero-length arrays by inserting padding to a length
+ * of two array elements. We therefore have to generate the
+ * __table_entries() symbols by hand in asm.
+ */
+#undef __table_entries
+#define __table_entries(table, idx) ( { \
+ extern __table_type(table) \
+ __table_temp_sym(idx, __LINE__) [] \
+ __table_entry(table, idx) \
+ asm (__table_entries_sym(table, idx)); \
+ __asm__( ".ifndef %c0\n\t" \
+ ".section " __table_section(table, idx) "\n\t" \
+ ".align %c1\n\t" \
+ "\n%c0:\n\t" \
+ ".previous\n\t" \
+ ".endif\n\t" \
+ : : "i" (__table_temp_sym(idx, __LINE__ )), \
+ "i" (__table_alignment(table))); \
+ __table_temp_sym ( idx, __LINE__ ); } )
+#define __table_entries_sym(table, idx) \
+ "__tbl_" __table_name(table) "_" #idx
+#define __table_temp_sym(a, b) \
+ ___table_temp_sym(__table_, a, _, b)
+#define ___table_temp_sym(a, b, c, d) a ## b ## c ## d
+
+/*
+ * icc ignores __attribute__ (( aligned (x) )) when it is used to
+ * decrease the compiler's default choice of alignment (which may be
+ * higher than the alignment actually required by the structure). We
+ * work around this by forcing the alignment to a large multiple of
+ * the required value (so that we are never attempting to decrease the
+ * default alignment) and then postprocessing the object file to
+ * reduce the alignment back down to the "real" value.
+ *
+ */
+#undef __table_alignment
+#define __table_alignment(table) \
+ (ICC_ALIGN_HACK_FACTOR * __alignof__(__table_type(table)))
+
+/*
+ * Because of the alignment hack, we must ensure that the compiler
+ * never tries to place multiple objects within the same section,
+ * otherwise the assembler will insert padding to the (incorrect)
+ * alignment boundary. Do this by appending the line number to table
+ * section names.
+ *
+ * Note that we don't need to worry about padding between array
+ * elements, since the alignment is declared on the variable (i.e. the
+ * whole array) rather than on the type (i.e. on all individual array
+ * elements).
+ */
+#undef __table_section
+#define __table_section(table, idx) \
+ ".tbl." __table_name(table) "." __table_str(idx) \
+ "." __table_xstr(__LINE__)
+#define __table_xstr(x) __table_str(x)
+
+#endif /* __ICC */
+
+#endif /* _LINUX_LINKER_TABLES_H */
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 01df30af4d4a..4210c6493a17 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -91,7 +91,7 @@ modorder-target := $(obj)/modules.order

# We keep a list of all modules in $(MODVERDIR)

-__build: $(if $(KBUILD_BUILTIN),$(builtin-target) $(lib-target) $(extra-y)) \
+__build: $(if $(KBUILD_BUILTIN),$(builtin-target) $(lib-target) $(extra-y) $(table-y)) \
$(if $(KBUILD_MODULES),$(obj-m) $(modorder-target)) \
$(subdir-ym) $(always)
@:
@@ -294,7 +294,7 @@ $(obj)/%.o: $(src)/%.S FORCE
$(call if_changed_dep,as_o_S)

targets += $(real-objs-y) $(real-objs-m) $(lib-y)
-targets += $(extra-y) $(MAKECMDGOALS) $(always)
+targets += $(extra-y) $(table-y) $(MAKECMDGOALS) $(always)

# Linker scripts preprocessor (.lds.S -> .lds)
# ---------------------------------------------------------------------------
diff --git a/scripts/Makefile.clean b/scripts/Makefile.clean
index 55c96cb8070f..0af1e12c9749 100644
--- a/scripts/Makefile.clean
+++ b/scripts/Makefile.clean
@@ -36,6 +36,7 @@ subdir-ymn := $(addprefix $(obj)/,$(subdir-ymn))
# directory

__clean-files := $(extra-y) $(extra-m) $(extra-) \
+ $(table-y) $(table-m) $(table-) \
$(always) $(targets) $(clean-files) \
$(host-progs) \
$(hostprogs-y) $(hostprogs-m) $(hostprogs-)
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index 79e86613712f..3d826e5fff9d 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -12,6 +12,16 @@ export KBUILD_SUBDIR_CCFLAGS := $(KBUILD_SUBDIR_CCFLAGS) $(subdir-ccflags-y)
# Figure out what we need to build from the various variables
# ===========================================================================

+# Linker tables objects always wish to be built to avoid bit-rot in
+# code, but only linked in *iff* they were enabled. We accomplish this
+# using pegging linker table objects into extra-y, which forces
+# compilation and then using the respective table-y and table-m as
+# as hints for things we do want enabled. Objects which we want to
+# avoid linking in will be in table-, not table-y and table-m.
+extra-y += $(table-)
+obj-m += $(table-m)
+obj-y += $(table-y)
+
# When an object is listed to be built compiled-in and modular,
# only build the compiled-in version

@@ -72,6 +82,8 @@ real-objs-m := $(foreach m, $(obj-m), $(if $(strip $($(m:.o=-objs)) $($(m:.o=-y)
# Add subdir path

extra-y := $(addprefix $(obj)/,$(extra-y))
+table-y := $(addprefix $(obj)/,$(table-y))
+table-m := $(addprefix $(obj)/,$(table-m))
always := $(addprefix $(obj)/,$(always))
targets := $(addprefix $(obj)/,$(targets))
modorder := $(addprefix $(obj)/,$(modorder))
--
2.6.2

2015-12-15 22:16:56

by Luis Chamberlain

[permalink] [raw]
Subject: [RFC v1 3/8] x86/boot: add BIT() to boot/bitops.h

From: "Luis R. Rodriguez" <[email protected]>

The boot/bitops.h guards against included the regular bitops,
it only implements what we need at early boot. We'll be making
use of BIT() later so add it.

Users of boot/boot.h must include it prior to asm/setup.h
otherwise the guard protection devise against the regular
linux/bitops.h will not take effect.

Signed-off-by: Luis R. Rodriguez <[email protected]>
---
arch/x86/boot/bitops.h | 2 ++
arch/x86/boot/boot.h | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/boot/bitops.h b/arch/x86/boot/bitops.h
index 878e4b9940d9..232cff0ff4e3 100644
--- a/arch/x86/boot/bitops.h
+++ b/arch/x86/boot/bitops.h
@@ -40,4 +40,6 @@ static inline void set_bit(int nr, void *addr)
asm("btsl %1,%0" : "+m" (*(u32 *)addr) : "Ir" (nr));
}

+#define BIT(x) (1 << x)
+
#endif /* BOOT_BITOPS_H */
diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 9011a88353de..4fb53da1f48a 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -23,8 +23,8 @@
#include <stdarg.h>
#include <linux/types.h>
#include <linux/edd.h>
-#include <asm/setup.h>
#include "bitops.h"
+#include <asm/setup.h>
#include "ctype.h"
#include "cpuflags.h"

--
2.6.2

2015-12-15 22:17:03

by Luis Chamberlain

[permalink] [raw]
Subject: [RFC v1 4/8] x86/init: add linker table support

From: "Luis R. Rodriguez" <[email protected]>

Any failure on the x86 init path can be catastrophic.
A simple shift of a call from one place to another can
easily break things. Likewise adding a new call to
one path without considering all x86 requirements
can make certain x86 run time environments crash.
We currently account for these requirements through
peer code review and run time testing. We could do
much better if we had a clean and simple way to annotate
strong semantics for run time requirements, init sequence
dependencies, and detection mechanisms for additions of
new x86 init sequences.

This adds support for linker tables for x86 in order
to help define strong semantics for x86 init sequences.
The defined struct x86_init_fn is inspired by both
gPXE's sample definitions but is also also heavily
inspired by Linux's own IOMMU initialiation solution
which enables to extend initialization semantics to
support init routine custom detection routines, and
enables adding annotations for init routine dependency
mapping. A newly featured solution in this design
is to build up on the hardware subarchitecture
added to the x86 boot protocol 2.07 in 2007 by Rusty
but never really taken advantage of except for lguest,
using it as a stop-gap for new x86 features which
have failed to take into consideration the dual x86-64
entry points possible due to paravirtualization yielding
requirements. The hardware subarchitecture could also
potentially be used in the future to help unify Linux x86
entry points. The current disjoint entry points for x86-64 can
be summarized as follows.

Bare metal, KVM, Xen HVM Xen PV / dom0
startup_64() startup_xen()
\ /
x86_64_start_kernel() xen_start_kernel()
\ /
x86_64_start_reservations()
|
start_kernel()
[ ... ]
[ setup_arch() ]
[ ... ]
init

Inherited by the nature of using linker tables we also
gain the ability to sort init sequences using the linker
through the use of order-level and linker SORT() and only
optionally enable use specific init sequence sorting when
init seqeuence dependencies semantics are needed. By using
linker tables we can also avoid #ifdefer'y on select code
when and if desirable, this is also completley optional but
it is a feature we inherit by using linker tables.

For *new x86 features* this enables strong semantics to be
considered. For *existing x86 feature code* this enables the
opportunity to clarify requirements and dependencies,
which ultimately also provides a new way to ensure that code
that should not run never runs, that is, it provides one
mechanism to help prevent dead code. For a more elaborate
description on what dead code is exactly and how it can relate
to init sequences in particular to Xen due to its init sequence
refer to [0] and [1].

Four debug x86 features are included to help test x86 linker
table support. You'll want CONFIG_X86_DEBUG_LINKER_TABLES to
enable those. The features defined with table-$(CONFIG-x)
will *always* be compiled but only linked in when enabled.
The features defined with the good 'ol obj-y will only be
compiled as we're used to. If CONFIG_X86_DEBUG_LINKER_TABLES is
disabled nothing currently runs as there are no init sequences
yet using any of this.

The goal is to evolve the semantics carefully as needed. We start
with one basic callback, an early_init() called at the beginning
of x86_64_start_reservations().

[0] http://www.do-not-panic.com/2015/12/avoiding-dead-code-pvops-not-silver-bullet.html
[1] http://www.do-not-panic.com/2015/12/xen-and-x86-linux-zero-page.html

Signed-off-by: Luis R. Rodriguez <[email protected]>
---
arch/x86/Kconfig.debug | 47 +++++++
arch/x86/include/asm/x86_init.h | 1 +
arch/x86/include/asm/x86_init_fn.h | 267 ++++++++++++++++++++++++++++++++++++
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/dbg-tables/Makefile | 18 +++
arch/x86/kernel/dbg-tables/alpha.c | 10 ++
arch/x86/kernel/dbg-tables/beta.c | 18 +++
arch/x86/kernel/dbg-tables/delta.c | 10 ++
arch/x86/kernel/dbg-tables/gamma.c | 18 +++
arch/x86/kernel/dbg-tables/gamma.h | 3 +
arch/x86/kernel/head32.c | 4 +
arch/x86/kernel/head64.c | 4 +
arch/x86/kernel/init.c | 55 ++++++++
arch/x86/kernel/sort-init.c | 114 +++++++++++++++
arch/x86/kernel/vmlinux.lds.S | 25 ++++
arch/x86/tools/relocs.c | 1 +
16 files changed, 597 insertions(+)
create mode 100644 arch/x86/include/asm/x86_init_fn.h
create mode 100644 arch/x86/kernel/dbg-tables/Makefile
create mode 100644 arch/x86/kernel/dbg-tables/alpha.c
create mode 100644 arch/x86/kernel/dbg-tables/beta.c
create mode 100644 arch/x86/kernel/dbg-tables/delta.c
create mode 100644 arch/x86/kernel/dbg-tables/gamma.c
create mode 100644 arch/x86/kernel/dbg-tables/gamma.h
create mode 100644 arch/x86/kernel/init.c
create mode 100644 arch/x86/kernel/sort-init.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 137dfa96aa14..e66b030e82bc 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -400,4 +400,51 @@ config PUNIT_ATOM_DEBUG
The current power state can be read from
/sys/kernel/debug/punit_atom/dev_power_state

+config X86_DEBUG_LINKER_TABLES
+ bool "x86 linker table debug"
+ depends on DEBUG_KERNEL
+ default n
+ ---help---
+ Enabling this should enable debugging linker tables on x86
+ with its own solution. To help debug it enables the "x86 beta"
+ and "x86 gamma" built-in feature. Both "x86 alpha" and "x86 delta"
+ built-in features are left disabled. You should see all tables
+ compiled, except delta, only the beta and gamma feature will be
+ linked in. The delta feature will only be compiled and linked in
+ if and only if its enabled the old way.
+
+ For more details on these linker table refer to:
+
+ include/linux/tables.h
+
+ If unsure, say N.
+
+config X86_DEBUG_LINKER_TABLE_ALPHA
+ bool "x86 linker table alpha"
+ default n
+ depends on X86_DEBUG_LINKER_TABLES
+ ---help---
+ Enabling this should enable the linker table "x86 alpha" feature.
+
+config X86_DEBUG_LINKER_TABLE_BETA
+ bool "x86 linker table beta"
+ default y
+ depends on X86_DEBUG_LINKER_TABLES
+ ---help---
+ Enabling this should enable the linker table "x86 beta" feature.
+
+config X86_DEBUG_LINKER_TABLE_GAMMA
+ bool "x86 linker table gamma"
+ default y
+ depends on X86_DEBUG_LINKER_TABLES
+ ---help---
+ Enabling this should enable the linker table "x86 gamma" feature.
+
+config X86_DEBUG_LINKER_TABLE_DELTA
+ bool "x86 linker table delta"
+ default n
+ depends on X86_DEBUG_LINKER_TABLES
+ ---help---
+ Enabling this should enable the linker table "x86 delta" feature.
+
endmenu
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index cd0fc0cc78bc..b1b941e9ccd7 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -2,6 +2,7 @@
#define _ASM_X86_PLATFORM_H

#include <asm/bootparam.h>
+#include <asm/x86_init_fn.h>

struct mpc_bus;
struct mpc_cpu;
diff --git a/arch/x86/include/asm/x86_init_fn.h b/arch/x86/include/asm/x86_init_fn.h
new file mode 100644
index 000000000000..2c9e7a5f31c6
--- /dev/null
+++ b/arch/x86/include/asm/x86_init_fn.h
@@ -0,0 +1,267 @@
+#ifndef __X86_INIT_TABLES_H
+#define __X86_INIT_TABLES_H
+
+#include <linux/types.h>
+#include <linux/tables.h>
+#include <linux/init.h>
+#include <linux/bitops.h>
+
+/**
+ * struct x86_init_fn - x86 generic kernel init call
+ *
+ * Linux x86 features vary in complexity, features may require work done at
+ * different levels of the full x86 init sequence. Today there are also two
+ * different possible entry points for Linux on x86, one for bare metal, KVM
+ * and Xen HVM, and another for Xen PV guests / dom0. Assuming a bootloader
+ * has set up 64-bit mode, roughly the x86 init sequence follows this path:
+ *
+ * Bare metal, KVM, Xen HVM Xen PV / dom0
+ * startup_64() startup_xen()
+ * \ /
+ * x86_64_start_kernel() xen_start_kernel()
+ * \ /
+ * x86_64_start_reservations()
+ * |
+ * start_kernel()
+ * [ ... ]
+ * [ setup_arch() ]
+ * [ ... ]
+ * init
+ *
+ * x86_64_start_kernel() and xen_start_kernel() are the respective first C code
+ * entry starting points. The different entry points exist to enable Xen to
+ * skip a lot of hardware setup already done and managed on behalf of the
+ * hypervisor, we refer to this as "paravirtualization yielding". The different
+ * levels of init calls on the x86 init sequence exist to account for these
+ * slight differences and requirements. These different entry points also share
+ * a common entry x86 specific path, x86_64_start_reservations().
+ *
+ * A generic x86 feature can have different initialization calls, one on each
+ * of the different main x86 init sequences, but must also address both entry
+ * points in order to work properly across the board on all supported x86
+ * subarchitectures. Since x86 features can also have dependencies on other
+ * setup code or features, x86 features can at times be subordinate to other
+ * x86 features, or conditions. struct x86_init_fn enables feature developers
+ * to annotate dependency relationships to ensure subsequent init calls only
+ * run once a subordinate's dependencies have run. When needed custom
+ * dependency requirements can also be spelled out through a custom dependency
+ * checker. In order to account for the dual entry point nature of x86-64 Linux
+ * for "paravirtualization yielding" and to make annotations for support for
+ * these explicit each struct x86_init_fn must specify supported
+ * subarchitectures. The earliest x86-64 code can read the subarchitecture
+ * though is after load_idt(), as such the earliest we can currently rely on
+ * subarchitecture for semantics and a common init sequences is on the shared
+ * common x86_64_start_reservations(). Each struct x86_init_fn must also
+ * declare a two-digit decimal number to impose an ordering relative to other
+ * features when required.
+ *
+ * x86_init_fn enables strong semantics and dependencies to be defined and
+ * implemented on the full x86 initialization sequence.
+ *
+ * @order_level: must be set, linker order level, this corresponds to the table
+ * section sub-table index, we record this only for semantic validation
+ * purposes. Order-level is always required however you typically would
+ * only use X86_INIT_NORMAL*() and leave ordering to be done by placement
+ * of code in a C file and the order of objects through a Makefile. Custom
+ * order-levels can be used when order on C file and order of objects on
+ * Makfiles does not suffice or much further refinements are needed.
+ * @supp_hardware_subarch: must be set, it represents the bitmask of supported
+ * subarchitectures. We require each struct x86_init_fn to have this set
+ * to require developer considerations for each supported x86
+ * subarchitecture and to build strong annotations of different possible
+ * run time states particularly in consideration for the two main
+ * different entry points for x86 Linux, to account for paravirtualization
+ * yielding.
+ *
+ * The subarchitecture is read by the kernel at early boot from the
+ * struct boot_params hardware_subarch. Support for the subarchitecture
+ * exists as of x86 boot protocol 2.07. The bootloader would have set up
+ * the respective hardware_subarch on the boot sector as per
+ * Documentation/x86/boot.txt.
+ *
+ * What x86 entry point is used is determined at run time by the
+ * bootloader. Linux pv_ops was designed to help enable to build one Linux
+ * binary to support bare metal and different hypervisors. pv_ops setup
+ * code however is limited in that all pv_ops setup code is run late in
+ * the x86 init sequence, during setup_arch(). In fact cpu_has_hypervisor
+ * only works after early_cpu_init() during setup_arch(). If an x86
+ * feature requires an earlier determination of what hypervisor was used,
+ * or if it needs to annotate only support for certain hypervisors, the
+ * x86 hardware_subarch should be set by the bootloader and
+ * @supp_hardware_subarch set by the x86 feature. Using hardware_subarch
+ * enables x86 features to fill the semantic gap between the Linux x86
+ * entry point used and what pv_ops has to offer through a hypervisor
+ * agnostic mechanism.
+ *
+ * Each supported subarchitecture is set using the respective
+ * X86_SUBARCH_* as a bit in the bitmask. For instance if a feature
+ * is supported on PC and Xen subarchitectures only you would set this
+ * bitmask to:
+ *
+ * BIT(X86_SUBARCH_PC) |
+ * BIT(X86_SUBARCH_XEN);
+ *
+ * @detect: optional, if set returns true if the feature has been detected to
+ * be required, it returns false if the feature has been detected to not
+ * be required.
+ * @depend: optional, if set this set of init routines must be called prior to
+ * the init routine who's respective detect routine we have set this
+ * depends callback to. This is only used for sorting purposes given
+ * all current init callbacks have a void return type. Sorting is
+ * implemented via x86_init_fn_sort(), it must be called only once,
+ * however you can delay sorting until you need it if you can ensure
+ * only @order_level and @supp_hardware_subarch can account for proper
+ * ordering and dependency requirements for all init sequences prior.
+ * If you do not have a depend callback set its assumed the order level
+ * (__x86_init_fn(level)) set by the init routine suffices to set the
+ * order for when the feature's respective callbacks are called with
+ * respect to other calls. Sorting of init calls with the same order level
+ * is determined by linker order, determined by order placement on C code
+ * and order listed on a Makefile. A routine that depends on another is
+ * known as being subordinate to the init routine it depends on. Routines
+ * that are subordinate must have an order-level of lower priority or
+ * equal priority than the order-level of the init sequence it depends on.
+ * @early_init: required, routine which will run in x86_64_start_reservations()
+ * after we ensure boot_params.hdr.hardware_subarch is accessible and
+ * properly set. Memory is not yet available. This the earliest we can
+ * currently define a common shared callback since all callbacks need to
+ * check for boot_params.hdr.hardware_subarch and this becomes accessible
+ * on x86-64 until after load_idt().
+ * @flags: optional, bitmask of enum x86_init_fn_flags
+ */
+struct x86_init_fn {
+ __u32 order_level;
+ __u32 supp_hardware_subarch;
+ bool (*detect)(void);
+ bool (*depend)(void);
+ void (*early_init)(void);
+ __u32 flags;
+};
+
+/**
+ * enum x86_init_fn_flags: flags for init sequences
+ *
+ * X86_INIT_FINISH_IF_DETECTED: tells the core that once this init sequence
+ * has completed it can break out of the loop for init sequences on
+ * its own level.
+ * X86_INIT_DETECTED: private flag. Used by the x86 core to annotate that this
+ * init sequence has been detected and it all of its callbacks
+ * must be run during initialization.
+ */
+enum x86_init_fn_flags {
+ X86_INIT_FINISH_IF_DETECTED = BIT(0),
+ X86_INIT_DETECTED = BIT(1),
+};
+
+/* The x86 initialisation function table */
+#define X86_INIT_FNS __table(struct x86_init_fn, "x86_init_fns")
+
+/* Used to declares an x86 initialization table */
+#define __x86_init_fn(order_level) __table_entry(X86_INIT_FNS, order_level)
+
+/* Init order levels, we can start at 01 but reserve 01-09 for now */
+#define X86_INIT_ORDER_EARLY 10
+#define X86_INIT_ORDER_NORMAL 30
+#define X86_INIT_ORDER_LATE 50
+
+/*
+ * Use LTO_REFERENCE_INITCALL just in case of issues with old versions of gcc.
+ * This might not be needed for linker tables due to how we compartamentalize
+ * sections and then order them at linker time, but just in case.
+ */
+
+#define x86_init(__level, \
+ __supp_hardware_subarch, \
+ __detect, \
+ __depend, \
+ __early_init) \
+ static struct x86_init_fn __x86_init_fn_##__early_init __used \
+ __x86_init_fn(__level) = { \
+ .order_level = __level, \
+ .supp_hardware_subarch = __supp_hardware_subarch, \
+ .detect = __detect, \
+ .depend = __depend, \
+ .early_init = __early_init, \
+ }; \
+ LTO_REFERENCE_INITCALL(__x86_init_fn_##__early_init);
+
+#define x86_init_early(__supp_hardware_subarch, \
+ __detect, \
+ __depend, \
+ __early_init) \
+ x86_init(X86_INIT_ORDER_EARLY, __supp_hardware_subarch, \
+ __detect, __depend, \
+ __early_init);
+
+#define x86_init_normal(__supp_hardware_subarch, \
+ __detect, \
+ __depend, \
+ __early_init) \
+ x86_init(__name, X86_INIT_ORDER_NORMAL, __supp_hardware_subarch,\
+ __detect, __depend, \
+ __early_init);
+
+#define x86_init_early_all(__detect, \
+ __depend, \
+ __early_init) \
+ x86_init_early(__name, X86_SUBARCH_ALL_SUBARCHS, \
+ __detect, __depend, \
+ __early_init);
+
+#define x86_init_early_pc(__detect, \
+ __depend, \
+ __early_init) \
+ x86_init_early(BIT(X86_SUBARCH_PC), \
+ __detect, __depend, \
+ __early_init);
+
+#define x86_init_early_pc_simple(__early_init) \
+ x86_init_early((BIT(X86_SUBARCH_PC)), NULL, NULL, \
+ __early_init);
+
+#define x86_init_normal_all(__detect, \
+ __depend, \
+ __early_init) \
+ x86_init_normal(X86_SUBARCH_ALL_SUBARCHS, \
+ __detect, __depend, \
+ __early_init);
+
+#define x86_init_normal_pc(__detect, \
+ __depend, \
+ __early_init) \
+ x86_init_normal((BIT(X86_SUBARCH_PC)), \
+ __detect, __depend, \
+ __early_init);
+
+
+#define x86_init_normal_xen(__detect, \
+ __depend, \
+ __early_init) \
+ x86_init_normal((BIT(X86_SUBARCH_XEN)), \
+ __detect, __depend, \
+ __early_init);
+
+/**
+ * x86_init_fn_early_init: call all early_init() callbacks
+ *
+ * This calls all early_init() callbacks on the x86_init_fns linker table.
+ */
+void x86_init_fn_early_init(void);
+
+/**
+ * x86_init_fn_init_tables - sort and check x86 linker table
+ *
+ * This sorts struct x86_init_fn init sequences in the x86_init_fns linker
+ * table by ensuring that init sequences that depend on other init sequences
+ * are placed later in the linker table. Init sequences that do not have
+ * dependencies are left in place. Circular dependencies are not allowed.
+ * The order-level of subordinate init sequences, that is of init sequences
+ * that depend on other init sequences, must have an order-level of lower
+ * or equal priority to the init sequence it depends on.
+ *
+ * This also validates semantics of all struct x86_init_fn init sequences
+ * on the x86_init_fns linker table.
+ */
+void x86_init_fn_init_tables(void);
+
+#endif /* __X86_INIT_TABLES_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index b1b78ffe01d0..be167a0a5e2c 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -23,6 +23,8 @@ KASAN_SANITIZE_dumpstack_$(BITS).o := n
CFLAGS_irq.o := -I$(src)/../include/asm/trace

obj-y := process_$(BITS).o signal.o
+obj-y += init.o sort-init.o
+obj-$(CONFIG_X86_DEBUG_LINKER_TABLES) += dbg-tables/
obj-$(CONFIG_COMPAT) += signal_compat.o
obj-y += traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
obj-y += time.o ioport.o dumpstack.o nmi.o
diff --git a/arch/x86/kernel/dbg-tables/Makefile b/arch/x86/kernel/dbg-tables/Makefile
new file mode 100644
index 000000000000..02d12c502ad0
--- /dev/null
+++ b/arch/x86/kernel/dbg-tables/Makefile
@@ -0,0 +1,18 @@
+# You should see all these compile but at run time you'd
+# only see the ones that were linked in.
+
+# This ensures we always compile but only link
+# in alpha if it was enabled. This is typically what
+# you are expected to use to avoid code bit-rot.
+table-$(CONFIG_X86_DEBUG_LINKER_TABLE_ALPHA) += alpha.o
+
+# This accomplishes the same, but requires 2 lines.
+extra-y += beta.o
+obj-$(CONFIG_X86_DEBUG_LINKER_TABLE_BETA) += beta.o
+
+table-$(CONFIG_X86_DEBUG_LINKER_TABLE_GAMMA) += gamma.o
+
+# If you *know* you only want to enable compilation of
+# a feature when its selected you can just use the good
+# ol' obj
+obj-$(CONFIG_X86_DEBUG_LINKER_TABLE_DELTA) += delta.o
diff --git a/arch/x86/kernel/dbg-tables/alpha.c b/arch/x86/kernel/dbg-tables/alpha.c
new file mode 100644
index 000000000000..54754f893a08
--- /dev/null
+++ b/arch/x86/kernel/dbg-tables/alpha.c
@@ -0,0 +1,10 @@
+#define pr_fmt(fmt) "debug-alpha: " fmt
+
+#include <linux/kernel.h>
+#include <asm/x86_init.h>
+
+static void early_init_dbg_alpha(void) {
+ pr_info("early_init triggered\n");
+}
+
+x86_init_early_pc_simple(early_init_dbg_alpha);
diff --git a/arch/x86/kernel/dbg-tables/beta.c b/arch/x86/kernel/dbg-tables/beta.c
new file mode 100644
index 000000000000..7384a57fc386
--- /dev/null
+++ b/arch/x86/kernel/dbg-tables/beta.c
@@ -0,0 +1,18 @@
+#define pr_fmt(fmt) "debug-beta: " fmt
+
+#include <linux/kernel.h>
+#include <asm/x86_init.h>
+
+#include "gamma.h"
+
+static bool x86_dbg_detect_beta(void)
+{
+ return true;
+}
+
+static void early_init_dbg_beta(void) {
+ pr_info("early_init triggered\n");
+}
+x86_init_early_pc(x86_dbg_detect_beta,
+ x86_dbg_detect_gamma,
+ early_init_dbg_beta);
diff --git a/arch/x86/kernel/dbg-tables/delta.c b/arch/x86/kernel/dbg-tables/delta.c
new file mode 100644
index 000000000000..9d38c68e602a
--- /dev/null
+++ b/arch/x86/kernel/dbg-tables/delta.c
@@ -0,0 +1,10 @@
+#define pr_fmt(fmt) "debug-delta: " fmt
+
+#include <linux/kernel.h>
+#include <asm/x86_init.h>
+
+static void early_init_dbg_delta(void) {
+ pr_info("early_init triggered\n");
+}
+
+x86_init_early_pc_simple(early_init_dbg_delta);
diff --git a/arch/x86/kernel/dbg-tables/gamma.c b/arch/x86/kernel/dbg-tables/gamma.c
new file mode 100644
index 000000000000..7b663c1f08f4
--- /dev/null
+++ b/arch/x86/kernel/dbg-tables/gamma.c
@@ -0,0 +1,18 @@
+#define pr_fmt(fmt) "debug-gamma: " fmt
+
+#include <linux/kernel.h>
+#include <asm/x86_init.h>
+
+bool x86_dbg_detect_gamma(void)
+{
+ return true;
+}
+
+static void early_init_dbg_gamma(void)
+{
+ pr_info("early_init triggered\n");
+}
+
+x86_init_early_pc(x86_dbg_detect_gamma,
+ NULL,
+ early_init_dbg_gamma);
diff --git a/arch/x86/kernel/dbg-tables/gamma.h b/arch/x86/kernel/dbg-tables/gamma.h
new file mode 100644
index 000000000000..810d450ddd14
--- /dev/null
+++ b/arch/x86/kernel/dbg-tables/gamma.h
@@ -0,0 +1,3 @@
+#include <asm/x86_init.h>
+
+bool x86_dbg_detect_gamma(void);
diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c
index 2911ef3a9f1c..d93f3e42e61b 100644
--- a/arch/x86/kernel/head32.c
+++ b/arch/x86/kernel/head32.c
@@ -19,6 +19,7 @@
#include <asm/bios_ebda.h>
#include <asm/tlbflush.h>
#include <asm/bootparam_utils.h>
+#include <asm/x86_init.h>

static void __init i386_default_early_setup(void)
{
@@ -47,5 +48,8 @@ asmlinkage __visible void __init i386_start_kernel(void)
break;
}

+ x86_init_fn_init_tables();
+ x86_init_fn_early_init();
+
start_kernel();
}
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index f129a9af6357..f83263a8d0ed 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -28,6 +28,7 @@
#include <asm/bootparam_utils.h>
#include <asm/microcode.h>
#include <asm/kasan.h>
+#include <asm/x86_init.h>

/*
* Manage page tables very early on.
@@ -190,6 +191,9 @@ void __init x86_64_start_reservations(char *real_mode_data)
if (!boot_params.hdr.version)
copy_bootdata(__va(real_mode_data));

+ x86_init_fn_init_tables();
+ x86_init_fn_early_init();
+
reserve_ebda_region();

start_kernel();
diff --git a/arch/x86/kernel/init.c b/arch/x86/kernel/init.c
new file mode 100644
index 000000000000..5f9b662e2b34
--- /dev/null
+++ b/arch/x86/kernel/init.c
@@ -0,0 +1,55 @@
+#define pr_fmt(fmt) "x86-init: " fmt
+
+#include <linux/bug.h>
+#include <linux/kernel.h>
+
+#include <asm/x86_init_fn.h>
+#include <asm/bootparam.h>
+#include <asm/boot.h>
+#include <asm/setup.h>
+
+extern struct x86_init_fn __tbl_x86_start_init_fns[], __tbl_x86_end_init_fns[];
+
+static bool x86_init_fn_supports_subarch(struct x86_init_fn *fn)
+{
+ if (!fn->supp_hardware_subarch) {
+ pr_err("Init sequence fails to declares any supported subarchs: %pF\n", fn->early_init);
+ WARN_ON(1);
+ }
+ if (BIT(boot_params.hdr.hardware_subarch) & fn->supp_hardware_subarch)
+ return true;
+ return false;
+}
+
+void x86_init_fn_early_init(void)
+{
+ int ret;
+ struct x86_init_fn *init_fn;
+ unsigned int num_inits = table_num_entries(X86_INIT_FNS);
+
+ if (!num_inits)
+ return;
+
+ pr_debug("Number of init entries: %d\n", num_inits);
+
+ for_each_table_entry(init_fn, X86_INIT_FNS) {
+ if (!x86_init_fn_supports_subarch(init_fn))
+ continue;
+ if (!init_fn->detect)
+ init_fn->flags |= X86_INIT_DETECTED;
+ else {
+ ret = init_fn->detect();
+ if (ret > 0)
+ init_fn->flags |= X86_INIT_DETECTED;
+ }
+
+ if (init_fn->flags & X86_INIT_DETECTED) {
+ init_fn->flags |= X86_INIT_DETECTED;
+ pr_debug("Running early init %pF ...\n", init_fn->early_init);
+ init_fn->early_init();
+ pr_debug("Completed early init %pF\n", init_fn->early_init);
+ if (init_fn->flags & X86_INIT_FINISH_IF_DETECTED)
+ break;
+ }
+ }
+}
diff --git a/arch/x86/kernel/sort-init.c b/arch/x86/kernel/sort-init.c
new file mode 100644
index 000000000000..f230f3ddb0aa
--- /dev/null
+++ b/arch/x86/kernel/sort-init.c
@@ -0,0 +1,114 @@
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <asm/x86_init_fn.h>
+
+extern struct x86_init_fn __tbl_x86_start_init_fns[], __tbl_x86_end_init_fns[];
+
+static struct x86_init_fn *x86_init_fn_find_dep(struct x86_init_fn *start,
+ struct x86_init_fn *finish,
+ struct x86_init_fn *q)
+{
+ struct x86_init_fn *p;
+
+ if (!q)
+ return NULL;
+
+ for (p = start; p < finish; p++)
+ if (p->detect == q->depend)
+ return p;
+
+ return NULL;
+}
+
+static void x86_init_fn_sort(struct x86_init_fn *start,
+ struct x86_init_fn *finish)
+{
+
+ struct x86_init_fn *p, *q, tmp;
+
+ for (p = start; p < finish; p++) {
+again:
+ q = x86_init_fn_find_dep(start, finish, p);
+ /*
+ * We are bit sneaky here. We use the memory address to figure
+ * out if the node we depend on is past our point, if so, swap.
+ */
+ if (q > p) {
+ tmp = *p;
+ memmove(p, q, sizeof(*p));
+ *q = tmp;
+ goto again;
+ }
+ }
+
+}
+
+static void x86_init_fn_check(struct x86_init_fn *start,
+ struct x86_init_fn *finish)
+{
+ struct x86_init_fn *p, *q, *x;
+
+ /* Simple cyclic dependency checker. */
+ for (p = start; p < finish; p++) {
+ if (!p->depend)
+ continue;
+ q = x86_init_fn_find_dep(start, finish, p);
+ x = x86_init_fn_find_dep(start, finish, q);
+ if (p == x) {
+ pr_info("CYCLIC DEPENDENCY FOUND! %pF depends on %pF and vice-versa. BREAKING IT.\n",
+ p->early_init, q->early_init);
+ /* Heavy handed way..*/
+ x->depend = 0;
+ }
+ }
+
+ /*
+ * Validate sorting semantics.
+ *
+ * p depends on q so:
+ * - q must run first, so q < p. If q > p that's an issue
+ * as its saying p must run prior to q. We already sorted
+ * this table, this is a problem.
+ *
+ * - q's order level must be <= than p's as it should run first
+ */
+ for (p = start; p < finish; p++) {
+ if (!p->depend)
+ continue;
+ /*
+ * Be pedantic and do a full search on the entire table,
+ * if we need further validation, after this is called
+ * one could use an optimized version which just searches
+ * on x86_init_fn_find_dep(p, finish, p), as we would have
+ * guarantee on proper ordering both at the dependency level
+ * and by order level.
+ */
+ q = x86_init_fn_find_dep(start, finish, p);
+ if (q && q > p) {
+ pr_info("EXECUTION ORDER INVALID! %pF should be called before %pF!\n",
+ p->early_init, q->early_init);
+ }
+
+ /*
+ * Technically this would still work as the memmove() would
+ * have forced the dependency to run first, however we want
+ * strong semantics, so lets avoid these.
+ */
+ if (q && q->order_level > p->order_level) {
+ pr_info("INVALID ORDER LEVEL! %pF should have an order level <= be called before %pF!\n",
+ p->early_init, q->early_init);
+ }
+ }
+}
+
+void x86_init_fn_init_tables(void)
+{
+ unsigned int num_inits = table_num_entries(X86_INIT_FNS);
+
+ if (!num_inits)
+ return;
+
+ x86_init_fn_sort(__tbl_x86_start_init_fns, __tbl_x86_end_init_fns);
+ x86_init_fn_check(__tbl_x86_start_init_fns, __tbl_x86_end_init_fns);
+}
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 74e4bf11f562..75c8f3f80e73 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -265,6 +265,31 @@ SECTIONS
__apicdrivers_end = .;
}

+ /*
+ * x86 linker tables: this will sort all order levels
+ * lexicographically. Something like SORT_BY_INIT_PRIORITY()
+ * which sorts specifically on numeric order is desirable but
+ * that would impose a change on binutils to generalize
+ * SORT_BY_INIT_PRIORITY() for any section, SORT_BY_INIT_PRIORITY()
+ * is specific to the init sections, and that would also mean having
+ * a bump on requirments for building Linux. Using SORT() should
+ * suffice, * its what gpxe uses. For things in the same order
+ * level we rely on order in the C file, and also order on the
+ * Makefile. Since this solution also embraces the IOMMU init solution
+ * it has further sorting semantics implemented in C and its semantics
+ * are defined in x86_init_fn.h and implemented in sort-init.c.
+ */
+ .tbl : {
+ __tbl_x86_start_init_fns = .;
+ *(SORT(.tbl.x86_init_fns.*))
+ __tbl_x86_end_init_fns = .;
+
+ /* ... etc ... */
+
+ /* You would add other tables here as needed */
+ }
+ . = ALIGN(8);
+
. = ALIGN(8);
/*
* .exit.text is discard at runtime, not link time, to deal with
diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 0c2fae8d929d..22b1616de37c 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -56,6 +56,7 @@ static const char * const sym_regex_kernel[S_NSYMTYPES] = {
"__x86_cpu_dev_(start|end)|"
"(__parainstructions|__alt_instructions)(|_end)|"
"(__iommu_table|__apicdrivers|__smp_locks)(|_end)|"
+ "(__tbl_x86_(start|end)_init_fns|"
"__(start|end)_pci_.*|"
"__(start|end)_builtin_fw|"
"__(start|stop)___ksymtab(|_gpl|_unused|_unused_gpl|_gpl_future)|"
--
2.6.2

2015-12-15 22:18:13

by Luis Chamberlain

[permalink] [raw]
Subject: [RFC v1 5/8] x86/init: move ebda reservations into linker table

From: "Luis R. Rodriguez" <[email protected]>

This lets us annotate its requirements specifically for
PC and lguest subarchitectures. While at it since head.c
just has ebda data rename it. Since we're using linker tables
and both x86 32-bit and 64-bit have them we don't need
to declare a call for this separately. This lets us
also keep this declared static now.

Since we're using linker tables and have support to annotate
subarchitecture do that instead. pv_legacy() is incorrect as
its really only for legacy PV guests. There's no need for
pv_legacy() check anymore now.

Signed-off-by: Luis R. Rodriguez <[email protected]>
---
arch/x86/include/asm/bios_ebda.h | 2 --
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/{head.c => ebda.c} | 6 +++---
arch/x86/kernel/head32.c | 2 --
arch/x86/kernel/head64.c | 2 --
5 files changed, 4 insertions(+), 10 deletions(-)
rename arch/x86/kernel/{head.c => ebda.c} (94%)

diff --git a/arch/x86/include/asm/bios_ebda.h b/arch/x86/include/asm/bios_ebda.h
index aa6a3170ab5a..e63347eb3804 100644
--- a/arch/x86/include/asm/bios_ebda.h
+++ b/arch/x86/include/asm/bios_ebda.h
@@ -38,8 +38,6 @@ static inline unsigned int get_bios_ebda_length(void)
return length;
}

-void reserve_ebda_region(void);
-
#ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
/*
* This is obviously not a great place for this, but we want to be
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index be167a0a5e2c..e7a43f08bccc 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -2,7 +2,7 @@
# Makefile for the linux kernel.
#

-extra-y := head_$(BITS).o head$(BITS).o head.o vmlinux.lds
+extra-y := head_$(BITS).o head$(BITS).o ebda.o vmlinux.lds

CPPFLAGS_vmlinux.lds += -U$(UTS_MACHINE)

diff --git a/arch/x86/kernel/head.c b/arch/x86/kernel/ebda.c
similarity index 94%
rename from arch/x86/kernel/head.c
rename to arch/x86/kernel/ebda.c
index 279fad7288f8..defc2f07a80b 100644
--- a/arch/x86/kernel/head.c
+++ b/arch/x86/kernel/ebda.c
@@ -4,6 +4,7 @@

#include <asm/setup.h>
#include <asm/bios_ebda.h>
+#include <asm/x86_init.h>

/*
* The BIOS places the EBDA/XBDA at the top of conventional
@@ -26,7 +27,7 @@
#define LOWMEM_CAP 0x9f000U /* Absolute maximum */
#define INSANE_CUTOFF 0x20000U /* Less than this = insane */

-void __init reserve_ebda_region(void)
+static void __init reserve_ebda_region(void)
{
unsigned int lowmem, ebda_addr;

@@ -38,8 +39,6 @@ void __init reserve_ebda_region(void)
* that the paravirt case can handle memory setup
* correctly, without our help.
*/
- if (paravirt_legacy())
- return;

/* end of low (conventional) memory */
lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
@@ -69,3 +68,4 @@ void __init reserve_ebda_region(void)
/* reserve all memory between lowmem and the 1MB mark */
memblock_reserve(lowmem, 0x100000 - lowmem);
}
+x86_init_early_pc_simple(reserve_ebda_region);
diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c
index d93f3e42e61b..768fa3888066 100644
--- a/arch/x86/kernel/head32.c
+++ b/arch/x86/kernel/head32.c
@@ -26,8 +26,6 @@ static void __init i386_default_early_setup(void)
/* Initialize 32bit specific setup functions */
x86_init.resources.reserve_resources = i386_reserve_resources;
x86_init.mpparse.setup_ioapic_ids = setup_ioapic_ids_from_mpc;
-
- reserve_ebda_region();
}

asmlinkage __visible void __init i386_start_kernel(void)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index f83263a8d0ed..c913b7eb5056 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -194,7 +194,5 @@ void __init x86_64_start_reservations(char *real_mode_data)
x86_init_fn_init_tables();
x86_init_fn_early_init();

- reserve_ebda_region();
-
start_kernel();
}
--
2.6.2

2015-12-15 22:17:09

by Luis Chamberlain

[permalink] [raw]
Subject: [RFC v1 6/8] x86/init: use linker table for i386 early setup

From: "Luis R. Rodriguez" <[email protected]>

This also annotates this is only used for PC and
lguest hardware subarchitectures.

Signed-off-by: Luis R. Rodriguez <[email protected]>
---
arch/x86/kernel/head32.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c
index 768fa3888066..1db8eec5b0e2 100644
--- a/arch/x86/kernel/head32.c
+++ b/arch/x86/kernel/head32.c
@@ -21,12 +21,15 @@
#include <asm/bootparam_utils.h>
#include <asm/x86_init.h>

-static void __init i386_default_early_setup(void)
+static void __init i386_set_setup_funcs(void)
{
/* Initialize 32bit specific setup functions */
x86_init.resources.reserve_resources = i386_reserve_resources;
x86_init.mpparse.setup_ioapic_ids = setup_ioapic_ids_from_mpc;
}
+x86_init_early(BIT(X86_SUBARCH_PC) |
+ BIT(X86_SUBARCH_LGUEST),
+ NULL, NULL, i386_set_setup_funcs);

asmlinkage __visible void __init i386_start_kernel(void)
{
@@ -41,9 +44,6 @@ asmlinkage __visible void __init i386_start_kernel(void)
case X86_SUBARCH_CE4100:
x86_ce4100_early_setup();
break;
- default:
- i386_default_early_setup();
- break;
}

x86_init_fn_init_tables();
--
2.6.2

2015-12-15 22:17:18

by Luis Chamberlain

[permalink] [raw]
Subject: [RFC v1 7/8] x86/init: user linker table for ce4100 early setup

From: "Luis R. Rodriguez" <[email protected]>

Using the linker table removes the need for the #ifdef'ery
and clutter on head32.c.

Signed-off-by: Luis R. Rodriguez <[email protected]>
---
arch/x86/include/asm/setup.h | 6 ------
arch/x86/kernel/head32.c | 3 ---
arch/x86/platform/ce4100/ce4100.c | 4 +++-
3 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 11af24e09c8a..f1e111a9d558 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -51,12 +51,6 @@ extern void x86_intel_mid_early_setup(void);
static inline void x86_intel_mid_early_setup(void) { }
#endif

-#ifdef CONFIG_X86_INTEL_CE
-extern void x86_ce4100_early_setup(void);
-#else
-static inline void x86_ce4100_early_setup(void) { }
-#endif
-
#ifndef _SETUP

#include <asm/espfix.h>
diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c
index 1db8eec5b0e2..ec6912873395 100644
--- a/arch/x86/kernel/head32.c
+++ b/arch/x86/kernel/head32.c
@@ -41,9 +41,6 @@ asmlinkage __visible void __init i386_start_kernel(void)
case X86_SUBARCH_INTEL_MID:
x86_intel_mid_early_setup();
break;
- case X86_SUBARCH_CE4100:
- x86_ce4100_early_setup();
- break;
}

x86_init_fn_init_tables();
diff --git a/arch/x86/platform/ce4100/ce4100.c b/arch/x86/platform/ce4100/ce4100.c
index 701fd5843c87..db2a709402e2 100644
--- a/arch/x86/platform/ce4100/ce4100.c
+++ b/arch/x86/platform/ce4100/ce4100.c
@@ -24,6 +24,7 @@
#include <asm/io.h>
#include <asm/io_apic.h>
#include <asm/emergency-restart.h>
+#include <asm/x86_init.h>

static int ce4100_i8042_detect(void)
{
@@ -144,7 +145,7 @@ static void sdv_pci_init(void)
* CE4100 specific x86_init function overrides and early setup
* calls.
*/
-void __init x86_ce4100_early_setup(void)
+static void __init x86_ce4100_early_setup(void)
{
x86_init.oem.arch_setup = sdv_arch_setup;
x86_platform.i8042_detect = ce4100_i8042_detect;
@@ -166,3 +167,4 @@ void __init x86_ce4100_early_setup(void)

pm_power_off = ce4100_power_off;
}
+x86_init_early(BIT(X86_SUBARCH_CE4100), NULL, NULL, x86_ce4100_early_setup);
--
2.6.2

2015-12-15 22:17:16

by Luis Chamberlain

[permalink] [raw]
Subject: [RFC v1 8/8] x86/init: use linker table for mid early setup

From: "Luis R. Rodriguez" <[email protected]>

Using the linker table removes the need for the #ifdef'ery
and clutter on head32.c.

Signed-off-by: Luis R. Rodriguez <[email protected]>
---
arch/x86/include/asm/setup.h | 6 ------
arch/x86/kernel/head32.c | 7 -------
arch/x86/platform/intel-mid/intel-mid.c | 4 +++-
3 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index f1e111a9d558..1345c1de7f99 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -45,12 +45,6 @@ extern void reserve_standard_io_resources(void);
extern void i386_reserve_resources(void);
extern void setup_default_timer_irq(void);

-#ifdef CONFIG_X86_INTEL_MID
-extern void x86_intel_mid_early_setup(void);
-#else
-static inline void x86_intel_mid_early_setup(void) { }
-#endif
-
#ifndef _SETUP

#include <asm/espfix.h>
diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c
index ec6912873395..b23d16a2a5d5 100644
--- a/arch/x86/kernel/head32.c
+++ b/arch/x86/kernel/head32.c
@@ -36,13 +36,6 @@ asmlinkage __visible void __init i386_start_kernel(void)
cr4_init_shadow();
sanitize_boot_params(&boot_params);

- /* Call the subarch specific early setup function */
- switch (boot_params.hdr.hardware_subarch) {
- case X86_SUBARCH_INTEL_MID:
- x86_intel_mid_early_setup();
- break;
- }
-
x86_init_fn_init_tables();
x86_init_fn_early_init();

diff --git a/arch/x86/platform/intel-mid/intel-mid.c b/arch/x86/platform/intel-mid/intel-mid.c
index 1bbc21e2e4ae..27bc33193a06 100644
--- a/arch/x86/platform/intel-mid/intel-mid.c
+++ b/arch/x86/platform/intel-mid/intel-mid.c
@@ -167,7 +167,7 @@ static unsigned char intel_mid_get_nmi_reason(void)
* Moorestown specific x86_init function overrides and early setup
* calls.
*/
-void __init x86_intel_mid_early_setup(void)
+static void __init x86_intel_mid_early_setup(void)
{
x86_init.resources.probe_roms = x86_init_noop;
x86_init.resources.reserve_resources = x86_init_noop;
@@ -199,6 +199,8 @@ void __init x86_intel_mid_early_setup(void)
x86_init.mpparse.get_smp_config = x86_init_uint_noop;
set_bit(MP_BUS_ISA, mp_bus_not_pci);
}
+x86_init_early(BIT(X86_SUBARCH_INTEL_MID), NULL, NULL,
+ x86_intel_mid_early_setup);

/*
* if user does not want to use per CPU apb timer, just give it a lower rating
--
2.6.2

2015-12-15 23:00:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC v1 0/8] x86/init: Linux linker tables

On December 15, 2015 2:16:29 PM PST, "Luis R. Rodriguez" <[email protected]> wrote:
>From: "Luis R. Rodriguez" <[email protected]>
>
> A long time ago in a galaxy far,
> far away...
>
>Konrad Rzeszutek Wilk posted patches which eventually got merged to
>help
>with modularizing the IOMMUs we have on x86 [0]. This work was done due
>to
>the complex relationship that exists on IOMMUs and the requirements on
>careful execution. The solution also provided a mechanism which
>jettisoned
>unused IOMMUs during run-time.
>
>During review, even though the code was merged, hpa did note that we
>tend
>to encounter this type of problem "often enough that we should
>implement a
>generic facility for it" [1], hpa acknowledged that it obviously has to
>be
>based on sections and even noted that perhaps we might be able in the
>future to
>automate its creation. He noted that the gPXE folks had done just this
>with
>linker tables and suggested that "presumably we'd need a few different
>flavors
>for init tables and so on, but this would make it a generic mechanism."
>
>The IOMMU code got merged and this was left on someone's mental
>backburner.
>I've had an itch to scratch recently to try to avoid issues which are
>possible
>if one does not jettison other code carefully due to the large
>complexity of
>implicit dependencies of certain code on x86 in particular with
>possible dead
>code on x86 due to paravirtualization, and the IOMMU jettison strategy
>turned
>out to be my favorite solution so far. I've taken on hpa's suggestions
>from
>back in the day to review gPXE's solution to see if we could embrace it
>on
>Linux for a generic section solution to help jettison code carefully.
>
>What this patch set does exactly:
>
>This RFC patch set attempts to add support for such a generic solution.
>In the end, it turns out that the best solution possible was the best
>of
>both worlds: a combination of what Konrad had implemented in addition
>to
>what Michael Brown had implemented on the gXPE front. The IOMMU
>solution
>enables simple semantic annotations for dependency relationships, this
>however requires a run time sort. The gPXE solution grants the option
>to
>simply sort at build time. One of gPXE's solution primary goals however
>was
>also to help avoid bit-rot on code that's possible from #ifdef'ery. The
>Linux
>linker table solution enables developers to pick and choose what they
>need, with linker tables being the simplest solution. Contrary to gPXE
>which strives to force compilation of all linker table solutions we
>let developers pick *when* they want this as part of their solution.
>As can be seen from the suggested x86 init specific use of linker
>tables
>proposed you can also take advantage of both, linker sorting, optional
>compilation when needed (at developer's discretion), and even careful
>semantics annotation for dependency / relationship annotations.
>Although
>the x86 init solution here is heavily inspired by the IOMMU solution it
>diverges with strong semantics, and a new optional subarchitecture
>annotation. Sorting of init sequences is structure specific, as such
>each subsystem must defing their own solution unless semantics could
>be shared. I considered sharing semantics but in the end this proved
>pointless so this keeps things separate. A series of changes were made
>to the x86 init sequence in contrast to the IOMMU solution to be
>*extremely
>pedantic* on semantics, review of this changes can be studied on the
>table-init tree [2].
>
>Quick review of gPXE's solution and prospects on further changes:
>
>In my review from gPXE's solution it was not clear what hpa meant by
>gXPE folks having automated this process, they actually use linker
>tables
>all around, forcing compilation of *everything* and just do linking of
>enabled features at link time. You still need to build linker tables on
>your own. What I do see more potential for in the future is enabling to
>evolve stronger semantics over time, and this would also be subsystem
>specific.
>This will be evident in this patch set on the x86 init use of linker
>tables.
>I also see potential in strenghtening semantics for linker sorting, any
>of
>these types of features however would impose requiring newer binutils.
>For
>instance, gPXE's linker solution currently relies on SORT(), that
>defaults to
>SORT_BY_NAME(). This sorts lexicographically, gPXE's solution uses two
>digits
>to enable SORT_BY_NAME()'s lexicographical sort to sort orde by numeric
>priority. Since one is in control of order-level numbers one can
>provide
>guarantee that this sort should work as intended, however binutils also
>now has
>a SORT_BY_INIT_PRIORITY() which sorts specifically based on digits.
>SORT_BY_INIT_PRIORITY() was designed specifically for init_array
>sections
>though. Refer to the userspace mockup solution table-init git tree [2]
>commit
>6deba47ee1ad461e90 for more details on this. One thing I can envision
>to help
>here further are prospects for future linker enhancements, these
>however must
>be carefully considered in light of possible requirements of newer
>binutils
>and also of compatibility with other toolchains. For now we resort to
>the good
>'ol SORT() which even Linux has made use of for ages already. In that
>regards
>here, nothing new here. gPXE folks however did find some fun ICC
>compatiblity
>issues, and have also developed fixes for them, these were obviously
>carried
>over but should be reviewed carefully. Lastly, I should point out that
>essentially what we're developing are different forms loose and strong
>semantics -- in the most complex form what we really are after are
>feature
>graphs. Mauro explained to me that for media driver they needed to
>build a
>feature graphs, IMHO that could be a next level of strong semantics we
>might
>want to consider. For now though the combination of gPXE's linker-table
>solution based on linker order-level sort, and our old IOMMU init
>solution
>and its simple dependency map (and possible extensions, see
>subarchitecture)
>should likely suffice as a light weight solution for where we need
>semantics
>for at least on the x86 init path.
>
>Motivation and possible enhancements:
>
>To understand what made me tick to work on this feel free to read the
>dead
>code concerns with paravirtualization posts I've made, which also go
>into
>and detail Xen's alternative entry point [4] [5]. That's not the end of
>possibilities to help address to possible "dead code" on Linux, I have
>other
>ideas but I have a bit more work to do before publishing anything about
>it.
>If anything -- later work should likely supplement this solution
>further.
>
>Currently the earliest I was able to make use of boot_params on x86-64
>for
>linker tables was after load_idt(), so I've decided to stick with the
>earliest
>init call for x86 init sequences starting on
>x86_64_start_reservation().
>However, if we're able to make use of boot_params earlier (help
>appreciated),
>in particular just boot_params.hdr.hardware_subarch we should be able
>to make
>clearer and careful annotations on early x86-64 init code. Using the
>boot_params.hdr.hardware_subarch and requiring clear subarchitecture
>support
>annotations on early init code should provide a *proactive* means to
>avoid issues
>such as the cr4 shadow oversight [6] which later caused crashes on Xen.
>The
>latest similar type of issue is when KAsan was introduced, after it was
>integrated I suspected KAsan would probably break if enabled on Xen, I
>reported
>this in March 2015 [7] and Andrey confirmed this might be the case but
>since he
>wanted it enabled for allyesconfig and allmodconfig he could not think
>of a
>clean way to disable it or address support for it then [8]. During the
>development of this patch set I confirmed KAsan crashes Xen dom0 and
>therefore
>should obviously crash Xen PV guests as well. Using the same kernel
>KAsan
>still worked on KVM and bare-metal. Provided we had early access to
>boot_params.hdr.hardware_subarch, in x86_64_start_kernel() we could
>annotate
>kasan_early_init() and friends not supported on Xen, however since we
>don't
>have a way to disable KAsan at run time at this time we wouldn't have
>any other
>option but to crash early. Because of this though we should perhaps
>just consider
>disabling KAsan for Xen configs and KAsan folks might just have to bite
>the
>bullet. If KAsan folks are gung-ho about not disabling KAsan when Xen
>is
>enabled and if its easier to add support to disable KAsan at run time
>than
>it is to add KAsan support for Xen one alternative might be to use
>linker
>tables to annotate this and disable KAsan at run time for Xen.
>
>One of the points of the use of the x86 linker table solution and
>specific
>semantics there in is to *force* developers to think about requirements
>on x86
>carefully. So for instance by requiring itemizing supported
>subarchitectures
>*early* on in development we should be able to avoid proactively issues
>such as
>the crash due to the cr4 shadow changes not added on the Xen entry
>path, and
>also missing the requirement to develop a solution for Kasan for Xen.
>The
>proposed x86 linker table solution is not the first to use
>boot_params.hdr.hardware_subarch, its first use was actually on i386,
>and for
>lguest. We take advantage of it to avoid further extending pv_ops, and
>*actually* to see if we could simplify pv_ops over time further. This
>patch set
>also includes a renaming of paravirt_enabled() to paravirt_legacy()
>
>An unexpected long term side goal which *might* be possible due to
>linker
>tables on x86 is to help unify the different C entry points for Linux
>x86-64.
>I've taken a stab at it after these patches [9] but it fails on Xen,
>likely
>because the stack is not set up right due to the different calls /
>argument
>requirements. If this is desirable folks could take a look and perhaps
>help
>on that front.
>
>Where to get code alternatively and testing:
>
>All of these patches are RFCs, if anything the only patches worth
>consdering
>merging sooner rather than later might be "paravirt: rename
>paravirt_enabled
>to paravirt_legacy" and "x86/boot: add BIT() to boot/bitops.h". After
>review
>I can send those separately in patch form if agreeable. Since these are
>all
>just RFCs I've based these patches on top of Linus' v4.4-rc5. If you
>want all
>patches in one file you can get them here [10], and a respective
>linux-next
>version which applies on top of next-20151215 here [11]. I also have
>two
>git trees set up with branches for this code, one based on Linus' tree
>[12]
>and another based on linux-next next-20151215 [13]. I'll see what
>issues
>zero-day bot testing finds out. I've just run time tested this on
>x86-64
>bare metal, KVM, and Xen dom0 so far.
>
>[0] https://marc.info/?l=linux-kernel&m=128284562303565&w=2
>[1] https://marc.info/?l=linux-kernel&m=128285216913266&w=2
>[2] https://github.com/mcgrof/table-init/
>[3]
>https://github.com/mcgrof/table-init/commit/6deba47ee1ad461e90e0fbba226a535cfc1c58f3
>[4]
>http://www.do-not-panic.com/2015/12/avoiding-dead-code-pvops-not-silver-bullet.html
>[5]
>http://www.do-not-panic.com/2015/12/xen-and-x86-linux-zero-page.html
>[6]
>http://lists.xenproject.org/archives/html/xen-devel/2015-02/msg02742.html
>[7]
>http://lkml.kernel.org/r/CAB=NE6Xs5fepzNtymzT4CueeJZ0KMPETpda114DpL4eMtDswtw@mail.gmail.com
>[8] http://lkml.kernel.org/r/[email protected]
>[9]
>http://drvbp1.linux-foundation.org/~mcgrof/patches/2015/12/15/x86-merge-x86-init-v1.patch
>[10]
>http://drvbp1.linux-foundation.org/~mcgrof/patches/2015/12/15/pend-20151215-rfc-v1-linker-tables.patch
>[11]
>http://drvbp1.linux-foundation.org/~mcgrof/patches/2015/12/15/pend-next-20151215-rfc-v1-linker-tables.patch
>[12]
>https://git.kernel.org/cgit/linux/kernel/git/mcgrof/linux.git/log/?h=20151215-rfc-v1-linker-tables
>[13]
>https://git.kernel.org/cgit/linux/kernel/git/mcgrof/linux-next.git/log/?h=20151215-rfc-v1-linker-tables
>
>Luis R. Rodriguez (8):
> paravirt: rename paravirt_enabled to paravirt_legacy
> tables.h: add linker table support
> x86/boot: add BIT() to boot/bitops.h
> x86/init: add linker table support
> x86/init: move ebda reservations into linker table
> x86/init: use linker table for i386 early setup
> x86/init: user linker table for ce4100 early setup
> x86/init: use linker table for mid early setup
>
> Documentation/kbuild/makefiles.txt | 19 ++
> arch/x86/Kconfig.debug | 47 ++++
> arch/x86/boot/bitops.h | 2 +
> arch/x86/boot/boot.h | 2 +-
> arch/x86/include/asm/bios_ebda.h | 2 -
> arch/x86/include/asm/paravirt.h | 4 +-
> arch/x86/include/asm/paravirt_types.h | 11 +-
> arch/x86/include/asm/processor.h | 2 +-
> arch/x86/include/asm/setup.h | 12 -
> arch/x86/include/asm/x86_init.h | 1 +
> arch/x86/include/asm/x86_init_fn.h | 267 ++++++++++++++++++++
> arch/x86/kernel/Makefile | 4 +-
> arch/x86/kernel/apm_32.c | 2 +-
> arch/x86/kernel/asm-offsets.c | 2 +-
> arch/x86/kernel/cpu/intel.c | 2 +-
> arch/x86/kernel/cpu/microcode/core.c | 2 +-
> arch/x86/kernel/dbg-tables/Makefile | 18 ++
> arch/x86/kernel/dbg-tables/alpha.c | 10 +
> arch/x86/kernel/dbg-tables/beta.c | 18 ++
> arch/x86/kernel/dbg-tables/delta.c | 10 +
> arch/x86/kernel/dbg-tables/gamma.c | 18 ++
> arch/x86/kernel/dbg-tables/gamma.h | 3 +
> arch/x86/kernel/{head.c => ebda.c} | 6 +-
> arch/x86/kernel/head32.c | 22 +-
> arch/x86/kernel/head64.c | 4 +-
> arch/x86/kernel/init.c | 55 +++++
> arch/x86/kernel/kvm.c | 9 +-
> arch/x86/kernel/paravirt.c | 2 +-
> arch/x86/kernel/sort-init.c | 114 +++++++++
> arch/x86/kernel/tboot.c | 2 +-
> arch/x86/kernel/vmlinux.lds.S | 25 ++
> arch/x86/lguest/boot.c | 2 +-
> arch/x86/mm/dump_pagetables.c | 2 +-
> arch/x86/platform/ce4100/ce4100.c | 4 +-
> arch/x86/platform/intel-mid/intel-mid.c | 4 +-
> arch/x86/tools/relocs.c | 1 +
> arch/x86/xen/enlighten.c | 2 +-
> drivers/pnp/pnpbios/core.c | 2 +-
>include/linux/tables.h | 421
>++++++++++++++++++++++++++++++++
> scripts/Makefile.build | 4 +-
> scripts/Makefile.clean | 1 +
> scripts/Makefile.lib | 12 +
> 42 files changed, 1091 insertions(+), 61 deletions(-)
> create mode 100644 arch/x86/include/asm/x86_init_fn.h
> create mode 100644 arch/x86/kernel/dbg-tables/Makefile
> create mode 100644 arch/x86/kernel/dbg-tables/alpha.c
> create mode 100644 arch/x86/kernel/dbg-tables/beta.c
> create mode 100644 arch/x86/kernel/dbg-tables/delta.c
> create mode 100644 arch/x86/kernel/dbg-tables/gamma.c
> create mode 100644 arch/x86/kernel/dbg-tables/gamma.h
> rename arch/x86/kernel/{head.c => ebda.c} (94%)
> create mode 100644 arch/x86/kernel/init.c
> create mode 100644 arch/x86/kernel/sort-init.c
> create mode 100644 include/linux/tables.h

By the way, please refer to iPXE as gPXE is a dead project.
--
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

2015-12-17 20:39:20

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC v1 0/8] x86/init: Linux linker tables

I think we can make this even more generic. In particular, I would love
to see a solution for link tables that:

a) can be used for any kind of data structures, not just function
pointers (the latter is a specialization of the former);
b) doesn't need any changes to the linker scripts once the initial
enabling is done for any one architecture.

Key to this is to be able to define tables by name only, which is really
why SORT_BY_NAME() is used: the name sorts before the priority simply by
putting the name before the class.

Instead of .tbl.* naming of sections I think we should have the first
component be the type of section (.rodata, .data, .init_rodata,
.read_mostly etc.) which makes it easier to write a linker script that
properly sorts it into the right section. The other thing is to take a
clue from the implementation in iPXE, which uses priority levels 00 and
99 (or we could use non-integers which sort appropriately instead of
using "real" levels) to contain the start and end symbols, which
eliminates any need for linker script modifications to add new tables.

Making this a generic facility we could eventually eliminate a bunch of
ad hoc hacks we currently have.

Oh, and the link table feature should NOT be x86-specific.

-hpa

2015-12-17 20:49:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC v1 5/8] x86/init: move ebda reservations into linker table

On Tue, Dec 15, 2015 at 2:16 PM, Luis R. Rodriguez
<[email protected]> wrote:
> From: "Luis R. Rodriguez" <[email protected]>
>
> This lets us annotate its requirements specifically for
> PC and lguest subarchitectures. While at it since head.c
> just has ebda data rename it. Since we're using linker tables
> and both x86 32-bit and 64-bit have them we don't need
> to declare a call for this separately. This lets us
> also keep this declared static now.
>
> Since we're using linker tables and have support to annotate
> subarchitecture do that instead. pv_legacy() is incorrect as
> its really only for legacy PV guests. There's no need for
> pv_legacy() check anymore now.

I'm entirely ignorant of anything going on in gPXE/iPXE.

Can you explain what a linker table *does*? It looks like all you've
done in this patch is to move code around. What actually happens?

--Andy

2015-12-17 20:58:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC v1 5/8] x86/init: move ebda reservations into linker table

On 12/17/15 12:48, Andy Lutomirski wrote:
>
> I'm entirely ignorant of anything going on in gPXE/iPXE.
>
> Can you explain what a linker table *does*? It looks like all you've
> done in this patch is to move code around. What actually happens?
>

A linker table is a data structure that is stitched together from items
in multiple object files.

We already have a *bunch* of linker tables in Linux, mostly the init
tables, but they are all built in an ad hoc manner which requires linker
script modifications, which are of course per architecture.

My desire would be to make a general linker table facility so that a new
linker table can be implemented by changing C code only.

-hpa

2015-12-17 20:58:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC v1 5/8] x86/init: move ebda reservations into linker table

On Thu, Dec 17, 2015 at 12:55 PM, H. Peter Anvin <[email protected]> wrote:
> On 12/17/15 12:48, Andy Lutomirski wrote:
>>
>> I'm entirely ignorant of anything going on in gPXE/iPXE.
>>
>> Can you explain what a linker table *does*? It looks like all you've
>> done in this patch is to move code around. What actually happens?
>>
>
> A linker table is a data structure that is stitched together from items
> in multiple object files.
>
> We already have a *bunch* of linker tables in Linux, mostly the init
> tables, but they are all built in an ad hoc manner which requires linker
> script modifications, which are of course per architecture.
>
> My desire would be to make a general linker table facility so that a new
> linker table can be implemented by changing C code only.

Sounds good to me.

--Andy

2015-12-17 23:40:14

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC v1 5/8] x86/init: move ebda reservations into linker table

On Thu, Dec 17, 2015 at 12:57:49PM -0800, Andy Lutomirski wrote:
> On Thu, Dec 17, 2015 at 12:55 PM, H. Peter Anvin <[email protected]> wrote:
> > On 12/17/15 12:48, Andy Lutomirski wrote:
> >>
> >> I'm entirely ignorant of anything going on in gPXE/iPXE.
> >>
> >> Can you explain what a linker table *does*? It looks like all you've
> >> done in this patch is to move code around. What actually happens?
> >>
> >
> > A linker table is a data structure that is stitched together from items
> > in multiple object files.
> >
> > We already have a *bunch* of linker tables in Linux, mostly the init
> > tables, but they are all built in an ad hoc manner which requires linker
> > script modifications, which are of course per architecture.
> >
> > My desire would be to make a general linker table facility so that a new
> > linker table can be implemented by changing C code only.
>
> Sounds good to me.

That's what this actually accomplishes, there are just a few caveats to consider,
more on this shortly on the other thread in this patch series.

Luis

2015-12-17 23:46:29

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC v1 0/8] x86/init: Linux linker tables

On Thu, Dec 17, 2015 at 12:38:10PM -0800, H. Peter Anvin wrote:
> I think we can make this even more generic. In particular, I would love
> to see a solution for link tables that:
>
> a) can be used for any kind of data structures, not just function
> pointers (the latter is a specialization of the former);

Sure, I can give this a shot. I'll sync back to the userspace
table-init.git tree what I had in this series and then build
first there a solution to make it agnostic to the use case.

> b) doesn't need any changes to the linker scripts once the initial
> enabling is done for any one architecture.

I'm glad you brought this up. I considered this as well, and sadly I
forgot to document the resolution. So here it is:

iPXE uses *one* entry on their linker script, that's it. All else is
just C code. That works well as all code in iPXE is happy to work
with one linker script, and the goal and use case for the linker
tables are *all* for initialization. iPXE also does *not* make use
of any further semantics for dependency annotations.

For Linux we have different requirements and goals, and we also wanted to take
advantage of other possible semantics when desirable. Consider the proposed use
case of a linker table with struct x86_init_fn, for its init sequences. We
need at the very least ***one linker table entry* to catch and peg these
sections, just as iPXE has. If we wanted to we could live with that but for
Linux on x86 that means vmlinux.lds.S, that's already architecture specific.
Do we want a more generic place ? If so can anyone suggest one ?

So to be clear, the solution in place suggested here already enables this, but
its current form does use vmlinux.lds.S. Although the current patch made this
change:

.tbl : {
__tbl_x86_start_init_fns = .;
*(SORT(.tbl.x86_init_fns.*))
__tbl_x86_end_init_fns = .;

/* ... etc ... */

/* You would add other tables here as needed */
}

This was on purpose, a generic form could have simply been used as well:


__tbl = .;
*(SORT(.tbl.*))
__tbl_end = .;


I made this change on the mockup table-init through two commits here:

https://github.com/mcgrof/table-init/commit/1a330c1dadbf7fa255eadcbaf214452f7e9ea29c
https://github.com/mcgrof/table-init/commit/ad11b4010ab2a5ffae7be39cce0ed4d052fe18fd

I explain why I do that there but the gist of it is that on Linux we may also
want stronger semantics for specific linker table solutions, and solutions such
as those devised on the IOMMU init stuff do memmove() for sorting depending on
semantics defined (in the simplest case here so far dependency between init
sequences), this makes each set of sequences very subsystem specific. An issue
with *one* subsystem could make things really bad for others. I thought about
this quite a bit and figured its best left to the subsystem maintainers to
decide.

So we have the freedom to do we wish still -- its just we need to understand
the implications of using something like a run time sort, and also when we want
to have the option to use different sized set of routines for different sections.
Linker sorting should fortunately give us the freedom to sort without concern
or regard to size of the structs / data used, but *iff* different sized data
is used in that section then one has to be careful that the sort routine be
very subsystem specific. So as I see it we'd need linker script annotations
for start and end on subsystems that want to use run time sort. This would
compartmentalize their sorting and bugs. If run time sorting is not needed
a catch-it-all script section as the above simple one should catch it.

To try to avoid issues between these two types of cases we could just document
the requirement. Now, it'd nice if we had a way to provide the section start/end
through C code, and not modify anything on linker scripts even if you had run
time sort requirements, I just can't think of a way to do that right now. Ideas
welcomed. Maybe some Kconfig magic to custom linker script section ? Then the
main linker script can just include this generated linker script section ?

> Key to this is to be able to define tables by name only, which is really
> why SORT_BY_NAME() is used: the name sorts before the priority simply by
> putting the name before the class.

Right, tbl.init_fns.01 for example would be one section if your scruct
was .init_fns and 01 for the first "priority". That still puts a requirement
if you want to use a run time sort so you can identify the start/end of
the range.

> Instead of .tbl.* naming of sections I think we should have the first
> component be the type of section (.rodata, .data, .init_rodata,
> .read_mostly etc.) which makes it easier to write a linker script that
> properly sorts it into the right section.

Sure, we can peg that as an argument to the __table(), so perhaps:

Perhaps a new sections.h file (you tell me) which documents the different
section components:

/* document this *really* well */
#define SECTION_RODATA ".rodata"
#define SECTION_INIT ".init"
#define SECTION_INIT_RODATA ".init_rodata"
#define SECTION_READ_MOSTLY ".read_mostly"

Then on tables.h we add the section components support:

#define __table(component, type, name) (component, type, name)

#define __table_component(table) __table_extract_component table
#define __table_extract_component(component, type, name) component

#define __table_type(table) __table_extract_type table
#define __table_extract_type(component, type, name) type

#define __table_name(table) __table_extract_name table
#define __table_extract_name(component, type, name) name

#define __table_str(x) #x

#define __table_section(table, idx) \
"." __table_component (table) ".tbl." __table_name (table) "." __table_str (idx)

#define __table_entry(table, idx) \
__attribute__ ((__section__(__table_section(table, idx)), \
__aligned__(__table_alignment(table))))

A user could then be something as follows:

#define X86_INIT_FNS __table(SECTION_INIT, struct x86_init_fn, "x86_init_fns")
#define __x86_init_fn(order_level) __table_entry(X86_INIT_FNS, order_level)

If that's what you mean?

I'm a bit wary about having the linker sort any of the above SECTION_*'s, but
if we're happy to do that perhaps a simple first step might be to see if 0-day
but would be happy with just the sort without any consequences to any
architecture. Thoughts?

> The other thing is to take a
> clue from the implementation in iPXE, which uses priority levels 00 and
> 99 (or we could use non-integers which sort appropriately instead of
> using "real" levels) to contain the start and end symbols, which
> eliminates any need for linker script modifications to add new tables.

This solution uses that as well. The only need for adding custom sections
is when they have a requirement for a custom run time sort, and also to
ensure they don't cause regressions on other subsystems if they have a buggy
sort. The run time sorting is all subsystem specific and up to their own
semantics.

> Making this a generic facility we could eventually eliminate a bunch of
> ad hoc hacks we currently have.

Towards the end of my development I did look into seeing if I could use this
sort of stuff also for the other section sorting hacks we have in place, I ran
into an issue with the type stuff, if we make it agnostic to data structure,
then perhaps indeed it might be possible to generalize this even to those
scary cob-web corner infested places in the kernel.

> Oh, and the link table feature should NOT be x86-specific.

As far as I can tell tables.h is not architecture specific at this point, the
way arch/x86/include/asm/x86_init_fn.h uses it is, but that's fine. Even with
the above changes it would still be architecture agnostic. If there's some
discrepancies on how some architecture linkers work -- now that's something
to consider now. Its also why I've been mocking this up in userspace, we
should be able to test that much easier than a full blown kernel.

Now would be a good time to get linker folks involved and eyeballing all this.

Luis

2015-12-17 23:59:03

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC v1 0/8] x86/init: Linux linker tables

On Thu, Dec 17, 2015 at 3:46 PM, Luis R. Rodriguez <[email protected]> wrote:
> solutions such
> as those devised on the IOMMU init stuff do memmove() for sorting depending on
> semantics defined (in the simplest case here so far dependency between init
> sequences), this makes each set of sequences very subsystem specific

I should also note -- since the sort uses memmove() we're constrained
to using run time sort only once we can ensure the kernel can do that
without issues. I'd hope x86_64_start_reservations() is a safe place,
if not please let me know. Likewise, since one prospect here is to
bring further strong subarchitecture semantics even earlier (help
appreciated), say as early as possible in x86_64_start_kernel(), it'd
be nice to know the earliest memmove() is safe there.

In this case the solution doesn't yet make use of calls in between
x86_64_start_kernel() and x86_64_start_reservations() but if it could
(we'd need a way for x86_init_fn_early_init() to access the
subarchitecture, help appreciated) we *might* in the future want run
time sort this early. It might still be possible to *not* do a run
time sort until later so long as the priority level linker sort
mechanism suffices for all early init routines. That's another
possibility, should memmove() be a problem that early.

Luis

2015-12-18 04:26:33

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC v1 0/8] x86/init: Linux linker tables

On 12/17/15 15:46, Luis R. Rodriguez wrote:
>
> I explain why I do that there but the gist of it is that on Linux we may also
> want stronger semantics for specific linker table solutions, and solutions such
> as those devised on the IOMMU init stuff do memmove() for sorting depending on
> semantics defined (in the simplest case here so far dependency between init
> sequences), this makes each set of sequences very subsystem specific. An issue
> with *one* subsystem could make things really bad for others. I thought about
> this quite a bit and figured its best left to the subsystem maintainers to
> decide.
>

A table that needs sorting or other runtime handling is just a
read-write table for the purpose of the linker table construct. It
presents to C as an array of initialized data.

> Perhaps a new sections.h file (you tell me) which documents the different
> section components:
>
> /* document this *really* well */
> #define SECTION_RODATA ".rodata"
> #define SECTION_INIT ".init"
> #define SECTION_INIT_RODATA ".init_rodata"
> #define SECTION_READ_MOSTLY ".read_mostly"
>
> Then on tables.h we add the section components support:

Yes, something like that. How to macroize it cleanly is another matter;
we may want to use slightly different conventions that iPXE to match our
own codebase.

> #define __table(component, type, name) (component, type, name)
>
> #define __table_component(table) __table_extract_component table
> #define __table_extract_component(component, type, name) component
>
> #define __table_type(table) __table_extract_type table
> #define __table_extract_type(component, type, name) type
>
> #define __table_name(table) __table_extract_name table
> #define __table_extract_name(component, type, name) name
>
> #define __table_str(x) #x
>
> #define __table_section(table, idx) \
> "." __table_component (table) ".tbl." __table_name (table) "." __table_str (idx)
>
> #define __table_entry(table, idx) \
> __attribute__ ((__section__(__table_section(table, idx)), \
> __aligned__(__table_alignment(table))))
>
> A user could then be something as follows:
>
> #define X86_INIT_FNS __table(SECTION_INIT, struct x86_init_fn, "x86_init_fns")
> #define __x86_init_fn(order_level) __table_entry(X86_INIT_FNS, order_level)

Yes, but in particular the common case of function initialization tables
should be generic.

I'm kind of thinking a syntax like this:

DECLARE_LINKTABLE_RO(struct foo, tablename);
DEFINE_LINKTABLE_RO(struct foo, tablename);
LINKTABLE_RO(tablename,level) = /* contents */;
LINKTABLE_SIZE(tablename)

... which would turn into something like this once it goes through all
the preprocessing phases

/* DECLARE_LINKTABLE_RO */
extern const struct foo tablename[], tablename__end[];

/* DEFINE_LINKTABLE_RO */
DECLARE_LINKTABLE_RO(struct foo, tablename);

const struct
foo__attribute__((used,section(".rodata.tbl.tablename.0"))) tablename[0];

const struct
foo__attribute__((used,section(".rodata.tbl.tablename.999")))
tablename__end[0];

/* LINKTABLE_RO */
static const __typeof__(tablename)
__attribute__((used,section(".rodata.tbl.tablename.50")))
__tbl_tablename_12345

/* LINKTABLE_SIZE */
((tablename__end) - (tablename))

... and so on for all the possible sections where we may want tables.

Note: I used 0 and 999 above since they sort before and after all
possible 2-digit decimal numbers, but that's just cosmetic.

> If that's what you mean?
>
> I'm a bit wary about having the linker sort any of the above SECTION_*'s, but
> if we're happy to do that perhaps a simple first step might be to see if 0-day
> but would be happy with just the sort without any consequences to any
> architecture. Thoughts?

I don't see what is dangerous about it. The section names are such that
a lexographical sort will do the right thing, and we can simply use
SORT(.rodata.tbl.*) in the linker script, for example.

>> The other thing is to take a
>> clue from the implementation in iPXE, which uses priority levels 00 and
>> 99 (or we could use non-integers which sort appropriately instead of
>> using "real" levels) to contain the start and end symbols, which
>> eliminates any need for linker script modifications to add new tables.
>
> This solution uses that as well. The only need for adding custom sections
> is when they have a requirement for a custom run time sort, and also to
> ensure they don't cause regressions on other subsystems if they have a buggy
> sort. The run time sorting is all subsystem specific and up to their own
> semantics.

Again, from a linker table POV this is nothing other than a read-write
table; there is a runtime function that then operates on that read-write
table.

-hpa

2015-12-18 04:41:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC v1 0/8] x86/init: Linux linker tables

On 12/17/15 20:25, H. Peter Anvin wrote:
>
> /* DECLARE_LINKTABLE_RO */
> extern const struct foo tablename[], tablename__end[];
>
> /* DEFINE_LINKTABLE_RO */
> DECLARE_LINKTABLE_RO(struct foo, tablename);
>
> const struct
> foo__attribute__((used,section(".rodata.tbl.tablename.0"))) tablename[0];
>
> const struct
> foo__attribute__((used,section(".rodata.tbl.tablename.999")))
> tablename__end[0];
>
> /* LINKTABLE_RO */
> static const __typeof__(tablename)
> __attribute__((used,section(".rodata.tbl.tablename.50")))
> __tbl_tablename_12345
>
> /* LINKTABLE_SIZE */
> ((tablename__end) - (tablename))
>
> ... and so on for all the possible sections where we may want tables.
>

Come to think of it, we could even eliminate the need for a DEFINE
entirely if we made the start and end symbols static. However, this
would generate an awful lot of identical-but-local symbols which
probably would make the linker slower and definitely would bloat the
debug data.

-hpa

2015-12-18 18:50:46

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC v1 0/8] x86/init: Linux linker tables

On Thu, Dec 17, 2015 at 08:25:19PM -0800, H. Peter Anvin wrote:
> On 12/17/15 15:46, Luis R. Rodriguez wrote:
> >
> > I explain why I do that there but the gist of it is that on Linux we may also
> > want stronger semantics for specific linker table solutions, and solutions such
> > as those devised on the IOMMU init stuff do memmove() for sorting depending on
> > semantics defined (in the simplest case here so far dependency between init
> > sequences), this makes each set of sequences very subsystem specific. An issue
> > with *one* subsystem could make things really bad for others. I thought about
> > this quite a bit and figured its best left to the subsystem maintainers to
> > decide.
> >
>
> A table that needs sorting or other runtime handling is just a
> read-write table for the purpose of the linker table construct. It
> presents to C as an array of initialized data.

Sure but what I was getting at was that since some run time changes to the
table *might* be desirable, depending on the subsystem, the subsystem table needs
to be able to know the start and end address of its table, and that's a
linker script change. But come to think of it, that was me just being pedantic
and careful, I'll try even a run time sort with a few tables of different
structure size to ensure its both possible to just declare them in C and also
sort them without a linker script change.

> > Perhaps a new sections.h file (you tell me) which documents the different
> > section components:
> >
> > /* document this *really* well */
> > #define SECTION_RODATA ".rodata"
> > #define SECTION_INIT ".init"
> > #define SECTION_INIT_RODATA ".init_rodata"
> > #define SECTION_READ_MOSTLY ".read_mostly"
> >
> > Then on tables.h we add the section components support:
>
> Yes, something like that. How to macroize it cleanly is another matter;
> we may want to use slightly different conventions that iPXE to match our
> own codebase.

Sure, the style below is from iPXE, the one in the patch set matches the
Linux coding style. Other than that I'd welcome feedback on any issues
or recommendations with style on our proposed version of tables.h
Seems you provided some pointers below, so I'll go try to incorporate
those suggestions.

> > #define __table(component, type, name) (component, type, name)
> >
> > #define __table_component(table) __table_extract_component table
> > #define __table_extract_component(component, type, name) component
> >
> > #define __table_type(table) __table_extract_type table
> > #define __table_extract_type(component, type, name) type
> >
> > #define __table_name(table) __table_extract_name table
> > #define __table_extract_name(component, type, name) name
> >
> > #define __table_str(x) #x
> >
> > #define __table_section(table, idx) \
> > "." __table_component (table) ".tbl." __table_name (table) "." __table_str (idx)
> >
> > #define __table_entry(table, idx) \
> > __attribute__ ((__section__(__table_section(table, idx)), \
> > __aligned__(__table_alignment(table))))
> >
> > A user could then be something as follows:
> >
> > #define X86_INIT_FNS __table(SECTION_INIT, struct x86_init_fn, "x86_init_fns")
> > #define __x86_init_fn(order_level) __table_entry(X86_INIT_FNS, order_level)
>
> Yes, but in particular the common case of function initialization tables
> should be generic.
>
> I'm kind of thinking a syntax like this:
>
> DECLARE_LINKTABLE_RO(struct foo, tablename);
> DEFINE_LINKTABLE_RO(struct foo, tablename);
> LINKTABLE_RO(tablename,level) = /* contents */;
> LINKTABLE_SIZE(tablename)
>
> ... which would turn into something like this once it goes through all
> the preprocessing phases
>
> /* DECLARE_LINKTABLE_RO */
> extern const struct foo tablename[], tablename__end[];
>
> /* DEFINE_LINKTABLE_RO */
> DECLARE_LINKTABLE_RO(struct foo, tablename);
>
> const struct
> foo__attribute__((used,section(".rodata.tbl.tablename.0"))) tablename[0];
>
> const struct
> foo__attribute__((used,section(".rodata.tbl.tablename.999")))
> tablename__end[0];
>
> /* LINKTABLE_RO */
> static const __typeof__(tablename)
> __attribute__((used,section(".rodata.tbl.tablename.50")))
> __tbl_tablename_12345
>
> /* LINKTABLE_SIZE */
> ((tablename__end) - (tablename))
>
> ... and so on for all the possible sections where we may want tables.

OK thanks so much! Will try working with that.

> Note: I used 0 and 999 above since they sort before and after all
> possible 2-digit decimal numbers, but that's just cosmetic.

Ah neat, so we could still use the two digits 99 and 00 for order
level if we wanted then.

> > If that's what you mean?
> >
> > I'm a bit wary about having the linker sort any of the above SECTION_*'s, but
> > if we're happy to do that perhaps a simple first step might be to see if 0-day
> > but would be happy with just the sort without any consequences to any
> > architecture. Thoughts?
>
> I don't see what is dangerous about it. The section names are such that
> a lexographical sort will do the right thing, and we can simply use
> SORT(.rodata.tbl.*) in the linker script, for example.

OK I'll leave it up to you, I'll go test sorting the sections broadly first.

> >> The other thing is to take a
> >> clue from the implementation in iPXE, which uses priority levels 00 and
> >> 99 (or we could use non-integers which sort appropriately instead of
> >> using "real" levels) to contain the start and end symbols, which
> >> eliminates any need for linker script modifications to add new tables.
> >
> > This solution uses that as well. The only need for adding custom sections
> > is when they have a requirement for a custom run time sort, and also to
> > ensure they don't cause regressions on other subsystems if they have a buggy
> > sort. The run time sorting is all subsystem specific and up to their own
> > semantics.
>
> Again, from a linker table POV this is nothing other than a read-write
> table; there is a runtime function that then operates on that read-write
> table.

OK, I'm convinced, I'll go try these changes now.

Luis

2015-12-18 18:59:51

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC v1 0/8] x86/init: Linux linker tables

On December 18, 2015 10:50:40 AM PST, "Luis R. Rodriguez" <[email protected]> wrote:
>On Thu, Dec 17, 2015 at 08:25:19PM -0800, H. Peter Anvin wrote:
>> On 12/17/15 15:46, Luis R. Rodriguez wrote:
>> >
>> > I explain why I do that there but the gist of it is that on Linux
>we may also
>> > want stronger semantics for specific linker table solutions, and
>solutions such
>> > as those devised on the IOMMU init stuff do memmove() for sorting
>depending on
>> > semantics defined (in the simplest case here so far dependency
>between init
>> > sequences), this makes each set of sequences very subsystem
>specific. An issue
>> > with *one* subsystem could make things really bad for others. I
>thought about
>> > this quite a bit and figured its best left to the subsystem
>maintainers to
>> > decide.
>> >
>>
>> A table that needs sorting or other runtime handling is just a
>> read-write table for the purpose of the linker table construct. It
>> presents to C as an array of initialized data.
>
>Sure but what I was getting at was that since some run time changes to
>the
>table *might* be desirable, depending on the subsystem, the subsystem
>table needs
>to be able to know the start and end address of its table, and that's a
>linker script change. But come to think of it, that was me just being
>pedantic
>and careful, I'll try even a run time sort with a few tables of
>different
>structure size to ensure its both possible to just declare them in C
>and also
>sort them without a linker script change.
>
>> > Perhaps a new sections.h file (you tell me) which documents the
>different
>> > section components:
>> >
>> > /* document this *really* well */
>> > #define SECTION_RODATA ".rodata"
>> > #define SECTION_INIT ".init"
>> > #define SECTION_INIT_RODATA ".init_rodata"
>> > #define SECTION_READ_MOSTLY ".read_mostly"
>> >
>> > Then on tables.h we add the section components support:
>>
>> Yes, something like that. How to macroize it cleanly is another
>matter;
>> we may want to use slightly different conventions that iPXE to match
>our
>> own codebase.
>
>Sure, the style below is from iPXE, the one in the patch set matches
>the
>Linux coding style. Other than that I'd welcome feedback on any issues
>or recommendations with style on our proposed version of tables.h
>Seems you provided some pointers below, so I'll go try to incorporate
>those suggestions.
>
>> > #define __table(component, type, name) (component, type, name)
>> >
>> > #define __table_component(table) __table_extract_component table
>
>> > #define __table_extract_component(component, type, name) component
>> >
>> > #define __table_type(table) __table_extract_type table
>
>> > #define __table_extract_type(component, type, name) type
>> >
>> > #define __table_name(table) __table_extract_name table
>
>> > #define __table_extract_name(component, type, name) name
>> >
>> > #define __table_str(x) #x
>> >
>> > #define __table_section(table, idx) \
>
>> > "." __table_component (table) ".tbl." __table_name (table)
>"." __table_str (idx)
>> >
>> > #define __table_entry(table, idx)
> \
>> > __attribute__ ((__section__(__table_section(table, idx)),
> \
>> > __aligned__(__table_alignment(table))))
>> >
>> > A user could then be something as follows:
>> >
>> > #define X86_INIT_FNS __table(SECTION_INIT, struct x86_init_fn,
>"x86_init_fns")
>> > #define __x86_init_fn(order_level) __table_entry(X86_INIT_FNS,
>order_level)
>>
>> Yes, but in particular the common case of function initialization
>tables
>> should be generic.
>>
>> I'm kind of thinking a syntax like this:
>>
>> DECLARE_LINKTABLE_RO(struct foo, tablename);
>> DEFINE_LINKTABLE_RO(struct foo, tablename);
>> LINKTABLE_RO(tablename,level) = /* contents */;
>> LINKTABLE_SIZE(tablename)
>>
>> ... which would turn into something like this once it goes through
>all
>> the preprocessing phases
>>
>> /* DECLARE_LINKTABLE_RO */
>> extern const struct foo tablename[], tablename__end[];
>>
>> /* DEFINE_LINKTABLE_RO */
>> DECLARE_LINKTABLE_RO(struct foo, tablename);
>>
>> const struct
>> foo__attribute__((used,section(".rodata.tbl.tablename.0")))
>tablename[0];
>>
>> const struct
>> foo__attribute__((used,section(".rodata.tbl.tablename.999")))
>> tablename__end[0];
>>
>> /* LINKTABLE_RO */
>> static const __typeof__(tablename)
>> __attribute__((used,section(".rodata.tbl.tablename.50")))
>> __tbl_tablename_12345
>>
>> /* LINKTABLE_SIZE */
>> ((tablename__end) - (tablename))
>>
>> ... and so on for all the possible sections where we may want tables.
>
>OK thanks so much! Will try working with that.
>
>> Note: I used 0 and 999 above since they sort before and after all
>> possible 2-digit decimal numbers, but that's just cosmetic.
>
>Ah neat, so we could still use the two digits 99 and 00 for order
>level if we wanted then.
>
>> > If that's what you mean?
>> >
>> > I'm a bit wary about having the linker sort any of the above
>SECTION_*'s, but
>> > if we're happy to do that perhaps a simple first step might be to
>see if 0-day
>> > but would be happy with just the sort without any consequences to
>any
>> > architecture. Thoughts?
>>
>> I don't see what is dangerous about it. The section names are such
>that
>> a lexographical sort will do the right thing, and we can simply use
>> SORT(.rodata.tbl.*) in the linker script, for example.
>
>OK I'll leave it up to you, I'll go test sorting the sections broadly
>first.
>
>> >> The other thing is to take a
>> >> clue from the implementation in iPXE, which uses priority levels
>00 and
>> >> 99 (or we could use non-integers which sort appropriately instead
>of
>> >> using "real" levels) to contain the start and end symbols, which
>> >> eliminates any need for linker script modifications to add new
>tables.
>> >
>> > This solution uses that as well. The only need for adding custom
>sections
>> > is when they have a requirement for a custom run time sort, and
>also to
>> > ensure they don't cause regressions on other subsystems if they
>have a buggy
>> > sort. The run time sorting is all subsystem specific and up to
>their own
>> > semantics.
>>
>> Again, from a linker table POV this is nothing other than a
>read-write
>> table; there is a runtime function that then operates on that
>read-write
>> table.
>
>OK, I'm convinced, I'll go try these changes now.
>
> Luis

No, start and end symbols are provided by having zero-length sections which sort at the very beginning and very end.
--
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.