2009-01-04 23:35:53

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [0/5] Couple of x86 patches for 2.6.29


Most of this has been posted earlier.

I believe all are 2.6.29 candidates, except possibly for the always inline
patch.

-Andi


2009-01-04 23:35:37

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [2/5] Allow HPET force enable on ICH10 HPET


Intel "Smackover" x58 BIOS don't have HPET enabled in the BIOS, so allow
to force enable it at least. The register layout is the same
as in other recent ICHs, so all the code can be reused.

Using numerical PCI-ID because it's unlikely the PCI-ID will be used
anywhere else.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/kernel/quirks.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6.28-rc4-test/arch/x86/kernel/quirks.c
===================================================================
--- linux-2.6.28-rc4-test.orig/arch/x86/kernel/quirks.c 2008-10-24 13:34:40.000000000 +0200
+++ linux-2.6.28-rc4-test/arch/x86/kernel/quirks.c 2008-12-15 17:31:24.000000000 +0100
@@ -170,7 +170,8 @@
ich_force_enable_hpet);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_7,
ich_force_enable_hpet);
-
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x3a16, /* ICH10 */
+ ich_force_enable_hpet);

static struct pci_dev *cached_dev;

2009-01-04 23:36:13

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [3/5] Mark complex bitops.h inlines as __always_inline


Hugh Dickins noticed that older gcc versions when the kernel
is built for code size didn't inline some of the bitops.

Mark all complex x86 bitops that have more than a single
asm statement or two as always inline to avoid this problem.

Probably should be done for other architectures too.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/include/asm/bitops.h | 15 +++++++++++----
include/asm-generic/bitops/__ffs.h | 2 +-
include/asm-generic/bitops/__fls.h | 2 +-
include/asm-generic/bitops/fls.h | 2 +-
include/asm-generic/bitops/fls64.h | 4 ++--
5 files changed, 16 insertions(+), 9 deletions(-)

Index: linux-2.6.28-test/arch/x86/include/asm/bitops.h
===================================================================
--- linux-2.6.28-test.orig/arch/x86/include/asm/bitops.h 2008-10-24 13:34:40.000000000 +0200
+++ linux-2.6.28-test/arch/x86/include/asm/bitops.h 2009-01-02 16:01:00.000000000 +0100
@@ -3,6 +3,9 @@

/*
* Copyright 1992, Linus Torvalds.
+ *
+ * Note: inlines with more than a single statement should be marked
+ * __always_inline to avoid problems with older gcc's inlining heuristics.
*/

#ifndef _LINUX_BITOPS_H
@@ -53,7 +56,8 @@
* Note that @nr may be almost arbitrarily large; this function is not
* restricted to acting on a single-word quantity.
*/
-static inline void set_bit(unsigned int nr, volatile unsigned long *addr)
+static __always_inline void
+set_bit(unsigned int nr, volatile unsigned long *addr)
{
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "orb %1,%0"
@@ -90,7 +94,8 @@
* you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
* in order to ensure changes are visible on other processors.
*/
-static inline void clear_bit(int nr, volatile unsigned long *addr)
+static __always_inline void
+clear_bit(int nr, volatile unsigned long *addr)
{
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "andb %1,%0"
@@ -196,7 +201,8 @@
*
* This is the same as test_and_set_bit on x86.
*/
-static inline int test_and_set_bit_lock(int nr, volatile unsigned long *addr)
+static __always_inline int
+test_and_set_bit_lock(int nr, volatile unsigned long *addr)
{
return test_and_set_bit(nr, addr);
}
@@ -292,7 +298,8 @@
return oldbit;
}

-static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
+static __always_inline int
+constant_test_bit(int nr, const volatile unsigned long *addr)
{
return ((1UL << (nr % BITS_PER_LONG)) &
(((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
Index: linux-2.6.28-test/include/asm-generic/bitops/__ffs.h
===================================================================
--- linux-2.6.28-test.orig/include/asm-generic/bitops/__ffs.h 2006-04-03 16:06:13.000000000 +0200
+++ linux-2.6.28-test/include/asm-generic/bitops/__ffs.h 2009-01-02 16:01:00.000000000 +0100
@@ -9,7 +9,7 @@
*
* Undefined if no bit exists, so code should check against 0 first.
*/
-static inline unsigned long __ffs(unsigned long word)
+static __always_inline unsigned long __ffs(unsigned long word)
{
int num = 0;

Index: linux-2.6.28-test/include/asm-generic/bitops/__fls.h
===================================================================
--- linux-2.6.28-test.orig/include/asm-generic/bitops/__fls.h 2008-05-08 12:56:05.000000000 +0200
+++ linux-2.6.28-test/include/asm-generic/bitops/__fls.h 2009-01-02 16:01:00.000000000 +0100
@@ -9,7 +9,7 @@
*
* Undefined if no set bit exists, so code should check against 0 first.
*/
-static inline unsigned long __fls(unsigned long word)
+static __always_inline unsigned long __fls(unsigned long word)
{
int num = BITS_PER_LONG - 1;

Index: linux-2.6.28-test/include/asm-generic/bitops/fls.h
===================================================================
--- linux-2.6.28-test.orig/include/asm-generic/bitops/fls.h 2006-04-03 16:06:13.000000000 +0200
+++ linux-2.6.28-test/include/asm-generic/bitops/fls.h 2009-01-02 16:01:00.000000000 +0100
@@ -9,7 +9,7 @@
* Note fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32.
*/

-static inline int fls(int x)
+static __always_inline int fls(int x)
{
int r = 32;

Index: linux-2.6.28-test/include/asm-generic/bitops/fls64.h
===================================================================
--- linux-2.6.28-test.orig/include/asm-generic/bitops/fls64.h 2008-05-08 12:56:05.000000000 +0200
+++ linux-2.6.28-test/include/asm-generic/bitops/fls64.h 2009-01-02 16:01:00.000000000 +0100
@@ -15,7 +15,7 @@
* at position 64.
*/
#if BITS_PER_LONG == 32
-static inline int fls64(__u64 x)
+static __always_inline int fls64(__u64 x)
{
__u32 h = x >> 32;
if (h)
@@ -23,7 +23,7 @@
return fls(x);
}
#elif BITS_PER_LONG == 64
-static inline int fls64(__u64 x)
+static __always_inline int fls64(__u64 x)
{
if (x == 0)
return 0;

2009-01-04 23:36:31

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [1/5] Only scan the root bus in early PCI quirks.


We found a situation on Linus' machine that the Nvidia timer
quirk hit on a Intel chipset system. The problem is that the
system has a fancy Nvidia card with an own PCI bridge, and
the early-quirks code looking for any NVidia bridge triggered
on it incorrectly. This didn't lead a boot failure by luck,
but the timer routing code selecting the wrong timer first and
some ugly messages. It might lead to real problems on
other systems.

I checked all the devices which are currently checked for
by early_quirks and it turns out they are all located in
the root bus zero.

So change the early-quirks loop to only scan bus 0. This
incidently also saves quite some unnecessary scanning work,
because early_quirks doesn't go through all the non root
busses.

The graphics card is not on bus 0, so it is not matched
anymore.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/kernel/early-quirks.c | 22 ++++++++++++++--------
1 file changed, 14 insertions(+), 8 deletions(-)

Index: linux-2.6.28-test/arch/x86/kernel/early-quirks.c
===================================================================
--- linux-2.6.28-test.orig/arch/x86/kernel/early-quirks.c 2008-11-24 16:42:46.000000000 +0100
+++ linux-2.6.28-test/arch/x86/kernel/early-quirks.c 2008-12-29 05:25:09.000000000 +0100
@@ -200,6 +200,12 @@
void (*f)(int num, int slot, int func);
};

+/*
+ * Only works for devices on the root bus. If you add any devices
+ * not on bus 0 readd another loop level in early_quirks(). But
+ * be careful because at least the Nvidia quirk here relies on
+ * only matching on bus 0.
+ */
static struct chipset early_qrk[] __initdata = {
{ PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, QFLAG_APPLY_ONCE, nvidia_bugs },
@@ -266,17 +272,17 @@

void __init early_quirks(void)
{
- int num, slot, func;
+ int slot, func;

if (!early_pci_allowed())
return;

/* Poor man's PCI discovery */
- for (num = 0; num < 32; num++)
- for (slot = 0; slot < 32; slot++)
- for (func = 0; func < 8; func++) {
- /* Only probe function 0 on single fn devices */
- if (check_dev_quirk(num, slot, func))
- break;
- }
+ /* Only scan the root bus */
+ for (slot = 0; slot < 32; slot++)
+ for (func = 0; func < 8; func++) {
+ /* Only probe function 0 on single fn devices */
+ if (check_dev_quirk(0, slot, func))
+ break;
+ }
}

2009-01-04 23:36:46

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [4/5] Use asm stubs for 32bit sigreturn codes


32bit sigreturn/rt_sigreturn currently uses some magic casting
with assumptions about the stack layout
to simulate call by reference for the pt_regs structure on the
stack. This is fragile, non standard and according to reports
breaks with LLVM at least. I suppose it could break in
future gcc versions too when their stack layout changes.

So instead of having this magic in C do it in small
assembler stubs similar to what x86-64 does (which
always passes a pointer to pt_regs for such functions)

This also leads to cleaner code.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/kernel/entry_32.S | 10 ++++++++++
arch/x86/kernel/signal_32.c | 13 ++-----------
arch/x86/kernel/syscall_table_32.S | 4 ++--
3 files changed, 14 insertions(+), 13 deletions(-)

Index: linux-2.6.28-test/arch/x86/kernel/entry_32.S
===================================================================
--- linux-2.6.28-test.orig/arch/x86/kernel/entry_32.S 2008-11-06 20:26:33.000000000 +0100
+++ linux-2.6.28-test/arch/x86/kernel/entry_32.S 2009-01-02 16:01:06.000000000 +0100
@@ -1206,6 +1206,16 @@
#endif /* CONFIG_DYNAMIC_FTRACE */
#endif /* CONFIG_FUNCTION_TRACER */

+ENTRY(stub_sigreturn)
+ movl %esp,%eax
+ jmp sys_sigreturn
+END(stub_sigreturn)
+
+ENTRY(stub_rt_sigreturn)
+ movl %esp,%eax
+ jmp sys_rt_sigreturn
+END(stub_rt_sigreturn)
+
.section .rodata,"a"
#include "syscall_table_32.S"

Index: linux-2.6.28-test/arch/x86/kernel/signal_32.c
===================================================================
--- linux-2.6.28-test.orig/arch/x86/kernel/signal_32.c 2008-10-24 13:34:40.000000000 +0200
+++ linux-2.6.28-test/arch/x86/kernel/signal_32.c 2009-01-02 16:01:06.000000000 +0100
@@ -169,14 +169,12 @@
return err;
}

-asmlinkage unsigned long sys_sigreturn(unsigned long __unused)
+asmlinkage unsigned long sys_sigreturn(struct pt_regs *regs)
{
struct sigframe __user *frame;
- struct pt_regs *regs;
unsigned long ax;
sigset_t set;

- regs = (struct pt_regs *) &__unused;
frame = (struct sigframe __user *)(regs->sp - 8);

if (!access_ok(VERIFY_READ, frame, sizeof(*frame)))
@@ -212,7 +210,7 @@
return 0;
}

-static long do_rt_sigreturn(struct pt_regs *regs)
+asmlinkage long sys_rt_sigreturn(struct pt_regs *regs)
{
struct rt_sigframe __user *frame;
unsigned long ax;
@@ -243,13 +241,6 @@
return 0;
}

-asmlinkage int sys_rt_sigreturn(unsigned long __unused)
-{
- struct pt_regs *regs = (struct pt_regs *)&__unused;
-
- return do_rt_sigreturn(regs);
-}
-
/*
* Set up a signal frame.
*/
Index: linux-2.6.28-test/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux-2.6.28-test.orig/arch/x86/kernel/syscall_table_32.S 2008-08-15 17:13:40.000000000 +0200
+++ linux-2.6.28-test/arch/x86/kernel/syscall_table_32.S 2009-01-02 16:01:06.000000000 +0100
@@ -118,7 +118,7 @@
.long sys_sysinfo
.long sys_ipc
.long sys_fsync
- .long sys_sigreturn
+ .long stub_sigreturn
.long sys_clone /* 120 */
.long sys_setdomainname
.long sys_newuname
@@ -172,7 +172,7 @@
.long sys_setresgid16 /* 170 */
.long sys_getresgid16
.long sys_prctl
- .long sys_rt_sigreturn
+ .long stub_rt_sigreturn
.long sys_rt_sigaction
.long sys_rt_sigprocmask /* 175 */
.long sys_rt_sigpending

2009-01-04 23:37:11

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [5/5] Avoid theoretical vmalloc fault loop


Ajith Kumar noticed:

I was going through the vmalloc fault handling for x86_64 and am unclear
about the following lines in the vmalloc_fault() function.

pgd = pgd_offset(current->mm ?: &init_mm, address);
pgd_ref = pgd_offset_k(address);

Here the intention is to get the pgd corresponding to the current
process and sync it up with the pgd in init_mm(obtained from
pgd_offset_k). However, for kernel threads current->mm is NULL and hence
pgd = pgd_offset(init_mm, address) = pgd_ref which means the fault
handler returns without setting the pgd entry in the MM structure in
the context of which the kernel thread has faulted. This could lead to
never-ending faults and busy looping of kernel threads like pdflush.
So, shouldn't the pgd = pgd_offset(current->mm ?: &init_mm, address);
be
pgd = pgd_offset(current->active_mm ?: &init_mm, address);

AK: We can use active_mm unconditionally because it should
be always set. Do that.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/fault.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.28-test/arch/x86/mm/fault.c
===================================================================
--- linux-2.6.28-test.orig/arch/x86/mm/fault.c 2008-10-24 13:34:41.000000000 +0200
+++ linux-2.6.28-test/arch/x86/mm/fault.c 2009-01-02 16:01:11.000000000 +0100
@@ -533,7 +533,7 @@
happen within a race in page table update. In the later
case just flush. */

- pgd = pgd_offset(current->mm ?: &init_mm, address);
+ pgd = pgd_offset(current->active_mm, address);
pgd_ref = pgd_offset_k(address);
if (pgd_none(*pgd_ref))
return -1;

2009-01-04 23:51:19

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] [1/5] Only scan the root bus in early PCI quirks.

On Sun, Jan 4, 2009 at 3:36 PM, Andi Kleen <[email protected]> wrote:
>
> We found a situation on Linus' machine that the Nvidia timer
> quirk hit on a Intel chipset system. The problem is that the
> system has a fancy Nvidia card with an own PCI bridge, and
> the early-quirks code looking for any NVidia bridge triggered
> on it incorrectly. This didn't lead a boot failure by luck,
> but the timer routing code selecting the wrong timer first and
> some ugly messages. It might lead to real problems on
> other systems.
>
> I checked all the devices which are currently checked for
> by early_quirks and it turns out they are all located in
> the root bus zero.
>
> So change the early-quirks loop to only scan bus 0. This
> incidently also saves quite some unnecessary scanning work,
> because early_quirks doesn't go through all the non root
> busses.
>
> The graphics card is not on bus 0, so it is not matched
> anymore.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> arch/x86/kernel/early-quirks.c | 22 ++++++++++++++--------
> 1 file changed, 14 insertions(+), 8 deletions(-)
>
> Index: linux-2.6.28-test/arch/x86/kernel/early-quirks.c
> ===================================================================
> --- linux-2.6.28-test.orig/arch/x86/kernel/early-quirks.c 2008-11-24 16:42:46.000000000 +0100
> +++ linux-2.6.28-test/arch/x86/kernel/early-quirks.c 2008-12-29 05:25:09.000000000 +0100
> @@ -200,6 +200,12 @@
> void (*f)(int num, int slot, int func);
> };
>
> +/*
> + * Only works for devices on the root bus. If you add any devices
> + * not on bus 0 readd another loop level in early_quirks(). But
> + * be careful because at least the Nvidia quirk here relies on
> + * only matching on bus 0.
> + */
> static struct chipset early_qrk[] __initdata = {
> { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, QFLAG_APPLY_ONCE, nvidia_bugs },
> @@ -266,17 +272,17 @@
>
> void __init early_quirks(void)
> {
> - int num, slot, func;
> + int slot, func;
>
> if (!early_pci_allowed())
> return;
>
> /* Poor man's PCI discovery */
> - for (num = 0; num < 32; num++)
> - for (slot = 0; slot < 32; slot++)
> - for (func = 0; func < 8; func++) {
> - /* Only probe function 0 on single fn devices */
> - if (check_dev_quirk(num, slot, func))
> - break;
> - }
> + /* Only scan the root bus */
> + for (slot = 0; slot < 32; slot++)
> + for (func = 0; func < 8; func++) {
> + /* Only probe function 0 on single fn devices */
> + if (check_dev_quirk(0, slot, func))
> + break;
> + }
> }
> --

two cases:
1. how about system with two or more HT chains, and second will be
0x40, x0x80, or 0xc0 ?
2. some system with LinuxBIOS, could put first chain on bus 1

YH

2009-01-05 00:08:31

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] [4/5] Use asm stubs for 32bit sigreturn codes

Andi Kleen wrote:
> 32bit sigreturn/rt_sigreturn currently uses some magic casting
> with assumptions about the stack layout
> to simulate call by reference for the pt_regs structure on the
> stack. This is fragile, non standard and according to reports
> breaks with LLVM at least. I suppose it could break in
> future gcc versions too when their stack layout changes.
>
> So instead of having this magic in C do it in small
> assembler stubs similar to what x86-64 does (which
> always passes a pointer to pt_regs for such functions)
>
> This also leads to cleaner code.
>

It's wrong, however:

#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))

> +ENTRY(stub_sigreturn)
> + movl %esp,%eax
> + jmp sys_sigreturn
> +END(stub_sigreturn)

This assumes regparm=3, i.e !asmlinkage.

> -asmlinkage unsigned long sys_sigreturn(unsigned long __unused)
> +asmlinkage unsigned long sys_sigreturn(struct pt_regs *regs)
> {
> struct sigframe __user *frame;
> - struct pt_regs *regs;
> unsigned long ax;
> sigset_t set;
>
> - regs = (struct pt_regs *) &__unused;
> frame = (struct sigframe __user *)(regs->sp - 8);
>
> if (!access_ok(VERIFY_READ, frame, sizeof(*frame)))
> @@ -212,7 +210,7 @@
> return 0;
> }

I think we already have a patch in the tree that changes this to:

asmlinkage unsigned long sys_sigreturn(struct pt_regs regs)

... although doing it as a pointer is better, but requires a change to
the asmlinkage thing.

Getting rid of __attribute__((regparm(0))) for asmlinkage on i386 would
definitely be good, but is a bigger thing than only these couple of
functions.

-hpa

2009-01-05 01:33:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [1/5] Only scan the root bus in early PCI quirks.

> two cases:
> 1. how about system with two or more HT chains, and second will be
> 0x40, x0x80, or 0xc0 ?

The early quirks code only cares about the first south bridge,
it's essentially just a chipset detect. So secondaries do not
matter.

> 2. some system with LinuxBIOS, could put first chain on bus 1

Do they or do they not? Please be specific.

Anyways AntiLinuxBIOS doesn't have ACPI and early-quirks.c is 80%+ ACPI
workarounds so they don't apply. The only exceptions are the VIA
detect or the the HT broadcast workaround. I suppose both do not
apply to these LB systems.

-Andi
--
[email protected]

2009-01-05 11:48:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [5/5] Avoid theoretical vmalloc fault loop

This should be a -stable patch too I think.

On Monday 05 January 2009 10:36:42 Andi Kleen wrote:
> Ajith Kumar noticed:
>
> I was going through the vmalloc fault handling for x86_64 and am unclear
> about the following lines in the vmalloc_fault() function.
>
> pgd = pgd_offset(current->mm ?: &init_mm, address);
> pgd_ref = pgd_offset_k(address);
>
> Here the intention is to get the pgd corresponding to the current
> process and sync it up with the pgd in init_mm(obtained from
> pgd_offset_k). However, for kernel threads current->mm is NULL and hence
> pgd = pgd_offset(init_mm, address) = pgd_ref which means the fault
> handler returns without setting the pgd entry in the MM structure in
> the context of which the kernel thread has faulted. This could lead to
> never-ending faults and busy looping of kernel threads like pdflush.
> So, shouldn't the pgd = pgd_offset(current->mm ?: &init_mm, address);
> be
> pgd = pgd_offset(current->active_mm ?: &init_mm, address);
>
> AK: We can use active_mm unconditionally because it should
> be always set. Do that.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> arch/x86/mm/fault.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6.28-test/arch/x86/mm/fault.c
> ===================================================================
> --- linux-2.6.28-test.orig/arch/x86/mm/fault.c 2008-10-24
> 13:34:41.000000000 +0200 +++
> linux-2.6.28-test/arch/x86/mm/fault.c 2009-01-02 16:01:11.000000000 +0100
> @@ -533,7 +533,7 @@
> happen within a race in page table update. In the later
> case just flush. */
>
> - pgd = pgd_offset(current->mm ?: &init_mm, address);
> + pgd = pgd_offset(current->active_mm, address);
> pgd_ref = pgd_offset_k(address);
> if (pgd_none(*pgd_ref))
> return -1;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-01-06 00:03:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] [3/5] Mark complex bitops.h inlines as __always_inline

On Mon, 5 Jan 2009 00:36:40 +0100 (CET)
Andi Kleen <[email protected]> wrote:

> Hugh Dickins noticed that older gcc versions when the kernel
> is built for code size didn't inline some of the bitops.
>
> Mark all complex x86 bitops that have more than a single
> asm statement or two as always inline to avoid this problem.
>
> Probably should be done for other architectures too.
>

Well yes. This is why we did

#if !defined(CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING) || \
!defined(CONFIG_OPTIMIZE_INLINING) || (__GNUC__ < 4)
# define inline inline __attribute__((always_inline))
# define __inline__ __inline__ __attribute__((always_inline))
# define __inline __inline __attribute__((always_inline))
#endif

but people had to go and futz with it, make it complicated and break
stuff. Sigh.

2009-01-06 10:17:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] [3/5] Mark complex bitops.h inlines as __always_inline


* Andi Kleen <[email protected]> wrote:

> Hugh Dickins noticed that older gcc versions when the kernel is built
> for code size didn't inline some of the bitops.

Older GCCs have a whole lot more problems than suboptimal inlining
decisions.

Your patch is wrong because it prevents a good compiler from doing the
right inlining decisions.

So why should i apply this?

Ingo

2009-01-06 14:18:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [3/5] Mark complex bitops.h inlines as __always_inline

> Your patch is wrong because it prevents a good compiler from doing the
> right inlining decisions.

One or two instruction bitops should be always inlined, not inlining them
always generates much worse code than inlining them. That's easy
to prove just based on code size: the call overhead is larger
than the inlined code.

This patch just makes sure they get always inlined by marking
those explicitely. There's no reason ever to not inline
those, so giving the compiler a choice doesn't make sense.

Even on a compiler with perfect inlining algorithm (which
no compiler has) that's true and stating that explicitely
is correct.

Also to handle inlines in all cases that have code that collapses at compile
time (e.g. __builtin_constant_p tests and similar) correctly the compiler
needs the new "early inlining + optimization" pass that was
added with gcc 4.4 only. But 4.4 is not even out yet, so obviously
most people don't use it.

That is why *_user() for example always needs to be marked
__always_inline. This patch just extends it a bit more to
more functions with the same problem.

-Andi

--
[email protected]

2009-01-06 19:18:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] [3/5] Mark complex bitops.h inlines as __always_inline

On Mon, 5 Jan 2009, Andi Kleen wrote:
>
> Hugh Dickins noticed that older gcc versions when the kernel
> is built for code size didn't inline some of the bitops.
>
> Mark all complex x86 bitops that have more than a single
> asm statement or two as always inline to avoid this problem.
>
> Probably should be done for other architectures too.
>
> Signed-off-by: Andi Kleen <[email protected]>

Thanks for pursuing this again, Andi. Yes, this patch certainly
helped in my case. But could we please change the first sentence
of that comment? Though I did first notice the problem while
CONFIG_CC_OPTIMIZE_FOR_SIZE=y, it persisted to a greater or
lesser extent without that, so long as CONFIG_OPTIMIZE_INLINING=y:
it's a problem with that option, not with building for code size.

I haven't actually been using your patch since testing it, I just
switched CONFIG_OPTIMIZE_INLINING off: I'm surprised it remained
=y in the x86 defconfigs (though default N in the Kconfig, phew),
and still worry what other silliness that option may be doing, while
it seems that there isn't yet an actual gcc release that behaves well
with it (I see from later mail that you do expect 4.4 to fix it).

So far as I've been able to tell, the right wording for the first
sentence would be:

Hugh Dickins noticed that released gcc versions building the kernel
with CONFIG_OPTIMIZE_INLINING=y don't inline some of the bitops -
sometimes generating very inefficient pageflag tests, and many
instances of constant_test_bit().

Hugh

>
> ---
> arch/x86/include/asm/bitops.h | 15 +++++++++++----
> include/asm-generic/bitops/__ffs.h | 2 +-
> include/asm-generic/bitops/__fls.h | 2 +-
> include/asm-generic/bitops/fls.h | 2 +-
> include/asm-generic/bitops/fls64.h | 4 ++--
> 5 files changed, 16 insertions(+), 9 deletions(-)
>
> Index: linux-2.6.28-test/arch/x86/include/asm/bitops.h
> ===================================================================
> --- linux-2.6.28-test.orig/arch/x86/include/asm/bitops.h 2008-10-24 13:34:40.000000000 +0200
> +++ linux-2.6.28-test/arch/x86/include/asm/bitops.h 2009-01-02 16:01:00.000000000 +0100
> @@ -3,6 +3,9 @@
>
> /*
> * Copyright 1992, Linus Torvalds.
> + *
> + * Note: inlines with more than a single statement should be marked
> + * __always_inline to avoid problems with older gcc's inlining heuristics.
> */
>
> #ifndef _LINUX_BITOPS_H
> @@ -53,7 +56,8 @@
> * Note that @nr may be almost arbitrarily large; this function is not
> * restricted to acting on a single-word quantity.
> */
> -static inline void set_bit(unsigned int nr, volatile unsigned long *addr)
> +static __always_inline void
> +set_bit(unsigned int nr, volatile unsigned long *addr)
> {
> if (IS_IMMEDIATE(nr)) {
> asm volatile(LOCK_PREFIX "orb %1,%0"
> @@ -90,7 +94,8 @@
> * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
> * in order to ensure changes are visible on other processors.
> */
> -static inline void clear_bit(int nr, volatile unsigned long *addr)
> +static __always_inline void
> +clear_bit(int nr, volatile unsigned long *addr)
> {
> if (IS_IMMEDIATE(nr)) {
> asm volatile(LOCK_PREFIX "andb %1,%0"
> @@ -196,7 +201,8 @@
> *
> * This is the same as test_and_set_bit on x86.
> */
> -static inline int test_and_set_bit_lock(int nr, volatile unsigned long *addr)
> +static __always_inline int
> +test_and_set_bit_lock(int nr, volatile unsigned long *addr)
> {
> return test_and_set_bit(nr, addr);
> }
> @@ -292,7 +298,8 @@
> return oldbit;
> }
>
> -static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
> +static __always_inline int
> +constant_test_bit(int nr, const volatile unsigned long *addr)
> {
> return ((1UL << (nr % BITS_PER_LONG)) &
> (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
> Index: linux-2.6.28-test/include/asm-generic/bitops/__ffs.h
> ===================================================================
> --- linux-2.6.28-test.orig/include/asm-generic/bitops/__ffs.h 2006-04-03 16:06:13.000000000 +0200
> +++ linux-2.6.28-test/include/asm-generic/bitops/__ffs.h 2009-01-02 16:01:00.000000000 +0100
> @@ -9,7 +9,7 @@
> *
> * Undefined if no bit exists, so code should check against 0 first.
> */
> -static inline unsigned long __ffs(unsigned long word)
> +static __always_inline unsigned long __ffs(unsigned long word)
> {
> int num = 0;
>
> Index: linux-2.6.28-test/include/asm-generic/bitops/__fls.h
> ===================================================================
> --- linux-2.6.28-test.orig/include/asm-generic/bitops/__fls.h 2008-05-08 12:56:05.000000000 +0200
> +++ linux-2.6.28-test/include/asm-generic/bitops/__fls.h 2009-01-02 16:01:00.000000000 +0100
> @@ -9,7 +9,7 @@
> *
> * Undefined if no set bit exists, so code should check against 0 first.
> */
> -static inline unsigned long __fls(unsigned long word)
> +static __always_inline unsigned long __fls(unsigned long word)
> {
> int num = BITS_PER_LONG - 1;
>
> Index: linux-2.6.28-test/include/asm-generic/bitops/fls.h
> ===================================================================
> --- linux-2.6.28-test.orig/include/asm-generic/bitops/fls.h 2006-04-03 16:06:13.000000000 +0200
> +++ linux-2.6.28-test/include/asm-generic/bitops/fls.h 2009-01-02 16:01:00.000000000 +0100
> @@ -9,7 +9,7 @@
> * Note fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32.
> */
>
> -static inline int fls(int x)
> +static __always_inline int fls(int x)
> {
> int r = 32;
>
> Index: linux-2.6.28-test/include/asm-generic/bitops/fls64.h
> ===================================================================
> --- linux-2.6.28-test.orig/include/asm-generic/bitops/fls64.h 2008-05-08 12:56:05.000000000 +0200
> +++ linux-2.6.28-test/include/asm-generic/bitops/fls64.h 2009-01-02 16:01:00.000000000 +0100
> @@ -15,7 +15,7 @@
> * at position 64.
> */
> #if BITS_PER_LONG == 32
> -static inline int fls64(__u64 x)
> +static __always_inline int fls64(__u64 x)
> {
> __u32 h = x >> 32;
> if (h)
> @@ -23,7 +23,7 @@
> return fls(x);
> }
> #elif BITS_PER_LONG == 64
> -static inline int fls64(__u64 x)
> +static __always_inline int fls64(__u64 x)
> {
> if (x == 0)
> return 0;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-01-07 13:19:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] [3/5] Mark complex bitops.h inlines as __always_inline


* Hugh Dickins <[email protected]> wrote:

> Hugh Dickins noticed that released gcc versions building the kernel with
> CONFIG_OPTIMIZE_INLINING=y don't inline some of the bitops - sometimes
> generating very inefficient pageflag tests, and many instances of
> constant_test_bit().

Could you quantify that please?

We really dont want to reintroduce __always_inline just for performance /
code size reasons. If GCC messes up and makes a larger / more inefficient
kernel, GCC will be fixed. CONFIG_OPTIMIZE_INLINING is default-off, so
enable it only if it improves your kernel.

Ingo

2009-01-07 13:27:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] [3/5] Mark complex bitops.h inlines as __always_inline


* Andi Kleen <[email protected]> wrote:

> > Your patch is wrong because it prevents a good compiler from doing the
> > right inlining decisions.
>
> One or two instruction bitops should be always inlined, not inlining
> them always generates much worse code than inlining them. That's easy to
> prove just based on code size: the call overhead is larger than the
> inlined code.

You are arguing this backwards. Firstly, CONFIG_OPTIMIZE_INLINING is
disabled by default. If you enable it, and if CONFIG_OPTIMIZE_INLINING=y
creates a larger kernel for you, then either do not enable it - or send
numbers so that we can see the true impact of this change.

Yes, simple functions should always be inlined, nobody will argue about
that. But GCC's inliner will never be fixed (for the kernel domain at
least) if we keep working it around and if we keep micro-managing it.

We work around GCC bugs only if it hurts visibly (and generally if it
affects default-enabled features/optimizations): if it can be quantified
either via stability/robustness problems or there's excesssive size /
efficiency problems.

You simply have not made that argument via any numbers here. So i have no
reason to take your patch just yet, unless you provide some clear,
documented, measurable benefit.

Ingo

2009-01-07 19:32:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [3/5] Mark complex bitops.h inlines as __always_inline

Hi Ingo,

> Yes, simple functions should always be inlined, nobody will argue about
> that. But GCC's inliner will never be fixed (for the kernel domain at
> least) if we keep working it around and if we keep micro-managing it.

To my knowledge the inliner is already fixed. We're just talking about
old gccs.

Also to be honest I never quite believed in the basic idea of
"hurt users to get someone else to do something" strategies like
you're advocating here.

>
> We work around GCC bugs only if it hurts visibly (and generally if it
> affects default-enabled features/optimizations): if it can be quantified
> either via stability/robustness problems or there's excesssive size /
> efficiency problems.

Not inlining bitops qualifies as the later I believe.

I agree with you if we're talking about random static function
in .c files, but widely used functions in .h files that are
not expected to change are a little different. The main
reason to not use __always_inline normally is that when people
change the function to contain more code they typically don't
drop these markers and then it eventually leads to bloat.
But for the bitops style functions I changed here it's not really
expected that they change this way.

Also I don't think adding a few __always_inline for functions which
really need it will hurt maintainability or anything else. In fact
we already have them in a couple of places (e.g. uaccess.h)

>
> You simply have not made that argument via any numbers here. So i have no

That's true, I just reacted to Hugh's report. I don't use old
compilers on my own.

-Andi

--
[email protected]

2009-01-08 02:17:00

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] [3/5] Mark complex bitops.h inlines as __always_inline

On Wed, 7 Jan 2009, Ingo Molnar wrote:
>
> > Hugh Dickins noticed that released gcc versions building the kernel with
> > CONFIG_OPTIMIZE_INLINING=y don't inline some of the bitops - sometimes
> > generating very inefficient pageflag tests, and many instances of
> > constant_test_bit().
>
> Could you quantify that please?
>
> We really dont want to reintroduce __always_inline just for performance /
> code size reasons. If GCC messes up and makes a larger / more inefficient
> kernel, GCC will be fixed. CONFIG_OPTIMIZE_INLINING is default-off, so
> enable it only if it improves your kernel.

I'm happy with the config option defaulting off, but the Kconfig help
text seems too optimistic, and I doubt defconfigs should say Y yet.

http://lkml.org/lkml/2008/11/14/203 CONFIG_OPTIMIZE_INLINING fun
was my original mail to you on this: somewhat rambling, but worth
looking at just for the laughable 4.2.1 Page disassemblies.

Quantification: you may have a better idea of what numbers to look
at, but here are some.

I've tried "make defconfig" on 2.6.28 x86_32 kernel trees, building
first with the openSUSE 10.3 version of gcc 4.2.1, then the openSUSE
11.1 version of gcc 4.3.2 (Fedora 10 and Ubuntu 8.10 versions gave
similar results, though I didn't go very far with those), then the
latest gcc 4.4 snapshot. I was building x86_64 a few weeks ago,
and that gave similar results.

I've built five configs: vmlinux.nn is the defconfig, except for
CONFIG_CC_OPTIMIZE_FOR_SIZE and CONFIG_OPTIMIZE_INLINING both off;
vmlinux.yn has SIZE=y, vmlinux.ny has INLINING=y, vmlinux.yy is the
actual defconfig with both y, vmlinux.yya is that with Andi's bitops
patch applied.

For background I'm showing "size" output, and the number of lines in
System.map (supposing that to give some idea of how much inlining).

Then how many "Page" lines appear in System.map: ideally none, a
few are okay, but lots mean it's being stupid about PageLocked etc -
4.2.1 SIZE=y INLINING=y is afflicted by that, the rest are okay.

Then how many "constant_test_bit"s (each sized 27 bytes): clearly
Andi's patch eliminates those. Contrary to expectations, 4.4 does
little better than 4.3.2 with them; and though 4.2.1's yy case is
worst, its ny case is actually much better than with 4.3 and 4.4
(or maybe there's a different interpretation than numbers suggest).

I think these numbers show that that even gcc 4.4 is not yet ready
for CONFIG_OPTIMIZE_INLINING=y, but that Andi's patch helps in one
important area. grep for something else and maybe a different story.

gcc version 4.2.1 (SUSE Linux):

text data bss dec hex filename
6847905 1151180 622592 8621677 838e6d vmlinux.nn
45069 wc -l <System.map.nn
0 grep -c Page System.map.nn
0 grep -c constant_test_bit System.map.nn

text data bss dec hex filename
6848053 1151180 622592 8621825 838f01 vmlinux.ny
45130 wc -l <System.map.ny
0 grep -c Page System.map.ny
5 grep -c constant_test_bit System.map.ny

text data bss dec hex filename
5921243 1124336 618496 7664075 74f1cb vmlinux.yn
44949 wc -l <System.map.yn
0 grep -c Page System.map.yn
0 grep -c constant_test_bit System.map.yn

text data bss dec hex filename
5906470 1124336 618496 7649302 74b816 vmlinux.yy
49223 wc -l <System.map.yy
98 grep -c Page System.map.yy
228 grep -c constant_test_bit System.map.yy

text data bss dec hex filename
5877907 1124336 618496 7620739 744883 vmlinux.yya
48786 wc -l <System.map.yya
98 grep -c Page System.map.yya
0 grep -c constant_test_bit System.map.yya

gcc version 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux):

text data bss dec hex filename
7017366 1151180 622592 8791138 862462 vmlinux.nn
43755 wc -l <System.map.nn
0 grep -c Page System.map.nn
0 grep -c constant_test_bit System.map.nn

text data bss dec hex filename
7014542 1151180 622592 8788314 86195a vmlinux.ny
44165 wc -l <System.map.ny
2 grep -c Page System.map.ny
79 grep -c constant_test_bit System.map.ny

text data bss dec hex filename
5965795 1124336 618496 7708627 759fd3 vmlinux.yn
43790 wc -l <System.map.yn
0 grep -c Page System.map.yn
0 grep -c constant_test_bit System.map.yn

text data bss dec hex filename
5917632 1124336 618496 7660464 74e3b0 vmlinux.yy
44816 wc -l <System.map.yy
4 grep -c Page System.map.yy
151 grep -c constant_test_bit System.map.yy

text data bss dec hex filename
5898313 1124336 618496 7641145 749839 vmlinux.yya
44602 wc -l <System.map.yya
0 grep -c Page System.map.yya
0 grep -c constant_test_bit System.map.yya

gcc version 4.4.0 20090102 (experimental) (GCC)

text data bss dec hex filename
6960329 1151180 622592 8734101 854595 vmlinux.nn
44121 wc -l <System.map.nn
0 grep -c Page System.map.nn
0 grep -c constant_test_bit System.map.nn

text data bss dec hex filename
6952421 1151180 622592 8726193 8526b1 vmlinux.ny
44564 wc -l <System.map.ny
1 grep -c Page System.map.ny
60 grep -c constant_test_bit System.map.ny

text data bss dec hex filename
5897459 1124336 618496 7640291 7494e3 vmlinux.yn
44149 wc -l <System.map.yn
0 grep -c Page System.map.yn
0 grep -c constant_test_bit System.map.yn

text data bss dec hex filename
5843498 1124336 618496 7586330 73c21a vmlinux.yy
45655 wc -l <System.map.yy
4 grep -c Page System.map.yy
148 grep -c constant_test_bit System.map.yy

text data bss dec hex filename
5816020 1124336 618496 7558852 7356c4 vmlinux.yya
45452 wc -l <System.map.yya
0 grep -c Page System.map.yya
0 grep -c constant_test_bit System.map.yya

Hugh

2009-01-08 05:07:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [3/5] Mark complex bitops.h inlines as __always_inline

Hugh Dickins <[email protected]> writes:
>
> I think these numbers show that that even gcc 4.4 is not yet ready
> for CONFIG_OPTIMIZE_INLINING=y, but that Andi's patch helps in one
> important area. grep for something else and maybe a different story.

Thanks Hugh for the excellent numbers. So it looks like the patch is a good
idea even on newer gccs.

Should probably file test cases into the gcc bugzilla if 4.4 still
doesn't get it right. I'll put that on my todo list if noone beats
me.

-Andi

--
[email protected]

2009-01-08 08:04:58

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] [3/5] Mark complex bitops.h inlines as __always_inline

Ingo Molnar wrote:
> * Hugh Dickins <[email protected]> wrote:
>
>> Hugh Dickins noticed that released gcc versions building the kernel with
>> CONFIG_OPTIMIZE_INLINING=y don't inline some of the bitops - sometimes
>> generating very inefficient pageflag tests, and many instances of
>> constant_test_bit().
>
> Could you quantify that please?
>
> We really dont want to reintroduce __always_inline just for performance /
> code size reasons. If GCC messes up and makes a larger / more inefficient
> kernel, GCC will be fixed. CONFIG_OPTIMIZE_INLINING is default-off, so
> enable it only if it improves your kernel.
>

There is one condition under which gcc simply won't know, and that is
when an inline is composed primarily of asm code. gcc, I believe,
creates a worst-case estimate based on the number of semicolons or
newlines (something that works semi-okayish on RISC), and thus tend to
vastly overestimate the size of an asm() on x86, where statements are
highly variable length. Hence it is probably always going to need
hints, unless the whole handling of inline assembly is revamped (which
would be good for scheduling, but I doubt it will happen.)

-hpa