2015-05-06 17:07:45

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH] x86: Deinline cpuid_eax and friends

cpuid_e{a,b,c,d}x() functions compile to 44 bytes of machine code each.
On x86 allyesconfig build they have 48 callsites.
Deinlining all four of them shrinks kernel by about 1k:

text data bss dec hex filename
82434909 22255384 20627456 125317749 7783275 vmlinux.before
82433898 22255384 20627456 125316738 7782e82 vmlinux

Speed impact: CPUID instruction takes from 50 to 350+ cycles,
call overhead is negligible in comparison.

Signed-off-by: Denys Vlasenko <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: Alexei Starovoitov <[email protected]>
CC: Will Drewry <[email protected]>
CC: Kees Cook <[email protected]>
CC: [email protected]
CC: [email protected]
---
arch/x86/include/asm/processor.h | 39 ++++--------------------------------
arch/x86/kernel/cpu/common.c | 43 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index ec1c935..67e1974 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -616,41 +616,10 @@ static inline void cpuid_count(unsigned int op, int count,
/*
* CPUID functions returning a single datum
*/
-static inline unsigned int cpuid_eax(unsigned int op)
-{
- unsigned int eax, ebx, ecx, edx;
-
- cpuid(op, &eax, &ebx, &ecx, &edx);
-
- return eax;
-}
-
-static inline unsigned int cpuid_ebx(unsigned int op)
-{
- unsigned int eax, ebx, ecx, edx;
-
- cpuid(op, &eax, &ebx, &ecx, &edx);
-
- return ebx;
-}
-
-static inline unsigned int cpuid_ecx(unsigned int op)
-{
- unsigned int eax, ebx, ecx, edx;
-
- cpuid(op, &eax, &ebx, &ecx, &edx);
-
- return ecx;
-}
-
-static inline unsigned int cpuid_edx(unsigned int op)
-{
- unsigned int eax, ebx, ecx, edx;
-
- cpuid(op, &eax, &ebx, &ecx, &edx);
-
- return edx;
-}
+unsigned int cpuid_eax(unsigned int op);
+unsigned int cpuid_ebx(unsigned int op);
+unsigned int cpuid_ecx(unsigned int op);
+unsigned int cpuid_edx(unsigned int op);

/* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
static inline void rep_nop(void)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 2346c95..1d2e270 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -307,6 +307,49 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c)
}

/*
+ * CPUID functions returning a single datum
+ */
+unsigned int cpuid_eax(unsigned int op)
+{
+ unsigned int eax, ebx, ecx, edx;
+
+ cpuid(op, &eax, &ebx, &ecx, &edx);
+
+ return eax;
+}
+EXPORT_SYMBOL(cpuid_eax);
+
+unsigned int cpuid_ebx(unsigned int op)
+{
+ unsigned int eax, ebx, ecx, edx;
+
+ cpuid(op, &eax, &ebx, &ecx, &edx);
+
+ return ebx;
+}
+EXPORT_SYMBOL(cpuid_ebx);
+
+unsigned int cpuid_ecx(unsigned int op)
+{
+ unsigned int eax, ebx, ecx, edx;
+
+ cpuid(op, &eax, &ebx, &ecx, &edx);
+
+ return ecx;
+}
+EXPORT_SYMBOL(cpuid_ecx);
+
+unsigned int cpuid_edx(unsigned int op)
+{
+ unsigned int eax, ebx, ecx, edx;
+
+ cpuid(op, &eax, &ebx, &ecx, &edx);
+
+ return edx;
+}
+EXPORT_SYMBOL(cpuid_edx);
+
+/*
* Some CPU features depend on higher CPUID levels, which may not always
* be available due to CPUID level capping or broken virtualization
* software. Add those features to this table to auto-disable them.
--
1.8.1.4


2015-05-06 19:00:30

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Deinline cpuid_eax and friends

On 05/06/2015 10:07 AM, Denys Vlasenko wrote:
> cpuid_e{a,b,c,d}x() functions compile to 44 bytes of machine code each.
> On x86 allyesconfig build they have 48 callsites.
> Deinlining all four of them shrinks kernel by about 1k:
>
> text data bss dec hex filename
> 82434909 22255384 20627456 125317749 7783275 vmlinux.before
> 82433898 22255384 20627456 125316738 7782e82 vmlinux
>
> Speed impact: CPUID instruction takes from 50 to 350+ cycles,
> call overhead is negligible in comparison.

How on Earth does it make 44 bytes? Is this due to paravirt_fail?

-hpa

2015-05-06 19:10:33

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] x86: Deinline cpuid_eax and friends

On 05/06/2015 08:59 PM, H. Peter Anvin wrote:
> On 05/06/2015 10:07 AM, Denys Vlasenko wrote:
>> cpuid_e{a,b,c,d}x() functions compile to 44 bytes of machine code each.
>> On x86 allyesconfig build they have 48 callsites.
>> Deinlining all four of them shrinks kernel by about 1k:
>>
>> text data bss dec hex filename
>> 82434909 22255384 20627456 125317749 7783275 vmlinux.before
>> 82433898 22255384 20627456 125316738 7782e82 vmlinux
>>
>> Speed impact: CPUID instruction takes from 50 to 350+ cycles,
>> call overhead is negligible in comparison.
>
> How on Earth does it make 44 bytes? Is this due to paravirt_fail?

No, just this construct

unsigned int eax, ebx, ecx, edx;
cpuid(op, &eax, &ebx, &ecx, &edx);

is not really that cheap to set up. You need to allocate
variables on stack and take address of each:

ffffffff81063668 <cpuid_eax>:
ffffffff81063668: 55 push %rbp
ffffffff81063669: 48 89 e5 mov %rsp,%rbp
ffffffff8106366c: 48 83 ec 10 sub $0x10,%rsp
ffffffff81063670: 48 8d 4d fc lea -0x4(%rbp),%rcx
ffffffff81063674: 89 7d f0 mov %edi,-0x10(%rbp)
ffffffff81063677: 48 8d 55 f8 lea -0x8(%rbp),%rdx
ffffffff8106367b: 48 8d 75 f4 lea -0xc(%rbp),%rsi
ffffffff8106367f: 48 8d 7d f0 lea -0x10(%rbp),%rdi
ffffffff81063683: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%rbp)
ffffffff8106368a: e8 3c ff ff ff callq ffffffff810635cb <__cpuid>
ffffffff8106368f: 8b 45 f0 mov -0x10(%rbp),%eax
ffffffff81063692: c9 leaveq
ffffffff81063693: c3 retq

--
vda

2015-05-06 20:42:23

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Deinline cpuid_eax and friends

On 05/06/2015 12:09 PM, Denys Vlasenko wrote:
>>
>> How on Earth does it make 44 bytes? Is this due to paravirt_fail?
>
> No, just this construct
>
> unsigned int eax, ebx, ecx, edx;
> cpuid(op, &eax, &ebx, &ecx, &edx);
>
> is not really that cheap to set up. You need to allocate
> variables on stack and take address of each:
>
> ffffffff81063668 <cpuid_eax>:
> ffffffff81063668: 55 push %rbp
> ffffffff81063669: 48 89 e5 mov %rsp,%rbp
> ffffffff8106366c: 48 83 ec 10 sub $0x10,%rsp
> ffffffff81063670: 48 8d 4d fc lea -0x4(%rbp),%rcx
> ffffffff81063674: 89 7d f0 mov %edi,-0x10(%rbp)
> ffffffff81063677: 48 8d 55 f8 lea -0x8(%rbp),%rdx
> ffffffff8106367b: 48 8d 75 f4 lea -0xc(%rbp),%rsi
> ffffffff8106367f: 48 8d 7d f0 lea -0x10(%rbp),%rdi
> ffffffff81063683: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%rbp)
> ffffffff8106368a: e8 3c ff ff ff callq ffffffff810635cb <__cpuid>
> ffffffff8106368f: 8b 45 f0 mov -0x10(%rbp),%eax
> ffffffff81063692: c9 leaveq
> ffffffff81063693: c3 retq
>

That almost certainly is due to paravirt_fail, because otherwise cpuid
would be inline, and gcc actually knows how to optimize around the cpuid
instruction to the point of eliminating the temporaries.

That being said, it would have been better to use a structure.

-hpa

2015-05-07 08:57:56

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] x86: Deinline cpuid_eax and friends

On 05/06/2015 10:41 PM, H. Peter Anvin wrote:
> On 05/06/2015 12:09 PM, Denys Vlasenko wrote:
>>>
>>> How on Earth does it make 44 bytes? Is this due to paravirt_fail?
>>
>> No, just this construct
>>
>> unsigned int eax, ebx, ecx, edx;
>> cpuid(op, &eax, &ebx, &ecx, &edx);
>>
>> is not really that cheap to set up. You need to allocate
>> variables on stack and take address of each:
>>
>> ffffffff81063668 <cpuid_eax>:
>> ffffffff81063668: 55 push %rbp
>> ffffffff81063669: 48 89 e5 mov %rsp,%rbp
>> ffffffff8106366c: 48 83 ec 10 sub $0x10,%rsp
>> ffffffff81063670: 48 8d 4d fc lea -0x4(%rbp),%rcx
>> ffffffff81063674: 89 7d f0 mov %edi,-0x10(%rbp)
>> ffffffff81063677: 48 8d 55 f8 lea -0x8(%rbp),%rdx
>> ffffffff8106367b: 48 8d 75 f4 lea -0xc(%rbp),%rsi
>> ffffffff8106367f: 48 8d 7d f0 lea -0x10(%rbp),%rdi
>> ffffffff81063683: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%rbp)
>> ffffffff8106368a: e8 3c ff ff ff callq ffffffff810635cb <__cpuid>
>> ffffffff8106368f: 8b 45 f0 mov -0x10(%rbp),%eax
>> ffffffff81063692: c9 leaveq
>> ffffffff81063693: c3 retq
>>
>
> That almost certainly is due to paravirt_fail, because otherwise cpuid
> would be inline, and gcc actually knows how to optimize around the cpuid
> instruction to the point of eliminating the temporaries.

Yes, with HYPERVISOR_GUEST off cpuid_eax() is smaller:

ffffffff81055a66 <cpuid_eax>:
ffffffff81055a66: 55 push %rbp
ffffffff81055a67: 89 f8 mov %edi,%eax
ffffffff81055a69: 31 c9 xor %ecx,%ecx
ffffffff81055a6b: 48 89 e5 mov %rsp,%rbp
ffffffff81055a6e: 53 push %rbx
ffffffff81055a6f: 0f a2 cpuid
ffffffff81055a71: 5b pop %rbx
ffffffff81055a72: 5d pop %rbp
ffffffff81055a73: c3 retq

However, it is not small enough to make vmlinux grow:

text data bss dec hex filename
81746530 13978160 20066304 115790994 6e6d492 vmlinux.before
81746509 13978160 20066304 115790973 6e6d47d vmlinux

To recap: with this patch
Code is smaller with and without HYPERVISOR_GUEST.
Slowdown per cpuid_REG() call is at worst 4%.