Date: Fri, 17 Jul 2015 09:45:55 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-kernel@vger.kernel.org, Andy Lutomirski <luto@amacapital.net>,
        Borislav Petkov <bp@alien8.de>, Fenghua Yu <fenghua.yu@intel.com>,
        "H. Peter Anvin" <hpa@zytor.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Oleg Nesterov <oleg@redhat.com>, Thomas Gleixner <tglx@linutronix.de>,
        Ross Zwisler <ross.zwisler@linux.intel.com>
Subject: Re: [REGRESSION] 4.2-rc2: early boot memory corruption from FPU
 rework
Message-ID: <20150717074555.GA31873@gmail.com>
References: <1430848300-27877-1-git-send-email-mingo@kernel.org>
 <1430848300-27877-19-git-send-email-mingo@kernel.org>
 <55A56709.6020201@linux.intel.com>
 <20150715110726.GA26611@gmail.com>
 <55A6FC31.5010102@linux.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <55A6FC31.5010102@linux.intel.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6769
Lines: 156


* Dave Hansen <dave.hansen@linux.intel.com> wrote:

> On 07/15/2015 04:07 AM, Ingo Molnar wrote:
> > * Dave Hansen <dave.hansen@linux.intel.com> wrote:
> >>>  	/*
> >>> -	 * Setup init_xstate_buf to represent the init state of
> >>> +	 * Setup init_xstate_ctx to represent the init state of
> >>>  	 * all the features managed by the xsave
> >>>  	 */
> >>> -	init_xstate_buf = alloc_bootmem_align(xstate_size,
> >>> -					      __alignof__(struct xsave_struct));
> >>> -	fx_finit(&init_xstate_buf->i387);
> >>> +	fx_finit(&init_xstate_ctx.i387);
> >>
> >> This is causing memory corruption in 4.2-rc2.
> ...
> >> This patch works around the problem, btw:
> >>
> >> 	https://www.sr71.net/~dave/intel/bloat-xsave-gunk-2.patch
> > 
> > Yeah, so I got this prototype hardware boot crash reported in private mail and 
> > decoded it and after some debugging I suggested the +PAGE_SIZE hack - possibly you 
> > got that hack from the same person?
> 
> Nope, I came up with that gem of a patch all on my own.

:)

> I also wouldn't characterize this as prototype hardware.  There are obviously 
> plenty of folks depending on mainline to boot and function on hardware that has 
> AVX-512 support.  That's why two different Intel folks came to you 
> independently.

Yeah, so I treat it as a regression even if it's unreleased hw, what matters to 
regressions is number of people affected, plus that the kernel should work for a 
reasonable set of future hardware as well, without much trouble.

Just curious: does any released hardware have AVX-512? I went by Wikipedia, which 
seems to list pre-release hw:

  https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

    Intel
        Xeon Phi Knights Landing: AVX-512 F, CDI, PFI and ERI[1] in 2015[6]
        Xeon Skylake: AVX-512 F, CDI, VL, BW, and DQ[7] in 2015[8]
        Cannonlake (speculation)

> > My suggestion was to solve this properly: if we list xstate features as 
> > supported then we should size their max size correctly. The AVX bits are 
> > currently not properly enumerated and sized - and I refuse to add feature 
> > support to the kernel where per task CPU state fields that the kernel 
> > saves/restores are opaque...
> 
> We might know the size and composition of the individual components, but we do 
> not know the size of the buffer.  Different implementations of a given feature 
> are quite free to have different data stored in the buffer, or even to rearrange 
> or pad it.  That's why the sizes are not explicitly called out by the 
> architecture and why we enumerated them before your patch that caused this 
> regression.

But we _have_ to know their structure and layout of the XSAVE context for any 
reasonable ptrace and signal frame support. Can you set/get AVX-512 registers via 
ptrace? MPX state?

That's one of the reasons why I absolutely hate how this 'opaque per task CPU 
context blob' concept snuck into the x86 code via the XSAVE patches without proper 
enumeration of the data structures, sorry...

It makes it way too easy to 'support' CPU features without actually doing a good 
job of it - and in fact it makes certain reasonable things impossible or very, 
very hard, which makes me nervous.

But we'll fix the boot regression, no argument about that!

> The component itself may not be opaque, but the size of the *buffer* is not a 
> simple sum of the component sizes.  Here's a real-world example:
> 
> [    0.000000] x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100
> [    0.000000] x86/fpu: xstate_offset[3]: 03c0, xstate_sizes[3]: 0040
> 
> Notice that component 3 is not at 0x240+0x100.  This means our existing 
> init_xstate_size(), and why any attempt to staticlly-size the buffer is broken.
> 
> I understand why you were misled by it, but the old "xsave_hdr_struct" was 
> wrong.  Fenghua even posted patches to remove it before the FPU rework (you were 
> cc'd):
> 
> 	https://lkml.org/lkml/2015/4/18/164

Yeah, so I thought the worst bugs were fixed and that these would re-emerge on top 
of the new code.

Whether we have a static limit or not is orthogonal to the issue of sizing it 
properly - and the plan was to have a dynamic context area in any case.

> > So please add proper AVX512 support structures to fpu/types.h and size 
> > XSTATE_RESERVE correctly - or alternatively we can remove the current 
> > incomplete AVX512 bits.
> 
> The old code sized the buffer in a fully architectural way and it worked.  The 
> CPU *tells* you how much memory the 'xsave' instruction is going to scribble on.  
> The new code just merrily calls it and let it scribble away.  This is as 
> clear-cut a regression as I've ever seen.

This is a regression which we'll fix, but the 'old' dynamic code clearly did not 
work for a long time, I'm sure you still remember my attempt at addressing the 
worst fallout in:

  e88221c50cad ("x86/fpu: Disable XSAVES* support for now")

Those kinds of totally non-working aspects were what made me nervous about the 
opaque data structure aspect.

Because we can have dynamic sizing of the context area and non-opaque data 
structures.

> The least we can do is detect that the kernel undersized the buffer and disable 
> support for the features that do not fit.  A very lightly tested patch to do 
> that is attached.  I'm not super eager to put that in to an -rc2 kernel though.

Ok, this approach looks good to me as an interim fix. I'll give it a whirl on 
older hardware. I agree with you that it needs to be sized dynamically.

> This came out a lot more complicated than I would have liked.
> 
> Instead of simply enabling all of the XSAVE features that we both know about and 
> the CPU supports, we have to be careful to not overflow our buffer in 
> 'init_fpstate.xsave'.

Yeah, and this can be fixed separately and on top of your fix: my plan during the 
FPU rework was to move the context area to the end of task_struct and size it 
dynamically.

This needs some (very minor) changes to kernel/fork.c to allow an architecture to 
determine the full task_struct size dynamically - but looks very doable and clean. 
Wanna try this, or should I?

> To do this, we enable each XSAVE feature and then ask the CPU how large that 
> makes the buffer.  If we would overflow the buffer we allocated, we turn off the 
> feature.
> 
> This means that no matter what the CPU does, we will not corrupt random memory 
> like we do before this patch.  It also means that we can fall back in a way 
> which cripples the system the least.

Yes, agreed.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/