Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422640AbbGQHqF (ORCPT ); Fri, 17 Jul 2015 03:46:05 -0400 Received: from mail-wg0-f47.google.com ([74.125.82.47]:34608 "EHLO mail-wg0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752113AbbGQHqD (ORCPT ); Fri, 17 Jul 2015 03:46:03 -0400 Date: Fri, 17 Jul 2015 09:45:55 +0200 From: Ingo Molnar To: Dave Hansen Cc: linux-kernel@vger.kernel.org, Andy Lutomirski , Borislav Petkov , Fenghua Yu , "H. Peter Anvin" , Linus Torvalds , Oleg Nesterov , Thomas Gleixner , Ross Zwisler Subject: Re: [REGRESSION] 4.2-rc2: early boot memory corruption from FPU rework Message-ID: <20150717074555.GA31873@gmail.com> References: <1430848300-27877-1-git-send-email-mingo@kernel.org> <1430848300-27877-19-git-send-email-mingo@kernel.org> <55A56709.6020201@linux.intel.com> <20150715110726.GA26611@gmail.com> <55A6FC31.5010102@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <55A6FC31.5010102@linux.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6769 Lines: 156 * Dave Hansen wrote: > On 07/15/2015 04:07 AM, Ingo Molnar wrote: > > * Dave Hansen wrote: > >>> /* > >>> - * Setup init_xstate_buf to represent the init state of > >>> + * Setup init_xstate_ctx to represent the init state of > >>> * all the features managed by the xsave > >>> */ > >>> - init_xstate_buf = alloc_bootmem_align(xstate_size, > >>> - __alignof__(struct xsave_struct)); > >>> - fx_finit(&init_xstate_buf->i387); > >>> + fx_finit(&init_xstate_ctx.i387); > >> > >> This is causing memory corruption in 4.2-rc2. > ... > >> This patch works around the problem, btw: > >> > >> https://www.sr71.net/~dave/intel/bloat-xsave-gunk-2.patch > > > > Yeah, so I got this prototype hardware boot crash reported in private mail and > > decoded it and after some debugging I suggested the +PAGE_SIZE hack - possibly you > > got that hack from the same person? > > Nope, I came up with that gem of a patch all on my own. :) > I also wouldn't characterize this as prototype hardware. There are obviously > plenty of folks depending on mainline to boot and function on hardware that has > AVX-512 support. That's why two different Intel folks came to you > independently. Yeah, so I treat it as a regression even if it's unreleased hw, what matters to regressions is number of people affected, plus that the kernel should work for a reasonable set of future hardware as well, without much trouble. Just curious: does any released hardware have AVX-512? I went by Wikipedia, which seems to list pre-release hw: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 Intel Xeon Phi Knights Landing: AVX-512 F, CDI, PFI and ERI[1] in 2015[6] Xeon Skylake: AVX-512 F, CDI, VL, BW, and DQ[7] in 2015[8] Cannonlake (speculation) > > My suggestion was to solve this properly: if we list xstate features as > > supported then we should size their max size correctly. The AVX bits are > > currently not properly enumerated and sized - and I refuse to add feature > > support to the kernel where per task CPU state fields that the kernel > > saves/restores are opaque... > > We might know the size and composition of the individual components, but we do > not know the size of the buffer. Different implementations of a given feature > are quite free to have different data stored in the buffer, or even to rearrange > or pad it. That's why the sizes are not explicitly called out by the > architecture and why we enumerated them before your patch that caused this > regression. But we _have_ to know their structure and layout of the XSAVE context for any reasonable ptrace and signal frame support. Can you set/get AVX-512 registers via ptrace? MPX state? That's one of the reasons why I absolutely hate how this 'opaque per task CPU context blob' concept snuck into the x86 code via the XSAVE patches without proper enumeration of the data structures, sorry... It makes it way too easy to 'support' CPU features without actually doing a good job of it - and in fact it makes certain reasonable things impossible or very, very hard, which makes me nervous. But we'll fix the boot regression, no argument about that! > The component itself may not be opaque, but the size of the *buffer* is not a > simple sum of the component sizes. Here's a real-world example: > > [ 0.000000] x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100 > [ 0.000000] x86/fpu: xstate_offset[3]: 03c0, xstate_sizes[3]: 0040 > > Notice that component 3 is not at 0x240+0x100. This means our existing > init_xstate_size(), and why any attempt to staticlly-size the buffer is broken. > > I understand why you were misled by it, but the old "xsave_hdr_struct" was > wrong. Fenghua even posted patches to remove it before the FPU rework (you were > cc'd): > > https://lkml.org/lkml/2015/4/18/164 Yeah, so I thought the worst bugs were fixed and that these would re-emerge on top of the new code. Whether we have a static limit or not is orthogonal to the issue of sizing it properly - and the plan was to have a dynamic context area in any case. > > So please add proper AVX512 support structures to fpu/types.h and size > > XSTATE_RESERVE correctly - or alternatively we can remove the current > > incomplete AVX512 bits. > > The old code sized the buffer in a fully architectural way and it worked. The > CPU *tells* you how much memory the 'xsave' instruction is going to scribble on. > The new code just merrily calls it and let it scribble away. This is as > clear-cut a regression as I've ever seen. This is a regression which we'll fix, but the 'old' dynamic code clearly did not work for a long time, I'm sure you still remember my attempt at addressing the worst fallout in: e88221c50cad ("x86/fpu: Disable XSAVES* support for now") Those kinds of totally non-working aspects were what made me nervous about the opaque data structure aspect. Because we can have dynamic sizing of the context area and non-opaque data structures. > The least we can do is detect that the kernel undersized the buffer and disable > support for the features that do not fit. A very lightly tested patch to do > that is attached. I'm not super eager to put that in to an -rc2 kernel though. Ok, this approach looks good to me as an interim fix. I'll give it a whirl on older hardware. I agree with you that it needs to be sized dynamically. > This came out a lot more complicated than I would have liked. > > Instead of simply enabling all of the XSAVE features that we both know about and > the CPU supports, we have to be careful to not overflow our buffer in > 'init_fpstate.xsave'. Yeah, and this can be fixed separately and on top of your fix: my plan during the FPU rework was to move the context area to the end of task_struct and size it dynamically. This needs some (very minor) changes to kernel/fork.c to allow an architecture to determine the full task_struct size dynamically - but looks very doable and clean. Wanna try this, or should I? > To do this, we enable each XSAVE feature and then ask the CPU how large that > makes the buffer. If we would overflow the buffer we allocated, we turn off the > feature. > > This means that no matter what the CPU does, we will not corrupt random memory > like we do before this patch. It also means that we can fall back in a way > which cripples the system the least. Yes, agreed. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/