Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757320AbaDVQET (ORCPT ); Tue, 22 Apr 2014 12:04:19 -0400 Received: from mail-ve0-f182.google.com ([209.85.128.182]:59823 "EHLO mail-ve0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757038AbaDVQEQ (ORCPT ); Tue, 22 Apr 2014 12:04:16 -0400 MIME-Version: 1.0 In-Reply-To: <20140422144659.GF15882@pd.tnic> References: <5355A9E9.9070102@zytor.com> <1dbe8155-58da-45c2-9dc0-d9f4b5a6e643@email.android.com> <20140422112312.GB15882@pd.tnic> <20140422144659.GF15882@pd.tnic> From: Andrew Lutomirski Date: Tue, 22 Apr 2014 09:03:55 -0700 Message-ID: Subject: Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE* To: Borislav Petkov Cc: "H. Peter Anvin" , "H. Peter Anvin" , Linux Kernel Mailing List , Linus Torvalds , Ingo Molnar , Alexander van Heukelum , Konrad Rzeszutek Wilk , Boris Ostrovsky , Arjan van de Ven , Brian Gerst , Alexandre Julliard , Andi Kleen , Thomas Gleixner Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 22, 2014 at 7:46 AM, Borislav Petkov wrote: > On Tue, Apr 22, 2014 at 01:23:12PM +0200, Borislav Petkov wrote: >> I wonder if it would be workable to use a bit in the espfix PGD to >> denote that it has been initialized already... I hear, near NX there's >> some room :-) > > Ok, I realized this won't work when I hit send... Oh well. > > Anyway, another dumb idea: have we considered making this lazy? I.e., > preallocate pages to fit the stack of NR_CPUS after smp init is done but > not setup the percpu espfix stack. Only do that in espfix_fix_stack the > first time we land there and haven't been setup yet on this cpu. > > This should cover the 1% out there who still use 16-bit segments and the > rest simply doesn't use it and get to save themselves the PT-walk in > start_secondary(). > > Hmmm... I'm going to try to do the math to see what's actually going on. Each 4G slice contains 64kB of ministacks, which corresponds to 1024 ministacks. Virtual addresses are divided up as: 12 bits (0..11): address within page. 9 bits (12..20): identifies the PTE within the level 1 directory 9 bits (21..29): identifies the level 1 directory (pmd?) within the level 2 directory 9 bits (30..38): identifies the level 2 directory (pud) within the level 3 directory Critically, each 1024 CPUs can share the same level 1 directory -- there are just a bunch of copies of the same thing in there. Similarly, they can share the same level 2 directory, and each slot in that directory will point to the same level 1 directory. For the level 3 directory, there is only one globally. It needs 8 entries per 1024 CPUs. I imagine there's a scalability problem here, too: it's okay if each of a very large number of CPUs waits while shared structures are allocated, but owners of big systems won't like it if they all serialize on the way out. So maybe it would make sense to refactor this into two separate functions. First, before we start the first non-boot CPU: static pte_t *slice_pte_tables[NR_CPUS / 1024]; Allocate and initialize them all; It might even make sense to do this at build time instead of run time. I can't imagine that parallelizing this would provide any benefit unless it were done *very* carefully and there were hundreds of thousands of CPUs. At worst, we're wasting 4 bytes per CPU not present. Then, for the per-CPU part, have one init-once structure (please tell me the kernel has one of these) per 64 possible CPUs. Each CPU will make sure that its group of 64 cpus is initialized, using the init once mechanism, and then it will set its percpu variable accordingly. There are only 64 CPUs per slice, so mutexes may no be so bad here. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/