Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756272AbeAHLZ6 (ORCPT + 1 other); Mon, 8 Jan 2018 06:25:58 -0500 Received: from mail-ua0-f194.google.com ([209.85.217.194]:33838 "EHLO mail-ua0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756048AbeAHLZ5 (ORCPT ); Mon, 8 Jan 2018 06:25:57 -0500 X-Google-Smtp-Source: ACJfBosQZEOtNsbJIGTKYmT5Qd+Jdwy5NNbS20qw+fzkIjFRa0jiZlixG3l1ZFJKaurWSmZPltce1KzuAJp3pYm/F9k= MIME-Version: 1.0 In-Reply-To: <37ca6fc5-78d5-5750-051f-a712343d4a8f@citrix.com> References: <1515363085-4219-1-git-send-email-dwmw@amazon.co.uk> <37ca6fc5-78d5-5750-051f-a712343d4a8f@citrix.com> From: Paul Turner Date: Mon, 8 Jan 2018 03:25:25 -0800 Message-ID: Subject: Re: [PATCH v6 00/10] Retpoline: Avoid speculative indirect calls in kernel To: Andrew Cooper Cc: David Woodhouse , Andi Kleen , LKML , Linus Torvalds , Greg Kroah-Hartman , Tim Chen , Dave Hansen , Thomas Gleixner , Kees Cook , Rik van Riel , Peter Zijlstra , Andy Lutomirski , Jiri Kosina , One Thousand Gnomes Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: For Intel the manuals state that it's 16 entries -- 2.5.2.1 Agner also reports 16 (presumably experimentally measured) e.g. http://www.agner.org/optimize/microarchitecture.pdf [3.8] For AMD it can be larger, for example 32 entries on Fam17h (but 16 entries on Fam16h). For future proofing a binary, or a new AMD processor, 32 calls are required. I would suggest tuning this based on the current CPU (which also covers the future case while saving cycles now) to save overhead. On Mon, Jan 8, 2018 at 3:16 AM, Andrew Cooper wrote: > On 08/01/18 10:42, Paul Turner wrote: >> A sequence for efficiently refilling the RSB is: >> mov $8, %rax; >> .align 16; >> 3: call 4f; >> 3p: pause; call 3p; >> .align 16; >> 4: call 5f; >> 4p: pause; call 4p; >> .align 16; >> 5: dec %rax; >> jnz 3b; >> add $(16*8), %rsp; >> This implementation uses 8 loops, with 2 calls per iteration. This is >> marginally faster than a single call per iteration. We did not >> observe useful benefit (particularly relative to text size) from >> further unrolling. This may also be usefully split into smaller (e.g. >> 4 or 8 call) segments where we can usefully pipeline/intermix with >> other operations. It includes retpoline type traps so that if an >> entry is consumed, it cannot lead to controlled speculation. On my >> test system it took ~43 cycles on average. Note that non-zero >> displacement calls should be used as these may be optimized to not >> interact with the RSB due to their use in fetching RIP for 32-bit >> relocations. > > Guidance from both Intel and AMD still states that 32 calls are required > in general. Is your above code optimised for a specific processor which > you know the RSB to be smaller on? > > ~Andrew