Received: by 2002:ab2:1149:0:b0:1f3:1f8c:d0c6 with SMTP id z9csp3074434lqz; Wed, 3 Apr 2024 18:35:37 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCUjl8IPyz3HDxZZYFNHepW3XQgqaDeE97ClqBjaqzDK8JoM/gHBJo7v8d+BU23zRgb29Ul2DCJ7C8GprBSbbswEQKIK5PSOO+KHhQEB4Q== X-Google-Smtp-Source: AGHT+IH1hvrXkyfQv6XKMG2iiBTxvQIH6PCaRhjUxuQS8H+f1j93jRZ1rwIk0PEn2/Wy5tREsw0B X-Received: by 2002:a05:622a:314:b0:432:f4f5:cf5d with SMTP id q20-20020a05622a031400b00432f4f5cf5dmr1474783qtw.20.1712194537378; Wed, 03 Apr 2024 18:35:37 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1712194537; cv=pass; d=google.com; s=arc-20160816; b=Q9t4LjhMwB+XSBnarene8OHmmMfub7nqcrDixfy0ew9EJ6FkKH+7CnMT+FcwXywft0 L/PzJstF0LEtXEIc0YcVSTkU64HN8DB3v0wmvPil168Eo/CbWRfgcE+V4ksneJd4qu2+ lyHLTabnBThMacKFbBK2HOCQYtx295eNRisE9fse5gd+LA0lx7FA4kPcKHr5TOjK07r6 d1WAT7Gcl4Tgec6qXZ0i2wQmOtdLIazgg8S89V8OGLq8ZB9/vDqAG3Ai6T+DjgPVLiC+ tiQc2oO2tL/AzFU0Mo+rhCQ9GXnnaJtApbPd3TU8IBaxL6YAZBO9kp0N40ROYbjjnbRc 90sQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:date:dkim-signature; bh=mu97ZxLx5EvC0aL/ndKdjjtmLoGi/wG5n1Wm093Lxu4=; fh=Fb8BePtQZor3QfK+tpEHG4jr88BqdgrV2OHVqlomeLE=; b=o8sktCbaL+0PE8FZHLBgSD70D3vp47Bhd5vqArnbI/wkYa+MU4fa/Bpoey86vacUjK B+uwlEbulyiH/nhpbC0r/iCQGeKn2iUSsTjD/8yV05Ubf5oIf8tAenyXFHMVbnbHvafe Kypbe06Sot+snXvTKWZTxHVYzIHBIN8XtWNe+YxKKqv1yq3PLaGlC0j604ofFtSzRvZc r5enipi2hxspw3J0EOFM4Uh7Bc47/0Pe9129USk4HY+ddvZvaRLbt/xM/CVhdnZUdEOB XHYCXR6mWO/8FegkAfWwdFjyuUHC50VUD0KM/kzwGDO5O6JbinvchnlslMU4fIWKxXby UbIg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=AgioLU7P; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-crypto+bounces-3310-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-crypto+bounces-3310-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id bs20-20020ac86f14000000b00432b4583a15si16954780qtb.428.2024.04.03.18.35.37 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 03 Apr 2024 18:35:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-crypto+bounces-3310-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=AgioLU7P; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-crypto+bounces-3310-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-crypto+bounces-3310-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 118501C21D5B for ; Thu, 4 Apr 2024 01:35:37 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 2377BCA40; Thu, 4 Apr 2024 01:35:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="AgioLU7P" X-Original-To: linux-crypto@vger.kernel.org Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B1C17BE5E; Thu, 4 Apr 2024 01:35:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712194532; cv=none; b=FB8aty/t/mT2pUXTdF52V0oOodCU+nX9IFt162fuq3rdJTjQjZkpz/Nt6hW+7dk9bhjQ/grf72Lj0exMf32+GjQBBrFlxVCb242Z2l9p+LJwfcSZ9/6nDF2kt7lgZCcQMnXBgv9hgbRfbGYTry/REHU9iK2KmdOvn1uAqKKbGOs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712194532; c=relaxed/simple; bh=lwOCV3FC0LRPXqiYNtF6AMYcYrdZogCovKCtx16u8aI=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=h61BB6ppGMJO0zrTrVh4CEP4d6vl21U/0YaA7vIZdSU3Ao2kyRdNpxYL9nZDa7vZSRE/8AdP6/PHAnMr8bJOzxQMMaZpOUxnQGGnTzxhNEhd/G2tPZdTVU7Oz+hXzBYDliowNjx3Dm71fk3ivmVYUphPhevEPPCeRkFdSgsDGxk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=AgioLU7P; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2200AC433C7; Thu, 4 Apr 2024 01:35:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1712194531; bh=lwOCV3FC0LRPXqiYNtF6AMYcYrdZogCovKCtx16u8aI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=AgioLU7PAkx3fTxnVtKopwt1d6oLqoQPKf/oVGEXLFhqo8Ag/6ww5vgRz3xqGbb2k z2FWkfrZe8EvWVfXwZz97PzTXWIMSlSWiCpfFcZ9q1Op/iiObwRoLFVRAoKB3xeOW7 StEY3yVHTbJmE2onqWtV6OtmQOnKZO58RqW2EGGRVndbeAoGMVu2yFSAHnGH/QScsi GZpBSg3HqsdjwIhwZZPsthmLaMB4c67yeb6pDl4GxD1oRHxWGJokxGQn1YPqPK5d/h ZjUgqNodVGk888FiKj/E7F1vIgaxYCdGTqXEhvJ/0bjLZYfBmVtJCqRPzx5D+6Gyke BB4QCbLAd2DbA== Date: Wed, 3 Apr 2024 20:35:29 -0500 From: Eric Biggers To: David Laight Cc: Ard Biesheuvel , "linux-crypto@vger.kernel.org" , "x86@kernel.org" , "linux-kernel@vger.kernel.org" , Andy Lutomirski , "Chang S . Bae" Subject: Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Message-ID: <20240404013529.GB24248@quark.localdomain> References: <20240326080305.402382-1-ebiggers@kernel.org> <20240326164755.GB1524@sol.localdomain> <6629b8120807458ab76e1968056f5e10@AcuMS.aculab.com> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6629b8120807458ab76e1968056f5e10@AcuMS.aculab.com> Hi David, On Wed, Apr 03, 2024 at 08:12:09AM +0000, David Laight wrote: > From: Eric Biggers > > Sent: 26 March 2024 16:48 > .... > > Consider Intel Ice Lake for example, these are the AES-256-XTS encryption speeds > > on 4096-byte messages in MB/s I'm seeing: > > > > xts-aes-aesni 5136 > > xts-aes-aesni-avx 5366 > > xts-aes-vaes-avx2 9337 > > xts-aes-vaes-avx10_256 9876 > > xts-aes-vaes-avx10_512 10215 > > > > So yes, on that CPU the biggest boost comes just from VAES, staying on AVX2. > > But taking advantage of AVX512 does help a bit more, first from the parts other > > than 512-bit registers, then a bit more from 512-bit registers. > > How much does the kernel_fpu_begin() cost on real workloads? > (ie when the registers are live and it forces an extra save/restore) x86 Linux does lazy restore of the FPU state. The first kernel_fpu_begin() can have a significant cost, as it issues an XSAVE (or equivalent) instruction and causes an XRSTOR (or equivalent) instruction to be issued when returning to userspace when it otherwise might not be needed. Additional kernel_fpu_begin() / kernel_fpu_end() pairs without returning to userspace have only a small cost, as they don't cause any more saves or restores of the FPU state to be done. My new xts(aes) implementations have one kernel_fpu_begin() / kernel_fpu_end() pair per message (if the message doesn't span any page boundaries, which is almost always the case). That's exactly the same as the current xts-aes-aesni. I think what you may really be asking is how much the overhead of the XSAVE / XRSTOR pair associated with kernel-mode use of the FPU *increases* if the kernel clobbers AVX or AVX512 state, instead of just SSE state as xts-aes-aesni does. That's much more relevant to this patchset. I think the answer is that there is no additional overhead. This is because the XSAVE / XRSTOR pair happens regardless of the type of state the kernel clobbers, and it operates on the userspace state, not the kernel's. Some of the newer variants of XSAVE (XSAVEOPT and XSAVES) do have a "modified" optimization where they don't save parts of the state that are unmodified since the last XRSTOR; however, that is unimportant here because the kernel's FPU state is never saved. (This would change if x86 Linux were to support preemption of kernel-mode FPU code. In that case, we may need to take more care to minimize use of AVX and AVX512 state. That being said, AES-XTS tends to be used for bulk data anyway.) This is based on theory, though. I'll do a test to confirm that there's indeed no additional overhead. And also, even if there's no additional overhead, what the existing overhead actually is. > I've not looked at the code but I often see what looks like > excessive inlining in crypto code. > This will speed up benchmarks but can have a negative effect > on real code both because of the time taken to load the > code and the effect of displacing other code. > > It might be that this code is a simple loop.... This is a different topic. By "inlining" I assume that you also mean things like loop unrolling. I totally agree that some of the crypto assembly code goes way overboard on this, resulting in an unreasonably large machine code size. The AVX implementation of AES-GCM (aesni-intel_avx-x86_64.S), which was written by Intel, is the worst offender by far, generating 256011 bytes of machine code. In OpenSSL, Intel has even taken that to the next level with their VAES optimized implementation of AES-GCM generating 696040 bytes of machine code. For my AES-XTS code I've limited the code size to a much more reasonable level by focusing on the things that make the most difference. My assembly file compiles to 14386 bytes of machine code (less than 6% of AES-GCM). It consists of encryption and decryption functions for each of the four included implementations, and also the short function aes_xts_encrypt_iv(). On a particular CPU model, only one implementation is actually used, resulting in at most 3500-4000 bytes being actually used at runtime. However, roughly half of that is code to handle messages that aren't a multiple of 256 bytes, which aren't really encountered in practice. I've placed that code out-of-line to try to prevent it from polluting the CPU's instruction cache. On the C side in aesni-intel-glue.c, I have roughly ~600 bytes of code per implementation for the inlined fast path: half for encryption, half for decryption. There arewith ~600 additional bytes for the rarely-executed slow path of page-spanning messages shared by all implementations. So in practice, at runtime just over 2 KB of AES-XTS code will get executed, half for encryption and half for decryption. That seems reasonable for something as performance-critical as disk and file encryption. There are changes that could be made to make the code smaller, for example rolling up the AES rounds, making encryption and decryption share more code, doing 1x-wide instead of 4x-wide, etc. We could also skip the AVX512 implementations and top out at VAES + AVX2. There are issues with these changes though -- either they straight up hurt performance on CPUs that I tested, or they demand a lot more out of the CPU (e.g. relying much more heavily on the branch predictor) and I was concerned about issues on non-tested or future CPUs. So, I think my current proposal is at a reasonable place regarding compiled code size, especially when it's compared to the monstrosity that is some of the existing crypto assembly code. But let me know if there are any specific choices I've made that you may have a different opinion on. - Eric