Received: by 2002:a05:7412:98c1:b0:fa:551:50a7 with SMTP id kc1csp937015rdb; Sat, 6 Jan 2024 17:19:21 -0800 (PST) X-Google-Smtp-Source: AGHT+IFPFxRl8qcQoNlza0mAGvI+A1X6OU+BexxFBTAPu0H2qhxI6xdyEVAo/tqSEonn4pw6dvf+ X-Received: by 2002:a05:622a:11cc:b0:429:8bce:3278 with SMTP id n12-20020a05622a11cc00b004298bce3278mr997780qtk.0.1704590361160; Sat, 06 Jan 2024 17:19:21 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704590361; cv=none; d=google.com; s=arc-20160816; b=nEProrCDiJEx/2XJPhVQ6vblYI/e7hu269JHkSnFL1zwPUSACHVAMBuXsiG6/RkU0+ 4SvCmw8vgK56e16oTCnhn9bquUHqpr0CyIav8kFxJ1dueBCVc8UneFznjz86joToJ/wj NHp4MiSf/K5HS+7nszZaNqP+zDa/GitFUZnKe9p7sFlx3emmY48wI0y7fyzL5fmLxyBJ IMLBRHkhpWc7BI4yNaN42rL+hGemvleSONUVpf3U/y8yvYvW+J34T+IsUj1URPavsJ9j XJNn8oZzdey6pnNEkB+Ue+1ViT7yhRD7J0NT8mOwOIhOnAXqyyotLRBmdVa86i0779WR qdCQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:references:in-reply-to :user-agent:subject:cc:to:from:date:dkim-signature:dkim-filter; bh=NC2CrnQRyBY2GhZKHwqRwBQvMsjgYLB7PY/48PwpD80=; fh=oNHCQoSRLUvf1i/r9+v94PeJVw+FeYXGjgJS6eubtjk=; b=otInvWf4v63MuhGjdJsKvofgt1S6pW1utt1LG6thT99HS0hT54S0H582LdBsWNVqJ4 gnsU7Kps1LM344z0Jn+9lG5xAPWqODVtz9+po0hbiWr294S7TnJeGwiDs7QdAqiHNwIU uBtohC5EK3m3wNmO3Ywoujcfz0kxvFxZFGSFAp+yukG1MhApt1LJfxnoUS322ia5Jl27 qtQKO+VhgX5OsehcovIiJhB0GLo7emHq824f71C7/hx1a8ZnMOBfYDrhZ7r1xQnQqD9E mZ7f+CAgQbhP1wjlRMzLq7xYBqJgp+TV2FVZUTYWx6sJJ7idTt04lW7D41vlCL/BISsN TU3A== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@zytor.com header.s=2023121201 header.b=iZSEy6m2; spf=pass (google.com: domain of linux-kernel+bounces-18754-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-18754-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=zytor.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id d12-20020a05622a100c00b004284014ef2dsi4973687qte.13.2024.01.06.17.19.20 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 06 Jan 2024 17:19:21 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-18754-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=fail header.i=@zytor.com header.s=2023121201 header.b=iZSEy6m2; spf=pass (google.com: domain of linux-kernel+bounces-18754-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-18754-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=zytor.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id DA6911C20CDD for ; Sun, 7 Jan 2024 01:19:20 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0E93F15B2; Sun, 7 Jan 2024 01:19:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=zytor.com header.i=@zytor.com header.b="iZSEy6m2" X-Original-To: linux-kernel@vger.kernel.org Received: from mail.zytor.com (terminus.zytor.com [198.137.202.136]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D94B137F for ; Sun, 7 Jan 2024 01:19:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=zytor.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=zytor.com Received: from [IPv6:::1] ([172.56.208.238]) (authenticated bits=0) by mail.zytor.com (8.17.2/8.17.1) with ESMTPSA id 4071HJpr962781 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO); Sat, 6 Jan 2024 17:17:21 -0800 DKIM-Filter: OpenDKIM Filter v2.11.0 mail.zytor.com 4071HJpr962781 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=zytor.com; s=2023121201; t=1704590243; bh=NC2CrnQRyBY2GhZKHwqRwBQvMsjgYLB7PY/48PwpD80=; h=Date:From:To:CC:Subject:In-Reply-To:References:From; b=iZSEy6m20clspqHdRl10tIH4WTqAgVZXCZC4ZB3I2Kj8owfPsB7SlQ/eOE7U35W9Q HFHZPq1eQ8aMP6z7M2oSp+vq6mgw8XSv+Cr3dIhJL//zmB+Z5PH9XOCB/G6D7JKTS6 r+DZJB3DEGduOsd28mPTOYyUsWxaxHNZPQRq4lJ0zvADus1/M72vOuQfXTndf3JfI3 +EVdI4BUeDFFTDce+vkMCrvnI+9CDCfSPhr2VK5DDylmjC0DCyDoLHUYDOYv4y9VDo G3vFPf8oDAw9WIpdq3WbDZlsEFtY576PkK4gPw/Xi9EfyCil5J7bb5KuGONHcZt6X0 jKaYMf+2wnItA== Date: Sat, 06 Jan 2024 17:09:09 -0800 From: "H. Peter Anvin" To: David Laight , "'Linus Torvalds'" CC: Noah Goldstein , "x86@kernel.org" , "oe-kbuild-all@lists.linux.dev" , "linux-kernel@vger.kernel.org" , "edumazet@google.com" , "tglx@linutronix.de" , "mingo@redhat.com" , "bp@alien8.de" , "dave.hansen@linux.intel.com" Subject: RE: x86/csum: Remove unnecessary odd handling User-Agent: K-9 Mail for Android In-Reply-To: <124b21857fe44e499e29800cbf4f63f8@AcuMS.aculab.com> References: <20230920192300.3772199-1-goldstein.w.n@gmail.com> <202309231130.ZI5MdlDc-lkp@intel.com> <5354eeec562345f6a1de84f0b2081b75@AcuMS.aculab.com> <124b21857fe44e499e29800cbf4f63f8@AcuMS.aculab.com> Message-ID: <4313F9BB-DE2E-448F-A366-A68CAEA2BFE0@zytor.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On January 6, 2024 2:08:48 PM PST, David Laight wrote: >From: Linus Torvalds >> Sent: 05 January 2024 18:06 >>=20 >> On Fri, 5 Jan 2024 at 02:41, David Laight = wrote: >> > >> > Interesting, I'm pretty sure trying to get two blocks of >> > 'adc' scheduled in parallel like that doesn't work=2E >>=20 >> You should check out the benchmark at >>=20 >> https://github=2Ecom/fenrus75/csum_partial >>=20 >> and see if you can improve on it=2E I'm including the patch (on top of >> that code by Arjan) to implement the actual current kernel version as >> "New version"=2E > >Annoyingly (for me) you are partially right=2E=2E=2E > >I found where my ip checksum perf code was hiding and revisited it=2E >Although I found comments elsewhere that the 'jecxz, adc, adc, lea, jmp' >did an adc every clock it isn't happening for me now=2E > >I'm only measuring the inner loop for multiples of 64 bytes=2E >The code less than 8 bytes and partial final words is a >separate problem=2E >The less unrolled the main loop, the less overhead there'll >be for 'normal' sizes=2E >So I've changed your '80 byte' block to 64 bytes for consistency=2E > >I'm ignoring pre-sandy bridge cpu (no split flags) and pre-broadwell >(adc takes two clocks - although adc to alternate regs is one clock >on sandy bridge)=2E >My test system is an i7-7700, I think anything from broadwell (gen 4) >will be at least as good=2E >I don't have a modern amd cpu=2E > >The best loop for 256+ bytes is an adxc/adxo one=2E >However that requires the run-time patching=2E >Followed by new kernel version (two blocks of 4 adc)=2E >The surprising one is: > xor sum, sum > 1: adc (buff), sum > adc 8(buff), sum > lea 16(buff), buff > dec count > jnz 1b > adc $0, sum >For 256 bytes it is only a couple of clocks slower=2E >Maybe 10% slower for 512+ bytes=2E >But it need almost no extra code for 'normal' buffer sizes=2E >By comparison the adxc/adxo one is 20% faster=2E > >The code is doing: > old =3D rdpmc > mfence > csum =3D do_csum(buf, len); > mfence > clocks =3D rdpmc - old >(That is directly reading the pmc register=2E) >With 'no-op' function it takes 160 clocks (I-cache resident)=2E >Without the mfence 40 - but pretty much everything can execute >after the 2nd rdpmc=2E > >I've attached my (horrid) test program=2E > > David > >- >Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1= 1PT, UK >Registration No: 1397386 (Wales) Rather than runtime patching perhaps separate paths=2E=2E=2E