Received: by 2002:a05:7412:b130:b0:e2:908c:2ebd with SMTP id az48csp2328434rdb; Mon, 20 Nov 2023 08:08:55 -0800 (PST) X-Google-Smtp-Source: AGHT+IEEitYekrImrb59CbnIAiuxVaD8cvSIIXfOLgSmPPMIua1sbKVOuOChjkQ45Rpkm4l1U2iK X-Received: by 2002:a05:6a20:da95:b0:187:b7db:8682 with SMTP id iy21-20020a056a20da9500b00187b7db8682mr7414115pzb.24.1700496535033; Mon, 20 Nov 2023 08:08:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700496535; cv=none; d=google.com; s=arc-20160816; b=jyavuAwjp3KVIiG/EkjAUh38mdpUQ7kiRmX+z1n7MezOtlEVKvf+FZyRpqPDPD2ssc HrYU43iG35mXofngPgbP1/lntgM1ikkmp9b+QsS1FCzxp21Asi/7TXSGVpx1jVuIyKug NU0Y95Mose6SrOVLLSVqZo0Z8XAVBXxWIZMeaUTWEK/ASTLs4M61J9hhOA3ULRmNvckl 7cPltOrNrS8VuZ8jik+WuL7ZimD5Uq9x3MgWwdZ/eIqrIjpq12NMHyGmbIwuckfCbPrp jFzXPEJV0MTeryTbLD1LviTbgHMMjqXgIEPpIFwbb/tC4czFZz0bXPslDJlvwoY1jhYK jNeA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :mime-version:accept-language:in-reply-to:references:message-id:date :thread-index:thread-topic:subject:cc:to:from; bh=KGXlMgC6eGy94wXrgT0pw21PBBBE4p1R+0F742H0ZoY=; fh=Cz+5d4MqmTuNaOy1a0LKgl7Q3zKJ6zGeZgaHvd8CtUI=; b=EGLaB302JL9tNanxr9YgTPKb8hfWowdsIYxA6uX7YMLR2m1MyvSbNvXmEJTTin0LYh WbD0OcnkqWae9pzv+CRy2NvN1HfbsoXCZ7A7Qy5V0P0ogspsyMbuAoX6WbN5U/IyNCHa 30rdpkyYQrEe00ps2CK2B9pMhidnqXNXsDHkMnxbca/JybT9g1qTqAU6kpGCARAbQjZO df+u/HFeOP8M9ZS4PXvUKa9YHllI2o40vCxHJviLnemcvmSVgd29jlHCBcjMDhSaqn9w 4rw54edYXRsk3PtpvsC4xXUdwMdaLYuSLhtVjwD4lprdiElIaocldiszo5+5P8Iqj9AX jn+A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aculab.com Return-Path: Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id cn5-20020a056a020a8500b005b90972529dsi8490261pgb.456.2023.11.20.08.08.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 Nov 2023 08:08:55 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aculab.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id B722A8061B58; Mon, 20 Nov 2023 08:08:41 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233238AbjKTQIk convert rfc822-to-8bit (ORCPT + 99 others); Mon, 20 Nov 2023 11:08:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49522 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229696AbjKTQIj (ORCPT ); Mon, 20 Nov 2023 11:08:39 -0500 Received: from eu-smtp-delivery-151.mimecast.com (eu-smtp-delivery-151.mimecast.com [185.58.86.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 94D16E7 for ; Mon, 20 Nov 2023 08:08:34 -0800 (PST) Received: from AcuMS.aculab.com (156.67.243.121 [156.67.243.121]) by relay.mimecast.com with ESMTP with both STARTTLS and AUTH (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id uk-mta-27-vLVux7jWOF6nLYmEF7DxSQ-1; Mon, 20 Nov 2023 16:08:31 +0000 X-MC-Unique: vLVux7jWOF6nLYmEF7DxSQ-1 Received: from AcuMS.Aculab.com (10.202.163.4) by AcuMS.aculab.com (10.202.163.4) with Microsoft SMTP Server (TLS) id 15.0.1497.48; Mon, 20 Nov 2023 16:09:02 +0000 Received: from AcuMS.Aculab.com ([::1]) by AcuMS.aculab.com ([::1]) with mapi id 15.00.1497.048; Mon, 20 Nov 2023 16:09:02 +0000 From: David Laight To: 'David Howells' , Linus Torvalds CC: Borislav Petkov , kernel test robot , "oe-lkp@lists.linux.dev" , "lkp@intel.com" , "linux-kernel@vger.kernel.org" , Christian Brauner , Alexander Viro , Jens Axboe , Christoph Hellwig , Christian Brauner , Matthew Wilcox , "ying.huang@intel.com" , "feng.tang@intel.com" , "fengwei.yin@intel.com" Subject: RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression Thread-Topic: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression Thread-Index: AQHaG7YSFfqMOen5JUCwmbDFX2u0nbCDUWGA Date: Mon, 20 Nov 2023 16:09:02 +0000 Message-ID: References: <20231115190938.GGZVUXcuUjI3i1JRAB@fat_crate.local> <232440.1700153046@warthog.procyon.org.uk> <864270.1700230337@warthog.procyon.org.uk> <20231117160940.GGZVeQRLgLjJZXBLE1@fat_crate.local> <20231117191243.GHZVe7K4vN9n5M92gb@fat_crate.local> <2284219.1700487177@warthog.procyon.org.uk> In-Reply-To: <2284219.1700487177@warthog.procyon.org.uk> Accept-Language: en-GB, en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [10.202.205.107] MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: aculab.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,PDS_BAD_THREAD_QP_64, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Mon, 20 Nov 2023 08:08:41 -0800 (PST) From: David Howells > Sent: 20 November 2023 13:33 > > Linus Torvalds wrote: > > > So I don't think we should use either of these benchmarks as a "we > > need to optimize for *this*", but it is another example of how much > > memcpy() does matter. Even if the end result is then "but different > > microarchitectrues react so differently that we can't please > > everybody". > > So what, if anything, should I change? Should I make it directly call > __memcpy? Or should we just leave it to the compiler? I would prefer to > leave memcpy_from_iter() and memcpy_to_iter() as __always_inline to eliminate > the function pointer call we otherwise end up with and to eliminate the return > value (which is always 0 in this case). I'd have thought you'd just want to call memcpy() (or xxxx_memcpy()) Anything that matters here is likely to make more difference elsewhere. I wonder if the kernel ever uses the return value from memcpy(). I suspect it only exists for very historic reasons. The wrapper: #define memcpy(d, s, l) {( \ const void *dd = d; \ memcpy_void(dd, s, l); \ dd; \ )} would save all the asm implementations from saving the result. I did some more measurements over the weekend. A quick summary - I've not quite finished (and need to find some more test systems - newer and amd). I'm now thinking that the 5k clocks is a TLB miss. In any case it is a feature of my test not the instruction. I'm also subtracting off a baseline that has 'nop; nop' not 'rep movsb'. I'm not entirely certain about the fractional clocks! I counting 10 operations and getting pretty consistent counts. I suspect they are end effects. These measurements are also for 4k aligned src and dest. An ivy bridge i7-3xxx seems to do: 0 41.4 clocks 1-64 31.5 clocks 65-128 44.3 129-191 55.1 192 47.4 193-255 58.8 then an extra 3 clocks every 64 bytes. Whereas kaby lake i7-7xxx does: 0 51.5 clocks 1-64 22.9 65-95 25.3 96 30.5 97-127 34.1 128 31.5 then an extra clock every 32 bytes (if dest aligned). Note that this system is somewhat slower if the destination is less than (iirc) 48 bytes before the source (mod 4k). There are several different slow speeds worst is about half the speed. I might be able to find a newer system with fsrm. I was going to measure orig_memcpy() and also see what I can write. Both those cpu can do a read and write every clock. So a 64bit copy loop can execute at 8 bytes/clock. It should be possible to get a 2 clock loop copying 16 bytes. But that will need a few instructions to setup. You need to use negative offsets from the end so that only one register is changed and the 'add' sets Z for the jump. It can be written in C - but gcc will pessimise it for you. You also need a conditional branch for short copies (< 16 bytes) that could easily be mispredicted pretty much 50% of the time. (IIRC no static prediction on recent x86 cpu.) And probably a separate test for 0. It is hard genning a sane clock count for short copies because the mispredicted branches kill you. Trouble is any benchmark measurement is likely to train the branch predictor. It might actually be hard to reliably beat the ~20 clocks for 'rep movsb' on kaby lake. This graph is from the fsrm patch: Time (cycles) for memmove() sizes 1..31 with neither source nor destination in cache. 1800 +-+-------+--------+---------+---------+---------+--------+-------+-+ + + + + + + + + 1600 +-+ 'memmove-fsrm' *******-+ | ###### 'memmove-orig' ####### | 1400 +-+ # ##################### +-+ | # ############ | 1200 +-+# ################## +-+ | # | 1000 +-+# +-+ | # | | # | 800 +-# +-+ | # | 600 +-*********************** +-+ | ***************************** | 400 +-+ ******* +-+ | | 200 +-+ +-+ + + + + + + + + 0 +-+-------+--------+---------+---------+---------+--------+-------+-+ 0 5 10 15 20 25 30 35 I don't know what that was measured on. 600 clocks seems a lot - could be dominated by loading the cache. I'd have thought short buffers are actually likely to be in the cache and/or wanted in it. There is also the lack of 'rep movsb' on erms (on various cpu). David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)