Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp4528264ybb; Tue, 7 Apr 2020 09:10:51 -0700 (PDT) X-Google-Smtp-Source: APiQypLlJOZuVjmgrUHoPAXG2iMWPbs5oCJb+ufsQwwjXfdoC4BIWwjGKUumhEN1pRdTT1I8yom+ X-Received: by 2002:a9d:7197:: with SMTP id o23mr2289708otj.286.1586275850061; Tue, 07 Apr 2020 09:10:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1586275850; cv=none; d=google.com; s=arc-20160816; b=VJWgB7vYBXelmpygWvYgzgIwXKdObLSy0wYBg8c8fnURdhIdCn/6AmckorYSKcRcmK RorGF3US0Li6/qrFJ/YG5Upu67f1j5ApSrOrVRfb0qqylKQP1uwtojzIR4SerPpgPefp Jeb24huIZO9P8KDMC4p8JOUH7r7kcfyRYOEz2FsiSujuQWO49gGkFliTY7rzZIEz/jMh tp7pPn6ryQwA3fjer6mKaK52C8OIrzQFvioXoy3ZreexO3ACIpL7n1P2lgkV61o8uxVP G3UhVFAU0nCv+pVcJ0vu2FNTXcnC8i7o1Tv0ZmmrMq2mKQtdGji7jJwoAFrBbQ31mhYL 1STA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:to:in-reply-to:cc:references:message-id :date:subject:mime-version:from:content-transfer-encoding :dkim-signature; bh=HIeYEWjY9iLocedwaESC/3JzT9JqTWW18TyKK7LlFo0=; b=oOwZ/WyUuBSWEbLxZYusiIU60N0w4XkmwTfVtr1J/a/KiwFhWhfRfHqczOIgQAxlTB fRzYJI04Gty9pdW2J7ePJyYtH47YSyEn6w0PfvySawTwUSS9bj2+ea3ZUtQvOklSRbbJ 3BpTNRr4I+Dt4I10fmjJwwCK6epcfV5sToFjJYoPFDlg/DQ1f/77zOluSoIfi+1NOs/M wx0zwMkYB71Vyuz1AA4wyw245QYgbo6aI/k0NV9HnbEDCck+Pa/Erq3kGsKUc1ZwhMV4 Gae6x4a/hph5tJylnFfvZEBvrVGHsIzfyr+mcjnWk66IBcRfTJMfNVu4WV0jKDYDEW9a 23zA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amacapital-net.20150623.gappssmtp.com header.s=20150623 header.b=oz8DXwmJ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m23si1540049ots.325.2020.04.07.09.10.36; Tue, 07 Apr 2020 09:10:50 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@amacapital-net.20150623.gappssmtp.com header.s=20150623 header.b=oz8DXwmJ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726934AbgDGQJf (ORCPT + 99 others); Tue, 7 Apr 2020 12:09:35 -0400 Received: from mail-pf1-f193.google.com ([209.85.210.193]:45679 "EHLO mail-pf1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726776AbgDGQJf (ORCPT ); Tue, 7 Apr 2020 12:09:35 -0400 Received: by mail-pf1-f193.google.com with SMTP id r14so999504pfl.12 for ; Tue, 07 Apr 2020 09:09:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amacapital-net.20150623.gappssmtp.com; s=20150623; h=content-transfer-encoding:from:mime-version:subject:date:message-id :references:cc:in-reply-to:to; bh=HIeYEWjY9iLocedwaESC/3JzT9JqTWW18TyKK7LlFo0=; b=oz8DXwmJRPxZ2/QTLvmIMl/gMc30YeDPMNbeKfIrJRC3qeN4A+iMQrSM8wMLA9fVSq biyUpj02ziwvSILA0VzjWou34CgCcLJnxH8TA5aZPuaDXuTzC+F6LihwD8Iw3DYUhc+4 LKjpydRtWpUVyRlJQu7NmpnCZZy9J8z6JDrlBYUV8AX3OMgC61cBe4k1LblM2sOpSNxj sWVMXs+SNA29w9JeMKfTFfZKdyZRwqLVdRgoK7eIvxv+Z9uzyzlbPhrRCNc+fj1/oayo Hzno0k2Hdb8QUPzK6qQ6izsF/BSKTLwpCrz4Yg/3zDh2iNQVLSuxRY9TOlncYJJJWZDk i3rw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:content-transfer-encoding:from:mime-version :subject:date:message-id:references:cc:in-reply-to:to; bh=HIeYEWjY9iLocedwaESC/3JzT9JqTWW18TyKK7LlFo0=; b=iSE88ieXYK8M7tHwlHX8AK5ZhnZXhyIYB1LY5M7WSMMwtsnUCAV/nf4gTCMBbLXKie 344wiYyCF9QpAZc/dPT8bjthOAXHUa4sDiy197FzesED7JimCxgGAMxWpOdxaJ3d/viJ uTCvRGTvTMrPjWtmDSxkLT+CFgJf7VZk3y88I7e+0SLY02dloNnuntvjU0+XwFGQx+f8 KGlmvXlugxYLZK2p9c+RtKOFCjZw/dACNvQKM2qcBZlyGJ18vP+YYk5nPSQoErIo4+jB tf2s/4tOcgY59r0E+30yEfjludUuefaJaDQ51OoMpgXATY70BNHDl+exd86CQaWydL7x cS6Q== X-Gm-Message-State: AGi0PuboIIkXVkg9xjVLkXy28GlvK/e9bOngadatfZ/uDM73LzYt1rtp 7IKu/3aottMn9krblzva/zT/tg== X-Received: by 2002:aa7:9695:: with SMTP id f21mr3288671pfk.93.1586275772707; Tue, 07 Apr 2020 09:09:32 -0700 (PDT) Received: from ?IPv6:2601:646:c200:1ef2:1546:cd03:cc4:c9fb? ([2601:646:c200:1ef2:1546:cd03:cc4:c9fb]) by smtp.gmail.com with ESMTPSA id r140sm1175662pfc.137.2020.04.07.09.09.31 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 07 Apr 2020 09:09:31 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Andy Lutomirski Mime-Version: 1.0 (1.0) Subject: Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths Date: Tue, 7 Apr 2020 09:09:30 -0700 Message-Id: <583AD128-5B10-4414-A35B-FEACF30B7C5A@amacapital.net> References: Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Peter Zijlstra , x86@kernel.org, Dan Williams , Linux Kernel Mailing List , dm-devel@redhat.com In-Reply-To: To: Mikulas Patocka X-Mailer: iPhone Mail (17E255) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Apr 7, 2020, at 8:01 AM, Mikulas Patocka wrote: >=20 > =EF=BB=BF[ resending this to x86 maintainers ] >=20 > Hi >=20 > I tested performance of various methods how to write to optane-based > persistent memory, and found out that non-temporal stores achieve=20 > throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt=20= > or clwb achieve throughput 1.6 GB/s. >=20 > memcpy_flushcache uses non-temporal stores, I modified it to use cached=20= > stores + clflushopt and it improved performance of the dm-writecache=20 > target significantly: >=20 > dm-writecache throughput: > (dd if=3D/dev/zero of=3D/dev/mapper/wc bs=3D64k oflag=3Ddirect) > writecache block size 512 1024 2048 40= 96 > movnti 496 MB/s 642 MB/s 725 MB/s 74= 4 MB/s > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.= 2 GB/s >=20 > For block size 512, movnti works better, for larger block sizes,=20 > clflushopt is better. >=20 > I was also testing the novafs filesystem, it is not upstream, but it=20 > benefitted from similar change in __memcpy_flushcache and=20 > __copy_user_nocache: > write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s > write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s >=20 >=20 > I submit this patch for __memcpy_flushcache that improves dm-writecache=20= > performance. >=20 > Other ideas - should we introduce memcpy_to_pmem instead of modifying=20 > memcpy_flushcache and move this logic there? Or should I modify the=20 > dm-writecache target directly to use clflushopt with no change to the=20 > architecture-specific code? >=20 > Mikulas >=20 >=20 >=20 >=20 > From: Mikulas Patocka >=20 > I tested dm-writecache performance on a machine with Optane nvdimm and it > turned out that for larger writes, cached stores + cache flushing perform > better than non-temporal stores. This is the throughput of dm-writecache > measured with this command: > dd if=3D/dev/zero of=3D/dev/mapper/wc bs=3D64 oflag=3Ddirect >=20 > block size 512 1024 2048 4096 > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s >=20 > We can see that for smaller block, movnti performs better, but for larger > blocks, clflushopt has better performance. >=20 > This patch changes the function __memcpy_flushcache accordingly, so that > with size >=3D 768 it performs cached stores and cache flushing. Note that= > we must not use the new branch if the CPU doesn't have clflushopt - in > that case, the kernel would use inefficient "clflush" instruction that has= > very bad performance. >=20 > Signed-off-by: Mikulas Patocka >=20 > --- > arch/x86/lib/usercopy_64.c | 36 ++++++++++++++++++++++++++++++++++++ > 1 file changed, 36 insertions(+) >=20 > Index: linux-2.6/arch/x86/lib/usercopy_64.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-03-24 15:15:36.64494= 5091 -0400 > +++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-03-30 07:17:51.450290007 -= 0400 > @@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con > return; > } >=20 > + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >=3D 768 && likely= (boot_cpu_data.x86_clflush_size =3D=3D 64)) { > + while (!IS_ALIGNED(dest, 64)) { > + asm("movq (%0), %%r8\n" > + "movnti %%r8, (%1)\n" > + :: "r" (source), "r" (dest) > + : "memory", "r8"); > + dest +=3D 8; > + source +=3D 8; > + size -=3D 8; > + } > + do { > + asm("movq (%0), %%r8\n" > + "movq 8(%0), %%r9\n" > + "movq 16(%0), %%r10\n" > + "movq 24(%0), %%r11\n" > + "movq %%r8, (%1)\n" > + "movq %%r9, 8(%1)\n" > + "movq %%r10, 16(%1)\n" > + "movq %%r11, 24(%1)\n" > + "movq 32(%0), %%r8\n" > + "movq 40(%0), %%r9\n" > + "movq 48(%0), %%r10\n" > + "movq 56(%0), %%r11\n" > + "movq %%r8, 32(%1)\n" > + "movq %%r9, 40(%1)\n" > + "movq %%r10, 48(%1)\n" > + "movq %%r11, 56(%1)\n" > + :: "r" (source), "r" (dest) > + : "memory", "r8", "r9", "r10", "r11"); Does this actually work better than the corresponding C code? Also, that memory clobber probably isn=E2=80=99t doing your code generation a= ny favors. Experimentally, you have the constraints wrong. An =E2=80=9Cr=E2= =80=9D constraint doesn=E2=80=99t tell GCC that you are dereferencing the po= inter. You need to use =E2=80=9Cm=E2=80=9D with a correctly-sized type. Bu= t I bet plain C is at least as good. > + clflushopt((void *)dest); > + dest +=3D 64; > + source +=3D 64; > + size -=3D 64; > + } while (size >=3D 64); > + } > + > /* 4x8 movnti loop */ > while (size >=3D 32) { > asm("movq (%0), %%r8\n" >=20