Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp4550382ybb; Tue, 7 Apr 2020 09:36:10 -0700 (PDT) X-Google-Smtp-Source: APiQypKsyLrGk0QGDSjHQf0J7krHWzG9TxGr5Slmp9NzYjvgOOTahO8+EeWMK69TrlbJ3QGYGCo8 X-Received: by 2002:a05:6830:239b:: with SMTP id l27mr2341975ots.278.1586277369828; Tue, 07 Apr 2020 09:36:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1586277369; cv=none; d=google.com; s=arc-20160816; b=xl1ooZtnscioNr50QvV7dbOgYRwJ+vNbrgmQqiyvcQVyek1YzBMsk1hwKZCs94cYZD xL2XmJDVr1hCFSBtoKPfOSvmNmtfKDu8ZMm4QFRmgQSssZ7rK3hZGIlPGyqfwQFsRXP9 UwIvoDGP0YkfMZz1R2Ce0cSB7Ykq6G1wlkHy+ZtC5GDs9IEdBG9G34+Tu1fvMcYS2gLr B6lPXxHvKTZhEHn6IQbxJxQEXM4+FHIr8rPzQAgPar87hUi4WVvfRN79ViIQKWqCAhzM bWbPg42llg1U59KpsNeWmX2bBit50B93hWWQTMJzsNwMOs9qBPP26C0KxK2tLIsqXYNR 2irA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature; bh=rD8P04UnIv/oN2rFIj08883qC9fZ4KyL2UEjThZTqbE=; b=dSEd9lEEm81A8xA1e9+C0qcS/3FBUcy51yDhxm1cHx7WQVc+34RVAJqysS5pHcQBg5 v8JuNUAN/AOAxnMHOZUi17SgZNBK5eBRGXWwQ0sFbPVdr31otbwXVAf744Y96v92jwgZ s9FKaFfGSDLFC7gvobqHjEv9hZFTEOOo1SfPKHLbiFiQgN+IVa4yWKfNhpK+1PeUR+Av 8x1ER0ON8YlR3rPFDMMbVU9oJU5KAQF0TpdFpRCUQcRB//oFm2MIBIv8FdXconOqPj4/ HdO4BfHVjAGnn1bJ9hNiZASQtfFONw3J+/iLWiz5xiaSE1JgFOPjvxhGEgEeo4DH7QdR mWWQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Xu9oSfB5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y25si1628109otj.304.2020.04.07.09.35.57; Tue, 07 Apr 2020 09:36:09 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Xu9oSfB5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728395AbgDGQdo (ORCPT + 99 others); Tue, 7 Apr 2020 12:33:44 -0400 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:50260 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726833AbgDGQdo (ORCPT ); Tue, 7 Apr 2020 12:33:44 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1586277222; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=rD8P04UnIv/oN2rFIj08883qC9fZ4KyL2UEjThZTqbE=; b=Xu9oSfB58K4Rtw5o1S9gFU2pXMkWdo3ypBuhc0680K7NNNUwOuseN9A/80MfSSmR6KnXen 6TB9EOTev5UaUcTnK3J77aniuHt5lMqoI0dLHK8+U+5wjrc286WnxXL/sLGWTEaiXfSBtu TDeHS6VZ6E0Y3669VFmGEV0uUVATJbI= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-149-Nm2Pj_kKMi2juzgFlyppbA-1; Tue, 07 Apr 2020 12:33:39 -0400 X-MC-Unique: Nm2Pj_kKMi2juzgFlyppbA-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 3F58118FE86A; Tue, 7 Apr 2020 16:33:37 +0000 (UTC) Received: from file01.intranet.prod.int.rdu2.redhat.com (file01.intranet.prod.int.rdu2.redhat.com [10.11.5.7]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 44E7E10021B3; Tue, 7 Apr 2020 16:33:34 +0000 (UTC) Received: from file01.intranet.prod.int.rdu2.redhat.com (localhost [127.0.0.1]) by file01.intranet.prod.int.rdu2.redhat.com (8.14.4/8.14.4) with ESMTP id 037GXXVk011116; Tue, 7 Apr 2020 12:33:33 -0400 Received: from localhost (mpatocka@localhost) by file01.intranet.prod.int.rdu2.redhat.com (8.14.4/8.14.4/Submit) with ESMTP id 037GXWKg011107; Tue, 7 Apr 2020 12:33:33 -0400 X-Authentication-Warning: file01.intranet.prod.int.rdu2.redhat.com: mpatocka owned process doing -bs Date: Tue, 7 Apr 2020 12:33:32 -0400 (EDT) From: Mikulas Patocka X-X-Sender: mpatocka@file01.intranet.prod.int.rdu2.redhat.com To: Andy Lutomirski cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Peter Zijlstra , x86@kernel.org, Dan Williams , Linux Kernel Mailing List , dm-devel@redhat.com Subject: Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths In-Reply-To: <583AD128-5B10-4414-A35B-FEACF30B7C5A@amacapital.net> Message-ID: References: <583AD128-5B10-4414-A35B-FEACF30B7C5A@amacapital.net> User-Agent: Alpine 2.02 (LRH 1266 2009-07-14) MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="185206533-410240676-1586277213=:7791" X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --185206533-410240676-1586277213=:7791 Content-Type: TEXT/PLAIN; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Tue, 7 Apr 2020, Andy Lutomirski wrote: >=20 > > On Apr 7, 2020, at 8:01 AM, Mikulas Patocka wro= te: > >=20 > > =EF=BB=BF[ resending this to x86 maintainers ] > >=20 > > Hi > >=20 > > I tested performance of various methods how to write to optane-based > > persistent memory, and found out that non-temporal stores achieve=20 > > throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflush= opt=20 > > or clwb achieve throughput 1.6 GB/s. > >=20 > > memcpy_flushcache uses non-temporal stores, I modified it to use cach= ed=20 > > stores + clflushopt and it improved performance of the dm-writecache=20 > > target significantly: > >=20 > > dm-writecache throughput: > > (dd if=3D/dev/zero of=3D/dev/mapper/wc bs=3D64k oflag=3Ddirect) > > writecache block size 512 1024 2048 = 4096 > > movnti 496 MB/s 642 MB/s 725 MB/s = 744 MB/s > > clflushopt 373 MB/s 688 MB/s 1.1 GB/s = 1.2 GB/s > >=20 > > For block size 512, movnti works better, for larger block sizes,=20 > > clflushopt is better. > >=20 > > I was also testing the novafs filesystem, it is not upstream, but it=20 > > benefitted from similar change in __memcpy_flushcache and=20 > > __copy_user_nocache: > > write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s > > write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s > >=20 > >=20 > > I submit this patch for __memcpy_flushcache that improves dm-writecac= he=20 > > performance. > >=20 > > Other ideas - should we introduce memcpy_to_pmem instead of modifying= =20 > > memcpy_flushcache and move this logic there? Or should I modify the=20 > > dm-writecache target directly to use clflushopt with no change to the= =20 > > architecture-specific code? > >=20 > > Mikulas > >=20 > >=20 > >=20 > >=20 > > From: Mikulas Patocka > >=20 > > I tested dm-writecache performance on a machine with Optane nvdimm an= d it > > turned out that for larger writes, cached stores + cache flushing per= form > > better than non-temporal stores. This is the throughput of dm-writeca= che > > measured with this command: > > dd if=3D/dev/zero of=3D/dev/mapper/wc bs=3D64 oflag=3Ddirect > >=20 > > block size 512 1024 2048 4096 > > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s > > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s > >=20 > > We can see that for smaller block, movnti performs better, but for la= rger > > blocks, clflushopt has better performance. > >=20 > > This patch changes the function __memcpy_flushcache accordingly, so t= hat > > with size >=3D 768 it performs cached stores and cache flushing. Note= that > > we must not use the new branch if the CPU doesn't have clflushopt - i= n > > that case, the kernel would use inefficient "clflush" instruction tha= t has > > very bad performance. > >=20 > > Signed-off-by: Mikulas Patocka > >=20 > > --- > > arch/x86/lib/usercopy_64.c | 36 +++++++++++++++++++++++++++++++++++= + > > 1 file changed, 36 insertions(+) > >=20 > > Index: linux-2.6/arch/x86/lib/usercopy_64.c > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-03-24 15:15:36.= 644945091 -0400 > > +++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-03-30 07:17:51.45029= 0007 -0400 > > @@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con > > return; > > } > >=20 > > + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >=3D 768 && l= ikely(boot_cpu_data.x86_clflush_size =3D=3D 64)) { > > + while (!IS_ALIGNED(dest, 64)) { > > + asm("movq (%0), %%r8\n" > > + "movnti %%r8, (%1)\n" > > + :: "r" (source), "r" (dest) > > + : "memory", "r8"); > > + dest +=3D 8; > > + source +=3D 8; > > + size -=3D 8; > > + } > > + do { > > + asm("movq (%0), %%r8\n" > > + "movq 8(%0), %%r9\n" > > + "movq 16(%0), %%r10\n" > > + "movq 24(%0), %%r11\n" > > + "movq %%r8, (%1)\n" > > + "movq %%r9, 8(%1)\n" > > + "movq %%r10, 16(%1)\n" > > + "movq %%r11, 24(%1)\n" > > + "movq 32(%0), %%r8\n" > > + "movq 40(%0), %%r9\n" > > + "movq 48(%0), %%r10\n" > > + "movq 56(%0), %%r11\n" > > + "movq %%r8, 32(%1)\n" > > + "movq %%r9, 40(%1)\n" > > + "movq %%r10, 48(%1)\n" > > + "movq %%r11, 56(%1)\n" > > + :: "r" (source), "r" (dest) > > + : "memory", "r8", "r9", "r10", "r11"); >=20 > Does this actually work better than the corresponding C code? >=20 > Also, that memory clobber probably isn=E2=80=99t doing your code genera= tion any=20 > favors. Experimentally, you have the constraints wrong. An =E2=80=9Cr=E2= =80=9D=20 The existing "movnti" loop uses exactly the same constraints (and the=20 "memory" clobber). > constraint doesn=E2=80=99t tell GCC that you are dereferencing the poin= ter. =20 > You need to use =E2=80=9Cm=E2=80=9D with a correctly-sized type. But you would use "=3Dm"(*(char *)dest),"=3Dm"(*((char *)dest + 8)),"=3Dm"((char *)dest + = 16))... and so on, until you run out of argument numbers. > But I bet plain C is at least as good. I tried to replace it with memcpy((void *)dest, (void *)src, 64); The compiler inlined the memcpy function into 8 loads and 8 stores.=20 However, the whole function __memcpy_flushcache consumed one more saved=20 register and the machine code was a few bytes longer. Mikulas --185206533-410240676-1586277213=:7791--