Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp4468763ybb; Tue, 7 Apr 2020 08:06:11 -0700 (PDT) X-Google-Smtp-Source: APiQypLpyq7BbZ6Pww7ZePbCQHqwkfE7+wMf7vVY16OW7ql/CAuqjSi9bDifnNL9a8pcJ5a84H/S X-Received: by 2002:a9d:2963:: with SMTP id d90mr1852937otb.149.1586271971053; Tue, 07 Apr 2020 08:06:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1586271971; cv=none; d=google.com; s=arc-20160816; b=B9jhYk8lTUzJ72Buly9ID7FXUR7emhnaDetGOn48Hyw5eEMKTb6YggQWJphXeGZtNw dqMz4FUjh4DnVuGj8X8ngFkxat3de63NPXsgO2ElSiVytsq7aKuyvoVWdPGXB47ghwea 5Ile3DejbGLhU38b0tY8qxFXwaXyyIkT6amQfhdVl5S4iDludkk0zds2x9kL212iCvzB 5JqgvEraEKU8UFQ06BsWFpz5PSKV4makDPiCAoI67z0HkK9bKc29WoSOZqD8KN6ygjQp zLiow0GEQg/ouZoTbLbWBUPJqW+LGta9yZ2jDxob41N8utIoE7dFxz1vBiWIzfkEUp+N T/3g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:message-id :subject:cc:to:from:date:dkim-signature; bh=AF7Rel5LO4XBz6MttPesae+/BrIA3c18+PEN6n9w9Hw=; b=sH/eLTYhp+v9EJvih6z+IGcgWmJWOg2+ukfWry6e0EgOtBEEsuEDzrkrAoe9hHUyNM 7Qr4/gkbZiEZaOS8xq4RjYCm8LbmWVW1hKwAsVRB1Hjiz9MbMo5Da95HUccKvIr/87Kk SXXk3BEN8lp/SlnQi7n8dYljfL74O84veiDIJ/MYgkCQeuh8oV+vNIfaYaZWD4G30HwO QbSFnkMa1VL2R+9Krb0t5eEmpGWlnPNMQl596b66z88pSD103HxQmwPMexDs++Xs3WPk UCZJEbeLBpQHiq+qEPeRxjI6haTHyLwl3cMePZK4948WXPzdmMNmLL2nHDGOBObOndEy cQWQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WMueFxck; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z90si1271985otb.47.2020.04.07.08.05.42; Tue, 07 Apr 2020 08:06:11 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WMueFxck; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729331AbgDGPBq (ORCPT + 99 others); Tue, 7 Apr 2020 11:01:46 -0400 Received: from us-smtp-1.mimecast.com ([207.211.31.81]:58714 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728917AbgDGPBq (ORCPT ); Tue, 7 Apr 2020 11:01:46 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1586271705; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type; bh=AF7Rel5LO4XBz6MttPesae+/BrIA3c18+PEN6n9w9Hw=; b=WMueFxck1YhSDWZChYKIQQ8qvH5wkpOZrIu1qUdLWY+l6pZ/ew82DyR/JyQGQBxR+LecxF zIQHMt+C9CGfZCL+5iRo5wMcN+GtactIF6eSkPRNHBzPmxUu+Da8jzBYaT0PVlrlYWN5nO bYh4+MsW3Et9SEo7ZVwUwl2rwJegy0Y= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-421-U-pgugmlO62_SadVrITTjQ-1; Tue, 07 Apr 2020 11:01:41 -0400 X-MC-Unique: U-pgugmlO62_SadVrITTjQ-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 9B95B190B2A2; Tue, 7 Apr 2020 15:01:39 +0000 (UTC) Received: from file01.intranet.prod.int.rdu2.redhat.com (file01.intranet.prod.int.rdu2.redhat.com [10.11.5.7]) by smtp.corp.redhat.com (Postfix) with ESMTPS id C36A910027A4; Tue, 7 Apr 2020 15:01:36 +0000 (UTC) Received: from file01.intranet.prod.int.rdu2.redhat.com (localhost [127.0.0.1]) by file01.intranet.prod.int.rdu2.redhat.com (8.14.4/8.14.4) with ESMTP id 037F1auk029412; Tue, 7 Apr 2020 11:01:36 -0400 Received: from localhost (mpatocka@localhost) by file01.intranet.prod.int.rdu2.redhat.com (8.14.4/8.14.4/Submit) with ESMTP id 037F1XBj029408; Tue, 7 Apr 2020 11:01:33 -0400 X-Authentication-Warning: file01.intranet.prod.int.rdu2.redhat.com: mpatocka owned process doing -bs Date: Tue, 7 Apr 2020 11:01:33 -0400 (EDT) From: Mikulas Patocka X-X-Sender: mpatocka@file01.intranet.prod.int.rdu2.redhat.com To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Peter Zijlstra cc: x86@kernel.org, Dan Williams , Linux Kernel Mailing List , dm-devel@redhat.com Subject: [PATCH] memcpy_flushcache: use cache flusing for larger lengths Message-ID: User-Agent: Alpine 2.02 (LRH 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [ resending this to x86 maintainers ] Hi I tested performance of various methods how to write to optane-based persistent memory, and found out that non-temporal stores achieve throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt or clwb achieve throughput 1.6 GB/s. memcpy_flushcache uses non-temporal stores, I modified it to use cached stores + clflushopt and it improved performance of the dm-writecache target significantly: dm-writecache throughput: (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct) writecache block size 512 1024 2048 4096 movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s For block size 512, movnti works better, for larger block sizes, clflushopt is better. I was also testing the novafs filesystem, it is not upstream, but it benefitted from similar change in __memcpy_flushcache and __copy_user_nocache: write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s I submit this patch for __memcpy_flushcache that improves dm-writecache performance. Other ideas - should we introduce memcpy_to_pmem instead of modifying memcpy_flushcache and move this logic there? Or should I modify the dm-writecache target directly to use clflushopt with no change to the architecture-specific code? Mikulas From: Mikulas Patocka I tested dm-writecache performance on a machine with Optane nvdimm and it turned out that for larger writes, cached stores + cache flushing perform better than non-temporal stores. This is the throughput of dm-writecache measured with this command: dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct block size 512 1024 2048 4096 movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s We can see that for smaller block, movnti performs better, but for larger blocks, clflushopt has better performance. This patch changes the function __memcpy_flushcache accordingly, so that with size >= 768 it performs cached stores and cache flushing. Note that we must not use the new branch if the CPU doesn't have clflushopt - in that case, the kernel would use inefficient "clflush" instruction that has very bad performance. Signed-off-by: Mikulas Patocka --- arch/x86/lib/usercopy_64.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) Index: linux-2.6/arch/x86/lib/usercopy_64.c =================================================================== --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-03-24 15:15:36.644945091 -0400 +++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-03-30 07:17:51.450290007 -0400 @@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con return; } + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768 && likely(boot_cpu_data.x86_clflush_size == 64)) { + while (!IS_ALIGNED(dest, 64)) { + asm("movq (%0), %%r8\n" + "movnti %%r8, (%1)\n" + :: "r" (source), "r" (dest) + : "memory", "r8"); + dest += 8; + source += 8; + size -= 8; + } + do { + asm("movq (%0), %%r8\n" + "movq 8(%0), %%r9\n" + "movq 16(%0), %%r10\n" + "movq 24(%0), %%r11\n" + "movq %%r8, (%1)\n" + "movq %%r9, 8(%1)\n" + "movq %%r10, 16(%1)\n" + "movq %%r11, 24(%1)\n" + "movq 32(%0), %%r8\n" + "movq 40(%0), %%r9\n" + "movq 48(%0), %%r10\n" + "movq 56(%0), %%r11\n" + "movq %%r8, 32(%1)\n" + "movq %%r9, 40(%1)\n" + "movq %%r10, 48(%1)\n" + "movq %%r11, 56(%1)\n" + :: "r" (source), "r" (dest) + : "memory", "r8", "r9", "r10", "r11"); + clflushopt((void *)dest); + dest += 64; + source += 64; + size -= 64; + } while (size >= 64); + } + /* 4x8 movnti loop */ while (size >= 32) { asm("movq (%0), %%r8\n"