Received: by 2002:a05:6a10:eb17:0:0:0:0 with SMTP id hx23csp331031pxb; Thu, 9 Sep 2021 01:50:36 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyRoyYPn2qmI6Aul49Q13xtWwbcpDXE3spLpZZFZ0ov3LQbmLuBgmcZIK/KNoJzTRqE4Xso X-Received: by 2002:a17:906:3fc1:: with SMTP id k1mr2258461ejj.44.1631177436312; Thu, 09 Sep 2021 01:50:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1631177436; cv=none; d=google.com; s=arc-20160816; b=DWi5Rmh2YMY7euyvnTqfJ3Rma8EvtB+T0qhluqnTfhaReQTGbUuGLyeE08/28SUJRa lDKCeRMN8TVa0b83MLFy1XXVcOFh8Cso5NCJ/CaZunapGrmJtUY0udiKTPbT6lW/Ue73 shyOKaLNSIBRSNihUt4IV83LPpMKx4pHuNtGWcwxjSOpX96wMfisAKF0nY+KyoJN9JoJ x7GcpsnZkfyH7+8ZjqPCaEKV8ca0F0qegp32xdZRWwe2lZ5f+PwkmrL15qECg9lEnmNV ydiPloAzO/gvTzRehKupQigik5g127t9VBHx9m8dmDHzQAvEqc41icsPrboG34+GmSbz Ah4Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from; bh=hQF+t6FtRFp5KyyXikDsue78hDaFYxYfAu1uwjxKdwo=; b=LFI7sOclI3WhJtFvwPRwASWdz85khRPeTPMnKSNZEuntlM49m/tsMXPcu+V8TVCd/s eTzQWSi0TIAdLiD/CHhmPZ3mbGRnfN/hA1UCwVfjxSw52MKHUkjHHvQEsK4YlgM6JF8P 8oesSjI6V2uJGQd+J0OLEwhDhdYvATQfUcBSoOAcU0IM4+Hn8hKhxS2PDk9oGunKyEAn lRjzo4JNPimE7iry+qWrfNp2bzEAqBWbRGyvXb5kBxr/2JDqAtgQPRnVUEfOVSWeFXSG YfvcG0QUxkfF8Z8tszF1R+KDROabFsQLSuN1UTmNe6eMQJoBBhPMPClc4eMSCH1bogun 97NA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q19si1488368edv.394.2021.09.09.01.50.12; Thu, 09 Sep 2021 01:50:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231825AbhIIIrR (ORCPT + 99 others); Thu, 9 Sep 2021 04:47:17 -0400 Received: from out4436.biz.mail.alibaba.com ([47.88.44.36]:62999 "EHLO out4436.biz.mail.alibaba.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231549AbhIIIrR (ORCPT ); Thu, 9 Sep 2021 04:47:17 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R811e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=wujinhua@linux.alibaba.com;NM=1;PH=DS;RN=17;SR=0;TI=SMTPD_---0UnmHZDR_1631177152; Received: from VM20210331-21.tbsite.net(mailfrom:wujinhua@linux.alibaba.com fp:SMTPD_---0UnmHZDR_1631177152) by smtp.aliyun-inc.com(127.0.0.1); Thu, 09 Sep 2021 16:45:56 +0800 From: Jinhua Wu To: x86@kernel.org Cc: zelin.deng@linux.alibaba.com, jiayu.ni@linux.alibaba.com, wujinhua@linux.alibaba.com, ak@linux.intel.com, luming.yu@intel.com, fan.du@intel.com, artie.ding@linux.alibaba.com, tony.luck@intel.com, tglx@linutronix.de, bp@alien8.de, linux-kernel@vger.kernel.org, pawan.kumar.gupta@linux.intel.com, fenghua.yu@intel.com, hpa@zytor.com, ricardo.neri-calderon@linux.intel.com, peterz@infradead.org Subject: [PATCH] perf: optimize clear page in Intel specified model with movq instruction Date: Thu, 9 Sep 2021 16:45:51 +0800 Message-Id: <1631177151-53723-1-git-send-email-wujinhua@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Clear page is the most time-consuming procedure in page fault handling. Kernel use fast-string instruction to clear page. We found that in specified Intel model such as CPX and ICX, the movq instruction perform much better than fast-string instruction when corresponding page is not in cache. But when the page is in cache, fast string perform better. We show the test result in the following: machine: Intel CPX Allocated memory size Page fault latency per 4K byte rep stosb movq -------------------- ---------------- ------------------ 8MB 2057.13ns 1338.38ns 64MB 1850.71ns 1200.20ns 512MB 1918.40ns 1196.91ns 4096MB 1931.24ns 1189.41ns We can find that there is 40% performance improvement. So we add a blacklist for Intel specified model, in which we use movq instruction to clear page. Signed-off-by: Jinhua Wu Signed-off-by: Jiayu Ni Signed-off-by: Artie Ding --- arch/x86/include/asm/page_64.h | 18 ++++++++++++------ arch/x86/kernel/cpu/intel.c | 22 ++++++++++++++++++++++ arch/x86/mm/init.c | 9 +++++++++ 3 files changed, 43 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h index 4bde0dc..1fedfbe 100644 --- a/arch/x86/include/asm/page_64.h +++ b/arch/x86/include/asm/page_64.h @@ -7,6 +7,8 @@ #ifndef __ASSEMBLY__ #include +#include + /* duplicated to the one in bootmem.h */ extern unsigned long max_pfn; extern unsigned long phys_base; @@ -43,15 +45,19 @@ static inline unsigned long __phys_addr_nodebug(unsigned long x) void clear_page_orig(void *page); void clear_page_rep(void *page); void clear_page_erms(void *page); +extern struct static_key_false clear_page_movq_key; static inline void clear_page(void *page) { - alternative_call_2(clear_page_orig, - clear_page_rep, X86_FEATURE_REP_GOOD, - clear_page_erms, X86_FEATURE_ERMS, - "=D" (page), - "0" (page) - : "cc", "memory", "rax", "rcx"); + if (static_branch_unlikely(&clear_page_movq_key)) + clear_page_orig(page); + else + alternative_call_2(clear_page_orig, + clear_page_rep, X86_FEATURE_REP_GOOD, + clear_page_erms, X86_FEATURE_ERMS, + "=D" (page), + "0" (page) + : "cc", "memory", "rax", "rcx"); } void copy_page(void *to, void *from); diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 8321c43..3366da0 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -38,6 +38,28 @@ #include #endif +/* Optimize clear page with movq in specific Intel CPU */ +#include +#include + +DEFINE_STATIC_KEY_FALSE(clear_page_movq_key); +EXPORT_SYMBOL_GPL(clear_page_movq_key); + +extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match); + +const struct x86_cpu_id faststring_blacklist_match[] __initconst = { + X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X, 0), + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0), + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_XEON_D, 0) +}; + +void enable_clear_page_movq(void) +{ + if (x86_match_cpu(faststring_blacklist_match)) + static_branch_enable(&clear_page_movq_key); +} + + enum split_lock_detect_state { sld_off = 0, sld_warn, diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 23a14d8..480c189 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -28,6 +28,12 @@ #include /* + * Optimize clear page with movq in specific Intel CPU + * Definition in intel.c + */ +extern void enable_clear_page_movq(void); + +/* * We need to define the tracepoints somewhere, and tlb.c * is only compiled when SMP=y. */ @@ -775,6 +781,9 @@ void __init init_mem_mapping(void) x86_init.hyper.init_mem_mapping(); + /* Optimize clear page with mov in specific Intel CPU */ + enable_clear_page_movq(); + early_memtest(0, max_pfn_mapped << PAGE_SHIFT); } -- 1.8.3.1