Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751768AbdHXKBS (ORCPT ); Thu, 24 Aug 2017 06:01:18 -0400 Received: from mga06.intel.com ([134.134.136.31]:63451 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751207AbdHXKBR (ORCPT ); Thu, 24 Aug 2017 06:01:17 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,420,1498546800"; d="scan'208";a="1209849859" From: Kemi Wang To: Andrew Morton , Michal Hocko , Mel Gorman , Johannes Weiner , Christopher Lameter Cc: Dave , Andi Kleen , Jesper Dangaard Brouer , Ying Huang , Aaron Lu , Tim Chen , Linux MM , Linux Kernel , Kemi Wang Subject: [PATCH v2 0/3] Separate NUMA statistics from zone statistics Date: Thu, 24 Aug 2017 17:59:58 +0800 Message-Id: <1503568801-21305-1-git-send-email-kemi.wang@intel.com> X-Mailer: git-send-email 2.7.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2746 Lines: 60 Each page allocation updates a set of per-zone statistics with a call to zone_statistics(). As discussed in 2017 MM summit, these are a substantial source of overhead in the page allocator and are very rarely consumed. This significant overhead in cache bouncing caused by zone counters (NUMA associated counters) update in parallel in multi-threaded page allocation (pointed out by Dave Hansen). A link to the MM summit slides: http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017 -JesperBrouer.pdf To mitigate this overhead, this patchset separates NUMA statistics from zone statistics framework, and update NUMA counter threshold to a fixed size of MAX_U16 - 2, as a small threshold greatly increases the update frequency of the global counter from local per cpu counter (suggested by Ying Huang). The rationality is that these statistics counters don't need to be read often, unlike other VM counters, so it's not a problem to use a large threshold and make readers more expensive. With this patchset, we see 31.3% drop of CPU cycles(537-->369, see below) for per single page allocation and reclaim on Jesper's page_bench03 benchmark. Meanwhile, this patchset keeps the same style of virtual memory statistics with little end-user-visible effects (only move the numa stats to show behind zone page stats, see the first patch for details). I did an experiment of single page allocation and reclaim concurrently using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server (88 processors with 126G memory) with different size of threshold of pcp counter. Benchmark provided by Jesper D Brouer(increase loop times to 10000000): https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ bench Threshold CPU cycles Throughput(88 threads) 32 799 241760478 64 640 301628829 125 537 358906028 <==> system by default 256 468 412397590 512 428 450550704 4096 399 482520943 20000 394 489009617 30000 395 488017817 65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics Kemi Wang (3): mm: Change the call sites of numa statistics items mm: Update NUMA counter threshold size mm: Consider the number in local CPUs when *reads* NUMA stats drivers/base/node.c | 22 ++++--- include/linux/mmzone.h | 24 +++++--- include/linux/vmstat.h | 33 +++++++++++ mm/page_alloc.c | 10 ++-- mm/vmstat.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++-- 5 files changed, 217 insertions(+), 24 deletions(-) -- 2.7.4