Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp918172pxf; Wed, 7 Apr 2021 15:03:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx9m1/vRFwHAztaL3PeMStWJYYNTG9KK1j6mgp6mV6JuIzDilsdnRwdizlwuAvxZ8art2v9 X-Received: by 2002:a17:90a:a613:: with SMTP id c19mr5335763pjq.141.1617832980735; Wed, 07 Apr 2021 15:03:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617832980; cv=none; d=google.com; s=arc-20160816; b=0MRQke8KTM7agdgy9ou8Wy2T44EJFO4+xiNoY8sU5MmS5I24DNBYVblOPBgpwEhLzq zWHk2LozPnWP8bdk/gP7169UOTx0XAYQtYHJWspxcQ8RtmQMXGOKGeDh8RU8h2YeS9Zd nR/KpBSrgqI6/Wts9d02ZWdIZL6dazAEhkXnSu8bwQIDyyHuAxHfZgXzuTUpkIBnWoPX Obx36oq7wrO7/briPahtBKbOUvPnIfQAeTWfiaVJhx9piZ6/IGRYieII4mIfnW5/8wcR Ee9IYgsuAfVoJQV+3tufdW5BInMtJHoZqWWMLoD9cKLVQL+i31mtbekmENZAMgETV536 M2BQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=pxZCynY/H03FgUqNLYIoK66qdv2Fb1XWE3nl1wuuTPg=; b=LBXRE40KI1mG/q17T+YlQjcisvUbIHVVFR9Y6xm/08TulHRJx01K24DTJNSNo3J8H/ ih6lhTjfgBNCVpzz7AkqGrUrd2WwKufEYtUHJLeaPkbaZFL6wZBG0mnoVkNwcIjpEUB2 X4vLXyq4+oH9gFsPqUl5pznV6Ipa7ABS7AX8CsKwbwRlHBNNzGi9qLGUQggK0o9BHpIr u0qOO/Wme+hqKZ9ssBtTobBJiAJ18ct/4vVrDLtGBrXJS8DqlgXUrVs5gq+JecCzudHq NTr+OL7m1mb2YtaUZ1QLkpnMUgKg9WFjUm69LwlP/j8qS/O3Oa0PXW+ix66I4UW/Xj8S sS0g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q15si20465208pfu.37.2021.04.07.15.02.46; Wed, 07 Apr 2021 15:03:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1356144AbhDGUZB (ORCPT + 99 others); Wed, 7 Apr 2021 16:25:01 -0400 Received: from outbound-smtp49.blacknight.com ([46.22.136.233]:35039 "EHLO outbound-smtp49.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1356260AbhDGUYq (ORCPT ); Wed, 7 Apr 2021 16:24:46 -0400 Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp49.blacknight.com (Postfix) with ESMTPS id 5A73CFB3DB for ; Wed, 7 Apr 2021 21:24:34 +0100 (IST) Received: (qmail 13602 invoked from network); 7 Apr 2021 20:24:34 -0000 Received: from unknown (HELO stampy.112glenside.lan) (mgorman@techsingularity.net@[84.203.22.4]) by 81.17.254.9 with ESMTPA; 7 Apr 2021 20:24:34 -0000 From: Mel Gorman To: Linux-MM , Linux-RT-Users Cc: LKML , Chuck Lever , Jesper Dangaard Brouer , Matthew Wilcox , Thomas Gleixner , Peter Zijlstra , Ingo Molnar , Michal Hocko , Oscar Salvador , Mel Gorman Subject: [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Date: Wed, 7 Apr 2021 21:24:12 +0100 Message-Id: <20210407202423.16022-1-mgorman@techsingularity.net> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org For MM people, the whole series is relevant but patch 3 needs particular attention for memory hotremove as I had problems testing it because full zone removal always failed for me. For RT people, the most interesting patches are 2, 9 and 10 with 2 being the most important. This series requires patches in Andrew's tree so for convenience, it's also available at git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v2r10 The PCP (per-cpu page allocator in page_alloc.c) shares locking requirements with vmstat and the zone lock which is inconvenient and causes some issues. For example, the PCP list and vmstat share the same per-cpu space meaning that it's possible that vmstat updates dirty cache lines holding per-cpu lists across CPUs unless padding is used. Second, PREEMPT_RT does not want IRQs disabled in the page allocator because it's too long for IRQs to be disabled unnecesarily. This series splits the locking requirements and uses locks types more suitable for PREEMPT_RT, reduces the time when special locking is required for stats and reduces the time when IRQs need to be disabled on !PREEMPT_RT kernels. Why local_lock? PREEMPT_RT considers the following sequence to be unsafe as documented in Documentation/locking/locktypes.rst local_irq_disable(); raw_spin_lock(&lock); The page allocator does not use raw_spin_lock but using local_irq_safe is undesirable on PREEMPT_RT as it leaves IRQs disabled for an excessive length of time. By converting to local_lock which disables migration on PREEMPT_RT, the locking requirements can be separated and start moving the protections for PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking. As a bonus, local_lock also means that PROVE_LOCKING does something useful. After that, it was very obvious that zone_statistics in particular has way too much overhead and leaves IRQs disabled for longer than necessary on !PREEMPT_RT kernels. zone_statistics uses perfectly accurate counters requiring IRQs be disabled for parallel RMW sequences when inaccurate ones like vm_events would do. The series makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that then require no special protection on !PREEMPT_RT. The bulk page allocator can then do stat updates in bulk with IRQs enabled which should improve the efficiency. Technically, this could have been done without the local_lock and vmstat conversion work and the order simply reflects the timing of when different series were implemented. Finally, there are places where we conflate IRQs being disabled for the PCP with the IRQ-safe zone spinlock. The remainder of the series reduces the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels. By the end of the series, page_alloc.c does not call local_irq_save so the locking scope is a bit clearer. The one exception is that modifying NR_FREE_PAGES still happens in places where it's known the IRQs are disabled as it's harmless for PREEMPT_RT and would be expensive to split the locking there. No performance data is included because despite the overhead of the stats, it's within the noise for most workloads on !PREEMPT_RT. However, Jesper Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @ 3.60GHz CPU on the first version of this series. Focusing on the array variant of the bulk page allocator reveals the following. (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz) ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size Baseline Patched 1 56.383 54.225 (+3.83%) 2 40.047 35.492 (+11.38%) 3 37.339 32.643 (+12.58%) 4 35.578 30.992 (+12.89%) 8 33.592 29.606 (+11.87%) 16 32.362 28.532 (+11.85%) 32 31.476 27.728 (+11.91%) 64 30.633 27.252 (+11.04%) 128 30.596 27.090 (+11.46%) While this is a positive outcome, the series is more likely to be interesting to the RT people in terms of getting parts of the PREEMPT_RT tree into mainline. drivers/base/node.c | 18 +-- include/linux/mmzone.h | 29 ++-- include/linux/vmstat.h | 65 +++++---- mm/internal.h | 2 +- mm/memory_hotplug.c | 10 +- mm/mempolicy.c | 2 +- mm/page_alloc.c | 297 ++++++++++++++++++++++++----------------- mm/vmstat.c | 250 ++++++++++++---------------------- 8 files changed, 339 insertions(+), 334 deletions(-) -- 2.26.2