Received: by 2002:a25:868d:0:0:0:0:0 with SMTP id z13csp1195121ybk; Thu, 14 May 2020 02:55:07 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyqHmRWBH8IJ+Y/OcaTbCsoL3WA+j7642fy7196c5/ut+51n18tB27MvB9kWFzfIN1REwph X-Received: by 2002:a17:906:784c:: with SMTP id p12mr3041706ejm.346.1589450107536; Thu, 14 May 2020 02:55:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1589450107; cv=none; d=google.com; s=arc-20160816; b=OiNSDl1aw8VnLCSc4SAn7lSXG+HqImq/0EMfaCaMj3NhgorfHTg1gJD2MTqhBKLzwQ FZ4NTTrIueF2tGkjYa9L9jpPCUYxj9Lgwfel2A9tgkWA9jMEzKbacDrWmYV4h+0itQck igF58VFSJS8l35SZWSBfwQSCOQ18wcWCjSnGeyKn46WOuUJgWKzwntsHiydQGyuHkKW7 ihnzt2SYU8vLgRU0pzz04DZqo6vqMFF9CKLLN81edFqHKsHxpLFwkKgH8tdhmodhN8jg 67XDQzDwG/e5Rr5aks/NNfzU02sOr753oPsnrVCUaWGaHB0KS1rxsFVjIt8fg44K5D4O hghA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=P9PlwD0jYmlSj0ncrKcchFWRLxmQXGMNT7+HC1kCNtE=; b=ukCMFaf/sIo0MxURo0rQzenH4uBsxfu03cOQDWYCf8KxesJc3Dad+EiNBxmX+wrL1w rxOWRjRxZ7m6ENrlKZe/TK4IzIcV351ouZKVVZR0sp/KYLF/umZ6R3U999OtP08LffEp 13BePCLlXU+kBCq46pSgIM4DZllItQXDAIA6dnj0SNJv/OrCypFk97mOq8ezdexd96Yq 4BdLiDjsdCb8kKHtrnTz0QumQVw4RJqV9KZkXw6cii1KSoKT1/Hp8GM9atz2/wObShg3 sqizxTf4532Measwb5rnN5XZkafaGFtr70gHdiBFyHF+THoRll1q97JIP1atIOXPp7pm peMw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id x17si1400859ejn.21.2020.05.14.02.54.43; Thu, 14 May 2020 02:55:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725974AbgENJu7 (ORCPT + 99 others); Thu, 14 May 2020 05:50:59 -0400 Received: from outbound-smtp18.blacknight.com ([46.22.139.245]:59831 "EHLO outbound-smtp18.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725878AbgENJu7 (ORCPT ); Thu, 14 May 2020 05:50:59 -0400 Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp18.blacknight.com (Postfix) with ESMTPS id 334F21C37A0 for ; Thu, 14 May 2020 10:50:57 +0100 (IST) Received: (qmail 24030 invoked from network); 14 May 2020 09:50:57 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.18.57]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 14 May 2020 09:50:56 -0000 Date: Thu, 14 May 2020 10:50:55 +0100 From: Mel Gorman To: Jirka Hladky Cc: Phil Auld , Peter Zijlstra , Ingo Molnar , Vincent Guittot , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Valentin Schneider , Hillf Danton , LKML , Douglas Shakshober , Waiman Long , Joe Mario , Bill Gray Subject: Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Message-ID: <20200514095055.GG3758@techsingularity.net> References: <20200320163843.GD3818@techsingularity.net> <20200507155422.GD3758@techsingularity.net> <20200508092212.GE3758@techsingularity.net> <20200513153023.GF3758@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 13, 2020 at 06:20:53PM +0200, Jirka Hladky wrote: > Thank you, Mel! > > I think I have to make sure we cover the scenario you have targeted > when developing adjust_numa_imbalance: > > ======================================================================= > https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8910 > > /* > * Allow a small imbalance based on a simple pair of communicating > * tasks that remain local when the source domain is almost idle. > */ > ======================================================================= > > Could you point me to a benchmark for this scenario? I have checked > https://github.com/gormanm/mmtests > and we use lots of the same benchmarks but I'm not sure if we cover > this particular scenario. > The NUMA imbalance part showed up as part of the general effort to reconcile NUMA balancing with Load balancing. It's been known for years that the two balancers disagreed to the extent that NUMA balancing retries migrations multiple times just to keep things local leading to excessive migrations. The full battery of tests that were used when I was trying to reconcile the balancers and later working on Vincent's version is as follows scheduler-unbound scheduler-forkintensive scheduler-perfpipe scheduler-perfpipe-cpufreq scheduler-schbench db-pgbench-timed-ro-small-xfs hpc-nas-c-class-mpi-full-xfs hpc-nas-c-class-mpi-half-xfs hpc-nas-c-class-omp-full hpc-nas-c-class-omp-half hpc-nas-d-class-mpi-full-xfs hpc-nas-d-class-mpi-half-xfs hpc-nas-d-class-omp-full hpc-nas-d-class-omp-half io-dbench4-async-ext4 io-dbench4-async-xfs jvm-specjbb2005-multi jvm-specjbb2005-single network-netperf-cstate network-netperf-rr-cstate network-netperf-rr-unbound network-netperf-unbound network-tbench numa-autonumabench workload-kerndevel-xfs workload-shellscripts-xfs Where there is -ext4 or -xfs, just remove the filesystem to get the base configuration. i.e. io-dbench4-async-ext4 basic configuration is io-dbench4-async. Both filesystems are sometimes tested because they interact differently with the scheduler due to ext4 using a journal thread and xfs using workqueues. The imbalance one is most obvious with network-netperf-unbound running on localhost. If the client/server are on separate nodes, it's obvious from mpstat that two nodes are busy and it's migrating quite a bit. The second effect is that NUMA balancing is active, trapping hinting faults and migrating pages. The biggest problem I have right now is that the wakeup path between tasks that are local is slower than doing a remote wakeup via wake_list that potentially sends an IPI which is ridiculous. The slower wakeup manifests as a loss of throughput for netperf even though all the accesses are local. At least that's what I'm looking at whenever I get the chance. -- Mel Gorman SUSE Labs