Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp1350615imm; Tue, 2 Oct 2018 06:57:26 -0700 (PDT) X-Google-Smtp-Source: ACcGV61NyqWEl6RlzRYkLhpJ4mwvKSSJuXpFGVjiRpEmX3bOtKpeE384MMQ78vSSwD6ypvQpFTch X-Received: by 2002:a62:2315:: with SMTP id j21-v6mr16566570pfj.90.1538488646147; Tue, 02 Oct 2018 06:57:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538488646; cv=none; d=google.com; s=arc-20160816; b=iT+ZzMNrxtTQ/CM1ym/L4oUSBIaHyDoIJZy2Cs2KZa2um6S/0wZSjsfCRGiNIwTUNH pGpRPdRVpO7vw9a12VOMgAL/qe2tWnxBQh7Fv5AADIT8GqIXWygFIhgGdbXgUTrBXuy4 Deo/57z0XPMNJJFPjLWiCAe2igtRlpW5Vv3CVqWk8KIfg9j/Hzqk2tsdo4PGSu6j34kj p/OGNllEwRYL6vFsyQPAkjlwJy7LjJjoTBVpGmKSb2eVrgh308bz30TXTqbp4qr2IcKo P6bYyAJGzk/hJk5aKooUpm/gdrjeuTJ6CMHBdoewBZquDQhwfS3jywxHIjeybliHXYnc Z4Eg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=VjXwOMN4Ye/Qb7lnNZXHbbC7wnmTEcBpTMQQv3SI7b8=; b=hoO1yYhl9ccBLBPWhbX/3BgeFYm5wIn3BxXN+XV5TGUYP4+w+bVMQoSA1QSho2BhAk 8+mhThgWaTUKVDtfHcRUolivymNLKHf4k0RrnqAko4LLhE+FOsB7tSk3XyW74ggn9uW0 eJ3B+EJgGbqijtEUD2tV2yVUVGniVf9g692cCNrORPfouFodHCYqi0e8ewK6GGXJyCjC 5M9+P2ib3GqgQGlyDz5w86V7a5z8M7gSwz1JWHpHbc2INMZWdqEUCXMNcGOiLyMujvU5 Upy490syEPiA7UHSLqPS86Q+edbouLx1KUOdOU6fx4Evv/1DWcWPQYMAvsfaa+2pDHe1 lErQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s61-v6si15059180plb.125.2018.10.02.06.57.11; Tue, 02 Oct 2018 06:57:26 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731574AbeJBUig (ORCPT + 99 others); Tue, 2 Oct 2018 16:38:36 -0400 Received: from outbound-smtp13.blacknight.com ([46.22.139.230]:33506 "EHLO outbound-smtp13.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730694AbeJBUig (ORCPT ); Tue, 2 Oct 2018 16:38:36 -0400 Received: from mail.blacknight.com (unknown [81.17.254.10]) by outbound-smtp13.blacknight.com (Postfix) with ESMTPS id 9B4461C2CCA for ; Tue, 2 Oct 2018 14:55:04 +0100 (IST) Received: (qmail 6180 invoked from network); 2 Oct 2018 13:55:04 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[37.228.229.88]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 2 Oct 2018 13:55:04 -0000 Date: Tue, 2 Oct 2018 14:54:59 +0100 From: Mel Gorman To: Srikar Dronamraju Cc: Peter Zijlstra , Ingo Molnar , Jirka Hladky , Rik van Riel , LKML , Linux-MM Subject: Re: [PATCH 2/2] mm, numa: Migrate pages to local nodes quicker early in the lifetime of a task Message-ID: <20181002135459.GA7003@techsingularity.net> References: <20181001100525.29789-1-mgorman@techsingularity.net> <20181001100525.29789-3-mgorman@techsingularity.net> <20181002124149.GB4593@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20181002124149.GB4593@linux.vnet.ibm.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 02, 2018 at 06:11:49PM +0530, Srikar Dronamraju wrote: > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 25c7c7e09cbd..7fc4a371bdd2 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -1392,6 +1392,17 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, > > int last_cpupid, this_cpupid; > > > > this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); > > + last_cpupid = page_cpupid_xchg_last(page, this_cpupid); > > + > > + /* > > + * Allow first faults or private faults to migrate immediately early in > > + * the lifetime of a task. The magic number 4 is based on waiting for > > + * two full passes of the "multi-stage node selection" test that is > > + * executed below. > > + */ > > + if ((p->numa_preferred_nid == -1 || p->numa_scan_seq <= 4) && > > + (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid))) > > + return true; > > > > This does have issues when using with workloads that access more shared faults > than private faults. > Not as such. It can have issues on workloads where memory is initialised by one thread, then additional threads are created and access the same memory. They are not necessarily shared once buffers are handed over. In such a case, migrating quickly is the right thing to do. If it's truely shared pages then there may be some unnecessary migrations early in the lifetime of the task but it'll settle down quickly enough. > In such workloads, this change would spread the memory causing regression in > behaviour. > > 5 runs of on 2 socket/ 4 node power 8 box > > > Without this patch > ./numa01.sh Real: 382.82 454.29 422.31 29.72 > ./numa01.sh Sys: 40.12 74.53 58.50 13.37 > ./numa01.sh User: 34230.22 46398.84 40292.62 4915.93 > > With this patch > ./numa01.sh Real: 415.56 555.04 473.45 51.17 -10.8016% > ./numa01.sh Sys: 43.42 94.22 73.59 17.31 -20.5055% > ./numa01.sh User: 35271.95 56644.19 45615.72 7165.01 -11.6694% > > Since we are looking at time, smaller numbers are better. > Is it just numa01 that was affected for you? I ask because that particular workload is an averse workload on any machine with more than sockets and your machine description says it has 4 nodes. What it is testing is quite specific to 2-node machines. > SPECJbb did show some small loss and gains. > That almost always shows small gains and losses so that's not too surprising. > Our numa grouping is not fast enough. It can take sometimes several > iterations before all the tasks belonging to the same group end up being > part of the group. With the current check we end up spreading memory faster > than we should hence hurting the chance of early consolidation. > > Can we restrict to something like this? > > if (p->numa_scan_seq >=MIN && p->numa_scan_seq <= MIN+4 && > (cpupid_match_pid(p, last_cpupid))) > return true; > > meaning, we ran atleast MIN number of scans, and we find the task to be most likely > task using this page. > What's MIN? Assuming it's any type of delay, note that this will regress STREAM again because it's very sensitive to the starting state. -- Mel Gorman SUSE Labs