Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp5306874imm; Tue, 19 Jun 2018 08:20:02 -0700 (PDT) X-Google-Smtp-Source: ADUXVKJZCDwwyaOaTSZiCXSaQOwyVW11Uoxamq3ICb2M5kWLqmGz4JF+r0zWW2fT3ezMw/URF5qW X-Received: by 2002:a63:618e:: with SMTP id v136-v6mr15246846pgb.100.1529421602887; Tue, 19 Jun 2018 08:20:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529421602; cv=none; d=google.com; s=arc-20160816; b=UsxgyZ7K+TvnxsS2ymQwhFvSzzN6eFjrrTVob0or/Py3LupUeoVCpMV3xAWhsL0C1d 9+dM3Krrm7I5jHWFjA6B8MR7ocA0BPICnvFVzt7EwWG3fhaqT9Nb8DiMH1Z1TeGK5w+/ SlORIdhOPshYybjJ3luO9VvpXolkvIDgxVbcJMqWVvX8GIKmXybZ3vaI2BLjqi40tZFx w8ZAjU3a/Z6erjemgITp02FeETO7T3G3ycUqbALKYy/FhEPdJ3J7ldw/33v2ACHFjFSY 1AZ51+R1mgNFyHBeBYzymhdZ/JzWC8LeTJWYwP+PtL0cmqvyYRAzS9esXikrS8AxoVEc K+pw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=ER5HSazMWGEy01vm7c6q7LZdD6sJLYnrXDJq9L7SMj0=; b=Q+fgGbRCuW1uDD+5U16aD2n4qn5EFKOZEQHS3dbryBZTh12QPcFrZ6uczT230lWqxc 55l9nxwECyWiKgQ67f93pyBtb74wcOSUyzfEtQeLwDrUGvZGA9LBlM0ClbVojKVv8vzK YGtnx4Q7JbG6sqQCDt43LV8lHHje1x4K0BFCzCT7SIkIJzUV6N6j5wCeagMjBQ3fShIg U94pLg/eL8+UfM4xsPN2yJA7xuDtrJ/j6KV49G9MN7szAwFMUYFY+TIwHFLtLYLIkebv tfbK/Ckh38e6lFbMdsNVEUFS/duREQfuv+isT+MsrwMZyXsxNPCT6O2vqnI5ur6CuMJn vtzQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i16-v6si17540892pfi.234.2018.06.19.08.19.48; Tue, 19 Jun 2018 08:20:02 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S937813AbeFSPSH (ORCPT + 99 others); Tue, 19 Jun 2018 11:18:07 -0400 Received: from outbound-smtp04.blacknight.com ([81.17.249.35]:55527 "EHLO outbound-smtp04.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S937750AbeFSPSG (ORCPT ); Tue, 19 Jun 2018 11:18:06 -0400 Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17]) by outbound-smtp04.blacknight.com (Postfix) with ESMTPS id BA898989D5 for ; Tue, 19 Jun 2018 15:18:04 +0000 (UTC) Received: (qmail 25700 invoked from network); 19 Jun 2018 15:18:04 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[37.228.237.171]) by 81.17.254.9 with ESMTPSA (DHE-RSA-AES256-SHA encrypted, authenticated); 19 Jun 2018 15:18:04 -0000 Date: Tue, 19 Jun 2018 16:18:03 +0100 From: Mel Gorman To: Jirka Hladky Cc: Jakub Racek , linux-kernel , "Rafael J. Wysocki" , Len Brown , linux-acpi@vger.kernel.org, "kkolakow@redhat.com" Subject: Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks Message-ID: <20180619151803.bu6pehdu6wbd6l5x@techsingularity.net> References: <20180611141113.pfuttg7npch3jtg6@techsingularity.net> <20180614083640.dekqhsopoefnfhb4@techsingularity.net> <20180615112522.3wujbq7bajof57qx@techsingularity.net> <20180615135212.wq45co7ootvdeo2f@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170912 (1.9.0) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 19, 2018 at 03:36:53PM +0200, Jirka Hladky wrote: > Hi Mel, > > we have tested following variants: > > var1: 4.16 + 2c83362734dad8e48ccc0710b5cd2436a0323893 > fix1: var1+ ratelimit_pages __read_mostly increased by factor 4x > -static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT); > +static unsigned int ratelimit_pages __read_mostly = 512 << (20 - PAGE_SHIFT); > fix2: var1+ ratelimit_pages __read_mostly increased by factor 8x > -static unsigned int ratelimit_pages __read_mostly = 512 << (20 - PAGE_SHIFT); > +static unsigned int ratelimit_pages __read_mostly = 1024 << (20 - PAGE_SHIFT); > fix3: var1+ ratelimit_pages __read_mostly increased by factor 16x > -static unsigned int ratelimit_pages __read_mostly = 1024 << (20 - PAGE_SHIFT); > +static unsigned int ratelimit_pages __read_mostly = 2048 << (20 - PAGE_SHIFT); > > Results for the stream benchmark (standalone processes) have gradually > improved. For fix3, stream benchmark with runtime 60 seconds does not show > performance drop compared to 4.16 kernel anymore. > Ok, so at least one option is to remove the rate limiting. It'll be ok as long as cross-node migrations are not both a) a regular event and b) each migrate remains on the new socket long enough for migrations to occur and c) the bandwidth used for cross-node migration does not interfere badly with tasks accessing local memory. It'll vary depending on workload and machine unfortuantely but the rate limiting never accounted for the real capabilities of hardware and cannot detect bandwidth used for regular accesses. > For the OpenMP NAS, results are still worse than with vanilla 4.16 kernel. > Increasing the ratelimit has helped, but even the best results of {fix1, > fix2, fix3} are still some 5-10% slower than with vanilla 4.16 kernel. If > I should pick the best value for ratelimit_pages __read_mostly it would be > fix2: > +static unsigned int ratelimit_pages __read_mostly = 1024 << (20 - > PAGE_SHIFT); > > I have also used the Intel LINPACK (OpenMP) benchmark on 4 NUMA server - it > gives the similar results as NAS test. > > I think that patch 2c83362734dad8e48ccc0710b5cd2436a0323893 needs a review > for the OpenMP and standalone processes workflow. > I did get some results although testing of the different potential patches (revert, numabalance series, faster scanning in isolation etc) are still in progress. However, I did find that rate limiting was not a factor for NAS at least (STREAM was too short lived in the configuration I used) on the machine I used. That does not prevent the ratelimiting being removed but it highlights that the impact is workload and machine specific. Second, I did manage to see the regression and the fix from the revert *but* it required both one third of CPUs to be used and the openMP parallelisation method. Using all CPUs shows no regression and using a third of the CPUs with MPI shows no regression. In other words, the impact is specific to the workload, the configuration and the machine. I don't have a LINPACK configuration but you say that the behaviour is similar so I'll stick with NAS. On the topic of STREAM, it's meant to be a memory bandwidth benchmark and there is no knowledge within the scheduler for automatically moving tasks to a memory controller. It really should be tuned to run as one instance bound to one controller for the figures to make sense. For automatic NUMA balancing to fix it up, it needs to run long enough and it's not guaranteed to be optimally located. I think it's less relevant as a workload in this instance and it'll never be optimal as even spreading early does not mean it'll spread to each memory controller. Given the specific requirement of CPUs used and parallelisation method, I think a plain revert is not the answer because it'll fix one particular workload and reintroduce regressions on others (as laid out in the original changelog). There are other observations we can make about the NAS-OpenMP workload though 1. The locality sucks Parallelising with MPI indicates that locality as measured by the NUMA hinting achieves 94% local hits and minimal migration. With OpenMP, locality is 66% with large amounts of migration. Many of the updates are huge PMDs so this may be an instance of false sharing or it might be the timing of when migrations start. 2. Migrations with the revert are lower There are fewer migrations when the patch is reverted and this may be an indication that it simply benefits by spreading early before any memory is allocated so that migrations are avoided. Unfortunately, this is not a universal win for every workload. I'll experiement a bit with faster migrations on cross-node accesses but I think no matter which way we jump on this one it'll be a case of "it helps one workload and hurts another". -- Mel Gorman SUSE Labs