Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp4716795iob; Sun, 8 May 2022 23:04:56 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzZsI9p2CGnWi+zvzRKab5aY48LAQ4wWdcCwzE+ozDW+iuFhRzy/x9z0RXGjiE06ZfKOAGW X-Received: by 2002:aa7:8256:0:b0:4e0:78ad:eb81 with SMTP id e22-20020aa78256000000b004e078adeb81mr14710373pfn.30.1652076296682; Sun, 08 May 2022 23:04:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652076296; cv=none; d=google.com; s=arc-20160816; b=j5JOSkqDVGvaViadllo+yeTM0Lmhd7f+taigyJF9upI7T/s1GZppKQLbkMnVN5Tc97 Mxd0rANBIsCvN14grELkLIxyxU7wAvQ4X92/stHuGS3AdBlX02LVMB9InOViLgCTui6Q 8T8ENSZknE+aB+33UykIO8eRgAWn341CusqEqrQ1NrBmrsvv/ALHKeMmis7Bo83WxCKp LRS7Z6THUpE4QgpkFVtXcz1ozQV7SlspthlCbEkI9qfDrmZDl0Tw0A7qXeuvJao+HQ25 jLFcFNWPbTnO4z4y0DyW3tPYuE7w9pIUs2v10e4BxNbkx2f1Ja2nInyv2LfKoT/14U6e JCSQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=Niv/qOmRqHgg09RlTXLGfc30j24AEklqy6LCYQBdQRM=; b=WXBHXgYtam5tK/JDizqpXz9Eck2Y/XJyzPfqtYMX5y48I1W+OvxYRtBPl5KUsPJbhT oL81GMvl9mf83e9vZs4kxwk8/HlFyuCUOzIWqio+vfd4EArkCgmpDWS0uP5vspmUrOi5 gcGZs1E6tGN/XtWIJrLtRUN+SIYOLKOw5PagYAg4o62JuEiUNGaU6cReEmdON/YqLtkL 6oZTG8jxt0RWyc0Or23wUCG5G9q7BVdQPth2jvwi0M1W7XEE4GHMKgMnjg04uTeJvvjj 1pVCo06jb8ZiADTL8onm/c9UsBD81AAtq3g/vb6Hz7B3Xe1AvEmxAb//OEgMFbOLmG6f ZKEw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=D4vXIRHJ; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id 206-20020a6306d7000000b003c6af783a2esi3755656pgg.396.2022.05.08.23.04.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 08 May 2022 23:04:56 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=D4vXIRHJ; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 14FC5170F00; Sun, 8 May 2022 23:04:52 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1445123AbiEGA62 (ORCPT + 99 others); Fri, 6 May 2022 20:58:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55362 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229608AbiEGA61 (ORCPT ); Fri, 6 May 2022 20:58:27 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C688850B2F for ; Fri, 6 May 2022 17:54:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1651884882; x=1683420882; h=message-id:subject:from:to:cc:date:in-reply-to: references:mime-version:content-transfer-encoding; bh=ZHUKXnpEJXH5qUcfBqfm3azdXKYtQy/eAhnKUMnduLY=; b=D4vXIRHJyChe2PG7uGsmDvKofnGIn7Cchg6ODKFtquC4RJTHU8dMl0MU /SYfLbNpIvA3Hmy4fECU1+g6WM09fmwm2eJn5oaPaWZl0rEqqO6PplNnJ jKbTi1HX7gSbW/jDPIFc5R53aAhXzPYPgraVR35yjCQBa7nFzooNKqwXJ sWaO12jUcKH6qH1K7bBjkKp1eJsXF1aTiOAJ/oTyXxnVzGNOmq09pDjd3 eqiyFuS037XPV8utufAVnjl/It3L/uukCYsuJ2PjLS/2BPnzKGcg7jpMQ UrdkRMpE8jnqPymrm2hN1QK4uokORKDuvfgu836zHhYAU/3BQilKGAvUW A==; X-IronPort-AV: E=McAfee;i="6400,9594,10339"; a="267456940" X-IronPort-AV: E=Sophos;i="5.91,205,1647327600"; d="scan'208";a="267456940" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2022 17:54:42 -0700 X-IronPort-AV: E=Sophos;i="5.91,205,1647327600"; d="scan'208";a="586288097" Received: from yuzhenta-mobl.ccr.corp.intel.com ([10.254.213.210]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2022 17:54:38 -0700 Message-ID: <7d20a9543f69523cfda280e3f5ab17d68db037ab.camel@intel.com> Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression From: "ying.huang@intel.com" To: Aaron Lu Cc: Mel Gorman , kernel test robot , Linus Torvalds , Vlastimil Babka , Dave Hansen , Jesper Dangaard Brouer , Michal Hocko , Andrew Morton , LKML , lkp@lists.01.org, lkp@intel.com, feng.tang@intel.com, zhengjun.xing@linux.intel.com, fengwei.yin@intel.com Date: Sat, 07 May 2022 08:54:35 +0800 In-Reply-To: References: <20220420013526.GB14333@xsang-OptiPlex-9020> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.38.3-1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2022-05-06 at 20:17 +0800, Aaron Lu wrote: > On Fri, May 06, 2022 at 04:40:45PM +0800, ying.huang@intel.com wrote: > > On Fri, 2022-04-29 at 19:29 +0800, Aaron Lu wrote: > > > Hi Mel, > > > > > > On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote: > > > > > > > > (please be noted we reported > > > > "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression" > > > > on > > > > https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/ > > > > while the commit is on branch. > > > > now we still observe similar regression when it's on mainline, and we also > > > > observe a 13.2% improvement on another netperf subtest. > > > > so report again for information) > > > > > > > > Greeting, > > > > > > > > FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit: > > > > > > > > > > > > commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free") > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > > > > > > > > > > So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0) > > > > IMHO, this means the consumer and producer are running on different > > CPUs. > > > > Right. > > > > and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER, > > > then do not use PCP but directly free the page directly to buddy. > > > > > > The rationale as explained in the commit's changelog is: > > > " > > > Netperf running on localhost exhibits this pattern and while it does not > > > matter for some machines, it does matter for others with smaller caches > > > where cache misses cause problems due to reduced page reuse. Pages > > > freed directly to the buddy list may be reused quickly while still cache > > > hot where as storing on the PCP lists may be cold by the time > > > free_pcppages_bulk() is called. > > > " > > > > > > This regression occurred on a machine that has large caches so this > > > optimization brings no value to it but only overhead(skipped PCP), I > > > guess this is the reason why there is a regression. > > > > Per my understanding, not only the cache size is larger, but also the L2 > > cache (1MB) is per-core on this machine. So if the consumer and > > producer are running on different cores, the cache-hot page may cause > > more core-to-core cache transfer. This may hurt performance too. > > > > Client side allocates skb(page) and server side recvfrom() it. > recvfrom() copies the page data to server's own buffer and then releases > the page associated with the skb. Client does all the allocation and > server does all the free, page reuse happens at client side. > So I think core-2-core cache transfer due to page reuse can occur when > client task migrates. The core-to-core cache transfering can be cross-socket or cross-L2 in one socket. I mean the later one. > I have modified the job to have the client and server bound to a > specific CPU of different cores on the same node, and testing it on the > same Icelake 2 sockets server, the result is > >   kernel throughput > 8b10b465d0e1 125168 > f26b3fa04611 102039 -18% > > It's also a 18% drop. I think this means c2c is not a factor? Can you test with client and server bound to 2 hardware threads (hyperthread) of one core? The two hardware threads of one core will share the L2 cache. > > > I have also tested this case on a small machine: a skylake desktop and > > > this commit shows improvement: > > > 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76, > > > f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6% > > > > > > So this means those directly freed pages get reused by allocator side > > > and that brings performance improvement for machines with smaller cache. > > > > Per my understanding, the L2 cache on this desktop machine is shared > > among cores. > > > > The said CPU is i7-6700 and according to this wikipedia page, > L2 is per core: > https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Mainstream_desktop_processors Sorry, my memory was wrong. The skylake and later server has much larger private L2 cache (1MB vs 256KB of client), this may increase the possibility of core-2-core transfering. > > > I wonder if we should still use PCP a little bit under the above said > > > condition, for the purpose of: > > > 1 reduced overhead in the free path for machines with large cache; > > > 2 still keeps the benefit of reused pages for machines with smaller cache. > > > > > > For this reason, I tested increasing nr_pcp_high() from returning 0 to > > > either returning pcp->batch or (pcp->batch << 2): > > > machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2) > > > skylake desktop: 72288 90784 92219 91528 > > > icelake 2sockets: 120956 99177 98251 116108 > > > > > > note nr_pcp_high() returns pcp->high is the behaviour of this commit's > > > parent, returns 0 is the behaviour of this commit. > > > > > > The result shows, if we effectively use a PCP high as (pcp->batch << 2) > > > for the described condition, then this workload's performance on > > > small machine can remain while the regression on large machines can be > > > greately reduced(from -18% to -4%). > > > > > > > Can we use cache size and topology information directly? > > It can be complicated by the fact that the system can have multiple > producers(cpus that are doing free) running at the same time and getting > the perfect number can be a difficult job. We can discuss this after verifying whether it's core-2-core transfering related. Best Regards, Huang, Ying