Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934993AbdCXMhM (ORCPT ); Fri, 24 Mar 2017 08:37:12 -0400 Received: from mga03.intel.com ([134.134.136.65]:42164 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932434AbdCXMhB (ORCPT ); Fri, 24 Mar 2017 08:37:01 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.36,214,1486454400"; d="scan'208";a="80612296" Date: Fri, 24 Mar 2017 20:37:10 +0800 From: Aaron Lu To: Dave Hansen Cc: Minchan Kim , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Tim Chen , Andrew Morton , Ying Huang Subject: Re: [PATCH v2 3/5] mm: use a dedicated workqueue for the free workers Message-ID: <20170324123710.GA10672@aaronlu.sh.intel.com> References: <1489568404-7817-1-git-send-email-aaron.lu@intel.com> <1489568404-7817-4-git-send-email-aaron.lu@intel.com> <20170322063335.GF30149@bbox> <20170322084103.GC2360@aaronlu.sh.intel.com> <4549498a-befc-133d-b204-dd69b191e579@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <4549498a-befc-133d-b204-dd69b191e579@intel.com> User-Agent: Mutt/1.8.0 (2017-02-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2262 Lines: 40 On Thu, Mar 23, 2017 at 08:38:43AM -0700, Dave Hansen wrote: > On 03/22/2017 01:41 AM, Aaron Lu wrote: > > On Wed, Mar 22, 2017 at 03:33:35PM +0900, Minchan Kim wrote: > >> On Wed, Mar 15, 2017 at 05:00:02PM +0800, Aaron Lu wrote: > >>> Introduce a workqueue for all the free workers so that user can fine > >>> tune how many workers can be active through sysfs interface: max_active. > >>> More workers will normally lead to better performance, but too many can > >>> cause severe lock contention. > >> > >> Let me ask a question. > >> > >> How well can workqueue distribute the jobs in multiple CPU? > > > > I would say it's good enough for my needs. > > After all, it doesn't need many kworkers to achieve the 50% time > > decrease: 2-4 kworkers for EP and 4-8 kworkers for EX are enough from > > previous attched data. > > It's also worth noting that we'd like to *also* like to look into > increasing how scalable freeing pages to a given zone is. Still on EX, I restricted the allocation to be only on node 1, with 120G memory allocated there: max_active time compared to base lock from perf base(no parallel) 3.81s ±3.3% N/A <1% 1 3.10s ±7.7% ↓18.6% 14.76% 2 2.44s ±13.6% ↓35.9% 36.95% 4 2.07s ±13.6% ↓45.6% 59.67% 8 1.98s ±0.4% ↓48.0% 62.59% 16 2.01s ±2.4% ↓47.2% 79.62% If we can improve the scalibility of freeing a given zone, then parallel free will be able to achieve more. BTW, the lock is basically pgdat->lru_lock in release_pages and zone->lock in free_pcppages_bulk: 62.59% 62.59% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 37.17% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;free_pcppages_bulk;free_hot_cold_page;free_hot_cold_page_list;release_pages;free_pages_and_swap_cache;tlb_flush_mmu_free_batches;batch_free_work;process_one_work;worker_thread;kthread;ret_from_fork 25.27% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;release_pages;free_pages_and_swap_cache;tlb_flush_mmu_free_batches;batch_free_work;process_one_work;worker_thread;kthread;ret_from_fork