Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757598AbbEWDty (ORCPT ); Fri, 22 May 2015 23:49:54 -0400 Received: from numascale.com ([213.162.240.84]:51419 "EHLO numascale.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757093AbbEWDtw (ORCPT ); Fri, 22 May 2015 23:49:52 -0400 Date: Sat, 23 May 2015 11:49:33 +0800 From: Daniel J Blueman Subject: Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup To: Waiman Long , Mel Gorman Cc: Andrew Morton , nzimmer , Dave Hansen , Scott Norton , Linux-MM , LKML , Steffen Persvold Message-Id: <1432352973.6359.2@cpanel21.proisp.no> In-Reply-To: <555F6404.4010905@hp.com> References: <555F6404.4010905@hp.com> X-Mailer: geary/0.8.3 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - cpanel21.proisp.no X-AntiAbuse: Original Domain - vger.kernel.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - numascale.com X-Get-Message-Sender-Via: cpanel21.proisp.no: authenticated_id: daniel@numascale.com X-Source: X-Source-Args: X-Source-Dir: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3819 Lines: 95 -- Daniel J Blueman Principal Software Engineer, Numascale On Sat, May 23, 2015 at 1:14 AM, Waiman Long wrote: > On 05/22/2015 05:33 AM, Mel Gorman wrote: >> On Fri, May 22, 2015 at 02:30:01PM +0800, Daniel J Blueman wrote: >>> On Thu, May 14, 2015 at 6:03 PM, Daniel J Blueman >>> wrote: >>>> On Thu, May 14, 2015 at 12:31 AM, Mel Gorman >>>> wrote: >>>>> On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote: >>>>>> I am just noticed a hang on my largest box. >>>>>> I can only reproduce with large core counts, if I turn down the >>>>>> number of cpus it doesn't have an issue. >>>>>> >>>>> Odd. The number of core counts should make little a difference >>>>> as only >>>>> one CPU per node should be in use. Does sysrq+t give any >>>>> indication how >>>>> or where it is hanging? >>>> I was seeing the same behaviour of 1000ms increasing to 5500ms >>>> [1]; this suggests either lock contention or O(n) behaviour. >>>> >>>> Nathan, can you check with this ordering of patches from Andrew's >>>> cache [2]? I was getting hanging until I a found them all. >>>> >>>> I'll follow up with timing data. >>> 7TB over 216 NUMA nodes, 1728 cores, from kernel 4.0.4 load to >>> login: >>> >>> 1. 2086s with patches 01-19 [1] >>> >>> 2. 2026s adding "Take into account that large system caches scale >>> linearly with memory", which has: >>> min(2UL<< (30 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 3)); >>> >>> 3. 2442s fixing to: >>> max(2UL<< (30 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 3)); >>> >>> 4. 2064s adjusting minimum and shift to: >>> max(512UL<< (20 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 8)); >>> >>> 5. 1934s adjusting minimum and shift to: >>> max(128UL<< (20 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 8)); >>> >>> 6. 930s #5 with the non-temporal PMD init patch I had earlier >>> proposed (I'll pursue separately) >>> >>> The scaling patch isn't in -mm. >> That patch was superceded by "mm: meminit: finish >> initialisation of struct pages before basic setup" and >> "mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix" >> so that's ok. >> >> FWIW, I think you should still go ahead with the non-temporal >> patches because >> there is potential benefit there other than the initialisation. If >> there >> was an arch-optional implementation of a non-termporal clear then it >> would >> also be worth considering if __GFP_ZERO should use non-temporal >> stores. >> At a greater stretch it would be worth considering if kswapd freeing >> should >> zero pages to avoid a zero on the allocation side in the general >> case as >> it would be more generally useful and a stepping stone towards what >> the >> series "Sanitizing freed pages" attempts. Good tip Mel; I'll take a look when time allows and get some data, though I guess it'll only be a win where the clearing is on a different node than the allocation. > I think the non-temporal patch benefits mainly AMD systems. I have > tried the patch on both DragonHawk and it actually made it boot up a > little bit slower. I think the Intel optimized "rep stosb" > instruction (used in memset) is performing well. I had done similar > test on zero page code and the performance gain was non-conclusive. I suspect 'rep stosb' on modern Intel hardware can write whole cachelines atomically, avoiding the RMW, or that the read part of the RMW is optimally prefetched. Open-coding it just can't reach the same level of pipeline saturation that the microcode can. Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/