Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758125Ab3HMRvk (ORCPT ); Tue, 13 Aug 2013 13:51:40 -0400 Received: from mail-vc0-f170.google.com ([209.85.220.170]:46611 "EHLO mail-vc0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756881Ab3HMRvi (ORCPT ); Tue, 13 Aug 2013 13:51:38 -0400 MIME-Version: 1.0 In-Reply-To: <520A6DFC.1070201@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> Date: Tue, 13 Aug 2013 10:51:37 -0700 X-Google-Sender-Auth: hIEpHPy3GhO-h8MzU7GY3qTyZ6U Message-ID: Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator From: Linus Torvalds To: Mike Travis Cc: Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3181 Lines: 60 On Tue, Aug 13, 2013 at 10:33 AM, Mike Travis wrote: > > Initially this patch set consisted of diverting a major portion of the > memory to an "absent" list during e820 processing. A very late initcall > was then used to dispatch a cpu per node to add that nodes's absent > memory. By nature these ran in parallel so Nathan did the work to > "parallelize" various global resource locks to become per node locks. So quite frankly, I'm not sure how worthwhile it even is to parallelize the thing. I realize that some environments may care about getting up to full memory population very quicky, but I think it would be very rare and specialized, and shouldn't necessarily be part of the initial patches. And it really doesn't have to be an initcall at all - at least not a synchronous one. A late initcall to get the process *started*, but the process itself could easily be done with a separate thread asynchronously, and let the machine boot up while that thread is going. And in fact, I'd argue that instead of trying to make it fast and parallelize things excessively, you might want to make the memory initialization *slow*, and make all the rest of the bootup have higher priority. At that point, who cares if it takes 400 seconds to get all memory initialized? In fact, who cares if it takes twice that? Let's assume that the rest of the boot takes 30s (which is pretty aggressive for some big server with terabytes of memory), even if the memory initialization was running in the background and only during idle time for probing, I'm sure you'd have a few hundred gigs of RAM initialized by the time you can log in. And if it then takes another ten minutes until you have the full 16TB initialized, and some things might be a tad slower early on, does anybody really care? The machine will be up and running with plenty of memory, even if it may not be *all* the memory yet. I realize that benchmarking cares, and yes, I also realize that some benchmarks actually want to reboot the machine between some runs just to get repeatability, but if you're benchmarking a 16TB machine I'm guessing any serious benchmark that actually uses that much memory is going to take many hours to a few days to run anyway? Having some way to wait until the memory is all done (which might even be just a silly shell script that does "ps" and waits for the kernel threads to all go away) isn't going to kill the benchmark - and the benchmark itself will then not have to worry about hittinf the "oops, I need to initialize 2GB of RAM now because I hit an uninitialized page". Ok, so I don't know all the issues, and in many ways I don't even really care. You could do it other ways, I don't think this is a big deal. The part I hate is the runtime hook into the core MM page allocation code, so I'm just throwing out any random thing that comes to my mind that could be used to avoid that part. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/