Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Tue, 24 Sep 2002 17:04:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Tue, 24 Sep 2002 17:04:24 -0400 Received: from ophelia.ess.nec.de ([193.141.139.8]:14324 "EHLO ophelia.ess.nec.de") by vger.kernel.org with ESMTP id convert rfc822-to-8bit; Tue, 24 Sep 2002 17:04:23 -0400 Content-Type: text/plain; charset=US-ASCII From: Erich Focht To: "Martin J. Bligh" , linux-kernel Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler Date: Tue, 24 Sep 2002 23:04:44 +0200 User-Agent: KMail/1.4.1 Cc: LSE , Ingo Molnar , Michael Hohnbaum References: <200209232038.15039.efocht@ess.nec.de> <170330281.1032781640@[10.10.2.3]> In-Reply-To: <170330281.1032781640@[10.10.2.3]> MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Message-Id: <200209242304.44799.efocht@ess.nec.de> Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2340 Lines: 50 On Monday 23 September 2002 20:47, Martin J. Bligh wrote: > > I have two problems with this approach: > > 1: Freeing memory is quite expensive, as it currently involves finding > > the maximum of the array node_mem[]. > > Bleh ... why? This needs to be calculated much more lazily than this, > or you're going to kick the hell out of any cache affinity. Can you > recalc this in the rebalance code or something instead? You're right, that would be too slow. I think of marking the tasks needing recalculation and update their homenode when their runqueue is scanned for a task to be stolen. > > 2: I have no idea how tasks sharing the mm structure will behave. I'd > > like them to run on different nodes (that's why node_mem is not in mm), > > but they could (legally) free pages which they did not allocate and > > have wrong values in node_mem[]. > > Yes, that really ought to be per-process, not per task. Which means > locking or atomics ... and overhead. Ick. Hmm, I think it is sometimes ok to have it per task. For example OpenMP parallel jobs working on huge arrays. The "first-touch" of these arrays leads to pagefaults generated by the different tasks and thus different node_mem[] arrays for each task. As long as they just allocate memory, all is well. If they only release it at the end of the job, too. This probably goes wrong if we have a long running task that spawns short living clones. They inherit the node_mem from the parent but pages added by them to the common mm are not reflected in the parent's node_mem after their death. > For the first cut of the NUMA sched, maybe you could just leave page > allocation alone, and do that seperately? or is that what the second > patch was meant to be? The first patch needs a correction, add in load_balance() if (!busiest) goto out; after the call to find_busiest_queue. This works alone. On top of this pooling NUMA scheduler we can put the node affinity approach that fits best. With or without memory allocation. I'll update the patches and their setup code (thanks for the comments!) and resend them. Regards, Erich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/