Date: Wed, 27 Sep 2017 13:11:28 +0200
From: Borislav Petkov <bp@suse.de>
To: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org
Subject: Re: [PATCH v3] sched/topology: Introduce NUMA identity node sched
 domain
Message-ID: <20170927111128.rh4hmlymqroulp4c@pd.tnic>
References: <1504768805-46716-1-git-send-email-suravee.suthikulpanit@amd.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <1504768805-46716-1-git-send-email-suravee.suthikulpanit@amd.com>
User-Agent: NeoMutt/20170113 (1.7.2)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4126
Lines: 128

On Thu, Sep 07, 2017 at 02:20:05AM -0500, Suravee Suthikulpanit wrote:
> On AMD Family17h-based (EPYC) system, a logical NUMA node can contain

Let's simply spell it F17h like we did for the older families.

> upto 8 cores (16 threads) with the following topology.
> 
>              ----------------------------
>          C0  | T0 T1 |    ||    | T0 T1 | C4
>              --------|    ||    |--------
>          C1  | T0 T1 | L3 || L3 | T0 T1 | C5
>              --------|    ||    |--------
>          C2  | T0 T1 | #0 || #1 | T0 T1 | C6
>              --------|    ||    |--------
>          C3  | T0 T1 |    ||    | T0 T1 | C7
>              ----------------------------
> 
> Here, there are 2 last-level (L3) caches per logical NUMA node.
> A socket can contain upto 4 NUMA nodes, and a system can support
> upto 2 sockets. With full system configuration, current scheduler
> creates 4 sched domains:
> 
>   domain0 SMT       (span a core)
>   domain1 MC        (span a last-level-cache)
>   domain2 NUMA      (span a socket: 4 nodes)
>   domain3 NUMA      (span a system: 8 nodes)
> 
> Note that there is no domain to represent cpus spaning a logical

s/cpus/CPUs/

s/spaning/spanning/

Please introduce a spellchecker into your patch creation workflow.

> NUMA node.  With this hierarchy of sched domains, the scheduler does
> not balance properly in the following cases:
> 
> Case1:
> When running 8 tasks, a properly balanced system should
> schedule a task per logical NUMA node. This is not the case for
> the current scheduler.

I'd like to have a sentence or two here saying what the problem is,
i.e., how do the 8 tasks get placed...

> 
> Case2:
> In some cases, threads are scheduled on the same cpu, while other

s/cpu/CPU/

> cpus are idle.

... like this sentence, for example, explaining what happens without
that patch.

> This results in run-to-run inconsistency. For example:
> 
>   taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
>                           --cpu-max-prime=100000 run
> 
> Total execution time ranges from 25.1s to 33.5s depending on threads
> placement, where 25.1s is when all 8 threads are balanced properly
> on 8 cpus.

s/cpus/CPUs/

Please check the whole patch (comments, etc).

> Introducing NUMA identity node sched domain, which is based on how

"Introduce... " no ing form but plain procedural, do this, do that.

> SRAT/SLIT table define a logical NUMA node. This results in the following
> hierarchy of sched domains on the same system described above.
> 
>   domain0 SMT       (span a core)
>   domain1 MC        (span a last-level-cache)
>   domain2 NODE      (span a logical NUMA node)
>   domain3 NUMA      (span a socket: 4 nodes)
>   domain4 NUMA      (span a system: 8 nodes)
> 
> This fixes the improper load balancing cases mentioned above.
> 
> Note that in case cpumask of the last-level-cache and NODE domains
> are the same (e.g. on AMD family10h/15h servers), the NODE domain

As above.

> will be excluded. Therefore, this change will not affect those systems.

Right, and this is running on *all* machines, not only AMD or x86. Why
doesn't it affect others? The degenerate code maybe?

...

> @@ -1445,9 +1448,26 @@ void sched_init_numa(void)
>  		tl[i] = sched_domain_topology[i];
>  
>  	/*
> +	 * Do not setup NUMA node level if it has the same cpumask
> +	 * as sched domain at previous level. This is the case for
> +	 * system with:
> +	 *  LLC == NODE : LLC (MC) sched domain span a NUMA node.
> +	 *  DIE == NODE : DIE sched domain span a NUMA node.
> +	 *
> +	 * Assume all NUMA nodes are identical, so only check node 0.
> +	 */
> +	if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0))) {
> +		tl[i++] = (struct sched_domain_topology_level){
> +			.mask = sd_numa_mask,
> +			.numa_level = 0,
> +			SD_INIT_NAME(NODE)
> +		};
> +	}

Right, I think the issue wrt the degenerate code is not fully discussed
yet judging by:

https://lkml.kernel.org/r/f85d6d5d-64b7-7e08-939f-b321e5f05949@amd.com

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--