Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756232AbdLTW2P (ORCPT ); Wed, 20 Dec 2017 17:28:15 -0500 Received: from aserp2120.oracle.com ([141.146.126.78]:60180 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754428AbdLTW2O (ORCPT ); Wed, 20 Dec 2017 17:28:14 -0500 Subject: Re: [RFC PATCH] sched: Improve scalability of select_idle_sibling using SMT balance To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, mingo@redhat.com References: <20171208200754.4738-1-subhra.mazumdar@oracle.com> <20171219193648.55oxgpbosruavlby@hirez.programming.kicks-ass.net> From: Subhra Mazumdar Message-ID: <68d44ac6-854d-050c-d41d-e48847c4e1e8@oracle.com> Date: Wed, 20 Dec 2017 14:19:36 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: <20171219193648.55oxgpbosruavlby@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8751 signatures=668651 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1712200313 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3261 Lines: 103 On 12/19/2017 11:36 AM, Peter Zijlstra wrote: > On Fri, Dec 08, 2017 at 12:07:54PM -0800, subhra mazumdar wrote: >> +static inline void >> +sd_context_switch(struct sched_domain *sd, struct rq *rq, int util) >> +{ >> + struct sched_group *sg_cpu; >> + >> + /* atomically add/subtract the util */ >> + sg_cpu = sd->sg_cpu; >> + if (util > 0) >> + atomic_inc( >> + (atomic_t *)(&(sg_cpu->utilization))); >> + else >> + atomic_dec( >> + (atomic_t *)(&(sg_cpu->utilization))); > Whahah, lol, no! This patch solves the scalability problem of potentially searching all cores or cpus in current logic while scheduling a thread. It does so by using a randomized approach. We can determine a better core to schedule by comparing the SMT utilization (i.e. how many CPUs are busy in a core) between the 2 cores of a llc domain. The comparison is done between the core of the target cpu passed to select_idle_sibling and a random core. This comparison can be done in O(1) if we have an accurate accounting. The SMT utilization is accounted during context switch and for that the atomic operation is needed for correctness. The cost of atomic increment/decrement is minimal as it happens only for levels down from llc, i.e only for cores. My data for context_switch() time (I had sent it in the patch) shows the extra cost is minimal and mostly within noise margin. I am open to any suggestions on better way of doing it. > >> +} >> + >> /* >> * context_switch - switch to the new MM and the new thread's register state. >> */ >> @@ -2751,6 +2766,51 @@ context_switch(struct rq *rq, struct task_struct *prev, >> struct task_struct *next, struct rq_flags *rf) >> { >> struct mm_struct *mm, *oldmm; >> + int this_cpu = rq->cpu; >> + struct sched_domain *sd; >> + unsigned int cond; >> + >> + cond = ((prev != rq->idle) << 1) | (next != rq->idle); >> + sd = rcu_dereference(per_cpu(sd_llc, this_cpu)); > That one is RCU, not RCU-sched protected.. Ok > >> + /* >> + * From sd_llc downward update the SMT utilization. >> + * Skip the lowest level 0. >> + */ >> + for_each_lower_domain(sd) { >> + if (sd->level == 0) >> + break; >> + if (rq->initial_util == UTIL_UNINITIALIZED) { >> + switch (cond) { >> + case PREV_IDLE_NEXT_NIDLE: >> + case PREV_NIDLE_NEXT_NIDLE: >> + sd_context_switch(sd, rq, SMT_THREAD_UTIL); >> + break; >> + case PREV_NIDLE_NEXT_IDLE: >> + case PREV_IDLE_NEXT_IDLE: >> + break; >> + } >> + } else { >> + switch (cond) { >> + case PREV_IDLE_NEXT_NIDLE: >> + sd_context_switch(sd, rq, SMT_THREAD_UTIL); >> + break; >> + case PREV_NIDLE_NEXT_IDLE: >> + sd_context_switch(sd, rq, -SMT_THREAD_UTIL); >> + break; >> + case PREV_IDLE_NEXT_IDLE: >> + case PREV_NIDLE_NEXT_NIDLE: >> + break; >> + } >> + } >> + } >> + >> + if (sd) { >> + if (next == rq->idle) >> + rq->initial_util = UTIL_IDLE; >> + else >> + rq->initial_util = UTIL_BUSY; >> + } > WTH do you even think this is reasonable? I will clean this part up. > >> prepare_task_switch(rq, prev, next); >> > And I still have no idea what the patch does, but I can't be bothered to > reverse engineer it just now. I will improve the changelog to explain the logic better in v2. Thanks, Subhra