Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751043AbaJIFfv (ORCPT ); Thu, 9 Oct 2014 01:35:51 -0400 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:40897 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750705AbaJIFfm (ORCPT ); Thu, 9 Oct 2014 01:35:42 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v2.0.1 X-SHieldMailCheckerPolicyVersion: FJ-ISEC-20120718-3 Message-ID: <54361E7E.5090006@jp.fujitsu.com> Date: Thu, 9 Oct 2014 14:34:54 +0900 From: Yasuaki Ishimatsu User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Wanpeng Li , , CC: , , Subject: Re: [PATCH] sched/fair: Care divide error in update_task_scan_period() References: <5434DCFF.1040208@jp.fujitsu.com> <54352545.8060605@gmail.com> In-Reply-To: <54352545.8060605@gmail.com> Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit X-SecurityPolicyCheck-GC: OK by FENCE-Mail Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2014/10/08 20:51), Wanpeng Li wrote: > > 于 10/8/14, 2:43 PM, Yasuaki Ishimatsu 写道: >> While offling node by hot removing memory, the following divide error >> occurs: >> >> divide error: 0000 [#1] SMP >> [...] >> Call Trace: >> [...] handle_mm_fault >> [...] ? try_to_wake_up >> [...] ? wake_up_state >> [...] __do_page_fault >> [...] ? do_futex >> [...] ? put_prev_entity >> [...] ? __switch_to >> [...] do_page_fault >> [...] page_fault >> [...] >> RIP [] task_numa_fault >> RSP >> >> The issue occurs as follows: >> 1. When page fault occurs and page is allocated from node 1, >> task_struct->numa_faults_buffer_memory[] of node 1 is >> incremented and p->numa_faults_locality[] is also incremented >> as follows: >> >> o numa_faults_buffer_memory[] o numa_faults_locality[] >> NR_NUMA_HINT_FAULT_TYPES >> | 0 | 1 | >> ---------------------------------- ---------------------- >> node 0 | 0 | 0 | remote | 0 | >> node 1 | 0 | 1 | locale | 1 | >> ---------------------------------- ---------------------- >> >> 2. node 1 is offlined by hot removing memory. >> >> 3. When page fault occurs, fault_types[] is calculated by using >> p->numa_faults_buffer_memory[] of all online nodes in >> task_numa_placement(). But node 1 was offline by step 2. So >> the fault_types[] is calculated by using only >> p->numa_faults_buffer_memory[] of node 0. So both of fault_types[] >> are set to 0. >> >> 4. The values(0) of fault_types[] pass to update_task_scan_period(). >> >> 5. numa_faults_locality[1] is set to 1. So the following division is >> calculated. >> >> static void update_task_scan_period(struct task_struct *p, >> unsigned long shared, unsigned long private){ >> ... >> ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared)); >> } >> >> 6. But both of private and shared are set to 0. So divide error >> occurs here. >> >> The divide error is rare case because the trigger is node offline. >> By this patch, when both of private and shared are set to 0, diff >> is just set to 0, not calculating the division. >> >> Signed-off-by: Yasuaki Ishimatsu >> --- >> kernel/sched/fair.c | 30 +++++++++++++++++++----------- >> 1 file changed, 19 insertions(+), 11 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index bfa3c86..fb7dc3f 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -1496,18 +1496,26 @@ static void update_task_scan_period(struct task_struct *p, >> slot = 1; >> diff = slot * period_slot; >> } else { >> - diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot; >> + if (unlikely((private + shared) == 0)) >> + /* >> + * This is a rare case. The trigger is node offline. >> + */ >> + diff = 0; >> + else { >> + diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot; >> >> - /* >> - * Scale scan rate increases based on sharing. There is an >> - * inverse relationship between the degree of sharing and >> - * the adjustment made to the scanning period. Broadly >> - * speaking the intent is that there is little point >> - * scanning faster if shared accesses dominate as it may >> - * simply bounce migrations uselessly >> - */ >> - ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared)); >> - diff = (diff * ratio) / NUMA_PERIOD_SLOTS; >> + /* >> + * Scale scan rate increases based on sharing. There is >> + * an inverse relationship between the degree of sharing >> + * and the adjustment made to the scanning period. >> + * Broadly speaking the intent is that there is little >> + * point scanning faster if shared accesses dominate as >> + * it may simply bounce migrations uselessly >> + */ >> + ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, >> + (private + shared)); >> + diff = (diff * ratio) / NUMA_PERIOD_SLOTS; >> + } >> } > > How about just > > ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared + 1)); Thank you for providing sample code. Rik also provided other idea. So I am confirming which is better idea. Thanks, Yasuaki Ishimatsu > > > Regards, > Wanpeng Li > >> p->numa_scan_period = clamp(p->numa_scan_period + diff, > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/