2014-10-08 06:44:12

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: [PATCH] sched/fair: Care divide error in update_task_scan_period()

While offling node by hot removing memory, the following divide error
occurs:

divide error: 0000 [#1] SMP
[...]
Call Trace:
[...] handle_mm_fault
[...] ? try_to_wake_up
[...] ? wake_up_state
[...] __do_page_fault
[...] ? do_futex
[...] ? put_prev_entity
[...] ? __switch_to
[...] do_page_fault
[...] page_fault
[...]
RIP [<ffffffff810a7081>] task_numa_fault
RSP <ffff88084eb2bcb0>

The issue occurs as follows:
1. When page fault occurs and page is allocated from node 1,
task_struct->numa_faults_buffer_memory[] of node 1 is
incremented and p->numa_faults_locality[] is also incremented
as follows:

o numa_faults_buffer_memory[] o numa_faults_locality[]
NR_NUMA_HINT_FAULT_TYPES
| 0 | 1 |
---------------------------------- ----------------------
node 0 | 0 | 0 | remote | 0 |
node 1 | 0 | 1 | locale | 1 |
---------------------------------- ----------------------

2. node 1 is offlined by hot removing memory.

3. When page fault occurs, fault_types[] is calculated by using
p->numa_faults_buffer_memory[] of all online nodes in
task_numa_placement(). But node 1 was offline by step 2. So
the fault_types[] is calculated by using only
p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
are set to 0.

4. The values(0) of fault_types[] pass to update_task_scan_period().

5. numa_faults_locality[1] is set to 1. So the following division is
calculated.

static void update_task_scan_period(struct task_struct *p,
unsigned long shared, unsigned long private){
...
ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
}

6. But both of private and shared are set to 0. So divide error
occurs here.

The divide error is rare case because the trigger is node offline.
By this patch, when both of private and shared are set to 0, diff
is just set to 0, not calculating the division.

Signed-off-by: Yasuaki Ishimatsu <[email protected]>
---
kernel/sched/fair.c | 30 +++++++++++++++++++-----------
1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfa3c86..fb7dc3f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1496,18 +1496,26 @@ static void update_task_scan_period(struct task_struct *p,
slot = 1;
diff = slot * period_slot;
} else {
- diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
+ if (unlikely((private + shared) == 0))
+ /*
+ * This is a rare case. The trigger is node offline.
+ */
+ diff = 0;
+ else {
+ diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;

- /*
- * Scale scan rate increases based on sharing. There is an
- * inverse relationship between the degree of sharing and
- * the adjustment made to the scanning period. Broadly
- * speaking the intent is that there is little point
- * scanning faster if shared accesses dominate as it may
- * simply bounce migrations uselessly
- */
- ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
- diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
+ /*
+ * Scale scan rate increases based on sharing. There is
+ * an inverse relationship between the degree of sharing
+ * and the adjustment made to the scanning period.
+ * Broadly speaking the intent is that there is little
+ * point scanning faster if shared accesses dominate as
+ * it may simply bounce migrations uselessly
+ */
+ ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS,
+ (private + shared));
+ diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
+ }
}

p->numa_scan_period = clamp(p->numa_scan_period + diff,
--
1.8.3.1


2014-10-08 08:31:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Care divide error in update_task_scan_period()

On Wed, Oct 08, 2014 at 03:43:11PM +0900, Yasuaki Ishimatsu wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfa3c86..fb7dc3f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1496,18 +1496,26 @@ static void update_task_scan_period(struct task_struct *p,
> slot = 1;
> diff = slot * period_slot;
> } else {
> - diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
> + if (unlikely((private + shared) == 0))
> + /*
> + * This is a rare case. The trigger is node offline.
> + */
> + diff = 0;
> + else {
> + diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
>
> - /*
> - * Scale scan rate increases based on sharing. There is an
> - * inverse relationship between the degree of sharing and
> - * the adjustment made to the scanning period. Broadly
> - * speaking the intent is that there is little point
> - * scanning faster if shared accesses dominate as it may
> - * simply bounce migrations uselessly
> - */
> - ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
> - diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
> + /*
> + * Scale scan rate increases based on sharing. There is
> + * an inverse relationship between the degree of sharing
> + * and the adjustment made to the scanning period.
> + * Broadly speaking the intent is that there is little
> + * point scanning faster if shared accesses dominate as
> + * it may simply bounce migrations uselessly
> + */
> + ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS,
> + (private + shared));
> + diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
> + }
> }
>
> p->numa_scan_period = clamp(p->numa_scan_period + diff,

Yeah, so I don't like the patch nor do I really like the function as it
stands -- which I suppose is part of why I don't like the patch.

The problem I have with the function is that its very inconsistent in
behaviour. In the early return path it sets numa_scan_period and
numa_next_scan, in the later return path it sets numa_scan_period and
numa_faults_locality.

I feel both return paths should affect the same set of variables, esp.
the non clearing of numa_faults_locality in the early path seems weird.

The thing I suppose I don't like about the patch is its added
indentation and the fact that the simple +1 thing wasn't considered.

2014-10-08 11:52:05

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Care divide error in update_task_scan_period()


$BP2(B 10/8/14, 2:43 PM, Yasuaki Ishimatsu $B<LF;(B:
> While offling node by hot removing memory, the following divide error
> occurs:
>
> divide error: 0000 [#1] SMP
> [...]
> Call Trace:
> [...] handle_mm_fault
> [...] ? try_to_wake_up
> [...] ? wake_up_state
> [...] __do_page_fault
> [...] ? do_futex
> [...] ? put_prev_entity
> [...] ? __switch_to
> [...] do_page_fault
> [...] page_fault
> [...]
> RIP [<ffffffff810a7081>] task_numa_fault
> RSP <ffff88084eb2bcb0>
>
> The issue occurs as follows:
> 1. When page fault occurs and page is allocated from node 1,
> task_struct->numa_faults_buffer_memory[] of node 1 is
> incremented and p->numa_faults_locality[] is also incremented
> as follows:
>
> o numa_faults_buffer_memory[] o numa_faults_locality[]
> NR_NUMA_HINT_FAULT_TYPES
> | 0 | 1 |
> ---------------------------------- ----------------------
> node 0 | 0 | 0 | remote | 0 |
> node 1 | 0 | 1 | locale | 1 |
> ---------------------------------- ----------------------
>
> 2. node 1 is offlined by hot removing memory.
>
> 3. When page fault occurs, fault_types[] is calculated by using
> p->numa_faults_buffer_memory[] of all online nodes in
> task_numa_placement(). But node 1 was offline by step 2. So
> the fault_types[] is calculated by using only
> p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
> are set to 0.
>
> 4. The values(0) of fault_types[] pass to update_task_scan_period().
>
> 5. numa_faults_locality[1] is set to 1. So the following division is
> calculated.
>
> static void update_task_scan_period(struct task_struct *p,
> unsigned long shared, unsigned long private){
> ...
> ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
> }
>
> 6. But both of private and shared are set to 0. So divide error
> occurs here.
>
> The divide error is rare case because the trigger is node offline.
> By this patch, when both of private and shared are set to 0, diff
> is just set to 0, not calculating the division.
>
> Signed-off-by: Yasuaki Ishimatsu <[email protected]>
> ---
> kernel/sched/fair.c | 30 +++++++++++++++++++-----------
> 1 file changed, 19 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfa3c86..fb7dc3f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1496,18 +1496,26 @@ static void update_task_scan_period(struct task_struct *p,
> slot = 1;
> diff = slot * period_slot;
> } else {
> - diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
> + if (unlikely((private + shared) == 0))
> + /*
> + * This is a rare case. The trigger is node offline.
> + */
> + diff = 0;
> + else {
> + diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
>
> - /*
> - * Scale scan rate increases based on sharing. There is an
> - * inverse relationship between the degree of sharing and
> - * the adjustment made to the scanning period. Broadly
> - * speaking the intent is that there is little point
> - * scanning faster if shared accesses dominate as it may
> - * simply bounce migrations uselessly
> - */
> - ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
> - diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
> + /*
> + * Scale scan rate increases based on sharing. There is
> + * an inverse relationship between the degree of sharing
> + * and the adjustment made to the scanning period.
> + * Broadly speaking the intent is that there is little
> + * point scanning faster if shared accesses dominate as
> + * it may simply bounce migrations uselessly
> + */
> + ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS,
> + (private + shared));
> + diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
> + }
> }

How about just

ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared + 1));


Regards,
Wanpeng Li

> p->numa_scan_period = clamp(p->numa_scan_period + diff,

2014-10-08 16:45:05

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Care divide error in update_task_scan_period()

On 10/08/2014 02:43 AM, Yasuaki Ishimatsu wrote:

> The divide error is rare case because the trigger is node offline.
> By this patch, when both of private and shared are set to 0, diff
> is just set to 0, not calculating the division.

How about a simple

if (private + shared) == 0)
return;

higher up in the function, to avoid adding an extra
layer of indentation and confusion to the main part
of the function?

2014-10-08 16:54:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Care divide error in update_task_scan_period()

On Wed, Oct 08, 2014 at 12:42:24PM -0400, Rik van Riel wrote:
> On 10/08/2014 02:43 AM, Yasuaki Ishimatsu wrote:
>
> > The divide error is rare case because the trigger is node offline.
> > By this patch, when both of private and shared are set to 0, diff
> > is just set to 0, not calculating the division.
>
> How about a simple
>
> if (private + shared) == 0)
> return;
>
> higher up in the function, to avoid adding an extra
> layer of indentation and confusion to the main part
> of the function?

At which point we'll have 3 different return semantics. Should we not
clear numa_faults_localityp[], even in this case?

2014-10-09 05:20:04

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Care divide error in update_task_scan_period()

(2014/10/09 1:54), Peter Zijlstra wrote:
> On Wed, Oct 08, 2014 at 12:42:24PM -0400, Rik van Riel wrote:
>> On 10/08/2014 02:43 AM, Yasuaki Ishimatsu wrote:
>>
>>> The divide error is rare case because the trigger is node offline.
>>> By this patch, when both of private and shared are set to 0, diff
>>> is just set to 0, not calculating the division.
>>
>> How about a simple
>>
>> if (private + shared) == 0)
>> return;
>>
>> higher up in the function, to avoid adding an extra
>> layer of indentation and confusion to the main part
>> of the function?
>
> At which point we'll have 3 different return semantics. Should we not
> clear numa_faults_localityp[], even in this case?
>

I'm not familiar with Numa balancing feature. So I want to know it too.
If it's not necessary to clear numa_faults_locality[], I'll apply the idea.

Thanks,
Yasuaki Ishimatsu

2014-10-09 05:35:51

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Care divide error in update_task_scan_period()

(2014/10/08 20:51), Wanpeng Li wrote:
>
> $BP2(B 10/8/14, 2:43 PM, Yasuaki Ishimatsu $B<LF;(B:
>> While offling node by hot removing memory, the following divide error
>> occurs:
>>
>> divide error: 0000 [#1] SMP
>> [...]
>> Call Trace:
>> [...] handle_mm_fault
>> [...] ? try_to_wake_up
>> [...] ? wake_up_state
>> [...] __do_page_fault
>> [...] ? do_futex
>> [...] ? put_prev_entity
>> [...] ? __switch_to
>> [...] do_page_fault
>> [...] page_fault
>> [...]
>> RIP [<ffffffff810a7081>] task_numa_fault
>> RSP <ffff88084eb2bcb0>
>>
>> The issue occurs as follows:
>> 1. When page fault occurs and page is allocated from node 1,
>> task_struct->numa_faults_buffer_memory[] of node 1 is
>> incremented and p->numa_faults_locality[] is also incremented
>> as follows:
>>
>> o numa_faults_buffer_memory[] o numa_faults_locality[]
>> NR_NUMA_HINT_FAULT_TYPES
>> | 0 | 1 |
>> ---------------------------------- ----------------------
>> node 0 | 0 | 0 | remote | 0 |
>> node 1 | 0 | 1 | locale | 1 |
>> ---------------------------------- ----------------------
>>
>> 2. node 1 is offlined by hot removing memory.
>>
>> 3. When page fault occurs, fault_types[] is calculated by using
>> p->numa_faults_buffer_memory[] of all online nodes in
>> task_numa_placement(). But node 1 was offline by step 2. So
>> the fault_types[] is calculated by using only
>> p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
>> are set to 0.
>>
>> 4. The values(0) of fault_types[] pass to update_task_scan_period().
>>
>> 5. numa_faults_locality[1] is set to 1. So the following division is
>> calculated.
>>
>> static void update_task_scan_period(struct task_struct *p,
>> unsigned long shared, unsigned long private){
>> ...
>> ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>> }
>>
>> 6. But both of private and shared are set to 0. So divide error
>> occurs here.
>>
>> The divide error is rare case because the trigger is node offline.
>> By this patch, when both of private and shared are set to 0, diff
>> is just set to 0, not calculating the division.
>>
>> Signed-off-by: Yasuaki Ishimatsu <[email protected]>
>> ---
>> kernel/sched/fair.c | 30 +++++++++++++++++++-----------
>> 1 file changed, 19 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bfa3c86..fb7dc3f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1496,18 +1496,26 @@ static void update_task_scan_period(struct task_struct *p,
>> slot = 1;
>> diff = slot * period_slot;
>> } else {
>> - diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
>> + if (unlikely((private + shared) == 0))
>> + /*
>> + * This is a rare case. The trigger is node offline.
>> + */
>> + diff = 0;
>> + else {
>> + diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
>>
>> - /*
>> - * Scale scan rate increases based on sharing. There is an
>> - * inverse relationship between the degree of sharing and
>> - * the adjustment made to the scanning period. Broadly
>> - * speaking the intent is that there is little point
>> - * scanning faster if shared accesses dominate as it may
>> - * simply bounce migrations uselessly
>> - */
>> - ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>> - diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
>> + /*
>> + * Scale scan rate increases based on sharing. There is
>> + * an inverse relationship between the degree of sharing
>> + * and the adjustment made to the scanning period.
>> + * Broadly speaking the intent is that there is little
>> + * point scanning faster if shared accesses dominate as
>> + * it may simply bounce migrations uselessly
>> + */
>> + ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS,
>> + (private + shared));
>> + diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
>> + }
>> }
>

> How about just
>
> ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared + 1));

Thank you for providing sample code. Rik also provided other idea.
So I am confirming which is better idea.

Thanks,
Yasuaki Ishimatsu

>
>
> Regards,
> Wanpeng Li
>
>> p->numa_scan_period = clamp(p->numa_scan_period + diff,
>