2017-12-14 11:14:37

by Anshuman Khandual

[permalink] [raw]
Subject: [PATCH V2] mm/mprotect: Add a cond_resched() inside change_pmd_range()

While testing on a large CPU system, detected the following RCU
stall many times over the span of the workload. This problem
is solved by adding a cond_resched() in the change_pmd_range()
function.

[ 850.962530] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 850.962584] 154-....: (670 ticks this GP) idle=022/140000000000000/0 softirq=2825/2825 fqs=612
[ 850.962605] (detected by 955, t=6002 jiffies, g=4486, c=4485, q=90864)
[ 850.962895] Sending NMI from CPU 955 to CPUs 154:
[ 850.992667] NMI backtrace for cpu 154
[ 850.993069] CPU: 154 PID: 147071 Comm: workload Not tainted 4.15.0-rc3+ #3
[ 850.993258] NIP: c0000000000b3f64 LR: c0000000000b33d4 CTR: 000000000000aa18
[ 850.993503] REGS: 00000000a4b0fb44 TRAP: 0501 Not tainted (4.15.0-rc3+)
[ 850.993707] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22422082 XER: 00000000
[ 850.994386] CFAR: 00000000006cf8f0 SOFTE: 1
GPR00: 0010000000000000 c00003ef9b1cb8c0 c0000000010cc600 0000000000000000
GPR04: 8e0000018c32b200 40017b3858fd6e00 8e0000018c32b208 40017b3858fd6e00
GPR08: 8e0000018c32b210 40017b3858fd6e00 8e0000018c32b218 40017b3858fd6e00
GPR12: ffffffffffffffff c00000000fb25100
[ 850.995976] NIP [c0000000000b3f64] plpar_hcall9+0x44/0x7c
[ 850.996174] LR [c0000000000b33d4] pSeries_lpar_flush_hash_range+0x384/0x420
[ 850.996401] Call Trace:
[ 850.996600] [c00003ef9b1cb8c0] [c00003fa8fff7d40] 0xc00003fa8fff7d40 (unreliable)
[ 850.996959] [c00003ef9b1cba40] [c0000000000688a8] flush_hash_range+0x48/0x100
[ 850.997261] [c00003ef9b1cba90] [c000000000071b14] __flush_tlb_pending+0x44/0xd0
[ 850.997600] [c00003ef9b1cbac0] [c000000000071fa8] hpte_need_flush+0x408/0x470
[ 850.997958] [c00003ef9b1cbb30] [c0000000002c646c] change_protection_range+0xaac/0xf10
[ 850.998180] [c00003ef9b1cbcb0] [c0000000002f2510] change_prot_numa+0x30/0xb0
[ 850.998502] [c00003ef9b1cbce0] [c00000000013a950] task_numa_work+0x2d0/0x3e0
[ 850.998816] [c00003ef9b1cbda0] [c00000000011ea30] task_work_run+0x130/0x190
[ 850.999121] [c00003ef9b1cbe00] [c00000000001bcd8] do_notify_resume+0x118/0x120
[ 850.999421] [c00003ef9b1cbe30] [c00000000000b744] ret_from_except_lite+0x70/0x74
[ 850.999716] Instruction dump:
[ 850.999959] 60000000 f8810028 7ca42b78 7cc53378 7ce63b78 7d074378 7d284b78 7d495378
[ 851.000575] e9410060 e9610068 e9810070 44000022 <7d806378> e9810028 f88c0000 f8ac0008

Suggested-by: Nicholas Piggin <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
---
Changes in V2:

- Moved cond_resched() to change_pmd_range() as per Michal Hocko
- Fixed commit message as appropriate

Changes in V1: (https://patchwork.kernel.org/patch/10111445/)

mm/mprotect.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index ec39f73..43c29fa 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -196,6 +196,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
this_pages = change_pte_range(vma, pmd, addr, next, newprot,
dirty_accountable, prot_numa);
pages += this_pages;
+ cond_resched();
} while (pmd++, addr = next, addr != end);

if (mni_start)
--
2.9.3


2017-12-14 11:29:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH V2] mm/mprotect: Add a cond_resched() inside change_pmd_range()

On Thu 14-12-17 16:44:26, Anshuman Khandual wrote:
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index ec39f73..43c29fa 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -196,6 +196,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> this_pages = change_pte_range(vma, pmd, addr, next, newprot,
> dirty_accountable, prot_numa);
> pages += this_pages;
> + cond_resched();
> } while (pmd++, addr = next, addr != end);
>
> if (mni_start)

this is not exactly what I meant. See how change_huge_pmd does continue.
That's why I mentioned zap_pmd_range which does goto next...
--
Michal Hocko
SUSE Labs

2017-12-14 12:56:04

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V2] mm/mprotect: Add a cond_resched() inside change_pmd_range()

On 12/14/2017 04:59 PM, Michal Hocko wrote:
> On Thu 14-12-17 16:44:26, Anshuman Khandual wrote:
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index ec39f73..43c29fa 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -196,6 +196,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>> this_pages = change_pte_range(vma, pmd, addr, next, newprot,
>> dirty_accountable, prot_numa);
>> pages += this_pages;
>> + cond_resched();
>> } while (pmd++, addr = next, addr != end);
>>
>> if (mni_start)
>
> this is not exactly what I meant. See how change_huge_pmd does continue.
> That's why I mentioned zap_pmd_range which does goto next...

I might be still missing something but is this what you meant ?
Here we will give cond_resched() cover to the THP backed pages
as well.

diff --git a/mm/mprotect.c b/mm/mprotect.c
index ec39f73..3d445ee 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -188,7 +188,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}

/* huge pmd was handled */
- continue;
+ goto next;
}
}
/* fall through, the trans huge pmd just split */
@@ -196,6 +196,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
this_pages = change_pte_range(vma, pmd, addr, next, newprot,
dirty_accountable, prot_numa);
pages += this_pages;
+next:
+ cond_resched();
} while (pmd++, addr = next, addr != end);

if (mni_start)

2017-12-14 13:05:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH V2] mm/mprotect: Add a cond_resched() inside change_pmd_range()

On Thu 14-12-17 18:25:54, Anshuman Khandual wrote:
> On 12/14/2017 04:59 PM, Michal Hocko wrote:
> > On Thu 14-12-17 16:44:26, Anshuman Khandual wrote:
> >> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >> index ec39f73..43c29fa 100644
> >> --- a/mm/mprotect.c
> >> +++ b/mm/mprotect.c
> >> @@ -196,6 +196,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >> this_pages = change_pte_range(vma, pmd, addr, next, newprot,
> >> dirty_accountable, prot_numa);
> >> pages += this_pages;
> >> + cond_resched();
> >> } while (pmd++, addr = next, addr != end);
> >>
> >> if (mni_start)
> >
> > this is not exactly what I meant. See how change_huge_pmd does continue.
> > That's why I mentioned zap_pmd_range which does goto next...
>
> I might be still missing something but is this what you meant ?

yes, except

> Here we will give cond_resched() cover to the THP backed pages
> as well.

but there is still
if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
&& pmd_none_or_clear_bad(pmd))
continue;

so we won't have scheduling point on pmd holes. Maybe this doesn't
matter, I haven't checked but why should we handle those differently?

> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index ec39f73..3d445ee 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -188,7 +188,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> }
>
> /* huge pmd was handled */
> - continue;
> + goto next;
> }
> }
> /* fall through, the trans huge pmd just split */
> @@ -196,6 +196,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> this_pages = change_pte_range(vma, pmd, addr, next, newprot,
> dirty_accountable, prot_numa);
> pages += this_pages;
> +next:
> + cond_resched();
> } while (pmd++, addr = next, addr != end);
>
> if (mni_start)

--
Michal Hocko
SUSE Labs

2017-12-14 13:20:49

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V2] mm/mprotect: Add a cond_resched() inside change_pmd_range()

On 12/14/2017 06:34 PM, Michal Hocko wrote:
> On Thu 14-12-17 18:25:54, Anshuman Khandual wrote:
>> On 12/14/2017 04:59 PM, Michal Hocko wrote:
>>> On Thu 14-12-17 16:44:26, Anshuman Khandual wrote:
>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>> index ec39f73..43c29fa 100644
>>>> --- a/mm/mprotect.c
>>>> +++ b/mm/mprotect.c
>>>> @@ -196,6 +196,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>>>> this_pages = change_pte_range(vma, pmd, addr, next, newprot,
>>>> dirty_accountable, prot_numa);
>>>> pages += this_pages;
>>>> + cond_resched();
>>>> } while (pmd++, addr = next, addr != end);
>>>>
>>>> if (mni_start)
>>> this is not exactly what I meant. See how change_huge_pmd does continue.
>>> That's why I mentioned zap_pmd_range which does goto next...
>> I might be still missing something but is this what you meant ?
> yes, except
>
>> Here we will give cond_resched() cover to the THP backed pages
>> as well.
> but there is still
> if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> && pmd_none_or_clear_bad(pmd))
> continue;
>
> so we won't have scheduling point on pmd holes. Maybe this doesn't
> matter, I haven't checked but why should we handle those differently?

May be because it is not spending much time for those entries which
can really trigger stalls, hence they dont need scheduling points.
In case of zap_pmd_range(), it was spending time either in
__split_huge_pmd() or zap_huge_pmd() hence deserved a scheduling point.

2017-12-14 13:27:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH V2] mm/mprotect: Add a cond_resched() inside change_pmd_range()

On Thu 14-12-17 18:50:41, Anshuman Khandual wrote:
> On 12/14/2017 06:34 PM, Michal Hocko wrote:
> > On Thu 14-12-17 18:25:54, Anshuman Khandual wrote:
> >> On 12/14/2017 04:59 PM, Michal Hocko wrote:
> >>> On Thu 14-12-17 16:44:26, Anshuman Khandual wrote:
> >>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >>>> index ec39f73..43c29fa 100644
> >>>> --- a/mm/mprotect.c
> >>>> +++ b/mm/mprotect.c
> >>>> @@ -196,6 +196,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >>>> this_pages = change_pte_range(vma, pmd, addr, next, newprot,
> >>>> dirty_accountable, prot_numa);
> >>>> pages += this_pages;
> >>>> + cond_resched();
> >>>> } while (pmd++, addr = next, addr != end);
> >>>>
> >>>> if (mni_start)
> >>> this is not exactly what I meant. See how change_huge_pmd does continue.
> >>> That's why I mentioned zap_pmd_range which does goto next...
> >> I might be still missing something but is this what you meant ?
> > yes, except
> >
> >> Here we will give cond_resched() cover to the THP backed pages
> >> as well.
> > but there is still
> > if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> > && pmd_none_or_clear_bad(pmd))
> > continue;
> >
> > so we won't have scheduling point on pmd holes. Maybe this doesn't
> > matter, I haven't checked but why should we handle those differently?
>
> May be because it is not spending much time for those entries which
> can really trigger stalls, hence they dont need scheduling points.
> In case of zap_pmd_range(), it was spending time either in
> __split_huge_pmd() or zap_huge_pmd() hence deserved a scheduling point.

As I've said, I haven't thought much about that but the discrepancy just
hit my eyes. So if there is not a really good reason I would rather use
goto next consistently.

--
Michal Hocko
SUSE Labs

2017-12-14 13:30:46

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH V2] mm/mprotect: Add a cond_resched() inside change_pmd_range()

On 12/14/2017 06:57 PM, Michal Hocko wrote:
> On Thu 14-12-17 18:50:41, Anshuman Khandual wrote:
>> On 12/14/2017 06:34 PM, Michal Hocko wrote:
>>> On Thu 14-12-17 18:25:54, Anshuman Khandual wrote:
>>>> On 12/14/2017 04:59 PM, Michal Hocko wrote:
>>>>> On Thu 14-12-17 16:44:26, Anshuman Khandual wrote:
>>>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>>>> index ec39f73..43c29fa 100644
>>>>>> --- a/mm/mprotect.c
>>>>>> +++ b/mm/mprotect.c
>>>>>> @@ -196,6 +196,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>>>>>> this_pages = change_pte_range(vma, pmd, addr, next, newprot,
>>>>>> dirty_accountable, prot_numa);
>>>>>> pages += this_pages;
>>>>>> + cond_resched();
>>>>>> } while (pmd++, addr = next, addr != end);
>>>>>>
>>>>>> if (mni_start)
>>>>> this is not exactly what I meant. See how change_huge_pmd does continue.
>>>>> That's why I mentioned zap_pmd_range which does goto next...
>>>> I might be still missing something but is this what you meant ?
>>> yes, except
>>>
>>>> Here we will give cond_resched() cover to the THP backed pages
>>>> as well.
>>> but there is still
>>> if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
>>> && pmd_none_or_clear_bad(pmd))
>>> continue;
>>>
>>> so we won't have scheduling point on pmd holes. Maybe this doesn't
>>> matter, I haven't checked but why should we handle those differently?
>>
>> May be because it is not spending much time for those entries which
>> can really trigger stalls, hence they dont need scheduling points.
>> In case of zap_pmd_range(), it was spending time either in
>> __split_huge_pmd() or zap_huge_pmd() hence deserved a scheduling point.
>
> As I've said, I haven't thought much about that but the discrepancy just
> hit my eyes. So if there is not a really good reason I would rather use
> goto next consistently.

Sure, will respin with the changes.