In our testing(carmera recording), Miguel and Wei found unmap_page_range
takes above 6ms with preemption disabled easily. When I see that, the
reason is it holds page table spinlock during entire 512 page operation
in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
run in the time because it could make frame drop or glitch audio problem.
This patch adds preemption point like coyp_pte_range.
Reported-by: Miguel de Dios <[email protected]>
Reported-by: Wei Wang <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/memory.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 2e796372927fd..bc3e0c5e4f89b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1007,6 +1007,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct zap_details *details)
{
struct mm_struct *mm = tlb->mm;
+ int progress = 0;
int force_flush = 0;
int rss[NR_MM_COUNTERS];
spinlock_t *ptl;
@@ -1022,7 +1023,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
do {
- pte_t ptent = *pte;
+ pte_t ptent;
+
+ if (progress >= 32) {
+ progress = 0;
+ if (need_resched())
+ break;
+ }
+ progress += 8;
+
+ ptent = *pte;
if (pte_none(ptent))
continue;
@@ -1123,8 +1133,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (force_flush) {
force_flush = 0;
tlb_flush_mmu(tlb);
- if (addr != end)
- goto again;
+ }
+
+ if (addr != end) {
+ progress = 0;
+ goto again;
}
return addr;
--
2.22.0.709.g102302147b-goog
On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > In our testing(carmera recording), Miguel and Wei found unmap_page_range
> > takes above 6ms with preemption disabled easily. When I see that, the
> > reason is it holds page table spinlock during entire 512 page operation
> > in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
> > run in the time because it could make frame drop or glitch audio problem.
>
> Where is the time spent during the tear down? 512 pages doesn't sound
> like a lot to tear down. Is it the TLB flushing?
Miguel confirmed there is no such big latency without mark_page_accessed
in zap_pte_range so I guess it's the contention of LRU lock as well as
heavy activate_page overhead which is not trivial, either.
>
> > This patch adds preemption point like coyp_pte_range.
> >
> > Reported-by: Miguel de Dios <[email protected]>
> > Reported-by: Wei Wang <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
> > mm/memory.c | 19 ++++++++++++++++---
> > 1 file changed, 16 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2e796372927fd..bc3e0c5e4f89b 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1007,6 +1007,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > struct zap_details *details)
> > {
> > struct mm_struct *mm = tlb->mm;
> > + int progress = 0;
> > int force_flush = 0;
> > int rss[NR_MM_COUNTERS];
> > spinlock_t *ptl;
> > @@ -1022,7 +1023,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > flush_tlb_batched_pending(mm);
> > arch_enter_lazy_mmu_mode();
> > do {
> > - pte_t ptent = *pte;
> > + pte_t ptent;
> > +
> > + if (progress >= 32) {
> > + progress = 0;
> > + if (need_resched())
> > + break;
> > + }
> > + progress += 8;
>
> Why 8?
Just copied from copy_pte_range.
On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> In our testing(carmera recording), Miguel and Wei found unmap_page_range
> takes above 6ms with preemption disabled easily. When I see that, the
> reason is it holds page table spinlock during entire 512 page operation
> in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
> run in the time because it could make frame drop or glitch audio problem.
Where is the time spent during the tear down? 512 pages doesn't sound
like a lot to tear down. Is it the TLB flushing?
> This patch adds preemption point like coyp_pte_range.
>
> Reported-by: Miguel de Dios <[email protected]>
> Reported-by: Wei Wang <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> mm/memory.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 2e796372927fd..bc3e0c5e4f89b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1007,6 +1007,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> struct zap_details *details)
> {
> struct mm_struct *mm = tlb->mm;
> + int progress = 0;
> int force_flush = 0;
> int rss[NR_MM_COUNTERS];
> spinlock_t *ptl;
> @@ -1022,7 +1023,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> flush_tlb_batched_pending(mm);
> arch_enter_lazy_mmu_mode();
> do {
> - pte_t ptent = *pte;
> + pte_t ptent;
> +
> + if (progress >= 32) {
> + progress = 0;
> + if (need_resched())
> + break;
> + }
> + progress += 8;
Why 8?
> +
> + ptent = *pte;
> if (pte_none(ptent))
> continue;
>
> @@ -1123,8 +1133,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> if (force_flush) {
> force_flush = 0;
> tlb_flush_mmu(tlb);
> - if (addr != end)
> - goto again;
> + }
> +
> + if (addr != end) {
> + progress = 0;
> + goto again;
> }
>
> return addr;
> --
> 2.22.0.709.g102302147b-goog
--
Michal Hocko
SUSE Labs
On Mon 29-07-19 17:20:52, Minchan Kim wrote:
> On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> > On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > > In our testing(carmera recording), Miguel and Wei found unmap_page_range
> > > takes above 6ms with preemption disabled easily. When I see that, the
> > > reason is it holds page table spinlock during entire 512 page operation
> > > in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
> > > run in the time because it could make frame drop or glitch audio problem.
> >
> > Where is the time spent during the tear down? 512 pages doesn't sound
> > like a lot to tear down. Is it the TLB flushing?
>
> Miguel confirmed there is no such big latency without mark_page_accessed
> in zap_pte_range so I guess it's the contention of LRU lock as well as
> heavy activate_page overhead which is not trivial, either.
Please give us more details ideally with some numbers.
--
Michal Hocko
SUSE Labs
On Mon, Jul 29, 2019 at 10:35:15AM +0200, Michal Hocko wrote:
> On Mon 29-07-19 17:20:52, Minchan Kim wrote:
> > On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> > > On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > > > In our testing(carmera recording), Miguel and Wei found unmap_page_range
> > > > takes above 6ms with preemption disabled easily. When I see that, the
> > > > reason is it holds page table spinlock during entire 512 page operation
> > > > in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
> > > > run in the time because it could make frame drop or glitch audio problem.
> > >
> > > Where is the time spent during the tear down? 512 pages doesn't sound
> > > like a lot to tear down. Is it the TLB flushing?
> >
> > Miguel confirmed there is no such big latency without mark_page_accessed
> > in zap_pte_range so I guess it's the contention of LRU lock as well as
> > heavy activate_page overhead which is not trivial, either.
>
> Please give us more details ideally with some numbers.
I had a time to benchmark it via adding some trace_printk hooks between
pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
device is 2018 premium mobile device.
I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
task runs on little core even though it doesn't have any IPI and LRU
lock contention. It's already too heavy.
If I remove activate_page, 35-40% overhead of zap_pte_range is gone
so most of overhead(about 0.7ms) comes from activate_page via
mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
accumulate up to several ms.
On Tue 30-07-19 21:11:10, Minchan Kim wrote:
> On Mon, Jul 29, 2019 at 10:35:15AM +0200, Michal Hocko wrote:
> > On Mon 29-07-19 17:20:52, Minchan Kim wrote:
> > > On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> > > > On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > > > > In our testing(carmera recording), Miguel and Wei found unmap_page_range
> > > > > takes above 6ms with preemption disabled easily. When I see that, the
> > > > > reason is it holds page table spinlock during entire 512 page operation
> > > > > in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
> > > > > run in the time because it could make frame drop or glitch audio problem.
> > > >
> > > > Where is the time spent during the tear down? 512 pages doesn't sound
> > > > like a lot to tear down. Is it the TLB flushing?
> > >
> > > Miguel confirmed there is no such big latency without mark_page_accessed
> > > in zap_pte_range so I guess it's the contention of LRU lock as well as
> > > heavy activate_page overhead which is not trivial, either.
> >
> > Please give us more details ideally with some numbers.
>
> I had a time to benchmark it via adding some trace_printk hooks between
> pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> device is 2018 premium mobile device.
>
> I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> task runs on little core even though it doesn't have any IPI and LRU
> lock contention. It's already too heavy.
>
> If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> so most of overhead(about 0.7ms) comes from activate_page via
> mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> accumulate up to several ms.
Thanks for this information. This is something that should be a part of
the changelog. I am sorry to still poke into this because I still do not
have a full understanding of what is going on and while I do not object
to drop the spinlock I still suspect this is papering over a deeper
problem.
If mark_page_accessed is really expensive then why do we even bother to
do it in the tear down path in the first place? Why don't we simply set
a referenced bit on the page to reflect the young pte bit? I might be
missing something here of course.
--
Michal Hocko
SUSE Labs
On Tue, Jul 30, 2019 at 02:32:37PM +0200, Michal Hocko wrote:
> On Tue 30-07-19 21:11:10, Minchan Kim wrote:
> > On Mon, Jul 29, 2019 at 10:35:15AM +0200, Michal Hocko wrote:
> > > On Mon 29-07-19 17:20:52, Minchan Kim wrote:
> > > > On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> > > > > On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > > > > > In our testing(carmera recording), Miguel and Wei found unmap_page_range
> > > > > > takes above 6ms with preemption disabled easily. When I see that, the
> > > > > > reason is it holds page table spinlock during entire 512 page operation
> > > > > > in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
> > > > > > run in the time because it could make frame drop or glitch audio problem.
> > > > >
> > > > > Where is the time spent during the tear down? 512 pages doesn't sound
> > > > > like a lot to tear down. Is it the TLB flushing?
> > > >
> > > > Miguel confirmed there is no such big latency without mark_page_accessed
> > > > in zap_pte_range so I guess it's the contention of LRU lock as well as
> > > > heavy activate_page overhead which is not trivial, either.
> > >
> > > Please give us more details ideally with some numbers.
> >
> > I had a time to benchmark it via adding some trace_printk hooks between
> > pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> > device is 2018 premium mobile device.
> >
> > I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> > task runs on little core even though it doesn't have any IPI and LRU
> > lock contention. It's already too heavy.
> >
> > If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> > so most of overhead(about 0.7ms) comes from activate_page via
> > mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> > accumulate up to several ms.
>
> Thanks for this information. This is something that should be a part of
> the changelog. I am sorry to still poke into this because I still do not
I will include it.
> have a full understanding of what is going on and while I do not object
> to drop the spinlock I still suspect this is papering over a deeper
> problem.
I couldn't come up with better solution. Feel free to suggest it.
>
> If mark_page_accessed is really expensive then why do we even bother to
> do it in the tear down path in the first place? Why don't we simply set
> a referenced bit on the page to reflect the young pte bit? I might be
> missing something here of course.
commit bf3f3bc5e73
Author: Nick Piggin <[email protected]>
Date: Tue Jan 6 14:38:55 2009 -0800
mm: don't mark_page_accessed in fault path
Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
unmap-time if the pte is young has a number of problems.
mark_page_accessed is supposed to be roughly the equivalent of a young pte
for unmapped references. Unfortunately it doesn't come with any context:
after being called, reclaim doesn't know who or why the page was touched.
So calling mark_page_accessed not only adds extra lru or PG_referenced
manipulations for pages that are already going to have pte_young ptes anyway,
but it also adds these references which are difficult to work with from the
context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
wish to contribute to the page being referenced).
Then, simply doing SetPageReferenced when zapping a pte and finding it is
young, is not a really good solution either. SetPageReferenced does not
correctly promote the page to the active list for example. So after removing
mark_page_accessed from the fault path, several mmap()+touch+munmap() would
have a very different result from several read(2) calls for example, which
is not really desirable.
Signed-off-by: Nick Piggin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
[Cc Nick - the email thread starts http://lkml.kernel.org/r/[email protected]
A very brief summary is that mark_page_accessed seems to be quite
expensive and the question is whether we still need it and why
SetPageReferenced cannot be used instead. More below.]
On Tue 30-07-19 21:39:35, Minchan Kim wrote:
> On Tue, Jul 30, 2019 at 02:32:37PM +0200, Michal Hocko wrote:
> > On Tue 30-07-19 21:11:10, Minchan Kim wrote:
> > > On Mon, Jul 29, 2019 at 10:35:15AM +0200, Michal Hocko wrote:
> > > > On Mon 29-07-19 17:20:52, Minchan Kim wrote:
> > > > > On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> > > > > > On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > > > > > > In our testing(carmera recording), Miguel and Wei found unmap_page_range
> > > > > > > takes above 6ms with preemption disabled easily. When I see that, the
> > > > > > > reason is it holds page table spinlock during entire 512 page operation
> > > > > > > in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
> > > > > > > run in the time because it could make frame drop or glitch audio problem.
> > > > > >
> > > > > > Where is the time spent during the tear down? 512 pages doesn't sound
> > > > > > like a lot to tear down. Is it the TLB flushing?
> > > > >
> > > > > Miguel confirmed there is no such big latency without mark_page_accessed
> > > > > in zap_pte_range so I guess it's the contention of LRU lock as well as
> > > > > heavy activate_page overhead which is not trivial, either.
> > > >
> > > > Please give us more details ideally with some numbers.
> > >
> > > I had a time to benchmark it via adding some trace_printk hooks between
> > > pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> > > device is 2018 premium mobile device.
> > >
> > > I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> > > task runs on little core even though it doesn't have any IPI and LRU
> > > lock contention. It's already too heavy.
> > >
> > > If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> > > so most of overhead(about 0.7ms) comes from activate_page via
> > > mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> > > accumulate up to several ms.
> >
> > Thanks for this information. This is something that should be a part of
> > the changelog. I am sorry to still poke into this because I still do not
>
> I will include it.
>
> > have a full understanding of what is going on and while I do not object
> > to drop the spinlock I still suspect this is papering over a deeper
> > problem.
>
> I couldn't come up with better solution. Feel free to suggest it.
>
> >
> > If mark_page_accessed is really expensive then why do we even bother to
> > do it in the tear down path in the first place? Why don't we simply set
> > a referenced bit on the page to reflect the young pte bit? I might be
> > missing something here of course.
>
> commit bf3f3bc5e73
> Author: Nick Piggin <[email protected]>
> Date: Tue Jan 6 14:38:55 2009 -0800
>
> mm: don't mark_page_accessed in fault path
>
> Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
> unmap-time if the pte is young has a number of problems.
>
> mark_page_accessed is supposed to be roughly the equivalent of a young pte
> for unmapped references. Unfortunately it doesn't come with any context:
> after being called, reclaim doesn't know who or why the page was touched.
>
> So calling mark_page_accessed not only adds extra lru or PG_referenced
> manipulations for pages that are already going to have pte_young ptes anyway,
> but it also adds these references which are difficult to work with from the
> context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
> wish to contribute to the page being referenced).
>
> Then, simply doing SetPageReferenced when zapping a pte and finding it is
> young, is not a really good solution either. SetPageReferenced does not
> correctly promote the page to the active list for example. So after removing
> mark_page_accessed from the fault path, several mmap()+touch+munmap() would
> have a very different result from several read(2) calls for example, which
> is not really desirable.
Well, I have to say that this is rather vague to me. Nick, could you be
more specific about which workloads do benefit from this change? Let's
say that the zapped pte is the only referenced one and then reclaim
finds the page on inactive list. We would go and reclaim it. But does
that matter so much? Hot pages would be referenced from multiple ptes
very likely, no?
That being said, cosindering that mark_page_accessed is not free, do we
have strong reasons to keep it?
Or do I miss something?
> Signed-off-by: Nick Piggin <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
--
Michal Hocko
SUSE Labs
On Mon, 29 Jul 2019 17:20:52 +0900 Minchan Kim <[email protected]> wrote:
> > > @@ -1022,7 +1023,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > flush_tlb_batched_pending(mm);
> > > arch_enter_lazy_mmu_mode();
> > > do {
> > > - pte_t ptent = *pte;
> > > + pte_t ptent;
> > > +
> > > + if (progress >= 32) {
> > > + progress = 0;
> > > + if (need_resched())
> > > + break;
> > > + }
> > > + progress += 8;
> >
> > Why 8?
>
> Just copied from copy_pte_range.
copy_pte_range() does
if (pte_none(*src_pte)) {
progress++;
continue;
}
entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
vma, addr, rss);
if (entry.val)
break;
progress += 8;
which appears to be an attempt to balance the cost of copy_one_pte()
against the cost of not calling copy_one_pte().
Your code doesn't do this balancing and hence can be simpler.
It all seems a bit overdesigned. need_resched() is cheap. It's
possibly a mistake to check need_resched() on *every* loop because some
crazy scheduling load might livelock us. But surely it would be enough
to do something like
if (progress++ && need_resched()) {
<reschedule>
progress = 0;
}
and leave it at that?
On Tue, Jul 30, 2019 at 02:57:51PM +0200, Michal Hocko wrote:
> [Cc Nick - the email thread starts http://lkml.kernel.org/r/[email protected]
> A very brief summary is that mark_page_accessed seems to be quite
> expensive and the question is whether we still need it and why
> SetPageReferenced cannot be used instead. More below.]
>
> On Tue 30-07-19 21:39:35, Minchan Kim wrote:
> > On Tue, Jul 30, 2019 at 02:32:37PM +0200, Michal Hocko wrote:
> > > On Tue 30-07-19 21:11:10, Minchan Kim wrote:
> > > > On Mon, Jul 29, 2019 at 10:35:15AM +0200, Michal Hocko wrote:
> > > > > On Mon 29-07-19 17:20:52, Minchan Kim wrote:
> > > > > > On Mon, Jul 29, 2019 at 09:45:23AM +0200, Michal Hocko wrote:
> > > > > > > On Mon 29-07-19 16:10:37, Minchan Kim wrote:
> > > > > > > > In our testing(carmera recording), Miguel and Wei found unmap_page_range
> > > > > > > > takes above 6ms with preemption disabled easily. When I see that, the
> > > > > > > > reason is it holds page table spinlock during entire 512 page operation
> > > > > > > > in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
> > > > > > > > run in the time because it could make frame drop or glitch audio problem.
> > > > > > >
> > > > > > > Where is the time spent during the tear down? 512 pages doesn't sound
> > > > > > > like a lot to tear down. Is it the TLB flushing?
> > > > > >
> > > > > > Miguel confirmed there is no such big latency without mark_page_accessed
> > > > > > in zap_pte_range so I guess it's the contention of LRU lock as well as
> > > > > > heavy activate_page overhead which is not trivial, either.
> > > > >
> > > > > Please give us more details ideally with some numbers.
> > > >
> > > > I had a time to benchmark it via adding some trace_printk hooks between
> > > > pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> > > > device is 2018 premium mobile device.
> > > >
> > > > I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> > > > task runs on little core even though it doesn't have any IPI and LRU
> > > > lock contention. It's already too heavy.
> > > >
> > > > If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> > > > so most of overhead(about 0.7ms) comes from activate_page via
> > > > mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> > > > accumulate up to several ms.
> > >
> > > Thanks for this information. This is something that should be a part of
> > > the changelog. I am sorry to still poke into this because I still do not
> >
> > I will include it.
> >
> > > have a full understanding of what is going on and while I do not object
> > > to drop the spinlock I still suspect this is papering over a deeper
> > > problem.
> >
> > I couldn't come up with better solution. Feel free to suggest it.
> >
> > >
> > > If mark_page_accessed is really expensive then why do we even bother to
> > > do it in the tear down path in the first place? Why don't we simply set
> > > a referenced bit on the page to reflect the young pte bit? I might be
> > > missing something here of course.
> >
> > commit bf3f3bc5e73
> > Author: Nick Piggin <[email protected]>
> > Date: Tue Jan 6 14:38:55 2009 -0800
> >
> > mm: don't mark_page_accessed in fault path
> >
> > Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
> > unmap-time if the pte is young has a number of problems.
> >
> > mark_page_accessed is supposed to be roughly the equivalent of a young pte
> > for unmapped references. Unfortunately it doesn't come with any context:
> > after being called, reclaim doesn't know who or why the page was touched.
> >
> > So calling mark_page_accessed not only adds extra lru or PG_referenced
> > manipulations for pages that are already going to have pte_young ptes anyway,
> > but it also adds these references which are difficult to work with from the
> > context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
> > wish to contribute to the page being referenced).
> >
> > Then, simply doing SetPageReferenced when zapping a pte and finding it is
> > young, is not a really good solution either. SetPageReferenced does not
> > correctly promote the page to the active list for example. So after removing
> > mark_page_accessed from the fault path, several mmap()+touch+munmap() would
> > have a very different result from several read(2) calls for example, which
> > is not really desirable.
>
> Well, I have to say that this is rather vague to me. Nick, could you be
> more specific about which workloads do benefit from this change? Let's
> say that the zapped pte is the only referenced one and then reclaim
> finds the page on inactive list. We would go and reclaim it. But does
> that matter so much? Hot pages would be referenced from multiple ptes
> very likely, no?
As Nick mentioned in the description, without mark_page_accessed in
zapping part, repeated mmap + touch + munmap never acticated the page
while several read(2) calls easily promote it.
On Tue, Jul 30, 2019 at 12:42:07PM -0700, Andrew Morton wrote:
> On Mon, 29 Jul 2019 17:20:52 +0900 Minchan Kim <[email protected]> wrote:
>
> > > > @@ -1022,7 +1023,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > flush_tlb_batched_pending(mm);
> > > > arch_enter_lazy_mmu_mode();
> > > > do {
> > > > - pte_t ptent = *pte;
> > > > + pte_t ptent;
> > > > +
> > > > + if (progress >= 32) {
> > > > + progress = 0;
> > > > + if (need_resched())
> > > > + break;
> > > > + }
> > > > + progress += 8;
> > >
> > > Why 8?
> >
> > Just copied from copy_pte_range.
>
> copy_pte_range() does
>
> if (pte_none(*src_pte)) {
> progress++;
> continue;
> }
> entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
> vma, addr, rss);
> if (entry.val)
> break;
> progress += 8;
>
> which appears to be an attempt to balance the cost of copy_one_pte()
> against the cost of not calling copy_one_pte().
>
Indeed.
> Your code doesn't do this balancing and hence can be simpler.
Based on the balancing code of copy_one_pte, it seems we should balance
it with cost of mark_page_accessed against the cost of not calling
mark_page_accessed. IOW, add up 8 only when mark_page_accessed is called.
However, every mark_page_accessed is not heavy since it uses pagevec
and caller couldn't know whether the target page will be activated or
just have PG_referenced which is cheap. Thus, I agree, do not make it
complicated.
>
> It all seems a bit overdesigned. need_resched() is cheap. It's
> possibly a mistake to check need_resched() on *every* loop because some
> crazy scheduling load might livelock us. But surely it would be enough
> to do something like
>
> if (progress++ && need_resched()) {
> <reschedule>
> progress = 0;
> }
>
> and leave it at that?
Seems like this?
From bb1d7aaf520e98a6f9d988c25121602c28e12e67 Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Mon, 29 Jul 2019 15:28:48 +0900
Subject: [PATCH] mm: release the spinlock on zap_pte_range
In our testing(carmera recording), Miguel and Wei found unmap_page_range
takes above 6ms with preemption disabled easily. When I see that, the
reason is it holds page table spinlock during entire 512 page operation
in a PMD. 6.2ms is never trivial for user experince if RT task couldn't
run in the time because it could make frame drop or glitch audio problem.
I had a time to benchmark it via adding some trace_printk hooks between
pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
device is 2018 premium mobile device.
I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
task runs on little core even though it doesn't have any IPI and LRU
lock contention. It's already too heavy.
If I remove activate_page, 35-40% overhead of zap_pte_range is gone
so most of overhead(about 0.7ms) comes from activate_page via
mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
accumulate up to several ms.
Thus, this patch adds preemption point for once every 32 times in the
loop.
Reported-by: Miguel de Dios <[email protected]>
Reported-by: Wei Wang <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/memory.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 2e796372927fd..8bfcef09da674 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1007,6 +1007,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct zap_details *details)
{
struct mm_struct *mm = tlb->mm;
+ int progress = 0;
int force_flush = 0;
int rss[NR_MM_COUNTERS];
spinlock_t *ptl;
@@ -1022,7 +1023,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
do {
- pte_t ptent = *pte;
+ pte_t ptent;
+
+ if (progress++ >= 32) {
+ progress = 0;
+ if (need_resched())
+ break;
+ }
+
+ ptent = *pte;
if (pte_none(ptent))
continue;
@@ -1123,8 +1132,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (force_flush) {
force_flush = 0;
tlb_flush_mmu(tlb);
- if (addr != end)
- goto again;
+ }
+
+ if (addr != end) {
+ progress = 0;
+ goto again;
}
return addr;
--
2.22.0.709.g102302147b-goog
On Wed 31-07-19 14:44:47, Minchan Kim wrote:
> On Tue, Jul 30, 2019 at 02:57:51PM +0200, Michal Hocko wrote:
> > [Cc Nick - the email thread starts http://lkml.kernel.org/r/[email protected]
> > A very brief summary is that mark_page_accessed seems to be quite
> > expensive and the question is whether we still need it and why
> > SetPageReferenced cannot be used instead. More below.]
> >
> > On Tue 30-07-19 21:39:35, Minchan Kim wrote:
[...]
> > > commit bf3f3bc5e73
> > > Author: Nick Piggin <[email protected]>
> > > Date: Tue Jan 6 14:38:55 2009 -0800
> > >
> > > mm: don't mark_page_accessed in fault path
> > >
> > > Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
> > > unmap-time if the pte is young has a number of problems.
> > >
> > > mark_page_accessed is supposed to be roughly the equivalent of a young pte
> > > for unmapped references. Unfortunately it doesn't come with any context:
> > > after being called, reclaim doesn't know who or why the page was touched.
> > >
> > > So calling mark_page_accessed not only adds extra lru or PG_referenced
> > > manipulations for pages that are already going to have pte_young ptes anyway,
> > > but it also adds these references which are difficult to work with from the
> > > context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
> > > wish to contribute to the page being referenced).
> > >
> > > Then, simply doing SetPageReferenced when zapping a pte and finding it is
> > > young, is not a really good solution either. SetPageReferenced does not
> > > correctly promote the page to the active list for example. So after removing
> > > mark_page_accessed from the fault path, several mmap()+touch+munmap() would
> > > have a very different result from several read(2) calls for example, which
> > > is not really desirable.
> >
> > Well, I have to say that this is rather vague to me. Nick, could you be
> > more specific about which workloads do benefit from this change? Let's
> > say that the zapped pte is the only referenced one and then reclaim
> > finds the page on inactive list. We would go and reclaim it. But does
> > that matter so much? Hot pages would be referenced from multiple ptes
> > very likely, no?
>
> As Nick mentioned in the description, without mark_page_accessed in
> zapping part, repeated mmap + touch + munmap never acticated the page
> while several read(2) calls easily promote it.
And is this really a problem? If we refault the same page then the
refaults detection should catch it no? In other words is the above still
a problem these days?
--
Michal Hocko
SUSE Labs
Greeting,
FYI, we noticed a -4.1% regression of will-it-scale.per_process_ops due to commit:
commit: 755d6edc1aee4489c90975ec093d724d5492cecd ("[PATCH] mm: release the spinlock on zap_pte_range")
url: https://github.com/0day-ci/linux/commits/Minchan-Kim/mm-release-the-spinlock-on-zap_pte_range/20190730-010638
in testcase: will-it-scale
on test machine: 8 threads Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz with 16G memory
with following parameters:
nr_task: 100%
mode: process
test: malloc1
cpufreq_governor: performance
ucode: 0x21
test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale
Details are as below:
-------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-7/performance/x86_64-rhel-7.6/process/100%/debian-x86_64-2019-05-14.cgz/lkp-ivb-d01/malloc1/will-it-scale/0x21
commit:
v5.3-rc2
755d6edc1a ("mm: release the spinlock on zap_pte_range")
v5.3-rc2 755d6edc1aee4489c90975ec093
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
1:5 -20% :4 dmesg.RIP:__d_lookup_rcu
1:5 -20% :4 dmesg.RIP:mnt_drop_write
:5 20% 1:4 kmsg.ab33a8>]usb_hcd_irq
:5 20% 1:4 kmsg.b445f28>]usb_hcd_irq
:5 20% 1:4 kmsg.cdf63ef>]usb_hcd_irq
1:5 -20% :4 kmsg.d4af11>]usb_hcd_irq
1:5 -20% :4 kmsg.d9>]usb_hcd_irq
:5 20% 1:4 kmsg.f805d78>]usb_hcd_irq
5:5 -7% 4:4 perf-profile.calltrace.cycles-pp.error_entry
7:5 -39% 5:4 perf-profile.children.cycles-pp.error_entry
0:5 -1% 0:4 perf-profile.children.cycles-pp.error_exit
5:5 -30% 4:4 perf-profile.self.cycles-pp.error_entry
%stddev %change %stddev
\ | \
119757 -4.1% 114839 will-it-scale.per_process_ops
958059 -4.1% 918718 will-it-scale.workload
2429 ? 16% -34.5% 1591 ? 32% cpuidle.C1.usage
0.97 ? 88% -0.7 0.26 mpstat.cpu.all.idle%
78.40 +2.0% 80.00 vmstat.cpu.sy
45.42 +2.1% 46.38 turbostat.CorWatt
50.46 +2.0% 51.45 turbostat.PkgWatt
6641 ? 4% +8.6% 7215 ? 8% slabinfo.anon_vma_chain.num_objs
1327 ? 3% +23.0% 1632 ? 5% slabinfo.kmalloc-96.active_objs
1327 ? 3% +23.0% 1632 ? 5% slabinfo.kmalloc-96.num_objs
1235 ? 30% +37.7% 1700 ? 18% interrupts.29:PCI-MSI.409600-edge.eth0
4361 ? 81% +149.4% 10876 ? 32% interrupts.CPU0.NMI:Non-maskable_interrupts
4361 ? 81% +149.4% 10876 ? 32% interrupts.CPU0.PMI:Performance_monitoring_interrupts
1235 ? 30% +37.7% 1700 ? 18% interrupts.CPU7.29:PCI-MSI.409600-edge.eth0
93196 +9.1% 101723 ? 6% sched_debug.cfs_rq:/.load.min
15.37 ? 11% +13.6% 17.46 ? 3% sched_debug.cfs_rq:/.nr_spread_over.max
5.01 ? 11% +14.5% 5.74 ? 4% sched_debug.cfs_rq:/.nr_spread_over.stddev
53.80 ? 15% +41.6% 76.21 ? 7% sched_debug.cfs_rq:/.util_avg.stddev
60098 +1.6% 61056 proc-vmstat.nr_active_anon
6867 -1.2% 6781 proc-vmstat.nr_slab_unreclaimable
60098 +1.6% 61056 proc-vmstat.nr_zone_active_anon
5.757e+08 -4.2% 5.517e+08 proc-vmstat.numa_hit
5.757e+08 -4.2% 5.517e+08 proc-vmstat.numa_local
5.758e+08 -4.1% 5.52e+08 proc-vmstat.pgalloc_normal
2.881e+08 -4.1% 2.762e+08 proc-vmstat.pgfault
5.758e+08 -4.1% 5.52e+08 proc-vmstat.pgfree
2.861e+09 ? 41% +41.1% 4.038e+09 perf-stat.i.branch-instructions
41921318 ? 38% +34.9% 56552695 ? 2% perf-stat.i.cache-references
2.173e+10 ? 41% +34.9% 2.931e+10 perf-stat.i.cpu-cycles
2.26e+09 ? 41% +41.3% 3.194e+09 perf-stat.i.dTLB-stores
57813 ? 26% +66.7% 96370 ? 6% perf-stat.i.iTLB-loads
1.365e+10 ? 41% +37.9% 1.882e+10 perf-stat.i.instructions
661.20 ? 40% +45.4% 961.52 perf-stat.i.instructions-per-iTLB-miss
0.47 ? 41% +37.3% 0.64 perf-stat.i.ipc
948620 -3.5% 915067 perf-stat.i.minor-faults
948620 -3.5% 915067 perf-stat.i.page-faults
0.51 ? 7% -0.1 0.45 perf-stat.overall.branch-miss-rate%
1.59 -2.4% 1.56 perf-stat.overall.cpi
0.38 -0.0 0.35 ? 2% perf-stat.overall.dTLB-store-miss-rate%
875.11 +8.7% 950.89 perf-stat.overall.instructions-per-iTLB-miss
0.63 +2.4% 0.64 perf-stat.overall.ipc
4337585 ? 41% +42.3% 6173557 perf-stat.overall.path-length
2.855e+09 ? 41% +41.0% 4.028e+09 perf-stat.ps.branch-instructions
41833739 ? 38% +34.8% 56408902 ? 2% perf-stat.ps.cache-references
2.255e+09 ? 41% +41.2% 3.186e+09 perf-stat.ps.dTLB-stores
57677 ? 26% +66.7% 96124 ? 6% perf-stat.ps.iTLB-loads
1.362e+10 ? 41% +37.8% 1.877e+10 perf-stat.ps.instructions
946368 -3.6% 912714 perf-stat.ps.minor-faults
946368 -3.6% 912714 perf-stat.ps.page-faults
4.155e+12 ? 41% +36.5% 5.672e+12 perf-stat.total.instructions
20.10 -0.7 19.42 perf-profile.calltrace.cycles-pp.__do_page_fault.do_page_fault.page_fault
17.83 -0.7 17.17 perf-profile.calltrace.cycles-pp.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
5.47 ? 2% -0.5 4.92 ? 4% perf-profile.calltrace.cycles-pp.pagevec_lru_move_fn.lru_add_drain_cpu.lru_add_drain.unmap_region.__do_munmap
5.75 ? 2% -0.5 5.20 ? 4% perf-profile.calltrace.cycles-pp.lru_add_drain.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
5.69 ? 2% -0.5 5.17 ? 4% perf-profile.calltrace.cycles-pp.lru_add_drain_cpu.lru_add_drain.unmap_region.__do_munmap.__vm_munmap
2.61 -0.5 2.16 ? 12% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.pagevec_lru_move_fn.lru_add_drain_cpu.lru_add_drain.unmap_region
2.09 ? 2% -0.4 1.67 ? 15% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.pagevec_lru_move_fn.lru_add_drain_cpu.lru_add_drain
2.81 ? 2% -0.2 2.56 ? 2% perf-profile.calltrace.cycles-pp.__anon_vma_prepare.do_anonymous_page.__handle_mm_fault.handle_mm_fault.__do_page_fault
2.62 ? 2% -0.2 2.45 ? 2% perf-profile.calltrace.cycles-pp.flush_tlb_func_common.flush_tlb_mm_range.tlb_flush_mmu.tlb_finish_mmu.unmap_region
1.89 ? 2% -0.2 1.73 perf-profile.calltrace.cycles-pp.unlink_anon_vmas.free_pgtables.unmap_region.__do_munmap.__vm_munmap
3.05 ? 2% -0.1 2.91 perf-profile.calltrace.cycles-pp.flush_tlb_mm_range.tlb_flush_mmu.tlb_finish_mmu.unmap_region.__do_munmap
1.07 ? 3% -0.1 0.95 ? 2% perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64
0.91 ? 3% -0.1 0.84 ? 4% perf-profile.calltrace.cycles-pp.native_flush_tlb.flush_tlb_func_common.flush_tlb_mm_range.tlb_flush_mmu.tlb_finish_mmu
1.94 ? 3% +0.1 2.06 perf-profile.calltrace.cycles-pp.get_unmapped_area.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
1.31 ? 8% +0.1 1.45 perf-profile.calltrace.cycles-pp.arch_get_unmapped_area_topdown.get_unmapped_area.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff
0.31 ? 81% +0.2 0.54 ? 3% perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.do_anonymous_page.__handle_mm_fault.handle_mm_fault.__do_page_fault
2.27 ? 50% +0.7 2.97 ? 3% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret
43.67 +2.4 46.10 perf-profile.calltrace.cycles-pp.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
39.41 ? 2% +2.7 42.07 perf-profile.calltrace.cycles-pp.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64
18.28 ? 2% +3.7 21.95 perf-profile.calltrace.cycles-pp.unmap_vmas.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
17.43 ? 2% +3.7 21.12 perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.unmap_region.__do_munmap.__vm_munmap
35.89 ? 50% +11.0 46.92 perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
36.13 ? 50% +11.1 47.22 perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
51.68 ? 50% +14.5 66.17 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
51.90 ? 50% +14.5 66.42 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
17.89 -0.7 17.20 perf-profile.children.cycles-pp.handle_mm_fault
20.13 -0.7 19.45 perf-profile.children.cycles-pp.__do_page_fault
5.25 ? 2% -0.6 4.62 ? 8% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
5.50 ? 2% -0.6 4.95 ? 4% perf-profile.children.cycles-pp.pagevec_lru_move_fn
5.93 ? 2% -0.5 5.39 ? 4% perf-profile.children.cycles-pp.lru_add_drain
5.86 ? 2% -0.5 5.33 ? 4% perf-profile.children.cycles-pp.lru_add_drain_cpu
2.80 ? 2% -0.3 2.55 ? 3% perf-profile.children.cycles-pp.entry_SYSCALL_64
2.86 ? 2% -0.3 2.60 ? 2% perf-profile.children.cycles-pp.__anon_vma_prepare
1.92 ? 3% -0.2 1.75 perf-profile.children.cycles-pp.unlink_anon_vmas
1.88 ? 4% -0.2 1.72 ? 2% perf-profile.children.cycles-pp.percpu_counter_add_batch
2.03 ? 3% -0.2 1.88 perf-profile.children.cycles-pp.free_pgtables
3.06 ? 2% -0.1 2.92 perf-profile.children.cycles-pp.flush_tlb_mm_range
0.89 ? 5% -0.1 0.76 ? 6% perf-profile.children.cycles-pp.__might_sleep
1.58 ? 2% -0.1 1.45 perf-profile.children.cycles-pp.native_flush_tlb
1.97 -0.1 1.85 ? 2% perf-profile.children.cycles-pp.flush_tlb_func_common
0.41 ? 8% -0.1 0.32 ? 8% perf-profile.children.cycles-pp.___pte_free_tlb
0.10 ? 14% -0.1 0.03 ?100% perf-profile.children.cycles-pp.should_fail_alloc_page
0.55 ? 3% -0.1 0.49 ? 4% perf-profile.children.cycles-pp.down_write
0.10 ? 19% -0.1 0.05 ? 58% perf-profile.children.cycles-pp.should_failslab
0.28 ? 10% -0.0 0.23 perf-profile.children.cycles-pp.anon_vma_interval_tree_remove
0.11 ? 19% -0.0 0.07 ? 7% perf-profile.children.cycles-pp.policy_nodemask
0.10 ? 11% -0.0 0.06 ? 14% perf-profile.children.cycles-pp.__vma_link_file
0.11 ? 9% -0.0 0.08 ? 6% perf-profile.children.cycles-pp.anon_vma_chain_link
0.13 ? 8% -0.0 0.11 ? 4% perf-profile.children.cycles-pp.try_charge
0.18 ? 6% -0.0 0.16 ? 5% perf-profile.children.cycles-pp.inc_zone_page_state
0.14 ? 2% -0.0 0.12 ? 3% perf-profile.children.cycles-pp.anon_vma_interval_tree_insert
0.10 ? 17% +0.0 0.14 ? 7% perf-profile.children.cycles-pp.strlen
0.52 ? 2% +0.0 0.56 ? 3% perf-profile.children.cycles-pp.mem_cgroup_commit_charge
0.17 ? 16% +0.0 0.21 ? 6% perf-profile.children.cycles-pp.uncharge_page
0.08 ? 16% +0.0 0.13 ? 7% perf-profile.children.cycles-pp.__vma_link_list
0.26 ? 6% +0.1 0.31 ? 6% perf-profile.children.cycles-pp.mem_cgroup_charge_statistics
0.00 +0.1 0.06 ? 22% perf-profile.children.cycles-pp.__get_vma_policy
0.13 ? 9% +0.1 0.19 ? 9% perf-profile.children.cycles-pp.vma_merge
0.02 ?122% +0.1 0.09 ? 11% perf-profile.children.cycles-pp.kthread_blkcg
0.25 ? 11% +0.1 0.33 ? 6% perf-profile.children.cycles-pp.get_task_policy
0.00 +0.1 0.08 ? 5% perf-profile.children.cycles-pp.memcpy
0.25 ? 9% +0.1 0.35 ? 2% perf-profile.children.cycles-pp.memcpy_erms
1.97 ? 2% +0.1 2.09 perf-profile.children.cycles-pp.get_unmapped_area
1.34 ? 7% +0.1 1.47 ? 2% perf-profile.children.cycles-pp.arch_get_unmapped_area_topdown
0.38 ? 5% +0.1 0.52 ? 5% perf-profile.children.cycles-pp.alloc_pages_current
3.08 ? 2% +0.2 3.24 ? 2% perf-profile.children.cycles-pp.syscall_return_via_sysret
64.46 +2.0 66.45 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
64.19 +2.0 66.19 perf-profile.children.cycles-pp.do_syscall_64
43.77 +2.4 46.18 perf-profile.children.cycles-pp.__do_munmap
44.49 +2.5 46.95 perf-profile.children.cycles-pp.__vm_munmap
44.77 +2.5 47.24 perf-profile.children.cycles-pp.__x64_sys_munmap
39.43 ? 2% +2.7 42.10 perf-profile.children.cycles-pp.unmap_region
18.07 ? 2% +3.7 21.73 perf-profile.children.cycles-pp.unmap_page_range
18.29 ? 2% +3.7 21.97 perf-profile.children.cycles-pp.unmap_vmas
6.02 ? 3% -0.5 5.57 ? 3% perf-profile.self.cycles-pp.do_syscall_64
1.73 -0.1 1.59 perf-profile.self.cycles-pp._raw_spin_lock_irqsave
1.56 ? 2% -0.1 1.44 perf-profile.self.cycles-pp.native_flush_tlb
0.34 ? 11% -0.1 0.24 ? 7% perf-profile.self.cycles-pp.strlcpy
0.57 ? 5% -0.1 0.49 ? 6% perf-profile.self.cycles-pp.unlink_anon_vmas
0.68 ? 4% -0.1 0.60 ? 8% perf-profile.self.cycles-pp._raw_spin_lock
0.37 ? 5% -0.1 0.31 ? 6% perf-profile.self.cycles-pp.cpumask_any_but
0.42 ? 7% -0.1 0.36 ? 6% perf-profile.self.cycles-pp.handle_mm_fault
0.23 ? 7% -0.1 0.18 ? 4% perf-profile.self.cycles-pp.__perf_sw_event
0.10 ? 23% -0.0 0.06 ? 9% perf-profile.self.cycles-pp.policy_nodemask
0.09 ? 11% -0.0 0.04 ? 59% perf-profile.self.cycles-pp.__vma_link_file
0.13 ? 6% -0.0 0.10 ? 8% perf-profile.self.cycles-pp.try_charge
0.14 ? 2% -0.0 0.12 ? 3% perf-profile.self.cycles-pp.anon_vma_interval_tree_insert
0.10 ? 15% +0.0 0.11 ? 4% perf-profile.self.cycles-pp.strlen
0.09 ? 17% +0.0 0.12 ? 5% perf-profile.self.cycles-pp.memcg_check_events
0.07 ? 19% +0.0 0.10 ? 7% perf-profile.self.cycles-pp.__vma_link_list
0.16 ? 16% +0.0 0.20 ? 5% perf-profile.self.cycles-pp.uncharge_page
0.24 ? 7% +0.0 0.28 ? 2% perf-profile.self.cycles-pp.memcpy_erms
0.04 ? 53% +0.0 0.09 ? 8% perf-profile.self.cycles-pp.do_page_fault
0.42 ? 9% +0.1 0.48 ? 7% perf-profile.self.cycles-pp.find_next_bit
0.13 ? 10% +0.1 0.19 ? 8% perf-profile.self.cycles-pp.vma_merge
0.02 ?122% +0.1 0.09 ? 11% perf-profile.self.cycles-pp.kthread_blkcg
0.25 ? 10% +0.1 0.32 ? 7% perf-profile.self.cycles-pp.get_task_policy
0.00 +0.1 0.08 ? 6% perf-profile.self.cycles-pp.memcpy
0.14 ? 5% +0.1 0.25 ? 15% perf-profile.self.cycles-pp.alloc_pages_current
3.08 ? 2% +0.2 3.23 ? 2% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.43 ? 10% +0.2 0.58 ? 6% perf-profile.self.cycles-pp.arch_get_unmapped_area_topdown
11.00 ? 2% +3.6 14.56 perf-profile.self.cycles-pp.unmap_page_range
will-it-scale.per_process_ops
120000 +-+----------------------------------------------------------------+
| +. +. +.. .. + +. +. + |
119000 +-+ +.+..+..+ |
118000 +-+ |
| |
117000 +-+ |
| O |
116000 O-+O O O O |
| O O O O O |
115000 +-+ O O O O O O O O
114000 +-+ |
| O O O O O |
113000 +-+ O |
| |
112000 +-+----------------------------------------------------------------+
[*] bisect-good sample
[O] bisect-bad sample
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
Thanks,
Oliver Sang
On Tue 06-08-19 15:05:47, kernel test robot wrote:
> Greeting,
>
> FYI, we noticed a -4.1% regression of will-it-scale.per_process_ops due to commit:
I have to confess I cannot make much sense from numbers because they
seem to be too volatile and the main contributor doesn't stand up for
me. Anyway, regressions on microbenchmarks like this are not all that
surprising when a locking is slightly changed and the critical section
made shorter. I have seen that in the past already.
That being said I would still love to get to bottom of this bug rather
than play with the lock duration by a magic. In other words
http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs
On Wed, Jul 31, 2019 at 09:21:01AM +0200, Michal Hocko wrote:
> On Wed 31-07-19 14:44:47, Minchan Kim wrote:
> > On Tue, Jul 30, 2019 at 02:57:51PM +0200, Michal Hocko wrote:
> > > [Cc Nick - the email thread starts http://lkml.kernel.org/r/[email protected]
> > > A very brief summary is that mark_page_accessed seems to be quite
> > > expensive and the question is whether we still need it and why
> > > SetPageReferenced cannot be used instead. More below.]
> > >
> > > On Tue 30-07-19 21:39:35, Minchan Kim wrote:
> [...]
> > > > commit bf3f3bc5e73
> > > > Author: Nick Piggin <[email protected]>
> > > > Date: Tue Jan 6 14:38:55 2009 -0800
> > > >
> > > > mm: don't mark_page_accessed in fault path
> > > >
> > > > Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
> > > > unmap-time if the pte is young has a number of problems.
> > > >
> > > > mark_page_accessed is supposed to be roughly the equivalent of a young pte
> > > > for unmapped references. Unfortunately it doesn't come with any context:
> > > > after being called, reclaim doesn't know who or why the page was touched.
> > > >
> > > > So calling mark_page_accessed not only adds extra lru or PG_referenced
> > > > manipulations for pages that are already going to have pte_young ptes anyway,
> > > > but it also adds these references which are difficult to work with from the
> > > > context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
> > > > wish to contribute to the page being referenced).
> > > >
> > > > Then, simply doing SetPageReferenced when zapping a pte and finding it is
> > > > young, is not a really good solution either. SetPageReferenced does not
> > > > correctly promote the page to the active list for example. So after removing
> > > > mark_page_accessed from the fault path, several mmap()+touch+munmap() would
> > > > have a very different result from several read(2) calls for example, which
> > > > is not really desirable.
> > >
> > > Well, I have to say that this is rather vague to me. Nick, could you be
> > > more specific about which workloads do benefit from this change? Let's
> > > say that the zapped pte is the only referenced one and then reclaim
> > > finds the page on inactive list. We would go and reclaim it. But does
> > > that matter so much? Hot pages would be referenced from multiple ptes
> > > very likely, no?
> >
> > As Nick mentioned in the description, without mark_page_accessed in
> > zapping part, repeated mmap + touch + munmap never acticated the page
> > while several read(2) calls easily promote it.
>
> And is this really a problem? If we refault the same page then the
> refaults detection should catch it no? In other words is the above still
> a problem these days?
I admit we have been not fair for them because read(2) syscall pages are
easily promoted regardless of zap timing unlike mmap-based pages.
However, if we remove the mark_page_accessed in the zap_pte_range, it
would make them more unfair in that read(2)-accessed pages are easily
promoted while mmap-based page should go through refault to be promoted.
I also want to remove the costly overhead from the hot path but couldn't
come up with nice solution.
On Tue, Aug 06, 2019 at 10:04:15AM +0200, Michal Hocko wrote:
> On Tue 06-08-19 15:05:47, kernel test robot wrote:
> > Greeting,
> >
> > FYI, we noticed a -4.1% regression of will-it-scale.per_process_ops due to commit:
>
> I have to confess I cannot make much sense from numbers because they
> seem to be too volatile and the main contributor doesn't stand up for
> me. Anyway, regressions on microbenchmarks like this are not all that
> surprising when a locking is slightly changed and the critical section
> made shorter. I have seen that in the past already.
I guess if it's multi process workload. The patch will give more chance
to be scheduled out so TLB miss ratio would be bigger than old.
I see it's natural trade-off for latency vs. performance so only thing
I could think is just increase threshold from 32 to 64 or 128?
>
> That being said I would still love to get to bottom of this bug rather
> than play with the lock duration by a magic. In other words
> http://lkml.kernel.org/r/[email protected]
Yes, if we could remove mark_page_accessed there, it would be best.
I added a commen in the thread.
On Tue 06-08-19 20:00:24, Minchan Kim wrote:
> On Tue, Aug 06, 2019 at 10:04:15AM +0200, Michal Hocko wrote:
> > On Tue 06-08-19 15:05:47, kernel test robot wrote:
> > > Greeting,
> > >
> > > FYI, we noticed a -4.1% regression of will-it-scale.per_process_ops due to commit:
> >
> > I have to confess I cannot make much sense from numbers because they
> > seem to be too volatile and the main contributor doesn't stand up for
> > me. Anyway, regressions on microbenchmarks like this are not all that
> > surprising when a locking is slightly changed and the critical section
> > made shorter. I have seen that in the past already.
>
> I guess if it's multi process workload. The patch will give more chance
> to be scheduled out so TLB miss ratio would be bigger than old.
> I see it's natural trade-off for latency vs. performance so only thing
> I could think is just increase threshold from 32 to 64 or 128?
This still feels like a magic number tunning, doesn't it?
--
Michal Hocko
SUSE Labs
On Tue 06-08-19 19:55:09, Minchan Kim wrote:
> On Wed, Jul 31, 2019 at 09:21:01AM +0200, Michal Hocko wrote:
> > On Wed 31-07-19 14:44:47, Minchan Kim wrote:
[...]
> > > As Nick mentioned in the description, without mark_page_accessed in
> > > zapping part, repeated mmap + touch + munmap never acticated the page
> > > while several read(2) calls easily promote it.
> >
> > And is this really a problem? If we refault the same page then the
> > refaults detection should catch it no? In other words is the above still
> > a problem these days?
>
> I admit we have been not fair for them because read(2) syscall pages are
> easily promoted regardless of zap timing unlike mmap-based pages.
>
> However, if we remove the mark_page_accessed in the zap_pte_range, it
> would make them more unfair in that read(2)-accessed pages are easily
> promoted while mmap-based page should go through refault to be promoted.
I have really hard time to follow why an unmap special handling is
making the overall state more reasonable.
Anyway, let me throw the patch for further discussion. Nick, Mel,
Johannes what do you think?
From 3821c2e66347a2141358cabdc6224d9990276fec Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Fri, 9 Aug 2019 14:29:59 +0200
Subject: [PATCH] mm: drop mark_page_access from the unmap path
Minchan has noticed that mark_page_access can take quite some time
during unmap:
: I had a time to benchmark it via adding some trace_printk hooks between
: pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
: device is 2018 premium mobile device.
:
: I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
: task runs on little core even though it doesn't have any IPI and LRU
: lock contention. It's already too heavy.
:
: If I remove activate_page, 35-40% overhead of zap_pte_range is gone
: so most of overhead(about 0.7ms) comes from activate_page via
: mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
: accumulate up to several ms.
bf3f3bc5e734 ("mm: don't mark_page_accessed in fault path") has replaced
SetPageReferenced by mark_page_accessed arguing that the former is not
sufficient when mark_page_accessed is removed from the fault path
because it doesn't promote page to the active list. It is true that a
page that is mapped by a single process might not get promoted even when
referenced if the reclaim checks it after the unmap but does that matter
that much? Can we cosider the page hot if there are no other
users? Moreover we do have workingset detection in place since then and
so a next refault would activate the page if it was really hot one.
Drop the expensive mark_page_accessed and restore SetPageReferenced to
transfer the reference information into the struct page for now to
reduce the unmap overhead. Should we find workloads that noticeably
depend on this behavior we should find a way to make mark_page_accessed
less expensive.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index e2bb51b6242e..ced521df8ee7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1053,7 +1053,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
}
if (pte_young(ptent) &&
likely(!(vma->vm_flags & VM_SEQ_READ)))
- mark_page_accessed(page);
+ SetPageReferenced(page);
}
rss[mm_counter(page)]--;
page_remove_rmap(page, false);
--
2.20.1
--
Michal Hocko
SUSE Labs
On Fri, Aug 09, 2019 at 02:43:24PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 19:55:09, Minchan Kim wrote:
> > On Wed, Jul 31, 2019 at 09:21:01AM +0200, Michal Hocko wrote:
> > > On Wed 31-07-19 14:44:47, Minchan Kim wrote:
> [...]
> > > > As Nick mentioned in the description, without mark_page_accessed in
> > > > zapping part, repeated mmap + touch + munmap never acticated the page
> > > > while several read(2) calls easily promote it.
> > >
> > > And is this really a problem? If we refault the same page then the
> > > refaults detection should catch it no? In other words is the above still
> > > a problem these days?
> >
> > I admit we have been not fair for them because read(2) syscall pages are
> > easily promoted regardless of zap timing unlike mmap-based pages.
> >
> > However, if we remove the mark_page_accessed in the zap_pte_range, it
> > would make them more unfair in that read(2)-accessed pages are easily
> > promoted while mmap-based page should go through refault to be promoted.
>
> I have really hard time to follow why an unmap special handling is
> making the overall state more reasonable.
>
> Anyway, let me throw the patch for further discussion. Nick, Mel,
> Johannes what do you think?
>
I won't be able to answer follow-ups to this for a while but here is some
superficial thinking.
Minimally, you should test PageReferenced before setting it like
mark_page_accessed does to avoid unnecessary atomics. I know it wasn't
done that way before but there is no harm in addressing it now.
workingset_activation is necessarily expensive. It could speculatively
lookup memcg outside the RCU read lock and only acquire it if there is
something interesting to lookup. Probably not much help though.
Note that losing the potential workingset_activation from the patch
may have consequences if we are relying on refaults to fix this up. I'm
undecided as to what degree it matters.
That said, I do agree that the mark_page_accessed on page zapping may
be overkill given that it can be a very expensive call if the page
gets activated and it's potentially being called in the zap path at a
high frequency. It's also not a function that is particularly easy to
optimise if you want to cover all the cases that matter. It really would
be preferably to have knowledge of a workload that really cares about
the activations from mmap/touch/munmap.
mark_page_accessed is a hint, it's known that there are gaps with
it so we shouldn't pay too much of a cost on information that only
might be useful. If the system is under no memory pressure because the
workloads are tuned to fit in memory (e.g. database using direct IO)
then mark_page_accessed is only cost. We could avoid marking it accessed
entirely if PF_EXITING given that if a task is exiting, it's not a strong
indication that the page is of any interest. Even if the page is heavily
shared page and one user exits, the other users will keep it referenced
and prevent reclaim anyway. The benefit is too marginal too.
Given the definite cost of mark_page_accessed in this path and the main
corner case being tasks that access pages via mmap/touch/munmap (which is
insanely expensive if done at high frequency), I think it's reasonable to
rely on SetPageReferenced giving the page another lap of the LRU in most
cases (the obvious exception being CMA forcing reclaim). That opinion
might change if there is a known example of a realistic workload that
would suffer from the lack of explicit activations from teardown context.
--
Mel Gorman
SUSE Labs
On Fri, Aug 09, 2019 at 02:43:24PM +0200, Michal Hocko wrote:
> On Tue 06-08-19 19:55:09, Minchan Kim wrote:
> > On Wed, Jul 31, 2019 at 09:21:01AM +0200, Michal Hocko wrote:
> > > On Wed 31-07-19 14:44:47, Minchan Kim wrote:
> [...]
> > > > As Nick mentioned in the description, without mark_page_accessed in
> > > > zapping part, repeated mmap + touch + munmap never acticated the page
> > > > while several read(2) calls easily promote it.
> > >
> > > And is this really a problem? If we refault the same page then the
> > > refaults detection should catch it no? In other words is the above still
> > > a problem these days?
> >
> > I admit we have been not fair for them because read(2) syscall pages are
> > easily promoted regardless of zap timing unlike mmap-based pages.
> >
> > However, if we remove the mark_page_accessed in the zap_pte_range, it
> > would make them more unfair in that read(2)-accessed pages are easily
> > promoted while mmap-based page should go through refault to be promoted.
>
> I have really hard time to follow why an unmap special handling is
> making the overall state more reasonable.
>
> Anyway, let me throw the patch for further discussion. Nick, Mel,
> Johannes what do you think?
>
> From 3821c2e66347a2141358cabdc6224d9990276fec Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Fri, 9 Aug 2019 14:29:59 +0200
> Subject: [PATCH] mm: drop mark_page_access from the unmap path
>
> Minchan has noticed that mark_page_access can take quite some time
> during unmap:
> : I had a time to benchmark it via adding some trace_printk hooks between
> : pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> : device is 2018 premium mobile device.
> :
> : I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> : task runs on little core even though it doesn't have any IPI and LRU
> : lock contention. It's already too heavy.
> :
> : If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> : so most of overhead(about 0.7ms) comes from activate_page via
> : mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> : accumulate up to several ms.
>
> bf3f3bc5e734 ("mm: don't mark_page_accessed in fault path") has replaced
> SetPageReferenced by mark_page_accessed arguing that the former is not
> sufficient when mark_page_accessed is removed from the fault path
> because it doesn't promote page to the active list. It is true that a
> page that is mapped by a single process might not get promoted even when
> referenced if the reclaim checks it after the unmap but does that matter
> that much? Can we cosider the page hot if there are no other
> users? Moreover we do have workingset detection in place since then and
> so a next refault would activate the page if it was really hot one.
I do think the pages can be very hot. Think of short-lived executables
and their libraries. Like shell commands. When they run a few times or
periodically, they should be promoted to the active list and not have
to compete with streaming IO on the inactive list - the PG_referenced
doesn't really help them there, see page_check_references().
Maybe the refaults will be fine - but latency expectations around
mapped page cache certainly are a lot higher than unmapped cache.
So I'm a bit reluctant about this patch. If Minchan can be happy with
the lock batching, I'd prefer that.
On Fri 09-08-19 14:34:24, Johannes Weiner wrote:
> On Fri, Aug 09, 2019 at 02:43:24PM +0200, Michal Hocko wrote:
> > On Tue 06-08-19 19:55:09, Minchan Kim wrote:
> > > On Wed, Jul 31, 2019 at 09:21:01AM +0200, Michal Hocko wrote:
> > > > On Wed 31-07-19 14:44:47, Minchan Kim wrote:
> > [...]
> > > > > As Nick mentioned in the description, without mark_page_accessed in
> > > > > zapping part, repeated mmap + touch + munmap never acticated the page
> > > > > while several read(2) calls easily promote it.
> > > >
> > > > And is this really a problem? If we refault the same page then the
> > > > refaults detection should catch it no? In other words is the above still
> > > > a problem these days?
> > >
> > > I admit we have been not fair for them because read(2) syscall pages are
> > > easily promoted regardless of zap timing unlike mmap-based pages.
> > >
> > > However, if we remove the mark_page_accessed in the zap_pte_range, it
> > > would make them more unfair in that read(2)-accessed pages are easily
> > > promoted while mmap-based page should go through refault to be promoted.
> >
> > I have really hard time to follow why an unmap special handling is
> > making the overall state more reasonable.
> >
> > Anyway, let me throw the patch for further discussion. Nick, Mel,
> > Johannes what do you think?
> >
> > From 3821c2e66347a2141358cabdc6224d9990276fec Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Fri, 9 Aug 2019 14:29:59 +0200
> > Subject: [PATCH] mm: drop mark_page_access from the unmap path
> >
> > Minchan has noticed that mark_page_access can take quite some time
> > during unmap:
> > : I had a time to benchmark it via adding some trace_printk hooks between
> > : pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> > : device is 2018 premium mobile device.
> > :
> > : I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> > : task runs on little core even though it doesn't have any IPI and LRU
> > : lock contention. It's already too heavy.
> > :
> > : If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> > : so most of overhead(about 0.7ms) comes from activate_page via
> > : mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> > : accumulate up to several ms.
> >
> > bf3f3bc5e734 ("mm: don't mark_page_accessed in fault path") has replaced
> > SetPageReferenced by mark_page_accessed arguing that the former is not
> > sufficient when mark_page_accessed is removed from the fault path
> > because it doesn't promote page to the active list. It is true that a
> > page that is mapped by a single process might not get promoted even when
> > referenced if the reclaim checks it after the unmap but does that matter
> > that much? Can we cosider the page hot if there are no other
> > users? Moreover we do have workingset detection in place since then and
> > so a next refault would activate the page if it was really hot one.
>
> I do think the pages can be very hot. Think of short-lived executables
> and their libraries. Like shell commands. When they run a few times or
> periodically, they should be promoted to the active list and not have
> to compete with streaming IO on the inactive list - the PG_referenced
> doesn't really help them there, see page_check_references().
Yeah, I am aware of that. We do rely on more processes to map the page
which I've tried to explain in the changelog.
Btw. can we promote PageReferenced pages with zero mapcount? I am
throwing that more as an idea because I haven't really thought that
through yet.
> Maybe the refaults will be fine - but latency expectations around
> mapped page cache certainly are a lot higher than unmapped cache.
>
> So I'm a bit reluctant about this patch. If Minchan can be happy with
> the lock batching, I'd prefer that.
Yes, it seems that the regular lock drop&relock helps in Minchan's case
but this is a kind of change that might have other subtle side effects.
E.g. will-it-scale has noticed a regression [1], likely because the
critical section is shorter and the overal throughput of the operation
decreases. Now, the w-i-s is an artificial benchmark so I wouldn't lose
much sleep over it normally but we have already seen real regressions
when the locking pattern has changed in the past so I would by a bit
cautious.
As I've said, this RFC is mostly to open a discussion. I would really
like to weigh the overhead of mark_page_accessed and potential scenario
when refaults would be visible in practice. I can imagine that a short
lived statically linked applications have higher chance of being the
only user unlike libraries which are often being mapped via several
ptes. But the main problem to evaluate this is that there are many other
external factors to trigger the worst case.
[1] http://lkml.kernel.org/r/20190806070547.GA10123@xsang-OptiPlex-9020
--
Michal Hocko
SUSE Labs
On Mon, Aug 12, 2019 at 10:09:47AM +0200, Michal Hocko wrote:
> On Fri 09-08-19 14:34:24, Johannes Weiner wrote:
> > On Fri, Aug 09, 2019 at 02:43:24PM +0200, Michal Hocko wrote:
> > > On Tue 06-08-19 19:55:09, Minchan Kim wrote:
> > > > On Wed, Jul 31, 2019 at 09:21:01AM +0200, Michal Hocko wrote:
> > > > > On Wed 31-07-19 14:44:47, Minchan Kim wrote:
> > > [...]
> > > > > > As Nick mentioned in the description, without mark_page_accessed in
> > > > > > zapping part, repeated mmap + touch + munmap never acticated the page
> > > > > > while several read(2) calls easily promote it.
> > > > >
> > > > > And is this really a problem? If we refault the same page then the
> > > > > refaults detection should catch it no? In other words is the above still
> > > > > a problem these days?
> > > >
> > > > I admit we have been not fair for them because read(2) syscall pages are
> > > > easily promoted regardless of zap timing unlike mmap-based pages.
> > > >
> > > > However, if we remove the mark_page_accessed in the zap_pte_range, it
> > > > would make them more unfair in that read(2)-accessed pages are easily
> > > > promoted while mmap-based page should go through refault to be promoted.
> > >
> > > I have really hard time to follow why an unmap special handling is
> > > making the overall state more reasonable.
> > >
> > > Anyway, let me throw the patch for further discussion. Nick, Mel,
> > > Johannes what do you think?
> > >
> > > From 3821c2e66347a2141358cabdc6224d9990276fec Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <[email protected]>
> > > Date: Fri, 9 Aug 2019 14:29:59 +0200
> > > Subject: [PATCH] mm: drop mark_page_access from the unmap path
> > >
> > > Minchan has noticed that mark_page_access can take quite some time
> > > during unmap:
> > > : I had a time to benchmark it via adding some trace_printk hooks between
> > > : pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
> > > : device is 2018 premium mobile device.
> > > :
> > > : I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
> > > : task runs on little core even though it doesn't have any IPI and LRU
> > > : lock contention. It's already too heavy.
> > > :
> > > : If I remove activate_page, 35-40% overhead of zap_pte_range is gone
> > > : so most of overhead(about 0.7ms) comes from activate_page via
> > > : mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
> > > : accumulate up to several ms.
> > >
> > > bf3f3bc5e734 ("mm: don't mark_page_accessed in fault path") has replaced
> > > SetPageReferenced by mark_page_accessed arguing that the former is not
> > > sufficient when mark_page_accessed is removed from the fault path
> > > because it doesn't promote page to the active list. It is true that a
> > > page that is mapped by a single process might not get promoted even when
> > > referenced if the reclaim checks it after the unmap but does that matter
> > > that much? Can we cosider the page hot if there are no other
> > > users? Moreover we do have workingset detection in place since then and
> > > so a next refault would activate the page if it was really hot one.
> >
> > I do think the pages can be very hot. Think of short-lived executables
> > and their libraries. Like shell commands. When they run a few times or
> > periodically, they should be promoted to the active list and not have
> > to compete with streaming IO on the inactive list - the PG_referenced
> > doesn't really help them there, see page_check_references().
>
> Yeah, I am aware of that. We do rely on more processes to map the page
> which I've tried to explain in the changelog.
>
> Btw. can we promote PageReferenced pages with zero mapcount? I am
> throwing that more as an idea because I haven't really thought that
> through yet.
That flag implements a second-chance policy, see this commit:
commit 645747462435d84c6c6a64269ed49cc3015f753d
Author: Johannes Weiner <[email protected]>
Date: Fri Mar 5 13:42:22 2010 -0800
vmscan: detect mapped file pages used only once
We had an application that would checksum large files using mmapped IO
to avoid double buffering. The VM used to activate mapped cache
directly, and it trashed the actual workingset.
In response I added support for use-once mapped pages using this flag.
SetPageReferenced signals the VM that we're not sure about the page
yet and give it another round trip on the LRU.
If you activate on this flag, it would restore the initial problem of
use-once pages trashing the workingset.
> > Maybe the refaults will be fine - but latency expectations around
> > mapped page cache certainly are a lot higher than unmapped cache.
> >
> > So I'm a bit reluctant about this patch. If Minchan can be happy with
> > the lock batching, I'd prefer that.
>
> Yes, it seems that the regular lock drop&relock helps in Minchan's case
> but this is a kind of change that might have other subtle side effects.
> E.g. will-it-scale has noticed a regression [1], likely because the
> critical section is shorter and the overal throughput of the operation
> decreases. Now, the w-i-s is an artificial benchmark so I wouldn't lose
> much sleep over it normally but we have already seen real regressions
> when the locking pattern has changed in the past so I would by a bit
> cautious.
I'm much more concerned about fundamentally changing the aging policy
of mapped page cache then about the lock breaking scheme. With locking
we worry about CPU effects; with aging we worry about additional IO.
> As I've said, this RFC is mostly to open a discussion. I would really
> like to weigh the overhead of mark_page_accessed and potential scenario
> when refaults would be visible in practice. I can imagine that a short
> lived statically linked applications have higher chance of being the
> only user unlike libraries which are often being mapped via several
> ptes. But the main problem to evaluate this is that there are many other
> external factors to trigger the worst case.
We can discuss the pros and cons, but ultimately we simply need to
test it against real workloads to see if changing the promotion rules
regresses the amount of paging we do in practice.
On Mon 12-08-19 11:07:25, Johannes Weiner wrote:
> On Mon, Aug 12, 2019 at 10:09:47AM +0200, Michal Hocko wrote:
[...]
> > Btw. can we promote PageReferenced pages with zero mapcount? I am
> > throwing that more as an idea because I haven't really thought that
> > through yet.
>
> That flag implements a second-chance policy, see this commit:
>
> commit 645747462435d84c6c6a64269ed49cc3015f753d
> Author: Johannes Weiner <[email protected]>
> Date: Fri Mar 5 13:42:22 2010 -0800
>
> vmscan: detect mapped file pages used only once
>
> We had an application that would checksum large files using mmapped IO
> to avoid double buffering. The VM used to activate mapped cache
> directly, and it trashed the actual workingset.
>
> In response I added support for use-once mapped pages using this flag.
> SetPageReferenced signals the VM that we're not sure about the page
> yet and give it another round trip on the LRU.
>
> If you activate on this flag, it would restore the initial problem of
> use-once pages trashing the workingset.
You are right of course. I should have realized that! We really need
another piece of information to store to the struct page or maybe xarray
to reflect that.
> > > Maybe the refaults will be fine - but latency expectations around
> > > mapped page cache certainly are a lot higher than unmapped cache.
> > >
> > > So I'm a bit reluctant about this patch. If Minchan can be happy with
> > > the lock batching, I'd prefer that.
> >
> > Yes, it seems that the regular lock drop&relock helps in Minchan's case
> > but this is a kind of change that might have other subtle side effects.
> > E.g. will-it-scale has noticed a regression [1], likely because the
> > critical section is shorter and the overal throughput of the operation
> > decreases. Now, the w-i-s is an artificial benchmark so I wouldn't lose
> > much sleep over it normally but we have already seen real regressions
> > when the locking pattern has changed in the past so I would by a bit
> > cautious.
>
> I'm much more concerned about fundamentally changing the aging policy
> of mapped page cache then about the lock breaking scheme. With locking
> we worry about CPU effects; with aging we worry about additional IO.
But the later is observable and debuggable little bit easier IMHO.
People are quite used to watch for major faults from my experience
as that is an easy metric to compare.
> > As I've said, this RFC is mostly to open a discussion. I would really
> > like to weigh the overhead of mark_page_accessed and potential scenario
> > when refaults would be visible in practice. I can imagine that a short
> > lived statically linked applications have higher chance of being the
> > only user unlike libraries which are often being mapped via several
> > ptes. But the main problem to evaluate this is that there are many other
> > external factors to trigger the worst case.
>
> We can discuss the pros and cons, but ultimately we simply need to
> test it against real workloads to see if changing the promotion rules
> regresses the amount of paging we do in practice.
Agreed. Do you see other option than to try it out and revert if we see
regressions? We would get a workload description which would be helpful
for future regression testing when touching this area. We can start
slower and keep it in linux-next for a release cycle to catch any
fallouts early.
Thoughts?
--
Michal Hocko
SUSE Labs
On Tue 13-08-19 12:51:43, Michal Hocko wrote:
> On Mon 12-08-19 11:07:25, Johannes Weiner wrote:
> > On Mon, Aug 12, 2019 at 10:09:47AM +0200, Michal Hocko wrote:
[...]
> > > > Maybe the refaults will be fine - but latency expectations around
> > > > mapped page cache certainly are a lot higher than unmapped cache.
> > > >
> > > > So I'm a bit reluctant about this patch. If Minchan can be happy with
> > > > the lock batching, I'd prefer that.
> > >
> > > Yes, it seems that the regular lock drop&relock helps in Minchan's case
> > > but this is a kind of change that might have other subtle side effects.
> > > E.g. will-it-scale has noticed a regression [1], likely because the
> > > critical section is shorter and the overal throughput of the operation
> > > decreases. Now, the w-i-s is an artificial benchmark so I wouldn't lose
> > > much sleep over it normally but we have already seen real regressions
> > > when the locking pattern has changed in the past so I would by a bit
> > > cautious.
> >
> > I'm much more concerned about fundamentally changing the aging policy
> > of mapped page cache then about the lock breaking scheme. With locking
> > we worry about CPU effects; with aging we worry about additional IO.
>
> But the later is observable and debuggable little bit easier IMHO.
> People are quite used to watch for major faults from my experience
> as that is an easy metric to compare.
>
> > > As I've said, this RFC is mostly to open a discussion. I would really
> > > like to weigh the overhead of mark_page_accessed and potential scenario
> > > when refaults would be visible in practice. I can imagine that a short
> > > lived statically linked applications have higher chance of being the
> > > only user unlike libraries which are often being mapped via several
> > > ptes. But the main problem to evaluate this is that there are many other
> > > external factors to trigger the worst case.
> >
> > We can discuss the pros and cons, but ultimately we simply need to
> > test it against real workloads to see if changing the promotion rules
> > regresses the amount of paging we do in practice.
>
> Agreed. Do you see other option than to try it out and revert if we see
> regressions? We would get a workload description which would be helpful
> for future regression testing when touching this area. We can start
> slower and keep it in linux-next for a release cycle to catch any
> fallouts early.
>
> Thoughts?
ping...
--
Michal Hocko
SUSE Labs
On Mon, Aug 26, 2019 at 02:06:30PM +0200, Michal Hocko wrote:
> On Tue 13-08-19 12:51:43, Michal Hocko wrote:
> > On Mon 12-08-19 11:07:25, Johannes Weiner wrote:
> > > On Mon, Aug 12, 2019 at 10:09:47AM +0200, Michal Hocko wrote:
> [...]
> > > > > Maybe the refaults will be fine - but latency expectations around
> > > > > mapped page cache certainly are a lot higher than unmapped cache.
> > > > >
> > > > > So I'm a bit reluctant about this patch. If Minchan can be happy with
> > > > > the lock batching, I'd prefer that.
> > > >
> > > > Yes, it seems that the regular lock drop&relock helps in Minchan's case
> > > > but this is a kind of change that might have other subtle side effects.
> > > > E.g. will-it-scale has noticed a regression [1], likely because the
> > > > critical section is shorter and the overal throughput of the operation
> > > > decreases. Now, the w-i-s is an artificial benchmark so I wouldn't lose
> > > > much sleep over it normally but we have already seen real regressions
> > > > when the locking pattern has changed in the past so I would by a bit
> > > > cautious.
> > >
> > > I'm much more concerned about fundamentally changing the aging policy
> > > of mapped page cache then about the lock breaking scheme. With locking
> > > we worry about CPU effects; with aging we worry about additional IO.
> >
> > But the later is observable and debuggable little bit easier IMHO.
> > People are quite used to watch for major faults from my experience
> > as that is an easy metric to compare.
Rootcausing additional (re)faults is really difficult. We're talking
about a slight trend change in caching behavior in a sea of millions
of pages. There could be so many factors causing this, and for most
you have to patch debugging stuff into the kernel to rule them out.
A CPU regression you can figure out with perf.
> > > > As I've said, this RFC is mostly to open a discussion. I would really
> > > > like to weigh the overhead of mark_page_accessed and potential scenario
> > > > when refaults would be visible in practice. I can imagine that a short
> > > > lived statically linked applications have higher chance of being the
> > > > only user unlike libraries which are often being mapped via several
> > > > ptes. But the main problem to evaluate this is that there are many other
> > > > external factors to trigger the worst case.
> > >
> > > We can discuss the pros and cons, but ultimately we simply need to
> > > test it against real workloads to see if changing the promotion rules
> > > regresses the amount of paging we do in practice.
> >
> > Agreed. Do you see other option than to try it out and revert if we see
> > regressions? We would get a workload description which would be helpful
> > for future regression testing when touching this area. We can start
> > slower and keep it in linux-next for a release cycle to catch any
> > fallouts early.
> >
> > Thoughts?
>
> ping...
Personally, I'm not convinced by this patch. I think it's a pretty
drastic change in aging heuristics just to address a CPU overhead
problem that has simpler, easier to verify, alternative solutions.
It WOULD be great to clarify and improve the aging model for mapped
cache, to make it a bit easier to reason about. But this patch does
not really get there either. Instead of taking a serious look at
mapped cache lifetime and usage scenarios, the changelog is more in
"let's see what breaks if we take out this screw here" territory.
So I'm afraid I don't think the patch & changelog in its current shape
should go upstream.
On Tue 27-08-19 12:00:26, Johannes Weiner wrote:
> On Mon, Aug 26, 2019 at 02:06:30PM +0200, Michal Hocko wrote:
> > On Tue 13-08-19 12:51:43, Michal Hocko wrote:
> > > On Mon 12-08-19 11:07:25, Johannes Weiner wrote:
> > > > On Mon, Aug 12, 2019 at 10:09:47AM +0200, Michal Hocko wrote:
> > [...]
> > > > > > Maybe the refaults will be fine - but latency expectations around
> > > > > > mapped page cache certainly are a lot higher than unmapped cache.
> > > > > >
> > > > > > So I'm a bit reluctant about this patch. If Minchan can be happy with
> > > > > > the lock batching, I'd prefer that.
> > > > >
> > > > > Yes, it seems that the regular lock drop&relock helps in Minchan's case
> > > > > but this is a kind of change that might have other subtle side effects.
> > > > > E.g. will-it-scale has noticed a regression [1], likely because the
> > > > > critical section is shorter and the overal throughput of the operation
> > > > > decreases. Now, the w-i-s is an artificial benchmark so I wouldn't lose
> > > > > much sleep over it normally but we have already seen real regressions
> > > > > when the locking pattern has changed in the past so I would by a bit
> > > > > cautious.
> > > >
> > > > I'm much more concerned about fundamentally changing the aging policy
> > > > of mapped page cache then about the lock breaking scheme. With locking
> > > > we worry about CPU effects; with aging we worry about additional IO.
> > >
> > > But the later is observable and debuggable little bit easier IMHO.
> > > People are quite used to watch for major faults from my experience
> > > as that is an easy metric to compare.
>
> Rootcausing additional (re)faults is really difficult. We're talking
> about a slight trend change in caching behavior in a sea of millions
> of pages. There could be so many factors causing this, and for most
> you have to patch debugging stuff into the kernel to rule them out.
>
> A CPU regression you can figure out with perf.
>
> > > > > As I've said, this RFC is mostly to open a discussion. I would really
> > > > > like to weigh the overhead of mark_page_accessed and potential scenario
> > > > > when refaults would be visible in practice. I can imagine that a short
> > > > > lived statically linked applications have higher chance of being the
> > > > > only user unlike libraries which are often being mapped via several
> > > > > ptes. But the main problem to evaluate this is that there are many other
> > > > > external factors to trigger the worst case.
> > > >
> > > > We can discuss the pros and cons, but ultimately we simply need to
> > > > test it against real workloads to see if changing the promotion rules
> > > > regresses the amount of paging we do in practice.
> > >
> > > Agreed. Do you see other option than to try it out and revert if we see
> > > regressions? We would get a workload description which would be helpful
> > > for future regression testing when touching this area. We can start
> > > slower and keep it in linux-next for a release cycle to catch any
> > > fallouts early.
> > >
> > > Thoughts?
> >
> > ping...
>
> Personally, I'm not convinced by this patch. I think it's a pretty
> drastic change in aging heuristics just to address a CPU overhead
> problem that has simpler, easier to verify, alternative solutions.
>
> It WOULD be great to clarify and improve the aging model for mapped
> cache, to make it a bit easier to reason about.
I fully agree with this! Do you have any specific ideas? I am afraid I
am unlikely to find time for a larger project that this sounds to be but
maybe others will find this as a good fit.
> But this patch does
> not really get there either. Instead of taking a serious look at
> mapped cache lifetime and usage scenarios, the changelog is more in
> "let's see what breaks if we take out this screw here" territory.
You know that I tend to be quite conservative. In this case I can see
the cost which is not negligible and likely to hit many workloads
because it is a common path. The immediate benefit is not really clear,
though, at least to me. We can speculate and I would really love to hear
from Nick what exactly led him to this change.
> So I'm afraid I don't think the patch & changelog in its current shape
> should go upstream.
I will not insist of course but it would be really great to know and
_document_ why we are doing this. I really hate how often we keep
different heuristics and build more complex solutions on top just
because nobody dares to change that.
Our code is really hard to reason about.
--
Michal Hocko
SUSE Labs