From: Huang Ying <[email protected]>
Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue. For example, when
clearing huge page on x86_64 platform, the cache footprint is 2M. But
on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
LLC (last level cache). That is, in average, there are 2.5M LLC for
each core and 1.25M LLC for each thread. If the cache pressure is
heavy when clearing the huge page, and we clear the huge page from the
begin to the end, it is possible that the begin of huge page is
evicted from the cache after we finishing clearing the end of the huge
page. And it is possible for the application to access the begin of
the huge page after clearing the huge page.
To help the above situation, in this patch, when we clear a huge page,
the order to clear sub-pages is changed. In quite some situation, we
can get the address that the application will access after we clear
the huge page, for example, in a page fault handler. Instead of
clearing the huge page from begin to end, we will clear the sub-pages
farthest from the the sub-page to access firstly, and clear the
sub-page to access last. This will make the sub-page to access most
cache-hot and sub-pages around it more cache-hot too. If we cannot
know the address the application will access, the begin of the huge
page is assumed to be the the address the application will access.
With this patch, the throughput increases ~28.3% in vm-scalability
anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads). The test case creates 72 processes,
each process mmap a big anonymous memory area and writes to it from
the begin to the end. For each process, other processes could be seen
as other workload which generates heavy cache pressure. At the same
time, the cache miss rate reduced from ~33.4% to ~31.7%, the
IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
spent in user space is reduced ~7.9%
Thanks Andi Kleen to propose to use address to access to determine the
order of sub-pages to clear.
The hugetlbfs access address could be improved, will do that in
another patch.
[Use address to access information]
Suggested-by: Andi Kleen <[email protected]>
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Nadia Yvette Chambers <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Shaohua Li <[email protected]>
---
fs/hugetlbfs/inode.c | 2 +-
include/linux/mm.h | 3 ++-
mm/huge_memory.c | 10 ++++++----
mm/hugetlb.c | 2 +-
mm/memory.c | 32 +++++++++++++++++++++++++++-----
5 files changed, 37 insertions(+), 12 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 33961b35007b..1bbb38fcaa11 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -627,7 +627,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
error = PTR_ERR(page);
goto out;
}
- clear_huge_page(page, addr, pages_per_huge_page(h));
+ clear_huge_page(page, addr, pages_per_huge_page(h), addr);
__SetPageUptodate(page);
error = huge_add_to_page_cache(page, mapping, index);
if (unlikely(error)) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9fee3213a75e..a954f63a13c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2509,7 +2509,8 @@ enum mf_action_page_type {
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
extern void clear_huge_page(struct page *page,
unsigned long addr,
- unsigned int pages_per_huge_page);
+ unsigned int pages_per_huge_page,
+ unsigned long addr_hint);
extern void copy_user_huge_page(struct page *dst, struct page *src,
unsigned long addr, struct vm_area_struct *vma,
unsigned int pages_per_huge_page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd3ad6c88c8a..b1e66df38661 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -549,7 +549,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
struct vm_area_struct *vma = vmf->vma;
struct mem_cgroup *memcg;
pgtable_t pgtable;
- unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+ unsigned long address = vmf->address;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
VM_BUG_ON_PAGE(!PageCompound(page), page);
@@ -566,7 +567,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
return VM_FAULT_OOM;
}
- clear_huge_page(page, haddr, HPAGE_PMD_NR);
+ clear_huge_page(page, haddr, HPAGE_PMD_NR, address);
/*
* The memory barrier inside __SetPageUptodate makes sure that
* clear_huge_page writes become visible before the set_pmd_at()
@@ -1225,7 +1226,8 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL, *new_page;
struct mem_cgroup *memcg;
- unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+ unsigned long address = vmf->address;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
gfp_t huge_gfp; /* for allocation and charge */
@@ -1310,7 +1312,7 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
count_vm_event(THP_FAULT_ALLOC);
if (!page)
- clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
+ clear_huge_page(new_page, haddr, HPAGE_PMD_NR, address);
else
copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
__SetPageUptodate(new_page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5dae4fff368d..fb2ff230236a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3707,7 +3707,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
ret = VM_FAULT_SIGBUS;
goto out;
}
- clear_huge_page(page, address, pages_per_huge_page(h));
+ clear_huge_page(page, address, pages_per_huge_page(h), address);
__SetPageUptodate(page);
set_page_huge_active(page);
diff --git a/mm/memory.c b/mm/memory.c
index edabf6f03447..d5bd7633a443 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4363,10 +4363,10 @@ static void clear_gigantic_page(struct page *page,
clear_user_highpage(p, addr + i * PAGE_SIZE);
}
}
-void clear_huge_page(struct page *page,
- unsigned long addr, unsigned int pages_per_huge_page)
+void clear_huge_page(struct page *page, unsigned long addr,
+ unsigned int pages_per_huge_page, unsigned long addr_hint)
{
- int i;
+ int i, n, base, l;
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
clear_gigantic_page(page, addr, pages_per_huge_page);
@@ -4374,9 +4374,31 @@ void clear_huge_page(struct page *page,
}
might_sleep();
- for (i = 0; i < pages_per_huge_page; i++) {
+ VM_BUG_ON(clamp(addr_hint, addr, addr +
+ (pages_per_huge_page << PAGE_SHIFT)) != addr_hint);
+ n = (addr_hint - addr) / PAGE_SIZE;
+ if (2 * n <= pages_per_huge_page) {
+ base = 0;
+ l = n;
+ for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
+ cond_resched();
+ clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+ }
+ } else {
+ base = 2 * n - pages_per_huge_page;
+ l = pages_per_huge_page - n;
+ for (i = 0; i < base; i++) {
+ cond_resched();
+ clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+ }
+ }
+ for (i = 0; i < l; i++) {
+ cond_resched();
+ clear_user_highpage(page + base + i,
+ addr + (base + i) * PAGE_SIZE);
cond_resched();
- clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+ clear_user_highpage(page + base + 2 * l - 1 - i,
+ addr + (base + 2 * l - 1 - i) * PAGE_SIZE);
}
}
--
2.11.0
On Mon 07-08-17 15:21:31, Huang, Ying wrote:
> From: Huang Ying <[email protected]>
>
> Huge page helps to reduce TLB miss rate, but it has higher cache
> footprint, sometimes this may cause some issue. For example, when
> clearing huge page on x86_64 platform, the cache footprint is 2M. But
> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
> LLC (last level cache). That is, in average, there are 2.5M LLC for
> each core and 1.25M LLC for each thread. If the cache pressure is
> heavy when clearing the huge page, and we clear the huge page from the
> begin to the end, it is possible that the begin of huge page is
> evicted from the cache after we finishing clearing the end of the huge
> page. And it is possible for the application to access the begin of
> the huge page after clearing the huge page.
>
> To help the above situation, in this patch, when we clear a huge page,
> the order to clear sub-pages is changed. In quite some situation, we
> can get the address that the application will access after we clear
> the huge page, for example, in a page fault handler. Instead of
> clearing the huge page from begin to end, we will clear the sub-pages
> farthest from the the sub-page to access firstly, and clear the
> sub-page to access last. This will make the sub-page to access most
> cache-hot and sub-pages around it more cache-hot too. If we cannot
> know the address the application will access, the begin of the huge
> page is assumed to be the the address the application will access.
>
> With this patch, the throughput increases ~28.3% in vm-scalability
> anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
> system (36 cores, 72 threads). The test case creates 72 processes,
> each process mmap a big anonymous memory area and writes to it from
> the begin to the end. For each process, other processes could be seen
> as other workload which generates heavy cache pressure. At the same
> time, the cache miss rate reduced from ~33.4% to ~31.7%, the
> IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
> spent in user space is reduced ~7.9%
Hum, the improvement looks impressive enough that it is probably worth the
bother. But please add at least a brief explanation why you do stuff in
this more complicated way to a comment in clear_huge_page() so that people
don't have to look it up in the changelog. Otherwise the patch looks good
to me so feel free to add:
Acked-by: Jan Kara <[email protected]>
Honza
> @@ -4374,9 +4374,31 @@ void clear_huge_page(struct page *page,
> }
>
> might_sleep();
> - for (i = 0; i < pages_per_huge_page; i++) {
> + VM_BUG_ON(clamp(addr_hint, addr, addr +
> + (pages_per_huge_page << PAGE_SHIFT)) != addr_hint);
> + n = (addr_hint - addr) / PAGE_SIZE;
> + if (2 * n <= pages_per_huge_page) {
> + base = 0;
> + l = n;
> + for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
> + cond_resched();
> + clear_user_highpage(page + i, addr + i * PAGE_SIZE);
> + }
> + } else {
> + base = 2 * n - pages_per_huge_page;
> + l = pages_per_huge_page - n;
> + for (i = 0; i < base; i++) {
> + cond_resched();
> + clear_user_highpage(page + i, addr + i * PAGE_SIZE);
> + }
> + }
> + for (i = 0; i < l; i++) {
> + cond_resched();
> + clear_user_highpage(page + base + i,
> + addr + (base + i) * PAGE_SIZE);
> cond_resched();
> - clear_user_highpage(page + i, addr + i * PAGE_SIZE);
> + clear_user_highpage(page + base + 2 * l - 1 - i,
> + addr + (base + 2 * l - 1 - i) * PAGE_SIZE);
> }
> }
>
> --
> 2.11.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
Jan Kara <[email protected]> writes:
> On Mon 07-08-17 15:21:31, Huang, Ying wrote:
>> From: Huang Ying <[email protected]>
>>
>> Huge page helps to reduce TLB miss rate, but it has higher cache
>> footprint, sometimes this may cause some issue. For example, when
>> clearing huge page on x86_64 platform, the cache footprint is 2M. But
>> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
>> LLC (last level cache). That is, in average, there are 2.5M LLC for
>> each core and 1.25M LLC for each thread. If the cache pressure is
>> heavy when clearing the huge page, and we clear the huge page from the
>> begin to the end, it is possible that the begin of huge page is
>> evicted from the cache after we finishing clearing the end of the huge
>> page. And it is possible for the application to access the begin of
>> the huge page after clearing the huge page.
>>
>> To help the above situation, in this patch, when we clear a huge page,
>> the order to clear sub-pages is changed. In quite some situation, we
>> can get the address that the application will access after we clear
>> the huge page, for example, in a page fault handler. Instead of
>> clearing the huge page from begin to end, we will clear the sub-pages
>> farthest from the the sub-page to access firstly, and clear the
>> sub-page to access last. This will make the sub-page to access most
>> cache-hot and sub-pages around it more cache-hot too. If we cannot
>> know the address the application will access, the begin of the huge
>> page is assumed to be the the address the application will access.
>>
>> With this patch, the throughput increases ~28.3% in vm-scalability
>> anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
>> system (36 cores, 72 threads). The test case creates 72 processes,
>> each process mmap a big anonymous memory area and writes to it from
>> the begin to the end. For each process, other processes could be seen
>> as other workload which generates heavy cache pressure. At the same
>> time, the cache miss rate reduced from ~33.4% to ~31.7%, the
>> IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
>> spent in user space is reduced ~7.9%
>
> Hum, the improvement looks impressive enough that it is probably worth the
> bother. But please add at least a brief explanation why you do stuff in
> this more complicated way to a comment in clear_huge_page() so that people
> don't have to look it up in the changelog.
Good suggestion! I will do that in the next version.
> Otherwise the patch looks good
> to me so feel free to add:
>
> Acked-by: Jan Kara <[email protected]>
Thanks!
Best Regards,
Huang, Ying
> Honza
>
>> @@ -4374,9 +4374,31 @@ void clear_huge_page(struct page *page,
>> }
>>
>> might_sleep();
>> - for (i = 0; i < pages_per_huge_page; i++) {
>> + VM_BUG_ON(clamp(addr_hint, addr, addr +
>> + (pages_per_huge_page << PAGE_SHIFT)) != addr_hint);
>> + n = (addr_hint - addr) / PAGE_SIZE;
>> + if (2 * n <= pages_per_huge_page) {
>> + base = 0;
>> + l = n;
>> + for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
>> + cond_resched();
>> + clear_user_highpage(page + i, addr + i * PAGE_SIZE);
>> + }
>> + } else {
>> + base = 2 * n - pages_per_huge_page;
>> + l = pages_per_huge_page - n;
>> + for (i = 0; i < base; i++) {
>> + cond_resched();
>> + clear_user_highpage(page + i, addr + i * PAGE_SIZE);
>> + }
>> + }
>> + for (i = 0; i < l; i++) {
>> + cond_resched();
>> + clear_user_highpage(page + base + i,
>> + addr + (base + i) * PAGE_SIZE);
>> cond_resched();
>> - clear_user_highpage(page + i, addr + i * PAGE_SIZE);
>> + clear_user_highpage(page + base + 2 * l - 1 - i,
>> + addr + (base + 2 * l - 1 - i) * PAGE_SIZE);
>> }
>> }
>>
>> --
>> 2.11.0
>>
On Mon, Aug 07, 2017 at 03:21:31PM +0800, Huang, Ying wrote:
> From: Huang Ying <[email protected]>
>
> Huge page helps to reduce TLB miss rate, but it has higher cache
> footprint, sometimes this may cause some issue. For example, when
> clearing huge page on x86_64 platform, the cache footprint is 2M. But
> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
> LLC (last level cache). That is, in average, there are 2.5M LLC for
> each core and 1.25M LLC for each thread. If the cache pressure is
> heavy when clearing the huge page, and we clear the huge page from the
> begin to the end, it is possible that the begin of huge page is
> evicted from the cache after we finishing clearing the end of the huge
> page. And it is possible for the application to access the begin of
> the huge page after clearing the huge page.
>
> To help the above situation, in this patch, when we clear a huge page,
> the order to clear sub-pages is changed. In quite some situation, we
> can get the address that the application will access after we clear
> the huge page, for example, in a page fault handler. Instead of
> clearing the huge page from begin to end, we will clear the sub-pages
> farthest from the the sub-page to access firstly, and clear the
> sub-page to access last. This will make the sub-page to access most
> cache-hot and sub-pages around it more cache-hot too. If we cannot
> know the address the application will access, the begin of the huge
> page is assumed to be the the address the application will access.
>
> With this patch, the throughput increases ~28.3% in vm-scalability
> anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
> system (36 cores, 72 threads). The test case creates 72 processes,
> each process mmap a big anonymous memory area and writes to it from
> the begin to the end. For each process, other processes could be seen
> as other workload which generates heavy cache pressure. At the same
> time, the cache miss rate reduced from ~33.4% to ~31.7%, the
> IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
> spent in user space is reduced ~7.9%
That's impressive.
But what about the case when we are not bounded that much by the size of
LLC? What about running the same test on the same hardware, but with 4
processes instead of 72.
I just want to make sure we don't regress on more realistic tast case.
--
Kirill A. Shutemov
On Mon, 7 Aug 2017, Huang, Ying wrote:
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4374,9 +4374,31 @@ void clear_huge_page(struct page *page,
> }
>
> might_sleep();
> - for (i = 0; i < pages_per_huge_page; i++) {
> + VM_BUG_ON(clamp(addr_hint, addr, addr +
> + (pages_per_huge_page << PAGE_SHIFT)) != addr_hint);
> + n = (addr_hint - addr) / PAGE_SIZE;
> + if (2 * n <= pages_per_huge_page) {
> + base = 0;
> + l = n;
> + for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
> + cond_resched();
> + clear_user_highpage(page + i, addr + i * PAGE_SIZE);
> + }
I really like the idea behind the patch but this is not clearing from last
to first byte of the huge page.
What seems to be happening here is clearing from the last page to the
first page and I would think that within each page the clearing is from
first byte to last byte. Maybe more gains can be had by really clearing
from last to first byte of the huge page instead of this jumping over 4k
addresses?
"Kirill A. Shutemov" <[email protected]> writes:
> On Mon, Aug 07, 2017 at 03:21:31PM +0800, Huang, Ying wrote:
>> From: Huang Ying <[email protected]>
>>
>> Huge page helps to reduce TLB miss rate, but it has higher cache
>> footprint, sometimes this may cause some issue. For example, when
>> clearing huge page on x86_64 platform, the cache footprint is 2M. But
>> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
>> LLC (last level cache). That is, in average, there are 2.5M LLC for
>> each core and 1.25M LLC for each thread. If the cache pressure is
>> heavy when clearing the huge page, and we clear the huge page from the
>> begin to the end, it is possible that the begin of huge page is
>> evicted from the cache after we finishing clearing the end of the huge
>> page. And it is possible for the application to access the begin of
>> the huge page after clearing the huge page.
>>
>> To help the above situation, in this patch, when we clear a huge page,
>> the order to clear sub-pages is changed. In quite some situation, we
>> can get the address that the application will access after we clear
>> the huge page, for example, in a page fault handler. Instead of
>> clearing the huge page from begin to end, we will clear the sub-pages
>> farthest from the the sub-page to access firstly, and clear the
>> sub-page to access last. This will make the sub-page to access most
>> cache-hot and sub-pages around it more cache-hot too. If we cannot
>> know the address the application will access, the begin of the huge
>> page is assumed to be the the address the application will access.
>>
>> With this patch, the throughput increases ~28.3% in vm-scalability
>> anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
>> system (36 cores, 72 threads). The test case creates 72 processes,
>> each process mmap a big anonymous memory area and writes to it from
>> the begin to the end. For each process, other processes could be seen
>> as other workload which generates heavy cache pressure. At the same
>> time, the cache miss rate reduced from ~33.4% to ~31.7%, the
>> IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
>> spent in user space is reduced ~7.9%
>
> That's impressive.
>
> But what about the case when we are not bounded that much by the size of
> LLC? What about running the same test on the same hardware, but with 4
> processes instead of 72.
>
> I just want to make sure we don't regress on more realistic tast case.
Sure. I will test it.
Best Regards,
Huang, Ying
Christopher Lameter <[email protected]> writes:
> On Mon, 7 Aug 2017, Huang, Ying wrote:
>
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4374,9 +4374,31 @@ void clear_huge_page(struct page *page,
>> }
>>
>> might_sleep();
>> - for (i = 0; i < pages_per_huge_page; i++) {
>> + VM_BUG_ON(clamp(addr_hint, addr, addr +
>> + (pages_per_huge_page << PAGE_SHIFT)) != addr_hint);
>> + n = (addr_hint - addr) / PAGE_SIZE;
>> + if (2 * n <= pages_per_huge_page) {
>> + base = 0;
>> + l = n;
>> + for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
>> + cond_resched();
>> + clear_user_highpage(page + i, addr + i * PAGE_SIZE);
>> + }
>
> I really like the idea behind the patch but this is not clearing from last
> to first byte of the huge page.
>
> What seems to be happening here is clearing from the last page to the
> first page and I would think that within each page the clearing is from
> first byte to last byte. Maybe more gains can be had by really clearing
> from last to first byte of the huge page instead of this jumping over 4k
> addresses?
Yes. That is a good idea. I will experiment it via changing the
direction to clear in clear_user_highpage().
Best Regards,
Huang, Ying
On 08/07/2017 12:21 AM, Huang, Ying wrote:
> From: Huang Ying <[email protected]>
>
> Huge page helps to reduce TLB miss rate, but it has higher cache
> footprint, sometimes this may cause some issue. For example, when
> clearing huge page on x86_64 platform, the cache footprint is 2M. But
> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
> LLC (last level cache). That is, in average, there are 2.5M LLC for
> each core and 1.25M LLC for each thread. If the cache pressure is
> heavy when clearing the huge page, and we clear the huge page from the
> begin to the end, it is possible that the begin of huge page is
> evicted from the cache after we finishing clearing the end of the huge
> page. And it is possible for the application to access the begin of
> the huge page after clearing the huge page.
>
> To help the above situation, in this patch, when we clear a huge page,
> the order to clear sub-pages is changed. In quite some situation, we
> can get the address that the application will access after we clear
> the huge page, for example, in a page fault handler. Instead of
> clearing the huge page from begin to end, we will clear the sub-pages
> farthest from the the sub-page to access firstly, and clear the
> sub-page to access last. This will make the sub-page to access most
> cache-hot and sub-pages around it more cache-hot too. If we cannot
> know the address the application will access, the begin of the huge
> page is assumed to be the the address the application will access.
>
> With this patch, the throughput increases ~28.3% in vm-scalability
> anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
> system (36 cores, 72 threads). The test case creates 72 processes,
> each process mmap a big anonymous memory area and writes to it from
> the begin to the end. For each process, other processes could be seen
> as other workload which generates heavy cache pressure. At the same
> time, the cache miss rate reduced from ~33.4% to ~31.7%, the
> IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
> spent in user space is reduced ~7.9%
>
> Thanks Andi Kleen to propose to use address to access to determine the
> order of sub-pages to clear.
>
> The hugetlbfs access address could be improved, will do that in
> another patch.
hugetlb_fault masks off the actual faulting address with,
address &= huge_page_mask(h);
before calling hugetlb_no_page.
But, we could pass down the actual (unmasked) address to take advantage
of this optimization for hugetlb faults as well. hugetlb_fault is the
only caller of hugetlb_no_page, so this should be pretty straight forward.
Were you thinking of additional improvements?
--
Mike Kravetz
Mike Kravetz <[email protected]> writes:
> On 08/07/2017 12:21 AM, Huang, Ying wrote:
>> From: Huang Ying <[email protected]>
>>
>> Huge page helps to reduce TLB miss rate, but it has higher cache
>> footprint, sometimes this may cause some issue. For example, when
>> clearing huge page on x86_64 platform, the cache footprint is 2M. But
>> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
>> LLC (last level cache). That is, in average, there are 2.5M LLC for
>> each core and 1.25M LLC for each thread. If the cache pressure is
>> heavy when clearing the huge page, and we clear the huge page from the
>> begin to the end, it is possible that the begin of huge page is
>> evicted from the cache after we finishing clearing the end of the huge
>> page. And it is possible for the application to access the begin of
>> the huge page after clearing the huge page.
>>
>> To help the above situation, in this patch, when we clear a huge page,
>> the order to clear sub-pages is changed. In quite some situation, we
>> can get the address that the application will access after we clear
>> the huge page, for example, in a page fault handler. Instead of
>> clearing the huge page from begin to end, we will clear the sub-pages
>> farthest from the the sub-page to access firstly, and clear the
>> sub-page to access last. This will make the sub-page to access most
>> cache-hot and sub-pages around it more cache-hot too. If we cannot
>> know the address the application will access, the begin of the huge
>> page is assumed to be the the address the application will access.
>>
>> With this patch, the throughput increases ~28.3% in vm-scalability
>> anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
>> system (36 cores, 72 threads). The test case creates 72 processes,
>> each process mmap a big anonymous memory area and writes to it from
>> the begin to the end. For each process, other processes could be seen
>> as other workload which generates heavy cache pressure. At the same
>> time, the cache miss rate reduced from ~33.4% to ~31.7%, the
>> IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
>> spent in user space is reduced ~7.9%
>>
>> Thanks Andi Kleen to propose to use address to access to determine the
>> order of sub-pages to clear.
>>
>> The hugetlbfs access address could be improved, will do that in
>> another patch.
>
> hugetlb_fault masks off the actual faulting address with,
> address &= huge_page_mask(h);
> before calling hugetlb_no_page.
>
> But, we could pass down the actual (unmasked) address to take advantage
> of this optimization for hugetlb faults as well. hugetlb_fault is the
> only caller of hugetlb_no_page, so this should be pretty straight forward.
>
> Were you thinking of additional improvements?
No. I am thinking of something like this. If the basic idea is
accepted, I plan to add better support like this for hugetlbfs in
another patch.
Best Regards,
Huang, Ying
Christopher Lameter <[email protected]> writes:
> On Mon, 7 Aug 2017, Huang, Ying wrote:
>
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4374,9 +4374,31 @@ void clear_huge_page(struct page *page,
>> }
>>
>> might_sleep();
>> - for (i = 0; i < pages_per_huge_page; i++) {
>> + VM_BUG_ON(clamp(addr_hint, addr, addr +
>> + (pages_per_huge_page << PAGE_SHIFT)) != addr_hint);
>> + n = (addr_hint - addr) / PAGE_SIZE;
>> + if (2 * n <= pages_per_huge_page) {
>> + base = 0;
>> + l = n;
>> + for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
>> + cond_resched();
>> + clear_user_highpage(page + i, addr + i * PAGE_SIZE);
>> + }
>
> I really like the idea behind the patch but this is not clearing from last
> to first byte of the huge page.
>
> What seems to be happening here is clearing from the last page to the
> first page and I would think that within each page the clearing is from
> first byte to last byte. Maybe more gains can be had by really clearing
> from last to first byte of the huge page instead of this jumping over 4k
> addresses?
I changed the code to use clear_page_orig() and make it clear pages from
last to first. The patch is as below.
With that, there is no visible changes in benchmark result. But the
cache miss rate dropped a little from 27.64% to 26.70%. The cache miss
rate is different with before because the clear_page() implementation
used is different.
I think this is because the size of page is relative small compared with
the cache size, so that the effect is almost invisible.
Best Regards,
Huang, Ying
--------------->8----------------
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index b4a0d43248cf..01d201afde92 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -42,8 +42,8 @@ void clear_page_erms(void *page);
static inline void clear_page(void *page)
{
alternative_call_2(clear_page_orig,
- clear_page_rep, X86_FEATURE_REP_GOOD,
- clear_page_erms, X86_FEATURE_ERMS,
+ clear_page_orig, X86_FEATURE_REP_GOOD,
+ clear_page_orig, X86_FEATURE_ERMS,
"=D" (page),
"0" (page)
: "memory", "rax", "rcx");
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 81b1635d67de..23e6238e625d 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -25,19 +25,20 @@ EXPORT_SYMBOL_GPL(clear_page_rep)
ENTRY(clear_page_orig)
xorl %eax,%eax
movl $4096/64,%ecx
+ addq $4096-64,%rdi
.p2align 4
.Lloop:
decl %ecx
#define PUT(x) movq %rax,x*8(%rdi)
- movq %rax,(%rdi)
- PUT(1)
- PUT(2)
- PUT(3)
- PUT(4)
- PUT(5)
- PUT(6)
PUT(7)
- leaq 64(%rdi),%rdi
+ PUT(6)
+ PUT(5)
+ PUT(4)
+ PUT(3)
+ PUT(2)
+ PUT(1)
+ movq %rax,(%rdi)
+ leaq -64(%rdi),%rdi
jnz .Lloop
nop
ret
"Huang, Ying" <[email protected]> writes:
> "Kirill A. Shutemov" <[email protected]> writes:
>
>> On Mon, Aug 07, 2017 at 03:21:31PM +0800, Huang, Ying wrote:
>>> From: Huang Ying <[email protected]>
>>>
>>> Huge page helps to reduce TLB miss rate, but it has higher cache
>>> footprint, sometimes this may cause some issue. For example, when
>>> clearing huge page on x86_64 platform, the cache footprint is 2M. But
>>> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
>>> LLC (last level cache). That is, in average, there are 2.5M LLC for
>>> each core and 1.25M LLC for each thread. If the cache pressure is
>>> heavy when clearing the huge page, and we clear the huge page from the
>>> begin to the end, it is possible that the begin of huge page is
>>> evicted from the cache after we finishing clearing the end of the huge
>>> page. And it is possible for the application to access the begin of
>>> the huge page after clearing the huge page.
>>>
>>> To help the above situation, in this patch, when we clear a huge page,
>>> the order to clear sub-pages is changed. In quite some situation, we
>>> can get the address that the application will access after we clear
>>> the huge page, for example, in a page fault handler. Instead of
>>> clearing the huge page from begin to end, we will clear the sub-pages
>>> farthest from the the sub-page to access firstly, and clear the
>>> sub-page to access last. This will make the sub-page to access most
>>> cache-hot and sub-pages around it more cache-hot too. If we cannot
>>> know the address the application will access, the begin of the huge
>>> page is assumed to be the the address the application will access.
>>>
>>> With this patch, the throughput increases ~28.3% in vm-scalability
>>> anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
>>> system (36 cores, 72 threads). The test case creates 72 processes,
>>> each process mmap a big anonymous memory area and writes to it from
>>> the begin to the end. For each process, other processes could be seen
>>> as other workload which generates heavy cache pressure. At the same
>>> time, the cache miss rate reduced from ~33.4% to ~31.7%, the
>>> IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
>>> spent in user space is reduced ~7.9%
>>
>> That's impressive.
>>
>> But what about the case when we are not bounded that much by the size of
>> LLC? What about running the same test on the same hardware, but with 4
>> processes instead of 72.
>>
>> I just want to make sure we don't regress on more realistic tast case.
>
> Sure. I will test it.
Tested with 4 processes, there is no visible changes for benchmark result.
Best Regards,
Huang, Ying
On Mon, Aug 07, 2017 at 03:21:31PM +0800, Huang, Ying wrote:
> @@ -2509,7 +2509,8 @@ enum mf_action_page_type {
> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> extern void clear_huge_page(struct page *page,
> unsigned long addr,
> - unsigned int pages_per_huge_page);
> + unsigned int pages_per_huge_page,
> + unsigned long addr_hint);
I don't really like adding the extra argument to this function ...
> +++ b/mm/huge_memory.c
> @@ -549,7 +549,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
> struct vm_area_struct *vma = vmf->vma;
> struct mem_cgroup *memcg;
> pgtable_t pgtable;
> - unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> + unsigned long address = vmf->address;
> + unsigned long haddr = address & HPAGE_PMD_MASK;
>
> VM_BUG_ON_PAGE(!PageCompound(page), page);
>
> @@ -566,7 +567,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
> return VM_FAULT_OOM;
> }
>
> - clear_huge_page(page, haddr, HPAGE_PMD_NR);
> + clear_huge_page(page, haddr, HPAGE_PMD_NR, address);
> /*
> * The memory barrier inside __SetPageUptodate makes sure that
> * clear_huge_page writes become visible before the set_pmd_at()
How about calling:
- clear_huge_page(page, haddr, HPAGE_PMD_NR);
+ clear_huge_page(page, address, HPAGE_PMD_NR);
> +++ b/mm/memory.c
> @@ -4363,10 +4363,10 @@ static void clear_gigantic_page(struct page *page,
> clear_user_highpage(p, addr + i * PAGE_SIZE);
> }
> }
> -void clear_huge_page(struct page *page,
> - unsigned long addr, unsigned int pages_per_huge_page)
> +void clear_huge_page(struct page *page, unsigned long addr,
> + unsigned int pages_per_huge_page, unsigned long addr_hint)
> {
> - int i;
> + int i, n, base, l;
>
> if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
> clear_gigantic_page(page, addr, pages_per_huge_page);
... and doing this:
void clear_huge_page(struct page *page,
- unsigned long addr, unsigned int pages_per_huge_page)
+ unsigned long addr_hint, unsigned int pages_per_huge_page)
{
- int i;
+ int i, n, base, l;
+ unsigned long addr = addr_hint &
+ (1UL << (pages_per_huge_page + PAGE_SHIFT));
> @@ -4374,9 +4374,31 @@ void clear_huge_page(struct page *page,
> }
>
> might_sleep();
> - for (i = 0; i < pages_per_huge_page; i++) {
> + VM_BUG_ON(clamp(addr_hint, addr, addr +
> + (pages_per_huge_page << PAGE_SHIFT)) != addr_hint);
... then you can ditch this check
Matthew Wilcox <[email protected]> writes:
> On Mon, Aug 07, 2017 at 03:21:31PM +0800, Huang, Ying wrote:
>> @@ -2509,7 +2509,8 @@ enum mf_action_page_type {
>> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>> extern void clear_huge_page(struct page *page,
>> unsigned long addr,
>> - unsigned int pages_per_huge_page);
>> + unsigned int pages_per_huge_page,
>> + unsigned long addr_hint);
>
> I don't really like adding the extra argument to this function ...
>
>> +++ b/mm/huge_memory.c
>> @@ -549,7 +549,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
>> struct vm_area_struct *vma = vmf->vma;
>> struct mem_cgroup *memcg;
>> pgtable_t pgtable;
>> - unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>> + unsigned long address = vmf->address;
>> + unsigned long haddr = address & HPAGE_PMD_MASK;
>>
>> VM_BUG_ON_PAGE(!PageCompound(page), page);
>>
>> @@ -566,7 +567,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
>> return VM_FAULT_OOM;
>> }
>>
>> - clear_huge_page(page, haddr, HPAGE_PMD_NR);
>> + clear_huge_page(page, haddr, HPAGE_PMD_NR, address);
>> /*
>> * The memory barrier inside __SetPageUptodate makes sure that
>> * clear_huge_page writes become visible before the set_pmd_at()
>
> How about calling:
>
> - clear_huge_page(page, haddr, HPAGE_PMD_NR);
> + clear_huge_page(page, address, HPAGE_PMD_NR);
>
>> +++ b/mm/memory.c
>> @@ -4363,10 +4363,10 @@ static void clear_gigantic_page(struct page *page,
>> clear_user_highpage(p, addr + i * PAGE_SIZE);
>> }
>> }
>> -void clear_huge_page(struct page *page,
>> - unsigned long addr, unsigned int pages_per_huge_page)
>> +void clear_huge_page(struct page *page, unsigned long addr,
>> + unsigned int pages_per_huge_page, unsigned long addr_hint)
>> {
>> - int i;
>> + int i, n, base, l;
>>
>> if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
>> clear_gigantic_page(page, addr, pages_per_huge_page);
>
> ... and doing this:
>
> void clear_huge_page(struct page *page,
> - unsigned long addr, unsigned int pages_per_huge_page)
> + unsigned long addr_hint, unsigned int pages_per_huge_page)
> {
> - int i;
> + int i, n, base, l;
> + unsigned long addr = addr_hint &
> + (1UL << (pages_per_huge_page + PAGE_SHIFT));
>
>> @@ -4374,9 +4374,31 @@ void clear_huge_page(struct page *page,
>> }
>>
>> might_sleep();
>> - for (i = 0; i < pages_per_huge_page; i++) {
>> + VM_BUG_ON(clamp(addr_hint, addr, addr +
>> + (pages_per_huge_page << PAGE_SHIFT)) != addr_hint);
>
> ... then you can ditch this check
Yes. This looks good for me. If there is no objection, I will go this
way in the next version.
Best Regards,
Huang, Ying
On Mon, 7 Aug 2017 15:21:31 +0800 "Huang, Ying" <[email protected]> wrote:
> From: Huang Ying <[email protected]>
>
> Huge page helps to reduce TLB miss rate, but it has higher cache
> footprint, sometimes this may cause some issue. For example, when
> clearing huge page on x86_64 platform, the cache footprint is 2M. But
> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
> LLC (last level cache). That is, in average, there are 2.5M LLC for
> each core and 1.25M LLC for each thread. If the cache pressure is
> heavy when clearing the huge page, and we clear the huge page from the
> begin to the end, it is possible that the begin of huge page is
> evicted from the cache after we finishing clearing the end of the huge
> page. And it is possible for the application to access the begin of
> the huge page after clearing the huge page.
>
> To help the above situation, in this patch, when we clear a huge page,
> the order to clear sub-pages is changed. In quite some situation, we
> can get the address that the application will access after we clear
> the huge page, for example, in a page fault handler. Instead of
> clearing the huge page from begin to end, we will clear the sub-pages
> farthest from the the sub-page to access firstly, and clear the
> sub-page to access last. This will make the sub-page to access most
> cache-hot and sub-pages around it more cache-hot too. If we cannot
> know the address the application will access, the begin of the huge
> page is assumed to be the the address the application will access.
>
> With this patch, the throughput increases ~28.3% in vm-scalability
> anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
> system (36 cores, 72 threads). The test case creates 72 processes,
> each process mmap a big anonymous memory area and writes to it from
> the begin to the end. For each process, other processes could be seen
> as other workload which generates heavy cache pressure. At the same
> time, the cache miss rate reduced from ~33.4% to ~31.7%, the
> IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
> spent in user space is reduced ~7.9%
>
> Thanks Andi Kleen to propose to use address to access to determine the
> order of sub-pages to clear.
>
> The hugetlbfs access address could be improved, will do that in
> another patch.
I agree with what others said, plus...
> @@ -4374,9 +4374,31 @@ void clear_huge_page(struct page *page,
> }
>
> might_sleep();
> - for (i = 0; i < pages_per_huge_page; i++) {
> + VM_BUG_ON(clamp(addr_hint, addr, addr +
> + (pages_per_huge_page << PAGE_SHIFT)) != addr_hint);
> + n = (addr_hint - addr) / PAGE_SIZE;
> + if (2 * n <= pages_per_huge_page) {
> + base = 0;
> + l = n;
> + for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
> + cond_resched();
> + clear_user_highpage(page + i, addr + i * PAGE_SIZE);
> + }
> + } else {
> + base = 2 * n - pages_per_huge_page;
> + l = pages_per_huge_page - n;
> + for (i = 0; i < base; i++) {
> + cond_resched();
> + clear_user_highpage(page + i, addr + i * PAGE_SIZE);
> + }
> + }
> + for (i = 0; i < l; i++) {
> + cond_resched();
> + clear_user_highpage(page + base + i,
> + addr + (base + i) * PAGE_SIZE);
> cond_resched();
> - clear_user_highpage(page + i, addr + i * PAGE_SIZE);
> + clear_user_highpage(page + base + 2 * l - 1 - i,
> + addr + (base + 2 * l - 1 - i) * PAGE_SIZE);
Please document this design with a carefully written code comment.
For example, why was "2 * n" chosen? What is it trying to achieve?
Also, the final clearing loop "for (i = 0; i < l; i++)" might cause
eviction of data which was cached in the previous loop. Perhaps some
additional gains will be made by clearing the hugepage in a
left-right-left-right "start from the ends and work inwards" manner, if
you see what I mean. So the 4k pages immediately surrounding addr_hint
are the most-recently-cleared. Although accesses to the data at lower
addresses than addr_hint are probably somewhat rare (and may be
nonexistent in your synthetic test case).
Hi, Andrew,
Andrew Morton <[email protected]> writes:
> On Mon, 7 Aug 2017 15:21:31 +0800 "Huang, Ying" <[email protected]> wrote:
>
>> From: Huang Ying <[email protected]>
>>
>> Huge page helps to reduce TLB miss rate, but it has higher cache
>> footprint, sometimes this may cause some issue. For example, when
>> clearing huge page on x86_64 platform, the cache footprint is 2M. But
>> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
>> LLC (last level cache). That is, in average, there are 2.5M LLC for
>> each core and 1.25M LLC for each thread. If the cache pressure is
>> heavy when clearing the huge page, and we clear the huge page from the
>> begin to the end, it is possible that the begin of huge page is
>> evicted from the cache after we finishing clearing the end of the huge
>> page. And it is possible for the application to access the begin of
>> the huge page after clearing the huge page.
>>
>> To help the above situation, in this patch, when we clear a huge page,
>> the order to clear sub-pages is changed. In quite some situation, we
>> can get the address that the application will access after we clear
>> the huge page, for example, in a page fault handler. Instead of
>> clearing the huge page from begin to end, we will clear the sub-pages
>> farthest from the the sub-page to access firstly, and clear the
>> sub-page to access last. This will make the sub-page to access most
>> cache-hot and sub-pages around it more cache-hot too. If we cannot
>> know the address the application will access, the begin of the huge
>> page is assumed to be the the address the application will access.
>>
>> With this patch, the throughput increases ~28.3% in vm-scalability
>> anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
>> system (36 cores, 72 threads). The test case creates 72 processes,
>> each process mmap a big anonymous memory area and writes to it from
>> the begin to the end. For each process, other processes could be seen
>> as other workload which generates heavy cache pressure. At the same
>> time, the cache miss rate reduced from ~33.4% to ~31.7%, the
>> IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
>> spent in user space is reduced ~7.9%
>>
>> Thanks Andi Kleen to propose to use address to access to determine the
>> order of sub-pages to clear.
>>
>> The hugetlbfs access address could be improved, will do that in
>> another patch.
>
> I agree with what others said, plus...
>
>> @@ -4374,9 +4374,31 @@ void clear_huge_page(struct page *page,
>> }
>>
>> might_sleep();
>> - for (i = 0; i < pages_per_huge_page; i++) {
>> + VM_BUG_ON(clamp(addr_hint, addr, addr +
>> + (pages_per_huge_page << PAGE_SHIFT)) != addr_hint);
>> + n = (addr_hint - addr) / PAGE_SIZE;
>> + if (2 * n <= pages_per_huge_page) {
>> + base = 0;
>> + l = n;
>> + for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
>> + cond_resched();
>> + clear_user_highpage(page + i, addr + i * PAGE_SIZE);
>> + }
>> + } else {
>> + base = 2 * n - pages_per_huge_page;
>> + l = pages_per_huge_page - n;
>> + for (i = 0; i < base; i++) {
>> + cond_resched();
>> + clear_user_highpage(page + i, addr + i * PAGE_SIZE);
>> + }
>> + }
>> + for (i = 0; i < l; i++) {
>> + cond_resched();
>> + clear_user_highpage(page + base + i,
>> + addr + (base + i) * PAGE_SIZE);
>> cond_resched();
>> - clear_user_highpage(page + i, addr + i * PAGE_SIZE);
>> + clear_user_highpage(page + base + 2 * l - 1 - i,
>> + addr + (base + 2 * l - 1 - i) * PAGE_SIZE);
>
> Please document this design with a carefully written code comment.
> For example, why was "2 * n" chosen? What is it trying to achieve?
Sure.
"2 * n" here is to determine whether addr_hint is in the first half (2 *
n <= pages_per_huge_page) or the second half (2 * n >
pages_per_huge_page) of the huge page.
> Also, the final clearing loop "for (i = 0; i < l; i++)" might cause
> eviction of data which was cached in the previous loop. Perhaps some
> additional gains will be made by clearing the hugepage in a
> left-right-left-right "start from the ends and work inwards" manner, if
> you see what I mean. So the 4k pages immediately surrounding addr_hint
> are the most-recently-cleared. Although accesses to the data at lower
> addresses than addr_hint are probably somewhat rare (and may be
> nonexistent in your synthetic test case).
Yes. I think I have done exactly this in the patch. For each iteration
of the loop, two sub-pages will be cleared: base + i, and base + 2 * l -
1 - i, that is, the left and right of the fault sub-page, and finally
reach the fault sub-page as the last sub-page to clear.
Best Regards,
Huang, Ying