2014-06-17 22:38:23

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 0/2] mm, thp: two THP splitting performance fixes

v1->v2:
- Add a second patch to replace smp_mb() by smp_mb__after_atomic().
- Add performance data to the first patch

This mini-series contains 2 minor changes to the transparent huge
page splitting code to split its performance, particularly for the
x86 architecture.

Waiman Long (2):
mm, thp: move invariant bug check out of loop in
__split_huge_page_map
mm, thp: replace smp_mb after atomic_add by smp_mb__after_atomic

mm/huge_memory.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)


2014-06-17 22:38:25

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 2/2] mm, thp: replace smp_mb after atomic_add by smp_mb__after_atomic

In some architectures like x86, atomic_add() is a full memory
barrier. In that case, an additional smp_mb() is just a waste of time.
This patch replaces that smp_mb() by smp_mb__after_atomic() which
will avoid the redundant memory barrier in some architectures.

With a 3.16-rc1 based kernel, this patch reduced the execution time
of breaking 1000 transparent huge pages from 38,245us to 30,964us. A
reduction of 19% which is quite sizeable. It also reduces the %cpu
time of the __split_huge_page_refcount function in the perf profile
from 2.18% to 1.15%.

Signed-off-by: Waiman Long <[email protected]>
---
mm/huge_memory.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index be84c71..e2ee131 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1650,7 +1650,7 @@ static void __split_huge_page_refcount(struct page *page,
&page_tail->_count);

/* after clearing PageTail the gup refcount can be released */
- smp_mb();
+ smp_mb__after_atomic();

/*
* retain hwpoison flag of the poisoned tail page:
--
1.7.1

2014-06-17 22:38:53

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 1/2] mm, thp: move invariant bug check out of loop in __split_huge_page_map

In the __split_huge_page_map() function, the check for
page_mapcount(page) is invariant within the for loop. Because of the
fact that the macro is implemented using atomic_read(), the redundant
check cannot be optimized away by the compiler leading to unnecessary
read to the page structure.

This patch moves the invariant bug check out of the loop so that it
will be done only once. On a 3.16-rc1 based kernel, the execution
time of a microbenchmark that broke up 1000 transparent huge pages
using munmap() had an execution time of 38,245us and 38,548us with
and without the patch respectively. The performance gain is about 1%.

Signed-off-by: Waiman Long <[email protected]>
---
mm/huge_memory.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e60837d..be84c71 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1744,6 +1744,8 @@ static int __split_huge_page_map(struct page *page,
if (pmd) {
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);
+ if (pmd_write(*pmd))
+ BUG_ON(page_mapcount(page) != 1);

haddr = address;
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
@@ -1753,8 +1755,6 @@ static int __split_huge_page_map(struct page *page,
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (!pmd_write(*pmd))
entry = pte_wrprotect(entry);
- else
- BUG_ON(page_mapcount(page) != 1);
if (!pmd_young(*pmd))
entry = pte_mkold(entry);
if (pmd_numa(*pmd))
--
1.7.1

2014-06-18 12:20:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm, thp: replace smp_mb after atomic_add by smp_mb__after_atomic

On Tue, Jun 17, 2014 at 06:37:59PM -0400, Waiman Long wrote:
> In some architectures like x86, atomic_add() is a full memory
> barrier. In that case, an additional smp_mb() is just a waste of time.
> This patch replaces that smp_mb() by smp_mb__after_atomic() which
> will avoid the redundant memory barrier in some architectures.
>
> With a 3.16-rc1 based kernel, this patch reduced the execution time
> of breaking 1000 transparent huge pages from 38,245us to 30,964us. A
> reduction of 19% which is quite sizeable. It also reduces the %cpu
> time of the __split_huge_page_refcount function in the perf profile
> from 2.18% to 1.15%.
>
> Signed-off-by: Waiman Long <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2014-06-18 12:25:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm, thp: move invariant bug check out of loop in __split_huge_page_map

On Tue, Jun 17, 2014 at 06:37:58PM -0400, Waiman Long wrote:
> In the __split_huge_page_map() function, the check for
> page_mapcount(page) is invariant within the for loop. Because of the
> fact that the macro is implemented using atomic_read(), the redundant
> check cannot be optimized away by the compiler leading to unnecessary
> read to the page structure.
>
> This patch moves the invariant bug check out of the loop so that it
> will be done only once. On a 3.16-rc1 based kernel, the execution
> time of a microbenchmark that broke up 1000 transparent huge pages
> using munmap() had an execution time of 38,245us and 38,548us with
> and without the patch respectively. The performance gain is about 1%.

For this low difference it would be nice to average over few runs +
stddev. It can easily can be a noise.

> Signed-off-by: Waiman Long <[email protected]>

But okay:

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2014-06-18 15:31:10

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm, thp: move invariant bug check out of loop in __split_huge_page_map

On 06/18/2014 08:24 AM, Kirill A. Shutemov wrote:
> On Tue, Jun 17, 2014 at 06:37:58PM -0400, Waiman Long wrote:
>> In the __split_huge_page_map() function, the check for
>> page_mapcount(page) is invariant within the for loop. Because of the
>> fact that the macro is implemented using atomic_read(), the redundant
>> check cannot be optimized away by the compiler leading to unnecessary
>> read to the page structure.
>>
>> This patch moves the invariant bug check out of the loop so that it
>> will be done only once. On a 3.16-rc1 based kernel, the execution
>> time of a microbenchmark that broke up 1000 transparent huge pages
>> using munmap() had an execution time of 38,245us and 38,548us with
>> and without the patch respectively. The performance gain is about 1%.
> For this low difference it would be nice to average over few runs +
> stddev. It can easily can be a noise.

The timing data was the average of 5 runs with a SD of 100-200us.
>> Signed-off-by: Waiman Long<[email protected]>
> But okay:
>
> Acked-by: Kirill A. Shutemov<[email protected]>
>

Thank for the review.

-Longman