LinuxLists.cc - Re: 2.6.38 sbrk regression

2011-04-13 14:07:18

Subject: Re: 2.6.38 sbrk regression

ok, managed to build 38-rc3 ( keep forgetting make mrproper ).
problem is still there:

------------------------------------------------------------------------------------------------------------
Test Test Elapsed Iteration Iteration
Operation
Number Name Time (sec) Count Rate (loops/sec) Rate
(ops/sec)
------------------------------------------------------------------------------------------------------------
1 page_test 5.02 361 71.91235 122251.00
System Allocations & Pages/second
------------------------------------------------------------------------------------------------------------

while in 2.6.37 rate is 171000.
ftrace is no good here as it changes the results to greatly.
any idea ?

On Wed, 2011-04-13 at 14:51 +0200, Andrea Arcangeli wrote:
> Hi,
>
> CC'ing Mel.
>
> On Wed, Apr 13, 2011 at 02:03:58PM +0300, raz ben yehuda wrote:
> > Andrea Hello
> >
> > I am running the AIM benchmark suite looking for regressions in 2.6.38.
> > The test page_test tests sbrk(). 2.6.38 is 13% less than 2.6.27 and 11%
> > less than 2.6.37.
> > The regression starts somewhere in the THP patch:
> > SHA1 4e9f64c42d0ba5eb0c78569435ada4c224332ce4 to
> > SHA1 152c9ccb75548c027fa3103efa4fa4e19a345449 .
> >
> > I cannot git bisect it any more as the kernel won't compile.
> > Even if i disable THP in the kernel I still get a regression.
> > I performed the benchmark on Xeon blade 3GHZ. But it also happens in
> > other types of machines.
> >
> > 2.6.38
> > Apr 13 13:31:16 2011 AIM Independent Resource Benchmark - Suite IX 5120
> > page_test 30010 85.0716 144621.79 System Allocations &Pages/second
> >
> > 2.6.37
> > Apr 13 13:44:39 2011 AIM Independent Resource Benchmark - Suite IX 5120
> > page_test 5020 100.797 171354.58 System Allocations & Pages/second
> >
> > I am going to profile it hopefully I will have more information.
>
> The compaction kswapd caused regressions in fs benchs like specsfs,
> it's fixed in 2.6.39-rc. Obviously it will go away if you set
> CONFIG_COMPACTION=n, but 2.6.39-rc should work best with COMPACTION=y
> too. Could you try latest 2.6.39 git?
>
> Also please make sure CONFIG_SLUB=n and CONFIG_SLAB=y. We must fix
> SLUB so it stops allocating with GFP_KERNEL in the higher order alloc,
> I think it should only do a quick check of the buddy with
> GFP_ATOMIC. The pageblock types are the thing that prevents
> fragmentation, no need of special logic for that in slub. It's
> unlikely that for short lived allocations succeeding the high order
> allocation provides enough speedup to the code using the memory, in
> order to at least break even with the cost of compacting the memory to
> succeed the allocation.

2011-04-13 18:39:15

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: 2.6.38 sbrk regression

On Wed, Apr 13, 2011 at 05:06:19PM +0300, raz ben yehuda wrote:
> ok, managed to build 38-rc3 ( keep forgetting make mrproper ).
> problem is still there:

Oops I thought it was typo on your prev email so I didn't answer
assuming you would test 39-rc like I suggested in my prev email, I
really meant 2.6.39-rc3.

The fix I refer to is d527caf22e48480b102c7c6ee5b9ba12170148f7 and
it's included in >= 2.6.39-rc1. 38-rc3 has THP but not this
changeset. We tried to push it for 38 final but it was a bit late so
it got merged immediately in 39-rc1 to stay on the safe side (it's
only a performance issue and the other longstanding irq/sched latency
issues in compaction that got fixed in 39-rc1 where actually more
relevant than the compaction-kswapd removal).

2011-04-14 08:14:19

by raz

[permalink] [raw]

Subject: Re: 2.6.38 sbrk regression

Andrea Hello

I performed the second benchmark on 2.6.39-rc3, not 38. I am sorry for
the typo ( dell keyboard is too small ).

1. First benchmark was performed on 2.6.38, no compaction and with
slab=y. regression of 13% compared to 2.6.27

2. second benchmark was performed on 2.6.39-rc3, with compaction and
slab. resulted like 2.6.38, 13% regression compared to 2.6.27. So the
2.6.39-rc1 fix is probably related to something else.

3. It does not matter if i disable thp or not. regression still exits.

I will update you more in following few days.

thank you for your help.

On Wed, 2011-04-13 at 19:21 +0200, Andrea Arcangeli wrote:
> On Wed, Apr 13, 2011 at 05:06:19PM +0300, raz ben yehuda wrote:
> > ok, managed to build 38-rc3 ( keep forgetting make mrproper ).
> > problem is still there:
>
> Oops I thought it was typo on your prev email so I didn't answer
> assuming you would test 39-rc like I suggested in my prev email, I
> really meant 2.6.39-rc3.
>
> The fix I refer to is d527caf22e48480b102c7c6ee5b9ba12170148f7 and
> it's included in >= 2.6.39-rc1. 38-rc3 has THP but not this
> changeset. We tried to push it for 38 final but it was a bit late so
> it got merged immediately in 39-rc1 to stay on the safe side (it's
> only a performance issue and the other longstanding irq/sched latency
> issues in compaction that got fixed in 39-rc1 where actually more
> relevant than the compaction-kswapd removal).

2011-04-14 11:50:15

by raz

[permalink] [raw]

Subject: Re: 2.6.38 sbrk regression

Hey Andrea
Me again. I managed to ftrace ( function graph ) the two kernels. I used
2.6.37 and 2.6.39-rc3. The bellow is example for sys_brk calls traces
from each kernel. As you can see, there is no "single smoking gun"
here.

The vm functions durations increased as a whole.
I repeated the tests from sha1 4e9f64c42d0ba5eb0c78569435ada4c224332ce4
compared to sha1 152c9ccb75548c027fa3103efa4fa4e19a345449 and it is
consistent. ~13% performance decrease.

Can you see any relation to thp that might causes this degradation ?

thank you
raz

2.6.39-rc3

0) . | sys_brk() {
0) 0.205 us | down_write();
0) 0.256 us | find_vma();
0) | do_brk() {
0) 0.549 us | cap_file_mmap();
0) | get_unmapped_area() {
0) 0.130 us | arch_get_unmapped_area_topdown();
0) 1.112 us | }
0) | cap_vm_enough_memory() {
0) 0.163 us | __vm_enough_memory();
0) 1.408 us | }
0) | vma_merge() {
0) 0.313 us | vma_adjust();
0) 2.315 us | }
0) 9.692 us | }
0) 0.168 us | up_write();
0) + 14.473 us | }

2.6.37
0) | sys_brk() {
0) 0.261 us | down_write();
0) 0.416 us | find_vma();
0) | do_brk() {
0) 0.411 us | cap_file_mmap();
0) | get_unmapped_area() {
0) 0.443 us | arch_get_unmapped_area_topdown();
0) 0.982 us | }
0) | cap_vm_enough_memory() {
0) 0.481 us | __vm_enough_memory();
0) 1.145 us | }
0) | vma_merge() {
0) 0.459 us | vma_adjust();
0) 1.004 us | }
0) 5.423 us | }
0) 0.273 us | up_write();
0) 8.020 us | }

On Wed, 2011-04-13 at 19:21 +0200, Andrea Arcangeli wrote:
> On Wed, Apr 13, 2011 at 05:06:19PM +0300, raz ben yehuda wrote:
> > ok, managed to build 38-rc3 ( keep forgetting make mrproper ).
> > problem is still there:
>
> Oops I thought it was typo on your prev email so I didn't answer
> assuming you would test 39-rc like I suggested in my prev email, I
> really meant 2.6.39-rc3.
>
> The fix I refer to is d527caf22e48480b102c7c6ee5b9ba12170148f7 and
> it's included in >= 2.6.39-rc1. 38-rc3 has THP but not this
> changeset. We tried to push it for 38 final but it was a bit late so
> it got merged immediately in 39-rc1 to stay on the safe side (it's
> only a performance issue and the other longstanding irq/sched latency
> issues in compaction that got fixed in 39-rc1 where actually more
> relevant than the compaction-kswapd removal).

2011-04-14 15:10:24

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: 2.6.38 sbrk regression

On Thu, Apr 14, 2011 at 02:49:14PM +0300, raz ben yehuda wrote:
> Hey Andrea
> Me again. I managed to ftrace ( function graph ) the two kernels. I used
> 2.6.37 and 2.6.39-rc3. The bellow is example for sys_brk calls traces
> from each kernel. As you can see, there is no "single smoking gun"
> here.
>
> The vm functions durations increased as a whole.
> I repeated the tests from sha1 4e9f64c42d0ba5eb0c78569435ada4c224332ce4
> compared to sha1 152c9ccb75548c027fa3103efa4fa4e19a345449 and it is
> consistent. ~13% performance decrease.
>

> Can you see any relation to thp that might causes this degradation ?

With compaction and THP off I don't see how it could change
anything.

But can you try the THP-33 tag of my aa.git tree, that was based on
2.3.37-rc5 so it'll rule out the whole THP patchset if it doesn't
regress compared to 2.6.37-rc5 vanilla.

git clone --reference linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
git checkout THP-33

Thanks,
Andrea

2011-04-14 20:08:21

by raz

[permalink] [raw]

Subject: Re: 2.6.38 page_test regression

bah. Mel is correct. I did mean page_test ( in my defense it is in the
msg ).
Here some more information:
1. I manage to lower the regression to 2 sha1's:
32dba98e085f8b2b4345887df9abf5e0e93bfc12 to
71e3aac0724ffe8918992d76acfe3aad7d8724a5.
though I had to remark wait_split_huge_page for the sake of
compilation. up to 32dba98e085f8b2b4345887df9abf5e0e93bfc12 there is no
regression.

2. I booted 2.6.37-rc5 you gave me. same regression is there.

raz

On Thu, 2011-04-14 at 17:09 +0200, Andrea Arcangeli wrote:
> On Thu, Apr 14, 2011 at 02:49:14PM +0300, raz ben yehuda wrote:
> > Hey Andrea
> > Me again. I managed to ftrace ( function graph ) the two kernels. I used
> > 2.6.37 and 2.6.39-rc3. The bellow is example for sys_brk calls traces
> > from each kernel. As you can see, there is no "single smoking gun"
> > here.
> >
> > The vm functions durations increased as a whole.
> > I repeated the tests from sha1 4e9f64c42d0ba5eb0c78569435ada4c224332ce4
> > compared to sha1 152c9ccb75548c027fa3103efa4fa4e19a345449 and it is
> > consistent. ~13% performance decrease.
> >
>
> > Can you see any relation to thp that might causes this degradation ?
>
> With compaction and THP off I don't see how it could change
> anything.
>
> But can you try the THP-33 tag of my aa.git tree, that was based on
> 2.3.37-rc5 so it'll rule out the whole THP patchset if it doesn't
> regress compared to 2.6.37-rc5 vanilla.
>
> git clone --reference linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> git checkout THP-33
>
> Thanks,
> Andrea

2011-04-14 21:53:58

by Mel Gorman

[permalink] [raw]

Subject: Re: 2.6.38 page_test regression

On Thu, Apr 14, 2011 at 11:07:23PM +0300, raz ben yehuda wrote:
> bah. Mel is correct. I did mean page_test ( in my defense it is in the
> msg ).
> Here some more information:
> 1. I manage to lower the regression to 2 sha1's:
> 32dba98e085f8b2b4345887df9abf5e0e93bfc12 to
> 71e3aac0724ffe8918992d76acfe3aad7d8724a5.
> though I had to remark wait_split_huge_page for the sake of
> compilation. up to 32dba98e085f8b2b4345887df9abf5e0e93bfc12 there is no
> regression.
>
> 2. I booted 2.6.37-rc5 you gave me. same regression is there.

Extremely long shot - try this patch.

diff --git a/mm/memory.c b/mm/memory.c
index c50a195..a39baaf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3317,7 +3317,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* run pte_offset_map on the pmd, if an huge pmd could
* materialize from under us from a different thread.
*/
- if (unlikely(__pte_alloc(mm, vma, pmd, address)))
+ if (unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, vma, pmd, address))
return VM_FAULT_OOM;
/* if an huge pmd materialized from under us just retry later */
if (unlikely(pmd_trans_huge(*pmd)))

2011-04-14 23:21:46

by Mel Gorman

[permalink] [raw]

Subject: Re: 2.6.38 page_test regression

On Thu, Apr 14, 2011 at 10:53:27PM +0100, Mel Gorman wrote:
> On Thu, Apr 14, 2011 at 11:07:23PM +0300, raz ben yehuda wrote:
> > bah. Mel is correct. I did mean page_test ( in my defense it is in the
> > msg ).
> > Here some more information:
> > 1. I manage to lower the regression to 2 sha1's:
> > 32dba98e085f8b2b4345887df9abf5e0e93bfc12 to
> > 71e3aac0724ffe8918992d76acfe3aad7d8724a5.
> > though I had to remark wait_split_huge_page for the sake of
> > compilation. up to 32dba98e085f8b2b4345887df9abf5e0e93bfc12 there is no
> > regression.
> >
> > 2. I booted 2.6.37-rc5 you gave me. same regression is there.
>
> Extremely long shot - try this patch.
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c50a195..a39baaf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3317,7 +3317,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> * run pte_offset_map on the pmd, if an huge pmd could
> * materialize from under us from a different thread.
> */
> - if (unlikely(__pte_alloc(mm, vma, pmd, address)))
> + if (unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, vma, pmd, address))
> return VM_FAULT_OOM;
> /* if an huge pmd materialized from under us just retry later */
> if (unlikely(pmd_trans_huge(*pmd)))

The results for this patch on my own tests at least are

AIM9
vmr-aim9 vmr-aim9 vmr-aim9 vmr-aim9 vmr-aim9 aim9-2.6.39 aim9-2.6.39
2.6.32-vanilla 2.6.36-vanilla 2.6.37-vanilla 2.6.38-vanilla 2.6.38-noway rc3-vanilla rc3-noway
creat-clo 365.47 ( 0.00%) 385.25 ( 5.13%) 411.82 (11.25%) 446.10 (18.07%) 427.78 (14.57%) 383.50 ( 4.70%) 377.63 ( 3.22%)
page_test 43.21 ( 0.00%) 41.44 (-4.26%) 43.71 ( 1.15%) 38.10 (-13.40%) 41.87 (-3.20%) 36.08 (-19.75%) 44.25 ( 2.36%)
brk_test 45.19 ( 0.00%) 46.38 ( 2.57%) 51.17 (11.68%) 52.45 (13.84%) 51.61 (12.43%) 51.52 (12.29%) 54.24 (16.68%)
exec_test 387.20 ( 0.00%) 458.92 (15.63%) 450.60 (14.07%) 382.00 (-1.36%) 457.64 (15.39%) 378.82 (-2.21%) 458.70 (15.59%)
fork_test 61.59 ( 0.00%) 67.87 ( 9.26%) 66.65 ( 7.59%) 60.11 (-2.47%) 67.44 ( 8.67%) 59.14 (-4.14%) 66.24 ( 7.03%)
MMTests Statistics: duration
Total Elapsed Time (seconds) 613.03 611.99 611.85 611.90 612.36 612.62 612.26

The "noway" kernel is with the patch applied which might summarise how I
feel about it.

The change is minor but emulates what pte_alloc_map() was doing
with the pmd_present check. I don't know why it makes such a big
difference. The disassembly is very similar except that registers are
used differently but it's a minor enough difference that I wouldn't
expect this big a performance difference. However, profiles indicate
that we go from spending 10.6382% of the time in clear_page_c to 9.54%
but I admit the profiles are noisy because they are over all tests,
not just page_test.

Theories better than slightly-different-register-use are welcome.

--
Mel Gorman
SUSE Labs

2011-04-14 23:33:52

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: 2.6.38 page_test regression

2011-04-14 23:39:38

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: 2.6.38 page_test regression

On Fri, Apr 15, 2011 at 12:16:48AM +0100, Mel Gorman wrote:
> Theories better than slightly-different-register-use are welcome.

__pte_alloc_one is now skipped 511 times every 512 page faults, so we
call alloc_pages 512 times less and we take page_table_lock 512 times
less, that explains it I think. We need it...

2011-04-14 23:45:43

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: 2.6.38 page_test regression

On Fri, Apr 15, 2011 at 01:32:26AM +0200, Andrea Arcangeli wrote:
> On Thu, Apr 14, 2011 at 10:53:27PM +0100, Mel Gorman wrote:
> > On Thu, Apr 14, 2011 at 11:07:23PM +0300, raz ben yehuda wrote:
> > > bah. Mel is correct. I did mean page_test ( in my defense it is in the
> > > msg ).
> > > Here some more information:
> > > 1. I manage to lower the regression to 2 sha1's:
> > > 32dba98e085f8b2b4345887df9abf5e0e93bfc12 to
> > > 71e3aac0724ffe8918992d76acfe3aad7d8724a5.
> > > though I had to remark wait_split_huge_page for the sake of
> > > compilation. up to 32dba98e085f8b2b4345887df9abf5e0e93bfc12 there is no
> > > regression.
> > >
> > > 2. I booted 2.6.37-rc5 you gave me. same regression is there.
> >
> > Extremely long shot - try this patch.
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index c50a195..a39baaf 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3317,7 +3317,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > * run pte_offset_map on the pmd, if an huge pmd could
> > * materialize from under us from a different thread.
> > */
> > - if (unlikely(__pte_alloc(mm, vma, pmd, address)))
> > + if (unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, vma, pmd, address))

Actually reading this again this should be pmd_none. This is one
change I had to make across the board because for an tiny amount of
time during the pmd teardown I have to mark an pmd_trans_splitting pte
not present in order to flush it away from the 2m tlb before the 4k
tlb can be loaded (for errata), but when it's being splitted it's
definitely not null. Now it's not buggy because then __pte_alloc would
then check it against pmd_none, but it's not worth calling into
__pte_alloc if it's not pmd_none. It just makes it confusing when
everything has been updated to check pmd_none.

Just a nitpick of course (not even a bug).

2011-04-15 09:12:11

by Mel Gorman

[permalink] [raw]

Subject: Re: 2.6.38 page_test regression

On Fri, Apr 15, 2011 at 01:44:44AM +0200, Andrea Arcangeli wrote:
> On Fri, Apr 15, 2011 at 01:32:26AM +0200, Andrea Arcangeli wrote:
> > On Thu, Apr 14, 2011 at 10:53:27PM +0100, Mel Gorman wrote:
> > > On Thu, Apr 14, 2011 at 11:07:23PM +0300, raz ben yehuda wrote:
> > > > bah. Mel is correct. I did mean page_test ( in my defense it is in the
> > > > msg ).
> > > > Here some more information:
> > > > 1. I manage to lower the regression to 2 sha1's:
> > > > 32dba98e085f8b2b4345887df9abf5e0e93bfc12 to
> > > > 71e3aac0724ffe8918992d76acfe3aad7d8724a5.
> > > > though I had to remark wait_split_huge_page for the sake of
> > > > compilation. up to 32dba98e085f8b2b4345887df9abf5e0e93bfc12 there is no
> > > > regression.
> > > >
> > > > 2. I booted 2.6.37-rc5 you gave me. same regression is there.
> > >
> > > Extremely long shot - try this patch.
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index c50a195..a39baaf 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3317,7 +3317,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > > * run pte_offset_map on the pmd, if an huge pmd could
> > > * materialize from under us from a different thread.
> > > */
> > > - if (unlikely(__pte_alloc(mm, vma, pmd, address)))
> > > + if (unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, vma, pmd, address))
>
> Actually reading this again this should be pmd_none.

You're right.

> This is one
> change I had to make across the board because for an tiny amount of
> time during the pmd teardown I have to mark an pmd_trans_splitting pte
> not present in order to flush it away from the 2m tlb before the 4k
> tlb can be loaded (for errata), but when it's being splitted it's
> definitely not null. Now it's not buggy because then __pte_alloc would
> then check it against pmd_none, but it's not worth calling into
> __pte_alloc if it's not pmd_none. It just makes it confusing when
> everything has been updated to check pmd_none.
>
> Just a nitpick of course (not even a bug).
>

A fairly important nitpick. Otherwise the inconsistency is confusing.

--
Mel Gorman
SUSE Labs