2014-04-28 09:13:25

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH V3 0/2] mm: FAULT_AROUND_ORDER patchset performance data for powerpc

Kirill A. Shutemov with 8c6e50b029 commit introduced
vm_ops->map_pages() for mapping easy accessible pages around
fault address in hope to reduce number of minor page faults.

This patch creates infrastructure to modify the FAULT_AROUND_ORDER
value using mm/Kconfig. This will enable architecture maintainers
to decide on suitable FAULT_AROUND_ORDER value based on
performance data for that architecture. First patch also defaults
FAULT_AROUND_ORDER Kconfig element to 4. Second patch list
out the performance numbers for powerpc (platform pseries) and
initialize the fault around order variable for pseries platform of
powerpc.

V3 Changes:
Replaced FAULT_AROUND_ORDER macro to a variable to support arch's that
supports sub platforms.
Made changes in commit messages.

V2 Changes:
Created Kconfig parameter for FAULT_AROUND_ORDER
Added check in do_read_fault to handle FAULT_AROUND_ORDER value of 0
Made changes in commit messages.

Madhavan Srinivasan (2):
mm: move FAULT_AROUND_ORDER to arch/
powerpc/pseries: init fault_around_order for pseries

arch/powerpc/platforms/pseries/setup.c | 5 +++++
mm/Kconfig | 8 ++++++++
mm/memory.c | 11 ++++-------
3 files changed, 17 insertions(+), 7 deletions(-)

--
1.7.10.4


2014-04-28 09:04:22

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries

Performance data for different FAULT_AROUND_ORDER values from 4 socket
Power7 system (128 Threads and 128GB memory). perf stat with repeat of 5
is used to get the stddev values. Test ran in v3.14 kernel (Baseline) and
v3.15-rc1 for different fault around order values.

FAULT_AROUND_ORDER Baseline 1 3 4 5 8

Linux build (make -j64)
minor-faults 47,437,359 35,279,286 25,425,347 23,461,275 22,002,189 21,435,836
times in seconds 347.302528420 344.061588460 340.974022391 348.193508116 348.673900158 350.986543618
stddev for time ( +- 1.50% ) ( +- 0.73% ) ( +- 1.13% ) ( +- 1.01% ) ( +- 1.89% ) ( +- 1.55% )
%chg time to baseline -0.9% -1.8% 0.2% 0.39% 1.06%

Linux rebuild (make -j64)
minor-faults 941,552 718,319 486,625 440,124 410,510 397,416
times in seconds 30.569834718 31.219637539 31.319370649 31.434285472 31.972367174 31.443043580
stddev for time ( +- 1.07% ) ( +- 0.13% ) ( +- 0.43% ) ( +- 0.18% ) ( +- 0.95% ) ( +- 0.58% )
%chg time to baseline 2.1% 2.4% 2.8% 4.58% 2.85%

Binutils build (make all -j64 )
minor-faults 474,821 371,380 269,463 247,715 235,255 228,337
times in seconds 53.882492432 53.584289348 53.882773216 53.755816431 53.607824348 53.423759642
stddev for time ( +- 0.08% ) ( +- 0.56% ) ( +- 0.17% ) ( +- 0.11% ) ( +- 0.60% ) ( +- 0.69% )
%chg time to baseline -0.55% 0.0% -0.23% -0.51% -0.85%

Two synthetic tests: access every word in file in sequential/random order.

Sequential access 16GiB file
FAULT_AROUND_ORDER Baseline 1 3 4 5 8
1 thread
minor-faults 263,148 131,166 32,908 16,514 8,260 1,093
times in seconds 53.091138345 53.113191672 53.188776177 53.233017218 53.206841347 53.429979442
stddev for time ( +- 0.06% ) ( +- 0.07% ) ( +- 0.08% ) ( +- 0.09% ) ( +- 0.03% ) ( +- 0.03% )
%chg time to baseline 0.04% 0.18% 0.26% 0.21% 0.63%
8 threads
minor-faults 2,097,267 1,048,753 262,237 131,397 65,621 8,274
times in seconds 55.173790028 54.591880790 54.824623287 54.802162211 54.969680503 54.790387715
stddev for time ( +- 0.78% ) ( +- 0.09% ) ( +- 0.08% ) ( +- 0.07% ) ( +- 0.28% ) ( +- 0.05% )
%chg time to baseline -1.05% -0.63% -0.67% -0.36% -0.69%
32 threads
minor-faults 8,388,751 4,195,621 1,049,664 525,461 262,535 32,924
times in seconds 60.431573046 60.669110744 60.485336388 60.697789706 60.077959564 60.588855032
stddev for time ( +- 0.44% ) ( +- 0.27% ) ( +- 0.46% ) ( +- 0.67% ) ( +- 0.31% ) ( +- 0.49% )
%chg time to baseline 0.39% 0.08% 0.44% -0.58% 0.25%
64 threads
minor-faults 16,777,409 8,607,527 2,289,766 1,202,264 598,405 67,587
times in seconds 96.932617720 100.675418760 102.109880836 103.881733383 102.580199555 105.751194041
stddev for time ( +- 1.39% ) ( +- 1.06% ) ( +- 0.99% ) ( +- 0.76% ) ( +- 1.65% ) ( +- 1.60% )
%chg time to baseline 3.86% 5.34% 7.16% 5.82% 9.09%
128 threads
minor-faults 33,554,705 17,375,375 4,682,462 2,337,245 1,179,007 134,819
times in seconds 128.766704495 115.659225437 120.353046307 115.291871270 115.450886036 113.991902150
stddev for time ( +- 2.93% ) ( +- 0.30% ) ( +- 2.93% ) ( +- 1.24% ) ( +- 1.03% ) ( +- 0.70% )
%chg time to baseline -10.17% -6.53% -10.46% -10.34% -11.47%

Random access 1GiB file
FAULT_AROUND_ORDER Baseline 1 3 4 5 8
1 thread
minor-faults 17,155 8,678 2,126 1,097 581 134
times in seconds 51.904430523 51.658017987 51.919270792 51.560531738 52.354431597 51.976469502
stddev for time ( +- 3.19% ) ( +- 1.35% ) ( +- 1.56% ) ( +- 0.91% ) ( +- 1.70% ) ( +- 2.02% )
%chg time to baseline -0.47% 0.02% -0.66% 0.86% 0.13%
8 threads
minor-faults 131,844 70,705 17,457 8,505 4,251 598
times in seconds 58.162813956 54.991706305 54.952675791 55.323057492 54.755587379 53.376722828
stddev for time ( +- 1.44% ) ( +- 0.69% ) ( +- 1.23% ) ( +- 2.78% ) ( +- 1.90% ) ( +- 2.91% )
%chg time to baseline -5.45% -5.52% -4.88% -5.86% -8.22%
32 threads
minor-faults 524,437 270,760 67,069 33,414 16,641 2,204
times in seconds 69.981777072 76.539570015 79.753578505 76.245943618 77.254258344 79.072596831
stddev for time ( +- 2.81% ) ( +- 1.95% ) ( +- 2.66% ) ( +- 0.99% ) ( +- 2.35% ) ( +- 3.22% )
%chg time to baseline 9.37% 13.96% 8.95% 10.39% 12.98%
64 threads
minor-faults 1,049,117 527,451 134,016 66,638 33,391 4,559
times in seconds 108.024517536 117.575067996 115.322659914 111.943998437 115.049450815 119.218450840
stddev for time ( +- 2.40% ) ( +- 1.77% ) ( +- 1.19% ) ( +- 3.29% ) ( +- 2.32% ) ( +- 1.42% )
%chg time to baseline 8.84% 6.75% 3.62% 6.5% 10.3%
128 threads
minor-faults 2,097,440 1,054,360 267,042 133,328 66,532 8,652
times in seconds 155.055861167 153.059625968 152.449492156 151.024005282 150.844647770 155.954366718
stddev for time ( +- 1.32% ) ( +- 1.14% ) ( +- 1.32% ) ( +- 0.81% ) ( +- 0.75% ) ( +- 0.72% )
%chg time to baseline -1.28% -1.68% -2.59% -2.71% 0.57%

Incase of Kernel compilation, fault around order (fao) of 1 and 3 provides fast compilation time
when compared to a value of 4. On closer look, fao of 3 has higher agains. Incase of Sequential access
synthetic tests fao of 1 has higher gains and in Random access test, fao of 3 has marginal gains.
Going by compilation time, fao value of 3 is suggested in this patch for pseries platform.

Worst case scenario: we touch one page every 16M to demonstrate overhead.

Touch only one page in page table in 16GiB file
FAULT_AROUND_ORDER Baseline 1 3 4 5 8
1 thread
minor-faults 1,104 1,090 1,071 1,068 1,065 1,063
times in seconds 0.006583298 0.008531502 0.019733795 0.036033763 0.062300553 0.406857086
stddev for time ( +- 2.79% ) ( +- 2.42% ) ( +- 3.47% ) ( +- 2.81% ) ( +- 2.01% ) ( +- 1.33% )
8 threads
minor-faults 8,279 8,264 8,245 8,243 8,239 8,240
times in seconds 0.044572398 0.057211811 0.107606306 0.205626815 0.381679120 2.647979955
stddev for time ( +- 1.95% ) ( +- 2.98% ) ( +- 1.74% ) ( +- 2.80% ) ( +- 2.01% ) ( +- 1.86% )
32 threads
minor-faults 32,879 32,864 32,849 32,845 32,839 32,843
times in seconds 0.197659343 0.218486087 0.445116407 0.694235883 1.296894038 9.127517045
stddev for time ( +- 3.05% ) ( +- 3.05% ) ( +- 4.33% ) ( +- 3.08% ) ( +- 3.75% ) ( +- 0.56% )
64 threads
minor-faults 65,680 65,664 65,646 65,645 65,640 65,647
times in seconds 0.455537304 0.489688780 0.866490093 1.427393118 2.379628982 17.059295051
stddev for time ( +- 4.01% ) ( +- 4.13% ) ( +- 2.92% ) ( +- 1.68% ) ( +- 1.79% ) ( +- 0.48% )
128 threads
minor-faults 131,279 131,265 131,250 131,245 131,241 131,254
times in seconds 1.026880651 1.095327536 1.721728274 2.808233068 4.662729948 31.732848290
stddev for time ( +- 6.85% ) ( +- 4.09% ) ( +- 1.71% ) ( +- 3.45% ) ( +- 2.40% ) ( +- 0.68% )

Signed-off-by: Madhavan Srinivasan <[email protected]>
---
arch/powerpc/platforms/pseries/setup.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 2db8cc6..c87e6b6 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -74,6 +74,8 @@ int CMO_SecPSP = -1;
unsigned long CMO_PageSize = (ASM_CONST(1) << IOMMU_PAGE_SHIFT_4K);
EXPORT_SYMBOL(CMO_PageSize);

+extern unsigned int fault_around_order;
+
int fwnmi_active; /* TRUE if an FWNMI handler is present */

static struct device_node *pSeries_mpic_node;
@@ -465,6 +467,9 @@ static void __init pSeries_setup_arch(void)
{
set_arch_panic_timeout(10, ARCH_PANIC_TIMEOUT);

+ /* Measured on a 4 socket Power7 system (128 Threads and 128GB memory) */
+ fault_around_order = 3;
+
/* Discover PIC type and setup ppc_md accordingly */
pseries_discover_pic();

--
1.7.10.4

2014-04-28 09:04:37

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH V3 1/2] mm: move FAULT_AROUND_ORDER to arch/

Kirill A. Shutemov with 8c6e50b029 commit introduced
vm_ops->map_pages() for mapping easy accessible pages around
fault address in hope to reduce number of minor page faults.

This patch creates infrastructure to modify the FAULT_AROUND_ORDER
value using mm/Kconfig. This will enable architecture maintainers
to decide on suitable FAULT_AROUND_ORDER value based on
performance data for that architecture. Patch also defaults
FAULT_AROUND_ORDER Kconfig element to 4.

Signed-off-by: Madhavan Srinivasan <[email protected]>
---
mm/Kconfig | 8 ++++++++
mm/memory.c | 11 ++++-------
2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index ebe5880..c7fc4f1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -176,6 +176,14 @@ config MOVABLE_NODE
config HAVE_BOOTMEM_INFO_NODE
def_bool n

+#
+# Fault around order is a control knob to decide the fault around pages.
+# Default value is set to 4 , but the arch can override it as desired.
+#
+config FAULT_AROUND_ORDER
+ int
+ default 4
+
# eventually, we can have this option just 'select SPARSEMEM'
config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
diff --git a/mm/memory.c b/mm/memory.c
index d0f0bef..457436d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3382,11 +3382,9 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
update_mmu_cache(vma, address, pte);
}

-#define FAULT_AROUND_ORDER 4
+unsigned int fault_around_order = CONFIG_FAULT_AROUND_ORDER;

#ifdef CONFIG_DEBUG_FS
-static unsigned int fault_around_order = FAULT_AROUND_ORDER;
-
static int fault_around_order_get(void *data, u64 *val)
{
*val = fault_around_order;
@@ -3395,7 +3393,6 @@ static int fault_around_order_get(void *data, u64 *val)

static int fault_around_order_set(void *data, u64 val)
{
- BUILD_BUG_ON((1UL << FAULT_AROUND_ORDER) > PTRS_PER_PTE);
if (1UL << val > PTRS_PER_PTE)
return -EINVAL;
fault_around_order = val;
@@ -3430,14 +3427,14 @@ static inline unsigned long fault_around_pages(void)
{
unsigned long nr_pages;

- nr_pages = 1UL << FAULT_AROUND_ORDER;
+ nr_pages = 1UL << fault_around_order;
BUILD_BUG_ON(nr_pages > PTRS_PER_PTE);
return nr_pages;
}

static inline unsigned long fault_around_mask(void)
{
- return ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1);
+ return ~((1UL << (PAGE_SHIFT + fault_around_order)) - 1);
}
#endif

@@ -3495,7 +3492,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* if page by the offset is not ready to be mapped (cold cache or
* something).
*/
- if (vma->vm_ops->map_pages) {
+ if ((vma->vm_ops->map_pages) && (fault_around_order)) {
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
do_fault_around(vma, address, pte, pgoff, flags);
if (!pte_same(*pte, orig_pte))
--
1.7.10.4

2014-04-28 09:07:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V3 1/2] mm: move FAULT_AROUND_ORDER to arch/

On Mon, Apr 28, 2014 at 02:31:29PM +0530, Madhavan Srinivasan wrote:
> +unsigned int fault_around_order = CONFIG_FAULT_AROUND_ORDER;

__read_mostly?

2014-04-28 09:36:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: RE: [PATCH V3 1/2] mm: move FAULT_AROUND_ORDER to arch/

Madhavan Srinivasan wrote:
> Kirill A. Shutemov with 8c6e50b029 commit introduced
> vm_ops->map_pages() for mapping easy accessible pages around
> fault address in hope to reduce number of minor page faults.
>
> This patch creates infrastructure to modify the FAULT_AROUND_ORDER
> value using mm/Kconfig. This will enable architecture maintainers
> to decide on suitable FAULT_AROUND_ORDER value based on
> performance data for that architecture. Patch also defaults
> FAULT_AROUND_ORDER Kconfig element to 4.
>
> Signed-off-by: Madhavan Srinivasan <[email protected]>
> ---
> mm/Kconfig | 8 ++++++++
> mm/memory.c | 11 ++++-------
> 2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ebe5880..c7fc4f1 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -176,6 +176,14 @@ config MOVABLE_NODE
> config HAVE_BOOTMEM_INFO_NODE
> def_bool n
>
> +#
> +# Fault around order is a control knob to decide the fault around pages.
> +# Default value is set to 4 , but the arch can override it as desired.
> +#
> +config FAULT_AROUND_ORDER
> + int
> + default 4
> +
> # eventually, we can have this option just 'select SPARSEMEM'
> config MEMORY_HOTPLUG
> bool "Allow for memory hot-add"
> diff --git a/mm/memory.c b/mm/memory.c
> index d0f0bef..457436d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3382,11 +3382,9 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
> update_mmu_cache(vma, address, pte);
> }
>
> -#define FAULT_AROUND_ORDER 4
> +unsigned int fault_around_order = CONFIG_FAULT_AROUND_ORDER;
>
> #ifdef CONFIG_DEBUG_FS
> -static unsigned int fault_around_order = FAULT_AROUND_ORDER;
> -
> static int fault_around_order_get(void *data, u64 *val)
> {
> *val = fault_around_order;
> @@ -3395,7 +3393,6 @@ static int fault_around_order_get(void *data, u64 *val)
>
> static int fault_around_order_set(void *data, u64 val)
> {
> - BUILD_BUG_ON((1UL << FAULT_AROUND_ORDER) > PTRS_PER_PTE);
> if (1UL << val > PTRS_PER_PTE)
> return -EINVAL;
> fault_around_order = val;
> @@ -3430,14 +3427,14 @@ static inline unsigned long fault_around_pages(void)
> {
> unsigned long nr_pages;
>
> - nr_pages = 1UL << FAULT_AROUND_ORDER;
> + nr_pages = 1UL << fault_around_order;
> BUILD_BUG_ON(nr_pages > PTRS_PER_PTE);

This BUILD_BUG_ON() doesn't make sense any more since compiler usually can't
prove anything about extern variable.

Drop it or change to VM_BUG_ON() or something.

> return nr_pages;
> }
>
> static inline unsigned long fault_around_mask(void)
> {
> - return ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1);
> + return ~((1UL << (PAGE_SHIFT + fault_around_order)) - 1);

fault_around_pages() and fault_around_mask() should be moved outside of
#ifdef CONFIG_DEBUG_FS and consolidated since they are practically identical
both branches after the changes.

> }
> #endif
>
> @@ -3495,7 +3492,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> * if page by the offset is not ready to be mapped (cold cache or
> * something).
> */
> - if (vma->vm_ops->map_pages) {
> + if ((vma->vm_ops->map_pages) && (fault_around_order)) {

Way too many parentheses.

> pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> do_fault_around(vma, address, pte, pgoff, flags);
> if (!pte_same(*pte, orig_pte))
--
Kirill A. Shutemov

2014-04-29 02:36:47

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries

Madhavan Srinivasan <[email protected]> writes:
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index 2db8cc6..c87e6b6 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -74,6 +74,8 @@ int CMO_SecPSP = -1;
> unsigned long CMO_PageSize = (ASM_CONST(1) << IOMMU_PAGE_SHIFT_4K);
> EXPORT_SYMBOL(CMO_PageSize);
>
> +extern unsigned int fault_around_order;
> +

It's considered bad form to do this. Put the declaration in linux/mm.h.

Thanks,
Rusty.
PS. But we're getting there! :)

2014-04-29 07:06:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries


* Madhavan Srinivasan <[email protected]> wrote:

> Performance data for different FAULT_AROUND_ORDER values from 4 socket
> Power7 system (128 Threads and 128GB memory). perf stat with repeat of 5
> is used to get the stddev values. Test ran in v3.14 kernel (Baseline) and
> v3.15-rc1 for different fault around order values.
>
> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
>
> Linux build (make -j64)
> minor-faults 47,437,359 35,279,286 25,425,347 23,461,275 22,002,189 21,435,836
> times in seconds 347.302528420 344.061588460 340.974022391 348.193508116 348.673900158 350.986543618
> stddev for time ( +- 1.50% ) ( +- 0.73% ) ( +- 1.13% ) ( +- 1.01% ) ( +- 1.89% ) ( +- 1.55% )
> %chg time to baseline -0.9% -1.8% 0.2% 0.39% 1.06%

Probably too noisy.

> Linux rebuild (make -j64)
> minor-faults 941,552 718,319 486,625 440,124 410,510 397,416
> times in seconds 30.569834718 31.219637539 31.319370649 31.434285472 31.972367174 31.443043580
> stddev for time ( +- 1.07% ) ( +- 0.13% ) ( +- 0.43% ) ( +- 0.18% ) ( +- 0.95% ) ( +- 0.58% )
> %chg time to baseline 2.1% 2.4% 2.8% 4.58% 2.85%

Here it looks like a speedup. Optimal value: 5+.

> Binutils build (make all -j64 )
> minor-faults 474,821 371,380 269,463 247,715 235,255 228,337
> times in seconds 53.882492432 53.584289348 53.882773216 53.755816431 53.607824348 53.423759642
> stddev for time ( +- 0.08% ) ( +- 0.56% ) ( +- 0.17% ) ( +- 0.11% ) ( +- 0.60% ) ( +- 0.69% )
> %chg time to baseline -0.55% 0.0% -0.23% -0.51% -0.85%

Probably too noisy, but looks like a potential slowdown?

> Two synthetic tests: access every word in file in sequential/random order.
>
> Sequential access 16GiB file
> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
> 1 thread
> minor-faults 263,148 131,166 32,908 16,514 8,260 1,093
> times in seconds 53.091138345 53.113191672 53.188776177 53.233017218 53.206841347 53.429979442
> stddev for time ( +- 0.06% ) ( +- 0.07% ) ( +- 0.08% ) ( +- 0.09% ) ( +- 0.03% ) ( +- 0.03% )
> %chg time to baseline 0.04% 0.18% 0.26% 0.21% 0.63%

Speedup, optimal value: 8+.

> 8 threads
> minor-faults 2,097,267 1,048,753 262,237 131,397 65,621 8,274
> times in seconds 55.173790028 54.591880790 54.824623287 54.802162211 54.969680503 54.790387715
> stddev for time ( +- 0.78% ) ( +- 0.09% ) ( +- 0.08% ) ( +- 0.07% ) ( +- 0.28% ) ( +- 0.05% )
> %chg time to baseline -1.05% -0.63% -0.67% -0.36% -0.69%

Looks like a regression?

> 32 threads
> minor-faults 8,388,751 4,195,621 1,049,664 525,461 262,535 32,924
> times in seconds 60.431573046 60.669110744 60.485336388 60.697789706 60.077959564 60.588855032
> stddev for time ( +- 0.44% ) ( +- 0.27% ) ( +- 0.46% ) ( +- 0.67% ) ( +- 0.31% ) ( +- 0.49% )
> %chg time to baseline 0.39% 0.08% 0.44% -0.58% 0.25%

Probably too noisy.

> 64 threads
> minor-faults 16,777,409 8,607,527 2,289,766 1,202,264 598,405 67,587
> times in seconds 96.932617720 100.675418760 102.109880836 103.881733383 102.580199555 105.751194041
> stddev for time ( +- 1.39% ) ( +- 1.06% ) ( +- 0.99% ) ( +- 0.76% ) ( +- 1.65% ) ( +- 1.60% )
> %chg time to baseline 3.86% 5.34% 7.16% 5.82% 9.09%

Speedup, optimal value: 4+

> 128 threads
> minor-faults 33,554,705 17,375,375 4,682,462 2,337,245 1,179,007 134,819
> times in seconds 128.766704495 115.659225437 120.353046307 115.291871270 115.450886036 113.991902150
> stddev for time ( +- 2.93% ) ( +- 0.30% ) ( +- 2.93% ) ( +- 1.24% ) ( +- 1.03% ) ( +- 0.70% )
> %chg time to baseline -10.17% -6.53% -10.46% -10.34% -11.47%

Rather significant regression at order 1 already.

> Random access 1GiB file
> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
> 1 thread
> minor-faults 17,155 8,678 2,126 1,097 581 134
> times in seconds 51.904430523 51.658017987 51.919270792 51.560531738 52.354431597 51.976469502
> stddev for time ( +- 3.19% ) ( +- 1.35% ) ( +- 1.56% ) ( +- 0.91% ) ( +- 1.70% ) ( +- 2.02% )
> %chg time to baseline -0.47% 0.02% -0.66% 0.86% 0.13%

Probably too noisy.

> 8 threads
> minor-faults 131,844 70,705 17,457 8,505 4,251 598
> times in seconds 58.162813956 54.991706305 54.952675791 55.323057492 54.755587379 53.376722828
> stddev for time ( +- 1.44% ) ( +- 0.69% ) ( +- 1.23% ) ( +- 2.78% ) ( +- 1.90% ) ( +- 2.91% )
> %chg time to baseline -5.45% -5.52% -4.88% -5.86% -8.22%

Regression.

> 32 threads
> minor-faults 524,437 270,760 67,069 33,414 16,641 2,204
> times in seconds 69.981777072 76.539570015 79.753578505 76.245943618 77.254258344 79.072596831
> stddev for time ( +- 2.81% ) ( +- 1.95% ) ( +- 2.66% ) ( +- 0.99% ) ( +- 2.35% ) ( +- 3.22% )
> %chg time to baseline 9.37% 13.96% 8.95% 10.39% 12.98%

Speedup, optimal value hard to tell due to noise - 3+ or 8+.

> 64 threads
> minor-faults 1,049,117 527,451 134,016 66,638 33,391 4,559
> times in seconds 108.024517536 117.575067996 115.322659914 111.943998437 115.049450815 119.218450840
> stddev for time ( +- 2.40% ) ( +- 1.77% ) ( +- 1.19% ) ( +- 3.29% ) ( +- 2.32% ) ( +- 1.42% )
> %chg time to baseline 8.84% 6.75% 3.62% 6.5% 10.3%

Speedup, optimal value again hard to tell due to noise.

> 128 threads
> minor-faults 2,097,440 1,054,360 267,042 133,328 66,532 8,652
> times in seconds 155.055861167 153.059625968 152.449492156 151.024005282 150.844647770 155.954366718
> stddev for time ( +- 1.32% ) ( +- 1.14% ) ( +- 1.32% ) ( +- 0.81% ) ( +- 0.75% ) ( +- 0.72% )
> %chg time to baseline -1.28% -1.68% -2.59% -2.71% 0.57%

Slowdown for most orders.

> Incase of Kernel compilation, fault around order (fao) of 1 and 3
> provides fast compilation time when compared to a value of 4. On
> closer look, fao of 3 has higher agains. Incase of Sequential access
> synthetic tests fao of 1 has higher gains and in Random access test,
> fao of 3 has marginal gains. Going by compilation time, fao value of
> 3 is suggested in this patch for pseries platform.

So I'm really at loss to understand where you get the optimal value of
'3' from. The data does not seem to match your claim that '1 and 3
provides fast compilation time when compared to a value of 4':

> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
>
> Linux rebuild (make -j64)
> minor-faults 941,552 718,319 486,625 440,124 410,510 397,416
> times in seconds 30.569834718 31.219637539 31.319370649 31.434285472 31.972367174 31.443043580
> stddev for time ( +- 1.07% ) ( +- 0.13% ) ( +- 0.43% ) ( +- 0.18% ) ( +- 0.95% ) ( +- 0.58% )
> %chg time to baseline 2.1% 2.4% 2.8% 4.58% 2.85%

5 and 8, and probably 6, 7 are better than 4.

3 is probably _slower_ than the current default - but it's hard to
tell due to inherent noise.

But the other two build tests were too noisy, and if then they showed
a slowdown.

> Worst case scenario: we touch one page every 16M to demonstrate overhead.
>
> Touch only one page in page table in 16GiB file
> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
> 1 thread
> minor-faults 1,104 1,090 1,071 1,068 1,065 1,063
> times in seconds 0.006583298 0.008531502 0.019733795 0.036033763 0.062300553 0.406857086
> stddev for time ( +- 2.79% ) ( +- 2.42% ) ( +- 3.47% ) ( +- 2.81% ) ( +- 2.01% ) ( +- 1.33% )
> 8 threads
> minor-faults 8,279 8,264 8,245 8,243 8,239 8,240
> times in seconds 0.044572398 0.057211811 0.107606306 0.205626815 0.381679120 2.647979955
> stddev for time ( +- 1.95% ) ( +- 2.98% ) ( +- 1.74% ) ( +- 2.80% ) ( +- 2.01% ) ( +- 1.86% )
> 32 threads
> minor-faults 32,879 32,864 32,849 32,845 32,839 32,843
> times in seconds 0.197659343 0.218486087 0.445116407 0.694235883 1.296894038 9.127517045
> stddev for time ( +- 3.05% ) ( +- 3.05% ) ( +- 4.33% ) ( +- 3.08% ) ( +- 3.75% ) ( +- 0.56% )
> 64 threads
> minor-faults 65,680 65,664 65,646 65,645 65,640 65,647
> times in seconds 0.455537304 0.489688780 0.866490093 1.427393118 2.379628982 17.059295051
> stddev for time ( +- 4.01% ) ( +- 4.13% ) ( +- 2.92% ) ( +- 1.68% ) ( +- 1.79% ) ( +- 0.48% )
> 128 threads
> minor-faults 131,279 131,265 131,250 131,245 131,241 131,254
> times in seconds 1.026880651 1.095327536 1.721728274 2.808233068 4.662729948 31.732848290
> stddev for time ( +- 6.85% ) ( +- 4.09% ) ( +- 1.71% ) ( +- 3.45% ) ( +- 2.40% ) ( +- 0.68% )

There's no '%change' values shown, but the slowdown looks significant,
it's the worst case: for example with 1 thread order 3 looks about
300% slower (3x slowdown) compared to order 0.

All in one, looking at your latest data I don't think the conclusion
from your first version of this optimization patch from a month ago is
true anymore:

> + /* Measured on a 4 socket Power7 system (128 Threads and 128GB memory) */
> + fault_around_order = 3;

As the data is rather conflicting and inconclusive, and if it shows a
sweet spot it's not at order 3. New data should in general trigger
reanalysis of your first optimization value.

I'm starting to suspect that maybe workloads ought to be given a
choice in this matter, via madvise() or such.

Thanks,

Ingo

2014-04-29 09:35:58

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH V3 1/2] mm: move FAULT_AROUND_ORDER to arch/

On Monday 28 April 2014 02:36 PM, Peter Zijlstra wrote:
> On Mon, Apr 28, 2014 at 02:31:29PM +0530, Madhavan Srinivasan wrote:
>> +unsigned int fault_around_order = CONFIG_FAULT_AROUND_ORDER;
>
> __read_mostly?
>
Agreed. Will add it.

Thanks for review.
With regards
Maddy

2014-04-29 09:36:03

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH V3 1/2] mm: move FAULT_AROUND_ORDER to arch/

On Monday 28 April 2014 03:06 PM, Kirill A. Shutemov wrote:
> Madhavan Srinivasan wrote:
>> Kirill A. Shutemov with 8c6e50b029 commit introduced
>> vm_ops->map_pages() for mapping easy accessible pages around
>> fault address in hope to reduce number of minor page faults.
>>
>> This patch creates infrastructure to modify the FAULT_AROUND_ORDER
>> value using mm/Kconfig. This will enable architecture maintainers
>> to decide on suitable FAULT_AROUND_ORDER value based on
>> performance data for that architecture. Patch also defaults
>> FAULT_AROUND_ORDER Kconfig element to 4.
>>
>> Signed-off-by: Madhavan Srinivasan <[email protected]>
>> ---
>> mm/Kconfig | 8 ++++++++
>> mm/memory.c | 11 ++++-------
>> 2 files changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index ebe5880..c7fc4f1 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -176,6 +176,14 @@ config MOVABLE_NODE
>> config HAVE_BOOTMEM_INFO_NODE
>> def_bool n
>>
>> +#
>> +# Fault around order is a control knob to decide the fault around pages.
>> +# Default value is set to 4 , but the arch can override it as desired.
>> +#
>> +config FAULT_AROUND_ORDER
>> + int
>> + default 4
>> +
>> # eventually, we can have this option just 'select SPARSEMEM'
>> config MEMORY_HOTPLUG
>> bool "Allow for memory hot-add"
>> diff --git a/mm/memory.c b/mm/memory.c
>> index d0f0bef..457436d 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3382,11 +3382,9 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
>> update_mmu_cache(vma, address, pte);
>> }
>>
>> -#define FAULT_AROUND_ORDER 4
>> +unsigned int fault_around_order = CONFIG_FAULT_AROUND_ORDER;
>>
>> #ifdef CONFIG_DEBUG_FS
>> -static unsigned int fault_around_order = FAULT_AROUND_ORDER;
>> -
>> static int fault_around_order_get(void *data, u64 *val)
>> {
>> *val = fault_around_order;
>> @@ -3395,7 +3393,6 @@ static int fault_around_order_get(void *data, u64 *val)
>>
>> static int fault_around_order_set(void *data, u64 val)
>> {
>> - BUILD_BUG_ON((1UL << FAULT_AROUND_ORDER) > PTRS_PER_PTE);
>> if (1UL << val > PTRS_PER_PTE)
>> return -EINVAL;
>> fault_around_order = val;
>> @@ -3430,14 +3427,14 @@ static inline unsigned long fault_around_pages(void)
>> {
>> unsigned long nr_pages;
>>
>> - nr_pages = 1UL << FAULT_AROUND_ORDER;
>> + nr_pages = 1UL << fault_around_order;
>> BUILD_BUG_ON(nr_pages > PTRS_PER_PTE);
>
> This BUILD_BUG_ON() doesn't make sense any more since compiler usually can't
> prove anything about extern variable.
>
> Drop it or change to VM_BUG_ON() or something.
>

Ok.

>> return nr_pages;
>> }
>>
>> static inline unsigned long fault_around_mask(void)
>> {
>> - return ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1);
>> + return ~((1UL << (PAGE_SHIFT + fault_around_order)) - 1);
>
> fault_around_pages() and fault_around_mask() should be moved outside of
> #ifdef CONFIG_DEBUG_FS and consolidated since they are practically identical
> both branches after the changes.
>

Agreed. Will change it.

>> }
>> #endif
>>
>> @@ -3495,7 +3492,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>> * if page by the offset is not ready to be mapped (cold cache or
>> * something).
>> */
>> - if (vma->vm_ops->map_pages) {
>> + if ((vma->vm_ops->map_pages) && (fault_around_order)) {
>
> Way too many parentheses.
>

Sure. Will modify it.

Thanks for review
with regards
Maddy
>> pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>> do_fault_around(vma, address, pte, pgoff, flags);
>> if (!pte_same(*pte, orig_pte))

2014-04-29 09:37:07

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries

On Tuesday 29 April 2014 07:48 AM, Rusty Russell wrote:
> Madhavan Srinivasan <[email protected]> writes:
>> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>> index 2db8cc6..c87e6b6 100644
>> --- a/arch/powerpc/platforms/pseries/setup.c
>> +++ b/arch/powerpc/platforms/pseries/setup.c
>> @@ -74,6 +74,8 @@ int CMO_SecPSP = -1;
>> unsigned long CMO_PageSize = (ASM_CONST(1) << IOMMU_PAGE_SHIFT_4K);
>> EXPORT_SYMBOL(CMO_PageSize);
>>
>> +extern unsigned int fault_around_order;
>> +
>
> It's considered bad form to do this. Put the declaration in linux/mm.h.
>

ok. Will change it.

Thanks for review
With regards
Maddy

> Thanks,
> Rusty.
> PS. But we're getting there! :)
>

2014-04-29 10:37:00

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries

On Tuesday 29 April 2014 12:36 PM, Ingo Molnar wrote:
>
> * Madhavan Srinivasan <[email protected]> wrote:
>
>> Performance data for different FAULT_AROUND_ORDER values from 4 socket
>> Power7 system (128 Threads and 128GB memory). perf stat with repeat of 5
>> is used to get the stddev values. Test ran in v3.14 kernel (Baseline) and
>> v3.15-rc1 for different fault around order values.
>>
>> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
>>
>> Linux build (make -j64)
>> minor-faults 47,437,359 35,279,286 25,425,347 23,461,275 22,002,189 21,435,836
>> times in seconds 347.302528420 344.061588460 340.974022391 348.193508116 348.673900158 350.986543618
>> stddev for time ( +- 1.50% ) ( +- 0.73% ) ( +- 1.13% ) ( +- 1.01% ) ( +- 1.89% ) ( +- 1.55% )
>> %chg time to baseline -0.9% -1.8% 0.2% 0.39% 1.06%
>
> Probably too noisy.

Ok. I should have added the formula used for %change to clarify the data
presented. My bad.

Just to clarify, %change here is calculated based on this formula.

((new value - baseline)/baseline)

And in this case, negative %change says it a drop in time and
positive value has increase the time when compared to baseline.

With regards
Maddy

>
>> Linux rebuild (make -j64)
>> minor-faults 941,552 718,319 486,625 440,124 410,510 397,416
>> times in seconds 30.569834718 31.219637539 31.319370649 31.434285472 31.972367174 31.443043580
>> stddev for time ( +- 1.07% ) ( +- 0.13% ) ( +- 0.43% ) ( +- 0.18% ) ( +- 0.95% ) ( +- 0.58% )
>> %chg time to baseline 2.1% 2.4% 2.8% 4.58% 2.85%
>
> Here it looks like a speedup. Optimal value: 5+.
>
>> Binutils build (make all -j64 )
>> minor-faults 474,821 371,380 269,463 247,715 235,255 228,337
>> times in seconds 53.882492432 53.584289348 53.882773216 53.755816431 53.607824348 53.423759642
>> stddev for time ( +- 0.08% ) ( +- 0.56% ) ( +- 0.17% ) ( +- 0.11% ) ( +- 0.60% ) ( +- 0.69% )
>> %chg time to baseline -0.55% 0.0% -0.23% -0.51% -0.85%
>
> Probably too noisy, but looks like a potential slowdown?
>
>> Two synthetic tests: access every word in file in sequential/random order.
>>
>> Sequential access 16GiB file
>> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
>> 1 thread
>> minor-faults 263,148 131,166 32,908 16,514 8,260 1,093
>> times in seconds 53.091138345 53.113191672 53.188776177 53.233017218 53.206841347 53.429979442
>> stddev for time ( +- 0.06% ) ( +- 0.07% ) ( +- 0.08% ) ( +- 0.09% ) ( +- 0.03% ) ( +- 0.03% )
>> %chg time to baseline 0.04% 0.18% 0.26% 0.21% 0.63%
>
> Speedup, optimal value: 8+.
>
>> 8 threads
>> minor-faults 2,097,267 1,048,753 262,237 131,397 65,621 8,274
>> times in seconds 55.173790028 54.591880790 54.824623287 54.802162211 54.969680503 54.790387715
>> stddev for time ( +- 0.78% ) ( +- 0.09% ) ( +- 0.08% ) ( +- 0.07% ) ( +- 0.28% ) ( +- 0.05% )
>> %chg time to baseline -1.05% -0.63% -0.67% -0.36% -0.69%
>
> Looks like a regression?
>
>> 32 threads
>> minor-faults 8,388,751 4,195,621 1,049,664 525,461 262,535 32,924
>> times in seconds 60.431573046 60.669110744 60.485336388 60.697789706 60.077959564 60.588855032
>> stddev for time ( +- 0.44% ) ( +- 0.27% ) ( +- 0.46% ) ( +- 0.67% ) ( +- 0.31% ) ( +- 0.49% )
>> %chg time to baseline 0.39% 0.08% 0.44% -0.58% 0.25%
>
> Probably too noisy.
>
>> 64 threads
>> minor-faults 16,777,409 8,607,527 2,289,766 1,202,264 598,405 67,587
>> times in seconds 96.932617720 100.675418760 102.109880836 103.881733383 102.580199555 105.751194041
>> stddev for time ( +- 1.39% ) ( +- 1.06% ) ( +- 0.99% ) ( +- 0.76% ) ( +- 1.65% ) ( +- 1.60% )
>> %chg time to baseline 3.86% 5.34% 7.16% 5.82% 9.09%
>
> Speedup, optimal value: 4+
>
>> 128 threads
>> minor-faults 33,554,705 17,375,375 4,682,462 2,337,245 1,179,007 134,819
>> times in seconds 128.766704495 115.659225437 120.353046307 115.291871270 115.450886036 113.991902150
>> stddev for time ( +- 2.93% ) ( +- 0.30% ) ( +- 2.93% ) ( +- 1.24% ) ( +- 1.03% ) ( +- 0.70% )
>> %chg time to baseline -10.17% -6.53% -10.46% -10.34% -11.47%
>
> Rather significant regression at order 1 already.
>
>> Random access 1GiB file
>> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
>> 1 thread
>> minor-faults 17,155 8,678 2,126 1,097 581 134
>> times in seconds 51.904430523 51.658017987 51.919270792 51.560531738 52.354431597 51.976469502
>> stddev for time ( +- 3.19% ) ( +- 1.35% ) ( +- 1.56% ) ( +- 0.91% ) ( +- 1.70% ) ( +- 2.02% )
>> %chg time to baseline -0.47% 0.02% -0.66% 0.86% 0.13%
>
> Probably too noisy.
>
>> 8 threads
>> minor-faults 131,844 70,705 17,457 8,505 4,251 598
>> times in seconds 58.162813956 54.991706305 54.952675791 55.323057492 54.755587379 53.376722828
>> stddev for time ( +- 1.44% ) ( +- 0.69% ) ( +- 1.23% ) ( +- 2.78% ) ( +- 1.90% ) ( +- 2.91% )
>> %chg time to baseline -5.45% -5.52% -4.88% -5.86% -8.22%
>
> Regression.
>
>> 32 threads
>> minor-faults 524,437 270,760 67,069 33,414 16,641 2,204
>> times in seconds 69.981777072 76.539570015 79.753578505 76.245943618 77.254258344 79.072596831
>> stddev for time ( +- 2.81% ) ( +- 1.95% ) ( +- 2.66% ) ( +- 0.99% ) ( +- 2.35% ) ( +- 3.22% )
>> %chg time to baseline 9.37% 13.96% 8.95% 10.39% 12.98%
>
> Speedup, optimal value hard to tell due to noise - 3+ or 8+.
>
>> 64 threads
>> minor-faults 1,049,117 527,451 134,016 66,638 33,391 4,559
>> times in seconds 108.024517536 117.575067996 115.322659914 111.943998437 115.049450815 119.218450840
>> stddev for time ( +- 2.40% ) ( +- 1.77% ) ( +- 1.19% ) ( +- 3.29% ) ( +- 2.32% ) ( +- 1.42% )
>> %chg time to baseline 8.84% 6.75% 3.62% 6.5% 10.3%
>
> Speedup, optimal value again hard to tell due to noise.
>
>> 128 threads
>> minor-faults 2,097,440 1,054,360 267,042 133,328 66,532 8,652
>> times in seconds 155.055861167 153.059625968 152.449492156 151.024005282 150.844647770 155.954366718
>> stddev for time ( +- 1.32% ) ( +- 1.14% ) ( +- 1.32% ) ( +- 0.81% ) ( +- 0.75% ) ( +- 0.72% )
>> %chg time to baseline -1.28% -1.68% -2.59% -2.71% 0.57%
>
> Slowdown for most orders.
>
>> Incase of Kernel compilation, fault around order (fao) of 1 and 3
>> provides fast compilation time when compared to a value of 4. On
>> closer look, fao of 3 has higher agains. Incase of Sequential access
>> synthetic tests fao of 1 has higher gains and in Random access test,
>> fao of 3 has marginal gains. Going by compilation time, fao value of
>> 3 is suggested in this patch for pseries platform.
>
> So I'm really at loss to understand where you get the optimal value of
> '3' from. The data does not seem to match your claim that '1 and 3
> provides fast compilation time when compared to a value of 4':
>



>> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
>>
>> Linux rebuild (make -j64)
>> minor-faults 941,552 718,319 486,625 440,124 410,510 397,416
>> times in seconds 30.569834718 31.219637539 31.319370649 31.434285472 31.972367174 31.443043580
>> stddev for time ( +- 1.07% ) ( +- 0.13% ) ( +- 0.43% ) ( +- 0.18% ) ( +- 0.95% ) ( +- 0.58% )
>> %chg time to baseline 2.1% 2.4% 2.8% 4.58% 2.85%
>
> 5 and 8, and probably 6, 7 are better than 4.
>
> 3 is probably _slower_ than the current default - but it's hard to
> tell due to inherent noise.
>
> But the other two build tests were too noisy, and if then they showed
> a slowdown.
>
>> Worst case scenario: we touch one page every 16M to demonstrate overhead.
>>
>> Touch only one page in page table in 16GiB file
>> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
>> 1 thread
>> minor-faults 1,104 1,090 1,071 1,068 1,065 1,063
>> times in seconds 0.006583298 0.008531502 0.019733795 0.036033763 0.062300553 0.406857086
>> stddev for time ( +- 2.79% ) ( +- 2.42% ) ( +- 3.47% ) ( +- 2.81% ) ( +- 2.01% ) ( +- 1.33% )
>> 8 threads
>> minor-faults 8,279 8,264 8,245 8,243 8,239 8,240
>> times in seconds 0.044572398 0.057211811 0.107606306 0.205626815 0.381679120 2.647979955
>> stddev for time ( +- 1.95% ) ( +- 2.98% ) ( +- 1.74% ) ( +- 2.80% ) ( +- 2.01% ) ( +- 1.86% )
>> 32 threads
>> minor-faults 32,879 32,864 32,849 32,845 32,839 32,843
>> times in seconds 0.197659343 0.218486087 0.445116407 0.694235883 1.296894038 9.127517045
>> stddev for time ( +- 3.05% ) ( +- 3.05% ) ( +- 4.33% ) ( +- 3.08% ) ( +- 3.75% ) ( +- 0.56% )
>> 64 threads
>> minor-faults 65,680 65,664 65,646 65,645 65,640 65,647
>> times in seconds 0.455537304 0.489688780 0.866490093 1.427393118 2.379628982 17.059295051
>> stddev for time ( +- 4.01% ) ( +- 4.13% ) ( +- 2.92% ) ( +- 1.68% ) ( +- 1.79% ) ( +- 0.48% )
>> 128 threads
>> minor-faults 131,279 131,265 131,250 131,245 131,241 131,254
>> times in seconds 1.026880651 1.095327536 1.721728274 2.808233068 4.662729948 31.732848290
>> stddev for time ( +- 6.85% ) ( +- 4.09% ) ( +- 1.71% ) ( +- 3.45% ) ( +- 2.40% ) ( +- 0.68% )
>
> There's no '%change' values shown, but the slowdown looks significant,
> it's the worst case: for example with 1 thread order 3 looks about
> 300% slower (3x slowdown) compared to order 0.
>
> All in one, looking at your latest data I don't think the conclusion
> from your first version of this optimization patch from a month ago is
> true anymore:
>
>> + /* Measured on a 4 socket Power7 system (128 Threads and 128GB memory) */
>> + fault_around_order = 3;
>
> As the data is rather conflicting and inconclusive, and if it shows a
> sweet spot it's not at order 3. New data should in general trigger
> reanalysis of your first optimization value.
>
> I'm starting to suspect that maybe workloads ought to be given a
> choice in this matter, via madvise() or such.
>
> Thanks,
>
> Ingo
>

2014-04-30 07:11:29

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries

Ingo Molnar <[email protected]> writes:
> * Madhavan Srinivasan <[email protected]> wrote:
>
>> Performance data for different FAULT_AROUND_ORDER values from 4 socket
>> Power7 system (128 Threads and 128GB memory). perf stat with repeat of 5
>> is used to get the stddev values. Test ran in v3.14 kernel (Baseline) and
>> v3.15-rc1 for different fault around order values.
>>
>> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
>>
>> Linux build (make -j64)
>> minor-faults 47,437,359 35,279,286 25,425,347 23,461,275 22,002,189 21,435,836
>> times in seconds 347.302528420 344.061588460 340.974022391 348.193508116 348.673900158 350.986543618
>> stddev for time ( +- 1.50% ) ( +- 0.73% ) ( +- 1.13% ) ( +- 1.01% ) ( +- 1.89% ) ( +- 1.55% )
>> %chg time to baseline -0.9% -1.8% 0.2% 0.39% 1.06%
>
> Probably too noisy.

A little, but 3 still looks like the winner.

>> Linux rebuild (make -j64)
>> minor-faults 941,552 718,319 486,625 440,124 410,510 397,416
>> times in seconds 30.569834718 31.219637539 31.319370649 31.434285472 31.972367174 31.443043580
>> stddev for time ( +- 1.07% ) ( +- 0.13% ) ( +- 0.43% ) ( +- 0.18% ) ( +- 0.95% ) ( +- 0.58% )
>> %chg time to baseline 2.1% 2.4% 2.8% 4.58% 2.85%
>
> Here it looks like a speedup. Optimal value: 5+.

No, lower time is better. Baseline (no faultaround) wins.


etc.

It's not a huge surprise that a 64k page arch wants a smaller value than
a 4k system. But I agree: I don't see much upside for FAO > 0, but I do
see downside.

Most extreme results:
Order 1: 2% loss on recompile. 10% win 4% loss on seq. 9% loss random.
Order 3: 2% loss on recompile. 6% win 5% loss on seq. 14% loss on random.
Order 4: 2.8% loss on recompile. 10% win 7% loss on seq. 9% loss on random.

> I'm starting to suspect that maybe workloads ought to be given a
> choice in this matter, via madvise() or such.

I really don't think they'll be able to use it; it'll change far too
much with machine and kernel updates. I think we should apply patch #1
(with fixes) to make it a variable, then set it to 0 for PPC.

Cheers,
Rusty.

2014-04-30 08:15:43

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries

On Wednesday 30 April 2014 12:34 PM, Rusty Russell wrote:
> Ingo Molnar <[email protected]> writes:
>> * Madhavan Srinivasan <[email protected]> wrote:
>>
>>> Performance data for different FAULT_AROUND_ORDER values from 4 socket
>>> Power7 system (128 Threads and 128GB memory). perf stat with repeat of 5
>>> is used to get the stddev values. Test ran in v3.14 kernel (Baseline) and
>>> v3.15-rc1 for different fault around order values.
>>>
>>> FAULT_AROUND_ORDER Baseline 1 3 4 5 8
>>>
>>> Linux build (make -j64)
>>> minor-faults 47,437,359 35,279,286 25,425,347 23,461,275 22,002,189 21,435,836
>>> times in seconds 347.302528420 344.061588460 340.974022391 348.193508116 348.673900158 350.986543618
>>> stddev for time ( +- 1.50% ) ( +- 0.73% ) ( +- 1.13% ) ( +- 1.01% ) ( +- 1.89% ) ( +- 1.55% )
>>> %chg time to baseline -0.9% -1.8% 0.2% 0.39% 1.06%
>>
>> Probably too noisy.
>
> A little, but 3 still looks like the winner.
>
>>> Linux rebuild (make -j64)
>>> minor-faults 941,552 718,319 486,625 440,124 410,510 397,416
>>> times in seconds 30.569834718 31.219637539 31.319370649 31.434285472 31.972367174 31.443043580
>>> stddev for time ( +- 1.07% ) ( +- 0.13% ) ( +- 0.43% ) ( +- 0.18% ) ( +- 0.95% ) ( +- 0.58% )
>>> %chg time to baseline 2.1% 2.4% 2.8% 4.58% 2.85%
>>
>> Here it looks like a speedup. Optimal value: 5+.
>
> No, lower time is better. Baseline (no faultaround) wins.
>
>
> etc.
>
> It's not a huge surprise that a 64k page arch wants a smaller value than
> a 4k system. But I agree: I don't see much upside for FAO > 0, but I do
> see downside.
>
> Most extreme results:
> Order 1: 2% loss on recompile. 10% win 4% loss on seq. 9% loss random.
> Order 3: 2% loss on recompile. 6% win 5% loss on seq. 14% loss on random.
> Order 4: 2.8% loss on recompile. 10% win 7% loss on seq. 9% loss on random.
>
>> I'm starting to suspect that maybe workloads ought to be given a
>> choice in this matter, via madvise() or such.
>
> I really don't think they'll be able to use it; it'll change far too
> much with machine and kernel updates. I think we should apply patch #1
> (with fixes) to make it a variable, then set it to 0 for PPC.
>

Ok. Will do.

Thanks for review
With regards
Maddy


> Cheers,
> Rusty.
>