2015-05-24 16:02:33

by Jungseok Lee

[permalink] [raw]
Subject: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

Fork-routine sometimes fails to get a physically contiguous region for
thread_info on 4KB page system although free memory is enough. That is,
a physically contiguous region, which is currently 16KB, is not available
since system memory is fragmented.

This patch tries to solve the problem as allocating thread_info memory
from vmalloc space, not 1:1 mapping one. The downside is one additional
page allocation in case of vmalloc. However, vmalloc space is large enough,
around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
not a big tradeoff for fork-routine service.

Suggested-by: Sungjinn Chung <[email protected]>
Signed-off-by: Jungseok Lee <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/arm64/Kconfig | 12 ++++++++++++
arch/arm64/include/asm/thread_info.h | 9 +++++++++
arch/arm64/kernel/process.c | 7 +++++++
3 files changed, 28 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 99930cf..93c236a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -536,6 +536,18 @@ config ARCH_SELECT_MEMORY_MODEL
config HAVE_ARCH_PFN_VALID
def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM

+config ARCH_THREAD_INFO_ALLOCATOR
+ bool "Enable vmalloc based thread_info allocator (EXPERIMENTAL)"
+ depends on ARM64_4K_PAGES
+ default n
+ help
+ This feature enables vmalloc based thread_info allocator. It
+ prevents fork-routine from begin failed to obtain physically
+ contiguour region due to memory fragmentation on low system
+ memory platforms.
+
+ If unsure, say N
+
config HW_PERF_EVENTS
bool "Enable hardware performance counter support for perf events"
depends on PERF_EVENTS
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index dcd06d1..e753e59 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -61,6 +61,15 @@ struct thread_info {
#define init_thread_info (init_thread_union.thread_info)
#define init_stack (init_thread_union.stack)

+#ifdef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
+#define alloc_thread_info_node(tsk, node) \
+({ \
+ __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE, VMALLOC_START, \
+ VMALLOC_END, GFP_KERNEL, PAGE_KERNEL, 0, \
+ NUMA_NO_NODE, __builtin_return_address(0)); \
+})
+#define free_thread_info(ti) vfree(ti)
+#endif
/*
* how to get the current stack pointer from C
*/
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index c506bee..c4b6aae 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -238,6 +238,13 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
return 0;
}

+#ifdef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
+struct page *arch_thread_info_to_page(struct thread_info *ti)
+{
+ return vmalloc_to_page(ti);
+}
+#endif
+
asmlinkage void ret_from_fork(void) asm("ret_from_fork");

int copy_thread(unsigned long clone_flags, unsigned long stack_start,
--
1.9.1


2015-05-24 17:49:11

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

On Monday 25 May 2015 01:02:20 Jungseok Lee wrote:
> Fork-routine sometimes fails to get a physically contiguous region for
> thread_info on 4KB page system although free memory is enough. That is,
> a physically contiguous region, which is currently 16KB, is not available
> since system memory is fragmented.
>
> This patch tries to solve the problem as allocating thread_info memory
> from vmalloc space, not 1:1 mapping one. The downside is one additional
> page allocation in case of vmalloc. However, vmalloc space is large enough,
> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
> not a big tradeoff for fork-routine service.

vmalloc has a rather large runtime cost. I'd argue that failing to allocate
thread_info structures means something has gone very wrong.

Can you describe the scenario that leads to fragmentation this bad?

Could the stack size be reduced to 8KB perhaps?

Arnd

2015-05-25 10:01:43

by Jungseok Lee

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

On May 25, 2015, at 2:49 AM, Arnd Bergmann wrote:
> On Monday 25 May 2015 01:02:20 Jungseok Lee wrote:
>> Fork-routine sometimes fails to get a physically contiguous region for
>> thread_info on 4KB page system although free memory is enough. That is,
>> a physically contiguous region, which is currently 16KB, is not available
>> since system memory is fragmented.
>>
>> This patch tries to solve the problem as allocating thread_info memory
>> from vmalloc space, not 1:1 mapping one. The downside is one additional
>> page allocation in case of vmalloc. However, vmalloc space is large enough,
>> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
>> not a big tradeoff for fork-routine service.
>
> vmalloc has a rather large runtime cost. I'd argue that failing to allocate
> thread_info structures means something has gone very wrong.

That is why the feature is marked "N" by default.
I focused on fork-routine stability rather than performance.

Could you give me an idea how to evaluate performance degradation?
Running some benchmarks would be helpful, but I would like to try to
gather data based on meaningful methodology.

> Can you describe the scenario that leads to fragmentation this bad?

Android, but I could not describe an exact reproduction procedure step
by step since it's behaved and reproduced randomly. As reading the following
thread from mm mailing list, a similar symptom is observed on other systems.

https://lkml.org/lkml/2015/4/28/59

Although I do not know the details of a system mentioned in the thread,
even order-2 page allocation is not smoothly operated due to fragmentation on
low memory system.

I think the point is *low memory system*. 64-bit kernel is usually a feasible
option when system memory is enough, but 64-bit kernel and low memory system
combo is not unusual in case of ARM64.

> Could the stack size be reduced to 8KB perhaps?

I guess probably not.

A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.
The stack size is 16KB on x86_64. I am not sure whether all applications,
which work fine on x86_64 machine, run very well on ARM64 with 8KB stack size.

Best Regards
Jungseok Lee-

2015-05-25 14:40:54

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

Hello Jungseok,

On Mon, May 25, 2015 at 01:02:20AM +0900, Jungseok Lee wrote:
> Fork-routine sometimes fails to get a physically contiguous region for
> thread_info on 4KB page system although free memory is enough. That is,
> a physically contiguous region, which is currently 16KB, is not available
> since system memory is fragmented.

Order less than PAGE_ALLOC_COSTLY_ORDER should not fail in current
mm implementation. If you saw the order-2,3 high-order allocation fail
maybe your application received SIGKILL by someone. LMK?

>
> This patch tries to solve the problem as allocating thread_info memory
> from vmalloc space, not 1:1 mapping one. The downside is one additional
> page allocation in case of vmalloc. However, vmalloc space is large enough,

The size you want to allocate is 16KB in here but additional 4K?
It increases 25% memory footprint, which is huge downside.

> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
> not a big tradeoff for fork-routine service.
>
> Suggested-by: Sungjinn Chung <[email protected]>
> Signed-off-by: Jungseok Lee <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> arch/arm64/Kconfig | 12 ++++++++++++
> arch/arm64/include/asm/thread_info.h | 9 +++++++++
> arch/arm64/kernel/process.c | 7 +++++++
> 3 files changed, 28 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 99930cf..93c236a 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -536,6 +536,18 @@ config ARCH_SELECT_MEMORY_MODEL
> config HAVE_ARCH_PFN_VALID
> def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
>
> +config ARCH_THREAD_INFO_ALLOCATOR
> + bool "Enable vmalloc based thread_info allocator (EXPERIMENTAL)"
> + depends on ARM64_4K_PAGES
> + default n
> + help
> + This feature enables vmalloc based thread_info allocator. It
> + prevents fork-routine from begin failed to obtain physically
> + contiguour region due to memory fragmentation on low system
> + memory platforms.
> +
> + If unsure, say N
> +
> config HW_PERF_EVENTS
> bool "Enable hardware performance counter support for perf events"
> depends on PERF_EVENTS
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index dcd06d1..e753e59 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -61,6 +61,15 @@ struct thread_info {
> #define init_thread_info (init_thread_union.thread_info)
> #define init_stack (init_thread_union.stack)
>
> +#ifdef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
> +#define alloc_thread_info_node(tsk, node) \
> +({ \
> + __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE, VMALLOC_START, \
> + VMALLOC_END, GFP_KERNEL, PAGE_KERNEL, 0, \
> + NUMA_NO_NODE, __builtin_return_address(0)); \
> +})
> +#define free_thread_info(ti) vfree(ti)
> +#endif
> /*
> * how to get the current stack pointer from C
> */
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index c506bee..c4b6aae 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -238,6 +238,13 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
> return 0;
> }
>
> +#ifdef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
> +struct page *arch_thread_info_to_page(struct thread_info *ti)
> +{
> + return vmalloc_to_page(ti);
> +}
> +#endif
> +
> asmlinkage void ret_from_fork(void) asm("ret_from_fork");
>
> int copy_thread(unsigned long clone_flags, unsigned long stack_start,
> --
> 1.9.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2015-05-25 14:59:09

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

On Mon, May 25, 2015 at 07:01:33PM +0900, Jungseok Lee wrote:
> On May 25, 2015, at 2:49 AM, Arnd Bergmann wrote:
> > On Monday 25 May 2015 01:02:20 Jungseok Lee wrote:
> >> Fork-routine sometimes fails to get a physically contiguous region for
> >> thread_info on 4KB page system although free memory is enough. That is,
> >> a physically contiguous region, which is currently 16KB, is not available
> >> since system memory is fragmented.
> >>
> >> This patch tries to solve the problem as allocating thread_info memory
> >> from vmalloc space, not 1:1 mapping one. The downside is one additional
> >> page allocation in case of vmalloc. However, vmalloc space is large enough,
> >> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
> >> not a big tradeoff for fork-routine service.
> >
> > vmalloc has a rather large runtime cost. I'd argue that failing to allocate
> > thread_info structures means something has gone very wrong.
>
> That is why the feature is marked "N" by default.
> I focused on fork-routine stability rather than performance.

If VM has trouble with order-2 allocation, your system would be
trouble soon although fork at the moment manages to be successful
because such small high-order(ex, order <= PAGE_ALLOC_COSTLY_ORDER)
allocation is common in the kernel so VM should handle it smoothly.
If VM didn't, it means we should fix VM itself, not a specific
allocation site. Fork is one of victim by that.

>
> Could you give me an idea how to evaluate performance degradation?
> Running some benchmarks would be helpful, but I would like to try to
> gather data based on meaningful methodology.
>
> > Can you describe the scenario that leads to fragmentation this bad?
>
> Android, but I could not describe an exact reproduction procedure step
> by step since it's behaved and reproduced randomly. As reading the following
> thread from mm mailing list, a similar symptom is observed on other systems.
>
> https://lkml.org/lkml/2015/4/28/59
>
> Although I do not know the details of a system mentioned in the thread,
> even order-2 page allocation is not smoothly operated due to fragmentation on
> low memory system.

What Joonsoo have tackle is generic fragmentation problem, not *a* fork fail,
which is more right approach to handle small high-order allocation problem.

>
> I think the point is *low memory system*. 64-bit kernel is usually a feasible
> option when system memory is enough, but 64-bit kernel and low memory system
> combo is not unusual in case of ARM64.
>
> > Could the stack size be reduced to 8KB perhaps?
>
> I guess probably not.
>
> A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.
> The stack size is 16KB on x86_64. I am not sure whether all applications,
> which work fine on x86_64 machine, run very well on ARM64 with 8KB stack size.
>
> Best Regards
> Jungseok Lee
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2015-05-25 16:48:23

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

On 25 May 2015, at 13:01, Jungseok Lee <[email protected]> wrote:

>> Could the stack size be reduced to 8KB perhaps?
>
> I guess probably not.
>
> A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.

We could go back to 8KB stacks if we implement support for separate IRQ
stack on arm64. It's not too complicated, we would have to use SP0 for (kernel) threads
and SP1 for IRQ handlers.

Catalin-

2015-05-25 20:30:18

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

On Monday 25 May 2015 19:47:15 Catalin Marinas wrote:
> On 25 May 2015, at 13:01, Jungseok Lee <[email protected]> wrote:
>
> >> Could the stack size be reduced to 8KB perhaps?
> >
> > I guess probably not.
> >
> > A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.
>
> We could go back to 8KB stacks if we implement support for separate IRQ
> stack on arm64. It's not too complicated, we would have to use SP0 for (kernel) threads
> and SP1 for IRQ handlers.

I think most architectures that see a lot of benchmarks have moved to
irqstacks at some point, that definitely sounds like a useful idea,
even if the implementation turns out to be a bit more tricky than
what you describe.

There are a lot of workloads that would benefit from having lower
per-thread memory cost.

Arnd

2015-05-25 21:20:57

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

On Monday 25 May 2015 19:01:33 Jungseok Lee wrote:
> On May 25, 2015, at 2:49 AM, Arnd Bergmann wrote:

> > Could the stack size be reduced to 8KB perhaps?
>
> I guess probably not.
>
> A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.
> The stack size is 16KB on x86_64. I am not sure whether all applications,
> which work fine on x86_64 machine, run very well on ARM64 with 8KB stack size.

A more interesting explanation is probably the one that led to x86
adopting 16kb pages, see https://lwn.net/Articles/600644 ,
https://lwn.net/Articles/600649/ and http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6538b8ea88

Based on that, I have to agree that lowering the size back to 8kb seems
unrealistic at the moment, but at the same time, we need to actively
keep working on reducing the stack size. IRQ stacks are one important
point here, but in the trace that Minchan Kim has in the commit for x86-64
there is not even an interrupt.

I've looked at the code that is generated by aarch64-linux-gcc-4.8 now
and found that it does not try very hard to reduce the stack size,
the first function I fed it end up up using 48 bytes of stack where
half of that would suffice, so there is probably room for optimization
within the boundaries of the ABI, possibly more if we allow kernel
specific calling conventions as some other architectures have implemented
in the past.

Arnd

2015-05-25 22:36:34

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

On 25 May 2015, at 23:29, Arnd Bergmann <[email protected]> wrote:
> On Monday 25 May 2015 19:47:15 Catalin Marinas wrote:
>> On 25 May 2015, at 13:01, Jungseok Lee <[email protected]> wrote:
>>>> Could the stack size be reduced to 8KB perhaps?
>>>
>>> I guess probably not.
>>>
>>> A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.
>>
>> We could go back to 8KB stacks if we implement support for separate IRQ
>> stack on arm64. It's not too complicated, we would have to use SP0 for (kernel) threads
>> and SP1 for IRQ handlers.
>
> I think most architectures that see a lot of benchmarks have moved to
> irqstacks at some point, that definitely sounds like a useful idea,
> even if the implementation turns out to be a bit more tricky than
> what you describe.

Of course, it's more complicated than just setting up two stacks (but I'm away for a
week and writing from a phone). We would need to deal with the initial per-CPU setup,
rescheduling following an IRQ, CPU on following power management and maybe
other issues. However, the architecture helps us a bit by allowing both SP0 and SP1 to be
used at EL1.

> There are a lot of workloads that would benefit from having lower
> per-thread memory cost.

If we keep the 16KB stack, is there any advantage in a separate IRQ one (assuming
that we won't overflow 16KB)?

Catalin-

2015-05-26 02:52:46

by yalin wang

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

2015-05-25 0:02 GMT+08:00 Jungseok Lee <[email protected]>:
> Fork-routine sometimes fails to get a physically contiguous region for
> thread_info on 4KB page system although free memory is enough. That is,
> a physically contiguous region, which is currently 16KB, is not available
> since system memory is fragmented.
>
> This patch tries to solve the problem as allocating thread_info memory
> from vmalloc space, not 1:1 mapping one. The downside is one additional
> page allocation in case of vmalloc. However, vmalloc space is large enough,
> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
> not a big tradeoff for fork-routine service.
>
> Suggested-by: Sungjinn Chung <[email protected]>
> Signed-off-by: Jungseok Lee <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> arch/arm64/Kconfig | 12 ++++++++++++
> arch/arm64/include/asm/thread_info.h | 9 +++++++++
> arch/arm64/kernel/process.c | 7 +++++++
> 3 files changed, 28 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 99930cf..93c236a 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -536,6 +536,18 @@ config ARCH_SELECT_MEMORY_MODEL
> config HAVE_ARCH_PFN_VALID
> def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
>
> +config ARCH_THREAD_INFO_ALLOCATOR
> + bool "Enable vmalloc based thread_info allocator (EXPERIMENTAL)"
> + depends on ARM64_4K_PAGES
> + default n
> + help
> + This feature enables vmalloc based thread_info allocator. It
> + prevents fork-routine from begin failed to obtain physically
> + contiguour region due to memory fragmentation on low system
> + memory platforms.
> +
> + If unsure, say N
> +
> config HW_PERF_EVENTS
> bool "Enable hardware performance counter support for perf events"
> depends on PERF_EVENTS
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index dcd06d1..e753e59 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -61,6 +61,15 @@ struct thread_info {
> #define init_thread_info (init_thread_union.thread_info)
> #define init_stack (init_thread_union.stack)
>
> +#ifdef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
> +#define alloc_thread_info_node(tsk, node) \
> +({ \
> + __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE, VMALLOC_START, \
> + VMALLOC_END, GFP_KERNEL, PAGE_KERNEL, 0, \
> + NUMA_NO_NODE, __builtin_return_address(0)); \
> +})
why not add __GFP_HIGHMEM, if you decided to use vmalloc() alloc stack pages?

Thanks

2015-05-26 14:37:25

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

On Tuesday 26 May 2015 01:36:29 Catalin Marinas wrote:
>
> > There are a lot of workloads that would benefit from having lower
> > per-thread memory cost.
>
> If we keep the 16KB stack, is there any advantage in a separate IRQ one (assuming
> that we won't overflow 16KB)?

It makes possible errors more reproducible: we already know that we need over
8kb for normal stacks based on Minchan's findings, and the chance that an interrupt
happens at a time when the stack is the highest is very low, but that just makes
the bug much harder to find if you ever run into it.

If we overflow the stack (independent of its size) with a process stack by itself,
it will always happen in the same call chain, not a combination of a call chain
an a particularly bad interrupt.

Arnd

2015-05-26 12:28:13

by Jungseok Lee

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

On May 25, 2015, at 11:40 PM, Minchan Kim wrote:
> Hello Jungseok,

Hi, Minchan,

> On Mon, May 25, 2015 at 01:02:20AM +0900, Jungseok Lee wrote:
>> Fork-routine sometimes fails to get a physically contiguous region for
>> thread_info on 4KB page system although free memory is enough. That is,
>> a physically contiguous region, which is currently 16KB, is not available
>> since system memory is fragmented.
>
> Order less than PAGE_ALLOC_COSTLY_ORDER should not fail in current
> mm implementation. If you saw the order-2,3 high-order allocation fail
> maybe your application received SIGKILL by someone. LMK?

Exactly right. The allocation is failed via the following path.

if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
goto nopage;

IMHO, a reclaim operation would be not needed in this context if memory is
allocated from vmalloc space. It means there is no need to traverse shrinker list.

>> This patch tries to solve the problem as allocating thread_info memory
>> from vmalloc space, not 1:1 mapping one. The downside is one additional
>> page allocation in case of vmalloc. However, vmalloc space is large enough,
>
> The size you want to allocate is 16KB in here but additional 4K?
> It increases 25% memory footprint, which is huge downside.

I agree with the point, and most people who try to use vmalloc might know the number.
However, an interoperation on the number depends on a point of view.

Vmalloc is large enough and not fully utilized in case of ARM64.
With the considerations, there is a room to do math as follows.

4KB / 240GB = 1.5e-8 (4KB page + 3 level combo)

It would be not a huge downside if fork-routine is not damaged due to fragmentation.

However, this is one of reasons to add "RFC" prefix in the patch set. How is the
additional 4KB interpreted and considered?

Best Regards
Jungseok Lee-

2015-05-26 12:44:00

by Jungseok Lee

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

On May 25, 2015, at 11:58 PM, Minchan Kim wrote:
> On Mon, May 25, 2015 at 07:01:33PM +0900, Jungseok Lee wrote:
>> On May 25, 2015, at 2:49 AM, Arnd Bergmann wrote:
>>> On Monday 25 May 2015 01:02:20 Jungseok Lee wrote:
>>>> Fork-routine sometimes fails to get a physically contiguous region for
>>>> thread_info on 4KB page system although free memory is enough. That is,
>>>> a physically contiguous region, which is currently 16KB, is not available
>>>> since system memory is fragmented.
>>>>
>>>> This patch tries to solve the problem as allocating thread_info memory
>>>> from vmalloc space, not 1:1 mapping one. The downside is one additional
>>>> page allocation in case of vmalloc. However, vmalloc space is large enough,
>>>> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
>>>> not a big tradeoff for fork-routine service.
>>>
>>> vmalloc has a rather large runtime cost. I'd argue that failing to allocate
>>> thread_info structures means something has gone very wrong.
>>
>> That is why the feature is marked "N" by default.
>> I focused on fork-routine stability rather than performance.
>
> If VM has trouble with order-2 allocation, your system would be
> trouble soon although fork at the moment manages to be successful
> because such small high-order(ex, order <= PAGE_ALLOC_COSTLY_ORDER)
> allocation is common in the kernel so VM should handle it smoothly.
> If VM didn't, it means we should fix VM itself, not a specific
> allocation site. Fork is one of victim by that.

A problem I observed is an user space, not a kernel side. As user applications
fail to create threads in order to distribute their jobs properly, they are getting
in trouble slowly and then gone.

Yes, fork is one of victim, but damages user applications seriously.
At this snapshot, free memory is enough.

>> Could you give me an idea how to evaluate performance degradation?
>> Running some benchmarks would be helpful, but I would like to try to
>> gather data based on meaningful methodology.
>>
>>> Can you describe the scenario that leads to fragmentation this bad?
>>
>> Android, but I could not describe an exact reproduction procedure step
>> by step since it's behaved and reproduced randomly. As reading the following
>> thread from mm mailing list, a similar symptom is observed on other systems.
>>
>> https://lkml.org/lkml/2015/4/28/59
>>
>> Although I do not know the details of a system mentioned in the thread,
>> even order-2 page allocation is not smoothly operated due to fragmentation on
>> low memory system.
>
> What Joonsoo have tackle is generic fragmentation problem, not *a* fork fail,
> which is more right approach to handle small high-order allocation problem.

I totally agree with that point. One of the best ways is to figure out a generic
anti-fragmentation with VM system improvement. Reducing the stack size to 8KB is also
a really great approach. My intention is not to overlook them or figure out a workaround.

IMHO, vmalloc would be a different option in case of ARM64 on low memory systems since
*fork failure from fragmentation* is a nontrivial issue.

Do you think the patch set doesn't need to be considered?

Best Regards
Jungseok Lee-

2015-05-26 12:30:22

by Jungseok Lee

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

On May 26, 2015, at 11:52 AM, yalin wang wrote:
> 2015-05-25 0:02 GMT+08:00 Jungseok Lee <[email protected]>:
>> Fork-routine sometimes fails to get a physically contiguous region for
>> thread_info on 4KB page system although free memory is enough. That is,
>> a physically contiguous region, which is currently 16KB, is not available
>> since system memory is fragmented.
>>
>> This patch tries to solve the problem as allocating thread_info memory
>> from vmalloc space, not 1:1 mapping one. The downside is one additional
>> page allocation in case of vmalloc. However, vmalloc space is large enough,
>> around 240GB, under a combination of 39-bit VA and 4KB page. Thus, it is
>> not a big tradeoff for fork-routine service.
>>
>> Suggested-by: Sungjinn Chung <[email protected]>
>> Signed-off-by: Jungseok Lee <[email protected]>
>> Cc: Catalin Marinas <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> ---
>> arch/arm64/Kconfig | 12 ++++++++++++
>> arch/arm64/include/asm/thread_info.h | 9 +++++++++
>> arch/arm64/kernel/process.c | 7 +++++++
>> 3 files changed, 28 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 99930cf..93c236a 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -536,6 +536,18 @@ config ARCH_SELECT_MEMORY_MODEL
>> config HAVE_ARCH_PFN_VALID
>> def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
>>
>> +config ARCH_THREAD_INFO_ALLOCATOR
>> + bool "Enable vmalloc based thread_info allocator (EXPERIMENTAL)"
>> + depends on ARM64_4K_PAGES
>> + default n
>> + help
>> + This feature enables vmalloc based thread_info allocator. It
>> + prevents fork-routine from begin failed to obtain physically
>> + contiguour region due to memory fragmentation on low system
>> + memory platforms.
>> +
>> + If unsure, say N
>> +
>> config HW_PERF_EVENTS
>> bool "Enable hardware performance counter support for perf events"
>> depends on PERF_EVENTS
>> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
>> index dcd06d1..e753e59 100644
>> --- a/arch/arm64/include/asm/thread_info.h
>> +++ b/arch/arm64/include/asm/thread_info.h
>> @@ -61,6 +61,15 @@ struct thread_info {
>> #define init_thread_info (init_thread_union.thread_info)
>> #define init_stack (init_thread_union.stack)
>>
>> +#ifdef CONFIG_ARCH_THREAD_INFO_ALLOCATOR
>> +#define alloc_thread_info_node(tsk, node) \
>> +({ \
>> + __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE, VMALLOC_START, \
>> + VMALLOC_END, GFP_KERNEL, PAGE_KERNEL, 0, \
>> + NUMA_NO_NODE, __builtin_return_address(0)); \
>> +})
> why not add __GFP_HIGHMEM, if you decided to use vmalloc() alloc stack pages?

I do not add the flag since there is no high memory on a current ARM64 kernel.

It would be helpful to review include/linux/gfp.h and the following code snippet
from arch/arm64/kernel/module.c.

void *module_alloc(unsigned long size)
{
return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
NUMA_NO_NODE, __builtin_return_address(0));
}

Best Regards
Jungseok Lee

2015-05-26 13:04:53

by Jungseok Lee

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] arm64: Implement vmalloc based thread_info allocator

On May 26, 2015, at 1:47 AM, Catalin Marinas wrote:
> On 25 May 2015, at 13:01, Jungseok Lee <[email protected]> wrote:
>
>>> Could the stack size be reduced to 8KB perhaps?
>>
>> I guess probably not.
>>
>> A commit, 845ad05e, says that 8KB is not enough to cover SpecWeb benchmark.
>
> We could go back to 8KB stacks if we implement support for separate IRQ
> stack on arm64. It's not too complicated, we would have to use SP0 for (kernel) threads
> and SP1 for IRQ handlers.

Definitely interesting.

It looks like there are two options based on discussion.
1) Reduce the stack size with separate IRQ stack scheme
2) Figure out a generic anti-fragmentation solution

Do I miss anything?

I am still not sure about the first scheme as reviewing Minchan's findings repeatedly,
but I agree that the item should be worked actively.

Best Regards
Jungseok Lee-