On Mon, Mar 7, 2022 at 3:08 PM Jisheng Zhang <[email protected]> wrote:
>
> Currently, IRQs are still handled on the kernel stack of the current
> task on riscv platforms. If the task has a deep call stack at the time
> of interrupt, and handling the interrupt also requires a deep stack,
> it's possible to see stack overflow.
>
> Before this patch, the stack_max_size of a v5.17-rc1 kernel running on
> a lichee RV board gave:
> ~ # cat /sys/kernel/debug/tracing/stack_max_size
> 3736
>
> After this patch,
> ~ # cat /sys/kernel/debug/tracing/stack_max_size
> 3176
>
> We reduce the max kernel stack usage by 560 bytes!
>
> From another side, after this patch, it's possible to reduce the
> THREAD_SIZE to 8KB for RV64 platforms. This is especially useful for
> those systems with small memory size, e.g the Allwinner D1S platform
> which is RV64 but only has 64MB DDR.
>
> Signed-off-by: Jisheng Zhang <[email protected]>
Very nice!
> diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> index ed29e9c8f660..57c9b64e16a5 100644
> --- a/arch/riscv/kernel/entry.S
> +++ b/arch/riscv/kernel/entry.S
> @@ -126,12 +126,39 @@ skip_context_tracking:
> */
> bge s4, zero, 1f
>
> - la ra, ret_from_exception
> + /* preserve the sp */
> + move s0, sp
>
> - /* Handle interrupts */
> move a0, sp /* pt_regs */
> +
> + /*
> + * Compare sp with the base of the task stack.
> + * If the top ~(THREAD_SIZE - 1) bits match, we are on a task stack,
> + * and should switch to the irq stack.
> + */
> + REG_L t0, TASK_STACK(tp)
> + xor t0, t0, s0
> + li t1, ~(THREAD_SIZE - 1)
> + and t0, t0, t1
> + bnez t0, 2f
> +
> + la t1, irq_stack
> + REG_L t2, TASK_TI_CPU(tp)
> + slli t2, t2, RISCV_LGPTR
> + add t1, t1, t2
> + REG_L t2, 0(t1)
> + li t1, IRQ_STACK_SIZE
> + /* switch to the irq stack */
> + add sp, t2, t1
> +
> +2:
What is the benefit of doing this in assembler? Is it measurably faster?
I see that arm64 does the same thing in C code, and it would be best to
have a common implementation for doing this, in terms of maintainability.
> +
> + for_each_possible_cpu(cpu) {
> +#ifdef CONFIG_VMAP_STACK
> + void *s = __vmalloc_node(IRQ_STACK_SIZE, THREAD_ALIGN,
> + THREADINFO_GFP, cpu_to_node(cpu),
> + __builtin_return_address(0));
> +#else
> + void *s = (void *)__get_free_pages(GFP_KERNEL, get_order(IRQ_STACK_SIZE));
> +#endif
On a related topic: is there a reason to still keep the non-VMAP_STACK
code path around? I see that it currently is optional for 64-bit with MMU,
but not available otherwise. The benefits should still outweigh the downside
(virtual address space usage mainly) on 32-bit, especially when this allows
a common implementation. Not sure about NOMMU, but I would guess
that it's not a big issue to use the same code there as well, since nommu
vmalloc just turns into a kmalloc as well.
Arnd
On Mon, Mar 07, 2022 at 08:19:35PM +0100, Arnd Bergmann wrote:
> On Mon, Mar 7, 2022 at 3:08 PM Jisheng Zhang <[email protected]> wrote:
> >
> > Currently, IRQs are still handled on the kernel stack of the current
> > task on riscv platforms. If the task has a deep call stack at the time
> > of interrupt, and handling the interrupt also requires a deep stack,
> > it's possible to see stack overflow.
> >
> > Before this patch, the stack_max_size of a v5.17-rc1 kernel running on
> > a lichee RV board gave:
> > ~ # cat /sys/kernel/debug/tracing/stack_max_size
> > 3736
> >
> > After this patch,
> > ~ # cat /sys/kernel/debug/tracing/stack_max_size
> > 3176
> >
> > We reduce the max kernel stack usage by 560 bytes!
> >
> > From another side, after this patch, it's possible to reduce the
> > THREAD_SIZE to 8KB for RV64 platforms. This is especially useful for
> > those systems with small memory size, e.g the Allwinner D1S platform
> > which is RV64 but only has 64MB DDR.
> >
> > Signed-off-by: Jisheng Zhang <[email protected]>
>
> Very nice!
>
> > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> > index ed29e9c8f660..57c9b64e16a5 100644
> > --- a/arch/riscv/kernel/entry.S
> > +++ b/arch/riscv/kernel/entry.S
> > @@ -126,12 +126,39 @@ skip_context_tracking:
> > */
> > bge s4, zero, 1f
> >
> > - la ra, ret_from_exception
> > + /* preserve the sp */
> > + move s0, sp
> >
> > - /* Handle interrupts */
> > move a0, sp /* pt_regs */
> > +
> > + /*
> > + * Compare sp with the base of the task stack.
> > + * If the top ~(THREAD_SIZE - 1) bits match, we are on a task stack,
> > + * and should switch to the irq stack.
> > + */
> > + REG_L t0, TASK_STACK(tp)
> > + xor t0, t0, s0
> > + li t1, ~(THREAD_SIZE - 1)
> > + and t0, t0, t1
> > + bnez t0, 2f
> > +
> > + la t1, irq_stack
> > + REG_L t2, TASK_TI_CPU(tp)
> > + slli t2, t2, RISCV_LGPTR
> > + add t1, t1, t2
> > + REG_L t2, 0(t1)
> > + li t1, IRQ_STACK_SIZE
> > + /* switch to the irq stack */
> > + add sp, t2, t1
> > +
> > +2:
>
> What is the benefit of doing this in assembler? Is it measurably faster?
>
> I see that arm64 does the same thing in C code, and it would be best to
> have a common implementation for doing this, in terms of maintainability.
>
Hi Arnd,
Sorry for delay. The assembler code is mainly to cal the stack ptr then
change the SP to use the stack, which equals to arm64 call_on_irq_stack()
which is implemented in assembler too.
> > +
> > + for_each_possible_cpu(cpu) {
> > +#ifdef CONFIG_VMAP_STACK
> > + void *s = __vmalloc_node(IRQ_STACK_SIZE, THREAD_ALIGN,
> > + THREADINFO_GFP, cpu_to_node(cpu),
> > + __builtin_return_address(0));
> > +#else
> > + void *s = (void *)__get_free_pages(GFP_KERNEL, get_order(IRQ_STACK_SIZE));
> > +#endif
>
> On a related topic: is there a reason to still keep the non-VMAP_STACK
irq stack is 16KB on RV64 now, vmalloc doesn't gurantee physical
continuous pages, I want to keep the stack physical continuous
characteristic for !VMAP_STACK case.
Thanks
> code path around? I see that it currently is optional for 64-bit with MMU,
> but not available otherwise. The benefits should still outweigh the downside
> (virtual address space usage mainly) on 32-bit, especially when this allows
> a common implementation. Not sure about NOMMU, but I would guess
> that it's not a big issue to use the same code there as well, since nommu
> vmalloc just turns into a kmalloc as well.
>
On Sun, May 15, 2022 at 7:14 AM Jisheng Zhang <[email protected]> wrote:
> On Mon, Mar 07, 2022 at 08:19:35PM +0100, Arnd Bergmann wrote:
> > On Mon, Mar 7, 2022 at 3:08 PM Jisheng Zhang <[email protected]> wrote:
> > > +2:
> >
> > What is the benefit of doing this in assembler? Is it measurably faster?
> >
> > I see that arm64 does the same thing in C code, and it would be best to
> > have a common implementation for doing this, in terms of maintainability.
> >
>
> Sorry for delay. The assembler code is mainly to cal the stack ptr then
> change the SP to use the stack, which equals to arm64 call_on_irq_stack()
> which is implemented in assembler too.
I understand that you need to be in asm code to switch the stack, it
just felt that the arm64 method is a bit easier to debug here.
I suppose being able to keep using generic_handle_arch_irq() is also
beneficial, so it doesn't make much difference either way.
> > > + for_each_possible_cpu(cpu) {
> > > +#ifdef CONFIG_VMAP_STACK
> > > + void *s = __vmalloc_node(IRQ_STACK_SIZE, THREAD_ALIGN,
> > > + THREADINFO_GFP, cpu_to_node(cpu),
> > > + __builtin_return_address(0));
> > > +#else
> > > + void *s = (void *)__get_free_pages(GFP_KERNEL, get_order(IRQ_STACK_SIZE));
> > > +#endif
> >
> > On a related topic: is there a reason to still keep the non-VMAP_STACK
>
> irq stack is 16KB on RV64 now, vmalloc doesn't gurantee physical
> continuous pages, I want to keep the stack physical continuous
> characteristic for !VMAP_STACK case.
I don't understand. What is the benefit of having a physically continuous
stack? If this is required for something, you could still get that with a VMAP
stack by using alloc_pages() to allocate the stack and them using vmap() to
put it into the vmalloc range with appropriate guard pages.
I think we really want to avoid the case of missing guard pages around
the stack, and eliminate the part where the stack is in the linear map.
Arnd
On Mon, May 23, 2022 at 10:16 AM Arnd Bergmann <[email protected]> wrote:
> > > > + for_each_possible_cpu(cpu) {
> > > > +#ifdef CONFIG_VMAP_STACK
> > > > + void *s = __vmalloc_node(IRQ_STACK_SIZE, THREAD_ALIGN,
> > > > + THREADINFO_GFP, cpu_to_node(cpu),
> > > > + __builtin_return_address(0));
> > > > +#else
> > > > + void *s = (void *)__get_free_pages(GFP_KERNEL, get_order(IRQ_STACK_SIZE));
> > > > +#endif
> > >
> > > On a related topic: is there a reason to still keep the non-VMAP_STACK
> >
> > irq stack is 16KB on RV64 now, vmalloc doesn't gurantee physical
> > continuous pages, I want to keep the stack physical continuous
> > characteristic for !VMAP_STACK case.
>
> I don't understand. What is the benefit of having a physically continuous
> stack? If this is required for something, you could still get that with a VMAP
> stack by using alloc_pages() to allocate the stack and them using vmap() to
> put it into the vmalloc range with appropriate guard pages.
>
> I think we really want to avoid the case of missing guard pages around
> the stack, and eliminate the part where the stack is in the linear map.
I was actually confused here and mixed up a few things: I thought this
was about whether to use vmap stacks unconditionally, and this is in
fact not even an architecture specific decision, it's a global option as you
are probably aware.
Since one can still turn off VMAP_STACK for normal thread stacks,
it doesn't make much of a difference whether one can do the same for
IRQ stacks. Please just ignore what I said above. I see you already
sent a modified v3, and I think either way is fine, feel free to revert back
to this method if it makes life easier.
Arnd