2024-03-13 19:12:01

by Samuel Holland

[permalink] [raw]
Subject: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

TASK_SIZE_MAX should be set to the largest userspace address under any
runtime configuration. This optimizes the check in __access_ok(), which
no longer needs to compute the current value of TASK_SIZE. It is still
safe because addresses between TASK_SIZE and TASK_SIZE_MAX are invalid
at the hardware level.

This removes about half of the references to pgtable_l[45]_enabled.

Signed-off-by: Samuel Holland <[email protected]>
---

arch/riscv/include/asm/pgtable-64.h | 1 +
arch/riscv/include/asm/pgtable.h | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index b99bd66107a6..a677ef3c0fe2 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -17,6 +17,7 @@ extern bool pgtable_l5_enabled;
#define PGDIR_SHIFT_L4 39
#define PGDIR_SHIFT_L5 48
#define PGDIR_SIZE_L3 (_AC(1, UL) << PGDIR_SHIFT_L3)
+#define PGDIR_SIZE_L5 (_AC(1, UL) << PGDIR_SHIFT_L5)

#define PGDIR_SHIFT (pgtable_l5_enabled ? PGDIR_SHIFT_L5 : \
(pgtable_l4_enabled ? PGDIR_SHIFT_L4 : PGDIR_SHIFT_L3))
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 6066822e7396..2032f8ac5fc5 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -867,6 +867,7 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
#ifdef CONFIG_64BIT
#define TASK_SIZE_64 (PGDIR_SIZE * PTRS_PER_PGD / 2)
#define TASK_SIZE_MIN (PGDIR_SIZE_L3 * PTRS_PER_PGD / 2)
+#define TASK_SIZE_MAX (PGDIR_SIZE_L5 * PTRS_PER_PGD / 2)

#ifdef CONFIG_COMPAT
#define TASK_SIZE_32 (_AC(0x80000000, UL))
--
2.43.1



2024-03-18 20:50:35

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

Hi Samuel,

On Wed, Mar 13, 2024 at 7:00 PM Samuel Holland
<[email protected]> wrote:
>
> TASK_SIZE_MAX should be set to the largest userspace address under any
> runtime configuration. This optimizes the check in __access_ok(), which
> no longer needs to compute the current value of TASK_SIZE. It is still
> safe because addresses between TASK_SIZE and TASK_SIZE_MAX are invalid
> at the hardware level.
>
> This removes about half of the references to pgtable_l[45]_enabled.
>
> Signed-off-by: Samuel Holland <[email protected]>
> ---
>
> arch/riscv/include/asm/pgtable-64.h | 1 +
> arch/riscv/include/asm/pgtable.h | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
> index b99bd66107a6..a677ef3c0fe2 100644
> --- a/arch/riscv/include/asm/pgtable-64.h
> +++ b/arch/riscv/include/asm/pgtable-64.h
> @@ -17,6 +17,7 @@ extern bool pgtable_l5_enabled;
> #define PGDIR_SHIFT_L4 39
> #define PGDIR_SHIFT_L5 48
> #define PGDIR_SIZE_L3 (_AC(1, UL) << PGDIR_SHIFT_L3)
> +#define PGDIR_SIZE_L5 (_AC(1, UL) << PGDIR_SHIFT_L5)
>
> #define PGDIR_SHIFT (pgtable_l5_enabled ? PGDIR_SHIFT_L5 : \
> (pgtable_l4_enabled ? PGDIR_SHIFT_L4 : PGDIR_SHIFT_L3))
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 6066822e7396..2032f8ac5fc5 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -867,6 +867,7 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
> #ifdef CONFIG_64BIT
> #define TASK_SIZE_64 (PGDIR_SIZE * PTRS_PER_PGD / 2)
> #define TASK_SIZE_MIN (PGDIR_SIZE_L3 * PTRS_PER_PGD / 2)
> +#define TASK_SIZE_MAX (PGDIR_SIZE_L5 * PTRS_PER_PGD / 2)
>
> #ifdef CONFIG_COMPAT
> #define TASK_SIZE_32 (_AC(0x80000000, UL))
> --
> 2.43.1
>

I think you also need to change the check in handle_page_fault() by
using TASK_SIZE_MAX instead of TASK_SIZE, otherwise the fixup can't
happen (https://elixir.bootlin.com/linux/latest/source/arch/riscv/mm/fault.c#L273).

Or I was wondering if it would not be better to do like x86 and use an
alternative, it would be more correct (even though I believe your
solution works)
https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64h#L82.

Thanks,

Alex

2024-03-18 21:29:45

by Samuel Holland

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

Hi Alex,

On 2024-03-18 3:50 PM, Alexandre Ghiti wrote:
> On Wed, Mar 13, 2024 at 7:00 PM Samuel Holland
> <[email protected]> wrote:
>>
>> TASK_SIZE_MAX should be set to the largest userspace address under any
>> runtime configuration. This optimizes the check in __access_ok(), which
>> no longer needs to compute the current value of TASK_SIZE. It is still
>> safe because addresses between TASK_SIZE and TASK_SIZE_MAX are invalid
>> at the hardware level.
>>
>> This removes about half of the references to pgtable_l[45]_enabled.
>>
>> Signed-off-by: Samuel Holland <[email protected]>
>> ---
>>
>> arch/riscv/include/asm/pgtable-64.h | 1 +
>> arch/riscv/include/asm/pgtable.h | 1 +
>> 2 files changed, 2 insertions(+)
>>
>> diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
>> index b99bd66107a6..a677ef3c0fe2 100644
>> --- a/arch/riscv/include/asm/pgtable-64.h
>> +++ b/arch/riscv/include/asm/pgtable-64.h
>> @@ -17,6 +17,7 @@ extern bool pgtable_l5_enabled;
>> #define PGDIR_SHIFT_L4 39
>> #define PGDIR_SHIFT_L5 48
>> #define PGDIR_SIZE_L3 (_AC(1, UL) << PGDIR_SHIFT_L3)
>> +#define PGDIR_SIZE_L5 (_AC(1, UL) << PGDIR_SHIFT_L5)
>>
>> #define PGDIR_SHIFT (pgtable_l5_enabled ? PGDIR_SHIFT_L5 : \
>> (pgtable_l4_enabled ? PGDIR_SHIFT_L4 : PGDIR_SHIFT_L3))
>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>> index 6066822e7396..2032f8ac5fc5 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -867,6 +867,7 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
>> #ifdef CONFIG_64BIT
>> #define TASK_SIZE_64 (PGDIR_SIZE * PTRS_PER_PGD / 2)
>> #define TASK_SIZE_MIN (PGDIR_SIZE_L3 * PTRS_PER_PGD / 2)
>> +#define TASK_SIZE_MAX (PGDIR_SIZE_L5 * PTRS_PER_PGD / 2)
>>
>> #ifdef CONFIG_COMPAT
>> #define TASK_SIZE_32 (_AC(0x80000000, UL))
>> --
>> 2.43.1
>>
>
> I think you also need to change the check in handle_page_fault() by
> using TASK_SIZE_MAX instead of TASK_SIZE, otherwise the fixup can't
> happen (https://elixir.bootlin.com/linux/latest/source/arch/riscv/mm/fault.c#L273).

It is not necessary to change that check in fault.c unless we expect to handle
exceptions (outside of userspace access routines) for addresses between
TASK_SIZE and TASK_SIZE_MAX. It looks like the call to fixup_exception() [added
in 416721ff05fd ("riscv, mm: Perform BPF exhandler fixup on page fault")] is
only intended to catch null pointer dereferences. So making the change wouldn't
have any functional impact, but it would still be a valid optimization.

> Or I was wondering if it would not be better to do like x86 and use an
> alternative, it would be more correct (even though I believe your
> solution works)
> https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64.h#L82.

What would be the benefit of using an alternative? Any access to an address
between TASK_SIZE and TASK_SIZE_MAX is guaranteed to generate a page fault, so
the only benefit I see is returning -EFAULT slightly faster at the cost of
applying a few hundred alternatives at boot. But it's possible I'm missing
something.

Regards,
Samuel


2024-03-19 16:52:53

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

Hi  Samuel,

On 18/03/2024 22:29, Samuel Holland wrote:
> Hi Alex,
>
> On 2024-03-18 3:50 PM, Alexandre Ghiti wrote:
>> On Wed, Mar 13, 2024 at 7:00 PM Samuel Holland
>> <[email protected]> wrote:
>>> TASK_SIZE_MAX should be set to the largest userspace address under any
>>> runtime configuration. This optimizes the check in __access_ok(), which
>>> no longer needs to compute the current value of TASK_SIZE. It is still
>>> safe because addresses between TASK_SIZE and TASK_SIZE_MAX are invalid
>>> at the hardware level.
>>>
>>> This removes about half of the references to pgtable_l[45]_enabled.
>>>
>>> Signed-off-by: Samuel Holland <[email protected]>
>>> ---
>>>
>>> arch/riscv/include/asm/pgtable-64.h | 1 +
>>> arch/riscv/include/asm/pgtable.h | 1 +
>>> 2 files changed, 2 insertions(+)
>>>
>>> diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
>>> index b99bd66107a6..a677ef3c0fe2 100644
>>> --- a/arch/riscv/include/asm/pgtable-64.h
>>> +++ b/arch/riscv/include/asm/pgtable-64.h
>>> @@ -17,6 +17,7 @@ extern bool pgtable_l5_enabled;
>>> #define PGDIR_SHIFT_L4 39
>>> #define PGDIR_SHIFT_L5 48
>>> #define PGDIR_SIZE_L3 (_AC(1, UL) << PGDIR_SHIFT_L3)
>>> +#define PGDIR_SIZE_L5 (_AC(1, UL) << PGDIR_SHIFT_L5)
>>>
>>> #define PGDIR_SHIFT (pgtable_l5_enabled ? PGDIR_SHIFT_L5 : \
>>> (pgtable_l4_enabled ? PGDIR_SHIFT_L4 : PGDIR_SHIFT_L3))
>>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>>> index 6066822e7396..2032f8ac5fc5 100644
>>> --- a/arch/riscv/include/asm/pgtable.h
>>> +++ b/arch/riscv/include/asm/pgtable.h
>>> @@ -867,6 +867,7 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
>>> #ifdef CONFIG_64BIT
>>> #define TASK_SIZE_64 (PGDIR_SIZE * PTRS_PER_PGD / 2)
>>> #define TASK_SIZE_MIN (PGDIR_SIZE_L3 * PTRS_PER_PGD / 2)
>>> +#define TASK_SIZE_MAX (PGDIR_SIZE_L5 * PTRS_PER_PGD / 2)
>>>
>>> #ifdef CONFIG_COMPAT
>>> #define TASK_SIZE_32 (_AC(0x80000000, UL))
>>> --
>>> 2.43.1
>>>
>> I think you also need to change the check in handle_page_fault() by
>> using TASK_SIZE_MAX instead of TASK_SIZE, otherwise the fixup can't
>> happen (https://elixir.bootlin.com/linux/latest/source/arch/riscv/mm/fault.c#L273).
> It is not necessary to change that check in fault.c unless we expect to handle
> exceptions (outside of userspace access routines) for addresses between
> TASK_SIZE and TASK_SIZE_MAX.


Which I think could be the case if some code is only using the "new"
access_ok() without the uaccess routines (which is wrong yes, but such
code is at the origin of this check).


> It looks like the call to fixup_exception() [added
> in 416721ff05fd ("riscv, mm: Perform BPF exhandler fixup on page fault")] is
> only intended to catch null pointer dereferences. So making the change wouldn't
> have any functional impact, but it would still be a valid optimization.
>
>> Or I was wondering if it would not be better to do like x86 and use an
>> alternative, it would be more correct (even though I believe your
>> solution works)
>> https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64.h#L82.
> What would be the benefit of using an alternative? Any access to an address
> between TASK_SIZE and TASK_SIZE_MAX is guaranteed to generate a page fault, so
> the only benefit I see is returning -EFAULT slightly faster at the cost of
> applying a few hundred alternatives at boot. But it's possible I'm missing
> something.


The use of alternatives allows to return right away if the buffer is
beyond the usable user address space, and it's not just "slightly
faster" for some cases (a very large buffer with only a few bytes being
beyond the limit or someone could fault-in all the user pages and fail
very late...etc). access_ok() is here to guarantee that such situations
don't happen, so actually it makes more sense to use an alternative to
avoid that.


Alex


> Regards,
> Samuel
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2024-03-24 19:42:53

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

...
> The use of alternatives allows to return right away if the buffer is
> beyond the usable user address space, and it's not just "slightly
> faster" for some cases (a very large buffer with only a few bytes being
> beyond the limit or someone could fault-in all the user pages and fail
> very late...etc). access_ok() is here to guarantee that such situations
> don't happen, so actually it makes more sense to use an alternative to
> avoid that.

Is it really worth doing ANY optimisations for the -EFAULT path?
They really don't happen.

The only fault path that matters is the one that has to page in
data from somewhere.

Provided there is a gap between the highest valid user address and the
lowest valid kernel address (may not be true on some 32bit systems)
and copy_to/from_user() do 'increasing address' copies then the
access_ok() check they do can almost certainly ignore the length.

This may be true for pretty much all access_ok() tests?
It would certainly simplify the test.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-03-24 22:06:27

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

On Tue, Mar 19, 2024, at 17:51, Alexandre Ghiti wrote:
> On 18/03/2024 22:29, Samuel Holland wrote:
>> On 2024-03-18 3:50 PM, Alexandre Ghiti wrote:
>>> On Wed, Mar 13, 2024 at 7:00 PM Samuel Holland
>> It looks like the call to fixup_exception() [added
>> in 416721ff05fd ("riscv, mm: Perform BPF exhandler fixup on page fault")] is
>> only intended to catch null pointer dereferences. So making the change wouldn't
>> have any functional impact, but it would still be a valid optimization.
>>
>>> Or I was wondering if it would not be better to do like x86 and use an
>>> alternative, it would be more correct (even though I believe your
>>> solution works)
>>> https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64.h#L82.
>> What would be the benefit of using an alternative? Any access to an address
>> between TASK_SIZE and TASK_SIZE_MAX is guaranteed to generate a page fault, so
>> the only benefit I see is returning -EFAULT slightly faster at the cost of
>> applying a few hundred alternatives at boot. But it's possible I'm missing
>> something.
>
>
> The use of alternatives allows to return right away if the buffer is
> beyond the usable user address space, and it's not just "slightly
> faster" for some cases (a very large buffer with only a few bytes being
> beyond the limit or someone could fault-in all the user pages and fail
> very late...etc). access_ok() is here to guarantee that such situations
> don't happen, so actually it makes more sense to use an alternative to
> avoid that.

The access_ok() function really wants a compile-time constant
value for TASK_SIZE_MAX so it can do constant folding for
repeated calls inside of one function, so for configurations
with a boot-time selected TASK_SIZE_64 it's already not ideal,
with or without alternatives.

If I read the current code correctly, riscv doesn't even
have a way to build with a compile-time selected
VA_BITS/PGDIR_SIZE, which is probably a better place to
start optimizing, since this rarely needs to be selected
dynamically.

Arnd

2024-03-25 13:14:36

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

Hi Arnd,

On 24/03/2024 23:05, Arnd Bergmann wrote:
> On Tue, Mar 19, 2024, at 17:51, Alexandre Ghiti wrote:
>> On 18/03/2024 22:29, Samuel Holland wrote:
>>> On 2024-03-18 3:50 PM, Alexandre Ghiti wrote:
>>>> On Wed, Mar 13, 2024 at 7:00 PM Samuel Holland
>>> It looks like the call to fixup_exception() [added
>>> in 416721ff05fd ("riscv, mm: Perform BPF exhandler fixup on page fault")] is
>>> only intended to catch null pointer dereferences. So making the change wouldn't
>>> have any functional impact, but it would still be a valid optimization.
>>>
>>>> Or I was wondering if it would not be better to do like x86 and use an
>>>> alternative, it would be more correct (even though I believe your
>>>> solution works)
>>>> https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64.h#L82.
>>> What would be the benefit of using an alternative? Any access to an address
>>> between TASK_SIZE and TASK_SIZE_MAX is guaranteed to generate a page fault, so
>>> the only benefit I see is returning -EFAULT slightly faster at the cost of
>>> applying a few hundred alternatives at boot. But it's possible I'm missing
>>> something.
>>
>> The use of alternatives allows to return right away if the buffer is
>> beyond the usable user address space, and it's not just "slightly
>> faster" for some cases (a very large buffer with only a few bytes being
>> beyond the limit or someone could fault-in all the user pages and fail
>> very late...etc). access_ok() is here to guarantee that such situations
>> don't happen, so actually it makes more sense to use an alternative to
>> avoid that.
> The access_ok() function really wants a compile-time constant
> value for TASK_SIZE_MAX so it can do constant folding for
> repeated calls inside of one function, so for configurations
> with a boot-time selected TASK_SIZE_64 it's already not ideal,
> with or without alternatives.
>
> If I read the current code correctly, riscv doesn't even
> have a way to build with a compile-time selected
> VA_BITS/PGDIR_SIZE, which is probably a better place to
> start optimizing, since this rarely needs to be selected
> dynamically.


Indeed, we do not support compile-time fixed VA_BITS! We could, but that
would only be used for custom kernels. I don't think distro kernels will
ever (?) propose 3 different kernels for sv39, sv48 and sv57 because the
cost of dynamically choosing the address space width is not big enough
to me (and the burden of maintaining 3 different kernels is).

Let me know if I'm wrong, I'd be happy to work on that.

Thanks,

Alex


>
> Arnd

2024-03-25 13:16:50

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

Hi David,

On 24/03/2024 20:42, David Laight wrote:
> ...
>> The use of alternatives allows to return right away if the buffer is
>> beyond the usable user address space, and it's not just "slightly
>> faster" for some cases (a very large buffer with only a few bytes being
>> beyond the limit or someone could fault-in all the user pages and fail
>> very late...etc). access_ok() is here to guarantee that such situations
>> don't happen, so actually it makes more sense to use an alternative to
>> avoid that.
> Is it really worth doing ANY optimisations for the -EFAULT path?
> They really don't happen.
>
> The only fault path that matters is the one that has to page in
> data from somewhere.


Which is completely avoided with a strict definition of access_ok(). I
see access_ok() as an already existing optimization of fault paths by
avoiding them entirely when they are bound to happen.

Thanks,

Alex


>
> Provided there is a gap between the highest valid user address and the
> lowest valid kernel address (may not be true on some 32bit systems)
> and copy_to/from_user() do 'increasing address' copies then the
> access_ok() check they do can almost certainly ignore the length.
>
> This may be true for pretty much all access_ok() tests?
> It would certainly simplify the test.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2024-03-25 14:54:03

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

From: Alexandre Ghiti
> Sent: 25 March 2024 07:31
>
> Hi David,
>
> On 24/03/2024 20:42, David Laight wrote:
> > ...
> >> The use of alternatives allows to return right away if the buffer is
> >> beyond the usable user address space, and it's not just "slightly
> >> faster" for some cases (a very large buffer with only a few bytes being
> >> beyond the limit or someone could fault-in all the user pages and fail
> >> very late...etc). access_ok() is here to guarantee that such situations
> >> don't happen, so actually it makes more sense to use an alternative to
> >> avoid that.
> >
> > Is it really worth doing ANY optimisations for the -EFAULT path?
> > They really don't happen.
> >
> > The only fault path that matters is the one that has to page in
> > data from somewhere.
>
>
> Which is completely avoided with a strict definition of access_ok(). I
> see access_ok() as an already existing optimization of fault paths by
> avoiding them entirely when they are bound to happen.

No, access_ok() exists because accesses to kernel addresses don't fault.
Possibly in linux 0.01 it tried to ensure that the access was valid
(by checking the process's page tables (etc) but that that hasn't been
true for a long time.

You don't want to add a single instruction (never mind a conditional)
to access_ok() to avoid a page fault on an address that will fault.

Basically real programs don't pass bad addresses into the kernel
or access them in userspace. EFAULT and SIGSEGV are pretty fatal.
(Nothing call sbrk() from its SIGSEGV handler any more!)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-03-25 16:10:49

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

On Mon, Mar 25, 2024, at 08:25, Alexandre Ghiti wrote:
> On 24/03/2024 23:05, Arnd Bergmann wrote:
>> On Tue, Mar 19, 2024, at 17:51, Alexandre Ghiti wrote:
>>>
>>> The use of alternatives allows to return right away if the buffer is
>>> beyond the usable user address space, and it's not just "slightly
>>> faster" for some cases (a very large buffer with only a few bytes being
>>> beyond the limit or someone could fault-in all the user pages and fail
>>> very late...etc). access_ok() is here to guarantee that such situations
>>> don't happen, so actually it makes more sense to use an alternative to
>>> avoid that.
>> The access_ok() function really wants a compile-time constant
>> value for TASK_SIZE_MAX so it can do constant folding for
>> repeated calls inside of one function, so for configurations
>> with a boot-time selected TASK_SIZE_64 it's already not ideal,
>> with or without alternatives.
>>
>> If I read the current code correctly, riscv doesn't even
>> have a way to build with a compile-time selected
>> VA_BITS/PGDIR_SIZE, which is probably a better place to
>> start optimizing, since this rarely needs to be selected
>> dynamically.
>
>
> Indeed, we do not support compile-time fixed VA_BITS! We could, but that
> would only be used for custom kernels. I don't think distro kernels will
> ever (?) propose 3 different kernels for sv39, sv48 and sv57 because the
> cost of dynamically choosing the address space width is not big enough
> to me (and the burden of maintaining 3 different kernels is).
>
> Let me know if I'm wrong, I'd be happy to work on that.

My feeling is that in most cases, users are better off with a
compile-time default, given that the addressable memory has
a factor of 512 between each step. With sv39, I think you are
limited to having all RAM in the first 128GB of physical
address space, and each process is limited to 256GB virtual
addressing, but this is already covers pretty much anything
you want to do on small systems that run a custom kernel.

On most desktop/server/cloud distros, hardwiring sv48 is
probably sufficient if all general purpose machines support
this, and it should be enough even for commercial databases
that micro-optimize 100TB datasets through a permanent mmap(),
as well as most NUMA systems with discontiguous memory.
This adds a little cost over hardcoded sv39, but is still
faster than a boot-time sv39/sv48 config that most users
will not be aware of.

Once enterprise distros certify systems beyond a few dozen
TB of RAM, they probably need to enable the boot time
detection, until then I think the few users with gigantic
systems will probably be fine running a custom sv57
kernel. At that point, that distro can start shipping a
kernel with boot-time detected page table sizes.

Arnd

2024-03-25 17:46:25

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

On Mon, Mar 25, 2024 at 08:30:37AM +0100, Alexandre Ghiti wrote:
> Hi David,
>
> On 24/03/2024 20:42, David Laight wrote:
> > ...
> > > The use of alternatives allows to return right away if the buffer is
> > > beyond the usable user address space, and it's not just "slightly
> > > faster" for some cases (a very large buffer with only a few bytes being
> > > beyond the limit or someone could fault-in all the user pages and fail
> > > very late...etc). access_ok() is here to guarantee that such situations
> > > don't happen, so actually it makes more sense to use an alternative to
> > > avoid that.
> > Is it really worth doing ANY optimisations for the -EFAULT path?
> > They really don't happen.
> >
> > The only fault path that matters is the one that has to page in
> > data from somewhere.
>
> Which is completely avoided with a strict definition of access_ok(). I see
> access_ok() as an already existing optimization of fault paths by avoiding
> them entirely when they are bound to happen.

I think the point that David is making is that address+size pairs that'd fail
access_ok() *should* be rare, and hence it's a better trade-off to occasionally
handle faults for those if it makes the common case of successful access_ok()
smaller or faster. For any well-behaved userspace applications, access_ok()
should practically never fail, since userspace should be passing good
address+size pairs as arguments to syscalls.

Using a compile-time constant TASK_SIZE_MAX allows the compiler to generate
much better code for access_ok(), and on arm64 we use a compile-time constant
even when our page table depth can change at runtime (and when native/compat
task sizes differ). The only abosolute boundary that needs to be maintained is
that access_ok() fails for kernel addresses.

Mark.

2024-03-25 18:30:41

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

On Mon, Mar 25, 2024, at 17:39, Mark Rutland wrote:

> Using a compile-time constant TASK_SIZE_MAX allows the compiler to generate
> much better code for access_ok(), and on arm64 we use a compile-time constant
> even when our page table depth can change at runtime (and when native/compat
> task sizes differ). The only abosolute boundary that needs to be maintained is
> that access_ok() fails for kernel addresses.

As I understand, this works on arm64 and x86 because the kernel
mapping starts on negative 64-bit addresses, so the highest user
address (TASK_SIZE = 0x000fffffffffffff) is still smaller than the
lowest kernel address (PAGE_OFFSET = 0xfff0000000000000).

If an architecture ignores all the top bits of a virtual address,
the largest TASK_SIZE would be higher than the smallest (positive,
unsigned) PAGE_OFFSET, so you need TASK_SIZE_MAX to be dynamic.
It doesn't look like this is the case on riscv, but I'm not sure
about this part.

Arnd

2024-03-25 18:52:52

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

On Mon, Mar 25, 2024 at 07:02:13PM +0100, Arnd Bergmann wrote:
> On Mon, Mar 25, 2024, at 17:39, Mark Rutland wrote:
>
> > Using a compile-time constant TASK_SIZE_MAX allows the compiler to generate
> > much better code for access_ok(), and on arm64 we use a compile-time constant
> > even when our page table depth can change at runtime (and when native/compat
> > task sizes differ). The only abosolute boundary that needs to be maintained is
> > that access_ok() fails for kernel addresses.
>
> As I understand, this works on arm64 and x86 because the kernel
> mapping starts on negative 64-bit addresses, so the highest user
> address (TASK_SIZE = 0x000fffffffffffff) is still smaller than the
> lowest kernel address (PAGE_OFFSET = 0xfff0000000000000).

Yep; the highest posible user address is always below the lowest possible
kernel address, and any "non-canonical" address between the two ranges faults.
There's some fun with TBI (Top Byte Ignore) and MTE, but that only affects how
to mangle the pointer before the check, and doesn't affect the definition of
the VA boundary.

In general, so long as TASK_SIZE_MAX is <= the lowest possible kernel address
and TASK_SIZE_MAX > the highest possible user address, it all works out.

> If an architecture ignores all the top bits of a virtual address,
> the largest TASK_SIZE would be higher than the smallest (positive,
> unsigned) PAGE_OFFSET, so you need TASK_SIZE_MAX to be dynamic.

Agreed, but do we even support such architectures within Linux?

> It doesn't look like this is the case on riscv, but I'm not sure
> about this part.

It looks like riscv is in the same bucket as arm64 and x86 per:

https://www.kernel.org/doc/html/next/riscv/vm-layout.html

.. which says:

| The RISC-V privileged architecture document states that the 64bit addresses
| "must have bits 63-48 all equal to bit 47, or else a page-fault exception
| will occur.": that splits the virtual address space into 2 halves separated
| by a very big hole, the lower half is where the userspace resides, the upper
| half is where the RISC-V Linux Kernel resides.

Mark.

2024-03-25 19:20:58

by Samuel Holland

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

On 2024-03-25 1:30 PM, Mark Rutland wrote:
> On Mon, Mar 25, 2024 at 07:02:13PM +0100, Arnd Bergmann wrote:
>> On Mon, Mar 25, 2024, at 17:39, Mark Rutland wrote:
>>
>>> Using a compile-time constant TASK_SIZE_MAX allows the compiler to generate
>>> much better code for access_ok(), and on arm64 we use a compile-time constant
>>> even when our page table depth can change at runtime (and when native/compat
>>> task sizes differ). The only abosolute boundary that needs to be maintained is
>>> that access_ok() fails for kernel addresses.
>>
>> As I understand, this works on arm64 and x86 because the kernel
>> mapping starts on negative 64-bit addresses, so the highest user
>> address (TASK_SIZE = 0x000fffffffffffff) is still smaller than the
>> lowest kernel address (PAGE_OFFSET = 0xfff0000000000000).
>
> Yep; the highest posible user address is always below the lowest possible
> kernel address, and any "non-canonical" address between the two ranges faults.
> There's some fun with TBI (Top Byte Ignore) and MTE, but that only affects how
> to mangle the pointer before the check, and doesn't affect the definition of
> the VA boundary.
>
> In general, so long as TASK_SIZE_MAX is <= the lowest possible kernel address
> and TASK_SIZE_MAX > the highest possible user address, it all works out.
>
>> If an architecture ignores all the top bits of a virtual address,
>> the largest TASK_SIZE would be higher than the smallest (positive,
>> unsigned) PAGE_OFFSET, so you need TASK_SIZE_MAX to be dynamic.
>
> Agreed, but do we even support such architectures within Linux?
>
>> It doesn't look like this is the case on riscv, but I'm not sure
>> about this part.
>
> It looks like riscv is in the same bucket as arm64 and x86 per:
>
> https://www.kernel.org/doc/html/next/riscv/vm-layout.html
>
> ... which says:
>
> | The RISC-V privileged architecture document states that the 64bit addresses
> | "must have bits 63-48 all equal to bit 47, or else a page-fault exception
> | will occur.": that splits the virtual address space into 2 halves separated
> | by a very big hole, the lower half is where the userspace resides, the upper
> | half is where the RISC-V Linux Kernel resides.

Right, and while RISC-V has a pointer masking extension[1] similar to arm64's
TBI, it will be handled[2] the same way: by sign extending the address prior to
checking against TASK_SIZE_MAX. So we maintain the property that userspace
addresses are always "positive" and kernel addresses are always "negative".

Regards,
Samuel

[1]: https://github.com/riscv/riscv-j-extension/raw/a1e68469c60/zjpm-spec.pdf
[2]:
https://lore.kernel.org/linux-riscv/[email protected]/

2024-03-25 20:15:46

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

Hi David, Mark,

On 25/03/2024 17:39, Mark Rutland wrote:
> On Mon, Mar 25, 2024 at 08:30:37AM +0100, Alexandre Ghiti wrote:
>> Hi David,
>>
>> On 24/03/2024 20:42, David Laight wrote:
>>> ...
>>>> The use of alternatives allows to return right away if the buffer is
>>>> beyond the usable user address space, and it's not just "slightly
>>>> faster" for some cases (a very large buffer with only a few bytes being
>>>> beyond the limit or someone could fault-in all the user pages and fail
>>>> very late...etc). access_ok() is here to guarantee that such situations
>>>> don't happen, so actually it makes more sense to use an alternative to
>>>> avoid that.
>>> Is it really worth doing ANY optimisations for the -EFAULT path?
>>> They really don't happen.
>>>
>>> The only fault path that matters is the one that has to page in
>>> data from somewhere.
>> Which is completely avoided with a strict definition of access_ok(). I see
>> access_ok() as an already existing optimization of fault paths by avoiding
>> them entirely when they are bound to happen.
> I think the point that David is making is that address+size pairs that'd fail
> access_ok() *should* be rare, and hence it's a better trade-off to occasionally
> handle faults for those if it makes the common case of successful access_ok()
> smaller or faster. For any well-behaved userspace applications, access_ok()
> should practically never fail, since userspace should be passing good
> address+size pairs as arguments to syscalls.
>
> Using a compile-time constant TASK_SIZE_MAX allows the compiler to generate
> much better code for access_ok(), and on arm64 we use a compile-time constant
> even when our page table depth can change at runtime (and when native/compat
> task sizes differ). The only abosolute boundary that needs to be maintained is
> that access_ok() fails for kernel addresses.


Hmm indeed I had completely misunderstood David's point, but actually
not really since I disagreed with what he actually meant :)

But I had not realized access_ok() was so performance-sensitive and also
missed the point that it was to protect the kernel more than making sure
the userspace address is correct, so I guess we are good to go with
Samuel's patch.

Thanks David and Mark,

Alex


>
> Mark.
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2024-03-25 20:49:11

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

On Wed, Mar 13, 2024, at 18:59, Samuel Holland wrote:
> TASK_SIZE_MAX should be set to the largest userspace address under any
> runtime configuration. This optimizes the check in __access_ok(), which
> no longer needs to compute the current value of TASK_SIZE. It is still
> safe because addresses between TASK_SIZE and TASK_SIZE_MAX are invalid
> at the hardware level.
>
> This removes about half of the references to pgtable_l[45]_enabled.
>
> Signed-off-by: Samuel Holland <[email protected]>

Reviewed-by: Arnd Bergmann <[email protected]>

> #ifdef CONFIG_64BIT
> #define TASK_SIZE_64 (PGDIR_SIZE * PTRS_PER_PGD / 2)
> #define TASK_SIZE_MIN (PGDIR_SIZE_L3 * PTRS_PER_PGD / 2)
> +#define TASK_SIZE_MAX (PGDIR_SIZE_L5 * PTRS_PER_PGD / 2)

I see that TASK_SIZE_MIN is unused since 085e2ff9aeb0 ("efi:
libstub: Drop randomization of runtime memory map") and could
be dropped now, but it doesn't really hurt either.

Arnd

2024-03-25 23:54:46

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

On Mon, Mar 25, 2024, at 19:30, Mark Rutland wrote:
> On Mon, Mar 25, 2024 at 07:02:13PM +0100, Arnd Bergmann wrote:
>> On Mon, Mar 25, 2024, at 17:39, Mark Rutland wrote:
>
>> If an architecture ignores all the top bits of a virtual address,
>> the largest TASK_SIZE would be higher than the smallest (positive,
>> unsigned) PAGE_OFFSET, so you need TASK_SIZE_MAX to be dynamic.
>
> Agreed, but do we even support such architectures within Linux?

Apparently not.

On 32-bit architectures, you often have TASK_SIZE==PAGE_OFFSET,
but not on 64-bit -- either the top few bits in PAGE_OFFSET are
always ones, or the user and kernel page tables are completely
separate.

>> It doesn't look like this is the case on riscv, but I'm not sure
>> about this part.
>
> It looks like riscv is in the same bucket as arm64 and x86 per:
>
> https://www.kernel.org/doc/html/next/riscv/vm-layout.html
>
> ... which says:
>
> | The RISC-V privileged architecture document states that the 64bit addresses
> | "must have bits 63-48 all equal to bit 47, or else a page-fault exception
> | will occur.": that splits the virtual address space into 2 halves separated
> | by a very big hole, the lower half is where the userspace resides, the upper
> | half is where the RISC-V Linux Kernel resides.

Rihgt. I had even looked in that directory but somehow missed
the vm-layout.rst file.

Arnd

2024-03-26 10:21:19

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

From: Arnd Bergmann
> Sent: 25 March 2024 20:38
>
> On Mon, Mar 25, 2024, at 19:30, Mark Rutland wrote:
> > On Mon, Mar 25, 2024 at 07:02:13PM +0100, Arnd Bergmann wrote:
> >> On Mon, Mar 25, 2024, at 17:39, Mark Rutland wrote:
> >
> >> If an architecture ignores all the top bits of a virtual address,
> >> the largest TASK_SIZE would be higher than the smallest (positive,
> >> unsigned) PAGE_OFFSET, so you need TASK_SIZE_MAX to be dynamic.
> >
> > Agreed, but do we even support such architectures within Linux?
>
> Apparently not.
>
> On 32-bit architectures, you often have TASK_SIZE==PAGE_OFFSET,
> but not on 64-bit -- either the top few bits in PAGE_OFFSET are
> always ones, or the user and kernel page tables are completely
> separate.

ISTR that arm64 uses (something like) bit 56 to select kernel
with the annoying 'feature' that the high bits can be ignored
just to complicate things.

But I also recall the people that want 'address masking' for x86-64
have been persuaded that addresses with the top bit set are invalid.
I has to be said that I'm not sure that aliasing user addresses
like that is a good idea.
If the TLB/PTE verified the masked bits it might be more use.

If bit63 selects kernel addresses (as in x86-64) there is a massive
(non-canonical address) gap before the first valid kernel address
that is larger than the user address space (and hence buffer size).
I think that lets access_ok() check ((address | size) >> 60) != 0.
Although it probably means that you don't need to test 'size' at all
(unless some code probes the last byte of the buffer).

For 32bit the user/kernel boundary is usually 0x80000000 or 0xc0000000
and there may not even be an invalid page between the two.
This does require access_ok() check the length (even for get_user()).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2024-03-26 14:50:58

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

On Tue, Mar 26, 2024 at 10:19:28AM +0000, David Laight wrote:
> From: Arnd Bergmann
> > Sent: 25 March 2024 20:38
> >
> > On Mon, Mar 25, 2024, at 19:30, Mark Rutland wrote:
> > > On Mon, Mar 25, 2024 at 07:02:13PM +0100, Arnd Bergmann wrote:
> > >> On Mon, Mar 25, 2024, at 17:39, Mark Rutland wrote:
> > >
> > >> If an architecture ignores all the top bits of a virtual address,
> > >> the largest TASK_SIZE would be higher than the smallest (positive,
> > >> unsigned) PAGE_OFFSET, so you need TASK_SIZE_MAX to be dynamic.
> > >
> > > Agreed, but do we even support such architectures within Linux?
> >
> > Apparently not.
> >
> > On 32-bit architectures, you often have TASK_SIZE==PAGE_OFFSET,
> > but not on 64-bit -- either the top few bits in PAGE_OFFSET are
> > always ones, or the user and kernel page tables are completely
> > separate.
>
> ISTR that arm64 uses (something like) bit 56 to select kernel
> with the annoying 'feature' that the high bits can be ignored
> just to complicate things.

Yes, bit 55.

We choose our TASK_SIZE_MAX to be below 2^55, so no kernel address will pass
access_ok(), and we pre-mangle the TBI bits for userspace so they can't affect
the check and fail unexpectedly.

So it doesn't actually matter -- leave that aspect to arch code.

Mark.