Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags
current with a flag that prevents pages that were previously not executable from
becoming executable.
This tag always gets inherited by children tasks. (it's in MMF_INIT_MASK)
At Google, we've been using a somewhat similar downstream patch for a few years
now. To make the adoption of this feature easier, we've had it support a mode in
which the W^X flag does not propagate to children. For example, this is handy if
a C process which wants W^X protection suspects it could start children
processes that would use a JIT.
I'd like to align our features with the upstream prctl. This series proposes a
new NO_INHERIT flag to the MDWE prctl to make this kind of adoption easier. It
sets a different flag in current that is not in MMF_INIT_MASK and which does not
propagate.
As part of looking into MDWE, I also fixed a couple of things in the MDWE test.
This series applies on the mm-everything-2023-05-16-23-30 tag of the mm tree:
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/
Diff since v1:
- MMF_HAS_MDWE_NO_INHERIT clears MMF_HAS_MDWE in the fork path as part of a
MMF_INIT_FLAGS macro (suggested by Catalin)
- PR_MDWE_* are defined as unsigned long rather than int (suggested by Andrey)
Florent Revest (5):
kselftest: vm: Fix tabs/spaces inconsistency in the mdwe test
kselftest: vm: Fix mdwe's mmap_FIXED test case
mm: Make PR_MDWE_REFUSE_EXEC_GAIN an unsigned long
mm: Add a NO_INHERIT flag to the PR_SET_MDWE prctl
kselftest: vm: Add tests for no-inherit memory-deny-write-execute
include/linux/sched/coredump.h | 10 +++
include/uapi/linux/prctl.h | 3 +-
kernel/fork.c | 2 +-
kernel/sys.c | 24 +++++-
tools/include/uapi/linux/prctl.h | 3 +-
tools/testing/selftests/mm/mdwe_test.c | 110 +++++++++++++++++++++----
6 files changed, 131 insertions(+), 21 deletions(-)
--
2.40.1.606.ga4b1b128d6-goog
Signed-off-by: Florent Revest <[email protected]>
---
tools/testing/selftests/mm/mdwe_test.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/mm/mdwe_test.c b/tools/testing/selftests/mm/mdwe_test.c
index bc91bef5d254..d0954c657feb 100644
--- a/tools/testing/selftests/mm/mdwe_test.c
+++ b/tools/testing/selftests/mm/mdwe_test.c
@@ -49,19 +49,19 @@ FIXTURE_VARIANT(mdwe)
FIXTURE_VARIANT_ADD(mdwe, stock)
{
- .enabled = false,
+ .enabled = false,
.forked = false,
};
FIXTURE_VARIANT_ADD(mdwe, enabled)
{
- .enabled = true,
+ .enabled = true,
.forked = false,
};
FIXTURE_VARIANT_ADD(mdwe, forked)
{
- .enabled = true,
+ .enabled = true,
.forked = true,
};
--
2.40.1.606.ga4b1b128d6-goog
Alexey pointed out that defining a prctl flag as an int is a footgun
because, under some circumstances, when used as a flag to prctl, it can
be casted to long with garbage upper bits which would result in
unexpected behaviors.
This patch changes the constant to a UL to eliminate these
possibilities.
Signed-off-by: Florent Revest <[email protected]>
Suggested-by: Alexey Izbyshev <[email protected]>
---
include/uapi/linux/prctl.h | 2 +-
tools/include/uapi/linux/prctl.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index f23d9a16507f..6e9af6cbc950 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -283,7 +283,7 @@ struct prctl_mm_map {
/* Memory deny write / execute */
#define PR_SET_MDWE 65
-# define PR_MDWE_REFUSE_EXEC_GAIN 1
+# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
#define PR_GET_MDWE 66
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 759b3f53e53f..6e6563e97fef 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -283,7 +283,7 @@ struct prctl_mm_map {
/* Memory deny write / execute */
#define PR_SET_MDWE 65
-# define PR_MDWE_REFUSE_EXEC_GAIN 1
+# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
#define PR_GET_MDWE 66
--
2.40.1.606.ga4b1b128d6-goog
Add some tests to cover the new PR_MDWE_NO_INHERIT flag of the
PR_SET_MDWE prctl.
Signed-off-by: Florent Revest <[email protected]>
---
tools/testing/selftests/mm/mdwe_test.c | 95 ++++++++++++++++++++++++--
1 file changed, 89 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/mm/mdwe_test.c b/tools/testing/selftests/mm/mdwe_test.c
index 91aa9c3099e7..9f08ed1b99ae 100644
--- a/tools/testing/selftests/mm/mdwe_test.c
+++ b/tools/testing/selftests/mm/mdwe_test.c
@@ -22,6 +22,8 @@
TEST(prctl_flags)
{
+ EXPECT_LT(prctl(PR_SET_MDWE, PR_MDWE_NO_INHERIT, 0L, 0L, 7L), 0);
+
EXPECT_LT(prctl(PR_SET_MDWE, 7L, 0L, 0L, 0L), 0);
EXPECT_LT(prctl(PR_SET_MDWE, 0L, 7L, 0L, 0L), 0);
EXPECT_LT(prctl(PR_SET_MDWE, 0L, 0L, 7L, 0L), 0);
@@ -33,6 +35,66 @@ TEST(prctl_flags)
EXPECT_LT(prctl(PR_GET_MDWE, 0L, 0L, 0L, 7L), 0);
}
+FIXTURE(consecutive_prctl_flags) {};
+FIXTURE_SETUP(consecutive_prctl_flags) {}
+FIXTURE_TEARDOWN(consecutive_prctl_flags) {}
+
+FIXTURE_VARIANT(consecutive_prctl_flags)
+{
+ unsigned long first_flags;
+ unsigned long second_flags;
+ bool should_work;
+};
+
+FIXTURE_VARIANT_ADD(consecutive_prctl_flags, same)
+{
+ .first_flags = PR_MDWE_REFUSE_EXEC_GAIN,
+ .second_flags = PR_MDWE_REFUSE_EXEC_GAIN,
+ .should_work = true,
+};
+
+FIXTURE_VARIANT_ADD(consecutive_prctl_flags, cant_disable_mdwe)
+{
+ .first_flags = PR_MDWE_REFUSE_EXEC_GAIN,
+ .second_flags = 0,
+ .should_work = false,
+};
+
+FIXTURE_VARIANT_ADD(consecutive_prctl_flags, cant_disable_mdwe_no_inherit)
+{
+ .first_flags = PR_MDWE_REFUSE_EXEC_GAIN | PR_MDWE_NO_INHERIT,
+ .second_flags = 0,
+ .should_work = false,
+};
+
+FIXTURE_VARIANT_ADD(consecutive_prctl_flags, can_lower_privileges)
+{
+ .first_flags = PR_MDWE_REFUSE_EXEC_GAIN | PR_MDWE_NO_INHERIT,
+ .second_flags = PR_MDWE_REFUSE_EXEC_GAIN,
+ .should_work = true,
+};
+
+FIXTURE_VARIANT_ADD(consecutive_prctl_flags, cant_gain_privileges)
+{
+ .first_flags = PR_MDWE_REFUSE_EXEC_GAIN,
+ .second_flags = PR_MDWE_REFUSE_EXEC_GAIN | PR_MDWE_NO_INHERIT,
+ .should_work = false,
+};
+
+TEST_F(consecutive_prctl_flags, two_prctls)
+{
+ int ret;
+
+ EXPECT_EQ(prctl(PR_SET_MDWE, variant->first_flags, 0L, 0L, 0L), 0);
+
+ ret = prctl(PR_SET_MDWE, variant->second_flags, 0L, 0L, 0L);
+ if (variant->should_work) {
+ EXPECT_EQ(ret, 0);
+ } else {
+ EXPECT_NE(ret, 0);
+ }
+}
+
FIXTURE(mdwe)
{
void *p;
@@ -45,28 +107,45 @@ FIXTURE_VARIANT(mdwe)
{
bool enabled;
bool forked;
+ bool inherit;
};
FIXTURE_VARIANT_ADD(mdwe, stock)
{
.enabled = false,
.forked = false,
+ .inherit = false,
};
FIXTURE_VARIANT_ADD(mdwe, enabled)
{
.enabled = true,
.forked = false,
+ .inherit = true,
};
-FIXTURE_VARIANT_ADD(mdwe, forked)
+FIXTURE_VARIANT_ADD(mdwe, inherited)
{
.enabled = true,
.forked = true,
+ .inherit = true,
};
+FIXTURE_VARIANT_ADD(mdwe, not_inherited)
+{
+ .enabled = true,
+ .forked = true,
+ .inherit = false,
+};
+
+static bool executable_map_should_fail(const FIXTURE_VARIANT(mdwe) *variant)
+{
+ return variant->enabled && (!variant->forked || variant->inherit);
+}
+
FIXTURE_SETUP(mdwe)
{
+ unsigned long mdwe_flags;
int ret, status;
self->p = NULL;
@@ -76,13 +155,17 @@ FIXTURE_SETUP(mdwe)
if (!variant->enabled)
return;
- ret = prctl(PR_SET_MDWE, PR_MDWE_REFUSE_EXEC_GAIN, 0L, 0L, 0L);
+ mdwe_flags = PR_MDWE_REFUSE_EXEC_GAIN;
+ if (!variant->inherit)
+ mdwe_flags |= PR_MDWE_NO_INHERIT;
+
+ ret = prctl(PR_SET_MDWE, mdwe_flags, 0L, 0L, 0L);
ASSERT_EQ(ret, 0) {
TH_LOG("PR_SET_MDWE failed or unsupported");
}
ret = prctl(PR_GET_MDWE, 0L, 0L, 0L, 0L);
- ASSERT_EQ(ret, 1);
+ ASSERT_EQ(ret, mdwe_flags);
if (variant->forked) {
self->pid = fork();
@@ -113,7 +196,7 @@ TEST_F(mdwe, mmap_READ_EXEC)
TEST_F(mdwe, mmap_WRITE_EXEC)
{
self->p = mmap(NULL, self->size, PROT_WRITE | PROT_EXEC, self->flags, 0, 0);
- if (variant->enabled) {
+ if (executable_map_should_fail(variant)) {
EXPECT_EQ(self->p, MAP_FAILED);
} else {
EXPECT_NE(self->p, MAP_FAILED);
@@ -139,7 +222,7 @@ TEST_F(mdwe, mprotect_add_EXEC)
ASSERT_NE(self->p, MAP_FAILED);
ret = mprotect(self->p, self->size, PROT_READ | PROT_EXEC);
- if (variant->enabled) {
+ if (executable_map_should_fail(variant)) {
EXPECT_LT(ret, 0);
} else {
EXPECT_EQ(ret, 0);
@@ -154,7 +237,7 @@ TEST_F(mdwe, mprotect_WRITE_EXEC)
ASSERT_NE(self->p, MAP_FAILED);
ret = mprotect(self->p, self->size, PROT_WRITE | PROT_EXEC);
- if (variant->enabled) {
+ if (executable_map_should_fail(variant)) {
EXPECT_LT(ret, 0);
} else {
EXPECT_EQ(ret, 0);
--
2.40.1.606.ga4b1b128d6-goog
On 17.05.23 17:03, Florent Revest wrote:
> Alexey pointed out that defining a prctl flag as an int is a footgun
> because, under some circumstances, when used as a flag to prctl, it can
> be casted to long with garbage upper bits which would result in
> unexpected behaviors.
>
> This patch changes the constant to a UL to eliminate these
> possibilities.
>
> Signed-off-by: Florent Revest <[email protected]>
> Suggested-by: Alexey Izbyshev <[email protected]>
> ---
> include/uapi/linux/prctl.h | 2 +-
> tools/include/uapi/linux/prctl.h | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index f23d9a16507f..6e9af6cbc950 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -283,7 +283,7 @@ struct prctl_mm_map {
>
> /* Memory deny write / execute */
> #define PR_SET_MDWE 65
> -# define PR_MDWE_REFUSE_EXEC_GAIN 1
> +# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
>
> #define PR_GET_MDWE 66
>
> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
> index 759b3f53e53f..6e6563e97fef 100644
> --- a/tools/include/uapi/linux/prctl.h
> +++ b/tools/include/uapi/linux/prctl.h
> @@ -283,7 +283,7 @@ struct prctl_mm_map {
>
> /* Memory deny write / execute */
> #define PR_SET_MDWE 65
> -# define PR_MDWE_REFUSE_EXEC_GAIN 1
> +# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
>
> #define PR_GET_MDWE 66
>
Both are changing existing uapi, so you'll already have existing user
space using the old values, that your kernel code has to deal with no?
--
Thanks,
David / dhildenb
On 17.05.23 17:03, Florent Revest wrote:
> Signed-off-by: Florent Revest <[email protected]>
> ---
> tools/testing/selftests/mm/mdwe_test.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/tools/testing/selftests/mm/mdwe_test.c b/tools/testing/selftests/mm/mdwe_test.c
> index bc91bef5d254..d0954c657feb 100644
> --- a/tools/testing/selftests/mm/mdwe_test.c
> +++ b/tools/testing/selftests/mm/mdwe_test.c
> @@ -49,19 +49,19 @@ FIXTURE_VARIANT(mdwe)
>
> FIXTURE_VARIANT_ADD(mdwe, stock)
> {
> - .enabled = false,
> + .enabled = false,
> .forked = false,
> };
>
> FIXTURE_VARIANT_ADD(mdwe, enabled)
> {
> - .enabled = true,
> + .enabled = true,
> .forked = false,
> };
>
> FIXTURE_VARIANT_ADD(mdwe, forked)
> {
> - .enabled = true,
> + .enabled = true,
> .forked = true,
> };
>
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On 2023-05-22 11:55, David Hildenbrand wrote:
> On 17.05.23 17:03, Florent Revest wrote:
>> Alexey pointed out that defining a prctl flag as an int is a footgun
>> because, under some circumstances, when used as a flag to prctl, it
>> can
>> be casted to long with garbage upper bits which would result in
>> unexpected behaviors.
>>
>> This patch changes the constant to a UL to eliminate these
>> possibilities.
>>
>> Signed-off-by: Florent Revest <[email protected]>
>> Suggested-by: Alexey Izbyshev <[email protected]>
>> ---
>> include/uapi/linux/prctl.h | 2 +-
>> tools/include/uapi/linux/prctl.h | 2 +-
>> 2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index f23d9a16507f..6e9af6cbc950 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -283,7 +283,7 @@ struct prctl_mm_map {
>> /* Memory deny write / execute */
>> #define PR_SET_MDWE 65
>> -# define PR_MDWE_REFUSE_EXEC_GAIN 1
>> +# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
>> #define PR_GET_MDWE 66
>> diff --git a/tools/include/uapi/linux/prctl.h
>> b/tools/include/uapi/linux/prctl.h
>> index 759b3f53e53f..6e6563e97fef 100644
>> --- a/tools/include/uapi/linux/prctl.h
>> +++ b/tools/include/uapi/linux/prctl.h
>> @@ -283,7 +283,7 @@ struct prctl_mm_map {
>> /* Memory deny write / execute */
>> #define PR_SET_MDWE 65
>> -# define PR_MDWE_REFUSE_EXEC_GAIN 1
>> +# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
>> #define PR_GET_MDWE 66
>>
>
> Both are changing existing uapi, so you'll already have existing user
> space using the old values, that your kernel code has to deal with no?
I'm the one who suggested this change, so I feel the need to clarify.
For any existing 64-bit user space code using the kernel and the uapi
headers before this patch and doing the wrong prctl(PR_SET_MDWE,
PR_MDWE_REFUSE_EXEC_GAIN) call instead of the correct prctl(PR_SET_MDWE,
(unsigned long)PR_MDWE_REFUSE_EXEC_GAIN), there are two possibilities
when prctl() implementation extracts the second argument via va_arg(op,
unsigned long):
* It gets lucky, and the upper 32 bits of the argument are zero. The
call does what is expected by the user.
* The upper 32 bits are non-zero junk. The flags argument is rejected by
the kernel, and the call fails with EINVAL (unexpectedly for the user).
This change is intended to affect only the second case, and only after
the program is recompiled with the new uapi headers. The currently
wrong, but naturally-looking prctl(PR_SET_MDWE,
PR_MDWE_REFUSE_EXEC_GAIN) call becomes correct.
The kernel ABI is unaffected by this change, since it has been defined
in terms of unsigned long from the start.
Thanks,
Alexey
On 22.05.23 12:35, Alexey Izbyshev wrote:
> On 2023-05-22 11:55, David Hildenbrand wrote:
>> On 17.05.23 17:03, Florent Revest wrote:
>>> Alexey pointed out that defining a prctl flag as an int is a footgun
>>> because, under some circumstances, when used as a flag to prctl, it
>>> can
>>> be casted to long with garbage upper bits which would result in
>>> unexpected behaviors.
>>>
>>> This patch changes the constant to a UL to eliminate these
>>> possibilities.
>>>
>>> Signed-off-by: Florent Revest <[email protected]>
>>> Suggested-by: Alexey Izbyshev <[email protected]>
>>> ---
>>> include/uapi/linux/prctl.h | 2 +-
>>> tools/include/uapi/linux/prctl.h | 2 +-
>>> 2 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>>> index f23d9a16507f..6e9af6cbc950 100644
>>> --- a/include/uapi/linux/prctl.h
>>> +++ b/include/uapi/linux/prctl.h
>>> @@ -283,7 +283,7 @@ struct prctl_mm_map {
>>> /* Memory deny write / execute */
>>> #define PR_SET_MDWE 65
>>> -# define PR_MDWE_REFUSE_EXEC_GAIN 1
>>> +# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
>>> #define PR_GET_MDWE 66
>>> diff --git a/tools/include/uapi/linux/prctl.h
>>> b/tools/include/uapi/linux/prctl.h
>>> index 759b3f53e53f..6e6563e97fef 100644
>>> --- a/tools/include/uapi/linux/prctl.h
>>> +++ b/tools/include/uapi/linux/prctl.h
>>> @@ -283,7 +283,7 @@ struct prctl_mm_map {
>>> /* Memory deny write / execute */
>>> #define PR_SET_MDWE 65
>>> -# define PR_MDWE_REFUSE_EXEC_GAIN 1
>>> +# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
>>> #define PR_GET_MDWE 66
>>>
>>
>> Both are changing existing uapi, so you'll already have existing user
>> space using the old values, that your kernel code has to deal with no?
>
> I'm the one who suggested this change, so I feel the need to clarify.
>
> For any existing 64-bit user space code using the kernel and the uapi
> headers before this patch and doing the wrong prctl(PR_SET_MDWE,
> PR_MDWE_REFUSE_EXEC_GAIN) call instead of the correct prctl(PR_SET_MDWE,
> (unsigned long)PR_MDWE_REFUSE_EXEC_GAIN), there are two possibilities
> when prctl() implementation extracts the second argument via va_arg(op,
> unsigned long):
>
> * It gets lucky, and the upper 32 bits of the argument are zero. The
> call does what is expected by the user.
>
> * The upper 32 bits are non-zero junk. The flags argument is rejected by
> the kernel, and the call fails with EINVAL (unexpectedly for the user).
>
> This change is intended to affect only the second case, and only after
> the program is recompiled with the new uapi headers. The currently
> wrong, but naturally-looking prctl(PR_SET_MDWE,
> PR_MDWE_REFUSE_EXEC_GAIN) call becomes correct.
>
> The kernel ABI is unaffected by this change, since it has been defined
> in terms of unsigned long from the start.
The thing I'm concerned about is the following: old user space (that
would fail) on new kernel where we define some upper 32bit to actually
have a meaning (where it would succeed with wrong semantics).
IOW, can we ever really "use" these upper 32bit, or should we instead
only consume the lower 32bit in the kernel and effectively ignore the
upper 32bit?
I guess the feature is not that old, so having many existing user space
applications is unlikely.
Which raises the question if we want to tag this here with a "Fixes" and
eventually cc stable (hmm ...)?
--
Thanks,
David / dhildenb
On 2023-05-22 19:22, David Hildenbrand wrote:
> On 22.05.23 12:35, Alexey Izbyshev wrote:
>> On 2023-05-22 11:55, David Hildenbrand wrote:
>>> On 17.05.23 17:03, Florent Revest wrote:
>>>> Alexey pointed out that defining a prctl flag as an int is a footgun
>>>> because, under some circumstances, when used as a flag to prctl, it
>>>> can
>>>> be casted to long with garbage upper bits which would result in
>>>> unexpected behaviors.
>>>>
>>>> This patch changes the constant to a UL to eliminate these
>>>> possibilities.
>>>>
>>>> Signed-off-by: Florent Revest <[email protected]>
>>>> Suggested-by: Alexey Izbyshev <[email protected]>
>>>> ---
>>>> include/uapi/linux/prctl.h | 2 +-
>>>> tools/include/uapi/linux/prctl.h | 2 +-
>>>> 2 files changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>>>> index f23d9a16507f..6e9af6cbc950 100644
>>>> --- a/include/uapi/linux/prctl.h
>>>> +++ b/include/uapi/linux/prctl.h
>>>> @@ -283,7 +283,7 @@ struct prctl_mm_map {
>>>> /* Memory deny write / execute */
>>>> #define PR_SET_MDWE 65
>>>> -# define PR_MDWE_REFUSE_EXEC_GAIN 1
>>>> +# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
>>>> #define PR_GET_MDWE 66
>>>> diff --git a/tools/include/uapi/linux/prctl.h
>>>> b/tools/include/uapi/linux/prctl.h
>>>> index 759b3f53e53f..6e6563e97fef 100644
>>>> --- a/tools/include/uapi/linux/prctl.h
>>>> +++ b/tools/include/uapi/linux/prctl.h
>>>> @@ -283,7 +283,7 @@ struct prctl_mm_map {
>>>> /* Memory deny write / execute */
>>>> #define PR_SET_MDWE 65
>>>> -# define PR_MDWE_REFUSE_EXEC_GAIN 1
>>>> +# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
>>>> #define PR_GET_MDWE 66
>>>>
>>>
>>> Both are changing existing uapi, so you'll already have existing user
>>> space using the old values, that your kernel code has to deal with
>>> no?
>>
>> I'm the one who suggested this change, so I feel the need to clarify.
>>
>> For any existing 64-bit user space code using the kernel and the uapi
>> headers before this patch and doing the wrong prctl(PR_SET_MDWE,
>> PR_MDWE_REFUSE_EXEC_GAIN) call instead of the correct
>> prctl(PR_SET_MDWE,
>> (unsigned long)PR_MDWE_REFUSE_EXEC_GAIN), there are two possibilities
>> when prctl() implementation extracts the second argument via
>> va_arg(op,
>> unsigned long):
>>
>> * It gets lucky, and the upper 32 bits of the argument are zero. The
>> call does what is expected by the user.
>>
>> * The upper 32 bits are non-zero junk. The flags argument is rejected
>> by
>> the kernel, and the call fails with EINVAL (unexpectedly for the
>> user).
>>
>> This change is intended to affect only the second case, and only after
>> the program is recompiled with the new uapi headers. The currently
>> wrong, but naturally-looking prctl(PR_SET_MDWE,
>> PR_MDWE_REFUSE_EXEC_GAIN) call becomes correct.
>>
>> The kernel ABI is unaffected by this change, since it has been defined
>> in terms of unsigned long from the start.
>
> The thing I'm concerned about is the following: old user space (that
> would fail) on new kernel where we define some upper 32bit to actually
> have a meaning (where it would succeed with wrong semantics).
>
> IOW, can we ever really "use" these upper 32bit, or should we instead
> only consume the lower 32bit in the kernel and effectively ignore the
> upper 32bit?
>
I see, thanks. But I think this question is mostly independent from this
patch. The patch removes a footgun, but it doesn't change the flags
check in the kernel, and nothing stops the user from doing
int flags = PR_MDWE_REFUSE_EXEC_GAIN;
prctl(PR_SET_MDWE, flags);
So we have to decide whether to ignore the upper 32 bits or not even if
this patch is not applied (actually *had to* when PR_SET_MDWE op was
being added).
Possible arguments for ignoring them:
* Upper 32 bits can't be passed on 32-bit targets via the current
prctl() interface, so a change that adds meaning to them would have to
be both 64-bit-specific and unable to use another prctl() argument
instead. That seems unlikely.
* It's not hard to accidentally pass int to prctl() even after this
patch, so making technically wrong user code work as intended could be a
good thing.
* A similar footgun exists for ILP32 ABIs (e.g. x32) on a lower level:
while prctl(PR_SET_MDWE, 1) is fine there because long is 32-bit, the
syscall interface is still 64-bit, so e.g. syscall(SYS_prctl,
PR_SET_MDWE, -1) could, depending on syscall() implementation,
sign-extend -1 to 64 bits and pass 64 set bits instead of 32 to the
kernel.
Possible arguments for checking them:
* Code like "prctl(PR_SET_MDWE, 1)" is UB on 64-bit platforms. If the
compiler notices that (e.g. if somebody ever manages to build a program
and a libc together with LTO), it's allowed to make things much worse
than just passing junk. Allowing the user to detect at least some of
such calls now by checking for junk could be better.
* I have the impression that the kernel security community prefers
strict argument validation.
* PR_SET_MDWE is a new op added in 6.3, so we don't have lots of legacy
code that is known to pass junk in the upper 32 bits and must be kept
working (i.e. failing) in the same way in a potential future kernel that
assigns meaning to those bits.
My preference would be to keep checking the upper 32 bits. Florent, what
do you think?
> I guess the feature is not that old, so having many existing user
> space applications is unlikely.
>
> Which raises the question if we want to tag this here with a "Fixes"
> and eventually cc stable (hmm ...)?
Yes, IMO the faster we propagate this change, the better.
Thanks,
Alexey
On 22.05.23 20:58, Alexey Izbyshev wrote:
> On 2023-05-22 19:22, David Hildenbrand wrote:
>> On 22.05.23 12:35, Alexey Izbyshev wrote:
>>> On 2023-05-22 11:55, David Hildenbrand wrote:
>>>> On 17.05.23 17:03, Florent Revest wrote:
>>>>> Alexey pointed out that defining a prctl flag as an int is a footgun
>>>>> because, under some circumstances, when used as a flag to prctl, it
>>>>> can
>>>>> be casted to long with garbage upper bits which would result in
>>>>> unexpected behaviors.
>>>>>
>>>>> This patch changes the constant to a UL to eliminate these
>>>>> possibilities.
>>>>>
>>>>> Signed-off-by: Florent Revest <[email protected]>
>>>>> Suggested-by: Alexey Izbyshev <[email protected]>
>>>>> ---
>>>>> include/uapi/linux/prctl.h | 2 +-
>>>>> tools/include/uapi/linux/prctl.h | 2 +-
>>>>> 2 files changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>>>>> index f23d9a16507f..6e9af6cbc950 100644
>>>>> --- a/include/uapi/linux/prctl.h
>>>>> +++ b/include/uapi/linux/prctl.h
>>>>> @@ -283,7 +283,7 @@ struct prctl_mm_map {
>>>>> /* Memory deny write / execute */
>>>>> #define PR_SET_MDWE 65
>>>>> -# define PR_MDWE_REFUSE_EXEC_GAIN 1
>>>>> +# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
>>>>> #define PR_GET_MDWE 66
>>>>> diff --git a/tools/include/uapi/linux/prctl.h
>>>>> b/tools/include/uapi/linux/prctl.h
>>>>> index 759b3f53e53f..6e6563e97fef 100644
>>>>> --- a/tools/include/uapi/linux/prctl.h
>>>>> +++ b/tools/include/uapi/linux/prctl.h
>>>>> @@ -283,7 +283,7 @@ struct prctl_mm_map {
>>>>> /* Memory deny write / execute */
>>>>> #define PR_SET_MDWE 65
>>>>> -# define PR_MDWE_REFUSE_EXEC_GAIN 1
>>>>> +# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
>>>>> #define PR_GET_MDWE 66
>>>>>
>>>>
>>>> Both are changing existing uapi, so you'll already have existing user
>>>> space using the old values, that your kernel code has to deal with
>>>> no?
>>>
>>> I'm the one who suggested this change, so I feel the need to clarify.
>>>
>>> For any existing 64-bit user space code using the kernel and the uapi
>>> headers before this patch and doing the wrong prctl(PR_SET_MDWE,
>>> PR_MDWE_REFUSE_EXEC_GAIN) call instead of the correct
>>> prctl(PR_SET_MDWE,
>>> (unsigned long)PR_MDWE_REFUSE_EXEC_GAIN), there are two possibilities
>>> when prctl() implementation extracts the second argument via
>>> va_arg(op,
>>> unsigned long):
>>>
>>> * It gets lucky, and the upper 32 bits of the argument are zero. The
>>> call does what is expected by the user.
>>>
>>> * The upper 32 bits are non-zero junk. The flags argument is rejected
>>> by
>>> the kernel, and the call fails with EINVAL (unexpectedly for the
>>> user).
>>>
>>> This change is intended to affect only the second case, and only after
>>> the program is recompiled with the new uapi headers. The currently
>>> wrong, but naturally-looking prctl(PR_SET_MDWE,
>>> PR_MDWE_REFUSE_EXEC_GAIN) call becomes correct.
>>>
>>> The kernel ABI is unaffected by this change, since it has been defined
>>> in terms of unsigned long from the start.
>>
>> The thing I'm concerned about is the following: old user space (that
>> would fail) on new kernel where we define some upper 32bit to actually
>> have a meaning (where it would succeed with wrong semantics).
>>
>> IOW, can we ever really "use" these upper 32bit, or should we instead
>> only consume the lower 32bit in the kernel and effectively ignore the
>> upper 32bit?
>>
> I see, thanks. But I think this question is mostly independent from this
> patch. The patch removes a footgun, but it doesn't change the flags
> check in the kernel, and nothing stops the user from doing
>
> int flags = PR_MDWE_REFUSE_EXEC_GAIN;
> prctl(PR_SET_MDWE, flags);
>
> So we have to decide whether to ignore the upper 32 bits or not even if
> this patch is not applied (actually *had to* when PR_SET_MDWE op was
> being added).
Well, an alternative to this patch would be to say "well, for this prctl
we ignore any upper 32bit. Then, this change would not be needed. Yes, I
also don't like that :)
Bu unrelated, I looked at some other random prctl.
#define SUID_DUMP_USER 1
And in kernel/sys.c:
case PR_SET_DUMPABLE:
if (arg2 != SUID_DUMP_DISABLE && arg2 != SUID_DUMP_USER)
...
Wouldn't that also suffer from the same issue, or how is this different?
Also, how is passing "0"s to e.g., PR_GET_THP_DISABLE reliable? We need
arg2 -> arg5 to be 0. But wouldn't the following also just pass a 0 "int" ?
prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0)
I'm easily confused by such (va_args) things, so sorry for the dummy
questions.
--
Thanks,
David / dhildenb
On 2023-05-23 12:12, David Hildenbrand wrote:
> On 22.05.23 20:58, Alexey Izbyshev wrote:
>> On 2023-05-22 19:22, David Hildenbrand wrote:
>>> On 22.05.23 12:35, Alexey Izbyshev wrote:
>>>> On 2023-05-22 11:55, David Hildenbrand wrote:
>>>>> On 17.05.23 17:03, Florent Revest wrote:
>>>>>> Alexey pointed out that defining a prctl flag as an int is a
>>>>>> footgun
>>>>>> because, under some circumstances, when used as a flag to prctl,
>>>>>> it
>>>>>> can
>>>>>> be casted to long with garbage upper bits which would result in
>>>>>> unexpected behaviors.
>>>>>>
>>>>>> This patch changes the constant to a UL to eliminate these
>>>>>> possibilities.
>>>>>>
>>>>>> Signed-off-by: Florent Revest <[email protected]>
>>>>>> Suggested-by: Alexey Izbyshev <[email protected]>
>>>>>> ---
>>>>>> include/uapi/linux/prctl.h | 2 +-
>>>>>> tools/include/uapi/linux/prctl.h | 2 +-
>>>>>> 2 files changed, 2 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/include/uapi/linux/prctl.h
>>>>>> b/include/uapi/linux/prctl.h
>>>>>> index f23d9a16507f..6e9af6cbc950 100644
>>>>>> --- a/include/uapi/linux/prctl.h
>>>>>> +++ b/include/uapi/linux/prctl.h
>>>>>> @@ -283,7 +283,7 @@ struct prctl_mm_map {
>>>>>> /* Memory deny write / execute */
>>>>>> #define PR_SET_MDWE 65
>>>>>> -# define PR_MDWE_REFUSE_EXEC_GAIN 1
>>>>>> +# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
>>>>>> #define PR_GET_MDWE 66
>>>>>> diff --git a/tools/include/uapi/linux/prctl.h
>>>>>> b/tools/include/uapi/linux/prctl.h
>>>>>> index 759b3f53e53f..6e6563e97fef 100644
>>>>>> --- a/tools/include/uapi/linux/prctl.h
>>>>>> +++ b/tools/include/uapi/linux/prctl.h
>>>>>> @@ -283,7 +283,7 @@ struct prctl_mm_map {
>>>>>> /* Memory deny write / execute */
>>>>>> #define PR_SET_MDWE 65
>>>>>> -# define PR_MDWE_REFUSE_EXEC_GAIN 1
>>>>>> +# define PR_MDWE_REFUSE_EXEC_GAIN (1UL << 0)
>>>>>> #define PR_GET_MDWE 66
>>>>>>
>>>>>
>>>>> Both are changing existing uapi, so you'll already have existing
>>>>> user
>>>>> space using the old values, that your kernel code has to deal with
>>>>> no?
>>>>
>>>> I'm the one who suggested this change, so I feel the need to
>>>> clarify.
>>>>
>>>> For any existing 64-bit user space code using the kernel and the
>>>> uapi
>>>> headers before this patch and doing the wrong prctl(PR_SET_MDWE,
>>>> PR_MDWE_REFUSE_EXEC_GAIN) call instead of the correct
>>>> prctl(PR_SET_MDWE,
>>>> (unsigned long)PR_MDWE_REFUSE_EXEC_GAIN), there are two
>>>> possibilities
>>>> when prctl() implementation extracts the second argument via
>>>> va_arg(op,
>>>> unsigned long):
>>>>
>>>> * It gets lucky, and the upper 32 bits of the argument are zero. The
>>>> call does what is expected by the user.
>>>>
>>>> * The upper 32 bits are non-zero junk. The flags argument is
>>>> rejected
>>>> by
>>>> the kernel, and the call fails with EINVAL (unexpectedly for the
>>>> user).
>>>>
>>>> This change is intended to affect only the second case, and only
>>>> after
>>>> the program is recompiled with the new uapi headers. The currently
>>>> wrong, but naturally-looking prctl(PR_SET_MDWE,
>>>> PR_MDWE_REFUSE_EXEC_GAIN) call becomes correct.
>>>>
>>>> The kernel ABI is unaffected by this change, since it has been
>>>> defined
>>>> in terms of unsigned long from the start.
>>>
>>> The thing I'm concerned about is the following: old user space (that
>>> would fail) on new kernel where we define some upper 32bit to
>>> actually
>>> have a meaning (where it would succeed with wrong semantics).
>>>
>>> IOW, can we ever really "use" these upper 32bit, or should we instead
>>> only consume the lower 32bit in the kernel and effectively ignore the
>>> upper 32bit?
>>>
>> I see, thanks. But I think this question is mostly independent from
>> this
>> patch. The patch removes a footgun, but it doesn't change the flags
>> check in the kernel, and nothing stops the user from doing
>>
>> int flags = PR_MDWE_REFUSE_EXEC_GAIN;
>> prctl(PR_SET_MDWE, flags);
>>
>> So we have to decide whether to ignore the upper 32 bits or not even
>> if
>> this patch is not applied (actually *had to* when PR_SET_MDWE op was
>> being added).
>
> Well, an alternative to this patch would be to say "well, for this
> prctl we ignore any upper 32bit. Then, this change would not be
> needed. Yes, I also don't like that :)
>
> Bu unrelated, I looked at some other random prctl.
>
> #define SUID_DUMP_USER 1
>
> And in kernel/sys.c:
>
> case PR_SET_DUMPABLE:
> if (arg2 != SUID_DUMP_DISABLE && arg2 != SUID_DUMP_USER)
> ...
>
> Wouldn't that also suffer from the same issue, or how is this
> different?
>
Yes, it is the same issue, so e.g. prctl(PR_SET_DUMPABLE,
SUID_DUMP_DISABLE ) may wrongly fail with EINVAL on 64-bit targets.
> Also, how is passing "0"s to e.g., PR_GET_THP_DISABLE reliable? We
> need arg2 -> arg5 to be 0. But wouldn't the following also just pass a
> 0 "int" ?
>
> prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0)
>
Yes, this is not reliable on 64-bit targets too. The simplest fix is to
use "0L", as done in MDWE self-tests (but many other tests get this
wrong).
Florent also expressed surprise[1] that we don't see a lot of failures
due to such issues, and I tried to provide some reasons. To elaborate on
the x86-64 thing, for prctl(PR_SET_DUMPABLE, 0) the compiler will likely
generate "xorl %esi, %esi" to pass zero, but this instruction will also
clear the upper 32 bits of %rsi, so the problem is masked (and I believe
CPU vendors are motivated to do such zeroing to reduce false
dependencies). But this zeroing is not required by the ABI, so in a more
complex situation junk might get through.
Real-world examples of very similar breakage in variadic functions
involving NULL sentinels are mentioned in [2] (the musl bug report is
[3]). In short, musl defined NULL as plain 0 for C++, so when people do
e.g. execl("/bin/true", "true", NULL), junk might prevent detection of
the sentinel in execl() impl. (Though if the sentinel is passed via
stack because there are a lot of preceding arguments, the breakage
becomes more apparent because auto-zeroing of registers doesn't come
into play anymore.)
>
> I'm easily confused by such (va_args) things, so sorry for the dummy
> questions.
This stuff *is* confusing, and note that Linux man pages don't even tell
that prctl() is actually declared as a variadic function (and for
ptrace() this is mentioned only in the notes, but not in its signature).
Thanks,
Alexey
[1]
https://lore.kernel.org/lkml/[email protected]
[2] https://ewontfix.com/11
[3] https://www.openwall.com/lists/musl/2013/01/09/1
On Tue, May 23, 2023 at 11:12:37AM +0200, David Hildenbrand wrote:
> Also, how is passing "0"s to e.g., PR_GET_THP_DISABLE reliable? We need arg2
> -> arg5 to be 0. But wouldn't the following also just pass a 0 "int" ?
>
> prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0)
>
> I'm easily confused by such (va_args) things, so sorry for the dummy
> questions.
Isn't the prctl() prototype in the user headers defined with the first
argument as int while the rest as unsigned long? At least from the man
page:
int prctl(int option, unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5);
So there are no va_args tricks (which confuse me as well).
Any int passed to arg[2-5] would be converted by the compiler to an
unsigned long before being passed to the kernel. So I think the change
in this patch is harmless as the conversion is happening anyway.
(well, unless I completely missed what the problem is)
--
Catalin
On 2023-05-23 16:07, Catalin Marinas wrote:
> On Tue, May 23, 2023 at 11:12:37AM +0200, David Hildenbrand wrote:
>> Also, how is passing "0"s to e.g., PR_GET_THP_DISABLE reliable? We
>> need arg2
>> -> arg5 to be 0. But wouldn't the following also just pass a 0 "int" ?
>>
>> prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0)
>>
>> I'm easily confused by such (va_args) things, so sorry for the dummy
>> questions.
>
> Isn't the prctl() prototype in the user headers defined with the first
> argument as int while the rest as unsigned long? At least from the man
> page:
>
> int prctl(int option, unsigned long arg2, unsigned long arg3,
> unsigned long arg4, unsigned long arg5);
>
> So there are no va_args tricks (which confuse me as well).
>
I have explicitly mentioned the problem with man pages in my response to
David[1]. Quoting myself:
> This stuff *is* confusing, and note that Linux man pages don't even
> tell
that prctl() is actually declared as a variadic function (and for
ptrace() this is mentioned only in the notes, but not in its signature).
The reality:
* glibc:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/sys/prctl.h;h=821aeefc1339b35210e8918ecfe9833ed2792626;hb=glibc-2.37#l42
* musl:
https://git.musl-libc.org/cgit/musl/tree/include/sys/prctl.h?h=v1.2.4#n180
Though there is a test in the kernel that does define its own prototype,
avoiding the issue:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/sched/cs_prctl_test.c?h=v6.3#n77
Thanks,
Alexey
[1]
https://lore.kernel.org/lkml/[email protected]//lkml/[email protected]
> Any int passed to arg[2-5] would be converted by the compiler to an
> unsigned long before being passed to the kernel. So I think the change
> in this patch is harmless as the conversion is happening anyway.
>
> (well, unless I completely missed what the problem is)
On Tue, May 23, 2023 at 04:25:45PM +0300, Alexey Izbyshev wrote:
> On 2023-05-23 16:07, Catalin Marinas wrote:
> > On Tue, May 23, 2023 at 11:12:37AM +0200, David Hildenbrand wrote:
> > > Also, how is passing "0"s to e.g., PR_GET_THP_DISABLE reliable? We
> > > need arg2
> > > -> arg5 to be 0. But wouldn't the following also just pass a 0 "int" ?
> > >
> > > prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0)
> > >
> > > I'm easily confused by such (va_args) things, so sorry for the dummy
> > > questions.
> >
> > Isn't the prctl() prototype in the user headers defined with the first
> > argument as int while the rest as unsigned long? At least from the man
> > page:
> >
> > int prctl(int option, unsigned long arg2, unsigned long arg3,
> > unsigned long arg4, unsigned long arg5);
> >
> > So there are no va_args tricks (which confuse me as well).
> >
> I have explicitly mentioned the problem with man pages in my response to
> David[1]. Quoting myself:
>
> > This stuff *is* confusing, and note that Linux man pages don't even tell
> that prctl() is actually declared as a variadic function (and for
> ptrace() this is mentioned only in the notes, but not in its signature).
Ah, thanks for the clarification (I somehow missed your reply).
> The reality:
>
> * glibc: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/sys/prctl.h;h=821aeefc1339b35210e8918ecfe9833ed2792626;hb=glibc-2.37#l42
>
> * musl:
> https://git.musl-libc.org/cgit/musl/tree/include/sys/prctl.h?h=v1.2.4#n180
>
> Though there is a test in the kernel that does define its own prototype,
> avoiding the issue: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/sched/cs_prctl_test.c?h=v6.3#n77
At least for glibc, it seems that there is a conversion to unsigned
long:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/prctl.c#l28
unsigned long int arg2 = va_arg (arg, unsigned long int);
(does va_arg expand to an actual cast?)
If the libc passes a 32-bit to a kernel ABI that expects 64-bit, I think
it's a user-space bug and not a kernel ABI issue.
--
Catalin
>> Wouldn't that also suffer from the same issue, or how is this
>> different?
>>
> Yes, it is the same issue, so e.g. prctl(PR_SET_DUMPABLE,
> SUID_DUMP_DISABLE ) may wrongly fail with EINVAL on 64-bit targets.
>
>> Also, how is passing "0"s to e.g., PR_GET_THP_DISABLE reliable? We
>> need arg2 -> arg5 to be 0. But wouldn't the following also just pass a
>> 0 "int" ?
>>
>> prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0)
>>
> Yes, this is not reliable on 64-bit targets too. The simplest fix is to
> use "0L", as done in MDWE self-tests (but many other tests get this
> wrong).
Oh, it's even worse than I thought, then. :)
Even in our selftest most of
$ git grep prctl tools/testing/selftests/ | grep "0"
gets it wrong.
>
> Florent also expressed surprise[1] that we don't see a lot of failures
> due to such issues, and I tried to provide some reasons. To elaborate on
Yes, I'm also surprised!
> the x86-64 thing, for prctl(PR_SET_DUMPABLE, 0) the compiler will likely
> generate "xorl %esi, %esi" to pass zero, but this instruction will also
> clear the upper 32 bits of %rsi, so the problem is masked (and I believe
> CPU vendors are motivated to do such zeroing to reduce false
> dependencies). But this zeroing is not required by the ABI, so in a more
> complex situation junk might get through.
:/
>
> Real-world examples of very similar breakage in variadic functions
> involving NULL sentinels are mentioned in [2] (the musl bug report is
> [3]). In short, musl defined NULL as plain 0 for C++, so when people do
> e.g. execl("/bin/true", "true", NULL), junk might prevent detection of
> the sentinel in execl() impl. (Though if the sentinel is passed via
> stack because there are a lot of preceding arguments, the breakage
> becomes more apparent because auto-zeroing of registers doesn't come
> into play anymore.)
Yes, I heard about the "fun" with NULL already. Thanks for the musl
pointer. And thanks for the confirmation/explanation.
>
>>
>> I'm easily confused by such (va_args) things, so sorry for the dummy
>> questions.
>
> This stuff *is* confusing, and note that Linux man pages don't even tell
> that prctl() is actually declared as a variadic function (and for
> ptrace() this is mentioned only in the notes, but not in its signature).
Agreed, that's easy to miss (and probably many people missed it).
Anyhow, for this patch as is (although it feels like drops in the ocean
after our discussion)
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On Wed, May 17, 2023 at 05:03:19PM +0200, Florent Revest wrote:
> Alexey pointed out that defining a prctl flag as an int is a footgun
> because, under some circumstances, when used as a flag to prctl, it can
> be casted to long with garbage upper bits which would result in
> unexpected behaviors.
>
> This patch changes the constant to a UL to eliminate these
> possibilities.
>
> Signed-off-by: Florent Revest <[email protected]>
> Suggested-by: Alexey Izbyshev <[email protected]>
FWIW, I'm fine with this patch and I don't think it introduces an ABI
change.
Acked-by: Catalin Marinas <[email protected]>
On 2023-05-23 17:09, Catalin Marinas wrote:
> On Tue, May 23, 2023 at 04:25:45PM +0300, Alexey Izbyshev wrote:
>> On 2023-05-23 16:07, Catalin Marinas wrote:
>> > On Tue, May 23, 2023 at 11:12:37AM +0200, David Hildenbrand wrote:
>> > > Also, how is passing "0"s to e.g., PR_GET_THP_DISABLE reliable? We
>> > > need arg2
>> > > -> arg5 to be 0. But wouldn't the following also just pass a 0 "int" ?
>> > >
>> > > prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0)
>> > >
>> > > I'm easily confused by such (va_args) things, so sorry for the dummy
>> > > questions.
>> >
>> > Isn't the prctl() prototype in the user headers defined with the first
>> > argument as int while the rest as unsigned long? At least from the man
>> > page:
>> >
>> > int prctl(int option, unsigned long arg2, unsigned long arg3,
>> > unsigned long arg4, unsigned long arg5);
>> >
>> > So there are no va_args tricks (which confuse me as well).
>> >
>> I have explicitly mentioned the problem with man pages in my response
>> to
>> David[1]. Quoting myself:
>>
>> > This stuff *is* confusing, and note that Linux man pages don't even tell
>> that prctl() is actually declared as a variadic function (and for
>> ptrace() this is mentioned only in the notes, but not in its
>> signature).
>
> Ah, thanks for the clarification (I somehow missed your reply).
>
>> The reality:
>>
>> * glibc:
>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/sys/prctl.h;h=821aeefc1339b35210e8918ecfe9833ed2792626;hb=glibc-2.37#l42
>>
>> * musl:
>> https://git.musl-libc.org/cgit/musl/tree/include/sys/prctl.h?h=v1.2.4#n180
>>
>> Though there is a test in the kernel that does define its own
>> prototype,
>> avoiding the issue:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/sched/cs_prctl_test.c?h=v6.3#n77
>
> At least for glibc, it seems that there is a conversion to unsigned
> long:
>
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/prctl.c#l28
>
> unsigned long int arg2 = va_arg (arg, unsigned long int);
>
> (does va_arg expand to an actual cast?)
>
No, this not a conversion or a cast in the sense that I think you mean
it. What happens in the situation discussed in this thread is the
following (assuming the argument is passed via a register, which is
typical for initial variadic arguments on 64-bit targets):
* User calls prctl(op, 0) on a 64-bit target.
* The second argument is an int.
* The compiler generates code to pass an int (32 bits) via a 64-bit
register. The compiler is NOT required to clear the upper 32 bits of the
register, so they might contain arbitrary junk in a general case.
* The prctl() implementation calls va_arg(arg, unsigned long) (as in
your quote).
* The compiler extracts the full 64-bit value of the same register
(which in our case might contain junk in the upper 32 bits).
* This extracted 64-bit value is then passed to the system call.
So...
> If the libc passes a 32-bit to a kernel ABI that expects 64-bit, I
> think
> it's a user-space bug and not a kernel ABI issue.
... the problem happens not at the user/kernel boundary, but in prctl()
call/implementation in user space. But yes, it's still a user-space bug
and not a kernel ABI issue. The David's question, as I understand it,
was whether we want to keep such buggy code that happens to pass junk
failing with EINVAL in future kernels or not. If we do want to keep it
failing, we can never assign any meaning to the upper 32 bits of the
second prctl() argument for PR_SET_MDWE op.
Thanks,
Alexey
The 05/23/2023 15:09, Catalin Marinas wrote:
> At least for glibc, it seems that there is a conversion to unsigned
> long:
>
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/prctl.c#l28
>
> unsigned long int arg2 = va_arg (arg, unsigned long int);
>
> (does va_arg expand to an actual cast?)
this is not a conversion.
at this point the argument is already corrupt since
an int was passed instead of unsigned long, usually
it's just wrong top 32bit, but in theory arg passing
for int and long could be completely different.
>
> If the libc passes a 32-bit to a kernel ABI that expects 64-bit, I think
> it's a user-space bug and not a kernel ABI issue.
this is point 3 in an earlier mail i wrote about varargs
(we ran into vararg issues during morello porting but it
affects all 64bit targets):
https://lkml.org/lkml/2022/10/31/386
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Mon, May 22, 2023 at 8:58 PM Alexey Izbyshev <[email protected]> wrote:
>
> My preference would be to keep checking the upper 32 bits. Florent, what
> do you think?
I don't mind. I can do this as part of v3. :)
> > Which raises the question if we want to tag this here with a "Fixes"
> > and eventually cc stable (hmm ...)?
>
> Yes, IMO the faster we propagate this change, the better.
Okay, will do
On Tue, May 23, 2023 at 4:10 PM David Hildenbrand <[email protected]> wrote:
>
> >> I'm easily confused by such (va_args) things, so sorry for the dummy
> >> questions.
> >
> > This stuff *is* confusing, and note that Linux man pages don't even tell
> > that prctl() is actually declared as a variadic function (and for
> > ptrace() this is mentioned only in the notes, but not in its signature).
>
> Agreed, that's easy to miss (and probably many people missed it).
>
>
> Anyhow, for this patch as is (although it feels like drops in the ocean
> after our discussion)
>
> Reviewed-by: David Hildenbrand <[email protected]>
Thanks everyone for the review and the exchange ! :)