Josh / PeterZ,
it turns out that clang seems to now have fixed the last known
nagging details with "asm goto" with outputs, so I'm looking at
actually trying to merge the support for that in the kernel.
The main annoyance isn't actually using "asm goto" at all, the main
annoyance is just that it will all have to be conditional on whether
the compiler supports it or not. We have the config option for that
already, but it will just end up with two copies of the code depending
on that option.
It's not a huge deal: the recent cleanups wrt the x86 uaccess code
have made the code _much_ more straightforward and legible, and I'm
not so worried about it all.
Except that when I looked at this, I realized that I really had picked
the wrong model for how exceptions are handled wrt stac/clac. In
particular user access exceptions return with stac set, so we end up
having a code pattern where the error case will also have to do the
user_access_end() to finish the STAC region.
Adding a user_access_end() to the user exception fault handler is trivial.
But the thing I'm asking you for is how nasty it would be to change
objtool to have those rules?
IOW, right now we have
if (!user_acces_begin(...))
goto efault;
unsafe_get/put_user(ptr, val, label);
user_access_end();
return 0;
label:
user_access_end();
efaulr:
return -EFAULT;
and I'd like to make the "label" case just go to "efault", with
objtool knowing that the exception handling already did the
user_access_end().
That would end up cleaning up the flow for a number of cases.
Nasty? Trivial?
Linus
On Wed, Jul 1, 2020 at 11:22 AM Linus Torvalds
<[email protected]> wrote:
>
> Josh / PeterZ,
> it turns out that clang seems to now have fixed the last known
> nagging details with "asm goto" with outputs, so I'm looking at
> actually trying to merge the support for that in the kernel.
>
> The main annoyance isn't actually using "asm goto" at all, the main
> annoyance is just that it will all have to be conditional on whether
> the compiler supports it or not. We have the config option for that
> already, but it will just end up with two copies of the code depending
> on that option.
>
> It's not a huge deal: the recent cleanups wrt the x86 uaccess code
> have made the code _much_ more straightforward and legible, and I'm
> not so worried about it all.
>
> Except that when I looked at this, I realized that I really had picked
> the wrong model for how exceptions are handled wrt stac/clac. In
> particular user access exceptions return with stac set, so we end up
> having a code pattern where the error case will also have to do the
> user_access_end() to finish the STAC region.
>
> Adding a user_access_end() to the user exception fault handler is trivial.
>
> But the thing I'm asking you for is how nasty it would be to change
> objtool to have those rules?
>
> IOW, right now we have
>
> if (!user_acces_begin(...))
> goto efault;
> unsafe_get/put_user(ptr, val, label);
> user_access_end();
> return 0;
>
> label:
> user_access_end();
> efaulr:
> return -EFAULT;
>
> and I'd like to make the "label" case just go to "efault", with
> objtool knowing that the exception handling already did the
> user_access_end().
Do we really want the exception handling to do the CLAC? Having
unsafe_get_user() do CLAC seems surprising to me, and it will break
use cases like:
if (!user_access_begin(...)
goto out;
ret = unsafe_get_user(...);
user_access_end();
check ret;
I have no problem with a special ex_handler_uaccess_clac, but I don't
think it should be the default.
--Andy
On Wed, Jul 01, 2020 at 11:22:01AM -0700, Linus Torvalds wrote:
> Josh / PeterZ,
> it turns out that clang seems to now have fixed the last known
> nagging details with "asm goto" with outputs, so I'm looking at
> actually trying to merge the support for that in the kernel.
>
> The main annoyance isn't actually using "asm goto" at all, the main
> annoyance is just that it will all have to be conditional on whether
> the compiler supports it or not. We have the config option for that
> already, but it will just end up with two copies of the code depending
> on that option.
>
> It's not a huge deal: the recent cleanups wrt the x86 uaccess code
> have made the code _much_ more straightforward and legible, and I'm
> not so worried about it all.
>
> Except that when I looked at this, I realized that I really had picked
> the wrong model for how exceptions are handled wrt stac/clac. In
> particular user access exceptions return with stac set, so we end up
> having a code pattern where the error case will also have to do the
> user_access_end() to finish the STAC region.
>
> Adding a user_access_end() to the user exception fault handler is trivial.
>
> But the thing I'm asking you for is how nasty it would be to change
> objtool to have those rules?
>
> IOW, right now we have
>
> if (!user_acces_begin(...))
> goto efault;
> unsafe_get/put_user(ptr, val, label);
> user_access_end();
> return 0;
>
> label:
> user_access_end();
> efaulr:
> return -EFAULT;
>
> and I'd like to make the "label" case just go to "efault", with
> objtool knowing that the exception handling already did the
> user_access_end().
>
> That would end up cleaning up the flow for a number of cases.
>
> Nasty? Trivial?
Rather nasty for ppc; they have separate user_read_access_end() and
user_write_access_end().
On Wed, Jul 1, 2020 at 11:41 AM Al Viro <[email protected]> wrote:
>
> Rather nasty for ppc; they have separate user_read_access_end() and
> user_write_access_end().
That's actually for the access granting. Shutting the access down ends
up always doing the same thing anyway..
Linus
On Wed, Jul 1, 2020 at 11:29 AM Andy Lutomirski <[email protected]> wrote:
>
> Do we really want the exception handling to do the CLAC? Having
> unsafe_get_user() do CLAC seems surprising to me, and it will break
> use cases like:
>
> if (!user_access_begin(...)
> goto out;
>
> ret = unsafe_get_user(...);
>
> user_access_end();
>
> check ret;
That's not how unsafe_get_user() works.
unsafe_get_user() always jumps to the error label, it never returns a
value. So the code is actually now what you claim above, but
if (!user_access_begin(...)
goto out;
unsafe_get_user(..., out_fault);
user_access_end();
.. this is good, use the value we got..
out_fault:
user_access_end();
out:
return -EFAULT;
And that's _exactly_ why I'd like to make the change: because that
means that the error label for a user access failure is exactly the
same as the error label for the "user_access_begin()" failure.
So with my suggestion, that "out_fault" label and extra
user_access_end() would go away, and only "out" would remain.
So my suggestion actually simplifies the use cases, and your example
was literally an argument _for_ the change, not against it.
That's not why I started doing it, though. The real simplification is
inside the low-level implementation.
Right now __get_user_nocheck() does this:
__uaccess_begin(); \
__get_user_size(__gu_val, __gu_ptr, __gu_size, __gu_err); \
__uaccess_end(); \
(x) = (__force __typeof__(*(ptr)))__gu_val; \
__builtin_expect(__gu_err, 0); \
because __get_user_nocheck() internally does *not* use the jumping
version (yet) because of the fact how gcc can't do "asm goto" with
outputs.
And it's actually _important_ that the assignment to "x" is done
outside the user access region (because "x" can be a complex
expression).
But look at what happens if we change things to use a exception jump
instead of that __gu_error value.
Then we want the code to become this:
__uaccess_begin_nospec(); \
__get_user_size(__gu_val, __gu_ptr, __gu_size, __gu_label); \
__uaccess_end(); \
(x) = (__force __typeof__(*(ptr)))__gu_val; \
__gu_ret = 0; \
__gu_label: \
__builtin_expect(__gu_err, 0); \
and that actually looks nice and understandable, and the compiler will
also have a really easy time turing any subsequent test of the return
value into a trivial "we know the fallthrough case returned zero"
because instead of setting "__gu_err" much deeper and dynamically (and
with other code in between), it gets set right before the return from
that statement expression macro.
Btw, all the "it makes it easier to implement" is doubly true in all
the low-level asm code too. Go look into arch/x86/lib/{get,put}user.S
right now, and see how the error cases all have two different
entrypoints, and we have code like
SYM_CODE_START_LOCAL(.Lbad_get_user_clac)
ASM_CLAC
bad_get_user:
xor %edx,%edx
mov $(-EFAULT),%_ASM_AX
ret
SYM_CODE_END(.Lbad_get_user_clac)
where that "Lbad_get_user_clac" label is used for the exception case,
and the "bad_get_user" label is used for the "bad address" case.
So it
(a) makes it easier for users
(b) makes it easier to implement
(c) will actually make it easier for compilers to optimize too
Now, if the exception does *not* do the "stop user accesses", we have
to continue doing the two different failure targets (both in C code
and in asm code), and for the __get_user_nocheck() case where we use
"asm goto" with outputs, we can do things like this:
__uaccess_begin_nospec(); \
__get_user_size(__gu_val, __gu_ptr, __gu_size, __gu_label); \
__uaccess_end(); \
(x) = (__force __typeof__(*(ptr)))__gu_val; \
__gu_err = 0; \
if (0) { \
__gu_label: \
__uaccess_end(); \
__gu_err = -EFAULT; \
} \
__builtin_expect(__gu_err, 0); \
which certainly works, but I think you'll admit it's ugly (note that
nice "if (0)" trick to get the separate error case that does that
__uaccess_end() that can't be done in the common end part because that
complex access has to be done after closing the user access region).
So this is why I'd like to change the rules. A user access fault would
close the user access window.
Linus
On Wed, Jul 01, 2020 at 12:04:36PM -0700, Linus Torvalds wrote:
> On Wed, Jul 1, 2020 at 11:41 AM Al Viro <[email protected]> wrote:
> >
> > Rather nasty for ppc; they have separate user_read_access_end() and
> > user_write_access_end().
>
> That's actually for the access granting. Shutting the access down ends
> up always doing the same thing anyway..
#define user_read_access_end prevent_current_read_from_user
#define user_write_access_end prevent_current_write_to_user
static inline void prevent_current_read_from_user(void)
{
prevent_user_access(NULL, NULL, ~0UL, KUAP_CURRENT_READ);
}
static inline void prevent_current_write_to_user(void)
{
prevent_user_access(NULL, NULL, ~0UL, KUAP_CURRENT_WRITE);
}
and prevent_user_access() has instances that do care about the direction...
On Wed, Jul 1, 2020 at 12:59 PM Al Viro <[email protected]> wrote:
>
> On Wed, Jul 01, 2020 at 12:04:36PM -0700, Linus Torvalds wrote:
> >
> > That's actually for the access granting. Shutting the access down ends
> > up always doing the same thing anyway..
>
> #define user_read_access_end prevent_current_read_from_user
> #define user_write_access_end prevent_current_write_to_user
> static inline void prevent_current_read_from_user(void)
> {
> prevent_user_access(NULL, NULL, ~0UL, KUAP_CURRENT_READ);
> }
>
> static inline void prevent_current_write_to_user(void)
> {
> prevent_user_access(NULL, NULL, ~0UL, KUAP_CURRENT_WRITE);
> }
>
> and prevent_user_access() has instances that do care about the direction...
Go and look closer.
There are three cases:
(a) the 32-bit book3s case. It looks like it cares, but when you look
closer, it ends up not caring about the read side, and saving the
"which address to I allow user writes to" in current->thread.kuap
(b) the nohash 32-bit case - doesn't care
(c) the 64-bit books case - doesn't care
So yes, in the (a) case it does make a difference between reads and
writes, but at least as far as I can tell, it ignores the read case,
and has code to avoid the unnecessary "disable user writes" case when
there was only a read enable done.
Now, it's possible that I'm wrong, but the upshot of that is that even
on powerpc, I think that if we just made the rule be that "taking a
user exception should automatically do the 'user_access_end()' for us"
is trivial.
But I'll add the powerpc people to the list too. And the arm64 people
too, although it looks like they still haven't actually made the
uaccess_disable() logic visible as user_access_begin/end and the
unsafe_xyz code, so they'd not be impacted.
Christophe/Michael: the discussion is that I'd actually want to change
the "exception on user access" case to do the user_access_end()
automatically, so that you can write code like
if (!user_access_begin(...))
goto out;
unsafe_get_user(..., out);
unsafe_get_user(..., out);
user_access_end();
.. all is good, use the value we got..
return 0;
out:
return -EFAULT;
and use the same error label for both the "user_access_begin() failed"
_and_ for the "oops, the access faulted".
Right now the code needs to explicitly do the user_access_end()
handling manually if one of the accesses fault.
See for example fs/readdir.c, which has that
efault_end:
user_write_access_end();
efault:
buf->result = -EFAULT;
return -EFAULT;
pattern of two different error targets several times. I'd like to
avoid that user_{read_,write_,}access_end() case for the error
handling entirely. It's extra complexity.
I checked every single non-arch user, and for all of them it was just
extra work (eg i915 driver, readdir, select, etc)
The only case it wasn't an extra bother was the
lib/strn{cpy,len}_from_user() cases, but that was because I literally
organized the code to call a helper function be called in such a way
that it always did the right thing.
Linus
> On Jul 1, 2020, at 12:35 PM, Linus Torvalds <[email protected]> wrote:
>
> On Wed, Jul 1, 2020 at 11:29 AM Andy Lutomirski <[email protected]> wrote:
>>
>> Do we really want the exception handling to do the CLAC? Having
>> unsafe_get_user() do CLAC seems surprising to me, and it will break
>> use cases like:
>>
>> if (!user_access_begin(...)
>> goto out;
>>
>> ret = unsafe_get_user(...);
>>
>> user_access_end();
>>
>> check ret;
>
> That's not how unsafe_get_user() works.
>
> unsafe_get_user() always jumps to the error label, it never returns a
> value. So the code is actually now what you claim above, but
>
> if (!user_access_begin(...)
> goto out;
>
> unsafe_get_user(..., out_fault);
> user_access_end();
> .. this is good, use the value we got..
>
> out_fault:
> user_access_end();
> out:
> return -EFAULT;
Ugh, right. But maybe, with the asm goto magic, we can’t get rid of this. I’ve always disliked the pattern where we enable user access, do a bunch of accesses that branch on error, and finish up. We ought to be able to do it the way I described and get decent code generation too.
If we do this extable change, we end up with a different mess: some exception handlers will clear AC and some won’t. I’m sure objtool can deal with this with some effort, but I’m not convinced it’s worth it.
On Wed, Jul 01, 2020 at 01:36:22PM -0700, Andy Lutomirski wrote:
> If we do this extable change, we end up with a different mess: some
> exception handlers will clear AC and some won’t. I’m sure objtool can
> deal with this with some effort, but I’m not convinced it’s worth it.
Yeah. Peter's more of the expert here, but I think we'd at least need
to annotate the code which expects an implicit CLAC so objtool knows
what to expect. It's not trivial, but it might be doable.
--
Josh
On Wed, Jul 1, 2020 at 1:36 PM Andy Lutomirski <[email protected]> wrote:
>
> We ought to be able to do it the way I described and get decent code generation too.
No, we really can't.
Each access really needs to jump to an exception label. Otherwise any
time you have multiple operations (think "strncpy()" and friends) you
have to test in between each access.
That is why *fundamnetally* the interface to "unsafe_get/put_user()"
takes a label for the error case. There is absolutely no way to make
any other interface work efficiently.
(Unless, of course, you make the exception handling something that the
compiler does entirely on its own. But that has never been a good idea
for the kernel, and I wouldn't trust a compiler to do what the kernel
needs).
Side note: the labels can be hidden. I did (long ago) send out
something that did a
uaccess_try {
val1 = unsafe_get_user(addr);
val2 = unsafe_get_user(addr2);
} uaccess_catch {
error handling here
};
kind of thing, but that was just syntactic wrapper around that label
model. And honestly, it doesn't really change anything fundamental, it
really ends up with exactly the same issues just with a slightly
different syntax.
(I did that because we had the nasty "put_user_ex()" interfaces, which
were horrible horrible crap, and if one access took an exception, then
all the other ones did too).
The "label for error case" is actually simpler to follow both for the
user and for a compiler. Yes, it's a bit odd, but once you get used to
it, it's really quite regular. But having a different error handler
for the "user_access_begin()" failure and the actual access failure
really does end up generating duplicate code and confusion.
Linus
Linus
On Wed, Jul 1, 2020 at 1:51 PM Josh Poimboeuf <[email protected]> wrote:
>
> Yeah. Peter's more of the expert here, but I think we'd at least need
> to annotate the code which expects an implicit CLAC so objtool knows
> what to expect. It's not trivial, but it might be doable.
In both C and asm code, it's the "_ASM_EXTABLE_UA" cases that would do
this ("UA" being for "User Access").
In fact, it should be quite easy to see: the thing that distinguishes
those things is that the exception handler is "ex_handler_uaccess". So
objtool should be able to see that quite easily as it follows the
exception tables.
It's a special case for entirely unrelated reasons (reasons objtool
doesn't care about): a user access exception can be either due to a
page fault (normal) or due to a misformed non-canonical address, and
we warn about the latter case.
That said, I wouldn't necessarily object to making the rule be that
*any* exception handler invocation will always do the
user_access_end().
It sounds dangerous/wrong to me to do anything that can fault (other
than the user access itself, of course) within a STAC/CLAC region.
So the objtool rule might be:
- in a STAC region, no exception handlers at all except for that
ex_handler_uaccess case
- and that case will clear AC if it triggers.
and maybe such an objtool check would show some case where I'm wrong,
and we do some MSR read other other fault thing within a STAC region.
That _sounds_ wrong to me, but maybe we have reason to do so that I
just can't think or right now?
Linus
On Wed, Jul 01, 2020 at 02:02:42PM -0700, Linus Torvalds wrote:
> So the objtool rule might be:
>
> - in a STAC region, no exception handlers at all except for that
> ex_handler_uaccess case
>
> - and that case will clear AC if it triggers.
>
> and maybe such an objtool check would show some case where I'm wrong,
> and we do some MSR read other other fault thing within a STAC region.
> That _sounds_ wrong to me, but maybe we have reason to do so that I
> just can't think or right now?
Here's an attempt at implementing this, in case anybody wants to play
with it. Usual disclaimers apply...
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 5e0d70a89fb8..470447225314 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -24,7 +24,7 @@
struct alternative {
struct list_head list;
struct instruction *insn;
- bool skip_orig;
+ bool skip_orig, exception, uaccess;
};
const char *objname;
@@ -1023,8 +1023,13 @@ static int add_special_section_alts(struct objtool_file *file)
}
alt->insn = new_insn;
+
alt->skip_orig = special_alt->skip_orig;
+ alt->exception = special_alt->exception;
+ alt->uaccess = special_alt->uaccess;
+
orig_insn->ignore_alts |= special_alt->skip_alt;
+
list_add_tail(&alt->list, &orig_insn->alts);
list_del(&special_alt->list);
@@ -2335,6 +2340,35 @@ static void fill_alternative_cfi(struct objtool_file *file, struct instruction *
}
}
+static int handle_stac(struct symbol *func, struct instruction *insn,
+ struct insn_state *state)
+{
+ if (state->uaccess) {
+ WARN_FUNC("recursive UACCESS enable", insn->sec, insn->offset);
+ return -1;
+ }
+
+ state->uaccess = true;
+ return 0;
+}
+
+static int handle_clac(struct symbol *func, struct instruction *insn,
+ struct insn_state *state)
+{
+ if (!state->uaccess && func) {
+ WARN_FUNC("redundant UACCESS disable", insn->sec, insn->offset);
+ return -1;
+ }
+
+ if (func_uaccess_safe(func) && !state->uaccess_stack) {
+ WARN_FUNC("UACCESS-safe disables UACCESS", insn->sec, insn->offset);
+ return -1;
+ }
+
+ state->uaccess = false;
+ return 0;
+}
+
/*
* Follow the branch starting at the given instruction, and recursively follow
* any other branches (jumps). Meanwhile, track the frame pointer state at
@@ -2393,6 +2427,17 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
if (alt->skip_orig)
skip_orig = true;
+ if (alt->exception) {
+ if (!alt->uaccess && state.uaccess) {
+ WARN_FUNC("non-user-access exception with uaccess enabled",
+ sec, insn->offset);
+ return 1;
+ }
+
+ if (alt->uaccess && handle_clac(func, insn, &state))
+ return 1;
+ }
+
ret = validate_branch(file, func, alt->insn, state);
if (ret) {
if (backtrace)
@@ -2478,26 +2523,13 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
return 0;
case INSN_STAC:
- if (state.uaccess) {
- WARN_FUNC("recursive UACCESS enable", sec, insn->offset);
+ if (handle_stac(func, insn, &state))
return 1;
- }
-
- state.uaccess = true;
break;
case INSN_CLAC:
- if (!state.uaccess && func) {
- WARN_FUNC("redundant UACCESS disable", sec, insn->offset);
+ if (handle_clac(func, insn, &state))
return 1;
- }
-
- if (func_uaccess_safe(func) && !state.uaccess_stack) {
- WARN_FUNC("UACCESS-safe disables UACCESS", sec, insn->offset);
- return 1;
- }
-
- state.uaccess = false;
break;
case INSN_STD:
diff --git a/tools/objtool/special.c b/tools/objtool/special.c
index e74e0189de22..41f27199cae6 100644
--- a/tools/objtool/special.c
+++ b/tools/objtool/special.c
@@ -18,6 +18,7 @@
#define EX_ENTRY_SIZE 12
#define EX_ORIG_OFFSET 0
#define EX_NEW_OFFSET 4
+#define EX_HANDLER_OFFSET 8
#define JUMP_ENTRY_SIZE 16
#define JUMP_ORIG_OFFSET 0
@@ -35,8 +36,9 @@
struct special_entry {
const char *sec;
- bool group, jump_or_nop;
+ bool group, jump_or_nop, exception;
unsigned char size, orig, new;
+ unsigned char handler; /* __ex_table only */
unsigned char orig_len, new_len; /* group only */
unsigned char feature; /* ALTERNATIVE macro CPU feature */
};
@@ -61,9 +63,11 @@ struct special_entry entries[] = {
},
{
.sec = "__ex_table",
+ .exception = true,
.size = EX_ENTRY_SIZE,
.orig = EX_ORIG_OFFSET,
.new = EX_NEW_OFFSET,
+ .handler = EX_HANDLER_OFFSET,
},
{},
};
@@ -118,6 +122,20 @@ static int get_alt_entry(struct elf *elf, struct special_entry *entry,
}
}
+ if (entry->exception) {
+ struct rela *handler_rela;
+
+ alt->exception = true;
+
+ handler_rela = find_rela_by_dest(elf, sec, offset + entry->handler);
+ if (!handler_rela) {
+ WARN_FUNC("can't find handler rela", sec, offset + entry->handler);
+ return -1;
+ }
+ if (!strcmp(handler_rela->sym->name, "ex_handler_uaccess"))
+ alt->uaccess = true;
+ }
+
orig_rela = find_rela_by_dest(elf, sec, offset + entry->orig);
if (!orig_rela) {
WARN_FUNC("can't find orig rela", sec, offset + entry->orig);
diff --git a/tools/objtool/special.h b/tools/objtool/special.h
index 35061530e46e..3a62daef14b3 100644
--- a/tools/objtool/special.h
+++ b/tools/objtool/special.h
@@ -16,6 +16,7 @@ struct special_alt {
bool skip_orig;
bool skip_alt;
bool jump_or_nop;
+ bool exception, uaccess;
struct section *orig_sec;
unsigned long orig_off;
On Wed, Jul 1, 2020 at 1:51 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, Jul 1, 2020 at 1:36 PM Andy Lutomirski <[email protected]> wrote:
> >
> > We ought to be able to do it the way I described and get decent code generation too.
>
> No, we really can't.
>
> Each access really needs to jump to an exception label. Otherwise any
> time you have multiple operations (think "strncpy()" and friends) you
> have to test in between each access.
>
> That is why *fundamnetally* the interface to "unsafe_get/put_user()"
> takes a label for the error case. There is absolutely no way to make
> any other interface work efficiently.
You inspired me to mock it up. I don't think I did anything special
here, except that I mocked up unsafe_put_user() and a fudged it a
little bit because I'm using gcc 9.3.1 which doesn't support asm goto
outputs. Code like this:
if (unsafe_put_user(&a, user_a))
goto error;
if (unsafe_put_user(&b, user_b))
goto error;
if (unsafe_put_user(&c, user_c))
goto error;
if (unsafe_put_user(&d, user_d))
goto error;
generates a series of movs. The conditions are entirely omitted from
the generated assembly output because gcc is smart enough to figure
out that the return value of unsafe_put_user() indicates which way the
asm goto went. I don't think I could generate better output by hand
than gcc generated from my test.
So I stand by my claim. :) Each access does need to jump, but that
jump can be entirely within the exception entry, and we don't need to
generate any actual jump instructions.
--Andy
On Wed, Jul 1, 2020 at 5:48 PM Andy Lutomirski <[email protected]> wrote:
>
> You inspired me to mock it up.
Ahh, you want to just use the jump folding of gcc to avoid the problem.
I guess we could do that. Are there cases where this actually helps?
Linus
On Wed, Jul 1, 2020 at 7:30 PM Linus Torvalds
<[email protected]> wrote:
>
> I guess we could do that. Are there cases where this actually helps?
Hmm. Thinking about it., using that approach might make the
"CONFIG_CC_HAS_ASM_GOTO_OUTPUT" choices simpler to handle.
With ASM_GOTO_OUTPUT it generates the perfect non-jump code. And
without it, it generates the (annoying, but inevitable) actual
conditionals.
Linus
> On Jul 1, 2020, at 7:30 PM, Linus Torvalds <[email protected]> wrote:
>
> On Wed, Jul 1, 2020 at 5:48 PM Andy Lutomirski <[email protected]> wrote:
>>
>> You inspired me to mock it up.
>
> Ahh, you want to just use the jump folding of gcc to avoid the problem.
>
> I guess we could do that. Are there cases where this actually helps?
>
I was thinking it would help avoid brain melt. For better or for worse, the kernel is written in C, and readers don’t really expect call_some_function(arg, other arg) to actually teleport elsewhere in the function. I’m all for goto err; but at least that’s spelled “goto” and it’s really obvious what it does.
On Wed, Jul 01, 2020 at 07:00:41PM -0500, Josh Poimboeuf wrote:
> On Wed, Jul 01, 2020 at 02:02:42PM -0700, Linus Torvalds wrote:
> > So the objtool rule might be:
> >
> > - in a STAC region, no exception handlers at all except for that
> > ex_handler_uaccess case
> >
> > - and that case will clear AC if it triggers.
> >
> > and maybe such an objtool check would show some case where I'm wrong,
> > and we do some MSR read other other fault thing within a STAC region.
> > That _sounds_ wrong to me, but maybe we have reason to do so that I
> > just can't think or right now?
>
> Here's an attempt at implementing this, in case anybody wants to play
> with it. Usual disclaimers apply...
Looks about right, two niggles below.
> @@ -2335,6 +2340,35 @@ static void fill_alternative_cfi(struct objtool_file *file, struct instruction *
> }
> }
>
> +static int handle_stac(struct symbol *func, struct instruction *insn,
> + struct insn_state *state)
> +{
> + if (state->uaccess) {
> + WARN_FUNC("recursive UACCESS enable", insn->sec, insn->offset);
> + return -1;
> + }
> +
> + state->uaccess = true;
> + return 0;
> +}
> +
> +static int handle_clac(struct symbol *func, struct instruction *insn,
> + struct insn_state *state)
> +{
> + if (!state->uaccess && func) {
> + WARN_FUNC("redundant UACCESS disable", insn->sec, insn->offset);
> + return -1;
> + }
> +
> + if (func_uaccess_safe(func) && !state->uaccess_stack) {
> + WARN_FUNC("UACCESS-safe disables UACCESS", insn->sec, insn->offset);
> + return -1;
> + }
> +
> + state->uaccess = false;
> + return 0;
> +}
For both these we return -1 on error and then all callers convert it to
1. So why not have this return 1 and pass any !0 value through?
> /*
> * Follow the branch starting at the given instruction, and recursively follow
> * any other branches (jumps). Meanwhile, track the frame pointer state at
> @@ -2393,6 +2427,17 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
> if (alt->skip_orig)
> skip_orig = true;
>
> + if (alt->exception) {
> + if (!alt->uaccess && state.uaccess) {
> + WARN_FUNC("non-user-access exception with uaccess enabled",
> + sec, insn->offset);
> + return 1;
> + }
This is Linus' new rule that AC code should not get any exceptions
except ex_handler_uaccess.
> +
> + if (alt->uaccess && handle_clac(func, insn, &state))
> + return 1;
And this is ex_handler_uaccess() mucking with regs->flags, right? Might
want a comment.
> + }
> +
> ret = validate_branch(file, func, alt->insn, state);
> if (ret) {
> if (backtrace)
Linus Torvalds <[email protected]> writes:
> On Wed, Jul 1, 2020 at 12:59 PM Al Viro <[email protected]> wrote:
>> On Wed, Jul 01, 2020 at 12:04:36PM -0700, Linus Torvalds wrote:
>> >
>> > That's actually for the access granting. Shutting the access down ends
>> > up always doing the same thing anyway..
>>
>> #define user_read_access_end prevent_current_read_from_user
>> #define user_write_access_end prevent_current_write_to_user
>> static inline void prevent_current_read_from_user(void)
>> {
>> prevent_user_access(NULL, NULL, ~0UL, KUAP_CURRENT_READ);
>> }
>>
>> static inline void prevent_current_write_to_user(void)
>> {
>> prevent_user_access(NULL, NULL, ~0UL, KUAP_CURRENT_WRITE);
>> }
>>
>> and prevent_user_access() has instances that do care about the direction...
>
> Go and look closer.
>
> There are three cases:
>
> (a) the 32-bit book3s case. It looks like it cares, but when you look
> closer, it ends up not caring about the read side, and saving the
> "which address to I allow user writes to" in current->thread.kuap
>
> (b) the nohash 32-bit case - doesn't care
>
> (c) the 64-bit books case - doesn't care
>
> So yes, in the (a) case it does make a difference between reads and
> writes, but at least as far as I can tell, it ignores the read case,
> and has code to avoid the unnecessary "disable user writes" case when
> there was only a read enable done.
Yeah that's my understanding too.
Christophe is the expert on that code so I'll defer to him if I'm wrong.
> Now, it's possible that I'm wrong, but the upshot of that is that even
> on powerpc, I think that if we just made the rule be that "taking a
> user exception should automatically do the 'user_access_end()' for us"
> is trivial.
I think we can do something to make it work.
We don't have an equivalent of x86's ex_handler_uaccess(), so it's not
quite as easy as whacking a user_access_end() in there.
Probably the simplest option for us is to just handle it in our
unsafe_op_wrap(). I'll try and come up with something tomorrow.
cheers
On Thu, Jul 02, 2020 at 11:34:31PM +1000, Michael Ellerman wrote:
> I think we can do something to make it work.
>
> We don't have an equivalent of x86's ex_handler_uaccess(), so it's not
> quite as easy as whacking a user_access_end() in there.
>
> Probably the simplest option for us is to just handle it in our
> unsafe_op_wrap(). I'll try and come up with something tomorrow.
The goal is to avoid using unsafe_op_wrap()...
On Thu, Jul 02, 2020 at 03:01:59PM +0100, Al Viro wrote:
> On Thu, Jul 02, 2020 at 11:34:31PM +1000, Michael Ellerman wrote:
>
> > I think we can do something to make it work.
> >
> > We don't have an equivalent of x86's ex_handler_uaccess(), so it's not
> > quite as easy as whacking a user_access_end() in there.
> >
> > Probably the simplest option for us is to just handle it in our
> > unsafe_op_wrap(). I'll try and come up with something tomorrow.
>
> The goal is to avoid using unsafe_op_wrap()...
Incidentally, the change Linus proposes would affect unsafe_put_user()
as well. And you are not using unsafe_op_wrap() anywhere on that
path...
Le 02/07/2020 à 15:34, Michael Ellerman a écrit :
> Linus Torvalds <[email protected]> writes:
>> On Wed, Jul 1, 2020 at 12:59 PM Al Viro <[email protected]> wrote:
>>> On Wed, Jul 01, 2020 at 12:04:36PM -0700, Linus Torvalds wrote:
>>>>
>>>> That's actually for the access granting. Shutting the access down ends
>>>> up always doing the same thing anyway..
>>>
>>> #define user_read_access_end prevent_current_read_from_user
>>> #define user_write_access_end prevent_current_write_to_user
>>> static inline void prevent_current_read_from_user(void)
>>> {
>>> prevent_user_access(NULL, NULL, ~0UL, KUAP_CURRENT_READ);
>>> }
>>>
>>> static inline void prevent_current_write_to_user(void)
>>> {
>>> prevent_user_access(NULL, NULL, ~0UL, KUAP_CURRENT_WRITE);
>>> }
>>>
>>> and prevent_user_access() has instances that do care about the direction...
>>
>> Go and look closer.
>>
>> There are three cases:
>>
>> (a) the 32-bit book3s case. It looks like it cares, but when you look
>> closer, it ends up not caring about the read side, and saving the
>> "which address to I allow user writes to" in current->thread.kuap
>>
>> (b) the nohash 32-bit case - doesn't care
>>
>> (c) the 64-bit books case - doesn't care
>>
>> So yes, in the (a) case it does make a difference between reads and
>> writes, but at least as far as I can tell, it ignores the read case,
>> and has code to avoid the unnecessary "disable user writes" case when
>> there was only a read enable done.
>
> Yeah that's my understanding too.
>
> Christophe is the expert on that code so I'll defer to him if I'm wrong.
>
>> Now, it's possible that I'm wrong, but the upshot of that is that even
>> on powerpc, I think that if we just made the rule be that "taking a
>> user exception should automatically do the 'user_access_end()' for us"
>> is trivial.
>
> I think we can do something to make it work.
>
> We don't have an equivalent of x86's ex_handler_uaccess(), so it's not
> quite as easy as whacking a user_access_end() in there.
Isn't it something easy to do in bad_page_fault() ?
Not exactly a call to user_access_end() but altering regs->kuap so that
user access is not restored on exception exit.
>
> Probably the simplest option for us is to just handle it in our
> unsafe_op_wrap(). I'll try and come up with something tomorrow.
unsafe_op_wrap() is not used anymore for unsafe_put_user() as we are now
using asm goto.
Christophe
On Thu, Jul 2, 2020 at 6:32 AM Michael Ellerman <[email protected]> wrote:
>
> Probably the simplest option for us is to just handle it in our
> unsafe_op_wrap(). I'll try and come up with something tomorrow.
IMy suggestion was to basically just always handle it in all exception cases.
And note that IU don't mean the fault handler: obviously page faults
(or unaligned faults or whatever) can happen while in a user access
region.
But I mean any time fixup_exception() triggers.
For x86, this is in fact particularly natural: it involves just always
clearing the AC bit in the "struct pt_regs" that fixup_exception()
gets anyway. We can do it without even bothering with checking for
CLAC/STAC support, since without it, AC is meaningless in kernel mode
anyway, but also because doing "user_access_end()" in the exception
would be pointless: AC is restored by the exception routine, so on x86
you *have* to do it by just modifying the return state.
Linus
On Thu, Jul 2, 2020 at 8:13 AM Christophe Leroy
<[email protected]> wrote:
>
> Isn't it something easy to do in bad_page_fault() ?
Can't the user access functions take any other faults on ppc?
On x86-64, we have the "address is non-canonical" case which doesn't
take a page fault at all, but takes a general protection fault
instead.
But note that depending on how you nest and save/restore the state,
things can be very subtle.
For example, what can happen is:
(a) user_access_begin()..
(b) we take a normal interrupt
(c) the interrupt code does something that has an exception handling
case entirely unrelated to the user access (on x86, it might be the
"unsafe_msr' logic, for example.
(d) we take that exception, do "fixup_exception()" for whatever that
interrupt did.
(e) we return from that exception to the fixed up state
(d) we return from the interrupt
(e) we should still have user accesses enabled.
NOTE! on x86, we can have "all fixup_exceptions() will clear AC in the
exception pt_regs", because AC is part of rflags which is saved on
(and cleared for the duration of) all interrupt and exceptions.
So what happens is that on x86 all of (b)-(d) will run with AC clear
and no user accesses allowed, and (e) will have user accesses enabled
again, because the "fixup_exception()" at (d) only affected the state
of the interrupt hander (which already had AC clear anyway).
But I don't think exceptions and interrupts save/restore the user
access state on powerpc, do they?
So on powerpc you do need to be more careful. You would only need to
disable user access on exceptions that happen _on_ user accesses.
The easiest way to do that is to do what x86 does: different
exceptions have different handlers. It's not what we did originally,
but it's been useful.
Hmm.
And again, on x86, this all works fine because of how exceptions
save/restore the user_access state and it all nests fine. But I'm
starting to wonder how the nesting works AT ALL for powerpc?
Because that nesting happens even without
IOW, even aside from this whole thing, what happens on PPC, when you have
(a) user_access_begin();
- profile NMI or interrupt happens
- it wants to do user stack tracing so does
pagefault_disable();
(b) get_user();
pagefault_enable();
- profile NMI/interrupt returns
(c) user accesss should work here!
even if the "get_user()" in (b) would have done a
"user_access_begin/end" pair, and regardless of whether (b) might have
triggered a "fixup_exception()", and whether that fixup_exception()
then did the user_access_end().
On x86, this is all ok exactly because of how we only have the AC bit,
and it nests very naturally with any exception handling.
Is the ppc code nesting-safe? Particularly since it has that whole
range-handling?
Linus
On Thu, Jul 02, 2020 at 12:52:27PM -0700, Linus Torvalds wrote:
> On Thu, Jul 2, 2020 at 6:32 AM Michael Ellerman <[email protected]> wrote:
> >
> > Probably the simplest option for us is to just handle it in our
> > unsafe_op_wrap(). I'll try and come up with something tomorrow.
>
> IMy suggestion was to basically just always handle it in all exception cases.
>
> And note that IU don't mean the fault handler: obviously page faults
> (or unaligned faults or whatever) can happen while in a user access
> region.
>
> But I mean any time fixup_exception() triggers.
>
> For x86, this is in fact particularly natural: it involves just always
> clearing the AC bit in the "struct pt_regs" that fixup_exception()
> gets anyway. We can do it without even bothering with checking for
> CLAC/STAC support, since without it, AC is meaningless in kernel mode
> anyway, but also because doing "user_access_end()" in the exception
> would be pointless: AC is restored by the exception routine, so on x86
> you *have* to do it by just modifying the return state.
What about
static inline int copy_xregs_to_user(struct xregs_state __user *buf)
{
[...]
stac();
XSTATE_OP(XSAVE, buf, -1, -1, err);
clac();
where XSTATE_OP() is
#define XSTATE_OP(op, st, lmask, hmask, err) \
asm volatile("1:" op "\n\t" \
"xor %[err], %[err]\n" \
"2:\n\t" \
".pushsection .fixup,\"ax\"\n\t" \
"3: movl $-2,%[err]\n\t" \
"jmp 2b\n\t" \
".popsection\n\t" \
_ASM_EXTABLE(1b, 3b) \
: [err] "=r" (err) \
: "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \
: "memory")
Rely upon objtool not noticing that we have, in effect, clac() in a state
where AC is already cleared? We could massage that thing to take a label,
but it wouldn't be pretty...
On Thu, Jul 2, 2020 at 1:18 PM Al Viro <[email protected]> wrote:
>
> stac();
> XSTATE_OP(XSAVE, buf, -1, -1, err);
> clac();
>
> Rely upon objtool not noticing that we have, in effect, clac() in a state
> where AC is already cleared? We could massage that thing to take a label,
> but it wouldn't be pretty...
Ugh, the above is bad anyway.
It doesn't use _ASM_EXTABLE_UA, so it won't warn about the noncanonical cases.
Yeah, it would need to be turned into a "jump out" instead of just "jump over".
Which it damn well should do anyway.,
That code should be taken behind a shed and shot. It does so many
things wrong that it's not even funny. It shouldn't do stac/clac on
its own.
At least it could use the "user_insn()" helper, which does it inside
the asm itself, has the right might_fault() marking (but not the
address check), and which can be trivially changed to have the fixup
jump be to after the "ASM_CLAC".
Or maybe it could use "asm goto" for that exception handling, because
it doesn't have any outputs (except for the error, which is exactly
what the goto case is for).
But no, at no point should we "rely on objtool not noticing". In fact,
I think Josh's patch would have made objtool notice, and it would have
been a good thing, because that code is unbelievably ugly and needs to
be cleaned up anyway.
Am I surprised that we have hacky stuff in our magic FPU/MMX/AVX state
handling? No I'm not.
I'm getting to the point where I'm thinking maybe all the horrendous
mis-steps from Intel over the last years are a good thing, and ARM
will take over, and we won't have to deal with this a decade from now.
Linus
On Thu, Jul 02, 2020 at 01:32:34PM -0700, Linus Torvalds wrote:
> Ugh, the above is bad anyway.
>
> It doesn't use _ASM_EXTABLE_UA, so it won't warn about the noncanonical cases.
FWIW, the address is inside a sigframe we decided to build, so noncanonical
addresses shouldn't occur in practice.
> Yeah, it would need to be turned into a "jump out" instead of just "jump over".
>
> Which it damn well should do anyway.,
>
> That code should be taken behind a shed and shot. It does so many
> things wrong that it's not even funny. It shouldn't do stac/clac on
> its own.
>
> At least it could use the "user_insn()" helper, which does it inside
> the asm itself, has the right might_fault() marking (but not the
> address check), and which can be trivially changed to have the fixup
> jump be to after the "ASM_CLAC".
I'm not sure it's the right solution in this case. Look at the call chain
and the stuff done nearby (that __clear_user(), for example)...
I'm not saying that this code is not awful - it certainly is. But it's
not that simple, unfortunately ;-/
On Thu, Jul 2, 2020 at 1:59 PM Al Viro <[email protected]> wrote:
>
> I'm not sure it's the right solution in this case. Look at the call chain
> and the stuff done nearby (that __clear_user(), for example)...
>
> I'm not saying that this code is not awful - it certainly is. But it's
> not that simple, unfortunately ;-/
Well, the minimal thing to do would probably be to just change that
particular place to use asm_volatile_goto() for the error handling.
Even aside from anything else, it would improve code generation and
avoid that extra "err" variable.
So something like this (ENTIRELY UNTESTED!!).
It removes lines of code, and it must improve code generation too.
And while XSTATE_OP() is still disgusting, it's
(a) slightly less disgusting than it used to be
(b) now easily fixable if we do the "exceptions clear AC" thing.
so it's an improvement all around.
If it works, that is. As mentioned: IT HAS NO TESTING.
Linus
On Thu, Jul 02, 2020 at 02:55:19PM -0700, Linus Torvalds wrote:
> And while XSTATE_OP() is still disgusting, it's
>
> (a) slightly less disgusting than it used to be
>
> (b) now easily fixable if we do the "exceptions clear AC" thing.
>
> so it's an improvement all around.
>
> If it works, that is. As mentioned: IT HAS NO TESTING.
What about load_unaligned_zeropad()? Normally the caller doesn't
want to know about the exception on crossing into an unmapped
page. Blanket "clear #AC of fixup, don't go through user_access_end()
in case of exception" would complicate the code that calls that sucker.
Christophe Leroy <[email protected]> writes:
> Le 02/07/2020 à 15:34, Michael Ellerman a écrit :
>> Linus Torvalds <[email protected]> writes:
>>> On Wed, Jul 1, 2020 at 12:59 PM Al Viro <[email protected]> wrote:
>>>> On Wed, Jul 01, 2020 at 12:04:36PM -0700, Linus Torvalds wrote:
>>>>>
>>>>> That's actually for the access granting. Shutting the access down ends
>>>>> up always doing the same thing anyway..
>>>>
>>>> #define user_read_access_end prevent_current_read_from_user
>>>> #define user_write_access_end prevent_current_write_to_user
>>>> static inline void prevent_current_read_from_user(void)
>>>> {
>>>> prevent_user_access(NULL, NULL, ~0UL, KUAP_CURRENT_READ);
>>>> }
>>>>
>>>> static inline void prevent_current_write_to_user(void)
>>>> {
>>>> prevent_user_access(NULL, NULL, ~0UL, KUAP_CURRENT_WRITE);
>>>> }
>>>>
>>>> and prevent_user_access() has instances that do care about the direction...
>>>
>>> Go and look closer.
>>>
>>> There are three cases:
>>>
>>> (a) the 32-bit book3s case. It looks like it cares, but when you look
>>> closer, it ends up not caring about the read side, and saving the
>>> "which address to I allow user writes to" in current->thread.kuap
>>>
>>> (b) the nohash 32-bit case - doesn't care
>>>
>>> (c) the 64-bit books case - doesn't care
>>>
>>> So yes, in the (a) case it does make a difference between reads and
>>> writes, but at least as far as I can tell, it ignores the read case,
>>> and has code to avoid the unnecessary "disable user writes" case when
>>> there was only a read enable done.
>>
>> Yeah that's my understanding too.
>>
>> Christophe is the expert on that code so I'll defer to him if I'm wrong.
>>
>>> Now, it's possible that I'm wrong, but the upshot of that is that even
>>> on powerpc, I think that if we just made the rule be that "taking a
>>> user exception should automatically do the 'user_access_end()' for us"
>>> is trivial.
>>
>> I think we can do something to make it work.
>>
>> We don't have an equivalent of x86's ex_handler_uaccess(), so it's not
>> quite as easy as whacking a user_access_end() in there.
>
> Isn't it something easy to do in bad_page_fault() ?
We'd need to do it there at least.
But I'm not convinced that's the only place we'd need to do it. We could
theoretically take a machine check on a user access, and those are
handled differently on each sub-(sub-sub)-platform, and I think all or
most of them don't call bad_page_fault().
> Not exactly a call to user_access_end() but altering regs->kuap so that
> user access is not restored on exception exit.
Yes.
>> Probably the simplest option for us is to just handle it in our
>> unsafe_op_wrap(). I'll try and come up with something tomorrow.
>
> unsafe_op_wrap() is not used anymore for unsafe_put_user() as we are now
> using asm goto.
Sure, but we could change it back to use unsafe_op_wrap().
I did a quick hack to do that and see no difference in the generated
code, but your commit adding put_user_goto() did show better code
generation, so possibly it depends on compiler version, or my example
wasn't complicated enough (filldir()).
cheers
On Thu, Jul 2, 2020 at 6:33 PM Al Viro <[email protected]> wrote:
>
> What about load_unaligned_zeropad()?
What about it? It doesn't care. It's for kernel addresses, clearing AC
on exception does nothing.
There's no user_access_begin/end anywhere around that thing that I can see.
Linus
Linus Torvalds <[email protected]> writes:
> On Thu, Jul 2, 2020 at 8:13 AM Christophe Leroy
> <[email protected]> wrote:
>>
>> Isn't it something easy to do in bad_page_fault() ?
>
> Can't the user access functions take any other faults on ppc?
Yes they definitely can.
I think I can enumerate all the possibilities on 64-bit, but I don't
know all the possibilities on all the 32-bit variants.
> On x86-64, we have the "address is non-canonical" case which doesn't
> take a page fault at all, but takes a general protection fault
> instead.
Right. On P9 radix we have an address-out-of-page-table-range exception
which I guess is similar, though that does end up at bad_page_fault() in
our case.
> But note that depending on how you nest and save/restore the state,
> things can be very subtle.
>
> For example, what can happen is:
>
> (a) user_access_begin()..
>
> (b) we take a normal interrupt
>
> (c) the interrupt code does something that has an exception handling
> case entirely unrelated to the user access (on x86, it might be the
> "unsafe_msr' logic, for example.
>
> (d) we take that exception, do "fixup_exception()" for whatever that
> interrupt did.
>
> (e) we return from that exception to the fixed up state
>
> (d) we return from the interrupt
>
> (e) we should still have user accesses enabled.
Yes.
We broke that a few times when developing the KUAP support, which is why
I added bad_kuap_fault() to report the case where we are in a uaccess
region but are being blocked unexpectedly by KUAP.
> NOTE! on x86, we can have "all fixup_exceptions() will clear AC in the
> exception pt_regs", because AC is part of rflags which is saved on
> (and cleared for the duration of) all interrupt and exceptions.
>
> So what happens is that on x86 all of (b)-(d) will run with AC clear
> and no user accesses allowed, and (e) will have user accesses enabled
> again, because the "fixup_exception()" at (d) only affected the state
> of the interrupt hander (which already had AC clear anyway).
>
> But I don't think exceptions and interrupts save/restore the user
> access state on powerpc, do they?
Not implicitly.
We manually save it into pt_regs on the stack in the exception entry. On
64-bit it's done in kuap_save_amr_and_lock. 32-bit does it in
kuap_save_and_lock.
And then on the return path it's kuap_restore_amr() on 64-bit, and
kuap_restore on 32-bit.
> So on powerpc you do need to be more careful. You would only need to
> disable user access on exceptions that happen _on_ user accesses.
>
> The easiest way to do that is to do what x86 does: different
> exceptions have different handlers. It's not what we did originally,
> but it's been useful.
>
> Hmm.
>
> And again, on x86, this all works fine because of how exceptions
> save/restore the user_access state and it all nests fine. But I'm
> starting to wonder how the nesting works AT ALL for powerpc?
>
> Because that nesting happens even without
>
> IOW, even aside from this whole thing, what happens on PPC, when you have
I'll annotate what happens for the 64-bit case as it's the one I know
best:
> (a) user_access_begin();
- mtspr(SPRN_AMR, 0) // 0 means loads & stores permitted
> - profile NMI or interrupt happens
- pt_regs->kuap = mfspr(SPRN_AMR)
- mtspr(SPRN_AMR, AMR_KUAP_BLOCKED)
> - it wants to do user stack tracing so does
> pagefault_disable();
> (b) get_user();
mtspr(SPRN_AMR, 0)
ld rN, <user pointer)
mtspr(SPRN_AMR, AMR_KUAP_BLOCKED)
> pagefault_enable();
> - profile NMI/interrupt returns
- mtspr(SPRN_AMR, pt_regs->kuap)
- return from interrupt
> (c) user accesss should work here!
>
> even if the "get_user()" in (b) would have done a
> "user_access_begin/end" pair, and regardless of whether (b) might have
> triggered a "fixup_exception()", and whether that fixup_exception()
> then did the user_access_end().
>
> On x86, this is all ok exactly because of how we only have the AC bit,
> and it nests very naturally with any exception handling.
>
> Is the ppc code nesting-safe? Particularly since it has that whole
> range-handling?
Yeah I think it is.
The range handling on 32-bit books follows the same pattern as above,
except that on exception entry we don't save the content of an SPR to
pt_regs, instead we save current->thread.kuap. (Because there isn't a
single SPR that contains the KUAP state).
cheers
Le 03/07/2020 à 05:17, Michael Ellerman a écrit :
> Christophe Leroy <[email protected]> writes:
>> Le 02/07/2020 à 15:34, Michael Ellerman a écrit :
>>> Linus Torvalds <[email protected]> writes:
>>>> On Wed, Jul 1, 2020 at 12:59 PM Al Viro <[email protected]> wrote:
>>>>> On Wed, Jul 01, 2020 at 12:04:36PM -0700, Linus Torvalds wrote:
>>>>>>
>>>>>> That's actually for the access granting. Shutting the access down ends
>>>>>> up always doing the same thing anyway..
>>>>>
>>>>> #define user_read_access_end prevent_current_read_from_user
>>>>> #define user_write_access_end prevent_current_write_to_user
>>>>> static inline void prevent_current_read_from_user(void)
>>>>> {
>>>>> prevent_user_access(NULL, NULL, ~0UL, KUAP_CURRENT_READ);
>>>>> }
>>>>>
>>>>> static inline void prevent_current_write_to_user(void)
>>>>> {
>>>>> prevent_user_access(NULL, NULL, ~0UL, KUAP_CURRENT_WRITE);
>>>>> }
>>>>>
>>>>> and prevent_user_access() has instances that do care about the direction...
>>>>
>>>> Go and look closer.
>>>>
>>>> There are three cases:
>>>>
>>>> (a) the 32-bit book3s case. It looks like it cares, but when you look
>>>> closer, it ends up not caring about the read side, and saving the
>>>> "which address to I allow user writes to" in current->thread.kuap
>>>>
>>>> (b) the nohash 32-bit case - doesn't care
>>>>
>>>> (c) the 64-bit books case - doesn't care
>>>>
>>>> So yes, in the (a) case it does make a difference between reads and
>>>> writes, but at least as far as I can tell, it ignores the read case,
>>>> and has code to avoid the unnecessary "disable user writes" case when
>>>> there was only a read enable done.
>>>
>>> Yeah that's my understanding too.
>>>
>>> Christophe is the expert on that code so I'll defer to him if I'm wrong.
>>>
>>>> Now, it's possible that I'm wrong, but the upshot of that is that even
>>>> on powerpc, I think that if we just made the rule be that "taking a
>>>> user exception should automatically do the 'user_access_end()' for us"
>>>> is trivial.
>>>
>>> I think we can do something to make it work.
>>>
>>> We don't have an equivalent of x86's ex_handler_uaccess(), so it's not
>>> quite as easy as whacking a user_access_end() in there.
>>
>> Isn't it something easy to do in bad_page_fault() ?
>
> We'd need to do it there at least.
>
> But I'm not convinced that's the only place we'd need to do it. We could
> theoretically take a machine check on a user access, and those are
> handled differently on each sub-(sub-sub)-platform, and I think all or
> most of them don't call bad_page_fault().
Indeed, it needs to be done everywhere we do
regs->nip = extable_fixup(entry)
There are half a dozen of places that do that, in additional of
bad_page_fault() that's mainly machine checks, also kprobe.
I think we can create a fixup_exception() function which takes regs and
entry as parameters and does the nip fixup and kuap closuse.
>
>> Not exactly a call to user_access_end() but altering regs->kuap so that
>> user access is not restored on exception exit.
>
> Yes.
>
>>> Probably the simplest option for us is to just handle it in our
>>> unsafe_op_wrap(). I'll try and come up with something tomorrow.
>>
>> unsafe_op_wrap() is not used anymore for unsafe_put_user() as we are now
>> using asm goto.
>
> Sure, but we could change it back to use unsafe_op_wrap().
But the whole purpose of using goto in unsafe_???_user() is to allow the
use of asm goto. See explanations in commit
https://github.com/linuxppc/linux/commit/1bd4403d86a1c06cb6cc9ac87664a0c9d3413d51#diff-eba084de047bb8a9087dac10c06f44bc
>
> I did a quick hack to do that and see no difference in the generated
> code, but your commit adding put_user_goto() did show better code
> generation, so possibly it depends on compiler version, or my example
> wasn't complicated enough (filldir()).
Yes as explained above it should remove the error checking in the caller
so your exemple was most likely too trivial.
Christophe
On Fri, Jul 03, 2020 at 02:33:28AM +0100, Al Viro wrote:
> On Thu, Jul 02, 2020 at 02:55:19PM -0700, Linus Torvalds wrote:
>
> > And while XSTATE_OP() is still disgusting, it's
> >
> > (a) slightly less disgusting than it used to be
> >
> > (b) now easily fixable if we do the "exceptions clear AC" thing.
> >
> > so it's an improvement all around.
> >
> > If it works, that is. As mentioned: IT HAS NO TESTING.
>
> What about load_unaligned_zeropad()? Normally the caller doesn't
> want to know about the exception on crossing into an unmapped
> page. Blanket "clear #AC of fixup, don't go through user_access_end()
> in case of exception" would complicate the code that calls that sucker.
Actually, for more serious problem consider arch/x86/lib/copy_user_64.S
In case of an unhandled fault on attempt to read an (unaligned) word,
the damn thing falls back to this:
SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
movl %edx,%ecx
1: rep movsb
2: mov %ecx,%eax
ASM_CLAC
ret
_ASM_EXTABLE_UA(1b, 2b)
SYM_CODE_END(.Lcopy_user_handle_tail)
We could do what alpha, sparc et.al. are doing - have both reads and
writes aligned, with every output word being a mix of two input ones.
But I would expect that to be considerably slower than the current
variants. Sure, we can set AC in .Lcopy_user_handle_tail, but that
doesn't look right.
And while squeezing every byte on a short copy is not a hard requirement,
in situation when the source is one byte before the end of page and
destination is aligned, raw_copy_from_user() really must copy at least
one byte if it's readable.
So I suspect that we need a variant of extable entry that does not
clear AC, at least for these fallbacks.
PS: I'm still going through the _ASM_EXTABLE... users on x86, so there
might be more fun. Will post when I'm done...
On Fri, Jul 3, 2020 at 2:02 PM Al Viro <[email protected]> wrote:
>
> Actually, for more serious problem consider arch/x86/lib/copy_user_64.S
What? No.
> In case of an unhandled fault on attempt to read an (unaligned) word,
> the damn thing falls back to this:
> SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
> movl %edx,%ecx
> 1: rep movsb
> 2: mov %ecx,%eax
> ASM_CLAC
> ret
>
> _ASM_EXTABLE_UA(1b, 2b)
> SYM_CODE_END(.Lcopy_user_handle_tail)
In the case of "we did an unaligned word at the end of a page, we took
a fault, and now we have to start all over", the _least_ of our
problems is that part of "starting over" would now imply doing a
"stac" again.
Yeah, the "stac" instruction isn't hugely fast, and serializes the
pipeline, so it's a nasty 20 cycles or something.
But for chissake, this
(a) happens approximately never
(b) is after a fault that took a thousand cycles
so the trivial thing to do is to just say "yeah, you need to add the
STAC when your optimistic thing failed and you have to fall back to
the byte-at-a-time tail case".
It's particularly trivial since objtool would statically find all
these places, since it would warn about the ASM_CLAC without a STAC
(that's assuming Josh's patch, of course).
Linus
On Fri, Jul 3, 2020 at 2:10 PM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Jul 3, 2020 at 2:02 PM Al Viro <[email protected]> wrote:
> >
> > Actually, for more serious problem consider arch/x86/lib/copy_user_64.S
>
> What? No.
>
> > In case of an unhandled fault on attempt to read an (unaligned) word,
> > the damn thing falls back to this:
> > SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
> > movl %edx,%ecx
> > 1: rep movsb
> > 2: mov %ecx,%eax
> > ASM_CLAC
> > ret
> >
> > _ASM_EXTABLE_UA(1b, 2b)
> > SYM_CODE_END(.Lcopy_user_handle_tail)
>
> In the case of "we did an unaligned word at the end of a page, we took
> a fault, and now we have to start all over", the _least_ of our
> problems is that part of "starting over" would now imply doing a
> "stac" again.
>
> Yeah, the "stac" instruction isn't hugely fast, and serializes the
> pipeline, so it's a nasty 20 cycles or something.
>
> But for chissake, this
> (a) happens approximately never
> (b) is after a fault that took a thousand cycles
>
> so the trivial thing to do is to just say "yeah, you need to add the
> STAC when your optimistic thing failed and you have to fall back to
> the byte-at-a-time tail case".
>
> It's particularly trivial since objtool would statically find all
> these places, since it would warn about the ASM_CLAC without a STAC
> (that's assuming Josh's patch, of course).
>
I still feel like the ex_handler-automatically-does-CLAC thing is an
optimization that isn't worth it. Once we pull our heads out of the
giant pile of macros and inlined functions, we're talking about
changing:
stac
1: mov something, somewhere
2: clac
... exception table entry pointing to 2 ...
to
stac
1: mov something, somewhere
clac
2:
... exception table entry pointing to 2 ...
Now maybe the pattern:
if (get_user(...))
goto bad;
gets a bit improved because we don't need to emit a fixup that does
clac; jmp. But on the flip side, the jump folding pattern looks
better like this:
unsafe_uaccess_begin();
if (unsafe_get_user(...))
goto fail;
if (unsafe_get_user(...))
goto fail;
unsafe_uaccess_end();
fail:
unsafe_uaccess_end();
than like:
unsafe_uaccess_begin();
if (unsafe_get_user(...))
goto fail;
if (unsafe_get_user(...))
goto fail;
unsafe_uaccess_end();
fail:
/* not unsafe_uaccess_end(); because unsafe_get_user() has
conditional-CLAC semantics */
And we really really do not want to ever see this:
int err;
unsafe_uaccess_begin();
err = unsafe_get_user(...);
if (err == 0)
unsafe_uaccess_end(); /* WTF?!? */
So I feel like this is all putting the cart before the horse. Can we
make everything work sanely without the automatic CLAC and then, if it
actually improves something, add an optional ex_handler_uaccess_clac
for the cases that benefit? I'm totally fine with burying:
unsafe_get_user_jump(..., error_label);
error_label:
/* we know unsafe_uaccess_end() already happened here */
in an inline function here if it meaningfully improves code
generation, but this is counter-intuitive code and I think we should
treat it as such.
--Andy
On Fri, Jul 03, 2020 at 02:10:08PM -0700, Linus Torvalds wrote:
> On Fri, Jul 3, 2020 at 2:02 PM Al Viro <[email protected]> wrote:
> >
> > Actually, for more serious problem consider arch/x86/lib/copy_user_64.S
>
> What? No.
>
> > In case of an unhandled fault on attempt to read an (unaligned) word,
> > the damn thing falls back to this:
> > SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
> > movl %edx,%ecx
> > 1: rep movsb
> > 2: mov %ecx,%eax
> > ASM_CLAC
> > ret
> >
> > _ASM_EXTABLE_UA(1b, 2b)
> > SYM_CODE_END(.Lcopy_user_handle_tail)
>
> In the case of "we did an unaligned word at the end of a page, we took
> a fault, and now we have to start all over", the _least_ of our
> problems is that part of "starting over" would now imply doing a
> "stac" again.
What do you mean, start over? It's picking a few remaining bytes out
of that word, *not* redoing the entire thing.
On Fri, Jul 03, 2020 at 10:59:22PM +0100, Al Viro wrote:
> On Fri, Jul 03, 2020 at 02:10:08PM -0700, Linus Torvalds wrote:
> > On Fri, Jul 3, 2020 at 2:02 PM Al Viro <[email protected]> wrote:
> > >
> > > Actually, for more serious problem consider arch/x86/lib/copy_user_64.S
> >
> > What? No.
> >
> > > In case of an unhandled fault on attempt to read an (unaligned) word,
> > > the damn thing falls back to this:
> > > SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
> > > movl %edx,%ecx
> > > 1: rep movsb
> > > 2: mov %ecx,%eax
> > > ASM_CLAC
> > > ret
> > >
> > > _ASM_EXTABLE_UA(1b, 2b)
> > > SYM_CODE_END(.Lcopy_user_handle_tail)
> >
> > In the case of "we did an unaligned word at the end of a page, we took
> > a fault, and now we have to start all over", the _least_ of our
> > problems is that part of "starting over" would now imply doing a
> > "stac" again.
>
> What do you mean, start over? It's picking a few remaining bytes out
> of that word, *not* redoing the entire thing.
I'm _not_ saying that it's a performance-critical place, in case that's not
obvious - just trying to head off potential confusion re what that code is
doing.
On Fri, Jul 03, 2020 at 02:10:08PM -0700, Linus Torvalds wrote:
> Yeah, the "stac" instruction isn't hugely fast, and serializes the
> pipeline, so it's a nasty 20 cycles or something.
>
> But for chissake, this
> (a) happens approximately never
> (b) is after a fault that took a thousand cycles
>
> so the trivial thing to do is to just say "yeah, you need to add the
> STAC when your optimistic thing failed and you have to fall back to
> the byte-at-a-time tail case".
Not the problem I'm concerned about, really. However, I would really
like to lift stac/clac into the *callers* of raw_copy_from_user()
et.al. and fold them into user_access_begin/user_access_end there.
And that's where the rules become very interesting - raw_copy_from_user()
is not "succeed or fail" thing, it's "tell me how much has been left
to copy" one. Put it that way - here we really do have outputs on
fault.
PS: I hope to kill __copy_from_user()/__copy_to_user() outside of
arch/* this cycle; not much is left by now. So I'm not talking about
lifting stac/clac out into the wild - it will merge with access_ok
into user_access_begin/end.
On Fri, Jul 03, 2020 at 02:41:43PM -0700, Andy Lutomirski wrote:
> I still feel like the ex_handler-automatically-does-CLAC thing is an
> optimization that isn't worth it. Once we pull our heads out of the
> giant pile of macros and inlined functions, we're talking about
> changing:
> clac; jmp. But on the flip side, the jump folding pattern looks
> better like this:
>
> unsafe_uaccess_begin();
> if (unsafe_get_user(...))
> goto fail;
> if (unsafe_get_user(...))
> goto fail;
> unsafe_uaccess_end();
>
> fail:
> unsafe_uaccess_end();
>
> than like:
>
> unsafe_uaccess_begin();
> if (unsafe_get_user(...))
> goto fail;
> if (unsafe_get_user(...))
> goto fail;
> unsafe_uaccess_end();
>
> fail:
> /* not unsafe_uaccess_end(); because unsafe_get_user() has
> conditional-CLAC semantics */
First of all, user_access_begin() itself can bloody well fail. So you need
to handle that as well. And then it becomes nowhere near as pretty.
We can pretend that it's normal C; however, that's not true at all - there
are shitloads of things you can't do in such areas, starting with "call anything
other than a very small list of functions". It's not a normal C environment
at all.
My problem is not with having AC turned off in exception handler - it leads
to saner patterns, no arguments here. I'm not happy with doing doing that
on *every* exception, with no way to specify whether it should or should not
be done. It's not like it would've cost us anything to be able to specify
that - we have the third argument of _ASM_EXTABLE_HANDLE(), after all.
On Fri, Jul 03, 2020 at 10:02:37PM +0100, Al Viro wrote:
> PS: I'm still going through the _ASM_EXTABLE... users on x86, so there
> might be more fun. Will post when I'm done...
Lovely... Not directly related to that, but... WTF?
arch/x86/lib/csum-copy_64.S:
/*
* No _ASM_EXTABLE_UA; this is used for intentional prefetch on a
* potentially unmapped kernel address.
*/
.macro ignore L=.Lignore
30:
_ASM_EXTABLE(30b, \L)
.endm
...
ignore 2f
prefetcht0 5*64(%rdi)
2:
(and no other users of 'ignore' anywhere). How could prefetcht0 possibly
raise an exception? Intel manual says that the only exception is #UD if
LOCK PREFETCHT0 is encountered; not here, obviously. AMD manual simply
says "no exceptions". Confused...
Incidentally, in the same file:
SYM_FUNC_START(csum_partial_copy_generic)
cmpl $3*64, %edx
jle .Lignore
.Lignore:
....
And it had been that way since "[PATCH] Intel x86-64 support merge" back
in 2004, where we had
@@ -59,15 +59,6 @@ csum_partial_copy_generic:
cmpl $3*64,%edx
jle .Lignore
- ignore
- prefetch (%rdi)
- ignore
- prefetch 1*64(%rdi)
- ignore
- prefetchw (%rsi)
- ignore
- prefetchw 1*64(%rsi)
-
.Lignore:
....
@@ -115,7 +106,7 @@ csum_partial_copy_generic:
movq 56(%rdi),%r13
ignore 2f
- prefetch 5*64(%rdi)
+ prefetcht0 5*64(%rdi)
2:
adcq %rbx,%rax
adcq %r8,%rax
What's going on in there? According to AMD manual, prefetch and prefetchw
can raise an exception (#UD), if
PREFETCH/PREFETCHW are not supported, as
indicated by ECX bit 8 of CPUID function
8000_0001h
Long Mode is not supported, as indicated by EDX
bit 29 of CPUID function 8000_0001h
The 3DNow! instructions are not supported, as
indicated by EDX bit 31 of CPUID function
8000_0001h.
so these at least used to make some sense, but why leave that thing at
the place where old prefetch became prefetcht0 and what is that comment
in front of 'ignore' definition about? Exceptions there had never
been about unmapped addresses - that would make no sense for prefetch.
What am I missing here?
On Fri, Jul 3, 2020 at 5:50 PM Al Viro <[email protected]> wrote:
>
> How could prefetcht0 possibly
> raise an exception? Intel manual says that the only exception is #UD if
> LOCK PREFETCHT0 is encountered; not here, obviously. AMD manual simply
> says "no exceptions". Confused...
Several CPU bugs in this area. I think they may all have been AMD.
But we've definitely had "prefetch causes page faults" errata.
Google for it. One pdf (AMD errata) I found had this:
"Software Prefetches May Report A Page Fault
Description Software prefetch instructions are defined to ignore
page faults. Under highly specific and detailed internal
circumstances, a prefetch instruction may report a page fault if both
of the following conditions are true:
• The target address of the prefetch would cause a page fault if
the address was accessed by an actual memory load or store instruction
under the current privilege mode;
• The prefetch instruction is followed in execution-order by an
actual or speculative byte-sized memory access of the same
modify-intent to the same address. PREFETCH and PREFETCHNTA/0/1/2 have
the same modify-intent as a memory load access.
PREFETCHW has the same modify-intent as a memory store access. The
page fault exception error code bits for the faulting prefetch will be
identical to that for a bytesized memory access of the same-modify
intent to the same address. Note that some misaligned accesses can be
broken up by the processor into multiple accesses where at least one
of the accesses is a byte-sized access. If the target address of the
subsequent memory access of the same modify-intent is aligned and not
byte-sized, this errata does not occur and no workaround is needed.
Potential Effect on System An unexpected page fault may occur
infrequently on a prefetch instruction."
So sadly the architecture manuals do not reflect reality.
That said, software prefetch instructions very seldom actually work.
They are only useful if you have one _very_ specific load and run one
one _very_ specific micrcoarchiecture.
Ir's almost always a mistake to have them in the first place.
Linus
On Sat, Jul 04, 2020 at 01:49:59AM +0100, Al Viro wrote:
> On Fri, Jul 03, 2020 at 10:02:37PM +0100, Al Viro wrote:
>
> > PS: I'm still going through the _ASM_EXTABLE... users on x86, so there
> > might be more fun. Will post when I'm done...
>
> Lovely... Not directly related to that, but... WTF?
>
> arch/x86/lib/csum-copy_64.S:
>
> /*
> * No _ASM_EXTABLE_UA; this is used for intentional prefetch on a
> * potentially unmapped kernel address.
> */
> .macro ignore L=.Lignore
> 30:
> _ASM_EXTABLE(30b, \L)
> .endm
>
> ...
> ignore 2f
> prefetcht0 5*64(%rdi)
> 2:
>
> (and no other users of 'ignore' anywhere). How could prefetcht0 possibly
> raise an exception? Intel manual says that the only exception is #UD if
> LOCK PREFETCHT0 is encountered; not here, obviously. AMD manual simply
> says "no exceptions". Confused...
>
> Incidentally, in the same file:
> SYM_FUNC_START(csum_partial_copy_generic)
> cmpl $3*64, %edx
> jle .Lignore
>
> .Lignore:
> ....
>
> And it had been that way since "[PATCH] Intel x86-64 support merge" back
> in 2004, where we had
> @@ -59,15 +59,6 @@ csum_partial_copy_generic:
> cmpl $3*64,%edx
> jle .Lignore
>
> - ignore
> - prefetch (%rdi)
> - ignore
> - prefetch 1*64(%rdi)
> - ignore
> - prefetchw (%rsi)
> - ignore
> - prefetchw 1*64(%rsi)
> -
> .Lignore:
> ....
> @@ -115,7 +106,7 @@ csum_partial_copy_generic:
> movq 56(%rdi),%r13
>
> ignore 2f
> - prefetch 5*64(%rdi)
> + prefetcht0 5*64(%rdi)
> 2:
> adcq %rbx,%rax
> adcq %r8,%rax
>
> What's going on in there? According to AMD manual, prefetch and prefetchw
> can raise an exception (#UD), if
> PREFETCH/PREFETCHW are not supported, as
> indicated by ECX bit 8 of CPUID function
> 8000_0001h
> Long Mode is not supported, as indicated by EDX
> bit 29 of CPUID function 8000_0001h
> The 3DNow! instructions are not supported, as
> indicated by EDX bit 31 of CPUID function
> 8000_0001h.
> so these at least used to make some sense, but why leave that thing at
> the place where old prefetch became prefetcht0 and what is that comment
> in front of 'ignore' definition about? Exceptions there had never
> been about unmapped addresses - that would make no sense for prefetch.
>
> What am I missing here?
BTW, looking at csum_and_copy_{to,from}_user() callers (all 3 of them,
all in lib/iov_iter.c) we have this:
1) len is never 0
2) sum (initial value of csum) is always 0
3) failure (reported via *err_ptr) is always treateds as "discard
the entire iovec segment (and possibly the entire iovec)". Exact value
put into *err_ptr doesn't matter (it's only compared to 0) and in case of
error the return value is ignored.
Now, using ~0U instead of 0 for initial sum would yield an equivalent csum
(comparable modulo 2^16-1) *AND* never yield 0 (recall how csum addition works).
IOW, we could simply return 0 to indicate an error. Which gives much saner
calling conventions:
__wsum csum_and_copy_from_user(const void __user *src, void *dst, int len)
copying the damn thing and returning 0 on error or a non-zero value comparable
to csum of the data modulo 2^16-1 on success. Same for csum_and_copy_to_user()
(modulo const and __user being on the other argument).
For x86 it simplifies the instances (both the inline wrappers and asm parts);
I hadn't checked the other architectures yet, but it looks like that should
be doable for all architectures. And it does simplify the callers...
On Fri, Jul 03, 2020 at 06:54:15PM -0700, Linus Torvalds wrote:
> "Software Prefetches May Report A Page Fault
>
> Description Software prefetch instructions are defined to ignore
> page faults. Under highly specific and detailed internal
> circumstances, a prefetch instruction may report a page fault if both
> of the following conditions are true:
>
> • The target address of the prefetch would cause a page fault if
> the address was accessed by an actual memory load or store instruction
> under the current privilege mode;
>
> • The prefetch instruction is followed in execution-order by an
> actual or speculative byte-sized memory access of the same
> modify-intent to the same address. PREFETCH and PREFETCHNTA/0/1/2 have
> the same modify-intent as a memory load access.
>
> PREFETCHW has the same modify-intent as a memory store access. The
> page fault exception error code bits for the faulting prefetch will be
> identical to that for a bytesized memory access of the same-modify
> intent to the same address. Note that some misaligned accesses can be
> broken up by the processor into multiple accesses where at least one
> of the accesses is a byte-sized access. If the target address of the
> subsequent memory access of the same modify-intent is aligned and not
> byte-sized, this errata does not occur and no workaround is needed.
>
> Potential Effect on System An unexpected page fault may occur
> infrequently on a prefetch instruction."
Lovely... So basically this is the rare place where we might use those
insns on userland addresses?
On Fri, Jul 3, 2020 at 7:30 PM Al Viro <[email protected]> wrote:
>
> Lovely... So basically this is the rare place where we might use those
> insns on userland addresses?
Honestly, I think the code you quote is just confused.
First off, we have special "is this page fault due to a prefetch"
logic in the x86 page fault handler anyway.
Second, we probably shouldn't have those prefetches there in the first place.
Sp I think the nasty code is likely just pointless and legacy. It may
exists simply because that case was the first time somebody noticed
the prefetch errata and it triggered in kernel mode. Who knows..
I'd be inclined to remove all the prefetching code from that csum
thing entirely. Most good CPU's do better prefetch pattern detection
in hardware than we can do in software.
Linus
From: Al Viro
> Sent: 04 July 2020 03:12
...
> BTW, looking at csum_and_copy_{to,from}_user() callers (all 3 of them,
> all in lib/iov_iter.c) we have this:
> 1) len is never 0
> 2) sum (initial value of csum) is always 0
> 3) failure (reported via *err_ptr) is always treateds as "discard
> the entire iovec segment (and possibly the entire iovec)". Exact value
> put into *err_ptr doesn't matter (it's only compared to 0) and in case of
> error the return value is ignored.
>
> Now, using ~0U instead of 0 for initial sum would yield an equivalent csum
> (comparable modulo 2^16-1) *AND* never yield 0 (recall how csum addition works).
>
> IOW, we could simply return 0 to indicate an error. Which gives much saner
> calling conventions:
> __wsum csum_and_copy_from_user(const void __user *src, void *dst, int len)
> copying the damn thing and returning 0 on error or a non-zero value comparable
> to csum of the data modulo 2^16-1 on success. Same for csum_and_copy_to_user()
> (modulo const and __user being on the other argument).
>
> For x86 it simplifies the instances (both the inline wrappers and asm parts);
> I hadn't checked the other architectures yet, but it looks like that should
> be doable for all architectures. And it does simplify the callers...
All the csum functions should let the caller pass in a small value
to be added in (could be over 2^32 on 64 bit systems) since that is
normally 'free' in the algorithm - certainly better than adding it
it at the end - which is what the current x86 code does.
(For 64bit systems the 'small' value can exceed 2^32.)
I also wonder if the csum_and_copy() functions are actually worthwhile on x86.
The csum code can run at 8 bytes/clock on all Intel cpu since ivy bridge.
(It doesn't, it only does 4 bytes/clock until (IIRC) Haswell [1].)
On cpu that support ADCX/ADOX you may do better - probably 12 bytes/clock,
I think 16 bytes/clock is wishful thinking.
But there is no leeway for any extra uops in either case.
However trying to get a memory read, memory write, adc and bits of loop
control scheduled in one clock is probably impossible - even though
it might not exceed the number of uops the execution pipelines can process.
ISTR that just separating the memory read from the adc slows
thing down too much - probably issues with retiring instructions.
So I don't think it can get near 8 bytes/clock.
OTOH a copy operation trivially does 8 bytes/clock.
I even think 'rep movsq' is faster - never mind the fast 'rep movsb'.
So separate copy and checksum passes should easily exceed 4 bytes/clock,
but I suspect that doing them together never does.
(Unless the buffer is too big for the L1 cache.)
[1] The underlying problem is that a uop can only have 2 inputs.
ADC needs three (two values and the carry flag).
So the ADC instruction takes two clocks.
From ivy bridge (sandy?) the carry flag is available early,
so adding to alternate registers lets you do 1 per clock.
So the existing csum function is rather slower than adding
32bit values to a 64bit register on most older cpus.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Tue, Jul 7, 2020 at 5:35 AM David Laight <[email protected]> wrote:
>
>
> So separate copy and checksum passes should easily exceed 4 bytes/clock,
> but I suspect that doing them together never does.
> (Unless the buffer is too big for the L1 cache.)
Its' the "touch the caches twice" that is the problem".
And it's not the "buffer is too big for L1", it's "the source, the
destination and any incidentals are too big for L1" with the
additional noise from replacement policies etc.
That said, I agree it's likely less of an issue these days when L1
sizes are bigger, and thankfully direct-mapped caches are no more. It
_used_ to be that touching the location twice was very very noticeable
in some situations, it may not be so much any more.
Linus
From: Linus Torvalds
> Sent: 10 July 2020 23:37
> On Tue, Jul 7, 2020 at 5:35 AM David Laight <[email protected]> wrote:
> >
> >
> > So separate copy and checksum passes should easily exceed 4 bytes/clock,
> > but I suspect that doing them together never does.
> > (Unless the buffer is too big for the L1 cache.)
>
> Its' the "touch the caches twice" that is the problem".
>
> And it's not the "buffer is too big for L1", it's "the source, the
> destination and any incidentals are too big for L1" with the
> additional noise from replacement policies etc.
That's really what I meant.
L1D is actually (probably) only 32kB.
I guess that gives you 8k for the buffer.
It is a shame you can't use the AVX instructions in kernel.
(Although saving them probably costs more than the gain.)
Then you could use something based on:
10: load ymm,src+idx // 32 bytes
store ymm,tgt+idx
addq sum0,ymm // eight 32bit adds
rotate ymm,16 // Pretty sure there in an instruction for this!
addq sum1,ymm
add idx,32
jnz 10b
It is then possibly to determine the correct result from sum0/sum1.
On very recent Intel cpu that might even run at 1 iteration/clock!
(Probably needs and unroll and explicit interleave.)
At one iteration every 2 clocks it matches the ADDX[OC] loop
but includes the write.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)