LinuxLists.cc - [PATCH] arch_check_bp_in_kernelspace: fix the range check

2012-11-09 18:29:07

Subject: [PATCH] arch_check_bp_in_kernelspace: fix the range check

arch_check_bp_in_kernelspace() tries to avoid the overflow and does 2
TASK_SIZE checks but it needs OR, not AND. Consider va = TASK_SIZE -1
and len = 2 case.

Note: TASK_SIZE doesn't look right at least on x86, I think it should
be replaced by TASK_SIZE_MAX.

Signed-off-by: Oleg Nesterov <[email protected]>

--- x/arch/arm64/kernel/hw_breakpoint.c
+++ x/arch/arm64/kernel/hw_breakpoint.c
@@ -293,7 +293,7 @@ int arch_check_bp_in_kernelspace(struct
va = info->address;
len = get_hbp_len(info->ctrl.len);

- return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
+ return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
}

/*
--- x/arch/arm/kernel/hw_breakpoint.c
+++ x/arch/arm/kernel/hw_breakpoint.c
@@ -464,7 +464,7 @@ int arch_check_bp_in_kernelspace(struct
va = info->address;
len = get_hbp_len(info->ctrl.len);

- return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
+ return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
}

/*
--- x/arch/sh/kernel/hw_breakpoint.c
+++ x/arch/sh/kernel/hw_breakpoint.c
@@ -132,7 +132,7 @@ int arch_check_bp_in_kernelspace(struct
va = info->address;
len = get_hbp_len(info->len);

- return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
+ return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
}

int arch_bp_generic_fields(int sh_len, int sh_type,
--- x/arch/x86/kernel/hw_breakpoint.c
+++ x/arch/x86/kernel/hw_breakpoint.c
@@ -200,7 +200,7 @@ int arch_check_bp_in_kernelspace(struct
va = info->address;
len = get_hbp_len(info->len);

- return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
+ return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
}

int arch_bp_generic_fields(int x86_len, int x86_type,

2012-11-09 18:29:44

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

On 11/09, Oleg Nesterov wrote:
>
> Note: TASK_SIZE doesn't look right at least on x86, I think it should
> be replaced by TASK_SIZE_MAX.
> ...
> --- x/arch/x86/kernel/hw_breakpoint.c
> +++ x/arch/x86/kernel/hw_breakpoint.c
> @@ -200,7 +200,7 @@ int arch_check_bp_in_kernelspace(struct
> va = info->address;
> len = get_hbp_len(info->len);
>
> - return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
> + return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);

But actully I'd like to ask another question.

Can't we add the additional !in_gate_area_no_mm() check to allow
the hw breakpoints in vsyscall?

Quoting Amnon:

My service needs to catch the system-calls of its traced son.
Almost all system-calls are caught with PTRACE_SYSCALL, but not those
using the vsyscall page - especially "gettimeofday()" and "time()".

...

However, the code in "arch/x86/kernel/ptrace.c" forbids catching kernel
addresses.

I tend to agree with Amnon...

What do you think?

Oleg.

2012-11-19 17:47:10

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

On 11/09, Oleg Nesterov wrote:
>
> On 11/09, Oleg Nesterov wrote:
> >
> > Note: TASK_SIZE doesn't look right at least on x86, I think it should
> > be replaced by TASK_SIZE_MAX.
> > ...
> > --- x/arch/x86/kernel/hw_breakpoint.c
> > +++ x/arch/x86/kernel/hw_breakpoint.c
> > @@ -200,7 +200,7 @@ int arch_check_bp_in_kernelspace(struct
> > va = info->address;
> > len = get_hbp_len(info->len);
> >
> > - return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
> > + return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
>
> But actully I'd like to ask another question.
>
> Can't we add the additional !in_gate_area_no_mm() check to allow
> the hw breakpoints in vsyscall?
>
> Quoting Amnon:
>
> My service needs to catch the system-calls of its traced son.
> Almost all system-calls are caught with PTRACE_SYSCALL, but not those
> using the vsyscall page - especially "gettimeofday()" and "time()".
>
> ...
>
> However, the code in "arch/x86/kernel/ptrace.c" forbids catching kernel
> addresses.
>
> I tend to agree with Amnon...
>
> What do you think?

ping ;)

I agree the patch I sent is very minor (although it looks like the trivial
bugfix to me).

Just I think Amnon has a point, shouldn't we change this change like below?
(on top of this fixlet).

Oleg.

--- x/arch/x86/kernel/hw_breakpoint.c
+++ x/arch/x86/kernel/hw_breakpoint.c
@@ -188,6 +188,11 @@ static int get_hbp_len(u8 hbp_len)
return len_in_bytes;
}

+static inline bool bp_in_gate(unsigned long start, unsigned long end)
+{
+ return in_gate_area_no_mm(start) && in_gate_area_no_mm(end);
+}
+
/*
* Check for virtual address in kernel space.
*/
@@ -200,7 +205,8 @@ int arch_check_bp_in_kernelspace(struct
va = info->address;
len = get_hbp_len(info->len);

- return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
+ return ((va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE)) &&
+ !bp_in_gate(va, va + len - 1);
}

int arch_bp_generic_fields(int x86_len, int x86_type,

2012-11-19 18:25:05

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

On Mon, 2012-11-19 at 18:47 +0100, Oleg Nesterov wrote:
> On 11/09, Oleg Nesterov wrote:
> >
> > On 11/09, Oleg Nesterov wrote:
> > >
> > > Note: TASK_SIZE doesn't look right at least on x86, I think it should
> > > be replaced by TASK_SIZE_MAX.
> > > ...
> > > --- x/arch/x86/kernel/hw_breakpoint.c
> > > +++ x/arch/x86/kernel/hw_breakpoint.c
> > > @@ -200,7 +200,7 @@ int arch_check_bp_in_kernelspace(struct
> > > va = info->address;
> > > len = get_hbp_len(info->len);
> > >
> > > - return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
> > > + return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
> >
> > But actully I'd like to ask another question.
> >
> > Can't we add the additional !in_gate_area_no_mm() check to allow
> > the hw breakpoints in vsyscall?
> >
> > Quoting Amnon:
> >
> > My service needs to catch the system-calls of its traced son.
> > Almost all system-calls are caught with PTRACE_SYSCALL, but not those
> > using the vsyscall page - especially "gettimeofday()" and "time()".
> >
> > ...
> >
> > However, the code in "arch/x86/kernel/ptrace.c" forbids catching kernel
> > addresses.
> >
> > I tend to agree with Amnon...
> >
> > What do you think?
>
> ping ;)
>
> I agree the patch I sent is very minor (although it looks like the trivial
> bugfix to me).
>
> Just I think Amnon has a point, shouldn't we change this change like below?
> (on top of this fixlet).
>
> Oleg.
>
> --- x/arch/x86/kernel/hw_breakpoint.c
> +++ x/arch/x86/kernel/hw_breakpoint.c
> @@ -188,6 +188,11 @@ static int get_hbp_len(u8 hbp_len)
> return len_in_bytes;
> }
>
> +static inline bool bp_in_gate(unsigned long start, unsigned long end)
> +{
> + return in_gate_area_no_mm(start) && in_gate_area_no_mm(end);
> +}
> +
> /*
> * Check for virtual address in kernel space.
> */
> @@ -200,7 +205,8 @@ int arch_check_bp_in_kernelspace(struct
> va = info->address;
> len = get_hbp_len(info->len);
>
> - return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
> + return ((va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE)) &&
> + !bp_in_gate(va, va + len - 1);

So we want to allow non privileged to add a breakpoint in a vsyscall
area even if it happens to lay in kernel space?

Is this really what we want? I mean, isn't syscall tracing in the kernel
done by a flag set to the task's structure. If the task has a flag set,
then it does the slow (ptrace) path which handles the breakpoint for the
user.

But here, there's no prejudice between tasks. All tasks will now hit the
breakpoint regardless of if it is being traced or not.

-- Steve

> }
>
> int arch_bp_generic_fields(int x86_len, int x86_type,

2012-11-20 10:33:46

by u3557

[permalink] [raw]

Subject: Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

Dear Steve,

> But here, there's no prejudice between tasks. All tasks will now hit the
> breakpoint regardless of if it is being traced or not.

Just to clarify, there is no intention to allow conventional breakpoints
in the vsyscall page - that would indeed be a disaster affecting all other
processes.

The aim of this patch is to allow ptracers to set the x86 debug-registers
of their traced process, so that it stops when it reaches the entry points
of those (few) system-calls that are in the vsyscall page. Note that those
debug registers are anyway swapped at context-switch, so no other processes
are affected.

> Is this really what we want? I mean, isn't syscall tracing in the kernel
> done by a flag set to the task's structure. If the task has a flag set,
> then it does the slow (ptrace) path which handles the breakpoint for the
> user.

Most system-calls can be trapped by the PTRACE_SYSCALL option (and the
later PTRACE_SYSEMU), but those few system-calls on the vsyscall page
defy the intention to trap ALL system-calls. They also cannot be
checkpointed by inserting a trap-instruction (as that, if allowed, would
break all other processes), hence the only alternative left is to have
this patch that fixes the oversight in the design of
PTRACE_SYSCALL/PTRACE_SYSEMU in the presence of a vsyscall page.

Best Regards,
Amnon.

> On Mon, 2012-11-19 at 18:47 +0100, Oleg Nesterov wrote:
>> On 11/09, Oleg Nesterov wrote:
>> >
>> > On 11/09, Oleg Nesterov wrote:
>> > >
>> > > Note: TASK_SIZE doesn't look right at least on x86, I think it
>> should
>> > > be replaced by TASK_SIZE_MAX.
>> > > ...
>> > > --- x/arch/x86/kernel/hw_breakpoint.c
>> > > +++ x/arch/x86/kernel/hw_breakpoint.c
>> > > @@ -200,7 +200,7 @@ int arch_check_bp_in_kernelspace(struct
>> > > va = info->address;
>> > > len = get_hbp_len(info->len);
>> > >
>> > > - return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
>> > > + return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
>> >
>> > But actully I'd like to ask another question.
>> >
>> > Can't we add the additional !in_gate_area_no_mm() check to allow
>> > the hw breakpoints in vsyscall?
>> >
>> > Quoting Amnon:
>> >
>> > My service needs to catch the system-calls of its traced son.
>> > Almost all system-calls are caught with PTRACE_SYSCALL, but not those
>> > using the vsyscall page - especially "gettimeofday()" and "time()".
>> >
>> > ...> Is this really what we want? I mean, isn't syscall tracing in
the kernel
> done by a flag set to the task's structure. If the task has a flag set,
> then it does the slow (ptrace) path which handles the breakpoint for the
> user.
>> >
>> > However, the code in "arch/x86/kernel/ptrace.c" forbids catching
>> kernel
>> > addresses.
>> >
>> > I tend to agree with Amnon...
>> >
>> > What do you think?
>>
>> ping ;)
>>
>> I agree the patch I sent is very minor (although it looks like the
>> trivial
>> bugfix to me).
>>
>> Just I think Amnon has a point, shouldn't we change this change like
>> below?
>> (on top of this fixlet).
>>
>> Oleg.
>>
>> --- x/arch/x86/kernel/hw_breakpoint.c
>> +++ x/arch/x86/kernel/hw_breakpoint.c
>> @@ -188,6 +188,11 @@ static int get_hbp_len(u8 hbp_len)
>> return len_in_bytes;
>> }
>>
>> +static inline bool bp_in_gate(unsigned long start, unsigned long end)
>> +{
>> + return in_gate_area_no_mm(start) && in_gate_area_no_mm(end);
>> +}
>> +
>> /*
>> * Check for virtual address in kernel space.
>> */
>> @@ -200,7 +205,8 @@ int arch_check_bp_in_kernelspace(struct
>> va = info->address;
>> len = get_hbp_len(info->len);
>>
>> - return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
>> + return ((va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE)) &&
>> + !bp_in_gate(va, va + len - 1);
>
> So we want to allow non privileged to add a breakpoint in a vsyscall
> area even if it happens to lay in kernel space?
>
> Is this really what we want? I mean, isn't syscall tracing in the kernel
> done by a flag set to the task's structure. If the task has a flag set,
> then it does the slow (ptrace) path which handles the breakpoint for the
> user.
>
> But here, there's no prejudice between tasks. All tasks will now hit the
> breakpoint regardless of if it is being traced or not.
>
> -- Steve
>
>
>> }
>>
>> int arch_bp_generic_fields(int x86_len, int x86_type,
>
>
>

2012-11-20 15:48:07

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

On 11/19, Steven Rostedt wrote:
>
> On Mon, 2012-11-19 at 18:47 +0100, Oleg Nesterov wrote:
> >
> > Just I think Amnon has a point, shouldn't we change this change like below?
> > (on top of this fixlet).
> >
> > Oleg.
> >
> > --- x/arch/x86/kernel/hw_breakpoint.c
> > +++ x/arch/x86/kernel/hw_breakpoint.c
> > @@ -188,6 +188,11 @@ static int get_hbp_len(u8 hbp_len)
> > return len_in_bytes;
> > }
> >
> > +static inline bool bp_in_gate(unsigned long start, unsigned long end)
> > +{
> > + return in_gate_area_no_mm(start) && in_gate_area_no_mm(end);
> > +}
> > +
> > /*
> > * Check for virtual address in kernel space.
> > */
> > @@ -200,7 +205,8 @@ int arch_check_bp_in_kernelspace(struct
> > va = info->address;
> > len = get_hbp_len(info->len);
> >
> > - return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
> > + return ((va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE)) &&
> > + !bp_in_gate(va, va + len - 1);
>
> So we want to allow non privileged to add a breakpoint in a vsyscall
> area even if it happens to lay in kernel space?

Yes,

> Is this really what we want?

I am not sure, that is why I am asking.

Hmm... I have to admit, I didn't even know this code was completely
rewritten, and a long ago... and in the default mode it works via
emulate_vsyscall() from page-fault, iow not in user_mode().

> I mean, isn't syscall tracing in the kernel
> done by a flag set to the task's structure. If the task has a flag set,
> then it does the slow (ptrace) path which handles the breakpoint for the
> user.

Well yes, with the new implementation __vsyscall_page simply does
syscall, so I guess (say) strace will "just work". But, afaics, only
if vsyscall_mode == NATIVE.

So perhaps bp in vsyscall_page still makes sense...

> But here, there's no prejudice between tasks. All tasks will now hit the
> breakpoint regardless of if it is being traced or not.
^^^^^^^^^^

This is hardware breakpoint, it is per-task.

Oleg.

2012-11-20 15:55:59

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

On Tue, 2012-11-20 at 16:48 +0100, Oleg Nesterov wrote:

> > But here, there's no prejudice between tasks. All tasks will now hit the
> > breakpoint regardless of if it is being traced or not.
> ^^^^^^^^^^
>
> This is hardware breakpoint, it is per-task.

I forgot that hw breakpoints are swapped out at context switch (as Amnon
informed me).

OK, that makes a big difference.

Thanks,

-- Steve

2012-11-20 18:32:25

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

On 11/20, Oleg Nesterov wrote:
>
> On 11/19, Steven Rostedt wrote:
> >
> > Is this really what we want?
>
> I am not sure, that is why I am asking.

Yes.

And,

> Well yes, with the new implementation __vsyscall_page simply does
> syscall, so I guess (say) strace will "just work". But, afaics, only
> if vsyscall_mode == NATIVE.
>
> So perhaps bp in vsyscall_page still makes sense...

Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to do

if (current->ptrace && test_thread_flag(TIF_SYSCALL_TRACE))
send_sigtrap(TRAP_VSYSCALL, ...);

if it returns true?

This way the debugger doesn't need to play with the debug registers,
it can use ptrace(PTRACE_SYSCALL). The tracee will report SIGTRAP after
vsyscall, debugger can detect this case looking at info->si_code.

Yes, the tracee won't report _before_ it calls the function, but perhaps
this is enough? Amnon?

Oleg.

2012-11-20 23:16:52

by u3557

[permalink] [raw]

Subject: Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

Hi Oleg,

> Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to
> do
>
>
> if (current->ptrace && test_thread_flag(TIF_SYSCALL_TRACE))
> send_sigtrap(TRAP_VSYSCALL, ...);
>
> if it returns true?
>

I wish it were possible, but the vsyscall page is entered in user-mode,
where it is not possible to read either "current" or the per-thread flags.

One nice alternative, however, is to take away the execute permission from
the vsyscall page of the traced process, so it will incur a SIGSEGV (which
the ptracer can easily handle and cancel).

The drawbacks of this nice-to-have alternative are:
1) It may require a bigger patch.
2) It needs to use a separate ptrace option, because current applications
that use PTRACE_SYSCALL (or PTRACE_SYSEMU), including "strace", will not
be aware of this new feature, so they can be broken.

> Yes, the tracee won't report _before_ it calls the function, but perhaps
> this is enough? Amnon?

The vsyscall page was designed in order to avoid user/kernel context
switches, which is possible when a system-call is only reading kernel
global information, not modifying any. In most cases therefore (with
a few exceptions, such as when the hardware clock is broken or
misconfigured), system-calls in the vsyscall page never invoke the kernel
at all - so no trap would be generated either before or after.

Best Regards,
Amnon.

> On 11/20, Oleg Nesterov wrote:
>>
>> On 11/19, Steven Rostedt wrote:
>> >
>> > Is this really what we want?
>>
>> I am not sure, that is why I am asking.
>
> Yes.
>
> And,
>
>> Well yes, with the new implementation __vsyscall_page simply does
>> syscall, so I guess (say) strace will "just work". But, afaics, only
>> if vsyscall_mode == NATIVE.
>>
>> So perhaps bp in vsyscall_page still makes sense...
>
> Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to
> do
>
>
> if (current->ptrace && test_thread_flag(TIF_SYSCALL_TRACE))
> send_sigtrap(TRAP_VSYSCALL, ...);
>
> if it returns true?
>
> This way the debugger doesn't need to play with the debug registers,
> it can use ptrace(PTRACE_SYSCALL). The tracee will report SIGTRAP after
> vsyscall, debugger can detect this case looking at info->si_code.
>
> Yes, the tracee won't report _before_ it calls the function, but perhaps
> this is enough? Amnon?
>
> Oleg.
>
>

2012-11-21 14:16:11

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

Hi Amnon,

Please read my previous email ;)
http://marc.info/?l=linux-kernel&m=135342649119153

On 11/21, [email protected] wrote:
>
> Hi Oleg,
>
> > Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to
> > do
> >
> >
> > if (current->ptrace && test_thread_flag(TIF_SYSCALL_TRACE))
> > send_sigtrap(TRAP_VSYSCALL, ...);
> >
> > if it returns true?
> >
>
> I wish it were possible, but the vsyscall page is entered in user-mode,

Only in NATIVE mode. emulate_vsyscall() runs in kernel mode.

And in the NATIVE mode PTRACE_SYSCALL should work just fine, because:

> The vsyscall page was designed in order to avoid user/kernel context
> switches,

True, it was. But not today. Please look at __vsyscall_page:

__vsyscall_page:

mov $__NR_gettimeofday, %rax
syscall
ret

If you want the "fast" sys_time() without entering the kernel, you can
use __vdso_time(). And since vdso has the user-space mapping you can
insert "int3" or use hw breakpoints.

At least this is my understanding after I glanced at the new implementation.

However. It is not that I think that TRAP_VSYSCALL is really good idea.
At least it needs another option...

Oleg.

2012-11-21 17:30:47

by u3557

[permalink] [raw]

Subject: Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

Hi Oleg,

Yes, I can see that "arch/x86/kernel/vsyscall_64.c"
has changed dramatically since I last looked at it.

Since this is the case, I no longer need to trap the vsyscall page.

Now however, that "vsyscall" was effectively replaced by vdso, it
creates a new problem for me and probably for anyone else who uses
some form of checkpoint/restore:

Suppose a process is checkpointed because the system needs to reboot
for a kernel-upgrade, then restored on the new and different kernel.
The new VDSO page may no longer match the new kernel - it could for
example fetch data from addresses in the vsyscall page that now
contain different things; or in case the hardware also was changed,
it may use machine-instructions that are now illegal.

As I don't mind to forego the "fast" sys_time(), my obvious solution
is to disable the vdso for traced processes that may be checkpointed.

One way to do it would be by brute-force: straight after "execve"
unmap the tracee's vdso page, then manipulate the ELF tables in
its memory so the VDSO entry is gone and the library will not go
looking for it. Alternately, the function-table within the VDSO
page can be erased.

I just wonder whether you know of an easier and more standard way
to disable the vdso in user-mode - ideally on a per-process basis,
or otherwise, if it's too hard, on the whole computer. I searched
the web and found references to "/proc/sys/vm/vdso_enable", but I
have no such file or "sysctl" option on my system.

Best Regards,
Amnon.

>
> Hi Amnon,
>
> Please read my previous email ;)
> http://marc.info/?l=linux-kernel&m=135342649119153
>
> On 11/21, [email protected] wrote:
> >
> > Hi Oleg,
> >
> > > Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to
> > > do
> > >
> > >
> > > if (current->ptrace && test_thread_flag(TIF_SYSCALL_TRACE))
> > > send_sigtrap(TRAP_VSYSCALL, ...);
> > >
> > > if it returns true?
> > >
> >
> > I wish it were possible, but the vsyscall page is entered in user-mode,
>
> Only in NATIVE mode. emulate_vsyscall() runs in kernel mode.
>
> And in the NATIVE mode PTRACE_SYSCALL should work just fine, because:
>
> > The vsyscall page was designed in order to avoid user/kernel context
> > switches,
>
> True, it was. But not today. Please look at __vsyscall_page:
>
> __vsyscall_page:
>
> mov $__NR_gettimeofday, %rax
> syscall
> ret
>
> If you want the "fast" sys_time() without entering the kernel, you can
> use __vdso_time(). And since vdso has the user-space mapping you can
> insert "int3" or use hw breakpoints.
>
> At least this is my understanding after I glanced at the new implementation.
>
>
> However. It is not that I think that TRAP_VSYSCALL is really good idea.
> At least it needs another option...
>
> Oleg.
>
>

2012-11-22 18:29:07

by Oleg Nesterov

[permalink] [raw]

Subject: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)

On 11/22, Amnon Shiloh wrote:
>
> Now however, that "vsyscall" was effectively replaced by vdso, it
> creates a new problem for me and probably for anyone else who uses
> some form of checkpoint/restore:

Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
seem to enjoy trying to solve the c/r problems.

> Suppose a process is checkpointed because the system needs to reboot
> for a kernel-upgrade, then restored on the new and different kernel.
> The new VDSO page may no longer match the new kernel - it could for
> example fetch data from addresses in the vsyscall page that now
> contain different things; or in case the hardware also was changed,
> it may use machine-instructions that are now illegal.

Sure. You shouldn't try to save/restore this page(s) directly. But
I do not really understand why do you need. IOW, I don't really
understand the problem, it depends on what c/r actually does.

> As I don't mind to forego the "fast" sys_time(), my obvious solution
> is to disable the vdso for traced processes that may be checkpointed.
>
> One way to do it would be by brute-force: straight after "execve"
> unmap the tracee's vdso page,

Not sure this will be always possible. For example, my (old) glibc
assumes that vsyscall() must work, I won't be surprised if some time
later it won't work without vdso. But again, I do not know.

> then manipulate the ELF tables in
> its memory so the VDSO entry is gone and the library will not go
> looking for it.

Probably it would be enough to simply erase AT_SYSINFO_EHDR note,
but again, I can be easily wrong.

> I just wonder whether you know of an easier and more standard way
> to disable the vdso in user-mode

Only the kernel parameter, afaics. vdso=0

> - ideally on a per-process basis,
> or otherwise, if it's too hard, on the whole computer. I searched
> the web and found references to "/proc/sys/vm/vdso_enable", but I
> have no such file or "sysctl" option on my system.

sys/vm/vdso_enabled, but only if CONFIG_X86_32 for some reason. See
kernel/sysctl.c

Oleg.

2012-11-22 20:57:25

by Pavel Emelyanov

[permalink] [raw]

Subject: Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)

On 11/22/2012 08:12 PM, Oleg Nesterov wrote:
> On 11/22, Amnon Shiloh wrote:
>>
>> Now however, that "vsyscall" was effectively replaced by vdso, it
>> creates a new problem for me and probably for anyone else who uses
>> some form of checkpoint/restore:
>
> Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
> seem to enjoy trying to solve the c/r problems.

Thank you :)

>> Suppose a process is checkpointed because the system needs to reboot
>> for a kernel-upgrade, then restored on the new and different kernel.
>> The new VDSO page may no longer match the new kernel - it could for
>> example fetch data from addresses in the vsyscall page that now
>> contain different things; or in case the hardware also was changed,
>> it may use machine-instructions that are now illegal.

If we could make VDSO entry points not move across the kernels (iow, make
them looks as yet another syscall table) this would help, I suppose.

> Sure. You shouldn't try to save/restore this page(s) directly. But
> I do not really understand why do you need. IOW, I don't really
> understand the problem, it depends on what c/r actually does.

Think about it like this -- you stop a process, then change the kern^w VDSO
page. Everything should work as it used to be :)

>> As I don't mind to forego the "fast" sys_time(), my obvious solution
>> is to disable the vdso for traced processes that may be checkpointed.

This is very poor solution from my POV. Nobody wants to have their applications
work fast only until it's checkpointed.

Thanks,
Pavel

2012-11-23 00:20:25

by u3557

[permalink] [raw]

Subject: Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range

Hi Pavel,

> >>
> >> Now however, that "vsyscall" was effectively replaced by vdso, it
> >> creates a new problem for me and probably for anyone else who uses
> >> some form of checkpoint/restore:
> >
> > Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
> > seem to enjoy trying to solve the c/r problems.
>
> Thank you :)

Thank YOU for joining!

> >> Suppose a process is checkpointed because the system needs to reboot
> >> for a kernel-upgrade, then restored on the new and different kernel.
> >> The new VDSO page may no longer match the new kernel - it could for
> >> example fetch data from addresses in the vsyscall page that now
> >> contain different things; or in case the hardware also was changed,
> >> it may use machine-instructions that are now illegal.
>
> If we could make VDSO entry points not move across the kernels (iow, make
> them looks as yet another syscall table) this would help, I suppose.

It will indeed solve PART of the problem, but there is one more issue:

One obviously cannot c/r a process while it runs in the VDSO page
without c/r'ing that page itself, but this can probably be handled
by single-stepping the process until it is out of that page (assuming
there are no sleeps, pauses or extremely long loops on that page) -
but suppose a catched signal interrupts the VDSO code and the process
needs to be checkpointed within that interrupt code - eventually it
will return ("sigreturn") to the VDSO page... a different page...
and probably fall on the wrong machine-instruction (or even between
machine-instructions), with all registers scrambled anyway.

The solution can be to hold all catched signals while in the VDSO page.
This is not something the application (or library) can reasonably do
due to the prohibitive cost of "sigprocmask()" before and after, defying
the whole purpose of the VDSO page, but could be achieved by some new
'prctl' option (or perhaps even be the default).

In my specific case, because the checkpointed process is ptraced,
and assuming VDSO entry points are fixed, the ptracer can postpone
all catched signals that occur within the VDSO page, but for others
who write/maintain a c/r package, that's probably not an option.

>
> > Sure. You shouldn't try to save/restore this page(s) directly. But
> > I do not really understand why do you need. IOW, I don't really
> > understand the problem, it depends on what c/r actually does.
>
> Think about it like this -- you stop a process, then change the kern^w VDSO
> page. Everything should work as it used to be :)

There are two reasons one may need to save/restore this page:
1) Entry points are not fixed (yet).
2) In case the process needs to return to it back from an interrupt.

>
> >> As I don't mind to forego the "fast" sys_time(), my obvious solution
> >> is to disable the vdso for traced processes that may be checkpointed.
>
> This is very poor solution from my POV. Nobody wants to have their applications
> work fast only until it's checkpointed.

I know, but it's a price I must and am willing to pay until a solution
is found that prevents catching signals within the VDSO page.

I made a small experiment and just zeroed out the whole VDSO page
straight after "execve" (brute force, easier than having to study
the internal format of the VDSO page). The program worked, using
the glibc version of "gettimeofday()" instead (which used "vsyscall",
but probably for not much longer).

So consider my immediate personal problem solved - what I'll do next
is to compile a special temporary kernel with all vdso functions
(__vdso_gettimeofday, __vdso_time, __vdso_clock_gettime, __vdso_getcpu)
reduced to system-calls, so they become kernel/hardware-independent,
then I'll save and set aside the resulting VDSO page and always replace
original VDSO pages with "my-vdso" after "execve".

However, this doesn't solve the problem for other c/r packages that do
not ptrace their processes all the time, and therefore unable to replace
the VDSO page immediately after each "execve".
For them you will need to either:

1) fix the VDSO entry points + introduce a kernel feature to prevent
catching signals within the VDSO page (probably a new prctl,
or make it the default) ; or
2) Introduce a kernel feature (probably a new prctl, so long as
it is not reset across fork/clone/exec) for those programs who
request it to load a "slow-but-sure", kernel/hardware-independent
version of the VDSO page.

>
> Thanks,
> Pavel
>

Thank you and Best Regards,
Amnon.

2012-11-23 09:14:58

by u3557

[permalink] [raw]

Subject: Re: arch_check_bp_in_kernelspace: fix the range check

Dear Oleg,

After the detour into the VDSO page, which is really a separate issue,
but one which I was able to work around in user-mode (for my particular
needs, not yet for all others who write checkpoint/restore pacakges),
I am sad to report that despite my previous post, the "vsyscall" problem
was not solved after all.

I made some further tests, without the fix that allows a hardware
breakpoint in the kernel, hoping that fixing the VDSO issue solves
it all, but alas, current libraries still call the vsyscall page even
when the VDSO page is present.

My dynamic glibc library calls call the VDSO version of "gettimeofday()",
but still uses the vsyscall version of "time()", while the static library
uses vsyscall for both.

What I discovered now, is that PTRACE_SYSCALL (also PTRACE_SINGLESTEP)
does not work within the vsyscall page, so I cannot trap the kernel-calls
there (this is very simple to verify using "gdb" or "strace").

I suspect that statically-linked executables will be around for a while,
long after the rest of the glibc library moves to VDSO, hence I still need
to place hardware breakpoints in the vsyscall page.

The necessary patch was already discussed and is very simple.

Or, there is an alternative: if only I (the ptracer or the traced process)
was allowed to munmap the vsyscall page, just get rid of that page
altogether, or at least make it non-executable, that would be fine
too for me, because then the process will get a SIGSEGV signal, which
the ptracer can easily handle.

Best Regards,
Amnon.

> On 11/22, Amnon Shiloh wrote:
> >
> > Now however, that "vsyscall" was effectively replaced by vdso, it
> > creates a new problem for me and probably for anyone else who uses
> > some form of checkpoint/restore:
>
> Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
> seem to enjoy trying to solve the c/r problems.
>
> > Suppose a process is checkpointed because the system needs to reboot
> > for a kernel-upgrade, then restored on the new and different kernel.
> > The new VDSO page may no longer match the new kernel - it could for
> > example fetch data from addresses in the vsyscall page that now
> > contain different things; or in case the hardware also was changed,
> > it may use machine-instructions that are now illegal.
>
> Sure. You shouldn't try to save/restore this page(s) directly. But
> I do not really understand why do you need. IOW, I don't really
> understand the problem, it depends on what c/r actually does.
>
> > As I don't mind to forego the "fast" sys_time(), my obvious solution
> > is to disable the vdso for traced processes that may be checkpointed.
> >
> > One way to do it would be by brute-force: straight after "execve"
> > unmap the tracee's vdso page,
>
> Not sure this will be always possible. For example, my (old) glibc
> assumes that vsyscall() must work, I won't be surprised if some time
> later it won't work without vdso. But again, I do not know.
>
> > then manipulate the ELF tables in
> > its memory so the VDSO entry is gone and the library will not go
> > looking for it.
>
> Probably it would be enough to simply erase AT_SYSINFO_EHDR note,
> but again, I can be easily wrong.
>
> > I just wonder whether you know of an easier and more standard way
> > to disable the vdso in user-mode
>
> Only the kernel parameter, afaics. vdso=0
>
> > - ideally on a per-process basis,
> > or otherwise, if it's too hard, on the whole computer. I searched
> > the web and found references to "/proc/sys/vm/vdso_enable", but I
> > have no such file or "sysctl" option on my system.
>
> sys/vm/vdso_enabled, but only if CONFIG_X86_32 for some reason. See
> kernel/sysctl.c
>
> Oleg.
>
>

2012-11-23 16:33:08

by Oleg Nesterov

[permalink] [raw]

Subject: Re: arch_check_bp_in_kernelspace: fix the range check

Hello Amnon,

I am a bit confused,

On 11/23, Amnon Shiloh wrote:
>
> What I discovered now, is that PTRACE_SYSCALL (also PTRACE_SINGLESTEP)
> does not work within the vsyscall page, so I cannot trap the kernel-calls
> there (this is very simple to verify using "gdb" or "strace").

Sure, but we alredy discussed this?

Once again, PTRACE_SYSCALL should work in the NATIVE mode. Obviously it
won't work in EMULATE mode but we can change emulate_vsyscall() to report
TRAP_VSYSCALL or even introduce PTRACE_EVENT_VSYSCALL.

> The necessary patch was already discussed and is very simple.

Do you mean TRAP_VSYSCALL/PTRACE_EVENT_VSYSCALL above or additional
in_gate_area_no_mm() check to allow the hw bp?

> Or, there is an alternative: if only I (the ptracer or the traced process)
> was allowed to munmap the vsyscall page,

It is not possible to unmap it. The kernel (swapper_pg_dir) has this
mapping, not the process. Unlike vdso. IOW, you can only "unmap" it
globally and obviously you can't do this from the userspace.

Oleg.

2012-11-23 17:05:41

by Oleg Nesterov

[permalink] [raw]

Subject: Re: arch_check_bp_in_kernelspace: fix the range check

forgot to mention...

On 11/23, Oleg Nesterov wrote:
>
> On 11/23, Amnon Shiloh wrote:
> >
> > Or, there is an alternative: if only I (the ptracer or the traced process)
> > was allowed to munmap the vsyscall page,
>
> It is not possible to unmap it. The kernel (swapper_pg_dir) has this
> mapping, not the process. Unlike vdso. IOW, you can only "unmap" it
> globally and obviously you can't do this from the userspace.

And even if this were possible, this can't help. Please look at
__bad_area_nosemaphore()->emulate_vsyscall(), the process won't get
SIGSEGV. IOW, in fact EMULATE already "unmaps" this page (sets _NX)
to trigger the fault.

Sure, we can do something like below, but it doesn't look very nice
too.

Oleg.

--- x/arch/x86/mm/fault.c
+++ x/arch/x86/mm/fault.c
@@ -744,7 +744,8 @@ __bad_area_nosemaphore(struct pt_regs *r
*/
if (unlikely((error_code & PF_INSTR) &&
((address & ~0xfff) == VSYSCALL_START))) {
- if (emulate_vsyscall(regs, address))
+ if (!(tsk->ptrace & PTRACE_O_DONTEMULATE) &&
+ emulate_vsyscall(regs, address))
return;
}
#endif

2012-11-23 17:42:41

by Oleg Nesterov

[permalink] [raw]

Subject: Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)

On 11/23, Pavel Emelyanov wrote:
>
> On 11/22/2012 08:12 PM, Oleg Nesterov wrote:
>
> > Sure. You shouldn't try to save/restore this page(s) directly. But
> > I do not really understand why do you need. IOW, I don't really
> > understand the problem, it depends on what c/r actually does.
>
> Think about it like this -- you stop a process, then change the kern^w VDSO
> page. Everything should work as it used to be :)

I understand. But again, this depends on how you restore. For example,
if you do not "transfer" libc/ld code, then probably vdso is not a
problem. Just you need to ensure the task was not stopped "inside" vdso.
Yes, yes, I understand that this is too limited, you probably want to
transfer "everything".

Oleg.

2012-11-23 17:45:24

by Oleg Nesterov

[permalink] [raw]

Subject: Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range

Amnon,

I am going to "ignore" this thread because this is not my area and
I can't help anyway. Just one note:

On 11/23, Amnon Shiloh wrote:
>
> The solution can be to hold all catched signals while in the VDSO page.
> ...
>
> 1) + introduce a kernel feature to prevent
> catching signals within the VDSO page (probably a new prctl,
> or make it the default)

Sorry, never ;)

Oleg.

2012-11-24 12:47:39

by u3557

[permalink] [raw]

Subject: Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range

Hi Oleg,

> Amnon,
>
> I am going to "ignore" this thread because this is not my area and
> I can't help anyway. Just one note:
>
> On 11/23, Amnon Shiloh wrote:
> >
> > The solution can be to hold all catched signals while in the VDSO page.
> > ...
> >
> > 1) + introduce a kernel feature to prevent
> > catching signals within the VDSO page (probably a new prctl,
> > or make it the default)
>
> Sorry, never ;)
>
> Oleg.

It's OK with me because I already found a way to work around this that
works for me, but I suspect that other people who write checkpoint/restore
packages may not be able to use my soltion and so they will have a problem
with interrupts occuring within the VDSO page.

I therefore suggested an alternate solution, for all such systems where
applications can be checkpointed on one kernel and restarted on another:
to allow the user to ask for an ultra-compatible VDSO version, which would
be exactly the same on all kernels (from a given point in time) and all
kernel configurations, even if it means a loss of performance. This is
needed for systems where applications can be checkpointed on one kernel
and restarted on another.

It could even be a kernel configuration option: CONFIG_ULTRA_COMPAT_VDSO,
but ideally it should be the user's choice.

Best Regards,
Amnon.

2012-11-24 13:45:15

by u3557

[permalink] [raw]

Subject: Re: arch_check_bp_in_kernelspace: fix the range check

Hi Oleg,

> Hello Amnon,
>
> I am a bit confused,

So let's get things in order.

1) I asked for the ability to set hardware breakpoints on the vsyscall
page (x86 debug registers), so that the ptracer can stop the process
whenever it attempts to jump there, then the ptracer can emulate those
system calls instead (gettimeofday, time, getcpu).

That would solve all my problems, because the traced process will
never even enter the vsyscall page (the ptracer will adjust its
program-counter).

2) I was then told (in my own words): "oh, don't worry, the vsyscall page
has now been minimized, all it contains now is *real* system calls,
and it always calls them".

[as a side-issue I was introduced to the new VDSO, had some issues there
and solved them separately, so we are back on the original topic]

3) I was thinking to myself - well, that's fine, if the vsyscall now
always invokes a *real* system-call (and nothing else), then the
ptracer can catch it just like any other system-call using
PTRACE_SYSCALL (or PTRACE_SYSEMU), and emulate it as usual,
vsyscall-or-no-vsyscall.

4) I made some tests and found that I was wrong in my assumption:
PTRACE_SYSCALL does NOT work within the vsyscall page (nor does
PTRACE_SINGLESTEP). Ptracers are not even aware that their tracee
ever issued a system call there (despite using PTRACE_SYSCALL or
PTRACE_SYSEMU), so they are unable to emulate it (or even to report
it, in the case of "strace").

5) Therefore, I still need the original feature - to relax
"arch_check_bp_in_kernelspace()", or whatever else will allow me
to set the x86 debug-registers to trap all attempts to enter the
vsyscall page.

6) I just suggested an alternative: to have the whole vsyscall page
removed on a per-process basis. I accept your reply that this is
not possible.

7) I suggested a third alternative: to have the vsyscall page be
unexecutable on a per-process basis, so attempts to use it will
incur SIGSEGV. I understand that this option is still under
discussion.

8) Any solution that allows a ptracer to prevent its traced process
from entering the vsyscall page and execute there system-calls
unchecked (thus in effect escape its jailer), would do for me.

Best Regards,
Amnon.

>
> On 11/23, Amnon Shiloh wrote:
> >
> > What I discovered now, is that PTRACE_SYSCALL (also PTRACE_SINGLESTEP)
> > does not work within the vsyscall page, so I cannot trap the kernel-calls
> > there (this is very simple to verify using "gdb" or "strace").
>
> Sure, but we alredy discussed this?
>
> Once again, PTRACE_SYSCALL should work in the NATIVE mode. Obviously it
> won't work in EMULATE mode but we can change emulate_vsyscall() to report
> TRAP_VSYSCALL or even introduce PTRACE_EVENT_VSYSCALL.
>
> > The necessary patch was already discussed and is very simple.
>
> Do you mean TRAP_VSYSCALL/PTRACE_EVENT_VSYSCALL above or additional
> in_gate_area_no_mm() check to allow the hw bp?
>
> > Or, there is an alternative: if only I (the ptracer or the traced process)
> > was allowed to munmap the vsyscall page,
>
> It is not possible to unmap it. The kernel (swapper_pg_dir) has this
> mapping, not the process. Unlike vdso. IOW, you can only "unmap" it
> globally and obviously you can't do this from the userspace.
>
> Oleg.
>
>

2012-11-24 14:15:01

by u3557

[permalink] [raw]

Subject: Re: arch_check_bp_in_kernelspace: fix the range check

Hi Oleg,

This patch may look ugly, but it is one way to solve my problem.

This way, "strace" too, which is broken since the introduction of
the vsyscall page, will again be able to report when the program
calls "time()" or "gettimeofday()" - currently it cannot!

I think that allowing to set the x86 debug-registers to the
vsyscall page is more elegant - but do whatever you prefer.

Best Regards,
Amnon.

> forgot to mention...
>
> On 11/23, Oleg Nesterov wrote:
> >
> > On 11/23, Amnon Shiloh wrote:
> > >
> > > Or, there is an alternative: if only I (the ptracer or the traced process)
> > > was allowed to munmap the vsyscall page,
> >
> > It is not possible to unmap it. The kernel (swapper_pg_dir) has this
> > mapping, not the process. Unlike vdso. IOW, you can only "unmap" it
> > globally and obviously you can't do this from the userspace.
>
> And even if this were possible, this can't help. Please look at
> __bad_area_nosemaphore()->emulate_vsyscall(), the process won't get
> SIGSEGV. IOW, in fact EMULATE already "unmaps" this page (sets _NX)
> to trigger the fault.
>
> Sure, we can do something like below, but it doesn't look very nice
> too.
>
> Oleg.
>
> --- x/arch/x86/mm/fault.c
> +++ x/arch/x86/mm/fault.c
> @@ -744,7 +744,8 @@ __bad_area_nosemaphore(struct pt_regs *r
> */
> if (unlikely((error_code & PF_INSTR) &&
> ((address & ~0xfff) == VSYSCALL_START))) {
> - if (emulate_vsyscall(regs, address))
> + if (!(tsk->ptrace & PTRACE_O_DONTEMULATE) &&
> + emulate_vsyscall(regs, address))
> return;
> }
> #endif
>
>

2012-11-25 22:55:36

by Oleg Nesterov

[permalink] [raw]

Subject: Re: arch_check_bp_in_kernelspace: fix the range check

On 11/25, Amnon Shiloh wrote:
>
> 2) I was then told (in my own words): "oh, don't worry, the vsyscall page
> has now been minimized, all it contains now is *real* system calls,
> and it always calls them".

Not sure where did you get this idea ;) From the very beginning you were
told that EMULATE mode doesn't do this.

The NATIVE mode should be fine, yes.

> 6) I just suggested an alternative: to have the whole vsyscall page
> removed on a per-process basis. I accept your reply that this is
> not possible.

Yes, this is not possible.

> 7) I suggested a third alternative: to have the vsyscall page be
> unexecutable on a per-process basis,

Like above, this is simply not possible. And at the same time the
vsyscall page is already unexecutable in EMULATE mode, but globally.

> 8) Any solution that allows a ptracer to prevent its traced process
> from entering the vsyscall page and execute there system-calls
> unchecked (thus in effect escape its jailer), would do for me.

Well. I am even more confused... probably this was already discussed
and I missed this, but.

Why do you need to _prevent_, say, sys_gettimeofday()? Why we can't
change emulate_vsyscall() to respect PTRACE_SYSCALL and report
TRAP_VSYSCALL or PTRACE_EVENT_VSYSCALL as I tried to suggest in
http://marc.info/?l=linux-kernel&m=135343635523715 ?

You previously replied that this can not work. Now that you see that
this _can_ work, could you please explain why this is not enough?

Oleg.

2012-11-25 23:48:39

by u3557

[permalink] [raw]

Subject: Re: arch_check_bp_in_kernelspace: fix the range check

Hi Oleg,

> > 2) I was then told (in my own words): "oh, don't worry, the vsyscall page
> > has now been minimized, all it contains now is *real* system calls,
> > and it always calls them".
>
> Not sure where did you get this idea ;) From the very beginning you were
> told that EMULATE mode doesn't do this.

Sorry, I was not aware of the existence of "EMULATE" at the time,
or that it was the default, so I lived in a "NATIVE" world... and
was content that yesterday's problem was solved... I just looked
at the vsyscall page itself, found the system-calls there and was
"happy" with it, that I could now catch them like anywhere else.

> > 8) Any solution that allows a ptracer to prevent its traced process
> > from entering the vsyscall page and execute there system-calls
> > unchecked (thus in effect escape its jailer), would do for me.
>
> Well. I am even more confused... probably this was already discussed
> and I missed this, but.
>
> Why do you need to _prevent_, say, sys_gettimeofday()? Why we can't
> change emulate_vsyscall() to respect PTRACE_SYSCALL and report
> TRAP_VSYSCALL or PTRACE_EVENT_VSYSCALL as I tried to suggest in
> http://marc.info/?l=linux-kernel&m=135343635523715 ?
>
> Oleg.
>

For my own application, I would be happy with this.

But I suspect it might break current versions of "strace",
or similar programs that expect to find the program-counter
pointing at a "syscall" instruction.

At present "strace" fails to report "gettimeofday()", but at
least it does not crash. Surely "strace" can and should be
enhanced to handle this, but existing versions may suffer.

>
> You previously replied that this can not work. Now that you see that
> this _can_ work, could you please explain why this is not enough?

I think it COULD work, but not based on PTRACE_SYSCALL
(or PTRACE_SYSEMU) alone. A new ptrace option will be needed, saying:
"Yes, I am aware of TRAP_VSYSCALL and I know how to handle it."

While for my own application, just fixing the range-check in
arch_check_bp_in_kernelspace will do, requiring a smaller patch,
I agree that fixing this properly by adding a new ptrace option
can help other programmers, so they need not bother with the x86
debug-registers (or perhaps they may need them for other purposes).

Best Regards,
Amnon.

2012-11-26 09:44:09

by Cyrill Gorcunov

[permalink] [raw]

Subject: Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)

On Thu, Nov 22, 2012 at 05:12:38PM +0100, Oleg Nesterov wrote:
> On 11/22, Amnon Shiloh wrote:
> >
> > Now however, that "vsyscall" was effectively replaced by vdso, it
> > creates a new problem for me and probably for anyone else who uses
> > some form of checkpoint/restore:
>
> Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
> seem to enjoy trying to solve the c/r problems.

Hi Oleg, Amnon, sorry for delay on this message. First of all, I'm
not vDSO specialist but as to c/r aspect of it I've had in mind the
the following scenario: we dump vDSO area either complete euther simply
remember the functions addresses it provides, then on restore we replace
the functions prologue with jmp instruction which would point to the
vDSO entries provided us by a kernel. But again, I didn't spend much time
for vDSO yet.

Cyrill

2012-11-26 12:27:05

by Andrei Vagin

[permalink] [raw]

Subject: Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)

2012/11/26 Cyrill Gorcunov <[email protected]>:
> On Thu, Nov 22, 2012 at 05:12:38PM +0100, Oleg Nesterov wrote:
>> On 11/22, Amnon Shiloh wrote:
>> >
>> > Now however, that "vsyscall" was effectively replaced by vdso, it
>> > creates a new problem for me and probably for anyone else who uses
>> > some form of checkpoint/restore:
>>
>> Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
>> seem to enjoy trying to solve the c/r problems.
>
> Hi Oleg, Amnon, sorry for delay on this message. First of all, I'm
> not vDSO specialist but as to c/r aspect of it I've had in mind the
> the following scenario: we dump vDSO area either complete euther simply
> remember the functions addresses it provides, then on restore we replace
> the functions prologue with jmp instruction which would point to the
> vDSO entries provided us by a kernel. But again, I didn't spend much time
> for vDSO yet.

I think this approach should work. Here is a bit more details:
1. A task will be not dumped in vdso code. If a task is in vdso code,
we will try to catch it again.
2. Only addresses of vdso symbols will be dumped.
3. On restore we will compare addresses of symbols. If addresses are
not changes, a new vdso will be mapped instead of old one. If they are
changed, a new vdso will be mapped in a free space and the old vdso
will be replaced on a proxy library, where all functions are replaced
on jumps to suitable functions in the new vdso.

Looks simple and does not required to hack a kernel.

>
> Cyrill

2012-11-26 12:55:06

by u3557

[permalink] [raw]

Subject: Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)

Hi Andrey,

So what do you do if a catched interrupt occured inside vdso code?

You can easily detect that the task is currently in vdso code,
then delay its dump, but how do you detect whether a sigreturn()
or even a series of sigreturn()s won't bring it back to the middle
of some old vdso code?

You could of course keep that old code and modify only the very
first instruction of each routine into a jump instruction, but then
the code to which the process returns may not be compatible with
the new kernel and/or hardware configuration.

You cannot even rely on a vdso enter/exit counter, because interrupt
code can use "longjmp()".

My idea was to introduce a kernel option saying "don't send me any
catched signals while the program-counter is in this range (eg. the
vdso page), but I was told that it is not feasible to implement it.

Regards,
Amnon.

> 2012/11/26 Cyrill Gorcunov <[email protected]>:
> > On Thu, Nov 22, 2012 at 05:12:38PM +0100, Oleg Nesterov wrote:
> >> On 11/22, Amnon Shiloh wrote:
> >> >
> >> > Now however, that "vsyscall" was effectively replaced by vdso, it
> >> > creates a new problem for me and probably for anyone else who uses
> >> > some form of checkpoint/restore:
> >>
> >> Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
> >> seem to enjoy trying to solve the c/r problems.
> >
> > Hi Oleg, Amnon, sorry for delay on this message. First of all, I'm
> > not vDSO specialist but as to c/r aspect of it I've had in mind the
> > the following scenario: we dump vDSO area either complete euther simply
> > remember the functions addresses it provides, then on restore we replace
> > the functions prologue with jmp instruction which would point to the
> > vDSO entries provided us by a kernel. But again, I didn't spend much time
> > for vDSO yet.
>
> I think this approach should work. Here is a bit more details:
> 1. A task will be not dumped in vdso code. If a task is in vdso code,
> we will try to catch it again.
> 2. Only addresses of vdso symbols will be dumped.
> 3. On restore we will compare addresses of symbols. If addresses are
> not changes, a new vdso will be mapped instead of old one. If they are
> changed, a new vdso will be mapped in a free space and the old vdso
> will be replaced on a proxy library, where all functions are replaced
> on jumps to suitable functions in the new vdso.
>
> Looks simple and does not required to hack a kernel.
>
> >
> > Cyrill
>

2012-11-26 14:18:48

by Cyrill Gorcunov

[permalink] [raw]

Subject: Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)

On Mon, Nov 26, 2012 at 11:55:01PM +1100, Amnon Shiloh wrote:
>
> You could of course keep that old code and modify only the very
> first instruction of each routine into a jump instruction, but then
> the code to which the process returns may not be compatible with
> the new kernel and/or hardware configuration.

For sure there will be some limitations but I fear we can't do
that much with it. I don't expect the regular program to use
sigreturn for jumping into vdso code, but I could be wrong.

Cyrill

2012-11-26 14:26:58

by u3557

[permalink] [raw]

Subject: Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range

Hi Cyrill,

Programmers don't (and the manual-page says they shouldn't even try)
call "sigreturn" directly.

If an interrupt happens, by bad luck, to occur while the process
is running vdso code, then eventually, once signal-processing is
complete, "sigreturn" (issued by glibc) will take the process back
to where it was before the interrupt happend, inside the vdso page.

Best Regards,
Amnon.

> On Mon, Nov 26, 2012 at 11:55:01PM +1100, Amnon Shiloh wrote:
> >
> > You could of course keep that old code and modify only the very
> > first instruction of each routine into a jump instruction, but then
> > the code to which the process returns may not be compatible with
> > the new kernel and/or hardware configuration.
>
> For sure there will be some limitations but I fear we can't do
> that much with it. I don't expect the regular program to use
> sigreturn for jumping into vdso code, but I could be wrong.
>
> Cyrill
>

2012-11-26 14:42:00

by Cyrill Gorcunov

[permalink] [raw]

Subject: Re: vdso && cr

On Tue, Nov 27, 2012 at 01:26:55AM +1100, Amnon Shiloh wrote:
> Programmers don't (and the manual-page says they shouldn't even try)
> call "sigreturn" directly.
>
> If an interrupt happens, by bad luck, to occur while the process
> is running vdso code, then eventually, once signal-processing is
> complete, "sigreturn" (issued by glibc) will take the process back
> to where it was before the interrupt happend, inside the vdso page.

Hi Amnon, good point, need to think. Frankly I would like to escape
kernel patching as long as possible ;)