by Chris Mason

[permalink] [raw]

Subject: Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO

On Thu, Dec 11, 2014 at 1:36 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> On Wed, Dec 10, 2014 at 2:56 PM, Shaohua Li <[email protected]> wrote:
>> > On Wed, Dec 10, 2014 at 02:13:23PM -0800, Andy Lutomirski wrote:
>> >> On Wed, Dec 10, 2014 at 1:57 PM, Shaohua Li <[email protected]> wrote:
>> >> > On Wed, Dec 10, 2014 at 11:10:52AM -0800, Andy Lutomirski
>> wrote:
>> >> >> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]>
>> wrote:
>> >> >> > This primarily speeds up
>> clock_gettime(CLOCK_THREAD_CPUTIME_ID, ..). We
>> >> >> > use the following method to compute the thread cpu time:
>> >> >>
>> >> >> I like the idea, and I like making this type of profiling
>> fast. I
>> >> >> don't love the implementation because it's an information
>> leak (maybe
>> >> >> we don't care) and it's ugly.
>> >> >>
>> >> >> The info leak could be fixed completely by having a
>> per-process array
>> >> >> instead of a global array. That's currently tricky without
>> wasting
>> >> >> memory, but it could be created on demand if we wanted to do
>> that,
>> >> >> once my vvar .fault patches go in (assuming they do -- I need
>> to ping
>> >> >> the linux-mm people).
>> >> >
>> >> > those info leak really doesn't matter.
>> >>
>> >> Why not?
>> >
>> > Ofcourse I can't make sure completely, but how could this
>> > info be used as attack?
>>
>> It may leak interesting timing info, even from cpus that are
>> outside your affinity mask / cpuset. I don't know how much
>> anyone actually cares.
>
> Finegraned timing information has been successfully used to
> recover secret keys (and sometimes even coarse timing
> information), so it can be a security issue in certain setups.

Trying to nail this down a little more clearly. Are you worried about
the context switch count being exported or the clock_gettime data?

-chris

2014-12-15 18:56:15

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO

On Mon, Dec 15, 2014 at 10:36 AM, Chris Mason <[email protected]> wrote:
>
>
> On Thu, Dec 11, 2014 at 1:36 AM, Ingo Molnar <[email protected]> wrote:
>>
>>
>> * Andy Lutomirski <[email protected]> wrote:
>>
>>> On Wed, Dec 10, 2014 at 2:56 PM, Shaohua Li <[email protected]> wrote:
>>> > On Wed, Dec 10, 2014 at 02:13:23PM -0800, Andy Lutomirski wrote:
>>> >> On Wed, Dec 10, 2014 at 1:57 PM, Shaohua Li <[email protected]> wrote:
>>> >> > On Wed, Dec 10, 2014 at 11:10:52AM -0800, Andy Lutomirski wrote:
>>> >> >> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <[email protected]> wrote:
>>> >> >> > This primarily speeds up clock_gettime(CLOCK_THREAD_CPUTIME_ID,
>>> ..). We
>>> >> >> > use the following method to compute the thread cpu time:
>>> >> >>
>>> >> >> I like the idea, and I like making this type of profiling fast. I
>>> >> >> don't love the implementation because it's an information leak
>>> (maybe
>>> >> >> we don't care) and it's ugly.
>>> >> >>
>>> >> >> The info leak could be fixed completely by having a per-process
>>> array
>>> >> >> instead of a global array. That's currently tricky without
>>> wasting
>>> >> >> memory, but it could be created on demand if we wanted to do that,
>>> >> >> once my vvar .fault patches go in (assuming they do -- I need to
>>> ping
>>> >> >> the linux-mm people).
>>> >> >
>>> >> > those info leak really doesn't matter.
>>> >>
>>> >> Why not?
>>> >
>>> > Ofcourse I can't make sure completely, but how could this
>>> > info be used as attack?
>>>
>>> It may leak interesting timing info, even from cpus that are
>>> outside your affinity mask / cpuset. I don't know how much
>>> anyone actually cares.
>>
>>
>> Finegraned timing information has been successfully used to
>> recover secret keys (and sometimes even coarse timing
>> information), so it can be a security issue in certain setups.
>
>
> Trying to nail this down a little more clearly. Are you worried about the
> context switch count being exported or the clock_gettime data?
>

The context switch count is unnecessary. Here's an IMO better
algorithm that relies on the context switch code storing the TSC at
last context switch for each CPU in a user-accessible location:

clock_gettime does

cpu, tsc = rdtscp; /* NB: HW orders rdtscp as a load */
barrier();
read scale_factor[cpu], sum_exec_runtime[cpu], etc;
barrier();
if (last_context_switch_tsc[cpu] >= tsc)
repeat (or fall back to a syscall)
return tsc * whatever + whatever_else;

This should be faster, as it avoids two uses of LSL, which take 7-10
cycles each, IIRC.

With that improvement, the context switch count becomes unavailable.
That leaves sum_exec_runtime (which ought to be totally uninteresting
for attackers) and the time of the last context switch.

The time of the last context switch *on the current cpu* is mostly
available to attackers anyway -- just call clock_gettime in a loop and
watch for jumps. Some of those are interrupts, but the long ones are
likely to be context switches.

The time of the last context switch on other CPUs is potentially
important. We could fix that by having a per-process array of per-cpu
values instead of a global array of per-cpu values. We could
alternatively have a per-process array of per-thread values.

If we did that, we'd probably want to create the array on demand,
which will be a mess without my special_mapping fault rework that no
one has commented on. I'll ping people after the merge window.

We could also shove all the timing parameters into magic segment
limits in the GDT. The problem with that is that we only get 20 bits
per descriptor, which isn't enough to make it palatable.

For 32-bit programs, we could do evil things with segmentation to make
this all work cleanly and without leaks, but for 64-bit programs this
is a non-starter.

I *really really* want per-cpu memory mappings for this and other
things. Pretty please, Intel or AMD? (We can have them today by
using per-cpu pgds, giving a speedup for system calls (no more swapgs)
and faults but a possibly unacceptable slowdown for switch_mm. The
slowdown would only be there for programs that either use enormous
amounts of virtual memory or use their address space very sparsely,
though. It could be done for free on an EPT guest, I think, as long
as the hypervisor were willing to play along and give us a per-vcpu
guest physical "page".)

--Andy