2015-02-05 00:50:37

by Stephen Boyd

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] sched_clock: Optimize and avoid deadlock during read from NMI

On 01/30, Daniel Thompson wrote:
> This patchset optimizes the generic sched_clock implementation to
> significantly reduce the data cache profile. It also makes it safe to call
> sched_clock() from NMI (or FIQ on ARM).
>
> The data cache profile of sched_clock() in both the original code and
> my previous patch was somewhere between 2 and 3 (64-byte) cache lines,
> depending on alignment of struct clock_data. After patching, the cache
> profile for the normal case should be a single cacheline.
>
> NMI safety was tested on i.MX6 with perf drowning the system in FIQs and
> using the perf handler to check that sched_clock() returned monotonic
> values. At the same time I forcefully reduced kt_wrap so that
> update_sched_clock() is being called at >1000Hz.
>
> Without the patches the above system is grossly unstable, surviving
> [9K,115K,25K] perf event cycles during three separate runs. With the
> patch I ran for over 9M perf event cycles before getting bored.

I wanted to see if there was any speedup from these changes so I
made a tight loop around sched_clock() that ran for 10 seconds
and I ran it 10 times before and after this patch series:

unsigned long long clock, start_clock;
int count = 0;

clock = start_clock = sched_clock();
while ((clock - start_clock) < 10ULL * NSEC_PER_SEC) {
clock = sched_clock();
count++;
}

pr_info("Made %d calls in %llu ns\n", count, clock - start_clock);

Before
------
Made 19218953 calls in 10000000439 ns
Made 19212790 calls in 10000000438 ns
Made 19217121 calls in 10000000142 ns
Made 19227304 calls in 10000000142 ns
Made 19217559 calls in 10000000142 ns
Made 19230193 calls in 10000000290 ns
Made 19212715 calls in 10000000290 ns
Made 19234446 calls in 10000000438 ns
Made 19226274 calls in 10000000439 ns
Made 19236118 calls in 10000000143 ns

After
-----
Made 19434797 calls in 10000000438 ns
Made 19435733 calls in 10000000439 ns
Made 19434499 calls in 10000000438 ns
Made 19438482 calls in 10000000438 ns
Made 19435604 calls in 10000000142 ns
Made 19438551 calls in 10000000438 ns
Made 19444550 calls in 10000000290 ns
Made 19437580 calls in 10000000290 ns
Made 19439429 calls in 10000048142 ns
Made 19439493 calls in 10000000438 ns

So it seems to be a small improvement.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


2015-02-05 09:05:44

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] sched_clock: Optimize and avoid deadlock during read from NMI

On 05/02/15 00:50, Stephen Boyd wrote:
> On 01/30, Daniel Thompson wrote:
>> This patchset optimizes the generic sched_clock implementation to
>> significantly reduce the data cache profile. It also makes it safe to call
>> sched_clock() from NMI (or FIQ on ARM).
>>
>> The data cache profile of sched_clock() in both the original code and
>> my previous patch was somewhere between 2 and 3 (64-byte) cache lines,
>> depending on alignment of struct clock_data. After patching, the cache
>> profile for the normal case should be a single cacheline.
>>
>> NMI safety was tested on i.MX6 with perf drowning the system in FIQs and
>> using the perf handler to check that sched_clock() returned monotonic
>> values. At the same time I forcefully reduced kt_wrap so that
>> update_sched_clock() is being called at >1000Hz.
>>
>> Without the patches the above system is grossly unstable, surviving
>> [9K,115K,25K] perf event cycles during three separate runs. With the
>> patch I ran for over 9M perf event cycles before getting bored.
>
> I wanted to see if there was any speedup from these changes so I
> made a tight loop around sched_clock() that ran for 10 seconds
> and I ran it 10 times before and after this patch series:
>
> unsigned long long clock, start_clock;
> int count = 0;
>
> clock = start_clock = sched_clock();
> while ((clock - start_clock) < 10ULL * NSEC_PER_SEC) {
> clock = sched_clock();
> count++;
> }
>
> pr_info("Made %d calls in %llu ns\n", count, clock - start_clock);
>
> Before
> ------
> Made 19218953 calls in 10000000439 ns
> Made 19212790 calls in 10000000438 ns
> Made 19217121 calls in 10000000142 ns
> Made 19227304 calls in 10000000142 ns
> Made 19217559 calls in 10000000142 ns
> Made 19230193 calls in 10000000290 ns
> Made 19212715 calls in 10000000290 ns
> Made 19234446 calls in 10000000438 ns
> Made 19226274 calls in 10000000439 ns
> Made 19236118 calls in 10000000143 ns
>
> After
> -----
> Made 19434797 calls in 10000000438 ns
> Made 19435733 calls in 10000000439 ns
> Made 19434499 calls in 10000000438 ns
> Made 19438482 calls in 10000000438 ns
> Made 19435604 calls in 10000000142 ns
> Made 19438551 calls in 10000000438 ns
> Made 19444550 calls in 10000000290 ns
> Made 19437580 calls in 10000000290 ns
> Made 19439429 calls in 10000048142 ns
> Made 19439493 calls in 10000000438 ns
>
> So it seems to be a small improvement.
>

Awesome!

I guess this is mostly the effect of simplifying the suspend logic since
the changes to the cache profile probably wouldn't reveal much in such a
tight loop.

I will re-run this after acting on your other review comments. BTW what
device did you run on?


Daniel.

2015-02-08 12:09:37

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] sched_clock: Optimize and avoid deadlock during read from NMI

On 05/02/15 17:05, Daniel Thompson wrote:
> On 05/02/15 00:50, Stephen Boyd wrote:
>> On 01/30, Daniel Thompson wrote:
>>> This patchset optimizes the generic sched_clock implementation to
>>> significantly reduce the data cache profile. It also makes it safe to call
>>> sched_clock() from NMI (or FIQ on ARM).
>>>
>>> The data cache profile of sched_clock() in both the original code and
>>> my previous patch was somewhere between 2 and 3 (64-byte) cache lines,
>>> depending on alignment of struct clock_data. After patching, the cache
>>> profile for the normal case should be a single cacheline.
>>>
>>> NMI safety was tested on i.MX6 with perf drowning the system in FIQs and
>>> using the perf handler to check that sched_clock() returned monotonic
>>> values. At the same time I forcefully reduced kt_wrap so that
>>> update_sched_clock() is being called at >1000Hz.
>>>
>>> Without the patches the above system is grossly unstable, surviving
>>> [9K,115K,25K] perf event cycles during three separate runs. With the
>>> patch I ran for over 9M perf event cycles before getting bored.
>>
>> I wanted to see if there was any speedup from these changes so I
>> made a tight loop around sched_clock() that ran for 10 seconds
>> and I ran it 10 times before and after this patch series:
>>
>> unsigned long long clock, start_clock;
>> int count = 0;
>>
>> clock = start_clock = sched_clock();
>> while ((clock - start_clock) < 10ULL * NSEC_PER_SEC) {
>> clock = sched_clock();
>> count++;
>> }
>>
>> pr_info("Made %d calls in %llu ns\n", count, clock - start_clock);
>>
>> Before
>> ------
>> Made 19218953 calls in 10000000439 ns
>> Made 19212790 calls in 10000000438 ns
>> Made 19217121 calls in 10000000142 ns
>> Made 19227304 calls in 10000000142 ns
>> Made 19217559 calls in 10000000142 ns
>> Made 19230193 calls in 10000000290 ns
>> Made 19212715 calls in 10000000290 ns
>> Made 19234446 calls in 10000000438 ns
>> Made 19226274 calls in 10000000439 ns
>> Made 19236118 calls in 10000000143 ns
>>
>> After
>> -----
>> Made 19434797 calls in 10000000438 ns
>> Made 19435733 calls in 10000000439 ns
>> Made 19434499 calls in 10000000438 ns
>> Made 19438482 calls in 10000000438 ns
>> Made 19435604 calls in 10000000142 ns
>> Made 19438551 calls in 10000000438 ns
>> Made 19444550 calls in 10000000290 ns
>> Made 19437580 calls in 10000000290 ns
>> Made 19439429 calls in 10000048142 ns
>> Made 19439493 calls in 10000000438 ns
>>
>> So it seems to be a small improvement.
>>
>
> Awesome!
>
> I guess this is mostly the effect of simplifying the suspend logic since
> the changes to the cache profile probably wouldn't reveal much in such a
> tight loop.
>
> I will re-run this after acting on your other review comments. BTW what
> device did you run on?

I ran the same test on my Snapdragon 600 board. The results are a little
odd. There is an odd quantization effect that I cannot easily explain
and the results of the v4 patch seem almost too good to be true.

My results are below but I'd be very interested to see what results you
get with the v4 patch!

Latest (branchless approach):

Made 18736519 calls in 10000000439 ns
Made 19958774 calls in 10000000439 ns
Made 18736500 calls in 10000000587 ns
Made 21703993 calls in 10000000439 ns
Made 18734458 calls in 10000000142 ns
Made 18736175 calls in 10000000439 ns
Made 19961406 calls in 10000000291 ns
Made 19953920 calls in 10000000143 ns
Made 21709619 calls in 10000000290 ns
Made 18734077 calls in 10000000142 ns

v3:

Made 15971788 calls in 10000000438 ns
Made 14594357 calls in 10000000734 ns
Made 14590951 calls in 10000000735 ns
Made 14595048 calls in 10000000290 ns
Made 14595157 calls in 10000000143 ns
Made 14594117 calls in 10000000142 ns
Made 14597277 calls in 10000000142 ns
Made 14594472 calls in 10000000586 ns
Made 14601292 calls in 10000000587 ns
Made 15968630 calls in 10000000587 ns

Current:

Made 14274871 calls in 10000000587 ns
Made 15634136 calls in 10000000587 ns
Made 16453528 calls in 10000000142 ns
Made 14275854 calls in 10000000586 ns
Made 15634128 calls in 10000000438 ns
Made 14277672 calls in 10000000143 ns
Made 14282904 calls in 10000000290 ns
Made 14278673 calls in 10000000142 ns
Made 14276096 calls in 10000000290 ns
Made 14275336 calls in 10000000143 ns

2015-02-09 22:18:38

by Stephen Boyd

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] sched_clock: Optimize and avoid deadlock during read from NMI

On 02/08/15 04:09, Daniel Thompson wrote:
> On 05/02/15 17:05, Daniel Thompson wrote:
>> On 05/02/15 00:50, Stephen Boyd wrote:
>>> On 01/30, Daniel Thompson wrote:
>>>> This patchset optimizes the generic sched_clock implementation to
>>>> significantly reduce the data cache profile. It also makes it safe to call
>>>> sched_clock() from NMI (or FIQ on ARM).
>>>>
>>>> The data cache profile of sched_clock() in both the original code and
>>>> my previous patch was somewhere between 2 and 3 (64-byte) cache lines,
>>>> depending on alignment of struct clock_data. After patching, the cache
>>>> profile for the normal case should be a single cacheline.
>>>>
>>>> NMI safety was tested on i.MX6 with perf drowning the system in FIQs and
>>>> using the perf handler to check that sched_clock() returned monotonic
>>>> values. At the same time I forcefully reduced kt_wrap so that
>>>> update_sched_clock() is being called at >1000Hz.
>>>>
>>>> Without the patches the above system is grossly unstable, surviving
>>>> [9K,115K,25K] perf event cycles during three separate runs. With the
>>>> patch I ran for over 9M perf event cycles before getting bored.
>>> I wanted to see if there was any speedup from these changes so I
>>> made a tight loop around sched_clock() that ran for 10 seconds
>>> and I ran it 10 times before and after this patch series:
>>>
>>> unsigned long long clock, start_clock;
>>> int count = 0;
>>>
>>> clock = start_clock = sched_clock();
>>> while ((clock - start_clock) < 10ULL * NSEC_PER_SEC) {
>>> clock = sched_clock();
>>> count++;
>>> }
>>>
>>> pr_info("Made %d calls in %llu ns\n", count, clock - start_clock);
>>>
>>> Before
>>> ------
>>> Made 19218953 calls in 10000000439 ns
>>> Made 19212790 calls in 10000000438 ns
>>> Made 19217121 calls in 10000000142 ns
>>> Made 19227304 calls in 10000000142 ns
>>> Made 19217559 calls in 10000000142 ns
>>> Made 19230193 calls in 10000000290 ns
>>> Made 19212715 calls in 10000000290 ns
>>> Made 19234446 calls in 10000000438 ns
>>> Made 19226274 calls in 10000000439 ns
>>> Made 19236118 calls in 10000000143 ns
>>>
>>> After
>>> -----
>>> Made 19434797 calls in 10000000438 ns
>>> Made 19435733 calls in 10000000439 ns
>>> Made 19434499 calls in 10000000438 ns
>>> Made 19438482 calls in 10000000438 ns
>>> Made 19435604 calls in 10000000142 ns
>>> Made 19438551 calls in 10000000438 ns
>>> Made 19444550 calls in 10000000290 ns
>>> Made 19437580 calls in 10000000290 ns
>>> Made 19439429 calls in 10000048142 ns
>>> Made 19439493 calls in 10000000438 ns
>>>
>>> So it seems to be a small improvement.
>>>
>> Awesome!
>>
>> I guess this is mostly the effect of simplifying the suspend logic since
>> the changes to the cache profile probably wouldn't reveal much in such a
>> tight loop.
>>
>> I will re-run this after acting on your other review comments. BTW what
>> device did you run on?
> I ran the same test on my Snapdragon 600 board. The results are a little
> odd. There is an odd quantization effect that I cannot easily explain
> and the results of the v4 patch seem almost too good to be true.
>
> My results are below but I'd be very interested to see what results you
> get with the v4 patch!

I ran my test on this msm8960 board I got lying around. In marketing
terms (which I honestly don't know off the top of my head) I think it's
called a Snapdragon S4 Play. I had to look up Snapdragon 600 to figure
out what it is. The 600 is slightly newer but should be an APQ8064 of
some kind. Maybe you can try it out on an ARM Ltd. CPU if you have one?

>
> Latest (branchless approach):
>
> Made 18736519 calls in 10000000439 ns
> Made 19958774 calls in 10000000439 ns
> Made 18736500 calls in 10000000587 ns
> Made 21703993 calls in 10000000439 ns
> Made 18734458 calls in 10000000142 ns
> Made 18736175 calls in 10000000439 ns
> Made 19961406 calls in 10000000291 ns
> Made 19953920 calls in 10000000143 ns
> Made 21709619 calls in 10000000290 ns
> Made 18734077 calls in 10000000142 ns

It's certainly more, but it also seems more variable. I'll reply with
raw data from my msm8960 and msm8974 on v4.

> v3:
>
> Made 15971788 calls in 10000000438 ns
> Made 14594357 calls in 10000000734 ns
> Made 14590951 calls in 10000000735 ns
> Made 14595048 calls in 10000000290 ns
> Made 14595157 calls in 10000000143 ns
> Made 14594117 calls in 10000000142 ns
> Made 14597277 calls in 10000000142 ns
> Made 14594472 calls in 10000000586 ns
> Made 14601292 calls in 10000000587 ns
> Made 15968630 calls in 10000000587 ns
>
> Current:
>
> Made 14274871 calls in 10000000587 ns
> Made 15634136 calls in 10000000587 ns
> Made 16453528 calls in 10000000142 ns
> Made 14275854 calls in 10000000586 ns
> Made 15634128 calls in 10000000438 ns
> Made 14277672 calls in 10000000143 ns
> Made 14282904 calls in 10000000290 ns
> Made 14278673 calls in 10000000142 ns
> Made 14276096 calls in 10000000290 ns
> Made 14275336 calls in 10000000143 ns


--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project