Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Subject: Re: [PATCH v4 4/4] perf/core,x86: synchronize PMU task contexts on
 optimized context switches
To:     Peter Zijlstra <peterz@infradead.org>
Cc:     Arnaldo Carvalho de Melo <acme@kernel.org>,
        Ingo Molnar <mingo@redhat.com>,
        Alexander Shishkin <alexander.shishkin@linux.intel.com>,
        Jiri Olsa <jolsa@redhat.com>,
        Namhyung Kim <namhyung@kernel.org>,
        Andi Kleen <ak@linux.intel.com>,
        Kan Liang <kan.liang@linux.intel.com>,
        Stephane Eranian <eranian@google.com>,
        Ian Rogers <irogers@google.com>,
        Song Liu <songliubraving@fb.com>,
        linux-kernel <linux-kernel@vger.kernel.org>
References: <f4662ac9-e72e-d141-bead-da07e29f81e8@linux.intel.com>
 <4d6320bb-0d15-0028-aefb-a176c986b8db@linux.intel.com>
 <20191022094300.GL1817@hirez.programming.kicks-ass.net>
From:   Alexey Budankov <alexey.budankov@linux.intel.com>
Organization: Intel Corp.
Message-ID: <fa9ffa1f-21d0-b918-d66f-b0a20af00eab@linux.intel.com>
Date:   Tue, 22 Oct 2019 15:49:09 +0300
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.9.0
MIME-Version: 1.0
In-Reply-To: <20191022094300.GL1817@hirez.programming.kicks-ass.net>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk


On 22.10.2019 12:43, Peter Zijlstra wrote:
> On Tue, Oct 22, 2019 at 09:01:11AM +0300, Alexey Budankov wrote:
> 
>>  			swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
>>  
>> +			/*
>> +			 * PMU specific parts of task perf context can require
>> +			 * additional synchronization which makes sense only if
>> +			 * both next_ctx->task_ctx_data and ctx->task_ctx_data
>> +			 * pointers are allocated. As an example of such
>> +			 * synchronization see implementation details of Intel
>> +			 * LBR call stack data profiling;
>> +			 */
>> +			if (ctx->task_ctx_data && next_ctx->task_ctx_data)
>> +				pmu->sync_task_ctx(next_ctx->task_ctx_data,
>> +						   ctx->task_ctx_data);
> 
> This still does not check if pmu->sync_task_ctx is set. If any other
> arch ever uses task_ctx_data without then also supplying this method
> things will go *bang*.
> 
> Also, I think I prefer the variant I gave you yesterday:
> 
>   https://lkml.kernel.org/r/20191021103745.GF1800@hirez.programming.kicks-ass.net
> 
> 	if (pmu->swap_task_ctx)
> 		pmu->swap_task_ctx(ctx, next_ctx);
> 	else
> 		swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
> 
> That also unconfuses the argument order in your above patch (where you
> have to undo thw swap).
> 
> Alternatively, since there currently is no other arch using
> task_ctx_data, we can make the pmu::swap_task_ctx() thing mandatory when
> having it and completely replace the swap(), write it like so:
> 
> 
> -	swap(ctx->task_ctx_data, next_ctx->task_ctx_data);

It still has to be swapped unconditionally. Thus, it will be a part of 
architecture specific implementation:

void intel_pmu_lbr_sync_task_ctx(struct x86_perf_task_context **prev,
				 struct x86_perf_task_context **next)
{
	if (*prev && *next) {
		swap(*prev->lbr_callstack_users, *next->lbr_callstack_users);
                ...
        }
	swap(prev, next);
}


> +	if (pmu->swap_task_ctx)
> +		pmu->swap_task_ctx(ctx, next_ctx);
> 
> Hmm?

This option above looks attractive because it pushes complexity down 
towards architecture specific implementation.

However, in order to keep the existing performance at the same level
if (ctx->task_ctx_data && next_ctx->task_ctx_data) check has to be 
preserved as closer to the top layer as possible. So the fastest version
appears to look like this:

swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
if (ctx->task_ctx_data && next_ctx->task_ctx_data && pmu->sync_task_ctx)
	pmu->sync_task_ctx(next_ctx->task_ctx_data, ctx->task_ctx_data);

If some architecture needs specific synchronization then it is enough 
to implement sync_task_ctx() without changing the core.

~Alexey