Received-SPF: pass (google.com: domain of linux-kernel+bounces-204813-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1;
Message-ID: <97f4f35b-e6b2-4a2e-bbea-3f6f127cb9c8@linux.intel.com>
Date: Thu, 6 Jun 2024 13:33:22 -0400
Precedence: bulk
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC] perf_events: exclude_guest impact on
 time_enabled/time_running
To: Stephane Eranian <eranian@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>, LKML
 <linux-kernel@vger.kernel.org>, Ian Rogers <irogers@google.com>,
 "Liang, Kan" <kan.liang@intel.com>, Andi Kleen <ak@linux.intel.com>,
 Ingo Molnar <mingo@redhat.com>, "Narayan, Ananth" <ananth.narayan@amd.com>,
 "Bangoria, Ravikumar" <ravi.bangoria@amd.com>,
 Namhyung Kim <namhyung@google.com>, Mingwei Zhang <mizhang@google.com>,
 Dapeng Mi <dapeng1.mi@linux.intel.com>, Zhang Xiong <xiong.y.zhang@intel.com>
References: <CABPqkBQ3LQ_dXQSQVSrriinvSSXm2fHx4yOms=jRsa2WaXSsog@mail.gmail.com>
 <4622060b-b758-4629-9aa4-cc8334111be0@linux.intel.com>
 <CABPqkBR4CGx9EY4nMmnRT1Uuw0p5iefnFZa0hphh+HBBYqQ71A@mail.gmail.com>
Content-Language: en-US
From: "Liang, Kan" <kan.liang@linux.intel.com>
In-Reply-To: <CABPqkBR4CGx9EY4nMmnRT1Uuw0p5iefnFZa0hphh+HBBYqQ71A@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit


On 2024-06-06 1:08 p.m., Stephane Eranian wrote:
> On Thu, Jun 6, 2024 at 8:48 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>>
>> On 2024-06-06 3:57 a.m., Stephane Eranian wrote:
>>> Hi Peter,
>>>
>>> In the context of the new vPMU passthru patch series, we have to look
>>> closer at the definition and implementation of the exclude_guest
>>> filter in the perf_event_attr structure. This filter has been in the
>>> kernel for many years. See patch:
>>> https://lore.kernel.org/all/20240506053020.3911940-8-mizhang@google.com/
>>>
>>> The presumed  definition of the filter is that the user does not want
>>> the event to count while the processor is running in guest mode (i.e.,
>>> inside the virtual machine guest OS or guest user code).
>>>
>>> The perf tool sets is by default on all core PMU events:
>>> $ perf stat -vv -e cycles sleep 0
>>> ------------------------------------------------------------
>>> perf_event_attr:
>>>   size                             112
>>>   sample_type                      IDENTIFIER
>>>   read_format                      TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
>>>   disabled                         1
>>>   inherit                          1
>>>   enable_on_exec                   1
>>>   exclude_guest                    1
>>> ------------------------------------------------------------
>>>
>>> In the kernel, the way this is treated differs between AMD and Intel because AMD
>>> does provide a hardware filter for guest vs. host in the PMU counters
>>> whereas Intel
>>> does not. For the latter, the  kernel simply disables the event in the
>>> hardware counters,
>>> i.e., the event is not descheduled.  Both approaches produce pretty
>>> much the same
>>> desired effect, the event is not counted while in guest mode.
>>>
>>> The issue I would like to raise has to do with the effects on
>>> time_enabled and time_running
>>> for exclude_guest=1 events.
>>>
>>> Given the event is not scheduled out while in guest mode, even though
>>> it is stopped, both time_enabled and time_running continue ticking
>>> while in guest mode.  If a measurement is 10s
>>> long but only 5s are in non-guest mode, then time_enabled=10s,
>>> time_running=10s. The count
>>> represents 10s worth of non guest mode, of which only 5s were really
>>> actively monitoring, but
>>> the user has no way of determining this.
>>
>>
>> For the latest design/implementation, only the exclude_guest=1 host
>> event can be successfully created for the case.
>> The end user should not expect that anything is collected in the guest
>> mode. So both the time_enabled and the time_running will be 5s.
>>
>>>
>>> If we look at vPMU passthru, the host event must have exclude_guest=1
>>> to avoid going into
>>> an error state on context switch to the vCPU thread (with vPMU
>>> enabled).
>>
>> That's the design in V1. There is no error state anymore in V2 and later
>> as suggested by Sean.
>> https://lore.kernel.org/all/ZhmIrQQVgblrhCZs@google.com/
>>
>> The VM cannot be created if there are host events with exclude_guest=0.
>> If a VM has been created, user cannot create an event with
>> exclude_guest=0. So nothing will be moved to an error state on context
>> switch to the vCPU thread.
>>
> Ok, that's new.
> 
>>> But this time,
>>> the event is scheduled out, that means that time_enabled keeps
>>> counting, but time_running
>>> stops. On context switch back in, the host event is scheduled again
>>> and time_running restarts
>>> ticking. For a 10s measurement, where 5s here in the guest, the event
>>> will come out with
>>> time_enabled=10s, time_running=5s, and the tool will scale it up
>>> because it thinks the event
>>> was multiplexed, when in fact it was not. This is not the intended
>>> outcome here. The tool should
>>> not scale the count, it was not multiplexed, it was descheduled
>>> because the filter forced it out.
>>> Note that if the event had been multiplexed while running on the host,
>>> then the scaling would be
>>> appropriate.
>>
>> The scaling will not happen, since both time_enabled and time_running
>> should be the same when there are enough counter resources.
>>
> If an event with exclude_guest=1 is sched out (event_sched_out), time_enabled
> will keep running but time_running will stop. Because the code assumes
> it is stopped
> because of multiplexing. However, here this is not the case. The event
> is stopped because
> the CPU is entering an execution domain that is excluded for the event.
> Unless you've modified the event_sched_out() code or added code to
> patch up time_running
> I don't see how they could be equal. The alternative, as you suggest,
> is to stop time_enabled.
> But usually time_enabled is controlled by ENABLE/DISABLE which are
> different from
> event_sched_out() and event_sched_in().

The entire context are scheduled, ctx_sched_{in,out}.
https://lore.kernel.org/all/20240507085807.GS40213@noisy.programming.kicks-ass.net/

The time in the guest mode doesn't be accumulated in the ctx->time,
which will be used to calculate the event time later.

The event_sched_in() never be failed (assuming enough PMU resources in
the host after be back from the guest). The time_enabled and
time_running should be equal.

Thanks,
Kan
> 
> 
>>>
>>> In that case, I argue, time_running should be updated to cover the
>>> time the event was not running. That would bring us back to the case I
>>> was describing earlier.
>>>
>>> It boils down to the exact definition of exclude_guest and expected
>>> impact on time_enabled
>>> and time_running. Then, with or without vPMU passthru, we can fix the
>>> kernel to ensure a
>>> uniform behavior.
>>
>> I think the time_enabled should be the one that has a controversial
>> definition.
>> Should the time in the guest mode count as the enabled time for an host
>> event that explicitly sets the exclude_guest=1?
>>
>> I think the answer is NO. So I implemented it in the code.
>>
>>>
>>> What are your thoughts on this problem?
>>>
>>
>> Peter, please share your thoughts. We want to make sure the design is on
>> the right track.
>>
>> Thanks,
>> Kan