Received-SPF: pass (google.com: domain of linux-kernel+bounces-159427-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249;
Message-ID: <bebdafb8-387c-4984-885a-8b22f2d9b9f5@linux.intel.com>
Date: Fri, 26 Apr 2024 09:50:50 +0800
Precedence: bulk
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU
 state for Intel CPU
To: "Liang, Kan" <kan.liang@linux.intel.com>,
 Mingwei Zhang <mizhang@google.com>
Cc: Sean Christopherson <seanjc@google.com>, maobibo <maobibo@loongson.cn>,
 Xiong Zhang <xiong.y.zhang@linux.intel.com>, pbonzini@redhat.com,
 peterz@infradead.org, kan.liang@intel.com, zhenyuw@linux.intel.com,
 jmattson@google.com, kvm@vger.kernel.org, linux-perf-users@vger.kernel.org,
 linux-kernel@vger.kernel.org, zhiyuan.lv@intel.com, eranian@google.com,
 irogers@google.com, samantha.alt@intel.com, like.xu.linux@gmail.com,
 chao.gao@intel.com
References: <ZiaX3H3YfrVh50cs@google.com>
 <d8f3497b-9f63-e30e-0c63-253908d40ac2@loongson.cn>
 <d980dd10-e4c4-4774-b107-77b320cec9f9@linux.intel.com>
 <b5e97aa1-7683-4eff-e1e3-58ac98a8d719@loongson.cn>
 <1ec7a21c-71d0-4f3e-9fa3-3de8ca0f7315@linux.intel.com>
 <5279eabc-ca46-ee1b-b80d-9a511ba90a36@loongson.cn>
 <CAL715WJK893gQd1m9CCAjz5OkxsRc5C4ZR7yJWJXbaGvCeZxQA@mail.gmail.com>
 <b3868bf5-4e16-3435-c807-f484821fccc6@loongson.cn>
 <CAL715W++maAt2Ujfvmu1pZKS4R5EmAPebTU_h9AB8aFbdLFrTQ@mail.gmail.com>
 <f843298c-db08-4fde-9887-13de18d960ac@linux.intel.com>
 <Zikeh2eGjwzDbytu@google.com>
 <7834a811-4764-42aa-8198-55c4556d947b@linux.intel.com>
 <CAL715WKh8VBJ-O50oqSnCqKPQo4Bor_aMnRZeS_TzJP3ja8-YQ@mail.gmail.com>
 <6af2da05-cb47-46f7-b129-08463bc9469b@linux.intel.com>
 <CAL715W+zeqKenPLP2Fm9u_BkGRKAk-mncsOxrg=EKs74qK5f1Q@mail.gmail.com>
 <42acf1fc-1603-4ac5-8a09-edae2d85963d@linux.intel.com>
Content-Language: en-US
From: "Mi, Dapeng" <dapeng1.mi@linux.intel.com>
In-Reply-To: <42acf1fc-1603-4ac5-8a09-edae2d85963d@linux.intel.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit


On 4/26/2024 4:43 AM, Liang, Kan wrote:
>
> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>
>>>
>>> On 2024-04-25 12:24 a.m., Mingwei Zhang wrote:
>>>> On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>>>>
>>>>> On 4/24/2024 11:00 PM, Sean Christopherson wrote:
>>>>>> On Wed, Apr 24, 2024, Dapeng Mi wrote:
>>>>>>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
>>>>>>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>>>>>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>>>>>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>>>>>>>>> switch location could be an option.
>>>>>>>>> If there are two VMs with pmu enabled both, however host PMU is not
>>>>>>>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>>>>>>>>
>>>>>>>>> If host pmu is used also, we can choose whether PMU switch should be
>>>>>>>>> done in vm exit path or vcpu thread sched-out path.
>>>>>>>>>
>>>>>>>> host PMU is always enabled, ie., Linux currently does not support KVM
>>>>>>>> PMU running standalone. I guess what you mean is there are no active
>>>>>>>> perf_events on the host side. Allowing a PMU context switch drifting
>>>>>>>> from vm-enter/exit boundary to vcpu loop boundary by checking host
>>>>>>>> side events might be a good option. We can keep the discussion, but I
>>>>>>>> won't propose that in v2.
>>>>>>> I suspect if it's really doable to do this deferring. This still makes host
>>>>>>> lose the most of capability to profile KVM. Per my understanding, most of
>>>>>>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
>>>>>>> We have no idea when host want to create perf event to profile KVM, it could
>>>>>>> be at any time.
>>>>>> No, the idea is that KVM will load host PMU state asap, but only when host PMU
>>>>>> state actually needs to be loaded, i.e. only when there are relevant host events.
>>>>>>
>>>>>> If there are no host perf events, KVM keeps guest PMU state loaded for the entire
>>>>>> KVM_RUN loop, i.e. provides optimal behavior for the guest.  But if a host perf
>>>>>> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
>>>>>> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
>>>>>> for the guest while host perf events are active.
>>>>> I see. So KVM needs to provide a callback which needs to be called in
>>>>> the IPI handler. The KVM callback needs to be called to switch PMU state
>>>>> before perf really enabling host event and touching PMU MSRs. And only
>>>>> the perf event with exclude_guest attribute is allowed to create on
>>>>> host. Thanks.
>>>> Do we really need a KVM callback? I think that is one option.
>>>>
>>>> Immediately after VMEXIT, KVM will check whether there are "host perf
>>>> events". If so, do the PMU context switch immediately. Otherwise, keep
>>>> deferring the context switch to the end of vPMU loop.
>>>>
>>>> Detecting if there are "host perf events" would be interesting. The
>>>> "host perf events" refer to the perf_events on the host that are
>>>> active and assigned with HW counters and that are saved when context
>>>> switching to the guest PMU. I think getting those events could be done
>>>> by fetching the bitmaps in cpuc.
>>> The cpuc is ARCH specific structure. I don't think it can be get in the
>>> generic code. You probably have to implement ARCH specific functions to
>>> fetch the bitmaps. It probably won't worth it.
>>>
>>> You may check the pinned_groups and flexible_groups to understand if
>>> there are host perf events which may be scheduled when VM-exit. But it
>>> will not tell the idx of the counters which can only be got when the
>>> host event is really scheduled.
>>>
>>>> I have to look into the details. But
>>>> at the time of VMEXIT, kvm should already have that information, so it
>>>> can immediately decide whether to do the PMU context switch or not.
>>>>
>>>> oh, but when the control is executing within the run loop, a
>>>> host-level profiling starts, say 'perf record -a ...', it will
>>>> generate an IPI to all CPUs. Maybe that's when we need a callback so
>>>> the KVM guest PMU context gets preempted for the host-level profiling.
>>>> Gah..
>>>>
>>>> hmm, not a fan of that. That means the host can poke the guest PMU
>>>> context at any time and cause higher overhead. But I admit it is much
>>>> better than the current approach.
>>>>
>>>> The only thing is that: any command like 'perf record/stat -a' shot in
>>>> dark corners of the host can preempt guest PMUs of _all_ running VMs.
>>>> So, to alleviate that, maybe a module parameter that disables this
>>>> "preemption" is possible? This should fit scenarios where we don't
>>>> want guest PMU to be preempted outside of the vCPU loop?
>>>>
>>> It should not happen. For the current implementation, perf rejects all
>>> the !exclude_guest system-wide event creation if a guest with the vPMU
>>> is running.
>>> However, it's possible to create an exclude_guest system-wide event at
>>> any time. KVM cannot use the information from the VM-entry to decide if
>>> there will be active perf events in the VM-exit.
>> Hmm, why not? If there is any exclude_guest system-wide event,
>> perf_guest_enter() can return something to tell KVM "hey, some active
>> host events are swapped out. they are originally in counter #2 and
>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
>> that and keep it in its pmu data structure.
> I think it's possible that someone creates !exclude_guest event after
> the perf_guest_enter(). The stale information is saved in the KVM. Perf
> will schedule the event in the next perf_guest_exit(). KVM will not know it.
>
>> Now, when doing context switching back to host at just VMEXIT, KVM
>> will check this data and see if host perf context has something active
>> (of course, they are all exclude_guest events). If not, deferring the
>> context switch to vcpu boundary. Otherwise, do the proper PMU context
>> switching by respecting the occupied counter positions on the host
>> side, i.e., avoid doubling the work on the KVM side.
>>
>> Kan, any suggestion on the above approach?
> I think we can only know the accurate event list at perf_guest_exit().
> You may check the pinned_groups and flexible_groups, which tell if there
> are candidate events.
>
>> Totally understand that
>> there might be some difficulty, since perf subsystem works in several
>> layers and obviously fetching low-level mapping is arch specific work.
>> If that is difficult, we can split the work in two phases: 1) phase
>> #1, just ask perf to tell kvm if there are active exclude_guest events
>> swapped out; 2) phase #2, ask perf to tell their (low-level) counter
>> indices.
>>
> If you want an accurate counter mask, the changes in the arch specific
> code is required. Two phases sound good to me.
>
> Besides perf changes, I think the KVM should also track which counters
> need to be saved/restored. The information can be get from the EventSel
> interception.

Yes, that's another optimization from guest point view. It's in our 
to-do list.


>
> Thanks,
> Kan
>>> The perf_guest_exit() will reload the host state. It's impossible to
>>> save the guest state after that. We may need a KVM callback. So perf can
>>> tell KVM whether to save the guest state before perf reloads the host state.
>>>
>>> Thanks,
>>> Kan
>>>>>
>>>>>> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom