Subject: Re: [RFC]Add new mdev interface for QoS
To: Kirti Wankhede <kwankhede@nvidia.com>,
        Alex Williamson <alex.williamson@redhat.com>
References: <9951f9cf-89dd-afa4-a9f7-9a795e4c01af@intel.com>
 <20170726104343.5bfa51d5@w520.home>
 <9607b33d-7b3a-1bcf-1ad9-4b554100e68a@intel.com>
 <f2032dbc-6d4e-a354-9fb6-aef0ec3283a9@intel.com>
 <20170801162625.6264dbd6@w520.home>
 <0f637a9b-8b74-8b50-6611-2eb2557a80d6@nvidia.com>
 <461872b1-1086-5151-1473-734223b050d0@intel.com>
 <e333d103-2321-304a-ff3f-2e0281575990@nvidia.com>
 <20170802105845.717ecf5f@w520.home>
 <ebebd457-cae1-61e2-7a84-20d07029a78f@intel.com>
 <20170803151155.35c650cb@w520.home>
 <09229dca-1083-4970-a27d-ec82d06f0b28@intel.com>
 <e7dfa278-36a0-a6b8-289b-12c7b40316c6@nvidia.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
        "Tian, Kevin" <kevin.tian@intel.com>,
        Zhenyu Wang <zhenyuw@linux.intel.com>, Jike Song <jike.song@intel.com>,
        libvir-list@redhat.com, zhi.a.wang@intel.com
From: "Gao, Ping A" <ping.a.gao@intel.com>
Message-ID: <9a57d996-6d0c-daec-98b8-b31ab6acb989@intel.com>
Date: Tue, 8 Aug 2017 20:48:45 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.1.0
MIME-Version: 1.0
In-Reply-To: <e7dfa278-36a0-a6b8-289b-12c7b40316c6@nvidia.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 10307
Lines: 181


On 2017/8/8 14:42, Kirti Wankhede wrote:
>
> On 8/7/2017 1:11 PM, Gao, Ping A wrote:
>> On 2017/8/4 5:11, Alex Williamson wrote:
>>> On Thu, 3 Aug 2017 20:26:14 +0800
>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>
>>>> On 2017/8/3 0:58, Alex Williamson wrote:
>>>>> On Wed, 2 Aug 2017 21:16:28 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>  
>>>>>> On 8/2/2017 6:29 PM, Gao, Ping A wrote:  
>>>>>>> On 2017/8/2 18:19, Kirti Wankhede wrote:    
>>>>>>>> On 8/2/2017 3:56 AM, Alex Williamson wrote:    
>>>>>>>>> On Tue, 1 Aug 2017 13:54:27 +0800
>>>>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>>>>>    
>>>>>>>>>> On 2017/7/28 0:00, Gao, Ping A wrote:    
>>>>>>>>>>> On 2017/7/27 0:43, Alex Williamson wrote:      
>>>>>>>>>>>> [cc +libvir-list]
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
>>>>>>>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>>>>>>>>      
>>>>>>>>>>>>> The vfio-mdev provide the capability to let different guest share the
>>>>>>>>>>>>> same physical device through mediate sharing, as result it bring a
>>>>>>>>>>>>> requirement about how to control the device sharing, we need a QoS
>>>>>>>>>>>>> related interface for mdev to management virtual device resource.
>>>>>>>>>>>>>
>>>>>>>>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has
>>>>>>>>>>>>> different performance requirements, some guests may need higher priority
>>>>>>>>>>>>> for real time usage, some other may need more portion of the GPU
>>>>>>>>>>>>> resource to get higher 3D performance, corresponding we can define some
>>>>>>>>>>>>> interfaces like weight/cap for overall budget control, priority for
>>>>>>>>>>>>> single submission control.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So I suggest to add some common attributes which are vendor agnostic in
>>>>>>>>>>>>> mdev core sysfs for QoS purpose.      
>>>>>>>>>>>> I think what you're asking for is just some standardization of a QoS
>>>>>>>>>>>> attribute_group which a vendor can optionally include within the
>>>>>>>>>>>> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
>>>>>>>>>>>> transparently enable this, but it really only provides the standard,
>>>>>>>>>>>> all of the support code is left for the vendor.  I'm fine with that,
>>>>>>>>>>>> but of course the trouble with and sort of standardization is arriving
>>>>>>>>>>>> at an agreed upon standard.  Are there QoS knobs that are generic
>>>>>>>>>>>> across any mdev device type?  Are there others that are more specific
>>>>>>>>>>>> to vGPU?  Are there existing examples of this that we can steal their
>>>>>>>>>>>> specification?      
>>>>>>>>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted.
>>>>>>>>>>> Only when it become a part of the mdev framework and libvirt, then QoS
>>>>>>>>>>> such critical feature can be leveraged by cloud usage. HW vendor only
>>>>>>>>>>> need to focus on the implementation of the corresponding QoS algorithm
>>>>>>>>>>> in their back-end driver.
>>>>>>>>>>>
>>>>>>>>>>> Vfio-mdev framework provide the capability to share the device that lack
>>>>>>>>>>> of HW virtualization support to guests, no matter the device type,
>>>>>>>>>>> mediated sharing actually is a time sharing multiplex method, from this
>>>>>>>>>>> point of view, QoS can be take as a generic way about how to control the
>>>>>>>>>>> time assignment for virtual mdev device that occupy HW. As result we can
>>>>>>>>>>> define QoS knob generic across any device type by this way. Even if HW
>>>>>>>>>>> has build in with some kind of QoS support, I think it's not a problem
>>>>>>>>>>> for back-end driver to convert mdev standard QoS definition to their
>>>>>>>>>>> specification to reach the same performance expectation. Seems there are
>>>>>>>>>>> no examples for us to follow, we need define it from scratch.
>>>>>>>>>>>
>>>>>>>>>>> I proposal universal QoS control interfaces like below:
>>>>>>>>>>>
>>>>>>>>>>> Cap: The cap limits the maximum percentage of time a mdev device can own
>>>>>>>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of
>>>>>>>>>>> total physical resource.
>>>>>>>>>>>
>>>>>>>>>>> Weight: The weight define proportional control of the mdev device
>>>>>>>>>>> resource between guests, it’s orthogonal with Cap, to target load
>>>>>>>>>>> balancing. E.g. if guest 1 should take double mdev device resource
>>>>>>>>>>> compare with guest 2, need set weight ratio to 2:1.
>>>>>>>>>>>
>>>>>>>>>>> Priority: The guest who has higher priority will get execution first,
>>>>>>>>>>> target to some real time usage and speeding interactive response.
>>>>>>>>>>>
>>>>>>>>>>> Above QoS interfaces cover both overall budget control and single
>>>>>>>>>>> submission control. I will sent out detail design later once get aligned.      
>>>>>>>>>> Hi Alex,
>>>>>>>>>> Any comments about the interface mentioned above?    
>>>>>>>>> Not really.
>>>>>>>>>
>>>>>>>>> Kirti, are there any QoS knobs that would be interesting
>>>>>>>>> for NVIDIA devices?
>>>>>>>>>    
>>>>>>>> We have different types of vGPU for different QoS factors.
>>>>>>>>
>>>>>>>> When mdev devices are created, its resources are allocated irrespective
>>>>>>>> of which VM/userspace app is going to use that mdev device. Any
>>>>>>>> parameter we add here should be tied to particular mdev device and not
>>>>>>>> to the guest/app that are going to use it. 'Cap' and 'Priority' are
>>>>>>>> along that line. All mdev device might not need/use these parameters,
>>>>>>>> these can be made optional interfaces.    
>>>>>>> We also define some QoS parameters in Intel vGPU types, but it only
>>>>>>> provided a default fool-style way. We still need a flexible approach
>>>>>>> that give user the ability to change QoS parameters freely and
>>>>>>> dynamically according to their requirement , not restrict to the current
>>>>>>> limited and static vGPU types.
>>>>>>>     
>>>>>>>> In the above proposal, I'm not sure how 'Weight' would work for mdev
>>>>>>>> devices on same physical device.
>>>>>>>>
>>>>>>>> In the above example, "if guest 1 should take double mdev device
>>>>>>>> resource compare with guest 2" but what if guest 2 never booted, how
>>>>>>>> will you calculate resources?    
>>>>>>> Cap is try to limit the max physical GPU resource for vGPU, it's a
>>>>>>> vertical limitation, but weight is a horizontal limitation that define
>>>>>>> the GPU resource consumption ratio between vGPUs. Cap is easy to
>>>>>>> understand as it's just a percentage. For weight. for example, if we
>>>>>>> define the max weight is 16, the vGPU_1 who get weight 8 should been
>>>>>>> assigned double GPU resources compared to the vGPU_2 whose weight is 4,
>>>>>>> we can translate it to this formula:  resource_of_vGPU_1 = 8 / (8+4) *
>>>>>>> total_physical_GPU_resource.
>>>>>>>     
>>>>>> How will vendor driver provide max weight to userspace
>>>>>> application/libvirt? Max weight will be per physical device, right?
>>>>>>
>>>>>> How would such resource allocation reflect in 'available_instances'?
>>>>>> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with
>>>>>> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G
>>>>>> FB free but you have reached max weight, so will you make
>>>>>> available_instances = 0 for all types on that physical GPU?  
>>>>> No, per the algorithm above, the available scheduling for the remaining
>>>>> mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16,
>>>>> we'd need to define or make the range discoverable, 16 seems rather
>>>>> arbitrary).  We can always add new scheduling participants.  AIUI,
>>>>> Intel uses round-robin scheduling now, where you could consider all
>>>>> mdev devices to have the same weight.  Whether we consider that to be a
>>>>> weight of 16 or zero or 8 doesn't really matter.  
>>>> QoS is to control the device's process capability like GPU
>>>> rendering/computing that can be time multiplexing, not used to control
>>>> the dedicated partition resources like FB, so there is no impact on
>>>> 'available_instances'.
>>>>
>>>> if vGPU_1 weight=8, vGPU_2 weight=4;
>>>> then vGPU_1_res = 8 / (8 + 4) * total,  vGPU_2_res = 4 / (8 + 4) * total;
>>>> if vGPU_3 created with weight 2;
>>>> then vGPU_1_res = 8 /(8 + 4 + 2) * total, vGPU_2_res = 4 / (8 + 4 + 2) *
>>>> total, vGPU_3_res = 2 / (8 + 4 + 2) * total.
>>>>
>>>> The resource allocation of vGPU_1 and vGPU_2 have been dynamically
>>>> changed after vGPU_3 creating, that's weight doing as it's to define the
>>>> relationship of all the vGPUs, the performance degradation is meet
>>>> expectation. The end-user should know about such behavior.
>>>>
>>>> However the argument on weight let me has some self-reflection, does the
>>>> end-user real need weight? does weight has actually application
>>>> requirement?  Maybe the cap and priority are enough?
>>> What sort of SLAs do you want to be able to offer?  For instance if I
>>> want to be able to offer a GPU in 1/4 increments, how does that work?
>>> I might sell customers A & B 1/4 increment each and customer C a 1/2
>>> increment.  If weight is removed, can we do better than capping A & B
>>> at 25% each and C at 50%?  That has the downside that nobody gets to
>>> use the unused capacity of the other clients.  The SLA is some sort of
>>> "up to X% (and no more)" model.  With weighting it's as simple as making
>>> sure customer C's vGPU has twice the weight of that given to A or B.
>>> Then you get an "at least X%" SLA model and any customer can use up to
>>> 100% if the others are idle.  Combining weight and cap, we can do "at
>>> least X%, but no more than Y%".
>>>
>>> All of this feels really similar to how cpusets must work since we're
>>> just dealing with QoS relative to scheduling and we should not try to
>>> reinvent scheduling QoS.  Thanks,
>>>
>> Yeah, that's also my original thoughts.
>> Since we get aligned about the QoS basic definition, I'm going to
>> prepare the code in kernel side. How about the corresponding part in
>> libvirt? Implemented separately after the kernel interface finalizing?
>>
> Ok. These interfaces should be optional since all vendors drivers of
> mdev may not support such QoS.
>

Sure, all of them are optional, it's freely to choose or even not to choose.

Thanks,
Ping