> I think we're then optimizing for different scenarios. Our compute
> driver will use mostly external objects only, and if shared, I don't
> forsee them bound to many VMs. What saves us currently here is that in
> compute mode we only really traverse the extobj list after a preempt
> fence wait, or when a vm is using a new context for the first time. So
> vm's extobj list is pretty large. Each bo's vma list will typically be
> pretty small.
Can I ask why we are optimising for this userspace, this seems
incredibly broken.
We've has this sort of problem in the past with Intel letting the tail
wag the horse, does anyone remember optimising relocations for a
userspace that didn't actually need to use relocations?
We need to ask why this userspace is doing this, can we get some
pointers to it? compute driver should have no reason to use mostly
external objects, the OpenCL and level0 APIs should be good enough to
figure this out.
Dave.
Am 10.10.23 um 22:23 schrieb Dave Airlie:
>> I think we're then optimizing for different scenarios. Our compute
>> driver will use mostly external objects only, and if shared, I don't
>> forsee them bound to many VMs. What saves us currently here is that in
>> compute mode we only really traverse the extobj list after a preempt
>> fence wait, or when a vm is using a new context for the first time. So
>> vm's extobj list is pretty large. Each bo's vma list will typically be
>> pretty small.
> Can I ask why we are optimising for this userspace, this seems
> incredibly broken.
>
> We've has this sort of problem in the past with Intel letting the tail
> wag the horse, does anyone remember optimising relocations for a
> userspace that didn't actually need to use relocations?
>
> We need to ask why this userspace is doing this, can we get some
> pointers to it? compute driver should have no reason to use mostly
> external objects, the OpenCL and level0 APIs should be good enough to
> figure this out.
Well that is pretty normal use case, AMD works the same way.
In a multi GPU compute stack you have mostly all the data shared between
different hardware devices.
As I said before looking at just the Vulcan use case is not a good idea
at all.
Christian.
>
> Dave.
On Wed, 2023-10-11 at 06:23 +1000, Dave Airlie wrote:
> > I think we're then optimizing for different scenarios. Our compute
> > driver will use mostly external objects only, and if shared, I
> > don't
> > forsee them bound to many VMs. What saves us currently here is that
> > in
> > compute mode we only really traverse the extobj list after a
> > preempt
> > fence wait, or when a vm is using a new context for the first time.
> > So
> > vm's extobj list is pretty large. Each bo's vma list will typically
> > be
> > pretty small.
>
> Can I ask why we are optimising for this userspace, this seems
> incredibly broken.
First Judging from the discussion with Christian this is not really
uncommon. There *are* ways that we can play tricks in KMD of assorted
cleverness to reduce the extobj list size, but doing that in KMD that
wouldn't be much different than accepting a large extobj list size and
do what we can to reduce overhead of iterating over it.
Second the discussion here really was about whether we should be using
a lower level lock to allow for async state updates, with a rather
complex mechanism with weak reference counting and a requirement to
drop the locks within the loop to avoid locking inversion. If that were
a simplification with little or no overhead all fine, but IMO it's not
a simplification?
>
> We've has this sort of problem in the past with Intel letting the
> tail
> wag the horse, does anyone remember optimising relocations for a
> userspace that didn't actually need to use relocations?
>
> We need to ask why this userspace is doing this, can we get some
> pointers to it? compute driver should have no reason to use mostly
> external objects, the OpenCL and level0 APIs should be good enough to
> figure this out.
TBH for the compute UMD case, I'd be prepared to drop the *performance*
argument of fine-grained locking the extobj list since it's really only
traversed on new contexts and preemption. But as Christian mentions
there might be other cases. We should perhaps figure those out and
document?
/Thoams
>
> Dave.
On Wed, 11 Oct 2023 at 17:07, Christian König <[email protected]> wrote:
>
> Am 10.10.23 um 22:23 schrieb Dave Airlie:
> >> I think we're then optimizing for different scenarios. Our compute
> >> driver will use mostly external objects only, and if shared, I don't
> >> forsee them bound to many VMs. What saves us currently here is that in
> >> compute mode we only really traverse the extobj list after a preempt
> >> fence wait, or when a vm is using a new context for the first time. So
> >> vm's extobj list is pretty large. Each bo's vma list will typically be
> >> pretty small.
> > Can I ask why we are optimising for this userspace, this seems
> > incredibly broken.
> >
> > We've has this sort of problem in the past with Intel letting the tail
> > wag the horse, does anyone remember optimising relocations for a
> > userspace that didn't actually need to use relocations?
> >
> > We need to ask why this userspace is doing this, can we get some
> > pointers to it? compute driver should have no reason to use mostly
> > external objects, the OpenCL and level0 APIs should be good enough to
> > figure this out.
>
> Well that is pretty normal use case, AMD works the same way.
>
> In a multi GPU compute stack you have mostly all the data shared between
> different hardware devices.
>
> As I said before looking at just the Vulcan use case is not a good idea
> at all.
>
It's okay, I don't think anyone is doing that, some of the these
use-cases are buried in server land and you guys don't communicate
them very well.
multi-gpu compute would I'd hope be moving towards HMM/SVM type
solutions though?
I'm also not into looking at use-cases that used to be important but
might not as important going forward.
Dave.
> Christian.
>
> >
> > Dave.
>
Am 12.10.23 um 12:33 schrieb Dave Airlie:
> On Wed, 11 Oct 2023 at 17:07, Christian König <[email protected]> wrote:
>> Am 10.10.23 um 22:23 schrieb Dave Airlie:
>>>> I think we're then optimizing for different scenarios. Our compute
>>>> driver will use mostly external objects only, and if shared, I don't
>>>> forsee them bound to many VMs. What saves us currently here is that in
>>>> compute mode we only really traverse the extobj list after a preempt
>>>> fence wait, or when a vm is using a new context for the first time. So
>>>> vm's extobj list is pretty large. Each bo's vma list will typically be
>>>> pretty small.
>>> Can I ask why we are optimising for this userspace, this seems
>>> incredibly broken.
>>>
>>> We've has this sort of problem in the past with Intel letting the tail
>>> wag the horse, does anyone remember optimising relocations for a
>>> userspace that didn't actually need to use relocations?
>>>
>>> We need to ask why this userspace is doing this, can we get some
>>> pointers to it? compute driver should have no reason to use mostly
>>> external objects, the OpenCL and level0 APIs should be good enough to
>>> figure this out.
>> Well that is pretty normal use case, AMD works the same way.
>>
>> In a multi GPU compute stack you have mostly all the data shared between
>> different hardware devices.
>>
>> As I said before looking at just the Vulcan use case is not a good idea
>> at all.
>>
> It's okay, I don't think anyone is doing that, some of the these
> use-cases are buried in server land and you guys don't communicate
> them very well.
Yeah, well everybody is trying very hard to get away from those
approaches :)
But so far there hasn't been any breakthrough.
>
> multi-gpu compute would I'd hope be moving towards HMM/SVM type
> solutions though?
Unfortunately not in the foreseeable future. HMM seems more and more
like a dead end, at least for AMD.
AMD still has hardware support in all of their MI* products, but for
Navi the features necessary for implementing HMM have been dropped. And
it looks more and more like their are not going to come back.
Additional to that from the software side Felix summarized it in the HMM
peer2peer discussion thread recently quite well. A buffer object based
approach is not only simpler to handle, but also performant vise
multiple magnitudes faster.
> I'm also not into looking at use-cases that used to be important but
> might not as important going forward.
Well multimedia applications and OpenGL are still around, but it's not
the main focus any more.
Christian.
>
> Dave.
>
>
>> Christian.
>>
>>> Dave.
On Thu, Oct 12, 2023 at 02:35:15PM +0200, Christian K?nig wrote:
> Am 12.10.23 um 12:33 schrieb Dave Airlie:
> > On Wed, 11 Oct 2023 at 17:07, Christian K?nig <[email protected]> wrote:
> > > Am 10.10.23 um 22:23 schrieb Dave Airlie:
> > > > > I think we're then optimizing for different scenarios. Our compute
> > > > > driver will use mostly external objects only, and if shared, I don't
> > > > > forsee them bound to many VMs. What saves us currently here is that in
> > > > > compute mode we only really traverse the extobj list after a preempt
> > > > > fence wait, or when a vm is using a new context for the first time. So
> > > > > vm's extobj list is pretty large. Each bo's vma list will typically be
> > > > > pretty small.
> > > > Can I ask why we are optimising for this userspace, this seems
> > > > incredibly broken.
> > > >
> > > > We've has this sort of problem in the past with Intel letting the tail
> > > > wag the horse, does anyone remember optimising relocations for a
> > > > userspace that didn't actually need to use relocations?
> > > >
> > > > We need to ask why this userspace is doing this, can we get some
> > > > pointers to it? compute driver should have no reason to use mostly
> > > > external objects, the OpenCL and level0 APIs should be good enough to
> > > > figure this out.
> > > Well that is pretty normal use case, AMD works the same way.
> > >
> > > In a multi GPU compute stack you have mostly all the data shared between
> > > different hardware devices.
> > >
> > > As I said before looking at just the Vulcan use case is not a good idea
> > > at all.
> > >
> > It's okay, I don't think anyone is doing that, some of the these
> > use-cases are buried in server land and you guys don't communicate
> > them very well.
>
> Yeah, well everybody is trying very hard to get away from those approaches
> :)
>
> But so far there hasn't been any breakthrough.
>
> >
> > multi-gpu compute would I'd hope be moving towards HMM/SVM type
> > solutions though?
>
> Unfortunately not in the foreseeable future. HMM seems more and more like a
> dead end, at least for AMD.
>
> AMD still has hardware support in all of their MI* products, but for Navi
> the features necessary for implementing HMM have been dropped. And it looks
> more and more like their are not going to come back.
>
> Additional to that from the software side Felix summarized it in the HMM
> peer2peer discussion thread recently quite well. A buffer object based
> approach is not only simpler to handle, but also performant vise multiple
> magnitudes faster.
This matches what I'm hearing from all over. Turns out that handling page
faults in full generality in a compute/accel device (not just gpu) is just
too damn hard. At least for anyone who isn't nvidia. Usually time bound
preemption guarantees are the first to go, followed right after by a long
list of more fixed function hardware blocks that outright can't cope with
page faults.
There's so many corner cases where it breaks down that I feel like device
driver allocated memory of one flavor or another will stick around for a
very long time.
This isn't even counting the software challenges.
-Sima
> > I'm also not into looking at use-cases that used to be important but
> > might not as important going forward.
>
> Well multimedia applications and OpenGL are still around, but it's not the
> main focus any more.
>
> Christian.
>
> >
> > Dave.
> >
> >
> > > Christian.
> > >
> > > > Dave.
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Thu, Oct 12, 2023 at 02:35:15PM +0200, Christian K?nig wrote:
> Additional to that from the software side Felix summarized it in the HMM
> peer2peer discussion thread recently quite well.
Do you have a pointer to that discussion?