2019-05-14 07:22:05

by Erik Skultety

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

On Tue, May 14, 2019 at 02:12:35AM -0400, Yan Zhao wrote:
> On Mon, May 13, 2019 at 09:28:04PM +0800, Erik Skultety wrote:
> > On Fri, May 10, 2019 at 11:48:38AM +0200, Cornelia Huck wrote:
> > > On Fri, 10 May 2019 10:36:09 +0100
> > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > >
> > > > * Cornelia Huck ([email protected]) wrote:
> > > > > On Thu, 9 May 2019 17:48:26 +0100
> > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > >
> > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > On Thu, 9 May 2019 16:48:57 +0100
> > > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > > >
> > > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > > On Tue, 7 May 2019 15:18:26 -0600
> > > > > > > > > Alex Williamson <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > > On Sun, 5 May 2019 21:49:04 -0400
> > > > > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > > > + Errno:
> > > > > > > > > > > + If vendor driver wants to claim a mdev device incompatible to all other mdev
> > > > > > > > > > > + devices, it should not register version attribute for this mdev device. But if
> > > > > > > > > > > + a vendor driver has already registered version attribute and it wants to claim
> > > > > > > > > > > + a mdev device incompatible to all other mdev devices, it needs to return
> > > > > > > > > > > + -ENODEV on access to this mdev device's version attribute.
> > > > > > > > > > > + If a mdev device is only incompatible to certain mdev devices, write of
> > > > > > > > > > > + incompatible mdev devices's version strings to its version attribute should
> > > > > > > > > > > + return -EINVAL;
> > > > > > > > > >
> > > > > > > > > > I think it's best not to define the specific errno returned for a
> > > > > > > > > > specific situation, let the vendor driver decide, userspace simply
> > > > > > > > > > needs to know that an errno on read indicates the device does not
> > > > > > > > > > support migration version comparison and that an errno on write
> > > > > > > > > > indicates the devices are incompatible or the target doesn't support
> > > > > > > > > > migration versions.
> > > > > > > > >
> > > > > > > > > I think I have to disagree here: It's probably valuable to have an
> > > > > > > > > agreed error for 'cannot migrate at all' vs 'cannot migrate between
> > > > > > > > > those two particular devices'. Userspace might want to do different
> > > > > > > > > things (e.g. trying with different device pairs).
> > > > > > > >
> > > > > > > > Trying to stuff these things down an errno seems a bad idea; we can't
> > > > > > > > get much information that way.
> > > > > > >
> > > > > > > So, what would be a reasonable approach? Userspace should first read
> > > > > > > the version attributes on both devices (to find out whether migration
> > > > > > > is supported at all), and only then figure out via writing whether they
> > > > > > > are compatible?
> > > > > > >
> > > > > > > (Or just go ahead and try, if it does not care about the reason.)
> > > > > >
> > > > > > Well, I'm OK with something like writing to test whether it's
> > > > > > compatible, it's just we need a better way of saying 'no'.
> > > > > > I'm not sure if that involves reading back from somewhere after
> > > > > > the write or what.
> > > > >
> > > > > Hm, so I basically see two ways of doing that:
> > > > > - standardize on some error codes... problem: error codes can be hard
> > > > > to fit to reasons
> > > > > - make the error available in some attribute that can be read
> > > > >
> > > > > I'm not sure how we can serialize the readback with the last write,
> > > > > though (this looks inherently racy).
> > > > >
> > > > > How important is detailed error reporting here?
> > > >
> > > > I think we need something, otherwise we're just going to get vague
> > > > user reports of 'but my VM doesn't migrate'; I'd like the error to be
> > > > good enough to point most users to something they can understand
> > > > (e.g. wrong card family/too old a driver etc).
> > >
> > > Ok, that sounds like a reasonable point. Not that I have a better idea
> > > how to achieve that, though... we could also log a more verbose error
> > > message to the kernel log, but that's not necessarily where a user will
> > > look first.
> >
> > In case of libvirt checking the compatibility, it won't matter how good the
> > error message in the kernel log is and regardless of how many error states you
> > want to handle, libvirt's only limited to errno here, since we're going to do
> > plain read/write, so our internal error message returned to the user is only
> > going to contain what the errno says - okay, of course we can (and we DO)
> > provide libvirt specific string, further specifying the error but like I
> > mentioned, depending on how many error cases we want to distinguish this may be
> > hard for anyone to figure out solely on the error code, as apps will most
> > probably not parse the
> > logs.
> >
> > Regards,
> > Erik
> hi Erik
> do you mean you are agreeing on defining common errors and only returning errno?

In a sense, yes. While it is highly desirable to have logs with descriptive
messages which will help in troubleshooting tremendously, I wanted to point out
that spending time with error logs may not be that worthwhile especially since
most apps (like libvirt) will solely rely on using read(3)/write(3) to sysfs.
That means that we're limited by the errnos available, so apart from
reporting the generic system message we can't any more magic in terms of the
error messages, so the driver needs to assure that a proper message is
propagated to the journal and at best libvirt can direct the user (consumer) to
look through the system logs for more info. I also agree with the point
mentioned above that defining a specific errno is IMO not the way to go, as
these would be just too specific for the read(3)/write(3) use case.

That said, from libvirt POV as a consumer, I'd expect there to be truly only 2
errors (I believe Alex has mentioned something similar in one of his responses
in one of the threads):
a) read error indicating that an mdev type doesn't support migration
- I assume if one type doesn't support migration, none of the other
types exposed on the parent device do, is that a fair assumption?
b) write error indicating that the mdev types are incompatible for
migration

Regards,
Erik


2019-05-14 07:40:37

by Yan Zhao

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

On Tue, May 14, 2019 at 03:20:40PM +0800, Erik Skultety wrote:
> On Tue, May 14, 2019 at 02:12:35AM -0400, Yan Zhao wrote:
> > On Mon, May 13, 2019 at 09:28:04PM +0800, Erik Skultety wrote:
> > > On Fri, May 10, 2019 at 11:48:38AM +0200, Cornelia Huck wrote:
> > > > On Fri, 10 May 2019 10:36:09 +0100
> > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > >
> > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > On Thu, 9 May 2019 17:48:26 +0100
> > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > >
> > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > On Thu, 9 May 2019 16:48:57 +0100
> > > > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > > > On Tue, 7 May 2019 15:18:26 -0600
> > > > > > > > > > Alex Williamson <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > > On Sun, 5 May 2019 21:49:04 -0400
> > > > > > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > > > + Errno:
> > > > > > > > > > > > + If vendor driver wants to claim a mdev device incompatible to all other mdev
> > > > > > > > > > > > + devices, it should not register version attribute for this mdev device. But if
> > > > > > > > > > > > + a vendor driver has already registered version attribute and it wants to claim
> > > > > > > > > > > > + a mdev device incompatible to all other mdev devices, it needs to return
> > > > > > > > > > > > + -ENODEV on access to this mdev device's version attribute.
> > > > > > > > > > > > + If a mdev device is only incompatible to certain mdev devices, write of
> > > > > > > > > > > > + incompatible mdev devices's version strings to its version attribute should
> > > > > > > > > > > > + return -EINVAL;
> > > > > > > > > > >
> > > > > > > > > > > I think it's best not to define the specific errno returned for a
> > > > > > > > > > > specific situation, let the vendor driver decide, userspace simply
> > > > > > > > > > > needs to know that an errno on read indicates the device does not
> > > > > > > > > > > support migration version comparison and that an errno on write
> > > > > > > > > > > indicates the devices are incompatible or the target doesn't support
> > > > > > > > > > > migration versions.
> > > > > > > > > >
> > > > > > > > > > I think I have to disagree here: It's probably valuable to have an
> > > > > > > > > > agreed error for 'cannot migrate at all' vs 'cannot migrate between
> > > > > > > > > > those two particular devices'. Userspace might want to do different
> > > > > > > > > > things (e.g. trying with different device pairs).
> > > > > > > > >
> > > > > > > > > Trying to stuff these things down an errno seems a bad idea; we can't
> > > > > > > > > get much information that way.
> > > > > > > >
> > > > > > > > So, what would be a reasonable approach? Userspace should first read
> > > > > > > > the version attributes on both devices (to find out whether migration
> > > > > > > > is supported at all), and only then figure out via writing whether they
> > > > > > > > are compatible?
> > > > > > > >
> > > > > > > > (Or just go ahead and try, if it does not care about the reason.)
> > > > > > >
> > > > > > > Well, I'm OK with something like writing to test whether it's
> > > > > > > compatible, it's just we need a better way of saying 'no'.
> > > > > > > I'm not sure if that involves reading back from somewhere after
> > > > > > > the write or what.
> > > > > >
> > > > > > Hm, so I basically see two ways of doing that:
> > > > > > - standardize on some error codes... problem: error codes can be hard
> > > > > > to fit to reasons
> > > > > > - make the error available in some attribute that can be read
> > > > > >
> > > > > > I'm not sure how we can serialize the readback with the last write,
> > > > > > though (this looks inherently racy).
> > > > > >
> > > > > > How important is detailed error reporting here?
> > > > >
> > > > > I think we need something, otherwise we're just going to get vague
> > > > > user reports of 'but my VM doesn't migrate'; I'd like the error to be
> > > > > good enough to point most users to something they can understand
> > > > > (e.g. wrong card family/too old a driver etc).
> > > >
> > > > Ok, that sounds like a reasonable point. Not that I have a better idea
> > > > how to achieve that, though... we could also log a more verbose error
> > > > message to the kernel log, but that's not necessarily where a user will
> > > > look first.
> > >
> > > In case of libvirt checking the compatibility, it won't matter how good the
> > > error message in the kernel log is and regardless of how many error states you
> > > want to handle, libvirt's only limited to errno here, since we're going to do
> > > plain read/write, so our internal error message returned to the user is only
> > > going to contain what the errno says - okay, of course we can (and we DO)
> > > provide libvirt specific string, further specifying the error but like I
> > > mentioned, depending on how many error cases we want to distinguish this may be
> > > hard for anyone to figure out solely on the error code, as apps will most
> > > probably not parse the
> > > logs.
> > >
> > > Regards,
> > > Erik
> > hi Erik
> > do you mean you are agreeing on defining common errors and only returning errno?
>
> In a sense, yes. While it is highly desirable to have logs with descriptive
> messages which will help in troubleshooting tremendously, I wanted to point out
> that spending time with error logs may not be that worthwhile especially since
> most apps (like libvirt) will solely rely on using read(3)/write(3) to sysfs.
> That means that we're limited by the errnos available, so apart from
> reporting the generic system message we can't any more magic in terms of the
> error messages, so the driver needs to assure that a proper message is
> propagated to the journal and at best libvirt can direct the user (consumer) to
> look through the system logs for more info. I also agree with the point
> mentioned above that defining a specific errno is IMO not the way to go, as
> these would be just too specific for the read(3)/write(3) use case.
>
> That said, from libvirt POV as a consumer, I'd expect there to be truly only 2
> errors (I believe Alex has mentioned something similar in one of his responses
> in one of the threads):
> a) read error indicating that an mdev type doesn't support migration
> - I assume if one type doesn't support migration, none of the other
> types exposed on the parent device do, is that a fair assumption?
> b) write error indicating that the mdev types are incompatible for
> migration
>
> Regards,
> Erik
Thanks for this explanation.
so, can we arrive at below agreements?

1. "not to define the specific errno returned for a specific situation,
let the vendor driver decide, userspace simply needs to know that an errno on
read indicates the device does not support migration version comparison and
that an errno on write indicates the devices are incompatible or the target
doesn't support migration versions. "
2. vendor driver should log detailed error reasons in kernel log.

Thanks
Yan

> _______________________________________________
> intel-gvt-dev mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

2019-05-14 07:45:06

by Erik Skultety

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

On Tue, May 14, 2019 at 03:32:19AM -0400, Yan Zhao wrote:
> On Tue, May 14, 2019 at 03:20:40PM +0800, Erik Skultety wrote:
> > On Tue, May 14, 2019 at 02:12:35AM -0400, Yan Zhao wrote:
> > > On Mon, May 13, 2019 at 09:28:04PM +0800, Erik Skultety wrote:
> > > > On Fri, May 10, 2019 at 11:48:38AM +0200, Cornelia Huck wrote:
> > > > > On Fri, 10 May 2019 10:36:09 +0100
> > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > >
> > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > On Thu, 9 May 2019 17:48:26 +0100
> > > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > > >
> > > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > > On Thu, 9 May 2019 16:48:57 +0100
> > > > > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > > > > On Tue, 7 May 2019 15:18:26 -0600
> > > > > > > > > > > Alex Williamson <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Sun, 5 May 2019 21:49:04 -0400
> > > > > > > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > > + Errno:
> > > > > > > > > > > > > + If vendor driver wants to claim a mdev device incompatible to all other mdev
> > > > > > > > > > > > > + devices, it should not register version attribute for this mdev device. But if
> > > > > > > > > > > > > + a vendor driver has already registered version attribute and it wants to claim
> > > > > > > > > > > > > + a mdev device incompatible to all other mdev devices, it needs to return
> > > > > > > > > > > > > + -ENODEV on access to this mdev device's version attribute.
> > > > > > > > > > > > > + If a mdev device is only incompatible to certain mdev devices, write of
> > > > > > > > > > > > > + incompatible mdev devices's version strings to its version attribute should
> > > > > > > > > > > > > + return -EINVAL;
> > > > > > > > > > > >
> > > > > > > > > > > > I think it's best not to define the specific errno returned for a
> > > > > > > > > > > > specific situation, let the vendor driver decide, userspace simply
> > > > > > > > > > > > needs to know that an errno on read indicates the device does not
> > > > > > > > > > > > support migration version comparison and that an errno on write
> > > > > > > > > > > > indicates the devices are incompatible or the target doesn't support
> > > > > > > > > > > > migration versions.
> > > > > > > > > > >
> > > > > > > > > > > I think I have to disagree here: It's probably valuable to have an
> > > > > > > > > > > agreed error for 'cannot migrate at all' vs 'cannot migrate between
> > > > > > > > > > > those two particular devices'. Userspace might want to do different
> > > > > > > > > > > things (e.g. trying with different device pairs).
> > > > > > > > > >
> > > > > > > > > > Trying to stuff these things down an errno seems a bad idea; we can't
> > > > > > > > > > get much information that way.
> > > > > > > > >
> > > > > > > > > So, what would be a reasonable approach? Userspace should first read
> > > > > > > > > the version attributes on both devices (to find out whether migration
> > > > > > > > > is supported at all), and only then figure out via writing whether they
> > > > > > > > > are compatible?
> > > > > > > > >
> > > > > > > > > (Or just go ahead and try, if it does not care about the reason.)
> > > > > > > >
> > > > > > > > Well, I'm OK with something like writing to test whether it's
> > > > > > > > compatible, it's just we need a better way of saying 'no'.
> > > > > > > > I'm not sure if that involves reading back from somewhere after
> > > > > > > > the write or what.
> > > > > > >
> > > > > > > Hm, so I basically see two ways of doing that:
> > > > > > > - standardize on some error codes... problem: error codes can be hard
> > > > > > > to fit to reasons
> > > > > > > - make the error available in some attribute that can be read
> > > > > > >
> > > > > > > I'm not sure how we can serialize the readback with the last write,
> > > > > > > though (this looks inherently racy).
> > > > > > >
> > > > > > > How important is detailed error reporting here?
> > > > > >
> > > > > > I think we need something, otherwise we're just going to get vague
> > > > > > user reports of 'but my VM doesn't migrate'; I'd like the error to be
> > > > > > good enough to point most users to something they can understand
> > > > > > (e.g. wrong card family/too old a driver etc).
> > > > >
> > > > > Ok, that sounds like a reasonable point. Not that I have a better idea
> > > > > how to achieve that, though... we could also log a more verbose error
> > > > > message to the kernel log, but that's not necessarily where a user will
> > > > > look first.
> > > >
> > > > In case of libvirt checking the compatibility, it won't matter how good the
> > > > error message in the kernel log is and regardless of how many error states you
> > > > want to handle, libvirt's only limited to errno here, since we're going to do
> > > > plain read/write, so our internal error message returned to the user is only
> > > > going to contain what the errno says - okay, of course we can (and we DO)
> > > > provide libvirt specific string, further specifying the error but like I
> > > > mentioned, depending on how many error cases we want to distinguish this may be
> > > > hard for anyone to figure out solely on the error code, as apps will most
> > > > probably not parse the
> > > > logs.
> > > >
> > > > Regards,
> > > > Erik
> > > hi Erik
> > > do you mean you are agreeing on defining common errors and only returning errno?
> >
> > In a sense, yes. While it is highly desirable to have logs with descriptive
> > messages which will help in troubleshooting tremendously, I wanted to point out
> > that spending time with error logs may not be that worthwhile especially since
> > most apps (like libvirt) will solely rely on using read(3)/write(3) to sysfs.
> > That means that we're limited by the errnos available, so apart from
> > reporting the generic system message we can't any more magic in terms of the
> > error messages, so the driver needs to assure that a proper message is
> > propagated to the journal and at best libvirt can direct the user (consumer) to
> > look through the system logs for more info. I also agree with the point
> > mentioned above that defining a specific errno is IMO not the way to go, as
> > these would be just too specific for the read(3)/write(3) use case.
> >
> > That said, from libvirt POV as a consumer, I'd expect there to be truly only 2
> > errors (I believe Alex has mentioned something similar in one of his responses
> > in one of the threads):
> > a) read error indicating that an mdev type doesn't support migration
> > - I assume if one type doesn't support migration, none of the other
> > types exposed on the parent device do, is that a fair assumption?
> > b) write error indicating that the mdev types are incompatible for
> > migration
> >
> > Regards,
> > Erik
> Thanks for this explanation.
> so, can we arrive at below agreements?
>
> 1. "not to define the specific errno returned for a specific situation,
> let the vendor driver decide, userspace simply needs to know that an errno on
> read indicates the device does not support migration version comparison and
> that an errno on write indicates the devices are incompatible or the target
> doesn't support migration versions. "
> 2. vendor driver should log detailed error reasons in kernel log.

That would be my take on this, yes, but I open to hear any other suggestions and
ideas I couldn't think of as well.

Erik

2019-05-14 07:54:27

by Yan Zhao

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

On Tue, May 14, 2019 at 03:43:44PM +0800, Erik Skultety wrote:
> On Tue, May 14, 2019 at 03:32:19AM -0400, Yan Zhao wrote:
> > On Tue, May 14, 2019 at 03:20:40PM +0800, Erik Skultety wrote:
> > > On Tue, May 14, 2019 at 02:12:35AM -0400, Yan Zhao wrote:
> > > > On Mon, May 13, 2019 at 09:28:04PM +0800, Erik Skultety wrote:
> > > > > On Fri, May 10, 2019 at 11:48:38AM +0200, Cornelia Huck wrote:
> > > > > > On Fri, 10 May 2019 10:36:09 +0100
> > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > >
> > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > On Thu, 9 May 2019 17:48:26 +0100
> > > > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > > > On Thu, 9 May 2019 16:48:57 +0100
> > > > > > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > > > > > On Tue, 7 May 2019 15:18:26 -0600
> > > > > > > > > > > > Alex Williamson <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > On Sun, 5 May 2019 21:49:04 -0400
> > > > > > > > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > > + Errno:
> > > > > > > > > > > > > > + If vendor driver wants to claim a mdev device incompatible to all other mdev
> > > > > > > > > > > > > > + devices, it should not register version attribute for this mdev device. But if
> > > > > > > > > > > > > > + a vendor driver has already registered version attribute and it wants to claim
> > > > > > > > > > > > > > + a mdev device incompatible to all other mdev devices, it needs to return
> > > > > > > > > > > > > > + -ENODEV on access to this mdev device's version attribute.
> > > > > > > > > > > > > > + If a mdev device is only incompatible to certain mdev devices, write of
> > > > > > > > > > > > > > + incompatible mdev devices's version strings to its version attribute should
> > > > > > > > > > > > > > + return -EINVAL;
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think it's best not to define the specific errno returned for a
> > > > > > > > > > > > > specific situation, let the vendor driver decide, userspace simply
> > > > > > > > > > > > > needs to know that an errno on read indicates the device does not
> > > > > > > > > > > > > support migration version comparison and that an errno on write
> > > > > > > > > > > > > indicates the devices are incompatible or the target doesn't support
> > > > > > > > > > > > > migration versions.
> > > > > > > > > > > >
> > > > > > > > > > > > I think I have to disagree here: It's probably valuable to have an
> > > > > > > > > > > > agreed error for 'cannot migrate at all' vs 'cannot migrate between
> > > > > > > > > > > > those two particular devices'. Userspace might want to do different
> > > > > > > > > > > > things (e.g. trying with different device pairs).
> > > > > > > > > > >
> > > > > > > > > > > Trying to stuff these things down an errno seems a bad idea; we can't
> > > > > > > > > > > get much information that way.
> > > > > > > > > >
> > > > > > > > > > So, what would be a reasonable approach? Userspace should first read
> > > > > > > > > > the version attributes on both devices (to find out whether migration
> > > > > > > > > > is supported at all), and only then figure out via writing whether they
> > > > > > > > > > are compatible?
> > > > > > > > > >
> > > > > > > > > > (Or just go ahead and try, if it does not care about the reason.)
> > > > > > > > >
> > > > > > > > > Well, I'm OK with something like writing to test whether it's
> > > > > > > > > compatible, it's just we need a better way of saying 'no'.
> > > > > > > > > I'm not sure if that involves reading back from somewhere after
> > > > > > > > > the write or what.
> > > > > > > >
> > > > > > > > Hm, so I basically see two ways of doing that:
> > > > > > > > - standardize on some error codes... problem: error codes can be hard
> > > > > > > > to fit to reasons
> > > > > > > > - make the error available in some attribute that can be read
> > > > > > > >
> > > > > > > > I'm not sure how we can serialize the readback with the last write,
> > > > > > > > though (this looks inherently racy).
> > > > > > > >
> > > > > > > > How important is detailed error reporting here?
> > > > > > >
> > > > > > > I think we need something, otherwise we're just going to get vague
> > > > > > > user reports of 'but my VM doesn't migrate'; I'd like the error to be
> > > > > > > good enough to point most users to something they can understand
> > > > > > > (e.g. wrong card family/too old a driver etc).
> > > > > >
> > > > > > Ok, that sounds like a reasonable point. Not that I have a better idea
> > > > > > how to achieve that, though... we could also log a more verbose error
> > > > > > message to the kernel log, but that's not necessarily where a user will
> > > > > > look first.
> > > > >
> > > > > In case of libvirt checking the compatibility, it won't matter how good the
> > > > > error message in the kernel log is and regardless of how many error states you
> > > > > want to handle, libvirt's only limited to errno here, since we're going to do
> > > > > plain read/write, so our internal error message returned to the user is only
> > > > > going to contain what the errno says - okay, of course we can (and we DO)
> > > > > provide libvirt specific string, further specifying the error but like I
> > > > > mentioned, depending on how many error cases we want to distinguish this may be
> > > > > hard for anyone to figure out solely on the error code, as apps will most
> > > > > probably not parse the
> > > > > logs.
> > > > >
> > > > > Regards,
> > > > > Erik
> > > > hi Erik
> > > > do you mean you are agreeing on defining common errors and only returning errno?
> > >
> > > In a sense, yes. While it is highly desirable to have logs with descriptive
> > > messages which will help in troubleshooting tremendously, I wanted to point out
> > > that spending time with error logs may not be that worthwhile especially since
> > > most apps (like libvirt) will solely rely on using read(3)/write(3) to sysfs.
> > > That means that we're limited by the errnos available, so apart from
> > > reporting the generic system message we can't any more magic in terms of the
> > > error messages, so the driver needs to assure that a proper message is
> > > propagated to the journal and at best libvirt can direct the user (consumer) to
> > > look through the system logs for more info. I also agree with the point
> > > mentioned above that defining a specific errno is IMO not the way to go, as
> > > these would be just too specific for the read(3)/write(3) use case.
> > >
> > > That said, from libvirt POV as a consumer, I'd expect there to be truly only 2
> > > errors (I believe Alex has mentioned something similar in one of his responses
> > > in one of the threads):
> > > a) read error indicating that an mdev type doesn't support migration
> > > - I assume if one type doesn't support migration, none of the other
> > > types exposed on the parent device do, is that a fair assumption?
> > > b) write error indicating that the mdev types are incompatible for
> > > migration
> > >
> > > Regards,
> > > Erik
> > Thanks for this explanation.
> > so, can we arrive at below agreements?
> >
> > 1. "not to define the specific errno returned for a specific situation,
> > let the vendor driver decide, userspace simply needs to know that an errno on
> > read indicates the device does not support migration version comparison and
> > that an errno on write indicates the devices are incompatible or the target
> > doesn't support migration versions. "
> > 2. vendor driver should log detailed error reasons in kernel log.
>
> That would be my take on this, yes, but I open to hear any other suggestions and
> ideas I couldn't think of as well.
>
> Erik
got it. thanks a lot!

hi Cornelia and Dave,
do you also agree on:
1. "not to define the specific errno returned for a specific situation,
let the vendor driver decide, userspace simply needs to know that an errno on
read indicates the device does not support migration version comparison and
that an errno on write indicates the devices are incompatible or the target
doesn't support migration versions. "
2. vendor driver should log detailed error reasons in kernel log.

Thanks
Yan

2019-05-14 09:53:27

by Cornelia Huck

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

On Tue, 14 May 2019 03:47:36 -0400
Yan Zhao <[email protected]> wrote:

> On Tue, May 14, 2019 at 03:43:44PM +0800, Erik Skultety wrote:
> > On Tue, May 14, 2019 at 03:32:19AM -0400, Yan Zhao wrote:
> > > On Tue, May 14, 2019 at 03:20:40PM +0800, Erik Skultety wrote:

> > > > That said, from libvirt POV as a consumer, I'd expect there to be truly only 2
> > > > errors (I believe Alex has mentioned something similar in one of his responses
> > > > in one of the threads):
> > > > a) read error indicating that an mdev type doesn't support migration
> > > > - I assume if one type doesn't support migration, none of the other
> > > > types exposed on the parent device do, is that a fair assumption?

Probably; but there might be cases where the migratability depends not
on the device type, but how the partitioning has been done... or is
that too contrived?

> > > > b) write error indicating that the mdev types are incompatible for
> > > > migration
> > > >
> > > > Regards,
> > > > Erik
> > > Thanks for this explanation.
> > > so, can we arrive at below agreements?
> > >
> > > 1. "not to define the specific errno returned for a specific situation,
> > > let the vendor driver decide, userspace simply needs to know that an errno on
> > > read indicates the device does not support migration version comparison and
> > > that an errno on write indicates the devices are incompatible or the target
> > > doesn't support migration versions. "
> > > 2. vendor driver should log detailed error reasons in kernel log.
> >
> > That would be my take on this, yes, but I open to hear any other suggestions and
> > ideas I couldn't think of as well.

So, read to find out whether migration is supported at all, write to
find out whether it is supported for that concrete pairing is
reasonable for libvirt?

> >
> > Erik
> got it. thanks a lot!
>
> hi Cornelia and Dave,
> do you also agree on:
> 1. "not to define the specific errno returned for a specific situation,
> let the vendor driver decide, userspace simply needs to know that an errno on
> read indicates the device does not support migration version comparison and
> that an errno on write indicates the devices are incompatible or the target
> doesn't support migration versions. "
> 2. vendor driver should log detailed error reasons in kernel log.

Two questions:
- How reasonable is it to refer to the system log in order to find out
what exactly went wrong?
- If detailed error reporting is basically done to the syslog, do
different error codes still provide useful information? Or should the
vendor driver decide what it wants to do?

2019-05-14 11:00:08

by Erik Skultety

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

On Tue, May 14, 2019 at 11:51:35AM +0200, Cornelia Huck wrote:
> On Tue, 14 May 2019 03:47:36 -0400
> Yan Zhao <[email protected]> wrote:
>
> > On Tue, May 14, 2019 at 03:43:44PM +0800, Erik Skultety wrote:
> > > On Tue, May 14, 2019 at 03:32:19AM -0400, Yan Zhao wrote:
> > > > On Tue, May 14, 2019 at 03:20:40PM +0800, Erik Skultety wrote:
>
> > > > > That said, from libvirt POV as a consumer, I'd expect there to be truly only 2
> > > > > errors (I believe Alex has mentioned something similar in one of his responses
> > > > > in one of the threads):
> > > > > a) read error indicating that an mdev type doesn't support migration
> > > > > - I assume if one type doesn't support migration, none of the other
> > > > > types exposed on the parent device do, is that a fair assumption?
>
> Probably; but there might be cases where the migratability depends not
> on the device type, but how the partitioning has been done... or is
> that too contrived?

No, you have a point - once again I let my thoughts be carried away by the idea
of heterogeneous setups, which is a discussion for another time anyway, I was
just thinking out loud.

>
> > > > > b) write error indicating that the mdev types are incompatible for
> > > > > migration
> > > > >
> > > > > Regards,
> > > > > Erik
> > > > Thanks for this explanation.
> > > > so, can we arrive at below agreements?
> > > >
> > > > 1. "not to define the specific errno returned for a specific situation,
> > > > let the vendor driver decide, userspace simply needs to know that an errno on
> > > > read indicates the device does not support migration version comparison and
> > > > that an errno on write indicates the devices are incompatible or the target
> > > > doesn't support migration versions. "
> > > > 2. vendor driver should log detailed error reasons in kernel log.
> > >
> > > That would be my take on this, yes, but I open to hear any other suggestions and
> > > ideas I couldn't think of as well.
>
> So, read to find out whether migration is supported at all, write to
> find out whether it is supported for that concrete pairing is
> reasonable for libvirt?

Yes, more specifically, in the prepare phase of migration, we'd retrieve the
string (potentially reporting an error like: "Failed to query migration
support: <errno translation>"), put the string into the migration cookie and
do the check with write on destination. The only thing is that if the error is
on the destination, the error message in kernel log lives only on the
destination, which doesn't help libvirt users, so it would require setting up
remote logging, but for layered products, this is not a problem since those
already utilize central logging nodes.

Then there are the libvirt-specific bits out of scope of this discussion,
whether we should only assume identical mdev type pairs, or whether we should
employ best effort approach and iterate over all the available types exposed by
the vendor and check whether any of the types would support this migration
(back to your note Connie, partitioning would come into the picture here).


>
> > >
> > > Erik
> > got it. thanks a lot!
> >
> > hi Cornelia and Dave,
> > do you also agree on:
> > 1. "not to define the specific errno returned for a specific situation,
> > let the vendor driver decide, userspace simply needs to know that an errno on
> > read indicates the device does not support migration version comparison and
> > that an errno on write indicates the devices are incompatible or the target
> > doesn't support migration versions. "
> > 2. vendor driver should log detailed error reasons in kernel log.
>
> Two questions:
> - How reasonable is it to refer to the system log in order to find out
> what exactly went wrong?
> - If detailed error reporting is basically done to the syslog, do
> different error codes still provide useful information? Or should the
> vendor driver decide what it wants to do?

I'd leave anything beyond returning -1 on read/write from/to the sysfs to the
vendor driver, as user space has no control over it, even if there was a
facility to interpret different return codes for us, I'm not sure (in this
migration-related case) how much would userspace be able to recover or
fallback anyway, you either can or cannot migrate smoothely.

Regards,
Erik

2019-05-14 11:04:30

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

* Cornelia Huck ([email protected]) wrote:
> On Tue, 14 May 2019 03:47:36 -0400
> Yan Zhao <[email protected]> wrote:
>
> > On Tue, May 14, 2019 at 03:43:44PM +0800, Erik Skultety wrote:
> > > On Tue, May 14, 2019 at 03:32:19AM -0400, Yan Zhao wrote:
> > > > On Tue, May 14, 2019 at 03:20:40PM +0800, Erik Skultety wrote:
>
> > > > > That said, from libvirt POV as a consumer, I'd expect there to be truly only 2
> > > > > errors (I believe Alex has mentioned something similar in one of his responses
> > > > > in one of the threads):
> > > > > a) read error indicating that an mdev type doesn't support migration
> > > > > - I assume if one type doesn't support migration, none of the other
> > > > > types exposed on the parent device do, is that a fair assumption?
>
> Probably; but there might be cases where the migratability depends not
> on the device type, but how the partitioning has been done... or is
> that too contrived?
>
> > > > > b) write error indicating that the mdev types are incompatible for
> > > > > migration
> > > > >
> > > > > Regards,
> > > > > Erik
> > > > Thanks for this explanation.
> > > > so, can we arrive at below agreements?
> > > >
> > > > 1. "not to define the specific errno returned for a specific situation,
> > > > let the vendor driver decide, userspace simply needs to know that an errno on
> > > > read indicates the device does not support migration version comparison and
> > > > that an errno on write indicates the devices are incompatible or the target
> > > > doesn't support migration versions. "
> > > > 2. vendor driver should log detailed error reasons in kernel log.
> > >
> > > That would be my take on this, yes, but I open to hear any other suggestions and
> > > ideas I couldn't think of as well.
>
> So, read to find out whether migration is supported at all, write to
> find out whether it is supported for that concrete pairing is
> reasonable for libvirt?
>
> > >
> > > Erik
> > got it. thanks a lot!
> >
> > hi Cornelia and Dave,
> > do you also agree on:
> > 1. "not to define the specific errno returned for a specific situation,
> > let the vendor driver decide, userspace simply needs to know that an errno on
> > read indicates the device does not support migration version comparison and
> > that an errno on write indicates the devices are incompatible or the target
> > doesn't support migration versions. "
> > 2. vendor driver should log detailed error reasons in kernel log.
>
> Two questions:
> - How reasonable is it to refer to the system log in order to find out
> what exactly went wrong?
> - If detailed error reporting is basically done to the syslog, do
> different error codes still provide useful information? Or should the
> vendor driver decide what it wants to do?

I don't see error codes as being that helpful; if we can't actually get
an error message back up the stack (which was my preference), then I guess
syslog is as good as it will get.

Dave

--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2019-05-14 11:32:03

by Cornelia Huck

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

On Tue, 14 May 2019 12:01:45 +0100
"Dr. David Alan Gilbert" <[email protected]> wrote:

> * Cornelia Huck ([email protected]) wrote:
> > On Tue, 14 May 2019 03:47:36 -0400
> > Yan Zhao <[email protected]> wrote:

> > > hi Cornelia and Dave,
> > > do you also agree on:
> > > 1. "not to define the specific errno returned for a specific situation,
> > > let the vendor driver decide, userspace simply needs to know that an errno on
> > > read indicates the device does not support migration version comparison and
> > > that an errno on write indicates the devices are incompatible or the target
> > > doesn't support migration versions. "
> > > 2. vendor driver should log detailed error reasons in kernel log.
> >
> > Two questions:
> > - How reasonable is it to refer to the system log in order to find out
> > what exactly went wrong?
> > - If detailed error reporting is basically done to the syslog, do
> > different error codes still provide useful information? Or should the
> > vendor driver decide what it wants to do?
>
> I don't see error codes as being that helpful; if we can't actually get
> an error message back up the stack (which was my preference), then I guess
> syslog is as good as it will get.

Ok, so letting the vendor driver simply return an(y) error and possibly
dumping an error message into the syslog seems to be the most
reasonable approach.

2019-05-14 15:04:01

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

On Tue, 14 May 2019 09:43:44 +0200
Erik Skultety <[email protected]> wrote:

> On Tue, May 14, 2019 at 03:32:19AM -0400, Yan Zhao wrote:
> > On Tue, May 14, 2019 at 03:20:40PM +0800, Erik Skultety wrote:
> > > On Tue, May 14, 2019 at 02:12:35AM -0400, Yan Zhao wrote:
> > > > On Mon, May 13, 2019 at 09:28:04PM +0800, Erik Skultety wrote:
> > > > > On Fri, May 10, 2019 at 11:48:38AM +0200, Cornelia Huck wrote:
> > > > > > On Fri, 10 May 2019 10:36:09 +0100
> > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > >
> > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > On Thu, 9 May 2019 17:48:26 +0100
> > > > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > > > On Thu, 9 May 2019 16:48:57 +0100
> > > > > > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > > > > > On Tue, 7 May 2019 15:18:26 -0600
> > > > > > > > > > > > Alex Williamson <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > On Sun, 5 May 2019 21:49:04 -0400
> > > > > > > > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > > + Errno:
> > > > > > > > > > > > > > + If vendor driver wants to claim a mdev device incompatible to all other mdev
> > > > > > > > > > > > > > + devices, it should not register version attribute for this mdev device. But if
> > > > > > > > > > > > > > + a vendor driver has already registered version attribute and it wants to claim
> > > > > > > > > > > > > > + a mdev device incompatible to all other mdev devices, it needs to return
> > > > > > > > > > > > > > + -ENODEV on access to this mdev device's version attribute.
> > > > > > > > > > > > > > + If a mdev device is only incompatible to certain mdev devices, write of
> > > > > > > > > > > > > > + incompatible mdev devices's version strings to its version attribute should
> > > > > > > > > > > > > > + return -EINVAL;
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think it's best not to define the specific errno returned for a
> > > > > > > > > > > > > specific situation, let the vendor driver decide, userspace simply
> > > > > > > > > > > > > needs to know that an errno on read indicates the device does not
> > > > > > > > > > > > > support migration version comparison and that an errno on write
> > > > > > > > > > > > > indicates the devices are incompatible or the target doesn't support
> > > > > > > > > > > > > migration versions.
> > > > > > > > > > > >
> > > > > > > > > > > > I think I have to disagree here: It's probably valuable to have an
> > > > > > > > > > > > agreed error for 'cannot migrate at all' vs 'cannot migrate between
> > > > > > > > > > > > those two particular devices'. Userspace might want to do different
> > > > > > > > > > > > things (e.g. trying with different device pairs).
> > > > > > > > > > >
> > > > > > > > > > > Trying to stuff these things down an errno seems a bad idea; we can't
> > > > > > > > > > > get much information that way.
> > > > > > > > > >
> > > > > > > > > > So, what would be a reasonable approach? Userspace should first read
> > > > > > > > > > the version attributes on both devices (to find out whether migration
> > > > > > > > > > is supported at all), and only then figure out via writing whether they
> > > > > > > > > > are compatible?
> > > > > > > > > >
> > > > > > > > > > (Or just go ahead and try, if it does not care about the reason.)
> > > > > > > > >
> > > > > > > > > Well, I'm OK with something like writing to test whether it's
> > > > > > > > > compatible, it's just we need a better way of saying 'no'.
> > > > > > > > > I'm not sure if that involves reading back from somewhere after
> > > > > > > > > the write or what.
> > > > > > > >
> > > > > > > > Hm, so I basically see two ways of doing that:
> > > > > > > > - standardize on some error codes... problem: error codes can be hard
> > > > > > > > to fit to reasons
> > > > > > > > - make the error available in some attribute that can be read
> > > > > > > >
> > > > > > > > I'm not sure how we can serialize the readback with the last write,
> > > > > > > > though (this looks inherently racy).
> > > > > > > >
> > > > > > > > How important is detailed error reporting here?
> > > > > > >
> > > > > > > I think we need something, otherwise we're just going to get vague
> > > > > > > user reports of 'but my VM doesn't migrate'; I'd like the error to be
> > > > > > > good enough to point most users to something they can understand
> > > > > > > (e.g. wrong card family/too old a driver etc).
> > > > > >
> > > > > > Ok, that sounds like a reasonable point. Not that I have a better idea
> > > > > > how to achieve that, though... we could also log a more verbose error
> > > > > > message to the kernel log, but that's not necessarily where a user will
> > > > > > look first.
> > > > >
> > > > > In case of libvirt checking the compatibility, it won't matter how good the
> > > > > error message in the kernel log is and regardless of how many error states you
> > > > > want to handle, libvirt's only limited to errno here, since we're going to do
> > > > > plain read/write, so our internal error message returned to the user is only
> > > > > going to contain what the errno says - okay, of course we can (and we DO)
> > > > > provide libvirt specific string, further specifying the error but like I
> > > > > mentioned, depending on how many error cases we want to distinguish this may be
> > > > > hard for anyone to figure out solely on the error code, as apps will most
> > > > > probably not parse the
> > > > > logs.
> > > > >
> > > > > Regards,
> > > > > Erik
> > > > hi Erik
> > > > do you mean you are agreeing on defining common errors and only returning errno?
> > >
> > > In a sense, yes. While it is highly desirable to have logs with descriptive
> > > messages which will help in troubleshooting tremendously, I wanted to point out
> > > that spending time with error logs may not be that worthwhile especially since
> > > most apps (like libvirt) will solely rely on using read(3)/write(3) to sysfs.
> > > That means that we're limited by the errnos available, so apart from
> > > reporting the generic system message we can't any more magic in terms of the
> > > error messages, so the driver needs to assure that a proper message is
> > > propagated to the journal and at best libvirt can direct the user (consumer) to
> > > look through the system logs for more info. I also agree with the point
> > > mentioned above that defining a specific errno is IMO not the way to go, as
> > > these would be just too specific for the read(3)/write(3) use case.
> > >
> > > That said, from libvirt POV as a consumer, I'd expect there to be truly only 2
> > > errors (I believe Alex has mentioned something similar in one of his responses
> > > in one of the threads):
> > > a) read error indicating that an mdev type doesn't support migration
> > > - I assume if one type doesn't support migration, none of the other
> > > types exposed on the parent device do, is that a fair assumption?

I'd prefer not to make this assumption. Let's leave open the
possibility that (for whatever reason) a vendor may choose to support
migration on some types, but not others.

> > > b) write error indicating that the mdev types are incompatible for
> > > migration
> > >
> > > Regards,
> > > Erik
> > Thanks for this explanation.
> > so, can we arrive at below agreements?
> >
> > 1. "not to define the specific errno returned for a specific situation,
> > let the vendor driver decide, userspace simply needs to know that an errno on
> > read indicates the device does not support migration version comparison and
> > that an errno on write indicates the devices are incompatible or the target
> > doesn't support migration versions. "
> > 2. vendor driver should log detailed error reasons in kernel log.
>
> That would be my take on this, yes, but I open to hear any other suggestions and
> ideas I couldn't think of as well.

Kernel logging tends to be rather ineffective, it's surprisingly
difficult to get users to look in dmesg and it's not really a good
choice for scraping diagnostic information either. I'd probably leave
this to vendor driver's discretion at this point. Thanks,

Alex

2019-05-16 01:50:16

by Yan Zhao

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

On Tue, May 14, 2019 at 11:01:42PM +0800, Alex Williamson wrote:
> On Tue, 14 May 2019 09:43:44 +0200
> Erik Skultety <[email protected]> wrote:
>
> > On Tue, May 14, 2019 at 03:32:19AM -0400, Yan Zhao wrote:
> > > On Tue, May 14, 2019 at 03:20:40PM +0800, Erik Skultety wrote:
> > > > On Tue, May 14, 2019 at 02:12:35AM -0400, Yan Zhao wrote:
> > > > > On Mon, May 13, 2019 at 09:28:04PM +0800, Erik Skultety wrote:
> > > > > > On Fri, May 10, 2019 at 11:48:38AM +0200, Cornelia Huck wrote:
> > > > > > > On Fri, 10 May 2019 10:36:09 +0100
> > > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > > >
> > > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > > On Thu, 9 May 2019 17:48:26 +0100
> > > > > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > > > > On Thu, 9 May 2019 16:48:57 +0100
> > > > > > > > > > > "Dr. David Alan Gilbert" <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > * Cornelia Huck ([email protected]) wrote:
> > > > > > > > > > > > > On Tue, 7 May 2019 15:18:26 -0600
> > > > > > > > > > > > > Alex Williamson <[email protected]> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On Sun, 5 May 2019 21:49:04 -0400
> > > > > > > > > > > > > > Yan Zhao <[email protected]> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > + Errno:
> > > > > > > > > > > > > > > + If vendor driver wants to claim a mdev device incompatible to all other mdev
> > > > > > > > > > > > > > > + devices, it should not register version attribute for this mdev device. But if
> > > > > > > > > > > > > > > + a vendor driver has already registered version attribute and it wants to claim
> > > > > > > > > > > > > > > + a mdev device incompatible to all other mdev devices, it needs to return
> > > > > > > > > > > > > > > + -ENODEV on access to this mdev device's version attribute.
> > > > > > > > > > > > > > > + If a mdev device is only incompatible to certain mdev devices, write of
> > > > > > > > > > > > > > > + incompatible mdev devices's version strings to its version attribute should
> > > > > > > > > > > > > > > + return -EINVAL;
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think it's best not to define the specific errno returned for a
> > > > > > > > > > > > > > specific situation, let the vendor driver decide, userspace simply
> > > > > > > > > > > > > > needs to know that an errno on read indicates the device does not
> > > > > > > > > > > > > > support migration version comparison and that an errno on write
> > > > > > > > > > > > > > indicates the devices are incompatible or the target doesn't support
> > > > > > > > > > > > > > migration versions.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think I have to disagree here: It's probably valuable to have an
> > > > > > > > > > > > > agreed error for 'cannot migrate at all' vs 'cannot migrate between
> > > > > > > > > > > > > those two particular devices'. Userspace might want to do different
> > > > > > > > > > > > > things (e.g. trying with different device pairs).
> > > > > > > > > > > >
> > > > > > > > > > > > Trying to stuff these things down an errno seems a bad idea; we can't
> > > > > > > > > > > > get much information that way.
> > > > > > > > > > >
> > > > > > > > > > > So, what would be a reasonable approach? Userspace should first read
> > > > > > > > > > > the version attributes on both devices (to find out whether migration
> > > > > > > > > > > is supported at all), and only then figure out via writing whether they
> > > > > > > > > > > are compatible?
> > > > > > > > > > >
> > > > > > > > > > > (Or just go ahead and try, if it does not care about the reason.)
> > > > > > > > > >
> > > > > > > > > > Well, I'm OK with something like writing to test whether it's
> > > > > > > > > > compatible, it's just we need a better way of saying 'no'.
> > > > > > > > > > I'm not sure if that involves reading back from somewhere after
> > > > > > > > > > the write or what.
> > > > > > > > >
> > > > > > > > > Hm, so I basically see two ways of doing that:
> > > > > > > > > - standardize on some error codes... problem: error codes can be hard
> > > > > > > > > to fit to reasons
> > > > > > > > > - make the error available in some attribute that can be read
> > > > > > > > >
> > > > > > > > > I'm not sure how we can serialize the readback with the last write,
> > > > > > > > > though (this looks inherently racy).
> > > > > > > > >
> > > > > > > > > How important is detailed error reporting here?
> > > > > > > >
> > > > > > > > I think we need something, otherwise we're just going to get vague
> > > > > > > > user reports of 'but my VM doesn't migrate'; I'd like the error to be
> > > > > > > > good enough to point most users to something they can understand
> > > > > > > > (e.g. wrong card family/too old a driver etc).
> > > > > > >
> > > > > > > Ok, that sounds like a reasonable point. Not that I have a better idea
> > > > > > > how to achieve that, though... we could also log a more verbose error
> > > > > > > message to the kernel log, but that's not necessarily where a user will
> > > > > > > look first.
> > > > > >
> > > > > > In case of libvirt checking the compatibility, it won't matter how good the
> > > > > > error message in the kernel log is and regardless of how many error states you
> > > > > > want to handle, libvirt's only limited to errno here, since we're going to do
> > > > > > plain read/write, so our internal error message returned to the user is only
> > > > > > going to contain what the errno says - okay, of course we can (and we DO)
> > > > > > provide libvirt specific string, further specifying the error but like I
> > > > > > mentioned, depending on how many error cases we want to distinguish this may be
> > > > > > hard for anyone to figure out solely on the error code, as apps will most
> > > > > > probably not parse the
> > > > > > logs.
> > > > > >
> > > > > > Regards,
> > > > > > Erik
> > > > > hi Erik
> > > > > do you mean you are agreeing on defining common errors and only returning errno?
> > > >
> > > > In a sense, yes. While it is highly desirable to have logs with descriptive
> > > > messages which will help in troubleshooting tremendously, I wanted to point out
> > > > that spending time with error logs may not be that worthwhile especially since
> > > > most apps (like libvirt) will solely rely on using read(3)/write(3) to sysfs.
> > > > That means that we're limited by the errnos available, so apart from
> > > > reporting the generic system message we can't any more magic in terms of the
> > > > error messages, so the driver needs to assure that a proper message is
> > > > propagated to the journal and at best libvirt can direct the user (consumer) to
> > > > look through the system logs for more info. I also agree with the point
> > > > mentioned above that defining a specific errno is IMO not the way to go, as
> > > > these would be just too specific for the read(3)/write(3) use case.
> > > >
> > > > That said, from libvirt POV as a consumer, I'd expect there to be truly only 2
> > > > errors (I believe Alex has mentioned something similar in one of his responses
> > > > in one of the threads):
> > > > a) read error indicating that an mdev type doesn't support migration
> > > > - I assume if one type doesn't support migration, none of the other
> > > > types exposed on the parent device do, is that a fair assumption?
>
> I'd prefer not to make this assumption. Let's leave open the
> possibility that (for whatever reason) a vendor may choose to support
> migration on some types, but not others.
>
> > > > b) write error indicating that the mdev types are incompatible for
> > > > migration
> > > >
> > > > Regards,
> > > > Erik
> > > Thanks for this explanation.
> > > so, can we arrive at below agreements?
> > >
> > > 1. "not to define the specific errno returned for a specific situation,
> > > let the vendor driver decide, userspace simply needs to know that an errno on
> > > read indicates the device does not support migration version comparison and
> > > that an errno on write indicates the devices are incompatible or the target
> > > doesn't support migration versions. "
> > > 2. vendor driver should log detailed error reasons in kernel log.
> >
> > That would be my take on this, yes, but I open to hear any other suggestions and
> > ideas I couldn't think of as well.
>
> Kernel logging tends to be rather ineffective, it's surprisingly
> difficult to get users to look in dmesg and it's not really a good
> choice for scraping diagnostic information either. I'd probably leave
> this to vendor driver's discretion at this point. Thanks,
>
> Alex

got it.
Thank you all!
I'll follow it to prepare the next revision.

Thanks
Yan

> _______________________________________________
> intel-gvt-dev mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev