Xiangliang reported that a platform hangs after "udevadm info
--attribute-walk --path=/sys/devices/pci0000:00/<...>/block/sda" is ran.
He was then able to isolate the failure further to accesses involving a
Marvell 9125 device's I/O BARs, or more specifically, accesses to I/O Port
space backing the device's I/O Port BARs. Putting these two pieces of
information together he was able to reproduce the hang by targeting the
device's pci-sysfs 'resource<N>' entries, where N represents an I/O Port
Bar, via "cat /sys/devices/<...>/resource<N>" [1].
There are basically two issues at play here: pci-sysfs entries backing I/O
Port BARs, which directly map to the device's control and status
registers, being able to be accessed by userspace - which commit 8633328
introduced as part of adding virtualization based device assignment
functionality - and, abnormally strict access restrictions with respect to
the Marvell 9125 device's I/O Port region(s).
I proposed a patch to circumvent "udevadm info ..." from accessing these
specific pci-sysfs entries which started an interesting conversation [2].
While there were numerous issues brought up they basically filtered down to
whether or not it is possible to safely have userspace access device
registers and, if it is, how do we ensure that two or more entities can't
access a device simultaneously (see Don Dutile's quick summary as it most
succulently expressed the complexity of the situation we currently find
ourselves in).
There are plans for VFIO to replace the sysfs based accessing method,
moving device register accesses behind ioctls, but for now legacy KVM
device assignment relies on these files.
This patch series is a possible compromise to the immediate issue; a
quirking mechanism that can be used to detect accesses that do no meet the
device's restrictions, letting a device specific method intervene and
decide how to progress.
If consensus is reached that it is just too unsafe to allow userspace
access to device registers, then the series may serve as a stop gap
solution, addressing the symptom and not the underlying issue, until VFIO
materializes at which time commit 8633328, along with the second patch in
*this* series, would need to be reverted.
If on the other hand, consensus is that we need userspace device register
access capabilities - say for UIO drivers or such - then, depending on the
tact taken, we'll need this solution, or something like it, as part of
that overall strategy.
Reference(s):
[1] https://lkml.org/lkml/2013/3/7/242
[2] https://lkml.org/lkml/2013/3/16/168
---
Myron Stowe (2):
PCI, scsi, ahci: Unify usages of 0x1b4b vendor ID to use PCI_VENDOR_ID_MARVELL_EXT
PCI: Handle device quirks when accessing sysfs resource<N> entries
Xiangliang Yu (1):
PCI: Define macro for Marvell vendor ID
drivers/ata/ahci.c | 10 +++---
drivers/pci/pci-sysfs.c | 11 +++----
drivers/pci/pci.h | 13 ++++++++
drivers/pci/quirks.c | 70 ++++++++++++++++++++++++++++++++++++++++++
drivers/scsi/mvsas/mv_init.c | 6 ++--
drivers/scsi/mvumi.c | 4 +-
drivers/scsi/mvumi.h | 1 -
include/linux/pci_ids.h | 1 +
8 files changed, 99 insertions(+), 17 deletions(-)
--
Sysfs includes entries to memory regions that back a PCI device's BARs.
The pci-sysfs entries backing I/O Port BARs can be accessed by userspace,
providing direct access to the device's registers. File permissions
prevent random users from accessing the device's registers through these
files, but don't stop a privileged app that chooses to ignore the purpose
of these files from doing so.
There are devices with abnormally strict restrictions with respect to
accessing their registers; aspects that are typically handled by the
device's driver. When these access restrictions are not followed - as
when a userspace app such as "udevadm info --attribute-walk
--path=/sys/..." parses though reading all the device's sysfs entries - it
can cause such devices to fail.
This patch introduces a quirking mechanism that can be used to detect
accesses that do no meet the device's restrictions, letting a device
specific method intervene and decide how to progress.
Reported-by: Xiangliang Yu <[email protected]>
Signed-off-by: Myron Stowe <[email protected]>
---
drivers/pci/pci-sysfs.c | 11 +++----
drivers/pci/pci.h | 13 +++++++++
drivers/pci/quirks.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 88 insertions(+), 6 deletions(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 9c6e9bb..8e80c33 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -974,10 +974,9 @@ pci_mmap_resource_wc(struct file *filp, struct kobject *kobj,
return pci_mmap_resource(kobj, attr, vma, 1);
}
-static ssize_t
-pci_resource_io(struct file *filp, struct kobject *kobj,
- struct bin_attribute *attr, char *buf,
- loff_t off, size_t count, bool write)
+ssize_t pci_resource_io(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t off, size_t count, bool write)
{
struct pci_dev *pdev = to_pci_dev(container_of(kobj,
struct device, kobj));
@@ -1027,7 +1026,7 @@ pci_read_resource_io(struct file *filp, struct kobject *kobj,
struct bin_attribute *attr, char *buf,
loff_t off, size_t count)
{
- return pci_resource_io(filp, kobj, attr, buf, off, count, false);
+ return pci_resource_io_filter(filp, kobj, attr, buf, off, count, false);
}
static ssize_t
@@ -1035,7 +1034,7 @@ pci_write_resource_io(struct file *filp, struct kobject *kobj,
struct bin_attribute *attr, char *buf,
loff_t off, size_t count)
{
- return pci_resource_io(filp, kobj, attr, buf, off, count, true);
+ return pci_resource_io_filter(filp, kobj, attr, buf, off, count, true);
}
/**
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 7346ee6..a47611d 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -29,6 +29,9 @@ extern int pci_mmap_fits(struct pci_dev *pdev, int resno,
struct vm_area_struct *vmai,
enum pci_mmap_api mmap_api);
#endif
+extern ssize_t pci_resource_io(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t off, size_t count, bool write);
int pci_probe_reset_function(struct pci_dev *dev);
/**
@@ -308,11 +311,21 @@ struct pci_dev_reset_methods {
#ifdef CONFIG_PCI_QUIRKS
extern int pci_dev_specific_reset(struct pci_dev *dev, int probe);
+ssize_t pci_resource_io_filter(struct file *fp, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t offset, size_t size, bool write);
#else
static inline int pci_dev_specific_reset(struct pci_dev *dev, int probe)
{
return -ENOTTY;
}
+static inline ssize_t
+pci_resource_io_filter(struct file *fp, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t offset, size_t size, bool write)
+{
+ return pci_resource_io(fp, kobj, attr, buf, offset, size, write);
+}
#endif
#endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 0369fb6..8b68cd2 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3324,3 +3324,73 @@ int pci_dev_specific_acs_enabled(struct pci_dev *dev, u16 acs_flags)
return -ENOTTY;
}
+
+static ssize_t marvell9125_quirk(struct file *fp, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t offset, size_t size, bool write)
+{
+ /*
+ * Short circuit out of trying to access this device's I/O port
+ * region.
+ */
+ return -EINVAL;
+}
+
+static const struct pci_resource_io_quirk {
+ u16 vendor;
+ u16 device;
+ int bar;
+ size_t size;
+ loff_t offset;
+ ssize_t (*iores_access)(struct file *fp, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t offset, size_t size, bool write);
+} pci_resource_io_quirks[] = {
+ /*
+ * Some devices have abnormally strict restrictions when accessing
+ * their I/O port regions.
+ * https://lkml.org/lkml/2013/3/16/168
+ */
+ { PCI_VENDOR_ID_MARVELL_EXT, 0x9125, 1, 4, 0, marvell9125_quirk },
+ { PCI_VENDOR_ID_MARVELL_EXT, 0x9125, 1, 2, 2, marvell9125_quirk },
+ { PCI_VENDOR_ID_MARVELL_EXT, 0x9125, 1, 1, 0, marvell9125_quirk },
+ { PCI_VENDOR_ID_MARVELL_EXT, 0x9125, 1, 1, 1, marvell9125_quirk },
+ { PCI_VENDOR_ID_MARVELL_EXT, 0x9125, 1, 1, 3, marvell9125_quirk },
+ { PCI_VENDOR_ID_MARVELL_EXT, 0x9125, 3, 4, 0, marvell9125_quirk },
+ { PCI_VENDOR_ID_MARVELL_EXT, 0x9125, 3, 2, 2, marvell9125_quirk },
+ { PCI_VENDOR_ID_MARVELL_EXT, 0x9125, 3, 1, 0, marvell9125_quirk },
+ { PCI_VENDOR_ID_MARVELL_EXT, 0x9125, 3, 1, 1, marvell9125_quirk },
+ { PCI_VENDOR_ID_MARVELL_EXT, 0x9125, 3, 1, 3, marvell9125_quirk },
+ { 0 }
+};
+
+ssize_t pci_resource_io_filter(struct file *fp, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t offset, size_t size, bool write)
+{
+ int bar;
+ struct resource *res = attr->private;
+ const struct pci_resource_io_quirk *i;
+ struct pci_dev *dev = to_pci_dev(container_of(kobj,
+ struct device, kobj));
+
+ for (bar = 0; bar < PCI_ROM_RESOURCE; bar++)
+ if (res == &dev->resource[bar])
+ break;
+ if (bar >= PCI_ROM_RESOURCE)
+ return -ENODEV;
+
+ for (i = pci_resource_io_quirks; i->iores_access; i++) {
+ if ((i->vendor == dev->vendor ||
+ i->vendor == (u16)PCI_ANY_ID) &&
+ (i->device == dev->device ||
+ i->device == (u16)PCI_ANY_ID) &&
+ i->bar == bar &&
+ i->size == size &&
+ i->offset == offset)
+ return i->iores_access(fp, kobj, attr, buf, offset,
+ size, write);
+ }
+
+ return pci_resource_io(fp, kobj, attr, buf, offset, size, write);
+}
With the 0x1b4b vendor ID #define in place, convert hard-coded ID values.
Also, unify the previous SCSI specific use of this vendor ID to the new
macro.
Signed-off-by: Myron Stowe <[email protected]>
---
drivers/ata/ahci.c | 10 +++++-----
drivers/scsi/mvsas/mv_init.c | 6 +++---
drivers/scsi/mvumi.c | 4 ++--
drivers/scsi/mvumi.h | 1 -
4 files changed, 10 insertions(+), 11 deletions(-)
diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index a99112c..616952f 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -413,17 +413,17 @@ static const struct pci_device_id ahci_pci_tbl[] = {
/* Marvell */
{ PCI_VDEVICE(MARVELL, 0x6145), board_ahci_mv }, /* 6145 */
{ PCI_VDEVICE(MARVELL, 0x6121), board_ahci_mv }, /* 6121 */
- { PCI_DEVICE(0x1b4b, 0x9123),
+ { PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, 0x9123),
.class = PCI_CLASS_STORAGE_SATA_AHCI,
.class_mask = 0xffffff,
.driver_data = board_ahci_yes_fbs }, /* 88se9128 */
- { PCI_DEVICE(0x1b4b, 0x9125),
+ { PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, 0x9125),
.driver_data = board_ahci_yes_fbs }, /* 88se9125 */
- { PCI_DEVICE(0x1b4b, 0x917a),
+ { PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, 0x917a),
.driver_data = board_ahci_yes_fbs }, /* 88se9172 */
- { PCI_DEVICE(0x1b4b, 0x9192),
+ { PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, 0x9192),
.driver_data = board_ahci_yes_fbs }, /* 88se9172 on some Gigabyte */
- { PCI_DEVICE(0x1b4b, 0x91a3),
+ { PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, 0x91a3),
.driver_data = board_ahci_yes_fbs },
/* Promise */
diff --git a/drivers/scsi/mvsas/mv_init.c b/drivers/scsi/mvsas/mv_init.c
index ce90d05..7455092 100644
--- a/drivers/scsi/mvsas/mv_init.c
+++ b/drivers/scsi/mvsas/mv_init.c
@@ -703,7 +703,7 @@ static struct pci_device_id mvs_pci_table[] = {
{ PCI_VDEVICE(TTI, 0x2744), chip_9480 },
{ PCI_VDEVICE(TTI, 0x2760), chip_9480 },
{
- .vendor = 0x1b4b,
+ .vendor = PCI_VENDOR_ID_MARVELL_EXT,
.device = 0x9480,
.subvendor = PCI_ANY_ID,
.subdevice = 0x9480,
@@ -712,7 +712,7 @@ static struct pci_device_id mvs_pci_table[] = {
.driver_data = chip_9480,
},
{
- .vendor = 0x1b4b,
+ .vendor = PCI_VENDOR_ID_MARVELL_EXT,
.device = 0x9445,
.subvendor = PCI_ANY_ID,
.subdevice = 0x9480,
@@ -721,7 +721,7 @@ static struct pci_device_id mvs_pci_table[] = {
.driver_data = chip_9445,
},
{
- .vendor = 0x1b4b,
+ .vendor = PCI_VENDOR_ID_MARVELL_EXT,
.device = 0x9485,
.subvendor = PCI_ANY_ID,
.subdevice = 0x9480,
diff --git a/drivers/scsi/mvumi.c b/drivers/scsi/mvumi.c
index 4594cca..c3601b5 100644
--- a/drivers/scsi/mvumi.c
+++ b/drivers/scsi/mvumi.c
@@ -49,8 +49,8 @@ MODULE_AUTHOR("[email protected]");
MODULE_DESCRIPTION("Marvell UMI Driver");
static DEFINE_PCI_DEVICE_TABLE(mvumi_pci_table) = {
- { PCI_DEVICE(PCI_VENDOR_ID_MARVELL_2, PCI_DEVICE_ID_MARVELL_MV9143) },
- { PCI_DEVICE(PCI_VENDOR_ID_MARVELL_2, PCI_DEVICE_ID_MARVELL_MV9580) },
+ { PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, PCI_DEVICE_ID_MARVELL_MV9143) },
+ { PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, PCI_DEVICE_ID_MARVELL_MV9580) },
{ 0 }
};
diff --git a/drivers/scsi/mvumi.h b/drivers/scsi/mvumi.h
index e360135..41f1687 100644
--- a/drivers/scsi/mvumi.h
+++ b/drivers/scsi/mvumi.h
@@ -32,7 +32,6 @@
#define VER_BUILD 1500
#define MV_DRIVER_NAME "mvumi"
-#define PCI_VENDOR_ID_MARVELL_2 0x1b4b
#define PCI_DEVICE_ID_MARVELL_MV9143 0x9143
#define PCI_DEVICE_ID_MARVELL_MV9580 0x9580
From: Xiangliang Yu <[email protected]>
Define PCI_VENDOR_ID_MARVELL_EXT macro for 0x1b4b vendor ID
Signed-off-by: Xiangliang Yu <[email protected]>
Signed-off-by: Myron Stowe <[email protected]>
---
include/linux/pci_ids.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index f11c1c2..a80f9e6 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -1604,6 +1604,7 @@
#define PCI_SUBDEVICE_ID_KEYSPAN_SX2 0x5334
#define PCI_VENDOR_ID_MARVELL 0x11ab
+#define PCI_VENDOR_ID_MARVELL_EXT 0x1b4b
#define PCI_DEVICE_ID_MARVELL_GT64111 0x4146
#define PCI_DEVICE_ID_MARVELL_GT64260 0x6430
#define PCI_DEVICE_ID_MARVELL_MV64360 0x6460
On 03/20/2013 10:35 PM, Myron Stowe wrote:
> Sysfs includes entries to memory regions that back a PCI device's BARs.
> The pci-sysfs entries backing I/O Port BARs can be accessed by userspace,
> providing direct access to the device's registers. File permissions
> prevent random users from accessing the device's registers through these
> files, but don't stop a privileged app that chooses to ignore the purpose
> of these files from doing so.
>
> There are devices with abnormally strict restrictions with respect to
> accessing their registers; aspects that are typically handled by the
> device's driver. When these access restrictions are not followed - as
> when a userspace app such as "udevadm info --attribute-walk
> --path=/sys/..." parses though reading all the device's sysfs entries - it
> can cause such devices to fail.
>
> This patch introduces a quirking mechanism that can be used to detect
> accesses that do no meet the device's restrictions, letting a device
> specific method intervene and decide how to progress.
>
> Reported-by: Xiangliang Yu <[email protected]>
> Signed-off-by: Myron Stowe <[email protected]>
I honestly don't think there's much point in even attempting this
strategy. This list of devices in the quirk can't possibly be complete.
It would likely be easier to enumerate a white-list of devices that can
deal with their IO ports being read willy-nilly than a blacklist of
those that don't, as there's likely countless devices that fall into
this category. Even if they don't choke as badly as these ones do, it's
quite likely that bad behavior will result.
I think there's a few things that need to be done:
-Fix the bug in udevadm that caused it to trawl through these files
willy-nilly,
-Fix the kernel so that access through these files complies with the
kernel's mechanisms for claiming IO/memory regions to prevent access
conflicts (i.e. opening these files should claim the resource region
they refer to, and should fail with EBUSY or something if another
process or a kernel driver is using it).
-Reconsider whether supporting read/write on the resource files for IO
port regions like these makes any sense. Obviously mmap isn't very
practical for IO port access on x86 but you could even do something like
an ioctl for this purpose. Not very many pieces of software would need
to access these files so it's likely OK if the API is a bit ugly. That
would prevent something like grepping through sysfs from generating port
accesses to random devices.
On Thu, Mar 21, 2013 at 06:51:31PM -0600, Robert Hancock wrote:
> On 03/20/2013 10:35 PM, Myron Stowe wrote:
> >Sysfs includes entries to memory regions that back a PCI device's BARs.
> >The pci-sysfs entries backing I/O Port BARs can be accessed by userspace,
> >providing direct access to the device's registers. File permissions
> >prevent random users from accessing the device's registers through these
> >files, but don't stop a privileged app that chooses to ignore the purpose
> >of these files from doing so.
> >
> >There are devices with abnormally strict restrictions with respect to
> >accessing their registers; aspects that are typically handled by the
> >device's driver. When these access restrictions are not followed - as
> >when a userspace app such as "udevadm info --attribute-walk
> >--path=/sys/..." parses though reading all the device's sysfs entries - it
> >can cause such devices to fail.
> >
> >This patch introduces a quirking mechanism that can be used to detect
> >accesses that do no meet the device's restrictions, letting a device
> >specific method intervene and decide how to progress.
> >
> >Reported-by: Xiangliang Yu <[email protected]>
> >Signed-off-by: Myron Stowe <[email protected]>
>
> I honestly don't think there's much point in even attempting this
> strategy. This list of devices in the quirk can't possibly be
> complete. It would likely be easier to enumerate a white-list of
> devices that can deal with their IO ports being read willy-nilly
> than a blacklist of those that don't, as there's likely countless
> devices that fall into this category. Even if they don't choke as
> badly as these ones do, it's quite likely that bad behavior will
> result.
>
> I think there's a few things that need to be done:
>
> -Fix the bug in udevadm that caused it to trawl through these files
> willy-nilly,
There's no "bug" in udevadm, the user explicitly asked for it to read
all of those files. Just like grep or bash could be used to ask to read
those files.
If the kernel is going to provide files to userspace, the kernel can't
suddenly get upset if userspace actually reads those files.
Fix the kernel here please.
> -Fix the kernel so that access through these files complies with the
> kernel's mechanisms for claiming IO/memory regions to prevent access
> conflicts (i.e. opening these files should claim the resource region
> they refer to, and should fail with EBUSY or something if another
> process or a kernel driver is using it).
Yes, this is a good solution.
> -Reconsider whether supporting read/write on the resource files for
> IO port regions like these makes any sense. Obviously mmap isn't
> very practical for IO port access on x86 but you could even do
> something like an ioctl for this purpose. Not very many pieces of
> software would need to access these files so it's likely OK if the
> API is a bit ugly. That would prevent something like grepping
> through sysfs from generating port accesses to random devices.
Also a good solution.
greg k-h
On Thu, Mar 21, 2013 at 6:51 PM, Robert Hancock <[email protected]> wrote:
> On 03/20/2013 10:35 PM, Myron Stowe wrote:
>>
>> Sysfs includes entries to memory regions that back a PCI device's BARs.
>> The pci-sysfs entries backing I/O Port BARs can be accessed by userspace,
>> providing direct access to the device's registers. File permissions
>> prevent random users from accessing the device's registers through these
>> files, but don't stop a privileged app that chooses to ignore the purpose
>> of these files from doing so.
>>
>> There are devices with abnormally strict restrictions with respect to
>> accessing their registers; aspects that are typically handled by the
>> device's driver. When these access restrictions are not followed - as
>> when a userspace app such as "udevadm info --attribute-walk
>> --path=/sys/..." parses though reading all the device's sysfs entries - it
>> can cause such devices to fail.
>>
>> This patch introduces a quirking mechanism that can be used to detect
>> accesses that do no meet the device's restrictions, letting a device
>> specific method intervene and decide how to progress.
>>
>> Reported-by: Xiangliang Yu <[email protected]>
>> Signed-off-by: Myron Stowe <[email protected]>
>
>
> I honestly don't think there's much point in even attempting this strategy.
> This list of devices in the quirk can't possibly be complete. It would
> likely be easier to enumerate a white-list of devices that can deal with
> their IO ports being read willy-nilly than a blacklist of those that don't,
> as there's likely countless devices that fall into this category. Even if
> they don't choke as badly as these ones do, it's quite likely that bad
> behavior will result.
For the device in question, it seems abnormally restrictive in its
access limitations to BAR1 and BAR3. The device reserves 4 Bytes of
I/O Port space for these BARs, which is likely based on PCI's DWord
based protocol, but "chokes (we still have not received any specifics
on this yet)" on any access other than a single Byte access at offset
0x2 (x86 supports 1, 2, and 4 Byte I/O Port accesses). This seems to
imply that the device did not back the other three reserved Bytes it
claims in any way, which again, seems peculiar to this particular
device as other similar devices tend to back the reserved bytes they
claim and return 0's when accessed.
So in the case where two entities such as the devices driver and an
app like 'udevadm' are *not* simultaneously accessing it, so in effect
the device is idle with no device driver attached and a user app like
'udevadm' accesses it: do you still contend that there are countless
devices that will not deal with their IO ports being read willy-nilly?
The reason I ask is related to what I stated in the cover [PATCH 0/3]
- "If on the other hand, consensus is that we need userspace device
register access capabilities - say for UIO drivers or such - then,
depending on the tact taken, we'll need this solution, or something
like it, as part of that overall strategy." So, if it isn't too
uncommon for devices to choke with normal I/O Port accesses then I
believe that this solution would be part of a larger strategy to
support userspace direct access to devices and also works as a
compromise for the immediate issue today.
You make some good suggestions below on possible tactics related to
allowing userspace access to device registers but I question whether
or not we really want, or need, to. Are we, by de-facto, just
assuming we need to? Is there consensus that Linux needs to support
such functionality? Yes, we are already there (to some extent) with
commit 8633328 but I think we are seeing the implications of such a
direction and we also know that the tact that this commit took should
be going away when VFIO matures.
Thanks for participating in this conversation - you helped in
understanding the real root cause in the originating thread and have
brought up some good points to consider,
Myron
>
> I think there's a few things that need to be done:
>
> -Fix the bug in udevadm that caused it to trawl through these files
> willy-nilly,
>
> -Fix the kernel so that access through these files complies with the
> kernel's mechanisms for claiming IO/memory regions to prevent access
> conflicts (i.e. opening these files should claim the resource region they
> refer to, and should fail with EBUSY or something if another process or a
> kernel driver is using it).
>
> -Reconsider whether supporting read/write on the resource files for IO port
> regions like these makes any sense. Obviously mmap isn't very practical for
> IO port access on x86 but you could even do something like an ioctl for this
> purpose. Not very many pieces of software would need to access these files
> so it's likely OK if the API is a bit ugly. That would prevent something
> like grepping through sysfs from generating port accesses to random devices.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 22, 2013 at 9:39 AM, Myron Stowe <[email protected]> wrote:
> On Thu, Mar 21, 2013 at 6:51 PM, Robert Hancock <[email protected]> wrote:
>> On 03/20/2013 10:35 PM, Myron Stowe wrote:
>>>
>>> Sysfs includes entries to memory regions that back a PCI device's BARs.
>>> The pci-sysfs entries backing I/O Port BARs can be accessed by userspace,
>>> providing direct access to the device's registers. File permissions
>>> prevent random users from accessing the device's registers through these
>>> files, but don't stop a privileged app that chooses to ignore the purpose
>>> of these files from doing so.
>>>
>>> There are devices with abnormally strict restrictions with respect to
>>> accessing their registers; aspects that are typically handled by the
>>> device's driver. When these access restrictions are not followed - as
>>> when a userspace app such as "udevadm info --attribute-walk
>>> --path=/sys/..." parses though reading all the device's sysfs entries - it
>>> can cause such devices to fail.
>>>
>>> This patch introduces a quirking mechanism that can be used to detect
>>> accesses that do no meet the device's restrictions, letting a device
>>> specific method intervene and decide how to progress.
>>>
>>> Reported-by: Xiangliang Yu <[email protected]>
>>> Signed-off-by: Myron Stowe <[email protected]>
>>
>>
>> I honestly don't think there's much point in even attempting this strategy.
>> This list of devices in the quirk can't possibly be complete. It would
>> likely be easier to enumerate a white-list of devices that can deal with
>> their IO ports being read willy-nilly than a blacklist of those that don't,
>> as there's likely countless devices that fall into this category. Even if
>> they don't choke as badly as these ones do, it's quite likely that bad
>> behavior will result.
>
> For the device in question, it seems abnormally restrictive in its
> access limitations to BAR1 and BAR3. The device reserves 4 Bytes of
> I/O Port space for these BARs, which is likely based on PCI's DWord
> based protocol, but "chokes (we still have not received any specifics
> on this yet)" on any access other than a single Byte access at offset
> 0x2 (x86 supports 1, 2, and 4 Byte I/O Port accesses). This seems to
> imply that the device did not back the other three reserved Bytes it
> claims in any way, which again, seems peculiar to this particular
> device as other similar devices tend to back the reserved bytes they
> claim and return 0's when accessed.
I don't think it's really that unusual. IO ports are weird, they're
not necessarily emulating any kind of normal memory - doing a 32-bit
access on IO port 0 isn't necessarily equivalent at all to doing
separate 8-bit accesses on IO ports 0, 1, 2 and 3 for example. They
can do completely different things. For legacy SFF ATA controllers,
IIRC most of the IO ports only expect 8-bit accesses other than the
PIO data register which can do 16 or 32-bit accesses (which just gives
you a different amount of data when doing different size reads from
the same location - it doesn't give you the data for the IO ports 2 or
3 bytes ahead of that when doing a 32-bit read, like you might expect
if you were thinking of it like memory). So it's not too shocking that
the designer of these Marvell devices didn't pay too much attention to
what happens if you access the ports in an unexpected way.
Most devices tend not to use IO ports any more in favor of MMIO,
except for legacy devices or devices implementing a legacy interface
(like the SFF ATA portion of these Marvell controllers). In those old
setups, doing funny things with IO port space is more common. Also,
the fact that IO port space tends to be somewhat precious leads to
this sort of thing too.
>
> So in the case where two entities such as the devices driver and an
> app like 'udevadm' are *not* simultaneously accessing it, so in effect
> the device is idle with no device driver attached and a user app like
> 'udevadm' accesses it: do you still contend that there are countless
> devices that will not deal with their IO ports being read willy-nilly?
Perhaps not "countless" but more "uncountable" in that there's no real
way to tell which devices may have issues, only to know that it's
quite possible that many devices may.
>
> The reason I ask is related to what I stated in the cover [PATCH 0/3]
> - "If on the other hand, consensus is that we need userspace device
> register access capabilities - say for UIO drivers or such - then,
> depending on the tact taken, we'll need this solution, or something
> like it, as part of that overall strategy." So, if it isn't too
> uncommon for devices to choke with normal I/O Port accesses then I
> believe that this solution would be part of a larger strategy to
> support userspace direct access to devices and also works as a
> compromise for the immediate issue today.
>
>
> You make some good suggestions below on possible tactics related to
> allowing userspace access to device registers but I question whether
> or not we really want, or need, to. Are we, by de-facto, just
> assuming we need to? Is there consensus that Linux needs to support
> such functionality? Yes, we are already there (to some extent) with
> commit 8633328 but I think we are seeing the implications of such a
> direction and we also know that the tact that this commit took should
> be going away when VFIO matures.
>
> Thanks for participating in this conversation - you helped in
> understanding the real root cause in the originating thread and have
> brought up some good points to consider,
> Myron
>
>>
>> I think there's a few things that need to be done:
>>
>> -Fix the bug in udevadm that caused it to trawl through these files
>> willy-nilly,
>>
>> -Fix the kernel so that access through these files complies with the
>> kernel's mechanisms for claiming IO/memory regions to prevent access
>> conflicts (i.e. opening these files should claim the resource region they
>> refer to, and should fail with EBUSY or something if another process or a
>> kernel driver is using it).
>>
>> -Reconsider whether supporting read/write on the resource files for IO port
>> regions like these makes any sense. Obviously mmap isn't very practical for
>> IO port access on x86 but you could even do something like an ioctl for this
>> purpose. Not very many pieces of software would need to access these files
>> so it's likely OK if the API is a bit ugly. That would prevent something
>> like grepping through sysfs from generating port accesses to random devices.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>>
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 22, 2013 at 9:55 AM, Robert Hancock <[email protected]> wrote:
> On Fri, Mar 22, 2013 at 9:39 AM, Myron Stowe <[email protected]> wrote:
>> On Thu, Mar 21, 2013 at 6:51 PM, Robert Hancock <[email protected]> wrote:
>>> On 03/20/2013 10:35 PM, Myron Stowe wrote:
>>>>
>>>> Sysfs includes entries to memory regions that back a PCI device's BARs.
>>>> The pci-sysfs entries backing I/O Port BARs can be accessed by userspace,
>>>> providing direct access to the device's registers. File permissions
>>>> prevent random users from accessing the device's registers through these
>>>> files, but don't stop a privileged app that chooses to ignore the purpose
>>>> of these files from doing so.
>>>>
>>>> There are devices with abnormally strict restrictions with respect to
>>>> accessing their registers; aspects that are typically handled by the
>>>> device's driver. When these access restrictions are not followed - as
>>>> when a userspace app such as "udevadm info --attribute-walk
>>>> --path=/sys/..." parses though reading all the device's sysfs entries - it
>>>> can cause such devices to fail.
>>>>
>>>> This patch introduces a quirking mechanism that can be used to detect
>>>> accesses that do no meet the device's restrictions, letting a device
>>>> specific method intervene and decide how to progress.
>>>>
>>>> Reported-by: Xiangliang Yu <[email protected]>
>>>> Signed-off-by: Myron Stowe <[email protected]>
>>>
>>>
>>> I honestly don't think there's much point in even attempting this strategy.
>>> This list of devices in the quirk can't possibly be complete. It would
>>> likely be easier to enumerate a white-list of devices that can deal with
>>> their IO ports being read willy-nilly than a blacklist of those that don't,
>>> as there's likely countless devices that fall into this category. Even if
>>> they don't choke as badly as these ones do, it's quite likely that bad
>>> behavior will result.
>>
>> For the device in question, it seems abnormally restrictive in its
>> access limitations to BAR1 and BAR3. The device reserves 4 Bytes of
>> I/O Port space for these BARs, which is likely based on PCI's DWord
>> based protocol, but "chokes (we still have not received any specifics
>> on this yet)" on any access other than a single Byte access at offset
>> 0x2 (x86 supports 1, 2, and 4 Byte I/O Port accesses). This seems to
>> imply that the device did not back the other three reserved Bytes it
>> claims in any way, which again, seems peculiar to this particular
>> device as other similar devices tend to back the reserved bytes they
>> claim and return 0's when accessed.
>
> I don't think it's really that unusual. IO ports are weird, they're
> not necessarily emulating any kind of normal memory - doing a 32-bit
> access on IO port 0 isn't necessarily equivalent at all to doing
> separate 8-bit accesses on IO ports 0, 1, 2 and 3 for example. They
> can do completely different things. For legacy SFF ATA controllers,
> IIRC most of the IO ports only expect 8-bit accesses other than the
> PIO data register which can do 16 or 32-bit accesses (which just gives
> you a different amount of data when doing different size reads from
> the same location - it doesn't give you the data for the IO ports 2 or
> 3 bytes ahead of that when doing a 32-bit read, like you might expect
> if you were thinking of it like memory).
Very good point; I did have a more memory based mindset.
> So it's not too shocking that
> the designer of these Marvell devices didn't pay too much attention to
> what happens if you access the ports in an unexpected way.
>
> Most devices tend not to use IO ports any more in favor of MMIO,
> except for legacy devices or devices implementing a legacy interface
> (like the SFF ATA portion of these Marvell controllers). In those old
> setups, doing funny things with IO port space is more common. Also,
> the fact that IO port space tends to be somewhat precious leads to
> this sort of thing too.
>
>>
>> So in the case where two entities such as the devices driver and an
>> app like 'udevadm' are *not* simultaneously accessing it, so in effect
>> the device is idle with no device driver attached and a user app like
>> 'udevadm' accesses it: do you still contend that there are countless
>> devices that will not deal with their IO ports being read willy-nilly?
>
> Perhaps not "countless" but more "uncountable" in that there's no real
> way to tell which devices may have issues, only to know that it's
> quite possible that many devices may.
So this suggests that we will never be able to deal with a userspace
app accessing device registers in a unknowledgable, random, fashion -
the exact situation we are currently in today. We could get to the
point of supporting UIO drivers, which would be more like device
drivers and have intimate detailed knowledge of what they are driving
and which would also introduce some of the mutual-exclusive ownership
issues we have talked about, but not unknowing userspace apps.
As such, your first suggestion below to enforce the mutual-exclusion
aspects that are necessary is still not enough to solve the current
issue being encountered. An idle device would allow 'udevadm' to
acquire ownership and the sysfs based access to BAR1 would end in
catastrophe due to the device's strict access restrictions to this
BARs I/O Port region. I also agree that the whitelist/blacklist
tactic seems like a complete slippery slope. So where does that leave
us in trying to solve this issue?
Xianglinag: could you elaborate on the exact way the device is failing
due to the DWord I/O Port read of BAR1 and how that leads to the
platform hanging as this is still not understood and a lot of people
are confused by this? We understand why this access is invalid for
this device (i.e. the BE [Byte Enable] discussion) but we don't
understand how such an access attempt ends up so catastrophic, causing
the platform to hang.
>
>>
>> The reason I ask is related to what I stated in the cover [PATCH 0/3]
>> - "If on the other hand, consensus is that we need userspace device
>> register access capabilities - say for UIO drivers or such - then,
>> depending on the tact taken, we'll need this solution, or something
>> like it, as part of that overall strategy." So, if it isn't too
>> uncommon for devices to choke with normal I/O Port accesses then I
>> believe that this solution would be part of a larger strategy to
>> support userspace direct access to devices and also works as a
>> compromise for the immediate issue today.
>>
>>
>> You make some good suggestions below on possible tactics related to
>> allowing userspace access to device registers but I question whether
>> or not we really want, or need, to. Are we, by de-facto, just
>> assuming we need to? Is there consensus that Linux needs to support
>> such functionality? Yes, we are already there (to some extent) with
>> commit 8633328 but I think we are seeing the implications of such a
>> direction and we also know that the tact that this commit took should
>> be going away when VFIO matures.
>>
>> Thanks for participating in this conversation - you helped in
>> understanding the real root cause in the originating thread and have
>> brought up some good points to consider,
>> Myron
>>
>>>
>>> I think there's a few things that need to be done:
>>>
>>> -Fix the bug in udevadm that caused it to trawl through these files
>>> willy-nilly,
>>>
>>> -Fix the kernel so that access through these files complies with the
>>> kernel's mechanisms for claiming IO/memory regions to prevent access
>>> conflicts (i.e. opening these files should claim the resource region they
>>> refer to, and should fail with EBUSY or something if another process or a
>>> kernel driver is using it).
>>>
>>> -Reconsider whether supporting read/write on the resource files for IO port
>>> regions like these makes any sense. Obviously mmap isn't very practical for
>>> IO port access on x86 but you could even do something like an ioctl for this
>>> purpose. Not very many pieces of software would need to access these files
>>> so it's likely OK if the API is a bit ugly. That would prevent something
>>> like grepping through sysfs from generating port accesses to random devices.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>>>
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Fri, Mar 22, 2013 at 9:55 AM, Robert Hancock <[email protected]>
> wrote:
> > On Fri, Mar 22, 2013 at 9:39 AM, Myron Stowe <[email protected]>
> wrote:
> >> On Thu, Mar 21, 2013 at 6:51 PM, Robert Hancock
> <[email protected]> wrote:
> >>> On 03/20/2013 10:35 PM, Myron Stowe wrote:
> >>>>
> >>>> Sysfs includes entries to memory regions that back a PCI device's BARs.
> >>>> The pci-sysfs entries backing I/O Port BARs can be accessed by userspace,
> >>>> providing direct access to the device's registers. File permissions
> >>>> prevent random users from accessing the device's registers through these
> >>>> files, but don't stop a privileged app that chooses to ignore the purpose
> >>>> of these files from doing so.
> >>>>
> >>>> There are devices with abnormally strict restrictions with respect to
> >>>> accessing their registers; aspects that are typically handled by the
> >>>> device's driver. When these access restrictions are not followed - as
> >>>> when a userspace app such as "udevadm info --attribute-walk
> >>>> --path=/sys/..." parses though reading all the device's sysfs entries - it
> >>>> can cause such devices to fail.
> >>>>
> >>>> This patch introduces a quirking mechanism that can be used to detect
> >>>> accesses that do no meet the device's restrictions, letting a device
> >>>> specific method intervene and decide how to progress.
> >>>>
> >>>> Reported-by: Xiangliang Yu <[email protected]>
> >>>> Signed-off-by: Myron Stowe <[email protected]>
> >>>
> >>>
> >>> I honestly don't think there's much point in even attempting this strategy.
> >>> This list of devices in the quirk can't possibly be complete. It would
> >>> likely be easier to enumerate a white-list of devices that can deal with
> >>> their IO ports being read willy-nilly than a blacklist of those that don't,
> >>> as there's likely countless devices that fall into this category. Even if
> >>> they don't choke as badly as these ones do, it's quite likely that bad
> >>> behavior will result.
> >>
> >> For the device in question, it seems abnormally restrictive in its
> >> access limitations to BAR1 and BAR3. The device reserves 4 Bytes of
> >> I/O Port space for these BARs, which is likely based on PCI's DWord
> >> based protocol, but "chokes (we still have not received any specifics
> >> on this yet)" on any access other than a single Byte access at offset
> >> 0x2 (x86 supports 1, 2, and 4 Byte I/O Port accesses). This seems to
> >> imply that the device did not back the other three reserved Bytes it
> >> claims in any way, which again, seems peculiar to this particular
> >> device as other similar devices tend to back the reserved bytes they
> >> claim and return 0's when accessed.
> >
> > I don't think it's really that unusual. IO ports are weird, they're
> > not necessarily emulating any kind of normal memory - doing a 32-bit
> > access on IO port 0 isn't necessarily equivalent at all to doing
> > separate 8-bit accesses on IO ports 0, 1, 2 and 3 for example. They
> > can do completely different things. For legacy SFF ATA controllers,
> > IIRC most of the IO ports only expect 8-bit accesses other than the
> > PIO data register which can do 16 or 32-bit accesses (which just gives
> > you a different amount of data when doing different size reads from
> > the same location - it doesn't give you the data for the IO ports 2 or
> > 3 bytes ahead of that when doing a 32-bit read, like you might expect
> > if you were thinking of it like memory).
>
> Very good point; I did have a more memory based mindset.
>
> > So it's not too shocking that
> > the designer of these Marvell devices didn't pay too much attention to
> > what happens if you access the ports in an unexpected way.
> >
> > ...
> >>
> >> ...
>
> So this suggests that we will never be able to deal with a userspace
> app accessing device registers in a unknowledgable, random, fashion -
> the exact situation we are currently in today. We could get to the
> point of supporting UIO drivers, which would be more like device
> drivers and have intimate detailed knowledge of what they are driving
> and which would also introduce some of the mutual-exclusive ownership
> issues we have talked about, but not unknowing userspace apps.
>
The PCI Express specification allows devices to choose what types of accesses they support to the
IO or memory address spaces provided by their BARs:
Excerpt from PCI Express Base Specification 3.0 page 109:
" When a device's programming model restricts (vs. what is otherwise permitted in PCI Express) the
characteristics of a Request, that device is permitted to "Completer Abort" any Requests which
violate the programming model. Examples include unaligned or wrong-size access to a register
block and unsupported size of request to a Memory Space.
Generally, devices are able to assume a restricted programming model when all communication will
be between the device's driver software and the device itself. Devices which may be accessed
directly by operating system software or by applications which may not comprehend the restricted
programming model of the device (typically devices which implement legacy capabilities) should be
designed to support all types of Requests which are possible in the existing usage model for the
device. If this is not done, the device may fail to operate with existing software."
How the system handles errors like Completer Aborts will vary - NMIs, hangs, or even processing
bogus data are all possible.
There are some baseline rules on configuration space accesses (so generic PCI topology software
can work with all PCI cards), but not IO and memory space. It's prudent to not generate IO
or memory accesses you don't know are supposed to work.
---
Rob Elliott HP Server Storage
On Thu, Mar 21, 2013 at 6:51 PM, Robert Hancock <[email protected]> wrote:
> On 03/20/2013 10:35 PM, Myron Stowe wrote:
>>
>> Sysfs includes entries to memory regions that back a PCI device's BARs.
>> The pci-sysfs entries backing I/O Port BARs can be accessed by userspace,
>> providing direct access to the device's registers. File permissions
>> prevent random users from accessing the device's registers through these
>> files, but don't stop a privileged app that chooses to ignore the purpose
>> of these files from doing so.
>>
>> There are devices with abnormally strict restrictions with respect to
>> accessing their registers; aspects that are typically handled by the
>> device's driver. When these access restrictions are not followed - as
>> when a userspace app such as "udevadm info --attribute-walk
>> --path=/sys/..." parses though reading all the device's sysfs entries - it
>> can cause such devices to fail.
>>
>> This patch introduces a quirking mechanism that can be used to detect
>> accesses that do no meet the device's restrictions, letting a device
>> specific method intervene and decide how to progress.
>>
>> Reported-by: Xiangliang Yu <[email protected]>
>> Signed-off-by: Myron Stowe <[email protected]>
>
>
> I honestly don't think there's much point in even attempting this strategy.
> This list of devices in the quirk can't possibly be complete. It would
> likely be easier to enumerate a white-list of devices that can deal with
> their IO ports being read willy-nilly than a blacklist of those that don't,
> as there's likely countless devices that fall into this category. Even if
> they don't choke as badly as these ones do, it's quite likely that bad
> behavior will result.
I agree, I'm not inclined to add a bunch of code that only fixes one
device when there are likely many others with similar problems.
> I think there's a few things that need to be done:
>
> -Fix the bug in udevadm that caused it to trawl through these files
> willy-nilly,
We can argue about whether this is actually a bug in udevadm, but
either way, my opinion is that the data that udevadm prints out from
these files is uninteresting in the same way the data from "uevent,"
"dev," "modalias," "resource," etc. is uninteresting. So to me, a
udevadm change seems justified for that reason.
> -Fix the kernel so that access through these files complies with the
> kernel's mechanisms for claiming IO/memory regions to prevent access
> conflicts (i.e. opening these files should claim the resource region they
> refer to, and should fail with EBUSY or something if another process or a
> kernel driver is using it).
>
> -Reconsider whether supporting read/write on the resource files for IO port
> regions like these makes any sense. Obviously mmap isn't very practical for
> IO port access on x86 but you could even do something like an ioctl for this
> purpose. Not very many pieces of software would need to access these files
> so it's likely OK if the API is a bit ugly. That would prevent something
> like grepping through sysfs from generating port accesses to random devices.
This is the approach I'd like to push on for a kernel fix. I'm not a
VM person, but if it were possible to support .mmap() in such a way
that a handler would be called for every access to the region, we
could easily support I/O port access that way.
Along that line, I'm concerned that we may have a hole in the MEM BAR
.mmap() path. What happens if we have BARs from two different devices
in the same page, and we mmap() one of them? Does the user also get
access to the second device's BAR on the same page? If so, that could
also be fixed with an .mmap() trap approach. Performance would suck
for these small BARs, of course.
Bjorn
On Thu, 2013-04-04 at 12:06 -0600, Bjorn Helgaas wrote:
> On Thu, Mar 21, 2013 at 6:51 PM, Robert Hancock <[email protected]> wrote:
> > On 03/20/2013 10:35 PM, Myron Stowe wrote:
> >>
> >> Sysfs includes entries to memory regions that back a PCI device's BARs.
> >> The pci-sysfs entries backing I/O Port BARs can be accessed by userspace,
> >> providing direct access to the device's registers. File permissions
> >> prevent random users from accessing the device's registers through these
> >> files, but don't stop a privileged app that chooses to ignore the purpose
> >> of these files from doing so.
> >>
> >> There are devices with abnormally strict restrictions with respect to
> >> accessing their registers; aspects that are typically handled by the
> >> device's driver. When these access restrictions are not followed - as
> >> when a userspace app such as "udevadm info --attribute-walk
> >> --path=/sys/..." parses though reading all the device's sysfs entries - it
> >> can cause such devices to fail.
> >>
> >> This patch introduces a quirking mechanism that can be used to detect
> >> accesses that do no meet the device's restrictions, letting a device
> >> specific method intervene and decide how to progress.
> >>
> >> Reported-by: Xiangliang Yu <[email protected]>
> >> Signed-off-by: Myron Stowe <[email protected]>
> >
> >
> > I honestly don't think there's much point in even attempting this strategy.
> > This list of devices in the quirk can't possibly be complete. It would
> > likely be easier to enumerate a white-list of devices that can deal with
> > their IO ports being read willy-nilly than a blacklist of those that don't,
> > as there's likely countless devices that fall into this category. Even if
> > they don't choke as badly as these ones do, it's quite likely that bad
> > behavior will result.
>
> I agree, I'm not inclined to add a bunch of code that only fixes one
> device when there are likely many others with similar problems.
>
> > I think there's a few things that need to be done:
> >
> > -Fix the bug in udevadm that caused it to trawl through these files
> > willy-nilly,
>
> We can argue about whether this is actually a bug in udevadm, but
> either way, my opinion is that the data that udevadm prints out from
> these files is uninteresting in the same way the data from "uevent,"
> "dev," "modalias," "resource," etc. is uninteresting. So to me, a
> udevadm change seems justified for that reason.
>
> > -Fix the kernel so that access through these files complies with the
> > kernel's mechanisms for claiming IO/memory regions to prevent access
> > conflicts (i.e. opening these files should claim the resource region they
> > refer to, and should fail with EBUSY or something if another process or a
> > kernel driver is using it).
> >
> > -Reconsider whether supporting read/write on the resource files for IO port
> > regions like these makes any sense. Obviously mmap isn't very practical for
> > IO port access on x86 but you could even do something like an ioctl for this
> > purpose. Not very many pieces of software would need to access these files
> > so it's likely OK if the API is a bit ugly. That would prevent something
> > like grepping through sysfs from generating port accesses to random devices.
>
> This is the approach I'd like to push on for a kernel fix.
Me too. I think the quirks approach is unsupportable. Worst case I
think we should have an ability for the *driver* to mark the region as
having strange access rules.
> I'm not a
> VM person, but if it were possible to support .mmap() in such a way
> that a handler would be called for every access to the region, we
> could easily support I/O port access that way.
We could ... the OS could trigger a page fault on every access using the
COW mechanism ... however, that mechanism wasn't intended for write
combining, so although it's theoretically possible to add it, I'd be a
bit wary.
> Along that line, I'm concerned that we may have a hole in the MEM BAR
> .mmap() path. What happens if we have BARs from two different devices
> in the same page, and we mmap() one of them? Does the user also get
> access to the second device's BAR on the same page? If so, that could
> also be fixed with an .mmap() trap approach. Performance would suck
> for these small BARs, of course.
Pure protections are done at the page level. That said, you can emulate
offset protections by marking the page inaccessible in the CPU, which
causes a trap on every access. Then you can mediate the write based on
its address. That said, this is a bit complex and we don't really have
the mechanism for it.
James
On Sat, Apr 6, 2013 at 2:49 AM, James Bottomley
<[email protected]> wrote:
> On Thu, 2013-04-04 at 12:06 -0600, Bjorn Helgaas wrote:
>> On Thu, Mar 21, 2013 at 6:51 PM, Robert Hancock <[email protected]> wrote:
>> > -Reconsider whether supporting read/write on the resource files for IO port
>> > regions like these makes any sense. Obviously mmap isn't very practical for
>> > IO port access on x86 but you could even do something like an ioctl for this
>> > purpose. Not very many pieces of software would need to access these files
>> > so it's likely OK if the API is a bit ugly. That would prevent something
>> > like grepping through sysfs from generating port accesses to random devices.
>>
>> This is the approach I'd like to push on for a kernel fix.
>
> Me too. I think the quirks approach is unsupportable. Worst case I
> think we should have an ability for the *driver* to mark the region as
> having strange access rules.
>
>> I'm not a
>> VM person, but if it were possible to support .mmap() in such a way
>> that a handler would be called for every access to the region, we
>> could easily support I/O port access that way.
>
> We could ... the OS could trigger a page fault on every access using the
> COW mechanism ... however, that mechanism wasn't intended for write
> combining, so although it's theoretically possible to add it, I'd be a
> bit wary.
The current problem case is only I/O port accesses, so we only have to
support 1/2/4-byte accesses, and I don't think write combining is an
issue. But it still sounds like COW might be trying to force a square
peg into a round hole. Maybe there are other approaches. I just
thought of .mmap() because that's already used for MEM space accesses.
Bjorn