Hi all,
Here's a series that contains two fixes to PCI bridge window sizing
algorithm. Together, they should enable remove & rescan cycle to work
for a PCI bus that has PCI devices with optional resources and/or
disparity in BAR sizes.
For the second fix, I chose to expose find_empty_resource_slot() from
kernel/resource.c because it should increase accuracy of the cannot-fit
decision (currently that function is called find_resource()). In order
to do that sensibly, a few improvements seemed in order to make its
interface and name of the function sane before exposing it. Thus, the
few extra patches on resource side.
Unfortunately I don't have a reason to suspect these would help with
the issues related to the currently ongoing resource regression
thread [1].
[1] https://lore.kernel.org/linux-pci/[email protected]/
v2:
- Add "typedef" to kerneldoc to get correct formatting
- Use RESOURCE_SIZE_MAX instead of literal
- Remove unnecessary checks for io{port/mem}_resource
- Apply a few style tweaks from Andy
Ilpo Järvinen (7):
PCI: Fix resource double counting on remove & rescan
resource: Rename find_resource() to find_empty_resource_slot()
resource: Document find_empty_resource_slot() and resource_constraint
resource: Use typedef for alignf callback
resource: Handle simple alignment inside __find_empty_resource_slot()
resource: Export find_empty_resource_slot()
PCI: Relax bridge window tail sizing rules
drivers/pci/bus.c | 10 ++----
drivers/pci/setup-bus.c | 80 +++++++++++++++++++++++++++++++++++++----
include/linux/ioport.h | 44 ++++++++++++++++++++---
include/linux/pci.h | 5 +--
kernel/resource.c | 68 ++++++++++++++++-------------------
5 files changed, 148 insertions(+), 59 deletions(-)
--
2.30.2
pbus_size_mem() keeps the size of the optional resources in
children_add_size. When calculating the PCI bridge window size,
calculate_memsize() lower bounds size by old_size before adding
children_add_size and performing the window size alignment. This
results in double counting for the resources in children_add_size
because old_size may be based on the previous size of the bridge
window after it has already included children_add_size (that is,
size1 in pbus_size_mem() from an earlier invocation of that
function).
As a result, on repeated remove of the bus & rescan cycles the resource
size keeps increasing when children_add_size is non-zero as can be seen
from this extract:
iomem0: 23fffd00000-23fffdfffff : PCI Bus 0000:03
iomem1: 20000000000-200001fffff : PCI Bus 0000:03
iomem2: 20000000000-200002fffff : PCI Bus 0000:03
iomem3: 20000000000-200003fffff : PCI Bus 0000:03
iomem4: 20000000000-200004fffff : PCI Bus 0000:03
Solve the double counting by moving old_size check later in
calculate_memsize() so that children_add_size is already accounted for.
After the patch, the bridge window retains its size as expected:
iomem0: 23fffd00000-23fffdfffff : PCI Bus 0000:03
iomem1: 20000000000-200000fffff : PCI Bus 0000:03
iomem2: 20000000000-200000fffff : PCI Bus 0000:03
Fixes: a4ac9fea016f ("PCI : Calculate right add_size")
Signed-off-by: Ilpo Järvinen <[email protected]>
---
drivers/pci/setup-bus.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index fd74f1c99dba..e3e6ff8854a7 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -820,11 +820,9 @@ static resource_size_t calculate_memsize(resource_size_t size,
size = min_size;
if (old_size == 1)
old_size = 0;
- if (size < old_size)
- size = old_size;
- size = ALIGN(max(size, add_size) + children_add_size, align);
- return size;
+ size = max(size, add_size) + children_add_size;
+ return ALIGN(max(size, old_size), align);
}
resource_size_t __weak pcibios_window_alignment(struct pci_bus *bus,
--
2.30.2
Rename find_resource() to find_empty_resource_slot() to better describe
what the functions does. This is a preparation for exposing it beyond
resource.c which is needed by PCI core. Also rename the __ variant to
match the names.
Signed-off-by: Ilpo Järvinen <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
---
kernel/resource.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/kernel/resource.c b/kernel/resource.c
index 866ef3663a0b..94f67005e1e2 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -574,10 +574,9 @@ static void resource_clip(struct resource *res, resource_size_t min,
* Find empty slot in the resource tree with the given range and
* alignment constraints
*/
-static int __find_resource(struct resource *root, struct resource *old,
- struct resource *new,
- resource_size_t size,
- struct resource_constraint *constraint)
+static int __find_empty_resource_slot(struct resource *root, struct resource *old,
+ struct resource *new, resource_size_t size,
+ struct resource_constraint *constraint)
{
struct resource *this = root->child;
struct resource tmp = *new, avail, alloc;
@@ -633,11 +632,11 @@ next: if (!this || this->end == root->end)
/*
* Find empty slot in the resource tree given range and alignment.
*/
-static int find_resource(struct resource *root, struct resource *new,
- resource_size_t size,
- struct resource_constraint *constraint)
+static int find_empty_resource_slot(struct resource *root, struct resource *new,
+ resource_size_t size,
+ struct resource_constraint *constraint)
{
- return __find_resource(root, NULL, new, size, constraint);
+ return __find_empty_resource_slot(root, NULL, new, size, constraint);
}
/**
@@ -660,7 +659,7 @@ static int reallocate_resource(struct resource *root, struct resource *old,
write_lock(&resource_lock);
- if ((err = __find_resource(root, old, &new, newsize, constraint)))
+ if ((err = __find_empty_resource_slot(root, old, &new, newsize, constraint)))
goto out;
if (resource_contains(&new, old)) {
@@ -729,7 +728,7 @@ int allocate_resource(struct resource *root, struct resource *new,
}
write_lock(&resource_lock);
- err = find_resource(root, new, size, &constraint);
+ err = find_empty_resource_slot(root, new, size, &constraint);
if (err >= 0 && __request_resource(root, new))
err = -EBUSY;
write_unlock(&resource_lock);
--
2.30.2
Document find_empty_resource_slot() and the struct resource_constraint
as they are going to be exposed outside of resource.c.
Signed-off-by: Ilpo Järvinen <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
---
kernel/resource.c | 28 +++++++++++++++++++++++++---
1 file changed, 25 insertions(+), 3 deletions(-)
diff --git a/kernel/resource.c b/kernel/resource.c
index 94f67005e1e2..ed4bb8ad701a 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -46,7 +46,18 @@ struct resource iomem_resource = {
};
EXPORT_SYMBOL(iomem_resource);
-/* constraints to be met while allocating resources */
+/**
+ * resource_constraint - constraints to be met while searching empty resource slots
+ * @min: The minimum address for the memory range
+ * @max: The maximum address for the memory range
+ * @align: Alignment for the start address of the empty slot
+ * @alignf: Additional alignment constraints callback
+ * @alignf_data: Data provided for @alignf callback
+ *
+ * Contains the range and alignment constraints that have to be met during
+ * find_empty_resource_slot(). @alignf can be NULL indicating no alignment
+ * beyond @align is necessary.
+ */
struct resource_constraint {
resource_size_t min, max, align;
resource_size_t (*alignf)(void *, const struct resource *,
@@ -629,8 +640,19 @@ next: if (!this || this->end == root->end)
return -EBUSY;
}
-/*
- * Find empty slot in the resource tree given range and alignment.
+/**
+ * find_empty_resource_slot - Find empty slot in the resource tree
+ * @root: Root resource descriptor
+ * @new: Resource descriptor awaiting an empty resource slot
+ * @size: The minimum size of the empty slot
+ * @constraint: The range and alignment constraints to be met
+ *
+ * Finds an empty slot under @root in the resource tree satisfying range and
+ * alignment @constraints.
+ *
+ * Return:
+ * * %0 - if successful, @new members start, end, and flags are altered.
+ * * %-EBUSY - if no empty slot was found.
*/
static int find_empty_resource_slot(struct resource *root, struct resource *new,
resource_size_t size,
--
2.30.2
To make it simpler to declare resource constraint alignf callbacks, add
typedef for it and document it.
Suggested-by: Andy Shevchenko <[email protected]>
Signed-off-by: Ilpo Järvinen <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
---
drivers/pci/bus.c | 10 ++--------
include/linux/ioport.h | 22 ++++++++++++++++++----
include/linux/pci.h | 5 +----
kernel/resource.c | 8 ++------
4 files changed, 23 insertions(+), 22 deletions(-)
diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 9c2137dae429..be5a4bb4d2fb 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -176,10 +176,7 @@ static void pci_clip_resource_to_region(struct pci_bus *bus,
static int pci_bus_alloc_from_region(struct pci_bus *bus, struct resource *res,
resource_size_t size, resource_size_t align,
resource_size_t min, unsigned long type_mask,
- resource_size_t (*alignf)(void *,
- const struct resource *,
- resource_size_t,
- resource_size_t),
+ resource_alignf alignf,
void *alignf_data,
struct pci_bus_region *region)
{
@@ -250,10 +247,7 @@ static int pci_bus_alloc_from_region(struct pci_bus *bus, struct resource *res,
int pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
resource_size_t size, resource_size_t align,
resource_size_t min, unsigned long type_mask,
- resource_size_t (*alignf)(void *,
- const struct resource *,
- resource_size_t,
- resource_size_t),
+ resource_alignf alignf,
void *alignf_data)
{
#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 14f5cfabbbc8..26bfd9515653 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -188,6 +188,23 @@ enum {
#define DEFINE_RES_DMA(_dma) \
DEFINE_RES_DMA_NAMED((_dma), NULL)
+/**
+ * typedef resource_alignf - Resource alignment callback
+ * @data: Private data used by the callback
+ * @res: Resource candidate range (an empty resource slot)
+ * @size: The minimum size of the empty slot
+ * @align: Alignment from the constraints
+ *
+ * Callback allows calculating resource placement and alignment beyond min,
+ * max, and align fields in the struct resource_constraint.
+ *
+ * Return: Start address for the resource.
+ */
+typedef resource_size_t (*resource_alignf)(void *data,
+ const struct resource *res,
+ resource_size_t size,
+ resource_size_t align);
+
/* PC/ISA/whatever - the normal PC address spaces: IO and memory */
extern struct resource ioport_resource;
extern struct resource iomem_resource;
@@ -207,10 +224,7 @@ extern void arch_remove_reservations(struct resource *avail);
extern int allocate_resource(struct resource *root, struct resource *new,
resource_size_t size, resource_size_t min,
resource_size_t max, resource_size_t align,
- resource_size_t (*alignf)(void *,
- const struct resource *,
- resource_size_t,
- resource_size_t),
+ resource_alignf alignf,
void *alignf_data);
struct resource *lookup_resource(struct resource *root, resource_size_t start);
int adjust_resource(struct resource *res, resource_size_t start,
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 60ca768bc867..d635e64debd9 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1546,10 +1546,7 @@ int __must_check pci_bus_alloc_resource(struct pci_bus *bus,
struct resource *res, resource_size_t size,
resource_size_t align, resource_size_t min,
unsigned long type_mask,
- resource_size_t (*alignf)(void *,
- const struct resource *,
- resource_size_t,
- resource_size_t),
+ resource_alignf alignf,
void *alignf_data);
diff --git a/kernel/resource.c b/kernel/resource.c
index ed4bb8ad701a..9d7920104120 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -60,8 +60,7 @@ EXPORT_SYMBOL(iomem_resource);
*/
struct resource_constraint {
resource_size_t min, max, align;
- resource_size_t (*alignf)(void *, const struct resource *,
- resource_size_t, resource_size_t);
+ resource_alignf alignf;
void *alignf_data;
};
@@ -725,10 +724,7 @@ static int reallocate_resource(struct resource *root, struct resource *old,
int allocate_resource(struct resource *root, struct resource *new,
resource_size_t size, resource_size_t min,
resource_size_t max, resource_size_t align,
- resource_size_t (*alignf)(void *,
- const struct resource *,
- resource_size_t,
- resource_size_t),
+ resource_alignf alignf,
void *alignf_data)
{
int err;
--
2.30.2
allocate_resource() accepts ->alignf() callback to perform custom
alignment beyond constraint->align. If alignf is NULL,
simple_align_resource() is used which only returns avail->start (no
change).
Using avail->start directly is natural and can be done with a
conditional in __find_empty_resource_slot() instead which avoids
unnecessarily using callback. It makes the code inside
__find_empty_resource_slot() more obvious and removes the need for the
caller to provide constraint->alignf unnecessarily.
This is preparation for exporting find_empty_resource_slot().
Signed-off-by: Ilpo Järvinen <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
---
kernel/resource.c | 20 +++++++-------------
1 file changed, 7 insertions(+), 13 deletions(-)
diff --git a/kernel/resource.c b/kernel/resource.c
index 9d7920104120..f0e4d57d0c35 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -563,14 +563,6 @@ void __weak arch_remove_reservations(struct resource *avail)
{
}
-static resource_size_t simple_align_resource(void *data,
- const struct resource *avail,
- resource_size_t size,
- resource_size_t align)
-{
- return avail->start;
-}
-
static void resource_clip(struct resource *res, resource_size_t min,
resource_size_t max)
{
@@ -590,6 +582,7 @@ static int __find_empty_resource_slot(struct resource *root, struct resource *ol
{
struct resource *this = root->child;
struct resource tmp = *new, avail, alloc;
+ resource_alignf alignf = constraint->alignf;
tmp.start = root->start;
/*
@@ -618,8 +611,12 @@ static int __find_empty_resource_slot(struct resource *root, struct resource *ol
avail.flags = new->flags & ~IORESOURCE_UNSET;
if (avail.start >= tmp.start) {
alloc.flags = avail.flags;
- alloc.start = constraint->alignf(constraint->alignf_data, &avail,
- size, constraint->align);
+ if (alignf) {
+ alloc.start = alignf(constraint->alignf_data,
+ &avail, size, constraint->align);
+ } else {
+ alloc.start = avail.start;
+ }
alloc.end = alloc.start + size - 1;
if (alloc.start <= alloc.end &&
resource_contains(&avail, &alloc)) {
@@ -730,9 +727,6 @@ int allocate_resource(struct resource *root, struct resource *new,
int err;
struct resource_constraint constraint;
- if (!alignf)
- alignf = simple_align_resource;
-
constraint.min = min;
constraint.max = max;
constraint.align = align;
--
2.30.2
PCI bridge window logic needs to find out in advance to the actual
allocation if there is an empty slot big enough to fit the window.
Export find_empty_resource_slot() for the purpose. Also move the struct
resource_constraint into generic header to be able to use the new
interface.
Signed-off-by: Ilpo Järvinen <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
---
include/linux/ioport.h | 22 ++++++++++++++++++++++
kernel/resource.c | 25 ++++---------------------
2 files changed, 26 insertions(+), 21 deletions(-)
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 26bfd9515653..b6ed244bc4b9 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -205,6 +205,24 @@ typedef resource_size_t (*resource_alignf)(void *data,
resource_size_t size,
resource_size_t align);
+/**
+ * resource_constraint - constraints to be met while searching empty resource slots
+ * @min: The minimum address for the memory range
+ * @max: The maximum address for the memory range
+ * @align: Alignment for the start address of the empty slot
+ * @alignf: Additional alignment constraints callback
+ * @alignf_data: Data provided for @alignf callback
+ *
+ * Contains the range and alignment constraints that have to be met during
+ * find_empty_resource_slot(). @alignf can be NULL indicating no alignment
+ * beyond @align is necessary.
+ */
+struct resource_constraint {
+ resource_size_t min, max, align;
+ resource_alignf alignf;
+ void *alignf_data;
+};
+
/* PC/ISA/whatever - the normal PC address spaces: IO and memory */
extern struct resource ioport_resource;
extern struct resource iomem_resource;
@@ -278,6 +296,10 @@ static inline bool resource_union(const struct resource *r1, const struct resour
return true;
}
+int find_empty_resource_slot(struct resource *root, struct resource *new,
+ resource_size_t size,
+ struct resource_constraint *constraint);
+
/* Convenience shorthand with allocation */
#define request_region(start,n,name) __request_region(&ioport_resource, (start), (n), (name), 0)
#define request_muxed_region(start,n,name) __request_region(&ioport_resource, (start), (n), (name), IORESOURCE_MUXED)
diff --git a/kernel/resource.c b/kernel/resource.c
index f0e4d57d0c35..c1bbb0943d9c 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -46,24 +46,6 @@ struct resource iomem_resource = {
};
EXPORT_SYMBOL(iomem_resource);
-/**
- * resource_constraint - constraints to be met while searching empty resource slots
- * @min: The minimum address for the memory range
- * @max: The maximum address for the memory range
- * @align: Alignment for the start address of the empty slot
- * @alignf: Additional alignment constraints callback
- * @alignf_data: Data provided for @alignf callback
- *
- * Contains the range and alignment constraints that have to be met during
- * find_empty_resource_slot(). @alignf can be NULL indicating no alignment
- * beyond @align is necessary.
- */
-struct resource_constraint {
- resource_size_t min, max, align;
- resource_alignf alignf;
- void *alignf_data;
-};
-
static DEFINE_RWLOCK(resource_lock);
static struct resource *next_resource(struct resource *p, bool skip_children)
@@ -650,12 +632,13 @@ next: if (!this || this->end == root->end)
* * %0 - if successful, @new members start, end, and flags are altered.
* * %-EBUSY - if no empty slot was found.
*/
-static int find_empty_resource_slot(struct resource *root, struct resource *new,
- resource_size_t size,
- struct resource_constraint *constraint)
+int find_empty_resource_slot(struct resource *root, struct resource *new,
+ resource_size_t size,
+ struct resource_constraint *constraint)
{
return __find_empty_resource_slot(root, NULL, new, size, constraint);
}
+EXPORT_SYMBOL_GPL(find_empty_resource_slot);
/**
* reallocate_resource - allocate a slot in the resource tree given range & alignment.
--
2.30.2
During remove & rescan cycle, PCI subsystem will recalculate and adjust
the bridge window sizing that was initially done by "BIOS". The size
calculation is based on the required alignment of the largest resource
among the downstream resources as per pbus_size_mem() (unimportant or
zero parameters marked with "..."):
min_align = calculate_mem_align(aligns, max_order);
size0 = calculate_memsize(size, ..., min_align);
and then in calculate_memsize():
size = ALIGN(max(size, ...) + ..., align);
If the original bridge window sizing tried to conserve space, this will
lead to massive increase of the required bridge window size when the
downstream has a large disparity in BAR sizes. E.g., with 16MiB and
16GiB BARs this results in 32GiB bridge window size even if 16MiB BAR
does not require gigabytes of space to fit.
When doing remove & rescan for a bus that contains such a PCI device, a
larger bridge window is suddenly required on rescan but when there is a
bridge window upstream that is already assigned based on the original
size, it cannot be enlarged to the new requirement. This causes the
allocation of the bridge window to fail (0x600000000 > 0x400ffffff):
pci 0000:02:01.0: PCI bridge to [bus 03]
pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
pci 0000:01:00.0: PCI bridge to [bus 02-04]
pci 0000:01:00.0: bridge window [mem 0x40400000-0x406fffff]
pci 0000:01:00.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
...
pci_bus 0000:03: busn_res: [bus 03] is released
pci 0000:03:00.0: reg 0x10: [mem 0x6400000000-0x6400ffffff 64bit pref]
pci 0000:03:00.0: reg 0x18: [mem 0x6000000000-0x63ffffffff 64bit pref]
pci 0000:03:00.0: reg 0x30: [mem 0x40400000-0x405fffff pref]
pci 0000:02:01.0: PCI bridge to [bus 03]
pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
pci 0000:02:01.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
pci 0000:02:01.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
pci 0000:02:01.0: BAR 8: assigned [mem 0x40400000-0x405fffff]
pci 0000:03:00.0: BAR 2: no space for [mem size 0x400000000 64bit pref]
pci 0000:03:00.0: BAR 2: failed to assign [mem size 0x400000000 64bit pref]
pci 0000:03:00.0: BAR 0: no space for [mem size 0x01000000 64bit pref]
pci 0000:03:00.0: BAR 0: failed to assign [mem size 0x01000000 64bit pref]
pci 0000:03:00.0: BAR 6: assigned [mem 0x40400000-0x405fffff pref]
pci 0000:02:01.0: PCI bridge to [bus 03]
pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
This is a major surprise for users who are suddenly left with a PCIe
device that was working fine with the original bridge window sizing.
Even if the already assigned bridge window could be enlarged by
reallocation in some cases (something the current code does not attempt
to do), it is not possible in general case and the large amount of
wasted space at the tail of the bridge window may lead to other
resource exhaustion problems on Root Complex level (think of multiple
PCIe cards with VFs and BAR size disparity in a single system).
PCI specifications only expect natural alignment for BARs (PCI Express
Base Specification, rev. 6.1 sect. 7.5.1.2.1) and minimum of 1MiB
alignment for the bridge window (PCI Express Base Specification,
rev 6.1 sect. 7.5.1.3). The current bridge window tail alignment rule
was introduced in the commit 5d0a8965aea9 ("[PATCH] 2.5.14: New PCI
allocation code (alpha, arm, parisc) [2/2]") that only states:
"pbus_size_mem: core stuff; tested with randomly generated sets of
resources". It does not explain the motivation for the extra tail space
allocated that is not truly needed by the downstream resources. As
such, it is far from clear if it ever has been required by any HW.
To prevent PCIe cards with BAR size disparity from becoming unusable
after remove & rescan cycle, attempt to do a truly minimal allocation
for memory resources if needed. First check if the normally calculated
bridge window will not fit into an already assigned upstream resource.
In such case, try with relaxed bridge window tail sizing rules instead
where no extra tail space is requested beyond what the downstream
resources require. Only enforce the alignment requirement of the bridge
window itself (normally 1MiB).
With this patch, the resources are successfully allocated:
pci 0000:02:01.0: PCI bridge to [bus 03]
pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref] to [bus 03] requires relaxed alignment rules
pci 0000:02:01.0: BAR 9: assigned [mem 0x6000000000-0x6400ffffff 64bit pref]
pci 0000:02:01.0: BAR 8: assigned [mem 0x40400000-0x405fffff]
pci 0000:03:00.0: BAR 2: assigned [mem 0x6000000000-0x63ffffffff 64bit pref]
pci 0000:03:00.0: BAR 0: assigned [mem 0x6400000000-0x6400ffffff 64bit pref]
pci 0000:03:00.0: BAR 6: assigned [mem 0x40400000-0x405fffff pref]
pci 0000:02:01.0: PCI bridge to [bus 03]
pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
This patch draws inspiration from the initial investigations and work
by Mika Westerberg.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216795
Link: https://lore.kernel.org/linux-pci/[email protected]/
Fixes: 5d0a8965aea9 ("[PATCH] 2.5.14: New PCI allocation code (alpha, arm, parisc) [2/2]")
Signed-off-by: Ilpo Järvinen <[email protected]>
Cc: Mika Westerberg <[email protected]>
---
drivers/pci/setup-bus.c | 74 +++++++++++++++++++++++++++++++++++++++--
1 file changed, 72 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index e3e6ff8854a7..7a32283262c2 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -21,6 +21,7 @@
#include <linux/errno.h>
#include <linux/ioport.h>
#include <linux/cache.h>
+#include <linux/limits.h>
#include <linux/slab.h>
#include <linux/acpi.h>
#include "pci.h"
@@ -960,6 +961,62 @@ static inline resource_size_t calculate_mem_align(resource_size_t *aligns,
return min_align;
}
+/**
+ * pbus_upstream_assigned_limit - Check no upstream resource limits allocation
+ * @bus: The bus
+ * @mask: Mask the resource flag, then compare it with type
+ * @type: The type of resource from bridge
+ * @size: The size required from the bridge window
+ * @align: Required alignment for the resource
+ *
+ * Checks that @size can fit inside the upstream bridge resources that are
+ * already assigned.
+ *
+ * Return: -ENOSPC if @size cannot fit into an already assigned resource
+ * upstream resource.
+ */
+static int pbus_upstream_assigned_limit(struct pci_bus *bus, unsigned long mask,
+ unsigned long type, resource_size_t size,
+ resource_size_t align)
+{
+ struct resource_constraint constraint = {
+ .max = RESOURCE_SIZE_MAX,
+ .align = align,
+ };
+ struct pci_bus *downstream = bus;
+ struct resource *r;
+
+ while ((bus = bus->parent)) {
+ if (pci_is_root_bus(bus))
+ break;
+
+ pci_bus_for_each_resource(bus, r) {
+ if (!r || !r->parent || (r->flags & mask) != type)
+ continue;
+
+ if (resource_size(r) >= size) {
+ struct resource gap = {};
+
+ if (!find_empty_resource_slot(r, &gap, size, &constraint))
+ return 0;
+ }
+
+ if (bus->self) {
+ pci_dbg(bus->self,
+ "Assigned bridge window %pR to %pR cannot fit 0x%llx required for %s bridging to %pR\n",
+ r, &bus->busn_res,
+ (unsigned long long)size,
+ pci_name(downstream->self),
+ &downstream->busn_res);
+ }
+
+ return -ENOSPC;
+ }
+ }
+
+ return 0;
+}
+
/**
* pbus_size_mem() - Size the memory window of a given bus
*
@@ -986,7 +1043,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
struct list_head *realloc_head)
{
struct pci_dev *dev;
- resource_size_t min_align, align, size, size0, size1;
+ resource_size_t min_align, win_align, align, size, size0, size1;
resource_size_t aligns[24]; /* Alignments from 1MB to 8TB */
int order, max_order;
struct resource *b_res = find_bus_resource_of_type(bus,
@@ -1064,10 +1121,23 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
}
}
+ win_align = window_alignment(bus, b_res->flags);
min_align = calculate_mem_align(aligns, max_order);
- min_align = max(min_align, window_alignment(bus, b_res->flags));
+ min_align = max(min_align, win_align);
size0 = calculate_memsize(size, min_size, 0, 0, resource_size(b_res), min_align);
add_align = max(min_align, add_align);
+
+ if (bus->self && size0 &&
+ pbus_upstream_assigned_limit(bus, mask | IORESOURCE_PREFETCH, type,
+ size0, add_align)) {
+ min_align = 1ULL << (max_order + 20);
+ min_align = max(min_align, win_align);
+ size0 = calculate_memsize(size, min_size, 0, 0, resource_size(b_res), win_align);
+ add_align = win_align;
+ pci_info(bus->self, "bridge window %pR to %pR requires relaxed alignment rules\n",
+ b_res, &bus->busn_res);
+ }
+
size1 = (!realloc_head || (realloc_head && !add_size && !children_add_size)) ? size0 :
calculate_memsize(size, min_size, add_size, children_add_size,
resource_size(b_res), add_align);
--
2.30.2
Hi Ilpo,
On Thu, Dec 28, 2023 at 06:57:00PM +0200, Ilpo Järvinen wrote:
> Hi all,
>
> Here's a series that contains two fixes to PCI bridge window sizing
> algorithm. Together, they should enable remove & rescan cycle to work
> for a PCI bus that has PCI devices with optional resources and/or
> disparity in BAR sizes.
>
> For the second fix, I chose to expose find_empty_resource_slot() from
> kernel/resource.c because it should increase accuracy of the cannot-fit
> decision (currently that function is called find_resource()). In order
> to do that sensibly, a few improvements seemed in order to make its
> interface and name of the function sane before exposing it. Thus, the
> few extra patches on resource side.
>
> Unfortunately I don't have a reason to suspect these would help with
> the issues related to the currently ongoing resource regression
> thread [1].
>
> [1] https://lore.kernel.org/linux-pci/[email protected]/
>
> v2:
> - Add "typedef" to kerneldoc to get correct formatting
> - Use RESOURCE_SIZE_MAX instead of literal
> - Remove unnecessary checks for io{port/mem}_resource
> - Apply a few style tweaks from Andy
>
> Ilpo Järvinen (7):
> PCI: Fix resource double counting on remove & rescan
> resource: Rename find_resource() to find_empty_resource_slot()
> resource: Document find_empty_resource_slot() and resource_constraint
> resource: Use typedef for alignf callback
> resource: Handle simple alignment inside __find_empty_resource_slot()
> resource: Export find_empty_resource_slot()
> PCI: Relax bridge window tail sizing rules
Thanks for doing this! :)
All look good to me,
Reviewed-by: Mika Westerberg <[email protected]>
On Thu, 28 Dec 2023 18:57:00 +0200
Ilpo Järvinen <[email protected]> wrote:
> Hi all,
>
> Here's a series that contains two fixes to PCI bridge window sizing
> algorithm. Together, they should enable remove & rescan cycle to work
> for a PCI bus that has PCI devices with optional resources and/or
> disparity in BAR sizes.
>
> For the second fix, I chose to expose find_empty_resource_slot() from
> kernel/resource.c because it should increase accuracy of the cannot-fit
> decision (currently that function is called find_resource()). In order
> to do that sensibly, a few improvements seemed in order to make its
> interface and name of the function sane before exposing it. Thus, the
> few extra patches on resource side.
>
> Unfortunately I don't have a reason to suspect these would help with
> the issues related to the currently ongoing resource regression
> thread [1].
Jonathan,
can you test this series on affected machine with broken kernel to see if
it's of any help in your case?
>
> [1] https://lore.kernel.org/linux-pci/[email protected]/
>
> v2:
> - Add "typedef" to kerneldoc to get correct formatting
> - Use RESOURCE_SIZE_MAX instead of literal
> - Remove unnecessary checks for io{port/mem}_resource
> - Apply a few style tweaks from Andy
>
> Ilpo Järvinen (7):
> PCI: Fix resource double counting on remove & rescan
> resource: Rename find_resource() to find_empty_resource_slot()
> resource: Document find_empty_resource_slot() and resource_constraint
> resource: Use typedef for alignf callback
> resource: Handle simple alignment inside __find_empty_resource_slot()
> resource: Export find_empty_resource_slot()
> PCI: Relax bridge window tail sizing rules
>
> drivers/pci/bus.c | 10 ++----
> drivers/pci/setup-bus.c | 80 +++++++++++++++++++++++++++++++++++++----
> include/linux/ioport.h | 44 ++++++++++++++++++++---
> include/linux/pci.h | 5 +--
> kernel/resource.c | 68 ++++++++++++++++-------------------
> 5 files changed, 148 insertions(+), 59 deletions(-)
>
On Thu, Jan 04, 2024 at 01:12:10PM +0100, Igor Mammedov wrote:
> On Thu, 28 Dec 2023 18:57:00 +0200
> Ilpo J?rvinen <[email protected]> wrote:
>
> > Hi all,
> >
> > Here's a series that contains two fixes to PCI bridge window sizing
> > algorithm. Together, they should enable remove & rescan cycle to work
> > for a PCI bus that has PCI devices with optional resources and/or
> > disparity in BAR sizes.
> >
> > For the second fix, I chose to expose find_empty_resource_slot() from
> > kernel/resource.c because it should increase accuracy of the cannot-fit
> > decision (currently that function is called find_resource()). In order
> > to do that sensibly, a few improvements seemed in order to make its
> > interface and name of the function sane before exposing it. Thus, the
> > few extra patches on resource side.
> >
> > Unfortunately I don't have a reason to suspect these would help with
> > the issues related to the currently ongoing resource regression
> > thread [1].
>
> Jonathan,
> can you test this series on affected machine with broken kernel to see if
> it's of any help in your case?
Certainly, but it will have to wait until next Thursday (11 Jan 2024). I'm
still on leave this week, and when at work I only have physical access to
the machine concerned on Thursdays at present.
Which kernel would you prefer I apply the series to?
Regards
jonathan
On Thu, Jan 04, 2024 at 10:48:53PM +1030, Jonathan Woithe wrote:
> On Thu, Jan 04, 2024 at 01:12:10PM +0100, Igor Mammedov wrote:
> > On Thu, 28 Dec 2023 18:57:00 +0200
> > Ilpo J?rvinen <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > Here's a series that contains two fixes to PCI bridge window sizing
> > > algorithm. Together, they should enable remove & rescan cycle to work
> > > for a PCI bus that has PCI devices with optional resources and/or
> > > disparity in BAR sizes.
> > >
> > > For the second fix, I chose to expose find_empty_resource_slot() from
> > > kernel/resource.c because it should increase accuracy of the cannot-fit
> > > decision (currently that function is called find_resource()). In order
> > > to do that sensibly, a few improvements seemed in order to make its
> > > interface and name of the function sane before exposing it. Thus, the
> > > few extra patches on resource side.
> > >
> > > Unfortunately I don't have a reason to suspect these would help with
> > > the issues related to the currently ongoing resource regression
> > > thread [1].
> >
> > Jonathan,
> > can you test this series on affected machine with broken kernel to see if
> > it's of any help in your case?
>
> Certainly, but it will have to wait until next Thursday (11 Jan 2024). I'm
> still on leave this week, and when at work I only have physical access to
> the machine concerned on Thursdays at present.
>
> Which kernel would you prefer I apply the series to?
I was very short of time today but I did apply the above series to the
5.15.y branch (since I had this source available), resulting in version
5.15.141+. Unfortunately, in the rush I forgot to do a clean after the
bisect reset, so the resulting kernel was not correctly built. It booted
but thought it was a different version and therefore none of the modules
could be found. As a result, the test is invalid.
I will try again in a week when I next have physical access to the system.
Apologies for the delay. In the meantime, if there's a specific kernel I
should apply the patch series against please let me know. As I understand
it, you want it applied to one of the kernels which failed, making 5.15.y
(for y < 145) a reasonable choice.
Regards
jonathan
On Thu, Jan 11, 2024 at 06:30:22PM +1030, Jonathan Woithe wrote:
> On Thu, Jan 04, 2024 at 10:48:53PM +1030, Jonathan Woithe wrote:
> > On Thu, Jan 04, 2024 at 01:12:10PM +0100, Igor Mammedov wrote:
> > > On Thu, 28 Dec 2023 18:57:00 +0200
> > > Ilpo Järvinen <[email protected]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Here's a series that contains two fixes to PCI bridge window sizing
> > > > algorithm. Together, they should enable remove & rescan cycle to work
> > > > for a PCI bus that has PCI devices with optional resources and/or
> > > > disparity in BAR sizes.
> > > >
> > > > For the second fix, I chose to expose find_empty_resource_slot() from
> > > > kernel/resource.c because it should increase accuracy of the cannot-fit
> > > > decision (currently that function is called find_resource()). In order
> > > > to do that sensibly, a few improvements seemed in order to make its
> > > > interface and name of the function sane before exposing it. Thus, the
> > > > few extra patches on resource side.
> > > >
> > > > Unfortunately I don't have a reason to suspect these would help with
> > > > the issues related to the currently ongoing resource regression
> > > > thread [1].
> > >
> > > Jonathan,
> > > can you test this series on affected machine with broken kernel to see if
> > > it's of any help in your case?
> >
> > Certainly, but it will have to wait until next Thursday (11 Jan 2024). I'm
> > still on leave this week, and when at work I only have physical access to
> > the machine concerned on Thursdays at present.
> >
> > Which kernel would you prefer I apply the series to?
>
> I was very short of time today but I did apply the above series to the
> 5.15.y branch (since I had this source available), resulting in version
> 5.15.141+. Unfortunately, in the rush I forgot to do a clean after the
> bisect reset, so the resulting kernel was not correctly built. It booted
> but thought it was a different version and therefore none of the modules
> could be found. As a result, the test is invalid.
>
> I will try again in a week when I next have physical access to the system.
> Apologies for the delay. In the meantime, if there's a specific kernel I
> should apply the patch series against please let me know. As I understand
> it, you want it applied to one of the kernels which failed, making 5.15.y
> (for y < 145) a reasonable choice.
I did a "make clean" to reset the source tree and recompiled. However, it
errored out:
drivers/pci/setup-bus.c:988:24: error: ‘RESOURCE_SIZE_MAX’ undeclared
drivers/pci/setup-bus.c:998:17: error: ‘pci_bus_for_each_resource’ undeclared
This was with the patch series applied against 5.15.141. It seems the patch
targets a kernel that's too far removed from 5.15.x.
Which kernel would you like me to apply the patch series to and test?
Regards
jonathan
On Thu, 18 Jan 2024, Jonathan Woithe wrote:
> On Thu, Jan 11, 2024 at 06:30:22PM +1030, Jonathan Woithe wrote:
> > On Thu, Jan 04, 2024 at 10:48:53PM +1030, Jonathan Woithe wrote:
> > > On Thu, Jan 04, 2024 at 01:12:10PM +0100, Igor Mammedov wrote:
> > > > On Thu, 28 Dec 2023 18:57:00 +0200
> > > > Ilpo Järvinen <[email protected]> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Here's a series that contains two fixes to PCI bridge window sizing
> > > > > algorithm. Together, they should enable remove & rescan cycle to work
> > > > > for a PCI bus that has PCI devices with optional resources and/or
> > > > > disparity in BAR sizes.
> > > > >
> > > > > For the second fix, I chose to expose find_empty_resource_slot() from
> > > > > kernel/resource.c because it should increase accuracy of the cannot-fit
> > > > > decision (currently that function is called find_resource()). In order
> > > > > to do that sensibly, a few improvements seemed in order to make its
> > > > > interface and name of the function sane before exposing it. Thus, the
> > > > > few extra patches on resource side.
> > > > >
> > > > > Unfortunately I don't have a reason to suspect these would help with
> > > > > the issues related to the currently ongoing resource regression
> > > > > thread [1].
> > > >
> > > > Jonathan,
> > > > can you test this series on affected machine with broken kernel to see if
> > > > it's of any help in your case?
> > >
> > > Certainly, but it will have to wait until next Thursday (11 Jan 2024). I'm
> > > still on leave this week, and when at work I only have physical access to
> > > the machine concerned on Thursdays at present.
> > >
> > > Which kernel would you prefer I apply the series to?
> >
> > I was very short of time today but I did apply the above series to the
> > 5.15.y branch (since I had this source available), resulting in version
> > 5.15.141+. Unfortunately, in the rush I forgot to do a clean after the
> > bisect reset, so the resulting kernel was not correctly built. It booted
> > but thought it was a different version and therefore none of the modules
> > could be found. As a result, the test is invalid.
> >
> > I will try again in a week when I next have physical access to the system.
> > Apologies for the delay. In the meantime, if there's a specific kernel I
> > should apply the patch series against please let me know. As I understand
> > it, you want it applied to one of the kernels which failed, making 5.15.y
> > (for y < 145) a reasonable choice.
>
> I did a "make clean" to reset the source tree and recompiled. However, it
> errored out:
>
> drivers/pci/setup-bus.c:988:24: error: ‘RESOURCE_SIZE_MAX’ undeclared
> drivers/pci/setup-bus.c:998:17: error: ‘pci_bus_for_each_resource’ undeclared
>
> This was with the patch series applied against 5.15.141. It seems the patch
> targets a kernel that's too far removed from 5.15.x.
>
> Which kernel would you like me to apply the patch series to and test?
Two argument version of pci_bus_for_each_resource() is quite new (so
either 6.6 or 6.7). If want to attempt to compile in 5.15.x, you need
this:
include/linux/limits.h:#define RESOURCE_SIZE_MAX ((resource_size_t)~0)
And to add one extra argument into pci_bus_for_each_resource(bus, r) in
pbus_upstream_assigned_limit():
...
while ((bus = bus->parent)) {
+ unsigned int i;
if (pci_is_root_bus(bus))
break;
- pci_bus_for_each_resource(bus, r) {
+ pci_bus_for_each_resource(bus, r, i) {
Note I've written this "patch" by hand inline so patch command cannot
apply it but you need to edit those in.
--
i.
On Thu, Jan 18, 2024 at 05:18:45PM +1030, Jonathan Woithe wrote:
> On Thu, Jan 11, 2024 at 06:30:22PM +1030, Jonathan Woithe wrote:
> > On Thu, Jan 04, 2024 at 10:48:53PM +1030, Jonathan Woithe wrote:
> > > On Thu, Jan 04, 2024 at 01:12:10PM +0100, Igor Mammedov wrote:
> > > > On Thu, 28 Dec 2023 18:57:00 +0200
> > > > Ilpo Järvinen <[email protected]> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Here's a series that contains two fixes to PCI bridge window sizing
> > > > > algorithm. Together, they should enable remove & rescan cycle to work
> > > > > for a PCI bus that has PCI devices with optional resources and/or
> > > > > disparity in BAR sizes.
> > > > >
> > > > > For the second fix, I chose to expose find_empty_resource_slot() from
> > > > > kernel/resource.c because it should increase accuracy of the cannot-fit
> > > > > decision (currently that function is called find_resource()). In order
> > > > > to do that sensibly, a few improvements seemed in order to make its
> > > > > interface and name of the function sane before exposing it. Thus, the
> > > > > few extra patches on resource side.
> > > > >
> > > > > Unfortunately I don't have a reason to suspect these would help with
> > > > > the issues related to the currently ongoing resource regression
> > > > > thread [1].
> > > >
> > > > Jonathan,
> > > > can you test this series on affected machine with broken kernel to see if
> > > > it's of any help in your case?
> > >
> > > Certainly, but it will have to wait until next Thursday (11 Jan 2024). I'm
> > > still on leave this week, and when at work I only have physical access to
> > > the machine concerned on Thursdays at present.
> > >
> > > Which kernel would you prefer I apply the series to?
> >
> > I was very short of time today but I did apply the above series to the
> > 5.15.y branch (since I had this source available), resulting in version
> > 5.15.141+. Unfortunately, in the rush I forgot to do a clean after the
> > bisect reset, so the resulting kernel was not correctly built. It booted
> > but thought it was a different version and therefore none of the modules
> > could be found. As a result, the test is invalid.
> >
> > I will try again in a week when I next have physical access to the system.
> > Apologies for the delay. In the meantime, if there's a specific kernel I
> > should apply the patch series against please let me know. As I understand
> > it, you want it applied to one of the kernels which failed, making 5.15.y
> > (for y < 145) a reasonable choice.
>
> I did a "make clean" to reset the source tree and recompiled. However, it
> errored out:
>
> drivers/pci/setup-bus.c:988:24: error: ‘RESOURCE_SIZE_MAX’ undeclared
> drivers/pci/setup-bus.c:998:17: error: ‘pci_bus_for_each_resource’ undeclared
>
> This was with the patch series applied against 5.15.141. It seems the patch
> targets a kernel that's too far removed from 5.15.x.
>
> Which kernel would you like me to apply the patch series to and test?
The rule of thumb is to test against latest vanilla (as of today v6.7).
Also makes sense to test against Linux Next. The v5.15 is way too old for
a new code.
--
With Best Regards,
Andy Shevchenko
On Sun, Jan 21, 2024 at 02:54:22PM +0200, Andy Shevchenko wrote:
> On Thu, Jan 18, 2024 at 05:18:45PM +1030, Jonathan Woithe wrote:
> > On Thu, Jan 11, 2024 at 06:30:22PM +1030, Jonathan Woithe wrote:
> > > On Thu, Jan 04, 2024 at 10:48:53PM +1030, Jonathan Woithe wrote:
> > > > On Thu, Jan 04, 2024 at 01:12:10PM +0100, Igor Mammedov wrote:
> > > > > On Thu, 28 Dec 2023 18:57:00 +0200
> > > > > Ilpo Järvinen <[email protected]> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Here's a series that contains two fixes to PCI bridge window sizing
> > > > > > algorithm. Together, they should enable remove & rescan cycle to work
> > > > > > for a PCI bus that has PCI devices with optional resources and/or
> > > > > > disparity in BAR sizes.
> > > > > >
> > > > > > For the second fix, I chose to expose find_empty_resource_slot() from
> > > > > > kernel/resource.c because it should increase accuracy of the cannot-fit
> > > > > > decision (currently that function is called find_resource()). In order
> > > > > > to do that sensibly, a few improvements seemed in order to make its
> > > > > > interface and name of the function sane before exposing it. Thus, the
> > > > > > few extra patches on resource side.
> > > > > >
> > > > > > Unfortunately I don't have a reason to suspect these would help with
> > > > > > the issues related to the currently ongoing resource regression
> > > > > > thread [1].
> > > > >
> > > > > Jonathan,
> > > > > can you test this series on affected machine with broken kernel to see if
> > > > > it's of any help in your case?
> > > >
> > > > Certainly, but it will have to wait until next Thursday (11 Jan 2024). I'm
> > > > still on leave this week, and when at work I only have physical access to
> > > > the machine concerned on Thursdays at present.
> > > >
> > > > Which kernel would you prefer I apply the series to?
> > >
> > > I was very short of time today but I did apply the above series to the
> > > 5.15.y branch (since I had this source available), resulting in version
> > > 5.15.141+. Unfortunately, in the rush I forgot to do a clean after the
> > > bisect reset, so the resulting kernel was not correctly built. It booted
> > > but thought it was a different version and therefore none of the modules
> > > could be found. As a result, the test is invalid.
> > >
> > > I will try again in a week when I next have physical access to the system.
> > > Apologies for the delay. In the meantime, if there's a specific kernel I
> > > should apply the patch series against please let me know. As I understand
> > > it, you want it applied to one of the kernels which failed, making 5.15.y
> > > (for y < 145) a reasonable choice.
> >
> > I did a "make clean" to reset the source tree and recompiled. However, it
> > errored out:
> >
> > drivers/pci/setup-bus.c:988:24: error: ‘RESOURCE_SIZE_MAX’ undeclared
> > drivers/pci/setup-bus.c:998:17: error: ‘pci_bus_for_each_resource’ undeclared
> >
> > This was with the patch series applied against 5.15.141. It seems the patch
> > targets a kernel that's too far removed from 5.15.x.
> >
> > Which kernel would you like me to apply the patch series to and test?
>
> The rule of thumb is to test against latest vanilla (as of today v6.7).
> Also makes sense to test against Linux Next. The v5.15 is way too old for
> a new code.
Thanks, and understood. In this case the request from Igor was
can you test this series on affected machine with broken kernel to see if
it's of any help in your case?
The latest vanilla kernel (6.7) has (AFAIK) had the offending commit
reverted, so it's not a "broken" kernel in this respect. Therefore, if I've
understood the request correctly, working with that kernel won't produce the
desired test.
Regards
jonathan
On Mon, 22 Jan 2024, Jonathan Woithe wrote:
> On Sun, Jan 21, 2024 at 02:54:22PM +0200, Andy Shevchenko wrote:
> > On Thu, Jan 18, 2024 at 05:18:45PM +1030, Jonathan Woithe wrote:
> > > On Thu, Jan 11, 2024 at 06:30:22PM +1030, Jonathan Woithe wrote:
> > > > On Thu, Jan 04, 2024 at 10:48:53PM +1030, Jonathan Woithe wrote:
> > > > > On Thu, Jan 04, 2024 at 01:12:10PM +0100, Igor Mammedov wrote:
> > > > > > On Thu, 28 Dec 2023 18:57:00 +0200
> > > > > > Ilpo Järvinen <[email protected]> wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Here's a series that contains two fixes to PCI bridge window sizing
> > > > > > > algorithm. Together, they should enable remove & rescan cycle to work
> > > > > > > for a PCI bus that has PCI devices with optional resources and/or
> > > > > > > disparity in BAR sizes.
> > > > > > >
> > > > > > > For the second fix, I chose to expose find_empty_resource_slot() from
> > > > > > > kernel/resource.c because it should increase accuracy of the cannot-fit
> > > > > > > decision (currently that function is called find_resource()). In order
> > > > > > > to do that sensibly, a few improvements seemed in order to make its
> > > > > > > interface and name of the function sane before exposing it. Thus, the
> > > > > > > few extra patches on resource side.
> > > > > > >
> > > > > > > Unfortunately I don't have a reason to suspect these would help with
> > > > > > > the issues related to the currently ongoing resource regression
> > > > > > > thread [1].
> > > > > >
> > > > > > Jonathan,
> > > > > > can you test this series on affected machine with broken kernel to see if
> > > > > > it's of any help in your case?
> > > > >
> > > > > Certainly, but it will have to wait until next Thursday (11 Jan 2024). I'm
> > > > > still on leave this week, and when at work I only have physical access to
> > > > > the machine concerned on Thursdays at present.
> > > > >
> > > > > Which kernel would you prefer I apply the series to?
> > > >
> > > > I was very short of time today but I did apply the above series to the
> > > > 5.15.y branch (since I had this source available), resulting in version
> > > > 5.15.141+. Unfortunately, in the rush I forgot to do a clean after the
> > > > bisect reset, so the resulting kernel was not correctly built. It booted
> > > > but thought it was a different version and therefore none of the modules
> > > > could be found. As a result, the test is invalid.
> > > >
> > > > I will try again in a week when I next have physical access to the system.
> > > > Apologies for the delay. In the meantime, if there's a specific kernel I
> > > > should apply the patch series against please let me know. As I understand
> > > > it, you want it applied to one of the kernels which failed, making 5.15.y
> > > > (for y < 145) a reasonable choice.
> > >
> > > I did a "make clean" to reset the source tree and recompiled. However, it
> > > errored out:
> > >
> > > drivers/pci/setup-bus.c:988:24: error: ‘RESOURCE_SIZE_MAX’ undeclared
> > > drivers/pci/setup-bus.c:998:17: error: ‘pci_bus_for_each_resource’ undeclared
> > >
> > > This was with the patch series applied against 5.15.141. It seems the patch
> > > targets a kernel that's too far removed from 5.15.x.
> > >
> > > Which kernel would you like me to apply the patch series to and test?
> >
> > The rule of thumb is to test against latest vanilla (as of today v6.7).
> > Also makes sense to test against Linux Next. The v5.15 is way too old for
> > a new code.
>
> Thanks, and understood. In this case the request from Igor was
>
> can you test this series on affected machine with broken kernel to see if
> it's of any help in your case?
>
> The latest vanilla kernel (6.7) has (AFAIK) had the offending commit
> reverted, so it's not a "broken" kernel in this respect. Therefore, if I've
> understood the request correctly, working with that kernel won't produce the
> desired test.
Well, you can revert the revert again to get back to the broken state.
--
i.
On Mon, 22 Jan 2024 14:37:32 +0200 (EET)
Ilpo Järvinen <[email protected]> wrote:
> On Mon, 22 Jan 2024, Jonathan Woithe wrote:
>
> > On Sun, Jan 21, 2024 at 02:54:22PM +0200, Andy Shevchenko wrote:
> > > On Thu, Jan 18, 2024 at 05:18:45PM +1030, Jonathan Woithe wrote:
> > > > On Thu, Jan 11, 2024 at 06:30:22PM +1030, Jonathan Woithe wrote:
> > > > > On Thu, Jan 04, 2024 at 10:48:53PM +1030, Jonathan Woithe wrote:
> > > > > > On Thu, Jan 04, 2024 at 01:12:10PM +0100, Igor Mammedov wrote:
> > > > > > > On Thu, 28 Dec 2023 18:57:00 +0200
> > > > > > > Ilpo Järvinen <[email protected]> wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Here's a series that contains two fixes to PCI bridge window sizing
> > > > > > > > algorithm. Together, they should enable remove & rescan cycle to work
> > > > > > > > for a PCI bus that has PCI devices with optional resources and/or
> > > > > > > > disparity in BAR sizes.
> > > > > > > >
> > > > > > > > For the second fix, I chose to expose find_empty_resource_slot() from
> > > > > > > > kernel/resource.c because it should increase accuracy of the cannot-fit
> > > > > > > > decision (currently that function is called find_resource()). In order
> > > > > > > > to do that sensibly, a few improvements seemed in order to make its
> > > > > > > > interface and name of the function sane before exposing it. Thus, the
> > > > > > > > few extra patches on resource side.
> > > > > > > >
> > > > > > > > Unfortunately I don't have a reason to suspect these would help with
> > > > > > > > the issues related to the currently ongoing resource regression
> > > > > > > > thread [1].
> > > > > > >
> > > > > > > Jonathan,
> > > > > > > can you test this series on affected machine with broken kernel to see if
> > > > > > > it's of any help in your case?
> > > > > >
> > > > > > Certainly, but it will have to wait until next Thursday (11 Jan 2024). I'm
> > > > > > still on leave this week, and when at work I only have physical access to
> > > > > > the machine concerned on Thursdays at present.
> > > > > >
> > > > > > Which kernel would you prefer I apply the series to?
> > > > >
> > > > > I was very short of time today but I did apply the above series to the
> > > > > 5.15.y branch (since I had this source available), resulting in version
> > > > > 5.15.141+. Unfortunately, in the rush I forgot to do a clean after the
> > > > > bisect reset, so the resulting kernel was not correctly built. It booted
> > > > > but thought it was a different version and therefore none of the modules
> > > > > could be found. As a result, the test is invalid.
> > > > >
> > > > > I will try again in a week when I next have physical access to the system.
> > > > > Apologies for the delay. In the meantime, if there's a specific kernel I
> > > > > should apply the patch series against please let me know. As I understand
> > > > > it, you want it applied to one of the kernels which failed, making 5.15.y
> > > > > (for y < 145) a reasonable choice.
> > > >
> > > > I did a "make clean" to reset the source tree and recompiled. However, it
> > > > errored out:
> > > >
> > > > drivers/pci/setup-bus.c:988:24: error: ‘RESOURCE_SIZE_MAX’ undeclared
> > > > drivers/pci/setup-bus.c:998:17: error: ‘pci_bus_for_each_resource’ undeclared
> > > >
> > > > This was with the patch series applied against 5.15.141. It seems the patch
> > > > targets a kernel that's too far removed from 5.15.x.
> > > >
> > > > Which kernel would you like me to apply the patch series to and test?
> > >
> > > The rule of thumb is to test against latest vanilla (as of today v6.7).
> > > Also makes sense to test against Linux Next. The v5.15 is way too old for
> > > a new code.
> >
> > Thanks, and understood. In this case the request from Igor was
> >
> > can you test this series on affected machine with broken kernel to see if
> > it's of any help in your case?
> >
> > The latest vanilla kernel (6.7) has (AFAIK) had the offending commit
> > reverted, so it's not a "broken" kernel in this respect. Therefore, if I've
> > understood the request correctly, working with that kernel won't produce the
> > desired test.
>
> Well, you can revert the revert again to get back to the broken state.
either this or just a hand patching as Ilpo has suggested earlier
would do.
There is non zero chance that this series might fix issues
Jonathan is facing. i.e. failed resource reallocation which
offending patches trigger. There are 2 different issues here,
* 1st unwanted reallocation - it should happen but well that how current code works
* 2nd failed reallocation - seemingly matches what this series is trying to fix
and if it doesn't help we would need to dig some more in this direction
as well to figure out why it fails.
On Mon, Jan 22, 2024 at 02:45:20PM +0100, Igor Mammedov wrote:
> On Mon, 22 Jan 2024 14:37:32 +0200 (EET)
> Ilpo Järvinen <[email protected]> wrote:
>
> > On Mon, 22 Jan 2024, Jonathan Woithe wrote:
> >
> > > On Sun, Jan 21, 2024 at 02:54:22PM +0200, Andy Shevchenko wrote:
> > > > On Thu, Jan 18, 2024 at 05:18:45PM +1030, Jonathan Woithe wrote:
> > > > > On Thu, Jan 11, 2024 at 06:30:22PM +1030, Jonathan Woithe wrote:
> > > > > > On Thu, Jan 04, 2024 at 10:48:53PM +1030, Jonathan Woithe wrote:
> > > > > > > On Thu, Jan 04, 2024 at 01:12:10PM +0100, Igor Mammedov wrote:
> > > > > > > > On Thu, 28 Dec 2023 18:57:00 +0200
> > > > > > > > Ilpo Järvinen <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Here's a series that contains two fixes to PCI bridge window sizing
> > > > > > > > > algorithm. Together, they should enable remove & rescan cycle to work
> > > > > > > > > for a PCI bus that has PCI devices with optional resources and/or
> > > > > > > > > disparity in BAR sizes.
> > > > > > > > >
> > > > > > > > > For the second fix, I chose to expose find_empty_resource_slot() from
> > > > > > > > > kernel/resource.c because it should increase accuracy of the cannot-fit
> > > > > > > > > decision (currently that function is called find_resource()). In order
> > > > > > > > > to do that sensibly, a few improvements seemed in order to make its
> > > > > > > > > interface and name of the function sane before exposing it. Thus, the
> > > > > > > > > few extra patches on resource side.
> > > > > > > > >
> > > > > > > > > Unfortunately I don't have a reason to suspect these would help with
> > > > > > > > > the issues related to the currently ongoing resource regression
> > > > > > > > > thread [1].
> > > > > > > >
> > > > > > > > Jonathan,
> > > > > > > > can you test this series on affected machine with broken kernel to see if
> > > > > > > > it's of any help in your case?
> > > > > > >
> > > > > > > Certainly, but it will have to wait until next Thursday (11 Jan 2024). I'm
> > > > > > > still on leave this week, and when at work I only have physical access to
> > > > > > > the machine concerned on Thursdays at present.
> > > > > > >
> > > > > > > Which kernel would you prefer I apply the series to?
> > > > > >
> > > > > > I was very short of time today but I did apply the above series to the
> > > > > > 5.15.y branch (since I had this source available), resulting in version
> > > > > > 5.15.141+. Unfortunately, in the rush I forgot to do a clean after the
> > > > > > bisect reset, so the resulting kernel was not correctly built. It booted
> > > > > > but thought it was a different version and therefore none of the modules
> > > > > > could be found. As a result, the test is invalid.
> > > > > >
> > > > > > I will try again in a week when I next have physical access to the system.
> > > > > > Apologies for the delay. In the meantime, if there's a specific kernel I
> > > > > > should apply the patch series against please let me know. As I understand
> > > > > > it, you want it applied to one of the kernels which failed, making 5.15.y
> > > > > > (for y < 145) a reasonable choice.
> > > > >
> > > > > I did a "make clean" to reset the source tree and recompiled. However, it
> > > > > errored out:
> > > > >
> > > > > drivers/pci/setup-bus.c:988:24: error: ‘RESOURCE_SIZE_MAX’ undeclared
> > > > > drivers/pci/setup-bus.c:998:17: error: ‘pci_bus_for_each_resource’ undeclared
> > > > >
> > > > > This was with the patch series applied against 5.15.141. It seems the patch
> > > > > targets a kernel that's too far removed from 5.15.x.
> > > > >
> > > > > Which kernel would you like me to apply the patch series to and test?
> > > >
> > > > The rule of thumb is to test against latest vanilla (as of today v6.7).
> > > > Also makes sense to test against Linux Next. The v5.15 is way too old for
> > > > a new code.
> > >
> > > Thanks, and understood. In this case the request from Igor was
> > >
> > > can you test this series on affected machine with broken kernel to see if
> > > it's of any help in your case?
> > >
> > > The latest vanilla kernel (6.7) has (AFAIK) had the offending commit
> > > reverted, so it's not a "broken" kernel in this respect. Therefore, if I've
> > > understood the request correctly, working with that kernel won't produce the
> > > desired test.
> >
> > Well, you can revert the revert again to get back to the broken state.
>
> either this or just a hand patching as Ilpo has suggested earlier
> would do.
No problem. This was the easiest approach for me and I have now done this.
Apologies for the delay in getting to this: I ran out of time last Thursday.
> There is non zero chance that this series might fix issues
> Jonathan is facing. i.e. failed resource reallocation which
> offending patches trigger.
I can confirm that as expected, this patch series has had no effect on the
system which experiences the failed resource reallocation. From syslog,
running a 5.15.141+ kernel[1]:
kernel: radeon 0000:4b:00.0: Fatal error during GPU init
kernel: radeon: probe of 0000:4b:00.0 failed with error -12
This is unchanged from what is seen with the unaltered 5.15.141 kernel.
In case it's important, can also confirm that the errors related to the
thunderbolt device are are also still present in the patched 5.15.141+
kernel:
thunderbolt 0000:04:00.0: interrupt for TX ring 0 is already enabled
:
thunderbolt 0000:04:00.0: interrupt for RX ring 0 is already enabled
:
Like the GPU failure, they do not appear in the working kernels on this
system.
Let me know if you would like to me to run further tests.
Regards
jonathan
[1] This is 5.15.141, patched with the series of interest here and the hand
patch from Ilpo.
On Thu, 1 Feb 2024, Jonathan Woithe wrote:
> On Mon, Jan 22, 2024 at 02:45:20PM +0100, Igor Mammedov wrote:
> > On Mon, 22 Jan 2024 14:37:32 +0200 (EET)
> > Ilpo Järvinen <[email protected]> wrote:
> >
> > > On Mon, 22 Jan 2024, Jonathan Woithe wrote:
> > >
> > > > On Sun, Jan 21, 2024 at 02:54:22PM +0200, Andy Shevchenko wrote:
> > > > > On Thu, Jan 18, 2024 at 05:18:45PM +1030, Jonathan Woithe wrote:
> > > > > > On Thu, Jan 11, 2024 at 06:30:22PM +1030, Jonathan Woithe wrote:
> > > > > > > On Thu, Jan 04, 2024 at 10:48:53PM +1030, Jonathan Woithe wrote:
> > > > > > > > On Thu, Jan 04, 2024 at 01:12:10PM +0100, Igor Mammedov wrote:
> > > > > > > > > On Thu, 28 Dec 2023 18:57:00 +0200
> > > > > > > > > Ilpo Järvinen <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Here's a series that contains two fixes to PCI bridge window sizing
> > > > > > > > > > algorithm. Together, they should enable remove & rescan cycle to work
> > > > > > > > > > for a PCI bus that has PCI devices with optional resources and/or
> > > > > > > > > > disparity in BAR sizes.
> > > > > > > > > >
> > > > > > > > > > For the second fix, I chose to expose find_empty_resource_slot() from
> > > > > > > > > > kernel/resource.c because it should increase accuracy of the cannot-fit
> > > > > > > > > > decision (currently that function is called find_resource()). In order
> > > > > > > > > > to do that sensibly, a few improvements seemed in order to make its
> > > > > > > > > > interface and name of the function sane before exposing it. Thus, the
> > > > > > > > > > few extra patches on resource side.
> > > > > > > > > >
> > > > > > > > > > Unfortunately I don't have a reason to suspect these would help with
> > > > > > > > > > the issues related to the currently ongoing resource regression
> > > > > > > > > > thread [1].
> > > > Thanks, and understood. In this case the request from Igor was
> > > >
> > > > can you test this series on affected machine with broken kernel to see if
> > > > it's of any help in your case?
> > > >
> > > > The latest vanilla kernel (6.7) has (AFAIK) had the offending commit
> > > > reverted, so it's not a "broken" kernel in this respect. Therefore, if I've
> > > > understood the request correctly, working with that kernel won't produce the
> > > > desired test.
> > >
> > > Well, you can revert the revert again to get back to the broken state.
> >
> > either this or just a hand patching as Ilpo has suggested earlier
> > would do.
>
> No problem. This was the easiest approach for me and I have now done this.
> Apologies for the delay in getting to this: I ran out of time last Thursday.
>
> > There is non zero chance that this series might fix issues
> > Jonathan is facing. i.e. failed resource reallocation which
> > offending patches trigger.
>
> I can confirm that as expected, this patch series has had no effect on the
> system which experiences the failed resource reallocation. From syslog,
> running a 5.15.141+ kernel[1]:
>
> kernel: radeon 0000:4b:00.0: Fatal error during GPU init
> kernel: radeon: probe of 0000:4b:00.0 failed with error -12
>
> This is unchanged from what is seen with the unaltered 5.15.141 kernel.
>
> In case it's important, can also confirm that the errors related to the
> thunderbolt device are are also still present in the patched 5.15.141+
> kernel:
>
> thunderbolt 0000:04:00.0: interrupt for TX ring 0 is already enabled
> :
> thunderbolt 0000:04:00.0: interrupt for RX ring 0 is already enabled
> :
>
> Like the GPU failure, they do not appear in the working kernels on this
> system.
>
> Let me know if you would like to me to run further tests.
>
> Regards
> jonathan
>
> [1] This is 5.15.141, patched with the series of interest here and the hand
> patch from Ilpo.
Hi Jonathan,
Thanks a lot for testing it regardless. The end result was not a big
surprise given how it looked like based on the logs but was certainly
worth a test like Igor mentioned. The resource allocation code isn't among
the easiest to track.
--
i.
On Thu, 28 Dec 2023, Ilpo J?rvinen wrote:
> Here's a series that contains two fixes to PCI bridge window sizing
> algorithm. Together, they should enable remove & rescan cycle to work
> for a PCI bus that has PCI devices with optional resources and/or
> disparity in BAR sizes.
>
> For the second fix, I chose to expose find_empty_resource_slot() from
> kernel/resource.c because it should increase accuracy of the cannot-fit
> decision (currently that function is called find_resource()). In order
> to do that sensibly, a few improvements seemed in order to make its
> interface and name of the function sane before exposing it. Thus, the
> few extra patches on resource side.
>
> Unfortunately I don't have a reason to suspect these would help with
> the issues related to the currently ongoing resource regression
> thread [1].
>
> [1] https://lore.kernel.org/linux-pci/[email protected]/
>
> v2:
> - Add "typedef" to kerneldoc to get correct formatting
> - Use RESOURCE_SIZE_MAX instead of literal
> - Remove unnecessary checks for io{port/mem}_resource
> - Apply a few style tweaks from Andy
>
> Ilpo J?rvinen (7):
> PCI: Fix resource double counting on remove & rescan
> resource: Rename find_resource() to find_empty_resource_slot()
> resource: Document find_empty_resource_slot() and resource_constraint
> resource: Use typedef for alignf callback
> resource: Handle simple alignment inside __find_empty_resource_slot()
> resource: Export find_empty_resource_slot()
> PCI: Relax bridge window tail sizing rules
I finally managed to get the group of people who reported this initially
here to go and test to confirm these did solve the issues they're seeing,
so for all the patches:
Tested-by: Lidong Wang <[email protected]>
(If needed, I can send v3 with that tag).
--
i.
ps. Bjorn, I realized I pointed you earlier to v1 of this patchset, not
this v2 one. I'm sorry about that confusion (it was too far back I didn't
immediately even remember I did v2).
On Fri, Mar 15, 2024 at 12:33:43PM +0200, Ilpo J?rvinen wrote:
> On Thu, 28 Dec 2023, Ilpo J?rvinen wrote:
..
> (If needed, I can send v3 with that tag).
Dunno what's Bjorn's workflow, but `b4 am` has parameter to accept tags given
against a cover letter and propagate them to all patches in the series. I.o.w.
no need to send a new version in such cases.
--
With Best Regards,
Andy Shevchenko
On Thu, 28 Dec 2023 18:57:00 +0200
Ilpo J?rvinen <[email protected]> wrote:
> Hi all,
>
> Here's a series that contains two fixes to PCI bridge window sizing
> algorithm. Together, they should enable remove & rescan cycle to work
> for a PCI bus that has PCI devices with optional resources and/or
> disparity in BAR sizes.
>
> For the second fix, I chose to expose find_empty_resource_slot() from
> kernel/resource.c because it should increase accuracy of the cannot-fit
> decision (currently that function is called find_resource()). In order
> to do that sensibly, a few improvements seemed in order to make its
> interface and name of the function sane before exposing it. Thus, the
> few extra patches on resource side.
>
> Unfortunately I don't have a reason to suspect these would help with
> the issues related to the currently ongoing resource regression
> thread [1].
>
> [1] https://lore.kernel.org/linux-pci/[email protected]/
>
Hi Ilpo
I've hit what looks to be a similar issue to this (not fixed by this series :()
QEMU setup with a CXL PCI Expander bridge a single RP and a 4 port
port switch with CXL Type 3 devices below ports 0 and 1. (2 and 3 are empty but
each has a single BAR). RP and DSP on the switch all hotplug capable bridges.
Note that if I touch almost anything about the configuration it all works, but
this particular combination doesn't. I can add a 3rd empty port or remove 1
or add or remove an EP below the switch and it all succeeds.
arm64 setup and I'd rather not set the DSM to not reenumerate (because
there should be no problem doing so.
Note that arm64 support in general isn't upstream in qemu yet (at least partly
because we haven't figure out how to do PXB bus enumeration on DT) but can
be found at gitlab.com/jic23/qemu (not including the vsec additions to expand
the available CRS entries for the root bridge)
I've added EDK2 support and the vsec structures to expand the windows massively
so there 'should' be no issue with tight fits. However, the large CRS
entries for the root bridge don't seem to get picked up.
I did some digging and the 0c bus has those windows, but not the 0c:00.0
(the root port). I can't work out how extra space is supposed to get distributed
to root ports.
Problem chunk with the kernel enumeration is that the first CXL type3 device
has 3 bars, but the range has been shrunken to the point where only one of them
fits.
pci 0000:0f:00.0: BAR 2 [mem 0x20000000-0x2003ffff 64bit]: assigned
pci 0000:0f:00.0: BAR 0 [mem size 0x00010000 64bit]: can't assign; no space
pci 0000:0f:00.0: BAR 0 [mem size 0x00010000 64bit]: failed to assign
pci 0000:0f:00.0: BAR 4 [mem size 0x00001000]: can't assign; no space
pci 0000:0f:00.0: BAR 4 [mem size 0x00001000]: failed to assign
pci 0000:0e:00.0: PCI bridge to [bus 0f]
pci 0000:0e:00.0: bridge window [mem 0x20000000-0x2003ffff]
pci 0000:0e:00.0: bridge window [mem 0x20c00000-0x20cfffff 64bit pref]
pci 0000:10:00.0: BAR 2 [mem 0x20100000-0x2013ffff 64bit]: assigned
pci 0000:10:00.0: BAR 0 [mem 0x20140000-0x2014ffff 64bit]: assigned
pci 0000:10:00.0: BAR 4 [mem 0x20150000-0x20150fff]: assigned
pci 0000:0e:01.0: PCI bridge to [bus 10]
pci 0000:0e:01.0: bridge window [mem 0x20100000-0x2017ffff]
pci 0000:0e:01.0: bridge window [mem 0x20d00000-0x20dfffff 64bit pref]
AS you can see the window for that first device is simply too small.
EDK2 ends up with a /proc/iomap of
(kernel hacked as if the DSM was there to prevent reenumeration.
0b000080-0b0000ff : PRP0001:00
10000000-1fffffff : PCI Bus 0000:00
10000000-101fffff : PCI Bus 0000:0p1
10200000-1023ffff : 0000:00:03.0
10240000-10240fff : 0000:00:03.0
10241000-10241fff : 0000:00:02.0
10242000-10242fff : 0000:00:01.0
20000000-381fffff : PCI Bus 0000:0c
20000000-2fffffff : PCI Bus 0000:0d
20000000-2fffffff : PCI Bus 0000:0e
20000000-23ffffff : PCI Bus 0000:12
24000000-27ffffff : PCI Bus 0000:11
28000000-2bffffff : PCI Bus 0000:10
2c000000-2fffffff : PCI Bus 0000:0f
30000000-381fffff : PCI Bus 0000:0d
30000000-380fffff : PCI Bus 0000:0e
30000000-31ffffff : PCI Bus 0000:12
32000000-33ffffff : PCI Bus 0000:11
34000000-35ffffff : PCI Bus 0000:10
34000000-3403ffff : 0000:10:00.0
34000080-34000087 : 0000:10:00.0
34000088-340008a7 : 0000:10:00.0
340008a8-340008af : 0000:10:00.0
34010000-34010dff : 0000:10:00.0
34020000-34020dff : 0000:10:00.0
34040000-3404ffff : 0000:10:00.0
34041080-340410d7 : 0000:10:00.0
34041128-340411b7 : endpoint4
34050000-34050fff : 0000:10:00.0
36000000-37ffffff : PCI Bus 0000:0f
36000000-3603ffff : 0000:0f:00.0
36000080-36000087 : 0000:0f:00.0
36000088-360008a7 : 0000:0f:00.0
360008a8-360008af : 0000:0f:00.0
36010000-36010dff : 0000:0f:00.0
36020000-36020dff : 0000:0f:00.0
36040000-3604ffff : 0000:0f:00.0
36041080-360410d7 : 0000:0f:00.0
36041128-360411b7 : endpoint3
36050000-36050fff : 0000:0f:00.0
38000000-3801ffff : 0000:0e:00.0
38001080-380010d7 : mem0
38020000-3803ffff : 0000:0e:01.0
38021080-380210d7 : mem1
38040000-3805ffff : 0000:0e:02.0
38060000-3807ffff : 0000:0e:03.0
38100000-3811ffff : 0000:0d:00.0
38101128-381011b7 : port2
38200000-3efeffff : PCI Bus 0000:00
40000000-b9d7ffff : System RAM
With the kernel rerunning.
10000000-1fffffff : PCI Bus 0000:00
10000000-101fffff : PCI Bus 0000:01
10200000-1023ffff : 0000:00:03.0
10240000-10240fff : 0000:00:01.0
10241000-10241fff : 0000:00:02.0
10242000-10242fff : 0000:00:03.0
20000000-381fffff : PCI Bus 0000:0c
20000000-20bfffff : PCI Bus 0000:0d
20000000-20afffff : PCI Bus 0000:0e
20000000-2003ffff : PCI Bus 0000:0f
20000000-2003ffff : 0000:0f:00.0
20000080-20000087 : 0000:0f:00.0
20000088-200008a7 : 0000:0f:00.0
200008a8-200008af : 0000:0f:00.0
20010000-20010dff : 0000:0f:00.0
20020000-20020dff : 0000:0f:00.0
20040000-2005ffff : 0000:0e:00.0
20060000-2007ffff : 0000:0e:01.0
20061080-200610d7 : mem1
20080000-2009ffff : 0000:0e:02.0
200a0000-200bffff : 0000:0e:03.0
20100000-2017ffff : PCI Bus 0000:10
20100000-2013ffff : 0000:10:00.0
20100080-20100087 : 0000:10:00.0
20100088-201008a7 : 0000:10:00.0
201008a8-201008af : 0000:10:00.0
20110000-20110dff : 0000:10:00.0
20120000-20120dff : 0000:10:00.0
20140000-2014ffff : 0000:10:00.0
20141080-201410d7 : 0000:10:00.0
20141128-201411b7 : endpoint3
20150000-20150fff : 0000:10:00.0
20200000-203fffff : PCI Bus 0000:11
20400000-205fffff : PCI Bus 0000:12
20b00000-20b1ffff : 0000:0d:00.0
20b01128-20b011b7 : port2
20c00000-217fffff : PCI Bus 0000:0d
20c00000-217fffff : PCI Bus 0000:0e
20c00000-20cfffff : PCI Bus 0000:0f
20d00000-20dfffff : PCI Bus 0000:10
20e00000-20efffff : PCI Bus 0000:11
20f00000-20ffffff : PCI Bus 0000:12
All suggestions welcome. I've tried to figure out what is going on but
beyond thinking that the
20000000-381fffff : PCI Bus 0000:0c
entry isn't being distributed, I'm drawing a blank.
Simpler case (no extra padding from QEMU / EDK2)
EDK2 gives:
10000000-103fffff : PCI Bus 0000:00
10000000-101fffff : PCI Bus 0000:01
10200000-1023ffff : 0000:00:03.0
10240000-10240fff : 0000:00:03.0
10241000-10241fff : 0000:00:02.0
10242000-10242fff : 0000:00:01.0
10400000-10dfffff : PCI Bus 0000:0c
10400000-10dfffff : PCI Bus 0000:0d
10400000-10cfffff : PCI Bus 0000:0e
10400000-105fffff : PCI Bus 0000:12
10600000-107fffff : PCI Bus 0000:11
10800000-109fffff : PCI Bus 0000:10
10800000-1083ffff : 0000:10:00.0
10800080-10800087 : 0000:10:00.0
10800088-108008a7 : 0000:10:00.0
108008a8-108008af : 0000:10:00.0
10810000-10810dff : 0000:10:00.0
10820000-10820dff : 0000:10:00.0
10840000-1084ffff : 0000:10:00.0
10841080-108410d7 : 0000:10:00.0
10841128-108411b7 : endpoint4
10850000-10850fff : 0000:10:00.0
10a00000-10bfffff : PCI Bus 0000:0f
10a00000-10a3ffff : 0000:0f:00.0
10a00080-10a00087 : 0000:0f:00.0
10a00088-10a008a7 : 0000:0f:00.0
10a008a8-10a008af : 0000:0f:00.0
10a10000-10a10dff : 0000:0f:00.0
10a20000-10a20dff : 0000:0f:00.0
10a40000-10a4ffff : 0000:0f:00.0
10a41080-10a410d7 : 0000:0f:00.0
10a41128-10a411b7 : endpoint3
10a50000-10a50fff : 0000:0f:00.0
10c00000-10c1ffff : 0000:0e:00.0
10c01080-10c010d7 : mem1
10c20000-10c3ffff : 0000:0e:01.0
10c21080-10c210d7 : mem0
10c40000-10c5ffff : 0000:0e:02.0
10c60000-10c7ffff : 0000:0e:03.0
10d00000-10d1ffff : 0000:0d:00.0
10d01128-10d011b7 : port2
10e00000-3efeffff : PCI Bus 0000:00
And the kernel enumeration (resulting in missing BARS on 0f:00.0)
10000000-103fffff : PCI Bus 0000:00
10000000-101fffff : PCI Bus 0000:01
10200000-1023ffff : 0000:00:03.0
10240000-10240fff : 0000:00:01.0
10241000-10241fff : 0000:00:02.0
10242000-10242fff : 0000:00:03.0
10400000-10dfffff : PCI Bus 0000:0c
10400000-109fffff : PCI Bus 0000:0d
10400000-108fffff : PCI Bus 0000:0e
10400000-1043ffff : PCI Bus 0000:0f
10400000-1043ffff : 0000:0f:00.0
10400080-10400087 : 0000:0f:00.0
10400088-104008a7 : 0000:0f:00.0
104008a8-104008af : 0000:0f:00.0
10410000-10410dff : 0000:0f:00.0
10420000-10420dff : 0000:0f:00.0
10440000-1045ffff : 0000:0e:00.0
10460000-1047ffff : 0000:0e:01.0
10461080-104610d7 : mem1
10480000-1049ffff : 0000:0e:02.0
104a0000-104bffff : 0000:0e:03.0
10500000-1057ffff : PCI Bus 0000:10
10500000-1053ffff : 0000:10:00.0
10500080-10500087 : 0000:10:00.0
10500088-105008a7 : 0000:10:00.0
105008a8-105008af : 0000:10:00.0
10510000-10510dff : 0000:10:00.0
10520000-10520dff : 0000:10:00.0
10540000-1054ffff : 0000:10:00.0
10541080-105410d7 : 0000:10:00.0
10541128-105411b7 : endpoint3
10550000-10550fff : 0000:10:00.0
10600000-107fffff : PCI Bus 0000:12
10900000-1091ffff : 0000:0d:00.0
10901128-109011b7 : port2
10a00000-10dfffff : PCI Bus 0000:0d
10a00000-10dfffff : PCI Bus 0000:0e
10a00000-10afffff : PCI Bus 0000:0f
10b00000-10bfffff : PCI Bus 0000:10
10c00000-10cfffff : PCI Bus 0000:11
10d00000-10dfffff : PCI Bus 0000:12
10e00000-3efeffff : PCI Bus 0000:00
Thanks,
Jonathan
> v2:
> - Add "typedef" to kerneldoc to get correct formatting
> - Use RESOURCE_SIZE_MAX instead of literal
> - Remove unnecessary checks for io{port/mem}_resource
> - Apply a few style tweaks from Andy
>
> Ilpo J?rvinen (7):
> PCI: Fix resource double counting on remove & rescan
> resource: Rename find_resource() to find_empty_resource_slot()
> resource: Document find_empty_resource_slot() and resource_constraint
> resource: Use typedef for alignf callback
> resource: Handle simple alignment inside __find_empty_resource_slot()
> resource: Export find_empty_resource_slot()
> PCI: Relax bridge window tail sizing rules
>
> drivers/pci/bus.c | 10 ++----
> drivers/pci/setup-bus.c | 80 +++++++++++++++++++++++++++++++++++++----
> include/linux/ioport.h | 44 ++++++++++++++++++++---
> include/linux/pci.h | 5 +--
> kernel/resource.c | 68 ++++++++++++++++-------------------
> 5 files changed, 148 insertions(+), 59 deletions(-)
>
On Thu, Apr 11, 2024 at 01:41:10PM +0300, Ilpo J?rvinen wrote:
> On Tue, 9 Apr 2024, Jonathan Cameron wrote:
> > On Thu, 28 Dec 2023 18:57:00 +0200
> > Ilpo J?rvinen <[email protected]> wrote:
..
> E.g., there are
> zero pci_dbg()s in pci_bus_distribute_available_resources(). So unless the
> window is adjusted, we have zero information on what's going on so no
> surprise why everyone is "drawing a blank". :-(
Perhaps it's a good time to start trace events / points for PCI (if not yet)?
Just my 2c.
--
With Best Regards,
Andy Shevchenko
On Tue, 9 Apr 2024, Jonathan Cameron wrote:
> On Thu, 28 Dec 2023 18:57:00 +0200
> Ilpo J?rvinen <[email protected]> wrote:
>
> > Here's a series that contains two fixes to PCI bridge window sizing
> > algorithm. Together, they should enable remove & rescan cycle to work
> > for a PCI bus that has PCI devices with optional resources and/or
> > disparity in BAR sizes.
> >
> > For the second fix, I chose to expose find_empty_resource_slot() from
> > kernel/resource.c because it should increase accuracy of the cannot-fit
> > decision (currently that function is called find_resource()). In order
> > to do that sensibly, a few improvements seemed in order to make its
> > interface and name of the function sane before exposing it. Thus, the
> > few extra patches on resource side.
> >
> > Unfortunately I don't have a reason to suspect these would help with
> > the issues related to the currently ongoing resource regression
> > thread [1].
> >
> > [1] https://lore.kernel.org/linux-pci/[email protected]/
> >
>
> Hi Ilpo
>
> I've hit what looks to be a similar issue to this (not fixed by this series :()
>
> QEMU setup with a CXL PCI Expander bridge a single RP and a 4 port
> port switch with CXL Type 3 devices below ports 0 and 1. (2 and 3 are empty but
> each has a single BAR). RP and DSP on the switch all hotplug capable bridges.
>
> Note that if I touch almost anything about the configuration it all works, but
> this particular combination doesn't. I can add a 3rd empty port or remove 1
> or add or remove an EP below the switch and it all succeeds.
>
> arm64 setup and I'd rather not set the DSM to not reenumerate (because
> there should be no problem doing so.
> Note that arm64 support in general isn't upstream in qemu yet (at least partly
> because we haven't figure out how to do PXB bus enumeration on DT) but can
> be found at gitlab.com/jic23/qemu (not including the vsec additions to expand
> the available CRS entries for the root bridge)
>
> I've added EDK2 support and the vsec structures to expand the windows massively
> so there 'should' be no issue with tight fits. However, the large CRS
> entries for the root bridge don't seem to get picked up.
> I did some digging and the 0c bus has those windows, but not the 0c:00.0
> (the root port). I can't work out how extra space is supposed to get distributed
> to root ports.
>
> Problem chunk with the kernel enumeration is that the first CXL type3 device
> has 3 bars, but the range has been shrunken to the point where only one of them
> fits.
Hi Jonathan,
I'm not entirely sure about the scenario here, because you didn't exactly
explain which steps you used before you the no space for BAR problem
triggers.
The main goal of this patchset is to help in cases where something is
first working as expected, then removed and rescanned, it is not expected
to help on initial enumeration/cases where there is no "first working"
condition (it might help even there, under some scenarios but that's just
"bonus", not the original goal I had in mind).
That being said, there is still caveats even when rescanning. In
multiple children/multilevel cases, the allocation still greedily follows
the current policy and only tries to use the minimal strategy when the
resource can no longer fit. Because the greedy strategy is the default,
the first allocations can consume the space that would be needed later and
all allocations should have been done with minimal strategy but due to
online nature of the resource allocation algorithm it's not able to
correct its mistake at that point.
(The above by no means implies I wouldn't have interest in looking into
and addressing other problems besides what this patchset tries to solve.)
> pci 0000:0f:00.0: BAR 2 [mem 0x20000000-0x2003ffff 64bit]: assigned
> pci 0000:0f:00.0: BAR 0 [mem size 0x00010000 64bit]: can't assign; no space
> pci 0000:0f:00.0: BAR 0 [mem size 0x00010000 64bit]: failed to assign
> pci 0000:0f:00.0: BAR 4 [mem size 0x00001000]: can't assign; no space
> pci 0000:0f:00.0: BAR 4 [mem size 0x00001000]: failed to assign
> pci 0000:0e:00.0: PCI bridge to [bus 0f]
> pci 0000:0e:00.0: bridge window [mem 0x20000000-0x2003ffff]
> pci 0000:0e:00.0: bridge window [mem 0x20c00000-0x20cfffff 64bit pref]
> pci 0000:10:00.0: BAR 2 [mem 0x20100000-0x2013ffff 64bit]: assigned
> pci 0000:10:00.0: BAR 0 [mem 0x20140000-0x2014ffff 64bit]: assigned
> pci 0000:10:00.0: BAR 4 [mem 0x20150000-0x20150fff]: assigned
> pci 0000:0e:01.0: PCI bridge to [bus 10]
> pci 0000:0e:01.0: bridge window [mem 0x20100000-0x2017ffff]
> pci 0000:0e:01.0: bridge window [mem 0x20d00000-0x20dfffff 64bit pref]
>
> AS you can see the window for that first device is simply too small.
The challenge in tracking the allocations is that the sizes are calculated
long before the allocations occurs so it's pretty hard to track things
back into the root cause. And there's also the intermediate step too which
tries to fit the optional (add) sizes.
> EDK2 ends up with a /proc/iomap of
> (kernel hacked as if the DSM was there to prevent reenumeration.
>
> 0b000080-0b0000ff : PRP0001:00
> 10000000-1fffffff : PCI Bus 0000:00
> 10000000-101fffff : PCI Bus 0000:0p1
> 10200000-1023ffff : 0000:00:03.0
> 10240000-10240fff : 0000:00:03.0
> 10241000-10241fff : 0000:00:02.0
> 10242000-10242fff : 0000:00:01.0
> 20000000-381fffff : PCI Bus 0000:0c
> 20000000-2fffffff : PCI Bus 0000:0d
> 20000000-2fffffff : PCI Bus 0000:0e
> 20000000-23ffffff : PCI Bus 0000:12
> 24000000-27ffffff : PCI Bus 0000:11
> 28000000-2bffffff : PCI Bus 0000:10
> 2c000000-2fffffff : PCI Bus 0000:0f
> 30000000-381fffff : PCI Bus 0000:0d
> 30000000-380fffff : PCI Bus 0000:0e
> 30000000-31ffffff : PCI Bus 0000:12
> 32000000-33ffffff : PCI Bus 0000:11
> 34000000-35ffffff : PCI Bus 0000:10
> 34000000-3403ffff : 0000:10:00.0
> 34000080-34000087 : 0000:10:00.0
> 34000088-340008a7 : 0000:10:00.0
> 340008a8-340008af : 0000:10:00.0
> 34010000-34010dff : 0000:10:00.0
> 34020000-34020dff : 0000:10:00.0
> 34040000-3404ffff : 0000:10:00.0
> 34041080-340410d7 : 0000:10:00.0
> 34041128-340411b7 : endpoint4
> 34050000-34050fff : 0000:10:00.0
> 36000000-37ffffff : PCI Bus 0000:0f
> 36000000-3603ffff : 0000:0f:00.0
> 36000080-36000087 : 0000:0f:00.0
> 36000088-360008a7 : 0000:0f:00.0
> 360008a8-360008af : 0000:0f:00.0
> 36010000-36010dff : 0000:0f:00.0
> 36020000-36020dff : 0000:0f:00.0
> 36040000-3604ffff : 0000:0f:00.0
> 36041080-360410d7 : 0000:0f:00.0
> 36041128-360411b7 : endpoint3
> 36050000-36050fff : 0000:0f:00.0
> 38000000-3801ffff : 0000:0e:00.0
> 38001080-380010d7 : mem0
> 38020000-3803ffff : 0000:0e:01.0
> 38021080-380210d7 : mem1
> 38040000-3805ffff : 0000:0e:02.0
> 38060000-3807ffff : 0000:0e:03.0
> 38100000-3811ffff : 0000:0d:00.0
> 38101128-381011b7 : port2
> 38200000-3efeffff : PCI Bus 0000:00
> 40000000-b9d7ffff : System RAM
>
> With the kernel rerunning.
>
> 10000000-1fffffff : PCI Bus 0000:00
> 10000000-101fffff : PCI Bus 0000:01
> 10200000-1023ffff : 0000:00:03.0
> 10240000-10240fff : 0000:00:01.0
> 10241000-10241fff : 0000:00:02.0
> 10242000-10242fff : 0000:00:03.0
> 20000000-381fffff : PCI Bus 0000:0c
> 20000000-20bfffff : PCI Bus 0000:0d
> 20000000-20afffff : PCI Bus 0000:0e
> 20000000-2003ffff : PCI Bus 0000:0f
> 20000000-2003ffff : 0000:0f:00.0
> 20000080-20000087 : 0000:0f:00.0
> 20000088-200008a7 : 0000:0f:00.0
> 200008a8-200008af : 0000:0f:00.0
> 20010000-20010dff : 0000:0f:00.0
> 20020000-20020dff : 0000:0f:00.0
> 20040000-2005ffff : 0000:0e:00.0
> 20060000-2007ffff : 0000:0e:01.0
> 20061080-200610d7 : mem1
> 20080000-2009ffff : 0000:0e:02.0
> 200a0000-200bffff : 0000:0e:03.0
> 20100000-2017ffff : PCI Bus 0000:10
> 20100000-2013ffff : 0000:10:00.0
> 20100080-20100087 : 0000:10:00.0
> 20100088-201008a7 : 0000:10:00.0
> 201008a8-201008af : 0000:10:00.0
> 20110000-20110dff : 0000:10:00.0
> 20120000-20120dff : 0000:10:00.0
> 20140000-2014ffff : 0000:10:00.0
> 20141080-201410d7 : 0000:10:00.0
> 20141128-201411b7 : endpoint3
> 20150000-20150fff : 0000:10:00.0
> 20200000-203fffff : PCI Bus 0000:11
> 20400000-205fffff : PCI Bus 0000:12
> 20b00000-20b1ffff : 0000:0d:00.0
> 20b01128-20b011b7 : port2
> 20c00000-217fffff : PCI Bus 0000:0d
> 20c00000-217fffff : PCI Bus 0000:0e
> 20c00000-20cfffff : PCI Bus 0000:0f
> 20d00000-20dfffff : PCI Bus 0000:10
> 20e00000-20efffff : PCI Bus 0000:11
> 20f00000-20ffffff : PCI Bus 0000:12
>
> All suggestions welcome. I've tried to figure out what is going on but
> beyond thinking that the
> 20000000-381fffff : PCI Bus 0000:0c
> entry isn't being distributed, I'm drawing a blank.
I think it would be generally useful in tracking these problems to have
something actually logged about these resource decisions. E.g., there are
zero pci_dbg()s in pci_bus_distribute_available_resources(). So unless the
window is adjusted, we have zero information on what's going on so no
surprise why everyone is "drawing a blank". :-(
What often comes into play in some scenarios is that once a bridge
window is assigned (res->parent is non-zero), it won't be changed. This
leads to some code not working as per what likely was the original intent.
Also adjust_bridge_window() seems to check for res->parent as the first
thing, maybe check if that prevents distributing the window?
Below is a patch which adds a bit of logging into
pci_bus_distribute_available_resources().
--
i.
From: Ilpo J?rvinen <[email protected]>
Subject: [PATCH] PCI: Log bridge window distribute value
pci_bus_distribute_available_resources() is currently very silent, add
debugging print about the distribute decision.
Signed-off-by: Ilpo J?rvinen <[email protected]>
---
drivers/pci/setup-bus.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 909e6a7c3cc3..7a455d75e657 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1917,6 +1917,11 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
normal_bridges);
}
+ pci_dbg(bridge, "distribute to bridges: %u (hp) + %u, io: %llx mem: %llx mem pref: %llx\n",
+ hotplug_bridges, normal_bridges,
+ (unsigned long long)io_per_b, (unsigned long long)mmio_per_b,
+ (unsigned long long)mmio_pref_per_b);
+
for_each_pci_bridge(dev, bus) {
struct resource *res;
struct pci_bus *b;
--
tg: (4cece7649650..) log/distribute (depends on: main)
On Thu, Dec 28, 2023 at 06:57:07PM +0200, Ilpo Järvinen wrote:
> During remove & rescan cycle, PCI subsystem will recalculate and adjust
> the bridge window sizing that was initially done by "BIOS". The size
> calculation is based on the required alignment of the largest resource
> among the downstream resources as per pbus_size_mem() (unimportant or
> zero parameters marked with "..."):
>
> min_align = calculate_mem_align(aligns, max_order);
> size0 = calculate_memsize(size, ..., min_align);
>
> and then in calculate_memsize():
> size = ALIGN(max(size, ...) + ..., align);
>
> If the original bridge window sizing tried to conserve space, this will
> lead to massive increase of the required bridge window size when the
> downstream has a large disparity in BAR sizes. E.g., with 16MiB and
> 16GiB BARs this results in 32GiB bridge window size even if 16MiB BAR
> does not require gigabytes of space to fit.
Trying to understand exactly what "relaxed window tail sizing rules"
means.
Previous calculation "based on the required alignment of the largest
resource among downstream resources." If a 16GiB BAR and a 16MiB BAR
leads to a 32GiB window, obviously we need a 16GiB-aligned area. Are
you saying we also require the *size* to be 16GiB aligned? So we add
16GiB + 16MiB to get the required size, then round it up to a 16GiB
multiple?
And "relaxed window tail" means what? I guess we don't require as
much alignment of the size?
Obviously we need at least 1MiB alignment because that's how bridge
windows work. If downstream resources could be packed, e.g., sort
them into descending size, align window to size of largest downstream
resource, assign smaller ones adjacent to preceding larger ones, it
should be a minimal size.
But I assume we have to do this individually for each bridge level,
including any BARs at that level along with bridge windows to the next
level downstream.
Does "relaxed window tail sizing" mean we don't allocate any more
space than is strictly required for the BARs currently present?
> When doing remove & rescan for a bus that contains such a PCI device, a
> larger bridge window is suddenly required on rescan but when there is a
> bridge window upstream that is already assigned based on the original
> size, it cannot be enlarged to the new requirement. This causes the
> allocation of the bridge window to fail (0x600000000 > 0x400ffffff):
>
> pci 0000:02:01.0: PCI bridge to [bus 03]
> pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
> pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
> pci 0000:01:00.0: PCI bridge to [bus 02-04]
> pci 0000:01:00.0: bridge window [mem 0x40400000-0x406fffff]
> pci 0000:01:00.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
> ...
> pci_bus 0000:03: busn_res: [bus 03] is released
> pci 0000:03:00.0: reg 0x10: [mem 0x6400000000-0x6400ffffff 64bit pref]
> pci 0000:03:00.0: reg 0x18: [mem 0x6000000000-0x63ffffffff 64bit pref]
> pci 0000:03:00.0: reg 0x30: [mem 0x40400000-0x405fffff pref]
> pci 0000:02:01.0: PCI bridge to [bus 03]
> pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
> pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
> pci 0000:02:01.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
> pci 0000:02:01.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
> pci 0000:02:01.0: BAR 8: assigned [mem 0x40400000-0x405fffff]
> pci 0000:03:00.0: BAR 2: no space for [mem size 0x400000000 64bit pref]
> pci 0000:03:00.0: BAR 2: failed to assign [mem size 0x400000000 64bit pref]
> pci 0000:03:00.0: BAR 0: no space for [mem size 0x01000000 64bit pref]
> pci 0000:03:00.0: BAR 0: failed to assign [mem size 0x01000000 64bit pref]
> pci 0000:03:00.0: BAR 6: assigned [mem 0x40400000-0x405fffff pref]
> pci 0000:02:01.0: PCI bridge to [bus 03]
> pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
>
> This is a major surprise for users who are suddenly left with a PCIe
> device that was working fine with the original bridge window sizing.
>
> Even if the already assigned bridge window could be enlarged by
> reallocation in some cases (something the current code does not attempt
> to do), it is not possible in general case and the large amount of
> wasted space at the tail of the bridge window may lead to other
> resource exhaustion problems on Root Complex level (think of multiple
> PCIe cards with VFs and BAR size disparity in a single system).
>
> PCI specifications only expect natural alignment for BARs (PCI Express
> Base Specification, rev. 6.1 sect. 7.5.1.2.1) and minimum of 1MiB
> alignment for the bridge window (PCI Express Base Specification,
> rev 6.1 sect. 7.5.1.3). The current bridge window tail alignment rule
> was introduced in the commit 5d0a8965aea9 ("[PATCH] 2.5.14: New PCI
> allocation code (alpha, arm, parisc) [2/2]") that only states:
> "pbus_size_mem: core stuff; tested with randomly generated sets of
> resources". It does not explain the motivation for the extra tail space
> allocated that is not truly needed by the downstream resources. As
> such, it is far from clear if it ever has been required by any HW.
>
> To prevent PCIe cards with BAR size disparity from becoming unusable
> after remove & rescan cycle, attempt to do a truly minimal allocation
> for memory resources if needed. First check if the normally calculated
> bridge window will not fit into an already assigned upstream resource.
> In such case, try with relaxed bridge window tail sizing rules instead
> where no extra tail space is requested beyond what the downstream
> resources require. Only enforce the alignment requirement of the bridge
> window itself (normally 1MiB).
>
> With this patch, the resources are successfully allocated:
>
> pci 0000:02:01.0: PCI bridge to [bus 03]
> pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
> pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
> pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref] to [bus 03] requires relaxed alignment rules
I guess the relaxed rules are required because this window doesn't fit
inside some upstream window? Maybe include the upstream window in the
example here. In the code, do we know the upstream window at this
point? If so, maybe the message could include it?
But I'm missing something because it looks like we assigned the same
window (0x6000000000-0x6400ffffff) below. What exactly is different?
It looks like the 02:01.0 windows below are the same as the ones where
it failed above.
> pci 0000:02:01.0: BAR 9: assigned [mem 0x6000000000-0x6400ffffff 64bit pref]
> pci 0000:02:01.0: BAR 8: assigned [mem 0x40400000-0x405fffff]
> pci 0000:03:00.0: BAR 2: assigned [mem 0x6000000000-0x63ffffffff 64bit pref]
> pci 0000:03:00.0: BAR 0: assigned [mem 0x6400000000-0x6400ffffff 64bit pref]
> pci 0000:03:00.0: BAR 6: assigned [mem 0x40400000-0x405fffff pref]
> pci 0000:02:01.0: PCI bridge to [bus 03]
> pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
> pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
Nit, I think if these examples were collected from a newer kernel,
"reg 0x10" would be replaced with "BAR 0" so the messages would be
slightly easier to read.
> This patch draws inspiration from the initial investigations and work
> by Mika Westerberg.
>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216795
> Link: https://lore.kernel.org/linux-pci/[email protected]/
> Fixes: 5d0a8965aea9 ("[PATCH] 2.5.14: New PCI allocation code (alpha, arm, parisc) [2/2]")
> Signed-off-by: Ilpo Järvinen <[email protected]>
> Cc: Mika Westerberg <[email protected]>
> ---
> drivers/pci/setup-bus.c | 74 +++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 72 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index e3e6ff8854a7..7a32283262c2 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -21,6 +21,7 @@
> #include <linux/errno.h>
> #include <linux/ioport.h>
> #include <linux/cache.h>
> +#include <linux/limits.h>
> #include <linux/slab.h>
> #include <linux/acpi.h>
> #include "pci.h"
> @@ -960,6 +961,62 @@ static inline resource_size_t calculate_mem_align(resource_size_t *aligns,
> return min_align;
> }
>
> +/**
> + * pbus_upstream_assigned_limit - Check no upstream resource limits allocation
> + * @bus: The bus
> + * @mask: Mask the resource flag, then compare it with type
> + * @type: The type of resource from bridge
> + * @size: The size required from the bridge window
> + * @align: Required alignment for the resource
> + *
> + * Checks that @size can fit inside the upstream bridge resources that are
> + * already assigned.
> + *
> + * Return: -ENOSPC if @size cannot fit into an already assigned resource
> + * upstream resource.
I guess this returns 0 on success, right? But more comments below.
> + */
> +static int pbus_upstream_assigned_limit(struct pci_bus *bus, unsigned long mask,
> + unsigned long type, resource_size_t size,
> + resource_size_t align)
> +{
> + struct resource_constraint constraint = {
> + .max = RESOURCE_SIZE_MAX,
> + .align = align,
> + };
> + struct pci_bus *downstream = bus;
> + struct resource *r;
> +
> + while ((bus = bus->parent)) {
> + if (pci_is_root_bus(bus))
> + break;
> +
> + pci_bus_for_each_resource(bus, r) {
> + if (!r || !r->parent || (r->flags & mask) != type)
> + continue;
> +
> + if (resource_size(r) >= size) {
> + struct resource gap = {};
> +
> + if (!find_empty_resource_slot(r, &gap, size, &constraint))
I would test "find_empty_resource_slot(...) == 0" since it doesn't
return a boolean.
IIUC, when find_empty_resource_slot() returns 0, it fills in
gap.start/end/flags. Is there anything useful we can do with that
information, e.g., a dbg log message?
I think find_empty_resource_slot() doesn't actually *allocate* the
space, right? So there would be a race between this and the
allocation in pci_claim_resource(), except that there's nobody else
allocating in this hierarchy because ...?
> + return 0;
> + }
> +
> + if (bus->self) {
> + pci_dbg(bus->self,
> + "Assigned bridge window %pR to %pR cannot fit 0x%llx required for %s bridging to %pR\n",
> + r, &bus->busn_res,
> + (unsigned long long)size,
> + pci_name(downstream->self),
> + &downstream->busn_res);
> + }
> +
> + return -ENOSPC;
> + }
> + }
> +
> + return 0;
> +}
> +
> /**
> * pbus_size_mem() - Size the memory window of a given bus
> *
> @@ -986,7 +1043,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> struct list_head *realloc_head)
> {
> struct pci_dev *dev;
> - resource_size_t min_align, align, size, size0, size1;
> + resource_size_t min_align, win_align, align, size, size0, size1;
> resource_size_t aligns[24]; /* Alignments from 1MB to 8TB */
> int order, max_order;
> struct resource *b_res = find_bus_resource_of_type(bus,
> @@ -1064,10 +1121,23 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> }
> }
>
> + win_align = window_alignment(bus, b_res->flags);
> min_align = calculate_mem_align(aligns, max_order);
> - min_align = max(min_align, window_alignment(bus, b_res->flags));
> + min_align = max(min_align, win_align);
> size0 = calculate_memsize(size, min_size, 0, 0, resource_size(b_res), min_align);
> add_align = max(min_align, add_align);
> +
> + if (bus->self && size0 &&
> + pbus_upstream_assigned_limit(bus, mask | IORESOURCE_PREFETCH, type,
> + size0, add_align)) {
The IORESOURCE_PREFETCH here -- does that just mean that "type" must
match exactly, including whether IORESOURCE_PREFETCH is set or not?
I guess pbus_upstream_assigned_limit() being true (-ENOSPC) basically
means there's no available space for "size0, add_align" in the
upstream window?
Since it's used as a predicate, I wonder if it could return true/false
and be named something like "pbus_upstream_space_available()" and test
it with "!pbus_upstream_space_available()" here?
> + min_align = 1ULL << (max_order + 20);
We never had any comments about what "max_order" means. I vaguely
remember it's basically in 1MB units because of these:
resource_size_t aligns[24]; /* Alignments from 1MB to 8TB */
* aligns[0] is for 1MB (since bridge memory
* windows are always at least 1MB aligned), so
* keep "order" from being negative for smaller
* resources.
> + min_align = max(min_align, win_align);
> + size0 = calculate_memsize(size, min_size, 0, 0, resource_size(b_res), win_align);
> + add_align = win_align;
> + pci_info(bus->self, "bridge window %pR to %pR requires relaxed alignment rules\n",
> + b_res, &bus->busn_res);
Is there any hint we can give in the message about what "relaxed
alignment rules" means, or how we decided we needed them?
> + }
> +
> size1 = (!realloc_head || (realloc_head && !add_size && !children_add_size)) ? size0 :
> calculate_memsize(size, min_size, add_size, children_add_size,
> resource_size(b_res), add_align);
> --
> 2.30.2
>
On Thu, Dec 28, 2023 at 06:57:02PM +0200, Ilpo Järvinen wrote:
> Rename find_resource() to find_empty_resource_slot() to better describe
> what the functions does. This is a preparation for exposing it beyond
> resource.c which is needed by PCI core. Also rename the __ variant to
> match the names.
I wonder if "find_resource_space()" or "find_available_resource()"
would be better than "_slot"?
"Slot" *is* already used a few times in kernel/resource.c, but in most
cases I think it refers to a "resource", and find_resource() basically
returns a filled-in struct resource.
And of course "slot" suggests something entirely different in the PCI
context.
> Signed-off-by: Ilpo Järvinen <[email protected]>
> Reviewed-by: Andy Shevchenko <[email protected]>
> ---
> kernel/resource.c | 19 +++++++++----------
> 1 file changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 866ef3663a0b..94f67005e1e2 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -574,10 +574,9 @@ static void resource_clip(struct resource *res, resource_size_t min,
> * Find empty slot in the resource tree with the given range and
> * alignment constraints
> */
> -static int __find_resource(struct resource *root, struct resource *old,
> - struct resource *new,
> - resource_size_t size,
> - struct resource_constraint *constraint)
> +static int __find_empty_resource_slot(struct resource *root, struct resource *old,
> + struct resource *new, resource_size_t size,
> + struct resource_constraint *constraint)
> {
> struct resource *this = root->child;
> struct resource tmp = *new, avail, alloc;
> @@ -633,11 +632,11 @@ next: if (!this || this->end == root->end)
> /*
> * Find empty slot in the resource tree given range and alignment.
> */
> -static int find_resource(struct resource *root, struct resource *new,
> - resource_size_t size,
> - struct resource_constraint *constraint)
> +static int find_empty_resource_slot(struct resource *root, struct resource *new,
> + resource_size_t size,
> + struct resource_constraint *constraint)
> {
> - return __find_resource(root, NULL, new, size, constraint);
> + return __find_empty_resource_slot(root, NULL, new, size, constraint);
> }
>
> /**
> @@ -660,7 +659,7 @@ static int reallocate_resource(struct resource *root, struct resource *old,
>
> write_lock(&resource_lock);
>
> - if ((err = __find_resource(root, old, &new, newsize, constraint)))
> + if ((err = __find_empty_resource_slot(root, old, &new, newsize, constraint)))
> goto out;
>
> if (resource_contains(&new, old)) {
> @@ -729,7 +728,7 @@ int allocate_resource(struct resource *root, struct resource *new,
> }
>
> write_lock(&resource_lock);
> - err = find_resource(root, new, size, &constraint);
> + err = find_empty_resource_slot(root, new, size, &constraint);
> if (err >= 0 && __request_resource(root, new))
> err = -EBUSY;
> write_unlock(&resource_lock);
> --
> 2.30.2
>
On Thu, Dec 28, 2023 at 06:57:03PM +0200, Ilpo Järvinen wrote:
> Document find_empty_resource_slot() and the struct resource_constraint
> as they are going to be exposed outside of resource.c.
>
> Signed-off-by: Ilpo Järvinen <[email protected]>
> Reviewed-by: Andy Shevchenko <[email protected]>
> ---
> kernel/resource.c | 28 +++++++++++++++++++++++++---
> 1 file changed, 25 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 94f67005e1e2..ed4bb8ad701a 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -46,7 +46,18 @@ struct resource iomem_resource = {
> };
> EXPORT_SYMBOL(iomem_resource);
>
> -/* constraints to be met while allocating resources */
> +/**
> + * resource_constraint - constraints to be met while searching empty resource slots
Needs "struct resource_constraint", I think, to satisfy
scripts/kernel-doc.
> + * @min: The minimum address for the memory range
> + * @max: The maximum address for the memory range
> + * @align: Alignment for the start address of the empty slot
> + * @alignf: Additional alignment constraints callback
> + * @alignf_data: Data provided for @alignf callback
> + *
> + * Contains the range and alignment constraints that have to be met during
> + * find_empty_resource_slot(). @alignf can be NULL indicating no alignment
> + * beyond @align is necessary.
> + */
> struct resource_constraint {
> resource_size_t min, max, align;
> resource_size_t (*alignf)(void *, const struct resource *,
> @@ -629,8 +640,19 @@ next: if (!this || this->end == root->end)
> return -EBUSY;
> }
On Fri, 3 May 2024, Bjorn Helgaas wrote:
> On Thu, Dec 28, 2023 at 06:57:07PM +0200, Ilpo J?rvinen wrote:
> > During remove & rescan cycle, PCI subsystem will recalculate and adjust
> > the bridge window sizing that was initially done by "BIOS". The size
> > calculation is based on the required alignment of the largest resource
> > among the downstream resources as per pbus_size_mem() (unimportant or
> > zero parameters marked with "..."):
> >
> > min_align = calculate_mem_align(aligns, max_order);
> > size0 = calculate_memsize(size, ..., min_align);
> >
> > and then in calculate_memsize():
> > size = ALIGN(max(size, ...) + ..., align);
> >
> > If the original bridge window sizing tried to conserve space, this will
> > lead to massive increase of the required bridge window size when the
> > downstream has a large disparity in BAR sizes. E.g., with 16MiB and
> > 16GiB BARs this results in 32GiB bridge window size even if 16MiB BAR
> > does not require gigabytes of space to fit.
>
> Trying to understand exactly what "relaxed window tail sizing rules"
> means.
>
> Previous calculation "based on the required alignment of the largest
> resource among downstream resources." If a 16GiB BAR and a 16MiB BAR
> leads to a 32GiB window, obviously we need a 16GiB-aligned area. Are
> you saying we also require the *size* to be 16GiB aligned? So we add
> 16GiB + 16MiB to get the required size, then round it up to a 16GiB
> multiple?
Yes. That's what I meant above but now after rechecking because not all
numbers seemed match (I started to wonder why it's 0x600000000, not
0x800000000), I realize there's small twist in this I've missed in
this description:
calculate_mem_align() does min_align = align1 >> 1 so it's actually
aligning the size by half of the largest BAR so 8GiB, not by 16GiB. Thus,
without this series 16GiB + 8GiB = 24GiB is needed, not 32GiB as I
errornously write above. Nonetheless, it's still excessive and we hit the
same issue.
(I'll correct this part of the description in the next version of the
patch.)
> And "relaxed window tail" means what? I guess we don't require as
> much alignment of the size?
Yes. There are two ways to align, one is for the start address, whereas
the "tail" refers to the size (or space), i.e., whether to leave extra
space or not beyond the minimum 1MiB alignment.
> Obviously we need at least 1MiB alignment because that's how bridge
> windows work. If downstream resources could be packed, e.g., sort
> them into descending size, align window to size of largest downstream
> resource, assign smaller ones adjacent to preceding larger ones, it
> should be a minimal size.
>
> But I assume we have to do this individually for each bridge level,
> including any BARs at that level along with bridge windows to the next
> level downstream.
>
> Does "relaxed window tail sizing" mean we don't allocate any more
> space than is strictly required for the BARs currently present?
Yes. One important thing to consider in addition is that the reason we
have to do this is because of an upstream bridge window that is already
assigned, i.e., fixed in place and cannot be changed in general case by
releasing and moving it elsewhere. This condition happens, e.g., during
remove/rescan cycle of a device, the resources of the Root Port are
already assigned. If original resource allocation done by the BIOS did not
allocate for the extra space, the rescan ends up failing despite
everything working perfectly before the remove/rescan cycle.
I'm open to better wording/terminology, if you have a better suggestion
than "relaxed window tail sizing".
> > When doing remove & rescan for a bus that contains such a PCI device, a
> > larger bridge window is suddenly required on rescan but when there is a
> > bridge window upstream that is already assigned based on the original
> > size, it cannot be enlarged to the new requirement. This causes the
> > allocation of the bridge window to fail (0x600000000 > 0x400ffffff):
> >
> > pci 0000:02:01.0: PCI bridge to [bus 03]
> > pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
> > pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
> > pci 0000:01:00.0: PCI bridge to [bus 02-04]
> > pci 0000:01:00.0: bridge window [mem 0x40400000-0x406fffff]
> > pci 0000:01:00.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
> > ...
> > pci_bus 0000:03: busn_res: [bus 03] is released
> > pci 0000:03:00.0: reg 0x10: [mem 0x6400000000-0x6400ffffff 64bit pref]
> > pci 0000:03:00.0: reg 0x18: [mem 0x6000000000-0x63ffffffff 64bit pref]
> > pci 0000:03:00.0: reg 0x30: [mem 0x40400000-0x405fffff pref]
> > pci 0000:02:01.0: PCI bridge to [bus 03]
> > pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
> > pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
> > pci 0000:02:01.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
> > pci 0000:02:01.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
> > pci 0000:02:01.0: BAR 8: assigned [mem 0x40400000-0x405fffff]
> > pci 0000:03:00.0: BAR 2: no space for [mem size 0x400000000 64bit pref]
> > pci 0000:03:00.0: BAR 2: failed to assign [mem size 0x400000000 64bit pref]
> > pci 0000:03:00.0: BAR 0: no space for [mem size 0x01000000 64bit pref]
> > pci 0000:03:00.0: BAR 0: failed to assign [mem size 0x01000000 64bit pref]
> > pci 0000:03:00.0: BAR 6: assigned [mem 0x40400000-0x405fffff pref]
> > pci 0000:02:01.0: PCI bridge to [bus 03]
> > pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
> >
> > This is a major surprise for users who are suddenly left with a PCIe
> > device that was working fine with the original bridge window sizing.
> >
> > Even if the already assigned bridge window could be enlarged by
> > reallocation in some cases (something the current code does not attempt
> > to do), it is not possible in general case and the large amount of
> > wasted space at the tail of the bridge window may lead to other
> > resource exhaustion problems on Root Complex level (think of multiple
> > PCIe cards with VFs and BAR size disparity in a single system).
> >
> > PCI specifications only expect natural alignment for BARs (PCI Express
> > Base Specification, rev. 6.1 sect. 7.5.1.2.1) and minimum of 1MiB
> > alignment for the bridge window (PCI Express Base Specification,
> > rev 6.1 sect. 7.5.1.3). The current bridge window tail alignment rule
> > was introduced in the commit 5d0a8965aea9 ("[PATCH] 2.5.14: New PCI
> > allocation code (alpha, arm, parisc) [2/2]") that only states:
> > "pbus_size_mem: core stuff; tested with randomly generated sets of
> > resources". It does not explain the motivation for the extra tail space
> > allocated that is not truly needed by the downstream resources. As
> > such, it is far from clear if it ever has been required by any HW.
> >
> > To prevent PCIe cards with BAR size disparity from becoming unusable
> > after remove & rescan cycle, attempt to do a truly minimal allocation
> > for memory resources if needed. First check if the normally calculated
> > bridge window will not fit into an already assigned upstream resource.
> > In such case, try with relaxed bridge window tail sizing rules instead
> > where no extra tail space is requested beyond what the downstream
> > resources require. Only enforce the alignment requirement of the bridge
> > window itself (normally 1MiB).
> >
> > With this patch, the resources are successfully allocated:
> >
> > pci 0000:02:01.0: PCI bridge to [bus 03]
> > pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
> > pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
> > pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref] to [bus 03] requires relaxed alignment rules
>
> I guess the relaxed rules are required because this window doesn't fit
> inside some upstream window? Maybe include the upstream window in the
> example here. In the code, do we know the upstream window at this
> point? If so, maybe the message could include it?
Okay, I changed the severity of the pci_dbg() inside
pbus_upstream_assigned_limit()? It reveals this very information:
pcieport 0000:01:00.0: Assigned bridge window [mem 0x6000000000-0x6400ffffff 64bit pref] to [bus 02-04] cannot fit 0x600000000 required for 0000:02:01.0 bridging to [bus 03]
> But I'm missing something because it looks like we assigned the same
> window (0x6000000000-0x6400ffffff) below. What exactly is different?
> It looks like the 02:01.0 windows below are the same as the ones where
> it failed above.
Without this series, the tail aligned is taken from the largest resource
so as per 03:00.0 BAR 2's size so it would need 0x6000000000-0x65ffffffff
(min_align = align1 >> 1 inside calculate_mem_align()).
It would, of course, leave 0x6400ffffff-0x65ffffffff unused but since it
won't fit into this (from above):
> > pci 0000:01:00.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
it ends up failing without this series.
So effectively, this series just makes things work with the same sizes as
they did before the remove/rescan cycle, matching what you noticed!
> > pci 0000:02:01.0: BAR 9: assigned [mem 0x6000000000-0x6400ffffff 64bit pref]
> > pci 0000:02:01.0: BAR 8: assigned [mem 0x40400000-0x405fffff]
> > pci 0000:03:00.0: BAR 2: assigned [mem 0x6000000000-0x63ffffffff 64bit pref]
> > pci 0000:03:00.0: BAR 0: assigned [mem 0x6400000000-0x6400ffffff 64bit pref]
> > pci 0000:03:00.0: BAR 6: assigned [mem 0x40400000-0x405fffff pref]
> > pci 0000:02:01.0: PCI bridge to [bus 03]
> > pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
> > pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
>
> Nit, I think if these examples were collected from a newer kernel,
> "reg 0x10" would be replaced with "BAR 0" so the messages would be
> slightly easier to read.
Okay, I'll take the logs from a later kernel.
> > This patch draws inspiration from the initial investigations and work
> > by Mika Westerberg.
> >
> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216795
> > Link: https://lore.kernel.org/linux-pci/[email protected]/
> > Fixes: 5d0a8965aea9 ("[PATCH] 2.5.14: New PCI allocation code (alpha, arm, parisc) [2/2]")
> > Signed-off-by: Ilpo J?rvinen <[email protected]>
> > Cc: Mika Westerberg <[email protected]>
> > ---
> > drivers/pci/setup-bus.c | 74 +++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 72 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> > index e3e6ff8854a7..7a32283262c2 100644
> > --- a/drivers/pci/setup-bus.c
> > +++ b/drivers/pci/setup-bus.c
> > @@ -21,6 +21,7 @@
> > #include <linux/errno.h>
> > #include <linux/ioport.h>
> > #include <linux/cache.h>
> > +#include <linux/limits.h>
> > #include <linux/slab.h>
> > #include <linux/acpi.h>
> > #include "pci.h"
> > @@ -960,6 +961,62 @@ static inline resource_size_t calculate_mem_align(resource_size_t *aligns,
> > return min_align;
> > }
> >
> > +/**
> > + * pbus_upstream_assigned_limit - Check no upstream resource limits allocation
> > + * @bus: The bus
> > + * @mask: Mask the resource flag, then compare it with type
> > + * @type: The type of resource from bridge
> > + * @size: The size required from the bridge window
> > + * @align: Required alignment for the resource
> > + *
> > + * Checks that @size can fit inside the upstream bridge resources that are
> > + * already assigned.
> > + *
> > + * Return: -ENOSPC if @size cannot fit into an already assigned resource
> > + * upstream resource.
>
> I guess this returns 0 on success, right? But more comments below.
Yes, 0 when size fits to upstream bridge windows.
> > + */
> > +static int pbus_upstream_assigned_limit(struct pci_bus *bus, unsigned long mask,
> > + unsigned long type, resource_size_t size,
> > + resource_size_t align)
> > +{
> > + struct resource_constraint constraint = {
> > + .max = RESOURCE_SIZE_MAX,
> > + .align = align,
> > + };
> > + struct pci_bus *downstream = bus;
> > + struct resource *r;
> > +
> > + while ((bus = bus->parent)) {
> > + if (pci_is_root_bus(bus))
> > + break;
> > +
> > + pci_bus_for_each_resource(bus, r) {
> > + if (!r || !r->parent || (r->flags & mask) != type)
> > + continue;
> > +
> > + if (resource_size(r) >= size) {
> > + struct resource gap = {};
> > +
> > + if (!find_empty_resource_slot(r, &gap, size, &constraint))
>
> I would test "find_empty_resource_slot(...) == 0" since it doesn't
> return a boolean.
Okay.
> IIUC, when find_empty_resource_slot() returns 0, it fills in
> gap.start/end/flags. Is there anything useful we can do with that
> information, e.g., a dbg log message?
Log something for success? I suppose it could be done but I don't expect
it to be that valuable since the expected behavior is the alloc will then
succeed too.
For the case where relaxed tail sizing is required, I made the pci_dbg()
into pci_info()
> I think find_empty_resource_slot() doesn't actually *allocate* the
> space, right?
Yes, the calculation and allocation for these resources is like this.
Things are first calculated and only allocated by the later stage.
> So there would be a race between this and the
> allocation in pci_claim_resource(), except that there's nobody else
> allocating in this hierarchy because ...?
Rescanning is protected by pci_rescan_remove_lock, no?
> > + return 0;
> > + }
> > +
> > + if (bus->self) {
> > + pci_dbg(bus->self,
> > + "Assigned bridge window %pR to %pR cannot fit 0x%llx required for %s bridging to %pR\n",
> > + r, &bus->busn_res,
> > + (unsigned long long)size,
> > + pci_name(downstream->self),
> > + &downstream->busn_res);
> > + }
> > +
> > + return -ENOSPC;
> > + }
> > + }
> > +
> > + return 0;
> > +}
> > +
> > /**
> > * pbus_size_mem() - Size the memory window of a given bus
> > *
> > @@ -986,7 +1043,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> > struct list_head *realloc_head)
> > {
> > struct pci_dev *dev;
> > - resource_size_t min_align, align, size, size0, size1;
> > + resource_size_t min_align, win_align, align, size, size0, size1;
> > resource_size_t aligns[24]; /* Alignments from 1MB to 8TB */
> > int order, max_order;
> > struct resource *b_res = find_bus_resource_of_type(bus,
> > @@ -1064,10 +1121,23 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> > }
> > }
> >
> > + win_align = window_alignment(bus, b_res->flags);
> > min_align = calculate_mem_align(aligns, max_order);
> > - min_align = max(min_align, window_alignment(bus, b_res->flags));
> > + min_align = max(min_align, win_align);
> > size0 = calculate_memsize(size, min_size, 0, 0, resource_size(b_res), min_align);
> > add_align = max(min_align, add_align);
> > +
> > + if (bus->self && size0 &&
> > + pbus_upstream_assigned_limit(bus, mask | IORESOURCE_PREFETCH, type,
> > + size0, add_align)) {
>
> The IORESOURCE_PREFETCH here -- does that just mean that "type" must
> match exactly, including whether IORESOURCE_PREFETCH is set or not?
This just matches what the other calculation in pbus_size_mem() does:
struct resource *b_res = find_bus_resource_of_type(bus,
mask | IORESOURCE_PREFETCH, type);
Do you think it's wrong?
> I guess pbus_upstream_assigned_limit() being true (-ENOSPC) basically
> means there's no available space for "size0, add_align" in the
> upstream window?
>
> Since it's used as a predicate, I wonder if it could return true/false
> and be named something like "pbus_upstream_space_available()" and test
> it with "!pbus_upstream_space_available()" here?
I'll make this change.
> > + min_align = 1ULL << (max_order + 20);
>
> We never had any comments about what "max_order" means. I vaguely
> remember it's basically in 1MB units because of these:
>
> resource_size_t aligns[24]; /* Alignments from 1MB to 8TB */
>
> * aligns[0] is for 1MB (since bridge memory
> * windows are always at least 1MB aligned), so
> * keep "order" from being negative for smaller
> * resources.
I was considering replacing those literal 20s with __ffs(SZ_1M) to make
it less magic but I don't now recall why I ended up not pursuing that.
I'm sure I even looked if there were changes on assembly level (which
unfortunately happened even if it seemed something that should not produce
binary changes at all) but now I cannot even seem to find where that
patch has gone and even less to remember why I put it into backburner...
I'll look into it again.
> > + min_align = max(min_align, win_align);
> > + size0 = calculate_memsize(size, min_size, 0, 0, resource_size(b_res), win_align);
> > + add_align = win_align;
> > + pci_info(bus->self, "bridge window %pR to %pR requires relaxed alignment rules\n",
> > + b_res, &bus->busn_res);
>
> Is there any hint we can give in the message about what "relaxed
> alignment rules" means, or how we decided we needed them?
The extra information would be given by the pci_dbg() whose severity now
upped from dbg to info level.
--
i.
On Fri, 3 May 2024, Bjorn Helgaas wrote:
> On Thu, Dec 28, 2023 at 06:57:02PM +0200, Ilpo Järvinen wrote:
> > Rename find_resource() to find_empty_resource_slot() to better describe
> > what the functions does. This is a preparation for exposing it beyond
> > resource.c which is needed by PCI core. Also rename the __ variant to
> > match the names.
>
> I wonder if "find_resource_space()" or "find_available_resource()"
> would be better than "_slot"?
>
> "Slot" *is* already used a few times in kernel/resource.c, but in most
> cases I think it refers to a "resource", and find_resource() basically
> returns a filled-in struct resource.
>
> And of course "slot" suggests something entirely different in the PCI
> context.
I picked up it from the existing usage but I've no strong opinion on
this so I'll just rework the series to not add more "slot" wording into
there.
--
i.