2023-03-10 09:27:47

by Andrzej Hajda

[permalink] [raw]
Subject: [PATCH v6 0/2] drm/i915: add guard page to ggtt->error_capture

This patch tries to diminish plague of DMAR read errors present
in CI for ADL*, RPL*, DG2 platforms, see for example [1] (grep DMAR).
CI is usually tolerant for these errors, so the scale of the problem
is not really visible.
To show it I have counted lines containing DMAR read errors in dmesgs
produced by CI for all three versions of the patch, but in contrast to v2
I have grepped only for lines containing "PTE Read access".
Below stats for kernel w/o patchset vs patched one.
v1: 210 vs 0
v2: 201 vs 0
v3: 214 vs 0
Apparently the patchset fixes all common PTE read errors.

Changelog:
v2:
- modified commit message (I hope the diagnosis is correct),
- added bug checks to ensure scratch is initialized on gen3 platforms.
CI produces strange stacktrace for it suggesting scratch[0] is NULL,
to be removed after resolving the issue with gen3 platforms.
v3:
- removed bug checks, replaced with gen check.
v4:
- change code for scratch page insertion to support all platforms,
- add info in commit message there could be more similar issues
v5:
- changed to patchset adding nop_clear_range related code,
- re-insert scratch PTEs on resume
v6:
- use scratch_range

To: Jani Nikula <[email protected]>
To: Joonas Lahtinen <[email protected]>
To: Rodrigo Vivi <[email protected]>
To: Tvrtko Ursulin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Andi Shyti <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Nirmoy Das <[email protected]>

Signed-off-by: Andrzej Hajda <[email protected]>

---
- Link to v5: https://lore.kernel.org/r/[email protected]

---
Andrzej Hajda (2):
drm/i915/gt: introduce vm->scratch_range callback
drm/i915: add guard page to ggtt->error_capture

drivers/gpu/drm/i915/gt/intel_ggtt.c | 43 ++++++++++++++++++++++++++++---
drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c | 1 +
drivers/gpu/drm/i915/gt/intel_gtt.h | 2 ++
3 files changed, 42 insertions(+), 4 deletions(-)
---
base-commit: 3cd6c251f39c14df9ab711e3eb56e703b359ff54
change-id: 20230308-guard_error_capture-f3f334eec85f

Best regards,
--
Andrzej Hajda <[email protected]>


2023-03-10 09:27:52

by Andrzej Hajda

[permalink] [raw]
Subject: [PATCH v6 1/2] drm/i915/gt: introduce vm->scratch_range callback

The callback will be responsible for setting scratch page PTEs for
specified range. In contrast to clear_range it cannot be optimized to nop.
It will be used by code adding guard pages.

Signed-off-by: Andrzej Hajda <[email protected]>
---
drivers/gpu/drm/i915/gt/intel_ggtt.c | 23 +++++++++++++++++++++++
drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c | 1 +
drivers/gpu/drm/i915/gt/intel_gtt.h | 2 ++
3 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 842e69c7b21e49..38e6f0b207fe0c 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -291,6 +291,27 @@ static void gen8_ggtt_insert_entries(struct i915_address_space *vm,
ggtt->invalidate(ggtt);
}

+static void gen8_ggtt_clear_range(struct i915_address_space *vm,
+ u64 start, u64 length)
+{
+ struct i915_ggtt *ggtt = i915_vm_to_ggtt(vm);
+ unsigned int first_entry = start / I915_GTT_PAGE_SIZE;
+ unsigned int num_entries = length / I915_GTT_PAGE_SIZE;
+ const gen8_pte_t scratch_pte = vm->scratch[0]->encode;
+ gen8_pte_t __iomem *gtt_base =
+ (gen8_pte_t __iomem *)ggtt->gsm + first_entry;
+ const int max_entries = ggtt_total_entries(ggtt) - first_entry;
+ int i;
+
+ if (WARN(num_entries > max_entries,
+ "First entry = %d; Num entries = %d (max=%d)\n",
+ first_entry, num_entries, max_entries))
+ num_entries = max_entries;
+
+ for (i = 0; i < num_entries; i++)
+ gen8_set_pte(&gtt_base[i], scratch_pte);
+}
+
static void gen6_ggtt_insert_page(struct i915_address_space *vm,
dma_addr_t addr,
u64 offset,
@@ -919,6 +940,7 @@ static int gen8_gmch_probe(struct i915_ggtt *ggtt)
ggtt->vm.cleanup = gen6_gmch_remove;
ggtt->vm.insert_page = gen8_ggtt_insert_page;
ggtt->vm.clear_range = nop_clear_range;
+ ggtt->vm.scratch_range = gen8_ggtt_clear_range;

ggtt->vm.insert_entries = gen8_ggtt_insert_entries;

@@ -1082,6 +1104,7 @@ static int gen6_gmch_probe(struct i915_ggtt *ggtt)
ggtt->vm.clear_range = nop_clear_range;
if (!HAS_FULL_PPGTT(i915))
ggtt->vm.clear_range = gen6_ggtt_clear_range;
+ ggtt->vm.scratch_range = gen6_ggtt_clear_range;
ggtt->vm.insert_page = gen6_ggtt_insert_page;
ggtt->vm.insert_entries = gen6_ggtt_insert_entries;
ggtt->vm.cleanup = gen6_gmch_remove;
diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c b/drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c
index 77c793812eb46a..d6a74ae2527bd9 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c
@@ -102,6 +102,7 @@ int intel_ggtt_gmch_probe(struct i915_ggtt *ggtt)
ggtt->vm.insert_page = gmch_ggtt_insert_page;
ggtt->vm.insert_entries = gmch_ggtt_insert_entries;
ggtt->vm.clear_range = gmch_ggtt_clear_range;
+ ggtt->vm.scratch_range = gmch_ggtt_clear_range;
ggtt->vm.cleanup = gmch_ggtt_remove;

ggtt->invalidate = gmch_ggtt_invalidate;
diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
index 5a775310d3fcb5..69ce55f517f567 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
@@ -298,6 +298,8 @@ struct i915_address_space {
u64 start, u64 length);
void (*clear_range)(struct i915_address_space *vm,
u64 start, u64 length);
+ void (*scratch_range)(struct i915_address_space *vm,
+ u64 start, u64 length);
void (*insert_page)(struct i915_address_space *vm,
dma_addr_t addr,
u64 offset,

--
2.34.1

2023-03-10 09:27:56

by Andrzej Hajda

[permalink] [raw]
Subject: [PATCH v6 2/2] drm/i915: add guard page to ggtt->error_capture

Write-combining memory allows speculative reads by CPU.
ggtt->error_capture is WC mapped to CPU, so CPU/MMU can try
to prefetch memory beyond the error_capture, ie it tries
to read memory pointed by next PTE in GGTT.
If this PTE points to invalid address DMAR errors will occur.
This behaviour was observed on ADL and RPL platforms.
To avoid it, guard scratch page should be added after error_capture.
The patch fixes the most annoying issue with error capture but
since WC reads are used also in other places there is a risk similar
problem can affect them as well.

v2:
- modified commit message (I hope the diagnosis is correct),
- added bug checks to ensure scratch is initialized on gen3 platforms.
CI produces strange stacktrace for it suggesting scratch[0] is NULL,
to be removed after resolving the issue with gen3 platforms.
v3:
- removed bug checks, replaced with gen check.
v4:
- change code for scratch page insertion to support all platforms,
- add info in commit message there could be more similar issues
v5:
- check for nop_clear_range instead of gen8 (Tvrtko),
- re-insert scratch pages on resume (Tvrtko)
v6:
- use scratch_range callback to set scratch pages (Chris)

Signed-off-by: Andrzej Hajda <[email protected]>
Reviewed-by: Andi Shyti <[email protected]>
---
drivers/gpu/drm/i915/gt/intel_ggtt.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 38e6f0b207fe0c..5ef7e03b11c8e6 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -572,8 +572,12 @@ static int init_ggtt(struct i915_ggtt *ggtt)
* paths, and we trust that 0 will remain reserved. However,
* the only likely reason for failure to insert is a driver
* bug, which we expect to cause other failures...
+ *
+ * Since CPU can perform speculative reads on error capture
+ * (write-combining allows it) add scratch page after error
+ * capture to avoid DMAR errors.
*/
- ggtt->error_capture.size = I915_GTT_PAGE_SIZE;
+ ggtt->error_capture.size = 2 * I915_GTT_PAGE_SIZE;
ggtt->error_capture.color = I915_COLOR_UNEVICTABLE;
if (drm_mm_reserve_node(&ggtt->vm.mm, &ggtt->error_capture))
drm_mm_insert_node_in_range(&ggtt->vm.mm,
@@ -583,11 +587,15 @@ static int init_ggtt(struct i915_ggtt *ggtt)
0, ggtt->mappable_end,
DRM_MM_INSERT_LOW);
}
- if (drm_mm_node_allocated(&ggtt->error_capture))
+ if (drm_mm_node_allocated(&ggtt->error_capture)) {
+ u64 start = ggtt->error_capture.start;
+ u64 size = ggtt->error_capture.size;
+
+ ggtt->vm.scratch_range(&ggtt->vm, start, size);
drm_dbg(&ggtt->vm.i915->drm,
"Reserved GGTT:[%llx, %llx] for use by error capture\n",
- ggtt->error_capture.start,
- ggtt->error_capture.start + ggtt->error_capture.size);
+ start, start + size);
+ }

/*
* The upper portion of the GuC address space has a sizeable hole
@@ -1280,6 +1288,10 @@ void i915_ggtt_resume(struct i915_ggtt *ggtt)

flush = i915_ggtt_resume_vm(&ggtt->vm);

+ if (drm_mm_node_allocated(&ggtt->error_capture))
+ ggtt->vm.scratch_range(&ggtt->vm, ggtt->error_capture.start,
+ ggtt->error_capture.size);
+
ggtt->invalidate(ggtt);

if (flush)

--
2.34.1

2023-03-13 12:59:09

by Nirmoy Das

[permalink] [raw]
Subject: Re: [PATCH v6 1/2] drm/i915/gt: introduce vm->scratch_range callback


On 3/10/2023 10:23 AM, Andrzej Hajda wrote:
> The callback will be responsible for setting scratch page PTEs for
> specified range. In contrast to clear_range it cannot be optimized to nop.
> It will be used by code adding guard pages.
>
> Signed-off-by: Andrzej Hajda <[email protected]>
Reviewed-by: Nirmoy Das <[email protected]>
> ---
> drivers/gpu/drm/i915/gt/intel_ggtt.c | 23 +++++++++++++++++++++++
> drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c | 1 +
> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 ++
> 3 files changed, 26 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
> index 842e69c7b21e49..38e6f0b207fe0c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
> @@ -291,6 +291,27 @@ static void gen8_ggtt_insert_entries(struct i915_address_space *vm,
> ggtt->invalidate(ggtt);
> }
>
> +static void gen8_ggtt_clear_range(struct i915_address_space *vm,
> + u64 start, u64 length)
> +{
> + struct i915_ggtt *ggtt = i915_vm_to_ggtt(vm);
> + unsigned int first_entry = start / I915_GTT_PAGE_SIZE;
> + unsigned int num_entries = length / I915_GTT_PAGE_SIZE;
> + const gen8_pte_t scratch_pte = vm->scratch[0]->encode;
> + gen8_pte_t __iomem *gtt_base =
> + (gen8_pte_t __iomem *)ggtt->gsm + first_entry;
> + const int max_entries = ggtt_total_entries(ggtt) - first_entry;
> + int i;
> +
> + if (WARN(num_entries > max_entries,
> + "First entry = %d; Num entries = %d (max=%d)\n",
> + first_entry, num_entries, max_entries))
> + num_entries = max_entries;
> +
> + for (i = 0; i < num_entries; i++)
> + gen8_set_pte(&gtt_base[i], scratch_pte);
> +}
> +
> static void gen6_ggtt_insert_page(struct i915_address_space *vm,
> dma_addr_t addr,
> u64 offset,
> @@ -919,6 +940,7 @@ static int gen8_gmch_probe(struct i915_ggtt *ggtt)
> ggtt->vm.cleanup = gen6_gmch_remove;
> ggtt->vm.insert_page = gen8_ggtt_insert_page;
> ggtt->vm.clear_range = nop_clear_range;
> + ggtt->vm.scratch_range = gen8_ggtt_clear_range;
>
> ggtt->vm.insert_entries = gen8_ggtt_insert_entries;
>
> @@ -1082,6 +1104,7 @@ static int gen6_gmch_probe(struct i915_ggtt *ggtt)
> ggtt->vm.clear_range = nop_clear_range;
> if (!HAS_FULL_PPGTT(i915))
> ggtt->vm.clear_range = gen6_ggtt_clear_range;
> + ggtt->vm.scratch_range = gen6_ggtt_clear_range;
> ggtt->vm.insert_page = gen6_ggtt_insert_page;
> ggtt->vm.insert_entries = gen6_ggtt_insert_entries;
> ggtt->vm.cleanup = gen6_gmch_remove;
> diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c b/drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c
> index 77c793812eb46a..d6a74ae2527bd9 100644
> --- a/drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c
> +++ b/drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c
> @@ -102,6 +102,7 @@ int intel_ggtt_gmch_probe(struct i915_ggtt *ggtt)
> ggtt->vm.insert_page = gmch_ggtt_insert_page;
> ggtt->vm.insert_entries = gmch_ggtt_insert_entries;
> ggtt->vm.clear_range = gmch_ggtt_clear_range;
> + ggtt->vm.scratch_range = gmch_ggtt_clear_range;
> ggtt->vm.cleanup = gmch_ggtt_remove;
>
> ggtt->invalidate = gmch_ggtt_invalidate;
> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> index 5a775310d3fcb5..69ce55f517f567 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> @@ -298,6 +298,8 @@ struct i915_address_space {
> u64 start, u64 length);
> void (*clear_range)(struct i915_address_space *vm,
> u64 start, u64 length);
> + void (*scratch_range)(struct i915_address_space *vm,
> + u64 start, u64 length);
> void (*insert_page)(struct i915_address_space *vm,
> dma_addr_t addr,
> u64 offset,
>

2023-03-13 12:59:52

by Nirmoy Das

[permalink] [raw]
Subject: Re: [Intel-gfx] [PATCH v6 2/2] drm/i915: add guard page to ggtt->error_capture


On 3/10/2023 10:23 AM, Andrzej Hajda wrote:
> Write-combining memory allows speculative reads by CPU.
> ggtt->error_capture is WC mapped to CPU, so CPU/MMU can try
> to prefetch memory beyond the error_capture, ie it tries
> to read memory pointed by next PTE in GGTT.
> If this PTE points to invalid address DMAR errors will occur.
> This behaviour was observed on ADL and RPL platforms.
> To avoid it, guard scratch page should be added after error_capture.
> The patch fixes the most annoying issue with error capture but
> since WC reads are used also in other places there is a risk similar
> problem can affect them as well.
>
> v2:
> - modified commit message (I hope the diagnosis is correct),
> - added bug checks to ensure scratch is initialized on gen3 platforms.
> CI produces strange stacktrace for it suggesting scratch[0] is NULL,
> to be removed after resolving the issue with gen3 platforms.
> v3:
> - removed bug checks, replaced with gen check.
> v4:
> - change code for scratch page insertion to support all platforms,
> - add info in commit message there could be more similar issues
> v5:
> - check for nop_clear_range instead of gen8 (Tvrtko),
> - re-insert scratch pages on resume (Tvrtko)
> v6:
> - use scratch_range callback to set scratch pages (Chris)
>
> Signed-off-by: Andrzej Hajda <[email protected]>
> Reviewed-by: Andi Shyti <[email protected]>
Acked-by: Nirmoy Das <[email protected]>
> ---
> drivers/gpu/drm/i915/gt/intel_ggtt.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
> index 38e6f0b207fe0c..5ef7e03b11c8e6 100644
> --- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
> @@ -572,8 +572,12 @@ static int init_ggtt(struct i915_ggtt *ggtt)
> * paths, and we trust that 0 will remain reserved. However,
> * the only likely reason for failure to insert is a driver
> * bug, which we expect to cause other failures...
> + *
> + * Since CPU can perform speculative reads on error capture
> + * (write-combining allows it) add scratch page after error
> + * capture to avoid DMAR errors.
> */
> - ggtt->error_capture.size = I915_GTT_PAGE_SIZE;
> + ggtt->error_capture.size = 2 * I915_GTT_PAGE_SIZE;
> ggtt->error_capture.color = I915_COLOR_UNEVICTABLE;
> if (drm_mm_reserve_node(&ggtt->vm.mm, &ggtt->error_capture))
> drm_mm_insert_node_in_range(&ggtt->vm.mm,
> @@ -583,11 +587,15 @@ static int init_ggtt(struct i915_ggtt *ggtt)
> 0, ggtt->mappable_end,
> DRM_MM_INSERT_LOW);
> }
> - if (drm_mm_node_allocated(&ggtt->error_capture))
> + if (drm_mm_node_allocated(&ggtt->error_capture)) {
> + u64 start = ggtt->error_capture.start;
> + u64 size = ggtt->error_capture.size;
> +
> + ggtt->vm.scratch_range(&ggtt->vm, start, size);
> drm_dbg(&ggtt->vm.i915->drm,
> "Reserved GGTT:[%llx, %llx] for use by error capture\n",
> - ggtt->error_capture.start,
> - ggtt->error_capture.start + ggtt->error_capture.size);
> + start, start + size);
> + }
>
> /*
> * The upper portion of the GuC address space has a sizeable hole
> @@ -1280,6 +1288,10 @@ void i915_ggtt_resume(struct i915_ggtt *ggtt)
>
> flush = i915_ggtt_resume_vm(&ggtt->vm);
>
> + if (drm_mm_node_allocated(&ggtt->error_capture))
> + ggtt->vm.scratch_range(&ggtt->vm, ggtt->error_capture.start,
> + ggtt->error_capture.size);
> +
> ggtt->invalidate(ggtt);
>
> if (flush)
>

2023-03-14 17:15:07

by Andi Shyti

[permalink] [raw]
Subject: Re: [PATCH v6 1/2] drm/i915/gt: introduce vm->scratch_range callback

Hi Andrzej,

On Fri, Mar 10, 2023 at 10:23:49AM +0100, Andrzej Hajda wrote:
> The callback will be responsible for setting scratch page PTEs for
> specified range. In contrast to clear_range it cannot be optimized to nop.
> It will be used by code adding guard pages.
>
> Signed-off-by: Andrzej Hajda <[email protected]>

Reviewed-by: Andi Shyti <[email protected]>

Thanks,
Andi

2023-03-16 18:20:18

by Andrzej Hajda

[permalink] [raw]
Subject: Re: [Intel-gfx] [PATCH v6 0/2] drm/i915: add guard page to ggtt->error_capture

On 10.03.2023 10:23, Andrzej Hajda wrote:
> This patch tries to diminish plague of DMAR read errors present
> in CI for ADL*, RPL*, DG2 platforms, see for example [1] (grep DMAR).
> CI is usually tolerant for these errors, so the scale of the problem
> is not really visible.
> To show it I have counted lines containing DMAR read errors in dmesgs
> produced by CI for all three versions of the patch, but in contrast to v2
> I have grepped only for lines containing "PTE Read access".
> Below stats for kernel w/o patchset vs patched one.
> v1: 210 vs 0
> v2: 201 vs 0
> v3: 214 vs 0
> Apparently the patchset fixes all common PTE read errors.
>
> Changelog:
> v2:
> - modified commit message (I hope the diagnosis is correct),
> - added bug checks to ensure scratch is initialized on gen3 platforms.
> CI produces strange stacktrace for it suggesting scratch[0] is NULL,
> to be removed after resolving the issue with gen3 platforms.
> v3:
> - removed bug checks, replaced with gen check.
> v4:
> - change code for scratch page insertion to support all platforms,
> - add info in commit message there could be more similar issues
> v5:
> - changed to patchset adding nop_clear_range related code,
> - re-insert scratch PTEs on resume
> v6:
> - use scratch_range
>
> To: Jani Nikula <[email protected]>
> To: Joonas Lahtinen <[email protected]>
> To: Rodrigo Vivi <[email protected]>
> To: Tvrtko Ursulin <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Andi Shyti <[email protected]>
> Cc: Chris Wilson <[email protected]>
> Cc: Nirmoy Das <[email protected]>
>
> Signed-off-by: Andrzej Hajda <[email protected]>
>

Queued to drm-intel-gt-next

Regards
Andrzej

> ---
> - Link to v5: https://lore.kernel.org/r/[email protected]
>
> ---
> Andrzej Hajda (2):
> drm/i915/gt: introduce vm->scratch_range callback
> drm/i915: add guard page to ggtt->error_capture
>
> drivers/gpu/drm/i915/gt/intel_ggtt.c | 43 ++++++++++++++++++++++++++++---
> drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c | 1 +
> drivers/gpu/drm/i915/gt/intel_gtt.h | 2 ++
> 3 files changed, 42 insertions(+), 4 deletions(-)
> ---
> base-commit: 3cd6c251f39c14df9ab711e3eb56e703b359ff54
> change-id: 20230308-guard_error_capture-f3f334eec85f
>
> Best regards,