2024-02-29 23:22:25

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 00/13] riscv: ASID-related and UP-related TLB flush enhancements

While reviewing Alexandre Ghiti's "riscv: tlb flush improvements"
series[1], I noticed that most TLB flush functions end up as a call to
local_flush_tlb_all() when SMP is disabled. This series resolves that,
and also optimizes the scenario where SMP is enabled but only one CPU is
present or online. Along the way, I realized that we should be using
single-ASID flushes wherever possible, so I implemented that as well.

Here are some numbers from D1 (with SMP disabled) which show the
performance impact:

v6.8-rc6 + riscv/fixes + riscv/for-next:
System Benchmarks Partial Index BASELINE RESULT INDEX
Execl Throughput 43.0 223.5 52.0
File Copy 1024 bufsize 2000 maxblocks 3960.0 62563.4 158.0
File Copy 256 bufsize 500 maxblocks 1655.0 17869.2 108.0
File Copy 4096 bufsize 8000 maxblocks 5800.0 164915.1 284.3
Pipe Throughput 12440.0 161368.5 129.7
Pipe-based Context Switching 4000.0 22247.1 55.6
Process Creation 126.0 546.9 43.4
Shell Scripts (1 concurrent) 42.4 599.6 141.4
Shell Scripts (16 concurrent) --- 39.3 ---
Shell Scripts (8 concurrent) 6.0 79.1 131.9
System Call Overhead 15000.0 246019.0 164.0
========
System Benchmarks Index Score (Partial Only) 109.2

v6.8-rc6 + riscv/fixes + riscv/for-next + this patch series:
System Benchmarks Partial Index BASELINE RESULT INDEX
Execl Throughput 43.0 223.1 51.9
File Copy 1024 bufsize 2000 maxblocks 3960.0 71982.9 181.8
File Copy 256 bufsize 500 maxblocks 1655.0 18436.9 111.4
File Copy 4096 bufsize 8000 maxblocks 5800.0 184955.2 318.9
Pipe Throughput 12440.0 162622.9 130.7
Pipe-based Context Switching 4000.0 22082.5 55.2
Process Creation 126.0 546.4 43.4
Shell Scripts (1 concurrent) 42.4 598.1 141.1
Shell Scripts (16 concurrent) --- 38.8 ---
Shell Scripts (8 concurrent) 6.0 78.6 131.0
System Call Overhead 15000.0 258529.3 172.4
========
System Benchmarks Index Score (Partial Only) 112.8

[1]: https://lore.kernel.org/linux-riscv/[email protected]/

Changes in v5:
- Rebase on v6.8-rc1 + riscv/for-next (for the fast GUP implementation)
- Add patch for minor refactoring in asm/pgalloc.h
- Also switch to riscv_use_sbi_for_rfence() in asm/pgalloc.h
- Leave use_asid_allocator declared in asm/mmu_context.h

Changes in v4:
- Fix a possible race between flush_icache_*() and SMP bringup
- Refactor riscv_use_ipi_for_rfence() to make later changes cleaner
- Optimize kernel TLB flushes with only one CPU online
- Optimize global cache/TLB flushes with only one CPU online
- Merge the two copies of __flush_tlb_range() and rely on the compiler
to optimize out the broadcast path (both clang and gcc do this)
- Merge the two copies of flush_tlb_all() and rely on constant folding
- Only set tlb_flush_all_threshold when CONFIG_MMU=y.

Changes in v3:
- Fixed a performance regression caused by executing sfence.vma in a
loop on implementations affected by SiFive CIP-1200
- Rebased on v6.7-rc1

Changes in v2:
- Move the SMP/UP merge earlier in the series to avoid build issues
- Make a copy of __flush_tlb_range() instead of adding ifdefs inside
- local_flush_tlb_all() is the only function used on !MMU (smpboot.c)

Samuel Holland (13):
riscv: Flush the instruction cache during SMP bringup
riscv: Factor out page table TLB synchronization
riscv: Use IPIs for remote cache/TLB flushes by default
riscv: mm: Broadcast kernel TLB flushes only when needed
riscv: Only send remote fences when some other CPU is online
riscv: mm: Combine the SMP and UP TLB flush code
riscv: Apply SiFive CIP-1200 workaround to single-ASID sfence.vma
riscv: Avoid TLB flush loops when affected by SiFive CIP-1200
riscv: mm: Introduce cntx2asid/cntx2version helper macros
riscv: mm: Use a fixed layout for the MM context ID
riscv: mm: Make asid_bits a local variable
riscv: mm: Preserve global TLB entries when switching contexts
riscv: mm: Always use an ASID to flush mm contexts

arch/riscv/Kconfig | 2 +-
arch/riscv/errata/sifive/errata.c | 5 ++
arch/riscv/include/asm/errata_list.h | 12 ++++-
arch/riscv/include/asm/mmu.h | 3 ++
arch/riscv/include/asm/pgalloc.h | 32 ++++++------
arch/riscv/include/asm/sbi.h | 4 ++
arch/riscv/include/asm/smp.h | 15 +-----
arch/riscv/include/asm/tlbflush.h | 51 ++++++++-----------
arch/riscv/kernel/sbi-ipi.c | 11 +++-
arch/riscv/kernel/smp.c | 11 +---
arch/riscv/kernel/smpboot.c | 7 +--
arch/riscv/mm/Makefile | 5 +-
arch/riscv/mm/cacheflush.c | 7 +--
arch/riscv/mm/context.c | 23 ++++-----
arch/riscv/mm/tlbflush.c | 75 ++++++++--------------------
drivers/clocksource/timer-clint.c | 2 +-
16 files changed, 114 insertions(+), 151 deletions(-)

--
2.43.1



2024-02-29 23:22:28

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 01/13] riscv: Flush the instruction cache during SMP bringup

Instruction cache flush IPIs are sent only to CPUs in cpu_online_mask,
so they will not target a CPU until it calls set_cpu_online() earlier in
smp_callin(). As a result, if instruction memory is modified between the
CPU coming out of reset and that point, then its instruction cache may
contain stale data. Therefore, the instruction cache must be flushed
after the set_cpu_online() synchronization point.

Fixes: 08f051eda33b ("RISC-V: Flush I$ when making a dirty page executable")
Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Samuel Holland <[email protected]>
---

(no changes since v4)

Changes in v4:
- New patch for v4

arch/riscv/kernel/smpboot.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
index cfbe4b840d42..e1b612f37dd9 100644
--- a/arch/riscv/kernel/smpboot.c
+++ b/arch/riscv/kernel/smpboot.c
@@ -26,7 +26,7 @@
#include <linux/sched/task_stack.h>
#include <linux/sched/mm.h>

-#include <asm/cpufeature.h>
+#include <asm/cacheflush.h>
#include <asm/cpu_ops.h>
#include <asm/irq.h>
#include <asm/mmu_context.h>
@@ -238,9 +238,10 @@ asmlinkage __visible void smp_callin(void)
riscv_user_isa_enable();

/*
- * Remote TLB flushes are ignored while the CPU is offline, so emit
- * a local TLB flush right now just in case.
+ * Remote cache and TLB flushes are ignored while the CPU is offline,
+ * so flush them both right now just in case.
*/
+ local_flush_icache_all();
local_flush_tlb_all();
complete(&cpu_running);
/*
--
2.43.1


2024-02-29 23:22:36

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 02/13] riscv: Factor out page table TLB synchronization

The logic is the same for all page table levels. See commit 69be3fb111e7
("riscv: enable MMU_GATHER_RCU_TABLE_FREE for SMP && MMU").

Signed-off-by: Samuel Holland <[email protected]>
---

Changes in v5:
- New patch for v5

arch/riscv/include/asm/pgalloc.h | 31 +++++++++++++------------------
1 file changed, 13 insertions(+), 18 deletions(-)

diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index deaf971253a2..87468f67951a 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -88,6 +88,14 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
return NULL;
}

+static inline void riscv_tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt)
+{
+ if (riscv_use_ipi_for_rfence())
+ tlb_remove_page_ptdesc(tlb, pt);
+ else
+ tlb_remove_ptdesc(tlb, pt);
+}
+
#define pud_free pud_free
static inline void pud_free(struct mm_struct *mm, pud_t *pud)
{
@@ -102,10 +110,7 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pud,
struct ptdesc *ptdesc = virt_to_ptdesc(pud);

pagetable_pud_dtor(ptdesc);
- if (riscv_use_ipi_for_rfence())
- tlb_remove_page_ptdesc(tlb, ptdesc);
- else
- tlb_remove_ptdesc(tlb, ptdesc);
+ riscv_tlb_remove_ptdesc(tlb, ptdesc);
}
}

@@ -139,12 +144,8 @@ static inline void p4d_free(struct mm_struct *mm, p4d_t *p4d)
static inline void __p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d,
unsigned long addr)
{
- if (pgtable_l5_enabled) {
- if (riscv_use_ipi_for_rfence())
- tlb_remove_page_ptdesc(tlb, virt_to_ptdesc(p4d));
- else
- tlb_remove_ptdesc(tlb, virt_to_ptdesc(p4d));
- }
+ if (pgtable_l5_enabled)
+ riscv_tlb_remove_ptdesc(tlb, virt_to_ptdesc(p4d));
}
#endif /* __PAGETABLE_PMD_FOLDED */

@@ -176,10 +177,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd,
struct ptdesc *ptdesc = virt_to_ptdesc(pmd);

pagetable_pmd_dtor(ptdesc);
- if (riscv_use_ipi_for_rfence())
- tlb_remove_page_ptdesc(tlb, ptdesc);
- else
- tlb_remove_ptdesc(tlb, ptdesc);
+ riscv_tlb_remove_ptdesc(tlb, ptdesc);
}

#endif /* __PAGETABLE_PMD_FOLDED */
@@ -190,10 +188,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
struct ptdesc *ptdesc = page_ptdesc(pte);

pagetable_pte_dtor(ptdesc);
- if (riscv_use_ipi_for_rfence())
- tlb_remove_page_ptdesc(tlb, ptdesc);
- else
- tlb_remove_ptdesc(tlb, ptdesc);
+ riscv_tlb_remove_ptdesc(tlb, ptdesc);
}
#endif /* CONFIG_MMU */

--
2.43.1


2024-02-29 23:23:11

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 03/13] riscv: Use IPIs for remote cache/TLB flushes by default

An IPI backend is always required in an SMP configuration, but an SBI
implementation is not. For example, SBI will be unavailable when the
kernel runs in M mode.

Generally, IPIs are assumed to be faster than SBI calls due to the SBI
context switch overhead. However, when SBI is used as the IPI backend,
then the context switch cost must be paid anyway, and performing the
cache/TLB flush directly in the SBI implementation is more efficient
than inserting an interrupt to the kernel. This is the only scenario
where riscv_ipi_set_virq_range()'s use_for_rfence parameter is false.

Thus, it makes sense for remote fences to use IPIs by default, and make
the SBI remote fence extension the special case.

sbi_ipi_init() already checks riscv_ipi_have_virq_range(), so it only
calls riscv_ipi_set_virq_range() when no other IPI device is available.
So we can move the static key and drop the use_for_rfence parameter.

Furthermore, the static branch only makes sense when CONFIG_RISCV_SBI is
enabled. Optherwise, IPIs must be used. Add a fallback definition of
riscv_use_sbi_for_rfence() which handles this case and removes the need
to check CONFIG_RISCV_SBI elsewhere, such as in cacheflush.c.

Signed-off-by: Samuel Holland <[email protected]>
---

Changes in v5:
- Also switch to riscv_use_sbi_for_rfence() in asm/pgalloc.h

Changes in v4:
- New patch for v4

arch/riscv/include/asm/pgalloc.h | 7 ++++---
arch/riscv/include/asm/sbi.h | 4 ++++
arch/riscv/include/asm/smp.h | 15 ++-------------
arch/riscv/kernel/sbi-ipi.c | 11 ++++++++++-
arch/riscv/kernel/smp.c | 11 +----------
arch/riscv/mm/cacheflush.c | 5 ++---
arch/riscv/mm/tlbflush.c | 31 ++++++++++++++-----------------
drivers/clocksource/timer-clint.c | 2 +-
8 files changed, 38 insertions(+), 48 deletions(-)

diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index 87468f67951a..6578054977ef 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -8,6 +8,7 @@
#define _ASM_RISCV_PGALLOC_H

#include <linux/mm.h>
+#include <asm/sbi.h>
#include <asm/tlb.h>

#ifdef CONFIG_MMU
@@ -90,10 +91,10 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)

static inline void riscv_tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt)
{
- if (riscv_use_ipi_for_rfence())
- tlb_remove_page_ptdesc(tlb, pt);
- else
+ if (riscv_use_sbi_for_rfence())
tlb_remove_ptdesc(tlb, pt);
+ else
+ tlb_remove_page_ptdesc(tlb, pt);
}

#define pud_free pud_free
diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
index 6e68f8dff76b..ea84392ca9d7 100644
--- a/arch/riscv/include/asm/sbi.h
+++ b/arch/riscv/include/asm/sbi.h
@@ -375,8 +375,12 @@ unsigned long riscv_cached_marchid(unsigned int cpu_id);
unsigned long riscv_cached_mimpid(unsigned int cpu_id);

#if IS_ENABLED(CONFIG_SMP) && IS_ENABLED(CONFIG_RISCV_SBI)
+DECLARE_STATIC_KEY_FALSE(riscv_sbi_for_rfence);
+#define riscv_use_sbi_for_rfence() \
+ static_branch_unlikely(&riscv_sbi_for_rfence)
void sbi_ipi_init(void);
#else
+static inline bool riscv_use_sbi_for_rfence(void) { return false; }
static inline void sbi_ipi_init(void) { }
#endif

diff --git a/arch/riscv/include/asm/smp.h b/arch/riscv/include/asm/smp.h
index 0d555847cde6..7ac80e9f2288 100644
--- a/arch/riscv/include/asm/smp.h
+++ b/arch/riscv/include/asm/smp.h
@@ -49,12 +49,7 @@ void riscv_ipi_disable(void);
bool riscv_ipi_have_virq_range(void);

/* Set the IPI interrupt numbers for arch (called by irqchip drivers) */
-void riscv_ipi_set_virq_range(int virq, int nr, bool use_for_rfence);
-
-/* Check if we can use IPIs for remote FENCEs */
-DECLARE_STATIC_KEY_FALSE(riscv_ipi_for_rfence);
-#define riscv_use_ipi_for_rfence() \
- static_branch_unlikely(&riscv_ipi_for_rfence)
+void riscv_ipi_set_virq_range(int virq, int nr);

/* Check other CPUs stop or not */
bool smp_crash_stop_failed(void);
@@ -104,16 +99,10 @@ static inline bool riscv_ipi_have_virq_range(void)
return false;
}

-static inline void riscv_ipi_set_virq_range(int virq, int nr,
- bool use_for_rfence)
+static inline void riscv_ipi_set_virq_range(int virq, int nr)
{
}

-static inline bool riscv_use_ipi_for_rfence(void)
-{
- return false;
-}
-
#endif /* CONFIG_SMP */

#if defined(CONFIG_HOTPLUG_CPU) && (CONFIG_SMP)
diff --git a/arch/riscv/kernel/sbi-ipi.c b/arch/riscv/kernel/sbi-ipi.c
index a4559695ce62..1026e22955cc 100644
--- a/arch/riscv/kernel/sbi-ipi.c
+++ b/arch/riscv/kernel/sbi-ipi.c
@@ -13,6 +13,9 @@
#include <linux/irqdomain.h>
#include <asm/sbi.h>

+DEFINE_STATIC_KEY_FALSE(riscv_sbi_for_rfence);
+EXPORT_SYMBOL_GPL(riscv_sbi_for_rfence);
+
static int sbi_ipi_virq;

static void sbi_ipi_handle(struct irq_desc *desc)
@@ -72,6 +75,12 @@ void __init sbi_ipi_init(void)
"irqchip/sbi-ipi:starting",
sbi_ipi_starting_cpu, NULL);

- riscv_ipi_set_virq_range(virq, BITS_PER_BYTE, false);
+ riscv_ipi_set_virq_range(virq, BITS_PER_BYTE);
pr_info("providing IPIs using SBI IPI extension\n");
+
+ /*
+ * Use the SBI remote fence extension to avoid
+ * the extra context switch needed to handle IPIs.
+ */
+ static_branch_enable(&riscv_sbi_for_rfence);
}
diff --git a/arch/riscv/kernel/smp.c b/arch/riscv/kernel/smp.c
index 45dd4035416e..8e6eb64459af 100644
--- a/arch/riscv/kernel/smp.c
+++ b/arch/riscv/kernel/smp.c
@@ -171,10 +171,7 @@ bool riscv_ipi_have_virq_range(void)
return (ipi_virq_base) ? true : false;
}

-DEFINE_STATIC_KEY_FALSE(riscv_ipi_for_rfence);
-EXPORT_SYMBOL_GPL(riscv_ipi_for_rfence);
-
-void riscv_ipi_set_virq_range(int virq, int nr, bool use_for_rfence)
+void riscv_ipi_set_virq_range(int virq, int nr)
{
int i, err;

@@ -197,12 +194,6 @@ void riscv_ipi_set_virq_range(int virq, int nr, bool use_for_rfence)

/* Enabled IPIs for boot CPU immediately */
riscv_ipi_enable();
-
- /* Update RFENCE static key */
- if (use_for_rfence)
- static_branch_enable(&riscv_ipi_for_rfence);
- else
- static_branch_disable(&riscv_ipi_for_rfence);
}

static const char * const ipi_names[] = {
diff --git a/arch/riscv/mm/cacheflush.c b/arch/riscv/mm/cacheflush.c
index 55a34f2020a8..47c485bc7df0 100644
--- a/arch/riscv/mm/cacheflush.c
+++ b/arch/riscv/mm/cacheflush.c
@@ -21,7 +21,7 @@ void flush_icache_all(void)
{
local_flush_icache_all();

- if (IS_ENABLED(CONFIG_RISCV_SBI) && !riscv_use_ipi_for_rfence())
+ if (riscv_use_sbi_for_rfence())
sbi_remote_fence_i(NULL);
else
on_each_cpu(ipi_remote_fence_i, NULL, 1);
@@ -69,8 +69,7 @@ void flush_icache_mm(struct mm_struct *mm, bool local)
* with flush_icache_deferred().
*/
smp_mb();
- } else if (IS_ENABLED(CONFIG_RISCV_SBI) &&
- !riscv_use_ipi_for_rfence()) {
+ } else if (riscv_use_sbi_for_rfence()) {
sbi_remote_fence_i(&others);
} else {
on_each_cpu_mask(&others, ipi_remote_fence_i, NULL, 1);
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 8d12b26f5ac3..0373661bd1c4 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -78,10 +78,10 @@ static void __ipi_flush_tlb_all(void *info)

void flush_tlb_all(void)
{
- if (riscv_use_ipi_for_rfence())
- on_each_cpu(__ipi_flush_tlb_all, NULL, 1);
- else
+ if (riscv_use_sbi_for_rfence())
sbi_remote_sfence_vma_asid(NULL, 0, FLUSH_TLB_MAX_SIZE, FLUSH_TLB_NO_ASID);
+ else
+ on_each_cpu(__ipi_flush_tlb_all, NULL, 1);
}

struct flush_tlb_range_data {
@@ -102,7 +102,6 @@ static void __flush_tlb_range(struct cpumask *cmask, unsigned long asid,
unsigned long start, unsigned long size,
unsigned long stride)
{
- struct flush_tlb_range_data ftd;
bool broadcast;

if (cpumask_empty(cmask))
@@ -118,20 +117,18 @@ static void __flush_tlb_range(struct cpumask *cmask, unsigned long asid,
broadcast = true;
}

- if (broadcast) {
- if (riscv_use_ipi_for_rfence()) {
- ftd.asid = asid;
- ftd.start = start;
- ftd.size = size;
- ftd.stride = stride;
- on_each_cpu_mask(cmask,
- __ipi_flush_tlb_range_asid,
- &ftd, 1);
- } else
- sbi_remote_sfence_vma_asid(cmask,
- start, size, asid);
- } else {
+ if (!broadcast) {
local_flush_tlb_range_asid(start, size, stride, asid);
+ } else if (riscv_use_sbi_for_rfence()) {
+ sbi_remote_sfence_vma_asid(cmask, start, size, asid);
+ } else {
+ struct flush_tlb_range_data ftd;
+
+ ftd.asid = asid;
+ ftd.start = start;
+ ftd.size = size;
+ ftd.stride = stride;
+ on_each_cpu_mask(cmask, __ipi_flush_tlb_range_asid, &ftd, 1);
}

if (cmask != cpu_online_mask)
diff --git a/drivers/clocksource/timer-clint.c b/drivers/clocksource/timer-clint.c
index 09fd292eb83d..0bdd9d7ec545 100644
--- a/drivers/clocksource/timer-clint.c
+++ b/drivers/clocksource/timer-clint.c
@@ -251,7 +251,7 @@ static int __init clint_timer_init_dt(struct device_node *np)
}

irq_set_chained_handler(clint_ipi_irq, clint_ipi_interrupt);
- riscv_ipi_set_virq_range(rc, BITS_PER_BYTE, true);
+ riscv_ipi_set_virq_range(rc, BITS_PER_BYTE);
clint_clear_ipi();
#endif

--
2.43.1


2024-02-29 23:23:15

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 04/13] riscv: mm: Broadcast kernel TLB flushes only when needed

__flush_tlb_range() avoids broadcasting TLB flushes when an mm context
is only active on the local CPU. Apply this same optimization to TLB
flushes of kernel memory when only one CPU is online. This check can be
constant-folded when SMP is disabled.

Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Samuel Holland <[email protected]>
---

(no changes since v4)

Changes in v4:
- New patch for v4

arch/riscv/mm/tlbflush.c | 18 +++++-------------
1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 0373661bd1c4..8cdb082f00ca 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -102,22 +102,15 @@ static void __flush_tlb_range(struct cpumask *cmask, unsigned long asid,
unsigned long start, unsigned long size,
unsigned long stride)
{
- bool broadcast;
+ unsigned int cpu;

if (cpumask_empty(cmask))
return;

- if (cmask != cpu_online_mask) {
- unsigned int cpuid;
+ cpu = get_cpu();

- cpuid = get_cpu();
- /* check if the tlbflush needs to be sent to other CPUs */
- broadcast = cpumask_any_but(cmask, cpuid) < nr_cpu_ids;
- } else {
- broadcast = true;
- }
-
- if (!broadcast) {
+ /* Check if the TLB flush needs to be sent to other CPUs. */
+ if (cpumask_any_but(cmask, cpu) >= nr_cpu_ids) {
local_flush_tlb_range_asid(start, size, stride, asid);
} else if (riscv_use_sbi_for_rfence()) {
sbi_remote_sfence_vma_asid(cmask, start, size, asid);
@@ -131,8 +124,7 @@ static void __flush_tlb_range(struct cpumask *cmask, unsigned long asid,
on_each_cpu_mask(cmask, __ipi_flush_tlb_range_asid, &ftd, 1);
}

- if (cmask != cpu_online_mask)
- put_cpu();
+ put_cpu();
}

static inline unsigned long get_mm_asid(struct mm_struct *mm)
--
2.43.1


2024-02-29 23:23:29

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 05/13] riscv: Only send remote fences when some other CPU is online

If no other CPU is online, a local cache or TLB flush is sufficient.
These checks can be constant-folded when SMP is disabled.

Signed-off-by: Samuel Holland <[email protected]>
---

(no changes since v4)

Changes in v4:
- New patch for v4

arch/riscv/mm/cacheflush.c | 4 +++-
arch/riscv/mm/tlbflush.c | 4 +++-
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/mm/cacheflush.c b/arch/riscv/mm/cacheflush.c
index 47c485bc7df0..f7933ae88a55 100644
--- a/arch/riscv/mm/cacheflush.c
+++ b/arch/riscv/mm/cacheflush.c
@@ -21,7 +21,9 @@ void flush_icache_all(void)
{
local_flush_icache_all();

- if (riscv_use_sbi_for_rfence())
+ if (num_online_cpus() < 2)
+ return;
+ else if (riscv_use_sbi_for_rfence())
sbi_remote_fence_i(NULL);
else
on_each_cpu(ipi_remote_fence_i, NULL, 1);
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 8cdb082f00ca..69402c260a89 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -78,7 +78,9 @@ static void __ipi_flush_tlb_all(void *info)

void flush_tlb_all(void)
{
- if (riscv_use_sbi_for_rfence())
+ if (num_online_cpus() < 2)
+ local_flush_tlb_all();
+ else if (riscv_use_sbi_for_rfence())
sbi_remote_sfence_vma_asid(NULL, 0, FLUSH_TLB_MAX_SIZE, FLUSH_TLB_NO_ASID);
else
on_each_cpu(__ipi_flush_tlb_all, NULL, 1);
--
2.43.1


2024-02-29 23:23:44

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 06/13] riscv: mm: Combine the SMP and UP TLB flush code

In SMP configurations, all TLB flushing narrower than flush_tlb_all()
goes through __flush_tlb_range(). Do the same in UP configurations.

This allows UP configurations to take advantage of recent improvements
to the code in tlbflush.c, such as support for huge pages and flushing
multiple-page ranges.

Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Samuel Holland <[email protected]>
---

(no changes since v4)

Changes in v4:
- Merge the two copies of __flush_tlb_range() and rely on the compiler
to optimize out the broadcast path (both clang and gcc do this)
- Merge the two copies of flush_tlb_all() and rely on constant folding

Changes in v2:
- Move the SMP/UP merge earlier in the series to avoid build issues
- Make a copy of __flush_tlb_range() instead of adding ifdefs inside
- local_flush_tlb_all() is the only function used on !MMU (smpboot.c)

arch/riscv/Kconfig | 2 +-
arch/riscv/include/asm/tlbflush.h | 30 +++---------------------------
arch/riscv/mm/Makefile | 5 +----
3 files changed, 5 insertions(+), 32 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 0bfcfec67ed5..de9b6f2279ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -60,7 +60,7 @@ config RISCV
select ARCH_USE_MEMTEST
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USES_CFI_TRAPS if CFI_CLANG
- select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP && MMU
+ select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if MMU
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT
diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index 928f096dca21..4f86424b1ba5 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -27,12 +27,7 @@ static inline void local_flush_tlb_page(unsigned long addr)
{
ALT_FLUSH_TLB_PAGE(__asm__ __volatile__ ("sfence.vma %0" : : "r" (addr) : "memory"));
}
-#else /* CONFIG_MMU */
-#define local_flush_tlb_all() do { } while (0)
-#define local_flush_tlb_page(addr) do { } while (0)
-#endif /* CONFIG_MMU */

-#if defined(CONFIG_SMP) && defined(CONFIG_MMU)
void flush_tlb_all(void);
void flush_tlb_mm(struct mm_struct *mm);
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
@@ -54,27 +49,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
unsigned long uaddr);
void arch_flush_tlb_batched_pending(struct mm_struct *mm);
void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
-
-#else /* CONFIG_SMP && CONFIG_MMU */
-
-#define flush_tlb_all() local_flush_tlb_all()
-#define flush_tlb_page(vma, addr) local_flush_tlb_page(addr)
-
-static inline void flush_tlb_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
-{
- local_flush_tlb_all();
-}
-
-/* Flush a range of kernel pages */
-static inline void flush_tlb_kernel_range(unsigned long start,
- unsigned long end)
-{
- local_flush_tlb_all();
-}
-
-#define flush_tlb_mm(mm) flush_tlb_all()
-#define flush_tlb_mm_range(mm, start, end, page_size) flush_tlb_all()
-#endif /* !CONFIG_SMP || !CONFIG_MMU */
+#else /* CONFIG_MMU */
+#define local_flush_tlb_all() do { } while (0)
+#endif /* CONFIG_MMU */

#endif /* _ASM_RISCV_TLBFLUSH_H */
diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
index 2c869f8026a8..cbe4d775ef56 100644
--- a/arch/riscv/mm/Makefile
+++ b/arch/riscv/mm/Makefile
@@ -13,14 +13,11 @@ endif
KCOV_INSTRUMENT_init.o := n

obj-y += init.o
-obj-$(CONFIG_MMU) += extable.o fault.o pageattr.o pgtable.o
+obj-$(CONFIG_MMU) += extable.o fault.o pageattr.o pgtable.o tlbflush.o
obj-y += cacheflush.o
obj-y += context.o
obj-y += pmem.o

-ifeq ($(CONFIG_MMU),y)
-obj-$(CONFIG_SMP) += tlbflush.o
-endif
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
obj-$(CONFIG_KASAN) += kasan_init.o
--
2.43.1


2024-02-29 23:24:07

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 07/13] riscv: Apply SiFive CIP-1200 workaround to single-ASID sfence.vma

commit 3f1e782998cd ("riscv: add ASID-based tlbflushing methods") added
calls to the sfence.vma instruction with rs2 != x0. These single-ASID
instruction variants are also affected by SiFive errata CIP-1200.

Until now, the errata workaround was not needed for the single-ASID
sfence.vma variants, because they were only used when the ASID allocator
was enabled, and the affected SiFive platforms do not support multiple
ASIDs. However, we are going to start using those sfence.vma variants
regardless of ASID support, so now we need alternatives covering them.

Signed-off-by: Samuel Holland <[email protected]>
---

(no changes since v2)

Changes in v2:
- Rebase on Alexandre's "riscv: tlb flush improvements" series v5

arch/riscv/include/asm/errata_list.h | 12 +++++++++++-
arch/riscv/include/asm/tlbflush.h | 19 ++++++++++++++++++-
arch/riscv/mm/tlbflush.c | 23 -----------------------
3 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/arch/riscv/include/asm/errata_list.h b/arch/riscv/include/asm/errata_list.h
index ea33288f8a25..6e2762101968 100644
--- a/arch/riscv/include/asm/errata_list.h
+++ b/arch/riscv/include/asm/errata_list.h
@@ -43,11 +43,21 @@ ALTERNATIVE(__stringify(RISCV_PTR do_page_fault), \
CONFIG_ERRATA_SIFIVE_CIP_453)
#else /* !__ASSEMBLY__ */

-#define ALT_FLUSH_TLB_PAGE(x) \
+#define ALT_SFENCE_VMA_ASID(asid) \
+asm(ALTERNATIVE("sfence.vma x0, %0", "sfence.vma", SIFIVE_VENDOR_ID, \
+ ERRATA_SIFIVE_CIP_1200, CONFIG_ERRATA_SIFIVE_CIP_1200) \
+ : : "r" (asid) : "memory")
+
+#define ALT_SFENCE_VMA_ADDR(addr) \
asm(ALTERNATIVE("sfence.vma %0", "sfence.vma", SIFIVE_VENDOR_ID, \
ERRATA_SIFIVE_CIP_1200, CONFIG_ERRATA_SIFIVE_CIP_1200) \
: : "r" (addr) : "memory")

+#define ALT_SFENCE_VMA_ADDR_ASID(addr, asid) \
+asm(ALTERNATIVE("sfence.vma %0, %1", "sfence.vma", SIFIVE_VENDOR_ID, \
+ ERRATA_SIFIVE_CIP_1200, CONFIG_ERRATA_SIFIVE_CIP_1200) \
+ : : "r" (addr), "r" (asid) : "memory")
+
/*
* _val is marked as "will be overwritten", so need to set it to 0
* in the default case.
diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index 4f86424b1ba5..463b615d7728 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -22,10 +22,27 @@ static inline void local_flush_tlb_all(void)
__asm__ __volatile__ ("sfence.vma" : : : "memory");
}

+static inline void local_flush_tlb_all_asid(unsigned long asid)
+{
+ if (asid != FLUSH_TLB_NO_ASID)
+ ALT_SFENCE_VMA_ASID(asid);
+ else
+ local_flush_tlb_all();
+}
+
/* Flush one page from local TLB */
static inline void local_flush_tlb_page(unsigned long addr)
{
- ALT_FLUSH_TLB_PAGE(__asm__ __volatile__ ("sfence.vma %0" : : "r" (addr) : "memory"));
+ ALT_SFENCE_VMA_ADDR(addr);
+}
+
+static inline void local_flush_tlb_page_asid(unsigned long addr,
+ unsigned long asid)
+{
+ if (asid != FLUSH_TLB_NO_ASID)
+ ALT_SFENCE_VMA_ADDR_ASID(addr, asid);
+ else
+ local_flush_tlb_page(addr);
}

void flush_tlb_all(void);
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 69402c260a89..365e0a0e4725 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -7,29 +7,6 @@
#include <asm/sbi.h>
#include <asm/mmu_context.h>

-static inline void local_flush_tlb_all_asid(unsigned long asid)
-{
- if (asid != FLUSH_TLB_NO_ASID)
- __asm__ __volatile__ ("sfence.vma x0, %0"
- :
- : "r" (asid)
- : "memory");
- else
- local_flush_tlb_all();
-}
-
-static inline void local_flush_tlb_page_asid(unsigned long addr,
- unsigned long asid)
-{
- if (asid != FLUSH_TLB_NO_ASID)
- __asm__ __volatile__ ("sfence.vma %0, %1"
- :
- : "r" (addr), "r" (asid)
- : "memory");
- else
- local_flush_tlb_page(addr);
-}
-
/*
* Flush entire TLB if number of entries to be flushed is greater
* than the threshold below.
--
2.43.1


2024-02-29 23:24:14

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 08/13] riscv: Avoid TLB flush loops when affected by SiFive CIP-1200

Since implementations affected by SiFive errata CIP-1200 always use the
global variant of the sfence.vma instruction, they only need to execute
the instruction once. The range-based loop only hurts performance.

Signed-off-by: Samuel Holland <[email protected]>
---

(no changes since v4)

Changes in v4:
- Only set tlb_flush_all_threshold when CONFIG_MMU=y.

Changes in v3:
- New patch for v3

arch/riscv/errata/sifive/errata.c | 5 +++++
arch/riscv/include/asm/tlbflush.h | 2 ++
arch/riscv/mm/tlbflush.c | 2 +-
3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/errata/sifive/errata.c b/arch/riscv/errata/sifive/errata.c
index 3d9a32d791f7..716cfedad3a2 100644
--- a/arch/riscv/errata/sifive/errata.c
+++ b/arch/riscv/errata/sifive/errata.c
@@ -42,6 +42,11 @@ static bool errata_cip_1200_check_func(unsigned long arch_id, unsigned long imp
return false;
if ((impid & 0xffffff) > 0x200630 || impid == 0x1200626)
return false;
+
+#ifdef CONFIG_MMU
+ tlb_flush_all_threshold = 0;
+#endif
+
return true;
}

diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index 463b615d7728..8e329721375b 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -66,6 +66,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
unsigned long uaddr);
void arch_flush_tlb_batched_pending(struct mm_struct *mm);
void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+
+extern unsigned long tlb_flush_all_threshold;
#else /* CONFIG_MMU */
#define local_flush_tlb_all() do { } while (0)
#endif /* CONFIG_MMU */
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 365e0a0e4725..22870f213188 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -11,7 +11,7 @@
* Flush entire TLB if number of entries to be flushed is greater
* than the threshold below.
*/
-static unsigned long tlb_flush_all_threshold __read_mostly = 64;
+unsigned long tlb_flush_all_threshold __read_mostly = 64;

static void local_flush_tlb_range_threshold_asid(unsigned long start,
unsigned long size,
--
2.43.1


2024-02-29 23:24:30

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 09/13] riscv: mm: Introduce cntx2asid/cntx2version helper macros

When using the ASID allocator, the MM context ID contains two values:
the ASID in the lower bits, and the allocator version number in the
remaining bits. Use macros to make this separation more obvious.

Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Samuel Holland <[email protected]>
---

(no changes since v1)

arch/riscv/include/asm/mmu.h | 3 +++
arch/riscv/mm/context.c | 12 ++++++------
arch/riscv/mm/tlbflush.c | 2 +-
3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h
index 355504b37f8e..a550fbf770be 100644
--- a/arch/riscv/include/asm/mmu.h
+++ b/arch/riscv/include/asm/mmu.h
@@ -26,6 +26,9 @@ typedef struct {
#endif
} mm_context_t;

+#define cntx2asid(cntx) ((cntx) & asid_mask)
+#define cntx2version(cntx) ((cntx) & ~asid_mask)
+
void __init create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa,
phys_addr_t sz, pgprot_t prot);
#endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index ba8eb3944687..b562b3c44487 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -81,7 +81,7 @@ static void __flush_context(void)
if (cntx == 0)
cntx = per_cpu(reserved_context, i);

- __set_bit(cntx & asid_mask, context_asid_map);
+ __set_bit(cntx2asid(cntx), context_asid_map);
per_cpu(reserved_context, i) = cntx;
}

@@ -102,7 +102,7 @@ static unsigned long __new_context(struct mm_struct *mm)
lockdep_assert_held(&context_lock);

if (cntx != 0) {
- unsigned long newcntx = ver | (cntx & asid_mask);
+ unsigned long newcntx = ver | cntx2asid(cntx);

/*
* If our current CONTEXT was active during a rollover, we
@@ -115,7 +115,7 @@ static unsigned long __new_context(struct mm_struct *mm)
* We had a valid CONTEXT in a previous life, so try to
* re-use it if possible.
*/
- if (!__test_and_set_bit(cntx & asid_mask, context_asid_map))
+ if (!__test_and_set_bit(cntx2asid(cntx), context_asid_map))
return newcntx;
}

@@ -168,7 +168,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
*/
old_active_cntx = atomic_long_read(&per_cpu(active_context, cpu));
if (old_active_cntx &&
- ((cntx & ~asid_mask) == atomic_long_read(&current_version)) &&
+ (cntx2version(cntx) == atomic_long_read(&current_version)) &&
atomic_long_cmpxchg_relaxed(&per_cpu(active_context, cpu),
old_active_cntx, cntx))
goto switch_mm_fast;
@@ -177,7 +177,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)

/* Check that our ASID belongs to the current_version. */
cntx = atomic_long_read(&mm->context.id);
- if ((cntx & ~asid_mask) != atomic_long_read(&current_version)) {
+ if (cntx2version(cntx) != atomic_long_read(&current_version)) {
cntx = __new_context(mm);
atomic_long_set(&mm->context.id, cntx);
}
@@ -191,7 +191,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)

switch_mm_fast:
csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
- ((cntx & asid_mask) << SATP_ASID_SHIFT) |
+ (cntx2asid(cntx) << SATP_ASID_SHIFT) |
satp_mode);

if (need_flush_tlb)
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 22870f213188..e194e14e5b2b 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -109,7 +109,7 @@ static void __flush_tlb_range(struct cpumask *cmask, unsigned long asid,
static inline unsigned long get_mm_asid(struct mm_struct *mm)
{
return static_branch_unlikely(&use_asid_allocator) ?
- atomic_long_read(&mm->context.id) & asid_mask : FLUSH_TLB_NO_ASID;
+ cntx2asid(atomic_long_read(&mm->context.id)) : FLUSH_TLB_NO_ASID;
}

void flush_tlb_mm(struct mm_struct *mm)
--
2.43.1


2024-02-29 23:24:55

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 10/13] riscv: mm: Use a fixed layout for the MM context ID

Currently, the size of the ASID field in the MM context ID dynamically
depends on the number of hardware-supported ASID bits. This requires
reading a global variable to extract either field from the context ID.
Instead, allocate the maximum possible number of bits to the ASID field,
so the layout of the context ID is known at compile-time.

Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Samuel Holland <[email protected]>
---

(no changes since v1)

arch/riscv/include/asm/mmu.h | 4 ++--
arch/riscv/include/asm/tlbflush.h | 2 --
arch/riscv/mm/context.c | 6 ++----
3 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h
index a550fbf770be..dc0273f7905f 100644
--- a/arch/riscv/include/asm/mmu.h
+++ b/arch/riscv/include/asm/mmu.h
@@ -26,8 +26,8 @@ typedef struct {
#endif
} mm_context_t;

-#define cntx2asid(cntx) ((cntx) & asid_mask)
-#define cntx2version(cntx) ((cntx) & ~asid_mask)
+#define cntx2asid(cntx) ((cntx) & SATP_ASID_MASK)
+#define cntx2version(cntx) ((cntx) & ~SATP_ASID_MASK)

void __init create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa,
phys_addr_t sz, pgprot_t prot);
diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index 8e329721375b..72e559934952 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -15,8 +15,6 @@
#define FLUSH_TLB_NO_ASID ((unsigned long)-1)

#ifdef CONFIG_MMU
-extern unsigned long asid_mask;
-
static inline void local_flush_tlb_all(void)
{
__asm__ __volatile__ ("sfence.vma" : : : "memory");
diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index b562b3c44487..5315af06cd4d 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -22,7 +22,6 @@ DEFINE_STATIC_KEY_FALSE(use_asid_allocator);

static unsigned long asid_bits;
static unsigned long num_asids;
-unsigned long asid_mask;

static atomic_long_t current_version;

@@ -128,7 +127,7 @@ static unsigned long __new_context(struct mm_struct *mm)
goto set_asid;

/* We're out of ASIDs, so increment current_version */
- ver = atomic_long_add_return_relaxed(num_asids, &current_version);
+ ver = atomic_long_add_return_relaxed(BIT(SATP_ASID_BITS), &current_version);

/* Flush everything */
__flush_context();
@@ -247,7 +246,6 @@ static int __init asids_init(void)
/* Pre-compute ASID details */
if (asid_bits) {
num_asids = 1 << asid_bits;
- asid_mask = num_asids - 1;
}

/*
@@ -255,7 +253,7 @@ static int __init asids_init(void)
* at-least twice more than CPUs
*/
if (num_asids > (2 * num_possible_cpus())) {
- atomic_long_set(&current_version, num_asids);
+ atomic_long_set(&current_version, BIT(SATP_ASID_BITS));

context_asid_map = bitmap_zalloc(num_asids, GFP_KERNEL);
if (!context_asid_map)
--
2.43.1


2024-02-29 23:25:07

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 12/13] riscv: mm: Preserve global TLB entries when switching contexts

If the CPU does not support multiple ASIDs, all MM contexts use ASID 0.
In this case, it is still beneficial to flush the TLB by ASID, as the
single-ASID variant of the sfence.vma instruction preserves TLB entries
for global (kernel) pages.

This optimization is recommended by the RISC-V privileged specification:

If the implementation does not provide ASIDs, or software chooses
to always use ASID 0, then after every satp write, software should
execute SFENCE.VMA with rs1=x0. In the common case that no global
translations have been modified, rs2 should be set to a register
other than x0 but which contains the value zero, so that global
translations are not flushed.

It is not possible to apply this optimization when using the ASID
allocator, because that code must flush the TLB for all ASIDs at once
when incrementing the version number.

Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Samuel Holland <[email protected]>
---

(no changes since v1)

arch/riscv/mm/context.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index 0bf6d0070a14..60cb0b82240e 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -200,7 +200,7 @@ static void set_mm_noasid(struct mm_struct *mm)
{
/* Switch the page table and blindly nuke entire local TLB */
csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | satp_mode);
- local_flush_tlb_all();
+ local_flush_tlb_all_asid(0);
}

static inline void set_mm(struct mm_struct *prev,
--
2.43.1


2024-02-29 23:25:20

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 13/13] riscv: mm: Always use an ASID to flush mm contexts

Even if multiple ASIDs are not supported, using the single-ASID variant
of the sfence.vma instruction preserves TLB entries for global (kernel)
pages. So it is always more efficient to use the single-ASID code path.

Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Samuel Holland <[email protected]>
---

Changes in v5:
- Leave use_asid_allocator declared in asm/mmu_context.h

Changes in v4:
- There is now only one copy of __flush_tlb_range()

Changes in v2:
- Update both copies of __flush_tlb_range()

arch/riscv/mm/tlbflush.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index e194e14e5b2b..5b473588a985 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -108,8 +108,7 @@ static void __flush_tlb_range(struct cpumask *cmask, unsigned long asid,

static inline unsigned long get_mm_asid(struct mm_struct *mm)
{
- return static_branch_unlikely(&use_asid_allocator) ?
- cntx2asid(atomic_long_read(&mm->context.id)) : FLUSH_TLB_NO_ASID;
+ return cntx2asid(atomic_long_read(&mm->context.id));
}

void flush_tlb_mm(struct mm_struct *mm)
--
2.43.1


2024-02-29 23:25:33

by Samuel Holland

[permalink] [raw]
Subject: [PATCH v5 11/13] riscv: mm: Make asid_bits a local variable

This variable is only used inside asids_init().

Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Samuel Holland <[email protected]>
---

(no changes since v1)

arch/riscv/mm/context.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index 5315af06cd4d..0bf6d0070a14 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -20,7 +20,6 @@

DEFINE_STATIC_KEY_FALSE(use_asid_allocator);

-static unsigned long asid_bits;
static unsigned long num_asids;

static atomic_long_t current_version;
@@ -226,7 +225,7 @@ static inline void set_mm(struct mm_struct *prev,

static int __init asids_init(void)
{
- unsigned long old;
+ unsigned long asid_bits, old;

/* Figure-out number of ASID bits in HW */
old = csr_read(CSR_SATP);
--
2.43.1


2024-03-01 02:12:50

by yunhui cui

[permalink] [raw]
Subject: Re: [External] [PATCH v5 06/13] riscv: mm: Combine the SMP and UP TLB flush code

Hi Samuel,

On Fri, Mar 1, 2024 at 7:22 AM Samuel Holland <[email protected]> wrote:
>
> In SMP configurations, all TLB flushing narrower than flush_tlb_all()
> goes through __flush_tlb_range(). Do the same in UP configurations.
>
> This allows UP configurations to take advantage of recent improvements
> to the code in tlbflush.c, such as support for huge pages and flushing
> multiple-page ranges.
>
> Reviewed-by: Alexandre Ghiti <[email protected]>
> Signed-off-by: Samuel Holland <[email protected]>
> ---
>
> (no changes since v4)
>
> Changes in v4:
> - Merge the two copies of __flush_tlb_range() and rely on the compiler
> to optimize out the broadcast path (both clang and gcc do this)
> - Merge the two copies of flush_tlb_all() and rely on constant folding
>
> Changes in v2:
> - Move the SMP/UP merge earlier in the series to avoid build issues
> - Make a copy of __flush_tlb_range() instead of adding ifdefs inside
> - local_flush_tlb_all() is the only function used on !MMU (smpboot.c)
>
> arch/riscv/Kconfig | 2 +-
> arch/riscv/include/asm/tlbflush.h | 30 +++---------------------------
> arch/riscv/mm/Makefile | 5 +----
> 3 files changed, 5 insertions(+), 32 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 0bfcfec67ed5..de9b6f2279ff 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -60,7 +60,7 @@ config RISCV
> select ARCH_USE_MEMTEST
> select ARCH_USE_QUEUED_RWLOCKS
> select ARCH_USES_CFI_TRAPS if CFI_CLANG
> - select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP && MMU
> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if MMU
> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
> select ARCH_WANT_FRAME_POINTERS
> select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT
> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
> index 928f096dca21..4f86424b1ba5 100644
> --- a/arch/riscv/include/asm/tlbflush.h
> +++ b/arch/riscv/include/asm/tlbflush.h
> @@ -27,12 +27,7 @@ static inline void local_flush_tlb_page(unsigned long addr)
> {
> ALT_FLUSH_TLB_PAGE(__asm__ __volatile__ ("sfence.vma %0" : : "r" (addr) : "memory"));
> }
> -#else /* CONFIG_MMU */
> -#define local_flush_tlb_all() do { } while (0)
> -#define local_flush_tlb_page(addr) do { } while (0)
> -#endif /* CONFIG_MMU */
>
> -#if defined(CONFIG_SMP) && defined(CONFIG_MMU)
> void flush_tlb_all(void);
> void flush_tlb_mm(struct mm_struct *mm);
> void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> @@ -54,27 +49,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> unsigned long uaddr);
> void arch_flush_tlb_batched_pending(struct mm_struct *mm);
> void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
> -
> -#else /* CONFIG_SMP && CONFIG_MMU */
> -
> -#define flush_tlb_all() local_flush_tlb_all()
> -#define flush_tlb_page(vma, addr) local_flush_tlb_page(addr)
> -
> -static inline void flush_tlb_range(struct vm_area_struct *vma,
> - unsigned long start, unsigned long end)
> -{
> - local_flush_tlb_all();
> -}
> -
> -/* Flush a range of kernel pages */
> -static inline void flush_tlb_kernel_range(unsigned long start,
> - unsigned long end)
> -{
> - local_flush_tlb_all();
> -}
> -
> -#define flush_tlb_mm(mm) flush_tlb_all()
> -#define flush_tlb_mm_range(mm, start, end, page_size) flush_tlb_all()
> -#endif /* !CONFIG_SMP || !CONFIG_MMU */
> +#else /* CONFIG_MMU */
> +#define local_flush_tlb_all() do { } while (0)
> +#endif /* CONFIG_MMU */
>
> #endif /* _ASM_RISCV_TLBFLUSH_H */
> diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
> index 2c869f8026a8..cbe4d775ef56 100644
> --- a/arch/riscv/mm/Makefile
> +++ b/arch/riscv/mm/Makefile
> @@ -13,14 +13,11 @@ endif
> KCOV_INSTRUMENT_init.o := n
>
> obj-y += init.o
> -obj-$(CONFIG_MMU) += extable.o fault.o pageattr.o pgtable.o
> +obj-$(CONFIG_MMU) += extable.o fault.o pageattr.o pgtable.o tlbflush.o
> obj-y += cacheflush.o
> obj-y += context.o
> obj-y += pmem.o
>
> -ifeq ($(CONFIG_MMU),y)
> -obj-$(CONFIG_SMP) += tlbflush.o
> -endif
> obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
> obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
> obj-$(CONFIG_KASAN) += kasan_init.o
> --
> 2.43.1
>

git am the patch failed. Was it a patch based on the top commit of linux-next ?

Thanks,
Yunhui

2024-03-01 02:34:42

by Samuel Holland

[permalink] [raw]
Subject: Re: [External] [PATCH v5 06/13] riscv: mm: Combine the SMP and UP TLB flush code

Hi Yunhui,

On 2024-02-29 8:12 PM, yunhui cui wrote:
> On Fri, Mar 1, 2024 at 7:22 AM Samuel Holland <[email protected]> wrote:
>>
>> In SMP configurations, all TLB flushing narrower than flush_tlb_all()
>> goes through __flush_tlb_range(). Do the same in UP configurations.
>>
>> This allows UP configurations to take advantage of recent improvements
>> to the code in tlbflush.c, such as support for huge pages and flushing
>> multiple-page ranges.
>>
>> Reviewed-by: Alexandre Ghiti <[email protected]>
>> Signed-off-by: Samuel Holland <[email protected]>
>> ---
>>
>> (no changes since v4)
>>
>> Changes in v4:
>> - Merge the two copies of __flush_tlb_range() and rely on the compiler
>> to optimize out the broadcast path (both clang and gcc do this)
>> - Merge the two copies of flush_tlb_all() and rely on constant folding
>>
>> Changes in v2:
>> - Move the SMP/UP merge earlier in the series to avoid build issues
>> - Make a copy of __flush_tlb_range() instead of adding ifdefs inside
>> - local_flush_tlb_all() is the only function used on !MMU (smpboot.c)
>>
>> arch/riscv/Kconfig | 2 +-
>> arch/riscv/include/asm/tlbflush.h | 30 +++---------------------------
>> arch/riscv/mm/Makefile | 5 +----
>> 3 files changed, 5 insertions(+), 32 deletions(-)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index 0bfcfec67ed5..de9b6f2279ff 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -60,7 +60,7 @@ config RISCV
>> select ARCH_USE_MEMTEST
>> select ARCH_USE_QUEUED_RWLOCKS
>> select ARCH_USES_CFI_TRAPS if CFI_CLANG
>> - select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP && MMU
>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if MMU
>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
>> select ARCH_WANT_FRAME_POINTERS
>> select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT
>> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
>> index 928f096dca21..4f86424b1ba5 100644
>> --- a/arch/riscv/include/asm/tlbflush.h
>> +++ b/arch/riscv/include/asm/tlbflush.h
>> @@ -27,12 +27,7 @@ static inline void local_flush_tlb_page(unsigned long addr)
>> {
>> ALT_FLUSH_TLB_PAGE(__asm__ __volatile__ ("sfence.vma %0" : : "r" (addr) : "memory"));
>> }
>> -#else /* CONFIG_MMU */
>> -#define local_flush_tlb_all() do { } while (0)
>> -#define local_flush_tlb_page(addr) do { } while (0)
>> -#endif /* CONFIG_MMU */
>>
>> -#if defined(CONFIG_SMP) && defined(CONFIG_MMU)
>> void flush_tlb_all(void);
>> void flush_tlb_mm(struct mm_struct *mm);
>> void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>> @@ -54,27 +49,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
>> unsigned long uaddr);
>> void arch_flush_tlb_batched_pending(struct mm_struct *mm);
>> void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
>> -
>> -#else /* CONFIG_SMP && CONFIG_MMU */
>> -
>> -#define flush_tlb_all() local_flush_tlb_all()
>> -#define flush_tlb_page(vma, addr) local_flush_tlb_page(addr)
>> -
>> -static inline void flush_tlb_range(struct vm_area_struct *vma,
>> - unsigned long start, unsigned long end)
>> -{
>> - local_flush_tlb_all();
>> -}
>> -
>> -/* Flush a range of kernel pages */
>> -static inline void flush_tlb_kernel_range(unsigned long start,
>> - unsigned long end)
>> -{
>> - local_flush_tlb_all();
>> -}
>> -
>> -#define flush_tlb_mm(mm) flush_tlb_all()
>> -#define flush_tlb_mm_range(mm, start, end, page_size) flush_tlb_all()
>> -#endif /* !CONFIG_SMP || !CONFIG_MMU */
>> +#else /* CONFIG_MMU */
>> +#define local_flush_tlb_all() do { } while (0)
>> +#endif /* CONFIG_MMU */
>>
>> #endif /* _ASM_RISCV_TLBFLUSH_H */
>> diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
>> index 2c869f8026a8..cbe4d775ef56 100644
>> --- a/arch/riscv/mm/Makefile
>> +++ b/arch/riscv/mm/Makefile
>> @@ -13,14 +13,11 @@ endif
>> KCOV_INSTRUMENT_init.o := n
>>
>> obj-y += init.o
>> -obj-$(CONFIG_MMU) += extable.o fault.o pageattr.o pgtable.o
>> +obj-$(CONFIG_MMU) += extable.o fault.o pageattr.o pgtable.o tlbflush.o
>> obj-y += cacheflush.o
>> obj-y += context.o
>> obj-y += pmem.o
>>
>> -ifeq ($(CONFIG_MMU),y)
>> -obj-$(CONFIG_SMP) += tlbflush.o
>> -endif
>> obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
>> obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
>> obj-$(CONFIG_KASAN) += kasan_init.o
>> --
>> 2.43.1
>>
>
> git am the patch failed. Was it a patch based on the top commit of linux-next ?

This series is based on the for-next branch of riscv.git, which is where it
would be applied. There is a conflict with commit d9807d60c145 ("riscv: mm:
execute local TLB flush after populating vmemmap") in the riscv.git fixes
branch, which added a uniprocessor-specific local_flush_tlb_kernel_range()
definition. The appropriate merge conflict resolution is to remove that new
macro, i.e. take the version of the file from this patch series.

Regards,
Samuel


2024-03-01 02:49:12

by yunhui cui

[permalink] [raw]
Subject: Re: [External] [PATCH v5 08/13] riscv: Avoid TLB flush loops when affected by SiFive CIP-1200

Hi Samuel,

On Fri, Mar 1, 2024 at 7:22 AM Samuel Holland <[email protected]> wrote:
>
> Since implementations affected by SiFive errata CIP-1200 always use the
> global variant of the sfence.vma instruction, they only need to execute
> the instruction once. The range-based loop only hurts performance.
>
> Signed-off-by: Samuel Holland <[email protected]>
> ---
>
> (no changes since v4)
>
> Changes in v4:
> - Only set tlb_flush_all_threshold when CONFIG_MMU=y.
>
> Changes in v3:
> - New patch for v3
>
> arch/riscv/errata/sifive/errata.c | 5 +++++
> arch/riscv/include/asm/tlbflush.h | 2 ++
> arch/riscv/mm/tlbflush.c | 2 +-
> 3 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/riscv/errata/sifive/errata.c b/arch/riscv/errata/sifive/errata.c
> index 3d9a32d791f7..716cfedad3a2 100644
> --- a/arch/riscv/errata/sifive/errata.c
> +++ b/arch/riscv/errata/sifive/errata.c
> @@ -42,6 +42,11 @@ static bool errata_cip_1200_check_func(unsigned long arch_id, unsigned long imp
> return false;
> if ((impid & 0xffffff) > 0x200630 || impid == 0x1200626)
> return false;
> +
> +#ifdef CONFIG_MMU
> + tlb_flush_all_threshold = 0;
> +#endif
> +
> return true;
> }
>
> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
> index 463b615d7728..8e329721375b 100644
> --- a/arch/riscv/include/asm/tlbflush.h
> +++ b/arch/riscv/include/asm/tlbflush.h
> @@ -66,6 +66,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> unsigned long uaddr);
> void arch_flush_tlb_batched_pending(struct mm_struct *mm);
> void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
> +
> +extern unsigned long tlb_flush_all_threshold;
> #else /* CONFIG_MMU */
> #define local_flush_tlb_all() do { } while (0)
> #endif /* CONFIG_MMU */
> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> index 365e0a0e4725..22870f213188 100644
> --- a/arch/riscv/mm/tlbflush.c
> +++ b/arch/riscv/mm/tlbflush.c
> @@ -11,7 +11,7 @@
> * Flush entire TLB if number of entries to be flushed is greater
> * than the threshold below.
> */
> -static unsigned long tlb_flush_all_threshold __read_mostly = 64;
> +unsigned long tlb_flush_all_threshold __read_mostly = 64;
>
> static void local_flush_tlb_range_threshold_asid(unsigned long start,
> unsigned long size,
> --
> 2.43.1
>

If local_flush_tlb_all_asid() is used every time, more PTWs will be
generated. Will such modifications definitely improve the overall
performance?

Hi Alex, Samuel,
The relationship between flush_xx_range_asid() and nr_ptes is
basically linear growth (y=kx +b), while flush_all_asid() has nothing
to do with nr_ptes (y=c).
Some TLBs may do some optimization. The operation of flush all itself
requires very few cycles, but there is a certain delay between
consecutive flush all.
The intersection of the two straight lines is the optimal solution of
tlb_flush_all_threshold. In actual situations, continuous
flush_all_asid will not occur. One problem caused by flush_all_asid()
is that multiple flush entries require PTW, which causes greater
latency.
Therefore, the value of tlb_flush_all_threshold needs to be considered
or quantified. Maybe doing local_flush_tlb_page_asid() based on the
actual nr_ptes_in_range would give better overall performance.
What do you think?

Thanks,
Yunhui

2024-03-01 09:32:23

by Conor Dooley

[permalink] [raw]
Subject: Re: [PATCH v5 00/13] riscv: ASID-related and UP-related TLB flush enhancements

On Thu, Feb 29, 2024 at 03:21:41PM -0800, Samuel Holland wrote:
> Samuel Holland (13):
> riscv: Flush the instruction cache during SMP bringup
> riscv: Factor out page table TLB synchronization

From here onwards, fails on 32-bit, bunch of
implicit-function-declaration stuff.

> riscv: Use IPIs for remote cache/TLB flushes by default
> riscv: mm: Broadcast kernel TLB flushes only when needed
> riscv: Only send remote fences when some other CPU is online
> riscv: mm: Combine the SMP and UP TLB flush code
> riscv: Apply SiFive CIP-1200 workaround to single-ASID sfence.vma
> riscv: Avoid TLB flush loops when affected by SiFive CIP-1200
> riscv: mm: Introduce cntx2asid/cntx2version helper macros
> riscv: mm: Use a fixed layout for the MM context ID
> riscv: mm: Make asid_bits a local variable
> riscv: mm: Preserve global TLB entries when switching contexts
> riscv: mm: Always use an ASID to flush mm contexts


Attachments:
(No filename) (963.00 B)
signature.asc (235.00 B)
Download all attachments

2024-03-11 03:07:14

by Stefan O'Rear

[permalink] [raw]
Subject: Re: [PATCH v5 03/13] riscv: Use IPIs for remote cache/TLB flushes by default

On Thu, Feb 29, 2024, at 6:21 PM, Samuel Holland wrote:
> An IPI backend is always required in an SMP configuration, but an SBI
> implementation is not. For example, SBI will be unavailable when the
> kernel runs in M mode.
>
> Generally, IPIs are assumed to be faster than SBI calls due to the SBI
> context switch overhead. However, when SBI is used as the IPI backend,
> then the context switch cost must be paid anyway, and performing the
> cache/TLB flush directly in the SBI implementation is more efficient
> than inserting an interrupt to the kernel. This is the only scenario
> where riscv_ipi_set_virq_range()'s use_for_rfence parameter is false.
>
> Thus, it makes sense for remote fences to use IPIs by default, and make
> the SBI remote fence extension the special case.

The historical intention of providing SBI calls for remote fences is that
it abstracts over hardware that allows for performing remote fences
_without an IPI at all_. The TH1520 is an example of such hardware, since
it can (at least according to the documentation) perform broadcast fence,
fence.i, and sfence.vma operations using bits in the mhint register.

T-Head's public opensbi repository doesn't actually use this feature, and
in general SBI remote fences come from a much more optimistic time about
how much we can successfully hide from supervisor software. But I don't
think we can generalize that an IPI will always do less work than a SBI
remote fence.

-s

> sbi_ipi_init() already checks riscv_ipi_have_virq_range(), so it only
> calls riscv_ipi_set_virq_range() when no other IPI device is available.
> So we can move the static key and drop the use_for_rfence parameter.
>
> Furthermore, the static branch only makes sense when CONFIG_RISCV_SBI is
> enabled. Optherwise, IPIs must be used. Add a fallback definition of
> riscv_use_sbi_for_rfence() which handles this case and removes the need
> to check CONFIG_RISCV_SBI elsewhere, such as in cacheflush.c.
>
> Signed-off-by: Samuel Holland <[email protected]>
> ---
>
> Changes in v5:
> - Also switch to riscv_use_sbi_for_rfence() in asm/pgalloc.h
>
> Changes in v4:
> - New patch for v4
>
> arch/riscv/include/asm/pgalloc.h | 7 ++++---
> arch/riscv/include/asm/sbi.h | 4 ++++
> arch/riscv/include/asm/smp.h | 15 ++-------------
> arch/riscv/kernel/sbi-ipi.c | 11 ++++++++++-
> arch/riscv/kernel/smp.c | 11 +----------
> arch/riscv/mm/cacheflush.c | 5 ++---
> arch/riscv/mm/tlbflush.c | 31 ++++++++++++++-----------------
> drivers/clocksource/timer-clint.c | 2 +-
> 8 files changed, 38 insertions(+), 48 deletions(-)
>
> diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
> index 87468f67951a..6578054977ef 100644
> --- a/arch/riscv/include/asm/pgalloc.h
> +++ b/arch/riscv/include/asm/pgalloc.h
> @@ -8,6 +8,7 @@
> #define _ASM_RISCV_PGALLOC_H
>
> #include <linux/mm.h>
> +#include <asm/sbi.h>
> #include <asm/tlb.h>
>
> #ifdef CONFIG_MMU
> @@ -90,10 +91,10 @@ static inline pud_t *pud_alloc_one(struct mm_struct
> *mm, unsigned long addr)
>
> static inline void riscv_tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt)
> {
> - if (riscv_use_ipi_for_rfence())
> - tlb_remove_page_ptdesc(tlb, pt);
> - else
> + if (riscv_use_sbi_for_rfence())
> tlb_remove_ptdesc(tlb, pt);
> + else
> + tlb_remove_page_ptdesc(tlb, pt);
> }
>
> #define pud_free pud_free
> diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
> index 6e68f8dff76b..ea84392ca9d7 100644
> --- a/arch/riscv/include/asm/sbi.h
> +++ b/arch/riscv/include/asm/sbi.h
> @@ -375,8 +375,12 @@ unsigned long riscv_cached_marchid(unsigned int cpu_id);
> unsigned long riscv_cached_mimpid(unsigned int cpu_id);
>
> #if IS_ENABLED(CONFIG_SMP) && IS_ENABLED(CONFIG_RISCV_SBI)
> +DECLARE_STATIC_KEY_FALSE(riscv_sbi_for_rfence);
> +#define riscv_use_sbi_for_rfence() \
> + static_branch_unlikely(&riscv_sbi_for_rfence)
> void sbi_ipi_init(void);
> #else
> +static inline bool riscv_use_sbi_for_rfence(void) { return false; }
> static inline void sbi_ipi_init(void) { }
> #endif
>
> diff --git a/arch/riscv/include/asm/smp.h b/arch/riscv/include/asm/smp.h
> index 0d555847cde6..7ac80e9f2288 100644
> --- a/arch/riscv/include/asm/smp.h
> +++ b/arch/riscv/include/asm/smp.h
> @@ -49,12 +49,7 @@ void riscv_ipi_disable(void);
> bool riscv_ipi_have_virq_range(void);
>
> /* Set the IPI interrupt numbers for arch (called by irqchip drivers) */
> -void riscv_ipi_set_virq_range(int virq, int nr, bool use_for_rfence);
> -
> -/* Check if we can use IPIs for remote FENCEs */
> -DECLARE_STATIC_KEY_FALSE(riscv_ipi_for_rfence);
> -#define riscv_use_ipi_for_rfence() \
> - static_branch_unlikely(&riscv_ipi_for_rfence)
> +void riscv_ipi_set_virq_range(int virq, int nr);
>
> /* Check other CPUs stop or not */
> bool smp_crash_stop_failed(void);
> @@ -104,16 +99,10 @@ static inline bool riscv_ipi_have_virq_range(void)
> return false;
> }
>
> -static inline void riscv_ipi_set_virq_range(int virq, int nr,
> - bool use_for_rfence)
> +static inline void riscv_ipi_set_virq_range(int virq, int nr)
> {
> }
>
> -static inline bool riscv_use_ipi_for_rfence(void)
> -{
> - return false;
> -}
> -
> #endif /* CONFIG_SMP */
>
> #if defined(CONFIG_HOTPLUG_CPU) && (CONFIG_SMP)
> diff --git a/arch/riscv/kernel/sbi-ipi.c b/arch/riscv/kernel/sbi-ipi.c
> index a4559695ce62..1026e22955cc 100644
> --- a/arch/riscv/kernel/sbi-ipi.c
> +++ b/arch/riscv/kernel/sbi-ipi.c
> @@ -13,6 +13,9 @@
> #include <linux/irqdomain.h>
> #include <asm/sbi.h>
>
> +DEFINE_STATIC_KEY_FALSE(riscv_sbi_for_rfence);
> +EXPORT_SYMBOL_GPL(riscv_sbi_for_rfence);
> +
> static int sbi_ipi_virq;
>
> static void sbi_ipi_handle(struct irq_desc *desc)
> @@ -72,6 +75,12 @@ void __init sbi_ipi_init(void)
> "irqchip/sbi-ipi:starting",
> sbi_ipi_starting_cpu, NULL);
>
> - riscv_ipi_set_virq_range(virq, BITS_PER_BYTE, false);
> + riscv_ipi_set_virq_range(virq, BITS_PER_BYTE);
> pr_info("providing IPIs using SBI IPI extension\n");
> +
> + /*
> + * Use the SBI remote fence extension to avoid
> + * the extra context switch needed to handle IPIs.
> + */
> + static_branch_enable(&riscv_sbi_for_rfence);
> }
> diff --git a/arch/riscv/kernel/smp.c b/arch/riscv/kernel/smp.c
> index 45dd4035416e..8e6eb64459af 100644
> --- a/arch/riscv/kernel/smp.c
> +++ b/arch/riscv/kernel/smp.c
> @@ -171,10 +171,7 @@ bool riscv_ipi_have_virq_range(void)
> return (ipi_virq_base) ? true : false;
> }
>
> -DEFINE_STATIC_KEY_FALSE(riscv_ipi_for_rfence);
> -EXPORT_SYMBOL_GPL(riscv_ipi_for_rfence);
> -
> -void riscv_ipi_set_virq_range(int virq, int nr, bool use_for_rfence)
> +void riscv_ipi_set_virq_range(int virq, int nr)
> {
> int i, err;
>
> @@ -197,12 +194,6 @@ void riscv_ipi_set_virq_range(int virq, int nr,
> bool use_for_rfence)
>
> /* Enabled IPIs for boot CPU immediately */
> riscv_ipi_enable();
> -
> - /* Update RFENCE static key */
> - if (use_for_rfence)
> - static_branch_enable(&riscv_ipi_for_rfence);
> - else
> - static_branch_disable(&riscv_ipi_for_rfence);
> }
>
> static const char * const ipi_names[] = {
> diff --git a/arch/riscv/mm/cacheflush.c b/arch/riscv/mm/cacheflush.c
> index 55a34f2020a8..47c485bc7df0 100644
> --- a/arch/riscv/mm/cacheflush.c
> +++ b/arch/riscv/mm/cacheflush.c
> @@ -21,7 +21,7 @@ void flush_icache_all(void)
> {
> local_flush_icache_all();
>
> - if (IS_ENABLED(CONFIG_RISCV_SBI) && !riscv_use_ipi_for_rfence())
> + if (riscv_use_sbi_for_rfence())
> sbi_remote_fence_i(NULL);
> else
> on_each_cpu(ipi_remote_fence_i, NULL, 1);
> @@ -69,8 +69,7 @@ void flush_icache_mm(struct mm_struct *mm, bool local)
> * with flush_icache_deferred().
> */
> smp_mb();
> - } else if (IS_ENABLED(CONFIG_RISCV_SBI) &&
> - !riscv_use_ipi_for_rfence()) {
> + } else if (riscv_use_sbi_for_rfence()) {
> sbi_remote_fence_i(&others);
> } else {
> on_each_cpu_mask(&others, ipi_remote_fence_i, NULL, 1);
> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> index 8d12b26f5ac3..0373661bd1c4 100644
> --- a/arch/riscv/mm/tlbflush.c
> +++ b/arch/riscv/mm/tlbflush.c
> @@ -78,10 +78,10 @@ static void __ipi_flush_tlb_all(void *info)
>
> void flush_tlb_all(void)
> {
> - if (riscv_use_ipi_for_rfence())
> - on_each_cpu(__ipi_flush_tlb_all, NULL, 1);
> - else
> + if (riscv_use_sbi_for_rfence())
> sbi_remote_sfence_vma_asid(NULL, 0, FLUSH_TLB_MAX_SIZE, FLUSH_TLB_NO_ASID);
> + else
> + on_each_cpu(__ipi_flush_tlb_all, NULL, 1);
> }
>
> struct flush_tlb_range_data {
> @@ -102,7 +102,6 @@ static void __flush_tlb_range(struct cpumask
> *cmask, unsigned long asid,
> unsigned long start, unsigned long size,
> unsigned long stride)
> {
> - struct flush_tlb_range_data ftd;
> bool broadcast;
>
> if (cpumask_empty(cmask))
> @@ -118,20 +117,18 @@ static void __flush_tlb_range(struct cpumask
> *cmask, unsigned long asid,
> broadcast = true;
> }
>
> - if (broadcast) {
> - if (riscv_use_ipi_for_rfence()) {
> - ftd.asid = asid;
> - ftd.start = start;
> - ftd.size = size;
> - ftd.stride = stride;
> - on_each_cpu_mask(cmask,
> - __ipi_flush_tlb_range_asid,
> - &ftd, 1);
> - } else
> - sbi_remote_sfence_vma_asid(cmask,
> - start, size, asid);
> - } else {
> + if (!broadcast) {
> local_flush_tlb_range_asid(start, size, stride, asid);
> + } else if (riscv_use_sbi_for_rfence()) {
> + sbi_remote_sfence_vma_asid(cmask, start, size, asid);
> + } else {
> + struct flush_tlb_range_data ftd;
> +
> + ftd.asid = asid;
> + ftd.start = start;
> + ftd.size = size;
> + ftd.stride = stride;
> + on_each_cpu_mask(cmask, __ipi_flush_tlb_range_asid, &ftd, 1);
> }
>
> if (cmask != cpu_online_mask)
> diff --git a/drivers/clocksource/timer-clint.c
> b/drivers/clocksource/timer-clint.c
> index 09fd292eb83d..0bdd9d7ec545 100644
> --- a/drivers/clocksource/timer-clint.c
> +++ b/drivers/clocksource/timer-clint.c
> @@ -251,7 +251,7 @@ static int __init clint_timer_init_dt(struct
> device_node *np)
> }
>
> irq_set_chained_handler(clint_ipi_irq, clint_ipi_interrupt);
> - riscv_ipi_set_virq_range(rc, BITS_PER_BYTE, true);
> + riscv_ipi_set_virq_range(rc, BITS_PER_BYTE);
> clint_clear_ipi();
> #endif
>
> --
> 2.43.1
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2024-03-11 04:04:39

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH v5 03/13] riscv: Use IPIs for remote cache/TLB flushes by default

On Mon, Mar 11, 2024 at 8:37 AM Stefan O'Rear <[email protected]> wrote:
>
> On Thu, Feb 29, 2024, at 6:21 PM, Samuel Holland wrote:
> > An IPI backend is always required in an SMP configuration, but an SBI
> > implementation is not. For example, SBI will be unavailable when the
> > kernel runs in M mode.
> >
> > Generally, IPIs are assumed to be faster than SBI calls due to the SBI
> > context switch overhead. However, when SBI is used as the IPI backend,
> > then the context switch cost must be paid anyway, and performing the
> > cache/TLB flush directly in the SBI implementation is more efficient
> > than inserting an interrupt to the kernel. This is the only scenario
> > where riscv_ipi_set_virq_range()'s use_for_rfence parameter is false.
> >
> > Thus, it makes sense for remote fences to use IPIs by default, and make
> > the SBI remote fence extension the special case.
>
> The historical intention of providing SBI calls for remote fences is that
> it abstracts over hardware that allows for performing remote fences
> _without an IPI at all_. The TH1520 is an example of such hardware, since
> it can (at least according to the documentation) perform broadcast fence,
> fence.i, and sfence.vma operations using bits in the mhint register.
>
> T-Head's public opensbi repository doesn't actually use this feature, and
> in general SBI remote fences come from a much more optimistic time about
> how much we can successfully hide from supervisor software. But I don't
> think we can generalize that an IPI will always do less work than a SBI
> remote fence.

On platforms where SBI is the only way of injecting IPIs in S-mode, the
SBI based remote fences are actually much faster. This is because on
such platforms injecting an IPI from a HART to a set of HARTs will
require multiple SBI calls which can be directly replaced by one (or
few) SBI remote fence SBI calls.

Most of the current platforms still have M-mode CLINT and depend on
SBI IPI for S-mode IPI injection so we should not slow down remote
fences for these platforms.

Until all platforms have moved to RISC-V AIA (which supports S-mode
IPIs), we should keep this boot-time choice of SBI RFENCEs versus
direct S-mode IPIs.

IMO, this patch is ahead of its time.

Regards,
Anup

>
> -s
>
> > sbi_ipi_init() already checks riscv_ipi_have_virq_range(), so it only
> > calls riscv_ipi_set_virq_range() when no other IPI device is available.
> > So we can move the static key and drop the use_for_rfence parameter.
> >
> > Furthermore, the static branch only makes sense when CONFIG_RISCV_SBI is
> > enabled. Optherwise, IPIs must be used. Add a fallback definition of
> > riscv_use_sbi_for_rfence() which handles this case and removes the need
> > to check CONFIG_RISCV_SBI elsewhere, such as in cacheflush.c.
> >
> > Signed-off-by: Samuel Holland <[email protected]>
> > ---
> >
> > Changes in v5:
> > - Also switch to riscv_use_sbi_for_rfence() in asm/pgalloc.h
> >
> > Changes in v4:
> > - New patch for v4
> >
> > arch/riscv/include/asm/pgalloc.h | 7 ++++---
> > arch/riscv/include/asm/sbi.h | 4 ++++
> > arch/riscv/include/asm/smp.h | 15 ++-------------
> > arch/riscv/kernel/sbi-ipi.c | 11 ++++++++++-
> > arch/riscv/kernel/smp.c | 11 +----------
> > arch/riscv/mm/cacheflush.c | 5 ++---
> > arch/riscv/mm/tlbflush.c | 31 ++++++++++++++-----------------
> > drivers/clocksource/timer-clint.c | 2 +-
> > 8 files changed, 38 insertions(+), 48 deletions(-)
> >
> > diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
> > index 87468f67951a..6578054977ef 100644
> > --- a/arch/riscv/include/asm/pgalloc.h
> > +++ b/arch/riscv/include/asm/pgalloc.h
> > @@ -8,6 +8,7 @@
> > #define _ASM_RISCV_PGALLOC_H
> >
> > #include <linux/mm.h>
> > +#include <asm/sbi.h>
> > #include <asm/tlb.h>
> >
> > #ifdef CONFIG_MMU
> > @@ -90,10 +91,10 @@ static inline pud_t *pud_alloc_one(struct mm_struct
> > *mm, unsigned long addr)
> >
> > static inline void riscv_tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt)
> > {
> > - if (riscv_use_ipi_for_rfence())
> > - tlb_remove_page_ptdesc(tlb, pt);
> > - else
> > + if (riscv_use_sbi_for_rfence())
> > tlb_remove_ptdesc(tlb, pt);
> > + else
> > + tlb_remove_page_ptdesc(tlb, pt);
> > }
> >
> > #define pud_free pud_free
> > diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
> > index 6e68f8dff76b..ea84392ca9d7 100644
> > --- a/arch/riscv/include/asm/sbi.h
> > +++ b/arch/riscv/include/asm/sbi.h
> > @@ -375,8 +375,12 @@ unsigned long riscv_cached_marchid(unsigned int cpu_id);
> > unsigned long riscv_cached_mimpid(unsigned int cpu_id);
> >
> > #if IS_ENABLED(CONFIG_SMP) && IS_ENABLED(CONFIG_RISCV_SBI)
> > +DECLARE_STATIC_KEY_FALSE(riscv_sbi_for_rfence);
> > +#define riscv_use_sbi_for_rfence() \
> > + static_branch_unlikely(&riscv_sbi_for_rfence)
> > void sbi_ipi_init(void);
> > #else
> > +static inline bool riscv_use_sbi_for_rfence(void) { return false; }
> > static inline void sbi_ipi_init(void) { }
> > #endif
> >
> > diff --git a/arch/riscv/include/asm/smp.h b/arch/riscv/include/asm/smp.h
> > index 0d555847cde6..7ac80e9f2288 100644
> > --- a/arch/riscv/include/asm/smp.h
> > +++ b/arch/riscv/include/asm/smp.h
> > @@ -49,12 +49,7 @@ void riscv_ipi_disable(void);
> > bool riscv_ipi_have_virq_range(void);
> >
> > /* Set the IPI interrupt numbers for arch (called by irqchip drivers) */
> > -void riscv_ipi_set_virq_range(int virq, int nr, bool use_for_rfence);
> > -
> > -/* Check if we can use IPIs for remote FENCEs */
> > -DECLARE_STATIC_KEY_FALSE(riscv_ipi_for_rfence);
> > -#define riscv_use_ipi_for_rfence() \
> > - static_branch_unlikely(&riscv_ipi_for_rfence)
> > +void riscv_ipi_set_virq_range(int virq, int nr);
> >
> > /* Check other CPUs stop or not */
> > bool smp_crash_stop_failed(void);
> > @@ -104,16 +99,10 @@ static inline bool riscv_ipi_have_virq_range(void)
> > return false;
> > }
> >
> > -static inline void riscv_ipi_set_virq_range(int virq, int nr,
> > - bool use_for_rfence)
> > +static inline void riscv_ipi_set_virq_range(int virq, int nr)
> > {
> > }
> >
> > -static inline bool riscv_use_ipi_for_rfence(void)
> > -{
> > - return false;
> > -}
> > -
> > #endif /* CONFIG_SMP */
> >
> > #if defined(CONFIG_HOTPLUG_CPU) && (CONFIG_SMP)
> > diff --git a/arch/riscv/kernel/sbi-ipi.c b/arch/riscv/kernel/sbi-ipi.c
> > index a4559695ce62..1026e22955cc 100644
> > --- a/arch/riscv/kernel/sbi-ipi.c
> > +++ b/arch/riscv/kernel/sbi-ipi.c
> > @@ -13,6 +13,9 @@
> > #include <linux/irqdomain.h>
> > #include <asm/sbi.h>
> >
> > +DEFINE_STATIC_KEY_FALSE(riscv_sbi_for_rfence);
> > +EXPORT_SYMBOL_GPL(riscv_sbi_for_rfence);
> > +
> > static int sbi_ipi_virq;
> >
> > static void sbi_ipi_handle(struct irq_desc *desc)
> > @@ -72,6 +75,12 @@ void __init sbi_ipi_init(void)
> > "irqchip/sbi-ipi:starting",
> > sbi_ipi_starting_cpu, NULL);
> >
> > - riscv_ipi_set_virq_range(virq, BITS_PER_BYTE, false);
> > + riscv_ipi_set_virq_range(virq, BITS_PER_BYTE);
> > pr_info("providing IPIs using SBI IPI extension\n");
> > +
> > + /*
> > + * Use the SBI remote fence extension to avoid
> > + * the extra context switch needed to handle IPIs.
> > + */
> > + static_branch_enable(&riscv_sbi_for_rfence);
> > }
> > diff --git a/arch/riscv/kernel/smp.c b/arch/riscv/kernel/smp.c
> > index 45dd4035416e..8e6eb64459af 100644
> > --- a/arch/riscv/kernel/smp.c
> > +++ b/arch/riscv/kernel/smp.c
> > @@ -171,10 +171,7 @@ bool riscv_ipi_have_virq_range(void)
> > return (ipi_virq_base) ? true : false;
> > }
> >
> > -DEFINE_STATIC_KEY_FALSE(riscv_ipi_for_rfence);
> > -EXPORT_SYMBOL_GPL(riscv_ipi_for_rfence);
> > -
> > -void riscv_ipi_set_virq_range(int virq, int nr, bool use_for_rfence)
> > +void riscv_ipi_set_virq_range(int virq, int nr)
> > {
> > int i, err;
> >
> > @@ -197,12 +194,6 @@ void riscv_ipi_set_virq_range(int virq, int nr,
> > bool use_for_rfence)
> >
> > /* Enabled IPIs for boot CPU immediately */
> > riscv_ipi_enable();
> > -
> > - /* Update RFENCE static key */
> > - if (use_for_rfence)
> > - static_branch_enable(&riscv_ipi_for_rfence);
> > - else
> > - static_branch_disable(&riscv_ipi_for_rfence);
> > }
> >
> > static const char * const ipi_names[] = {
> > diff --git a/arch/riscv/mm/cacheflush.c b/arch/riscv/mm/cacheflush.c
> > index 55a34f2020a8..47c485bc7df0 100644
> > --- a/arch/riscv/mm/cacheflush.c
> > +++ b/arch/riscv/mm/cacheflush.c
> > @@ -21,7 +21,7 @@ void flush_icache_all(void)
> > {
> > local_flush_icache_all();
> >
> > - if (IS_ENABLED(CONFIG_RISCV_SBI) && !riscv_use_ipi_for_rfence())
> > + if (riscv_use_sbi_for_rfence())
> > sbi_remote_fence_i(NULL);
> > else
> > on_each_cpu(ipi_remote_fence_i, NULL, 1);
> > @@ -69,8 +69,7 @@ void flush_icache_mm(struct mm_struct *mm, bool local)
> > * with flush_icache_deferred().
> > */
> > smp_mb();
> > - } else if (IS_ENABLED(CONFIG_RISCV_SBI) &&
> > - !riscv_use_ipi_for_rfence()) {
> > + } else if (riscv_use_sbi_for_rfence()) {
> > sbi_remote_fence_i(&others);
> > } else {
> > on_each_cpu_mask(&others, ipi_remote_fence_i, NULL, 1);
> > diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> > index 8d12b26f5ac3..0373661bd1c4 100644
> > --- a/arch/riscv/mm/tlbflush.c
> > +++ b/arch/riscv/mm/tlbflush.c
> > @@ -78,10 +78,10 @@ static void __ipi_flush_tlb_all(void *info)
> >
> > void flush_tlb_all(void)
> > {
> > - if (riscv_use_ipi_for_rfence())
> > - on_each_cpu(__ipi_flush_tlb_all, NULL, 1);
> > - else
> > + if (riscv_use_sbi_for_rfence())
> > sbi_remote_sfence_vma_asid(NULL, 0, FLUSH_TLB_MAX_SIZE, FLUSH_TLB_NO_ASID);
> > + else
> > + on_each_cpu(__ipi_flush_tlb_all, NULL, 1);
> > }
> >
> > struct flush_tlb_range_data {
> > @@ -102,7 +102,6 @@ static void __flush_tlb_range(struct cpumask
> > *cmask, unsigned long asid,
> > unsigned long start, unsigned long size,
> > unsigned long stride)
> > {
> > - struct flush_tlb_range_data ftd;
> > bool broadcast;
> >
> > if (cpumask_empty(cmask))
> > @@ -118,20 +117,18 @@ static void __flush_tlb_range(struct cpumask
> > *cmask, unsigned long asid,
> > broadcast = true;
> > }
> >
> > - if (broadcast) {
> > - if (riscv_use_ipi_for_rfence()) {
> > - ftd.asid = asid;
> > - ftd.start = start;
> > - ftd.size = size;
> > - ftd.stride = stride;
> > - on_each_cpu_mask(cmask,
> > - __ipi_flush_tlb_range_asid,
> > - &ftd, 1);
> > - } else
> > - sbi_remote_sfence_vma_asid(cmask,
> > - start, size, asid);
> > - } else {
> > + if (!broadcast) {
> > local_flush_tlb_range_asid(start, size, stride, asid);
> > + } else if (riscv_use_sbi_for_rfence()) {
> > + sbi_remote_sfence_vma_asid(cmask, start, size, asid);
> > + } else {
> > + struct flush_tlb_range_data ftd;
> > +
> > + ftd.asid = asid;
> > + ftd.start = start;
> > + ftd.size = size;
> > + ftd.stride = stride;
> > + on_each_cpu_mask(cmask, __ipi_flush_tlb_range_asid, &ftd, 1);
> > }
> >
> > if (cmask != cpu_online_mask)
> > diff --git a/drivers/clocksource/timer-clint.c
> > b/drivers/clocksource/timer-clint.c
> > index 09fd292eb83d..0bdd9d7ec545 100644
> > --- a/drivers/clocksource/timer-clint.c
> > +++ b/drivers/clocksource/timer-clint.c
> > @@ -251,7 +251,7 @@ static int __init clint_timer_init_dt(struct
> > device_node *np)
> > }
> >
> > irq_set_chained_handler(clint_ipi_irq, clint_ipi_interrupt);
> > - riscv_ipi_set_virq_range(rc, BITS_PER_BYTE, true);
> > + riscv_ipi_set_virq_range(rc, BITS_PER_BYTE);
> > clint_clear_ipi();
> > #endif
> >
> > --
> > 2.43.1
> >
> >
> > _______________________________________________
> > linux-riscv mailing list
> > [email protected]
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
>

2024-03-11 04:12:57

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH v5 03/13] riscv: Use IPIs for remote cache/TLB flushes by default

On Mon, Mar 11, 2024 at 9:34 AM Anup Patel <[email protected]> wrote:
>
> On Mon, Mar 11, 2024 at 8:37 AM Stefan O'Rear <[email protected]> wrote:
> >
> > On Thu, Feb 29, 2024, at 6:21 PM, Samuel Holland wrote:
> > > An IPI backend is always required in an SMP configuration, but an SBI
> > > implementation is not. For example, SBI will be unavailable when the
> > > kernel runs in M mode.
> > >
> > > Generally, IPIs are assumed to be faster than SBI calls due to the SBI
> > > context switch overhead. However, when SBI is used as the IPI backend,
> > > then the context switch cost must be paid anyway, and performing the
> > > cache/TLB flush directly in the SBI implementation is more efficient
> > > than inserting an interrupt to the kernel. This is the only scenario
> > > where riscv_ipi_set_virq_range()'s use_for_rfence parameter is false.
> > >
> > > Thus, it makes sense for remote fences to use IPIs by default, and make
> > > the SBI remote fence extension the special case.
> >
> > The historical intention of providing SBI calls for remote fences is that
> > it abstracts over hardware that allows for performing remote fences
> > _without an IPI at all_. The TH1520 is an example of such hardware, since
> > it can (at least according to the documentation) perform broadcast fence,
> > fence.i, and sfence.vma operations using bits in the mhint register.
> >
> > T-Head's public opensbi repository doesn't actually use this feature, and
> > in general SBI remote fences come from a much more optimistic time about
> > how much we can successfully hide from supervisor software. But I don't
> > think we can generalize that an IPI will always do less work than a SBI
> > remote fence.
>
> On platforms where SBI is the only way of injecting IPIs in S-mode, the
> SBI based remote fences are actually much faster. This is because on
> such platforms injecting an IPI from a HART to a set of HARTs will
> require multiple SBI calls which can be directly replaced by one (or
> few) SBI remote fence SBI calls.
>
> Most of the current platforms still have M-mode CLINT and depend on
> SBI IPI for S-mode IPI injection so we should not slow down remote
> fences for these platforms.
>
> Until all platforms have moved to RISC-V AIA (which supports S-mode
> IPIs), we should keep this boot-time choice of SBI RFENCEs versus
> direct S-mode IPIs.
>
> IMO, this patch is ahead of its time.

I think this patch is fine since it replaces the static key
riscv_ipi_for_rfence with riscv_sbi_for_rfence.

I got confused by the removal of riscv_ipi_for_rfence.

Reviewed-by: Anup Patel <[email protected]>

Regards,
Anup

>
> Regards,
> Anup
>
> >
> > -s
> >
> > > sbi_ipi_init() already checks riscv_ipi_have_virq_range(), so it only
> > > calls riscv_ipi_set_virq_range() when no other IPI device is available.
> > > So we can move the static key and drop the use_for_rfence parameter.
> > >
> > > Furthermore, the static branch only makes sense when CONFIG_RISCV_SBI is
> > > enabled. Optherwise, IPIs must be used. Add a fallback definition of
> > > riscv_use_sbi_for_rfence() which handles this case and removes the need
> > > to check CONFIG_RISCV_SBI elsewhere, such as in cacheflush.c.
> > >
> > > Signed-off-by: Samuel Holland <[email protected]>
> > > ---
> > >
> > > Changes in v5:
> > > - Also switch to riscv_use_sbi_for_rfence() in asm/pgalloc.h
> > >
> > > Changes in v4:
> > > - New patch for v4
> > >
> > > arch/riscv/include/asm/pgalloc.h | 7 ++++---
> > > arch/riscv/include/asm/sbi.h | 4 ++++
> > > arch/riscv/include/asm/smp.h | 15 ++-------------
> > > arch/riscv/kernel/sbi-ipi.c | 11 ++++++++++-
> > > arch/riscv/kernel/smp.c | 11 +----------
> > > arch/riscv/mm/cacheflush.c | 5 ++---
> > > arch/riscv/mm/tlbflush.c | 31 ++++++++++++++-----------------
> > > drivers/clocksource/timer-clint.c | 2 +-
> > > 8 files changed, 38 insertions(+), 48 deletions(-)
> > >
> > > diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
> > > index 87468f67951a..6578054977ef 100644
> > > --- a/arch/riscv/include/asm/pgalloc.h
> > > +++ b/arch/riscv/include/asm/pgalloc.h
> > > @@ -8,6 +8,7 @@
> > > #define _ASM_RISCV_PGALLOC_H
> > >
> > > #include <linux/mm.h>
> > > +#include <asm/sbi.h>
> > > #include <asm/tlb.h>
> > >
> > > #ifdef CONFIG_MMU
> > > @@ -90,10 +91,10 @@ static inline pud_t *pud_alloc_one(struct mm_struct
> > > *mm, unsigned long addr)
> > >
> > > static inline void riscv_tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt)
> > > {
> > > - if (riscv_use_ipi_for_rfence())
> > > - tlb_remove_page_ptdesc(tlb, pt);
> > > - else
> > > + if (riscv_use_sbi_for_rfence())
> > > tlb_remove_ptdesc(tlb, pt);
> > > + else
> > > + tlb_remove_page_ptdesc(tlb, pt);
> > > }
> > >
> > > #define pud_free pud_free
> > > diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
> > > index 6e68f8dff76b..ea84392ca9d7 100644
> > > --- a/arch/riscv/include/asm/sbi.h
> > > +++ b/arch/riscv/include/asm/sbi.h
> > > @@ -375,8 +375,12 @@ unsigned long riscv_cached_marchid(unsigned int cpu_id);
> > > unsigned long riscv_cached_mimpid(unsigned int cpu_id);
> > >
> > > #if IS_ENABLED(CONFIG_SMP) && IS_ENABLED(CONFIG_RISCV_SBI)
> > > +DECLARE_STATIC_KEY_FALSE(riscv_sbi_for_rfence);
> > > +#define riscv_use_sbi_for_rfence() \
> > > + static_branch_unlikely(&riscv_sbi_for_rfence)
> > > void sbi_ipi_init(void);
> > > #else
> > > +static inline bool riscv_use_sbi_for_rfence(void) { return false; }
> > > static inline void sbi_ipi_init(void) { }
> > > #endif
> > >
> > > diff --git a/arch/riscv/include/asm/smp.h b/arch/riscv/include/asm/smp.h
> > > index 0d555847cde6..7ac80e9f2288 100644
> > > --- a/arch/riscv/include/asm/smp.h
> > > +++ b/arch/riscv/include/asm/smp.h
> > > @@ -49,12 +49,7 @@ void riscv_ipi_disable(void);
> > > bool riscv_ipi_have_virq_range(void);
> > >
> > > /* Set the IPI interrupt numbers for arch (called by irqchip drivers) */
> > > -void riscv_ipi_set_virq_range(int virq, int nr, bool use_for_rfence);
> > > -
> > > -/* Check if we can use IPIs for remote FENCEs */
> > > -DECLARE_STATIC_KEY_FALSE(riscv_ipi_for_rfence);
> > > -#define riscv_use_ipi_for_rfence() \
> > > - static_branch_unlikely(&riscv_ipi_for_rfence)
> > > +void riscv_ipi_set_virq_range(int virq, int nr);
> > >
> > > /* Check other CPUs stop or not */
> > > bool smp_crash_stop_failed(void);
> > > @@ -104,16 +99,10 @@ static inline bool riscv_ipi_have_virq_range(void)
> > > return false;
> > > }
> > >
> > > -static inline void riscv_ipi_set_virq_range(int virq, int nr,
> > > - bool use_for_rfence)
> > > +static inline void riscv_ipi_set_virq_range(int virq, int nr)
> > > {
> > > }
> > >
> > > -static inline bool riscv_use_ipi_for_rfence(void)
> > > -{
> > > - return false;
> > > -}
> > > -
> > > #endif /* CONFIG_SMP */
> > >
> > > #if defined(CONFIG_HOTPLUG_CPU) && (CONFIG_SMP)
> > > diff --git a/arch/riscv/kernel/sbi-ipi.c b/arch/riscv/kernel/sbi-ipi.c
> > > index a4559695ce62..1026e22955cc 100644
> > > --- a/arch/riscv/kernel/sbi-ipi.c
> > > +++ b/arch/riscv/kernel/sbi-ipi.c
> > > @@ -13,6 +13,9 @@
> > > #include <linux/irqdomain.h>
> > > #include <asm/sbi.h>
> > >
> > > +DEFINE_STATIC_KEY_FALSE(riscv_sbi_for_rfence);
> > > +EXPORT_SYMBOL_GPL(riscv_sbi_for_rfence);
> > > +
> > > static int sbi_ipi_virq;
> > >
> > > static void sbi_ipi_handle(struct irq_desc *desc)
> > > @@ -72,6 +75,12 @@ void __init sbi_ipi_init(void)
> > > "irqchip/sbi-ipi:starting",
> > > sbi_ipi_starting_cpu, NULL);
> > >
> > > - riscv_ipi_set_virq_range(virq, BITS_PER_BYTE, false);
> > > + riscv_ipi_set_virq_range(virq, BITS_PER_BYTE);
> > > pr_info("providing IPIs using SBI IPI extension\n");
> > > +
> > > + /*
> > > + * Use the SBI remote fence extension to avoid
> > > + * the extra context switch needed to handle IPIs.
> > > + */
> > > + static_branch_enable(&riscv_sbi_for_rfence);
> > > }
> > > diff --git a/arch/riscv/kernel/smp.c b/arch/riscv/kernel/smp.c
> > > index 45dd4035416e..8e6eb64459af 100644
> > > --- a/arch/riscv/kernel/smp.c
> > > +++ b/arch/riscv/kernel/smp.c
> > > @@ -171,10 +171,7 @@ bool riscv_ipi_have_virq_range(void)
> > > return (ipi_virq_base) ? true : false;
> > > }
> > >
> > > -DEFINE_STATIC_KEY_FALSE(riscv_ipi_for_rfence);
> > > -EXPORT_SYMBOL_GPL(riscv_ipi_for_rfence);
> > > -
> > > -void riscv_ipi_set_virq_range(int virq, int nr, bool use_for_rfence)
> > > +void riscv_ipi_set_virq_range(int virq, int nr)
> > > {
> > > int i, err;
> > >
> > > @@ -197,12 +194,6 @@ void riscv_ipi_set_virq_range(int virq, int nr,
> > > bool use_for_rfence)
> > >
> > > /* Enabled IPIs for boot CPU immediately */
> > > riscv_ipi_enable();
> > > -
> > > - /* Update RFENCE static key */
> > > - if (use_for_rfence)
> > > - static_branch_enable(&riscv_ipi_for_rfence);
> > > - else
> > > - static_branch_disable(&riscv_ipi_for_rfence);
> > > }
> > >
> > > static const char * const ipi_names[] = {
> > > diff --git a/arch/riscv/mm/cacheflush.c b/arch/riscv/mm/cacheflush.c
> > > index 55a34f2020a8..47c485bc7df0 100644
> > > --- a/arch/riscv/mm/cacheflush.c
> > > +++ b/arch/riscv/mm/cacheflush.c
> > > @@ -21,7 +21,7 @@ void flush_icache_all(void)
> > > {
> > > local_flush_icache_all();
> > >
> > > - if (IS_ENABLED(CONFIG_RISCV_SBI) && !riscv_use_ipi_for_rfence())
> > > + if (riscv_use_sbi_for_rfence())
> > > sbi_remote_fence_i(NULL);
> > > else
> > > on_each_cpu(ipi_remote_fence_i, NULL, 1);
> > > @@ -69,8 +69,7 @@ void flush_icache_mm(struct mm_struct *mm, bool local)
> > > * with flush_icache_deferred().
> > > */
> > > smp_mb();
> > > - } else if (IS_ENABLED(CONFIG_RISCV_SBI) &&
> > > - !riscv_use_ipi_for_rfence()) {
> > > + } else if (riscv_use_sbi_for_rfence()) {
> > > sbi_remote_fence_i(&others);
> > > } else {
> > > on_each_cpu_mask(&others, ipi_remote_fence_i, NULL, 1);
> > > diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> > > index 8d12b26f5ac3..0373661bd1c4 100644
> > > --- a/arch/riscv/mm/tlbflush.c
> > > +++ b/arch/riscv/mm/tlbflush.c
> > > @@ -78,10 +78,10 @@ static void __ipi_flush_tlb_all(void *info)
> > >
> > > void flush_tlb_all(void)
> > > {
> > > - if (riscv_use_ipi_for_rfence())
> > > - on_each_cpu(__ipi_flush_tlb_all, NULL, 1);
> > > - else
> > > + if (riscv_use_sbi_for_rfence())
> > > sbi_remote_sfence_vma_asid(NULL, 0, FLUSH_TLB_MAX_SIZE, FLUSH_TLB_NO_ASID);
> > > + else
> > > + on_each_cpu(__ipi_flush_tlb_all, NULL, 1);
> > > }
> > >
> > > struct flush_tlb_range_data {
> > > @@ -102,7 +102,6 @@ static void __flush_tlb_range(struct cpumask
> > > *cmask, unsigned long asid,
> > > unsigned long start, unsigned long size,
> > > unsigned long stride)
> > > {
> > > - struct flush_tlb_range_data ftd;
> > > bool broadcast;
> > >
> > > if (cpumask_empty(cmask))
> > > @@ -118,20 +117,18 @@ static void __flush_tlb_range(struct cpumask
> > > *cmask, unsigned long asid,
> > > broadcast = true;
> > > }
> > >
> > > - if (broadcast) {
> > > - if (riscv_use_ipi_for_rfence()) {
> > > - ftd.asid = asid;
> > > - ftd.start = start;
> > > - ftd.size = size;
> > > - ftd.stride = stride;
> > > - on_each_cpu_mask(cmask,
> > > - __ipi_flush_tlb_range_asid,
> > > - &ftd, 1);
> > > - } else
> > > - sbi_remote_sfence_vma_asid(cmask,
> > > - start, size, asid);
> > > - } else {
> > > + if (!broadcast) {
> > > local_flush_tlb_range_asid(start, size, stride, asid);
> > > + } else if (riscv_use_sbi_for_rfence()) {
> > > + sbi_remote_sfence_vma_asid(cmask, start, size, asid);
> > > + } else {
> > > + struct flush_tlb_range_data ftd;
> > > +
> > > + ftd.asid = asid;
> > > + ftd.start = start;
> > > + ftd.size = size;
> > > + ftd.stride = stride;
> > > + on_each_cpu_mask(cmask, __ipi_flush_tlb_range_asid, &ftd, 1);
> > > }
> > >
> > > if (cmask != cpu_online_mask)
> > > diff --git a/drivers/clocksource/timer-clint.c
> > > b/drivers/clocksource/timer-clint.c
> > > index 09fd292eb83d..0bdd9d7ec545 100644
> > > --- a/drivers/clocksource/timer-clint.c
> > > +++ b/drivers/clocksource/timer-clint.c
> > > @@ -251,7 +251,7 @@ static int __init clint_timer_init_dt(struct
> > > device_node *np)
> > > }
> > >
> > > irq_set_chained_handler(clint_ipi_irq, clint_ipi_interrupt);
> > > - riscv_ipi_set_virq_range(rc, BITS_PER_BYTE, true);
> > > + riscv_ipi_set_virq_range(rc, BITS_PER_BYTE);
> > > clint_clear_ipi();
> > > #endif
> > >
> > > --
> > > 2.43.1
> > >
> > >
> > > _______________________________________________
> > > linux-riscv mailing list
> > > [email protected]
> > > http://lists.infradead.org/mailman/listinfo/linux-riscv
> >

2024-03-11 04:42:45

by Samuel Holland

[permalink] [raw]
Subject: Re: [PATCH v5 03/13] riscv: Use IPIs for remote cache/TLB flushes by default

Hi Stefan, Anup,

On 2024-03-10 11:12 PM, Anup Patel wrote:
> On Mon, Mar 11, 2024 at 9:34 AM Anup Patel <[email protected]> wrote:
>>
>> On Mon, Mar 11, 2024 at 8:37 AM Stefan O'Rear <[email protected]> wrote:
>>>
>>> On Thu, Feb 29, 2024, at 6:21 PM, Samuel Holland wrote:
>>>> An IPI backend is always required in an SMP configuration, but an SBI
>>>> implementation is not. For example, SBI will be unavailable when the
>>>> kernel runs in M mode.
>>>>
>>>> Generally, IPIs are assumed to be faster than SBI calls due to the SBI
>>>> context switch overhead. However, when SBI is used as the IPI backend,
>>>> then the context switch cost must be paid anyway, and performing the
>>>> cache/TLB flush directly in the SBI implementation is more efficient
>>>> than inserting an interrupt to the kernel. This is the only scenario
>>>> where riscv_ipi_set_virq_range()'s use_for_rfence parameter is false.
>>>>
>>>> Thus, it makes sense for remote fences to use IPIs by default, and make
>>>> the SBI remote fence extension the special case.
>>>
>>> The historical intention of providing SBI calls for remote fences is that
>>> it abstracts over hardware that allows for performing remote fences
>>> _without an IPI at all_. The TH1520 is an example of such hardware, since
>>> it can (at least according to the documentation) perform broadcast fence,
>>> fence.i, and sfence.vma operations using bits in the mhint register.

Platform-specific code can call static_branch_enable(&riscv_sbi_for_rfence) if
it somehow knows SBI remote fences are faster. That option is still available.

>>> T-Head's public opensbi repository doesn't actually use this feature, and
>>> in general SBI remote fences come from a much more optimistic time about
>>> how much we can successfully hide from supervisor software. But I don't
>>> think we can generalize that an IPI will always do less work than a SBI
>>> remote fence.

Agreed, and as Anup pointed out below, the case where IPIs go through SBI is an
obvious case where IPIs will do more work than SBI remote fences. However, there
is not currently any way to detect platforms where SBI remote fences have
special optimizations. So, in the absence of extra information, the kernel
assumes SBI remote fences have the performance characteristics implied by using
only standard RISC-V privileged ISA features.

The status quo is that in all cases where IPIs can be delivered to the kernel's
privilege mode without firmware assistance, remote fences are sent via IPI. This
patch is just recognizing that.

>> On platforms where SBI is the only way of injecting IPIs in S-mode, the
>> SBI based remote fences are actually much faster. This is because on
>> such platforms injecting an IPI from a HART to a set of HARTs will
>> require multiple SBI calls which can be directly replaced by one (or
>> few) SBI remote fence SBI calls.
>>
>> Most of the current platforms still have M-mode CLINT and depend on
>> SBI IPI for S-mode IPI injection so we should not slow down remote
>> fences for these platforms.

Just to be clear, this patch does not change the behavior on any existing
platform. All it does is simplify the logic the kernel uses to choose that
behavior at boot time.

In fact, it makes using something like the T-HEAD instructions easier, because
it decouples the remote fence static branch from irqchip driver registration.
And it also paves the way for supporting an SBI implementation that omits the
remote fence extension, if needed.

Like I mentioned in the commit message, the goal was to make IPIs the "else"
case, since SMP simply won't work without IPIs, so they must exist. And any
optimizations can be added on top of that.

>> Until all platforms have moved to RISC-V AIA (which supports S-mode
>> IPIs), we should keep this boot-time choice of SBI RFENCEs versus
>> direct S-mode IPIs.
>>
>> IMO, this patch is ahead of its time.
>
> I think this patch is fine since it replaces the static key
> riscv_ipi_for_rfence with riscv_sbi_for_rfence.
>
> I got confused by the removal of riscv_ipi_for_rfence.

Thanks for taking the time to re-review.

Regards,
Samuel


2024-03-12 00:36:02

by Samuel Holland

[permalink] [raw]
Subject: Re: [External] [PATCH v5 08/13] riscv: Avoid TLB flush loops when affected by SiFive CIP-1200

Hi Yunhui,

On 2024-02-29 8:48 PM, yunhui cui wrote:
> Hi Samuel,
>
> On Fri, Mar 1, 2024 at 7:22 AM Samuel Holland <[email protected]> wrote:
>>
>> Since implementations affected by SiFive errata CIP-1200 always use the
>> global variant of the sfence.vma instruction, they only need to execute
>> the instruction once. The range-based loop only hurts performance.
>>
>> Signed-off-by: Samuel Holland <[email protected]>
>> ---
>>
>> (no changes since v4)
>>
>> Changes in v4:
>> - Only set tlb_flush_all_threshold when CONFIG_MMU=y.
>>
>> Changes in v3:
>> - New patch for v3
>>
>> arch/riscv/errata/sifive/errata.c | 5 +++++
>> arch/riscv/include/asm/tlbflush.h | 2 ++
>> arch/riscv/mm/tlbflush.c | 2 +-
>> 3 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/riscv/errata/sifive/errata.c b/arch/riscv/errata/sifive/errata.c
>> index 3d9a32d791f7..716cfedad3a2 100644
>> --- a/arch/riscv/errata/sifive/errata.c
>> +++ b/arch/riscv/errata/sifive/errata.c
>> @@ -42,6 +42,11 @@ static bool errata_cip_1200_check_func(unsigned long arch_id, unsigned long imp
>> return false;
>> if ((impid & 0xffffff) > 0x200630 || impid == 0x1200626)
>> return false;
>> +
>> +#ifdef CONFIG_MMU
>> + tlb_flush_all_threshold = 0;
>> +#endif
>> +
>> return true;
>> }
>>
>> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
>> index 463b615d7728..8e329721375b 100644
>> --- a/arch/riscv/include/asm/tlbflush.h
>> +++ b/arch/riscv/include/asm/tlbflush.h
>> @@ -66,6 +66,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
>> unsigned long uaddr);
>> void arch_flush_tlb_batched_pending(struct mm_struct *mm);
>> void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
>> +
>> +extern unsigned long tlb_flush_all_threshold;
>> #else /* CONFIG_MMU */
>> #define local_flush_tlb_all() do { } while (0)
>> #endif /* CONFIG_MMU */
>> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
>> index 365e0a0e4725..22870f213188 100644
>> --- a/arch/riscv/mm/tlbflush.c
>> +++ b/arch/riscv/mm/tlbflush.c
>> @@ -11,7 +11,7 @@
>> * Flush entire TLB if number of entries to be flushed is greater
>> * than the threshold below.
>> */
>> -static unsigned long tlb_flush_all_threshold __read_mostly = 64;
>> +unsigned long tlb_flush_all_threshold __read_mostly = 64;
>>
>> static void local_flush_tlb_range_threshold_asid(unsigned long start,
>> unsigned long size,
>> --
>> 2.43.1
>>
>
> If local_flush_tlb_all_asid() is used every time, more PTWs will be
> generated. Will such modifications definitely improve the overall
> performance?

This change in this commit specifically applies to older SiFive SoCs with a bug
making single-page sfence.vma instructions unsafe to use. In this case, a single
call to local_flush_tlb_all_asid() is optimal, yes.

> Hi Alex, Samuel,
> The relationship between flush_xx_range_asid() and nr_ptes is
> basically linear growth (y=kx +b), while flush_all_asid() has nothing
> to do with nr_ptes (y=c).
> Some TLBs may do some optimization. The operation of flush all itself
> requires very few cycles, but there is a certain delay between
> consecutive flush all.
> The intersection of the two straight lines is the optimal solution of
> tlb_flush_all_threshold. In actual situations, continuous
> flush_all_asid will not occur. One problem caused by flush_all_asid()
> is that multiple flush entries require PTW, which causes greater
> latency.
> Therefore, the value of tlb_flush_all_threshold needs to be considered
> or quantified. Maybe doing local_flush_tlb_page_asid() based on the
> actual nr_ptes_in_range would give better overall performance.
> What do you think?

Yes, this was something Alex brought up when adding this threshold, that it
should be tuned for various scenarios. That still needs to be done. This patch
just covers one specific case where we know the optimal answer due to an erratum.

Regards,
Samuel


2024-03-12 01:52:17

by yunhui cui

[permalink] [raw]
Subject: Re: [External] [PATCH v5 08/13] riscv: Avoid TLB flush loops when affected by SiFive CIP-1200

H Samuel,

On Tue, Mar 12, 2024 at 8:36 AM Samuel Holland
<[email protected]> wrote:
>
> Hi Yunhui,
>
> On 2024-02-29 8:48 PM, yunhui cui wrote:
> > Hi Samuel,
> >
> > On Fri, Mar 1, 2024 at 7:22 AM Samuel Holland <[email protected]> wrote:
> >>
> >> Since implementations affected by SiFive errata CIP-1200 always use the
> >> global variant of the sfence.vma instruction, they only need to execute
> >> the instruction once. The range-based loop only hurts performance.
> >>
> >> Signed-off-by: Samuel Holland <[email protected]>
> >> ---
> >>
> >> (no changes since v4)
> >>
> >> Changes in v4:
> >> - Only set tlb_flush_all_threshold when CONFIG_MMU=y.
> >>
> >> Changes in v3:
> >> - New patch for v3
> >>
> >> arch/riscv/errata/sifive/errata.c | 5 +++++
> >> arch/riscv/include/asm/tlbflush.h | 2 ++
> >> arch/riscv/mm/tlbflush.c | 2 +-
> >> 3 files changed, 8 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/riscv/errata/sifive/errata.c b/arch/riscv/errata/sifive/errata.c
> >> index 3d9a32d791f7..716cfedad3a2 100644
> >> --- a/arch/riscv/errata/sifive/errata.c
> >> +++ b/arch/riscv/errata/sifive/errata.c
> >> @@ -42,6 +42,11 @@ static bool errata_cip_1200_check_func(unsigned long arch_id, unsigned long imp
> >> return false;
> >> if ((impid & 0xffffff) > 0x200630 || impid == 0x1200626)
> >> return false;
> >> +
> >> +#ifdef CONFIG_MMU
> >> + tlb_flush_all_threshold = 0;
> >> +#endif
> >> +
> >> return true;
> >> }
> >>
> >> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
> >> index 463b615d7728..8e329721375b 100644
> >> --- a/arch/riscv/include/asm/tlbflush.h
> >> +++ b/arch/riscv/include/asm/tlbflush.h
> >> @@ -66,6 +66,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> >> unsigned long uaddr);
> >> void arch_flush_tlb_batched_pending(struct mm_struct *mm);
> >> void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
> >> +
> >> +extern unsigned long tlb_flush_all_threshold;
> >> #else /* CONFIG_MMU */
> >> #define local_flush_tlb_all() do { } while (0)
> >> #endif /* CONFIG_MMU */
> >> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> >> index 365e0a0e4725..22870f213188 100644
> >> --- a/arch/riscv/mm/tlbflush.c
> >> +++ b/arch/riscv/mm/tlbflush.c
> >> @@ -11,7 +11,7 @@
> >> * Flush entire TLB if number of entries to be flushed is greater
> >> * than the threshold below.
> >> */
> >> -static unsigned long tlb_flush_all_threshold __read_mostly = 64;
> >> +unsigned long tlb_flush_all_threshold __read_mostly = 64;
> >>
> >> static void local_flush_tlb_range_threshold_asid(unsigned long start,
> >> unsigned long size,
> >> --
> >> 2.43.1
> >>
> >
> > If local_flush_tlb_all_asid() is used every time, more PTWs will be
> > generated. Will such modifications definitely improve the overall
> > performance?
>
> This change in this commit specifically applies to older SiFive SoCs with a bug
> making single-page sfence.vma instructions unsafe to use. In this case, a single
> call to local_flush_tlb_all_asid() is optimal, yes.
Would it be more clear to add this content to the git commit
description appropriately?

>
> > Hi Alex, Samuel,
> > The relationship between flush_xx_range_asid() and nr_ptes is
> > basically linear growth (y=kx +b), while flush_all_asid() has nothing
> > to do with nr_ptes (y=c).
> > Some TLBs may do some optimization. The operation of flush all itself
> > requires very few cycles, but there is a certain delay between
> > consecutive flush all.
> > The intersection of the two straight lines is the optimal solution of
> > tlb_flush_all_threshold. In actual situations, continuous
> > flush_all_asid will not occur. One problem caused by flush_all_asid()
> > is that multiple flush entries require PTW, which causes greater
> > latency.
> > Therefore, the value of tlb_flush_all_threshold needs to be considered
> > or quantified. Maybe doing local_flush_tlb_page_asid() based on the
> > actual nr_ptes_in_range would give better overall performance.
> > What do you think?
>
> Yes, this was something Alex brought up when adding this threshold, that it
> should be tuned for various scenarios. That still needs to be done. This patch
> just covers one specific case where we know the optimal answer due to an erratum.
>
> Regards,
> Samuel
>

Thanks,
Yunhui