LinuxLists.cc - [PATCH v4] erofs: replace erofs_unzipd workqueue with per-cpu threads

2023-01-06 08:00:52

Subject: [PATCH v4] erofs: replace erofs_unzipd workqueue with per-cpu threads

Using per-cpu thread pool we can reduce the scheduling latency compared
to workqueue implementation. With this patch scheduling latency and
variation is reduced as per-cpu threads are high priority kthread_workers.

The results were evaluated on arm64 Android devices running 5.10 kernel.

The table below shows resulting improvements of total scheduling latency
for the same app launch benchmark runs with 50 iterations. Scheduling
latency is the latency between when the task (workqueue kworker vs
kthread_worker) became eligible to run to when it actually started
running.
+-------------------------+-----------+----------------+---------+
| | workqueue | kthread_worker | diff |
+-------------------------+-----------+----------------+---------+
| Average (us) | 15253 | 2914 | -80.89% |
| Median (us) | 14001 | 2912 | -79.20% |
| Minimum (us) | 3117 | 1027 | -67.05% |
| Maximum (us) | 30170 | 3805 | -87.39% |
| Standard deviation (us) | 7166 | 359 | |
+-------------------------+-----------+----------------+---------+

Background: Boot times and cold app launch benchmarks are very
important to the android ecosystem as they directly translate to
responsiveness from user point of view. While erofs provides
a lot of important features like space savings, we saw some
performance penalty in cold app launch benchmarks in few scenarios.
Analysis showed that the significant variance was coming from the
scheduling cost while decompression cost was more or less the same.

Having per-cpu thread pool we can see from the above table that this
variation is reduced by ~80% on average. This problem was discussed
at LPC 2022. Link to LPC 2022 slides and
talk at [1]

[1] https://lpc.events/event/16/contributions/1338/

Signed-off-by: Sandeep Dhavale <[email protected]>
---
V3 -> V4
* Updated commit message with background information
V2 -> V3
* Fix a warning Reported-by: kernel test robot <[email protected]>
V1 -> V2
* Changed name of kthread_workers from z_erofs to erofs_worker
* Added kernel configuration to run kthread_workers at normal or
high priority
* Added cpu hotplug support
* Added wrapped kthread_workers under worker_pool
* Added one unbound thread in a pool to handle a context where
we already stopped per-cpu kthread worker
* Updated commit message
---
fs/erofs/Kconfig | 11 +++
fs/erofs/zdata.c | 201 +++++++++++++++++++++++++++++++++++++++++------
fs/erofs/zdata.h | 4 +-
3 files changed, 192 insertions(+), 24 deletions(-)

diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
index 85490370e0ca..879f493c6641 100644
--- a/fs/erofs/Kconfig
+++ b/fs/erofs/Kconfig
@@ -108,3 +108,14 @@ config EROFS_FS_ONDEMAND
read support.

If unsure, say N.
+
+config EROFS_FS_KTHREAD_HIPRI
+ bool "EROFS high priority percpu kthread workers"
+ depends on EROFS_FS
+ default n
+ help
+ EROFS uses per cpu kthread workers pool to carry out async work.
+ This permits EROFS to configure per cpu kthread workers to run
+ at higher priority.
+
+ If unsure, say N.
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index ccf7c55d477f..b4bf2da72d1a 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -8,6 +8,8 @@
#include "compress.h"
#include <linux/prefetch.h>
#include <linux/psi.h>
+#include <linux/slab.h>
+#include <linux/cpuhotplug.h>

#include <trace/events/erofs.h>

@@ -184,26 +186,152 @@ typedef tagptr1_t compressed_page_t;
#define tag_compressed_page_justfound(page) \
tagptr_fold(compressed_page_t, page, 1)

-static struct workqueue_struct *z_erofs_workqueue __read_mostly;
+struct erofs_kthread_worker_pool {
+ struct kthread_worker __rcu **workers;
+ struct kthread_worker *unbound_worker;
+};

-void z_erofs_exit_zip_subsystem(void)
+static struct erofs_kthread_worker_pool worker_pool;
+DEFINE_SPINLOCK(worker_pool_lock);
+
+static void erofs_destroy_worker_pool(void)
{
- destroy_workqueue(z_erofs_workqueue);
- z_erofs_destroy_pcluster_pool();
+ unsigned int cpu;
+ struct kthread_worker *worker;
+
+ for_each_possible_cpu(cpu) {
+ worker = rcu_dereference_protected(
+ worker_pool.workers[cpu],
+ 1);
+ rcu_assign_pointer(worker_pool.workers[cpu], NULL);
+
+ if (worker)
+ kthread_destroy_worker(worker);
+ }
+
+ if (worker_pool.unbound_worker)
+ kthread_destroy_worker(worker_pool.unbound_worker);
+
+ kfree(worker_pool.workers);
}

-static inline int z_erofs_init_workqueue(void)
+static inline void erofs_set_worker_priority(struct kthread_worker *worker)
{
- const unsigned int onlinecpus = num_possible_cpus();
+#ifdef CONFIG_EROFS_FS_KTHREAD_HIPRI
+ sched_set_fifo_low(worker->task);
+#else
+ sched_set_normal(worker->task, 0);
+#endif
+}

- /*
- * no need to spawn too many threads, limiting threads could minimum
- * scheduling overhead, perhaps per-CPU threads should be better?
- */
- z_erofs_workqueue = alloc_workqueue("erofs_unzipd",
- WQ_UNBOUND | WQ_HIGHPRI,
- onlinecpus + onlinecpus / 4);
- return z_erofs_workqueue ? 0 : -ENOMEM;
+static int erofs_create_kthread_workers(void)
+{
+ unsigned int cpu;
+ struct kthread_worker *worker;
+
+ for_each_online_cpu(cpu) {
+ worker = kthread_create_worker_on_cpu(cpu, 0, "erofs_worker/%u", cpu);
+ if (IS_ERR(worker)) {
+ erofs_destroy_worker_pool();
+ return -ENOMEM;
+ }
+ erofs_set_worker_priority(worker);
+ rcu_assign_pointer(worker_pool.workers[cpu], worker);
+ }
+
+ worker = kthread_create_worker(0, "erofs_unbound_worker");
+ if (IS_ERR(worker)) {
+ erofs_destroy_worker_pool();
+ return PTR_ERR(worker);
+ }
+ erofs_set_worker_priority(worker);
+ worker_pool.unbound_worker = worker;
+
+ return 0;
+}
+
+static int erofs_init_worker_pool(void)
+{
+ int err;
+
+ worker_pool.workers = kcalloc(num_possible_cpus(),
+ sizeof(struct kthread_worker *), GFP_ATOMIC);
+ if (!worker_pool.workers)
+ return -ENOMEM;
+ err = erofs_create_kthread_workers();
+
+ return err;
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static enum cpuhp_state erofs_cpuhp_state;
+static int erofs_cpu_online(unsigned int cpu)
+{
+ struct kthread_worker *worker;
+
+ worker = kthread_create_worker_on_cpu(cpu, 0, "erofs_worker/%u", cpu);
+ if (IS_ERR(worker))
+ return -ENOMEM;
+
+ erofs_set_worker_priority(worker);
+
+ spin_lock(&worker_pool_lock);
+ rcu_assign_pointer(worker_pool.workers[cpu], worker);
+ spin_unlock(&worker_pool_lock);
+
+ synchronize_rcu();
+
+ return 0;
+}
+
+static int erofs_cpu_offline(unsigned int cpu)
+{
+ struct kthread_worker *worker;
+
+ spin_lock(&worker_pool_lock);
+ worker = rcu_dereference_protected(worker_pool.workers[cpu],
+ lockdep_is_held(&worker_pool_lock));
+ rcu_assign_pointer(worker_pool.workers[cpu], NULL);
+ spin_unlock(&worker_pool_lock);
+
+ synchronize_rcu();
+
+ if (worker)
+ kthread_destroy_worker(worker);
+
+ return 0;
+}
+
+static int erofs_cpu_hotplug_init(void)
+{
+ int state;
+
+ state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+ "fs/erofs:online",
+ erofs_cpu_online, erofs_cpu_offline);
+ if (state < 0)
+ return state;
+
+ erofs_cpuhp_state = state;
+
+ return 0;
+}
+
+static void erofs_cpu_hotplug_destroy(void)
+{
+ if (erofs_cpuhp_state)
+ cpuhp_remove_state_nocalls(erofs_cpuhp_state);
+}
+#else /* !CONFIG_HOTPLUG_CPU */
+static inline int erofs_cpu_hotplug_init(void) { return 0; }
+static inline void erofs_cpu_hotplug_destroy(void) {}
+#endif
+
+void z_erofs_exit_zip_subsystem(void)
+{
+ erofs_cpu_hotplug_destroy();
+ erofs_destroy_worker_pool();
+ z_erofs_destroy_pcluster_pool();
}

int __init z_erofs_init_zip_subsystem(void)
@@ -211,10 +339,23 @@ int __init z_erofs_init_zip_subsystem(void)
int err = z_erofs_create_pcluster_pool();

if (err)
- return err;
- err = z_erofs_init_workqueue();
+ goto out_error_pcluster_pool;
+
+ err = erofs_init_worker_pool();
if (err)
- z_erofs_destroy_pcluster_pool();
+ goto out_error_worker_pool;
+
+ err = erofs_cpu_hotplug_init();
+ if (err < 0)
+ goto out_error_cpuhp_init;
+
+ return err;
+
+out_error_cpuhp_init:
+ erofs_destroy_worker_pool();
+out_error_worker_pool:
+ z_erofs_destroy_pcluster_pool();
+out_error_pcluster_pool:
return err;
}

@@ -1143,7 +1284,7 @@ static void z_erofs_decompress_queue(const struct z_erofs_decompressqueue *io,
}
}

-static void z_erofs_decompressqueue_work(struct work_struct *work)
+static void z_erofs_decompressqueue_kthread_work(struct kthread_work *work)
{
struct z_erofs_decompressqueue *bgq =
container_of(work, struct z_erofs_decompressqueue, u.work);
@@ -1156,6 +1297,20 @@ static void z_erofs_decompressqueue_work(struct work_struct *work)
kvfree(bgq);
}

+static void erofs_schedule_kthread_work(struct kthread_work *work)
+{
+ struct kthread_worker *worker;
+ unsigned int cpu = raw_smp_processor_id();
+
+ rcu_read_lock();
+ worker = rcu_dereference(worker_pool.workers[cpu]);
+ if (!worker)
+ worker = worker_pool.unbound_worker;
+
+ kthread_queue_work(worker, work);
+ rcu_read_unlock();
+}
+
static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
bool sync, int bios)
{
@@ -1170,15 +1325,15 @@ static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,

if (atomic_add_return(bios, &io->pending_bios))
return;
- /* Use workqueue and sync decompression for atomic contexts only */
+ /* Use kthread_workers and sync decompression for atomic contexts only */
if (in_atomic() || irqs_disabled()) {
- queue_work(z_erofs_workqueue, &io->u.work);
+ erofs_schedule_kthread_work(&io->u.work);
/* enable sync decompression for readahead */
if (sbi->opt.sync_decompress == EROFS_SYNC_DECOMPRESS_AUTO)
sbi->opt.sync_decompress = EROFS_SYNC_DECOMPRESS_FORCE_ON;
return;
}
- z_erofs_decompressqueue_work(&io->u.work);
+ z_erofs_decompressqueue_kthread_work(&io->u.work);
}

static struct page *pickup_page_for_submission(struct z_erofs_pcluster *pcl,
@@ -1306,7 +1461,7 @@ jobqueue_init(struct super_block *sb,
*fg = true;
goto fg_out;
}
- INIT_WORK(&q->u.work, z_erofs_decompressqueue_work);
+ kthread_init_work(&q->u.work, z_erofs_decompressqueue_kthread_work);
} else {
fg_out:
q = fgq;
@@ -1500,7 +1655,7 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,

/*
* although background is preferred, no one is pending for submission.
- * don't issue workqueue for decompression but drop it directly instead.
+ * don't issue kthread_work for decompression but drop it directly instead.
*/
if (!*force_fg && !nr_bios) {
kvfree(q[JQ_SUBMIT]);
diff --git a/fs/erofs/zdata.h b/fs/erofs/zdata.h
index d98c95212985..808bbbf71b7b 100644
--- a/fs/erofs/zdata.h
+++ b/fs/erofs/zdata.h
@@ -6,6 +6,8 @@
#ifndef __EROFS_FS_ZDATA_H
#define __EROFS_FS_ZDATA_H

+#include <linux/kthread.h>
+
#include "internal.h"
#include "tagptr.h"

@@ -107,7 +109,7 @@ struct z_erofs_decompressqueue {

union {
struct completion done;
- struct work_struct work;
+ struct kthread_work work;
} u;

bool eio;
--
2.39.0.314.g84b9a713c41-goog

2023-02-06 10:02:27

by Gao Xiang

[permalink] [raw]

Subject: Re: [PATCH v4] erofs: replace erofs_unzipd workqueue with per-cpu threads

Hi Sandeep,

On Fri, Jan 06, 2023 at 07:35:01AM +0000, Sandeep Dhavale wrote:
> Using per-cpu thread pool we can reduce the scheduling latency compared
> to workqueue implementation. With this patch scheduling latency and
> variation is reduced as per-cpu threads are high priority kthread_workers.
>
> The results were evaluated on arm64 Android devices running 5.10 kernel.
>
> The table below shows resulting improvements of total scheduling latency
> for the same app launch benchmark runs with 50 iterations. Scheduling
> latency is the latency between when the task (workqueue kworker vs
> kthread_worker) became eligible to run to when it actually started
> running.
> +-------------------------+-----------+----------------+---------+
> | | workqueue | kthread_worker | diff |
> +-------------------------+-----------+----------------+---------+
> | Average (us) | 15253 | 2914 | -80.89% |
> | Median (us) | 14001 | 2912 | -79.20% |
> | Minimum (us) | 3117 | 1027 | -67.05% |
> | Maximum (us) | 30170 | 3805 | -87.39% |
> | Standard deviation (us) | 7166 | 359 | |
> +-------------------------+-----------+----------------+---------+
>
> Background: Boot times and cold app launch benchmarks are very
> important to the android ecosystem as they directly translate to
> responsiveness from user point of view. While erofs provides
> a lot of important features like space savings, we saw some
> performance penalty in cold app launch benchmarks in few scenarios.
> Analysis showed that the significant variance was coming from the
> scheduling cost while decompression cost was more or less the same.
>
> Having per-cpu thread pool we can see from the above table that this
> variation is reduced by ~80% on average. This problem was discussed
> at LPC 2022. Link to LPC 2022 slides and
> talk at [1]
>
> [1] https://lpc.events/event/16/contributions/1338/
>
> Signed-off-by: Sandeep Dhavale <[email protected]>
> ---
> V3 -> V4
> * Updated commit message with background information
> V2 -> V3
> * Fix a warning Reported-by: kernel test robot <[email protected]>
> V1 -> V2
> * Changed name of kthread_workers from z_erofs to erofs_worker
> * Added kernel configuration to run kthread_workers at normal or
> high priority
> * Added cpu hotplug support
> * Added wrapped kthread_workers under worker_pool
> * Added one unbound thread in a pool to handle a context where
> we already stopped per-cpu kthread worker
> * Updated commit message

I've just modified your v4 patch based on erofs -dev branch with
my previous suggestion [1], but I haven't tested it.

Could you help check if the updated patch looks good to you and
test it on your side? If there are unexpected behaviors, please
help update as well, thanks!

[1] https://lore.kernel.org/r/[email protected]/

Thanks,
Gao Xiang

From 2e87235abc745c0fef8e32abcd3a51546b4378ad Mon Sep 17 00:00:00 2001
From: Sandeep Dhavale <[email protected]>
Date: Mon, 6 Feb 2023 17:53:39 +0800
Subject: [PATCH] erofs: add per-cpu threads for decompression

Using per-cpu thread pool we can reduce the scheduling latency compared
to workqueue implementation. With this patch scheduling latency and
variation is reduced as per-cpu threads are high priority kthread_workers.

The results were evaluated on arm64 Android devices running 5.10 kernel.

The table below shows resulting improvements of total scheduling latency
for the same app launch benchmark runs with 50 iterations. Scheduling
latency is the latency between when the task (workqueue kworker vs
kthread_worker) became eligible to run to when it actually started
running.
+-------------------------+-----------+----------------+---------+
| | workqueue | kthread_worker | diff |
+-------------------------+-----------+----------------+---------+
| Average (us) | 15253 | 2914 | -80.89% |
| Median (us) | 14001 | 2912 | -79.20% |
| Minimum (us) | 3117 | 1027 | -67.05% |
| Maximum (us) | 30170 | 3805 | -87.39% |
| Standard deviation (us) | 7166 | 359 | |
+-------------------------+-----------+----------------+---------+

Background: Boot times and cold app launch benchmarks are very
important to the android ecosystem as they directly translate to
responsiveness from user point of view. While erofs provides
a lot of important features like space savings, we saw some
performance penalty in cold app launch benchmarks in few scenarios.
Analysis showed that the significant variance was coming from the
scheduling cost while decompression cost was more or less the same.

Having per-cpu thread pool we can see from the above table that this
variation is reduced by ~80% on average. This problem was discussed
at LPC 2022. Link to LPC 2022 slides and
talk at [1]

[1] https://lpc.events/event/16/contributions/1338/

Signed-off-by: Sandeep Dhavale <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---
fs/erofs/Kconfig | 18 +++++
fs/erofs/zdata.c | 190 ++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 189 insertions(+), 19 deletions(-)

diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
index 85490370e0ca..704fb59577e0 100644
--- a/fs/erofs/Kconfig
+++ b/fs/erofs/Kconfig
@@ -108,3 +108,21 @@ config EROFS_FS_ONDEMAND
read support.

If unsure, say N.
+
+config EROFS_FS_PCPU_KTHREAD
+ bool "EROFS per-cpu decompression kthread workers"
+ depends on EROFS_FS_ZIP
+ help
+ Saying Y here enables per-CPU kthread workers pool to carry out
+ async decompression for low latencies on some architectures.
+
+ If unsure, say N.
+
+config EROFS_FS_PCPU_KTHREAD_HIPRI
+ bool "EROFS high priority per-CPU kthread workers"
+ depends on EROFS_FS_ZIP && EROFS_FS_PCPU_KTHREAD
+ help
+ This permits EROFS to configure per-CPU kthread workers to run
+ at higher priority.
+
+ If unsure, say N.
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 384f64292f73..73198f494a6a 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -7,6 +7,8 @@
#include "compress.h"
#include <linux/prefetch.h>
#include <linux/psi.h>
+#include <linux/slab.h>
+#include <linux/cpuhotplug.h>

#include <trace/events/erofs.h>

@@ -109,6 +111,7 @@ struct z_erofs_decompressqueue {
union {
struct completion done;
struct work_struct work;
+ struct kthread_work kthread_work;
} u;
bool eio, sync;
};
@@ -341,24 +344,128 @@ static void z_erofs_free_pcluster(struct z_erofs_pcluster *pcl)

static struct workqueue_struct *z_erofs_workqueue __read_mostly;

-void z_erofs_exit_zip_subsystem(void)
+#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
+static struct kthread_worker __rcu **z_erofs_pcpu_workers;
+
+static void erofs_destroy_percpu_workers(void)
{
- destroy_workqueue(z_erofs_workqueue);
- z_erofs_destroy_pcluster_pool();
+ struct kthread_worker *worker;
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ worker = rcu_dereference_protected(
+ z_erofs_pcpu_workers[cpu], 1);
+ rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
+ if (worker)
+ kthread_destroy_worker(worker);
+ }
+ kfree(z_erofs_pcpu_workers);
}

-static inline int z_erofs_init_workqueue(void)
+static struct kthread_worker *erofs_init_percpu_worker(int cpu)
{
- const unsigned int onlinecpus = num_possible_cpus();
+ struct kthread_worker *worker =
+ kthread_create_worker_on_cpu(cpu, 0, "erofs_worker/%u", cpu);

- /*
- * no need to spawn too many threads, limiting threads could minimum
- * scheduling overhead, perhaps per-CPU threads should be better?
- */
- z_erofs_workqueue = alloc_workqueue("erofs_unzipd",
- WQ_UNBOUND | WQ_HIGHPRI,
- onlinecpus + onlinecpus / 4);
- return z_erofs_workqueue ? 0 : -ENOMEM;
+ if (IS_ERR(worker))
+ return worker;
+ if (IS_ENABLED(CONFIG_EROFS_FS_PCPU_KTHREAD_HIPRI))
+ sched_set_fifo_low(worker->task);
+ else
+ sched_set_normal(worker->task, 0);
+ return worker;
+}
+
+static int erofs_init_percpu_workers(void)
+{
+ struct kthread_worker *worker;
+ unsigned int cpu;
+
+ z_erofs_pcpu_workers = kcalloc(num_possible_cpus(),
+ sizeof(struct kthread_worker *), GFP_ATOMIC);
+ if (!z_erofs_pcpu_workers)
+ return -ENOMEM;
+
+ for_each_online_cpu(cpu) { /* could miss cpu{off,on}line? */
+ worker = erofs_init_percpu_worker(cpu);
+ if (!IS_ERR(worker))
+ rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
+ }
+ return 0;
+}
+#else
+static inline void erofs_destroy_percpu_workers(void) {}
+static inline int erofs_init_percpu_workers(void) { return 0; }
+#endif
+
+#if defined(CONFIG_HOTPLUG_CPU) && defined(EROFS_FS_PCPU_KTHREAD)
+static DEFINE_SPINLOCK(z_erofs_pcpu_worker_lock);
+static enum cpuhp_state erofs_cpuhp_state;
+
+static int erofs_cpu_online(unsigned int cpu)
+{
+ struct kthread_worker *worker, *old;
+
+ worker = erofs_init_percpu_worker(cpu);
+ if (IS_ERR(worker))
+ return ERR_PTR(worker);
+
+ spin_lock(&z_erofs_pcpu_worker_lock);
+ old = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
+ lockdep_is_held(&z_erofs_pcpu_worker_lock));
+ if (!old)
+ rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
+ spin_unlock(&z_erofs_pcpu_worker_lock);
+ if (old)
+ kthread_destroy_worker(worker);
+ return 0;
+}
+
+static int erofs_cpu_offline(unsigned int cpu)
+{
+ struct kthread_worker *worker;
+
+ spin_lock(&z_erofs_pcpu_worker_lock);
+ worker = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
+ lockdep_is_held(&z_erofs_pcpu_worker_lock));
+ rcu_assign_pointer(worker_pool.workers[cpu], NULL);
+ spin_unlock(&z_erofs_pcpu_worker_lock);
+
+ synchronize_rcu();
+ if (worker)
+ kthread_destroy_worker(worker);
+ return 0;
+}
+
+static int erofs_cpu_hotplug_init(void)
+{
+ int state;
+
+ state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+ "fs/erofs:online", erofs_cpu_online, erofs_cpu_offline);
+ if (state < 0)
+ return state;
+
+ erofs_cpuhp_state = state;
+ return 0;
+}
+
+static void erofs_cpu_hotplug_destroy(void)
+{
+ if (erofs_cpuhp_state)
+ cpuhp_remove_state_nocalls(erofs_cpuhp_state);
+}
+#else /* !CONFIG_HOTPLUG_CPU || !CONFIG_EROFS_FS_PCPU_KTHREAD */
+static inline int erofs_cpu_hotplug_init(void) { return 0; }
+static inline void erofs_cpu_hotplug_destroy(void) {}
+#endif
+
+void z_erofs_exit_zip_subsystem(void)
+{
+ erofs_cpu_hotplug_destroy();
+ erofs_destroy_percpu_workers();
+ destroy_workqueue(z_erofs_workqueue);
+ z_erofs_destroy_pcluster_pool();
}

int __init z_erofs_init_zip_subsystem(void)
@@ -366,10 +473,29 @@ int __init z_erofs_init_zip_subsystem(void)
int err = z_erofs_create_pcluster_pool();

if (err)
- return err;
- err = z_erofs_init_workqueue();
+ goto out_error_pcluster_pool;
+
+ z_erofs_workqueue = alloc_workqueue("erofs_worker",
+ WQ_UNBOUND | WQ_HIGHPRI, num_possible_cpus());
+ if (!z_erofs_workqueue)
+ goto out_error_workqueue_init;
+
+ err = erofs_init_percpu_workers();
if (err)
- z_erofs_destroy_pcluster_pool();
+ goto out_error_pcpu_worker;
+
+ err = erofs_cpu_hotplug_init();
+ if (err < 0)
+ goto out_error_cpuhp_init;
+ return err;
+
+out_error_cpuhp_init:
+ erofs_destroy_percpu_workers();
+out_error_pcpu_worker:
+ destroy_workqueue(z_erofs_workqueue);
+out_error_workqueue_init:
+ z_erofs_destroy_pcluster_pool();
+out_error_pcluster_pool:
return err;
}

@@ -1305,11 +1431,17 @@ static void z_erofs_decompressqueue_work(struct work_struct *work)

DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
z_erofs_decompress_queue(bgq, &pagepool);
-
erofs_release_pages(&pagepool);
kvfree(bgq);
}

+#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
+static void z_erofs_decompressqueue_kthread_work(struct kthread_work *work)
+{
+ z_erofs_decompressqueue_work((struct work_struct *)work);
+}
+#endif
+
static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
int bios)
{
@@ -1324,9 +1456,24 @@ static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,

if (atomic_add_return(bios, &io->pending_bios))
return;
- /* Use workqueue and sync decompression for atomic contexts only */
+ /* Use (kthread_)work and sync decompression for atomic contexts only */
if (in_atomic() || irqs_disabled()) {
+#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
+ struct kthread_worker *worker;
+
+ rcu_read_lock();
+ worker = rcu_dereference(
+ z_erofs_pcpu_workers[raw_smp_processor_id()]);
+ if (!worker) {
+ INIT_WORK(&io->u.work, z_erofs_decompressqueue_work);
+ queue_work(z_erofs_workqueue, &io->u.work);
+ } else {
+ kthread_queue_work(worker, &io->u.kthread_work);
+ }
+ rcu_read_unlock();
+#else
queue_work(z_erofs_workqueue, &io->u.work);
+#endif
/* enable sync decompression for readahead */
if (sbi->opt.sync_decompress == EROFS_SYNC_DECOMPRESS_AUTO)
sbi->opt.sync_decompress = EROFS_SYNC_DECOMPRESS_FORCE_ON;
@@ -1455,7 +1602,12 @@ static struct z_erofs_decompressqueue *jobqueue_init(struct super_block *sb,
*fg = true;
goto fg_out;
}
+#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
+ kthread_init_work(&q->u.kthread_work,
+ z_erofs_decompressqueue_kthread_work);
+#else
INIT_WORK(&q->u.work, z_erofs_decompressqueue_work);
+#endif
} else {
fg_out:
q = fgq;
@@ -1640,7 +1792,7 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,

/*
* although background is preferred, no one is pending for submission.
- * don't issue workqueue for decompression but drop it directly instead.
+ * don't issue decompression but drop it directly instead.
*/
if (!*force_fg && !nr_bios) {
kvfree(q[JQ_SUBMIT]);
--
2.30.2

2023-02-06 19:41:31

by Sandeep Dhavale

[permalink] [raw]

Subject: Re: [PATCH v4] erofs: replace erofs_unzipd workqueue with per-cpu threads

On Mon, Feb 6, 2023 at 2:01 AM Gao Xiang <[email protected]> wrote:
>
> Hi Sandeep,
>
> On Fri, Jan 06, 2023 at 07:35:01AM +0000, Sandeep Dhavale wrote:
> > Using per-cpu thread pool we can reduce the scheduling latency compared
> > to workqueue implementation. With this patch scheduling latency and
> > variation is reduced as per-cpu threads are high priority kthread_workers.
> >
> > The results were evaluated on arm64 Android devices running 5.10 kernel.
> >
> > The table below shows resulting improvements of total scheduling latency
> > for the same app launch benchmark runs with 50 iterations. Scheduling
> > latency is the latency between when the task (workqueue kworker vs
> > kthread_worker) became eligible to run to when it actually started
> > running.
> > +-------------------------+-----------+----------------+---------+
> > | | workqueue | kthread_worker | diff |
> > +-------------------------+-----------+----------------+---------+
> > | Average (us) | 15253 | 2914 | -80.89% |
> > | Median (us) | 14001 | 2912 | -79.20% |
> > | Minimum (us) | 3117 | 1027 | -67.05% |
> > | Maximum (us) | 30170 | 3805 | -87.39% |
> > | Standard deviation (us) | 7166 | 359 | |
> > +-------------------------+-----------+----------------+---------+
> >
> > Background: Boot times and cold app launch benchmarks are very
> > important to the android ecosystem as they directly translate to
> > responsiveness from user point of view. While erofs provides
> > a lot of important features like space savings, we saw some
> > performance penalty in cold app launch benchmarks in few scenarios.
> > Analysis showed that the significant variance was coming from the
> > scheduling cost while decompression cost was more or less the same.
> >
> > Having per-cpu thread pool we can see from the above table that this
> > variation is reduced by ~80% on average. This problem was discussed
> > at LPC 2022. Link to LPC 2022 slides and
> > talk at [1]
> >
> > [1] https://lpc.events/event/16/contributions/1338/
> >
> > Signed-off-by: Sandeep Dhavale <[email protected]>
> > ---
> > V3 -> V4
> > * Updated commit message with background information
> > V2 -> V3
> > * Fix a warning Reported-by: kernel test robot <[email protected]>
> > V1 -> V2
> > * Changed name of kthread_workers from z_erofs to erofs_worker
> > * Added kernel configuration to run kthread_workers at normal or
> > high priority
> > * Added cpu hotplug support
> > * Added wrapped kthread_workers under worker_pool
> > * Added one unbound thread in a pool to handle a context where
> > we already stopped per-cpu kthread worker
> > * Updated commit message
>
> I've just modified your v4 patch based on erofs -dev branch with
> my previous suggestion [1], but I haven't tested it.
>
> Could you help check if the updated patch looks good to you and
> test it on your side? If there are unexpected behaviors, please
> help update as well, thanks!
Thanks Xiang, I was working on the same. I see that you have cleaned it up.
I will test it and report/fix any problems.

Thanks,
Sandeep.

>
> [1] https://lore.kernel.org/r/[email protected]/
>
> Thanks,
> Gao Xiang
>
> From 2e87235abc745c0fef8e32abcd3a51546b4378ad Mon Sep 17 00:00:00 2001
> From: Sandeep Dhavale <[email protected]>
> Date: Mon, 6 Feb 2023 17:53:39 +0800
> Subject: [PATCH] erofs: add per-cpu threads for decompression
>
> Using per-cpu thread pool we can reduce the scheduling latency compared
> to workqueue implementation. With this patch scheduling latency and
> variation is reduced as per-cpu threads are high priority kthread_workers.
>
> The results were evaluated on arm64 Android devices running 5.10 kernel.
>
> The table below shows resulting improvements of total scheduling latency
> for the same app launch benchmark runs with 50 iterations. Scheduling
> latency is the latency between when the task (workqueue kworker vs
> kthread_worker) became eligible to run to when it actually started
> running.
> +-------------------------+-----------+----------------+---------+
> | | workqueue | kthread_worker | diff |
> +-------------------------+-----------+----------------+---------+
> | Average (us) | 15253 | 2914 | -80.89% |
> | Median (us) | 14001 | 2912 | -79.20% |
> | Minimum (us) | 3117 | 1027 | -67.05% |
> | Maximum (us) | 30170 | 3805 | -87.39% |
> | Standard deviation (us) | 7166 | 359 | |
> +-------------------------+-----------+----------------+---------+
>
> Background: Boot times and cold app launch benchmarks are very
> important to the android ecosystem as they directly translate to
> responsiveness from user point of view. While erofs provides
> a lot of important features like space savings, we saw some
> performance penalty in cold app launch benchmarks in few scenarios.
> Analysis showed that the significant variance was coming from the
> scheduling cost while decompression cost was more or less the same.
>
> Having per-cpu thread pool we can see from the above table that this
> variation is reduced by ~80% on average. This problem was discussed
> at LPC 2022. Link to LPC 2022 slides and
> talk at [1]
>
> [1] https://lpc.events/event/16/contributions/1338/
>
> Signed-off-by: Sandeep Dhavale <[email protected]>
> Signed-off-by: Gao Xiang <[email protected]>
> ---
> fs/erofs/Kconfig | 18 +++++
> fs/erofs/zdata.c | 190 ++++++++++++++++++++++++++++++++++++++++++-----
> 2 files changed, 189 insertions(+), 19 deletions(-)
>
> diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
> index 85490370e0ca..704fb59577e0 100644
> --- a/fs/erofs/Kconfig
> +++ b/fs/erofs/Kconfig
> @@ -108,3 +108,21 @@ config EROFS_FS_ONDEMAND
> read support.
>
> If unsure, say N.
> +
> +config EROFS_FS_PCPU_KTHREAD
> + bool "EROFS per-cpu decompression kthread workers"
> + depends on EROFS_FS_ZIP
> + help
> + Saying Y here enables per-CPU kthread workers pool to carry out
> + async decompression for low latencies on some architectures.
> +
> + If unsure, say N.
> +
> +config EROFS_FS_PCPU_KTHREAD_HIPRI
> + bool "EROFS high priority per-CPU kthread workers"
> + depends on EROFS_FS_ZIP && EROFS_FS_PCPU_KTHREAD
> + help
> + This permits EROFS to configure per-CPU kthread workers to run
> + at higher priority.
> +
> + If unsure, say N.
> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
> index 384f64292f73..73198f494a6a 100644
> --- a/fs/erofs/zdata.c
> +++ b/fs/erofs/zdata.c
> @@ -7,6 +7,8 @@
> #include "compress.h"
> #include <linux/prefetch.h>
> #include <linux/psi.h>
> +#include <linux/slab.h>
> +#include <linux/cpuhotplug.h>
>
> #include <trace/events/erofs.h>
>
> @@ -109,6 +111,7 @@ struct z_erofs_decompressqueue {
> union {
> struct completion done;
> struct work_struct work;
> + struct kthread_work kthread_work;
> } u;
> bool eio, sync;
> };
> @@ -341,24 +344,128 @@ static void z_erofs_free_pcluster(struct z_erofs_pcluster *pcl)
>
> static struct workqueue_struct *z_erofs_workqueue __read_mostly;
>
> -void z_erofs_exit_zip_subsystem(void)
> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
> +static struct kthread_worker __rcu **z_erofs_pcpu_workers;
> +
> +static void erofs_destroy_percpu_workers(void)
> {
> - destroy_workqueue(z_erofs_workqueue);
> - z_erofs_destroy_pcluster_pool();
> + struct kthread_worker *worker;
> + unsigned int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + worker = rcu_dereference_protected(
> + z_erofs_pcpu_workers[cpu], 1);
> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
> + if (worker)
> + kthread_destroy_worker(worker);
> + }
> + kfree(z_erofs_pcpu_workers);
> }
>
> -static inline int z_erofs_init_workqueue(void)
> +static struct kthread_worker *erofs_init_percpu_worker(int cpu)
> {
> - const unsigned int onlinecpus = num_possible_cpus();
> + struct kthread_worker *worker =
> + kthread_create_worker_on_cpu(cpu, 0, "erofs_worker/%u", cpu);
>
> - /*
> - * no need to spawn too many threads, limiting threads could minimum
> - * scheduling overhead, perhaps per-CPU threads should be better?
> - */
> - z_erofs_workqueue = alloc_workqueue("erofs_unzipd",
> - WQ_UNBOUND | WQ_HIGHPRI,
> - onlinecpus + onlinecpus / 4);
> - return z_erofs_workqueue ? 0 : -ENOMEM;
> + if (IS_ERR(worker))
> + return worker;
> + if (IS_ENABLED(CONFIG_EROFS_FS_PCPU_KTHREAD_HIPRI))
> + sched_set_fifo_low(worker->task);
> + else
> + sched_set_normal(worker->task, 0);
> + return worker;
> +}
> +
> +static int erofs_init_percpu_workers(void)
> +{
> + struct kthread_worker *worker;
> + unsigned int cpu;
> +
> + z_erofs_pcpu_workers = kcalloc(num_possible_cpus(),
> + sizeof(struct kthread_worker *), GFP_ATOMIC);
> + if (!z_erofs_pcpu_workers)
> + return -ENOMEM;
> +
> + for_each_online_cpu(cpu) { /* could miss cpu{off,on}line? */
> + worker = erofs_init_percpu_worker(cpu);
> + if (!IS_ERR(worker))
> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
> + }
> + return 0;
> +}
> +#else
> +static inline void erofs_destroy_percpu_workers(void) {}
> +static inline int erofs_init_percpu_workers(void) { return 0; }
> +#endif
> +
> +#if defined(CONFIG_HOTPLUG_CPU) && defined(EROFS_FS_PCPU_KTHREAD)
> +static DEFINE_SPINLOCK(z_erofs_pcpu_worker_lock);
> +static enum cpuhp_state erofs_cpuhp_state;
> +
> +static int erofs_cpu_online(unsigned int cpu)
> +{
> + struct kthread_worker *worker, *old;
> +
> + worker = erofs_init_percpu_worker(cpu);
> + if (IS_ERR(worker))
> + return ERR_PTR(worker);
> +
> + spin_lock(&z_erofs_pcpu_worker_lock);
> + old = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
> + lockdep_is_held(&z_erofs_pcpu_worker_lock));
> + if (!old)
> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
> + spin_unlock(&z_erofs_pcpu_worker_lock);
> + if (old)
> + kthread_destroy_worker(worker);
> + return 0;
> +}
> +
> +static int erofs_cpu_offline(unsigned int cpu)
> +{
> + struct kthread_worker *worker;
> +
> + spin_lock(&z_erofs_pcpu_worker_lock);
> + worker = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
> + lockdep_is_held(&z_erofs_pcpu_worker_lock));
> + rcu_assign_pointer(worker_pool.workers[cpu], NULL);
> + spin_unlock(&z_erofs_pcpu_worker_lock);
> +
> + synchronize_rcu();
> + if (worker)
> + kthread_destroy_worker(worker);
> + return 0;
> +}
> +
> +static int erofs_cpu_hotplug_init(void)
> +{
> + int state;
> +
> + state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> + "fs/erofs:online", erofs_cpu_online, erofs_cpu_offline);
> + if (state < 0)
> + return state;
> +
> + erofs_cpuhp_state = state;
> + return 0;
> +}
> +
> +static void erofs_cpu_hotplug_destroy(void)
> +{
> + if (erofs_cpuhp_state)
> + cpuhp_remove_state_nocalls(erofs_cpuhp_state);
> +}
> +#else /* !CONFIG_HOTPLUG_CPU || !CONFIG_EROFS_FS_PCPU_KTHREAD */
> +static inline int erofs_cpu_hotplug_init(void) { return 0; }
> +static inline void erofs_cpu_hotplug_destroy(void) {}
> +#endif
> +
> +void z_erofs_exit_zip_subsystem(void)
> +{
> + erofs_cpu_hotplug_destroy();
> + erofs_destroy_percpu_workers();
> + destroy_workqueue(z_erofs_workqueue);
> + z_erofs_destroy_pcluster_pool();
> }
>
> int __init z_erofs_init_zip_subsystem(void)
> @@ -366,10 +473,29 @@ int __init z_erofs_init_zip_subsystem(void)
> int err = z_erofs_create_pcluster_pool();
>
> if (err)
> - return err;
> - err = z_erofs_init_workqueue();
> + goto out_error_pcluster_pool;
> +
> + z_erofs_workqueue = alloc_workqueue("erofs_worker",
> + WQ_UNBOUND | WQ_HIGHPRI, num_possible_cpus());
> + if (!z_erofs_workqueue)
> + goto out_error_workqueue_init;
> +
> + err = erofs_init_percpu_workers();
> if (err)
> - z_erofs_destroy_pcluster_pool();
> + goto out_error_pcpu_worker;
> +
> + err = erofs_cpu_hotplug_init();
> + if (err < 0)
> + goto out_error_cpuhp_init;
> + return err;
> +
> +out_error_cpuhp_init:
> + erofs_destroy_percpu_workers();
> +out_error_pcpu_worker:
> + destroy_workqueue(z_erofs_workqueue);
> +out_error_workqueue_init:
> + z_erofs_destroy_pcluster_pool();
> +out_error_pcluster_pool:
> return err;
> }
>
> @@ -1305,11 +1431,17 @@ static void z_erofs_decompressqueue_work(struct work_struct *work)
>
> DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
> z_erofs_decompress_queue(bgq, &pagepool);
> -
> erofs_release_pages(&pagepool);
> kvfree(bgq);
> }
>
> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
> +static void z_erofs_decompressqueue_kthread_work(struct kthread_work *work)
> +{
> + z_erofs_decompressqueue_work((struct work_struct *)work);
> +}
> +#endif
> +
> static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
> int bios)
> {
> @@ -1324,9 +1456,24 @@ static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
>
> if (atomic_add_return(bios, &io->pending_bios))
> return;
> - /* Use workqueue and sync decompression for atomic contexts only */
> + /* Use (kthread_)work and sync decompression for atomic contexts only */
> if (in_atomic() || irqs_disabled()) {
> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
> + struct kthread_worker *worker;
> +
> + rcu_read_lock();
> + worker = rcu_dereference(
> + z_erofs_pcpu_workers[raw_smp_processor_id()]);
> + if (!worker) {
> + INIT_WORK(&io->u.work, z_erofs_decompressqueue_work);
> + queue_work(z_erofs_workqueue, &io->u.work);
> + } else {
> + kthread_queue_work(worker, &io->u.kthread_work);
> + }
> + rcu_read_unlock();
> +#else
> queue_work(z_erofs_workqueue, &io->u.work);
> +#endif
> /* enable sync decompression for readahead */
> if (sbi->opt.sync_decompress == EROFS_SYNC_DECOMPRESS_AUTO)
> sbi->opt.sync_decompress = EROFS_SYNC_DECOMPRESS_FORCE_ON;
> @@ -1455,7 +1602,12 @@ static struct z_erofs_decompressqueue *jobqueue_init(struct super_block *sb,
> *fg = true;
> goto fg_out;
> }
> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
> + kthread_init_work(&q->u.kthread_work,
> + z_erofs_decompressqueue_kthread_work);
> +#else
> INIT_WORK(&q->u.work, z_erofs_decompressqueue_work);
> +#endif
> } else {
> fg_out:
> q = fgq;
> @@ -1640,7 +1792,7 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
>
> /*
> * although background is preferred, no one is pending for submission.
> - * don't issue workqueue for decompression but drop it directly instead.
> + * don't issue decompression but drop it directly instead.
> */
> if (!*force_fg && !nr_bios) {
> kvfree(q[JQ_SUBMIT]);
> --
> 2.30.2
>

2023-02-07 02:55:51

by Gao Xiang

[permalink] [raw]

Subject: Re: [PATCH v4] erofs: replace erofs_unzipd workqueue with per-cpu threads

On 2023/2/7 03:41, Sandeep Dhavale wrote:
> On Mon, Feb 6, 2023 at 2:01 AM Gao Xiang <[email protected]> wrote:
>>
>> Hi Sandeep,
>>
>> On Fri, Jan 06, 2023 at 07:35:01AM +0000, Sandeep Dhavale wrote:
>>> Using per-cpu thread pool we can reduce the scheduling latency compared
>>> to workqueue implementation. With this patch scheduling latency and
>>> variation is reduced as per-cpu threads are high priority kthread_workers.
>>>
>>> The results were evaluated on arm64 Android devices running 5.10 kernel.
>>>
>>> The table below shows resulting improvements of total scheduling latency
>>> for the same app launch benchmark runs with 50 iterations. Scheduling
>>> latency is the latency between when the task (workqueue kworker vs
>>> kthread_worker) became eligible to run to when it actually started
>>> running.
>>> +-------------------------+-----------+----------------+---------+
>>> | | workqueue | kthread_worker | diff |
>>> +-------------------------+-----------+----------------+---------+
>>> | Average (us) | 15253 | 2914 | -80.89% |
>>> | Median (us) | 14001 | 2912 | -79.20% |
>>> | Minimum (us) | 3117 | 1027 | -67.05% |
>>> | Maximum (us) | 30170 | 3805 | -87.39% |
>>> | Standard deviation (us) | 7166 | 359 | |
>>> +-------------------------+-----------+----------------+---------+
>>>
>>> Background: Boot times and cold app launch benchmarks are very
>>> important to the android ecosystem as they directly translate to
>>> responsiveness from user point of view. While erofs provides
>>> a lot of important features like space savings, we saw some
>>> performance penalty in cold app launch benchmarks in few scenarios.
>>> Analysis showed that the significant variance was coming from the
>>> scheduling cost while decompression cost was more or less the same.
>>>
>>> Having per-cpu thread pool we can see from the above table that this
>>> variation is reduced by ~80% on average. This problem was discussed
>>> at LPC 2022. Link to LPC 2022 slides and
>>> talk at [1]
>>>
>>> [1] https://lpc.events/event/16/contributions/1338/
>>>
>>> Signed-off-by: Sandeep Dhavale <[email protected]>
>>> ---
>>> V3 -> V4
>>> * Updated commit message with background information
>>> V2 -> V3
>>> * Fix a warning Reported-by: kernel test robot <[email protected]>
>>> V1 -> V2
>>> * Changed name of kthread_workers from z_erofs to erofs_worker
>>> * Added kernel configuration to run kthread_workers at normal or
>>> high priority
>>> * Added cpu hotplug support
>>> * Added wrapped kthread_workers under worker_pool
>>> * Added one unbound thread in a pool to handle a context where
>>> we already stopped per-cpu kthread worker
>>> * Updated commit message
>>
>> I've just modified your v4 patch based on erofs -dev branch with
>> my previous suggestion [1], but I haven't tested it.
>>
>> Could you help check if the updated patch looks good to you and
>> test it on your side? If there are unexpected behaviors, please
>> help update as well, thanks!
> Thanks Xiang, I was working on the same. I see that you have cleaned it up.
> I will test it and report/fix any problems.
>
> Thanks,
> Sandeep.

Thanks! Look forward to your test. BTW, we have < 2 weeks for 6.3, so I'd
like to fix it this week so that we could catch 6.3 merge window.

I've fixed some cpu hotplug errors as below and added to a branch for 0day CI
testing.

Thanks,
Gao Xiang

diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 73198f494a6a..92a9e20948b0 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -398,7 +398,7 @@ static inline void erofs_destroy_percpu_workers(void) {}
static inline int erofs_init_percpu_workers(void) { return 0; }
#endif

-#if defined(CONFIG_HOTPLUG_CPU) && defined(EROFS_FS_PCPU_KTHREAD)
+#if defined(CONFIG_HOTPLUG_CPU) && defined(CONFIG_EROFS_FS_PCPU_KTHREAD)
static DEFINE_SPINLOCK(z_erofs_pcpu_worker_lock);
static enum cpuhp_state erofs_cpuhp_state;

@@ -408,7 +408,7 @@ static int erofs_cpu_online(unsigned int cpu)

worker = erofs_init_percpu_worker(cpu);
if (IS_ERR(worker))
- return ERR_PTR(worker);
+ return PTR_ERR(worker);

spin_lock(&z_erofs_pcpu_worker_lock);
old = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
@@ -428,7 +428,7 @@ static int erofs_cpu_offline(unsigned int cpu)
spin_lock(&z_erofs_pcpu_worker_lock);
worker = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
lockdep_is_held(&z_erofs_pcpu_worker_lock));
- rcu_assign_pointer(worker_pool.workers[cpu], NULL);
+ rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
spin_unlock(&z_erofs_pcpu_worker_lock);

synchronize_rcu();

>
>>
>> [1] https://lore.kernel.org/r/[email protected]/
>>
>> Thanks,
>> Gao Xiang
>>
>> From 2e87235abc745c0fef8e32abcd3a51546b4378ad Mon Sep 17 00:00:00 2001
>> From: Sandeep Dhavale <[email protected]>
>> Date: Mon, 6 Feb 2023 17:53:39 +0800
>> Subject: [PATCH] erofs: add per-cpu threads for decompression
>>
>> Using per-cpu thread pool we can reduce the scheduling latency compared
>> to workqueue implementation. With this patch scheduling latency and
>> variation is reduced as per-cpu threads are high priority kthread_workers.
>>
>> The results were evaluated on arm64 Android devices running 5.10 kernel.
>>
>> The table below shows resulting improvements of total scheduling latency
>> for the same app launch benchmark runs with 50 iterations. Scheduling
>> latency is the latency between when the task (workqueue kworker vs
>> kthread_worker) became eligible to run to when it actually started
>> running.
>> +-------------------------+-----------+----------------+---------+
>> | | workqueue | kthread_worker | diff |
>> +-------------------------+-----------+----------------+---------+
>> | Average (us) | 15253 | 2914 | -80.89% |
>> | Median (us) | 14001 | 2912 | -79.20% |
>> | Minimum (us) | 3117 | 1027 | -67.05% |
>> | Maximum (us) | 30170 | 3805 | -87.39% |
>> | Standard deviation (us) | 7166 | 359 | |
>> +-------------------------+-----------+----------------+---------+
>>
>> Background: Boot times and cold app launch benchmarks are very
>> important to the android ecosystem as they directly translate to
>> responsiveness from user point of view. While erofs provides
>> a lot of important features like space savings, we saw some
>> performance penalty in cold app launch benchmarks in few scenarios.
>> Analysis showed that the significant variance was coming from the
>> scheduling cost while decompression cost was more or less the same.
>>
>> Having per-cpu thread pool we can see from the above table that this
>> variation is reduced by ~80% on average. This problem was discussed
>> at LPC 2022. Link to LPC 2022 slides and
>> talk at [1]
>>
>> [1] https://lpc.events/event/16/contributions/1338/
>>
>> Signed-off-by: Sandeep Dhavale <[email protected]>
>> Signed-off-by: Gao Xiang <[email protected]>
>> ---
>> fs/erofs/Kconfig | 18 +++++
>> fs/erofs/zdata.c | 190 ++++++++++++++++++++++++++++++++++++++++++-----
>> 2 files changed, 189 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
>> index 85490370e0ca..704fb59577e0 100644
>> --- a/fs/erofs/Kconfig
>> +++ b/fs/erofs/Kconfig
>> @@ -108,3 +108,21 @@ config EROFS_FS_ONDEMAND
>> read support.
>>
>> If unsure, say N.
>> +
>> +config EROFS_FS_PCPU_KTHREAD
>> + bool "EROFS per-cpu decompression kthread workers"
>> + depends on EROFS_FS_ZIP
>> + help
>> + Saying Y here enables per-CPU kthread workers pool to carry out
>> + async decompression for low latencies on some architectures.
>> +
>> + If unsure, say N.
>> +
>> +config EROFS_FS_PCPU_KTHREAD_HIPRI
>> + bool "EROFS high priority per-CPU kthread workers"
>> + depends on EROFS_FS_ZIP && EROFS_FS_PCPU_KTHREAD
>> + help
>> + This permits EROFS to configure per-CPU kthread workers to run
>> + at higher priority.
>> +
>> + If unsure, say N.
>> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
>> index 384f64292f73..73198f494a6a 100644
>> --- a/fs/erofs/zdata.c
>> +++ b/fs/erofs/zdata.c
>> @@ -7,6 +7,8 @@
>> #include "compress.h"
>> #include <linux/prefetch.h>
>> #include <linux/psi.h>
>> +#include <linux/slab.h>
>> +#include <linux/cpuhotplug.h>
>>
>> #include <trace/events/erofs.h>
>>
>> @@ -109,6 +111,7 @@ struct z_erofs_decompressqueue {
>> union {
>> struct completion done;
>> struct work_struct work;
>> + struct kthread_work kthread_work;
>> } u;
>> bool eio, sync;
>> };
>> @@ -341,24 +344,128 @@ static void z_erofs_free_pcluster(struct z_erofs_pcluster *pcl)
>>
>> static struct workqueue_struct *z_erofs_workqueue __read_mostly;
>>
>> -void z_erofs_exit_zip_subsystem(void)
>> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
>> +static struct kthread_worker __rcu **z_erofs_pcpu_workers;
>> +
>> +static void erofs_destroy_percpu_workers(void)
>> {
>> - destroy_workqueue(z_erofs_workqueue);
>> - z_erofs_destroy_pcluster_pool();
>> + struct kthread_worker *worker;
>> + unsigned int cpu;
>> +
>> + for_each_possible_cpu(cpu) {
>> + worker = rcu_dereference_protected(
>> + z_erofs_pcpu_workers[cpu], 1);
>> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
>> + if (worker)
>> + kthread_destroy_worker(worker);
>> + }
>> + kfree(z_erofs_pcpu_workers);
>> }
>>
>> -static inline int z_erofs_init_workqueue(void)
>> +static struct kthread_worker *erofs_init_percpu_worker(int cpu)
>> {
>> - const unsigned int onlinecpus = num_possible_cpus();
>> + struct kthread_worker *worker =
>> + kthread_create_worker_on_cpu(cpu, 0, "erofs_worker/%u", cpu);
>>
>> - /*
>> - * no need to spawn too many threads, limiting threads could minimum
>> - * scheduling overhead, perhaps per-CPU threads should be better?
>> - */
>> - z_erofs_workqueue = alloc_workqueue("erofs_unzipd",
>> - WQ_UNBOUND | WQ_HIGHPRI,
>> - onlinecpus + onlinecpus / 4);
>> - return z_erofs_workqueue ? 0 : -ENOMEM;
>> + if (IS_ERR(worker))
>> + return worker;
>> + if (IS_ENABLED(CONFIG_EROFS_FS_PCPU_KTHREAD_HIPRI))
>> + sched_set_fifo_low(worker->task);
>> + else
>> + sched_set_normal(worker->task, 0);
>> + return worker;
>> +}
>> +
>> +static int erofs_init_percpu_workers(void)
>> +{
>> + struct kthread_worker *worker;
>> + unsigned int cpu;
>> +
>> + z_erofs_pcpu_workers = kcalloc(num_possible_cpus(),
>> + sizeof(struct kthread_worker *), GFP_ATOMIC);
>> + if (!z_erofs_pcpu_workers)
>> + return -ENOMEM;
>> +
>> + for_each_online_cpu(cpu) { /* could miss cpu{off,on}line? */
>> + worker = erofs_init_percpu_worker(cpu);
>> + if (!IS_ERR(worker))
>> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
>> + }
>> + return 0;
>> +}
>> +#else
>> +static inline void erofs_destroy_percpu_workers(void) {}
>> +static inline int erofs_init_percpu_workers(void) { return 0; }
>> +#endif
>> +
>> +#if defined(CONFIG_HOTPLUG_CPU) && defined(EROFS_FS_PCPU_KTHREAD)
>> +static DEFINE_SPINLOCK(z_erofs_pcpu_worker_lock);
>> +static enum cpuhp_state erofs_cpuhp_state;
>> +
>> +static int erofs_cpu_online(unsigned int cpu)
>> +{
>> + struct kthread_worker *worker, *old;
>> +
>> + worker = erofs_init_percpu_worker(cpu);
>> + if (IS_ERR(worker))
>> + return ERR_PTR(worker);
>> +
>> + spin_lock(&z_erofs_pcpu_worker_lock);
>> + old = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
>> + lockdep_is_held(&z_erofs_pcpu_worker_lock));
>> + if (!old)
>> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
>> + spin_unlock(&z_erofs_pcpu_worker_lock);
>> + if (old)
>> + kthread_destroy_worker(worker);
>> + return 0;
>> +}
>> +
>> +static int erofs_cpu_offline(unsigned int cpu)
>> +{
>> + struct kthread_worker *worker;
>> +
>> + spin_lock(&z_erofs_pcpu_worker_lock);
>> + worker = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
>> + lockdep_is_held(&z_erofs_pcpu_worker_lock));
>> + rcu_assign_pointer(worker_pool.workers[cpu], NULL);
>> + spin_unlock(&z_erofs_pcpu_worker_lock);
>> +
>> + synchronize_rcu();
>> + if (worker)
>> + kthread_destroy_worker(worker);
>> + return 0;
>> +}
>> +
>> +static int erofs_cpu_hotplug_init(void)
>> +{
>> + int state;
>> +
>> + state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
>> + "fs/erofs:online", erofs_cpu_online, erofs_cpu_offline);
>> + if (state < 0)
>> + return state;
>> +
>> + erofs_cpuhp_state = state;
>> + return 0;
>> +}
>> +
>> +static void erofs_cpu_hotplug_destroy(void)
>> +{
>> + if (erofs_cpuhp_state)
>> + cpuhp_remove_state_nocalls(erofs_cpuhp_state);
>> +}
>> +#else /* !CONFIG_HOTPLUG_CPU || !CONFIG_EROFS_FS_PCPU_KTHREAD */
>> +static inline int erofs_cpu_hotplug_init(void) { return 0; }
>> +static inline void erofs_cpu_hotplug_destroy(void) {}
>> +#endif
>> +
>> +void z_erofs_exit_zip_subsystem(void)
>> +{
>> + erofs_cpu_hotplug_destroy();
>> + erofs_destroy_percpu_workers();
>> + destroy_workqueue(z_erofs_workqueue);
>> + z_erofs_destroy_pcluster_pool();
>> }
>>
>> int __init z_erofs_init_zip_subsystem(void)
>> @@ -366,10 +473,29 @@ int __init z_erofs_init_zip_subsystem(void)
>> int err = z_erofs_create_pcluster_pool();
>>
>> if (err)
>> - return err;
>> - err = z_erofs_init_workqueue();
>> + goto out_error_pcluster_pool;
>> +
>> + z_erofs_workqueue = alloc_workqueue("erofs_worker",
>> + WQ_UNBOUND | WQ_HIGHPRI, num_possible_cpus());
>> + if (!z_erofs_workqueue)
>> + goto out_error_workqueue_init;
>> +
>> + err = erofs_init_percpu_workers();
>> if (err)
>> - z_erofs_destroy_pcluster_pool();
>> + goto out_error_pcpu_worker;
>> +
>> + err = erofs_cpu_hotplug_init();
>> + if (err < 0)
>> + goto out_error_cpuhp_init;
>> + return err;
>> +
>> +out_error_cpuhp_init:
>> + erofs_destroy_percpu_workers();
>> +out_error_pcpu_worker:
>> + destroy_workqueue(z_erofs_workqueue);
>> +out_error_workqueue_init:
>> + z_erofs_destroy_pcluster_pool();
>> +out_error_pcluster_pool:
>> return err;
>> }
>>
>> @@ -1305,11 +1431,17 @@ static void z_erofs_decompressqueue_work(struct work_struct *work)
>>
>> DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
>> z_erofs_decompress_queue(bgq, &pagepool);
>> -
>> erofs_release_pages(&pagepool);
>> kvfree(bgq);
>> }
>>
>> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
>> +static void z_erofs_decompressqueue_kthread_work(struct kthread_work *work)
>> +{
>> + z_erofs_decompressqueue_work((struct work_struct *)work);
>> +}
>> +#endif
>> +
>> static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
>> int bios)
>> {
>> @@ -1324,9 +1456,24 @@ static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
>>
>> if (atomic_add_return(bios, &io->pending_bios))
>> return;
>> - /* Use workqueue and sync decompression for atomic contexts only */
>> + /* Use (kthread_)work and sync decompression for atomic contexts only */
>> if (in_atomic() || irqs_disabled()) {
>> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
>> + struct kthread_worker *worker;
>> +
>> + rcu_read_lock();
>> + worker = rcu_dereference(
>> + z_erofs_pcpu_workers[raw_smp_processor_id()]);
>> + if (!worker) {
>> + INIT_WORK(&io->u.work, z_erofs_decompressqueue_work);
>> + queue_work(z_erofs_workqueue, &io->u.work);
>> + } else {
>> + kthread_queue_work(worker, &io->u.kthread_work);
>> + }
>> + rcu_read_unlock();
>> +#else
>> queue_work(z_erofs_workqueue, &io->u.work);
>> +#endif
>> /* enable sync decompression for readahead */
>> if (sbi->opt.sync_decompress == EROFS_SYNC_DECOMPRESS_AUTO)
>> sbi->opt.sync_decompress = EROFS_SYNC_DECOMPRESS_FORCE_ON;
>> @@ -1455,7 +1602,12 @@ static struct z_erofs_decompressqueue *jobqueue_init(struct super_block *sb,
>> *fg = true;
>> goto fg_out;
>> }
>> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
>> + kthread_init_work(&q->u.kthread_work,
>> + z_erofs_decompressqueue_kthread_work);
>> +#else
>> INIT_WORK(&q->u.work, z_erofs_decompressqueue_work);
>> +#endif
>> } else {
>> fg_out:
>> q = fgq;
>> @@ -1640,7 +1792,7 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
>>
>> /*
>> * although background is preferred, no one is pending for submission.
>> - * don't issue workqueue for decompression but drop it directly instead.
>> + * don't issue decompression but drop it directly instead.
>> */
>> if (!*force_fg && !nr_bios) {
>> kvfree(q[JQ_SUBMIT]);
>> --
>> 2.30.2
>>

2023-02-08 07:00:33

by Sandeep Dhavale

[permalink] [raw]

Subject: Re: [PATCH v4] erofs: replace erofs_unzipd workqueue with per-cpu threads

On Mon, Feb 6, 2023 at 6:55 PM Gao Xiang <[email protected]> wrote:
>
>
>
> On 2023/2/7 03:41, Sandeep Dhavale wrote:
> > On Mon, Feb 6, 2023 at 2:01 AM Gao Xiang <[email protected]> wrote:
> >>
> >> Hi Sandeep,
> >>
> >> On Fri, Jan 06, 2023 at 07:35:01AM +0000, Sandeep Dhavale wrote:
> >>> Using per-cpu thread pool we can reduce the scheduling latency compared
> >>> to workqueue implementation. With this patch scheduling latency and
> >>> variation is reduced as per-cpu threads are high priority kthread_workers.
> >>>
> >>> The results were evaluated on arm64 Android devices running 5.10 kernel.
> >>>
> >>> The table below shows resulting improvements of total scheduling latency
> >>> for the same app launch benchmark runs with 50 iterations. Scheduling
> >>> latency is the latency between when the task (workqueue kworker vs
> >>> kthread_worker) became eligible to run to when it actually started
> >>> running.
> >>> +-------------------------+-----------+----------------+---------+
> >>> | | workqueue | kthread_worker | diff |
> >>> +-------------------------+-----------+----------------+---------+
> >>> | Average (us) | 15253 | 2914 | -80.89% |
> >>> | Median (us) | 14001 | 2912 | -79.20% |
> >>> | Minimum (us) | 3117 | 1027 | -67.05% |
> >>> | Maximum (us) | 30170 | 3805 | -87.39% |
> >>> | Standard deviation (us) | 7166 | 359 | |
> >>> +-------------------------+-----------+----------------+---------+
> >>>
> >>> Background: Boot times and cold app launch benchmarks are very
> >>> important to the android ecosystem as they directly translate to
> >>> responsiveness from user point of view. While erofs provides
> >>> a lot of important features like space savings, we saw some
> >>> performance penalty in cold app launch benchmarks in few scenarios.
> >>> Analysis showed that the significant variance was coming from the
> >>> scheduling cost while decompression cost was more or less the same.
> >>>
> >>> Having per-cpu thread pool we can see from the above table that this
> >>> variation is reduced by ~80% on average. This problem was discussed
> >>> at LPC 2022. Link to LPC 2022 slides and
> >>> talk at [1]
> >>>
> >>> [1] https://lpc.events/event/16/contributions/1338/
> >>>
> >>> Signed-off-by: Sandeep Dhavale <[email protected]>
> >>> ---
> >>> V3 -> V4
> >>> * Updated commit message with background information
> >>> V2 -> V3
> >>> * Fix a warning Reported-by: kernel test robot <[email protected]>
> >>> V1 -> V2
> >>> * Changed name of kthread_workers from z_erofs to erofs_worker
> >>> * Added kernel configuration to run kthread_workers at normal or
> >>> high priority
> >>> * Added cpu hotplug support
> >>> * Added wrapped kthread_workers under worker_pool
> >>> * Added one unbound thread in a pool to handle a context where
> >>> we already stopped per-cpu kthread worker
> >>> * Updated commit message
> >>
> >> I've just modified your v4 patch based on erofs -dev branch with
> >> my previous suggestion [1], but I haven't tested it.
> >>
> >> Could you help check if the updated patch looks good to you and
> >> test it on your side? If there are unexpected behaviors, please
> >> help update as well, thanks!
> > Thanks Xiang, I was working on the same. I see that you have cleaned it up.
> > I will test it and report/fix any problems.
> >
> > Thanks,
> > Sandeep.
>
> Thanks! Look forward to your test. BTW, we have < 2 weeks for 6.3, so I'd
> like to fix it this week so that we could catch 6.3 merge window.
>
>
> I've fixed some cpu hotplug errors as below and added to a branch for 0day CI
> testing.
>
Hi Xiang,
With this version of the patch I have tested
- Multiple device reboot test
- Cold App launch tests
- Cold App launch tests with cpu offline/online

All tests ran successfully and no issue was observed.

Thanks,
Sandeep.

> Thanks,
> Gao Xiang
>
> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
> index 73198f494a6a..92a9e20948b0 100644
> --- a/fs/erofs/zdata.c
> +++ b/fs/erofs/zdata.c
> @@ -398,7 +398,7 @@ static inline void erofs_destroy_percpu_workers(void) {}
> static inline int erofs_init_percpu_workers(void) { return 0; }
> #endif
>
> -#if defined(CONFIG_HOTPLUG_CPU) && defined(EROFS_FS_PCPU_KTHREAD)
> +#if defined(CONFIG_HOTPLUG_CPU) && defined(CONFIG_EROFS_FS_PCPU_KTHREAD)
> static DEFINE_SPINLOCK(z_erofs_pcpu_worker_lock);
> static enum cpuhp_state erofs_cpuhp_state;
>
> @@ -408,7 +408,7 @@ static int erofs_cpu_online(unsigned int cpu)
>
> worker = erofs_init_percpu_worker(cpu);
> if (IS_ERR(worker))
> - return ERR_PTR(worker);
> + return PTR_ERR(worker);
>
> spin_lock(&z_erofs_pcpu_worker_lock);
> old = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
> @@ -428,7 +428,7 @@ static int erofs_cpu_offline(unsigned int cpu)
> spin_lock(&z_erofs_pcpu_worker_lock);
> worker = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
> lockdep_is_held(&z_erofs_pcpu_worker_lock));
> - rcu_assign_pointer(worker_pool.workers[cpu], NULL);
> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
> spin_unlock(&z_erofs_pcpu_worker_lock);
>
> synchronize_rcu();
>
> >
> >>
> >> [1] https://lore.kernel.org/r/[email protected]/
> >>
> >> Thanks,
> >> Gao Xiang
> >>
> >> From 2e87235abc745c0fef8e32abcd3a51546b4378ad Mon Sep 17 00:00:00 2001
> >> From: Sandeep Dhavale <[email protected]>
> >> Date: Mon, 6 Feb 2023 17:53:39 +0800
> >> Subject: [PATCH] erofs: add per-cpu threads for decompression
> >>
> >> Using per-cpu thread pool we can reduce the scheduling latency compared
> >> to workqueue implementation. With this patch scheduling latency and
> >> variation is reduced as per-cpu threads are high priority kthread_workers.
> >>
> >> The results were evaluated on arm64 Android devices running 5.10 kernel.
> >>
> >> The table below shows resulting improvements of total scheduling latency
> >> for the same app launch benchmark runs with 50 iterations. Scheduling
> >> latency is the latency between when the task (workqueue kworker vs
> >> kthread_worker) became eligible to run to when it actually started
> >> running.
> >> +-------------------------+-----------+----------------+---------+
> >> | | workqueue | kthread_worker | diff |
> >> +-------------------------+-----------+----------------+---------+
> >> | Average (us) | 15253 | 2914 | -80.89% |
> >> | Median (us) | 14001 | 2912 | -79.20% |
> >> | Minimum (us) | 3117 | 1027 | -67.05% |
> >> | Maximum (us) | 30170 | 3805 | -87.39% |
> >> | Standard deviation (us) | 7166 | 359 | |
> >> +-------------------------+-----------+----------------+---------+
> >>
> >> Background: Boot times and cold app launch benchmarks are very
> >> important to the android ecosystem as they directly translate to
> >> responsiveness from user point of view. While erofs provides
> >> a lot of important features like space savings, we saw some
> >> performance penalty in cold app launch benchmarks in few scenarios.
> >> Analysis showed that the significant variance was coming from the
> >> scheduling cost while decompression cost was more or less the same.
> >>
> >> Having per-cpu thread pool we can see from the above table that this
> >> variation is reduced by ~80% on average. This problem was discussed
> >> at LPC 2022. Link to LPC 2022 slides and
> >> talk at [1]
> >>
> >> [1] https://lpc.events/event/16/contributions/1338/
> >>
> >> Signed-off-by: Sandeep Dhavale <[email protected]>
> >> Signed-off-by: Gao Xiang <[email protected]>
> >> ---
> >> fs/erofs/Kconfig | 18 +++++
> >> fs/erofs/zdata.c | 190 ++++++++++++++++++++++++++++++++++++++++++-----
> >> 2 files changed, 189 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
> >> index 85490370e0ca..704fb59577e0 100644
> >> --- a/fs/erofs/Kconfig
> >> +++ b/fs/erofs/Kconfig
> >> @@ -108,3 +108,21 @@ config EROFS_FS_ONDEMAND
> >> read support.
> >>
> >> If unsure, say N.
> >> +
> >> +config EROFS_FS_PCPU_KTHREAD
> >> + bool "EROFS per-cpu decompression kthread workers"
> >> + depends on EROFS_FS_ZIP
> >> + help
> >> + Saying Y here enables per-CPU kthread workers pool to carry out
> >> + async decompression for low latencies on some architectures.
> >> +
> >> + If unsure, say N.
> >> +
> >> +config EROFS_FS_PCPU_KTHREAD_HIPRI
> >> + bool "EROFS high priority per-CPU kthread workers"
> >> + depends on EROFS_FS_ZIP && EROFS_FS_PCPU_KTHREAD
> >> + help
> >> + This permits EROFS to configure per-CPU kthread workers to run
> >> + at higher priority.
> >> +
> >> + If unsure, say N.
> >> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
> >> index 384f64292f73..73198f494a6a 100644
> >> --- a/fs/erofs/zdata.c
> >> +++ b/fs/erofs/zdata.c
> >> @@ -7,6 +7,8 @@
> >> #include "compress.h"
> >> #include <linux/prefetch.h>
> >> #include <linux/psi.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/cpuhotplug.h>
> >>
> >> #include <trace/events/erofs.h>
> >>
> >> @@ -109,6 +111,7 @@ struct z_erofs_decompressqueue {
> >> union {
> >> struct completion done;
> >> struct work_struct work;
> >> + struct kthread_work kthread_work;
> >> } u;
> >> bool eio, sync;
> >> };
> >> @@ -341,24 +344,128 @@ static void z_erofs_free_pcluster(struct z_erofs_pcluster *pcl)
> >>
> >> static struct workqueue_struct *z_erofs_workqueue __read_mostly;
> >>
> >> -void z_erofs_exit_zip_subsystem(void)
> >> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
> >> +static struct kthread_worker __rcu **z_erofs_pcpu_workers;
> >> +
> >> +static void erofs_destroy_percpu_workers(void)
> >> {
> >> - destroy_workqueue(z_erofs_workqueue);
> >> - z_erofs_destroy_pcluster_pool();
> >> + struct kthread_worker *worker;
> >> + unsigned int cpu;
> >> +
> >> + for_each_possible_cpu(cpu) {
> >> + worker = rcu_dereference_protected(
> >> + z_erofs_pcpu_workers[cpu], 1);
> >> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
> >> + if (worker)
> >> + kthread_destroy_worker(worker);
> >> + }
> >> + kfree(z_erofs_pcpu_workers);
> >> }
> >>
> >> -static inline int z_erofs_init_workqueue(void)
> >> +static struct kthread_worker *erofs_init_percpu_worker(int cpu)
> >> {
> >> - const unsigned int onlinecpus = num_possible_cpus();
> >> + struct kthread_worker *worker =
> >> + kthread_create_worker_on_cpu(cpu, 0, "erofs_worker/%u", cpu);
> >>
> >> - /*
> >> - * no need to spawn too many threads, limiting threads could minimum
> >> - * scheduling overhead, perhaps per-CPU threads should be better?
> >> - */
> >> - z_erofs_workqueue = alloc_workqueue("erofs_unzipd",
> >> - WQ_UNBOUND | WQ_HIGHPRI,
> >> - onlinecpus + onlinecpus / 4);
> >> - return z_erofs_workqueue ? 0 : -ENOMEM;
> >> + if (IS_ERR(worker))
> >> + return worker;
> >> + if (IS_ENABLED(CONFIG_EROFS_FS_PCPU_KTHREAD_HIPRI))
> >> + sched_set_fifo_low(worker->task);
> >> + else
> >> + sched_set_normal(worker->task, 0);
> >> + return worker;
> >> +}
> >> +
> >> +static int erofs_init_percpu_workers(void)
> >> +{
> >> + struct kthread_worker *worker;
> >> + unsigned int cpu;
> >> +
> >> + z_erofs_pcpu_workers = kcalloc(num_possible_cpus(),
> >> + sizeof(struct kthread_worker *), GFP_ATOMIC);
> >> + if (!z_erofs_pcpu_workers)
> >> + return -ENOMEM;
> >> +
> >> + for_each_online_cpu(cpu) { /* could miss cpu{off,on}line? */
> >> + worker = erofs_init_percpu_worker(cpu);
> >> + if (!IS_ERR(worker))
> >> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
> >> + }
> >> + return 0;
> >> +}
> >> +#else
> >> +static inline void erofs_destroy_percpu_workers(void) {}
> >> +static inline int erofs_init_percpu_workers(void) { return 0; }
> >> +#endif
> >> +
> >> +#if defined(CONFIG_HOTPLUG_CPU) && defined(EROFS_FS_PCPU_KTHREAD)
> >> +static DEFINE_SPINLOCK(z_erofs_pcpu_worker_lock);
> >> +static enum cpuhp_state erofs_cpuhp_state;
> >> +
> >> +static int erofs_cpu_online(unsigned int cpu)
> >> +{
> >> + struct kthread_worker *worker, *old;
> >> +
> >> + worker = erofs_init_percpu_worker(cpu);
> >> + if (IS_ERR(worker))
> >> + return ERR_PTR(worker);
> >> +
> >> + spin_lock(&z_erofs_pcpu_worker_lock);
> >> + old = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
> >> + lockdep_is_held(&z_erofs_pcpu_worker_lock));
> >> + if (!old)
> >> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
> >> + spin_unlock(&z_erofs_pcpu_worker_lock);
> >> + if (old)
> >> + kthread_destroy_worker(worker);
> >> + return 0;
> >> +}
> >> +
> >> +static int erofs_cpu_offline(unsigned int cpu)
> >> +{
> >> + struct kthread_worker *worker;
> >> +
> >> + spin_lock(&z_erofs_pcpu_worker_lock);
> >> + worker = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
> >> + lockdep_is_held(&z_erofs_pcpu_worker_lock));
> >> + rcu_assign_pointer(worker_pool.workers[cpu], NULL);
> >> + spin_unlock(&z_erofs_pcpu_worker_lock);
> >> +
> >> + synchronize_rcu();
> >> + if (worker)
> >> + kthread_destroy_worker(worker);
> >> + return 0;
> >> +}
> >> +
> >> +static int erofs_cpu_hotplug_init(void)
> >> +{
> >> + int state;
> >> +
> >> + state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> >> + "fs/erofs:online", erofs_cpu_online, erofs_cpu_offline);
> >> + if (state < 0)
> >> + return state;
> >> +
> >> + erofs_cpuhp_state = state;
> >> + return 0;
> >> +}
> >> +
> >> +static void erofs_cpu_hotplug_destroy(void)
> >> +{
> >> + if (erofs_cpuhp_state)
> >> + cpuhp_remove_state_nocalls(erofs_cpuhp_state);
> >> +}
> >> +#else /* !CONFIG_HOTPLUG_CPU || !CONFIG_EROFS_FS_PCPU_KTHREAD */
> >> +static inline int erofs_cpu_hotplug_init(void) { return 0; }
> >> +static inline void erofs_cpu_hotplug_destroy(void) {}
> >> +#endif
> >> +
> >> +void z_erofs_exit_zip_subsystem(void)
> >> +{
> >> + erofs_cpu_hotplug_destroy();
> >> + erofs_destroy_percpu_workers();
> >> + destroy_workqueue(z_erofs_workqueue);
> >> + z_erofs_destroy_pcluster_pool();
> >> }
> >>
> >> int __init z_erofs_init_zip_subsystem(void)
> >> @@ -366,10 +473,29 @@ int __init z_erofs_init_zip_subsystem(void)
> >> int err = z_erofs_create_pcluster_pool();
> >>
> >> if (err)
> >> - return err;
> >> - err = z_erofs_init_workqueue();
> >> + goto out_error_pcluster_pool;
> >> +
> >> + z_erofs_workqueue = alloc_workqueue("erofs_worker",
> >> + WQ_UNBOUND | WQ_HIGHPRI, num_possible_cpus());
> >> + if (!z_erofs_workqueue)
> >> + goto out_error_workqueue_init;
> >> +
> >> + err = erofs_init_percpu_workers();
> >> if (err)
> >> - z_erofs_destroy_pcluster_pool();
> >> + goto out_error_pcpu_worker;
> >> +
> >> + err = erofs_cpu_hotplug_init();
> >> + if (err < 0)
> >> + goto out_error_cpuhp_init;
> >> + return err;
> >> +
> >> +out_error_cpuhp_init:
> >> + erofs_destroy_percpu_workers();
> >> +out_error_pcpu_worker:
> >> + destroy_workqueue(z_erofs_workqueue);
> >> +out_error_workqueue_init:
> >> + z_erofs_destroy_pcluster_pool();
> >> +out_error_pcluster_pool:
> >> return err;
> >> }
> >>
> >> @@ -1305,11 +1431,17 @@ static void z_erofs_decompressqueue_work(struct work_struct *work)
> >>
> >> DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
> >> z_erofs_decompress_queue(bgq, &pagepool);
> >> -
> >> erofs_release_pages(&pagepool);
> >> kvfree(bgq);
> >> }
> >>
> >> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
> >> +static void z_erofs_decompressqueue_kthread_work(struct kthread_work *work)
> >> +{
> >> + z_erofs_decompressqueue_work((struct work_struct *)work);
> >> +}
> >> +#endif
> >> +
> >> static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
> >> int bios)
> >> {
> >> @@ -1324,9 +1456,24 @@ static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
> >>
> >> if (atomic_add_return(bios, &io->pending_bios))
> >> return;
> >> - /* Use workqueue and sync decompression for atomic contexts only */
> >> + /* Use (kthread_)work and sync decompression for atomic contexts only */
> >> if (in_atomic() || irqs_disabled()) {
> >> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
> >> + struct kthread_worker *worker;
> >> +
> >> + rcu_read_lock();
> >> + worker = rcu_dereference(
> >> + z_erofs_pcpu_workers[raw_smp_processor_id()]);
> >> + if (!worker) {
> >> + INIT_WORK(&io->u.work, z_erofs_decompressqueue_work);
> >> + queue_work(z_erofs_workqueue, &io->u.work);
> >> + } else {
> >> + kthread_queue_work(worker, &io->u.kthread_work);
> >> + }
> >> + rcu_read_unlock();
> >> +#else
> >> queue_work(z_erofs_workqueue, &io->u.work);
> >> +#endif
> >> /* enable sync decompression for readahead */
> >> if (sbi->opt.sync_decompress == EROFS_SYNC_DECOMPRESS_AUTO)
> >> sbi->opt.sync_decompress = EROFS_SYNC_DECOMPRESS_FORCE_ON;
> >> @@ -1455,7 +1602,12 @@ static struct z_erofs_decompressqueue *jobqueue_init(struct super_block *sb,
> >> *fg = true;
> >> goto fg_out;
> >> }
> >> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
> >> + kthread_init_work(&q->u.kthread_work,
> >> + z_erofs_decompressqueue_kthread_work);
> >> +#else
> >> INIT_WORK(&q->u.work, z_erofs_decompressqueue_work);
> >> +#endif
> >> } else {
> >> fg_out:
> >> q = fgq;
> >> @@ -1640,7 +1792,7 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
> >>
> >> /*
> >> * although background is preferred, no one is pending for submission.
> >> - * don't issue workqueue for decompression but drop it directly instead.
> >> + * don't issue decompression but drop it directly instead.
> >> */
> >> if (!*force_fg && !nr_bios) {
> >> kvfree(q[JQ_SUBMIT]);
> >> --
> >> 2.30.2
> >>

2023-02-08 08:13:06

by Gao Xiang

[permalink] [raw]

Subject: Re: [PATCH v4] erofs: replace erofs_unzipd workqueue with per-cpu threads

On 2023/2/8 14:58, Sandeep Dhavale wrote:
> On Mon, Feb 6, 2023 at 6:55 PM Gao Xiang <[email protected]> wrote:
>>
>>
>>
>> On 2023/2/7 03:41, Sandeep Dhavale wrote:
>>> On Mon, Feb 6, 2023 at 2:01 AM Gao Xiang <[email protected]> wrote:
>>>>
>>>> Hi Sandeep,
>>>>
>>>> On Fri, Jan 06, 2023 at 07:35:01AM +0000, Sandeep Dhavale wrote:
>>>>> Using per-cpu thread pool we can reduce the scheduling latency compared
>>>>> to workqueue implementation. With this patch scheduling latency and
>>>>> variation is reduced as per-cpu threads are high priority kthread_workers.
>>>>>
>>>>> The results were evaluated on arm64 Android devices running 5.10 kernel.
>>>>>
>>>>> The table below shows resulting improvements of total scheduling latency
>>>>> for the same app launch benchmark runs with 50 iterations. Scheduling
>>>>> latency is the latency between when the task (workqueue kworker vs
>>>>> kthread_worker) became eligible to run to when it actually started
>>>>> running.
>>>>> +-------------------------+-----------+----------------+---------+
>>>>> | | workqueue | kthread_worker | diff |
>>>>> +-------------------------+-----------+----------------+---------+
>>>>> | Average (us) | 15253 | 2914 | -80.89% |
>>>>> | Median (us) | 14001 | 2912 | -79.20% |
>>>>> | Minimum (us) | 3117 | 1027 | -67.05% |
>>>>> | Maximum (us) | 30170 | 3805 | -87.39% |
>>>>> | Standard deviation (us) | 7166 | 359 | |
>>>>> +-------------------------+-----------+----------------+---------+
>>>>>
>>>>> Background: Boot times and cold app launch benchmarks are very
>>>>> important to the android ecosystem as they directly translate to
>>>>> responsiveness from user point of view. While erofs provides
>>>>> a lot of important features like space savings, we saw some
>>>>> performance penalty in cold app launch benchmarks in few scenarios.
>>>>> Analysis showed that the significant variance was coming from the
>>>>> scheduling cost while decompression cost was more or less the same.
>>>>>
>>>>> Having per-cpu thread pool we can see from the above table that this
>>>>> variation is reduced by ~80% on average. This problem was discussed
>>>>> at LPC 2022. Link to LPC 2022 slides and
>>>>> talk at [1]
>>>>>
>>>>> [1] https://lpc.events/event/16/contributions/1338/
>>>>>
>>>>> Signed-off-by: Sandeep Dhavale <[email protected]>
>>>>> ---
>>>>> V3 -> V4
>>>>> * Updated commit message with background information
>>>>> V2 -> V3
>>>>> * Fix a warning Reported-by: kernel test robot <[email protected]>
>>>>> V1 -> V2
>>>>> * Changed name of kthread_workers from z_erofs to erofs_worker
>>>>> * Added kernel configuration to run kthread_workers at normal or
>>>>> high priority
>>>>> * Added cpu hotplug support
>>>>> * Added wrapped kthread_workers under worker_pool
>>>>> * Added one unbound thread in a pool to handle a context where
>>>>> we already stopped per-cpu kthread worker
>>>>> * Updated commit message
>>>>
>>>> I've just modified your v4 patch based on erofs -dev branch with
>>>> my previous suggestion [1], but I haven't tested it.
>>>>
>>>> Could you help check if the updated patch looks good to you and
>>>> test it on your side? If there are unexpected behaviors, please
>>>> help update as well, thanks!
>>> Thanks Xiang, I was working on the same. I see that you have cleaned it up.
>>> I will test it and report/fix any problems.
>>>
>>> Thanks,
>>> Sandeep.
>>
>> Thanks! Look forward to your test. BTW, we have < 2 weeks for 6.3, so I'd
>> like to fix it this week so that we could catch 6.3 merge window.
>>
>>
>> I've fixed some cpu hotplug errors as below and added to a branch for 0day CI
>> testing.
>>
> Hi Xiang,
> With this version of the patch I have tested
> - Multiple device reboot test
> - Cold App launch tests
> - Cold App launch tests with cpu offline/online
>
> All tests ran successfully and no issue was observed.

Okay, thanks! I will resend & submit this version for -next now
and test on my side if no other concerns.

Thanks,
Gao XIang