2023-02-08 09:34:28

by Gao Xiang

[permalink] [raw]
Subject: [PATCH v5] erofs: add per-cpu threads for decompression as an option

From: Sandeep Dhavale <[email protected]>

Using per-cpu thread pool we can reduce the scheduling latency compared
to workqueue implementation. With this patch scheduling latency and
variation is reduced as per-cpu threads are high priority kthread_workers.

The results were evaluated on arm64 Android devices running 5.10 kernel.

The table below shows resulting improvements of total scheduling latency
for the same app launch benchmark runs with 50 iterations. Scheduling
latency is the latency between when the task (workqueue kworker vs
kthread_worker) became eligible to run to when it actually started
running.
+-------------------------+-----------+----------------+---------+
| | workqueue | kthread_worker | diff |
+-------------------------+-----------+----------------+---------+
| Average (us) | 15253 | 2914 | -80.89% |
| Median (us) | 14001 | 2912 | -79.20% |
| Minimum (us) | 3117 | 1027 | -67.05% |
| Maximum (us) | 30170 | 3805 | -87.39% |
| Standard deviation (us) | 7166 | 359 | |
+-------------------------+-----------+----------------+---------+

Background: Boot times and cold app launch benchmarks are very
important to the Android ecosystem as they directly translate to
responsiveness from user point of view. While EROFS provides
a lot of important features like space savings, we saw some
performance penalty in cold app launch benchmarks in few scenarios.
Analysis showed that the significant variance was coming from the
scheduling cost while decompression cost was more or less the same.

Having per-cpu thread pool we can see from the above table that this
variation is reduced by ~80% on average. This problem was discussed
at LPC 2022. Link to LPC 2022 slides and talk at [1]

[1] https://lpc.events/event/16/contributions/1338/

Signed-off-by: Sandeep Dhavale <[email protected]>
[ Gao Xiang: At least, we have to add this until WQ_UNBOUND workqueue
issue [2] on many arm64 devices is resolved. ]
[2] https://lore.kernel.org/r/CAJkfWY490-m6wNubkxiTPsW59sfsQs37Wey279LmiRxKt7aQYg@mail.gmail.com
Signed-off-by: Gao Xiang <[email protected]>
---
v4: https://lore.kernel.org/r/[email protected]/

changes since v4:
- need to fall back to workqueue rather than a global kthread and
make this feature as an option for now.

fs/erofs/Kconfig | 18 +++++
fs/erofs/zdata.c | 190 ++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 188 insertions(+), 20 deletions(-)

diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
index 85490370e0ca..704fb59577e0 100644
--- a/fs/erofs/Kconfig
+++ b/fs/erofs/Kconfig
@@ -108,3 +108,21 @@ config EROFS_FS_ONDEMAND
read support.

If unsure, say N.
+
+config EROFS_FS_PCPU_KTHREAD
+ bool "EROFS per-cpu decompression kthread workers"
+ depends on EROFS_FS_ZIP
+ help
+ Saying Y here enables per-CPU kthread workers pool to carry out
+ async decompression for low latencies on some architectures.
+
+ If unsure, say N.
+
+config EROFS_FS_PCPU_KTHREAD_HIPRI
+ bool "EROFS high priority per-CPU kthread workers"
+ depends on EROFS_FS_ZIP && EROFS_FS_PCPU_KTHREAD
+ help
+ This permits EROFS to configure per-CPU kthread workers to run
+ at higher priority.
+
+ If unsure, say N.
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 384f64292f73..8ea3f5fe985e 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -7,7 +7,7 @@
#include "compress.h"
#include <linux/prefetch.h>
#include <linux/psi.h>
-
+#include <linux/cpuhotplug.h>
#include <trace/events/erofs.h>

#define Z_EROFS_PCLUSTER_MAX_PAGES (Z_EROFS_PCLUSTER_MAX_SIZE / PAGE_SIZE)
@@ -109,6 +109,7 @@ struct z_erofs_decompressqueue {
union {
struct completion done;
struct work_struct work;
+ struct kthread_work kthread_work;
} u;
bool eio, sync;
};
@@ -341,24 +342,128 @@ static void z_erofs_free_pcluster(struct z_erofs_pcluster *pcl)

static struct workqueue_struct *z_erofs_workqueue __read_mostly;

-void z_erofs_exit_zip_subsystem(void)
+#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
+static struct kthread_worker __rcu **z_erofs_pcpu_workers;
+
+static void erofs_destroy_percpu_workers(void)
{
- destroy_workqueue(z_erofs_workqueue);
- z_erofs_destroy_pcluster_pool();
+ struct kthread_worker *worker;
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ worker = rcu_dereference_protected(
+ z_erofs_pcpu_workers[cpu], 1);
+ rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
+ if (worker)
+ kthread_destroy_worker(worker);
+ }
+ kfree(z_erofs_pcpu_workers);
}

-static inline int z_erofs_init_workqueue(void)
+static struct kthread_worker *erofs_init_percpu_worker(int cpu)
{
- const unsigned int onlinecpus = num_possible_cpus();
+ struct kthread_worker *worker =
+ kthread_create_worker_on_cpu(cpu, 0, "erofs_worker/%u", cpu);

- /*
- * no need to spawn too many threads, limiting threads could minimum
- * scheduling overhead, perhaps per-CPU threads should be better?
- */
- z_erofs_workqueue = alloc_workqueue("erofs_unzipd",
- WQ_UNBOUND | WQ_HIGHPRI,
- onlinecpus + onlinecpus / 4);
- return z_erofs_workqueue ? 0 : -ENOMEM;
+ if (IS_ERR(worker))
+ return worker;
+ if (IS_ENABLED(CONFIG_EROFS_FS_PCPU_KTHREAD_HIPRI))
+ sched_set_fifo_low(worker->task);
+ else
+ sched_set_normal(worker->task, 0);
+ return worker;
+}
+
+static int erofs_init_percpu_workers(void)
+{
+ struct kthread_worker *worker;
+ unsigned int cpu;
+
+ z_erofs_pcpu_workers = kcalloc(num_possible_cpus(),
+ sizeof(struct kthread_worker *), GFP_ATOMIC);
+ if (!z_erofs_pcpu_workers)
+ return -ENOMEM;
+
+ for_each_online_cpu(cpu) { /* could miss cpu{off,on}line? */
+ worker = erofs_init_percpu_worker(cpu);
+ if (!IS_ERR(worker))
+ rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
+ }
+ return 0;
+}
+#else
+static inline void erofs_destroy_percpu_workers(void) {}
+static inline int erofs_init_percpu_workers(void) { return 0; }
+#endif
+
+#if defined(CONFIG_HOTPLUG_CPU) && defined(CONFIG_EROFS_FS_PCPU_KTHREAD)
+static DEFINE_SPINLOCK(z_erofs_pcpu_worker_lock);
+static enum cpuhp_state erofs_cpuhp_state;
+
+static int erofs_cpu_online(unsigned int cpu)
+{
+ struct kthread_worker *worker, *old;
+
+ worker = erofs_init_percpu_worker(cpu);
+ if (IS_ERR(worker))
+ return PTR_ERR(worker);
+
+ spin_lock(&z_erofs_pcpu_worker_lock);
+ old = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
+ lockdep_is_held(&z_erofs_pcpu_worker_lock));
+ if (!old)
+ rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
+ spin_unlock(&z_erofs_pcpu_worker_lock);
+ if (old)
+ kthread_destroy_worker(worker);
+ return 0;
+}
+
+static int erofs_cpu_offline(unsigned int cpu)
+{
+ struct kthread_worker *worker;
+
+ spin_lock(&z_erofs_pcpu_worker_lock);
+ worker = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
+ lockdep_is_held(&z_erofs_pcpu_worker_lock));
+ rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
+ spin_unlock(&z_erofs_pcpu_worker_lock);
+
+ synchronize_rcu();
+ if (worker)
+ kthread_destroy_worker(worker);
+ return 0;
+}
+
+static int erofs_cpu_hotplug_init(void)
+{
+ int state;
+
+ state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+ "fs/erofs:online", erofs_cpu_online, erofs_cpu_offline);
+ if (state < 0)
+ return state;
+
+ erofs_cpuhp_state = state;
+ return 0;
+}
+
+static void erofs_cpu_hotplug_destroy(void)
+{
+ if (erofs_cpuhp_state)
+ cpuhp_remove_state_nocalls(erofs_cpuhp_state);
+}
+#else /* !CONFIG_HOTPLUG_CPU || !CONFIG_EROFS_FS_PCPU_KTHREAD */
+static inline int erofs_cpu_hotplug_init(void) { return 0; }
+static inline void erofs_cpu_hotplug_destroy(void) {}
+#endif
+
+void z_erofs_exit_zip_subsystem(void)
+{
+ erofs_cpu_hotplug_destroy();
+ erofs_destroy_percpu_workers();
+ destroy_workqueue(z_erofs_workqueue);
+ z_erofs_destroy_pcluster_pool();
}

int __init z_erofs_init_zip_subsystem(void)
@@ -366,10 +471,29 @@ int __init z_erofs_init_zip_subsystem(void)
int err = z_erofs_create_pcluster_pool();

if (err)
- return err;
- err = z_erofs_init_workqueue();
+ goto out_error_pcluster_pool;
+
+ z_erofs_workqueue = alloc_workqueue("erofs_worker",
+ WQ_UNBOUND | WQ_HIGHPRI, num_possible_cpus());
+ if (!z_erofs_workqueue)
+ goto out_error_workqueue_init;
+
+ err = erofs_init_percpu_workers();
if (err)
- z_erofs_destroy_pcluster_pool();
+ goto out_error_pcpu_worker;
+
+ err = erofs_cpu_hotplug_init();
+ if (err < 0)
+ goto out_error_cpuhp_init;
+ return err;
+
+out_error_cpuhp_init:
+ erofs_destroy_percpu_workers();
+out_error_pcpu_worker:
+ destroy_workqueue(z_erofs_workqueue);
+out_error_workqueue_init:
+ z_erofs_destroy_pcluster_pool();
+out_error_pcluster_pool:
return err;
}

@@ -1305,11 +1429,17 @@ static void z_erofs_decompressqueue_work(struct work_struct *work)

DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
z_erofs_decompress_queue(bgq, &pagepool);
-
erofs_release_pages(&pagepool);
kvfree(bgq);
}

+#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
+static void z_erofs_decompressqueue_kthread_work(struct kthread_work *work)
+{
+ z_erofs_decompressqueue_work((struct work_struct *)work);
+}
+#endif
+
static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
int bios)
{
@@ -1324,9 +1454,24 @@ static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,

if (atomic_add_return(bios, &io->pending_bios))
return;
- /* Use workqueue and sync decompression for atomic contexts only */
+ /* Use (kthread_)work and sync decompression for atomic contexts only */
if (in_atomic() || irqs_disabled()) {
+#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
+ struct kthread_worker *worker;
+
+ rcu_read_lock();
+ worker = rcu_dereference(
+ z_erofs_pcpu_workers[raw_smp_processor_id()]);
+ if (!worker) {
+ INIT_WORK(&io->u.work, z_erofs_decompressqueue_work);
+ queue_work(z_erofs_workqueue, &io->u.work);
+ } else {
+ kthread_queue_work(worker, &io->u.kthread_work);
+ }
+ rcu_read_unlock();
+#else
queue_work(z_erofs_workqueue, &io->u.work);
+#endif
/* enable sync decompression for readahead */
if (sbi->opt.sync_decompress == EROFS_SYNC_DECOMPRESS_AUTO)
sbi->opt.sync_decompress = EROFS_SYNC_DECOMPRESS_FORCE_ON;
@@ -1455,7 +1600,12 @@ static struct z_erofs_decompressqueue *jobqueue_init(struct super_block *sb,
*fg = true;
goto fg_out;
}
+#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
+ kthread_init_work(&q->u.kthread_work,
+ z_erofs_decompressqueue_kthread_work);
+#else
INIT_WORK(&q->u.work, z_erofs_decompressqueue_work);
+#endif
} else {
fg_out:
q = fgq;
@@ -1640,7 +1790,7 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,

/*
* although background is preferred, no one is pending for submission.
- * don't issue workqueue for decompression but drop it directly instead.
+ * don't issue decompression but drop it directly instead.
*/
if (!*force_fg && !nr_bios) {
kvfree(q[JQ_SUBMIT]);
--
2.24.4



2023-02-23 18:30:02

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v5] erofs: add per-cpu threads for decompression as an option

Hi,

On Wed, Feb 08, 2023 at 05:33:22PM +0800, Gao Xiang wrote:
> From: Sandeep Dhavale <[email protected]>
>
> Using per-cpu thread pool we can reduce the scheduling latency compared
> to workqueue implementation. With this patch scheduling latency and
> variation is reduced as per-cpu threads are high priority kthread_workers.
>
> The results were evaluated on arm64 Android devices running 5.10 kernel.

I see that this patch was upstreamed. Meanwhile, commit c25da5b7baf1d
("dm verity: stop using WQ_UNBOUND for verify_wq") was also upstreamed.

Why is this more complex solution better than simply removing WQ_UNBOUND?

- Eric

2023-02-23 18:53:05

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH v5] erofs: add per-cpu threads for decompression as an option

Hi Eric,

On 2023/2/24 02:29, Eric Biggers wrote:
> Hi,
>
> On Wed, Feb 08, 2023 at 05:33:22PM +0800, Gao Xiang wrote:
>> From: Sandeep Dhavale <[email protected]>
>>
>> Using per-cpu thread pool we can reduce the scheduling latency compared
>> to workqueue implementation. With this patch scheduling latency and
>> variation is reduced as per-cpu threads are high priority kthread_workers.
>>
>> The results were evaluated on arm64 Android devices running 5.10 kernel.
>
> I see that this patch was upstreamed. Meanwhile, commit c25da5b7baf1d
> ("dm verity: stop using WQ_UNBOUND for verify_wq") was also upstreamed.
>
> Why is this more complex solution better than simply removing WQ_UNBOUND?

I do think it's a specific issue on specific arm64 hardwares (assuming
qualcomm, I don't know) since WQ_UNBOUND decompression once worked well
on the hardwares I once used (I meant Hisilicon, and most x86_64 CPUs,
I tested at that time) compared with per-cpu workqueue.

Also RT threads are also matchable with softirq approach. In addition,
many configurations work without dm-verity.

I don't have more time to dig into it for now but it's important to
resolve this problem on some arm64 hardwares first. Also it's an
optional stuff, if the root cause of workqueue issue can be resolved,
we could consider drop it then.

Thsnka,
Gao Xiang

>
> - Eric

2023-02-23 19:08:24

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH v5] erofs: add per-cpu threads for decompression as an option



On 2023/2/24 02:52, Gao Xiang wrote:
> Hi Eric,
>
> On 2023/2/24 02:29, Eric Biggers wrote:
>> Hi,
>>
>> On Wed, Feb 08, 2023 at 05:33:22PM +0800, Gao Xiang wrote:
>>> From: Sandeep Dhavale <[email protected]>
>>>
>>> Using per-cpu thread pool we can reduce the scheduling latency compared
>>> to workqueue implementation. With this patch scheduling latency and
>>> variation is reduced as per-cpu threads are high priority kthread_workers.
>>>
>>> The results were evaluated on arm64 Android devices running 5.10 kernel.
>>
>> I see that this patch was upstreamed.  Meanwhile, commit c25da5b7baf1d
>> ("dm verity: stop using WQ_UNBOUND for verify_wq") was also upstreamed.
>>
>> Why is this more complex solution better than simply removing WQ_UNBOUND?
>
> I do think it's a specific issue on specific arm64 hardwares (assuming
> qualcomm, I don't know) since WQ_UNBOUND decompression once worked well
> on the hardwares I once used (I meant Hisilicon, and most x86_64 CPUs,
> I tested at that time) compared with per-cpu workqueue.
>
> Also RT threads are also matchable with softirq approach.  In addition,
> many configurations work without dm-verity.

Also for dm-verity use cases, EROFS will reuse the dm-verity context
directly rather than kick off a new context. Yet I'm not sure there
are still users using EROFS without dm-verity as I said above.

Anyway, the original scheduling issue sounds strange for me (with my
own landing experiences) in the beginning, and I have no way to
confirm the cases. Just hopefully it could be resolved from the
developer inputs and finally benefit to end users.

I've already did my own stress test with this new configuration as
well without explicit regression.

>
> I don't have more time to dig into it for now but it's important to
> resolve this problem on some arm64 hardwares first.  Also it's an
> optional stuff, if the root cause of workqueue issue can be resolved,
> we could consider drop it then.
>
> Thsnka,
> Gao Xiang
>
>>
>> - Eric

2023-02-24 00:14:05

by Sandeep Dhavale

[permalink] [raw]
Subject: Re: [PATCH v5] erofs: add per-cpu threads for decompression as an option

On Thu, Feb 23, 2023 at 11:08 AM Gao Xiang <[email protected]> wrote:
>
>
>
> On 2023/2/24 02:52, Gao Xiang wrote:
> > Hi Eric,
> >
> > On 2023/2/24 02:29, Eric Biggers wrote:
> >> Hi,
> >>
> >> On Wed, Feb 08, 2023 at 05:33:22PM +0800, Gao Xiang wrote:
> >>> From: Sandeep Dhavale <[email protected]>
> >>>
> >>> Using per-cpu thread pool we can reduce the scheduling latency compared
> >>> to workqueue implementation. With this patch scheduling latency and
> >>> variation is reduced as per-cpu threads are high priority kthread_workers.
> >>>
> >>> The results were evaluated on arm64 Android devices running 5.10 kernel.
> >>
> >> I see that this patch was upstreamed. Meanwhile, commit c25da5b7baf1d
> >> ("dm verity: stop using WQ_UNBOUND for verify_wq") was also upstreamed.
> >>
> >> Why is this more complex solution better than simply removing WQ_UNBOUND?
> >
> > I do think it's a specific issue on specific arm64 hardwares (assuming
> > qualcomm, I don't know) since WQ_UNBOUND decompression once worked well
> > on the hardwares I once used (I meant Hisilicon, and most x86_64 CPUs,
> > I tested at that time) compared with per-cpu workqueue.
> >
> > Also RT threads are also matchable with softirq approach. In addition,
> > many configurations work without dm-verity.
>
> Also for dm-verity use cases, EROFS will reuse the dm-verity context
> directly rather than kick off a new context. Yet I'm not sure there
> are still users using EROFS without dm-verity as I said above.
>
> Anyway, the original scheduling issue sounds strange for me (with my
> own landing experiences) in the beginning, and I have no way to
> confirm the cases. Just hopefully it could be resolved from the
> developer inputs and finally benefit to end users.
>
> I've already did my own stress test with this new configuration as
> well without explicit regression.
>
Hi Eric,
From the dm-verity patch description of removing WQ_UNBOUND it seems
Nathan saw the EROFS wait time was reduced by 51% whereas high pri per-cpu
threads showed me sched latency reduced on avg by ~80%.

So from the description at least it does not look like both patches have
equal benefits. I can't argue about the size and complexity of removing
WQ_UNBOUND if it gives the same benefits, that would have been great.

I will do the app launch tests again to compare these and share.

Thanks,
Sandeep.


> >
> > I don't have more time to dig into it for now but it's important to
> > resolve this problem on some arm64 hardwares first. Also it's an
> > optional stuff, if the root cause of workqueue issue can be resolved,
> > we could consider drop it then.
> >
> > Thsnka,
> > Gao Xiang
> >
> >>
> >> - Eric

2023-02-28 04:48:06

by Sandeep Dhavale

[permalink] [raw]
Subject: Re: [PATCH v5] erofs: add per-cpu threads for decompression as an option

Hi all,
I completed the tests and the results are consistent with
our previous observation. We can see that removing WQ_UNBOUND
helps but the scheduling latency by using high priority per cpu
kthreads is even lower. Below is the table.

|---------------------+-------+-------+------+-------|
| Table | avg | med | min | max |
|---------------------+-------+-------+------+-------|
| Default erofs | 19323 | 19758 | 3986 | 35051 |
|---------------------+-------+-------+------+-------|
| !WQ_UNBOUND | 11202 | 10798 | 3493 | 19822 |
|---------------------+-------+-------+------+-------|
| hipri pcpu kthreads | 7182 | 7017 | 2463 | 12300 |
|---------------------+-------+-------+------+-------|


Thanks,
Sandeep.

2023-02-28 05:01:53

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH v5] erofs: add per-cpu threads for decompression as an option

Hi Sandeep,

On 2023/2/28 12:47, Sandeep Dhavale via Linux-erofs wrote:
> Hi all,
> I completed the tests and the results are consistent with
> our previous observation. We can see that removing WQ_UNBOUND
> helps but the scheduling latency by using high priority per cpu
> kthreads is even lower. Below is the table.
>
> |---------------------+-------+-------+------+-------|
> | Table | avg | med | min | max |
> |---------------------+-------+-------+------+-------|
> | Default erofs | 19323 | 19758 | 3986 | 35051 |
> |---------------------+-------+-------+------+-------|
> | !WQ_UNBOUND | 11202 | 10798 | 3493 | 19822 |
> |---------------------+-------+-------+------+-------|
> | hipri pcpu kthreads | 7182 | 7017 | 2463 | 12300 |
> |---------------------+-------+-------+------+-------|

May I ask did it test with different setup since the test results
in the original commit message are:

+-------------------------+-----------+----------------+---------+
| | workqueue | kthread_worker | diff |
+-------------------------+-----------+----------------+---------+
| Average (us) | 15253 | 2914 | -80.89% |
| Median (us) | 14001 | 2912 | -79.20% |
| Minimum (us) | 3117 | 1027 | -67.05% |
| Maximum (us) | 30170 | 3805 | -87.39% |
| Standard deviation (us) | 7166 | 359 | |
+-------------------------+-----------+----------------+---------+

Otherwise it looks good to me for now, hopefully helpful to Android
users.

Thanks,
Gao Xiang

>
>
> Thanks,
> Sandeep.

2023-02-28 05:51:24

by Sandeep Dhavale

[permalink] [raw]
Subject: Re: [PATCH v5] erofs: add per-cpu threads for decompression as an option

On Mon, Feb 27, 2023 at 9:01 PM Gao Xiang <[email protected]> wrote:
>
> Hi Sandeep,
>
> On 2023/2/28 12:47, Sandeep Dhavale via Linux-erofs wrote:
> > Hi all,
> > I completed the tests and the results are consistent with
> > our previous observation. We can see that removing WQ_UNBOUND
> > helps but the scheduling latency by using high priority per cpu
> > kthreads is even lower. Below is the table.
> >
> > |---------------------+-------+-------+------+-------|
> > | Table | avg | med | min | max |
> > |---------------------+-------+-------+------+-------|
> > | Default erofs | 19323 | 19758 | 3986 | 35051 |
> > |---------------------+-------+-------+------+-------|
> > | !WQ_UNBOUND | 11202 | 10798 | 3493 | 19822 |
> > |---------------------+-------+-------+------+-------|
> > | hipri pcpu kthreads | 7182 | 7017 | 2463 | 12300 |
> > |---------------------+-------+-------+------+-------|
>
> May I ask did it test with different setup since the test results
> in the original commit message are:
>
Hi Gao,
Yes I did the test on the different (older) hardware than my original testing
(but the same one Nathan had used) to remove that as a variable.

Thanks,
Sandeep.

> +-------------------------+-----------+----------------+---------+
> | | workqueue | kthread_worker | diff |
> +-------------------------+-----------+----------------+---------+
> | Average (us) | 15253 | 2914 | -80.89% |
> | Median (us) | 14001 | 2912 | -79.20% |
> | Minimum (us) | 3117 | 1027 | -67.05% |
> | Maximum (us) | 30170 | 3805 | -87.39% |
> | Standard deviation (us) | 7166 | 359 | |
> +-------------------------+-----------+----------------+---------+
>
> Otherwise it looks good to me for now, hopefully helpful to Android
> users.
>
> Thanks,
> Gao Xiang
>
> >
> >
> > Thanks,
> > Sandeep.

2023-02-28 05:59:05

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH v5] erofs: add per-cpu threads for decompression as an option



On 2023/2/28 13:51, Sandeep Dhavale wrote:
> On Mon, Feb 27, 2023 at 9:01 PM Gao Xiang <[email protected]> wrote:
>>
>> Hi Sandeep,
>>
>> On 2023/2/28 12:47, Sandeep Dhavale via Linux-erofs wrote:
>>> Hi all,
>>> I completed the tests and the results are consistent with
>>> our previous observation. We can see that removing WQ_UNBOUND
>>> helps but the scheduling latency by using high priority per cpu
>>> kthreads is even lower. Below is the table.
>>>
>>> |---------------------+-------+-------+------+-------|
>>> | Table | avg | med | min | max |
>>> |---------------------+-------+-------+------+-------|
>>> | Default erofs | 19323 | 19758 | 3986 | 35051 |
>>> |---------------------+-------+-------+------+-------|
>>> | !WQ_UNBOUND | 11202 | 10798 | 3493 | 19822 |
>>> |---------------------+-------+-------+------+-------|
>>> | hipri pcpu kthreads | 7182 | 7017 | 2463 | 12300 |
>>> |---------------------+-------+-------+------+-------|
>>
>> May I ask did it test with different setup since the test results
>> in the original commit message are:
>>
> Hi Gao,
> Yes I did the test on the different (older) hardware than my original testing
> (but the same one Nathan had used) to remove that as a variable.

Ok, good to know that.

>
> Thanks,
> Sandeep.
>
>> +-------------------------+-----------+----------------+---------+
>> | | workqueue | kthread_worker | diff |
>> +-------------------------+-----------+----------------+---------+
>> | Average (us) | 15253 | 2914 | -80.89% |
>> | Median (us) | 14001 | 2912 | -79.20% |
>> | Minimum (us) | 3117 | 1027 | -67.05% |
>> | Maximum (us) | 30170 | 3805 | -87.39% |
>> | Standard deviation (us) | 7166 | 359 | |
>> +-------------------------+-----------+----------------+---------+
>>
>> Otherwise it looks good to me for now, hopefully helpful to Android
>> users.
>>
>> Thanks,
>> Gao Xiang
>>
>>>
>>>
>>> Thanks,
>>> Sandeep.