Received: by 2002:a05:6358:16cc:b0:ea:6187:17c9 with SMTP id r12csp1921424rwl; Thu, 5 Jan 2023 22:34:27 -0800 (PST) X-Google-Smtp-Source: AMrXdXsGcLXry5nQgV+ovnj6JNPR8cUgw+mu88/x6Fd50FfX9qmqpnkojGZkejWEj2xiS4QR054s X-Received: by 2002:a62:1e04:0:b0:583:ef3:19a8 with SMTP id e4-20020a621e04000000b005830ef319a8mr6623547pfe.7.1672986866743; Thu, 05 Jan 2023 22:34:26 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1672986866; cv=none; d=google.com; s=arc-20160816; b=E2c2HGfuaewFVQsXfW4IFRV4m3teQdAB12mBfaH4gsdkiJMdKRX3y24aRKpajSP1WK DjjXi6nIW1o+nvfriZauuhdH52mMDFk31mzWRgsLJzhNyBJ8mQOtQvxAxYiFsT6boeDZ 5Y+PO5ndjiXCiST/in/D5RRPzaY7zTTpp1R19/GtcDAf3lWHMIzFMtUcu4J798P9D4kY FM4VODzlpnSiSWg05PCM8xSBvdXqBHpWHfFTum7C7WWNo+pY6OJfuK7xUFiW4Etkmkdi q9Mk98h58YPpPh/x1FQAnTUqyqkHiBEbYt19jXm9WQEWLZwoXdjnQ0F2QsxUFXJ1eQnl ZxbA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=v7EwJ7XwSaKaynxNjkzHii/qmpNyLFjcrPFAXyb6ufg=; b=tj3LfS+8sj2oGUTuyxaM0Buw1CG12b5kRJfwQueEs+8sCgznvLt/GIizw4bzOZ6Vfq heo8XmukAQgxZTD35cBY8VYkCCUzgfkjeoN2K5zShFqVln7HIuEpCiR0UTkzWGEB4BXv RVudXiAl3UQl2TUgU+xXuK5zQSWm6xvdW7WhBAhjeJ2Sb1m1ccSGjI4U2sqf6yvTIC/0 iTD5KbN/wOp98Eb+EK4m5a1uEnZDNQOLrqmdd8kf7bwZ+cOFp4vHgu+YgHXZsr82AQ+J fp2Cla+cBUXjkW7K5fheq0qGDaky4zeDFyFFboydA/0zG0I6nJnzll+B9GLDxeIe+2BA 8hBQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f2-20020a056a00228200b00580ea400fbasi603020pfe.80.2023.01.05.22.34.19; Thu, 05 Jan 2023 22:34:26 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231213AbjAFGCc (ORCPT + 54 others); Fri, 6 Jan 2023 01:02:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49454 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229589AbjAFGC3 (ORCPT ); Fri, 6 Jan 2023 01:02:29 -0500 Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3332163D2A for ; Thu, 5 Jan 2023 22:02:26 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045168;MF=hsiangkao@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0VYyOh5y_1672984943; Received: from 30.97.49.39(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0VYyOh5y_1672984943) by smtp.aliyun-inc.com; Fri, 06 Jan 2023 14:02:24 +0800 Message-ID: Date: Fri, 6 Jan 2023 14:02:22 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.6.1 Subject: Re: [PATCH v3] erofs: replace erofs_unzipd workqueue with per-cpu threads To: Sandeep Dhavale , Chao Yu , Yue Hu , Jeffle Xu Cc: kernel-team@android.com, linux-erofs@lists.ozlabs.org, linux-kernel@vger.kernel.org, Gao Xiang References: <20230106054105.3998588-1-dhavale@google.com> From: Gao Xiang In-Reply-To: <20230106054105.3998588-1-dhavale@google.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-12.8 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE, SPF_PASS,UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Sandeep, On 2023/1/6 13:41, Sandeep Dhavale wrote: > Using per-cpu thread pool we can reduce the scheduling latency compared > to workqueue implementation. With this patch scheduling latency and > variation is reduced as per-cpu threads are high priority kthread_workers. > > The results were evaluated on arm64 Android devices running 5.10 kernel. > > The table below shows resulting improvements of total scheduling latency > for the same app launch benchmark runs with 50 iterations. Scheduling > latency is the latency between when the task (workqueue kworker vs > kthread_worker) became eligible to run to when it actually started > running. > +-------------------------+-----------+----------------+---------+ > | | workqueue | kthread_worker | diff | > +-------------------------+-----------+----------------+---------+ > | Average (us) | 15253 | 2914 | -80.89% | > | Median (us) | 14001 | 2912 | -79.20% | > | Minimum (us) | 3117 | 1027 | -67.05% | > | Maximum (us) | 30170 | 3805 | -87.39% | > | Standard deviation (us) | 7166 | 359 | | > +-------------------------+-----------+----------------+---------+ > > Signed-off-by: Sandeep Dhavale > --- > V2 -> V3 > Fix a warning Reported-by: kernel test robot > > V1 -> V2 > * Changed name of kthread_workers from z_erofs to erofs_worker > * Added kernel configuration to run kthread_workers at normal or > high priority > * Added cpu hotplug support > * Added wrapped kthread_workers under worker_pool > * Added one unbound thread in a pool to handle a context where > we already stopped per-cpu kthread worker > * Updated commit message > > Background: Boot times and cold app launch benchmarks are very > important to android ecosystem as they directly translate to > responsiveness from user point of view. While erofs provides > lot of important features like space savings, we saw some > performance penalty in cold app launch benchmarks in few scenarios. > Analysis showed that the significant variance was coming from the > scheduling cost while decompression cost was more or less the same. > > Having per-cpu thread pool we can see from above table that this > variation is reduced by ~80% on average. Link to LPC 2022 slides and > talk at [1] > > [1] https://lpc.events/event/16/contributions/1338/ Let's move these words into the commit message as well for others to refer. I think look into (and try) this version later this week. Thanks, Gao Xiang > --- > fs/erofs/Kconfig | 11 +++ > fs/erofs/zdata.c | 201 +++++++++++++++++++++++++++++++++++++++++------ > fs/erofs/zdata.h | 4 +- > 3 files changed, 192 insertions(+), 24 deletions(-) > > diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig > index 85490370e0ca..879f493c6641 100644 > --- a/fs/erofs/Kconfig > +++ b/fs/erofs/Kconfig > @@ -108,3 +108,14 @@ config EROFS_FS_ONDEMAND > read support. > > If unsure, say N. > + > +config EROFS_FS_KTHREAD_HIPRI > + bool "EROFS high priority percpu kthread workers" > + depends on EROFS_FS > + default n > + help > + EROFS uses per cpu kthread workers pool to carry out async work. > + This permits EROFS to configure per cpu kthread workers to run > + at higher priority. > + > + If unsure, say N. > diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c > index ccf7c55d477f..b4bf2da72d1a 100644 > --- a/fs/erofs/zdata.c > +++ b/fs/erofs/zdata.c > @@ -8,6 +8,8 @@ > #include "compress.h" > #include > #include > +#include > +#include > > #include > > @@ -184,26 +186,152 @@ typedef tagptr1_t compressed_page_t; > #define tag_compressed_page_justfound(page) \ > tagptr_fold(compressed_page_t, page, 1) > > -static struct workqueue_struct *z_erofs_workqueue __read_mostly; > +struct erofs_kthread_worker_pool { > + struct kthread_worker __rcu **workers; > + struct kthread_worker *unbound_worker; > +}; > > -void z_erofs_exit_zip_subsystem(void) > +static struct erofs_kthread_worker_pool worker_pool; > +DEFINE_SPINLOCK(worker_pool_lock); > + > +static void erofs_destroy_worker_pool(void) > { > - destroy_workqueue(z_erofs_workqueue); > - z_erofs_destroy_pcluster_pool(); > + unsigned int cpu; > + struct kthread_worker *worker; > + > + for_each_possible_cpu(cpu) { > + worker = rcu_dereference_protected( > + worker_pool.workers[cpu], > + 1); > + rcu_assign_pointer(worker_pool.workers[cpu], NULL); > + > + if (worker) > + kthread_destroy_worker(worker); > + } > + > + if (worker_pool.unbound_worker) > + kthread_destroy_worker(worker_pool.unbound_worker); > + > + kfree(worker_pool.workers); > } > > -static inline int z_erofs_init_workqueue(void) > +static inline void erofs_set_worker_priority(struct kthread_worker *worker) > { > - const unsigned int onlinecpus = num_possible_cpus(); > +#ifdef CONFIG_EROFS_FS_KTHREAD_HIPRI > + sched_set_fifo_low(worker->task); > +#else > + sched_set_normal(worker->task, 0); > +#endif > +} > > - /* > - * no need to spawn too many threads, limiting threads could minimum > - * scheduling overhead, perhaps per-CPU threads should be better? > - */ > - z_erofs_workqueue = alloc_workqueue("erofs_unzipd", > - WQ_UNBOUND | WQ_HIGHPRI, > - onlinecpus + onlinecpus / 4); > - return z_erofs_workqueue ? 0 : -ENOMEM; > +static int erofs_create_kthread_workers(void) > +{ > + unsigned int cpu; > + struct kthread_worker *worker; > + > + for_each_online_cpu(cpu) { > + worker = kthread_create_worker_on_cpu(cpu, 0, "erofs_worker/%u", cpu); > + if (IS_ERR(worker)) { > + erofs_destroy_worker_pool(); > + return -ENOMEM; > + } > + erofs_set_worker_priority(worker); > + rcu_assign_pointer(worker_pool.workers[cpu], worker); > + } > + > + worker = kthread_create_worker(0, "erofs_unbound_worker"); > + if (IS_ERR(worker)) { > + erofs_destroy_worker_pool(); > + return PTR_ERR(worker); > + } > + erofs_set_worker_priority(worker); > + worker_pool.unbound_worker = worker; > + > + return 0; > +} > + > +static int erofs_init_worker_pool(void) > +{ > + int err; > + > + worker_pool.workers = kcalloc(num_possible_cpus(), > + sizeof(struct kthread_worker *), GFP_ATOMIC); > + if (!worker_pool.workers) > + return -ENOMEM; > + err = erofs_create_kthread_workers(); > + > + return err; > +} > + > +#ifdef CONFIG_HOTPLUG_CPU > +static enum cpuhp_state erofs_cpuhp_state; > +static int erofs_cpu_online(unsigned int cpu) > +{ > + struct kthread_worker *worker; > + > + worker = kthread_create_worker_on_cpu(cpu, 0, "erofs_worker/%u", cpu); > + if (IS_ERR(worker)) > + return -ENOMEM; > + > + erofs_set_worker_priority(worker); > + > + spin_lock(&worker_pool_lock); > + rcu_assign_pointer(worker_pool.workers[cpu], worker); > + spin_unlock(&worker_pool_lock); > + > + synchronize_rcu(); > + > + return 0; > +} > + > +static int erofs_cpu_offline(unsigned int cpu) > +{ > + struct kthread_worker *worker; > + > + spin_lock(&worker_pool_lock); > + worker = rcu_dereference_protected(worker_pool.workers[cpu], > + lockdep_is_held(&worker_pool_lock)); > + rcu_assign_pointer(worker_pool.workers[cpu], NULL); > + spin_unlock(&worker_pool_lock); > + > + synchronize_rcu(); > + > + if (worker) > + kthread_destroy_worker(worker); > + > + return 0; > +} > + > +static int erofs_cpu_hotplug_init(void) > +{ > + int state; > + > + state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, > + "fs/erofs:online", > + erofs_cpu_online, erofs_cpu_offline); > + if (state < 0) > + return state; > + > + erofs_cpuhp_state = state; > + > + return 0; > +} > + > +static void erofs_cpu_hotplug_destroy(void) > +{ > + if (erofs_cpuhp_state) > + cpuhp_remove_state_nocalls(erofs_cpuhp_state); > +} > +#else /* !CONFIG_HOTPLUG_CPU */ > +static inline int erofs_cpu_hotplug_init(void) { return 0; } > +static inline void erofs_cpu_hotplug_destroy(void) {} > +#endif > + > +void z_erofs_exit_zip_subsystem(void) > +{ > + erofs_cpu_hotplug_destroy(); > + erofs_destroy_worker_pool(); > + z_erofs_destroy_pcluster_pool(); > } > > int __init z_erofs_init_zip_subsystem(void) > @@ -211,10 +339,23 @@ int __init z_erofs_init_zip_subsystem(void) > int err = z_erofs_create_pcluster_pool(); > > if (err) > - return err; > - err = z_erofs_init_workqueue(); > + goto out_error_pcluster_pool; > + > + err = erofs_init_worker_pool(); > if (err) > - z_erofs_destroy_pcluster_pool(); > + goto out_error_worker_pool; > + > + err = erofs_cpu_hotplug_init(); > + if (err < 0) > + goto out_error_cpuhp_init; > + > + return err; > + > +out_error_cpuhp_init: > + erofs_destroy_worker_pool(); > +out_error_worker_pool: > + z_erofs_destroy_pcluster_pool(); > +out_error_pcluster_pool: > return err; > } > > @@ -1143,7 +1284,7 @@ static void z_erofs_decompress_queue(const struct z_erofs_decompressqueue *io, > } > } > > -static void z_erofs_decompressqueue_work(struct work_struct *work) > +static void z_erofs_decompressqueue_kthread_work(struct kthread_work *work) > { > struct z_erofs_decompressqueue *bgq = > container_of(work, struct z_erofs_decompressqueue, u.work); > @@ -1156,6 +1297,20 @@ static void z_erofs_decompressqueue_work(struct work_struct *work) > kvfree(bgq); > } > > +static void erofs_schedule_kthread_work(struct kthread_work *work) > +{ > + struct kthread_worker *worker; > + unsigned int cpu = raw_smp_processor_id(); > + > + rcu_read_lock(); > + worker = rcu_dereference(worker_pool.workers[cpu]); > + if (!worker) > + worker = worker_pool.unbound_worker; > + > + kthread_queue_work(worker, work); > + rcu_read_unlock(); > +} > + > static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io, > bool sync, int bios) > { > @@ -1170,15 +1325,15 @@ static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io, > > if (atomic_add_return(bios, &io->pending_bios)) > return; > - /* Use workqueue and sync decompression for atomic contexts only */ > + /* Use kthread_workers and sync decompression for atomic contexts only */ > if (in_atomic() || irqs_disabled()) { > - queue_work(z_erofs_workqueue, &io->u.work); > + erofs_schedule_kthread_work(&io->u.work); > /* enable sync decompression for readahead */ > if (sbi->opt.sync_decompress == EROFS_SYNC_DECOMPRESS_AUTO) > sbi->opt.sync_decompress = EROFS_SYNC_DECOMPRESS_FORCE_ON; > return; > } > - z_erofs_decompressqueue_work(&io->u.work); > + z_erofs_decompressqueue_kthread_work(&io->u.work); > } > > static struct page *pickup_page_for_submission(struct z_erofs_pcluster *pcl, > @@ -1306,7 +1461,7 @@ jobqueue_init(struct super_block *sb, > *fg = true; > goto fg_out; > } > - INIT_WORK(&q->u.work, z_erofs_decompressqueue_work); > + kthread_init_work(&q->u.work, z_erofs_decompressqueue_kthread_work); > } else { > fg_out: > q = fgq; > @@ -1500,7 +1655,7 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f, > > /* > * although background is preferred, no one is pending for submission. > - * don't issue workqueue for decompression but drop it directly instead. > + * don't issue kthread_work for decompression but drop it directly instead. > */ > if (!*force_fg && !nr_bios) { > kvfree(q[JQ_SUBMIT]); > diff --git a/fs/erofs/zdata.h b/fs/erofs/zdata.h > index d98c95212985..808bbbf71b7b 100644 > --- a/fs/erofs/zdata.h > +++ b/fs/erofs/zdata.h > @@ -6,6 +6,8 @@ > #ifndef __EROFS_FS_ZDATA_H > #define __EROFS_FS_ZDATA_H > > +#include > + > #include "internal.h" > #include "tagptr.h" > > @@ -107,7 +109,7 @@ struct z_erofs_decompressqueue { > > union { > struct completion done; > - struct work_struct work; > + struct kthread_work work; > } u; > > bool eio;