2019-09-06 10:14:42

by Daniel Jordan

[permalink] [raw]
Subject: [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs

v2 -> v3:
- Rebase onto cryptodev.

v1 -> v2:
- Updated patch 8 to avoid queueing the reorder work if the next object
by sequence number isn't ready yet (Herbert)
- Added Steffen's ack to all but patch 8 since that one changed.

RFC -> v1:
- Included Tejun's acks.
- Added testing section to cover letter.

Padata binds the parallel part of a job to a single CPU and round-robins
over all CPUs in the system for each successive job. Though the serial
parts rely on per-CPU queues for correct ordering, they're not necessary
for parallel work, and it improves performance to run the job locally on
NUMA machines and let the scheduler pick the CPU within a node on a busy
system.

This series makes parallel padata jobs run on unbound workqueues.

Patch Description
----- -----------

1 Make a padata instance allocate its workqueue internally.

2 Unconfine some recently-confined workqueue interfaces.

3-6 Address recursive CPU hotplug locking issue.

padata_alloc* requires its callers to hold this lock, but allocating
an unbound workqueue and calling apply_workqueue_attrs also take it.
Fix by removing the requirement for callers of padata_alloc*.

7-8 Add a second workqueue for each padata instance that's dedicated to
parallel jobs.

9 Small cleanup.

Performance
-----------

Measurements are from a 2-socket, 20-core, 40-CPU Xeon server.

For repeatability, modprobe was bound to a CPU and the serial cpumasks
for both pencrypt and pdecrypt were also restricted to a CPU different
from modprobe's.

# modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
# modprobe tcrypt mode=211 sec=1
# modprobe tcrypt mode=215 sec=1

Busy system (tcrypt run while 10 stress-ng tasks were burning 100% CPU)

base test
---------------- ---------------
speedup key_sz blk_sz ops/sec stdev ops/sec stdev

(pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211)

117.2x 160 16 960 30 112555 24775
135.1x 160 64 845 246 114145 25124
113.2x 160 256 993 17 112395 24714
111.3x 160 512 1000 0 111252 23755
110.0x 160 1024 983 16 108153 22374
104.2x 160 2048 985 22 102563 20530
98.5x 160 4096 998 3 98346 18777
86.2x 160 8192 1000 0 86173 14480

(pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=211)

127.2x 160 16 997 5 126834 24244
128.4x 160 64 1000 0 128438 23261
127.6x 160 256 992 7 126627 23493
124.0x 160 512 1000 0 123958 22746
122.8x 160 1024 989 20 121372 22632
112.8x 160 2048 998 3 112602 18287
106.9x 160 4096 994 10 106255 16111
91.7x 160 8192 1000 0 91742 11670

multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215)

242.2x 160 16 2363 141 572189 16846
242.1x 160 64 2397 151 580424 11923
231.1x 160 256 2472 21 571387 16364
237.6x 160 512 2429 24 577264 8692
238.3x 160 1024 2384 97 568155 6621
216.3x 160 2048 2453 74 530627 3480
209.2x 160 4096 2381 206 498192 19177
176.5x 160 8192 2323 157 410013 9903

multibuffer (pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=215)

220.3x 160 16 2341 228 515733 91317
216.6x 160 64 2467 33 534381 101262
217.7x 160 256 2451 45 533443 85418
213.8x 160 512 2485 26 531293 83767
211.0x 160 1024 2472 28 521677 80339
200.8x 160 2048 2459 67 493808 63587
188.8x 160 4096 2491 9 470325 58055
159.9x 160 8192 2459 51 393147 25756

Idle system (tcrypt run by itself)

base test
---------------- ---------------
speedup key_sz blk_sz ops/sec stdev ops/sec stdev

(pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211)

2.5x 160 16 63412 43075 161615 1034
4.1x 160 64 39554 24006 161653 981
6.0x 160 256 26504 1436 160110 1158
6.2x 160 512 25500 40 157018 951
5.9x 160 1024 25777 1094 151852 915
5.8x 160 2048 24653 218 143756 508
5.6x 160 4096 24333 20 136752 548
5.0x 160 8192 23310 15 117660 481

(pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=211)

2.4x 160 16 53471 48279 128047 31328
3.4x 160 64 37712 20855 128187 31074
4.5x 160 256 27911 4378 126430 31084
4.9x 160 512 25346 175 123870 29099
3.1x 160 1024 38452 23118 120817 26846
4.7x 160 2048 24612 187 115036 23942
4.5x 160 4096 24217 114 109583 21559
4.2x 160 8192 23144 108 96850 16686

multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215)

1.0x 160 16 412157 3855 426973 1591
1.0x 160 64 412600 4410 431920 4224
1.1x 160 256 410352 3254 453691 17831
1.2x 160 512 406293 4948 473491 39818
1.2x 160 1024 395123 7804 478539 27660
1.2x 160 2048 385144 7601 453720 17579
1.2x 160 4096 371989 3631 449923 15331
1.2x 160 8192 346723 1617 399824 18559

multibuffer (pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=215)

1.1x 160 16 407317 1487 452619 14404
1.1x 160 64 411821 4261 464059 23541
1.2x 160 256 408941 4945 477483 36576
1.2x 160 512 406451 611 472661 11038
1.2x 160 1024 394813 2667 456357 11452
1.2x 160 2048 390291 4175 448928 8957
1.2x 160 4096 371904 1068 449344 14225
1.2x 160 8192 344227 1973 404397 19540

Testing
-------

In addition to the bare metal performance runs above, this series was
tested in a kvm guest with the tcrypt module (mode=215). All
combinations of CPUs among parallel_cpumask, serial_cpumask, and CPU
hotplug online/offline were run with 3 possible CPUs, and over 2000
random combinations of these were run with 8 possible CPUs. Workqueue
events were used throughout to verify that all parallel and serial
workers executed on only the CPUs allowed by the cpumask sysfs files.

Finally, tcrypt mode=215 was run at each patch in the series when built
with and without CONFIG_PADATA/CONFIG_CRYPTO_PCRYPT.

v2: https://lore.kernel.org/linux-crypto/[email protected]/
v1: https://lore.kernel.org/linux-crypto/[email protected]/
RFC: https://lore.kernel.org/lkml/[email protected]/

Daniel Jordan (9):
padata: allocate workqueue internally
workqueue: unconfine alloc/apply/free_workqueue_attrs()
workqueue: require CPU hotplug read exclusion for
apply_workqueue_attrs
padata: make padata_do_parallel find alternate callback CPU
pcrypt: remove padata cpumask notifier
padata, pcrypt: take CPU hotplug lock internally in
padata_alloc_possible
padata: use separate workqueues for parallel and serial work
padata: unbind parallel jobs from specific CPUs
padata: remove cpu_index from the parallel_queue

Documentation/padata.txt | 12 +--
crypto/pcrypt.c | 167 ++++---------------------------
include/linux/padata.h | 16 +--
include/linux/workqueue.h | 4 +
kernel/padata.c | 201 ++++++++++++++++++++++----------------
kernel/workqueue.c | 25 +++--
6 files changed, 170 insertions(+), 255 deletions(-)

--
2.23.0


2019-09-06 10:15:18

by Daniel Jordan

[permalink] [raw]
Subject: [PATCH v3 6/9] padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible

With pcrypt's cpumask no longer used, take the CPU hotplug lock inside
padata_alloc_possible.

Useful later in the series for avoiding nested acquisition of the CPU
hotplug lock in padata when padata_alloc_possible is allocating an
unbound workqueue.

Without this patch, this nested acquisition would happen later in the
series:

pcrypt_init_padata
get_online_cpus
alloc_padata_possible
alloc_padata
alloc_workqueue(WQ_UNBOUND) // later in the series
alloc_and_link_pwqs
apply_wqattrs_lock
get_online_cpus // recursive rwsem acquisition

Signed-off-by: Daniel Jordan <[email protected]>
Acked-by: Steffen Klassert <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
crypto/pcrypt.c | 4 ----
kernel/padata.c | 17 +++++++++--------
2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 2ec36e6a132f..543792e0ebf0 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -308,8 +308,6 @@ static int pcrypt_init_padata(struct padata_instance **pinst, const char *name)
{
int ret = -ENOMEM;

- get_online_cpus();
-
*pinst = padata_alloc_possible(name);
if (!*pinst)
return ret;
@@ -318,8 +316,6 @@ static int pcrypt_init_padata(struct padata_instance **pinst, const char *name)
if (ret)
padata_free(*pinst);

- put_online_cpus();
-
return ret;
}

diff --git a/kernel/padata.c b/kernel/padata.c
index 9a17922ec436..8a362923c488 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -955,8 +955,6 @@ static struct kobj_type padata_attr_type = {
* @name: used to identify the instance
* @pcpumask: cpumask that will be used for padata parallelization
* @cbcpumask: cpumask that will be used for padata serialization
- *
- * Must be called from a cpus_read_lock() protected region
*/
static struct padata_instance *padata_alloc(const char *name,
const struct cpumask *pcpumask,
@@ -974,11 +972,13 @@ static struct padata_instance *padata_alloc(const char *name,
if (!pinst->wq)
goto err_free_inst;

+ get_online_cpus();
+
if (!alloc_cpumask_var(&pinst->cpumask.pcpu, GFP_KERNEL))
- goto err_free_wq;
+ goto err_put_cpus;
if (!alloc_cpumask_var(&pinst->cpumask.cbcpu, GFP_KERNEL)) {
free_cpumask_var(pinst->cpumask.pcpu);
- goto err_free_wq;
+ goto err_put_cpus;
}
if (!padata_validate_cpumask(pinst, pcpumask) ||
!padata_validate_cpumask(pinst, cbcpumask))
@@ -1002,12 +1002,16 @@ static struct padata_instance *padata_alloc(const char *name,
#ifdef CONFIG_HOTPLUG_CPU
cpuhp_state_add_instance_nocalls_cpuslocked(hp_online, &pinst->node);
#endif
+
+ put_online_cpus();
+
return pinst;

err_free_masks:
free_cpumask_var(pinst->cpumask.pcpu);
free_cpumask_var(pinst->cpumask.cbcpu);
-err_free_wq:
+err_put_cpus:
+ put_online_cpus();
destroy_workqueue(pinst->wq);
err_free_inst:
kfree(pinst);
@@ -1021,12 +1025,9 @@ static struct padata_instance *padata_alloc(const char *name,
* parallel workers.
*
* @name: used to identify the instance
- *
- * Must be called from a cpus_read_lock() protected region
*/
struct padata_instance *padata_alloc_possible(const char *name)
{
- lockdep_assert_cpus_held();
return padata_alloc(name, cpu_possible_mask, cpu_possible_mask);
}
EXPORT_SYMBOL(padata_alloc_possible);
--
2.23.0

2019-09-06 10:15:26

by Daniel Jordan

[permalink] [raw]
Subject: [PATCH v3 8/9] padata: unbind parallel jobs from specific CPUs

Padata binds the parallel part of a job to a single CPU and round-robins
over all CPUs in the system for each successive job. Though the serial
parts rely on per-CPU queues for correct ordering, they're not necessary
for parallel work, and it improves performance to run the job locally on
NUMA machines and let the scheduler pick the CPU within a node on a busy
system.

So, make the parallel workqueue unbound.

Update the parallel workqueue's cpumask when the instance's parallel
cpumask changes.

Now that parallel jobs no longer run on max_active=1 workqueues, two or
more parallel works that hash to the same CPU may run simultaneously,
finish out of order, and so be serialized out of order. Prevent this by
keeping the works sorted on the reorder list by sequence number and
checking that in the reordering logic.

padata_get_next becomes padata_find_next so it can be reused for the end
of padata_reorder, where it's used to avoid uselessly queueing work when
the next job by sequence number isn't finished yet but a later job that
hashed to the same CPU has.

The ENODATA case in padata_find_next no longer makes sense because
parallel jobs aren't bound to specific CPUs. The EINPROGRESS case takes
care of the scenario where a parallel job is potentially running on the
same CPU as padata_find_next, and with only one error code left, just
use NULL instead.

Signed-off-by: Daniel Jordan <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steffen Klassert <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/padata.h | 3 ++
kernel/padata.c | 118 +++++++++++++++++++++++------------------
2 files changed, 68 insertions(+), 53 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index e7978f8942ca..43d3fd9d17fc 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -35,6 +35,7 @@ struct padata_priv {
struct parallel_data *pd;
int cb_cpu;
int cpu;
+ unsigned int seq_nr;
int info;
void (*parallel)(struct padata_priv *padata);
void (*serial)(struct padata_priv *padata);
@@ -105,6 +106,7 @@ struct padata_cpumask {
* @reorder_objects: Number of objects waiting in the reorder queues.
* @refcnt: Number of objects holding a reference on this parallel_data.
* @max_seq_nr: Maximal used sequence number.
+ * @processed: Number of already processed objects.
* @cpu: Next CPU to be processed.
* @cpumask: The cpumasks in use for parallel and serial workers.
* @reorder_work: work struct for reordering.
@@ -117,6 +119,7 @@ struct parallel_data {
atomic_t reorder_objects;
atomic_t refcnt;
atomic_t seq_nr;
+ unsigned int processed;
int cpu;
struct padata_cpumask cpumask;
struct work_struct reorder_work;
diff --git a/kernel/padata.c b/kernel/padata.c
index 669f5d53d357..832224dcf2e1 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -46,18 +46,13 @@ static int padata_index_to_cpu(struct parallel_data *pd, int cpu_index)
return target_cpu;
}

-static int padata_cpu_hash(struct parallel_data *pd)
+static int padata_cpu_hash(struct parallel_data *pd, unsigned int seq_nr)
{
- unsigned int seq_nr;
- int cpu_index;
-
/*
* Hash the sequence numbers to the cpus by taking
* seq_nr mod. number of cpus in use.
*/
-
- seq_nr = atomic_inc_return(&pd->seq_nr);
- cpu_index = seq_nr % cpumask_weight(pd->cpumask.pcpu);
+ int cpu_index = seq_nr % cpumask_weight(pd->cpumask.pcpu);

return padata_index_to_cpu(pd, cpu_index);
}
@@ -144,7 +139,8 @@ int padata_do_parallel(struct padata_instance *pinst,
padata->pd = pd;
padata->cb_cpu = *cb_cpu;

- target_cpu = padata_cpu_hash(pd);
+ padata->seq_nr = atomic_inc_return(&pd->seq_nr);
+ target_cpu = padata_cpu_hash(pd, padata->seq_nr);
padata->cpu = target_cpu;
queue = per_cpu_ptr(pd->pqueue, target_cpu);

@@ -152,7 +148,7 @@ int padata_do_parallel(struct padata_instance *pinst,
list_add_tail(&padata->list, &queue->parallel.list);
spin_unlock(&queue->parallel.lock);

- queue_work_on(target_cpu, pinst->parallel_wq, &queue->work);
+ queue_work(pinst->parallel_wq, &queue->work);

out:
rcu_read_unlock_bh();
@@ -162,21 +158,19 @@ int padata_do_parallel(struct padata_instance *pinst,
EXPORT_SYMBOL(padata_do_parallel);

/*
- * padata_get_next - Get the next object that needs serialization.
+ * padata_find_next - Find the next object that needs serialization.
*
* Return values are:
*
* A pointer to the control struct of the next object that needs
* serialization, if present in one of the percpu reorder queues.
*
- * -EINPROGRESS, if the next object that needs serialization will
+ * NULL, if the next object that needs serialization will
* be parallel processed by another cpu and is not yet present in
* the cpu's reorder queue.
- *
- * -ENODATA, if this cpu has to do the parallel processing for
- * the next object.
*/
-static struct padata_priv *padata_get_next(struct parallel_data *pd)
+static struct padata_priv *padata_find_next(struct parallel_data *pd,
+ bool remove_object)
{
struct padata_parallel_queue *next_queue;
struct padata_priv *padata;
@@ -187,28 +181,30 @@ static struct padata_priv *padata_get_next(struct parallel_data *pd)
reorder = &next_queue->reorder;

spin_lock(&reorder->lock);
- if (!list_empty(&reorder->list)) {
- padata = list_entry(reorder->list.next,
- struct padata_priv, list);
-
- list_del_init(&padata->list);
- atomic_dec(&pd->reorder_objects);
+ if (list_empty(&reorder->list)) {
+ spin_unlock(&reorder->lock);
+ return NULL;
+ }

- pd->cpu = cpumask_next_wrap(cpu, pd->cpumask.pcpu, -1,
- false);
+ padata = list_entry(reorder->list.next, struct padata_priv, list);

+ /*
+ * Checks the rare case where two or more parallel jobs have hashed to
+ * the same CPU and one of the later ones finishes first.
+ */
+ if (padata->seq_nr != pd->processed) {
spin_unlock(&reorder->lock);
- goto out;
+ return NULL;
}
- spin_unlock(&reorder->lock);

- if (__this_cpu_read(pd->pqueue->cpu_index) == next_queue->cpu_index) {
- padata = ERR_PTR(-ENODATA);
- goto out;
+ if (remove_object) {
+ list_del_init(&padata->list);
+ atomic_dec(&pd->reorder_objects);
+ ++pd->processed;
+ pd->cpu = cpumask_next_wrap(cpu, pd->cpumask.pcpu, -1, false);
}

- padata = ERR_PTR(-EINPROGRESS);
-out:
+ spin_unlock(&reorder->lock);
return padata;
}

@@ -234,26 +230,16 @@ static void padata_reorder(struct parallel_data *pd)
return;

while (1) {
- padata = padata_get_next(pd);
+ padata = padata_find_next(pd, true);

/*
* If the next object that needs serialization is parallel
* processed by another cpu and is still on it's way to the
* cpu's reorder queue, nothing to do for now.
*/
- if (PTR_ERR(padata) == -EINPROGRESS)
+ if (!padata)
break;

- /*
- * This cpu has to do the parallel processing of the next
- * object. It's waiting in the cpu's parallelization queue,
- * so exit immediately.
- */
- if (PTR_ERR(padata) == -ENODATA) {
- spin_unlock_bh(&pd->lock);
- return;
- }
-
cb_cpu = padata->cb_cpu;
squeue = per_cpu_ptr(pd->squeue, cb_cpu);

@@ -277,7 +263,8 @@ static void padata_reorder(struct parallel_data *pd)
smp_mb();

next_queue = per_cpu_ptr(pd->pqueue, pd->cpu);
- if (!list_empty(&next_queue->reorder.list))
+ if (!list_empty(&next_queue->reorder.list) &&
+ padata_find_next(pd, false))
queue_work(pinst->serial_wq, &pd->reorder_work);
}

@@ -332,9 +319,14 @@ void padata_do_serial(struct padata_priv *padata)
struct parallel_data *pd = padata->pd;
struct padata_parallel_queue *pqueue = per_cpu_ptr(pd->pqueue,
padata->cpu);
+ struct padata_priv *cur;

spin_lock(&pqueue->reorder.lock);
- list_add_tail(&padata->list, &pqueue->reorder.list);
+ /* Sort in ascending order of sequence number. */
+ list_for_each_entry_reverse(cur, &pqueue->reorder.list, list)
+ if (cur->seq_nr < padata->seq_nr)
+ break;
+ list_add(&padata->list, &cur->list);
atomic_inc(&pd->reorder_objects);
spin_unlock(&pqueue->reorder.lock);

@@ -353,17 +345,36 @@ static int padata_setup_cpumasks(struct parallel_data *pd,
const struct cpumask *pcpumask,
const struct cpumask *cbcpumask)
{
- if (!alloc_cpumask_var(&pd->cpumask.pcpu, GFP_KERNEL))
- return -ENOMEM;
+ struct workqueue_attrs *attrs;
+ int err = -ENOMEM;

+ if (!alloc_cpumask_var(&pd->cpumask.pcpu, GFP_KERNEL))
+ goto out;
cpumask_and(pd->cpumask.pcpu, pcpumask, cpu_online_mask);
- if (!alloc_cpumask_var(&pd->cpumask.cbcpu, GFP_KERNEL)) {
- free_cpumask_var(pd->cpumask.pcpu);
- return -ENOMEM;
- }

+ if (!alloc_cpumask_var(&pd->cpumask.cbcpu, GFP_KERNEL))
+ goto free_pcpu_mask;
cpumask_and(pd->cpumask.cbcpu, cbcpumask, cpu_online_mask);
+
+ attrs = alloc_workqueue_attrs();
+ if (!attrs)
+ goto free_cbcpu_mask;
+
+ /* Restrict parallel_wq workers to pd->cpumask.pcpu. */
+ cpumask_copy(attrs->cpumask, pd->cpumask.pcpu);
+ err = apply_workqueue_attrs(pd->pinst->parallel_wq, attrs);
+ free_workqueue_attrs(attrs);
+ if (err < 0)
+ goto free_cbcpu_mask;
+
return 0;
+
+free_cbcpu_mask:
+ free_cpumask_var(pd->cpumask.cbcpu);
+free_pcpu_mask:
+ free_cpumask_var(pd->cpumask.pcpu);
+out:
+ return err;
}

static void __padata_list_init(struct padata_list *pd_list)
@@ -429,6 +440,8 @@ static struct parallel_data *padata_alloc_pd(struct padata_instance *pinst,
pd->squeue = alloc_percpu(struct padata_serial_queue);
if (!pd->squeue)
goto err_free_pqueue;
+
+ pd->pinst = pinst;
if (padata_setup_cpumasks(pd, pcpumask, cbcpumask) < 0)
goto err_free_squeue;

@@ -437,7 +450,6 @@ static struct parallel_data *padata_alloc_pd(struct padata_instance *pinst,
atomic_set(&pd->seq_nr, -1);
atomic_set(&pd->reorder_objects, 0);
atomic_set(&pd->refcnt, 0);
- pd->pinst = pinst;
spin_lock_init(&pd->lock);
pd->cpu = cpumask_first(pd->cpumask.pcpu);
INIT_WORK(&pd->reorder_work, invoke_padata_reorder);
@@ -968,8 +980,8 @@ static struct padata_instance *padata_alloc(const char *name,
if (!pinst)
goto err;

- pinst->parallel_wq = alloc_workqueue("%s_parallel", WQ_MEM_RECLAIM |
- WQ_CPU_INTENSIVE, 1, name);
+ pinst->parallel_wq = alloc_workqueue("%s_parallel", WQ_UNBOUND, 0,
+ name);
if (!pinst->parallel_wq)
goto err_free_inst;

--
2.23.0

2019-09-06 10:15:54

by Daniel Jordan

[permalink] [raw]
Subject: [PATCH v3 1/9] padata: allocate workqueue internally

Move workqueue allocation inside of padata to prepare for further
changes to how padata uses workqueues.

Guarantees the workqueue is created with max_active=1, which padata
relies on to work correctly. No functional change.

Signed-off-by: Daniel Jordan <[email protected]>
Acked-by: Steffen Klassert <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Documentation/padata.txt | 12 ++++++------
crypto/pcrypt.c | 13 ++-----------
include/linux/padata.h | 3 +--
kernel/padata.c | 24 +++++++++++++++---------
4 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/Documentation/padata.txt b/Documentation/padata.txt
index b103d0c82000..b37ba1eaace3 100644
--- a/Documentation/padata.txt
+++ b/Documentation/padata.txt
@@ -16,10 +16,12 @@ overall control of how tasks are to be run::

#include <linux/padata.h>

- struct padata_instance *padata_alloc(struct workqueue_struct *wq,
+ struct padata_instance *padata_alloc(const char *name,
const struct cpumask *pcpumask,
const struct cpumask *cbcpumask);

+'name' simply identifies the instance.
+
The pcpumask describes which processors will be used to execute work
submitted to this instance in parallel. The cbcpumask defines which
processors are allowed to be used as the serialization callback processor.
@@ -128,8 +130,7 @@ in that CPU mask or about a not running instance.

Each task submitted to padata_do_parallel() will, in turn, be passed to
exactly one call to the above-mentioned parallel() function, on one CPU, so
-true parallelism is achieved by submitting multiple tasks. Despite the
-fact that the workqueue is used to make these calls, parallel() is run with
+true parallelism is achieved by submitting multiple tasks. parallel() runs with
software interrupts disabled and thus cannot sleep. The parallel()
function gets the padata_priv structure pointer as its lone parameter;
information about the actual work to be done is probably obtained by using
@@ -148,7 +149,7 @@ fact with a call to::
At some point in the future, padata_do_serial() will trigger a call to the
serial() function in the padata_priv structure. That call will happen on
the CPU requested in the initial call to padata_do_parallel(); it, too, is
-done through the workqueue, but with local software interrupts disabled.
+run with local software interrupts disabled.
Note that this call may be deferred for a while since the padata code takes
pains to ensure that tasks are completed in the order in which they were
submitted.
@@ -159,5 +160,4 @@ when a padata instance is no longer needed::
void padata_free(struct padata_instance *pinst);

This function will busy-wait while any remaining tasks are completed, so it
-might be best not to call it while there is work outstanding. Shutting
-down the workqueue, if necessary, should be done separately.
+might be best not to call it while there is work outstanding.
diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 0edf5b54fc77..d67293063c7f 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -20,7 +20,6 @@

struct padata_pcrypt {
struct padata_instance *pinst;
- struct workqueue_struct *wq;

/*
* Cpumask for callback CPUs. It should be
@@ -397,14 +396,9 @@ static int pcrypt_init_padata(struct padata_pcrypt *pcrypt,

get_online_cpus();

- pcrypt->wq = alloc_workqueue("%s", WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE,
- 1, name);
- if (!pcrypt->wq)
- goto err;
-
- pcrypt->pinst = padata_alloc_possible(pcrypt->wq);
+ pcrypt->pinst = padata_alloc_possible(name);
if (!pcrypt->pinst)
- goto err_destroy_workqueue;
+ goto err;

mask = kmalloc(sizeof(*mask), GFP_KERNEL);
if (!mask)
@@ -437,8 +431,6 @@ static int pcrypt_init_padata(struct padata_pcrypt *pcrypt,
kfree(mask);
err_free_padata:
padata_free(pcrypt->pinst);
-err_destroy_workqueue:
- destroy_workqueue(pcrypt->wq);
err:
put_online_cpus();

@@ -452,7 +444,6 @@ static void pcrypt_fini_padata(struct padata_pcrypt *pcrypt)

padata_stop(pcrypt->pinst);
padata_unregister_cpumask_notifier(pcrypt->pinst, &pcrypt->nblock);
- destroy_workqueue(pcrypt->wq);
padata_free(pcrypt->pinst);
}

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 8da673861d99..839d9319920a 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -151,8 +151,7 @@ struct padata_instance {
#define PADATA_INVALID 4
};

-extern struct padata_instance *padata_alloc_possible(
- struct workqueue_struct *wq);
+extern struct padata_instance *padata_alloc_possible(const char *name);
extern void padata_free(struct padata_instance *pinst);
extern int padata_do_parallel(struct padata_instance *pinst,
struct padata_priv *padata, int cb_cpu);
diff --git a/kernel/padata.c b/kernel/padata.c
index b60cc3dcee58..58728cd7f40c 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -805,6 +805,7 @@ static void __padata_free(struct padata_instance *pinst)
padata_free_pd(pinst->pd);
free_cpumask_var(pinst->cpumask.pcpu);
free_cpumask_var(pinst->cpumask.cbcpu);
+ destroy_workqueue(pinst->wq);
kfree(pinst);
}

@@ -938,13 +939,13 @@ static struct kobj_type padata_attr_type = {
* padata_alloc - allocate and initialize a padata instance and specify
* cpumasks for serial and parallel workers.
*
- * @wq: workqueue to use for the allocated padata instance
+ * @name: used to identify the instance
* @pcpumask: cpumask that will be used for padata parallelization
* @cbcpumask: cpumask that will be used for padata serialization
*
* Must be called from a cpus_read_lock() protected region
*/
-static struct padata_instance *padata_alloc(struct workqueue_struct *wq,
+static struct padata_instance *padata_alloc(const char *name,
const struct cpumask *pcpumask,
const struct cpumask *cbcpumask)
{
@@ -955,11 +956,16 @@ static struct padata_instance *padata_alloc(struct workqueue_struct *wq,
if (!pinst)
goto err;

- if (!alloc_cpumask_var(&pinst->cpumask.pcpu, GFP_KERNEL))
+ pinst->wq = alloc_workqueue("%s", WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE,
+ 1, name);
+ if (!pinst->wq)
goto err_free_inst;
+
+ if (!alloc_cpumask_var(&pinst->cpumask.pcpu, GFP_KERNEL))
+ goto err_free_wq;
if (!alloc_cpumask_var(&pinst->cpumask.cbcpu, GFP_KERNEL)) {
free_cpumask_var(pinst->cpumask.pcpu);
- goto err_free_inst;
+ goto err_free_wq;
}
if (!padata_validate_cpumask(pinst, pcpumask) ||
!padata_validate_cpumask(pinst, cbcpumask))
@@ -971,8 +977,6 @@ static struct padata_instance *padata_alloc(struct workqueue_struct *wq,

rcu_assign_pointer(pinst->pd, pd);

- pinst->wq = wq;
-
cpumask_copy(pinst->cpumask.pcpu, pcpumask);
cpumask_copy(pinst->cpumask.cbcpu, cbcpumask);

@@ -990,6 +994,8 @@ static struct padata_instance *padata_alloc(struct workqueue_struct *wq,
err_free_masks:
free_cpumask_var(pinst->cpumask.pcpu);
free_cpumask_var(pinst->cpumask.cbcpu);
+err_free_wq:
+ destroy_workqueue(pinst->wq);
err_free_inst:
kfree(pinst);
err:
@@ -1001,14 +1007,14 @@ static struct padata_instance *padata_alloc(struct workqueue_struct *wq,
* Use the cpu_possible_mask for serial and
* parallel workers.
*
- * @wq: workqueue to use for the allocated padata instance
+ * @name: used to identify the instance
*
* Must be called from a cpus_read_lock() protected region
*/
-struct padata_instance *padata_alloc_possible(struct workqueue_struct *wq)
+struct padata_instance *padata_alloc_possible(const char *name)
{
lockdep_assert_cpus_held();
- return padata_alloc(wq, cpu_possible_mask, cpu_possible_mask);
+ return padata_alloc(name, cpu_possible_mask, cpu_possible_mask);
}
EXPORT_SYMBOL(padata_alloc_possible);

--
2.23.0

2019-09-06 10:17:52

by Daniel Jordan

[permalink] [raw]
Subject: [PATCH v3 5/9] pcrypt: remove padata cpumask notifier

Now that padata_do_parallel takes care of finding an alternate callback
CPU, there's no need for pcrypt's callback cpumask, so remove it and the
notifier callback that keeps it in sync.

Signed-off-by: Daniel Jordan <[email protected]>
Acked-by: Steffen Klassert <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
crypto/pcrypt.c | 125 +++++++-----------------------------------------
1 file changed, 18 insertions(+), 107 deletions(-)

diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index efca962ab12a..2ec36e6a132f 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -18,33 +18,8 @@
#include <linux/cpu.h>
#include <crypto/pcrypt.h>

-struct padata_pcrypt {
- struct padata_instance *pinst;
-
- /*
- * Cpumask for callback CPUs. It should be
- * equal to serial cpumask of corresponding padata instance,
- * so it is updated when padata notifies us about serial
- * cpumask change.
- *
- * cb_cpumask is protected by RCU. This fact prevents us from
- * using cpumask_var_t directly because the actual type of
- * cpumsak_var_t depends on kernel configuration(particularly on
- * CONFIG_CPUMASK_OFFSTACK macro). Depending on the configuration
- * cpumask_var_t may be either a pointer to the struct cpumask
- * or a variable allocated on the stack. Thus we can not safely use
- * cpumask_var_t with RCU operations such as rcu_assign_pointer or
- * rcu_dereference. So cpumask_var_t is wrapped with struct
- * pcrypt_cpumask which makes possible to use it with RCU.
- */
- struct pcrypt_cpumask {
- cpumask_var_t mask;
- } *cb_cpumask;
- struct notifier_block nblock;
-};
-
-static struct padata_pcrypt pencrypt;
-static struct padata_pcrypt pdecrypt;
+static struct padata_instance *pencrypt;
+static struct padata_instance *pdecrypt;
static struct kset *pcrypt_kset;

struct pcrypt_instance_ctx {
@@ -128,7 +103,7 @@ static int pcrypt_aead_encrypt(struct aead_request *req)
req->cryptlen, req->iv);
aead_request_set_ad(creq, req->assoclen);

- err = padata_do_parallel(pencrypt.pinst, padata, &ctx->cb_cpu);
+ err = padata_do_parallel(pencrypt, padata, &ctx->cb_cpu);
if (!err)
return -EINPROGRESS;

@@ -170,7 +145,7 @@ static int pcrypt_aead_decrypt(struct aead_request *req)
req->cryptlen, req->iv);
aead_request_set_ad(creq, req->assoclen);

- err = padata_do_parallel(pdecrypt.pinst, padata, &ctx->cb_cpu);
+ err = padata_do_parallel(pdecrypt, padata, &ctx->cb_cpu);
if (!err)
return -EINPROGRESS;

@@ -317,36 +292,6 @@ static int pcrypt_create(struct crypto_template *tmpl, struct rtattr **tb)
return -EINVAL;
}

-static int pcrypt_cpumask_change_notify(struct notifier_block *self,
- unsigned long val, void *data)
-{
- struct padata_pcrypt *pcrypt;
- struct pcrypt_cpumask *new_mask, *old_mask;
- struct padata_cpumask *cpumask = (struct padata_cpumask *)data;
-
- if (!(val & PADATA_CPU_SERIAL))
- return 0;
-
- pcrypt = container_of(self, struct padata_pcrypt, nblock);
- new_mask = kmalloc(sizeof(*new_mask), GFP_KERNEL);
- if (!new_mask)
- return -ENOMEM;
- if (!alloc_cpumask_var(&new_mask->mask, GFP_KERNEL)) {
- kfree(new_mask);
- return -ENOMEM;
- }
-
- old_mask = pcrypt->cb_cpumask;
-
- cpumask_copy(new_mask->mask, cpumask->cbcpu);
- rcu_assign_pointer(pcrypt->cb_cpumask, new_mask);
- synchronize_rcu();
-
- free_cpumask_var(old_mask->mask);
- kfree(old_mask);
- return 0;
-}
-
static int pcrypt_sysfs_add(struct padata_instance *pinst, const char *name)
{
int ret;
@@ -359,63 +304,29 @@ static int pcrypt_sysfs_add(struct padata_instance *pinst, const char *name)
return ret;
}

-static int pcrypt_init_padata(struct padata_pcrypt *pcrypt,
- const char *name)
+static int pcrypt_init_padata(struct padata_instance **pinst, const char *name)
{
int ret = -ENOMEM;
- struct pcrypt_cpumask *mask;

get_online_cpus();

- pcrypt->pinst = padata_alloc_possible(name);
- if (!pcrypt->pinst)
- goto err;
-
- mask = kmalloc(sizeof(*mask), GFP_KERNEL);
- if (!mask)
- goto err_free_padata;
- if (!alloc_cpumask_var(&mask->mask, GFP_KERNEL)) {
- kfree(mask);
- goto err_free_padata;
- }
-
- cpumask_and(mask->mask, cpu_possible_mask, cpu_online_mask);
- rcu_assign_pointer(pcrypt->cb_cpumask, mask);
-
- pcrypt->nblock.notifier_call = pcrypt_cpumask_change_notify;
- ret = padata_register_cpumask_notifier(pcrypt->pinst, &pcrypt->nblock);
- if (ret)
- goto err_free_cpumask;
+ *pinst = padata_alloc_possible(name);
+ if (!*pinst)
+ return ret;

- ret = pcrypt_sysfs_add(pcrypt->pinst, name);
+ ret = pcrypt_sysfs_add(*pinst, name);
if (ret)
- goto err_unregister_notifier;
+ padata_free(*pinst);

put_online_cpus();

- return ret;
-
-err_unregister_notifier:
- padata_unregister_cpumask_notifier(pcrypt->pinst, &pcrypt->nblock);
-err_free_cpumask:
- free_cpumask_var(mask->mask);
- kfree(mask);
-err_free_padata:
- padata_free(pcrypt->pinst);
-err:
- put_online_cpus();
-
return ret;
}

-static void pcrypt_fini_padata(struct padata_pcrypt *pcrypt)
+static void pcrypt_fini_padata(struct padata_instance *pinst)
{
- free_cpumask_var(pcrypt->cb_cpumask->mask);
- kfree(pcrypt->cb_cpumask);
-
- padata_stop(pcrypt->pinst);
- padata_unregister_cpumask_notifier(pcrypt->pinst, &pcrypt->nblock);
- padata_free(pcrypt->pinst);
+ padata_stop(pinst);
+ padata_free(pinst);
}

static struct crypto_template pcrypt_tmpl = {
@@ -440,13 +351,13 @@ static int __init pcrypt_init(void)
if (err)
goto err_deinit_pencrypt;

- padata_start(pencrypt.pinst);
- padata_start(pdecrypt.pinst);
+ padata_start(pencrypt);
+ padata_start(pdecrypt);

return crypto_register_template(&pcrypt_tmpl);

err_deinit_pencrypt:
- pcrypt_fini_padata(&pencrypt);
+ pcrypt_fini_padata(pencrypt);
err_unreg_kset:
kset_unregister(pcrypt_kset);
err:
@@ -455,8 +366,8 @@ static int __init pcrypt_init(void)

static void __exit pcrypt_exit(void)
{
- pcrypt_fini_padata(&pencrypt);
- pcrypt_fini_padata(&pdecrypt);
+ pcrypt_fini_padata(pencrypt);
+ pcrypt_fini_padata(pdecrypt);

kset_unregister(pcrypt_kset);
crypto_unregister_template(&pcrypt_tmpl);
--
2.23.0

2019-09-06 10:33:39

by Daniel Jordan

[permalink] [raw]
Subject: [PATCH v3 9/9] padata: remove cpu_index from the parallel_queue

With the removal of the ENODATA case from padata_get_next, the cpu_index
field is no longer useful, so it can go away.

Signed-off-by: Daniel Jordan <[email protected]>
Acked-by: Steffen Klassert <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/padata.h | 2 --
kernel/padata.c | 13 ++-----------
2 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 43d3fd9d17fc..23717eeaad23 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -75,14 +75,12 @@ struct padata_serial_queue {
* @swork: work struct for serialization.
* @work: work struct for parallelization.
* @num_obj: Number of objects that are processed by this cpu.
- * @cpu_index: Index of the cpu.
*/
struct padata_parallel_queue {
struct padata_list parallel;
struct padata_list reorder;
struct work_struct work;
atomic_t num_obj;
- int cpu_index;
};

/**
diff --git a/kernel/padata.c b/kernel/padata.c
index 832224dcf2e1..c3fec1413295 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -400,21 +400,12 @@ static void padata_init_squeues(struct parallel_data *pd)
/* Initialize all percpu queues used by parallel workers */
static void padata_init_pqueues(struct parallel_data *pd)
{
- int cpu_index, cpu;
+ int cpu;
struct padata_parallel_queue *pqueue;

- cpu_index = 0;
- for_each_possible_cpu(cpu) {
+ for_each_cpu(cpu, pd->cpumask.pcpu) {
pqueue = per_cpu_ptr(pd->pqueue, cpu);

- if (!cpumask_test_cpu(cpu, pd->cpumask.pcpu)) {
- pqueue->cpu_index = -1;
- continue;
- }
-
- pqueue->cpu_index = cpu_index;
- cpu_index++;
-
__padata_list_init(&pqueue->reorder);
__padata_list_init(&pqueue->parallel);
INIT_WORK(&pqueue->work, padata_parallel_worker);
--
2.23.0

2019-09-13 11:32:35

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v3 0/9] padata: use unbound workqueues for parallel jobs

On Thu, Sep 05, 2019 at 09:40:20PM -0400, Daniel Jordan wrote:
> v2 -> v3:
> - Rebase onto cryptodev.
>
> v1 -> v2:
> - Updated patch 8 to avoid queueing the reorder work if the next object
> by sequence number isn't ready yet (Herbert)
> - Added Steffen's ack to all but patch 8 since that one changed.
>
> RFC -> v1:
> - Included Tejun's acks.
> - Added testing section to cover letter.
>
> Padata binds the parallel part of a job to a single CPU and round-robins
> over all CPUs in the system for each successive job. Though the serial
> parts rely on per-CPU queues for correct ordering, they're not necessary
> for parallel work, and it improves performance to run the job locally on
> NUMA machines and let the scheduler pick the CPU within a node on a busy
> system.
>
> This series makes parallel padata jobs run on unbound workqueues.
>
> Patch Description
> ----- -----------
>
> 1 Make a padata instance allocate its workqueue internally.
>
> 2 Unconfine some recently-confined workqueue interfaces.
>
> 3-6 Address recursive CPU hotplug locking issue.
>
> padata_alloc* requires its callers to hold this lock, but allocating
> an unbound workqueue and calling apply_workqueue_attrs also take it.
> Fix by removing the requirement for callers of padata_alloc*.
>
> 7-8 Add a second workqueue for each padata instance that's dedicated to
> parallel jobs.
>
> 9 Small cleanup.
>
> Performance
> -----------
>
> Measurements are from a 2-socket, 20-core, 40-CPU Xeon server.
>
> For repeatability, modprobe was bound to a CPU and the serial cpumasks
> for both pencrypt and pdecrypt were also restricted to a CPU different
> from modprobe's.
>
> # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
> # modprobe tcrypt mode=211 sec=1
> # modprobe tcrypt mode=215 sec=1
>
> Busy system (tcrypt run while 10 stress-ng tasks were burning 100% CPU)
>
> base test
> ---------------- ---------------
> speedup key_sz blk_sz ops/sec stdev ops/sec stdev
>
> (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211)
>
> 117.2x 160 16 960 30 112555 24775
> 135.1x 160 64 845 246 114145 25124
> 113.2x 160 256 993 17 112395 24714
> 111.3x 160 512 1000 0 111252 23755
> 110.0x 160 1024 983 16 108153 22374
> 104.2x 160 2048 985 22 102563 20530
> 98.5x 160 4096 998 3 98346 18777
> 86.2x 160 8192 1000 0 86173 14480
>
> (pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=211)
>
> 127.2x 160 16 997 5 126834 24244
> 128.4x 160 64 1000 0 128438 23261
> 127.6x 160 256 992 7 126627 23493
> 124.0x 160 512 1000 0 123958 22746
> 122.8x 160 1024 989 20 121372 22632
> 112.8x 160 2048 998 3 112602 18287
> 106.9x 160 4096 994 10 106255 16111
> 91.7x 160 8192 1000 0 91742 11670
>
> multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215)
>
> 242.2x 160 16 2363 141 572189 16846
> 242.1x 160 64 2397 151 580424 11923
> 231.1x 160 256 2472 21 571387 16364
> 237.6x 160 512 2429 24 577264 8692
> 238.3x 160 1024 2384 97 568155 6621
> 216.3x 160 2048 2453 74 530627 3480
> 209.2x 160 4096 2381 206 498192 19177
> 176.5x 160 8192 2323 157 410013 9903
>
> multibuffer (pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=215)
>
> 220.3x 160 16 2341 228 515733 91317
> 216.6x 160 64 2467 33 534381 101262
> 217.7x 160 256 2451 45 533443 85418
> 213.8x 160 512 2485 26 531293 83767
> 211.0x 160 1024 2472 28 521677 80339
> 200.8x 160 2048 2459 67 493808 63587
> 188.8x 160 4096 2491 9 470325 58055
> 159.9x 160 8192 2459 51 393147 25756
>
> Idle system (tcrypt run by itself)
>
> base test
> ---------------- ---------------
> speedup key_sz blk_sz ops/sec stdev ops/sec stdev
>
> (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211)
>
> 2.5x 160 16 63412 43075 161615 1034
> 4.1x 160 64 39554 24006 161653 981
> 6.0x 160 256 26504 1436 160110 1158
> 6.2x 160 512 25500 40 157018 951
> 5.9x 160 1024 25777 1094 151852 915
> 5.8x 160 2048 24653 218 143756 508
> 5.6x 160 4096 24333 20 136752 548
> 5.0x 160 8192 23310 15 117660 481
>
> (pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=211)
>
> 2.4x 160 16 53471 48279 128047 31328
> 3.4x 160 64 37712 20855 128187 31074
> 4.5x 160 256 27911 4378 126430 31084
> 4.9x 160 512 25346 175 123870 29099
> 3.1x 160 1024 38452 23118 120817 26846
> 4.7x 160 2048 24612 187 115036 23942
> 4.5x 160 4096 24217 114 109583 21559
> 4.2x 160 8192 23144 108 96850 16686
>
> multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215)
>
> 1.0x 160 16 412157 3855 426973 1591
> 1.0x 160 64 412600 4410 431920 4224
> 1.1x 160 256 410352 3254 453691 17831
> 1.2x 160 512 406293 4948 473491 39818
> 1.2x 160 1024 395123 7804 478539 27660
> 1.2x 160 2048 385144 7601 453720 17579
> 1.2x 160 4096 371989 3631 449923 15331
> 1.2x 160 8192 346723 1617 399824 18559
>
> multibuffer (pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=215)
>
> 1.1x 160 16 407317 1487 452619 14404
> 1.1x 160 64 411821 4261 464059 23541
> 1.2x 160 256 408941 4945 477483 36576
> 1.2x 160 512 406451 611 472661 11038
> 1.2x 160 1024 394813 2667 456357 11452
> 1.2x 160 2048 390291 4175 448928 8957
> 1.2x 160 4096 371904 1068 449344 14225
> 1.2x 160 8192 344227 1973 404397 19540
>
> Testing
> -------
>
> In addition to the bare metal performance runs above, this series was
> tested in a kvm guest with the tcrypt module (mode=215). All
> combinations of CPUs among parallel_cpumask, serial_cpumask, and CPU
> hotplug online/offline were run with 3 possible CPUs, and over 2000
> random combinations of these were run with 8 possible CPUs. Workqueue
> events were used throughout to verify that all parallel and serial
> workers executed on only the CPUs allowed by the cpumask sysfs files.
>
> Finally, tcrypt mode=215 was run at each patch in the series when built
> with and without CONFIG_PADATA/CONFIG_CRYPTO_PCRYPT.
>
> v2: https://lore.kernel.org/linux-crypto/[email protected]/
> v1: https://lore.kernel.org/linux-crypto/[email protected]/
> RFC: https://lore.kernel.org/lkml/[email protected]/
>
> Daniel Jordan (9):
> padata: allocate workqueue internally
> workqueue: unconfine alloc/apply/free_workqueue_attrs()
> workqueue: require CPU hotplug read exclusion for
> apply_workqueue_attrs
> padata: make padata_do_parallel find alternate callback CPU
> pcrypt: remove padata cpumask notifier
> padata, pcrypt: take CPU hotplug lock internally in
> padata_alloc_possible
> padata: use separate workqueues for parallel and serial work
> padata: unbind parallel jobs from specific CPUs
> padata: remove cpu_index from the parallel_queue
>
> Documentation/padata.txt | 12 +--
> crypto/pcrypt.c | 167 ++++---------------------------
> include/linux/padata.h | 16 +--
> include/linux/workqueue.h | 4 +
> kernel/padata.c | 201 ++++++++++++++++++++++----------------
> kernel/workqueue.c | 25 +++--
> 6 files changed, 170 insertions(+), 255 deletions(-)

All applied. Thanks.
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt