2009-03-18 19:20:33

by Dan Williams

[permalink] [raw]
Subject: [PATCH 00/13] Asynchronous raid6 acceleration (part 1 of 2)

This series constitutes the pieces of the raid6 acceleration work that are
aimed at the next merge window. It implements:
1/ An api for asynchronous raid6 parity generation and recovery routines
2/ Extensions to the dmaengine framework and dmatest
3/ RAID6 support for iop13xx
4/ Removal of the BUILD_BUG_ON in async_xor to support platforms with
sizeof(dma_addr_t) > sizeof(void *). This increases the stack
utilization in the asynchronous path.

This series is available via git at:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx.git raid6

...it is based on git://neil.brown.name/md md-scratch

Part 2, which needs more testing, is the conversion of md/raid6. It is
available at:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/md.git raid6-for-neil

---

Dan Williams (12):
dmatest: add pq support
dmatest: add dma interrupts and callbacks
dmatest: add xor test
dmaengine: allow dma support for async_tx to be toggled
iop-adma: P+Q self test
iop-adma: P+Q support for iop13xx adma engines
async_tx: add support for asynchronous RAID6 recovery operations
async_tx: add support for asynchronous GF multiplication
async_tx: add sum check flags
async_tx: kill needless module_{init|exit}
async_tx: provide __async_inline for HAS_DMA=n archs
md/raid6: move raid6 data processing to raid6_pq.ko

Yuri Tikhonov (1):
async_tx: don't use src_list argument of async_xor() for dma addresses

arch/arm/include/asm/hardware/iop3xx-adma.h | 68 +++-
arch/arm/include/asm/hardware/iop_adma.h | 1 +
arch/arm/mach-iop13xx/include/mach/adma.h | 107 +++++-
arch/arm/mach-iop13xx/setup.c | 2 +-
crypto/async_tx/Kconfig | 9 +
crypto/async_tx/Makefile | 2 +
crypto/async_tx/async_memcpy.c | 13 -
crypto/async_tx/async_memset.c | 13 -
crypto/async_tx/async_pq.c | 590 +++++++++++++++++++++++++
crypto/async_tx/async_r6recov.c | 272 ++++++++++++
crypto/async_tx/async_tx.c | 23 +-
crypto/async_tx/async_xor.c | 40 +--
drivers/dma/Kconfig | 11 +
drivers/dma/dmatest.c | 333 ++++++++++----
drivers/dma/iop-adma.c | 285 ++++++++++++-
drivers/md/Kconfig | 4 +
drivers/md/Makefile | 4 +-
drivers/md/mktables.c | 14 +-
drivers/md/raid5.c | 14 +-
drivers/md/raid5.h | 7 +-
drivers/md/raid6algos.c | 21 +-
drivers/md/raid6altivec.uc | 2 +-
drivers/md/raid6int.uc | 2 +-
drivers/md/raid6mmx.c | 2 +-
drivers/md/raid6recov.c | 8 +-
drivers/md/raid6sse1.c | 2 +-
drivers/md/raid6sse2.c | 2 +-
drivers/md/raid6test/Makefile | 2 +-
drivers/md/raid6test/test.c | 2 +-
include/linux/async_tx.h | 56 +++-
include/linux/dmaengine.h | 97 ++++-
drivers/md/raid6.h => include/linux/raid/pq.h | 3 -
32 files changed, 1792 insertions(+), 219 deletions(-)
create mode 100644 crypto/async_tx/async_pq.c
create mode 100644 crypto/async_tx/async_r6recov.c
rename drivers/md/raid6.h => include/linux/raid/pq.h (97%)


2009-03-18 19:20:53

by Dan Williams

[permalink] [raw]
Subject: [PATCH 01/13] md/raid6: move raid6 data processing to raid6_pq.ko

Move the raid6 data processing routines into a standalone module
(raid6_pq) to prepare them to be called from async_tx wrappers and other
non-md drivers/modules. This precludes a circular dependency of raid456
needing the async modules for data processing while those modules in
turn depend on raid456 for the base level synchronous raid6 routines.

To support this move:
1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
2/ The raid6_call, recovery calls, and table symbols are exported
3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
compile

Signed-off-by: Dan Williams <[email protected]>
---
drivers/md/Kconfig | 4 ++++
drivers/md/Makefile | 4 +++-
drivers/md/mktables.c | 14 +++++++++++++-
drivers/md/raid5.c | 12 +-----------
drivers/md/raid5.h | 2 ++
drivers/md/raid6algos.c | 21 ++++++++++++++++++++-
drivers/md/raid6altivec.uc | 2 +-
drivers/md/raid6int.uc | 2 +-
drivers/md/raid6mmx.c | 2 +-
drivers/md/raid6recov.c | 8 +++++---
drivers/md/raid6sse1.c | 2 +-
drivers/md/raid6sse2.c | 2 +-
drivers/md/raid6test/Makefile | 2 +-
drivers/md/raid6test/test.c | 2 +-
include/linux/raid/pq.h | 3 ---
15 files changed, 55 insertions(+), 27 deletions(-)
rename drivers/md/raid6.h => include/linux/raid/pq.h (97%)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 2281b50..449d0b9 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -121,6 +121,7 @@ config MD_RAID10
config MD_RAID456
tristate "RAID-4/RAID-5/RAID-6 mode"
depends on BLK_DEV_MD
+ select MD_RAID6_PQ
select ASYNC_MEMCPY
select ASYNC_XOR
---help---
@@ -180,6 +181,9 @@ config MD_RAID5_RESHAPE

If unsure, say Y.

+config MD_RAID6_PQ
+ tristate
+
config MD_MULTIPATH
tristate "Multipath I/O support"
depends on BLK_DEV_MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 3b118da..45cc595 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -9,7 +9,8 @@ dm-snapshot-y += dm-snap.o dm-exception-store.o dm-snap-transient.o \
dm-snap-persistent.o
dm-mirror-y += dm-raid1.o
md-mod-y += md.o bitmap.o
-raid456-y += raid5.o raid6algos.o raid6recov.o raid6tables.o \
+raid456-y += raid5.o
+raid6_pq-y += raid6algos.o raid6recov.o raid6tables.o \
raid6int1.o raid6int2.o raid6int4.o \
raid6int8.o raid6int16.o raid6int32.o \
raid6altivec1.o raid6altivec2.o raid6altivec4.o \
@@ -26,6 +27,7 @@ obj-$(CONFIG_MD_LINEAR) += linear.o
obj-$(CONFIG_MD_RAID0) += raid0.o
obj-$(CONFIG_MD_RAID1) += raid1.o
obj-$(CONFIG_MD_RAID10) += raid10.o
+obj-$(CONFIG_MD_RAID6_PQ) += raid6_pq.o
obj-$(CONFIG_MD_RAID456) += raid456.o
obj-$(CONFIG_MD_MULTIPATH) += multipath.o
obj-$(CONFIG_MD_FAULTY) += faulty.o
diff --git a/drivers/md/mktables.c b/drivers/md/mktables.c
index b61d576..3b15008 100644
--- a/drivers/md/mktables.c
+++ b/drivers/md/mktables.c
@@ -59,7 +59,7 @@ int main(int argc, char *argv[])
uint8_t v;
uint8_t exptbl[256], invtbl[256];

- printf("#include \"raid6.h\"\n");
+ printf("#include <linux/raid/pq.h>\n");

/* Compute multiplication table */
printf("\nconst u8 __attribute__((aligned(256)))\n"
@@ -76,6 +76,9 @@ int main(int argc, char *argv[])
printf("\t},\n");
}
printf("};\n");
+ printf("#ifdef __KERNEL__\n");
+ printf("EXPORT_SYMBOL(raid6_gfmul);\n");
+ printf("#endif\n");

/* Compute power-of-2 table (exponent) */
v = 1;
@@ -92,6 +95,9 @@ int main(int argc, char *argv[])
}
}
printf("};\n");
+ printf("#ifdef __KERNEL__\n");
+ printf("EXPORT_SYMBOL(raid6_gfexp);\n");
+ printf("#endif\n");

/* Compute inverse table x^-1 == x^254 */
printf("\nconst u8 __attribute__((aligned(256)))\n"
@@ -104,6 +110,9 @@ int main(int argc, char *argv[])
}
}
printf("};\n");
+ printf("#ifdef __KERNEL__\n");
+ printf("EXPORT_SYMBOL(raid6_gfinv);\n");
+ printf("#endif\n");

/* Compute inv(2^x + 1) (exponent-xor-inverse) table */
printf("\nconst u8 __attribute__((aligned(256)))\n"
@@ -115,6 +124,9 @@ int main(int argc, char *argv[])
(j == 7) ? '\n' : ' ');
}
printf("};\n");
+ printf("#ifdef __KERNEL__\n");
+ printf("EXPORT_SYMBOL(raid6_gfexi);\n");
+ printf("#endif\n");

return 0;
}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 70b50af..21626cc 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -45,11 +45,11 @@

#include <linux/blkdev.h>
#include <linux/kthread.h>
+#include <linux/raid/pq.h>
#include <linux/async_tx.h>
#include <linux/seq_file.h>
#include "md.h"
#include "raid5.h"
-#include "raid6.h"
#include "bitmap.h"

/*
@@ -94,11 +94,6 @@

#define printk_rl(args...) ((void) (printk_ratelimit() && printk(args)))

-#if !RAID6_USE_EMPTY_ZERO_PAGE
-/* In .bss so it's zeroed */
-const char raid6_empty_zero_page[PAGE_SIZE] __attribute__((aligned(256)));
-#endif
-
/*
* We maintain a biased count of active stripes in the bottom 16 bits of
* bi_phys_segments, and a count of processed stripes in the upper 16 bits
@@ -5131,11 +5126,6 @@ static struct mdk_personality raid4_personality =

static int __init raid5_init(void)
{
- int e;
-
- e = raid6_select_algo();
- if ( e )
- return e;
register_md_personality(&raid6_personality);
register_md_personality(&raid5_personality);
register_md_personality(&raid4_personality);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index c172371..2934ee0 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -269,6 +269,8 @@ struct r6_state {
#define READ_MODIFY_WRITE 2
/* not a write method, but a compute_parity mode */
#define CHECK_PARITY 3
+/* Additional compute_parity mode -- updates the parity w/o LOCKING */
+#define UPDATE_PARITY 4

/*
* Stripe state
diff --git a/drivers/md/raid6algos.c b/drivers/md/raid6algos.c
index 1f6a3c8..34e57ff 100644
--- a/drivers/md/raid6algos.c
+++ b/drivers/md/raid6algos.c
@@ -16,10 +16,16 @@
* Algorithm list and algorithm selection for RAID-6
*/

-#include "raid6.h"
+#include <linux/raid/pq.h>
#ifndef __KERNEL__
#include <sys/mman.h>
#include <stdio.h>
+#else
+#if !RAID6_USE_EMPTY_ZERO_PAGE
+/* In .bss so it's zeroed */
+const char raid6_empty_zero_page[PAGE_SIZE] __attribute__((aligned(256)));
+EXPORT_SYMBOL(raid6_empty_zero_page);
+#endif
#endif

struct raid6_calls raid6_call;
@@ -79,6 +85,7 @@ const struct raid6_calls * const raid6_algos[] = {
#else
/* Need more time to be stable in userspace */
#define RAID6_TIME_JIFFIES_LG2 9
+#define time_before(x, y) ((x) < (y))
#endif

/* Try to pick the best algorithm */
@@ -152,3 +159,15 @@ int __init raid6_select_algo(void)

return best ? 0 : -EINVAL;
}
+
+static void raid6_exit(void)
+{
+ do { } while (0);
+}
+
+#ifdef __KERNEL__
+EXPORT_SYMBOL_GPL(raid6_call);
+subsys_initcall(raid6_select_algo);
+module_exit(raid6_exit);
+MODULE_LICENSE("GPL");
+#endif
diff --git a/drivers/md/raid6altivec.uc b/drivers/md/raid6altivec.uc
index 2175806..699dfee 100644
--- a/drivers/md/raid6altivec.uc
+++ b/drivers/md/raid6altivec.uc
@@ -22,7 +22,7 @@
* bracked this with preempt_disable/enable or in a lock)
*/

-#include "raid6.h"
+#include <linux/raid/pq.h>

#ifdef CONFIG_ALTIVEC

diff --git a/drivers/md/raid6int.uc b/drivers/md/raid6int.uc
index 32a0bac..f9bf9cb 100644
--- a/drivers/md/raid6int.uc
+++ b/drivers/md/raid6int.uc
@@ -18,7 +18,7 @@
* This file is postprocessed using unroll.pl
*/

-#include "raid6.h"
+#include <linux/raid/pq.h>

/*
* This is the C data type to use
diff --git a/drivers/md/raid6mmx.c b/drivers/md/raid6mmx.c
index 804cb50..e7f6c13 100644
--- a/drivers/md/raid6mmx.c
+++ b/drivers/md/raid6mmx.c
@@ -18,7 +18,7 @@

#if defined(__i386__) && !defined(__arch_um__)

-#include "raid6.h"
+#include <linux/raid/pq.h>
#include "raid6x86.h"

/* Shared with raid6sse1.c */
diff --git a/drivers/md/raid6recov.c b/drivers/md/raid6recov.c
index 7a98b86..844d27d 100644
--- a/drivers/md/raid6recov.c
+++ b/drivers/md/raid6recov.c
@@ -18,7 +18,7 @@
* the syndrome.)
*/

-#include "raid6.h"
+#include <linux/raid/pq.h>

/* Recover two failed data blocks. */
void raid6_2data_recov(int disks, size_t bytes, int faila, int failb,
@@ -98,8 +98,10 @@ void raid6_datap_recov(int disks, size_t bytes, int faila, void **ptrs)
}
}

-
-#ifndef __KERNEL__ /* Testing only */
+#ifdef __KERNEL__
+EXPORT_SYMBOL_GPL(raid6_2data_recov);
+EXPORT_SYMBOL_GPL(raid6_datap_recov);
+#else /* Testing only */

/* Recover two failed blocks. */
void raid6_dual_recov(int disks, size_t bytes, int faila, int failb, void **ptrs)
diff --git a/drivers/md/raid6sse1.c b/drivers/md/raid6sse1.c
index 15c5889..b274dd5 100644
--- a/drivers/md/raid6sse1.c
+++ b/drivers/md/raid6sse1.c
@@ -23,7 +23,7 @@

#if defined(__i386__) && !defined(__arch_um__)

-#include "raid6.h"
+#include <linux/raid/pq.h>
#include "raid6x86.h"

/* Defined in raid6mmx.c */
diff --git a/drivers/md/raid6sse2.c b/drivers/md/raid6sse2.c
index 2e92e96..6ed6c6c 100644
--- a/drivers/md/raid6sse2.c
+++ b/drivers/md/raid6sse2.c
@@ -19,7 +19,7 @@

#if (defined(__i386__) || defined(__x86_64__)) && !defined(__arch_um__)

-#include "raid6.h"
+#include <linux/raid/pq.h>
#include "raid6x86.h"

static const struct raid6_sse_constants {
diff --git a/drivers/md/raid6test/Makefile b/drivers/md/raid6test/Makefile
index 78e0396..58ffdf4 100644
--- a/drivers/md/raid6test/Makefile
+++ b/drivers/md/raid6test/Makefile
@@ -5,7 +5,7 @@

CC = gcc
OPTFLAGS = -O2 # Adjust as desired
-CFLAGS = -I.. -g $(OPTFLAGS)
+CFLAGS = -I.. -I ../../../include -g $(OPTFLAGS)
LD = ld
PERL = perl
AR = ar
diff --git a/drivers/md/raid6test/test.c b/drivers/md/raid6test/test.c
index 559cc41..7a93031 100644
--- a/drivers/md/raid6test/test.c
+++ b/drivers/md/raid6test/test.c
@@ -17,7 +17,7 @@
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
-#include "raid6.h"
+#include <linux/raid/pq.h>

#define NDISKS 16 /* Including P and Q */

diff --git a/drivers/md/raid6.h b/include/linux/raid/pq.h
similarity index 97%
rename from drivers/md/raid6.h
rename to include/linux/raid/pq.h
index 8a9c823..a019670 100644
--- a/drivers/md/raid6.h
+++ b/include/linux/raid/pq.h
@@ -19,9 +19,6 @@
#define RAID6_USE_EMPTY_ZERO_PAGE 0
#include <linux/blkdev.h>

-/* Additional compute_parity mode -- updates the parity w/o LOCKING */
-#define UPDATE_PARITY 4
-
/* We need a pre-zeroed page... if we don't want to use the kernel-provided
one define it here */
#if RAID6_USE_EMPTY_ZERO_PAGE

2009-03-18 19:21:32

by Dan Williams

[permalink] [raw]
Subject: [PATCH 02/13] async_tx: don't use src_list argument of async_xor() for dma addresses

From: Yuri Tikhonov <[email protected]>

Using src_list argument of async_xor() as a storage for dma addresses
implies sizeof(dma_addr_t) <= sizeof(struct page *) restriction which is
not always true (e.g. ppc440spe).

Signed-off-by: Ilya Yanok <[email protected]>
Signed-off-by: Yuri Tikhonov <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
crypto/async_tx/async_xor.c | 14 ++------------
1 files changed, 2 insertions(+), 12 deletions(-)

diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c
index 595b786..e1f1f28 100644
--- a/crypto/async_tx/async_xor.c
+++ b/crypto/async_tx/async_xor.c
@@ -42,7 +42,7 @@ do_async_xor(struct dma_chan *chan, struct page *dest, struct page **src_list,
dma_async_tx_callback cb_fn, void *cb_param)
{
struct dma_device *dma = chan->device;
- dma_addr_t *dma_src = (dma_addr_t *) src_list;
+ dma_addr_t dma_src[src_cnt];
struct dma_async_tx_descriptor *tx = NULL;
int src_off = 0;
int i;
@@ -254,7 +254,7 @@ async_xor_zero_sum(struct page *dest, struct page **src_list,
BUG_ON(src_cnt <= 1);

if (device && src_cnt <= device->max_xor) {
- dma_addr_t *dma_src = (dma_addr_t *) src_list;
+ dma_addr_t dma_src[src_cnt];
unsigned long dma_prep_flags = cb_fn ? DMA_PREP_INTERRUPT : 0;
int i;

@@ -303,16 +303,6 @@ EXPORT_SYMBOL_GPL(async_xor_zero_sum);

static int __init async_xor_init(void)
{
- #ifdef CONFIG_DMA_ENGINE
- /* To conserve stack space the input src_list (array of page pointers)
- * is reused to hold the array of dma addresses passed to the driver.
- * This conversion is only possible when dma_addr_t is less than the
- * the size of a pointer. HIGHMEM64G is known to violate this
- * assumption.
- */
- BUILD_BUG_ON(sizeof(dma_addr_t) > sizeof(struct page *));
- #endif
-
return 0;
}

2009-03-18 19:22:14

by Dan Williams

[permalink] [raw]
Subject: [PATCH 05/13] async_tx: add sum check flags

Replace the flat zero_sum_result with a collection of flags to contain
the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead
of DMA_ since these flags will be used on non-dma-zero-sum enabled
platforms.

Signed-off-by: Dan Williams <[email protected]>
---
arch/arm/include/asm/hardware/iop3xx-adma.h | 5 +++--
arch/arm/mach-iop13xx/include/mach/adma.h | 12 +++++++-----
crypto/async_tx/async_xor.c | 4 ++--
drivers/md/raid5.c | 2 +-
drivers/md/raid5.h | 5 +++--
include/linux/async_tx.h | 2 +-
include/linux/dmaengine.h | 21 ++++++++++++++++++++-
7 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/arch/arm/include/asm/hardware/iop3xx-adma.h b/arch/arm/include/asm/hardware/iop3xx-adma.h
index 83e6ba3..26eefea 100644
--- a/arch/arm/include/asm/hardware/iop3xx-adma.h
+++ b/arch/arm/include/asm/hardware/iop3xx-adma.h
@@ -756,13 +756,14 @@ static inline void iop_desc_set_block_fill_val(struct iop_adma_desc_slot *desc,
hw_desc->src[0] = val;
}

-static inline int iop_desc_get_zero_result(struct iop_adma_desc_slot *desc)
+static inline enum sum_check_flags
+iop_desc_get_zero_result(struct iop_adma_desc_slot *desc)
{
struct iop3xx_desc_aau *hw_desc = desc->hw_desc;
struct iop3xx_aau_desc_ctrl desc_ctrl = hw_desc->desc_ctrl_field;

iop_paranoia(!(desc_ctrl.tx_complete && desc_ctrl.zero_result_en));
- return desc_ctrl.zero_result_err;
+ return desc_ctrl.zero_result_err << SUM_CHECK_P;
}

static inline void iop_chan_append(struct iop_adma_chan *chan)
diff --git a/arch/arm/mach-iop13xx/include/mach/adma.h b/arch/arm/mach-iop13xx/include/mach/adma.h
index 5722e86..1cd31df 100644
--- a/arch/arm/mach-iop13xx/include/mach/adma.h
+++ b/arch/arm/mach-iop13xx/include/mach/adma.h
@@ -428,18 +428,20 @@ static inline void iop_desc_set_block_fill_val(struct iop_adma_desc_slot *desc,
hw_desc->block_fill_data = val;
}

-static inline int iop_desc_get_zero_result(struct iop_adma_desc_slot *desc)
+static inline enum sum_check_flags
+iop_desc_get_zero_result(struct iop_adma_desc_slot *desc)
{
struct iop13xx_adma_desc_hw *hw_desc = desc->hw_desc;
struct iop13xx_adma_desc_ctrl desc_ctrl = hw_desc->desc_ctrl_field;
struct iop13xx_adma_byte_count byte_count = hw_desc->byte_count_field;
+ enum sum_check_flags flags;

BUG_ON(!(byte_count.tx_complete && desc_ctrl.zero_result));

- if (desc_ctrl.pq_xfer_en)
- return byte_count.zero_result_err_q;
- else
- return byte_count.zero_result_err;
+ flags = byte_count.zero_result_err_q << SUM_CHECK_Q;
+ flags |= byte_count.zero_result_err << SUM_CHECK_P;
+
+ return flags;
}

static inline void iop_chan_append(struct iop_adma_chan *chan)
diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c
index 0a56fae..d1a084e 100644
--- a/crypto/async_tx/async_xor.c
+++ b/crypto/async_tx/async_xor.c
@@ -238,7 +238,7 @@ static int page_is_zero(struct page *p, unsigned int offset, size_t len)
struct dma_async_tx_descriptor *
async_xor_zero_sum(struct page *dest, struct page **src_list,
unsigned int offset, int src_cnt, size_t len,
- u32 *result, enum async_tx_flags flags,
+ enum sum_check_flags *result, enum async_tx_flags flags,
struct dma_async_tx_descriptor *depend_tx,
dma_async_tx_callback cb_fn, void *cb_param)
{
@@ -289,7 +289,7 @@ async_xor_zero_sum(struct page *dest, struct page **src_list,

async_tx_quiesce(&tx);

- *result = page_is_zero(dest, offset, len) ? 0 : 1;
+ *result = !page_is_zero(dest, offset, len) << SUM_CHECK_P;

async_tx_sync_epilog(cb_fn, cb_param);
}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 21626cc..a102337 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2521,7 +2521,7 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
* we are done. Otherwise update the mismatch count and repair
* parity if !MD_RECOVERY_CHECK
*/
- if (sh->ops.zero_sum_result == 0)
+ if ((sh->ops.zero_sum_result & SUM_CHECK_P_RESULT) == 0)
/* parity is correct (on disc,
* not in buffer any more)
*/
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 2934ee0..cfb5006 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -2,6 +2,7 @@
#define _RAID5_H

#include <linux/raid/xor.h>
+#include <linux/dmaengine.h>

/*
*
@@ -213,8 +214,8 @@ struct stripe_head {
* @target - STRIPE_OP_COMPUTE_BLK target
*/
struct stripe_operations {
- int target;
- u32 zero_sum_result;
+ int target;
+ enum sum_check_flags zero_sum_result;
} ops;
struct r5dev {
struct bio req;
diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h
index 5fc2ef8..6370f32 100644
--- a/include/linux/async_tx.h
+++ b/include/linux/async_tx.h
@@ -119,7 +119,7 @@ async_xor(struct page *dest, struct page **src_list, unsigned int offset,
struct dma_async_tx_descriptor *
async_xor_zero_sum(struct page *dest, struct page **src_list,
unsigned int offset, int src_cnt, size_t len,
- u32 *result, enum async_tx_flags flags,
+ enum sum_check_flags *result, enum async_tx_flags flags,
struct dma_async_tx_descriptor *depend_tx,
dma_async_tx_callback cb_fn, void *cb_fn_param);

diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 1956c8d..cd17392 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -90,6 +90,25 @@ enum dma_ctrl_flags {
};

/**
+ * enum sum_check_bits - bit position of pq_check_flags
+ */
+enum sum_check_bits {
+ SUM_CHECK_P = 0,
+ SUM_CHECK_Q = 1,
+};
+
+/**
+ * enum pq_check_flags - result of async_{xor,pq}_zero_sum operations
+ * @SUM_CHECK_P_RESULT - 1 if xor zero sum error, 0 otherwise
+ * @SUM_CHECK_Q_RESULT - 1 if reed-solomon zero sum error, 0 otherwise
+ */
+enum sum_check_flags {
+ SUM_CHECK_P_RESULT = (1 << SUM_CHECK_P),
+ SUM_CHECK_Q_RESULT = (1 << SUM_CHECK_Q),
+};
+
+
+/**
* dma_cap_mask_t - capabilities bitmap modeled after cpumask_t.
* See linux/cpumask.h
*/
@@ -246,7 +265,7 @@ struct dma_device {
unsigned int src_cnt, size_t len, unsigned long flags);
struct dma_async_tx_descriptor *(*device_prep_dma_zero_sum)(
struct dma_chan *chan, dma_addr_t *src, unsigned int src_cnt,
- size_t len, u32 *result, unsigned long flags);
+ size_t len, enum sum_check_flags *result, unsigned long flags);
struct dma_async_tx_descriptor *(*device_prep_dma_memset)(
struct dma_chan *chan, dma_addr_t dest, int value, size_t len,
unsigned long flags);

2009-03-18 19:21:49

by Dan Williams

[permalink] [raw]
Subject: [PATCH 03/13] async_tx: provide __async_inline for HAS_DMA=n archs

To allow an async_tx routine to be compiled away on HAS_DMA=n arch it
needs to be declared __always_inline otherwise the compiler may emit
code and cause a link error.

Signed-off-by: Dan Williams <[email protected]>
---
crypto/async_tx/async_xor.c | 7 ++-----
include/linux/async_tx.h | 9 +++++++++
2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c
index e1f1f28..11978a8 100644
--- a/crypto/async_tx/async_xor.c
+++ b/crypto/async_tx/async_xor.c
@@ -30,11 +30,8 @@
#include <linux/raid/xor.h>
#include <linux/async_tx.h>

-/* do_async_xor - dma map the pages and perform the xor with an engine.
- * This routine is marked __always_inline so it can be compiled away
- * when CONFIG_DMA_ENGINE=n
- */
-static __always_inline struct dma_async_tx_descriptor *
+/* do_async_xor - dma map the pages and perform the xor with an engine */
+static __async_inline struct dma_async_tx_descriptor *
do_async_xor(struct dma_chan *chan, struct page *dest, struct page **src_list,
unsigned int offset, int src_cnt, size_t len,
enum async_tx_flags flags,
diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h
index 45f6297..5fc2ef8 100644
--- a/include/linux/async_tx.h
+++ b/include/linux/async_tx.h
@@ -21,6 +21,15 @@
#include <linux/spinlock.h>
#include <linux/interrupt.h>

+/* on architectures without dma-mapping capabilities we need to ensure
+ * that the asynchronous path compiles away
+ */
+#ifdef CONFIG_HAS_DMA
+#define __async_inline
+#else
+#define __async_inline __always_inline
+#endif
+
/**
* dma_chan_ref - object used to manage dma channels received from the
* dmaengine core.

2009-03-18 19:22:49

by Dan Williams

[permalink] [raw]
Subject: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

[ Based on an original patch by Yuri Tikhonov ]

This adds support for doing asynchronous GF multiplication by adding
four additional functions to the async_tx API:

async_pq() does simultaneous XOR of sources and XOR of sources
GF-multiplied by given coefficients.

async_pq_zero_sum() checks if results of calculations match given
ones.

async_gen_syndrome() does simultaneous XOR and R/S syndrome of sources.

async_syndrome_zerosum() checks if results of XOR/syndrome calculation
matches given ones.

Latter two functions just use async_pq() with the appropriate coefficients
in asynchronous case but have significant optimizations in the synchronous
case.

When a request is made to run async_pq against more than the hardware
maximum number of supported sources we need to reuse the previous
generated P and Q values as sources into the next operation. Care must
be taken to remove Q from P' and P from Q'. For example to perform a 5
source pq op with hardware that only supports 4 sources at a time the
following approach is taken:

p, q = PQ(src0, src1, src2, src3, COEF({01}, {02}, {04}, {08}))
p', q' = PQ(p, q, q, src4, COEF({00}, {01}, {00}, {10}))

p' = p + q + q + src4 = p + src4
q' = {00}*p + {01}*q + {00}*q + {10}*src4 = q + {10}*src4

Note: 4 is the minimum acceptable maxpq otherwise we punt to
synchronous-software path.

The DMA_PREP_CONTINUE flag indicates to the driver to reuse p and q as
sources (in the above manner) and fill the remaining slots up to maxpq
with the new sources/coeffecients.

In the zero_sum case decrement max_pq by two to account for the implied
additional p and q sources.

Note: Some devices can natively support P+Q continuation and can skip
this extra work. Devices with this capability can advertise it with
dma_set_maxpq. It is up to each driver how the DMA_PREP_CONTINUE flag
is honored.

Signed-off-by: Yuri Tikhonov <[email protected]>
Signed-off-by: Ilya Yanok <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/arm/mach-iop13xx/setup.c | 2
crypto/async_tx/Kconfig | 4
crypto/async_tx/Makefile | 1
crypto/async_tx/async_pq.c | 590 +++++++++++++++++++++++++++++++++++++++++
crypto/async_tx/async_xor.c | 2
drivers/dma/iop-adma.c | 2
include/linux/async_tx.h | 33 ++
include/linux/dmaengine.h | 58 ++++
8 files changed, 681 insertions(+), 11 deletions(-)
create mode 100644 crypto/async_tx/async_pq.c

diff --git a/arch/arm/mach-iop13xx/setup.c b/arch/arm/mach-iop13xx/setup.c
index cfd4d2e..3846482 100644
--- a/arch/arm/mach-iop13xx/setup.c
+++ b/arch/arm/mach-iop13xx/setup.c
@@ -506,7 +506,7 @@ void __init iop13xx_platform_init(void)
dma_cap_set(DMA_MEMSET, plat_data->cap_mask);
dma_cap_set(DMA_MEMCPY_CRC32C, plat_data->cap_mask);
dma_cap_set(DMA_INTERRUPT, plat_data->cap_mask);
- dma_cap_set(DMA_PQ_XOR, plat_data->cap_mask);
+ dma_cap_set(DMA_PQ, plat_data->cap_mask);
dma_cap_set(DMA_PQ_UPDATE, plat_data->cap_mask);
dma_cap_set(DMA_PQ_ZERO_SUM, plat_data->cap_mask);
break;
diff --git a/crypto/async_tx/Kconfig b/crypto/async_tx/Kconfig
index d8fb391..cb6d731 100644
--- a/crypto/async_tx/Kconfig
+++ b/crypto/async_tx/Kconfig
@@ -14,3 +14,7 @@ config ASYNC_MEMSET
tristate
select ASYNC_CORE

+config ASYNC_PQ
+ tristate
+ select ASYNC_CORE
+
diff --git a/crypto/async_tx/Makefile b/crypto/async_tx/Makefile
index 27baa7d..1b99265 100644
--- a/crypto/async_tx/Makefile
+++ b/crypto/async_tx/Makefile
@@ -2,3 +2,4 @@ obj-$(CONFIG_ASYNC_CORE) += async_tx.o
obj-$(CONFIG_ASYNC_MEMCPY) += async_memcpy.o
obj-$(CONFIG_ASYNC_MEMSET) += async_memset.o
obj-$(CONFIG_ASYNC_XOR) += async_xor.o
+obj-$(CONFIG_ASYNC_PQ) += async_pq.o
diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c
new file mode 100644
index 0000000..da47a29
--- /dev/null
+++ b/crypto/async_tx/async_pq.c
@@ -0,0 +1,590 @@
+/*
+ * Copyright(c) 2007 Yuri Tikhonov <[email protected]>
+ * Copyright(c) 2009 Intel Corporation
+ *
+ * Developed for DENX Software Engineering GmbH
+ *
+ * Asynchronous GF-XOR calculations ASYNC_TX API.
+ *
+ * based on async_xor.c code written by:
+ * Dan Williams <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+#include <linux/kernel.h>
+#include <linux/interrupt.h>
+#include <linux/dma-mapping.h>
+#include <linux/raid/pq.h>
+#include <linux/async_tx.h>
+
+/**
+ * spare_pages - synchronous zero sum result buffers
+ *
+ * Protected by spare_lock
+ */
+static struct page *spare_pages[2];
+static spinlock_t spare_lock;
+
+/* scribble - space to hold throwaway P buffer for synchronous gen_syndrome */
+static struct page *scribble;
+
+static bool is_raid6_zero_block(void *p)
+{
+ return p == (void *) raid6_empty_zero_page;
+}
+
+/**
+ * do_async_pq - asynchronously calculate P and/or Q
+ */
+static __async_inline struct dma_async_tx_descriptor *
+do_async_pq(struct dma_chan *chan, struct page **blocks, unsigned char *scfs,
+ unsigned int offset, int src_cnt, size_t len,
+ enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback cb_fn, void *cb_param)
+{
+ struct dma_device *dma = chan->device;
+ dma_addr_t dma_dest[2], dma_src[src_cnt];
+ struct dma_async_tx_descriptor *tx = NULL;
+ dma_async_tx_callback _cb_fn;
+ void *_cb_param;
+ unsigned char *scf = NULL;
+ int i, src_off = 0;
+ unsigned short pq_src_cnt;
+ enum async_tx_flags async_flags;
+ enum dma_ctrl_flags dma_flags = 0;
+ int idx;
+ u8 coefs[src_cnt];
+
+ /* DMAs use destinations as sources, so use BIDIRECTIONAL mapping */
+ if (blocks[src_cnt])
+ dma_dest[0] = dma_map_page(dma->dev, blocks[src_cnt],
+ offset, len, DMA_BIDIRECTIONAL);
+ else
+ dma_flags |= DMA_PREP_PQ_DISABLE_P;
+ if (blocks[src_cnt+1])
+ dma_dest[1] = dma_map_page(dma->dev, blocks[src_cnt+1],
+ offset, len, DMA_BIDIRECTIONAL);
+ else
+ dma_flags |= DMA_PREP_PQ_DISABLE_Q;
+
+ /* convert source addresses being careful to collapse 'zero'
+ * sources and update the coefficients accordingly
+ */
+ for (i = 0, idx = 0; i < src_cnt; i++) {
+ if (is_raid6_zero_block(blocks[i]))
+ continue;
+ dma_src[idx] = dma_map_page(dma->dev, blocks[i],
+ offset, len, DMA_TO_DEVICE);
+ coefs[idx] = scfs[i];
+ idx++;
+ }
+ src_cnt = idx;
+
+ while (src_cnt > 0) {
+ async_flags = flags;
+ pq_src_cnt = min(src_cnt, dma_maxpq(dma, flags));
+ /* if we are submitting additional pqs, leave the chain open,
+ * clear the callback parameters, and leave the destination
+ * buffers mapped
+ */
+ if (src_cnt > pq_src_cnt) {
+ async_flags &= ~ASYNC_TX_ACK;
+ dma_flags |= DMA_COMPL_SKIP_DEST_UNMAP;
+ _cb_fn = NULL;
+ _cb_param = NULL;
+ } else {
+ _cb_fn = cb_fn;
+ _cb_param = cb_param;
+ }
+ if (_cb_fn)
+ dma_flags |= DMA_PREP_INTERRUPT;
+ if (scfs)
+ scf = &scfs[src_off];
+
+ /* Since we have clobbered the src_list we are committed
+ * to doing this asynchronously. Drivers force forward
+ * progress in case they can not provide a descriptor
+ */
+ tx = dma->device_prep_dma_pq(chan, dma_dest,
+ &dma_src[src_off], pq_src_cnt,
+ scf, len, dma_flags);
+ if (unlikely(!tx))
+ async_tx_quiesce(&depend_tx);
+
+ /* spin wait for the preceeding transactions to complete */
+ while (unlikely(!tx)) {
+ dma_async_issue_pending(chan);
+ tx = dma->device_prep_dma_pq(chan, dma_dest,
+ &dma_src[src_off], pq_src_cnt,
+ scf, len, dma_flags);
+ }
+
+ async_tx_submit(chan, tx, async_flags, depend_tx,
+ _cb_fn, _cb_param);
+
+ depend_tx = tx;
+ flags |= ASYNC_TX_DEP_ACK;
+
+ /* drop completed sources */
+ src_cnt -= pq_src_cnt;
+ src_off += pq_src_cnt;
+
+ dma_flags |= DMA_PREP_CONTINUE;
+ }
+
+ return tx;
+}
+
+/**
+ * do_sync_pq - synchronously calculate P and Q
+ */
+static void
+do_sync_pq(struct page **blocks, unsigned char *scfs, unsigned int offset,
+ int src_cnt, size_t len, enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback cb_fn, void *cb_param)
+{
+ u8 *p = NULL;
+ u8 *q = NULL;
+ u8 *ptrs[src_cnt];
+ int d, z;
+ u8 wd, wq, wp;
+
+ /* address convert inputs */
+ if (blocks[src_cnt])
+ p = (u8 *)(page_address(blocks[src_cnt]) + offset);
+ if (blocks[src_cnt+1])
+ q = (u8 *)(page_address(blocks[src_cnt+1]) + offset);
+ for (z = 0; z < src_cnt; z++) {
+ if (is_raid6_zero_block(blocks[z]))
+ ptrs[z] = (void *) blocks[z];
+ else
+ ptrs[z] = (u8 *)(page_address(blocks[z]) + offset);
+ }
+
+ for (d = 0; d < len; d++) {
+ wq = wp = ptrs[0][d];
+ for (z = 1; z < src_cnt; z++) {
+ wd = ptrs[z][d];
+ wp ^= wd;
+ wq ^= raid6_gfmul[scfs[z]][wd];
+ }
+ if (p)
+ p[d] = wp;
+ if (q)
+ q[d] = wq;
+ }
+
+ async_tx_sync_epilog(cb_fn, cb_param);
+}
+
+/**
+ * async_pq - attempt to do XOR and Galois calculations in parallel using
+ * a dma engine.
+ * @blocks: source block array from 0 to (src_cnt-1) with the p destination
+ * at blocks[src_cnt] and q at blocks[src_cnt + 1]. Only one of two
+ * destinations may be present (another then has to be set to NULL).
+ * NOTE: client code must assume the contents of this array are destroyed
+ * @offset: offset in pages to start transaction
+ * @src_cnt: number of source pages
+ * @scfs: array of source coefficients used in GF-multiplication
+ * @len: length in bytes
+ * @flags: ASYNC_TX_ACK, ASYNC_TX_DEP_ACK, ASYNC_TX_ASYNC_ONLY
+ * @depend_tx: depends on the result of this transaction.
+ * @cb_fn: function to call when the operation completes
+ * @cb_param: parameter to pass to the callback routine
+ */
+struct dma_async_tx_descriptor *
+async_pq(struct page **blocks, unsigned int offset, int src_cnt,
+ unsigned char *scfs, size_t len, enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback cb_fn, void *cb_param)
+{
+ struct dma_chan *chan = async_tx_find_channel(depend_tx, DMA_PQ,
+ &blocks[src_cnt], 2,
+ blocks, src_cnt, len);
+ struct dma_device *device = chan ? chan->device : NULL;
+ struct dma_async_tx_descriptor *tx = NULL;
+ bool do_async = false;
+
+ if (device && (src_cnt <= dma_maxpq(device, 0) ||
+ dma_maxpq(device, DMA_PREP_CONTINUE) > 0))
+ do_async = true;
+
+ if (!device && (flags & ASYNC_TX_ASYNC_ONLY))
+ return NULL;
+
+ if (do_async) {
+ /* run pq asynchronously */
+ tx = do_async_pq(chan, blocks, scfs, offset, src_cnt, len,
+ flags, depend_tx, cb_fn, cb_param);
+ } else {
+ /* run pq synchronously */
+ if (!blocks[src_cnt+1]) { /* only p requested, just xor */
+ flags |= ASYNC_TX_XOR_ZERO_DST;
+ return async_xor(blocks[src_cnt], blocks, offset,
+ src_cnt, len, flags, depend_tx,
+ cb_fn, cb_param);
+ }
+
+ /* wait for any prerequisite operations */
+ async_tx_quiesce(&depend_tx);
+
+ do_sync_pq(blocks, scfs, offset, src_cnt, len, flags,
+ depend_tx, cb_fn, cb_param);
+ }
+
+ return tx;
+}
+EXPORT_SYMBOL_GPL(async_pq);
+
+/**
+ * do_sync_gen_syndrome - synchronously calculate P (xor) and Q (Reed-Solomon
+ * code)
+ */
+static void
+do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int src_cnt,
+ size_t len, enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback cb_fn, void *cb_param)
+{
+ int i;
+ void *tsrc[src_cnt+2];
+
+ for (i = 0; i < src_cnt + 2; i++) {
+ if (is_raid6_zero_block(blocks[i]))
+ tsrc[i] = (void *) blocks[i];
+ else
+ tsrc[i] = page_address(blocks[i]) + offset;
+ }
+
+ raid6_call.gen_syndrome(i, len, tsrc);
+
+ async_tx_sync_epilog(cb_fn, cb_param);
+}
+
+/**
+ * async_gen_syndrome - attempt to generate P (xor) and Q (Reed-Solomon code)
+ * with a dma engine for a given set of blocks. This routine assumes a
+ * field of GF(2^8) with a primitive polynomial of 0x11d and a generator
+ * of {02}.
+ * @blocks: source block array ordered from 0..src_cnt-1 with the P destination
+ * at blocks[src_cnt] and Q at blocks[src_cnt + 1]. Only one of two
+ * destinations may be present (another then has to be set to NULL). Some
+ * raid6 schemes calculate the syndrome over all disks with P and Q set to
+ * zero. In this case we catch 'zero' blocks with is_raid6_zero_block()
+ * so we can drop them in the async case, or skip the page_address()
+ * conversion in the sync case.
+ * NOTE: client code must assume the contents of this array are destroyed
+ * @offset: offset in pages to start transaction
+ * @src_cnt: number of source pages: 2 < src_cnt <= 255
+ * @len: length of blocks in bytes
+ * @flags: ASYNC_TX_ACK, ASYNC_TX_DEP_ACK, ASYNC_TX_ASYNC_ONLY
+ * @depend_tx: P+Q operation depends on the result of this transaction.
+ * @cb_fn: function to call when P+Q generation completes
+ * @cb_param: parameter to pass to the callback routine
+ */
+struct dma_async_tx_descriptor *
+async_gen_syndrome(struct page **blocks, unsigned int offset, int src_cnt,
+ size_t len, enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback cb_fn, void *cb_param)
+{
+ struct dma_chan *chan = async_tx_find_channel(depend_tx, DMA_PQ,
+ &blocks[src_cnt], 2,
+ blocks, src_cnt, len);
+ struct dma_device *device = chan ? chan->device : NULL;
+ struct dma_async_tx_descriptor *tx = NULL;
+ bool do_async = false;
+
+ BUG_ON(src_cnt > 255 || (!blocks[src_cnt] && !blocks[src_cnt+1]));
+
+ if (device && (src_cnt <= dma_maxpq(device, 0) ||
+ dma_maxpq(device, DMA_PREP_CONTINUE) > 0))
+ do_async = true;
+
+ if (!do_async && (flags & ASYNC_TX_ASYNC_ONLY))
+ return NULL;
+
+ if (do_async) {
+ /* run the p+q asynchronously */
+ tx = do_async_pq(chan, blocks, (uint8_t *)raid6_gfexp,
+ offset, src_cnt, len, flags, depend_tx,
+ cb_fn, cb_param);
+ } else {
+ /* run the pq synchronously */
+ /* wait for any prerequisite operations */
+ async_tx_quiesce(&depend_tx);
+
+ if (!blocks[src_cnt])
+ blocks[src_cnt] = scribble;
+ if (!blocks[src_cnt+1])
+ blocks[src_cnt+1] = scribble;
+ do_sync_gen_syndrome(blocks, offset, src_cnt, len, flags,
+ depend_tx, cb_fn, cb_param);
+ }
+
+ return tx;
+}
+EXPORT_SYMBOL_GPL(async_gen_syndrome);
+
+static __async_inline enum dma_ctrl_flags
+__pq_zero_sum_map_pages(dma_addr_t *dma, int src_cnt, struct device *dev,
+ struct page **blocks, unsigned int offset, size_t len)
+{
+ enum dma_ctrl_flags flags = 0;
+ int i;
+
+ if (!blocks[src_cnt])
+ flags |= DMA_PREP_PQ_DISABLE_P;
+ if (!blocks[src_cnt+1])
+ flags |= DMA_PREP_PQ_DISABLE_Q;
+ for (i = 0; i < src_cnt + 2; i++)
+ if (likely(blocks[i])) {
+ dma[i] = dma_map_page(dev, blocks[i], offset, len,
+ DMA_TO_DEVICE);
+ BUG_ON(is_raid6_zero_block(blocks[i]));
+ }
+ return flags;
+}
+
+/**
+ * async_pq_zero_sum - attempt a PQ parities check with a dma engine.
+ * @blocks: array of source pages. The 0..src_cnt-1 are the sources, the
+ * src_cnt and src_cnt+1 are the P and Q destinations to check, resp.
+ * Only one of two destinations may be present.
+ * NOTE: client code must assume the contents of this array are destroyed
+ * @offset: offset in pages to start transaction
+ * @src_cnt: number of source pages
+ * @scfs: coefficients to use in GF-multiplications
+ * @len: length in bytes
+ * @pqres: SUM_CHECK_P_RESULT and/or SUM_CHECK_Q_RESULT are set on zero sum fail
+ * @flags: ASYNC_TX_ACK, ASYNC_TX_DEP_ACK
+ * @depend_tx: depends on the result of this transaction.
+ * @cb_fn: function to call when the xor completes
+ * @cb_param: parameter to pass to the callback routine
+ */
+struct dma_async_tx_descriptor *
+async_pq_zero_sum(struct page **blocks, unsigned int offset, int src_cnt,
+ unsigned char *scfs, size_t len, enum sum_check_flags *pqres,
+ enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback cb_fn, void *cb_param)
+{
+ struct dma_chan *chan = async_tx_find_channel(depend_tx,
+ DMA_PQ_ZERO_SUM,
+ &blocks[src_cnt], 2,
+ blocks, src_cnt, len);
+ struct dma_device *device = chan ? chan->device : NULL;
+ struct dma_async_tx_descriptor *tx = NULL;
+ enum dma_ctrl_flags dma_flags = cb_fn ? DMA_PREP_INTERRUPT : 0;
+
+ BUG_ON(src_cnt < 2);
+
+ if (device && src_cnt <= dma_maxpq(device, 0) - 2) {
+ dma_addr_t dma_src[src_cnt + 2];
+
+ dma_flags |= __pq_zero_sum_map_pages(dma_src, src_cnt,
+ device->dev, blocks,
+ offset, len);
+ tx = device->device_prep_dma_pqzero_sum(chan, dma_src, src_cnt,
+ scfs, len, pqres,
+ dma_flags);
+
+ if (unlikely(!tx)) {
+ async_tx_quiesce(&depend_tx);
+
+ while (unlikely(!tx)) {
+ dma_async_issue_pending(chan);
+ tx = device->device_prep_dma_pqzero_sum(chan,
+ dma_src, src_cnt, scfs, len,
+ pqres, dma_flags);
+ }
+ }
+
+ async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param);
+ } else {
+ struct page *pdest = blocks[src_cnt];
+ struct page *qdest = blocks[src_cnt + 1];
+ void *p, *q, *s;
+
+ flags &= ~ASYNC_TX_ACK;
+
+ spin_lock(&spare_lock);
+ blocks[src_cnt] = spare_pages[0];
+ blocks[src_cnt + 1] = spare_pages[1];
+ tx = async_pq(blocks, offset, src_cnt, scfs, len, flags,
+ depend_tx, NULL, NULL);
+ async_tx_quiesce(&tx);
+
+ *pqres = 0;
+ if (pdest) {
+ p = page_address(pdest) + offset;
+ s = page_address(spare_pages[0]) + offset;
+ *pqres |= !!memcmp(p, s, len) << SUM_CHECK_P;
+ }
+
+ if (qdest) {
+ q = page_address(qdest) + offset;
+ s = page_address(spare_pages[1]) + offset;
+ *pqres |= !!memcmp(q, s, len) << SUM_CHECK_Q;
+ }
+ spin_unlock(&spare_lock);
+
+ async_tx_sync_epilog(cb_fn, cb_param);
+ }
+
+ return tx;
+}
+EXPORT_SYMBOL_GPL(async_pq_zero_sum);
+
+/**
+ * async_syndrome_zero_sum - attempt a P (xor) and Q (Reed-Solomon code)
+ * parities check with a dma engine. This routine assumes a field of
+ * GF(2^8) with a primitive polynomial of 0x11d and a generator of {02}.
+ * @blocks: array of source pages. The 0..src_cnt-1 are the sources, the
+ * src_cnt and src_cnt+1 are the P and Q destinations to check, resp.
+ * Only one of two destinations may be present.
+ * NOTE: client code must assume the contents of this array are destroyed
+ * @offset: offset in pages to start transaction
+ * @src_cnt: number of source pages
+ * @len: length in bytes
+ * @pqres: SUM_CHECK_P_RESULT and/or SUM_CHECK_Q_RESULT are set on zero sum fail
+ * @flags: ASYNC_TX_ACK, ASYNC_TX_DEP_ACK
+ * @depend_tx: depends on the result of this transaction.
+ * @cb_fn: function to call when the xor completes
+ * @cb_param: parameter to pass to the callback routine
+ */
+struct dma_async_tx_descriptor *
+async_syndrome_zero_sum(struct page **blocks, unsigned int offset, int src_cnt,
+ size_t len, enum sum_check_flags *pqres,
+ enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback cb_fn, void *cb_param)
+{
+ struct dma_chan *chan = async_tx_find_channel(depend_tx,
+ DMA_PQ_ZERO_SUM,
+ &blocks[src_cnt], 2,
+ blocks, src_cnt, len);
+ struct dma_device *device = chan ? chan->device : NULL;
+ struct dma_async_tx_descriptor *tx = NULL;
+ enum dma_ctrl_flags dma_flags = cb_fn ? DMA_PREP_INTERRUPT : 0;
+
+ BUG_ON(src_cnt < 2);
+
+ if (device && src_cnt <= dma_maxpq(device, 0) - 2) {
+ dma_addr_t dma_src[src_cnt + 2];
+
+ dma_flags |= __pq_zero_sum_map_pages(dma_src, src_cnt,
+ device->dev, blocks,
+ offset, len);
+ tx = device->device_prep_dma_pqzero_sum(chan, dma_src, src_cnt,
+ (uint8_t *)raid6_gfexp,
+ len, pqres, dma_flags);
+
+ if (unlikely(!tx)) {
+ async_tx_quiesce(&depend_tx);
+ while (unlikely(!tx)) {
+ dma_async_issue_pending(chan);
+ tx = device->device_prep_dma_pqzero_sum(chan,
+ dma_src, src_cnt,
+ (uint8_t *)raid6_gfexp, len,
+ pqres, dma_flags);
+ }
+ }
+
+ async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param);
+ } else {
+ struct page *pdest = blocks[src_cnt];
+ struct page *qdest = blocks[src_cnt + 1];
+ enum async_tx_flags lflags = flags;
+ void *p, *q, *s;
+
+ lflags &= ~ASYNC_TX_ACK;
+
+ spin_lock(&spare_lock);
+ blocks[src_cnt] = spare_pages[0];
+ blocks[src_cnt + 1] = spare_pages[1];
+ tx = async_gen_syndrome(blocks, offset,
+ src_cnt, len, lflags,
+ depend_tx, NULL, NULL);
+ async_tx_quiesce(&tx);
+
+ *pqres = 0;
+ if (pdest) {
+ p = page_address(pdest) + offset;
+ s = page_address(spare_pages[0]) + offset;
+ *pqres |= !!memcmp(p, s, len) << SUM_CHECK_P;
+ }
+
+ if (qdest) {
+ q = page_address(qdest) + offset;
+ s = page_address(spare_pages[1]) + offset;
+ *pqres |= !!memcmp(q, s, len) << SUM_CHECK_Q;
+ }
+ spin_unlock(&spare_lock);
+
+ async_tx_sync_epilog(cb_fn, cb_param);
+ }
+
+ return tx;
+}
+EXPORT_SYMBOL_GPL(async_syndrome_zero_sum);
+
+static void safe_put_page(struct page *p)
+{
+ if (p)
+ put_page(p);
+}
+
+static int __init async_pq_init(void)
+{
+ spin_lock_init(&spare_lock);
+
+ spare_pages[0] = alloc_page(GFP_KERNEL);
+ if (!spare_pages[0])
+ goto abort;
+ spare_pages[1] = alloc_page(GFP_KERNEL);
+ if (!spare_pages[1])
+ goto abort;
+ scribble = alloc_page(GFP_KERNEL);
+ if (!scribble)
+ goto abort;
+ return 0;
+abort:
+ safe_put_page(scribble);
+ safe_put_page(spare_pages[1]);
+ safe_put_page(spare_pages[0]);
+ printk(KERN_ERR "%s: cannot allocate spare!\n", __func__);
+ return -ENOMEM;
+}
+
+static void __exit async_pq_exit(void)
+{
+ safe_put_page(scribble);
+ safe_put_page(spare_pages[1]);
+ safe_put_page(spare_pages[0]);
+}
+
+module_init(async_pq_init);
+module_exit(async_pq_exit);
+
+MODULE_AUTHOR("Yuri Tikhonov <[email protected]>, Dan Williams <[email protected]>");
+MODULE_DESCRIPTION("asynchronous pq/pq-zero-sum api");
+MODULE_LICENSE("GPL");
diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c
index d1a084e..fcd6be5 100644
--- a/crypto/async_tx/async_xor.c
+++ b/crypto/async_tx/async_xor.c
@@ -65,7 +65,7 @@ do_async_xor(struct dma_chan *chan, struct page *dest, struct page **src_list,
while (src_cnt) {
async_flags = flags;
dma_flags = 0;
- xor_src_cnt = min(src_cnt, dma->max_xor);
+ xor_src_cnt = min(src_cnt, (int)dma->max_xor);
/* if we are submitting additional xors, leave the chain open,
* clear the callback parameters, and leave the destination
* buffer mapped
diff --git a/drivers/dma/iop-adma.c b/drivers/dma/iop-adma.c
index 16adbe6..8bccf85 100644
--- a/drivers/dma/iop-adma.c
+++ b/drivers/dma/iop-adma.c
@@ -1258,7 +1258,7 @@ static int __devinit iop_adma_probe(struct platform_device *pdev)

dev_printk(KERN_INFO, &pdev->dev, "Intel(R) IOP: "
"( %s%s%s%s%s%s%s%s%s%s)\n",
- dma_has_cap(DMA_PQ_XOR, dma_dev->cap_mask) ? "pq_xor " : "",
+ dma_has_cap(DMA_PQ, dma_dev->cap_mask) ? "pq " : "",
dma_has_cap(DMA_PQ_UPDATE, dma_dev->cap_mask) ? "pq_update " : "",
dma_has_cap(DMA_PQ_ZERO_SUM, dma_dev->cap_mask) ? "pq_zero_sum " : "",
dma_has_cap(DMA_XOR, dma_dev->cap_mask) ? "xor " : "",
diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h
index 6370f32..1f10141 100644
--- a/include/linux/async_tx.h
+++ b/include/linux/async_tx.h
@@ -59,12 +59,15 @@ struct dma_chan_ref {
* @ASYNC_TX_ACK: immediately ack the descriptor, precludes setting up a
* dependency chain
* @ASYNC_TX_DEP_ACK: ack the dependency descriptor. Useful for chaining.
+ * @ASYNC_TX_ASYNC_ONLY: if set then try to perform operation requested only in
+ * the asynchronous mode. Useful for R6 recovery.
*/
enum async_tx_flags {
ASYNC_TX_XOR_ZERO_DST = (1 << 0),
ASYNC_TX_XOR_DROP_DST = (1 << 1),
- ASYNC_TX_ACK = (1 << 3),
- ASYNC_TX_DEP_ACK = (1 << 4),
+ ASYNC_TX_ACK = (1 << 2),
+ ASYNC_TX_DEP_ACK = (1 << 3),
+ ASYNC_TX_ASYNC_ONLY = (1 << 4),
};

#ifdef CONFIG_DMA_ENGINE
@@ -140,5 +143,31 @@ async_trigger_callback(enum async_tx_flags flags,
struct dma_async_tx_descriptor *depend_tx,
dma_async_tx_callback cb_fn, void *cb_fn_param);

+struct dma_async_tx_descriptor *
+async_pq(struct page **blocks, unsigned int offset, int src_cnt,
+ unsigned char *scfs, size_t len, enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback cb_fn, void *cb_param);
+
+struct dma_async_tx_descriptor *
+async_gen_syndrome(struct page **blocks, unsigned int offset, int src_cnt,
+ size_t len, enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback cb_fn, void *cb_param);
+
+struct dma_async_tx_descriptor *
+async_pq_zero_sum(struct page **blocks, unsigned int offset, int src_cnt,
+ unsigned char *scfs, size_t len, enum sum_check_flags *pqres,
+ enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback cb_fn, void *cb_param);
+
+struct dma_async_tx_descriptor *
+async_syndrome_zero_sum(struct page **blocks, unsigned int offset, int src_cnt,
+ size_t len, enum sum_check_flags *pqres,
+ enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback cb_fn, void *cb_param);
+
void async_tx_quiesce(struct dma_async_tx_descriptor **tx);
#endif /* _ASYNC_TX_H_ */
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index cd17392..a7fa966 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -55,7 +55,7 @@ enum dma_status {
enum dma_transaction_type {
DMA_MEMCPY,
DMA_XOR,
- DMA_PQ_XOR,
+ DMA_PQ,
DMA_DUAL_XOR,
DMA_PQ_UPDATE,
DMA_ZERO_SUM,
@@ -73,20 +73,28 @@ enum dma_transaction_type {

/**
* enum dma_ctrl_flags - DMA flags to augment operation preparation,
- * control completion, and communicate status.
+ * control completion, and communicate status.
* @DMA_PREP_INTERRUPT - trigger an interrupt (callback) upon completion of
- * this transaction
+ * this transaction
* @DMA_CTRL_ACK - the descriptor cannot be reused until the client
- * acknowledges receipt, i.e. has has a chance to establish any
- * dependency chains
+ * acknowledges receipt, i.e. has has a chance to establish any dependency
+ * chains
* @DMA_COMPL_SKIP_SRC_UNMAP - set to disable dma-unmapping the source buffer(s)
* @DMA_COMPL_SKIP_DEST_UNMAP - set to disable dma-unmapping the destination(s)
+ * @DMA_PREP_PQ_DISABLE_P - prevent generation of P while generating Q
+ * @DMA_PREP_PQ_DISABLE_Q - prevent generation of Q while generating P
+ * @DMA_PREP_CONTINUE - indicate to a driver that it is reusing buffers as
+ * sources that were the result of a previous operation, in the case of a PQ
+ * operation it continues the calculation with new sources
*/
enum dma_ctrl_flags {
DMA_PREP_INTERRUPT = (1 << 0),
DMA_CTRL_ACK = (1 << 1),
DMA_COMPL_SKIP_SRC_UNMAP = (1 << 2),
DMA_COMPL_SKIP_DEST_UNMAP = (1 << 3),
+ DMA_PREP_PQ_DISABLE_P = (1 << 4),
+ DMA_PREP_PQ_DISABLE_Q = (1 << 5),
+ DMA_PREP_CONTINUE = (1 << 6),
};

/**
@@ -228,6 +236,7 @@ struct dma_async_tx_descriptor {
* @global_node: list_head for global dma_device_list
* @cap_mask: one or more dma_capability flags
* @max_xor: maximum number of xor sources, 0 if no capability
+ * @max_pq: maximum number of PQ sources and PQ-continue capability
* @dev_id: unique device ID
* @dev: struct device reference for dma mapping api
* @device_alloc_chan_resources: allocate resources and return the
@@ -235,7 +244,9 @@ struct dma_async_tx_descriptor {
* @device_free_chan_resources: release DMA channel's resources
* @device_prep_dma_memcpy: prepares a memcpy operation
* @device_prep_dma_xor: prepares a xor operation
+ * @device_prep_dma_pq: prepares a pq operation
* @device_prep_dma_zero_sum: prepares a zero_sum operation
+ * @device_prep_dma_pqzero_sum: prepares a pqzero_sum operation
* @device_prep_dma_memset: prepares a memset operation
* @device_prep_dma_interrupt: prepares an end of chain interrupt operation
* @device_prep_slave_sg: prepares a slave dma operation
@@ -249,7 +260,9 @@ struct dma_device {
struct list_head channels;
struct list_head global_node;
dma_cap_mask_t cap_mask;
- int max_xor;
+ unsigned short max_xor;
+ unsigned short max_pq;
+ #define DMA_HAS_PQ_CONTINUE (1 << 15)

int dev_id;
struct device *dev;
@@ -263,9 +276,17 @@ struct dma_device {
struct dma_async_tx_descriptor *(*device_prep_dma_xor)(
struct dma_chan *chan, dma_addr_t dest, dma_addr_t *src,
unsigned int src_cnt, size_t len, unsigned long flags);
+ struct dma_async_tx_descriptor *(*device_prep_dma_pq)(
+ struct dma_chan *chan, dma_addr_t *dst, dma_addr_t *src,
+ unsigned int src_cnt, unsigned char *scf,
+ size_t len, unsigned long flags);
struct dma_async_tx_descriptor *(*device_prep_dma_zero_sum)(
struct dma_chan *chan, dma_addr_t *src, unsigned int src_cnt,
size_t len, enum sum_check_flags *result, unsigned long flags);
+ struct dma_async_tx_descriptor *(*device_prep_dma_pqzero_sum)(
+ struct dma_chan *chan, dma_addr_t *src, unsigned int src_cnt,
+ unsigned char *scf, size_t len, enum sum_check_flags *pqres,
+ unsigned long flags);
struct dma_async_tx_descriptor *(*device_prep_dma_memset)(
struct dma_chan *chan, dma_addr_t dest, int value, size_t len,
unsigned long flags);
@@ -284,6 +305,31 @@ struct dma_device {
void (*device_issue_pending)(struct dma_chan *chan);
};

+static inline void dma_set_maxpq(struct dma_device *dma, int maxpq, int has_pq_continue)
+{
+ dma->max_pq = maxpq;
+ if (has_pq_continue)
+ dma->max_pq |= DMA_HAS_PQ_CONTINUE;
+}
+
+/* dma_maxpq - reduce maxpq in the face of continued operations
+ * @dma - dma device with PQ capability
+ * @flags - to determine if DMA_PREP_CONTINUE is set
+ *
+ * When an engine does not support native continuation we need 3 extra
+ * source slots to reuse P and Q with the following coefficients:
+ * 1/ {00} * P : remove P from Q', but use it as a source for P'
+ * 2/ {01} * Q : use Q to continue Q' calculation
+ * 3/ {00} * Q : subtract Q from P' to cancel (2)
+ */
+static inline int dma_maxpq(struct dma_device *dma, enum dma_ctrl_flags flags)
+{
+ if ((flags & DMA_PREP_CONTINUE) &&
+ (dma->max_pq & DMA_HAS_PQ_CONTINUE) == 0)
+ return dma->max_pq - 3;
+ return dma->max_pq & ~DMA_HAS_PQ_CONTINUE;
+}
+
/* --- public DMA engine API --- */

#ifdef CONFIG_DMA_ENGINE

2009-03-18 19:22:33

by Dan Williams

[permalink] [raw]
Subject: [PATCH 04/13] async_tx: kill needless module_{init|exit}

From: Dan Williams <[email protected]>

If module_init and module_exit are nops then neither need to be defined.

Signed-off-by: Dan Williams <[email protected]>
---
crypto/async_tx/async_memcpy.c | 13 -------------
crypto/async_tx/async_memset.c | 13 -------------
crypto/async_tx/async_tx.c | 17 +++--------------
crypto/async_tx/async_xor.c | 13 -------------
4 files changed, 3 insertions(+), 53 deletions(-)

diff --git a/crypto/async_tx/async_memcpy.c b/crypto/async_tx/async_memcpy.c
index ddccfb0..1a902b9 100644
--- a/crypto/async_tx/async_memcpy.c
+++ b/crypto/async_tx/async_memcpy.c
@@ -90,19 +90,6 @@ async_memcpy(struct page *dest, struct page *src, unsigned int dest_offset,
}
EXPORT_SYMBOL_GPL(async_memcpy);

-static int __init async_memcpy_init(void)
-{
- return 0;
-}
-
-static void __exit async_memcpy_exit(void)
-{
- do { } while (0);
-}
-
-module_init(async_memcpy_init);
-module_exit(async_memcpy_exit);
-
MODULE_AUTHOR("Intel Corporation");
MODULE_DESCRIPTION("asynchronous memcpy api");
MODULE_LICENSE("GPL");
diff --git a/crypto/async_tx/async_memset.c b/crypto/async_tx/async_memset.c
index 5b5eb99..631c3d0 100644
--- a/crypto/async_tx/async_memset.c
+++ b/crypto/async_tx/async_memset.c
@@ -83,19 +83,6 @@ async_memset(struct page *dest, int val, unsigned int offset,
}
EXPORT_SYMBOL_GPL(async_memset);

-static int __init async_memset_init(void)
-{
- return 0;
-}
-
-static void __exit async_memset_exit(void)
-{
- do { } while (0);
-}
-
-module_init(async_memset_init);
-module_exit(async_memset_exit);
-
MODULE_AUTHOR("Intel Corporation");
MODULE_DESCRIPTION("asynchronous memset api");
MODULE_LICENSE("GPL");
diff --git a/crypto/async_tx/async_tx.c b/crypto/async_tx/async_tx.c
index f21147f..c3b07cf 100644
--- a/crypto/async_tx/async_tx.c
+++ b/crypto/async_tx/async_tx.c
@@ -42,6 +42,9 @@ static void __exit async_tx_exit(void)
dmaengine_put();
}

+module_init(async_tx_init);
+module_exit(async_tx_exit);
+
/**
* __async_tx_find_channel - find a channel to carry out the operation or let
* the transaction execute synchronously
@@ -59,17 +62,6 @@ __async_tx_find_channel(struct dma_async_tx_descriptor *depend_tx,
return dma_find_channel(tx_type);
}
EXPORT_SYMBOL_GPL(__async_tx_find_channel);
-#else
-static int __init async_tx_init(void)
-{
- printk(KERN_INFO "async_tx: api initialized (sync-only)\n");
- return 0;
-}
-
-static void __exit async_tx_exit(void)
-{
- do { } while (0);
-}
#endif


@@ -295,9 +287,6 @@ void async_tx_quiesce(struct dma_async_tx_descriptor **tx)
}
EXPORT_SYMBOL_GPL(async_tx_quiesce);

-module_init(async_tx_init);
-module_exit(async_tx_exit);
-
MODULE_AUTHOR("Intel Corporation");
MODULE_DESCRIPTION("Asynchronous Bulk Memory Transactions API");
MODULE_LICENSE("GPL");
diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c
index 11978a8..0a56fae 100644
--- a/crypto/async_tx/async_xor.c
+++ b/crypto/async_tx/async_xor.c
@@ -298,19 +298,6 @@ async_xor_zero_sum(struct page *dest, struct page **src_list,
}
EXPORT_SYMBOL_GPL(async_xor_zero_sum);

-static int __init async_xor_init(void)
-{
- return 0;
-}
-
-static void __exit async_xor_exit(void)
-{
- do { } while (0);
-}
-
-module_init(async_xor_init);
-module_exit(async_xor_exit);
-
MODULE_AUTHOR("Intel Corporation");
MODULE_DESCRIPTION("asynchronous xor/xor-zero-sum api");
MODULE_LICENSE("GPL");

2009-03-18 19:23:14

by Dan Williams

[permalink] [raw]
Subject: [PATCH 07/13] async_tx: add support for asynchronous RAID6 recovery operations

[ Based on an original patch by Yuri Tikhonov ]

This patch extends async_tx API with two routines for RAID6 recovery.

async_r6_dd_recov() recovers after double data disk failure

async_r6_dp_recov() recovers after D+P failure

These routines make use of async_pq() which is fast in the asynchronous
case, but much slower than raid6_2data_recov() and raid6_datap_recov()
in the synchronous case. The ASYNC_TX_ASYNC_ONLY flag is used to test
early for the presence of a raid6 offload engine before committing to
the asynchronous path.

Signed-off-by: Yuri Tikhonov <[email protected]>
Signed-off-by: Ilya Yanok <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
crypto/async_tx/Kconfig | 5 +
crypto/async_tx/Makefile | 1
crypto/async_tx/async_r6recov.c | 272 +++++++++++++++++++++++++++++++++++++++
include/linux/async_tx.h | 12 ++
4 files changed, 290 insertions(+), 0 deletions(-)
create mode 100644 crypto/async_tx/async_r6recov.c

diff --git a/crypto/async_tx/Kconfig b/crypto/async_tx/Kconfig
index cb6d731..0b56224 100644
--- a/crypto/async_tx/Kconfig
+++ b/crypto/async_tx/Kconfig
@@ -18,3 +18,8 @@ config ASYNC_PQ
tristate
select ASYNC_CORE

+config ASYNC_R6RECOV
+ tristate
+ select ASYNC_CORE
+ select ASYNC_PQ
+
diff --git a/crypto/async_tx/Makefile b/crypto/async_tx/Makefile
index 1b99265..0ed8f13 100644
--- a/crypto/async_tx/Makefile
+++ b/crypto/async_tx/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_ASYNC_MEMCPY) += async_memcpy.o
obj-$(CONFIG_ASYNC_MEMSET) += async_memset.o
obj-$(CONFIG_ASYNC_XOR) += async_xor.o
obj-$(CONFIG_ASYNC_PQ) += async_pq.o
+obj-$(CONFIG_ASYNC_R6RECOV) += async_r6recov.o
diff --git a/crypto/async_tx/async_r6recov.c b/crypto/async_tx/async_r6recov.c
new file mode 100644
index 0000000..90cdec6
--- /dev/null
+++ b/crypto/async_tx/async_r6recov.c
@@ -0,0 +1,272 @@
+/*
+ * Copyright(c) 2007 Yuri Tikhonov <[email protected]>
+ * Copyright(c) 2009 Intel Corporation
+ *
+ * Developed for DENX Software Engineering GmbH
+ *
+ * Asynchronous RAID-6 recovery calculations ASYNC_TX API.
+ *
+ * based on async_xor.c code written by:
+ * Dan Williams <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 51
+ * Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/interrupt.h>
+#include <linux/dma-mapping.h>
+#include <linux/raid/pq.h>
+#include <linux/async_tx.h>
+
+/**
+ * async_r6_dd_recov - attempt to calculate two data misses using dma engines.
+ * @disks: number of disks in the RAID-6 array
+ * @bytes: size of strip
+ * @faila: first failed drive index
+ * @failb: second failed drive index
+ * @ptrs: array of pointers to strips (last two must be p and q, respectively)
+ * @flags: ASYNC_TX_ACK, ASYNC_TX_DEP_ACK
+ * @depend_tx: depends on the result of this transaction.
+ * @cb: function to call when the operation completes
+ * @cb_param: parameter to pass to the callback routine
+ */
+struct dma_async_tx_descriptor *
+async_r6_dd_recov(int disks, size_t bytes, int faila, int failb,
+ struct page **ptrs, enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback cb, void *cb_param)
+{
+ struct dma_async_tx_descriptor *tx = NULL;
+ struct page *lptrs[disks];
+ unsigned char lcoef[disks-4];
+ int i = 0, k = 0;
+ uint8_t bc[2];
+ dma_async_tx_callback lcb = NULL;
+ void *lcb_param = NULL;
+
+ /* Assume that failb > faila */
+ if (faila > failb)
+ swap(faila, failb);
+
+ /* Try to compute missed data asynchronously. */
+ if (disks == 4) {
+ /*
+ * Pxy and Qxy are zero in this case so we already have
+ * P+Pxy and Q+Qxy in P and Q strips respectively.
+ */
+ tx = depend_tx;
+ lcb = cb;
+ lcb_param = cb_param;
+ goto do_mult;
+ }
+
+ /*
+ * (1) Calculate Qxy and Pxy:
+ * Qxy = A(0)*D(0) + ... + A(n-1)*D(n-1) + A(n+1)*D(n+1) + ... +
+ * A(m-1)*D(m-1) + A(m+1)*D(m+1) + ... + A(disks-1)*D(disks-1),
+ * where n = faila, m = failb.
+ */
+ for (i = 0, k = 0; i < disks - 2; i++) {
+ if (i != faila && i != failb) {
+ lptrs[k] = ptrs[i];
+ lcoef[k] = raid6_gfexp[i];
+ k++;
+ }
+ }
+
+ lptrs[k] = ptrs[faila];
+ lptrs[k+1] = ptrs[failb];
+ tx = async_pq(lptrs, 0, k, lcoef, bytes,
+ ASYNC_TX_ASYNC_ONLY|(flags & ASYNC_TX_DEP_ACK),
+ depend_tx, NULL, NULL);
+ if (!tx) {
+ /* jump to optimized synchronous path */
+ if (flags & ASYNC_TX_ASYNC_ONLY)
+ return NULL;
+ goto ddr_sync;
+ }
+
+ /*
+ * The following operations will 'damage' P/Q strips;
+ * so now we condemned to move in an asynchronous way.
+ */
+
+ /* (2) Calculate Q+Qxy */
+ lptrs[0] = ptrs[disks-1];
+ lptrs[1] = ptrs[failb];
+ tx = async_xor(lptrs[0], lptrs, 0, 2, bytes,
+ ASYNC_TX_XOR_DROP_DST|ASYNC_TX_DEP_ACK, tx, NULL, NULL);
+
+ /* (3) Calculate P+Pxy */
+ lptrs[0] = ptrs[disks-2];
+ lptrs[1] = ptrs[faila];
+ tx = async_xor(lptrs[0], lptrs, 0, 2, bytes,
+ ASYNC_TX_XOR_DROP_DST|ASYNC_TX_DEP_ACK, tx, NULL, NULL);
+
+do_mult:
+ /*
+ * (4) Compute (P+Pxy) * Bxy. Compute (Q+Qxy) * Cxy. XOR them and get
+ * faila.
+ * B = (2^(y-x))*((2^(y-x) + {01})^(-1))
+ * C = (2^(-x))*((2^(y-x) + {01})^(-1))
+ * B * [p] + C * [q] -> [failb]
+ */
+ bc[0] = raid6_gfexi[failb-faila];
+ bc[1] = raid6_gfinv[raid6_gfexp[faila]^raid6_gfexp[failb]];
+
+ lptrs[0] = ptrs[disks - 2];
+ lptrs[1] = ptrs[disks - 1];
+ lptrs[2] = NULL;
+ lptrs[3] = ptrs[failb];
+ tx = async_pq(lptrs, 0, 2, bc, bytes, ASYNC_TX_DEP_ACK, tx, NULL, NULL);
+
+ /* (5) Compute failed Dy using recovered [failb] and P+Pnm in [p] */
+ lptrs[0] = ptrs[disks-2];
+ lptrs[1] = ptrs[failb];
+ lptrs[2] = ptrs[faila];
+ tx = async_xor(lptrs[2], lptrs, 0, 2, bytes,
+ ASYNC_TX_XOR_ZERO_DST|ASYNC_TX_DEP_ACK, tx,
+ lcb, lcb_param);
+
+ if (disks == 4) {
+ if (flags & ASYNC_TX_ACK)
+ async_tx_ack(tx);
+ return tx;
+ }
+
+ /* (6) Restore the parities back */
+ memcpy(lptrs, ptrs, (disks - 2) * sizeof(struct page *));
+ lptrs[disks - 2] = ptrs[disks-2];
+ lptrs[disks - 1] = ptrs[disks-1];
+ return async_gen_syndrome(lptrs, 0, disks - 2, bytes,
+ ASYNC_TX_DEP_ACK|(flags & ASYNC_TX_ACK),
+ tx, cb, cb_param);
+
+ddr_sync:
+ {
+ void **sptrs = (void **)lptrs;
+ /*
+ * Failed to compute asynchronously, do it in
+ * synchronous manner
+ */
+
+ /* wait for any prerequisite operations */
+ async_tx_quiesce(&depend_tx);
+ if (flags & ASYNC_TX_DEP_ACK)
+ async_tx_ack(depend_tx);
+
+ i = disks;
+ while (i--)
+ sptrs[i] = page_address(ptrs[i]);
+ raid6_2data_recov(disks, bytes, faila, failb, sptrs);
+
+ async_tx_sync_epilog(cb, cb_param);
+ }
+
+ return tx;
+}
+EXPORT_SYMBOL_GPL(async_r6_dd_recov);
+
+/**
+ * async_r6_dp_recov - attempt to calculate one data miss using dma engines.
+ * @disks: number of disks in the RAID-6 array
+ * @bytes: size of strip
+ * @faila: failed drive index
+ * @ptrs: array of pointers to strips (last two must be p and q, respectively)
+ * @flags: ASYNC_TX_ACK, ASYNC_TX_DEP_ACK
+ * @depend_tx: depends on the result of this transaction.
+ * @cb: function to call when the operation completes
+ * @cb_param: parameter to pass to the callback routine
+ */
+struct dma_async_tx_descriptor *
+async_r6_dp_recov(int disks, size_t bytes, int faila, struct page **ptrs,
+ enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback cb, void *cb_param)
+{
+ struct dma_async_tx_descriptor *tx = NULL;
+ struct page *lptrs[disks];
+ unsigned char lcoef[disks-2];
+ int i = 0, k = 0;
+
+ /* Try compute missed data asynchronously. */
+
+ /*
+ * (1) Calculate Qn + Q:
+ * Qn = A(0)*D(0) + .. + A(n-1)*D(n-1) + A(n+1)*D(n+1) + ..,
+ * where n = faila;
+ * then subtract Qn from Q and place result to Pn.
+ */
+ for (i = 0; i < disks - 2; i++) {
+ if (i != faila) {
+ lptrs[k] = ptrs[i];
+ lcoef[k++] = raid6_gfexp[i];
+ }
+ }
+ lptrs[k] = ptrs[disks-1]; /* Q-parity */
+ lcoef[k++] = 1;
+
+ lptrs[k] = NULL;
+ lptrs[k+1] = ptrs[disks-2];
+
+ tx = async_pq(lptrs, 0, k, lcoef, bytes,
+ ASYNC_TX_ASYNC_ONLY|(flags & ASYNC_TX_DEP_ACK),
+ depend_tx, NULL, NULL);
+ if (!tx) {
+ /* jump to optimized synchronous path */
+ if (flags & ASYNC_TX_ASYNC_ONLY)
+ return NULL;
+ goto dpr_sync;
+ }
+
+ /*
+ * (2) Compute missed Dn:
+ * Dn = (Q + Qn) * [A(n)^(-1)]
+ */
+ lptrs[0] = ptrs[disks-2];
+ lptrs[1] = NULL;
+ lptrs[2] = ptrs[faila];
+ return async_pq(lptrs, 0, 1, (u8 *)&raid6_gfexp[faila ? 255-faila : 0],
+ bytes, ASYNC_TX_DEP_ACK|(flags & ASYNC_TX_ACK),
+ tx, cb, cb_param);
+
+dpr_sync:
+ {
+ void **sptrs = (void **) lptrs;
+ /*
+ * Failed to compute asynchronously, do it in
+ * synchronous manner
+ */
+
+ /* wait for any prerequisite operations */
+ async_tx_quiesce(&depend_tx);
+ if (flags & ASYNC_TX_DEP_ACK)
+ async_tx_ack(depend_tx);
+
+ i = disks;
+ while (i--)
+ sptrs[i] = page_address(ptrs[i]);
+ raid6_datap_recov(disks, bytes, faila, (void *)sptrs);
+
+ async_tx_sync_epilog(cb, cb_param);
+ }
+
+ return tx;
+}
+EXPORT_SYMBOL_GPL(async_r6_dp_recov);
+
+MODULE_AUTHOR("Yuri Tikhonov <[email protected]>, Dan Williams <[email protected]>");
+MODULE_DESCRIPTION("asynchronous RAID-6 recovery api");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h
index 1f10141..3febff9 100644
--- a/include/linux/async_tx.h
+++ b/include/linux/async_tx.h
@@ -169,5 +169,17 @@ async_syndrome_zero_sum(struct page **blocks, unsigned int offset, int src_cnt,
struct dma_async_tx_descriptor *depend_tx,
dma_async_tx_callback cb_fn, void *cb_param);

+struct dma_async_tx_descriptor *
+async_r6_dd_recov(int src_num, size_t bytes, int faila, int failb,
+ struct page **ptrs, enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback callback, void *callback_param);
+
+struct dma_async_tx_descriptor *
+async_r6_dp_recov(int src_num, size_t bytes, int faila, struct page **ptrs,
+ enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *depend_tx,
+ dma_async_tx_callback callback, void *callback_param);
+
void async_tx_quiesce(struct dma_async_tx_descriptor **tx);
#endif /* _ASYNC_TX_H_ */

2009-03-18 19:24:07

by Dan Williams

[permalink] [raw]
Subject: [PATCH 09/13] iop-adma: P+Q self test

Even though the intent is to extend dmatest with P+Q tests there is
still value in having an always-on sanity check to prevent an
unintentionally broken driver from registering.

This depends on raid6_pq.ko for verification, the side effect being that
PQ capable channels will fail to register when raid6 is disabled.

Signed-off-by: Dan Williams <[email protected]>
---
drivers/dma/iop-adma.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 181 insertions(+), 1 deletions(-)

diff --git a/drivers/dma/iop-adma.c b/drivers/dma/iop-adma.c
index 0cdb333..2b014b6 100644
--- a/drivers/dma/iop-adma.c
+++ b/drivers/dma/iop-adma.c
@@ -31,6 +31,7 @@
#include <linux/platform_device.h>
#include <linux/memory.h>
#include <linux/ioport.h>
+#include <linux/raid/raid6.h>

#include <mach/adma.h>

@@ -1200,6 +1201,177 @@ out:
return err;
}

+static int __devinit
+iop_adma_pq_zero_sum_self_test(struct iop_adma_device *device)
+{
+ #ifdef CONFIG_MD_RAID6_PQ
+ /* combined sources, software pq results, and extra hw pq results */
+ struct page *pq[IOP_ADMA_NUM_SRC_TEST+2+2];
+ /* ptr to the extra hw pq buffers defined above */
+ struct page **pq_hw = &pq[IOP_ADMA_NUM_SRC_TEST+2];
+ /* address conversion buffers (dma_map / page_address) */
+ void *pq_sw[IOP_ADMA_NUM_SRC_TEST+2];
+ dma_addr_t pq_src[IOP_ADMA_NUM_SRC_TEST];
+ dma_addr_t pq_dest[2];
+
+ int i;
+ struct dma_async_tx_descriptor *tx;
+ struct dma_chan *dma_chan;
+ dma_cookie_t cookie;
+ u32 zero_sum_result;
+ int err = 0;
+ struct device *dev;
+
+ dev_dbg(device->common.dev, "%s\n", __func__);
+
+ for (i = 0; i < ARRAY_SIZE(pq); i++) {
+ pq[i] = alloc_page(GFP_KERNEL);
+ if (!pq[i]) {
+ while (i--)
+ __free_page(pq[i]);
+ return -ENOMEM;
+ }
+ }
+
+ /* Fill in src buffers */
+ for (i = 0; i < IOP_ADMA_NUM_SRC_TEST; i++) {
+ pq_sw[i] = page_address(pq[i]);
+ memset(pq_sw[i], 0x11111111 * (1<<i), PAGE_SIZE);
+ }
+ pq_sw[i] = page_address(pq[i]);
+ pq_sw[i+1] = page_address(pq[i+1]);
+
+ dma_chan = container_of(device->common.channels.next,
+ struct dma_chan,
+ device_node);
+ if (iop_adma_alloc_chan_resources(dma_chan) < 1) {
+ err = -ENODEV;
+ goto out;
+ }
+
+ dev = dma_chan->device->dev;
+
+ /* initialize the dests */
+ memset(page_address(pq_hw[0]), 0 , PAGE_SIZE);
+ memset(page_address(pq_hw[1]), 0 , PAGE_SIZE);
+
+ /* test pq */
+ pq_dest[0] = dma_map_page(dev, pq_hw[0], 0, PAGE_SIZE, DMA_FROM_DEVICE);
+ pq_dest[1] = dma_map_page(dev, pq_hw[1], 0, PAGE_SIZE, DMA_FROM_DEVICE);
+ for (i = 0; i < IOP_ADMA_NUM_SRC_TEST; i++)
+ pq_src[i] = dma_map_page(dev, pq[i], 0, PAGE_SIZE,
+ DMA_TO_DEVICE);
+
+ tx = iop_adma_prep_dma_pq(dma_chan, pq_dest, pq_src,
+ IOP_ADMA_NUM_SRC_TEST, (u8 *)raid6_gfexp,
+ PAGE_SIZE,
+ DMA_PREP_INTERRUPT |
+ DMA_CTRL_ACK);
+
+ cookie = iop_adma_tx_submit(tx);
+ iop_adma_issue_pending(dma_chan);
+ msleep(8);
+
+ if (iop_adma_is_complete(dma_chan, cookie, NULL, NULL) !=
+ DMA_SUCCESS) {
+ dev_err(dev, "Self-test pq timed out, disabling\n");
+ err = -ENODEV;
+ goto free_resources;
+ }
+
+ raid6_call.gen_syndrome(IOP_ADMA_NUM_SRC_TEST+2, PAGE_SIZE, pq_sw);
+
+ if (memcmp(pq_sw[IOP_ADMA_NUM_SRC_TEST],
+ page_address(pq_hw[0]), PAGE_SIZE) != 0) {
+ dev_err(dev, "Self-test p failed compare, disabling\n");
+ err = -ENODEV;
+ goto free_resources;
+ }
+ if (memcmp(pq_sw[IOP_ADMA_NUM_SRC_TEST+1],
+ page_address(pq_hw[1]), PAGE_SIZE) != 0) {
+ dev_err(dev, "Self-test q failed compare, disabling\n");
+ err = -ENODEV;
+ goto free_resources;
+ }
+
+ /* test correct zero sum using the software generated pq values */
+ for (i = 0; i < IOP_ADMA_NUM_SRC_TEST + 2; i++)
+ pq_src[i] = dma_map_page(dev, pq[i], 0, PAGE_SIZE,
+ DMA_TO_DEVICE);
+
+ zero_sum_result = ~0;
+ tx = iop_adma_prep_dma_pq_zero_sum(dma_chan, pq_src,
+ IOP_ADMA_NUM_SRC_TEST + 2,
+ (u8 *) raid6_gfexp, PAGE_SIZE,
+ &zero_sum_result,
+ DMA_PREP_INTERRUPT |
+ DMA_CTRL_ACK);
+
+ cookie = iop_adma_tx_submit(tx);
+ iop_adma_issue_pending(dma_chan);
+ msleep(8);
+
+ if (iop_adma_is_complete(dma_chan, cookie, NULL, NULL) !=
+ DMA_SUCCESS) {
+ dev_err(dev, "Self-test pq-zero-sum timed out, disabling\n");
+ err = -ENODEV;
+ goto free_resources;
+ }
+
+ if (zero_sum_result != 0) {
+ dev_err(dev, "Self-test pq-zero-sum failed to validate: %x\n",
+ zero_sum_result);
+ err = -ENODEV;
+ goto free_resources;
+ }
+
+ /* test incorrect zero sum */
+ i = IOP_ADMA_NUM_SRC_TEST;
+ memset(pq_sw[i] + 100, 0, 100);
+ memset(pq_sw[i+1] + 200, 0, 200);
+ for (i = 0; i < IOP_ADMA_NUM_SRC_TEST + 2; i++)
+ pq_src[i] = dma_map_page(dev, pq[i], 0, PAGE_SIZE,
+ DMA_TO_DEVICE);
+
+ zero_sum_result = 0;
+ tx = iop_adma_prep_dma_pq_zero_sum(dma_chan, pq_src,
+ IOP_ADMA_NUM_SRC_TEST + 2,
+ (u8 *) raid6_gfexp, PAGE_SIZE,
+ &zero_sum_result,
+ DMA_PREP_INTERRUPT |
+ DMA_CTRL_ACK);
+
+ cookie = iop_adma_tx_submit(tx);
+ iop_adma_issue_pending(dma_chan);
+ msleep(8);
+
+ if (iop_adma_is_complete(dma_chan, cookie, NULL, NULL) !=
+ DMA_SUCCESS) {
+ dev_err(dev, "Self-test !pq-zero-sum timed out, disabling\n");
+ err = -ENODEV;
+ goto free_resources;
+ }
+
+ if (zero_sum_result != (SUM_CHECK_P_RESULT | SUM_CHECK_Q_RESULT)) {
+ dev_err(dev, "Self-test !pq-zero-sum failed to validate: %x\n",
+ zero_sum_result);
+ err = -ENODEV;
+ goto free_resources;
+ }
+
+free_resources:
+ iop_adma_free_chan_resources(dma_chan);
+out:
+ i = ARRAY_SIZE(pq);
+ while (i--)
+ __free_page(pq[i]);
+ return err;
+ #else
+ return -ENODEV; /* no raid6 support */
+ #endif
+}
+
+
static int __devexit iop_adma_remove(struct platform_device *dev)
{
struct iop_adma_device *device = platform_get_drvdata(dev);
@@ -1350,13 +1522,21 @@ static int __devinit iop_adma_probe(struct platform_device *pdev)
}

if (dma_has_cap(DMA_XOR, dma_dev->cap_mask) ||
- dma_has_cap(DMA_MEMSET, dma_dev->cap_mask)) {
+ dma_has_cap(DMA_MEMSET, dma_dev->cap_mask)) {
ret = iop_adma_xor_zero_sum_self_test(adev);
dev_dbg(&pdev->dev, "xor self test returned %d\n", ret);
if (ret)
goto err_free_iop_chan;
}

+ if (dma_has_cap(DMA_PQ, dma_dev->cap_mask) &&
+ dma_has_cap(DMA_PQ_ZERO_SUM, dma_dev->cap_mask)) {
+ ret = iop_adma_pq_zero_sum_self_test(adev);
+ dev_dbg(&pdev->dev, "pq self test returned %d\n", ret);
+ if (ret)
+ goto err_free_iop_chan;
+ }
+
dev_printk(KERN_INFO, &pdev->dev, "Intel(R) IOP: "
"( %s%s%s%s%s%s%s%s%s%s)\n",
dma_has_cap(DMA_PQ, dma_dev->cap_mask) ? "pq " : "",

2009-03-18 19:24:47

by Dan Williams

[permalink] [raw]
Subject: [PATCH 13/13] dmatest: add pq support

Test raid6 p+q operations with a simple "always multiply by 1" q
calculation to fit into dmatest's current destination verification
scheme.

Signed-off-by: Dan Williams <[email protected]>
---
drivers/dma/dmatest.c | 26 ++++++++++++++++++++++++++
1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/drivers/dma/dmatest.c b/drivers/dma/dmatest.c
index a27c0fb..a5ee541 100644
--- a/drivers/dma/dmatest.c
+++ b/drivers/dma/dmatest.c
@@ -43,6 +43,11 @@ module_param(xor_sources, uint, S_IRUGO);
MODULE_PARM_DESC(xor_sources,
"Number of xor source buffers (default: 3)");

+static unsigned int pq_sources = 3;
+module_param(pq_sources, uint, S_IRUGO);
+MODULE_PARM_DESC(pq_sources,
+ "Number of p+q source buffers (default: 3)");
+
/*
* Initialization patterns. All bytes in the source buffer has bit 7
* set, all bytes in the destination buffer has bit 7 cleared.
@@ -227,6 +232,7 @@ static int dmatest_func(void *data)
dma_cookie_t cookie;
enum dma_status status;
enum dma_ctrl_flags flags;
+ u8 pq_coefs[pq_sources];
int ret;
int src_cnt;
int dst_cnt;
@@ -243,6 +249,11 @@ static int dmatest_func(void *data)
else if (thread->type == DMA_XOR) {
src_cnt = xor_sources | 1; /* force odd to ensure dst = src */
dst_cnt = 1;
+ } else if (thread->type == DMA_PQ) {
+ src_cnt = pq_sources | 1; /* force odd to ensure dst = src */
+ dst_cnt = 2;
+ for (i = 0; i < pq_sources; i++)
+ pq_coefs[i] = 1;
} else
goto err_srcs;

@@ -310,6 +321,15 @@ static int dmatest_func(void *data)
dma_dsts[0] + dst_off,
dma_srcs, xor_sources,
len, flags);
+ else if (thread->type == DMA_PQ) {
+ dma_addr_t dma_pq[dst_cnt];
+
+ for (i = 0; i < dst_cnt; i++)
+ dma_pq[i] = dma_dsts[i] + dst_off;
+ tx = dev->device_prep_dma_pq(chan, dma_pq, dma_srcs,
+ pq_sources, pq_coefs,
+ len, flags);
+ }

if (!tx) {
for (i = 0; i < src_cnt; i++)
@@ -446,6 +466,8 @@ static int dmatest_add_threads(struct dmatest_chan *dtc, enum dma_transaction_ty
op = "copy";
else if (type == DMA_XOR)
op = "xor";
+ else if (type == DMA_PQ)
+ op = "pq";
else
return -EINVAL;

@@ -501,6 +523,10 @@ static int dmatest_add_channel(struct dma_chan *chan)
cnt = dmatest_add_threads(dtc, DMA_XOR);
thread_count += cnt > 0 ?: 0;
}
+ if (dma_has_cap(DMA_PQ, dma_dev->cap_mask)) {
+ cnt = dmatest_add_threads(dtc, DMA_PQ);
+ thread_count += cnt > 0 ?: 0;
+ }

pr_info("dmatest: Started %u threads using %s\n",
thread_count, dma_chan_name(chan));

2009-03-18 19:25:12

by Dan Williams

[permalink] [raw]
Subject: [PATCH 11/13] dmatest: add xor test

Extend dmatest to launch a thread per supported operation type and add
an xor test.

Signed-off-by: Dan Williams <[email protected]>
---
drivers/dma/dmatest.c | 274 ++++++++++++++++++++++++++++++++++---------------
1 files changed, 189 insertions(+), 85 deletions(-)

diff --git a/drivers/dma/dmatest.c b/drivers/dma/dmatest.c
index e190d8b..224acf4 100644
--- a/drivers/dma/dmatest.c
+++ b/drivers/dma/dmatest.c
@@ -38,6 +38,11 @@ module_param(max_channels, uint, S_IRUGO);
MODULE_PARM_DESC(max_channels,
"Maximum number of channels to use (default: all)");

+static unsigned int xor_sources = 3;
+module_param(xor_sources, uint, S_IRUGO);
+MODULE_PARM_DESC(xor_sources,
+ "Number of xor source buffers (default: 3)");
+
/*
* Initialization patterns. All bytes in the source buffer has bit 7
* set, all bytes in the destination buffer has bit 7 cleared.
@@ -59,8 +64,9 @@ struct dmatest_thread {
struct list_head node;
struct task_struct *task;
struct dma_chan *chan;
- u8 *srcbuf;
- u8 *dstbuf;
+ u8 **srcs;
+ u8 **dsts;
+ enum dma_transaction_type type;
};

struct dmatest_chan {
@@ -98,30 +104,37 @@ static unsigned long dmatest_random(void)
return buf;
}

-static void dmatest_init_srcbuf(u8 *buf, unsigned int start, unsigned int len)
+static void dmatest_init_srcs(u8 **bufs, unsigned int start, unsigned int len)
{
unsigned int i;
-
- for (i = 0; i < start; i++)
- buf[i] = PATTERN_SRC | (~i & PATTERN_COUNT_MASK);
- for ( ; i < start + len; i++)
- buf[i] = PATTERN_SRC | PATTERN_COPY
- | (~i & PATTERN_COUNT_MASK);;
- for ( ; i < test_buf_size; i++)
- buf[i] = PATTERN_SRC | (~i & PATTERN_COUNT_MASK);
+ u8 *buf;
+
+ for (; (buf = *bufs); bufs++) {
+ for (i = 0; i < start; i++)
+ buf[i] = PATTERN_SRC | (~i & PATTERN_COUNT_MASK);
+ for ( ; i < start + len; i++)
+ buf[i] = PATTERN_SRC | PATTERN_COPY
+ | (~i & PATTERN_COUNT_MASK);;
+ for ( ; i < test_buf_size; i++)
+ buf[i] = PATTERN_SRC | (~i & PATTERN_COUNT_MASK);
+ buf++;
+ }
}

-static void dmatest_init_dstbuf(u8 *buf, unsigned int start, unsigned int len)
+static void dmatest_init_dsts(u8 **bufs, unsigned int start, unsigned int len)
{
unsigned int i;
-
- for (i = 0; i < start; i++)
- buf[i] = PATTERN_DST | (~i & PATTERN_COUNT_MASK);
- for ( ; i < start + len; i++)
- buf[i] = PATTERN_DST | PATTERN_OVERWRITE
- | (~i & PATTERN_COUNT_MASK);
- for ( ; i < test_buf_size; i++)
- buf[i] = PATTERN_DST | (~i & PATTERN_COUNT_MASK);
+ u8 *buf;
+
+ for (; (buf = *bufs); bufs++) {
+ for (i = 0; i < start; i++)
+ buf[i] = PATTERN_DST | (~i & PATTERN_COUNT_MASK);
+ for ( ; i < start + len; i++)
+ buf[i] = PATTERN_DST | PATTERN_OVERWRITE
+ | (~i & PATTERN_COUNT_MASK);
+ for ( ; i < test_buf_size; i++)
+ buf[i] = PATTERN_DST | (~i & PATTERN_COUNT_MASK);
+ }
}

static void dmatest_mismatch(u8 actual, u8 pattern, unsigned int index,
@@ -150,23 +163,30 @@ static void dmatest_mismatch(u8 actual, u8 pattern, unsigned int index,
thread_name, index, expected, actual);
}

-static unsigned int dmatest_verify(u8 *buf, unsigned int start,
+static unsigned int dmatest_verify(u8 **bufs, unsigned int start,
unsigned int end, unsigned int counter, u8 pattern,
bool is_srcbuf)
{
unsigned int i;
unsigned int error_count = 0;
u8 actual;
-
- for (i = start; i < end; i++) {
- actual = buf[i];
- if (actual != (pattern | (~counter & PATTERN_COUNT_MASK))) {
- if (error_count < 32)
- dmatest_mismatch(actual, pattern, i, counter,
- is_srcbuf);
- error_count++;
+ u8 expected;
+ u8 *buf;
+ unsigned int counter_orig = counter;
+
+ for (; (buf = *bufs); bufs++) {
+ counter = counter_orig;
+ for (i = start; i < end; i++) {
+ actual = buf[i];
+ expected = pattern | (~counter & PATTERN_COUNT_MASK);
+ if (actual != expected) {
+ if (error_count < 32)
+ dmatest_mismatch(actual, pattern, i,
+ counter, is_srcbuf);
+ error_count++;
+ }
+ counter++;
}
- counter++;
}

if (error_count > 32)
@@ -178,10 +198,10 @@ static unsigned int dmatest_verify(u8 *buf, unsigned int start,

/*
* This function repeatedly tests DMA transfers of various lengths and
- * offsets until it is told to exit by kthread_stop(). There may be
- * multiple threads running this function in parallel for a single
- * channel, and there may be multiple channels being tested in
- * parallel.
+ * offsets for a given operation type until it is told to exit by
+ * kthread_stop(). There may be multiple threads running this function
+ * in parallel for a single channel, and there may be multiple channels
+ * being tested in parallel.
*
* Before each test, the source and destination buffer is initialized
* with a known pattern. This pattern is different depending on
@@ -201,25 +221,53 @@ static int dmatest_func(void *data)
unsigned int total_tests = 0;
dma_cookie_t cookie;
enum dma_status status;
+ enum dma_ctrl_flags flags;
int ret;
+ int src_cnt;
+ int dst_cnt;
+ int i;

thread_name = current->comm;

ret = -ENOMEM;
- thread->srcbuf = kmalloc(test_buf_size, GFP_KERNEL);
- if (!thread->srcbuf)
- goto err_srcbuf;
- thread->dstbuf = kmalloc(test_buf_size, GFP_KERNEL);
- if (!thread->dstbuf)
- goto err_dstbuf;

smp_rmb();
chan = thread->chan;
+ if (thread->type == DMA_MEMCPY)
+ src_cnt = dst_cnt = 1;
+ else if (thread->type == DMA_XOR) {
+ src_cnt = xor_sources | 1; /* force odd to ensure dst = src */
+ dst_cnt = 1;
+ } else
+ goto err_srcs;
+
+ thread->srcs = kcalloc(src_cnt+1, sizeof(u8 *), GFP_KERNEL);
+ if (!thread->srcs)
+ goto err_srcs;
+ for (i = 0; i < src_cnt; i++) {
+ thread->srcs[i] = kmalloc(test_buf_size, GFP_KERNEL);
+ if (!thread->srcs[i])
+ goto err_srcbuf;
+ }
+ thread->srcs[i] = NULL;
+
+ thread->dsts = kcalloc(dst_cnt+1, sizeof(u8 *), GFP_KERNEL);
+ if (!thread->dsts)
+ goto err_dsts;
+ for (i = 0; i < dst_cnt; i++) {
+ thread->dsts[i] = kmalloc(test_buf_size, GFP_KERNEL);
+ if (!thread->dsts[i])
+ goto err_dstbuf;
+ }
+ thread->dsts[i] = NULL;
+
+ flags = DMA_CTRL_ACK | DMA_COMPL_SKIP_DEST_UNMAP;

while (!kthread_should_stop()) {
struct dma_device *dev = chan->device;
- struct dma_async_tx_descriptor *tx;
- dma_addr_t dma_src, dma_dest;
+ struct dma_async_tx_descriptor *tx = NULL;
+ dma_addr_t dma_srcs[src_cnt];
+ dma_addr_t dma_dsts[dst_cnt];

total_tests++;

@@ -227,22 +275,41 @@ static int dmatest_func(void *data)
src_off = dmatest_random() % (test_buf_size - len + 1);
dst_off = dmatest_random() % (test_buf_size - len + 1);

- dmatest_init_srcbuf(thread->srcbuf, src_off, len);
- dmatest_init_dstbuf(thread->dstbuf, dst_off, len);
+ dmatest_init_srcs(thread->srcs, src_off, len);
+ dmatest_init_dsts(thread->dsts, dst_off, len);
+
+ for (i = 0; i < src_cnt; i++) {
+ u8 *buf = thread->srcs[i] + src_off;

- dma_src = dma_map_single(dev->dev, thread->srcbuf + src_off,
- len, DMA_TO_DEVICE);
+ dma_srcs[i] = dma_map_single(dev->dev, buf, len,
+ DMA_TO_DEVICE);
+ }
/* map with DMA_BIDIRECTIONAL to force writeback/invalidate */
- dma_dest = dma_map_single(dev->dev, thread->dstbuf,
- test_buf_size, DMA_BIDIRECTIONAL);
+ for (i = 0; i < dst_cnt; i++) {
+ dma_dsts[i] = dma_map_single(dev->dev, thread->dsts[i],
+ test_buf_size,
+ DMA_BIDIRECTIONAL);
+ }
+
+ if (thread->type == DMA_MEMCPY)
+ tx = dev->device_prep_dma_memcpy(chan,
+ dma_dsts[0] + dst_off,
+ dma_srcs[0], len,
+ flags);
+ else if (thread->type == DMA_XOR)
+ tx = dev->device_prep_dma_xor(chan,
+ dma_dsts[0] + dst_off,
+ dma_srcs, xor_sources,
+ len, flags);

- tx = dev->device_prep_dma_memcpy(chan, dma_dest + dst_off,
- dma_src, len,
- DMA_CTRL_ACK | DMA_COMPL_SKIP_DEST_UNMAP);
if (!tx) {
- dma_unmap_single(dev->dev, dma_src, len, DMA_TO_DEVICE);
- dma_unmap_single(dev->dev, dma_dest,
- test_buf_size, DMA_BIDIRECTIONAL);
+ for (i = 0; i < src_cnt; i++)
+ dma_unmap_single(dev->dev, dma_srcs[i], len,
+ DMA_TO_DEVICE);
+ for (i = 0; i < dst_cnt; i++)
+ dma_unmap_single(dev->dev, dma_dsts[i],
+ test_buf_size,
+ DMA_BIDIRECTIONAL);
pr_warning("%s: #%u: prep error with src_off=0x%x "
"dst_off=0x%x len=0x%x\n",
thread_name, total_tests - 1,
@@ -263,11 +330,11 @@ static int dmatest_func(void *data)
failed_tests++;
continue;
}
- dma_async_memcpy_issue_pending(chan);
+ dma_async_issue_pending(chan);

do {
msleep(1);
- status = dma_async_memcpy_complete(
+ status = dma_async_is_tx_complete(
chan, cookie, NULL, NULL);
} while (status == DMA_IN_PROGRESS);

@@ -278,29 +345,30 @@ static int dmatest_func(void *data)
continue;
}
/* Unmap by myself (see DMA_COMPL_SKIP_DEST_UNMAP above) */
- dma_unmap_single(dev->dev, dma_dest,
- test_buf_size, DMA_BIDIRECTIONAL);
+ for (i = 0; i < dst_cnt; i++)
+ dma_unmap_single(dev->dev, dma_dsts[i], test_buf_size,
+ DMA_BIDIRECTIONAL);

error_count = 0;

pr_debug("%s: verifying source buffer...\n", thread_name);
- error_count += dmatest_verify(thread->srcbuf, 0, src_off,
+ error_count += dmatest_verify(thread->srcs, 0, src_off,
0, PATTERN_SRC, true);
- error_count += dmatest_verify(thread->srcbuf, src_off,
+ error_count += dmatest_verify(thread->srcs, src_off,
src_off + len, src_off,
PATTERN_SRC | PATTERN_COPY, true);
- error_count += dmatest_verify(thread->srcbuf, src_off + len,
+ error_count += dmatest_verify(thread->srcs, src_off + len,
test_buf_size, src_off + len,
PATTERN_SRC, true);

pr_debug("%s: verifying dest buffer...\n",
thread->task->comm);
- error_count += dmatest_verify(thread->dstbuf, 0, dst_off,
+ error_count += dmatest_verify(thread->dsts, 0, dst_off,
0, PATTERN_DST, false);
- error_count += dmatest_verify(thread->dstbuf, dst_off,
+ error_count += dmatest_verify(thread->dsts, dst_off,
dst_off + len, src_off,
PATTERN_SRC | PATTERN_COPY, false);
- error_count += dmatest_verify(thread->dstbuf, dst_off + len,
+ error_count += dmatest_verify(thread->dsts, dst_off + len,
test_buf_size, dst_off + len,
PATTERN_DST, false);

@@ -319,10 +387,16 @@ static int dmatest_func(void *data)
}

ret = 0;
- kfree(thread->dstbuf);
+ for (i = 0; thread->dsts[i]; i++)
+ kfree(thread->dsts[i]);
err_dstbuf:
- kfree(thread->srcbuf);
+ kfree(thread->dsts);
+err_dsts:
+ for (i = 0; thread->srcs[i]; i++)
+ kfree(thread->srcs[i]);
err_srcbuf:
+ kfree(thread->srcs);
+err_srcs:
pr_notice("%s: terminating after %u tests, %u failures (status %d)\n",
thread_name, total_tests, failed_tests, ret);
return ret;
@@ -344,35 +418,36 @@ static void dmatest_cleanup_channel(struct dmatest_chan *dtc)
kfree(dtc);
}

-static int dmatest_add_channel(struct dma_chan *chan)
+static int dmatest_add_threads(struct dmatest_chan *dtc, enum dma_transaction_type type)
{
- struct dmatest_chan *dtc;
- struct dmatest_thread *thread;
- unsigned int i;
-
- dtc = kmalloc(sizeof(struct dmatest_chan), GFP_KERNEL);
- if (!dtc) {
- pr_warning("dmatest: No memory for %s\n", dma_chan_name(chan));
- return -ENOMEM;
- }
+ struct dmatest_thread *thread;
+ struct dma_chan *chan = dtc->chan;
+ char *op;
+ unsigned int i;

- dtc->chan = chan;
- INIT_LIST_HEAD(&dtc->threads);
+ if (type == DMA_MEMCPY)
+ op = "copy";
+ else if (type == DMA_XOR)
+ op = "xor";
+ else
+ return -EINVAL;

for (i = 0; i < threads_per_chan; i++) {
thread = kzalloc(sizeof(struct dmatest_thread), GFP_KERNEL);
if (!thread) {
- pr_warning("dmatest: No memory for %s-test%u\n",
- dma_chan_name(chan), i);
+ pr_warning("dmatest: No memory for %s-%s%u\n",
+ dma_chan_name(chan), op, i);
+
break;
}
thread->chan = dtc->chan;
+ thread->type = type;
smp_wmb();
- thread->task = kthread_run(dmatest_func, thread, "%s-test%u",
- dma_chan_name(chan), i);
+ thread->task = kthread_run(dmatest_func, thread, "%s-%s%u",
+ dma_chan_name(chan), op, i);
if (IS_ERR(thread->task)) {
- pr_warning("dmatest: Failed to run thread %s-test%u\n",
- dma_chan_name(chan), i);
+ pr_warning("dmatest: Failed to run thread %s-%s%u\n",
+ dma_chan_name(chan), op, i);
kfree(thread);
break;
}
@@ -382,7 +457,36 @@ static int dmatest_add_channel(struct dma_chan *chan)
list_add_tail(&thread->node, &dtc->threads);
}

- pr_info("dmatest: Started %u threads using %s\n", i, dma_chan_name(chan));
+ return i;
+}
+
+static int dmatest_add_channel(struct dma_chan *chan)
+{
+ struct dmatest_chan *dtc;
+ struct dma_device *dma_dev = chan->device;
+ unsigned int thread_count = 0;
+ unsigned int cnt;
+
+ dtc = kmalloc(sizeof(struct dmatest_chan), GFP_KERNEL);
+ if (!dtc) {
+ pr_warning("dmatest: No memory for %s\n", dma_chan_name(chan));
+ return -ENOMEM;
+ }
+
+ dtc->chan = chan;
+ INIT_LIST_HEAD(&dtc->threads);
+
+ if (dma_has_cap(DMA_MEMCPY, dma_dev->cap_mask)) {
+ cnt = dmatest_add_threads(dtc, DMA_MEMCPY);
+ thread_count += cnt > 0 ?: 0;
+ }
+ if (dma_has_cap(DMA_XOR, dma_dev->cap_mask)) {
+ cnt = dmatest_add_threads(dtc, DMA_XOR);
+ thread_count += cnt > 0 ?: 0;
+ }
+
+ pr_info("dmatest: Started %u threads using %s\n",
+ thread_count, dma_chan_name(chan));

list_add_tail(&dtc->node, &dmatest_channels);
nr_channels++;

2009-03-18 19:24:29

by Dan Williams

[permalink] [raw]
Subject: [PATCH 12/13] dmatest: add dma interrupts and callbacks

Use the callback infrastructure to report driver/hardware hangs or
missed interrupts. Since this makes the test threads much more
aggressive (from: explicit 1ms sleep to: wait_for_completion) we set the
nice value to 10 so as to not swamp legitimate tasks.

Signed-off-by: Dan Williams <[email protected]>
---
drivers/dma/dmatest.c | 37 +++++++++++++++++++++++++++----------
1 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/dma/dmatest.c b/drivers/dma/dmatest.c
index 224acf4..a27c0fb 100644
--- a/drivers/dma/dmatest.c
+++ b/drivers/dma/dmatest.c
@@ -196,6 +196,11 @@ static unsigned int dmatest_verify(u8 **bufs, unsigned int start,
return error_count;
}

+static void dmatest_callback(void *completion)
+{
+ complete(completion);
+}
+
/*
* This function repeatedly tests DMA transfers of various lengths and
* offsets for a given operation type until it is told to exit by
@@ -261,13 +266,17 @@ static int dmatest_func(void *data)
}
thread->dsts[i] = NULL;

- flags = DMA_CTRL_ACK | DMA_COMPL_SKIP_DEST_UNMAP;
+ set_user_nice(current, 10);
+
+ flags = DMA_CTRL_ACK | DMA_COMPL_SKIP_DEST_UNMAP | DMA_PREP_INTERRUPT;

while (!kthread_should_stop()) {
struct dma_device *dev = chan->device;
struct dma_async_tx_descriptor *tx = NULL;
dma_addr_t dma_srcs[src_cnt];
dma_addr_t dma_dsts[dst_cnt];
+ struct completion cmp;
+ unsigned long tmo = msecs_to_jiffies(3000);

total_tests++;

@@ -318,7 +327,10 @@ static int dmatest_func(void *data)
failed_tests++;
continue;
}
- tx->callback = NULL;
+
+ init_completion(&cmp);
+ tx->callback = dmatest_callback;
+ tx->callback_param = &cmp;
cookie = tx->tx_submit(tx);

if (dma_submit_error(cookie)) {
@@ -332,18 +344,23 @@ static int dmatest_func(void *data)
}
dma_async_issue_pending(chan);

- do {
- msleep(1);
- status = dma_async_is_tx_complete(
- chan, cookie, NULL, NULL);
- } while (status == DMA_IN_PROGRESS);
+ tmo = wait_for_completion_timeout(&cmp, tmo);
+ status = dma_async_is_tx_complete(chan, cookie, NULL, NULL);

- if (status == DMA_ERROR) {
- pr_warning("%s: #%u: error during copy\n",
- thread_name, total_tests - 1);
+ if (tmo == 0) {
+ pr_warning("%s: #%u: test timed out\n",
+ thread_name, total_tests - 1);
+ failed_tests++;
+ continue;
+ } else if (status != DMA_SUCCESS) {
+ pr_warning("%s: #%u: got completion callback,"
+ " but status is \'%s\'\n",
+ thread_name, total_tests - 1,
+ status == DMA_ERROR ? "error" : "in progress");
failed_tests++;
continue;
}
+
/* Unmap by myself (see DMA_COMPL_SKIP_DEST_UNMAP above) */
for (i = 0; i < dst_cnt; i++)
dma_unmap_single(dev->dev, dma_dsts[i], test_buf_size,

2009-03-18 19:23:32

by Dan Williams

[permalink] [raw]
Subject: [PATCH 08/13] iop-adma: P+Q support for iop13xx adma engines

iop33x support is not included because that engine is a bit more awkward
to handle in that it can either be in xor mode or pq mode. The
dmaengine/async_tx layers currently only comprehend static capablities.

Note iop13xx does not support hardware PQ continuation so the driver
must handle the DMA_PREP_CONTINUE flag for operations across > 16
sources. From the comment for dma_maxpq:

/* When an engine does not support native continuation we need 3 extra
* source slots to reuse P and Q with the following coefficients:
* 1/ {00} * P : remove P from Q', but use it as a source for P'
* 2/ {01} * Q : use Q to continue Q' calculation
* 3/ {00} * Q : subtract Q from P' to cancel (2)
*/

Signed-off-by: Dan Williams <[email protected]>
---
arch/arm/include/asm/hardware/iop3xx-adma.h | 63 +++++++++++++++++
arch/arm/include/asm/hardware/iop_adma.h | 1
arch/arm/mach-iop13xx/include/mach/adma.h | 95 +++++++++++++++++++++++++
drivers/dma/iop-adma.c | 101 +++++++++++++++++++++++++++
4 files changed, 260 insertions(+), 0 deletions(-)

diff --git a/arch/arm/include/asm/hardware/iop3xx-adma.h b/arch/arm/include/asm/hardware/iop3xx-adma.h
index 26eefea..32402b7 100644
--- a/arch/arm/include/asm/hardware/iop3xx-adma.h
+++ b/arch/arm/include/asm/hardware/iop3xx-adma.h
@@ -187,11 +187,74 @@ union iop3xx_desc {
void *ptr;
};

+/* No support for p+q operations */
+static inline int
+iop_chan_pq_slot_count(size_t len, int src_cnt, int *slots_per_op)
+{
+ BUG();
+ return 0;
+}
+
+static inline void
+iop_desc_init_pq(struct iop_adma_desc_slot *desc, int src_cnt,
+ unsigned long flags)
+{
+ BUG();
+}
+
+static inline void
+iop_desc_set_pq_addr(struct iop_adma_desc_slot *desc, dma_addr_t *addr)
+{
+ BUG();
+}
+
+static inline void
+iop_desc_set_pq_src_addr(struct iop_adma_desc_slot *desc, int src_idx,
+ dma_addr_t addr, unsigned char coef)
+{
+ BUG();
+}
+
+static inline int
+iop_chan_pq_zero_sum_slot_count(size_t len, int src_cnt, int *slots_per_op)
+{
+ BUG();
+ return 0;
+}
+
+static inline void
+iop_desc_init_pq_zero_sum(struct iop_adma_desc_slot *desc, int src_cnt,
+ unsigned long flags)
+{
+ BUG();
+}
+
+static inline void
+iop_desc_set_pq_zero_sum_byte_count(struct iop_adma_desc_slot *desc, u32 len)
+{
+ BUG();
+}
+
+#define iop_desc_set_pq_zero_sum_src_addr iop_desc_set_pq_src_addr
+
+static inline void
+iop_desc_set_pq_zero_sum_addr(struct iop_adma_desc_slot *desc, int pq_idx,
+ dma_addr_t *src)
+{
+ BUG();
+}
+
static inline int iop_adma_get_max_xor(void)
{
return 32;
}

+static inline int iop_adma_get_max_pq(void)
+{
+ BUG();
+ return 0;
+}
+
static inline u32 iop_chan_get_current_descriptor(struct iop_adma_chan *chan)
{
int id = chan->device->id;
diff --git a/arch/arm/include/asm/hardware/iop_adma.h b/arch/arm/include/asm/hardware/iop_adma.h
index 385c6e8..bbe8a04 100644
--- a/arch/arm/include/asm/hardware/iop_adma.h
+++ b/arch/arm/include/asm/hardware/iop_adma.h
@@ -106,6 +106,7 @@ struct iop_adma_desc_slot {
union {
u32 *xor_check_result;
u32 *crc32_result;
+ u32 *pq_check_result;
};
};

diff --git a/arch/arm/mach-iop13xx/include/mach/adma.h b/arch/arm/mach-iop13xx/include/mach/adma.h
index 1cd31df..4ebe65b 100644
--- a/arch/arm/mach-iop13xx/include/mach/adma.h
+++ b/arch/arm/mach-iop13xx/include/mach/adma.h
@@ -150,6 +150,8 @@ static inline int iop_adma_get_max_xor(void)
return 16;
}

+#define iop_adma_get_max_pq iop_adma_get_max_xor
+
static inline u32 iop_chan_get_current_descriptor(struct iop_adma_chan *chan)
{
return __raw_readl(ADMA_ADAR(chan));
@@ -211,7 +213,10 @@ iop_chan_xor_slot_count(size_t len, int src_cnt, int *slots_per_op)
#define IOP_ADMA_MAX_BYTE_COUNT ADMA_MAX_BYTE_COUNT
#define IOP_ADMA_ZERO_SUM_MAX_BYTE_COUNT ADMA_MAX_BYTE_COUNT
#define IOP_ADMA_XOR_MAX_BYTE_COUNT ADMA_MAX_BYTE_COUNT
+#define IOP_ADMA_PQ_MAX_BYTE_COUNT ADMA_MAX_BYTE_COUNT
#define iop_chan_zero_sum_slot_count(l, s, o) iop_chan_xor_slot_count(l, s, o)
+#define iop_chan_pq_slot_count iop_chan_xor_slot_count
+#define iop_chan_pq_zero_sum_slot_count iop_chan_xor_slot_count

static inline u32 iop_desc_get_dest_addr(struct iop_adma_desc_slot *desc,
struct iop_adma_chan *chan)
@@ -220,6 +225,13 @@ static inline u32 iop_desc_get_dest_addr(struct iop_adma_desc_slot *desc,
return hw_desc->dest_addr;
}

+static inline u32 iop_desc_get_qdest_addr(struct iop_adma_desc_slot *desc,
+ struct iop_adma_chan *chan)
+{
+ struct iop13xx_adma_desc_hw *hw_desc = desc->hw_desc;
+ return hw_desc->q_dest_addr;
+}
+
static inline u32 iop_desc_get_byte_count(struct iop_adma_desc_slot *desc,
struct iop_adma_chan *chan)
{
@@ -319,6 +331,46 @@ iop_desc_init_zero_sum(struct iop_adma_desc_slot *desc, int src_cnt,
return 1;
}

+static inline void
+iop_desc_init_pq(struct iop_adma_desc_slot *desc, int src_cnt,
+ unsigned long flags)
+{
+ struct iop13xx_adma_desc_hw *hw_desc = desc->hw_desc;
+ union {
+ u32 value;
+ struct iop13xx_adma_desc_ctrl field;
+ } u_desc_ctrl;
+
+ u_desc_ctrl.value = 0;
+ u_desc_ctrl.field.src_select = src_cnt - 1;
+ u_desc_ctrl.field.xfer_dir = 3; /* local to internal bus */
+ u_desc_ctrl.field.pq_xfer_en = 1;
+ u_desc_ctrl.field.p_xfer_dis = !!(flags & DMA_PREP_PQ_DISABLE_P);
+ u_desc_ctrl.field.int_en = flags & DMA_PREP_INTERRUPT;
+ hw_desc->desc_ctrl = u_desc_ctrl.value;
+}
+
+static inline void
+iop_desc_init_pq_zero_sum(struct iop_adma_desc_slot *desc, int src_cnt,
+ unsigned long flags)
+{
+ struct iop13xx_adma_desc_hw *hw_desc = desc->hw_desc;
+ union {
+ u32 value;
+ struct iop13xx_adma_desc_ctrl field;
+ } u_desc_ctrl;
+
+ u_desc_ctrl.value = 0;
+ u_desc_ctrl.field.src_select = src_cnt - 1;
+ u_desc_ctrl.field.xfer_dir = 3; /* local to internal bus */
+ u_desc_ctrl.field.zero_result = 1;
+ u_desc_ctrl.field.status_write_back_en = 1;
+ u_desc_ctrl.field.pq_xfer_en = 1;
+ u_desc_ctrl.field.p_xfer_dis = !!(flags & DMA_PREP_PQ_DISABLE_P);
+ u_desc_ctrl.field.int_en = flags & DMA_PREP_INTERRUPT;
+ hw_desc->desc_ctrl = u_desc_ctrl.value;
+}
+
static inline void iop_desc_set_byte_count(struct iop_adma_desc_slot *desc,
struct iop_adma_chan *chan,
u32 byte_count)
@@ -351,6 +403,7 @@ iop_desc_set_zero_sum_byte_count(struct iop_adma_desc_slot *desc, u32 len)
}
}

+#define iop_desc_set_pq_zero_sum_byte_count iop_desc_set_zero_sum_byte_count

static inline void iop_desc_set_dest_addr(struct iop_adma_desc_slot *desc,
struct iop_adma_chan *chan,
@@ -361,6 +414,16 @@ static inline void iop_desc_set_dest_addr(struct iop_adma_desc_slot *desc,
hw_desc->upper_dest_addr = 0;
}

+static inline void
+iop_desc_set_pq_addr(struct iop_adma_desc_slot *desc, dma_addr_t *addr)
+{
+ struct iop13xx_adma_desc_hw *hw_desc = desc->hw_desc;
+
+ hw_desc->dest_addr = addr[0];
+ hw_desc->q_dest_addr = addr[1];
+ hw_desc->upper_dest_addr = 0;
+}
+
static inline void iop_desc_set_memcpy_src_addr(struct iop_adma_desc_slot *desc,
dma_addr_t addr)
{
@@ -389,6 +452,29 @@ static inline void iop_desc_set_xor_src_addr(struct iop_adma_desc_slot *desc,
}

static inline void
+iop_desc_set_pq_src_addr(struct iop_adma_desc_slot *desc, int src_idx,
+ dma_addr_t addr, unsigned char coef)
+{
+ int slot_cnt = desc->slot_cnt, slots_per_op = desc->slots_per_op;
+ struct iop13xx_adma_desc_hw *hw_desc = desc->hw_desc, *iter;
+ struct iop13xx_adma_src *src;
+ int i = 0;
+
+ do {
+ iter = iop_hw_desc_slot_idx(hw_desc, i);
+ src = &iter->src[src_idx];
+ src->src_addr = addr;
+ src->pq_upper_src_addr = 0;
+ src->pq_dmlt = coef;
+ slot_cnt -= slots_per_op;
+ if (slot_cnt) {
+ i += slots_per_op;
+ addr += IOP_ADMA_PQ_MAX_BYTE_COUNT;
+ }
+ } while (slot_cnt);
+}
+
+static inline void
iop_desc_init_interrupt(struct iop_adma_desc_slot *desc,
struct iop_adma_chan *chan)
{
@@ -399,6 +485,15 @@ iop_desc_init_interrupt(struct iop_adma_desc_slot *desc,
}

#define iop_desc_set_zero_sum_src_addr iop_desc_set_xor_src_addr
+#define iop_desc_set_pq_zero_sum_src_addr iop_desc_set_pq_src_addr
+
+static inline void
+iop_desc_set_pq_zero_sum_addr(struct iop_adma_desc_slot *desc, int pq_idx,
+ dma_addr_t *src)
+{
+ iop_desc_set_xor_src_addr(desc, pq_idx, src[pq_idx]);
+ iop_desc_set_xor_src_addr(desc, pq_idx+1, src[pq_idx+1]);
+}

static inline void iop_desc_set_next_desc(struct iop_adma_desc_slot *desc,
u32 next_desc_addr)
diff --git a/drivers/dma/iop-adma.c b/drivers/dma/iop-adma.c
index 8bccf85..0cdb333 100644
--- a/drivers/dma/iop-adma.c
+++ b/drivers/dma/iop-adma.c
@@ -697,6 +697,100 @@ iop_adma_prep_dma_zero_sum(struct dma_chan *chan, dma_addr_t *dma_src,
return sw_desc ? &sw_desc->async_tx : NULL;
}

+static struct dma_async_tx_descriptor *
+iop_adma_prep_dma_pq(struct dma_chan *chan, dma_addr_t *dst, dma_addr_t *src,
+ unsigned int src_cnt, unsigned char *scf, size_t len,
+ unsigned long flags)
+{
+ struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+ struct iop_adma_desc_slot *sw_desc, *g;
+ int slot_cnt, slots_per_op;
+
+ if (unlikely(!len))
+ return NULL;
+ BUG_ON(unlikely(len > IOP_ADMA_XOR_MAX_BYTE_COUNT));
+
+ dev_dbg(iop_chan->device->common.dev,
+ "%s src_cnt: %d len: %u flags: %lx\n",
+ __func__, src_cnt, len, flags);
+
+ spin_lock_bh(&iop_chan->lock);
+ slot_cnt = iop_chan_pq_slot_count(len, src_cnt, &slots_per_op);
+ sw_desc = iop_adma_alloc_slots(iop_chan, slot_cnt, slots_per_op);
+ if (sw_desc) {
+ int src_cont; /* for continuations */
+
+ g = sw_desc->group_head;
+ iop_desc_init_pq(g, src_cnt, flags);
+ iop_desc_set_byte_count(g, iop_chan, len);
+ iop_desc_set_pq_addr(g, dst);
+ if (flags & DMA_PREP_CONTINUE)
+ sw_desc->unmap_src_cnt = src_cnt + 3;
+ else
+ sw_desc->unmap_src_cnt = src_cnt;
+ sw_desc->unmap_len = len;
+ sw_desc->async_tx.flags = flags;
+ src_cont = src_cnt;
+ while (src_cnt--)
+ iop_desc_set_pq_src_addr(g, src_cnt, src[src_cnt],
+ scf[src_cnt]);
+ if (flags & DMA_PREP_CONTINUE) {
+ /* we are continuing a previous operation so factor in
+ * the old p and q values, see the comment for
+ * dma_maxpq
+ */
+ iop_desc_set_pq_src_addr(g, src_cont, dst[0], 0);
+ iop_desc_set_pq_src_addr(g, src_cont+1, dst[1], 1);
+ iop_desc_set_pq_src_addr(g, src_cont+2, dst[1], 0);
+ }
+ }
+ spin_unlock_bh(&iop_chan->lock);
+
+ return sw_desc ? &sw_desc->async_tx : NULL;
+}
+
+static struct dma_async_tx_descriptor *
+iop_adma_prep_dma_pq_zero_sum(struct dma_chan *chan, dma_addr_t *src,
+ unsigned int src_cnt, unsigned char *scf,
+ size_t len, enum sum_check_flags *pqres,
+ unsigned long flags)
+{
+ struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
+ struct iop_adma_desc_slot *sw_desc, *g;
+ int slot_cnt, slots_per_op;
+
+ if (unlikely(!len))
+ return NULL;
+
+ dev_dbg(iop_chan->device->common.dev, "%s src_cnt: %d len: %u\n",
+ __func__, src_cnt, len);
+
+ spin_lock_bh(&iop_chan->lock);
+ slot_cnt = iop_chan_pq_zero_sum_slot_count(len, src_cnt + 2, &slots_per_op);
+ sw_desc = iop_adma_alloc_slots(iop_chan, slot_cnt, slots_per_op);
+ if (sw_desc) {
+ int pq_idx = src_cnt;
+
+ g = sw_desc->group_head;
+ iop_desc_init_pq_zero_sum(g, src_cnt, flags);
+ iop_desc_set_pq_zero_sum_byte_count(g, len);
+ g->pq_check_result = pqres;
+ pr_debug("\t%s: g->pq_check_result: %p\n",
+ __func__, g->pq_check_result);
+ sw_desc->unmap_src_cnt = src_cnt;
+ sw_desc->unmap_len = len;
+ sw_desc->async_tx.flags = flags;
+ while (src_cnt--)
+ iop_desc_set_pq_zero_sum_src_addr(g, src_cnt,
+ src[src_cnt],
+ scf[src_cnt]);
+ iop_desc_set_pq_zero_sum_addr(g, pq_idx, src);
+ }
+ spin_unlock_bh(&iop_chan->lock);
+
+ return sw_desc ? &sw_desc->async_tx : NULL;
+}
+
static void iop_adma_free_chan_resources(struct dma_chan *chan)
{
struct iop_adma_chan *iop_chan = to_iop_adma_chan(chan);
@@ -1196,6 +1290,13 @@ static int __devinit iop_adma_probe(struct platform_device *pdev)
if (dma_has_cap(DMA_ZERO_SUM, dma_dev->cap_mask))
dma_dev->device_prep_dma_zero_sum =
iop_adma_prep_dma_zero_sum;
+ if (dma_has_cap(DMA_PQ, dma_dev->cap_mask)) {
+ dma_set_maxpq(dma_dev, iop_adma_get_max_pq(), 0);
+ dma_dev->device_prep_dma_pq = iop_adma_prep_dma_pq;
+ }
+ if (dma_has_cap(DMA_PQ_ZERO_SUM, dma_dev->cap_mask))
+ dma_dev->device_prep_dma_pqzero_sum =
+ iop_adma_prep_dma_pq_zero_sum;
if (dma_has_cap(DMA_INTERRUPT, dma_dev->cap_mask))
dma_dev->device_prep_dma_interrupt =
iop_adma_prep_dma_interrupt;

2009-03-18 19:23:46

by Dan Williams

[permalink] [raw]
Subject: [PATCH 10/13] dmaengine: allow dma support for async_tx to be toggled

Provide a config option for blocking the allocation of dma channels to
the async_tx api.

Signed-off-by: Dan Williams <[email protected]>
---
crypto/async_tx/async_tx.c | 6 +++---
drivers/dma/Kconfig | 11 +++++++++++
include/linux/dmaengine.h | 18 ++++++++++++++++++
3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/crypto/async_tx/async_tx.c b/crypto/async_tx/async_tx.c
index c3b07cf..2b1513e 100644
--- a/crypto/async_tx/async_tx.c
+++ b/crypto/async_tx/async_tx.c
@@ -30,7 +30,7 @@
#ifdef CONFIG_DMA_ENGINE
static int __init async_tx_init(void)
{
- dmaengine_get();
+ async_dmaengine_get();

printk(KERN_INFO "async_tx: api initialized (async)\n");

@@ -39,7 +39,7 @@ static int __init async_tx_init(void)

static void __exit async_tx_exit(void)
{
- dmaengine_put();
+ async_dmaengine_put();
}

module_init(async_tx_init);
@@ -59,7 +59,7 @@ __async_tx_find_channel(struct dma_async_tx_descriptor *depend_tx,
if (depend_tx &&
dma_has_cap(tx_type, depend_tx->chan->device->cap_mask))
return depend_tx->chan;
- return dma_find_channel(tx_type);
+ return async_dma_find_channel(tx_type);
}
EXPORT_SYMBOL_GPL(__async_tx_find_channel);
#endif
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 48ea59e..3b3c01b 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -98,6 +98,17 @@ config NET_DMA
Say Y here if you enabled INTEL_IOATDMA or FSL_DMA, otherwise
say N.

+config ASYNC_TX_DMA
+ bool "Async_tx: Offload support for the async_tx api"
+ depends on DMA_ENGINE
+ help
+ This allows the async_tx api to take advantage of offload engines for
+ memcpy, memset, xor, and raid6 p+q operations. If your platform has
+ a dma engine that can perform raid operations and you have enabled
+ MD_RAID456 say Y.
+
+ If unsure, say N.
+
config DMATEST
tristate "DMA Test client"
depends on DMA_ENGINE
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index a7fa966..7eedf45 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -356,6 +356,24 @@ static inline void net_dmaengine_put(void)
}
#endif

+#ifdef CONFIG_ASYNC_TX_DMA
+#define async_dmaengine_get() dmaengine_get()
+#define async_dmaengine_put() dmaengine_put()
+#define async_dma_find_channel(type) dma_find_channel(type)
+#else
+static inline void async_dmaengine_get(void)
+{
+}
+static inline void async_dmaengine_put(void)
+{
+}
+static inline struct dma_chan *
+async_dma_find_channel(enum dma_transaction_type type)
+{
+ return NULL;
+}
+#endif
+
dma_cookie_t dma_async_memcpy_buf_to_buf(struct dma_chan *chan,
void *dest, void *src, size_t len);
dma_cookie_t dma_async_memcpy_buf_to_pg(struct dma_chan *chan,

2009-03-18 23:46:35

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 00/13] Asynchronous raid6 acceleration (part 1 of 2)

Dan Williams <[email protected]> writes:

> This series constitutes the pieces of the raid6 acceleration work that are
> aimed at the next merge window. It implements:
> 1/ An api for asynchronous raid6 parity generation and recovery routines

Could you please comment a bit how well the default load balancing works. If
I write a single stream from a single CPU will it use multiple CPU
cores in the system to do the RAID6 work?

Thanks,

-Andi

--
[email protected] -- Speaking for myself only.

2009-03-19 16:07:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Dan Williams wrote:
> + * @scfs: array of source coefficients used in GF-multiplication

Array of source coefficients? Are you doing a vector-vector
multiplication here?

Given this code:

> for (d = 0; d < len; d++) {
> + wq = wp = ptrs[0][d];
> + for (z = 1; z < src_cnt; z++) {
> + wd = ptrs[z][d];
> + wp ^= wd;
> + wq ^= raid6_gfmul[scfs[z]][wd];
> + }

... it kinds of looks like that.

This is really quite expensive! The whole point of the restore code
that exists is that we never do a two-dimensional lookup, instead
caching a pointer to the multiplication table that we intend to use,
because the RAID-6 code only ever contains scalar-vector multiplications.

I really don't get this, and I think it's broken.

-hpa

2009-03-19 17:09:23

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 00/13] Asynchronous raid6 acceleration (part 1 of 2)

On Wed, Mar 18, 2009 at 4:43 PM, Andi Kleen <[email protected]> wrote:
> Dan Williams <[email protected]> writes:
>
>> This series constitutes the pieces of the raid6 acceleration work that are
>> aimed at the next merge window. ?It implements:
>> 1/ An api for asynchronous raid6 parity generation and recovery routines
>
> Could you please comment a bit how well the default load balancing works. If
> I write a single stream from a single CPU will it use multiple CPU
> cores in the system to do the RAID6 work?

No, that is an item for the todo list. This implementation is only
asynchronous when a raid6 hardware offload resource is available, when
this is not the case it runs synchronous/single-threaded. In general
the api is meant to inherit the load balancing of the caller. For
md/raid6 this is currently always single threaded for writes and
degraded reads regardless of the number of requesting threads.

This work does make some progress towards multithreaded raid6 in that
calculations can now be preempted (i.e. no longer performed under the
stripe spin_lock).

--
Dan

2009-03-19 17:20:35

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

On Thu, Mar 19, 2009 at 9:06 AM, H. Peter Anvin <[email protected]> wrote:
> Dan Williams wrote:
>> + * @scfs: array of source coefficients used in GF-multiplication
>
> Array of source coefficients? ?Are you doing a vector-vector
> multiplication here?
>
> Given this code:
>
>> ? ? ? for (d = 0; d < len; d++) {
>> + ? ? ? ? ? ? wq = wp = ptrs[0][d];
>> + ? ? ? ? ? ? for (z = 1; z < src_cnt; z++) {
>> + ? ? ? ? ? ? ? ? ? ? wd = ptrs[z][d];
>> + ? ? ? ? ? ? ? ? ? ? wp ^= wd;
>> + ? ? ? ? ? ? ? ? ? ? wq ^= raid6_gfmul[scfs[z]][wd];
>> + ? ? ? ? ? ? }
>
> ... it kinds of looks like that.
>
> This is really quite expensive! ?The whole point of the restore code
> that exists is that we never do a two-dimensional lookup, instead
> caching a pointer to the multiplication table that we intend to use,
> because the RAID-6 code only ever contains scalar-vector multiplications.
>
> I really don't get this, and I think it's broken.

Something is broken if we take this path. This routine could stand to
have a WARN_ONCE(), because if it is ever called there is something is
wrong with the raid6 offload driver. The intent is that the
asynchronous recovery routines will perform an early check to see if
the recovery can be offloaded (with a series of calls to
async_xor/async_pq). If not we fall back to the optimized synchronous
recovery routines (raid6_2data_recov, raid6_datap_recov). The only
time this path will be taken is if we have decided to perform
asynchronous recovery but at a later call to async_pq the offload
driver reports it has run out of descriptors.

--
Dan

2009-03-19 20:41:21

by Andre Noll

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

On 12:20, Dan Williams wrote:
> +static __async_inline struct dma_async_tx_descriptor *
> +do_async_pq(struct dma_chan *chan, struct page **blocks, unsigned char *scfs,
> + unsigned int offset, int src_cnt, size_t len,
> + enum async_tx_flags flags,
> + struct dma_async_tx_descriptor *depend_tx,
> + dma_async_tx_callback cb_fn, void *cb_param)
> +{

Wow, this function takes 10 arguments..

> + struct dma_device *dma = chan->device;
> + dma_addr_t dma_dest[2], dma_src[src_cnt];
> + struct dma_async_tx_descriptor *tx = NULL;
> + dma_async_tx_callback _cb_fn;
> + void *_cb_param;
> + unsigned char *scf = NULL;
> + int i, src_off = 0;
> + unsigned short pq_src_cnt;
> + enum async_tx_flags async_flags;
> + enum dma_ctrl_flags dma_flags = 0;
> + int idx;
> + u8 coefs[src_cnt];
> +
> + /* DMAs use destinations as sources, so use BIDIRECTIONAL mapping */
> + if (blocks[src_cnt])
> + dma_dest[0] = dma_map_page(dma->dev, blocks[src_cnt],
> + offset, len, DMA_BIDIRECTIONAL);
> + else
> + dma_flags |= DMA_PREP_PQ_DISABLE_P;
> + if (blocks[src_cnt+1])
> + dma_dest[1] = dma_map_page(dma->dev, blocks[src_cnt+1],
> + offset, len, DMA_BIDIRECTIONAL);
> + else
> + dma_flags |= DMA_PREP_PQ_DISABLE_Q;
> +
> + /* convert source addresses being careful to collapse 'zero'
> + * sources and update the coefficients accordingly
> + */
> + for (i = 0, idx = 0; i < src_cnt; i++) {
> + if (is_raid6_zero_block(blocks[i]))
> + continue;
> + dma_src[idx] = dma_map_page(dma->dev, blocks[i],
> + offset, len, DMA_TO_DEVICE);
> + coefs[idx] = scfs[i];

coefs[] seems not to be used anywhere. BTW: coefs is an array of u8
while unsigned char is used elswhere for the coefficients troughout
the file. Any chance to unify this?

> + while (src_cnt > 0) {
> + async_flags = flags;
> + pq_src_cnt = min(src_cnt, dma_maxpq(dma, flags));
> + /* if we are submitting additional pqs, leave the chain open,
> + * clear the callback parameters, and leave the destination
> + * buffers mapped
> + */
> + if (src_cnt > pq_src_cnt) {
> + async_flags &= ~ASYNC_TX_ACK;
> + dma_flags |= DMA_COMPL_SKIP_DEST_UNMAP;
> + _cb_fn = NULL;
> + _cb_param = NULL;
> + } else {
> + _cb_fn = cb_fn;
> + _cb_param = cb_param;
> + }
> + if (_cb_fn)
> + dma_flags |= DMA_PREP_INTERRUPT;
> + if (scfs)
> + scf = &scfs[src_off];

If scfs is NULL, we have a NULL pointer dereference earlier.

> +
> + /* Since we have clobbered the src_list we are committed
> + * to doing this asynchronously. Drivers force forward
> + * progress in case they can not provide a descriptor
> + */
> + tx = dma->device_prep_dma_pq(chan, dma_dest,
> + &dma_src[src_off], pq_src_cnt,
> + scf, len, dma_flags);
> + if (unlikely(!tx))
> + async_tx_quiesce(&depend_tx);
> +
> + /* spin wait for the preceeding transactions to complete */
> + while (unlikely(!tx)) {
> + dma_async_issue_pending(chan);
> + tx = dma->device_prep_dma_pq(chan, dma_dest,
> + &dma_src[src_off], pq_src_cnt,
> + scf, len, dma_flags);
> + }

This can be simplified to

for (;;) {
tx = dma->device_prep_dma_pq(...);
if (likely(tx))
break;
async_tx_quiesce(&depend_tx);
dma_async_issue_pending(chan);
}

which avoids duplicating the device_prep_dma_pq() call. The additional
calls to async_tx_quiesce() will be a no-op and we're spinning anyway.

> +/**
> + * do_sync_pq - synchronously calculate P and Q
> + */
> +static void
> +do_sync_pq(struct page **blocks, unsigned char *scfs, unsigned int offset,
> + int src_cnt, size_t len, enum async_tx_flags flags,
> + struct dma_async_tx_descriptor *depend_tx,
> + dma_async_tx_callback cb_fn, void *cb_param)
> +{
> + u8 *p = NULL;
> + u8 *q = NULL;
> + u8 *ptrs[src_cnt];
> + int d, z;
> + u8 wd, wq, wp;

The scope of wd, wq and wp can be reduced to the for() loop in which they
are used.

> +
> + /* address convert inputs */
> + if (blocks[src_cnt])
> + p = (u8 *)(page_address(blocks[src_cnt]) + offset);
> + if (blocks[src_cnt+1])
> + q = (u8 *)(page_address(blocks[src_cnt+1]) + offset);

Unnecessary casts.

> + ptrs[z] = (u8 *)(page_address(blocks[z]) + offset);

again

> +/**
> + * async_pq - attempt to do XOR and Galois calculations in parallel using
> + * a dma engine.
> + * @blocks: source block array from 0 to (src_cnt-1) with the p destination
> + * at blocks[src_cnt] and q at blocks[src_cnt + 1]. Only one of two
> + * destinations may be present (another then has to be set to NULL).
> + * NOTE: client code must assume the contents of this array are destroyed
> + * @offset: offset in pages to start transaction
> + * @src_cnt: number of source pages
> + * @scfs: array of source coefficients used in GF-multiplication
> + * @len: length in bytes
> + * @flags: ASYNC_TX_ACK, ASYNC_TX_DEP_ACK, ASYNC_TX_ASYNC_ONLY
> + * @depend_tx: depends on the result of this transaction.
> + * @cb_fn: function to call when the operation completes
> + * @cb_param: parameter to pass to the callback routine
> + */

General remark: Many of the public functions in this file use a
similar set of parameters. As these functions my grow more users
in the future, it might ease maintainability in the long run if the
common parameter set would be combined to a structure and each of the
public functions would take a pointer to such a structure. Changing
the interface would then be a matter of altering only that structure
at one place while all function declarations could be left unchanged.

> +struct dma_async_tx_descriptor *
> +async_pq(struct page **blocks, unsigned int offset, int src_cnt,
> + unsigned char *scfs, size_t len, enum async_tx_flags flags,
> + struct dma_async_tx_descriptor *depend_tx,
> + dma_async_tx_callback cb_fn, void *cb_param)
> +{
> + struct dma_chan *chan = async_tx_find_channel(depend_tx, DMA_PQ,
> + &blocks[src_cnt], 2,
> + blocks, src_cnt, len);
> + struct dma_device *device = chan ? chan->device : NULL;
> + struct dma_async_tx_descriptor *tx = NULL;
> + bool do_async = false;
> +
> + if (device && (src_cnt <= dma_maxpq(device, 0) ||
> + dma_maxpq(device, DMA_PREP_CONTINUE) > 0))
> + do_async = true;
> +
> + if (!device && (flags & ASYNC_TX_ASYNC_ONLY))
> + return NULL;
> +
> + if (do_async) {
> + /* run pq asynchronously */
> + tx = do_async_pq(chan, blocks, scfs, offset, src_cnt, len,
> + flags, depend_tx, cb_fn, cb_param);
> + } else {
> + /* run pq synchronously */
> + if (!blocks[src_cnt+1]) { /* only p requested, just xor */
> + flags |= ASYNC_TX_XOR_ZERO_DST;
> + return async_xor(blocks[src_cnt], blocks, offset,
> + src_cnt, len, flags, depend_tx,
> + cb_fn, cb_param);
> + }
> +
> + /* wait for any prerequisite operations */
> + async_tx_quiesce(&depend_tx);
> +
> + do_sync_pq(blocks, scfs, offset, src_cnt, len, flags,
> + depend_tx, cb_fn, cb_param);

Is it really OK to return NULL here?

> + }
> +
> + return tx;
> +}
> +EXPORT_SYMBOL_GPL(async_pq);

This function could be simplified a bit as follows:

if (!device && (flags & ASYNC_TX_ASYNC_ONLY))
return NULL;
if (device && (src_cnt <= dma_maxpq(device, 0) ||
dma_maxpq(device, DMA_PREP_CONTINUE) > 0))
/* run pq asynchronously */
return do_async_pq(chan, blocks, scfs, offset, src_cnt, len,
flags, depend_tx, cb_fn, cb_param);

/* run pq synchronously */
if (!blocks[src_cnt+1]) {
flags |= ASYNC_TX_XOR_ZERO_DST;
return async_xor(blocks[src_cnt], blocks, offset,
src_cnt, len, flags, depend_tx,cb_fn, cb_param);
}
/* wait for any prerequisite operations */
async_tx_quiesce(&depend_tx);
do_sync_pq(blocks, scfs, offset, src_cnt, len, flags,
depend_tx, cb_fn, cb_param);
return NULL;

This decreases the indent level and gets rid of the local variables
tx and do_async.

> +/**
> + * async_gen_syndrome - attempt to generate P (xor) and Q (Reed-Solomon code)
> + * with a dma engine for a given set of blocks. This routine assumes a
> + * field of GF(2^8) with a primitive polynomial of 0x11d and a generator
> + * of {02}.
> + * @blocks: source block array ordered from 0..src_cnt-1 with the P destination
> + * at blocks[src_cnt] and Q at blocks[src_cnt + 1]. Only one of two
> + * destinations may be present (another then has to be set to NULL). Some

s/another/the other

> + * @src_cnt: number of source pages: 2 < src_cnt <= 255

This should be 1 < src_cnt, no? Tangent: Can this restriction be
relaxed to 0 < src_cnt (see current discussion on linux-raid)?

> +struct dma_async_tx_descriptor *
> +async_gen_syndrome(struct page **blocks, unsigned int offset, int src_cnt,
> + size_t len, enum async_tx_flags flags,
> + struct dma_async_tx_descriptor *depend_tx,
> + dma_async_tx_callback cb_fn, void *cb_param)
> +{
> + struct dma_chan *chan = async_tx_find_channel(depend_tx, DMA_PQ,
> + &blocks[src_cnt], 2,
> + blocks, src_cnt, len);
> + struct dma_device *device = chan ? chan->device : NULL;
> + struct dma_async_tx_descriptor *tx = NULL;
> + bool do_async = false;
> +
> + BUG_ON(src_cnt > 255 || (!blocks[src_cnt] && !blocks[src_cnt+1]));

src_cnt < 2 is also a bug.

> +
> + if (device && (src_cnt <= dma_maxpq(device, 0) ||
> + dma_maxpq(device, DMA_PREP_CONTINUE) > 0))
> + do_async = true;
> +
> + if (!do_async && (flags & ASYNC_TX_ASYNC_ONLY))
> + return NULL;
> +
> + if (do_async) {
> + /* run the p+q asynchronously */
> + tx = do_async_pq(chan, blocks, (uint8_t *)raid6_gfexp,
> + offset, src_cnt, len, flags, depend_tx,
> + cb_fn, cb_param);
> + } else {
> + /* run the pq synchronously */
> + /* wait for any prerequisite operations */
> + async_tx_quiesce(&depend_tx);
> +
> + if (!blocks[src_cnt])
> + blocks[src_cnt] = scribble;
> + if (!blocks[src_cnt+1])
> + blocks[src_cnt+1] = scribble;
> + do_sync_gen_syndrome(blocks, offset, src_cnt, len, flags,
> + depend_tx, cb_fn, cb_param);
> + }
> +
> + return tx;
> +}
> +EXPORT_SYMBOL_GPL(async_gen_syndrome);

Roughly the same comments as to async_pq() apply.

> +/**
> + * async_syndrome_zero_sum - attempt a P (xor) and Q (Reed-Solomon code)
> + * parities check with a dma engine. This routine assumes a field of
> + * GF(2^8) with a primitive polynomial of 0x11d and a generator of {02}.

DRY. Honestly, that function is amost identical to async_pq_zero_sum(),
the only difference being that async_pq_zero_sum() is more general
as it works for arbitrary coefficient vectors due to its additional
scfs parameter. So async_syndrome_zero_sum() should really call
async_pq_zero_sum() rather than duplicating its logic.

Regards
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (10.42 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2009-03-19 20:41:35

by Andre Noll

[permalink] [raw]
Subject: Re: [PATCH 01/13] md/raid6: move raid6 data processing to raid6_pq.ko

On 12:20, Dan Williams wrote:
> Move the raid6 data processing routines into a standalone module
> (raid6_pq) to prepare them to be called from async_tx wrappers and other
> non-md drivers/modules. This precludes a circular dependency of raid456
> needing the async modules for data processing while those modules in
> turn depend on raid456 for the base level synchronous raid6 routines.
>
> To support this move:
> 1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h

As raid*.h and md.h have been moved to drivers/md recently, pq.h
should probably live there as well.

> +#define time_before(x, y) ((x) < (y))

This macro seems to have no users (and a strange name).

Regards
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (787.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2009-03-19 20:41:51

by Andre Noll

[permalink] [raw]
Subject: Re: [PATCH 02/13] async_tx: don't use src_list argument of async_xor() for dma addresses

On 12:20, Dan Williams wrote:

> Using src_list argument of async_xor() as a storage for dma addresses
> implies sizeof(dma_addr_t) <= sizeof(struct page *) restriction which is
> not always true (e.g. ppc440spe).

This message does not tell what the patch is going to do about that.
It would also be nice to mention (either in the log message or as a
comment) why allocating the dma_src arrays on the stack is not going
to be a problem.

Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (524.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2009-03-20 00:15:16

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 09/13] iop-adma: P+Q self test

On Wednesday March 18, [email protected] wrote:
> Even though the intent is to extend dmatest with P+Q tests there is
> still value in having an always-on sanity check to prevent an
> unintentionally broken driver from registering.
>
> This depends on raid6_pq.ko for verification, the side effect being that
> PQ capable channels will fail to register when raid6 is disabled.
>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/dma/iop-adma.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 181 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/dma/iop-adma.c b/drivers/dma/iop-adma.c
> index 0cdb333..2b014b6 100644
> --- a/drivers/dma/iop-adma.c
> +++ b/drivers/dma/iop-adma.c
> @@ -31,6 +31,7 @@
> #include <linux/platform_device.h>
> #include <linux/memory.h>
> #include <linux/ioport.h>
> +#include <linux/raid/raid6.h>

???
<linux/raid/pq.h> Maybe?

NeilBrown

2009-03-20 00:19:42

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 09/13] iop-adma: P+Q self test

On Thu, Mar 19, 2009 at 5:14 PM, Neil Brown <[email protected]> wrote:
>> +#include <linux/raid/raid6.h>
>
> ???
> ?<linux/raid/pq.h> ?Maybe?
>

Yes, neglected the cross compile before pushing this out, will fix.

Thanks,
Dan

2009-03-20 00:30:30

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 00/13] Asynchronous raid6 acceleration (part 1 of 2)

On Wednesday March 18, [email protected] wrote:
> This series constitutes the pieces of the raid6 acceleration work that are
> aimed at the next merge window. It implements:
> 1/ An api for asynchronous raid6 parity generation and recovery routines
> 2/ Extensions to the dmaengine framework and dmatest
> 3/ RAID6 support for iop13xx
> 4/ Removal of the BUILD_BUG_ON in async_xor to support platforms with
> sizeof(dma_addr_t) > sizeof(void *). This increases the stack
> utilization in the asynchronous path.
>
> This series is available via git at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx.git raid6
>
> ...it is based on git://neil.brown.name/md md-scratch

Thanks.

I'll take the first patch (which is the only one that affects md
directly) and feed it to -next and eventually -linus. I'll leave the
rest for you.

>
> Part 2, which needs more testing, is the conversion of md/raid6. It is
> available at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/md.git raid6-for-neil

Thanks. I might have a look if I get a moment.

NeilBrown

2009-03-20 22:44:23

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Dan Williams wrote:
>
> Something is broken if we take this path. This routine could stand to
> have a WARN_ONCE(), because if it is ever called there is something is
> wrong with the raid6 offload driver. The intent is that the
> asynchronous recovery routines will perform an early check to see if
> the recovery can be offloaded (with a series of calls to
> async_xor/async_pq). If not we fall back to the optimized synchronous
> recovery routines (raid6_2data_recov, raid6_datap_recov). The only
> time this path will be taken is if we have decided to perform
> asynchronous recovery but at a later call to async_pq the offload
> driver reports it has run out of descriptors.
>

That doesn't explain what you expect the code to do. It still doesn't
make sense to me.

-hpa

2009-03-20 23:01:20

by Ilya Yanok

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Hi H. Peter,

H. Peter Anvin wrote:
>> Something is broken if we take this path. This routine could stand to
>> have a WARN_ONCE(), because if it is ever called there is something is
>> wrong with the raid6 offload driver. The intent is that the
>> asynchronous recovery routines will perform an early check to see if
>> the recovery can be offloaded (with a series of calls to
>> async_xor/async_pq). If not we fall back to the optimized synchronous
>> recovery routines (raid6_2data_recov, raid6_datap_recov). The only
>> time this path will be taken is if we have decided to perform
>> asynchronous recovery but at a later call to async_pq the offload
>> driver reports it has run out of descriptors.
>>
>
> That doesn't explain what you expect the code to do. It still doesn't
> make sense to me.

We defined async_pq() function as hardware can do GF multiplication of
arbitrary elements and we want to take advantage of this fact. But by
the design of ASYNC_TX API we have to provide synchronous version of
that function too.

Regards, Ilya.

2009-03-20 23:41:34

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Ilya Yanok wrote:
>>>
>> That doesn't explain what you expect the code to do. It still doesn't
>> make sense to me.
>
> We defined async_pq() function as hardware can do GF multiplication of
> arbitrary elements and we want to take advantage of this fact. But by
> the design of ASYNC_TX API we have to provide synchronous version of
> that function too.
>

I don't think that makes sense since the algorithm doesn't use
vector-vector multiplications and there is no need for them. As such,
you're actively excluding support for hardware which only provides
scalar-vector multiplication, even though that is all we need.

In fact, vectorizing the constant doesn't make much sense.

-hpa

2009-03-21 00:07:20

by Ilya Yanok

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Hi H. Peter,

H. Peter Anvin wrote:
>> We defined async_pq() function as hardware can do GF multiplication of
>> arbitrary elements and we want to take advantage of this fact. But by
>> the design of ASYNC_TX API we have to provide synchronous version of
>> that function too.
>>
>
> I don't think that makes sense since the algorithm doesn't use
> vector-vector multiplications and there is no need for them. As such,
> you're actively excluding support for hardware which only provides
> scalar-vector multiplication, even though that is all we need.
>
> In fact, vectorizing the constant doesn't make much sense.

Hm... I have to admit I can't understand your
vector-vector/scalar-vector terminology... What vector space are you
talking about?

async_gen_syndrome() takes element of GF^n(256) and returns result of
scalar multiplication of it with constant ({01}, {02}, {02}^2, ...,
{02}^n) vector.
async_pq() takes two vectors from GF^n(256) and returns their scalar
multiplication.

We need async_pq() function if we want to offload D_x = A * (P + P_{xy})
+ B * (Q + Q_{xy}) part of DD recovery and D_x = (Q + Q_x) * g^{-x} part
of DP recovery.

Regards, Ilya.

2009-03-21 02:31:39

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Ilya Yanok wrote:
>
> async_gen_syndrome() takes element of GF^n(256) and returns result of
> scalar multiplication of it with constant ({01}, {02}, {02}^2, ...,
> {02}^n) vector.

For any n (which would mean any GF field)? In that case, that is
generic scalar-vector multiplication...

> async_pq() takes two vectors from GF^n(256) and returns their scalar
> multiplication.
>
> We need async_pq() function if we want to offload D_x = A * (P + P_{xy})
> + B * (Q + Q_{xy}) part of DD recovery and D_x = (Q + Q_x) * g^{-x} part
> of DP recovery.


No, you don't. A and B (and g^{-x}) are scalars, meaning they're the
same for every element. This is simpler to do.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-03-21 10:19:25

by Ilya Yanok

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Hi H. Peter,

H. Peter Anvin wrote:
> Ilya Yanok wrote:
>
>> async_gen_syndrome() takes element of GF^n(256) and returns result of
>> scalar multiplication of it with constant ({01}, {02}, {02}^2, ...,
>> {02}^n) vector.
>>
>
> For any n (which would mean any GF field)? In that case, that is
> generic scalar-vector multiplication...
>

Ok. I think I got it. Vectors are things that depend on position inside
buffer and scalars are things that don't, am I right? In that sense we
don't have any vector-vector multiplication. Both async_gen_syndrome()
and async_pq() use constant coefficients. So I don't really understand
the problem here.

Regards, Ilya.

2009-03-21 15:19:55

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

On Fri, Mar 20, 2009 at 7:30 PM, H. Peter Anvin <[email protected]> wrote:
> Ilya Yanok wrote:
>>
>> async_gen_syndrome() takes element of GF^n(256) and returns result of
>> scalar multiplication of it with constant ({01}, {02}, {02}^2, ...,
>> {02}^n) vector.
>
> For any n (which would mean any GF field)? ?In that case, that is
> generic scalar-vector multiplication...
>
>> async_pq() takes two vectors from GF^n(256) and returns their scalar
>> multiplication.
>>
>> We need async_pq() function if we want to offload D_x = A * (P + P_{xy})
>> + B * (Q + Q_{xy}) part of DD recovery and D_x = (Q + Q_x) * g^{-x} part
>> of DP recovery.
>
>
> No, you don't. ?A and B (and g^{-x}) are scalars, meaning they're the
> same for every element. ?This is simpler to do.

Understood. However this routine also needs to cover the non-generic
and non-constant case where we have a separate coefficient per
element. I suppose it could scan the coefficient list to see if it
can bypass the 2-dimensional lookup multiply. At the very least we
need something like the following, because async_pq is really only a
helper routine for async_r6recov.c which knows how to avoid the
synchronous path.

diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c
index da47a29..c1087a2 100644
--- a/crypto/async_tx/async_pq.c
+++ b/crypto/async_tx/async_pq.c
@@ -236,6 +236,7 @@ async_pq(struct page **blocks, unsigned int
offset, int src_cnt,
flags, depend_tx, cb_fn, cb_param);
} else {
/* run pq synchronously */
+ WARN_ONCE(1, "INFO: async_pq entered synchronous path\n");
if (!blocks[src_cnt+1]) { /* only p requested, just xor */
flags |= ASYNC_TX_XOR_ZERO_DST;
return async_xor(blocks[src_cnt], blocks, offset,
@@ -252,7 +253,6 @@ async_pq(struct page **blocks, unsigned int
offset, int src_cnt,

return tx;
}
-EXPORT_SYMBOL_GPL(async_pq);

Looking closer, the only other caller, async_pq_zero_sum, can be
deleted because it has no users. So async_pq can become a static
routine in async_r6recov.

Thanks,
Dan

2009-03-21 19:15:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Dan Williams wrote:
>>
>> No, you don't. A and B (and g^{-x}) are scalars, meaning they're the
>> same for every element. This is simpler to do.
>
> Understood. However this routine also needs to cover the non-generic
> and non-constant case where we have a separate coefficient per
> element. I suppose it could scan the coefficient list to see if it
> can bypass the 2-dimensional lookup multiply. At the very least we
> need something like the following, because async_pq is really only a
> helper routine for async_r6recov.c which knows how to avoid the
> synchronous path.
>

Why does it? I don't see why you'd need to cover the vector-vector case
at all, since it doesn't appear anywhere in the algorithms. Certainly
going backwards from a vector-vector set to derive if you can do a
scalar-vector multiply when you should have known that in the first
place is not really useful.

>
> Looking closer, the only other caller, async_pq_zero_sum, can be
> deleted because it has no users. So async_pq can become a static
> routine in async_r6recov.
>

That's another issue entirely, of course :)

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-03-21 19:16:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Ilya Yanok wrote:
>>>
>> For any n (which would mean any GF field)? In that case, that is
>> generic scalar-vector multiplication...
>>
>
> Ok. I think I got it. Vectors are things that depend on position inside
> buffer and scalars are things that don't, am I right? In that sense we
> don't have any vector-vector multiplication. Both async_gen_syndrome()
> and async_pq() use constant coefficients. So I don't really understand
> the problem here.
>

Perhaps the interface is just too confusing (in which case it needs to
at least be better documented), but at least the synchronous simulation
code looked like it was doing a vector-vector multiply.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-03-21 20:05:28

by Ilya Yanok

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Hi H.Peter,

H. Peter Anvin wrote:
>> + * @scfs: array of source coefficients used in GF-multiplication
>>
>
> Array of source coefficients? Are you doing a vector-vector
> multiplication here?
>
> Given this code:
>
>
>> for (d = 0; d < len; d++) {
>> + wq = wp = ptrs[0][d];
>> + for (z = 1; z < src_cnt; z++) {
>> + wd = ptrs[z][d];
>> + wp ^= wd;
>> + wq ^= raid6_gfmul[scfs[z]][wd];
>> + }
>>
>
> ... it kinds of looks like that.
>
> This is really quite expensive! The whole point of the restore code
> that exists is that we never do a two-dimensional lookup, instead
> caching a pointer to the multiplication table that we intend to use,
> because the RAID-6 code only ever contains scalar-vector multiplications.
>
> I really don't get this, and I think it's broken.
>
Well, that code IS broken. Dan, I think you over-optimized this part.
Have you ever tested this code path? Besides of doing two-dimensional
lookup at every step it ignores scfs[0] completely so if it's not equal
to one the result is wrong.

Regards, Ilya.

2009-03-21 22:00:30

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

On Sat, Mar 21, 2009 at 1:05 PM, Ilya Yanok <[email protected]> wrote:
> Have you ever tested this code path?

Not this path, not yet. Which is exactly why I am holding back the md
pieces. We can collaborate on testing and stabilizing the api for a
kernel cycle before turning on the raid6 conversion.

--
Dan

2009-03-21 22:14:39

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

On Sat, Mar 21, 2009 at 12:15 PM, H. Peter Anvin <[email protected]> wrote:
> Why does it? ?I don't see why you'd need to cover the vector-vector case
> at all, since it doesn't appear anywhere in the algorithms. ?Certainly
> going backwards from a vector-vector set to derive if you can do a
> scalar-vector multiply when you should have known that in the first
> place is not really useful.
>
>>
>> Looking closer, the only other caller, async_pq_zero_sum, can be
>> deleted because it has no users. ?So async_pq can become a static
>> routine in async_r6recov.
>>
>
> That's another issue entirely, of course :)
>

...ah, but now I see your point and the root cause of the problem.
The caller needs to be modified to only require scalar multiplication
even in this corner case.

Thanks,
Dan

2009-03-21 22:26:59

by Ilya Yanok

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Hi Dan,

Dan Williams wrote:
>> Why does it? I don't see why you'd need to cover the vector-vector case
>> at all, since it doesn't appear anywhere in the algorithms. Certainly
>> going backwards from a vector-vector set to derive if you can do a
>> scalar-vector multiply when you should have known that in the first
>> place is not really useful.
>>
>>
>>> Looking closer, the only other caller, async_pq_zero_sum, can be
>>> deleted because it has no users. So async_pq can become a static
>>> routine in async_r6recov.
>>>
>>>
>> That's another issue entirely, of course :)
>>
>>
>
> ...ah, but now I see your point and the root cause of the problem.
> The caller needs to be modified to only require scalar multiplication
> even in this corner case.
>

Argh... There is no vector-vector multiplication at all! You just need
to swap the 'for' cycles back (as it was in the original patch by me and
Yuri) to get rid of two-dimensional lookup at each step.

Regards, Ilya.

2009-03-21 22:43:21

by Ilya Yanok

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Hi Dan,

Dan Williams wrote:
>> Have you ever tested this code path?
>>
>
> Not this path, not yet.

You can disable the fallback to optimized synchronous functions and test
it with raid with dmaengine disabled. Or you can add support for
arbitrary coefficients in dmatest...

> Which is exactly why I am holding back the md
> pieces. We can collaborate on testing and stabilizing the api for a
> kernel cycle before turning on the raid6 conversion.
>

Unfortunately we are not getting paid for this project any more so I
can't spend much time on it.

Regards, Ilya.

2009-03-21 22:46:27

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

On Sat, Mar 21, 2009 at 3:26 PM, Ilya Yanok <[email protected]> wrote:
>> The caller needs to be modified to only require scalar multiplication
>> even in this corner case.
>>
>
> Argh... There is no vector-vector multiplication at all! You just need
> to swap the 'for' cycles back (as it was in the original patch by me and
> Yuri) to get rid of two-dimensional lookup at each step.

Yes, I swapped the loops as part of the removal of the
ASYNC_TX_ZERO_{P,Q} flags and wanted to avoid an explicit memset.
I'll rework this path...

Regards,
Dan

2009-03-21 22:53:53

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

On Sat, Mar 21, 2009 at 3:43 PM, Ilya Yanok <[email protected]> wrote:
> Hi Dan,
>
> Dan Williams wrote:
>>> Have you ever tested this code path?
>>>
>>
>> Not this path, not yet.
>
> You can disable the fallback to optimized synchronous functions and test
> it with raid with dmaengine disabled. Or you can add support for
> arbitrary coefficients in dmatest...

There are also corner cases with channel switching that can only be
tested with an async_tx specific test module and coordination from the
driver.

>> ? Which is exactly why I am holding back the md
>> pieces. ?We can collaborate on testing and stabilizing the api for a
>> kernel cycle before turning on the raid6 conversion.
>>
>
> Unfortunately we are not getting paid for this project any more so I
> can't spend much time on it.

Hopefully I can occasionally call on you to try out some tests with
ppc440spe driver and hardware? As we have seen in the past that
platform hits corner cases that iop13xx does not.

Thanks,
Dan

2009-03-22 17:22:22

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 01/13] md/raid6: move raid6 data processing to raid6_pq.ko

On Thu, Mar 19, 2009 at 1:09 PM, Andre Noll <[email protected]> wrote:
> On 12:20, Dan Williams wrote:
>> Move the raid6 data processing routines into a standalone module
>> (raid6_pq) to prepare them to be called from async_tx wrappers and other
>> non-md drivers/modules. ?This precludes a circular dependency of raid456
>> needing the async modules for data processing while those modules in
>> turn depend on raid456 for the base level synchronous raid6 routines.
>>
>> To support this move:
>> 1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
>
> As raid*.h and md.h have been moved to drivers/md recently, pq.h
> should probably live there as well.

Not in this case. pq.h is analogous to xor.h currently in
include/linux/raid/. I.e. just the data processing symbols that are
used by modules outside of drivers/md/.

>> +#define time_before(x, y) ((x) < (y))
>
> This macro seems to have no users (and a strange name).
>

It's used a few lines down in the same file. It allows raid6algos.c
to be compiled for the userspace test (drivers/md/raid6test/) where we
do not have jiffies.h.

Regards,
Dan

2009-03-22 21:37:33

by Ilya Yanok

[permalink] [raw]
Subject: Re: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Hi Dan,

Dan Williams wrote:
>> Unfortunately we are not getting paid for this project any more so I
>> can't spend much time on it.
>>
>
> Hopefully I can occasionally call on you to try out some tests with
> ppc440spe driver and hardware? As we have seen in the past that
> platform hits corner cases that iop13xx does not.
>

Well, I think I'll be able to run some tests... but I can't do much
coding or debugging here or my manager won't be happy...

Regards, Ilya.

2009-03-23 10:15:18

by Andre Noll

[permalink] [raw]
Subject: Re: [PATCH 07/13] async_tx: add support for asynchronous RAID6 recovery operations

On 12:20, Dan Williams wrote:
> +struct dma_async_tx_descriptor *
> +async_r6_dd_recov(int disks, size_t bytes, int faila, int failb,
> + struct page **ptrs, enum async_tx_flags flags,
> + struct dma_async_tx_descriptor *depend_tx,
> + dma_async_tx_callback cb, void *cb_param)
> +{
> + struct dma_async_tx_descriptor *tx = NULL;
> + struct page *lptrs[disks];
> + unsigned char lcoef[disks-4];

This probably needs a BUG_ON(disks < 4).

> + * B = (2^(y-x))*((2^(y-x) + {01})^(-1))

Minor optimization suggestion: As B depends only on y-x, there are
255 possible values for B, so a lookup table for all these values
would only occupy 255 bytes.

> +ddr_sync:
> + {
> + void **sptrs = (void **)lptrs;

unnecessary cast

> +struct dma_async_tx_descriptor *
> +async_r6_dp_recov(int disks, size_t bytes, int faila, struct page **ptrs,
> + enum async_tx_flags flags,
> + struct dma_async_tx_descriptor *depend_tx,
> + dma_async_tx_callback cb, void *cb_param)
> +{
> + struct dma_async_tx_descriptor *tx = NULL;

unnecessary initialization.

> + struct page *lptrs[disks];
> + unsigned char lcoef[disks-2];
> + int i = 0, k = 0;

again. I'd suggest to init i and k in the for() loop.

Regards
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (1.26 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2009-03-25 17:11:22

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 02/13] async_tx: don't use src_list argument of async_xor() for dma addresses

On Thu, Mar 19, 2009 at 1:10 PM, Andre Noll <[email protected]> wrote:
> On 12:20, Dan Williams wrote:
>
>> Using src_list argument of async_xor() as a storage for dma addresses
>> implies sizeof(dma_addr_t) <= sizeof(struct page *) restriction which is
>> not always true (e.g. ppc440spe).
>
> This message does not tell what the patch is going to do about that.
> It would also be nice to mention (either in the log message or as a
> comment) why allocating the dma_src arrays on the stack is not going
> to be a problem.
>

True, and taking into account your other comment about the number of
parameters I think this is a good opportunity to solve this problem
rather than kick it down the road. So I will prepare a rework to make
this memory caller provided. This gives us the flexibility to use a
kmalloc'd buffer hanging off the stripe_head as Neil suggested in
another thread.

Regards,
Dan

2009-03-26 10:43:22

by Andre Noll

[permalink] [raw]
Subject: Re: [PATCH 02/13] async_tx: don't use src_list argument of async_xor() for dma addresses

On 10:11, Dan Williams wrote:
> On Thu, Mar 19, 2009 at 1:10 PM, Andre Noll <[email protected]> wrote:
> > On 12:20, Dan Williams wrote:
> >
> >> Using src_list argument of async_xor() as a storage for dma addresses
> >> implies sizeof(dma_addr_t) <= sizeof(struct page *) restriction which is
> >> not always true (e.g. ppc440spe).
> >
> > This message does not tell what the patch is going to do about that.
> > It would also be nice to mention (either in the log message or as a
> > comment) why allocating the dma_src arrays on the stack is not going
> > to be a problem.
> >
>
> True, and taking into account your other comment about the number of
> parameters I think this is a good opportunity to solve this problem
> rather than kick it down the road. So I will prepare a rework to make
> this memory caller provided. This gives us the flexibility to use a
> kmalloc'd buffer hanging off the stripe_head as Neil suggested in
> another thread.

Glad you agree. I'm looking forward to seeing the reworked patch set.

Thanks
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (1.09 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2009-03-30 14:31:17

by Sosnowski, Maciej

[permalink] [raw]
Subject: RE: [PATCH 06/13] async_tx: add support for asynchronous GF multiplication

Williams, Dan J wrote:
> + if (unlikely(!tx))
> + async_tx_quiesce(&depend_tx);
> +
> + /* spin wait for the preceeding transactions to complete */
> + while (unlikely(!tx)) {
> + dma_async_issue_pending(chan);
> + tx = dma->device_prep_dma_pq(chan, dma_dest,
> + &dma_src[src_off], pq_src_cnt,
> + scf, len, dma_flags);
> + }

I guess the while loop here should be a part of the "if (unlikely(!tx))" section
(just like it is in async_pq_zero_sum() and async_syndrome_zero_sum()).

BTW, how long can we wait for successful device_prep_dma_pq?
Shouldn't there be a timeout breaking the loop if we wait too long?

> +struct dma_async_tx_descriptor *
> +async_pq_zero_sum(struct page **blocks, unsigned int offset, int src_cnt,
> + unsigned char *scfs, size_t len, enum sum_check_flags *pqres,
> + enum async_tx_flags flags,
> + struct dma_async_tx_descriptor *depend_tx,
> + dma_async_tx_callback cb_fn, void *cb_param)
> +{
> + struct dma_chan *chan = async_tx_find_channel(depend_tx,
> + DMA_PQ_ZERO_SUM,
> + &blocks[src_cnt], 2,
> + blocks, src_cnt, len);
> + struct dma_device *device = chan ? chan->device : NULL;
> + struct dma_async_tx_descriptor *tx = NULL;
> + enum dma_ctrl_flags dma_flags = cb_fn ? DMA_PREP_INTERRUPT : 0;
> +
> + BUG_ON(src_cnt < 2);
> +
> + if (device && src_cnt <= dma_maxpq(device, 0) - 2) {
> + dma_addr_t dma_src[src_cnt + 2];
> +
> + dma_flags |= __pq_zero_sum_map_pages(dma_src, src_cnt,
> + device->dev, blocks,
> + offset, len);
> + tx = device->device_prep_dma_pqzero_sum(chan, dma_src, src_cnt,
> + scfs, len, pqres,
> + dma_flags);
> +
> + if (unlikely(!tx)) {
> + async_tx_quiesce(&depend_tx);
> +
> + while (unlikely(!tx)) {
> + dma_async_issue_pending(chan);
> + tx = device->device_prep_dma_pqzero_sum(chan,
> + dma_src, src_cnt, scfs, len,
> + pqres, dma_flags);
> + }
> + }
> +
> + async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param);
> + } else {
> + struct page *pdest = blocks[src_cnt];
> + struct page *qdest = blocks[src_cnt + 1];
> + void *p, *q, *s;
> +
> + flags &= ~ASYNC_TX_ACK;
> +
> + spin_lock(&spare_lock);
> + blocks[src_cnt] = spare_pages[0];
> + blocks[src_cnt + 1] = spare_pages[1];
> + tx = async_pq(blocks, offset, src_cnt, scfs, len, flags,
> + depend_tx, NULL, NULL);
> + async_tx_quiesce(&tx);
> +
> + *pqres = 0;
> + if (pdest) {
> + p = page_address(pdest) + offset;
> + s = page_address(spare_pages[0]) + offset;
> + *pqres |= !!memcmp(p, s, len) << SUM_CHECK_P;
> + }
> +
> + if (qdest) {
> + q = page_address(qdest) + offset;
> + s = page_address(spare_pages[1]) + offset;
> + *pqres |= !!memcmp(q, s, len) << SUM_CHECK_Q;
> + }
> + spin_unlock(&spare_lock);
> +
> + async_tx_sync_epilog(cb_fn, cb_param);
> + }

Some comments describing asynchronous and synchronous parts division would be good
(just like it is in other functions in this file).

> +struct dma_async_tx_descriptor *
> +async_syndrome_zero_sum(struct page **blocks, unsigned int offset, int src_cnt,
> + size_t len, enum sum_check_flags *pqres,
> + enum async_tx_flags flags,
> + struct dma_async_tx_descriptor *depend_tx,
> + dma_async_tx_callback cb_fn, void *cb_param)

Most of the code in async_syndrome_zero_sum() is the same as in async_pq_zero_sum().
What about putting it in one common function?

Regards,
Maciej-

2009-03-30 14:31:31

by Sosnowski, Maciej

[permalink] [raw]
Subject: RE: [PATCH 07/13] async_tx: add support for asynchronous RAID6 recovery operations

Williams, Dan J wrote:
> +ddr_sync:
> + {
> + void **sptrs = (void **)lptrs;
> + /*
> + * Failed to compute asynchronously, do it in
> + * synchronous manner
> + */
> +
> + /* wait for any prerequisite operations */
> + async_tx_quiesce(&depend_tx);
> + if (flags & ASYNC_TX_DEP_ACK)
> + async_tx_ack(depend_tx);
> +
> + i = disks;
> + while (i--)
> + sptrs[i] = page_address(ptrs[i]);
> + raid6_2data_recov(disks, bytes, faila, failb, sptrs);
> +
> + async_tx_sync_epilog(cb, cb_param);
> + }
(...)
> +dpr_sync:
> + {
> + void **sptrs = (void **) lptrs;
> + /*
> + * Failed to compute asynchronously, do it in
> + * synchronous manner
> + */
> +
> + /* wait for any prerequisite operations */
> + async_tx_quiesce(&depend_tx);
> + if (flags & ASYNC_TX_DEP_ACK)
> + async_tx_ack(depend_tx);
> +
> + i = disks;
> + while (i--)
> + sptrs[i] = page_address(ptrs[i]);
> + raid6_datap_recov(disks, bytes, faila, (void *)sptrs);
> +
> + async_tx_sync_epilog(cb, cb_param);
> + }

These synchronous sections in async_r6_dd_recov() and async_r6_dp_recov()
look almost the same. What about placing the code in a common function?

Regards,
Maciej-

2009-03-30 14:32:13

by Sosnowski, Maciej

[permalink] [raw]
Subject: RE: [PATCH 09/13] iop-adma: P+Q self test

Williams, Dan J wrote:
> + cookie = iop_adma_tx_submit(tx);
> + iop_adma_issue_pending(dma_chan);
> + msleep(8);

Why sleeping exactly 8 ms?
Has this value been somehow experiomentally obtained?
How about introducing wait_for_completion, just like it is in ioatdma now?

General remark:
Both iop_adma_pq_zero_sum_self_test() and iop_adma_xor_zero_sum_self_test()
are quite long and they consist of series of subsequent similar tests.
I guess the code would be easier to read if every test was a separate function
called from iop_adma_XX_zero_sum_self_test().

Regards,
Maciej