I'm proud to present RAID5 support to ORE. Which enables raid5 for both
exofs and pnfs-objects-layout driver.
The ORE with raid0/1/5 and soon 6 support has become a compact and abstract
library, that with not a lot of effort, can support not only OSD but any
type of devices. For example BTRFS, does it have a RAID5 support yet? if
not it could use ORE. The ORE gets a bunch of pages at the top and produces
bios for each device at the bottom. The libosd API can be easily abstracted
and be used for block devices, just the same. The RAID layout supported by
ORE is very rich, multi layered striping/mirroring/raid. even reacher then
stacked MD. You can read about this layout here:
http://git.open-osd.org/gitweb.cgi?p=ietf-rfc5664.git;a=blob_plain;f=draft-ietf-nfsv4-rfc5664bis.html;hb=boaz2
Start at: 5.3. "Data Mapping Schemes" up to:
5.4.5. "RAID Usage and Implementation Notes"
There where some problems with the previous patchset here are
the differences I squashed it to a new patchset at osd/linux-next.
[PATCH 1/1] SQUASHME: into: ore: Only IO one group at a time (API change)
And without farther ado Here is the RAID5 support. This is highly complicated
stuff, for humble me at least, and I would appreciate any review and/or comments
you guys can give it. Thanks in advance. (Pretty Please? :-))
[PATCH 1/6] ore: Make ore_calc_stripe_info EXPORT_SYMBOL
[PATCH 2/6] ore: RAID5 read
[PATCH 3/6] ore: RAID5 Write
[PATCH 4/6] exofs: Support for RAID5 read-4-write interface.
[PATCH 5/6] pnfs-obj: Support for RAID5 read-4-write interface.
[PATCH 6/6] ore: Enable RAID5 mounts
A tree with above plus all prerequisites is at:
$ git clone git://open-osd.org/linux-open-osd linux-next
[http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/merge_and_compile]
The code passes my tests, the highlight of which is git-clone linux and compare to
and identical clone on ext4. But there must be month of farther testing done on this
to get it better. For example. Aligned on stripe IO write back should be improved on
all the way down to the VFS layer.
Thanks
Boaz
This is finally the RAID5 Write support.
The bigger part of this patch is not the XOR engine itself, But the
read4write logic, which is a complete mini prepare_for_striping
reading engine that can read scattered pages of a stripe into cache
so it can be used for XOR calculation. That is, if the write was not
stripe aligned.
The main algorithm behind the XOR engine is the 2 dimensional array:
struct __stripe_pages_2d.
A drawing might save 1000 words
---
__stripe_pages_2d
|
n = pages_in_stripe_unit;
w = group_width - parity;
| pages array presented to the XOR lib
| |
V |
__1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---|
| |
__1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <---
|
... | ...
|
__1_page_stripe[n].pages --> [c0][c1]..[cw][c_par]
^
|
data added columns first then row
---
The pages are put on this array columns first. .i.e:
p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ...
So we are doing a corner turn of the pages.
Note that pages will zigzag down and left. but are put sequentially
in growing order. So when the time comes to XOR the stripe, only the
beginning and end of the array need be checked. We scan the array
and any NULL spot will be field by pages-to-be-read.
The FS that wants to support RAID5 needs to supply an
operations-vector that searches a given page in cache, and specifies
if the page is uptodate or need reading. All these pages to be read
are put on a slave ore_io_state and synchronously read. All the pages
of a stripe are read in one IO, using the scatter gather mechanism.
In write we constrain our IO to only be incomplete on a single
stripe. Meaning either the complete IO is within a single stripe so
we might have pages to read from both beginning or end of the
strip. Or we have some reading to do at beginning but end at strip
boundary. The left over pages are pushed to the next IO by the API
already established by previous work, where an IO offset/length
combination presented to the ORE might get the length truncated and
the user must re-submit the leftover pages. (Both exofs and NFS
support this)
But any ORE user should make it's best effort to align it's IO
before hand and avoid complications. A cached ore_layout->stripe_size
member can be used for that calculation. (NOTE: that ORE demands
that stripe_size may not be bigger then 32bit)
What else? Well read it and tell me.
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kconfig | 9 +-
fs/exofs/ore.c | 36 +++-
fs/exofs/ore_raid.c | 534 +++++++++++++++++++++++++++++++++++++++++++++++-
fs/exofs/ore_raid.h | 15 ++
include/scsi/osd_ore.h | 9 +
5 files changed, 587 insertions(+), 16 deletions(-)
diff --git a/fs/exofs/Kconfig b/fs/exofs/Kconfig
index 70bae41..d58e888 100644
--- a/fs/exofs/Kconfig
+++ b/fs/exofs/Kconfig
@@ -1,10 +1,17 @@
+# Note ORE needs to "select ASYNC_XOR". So Not to force multiple selects
+# for every ORE user we do it like this. Any user should add itself here
+# at the "depends on EXOFS_FS || ..." with an ||. The dependencies are
+# selected here, and we default to "ON". So in effect it is like been
+# selected by any of the users.
config ORE
tristate
+ depends on EXOFS_FS || PNFS_OBJLAYOUT
+ select ASYNC_XOR
+ default m
config EXOFS_FS
tristate "exofs: OSD based file system support"
depends on SCSI_OSD_ULD
- select ORE
help
EXOFS is a file system that uses an OSD storage device,
as its backing storage.
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index fd6090d..08ee454 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -95,6 +95,14 @@ int ore_verify_layout(unsigned total_comps, struct ore_layout *layout)
layout->max_io_length =
(BIO_MAX_PAGES_KMALLOC * PAGE_SIZE - layout->stripe_unit) *
layout->group_width;
+ if (layout->parity) {
+ unsigned stripe_length =
+ (layout->group_width - layout->parity) *
+ layout->stripe_unit;
+
+ layout->max_io_length /= stripe_length;
+ layout->max_io_length *= stripe_length;
+ }
return 0;
}
EXPORT_SYMBOL(ore_verify_layout);
@@ -118,7 +126,7 @@ static struct osd_dev *_ios_od(struct ore_io_state *ios, unsigned index)
return ore_comp_dev(ios->oc, index);
}
-static int _ore_get_io_state(struct ore_layout *layout,
+int _ore_get_io_state(struct ore_layout *layout,
struct ore_components *oc, unsigned numdevs,
unsigned sgs_per_dev, unsigned num_par_pages,
struct ore_io_state **pios)
@@ -334,7 +342,7 @@ static void _done_io(struct osd_request *or, void *p)
kref_put(&ios->kref, _last_io);
}
-static int ore_io_execute(struct ore_io_state *ios)
+int ore_io_execute(struct ore_io_state *ios)
{
DECLARE_COMPLETION_ONSTACK(wait);
bool sync = (ios->done == NULL);
@@ -597,6 +605,8 @@ int _ore_add_stripe_unit(struct ore_io_state *ios, unsigned *cur_pg,
ret = -ENOMEM;
goto out;
}
+ _add_stripe_page(ios->sp2d, &ios->si, pages[pg]);
+
pgbase = 0;
++pg;
}
@@ -636,6 +646,7 @@ static int _prepare_for_striping(struct ore_io_state *ios)
dev_order = _dev_order(devs_in_group, mirrors_p1, si->par_dev, dev);
si->cur_comp = dev_order;
+ si->cur_pg = si->unit_off / PAGE_SIZE;
while (length) {
unsigned comp = dev - first_dev;
@@ -677,14 +688,14 @@ static int _prepare_for_striping(struct ore_io_state *ios)
length -= cur_len;
si->cur_comp = (si->cur_comp + 1) % group_width;
- if (unlikely((dev == si->par_dev) ||
- (!length && ios->parity_pages))) {
- if (!length)
+ if (unlikely((dev == si->par_dev) || (!length && ios->sp2d))) {
+ if (!length && ios->sp2d) {
/* If we are writing and this is the very last
* stripe. then operate on parity dev.
*/
dev = si->par_dev;
- if (ios->reading)
+ }
+ if (ios->sp2d)
/* In writes cur_len just means if it's the
* last one. See _ore_add_parity_unit.
*/
@@ -709,6 +720,7 @@ static int _prepare_for_striping(struct ore_io_state *ios)
devs_in_group + first_dev;
/* Next stripe, start fresh */
si->cur_comp = 0;
+ si->cur_pg = 0;
}
}
out:
@@ -873,6 +885,14 @@ int ore_write(struct ore_io_state *ios)
int i;
int ret;
+ if (unlikely(ios->sp2d && !ios->r4w)) {
+ /* A library is attempting a RAID-write without providing
+ * a pages lock interface.
+ */
+ WARN_ON_ONCE(1);
+ return -ENOTSUPP;
+ }
+
ret = _prepare_for_striping(ios);
if (unlikely(ret))
return ret;
@@ -888,7 +908,7 @@ int ore_write(struct ore_io_state *ios)
}
EXPORT_SYMBOL(ore_write);
-static int _read_mirror(struct ore_io_state *ios, unsigned cur_comp)
+int _ore_read_mirror(struct ore_io_state *ios, unsigned cur_comp)
{
struct osd_request *or;
struct ore_per_dev_state *per_dev = &ios->per_dev[cur_comp];
@@ -952,7 +972,7 @@ int ore_read(struct ore_io_state *ios)
return ret;
for (i = 0; i < ios->numdevs; i += ios->layout->mirrors_p1) {
- ret = _read_mirror(ios, i);
+ ret = _ore_read_mirror(ios, i);
if (unlikely(ret))
return ret;
}
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index 8d4b93a..29c47e5 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -14,9 +14,13 @@
*/
#include <linux/gfp.h>
+#include <linux/async_tx.h>
#include "ore_raid.h"
+#undef ORE_DBGMSG2
+#define ORE_DBGMSG2 ORE_DBGMSG
+
struct page *_raid_page_alloc(void)
{
return alloc_page(GFP_KERNEL);
@@ -27,6 +31,236 @@ void _raid_page_free(struct page *p)
__free_page(p);
}
+/* This struct is forward declare in ore_io_state, but is private to here.
+ * It is put on ios->sp2d for RAID5/6 writes only. See _gen_xor_unit.
+ *
+ * __stripe_pages_2d is a 2d array of pages, and it is also a corner turn.
+ * Ascending page index access is sp2d(p-minor, c-major). But storage is
+ * sp2d[p-minor][c-major], so it can be properlly presented to the async-xor
+ * API.
+ */
+struct __stripe_pages_2d {
+ /* Cache some hot path repeated calculations */
+ unsigned parity;
+ unsigned data_devs;
+ unsigned pages_in_unit;
+
+ bool needed ;
+
+ /* Array size is pages_in_unit (layout->stripe_unit / PAGE_SIZE) */
+ struct __1_page_stripe {
+ bool alloc;
+ unsigned write_count;
+ struct async_submit_ctl submit;
+ struct dma_async_tx_descriptor *tx;
+
+ /* The size of this array is data_devs + parity */
+ struct page **pages;
+ struct page **scribble;
+ /* bool array, size of this array is data_devs */
+ char *page_is_read;
+ } _1p_stripes[];
+};
+
+/* This can get bigger then a page. So support multiple page allocations
+ * _sp2d_free should be called even if _sp2d_alloc fails (by returning
+ * none-zero).
+ */
+static int _sp2d_alloc(unsigned pages_in_unit, unsigned group_width,
+ unsigned parity, struct __stripe_pages_2d **psp2d)
+{
+ struct __stripe_pages_2d *sp2d;
+ unsigned data_devs = group_width - parity;
+ struct _alloc_all_bytes {
+ struct __alloc_stripe_pages_2d {
+ struct __stripe_pages_2d sp2d;
+ struct __1_page_stripe _1p_stripes[pages_in_unit];
+ } __asp2d;
+ struct __alloc_1p_arrays {
+ struct page *pages[group_width];
+ struct page *scribble[group_width];
+ char page_is_read[data_devs];
+ } __a1pa[pages_in_unit];
+ } *_aab;
+ struct __alloc_1p_arrays *__a1pa;
+ struct __alloc_1p_arrays *__a1pa_end;
+ const unsigned sizeof__a1pa = sizeof(_aab->__a1pa[0]);
+ unsigned num_a1pa, alloc_size, i;
+
+ /* FIXME: check these numbers in ore_verify_layout */
+ BUG_ON(sizeof(_aab->__asp2d) > PAGE_SIZE);
+ BUG_ON(sizeof__a1pa > PAGE_SIZE);
+
+ if (sizeof(*_aab) > PAGE_SIZE) {
+ num_a1pa = (PAGE_SIZE - sizeof(_aab->__asp2d)) / sizeof__a1pa;
+ alloc_size = sizeof(_aab->__asp2d) + sizeof__a1pa * num_a1pa;
+ } else {
+ num_a1pa = pages_in_unit;
+ alloc_size = sizeof(*_aab);
+ }
+
+ _aab = kzalloc(alloc_size, GFP_KERNEL);
+ if (unlikely(!_aab)) {
+ ORE_DBGMSG("!! Failed to alloc sp2d size=%d\n", alloc_size);
+ return -ENOMEM;
+ }
+
+ sp2d = &_aab->__asp2d.sp2d;
+ *psp2d = sp2d; /* From here Just call _sp2d_free */
+
+ __a1pa = _aab->__a1pa;
+ __a1pa_end = __a1pa + num_a1pa;
+
+ for (i = 0; i < pages_in_unit; ++i) {
+ if (unlikely(__a1pa >= __a1pa_end)) {
+ num_a1pa = min_t(unsigned, PAGE_SIZE / sizeof__a1pa,
+ pages_in_unit - i);
+
+ __a1pa = kzalloc(num_a1pa * sizeof__a1pa, GFP_KERNEL);
+ if (unlikely(!__a1pa)) {
+ ORE_DBGMSG("!! Failed to _alloc_1p_arrays=%d\n",
+ num_a1pa);
+ return -ENOMEM;
+ }
+ __a1pa_end = __a1pa + num_a1pa;
+ /* First *pages is marked for kfree of the buffer */
+ sp2d->_1p_stripes[i].alloc = true;
+ }
+
+ sp2d->_1p_stripes[i].pages = __a1pa->pages;
+ sp2d->_1p_stripes[i].scribble = __a1pa->scribble ;
+ sp2d->_1p_stripes[i].page_is_read = __a1pa->page_is_read;
+ ++__a1pa;
+ }
+
+ sp2d->parity = parity;
+ sp2d->data_devs = data_devs;
+ sp2d->pages_in_unit = pages_in_unit;
+ return 0;
+}
+
+static void _sp2d_reset(struct __stripe_pages_2d *sp2d,
+ const struct _ore_r4w_op *r4w, void *priv)
+{
+ unsigned data_devs = sp2d->data_devs;
+ unsigned group_width = data_devs + sp2d->parity;
+ unsigned p;
+
+ if (!sp2d->needed)
+ return;
+
+ for (p = 0; p < sp2d->pages_in_unit; p++) {
+ struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
+
+ if (_1ps->write_count < group_width) {
+ unsigned c;
+
+ for (c = 0; c < data_devs; c++)
+ if (_1ps->page_is_read[c]) {
+ struct page *page = _1ps->pages[c];
+
+ r4w->put_page(priv, page);
+ _1ps->page_is_read[c] = false;
+ }
+ }
+
+ memset(_1ps->pages, 0, group_width * sizeof(*_1ps->pages));
+ _1ps->write_count = 0;
+ _1ps->tx = NULL;
+ }
+
+ sp2d->needed = false;
+}
+
+static void _sp2d_free(struct __stripe_pages_2d *sp2d)
+{
+ unsigned i;
+
+ if (!sp2d)
+ return;
+
+ for (i = 0; i < sp2d->pages_in_unit; ++i) {
+ if (sp2d->_1p_stripes[i].alloc)
+ kfree(sp2d->_1p_stripes[i].pages);
+ }
+
+ kfree(sp2d);
+}
+
+static unsigned _sp2d_min_pg(struct __stripe_pages_2d *sp2d)
+{
+ unsigned p;
+
+ for (p = 0; p < sp2d->pages_in_unit; p++) {
+ struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
+
+ if (_1ps->write_count)
+ return p;
+ }
+
+ return ~0;
+}
+
+static unsigned _sp2d_max_pg(struct __stripe_pages_2d *sp2d)
+{
+ unsigned p;
+
+ for (p = sp2d->pages_in_unit - 1; p >= 0; --p) {
+ struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
+
+ if (_1ps->write_count)
+ return p;
+ }
+
+ return ~0;
+}
+
+static void _gen_xor_unit(struct __stripe_pages_2d *sp2d)
+{
+ unsigned p;
+ for (p = 0; p < sp2d->pages_in_unit; p++) {
+ struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
+
+ if (!_1ps->write_count)
+ continue;
+
+ init_async_submit(&_1ps->submit,
+ ASYNC_TX_XOR_ZERO_DST | ASYNC_TX_ACK,
+ NULL,
+ NULL, NULL,
+ (addr_conv_t *)_1ps->scribble);
+
+ /* TODO: raid6 */
+ _1ps->tx = async_xor(_1ps->pages[sp2d->data_devs], _1ps->pages,
+ 0, sp2d->data_devs, PAGE_SIZE,
+ &_1ps->submit);
+ }
+
+ for (p = 0; p < sp2d->pages_in_unit; p++) {
+ struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
+ /* NOTE: We wait for HW synchronously (I don't have such HW
+ * to test with.) Is parallelism needed with today's multi
+ * cores?
+ */
+ async_tx_issue_pending(_1ps->tx);
+ }
+}
+
+void _ore_add_stripe_page(struct __stripe_pages_2d *sp2d,
+ struct ore_striping_info *si, struct page *page)
+{
+ struct __1_page_stripe *_1ps;
+
+ sp2d->needed = true;
+
+ _1ps = &sp2d->_1p_stripes[si->cur_pg];
+ _1ps->pages[si->cur_comp] = page;
+ ++_1ps->write_count;
+
+ si->cur_pg = (si->cur_pg + 1) % sp2d->pages_in_unit;
+ /* si->cur_comp is advanced outside at main loop */
+}
+
void _ore_add_sg_seg(struct ore_per_dev_state *per_dev, unsigned cur_len,
bool not_last)
{
@@ -76,6 +310,240 @@ void _ore_add_sg_seg(struct ore_per_dev_state *per_dev, unsigned cur_len,
}
}
+static int _alloc_read_4_write(struct ore_io_state *ios)
+{
+ struct ore_layout *layout = ios->layout;
+ int ret;
+ /* We want to only read those pages not in cache so worst case
+ * is a stripe populated with every other page
+ */
+ unsigned sgs_per_dev = ios->sp2d->pages_in_unit + 2;
+
+ ret = _ore_get_io_state(layout, ios->oc,
+ layout->group_width * layout->mirrors_p1,
+ sgs_per_dev, 0, &ios->ios_read_4_write);
+ return ret;
+}
+
+/* @si contains info of the to-be-inserted page. Update of @si should be
+ * maintained by caller. Specificaly si->dev, si->obj_offset, ...
+ */
+static int _add_to_read_4_write(struct ore_io_state *ios,
+ struct ore_striping_info *si, struct page *page)
+{
+ struct request_queue *q;
+ struct ore_per_dev_state *per_dev;
+ struct ore_io_state *read_ios;
+ unsigned first_dev = si->dev - (si->dev %
+ (ios->layout->group_width * ios->layout->mirrors_p1));
+ unsigned comp = si->dev - first_dev;
+ unsigned added_len;
+
+ if (!ios->ios_read_4_write) {
+ int ret = _alloc_read_4_write(ios);
+
+ if (unlikely(ret))
+ return ret;
+ }
+
+ read_ios = ios->ios_read_4_write;
+ read_ios->numdevs = ios->layout->group_width * ios->layout->mirrors_p1;
+
+ per_dev = &read_ios->per_dev[comp];
+ if (!per_dev->length) {
+ per_dev->bio = bio_kmalloc(GFP_KERNEL,
+ ios->sp2d->pages_in_unit);
+ if (unlikely(!per_dev->bio)) {
+ ORE_DBGMSG("Failed to allocate BIO size=%u\n",
+ ios->sp2d->pages_in_unit);
+ return -ENOMEM;
+ }
+ per_dev->offset = si->obj_offset;
+ per_dev->dev = si->dev;
+ } else if (si->obj_offset != (per_dev->offset + per_dev->length)) {
+ u64 gap = si->obj_offset - (per_dev->offset + per_dev->length);
+
+ _ore_add_sg_seg(per_dev, gap, true);
+ }
+ q = osd_request_queue(ore_comp_dev(read_ios->oc, per_dev->dev));
+ added_len = bio_add_pc_page(q, per_dev->bio, page, PAGE_SIZE, 0);
+ if (unlikely(added_len != PAGE_SIZE)) {
+ ORE_DBGMSG("Failed to bio_add_pc_page bi_vcnt=%d\n",
+ per_dev->bio->bi_vcnt);
+ return -ENOMEM;
+ }
+
+ per_dev->length += PAGE_SIZE;
+ return 0;
+}
+
+static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
+{
+ struct bio_vec *bv;
+ unsigned i, d;
+
+ /* loop on all devices all pages */
+ for (d = 0; d < ios->numdevs; d++) {
+ struct bio *bio = ios->per_dev[d].bio;
+
+ if (!bio)
+ continue;
+
+ __bio_for_each_segment(bv, bio, i, 0) {
+ struct page *page = bv->bv_page;
+
+ SetPageUptodate(page);
+ if (PageError(page))
+ ClearPageError(page);
+ }
+ }
+}
+
+/* read_4_write is hacked to read the start of the first stripe and/or
+ * the end of the last stripe. If needed, with an sg-gap at each device/page.
+ * It is assumed to be called after the to_be_written pages of the first stripe
+ * are populating ios->sp2d[][]
+ *
+ * NOTE: We call ios->r4w->lock_fn for all pages needed for parity calculations
+ * These pages are held at sp2d[p].pages[c] but with
+ * sp2d[p].page_is_read[c] = true. At _sp2d_reset these pages are
+ * ios->r4w->lock_fn(). The ios->r4w->lock_fn might signal that the page is
+ * @uptodate=true, so we don't need to read it, only unlock, after IO.
+ *
+ * TODO: The read_4_write should calc a need_to_read_pages_count, if bigger then
+ * to-be-written count, we should consider the xor-in-place mode.
+ * need_to_read_pages_count is the actual number of pages not present in cache.
+ * maybe "devs_in_group - ios->sp2d[p].write_count" is a good enough
+ * approximation? In this mode the read pages are put in the empty places of
+ * ios->sp2d[p][*], xor is calculated the same way. These pages are
+ * allocated/freed and don't go through cache
+ */
+static int _read_4_write(struct ore_io_state *ios)
+{
+ struct ore_io_state *ios_read;
+ struct ore_striping_info read_si;
+ struct __stripe_pages_2d *sp2d = ios->sp2d;
+ u64 offset = ios->si.first_stripe_start;
+ u64 last_stripe_end;
+ unsigned bytes_in_stripe = ios->si.bytes_in_stripe;
+ unsigned i, c, p, min_p = sp2d->pages_in_unit, max_p = -1;
+ int ret;
+
+ if (offset == ios->offset) /* Go to start collect $200 */
+ goto read_last_stripe;
+
+ min_p = _sp2d_min_pg(sp2d);
+ max_p = _sp2d_max_pg(sp2d);
+
+ for (c = 0; ; c++) {
+ ore_calc_stripe_info(ios->layout, offset, 0, &read_si);
+ read_si.obj_offset += min_p * PAGE_SIZE;
+ offset += min_p * PAGE_SIZE;
+ for (p = min_p; p <= max_p; p++) {
+ struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
+ struct page **pp = &_1ps->pages[c];
+ bool uptodate;
+
+ if (*pp)
+ /* to-be-written pages start here */
+ goto read_last_stripe;
+
+ *pp = ios->r4w->get_page(ios->private, offset,
+ &uptodate);
+ if (unlikely(!*pp))
+ return -ENOMEM;
+
+ if (!uptodate)
+ _add_to_read_4_write(ios, &read_si, *pp);
+
+ /* Mark read-pages to be cache_released */
+ _1ps->page_is_read[c] = true;
+ read_si.obj_offset += PAGE_SIZE;
+ offset += PAGE_SIZE;
+ }
+ offset += (sp2d->pages_in_unit - p) * PAGE_SIZE;
+ }
+
+read_last_stripe:
+ offset = ios->offset + (ios->length + PAGE_SIZE - 1) /
+ PAGE_SIZE * PAGE_SIZE;
+ last_stripe_end = div_u64(offset + bytes_in_stripe - 1, bytes_in_stripe)
+ * bytes_in_stripe;
+ if (offset == last_stripe_end) /* Optimize for the aligned case */
+ goto read_it;
+
+ ore_calc_stripe_info(ios->layout, offset, 0, &read_si);
+ p = read_si.unit_off / PAGE_SIZE;
+ c = _dev_order(ios->layout->group_width * ios->layout->mirrors_p1,
+ ios->layout->mirrors_p1, read_si.par_dev, read_si.dev);
+
+ BUG_ON(ios->si.first_stripe_start + bytes_in_stripe != last_stripe_end);
+ /* unaligned IO must be within a single stripe */
+
+ if (min_p == sp2d->pages_in_unit) {
+ /* Didn't do it yet */
+ min_p = _sp2d_min_pg(sp2d);
+ max_p = _sp2d_max_pg(sp2d);
+ }
+
+ while (offset < last_stripe_end) {
+ struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
+
+ if ((min_p <= p) && (p <= max_p)) {
+ struct page *page;
+ bool uptodate;
+
+ BUG_ON(_1ps->pages[c]);
+ page = ios->r4w->get_page(ios->private, offset,
+ &uptodate);
+ if (unlikely(!page))
+ return -ENOMEM;
+
+ _1ps->pages[c] = page;
+ /* Mark read-pages to be cache_released */
+ _1ps->page_is_read[c] = true;
+ if (!uptodate)
+ _add_to_read_4_write(ios, &read_si, page);
+ }
+
+ offset += PAGE_SIZE;
+ if (p == (sp2d->pages_in_unit - 1)) {
+ ++c;
+ p = 0;
+ ore_calc_stripe_info(ios->layout, offset, 0, &read_si);
+ } else {
+ read_si.obj_offset += PAGE_SIZE;
+ ++p;
+ }
+ }
+
+read_it:
+ ios_read = ios->ios_read_4_write;
+ if (!ios_read)
+ return 0;
+
+ /* FIXME: Ugly to signal _sbi_read_mirror that we have bio(s). Change
+ * to check for per_dev->bio
+ */
+ ios_read->pages = ios->pages;
+
+ /* Now read these devices */
+ for (i = 0; i < ios_read->numdevs; i += ios_read->layout->mirrors_p1) {
+ ret = _ore_read_mirror(ios_read, i);
+ if (unlikely(ret))
+ return ret;
+ }
+
+ ret = ore_io_execute(ios_read); /* Synchronus execution */
+ if (unlikely(ret)) {
+ ORE_DBGMSG("!! ore_io_execute => %d\n", ret);
+ return ret;
+ }
+
+ _mark_read4write_pages_uptodate(ios_read, ret);
+ return 0;
+}
+
/* In writes @cur_len means length left. .i.e cur_len==0 is the last parity U */
int _ore_add_parity_unit(struct ore_io_state *ios,
struct ore_striping_info *si,
@@ -86,42 +554,89 @@ int _ore_add_parity_unit(struct ore_io_state *ios,
BUG_ON(per_dev->cur_sg >= ios->sgs_per_dev);
_ore_add_sg_seg(per_dev, cur_len, true);
} else {
+ struct __stripe_pages_2d *sp2d = ios->sp2d;
struct page **pages = ios->parity_pages + ios->cur_par_page;
- unsigned num_pages = ios->layout->stripe_unit / PAGE_SIZE;
+ unsigned num_pages;
unsigned array_start = 0;
unsigned i;
int ret;
+ si->cur_pg = _sp2d_min_pg(sp2d);
+ num_pages = _sp2d_max_pg(sp2d) + 1 - si->cur_pg;
+
+ if (!cur_len) /* If last stripe operate on parity comp */
+ si->cur_comp = sp2d->data_devs;
+
+ if (!per_dev->length) {
+ per_dev->offset += si->cur_pg * PAGE_SIZE;
+ /* If first stripe, Read in all read4write pages
+ * (if needed) before we calculate the first parity.
+ */
+ _read_4_write(ios);
+ }
+
for (i = 0; i < num_pages; i++) {
pages[i] = _raid_page_alloc();
if (unlikely(!pages[i]))
return -ENOMEM;
++(ios->cur_par_page);
- /* TODO: only read support for now */
- clear_highpage(pages[i]);
}
- ORE_DBGMSG("writing dev=%d num_pages=%d cur_par_page=%d",
- per_dev->dev, num_pages, ios->cur_par_page);
+ BUG_ON(si->cur_comp != sp2d->data_devs);
+ BUG_ON(si->cur_pg + num_pages > sp2d->pages_in_unit);
ret = _ore_add_stripe_unit(ios, &array_start, 0, pages,
per_dev, num_pages * PAGE_SIZE);
if (unlikely(ret))
return ret;
+
+ /* TODO: raid6 if (last_parity_dev) */
+ _gen_xor_unit(sp2d);
+ _sp2d_reset(sp2d, ios->r4w, ios->private);
}
return 0;
}
int _ore_post_alloc_raid_stuff(struct ore_io_state *ios)
{
- /*TODO: Only raid writes has stuff to add here */
+ struct ore_layout *layout = ios->layout;
+
+ if (ios->parity_pages) {
+ unsigned pages_in_unit = layout->stripe_unit / PAGE_SIZE;
+ unsigned stripe_size = ios->si.bytes_in_stripe;
+ u64 last_stripe, first_stripe;
+
+ if (_sp2d_alloc(pages_in_unit, layout->group_width,
+ layout->parity, &ios->sp2d)) {
+ return -ENOMEM;
+ }
+
+ BUG_ON(ios->offset % PAGE_SIZE);
+
+ /* Round io down to last full strip */
+ first_stripe = div_u64(ios->offset, stripe_size);
+ last_stripe = div_u64(ios->offset + ios->length, stripe_size);
+
+ /* If an IO spans more then a single stripe it must end at
+ * a stripe boundary. The reminder at the end is pushed into the
+ * next IO.
+ */
+ if (last_stripe != first_stripe) {
+ ios->length = last_stripe * stripe_size - ios->offset;
+
+ BUG_ON(!ios->length);
+ ios->nr_pages = (ios->length + PAGE_SIZE - 1) /
+ PAGE_SIZE;
+ ios->si.length = ios->length; /*make it consistent */
+ }
+ }
return 0;
}
void _ore_free_raid_stuff(struct ore_io_state *ios)
{
- if (ios->parity_pages) { /* writing and raid */
+ if (ios->sp2d) { /* writing and raid */
unsigned i;
for (i = 0; i < ios->cur_par_page; i++) {
@@ -132,9 +647,14 @@ void _ore_free_raid_stuff(struct ore_io_state *ios)
}
if (ios->extra_part_alloc)
kfree(ios->parity_pages);
+ /* If IO returned an error pages might need unlocking */
+ _sp2d_reset(ios->sp2d, ios->r4w, ios->private);
+ _sp2d_free(ios->sp2d);
} else {
/* Will only be set if raid reading && sglist is big */
if (ios->extra_part_alloc)
kfree(ios->per_dev[0].sglist);
}
+ if (ios->ios_read_4_write)
+ ore_put_io_state(ios->ios_read_4_write);
}
diff --git a/fs/exofs/ore_raid.h b/fs/exofs/ore_raid.h
index c21080b..2ffd2c3 100644
--- a/fs/exofs/ore_raid.h
+++ b/fs/exofs/ore_raid.h
@@ -57,8 +57,23 @@ void _ore_add_sg_seg(struct ore_per_dev_state *per_dev, unsigned cur_len,
bool not_last);
int _ore_add_parity_unit(struct ore_io_state *ios, struct ore_striping_info *si,
struct ore_per_dev_state *per_dev, unsigned cur_len);
+void _ore_add_stripe_page(struct __stripe_pages_2d *sp2d,
+ struct ore_striping_info *si, struct page *page);
+static inline void _add_stripe_page(struct __stripe_pages_2d *sp2d,
+ struct ore_striping_info *si, struct page *page)
+{
+ if (!sp2d) /* Inline the fast path */
+ return; /* Hay no raid stuff */
+ _ore_add_stripe_page(sp2d, si, page);
+}
/* ios.c stuff needed by ios_raid.c */
+int _ore_get_io_state(struct ore_layout *layout,
+ struct ore_components *oc, unsigned numdevs,
+ unsigned sgs_per_dev, unsigned num_par_pages,
+ struct ore_io_state **pios);
int _ore_add_stripe_unit(struct ore_io_state *ios, unsigned *cur_pg,
unsigned pgbase, struct page **pages,
struct ore_per_dev_state *per_dev, int cur_len);
+int _ore_read_mirror(struct ore_io_state *ios, unsigned cur_comp);
+int ore_io_execute(struct ore_io_state *ios);
diff --git a/include/scsi/osd_ore.h b/include/scsi/osd_ore.h
index 43821c1..f05fa82 100644
--- a/include/scsi/osd_ore.h
+++ b/include/scsi/osd_ore.h
@@ -99,11 +99,17 @@ struct ore_striping_info {
unsigned dev;
unsigned par_dev;
unsigned unit_off;
+ unsigned cur_pg;
unsigned cur_comp;
};
struct ore_io_state;
typedef void (*ore_io_done_fn)(struct ore_io_state *ios, void *private);
+struct _ore_r4w_op {
+ /* @Priv given here is passed ios->private */
+ struct page * (*get_page)(void *priv, u64 page_index, bool *uptodate);
+ void (*put_page)(void *priv, struct page *page);
+};
struct ore_io_state {
struct kref kref;
@@ -139,6 +145,9 @@ struct ore_io_state {
unsigned max_par_pages;
unsigned cur_par_page;
unsigned sgs_per_dev;
+ struct __stripe_pages_2d *sp2d;
+ struct ore_io_state *ios_read_4_write;
+ const struct _ore_r4w_op *r4w;
/* Variable array of size numdevs */
unsigned numdevs;
--
1.7.2.3
Thanks Benny.
Is it OK if we merge, pnfs tree and open-osd, on Monday in BAT first thing?
Tonight it's to late, and tomorrow I'm already flying. So Monday.
Have a safe trip
Boaz
On 10/14/2011 07:50 PM, Benny Halevy wrote:
> Awesome!
>
> Benny
> -----Original Message-----
> From: Boaz Harrosh <[email protected]>
> Sender: [email protected]
> Date: Fri, 14 Oct 2011 19:24:14
> To: Welch, Brent<[email protected]>; open-osd<[email protected]>; NFS list<[email protected]>; linux-fsdevel<[email protected]>
> Subject: [PATCHSET 0/1 0/6] ore: RAID5 Support
>
>
> I'm proud to present RAID5 support to ORE. Which enables raid5 for both
> exofs and pnfs-objects-layout driver.
>
> The ORE with raid0/1/5 and soon 6 support has become a compact and abstract
> library, that with not a lot of effort, can support not only OSD but any
> type of devices. For example BTRFS, does it have a RAID5 support yet? if
> not it could use ORE. The ORE gets a bunch of pages at the top and produces
> bios for each device at the bottom. The libosd API can be easily abstracted
> and be used for block devices, just the same. The RAID layout supported by
> ORE is very rich, multi layered striping/mirroring/raid. even reacher then
> stacked MD. You can read about this layout here:
> http://git.open-osd.org/gitweb.cgi?p=ietf-rfc5664.git;a=blob_plain;f=draft-ietf-nfsv4-rfc5664bis.html;hb=boaz2
> Start at: 5.3. "Data Mapping Schemes" up to:
> 5.4.5. "RAID Usage and Implementation Notes"
>
> There where some problems with the previous patchset here are
> the differences I squashed it to a new patchset at osd/linux-next.
>
> [PATCH 1/1] SQUASHME: into: ore: Only IO one group at a time (API change)
>
> And without farther ado Here is the RAID5 support. This is highly complicated
> stuff, for humble me at least, and I would appreciate any review and/or comments
> you guys can give it. Thanks in advance. (Pretty Please? :-))
>
> [PATCH 1/6] ore: Make ore_calc_stripe_info EXPORT_SYMBOL
> [PATCH 2/6] ore: RAID5 read
> [PATCH 3/6] ore: RAID5 Write
> [PATCH 4/6] exofs: Support for RAID5 read-4-write interface.
> [PATCH 5/6] pnfs-obj: Support for RAID5 read-4-write interface.
> [PATCH 6/6] ore: Enable RAID5 mounts
>
> A tree with above plus all prerequisites is at:
> $ git clone git://open-osd.org/linux-open-osd linux-next
> [http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/merge_and_compile]
>
> The code passes my tests, the highlight of which is git-clone linux and compare to
> and identical clone on ext4. But there must be month of farther testing done on this
> to get it better. For example. Aligned on stripe IO write back should be improved on
> all the way down to the VFS layer.
>
> Thanks
> Boaz
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
The ore need suplied a r4w_get_page/r4w_put_page API
from Filesystem so it can get cache pages to read-into when
writing parial stripes.
Also I commented out and NULLed the .writepage (singular)
vector. Because it gives terrible write pattern to raid
and is apparently not needed. Even in OOM conditions the
system copes (even better) with out it.
TODO: How to specify to write_cache_pages() to start
or include a certain page?
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/inode.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 59 insertions(+), 2 deletions(-)
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 86c0ac8..3e5f3a6 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -63,6 +63,7 @@ struct page_collect {
bool read_4_write; /* This means two things: that the read is sync
* And the pages should not be unlocked.
*/
+ struct page *that_locked_page;
};
static void _pcol_init(struct page_collect *pcol, unsigned expected_pages,
@@ -81,6 +82,7 @@ static void _pcol_init(struct page_collect *pcol, unsigned expected_pages,
pcol->length = 0;
pcol->pg_first = -1;
pcol->read_4_write = false;
+ pcol->that_locked_page = NULL;
}
static void _pcol_reset(struct page_collect *pcol)
@@ -93,6 +95,7 @@ static void _pcol_reset(struct page_collect *pcol)
pcol->length = 0;
pcol->pg_first = -1;
pcol->ios = NULL;
+ pcol->that_locked_page = NULL;
/* this is probably the end of the loop but in writes
* it might not end here. don't be left with nothing
@@ -391,6 +394,8 @@ static int readpage_strip(void *data, struct page *page)
EXOFS_ERR("PageUptodate(0x%lx, 0x%lx)\n", pcol->inode->i_ino,
page->index);
+ pcol->that_locked_page = page;
+
if (page->index < end_index)
len = PAGE_CACHE_SIZE;
else if (page->index == end_index)
@@ -560,6 +565,56 @@ static void writepages_done(struct ore_io_state *ios, void *p)
EXOFS_DBGMSG2("writepages_done END\n");
}
+static struct page *__r4w_get_page(void *priv, u64 offset, bool *uptodate)
+{
+ struct page_collect *pcol = priv;
+ pgoff_t index = offset / PAGE_SIZE;
+
+ if (!pcol->that_locked_page ||
+ (pcol->that_locked_page->index != index)) {
+ struct page *page = find_get_page(pcol->inode->i_mapping, index);
+
+ if (!page) {
+ page = find_or_create_page(pcol->inode->i_mapping,
+ index, GFP_NOFS);
+ if (unlikely(!page)) {
+ EXOFS_DBGMSG("grab_cache_page Failed "
+ "index=0x%llx\n", _LLU(index));
+ return NULL;
+ }
+ unlock_page(page);
+ }
+ if (PageDirty(page) || PageWriteback(page))
+ *uptodate = true;
+ else
+ *uptodate = PageUptodate(page);
+ EXOFS_DBGMSG("index=0x%lx uptodate=%d\n", index, *uptodate);
+ return page;
+ } else {
+ EXOFS_DBGMSG("YES that_locked_page index=0x%lx\n",
+ pcol->that_locked_page->index);
+ *uptodate = true;
+ return pcol->that_locked_page;
+ }
+}
+
+static void __r4w_put_page(void *priv, struct page *page)
+{
+ struct page_collect *pcol = priv;
+
+ if (pcol->that_locked_page != page) {
+ EXOFS_DBGMSG("index=0x%lx\n", page->index);
+ page_cache_release(page);
+ return;
+ }
+ EXOFS_DBGMSG("that_locked_page index=0x%lx\n", page->index);
+}
+
+static const struct _ore_r4w_op _r4w_op = {
+ .get_page = &__r4w_get_page,
+ .put_page = &__r4w_put_page,
+};
+
static int write_exec(struct page_collect *pcol)
{
struct exofs_i_info *oi = exofs_i(pcol->inode);
@@ -589,6 +644,7 @@ static int write_exec(struct page_collect *pcol)
ios = pcol->ios;
ios->pages = pcol_copy->pages;
ios->done = writepages_done;
+ ios->r4w = &_r4w_op;
ios->private = pcol_copy;
/* pages ownership was passed to pcol_copy */
@@ -773,6 +829,7 @@ static int exofs_writepages(struct address_space *mapping,
return 0;
}
+/*
static int exofs_writepage(struct page *page, struct writeback_control *wbc)
{
struct page_collect pcol;
@@ -788,7 +845,7 @@ static int exofs_writepage(struct page *page, struct writeback_control *wbc)
return write_exec(&pcol);
}
-
+*/
/* i_mutex held using inode->i_size directly */
static void _write_failed(struct inode *inode, loff_t to)
{
@@ -894,7 +951,7 @@ static void exofs_invalidatepage(struct page *page, unsigned long offset)
const struct address_space_operations exofs_aops = {
.readpage = exofs_readpage,
.readpages = exofs_readpages,
- .writepage = exofs_writepage,
+ .writepage = NULL,
.writepages = exofs_writepages,
.write_begin = exofs_write_begin_export,
.write_end = exofs_write_end,
--
1.7.2.3
Now that we support raid5 Enable it at mount. Raid6 will come next
raid4 is not demanded for so it will probably not be enabled.
(Until some one wants it)
NOTE: That mkfs.exofs had support for raid5/6 since long time
ago. (Making an empty raidX FS is just as easy as raid0 ;-} )
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/ore.c | 14 +++++++++++---
1 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 08ee454..fcfa86a 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -49,9 +49,17 @@ int ore_verify_layout(unsigned total_comps, struct ore_layout *layout)
{
u64 stripe_length;
-/* FIXME: Only raid0 is supported for now. */
- if (layout->raid_algorithm != PNFS_OSD_RAID_0) {
- ORE_ERR("Only RAID_0 for now\n");
+ switch (layout->raid_algorithm) {
+ case PNFS_OSD_RAID_0:
+ layout->parity = 0;
+ break;
+ case PNFS_OSD_RAID_5:
+ layout->parity = 1;
+ break;
+ case PNFS_OSD_RAID_PQ:
+ case PNFS_OSD_RAID_4:
+ default:
+ ORE_ERR("Only RAID_0/5 for now\n");
return -EINVAL;
}
if (0 != (layout->stripe_unit & ~PAGE_MASK)) {
--
1.7.2.3
This patch introduces the first stage of RAID5 support
mainly the skip-over-raid-units when reading. For
writes it inserts BLANK units, into where XOR blocks
should be calculated and written to.
It introduces the new "general raid maths", and the main
additional parameters and components needed for raid5.
Since at this stage it could corrupt future version that
actually do support raid5. The enablement of raid5
mounting and setting of parity-count > 0 is disabled. So
the raid5 code will never be used. Mounting of raid5 is
only enabled later once the basic XOR write is also in.
But if the patch "enable RAID5" is applied this code has
been tested to be able to properly read raid5 volumes
and is according to standard.
Also it has been tested that the new maths still properly
supports RAID0 and grouping code just as before.
(BTW: I have found more bugs in the pnfs-obj RAID math
fixed here)
The ore.c file is getting too big, so new ore_raid.[hc]
files are added that will include the special raid stuff
that are not used in striping and mirrors. In future write
support these will get bigger.
When adding the ore_raid.c to Kbuild file I was forced to
rename ore.ko to libore.ko. Is it possible to keep source
file, say ore.c and module file ore.ko the same even if there
are multiple files inside ore.ko?
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/Kbuild | 3 +-
fs/exofs/ore.c | 326 ++++++++++++++++++++++++++++++++++++-----------
fs/exofs/ore_raid.c | 140 +++++++++++++++++++++
fs/exofs/ore_raid.h | 64 ++++++++++
include/scsi/osd_ore.h | 21 +++-
5 files changed, 473 insertions(+), 81 deletions(-)
create mode 100644 fs/exofs/ore_raid.c
create mode 100644 fs/exofs/ore_raid.h
diff --git a/fs/exofs/Kbuild b/fs/exofs/Kbuild
index c5a5855..352ba14 100644
--- a/fs/exofs/Kbuild
+++ b/fs/exofs/Kbuild
@@ -13,7 +13,8 @@
#
# ore module library
-obj-$(CONFIG_ORE) += ore.o
+libore-y := ore.o ore_raid.o
+obj-$(CONFIG_ORE) += libore.o
exofs-y := inode.o file.o symlink.o namei.o dir.o super.o
obj-$(CONFIG_EXOFS_FS) += exofs.o
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index d92998d..fd6090d 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -24,24 +24,9 @@
#include <linux/slab.h>
#include <asm/div64.h>
+#include <linux/lcm.h>
-#include <scsi/osd_ore.h>
-
-#define ORE_ERR(fmt, a...) printk(KERN_ERR "ore: " fmt, ##a)
-
-#ifdef CONFIG_EXOFS_DEBUG
-#define ORE_DBGMSG(fmt, a...) \
- printk(KERN_NOTICE "ore @%s:%d: " fmt, __func__, __LINE__, ##a)
-#else
-#define ORE_DBGMSG(fmt, a...) \
- do { if (0) printk(fmt, ##a); } while (0)
-#endif
-
-/* u64 has problems with printk this will cast it to unsigned long long */
-#define _LLU(x) (unsigned long long)(x)
-
-#define ORE_DBGMSG2(M...) do {} while (0)
-/* #define ORE_DBGMSG2 ORE_DBGMSG */
+#include "ore_raid.h"
MODULE_AUTHOR("Boaz Harrosh <[email protected]>");
MODULE_DESCRIPTION("Objects Raid Engine ore.ko");
@@ -133,21 +118,81 @@ static struct osd_dev *_ios_od(struct ore_io_state *ios, unsigned index)
return ore_comp_dev(ios->oc, index);
}
-static int _get_io_state(struct ore_layout *layout,
- struct ore_components *oc, unsigned numdevs,
- struct ore_io_state **pios)
+static int _ore_get_io_state(struct ore_layout *layout,
+ struct ore_components *oc, unsigned numdevs,
+ unsigned sgs_per_dev, unsigned num_par_pages,
+ struct ore_io_state **pios)
{
struct ore_io_state *ios;
+ struct page **pages;
+ struct osd_sg_entry *sgilist;
+ struct __alloc_all_io_state {
+ struct ore_io_state ios;
+ struct ore_per_dev_state per_dev[numdevs];
+ union {
+ struct osd_sg_entry sglist[sgs_per_dev * numdevs];
+ struct page *pages[num_par_pages];
+ };
+ } *_aios;
+
+ if (likely(sizeof(*_aios) <= PAGE_SIZE)) {
+ _aios = kzalloc(sizeof(*_aios), GFP_KERNEL);
+ if (unlikely(!_aios)) {
+ ORE_DBGMSG("Failed kzalloc bytes=%zd\n",
+ sizeof(*_aios));
+ *pios = NULL;
+ return -ENOMEM;
+ }
+ pages = num_par_pages ? _aios->pages : NULL;
+ sgilist = sgs_per_dev ? _aios->sglist : NULL;
+ ios = &_aios->ios;
+ } else {
+ struct __alloc_small_io_state {
+ struct ore_io_state ios;
+ struct ore_per_dev_state per_dev[numdevs];
+ } *_aio_small;
+ union __extra_part {
+ struct osd_sg_entry sglist[sgs_per_dev * numdevs];
+ struct page *pages[num_par_pages];
+ } *extra_part;
+
+ _aio_small = kzalloc(sizeof(*_aio_small), GFP_KERNEL);
+ if (unlikely(!_aio_small)) {
+ ORE_DBGMSG("Failed alloc first part bytes=%zd\n",
+ sizeof(*_aio_small));
+ *pios = NULL;
+ return -ENOMEM;
+ }
+ extra_part = kzalloc(sizeof(*extra_part), GFP_KERNEL);
+ if (unlikely(!extra_part)) {
+ ORE_DBGMSG("Failed alloc second part bytes=%zd\n",
+ sizeof(*extra_part));
+ kfree(_aio_small);
+ *pios = NULL;
+ return -ENOMEM;
+ }
- /*TODO: Maybe use kmem_cach per sbi of size
- * exofs_io_state_size(layout->s_numdevs)
- */
- ios = kzalloc(ore_io_state_size(numdevs), GFP_KERNEL);
- if (unlikely(!ios)) {
- ORE_DBGMSG("Failed kzalloc bytes=%d\n",
- ore_io_state_size(numdevs));
- *pios = NULL;
- return -ENOMEM;
+ pages = num_par_pages ? extra_part->pages : NULL;
+ sgilist = sgs_per_dev ? extra_part->sglist : NULL;
+ /* In this case the per_dev[0].sgilist holds the pointer to
+ * be freed
+ */
+ ios = &_aio_small->ios;
+ ios->extra_part_alloc = true;
+ }
+
+ if (pages) {
+ ios->parity_pages = pages;
+ ios->max_par_pages = num_par_pages;
+ }
+ if (sgilist) {
+ unsigned d;
+
+ for (d = 0; d < numdevs; ++d) {
+ ios->per_dev[d].sglist = sgilist;
+ sgilist += sgs_per_dev;
+ }
+ ios->sgs_per_dev = sgs_per_dev;
}
ios->layout = layout;
@@ -178,9 +223,42 @@ int ore_get_rw_state(struct ore_layout *layout, struct ore_components *oc,
{
struct ore_io_state *ios;
unsigned numdevs = layout->group_width * layout->mirrors_p1;
+ unsigned sgs_per_dev = 0, max_par_pages = 0;
int ret;
- ret = _get_io_state(layout, oc, numdevs, pios);
+ if (layout->parity && length) {
+ unsigned data_devs = layout->group_width - layout->parity;
+ unsigned stripe_size = layout->stripe_unit * data_devs;
+ unsigned pages_in_unit = layout->stripe_unit / PAGE_SIZE;
+ u32 remainder;
+ u64 num_stripes;
+ u64 num_raid_units;
+
+ num_stripes = div_u64_rem(length, stripe_size, &remainder);
+ if (remainder)
+ ++num_stripes;
+
+ num_raid_units = num_stripes * layout->parity;
+
+ if (is_reading) {
+ /* For reads add per_dev sglist array */
+ /* TODO: Raid 6 we need twice more. Actually:
+ * num_stripes / LCMdP(W,P);
+ * if (W%P != 0) num_stripes *= parity;
+ */
+
+ /* first/last seg is split */
+ num_raid_units += layout->group_width;
+ sgs_per_dev = div_u64(num_raid_units, data_devs);
+ } else {
+ /* For Writes add parity pages array. */
+ max_par_pages = num_raid_units * pages_in_unit *
+ sizeof(struct page *);
+ }
+ }
+
+ ret = _ore_get_io_state(layout, oc, numdevs, sgs_per_dev, max_par_pages,
+ pios);
if (unlikely(ret))
return ret;
@@ -189,10 +267,11 @@ int ore_get_rw_state(struct ore_layout *layout, struct ore_components *oc,
ios->offset = offset;
if (length) {
- ore_calc_stripe_info(layout, offset, &ios->si);
- ios->length = (length <= ios->si.group_length) ? length :
- ios->si.group_length;
+ ore_calc_stripe_info(layout, offset, length, &ios->si);
+ ios->length = ios->si.length;
ios->nr_pages = (ios->length + PAGE_SIZE - 1) / PAGE_SIZE;
+ if (layout->parity)
+ _ore_post_alloc_raid_stuff(ios);
}
return 0;
@@ -209,7 +288,7 @@ EXPORT_SYMBOL(ore_get_rw_state);
int ore_get_io_state(struct ore_layout *layout, struct ore_components *oc,
struct ore_io_state **pios)
{
- return _get_io_state(layout, oc, oc->numdevs, pios);
+ return _ore_get_io_state(layout, oc, oc->numdevs, 0, 0, pios);
}
EXPORT_SYMBOL(ore_get_io_state);
@@ -227,6 +306,7 @@ void ore_put_io_state(struct ore_io_state *ios)
bio_put(per_dev->bio);
}
+ _ore_free_raid_stuff(ios);
kfree(ios);
}
}
@@ -367,53 +447,65 @@ EXPORT_SYMBOL(ore_check_io);
/*
* L - logical offset into the file
*
- * U - The number of bytes in a stripe within a group
+ * D - number of Data devices
+ * D = group_width - parity
*
- * U = stripe_unit * group_width
+ * U - The number of bytes in a stripe within a group
+ * U = stripe_unit * D
*
* T - The number of bytes striped within a group of component objects
* (before advancing to the next group)
- *
- * T = stripe_unit * group_width * group_depth
+ * T = U * group_depth
*
* S - The number of bytes striped across all component objects
* before the pattern repeats
+ * S = T * group_count
*
- * S = stripe_unit * group_width * group_depth * group_count
- *
- * M - The "major" (i.e., across all components) stripe number
- *
+ * M - The "major" (i.e., across all components) cycle number
* M = L / S
*
- * G - Counts the groups from the beginning of the major stripe
- *
+ * G - Counts the groups from the beginning of the major cycle
* G = (L - (M * S)) / T [or (L % S) / T]
*
* H - The byte offset within the group
- *
* H = (L - (M * S)) % T [or (L % S) % T]
*
* N - The "minor" (i.e., across the group) stripe number
- *
* N = H / U
*
* C - The component index coresponding to L
*
- * C = (H - (N * U)) / stripe_unit + G * group_width
- * [or (L % U) / stripe_unit + G * group_width]
+ * C = (H - (N * U)) / stripe_unit + G * D
+ * [or (L % U) / stripe_unit + G * D]
*
* O - The component offset coresponding to L
- *
* O = L % stripe_unit + N * stripe_unit + M * group_depth * stripe_unit
+ *
+ * LCMdP – Parity cycle: Lowest Common Multiple of group_width, parity
+ * divide by parity
+ * LCMdP = lcm(group_width, parity) / parity
+ *
+ * R - The parity Rotation stripe
+ * (Note parity cycle always starts at a group's boundary)
+ * R = N % LCMdP
+ *
+ * I = the first parity device index
+ * I = (group_width + group_width - R*parity - parity) % group_width
+ *
+ * Craid - The component index Rotated
+ * Craid = (group_width + C - R*parity) % group_width
+ * (We add the group_width to avoid negative numbers modulo math)
*/
void ore_calc_stripe_info(struct ore_layout *layout, u64 file_offset,
- struct ore_striping_info *si)
+ u64 length, struct ore_striping_info *si)
{
u32 stripe_unit = layout->stripe_unit;
u32 group_width = layout->group_width;
u64 group_depth = layout->group_depth;
+ u32 parity = layout->parity;
- u32 U = stripe_unit * group_width;
+ u32 D = group_width - parity;
+ u32 U = D * stripe_unit;
u64 T = U * group_depth;
u64 S = T * layout->group_count;
u64 M = div64_u64(file_offset, S);
@@ -429,22 +521,43 @@ void ore_calc_stripe_info(struct ore_layout *layout, u64 file_offset,
u32 N = div_u64(H, U);
/* "H - (N * U)" is just "H % U" so it's bound to u32 */
- si->dev = (u32)(H - (N * U)) / stripe_unit + G * group_width;
- si->dev *= layout->mirrors_p1;
+ u32 C = (u32)(H - (N * U)) / stripe_unit + G * group_width;
div_u64_rem(file_offset, stripe_unit, &si->unit_off);
si->obj_offset = si->unit_off + (N * stripe_unit) +
(M * group_depth * stripe_unit);
- si->group_length = T - H;
+ if (parity) {
+ u32 LCMdP = lcm(group_width, parity) / parity;
+ /* R = N % LCMdP; */
+ u32 RxP = (N % LCMdP) * parity;
+ u32 first_dev = C - C % group_width;
+
+ si->par_dev = (group_width + group_width - parity - RxP) %
+ group_width + first_dev;
+ si->dev = (group_width + C - RxP) % group_width + first_dev;
+ si->bytes_in_stripe = U;
+ si->first_stripe_start = M * S + G * T + N * U;
+ } else {
+ /* Make the math correct see _prepare_one_group */
+ si->par_dev = group_width;
+ si->dev = C;
+ }
+
+ si->dev *= layout->mirrors_p1;
+ si->par_dev *= layout->mirrors_p1;
+ si->offset = file_offset;
+ si->length = T - H;
+ if (si->length > length)
+ si->length = length;
si->M = M;
}
EXPORT_SYMBOL(ore_calc_stripe_info);
-static int _add_stripe_unit(struct ore_io_state *ios, unsigned *cur_pg,
- unsigned pgbase, struct ore_per_dev_state *per_dev,
- int cur_len)
+int _ore_add_stripe_unit(struct ore_io_state *ios, unsigned *cur_pg,
+ unsigned pgbase, struct page **pages,
+ struct ore_per_dev_state *per_dev, int cur_len)
{
unsigned pg = *cur_pg;
struct request_queue *q =
@@ -455,8 +568,11 @@ static int _add_stripe_unit(struct ore_io_state *ios, unsigned *cur_pg,
if (per_dev->bio == NULL) {
unsigned pages_in_stripe = ios->layout->group_width *
(ios->layout->stripe_unit / PAGE_SIZE);
- unsigned bio_size = (ios->nr_pages + pages_in_stripe) /
- ios->layout->group_width;
+ unsigned nr_pages = ios->nr_pages * ios->layout->group_width /
+ (ios->layout->group_width -
+ ios->layout->parity);
+ unsigned bio_size = (nr_pages + pages_in_stripe) /
+ ios->layout->group_width;
per_dev->bio = bio_kmalloc(GFP_KERNEL, bio_size);
if (unlikely(!per_dev->bio)) {
@@ -471,12 +587,13 @@ static int _add_stripe_unit(struct ore_io_state *ios, unsigned *cur_pg,
unsigned pglen = min_t(unsigned, PAGE_SIZE - pgbase, cur_len);
unsigned added_len;
- BUG_ON(ios->nr_pages <= pg);
cur_len -= pglen;
- added_len = bio_add_pc_page(q, per_dev->bio, ios->pages[pg],
+ added_len = bio_add_pc_page(q, per_dev->bio, pages[pg],
pglen, pgbase);
if (unlikely(pglen != added_len)) {
+ ORE_DBGMSG("Failed bio_add_pc_page bi_vcnt=%u\n",
+ per_dev->bio->bi_vcnt);
ret = -ENOMEM;
goto out;
}
@@ -501,9 +618,11 @@ static int _prepare_for_striping(struct ore_io_state *ios)
struct ore_striping_info *si = &ios->si;
unsigned stripe_unit = ios->layout->stripe_unit;
unsigned mirrors_p1 = ios->layout->mirrors_p1;
- unsigned devs_in_group = ios->layout->group_width * mirrors_p1;
+ unsigned group_width = ios->layout->group_width;
+ unsigned devs_in_group = group_width * mirrors_p1;
unsigned dev = si->dev;
unsigned first_dev = dev - (dev % devs_in_group);
+ unsigned dev_order;
unsigned cur_pg = ios->pages_consumed;
u64 length = ios->length;
int ret = 0;
@@ -513,7 +632,10 @@ static int _prepare_for_striping(struct ore_io_state *ios)
return 0;
}
- BUG_ON(length > si->group_length);
+ BUG_ON(length > si->length);
+
+ dev_order = _dev_order(devs_in_group, mirrors_p1, si->par_dev, dev);
+ si->cur_comp = dev_order;
while (length) {
unsigned comp = dev - first_dev;
@@ -522,17 +644,20 @@ static int _prepare_for_striping(struct ore_io_state *ios)
if (!per_dev->length) {
per_dev->dev = dev;
- if (dev < si->dev) {
- per_dev->offset = si->obj_offset + stripe_unit -
- si->unit_off;
- cur_len = stripe_unit;
- } else if (dev == si->dev) {
+ if (dev == si->dev) {
+ WARN_ON(dev == si->par_dev);
per_dev->offset = si->obj_offset;
cur_len = stripe_unit - si->unit_off;
page_off = si->unit_off & ~PAGE_MASK;
BUG_ON(page_off && (page_off != ios->pgbase));
- } else { /* dev > si->dev */
- per_dev->offset = si->obj_offset - si->unit_off;
+ } else {
+ if (si->cur_comp > dev_order)
+ per_dev->offset =
+ si->obj_offset - si->unit_off;
+ else /* si->cur_comp < dev_order */
+ per_dev->offset =
+ si->obj_offset + stripe_unit -
+ si->unit_off;
cur_len = stripe_unit;
}
} else {
@@ -541,8 +666,8 @@ static int _prepare_for_striping(struct ore_io_state *ios)
if (cur_len >= length)
cur_len = length;
- ret = _add_stripe_unit(ios, &cur_pg, page_off , per_dev,
- cur_len);
+ ret = _ore_add_stripe_unit(ios, &cur_pg, page_off, ios->pages,
+ per_dev, cur_len);
if (unlikely(ret))
goto out;
@@ -550,6 +675,41 @@ static int _prepare_for_striping(struct ore_io_state *ios)
dev = (dev % devs_in_group) + first_dev;
length -= cur_len;
+
+ si->cur_comp = (si->cur_comp + 1) % group_width;
+ if (unlikely((dev == si->par_dev) ||
+ (!length && ios->parity_pages))) {
+ if (!length)
+ /* If we are writing and this is the very last
+ * stripe. then operate on parity dev.
+ */
+ dev = si->par_dev;
+ if (ios->reading)
+ /* In writes cur_len just means if it's the
+ * last one. See _ore_add_parity_unit.
+ */
+ cur_len = length;
+ per_dev = &ios->per_dev[dev - first_dev];
+ if (!per_dev->length) {
+ /* Only/always the parity unit of the first
+ * stripe will be empty. So this is a chance to
+ * initialize the per_dev info.
+ */
+ per_dev->dev = dev;
+ per_dev->offset = si->obj_offset - si->unit_off;
+ }
+
+ ret = _ore_add_parity_unit(ios, si, per_dev, cur_len);
+ if (unlikely(ret))
+ goto out;
+
+ /* Rotate next par_dev backwards with wraping */
+ si->par_dev = (devs_in_group + si->par_dev -
+ ios->layout->parity * mirrors_p1) %
+ devs_in_group + first_dev;
+ /* Next stripe, start fresh */
+ si->cur_comp = 0;
+ }
}
out:
ios->numdevs = devs_in_group;
@@ -747,12 +907,24 @@ static int _read_mirror(struct ore_io_state *ios, unsigned cur_comp)
per_dev->or = or;
if (ios->pages) {
- osd_req_read(or, obj, per_dev->offset,
- per_dev->bio, per_dev->length);
+ if (per_dev->cur_sg) {
+ /* finalize the last sg_entry */
+ _ore_add_sg_seg(per_dev, 0, false);
+ if (unlikely(!per_dev->cur_sg))
+ return 0; /* Skip parity only device */
+
+ osd_req_read_sg(or, obj, per_dev->bio,
+ per_dev->sglist, per_dev->cur_sg);
+ } else {
+ /* The no raid case */
+ osd_req_read(or, obj, per_dev->offset,
+ per_dev->bio, per_dev->length);
+ }
+
ORE_DBGMSG("read(0x%llx) offset=0x%llx length=0x%llx"
- " dev=%d\n", _LLU(obj->id),
+ " dev=%d sg_len=%d\n", _LLU(obj->id),
_LLU(per_dev->offset), _LLU(per_dev->length),
- first_dev);
+ first_dev, per_dev->cur_sg);
} else {
BUG_ON(ios->kern_buff);
@@ -849,7 +1021,7 @@ static void _calc_trunk_info(struct ore_layout *layout, u64 file_offset,
{
unsigned stripe_unit = layout->stripe_unit;
- ore_calc_stripe_info(layout, file_offset, &ti->si);
+ ore_calc_stripe_info(layout, file_offset, 0, &ti->si);
ti->prev_group_obj_off = ti->si.M * stripe_unit;
ti->next_group_obj_off = ti->si.M ? (ti->si.M - 1) * stripe_unit : 0;
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
new file mode 100644
index 0000000..8d4b93a
--- /dev/null
+++ b/fs/exofs/ore_raid.c
@@ -0,0 +1,140 @@
+/*
+ * Copyright (C) 2011
+ * Boaz Harrosh <[email protected]>
+ *
+ * This file is part of the objects raid engine (ore).
+ *
+ * It is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with "ore". If not, write to the Free Software Foundation, Inc:
+ * "Free Software Foundation <[email protected]>"
+ */
+
+#include <linux/gfp.h>
+
+#include "ore_raid.h"
+
+struct page *_raid_page_alloc(void)
+{
+ return alloc_page(GFP_KERNEL);
+}
+
+void _raid_page_free(struct page *p)
+{
+ __free_page(p);
+}
+
+void _ore_add_sg_seg(struct ore_per_dev_state *per_dev, unsigned cur_len,
+ bool not_last)
+{
+ struct osd_sg_entry *sge;
+
+ ORE_DBGMSG("dev=%d cur_len=0x%x not_last=%d cur_sg=%d "
+ "offset=0x%llx length=0x%x last_sgs_total=0x%x\n",
+ per_dev->dev, cur_len, not_last, per_dev->cur_sg,
+ _LLU(per_dev->offset), per_dev->length,
+ per_dev->last_sgs_total);
+
+ if (!per_dev->cur_sg) {
+ sge = per_dev->sglist;
+
+ /* First time we prepare two entries */
+ if (per_dev->length) {
+ ++per_dev->cur_sg;
+ sge->offset = per_dev->offset;
+ sge->len = per_dev->length;
+ } else {
+ /* Here the parity is the first unit of this object.
+ * This happens every time we reach a parity device on
+ * the same stripe as the per_dev->offset. We need to
+ * just skip this unit.
+ */
+ per_dev->offset += cur_len;
+ return;
+ }
+ } else {
+ /* finalize the last one */
+ sge = &per_dev->sglist[per_dev->cur_sg - 1];
+ sge->len = per_dev->length - per_dev->last_sgs_total;
+ }
+
+ if (not_last) {
+ /* Partly prepare the next one */
+ struct osd_sg_entry *next_sge = sge + 1;
+
+ ++per_dev->cur_sg;
+ next_sge->offset = sge->offset + sge->len + cur_len;
+ /* Save cur len so we know how mutch was added next time */
+ per_dev->last_sgs_total = per_dev->length;
+ next_sge->len = 0;
+ } else if (!sge->len) {
+ /* Optimize for when the last unit is a parity */
+ --per_dev->cur_sg;
+ }
+}
+
+/* In writes @cur_len means length left. .i.e cur_len==0 is the last parity U */
+int _ore_add_parity_unit(struct ore_io_state *ios,
+ struct ore_striping_info *si,
+ struct ore_per_dev_state *per_dev,
+ unsigned cur_len)
+{
+ if (ios->reading) {
+ BUG_ON(per_dev->cur_sg >= ios->sgs_per_dev);
+ _ore_add_sg_seg(per_dev, cur_len, true);
+ } else {
+ struct page **pages = ios->parity_pages + ios->cur_par_page;
+ unsigned num_pages = ios->layout->stripe_unit / PAGE_SIZE;
+ unsigned array_start = 0;
+ unsigned i;
+ int ret;
+
+ for (i = 0; i < num_pages; i++) {
+ pages[i] = _raid_page_alloc();
+ if (unlikely(!pages[i]))
+ return -ENOMEM;
+
+ ++(ios->cur_par_page);
+ /* TODO: only read support for now */
+ clear_highpage(pages[i]);
+ }
+
+ ORE_DBGMSG("writing dev=%d num_pages=%d cur_par_page=%d",
+ per_dev->dev, num_pages, ios->cur_par_page);
+
+ ret = _ore_add_stripe_unit(ios, &array_start, 0, pages,
+ per_dev, num_pages * PAGE_SIZE);
+ if (unlikely(ret))
+ return ret;
+ }
+ return 0;
+}
+
+int _ore_post_alloc_raid_stuff(struct ore_io_state *ios)
+{
+ /*TODO: Only raid writes has stuff to add here */
+ return 0;
+}
+
+void _ore_free_raid_stuff(struct ore_io_state *ios)
+{
+ if (ios->parity_pages) { /* writing and raid */
+ unsigned i;
+
+ for (i = 0; i < ios->cur_par_page; i++) {
+ struct page *page = ios->parity_pages[i];
+
+ if (page)
+ _raid_page_free(page);
+ }
+ if (ios->extra_part_alloc)
+ kfree(ios->parity_pages);
+ } else {
+ /* Will only be set if raid reading && sglist is big */
+ if (ios->extra_part_alloc)
+ kfree(ios->per_dev[0].sglist);
+ }
+}
diff --git a/fs/exofs/ore_raid.h b/fs/exofs/ore_raid.h
new file mode 100644
index 0000000..c21080b
--- /dev/null
+++ b/fs/exofs/ore_raid.h
@@ -0,0 +1,64 @@
+/*
+ * Copyright (C) from 2011
+ * Boaz Harrosh <[email protected]>
+ *
+ * This file is part of the objects raid engine (ore).
+ *
+ * It is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with "ore". If not, write to the Free Software Foundation, Inc:
+ * "Free Software Foundation <[email protected]>"
+ */
+
+#include <scsi/osd_ore.h>
+
+#define ORE_ERR(fmt, a...) printk(KERN_ERR "ore: " fmt, ##a)
+
+#ifdef CONFIG_EXOFS_DEBUG
+#define ORE_DBGMSG(fmt, a...) \
+ printk(KERN_NOTICE "ore @%s:%d: " fmt, __func__, __LINE__, ##a)
+#else
+#define ORE_DBGMSG(fmt, a...) \
+ do { if (0) printk(fmt, ##a); } while (0)
+#endif
+
+/* u64 has problems with printk this will cast it to unsigned long long */
+#define _LLU(x) (unsigned long long)(x)
+
+#define ORE_DBGMSG2(M...) do {} while (0)
+/* #define ORE_DBGMSG2 ORE_DBGMSG */
+
+/* Calculate the component order in a stripe. eg the logical data unit
+ * address within the stripe of @dev given the @par_dev of this stripe.
+ */
+static inline unsigned _dev_order(unsigned devs_in_group, unsigned mirrors_p1,
+ unsigned par_dev, unsigned dev)
+{
+ unsigned first_dev = dev - dev % devs_in_group;
+
+ dev -= first_dev;
+ par_dev -= first_dev;
+
+ if (devs_in_group == par_dev) /* The raid 0 case */
+ return dev / mirrors_p1;
+ /* raid4/5/6 case */
+ return ((devs_in_group + dev - par_dev - mirrors_p1) % devs_in_group) /
+ mirrors_p1;
+}
+
+/* ios_raid.c stuff needed by ios.c */
+int _ore_post_alloc_raid_stuff(struct ore_io_state *ios);
+void _ore_free_raid_stuff(struct ore_io_state *ios);
+
+void _ore_add_sg_seg(struct ore_per_dev_state *per_dev, unsigned cur_len,
+ bool not_last);
+int _ore_add_parity_unit(struct ore_io_state *ios, struct ore_striping_info *si,
+ struct ore_per_dev_state *per_dev, unsigned cur_len);
+
+/* ios.c stuff needed by ios_raid.c */
+int _ore_add_stripe_unit(struct ore_io_state *ios, unsigned *cur_pg,
+ unsigned pgbase, struct page **pages,
+ struct ore_per_dev_state *per_dev, int cur_len);
diff --git a/include/scsi/osd_ore.h b/include/scsi/osd_ore.h
index a8e39d1..43821c1 100644
--- a/include/scsi/osd_ore.h
+++ b/include/scsi/osd_ore.h
@@ -40,6 +40,7 @@ struct ore_layout {
unsigned mirrors_p1;
unsigned group_width;
+ unsigned parity;
u64 group_depth;
unsigned group_count;
@@ -89,11 +90,16 @@ static inline void ore_comp_set_dev(
}
struct ore_striping_info {
+ u64 offset;
u64 obj_offset;
- u64 group_length;
+ u64 length;
+ u64 first_stripe_start; /* only used in raid writes */
u64 M; /* for truncate */
+ unsigned bytes_in_stripe;
unsigned dev;
+ unsigned par_dev;
unsigned unit_off;
+ unsigned cur_comp;
};
struct ore_io_state;
@@ -127,6 +133,13 @@ struct ore_io_state {
bool reading;
+ /* House keeping of Parity pages */
+ bool extra_part_alloc;
+ struct page **parity_pages;
+ unsigned max_par_pages;
+ unsigned cur_par_page;
+ unsigned sgs_per_dev;
+
/* Variable array of size numdevs */
unsigned numdevs;
struct ore_per_dev_state {
@@ -134,7 +147,10 @@ struct ore_io_state {
struct bio *bio;
loff_t offset;
unsigned length;
+ unsigned last_sgs_total;
unsigned dev;
+ struct osd_sg_entry *sglist;
+ unsigned cur_sg;
} per_dev[];
};
@@ -147,8 +163,7 @@ static inline unsigned ore_io_state_size(unsigned numdevs)
/* ore.c */
int ore_verify_layout(unsigned total_comps, struct ore_layout *layout);
void ore_calc_stripe_info(struct ore_layout *layout, u64 file_offset,
- struct ore_striping_info *si);
-
+ u64 length, struct ore_striping_info *si);
int ore_get_rw_state(struct ore_layout *layout, struct ore_components *comps,
bool is_reading, u64 offset, u64 length,
struct ore_io_state **ios);
--
1.7.2.3
To re submit a page that went through write_cache_pages()
(Recived on writepage_t) We need to take it out of
write_back and into set_page_dirty(). Then it will be
resubmitted again just fine. Checked!
And one more fix for the ios->numdevs.
Signed-off-by: Boaz Harrosh <[email protected]>
---
git diff --stat -p -M origin/linux-next 201e3d7b
fs/exofs/inode.c | 15 ++++++++++-----
fs/exofs/ore.c | 6 +-----
2 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 69dc236..0c522c6 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -295,7 +295,7 @@ static int _maybe_not_all_in_one_io(struct ore_io_state *ios,
for (i = 0; i < pages_less; ++i)
pcol->pages[i] = *src_page++;
- EXOFS_DBGMSG("Length was adjusted nr_pages=0x%x pages_less=%d "
+ EXOFS_DBGMSG("Length was adjusted nr_pages=0x%x pages_less=0x%x "
"expected_pages=0x%x next_offset=0x%llx "
"next_len=0x%lx\n",
pcol_src->nr_pages, pages_less, pcol->expected_pages,
@@ -758,14 +758,19 @@ static int exofs_writepages(struct address_space *mapping,
if (wbc->sync_mode == WB_SYNC_ALL) {
return write_exec(&pcol); /* pump the last reminder */
- } else {/* not SYNC let the reminder join the next writeout */
+ } else if (pcol.nr_pages) {
+ /* not SYNC let the reminder join the next writeout */
unsigned i;
- for (i = 0; i < pcol.nr_pages; i++)
- unlock_page(pcol.pages[i]);
+ for (i = 0; i < pcol.nr_pages; i++) {
+ struct page *page = pcol.pages[i];
- return 0;
+ end_page_writeback(page);
+ set_page_dirty(page);
+ unlock_page(page);
+ }
}
+ return 0;
}
static int exofs_writepage(struct page *page, struct writeback_control *wbc)
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 0dafd50..3b1cc3a 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -506,7 +506,6 @@ static int _prepare_for_striping(struct ore_io_state *ios)
unsigned devs_in_group = ios->layout->group_width * mirrors_p1;
unsigned dev = si->dev;
unsigned first_dev = dev - (dev % devs_in_group);
- unsigned max_comp = ios->numdevs ? ios->numdevs - mirrors_p1 : 0;
unsigned cur_pg = ios->pages_consumed;
u64 length = ios->length;
int ret = 0;
@@ -538,9 +537,6 @@ static int _prepare_for_striping(struct ore_io_state *ios)
per_dev->offset = si->obj_offset - si->unit_off;
cur_len = stripe_unit;
}
-
- if (max_comp < comp)
- max_comp = comp;
} else {
cur_len = stripe_unit;
}
@@ -558,7 +554,7 @@ static int _prepare_for_striping(struct ore_io_state *ios)
length -= cur_len;
}
out:
- ios->numdevs = max_comp + mirrors_p1;
+ ios->numdevs = devs_in_group;
ios->pages_consumed = cur_pg;
if (unlikely(ret)) {
if (length == ios->length)
Sure. Monday it is.
Benny
On Fri, Oct 14, 2011 at 3:34 PM, Boaz Harrosh <[email protected]> wrote:
> Thanks Benny.
>
> Is it OK if we merge, pnfs tree and open-osd, on Monday in BAT first thing?
> Tonight it's to late, and tomorrow I'm already flying. So Monday.
>
> Have a safe trip
> Boaz
>
> On 10/14/2011 07:50 PM, Benny Halevy wrote:
>> Awesome!
>>
>> Benny
>> -----Original Message-----
>> From: Boaz Harrosh <[email protected]>
>> Sender: ? ? ? [email protected]
>> Date: Fri, 14 Oct 2011 19:24:14
>> To: Welch, Brent<[email protected]>; open-osd<[email protected]>; NFS list<[email protected]>; linux-fsdevel<[email protected]>
>> Subject: [PATCHSET 0/1 0/6] ore: RAID5 Support
>>
>>
>> I'm proud to present RAID5 support to ORE. Which enables raid5 for both
>> exofs and pnfs-objects-layout driver.
>>
>> The ORE with raid0/1/5 and soon 6 support has become a compact and abstract
>> library, that with not a lot of effort, can support not only OSD but any
>> type of devices. For example BTRFS, does it have a RAID5 support yet? if
>> not it could use ORE. The ORE gets a bunch of pages at the top and produces
>> bios for each device at the bottom. The libosd API can be easily abstracted
>> and be used for block devices, just the same. The RAID layout supported by
>> ORE is very rich, multi layered striping/mirroring/raid. even reacher then
>> stacked MD. You can read about this layout here:
>> http://git.open-osd.org/gitweb.cgi?p=ietf-rfc5664.git;a=blob_plain;f=draft-ietf-nfsv4-rfc5664bis.html;hb=boaz2
>> Start at: 5.3. "Data Mapping Schemes" up to:
>> 5.4.5. "RAID Usage and Implementation Notes"
>>
>> There where some problems with the previous patchset here are
>> the differences I squashed it to a new patchset at osd/linux-next.
>>
>> ?[PATCH 1/1] SQUASHME: into: ore: Only IO one group at a time (API change)
>>
>> And without farther ado Here is the RAID5 support. This is highly complicated
>> stuff, for humble me at least, and I would appreciate any review and/or comments
>> you guys can give it. Thanks in advance. (Pretty Please? :-))
>>
>> [PATCH 1/6] ore: Make ore_calc_stripe_info EXPORT_SYMBOL
>> [PATCH 2/6] ore: RAID5 read
>> [PATCH 3/6] ore: RAID5 Write
>> [PATCH 4/6] exofs: Support for RAID5 read-4-write interface.
>> [PATCH 5/6] pnfs-obj: Support for RAID5 read-4-write interface.
>> [PATCH 6/6] ore: Enable RAID5 mounts
>>
>> A tree with above plus all prerequisites is at:
>> $ git clone git://open-osd.org/linux-open-osd linux-next
>> [http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/merge_and_compile]
>>
>> The code passes my tests, the highlight of which is git-clone linux and compare to
>> and identical clone on ext4. But there must be month of farther testing done on this
>> to get it better. For example. Aligned on stripe IO write back should be improved on
>> all the way down to the VFS layer.
>>
>> Thanks
>> Boaz
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
ore_calc_stripe_info is needed by exofs::export.c
for the layout calculations. Make it exportable
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/ore.c | 8 +++-----
include/scsi/osd_ore.h | 3 +++
2 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 3b1cc3a..d92998d 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -57,9 +57,6 @@ MODULE_LICENSE("GPL");
* 3. Cache some havily used calculations that will be needed by users.
*/
-static void ore_calc_stripe_info(struct ore_layout *layout, u64 file_offset,
- struct ore_striping_info *si);
-
enum { BIO_MAX_PAGES_KMALLOC =
(PAGE_SIZE - sizeof(struct bio)) / sizeof(struct bio_vec),};
@@ -409,8 +406,8 @@ EXPORT_SYMBOL(ore_check_io);
*
* O = L % stripe_unit + N * stripe_unit + M * group_depth * stripe_unit
*/
-static void ore_calc_stripe_info(struct ore_layout *layout, u64 file_offset,
- struct ore_striping_info *si)
+void ore_calc_stripe_info(struct ore_layout *layout, u64 file_offset,
+ struct ore_striping_info *si)
{
u32 stripe_unit = layout->stripe_unit;
u32 group_width = layout->group_width;
@@ -443,6 +440,7 @@ static void ore_calc_stripe_info(struct ore_layout *layout, u64 file_offset,
si->group_length = T - H;
si->M = M;
}
+EXPORT_SYMBOL(ore_calc_stripe_info);
static int _add_stripe_unit(struct ore_io_state *ios, unsigned *cur_pg,
unsigned pgbase, struct ore_per_dev_state *per_dev,
diff --git a/include/scsi/osd_ore.h b/include/scsi/osd_ore.h
index af2231a..a8e39d1 100644
--- a/include/scsi/osd_ore.h
+++ b/include/scsi/osd_ore.h
@@ -146,6 +146,9 @@ static inline unsigned ore_io_state_size(unsigned numdevs)
/* ore.c */
int ore_verify_layout(unsigned total_comps, struct ore_layout *layout);
+void ore_calc_stripe_info(struct ore_layout *layout, u64 file_offset,
+ struct ore_striping_info *si);
+
int ore_get_rw_state(struct ore_layout *layout, struct ore_components *comps,
bool is_reading, u64 offset, u64 length,
struct ore_io_state **ios);
--
1.7.2.3
The ore need suplied a r4w_get_page/r4w_put_page API
from Filesystem so it can get cache pages to read-into when
writing parial stripes.
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/nfs/objlayout/objio_osd.c | 38 ++++++++++++++++++++++++++++++++++++++
1 files changed, 38 insertions(+), 0 deletions(-)
diff --git a/fs/nfs/objlayout/objio_osd.c b/fs/nfs/objlayout/objio_osd.c
index 3161da6..c807ab9 100644
--- a/fs/nfs/objlayout/objio_osd.c
+++ b/fs/nfs/objlayout/objio_osd.c
@@ -459,6 +459,43 @@ static void _write_done(struct ore_io_state *ios, void *private)
objlayout_write_done(&objios->oir, status, objios->sync);
}
+static struct page *__r4w_get_page(void *priv, u64 offset, bool *uptodate)
+{
+ struct objio_state *objios = priv;
+ struct nfs_write_data *wdata = objios->oir.rpcdata;
+ pgoff_t index = offset / PAGE_SIZE;
+ struct page *page = find_get_page(wdata->inode->i_mapping, index);
+
+ if (!page) {
+ page = find_or_create_page(wdata->inode->i_mapping,
+ index, GFP_NOFS);
+ if (unlikely(!page)) {
+ dprintk("%s: grab_cache_page Failed index=0x%lx\n",
+ __func__, index);
+ return NULL;
+ }
+ unlock_page(page);
+ }
+ if (PageDirty(page) || PageWriteback(page))
+ *uptodate = true;
+ else
+ *uptodate = PageUptodate(page);
+ dprintk("%s: index=0x%lx uptodate=%d\n", __func__, index, *uptodate);
+ return page;
+}
+
+static void __r4w_put_page(void *priv, struct page *page)
+{
+ dprintk("%s: index=0x%lx\n", __func__, page->index);
+ page_cache_release(page);
+ return;
+}
+
+static const struct _ore_r4w_op _r4w_op = {
+ .get_page = &__r4w_get_page,
+ .put_page = &__r4w_put_page,
+};
+
int objio_write_pagelist(struct nfs_write_data *wdata, int how)
{
struct objio_state *objios;
@@ -472,6 +509,7 @@ int objio_write_pagelist(struct nfs_write_data *wdata, int how)
return ret;
objios->sync = 0 != (how & FLUSH_SYNC);
+ objios->ios->r4w = &_r4w_op;
if (!objios->sync)
objios->ios->done = _write_done;
--
1.7.2.3
QXdlc29tZSENCg0KQmVubnkNCi0tLS0tT3JpZ2luYWwgTWVzc2FnZS0tLS0tDQpGcm9tOglCb2F6
IEhhcnJvc2ggPGJoYXJyb3NoQHBhbmFzYXMuY29tPg0KU2VuZGVyOglsaW51eC1uZnMtb3duZXJA
dmdlci5rZXJuZWwub3JnDQpEYXRlOglGcmksIDE0IE9jdCAyMDExIDE5OjI0OjE0IA0KVG86IFdl
bGNoLCBCcmVudDx3ZWxjaEBwYW5hc2FzLmNvbT47IG9wZW4tb3NkPG9zZC1kZXZAb3Blbi1vc2Qu
b3JnPjsgTkZTIGxpc3Q8bGludXgtbmZzQHZnZXIua2VybmVsLm9yZz47IGxpbnV4LWZzZGV2ZWw8
bGludXgtZnNkZXZlbEB2Z2VyLmtlcm5lbC5vcmc+DQpTdWJqZWN0OiBbUEFUQ0hTRVQgMC8xIDAv
Nl0gb3JlOiBSQUlENSBTdXBwb3J0DQoNCg0KSSdtIHByb3VkIHRvIHByZXNlbnQgUkFJRDUgc3Vw
cG9ydCB0byBPUkUuIFdoaWNoIGVuYWJsZXMgcmFpZDUgZm9yIGJvdGgNCmV4b2ZzIGFuZCBwbmZz
LW9iamVjdHMtbGF5b3V0IGRyaXZlci4NCg0KVGhlIE9SRSB3aXRoIHJhaWQwLzEvNSBhbmQgc29v
biA2IHN1cHBvcnQgaGFzIGJlY29tZSBhIGNvbXBhY3QgYW5kIGFic3RyYWN0DQpsaWJyYXJ5LCB0
aGF0IHdpdGggbm90IGEgbG90IG9mIGVmZm9ydCwgY2FuIHN1cHBvcnQgbm90IG9ubHkgT1NEIGJ1
dCBhbnkNCnR5cGUgb2YgZGV2aWNlcy4gRm9yIGV4YW1wbGUgQlRSRlMsIGRvZXMgaXQgaGF2ZSBh
IFJBSUQ1IHN1cHBvcnQgeWV0PyBpZg0Kbm90IGl0IGNvdWxkIHVzZSBPUkUuIFRoZSBPUkUgZ2V0
cyBhIGJ1bmNoIG9mIHBhZ2VzIGF0IHRoZSB0b3AgYW5kIHByb2R1Y2VzDQpiaW9zIGZvciBlYWNo
IGRldmljZSBhdCB0aGUgYm90dG9tLiBUaGUgbGlib3NkIEFQSSBjYW4gYmUgZWFzaWx5IGFic3Ry
YWN0ZWQNCmFuZCBiZSB1c2VkIGZvciBibG9jayBkZXZpY2VzLCBqdXN0IHRoZSBzYW1lLiBUaGUg
UkFJRCBsYXlvdXQgc3VwcG9ydGVkIGJ5DQpPUkUgaXMgdmVyeSByaWNoLCBtdWx0aSBsYXllcmVk
IHN0cmlwaW5nL21pcnJvcmluZy9yYWlkLiBldmVuIHJlYWNoZXIgdGhlbg0Kc3RhY2tlZCBNRC4g
WW91IGNhbiByZWFkIGFib3V0IHRoaXMgbGF5b3V0IGhlcmU6DQpodHRwOi8vZ2l0Lm9wZW4tb3Nk
Lm9yZy9naXR3ZWIuY2dpP3A9aWV0Zi1yZmM1NjY0LmdpdDthPWJsb2JfcGxhaW47Zj1kcmFmdC1p
ZXRmLW5mc3Y0LXJmYzU2NjRiaXMuaHRtbDtoYj1ib2F6Mg0KU3RhcnQgYXQ6IDUuMy4gIkRhdGEg
TWFwcGluZyBTY2hlbWVzIiB1cCB0bzoNCjUuNC41LiAiUkFJRCBVc2FnZSBhbmQgSW1wbGVtZW50
YXRpb24gTm90ZXMiDQoNClRoZXJlIHdoZXJlIHNvbWUgcHJvYmxlbXMgd2l0aCB0aGUgcHJldmlv
dXMgcGF0Y2hzZXQgaGVyZSBhcmUNCnRoZSBkaWZmZXJlbmNlcyBJIHNxdWFzaGVkIGl0IHRvIGEg
bmV3IHBhdGNoc2V0IGF0IG9zZC9saW51eC1uZXh0Lg0KDQogW1BBVENIIDEvMV0gU1FVQVNITUU6
IGludG86IG9yZTogT25seSBJTyBvbmUgZ3JvdXAgYXQgYSB0aW1lIChBUEkgY2hhbmdlKQ0KDQpB
bmQgd2l0aG91dCBmYXJ0aGVyIGFkbyBIZXJlIGlzIHRoZSBSQUlENSBzdXBwb3J0LiBUaGlzIGlz
IGhpZ2hseSBjb21wbGljYXRlZA0Kc3R1ZmYsIGZvciBodW1ibGUgbWUgYXQgbGVhc3QsIGFuZCBJ
IHdvdWxkIGFwcHJlY2lhdGUgYW55IHJldmlldyBhbmQvb3IgY29tbWVudHMNCnlvdSBndXlzIGNh
biBnaXZlIGl0LiBUaGFua3MgaW4gYWR2YW5jZS4gKFByZXR0eSBQbGVhc2U/IDotKSkNCg0KW1BB
VENIIDEvNl0gb3JlOiBNYWtlIG9yZV9jYWxjX3N0cmlwZV9pbmZvIEVYUE9SVF9TWU1CT0wNCltQ
QVRDSCAyLzZdIG9yZTogUkFJRDUgcmVhZA0KW1BBVENIIDMvNl0gb3JlOiBSQUlENSBXcml0ZQ0K
W1BBVENIIDQvNl0gZXhvZnM6IFN1cHBvcnQgZm9yIFJBSUQ1IHJlYWQtNC13cml0ZSBpbnRlcmZh
Y2UuDQpbUEFUQ0ggNS82XSBwbmZzLW9iajogU3VwcG9ydCBmb3IgUkFJRDUgcmVhZC00LXdyaXRl
IGludGVyZmFjZS4NCltQQVRDSCA2LzZdIG9yZTogRW5hYmxlIFJBSUQ1IG1vdW50cw0KDQpBIHRy
ZWUgd2l0aCBhYm92ZSBwbHVzIGFsbCBwcmVyZXF1aXNpdGVzIGlzIGF0Og0KJCBnaXQgY2xvbmUg
Z2l0Oi8vb3Blbi1vc2Qub3JnL2xpbnV4LW9wZW4tb3NkIGxpbnV4LW5leHQNCltodHRwOi8vZ2l0
Lm9wZW4tb3NkLm9yZy9naXR3ZWIuY2dpP3A9bGludXgtb3Blbi1vc2QuZ2l0O2E9c2hvcnRsb2c7
aD1yZWZzL2hlYWRzL21lcmdlX2FuZF9jb21waWxlXQ0KDQpUaGUgY29kZSBwYXNzZXMgbXkgdGVz
dHMsIHRoZSBoaWdobGlnaHQgb2Ygd2hpY2ggaXMgZ2l0LWNsb25lIGxpbnV4IGFuZCBjb21wYXJl
IHRvDQphbmQgaWRlbnRpY2FsIGNsb25lIG9uIGV4dDQuIEJ1dCB0aGVyZSBtdXN0IGJlIG1vbnRo
IG9mIGZhcnRoZXIgdGVzdGluZyBkb25lIG9uIHRoaXMNCnRvIGdldCBpdCBiZXR0ZXIuIEZvciBl
eGFtcGxlLiBBbGlnbmVkIG9uIHN0cmlwZSBJTyB3cml0ZSBiYWNrIHNob3VsZCBiZSBpbXByb3Zl
ZCBvbg0KYWxsIHRoZSB3YXkgZG93biB0byB0aGUgVkZTIGxheWVyLg0KDQpUaGFua3MNCkJvYXoN
Ci0tDQpUbyB1bnN1YnNjcmliZSBmcm9tIHRoaXMgbGlzdDogc2VuZCB0aGUgbGluZSAidW5zdWJz
Y3JpYmUgbGludXgtbmZzIiBpbg0KdGhlIGJvZHkgb2YgYSBtZXNzYWdlIHRvIG1ham9yZG9tb0B2
Z2VyLmtlcm5lbC5vcmcNCk1vcmUgbWFqb3Jkb21vIGluZm8gYXQgIGh0dHA6Ly92Z2VyLmtlcm5l
bC5vcmcvbWFqb3Jkb21vLWluZm8uaHRtbA0K