LinuxLists.cc - [RFC 0/5] CRASHFIX: pnfs-obj: NONE-RPC LD must not call rpc_restart_call

2012-06-08 10:06:52

Subject: [RFC 0/5] CRASHFIX: pnfs-obj: NONE-RPC LD must not call rpc_restart_call_prepare()

NONE-RPC layout-Drivers call nfs_writeback_done() as part
of their completion of IO. (through pnfs_ld_write_done())

Inside nfs_writeback_done() there is code that does:

else if (resp->count < argp->count) {
...

/* This a short write! */
nfs_inc_stats(inode, NFSIOS_SHORTWRITE);

... /* Prepare the remainder */

rpc_restart_call_prepare(task);
}

But for none-rpc LDs (objects, blocks) there is no task->tk_ops
and this code will crash.

The same code exists in read.c::nfs_readpage_result_common() with
same crash.

Therefor: NONE-RPC LDs *must not* call pnfs_ld_read/write_done with
partial/short IO. Either complete everything or fail everything and
IO will be resubmitted through MDS.

Below are a set of patches that fix this problem for objects
layout, blocks maintainer should check that code does not allow a short
IO as well.

The Fix is much easier at the LD level then at NFS-Core. Though it
could be fixed there, I do not think it is worth it. I will submit a
comment that warns for this. Certainly NFS-Core changes could not
be so simple as below to be candidates for @Stable

These where lightly tested and are submitted for review. I will conduct
heavy testing and will put them in linux-next only next week.

Here are the list of patches:

[PATCH 1/6] ore: Fix NFS crash by supporting any unaligned RAID IO
[PATCH 2/6] ore: Remove support of partial IO request (NFS crash)
[PATCH 3/6] pnfs-obj: Fix __r4w_get_page in the case of offset beyond i_size

These 3 completely plug the crash above, and are destined
for Stable@

Trond please ACK the 3rd patch since I want to push this
through my tree, as these are for ore mostly.
(Please also review all of them, watch my back)

[PATCH 4/6] pnfs-obj: don't leak objio_state if ore_write/read fails.

A memory leak found on the way, in the error case.
Should it be for Stable@ ?

[PATCH 4/5] NFS41: add pg_layout_private to nfs_pageio_descriptor

This is the patch by Peng Tao <[email protected]> which is
needed for below fix. Minus the has_moreio flag.

[PATCH 5/5] pnfs-obj: Better IO pattern in case of unaligned offset

Though this patch is an optimization it fixes a very bad
IO pattern. Please consider for the 3.5-rcX Kernel and
don't postpone to 3.6 Merge window. If possible

Thanks for any review
Boaz

2012-06-08 12:06:20

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 1/6] ore: Fix NFS crash by supporting any unaligned RAID IO

In RAID1/4/5/6/ We used to not permit an IO that it's end
byte is not stripe_size aligned and spans more than one stripe.
.i.e the caller must check if after submission the actual
transferred bytes was shorter, he would need to resubmit
an IO with the remainder.

Exofs supports this, and NFS was supposed to support this
as well with it's short write mechanism. But late testing has
exposed a CRASH when this is used with none-RPC layout-drivers.

The change at NFS is deep and risky, in it's place the fix
at ORE to lift the limitation is actually clean and simple.
So here it is below.

The principal here is that in the case of unaligned IO on
both ends, beginning and end, we will send two read requests
one like old code, before the calculation of the first stripe,
and also a new site, before the calculation of the last stripe.
If any "boundary" is aligned or the complete IO is within a single
stripe we do a single read like before.

The code is clean and simple by splitting the old _read_4_write
into 3 even parts:
1. _read_4_write_first_stripe
2. _read_4_write_last_stripe
3. _read_4_write_execute

And calling 1+3 at the same place as before. 2+3 before last
stripe, and in the case of all in a single stripe then 1+2+3
is preformed additively.

Why did I not think of it before. Well I had a strike of
genius because I have stared at this code for 2 years, and did
not find this simple solution, til today. Not that I did not try.

This solution is much better for NFS than the previous supposedly
solution because the short write was dealt with out-of-band after
IO_done, which would cause for a seeky IO pattern where as in here
we execute in order. At both solutions we do 2 separate reads, only
here we do it within a single IO request. (And actually combine two
writes into a single submission)

NFS/exofs code need not change since the ORE API communicates the new
shorter length on return, what will happen is that this case would not
occur anymore.

hurray!!

[Stable this is an NFS bug since 3.2 Kernel should apply cleanly]
CC: Stable Tree <[email protected]>
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/ore_raid.c | 61 ++++++++++++++++++++++++++---------------------------
1 file changed, 30 insertions(+), 31 deletions(-)

diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index d222c77..80ae08e 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -461,16 +461,12 @@ static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
* ios->sp2d[p][*], xor is calculated the same way. These pages are
* allocated/freed and don't go through cache
*/
-static int _read_4_write(struct ore_io_state *ios)
+static int _read_4_write_first_stripe(struct ore_io_state *ios)
{
- struct ore_io_state *ios_read;
struct ore_striping_info read_si;
struct __stripe_pages_2d *sp2d = ios->sp2d;
u64 offset = ios->si.first_stripe_start;
- u64 last_stripe_end;
- unsigned bytes_in_stripe = ios->si.bytes_in_stripe;
- unsigned i, c, p, min_p = sp2d->pages_in_unit, max_p = -1;
- int ret;
+ unsigned c, p, min_p = sp2d->pages_in_unit, max_p = -1;

if (offset == ios->offset) /* Go to start collect $200 */
goto read_last_stripe;
@@ -512,6 +508,18 @@ static int _read_4_write(struct ore_io_state *ios)
}

read_last_stripe:
+ return 0;
+}
+
+static int _read_4_write_last_stripe(struct ore_io_state *ios)
+{
+ struct ore_striping_info read_si;
+ struct __stripe_pages_2d *sp2d = ios->sp2d;
+ u64 offset;
+ u64 last_stripe_end;
+ unsigned bytes_in_stripe = ios->si.bytes_in_stripe;
+ unsigned c, p, min_p = sp2d->pages_in_unit, max_p = -1;
+
offset = ios->offset + ios->length;
if (offset % PAGE_SIZE)
_add_to_r4w_last_page(ios, &offset);
@@ -527,9 +535,6 @@ static int _read_4_write(struct ore_io_state *ios)
c = _dev_order(ios->layout->group_width * ios->layout->mirrors_p1,
ios->layout->mirrors_p1, read_si.par_dev, read_si.dev);

- BUG_ON(ios->si.first_stripe_start + bytes_in_stripe != last_stripe_end);
- /* unaligned IO must be within a single stripe */
-
if (min_p == sp2d->pages_in_unit) {
/* Didn't do it yet */
min_p = _sp2d_min_pg(sp2d);
@@ -568,6 +573,15 @@ static int _read_4_write(struct ore_io_state *ios)
}

read_it:
+ return 0;
+}
+
+static int _read_4_write_execute(struct ore_io_state *ios)
+{
+ struct ore_io_state *ios_read;
+ unsigned i;
+ int ret;
+
ios_read = ios->ios_read_4_write;
if (!ios_read)
return 0;
@@ -591,6 +605,8 @@ static int _read_4_write(struct ore_io_state *ios)
}

_mark_read4write_pages_uptodate(ios_read, ret);
+ ore_put_io_state(ios_read);
+ ios->ios_read_4_write = NULL; /* Might need a reuse at last stripe */
return 0;
}

@@ -626,8 +642,11 @@ int _ore_add_parity_unit(struct ore_io_state *ios,
/* If first stripe, Read in all read4write pages
* (if needed) before we calculate the first parity.
*/
- _read_4_write(ios);
+ _read_4_write_first_stripe(ios);
}
+ if (!cur_len) /* If last stripe r4w pages of last stripe */
+ _read_4_write_last_stripe(ios);
+ _read_4_write_execute(ios);

for (i = 0; i < num_pages; i++) {
pages[i] = _raid_page_alloc();
@@ -654,34 +673,14 @@ int _ore_add_parity_unit(struct ore_io_state *ios,

int _ore_post_alloc_raid_stuff(struct ore_io_state *ios)
{
- struct ore_layout *layout = ios->layout;
-
if (ios->parity_pages) {
+ struct ore_layout *layout = ios->layout;
unsigned pages_in_unit = layout->stripe_unit / PAGE_SIZE;
- unsigned stripe_size = ios->si.bytes_in_stripe;
- u64 last_stripe, first_stripe;

if (_sp2d_alloc(pages_in_unit, layout->group_width,
layout->parity, &ios->sp2d)) {
return -ENOMEM;
}
-
- /* Round io down to last full strip */
- first_stripe = div_u64(ios->offset, stripe_size);
- last_stripe = div_u64(ios->offset + ios->length, stripe_size);
-
- /* If an IO spans more then a single stripe it must end at
- * a stripe boundary. The reminder at the end is pushed into the
- * next IO.
- */
- if (last_stripe != first_stripe) {
- ios->length = last_stripe * stripe_size - ios->offset;
-
- BUG_ON(!ios->length);
- ios->nr_pages = (ios->length + PAGE_SIZE - 1) /
- PAGE_SIZE;
- ios->si.length = ios->length; /*make it consistent */
- }
}
return 0;
}
--
1.7.10.2.677.gb6bc67f

2012-06-08 12:13:22

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 3/6] pnfs-obj: Fix __r4w_get_page when offset is beyond i_size

It is very common for the end of the file to be unaligned on
stripe size. But since we know it's beyond file's end then
the XOR should be preformed with all zeros.

Old code used to just read zeros out of the OSD devices, which is a great
waist. But what scares me more about this situation is that, we now have
pages attached to the file's mapping that are beyond i_size. I don't
like the kind of bugs this calls for.

Fix both birds, by returning a global zero_page, if offset is beyond
i_size.

TODO:
Change the API to ->__r4w_get_page() so a NULL can be
returned without being considered as error, since XOR API
treats NULL entries as zero_pages.

[Bug since 3.2. Should apply the same way to all Kernels since]
CC: Stable Tree <[email protected]>
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/nfs/objlayout/objio_osd.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/objlayout/objio_osd.c b/fs/nfs/objlayout/objio_osd.c
index b47277b..0e7c833 100644
--- a/fs/nfs/objlayout/objio_osd.c
+++ b/fs/nfs/objlayout/objio_osd.c
@@ -480,14 +480,28 @@ static void _write_done(struct ore_io_state *ios, void *private)
objlayout_write_done(&objios->oir, status, objios->sync);
}

+static struct page *g_zero_page;
+
static struct page *__r4w_get_page(void *priv, u64 offset, bool *uptodate)
{
struct objio_state *objios = priv;
struct nfs_write_data *wdata = objios->oir.rpcdata;
struct address_space *mapping = wdata->header->inode->i_mapping;
pgoff_t index = offset / PAGE_SIZE;
- struct page *page = find_get_page(mapping, index);
+ struct page *page;
+ loff_t i_size = i_size_read(wdata->header->inode);
+
+ if (offset >= i_size) {
+ if (!g_zero_page) {
+ g_zero_page = alloc_page(GFP_NOFS);
+ clear_highpage(g_zero_page);
+ }
+ *uptodate = true;
+ dprintk("%s: g_zero_page index=0x%lx\n", __func__, index);
+ return g_zero_page;
+ }

+ page = find_get_page(mapping, index);
if (!page) {
page = find_or_create_page(mapping, index, GFP_NOFS);
if (unlikely(!page)) {
@@ -507,8 +521,10 @@ static struct page *__r4w_get_page(void *priv, u64 offset, bool *uptodate)

static void __r4w_put_page(void *priv, struct page *page)
{
- dprintk("%s: index=0x%lx\n", __func__, page->index);
- page_cache_release(page);
+ dprintk("%s: index=0x%lx\n", __func__,
+ (page == g_zero_page) ? -1UL : page->index);
+ if (g_zero_page != page)
+ page_cache_release(page);
return;
}

@@ -615,6 +631,8 @@ static bool objio_pg_test(struct nfs_pageio_descriptor *pgio,
static void __exit
objlayout_exit(void)
{
+ if (g_zero_page)
+ __free_page(g_zero_page);
pnfs_unregister_layoutdriver(&objlayout_type);
printk(KERN_INFO "NFS: %s: Unregistered OSD pNFS Layout Driver\n",
__func__);
--
1.7.10.2.677.gb6bc67f

2012-06-08 12:14:42

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 4/6] pnfs-obj: don't leak objio_state if ore_write/read fails

[Bug since 3.2 Kernel]
CC: Stable Tree <[email protected]>
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/nfs/objlayout/objio_osd.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/objlayout/objio_osd.c b/fs/nfs/objlayout/objio_osd.c
index 0e7c833..422bdef 100644
--- a/fs/nfs/objlayout/objio_osd.c
+++ b/fs/nfs/objlayout/objio_osd.c
@@ -454,7 +454,10 @@ int objio_read_pagelist(struct nfs_read_data *rdata)
objios->ios->done = _read_done;
dprintk("%s: offset=0x%llx length=0x%x\n", __func__,
rdata->args.offset, rdata->args.count);
- return ore_read(objios->ios);
+ ret = ore_read(objios->ios);
+ if (unlikely(ret))
+ objio_free_result(&objios->oir);
+ return ret;
}

/*
@@ -555,8 +558,10 @@ int objio_write_pagelist(struct nfs_write_data *wdata, int how)
dprintk("%s: offset=0x%llx length=0x%x\n", __func__,
wdata->args.offset, wdata->args.count);
ret = ore_write(objios->ios);
- if (unlikely(ret))
+ if (unlikely(ret)) {
+ objio_free_result(&objios->oir);
return ret;
+ }

if (objios->sync)
_write_done(objios->ios, objios);
--
1.7.10.2.677.gb6bc67f

2012-06-08 12:09:58

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 2/6] ore: Remove support of partial IO request (NFS crash)

Do to OOM situations the ore might fail to allocate all resources
needed for IO of the full request. If some progress was possible
it would proceed with a partial/short request, for the sake of
forward progress.

Since this crashes NFS-core and exofs is just fine without it just
remove this contraption, and fail.

TODO:
Support real forward progress with some reserved allocations
of resources, such as mem pools and/or bio_sets

[Bug since 3.2 Kernel]
CC: Stable Tree <[email protected]>
CC: Benny Halevy <[email protected]>
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/ore.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 49cf230..24a49d4 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -735,13 +735,7 @@ static int _prepare_for_striping(struct ore_io_state *ios)
out:
ios->numdevs = devs_in_group;
ios->pages_consumed = cur_pg;
- if (unlikely(ret)) {
- if (length == ios->length)
- return ret;
- else
- ios->length -= length;
- }
- return 0;
+ return ret;
}

int ore_create(struct ore_io_state *ios)
--
1.7.10.2.677.gb6bc67f

2012-06-08 12:21:27

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 6/6] pnfs-obj: Better IO pattern in case of unaligned offset

Depending on layout and ARCH, ORE has some limits on max IO sizes
which is communicated on (what else) ore_layout->max_io_length,
which is always stripe aligned.
This was considered as the pg_test boundary for splitting and starting
a new IO.

But in the case of a long IO where the start offset is not aligned
what would happen is that both end of IO[N] and start of IO[N+1]
would be unaligned, causing each IO boundary parity unit to be
calculated and written twice.

So what we do in this patch is split the very start of an unaligned
IO, up to a stripe boundary, and then next IO's can continue fully
aligned til the end.

We might be sacrificing the case where the full unaligned IO would
fit within a single max_io_length, but the sacrifice is well worth
the elimination of double calculation and parity units IO.
Actually the sacrificing is marginal and is almost unmeasurable.

TODO:
If we know the total expected linear segment that will
be received, at pg_init, we could use that information
in many places:
1. blocks-layout get_layout write segment size
2. Better mds-threshold
3. In above situation for a better clean split

I will do this in future submission.

Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/nfs/objlayout/objio_osd.c | 55 +++++++++++++++++++++++++++++++++++++++++---
1 file changed, 52 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/objlayout/objio_osd.c b/fs/nfs/objlayout/objio_osd.c
index 422bdef..5f8c4f3 100644
--- a/fs/nfs/objlayout/objio_osd.c
+++ b/fs/nfs/objlayout/objio_osd.c
@@ -576,17 +576,66 @@ static bool objio_pg_test(struct nfs_pageio_descriptor *pgio,
return false;

return pgio->pg_count + req->wb_bytes <=
- OBJIO_LSEG(pgio->pg_lseg)->layout.max_io_length;
+ (unsigned long)pgio->pg_layout_private;
+}
+
+void objio_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
+{
+ pnfs_generic_pg_init_read(pgio, req);
+ if (unlikely(pgio->pg_lseg == NULL))
+ return; /* Not pNFS */
+
+ pgio->pg_layout_private = (void *)
+ OBJIO_LSEG(pgio->pg_lseg)->layout.max_io_length;
+}
+
+static bool aligned_on_raid_stripe(u64 offset, struct ore_layout *layout,
+ unsigned long *stripe_end)
+{
+ u32 stripe_off;
+ unsigned stripe_size;
+
+ if (layout->raid_algorithm == PNFS_OSD_RAID_0)
+ return true;
+
+ stripe_size = layout->stripe_unit *
+ (layout->group_width - layout->parity);
+
+ div_u64_rem(offset, stripe_size, &stripe_off);
+ if (!stripe_off)
+ return true;
+
+ *stripe_end = stripe_size - stripe_off;
+ return false;
+}
+
+void objio_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
+{
+ unsigned long stripe_end = 0;
+
+ pnfs_generic_pg_init_write(pgio, req);
+ if (unlikely(pgio->pg_lseg == NULL))
+ return; /* Not pNFS */
+
+ if (req->wb_offset ||
+ !aligned_on_raid_stripe(req->wb_index * PAGE_SIZE,
+ &OBJIO_LSEG(pgio->pg_lseg)->layout,
+ &stripe_end)) {
+ pgio->pg_layout_private = (void *)stripe_end;
+ } else {
+ pgio->pg_layout_private = (void *)
+ OBJIO_LSEG(pgio->pg_lseg)->layout.max_io_length;
+ }
}

static const struct nfs_pageio_ops objio_pg_read_ops = {
- .pg_init = pnfs_generic_pg_init_read,
+ .pg_init = objio_init_read,
.pg_test = objio_pg_test,
.pg_doio = pnfs_generic_pg_readpages,
};

static const struct nfs_pageio_ops objio_pg_write_ops = {
- .pg_init = pnfs_generic_pg_init_write,
+ .pg_init = objio_init_write,
.pg_test = objio_pg_test,
.pg_doio = pnfs_generic_pg_writepages,
};
--
1.7.10.2.677.gb6bc67f

2012-06-13 13:58:10

by Peng Tao

[permalink] [raw]

Subject: Re: [RFC 0/5] CRASHFIX: pnfs-obj: NONE-RPC LD must not call rpc_restart_call_prepare()

On Fri, Jun 8, 2012 at 6:06 PM, Boaz Harrosh <[email protected]> wrote:
>
> NONE-RPC layout-Drivers call nfs_writeback_done() as part
> of their completion of IO. (through pnfs_ld_write_done())
>
> Inside nfs_writeback_done() there is code that does:
>
> else if (resp->count < argp->count) {
> ...
>
> /* This a short write! */
> nfs_inc_stats(inode, NFSIOS_SHORTWRITE);
>
> ... /* Prepare the remainder */
>
> rpc_restart_call_prepare(task);
> }
>
> But for none-rpc LDs (objects, blocks) there is no task->tk_ops
> and this code will crash.
>
> The same code exists in read.c::nfs_readpage_result_common() with
> same crash.
>
> Therefor: NONE-RPC LDs *must not* call pnfs_ld_read/write_done with
> partial/short IO. Either complete everything or fail everything and
> IO will be resubmitted through MDS.
>
> Below are a set of patches that fix this problem for objects
> layout, blocks maintainer should check that code does not allow a short
> IO as well.
>
> The Fix is much easier at the LD level then at NFS-Core. Though it
> could be fixed there, I do not think it is worth it. I will submit a
> comment that warns for this. Certainly NFS-Core changes could not
> be so simple as below to be candidates for @Stable
>
Thanks for reporting this. I didn't realize it in the first place either.

However, blocks read/write is implemented to fail any short IO and set
pnfs_error to resend it through mds, except for the case of reading
beyond eof. So I don't think blocks will crash the same case. I will
double check with more tests though.

Cheers,
Tao

> These where lightly tested and are submitted for review. I will conduct
> heavy testing and will put them in linux-next only next week.
>
> Here are the list of patches:
>
> [PATCH 1/6] ore: Fix NFS crash by supporting any unaligned RAID IO
> [PATCH 2/6] ore: Remove support of partial IO request (NFS crash)
> [PATCH 3/6] pnfs-obj: Fix __r4w_get_page in the case of offset beyond i_size
>
> These 3 completely plug the crash above, and are destined
> for Stable@
>
> Trond please ACK the 3rd patch since I want to push this
> through my tree, as these are for ore mostly.
> (Please also review all of them, watch my back)
>
> [PATCH 4/6] pnfs-obj: don't leak objio_state if ore_write/read fails.
>
> A memory leak found on the way, in the error case.
> Should it be for Stable@ ?
>
> [PATCH 4/5] NFS41: add pg_layout_private to nfs_pageio_descriptor
>
> This is the patch by Peng Tao <[email protected]> which is
> needed for below fix. Minus the has_moreio flag.
>
> [PATCH 5/5] pnfs-obj: Better IO pattern in case of unaligned offset
>
> Though this patch is an optimization it fixes a very bad
> IO pattern. Please consider for the 3.5-rcX Kernel and
> don't postpone to 3.6 Merge window. If possible
>
> Thanks for any review
> Boaz

--
Thanks,
Tao

2012-06-08 12:16:12

by Boaz Harrosh

[permalink] [raw]

Subject: [PATCH 5/6] NFS41: add pg_layout_private to nfs_pageio_descriptor

From: Peng Tao <[email protected]>

To allow layout driver to pass private information around
pg_init/pg_test/pg_doio.

Signed-off-by: Peng Tao <[email protected]>
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/nfs/pagelist.c | 2 ++
include/linux/nfs_page.h | 1 +
include/linux/nfs_xdr.h | 1 +
3 files changed, 4 insertions(+)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index aed913c..342ca5e 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -49,6 +49,7 @@ void nfs_pgheader_init(struct nfs_pageio_descriptor *desc,
hdr->io_start = req_offset(hdr->req);
hdr->good_bytes = desc->pg_count;
hdr->dreq = desc->pg_dreq;
+ hdr->layout_private = desc->pg_layout_private;
hdr->release = release;
hdr->completion_ops = desc->pg_completion_ops;
if (hdr->completion_ops->init_hdr)
@@ -267,6 +268,7 @@ void nfs_pageio_init(struct nfs_pageio_descriptor *desc,
desc->pg_error = 0;
desc->pg_lseg = NULL;
desc->pg_dreq = NULL;
+ desc->pg_layout_private = NULL;
}

/**
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 88d166b..63093b1 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -69,6 +69,7 @@ struct nfs_pageio_descriptor {
const struct nfs_pgio_completion_ops *pg_completion_ops;
struct pnfs_layout_segment *pg_lseg;
struct nfs_direct_req *pg_dreq;
+ void *pg_layout_private;
};

#define NFS_WBACK_BUSY(req) (test_bit(PG_BUSY,&(req)->wb_flags))
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index ce010f6..fbbc6ca 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1246,6 +1246,7 @@ struct nfs_pgio_header {
void (*release) (struct nfs_pgio_header *hdr);
const struct nfs_pgio_completion_ops *completion_ops;
struct nfs_direct_req *dreq;
+ void *layout_private;
spinlock_t lock;
/* fields protected by lock */
int pnfs_error;
--
1.7.10.2.677.gb6bc67f

2012-06-13 13:54:20

by Peng Tao

[permalink] [raw]

Subject: Re: [PATCH 3/6] pnfs-obj: Fix __r4w_get_page when offset is beyond i_size

On Fri, Jun 8, 2012 at 8:13 PM, Boaz Harrosh <[email protected]> wrote:
>
> It is very common for the end of the file to be unaligned on
> stripe size. But since we know it's beyond file's end then
> the XOR should be preformed with all zeros.
>
> Old code used to just read zeros out of the OSD devices, which is a great
> waist. But what scares me more about this situation is that, we now have
> pages attached to the file's mapping that are beyond i_size. I don't
> like the kind of bugs this calls for.
>
> Fix both birds, by returning a global zero_page, if offset is beyond
> i_size.
>
> TODO:
> Change the API to ->__r4w_get_page() so a NULL can be
> returned without being considered as error, since XOR API
> treats NULL entries as zero_pages.
>
> [Bug since 3.2. Should apply the same way to all Kernels since]
> CC: Stable Tree <[email protected]>
> Signed-off-by: Boaz Harrosh <[email protected]>
> ---
> fs/nfs/objlayout/objio_osd.c | 24 +++++++++++++++++++++---
> 1 file changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/fs/nfs/objlayout/objio_osd.c b/fs/nfs/objlayout/objio_osd.c
> index b47277b..0e7c833 100644
> --- a/fs/nfs/objlayout/objio_osd.c
> +++ b/fs/nfs/objlayout/objio_osd.c
> @@ -480,14 +480,28 @@ static void _write_done(struct ore_io_state *ios, void *private)
> objlayout_write_done(&objios->oir, status, objios->sync);
> }
>
> +static struct page *g_zero_page;
> +
Since you don't attach g_zero_page to any file's address space, I
think that you can replace it with ZERO_PAGE(0), no?

Cheers,
Tao

> static struct page *__r4w_get_page(void *priv, u64 offset, bool *uptodate)
> {
> struct objio_state *objios = priv;
> struct nfs_write_data *wdata = objios->oir.rpcdata;
> struct address_space *mapping = wdata->header->inode->i_mapping;
> pgoff_t index = offset / PAGE_SIZE;
> - struct page *page = find_get_page(mapping, index);
> + struct page *page;
> + loff_t i_size = i_size_read(wdata->header->inode);
> +
> + if (offset >= i_size) {
> + if (!g_zero_page) {
> + g_zero_page = alloc_page(GFP_NOFS);
> + clear_highpage(g_zero_page);
> + }
> + *uptodate = true;
> + dprintk("%s: g_zero_page index=0x%lx\n", __func__, index);
> + return g_zero_page;
> + }
>
> + page = find_get_page(mapping, index);
> if (!page) {
> page = find_or_create_page(mapping, index, GFP_NOFS);
> if (unlikely(!page)) {
> @@ -507,8 +521,10 @@ static struct page *__r4w_get_page(void *priv, u64 offset, bool *uptodate)
>
> static void __r4w_put_page(void *priv, struct page *page)
> {
> - dprintk("%s: index=0x%lx\n", __func__, page->index);
> - page_cache_release(page);
> + dprintk("%s: index=0x%lx\n", __func__,
> + (page == g_zero_page) ? -1UL : page->index);
> + if (g_zero_page != page)
> + page_cache_release(page);
> return;
> }
>
> @@ -615,6 +631,8 @@ static bool objio_pg_test(struct nfs_pageio_descriptor *pgio,
> static void __exit
> objlayout_exit(void)
> {
> + if (g_zero_page)
> + __free_page(g_zero_page);
> pnfs_unregister_layoutdriver(&objlayout_type);
> printk(KERN_INFO "NFS: %s: Unregistered OSD pNFS Layout Driver\n",
> __func__);
> --
> 1.7.10.2.677.gb6bc67f
>
>

2012-07-19 09:49:35

by Boaz Harrosh

[permalink] [raw]

Subject: Re: [RFC 0/5] CRASHFIX: pnfs-obj: NONE-RPC LD must not call rpc_restart_call_prepare()

On 07/19/2012 02:07 AM, Myklebust, Trond wrote:

>> NFS41: add pg_layout_private to nfs_pageio_descriptor
>
> This one is the only patch that really affects the core NFS code, but
> I'm having trouble seeing what is defining hdr->layout_private in this
> patch series, and where it is being used.
>
> Am I missing some dependencies here?
>

No! It's all here:
[patch 5/6] NFS41: add pg_layout_private to nfs_pageio_descriptor

Just adds a single hdr->layout_private member
and:
[PATCH 6/6] pnfs-obj: Better IO pattern in case of unaligned offset

Uses above member.

In any way I have new versions of these for the merge window I'll send
them tonight.

Thanks
Boaz

2012-07-12 14:11:52

by Boaz Harrosh

[permalink] [raw]

Subject: Re: [RFC 0/5] CRASHFIX: pnfs-obj: NONE-RPC LD must not call rpc_restart_call_prepare()

On 06/08/2012 01:06 PM, Boaz Harrosh wrote:

>
> NONE-RPC layout-Drivers call nfs_writeback_done() as part
> of their completion of IO. (through pnfs_ld_write_done())
>
> Inside nfs_writeback_done() there is code that does:
>
> else if (resp->count < argp->count) {
> ...
>
> /* This a short write! */
> nfs_inc_stats(inode, NFSIOS_SHORTWRITE);
>
> ... /* Prepare the remainder */
>
> rpc_restart_call_prepare(task);
> }
>
> But for none-rpc LDs (objects, blocks) there is no task->tk_ops
> and this code will crash.
>

<snip>

Trond hi

Sorry for the late response I was sick (at Hospital and was away)

I must push these fixes to Linus ASAP I want to push them tomorrow.
They are for 3.5-rc7 and CC to stable@.

I would love to also push the objlayout patches as one push as well.
Please give me your blessing and ACK so I can do this.

I have done a small rebase over 3.5-rc5 and a few cleanups mainly
Peng's comment about ZERO_PAGE.

You can see the pending push request here:
http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/for-linus

These are the list of patches:

ore: Fix NFS crash by supporting any unaligned RAID IO
ore: Remove support of partial IO request (NFS crash)
ore: Unlock r4w pages in exact reverse order of locking

These above are ORE patches that actually fix the NFS CRASH

pnfs-obj: don't leak objio_state if ore_write/read ...

This above is an important mem leak in the error case

pnfs-obj: Fix __r4w_get_page when offset is beyond ...
NFS41: add pg_layout_private to nfs_pageio_descriptor
pnfs-obj: Better IO pattern in case of unaligned offset for-linus

I would also love to push these 3 please advise

Please look into this ASAP, as we are so late already because of my
absence. If you'll inspect them carefully you will see that other then
the fix they are low risk. And I have tested them extensively.

(I pushed it to linux-next and will let it cook for 48 hours)

Thanks for your help
Boaz

2012-07-18 23:07:34

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [RFC 0/5] CRASHFIX: pnfs-obj: NONE-RPC LD must not call rpc_restart_call_prepare()

T24gVGh1LCAyMDEyLTA3LTEyIGF0IDE3OjExICswMzAwLCBCb2F6IEhhcnJvc2ggd3JvdGU6DQo+
IE9uIDA2LzA4LzIwMTIgMDE6MDYgUE0sIEJvYXogSGFycm9zaCB3cm90ZToNCj4gDQo+ID4gDQo+
ID4gTk9ORS1SUEMgbGF5b3V0LURyaXZlcnMgY2FsbCBuZnNfd3JpdGViYWNrX2RvbmUoKSBhcyBw
YXJ0DQo+ID4gb2YgdGhlaXIgY29tcGxldGlvbiBvZiBJTy4gKHRocm91Z2ggcG5mc19sZF93cml0
ZV9kb25lKCkpDQo+ID4gDQo+ID4gSW5zaWRlIG5mc193cml0ZWJhY2tfZG9uZSgpIHRoZXJlIGlz
IGNvZGUgdGhhdCBkb2VzOg0KPiA+IA0KPiA+IAllbHNlIGlmIChyZXNwLT5jb3VudCA8IGFyZ3At
PmNvdW50KSB7DQo+ID4gCQkuLi4NCj4gPiANCj4gPiAJCS8qIFRoaXMgYSBzaG9ydCB3cml0ZSEg
Ki8NCj4gPiAJCW5mc19pbmNfc3RhdHMoaW5vZGUsIE5GU0lPU19TSE9SVFdSSVRFKTsNCj4gPiAN
Cj4gPiAJCS4uLiAvKiBQcmVwYXJlIHRoZSByZW1haW5kZXIgKi8NCj4gPiANCj4gPiAJCXJwY19y
ZXN0YXJ0X2NhbGxfcHJlcGFyZSh0YXNrKTsNCj4gPiAJfQ0KPiA+IA0KPiA+IEJ1dCBmb3Igbm9u
ZS1ycGMgTERzIChvYmplY3RzLCBibG9ja3MpIHRoZXJlIGlzIG5vIHRhc2stPnRrX29wcw0KPiA+
IGFuZCB0aGlzIGNvZGUgd2lsbCBjcmFzaC4NCj4gPiANCj4gDQo+IDxzbmlwPg0KPiANCj4gVHJv
bmQgaGkNCj4gDQo+IFNvcnJ5IGZvciB0aGUgbGF0ZSByZXNwb25zZSBJIHdhcyBzaWNrIChhdCBI
b3NwaXRhbCBhbmQgd2FzIGF3YXkpDQoNCkknbSB2ZXJ5IHNvcnJ5IHRvIGhlYXIgdGhhdC4gSSBy
ZWFsbHkgaG9wZSB5b3UncmUgZmVlbGluZyBiZXR0ZXIgbm93Li4uDQoNCj4gSSBtdXN0IHB1c2gg
dGhlc2UgZml4ZXMgdG8gTGludXMgQVNBUCBJIHdhbnQgdG8gcHVzaCB0aGVtIHRvbW9ycm93Lg0K
PiBUaGV5IGFyZSBmb3IgMy41LXJjNyBhbmQgQ0MgdG8gc3RhYmxlQC4NCj4gDQo+IEkgd291bGQg
bG92ZSB0byBhbHNvIHB1c2ggdGhlIG9iamxheW91dCBwYXRjaGVzIGFzIG9uZSBwdXNoIGFzIHdl
bGwuDQo+IFBsZWFzZSBnaXZlIG1lIHlvdXIgYmxlc3NpbmcgYW5kIEFDSyBzbyBJIGNhbiBkbyB0
aGlzLg0KPiANCj4gSSBoYXZlIGRvbmUgYSBzbWFsbCByZWJhc2Ugb3ZlciAzLjUtcmM1IGFuZCBh
IGZldyBjbGVhbnVwcyBtYWlubHkNCj4gUGVuZydzIGNvbW1lbnQgYWJvdXQgWkVST19QQUdFLg0K
PiANCj4gWW91IGNhbiBzZWUgdGhlIHBlbmRpbmcgcHVzaCByZXF1ZXN0IGhlcmU6DQo+IGh0dHA6
Ly9naXQub3Blbi1vc2Qub3JnL2dpdHdlYi5jZ2k/cD1saW51eC1vcGVuLW9zZC5naXQ7YT1zaG9y
dGxvZztoPXJlZnMvaGVhZHMvZm9yLWxpbnVzDQo+IA0KPiBUaGVzZSBhcmUgdGhlIGxpc3Qgb2Yg
cGF0Y2hlczoNCj4gDQo+IG9yZTogRml4IE5GUyBjcmFzaCBieSBzdXBwb3J0aW5nIGFueSB1bmFs
aWduZWQgUkFJRCBJTw0KPiBvcmU6IFJlbW92ZSBzdXBwb3J0IG9mIHBhcnRpYWwgSU8gcmVxdWVz
dCAoTkZTIGNyYXNoKQ0KPiBvcmU6IFVubG9jayByNHcgcGFnZXMgaW4gZXhhY3QgcmV2ZXJzZSBv
cmRlciBvZiBsb2NraW5nDQo+IA0KPiAJVGhlc2UgYWJvdmUgYXJlIE9SRSBwYXRjaGVzIHRoYXQg
YWN0dWFsbHkgZml4IHRoZSBORlMgQ1JBU0gNCj4gDQo+IHBuZnMtb2JqOiBkb24ndCBsZWFrIG9i
amlvX3N0YXRlIGlmIG9yZV93cml0ZS9yZWFkIC4uLg0KPiANCj4gCVRoaXMgYWJvdmUgaXMgYW4g
aW1wb3J0YW50IG1lbSBsZWFrIGluIHRoZSBlcnJvciBjYXNlDQo+IA0KPiBwbmZzLW9iajogRml4
IF9fcjR3X2dldF9wYWdlIHdoZW4gb2Zmc2V0IGlzIGJleW9uZCAuLi4NCj4gTkZTNDE6IGFkZCBw
Z19sYXlvdXRfcHJpdmF0ZSB0byBuZnNfcGFnZWlvX2Rlc2NyaXB0b3INCg0KVGhpcyBvbmUgaXMg
dGhlIG9ubHkgcGF0Y2ggdGhhdCByZWFsbHkgYWZmZWN0cyB0aGUgY29yZSBORlMgY29kZSwgYnV0
DQpJJ20gaGF2aW5nIHRyb3VibGUgc2VlaW5nIHdoYXQgaXMgZGVmaW5pbmcgaGRyLT5sYXlvdXRf
cHJpdmF0ZSBpbiB0aGlzDQpwYXRjaCBzZXJpZXMsIGFuZCB3aGVyZSBpdCBpcyBiZWluZyB1c2Vk
Lg0KDQpBbSBJIG1pc3Npbmcgc29tZSBkZXBlbmRlbmNpZXMgaGVyZT8NCg0KPiBwbmZzLW9iajog
QmV0dGVyIElPIHBhdHRlcm4gaW4gY2FzZSBvZiB1bmFsaWduZWQgb2Zmc2V0ICBmb3ItbGludXMN
Cj4gDQo+IAlJIHdvdWxkIGFsc28gbG92ZSB0byBwdXNoIHRoZXNlIDMgcGxlYXNlIGFkdmlzZQ0K
PiANCj4gUGxlYXNlIGxvb2sgaW50byB0aGlzIEFTQVAsIGFzIHdlIGFyZSBzbyBsYXRlIGFscmVh
ZHkgYmVjYXVzZSBvZiBteQ0KPiBhYnNlbmNlLiBJZiB5b3UnbGwgaW5zcGVjdCB0aGVtIGNhcmVm
dWxseSB5b3Ugd2lsbCBzZWUgdGhhdCBvdGhlciB0aGVuDQo+IHRoZSBmaXggdGhleSBhcmUgbG93
IHJpc2suIEFuZCBJIGhhdmUgdGVzdGVkIHRoZW0gZXh0ZW5zaXZlbHkuDQo+IA0KPiAoSSBwdXNo
ZWQgaXQgdG8gbGludXgtbmV4dCBhbmQgd2lsbCBsZXQgaXQgY29vayBmb3IgNDggaG91cnMpDQo+
IA0KPiBUaGFua3MgZm9yIHlvdXIgaGVscA0KPiBCb2F6DQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0
DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RA
bmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K