2011-11-09 15:16:55

by Peng Tao

[permalink] [raw]
Subject: [PATCH 0/8] pnfsblock cleanup and fixes

Hi, Trond and Benny,

Bellow are some pnfsblock patches that I have tested for days. Most of them
are cleanup and small improvements. Only
[PATCH 2/8] pnfsblock: acquire im_lock in _preload_range
needs to go into stable.

Thanks,
Tao

Peng Tao (8):
pnfsblock: cleanup bl_mark_sectors_init
pnfsblock: acquire im_lock in _preload_range
pnfsblock: move find lock page logic out of bl_write_pagelist
pnfsblock: set read/write tk_status to pnfs_error
pnfsblock: remove rpc_call_ops from struct parallel_io
pnfsblock: clean up _add_entry
pnfsblock: add im_extents to pnfs_inval_markings
pnfsblock: alloc short extent before submit bio

fs/nfs/blocklayout/blocklayout.c | 176 +++++++++++++++++++++++++-------------
fs/nfs/blocklayout/blocklayout.h | 11 ++-
fs/nfs/blocklayout/extents.c | 146 ++++++++++++--------------------
3 files changed, 180 insertions(+), 153 deletions(-)



2011-11-10 09:08:32

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 7/8] pnfsblock: add im_extents to pnfs_inval_markings


> -----Original Message-----
> From: Benny Halevy [mailto:[email protected]]
> Sent: Thursday, November 10, 2011 4:54 PM
> To: Peng Tao
> Cc: [email protected]; [email protected]; Peng, Tao
> Subject: Re: [PATCH 7/8] pnfsblock: add im_extents to pnfs_inval_markings
>
> On 2011-11-09 17:16, Peng Tao wrote:
> > It stores a list of short extents for INVAL->RW conversion.
> > Also add two functions to manipulate them, in preparation to
> > move malloc logic out of end_io.
> >
> > Signed-off-by: Peng Tao <[email protected]>
> > ---
> > fs/nfs/blocklayout/blocklayout.c | 6 ++++++
> > fs/nfs/blocklayout/blocklayout.h | 5 +++++
> > fs/nfs/blocklayout/extents.c | 37
> +++++++++++++++++++++++++++++++++++++
> > 3 files changed, 48 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> > index 815c0c3..cb4ff0f 100644
> > --- a/fs/nfs/blocklayout/blocklayout.c
> > +++ b/fs/nfs/blocklayout/blocklayout.c
> > @@ -706,11 +706,17 @@ static void
> > release_inval_marks(struct pnfs_inval_markings *marks)
> > {
> > struct pnfs_inval_tracking *pos, *temp;
> > + struct pnfs_block_short_extent *se, *stemp;
> >
> > list_for_each_entry_safe(pos, temp, &marks->im_tree.mtt_stub, it_link) {
> > list_del(&pos->it_link);
> > kfree(pos);
> > }
> > +
> > + list_for_each_entry_safe(se, stemp, &marks->im_extents, bse_node) {
> > + list_del(&se->bse_node);
> > + kfree(se);
> > + }
> > return;
> > }
> >
> > diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> > index 60728ac..df0e0fb 100644
> > --- a/fs/nfs/blocklayout/blocklayout.h
> > +++ b/fs/nfs/blocklayout/blocklayout.h
> > @@ -70,6 +70,7 @@ struct pnfs_inval_markings {
> > spinlock_t im_lock;
> > struct my_tree im_tree; /* Sectors that need LAYOUTCOMMIT */
> > sector_t im_block_size; /* Server blocksize in sectors */
> > + struct list_head im_extents; /* List of short extents for INVAL->RW conversion
> */
> > };
> >
> > struct pnfs_inval_tracking {
> > @@ -105,6 +106,7 @@ BL_INIT_INVAL_MARKS(struct pnfs_inval_markings
> *marks, sector_t blocksize)
> > {
> > spin_lock_init(&marks->im_lock);
> > INIT_LIST_HEAD(&marks->im_tree.mtt_stub);
> > + INIT_LIST_HEAD(&marks->im_extents);
> > marks->im_block_size = blocksize;
> > marks->im_tree.mtt_step_size = min((sector_t)PAGE_CACHE_SECTORS,
> > blocksize);
> > @@ -200,5 +202,8 @@ int bl_add_merge_extent(struct pnfs_block_layout *bl,
> > struct pnfs_block_extent *new);
> > int bl_mark_for_commit(struct pnfs_block_extent *be,
> > sector_t offset, sector_t length);
> > +int bl_push_one_short_extent(struct pnfs_inval_markings *marks);
> > +struct pnfs_block_short_extent*
> > +bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop);
> >
> > #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
> > diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
> > index 952ea8a..72c7fa1 100644
> > --- a/fs/nfs/blocklayout/extents.c
> > +++ b/fs/nfs/blocklayout/extents.c
> > @@ -863,3 +863,40 @@ clean_pnfs_block_layoutupdate(struct
> pnfs_block_layout *bl,
> > }
> > }
> > }
> > +
> > +int
> > +bl_push_one_short_extent(struct pnfs_inval_markings *marks) {
> > + struct pnfs_block_short_extent *new;
> > +
> > + new = kmalloc(sizeof(*new), GFP_NOFS);
> > + if (unlikely(!new))
> > + return -ENOMEM;
> > +
> > + spin_lock(&marks->im_lock);
> > + list_add(&new->bse_node, &marks->im_extents);
> > + spin_unlock(&marks->im_lock);
> > +
> > + return 0;
> > +}
> > +
> > +struct pnfs_block_short_extent*
> > +bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop) {
> > + struct pnfs_block_short_extent *rv = NULL;
> > +
> > + if (unlikely(num_to_pop <= 0))
> > + return rv;
>
> How unlikely is it?
> Is doing the extra compare really worth saving the spin_lock?
Never... I should really replace it with a BUG_ON.

>
> > +
> > + spin_lock(&marks->im_lock);
> > + while (!list_empty(&marks->im_extents) && num_to_pop-- > 0) {
> > + rv = list_entry((&marks->im_extents)->next,
> > + struct pnfs_block_short_extent, bse_node);
> > + list_del_init(&rv->bse_node);
> > + if (num_to_pop)
> > + kfree(rv);
>
> Please correct me if I'm wrong, you don't want to free the last element
> you pop since you want to return it. This is worth a comment...
Yes, you are right. Will add some comments above the function.

>
> I'd consider moving the decrement expression down here or
> changing the loop to be a for loop to improve its readability.
> In the latter case this will say if (num_to_pop > 1) kfree(rv)
> which is more straight forward IMHO.
How about following?

BUG_ON(num_to_pop <= 0);

list_for_each_entry_safe() {
list_del_init(&rv->bse_node);
if (num_to_pop-- > 1)
kfree(rv);
}

Thanks,
Tao

>
> Benny
>
> > + }
> > + spin_unlock(&marks->im_lock);
> > +
> > + BUG_ON(num_to_pop > 0);
> > +
> > + return rv;
> > +}


2011-11-10 08:51:26

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 0/8] pnfsblock cleanup and fixes

> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Benny Halevy
> Sent: Thursday, November 10, 2011 4:38 PM
> To: Peng Tao
> Cc: [email protected]; [email protected]
> Subject: Re: [PATCH 0/8] pnfsblock cleanup and fixes
>
> On 2011-11-09 17:15, Peng Tao wrote:
> > Hi, Trond and Benny,
> >
> > Bellow are some pnfsblock patches that I have tested for days. Most of them
> > are cleanup and small improvements. Only
> > [PATCH 2/8] pnfsblock: acquire im_lock in _preload_range
> > needs to go into stable.
> >
> > Thanks,
> > Tao
> >
> > Peng Tao (8):
> > pnfsblock: cleanup bl_mark_sectors_init
> > pnfsblock: acquire im_lock in _preload_range
>
> I can't find this patch in my mailbox.
> It looks like you did not send it...
> http://thread.gmane.org/gmane.linux.nfs/44761
I have it in both of my mailbox. Not sure why linux-nfs drops it... Anyway, I will resend it.

Thanks,
Tao
>
> Benny
>
> > pnfsblock: move find lock page logic out of bl_write_pagelist
> > pnfsblock: set read/write tk_status to pnfs_error
> > pnfsblock: remove rpc_call_ops from struct parallel_io
> > pnfsblock: clean up _add_entry
> > pnfsblock: add im_extents to pnfs_inval_markings
> > pnfsblock: alloc short extent before submit bio
> >
> > fs/nfs/blocklayout/blocklayout.c | 176
> +++++++++++++++++++++++++-------------
> > fs/nfs/blocklayout/blocklayout.h | 11 ++-
> > fs/nfs/blocklayout/extents.c | 146 ++++++++++++--------------------
> > 3 files changed, 180 insertions(+), 153 deletions(-)
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-11-09 15:17:22

by Peng Tao

[permalink] [raw]
Subject: [PATCH 1/8] pnfsblock: cleanup bl_mark_sectors_init

It does not need to manipulate on partial initialized blocks.
Writeback code takes care of it.

Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 6 +--
fs/nfs/blocklayout/blocklayout.h | 3 +-
fs/nfs/blocklayout/extents.c | 76 ++-----------------------------------
3 files changed, 8 insertions(+), 77 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 281ae95..4ced0b0 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -571,8 +571,7 @@ fill_invalid_ext:
unlock_page(page);

ret = bl_mark_sectors_init(be->be_inval, isect,
- PAGE_CACHE_SECTORS,
- NULL);
+ PAGE_CACHE_SECTORS);
if (unlikely(ret)) {
dprintk("%s bl_mark_sectors_init fail %d\n",
__func__, ret);
@@ -621,8 +620,7 @@ next_page:
}
if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
ret = bl_mark_sectors_init(be->be_inval, isect,
- PAGE_CACHE_SECTORS,
- NULL);
+ PAGE_CACHE_SECTORS);
if (unlikely(ret)) {
dprintk("%s bl_mark_sectors_init fail %d\n",
__func__, ret);
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 42acf7e..60728ac 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -186,8 +186,7 @@ struct pnfs_block_extent *
bl_find_get_extent(struct pnfs_block_layout *bl, sector_t isect,
struct pnfs_block_extent **cow_read);
int bl_mark_sectors_init(struct pnfs_inval_markings *marks,
- sector_t offset, sector_t length,
- sector_t **pages);
+ sector_t offset, sector_t length);
void bl_put_extent(struct pnfs_block_extent *be);
struct pnfs_block_extent *bl_alloc_extent(void);
int bl_is_sector_init(struct pnfs_inval_markings *marks, sector_t isect);
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 19fa7b0..0fc1321 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -179,33 +179,6 @@ static int _preload_range(struct my_tree *tree, u64 offset, u64 length)
return status;
}

-static void set_needs_init(sector_t *array, sector_t offset)
-{
- sector_t *p = array;
-
- dprintk("%s enter\n", __func__);
- if (!p)
- return;
- while (*p < offset)
- p++;
- if (*p == offset)
- return;
- else if (*p == ~0) {
- *p++ = offset;
- *p = ~0;
- return;
- } else {
- sector_t *save = p;
- dprintk("%s Adding %llu\n", __func__, (u64)offset);
- while (*p != ~0)
- p++;
- p++;
- memmove(save + 1, save, (char *)p - (char *)save);
- *save = offset;
- return;
- }
-}
-
/* We are relying on page lock to serialize this */
int bl_is_sector_init(struct pnfs_inval_markings *marks, sector_t isect)
{
@@ -261,28 +234,15 @@ static int is_range_written(struct pnfs_inval_markings *marks,

/* Marks sectors in [offest, offset_length) as having been initialized.
* All lengths are step-aligned, where step is min(pagesize, blocksize).
- * Notes where partial block is initialized, and helps prepare it for
- * complete initialization later.
+ * Currently assumes offset is page-aligned
*/
-/* Currently assumes offset is page-aligned */
int bl_mark_sectors_init(struct pnfs_inval_markings *marks,
- sector_t offset, sector_t length,
- sector_t **pages)
+ sector_t offset, sector_t length)
{
- sector_t s, start, end;
- sector_t *array = NULL; /* Pages to mark */
+ sector_t start, end;

dprintk("%s(offset=%llu,len=%llu) enter\n",
__func__, (u64)offset, (u64)length);
- s = max((sector_t) 3,
- 2 * (marks->im_block_size / (PAGE_CACHE_SECTORS)));
- dprintk("%s set max=%llu\n", __func__, (u64)s);
- if (pages) {
- array = kmalloc(s * sizeof(sector_t), GFP_NOFS);
- if (!array)
- goto outerr;
- array[0] = ~0;
- }

start = normalize(offset, marks->im_block_size);
end = normalize_up(offset + length, marks->im_block_size);
@@ -290,41 +250,15 @@ int bl_mark_sectors_init(struct pnfs_inval_markings *marks,
goto outerr;

spin_lock(&marks->im_lock);
-
- for (s = normalize_up(start, PAGE_CACHE_SECTORS);
- s < offset; s += PAGE_CACHE_SECTORS) {
- dprintk("%s pre-area pages\n", __func__);
- /* Portion of used block is not initialized */
- if (!_has_tag(&marks->im_tree, s, EXTENT_INITIALIZED))
- set_needs_init(array, s);
- }
if (_set_range(&marks->im_tree, EXTENT_INITIALIZED, offset, length))
goto out_unlock;
- for (s = normalize_up(offset + length, PAGE_CACHE_SECTORS);
- s < end; s += PAGE_CACHE_SECTORS) {
- dprintk("%s post-area pages\n", __func__);
- if (!_has_tag(&marks->im_tree, s, EXTENT_INITIALIZED))
- set_needs_init(array, s);
- }
-
spin_unlock(&marks->im_lock);

- if (pages) {
- if (array[0] == ~0) {
- kfree(array);
- *pages = NULL;
- } else
- *pages = array;
- }
return 0;

- out_unlock:
+out_unlock:
spin_unlock(&marks->im_lock);
- outerr:
- if (pages) {
- kfree(array);
- *pages = NULL;
- }
+outerr:
return -ENOMEM;
}

--
1.7.1.262.g5ef3d


2011-11-10 08:50:03

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 3/8] pnfsblock: move find lock page logic out of bl_write_pagelist


> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Benny Halevy
> Sent: Thursday, November 10, 2011 4:32 PM
> To: Peng Tao
> Cc: [email protected]; [email protected]; Peng, Tao
> Subject: Re: [PATCH 3/8] pnfsblock: move find lock page logic out of
> bl_write_pagelist
>
> On 2011-11-09 17:16, Peng Tao wrote:
> > Also avoid unnecessary lock_page if page is handled by others.
> >
> > Signed-off-by: Peng Tao <[email protected]>
> > ---
> > fs/nfs/blocklayout/blocklayout.c | 78
> ++++++++++++++++++++++++++------------
> > 1 files changed, 54 insertions(+), 24 deletions(-)
> >
> > diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> > index 4ced0b0..e8e13f3 100644
> > --- a/fs/nfs/blocklayout/blocklayout.c
> > +++ b/fs/nfs/blocklayout/blocklayout.c
> > @@ -484,6 +484,56 @@ cleanup:
> > return ret;
> > }
> >
> > +/* Find or create a zeroing page marked being writeback.
> > + * Return ERR_PTR on error, NULL to indicate skip this page and page itself
> > + * to indicate write out.
> > + */
> > +static struct page*
>
> coding style nit: "page *" rather than "page*"
>
> > +bl_find_get_zeroing_page(struct inode *inode, pgoff_t index,
> > + struct pnfs_block_extent *cow_read)
> > +{
> > + struct page* page;
>
> checkpatch nit:
> ERROR: "foo* bar" should be "foo *bar"
Thank you! Will change both.


>
> > + int locked = 0;
> > + page = find_get_page(inode->i_mapping, index);
> > + if (page)
> > + goto check_page;
> > +
> > + page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
> > + if (unlikely(!page)) {
> > + dprintk("%s oom\n", __func__);
> > + return ERR_PTR(-ENOMEM);
> > + }
> > + locked = 1;
>
> Why not just call find_or_create_page in the first place?
> dealing with both the locked and unlocked cases here just
> seems to add unneeded complexity.
It is to avoid lock page when possible. find_or_create_page() calls find_and_lock_page() but there are cases that the page is already locked by someone (e.g., in nfs_writepage_locked()) and current process has to wait. Pressure tests show many such contention. So I want to optimize it out.

Thanks,
Tao
>
> Benny
>
> > +
> > +check_page:
> > + /* PageDirty: Other will write this out
> > + * PageWriteback: Other is writing this out
> > + * PageUptodate: It was read before
> > + */
> > + if (PageDirty(page) || PageWriteback(page)) {
> > + print_page(page);
> > + if (locked)
> > + unlock_page(page);
> > + page_cache_release(page);
> > + return NULL;
> > + }
> > +
> > + if (!locked) {
> > + lock_page(page);
> > + if (PageDirty(page) || PageWriteback(page))
> > + unlock_page(page);
> > + return NULL;
> > + }
> > + if (!PageUptodate(page)) {
> > + /* New page, readin or zero it */
> > + init_page_for_write(page, cow_read);
> > + }
> > + set_page_writeback(page);
> > + unlock_page(page);
> > +
> > + return page;
> > +}
> > +
> > static enum pnfs_try_status
> > bl_write_pagelist(struct nfs_write_data *wdata, int sync)
> > {
> > @@ -543,32 +593,12 @@ fill_invalid_ext:
> > dprintk("%s zero %dth page: index %lu isect %llu\n",
> > __func__, npg_zero, index,
> > (unsigned long long)isect);
> > - page =
> > - find_or_create_page(wdata->inode->i_mapping, index,
> > - GFP_NOFS);
> > - if (!page) {
> > - dprintk("%s oom\n", __func__);
> > - wdata->pnfs_error = -ENOMEM;
> > + page = bl_find_get_zeroing_page(wdata->inode, index, cow_read);
> > + if (unlikely(IS_ERR(page))) {
> > + wdata->pnfs_error = PTR_ERR(page);
> > goto out;
> > - }
> > -
> > - /* PageDirty: Other will write this out
> > - * PageWriteback: Other is writing this out
> > - * PageUptodate: It was read before
> > - * sector_initialized: already written out
> > - */
> > - if (PageDirty(page) || PageWriteback(page)) {
> > - print_page(page);
> > - unlock_page(page);
> > - page_cache_release(page);
> > + } else if (page == NULL)
> > goto next_page;
> > - }
> > - if (!PageUptodate(page)) {
> > - /* New page, readin or zero it */
> > - init_page_for_write(page, cow_read);
> > - }
> > - set_page_writeback(page);
> > - unlock_page(page);
> >
> > ret = bl_mark_sectors_init(be->be_inval, isect,
> > PAGE_CACHE_SECTORS);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-11-09 15:17:50

by Peng Tao

[permalink] [raw]
Subject: [PATCH 5/8] pnfsblock: remove rpc_call_ops from struct parallel_io

block layout can just make use of generic read/write_done.

Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 13 -------------
1 files changed, 0 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index cc13cb0..815c0c3 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -90,7 +90,6 @@ static int is_writable(struct pnfs_block_extent *be, sector_t isect)
*/
struct parallel_io {
struct kref refcnt;
- struct rpc_call_ops call_ops;
void (*pnfs_callback) (void *data);
void *data;
};
@@ -221,14 +220,6 @@ bl_end_par_io_read(void *data)
schedule_work(&rdata->task.u.tk_work);
}

-/* We don't want normal .rpc_call_done callback used, so we replace it
- * with this stub.
- */
-static void bl_rpc_do_nothing(struct rpc_task *task, void *calldata)
-{
- return;
-}
-
static enum pnfs_try_status
bl_read_pagelist(struct nfs_read_data *rdata)
{
@@ -248,8 +239,6 @@ bl_read_pagelist(struct nfs_read_data *rdata)
par = alloc_parallel(rdata);
if (!par)
goto use_mds;
- par->call_ops = *rdata->mds_ops;
- par->call_ops.rpc_call_done = bl_rpc_do_nothing;
par->pnfs_callback = bl_end_par_io_read;
/* At this point, we can no longer jump to use_mds */

@@ -560,8 +549,6 @@ bl_write_pagelist(struct nfs_write_data *wdata, int sync)
par = alloc_parallel(wdata);
if (!par)
return PNFS_NOT_ATTEMPTED;
- par->call_ops = *wdata->mds_ops;
- par->call_ops.rpc_call_done = bl_rpc_do_nothing;
par->pnfs_callback = bl_end_par_io_write;
/* At this point, have to be more careful with error handling */

--
1.7.1.262.g5ef3d


2011-11-10 08:32:25

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 3/8] pnfsblock: move find lock page logic out of bl_write_pagelist

On 2011-11-09 17:16, Peng Tao wrote:
> Also avoid unnecessary lock_page if page is handled by others.
>
> Signed-off-by: Peng Tao <[email protected]>
> ---
> fs/nfs/blocklayout/blocklayout.c | 78 ++++++++++++++++++++++++++------------
> 1 files changed, 54 insertions(+), 24 deletions(-)
>
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 4ced0b0..e8e13f3 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -484,6 +484,56 @@ cleanup:
> return ret;
> }
>
> +/* Find or create a zeroing page marked being writeback.
> + * Return ERR_PTR on error, NULL to indicate skip this page and page itself
> + * to indicate write out.
> + */
> +static struct page*

coding style nit: "page *" rather than "page*"

> +bl_find_get_zeroing_page(struct inode *inode, pgoff_t index,
> + struct pnfs_block_extent *cow_read)
> +{
> + struct page* page;

checkpatch nit:
ERROR: "foo* bar" should be "foo *bar"

> + int locked = 0;
> + page = find_get_page(inode->i_mapping, index);
> + if (page)
> + goto check_page;
> +
> + page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
> + if (unlikely(!page)) {
> + dprintk("%s oom\n", __func__);
> + return ERR_PTR(-ENOMEM);
> + }
> + locked = 1;

Why not just call find_or_create_page in the first place?
dealing with both the locked and unlocked cases here just
seems to add unneeded complexity.

Benny

> +
> +check_page:
> + /* PageDirty: Other will write this out
> + * PageWriteback: Other is writing this out
> + * PageUptodate: It was read before
> + */
> + if (PageDirty(page) || PageWriteback(page)) {
> + print_page(page);
> + if (locked)
> + unlock_page(page);
> + page_cache_release(page);
> + return NULL;
> + }
> +
> + if (!locked) {
> + lock_page(page);
> + if (PageDirty(page) || PageWriteback(page))
> + unlock_page(page);
> + return NULL;
> + }
> + if (!PageUptodate(page)) {
> + /* New page, readin or zero it */
> + init_page_for_write(page, cow_read);
> + }
> + set_page_writeback(page);
> + unlock_page(page);
> +
> + return page;
> +}
> +
> static enum pnfs_try_status
> bl_write_pagelist(struct nfs_write_data *wdata, int sync)
> {
> @@ -543,32 +593,12 @@ fill_invalid_ext:
> dprintk("%s zero %dth page: index %lu isect %llu\n",
> __func__, npg_zero, index,
> (unsigned long long)isect);
> - page =
> - find_or_create_page(wdata->inode->i_mapping, index,
> - GFP_NOFS);
> - if (!page) {
> - dprintk("%s oom\n", __func__);
> - wdata->pnfs_error = -ENOMEM;
> + page = bl_find_get_zeroing_page(wdata->inode, index, cow_read);
> + if (unlikely(IS_ERR(page))) {
> + wdata->pnfs_error = PTR_ERR(page);
> goto out;
> - }
> -
> - /* PageDirty: Other will write this out
> - * PageWriteback: Other is writing this out
> - * PageUptodate: It was read before
> - * sector_initialized: already written out
> - */
> - if (PageDirty(page) || PageWriteback(page)) {
> - print_page(page);
> - unlock_page(page);
> - page_cache_release(page);
> + } else if (page == NULL)
> goto next_page;
> - }
> - if (!PageUptodate(page)) {
> - /* New page, readin or zero it */
> - init_page_for_write(page, cow_read);
> - }
> - set_page_writeback(page);
> - unlock_page(page);
>
> ret = bl_mark_sectors_init(be->be_inval, isect,
> PAGE_CACHE_SECTORS);

2011-11-10 09:31:40

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 3/8] pnfsblock: move find lock page logic out of bl_write_pagelist

On 2011-11-10 10:49, [email protected] wrote:
>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]]
>> On Behalf Of Benny Halevy
>> Sent: Thursday, November 10, 2011 4:32 PM
>> To: Peng Tao
>> Cc: [email protected]; [email protected]; Peng, Tao
>> Subject: Re: [PATCH 3/8] pnfsblock: move find lock page logic out of
>> bl_write_pagelist
>>
>> On 2011-11-09 17:16, Peng Tao wrote:
>>> Also avoid unnecessary lock_page if page is handled by others.
>>>
>>> Signed-off-by: Peng Tao <[email protected]>
>>> ---
>>> fs/nfs/blocklayout/blocklayout.c | 78
>> ++++++++++++++++++++++++++------------
>>> 1 files changed, 54 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
>>> index 4ced0b0..e8e13f3 100644
>>> --- a/fs/nfs/blocklayout/blocklayout.c
>>> +++ b/fs/nfs/blocklayout/blocklayout.c
>>> @@ -484,6 +484,56 @@ cleanup:
>>> return ret;
>>> }
>>>
>>> +/* Find or create a zeroing page marked being writeback.
>>> + * Return ERR_PTR on error, NULL to indicate skip this page and page itself
>>> + * to indicate write out.
>>> + */
>>> +static struct page*
>>
>> coding style nit: "page *" rather than "page*"
>>
>>> +bl_find_get_zeroing_page(struct inode *inode, pgoff_t index,
>>> + struct pnfs_block_extent *cow_read)
>>> +{
>>> + struct page* page;
>>
>> checkpatch nit:
>> ERROR: "foo* bar" should be "foo *bar"
> Thank you! Will change both.
>
>
>>
>>> + int locked = 0;
>>> + page = find_get_page(inode->i_mapping, index);
>>> + if (page)
>>> + goto check_page;
>>> +
>>> + page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
>>> + if (unlikely(!page)) {
>>> + dprintk("%s oom\n", __func__);
>>> + return ERR_PTR(-ENOMEM);
>>> + }
>>> + locked = 1;
>>
>> Why not just call find_or_create_page in the first place?
>> dealing with both the locked and unlocked cases here just
>> seems to add unneeded complexity.
> It is to avoid lock page when possible. find_or_create_page() calls find_and_lock_page() but there are cases that the page is already locked by someone (e.g., in nfs_writepage_locked()) and current process has to wait. Pressure tests show many such contention. So I want to optimize it out.

I see. So isn't there a missing page_cache_release
in the late locking case (see below...)

>
> Thanks,
> Tao
>>
>> Benny
>>
>>> +
>>> +check_page:
>>> + /* PageDirty: Other will write this out
>>> + * PageWriteback: Other is writing this out
>>> + * PageUptodate: It was read before
>>> + */
>>> + if (PageDirty(page) || PageWriteback(page)) {
>>> + print_page(page);
>>> + if (locked)
>>> + unlock_page(page);
>>> + page_cache_release(page);
>>> + return NULL;
>>> + }
>>> +
>>> + if (!locked) {
>>> + lock_page(page);
>>> + if (PageDirty(page) || PageWriteback(page))
>>> + unlock_page(page);

I suspect page_cache_release is required here too.
Actually, doing the following instead makes sense to me:

if (!locked) {
lock_page(page);
locked = true; // yeah, make it should be bool, not int
goto check_page;
}

Benny

>>> + return NULL;
>>> + }
>>> + if (!PageUptodate(page)) {
>>> + /* New page, readin or zero it */
>>> + init_page_for_write(page, cow_read);
>>> + }
>>> + set_page_writeback(page);
>>> + unlock_page(page);
>>> +
>>> + return page;
>>> +}
>>> +
>>> static enum pnfs_try_status
>>> bl_write_pagelist(struct nfs_write_data *wdata, int sync)
>>> {
>>> @@ -543,32 +593,12 @@ fill_invalid_ext:
>>> dprintk("%s zero %dth page: index %lu isect %llu\n",
>>> __func__, npg_zero, index,
>>> (unsigned long long)isect);
>>> - page =
>>> - find_or_create_page(wdata->inode->i_mapping, index,
>>> - GFP_NOFS);
>>> - if (!page) {
>>> - dprintk("%s oom\n", __func__);
>>> - wdata->pnfs_error = -ENOMEM;
>>> + page = bl_find_get_zeroing_page(wdata->inode, index, cow_read);
>>> + if (unlikely(IS_ERR(page))) {
>>> + wdata->pnfs_error = PTR_ERR(page);
>>> goto out;
>>> - }
>>> -
>>> - /* PageDirty: Other will write this out
>>> - * PageWriteback: Other is writing this out
>>> - * PageUptodate: It was read before
>>> - * sector_initialized: already written out
>>> - */
>>> - if (PageDirty(page) || PageWriteback(page)) {
>>> - print_page(page);
>>> - unlock_page(page);
>>> - page_cache_release(page);
>>> + } else if (page == NULL)
>>> goto next_page;
>>> - }
>>> - if (!PageUptodate(page)) {
>>> - /* New page, readin or zero it */
>>> - init_page_for_write(page, cow_read);
>>> - }
>>> - set_page_writeback(page);
>>> - unlock_page(page);
>>>
>>> ret = bl_mark_sectors_init(be->be_inval, isect,
>>> PAGE_CACHE_SECTORS);
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-11-10 07:06:48

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 0/8] pnfsblock cleanup and fixes

On 2011-11-10 04:09, [email protected] wrote:
> Benny, hi,
>
> These patches fail to apply to pnfs-latest branch because of conflicts with pnfs private workqueue patches. Since Trond has decided not to take pnfs private workqueue patches, would you please drop them so that new patches can apply? I can send you revert patches if you want.

No problem. I'll take them out.

Benny

>
> Thanks,
> Tao
>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]]
>> On Behalf Of Peng Tao
>> Sent: Wednesday, November 09, 2011 11:16 PM
>> To: [email protected]
>> Cc: [email protected]; [email protected]
>> Subject: [PATCH 0/8] pnfsblock cleanup and fixes
>>
>> Hi, Trond and Benny,
>>
>> Bellow are some pnfsblock patches that I have tested for days. Most of them
>> are cleanup and small improvements. Only
>> [PATCH 2/8] pnfsblock: acquire im_lock in _preload_range
>> needs to go into stable.
>>
>> Thanks,
>> Tao
>>
>> Peng Tao (8):
>> pnfsblock: cleanup bl_mark_sectors_init
>> pnfsblock: acquire im_lock in _preload_range
>> pnfsblock: move find lock page logic out of bl_write_pagelist
>> pnfsblock: set read/write tk_status to pnfs_error
>> pnfsblock: remove rpc_call_ops from struct parallel_io
>> pnfsblock: clean up _add_entry
>> pnfsblock: add im_extents to pnfs_inval_markings
>> pnfsblock: alloc short extent before submit bio
>>
>> fs/nfs/blocklayout/blocklayout.c | 176 +++++++++++++++++++++++++-------------
>> fs/nfs/blocklayout/blocklayout.h | 11 ++-
>> fs/nfs/blocklayout/extents.c | 146 ++++++++++++--------------------
>> 3 files changed, 180 insertions(+), 153 deletions(-)
>>

2011-11-10 08:54:19

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 7/8] pnfsblock: add im_extents to pnfs_inval_markings

On 2011-11-09 17:16, Peng Tao wrote:
> It stores a list of short extents for INVAL->RW conversion.
> Also add two functions to manipulate them, in preparation to
> move malloc logic out of end_io.
>
> Signed-off-by: Peng Tao <[email protected]>
> ---
> fs/nfs/blocklayout/blocklayout.c | 6 ++++++
> fs/nfs/blocklayout/blocklayout.h | 5 +++++
> fs/nfs/blocklayout/extents.c | 37 +++++++++++++++++++++++++++++++++++++
> 3 files changed, 48 insertions(+), 0 deletions(-)
>
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 815c0c3..cb4ff0f 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -706,11 +706,17 @@ static void
> release_inval_marks(struct pnfs_inval_markings *marks)
> {
> struct pnfs_inval_tracking *pos, *temp;
> + struct pnfs_block_short_extent *se, *stemp;
>
> list_for_each_entry_safe(pos, temp, &marks->im_tree.mtt_stub, it_link) {
> list_del(&pos->it_link);
> kfree(pos);
> }
> +
> + list_for_each_entry_safe(se, stemp, &marks->im_extents, bse_node) {
> + list_del(&se->bse_node);
> + kfree(se);
> + }
> return;
> }
>
> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> index 60728ac..df0e0fb 100644
> --- a/fs/nfs/blocklayout/blocklayout.h
> +++ b/fs/nfs/blocklayout/blocklayout.h
> @@ -70,6 +70,7 @@ struct pnfs_inval_markings {
> spinlock_t im_lock;
> struct my_tree im_tree; /* Sectors that need LAYOUTCOMMIT */
> sector_t im_block_size; /* Server blocksize in sectors */
> + struct list_head im_extents; /* List of short extents for INVAL->RW conversion */
> };
>
> struct pnfs_inval_tracking {
> @@ -105,6 +106,7 @@ BL_INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
> {
> spin_lock_init(&marks->im_lock);
> INIT_LIST_HEAD(&marks->im_tree.mtt_stub);
> + INIT_LIST_HEAD(&marks->im_extents);
> marks->im_block_size = blocksize;
> marks->im_tree.mtt_step_size = min((sector_t)PAGE_CACHE_SECTORS,
> blocksize);
> @@ -200,5 +202,8 @@ int bl_add_merge_extent(struct pnfs_block_layout *bl,
> struct pnfs_block_extent *new);
> int bl_mark_for_commit(struct pnfs_block_extent *be,
> sector_t offset, sector_t length);
> +int bl_push_one_short_extent(struct pnfs_inval_markings *marks);
> +struct pnfs_block_short_extent*
> +bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop);
>
> #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
> diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
> index 952ea8a..72c7fa1 100644
> --- a/fs/nfs/blocklayout/extents.c
> +++ b/fs/nfs/blocklayout/extents.c
> @@ -863,3 +863,40 @@ clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
> }
> }
> }
> +
> +int
> +bl_push_one_short_extent(struct pnfs_inval_markings *marks) {
> + struct pnfs_block_short_extent *new;
> +
> + new = kmalloc(sizeof(*new), GFP_NOFS);
> + if (unlikely(!new))
> + return -ENOMEM;
> +
> + spin_lock(&marks->im_lock);
> + list_add(&new->bse_node, &marks->im_extents);
> + spin_unlock(&marks->im_lock);
> +
> + return 0;
> +}
> +
> +struct pnfs_block_short_extent*
> +bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop) {
> + struct pnfs_block_short_extent *rv = NULL;
> +
> + if (unlikely(num_to_pop <= 0))
> + return rv;

How unlikely is it?
Is doing the extra compare really worth saving the spin_lock?

> +
> + spin_lock(&marks->im_lock);
> + while (!list_empty(&marks->im_extents) && num_to_pop-- > 0) {
> + rv = list_entry((&marks->im_extents)->next,
> + struct pnfs_block_short_extent, bse_node);
> + list_del_init(&rv->bse_node);
> + if (num_to_pop)
> + kfree(rv);

Please correct me if I'm wrong, you don't want to free the last element
you pop since you want to return it. This is worth a comment...

I'd consider moving the decrement expression down here or
changing the loop to be a for loop to improve its readability.
In the latter case this will say if (num_to_pop > 1) kfree(rv)
which is more straight forward IMHO.

Benny

> + }
> + spin_unlock(&marks->im_lock);
> +
> + BUG_ON(num_to_pop > 0);
> +
> + return rv;
> +}

2011-11-09 15:17:44

by Peng Tao

[permalink] [raw]
Subject: [PATCH 4/8] pnfsblock: set read/write tk_status to pnfs_error

To pass the IO status to upper layer.

Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index e8e13f3..cc13cb0 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -216,6 +216,7 @@ bl_end_par_io_read(void *data)
{
struct nfs_read_data *rdata = data;

+ rdata->task.tk_status = rdata->pnfs_error;
INIT_WORK(&rdata->task.u.tk_work, bl_read_cleanup);
schedule_work(&rdata->task.u.tk_work);
}
@@ -405,7 +406,7 @@ static void bl_end_par_io_write(void *data)
{
struct nfs_write_data *wdata = data;

- wdata->task.tk_status = 0;
+ wdata->task.tk_status = wdata->pnfs_error;
wdata->verf.committed = NFS_FILE_SYNC;
INIT_WORK(&wdata->task.u.tk_work, bl_write_cleanup);
schedule_work(&wdata->task.u.tk_work);
--
1.7.1.262.g5ef3d


2011-11-10 08:37:39

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 0/8] pnfsblock cleanup and fixes

On 2011-11-09 17:15, Peng Tao wrote:
> Hi, Trond and Benny,
>
> Bellow are some pnfsblock patches that I have tested for days. Most of them
> are cleanup and small improvements. Only
> [PATCH 2/8] pnfsblock: acquire im_lock in _preload_range
> needs to go into stable.
>
> Thanks,
> Tao
>
> Peng Tao (8):
> pnfsblock: cleanup bl_mark_sectors_init
> pnfsblock: acquire im_lock in _preload_range

I can't find this patch in my mailbox.
It looks like you did not send it...
http://thread.gmane.org/gmane.linux.nfs/44761

Benny

> pnfsblock: move find lock page logic out of bl_write_pagelist
> pnfsblock: set read/write tk_status to pnfs_error
> pnfsblock: remove rpc_call_ops from struct parallel_io
> pnfsblock: clean up _add_entry
> pnfsblock: add im_extents to pnfs_inval_markings
> pnfsblock: alloc short extent before submit bio
>
> fs/nfs/blocklayout/blocklayout.c | 176 +++++++++++++++++++++++++-------------
> fs/nfs/blocklayout/blocklayout.h | 11 ++-
> fs/nfs/blocklayout/extents.c | 146 ++++++++++++--------------------
> 3 files changed, 180 insertions(+), 153 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-11-10 09:38:04

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 7/8] pnfsblock: add im_extents to pnfs_inval_markings

> -----Original Message-----
> From: Benny Halevy [mailto:[email protected]]
> Sent: Thursday, November 10, 2011 5:21 PM
> To: Peng Tao
> Cc: [email protected]; [email protected]; Peng, Tao
> Subject: Re: [PATCH 7/8] pnfsblock: add im_extents to pnfs_inval_markings
>
> On 2011-11-10 10:54, Benny Halevy wrote:
> > On 2011-11-09 17:16, Peng Tao wrote:
> >> It stores a list of short extents for INVAL->RW conversion.
> >> Also add two functions to manipulate them, in preparation to
> >> move malloc logic out of end_io.
> >>
> >> Signed-off-by: Peng Tao <[email protected]>
> >> ---
> >> fs/nfs/blocklayout/blocklayout.c | 6 ++++++
> >> fs/nfs/blocklayout/blocklayout.h | 5 +++++
> >> fs/nfs/blocklayout/extents.c | 37
> +++++++++++++++++++++++++++++++++++++
> >> 3 files changed, 48 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> >> index 815c0c3..cb4ff0f 100644
> >> --- a/fs/nfs/blocklayout/blocklayout.c
> >> +++ b/fs/nfs/blocklayout/blocklayout.c
> >> @@ -706,11 +706,17 @@ static void
> >> release_inval_marks(struct pnfs_inval_markings *marks)
> >> {
> >> struct pnfs_inval_tracking *pos, *temp;
> >> + struct pnfs_block_short_extent *se, *stemp;
> >>
> >> list_for_each_entry_safe(pos, temp, &marks->im_tree.mtt_stub, it_link) {
> >> list_del(&pos->it_link);
> >> kfree(pos);
> >> }
> >> +
> >> + list_for_each_entry_safe(se, stemp, &marks->im_extents, bse_node) {
> >> + list_del(&se->bse_node);
> >> + kfree(se);
> >> + }
> >> return;
> >> }
> >>
> >> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> >> index 60728ac..df0e0fb 100644
> >> --- a/fs/nfs/blocklayout/blocklayout.h
> >> +++ b/fs/nfs/blocklayout/blocklayout.h
> >> @@ -70,6 +70,7 @@ struct pnfs_inval_markings {
> >> spinlock_t im_lock;
> >> struct my_tree im_tree; /* Sectors that need LAYOUTCOMMIT */
> >> sector_t im_block_size; /* Server blocksize in sectors */
> >> + struct list_head im_extents; /* List of short extents for INVAL->RW conversion
> */
> >> };
> >>
> >> struct pnfs_inval_tracking {
> >> @@ -105,6 +106,7 @@ BL_INIT_INVAL_MARKS(struct pnfs_inval_markings
> *marks, sector_t blocksize)
> >> {
> >> spin_lock_init(&marks->im_lock);
> >> INIT_LIST_HEAD(&marks->im_tree.mtt_stub);
> >> + INIT_LIST_HEAD(&marks->im_extents);
> >> marks->im_block_size = blocksize;
> >> marks->im_tree.mtt_step_size = min((sector_t)PAGE_CACHE_SECTORS,
> >> blocksize);
> >> @@ -200,5 +202,8 @@ int bl_add_merge_extent(struct pnfs_block_layout *bl,
> >> struct pnfs_block_extent *new);
> >> int bl_mark_for_commit(struct pnfs_block_extent *be,
> >> sector_t offset, sector_t length);
> >> +int bl_push_one_short_extent(struct pnfs_inval_markings *marks);
> >> +struct pnfs_block_short_extent*
> >> +bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop);
> >>
> >> #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
> >> diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
> >> index 952ea8a..72c7fa1 100644
> >> --- a/fs/nfs/blocklayout/extents.c
> >> +++ b/fs/nfs/blocklayout/extents.c
> >> @@ -863,3 +863,40 @@ clean_pnfs_block_layoutupdate(struct
> pnfs_block_layout *bl,
> >> }
> >> }
> >> }
> >> +
> >> +int
> >> +bl_push_one_short_extent(struct pnfs_inval_markings *marks) {
> >> + struct pnfs_block_short_extent *new;
> >> +
> >> + new = kmalloc(sizeof(*new), GFP_NOFS);
> >> + if (unlikely(!new))
> >> + return -ENOMEM;
> >> +
> >> + spin_lock(&marks->im_lock);
> >> + list_add(&new->bse_node, &marks->im_extents);
> >> + spin_unlock(&marks->im_lock);
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +struct pnfs_block_short_extent*
> >> +bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop) {
> >> + struct pnfs_block_short_extent *rv = NULL;
>
> The use case for bl_pop_short_extent comes only in the next patch
> so it should be introduced there in principle, along with its use.
I was hoping not to create too large single patch... But since you asked, I will merge them into one.

>
> With that in mind, there is one call site with num_to_pop == 1
> in which you don't free any entry, and the other with num_to_pop == num_se
> which seems to free all the members of the list, but the last one,
> which is then freed by the caller.
>
> This tells me you crammed two functions into one and you better
> have one that pops a single element and another to destroy the list.
Good point. I thought there might be more users of the function when I first wrote it but apparently I am wrong...
I will split it into bl_pop_one_short_extent() and bl_free_short_extents() as you suggested.

Thank you very much!
Tao

>
> Benny
>
> >> +
> >> + if (unlikely(num_to_pop <= 0))
> >> + return rv;
> >
> > How unlikely is it?
> > Is doing the extra compare really worth saving the spin_lock?
> >
> >> +
> >> + spin_lock(&marks->im_lock);
> >> + while (!list_empty(&marks->im_extents) && num_to_pop-- > 0) {
> >> + rv = list_entry((&marks->im_extents)->next,
> >> + struct pnfs_block_short_extent, bse_node);
> >> + list_del_init(&rv->bse_node);
> >> + if (num_to_pop)
> >> + kfree(rv);
> >
> > Please correct me if I'm wrong, you don't want to free the last element
> > you pop since you want to return it. This is worth a comment...
> >
> > I'd consider moving the decrement expression down here or
> > changing the loop to be a for loop to improve its readability.
> > In the latter case this will say if (num_to_pop > 1) kfree(rv)
> > which is more straight forward IMHO.
> >
> > Benny
> >
> >> + }
> >> + spin_unlock(&marks->im_lock);
> >> +
> >> + BUG_ON(num_to_pop > 0);
> >> +
> >> + return rv;
> >> +}


2011-11-09 15:17:55

by Peng Tao

[permalink] [raw]
Subject: [PATCH 6/8] pnfsblock: clean up _add_entry

It is wrong to kmalloc in _add_entry() as it is inside
spinlock. memory should be already allocated _add_entry() is called.
So BUG instead if we fail to find a match and no memory preallocated.

Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/blocklayout/extents.c | 9 ++-------
1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index f383524..952ea8a 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -110,13 +110,8 @@ static int _add_entry(struct my_tree *tree, u64 s, int32_t tag,
return 0;
} else {
struct pnfs_inval_tracking *new;
- if (storage)
- new = storage;
- else {
- new = kmalloc(sizeof(*new), GFP_NOFS);
- if (!new)
- return -ENOMEM;
- }
+ BUG_ON(!storage);
+ new = storage;
new->it_sector = s;
new->it_tags = (1 << tag);
list_add(&new->it_link, &pos->it_link);
--
1.7.1.262.g5ef3d


2011-11-10 09:37:28

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 7/8] pnfsblock: add im_extents to pnfs_inval_markings

On 2011-11-10 11:08, [email protected] wrote:
>
>> -----Original Message-----
>> From: Benny Halevy [mailto:[email protected]]
>> Sent: Thursday, November 10, 2011 4:54 PM
>> To: Peng Tao
>> Cc: [email protected]; [email protected]; Peng, Tao
>> Subject: Re: [PATCH 7/8] pnfsblock: add im_extents to pnfs_inval_markings
>>
>> On 2011-11-09 17:16, Peng Tao wrote:
>>> It stores a list of short extents for INVAL->RW conversion.
>>> Also add two functions to manipulate them, in preparation to
>>> move malloc logic out of end_io.
>>>
>>> Signed-off-by: Peng Tao <[email protected]>
>>> ---
>>> fs/nfs/blocklayout/blocklayout.c | 6 ++++++
>>> fs/nfs/blocklayout/blocklayout.h | 5 +++++
>>> fs/nfs/blocklayout/extents.c | 37
>> +++++++++++++++++++++++++++++++++++++
>>> 3 files changed, 48 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
>>> index 815c0c3..cb4ff0f 100644
>>> --- a/fs/nfs/blocklayout/blocklayout.c
>>> +++ b/fs/nfs/blocklayout/blocklayout.c
>>> @@ -706,11 +706,17 @@ static void
>>> release_inval_marks(struct pnfs_inval_markings *marks)
>>> {
>>> struct pnfs_inval_tracking *pos, *temp;
>>> + struct pnfs_block_short_extent *se, *stemp;
>>>
>>> list_for_each_entry_safe(pos, temp, &marks->im_tree.mtt_stub, it_link) {
>>> list_del(&pos->it_link);
>>> kfree(pos);
>>> }
>>> +
>>> + list_for_each_entry_safe(se, stemp, &marks->im_extents, bse_node) {
>>> + list_del(&se->bse_node);
>>> + kfree(se);
>>> + }
>>> return;
>>> }
>>>
>>> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
>>> index 60728ac..df0e0fb 100644
>>> --- a/fs/nfs/blocklayout/blocklayout.h
>>> +++ b/fs/nfs/blocklayout/blocklayout.h
>>> @@ -70,6 +70,7 @@ struct pnfs_inval_markings {
>>> spinlock_t im_lock;
>>> struct my_tree im_tree; /* Sectors that need LAYOUTCOMMIT */
>>> sector_t im_block_size; /* Server blocksize in sectors */
>>> + struct list_head im_extents; /* List of short extents for INVAL->RW conversion
>> */
>>> };
>>>
>>> struct pnfs_inval_tracking {
>>> @@ -105,6 +106,7 @@ BL_INIT_INVAL_MARKS(struct pnfs_inval_markings
>> *marks, sector_t blocksize)
>>> {
>>> spin_lock_init(&marks->im_lock);
>>> INIT_LIST_HEAD(&marks->im_tree.mtt_stub);
>>> + INIT_LIST_HEAD(&marks->im_extents);
>>> marks->im_block_size = blocksize;
>>> marks->im_tree.mtt_step_size = min((sector_t)PAGE_CACHE_SECTORS,
>>> blocksize);
>>> @@ -200,5 +202,8 @@ int bl_add_merge_extent(struct pnfs_block_layout *bl,
>>> struct pnfs_block_extent *new);
>>> int bl_mark_for_commit(struct pnfs_block_extent *be,
>>> sector_t offset, sector_t length);
>>> +int bl_push_one_short_extent(struct pnfs_inval_markings *marks);
>>> +struct pnfs_block_short_extent*
>>> +bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop);
>>>
>>> #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
>>> diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
>>> index 952ea8a..72c7fa1 100644
>>> --- a/fs/nfs/blocklayout/extents.c
>>> +++ b/fs/nfs/blocklayout/extents.c
>>> @@ -863,3 +863,40 @@ clean_pnfs_block_layoutupdate(struct
>> pnfs_block_layout *bl,
>>> }
>>> }
>>> }
>>> +
>>> +int
>>> +bl_push_one_short_extent(struct pnfs_inval_markings *marks) {
>>> + struct pnfs_block_short_extent *new;
>>> +
>>> + new = kmalloc(sizeof(*new), GFP_NOFS);
>>> + if (unlikely(!new))
>>> + return -ENOMEM;
>>> +
>>> + spin_lock(&marks->im_lock);
>>> + list_add(&new->bse_node, &marks->im_extents);
>>> + spin_unlock(&marks->im_lock);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +struct pnfs_block_short_extent*
>>> +bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop) {
>>> + struct pnfs_block_short_extent *rv = NULL;
>>> +
>>> + if (unlikely(num_to_pop <= 0))
>>> + return rv;
>>
>> How unlikely is it?
>> Is doing the extra compare really worth saving the spin_lock?
> Never... I should really replace it with a BUG_ON.
>
>>
>>> +
>>> + spin_lock(&marks->im_lock);
>>> + while (!list_empty(&marks->im_extents) && num_to_pop-- > 0) {
>>> + rv = list_entry((&marks->im_extents)->next,
>>> + struct pnfs_block_short_extent, bse_node);
>>> + list_del_init(&rv->bse_node);
>>> + if (num_to_pop)
>>> + kfree(rv);
>>
>> Please correct me if I'm wrong, you don't want to free the last element
>> you pop since you want to return it. This is worth a comment...
> Yes, you are right. Will add some comments above the function.
>
>>
>> I'd consider moving the decrement expression down here or
>> changing the loop to be a for loop to improve its readability.
>> In the latter case this will say if (num_to_pop > 1) kfree(rv)
>> which is more straight forward IMHO.
> How about following?
>
> BUG_ON(num_to_pop <= 0);
>
> list_for_each_entry_safe() {
> list_del_init(&rv->bse_node);
> if (num_to_pop-- > 1)
> kfree(rv);
> }
>

This looks better, but please read my next comment on this patch.
I believe you better split this function into two:
one that pops a single element and returns it (or NULL is the list is empty)
and another that just destroys the list, freeing all listed elements.

Benny

> Thanks,
> Tao
>
>>
>> Benny
>>
>>> + }
>>> + spin_unlock(&marks->im_lock);
>>> +
>>> + BUG_ON(num_to_pop > 0);
>>> +
>>> + return rv;
>>> +}
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-11-10 09:12:18

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 6/8] pnfsblock: clean up _add_entry

> -----Original Message-----
> From: Benny Halevy [mailto:[email protected]]
> Sent: Thursday, November 10, 2011 4:44 PM
> To: Peng Tao
> Cc: [email protected]; [email protected]; Peng, Tao
> Subject: Re: [PATCH 6/8] pnfsblock: clean up _add_entry
>
> On 2011-11-09 17:16, Peng Tao wrote:
> > It is wrong to kmalloc in _add_entry() as it is inside
> > spinlock. memory should be already allocated _add_entry() is called.
>
> What about the call from _set_range, it is passing NULL?
>
> > So BUG instead if we fail to find a match and no memory preallocated.
> >
> > Signed-off-by: Peng Tao <[email protected]>
> > ---
> > fs/nfs/blocklayout/extents.c | 9 ++-------
> > 1 files changed, 2 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
> > index f383524..952ea8a 100644
> > --- a/fs/nfs/blocklayout/extents.c
> > +++ b/fs/nfs/blocklayout/extents.c
> > @@ -110,13 +110,8 @@ static int _add_entry(struct my_tree *tree, u64 s, int32_t
> tag,
> > return 0;
> > } else {
> > struct pnfs_inval_tracking *new;
> > - if (storage)
> > - new = storage;
> > - else {
> > - new = kmalloc(sizeof(*new), GFP_NOFS);
> > - if (!new)
> > - return -ENOMEM;
> > - }
> > + BUG_ON(!storage);
>
> The BUG isn't needed as you dereference the pointer right away.
Yes. Will remove it.

Thanks a lot for reviewing!
Tao

>
> Benny
>
> > + new = storage;
> > new->it_sector = s;
> > new->it_tags = (1 << tag);
> > list_add(&new->it_link, &pos->it_link);


2011-11-10 09:21:01

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 7/8] pnfsblock: add im_extents to pnfs_inval_markings

On 2011-11-10 10:54, Benny Halevy wrote:
> On 2011-11-09 17:16, Peng Tao wrote:
>> It stores a list of short extents for INVAL->RW conversion.
>> Also add two functions to manipulate them, in preparation to
>> move malloc logic out of end_io.
>>
>> Signed-off-by: Peng Tao <[email protected]>
>> ---
>> fs/nfs/blocklayout/blocklayout.c | 6 ++++++
>> fs/nfs/blocklayout/blocklayout.h | 5 +++++
>> fs/nfs/blocklayout/extents.c | 37 +++++++++++++++++++++++++++++++++++++
>> 3 files changed, 48 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
>> index 815c0c3..cb4ff0f 100644
>> --- a/fs/nfs/blocklayout/blocklayout.c
>> +++ b/fs/nfs/blocklayout/blocklayout.c
>> @@ -706,11 +706,17 @@ static void
>> release_inval_marks(struct pnfs_inval_markings *marks)
>> {
>> struct pnfs_inval_tracking *pos, *temp;
>> + struct pnfs_block_short_extent *se, *stemp;
>>
>> list_for_each_entry_safe(pos, temp, &marks->im_tree.mtt_stub, it_link) {
>> list_del(&pos->it_link);
>> kfree(pos);
>> }
>> +
>> + list_for_each_entry_safe(se, stemp, &marks->im_extents, bse_node) {
>> + list_del(&se->bse_node);
>> + kfree(se);
>> + }
>> return;
>> }
>>
>> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
>> index 60728ac..df0e0fb 100644
>> --- a/fs/nfs/blocklayout/blocklayout.h
>> +++ b/fs/nfs/blocklayout/blocklayout.h
>> @@ -70,6 +70,7 @@ struct pnfs_inval_markings {
>> spinlock_t im_lock;
>> struct my_tree im_tree; /* Sectors that need LAYOUTCOMMIT */
>> sector_t im_block_size; /* Server blocksize in sectors */
>> + struct list_head im_extents; /* List of short extents for INVAL->RW conversion */
>> };
>>
>> struct pnfs_inval_tracking {
>> @@ -105,6 +106,7 @@ BL_INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
>> {
>> spin_lock_init(&marks->im_lock);
>> INIT_LIST_HEAD(&marks->im_tree.mtt_stub);
>> + INIT_LIST_HEAD(&marks->im_extents);
>> marks->im_block_size = blocksize;
>> marks->im_tree.mtt_step_size = min((sector_t)PAGE_CACHE_SECTORS,
>> blocksize);
>> @@ -200,5 +202,8 @@ int bl_add_merge_extent(struct pnfs_block_layout *bl,
>> struct pnfs_block_extent *new);
>> int bl_mark_for_commit(struct pnfs_block_extent *be,
>> sector_t offset, sector_t length);
>> +int bl_push_one_short_extent(struct pnfs_inval_markings *marks);
>> +struct pnfs_block_short_extent*
>> +bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop);
>>
>> #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
>> diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
>> index 952ea8a..72c7fa1 100644
>> --- a/fs/nfs/blocklayout/extents.c
>> +++ b/fs/nfs/blocklayout/extents.c
>> @@ -863,3 +863,40 @@ clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
>> }
>> }
>> }
>> +
>> +int
>> +bl_push_one_short_extent(struct pnfs_inval_markings *marks) {
>> + struct pnfs_block_short_extent *new;
>> +
>> + new = kmalloc(sizeof(*new), GFP_NOFS);
>> + if (unlikely(!new))
>> + return -ENOMEM;
>> +
>> + spin_lock(&marks->im_lock);
>> + list_add(&new->bse_node, &marks->im_extents);
>> + spin_unlock(&marks->im_lock);
>> +
>> + return 0;
>> +}
>> +
>> +struct pnfs_block_short_extent*
>> +bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop) {
>> + struct pnfs_block_short_extent *rv = NULL;

The use case for bl_pop_short_extent comes only in the next patch
so it should be introduced there in principle, along with its use.

With that in mind, there is one call site with num_to_pop == 1
in which you don't free any entry, and the other with num_to_pop == num_se
which seems to free all the members of the list, but the last one,
which is then freed by the caller.

This tells me you crammed two functions into one and you better
have one that pops a single element and another to destroy the list.

Benny

>> +
>> + if (unlikely(num_to_pop <= 0))
>> + return rv;
>
> How unlikely is it?
> Is doing the extra compare really worth saving the spin_lock?
>
>> +
>> + spin_lock(&marks->im_lock);
>> + while (!list_empty(&marks->im_extents) && num_to_pop-- > 0) {
>> + rv = list_entry((&marks->im_extents)->next,
>> + struct pnfs_block_short_extent, bse_node);
>> + list_del_init(&rv->bse_node);
>> + if (num_to_pop)
>> + kfree(rv);
>
> Please correct me if I'm wrong, you don't want to free the last element
> you pop since you want to return it. This is worth a comment...
>
> I'd consider moving the decrement expression down here or
> changing the loop to be a for loop to improve its readability.
> In the latter case this will say if (num_to_pop > 1) kfree(rv)
> which is more straight forward IMHO.
>
> Benny
>
>> + }
>> + spin_unlock(&marks->im_lock);
>> +
>> + BUG_ON(num_to_pop > 0);
>> +
>> + return rv;
>> +}

2011-11-10 09:39:59

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 7/8] pnfsblock: add im_extents to pnfs_inval_markings

On 2011-11-10 11:37, [email protected] wrote:
>> -----Original Message-----
>> From: Benny Halevy [mailto:[email protected]]
>> Sent: Thursday, November 10, 2011 5:21 PM
>> To: Peng Tao
>> Cc: [email protected]; [email protected]; Peng, Tao
>> Subject: Re: [PATCH 7/8] pnfsblock: add im_extents to pnfs_inval_markings
>>
>> On 2011-11-10 10:54, Benny Halevy wrote:
>>> On 2011-11-09 17:16, Peng Tao wrote:
>>>> It stores a list of short extents for INVAL->RW conversion.
>>>> Also add two functions to manipulate them, in preparation to
>>>> move malloc logic out of end_io.
>>>>
>>>> Signed-off-by: Peng Tao <[email protected]>
>>>> ---
>>>> fs/nfs/blocklayout/blocklayout.c | 6 ++++++
>>>> fs/nfs/blocklayout/blocklayout.h | 5 +++++
>>>> fs/nfs/blocklayout/extents.c | 37
>> +++++++++++++++++++++++++++++++++++++
>>>> 3 files changed, 48 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
>>>> index 815c0c3..cb4ff0f 100644
>>>> --- a/fs/nfs/blocklayout/blocklayout.c
>>>> +++ b/fs/nfs/blocklayout/blocklayout.c
>>>> @@ -706,11 +706,17 @@ static void
>>>> release_inval_marks(struct pnfs_inval_markings *marks)
>>>> {
>>>> struct pnfs_inval_tracking *pos, *temp;
>>>> + struct pnfs_block_short_extent *se, *stemp;
>>>>
>>>> list_for_each_entry_safe(pos, temp, &marks->im_tree.mtt_stub, it_link) {
>>>> list_del(&pos->it_link);
>>>> kfree(pos);
>>>> }
>>>> +
>>>> + list_for_each_entry_safe(se, stemp, &marks->im_extents, bse_node) {
>>>> + list_del(&se->bse_node);
>>>> + kfree(se);
>>>> + }
>>>> return;
>>>> }
>>>>
>>>> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
>>>> index 60728ac..df0e0fb 100644
>>>> --- a/fs/nfs/blocklayout/blocklayout.h
>>>> +++ b/fs/nfs/blocklayout/blocklayout.h
>>>> @@ -70,6 +70,7 @@ struct pnfs_inval_markings {
>>>> spinlock_t im_lock;
>>>> struct my_tree im_tree; /* Sectors that need LAYOUTCOMMIT */
>>>> sector_t im_block_size; /* Server blocksize in sectors */
>>>> + struct list_head im_extents; /* List of short extents for INVAL->RW conversion
>> */
>>>> };
>>>>
>>>> struct pnfs_inval_tracking {
>>>> @@ -105,6 +106,7 @@ BL_INIT_INVAL_MARKS(struct pnfs_inval_markings
>> *marks, sector_t blocksize)
>>>> {
>>>> spin_lock_init(&marks->im_lock);
>>>> INIT_LIST_HEAD(&marks->im_tree.mtt_stub);
>>>> + INIT_LIST_HEAD(&marks->im_extents);
>>>> marks->im_block_size = blocksize;
>>>> marks->im_tree.mtt_step_size = min((sector_t)PAGE_CACHE_SECTORS,
>>>> blocksize);
>>>> @@ -200,5 +202,8 @@ int bl_add_merge_extent(struct pnfs_block_layout *bl,
>>>> struct pnfs_block_extent *new);
>>>> int bl_mark_for_commit(struct pnfs_block_extent *be,
>>>> sector_t offset, sector_t length);
>>>> +int bl_push_one_short_extent(struct pnfs_inval_markings *marks);
>>>> +struct pnfs_block_short_extent*
>>>> +bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop);
>>>>
>>>> #endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
>>>> diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
>>>> index 952ea8a..72c7fa1 100644
>>>> --- a/fs/nfs/blocklayout/extents.c
>>>> +++ b/fs/nfs/blocklayout/extents.c
>>>> @@ -863,3 +863,40 @@ clean_pnfs_block_layoutupdate(struct
>> pnfs_block_layout *bl,
>>>> }
>>>> }
>>>> }
>>>> +
>>>> +int
>>>> +bl_push_one_short_extent(struct pnfs_inval_markings *marks) {
>>>> + struct pnfs_block_short_extent *new;
>>>> +
>>>> + new = kmalloc(sizeof(*new), GFP_NOFS);
>>>> + if (unlikely(!new))
>>>> + return -ENOMEM;
>>>> +
>>>> + spin_lock(&marks->im_lock);
>>>> + list_add(&new->bse_node, &marks->im_extents);
>>>> + spin_unlock(&marks->im_lock);
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> +struct pnfs_block_short_extent*
>>>> +bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop) {
>>>> + struct pnfs_block_short_extent *rv = NULL;
>>
>> The use case for bl_pop_short_extent comes only in the next patch
>> so it should be introduced there in principle, along with its use.
> I was hoping not to create too large single patch... But since you asked, I will merge them into one.
>
>>
>> With that in mind, there is one call site with num_to_pop == 1
>> in which you don't free any entry, and the other with num_to_pop == num_se
>> which seems to free all the members of the list, but the last one,
>> which is then freed by the caller.
>>
>> This tells me you crammed two functions into one and you better
>> have one that pops a single element and another to destroy the list.
> Good point. I thought there might be more users of the function when I first wrote it but apparently I am wrong...
> I will split it into bl_pop_one_short_extent() and bl_free_short_extents() as you suggested.

Terrific.

>
> Thank you very much!

Thanks you for doing all the hard work!

Benny

> Tao
>
>>
>> Benny
>>
>>>> +
>>>> + if (unlikely(num_to_pop <= 0))
>>>> + return rv;
>>>
>>> How unlikely is it?
>>> Is doing the extra compare really worth saving the spin_lock?
>>>
>>>> +
>>>> + spin_lock(&marks->im_lock);
>>>> + while (!list_empty(&marks->im_extents) && num_to_pop-- > 0) {
>>>> + rv = list_entry((&marks->im_extents)->next,
>>>> + struct pnfs_block_short_extent, bse_node);
>>>> + list_del_init(&rv->bse_node);
>>>> + if (num_to_pop)
>>>> + kfree(rv);
>>>
>>> Please correct me if I'm wrong, you don't want to free the last element
>>> you pop since you want to return it. This is worth a comment...
>>>
>>> I'd consider moving the decrement expression down here or
>>> changing the loop to be a for loop to improve its readability.
>>> In the latter case this will say if (num_to_pop > 1) kfree(rv)
>>> which is more straight forward IMHO.
>>>
>>> Benny
>>>
>>>> + }
>>>> + spin_unlock(&marks->im_lock);
>>>> +
>>>> + BUG_ON(num_to_pop > 0);
>>>> +
>>>> + return rv;
>>>> +}
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-11-09 15:18:07

by Peng Tao

[permalink] [raw]
Subject: [PATCH 8/8] pnfsblock: alloc short extent before submit bio

As discussed earlier, it is better for block client to allocate memoroy for
tracking extents state before submitting bio. So the patch does it by allocating
a short_extent for every INVALID extent touched by write pagelist and for
every zeroing page we created, saving them in layout header.
Then in end_io we can just use them to create commit list items and avoid
memory allocation there.

Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 70 +++++++++++++++++++++++++++++---------
fs/nfs/blocklayout/blocklayout.h | 3 +-
fs/nfs/blocklayout/extents.c | 13 ++-----
3 files changed, 60 insertions(+), 26 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index cb4ff0f..53cd332 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -90,8 +90,9 @@ static int is_writable(struct pnfs_block_extent *be, sector_t isect)
*/
struct parallel_io {
struct kref refcnt;
- void (*pnfs_callback) (void *data);
+ void (*pnfs_callback) (void *data, int num_se);
void *data;
+ int bse_count;
};

static inline struct parallel_io *alloc_parallel(void *data)
@@ -102,6 +103,7 @@ static inline struct parallel_io *alloc_parallel(void *data)
if (rv) {
rv->data = data;
kref_init(&rv->refcnt);
+ rv->bse_count = 0;
}
return rv;
}
@@ -116,7 +118,7 @@ static void destroy_parallel(struct kref *kref)
struct parallel_io *p = container_of(kref, struct parallel_io, refcnt);

dprintk("%s enter\n", __func__);
- p->pnfs_callback(p->data);
+ p->pnfs_callback(p->data, p->bse_count);
kfree(p);
}

@@ -211,7 +213,7 @@ static void bl_read_cleanup(struct work_struct *work)
}

static void
-bl_end_par_io_read(void *data)
+bl_end_par_io_read(void *data, int unused)
{
struct nfs_read_data *rdata = data;

@@ -312,6 +314,7 @@ static void mark_extents_written(struct pnfs_block_layout *bl,
{
sector_t isect, end;
struct pnfs_block_extent *be;
+ struct pnfs_block_short_extent *se;

dprintk("%s(%llu, %u)\n", __func__, offset, count);
if (count == 0)
@@ -324,8 +327,11 @@ static void mark_extents_written(struct pnfs_block_layout *bl,
be = bl_find_get_extent(bl, isect, NULL);
BUG_ON(!be); /* FIXME */
len = min(end, be->be_f_offset + be->be_length) - isect;
- if (be->be_state == PNFS_BLOCK_INVALID_DATA)
- bl_mark_for_commit(be, isect, len); /* What if fails? */
+ if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
+ se = bl_pop_short_extent(be->be_inval, 1);
+ BUG_ON(!se);
+ bl_mark_for_commit(be, isect, len, se);
+ }
isect += len;
bl_put_extent(be);
}
@@ -347,7 +353,8 @@ static void bl_end_io_write_zero(struct bio *bio, int err)
end_page_writeback(page);
page_cache_release(page);
} while (bvec >= bio->bi_io_vec);
- if (!uptodate) {
+
+ if (unlikely(!uptodate)) {
if (!wdata->pnfs_error)
wdata->pnfs_error = -EIO;
pnfs_set_lo_fail(wdata->lseg);
@@ -356,7 +363,6 @@ static void bl_end_io_write_zero(struct bio *bio, int err)
put_parallel(par);
}

-/* This is basically copied from mpage_end_io_read */
static void bl_end_io_write(struct bio *bio, int err)
{
struct parallel_io *par = bio->bi_private;
@@ -382,7 +388,7 @@ static void bl_write_cleanup(struct work_struct *work)
dprintk("%s enter\n", __func__);
task = container_of(work, struct rpc_task, u.tk_work);
wdata = container_of(task, struct nfs_write_data, task);
- if (!wdata->pnfs_error) {
+ if (likely(!wdata->pnfs_error)) {
/* Marks for LAYOUTCOMMIT */
mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
wdata->args.offset, wdata->args.count);
@@ -391,9 +397,16 @@ static void bl_write_cleanup(struct work_struct *work)
}

/* Called when last of bios associated with a bl_write_pagelist call finishes */
-static void bl_end_par_io_write(void *data)
+static void bl_end_par_io_write(void *data, int num_se)
{
struct nfs_write_data *wdata = data;
+ struct pnfs_block_short_extent *se;
+
+ if (unlikely(wdata->pnfs_error)) {
+ se = bl_pop_short_extent(&BLK_LSEG2EXT(wdata->lseg)->bl_inval,
+ num_se);
+ kfree(se);
+ }

wdata->task.tk_status = wdata->pnfs_error;
wdata->verf.committed = NFS_FILE_SYNC;
@@ -548,7 +561,7 @@ bl_write_pagelist(struct nfs_write_data *wdata, int sync)
*/
par = alloc_parallel(wdata);
if (!par)
- return PNFS_NOT_ATTEMPTED;
+ goto out_mds;
par->pnfs_callback = bl_end_par_io_write;
/* At this point, have to be more careful with error handling */

@@ -556,12 +569,15 @@ bl_write_pagelist(struct nfs_write_data *wdata, int sync)
be = bl_find_get_extent(BLK_LSEG2EXT(wdata->lseg), isect, &cow_read);
if (!be || !is_writable(be, isect)) {
dprintk("%s no matching extents!\n", __func__);
- wdata->pnfs_error = -EINVAL;
- goto out;
+ goto out_mds;
}

/* First page inside INVALID extent */
if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
+ if (likely(!bl_push_one_short_extent(be->be_inval)))
+ par->bse_count++;
+ else
+ goto out_mds;
temp = offset >> PAGE_CACHE_SHIFT;
npg_zero = do_div(temp, npg_per_block);
isect = (sector_t) (((offset - npg_zero * PAGE_CACHE_SIZE) &
@@ -598,6 +614,19 @@ fill_invalid_ext:
wdata->pnfs_error = ret;
goto out;
}
+ if (likely(!bl_push_one_short_extent(be->be_inval)))
+ par->bse_count++;
+ else {
+ end_page_writeback(page);
+ page_cache_release(page);
+ wdata->pnfs_error = -ENOMEM;
+ goto out;
+ }
+ /* FIXME: This should be done in bi_end_io */
+ mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
+ page->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE);
+
bio = bl_add_page_to_bio(bio, npg_zero, WRITE,
isect, page, be,
bl_end_io_write_zero, par);
@@ -606,10 +635,6 @@ fill_invalid_ext:
bio = NULL;
goto out;
}
- /* FIXME: This should be done in bi_end_io */
- mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
- page->index << PAGE_CACHE_SHIFT,
- PAGE_CACHE_SIZE);
next_page:
isect += PAGE_CACHE_SECTORS;
extent_length -= PAGE_CACHE_SECTORS;
@@ -633,6 +658,14 @@ next_page:
wdata->pnfs_error = -EINVAL;
goto out;
}
+ if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
+ if (likely(!bl_push_one_short_extent(be->be_inval)))
+ par->bse_count++;
+ else {
+ wdata->pnfs_error = -ENOMEM;
+ goto out;
+ }
+ }
extent_length = be->be_length -
(isect - be->be_f_offset);
}
@@ -680,6 +713,11 @@ out:
bl_submit_bio(WRITE, bio);
put_parallel(par);
return PNFS_ATTEMPTED;
+out_mds:
+ bl_put_extent(be);
+ if (par)
+ kfree(par);
+ return PNFS_NOT_ATTEMPTED;
}

/* FIXME - range ignored */
diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index df0e0fb..4986c23 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -201,7 +201,8 @@ void clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
int bl_add_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);
int bl_mark_for_commit(struct pnfs_block_extent *be,
- sector_t offset, sector_t length);
+ sector_t offset, sector_t length,
+ struct pnfs_block_short_extent *new);
int bl_push_one_short_extent(struct pnfs_inval_markings *marks);
struct pnfs_block_short_extent*
bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop);
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 72c7fa1..21da3a3 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -370,20 +370,18 @@ static void add_to_commitlist(struct pnfs_block_layout *bl,

/* Note the range described by offset, length is guaranteed to be contained
* within be.
+ * new will be freed, either by this function or add_to_commitlist if they
+ * decide not to use it, or after LAYOUTCOMMIT uses it in the commitlist.
*/
int bl_mark_for_commit(struct pnfs_block_extent *be,
- sector_t offset, sector_t length)
+ sector_t offset, sector_t length,
+ struct pnfs_block_short_extent *new)
{
sector_t new_end, end = offset + length;
- struct pnfs_block_short_extent *new;
struct pnfs_block_layout *bl = container_of(be->be_inval,
struct pnfs_block_layout,
bl_inval);

- new = kmalloc(sizeof(*new), GFP_NOFS);
- if (!new)
- return -ENOMEM;
-
mark_written_sectors(be->be_inval, offset, length);
/* We want to add the range to commit list, but it must be
* block-normalized, and verified that the normalized range has
@@ -413,9 +411,6 @@ int bl_mark_for_commit(struct pnfs_block_extent *be,
new->bse_mdev = be->be_mdev;

spin_lock(&bl->bl_ext_lock);
- /* new will be freed, either by add_to_commitlist if it decides not
- * to use it, or after LAYOUTCOMMIT uses it in the commitlist.
- */
add_to_commitlist(bl, new);
spin_unlock(&bl->bl_ext_lock);
return 0;
--
1.7.1.262.g5ef3d


2011-11-10 08:56:57

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 6/8] pnfsblock: clean up _add_entry


> -----Original Message-----
> From: Benny Halevy [mailto:[email protected]]
> Sent: Thursday, November 10, 2011 4:44 PM
> To: Peng Tao
> Cc: [email protected]; [email protected]; Peng, Tao
> Subject: Re: [PATCH 6/8] pnfsblock: clean up _add_entry
>
> On 2011-11-09 17:16, Peng Tao wrote:
> > It is wrong to kmalloc in _add_entry() as it is inside
> > spinlock. memory should be already allocated _add_entry() is called.
>
> What about the call from _set_range, it is passing NULL?
When _set_range() calls with NULL, it is called from mark_written_sectors(). For each sector, we must first mark it init in
bl_mark_sectors_init() before marking it written. And memory for the entry is allocated in bl_mark_sectors_init(). So if
we don't find the entry somehow in mark_written_sectors(), there must be some bug out there. That is the BUG_ON added for.

Thanks,
Tao

>
> > So BUG instead if we fail to find a match and no memory preallocated.
> >
> > Signed-off-by: Peng Tao <[email protected]>
> > ---
> > fs/nfs/blocklayout/extents.c | 9 ++-------
> > 1 files changed, 2 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
> > index f383524..952ea8a 100644
> > --- a/fs/nfs/blocklayout/extents.c
> > +++ b/fs/nfs/blocklayout/extents.c
> > @@ -110,13 +110,8 @@ static int _add_entry(struct my_tree *tree, u64 s, int32_t
> tag,
> > return 0;
> > } else {
> > struct pnfs_inval_tracking *new;
> > - if (storage)
> > - new = storage;
> > - else {
> > - new = kmalloc(sizeof(*new), GFP_NOFS);
> > - if (!new)
> > - return -ENOMEM;
> > - }
> > + BUG_ON(!storage);
>
> The BUG isn't needed as you dereference the pointer right away.
>
> Benny
>
> > + new = storage;
> > new->it_sector = s;
> > new->it_tags = (1 << tag);
> > list_add(&new->it_link, &pos->it_link);


2011-11-10 09:40:38

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 3/8] pnfsblock: move find lock page logic out of bl_write_pagelist

> -----Original Message-----
> From: Benny Halevy [mailto:[email protected]]
> Sent: Thursday, November 10, 2011 5:32 PM
> To: Peng, Tao
> Cc: [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH 3/8] pnfsblock: move find lock page logic out of
> bl_write_pagelist
>
> On 2011-11-10 10:49, [email protected] wrote:
> >
> >> -----Original Message-----
> >> From: [email protected]
> [mailto:[email protected]]
> >> On Behalf Of Benny Halevy
> >> Sent: Thursday, November 10, 2011 4:32 PM
> >> To: Peng Tao
> >> Cc: [email protected]; [email protected]; Peng, Tao
> >> Subject: Re: [PATCH 3/8] pnfsblock: move find lock page logic out of
> >> bl_write_pagelist
> >>
> >> On 2011-11-09 17:16, Peng Tao wrote:
> >>> Also avoid unnecessary lock_page if page is handled by others.
> >>>
> >>> Signed-off-by: Peng Tao <[email protected]>
> >>> ---
> >>> fs/nfs/blocklayout/blocklayout.c | 78
> >> ++++++++++++++++++++++++++------------
> >>> 1 files changed, 54 insertions(+), 24 deletions(-)
> >>>
> >>> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> >>> index 4ced0b0..e8e13f3 100644
> >>> --- a/fs/nfs/blocklayout/blocklayout.c
> >>> +++ b/fs/nfs/blocklayout/blocklayout.c
> >>> @@ -484,6 +484,56 @@ cleanup:
> >>> return ret;
> >>> }
> >>>
> >>> +/* Find or create a zeroing page marked being writeback.
> >>> + * Return ERR_PTR on error, NULL to indicate skip this page and page itself
> >>> + * to indicate write out.
> >>> + */
> >>> +static struct page*
> >>
> >> coding style nit: "page *" rather than "page*"
> >>
> >>> +bl_find_get_zeroing_page(struct inode *inode, pgoff_t index,
> >>> + struct pnfs_block_extent *cow_read)
> >>> +{
> >>> + struct page* page;
> >>
> >> checkpatch nit:
> >> ERROR: "foo* bar" should be "foo *bar"
> > Thank you! Will change both.
> >
> >
> >>
> >>> + int locked = 0;
> >>> + page = find_get_page(inode->i_mapping, index);
> >>> + if (page)
> >>> + goto check_page;
> >>> +
> >>> + page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
> >>> + if (unlikely(!page)) {
> >>> + dprintk("%s oom\n", __func__);
> >>> + return ERR_PTR(-ENOMEM);
> >>> + }
> >>> + locked = 1;
> >>
> >> Why not just call find_or_create_page in the first place?
> >> dealing with both the locked and unlocked cases here just
> >> seems to add unneeded complexity.
> > It is to avoid lock page when possible. find_or_create_page() calls
> find_and_lock_page() but there are cases that the page is already locked by
> someone (e.g., in nfs_writepage_locked()) and current process has to wait. Pressure
> tests show many such contention. So I want to optimize it out.
>
> I see. So isn't there a missing page_cache_release
> in the late locking case (see below...)
>
> >
> > Thanks,
> > Tao
> >>
> >> Benny
> >>
> >>> +
> >>> +check_page:
> >>> + /* PageDirty: Other will write this out
> >>> + * PageWriteback: Other is writing this out
> >>> + * PageUptodate: It was read before
> >>> + */
> >>> + if (PageDirty(page) || PageWriteback(page)) {
> >>> + print_page(page);
> >>> + if (locked)
> >>> + unlock_page(page);
> >>> + page_cache_release(page);
> >>> + return NULL;
> >>> + }
> >>> +
> >>> + if (!locked) {
> >>> + lock_page(page);
> >>> + if (PageDirty(page) || PageWriteback(page))
> >>> + unlock_page(page);
>
> I suspect page_cache_release is required here too.
> Actually, doing the following instead makes sense to me:
>
> if (!locked) {
> lock_page(page);
> locked = true; // yeah, make it should be bool, not int
> goto check_page;
> }
Thanks, will make the change.

Best,
Tao
>
> Benny
>
> >>> + return NULL;
> >>> + }
> >>> + if (!PageUptodate(page)) {
> >>> + /* New page, readin or zero it */
> >>> + init_page_for_write(page, cow_read);
> >>> + }
> >>> + set_page_writeback(page);
> >>> + unlock_page(page);
> >>> +
> >>> + return page;
> >>> +}
> >>> +
> >>> static enum pnfs_try_status
> >>> bl_write_pagelist(struct nfs_write_data *wdata, int sync)
> >>> {
> >>> @@ -543,32 +593,12 @@ fill_invalid_ext:
> >>> dprintk("%s zero %dth page: index %lu isect %llu\n",
> >>> __func__, npg_zero, index,
> >>> (unsigned long long)isect);
> >>> - page =
> >>> - find_or_create_page(wdata->inode->i_mapping, index,
> >>> - GFP_NOFS);
> >>> - if (!page) {
> >>> - dprintk("%s oom\n", __func__);
> >>> - wdata->pnfs_error = -ENOMEM;
> >>> + page = bl_find_get_zeroing_page(wdata->inode, index,
> cow_read);
> >>> + if (unlikely(IS_ERR(page))) {
> >>> + wdata->pnfs_error = PTR_ERR(page);
> >>> goto out;
> >>> - }
> >>> -
> >>> - /* PageDirty: Other will write this out
> >>> - * PageWriteback: Other is writing this out
> >>> - * PageUptodate: It was read before
> >>> - * sector_initialized: already written out
> >>> - */
> >>> - if (PageDirty(page) || PageWriteback(page)) {
> >>> - print_page(page);
> >>> - unlock_page(page);
> >>> - page_cache_release(page);
> >>> + } else if (page == NULL)
> >>> goto next_page;
> >>> - }
> >>> - if (!PageUptodate(page)) {
> >>> - /* New page, readin or zero it */
> >>> - init_page_for_write(page, cow_read);
> >>> - }
> >>> - set_page_writeback(page);
> >>> - unlock_page(page);
> >>>
> >>> ret = bl_mark_sectors_init(be->be_inval, isect,
> >>> PAGE_CACHE_SECTORS);
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-11-10 08:44:18

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 6/8] pnfsblock: clean up _add_entry

On 2011-11-09 17:16, Peng Tao wrote:
> It is wrong to kmalloc in _add_entry() as it is inside
> spinlock. memory should be already allocated _add_entry() is called.

What about the call from _set_range, it is passing NULL?

> So BUG instead if we fail to find a match and no memory preallocated.
>
> Signed-off-by: Peng Tao <[email protected]>
> ---
> fs/nfs/blocklayout/extents.c | 9 ++-------
> 1 files changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
> index f383524..952ea8a 100644
> --- a/fs/nfs/blocklayout/extents.c
> +++ b/fs/nfs/blocklayout/extents.c
> @@ -110,13 +110,8 @@ static int _add_entry(struct my_tree *tree, u64 s, int32_t tag,
> return 0;
> } else {
> struct pnfs_inval_tracking *new;
> - if (storage)
> - new = storage;
> - else {
> - new = kmalloc(sizeof(*new), GFP_NOFS);
> - if (!new)
> - return -ENOMEM;
> - }
> + BUG_ON(!storage);

The BUG isn't needed as you dereference the pointer right away.

Benny

> + new = storage;
> new->it_sector = s;
> new->it_tags = (1 << tag);
> list_add(&new->it_link, &pos->it_link);

2011-11-09 15:18:01

by Peng Tao

[permalink] [raw]
Subject: [PATCH 7/8] pnfsblock: add im_extents to pnfs_inval_markings

It stores a list of short extents for INVAL->RW conversion.
Also add two functions to manipulate them, in preparation to
move malloc logic out of end_io.

Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 6 ++++++
fs/nfs/blocklayout/blocklayout.h | 5 +++++
fs/nfs/blocklayout/extents.c | 37 +++++++++++++++++++++++++++++++++++++
3 files changed, 48 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 815c0c3..cb4ff0f 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -706,11 +706,17 @@ static void
release_inval_marks(struct pnfs_inval_markings *marks)
{
struct pnfs_inval_tracking *pos, *temp;
+ struct pnfs_block_short_extent *se, *stemp;

list_for_each_entry_safe(pos, temp, &marks->im_tree.mtt_stub, it_link) {
list_del(&pos->it_link);
kfree(pos);
}
+
+ list_for_each_entry_safe(se, stemp, &marks->im_extents, bse_node) {
+ list_del(&se->bse_node);
+ kfree(se);
+ }
return;
}

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 60728ac..df0e0fb 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -70,6 +70,7 @@ struct pnfs_inval_markings {
spinlock_t im_lock;
struct my_tree im_tree; /* Sectors that need LAYOUTCOMMIT */
sector_t im_block_size; /* Server blocksize in sectors */
+ struct list_head im_extents; /* List of short extents for INVAL->RW conversion */
};

struct pnfs_inval_tracking {
@@ -105,6 +106,7 @@ BL_INIT_INVAL_MARKS(struct pnfs_inval_markings *marks, sector_t blocksize)
{
spin_lock_init(&marks->im_lock);
INIT_LIST_HEAD(&marks->im_tree.mtt_stub);
+ INIT_LIST_HEAD(&marks->im_extents);
marks->im_block_size = blocksize;
marks->im_tree.mtt_step_size = min((sector_t)PAGE_CACHE_SECTORS,
blocksize);
@@ -200,5 +202,8 @@ int bl_add_merge_extent(struct pnfs_block_layout *bl,
struct pnfs_block_extent *new);
int bl_mark_for_commit(struct pnfs_block_extent *be,
sector_t offset, sector_t length);
+int bl_push_one_short_extent(struct pnfs_inval_markings *marks);
+struct pnfs_block_short_extent*
+bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop);

#endif /* FS_NFS_NFS4BLOCKLAYOUT_H */
diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
index 952ea8a..72c7fa1 100644
--- a/fs/nfs/blocklayout/extents.c
+++ b/fs/nfs/blocklayout/extents.c
@@ -863,3 +863,40 @@ clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
}
}
}
+
+int
+bl_push_one_short_extent(struct pnfs_inval_markings *marks) {
+ struct pnfs_block_short_extent *new;
+
+ new = kmalloc(sizeof(*new), GFP_NOFS);
+ if (unlikely(!new))
+ return -ENOMEM;
+
+ spin_lock(&marks->im_lock);
+ list_add(&new->bse_node, &marks->im_extents);
+ spin_unlock(&marks->im_lock);
+
+ return 0;
+}
+
+struct pnfs_block_short_extent*
+bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop) {
+ struct pnfs_block_short_extent *rv = NULL;
+
+ if (unlikely(num_to_pop <= 0))
+ return rv;
+
+ spin_lock(&marks->im_lock);
+ while (!list_empty(&marks->im_extents) && num_to_pop-- > 0) {
+ rv = list_entry((&marks->im_extents)->next,
+ struct pnfs_block_short_extent, bse_node);
+ list_del_init(&rv->bse_node);
+ if (num_to_pop)
+ kfree(rv);
+ }
+ spin_unlock(&marks->im_lock);
+
+ BUG_ON(num_to_pop > 0);
+
+ return rv;
+}
--
1.7.1.262.g5ef3d


2011-11-09 15:17:37

by Peng Tao

[permalink] [raw]
Subject: [PATCH 3/8] pnfsblock: move find lock page logic out of bl_write_pagelist

Also avoid unnecessary lock_page if page is handled by others.

Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 78 ++++++++++++++++++++++++++------------
1 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 4ced0b0..e8e13f3 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -484,6 +484,56 @@ cleanup:
return ret;
}

+/* Find or create a zeroing page marked being writeback.
+ * Return ERR_PTR on error, NULL to indicate skip this page and page itself
+ * to indicate write out.
+ */
+static struct page*
+bl_find_get_zeroing_page(struct inode *inode, pgoff_t index,
+ struct pnfs_block_extent *cow_read)
+{
+ struct page* page;
+ int locked = 0;
+ page = find_get_page(inode->i_mapping, index);
+ if (page)
+ goto check_page;
+
+ page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+ if (unlikely(!page)) {
+ dprintk("%s oom\n", __func__);
+ return ERR_PTR(-ENOMEM);
+ }
+ locked = 1;
+
+check_page:
+ /* PageDirty: Other will write this out
+ * PageWriteback: Other is writing this out
+ * PageUptodate: It was read before
+ */
+ if (PageDirty(page) || PageWriteback(page)) {
+ print_page(page);
+ if (locked)
+ unlock_page(page);
+ page_cache_release(page);
+ return NULL;
+ }
+
+ if (!locked) {
+ lock_page(page);
+ if (PageDirty(page) || PageWriteback(page))
+ unlock_page(page);
+ return NULL;
+ }
+ if (!PageUptodate(page)) {
+ /* New page, readin or zero it */
+ init_page_for_write(page, cow_read);
+ }
+ set_page_writeback(page);
+ unlock_page(page);
+
+ return page;
+}
+
static enum pnfs_try_status
bl_write_pagelist(struct nfs_write_data *wdata, int sync)
{
@@ -543,32 +593,12 @@ fill_invalid_ext:
dprintk("%s zero %dth page: index %lu isect %llu\n",
__func__, npg_zero, index,
(unsigned long long)isect);
- page =
- find_or_create_page(wdata->inode->i_mapping, index,
- GFP_NOFS);
- if (!page) {
- dprintk("%s oom\n", __func__);
- wdata->pnfs_error = -ENOMEM;
+ page = bl_find_get_zeroing_page(wdata->inode, index, cow_read);
+ if (unlikely(IS_ERR(page))) {
+ wdata->pnfs_error = PTR_ERR(page);
goto out;
- }
-
- /* PageDirty: Other will write this out
- * PageWriteback: Other is writing this out
- * PageUptodate: It was read before
- * sector_initialized: already written out
- */
- if (PageDirty(page) || PageWriteback(page)) {
- print_page(page);
- unlock_page(page);
- page_cache_release(page);
+ } else if (page == NULL)
goto next_page;
- }
- if (!PageUptodate(page)) {
- /* New page, readin or zero it */
- init_page_for_write(page, cow_read);
- }
- set_page_writeback(page);
- unlock_page(page);

ret = bl_mark_sectors_init(be->be_inval, isect,
PAGE_CACHE_SECTORS);
--
1.7.1.262.g5ef3d


2011-11-10 02:09:38

by Peng, Tao

[permalink] [raw]
Subject: RE: [PATCH 0/8] pnfsblock cleanup and fixes

Benny, hi,

These patches fail to apply to pnfs-latest branch because of conflicts with pnfs private workqueue patches. Since Trond has decided not to take pnfs private workqueue patches, would you please drop them so that new patches can apply? I can send you revert patches if you want.

Thanks,
Tao

> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Peng Tao
> Sent: Wednesday, November 09, 2011 11:16 PM
> To: [email protected]
> Cc: [email protected]; [email protected]
> Subject: [PATCH 0/8] pnfsblock cleanup and fixes
>
> Hi, Trond and Benny,
>
> Bellow are some pnfsblock patches that I have tested for days. Most of them
> are cleanup and small improvements. Only
> [PATCH 2/8] pnfsblock: acquire im_lock in _preload_range
> needs to go into stable.
>
> Thanks,
> Tao
>
> Peng Tao (8):
> pnfsblock: cleanup bl_mark_sectors_init
> pnfsblock: acquire im_lock in _preload_range
> pnfsblock: move find lock page logic out of bl_write_pagelist
> pnfsblock: set read/write tk_status to pnfs_error
> pnfsblock: remove rpc_call_ops from struct parallel_io
> pnfsblock: clean up _add_entry
> pnfsblock: add im_extents to pnfs_inval_markings
> pnfsblock: alloc short extent before submit bio
>
> fs/nfs/blocklayout/blocklayout.c | 176 +++++++++++++++++++++++++-------------
> fs/nfs/blocklayout/blocklayout.h | 11 ++-
> fs/nfs/blocklayout/extents.c | 146 ++++++++++++--------------------
> 3 files changed, 180 insertions(+), 153 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-11-10 09:23:01

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 8/8] pnfsblock: alloc short extent before submit bio

Hi Tao,

I like the direction in this patch however
there are a couple nits below.

Thanks,

Benny

On 2011-11-09 17:16, Peng Tao wrote:
> As discussed earlier, it is better for block client to allocate memoroy for

nit: "memory"

> tracking extents state before submitting bio. So the patch does it by allocating
> a short_extent for every INVALID extent touched by write pagelist and for
> every zeroing page we created, saving them in layout header.
> Then in end_io we can just use them to create commit list items and avoid
> memory allocation there.
>
> Signed-off-by: Peng Tao <[email protected]>
> ---
> fs/nfs/blocklayout/blocklayout.c | 70 +++++++++++++++++++++++++++++---------
> fs/nfs/blocklayout/blocklayout.h | 3 +-
> fs/nfs/blocklayout/extents.c | 13 ++-----
> 3 files changed, 60 insertions(+), 26 deletions(-)
>
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index cb4ff0f..53cd332 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -90,8 +90,9 @@ static int is_writable(struct pnfs_block_extent *be, sector_t isect)
> */
> struct parallel_io {
> struct kref refcnt;
> - void (*pnfs_callback) (void *data);
> + void (*pnfs_callback) (void *data, int num_se);
> void *data;
> + int bse_count;
> };
>
> static inline struct parallel_io *alloc_parallel(void *data)
> @@ -102,6 +103,7 @@ static inline struct parallel_io *alloc_parallel(void *data)
> if (rv) {
> rv->data = data;
> kref_init(&rv->refcnt);
> + rv->bse_count = 0;
> }
> return rv;
> }
> @@ -116,7 +118,7 @@ static void destroy_parallel(struct kref *kref)
> struct parallel_io *p = container_of(kref, struct parallel_io, refcnt);
>
> dprintk("%s enter\n", __func__);
> - p->pnfs_callback(p->data);
> + p->pnfs_callback(p->data, p->bse_count);
> kfree(p);
> }
>
> @@ -211,7 +213,7 @@ static void bl_read_cleanup(struct work_struct *work)
> }
>
> static void
> -bl_end_par_io_read(void *data)
> +bl_end_par_io_read(void *data, int unused)
> {
> struct nfs_read_data *rdata = data;
>
> @@ -312,6 +314,7 @@ static void mark_extents_written(struct pnfs_block_layout *bl,
> {
> sector_t isect, end;
> struct pnfs_block_extent *be;
> + struct pnfs_block_short_extent *se;
>
> dprintk("%s(%llu, %u)\n", __func__, offset, count);
> if (count == 0)
> @@ -324,8 +327,11 @@ static void mark_extents_written(struct pnfs_block_layout *bl,
> be = bl_find_get_extent(bl, isect, NULL);
> BUG_ON(!be); /* FIXME */
> len = min(end, be->be_f_offset + be->be_length) - isect;
> - if (be->be_state == PNFS_BLOCK_INVALID_DATA)
> - bl_mark_for_commit(be, isect, len); /* What if fails? */
> + if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
> + se = bl_pop_short_extent(be->be_inval, 1);
> + BUG_ON(!se);
> + bl_mark_for_commit(be, isect, len, se);
> + }
> isect += len;
> bl_put_extent(be);
> }
> @@ -347,7 +353,8 @@ static void bl_end_io_write_zero(struct bio *bio, int err)
> end_page_writeback(page);
> page_cache_release(page);
> } while (bvec >= bio->bi_io_vec);
> - if (!uptodate) {
> +
> + if (unlikely(!uptodate)) {
> if (!wdata->pnfs_error)
> wdata->pnfs_error = -EIO;
> pnfs_set_lo_fail(wdata->lseg);
> @@ -356,7 +363,6 @@ static void bl_end_io_write_zero(struct bio *bio, int err)
> put_parallel(par);
> }
>
> -/* This is basically copied from mpage_end_io_read */
> static void bl_end_io_write(struct bio *bio, int err)
> {
> struct parallel_io *par = bio->bi_private;
> @@ -382,7 +388,7 @@ static void bl_write_cleanup(struct work_struct *work)
> dprintk("%s enter\n", __func__);
> task = container_of(work, struct rpc_task, u.tk_work);
> wdata = container_of(task, struct nfs_write_data, task);
> - if (!wdata->pnfs_error) {
> + if (likely(!wdata->pnfs_error)) {
> /* Marks for LAYOUTCOMMIT */
> mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
> wdata->args.offset, wdata->args.count);
> @@ -391,9 +397,16 @@ static void bl_write_cleanup(struct work_struct *work)
> }
>
> /* Called when last of bios associated with a bl_write_pagelist call finishes */
> -static void bl_end_par_io_write(void *data)
> +static void bl_end_par_io_write(void *data, int num_se)
> {
> struct nfs_write_data *wdata = data;
> + struct pnfs_block_short_extent *se;
> +
> + if (unlikely(wdata->pnfs_error)) {
> + se = bl_pop_short_extent(&BLK_LSEG2EXT(wdata->lseg)->bl_inval,
> + num_se);
> + kfree(se);
> + }
>
> wdata->task.tk_status = wdata->pnfs_error;
> wdata->verf.committed = NFS_FILE_SYNC;
> @@ -548,7 +561,7 @@ bl_write_pagelist(struct nfs_write_data *wdata, int sync)
> */
> par = alloc_parallel(wdata);
> if (!par)
> - return PNFS_NOT_ATTEMPTED;
> + goto out_mds;
> par->pnfs_callback = bl_end_par_io_write;
> /* At this point, have to be more careful with error handling */
>
> @@ -556,12 +569,15 @@ bl_write_pagelist(struct nfs_write_data *wdata, int sync)
> be = bl_find_get_extent(BLK_LSEG2EXT(wdata->lseg), isect, &cow_read);
> if (!be || !is_writable(be, isect)) {
> dprintk("%s no matching extents!\n", __func__);
> - wdata->pnfs_error = -EINVAL;
> - goto out;
> + goto out_mds;
> }
>
> /* First page inside INVALID extent */
> if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
> + if (likely(!bl_push_one_short_extent(be->be_inval)))
> + par->bse_count++;
> + else
> + goto out_mds;
> temp = offset >> PAGE_CACHE_SHIFT;
> npg_zero = do_div(temp, npg_per_block);
> isect = (sector_t) (((offset - npg_zero * PAGE_CACHE_SIZE) &
> @@ -598,6 +614,19 @@ fill_invalid_ext:
> wdata->pnfs_error = ret;
> goto out;
> }
> + if (likely(!bl_push_one_short_extent(be->be_inval)))
> + par->bse_count++;
> + else {
> + end_page_writeback(page);
> + page_cache_release(page);
> + wdata->pnfs_error = -ENOMEM;
> + goto out;
> + }
> + /* FIXME: This should be done in bi_end_io */
> + mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
> + page->index << PAGE_CACHE_SHIFT,
> + PAGE_CACHE_SIZE);
> +
> bio = bl_add_page_to_bio(bio, npg_zero, WRITE,
> isect, page, be,
> bl_end_io_write_zero, par);
> @@ -606,10 +635,6 @@ fill_invalid_ext:
> bio = NULL;
> goto out;
> }
> - /* FIXME: This should be done in bi_end_io */
> - mark_extents_written(BLK_LSEG2EXT(wdata->lseg),
> - page->index << PAGE_CACHE_SHIFT,
> - PAGE_CACHE_SIZE);
> next_page:
> isect += PAGE_CACHE_SECTORS;
> extent_length -= PAGE_CACHE_SECTORS;
> @@ -633,6 +658,14 @@ next_page:
> wdata->pnfs_error = -EINVAL;
> goto out;
> }
> + if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
> + if (likely(!bl_push_one_short_extent(be->be_inval)))

checkpatch nit: please use tabs, not spaces, for indent...

Benny

> + par->bse_count++;
> + else {
> + wdata->pnfs_error = -ENOMEM;
> + goto out;
> + }
> + }
> extent_length = be->be_length -
> (isect - be->be_f_offset);
> }
> @@ -680,6 +713,11 @@ out:
> bl_submit_bio(WRITE, bio);
> put_parallel(par);
> return PNFS_ATTEMPTED;
> +out_mds:
> + bl_put_extent(be);
> + if (par)
> + kfree(par);
> + return PNFS_NOT_ATTEMPTED;
> }
>
> /* FIXME - range ignored */
> diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
> index df0e0fb..4986c23 100644
> --- a/fs/nfs/blocklayout/blocklayout.h
> +++ b/fs/nfs/blocklayout/blocklayout.h
> @@ -201,7 +201,8 @@ void clean_pnfs_block_layoutupdate(struct pnfs_block_layout *bl,
> int bl_add_merge_extent(struct pnfs_block_layout *bl,
> struct pnfs_block_extent *new);
> int bl_mark_for_commit(struct pnfs_block_extent *be,
> - sector_t offset, sector_t length);
> + sector_t offset, sector_t length,
> + struct pnfs_block_short_extent *new);
> int bl_push_one_short_extent(struct pnfs_inval_markings *marks);
> struct pnfs_block_short_extent*
> bl_pop_short_extent(struct pnfs_inval_markings *marks, int num_to_pop);
> diff --git a/fs/nfs/blocklayout/extents.c b/fs/nfs/blocklayout/extents.c
> index 72c7fa1..21da3a3 100644
> --- a/fs/nfs/blocklayout/extents.c
> +++ b/fs/nfs/blocklayout/extents.c
> @@ -370,20 +370,18 @@ static void add_to_commitlist(struct pnfs_block_layout *bl,
>
> /* Note the range described by offset, length is guaranteed to be contained
> * within be.
> + * new will be freed, either by this function or add_to_commitlist if they
> + * decide not to use it, or after LAYOUTCOMMIT uses it in the commitlist.
> */
> int bl_mark_for_commit(struct pnfs_block_extent *be,
> - sector_t offset, sector_t length)
> + sector_t offset, sector_t length,
> + struct pnfs_block_short_extent *new)
> {
> sector_t new_end, end = offset + length;
> - struct pnfs_block_short_extent *new;
> struct pnfs_block_layout *bl = container_of(be->be_inval,
> struct pnfs_block_layout,
> bl_inval);
>
> - new = kmalloc(sizeof(*new), GFP_NOFS);
> - if (!new)
> - return -ENOMEM;
> -
> mark_written_sectors(be->be_inval, offset, length);
> /* We want to add the range to commit list, but it must be
> * block-normalized, and verified that the normalized range has
> @@ -413,9 +411,6 @@ int bl_mark_for_commit(struct pnfs_block_extent *be,
> new->bse_mdev = be->be_mdev;
>
> spin_lock(&bl->bl_ext_lock);
> - /* new will be freed, either by add_to_commitlist if it decides not
> - * to use it, or after LAYOUTCOMMIT uses it in the commitlist.
> - */
> add_to_commitlist(bl, new);
> spin_unlock(&bl->bl_ext_lock);
> return 0;