LinuxLists.cc - Delayed allocation and page_lock vs transaction start ordering

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Tue, 2008-04-15 at 18:14 +0200, Jan Kara wrote:
> Hi,
>
> I've ported my patch inversing locking ordering of page_lock and
> transaction start to ext4 (on top of ext4 patch queue). Everything except
> delayed allocation is converted (the patch is below for interested
> readers). The question is how to proceed with delayed allocation. Its
> current implementation in VFS is designed to work well with the old
> ordering (page lock first, then start a transaction). We could bend it to
> work with the new locking ordering but I really see no point since ext4 is
> the only user.

I think the plan is port the changes to ext2/3/JFS and support delayed
allocation on those filesystems.

> Also XFS has AFAIK ordering first start transaction, then
> lock pages so if we should ever merge delayed alloc implementations the new
> ordering would make it easier.
> So what do people think here? Do you agree with reimplementing current
> mpage_da_... functions?

It worth a try, but I could not see how to bend delayed allocation to
work the new ordering:( With delayed allocation Ext4 gets into
writepage() directly with page locked, but we need to start transaction
to do block allocation...:(

I guess this reserve locking ordering allows support writepages() for
ext3/4? What other the benefits?

Regards,
Mingming

2008-04-15 23:28:31

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Tue, 2008-04-15 at 11:08 -0700, Mingming Cao wrote:
> On Tue, 2008-04-15 at 18:14 +0200, Jan Kara wrote:
> > Hi,
> >
> > I've ported my patch inversing locking ordering of page_lock and
> > transaction start to ext4 (on top of ext4 patch queue). Everything except
> > delayed allocation is converted (the patch is below for interested
> > readers). The question is how to proceed with delayed allocation. Its
> > current implementation in VFS is designed to work well with the old
> > ordering (page lock first, then start a transaction). We could bend it to
> > work with the new locking ordering but I really see no point since ext4 is
> > the only user.
>
> I think the plan is port the changes to ext2/3/JFS and support delayed
> allocation on those filesystems.
>
> > Also XFS has AFAIK ordering first start transaction, then
> > lock pages so if we should ever merge delayed alloc implementations the new
> > ordering would make it easier.
> > So what do people think here? Do you agree with reimplementing current
> > mpage_da_... functions?
>
> It worth a try, but I could not see how to bend delayed allocation to
> work the new ordering:( With delayed allocation Ext4 gets into
> writepage() directly with page locked, but we need to start transaction
> to do block allocation...:(

Looked again it seems possible to reservse the order with delayed
allocation. with ext3_da_writepgaes() we could start the journal before
calling mpage_da_writepages()(which will lock the pages), instead of
start the journal inside ext4_da_get_block_write(). So that we could get
the locking order right. Just need to taking care of the estimated
credits right.

How about this? (untested, just throw out for comment)

---
fs/ext4/inode.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 40 insertions(+), 13 deletions(-)

Index: linux-2.6.25-rc9/fs/ext4/inode.c
===================================================================
--- linux-2.6.25-rc9.orig/fs/ext4/inode.c 2008-04-15 15:40:33.000000000 -0700
+++ linux-2.6.25-rc9/fs/ext4/inode.c 2008-04-15 16:12:27.000000000 -0700
@@ -1437,18 +1437,12 @@ static int ext4_da_get_block_prep(struct
static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
- int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
+ int ret;
unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
loff_t disksize = EXT4_I(inode)->i_disksize;
handle_t *handle = NULL;

- if (create) {
- handle = ext4_journal_start(inode, needed_blocks);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
- }
+ handle = ext4_journal_current_handle();

ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
bh_result, create, 0);
@@ -1483,17 +1477,50 @@ static int ext4_da_get_block_write(struc
ret = 0;
}

-out:
- if (handle && !IS_ERR(handle))
- ext4_journal_stop(handle);
-
return ret;
}

+/*
+ * For now just follow the DIO way to estimate the max credits
+ * needed to write out EXT4_MAX_BUF_BLOCKS pages.
+ * todo: need to calculate the max credits need for
+ * extent based files, currently the DIO credits is based on
+ * indirect-blocks mapping way.
+ *
+ * Probably should have a generic way to calculate credits
+ * for DIO, writepages, and truncate
+ */
+#define EXT4_MAX_BUF_BLOCKS DIO_MAX_BLOCKS
+#define EXT4_MAX_BUF_CREDITS DIO_CREDITS
+
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
- return mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
+ handle_t *handle = NULL;
+ int needed_blocks;
+ int ret;
+
+ /*
+ * Estimate the worse case needed credits to write out
+ * EXT4_MAX_BUF_BLOCKS pages
+ */
+ needed_blocks = ext4_writepages_trans_blocks(inode);
+
+ /* start the transaction with credits*/
+ handle = ext4_journal_start(inode, needed_blocks);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ return ret;
+ }
+
+ /* set the max pages could be write-out at a time */
+ wbc->range_end = (wbc->range_start >> PAGE_CACHE_SHIFT
+ + EXT4_MAX_BUF_BLOCKS) << PAGE_CACHE_SHIFT;
+
+ ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
+ ext4_journal_stop(handle);
+
+ return ret;
}

static int ext4_da_write_begin(struct file *file, struct address_space *mapping,

2008-04-15 23:33:20

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Tue, 2008-04-15 at 16:28 -0700, Mingming Cao wrote:
> On Tue, 2008-04-15 at 11:08 -0700, Mingming Cao wrote:
> > On Tue, 2008-04-15 at 18:14 +0200, Jan Kara wrote:
> > > Hi,
> > >
> > > I've ported my patch inversing locking ordering of page_lock and
> > > transaction start to ext4 (on top of ext4 patch queue). Everything except
> > > delayed allocation is converted (the patch is below for interested
> > > readers). The question is how to proceed with delayed allocation. Its
> > > current implementation in VFS is designed to work well with the old
> > > ordering (page lock first, then start a transaction). We could bend it to
> > > work with the new locking ordering but I really see no point since ext4 is
> > > the only user.
> >
> > I think the plan is port the changes to ext2/3/JFS and support delayed
> > allocation on those filesystems.
> >
> > > Also XFS has AFAIK ordering first start transaction, then
> > > lock pages so if we should ever merge delayed alloc implementations the new
> > > ordering would make it easier.
> > > So what do people think here? Do you agree with reimplementing current
> > > mpage_da_... functions?
> >
> > It worth a try, but I could not see how to bend delayed allocation to
> > work the new ordering:( With delayed allocation Ext4 gets into
> > writepage() directly with page locked, but we need to start transaction
> > to do block allocation...:(
>
> Looked again it seems possible to reservse the order with delayed
> allocation. with ext3_da_writepgaes() we could start the journal before
> calling mpage_da_writepages()(which will lock the pages), instead of
> start the journal inside ext4_da_get_block_write(). So that we could get
> the locking order right. Just need to taking care of the estimated
> credits right.
>
> How about this? (untested, just throw out for comment)

Seems sent out an old version, this version compiles

---
fs/ext4/inode.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 40 insertions(+), 13 deletions(-)

Index: linux-2.6.25-rc9/fs/ext4/inode.c
===================================================================
--- linux-2.6.25-rc9.orig/fs/ext4/inode.c 2008-04-15 15:40:33.000000000 -0700
+++ linux-2.6.25-rc9/fs/ext4/inode.c 2008-04-15 16:32:10.000000000 -0700
@@ -1437,18 +1437,12 @@ static int ext4_da_get_block_prep(struct
static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
- int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
+ int ret;
unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
loff_t disksize = EXT4_I(inode)->i_disksize;
handle_t *handle = NULL;

- if (create) {
- handle = ext4_journal_start(inode, needed_blocks);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
- }
+ handle = ext4_journal_current_handle();

ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
bh_result, create, 0);
@@ -1483,17 +1477,51 @@ static int ext4_da_get_block_write(struc
ret = 0;
}

-out:
- if (handle && !IS_ERR(handle))
- ext4_journal_stop(handle);
-
return ret;
}

+/*
+ * For now just follow the DIO way to estimate the max credits
+ * needed to write out EXT4_MAX_BUF_BLOCKS pages.
+ * todo: need to calculate the max credits need for
+ * extent based files, currently the DIO credits is based on
+ * indirect-blocks mapping way.
+ *
+ * Probably should have a generic way to calculate credits
+ * for DIO, writepages, and truncate
+ */
+#define EXT4_MAX_BUF_BLOCKS DIO_MAX_BLOCKS
+#define EXT4_MAX_BUF_CREDITS DIO_CREDITS
+
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
- return mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
+ struct inode *inode = mapping->host;
+ handle_t *handle = NULL;
+ int needed_blocks;
+ int ret;
+
+ /*
+ * Estimate the worse case needed credits to write out
+ * EXT4_MAX_BUF_BLOCKS pages
+ */
+ needed_blocks = EXT4_MAX_BUF_CREDITS;
+
+ /* start the transaction with credits*/
+ handle = ext4_journal_start(inode, needed_blocks);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ return ret;
+ }
+
+ /* set the max pages could be write-out at a time */
+ wbc->range_end = wbc->range_start +
+ EXT4_MAX_BUF_BLOCKS << PAGE_CACHE_SHIFT - 1;
+
+ ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
+ ext4_journal_stop(handle);
+
+ return ret;
}

static int ext4_da_write_begin(struct file *file, struct address_space *mapping,

2008-04-16 09:26:16

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Tue 15-04-08 10:58:25, Badari Pulavarty wrote:
>
> On Tue, 2008-04-15 at 18:14 +0200, Jan Kara wrote:
> > Hi,
> >
> > I've ported my patch inversing locking ordering of page_lock and
> > transaction start to ext4 (on top of ext4 patch queue). Everything except
> > delayed allocation is converted (the patch is below for interested
> > readers).
>
> I am curious on the following code:
>
> ext4_ordered_writepage()
>
> + * Note that we don't need to start a transaction unless we're journaling
> + * data because we should have holes filled from ext4_page_mkwrite(). If
> + * we are journaling data, we cannot start transaction directly because
> + * transaction start ranks above page lock so we have to do some magic...
Hmm, actually this comment becomes completely correct only after JBD
ordered mode rewrite patch ;). But the locking is correct even now - note
that we call block_write_full_page() before we start a transaction and
block_write_full_page() unlocks the page...

> Currently you are starting a transaction when the page is locked.
> What do you plan to do here ?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-04-16 09:38:04

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Tue 15-04-08 11:08:52, Mingming Cao wrote:
> On Tue, 2008-04-15 at 18:14 +0200, Jan Kara wrote:
> > Hi,
> >
> > I've ported my patch inversing locking ordering of page_lock and
> > transaction start to ext4 (on top of ext4 patch queue). Everything except
> > delayed allocation is converted (the patch is below for interested
> > readers). The question is how to proceed with delayed allocation. Its
> > current implementation in VFS is designed to work well with the old
> > ordering (page lock first, then start a transaction). We could bend it to
> > work with the new locking ordering but I really see no point since ext4 is
> > the only user.
>
> I think the plan is port the changes to ext2/3/JFS and support delayed
> allocation on those filesystems.
I see. But ext2 doesn't care because is has no transactions, ext3 will
have exactly the same problems as ext4. I don't know about JFS but I guess
it is worth making life more complicated for JFS when it would be simpler
for ext3, ext4 and we could merge the code with XFS...

> > Also XFS has AFAIK ordering first start transaction, then
> > lock pages so if we should ever merge delayed alloc implementations the new
> > ordering would make it easier.
> > So what do people think here? Do you agree with reimplementing current
> > mpage_da_... functions?
>
> It worth a try, but I could not see how to bend delayed allocation to
> work the new ordering:( With delayed allocation Ext4 gets into
> writepage() directly with page locked, but we need to start transaction
> to do block allocation...:(
I see you've already resolved these ;).

> I guess this reserve locking ordering allows support writepages() for
> ext3/4? What other the benefits?
Yes, that is one advantage. The other one (which I care about the most)
is that transaction commit code can take page_lock in the new locking order
which is necessary for the new ordered mode rewrite.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-04-16 10:35:32

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Tue 15-04-08 16:33:17, Mingming Cao wrote:
> On Tue, 2008-04-15 at 16:28 -0700, Mingming Cao wrote:
> > On Tue, 2008-04-15 at 11:08 -0700, Mingming Cao wrote:
> > > On Tue, 2008-04-15 at 18:14 +0200, Jan Kara wrote:
> > > > Hi,
> > > >
> > > > I've ported my patch inversing locking ordering of page_lock and
> > > > transaction start to ext4 (on top of ext4 patch queue). Everything except
> > > > delayed allocation is converted (the patch is below for interested
> > > > readers). The question is how to proceed with delayed allocation. Its
> > > > current implementation in VFS is designed to work well with the old
> > > > ordering (page lock first, then start a transaction). We could bend it to
> > > > work with the new locking ordering but I really see no point since ext4 is
> > > > the only user.
> > >
> > > I think the plan is port the changes to ext2/3/JFS and support delayed
> > > allocation on those filesystems.
> > >
> > > > Also XFS has AFAIK ordering first start transaction, then
> > > > lock pages so if we should ever merge delayed alloc implementations the new
> > > > ordering would make it easier.
> > > > So what do people think here? Do you agree with reimplementing current
> > > > mpage_da_... functions?
> > >
> > > It worth a try, but I could not see how to bend delayed allocation to
> > > work the new ordering:( With delayed allocation Ext4 gets into
> > > writepage() directly with page locked, but we need to start transaction
> > > to do block allocation...:(
> >
> > Looked again it seems possible to reservse the order with delayed
> > allocation. with ext3_da_writepgaes() we could start the journal before
> > calling mpage_da_writepages()(which will lock the pages), instead of
> > start the journal inside ext4_da_get_block_write(). So that we could get
> > the locking order right. Just need to taking care of the estimated
> > credits right.
> >
> > How about this? (untested, just throw out for comment)
>
> Seems sent out an old version, this version compiles
Thanks for the patch. Some comments are below.

> ---
> fs/ext4/inode.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 40 insertions(+), 13 deletions(-)
>
> Index: linux-2.6.25-rc9/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.25-rc9.orig/fs/ext4/inode.c 2008-04-15 15:40:33.000000000 -0700
> +++ linux-2.6.25-rc9/fs/ext4/inode.c 2008-04-15 16:32:10.000000000 -0700
> @@ -1437,18 +1437,12 @@ static int ext4_da_get_block_prep(struct
> static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
> struct buffer_head *bh_result, int create)
> {
> - int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> + int ret;
> unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> loff_t disksize = EXT4_I(inode)->i_disksize;
> handle_t *handle = NULL;
>
> - if (create) {
> - handle = ext4_journal_start(inode, needed_blocks);
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - goto out;
> - }
> - }
> + handle = ext4_journal_current_handle();
Maybe we could assert that handle != NULL? When using delayed allocation,
a transaction should always be started.

> ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
> bh_result, create, 0);
> @@ -1483,17 +1477,51 @@ static int ext4_da_get_block_write(struc
> ret = 0;
> }
>
> -out:
> - if (handle && !IS_ERR(handle))
> - ext4_journal_stop(handle);
> -
> return ret;
> }
>
> +/*
> + * For now just follow the DIO way to estimate the max credits
> + * needed to write out EXT4_MAX_BUF_BLOCKS pages.
> + * todo: need to calculate the max credits need for
> + * extent based files, currently the DIO credits is based on
> + * indirect-blocks mapping way.
> + *
> + * Probably should have a generic way to calculate credits
> + * for DIO, writepages, and truncate
> + */
> +#define EXT4_MAX_BUF_BLOCKS DIO_MAX_BLOCKS
> +#define EXT4_MAX_BUF_CREDITS DIO_CREDITS
> +
> static int ext4_da_writepages(struct address_space *mapping,
> struct writeback_control *wbc)
> {
> - return mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> + struct inode *inode = mapping->host;
> + handle_t *handle = NULL;
> + int needed_blocks;
> + int ret;
> +
> + /*
> + * Estimate the worse case needed credits to write out
> + * EXT4_MAX_BUF_BLOCKS pages
> + */
> + needed_blocks = EXT4_MAX_BUF_CREDITS;
> +
> + /* start the transaction with credits*/
> + handle = ext4_journal_start(inode, needed_blocks);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + return ret;
> + }
> +
> + /* set the max pages could be write-out at a time */
> + wbc->range_end = wbc->range_start +
> + EXT4_MAX_BUF_BLOCKS << PAGE_CACHE_SHIFT - 1;
I think limiting mpage_da_writepages through nr_to_write is better than
through range_end. That way you don't count clean pages...

> +
> + ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> + ext4_journal_stop(handle);
But here we can't just stop. We have to write everything original caller
has asked about (at least in WB_SYNC_ALL mode). But the question is where
to resume because scanning the whole range again is kind-of excessive and
prone do livelock with other process dirtying the file via mmap. Maybe if
we slightly modified write_cache_pages() to always store in writeback_index
where they finished, we could use this value.

> +
> + return ret;
> }
>
> static int ext4_da_write_begin(struct file *file, struct address_space *mapping,

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-04-16 18:24:31

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Wed, 2008-04-16 at 12:35 +0200, Jan Kara wrote:
> On Tue 15-04-08 16:33:17, Mingming Cao wrote:
> > On Tue, 2008-04-15 at 16:28 -0700, Mingming Cao wrote:
> > > On Tue, 2008-04-15 at 11:08 -0700, Mingming Cao wrote:
> > > > On Tue, 2008-04-15 at 18:14 +0200, Jan Kara wrote:
> > > > > Hi,
> > > > >
> > > > > I've ported my patch inversing locking ordering of page_lock and
> > > > > transaction start to ext4 (on top of ext4 patch queue). Everything except
> > > > > delayed allocation is converted (the patch is below for interested
> > > > > readers). The question is how to proceed with delayed allocation. Its
> > > > > current implementation in VFS is designed to work well with the old
> > > > > ordering (page lock first, then start a transaction). We could bend it to
> > > > > work with the new locking ordering but I really see no point since ext4 is
> > > > > the only user.
> > > >
> > > > I think the plan is port the changes to ext2/3/JFS and support delayed
> > > > allocation on those filesystems.
> > > >
> > > > > Also XFS has AFAIK ordering first start transaction, then
> > > > > lock pages so if we should ever merge delayed alloc implementations the new
> > > > > ordering would make it easier.
> > > > > So what do people think here? Do you agree with reimplementing current
> > > > > mpage_da_... functions?
> > > >
> > > > It worth a try, but I could not see how to bend delayed allocation to
> > > > work the new ordering:( With delayed allocation Ext4 gets into
> > > > writepage() directly with page locked, but we need to start transaction
> > > > to do block allocation...:(
> > >
> > > Looked again it seems possible to reservse the order with delayed
> > > allocation. with ext3_da_writepgaes() we could start the journal before
> > > calling mpage_da_writepages()(which will lock the pages), instead of
> > > start the journal inside ext4_da_get_block_write(). So that we could get
> > > the locking order right. Just need to taking care of the estimated
> > > credits right.
> > >
> > > How about this? (untested, just throw out for comment)
> >
> > Seems sent out an old version, this version compiles
> Thanks for the patch. Some comments are below.
>
> > ---
> > fs/ext4/inode.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
> > 1 file changed, 40 insertions(+), 13 deletions(-)
> >
> > Index: linux-2.6.25-rc9/fs/ext4/inode.c
> > ===================================================================
> > --- linux-2.6.25-rc9.orig/fs/ext4/inode.c 2008-04-15 15:40:33.000000000 -0700
> > +++ linux-2.6.25-rc9/fs/ext4/inode.c 2008-04-15 16:32:10.000000000 -0700
> > @@ -1437,18 +1437,12 @@ static int ext4_da_get_block_prep(struct
> > static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
> > struct buffer_head *bh_result, int create)
> > {
> > - int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> > + int ret;
> > unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> > loff_t disksize = EXT4_I(inode)->i_disksize;
> > handle_t *handle = NULL;
> >
> > - if (create) {
> > - handle = ext4_journal_start(inode, needed_blocks);
> > - if (IS_ERR(handle)) {
> > - ret = PTR_ERR(handle);
> > - goto out;
> > - }
> > - }
> > + handle = ext4_journal_current_handle();
> Maybe we could assert that handle != NULL? When using delayed allocation,
> a transaction should always be started.
>
Agreed.

> > ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
> > bh_result, create, 0);
> > @@ -1483,17 +1477,51 @@ static int ext4_da_get_block_write(struc
> > ret = 0;
> > }
> >
> > -out:
> > - if (handle && !IS_ERR(handle))
> > - ext4_journal_stop(handle);
> > -
> > return ret;
> > }
> >
> > +/*
> > + * For now just follow the DIO way to estimate the max credits
> > + * needed to write out EXT4_MAX_BUF_BLOCKS pages.
> > + * todo: need to calculate the max credits need for
> > + * extent based files, currently the DIO credits is based on
> > + * indirect-blocks mapping way.
> > + *
> > + * Probably should have a generic way to calculate credits
> > + * for DIO, writepages, and truncate
> > + */
> > +#define EXT4_MAX_BUF_BLOCKS DIO_MAX_BLOCKS
> > +#define EXT4_MAX_BUF_CREDITS DIO_CREDITS
> > +
> > static int ext4_da_writepages(struct address_space *mapping,
> > struct writeback_control *wbc)
> > {
> > - return mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> > + struct inode *inode = mapping->host;
> > + handle_t *handle = NULL;
> > + int needed_blocks;
> > + int ret;
> > +
> > + /*
> > + * Estimate the worse case needed credits to write out
> > + * EXT4_MAX_BUF_BLOCKS pages
> > + */
> > + needed_blocks = EXT4_MAX_BUF_CREDITS;
> > +
> > + /* start the transaction with credits*/
> > + handle = ext4_journal_start(inode, needed_blocks);
> > + if (IS_ERR(handle)) {
> > + ret = PTR_ERR(handle);
> > + return ret;
> > + }
> > +
> > + /* set the max pages could be write-out at a time */
> > + wbc->range_end = wbc->range_start +
> > + EXT4_MAX_BUF_BLOCKS << PAGE_CACHE_SHIFT - 1;
> I think limiting mpage_da_writepages through nr_to_write is better than
> through range_end. That way you don't count clean pages...
>

You are right.

> > +
> > + ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> > + ext4_journal_stop(handle);
> But here we can't just stop. We have to write everything original caller
> has asked about (at least in WB_SYNC_ALL mode). But the question is where
> to resume because scanning the whole range again is kind-of excessive and
> prone do livelock with other process dirtying the file via mmap. Maybe if
> we slightly modified write_cache_pages() to always store in writeback_index
> where they finished, we could use this value.

Thanks for pointing this out.
How about this?
---
fs/ext4/inode.c | 70 ++++++++++++++++++++++++++++++++++++++++++----------
mm/page-writeback.c | 2 -
2 files changed, 58 insertions(+), 14 deletions(-)

Index: linux-2.6.25-rc9/fs/ext4/inode.c
===================================================================
--- linux-2.6.25-rc9.orig/fs/ext4/inode.c 2008-04-16 09:59:00.000000000 -0700
+++ linux-2.6.25-rc9/fs/ext4/inode.c 2008-04-16 11:23:12.000000000 -0700
@@ -1437,18 +1437,13 @@ static int ext4_da_get_block_prep(struct
static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
- int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
+ int ret;
unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
loff_t disksize = EXT4_I(inode)->i_disksize;
handle_t *handle = NULL;

- if (create) {
- handle = ext4_journal_start(inode, needed_blocks);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
- }
+ J_ASSERT(handle != NULL || create == 0);
+ handle = ext4_journal_current_handle();

ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
bh_result, create, 0);
@@ -1483,17 +1478,66 @@ static int ext4_da_get_block_write(struc
ret = 0;
}

-out:
- if (handle && !IS_ERR(handle))
- ext4_journal_stop(handle);
-
return ret;
}

+/*
+ * For now just follow the DIO way to estimate the max credits
+ * needed to write out EXT4_MAX_WRITEBACK_PAGES.
+ * todo: need to calculate the max credits need for
+ * extent based files, currently the DIO credits is based on
+ * indirect-blocks mapping way.
+ *
+ * Probably should have a generic way to calculate credits
+ * for DIO, writepages, and truncate
+ */
+#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
+#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
+
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
- return mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
+ struct inode *inode = mapping->host;
+ handle_t *handle = NULL;
+ int needed_blocks;
+ int ret = 0;
+ unsigned range_cyclic;
+ long to_write;
+
+ /*
+ * Estimate the worse case needed credits to write out
+ * EXT4_MAX_BUF_BLOCKS pages
+ */
+ needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
+
+ to_write = wbc->nr_to_write;
+ range_cyclic = wbc->range_cyclic;
+ wbc->range_cyclic = 1;
+
+ while (!ret && to_write) {
+ /* start a new transaction*/
+ handle = ext4_journal_start(inode, needed_blocks);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out_writepages;
+ }
+ /*
+ * set the max dirty pages could be write at a time
+ * to fit into the reserved transaction credits
+ */
+ if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
+ wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
+ to_write -= wbc->nr_to_write;
+
+ ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
+ ext4_journal_stop(handle);
+ to_write +=wbc->nr_to_write;
+ }
+
+out_writepages:
+ wbc->nr_to_write = to_write;
+ wbc->range_cyclic = range_cyclic;
+ return ret;
}

static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
Index: linux-2.6.25-rc9/mm/page-writeback.c
===================================================================
--- linux-2.6.25-rc9.orig/mm/page-writeback.c 2008-04-16 11:00:20.000000000 -0700
+++ linux-2.6.25-rc9/mm/page-writeback.c 2008-04-16 11:07:59.000000000 -0700
@@ -816,7 +816,7 @@ int write_cache_pages(struct address_spa
pagevec_init(&pvec, 0);
if (wbc->range_cyclic) {
index = mapping->writeback_index; /* Start from prev offset */
- end = -1;
+ end = wbc->range_end >> PAGE_CACHE_SHIFT;
} else {
index = wbc->range_start >> PAGE_CACHE_SHIFT;
end = wbc->range_end >> PAGE_CACHE_SHIFT;

2008-04-16 19:55:19

by Badari Pulavarty

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Wed, 2008-04-16 at 11:24 -0700, Mingming Cao wrote:
> On Wed, 2008-04-16 at 12:35 +0200, Jan Kara wrote:
> > On Tue 15-04-08 16:33:17, Mingming Cao wrote:
> > > On Tue, 2008-04-15 at 16:28 -0700, Mingming Cao wrote:
> > > > On Tue, 2008-04-15 at 11:08 -0700, Mingming Cao wrote:
> > > > > On Tue, 2008-04-15 at 18:14 +0200, Jan Kara wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I've ported my patch inversing locking ordering of page_lock and
> > > > > > transaction start to ext4 (on top of ext4 patch queue). Everything except
> > > > > > delayed allocation is converted (the patch is below for interested
> > > > > > readers). The question is how to proceed with delayed allocation. Its
> > > > > > current implementation in VFS is designed to work well with the old
> > > > > > ordering (page lock first, then start a transaction). We could bend it to
> > > > > > work with the new locking ordering but I really see no point since ext4 is
> > > > > > the only user.
> > > > >
> > > > > I think the plan is port the changes to ext2/3/JFS and support delayed
> > > > > allocation on those filesystems.
> > > > >
> > > > > > Also XFS has AFAIK ordering first start transaction, then
> > > > > > lock pages so if we should ever merge delayed alloc implementations the new
> > > > > > ordering would make it easier.
> > > > > > So what do people think here? Do you agree with reimplementing current
> > > > > > mpage_da_... functions?
> > > > >
> > > > > It worth a try, but I could not see how to bend delayed allocation to
> > > > > work the new ordering:( With delayed allocation Ext4 gets into
> > > > > writepage() directly with page locked, but we need to start transaction
> > > > > to do block allocation...:(
> > > >
> > > > Looked again it seems possible to reservse the order with delayed
> > > > allocation. with ext3_da_writepgaes() we could start the journal before
> > > > calling mpage_da_writepages()(which will lock the pages), instead of
> > > > start the journal inside ext4_da_get_block_write(). So that we could get
> > > > the locking order right. Just need to taking care of the estimated
> > > > credits right.
> > > >
> > > > How about this? (untested, just throw out for comment)
> > >
> > > Seems sent out an old version, this version compiles
> > Thanks for the patch. Some comments are below.
> >
> > > ---
> > > fs/ext4/inode.c | 53 ++++++++++++++++++++++++++++++++++++++++-------------
> > > 1 file changed, 40 insertions(+), 13 deletions(-)
> > >
> > > Index: linux-2.6.25-rc9/fs/ext4/inode.c
> > > ===================================================================
> > > --- linux-2.6.25-rc9.orig/fs/ext4/inode.c 2008-04-15 15:40:33.000000000 -0700
> > > +++ linux-2.6.25-rc9/fs/ext4/inode.c 2008-04-15 16:32:10.000000000 -0700
> > > @@ -1437,18 +1437,12 @@ static int ext4_da_get_block_prep(struct
> > > static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
> > > struct buffer_head *bh_result, int create)
> > > {
> > > - int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> > > + int ret;
> > > unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> > > loff_t disksize = EXT4_I(inode)->i_disksize;
> > > handle_t *handle = NULL;
> > >
> > > - if (create) {
> > > - handle = ext4_journal_start(inode, needed_blocks);
> > > - if (IS_ERR(handle)) {
> > > - ret = PTR_ERR(handle);
> > > - goto out;
> > > - }
> > > - }
> > > + handle = ext4_journal_current_handle();
> > Maybe we could assert that handle != NULL? When using delayed allocation,
> > a transaction should always be started.
> >
> Agreed.
>
> > > ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
> > > bh_result, create, 0);
> > > @@ -1483,17 +1477,51 @@ static int ext4_da_get_block_write(struc
> > > ret = 0;
> > > }
> > >
> > > -out:
> > > - if (handle && !IS_ERR(handle))
> > > - ext4_journal_stop(handle);
> > > -
> > > return ret;
> > > }
> > >
> > > +/*
> > > + * For now just follow the DIO way to estimate the max credits
> > > + * needed to write out EXT4_MAX_BUF_BLOCKS pages.
> > > + * todo: need to calculate the max credits need for
> > > + * extent based files, currently the DIO credits is based on
> > > + * indirect-blocks mapping way.
> > > + *
> > > + * Probably should have a generic way to calculate credits
> > > + * for DIO, writepages, and truncate
> > > + */
> > > +#define EXT4_MAX_BUF_BLOCKS DIO_MAX_BLOCKS
> > > +#define EXT4_MAX_BUF_CREDITS DIO_CREDITS
> > > +
> > > static int ext4_da_writepages(struct address_space *mapping,
> > > struct writeback_control *wbc)
> > > {
> > > - return mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> > > + struct inode *inode = mapping->host;
> > > + handle_t *handle = NULL;
> > > + int needed_blocks;
> > > + int ret;
> > > +
> > > + /*
> > > + * Estimate the worse case needed credits to write out
> > > + * EXT4_MAX_BUF_BLOCKS pages
> > > + */
> > > + needed_blocks = EXT4_MAX_BUF_CREDITS;
> > > +
> > > + /* start the transaction with credits*/
> > > + handle = ext4_journal_start(inode, needed_blocks);
> > > + if (IS_ERR(handle)) {
> > > + ret = PTR_ERR(handle);
> > > + return ret;
> > > + }
> > > +
> > > + /* set the max pages could be write-out at a time */
> > > + wbc->range_end = wbc->range_start +
> > > + EXT4_MAX_BUF_BLOCKS << PAGE_CACHE_SHIFT - 1;
> > I think limiting mpage_da_writepages through nr_to_write is better than
> > through range_end. That way you don't count clean pages...
> >
>
> You are right.
>
> > > +
> > > + ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> > > + ext4_journal_stop(handle);
> > But here we can't just stop. We have to write everything original caller
> > has asked about (at least in WB_SYNC_ALL mode). But the question is where
> > to resume because scanning the whole range again is kind-of excessive and
> > prone do livelock with other process dirtying the file via mmap. Maybe if
> > we slightly modified write_cache_pages() to always store in writeback_index
> > where they finished, we could use this value.
>
> Thanks for pointing this out.
> How about this?
> ---
> fs/ext4/inode.c | 70 ++++++++++++++++++++++++++++++++++++++++++----------
> mm/page-writeback.c | 2 -
> 2 files changed, 58 insertions(+), 14 deletions(-)
>
> Index: linux-2.6.25-rc9/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.25-rc9.orig/fs/ext4/inode.c 2008-04-16 09:59:00.000000000 -0700
> +++ linux-2.6.25-rc9/fs/ext4/inode.c 2008-04-16 11:23:12.000000000 -0700
> @@ -1437,18 +1437,13 @@ static int ext4_da_get_block_prep(struct
> static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
> struct buffer_head *bh_result, int create)
> {
> - int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> + int ret;
> unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> loff_t disksize = EXT4_I(inode)->i_disksize;
> handle_t *handle = NULL;
>
> - if (create) {
> - handle = ext4_journal_start(inode, needed_blocks);
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - goto out;
> - }
> - }
> + J_ASSERT(handle != NULL || create == 0);
> + handle = ext4_journal_current_handle();
>
> ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
> bh_result, create, 0);
> @@ -1483,17 +1478,66 @@ static int ext4_da_get_block_write(struc
> ret = 0;
> }
>
> -out:
> - if (handle && !IS_ERR(handle))
> - ext4_journal_stop(handle);
> -
> return ret;
> }
>
> +/*
> + * For now just follow the DIO way to estimate the max credits
> + * needed to write out EXT4_MAX_WRITEBACK_PAGES.
> + * todo: need to calculate the max credits need for
> + * extent based files, currently the DIO credits is based on
> + * indirect-blocks mapping way.
> + *
> + * Probably should have a generic way to calculate credits
> + * for DIO, writepages, and truncate
> + */
> +#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
> +#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
> +
> static int ext4_da_writepages(struct address_space *mapping,
> struct writeback_control *wbc)
> {
> - return mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> + struct inode *inode = mapping->host;
> + handle_t *handle = NULL;
> + int needed_blocks;
> + int ret = 0;
> + unsigned range_cyclic;
> + long to_write;
> +
> + /*
> + * Estimate the worse case needed credits to write out
> + * EXT4_MAX_BUF_BLOCKS pages
> + */
> + needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
> +
> + to_write = wbc->nr_to_write;
> + range_cyclic = wbc->range_cyclic;
> + wbc->range_cyclic = 1;
> +
> + while (!ret && to_write) {
> + /* start a new transaction*/
> + handle = ext4_journal_start(inode, needed_blocks);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto out_writepages;
> + }
> + /*
> + * set the max dirty pages could be write at a time
> + * to fit into the reserved transaction credits
> + */
> + if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
> + wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
> + to_write -= wbc->nr_to_write;
> +
> + ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> + ext4_journal_stop(handle);
> + to_write +=wbc->nr_to_write;
> + }

You need to set wbc->nr_to_write in the loop before calling
mpage_da_write_page() (for the next iteration).

> +
> +out_writepages:
> + wbc->nr_to_write = to_write;
> + wbc->range_cyclic = range_cyclic;
> + return ret;
> }
>
> static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> Index: linux-2.6.25-rc9/mm/page-writeback.c
> ===================================================================
> --- linux-2.6.25-rc9.orig/mm/page-writeback.c 2008-04-16 11:00:20.000000000 -0700
> +++ linux-2.6.25-rc9/mm/page-writeback.c 2008-04-16 11:07:59.000000000 -0700
> @@ -816,7 +816,7 @@ int write_cache_pages(struct address_spa
> pagevec_init(&pvec, 0);
> if (wbc->range_cyclic) {
> index = mapping->writeback_index; /* Start from prev offset */
> - end = -1;
> + end = wbc->range_end >> PAGE_CACHE_SHIFT;

Hmm. There are other callers to write_cache_pages() using
"range_cyclic" . Did you check them to make sure, they set range_end
correctly ?

Thanks,
Badari

2008-04-18 19:38:31

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Fri, 2008-04-18 at 12:54 -0600, Andreas Dilger wrote:
> On Apr 16, 2008 11:38 +0200, Jan Kara wrote:
> > On Tue 15-04-08 11:08:52, Mingming Cao wrote:
> > > I guess this reserve locking ordering allows support writepages() for
> > > ext3/4? What other the benefits?
> >
> > Yes, that is one advantage. The other one (which I care about the most)
> > is that transaction commit code can take page_lock in the new locking order
> > which is necessary for the new ordered mode rewrite.
>
> My understanding is that the main reason for the ordered mode rewrite is
> specifically to allow delalloc to still support ordered mode semantics.
> If the lock ordering is changed, and the jbd ordered mode is changed, but
> we don't support that with delalloc then we will have made a lot of changes
> (and likely introduced some bugs) with little benefit.
>
> My apologies in advance if I misunderstand, and delalloc will be supported
> with these changes.
>
I agrees with you that if we rewrite a new ordered mode(separate from
this one), we should make it possible to work with delalloc.

Just want to clarify that the inversing locking patch proposed here
could work delalloc(just the lock ordering. I have updated delalloc to
work for the inversed locking. Just FYI I have merged the inverse
locking patches for ext4 to the unstable ext4 patche queue to see how it
goes.

Mingming

2008-04-18 18:54:58

by Andreas Dilger

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Apr 16, 2008 11:38 +0200, Jan Kara wrote:
> On Tue 15-04-08 11:08:52, Mingming Cao wrote:
> > I guess this reserve locking ordering allows support writepages() for
> > ext3/4? What other the benefits?
>
> Yes, that is one advantage. The other one (which I care about the most)
> is that transaction commit code can take page_lock in the new locking order
> which is necessary for the new ordered mode rewrite.

My understanding is that the main reason for the ordered mode rewrite is
specifically to allow delalloc to still support ordered mode semantics.
If the lock ordering is changed, and the jbd ordered mode is changed, but
we don't support that with delalloc then we will have made a lot of changes
(and likely introduced some bugs) with little benefit.

My apologies in advance if I misunderstand, and delalloc will be supported
with these changes.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2008-04-21 17:13:14

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Fri 18-04-08 12:54:47, Andreas Dilger wrote:
> On Apr 16, 2008 11:38 +0200, Jan Kara wrote:
> > On Tue 15-04-08 11:08:52, Mingming Cao wrote:
> > > I guess this reserve locking ordering allows support writepages() for
> > > ext3/4? What other the benefits?
> >
> > Yes, that is one advantage. The other one (which I care about the most)
> > is that transaction commit code can take page_lock in the new locking order
> > which is necessary for the new ordered mode rewrite.
>
> My understanding is that the main reason for the ordered mode rewrite is
> specifically to allow delalloc to still support ordered mode semantics.
> If the lock ordering is changed, and the jbd ordered mode is changed, but
> we don't support that with delalloc then we will have made a lot of changes
> (and likely introduced some bugs) with little benefit.
Yes, with ordered mode rewrite, handling of data=ordered,delalloc is
going to be much simpler. But not that it would be my main motivation...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-05-21 08:21:37

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Tue, Apr 15, 2008 at 06:14:30PM +0200, Jan Kara wrote:
> Hi,
>
> I've ported my patch inversing locking ordering of page_lock and
> transaction start to ext4 (on top of ext4 patch queue). Everything except
> delayed allocation is converted (the patch is below for interested
> readers). The question is how to proceed with delayed allocation. Its
> current implementation in VFS is designed to work well with the old
> ordering (page lock first, then start a transaction). We could bend it to
> work with the new locking ordering but I really see no point since ext4 is
> the only user. Also XFS has AFAIK ordering first start transaction, then
> lock pages so if we should ever merge delayed alloc implementations the new
> ordering would make it easier.
> So what do people think here? Do you agree with reimplementing current
> mpage_da_... functions? Eric, I guess you have the best clue how XFS does
> this, do you have some advices? Also maybe pointers into XFS code would be
> useful if it is reasonably readable :). Thanks.
>
> Honza

[....snip....]

> */
> -static int ext4_ordered_writepage(struct page *page,
> +static int __ext4_ordered_writepage(struct page *page,
> struct writeback_control *wbc)
> {
> struct inode *inode = page->mapping->host;
> @@ -1723,22 +1694,6 @@ static int ext4_ordered_writepage(struct page *page,
> int ret = 0;
> int err;
>
> - J_ASSERT(PageLocked(page));
> -
> - /*
> - * We give up here if we're reentered, because it might be for a
> - * different filesystem.
> - */
> - if (ext4_journal_current_handle())
> - goto out_fail;
> -
> - handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> -
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - goto out_fail;
> - }
> -
> if (!page_has_buffers(page)) {
> create_empty_buffers(page, inode->i_sb->s_blocksize,
> (1 << BH_Dirty)|(1 << BH_Uptodate));
> @@ -1762,114 +1717,139 @@ static int ext4_ordered_writepage(struct page *page,
> * and generally junk.
> */
> if (ret == 0) {
> - err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
> + handle = ext4_journal_start(inode,
> + ext4_writepage_trans_blocks(inode));
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto out_put;
> + }
> +
> + ret = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
> NULL, jbd2_journal_dirty_data_fn);
> + err = ext4_journal_stop(handle);
> if (!ret)
> ret = err;
> }
> - walk_page_buffers(handle, page_bufs, 0,
> - PAGE_CACHE_SIZE, NULL, bput_one);
> - err = ext4_journal_stop(handle);
> - if (!ret)
> - ret = err;
> +out_put:
> + walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, NULL,
> + bput_one);
> return ret;
> +}
> +
> +static int ext4_ordered_writepage(struct page *page,
> + struct writeback_control *wbc)
> +{
> + J_ASSERT(PageLocked(page));
> +
> + /*
> + * We give up here if we're reentered, because it might be for a
> + * different filesystem.
> + */
> + if (!ext4_journal_current_handle())
> + return __ext4_ordered_writepage(page, wbc);
>
> -out_fail:
> redirty_page_for_writepage(wbc, page);
> unlock_page(page);
> - return ret;
> + return 0;
> }

How about change below to make sure we don't have a deadlock.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9d1d07b..85de163 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1718,6 +1718,10 @@ static int jbd2_journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
return 0;
}

+static int ext4_bh_unmapped(handle_t *handle, struct buffer_head *bh)
+{
+ return !buffer_mapped(bh);
+}
/*
* Note that we don't need to start a transaction unless we're journaling
* data because we should have holes filled from ext4_page_mkwrite(). If
@@ -1767,20 +1771,33 @@ static int jbd2_journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
* us.
*
*/
-static int __ext4_ordered_writepage(struct page *page,
- struct writeback_control *wbc)
+static int __ext4_ordered_alloc_and_writepage(struct page *page,
+ struct writeback_control *wbc, int alloc)
{
- struct inode *inode = page->mapping->host;
- struct buffer_head *page_bufs;
+ int ret = 0, err;
+ unsigned long len;
handle_t *handle = NULL;
- int ret = 0;
- int err;
+ struct buffer_head *page_bufs;
+ struct inode *inode = page->mapping->host;
+ loff_t size = i_size_read(inode);

if (!page_has_buffers(page)) {
create_empty_buffers(page, inode->i_sb->s_blocksize,
(1 << BH_Dirty)|(1 << BH_Uptodate));
}
page_bufs = page_buffers(page);
+
+ if (page->index == size >> PAGE_CACHE_SHIFT)
+ len = size & ~PAGE_CACHE_MASK;
+ else
+ len = PAGE_CACHE_SIZE;
+
+ if (!alloc && walk_page_buffers(NULL, page_bufs, 0,
+ len, NULL, ext4_bh_unmapped)) {
+ printk(KERN_CRIT "%s called with unmapped buffer\n",
+ __func__);
+ BUG();
+ }
walk_page_buffers(handle, page_bufs, 0,
PAGE_CACHE_SIZE, NULL, bget_one);

@@ -1828,7 +1845,7 @@ static int ext4_ordered_writepage(struct page *page,
* different filesystem.
*/
if (!ext4_journal_current_handle())
- return __ext4_ordered_writepage(page, wbc);
+ return __ext4_ordered_alloc_and_writepage(page, wbc, 0);

redirty_page_for_writepage(wbc, page);
unlock_page(page);
@@ -3777,10 +3794,6 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
return err;
}

-static int ext4_bh_unmapped(handle_t *handle, struct buffer_head *bh)
-{
- return !buffer_mapped(bh);
-}

int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page)
{
@@ -3837,7 +3850,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page)
if (ext4_should_writeback_data(inode))
ret = __ext4_writeback_writepage(page, &wbc);
else if (ext4_should_order_data(inode))
- ret = __ext4_ordered_writepage(page, &wbc);
+ ret = __ext4_ordered_alloc_and_writepage(page, &wbc, 1);
else
ret = __ext4_journalled_writepage(page, &wbc);
/* Page got unlocked in writepage */

ie we call __ext4_ordered_alloc_and_writepage with alloc = 1 only in
case of page_mkwrite. All the other case we should have all the buffer
heads mapped. Otherwise we will try to allocate new blocks which starts
a new transaction holding page lock.

>
> -static int ext4_writeback_writepage(struct page *page,
> +static int __ext4_writeback_writepage(struct page *page,
> struct writeback_control *wbc)
> {
> struct inode *inode = page->mapping->host;
> +
> + if (test_opt(inode->i_sb, NOBH))
> + return nobh_writepage(page, ext4_get_block, wbc);
> + else
> + return block_write_full_page(page, ext4_get_block, wbc);
> +}
> +
> +
> +static int ext4_writeback_writepage(struct page *page,
> + struct writeback_control *wbc)
> +{
> + if (!ext4_journal_current_handle())
> + return __ext4_writeback_writepage(page, wbc);
> +
> + redirty_page_for_writepage(wbc, page);
> + unlock_page(page);
> + return 0;
> +}
> +
> +static int __ext4_journalled_writepage(struct page *page,
> + struct writeback_control *wbc)
> +{
> + struct address_space *mapping = page->mapping;
> + struct inode *inode = mapping->host;
> + struct buffer_head *page_bufs;
> handle_t *handle = NULL;
> int ret = 0;
> int err;
>
> - if (ext4_journal_current_handle())
> - goto out_fail;
> + ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE, ext4_get_block);
> + if (ret != 0)
> + goto out_unlock;
> +
> + page_bufs = page_buffers(page);
> + walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, NULL,
> + bget_one);
> + /* As soon as we unlock the page, it can go away, but we have
> + * references to buffers so we are safe */
> + unlock_page(page);
>
> handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> - goto out_fail;
> + goto out;
> }
>
> - if (test_opt(inode->i_sb, NOBH) && ext4_should_writeback_data(inode))
> - ret = nobh_writepage(page, ext4_get_block, wbc);
> - else
> - ret = block_write_full_page(page, ext4_get_block, wbc);
> + ret = walk_page_buffers(handle, page_bufs, 0,
> + PAGE_CACHE_SIZE, NULL, do_journal_get_write_access);
>
> + err = walk_page_buffers(handle, page_bufs, 0,
> + PAGE_CACHE_SIZE, NULL, write_end_fn);
> + if (ret == 0)
> + ret = err;
> err = ext4_journal_stop(handle);
> if (!ret)
> ret = err;
> - return ret;
>
> -out_fail:
> - redirty_page_for_writepage(wbc, page);
> + walk_page_buffers(handle, page_bufs, 0,
> + PAGE_CACHE_SIZE, NULL, bput_one);
> + EXT4_I(inode)->i_state |= EXT4_STATE_JDATA;
> + goto out;
> +
> +out_unlock:
> unlock_page(page);
> +out:
> return ret;
> }
>
> static int ext4_journalled_writepage(struct page *page,
> struct writeback_control *wbc)
> {
> - struct inode *inode = page->mapping->host;
> - handle_t *handle = NULL;
> - int ret = 0;
> - int err;
> -
> if (ext4_journal_current_handle())
> goto no_write;
>
> - handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - goto no_write;
> - }
> -
> if (!page_has_buffers(page) || PageChecked(page)) {

This will never happen with writepage right ? And we don't call
ext4_journalled_writepage from page_mkwrite. So is this needed ?
If not __ext4_journalled_writepage can handle everything in a single
transaction right and assume that it is called within a transaction.

> /*
> * It's mmapped pagecache. Add buffers and journal it. There
> * doesn't seem much point in redirtying the page here.
> */
> ClearPageChecked(page);
> - ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE,
> - ext4_get_block);
> - if (ret != 0) {
> - ext4_journal_stop(handle);
> - goto out_unlock;
> - }
> - ret = walk_page_buffers(handle, page_buffers(page), 0,
> - PAGE_CACHE_SIZE, NULL, do_journal_get_write_access);
> -
> - err = walk_page_buffers(handle, page_buffers(page), 0,
> - PAGE_CACHE_SIZE, NULL, write_end_fn);
> - if (ret == 0)
> - ret = err;
> - EXT4_I(inode)->i_state |= EXT4_STATE_JDATA;
> - unlock_page(page);
> + return __ext4_journalled_writepage(page, wbc);
> } else {
> /*
> * It may be a page full of checkpoint-mode buffers. We don't
> * really know unless we go poke around in the buffer_heads.
> * But block_write_full_page will do the right thing.
> */
> - ret = block_write_full_page(page, ext4_get_block, wbc);
> + return block_write_full_page(page, ext4_get_block, wbc);
> }
> - err = ext4_journal_stop(handle);
> - if (!ret)
> - ret = err;
> -out:
> - return ret;
> -
> no_write:
> redirty_page_for_writepage(wbc, page);
> -out_unlock:
> unlock_page(page);
> - goto out;
> + return 0;
> }
>

-aneesh

2008-05-26 17:21:26

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Wed 21-05-08 13:51:09, Aneesh Kumar K.V wrote:
> On Tue, Apr 15, 2008 at 06:14:30PM +0200, Jan Kara wrote:
> > Hi,
> >
> > I've ported my patch inversing locking ordering of page_lock and
> > transaction start to ext4 (on top of ext4 patch queue). Everything except
> > delayed allocation is converted (the patch is below for interested
> > readers). The question is how to proceed with delayed allocation. Its
> > current implementation in VFS is designed to work well with the old
> > ordering (page lock first, then start a transaction). We could bend it to
> > work with the new locking ordering but I really see no point since ext4 is
> > the only user. Also XFS has AFAIK ordering first start transaction, then
> > lock pages so if we should ever merge delayed alloc implementations the new
> > ordering would make it easier.
> > So what do people think here? Do you agree with reimplementing current
> > mpage_da_... functions? Eric, I guess you have the best clue how XFS does
> > this, do you have some advices? Also maybe pointers into XFS code would be
> > useful if it is reasonably readable :). Thanks.
> >
> > Honza
>
>
> [....snip....]
>
> > */
> > -static int ext4_ordered_writepage(struct page *page,
> > +static int __ext4_ordered_writepage(struct page *page,
> > struct writeback_control *wbc)
> > {
> > struct inode *inode = page->mapping->host;
> > @@ -1723,22 +1694,6 @@ static int ext4_ordered_writepage(struct page *page,
> > int ret = 0;
> > int err;
> >
> > - J_ASSERT(PageLocked(page));
> > -
> > - /*
> > - * We give up here if we're reentered, because it might be for a
> > - * different filesystem.
> > - */
> > - if (ext4_journal_current_handle())
> > - goto out_fail;
> > -
> > - handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> > -
> > - if (IS_ERR(handle)) {
> > - ret = PTR_ERR(handle);
> > - goto out_fail;
> > - }
> > -
> > if (!page_has_buffers(page)) {
> > create_empty_buffers(page, inode->i_sb->s_blocksize,
> > (1 << BH_Dirty)|(1 << BH_Uptodate));
> > @@ -1762,114 +1717,139 @@ static int ext4_ordered_writepage(struct page *page,
> > * and generally junk.
> > */
> > if (ret == 0) {
> > - err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
> > + handle = ext4_journal_start(inode,
> > + ext4_writepage_trans_blocks(inode));
> > + if (IS_ERR(handle)) {
> > + ret = PTR_ERR(handle);
> > + goto out_put;
> > + }
> > +
> > + ret = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
> > NULL, jbd2_journal_dirty_data_fn);
> > + err = ext4_journal_stop(handle);
> > if (!ret)
> > ret = err;
> > }
> > - walk_page_buffers(handle, page_bufs, 0,
> > - PAGE_CACHE_SIZE, NULL, bput_one);
> > - err = ext4_journal_stop(handle);
> > - if (!ret)
> > - ret = err;
> > +out_put:
> > + walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, NULL,
> > + bput_one);
> > return ret;
> > +}
> > +
> > +static int ext4_ordered_writepage(struct page *page,
> > + struct writeback_control *wbc)
> > +{
> > + J_ASSERT(PageLocked(page));
> > +
> > + /*
> > + * We give up here if we're reentered, because it might be for a
> > + * different filesystem.
> > + */
> > + if (!ext4_journal_current_handle())
> > + return __ext4_ordered_writepage(page, wbc);
> >
> > -out_fail:
> > redirty_page_for_writepage(wbc, page);
> > unlock_page(page);
> > - return ret;
> > + return 0;
> > }
>
>
> How about change below to make sure we don't have a deadlock.
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 9d1d07b..85de163 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1718,6 +1718,10 @@ static int jbd2_journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> return 0;
> }
>
> +static int ext4_bh_unmapped(handle_t *handle, struct buffer_head *bh)
> +{
> + return !buffer_mapped(bh);
> +}
> /*
> * Note that we don't need to start a transaction unless we're journaling
> * data because we should have holes filled from ext4_page_mkwrite(). If
> @@ -1767,20 +1771,33 @@ static int jbd2_journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> * us.
> *
> */
> -static int __ext4_ordered_writepage(struct page *page,
> - struct writeback_control *wbc)
> +static int __ext4_ordered_alloc_and_writepage(struct page *page,
> + struct writeback_control *wbc, int alloc)
> {
> - struct inode *inode = page->mapping->host;
> - struct buffer_head *page_bufs;
> + int ret = 0, err;
> + unsigned long len;
> handle_t *handle = NULL;
> - int ret = 0;
> - int err;
> + struct buffer_head *page_bufs;
> + struct inode *inode = page->mapping->host;
> + loff_t size = i_size_read(inode);
>
> if (!page_has_buffers(page)) {
> create_empty_buffers(page, inode->i_sb->s_blocksize,
> (1 << BH_Dirty)|(1 << BH_Uptodate));
> }
> page_bufs = page_buffers(page);
> +
> + if (page->index == size >> PAGE_CACHE_SHIFT)
> + len = size & ~PAGE_CACHE_MASK;
> + else
> + len = PAGE_CACHE_SIZE;
> +
> + if (!alloc && walk_page_buffers(NULL, page_bufs, 0,
> + len, NULL, ext4_bh_unmapped)) {
> + printk(KERN_CRIT "%s called with unmapped buffer\n",
> + __func__);
> + BUG();
> + }
> walk_page_buffers(handle, page_bufs, 0,
> PAGE_CACHE_SIZE, NULL, bget_one);
>
> @@ -1828,7 +1845,7 @@ static int ext4_ordered_writepage(struct page *page,
> * different filesystem.
> */
> if (!ext4_journal_current_handle())
> - return __ext4_ordered_writepage(page, wbc);
> + return __ext4_ordered_alloc_and_writepage(page, wbc, 0);
>
> redirty_page_for_writepage(wbc, page);
> unlock_page(page);
> @@ -3777,10 +3794,6 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
> return err;
> }
>
> -static int ext4_bh_unmapped(handle_t *handle, struct buffer_head *bh)
> -{
> - return !buffer_mapped(bh);
> -}
>
> int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page)
> {
> @@ -3837,7 +3850,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page)
> if (ext4_should_writeback_data(inode))
> ret = __ext4_writeback_writepage(page, &wbc);
> else if (ext4_should_order_data(inode))
> - ret = __ext4_ordered_writepage(page, &wbc);
> + ret = __ext4_ordered_alloc_and_writepage(page, &wbc, 1);
> else
> ret = __ext4_journalled_writepage(page, &wbc);
> /* Page got unlocked in writepage */
>
>
>
> ie we call __ext4_ordered_alloc_and_writepage with alloc = 1 only in
> case of page_mkwrite. All the other case we should have all the buffer
> heads mapped. Otherwise we will try to allocate new blocks which starts
> a new transaction holding page lock.
When do we try to allocate new blocks in writepage now? ext4_page_mkwrite()
should have done the allocation before writepage() was called so there
should be no need to allocate anything... But maybe I miss something.

> > -static int ext4_writeback_writepage(struct page *page,
> > +static int __ext4_writeback_writepage(struct page *page,
> > struct writeback_control *wbc)
> > {
> > struct inode *inode = page->mapping->host;
> > +
> > + if (test_opt(inode->i_sb, NOBH))
> > + return nobh_writepage(page, ext4_get_block, wbc);
> > + else
> > + return block_write_full_page(page, ext4_get_block, wbc);
> > +}
> > +
> > +
> > +static int ext4_writeback_writepage(struct page *page,
> > + struct writeback_control *wbc)
> > +{
> > + if (!ext4_journal_current_handle())
> > + return __ext4_writeback_writepage(page, wbc);
> > +
> > + redirty_page_for_writepage(wbc, page);
> > + unlock_page(page);
> > + return 0;
> > +}
> > +
> > +static int __ext4_journalled_writepage(struct page *page,
> > + struct writeback_control *wbc)
> > +{
> > + struct address_space *mapping = page->mapping;
> > + struct inode *inode = mapping->host;
> > + struct buffer_head *page_bufs;
> > handle_t *handle = NULL;
> > int ret = 0;
> > int err;
> >
> > - if (ext4_journal_current_handle())
> > - goto out_fail;
> > + ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE, ext4_get_block);
> > + if (ret != 0)
> > + goto out_unlock;
> > +
> > + page_bufs = page_buffers(page);
> > + walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, NULL,
> > + bget_one);
> > + /* As soon as we unlock the page, it can go away, but we have
> > + * references to buffers so we are safe */
> > + unlock_page(page);
> >
> > handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> > if (IS_ERR(handle)) {
> > ret = PTR_ERR(handle);
> > - goto out_fail;
> > + goto out;
> > }
> >
> > - if (test_opt(inode->i_sb, NOBH) && ext4_should_writeback_data(inode))
> > - ret = nobh_writepage(page, ext4_get_block, wbc);
> > - else
> > - ret = block_write_full_page(page, ext4_get_block, wbc);
> > + ret = walk_page_buffers(handle, page_bufs, 0,
> > + PAGE_CACHE_SIZE, NULL, do_journal_get_write_access);
> >
> > + err = walk_page_buffers(handle, page_bufs, 0,
> > + PAGE_CACHE_SIZE, NULL, write_end_fn);
> > + if (ret == 0)
> > + ret = err;
> > err = ext4_journal_stop(handle);
> > if (!ret)
> > ret = err;
> > - return ret;
> >
> > -out_fail:
> > - redirty_page_for_writepage(wbc, page);
> > + walk_page_buffers(handle, page_bufs, 0,
> > + PAGE_CACHE_SIZE, NULL, bput_one);
> > + EXT4_I(inode)->i_state |= EXT4_STATE_JDATA;
> > + goto out;
> > +
> > +out_unlock:
> > unlock_page(page);
> > +out:
> > return ret;
> > }
> >
> > static int ext4_journalled_writepage(struct page *page,
> > struct writeback_control *wbc)
> > {
> > - struct inode *inode = page->mapping->host;
> > - handle_t *handle = NULL;
> > - int ret = 0;
> > - int err;
> > -
> > if (ext4_journal_current_handle())
> > goto no_write;
> >
> > - handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> > - if (IS_ERR(handle)) {
> > - ret = PTR_ERR(handle);
> > - goto no_write;
> > - }
> > -
> > if (!page_has_buffers(page) || PageChecked(page)) {
>
>
> This will never happen with writepage right ? And we don't call
> ext4_journalled_writepage from page_mkwrite. So is this needed ?
> If not __ext4_journalled_writepage can handle everything in a single
> transaction right and assume that it is called within a transaction.
I'm not sure I understand you. PageChecked() can happen from writepage
call path. We set PageChecked() when we do set_page_dirty() as far as I
remember... Basically, we use this flag to decide whether writepage should
do checkpointing or write into the journal.

> > /*
> > * It's mmapped pagecache. Add buffers and journal it. There
> > * doesn't seem much point in redirtying the page here.
> > */
> > ClearPageChecked(page);
> > - ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE,
> > - ext4_get_block);
> > - if (ret != 0) {
> > - ext4_journal_stop(handle);
> > - goto out_unlock;
> > - }
> > - ret = walk_page_buffers(handle, page_buffers(page), 0,
> > - PAGE_CACHE_SIZE, NULL, do_journal_get_write_access);
> > -
> > - err = walk_page_buffers(handle, page_buffers(page), 0,
> > - PAGE_CACHE_SIZE, NULL, write_end_fn);
> > - if (ret == 0)
> > - ret = err;
> > - EXT4_I(inode)->i_state |= EXT4_STATE_JDATA;
> > - unlock_page(page);
> > + return __ext4_journalled_writepage(page, wbc);
> > } else {
> > /*
> > * It may be a page full of checkpoint-mode buffers. We don't
> > * really know unless we go poke around in the buffer_heads.
> > * But block_write_full_page will do the right thing.
> > */
> > - ret = block_write_full_page(page, ext4_get_block, wbc);
> > + return block_write_full_page(page, ext4_get_block, wbc);
> > }
> > - err = ext4_journal_stop(handle);
> > - if (!ret)
> > - ret = err;
> > -out:
> > - return ret;
> > -
> > no_write:
> > redirty_page_for_writepage(wbc, page);
> > -out_unlock:
> > unlock_page(page);
> > - goto out;
> > + return 0;
> > }

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-05-26 18:00:52

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Mon, May 26, 2008 at 07:21:24PM +0200, Jan Kara wrote:
> On Wed 21-05-08 13:51:09, Aneesh Kumar K.V wrote:

[....snip.....]

> > {
> > @@ -3837,7 +3850,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page)
> > if (ext4_should_writeback_data(inode))
> > ret = __ext4_writeback_writepage(page, &wbc);
> > else if (ext4_should_order_data(inode))
> > - ret = __ext4_ordered_writepage(page, &wbc);
> > + ret = __ext4_ordered_alloc_and_writepage(page, &wbc, 1);
> > else
> > ret = __ext4_journalled_writepage(page, &wbc);
> > /* Page got unlocked in writepage */
> >
> >
> >
> > ie we call __ext4_ordered_alloc_and_writepage with alloc = 1 only in
> > case of page_mkwrite. All the other case we should have all the buffer
> > heads mapped. Otherwise we will try to allocate new blocks which starts
> > a new transaction holding page lock.
> When do we try to allocate new blocks in writepage now? ext4_page_mkwrite()
> should have done the allocation before writepage() was called so there
> should be no need to allocate anything... But maybe I miss something.

That's what i also meant by the above changes. The block are allocated
only in ext4_page_mkwrite and not during writepage. So calling
ext4_*_writepage during mkwrite confuse quiet a lot. Instead i was
trying to make it explicit by making page_mkwrite call
ext4_ordered_alloc_and_writepage and by adding BUG() in writepage
callback if it ever get called by an unmapped buffer.

I have got another question now related to page_mkwrite. AFAIU writepage
writeout dirty buffer_heads. It also looks at whether the pages are
dirty or not. In the page_mkwrite callback both are not true. ie we call
set_page_dirty from do_wp_page after calling page_mkwrite. I haven't
verified whether the above is correct or not. Just thinking reading the
code.

>
> > > -static int ext4_writeback_writepage(struct page *page,
> > > +static int __ext4_writeback_writepage(struct page *page,
> > > struct writeback_control *wbc)
> > > {
> > > struct inode *inode = page->mapping->host;
> > > +
> > > + if (test_opt(inode->i_sb, NOBH))
> > > + return nobh_writepage(page, ext4_get_block, wbc);
> > > + else
> > > + return block_write_full_page(page, ext4_get_block, wbc);
> > > +}
> > > +
> > > +
> > > +static int ext4_writeback_writepage(struct page *page,
> > > + struct writeback_control *wbc)
> > > +{
> > > + if (!ext4_journal_current_handle())
> > > + return __ext4_writeback_writepage(page, wbc);
> > > +
> > > + redirty_page_for_writepage(wbc, page);
> > > + unlock_page(page);
> > > + return 0;
> > > +}
> > > +
> > > +static int __ext4_journalled_writepage(struct page *page,
> > > + struct writeback_control *wbc)
> > > +{
> > > + struct address_space *mapping = page->mapping;
> > > + struct inode *inode = mapping->host;
> > > + struct buffer_head *page_bufs;
> > > handle_t *handle = NULL;
> > > int ret = 0;
> > > int err;
> > >
> > > - if (ext4_journal_current_handle())
> > > - goto out_fail;
> > > + ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE, ext4_get_block);
> > > + if (ret != 0)
> > > + goto out_unlock;
> > > +
> > > + page_bufs = page_buffers(page);
> > > + walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, NULL,
> > > + bget_one);
> > > + /* As soon as we unlock the page, it can go away, but we have
> > > + * references to buffers so we are safe */
> > > + unlock_page(page);
> > >
> > > handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> > > if (IS_ERR(handle)) {
> > > ret = PTR_ERR(handle);
> > > - goto out_fail;
> > > + goto out;
> > > }
> > >
> > > - if (test_opt(inode->i_sb, NOBH) && ext4_should_writeback_data(inode))
> > > - ret = nobh_writepage(page, ext4_get_block, wbc);
> > > - else
> > > - ret = block_write_full_page(page, ext4_get_block, wbc);
> > > + ret = walk_page_buffers(handle, page_bufs, 0,
> > > + PAGE_CACHE_SIZE, NULL, do_journal_get_write_access);
> > >
> > > + err = walk_page_buffers(handle, page_bufs, 0,
> > > + PAGE_CACHE_SIZE, NULL, write_end_fn);
> > > + if (ret == 0)
> > > + ret = err;
> > > err = ext4_journal_stop(handle);
> > > if (!ret)
> > > ret = err;
> > > - return ret;
> > >
> > > -out_fail:
> > > - redirty_page_for_writepage(wbc, page);
> > > + walk_page_buffers(handle, page_bufs, 0,
> > > + PAGE_CACHE_SIZE, NULL, bput_one);
> > > + EXT4_I(inode)->i_state |= EXT4_STATE_JDATA;
> > > + goto out;
> > > +
> > > +out_unlock:
> > > unlock_page(page);
> > > +out:
> > > return ret;
> > > }
> > >
> > > static int ext4_journalled_writepage(struct page *page,
> > > struct writeback_control *wbc)
> > > {
> > > - struct inode *inode = page->mapping->host;
> > > - handle_t *handle = NULL;
> > > - int ret = 0;
> > > - int err;
> > > -
> > > if (ext4_journal_current_handle())
> > > goto no_write;
> > >
> > > - handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> > > - if (IS_ERR(handle)) {
> > > - ret = PTR_ERR(handle);
> > > - goto no_write;
> > > - }
> > > -
> > > if (!page_has_buffers(page) || PageChecked(page)) {
> >
> >
> > This will never happen with writepage right ? And we don't call
> > ext4_journalled_writepage from page_mkwrite. So is this needed ?
> > If not __ext4_journalled_writepage can handle everything in a single
> > transaction right and assume that it is called within a transaction.
> I'm not sure I understand you. PageChecked() can happen from writepage
> call path. We set PageChecked() when we do set_page_dirty() as far as I
> remember... Basically, we use this flag to decide whether writepage should
> do checkpointing or write into the journal.

What i meant by the above question was can ext4_journalled_writepage get
called with page_buffers == NULL

So the check if (!page_has_buffers(page)) can go away right ?

I have posted some changes after this at
http://article.gmane.org/gmane.comp.file-systems.ext4/6768
Message-Id: <1211391859-17399-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

-aneesh

2008-05-27 12:43:14

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Mon 26-05-08 23:30:43, Aneesh Kumar K.V wrote:
> On Mon, May 26, 2008 at 07:21:24PM +0200, Jan Kara wrote:
> > On Wed 21-05-08 13:51:09, Aneesh Kumar K.V wrote:
> > > {
> > > @@ -3837,7 +3850,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page)
> > > if (ext4_should_writeback_data(inode))
> > > ret = __ext4_writeback_writepage(page, &wbc);
> > > else if (ext4_should_order_data(inode))
> > > - ret = __ext4_ordered_writepage(page, &wbc);
> > > + ret = __ext4_ordered_alloc_and_writepage(page, &wbc, 1);
> > > else
> > > ret = __ext4_journalled_writepage(page, &wbc);
> > > /* Page got unlocked in writepage */
> > >
> > >
> > >
> > > ie we call __ext4_ordered_alloc_and_writepage with alloc = 1 only in
> > > case of page_mkwrite. All the other case we should have all the buffer
> > > heads mapped. Otherwise we will try to allocate new blocks which starts
> > > a new transaction holding page lock.
> > When do we try to allocate new blocks in writepage now? ext4_page_mkwrite()
> > should have done the allocation before writepage() was called so there
> > should be no need to allocate anything... But maybe I miss something.
>
> That's what i also meant by the above changes. The block are allocated
Ah, Ok. Sorry, I misunderstood your previous email.

> only in ext4_page_mkwrite and not during writepage. So calling
> ext4_*_writepage during mkwrite confuse quiet a lot. Instead i was
> trying to make it explicit by making page_mkwrite call
> ext4_ordered_alloc_and_writepage and by adding BUG() in writepage
> callback if it ever get called by an unmapped buffer.
I see. I agree with the change in principle, but I'd just add this
check directly into ext4_ordered_writepage() (and similarly to
ext4_writeback_writepage() and ext4_journaled_writepage()) - there it makes
more sence and you don't have to pass the 'alloc' argument you proposed.
I can do this when we agree on how to resolve problems of delalloc mode
with this patch.

> I have got another question now related to page_mkwrite. AFAIU writepage
> writeout dirty buffer_heads. It also looks at whether the pages are
> dirty or not. In the page_mkwrite callback both are not true. ie we call
> set_page_dirty from do_wp_page after calling page_mkwrite. I haven't
> verified whether the above is correct or not. Just thinking reading the
> code.
Writepage call itself doesn't look at whether the page is dirty or not -
that flag is already cleared when writepage is called. You are right that
the page is marked dirty only after page_mkwrite is called - the meaning of
page_mkwrite() call is roughly "someone wants to do the first write to this
page via mmap, prepare filesystem for that". But we don't really care
whether the page is dirty or not - we know it carries correct data (it is
uptodate) and so we can write it if we want (and need).

> > > > -static int ext4_writeback_writepage(struct page *page,
> > > > +static int __ext4_writeback_writepage(struct page *page,
> > > > struct writeback_control *wbc)
> > > > {
> > > > struct inode *inode = page->mapping->host;
> > > > +
> > > > + if (test_opt(inode->i_sb, NOBH))
> > > > + return nobh_writepage(page, ext4_get_block, wbc);
> > > > + else
> > > > + return block_write_full_page(page, ext4_get_block, wbc);
> > > > +}
> > > > +
> > > > +
> > > > +static int ext4_writeback_writepage(struct page *page,
> > > > + struct writeback_control *wbc)
> > > > +{
> > > > + if (!ext4_journal_current_handle())
> > > > + return __ext4_writeback_writepage(page, wbc);
> > > > +
> > > > + redirty_page_for_writepage(wbc, page);
> > > > + unlock_page(page);
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static int __ext4_journalled_writepage(struct page *page,
> > > > + struct writeback_control *wbc)
> > > > +{
> > > > + struct address_space *mapping = page->mapping;
> > > > + struct inode *inode = mapping->host;
> > > > + struct buffer_head *page_bufs;
> > > > handle_t *handle = NULL;
> > > > int ret = 0;
> > > > int err;
> > > >
> > > > - if (ext4_journal_current_handle())
> > > > - goto out_fail;
> > > > + ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE, ext4_get_block);
> > > > + if (ret != 0)
> > > > + goto out_unlock;
> > > > +
> > > > + page_bufs = page_buffers(page);
> > > > + walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, NULL,
> > > > + bget_one);
> > > > + /* As soon as we unlock the page, it can go away, but we have
> > > > + * references to buffers so we are safe */
> > > > + unlock_page(page);
> > > >
> > > > handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> > > > if (IS_ERR(handle)) {
> > > > ret = PTR_ERR(handle);
> > > > - goto out_fail;
> > > > + goto out;
> > > > }
> > > >
> > > > - if (test_opt(inode->i_sb, NOBH) && ext4_should_writeback_data(inode))
> > > > - ret = nobh_writepage(page, ext4_get_block, wbc);
> > > > - else
> > > > - ret = block_write_full_page(page, ext4_get_block, wbc);
> > > > + ret = walk_page_buffers(handle, page_bufs, 0,
> > > > + PAGE_CACHE_SIZE, NULL, do_journal_get_write_access);
> > > >
> > > > + err = walk_page_buffers(handle, page_bufs, 0,
> > > > + PAGE_CACHE_SIZE, NULL, write_end_fn);
> > > > + if (ret == 0)
> > > > + ret = err;
> > > > err = ext4_journal_stop(handle);
> > > > if (!ret)
> > > > ret = err;
> > > > - return ret;
> > > >
> > > > -out_fail:
> > > > - redirty_page_for_writepage(wbc, page);
> > > > + walk_page_buffers(handle, page_bufs, 0,
> > > > + PAGE_CACHE_SIZE, NULL, bput_one);
> > > > + EXT4_I(inode)->i_state |= EXT4_STATE_JDATA;
> > > > + goto out;
> > > > +
> > > > +out_unlock:
> > > > unlock_page(page);
> > > > +out:
> > > > return ret;
> > > > }
> > > >
> > > > static int ext4_journalled_writepage(struct page *page,
> > > > struct writeback_control *wbc)
> > > > {
> > > > - struct inode *inode = page->mapping->host;
> > > > - handle_t *handle = NULL;
> > > > - int ret = 0;
> > > > - int err;
> > > > -
> > > > if (ext4_journal_current_handle())
> > > > goto no_write;
> > > >
> > > > - handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> > > > - if (IS_ERR(handle)) {
> > > > - ret = PTR_ERR(handle);
> > > > - goto no_write;
> > > > - }
> > > > -
> > > > if (!page_has_buffers(page) || PageChecked(page)) {
> > >
> > >
> > > This will never happen with writepage right ? And we don't call
> > > ext4_journalled_writepage from page_mkwrite. So is this needed ?
> > > If not __ext4_journalled_writepage can handle everything in a single
> > > transaction right and assume that it is called within a transaction.
> > I'm not sure I understand you. PageChecked() can happen from writepage
> > call path. We set PageChecked() when we do set_page_dirty() as far as I
> > remember... Basically, we use this flag to decide whether writepage should
> > do checkpointing or write into the journal.
>
> What i meant by the above question was can ext4_journalled_writepage get
> called with page_buffers == NULL
>
> So the check if (!page_has_buffers(page)) can go away right ?
I see. Well, you may be right but I'd rather leave the check there.
In page_mkwrite() we don't necessarily attach buffers to the page (if it
has PageMappedToDisk set) - that should not currently happen in
data=journal mode but in principle in future it could and there's no reason
to really forbid that.

> I have posted some changes after this at
> http://article.gmane.org/gmane.comp.file-systems.ext4/6768
> Message-Id: <1211391859-17399-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
Thanks for the pointer. I was at a conference last week and so I didn't
catch up with the mailing lists yet. Sorry for the confusion. Your post also
explains some questions I had in my previous emails about how do we proceed
with combination of lock-inversion and delalloc. So you don't have to
answer ;).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-05-27 15:12:02

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Tue, May 27, 2008 at 02:43:12PM +0200, Jan Kara wrote:
> On Mon 26-05-08 23:30:43, Aneesh Kumar K.V wrote:
>
> > I have got another question now related to page_mkwrite. AFAIU writepage
> > writeout dirty buffer_heads. It also looks at whether the pages are
> > dirty or not. In the page_mkwrite callback both are not true. ie we call
> > set_page_dirty from do_wp_page after calling page_mkwrite. I haven't
> > verified whether the above is correct or not. Just thinking reading the
> > code.
> Writepage call itself doesn't look at whether the page is dirty or not -
> that flag is already cleared when writepage is called. You are right that
> the page is marked dirty only after page_mkwrite is called - the meaning of
> page_mkwrite() call is roughly "someone wants to do the first write to this
> page via mmap, prepare filesystem for that". But we don't really care
> whether the page is dirty or not - we know it carries correct data (it is
> uptodate) and so we can write it if we want (and need).
>

I am looking at __block_write_full_page and we have

if (!buffer_mapped(bh) && buffer_dirty(bh)) {
WARN_ON(bh->b_size != blocksize);
err = get_block(inode, block, bh, 1);
if (err)

ie, we do get_block only if the buffer_head is dirty. So I am bit
doubtful whether we are actually allocating blocks via page_mkwrite.

-aneesh

2008-05-28 09:33:25

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Tue 27-05-08 20:41:28, Aneesh Kumar K.V wrote:
> On Tue, May 27, 2008 at 02:43:12PM +0200, Jan Kara wrote:
> > On Mon 26-05-08 23:30:43, Aneesh Kumar K.V wrote:
> >
> > > I have got another question now related to page_mkwrite. AFAIU writepage
> > > writeout dirty buffer_heads. It also looks at whether the pages are
> > > dirty or not. In the page_mkwrite callback both are not true. ie we call
> > > set_page_dirty from do_wp_page after calling page_mkwrite. I haven't
> > > verified whether the above is correct or not. Just thinking reading the
> > > code.
> > Writepage call itself doesn't look at whether the page is dirty or not -
> > that flag is already cleared when writepage is called. You are right that
> > the page is marked dirty only after page_mkwrite is called - the meaning of
> > page_mkwrite() call is roughly "someone wants to do the first write to this
> > page via mmap, prepare filesystem for that". But we don't really care
> > whether the page is dirty or not - we know it carries correct data (it is
> > uptodate) and so we can write it if we want (and need).
> >
>
> I am looking at __block_write_full_page and we have
>
> if (!buffer_mapped(bh) && buffer_dirty(bh)) {
> WARN_ON(bh->b_size != blocksize);
> err = get_block(inode, block, bh, 1);
> if (err)
>
> ie, we do get_block only if the buffer_head is dirty. So I am bit
> doubtful whether we are actually allocating blocks via page_mkwrite.
Good catch, we should mark unmapped buffers dirty before calling writepage.
Actually, if the page didn't have any buffers, block_write_full_page() will
create them all dirty so that's probably why I didn't hit it in my testing
but it's definitely safer to mark them dirty explicitely. Thanks.
It is enough to change ext4_bh_mapped() to something like:
static int ext4_bh_prepare_fill(handle_t *handle, struct buffer_head *bh)
{
if (!buffer_mapped(bh)) {
/*
* Mark buffer as dirty so that block_write_full_page()
* writes it
*/
set_buffer_dirty(bh);
return 1;
}
return 0;
}

Should I send you an updated patch with this change and the changes we spoke
about yesterday, or just an incremental changes which you will fold yourself
into the big one?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-05-28 09:44:35

[permalink] [raw]

Subject: Re: Delayed allocation and page_lock vs transaction start ordering

On Wed, May 28, 2008 at 11:33:24AM +0200, Jan Kara wrote:
> On Tue 27-05-08 20:41:28, Aneesh Kumar K.V wrote:
> > On Tue, May 27, 2008 at 02:43:12PM +0200, Jan Kara wrote:
> > > On Mon 26-05-08 23:30:43, Aneesh Kumar K.V wrote:
> > >
> > > > I have got another question now related to page_mkwrite. AFAIU writepage
> > > > writeout dirty buffer_heads. It also looks at whether the pages are
> > > > dirty or not. In the page_mkwrite callback both are not true. ie we call
> > > > set_page_dirty from do_wp_page after calling page_mkwrite. I haven't
> > > > verified whether the above is correct or not. Just thinking reading the
> > > > code.
> > > Writepage call itself doesn't look at whether the page is dirty or not -
> > > that flag is already cleared when writepage is called. You are right that
> > > the page is marked dirty only after page_mkwrite is called - the meaning of
> > > page_mkwrite() call is roughly "someone wants to do the first write to this
> > > page via mmap, prepare filesystem for that". But we don't really care
> > > whether the page is dirty or not - we know it carries correct data (it is
> > > uptodate) and so we can write it if we want (and need).
> > >
> >
> > I am looking at __block_write_full_page and we have
> >
> > if (!buffer_mapped(bh) && buffer_dirty(bh)) {
> > WARN_ON(bh->b_size != blocksize);
> > err = get_block(inode, block, bh, 1);
> > if (err)
> >
> > ie, we do get_block only if the buffer_head is dirty. So I am bit
> > doubtful whether we are actually allocating blocks via page_mkwrite.
> Good catch, we should mark unmapped buffers dirty before calling writepage.
> Actually, if the page didn't have any buffers, block_write_full_page() will
> create them all dirty so that's probably why I didn't hit it in my testing
> but it's definitely safer to mark them dirty explicitely. Thanks.

looking at create_empty_buffers we do that only if page is marked as
dirty. In the case of page_mkwrite the page is also not marked dirty
when we call the call back right ?

> It is enough to change ext4_bh_mapped() to something like:
> static int ext4_bh_prepare_fill(handle_t *handle, struct buffer_head *bh)
> {
> if (!buffer_mapped(bh)) {
> /*
> * Mark buffer as dirty so that block_write_full_page()
> * writes it
> */
> set_buffer_dirty(bh);
> return 1;
> }
> return 0;
> }
>
> Should I send you an updated patch with this change and the changes we spoke
> about yesterday, or just an incremental changes which you will fold yourself
> into the big one?
>

This will mark only the first unmapped buffer_head as dirty. What about
the rest of the buffer_heads in the page that are unmapped ?

I am looking at pushing the ext4_page_mkwrite before rest of the
changes. That is needed to handle ENOSPC when mmap write to files with
holes.

-aneesh

2008-05-28 10:33:50