>> Only whole file layout support means that there is only one IOMODE_RW layout
>> segment.
>>
>> Signed-off-by: Andy Adamson <[email protected]>
>> Signed-off-by: Alexandros Batsakis <[email protected]>
>> Signed-off-by: Boaz Harrosh <[email protected]>
>> Signed-off-by: Dean Hildebrand <[email protected]>
>> Signed-off-by: Fred Isaman <[email protected]>
>> Signed-off-by: Mingyang Guo <[email protected]>
>> Signed-off-by: Tao Guo <[email protected]>
>> Signed-off-by: Zhang Jingwang <[email protected]>
>> Tested-by: Boaz Harrosh <[email protected]>
>> Signed-off-by: Benny Halevy <[email protected]>
>
>The code in this patch is new and different enough from the one I/we
>signed-off originally that they don't make sense here.
Hi Benny
OK with me
>>
>>+ /* references matched in nfs4_layoutcommit_release */
>> + wdata->lseg->pls_lc_cred =
>> + get_rpccred(wdata->args.context->state->owner->so_cred);
>> + mark_inode_dirty_sync(wdata->inode);
>> + dprintk("%s: Set layoutcommit for inode %lu ",
>> + __func__, wdata->inode->i_ino);
>> + }
>> + if (end_pos > wdata->lseg->pls_end_pos)
>> + wdata->lseg->pls_end_pos = end_pos;
>
> The end_pos is essentially per inode, why maintain it per lseg?
> How do you see this working with multiple lsegs in mind?
The end-pos is per lseg, not per inode - each layoutcommit applies to
a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.
>From Section 18.42.3
. The byte-range being committed is
specified through the byte-range (loca_offset and loca_length). This
byte-range MUST overlap with one or more existing layouts previously
granted via LAYOUTGET
Also, loca_last_write_offset MUST overlap the range
described by loca_offset and loca_length.
For the multiple lseg case: if the lsegs are merged, bookeeping
end_pos per lseg just works. If a layoutdriver does not use merged
lsegs, then there is a bit of work to do to walk the list of lsegs and
determine the final end_pos for a given LAYOUTCOMMIT. If there are
multiple non-contiguous lsegs, each used for WRITEs then multiple
LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
byte-range will not overlap as required.
>> +pnfs_layoutcommit_inode(struct inode *inode, int sync)
>
> "bool sync" makes more sense
>> +{
>> + struct nfs4_layoutcommit_data *data;
>> + struct nfs_inode *nfsi = NFS_I(inode);
>> + struct pnfs_layout_segment *lseg;
>> + struct rpc_cred *cred;
>> + loff_t end_pos;
>> + int status = 0;
>> +
>> + dprintk("--> %s inode %lu\n", __func__, inode->i_ino);
>> +
>> + /* Note kzalloc ensures data->res.seq_res.sr_slot == NULL */
>> + data = kzalloc(sizeof(*data), GFP_NOFS);
>> + spin_lock(&inode->i_lock);
>> +
> >+ if (!test_and_clear_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
>
> previously (i.e. in the linux-pnfs tree :) this function is called only
> if layoutcommit_needed(), now I worry may waste a kzalloc too frequently.
> I suggest testing (and not clearing) NFS_INO_LAYOUTCOMMIT before doing
> the allocation to prevent that.
Agreed.
>> + end_pos = lseg->pls_end_pos;
>> + cred = lseg->pls_lc_cred;
>> + lseg->pls_end_pos = 0;
>> + lseg->pls_lc_cred = NULL;
>> +
>> + if (!data) {
>
> eh?
> why not test this before test_and_clear_bit(NFS_INO_LAYOUTCOMMIT ?
Because we should clear the LAYOUTCOMMIT needed information from the inode.
The LAYOUTCOMMIT for the file layout is an optimization. If the client
can't alloc the required buffer, the compound just won't be sent.
-->Andy
On 2011-03-24 18:48, Trond Myklebust wrote:
> On Thu, 2011-03-24 at 18:37 +0200, Benny Halevy wrote:
>> On 2011-03-24 15:57, William A. (Andy) Adamson wrote:
>>>>> Only whole file layout support means that there is only one IOMODE_RW layout
>>>>> segment.
>>>>>
>>>>> Signed-off-by: Andy Adamson <[email protected]>
>>>>> Signed-off-by: Alexandros Batsakis <[email protected]>
>>>>> Signed-off-by: Boaz Harrosh <[email protected]>
>>>>> Signed-off-by: Dean Hildebrand <[email protected]>
>>>>> Signed-off-by: Fred Isaman <[email protected]>
>>>>> Signed-off-by: Mingyang Guo <[email protected]>
>>>>> Signed-off-by: Tao Guo <[email protected]>
>>>>> Signed-off-by: Zhang Jingwang <[email protected]>
>>>>> Tested-by: Boaz Harrosh <[email protected]>
>>>>> Signed-off-by: Benny Halevy <[email protected]>
>>>>
>>>> The code in this patch is new and different enough from the one I/we
>>>> signed-off originally that they don't make sense here.
>>>
>>> Hi Benny
>>>
>>> OK with me
>>>
>>>>>
>>>>> + /* references matched in nfs4_layoutcommit_release */
>>>>> + wdata->lseg->pls_lc_cred =
>>>>> + get_rpccred(wdata->args.context->state->owner->so_cred);
>>>>> + mark_inode_dirty_sync(wdata->inode);
>>>>> + dprintk("%s: Set layoutcommit for inode %lu ",
>>>>> + __func__, wdata->inode->i_ino);
>>>>> + }
>>>>> + if (end_pos > wdata->lseg->pls_end_pos)
>>>>> + wdata->lseg->pls_end_pos = end_pos;
>>>>
>>>> The end_pos is essentially per inode, why maintain it per lseg?
>>>> How do you see this working with multiple lsegs in mind?
>>>
>>> The end-pos is per lseg, not per inode - each layoutcommit applies to
>>> a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.
>>>
>>> From Section 18.42.3
>>> . The byte-range being committed is
>>> specified through the byte-range (loca_offset and loca_length). This
>>> byte-range MUST overlap with one or more existing layouts previously
>>> granted via LAYOUTGET
>>>
>>>
>>> Also, loca_last_write_offset MUST overlap the range
>>> described by loca_offset and loca_length.
>>>
>>> For the multiple lseg case: if the lsegs are merged, bookeeping
>>> end_pos per lseg just works. If a layoutdriver does not use merged
>>> lsegs, then there is a bit of work to do to walk the list of lsegs and
>>> determine the final end_pos for a given LAYOUTCOMMIT. If there are
>>> multiple non-contiguous lsegs, each used for WRITEs then multiple
>>> LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
>>> byte-range will not overlap as required.
>>>
>>
>> For the current layout types I believe that the LAYOUTCOMMIT can "merge"
>> multiple layout segments into a single LAYOUTCOMMIT, with a byte range
>> covering all segments and a last_byte_written offset which is just the maximum.
>> Future layout types may need this method though...
>
> Is that safe?
>
> What if I'm doing blocks and have written layout segment 1 & 3, but not
> layout segment 2? I don't want to have the MDS commit layout segment 2,
> and make the (lack of) data there visible to future readers.
>
I'm not the real expert on pnfs-blocks but my interpretation of rfc5663 is that the
list of extents in pnfs_block_layoutupdate4 may be sparse (or holey if you'd like).
Note that the client may have written just parts of the layout it got in one layout segment.
In this case too, you don't want to send multiple LAYOUTCOMMITs for each contiguous
area...
Benny
On Thu, Mar 24, 2011 at 12:48 PM, Trond Myklebust
<[email protected]> wrote:
> On Thu, 2011-03-24 at 18:37 +0200, Benny Halevy wrote:
>> On 2011-03-24 15:57, William A. (Andy) Adamson wrote:
>> >>> Only whole file layout support means that there is only one IOMODE_RW layout
>> >>> segment.
>> >>>
>> >>> Signed-off-by: Andy Adamson <[email protected]>
>> >>> Signed-off-by: Alexandros Batsakis <[email protected]>
>> >>> Signed-off-by: Boaz Harrosh <[email protected]>
>> >>> Signed-off-by: Dean Hildebrand <[email protected]>
>> >>> Signed-off-by: Fred Isaman <[email protected]>
>> >>> Signed-off-by: Mingyang Guo <[email protected]>
>> >>> Signed-off-by: Tao Guo <[email protected]>
>> >>> Signed-off-by: Zhang Jingwang <[email protected]>
>> >>> Tested-by: Boaz Harrosh <[email protected]>
>> >>> Signed-off-by: Benny Halevy <[email protected]>
>> >>
>> >> The code in this patch is new and different enough from the one I/we
>> >> signed-off originally that they don't make sense here.
>> >
>> > Hi Benny
>> >
>> > OK with me
>> >
>> >>>
>> >>> + ? ? ? ? ? ? /* references matched in nfs4_layoutcommit_release */
>> >>> + ? ? ? ? ? ? wdata->lseg->pls_lc_cred =
>> >>> + ? ? ? ? ? ? ? ? ? ? get_rpccred(wdata->args.context->state->owner->so_cred);
>> >>> + ? ? ? ? ? ? mark_inode_dirty_sync(wdata->inode);
>> >>> + ? ? ? ? ? ? dprintk("%s: Set layoutcommit for inode %lu ",
>> >>> + ? ? ? ? ? ? ? ? ? ? __func__, wdata->inode->i_ino);
>> >>> + ? ? }
>> >>> + ? ? if (end_pos > wdata->lseg->pls_end_pos)
>> >>> + ? ? ? ? ? ? wdata->lseg->pls_end_pos = end_pos;
>> >>
>> >> The end_pos is essentially per inode, why maintain it per lseg?
>> >> How do you see this working with multiple lsegs in mind?
>> >
>> > The end-pos is per lseg, not per inode - each layoutcommit applies to
>> > a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.
>> >
>> > From Section 18.42.3
>> > . ?The byte-range being committed is
>> > ? ?specified through the byte-range (loca_offset and loca_length). ?This
>> > ? ?byte-range MUST overlap with one or more existing layouts previously
>> > ? ?granted via LAYOUTGET
>> >
>> >
>> > ? ?Also, loca_last_write_offset MUST overlap the range
>> > ? ?described by loca_offset and loca_length.
>> >
>> > For the multiple lseg case: if the lsegs are merged, bookeeping
>> > end_pos per lseg just works. If a layoutdriver does not use merged
>> > lsegs, then there is a bit of work to do to walk the list of lsegs and
>> > determine the final end_pos for a given LAYOUTCOMMIT. ?If there are
>> > multiple non-contiguous lsegs, each used for WRITEs then multiple
>> > LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
>> > byte-range will not overlap as required.
>> >
>>
>> For the current layout types I believe that the LAYOUTCOMMIT can "merge"
>> multiple layout segments into a single LAYOUTCOMMIT, with a byte range
>> covering all segments and a last_byte_written offset which is just the maximum.
>> Future layout types may need this method though...
>
> Is that safe?
>
> What if I'm doing blocks and have written layout segment 1 & 3, but not
> layout segment 2? I don't want to have the MDS commit layout segment 2,
> and make the (lack of) data there visible to future readers.
>
No, it is not safe. Avoiding this problem is one of the major reasons
for putting the bookkeeping in the lseg.
Fred
On 2011-03-24 15:57, William A. (Andy) Adamson wrote:
>>> Only whole file layout support means that there is only one IOMODE_RW layout
>>> segment.
>>>
>>> Signed-off-by: Andy Adamson <[email protected]>
>>> Signed-off-by: Alexandros Batsakis <[email protected]>
>>> Signed-off-by: Boaz Harrosh <[email protected]>
>>> Signed-off-by: Dean Hildebrand <[email protected]>
>>> Signed-off-by: Fred Isaman <[email protected]>
>>> Signed-off-by: Mingyang Guo <[email protected]>
>>> Signed-off-by: Tao Guo <[email protected]>
>>> Signed-off-by: Zhang Jingwang <[email protected]>
>>> Tested-by: Boaz Harrosh <[email protected]>
>>> Signed-off-by: Benny Halevy <[email protected]>
>>
>> The code in this patch is new and different enough from the one I/we
>> signed-off originally that they don't make sense here.
>
> Hi Benny
>
> OK with me
>
>>>
>>> + /* references matched in nfs4_layoutcommit_release */
>>> + wdata->lseg->pls_lc_cred =
>>> + get_rpccred(wdata->args.context->state->owner->so_cred);
>>> + mark_inode_dirty_sync(wdata->inode);
>>> + dprintk("%s: Set layoutcommit for inode %lu ",
>>> + __func__, wdata->inode->i_ino);
>>> + }
>>> + if (end_pos > wdata->lseg->pls_end_pos)
>>> + wdata->lseg->pls_end_pos = end_pos;
>>
>> The end_pos is essentially per inode, why maintain it per lseg?
>> How do you see this working with multiple lsegs in mind?
>
> The end-pos is per lseg, not per inode - each layoutcommit applies to
> a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.
>
> From Section 18.42.3
> . The byte-range being committed is
> specified through the byte-range (loca_offset and loca_length). This
> byte-range MUST overlap with one or more existing layouts previously
> granted via LAYOUTGET
>
>
> Also, loca_last_write_offset MUST overlap the range
> described by loca_offset and loca_length.
>
> For the multiple lseg case: if the lsegs are merged, bookeeping
> end_pos per lseg just works. If a layoutdriver does not use merged
> lsegs, then there is a bit of work to do to walk the list of lsegs and
> determine the final end_pos for a given LAYOUTCOMMIT. If there are
> multiple non-contiguous lsegs, each used for WRITEs then multiple
> LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
> byte-range will not overlap as required.
>
For the current layout types I believe that the LAYOUTCOMMIT can "merge"
multiple layout segments into a single LAYOUTCOMMIT, with a byte range
covering all segments and a last_byte_written offset which is just the maximum.
Future layout types may need this method though...
Benny
>>> +pnfs_layoutcommit_inode(struct inode *inode, int sync)
>>
>> "bool sync" makes more sense
>
>>> +{
>>> + struct nfs4_layoutcommit_data *data;
>>> + struct nfs_inode *nfsi = NFS_I(inode);
>>> + struct pnfs_layout_segment *lseg;
>>> + struct rpc_cred *cred;
>>> + loff_t end_pos;
>>> + int status = 0;
>>> +
>>> + dprintk("--> %s inode %lu\n", __func__, inode->i_ino);
>>> +
>>> + /* Note kzalloc ensures data->res.seq_res.sr_slot == NULL */
>>> + data = kzalloc(sizeof(*data), GFP_NOFS);
>>> + spin_lock(&inode->i_lock);
>>> +
>>> + if (!test_and_clear_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
>>
>> previously (i.e. in the linux-pnfs tree :) this function is called only
>> if layoutcommit_needed(), now I worry may waste a kzalloc too frequently.
>> I suggest testing (and not clearing) NFS_INO_LAYOUTCOMMIT before doing
>> the allocation to prevent that.
>
> Agreed.
>
>>> + end_pos = lseg->pls_end_pos;
>>> + cred = lseg->pls_lc_cred;
>>> + lseg->pls_end_pos = 0;
>>> + lseg->pls_lc_cred = NULL;
>>> +
>>> + if (!data) {
>>
>> eh?
>> why not test this before test_and_clear_bit(NFS_INO_LAYOUTCOMMIT ?
>
> Because we should clear the LAYOUTCOMMIT needed information from the inode.
> The LAYOUTCOMMIT for the file layout is an optimization. If the client
> can't alloc the required buffer, the compound just won't be sent.
>
> -->Andy
On Thu, 2011-03-24 at 18:37 +0200, Benny Halevy wrote:
> On 2011-03-24 15:57, William A. (Andy) Adamson wrote:
> >>> Only whole file layout support means that there is only one IOMODE_RW layout
> >>> segment.
> >>>
> >>> Signed-off-by: Andy Adamson <[email protected]>
> >>> Signed-off-by: Alexandros Batsakis <[email protected]>
> >>> Signed-off-by: Boaz Harrosh <[email protected]>
> >>> Signed-off-by: Dean Hildebrand <[email protected]>
> >>> Signed-off-by: Fred Isaman <[email protected]>
> >>> Signed-off-by: Mingyang Guo <[email protected]>
> >>> Signed-off-by: Tao Guo <[email protected]>
> >>> Signed-off-by: Zhang Jingwang <[email protected]>
> >>> Tested-by: Boaz Harrosh <[email protected]>
> >>> Signed-off-by: Benny Halevy <[email protected]>
> >>
> >> The code in this patch is new and different enough from the one I/we
> >> signed-off originally that they don't make sense here.
> >
> > Hi Benny
> >
> > OK with me
> >
> >>>
> >>> + /* references matched in nfs4_layoutcommit_release */
> >>> + wdata->lseg->pls_lc_cred =
> >>> + get_rpccred(wdata->args.context->state->owner->so_cred);
> >>> + mark_inode_dirty_sync(wdata->inode);
> >>> + dprintk("%s: Set layoutcommit for inode %lu ",
> >>> + __func__, wdata->inode->i_ino);
> >>> + }
> >>> + if (end_pos > wdata->lseg->pls_end_pos)
> >>> + wdata->lseg->pls_end_pos = end_pos;
> >>
> >> The end_pos is essentially per inode, why maintain it per lseg?
> >> How do you see this working with multiple lsegs in mind?
> >
> > The end-pos is per lseg, not per inode - each layoutcommit applies to
> > a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.
> >
> > From Section 18.42.3
> > . The byte-range being committed is
> > specified through the byte-range (loca_offset and loca_length). This
> > byte-range MUST overlap with one or more existing layouts previously
> > granted via LAYOUTGET
> >
> >
> > Also, loca_last_write_offset MUST overlap the range
> > described by loca_offset and loca_length.
> >
> > For the multiple lseg case: if the lsegs are merged, bookeeping
> > end_pos per lseg just works. If a layoutdriver does not use merged
> > lsegs, then there is a bit of work to do to walk the list of lsegs and
> > determine the final end_pos for a given LAYOUTCOMMIT. If there are
> > multiple non-contiguous lsegs, each used for WRITEs then multiple
> > LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
> > byte-range will not overlap as required.
> >
>
> For the current layout types I believe that the LAYOUTCOMMIT can "merge"
> multiple layout segments into a single LAYOUTCOMMIT, with a byte range
> covering all segments and a last_byte_written offset which is just the maximum.
> Future layout types may need this method though...
Is that safe?
What if I'm doing blocks and have written layout segment 1 & 3, but not
layout segment 2? I don't want to have the MDS commit layout segment 2,
and make the (lack of) data there visible to future readers.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Thu, 2011-03-24 at 18:58 +0200, Benny Halevy wrote:
> On 2011-03-24 18:48, Trond Myklebust wrote:
> > On Thu, 2011-03-24 at 18:37 +0200, Benny Halevy wrote:
> >> On 2011-03-24 15:57, William A. (Andy) Adamson wrote:
> >>>>> Only whole file layout support means that there is only one IOMODE_RW layout
> >>>>> segment.
> >>>>>
> >>>>> Signed-off-by: Andy Adamson <[email protected]>
> >>>>> Signed-off-by: Alexandros Batsakis <[email protected]>
> >>>>> Signed-off-by: Boaz Harrosh <[email protected]>
> >>>>> Signed-off-by: Dean Hildebrand <[email protected]>
> >>>>> Signed-off-by: Fred Isaman <[email protected]>
> >>>>> Signed-off-by: Mingyang Guo <[email protected]>
> >>>>> Signed-off-by: Tao Guo <[email protected]>
> >>>>> Signed-off-by: Zhang Jingwang <[email protected]>
> >>>>> Tested-by: Boaz Harrosh <[email protected]>
> >>>>> Signed-off-by: Benny Halevy <[email protected]>
> >>>>
> >>>> The code in this patch is new and different enough from the one I/we
> >>>> signed-off originally that they don't make sense here.
> >>>
> >>> Hi Benny
> >>>
> >>> OK with me
> >>>
> >>>>>
> >>>>> + /* references matched in nfs4_layoutcommit_release */
> >>>>> + wdata->lseg->pls_lc_cred =
> >>>>> + get_rpccred(wdata->args.context->state->owner->so_cred);
> >>>>> + mark_inode_dirty_sync(wdata->inode);
> >>>>> + dprintk("%s: Set layoutcommit for inode %lu ",
> >>>>> + __func__, wdata->inode->i_ino);
> >>>>> + }
> >>>>> + if (end_pos > wdata->lseg->pls_end_pos)
> >>>>> + wdata->lseg->pls_end_pos = end_pos;
> >>>>
> >>>> The end_pos is essentially per inode, why maintain it per lseg?
> >>>> How do you see this working with multiple lsegs in mind?
> >>>
> >>> The end-pos is per lseg, not per inode - each layoutcommit applies to
> >>> a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.
> >>>
> >>> From Section 18.42.3
> >>> . The byte-range being committed is
> >>> specified through the byte-range (loca_offset and loca_length). This
> >>> byte-range MUST overlap with one or more existing layouts previously
> >>> granted via LAYOUTGET
> >>>
> >>>
> >>> Also, loca_last_write_offset MUST overlap the range
> >>> described by loca_offset and loca_length.
> >>>
> >>> For the multiple lseg case: if the lsegs are merged, bookeeping
> >>> end_pos per lseg just works. If a layoutdriver does not use merged
> >>> lsegs, then there is a bit of work to do to walk the list of lsegs and
> >>> determine the final end_pos for a given LAYOUTCOMMIT. If there are
> >>> multiple non-contiguous lsegs, each used for WRITEs then multiple
> >>> LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
> >>> byte-range will not overlap as required.
> >>>
> >>
> >> For the current layout types I believe that the LAYOUTCOMMIT can "merge"
> >> multiple layout segments into a single LAYOUTCOMMIT, with a byte range
> >> covering all segments and a last_byte_written offset which is just the maximum.
> >> Future layout types may need this method though...
> >
> > Is that safe?
> >
> > What if I'm doing blocks and have written layout segment 1 & 3, but not
> > layout segment 2? I don't want to have the MDS commit layout segment 2,
> > and make the (lack of) data there visible to future readers.
> >
>
> I'm not the real expert on pnfs-blocks but my interpretation of rfc5663 is that the
> list of extents in pnfs_block_layoutupdate4 may be sparse (or holey if you'd like).
> Note that the client may have written just parts of the layout it got in one layout segment.
> In this case too, you don't want to send multiple LAYOUTCOMMITs for each contiguous
> area...
Sure, but my understanding was that RFC5663 supports copy on write files
and that the actual copying of the block may need to be done by the
client (see section 2.3.4).
If the new block is still uninitialised when you call LAYOUTCOMMIT, then
that will corrupt the file if the client then dies before it finishes
processing segment 2, since the previously valid data blocks are being
replaced by uninitialised ones (which I presume will convert that
section into a pre-allocated hole???).
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On 2011-03-24 19:15, Trond Myklebust wrote:
> On Thu, 2011-03-24 at 18:58 +0200, Benny Halevy wrote:
>> On 2011-03-24 18:48, Trond Myklebust wrote:
>>> On Thu, 2011-03-24 at 18:37 +0200, Benny Halevy wrote:
>>>> On 2011-03-24 15:57, William A. (Andy) Adamson wrote:
>>>>>>> Only whole file layout support means that there is only one IOMODE_RW layout
>>>>>>> segment.
>>>>>>>
>>>>>>> Signed-off-by: Andy Adamson <[email protected]>
>>>>>>> Signed-off-by: Alexandros Batsakis <[email protected]>
>>>>>>> Signed-off-by: Boaz Harrosh <[email protected]>
>>>>>>> Signed-off-by: Dean Hildebrand <[email protected]>
>>>>>>> Signed-off-by: Fred Isaman <[email protected]>
>>>>>>> Signed-off-by: Mingyang Guo <[email protected]>
>>>>>>> Signed-off-by: Tao Guo <[email protected]>
>>>>>>> Signed-off-by: Zhang Jingwang <[email protected]>
>>>>>>> Tested-by: Boaz Harrosh <[email protected]>
>>>>>>> Signed-off-by: Benny Halevy <[email protected]>
>>>>>> The code in this patch is new and different enough from the one I/we
>>>>>> signed-off originally that they don't make sense here.
>>>>> Hi Benny
>>>>>
>>>>> OK with me
>>>>>
>>>>>>> + /* references matched in nfs4_layoutcommit_release */
>>>>>>> + wdata->lseg->pls_lc_cred =
>>>>>>> + get_rpccred(wdata->args.context->state->owner->so_cred);
>>>>>>> + mark_inode_dirty_sync(wdata->inode);
>>>>>>> + dprintk("%s: Set layoutcommit for inode %lu ",
>>>>>>> + __func__, wdata->inode->i_ino);
>>>>>>> + }
>>>>>>> + if (end_pos > wdata->lseg->pls_end_pos)
>>>>>>> + wdata->lseg->pls_end_pos = end_pos;
>>>>>> The end_pos is essentially per inode, why maintain it per lseg?
>>>>>> How do you see this working with multiple lsegs in mind?
>>>>> The end-pos is per lseg, not per inode - each layoutcommit applies to
>>>>> a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.
>>>>>
>>>>> From Section 18.42.3
>>>>> . The byte-range being committed is
>>>>> specified through the byte-range (loca_offset and loca_length). This
>>>>> byte-range MUST overlap with one or more existing layouts previously
>>>>> granted via LAYOUTGET
>>>>>
>>>>>
>>>>> Also, loca_last_write_offset MUST overlap the range
>>>>> described by loca_offset and loca_length.
>>>>>
>>>>> For the multiple lseg case: if the lsegs are merged, bookeeping
>>>>> end_pos per lseg just works. If a layoutdriver does not use merged
>>>>> lsegs, then there is a bit of work to do to walk the list of lsegs and
>>>>> determine the final end_pos for a given LAYOUTCOMMIT. If there are
>>>>> multiple non-contiguous lsegs, each used for WRITEs then multiple
>>>>> LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
>>>>> byte-range will not overlap as required.
>>>>>
>>>> For the current layout types I believe that the LAYOUTCOMMIT can "merge"
>>>> multiple layout segments into a single LAYOUTCOMMIT, with a byte range
>>>> covering all segments and a last_byte_written offset which is just the maximum.
>>>> Future layout types may need this method though...
>>> Is that safe?
>>>
>>> What if I'm doing blocks and have written layout segment 1 & 3, but not
>>> layout segment 2? I don't want to have the MDS commit layout segment 2,
>>> and make the (lack of) data there visible to future readers.
>>>
>> I'm not the real expert on pnfs-blocks but my interpretation of rfc5663 is that the
>> list of extents in pnfs_block_layoutupdate4 may be sparse (or holey if you'd like).
>> Note that the client may have written just parts of the layout it got in one layout segment.
>> In this case too, you don't want to send multiple LAYOUTCOMMITs for each contiguous
>> area...
> Sure, but my understanding was that RFC5663 supports copy on write files
> and that the actual copying of the block may need to be done by the
> client (see section 2.3.4).
>
> If the new block is still uninitialised when you call LAYOUTCOMMIT, then
> that will corrupt the file if the client then dies before it finishes
> processing segment 2, since the previously valid data blocks are being
> replaced by uninitialised ones (which I presume will convert that
> section into a pre-allocated hole???).
>
The extents that LAYOUTCOMMITted MUST NOT be PNFS_BLOCK_INVALID_DATA,
as noted in 2.3.2:
The bex_state
field of each extent in the blu_commit_list MUST be set to
PNFS_BLOCK_READ_WRITE_DATA.
Benny