On Mon, 2013-03-18 at 16:38 +0200, Benny Halevy wrote:
> We're seeing roughly 20% of the I/Os going to the MDS
> when installing a VM over KVM in "none" caching mode (O_DIRECT).
> Instrumenting the client reveled that this is caused by buffer
> alignment vs. file offset alignment.
> Besides being a performance problem, when the MDS caches data
> this is also manifested as data corruption when data is written
> first via the MDS, then via the DS, eventually the stale data is
> read back from the MDS.
That's why we should return the layout.
> Note that this check exists also for the file layout specific
> pg_init_* functions. The objects (ORE) and block
> (bl_{read,write}_pagelist) layouts seem to deal correctly with
> splitting IOs in the case where req->wb_offset != req->wb_pgbase
> though this hasn't been tested wen submitting this patch.
>
> Signed-off-by: Benny Halevy <[email protected]>
> Cc: [email protected] [>= 3.5]
> Cc: Boaz Harrosh <[email protected]>
> Cc: Peng Tao <[email protected]>
> ---
> fs/nfs/pnfs.c | 18 ++++++++++--------
> 1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 483bd94..f12e456 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1322,10 +1322,11 @@ struct pnfs_layout_segment *
>
> WARN_ON_ONCE(pgio->pg_lseg != NULL);
>
> - if (req->wb_offset != req->wb_pgbase) {
> - nfs_pageio_reset_read_mds(pgio);
> - return;
> - }
> + if (req->wb_offset != req->wb_pgbase)
> + dprintk("%s: inode=%ld: offset=%llu wb_bytes=%u wb_offset=%u wb_pgbase=%u\n",
> + __func__, pgio->pg_inode->i_ino,
> + (((unsigned long long)req->wb_index) << PAGE_CACHE_SHIFT) + req->wb_offset,
> + req->wb_bytes, req->wb_offset, req->wb_pgbase);
>
> if (pgio->pg_dreq == NULL)
> rd_size = i_size_read(pgio->pg_inode) - req_offset(req);
> @@ -1351,10 +1352,11 @@ struct pnfs_layout_segment *
> {
> WARN_ON_ONCE(pgio->pg_lseg != NULL);
>
> - if (req->wb_offset != req->wb_pgbase) {
> - nfs_pageio_reset_write_mds(pgio);
> - return;
> - }
> + if (req->wb_offset != req->wb_pgbase)
> + dprintk("%s: inode=%ld: offset=%llu wb_bytes=%u wb_offset=%u wb_pgbase=%u\n",
> + __func__, pgio->pg_inode->i_ino,
> + (((unsigned long long)req->wb_index) << PAGE_CACHE_SHIFT) + req->wb_offset,
> + req->wb_bytes, req->wb_offset, req->wb_pgbase);
>
> pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
> req->wb_context,
NACK. I see no evidence that we've addressed the issues that were raised
by Fred in commit 1825a0d08f22463e5a8f4b1636473efd057a3479 (NFS: prepare
coalesce testing for directio).
If you think that his concerns about the coalescing assumptions are no
longer true, then please point to why this is the case. AFAICR that
patch was added to fix corruption issues.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On 2013-03-18 17:55, Myklebust, Trond wrote:
> On Mon, 2013-03-18 at 16:38 +0200, Benny Halevy wrote:
>> We're seeing roughly 20% of the I/Os going to the MDS
>> when installing a VM over KVM in "none" caching mode (O_DIRECT).
>> Instrumenting the client reveled that this is caused by buffer
>> alignment vs. file offset alignment.
>> Besides being a performance problem, when the MDS caches data
>> this is also manifested as data corruption when data is written
>> first via the MDS, then via the DS, eventually the stale data is
>> read back from the MDS.
>
> That's why we should return the layout.
>
We are not in this case.
>> Note that this check exists also for the file layout specific
>> pg_init_* functions. The objects (ORE) and block
>> (bl_{read,write}_pagelist) layouts seem to deal correctly with
>> splitting IOs in the case where req->wb_offset != req->wb_pgbase
>> though this hasn't been tested wen submitting this patch.
>>
>> Signed-off-by: Benny Halevy <[email protected]>
>> Cc: [email protected] [>= 3.5]
>> Cc: Boaz Harrosh <[email protected]>
>> Cc: Peng Tao <[email protected]>
>> ---
>> fs/nfs/pnfs.c | 18 ++++++++++--------
>> 1 file changed, 10 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index 483bd94..f12e456 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -1322,10 +1322,11 @@ struct pnfs_layout_segment *
>>
>> WARN_ON_ONCE(pgio->pg_lseg != NULL);
>>
>> - if (req->wb_offset != req->wb_pgbase) {
>> - nfs_pageio_reset_read_mds(pgio);
>> - return;
>> - }
>> + if (req->wb_offset != req->wb_pgbase)
>> + dprintk("%s: inode=%ld: offset=%llu wb_bytes=%u wb_offset=%u wb_pgbase=%u\n",
>> + __func__, pgio->pg_inode->i_ino,
>> + (((unsigned long long)req->wb_index) << PAGE_CACHE_SHIFT) + req->wb_offset,
>> + req->wb_bytes, req->wb_offset, req->wb_pgbase);
>>
>> if (pgio->pg_dreq == NULL)
>> rd_size = i_size_read(pgio->pg_inode) - req_offset(req);
>> @@ -1351,10 +1352,11 @@ struct pnfs_layout_segment *
>> {
>> WARN_ON_ONCE(pgio->pg_lseg != NULL);
>>
>> - if (req->wb_offset != req->wb_pgbase) {
>> - nfs_pageio_reset_write_mds(pgio);
>> - return;
>> - }
>> + if (req->wb_offset != req->wb_pgbase)
>> + dprintk("%s: inode=%ld: offset=%llu wb_bytes=%u wb_offset=%u wb_pgbase=%u\n",
>> + __func__, pgio->pg_inode->i_ino,
>> + (((unsigned long long)req->wb_index) << PAGE_CACHE_SHIFT) + req->wb_offset,
>> + req->wb_bytes, req->wb_offset, req->wb_pgbase);
>>
>> pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
>> req->wb_context,
>
> NACK. I see no evidence that we've addressed the issues that were raised
> by Fred in commit 1825a0d08f22463e5a8f4b1636473efd057a3479 (NFS: prepare
> coalesce testing for directio).
> If you think that his concerns about the coalescing assumptions are no
> longer true, then please point to why this is the case. AFAICR that
> patch was added to fix corruption issues.
>
We see no problems with this patch with the workloads we're testing.
Do you have a test that reproduces the original problem that we can try running?
Benny
On 2013-03-19 22:28, Fred Isaman wrote:
> On Tue, Mar 19, 2013 at 3:35 PM, Benny Halevy <[email protected]> wrote:
>> On 2013-03-18 19:04, Fred Isaman wrote:
>>> A problem case is if the layout ends within the page request, which
>>> can happen due to the misalignment.
>>
>> First, the file layout currently supports only whole file layouts so I don't see
>> how that's a problem.
>> Regardless, I'd rather trim the layout segment to PAGE_SIZE boundaries
>> and get a new layout from the page aligned offset and on with minlength = PAGE_SIZE
>> (as we do today).
>>
>> Benny
>>
>
> Yes, but consider....
>
> let P = PAGE_SIZE
>
> assume you have a layout from 0 to 4P, and you try to do a WRITE from
> 2P to 6P. Now if the buffer is not page aligned, one of the nfs_page
> requests will contain the file's logical 4P offset at a random
> location within the nfs_page. What happens then?
Fred, Can you please provide a test that reproduces this concern
so we can verify the system behavior with and without my patch?
Benny
>
> Fred
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Tue, Mar 19, 2013 at 3:35 PM, Benny Halevy <[email protected]> wrote:
> On 2013-03-18 19:04, Fred Isaman wrote:
>> A problem case is if the layout ends within the page request, which
>> can happen due to the misalignment.
>
> First, the file layout currently supports only whole file layouts so I don't see
> how that's a problem.
> Regardless, I'd rather trim the layout segment to PAGE_SIZE boundaries
> and get a new layout from the page aligned offset and on with minlength = PAGE_SIZE
> (as we do today).
>
> Benny
>
Yes, but consider....
let P = PAGE_SIZE
assume you have a layout from 0 to 4P, and you try to do a WRITE from
2P to 6P. Now if the buffer is not page aligned, one of the nfs_page
requests will contain the file's logical 4P offset at a random
location within the nfs_page. What happens then?
Fred
On Mon, 2013-03-18 at 18:22 +0200, Benny Halevy wrote:
> On 2013-03-18 17:55, Myklebust, Trond wrote:
> > On Mon, 2013-03-18 at 16:38 +0200, Benny Halevy wrote:
> >> We're seeing roughly 20% of the I/Os going to the MDS
> >> when installing a VM over KVM in "none" caching mode (O_DIRECT).
> >> Instrumenting the client reveled that this is caused by buffer
> >> alignment vs. file offset alignment.
> >> Besides being a performance problem, when the MDS caches data
> >> this is also manifested as data corruption when data is written
> >> first via the MDS, then via the DS, eventually the stale data is
> >> read back from the MDS.
> >
> > That's why we should return the layout.
>
> We are not in this case.
Doh! I was thinking it was a case where we need to fence...
Actually, it shouldn't be needed: we will always do a _stable_ write of
the data before we try to read it back in from the server, so MDS
caching shouldn't be a problem.
> >> Note that this check exists also for the file layout specific
> >> pg_init_* functions. The objects (ORE) and block
> >> (bl_{read,write}_pagelist) layouts seem to deal correctly with
> >> splitting IOs in the case where req->wb_offset != req->wb_pgbase
> >> though this hasn't been tested wen submitting this patch.
> >>
> > NACK. I see no evidence that we've addressed the issues that were raised
> > by Fred in commit 1825a0d08f22463e5a8f4b1636473efd057a3479 (NFS: prepare
> > coalesce testing for directio).
> > If you think that his concerns about the coalescing assumptions are no
> > longer true, then please point to why this is the case. AFAICR that
> > patch was added to fix corruption issues.
> >
>
> We see no problems with this patch with the workloads we're testing.
> Do you have a test that reproduces the original problem that we can try running?
I suspect it was one of the nfstests. (see
git://git.linux-nfs.org/projects/mora/nfstest.git ) since Fred was
working with Jorge to do the O_DIRECT testing.
Fred, Jorge?
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Tue, Mar 19, 2013 at 10:28 PM, Fred Isaman <[email protected]> wrote:
> On Tue, Mar 19, 2013 at 3:35 PM, Benny Halevy <[email protected]> wrote:
>> On 2013-03-18 19:04, Fred Isaman wrote:
>>> A problem case is if the layout ends within the page request, which
>>> can happen due to the misalignment.
>>
>> First, the file layout currently supports only whole file layouts so I don't see
>> how that's a problem.
>> Regardless, I'd rather trim the layout segment to PAGE_SIZE boundaries
>> and get a new layout from the page aligned offset and on with minlength = PAGE_SIZE
>> (as we do today).
>>
>> Benny
>>
>
> Yes, but consider....
>
> let P = PAGE_SIZE
>
> assume you have a layout from 0 to 4P, and you try to do a WRITE from
> 2P to 6P. Now if the buffer is not page aligned, one of the nfs_page
> requests will contain the file's logical 4P offset at a random
> location within the nfs_page. What happens then?
Since we're using one layout segment at a time this should be handled
as short I/O
doing the remainder with another lseg.
>
> Fred
On Mon, Mar 18, 2013 at 12:22 PM, Benny Halevy <[email protected]> wrote:
>> NACK. I see no evidence that we've addressed the issues that were raised
>> by Fred in commit 1825a0d08f22463e5a8f4b1636473efd057a3479 (NFS: prepare
>> coalesce testing for directio).
>> If you think that his concerns about the coalescing assumptions are no
>> longer true, then please point to why this is the case. AFAICR that
>> patch was added to fix corruption issues.
>>
>
> We see no problems with this patch with the workloads we're testing.
> Do you have a test that reproduces the original problem that we can try running?
>
> Benny
A problem case is if the layout ends within the page request, which
can happen due to the misalignment.
Fred
On Mon, 2013-03-18 at 18:45 +0200, Benny Halevy wrote:
> On 2013-03-18 18:39, Myklebust, Trond wrote:
> > On Mon, 2013-03-18 at 18:22 +0200, Benny Halevy wrote:
> >> On 2013-03-18 17:55, Myklebust, Trond wrote:
> >>> On Mon, 2013-03-18 at 16:38 +0200, Benny Halevy wrote:
> >>>> We're seeing roughly 20% of the I/Os going to the MDS
> >>>> when installing a VM over KVM in "none" caching mode (O_DIRECT).
> >>>> Instrumenting the client reveled that this is caused by buffer
> >>>> alignment vs. file offset alignment.
> >>>> Besides being a performance problem, when the MDS caches data
> >>>> this is also manifested as data corruption when data is written
> >>>> first via the MDS, then via the DS, eventually the stale data is
> >>>> read back from the MDS.
> >>>
> >>> That's why we should return the layout.
> >>
> >> We are not in this case.
> >
> > Doh! I was thinking it was a case where we need to fence...
> >
> > Actually, it shouldn't be needed: we will always do a _stable_ write of
> > the data before we try to read it back in from the server, so MDS
> > caching shouldn't be a problem.
> >
>
> Writing stable to the MDS does not solve all cases.
> The corruption we've seen happens like this:
>
> write(A) to MDS
> write(B) to DS
> read(A) from MDS - since the MDS is caching the last data written to it.
That looks like a server bug to me. If I write the data to stable
storage in both the A and B case above, then I expect READs the MDS and
the DS to return the same data.
That's particularly true in the case of O_DIRECT reads and writes; the
server can't make assumptions as to whether or not the next client to
read the data will use the DS or the MDS.
Note that I'm happy to accept that our client may not be meeting the
requirements of "write to stable storage" here if, say, we're failing to
issue a LAYOUTCOMMIT after the WRITE(B). If that's the case, then we
need to fix that.
My beef is rather with the notion that _if_ the client meets the stable
storage criterion, then the MDS can somehow still lie to us.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On 2013-03-18 19:04, Fred Isaman wrote:
> On Mon, Mar 18, 2013 at 12:22 PM, Benny Halevy <[email protected]> wrote:
>
>>> NACK. I see no evidence that we've addressed the issues that were raised
>>> by Fred in commit 1825a0d08f22463e5a8f4b1636473efd057a3479 (NFS: prepare
>>> coalesce testing for directio).
>>> If you think that his concerns about the coalescing assumptions are no
>>> longer true, then please point to why this is the case. AFAICR that
>>> patch was added to fix corruption issues.
>>>
>>
>> We see no problems with this patch with the workloads we're testing.
>> Do you have a test that reproduces the original problem that we can try running?
>>
>> Benny
>
> A problem case is if the layout ends within the page request, which
> can happen due to the misalignment.
First, the file layout currently supports only whole file layouts so I don't see
how that's a problem.
Regardless, I'd rather trim the layout segment to PAGE_SIZE boundaries
and get a new layout from the page aligned offset and on with minlength = PAGE_SIZE
(as we do today).
Benny
>
> Fred
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On 2013-03-18 18:39, Myklebust, Trond wrote:
> On Mon, 2013-03-18 at 18:22 +0200, Benny Halevy wrote:
>> On 2013-03-18 17:55, Myklebust, Trond wrote:
>>> On Mon, 2013-03-18 at 16:38 +0200, Benny Halevy wrote:
>>>> We're seeing roughly 20% of the I/Os going to the MDS
>>>> when installing a VM over KVM in "none" caching mode (O_DIRECT).
>>>> Instrumenting the client reveled that this is caused by buffer
>>>> alignment vs. file offset alignment.
>>>> Besides being a performance problem, when the MDS caches data
>>>> this is also manifested as data corruption when data is written
>>>> first via the MDS, then via the DS, eventually the stale data is
>>>> read back from the MDS.
>>>
>>> That's why we should return the layout.
>>
>> We are not in this case.
>
> Doh! I was thinking it was a case where we need to fence...
>
> Actually, it shouldn't be needed: we will always do a _stable_ write of
> the data before we try to read it back in from the server, so MDS
> caching shouldn't be a problem.
>
Writing stable to the MDS does not solve all cases.
The corruption we've seen happens like this:
write(A) to MDS
write(B) to DS
read(A) from MDS - since the MDS is caching the last data written to it.
>>>> Note that this check exists also for the file layout specific
>>>> pg_init_* functions. The objects (ORE) and block
>>>> (bl_{read,write}_pagelist) layouts seem to deal correctly with
>>>> splitting IOs in the case where req->wb_offset != req->wb_pgbase
>>>> though this hasn't been tested wen submitting this patch.
>>>>
>>> NACK. I see no evidence that we've addressed the issues that were raised
>>> by Fred in commit 1825a0d08f22463e5a8f4b1636473efd057a3479 (NFS: prepare
>>> coalesce testing for directio).
>>> If you think that his concerns about the coalescing assumptions are no
>>> longer true, then please point to why this is the case. AFAICR that
>>> patch was added to fix corruption issues.
>>>
>>
>> We see no problems with this patch with the workloads we're testing.
>> Do you have a test that reproduces the original problem that we can try running?
>
> I suspect it was one of the nfstests. (see
> git://git.linux-nfs.org/projects/mora/nfstest.git ) since Fred was
> working with Jorge to do the O_DIRECT testing.
>
> Fred, Jorge?
>
>
On 2013-03-27 14:19, Benny Halevy wrote:
> On 2013-03-19 22:28, Fred Isaman wrote:
>> On Tue, Mar 19, 2013 at 3:35 PM, Benny Halevy <[email protected]> wrote:
>>> On 2013-03-18 19:04, Fred Isaman wrote:
>>>> A problem case is if the layout ends within the page request, which
>>>> can happen due to the misalignment.
>>>
>>> First, the file layout currently supports only whole file layouts so I don't see
>>> how that's a problem.
>>> Regardless, I'd rather trim the layout segment to PAGE_SIZE boundaries
>>> and get a new layout from the page aligned offset and on with minlength = PAGE_SIZE
>>> (as we do today).
>>>
>>> Benny
>>>
>>
>> Yes, but consider....
>>
>> let P = PAGE_SIZE
>>
>> assume you have a layout from 0 to 4P, and you try to do a WRITE from
>> 2P to 6P. Now if the buffer is not page aligned, one of the nfs_page
>> requests will contain the file's logical 4P offset at a random
>> location within the nfs_page. What happens then?
>
> Fred, Can you please provide a test that reproduces this concern
> so we can verify the system behavior with and without my patch?
Ping...
>
> Benny
>
>>
>> Fred
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>