2006-03-30 15:17:16

by Olaf Kirch

[permalink] [raw]
Subject: NFS directio

In recent kernels, NFS direct IO limits the size of a request to
4096 pages, ie 16M on most platforms. This causes growisofs from the
dvd+rw-tools package (http://fy.chalmers.se/~appro/linux/DVD+RW/) to fail.
Its builtin "dd" style function uses 32M buffers by default.

It puzzled me a lot that it is returning -EFBIG, which means "file too
big". At least that should be EINVAL, I think. EFBIG should be reserved
for problems related to largefile support, IMO.

We can easily work around the problem by restricting the buffer size
growisofs uses on NFS, but that feels strange. I think the right way
would be to increase MAX_DIRECTIO_SIZE, and if someone submits a bigger
request, to loop and do that in MAX_DIRECTIO_SIZE chunks.

Comments?

Thanks,
Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2006-03-30 16:04:05

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS directio

On Thu, 2006-03-30 at 17:15 +0200, Olaf Kirch wrote:
> In recent kernels, NFS direct IO limits the size of a request to
> 4096 pages, ie 16M on most platforms. This causes growisofs from the
> dvd+rw-tools package (http://fy.chalmers.se/~appro/linux/DVD+RW/) to fail.
> Its builtin "dd" style function uses 32M buffers by default.
>
> It puzzled me a lot that it is returning -EFBIG, which means "file too
> big". At least that should be EINVAL, I think. EFBIG should be reserved
> for problems related to largefile support, IMO.
>
> We can easily work around the problem by restricting the buffer size
> growisofs uses on NFS, but that feels strange. I think the right way
> would be to increase MAX_DIRECTIO_SIZE, and if someone submits a bigger
> request, to loop and do that in MAX_DIRECTIO_SIZE chunks.
>
> Comments?

MAX_DIRECTIO_SIZE is gone from Linus' tree. I don't think we have any
limits on the size of the buffer that users can feed to us apart from
those set by mm/filemap.c:generic_write_checks().

Cheers,
Trond



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-03-30 17:27:36

by Chuck Lever

[permalink] [raw]
Subject: Re: NFS directio

Trond Myklebust wrote:
> On Thu, 2006-03-30 at 17:15 +0200, Olaf Kirch wrote:
>> In recent kernels, NFS direct IO limits the size of a request to
>> 4096 pages, ie 16M on most platforms. This causes growisofs from the
>> dvd+rw-tools package (http://fy.chalmers.se/~appro/linux/DVD+RW/) to fail.
>> Its builtin "dd" style function uses 32M buffers by default.
>>
>> It puzzled me a lot that it is returning -EFBIG, which means "file too
>> big". At least that should be EINVAL, I think. EFBIG should be reserved
>> for problems related to largefile support, IMO.
>>
>> We can easily work around the problem by restricting the buffer size
>> growisofs uses on NFS, but that feels strange. I think the right way
>> would be to increase MAX_DIRECTIO_SIZE, and if someone submits a bigger
>> request, to loop and do that in MAX_DIRECTIO_SIZE chunks.
>>
>> Comments?
>
> MAX_DIRECTIO_SIZE is gone from Linus' tree. I don't think we have any
> limits on the size of the buffer that users can feed to us apart from
> those set by mm/filemap.c:generic_write_checks().

that's correct. this was just an arbitrary limit to prevent type
overflows (namely the atomic_t's were only 24 bits on some platforms).

--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-03-31 07:49:07

by Olaf Kirch

[permalink] [raw]
Subject: Re: NFS directio

On Thu, Mar 30, 2006 at 11:03:32AM -0500, Trond Myklebust wrote:
> MAX_DIRECTIO_SIZE is gone from Linus' tree. I don't think we have any
> limits on the size of the buffer that users can feed to us apart from
> those set by mm/filemap.c:generic_write_checks().

So you're saying we can safely remove this check in 2.6.16?

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-03-31 14:35:54

by Chuck Lever

[permalink] [raw]
Subject: Re: NFS directio


On Mar 31, 2006, at 2:49 AM, Olaf Kirch wrote:

> On Thu, Mar 30, 2006 at 11:03:32AM -0500, Trond Myklebust wrote:
>> MAX_DIRECTIO_SIZE is gone from Linus' tree. I don't think we have any
>> limits on the size of the buffer that users can feed to us apart from
>> those set by mm/filemap.c:generic_write_checks().
>
> So you're saying we can safely remove this check in 2.6.16?

the check isn't in 2.6.16. it was removed sometime after 2.6.5.

--
corporate: < cel at netapp dot com >
personal: < chucklever at bigfoot dot com >





-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-03-31 14:58:54

by Olaf Kirch

[permalink] [raw]
Subject: Re: NFS directio

On Fri, Mar 31, 2006 at 09:35:34AM -0500, Chuck Lever wrote:
> the check isn't in 2.6.16. it was removed sometime after 2.6.5.

It is still in the 2.6.16 tree I'm looking at; else I wouldn't ask :)

linux/fs/nfs/direct.c has this:

#define MAX_DIRECTIO_SIZE (4096UL << PAGE_SHIFT)
...
static inline int
nfs_get_user_pages(int rw, unsigned long user_addr, size_t size,
struct page ***pages)
{
int result = -ENOMEM;
unsigned long page_count;
size_t array_size;

/* set an arbitrary limit to prevent type overflow */
/* XXX: this can probably be as large as INT_MAX */
if (size > MAX_DIRECTIO_SIZE) {
*pages = NULL;
return -EFBIG;
}




Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-03-31 15:50:40

by Chuck Lever

[permalink] [raw]
Subject: Re: NFS directio

Olaf Kirch wrote:
> On Fri, Mar 31, 2006 at 09:35:34AM -0500, Chuck Lever wrote:
>> the check isn't in 2.6.16. it was removed sometime after 2.6.5.
>
> It is still in the 2.6.16 tree I'm looking at; else I wouldn't ask :)

it's been in my trees since 2.6.13 or even earlier, my mistake.

that change is part of the aio+dio patches that were just included in
2.6.17-rc1. instead of creating a single patch for this change, you
should consider taking those patches, since they were tested as a unit.

if you can guarantee that atomic_t is 32-bits on every platform you
support, then it should be save to change that #define to 2^31.
otherwise, the work to eliminate the limit entirely has already been
done by the above-mentioned patches.

--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-04-09 12:40:48

by NeilBrown

[permalink] [raw]
Subject: Re: NFS directio

On Friday March 31, [email protected] wrote:
> Olaf Kirch wrote:
> > On Fri, Mar 31, 2006 at 09:35:34AM -0500, Chuck Lever wrote:
> >> the check isn't in 2.6.16. it was removed sometime after 2.6.5.
> >
> > It is still in the 2.6.16 tree I'm looking at; else I wouldn't ask :)
>
> it's been in my trees since 2.6.13 or even earlier, my mistake.
>
> that change is part of the aio+dio patches that were just included in
> 2.6.17-rc1. instead of creating a single patch for this change, you
> should consider taking those patches, since they were tested as a unit.
>
> if you can guarantee that atomic_t is 32-bits on every platform you
> support, then it should be save to change that #define to 2^31.
> otherwise, the work to eliminate the limit entirely has already been
> done by the above-mentioned patches.

(Coming into the conversation a bit late....)

What about the kmalloc in nfs_get_user_pages:

array_size = (page_count * sizeof(struct page *));
*pages = kmalloc(array_size, GFP_KERNEL);

With a page_count of 1024, this allocates one page (on 32bit) which is
easy.
With a page_count of 4096 (the previous MAX_DIRECTIO_SIZE)), this
allocates 4 consecutive pages, which won't always succeed.

If you want to go higher than that (which was the point of the start
of this thread) then you need a large-order allocation which doesn't
(in my understanding) have a good chance of success due to
fragmentation.

So I guess my question is: how hard would it be to use a more scalable
data structure so that very large IO sizes would be reliably
practical?

NeilBrown



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-04-09 22:09:28

by Chuck Lever

[permalink] [raw]
Subject: Re: NFS directio

Neil Brown wrote:
> On Friday March 31, [email protected] wrote:
>> Olaf Kirch wrote:
>>> On Fri, Mar 31, 2006 at 09:35:34AM -0500, Chuck Lever wrote:
>>>> the check isn't in 2.6.16. it was removed sometime after 2.6.5.
>>> It is still in the 2.6.16 tree I'm looking at; else I wouldn't ask :)
>> it's been in my trees since 2.6.13 or even earlier, my mistake.
>>
>> that change is part of the aio+dio patches that were just included in
>> 2.6.17-rc1. instead of creating a single patch for this change, you
>> should consider taking those patches, since they were tested as a unit.
>>
>> if you can guarantee that atomic_t is 32-bits on every platform you
>> support, then it should be save to change that #define to 2^31.
>> otherwise, the work to eliminate the limit entirely has already been
>> done by the above-mentioned patches.
>
> (Coming into the conversation a bit late....)
>
> What about the kmalloc in nfs_get_user_pages:
>
> array_size = (page_count * sizeof(struct page *));
> *pages = kmalloc(array_size, GFP_KERNEL);
>
> With a page_count of 1024, this allocates one page (on 32bit) which is
> easy.
> With a page_count of 4096 (the previous MAX_DIRECTIO_SIZE)), this
> allocates 4 consecutive pages, which won't always succeed.
>
> If you want to go higher than that (which was the point of the start
> of this thread) then you need a large-order allocation which doesn't
> (in my understanding) have a good chance of success due to
> fragmentation.
>
> So I guess my question is: how hard would it be to use a more scalable
> data structure so that very large IO sizes would be reliably
> practical?

howdy neil-

usually I/O is broken up into smaller chunks by the time it gets down to
this level, so it's never been much of an issue. it's pretty
challenging to generate a test case for extremely large I/O sizes (for
example, the size of the entire process address space).

and until now, there really hasn't been much call for doing NFS O_DIRECT
with very large requests. it's been a matter of meeting the
requirements of database I/O, which is generally 4KB to 16KB for data
files, and about a megabyte for log writes.

at this point we don't really have a test case and a use case that
reliably breaks this, so it hasn't been a priority to address this.

the structure of this code was adapted (ie stolen) from other parts of
the kernel that also employ get_user_pages. you can probably take a
look at other places that employ get_user_pages(), and see how they've
since tackled the issue.

--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-04-10 04:20:56

by NeilBrown

[permalink] [raw]
Subject: Re: NFS directio

On Sunday April 9, [email protected] wrote:
>
> howdy neil-

G'Day :-)

>
> usually I/O is broken up into smaller chunks by the time it gets down to
> this level, so it's never been much of an issue. it's pretty
> challenging to generate a test case for extremely large I/O sizes (for
> example, the size of the entire process address space).

I don't think direct_io requests get broken up at all...

I think the case in point was a DVD burner trying to burn an image
that lived on a NFS filesystem. It tried to direct-read a reasonable
fraction of the whole dvd (100Meg?) and had problems.

>
> and until now, there really hasn't been much call for doing NFS O_DIRECT
> with very large requests. it's been a matter of meeting the
> requirements of database I/O, which is generally 4KB to 16KB for data
> files, and about a megabyte for log writes.
>
> at this point we don't really have a test case and a use case that
> reliably breaks this, so it hasn't been a priority to address this.
>
> the structure of this code was adapted (ie stolen) from other parts of
> the kernel that also employ get_user_pages. you can probably take a
> look at other places that employ get_user_pages(), and see how they've
> since tackled the issue.

Looking at fs/direct-io.c, it called get_user_pages is blocks of
at-most 64 pages. I suspect a similar thing would be possible for
nfs. I might have a look one day....

Thanks,
NeilBrown


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-04-10 10:55:11

by Olaf Kirch

[permalink] [raw]
Subject: Re: NFS directio

On Mon, Apr 10, 2006 at 02:20:26PM +1000, Neil Brown wrote:
> I think the case in point was a DVD burner trying to burn an image
> that lived on a NFS filesystem. It tried to direct-read a reasonable
> fraction of the whole dvd (100Meg?) and had problems.

Actually they use chunks of 32Meg, but otherwise accurate.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-04-10 17:36:11

by Chuck Lever

[permalink] [raw]
Subject: Re: NFS directio

Neil Brown wrote:
> I think the case in point was a DVD burner trying to burn an image
> that lived on a NFS filesystem. It tried to direct-read a reasonable
> fraction of the whole dvd (100Meg?) and had problems.

wow. first i've heard of this. was it ever reported to [email protected]?

if you are sure that the problems were directly related to the pagevec
size logic, then i will get this fixed asap in mainline.

>> the structure of this code was adapted (ie stolen) from other parts of
>> the kernel that also employ get_user_pages. you can probably take a
>> look at other places that employ get_user_pages(), and see how they've
>> since tackled the issue.
>
> Looking at fs/direct-io.c, it called get_user_pages is blocks of
> at-most 64 pages. I suspect a similar thing would be possible for
> nfs.

yep.

we get only a single iocb from the caller. that may complicate things a
little, especially in the aio case.

--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-04-11 00:12:44

by NeilBrown

[permalink] [raw]
Subject: Re: NFS directio

On Monday April 10, [email protected] wrote:
> Neil Brown wrote:
> > I think the case in point was a DVD burner trying to burn an image
> > that lived on a NFS filesystem. It tried to direct-read a reasonable
> > fraction of the whole dvd (100Meg?) and had problems.
>
> wow. first i've heard of this. was it ever reported to [email protected]?

No, it was an internal suse thing, and as Olaf said, it was only 32Meg.

>
> if you are sure that the problems were directly related to the pagevec
> size logic, then i will get this fixed asap in mainline.

The old 16Meg limit (now gone in mainline, or -mm at least) was the
problem. I don't know that kmalloc failure has actually been a
problem, but when I see a high-order kmalloc, I get all worried.

So I think it is a "should be fixed sometime", but not necessarily a
"must be fixed now" issue.

Thanks,
NeilBrown



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-04-11 00:47:45

by Chuck Lever

[permalink] [raw]
Subject: Re: NFS directio

Neil Brown wrote:
> On Monday April 10, [email protected] wrote:
>> Neil Brown wrote:
>>> I think the case in point was a DVD burner trying to burn an image
>>> that lived on a NFS filesystem. It tried to direct-read a reasonable
>>> fraction of the whole dvd (100Meg?) and had problems.
>> wow. first i've heard of this. was it ever reported to [email protected]?
>
> No, it was an internal suse thing, and as Olaf said, it was only 32Meg.
>
>> if you are sure that the problems were directly related to the pagevec
>> size logic, then i will get this fixed asap in mainline.
>
> The old 16Meg limit (now gone in mainline, or -mm at least) was the
> problem. I don't know that kmalloc failure has actually been a
> problem, but when I see a high-order kmalloc, I get all worried.
>
> So I think it is a "should be fixed sometime", but not necessarily a
> "must be fixed now" issue.

Thanks for clarifying.

This is something I will plan to include in the vectored I/O support for
NFS O_DIRECT in the next couple of months.

In the meantime, we have some good tests for NFS O_DIRECT and aio+dio
that involve smaller I/O sizes, but nothing for stretching it with
larger I/O request sizes. Until now I had thought we were covering the
important use cases. If you have any other tests that will exercise
O_DIRECT in this way, I'm happy to include them in my development testing.

--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-04-11 09:16:04

by Olaf Kirch

[permalink] [raw]
Subject: Re: NFS directio

On Tue, Apr 11, 2006 at 10:12:11AM +1000, Neil Brown wrote:
> > wow. first i've heard of this. was it ever reported to [email protected]?
>
> No, it was an internal suse thing

No, it's from http://fy.chalmers.se/~appro/linux/DVD+RW/
The tool is growisofs, and has a builtin dd replacement
that wants to do this.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs