2023-07-26 12:20:54

by Jaco Kroon

[permalink] [raw]
Subject: [PATCH] fuse: enable larger read buffers for readdir.

Signed-off-by: Jaco Kroon <[email protected]>
---
fs/fuse/Kconfig | 16 ++++++++++++++++
fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
2 files changed, 40 insertions(+), 18 deletions(-)

diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 038ed0b9aaa5..0783f9ee5cd3 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -18,6 +18,22 @@ config FUSE_FS
If you want to develop a userspace FS, or if you want to use
a filesystem based on FUSE, answer Y or M.

+config FUSE_READDIR_ORDER
+ int
+ range 0 5
+ default 5
+ help
+ readdir performance varies greatly depending on the size of the read.
+ Larger buffers results in larger reads, thus fewer reads and higher
+ performance in return.
+
+ You may want to reduce this value on seriously constrained memory
+ systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
+
+ This value reprents the order of the number of pages to allocate (ie,
+ the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32
+ pages (128KiB).
+
config CUSE
tristate "Character device in Userspace support"
depends on FUSE_FS
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index dc603479b30e..98c62b623240 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -13,6 +13,12 @@
#include <linux/pagemap.h>
#include <linux/highmem.h>

+#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER
+#define READDIR_PAGES (1 << READDIR_PAGES_ORDER)
+#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER)
+#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1)
+#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER)
+
static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
{
struct fuse_conn *fc = get_fuse_conn(dir);
@@ -52,10 +58,10 @@ static void fuse_add_dirent_to_cache(struct file *file,
}
version = fi->rdc.version;
size = fi->rdc.size;
- offset = size & ~PAGE_MASK;
- index = size >> PAGE_SHIFT;
+ offset = size & ~READDIR_PAGES_MASK;
+ index = size >> READDIR_PAGES_SHIFT;
/* Dirent doesn't fit in current page? Jump to next page. */
- if (offset + reclen > PAGE_SIZE) {
+ if (offset + reclen > READDIR_PAGES_SIZE) {
index++;
offset = 0;
}
@@ -83,7 +89,7 @@ static void fuse_add_dirent_to_cache(struct file *file,
}
memcpy(addr + offset, dirent, reclen);
kunmap_local(addr);
- fi->rdc.size = (index << PAGE_SHIFT) + offset + reclen;
+ fi->rdc.size = (index << READDIR_PAGES_SHIFT) + offset + reclen;
fi->rdc.pos = dirent->off;
unlock:
spin_unlock(&fi->rdc.lock);
@@ -104,7 +110,7 @@ static void fuse_readdir_cache_end(struct file *file, loff_t pos)
}

fi->rdc.cached = true;
- end = ALIGN(fi->rdc.size, PAGE_SIZE);
+ end = ALIGN(fi->rdc.size, READDIR_PAGES_SIZE);
spin_unlock(&fi->rdc.lock);

/* truncate unused tail of cache */
@@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
struct fuse_mount *fm = get_fuse_mount(inode);
struct fuse_io_args ia = {};
struct fuse_args_pages *ap = &ia.ap;
- struct fuse_page_desc desc = { .length = PAGE_SIZE };
+ struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };
u64 attr_version = 0;
bool locked;

- page = alloc_page(GFP_KERNEL);
+ page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
if (!page)
return -ENOMEM;

plus = fuse_use_readdirplus(inode, ctx);
ap->args.out_pages = true;
- ap->num_pages = 1;
+ ap->num_pages = READDIR_PAGES;
ap->pages = &page;
ap->descs = &desc;
if (plus) {
attr_version = fuse_get_attr_version(fm->fc);
- fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
+ fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
FUSE_READDIRPLUS);
} else {
- fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
+ fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
FUSE_READDIR);
}
locked = fuse_lock_inode(inode);
@@ -383,7 +389,7 @@ static enum fuse_parse_result fuse_parse_cache(struct fuse_file *ff,
void *addr, unsigned int size,
struct dir_context *ctx)
{
- unsigned int offset = ff->readdir.cache_off & ~PAGE_MASK;
+ unsigned int offset = ff->readdir.cache_off & ~READDIR_PAGES_MASK;
enum fuse_parse_result res = FOUND_NONE;

WARN_ON(offset >= size);
@@ -504,16 +510,16 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)

WARN_ON(fi->rdc.size < ff->readdir.cache_off);

- index = ff->readdir.cache_off >> PAGE_SHIFT;
+ index = ff->readdir.cache_off >> READDIR_PAGES_SHIFT;

- if (index == (fi->rdc.size >> PAGE_SHIFT))
- size = fi->rdc.size & ~PAGE_MASK;
+ if (index == (fi->rdc.size >> READDIR_PAGES_SHIFT))
+ size = fi->rdc.size & ~READDIR_PAGES_MASK;
else
- size = PAGE_SIZE;
+ size = READDIR_PAGES_SIZE;
spin_unlock(&fi->rdc.lock);

/* EOF? */
- if ((ff->readdir.cache_off & ~PAGE_MASK) == size)
+ if ((ff->readdir.cache_off & ~READDIR_PAGES_MASK) == size)
return 0;

page = find_get_page_flags(file->f_mapping, index,
@@ -559,9 +565,9 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
if (res == FOUND_ALL)
return 0;

- if (size == PAGE_SIZE) {
+ if (size == READDIR_PAGES_SIZE) {
/* We hit end of page: skip to next page. */
- ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, PAGE_SIZE);
+ ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, READDIR_PAGES_SIZE);
goto retry;
}

--
2.41.0



2023-07-26 12:56:31

by Jaco Kroon

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir.

Hi,

Just to give some context, we've got maildir folders running on top of
glusterfs.  Without this a "medium" sized folder can take several
minutes to go through readdir (getdents64) and each system call will
return 14-18 entries at a time.  These calls (as measures using strace
-T -p) takes anywhere from around 150 micro-seconds to several
milli-seconds.  Inter-server round-trip time (as measured with ping) is
generally 100-120 micro-seconds so 150 µs is probably a good lower limit.

I've tested the below patch from a more remote location (7-8ms ping
round-trip) and in spite of this massively increased latency a readdir
iteration over a small folder (a few hundred entries) was still on par
with the local case, usually even marginally better (10%), where on
larger folders the difference became much more pronounced, with
performance for the remote location at times being drastically better
than for the "close" location.

This has a few benefits overall that I see:

1.  This enables glusterfs to internally read and process larger
chunks.  Whether this is happening I don't know (I have a separate
ongoing discussion with the developers on that side to see what can be
done inside of glusterfs itself to improve the mechanisms on the other
side of fuse).

2.  Fewer overall system calls (by an order of up to 32 fewer
getdents64()) calls for userspace to get a full directory listing, in
cases where there's 100k+ files in a folder this makes a HUGE difference
(14-18 entries vs ~500 entries per call, so >5000 calls vs 200).

3.  Fewer fuse messages being passed over the fuse link.

One concerns I have is that I don't understand all of the caching and
stuff going on, and I'm typically as per strace seeing less than 64KB of
data being returned to userspace, so I'm not sure there is a direct
correlation between the FUSE readdir size and that from glibc to kernel,
and I'm slightly worried that the caching may now be broken due to
smaller than READDIR_PAGES_SIZE records being cached, with that said,
successive readdir() iterations from userspace provided identical
results so I *think* this should be OK.  Someone with a better
understanding should please confirm.

I made the option configurable (but max it out by default) in case
memory constrained systems may need to drop the 128KiB value.

I suspect the discrepancy may also relate to the way in which glusterfs
merges directory listings from the underlying bricks where data is
actually stored.

Without this patch the remote system is orders of magnitude slower that
the close together systems.

Kind regards,
Jaco


On 2023/07/26 12:59, Jaco Kroon wrote:
> Signed-off-by: Jaco Kroon <[email protected]>
> ---
> fs/fuse/Kconfig | 16 ++++++++++++++++
> fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
> 2 files changed, 40 insertions(+), 18 deletions(-)
>
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index 038ed0b9aaa5..0783f9ee5cd3 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -18,6 +18,22 @@ config FUSE_FS
> If you want to develop a userspace FS, or if you want to use
> a filesystem based on FUSE, answer Y or M.
>
> +config FUSE_READDIR_ORDER
> + int
> + range 0 5
> + default 5
> + help
> + readdir performance varies greatly depending on the size of the read.
> + Larger buffers results in larger reads, thus fewer reads and higher
> + performance in return.
> +
> + You may want to reduce this value on seriously constrained memory
> + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
> +
> + This value reprents the order of the number of pages to allocate (ie,
> + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32
> + pages (128KiB).
> +
> config CUSE
> tristate "Character device in Userspace support"
> depends on FUSE_FS
> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
> index dc603479b30e..98c62b623240 100644
> --- a/fs/fuse/readdir.c
> +++ b/fs/fuse/readdir.c
> @@ -13,6 +13,12 @@
> #include <linux/pagemap.h>
> #include <linux/highmem.h>
>
> +#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER
> +#define READDIR_PAGES (1 << READDIR_PAGES_ORDER)
> +#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER)
> +#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1)
> +#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER)
> +
> static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
> {
> struct fuse_conn *fc = get_fuse_conn(dir);
> @@ -52,10 +58,10 @@ static void fuse_add_dirent_to_cache(struct file *file,
> }
> version = fi->rdc.version;
> size = fi->rdc.size;
> - offset = size & ~PAGE_MASK;
> - index = size >> PAGE_SHIFT;
> + offset = size & ~READDIR_PAGES_MASK;
> + index = size >> READDIR_PAGES_SHIFT;
> /* Dirent doesn't fit in current page? Jump to next page. */
> - if (offset + reclen > PAGE_SIZE) {
> + if (offset + reclen > READDIR_PAGES_SIZE) {
> index++;
> offset = 0;
> }
> @@ -83,7 +89,7 @@ static void fuse_add_dirent_to_cache(struct file *file,
> }
> memcpy(addr + offset, dirent, reclen);
> kunmap_local(addr);
> - fi->rdc.size = (index << PAGE_SHIFT) + offset + reclen;
> + fi->rdc.size = (index << READDIR_PAGES_SHIFT) + offset + reclen;
> fi->rdc.pos = dirent->off;
> unlock:
> spin_unlock(&fi->rdc.lock);
> @@ -104,7 +110,7 @@ static void fuse_readdir_cache_end(struct file *file, loff_t pos)
> }
>
> fi->rdc.cached = true;
> - end = ALIGN(fi->rdc.size, PAGE_SIZE);
> + end = ALIGN(fi->rdc.size, READDIR_PAGES_SIZE);
> spin_unlock(&fi->rdc.lock);
>
> /* truncate unused tail of cache */
> @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
> struct fuse_mount *fm = get_fuse_mount(inode);
> struct fuse_io_args ia = {};
> struct fuse_args_pages *ap = &ia.ap;
> - struct fuse_page_desc desc = { .length = PAGE_SIZE };
> + struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };
> u64 attr_version = 0;
> bool locked;
>
> - page = alloc_page(GFP_KERNEL);
> + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
> if (!page)
> return -ENOMEM;
>
> plus = fuse_use_readdirplus(inode, ctx);
> ap->args.out_pages = true;
> - ap->num_pages = 1;
> + ap->num_pages = READDIR_PAGES;
> ap->pages = &page;
> ap->descs = &desc;
> if (plus) {
> attr_version = fuse_get_attr_version(fm->fc);
> - fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
> + fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
> FUSE_READDIRPLUS);
> } else {
> - fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
> + fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
> FUSE_READDIR);
> }
> locked = fuse_lock_inode(inode);
> @@ -383,7 +389,7 @@ static enum fuse_parse_result fuse_parse_cache(struct fuse_file *ff,
> void *addr, unsigned int size,
> struct dir_context *ctx)
> {
> - unsigned int offset = ff->readdir.cache_off & ~PAGE_MASK;
> + unsigned int offset = ff->readdir.cache_off & ~READDIR_PAGES_MASK;
> enum fuse_parse_result res = FOUND_NONE;
>
> WARN_ON(offset >= size);
> @@ -504,16 +510,16 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
>
> WARN_ON(fi->rdc.size < ff->readdir.cache_off);
>
> - index = ff->readdir.cache_off >> PAGE_SHIFT;
> + index = ff->readdir.cache_off >> READDIR_PAGES_SHIFT;
>
> - if (index == (fi->rdc.size >> PAGE_SHIFT))
> - size = fi->rdc.size & ~PAGE_MASK;
> + if (index == (fi->rdc.size >> READDIR_PAGES_SHIFT))
> + size = fi->rdc.size & ~READDIR_PAGES_MASK;
> else
> - size = PAGE_SIZE;
> + size = READDIR_PAGES_SIZE;
> spin_unlock(&fi->rdc.lock);
>
> /* EOF? */
> - if ((ff->readdir.cache_off & ~PAGE_MASK) == size)
> + if ((ff->readdir.cache_off & ~READDIR_PAGES_MASK) == size)
> return 0;
>
> page = find_get_page_flags(file->f_mapping, index,
> @@ -559,9 +565,9 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
> if (res == FOUND_ALL)
> return 0;
>
> - if (size == PAGE_SIZE) {
> + if (size == READDIR_PAGES_SIZE) {
> /* We hit end of page: skip to next page. */
> - ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, PAGE_SIZE);
> + ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, READDIR_PAGES_SIZE);
> goto retry;
> }
>

2023-07-26 15:56:10

by Jaco Kroon

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir.

Hi,

On 2023/07/26 15:53, Bernd Schubert wrote:
>
>
> On 7/26/23 12:59, Jaco Kroon wrote:
>> Signed-off-by: Jaco Kroon <[email protected]>
>> ---
>>   fs/fuse/Kconfig   | 16 ++++++++++++++++
>>   fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
>>   2 files changed, 40 insertions(+), 18 deletions(-)
>>
>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>> index 038ed0b9aaa5..0783f9ee5cd3 100644
>> --- a/fs/fuse/Kconfig
>> +++ b/fs/fuse/Kconfig
>> @@ -18,6 +18,22 @@ config FUSE_FS
>>         If you want to develop a userspace FS, or if you want to use
>>         a filesystem based on FUSE, answer Y or M.
>>   +config FUSE_READDIR_ORDER
>> +    int
>> +    range 0 5
>> +    default 5
>> +    help
>> +        readdir performance varies greatly depending on the size of
>> the read.
>> +        Larger buffers results in larger reads, thus fewer reads and
>> higher
>> +        performance in return.
>> +
>> +        You may want to reduce this value on seriously constrained
>> memory
>> +        systems where 128KiB (assuming 4KiB pages) cache pages is
>> not ideal.
>> +
>> +        This value reprents the order of the number of pages to
>> allocate (ie,
>> +        the shift value).  A value of 0 is thus 1 page (4KiB) where
>> 5 is 32
>> +        pages (128KiB).
>> +
>
> I like the idea of a larger readdir size, but shouldn't that be a
> server/daemon/library decision which size to use, instead of kernel
> compile time? So should be part of FUSE_INIT negotiation?

Yes sure, but there still needs to be a default.  And one page at a time
doesn't cut it.

-- snip --

>>   -    page = alloc_page(GFP_KERNEL);
>> +    page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
>
> I guess that should become folio alloc(), one way or the other. Now I
> think order 0 was chosen before to avoid risk of allocation failure. I
> guess it might work to try a large size and to fall back to 0 when
> that failed. Or fail back to the slower vmalloc.

If this varies then a bunch of other code will become somewhat more
complex, especially if one alloc succeeds, and then a follow-up succeeds.

I'm not familiar with the differences between the different mechanisms
available for allocation.

-- snip --

> Thanks,
My pleasure,
Jaco

2023-07-26 15:58:17

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir.



On 7/26/23 17:26, Jaco Kroon wrote:
> Hi,
>
> On 2023/07/26 15:53, Bernd Schubert wrote:
>>
>>
>> On 7/26/23 12:59, Jaco Kroon wrote:
>>> Signed-off-by: Jaco Kroon <[email protected]>
>>> ---
>>>   fs/fuse/Kconfig   | 16 ++++++++++++++++
>>>   fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
>>>   2 files changed, 40 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>>> index 038ed0b9aaa5..0783f9ee5cd3 100644
>>> --- a/fs/fuse/Kconfig
>>> +++ b/fs/fuse/Kconfig
>>> @@ -18,6 +18,22 @@ config FUSE_FS
>>>         If you want to develop a userspace FS, or if you want to use
>>>         a filesystem based on FUSE, answer Y or M.
>>>   +config FUSE_READDIR_ORDER
>>> +    int
>>> +    range 0 5
>>> +    default 5
>>> +    help
>>> +        readdir performance varies greatly depending on the size of
>>> the read.
>>> +        Larger buffers results in larger reads, thus fewer reads and
>>> higher
>>> +        performance in return.
>>> +
>>> +        You may want to reduce this value on seriously constrained
>>> memory
>>> +        systems where 128KiB (assuming 4KiB pages) cache pages is
>>> not ideal.
>>> +
>>> +        This value reprents the order of the number of pages to
>>> allocate (ie,
>>> +        the shift value).  A value of 0 is thus 1 page (4KiB) where
>>> 5 is 32
>>> +        pages (128KiB).
>>> +
>>
>> I like the idea of a larger readdir size, but shouldn't that be a
>> server/daemon/library decision which size to use, instead of kernel
>> compile time? So should be part of FUSE_INIT negotiation?
>
> Yes sure, but there still needs to be a default.  And one page at a time
> doesn't cut it.

With FUSE_INIT userspace would make that decision, based on what kernel
fuse suggests? process_init_reply() already handles other limits - I
don't see why readdir max has to be compile time option. Maybe a module
option to set the limit?

Thanks,
Bernd

2023-07-26 15:58:26

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir.



On 7/26/23 17:26, Jaco Kroon wrote:
> Hi,
>
> On 2023/07/26 15:53, Bernd Schubert wrote:
>>
>>
>> On 7/26/23 12:59, Jaco Kroon wrote:
>>> Signed-off-by: Jaco Kroon <[email protected]>
>>> ---
>>>   fs/fuse/Kconfig   | 16 ++++++++++++++++
>>>   fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
>>>   2 files changed, 40 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>>> index 038ed0b9aaa5..0783f9ee5cd3 100644
>>> --- a/fs/fuse/Kconfig
>>> +++ b/fs/fuse/Kconfig
>>> @@ -18,6 +18,22 @@ config FUSE_FS
>>>         If you want to develop a userspace FS, or if you want to use
>>>         a filesystem based on FUSE, answer Y or M.
>>>   +config FUSE_READDIR_ORDER
>>> +    int
>>> +    range 0 5
>>> +    default 5
>>> +    help
>>> +        readdir performance varies greatly depending on the size of
>>> the read.
>>> +        Larger buffers results in larger reads, thus fewer reads and
>>> higher
>>> +        performance in return.
>>> +
>>> +        You may want to reduce this value on seriously constrained
>>> memory
>>> +        systems where 128KiB (assuming 4KiB pages) cache pages is
>>> not ideal.
>>> +
>>> +        This value reprents the order of the number of pages to
>>> allocate (ie,
>>> +        the shift value).  A value of 0 is thus 1 page (4KiB) where
>>> 5 is 32
>>> +        pages (128KiB).
>>> +
>>
>> I like the idea of a larger readdir size, but shouldn't that be a
>> server/daemon/library decision which size to use, instead of kernel
>> compile time? So should be part of FUSE_INIT negotiation?
>
> Yes sure, but there still needs to be a default.  And one page at a time
> doesn't cut it.
>
> -- snip --
>
>>>   -    page = alloc_page(GFP_KERNEL);
>>> +    page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
>>
>> I guess that should become folio alloc(), one way or the other. Now I
>> think order 0 was chosen before to avoid risk of allocation failure. I
>> guess it might work to try a large size and to fall back to 0 when
>> that failed. Or fail back to the slower vmalloc.
>
> If this varies then a bunch of other code will become somewhat more
> complex, especially if one alloc succeeds, and then a follow-up succeeds.

Yeah, the better choice is kvmalloc/kvfree which handles it internally.

>
> I'm not familiar with the differences between the different mechanisms
> available for allocation.
>
> -- snip --
>
>> Thanks,
> My pleasure,
> Jaco

2023-07-26 16:12:40

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir.



On 7/26/23 12:59, Jaco Kroon wrote:
> Signed-off-by: Jaco Kroon <[email protected]>
> ---
> fs/fuse/Kconfig | 16 ++++++++++++++++
> fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
> 2 files changed, 40 insertions(+), 18 deletions(-)
>
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index 038ed0b9aaa5..0783f9ee5cd3 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -18,6 +18,22 @@ config FUSE_FS
> If you want to develop a userspace FS, or if you want to use
> a filesystem based on FUSE, answer Y or M.
>
> +config FUSE_READDIR_ORDER
> + int
> + range 0 5
> + default 5
> + help
> + readdir performance varies greatly depending on the size of the read.
> + Larger buffers results in larger reads, thus fewer reads and higher
> + performance in return.
> +
> + You may want to reduce this value on seriously constrained memory
> + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
> +
> + This value reprents the order of the number of pages to allocate (ie,
> + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32
> + pages (128KiB).
> +

I like the idea of a larger readdir size, but shouldn't that be a
server/daemon/library decision which size to use, instead of kernel
compile time? So should be part of FUSE_INIT negotiation?

> config CUSE
> tristate "Character device in Userspace support"
> depends on FUSE_FS
> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
> index dc603479b30e..98c62b623240 100644
> --- a/fs/fuse/readdir.c
> +++ b/fs/fuse/readdir.c
> @@ -13,6 +13,12 @@
> #include <linux/pagemap.h>
> #include <linux/highmem.h>
>
> +#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER
> +#define READDIR_PAGES (1 << READDIR_PAGES_ORDER)
> +#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER)
> +#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1)
> +#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER)
> +
> static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
> {
> struct fuse_conn *fc = get_fuse_conn(dir);
> @@ -52,10 +58,10 @@ static void fuse_add_dirent_to_cache(struct file *file,
> }
> version = fi->rdc.version;
> size = fi->rdc.size;
> - offset = size & ~PAGE_MASK;
> - index = size >> PAGE_SHIFT;
> + offset = size & ~READDIR_PAGES_MASK;
> + index = size >> READDIR_PAGES_SHIFT;
> /* Dirent doesn't fit in current page? Jump to next page. */
> - if (offset + reclen > PAGE_SIZE) {
> + if (offset + reclen > READDIR_PAGES_SIZE) {
> index++;
> offset = 0;
> }
> @@ -83,7 +89,7 @@ static void fuse_add_dirent_to_cache(struct file *file,
> }
> memcpy(addr + offset, dirent, reclen);
> kunmap_local(addr);
> - fi->rdc.size = (index << PAGE_SHIFT) + offset + reclen;
> + fi->rdc.size = (index << READDIR_PAGES_SHIFT) + offset + reclen;
> fi->rdc.pos = dirent->off;
> unlock:
> spin_unlock(&fi->rdc.lock);
> @@ -104,7 +110,7 @@ static void fuse_readdir_cache_end(struct file *file, loff_t pos)
> }
>
> fi->rdc.cached = true;
> - end = ALIGN(fi->rdc.size, PAGE_SIZE);
> + end = ALIGN(fi->rdc.size, READDIR_PAGES_SIZE);
> spin_unlock(&fi->rdc.lock);
>
> /* truncate unused tail of cache */
> @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
> struct fuse_mount *fm = get_fuse_mount(inode);
> struct fuse_io_args ia = {};
> struct fuse_args_pages *ap = &ia.ap;
> - struct fuse_page_desc desc = { .length = PAGE_SIZE };
> + struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };
> u64 attr_version = 0;
> bool locked;
>
> - page = alloc_page(GFP_KERNEL);
> + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);

I guess that should become folio alloc(), one way or the other. Now I
think order 0 was chosen before to avoid risk of allocation failure. I
guess it might work to try a large size and to fall back to 0 when that
failed. Or fail back to the slower vmalloc.

> if (!page)
> return -ENOMEM;
>
> plus = fuse_use_readdirplus(inode, ctx);
> ap->args.out_pages = true;
> - ap->num_pages = 1;
> + ap->num_pages = READDIR_PAGES;
> ap->pages = &page;
> ap->descs = &desc;
> if (plus) {
> attr_version = fuse_get_attr_version(fm->fc);
> - fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
> + fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
> FUSE_READDIRPLUS);
> } else {
> - fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
> + fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
> FUSE_READDIR);
> }
> locked = fuse_lock_inode(inode);
> @@ -383,7 +389,7 @@ static enum fuse_parse_result fuse_parse_cache(struct fuse_file *ff,
> void *addr, unsigned int size,
> struct dir_context *ctx)
> {
> - unsigned int offset = ff->readdir.cache_off & ~PAGE_MASK;
> + unsigned int offset = ff->readdir.cache_off & ~READDIR_PAGES_MASK;
> enum fuse_parse_result res = FOUND_NONE;
>
> WARN_ON(offset >= size);
> @@ -504,16 +510,16 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
>
> WARN_ON(fi->rdc.size < ff->readdir.cache_off);
>
> - index = ff->readdir.cache_off >> PAGE_SHIFT;
> + index = ff->readdir.cache_off >> READDIR_PAGES_SHIFT;
>
> - if (index == (fi->rdc.size >> PAGE_SHIFT))
> - size = fi->rdc.size & ~PAGE_MASK;
> + if (index == (fi->rdc.size >> READDIR_PAGES_SHIFT))
> + size = fi->rdc.size & ~READDIR_PAGES_MASK;
> else
> - size = PAGE_SIZE;
> + size = READDIR_PAGES_SIZE;
> spin_unlock(&fi->rdc.lock);
>
> /* EOF? */
> - if ((ff->readdir.cache_off & ~PAGE_MASK) == size)
> + if ((ff->readdir.cache_off & ~READDIR_PAGES_MASK) == size)
> return 0;
>
> page = find_get_page_flags(file->f_mapping, index,
> @@ -559,9 +565,9 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
> if (res == FOUND_ALL)
> return 0;
>
> - if (size == PAGE_SIZE) {
> + if (size == READDIR_PAGES_SIZE) {
> /* We hit end of page: skip to next page. */
> - ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, PAGE_SIZE);
> + ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, READDIR_PAGES_SIZE);
> goto retry;
> }
>

Thanks,
Bernd

2023-07-26 16:30:49

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir.



On 7/26/23 03:59, Jaco Kroon wrote:
> +config FUSE_READDIR_ORDER
> + int
> + range 0 5
> + default 5
> + help
> + readdir performance varies greatly depending on the size of the read.
> + Larger buffers results in larger reads, thus fewer reads and higher
> + performance in return.
> +
> + You may want to reduce this value on seriously constrained memory
> + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
> +
> + This value reprents the order of the number of pages to allocate (ie,

represents (i.e.,

> + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32
> + pages (128KiB).

--
~Randy

2023-07-26 18:43:32

by Antonio SJ Musumeci

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir.

On 7/26/23 10:45, Bernd Schubert wrote:
>
> On 7/26/23 17:26, Jaco Kroon wrote:
>> Hi,
>>
>> On 2023/07/26 15:53, Bernd Schubert wrote:
>>>
>>> On 7/26/23 12:59, Jaco Kroon wrote:
>>>> Signed-off-by: Jaco Kroon <[email protected]>
>>>> ---
>>>>   fs/fuse/Kconfig   | 16 ++++++++++++++++
>>>>   fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
>>>>   2 files changed, 40 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>>>> index 038ed0b9aaa5..0783f9ee5cd3 100644
>>>> --- a/fs/fuse/Kconfig
>>>> +++ b/fs/fuse/Kconfig
>>>> @@ -18,6 +18,22 @@ config FUSE_FS
>>>>         If you want to develop a userspace FS, or if you want to use
>>>>         a filesystem based on FUSE, answer Y or M.
>>>>   +config FUSE_READDIR_ORDER
>>>> +    int
>>>> +    range 0 5
>>>> +    default 5
>>>> +    help
>>>> +        readdir performance varies greatly depending on the size of
>>>> the read.
>>>> +        Larger buffers results in larger reads, thus fewer reads and
>>>> higher
>>>> +        performance in return.
>>>> +
>>>> +        You may want to reduce this value on seriously constrained
>>>> memory
>>>> +        systems where 128KiB (assuming 4KiB pages) cache pages is
>>>> not ideal.
>>>> +
>>>> +        This value reprents the order of the number of pages to
>>>> allocate (ie,
>>>> +        the shift value).  A value of 0 is thus 1 page (4KiB) where
>>>> 5 is 32
>>>> +        pages (128KiB).
>>>> +
>>> I like the idea of a larger readdir size, but shouldn't that be a
>>> server/daemon/library decision which size to use, instead of kernel
>>> compile time? So should be part of FUSE_INIT negotiation?
>> Yes sure, but there still needs to be a default.  And one page at a time
>> doesn't cut it.
> With FUSE_INIT userspace would make that decision, based on what kernel
> fuse suggests? process_init_reply() already handles other limits - I
> don't see why readdir max has to be compile time option. Maybe a module
> option to set the limit?
>
> Thanks,
> Bernd

I had similar question / comment. This seems to me to be more
appropriately handed by the server via FUSE_INIT.

And wouldn't "max" more easily be FUSE_MAX_MAX_PAGES? Is there a reason
not to allow upwards of 256 pages sized readdir buffer?



2023-07-26 19:21:05

by Jaco Kroon

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir.

Hi,

On 2023/07/26 19:23, Antonio SJ Musumeci wrote:
> On 7/26/23 10:45, Bernd Schubert wrote:
>> On 7/26/23 17:26, Jaco Kroon wrote:
>>> Hi,
>>>
>>> On 2023/07/26 15:53, Bernd Schubert wrote:
>>>> On 7/26/23 12:59, Jaco Kroon wrote:
>>>>> Signed-off-by: Jaco Kroon <[email protected]>
>>>>> ---
>>>>>   fs/fuse/Kconfig   | 16 ++++++++++++++++
>>>>>   fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
>>>>>   2 files changed, 40 insertions(+), 18 deletions(-)
>>>>>
>>>>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>>>>> index 038ed0b9aaa5..0783f9ee5cd3 100644
>>>>> --- a/fs/fuse/Kconfig
>>>>> +++ b/fs/fuse/Kconfig
>>>>> @@ -18,6 +18,22 @@ config FUSE_FS
>>>>>         If you want to develop a userspace FS, or if you want to use
>>>>>         a filesystem based on FUSE, answer Y or M.
>>>>>   +config FUSE_READDIR_ORDER
>>>>> +    int
>>>>> +    range 0 5
>>>>> +    default 5
>>>>> +    help
>>>>> +        readdir performance varies greatly depending on the size of
>>>>> the read.
>>>>> +        Larger buffers results in larger reads, thus fewer reads and
>>>>> higher
>>>>> +        performance in return.
>>>>> +
>>>>> +        You may want to reduce this value on seriously constrained
>>>>> memory
>>>>> +        systems where 128KiB (assuming 4KiB pages) cache pages is
>>>>> not ideal.
>>>>> +
>>>>> +        This value reprents the order of the number of pages to
>>>>> allocate (ie,
>>>>> +        the shift value).  A value of 0 is thus 1 page (4KiB) where
>>>>> 5 is 32
>>>>> +        pages (128KiB).
>>>>> +
>>>> I like the idea of a larger readdir size, but shouldn't that be a
>>>> server/daemon/library decision which size to use, instead of kernel
>>>> compile time? So should be part of FUSE_INIT negotiation?
>>> Yes sure, but there still needs to be a default.  And one page at a time
>>> doesn't cut it.
>> With FUSE_INIT userspace would make that decision, based on what kernel
>> fuse suggests? process_init_reply() already handles other limits - I
>> don't see why readdir max has to be compile time option. Maybe a module
>> option to set the limit?
>>
>> Thanks,
>> Bernd
> I had similar question / comment. This seems to me to be more
> appropriately handed by the server via FUSE_INIT.
>
> And wouldn't "max" more easily be FUSE_MAX_MAX_PAGES? Is there a reason
> not to allow upwards of 256 pages sized readdir buffer?

Will look into FUSE_INIT.  The FUSE_INIT as I understand from what I've
read has some expansion constraints or the structure is somehow
negotiated.  Older clients in other words that's not aware of the option
will follow some default.  For backwards compatibility that default
should probably be 1 page.  For performance reasons it makes sense that
this limit be larger.

glibc uses a 128KiB buffer for getdents64, so I'm not sure >128KiB here
makes sense.  Or if these two buffers are even directly related.

Default to fc->max_pages (which defaults to 32 or 128KiB) if the
user-space side doesn't understand the max_readdir_pages limit aspect of
things?  Assuming these limits should be set separately.  I'm thinking
piggy backing on fc->max_pages is just fine to be honest.

For the sake of avoiding division and modulo operations in the cache,
I'm thinking round-down max_pages to the closest power of two for the
sake of sticking to binary operators rather than divisions and mods?

Current patch introduces a definite memory leak either way.  Tore
through about 12GB of RAM in a matter of 20 minutes or so.  Just going
to patch it that way first, and then based on responses above will look
into an alternative patch that does not depend on a compile-time
option.  Guessing __free_page should be a multi-page variant now.

Thanks for all the feedback so far.

Kind regards,
Jaco


2023-07-27 09:12:09

by Jaco Kroon

[permalink] [raw]
Subject: [PATCH] fuse: enable larger read buffers for readdir [v2].

This patch does not mess with the caching infrastructure like the
previous one, which we believe caused excessive CPU and broke directory
listings in some cases.

This version only affects the uncached read, which then during parse adds an
entry at a time to the cached structures by way of copying, and as such,
we believe this should be sufficient.

We're still seeing cases where getdents64 takes ~10s (this was the case
in any case without this patch, the difference now that we get ~500
entries for that time rather than the 14-18 previously). We believe
that that latency is introduced on glusterfs side and is under separate
discussion with the glusterfs developers.

This is still a compile-time option, but a working one compared to
previous patch. For now this works, but it's not recommended for merge
(as per email discussion).

This still uses alloc_pages rather than kvmalloc/kvfree.

Signed-off-by: Jaco Kroon <[email protected]>
---
fs/fuse/Kconfig | 16 ++++++++++++++++
fs/fuse/readdir.c | 18 ++++++++++++------
2 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 038ed0b9aaa5..0783f9ee5cd3 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -18,6 +18,22 @@ config FUSE_FS
If you want to develop a userspace FS, or if you want to use
a filesystem based on FUSE, answer Y or M.

+config FUSE_READDIR_ORDER
+ int
+ range 0 5
+ default 5
+ help
+ readdir performance varies greatly depending on the size of the read.
+ Larger buffers results in larger reads, thus fewer reads and higher
+ performance in return.
+
+ You may want to reduce this value on seriously constrained memory
+ systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
+
+ This value reprents the order of the number of pages to allocate (ie,
+ the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32
+ pages (128KiB).
+
config CUSE
tristate "Character device in Userspace support"
depends on FUSE_FS
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index dc603479b30e..47cea4d91228 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -13,6 +13,12 @@
#include <linux/pagemap.h>
#include <linux/highmem.h>

+#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER
+#define READDIR_PAGES (1 << READDIR_PAGES_ORDER)
+#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER)
+#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1)
+#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER)
+
static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
{
struct fuse_conn *fc = get_fuse_conn(dir);
@@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
struct fuse_mount *fm = get_fuse_mount(inode);
struct fuse_io_args ia = {};
struct fuse_args_pages *ap = &ia.ap;
- struct fuse_page_desc desc = { .length = PAGE_SIZE };
+ struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };
u64 attr_version = 0;
bool locked;

- page = alloc_page(GFP_KERNEL);
+ page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
if (!page)
return -ENOMEM;

plus = fuse_use_readdirplus(inode, ctx);
ap->args.out_pages = true;
- ap->num_pages = 1;
+ ap->num_pages = READDIR_PAGES;
ap->pages = &page;
ap->descs = &desc;
if (plus) {
attr_version = fuse_get_attr_version(fm->fc);
- fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
+ fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
FUSE_READDIRPLUS);
} else {
- fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
+ fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
FUSE_READDIR);
}
locked = fuse_lock_inode(inode);
@@ -367,7 +373,7 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
}
}

- __free_page(page);
+ __free_pages(page, READDIR_PAGES_ORDER);
fuse_invalidate_atime(inode);
return res;
}
--
2.41.0


2023-07-27 16:10:02

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir [v2].

On Thu, 27 Jul 2023 at 10:13, Jaco Kroon <[email protected]> wrote:
>
> This patch does not mess with the caching infrastructure like the
> previous one, which we believe caused excessive CPU and broke directory
> listings in some cases.
>
> This version only affects the uncached read, which then during parse adds an
> entry at a time to the cached structures by way of copying, and as such,
> we believe this should be sufficient.
>
> We're still seeing cases where getdents64 takes ~10s (this was the case
> in any case without this patch, the difference now that we get ~500
> entries for that time rather than the 14-18 previously). We believe
> that that latency is introduced on glusterfs side and is under separate
> discussion with the glusterfs developers.
>
> This is still a compile-time option, but a working one compared to
> previous patch. For now this works, but it's not recommended for merge
> (as per email discussion).
>
> This still uses alloc_pages rather than kvmalloc/kvfree.
>
> Signed-off-by: Jaco Kroon <[email protected]>
> ---
> fs/fuse/Kconfig | 16 ++++++++++++++++
> fs/fuse/readdir.c | 18 ++++++++++++------
> 2 files changed, 28 insertions(+), 6 deletions(-)
>
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index 038ed0b9aaa5..0783f9ee5cd3 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -18,6 +18,22 @@ config FUSE_FS
> If you want to develop a userspace FS, or if you want to use
> a filesystem based on FUSE, answer Y or M.
>
> +config FUSE_READDIR_ORDER
> + int
> + range 0 5
> + default 5
> + help
> + readdir performance varies greatly depending on the size of the read.
> + Larger buffers results in larger reads, thus fewer reads and higher
> + performance in return.
> +
> + You may want to reduce this value on seriously constrained memory
> + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
> +
> + This value reprents the order of the number of pages to allocate (ie,
> + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32
> + pages (128KiB).
> +
> config CUSE
> tristate "Character device in Userspace support"
> depends on FUSE_FS
> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
> index dc603479b30e..47cea4d91228 100644
> --- a/fs/fuse/readdir.c
> +++ b/fs/fuse/readdir.c
> @@ -13,6 +13,12 @@
> #include <linux/pagemap.h>
> #include <linux/highmem.h>
>
> +#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER
> +#define READDIR_PAGES (1 << READDIR_PAGES_ORDER)
> +#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER)
> +#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1)
> +#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER)
> +
> static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
> {
> struct fuse_conn *fc = get_fuse_conn(dir);
> @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
> struct fuse_mount *fm = get_fuse_mount(inode);
> struct fuse_io_args ia = {};
> struct fuse_args_pages *ap = &ia.ap;
> - struct fuse_page_desc desc = { .length = PAGE_SIZE };
> + struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };

Does this really work? I would've thought we are relying on single
page lengths somewhere.

> u64 attr_version = 0;
> bool locked;
>
> - page = alloc_page(GFP_KERNEL);
> + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
> if (!page)
> return -ENOMEM;
>
> plus = fuse_use_readdirplus(inode, ctx);
> ap->args.out_pages = true;
> - ap->num_pages = 1;
> + ap->num_pages = READDIR_PAGES;

No. This is the array lenght, which is 1. This is the hack I guess,
which makes the above trick work.

Better use kvmalloc, which might have a slightly worse performance
than a large page, but definitely not worse than the current single
page.

If we want to optimize the overhead of kvmalloc (and it's a big if)
then the parse_dir*file() functions would need to be converted to
using a page array instead of a plain kernel pointer, which would add
some complexity for sure.

Thanks,
Miklos

2023-07-27 16:12:50

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir.

On Wed, Jul 26, 2023 at 08:25:48PM +0200, Jaco Kroon wrote:

> glibc uses a 128KiB buffer for getdents64, so I'm not sure >128KiB here
> makes sense. Or if these two buffers are even directly related.

Definitely related. See how the dir_emit() works: if the entry doesn't fit in
the userspace buffer, then ->readdir() returns. If entries remain in the fuse
buffer at that time, they are simply discarded. Simply put, making the
buffer too large is going to be a waste of resources and at some point will be
slower than the single page buffer.

Note: discarding received entries is the only possible action if caching is
disabled. With caching enabled it would make sense to pre-fill the cache with
the overflowed entries and use them in the next iteration. However the caching
code is not prepared for using partial caches at the moment.

The best strategy would be to find the optimal buffer size based on the size of
the userspace buffer. Putting that info into struct dir_context doesn't sound
too complicated...

Here's a patch. It doesn't touch readdir. Simply setting the fuse buffer size
to the userspace buffer size should work, the record sizes are similar (fuse's
is slightly larger than libc's, so no overflow should ever happen).

Thanks,
Miklos


---
fs/exportfs/expfs.c | 1 +
fs/overlayfs/readdir.c | 12 +++++++++++-
fs/readdir.c | 29 ++++++++++++++---------------
include/linux/fs.h | 7 +++++++
4 files changed, 33 insertions(+), 16 deletions(-)

--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -286,6 +286,7 @@ static int get_name(const struct path *p
};
struct getdents_callback buffer = {
.ctx.actor = filldir_one,
+ .ctx.count = INT_MAX,
.name = name,
};

--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -347,6 +347,7 @@ static int ovl_dir_read_merged(struct de
struct path realpath;
struct ovl_readdir_data rdd = {
.ctx.actor = ovl_fill_merge,
+ .ctx.count = INT_MAX,
.dentry = dentry,
.list = list,
.root = root,
@@ -557,6 +558,7 @@ static int ovl_dir_read_impure(const str
struct ovl_cache_entry *p, *n;
struct ovl_readdir_data rdd = {
.ctx.actor = ovl_fill_plain,
+ .ctx.count = INT_MAX,
.list = list,
.root = root,
};
@@ -658,6 +660,7 @@ static bool ovl_fill_real(struct dir_con
struct ovl_readdir_translate *rdt =
container_of(ctx, struct ovl_readdir_translate, ctx);
struct dir_context *orig_ctx = rdt->orig_ctx;
+ bool res;

if (rdt->parent_ino && strcmp(name, "..") == 0) {
ino = rdt->parent_ino;
@@ -672,7 +675,10 @@ static bool ovl_fill_real(struct dir_con
name, namelen, rdt->xinowarn);
}

- return orig_ctx->actor(orig_ctx, name, namelen, offset, ino, d_type);
+ res = orig_ctx->actor(orig_ctx, name, namelen, offset, ino, d_type);
+ ctx->count = orig_ctx->count;
+
+ return res;
}

static bool ovl_is_impure_dir(struct file *file)
@@ -699,6 +705,7 @@ static int ovl_iterate_real(struct file
const struct ovl_layer *lower_layer = ovl_layer_lower(dir);
struct ovl_readdir_translate rdt = {
.ctx.actor = ovl_fill_real,
+ .ctx.count = ctx->count,
.orig_ctx = ctx,
.xinobits = ovl_xino_bits(ofs),
.xinowarn = ovl_xino_warn(ofs),
@@ -1058,6 +1065,7 @@ int ovl_check_d_type_supported(const str
int err;
struct ovl_readdir_data rdd = {
.ctx.actor = ovl_check_d_type,
+ .ctx.count = INT_MAX,
.d_type_supported = false,
};

@@ -1079,6 +1087,7 @@ static int ovl_workdir_cleanup_recurse(s
struct ovl_cache_entry *p;
struct ovl_readdir_data rdd = {
.ctx.actor = ovl_fill_plain,
+ .ctx.count = INT_MAX,
.list = &list,
};
bool incompat = false;
@@ -1163,6 +1172,7 @@ int ovl_indexdir_cleanup(struct ovl_fs *
struct ovl_cache_entry *p;
struct ovl_readdir_data rdd = {
.ctx.actor = ovl_fill_plain,
+ .ctx.count = INT_MAX,
.list = &list,
};

--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -184,6 +184,7 @@ SYSCALL_DEFINE3(old_readdir, unsigned in
struct fd f = fdget_pos(fd);
struct readdir_callback buf = {
.ctx.actor = fillonedir,
+ .ctx.count = 1, /* Hint to fs: just one entry. */
.dirent = dirent
};

@@ -215,7 +216,6 @@ struct getdents_callback {
struct dir_context ctx;
struct linux_dirent __user * current_dir;
int prev_reclen;
- int count;
int error;
};

@@ -234,7 +234,7 @@ static bool filldir(struct dir_context *
if (unlikely(buf->error))
return false;
buf->error = -EINVAL; /* only used if we fail.. */
- if (reclen > buf->count)
+ if (reclen > ctx->count)
return false;
d_ino = ino;
if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
@@ -259,7 +259,7 @@ static bool filldir(struct dir_context *

buf->current_dir = (void __user *)dirent + reclen;
buf->prev_reclen = reclen;
- buf->count -= reclen;
+ ctx->count -= reclen;
return true;
efault_end:
user_write_access_end();
@@ -274,7 +274,7 @@ SYSCALL_DEFINE3(getdents, unsigned int,
struct fd f;
struct getdents_callback buf = {
.ctx.actor = filldir,
- .count = count,
+ .ctx.count = count,
.current_dir = dirent
};
int error;
@@ -293,7 +293,7 @@ SYSCALL_DEFINE3(getdents, unsigned int,
if (put_user(buf.ctx.pos, &lastdirent->d_off))
error = -EFAULT;
else
- error = count - buf.count;
+ error = count - buf.ctx.count;
}
fdput_pos(f);
return error;
@@ -303,7 +303,6 @@ struct getdents_callback64 {
struct dir_context ctx;
struct linux_dirent64 __user * current_dir;
int prev_reclen;
- int count;
int error;
};

@@ -321,7 +320,7 @@ static bool filldir64(struct dir_context
if (unlikely(buf->error))
return false;
buf->error = -EINVAL; /* only used if we fail.. */
- if (reclen > buf->count)
+ if (reclen > ctx->count)
return false;
prev_reclen = buf->prev_reclen;
if (prev_reclen && signal_pending(current))
@@ -341,7 +340,7 @@ static bool filldir64(struct dir_context

buf->prev_reclen = reclen;
buf->current_dir = (void __user *)dirent + reclen;
- buf->count -= reclen;
+ ctx->count -= reclen;
return true;

efault_end:
@@ -357,7 +356,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int
struct fd f;
struct getdents_callback64 buf = {
.ctx.actor = filldir64,
- .count = count,
+ .ctx.count = count,
.current_dir = dirent
};
int error;
@@ -377,7 +376,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int
if (put_user(d_off, &lastdirent->d_off))
error = -EFAULT;
else
- error = count - buf.count;
+ error = count - buf.ctx.count;
}
fdput_pos(f);
return error;
@@ -442,6 +441,7 @@ COMPAT_SYSCALL_DEFINE3(old_readdir, unsi
struct fd f = fdget_pos(fd);
struct compat_readdir_callback buf = {
.ctx.actor = compat_fillonedir,
+ .ctx.count = 1, /* Hint to fs: just one entry. */
.dirent = dirent
};

@@ -467,7 +467,6 @@ struct compat_getdents_callback {
struct dir_context ctx;
struct compat_linux_dirent __user *current_dir;
int prev_reclen;
- int count;
int error;
};

@@ -486,7 +485,7 @@ static bool compat_filldir(struct dir_co
if (unlikely(buf->error))
return false;
buf->error = -EINVAL; /* only used if we fail.. */
- if (reclen > buf->count)
+ if (reclen > ctx->count)
return false;
d_ino = ino;
if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
@@ -510,7 +509,7 @@ static bool compat_filldir(struct dir_co

buf->prev_reclen = reclen;
buf->current_dir = (void __user *)dirent + reclen;
- buf->count -= reclen;
+ ctx->count -= reclen;
return true;
efault_end:
user_write_access_end();
@@ -525,8 +524,8 @@ COMPAT_SYSCALL_DEFINE3(getdents, unsigne
struct fd f;
struct compat_getdents_callback buf = {
.ctx.actor = compat_filldir,
+ .ctx.count = count,
.current_dir = dirent,
- .count = count
};
int error;

@@ -544,7 +543,7 @@ COMPAT_SYSCALL_DEFINE3(getdents, unsigne
if (put_user(buf.ctx.pos, &lastdirent->d_off))
error = -EFAULT;
else
- error = count - buf.count;
+ error = count - buf.ctx.count;
}
fdput_pos(f);
return error;
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1719,6 +1719,13 @@ typedef bool (*filldir_t)(struct dir_con
struct dir_context {
filldir_t actor;
loff_t pos;
+ /*
+ * Filesystems MUST NOT MODIFY count, but may use as a hint:
+ * 0 unknown
+ * > 0 space in buffer (assume at least one entry)
+ * INT_MAX unlimited
+ */
+ int count;
};

/*

2023-07-27 17:07:32

by Jaco Kroon

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir [v2].

Hi,

On 2023/07/27 17:35, Miklos Szeredi wrote:
> On Thu, 27 Jul 2023 at 10:13, Jaco Kroon <[email protected]> wrote:
>> This patch does not mess with the caching infrastructure like the
>> previous one, which we believe caused excessive CPU and broke directory
>> listings in some cases.
>>
>> This version only affects the uncached read, which then during parse adds an
>> entry at a time to the cached structures by way of copying, and as such,
>> we believe this should be sufficient.
>>
>> We're still seeing cases where getdents64 takes ~10s (this was the case
>> in any case without this patch, the difference now that we get ~500
>> entries for that time rather than the 14-18 previously). We believe
>> that that latency is introduced on glusterfs side and is under separate
>> discussion with the glusterfs developers.
>>
>> This is still a compile-time option, but a working one compared to
>> previous patch. For now this works, but it's not recommended for merge
>> (as per email discussion).
>>
>> This still uses alloc_pages rather than kvmalloc/kvfree.
>>
>> Signed-off-by: Jaco Kroon <[email protected]>
>> ---
>> fs/fuse/Kconfig | 16 ++++++++++++++++
>> fs/fuse/readdir.c | 18 ++++++++++++------
>> 2 files changed, 28 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>> index 038ed0b9aaa5..0783f9ee5cd3 100644
>> --- a/fs/fuse/Kconfig
>> +++ b/fs/fuse/Kconfig
>> @@ -18,6 +18,22 @@ config FUSE_FS
>> If you want to develop a userspace FS, or if you want to use
>> a filesystem based on FUSE, answer Y or M.
>>
>> +config FUSE_READDIR_ORDER
>> + int
>> + range 0 5
>> + default 5
>> + help
>> + readdir performance varies greatly depending on the size of the read.
>> + Larger buffers results in larger reads, thus fewer reads and higher
>> + performance in return.
>> +
>> + You may want to reduce this value on seriously constrained memory
>> + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
>> +
>> + This value reprents the order of the number of pages to allocate (ie,
>> + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32
>> + pages (128KiB).
>> +
>> config CUSE
>> tristate "Character device in Userspace support"
>> depends on FUSE_FS
>> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
>> index dc603479b30e..47cea4d91228 100644
>> --- a/fs/fuse/readdir.c
>> +++ b/fs/fuse/readdir.c
>> @@ -13,6 +13,12 @@
>> #include <linux/pagemap.h>
>> #include <linux/highmem.h>
>>
>> +#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER
>> +#define READDIR_PAGES (1 << READDIR_PAGES_ORDER)
>> +#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER)
>> +#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1)
>> +#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER)
>> +
>> static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
>> {
>> struct fuse_conn *fc = get_fuse_conn(dir);
>> @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
>> struct fuse_mount *fm = get_fuse_mount(inode);
>> struct fuse_io_args ia = {};
>> struct fuse_args_pages *ap = &ia.ap;
>> - struct fuse_page_desc desc = { .length = PAGE_SIZE };
>> + struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };
> Does this really work? I would've thought we are relying on single
> page lengths somewhere.
Based on my testing yes.  Getting just under 500 entries per
getdents64() call from userspace vs 14-18 before I guess the answer is yes.
>
>> u64 attr_version = 0;
>> bool locked;
>>
>> - page = alloc_page(GFP_KERNEL);
>> + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
>> if (!page)
>> return -ENOMEM;
>>
>> plus = fuse_use_readdirplus(inode, ctx);
>> ap->args.out_pages = true;
>> - ap->num_pages = 1;
>> + ap->num_pages = READDIR_PAGES;
> No. This is the array lenght, which is 1. This is the hack I guess,
> which makes the above trick work.

Oh?  So the page referenced above isn't an array of pages?  It's
actually a single piece of contiguous memory?

> Better use kvmalloc, which might have a slightly worse performance
> than a large page, but definitely not worse than the current single
> page.

Which returns a void*, not struct page* - guess this can be converted
using __virt_to_page (iirc)?

> If we want to optimize the overhead of kvmalloc (and it's a big if)
> then the parse_dir*file() functions would need to be converted to
> using a page array instead of a plain kernel pointer, which would add
> some complexity for sure.

Sorry, I read the above as "I'm surprised this works at all and you're
not kernel panicking all over the show", this is probably the most
ambitious kernel patch I've attempted to date.

Kind regards,
Jaco


2023-07-27 19:50:57

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir.



On 7/27/23 21:21, Miklos Szeredi wrote:
> On Wed, 26 Jul 2023 at 20:40, Jaco Kroon <[email protected]> wrote:
>
>> Will look into FUSE_INIT. The FUSE_INIT as I understand from what I've
>> read has some expansion constraints or the structure is somehow
>> negotiated. Older clients in other words that's not aware of the option
>> will follow some default. For backwards compatibility that default
>> should probably be 1 page. For performance reasons it makes sense that
>> this limit be larger.
>
> Yes, might need this for backward compatibility. But perhaps a
> feature flag is enough and the readdir buf can be limited to
> fc->max_read.

fc->max_read is set by default to ~0 and only set to something else when
the max_read mount option is given? So typically that is a large value
(UINT_MAX)?


Thanks,
Bernd

2023-07-27 20:23:31

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir.

On Wed, 26 Jul 2023 at 20:40, Jaco Kroon <[email protected]> wrote:

> Will look into FUSE_INIT. The FUSE_INIT as I understand from what I've
> read has some expansion constraints or the structure is somehow
> negotiated. Older clients in other words that's not aware of the option
> will follow some default. For backwards compatibility that default
> should probably be 1 page. For performance reasons it makes sense that
> this limit be larger.

Yes, might need this for backward compatibility. But perhaps a
feature flag is enough and the readdir buf can be limited to
fc->max_read.

Thanks,
Miklos

2023-07-27 20:29:30

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir [v2].



On 7/27/23 17:35, Miklos Szeredi wrote:
> On Thu, 27 Jul 2023 at 10:13, Jaco Kroon <[email protected]> wrote:
>>
>> This patch does not mess with the caching infrastructure like the
>> previous one, which we believe caused excessive CPU and broke directory
>> listings in some cases.
>>
>> This version only affects the uncached read, which then during parse adds an
>> entry at a time to the cached structures by way of copying, and as such,
>> we believe this should be sufficient.
>>
>> We're still seeing cases where getdents64 takes ~10s (this was the case
>> in any case without this patch, the difference now that we get ~500
>> entries for that time rather than the 14-18 previously). We believe
>> that that latency is introduced on glusterfs side and is under separate
>> discussion with the glusterfs developers.
>>
>> This is still a compile-time option, but a working one compared to
>> previous patch. For now this works, but it's not recommended for merge
>> (as per email discussion).
>>
>> This still uses alloc_pages rather than kvmalloc/kvfree.
>>
>> Signed-off-by: Jaco Kroon <[email protected]>
>> ---
>> fs/fuse/Kconfig | 16 ++++++++++++++++
>> fs/fuse/readdir.c | 18 ++++++++++++------
>> 2 files changed, 28 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>> index 038ed0b9aaa5..0783f9ee5cd3 100644
>> --- a/fs/fuse/Kconfig
>> +++ b/fs/fuse/Kconfig
>> @@ -18,6 +18,22 @@ config FUSE_FS
>> If you want to develop a userspace FS, or if you want to use
>> a filesystem based on FUSE, answer Y or M.
>>
>> +config FUSE_READDIR_ORDER
>> + int
>> + range 0 5
>> + default 5
>> + help
>> + readdir performance varies greatly depending on the size of the read.
>> + Larger buffers results in larger reads, thus fewer reads and higher
>> + performance in return.
>> +
>> + You may want to reduce this value on seriously constrained memory
>> + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
>> +
>> + This value reprents the order of the number of pages to allocate (ie,
>> + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32
>> + pages (128KiB).
>> +
>> config CUSE
>> tristate "Character device in Userspace support"
>> depends on FUSE_FS
>> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
>> index dc603479b30e..47cea4d91228 100644
>> --- a/fs/fuse/readdir.c
>> +++ b/fs/fuse/readdir.c
>> @@ -13,6 +13,12 @@
>> #include <linux/pagemap.h>
>> #include <linux/highmem.h>
>>
>> +#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER
>> +#define READDIR_PAGES (1 << READDIR_PAGES_ORDER)
>> +#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER)
>> +#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1)
>> +#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER)
>> +
>> static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
>> {
>> struct fuse_conn *fc = get_fuse_conn(dir);
>> @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
>> struct fuse_mount *fm = get_fuse_mount(inode);
>> struct fuse_io_args ia = {};
>> struct fuse_args_pages *ap = &ia.ap;
>> - struct fuse_page_desc desc = { .length = PAGE_SIZE };
>> + struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };
>
> Does this really work? I would've thought we are relying on single
> page lengths somewhere.
>
>> u64 attr_version = 0;
>> bool locked;
>>
>> - page = alloc_page(GFP_KERNEL);
>> + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
>> if (!page)
>> return -ENOMEM;
>>
>> plus = fuse_use_readdirplus(inode, ctx);
>> ap->args.out_pages = true;
>> - ap->num_pages = 1;
>> + ap->num_pages = READDIR_PAGES;
>
> No. This is the array lenght, which is 1. This is the hack I guess,
> which makes the above trick work.
>
> Better use kvmalloc, which might have a slightly worse performance
> than a large page, but definitely not worse than the current single
> page.
>
> If we want to optimize the overhead of kvmalloc (and it's a big if)
> then the parse_dir*file() functions would need to be converted to
> using a page array instead of a plain kernel pointer, which would add
> some complexity for sure.

One simple possibility might be to do pos=0 with a small buffer size
single page and only if pos is set we switch to a larger buffer - that
way small directories don't get the overhead of the large allocation.
Although following your idea to to the getdents buffer size - this is
something libc could already start with.


Cheers,
Bernd

2023-07-27 20:58:43

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir [v2].

On Thu, 27 Jul 2023 at 18:58, Jaco Kroon <[email protected]> wrote:
>
> Hi,
>
> On 2023/07/27 17:35, Miklos Szeredi wrote:
> > On Thu, 27 Jul 2023 at 10:13, Jaco Kroon <[email protected]> wrote:
> >> This patch does not mess with the caching infrastructure like the
> >> previous one, which we believe caused excessive CPU and broke directory
> >> listings in some cases.
> >>
> >> This version only affects the uncached read, which then during parse adds an
> >> entry at a time to the cached structures by way of copying, and as such,
> >> we believe this should be sufficient.
> >>
> >> We're still seeing cases where getdents64 takes ~10s (this was the case
> >> in any case without this patch, the difference now that we get ~500
> >> entries for that time rather than the 14-18 previously). We believe
> >> that that latency is introduced on glusterfs side and is under separate
> >> discussion with the glusterfs developers.
> >>
> >> This is still a compile-time option, but a working one compared to
> >> previous patch. For now this works, but it's not recommended for merge
> >> (as per email discussion).
> >>
> >> This still uses alloc_pages rather than kvmalloc/kvfree.
> >>
> >> Signed-off-by: Jaco Kroon <[email protected]>
> >> ---
> >> fs/fuse/Kconfig | 16 ++++++++++++++++
> >> fs/fuse/readdir.c | 18 ++++++++++++------
> >> 2 files changed, 28 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> >> index 038ed0b9aaa5..0783f9ee5cd3 100644
> >> --- a/fs/fuse/Kconfig
> >> +++ b/fs/fuse/Kconfig
> >> @@ -18,6 +18,22 @@ config FUSE_FS
> >> If you want to develop a userspace FS, or if you want to use
> >> a filesystem based on FUSE, answer Y or M.
> >>
> >> +config FUSE_READDIR_ORDER
> >> + int
> >> + range 0 5
> >> + default 5
> >> + help
> >> + readdir performance varies greatly depending on the size of the read.
> >> + Larger buffers results in larger reads, thus fewer reads and higher
> >> + performance in return.
> >> +
> >> + You may want to reduce this value on seriously constrained memory
> >> + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
> >> +
> >> + This value reprents the order of the number of pages to allocate (ie,
> >> + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32
> >> + pages (128KiB).
> >> +
> >> config CUSE
> >> tristate "Character device in Userspace support"
> >> depends on FUSE_FS
> >> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
> >> index dc603479b30e..47cea4d91228 100644
> >> --- a/fs/fuse/readdir.c
> >> +++ b/fs/fuse/readdir.c
> >> @@ -13,6 +13,12 @@
> >> #include <linux/pagemap.h>
> >> #include <linux/highmem.h>
> >>
> >> +#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER
> >> +#define READDIR_PAGES (1 << READDIR_PAGES_ORDER)
> >> +#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER)
> >> +#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1)
> >> +#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER)
> >> +
> >> static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
> >> {
> >> struct fuse_conn *fc = get_fuse_conn(dir);
> >> @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
> >> struct fuse_mount *fm = get_fuse_mount(inode);
> >> struct fuse_io_args ia = {};
> >> struct fuse_args_pages *ap = &ia.ap;
> >> - struct fuse_page_desc desc = { .length = PAGE_SIZE };
> >> + struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };
> > Does this really work? I would've thought we are relying on single
> > page lengths somewhere.
> Based on my testing yes. Getting just under 500 entries per
> getdents64() call from userspace vs 14-18 before I guess the answer is yes.
> >
> >> u64 attr_version = 0;
> >> bool locked;
> >>
> >> - page = alloc_page(GFP_KERNEL);
> >> + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
> >> if (!page)
> >> return -ENOMEM;
> >>
> >> plus = fuse_use_readdirplus(inode, ctx);
> >> ap->args.out_pages = true;
> >> - ap->num_pages = 1;
> >> + ap->num_pages = READDIR_PAGES;
> > No. This is the array lenght, which is 1. This is the hack I guess,
> > which makes the above trick work.
>
> Oh? So the page referenced above isn't an array of pages? It's
> actually a single piece of contiguous memory?

Yes.

>
> > Better use kvmalloc, which might have a slightly worse performance
> > than a large page, but definitely not worse than the current single
> > page.
>
> Which returns a void*, not struct page* - guess this can be converted
> using __virt_to_page (iirc)?

No, it cannot be converted to a page or a page array, use it as just a
piece of kernel memory.

Which means:

- don't set ->args.out_pages and ->num_pages
- instead set ->args.out_args[0].value to the allocated pointer

and that should be it (fingers crossed).

>
> > If we want to optimize the overhead of kvmalloc (and it's a big if)
> > then the parse_dir*file() functions would need to be converted to
> > using a page array instead of a plain kernel pointer, which would add
> > some complexity for sure.
>
> Sorry, I read the above as "I'm surprised this works at all and you're
> not kernel panicking all over the show", this is probably the most
> ambitious kernel patch I've attempted to date.

Good start, you'll learn.

Thanks,
Miklos

2023-07-27 21:15:20

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir [v2].



On 7/27/23 17:35, Miklos Szeredi wrote:
> On Thu, 27 Jul 2023 at 10:13, Jaco Kroon <[email protected]> wrote:
>>
>> This patch does not mess with the caching infrastructure like the
>> previous one, which we believe caused excessive CPU and broke directory
>> listings in some cases.
>>
>> This version only affects the uncached read, which then during parse adds an
>> entry at a time to the cached structures by way of copying, and as such,
>> we believe this should be sufficient.
>>
>> We're still seeing cases where getdents64 takes ~10s (this was the case
>> in any case without this patch, the difference now that we get ~500
>> entries for that time rather than the 14-18 previously). We believe
>> that that latency is introduced on glusterfs side and is under separate
>> discussion with the glusterfs developers.
>>
>> This is still a compile-time option, but a working one compared to
>> previous patch. For now this works, but it's not recommended for merge
>> (as per email discussion).
>>
>> This still uses alloc_pages rather than kvmalloc/kvfree.
>>
>> Signed-off-by: Jaco Kroon <[email protected]>
>> ---
>> fs/fuse/Kconfig | 16 ++++++++++++++++
>> fs/fuse/readdir.c | 18 ++++++++++++------
>> 2 files changed, 28 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>> index 038ed0b9aaa5..0783f9ee5cd3 100644
>> --- a/fs/fuse/Kconfig
>> +++ b/fs/fuse/Kconfig
>> @@ -18,6 +18,22 @@ config FUSE_FS
>> If you want to develop a userspace FS, or if you want to use
>> a filesystem based on FUSE, answer Y or M.
>>
>> +config FUSE_READDIR_ORDER
>> + int
>> + range 0 5
>> + default 5
>> + help
>> + readdir performance varies greatly depending on the size of the read.
>> + Larger buffers results in larger reads, thus fewer reads and higher
>> + performance in return.
>> +
>> + You may want to reduce this value on seriously constrained memory
>> + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
>> +
>> + This value reprents the order of the number of pages to allocate (ie,
>> + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32
>> + pages (128KiB).
>> +
>> config CUSE
>> tristate "Character device in Userspace support"
>> depends on FUSE_FS
>> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
>> index dc603479b30e..47cea4d91228 100644
>> --- a/fs/fuse/readdir.c
>> +++ b/fs/fuse/readdir.c
>> @@ -13,6 +13,12 @@
>> #include <linux/pagemap.h>
>> #include <linux/highmem.h>
>>
>> +#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER
>> +#define READDIR_PAGES (1 << READDIR_PAGES_ORDER)
>> +#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER)
>> +#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1)
>> +#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER)
>> +
>> static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
>> {
>> struct fuse_conn *fc = get_fuse_conn(dir);
>> @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
>> struct fuse_mount *fm = get_fuse_mount(inode);
>> struct fuse_io_args ia = {};
>> struct fuse_args_pages *ap = &ia.ap;
>> - struct fuse_page_desc desc = { .length = PAGE_SIZE };
>> + struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };
>
> Does this really work? I would've thought we are relying on single
> page lengths somewhere.
>
>> u64 attr_version = 0;
>> bool locked;
>>
>> - page = alloc_page(GFP_KERNEL);
>> + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
>> if (!page)
>> return -ENOMEM;
>>
>> plus = fuse_use_readdirplus(inode, ctx);
>> ap->args.out_pages = true;
>> - ap->num_pages = 1;
>> + ap->num_pages = READDIR_PAGES;
>
> No. This is the array lenght, which is 1. This is the hack I guess,
> which makes the above trick work.

Hmm, ap->num_pages / ap->pages[] is used in fuse_copy_pages, but so is
ap->descs[] - shouldn't the patch caused an out-of-bound access?
Out of interest, would you mind to explain how the hack worked?


Thanks,
Bernd

2023-07-28 05:27:27

by Jaco Kroon

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir [v2].

Hi,
>>>          plus = fuse_use_readdirplus(inode, ctx);
>>>          ap->args.out_pages = true;
>>> -       ap->num_pages = 1;
>>> +       ap->num_pages = READDIR_PAGES;
>>
>> No.  This is the array lenght, which is 1.  This is the hack I guess,
>> which makes the above trick work.
>
> Hmm, ap->num_pages / ap->pages[] is used in fuse_copy_pages, but so is
> ap->descs[] - shouldn't the patch caused an out-of-bound access?
> Out of interest, would you mind to explain how the hack worked?

Apparently it shouldn't ... my understanding of how pages* worked was
all wrong.

I'm guessing since all the data fits in the first page (ap->pages[0] in
other words, of length/size desc.length) that the other pages are never
accessed.  Looking at fuse_copy_pages this does indeed seem to be the
case.  So I ended up just being really, really lucky here.

Kind regards,
Jaco


2023-07-28 10:19:23

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] fuse: enable larger read buffers for readdir.

On Thu, 27 Jul 2023 at 21:43, Bernd Schubert <[email protected]> wrote:
>
>
>
> On 7/27/23 21:21, Miklos Szeredi wrote:
> > On Wed, 26 Jul 2023 at 20:40, Jaco Kroon <[email protected]> wrote:
> >
> >> Will look into FUSE_INIT. The FUSE_INIT as I understand from what I've
> >> read has some expansion constraints or the structure is somehow
> >> negotiated. Older clients in other words that's not aware of the option
> >> will follow some default. For backwards compatibility that default
> >> should probably be 1 page. For performance reasons it makes sense that
> >> this limit be larger.
> >
> > Yes, might need this for backward compatibility. But perhaps a
> > feature flag is enough and the readdir buf can be limited to
> > fc->max_read.
>
> fc->max_read is set by default to ~0 and only set to something else when
> the max_read mount option is given? So typically that is a large value
> (UINT_MAX)?

That's fine. It probably still makes sense to limit it to 128k, but
with the ctx->count patch it would be naturally limited by the size of
the userspace buffer. There's really no downside to enabling a large
buffer, other than an unlikely regression in userspace. If server
wants to return less entries, it still can. Unlike plain reads,
filling the buffer to the fullest extent isn't required for readdir.

So the buffer size calculation can be somthing like:

init:
#define FUSE_READDIR_MAX (128*1024)
fc->max_readdir = PAGE_SIZE;
if (flags & FUSE_LARGE_READDIR)
fc->max_readdir = min(fc->max_read, FUSE_READDIR_MAX);

[...]
readdir:
bufsize = min(fc->max_readdir, max(ctx->count, PAGE_SIZE));

Thanks,
Miklos