2016-10-17 12:18:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH] shmem: avoid huge pages for small files

Huge pages are detrimental for small file: they causes noticible
overhead on both allocation performance and memory footprint.

This patch aimed to address this issue by avoiding huge pages until file
grown to specified size. This would cover most of the cases where huge
pages causes regressions in performance.

By default the minimal file size to allocate huge pages is equal to size
of huge page.

We add two handle to specify minimal file size for huge pages:

- mount option 'huge_min_size';

- sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
in-kernel tmpfs mountpoint;

Few notes:

- if shmem_enabled is set to 'force', the limit is ignored. We still
want to generate as many pages as possible for functional testing.

- the limit doesn't affect khugepaged behaviour: it still can collapse
pages based on its settings;

- remount of the filesystem doesn't affect previously allocated pages,
but the limit is applied for new allocations;

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 6 +++++
include/linux/huge_mm.h | 1 +
include/linux/shmem_fs.h | 1 +
mm/huge_memory.c | 1 +
mm/shmem.c | 56 ++++++++++++++++++++++++++++++++++++++----
5 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 2ec6adb5a4ce..40006d193687 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -238,6 +238,12 @@ values:
- "force":
Force the huge option on for all - very useful for testing;

+Tehre's limit on minimal file size before kenrel starts allocate huge
+pages for it. By default it's size of huge page.
+
+You can adjust the limit using "huge_min_size=" mount option or
+/sys/kernel/mm/transparent_hugepage/shmem_min_size for in-kernel mount.
+
== Need of application restart ==

The transparent_hugepage/enabled values and tmpfs mount option only affect
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9b9f65d99873..515b96a5a592 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -52,6 +52,7 @@ extern ssize_t single_hugepage_flag_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf,
enum transparent_hugepage_flag flag);
extern struct kobj_attribute shmem_enabled_attr;
+extern struct kobj_attribute shmem_min_size_attr;

#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index ff078e7043b6..e7c3bddc6335 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -31,6 +31,7 @@ struct shmem_sb_info {
spinlock_t stat_lock; /* Serialize shmem_sb_info changes */
umode_t mode; /* Mount mode for root directory */
unsigned char huge; /* Whether to try for hugepages */
+ loff_t huge_min_size; /* No hugepages if i_size less than this */
kuid_t uid; /* Mount uid for root directory */
kgid_t gid; /* Mount gid for root directory */
struct mempolicy *mpol; /* default memory policy for mappings */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdcd25cb30fe..fa133eb5bf62 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -309,6 +309,7 @@ static struct attribute *hugepage_attr[] = {
&use_zero_page_attr.attr,
#if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
&shmem_enabled_attr.attr,
+ &shmem_min_size_attr.attr,
#endif
#ifdef CONFIG_DEBUG_VM
&debug_cow_attr.attr,
diff --git a/mm/shmem.c b/mm/shmem.c
index ad7813d73ea7..c69047386e2f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -369,6 +369,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
/* ifdef here to avoid bloating shmem.o when not necessary */

int shmem_huge __read_mostly;
+unsigned long long shmem_huge_min_size = HPAGE_PMD_SIZE __read_mostly;

static int shmem_parse_huge(const char *str)
{
@@ -1668,6 +1669,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
swap_free(swap);

} else {
+ loff_t i_size;
+
/* shmem_symlink() */
if (mapping->a_ops != &shmem_aops)
goto alloc_nohuge;
@@ -1675,14 +1678,17 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
goto alloc_nohuge;
if (shmem_huge == SHMEM_HUGE_FORCE)
goto alloc_huge;
+ i_size = i_size_read(inode);
+ if (i_size < sbinfo->huge_min_size &&
+ index < (sbinfo->huge_min_size >> PAGE_SHIFT))
+ goto alloc_nohuge;
switch (sbinfo->huge) {
- loff_t i_size;
pgoff_t off;
case SHMEM_HUGE_NEVER:
goto alloc_nohuge;
case SHMEM_HUGE_WITHIN_SIZE:
off = round_up(index, HPAGE_PMD_NR);
- i_size = round_up(i_size_read(inode), PAGE_SIZE);
+ i_size = round_up(i_size, PAGE_SIZE);
if (i_size >= HPAGE_PMD_SIZE &&
i_size >> PAGE_SHIFT >= off)
goto alloc_huge;
@@ -3349,6 +3355,10 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
huge != SHMEM_HUGE_NEVER)
goto bad_val;
sbinfo->huge = huge;
+ } else if (!strcmp(this_char, "huge_min_size")) {
+ sbinfo->huge_min_size = memparse(value, &rest);
+ if (*rest)
+ goto bad_val;
#endif
#ifdef CONFIG_NUMA
} else if (!strcmp(this_char,"mpol")) {
@@ -3382,6 +3392,8 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
int error = -EINVAL;

config.mpol = NULL;
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+ config.huge_min_size = HPAGE_PMD_SIZE;
if (shmem_parse_options(data, &config, true))
return error;

@@ -3403,6 +3415,7 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)

error = 0;
sbinfo->huge = config.huge;
+ sbinfo->huge_min_size = config.huge_min_size;
sbinfo->max_blocks = config.max_blocks;
sbinfo->max_inodes = config.max_inodes;
sbinfo->free_inodes = config.max_inodes - inodes;
@@ -3438,8 +3451,10 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
from_kgid_munged(&init_user_ns, sbinfo->gid));
#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
- if (sbinfo->huge)
+ if (sbinfo->huge) {
seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
+ seq_printf(seq, ",huge_min_size=%llu", sbinfo->huge_min_size);
+ }
#endif
shmem_show_mpol(seq, sbinfo->mpol);
return 0;
@@ -3542,6 +3557,8 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
sbinfo->mode = S_IRWXUGO | S_ISVTX;
sbinfo->uid = current_fsuid();
sbinfo->gid = current_fsgid();
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+ sbinfo->huge_min_size = HPAGE_PMD_SIZE;
sb->s_fs_info = sbinfo;

#ifdef CONFIG_TMPFS
@@ -3780,9 +3797,10 @@ int __init shmem_init(void)
}

#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
- if (has_transparent_hugepage() && shmem_huge < SHMEM_HUGE_DENY)
+ if (has_transparent_hugepage() && shmem_huge < SHMEM_HUGE_DENY) {
SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
- else
+ SHMEM_SB(shm_mnt->mnt_sb)->huge_min_size = shmem_huge_min_size;
+ } else
shmem_huge = 0; /* just in case it was patched */
#endif
return 0;
@@ -3848,6 +3866,34 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,

struct kobj_attribute shmem_enabled_attr =
__ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
+
+static ssize_t shmem_min_size_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%llu\n", shmem_huge_min_size);
+}
+
+
+static ssize_t shmem_min_size_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t count)
+{
+ unsigned long long size;
+ char *end;
+
+ size = memparse(buf, &end);
+ if (end == buf)
+ return -EINVAL;
+ if (*end == '\n')
+ end++;
+ if (*end != '\0')
+ return -EINVAL;
+ shmem_huge_min_size = size;
+ SHMEM_SB(shm_mnt->mnt_sb)->huge_min_size = size;
+ return end - buf;
+}
+
+struct kobj_attribute shmem_min_size_attr =
+ __ATTR(shmem_min_size, 0644, shmem_min_size_show, shmem_min_size_store);
#endif /* CONFIG_TRANSPARENT_HUGE_PAGECACHE && CONFIG_SYSFS */

#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
--
2.9.3


2016-10-17 12:30:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Mon, Oct 17, 2016 at 03:18:09PM +0300, Kirill A. Shutemov wrote:
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ad7813d73ea7..c69047386e2f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -369,6 +369,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
> /* ifdef here to avoid bloating shmem.o when not necessary */
>
> int shmem_huge __read_mostly;
> +unsigned long long shmem_huge_min_size = HPAGE_PMD_SIZE __read_mostly;

Arghh.. Last second changes...

This should be

unsigned long long shmem_huge_min_size __read_mostly = HPAGE_PMD_SIZE;


>From fd0b01b9797ddf2bef308c506c42d3dd50f11793 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Mon, 17 Oct 2016 14:44:47 +0300
Subject: [PATCH] shmem: avoid huge pages for small files

Huge pages are detrimental for small file: they causes noticible
overhead on both allocation performance and memory footprint.

This patch aimed to address this issue by avoiding huge pages until file
grown to specified size. This would cover most of the cases where huge
pages causes regressions in performance.

By default the minimal file size to allocate huge pages is equal to size
of huge page.

We add two handle to specify minimal file size for huge pages:

- mount option 'huge_min_size';

- sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
in-kernel tmpfs mountpoint;

Few notes:

- if shmem_enabled is set to 'force', the limit is ignored. We still
want to generate as many pages as possible for functional testing.

- the limit doesn't affect khugepaged behaviour: it still can collapse
pages based on its settings;

- remount of the filesystem doesn't affect previously allocated pages,
but the limit is applied for new allocations;

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 6 +++++
include/linux/huge_mm.h | 1 +
include/linux/shmem_fs.h | 1 +
mm/huge_memory.c | 1 +
mm/shmem.c | 56 ++++++++++++++++++++++++++++++++++++++----
5 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 2ec6adb5a4ce..40006d193687 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -238,6 +238,12 @@ values:
- "force":
Force the huge option on for all - very useful for testing;

+Tehre's limit on minimal file size before kenrel starts allocate huge
+pages for it. By default it's size of huge page.
+
+You can adjust the limit using "huge_min_size=" mount option or
+/sys/kernel/mm/transparent_hugepage/shmem_min_size for in-kernel mount.
+
== Need of application restart ==

The transparent_hugepage/enabled values and tmpfs mount option only affect
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9b9f65d99873..515b96a5a592 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -52,6 +52,7 @@ extern ssize_t single_hugepage_flag_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf,
enum transparent_hugepage_flag flag);
extern struct kobj_attribute shmem_enabled_attr;
+extern struct kobj_attribute shmem_min_size_attr;

#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index ff078e7043b6..e7c3bddc6335 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -31,6 +31,7 @@ struct shmem_sb_info {
spinlock_t stat_lock; /* Serialize shmem_sb_info changes */
umode_t mode; /* Mount mode for root directory */
unsigned char huge; /* Whether to try for hugepages */
+ loff_t huge_min_size; /* No hugepages if i_size less than this */
kuid_t uid; /* Mount uid for root directory */
kgid_t gid; /* Mount gid for root directory */
struct mempolicy *mpol; /* default memory policy for mappings */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdcd25cb30fe..fa133eb5bf62 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -309,6 +309,7 @@ static struct attribute *hugepage_attr[] = {
&use_zero_page_attr.attr,
#if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
&shmem_enabled_attr.attr,
+ &shmem_min_size_attr.attr,
#endif
#ifdef CONFIG_DEBUG_VM
&debug_cow_attr.attr,
diff --git a/mm/shmem.c b/mm/shmem.c
index ad7813d73ea7..c15eee0eb885 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -369,6 +369,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
/* ifdef here to avoid bloating shmem.o when not necessary */

int shmem_huge __read_mostly;
+unsigned long long shmem_huge_min_size __read_mostly = HPAGE_PMD_SIZE;

static int shmem_parse_huge(const char *str)
{
@@ -1668,6 +1669,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
swap_free(swap);

} else {
+ loff_t i_size;
+
/* shmem_symlink() */
if (mapping->a_ops != &shmem_aops)
goto alloc_nohuge;
@@ -1675,14 +1678,17 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
goto alloc_nohuge;
if (shmem_huge == SHMEM_HUGE_FORCE)
goto alloc_huge;
+ i_size = i_size_read(inode);
+ if (i_size < sbinfo->huge_min_size &&
+ index < (sbinfo->huge_min_size >> PAGE_SHIFT))
+ goto alloc_nohuge;
switch (sbinfo->huge) {
- loff_t i_size;
pgoff_t off;
case SHMEM_HUGE_NEVER:
goto alloc_nohuge;
case SHMEM_HUGE_WITHIN_SIZE:
off = round_up(index, HPAGE_PMD_NR);
- i_size = round_up(i_size_read(inode), PAGE_SIZE);
+ i_size = round_up(i_size, PAGE_SIZE);
if (i_size >= HPAGE_PMD_SIZE &&
i_size >> PAGE_SHIFT >= off)
goto alloc_huge;
@@ -3349,6 +3355,10 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
huge != SHMEM_HUGE_NEVER)
goto bad_val;
sbinfo->huge = huge;
+ } else if (!strcmp(this_char, "huge_min_size")) {
+ sbinfo->huge_min_size = memparse(value, &rest);
+ if (*rest)
+ goto bad_val;
#endif
#ifdef CONFIG_NUMA
} else if (!strcmp(this_char,"mpol")) {
@@ -3382,6 +3392,8 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
int error = -EINVAL;

config.mpol = NULL;
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+ config.huge_min_size = HPAGE_PMD_SIZE;
if (shmem_parse_options(data, &config, true))
return error;

@@ -3403,6 +3415,7 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)

error = 0;
sbinfo->huge = config.huge;
+ sbinfo->huge_min_size = config.huge_min_size;
sbinfo->max_blocks = config.max_blocks;
sbinfo->max_inodes = config.max_inodes;
sbinfo->free_inodes = config.max_inodes - inodes;
@@ -3438,8 +3451,10 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
from_kgid_munged(&init_user_ns, sbinfo->gid));
#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
- if (sbinfo->huge)
+ if (sbinfo->huge) {
seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
+ seq_printf(seq, ",huge_min_size=%llu", sbinfo->huge_min_size);
+ }
#endif
shmem_show_mpol(seq, sbinfo->mpol);
return 0;
@@ -3542,6 +3557,8 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
sbinfo->mode = S_IRWXUGO | S_ISVTX;
sbinfo->uid = current_fsuid();
sbinfo->gid = current_fsgid();
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+ sbinfo->huge_min_size = HPAGE_PMD_SIZE;
sb->s_fs_info = sbinfo;

#ifdef CONFIG_TMPFS
@@ -3780,9 +3797,10 @@ int __init shmem_init(void)
}

#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
- if (has_transparent_hugepage() && shmem_huge < SHMEM_HUGE_DENY)
+ if (has_transparent_hugepage() && shmem_huge < SHMEM_HUGE_DENY) {
SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
- else
+ SHMEM_SB(shm_mnt->mnt_sb)->huge_min_size = shmem_huge_min_size;
+ } else
shmem_huge = 0; /* just in case it was patched */
#endif
return 0;
@@ -3848,6 +3866,34 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,

struct kobj_attribute shmem_enabled_attr =
__ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
+
+static ssize_t shmem_min_size_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%llu\n", shmem_huge_min_size);
+}
+
+
+static ssize_t shmem_min_size_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t count)
+{
+ unsigned long long size;
+ char *end;
+
+ size = memparse(buf, &end);
+ if (end == buf)
+ return -EINVAL;
+ if (*end == '\n')
+ end++;
+ if (*end != '\0')
+ return -EINVAL;
+ shmem_huge_min_size = size;
+ SHMEM_SB(shm_mnt->mnt_sb)->huge_min_size = size;
+ return end - buf;
+}
+
+struct kobj_attribute shmem_min_size_attr =
+ __ATTR(shmem_min_size, 0644, shmem_min_size_show, shmem_min_size_store);
#endif /* CONFIG_TRANSPARENT_HUGE_PAGECACHE && CONFIG_SYSFS */

#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
--
2.9.3

--
Kirill A. Shutemov

2016-10-17 14:12:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
[...]
> >From fd0b01b9797ddf2bef308c506c42d3dd50f11793 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Mon, 17 Oct 2016 14:44:47 +0300
> Subject: [PATCH] shmem: avoid huge pages for small files
>
> Huge pages are detrimental for small file: they causes noticible
> overhead on both allocation performance and memory footprint.
>
> This patch aimed to address this issue by avoiding huge pages until file
> grown to specified size. This would cover most of the cases where huge
> pages causes regressions in performance.
>
> By default the minimal file size to allocate huge pages is equal to size
> of huge page.

ok

> We add two handle to specify minimal file size for huge pages:
>
> - mount option 'huge_min_size';
>
> - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> in-kernel tmpfs mountpoint;

Could you explain who might like to change the minimum value (other than
disable the feautre for the mount point) and for what reason?

[...]

> @@ -238,6 +238,12 @@ values:
> - "force":
> Force the huge option on for all - very useful for testing;
>
> +Tehre's limit on minimal file size before kenrel starts allocate huge
> +pages for it. By default it's size of huge page.

Smoe tyopse
--
Michal Hocko
SUSE Labs

2016-10-17 14:56:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> [...]
> > >From fd0b01b9797ddf2bef308c506c42d3dd50f11793 Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <[email protected]>
> > Date: Mon, 17 Oct 2016 14:44:47 +0300
> > Subject: [PATCH] shmem: avoid huge pages for small files
> >
> > Huge pages are detrimental for small file: they causes noticible
> > overhead on both allocation performance and memory footprint.
> >
> > This patch aimed to address this issue by avoiding huge pages until file
> > grown to specified size. This would cover most of the cases where huge
> > pages causes regressions in performance.
> >
> > By default the minimal file size to allocate huge pages is equal to size
> > of huge page.
>
> ok
>
> > We add two handle to specify minimal file size for huge pages:
> >
> > - mount option 'huge_min_size';
> >
> > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > in-kernel tmpfs mountpoint;
>
> Could you explain who might like to change the minimum value (other than
> disable the feautre for the mount point) and for what reason?

Depending on how well CPU microarchitecture deals with huge pages, you
might need to set it higher in order to balance out overhead with benefit
of huge pages.

In other case, if it's known in advance that specific mount would be
populated with large files, you might want to set it to zero to get huge
pages allocated from the beginning.

> > @@ -238,6 +238,12 @@ values:
> > - "force":
> > Force the huge option on for all - very useful for testing;
> >
> > +Tehre's limit on minimal file size before kenrel starts allocate huge
> > +pages for it. By default it's size of huge page.
>
> Smoe tyopse

Wlil fxi!

--
Kirill A. Shutemov

2016-10-18 14:20:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
[...]
> > > We add two handle to specify minimal file size for huge pages:
> > >
> > > - mount option 'huge_min_size';
> > >
> > > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > in-kernel tmpfs mountpoint;
> >
> > Could you explain who might like to change the minimum value (other than
> > disable the feautre for the mount point) and for what reason?
>
> Depending on how well CPU microarchitecture deals with huge pages, you
> might need to set it higher in order to balance out overhead with benefit
> of huge pages.

I am not sure this is a good argument. How do a user know and what will
help to make that decision? Why we cannot autotune that? In other words,
adding new knobs just in case turned out to be a bad idea in the past.

> In other case, if it's known in advance that specific mount would be
> populated with large files, you might want to set it to zero to get huge
> pages allocated from the beginning.

Cannot we use [mf]advise for that purpose?
--
Michal Hocko
SUSE Labs

2016-10-18 14:32:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> [...]
> > > > We add two handle to specify minimal file size for huge pages:
> > > >
> > > > - mount option 'huge_min_size';
> > > >
> > > > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > > in-kernel tmpfs mountpoint;
> > >
> > > Could you explain who might like to change the minimum value (other than
> > > disable the feautre for the mount point) and for what reason?
> >
> > Depending on how well CPU microarchitecture deals with huge pages, you
> > might need to set it higher in order to balance out overhead with benefit
> > of huge pages.
>
> I am not sure this is a good argument. How do a user know and what will
> help to make that decision? Why we cannot autotune that? In other words,
> adding new knobs just in case turned out to be a bad idea in the past.

Well, I don't see a reasonable way to autotune it. We can just let
arch-specific code to redefine it, but the argument below still stands.

> > In other case, if it's known in advance that specific mount would be
> > populated with large files, you might want to set it to zero to get huge
> > pages allocated from the beginning.
>
> Cannot we use [mf]advise for that purpose?

There's no fadvise for this at the moment. We can use madvise, except that
the patch makes it lower priority than the limit :P. I'll fix that.

But in general, it would require change to the program which is not always
desirable or even possible.

--
Kirill A. Shutemov

2016-10-18 18:30:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote:
> On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> > [...]
> > > > > We add two handle to specify minimal file size for huge pages:
> > > > >
> > > > > - mount option 'huge_min_size';
> > > > >
> > > > > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > > > in-kernel tmpfs mountpoint;
> > > >
> > > > Could you explain who might like to change the minimum value (other than
> > > > disable the feautre for the mount point) and for what reason?
> > >
> > > Depending on how well CPU microarchitecture deals with huge pages, you
> > > might need to set it higher in order to balance out overhead with benefit
> > > of huge pages.
> >
> > I am not sure this is a good argument. How do a user know and what will
> > help to make that decision? Why we cannot autotune that? In other words,
> > adding new knobs just in case turned out to be a bad idea in the past.
>
> Well, I don't see a reasonable way to autotune it. We can just let
> arch-specific code to redefine it, but the argument below still stands.
>
> > > In other case, if it's known in advance that specific mount would be
> > > populated with large files, you might want to set it to zero to get huge
> > > pages allocated from the beginning.

Do you think this is a sufficient reason to provide a tunable with such a
precision? In other words why cannot we simply start by using an
internal only limit at the huge page size for the initial transition
(with a way to disable THP altogether for a mount point) and only add a
more fine grained tunning if there ever is a real need for it with a use
case description. In other words can we be less optimistic about
tunables than we used to be in the past and often found out that those
were mistakes much later?
--
Michal Hocko
SUSE Labs

2016-10-19 18:14:05

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Tue, 18 Oct 2016, Michal Hocko wrote:
> On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote:
> > On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> > > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> > > [...]
> > > > > > We add two handle to specify minimal file size for huge pages:
> > > > > >
> > > > > > - mount option 'huge_min_size';
> > > > > >
> > > > > > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > > > > in-kernel tmpfs mountpoint;
> > > > >
> > > > > Could you explain who might like to change the minimum value (other than
> > > > > disable the feautre for the mount point) and for what reason?
> > > >
> > > > Depending on how well CPU microarchitecture deals with huge pages, you
> > > > might need to set it higher in order to balance out overhead with benefit
> > > > of huge pages.
> > >
> > > I am not sure this is a good argument. How do a user know and what will
> > > help to make that decision? Why we cannot autotune that? In other words,
> > > adding new knobs just in case turned out to be a bad idea in the past.
> >
> > Well, I don't see a reasonable way to autotune it. We can just let
> > arch-specific code to redefine it, but the argument below still stands.
> >
> > > > In other case, if it's known in advance that specific mount would be
> > > > populated with large files, you might want to set it to zero to get huge
> > > > pages allocated from the beginning.
>
> Do you think this is a sufficient reason to provide a tunable with such a
> precision? In other words why cannot we simply start by using an
> internal only limit at the huge page size for the initial transition
> (with a way to disable THP altogether for a mount point) and only add a
> more fine grained tunning if there ever is a real need for it with a use
> case description. In other words can we be less optimistic about
> tunables than we used to be in the past and often found out that those
> were mistakes much later?

I'm not sure whether I'm arguing in the same or the opposite direction
as you, Michal, but what makes me unhappy is not so much the tunable,
as the proliferation of mount options.

Kirill, this issue is (not exactly but close enough) what the mount
option "huge=within_size" was supposed to be about: not wasting huge
pages on small files. I'd be much happier if you made huge_min_size
into a /sys/kernel/mm/transparent_hugepage/shmem_within_size tunable,
and used it to govern "huge=within_size" mounts only.

Hugh

2016-10-20 10:39:52

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Wed, Oct 19, 2016 at 11:13:54AM -0700, Hugh Dickins wrote:
> On Tue, 18 Oct 2016, Michal Hocko wrote:
> > On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote:
> > > On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> > > > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > > > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> > > > [...]
> > > > > > > We add two handle to specify minimal file size for huge pages:
> > > > > > >
> > > > > > > - mount option 'huge_min_size';
> > > > > > >
> > > > > > > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > > > > > in-kernel tmpfs mountpoint;
> > > > > >
> > > > > > Could you explain who might like to change the minimum value (other than
> > > > > > disable the feautre for the mount point) and for what reason?
> > > > >
> > > > > Depending on how well CPU microarchitecture deals with huge pages, you
> > > > > might need to set it higher in order to balance out overhead with benefit
> > > > > of huge pages.
> > > >
> > > > I am not sure this is a good argument. How do a user know and what will
> > > > help to make that decision? Why we cannot autotune that? In other words,
> > > > adding new knobs just in case turned out to be a bad idea in the past.
> > >
> > > Well, I don't see a reasonable way to autotune it. We can just let
> > > arch-specific code to redefine it, but the argument below still stands.
> > >
> > > > > In other case, if it's known in advance that specific mount would be
> > > > > populated with large files, you might want to set it to zero to get huge
> > > > > pages allocated from the beginning.
> >
> > Do you think this is a sufficient reason to provide a tunable with such a
> > precision? In other words why cannot we simply start by using an
> > internal only limit at the huge page size for the initial transition
> > (with a way to disable THP altogether for a mount point) and only add a
> > more fine grained tunning if there ever is a real need for it with a use
> > case description. In other words can we be less optimistic about
> > tunables than we used to be in the past and often found out that those
> > were mistakes much later?
>
> I'm not sure whether I'm arguing in the same or the opposite direction
> as you, Michal, but what makes me unhappy is not so much the tunable,
> as the proliferation of mount options.
>
> Kirill, this issue is (not exactly but close enough) what the mount
> option "huge=within_size" was supposed to be about: not wasting huge
> pages on small files. I'd be much happier if you made huge_min_size
> into a /sys/kernel/mm/transparent_hugepage/shmem_within_size tunable,
> and used it to govern "huge=within_size" mounts only.

Well, you're right that I tried originally address the issue with
huge=within_size, but this option makes much more sense for filesystem
with persistent storage. For ext4, it would be pretty usable option.

What you propose would change the semantics of the option and it will
diverge from how it works on ext4.

I guess it may have sense, taking into account that shmem/tmpfs is
special, in sense that we always start with empty filesystem.

If everybody agree, I'll respin the patch with single tunable that manage
all huge=within_size mounts.

--
Kirill A. Shutemov

2016-10-20 22:46:37

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Thu, Oct 20, 2016 at 01:39:46PM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 19, 2016 at 11:13:54AM -0700, Hugh Dickins wrote:
> > On Tue, 18 Oct 2016, Michal Hocko wrote:
> > > On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote:
> > > > On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> > > > > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > > > > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > > > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> > > > > [...]
> > > > > > > > We add two handle to specify minimal file size for huge pages:
> > > > > > > >
> > > > > > > > - mount option 'huge_min_size';
> > > > > > > >
> > > > > > > > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > > > > > > in-kernel tmpfs mountpoint;
> > > > > > >
> > > > > > > Could you explain who might like to change the minimum value (other than
> > > > > > > disable the feautre for the mount point) and for what reason?
> > > > > >
> > > > > > Depending on how well CPU microarchitecture deals with huge pages, you
> > > > > > might need to set it higher in order to balance out overhead with benefit
> > > > > > of huge pages.
> > > > >
> > > > > I am not sure this is a good argument. How do a user know and what will
> > > > > help to make that decision? Why we cannot autotune that? In other words,
> > > > > adding new knobs just in case turned out to be a bad idea in the past.
> > > >
> > > > Well, I don't see a reasonable way to autotune it. We can just let
> > > > arch-specific code to redefine it, but the argument below still stands.
> > > >
> > > > > > In other case, if it's known in advance that specific mount would be
> > > > > > populated with large files, you might want to set it to zero to get huge
> > > > > > pages allocated from the beginning.
> > >
> > > Do you think this is a sufficient reason to provide a tunable with such a
> > > precision? In other words why cannot we simply start by using an
> > > internal only limit at the huge page size for the initial transition
> > > (with a way to disable THP altogether for a mount point) and only add a
> > > more fine grained tunning if there ever is a real need for it with a use
> > > case description. In other words can we be less optimistic about
> > > tunables than we used to be in the past and often found out that those
> > > were mistakes much later?
> >
> > I'm not sure whether I'm arguing in the same or the opposite direction
> > as you, Michal, but what makes me unhappy is not so much the tunable,
> > as the proliferation of mount options.
> >
> > Kirill, this issue is (not exactly but close enough) what the mount
> > option "huge=within_size" was supposed to be about: not wasting huge
> > pages on small files. I'd be much happier if you made huge_min_size
> > into a /sys/kernel/mm/transparent_hugepage/shmem_within_size tunable,
> > and used it to govern "huge=within_size" mounts only.
>
> Well, you're right that I tried originally address the issue with
> huge=within_size, but this option makes much more sense for filesystem
> with persistent storage. For ext4, it would be pretty usable option.

Ugh, no, please don't use mount options for file specific behaviours
in filesystems like ext4 and XFS. This is exactly the sort of
behaviour that should either just work automatically (i.e. be
completely controlled by the filesystem) or only be applied to files
specifically configured with persistent hints to reliably allocate
extents in a way that can be easily mapped to huge pages.

e.g. on XFS you will need to apply extent size hints to get large
page sized/aligned extent allocation to occur, and so this
persistent extent size hint should trigger the filesystem to use
large pages if supported, the hint is correctly sized and aligned,
and there are large pages available for allocation.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-10-21 02:01:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

> Ugh, no, please don't use mount options for file specific behaviours
> in filesystems like ext4 and XFS. This is exactly the sort of
> behaviour that should either just work automatically (i.e. be
> completely controlled by the filesystem) or only be applied to files

Can you explain what you mean? How would the file system control it?

> specifically configured with persistent hints to reliably allocate
> extents in a way that can be easily mapped to huge pages.

> e.g. on XFS you will need to apply extent size hints to get large
> page sized/aligned extent allocation to occur, and so this

It sounds like you're confusing alignment in memory with alignment
on disk here? I don't see why on disk alignment would be needed
at all, unless we're talking about DAX here (which is out of
scope currently) Kirill's changes are all about making the memory
access for cached data more efficient, it's not about disk layout
optimizations.

> persistent extent size hint should trigger the filesystem to use
> large pages if supported, the hint is correctly sized and aligned,
> and there are large pages available for allocation.

That would be ioctls and similar?

That would imply that every application wanting to use large pages
would need to be especially enabled. That would seem awfully limiting
to me and needlessly deny benefits to most existing code.

-Andi

2016-10-21 05:01:24

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > Ugh, no, please don't use mount options for file specific behaviours
> > in filesystems like ext4 and XFS. This is exactly the sort of
> > behaviour that should either just work automatically (i.e. be
> > completely controlled by the filesystem) or only be applied to files
>
> Can you explain what you mean? How would the file system control it?

There's no point in asking for huge pages when populating the page
cache if the file is:

- significantly smaller than the huge page size
- largely sparse
- being randomly accessed in small chunks
- badly fragmented and so takes hundreds of IO to read/write
a huge page
- able to optimise delayed allocation to match huge page
sizes and alignments

These are all constraints the filesystem knows about, but the
application and user don't. None of these aspects can be optimised
sanely by a single threshold, especially when considering the
combination of access patterns vs file layout.

Further, we are moving the IO path to a model where we use extents
for mapping, not blocks. We're optimising for the fact that modern
filesystems use extents and so massively reduce the number of block
mapping lookup calls we need to do for a given IO.

i.e. instead of doing "get page, map block to page" over and over
again until we've alked over the entire IO range, we're doing
"map extent for entire IO range" once, then iterating "get page"
until we've mapped the entire range.

Hence if we have a 2MB IO come in from userspace, and the iomap
returned is a covers that entire range, it's a no-brainer to ask the
page cache for a huge page instead of iterating 512 times to map all
the 4k pages needed.

> > specifically configured with persistent hints to reliably allocate
> > extents in a way that can be easily mapped to huge pages.
>
> > e.g. on XFS you will need to apply extent size hints to get large
> > page sized/aligned extent allocation to occur, and so this
>
> It sounds like you're confusing alignment in memory with alignment
> on disk here? I don't see why on disk alignment would be needed
> at all, unless we're talking about DAX here (which is out of
> scope currently) Kirill's changes are all about making the memory
> access for cached data more efficient, it's not about disk layout
> optimizations.

No, I'm not confusing this with DAX. However, this automatic use
model for huge pages fits straight into DAX as well. Same
mechanisms, same behaviours, slightly stricter alignment
characteristics. All stuff the filesystem already knows about.

Mount options are, quite frankly, a terrible mechanism for
specifying filesystem policy. Setting up DAX this way was a mistake,
and it's a mount option I plan to remove from XFS once we get nearer
to having DAX feature complete and stablised. We've already got
on-disk "use DAX for this file" flags in XFS, so we can easier and
cleanly support different methods of accessing PMEM from the same
filesystem.

As such, there is no way we should be considering different
interfaces and methods for configuring the /same functionality/ just
because DAX is enabled or not. It's the /same decision/ that needs
to be made, and the filesystem knows an awful lot more about whether
huge pages can be used efficiently at the time of access than just
about any other actor you can name....

> > persistent extent size hint should trigger the filesystem to use
> > large pages if supported, the hint is correctly sized and aligned,
> > and there are large pages available for allocation.
>
> That would be ioctls and similar?

You can, but existing filesystem admin tools can already set up
allocation policies without the apps being aware that they even
exist. If you want to use huge page mappings with DAX you'll already
need to do this because of the physical alignment requirements of
DAX.

Further, such techniques are already used by many admins for things
like limiting fragmentation of sparse vm image files. So while you
may not know it, extent size hints and per-file inheritable
attributes are quire widely used already to manage filesystem
behaviour without users or applications even being aware that the
filesystem policies have been modified by the admin...

> That would imply that every application wanting to use large pages
> would need to be especially enabled. That would seem awfully limiting
> to me and needlessly deny benefits to most existing code.

No change to applications will be necessary (see above), though
there's no reason why couldn't directly use the VFS interfaces to
explicitly ask for such behaviour themselves....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-10-21 15:00:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > > Ugh, no, please don't use mount options for file specific behaviours
> > > in filesystems like ext4 and XFS. This is exactly the sort of
> > > behaviour that should either just work automatically (i.e. be
> > > completely controlled by the filesystem) or only be applied to files
> >
> > Can you explain what you mean? How would the file system control it?
>
> There's no point in asking for huge pages when populating the page
> cache if the file is:
>
> - significantly smaller than the huge page size
> - largely sparse
> - being randomly accessed in small chunks
> - badly fragmented and so takes hundreds of IO to read/write
> a huge page
> - able to optimise delayed allocation to match huge page
> sizes and alignments
>
> These are all constraints the filesystem knows about, but the
> application and user don't.

Really?

To me, most of things you're talking about is highly dependent on access
pattern generated by userspace:

- we may want to allocate huge pages from byte 1 if we know that file
will grow;
- the same for sparse file that will be filled;
- it will be beneficial to allocate huge page even for fragmented files,
if it's read-mostly;

> None of these aspects can be optimised sanely by a single threshold,
> especially when considering the combination of access patterns vs file
> layout.

I agree.

Here I tried to address the particular performance regression I see with
huge pages enabled on tmpfs. It doesn't mean to fix all possible issues.

> Further, we are moving the IO path to a model where we use extents
> for mapping, not blocks. We're optimising for the fact that modern
> filesystems use extents and so massively reduce the number of block
> mapping lookup calls we need to do for a given IO.
>
> i.e. instead of doing "get page, map block to page" over and over
> again until we've alked over the entire IO range, we're doing
> "map extent for entire IO range" once, then iterating "get page"
> until we've mapped the entire range.

That's great, but it's not how IO path works *now*. And will takes a long
time (if ever) to flip it over to what you've described.

> Hence if we have a 2MB IO come in from userspace, and the iomap
> returned is a covers that entire range, it's a no-brainer to ask the
> page cache for a huge page instead of iterating 512 times to map all
> the 4k pages needed.

Yeah, it's no-brainier.

But do we want to limit huge page allocation only to such best-possible
cases? I hardly ever seen 2MB IOs in real world...

And this approach will put too much decision power on the first access to
the file range. It may or may not represent future access pattern.

> > > specifically configured with persistent hints to reliably allocate
> > > extents in a way that can be easily mapped to huge pages.
> >
> > > e.g. on XFS you will need to apply extent size hints to get large
> > > page sized/aligned extent allocation to occur, and so this
> >
> > It sounds like you're confusing alignment in memory with alignment
> > on disk here? I don't see why on disk alignment would be needed
> > at all, unless we're talking about DAX here (which is out of
> > scope currently) Kirill's changes are all about making the memory
> > access for cached data more efficient, it's not about disk layout
> > optimizations.
>
> No, I'm not confusing this with DAX. However, this automatic use
> model for huge pages fits straight into DAX as well. Same
> mechanisms, same behaviours, slightly stricter alignment
> characteristics. All stuff the filesystem already knows about.
>
> Mount options are, quite frankly, a terrible mechanism for
> specifying filesystem policy. Setting up DAX this way was a mistake,
> and it's a mount option I plan to remove from XFS once we get nearer
> to having DAX feature complete and stablised. We've already got
> on-disk "use DAX for this file" flags in XFS, so we can easier and
> cleanly support different methods of accessing PMEM from the same
> filesystem.
>
> As such, there is no way we should be considering different
> interfaces and methods for configuring the /same functionality/ just
> because DAX is enabled or not. It's the /same decision/ that needs
> to be made, and the filesystem knows an awful lot more about whether
> huge pages can be used efficiently at the time of access than just
> about any other actor you can name....

I'm not convinced that filesystem is in better position to see access
patterns than mm for page cache. It's not all about on-disk layout.

--
Kirill A. Shutemov

2016-10-21 15:12:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Fri 21-10-16 18:00:07, Kirill A. Shutemov wrote:
> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
[...]
> > None of these aspects can be optimised sanely by a single threshold,
> > especially when considering the combination of access patterns vs file
> > layout.
>
> I agree.
>
> Here I tried to address the particular performance regression I see with
> huge pages enabled on tmpfs. It doesn't mean to fix all possible issues.

So can we start simple and use huge pages on shmem mappings only when
they are larger than the huge page? Without any tunable which might turn
out to be misleading/wrong later on. If I understand Dave's comments it
is really not all that clear that a mount option makes sense. I cannot
comment on those but they clearly show that there are multiple points of
view here.
--
Michal Hocko
SUSE Labs

2016-10-21 22:50:19

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> > On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > > > Ugh, no, please don't use mount options for file specific behaviours
> > > > in filesystems like ext4 and XFS. This is exactly the sort of
> > > > behaviour that should either just work automatically (i.e. be
> > > > completely controlled by the filesystem) or only be applied to files
> > >
> > > Can you explain what you mean? How would the file system control it?
> >
> > There's no point in asking for huge pages when populating the page
> > cache if the file is:
> >
> > - significantly smaller than the huge page size
> > - largely sparse
> > - being randomly accessed in small chunks
> > - badly fragmented and so takes hundreds of IO to read/write
> > a huge page
> > - able to optimise delayed allocation to match huge page
> > sizes and alignments
> >
> > These are all constraints the filesystem knows about, but the
> > application and user don't.
>
> Really?
>
> To me, most of things you're talking about is highly dependent on access
> pattern generated by userspace:
>
> - we may want to allocate huge pages from byte 1 if we know that file
> will grow;

delayed allocation takes care of that. We use a growing speculative
delalloc size that kicks in at specific sizes and can be used
directly to determine if a large page shoul dbe allocated. This code
is aware of sparse files, sparse writes, etc.

> - the same for sparse file that will be filled;

See above.

> - it will be beneficial to allocate huge page even for fragmented files,
> if it's read-mostly;

No, no it won't. The IO latency impact here can be massive.
read-ahead of single 4k pages hides most of this latency from the
application, but with a 2MB page, we can't use readhead to hide this
IO latency because the first access could stall for hundreds of
small random read IOs to be completed instead of just 1.


> > Further, we are moving the IO path to a model where we use extents
> > for mapping, not blocks. We're optimising for the fact that modern
> > filesystems use extents and so massively reduce the number of block
> > mapping lookup calls we need to do for a given IO.
> >
> > i.e. instead of doing "get page, map block to page" over and over
> > again until we've alked over the entire IO range, we're doing
> > "map extent for entire IO range" once, then iterating "get page"
> > until we've mapped the entire range.
>
> That's great, but it's not how IO path works *now*. And will takes a long
> time (if ever) to flip it over to what you've described.

Wrong. fs/iomap.c. XFS already uses it, ext4 is being converted
right now, GFS2 will use parts of it in the next release, DAX
already uses it and PMD support in DAX is being built on top of it.

> > As such, there is no way we should be considering different
> > interfaces and methods for configuring the /same functionality/ just
> > because DAX is enabled or not. It's the /same decision/ that needs
> > to be made, and the filesystem knows an awful lot more about whether
> > huge pages can be used efficiently at the time of access than just
> > about any other actor you can name....
>
> I'm not convinced that filesystem is in better position to see access
> patterns than mm for page cache. It's not all about on-disk layout.

Spoken like a true mm developer. IO performance is all about IO
patterns, and the primary contributor to bad IO patterns is bad
filesystem allocation patterns.... :P

We're rapidly moving away from the world where a page cache is
needed to give applications decent performance. DAX doesn't have a
page cache, applications wanting to use high IOPS (hundreds of
thousands to millions) storage are using direct IO, because the page
cache just introduces latency, memory usage issues and
non-deterministic IO behaviour.

I we try to make the page cache the "one true IO optimisation source"
then we're screwing ourselves because the incoming IO technologies
simply don't require it anymore. We need to be ahead of that curve,
not playing catchup, and that's why this sort of "what should the
page cache do" decisions really need to come from the IO path where
we see /all/ the IO, not just buffered IO....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-10-21 23:32:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Sat, Oct 22, 2016 at 09:50:13AM +1100, Dave Chinner wrote:
> On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> > > On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > > > > Ugh, no, please don't use mount options for file specific behaviours
> > > > > in filesystems like ext4 and XFS. This is exactly the sort of
> > > > > behaviour that should either just work automatically (i.e. be
> > > > > completely controlled by the filesystem) or only be applied to files
> > > >
> > > > Can you explain what you mean? How would the file system control it?
> > >
> > > There's no point in asking for huge pages when populating the page
> > > cache if the file is:
> > >
> > > - significantly smaller than the huge page size
> > > - largely sparse
> > > - being randomly accessed in small chunks
> > > - badly fragmented and so takes hundreds of IO to read/write
> > > a huge page
> > > - able to optimise delayed allocation to match huge page
> > > sizes and alignments
> > >
> > > These are all constraints the filesystem knows about, but the
> > > application and user don't.
> >
> > Really?
> >
> > To me, most of things you're talking about is highly dependent on access
> > pattern generated by userspace:
> >
> > - we may want to allocate huge pages from byte 1 if we know that file
> > will grow;
>
> delayed allocation takes care of that. We use a growing speculative
> delalloc size that kicks in at specific sizes and can be used
> directly to determine if a large page shoul dbe allocated. This code
> is aware of sparse files, sparse writes, etc.

I'm confused here. How can we delay allocation of page cache?

Delalloc is helpful to have reasonable on-disk layout, but my
understanding is that it uses page cache as buffering to postpone
block allocation. Later on writeback we see access pattern using pages
from page cache.

I'm likely missing something important here. Hm?

> > - it will be beneficial to allocate huge page even for fragmented files,
> > if it's read-mostly;
>
> No, no it won't. The IO latency impact here can be massive.
> read-ahead of single 4k pages hides most of this latency from the
> application, but with a 2MB page, we can't use readhead to hide this
> IO latency because the first access could stall for hundreds of
> small random read IOs to be completed instead of just 1.

I agree that it will lead to initial latency spike. But don't we have
workloads which would tolerate it to get faster hot-cache behaviour?

> > > Further, we are moving the IO path to a model where we use extents
> > > for mapping, not blocks. We're optimising for the fact that modern
> > > filesystems use extents and so massively reduce the number of block
> > > mapping lookup calls we need to do for a given IO.
> > >
> > > i.e. instead of doing "get page, map block to page" over and over
> > > again until we've alked over the entire IO range, we're doing
> > > "map extent for entire IO range" once, then iterating "get page"
> > > until we've mapped the entire range.
> >
> > That's great, but it's not how IO path works *now*. And will takes a long
> > time (if ever) to flip it over to what you've described.
>
> Wrong. fs/iomap.c. XFS already uses it, ext4 is being converted
> right now, GFS2 will use parts of it in the next release, DAX
> already uses it and PMD support in DAX is being built on top of it.

That's interesting. I've managed to miss whole fs/iomap.c thing...

> > > As such, there is no way we should be considering different
> > > interfaces and methods for configuring the /same functionality/ just
> > > because DAX is enabled or not. It's the /same decision/ that needs
> > > to be made, and the filesystem knows an awful lot more about whether
> > > huge pages can be used efficiently at the time of access than just
> > > about any other actor you can name....
> >
> > I'm not convinced that filesystem is in better position to see access
> > patterns than mm for page cache. It's not all about on-disk layout.
>
> Spoken like a true mm developer.

Guilty.

--
Kirill A. Shutemov

2016-10-24 20:34:55

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On 10/21/2016 03:50 PM, Dave Chinner wrote:
> On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
>> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
>> To me, most of things you're talking about is highly dependent on access
>> pattern generated by userspace:
>>
>> - we may want to allocate huge pages from byte 1 if we know that file
>> will grow;
>
> delayed allocation takes care of that. We use a growing speculative
> delalloc size that kicks in at specific sizes and can be used
> directly to determine if a large page shoul dbe allocated. This code
> is aware of sparse files, sparse writes, etc.

OK, so somebody does a write() of 1 byte. We can delay the underlying
block allocation for a long time, but we can *not* delay the memory
allocation. We've got to decide before the write() returns.

How does delayed allocation help with that decision?

I guess we could (always?) allocate small pages up front, and then only
bother promoting them once the FS delayed-allocation code kicks in and
is *also* giving us underlying large allocations. That punts the logic
to the filesystem, which is a bit counterintuitive, but it seems
relatively sane.

>>> As such, there is no way we should be considering different
>>> interfaces and methods for configuring the /same functionality/ just
>>> because DAX is enabled or not. It's the /same decision/ that needs
>>> to be made, and the filesystem knows an awful lot more about whether
>>> huge pages can be used efficiently at the time of access than just
>>> about any other actor you can name....
>>
>> I'm not convinced that filesystem is in better position to see access
>> patterns than mm for page cache. It's not all about on-disk layout.
>
> Spoken like a true mm developer. IO performance is all about IO
> patterns, and the primary contributor to bad IO patterns is bad
> filesystem allocation patterns.... :P

For writes, I think you have a good point. Managing a horribly
fragmented file with larger pages and eating the associated write
magnification that comes along with it seems like a recipe for disaster.

But, Isn't some level of disconnection between the page cache and the
underlying IO patterns a *good* thing? Once we've gone to the trouble
of bringing some (potentially very fragmented) data into the page cache,
why _not_ manage it in a lower-overhead way if we can? For read-only
data it seems like a no-brainer that we'd want things in as large of a
management unit as we can get.

IOW, why let the underlying block allocation layout hamstring how the
memory is managed?

2016-10-25 05:29:07

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Mon, Oct 24, 2016 at 01:34:53PM -0700, Dave Hansen wrote:
> On 10/21/2016 03:50 PM, Dave Chinner wrote:
> > On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> >> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> >> To me, most of things you're talking about is highly dependent on access
> >> pattern generated by userspace:
> >>
> >> - we may want to allocate huge pages from byte 1 if we know that file
> >> will grow;
> >
> > delayed allocation takes care of that. We use a growing speculative
> > delalloc size that kicks in at specific sizes and can be used
> > directly to determine if a large page shoul dbe allocated. This code
> > is aware of sparse files, sparse writes, etc.
>
> OK, so somebody does a write() of 1 byte. We can delay the underlying
> block allocation for a long time, but we can *not* delay the memory
> allocation. We've got to decide before the write() returns.

> How does delayed allocation help with that decision?

You (and Kirill) have likely misunderstood what I'm saying, based
on the fact you are thinking I'm talking about delayed allocation of
page cache pages.

I'm not. The current code does this for a sequential write:

write( off, len)
for each PAGE_SIZE chunk
grab page cache page
alloc + insert if not found
map block to page
get_block(off, PAGE_SIZE);
filesystem does allocation <<<<<<<<<<<<
update bufferhead attached to page
write data into page


Essentially, delayed block allocation occurs inside the get_block()
call, completely hidden from the page cache and IO layers.

In XFS, we special stuff based on the offset being written to, the
size of the existing extent we are adding, etc. to specualtively
/reserve/ more blocks that the write actually needs and keep them in
a delalloc extent that extends beyond EOF.

The next page is grabbed, get_block is called again, and we find
we've already got a delalloc reservation for that file offset, so
we return immediately. And we repeat that until the delalloc extent
runs out.

When it runs out, we allocate a bigger delalloc extent beyond EOF
so that as the file grows we do fewer and fewer larger delayed
allocation reservations. These grow out to /gigabytes/ if there is
that much data to write.

i.e. the filesystem clearly knows when using large pages would be
appropriate, but because it's inside the page cache allocation,
it can't influence it at all.

Here's what the new fs/iomap.c code does for the same sequential
write:

write(off, len)
iomap_begin(off, len)
filesystem does delayed allocation of at least len bytes <<<<<<
returns an iomap with single mapping
iomap_apply()
for each PAGE_SIZE chunk
grab page cache page
alloc + insert if not found
map iomap to page
write data into page
iomap_end()

Hence if the write was for 1 byte into an empty file, we'd get a
single block extent back, which would match to a single PAGE_SIZE
page cache allocation required. If the app is doing sequential 1
byte IO, and we're at offset 2MB and the filesystem returns a 2MB
delalloc extent (i.e. extends 2MB byte beyond EOF), we know we're
getting sequential write IO and we could use a 2MB page in the page
cache for this.

Similarly, if the app is doing large IO - say 16MB at a time, we'll
get at least a 16MB delalloc extent returned from the filesystem,
and we know we could map that quickly and easily to 8 x 2MB huge
pages in the page cache.

But if we get random 4k writes into a sparse file, the filesystem
will be allocating single blocks, so the iomaps being returned would
be for a single block, and we know that PAGE_SIZE pages would be
best to allocate.

Or we could have a 2 MB extent size hint set, so every iomap
returned from the filesysetm is going to be 2MB aligned and sized,
in which case we'd could always map the returned iomap to a huge
page rather than worry about Io sizes and incoming IO patterns.

Now do you see the difference? The filesystem now has the ability to
optimise allocation based on user application IO patterns rather
than trying to guess from single page size block mapping requests.
And because this happens before we look at the page cache, we can
use that information to influence what we do with the page cache.

> >> I'm not convinced that filesystem is in better position to see access
> >> patterns than mm for page cache. It's not all about on-disk layout.
> >
> > Spoken like a true mm developer. IO performance is all about IO
> > patterns, and the primary contributor to bad IO patterns is bad
> > filesystem allocation patterns.... :P
>
> For writes, I think you have a good point. Managing a horribly
> fragmented file with larger pages and eating the associated write
> magnification that comes along with it seems like a recipe for disaster.
>
> But, Isn't some level of disconnection between the page cache and the
> underlying IO patterns a *good* thing?

Up to a point. Buffered IO only hides underlying IO patterns whilst
readahead and writebehind are effective. They rely on the filesystem
laying out files sequentially for good read rates, and writes to hit
the disk sequentially for good write rates. The moment either of
these two things break down, the latency induced by the underlying
IO patterns will immediately dominate buffered IO performance
characteristics.

The filesystem allocation and mapping routines will do a far better
job is we can feed them extents rather than 4k pages at a time.
Kirill is struggling to work out how to get huge pages into the page
cache readahead model, and this is something that we'll solve by
adding a readahead implemenation to the iomap code. i.e. iomap the
extent, decide on how much to read ahead and the page size to use,
populate the page cache, build the bios, off we go....

The problem with the existing code is that the page cache breaks IO
up into discrete PAGE_SIZE mapping requests. For all the fs knows it
could be 1 byte IOs or it could be 1GB IOs being done. We can do
(and already have done) an awful lot of optimisation if we know
exactly what IO is being thrown at the fs.

Effectively we are moving the IO path mapping methods from a
single-page-and-block iteration model to single-extent-multiple-page
iteration model. It's a different mechanism, but it's a lot more
efficient for moving data around than what we have now...

> Once we've gone to the trouble
> of bringing some (potentially very fragmented) data into the page cache,
> why _not_ manage it in a lower-overhead way if we can? For read-only
> data it seems like a no-brainer that we'd want things in as large of a
> management unit as we can get.
>
> IOW, why let the underlying block allocation layout hamstring how the
> memory is managed?

We're not changing how memory is managed at all. All we are doing is
is changing how we populate the page cache....

Cheers,

Dave.

--
Dave Chinner
[email protected]