When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page
size. This feature relies on both mmap virtual address and FS
block (i.e. physical address) to be aligned by the pmd page size.
Users can use mkfs options to specify FS to align block allocations.
However, aligning mmap address requires code changes to existing
applications for providing a pmd-aligned address to mmap().
For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1].
It calls mmap() with a NULL address, which needs to be changed to
provide a pmd-aligned address for testing with DAX pmd mappings.
Changing all applications that call mmap() with NULL is undesirable.
This patch-set extends filesystems to align an mmap address for
a DAX file so that unmodified applications can use DAX pmd mappings.
[1]: https://github.com/axboe/fio/blob/master/engines/mmap.c
v2:
- Change filesystems to provide their get_unmapped_area().
(Matthew Wilcox)
- Add more description about the benefit. (Matthew Wilcox)
---
Toshi Kani (5):
1/5 dax: add dax_get_unmapped_area for pmd mappings
2/5 ext4: call dax_get_unmapped_area() for DAX pmd mappings
3/5 xfs: call dax_get_unmapped_area() for DAX pmd mappings
4/5 ext2: call dax_get_unmapped_area() for DAX pmd mappings
5/5 block: call dax_get_unmapped_area() for DAX pmd mappings
---
fs/block_dev.c | 3 +++
fs/dax.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
fs/ext2/file.c | 3 +++
fs/ext4/file.c | 3 +++
fs/xfs/xfs_file.c | 3 +++
include/linux/dax.h | 2 ++
6 files changed, 59 insertions(+)
When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page
size. This feature relies on both mmap virtual address and FS
block (i.e. physical address) to be aligned by the pmd page size.
Users can use mkfs options to specify FS to align block allocations.
However, aligning mmap address requires code changes to existing
applications for providing a pmd-aligned address to mmap().
For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1].
It calls mmap() with a NULL address, which needs to be changed to
provide a pmd-aligned address for testing with DAX pmd mappings.
Changing all applications that call mmap() with NULL is undesirable.
Add dax_get_unmapped_area(), which can be called by filesystem's
get_unmapped_area to align an mmap address by the pmd size for
a DAX file. It calls the default handler, mm->get_unmapped_area(),
to find a range and then aligns it for a DAX file.
[1]: https://github.com/axboe/fio/blob/master/engines/mmap.c
Signed-off-by: Toshi Kani <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Andreas Dilger <[email protected]>
---
fs/dax.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 2 ++
2 files changed, 47 insertions(+)
diff --git a/fs/dax.c b/fs/dax.c
index 75ba46d..b54fdb9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1158,3 +1158,48 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
return dax_zero_page_range(inode, from, length, get_block);
}
EXPORT_SYMBOL_GPL(dax_truncate_page);
+
+/**
+ * dax_get_unmapped_area - handle get_unmapped_area for a DAX file
+ * @filp: The file being mmap'd, if not NULL
+ * @addr: The mmap address. If NULL, the kernel assigns the address
+ * @len: The mmap size in bytes
+ * @pgoff: The page offset in the file where the mapping starts from.
+ * @flags: The mmap flags
+ *
+ * This function can be called by a filesystem for get_unmapped_area().
+ * When a target file is a DAX file, it aligns the mmap address at the
+ * beginning of the file by the pmd size.
+ *
+ * When the target file is not a DAX file, @addr is specified, the
+ * request is not suitable for pmd mappings, or mm->get_unmapped_area()
+ * failed with extended @len, it simply calls the default handler,
+ * mm->get_unmapped_area(), with the original arguments.
+ */
+unsigned long dax_get_unmapped_area(struct file *filp, unsigned long addr,
+ unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+ unsigned long off, off_end, off_pmd, len_pmd, addr_pmd;
+
+ if (IS_ENABLED(CONFIG_FS_DAX_PMD) &&
+ filp && !addr && IS_DAX(filp->f_mapping->host)) {
+ off = pgoff << PAGE_SHIFT;
+ off_end = off + len;
+ off_pmd = round_up(off, PMD_SIZE);
+
+ if ((off_end > off_pmd) && ((off_end - off_pmd) >= PMD_SIZE)) {
+ len_pmd = len + PMD_SIZE;
+
+ addr_pmd = current->mm->get_unmapped_area(
+ filp, addr, len_pmd, pgoff, flags);
+
+ if (!IS_ERR_VALUE(addr_pmd)) {
+ addr_pmd += (off - addr_pmd) & (PMD_SIZE - 1);
+ return addr_pmd;
+ }
+ }
+ }
+
+ return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
+}
+EXPORT_SYMBOL_GPL(dax_get_unmapped_area);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 636dd59..9d52de0 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -10,6 +10,8 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
+unsigned long dax_get_unmapped_area(struct file *filp, unsigned long addr,
+ unsigned long len, unsigned long pgoff, unsigned long flags);
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
dax_iodone_t);
int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
On Tue, Apr 12, 2016 at 02:39:15PM -0600, Toshi Kani wrote:
> + * When the target file is not a DAX file, @addr is specified, the
> + * request is not suitable for pmd mappings, or mm->get_unmapped_area()
> + * failed with extended @len, it simply calls the default handler,
> + * mm->get_unmapped_area(), with the original arguments.
I think you can lose this paragraph. It describes what the function
does, not why it does it ... and we can see what the function does from
reading the code.
> +unsigned long dax_get_unmapped_area(struct file *filp, unsigned long addr,
> + unsigned long len, unsigned long pgoff, unsigned long flags)
> +{
> + unsigned long off, off_end, off_pmd, len_pmd, addr_pmd;
> +
> + if (IS_ENABLED(CONFIG_FS_DAX_PMD) &&
> + filp && !addr && IS_DAX(filp->f_mapping->host)) {
> + off = pgoff << PAGE_SHIFT;
> + off_end = off + len;
> + off_pmd = round_up(off, PMD_SIZE);
> +
> + if ((off_end > off_pmd) && ((off_end - off_pmd) >= PMD_SIZE)) {
> + len_pmd = len + PMD_SIZE;
> +
> + addr_pmd = current->mm->get_unmapped_area(
> + filp, addr, len_pmd, pgoff, flags);
> +
> + if (!IS_ERR_VALUE(addr_pmd)) {
> + addr_pmd += (off - addr_pmd) & (PMD_SIZE - 1);
> + return addr_pmd;
> + }
> + }
> + }
> +
> + return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
> +}
I think this is one of those functions which is actually improved with
gotos, purely to reduce the indentation level.
unsigned long dax_get_unmapped_area(struct file *filp, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags)
{
unsigned long off, off_end, off_pmd, len_pmd, addr_pmd;
if (!IS_ENABLED(CONFIG_FS_DAX_PMD) ||
!filp || addr || !IS_DAX(filp->f_mapping->host))
goto out;
off = pgoff << PAGE_SHIFT;
off_end = off + len;
off_pmd = round_up(off, PMD_SIZE);
if ((off_end <= off_pmd) || ((off_end - off_pmd) < PMD_SIZE))
goto out;
len_pmd = len + PMD_SIZE;
addr_pmd = current->mm->get_unmapped_area(filp, addr, len_pmd,
pgoff, flags);
if (!IS_ERR_VALUE(addr_pmd)) {
addr_pmd += (off - addr_pmd) & (PMD_SIZE - 1);
return addr_pmd;
}
out:
return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
}
Now ... back to the original version, some questions:
> +unsigned long dax_get_unmapped_area(struct file *filp, unsigned long addr,
> + unsigned long len, unsigned long pgoff, unsigned long flags)
> +{
> + unsigned long off, off_end, off_pmd, len_pmd, addr_pmd;
> +
> + if (IS_ENABLED(CONFIG_FS_DAX_PMD) &&
> + filp && !addr && IS_DAX(filp->f_mapping->host)) {
> + off = pgoff << PAGE_SHIFT;
> + off_end = off + len;
Can off + len wrap here, or did that get checked earlier?
> + off_pmd = round_up(off, PMD_SIZE);
> +
> + if ((off_end > off_pmd) && ((off_end - off_pmd) >= PMD_SIZE)) {
We're only going to look for a PMD-aligned mapping if the mapping is at
least double PMD_SIZE? I don't understand that decision. Seems to me
that if I ask to map offset 4MB, length 2MB, I should get a PMD-aligned
mapping.
Speaking of offset, we don't have any checks that offset is a multiple
of PMD_SIZE. I know that theoretically we could map offset 1.5MB, length
3MB and see the first 0.5MB filled with small pages, then the next 2MB
filled with one large page, and the tail filled with small pages, but I
think we're better off only looking for PMD-alignment if the user asked
for an aligned offset in the file.
> + len_pmd = len + PMD_SIZE;
> +
> + addr_pmd = current->mm->get_unmapped_area(
> + filp, addr, len_pmd, pgoff, flags);
> +
> + if (!IS_ERR_VALUE(addr_pmd)) {
> + addr_pmd += (off - addr_pmd) & (PMD_SIZE - 1);
... then you wouldn't need this calculation ;-)
> + return addr_pmd;
> + }
On Tue, 2016-04-12 at 23:18 -0400, Matthew Wilcox wrote:
> On Tue, Apr 12, 2016 at 02:39:15PM -0600, Toshi Kani wrote:
> >
> > + * When the target file is not a DAX file, @addr is specified, the
> > + * request is not suitable for pmd mappings, or mm-
> > >get_unmapped_area()
> > + * failed with extended @len, it simply calls the default handler,
> > + * mm->get_unmapped_area(), with the original arguments.
>
> I think you can lose this paragraph. It describes what the function
> does, not why it does it ... and we can see what the function does from
> reading the code.
Agreed. I will remove this paragraph.
> I think this is one of those functions which is actually improved with
> gotos, purely to reduce the indentation level.
>
> unsigned long dax_get_unmapped_area(struct file *filp, unsigned long
> addr,
> unsigned long len, unsigned long pgoff, unsigned long
> flags)
> {
> unsigned long off, off_end, off_pmd, len_pmd, addr_pmd;
>
> if (!IS_ENABLED(CONFIG_FS_DAX_PMD) ||
> !filp || addr || !IS_DAX(filp->f_mapping->host))
> goto out;
>
> off = pgoff << PAGE_SHIFT;
> off_end = off + len;
> off_pmd = round_up(off, PMD_SIZE);
> if ((off_end <= off_pmd) || ((off_end - off_pmd) < PMD_SIZE))
> goto out;
>
> len_pmd = len + PMD_SIZE;
> addr_pmd = current->mm->get_unmapped_area(filp, addr, len_pmd,
> pgoff, flags);
>
> if (!IS_ERR_VALUE(addr_pmd)) {
> addr_pmd += (off - addr_pmd) & (PMD_SIZE - 1);
> return addr_pmd;
> }
>
> out:
> return current->mm->get_unmapped_area(filp, addr, len, pgoff,
> flags);
> }
Sounds good.
> Now ... back to the original version, some questions:
>
> >
> > +unsigned long dax_get_unmapped_area(struct file *filp, unsigned long
> > addr,
> > + unsigned long len, unsigned long pgoff, unsigned long
> > flags)
> > +{
> > + unsigned long off, off_end, off_pmd, len_pmd, addr_pmd;
> > +
> > + if (IS_ENABLED(CONFIG_FS_DAX_PMD) &&
> > + filp && !addr && IS_DAX(filp->f_mapping->host)) {
> > + off = pgoff << PAGE_SHIFT;
> > + off_end = off + len;
>
> Can off + len wrap here, or did that get checked earlier?
Yes, do_mmap() has checked this condition earlier. But, I think I need to
check it for (off + len_pmd).
> >
> > + off_pmd = round_up(off, PMD_SIZE);
> > +
> > + if ((off_end > off_pmd) && ((off_end - off_pmd) >=
> > PMD_SIZE)) {
>
> We're only going to look for a PMD-aligned mapping if the mapping is at
> least double PMD_SIZE? I don't understand that decision. Seems to me
> that if I ask to map offset 4MB, length 2MB, I should get a PMD-aligned
> mapping.
It checks if this request can be covered by a PMD page. 'off_pmd' is the
first PMD-aligned offset. There needs to be at least 2MB from this offset
to the end, 'off_end', in order to cover with a PMD page.
In your example, 'off_pmd' is still 4MB, which has 2MB to the end. So, so
this request can be covered by a PMD page.
Another example, say, offset 4KB and length 2MB. 'off_pmd' is 2MB, which
has only 4KB to the end. So, this request cannot be covered by a PMD page.
> Speaking of offset, we don't have any checks that offset is a multiple
> of PMD_SIZE. I know that theoretically we could map offset 1.5MB, length
> 3MB and see the first 0.5MB filled with small pages, then the next 2MB
> filled with one large page, and the tail filled with small pages, but I
> think we're better off only looking for PMD-alignment if the user asked
> for an aligned offset in the file.
Yes, that's what it checks. In this case, 'off_pmd' is set to 2MB, which
has 2.5MB to the end. So, it can be covered by a PMD page. The offset
itself does not have to be aligned by 2MB.
> > + len_pmd = len + PMD_SIZE;
> > +
> > + addr_pmd = current->mm->get_unmapped_area(
> > + filp, addr, len_pmd, pgoff,
> > flags);
> > +
> > + if (!IS_ERR_VALUE(addr_pmd)) {
> > + addr_pmd += (off - addr_pmd) &
> > (PMD_SIZE - 1);
>
> ... then you wouldn't need this calculation ;-)
Applications should not be required to provide a 2MB-aligned offset.
Thanks,
-Toshi