2021-01-07 13:20:49

by Mikulas Patocka

[permalink] [raw]
Subject: [RFC v2] nvfs: a filesystem for persistent memory

Hi

I announce a new version of NVFS - a filesystem for persistent memory.
http://people.redhat.com/~mpatocka/nvfs/
git://leontynka.twibright.com/nvfs.git

Changes since the last release:

* I added a microjournal to the filesystem, it can hold up to 16 entries.
Each CPU has it's own journal, so that there is no lock contention. The
journal is used to provide atomicity of reaname() and extended attribute
replace.
(note that file creation or deletion doesn't use the journal, because
these operations can be deterministically cleaned up by fsck)

* I created a framework that can be used to verify the filesystem driver.
It logs all writes and memory barriers to a file, the entries in the
file are randomly reordered (to simulate reordering in the CPU
write-combining buffers), the sequence is cut at a random point (to
simulate a system crash) and the result is replayed on a filesystem
image.
With this framework, we can for example check that if a crash happens
during rename(), either old file or new file will be present in a
directory.
This framework helped to find a few bugs in sequencing the writes.

* If we map an executable image, we turn off the DAX flag on the inode
(because executables run 4% slower from persistent memory). There is
also a switch that can turn DAX always off or always on.




I'd like to ask about this piece of code in __kernel_read:
if (unlikely(!file->f_op->read_iter || file->f_op->read))
return warn_unsupported...
and __kernel_write:
if (unlikely(!file->f_op->write_iter || file->f_op->write))
return warn_unsupported...

- It exits with an error if both read_iter and read or write_iter and
write are present.

I found out that on NVFS, reading a file with the read method has 10%
better performance than the read_iter method. The benchmark just reads the
same 4k page over and over again - and the cost of creating and parsing
the kiocb and iov_iter structures is just that high.

So, I'd like to have both read and read_iter methods. Could the above
conditions be changed, so that they don't fail with an error if the "read"
or "write" method is present?

Mikulas


2021-01-08 20:38:59

by Matthew Wilcox

[permalink] [raw]
Subject: Expense of read_iter

On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> I'd like to ask about this piece of code in __kernel_read:
> if (unlikely(!file->f_op->read_iter || file->f_op->read))
> return warn_unsupported...
> and __kernel_write:
> if (unlikely(!file->f_op->write_iter || file->f_op->write))
> return warn_unsupported...
>
> - It exits with an error if both read_iter and read or write_iter and
> write are present.
>
> I found out that on NVFS, reading a file with the read method has 10%
> better performance than the read_iter method. The benchmark just reads the
> same 4k page over and over again - and the cost of creating and parsing
> the kiocb and iov_iter structures is just that high.

Which part of it is so expensive? Is it worth, eg adding an iov_iter
type that points to a single buffer instead of a single-member iov?

+++ b/include/linux/uio.h
@@ -19,6 +19,7 @@ struct kvec {

enum iter_type {
/* iter types */
+ ITER_UBUF = 2,
ITER_IOVEC = 4,
ITER_KVEC = 8,
ITER_BVEC = 16,
@@ -36,6 +36,7 @@ struct iov_iter {
size_t iov_offset;
size_t count;
union {
+ void __user *buf;
const struct iovec *iov;
const struct kvec *kvec;
const struct bio_vec *bvec;

and then doing all the appropriate changes to make that work.

2021-01-10 16:23:12

by Al Viro

[permalink] [raw]
Subject: Re: [RFC v2] nvfs: a filesystem for persistent memory

On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> Hi
>
> I announce a new version of NVFS - a filesystem for persistent memory.
> http://people.redhat.com/~mpatocka/nvfs/
Utilities, AFAICS

> git://leontynka.twibright.com/nvfs.git
Seems to hang on git pull at the moment... Do you have it anywhere else?

> I found out that on NVFS, reading a file with the read method has 10%
> better performance than the read_iter method. The benchmark just reads the
> same 4k page over and over again - and the cost of creating and parsing
> the kiocb and iov_iter structures is just that high.

Apples and oranges... What happens if you take

ssize_t read_iter_locked(struct file *file, struct iov_iter *to, loff_t *ppos)
{
struct inode *inode = file_inode(file);
struct nvfs_memory_inode *nmi = i_to_nmi(inode);
struct nvfs_superblock *nvs = inode->i_sb->s_fs_info;
ssize_t total = 0;
loff_t pos = *ppos;
int r;
int shift = nvs->log2_page_size;
size_t i_size;

i_size = inode->i_size;
if (pos >= i_size)
return 0;
iov_iter_truncate(to, i_size - pos);

while (iov_iter_count(to)) {
void *blk, *ptr;
size_t page_mask = (1UL << shift) - 1;
unsigned page_offset = pos & page_mask;
unsigned prealloc = (iov_iter_count(to) + page_mask) >> shift;
unsigned size;

blk = nvfs_bmap(nmi, pos >> shift, &prealloc, NULL, NULL, NULL);
if (unlikely(IS_ERR(blk))) {
r = PTR_ERR(blk);
goto ret_r;
}
size = ((size_t)prealloc << shift) - page_offset;
ptr = blk + page_offset;
if (unlikely(!blk)) {
size = min(size, (unsigned)PAGE_SIZE);
ptr = empty_zero_page;
}
size = copy_to_iter(to, ptr, size);
if (unlikely(!size)) {
r = -EFAULT;
goto ret_r;
}

pos += size;
total += size;
} while (iov_iter_count(to));

r = 0;

ret_r:
*ppos = pos;

if (file)
file_accessed(file);

return total ? total : r;
}

and use that instead of your nvfs_rw_iter_locked() in your
->read_iter() for DAX read case? Then the same with
s/copy_to_iter/_copy_to_iter/, to see how much of that is
"hardening" overhead.

Incidentally, what's the point of sharing nvfs_rw_iter() for
read and write cases? They have practically no overlap -
count the lines common for wr and !wr cases. And if you
do the same in nvfs_rw_iter_locked(), you'll see that the
shared parts _there_ are bloody pointless on the read side.
Not that it had been more useful on the write side, really,
but that's another story (nvfs_write_pages() handling of
copyin is... interesting). Let's figure out what's going
on with the read overhead first...

lib/iov_iter.c primitives certainly could use massage for
better code generation, but let's find out how much of the
PITA is due to those and how much comes from you fighing
the damn thing instead of using it sanely...

2021-01-10 16:54:45

by Al Viro

[permalink] [raw]
Subject: Re: [RFC v2] nvfs: a filesystem for persistent memory

On Sun, Jan 10, 2021 at 04:20:08PM +0000, Al Viro wrote:
> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > Hi
> >
> > I announce a new version of NVFS - a filesystem for persistent memory.
> > http://people.redhat.com/~mpatocka/nvfs/
> Utilities, AFAICS
>
> > git://leontynka.twibright.com/nvfs.git
> Seems to hang on git pull at the moment... Do you have it anywhere else?

D'oh... In case it's not obvious from the rest of reply, I have managed to
grab it - and forgot to remove the question before sending the comments.
My apologies for the confusion; I plead the lack of coffee...

2021-01-10 21:19:45

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [RFC v2] nvfs: a filesystem for persistent memory



On Sun, 10 Jan 2021, Al Viro wrote:

> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > Hi
> >
> > I announce a new version of NVFS - a filesystem for persistent memory.
> > http://people.redhat.com/~mpatocka/nvfs/
> Utilities, AFAICS
>
> > git://leontynka.twibright.com/nvfs.git
> Seems to hang on git pull at the moment... Do you have it anywhere else?

I saw some errors 'git-daemon: fatal: the remote end hung up unexpectedly'
in syslog. I don't know what's causing them.

> > I found out that on NVFS, reading a file with the read method has 10%
> > better performance than the read_iter method. The benchmark just reads the
> > same 4k page over and over again - and the cost of creating and parsing
> > the kiocb and iov_iter structures is just that high.
>
> Apples and oranges... What happens if you take
>
> ssize_t read_iter_locked(struct file *file, struct iov_iter *to, loff_t *ppos)
> {
> struct inode *inode = file_inode(file);
> struct nvfs_memory_inode *nmi = i_to_nmi(inode);
> struct nvfs_superblock *nvs = inode->i_sb->s_fs_info;
> ssize_t total = 0;
> loff_t pos = *ppos;
> int r;
> int shift = nvs->log2_page_size;
> size_t i_size;
>
> i_size = inode->i_size;
> if (pos >= i_size)
> return 0;
> iov_iter_truncate(to, i_size - pos);
>
> while (iov_iter_count(to)) {
> void *blk, *ptr;
> size_t page_mask = (1UL << shift) - 1;
> unsigned page_offset = pos & page_mask;
> unsigned prealloc = (iov_iter_count(to) + page_mask) >> shift;
> unsigned size;
>
> blk = nvfs_bmap(nmi, pos >> shift, &prealloc, NULL, NULL, NULL);
> if (unlikely(IS_ERR(blk))) {
> r = PTR_ERR(blk);
> goto ret_r;
> }
> size = ((size_t)prealloc << shift) - page_offset;
> ptr = blk + page_offset;
> if (unlikely(!blk)) {
> size = min(size, (unsigned)PAGE_SIZE);
> ptr = empty_zero_page;
> }
> size = copy_to_iter(to, ptr, size);
> if (unlikely(!size)) {
> r = -EFAULT;
> goto ret_r;
> }
>
> pos += size;
> total += size;
> } while (iov_iter_count(to));
>
> r = 0;
>
> ret_r:
> *ppos = pos;
>
> if (file)
> file_accessed(file);
>
> return total ? total : r;
> }
>
> and use that instead of your nvfs_rw_iter_locked() in your
> ->read_iter() for DAX read case? Then the same with
> s/copy_to_iter/_copy_to_iter/, to see how much of that is
> "hardening" overhead.
>
> Incidentally, what's the point of sharing nvfs_rw_iter() for
> read and write cases? They have practically no overlap -
> count the lines common for wr and !wr cases. And if you
> do the same in nvfs_rw_iter_locked(), you'll see that the
> shared parts _there_ are bloody pointless on the read side.

That's a good point. I split nvfs_rw_iter to separate functions
nvfs_read_iter and nvfs_write_iter - and inlined nvfs_rw_iter_locked into
both of them. It improved performance by 1.3%.

> Not that it had been more useful on the write side, really,
> but that's another story (nvfs_write_pages() handling of
> copyin is... interesting). Let's figure out what's going
> on with the read overhead first...
>
> lib/iov_iter.c primitives certainly could use massage for
> better code generation, but let's find out how much of the
> PITA is due to those and how much comes from you fighing
> the damn thing instead of using it sanely...

The results are:

read: 6.744s
read_iter: 7.417s
read_iter - separate read and write path: 7.321s
Al's read_iter: 7.182s
Al's read_iter with _copy_to_iter: 7.181s

Mikulas

2021-01-10 23:43:15

by Al Viro

[permalink] [raw]
Subject: Re: [RFC v2] nvfs: a filesystem for persistent memory

On Sun, Jan 10, 2021 at 04:14:55PM -0500, Mikulas Patocka wrote:

> That's a good point. I split nvfs_rw_iter to separate functions
> nvfs_read_iter and nvfs_write_iter - and inlined nvfs_rw_iter_locked into
> both of them. It improved performance by 1.3%.
>
> > Not that it had been more useful on the write side, really,
> > but that's another story (nvfs_write_pages() handling of
> > copyin is... interesting). Let's figure out what's going
> > on with the read overhead first...
> >
> > lib/iov_iter.c primitives certainly could use massage for
> > better code generation, but let's find out how much of the
> > PITA is due to those and how much comes from you fighing
> > the damn thing instead of using it sanely...
>
> The results are:
>
> read: 6.744s
> read_iter: 7.417s
> read_iter - separate read and write path: 7.321s
> Al's read_iter: 7.182s
> Al's read_iter with _copy_to_iter: 7.181s

So
* overhead of hardening stuff is noise here
* switching to more straightforward ->read_iter() cuts
the overhead by about 1/3.

Interesting... I wonder how much of that is spent in
iterate_and_advance() glue inside copy_to_iter() here. There's
certainly quite a bit of optimizations possible in those
primitives and your usecase makes a decent test for that...

Could you profile that and see where is it spending
the time, on instruction level?

2021-01-11 11:46:17

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [RFC v2] nvfs: a filesystem for persistent memory



On Sun, 10 Jan 2021, Al Viro wrote:

> On Sun, Jan 10, 2021 at 04:14:55PM -0500, Mikulas Patocka wrote:
>
> > That's a good point. I split nvfs_rw_iter to separate functions
> > nvfs_read_iter and nvfs_write_iter - and inlined nvfs_rw_iter_locked into
> > both of them. It improved performance by 1.3%.
> >
> > > Not that it had been more useful on the write side, really,
> > > but that's another story (nvfs_write_pages() handling of
> > > copyin is... interesting). Let's figure out what's going
> > > on with the read overhead first...
> > >
> > > lib/iov_iter.c primitives certainly could use massage for
> > > better code generation, but let's find out how much of the
> > > PITA is due to those and how much comes from you fighing
> > > the damn thing instead of using it sanely...
> >
> > The results are:
> >
> > read: 6.744s
> > read_iter: 7.417s
> > read_iter - separate read and write path: 7.321s
> > Al's read_iter: 7.182s
> > Al's read_iter with _copy_to_iter: 7.181s
>
> So
> * overhead of hardening stuff is noise here
> * switching to more straightforward ->read_iter() cuts
> the overhead by about 1/3.
>
> Interesting... I wonder how much of that is spent in
> iterate_and_advance() glue inside copy_to_iter() here. There's
> certainly quite a bit of optimizations possible in those
> primitives and your usecase makes a decent test for that...
>
> Could you profile that and see where is it spending
> the time, on instruction level?

This is the read method profile:

time 9.056s
52.69% pread [kernel.vmlinux] [k] copy_user_generic_string
6.24% pread [kernel.vmlinux] [k] current_time
6.22% pread [kernel.vmlinux] [k] entry_SYSCALL_64
4.88% pread libc-2.31.so [.] __libc_pread
3.75% pread [kernel.vmlinux] [k] syscall_return_via_sysret
3.63% pread [nvfs] [k] nvfs_read
2.83% pread [nvfs] [k] nvfs_bmap
2.81% pread [kernel.vmlinux] [k] vfs_read
2.63% pread [kernel.vmlinux] [k] __x64_sys_pread64
2.27% pread [kernel.vmlinux] [k] __fsnotify_parent
2.19% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
1.55% pread [kernel.vmlinux] [k] atime_needs_update
1.17% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode
1.15% pread [kernel.vmlinux] [k] touch_atime
0.84% pread [kernel.vmlinux] [k] down_read
0.82% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode
0.71% pread [kernel.vmlinux] [k] do_syscall_64
0.68% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
0.66% pread [kernel.vmlinux] [k] __fget_light
0.53% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare
0.45% pread [kernel.vmlinux] [k] up_read
0.44% pread pread [.] main
0.44% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare
0.26% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack
0.12% pread pread [.] pread@plt
0.07% pread [kernel.vmlinux] [k] __fdget
0.00% perf [kernel.vmlinux] [k] x86_pmu_enable_all


This is profile of "read_iter - separate read and write path":

time 10.058s
53.05% pread [kernel.vmlinux] [k] copy_user_generic_string
6.82% pread [kernel.vmlinux] [k] current_time
6.27% pread [nvfs] [k] nvfs_read_iter
4.70% pread [kernel.vmlinux] [k] entry_SYSCALL_64
3.20% pread libc-2.31.so [.] __libc_pread
2.77% pread [kernel.vmlinux] [k] syscall_return_via_sysret
2.31% pread [kernel.vmlinux] [k] vfs_read
2.15% pread [kernel.vmlinux] [k] new_sync_read
2.06% pread [kernel.vmlinux] [k] __fsnotify_parent
2.02% pread [nvfs] [k] nvfs_bmap
1.87% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
1.86% pread [kernel.vmlinux] [k] iov_iter_advance
1.62% pread [kernel.vmlinux] [k] __x64_sys_pread64
1.40% pread [kernel.vmlinux] [k] atime_needs_update
0.99% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode
0.85% pread [kernel.vmlinux] [k] touch_atime
0.85% pread [kernel.vmlinux] [k] down_read
0.84% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode
0.78% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
0.65% pread [kernel.vmlinux] [k] __fget_light
0.57% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare
0.53% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare
0.45% pread pread [.] main
0.43% pread [kernel.vmlinux] [k] up_read
0.43% pread [kernel.vmlinux] [k] do_syscall_64
0.28% pread [kernel.vmlinux] [k] iov_iter_init
0.16% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack
0.09% pread pread [.] pread@plt
0.03% pread [kernel.vmlinux] [k] __fdget
0.00% pread [kernel.vmlinux] [k] update_rt_rq_load_avg
0.00% perf [kernel.vmlinux] [k] x86_pmu_enable_all


This is your read_iter_locked profile (read_iter_locked is inlined to
nvfs_read_iter):

time 10.056s
50.71% pread [kernel.vmlinux] [k] copy_user_generic_string
6.95% pread [kernel.vmlinux] [k] current_time
5.22% pread [kernel.vmlinux] [k] entry_SYSCALL_64
4.29% pread libc-2.31.so [.] __libc_pread
4.17% pread [nvfs] [k] nvfs_read_iter
3.20% pread [kernel.vmlinux] [k] syscall_return_via_sysret
2.66% pread [kernel.vmlinux] [k] _copy_to_iter
2.44% pread [kernel.vmlinux] [k] __x64_sys_pread64
2.38% pread [kernel.vmlinux] [k] new_sync_read
2.37% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
2.26% pread [kernel.vmlinux] [k] vfs_read
2.02% pread [nvfs] [k] nvfs_bmap
1.88% pread [kernel.vmlinux] [k] __fsnotify_parent
1.46% pread [kernel.vmlinux] [k] atime_needs_update
1.08% pread [kernel.vmlinux] [k] touch_atime
0.83% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode
0.82% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode
0.75% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare
0.73% pread [kernel.vmlinux] [k] __fget_light
0.65% pread [kernel.vmlinux] [k] down_read
0.58% pread pread [.] main
0.58% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare
0.52% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
0.48% pread [kernel.vmlinux] [k] up_read
0.42% pread [kernel.vmlinux] [k] do_syscall_64
0.28% pread [kernel.vmlinux] [k] iov_iter_init
0.13% pread [kernel.vmlinux] [k] __fdget
0.12% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack
0.03% pread pread [.] pread@plt
0.00% perf [kernel.vmlinux] [k] x86_pmu_enable_all

Mikulas

2021-01-11 11:48:55

by Mikulas Patocka

[permalink] [raw]
Subject: RE: [RFC v2] nvfs: a filesystem for persistent memory



On Mon, 11 Jan 2021, David Laight wrote:

> From: Al Viro <[email protected]> On Behalf Of Al Viro
> > Sent: 10 January 2021 16:20
> >
> > On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > > Hi
> > >
> > > I announce a new version of NVFS - a filesystem for persistent memory.
> > > http://people.redhat.com/~mpatocka/nvfs/
> > Utilities, AFAICS
> >
> > > git://leontynka.twibright.com/nvfs.git
> > Seems to hang on git pull at the moment... Do you have it anywhere else?
> >
> > > I found out that on NVFS, reading a file with the read method has 10%
> > > better performance than the read_iter method. The benchmark just reads the
> > > same 4k page over and over again - and the cost of creating and parsing
> > > the kiocb and iov_iter structures is just that high.
> >
> > Apples and oranges... What happens if you take
> >
> > ssize_t read_iter_locked(struct file *file, struct iov_iter *to, loff_t *ppos)
> > {
> > struct inode *inode = file_inode(file);
> > struct nvfs_memory_inode *nmi = i_to_nmi(inode);
> > struct nvfs_superblock *nvs = inode->i_sb->s_fs_info;
> > ssize_t total = 0;
> > loff_t pos = *ppos;
> > int r;
> > int shift = nvs->log2_page_size;
> > size_t i_size;
> >
> > i_size = inode->i_size;
> > if (pos >= i_size)
> > return 0;
> > iov_iter_truncate(to, i_size - pos);
> >
> > while (iov_iter_count(to)) {
> > void *blk, *ptr;
> > size_t page_mask = (1UL << shift) - 1;
> > unsigned page_offset = pos & page_mask;
> > unsigned prealloc = (iov_iter_count(to) + page_mask) >> shift;
> > unsigned size;
> >
> > blk = nvfs_bmap(nmi, pos >> shift, &prealloc, NULL, NULL, NULL);
> > if (unlikely(IS_ERR(blk))) {
> > r = PTR_ERR(blk);
> > goto ret_r;
> > }
> > size = ((size_t)prealloc << shift) - page_offset;
> > ptr = blk + page_offset;
> > if (unlikely(!blk)) {
> > size = min(size, (unsigned)PAGE_SIZE);
> > ptr = empty_zero_page;
> > }
> > size = copy_to_iter(to, ptr, size);
> > if (unlikely(!size)) {
> > r = -EFAULT;
> > goto ret_r;
> > }
> >
> > pos += size;
> > total += size;
> > } while (iov_iter_count(to));
>
> That isn't the best formed loop!
>
> David

I removed the second "while" statement and fixed the arguments to
copy_to_iter - other than that, Al's function works.

Mikuklas

2021-01-11 12:01:56

by David Laight

[permalink] [raw]
Subject: RE: [RFC v2] nvfs: a filesystem for persistent memory

From: Mikulas Patocka
> Sent: 11 January 2021 11:44
> On Mon, 11 Jan 2021, David Laight wrote:
>
> > From: Al Viro <[email protected]> On Behalf Of Al Viro
> > > Sent: 10 January 2021 16:20
...
...
> > > while (iov_iter_count(to)) {
...
> > > size = copy_to_iter(to, ptr, size);
> > > if (unlikely(!size)) {
> > > r = -EFAULT;
> > > goto ret_r;
> > > }
> > >
> > > pos += size;
> > > total += size;
> > > } while (iov_iter_count(to));
> >
> > That isn't the best formed loop!
> >
> > David
>
> I removed the second "while" statement and fixed the arguments to
> copy_to_iter - other than that, Al's function works.

The extra while is easy to write and can be difficult to spot.
I've found them looking as the object code before now!

Oh - the error return for copy_to_iter() is wrong.
It should (probably) return 'total' if it is nonzero.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-01-11 13:06:40

by David Laight

[permalink] [raw]
Subject: RE: [RFC v2] nvfs: a filesystem for persistent memory

From: Al Viro <[email protected]> On Behalf Of Al Viro
> Sent: 10 January 2021 16:20
>
> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > Hi
> >
> > I announce a new version of NVFS - a filesystem for persistent memory.
> > http://people.redhat.com/~mpatocka/nvfs/
> Utilities, AFAICS
>
> > git://leontynka.twibright.com/nvfs.git
> Seems to hang on git pull at the moment... Do you have it anywhere else?
>
> > I found out that on NVFS, reading a file with the read method has 10%
> > better performance than the read_iter method. The benchmark just reads the
> > same 4k page over and over again - and the cost of creating and parsing
> > the kiocb and iov_iter structures is just that high.
>
> Apples and oranges... What happens if you take
>
> ssize_t read_iter_locked(struct file *file, struct iov_iter *to, loff_t *ppos)
> {
> struct inode *inode = file_inode(file);
> struct nvfs_memory_inode *nmi = i_to_nmi(inode);
> struct nvfs_superblock *nvs = inode->i_sb->s_fs_info;
> ssize_t total = 0;
> loff_t pos = *ppos;
> int r;
> int shift = nvs->log2_page_size;
> size_t i_size;
>
> i_size = inode->i_size;
> if (pos >= i_size)
> return 0;
> iov_iter_truncate(to, i_size - pos);
>
> while (iov_iter_count(to)) {
> void *blk, *ptr;
> size_t page_mask = (1UL << shift) - 1;
> unsigned page_offset = pos & page_mask;
> unsigned prealloc = (iov_iter_count(to) + page_mask) >> shift;
> unsigned size;
>
> blk = nvfs_bmap(nmi, pos >> shift, &prealloc, NULL, NULL, NULL);
> if (unlikely(IS_ERR(blk))) {
> r = PTR_ERR(blk);
> goto ret_r;
> }
> size = ((size_t)prealloc << shift) - page_offset;
> ptr = blk + page_offset;
> if (unlikely(!blk)) {
> size = min(size, (unsigned)PAGE_SIZE);
> ptr = empty_zero_page;
> }
> size = copy_to_iter(to, ptr, size);
> if (unlikely(!size)) {
> r = -EFAULT;
> goto ret_r;
> }
>
> pos += size;
> total += size;
> } while (iov_iter_count(to));

That isn't the best formed loop!

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-01-11 14:48:37

by Al Viro

[permalink] [raw]
Subject: Re: [RFC v2] nvfs: a filesystem for persistent memory

On Mon, Jan 11, 2021 at 11:57:08AM +0000, David Laight wrote:
> > > > size = copy_to_iter(to, ptr, size);
> > > > if (unlikely(!size)) {
> > > > r = -EFAULT;
> > > > goto ret_r;
> > > > }
> > > >
> > > > pos += size;
> > > > total += size;
> > > > } while (iov_iter_count(to));
> > >
> > > That isn't the best formed loop!
> > >
> > > David
> >
> > I removed the second "while" statement and fixed the arguments to
> > copy_to_iter - other than that, Al's function works.
>
> The extra while is easy to write and can be difficult to spot.
> I've found them looking as the object code before now!

That extra while comes from editing cut'n'pasted piece of
source while writing a reply. Hint: original had been a do-while.

> Oh - the error return for copy_to_iter() is wrong.
> It should (probably) return 'total' if it is nonzero.

copy_to_iter() call there has an obvious problem
(arguments in the wrong order), but return value is handled
correctly. It does not do a blind return -EFAULT. RTFS...

2021-01-11 14:58:29

by David Laight

[permalink] [raw]
Subject: RE: [RFC v2] nvfs: a filesystem for persistent memory

From: Al Viro <[email protected]> On Behalf Of Al Viro
> Sent: 11 January 2021 14:44
> On Mon, Jan 11, 2021 at 11:57:08AM +0000, David Laight wrote:
> > > > > size = copy_to_iter(to, ptr, size);
> > > > > if (unlikely(!size)) {
> > > > > r = -EFAULT;
> > > > > goto ret_r;
> > > > > }
> > > > >
> > > > > pos += size;
> > > > > total += size;
> > > > > }
> > > >
> > > > David
> > >
> > > I fixed the arguments to
> > > copy_to_iter - other than that, Al's function works.
> >
>
> > Oh - the error return for copy_to_iter() is wrong.
> > It should (probably) return 'total' if it is nonzero.
>
> copy_to_iter() call there has an obvious problem
> (arguments in the wrong order), but return value is handled
> correctly. It does not do a blind return -EFAULT. RTFS...

Ah I was looking at the version I'd cut the tail off...

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-01-21 16:13:58

by Mikulas Patocka

[permalink] [raw]
Subject: Re: Expense of read_iter



On Thu, 21 Jan 2021, Matthew Wilcox wrote:

> On Wed, Jan 20, 2021 at 10:12:01AM -0500, Mikulas Patocka wrote:
> > Do you have some idea how to optimize the generic code that calls
> > ->read_iter?
>
> Yes.
>
> > It might be better to maintain an f_iocb_flags in the
> > struct file and just copy that unconditionally. We'd need to remember
> > to update it in fcntl(F_SETFL), but I think that's the only place.
>
> Want to give that a try?

Yes - send me the patch and I'll benchmark it.

Mikulas