2015-04-01 07:26:16

by Christoph Hellwig

[permalink] [raw]
Subject: Re: another pmem variant V2

On Tue, Mar 31, 2015 at 10:11:29PM +0000, Elliott, Robert (Server Storage) wrote:
> I used fio to test 4 KiB random read and write IOPS
> on a 2-socket x86 DDR4 system. With various cache attributes:
>
> attr read write notes
> ---- ---- ----- -----
> UC 37 K 21 K ioremap_nocache
> WB 3.6 M 2.5 M ioremap
> WC 764 K 3.7 M ioremap_wc
> WT <not tested yet> ioremap_wt
>
> So, although UC and WT are the only modes certain to be safe,
> the V1 default of UC provides abysmal performance - worse than
> a consumer-class SATA SSD.

It doesn't look quite as bad on my setup, but performance is fairly
bad here as well.

> A solution for x86 is to use the MOVNTI instruction in WB
> mode. This non-temporal hint uses a buffer like the write
> combining buffer, not filling the cache and not stopping
> everything in the CPU. The kernel function __copy_from_user()
> uses that instruction (with SFENCE at the end) - see
> arch/x86/lib/copy_user_nocache_64.S.
>
> If I made the change from memcpy() to __copy_from_user()
> correctly, that results in:
>
> attr read write notes
> ---- ---- ----- -----
> WB w/NTI 2.4 M 2.6 M __copy_from_user()
> WC w/NTI 3.2 M 2.1 M __copy_from_user()

That looks a lot better. It doesn't help us with a pmem device
mapped directly into userspace using mmap with the DAX infrastructure,
though.

Note when we want to move to non-temporal copies we'll need to add
a new prototype, as __copy_from_user isn't guaranteed to use these,
and it is defined to only work on user addresses. That doesn't matter
on x86 but would blow up on say sparc or s390.


Subject: RE: another pmem variant V2



> -----Original Message-----
> From: Christoph Hellwig [mailto:[email protected]]
> Sent: Wednesday, April 1, 2015 2:26 AM
> To: Elliott, Robert (Server Storage)
> Cc: Christoph Hellwig; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; Kani,
> Toshimitsu
> Subject: Re: another pmem variant V2
>
> On Tue, Mar 31, 2015 at 10:11:29PM +0000, Elliott, Robert (Server Storage)
> wrote:
> > I used fio to test 4 KiB random read and write IOPS
> > on a 2-socket x86 DDR4 system. With various cache attributes:
> >
> > attr read write notes
> > ---- ---- ----- -----
> > UC 37 K 21 K ioremap_nocache
> > WB 3.6 M 2.5 M ioremap
> > WC 764 K 3.7 M ioremap_wc
> > WT <not tested yet> ioremap_wt
> >
> > So, although UC and WT are the only modes certain to be safe,
> > the V1 default of UC provides abysmal performance - worse than
> > a consumer-class SATA SSD.
>
> It doesn't look quite as bad on my setup, but performance is fairly
> bad here as well.
>
> > A solution for x86 is to use the MOVNTI instruction in WB
> > mode. This non-temporal hint uses a buffer like the write
> > combining buffer, not filling the cache and not stopping
> > everything in the CPU. The kernel function __copy_from_user()
> > uses that instruction (with SFENCE at the end) - see
> > arch/x86/lib/copy_user_nocache_64.S.
> >
> > If I made the change from memcpy() to __copy_from_user()
> > correctly, that results in:
> >
> > attr read write notes
> > ---- ---- ----- -----
> > WB w/NTI 2.4 M 2.6 M __copy_from_user()
> > WC w/NTI 3.2 M 2.1 M __copy_from_user()
>
> That looks a lot better. It doesn't help us with a pmem device
> mapped directly into userspace using mmap with the DAX infrastructure,
> though.
>
> Note when we want to move to non-temporal copies we'll need to add
> a new prototype, as __copy_from_user isn't guaranteed to use these,
> and it is defined to only work on user addresses. That doesn't matter
> on x86 but would blow up on say sparc or s390.

Here are some updated numbers including:
* WT (writethrough) cache attribute
* memcpy that uses non-temporal stores (MOVNTDQ) to the
persistent memory for block writes (rather than MOVNTI)
* memcpy that uses non-temporal loads (MOVNTDQA) from the
persistent memory for block reads

Attr Copy Read IOPS Write IOPS
==== ==== ========= ==========
UC memcpy 36 K 22 K
UC NT rd,wr 513 K 326 K

WB memcpy 3.4 M 2.5 M
WB NT rd,wr 3.3 M 3.5 M

WC memcpy 776 K 3.5 M
WC NT rd,wr 3.0 M 3.9 M

WT memcpy 2.1 M 22 K
WT NT rd,wr 3.3 M 2.1 M

a few other variations yielded the peak numbers:
WC NT rd only 3.2 M 4.1 M
WC NT wr only 712 K 4.6 M
WT NT wr only 2.6 M 4.0 M

There are lots of tuning considerations for those memcpy
functions - how far to unroll the loop, whether to
include PRFETCHNTA instructions, etc.

2015-04-02 16:41:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: another pmem variant V2

On Thu, Apr 02, 2015 at 03:11:36PM +0000, Elliott, Robert (Server Storage) wrote:
> Attr Copy Read IOPS Write IOPS
> ==== ==== ========= ==========
> UC memcpy 36 K 22 K
> UC NT rd,wr 513 K 326 K
>
> WB memcpy 3.4 M 2.5 M
> WB NT rd,wr 3.3 M 3.5 M
>
> WC memcpy 776 K 3.5 M
> WC NT rd,wr 3.0 M 3.9 M
>
> WT memcpy 2.1 M 22 K
> WT NT rd,wr 3.3 M 2.1 M
>
> a few other variations yielded the peak numbers:
> WC NT rd only 3.2 M 4.1 M
> WC NT wr only 712 K 4.6 M
> WT NT wr only 2.6 M 4.0 M
>
> There are lots of tuning considerations for those memcpy
> functions - how far to unroll the loop, whether to
> include PRFETCHNTA instructions, etc.

Looks like we should aіm for WC + NT would be a good start.

Can you prepare a patch to add your NT memcpy variants and
a second one to use them in the pmem driver?

2015-04-02 18:03:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: another pmem variant V2


* Christoph Hellwig <[email protected]> wrote:

> On Thu, Apr 02, 2015 at 03:11:36PM +0000, Elliott, Robert (Server Storage) wrote:
> > Attr Copy Read IOPS Write IOPS
> > ==== ==== ========= ==========
> > UC memcpy 36 K 22 K
> > UC NT rd,wr 513 K 326 K
> >
> > WB memcpy 3.4 M 2.5 M
> > WB NT rd,wr 3.3 M 3.5 M
> >
> > WC memcpy 776 K 3.5 M
> > WC NT rd,wr 3.0 M 3.9 M
> >
> > WT memcpy 2.1 M 22 K
> > WT NT rd,wr 3.3 M 2.1 M
> >
> > a few other variations yielded the peak numbers:
> > WC NT rd only 3.2 M 4.1 M
> > WC NT wr only 712 K 4.6 M
> > WT NT wr only 2.6 M 4.0 M
> >
> > There are lots of tuning considerations for those memcpy
> > functions - how far to unroll the loop, whether to
> > include PRFETCHNTA instructions, etc.
>
> Looks like we should aіm for WC + NT would be a good start.
>
> Can you prepare a patch to add your NT memcpy variants and
> a second one to use them in the pmem driver?

So we already have NT memcpy variants, see the copy_*_user_*_nocache()
primitives in arch/x86/. They could be used almost straight away for
kernel memory as well, as kernel buffers will not fault.

Thanks,

Ingo