I'd like to propose a discussion of how to take advantage of
persistent memory in network-attached storage scenarios.
RDMA runs on high speed network fabrics and offloads data
transfer from host CPUs. Thus it is a good match to the
performance characteristics of persistent memory.
Today Linux supports iSER, SRP, and NFS/RDMA on RDMA
fabrics. What kind of changes are needed in the Linux I/O
stack (in particular, storage targets) and in these storage
protocols to get the most benefit from ultra-low latency
storage?
There have been recent proposals about how storage protocols
and implementations might need to change (eg. Tom Talpey's
SNIA proposals for changing to a push data transfer model,
Sagi's proposal to utilize DAX under the NFS/RDMA server,
and my proposal for a new pNFS layout to drive RDMA data
transfer directly).
The outcome of the discussion would be to understand what
people are working on now and what is the desired
architectural approach in order to determine where storage
developers should be focused.
This could be either a BoF or a session during the main
tracks. There is sure to be a narrow segment of each
track's attendees that would have interest in this topic.
--
Chuck Lever
Hello,
On Mon 25-01-16 16:19:24, Chuck Lever wrote:
> I'd like to propose a discussion of how to take advantage of
> persistent memory in network-attached storage scenarios.
>
> RDMA runs on high speed network fabrics and offloads data
> transfer from host CPUs. Thus it is a good match to the
> performance characteristics of persistent memory.
>
> Today Linux supports iSER, SRP, and NFS/RDMA on RDMA
> fabrics. What kind of changes are needed in the Linux I/O
> stack (in particular, storage targets) and in these storage
> protocols to get the most benefit from ultra-low latency
> storage?
>
> There have been recent proposals about how storage protocols
> and implementations might need to change (eg. Tom Talpey's
> SNIA proposals for changing to a push data transfer model,
> Sagi's proposal to utilize DAX under the NFS/RDMA server,
> and my proposal for a new pNFS layout to drive RDMA data
> transfer directly).
>
> The outcome of the discussion would be to understand what
> people are working on now and what is the desired
> architectural approach in order to determine where storage
> developers should be focused.
>
> This could be either a BoF or a session during the main
> tracks. There is sure to be a narrow segment of each
> track's attendees that would have interest in this topic.
So hashing out details of pNFS layout isn't interesting to many people.
But if you want a broader architectural discussion about how to overcome
issues (and what those issues actually are) with the use of persistent
memory for NAS, then that may be interesting. So what do you actually want?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
> On Jan 26, 2016, at 10:25 AM, Atchley, Scott <[email protected]> wrote:
>
>> On Jan 25, 2016, at 4:19 PM, Chuck Lever <[email protected]> wrote:
>>
>> I'd like to propose a discussion of how to take advantage of
>> persistent memory in network-attached storage scenarios.
>>
>> RDMA runs on high speed network fabrics and offloads data
>> transfer from host CPUs. Thus it is a good match to the
>> performance characteristics of persistent memory.
>>
>> Today Linux supports iSER, SRP, and NFS/RDMA on RDMA
>> fabrics. What kind of changes are needed in the Linux I/O
>> stack (in particular, storage targets) and in these storage
>> protocols to get the most benefit from ultra-low latency
>> storage?
>>
>> There have been recent proposals about how storage protocols
>> and implementations might need to change (eg. Tom Talpey's
>> SNIA proposals for changing to a push data transfer model,
>> Sagi's proposal to utilize DAX under the NFS/RDMA server,
>> and my proposal for a new pNFS layout to drive RDMA data
>> transfer directly).
>>
>> The outcome of the discussion would be to understand what
>> people are working on now and what is the desired
>> architectural approach in order to determine where storage
>> developers should be focused.
>>
>> This could be either a BoF or a session during the main
>> tracks. There is sure to be a narrow segment of each
>> track's attendees that would have interest in this topic.
>>
>> --
>> Chuck Lever
>
> Chuck,
>
> One difference on targets is that some NVM/persistent memory may be byte-addressable while other NVM is only block addressable.
>
> Another difference is that NVMe-over-Fabrics will allow remote access of the target’s NVMe devices using the NVMe API.
As I understand it, NVMf devices look like local devices.
NVMf devices need globally unique naming to enable safe use
with pNFS and other remote storage access protocols.
--
Chuck Lever
PiBPbiBKYW4gMjUsIDIwMTYsIGF0IDQ6MTkgUE0sIENodWNrIExldmVyIDxjaHVjay5sZXZlckBv
cmFjbGUuY29tPiB3cm90ZToNCj4gDQo+IEknZCBsaWtlIHRvIHByb3Bvc2UgYSBkaXNjdXNzaW9u
IG9mIGhvdyB0byB0YWtlIGFkdmFudGFnZSBvZg0KPiBwZXJzaXN0ZW50IG1lbW9yeSBpbiBuZXR3
b3JrLWF0dGFjaGVkIHN0b3JhZ2Ugc2NlbmFyaW9zLg0KPiANCj4gUkRNQSBydW5zIG9uIGhpZ2gg
c3BlZWQgbmV0d29yayBmYWJyaWNzIGFuZCBvZmZsb2FkcyBkYXRhDQo+IHRyYW5zZmVyIGZyb20g
aG9zdCBDUFVzLiBUaHVzIGl0IGlzIGEgZ29vZCBtYXRjaCB0byB0aGUNCj4gcGVyZm9ybWFuY2Ug
Y2hhcmFjdGVyaXN0aWNzIG9mIHBlcnNpc3RlbnQgbWVtb3J5Lg0KPiANCj4gVG9kYXkgTGludXgg
c3VwcG9ydHMgaVNFUiwgU1JQLCBhbmQgTkZTL1JETUEgb24gUkRNQQ0KPiBmYWJyaWNzLiBXaGF0
IGtpbmQgb2YgY2hhbmdlcyBhcmUgbmVlZGVkIGluIHRoZSBMaW51eCBJL08NCj4gc3RhY2sgKGlu
IHBhcnRpY3VsYXIsIHN0b3JhZ2UgdGFyZ2V0cykgYW5kIGluIHRoZXNlIHN0b3JhZ2UNCj4gcHJv
dG9jb2xzIHRvIGdldCB0aGUgbW9zdCBiZW5lZml0IGZyb20gdWx0cmEtbG93IGxhdGVuY3kNCj4g
c3RvcmFnZT8NCj4gDQo+IFRoZXJlIGhhdmUgYmVlbiByZWNlbnQgcHJvcG9zYWxzIGFib3V0IGhv
dyBzdG9yYWdlIHByb3RvY29scw0KPiBhbmQgaW1wbGVtZW50YXRpb25zIG1pZ2h0IG5lZWQgdG8g
Y2hhbmdlIChlZy4gVG9tIFRhbHBleSdzDQo+IFNOSUEgcHJvcG9zYWxzIGZvciBjaGFuZ2luZyB0
byBhIHB1c2ggZGF0YSB0cmFuc2ZlciBtb2RlbCwNCj4gU2FnaSdzIHByb3Bvc2FsIHRvIHV0aWxp
emUgREFYIHVuZGVyIHRoZSBORlMvUkRNQSBzZXJ2ZXIsDQo+IGFuZCBteSBwcm9wb3NhbCBmb3Ig
YSBuZXcgcE5GUyBsYXlvdXQgdG8gZHJpdmUgUkRNQSBkYXRhDQo+IHRyYW5zZmVyIGRpcmVjdGx5
KS4NCj4gDQo+IFRoZSBvdXRjb21lIG9mIHRoZSBkaXNjdXNzaW9uIHdvdWxkIGJlIHRvIHVuZGVy
c3RhbmQgd2hhdA0KPiBwZW9wbGUgYXJlIHdvcmtpbmcgb24gbm93IGFuZCB3aGF0IGlzIHRoZSBk
ZXNpcmVkDQo+IGFyY2hpdGVjdHVyYWwgYXBwcm9hY2ggaW4gb3JkZXIgdG8gZGV0ZXJtaW5lIHdo
ZXJlIHN0b3JhZ2UNCj4gZGV2ZWxvcGVycyBzaG91bGQgYmUgZm9jdXNlZC4NCj4gDQo+IFRoaXMg
Y291bGQgYmUgZWl0aGVyIGEgQm9GIG9yIGEgc2Vzc2lvbiBkdXJpbmcgdGhlIG1haW4NCj4gdHJh
Y2tzLiBUaGVyZSBpcyBzdXJlIHRvIGJlIGEgbmFycm93IHNlZ21lbnQgb2YgZWFjaA0KPiB0cmFj
aydzIGF0dGVuZGVlcyB0aGF0IHdvdWxkIGhhdmUgaW50ZXJlc3QgaW4gdGhpcyB0b3BpYy4NCj4g
DQo+IC0tDQo+IENodWNrIExldmVyDQoNCkNodWNrLA0KDQpPbmUgZGlmZmVyZW5jZSBvbiB0YXJn
ZXRzIGlzIHRoYXQgc29tZSBOVk0vcGVyc2lzdGVudCBtZW1vcnkgbWF5IGJlIGJ5dGUtYWRkcmVz
c2FibGUgd2hpbGUgb3RoZXIgTlZNIGlzIG9ubHkgYmxvY2sgYWRkcmVzc2FibGUuDQoNCkFub3Ro
ZXIgZGlmZmVyZW5jZSBpcyB0aGF0IE5WTWUtb3Zlci1GYWJyaWNzIHdpbGwgYWxsb3cgcmVtb3Rl
IGFjY2VzcyBvZiB0aGUgdGFyZ2V04oCZcyBOVk1lIGRldmljZXMgdXNpbmcgdGhlIE5WTWUgQVBJ
Lg0KDQpTY290dA==
> On Jan 26, 2016, at 3:25 AM, Jan Kara <[email protected]> wrote:
>
> Hello,
>
> On Mon 25-01-16 16:19:24, Chuck Lever wrote:
>> I'd like to propose a discussion of how to take advantage of
>> persistent memory in network-attached storage scenarios.
>>
>> RDMA runs on high speed network fabrics and offloads data
>> transfer from host CPUs. Thus it is a good match to the
>> performance characteristics of persistent memory.
>>
>> Today Linux supports iSER, SRP, and NFS/RDMA on RDMA
>> fabrics. What kind of changes are needed in the Linux I/O
>> stack (in particular, storage targets) and in these storage
>> protocols to get the most benefit from ultra-low latency
>> storage?
>>
>> There have been recent proposals about how storage protocols
>> and implementations might need to change (eg. Tom Talpey's
>> SNIA proposals for changing to a push data transfer model,
>> Sagi's proposal to utilize DAX under the NFS/RDMA server,
>> and my proposal for a new pNFS layout to drive RDMA data
>> transfer directly).
>>
>> The outcome of the discussion would be to understand what
>> people are working on now and what is the desired
>> architectural approach in order to determine where storage
>> developers should be focused.
>>
>> This could be either a BoF or a session during the main
>> tracks. There is sure to be a narrow segment of each
>> track's attendees that would have interest in this topic.
>
> So hashing out details of pNFS layout isn't interesting to many people.
> But if you want a broader architectural discussion about how to overcome
> issues (and what those issues actually are) with the use of persistent
> memory for NAS, then that may be interesting. So what do you actually want?
I mentioned pNFS briefly only as an example. There have
been a variety of proposals and approaches so far, and
it's time, I believe, to start focusing our efforts.
Thus I'm requesting a "broader architectural discussion
about how to overcome issues with the use of persistent
memory for NAS," in particular how we'd like to do this
with the Linux implementations of the iSER, SRP, and
NFS/RDMA protocols using DAX/pmem or NVM[ef].
It is not going to be like the well-worn paradigm that
involves a page cache on the storage target backed by
slow I/O operations. The protocol layers on storage
targets need a way to discover memory addresses of
persistent memory that will be used as source/sink
buffers for RDMA operations.
And making data durable after a write is going to need
some thought. So I believe some new plumbing will be
necessary.
I know this is not everyone's cup of tea. A BoF would
be fine, if the PC believes that is a better venue (and
I'm kind of leaning that way myself).
--
Chuck Lever
On Tue, Jan 26, 2016 at 10:29:35AM -0500, Chuck Lever wrote:
> As I understand it, NVMf devices look like local devices.
> NVMf devices need globally unique naming to enable safe use
> with pNFS and other remote storage access protocols.
NVMe provides global uniqueue identifiers similar to SCSI, and in fact
there is even a standardised mapping to SCSI. The current SCSI layout
draft will work fine with both multi ported PCIe NVMe devices as well
as future fabrics devices.
On Tue, Jan 26, 2016 at 10:58:44AM -0500, Chuck Lever wrote:
> It is not going to be like the well-worn paradigm that
> involves a page cache on the storage target backed by
> slow I/O operations. The protocol layers on storage
> targets need a way to discover memory addresses of
> persistent memory that will be used as source/sink
> buffers for RDMA operations.
>
> And making data durable after a write is going to need
> some thought. So I believe some new plumbing will be
> necessary.
Haven't we already solve this for the pNFS file driver that XFS
implements? i.e. these export operations:
int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
int (*map_blocks)(struct inode *inode, loff_t offset,
u64 len, struct iomap *iomap,
bool write, u32 *device_generation);
int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
int nr_iomaps, struct iattr *iattr);
so mapping/allocation of file offset to sector mappings, which can
then trivially be used to grab the memory address through the bdev
->direct_access method, yes?
Cheers,
Dave.
--
Dave Chinner
[email protected]
>> So hashing out details of pNFS layout isn't interesting to many people.
>> But if you want a broader architectural discussion about how to overcome
>> issues (and what those issues actually are) with the use of persistent
>> memory for NAS, then that may be interesting. So what do you actually want?
>
> I mentioned pNFS briefly only as an example. There have
> been a variety of proposals and approaches so far, and
> it's time, I believe, to start focusing our efforts.
>
> Thus I'm requesting a "broader architectural discussion
> about how to overcome issues with the use of persistent
> memory for NAS," in particular how we'd like to do this
> with the Linux implementations of the iSER, SRP, and
> NFS/RDMA protocols using DAX/pmem or NVM[ef].
I agree,
I anticipate that we'll gradually see more and more implementations
optimizing remote storage access in the presence of pmem devices (maybe
even not only RDMA?). The straight forward approach would be for each
implementation to have it's own logic for accessing remote pmem
devices but I think we have a chance to consolidate that in a single
API for everyone.
I think the most natural way to start is NFS/RDMA (SCSI would be a bit
more challenging...)
> It is not going to be like the well-worn paradigm that
> involves a page cache on the storage target backed by
> slow I/O operations. The protocol layers on storage
> targets need a way to discover memory addresses of
> persistent memory that will be used as source/sink
> buffers for RDMA operations.
> And making data durable after a write is going to need
> some thought. So I believe some new plumbing will be
> necessary.
The challenge here is persistence semantics that are missing in
today's HCAs, so I think we should aim to have a SW solution for
remote persistence semantics with sufficient hooks for possible
HW that might be able to have it in the future...
> I know this is not everyone's cup of tea. A BoF would
> be fine, if the PC believes that is a better venue (and
> I'm kind of leaning that way myself).
I'd be happy to join such a discussion.
> On Jan 26, 2016, at 7:04 PM, Dave Chinner <[email protected]> wrote:
>
> On Tue, Jan 26, 2016 at 10:58:44AM -0500, Chuck Lever wrote:
>> It is not going to be like the well-worn paradigm that
>> involves a page cache on the storage target backed by
>> slow I/O operations. The protocol layers on storage
>> targets need a way to discover memory addresses of
>> persistent memory that will be used as source/sink
>> buffers for RDMA operations.
>>
>> And making data durable after a write is going to need
>> some thought. So I believe some new plumbing will be
>> necessary.
>
> Haven't we already solve this for the pNFS file driver that XFS
> implements? i.e. these export operations:
>
> int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
> int (*map_blocks)(struct inode *inode, loff_t offset,
> u64 len, struct iomap *iomap,
> bool write, u32 *device_generation);
> int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> int nr_iomaps, struct iattr *iattr);
>
> so mapping/allocation of file offset to sector mappings, which can
> then trivially be used to grab the memory address through the bdev
> ->direct_access method, yes?
Thanks, that makes sense. How would such addresses be
utilized?
I'll speak about the NFS/RDMA server for this example, as
I am more familiar with that than with block targets. When
I say "NFS server" here I mean the software service on the
storage target that speaks the NFS protocol.
In today's RDMA-enabled storage protocols, an initiator
exposes its memory (in small segments) to storage targets,
sends a request, and the target's network transport performs
RDMA Read and Write operations to move the payload data in
that request.
Assuming the NFS server is somehow aware that what it is
getting from ->direct_access is a persistent memory address
and not an LBA, it would then have to pass it down to the
transport layer (svcrdma) so that the address can be used
as a source or sink buffer for RDMA operations.
For an NFS READ, this should be straightforward. An RPC
request comes in, the NFS server identifies the memory that
is to source the READ reply and passes the address of that
memory to the transport, which then pushes the data in
that memory via an RDMA Write to the client.
NFS WRITES are more difficult. An RPC request comes in,
and today the transport layer gathers incoming payload data
in anonymous pages before the NFS server even knows there
is an incoming RPC. We'd have to add some kind of hook to
enable the NFS server and the underlying filesystem to
provide appropriate sink buffers to the transport.
After the NFS WRITE request has been wholly received, the
NFS server today uses vfs_writev to put that data into the
target file. We'd probably want something more efficient
for pmem-backed filesystems. We want something more
efficient for traditional page cache-based filesystems
anyway.
Every NFS WRITE larger than a page would be essentially
CoW, since the filesystem would need to provide "anonymous"
blocks to sink incoming WRITE data and then transition
those blocks into the target file? Not sure how this works
for pNFS with block devices.
Finally a client needs to perform an NFS COMMIT to ensure
that the written data is at rest on durable storage. We
could insist that all NFS WRITE operations to pmem will
be DATA_SYNC or better (in other words, abandon UNSTABLE
mode). If not, then a separate NFS COMMIT/LAYOUTCOMMIT
is necessary to flush memory caches and ensure data
durability. An extra RPC round trip is likely not a good
idea when the cost structure of NFS WRITE is so much
different than it is for traditional block devices.
NFS WRITE is really the thing we want to go as fast as
possible, btw. NFS READ on RDMA is already faster, for
reasons I won't go into here. Aside from that, NFS READ
results are frequently cached on clients, and some of the
cost of NFS READ is already hidden by read-ahead. Because
read(2) is often satisfied from a local cache, application
progress is more frequently blocked by pending write(2)
calls than by reads.
A fully generic solution would have to provide NFS service
for transports that do not enable direct data placement
(eg TCP), and for filesystems that are legacy page
cache-based (anything residing on a traditional block
device).
I imagine that the issues are similar for block targets, if
they assume block devices are fronted by a memory cache.
--
Chuck Lever
On 01/25/2016 11:19 PM, Chuck Lever wrote:
> I'd like to propose a discussion of how to take advantage of
> persistent memory in network-attached storage scenarios.
>
> RDMA runs on high speed network fabrics and offloads data
> transfer from host CPUs. Thus it is a good match to the
> performance characteristics of persistent memory.
>
> Today Linux supports iSER, SRP, and NFS/RDMA on RDMA
> fabrics. What kind of changes are needed in the Linux I/O
> stack (in particular, storage targets) and in these storage
> protocols to get the most benefit from ultra-low latency
> storage?
>
> There have been recent proposals about how storage protocols
> and implementations might need to change (eg. Tom Talpey's
> SNIA proposals for changing to a push data transfer model,
> Sagi's proposal to utilize DAX under the NFS/RDMA server,
> and my proposal for a new pNFS layout to drive RDMA data
> transfer directly).
>
> The outcome of the discussion would be to understand what
> people are working on now and what is the desired
> architectural approach in order to determine where storage
> developers should be focused.
>
> This could be either a BoF or a session during the main
> tracks. There is sure to be a narrow segment of each
> track's attendees that would have interest in this topic.
>
I would like to attend this talk, and also talk about
a target we have been developing / utilizing that we would like
to propose as a Linux standard driver.
(It would be very important for me to also attend the other
pmem talks in LSF, as well as some of the MM and FS talks
proposed so far)
RDMA passive target
~~~~~~~~~~~~~~~~~~~
The idea is to have a storage brick that exports a very
low level pure RDMA API to access its memory based storage.
The brick might be battery backed volatile based memory, or
pmem based. In any case the brick might utilize a much higher
capacity then memory by utilizing a "tiering" to slower media,
which is enabled by the API.
The API is simple:
1. Alloc_2M_block_at_virtual_address (ADDR_64_BIT)
ADDR_64_BIT is any virtual address and defines the logical ID of the block.
If the ID is already allocated an error is returned.
If storage is exhausted return => ENOSPC
2. Free_2M_block_at_virtual_address (ADDR_64_BIT)
Space for logical ID is returned to free store and the ID becomes free for
a new allocation.
3. map_virtual_address(ADDR_64_BIT, flags) => RDMA handle
previously allocated virtual address is locked in memory and an RDMA handle
is returned.
Flags: read-only, read-write, shared and so on...
4. unmap__virtual_address(ADDR_64_BIT)
At this point the brick can write data to slower storage if memory space
is needed. The RDMA handle from [3] is revoked.
5. List_mapped_IDs
An extent based list of all allocated ranges. (This is usually used on
mount or after a crash)
The dumb brick is not the Network allocator / storage manager at all. and it
is not a smart target / server. like an iser-target or pnfs-DS. A SW defined
application can do that, on top of the Dumb-brick. The motivation is a low level
very low latency API+library, which can be built upon for higher protocols or
used directly for very low latency cluster.
It does however mange a virtual allocation map of logical to physical mapping
of the 2M blocks.
Currently both drivers initiator and target are in Kernel, but with
latest advancement by Dan Williams it can be implemented in user-mode as well,
Almost.
The almost is because:
1. If the target is over a /dev/pmemX then all is fine we have 2M contiguous
memory blocks.
2. If the target is over an FS, we have a proposal pending for an falloc_2M_flag
to ask the FS for a contiguous 2M allocations only. If any of the 2M allocations
fail then return ENOSPC from falloc. This way we guaranty that each 2M block can be
mapped by a single RDAM handle.
An FS for this purpose is nice for an over-allocated / dynamic space usage by
a target and other resources in the server.
RDMA Initiator
~~~~~~~~~~~~~~~~~~~
The initiator is just a simple library. Both usermode and Kernel side should
be available, for direct access to the RDMA-passive-brick.
Thanks.
Boaz
> --
> Chuck Lever
>
On Wed, 2016-01-27 at 18:54 +0200, Boaz Harrosh wrote:
> On 01/25/2016 11:19 PM, Chuck Lever wrote:
> > I'd like to propose a discussion of how to take advantage of
> > persistent memory in network-attached storage scenarios.
> >
> > RDMA runs on high speed network fabrics and offloads data
> > transfer from host CPUs. Thus it is a good match to the
> > performance characteristics of persistent memory.
> >
> > Today Linux supports iSER, SRP, and NFS/RDMA on RDMA
> > fabrics. What kind of changes are needed in the Linux I/O
> > stack (in particular, storage targets) and in these storage
> > protocols to get the most benefit from ultra-low latency
> > storage?
> >
> > There have been recent proposals about how storage protocols
> > and implementations might need to change (eg. Tom Talpey's
> > SNIA proposals for changing to a push data transfer model,
> > Sagi's proposal to utilize DAX under the NFS/RDMA server,
> > and my proposal for a new pNFS layout to drive RDMA data
> > transfer directly).
> >
> > The outcome of the discussion would be to understand what
> > people are working on now and what is the desired
> > architectural approach in order to determine where storage
> > developers should be focused.
> >
> > This could be either a BoF or a session during the main
> > tracks. There is sure to be a narrow segment of each
> > track's attendees that would have interest in this topic.
> >
>
> I would like to attend this talk, and also talk about
> a target we have been developing / utilizing that we would like
> to propose as a Linux standard driver.
For everyone who hasn't sent an attend request in, this is a good
example of how not to get an invitation. When collecting the requests
to attend, the admins tend to fold to the top of thread, so if you
send a request to attend as a reply to somebody else, it won't be seen
by that process.
You don't need to resend this one, I noticed it, but just in case next
time ...
James
Hey Boaz,
> RDMA passive target
> ~~~~~~~~~~~~~~~~~~~
>
> The idea is to have a storage brick that exports a very
> low level pure RDMA API to access its memory based storage.
> The brick might be battery backed volatile based memory, or
> pmem based. In any case the brick might utilize a much higher
> capacity then memory by utilizing a "tiering" to slower media,
> which is enabled by the API.
>
> The API is simple:
>
> 1. Alloc_2M_block_at_virtual_address (ADDR_64_BIT)
> ADDR_64_BIT is any virtual address and defines the logical ID of the block.
> If the ID is already allocated an error is returned.
> If storage is exhausted return => ENOSPC
> 2. Free_2M_block_at_virtual_address (ADDR_64_BIT)
> Space for logical ID is returned to free store and the ID becomes free for
> a new allocation.
> 3. map_virtual_address(ADDR_64_BIT, flags) => RDMA handle
> previously allocated virtual address is locked in memory and an RDMA handle
> is returned.
> Flags: read-only, read-write, shared and so on...
> 4. unmap__virtual_address(ADDR_64_BIT)
> At this point the brick can write data to slower storage if memory space
> is needed. The RDMA handle from [3] is revoked.
> 5. List_mapped_IDs
> An extent based list of all allocated ranges. (This is usually used on
> mount or after a crash)
My understanding is that you're describing a wire protocol correct?
> The dumb brick is not the Network allocator / storage manager at all. and it
> is not a smart target / server. like an iser-target or pnfs-DS. A SW defined
> application can do that, on top of the Dumb-brick. The motivation is a low level
> very low latency API+library, which can be built upon for higher protocols or
> used directly for very low latency cluster.
> It does however mange a virtual allocation map of logical to physical mapping
> of the 2M blocks.
The challenge in my mind would be to have persistence semantics in
place.
>
> Currently both drivers initiator and target are in Kernel, but with
> latest advancement by Dan Williams it can be implemented in user-mode as well,
> Almost.
>
> The almost is because:
> 1. If the target is over a /dev/pmemX then all is fine we have 2M contiguous
> memory blocks.
> 2. If the target is over an FS, we have a proposal pending for an falloc_2M_flag
> to ask the FS for a contiguous 2M allocations only. If any of the 2M allocations
> fail then return ENOSPC from falloc. This way we guaranty that each 2M block can be
> mapped by a single RDAM handle.
Umm, you don't need the 2M to be contiguous in order to represent them
as a single RDMA handle. If that was true iSER would have never worked.
Or I misunderstood what you meant...
On Wed, Jan 27, 2016 at 10:55:36AM -0500, Chuck Lever wrote:
>
> > On Jan 26, 2016, at 7:04 PM, Dave Chinner <[email protected]> wrote:
> >
> > On Tue, Jan 26, 2016 at 10:58:44AM -0500, Chuck Lever wrote:
> >> It is not going to be like the well-worn paradigm that
> >> involves a page cache on the storage target backed by
> >> slow I/O operations. The protocol layers on storage
> >> targets need a way to discover memory addresses of
> >> persistent memory that will be used as source/sink
> >> buffers for RDMA operations.
> >>
> >> And making data durable after a write is going to need
> >> some thought. So I believe some new plumbing will be
> >> necessary.
> >
> > Haven't we already solve this for the pNFS file driver that XFS
> > implements? i.e. these export operations:
> >
> > int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
> > int (*map_blocks)(struct inode *inode, loff_t offset,
> > u64 len, struct iomap *iomap,
> > bool write, u32 *device_generation);
> > int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> > int nr_iomaps, struct iattr *iattr);
> >
> > so mapping/allocation of file offset to sector mappings, which can
> > then trivially be used to grab the memory address through the bdev
> > ->direct_access method, yes?
>
> Thanks, that makes sense. How would such addresses be
> utilized?
That's a different problem, and you need to talk to the IO guys
about that.
> I'll speak about the NFS/RDMA server for this example, as
> I am more familiar with that than with block targets. When
> I say "NFS server" here I mean the software service on the
> storage target that speaks the NFS protocol.
>
> In today's RDMA-enabled storage protocols, an initiator
> exposes its memory (in small segments) to storage targets,
> sends a request, and the target's network transport performs
> RDMA Read and Write operations to move the payload data in
> that request.
>
> Assuming the NFS server is somehow aware that what it is
> getting from ->direct_access is a persistent memory address
> and not an LBA, it would then have to pass it down to the
> transport layer (svcrdma) so that the address can be used
> as a source or sink buffer for RDMA operations.
>
> For an NFS READ, this should be straightforward. An RPC
> request comes in, the NFS server identifies the memory that
> is to source the READ reply and passes the address of that
> memory to the transport, which then pushes the data in
> that memory via an RDMA Write to the client.
Right, it's no different from using the page cache, except for
however the memory adress is then mapped by the IO subsystem for the
DMA transfer...
> NFS WRITES are more difficult. An RPC request comes in,
> and today the transport layer gathers incoming payload data
> in anonymous pages before the NFS server even knows there
> is an incoming RPC. We'd have to add some kind of hook to
> enable the NFS server and the underlying filesystem to
> provide appropriate sink buffers to the transport.
->map_blocks needs to be called to allocate/map the file offset and
return a memory address before the data is sent from the client.
> After the NFS WRITE request has been wholly received, the
> NFS server today uses vfs_writev to put that data into the
> target file. We'd probably want something more efficient
> for pmem-backed filesystems. We want something more
> efficient for traditional page cache-based filesystems
> anyway.
Yup. see above.
> Every NFS WRITE larger than a page would be essentially
> CoW, since the filesystem would need to provide "anonymous"
> blocks to sink incoming WRITE data and then transition
> those blocks into the target file? Not sure how this works
> for pNFS with block devices.
No, ->map_blocks can return blocks that are already allocated to
the file at the given offset, hence overwrite in place works just
fine.
> Finally a client needs to perform an NFS COMMIT to ensure
> that the written data is at rest on durable storage. We
> could insist that all NFS WRITE operations to pmem will
> be DATA_SYNC or better (in other words, abandon UNSTABLE
> mode).
You could, but you'd still need the two map/commit calls into the
filesystem to get the memory and mark the write done...
> If not, then a separate NFS COMMIT/LAYOUTCOMMIT
> is necessary to flush memory caches and ensure data
> durability. An extra RPC round trip is likely not a good
> idea when the cost structure of NFS WRITE is so much
> different than it is for traditional block devices.
IIRC, ->commit_blocks is called from the LAYOUTCOMMIT operation.
You'll need to call this to pair the ->map_blocks call above that
provided the memory as the data sink for the write. This is because
->map_blocks allocates unwritten extents so that stale data will not
be exposed before the write is complete and ->commit_blocks is called
to remove the unwritten extent flag.
> I imagine that the issues are similar for block targets, if
> they assume block devices are fronted by a memory cache.
Yup, hence the "three phase" write operation - map blocks, write
data, commit blocks.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 01/27/2016 07:27 PM, Sagi Grimberg wrote:
> Hey Boaz,
>
>> RDMA passive target
>> ~~~~~~~~~~~~~~~~~~~
>>
>> The idea is to have a storage brick that exports a very
>> low level pure RDMA API to access its memory based storage.
>> The brick might be battery backed volatile based memory, or
>> pmem based. In any case the brick might utilize a much higher
>> capacity then memory by utilizing a "tiering" to slower media,
>> which is enabled by the API.
>>
>> The API is simple:
>>
>> 1. Alloc_2M_block_at_virtual_address (ADDR_64_BIT)
>> ADDR_64_BIT is any virtual address and defines the logical ID of the block.
>> If the ID is already allocated an error is returned.
>> If storage is exhausted return => ENOSPC
>> 2. Free_2M_block_at_virtual_address (ADDR_64_BIT)
>> Space for logical ID is returned to free store and the ID becomes free for
>> a new allocation.
>> 3. map_virtual_address(ADDR_64_BIT, flags) => RDMA handle
>> previously allocated virtual address is locked in memory and an RDMA handle
>> is returned.
>> Flags: read-only, read-write, shared and so on...
>> 4. unmap__virtual_address(ADDR_64_BIT)
>> At this point the brick can write data to slower storage if memory space
>> is needed. The RDMA handle from [3] is revoked.
>> 5. List_mapped_IDs
>> An extent based list of all allocated ranges. (This is usually used on
>> mount or after a crash)
>
> My understanding is that you're describing a wire protocol correct?
>
Almost. Not yet a wire protocol, Just an high level functionality description.
first. But yes a wire protocol in the sense that I want an open source library
that will be good for Kernel and Usermode. Any none Linux platform should be
able to port the code base and use it.
That said at some early point we should lock the wire protocol for inter version
compatibility or at least have a fixture negotiation when things get evolved.
>> The dumb brick is not the Network allocator / storage manager at all. and it
>> is not a smart target / server. like an iser-target or pnfs-DS. A SW defined
>> application can do that, on top of the Dumb-brick. The motivation is a low level
>> very low latency API+library, which can be built upon for higher protocols or
>> used directly for very low latency cluster.
>> It does however mange a virtual allocation map of logical to physical mapping
>> of the 2M blocks.
>
> The challenge in my mind would be to have persistence semantics in
> place.
>
Ok Thanks for bringing this up.
So there is two separate issues here. Which are actually not related to the
above API. It is more an Initiator issue. Since once the server did the above
map_virtual_address() and return a key to client machine it is out of the way.
On the initiator what we do is: All RDMA async sends. Once the user did an fsync
we do a sync-read(0, 1); so to guaranty both initiator and Server's nicks flush all
write buffers, to Server's PCIE controller.
But here lays the problem: In modern servers the PCIE/memory_controller chooses
to write fast PCIE data (or actually any PCI data) directly to L3 cache on the
principal that receiving application will access that memory very soon.
This is what is called DDIO.
Now here there is big uncertainty. and we are still investigating. The only working
ADR machine we have with an old NvDIMM-type-12 legacy BIOS. (All the newer type-6
none NFIT BIOS systems never worked and had various problems with persistence)
So that only working system, though advertised as DDIO machine does not exhibit
the above problem.
On a test of RDMA-SEND x X; RDMA-READ(0,1); POWER-OFF;
We always are fine and never get a compare error between the machines.
[I guess it depends on the specific system and the depth of the ADR flushing
on power-off, there are 15 milliseconds of power to work with]
But the Intel documentation says different. And it says that in a DDIO system
persistence is not Guaranteed.
There are two ways to solve this:
1. Put a remote procedure on the passive machine that will do a CLFLUSH of
all written regions. We hate that in our system and will not want to do
so, this is CPU intensive and will kill our latencies.
So NO!
2. Disable the DDIO for the NIC we use for storage.
[In our setup we can do this because there is a 10G management NIC for
regular trafic, and a 40/100G Melanox card dedicated to storage, so for
the storage NIC DDIO may be disabled. (Though again it makes not difference
for us because in our lab with or without it works the same)
]
3. There is a future option that we asked Intel to do, which we should talk about
here. Set a per packet HEADER flag which says DDIO-off/on, and a way for the
PCIE card to enforce it. Intel guys where positive for this initiative and said
They will support it in the next chipsets.
But I do not have any specifics on this option.
For us. Only option two is viable right now.
In any way to answer your question at the Initiator we assume that after a sync-read
of a single byte from an RDMA channel, all previous writes are persistent.
[With the DDIO flag set to off when 3. is available]
But this is only the very little Information I was able to gather and the
little experimentation we did here in the lab. A real working NvDIMM ADR
system is very scarce so far and all Vendors came out short for us with
real off-the-shelf systems.
I was hoping you might have more information for me.
>>
>> Currently both drivers initiator and target are in Kernel, but with
>> latest advancement by Dan Williams it can be implemented in user-mode as well,
>> Almost.
>>
>> The almost is because:
>> 1. If the target is over a /dev/pmemX then all is fine we have 2M contiguous
>> memory blocks.
>> 2. If the target is over an FS, we have a proposal pending for an falloc_2M_flag
>> to ask the FS for a contiguous 2M allocations only. If any of the 2M allocations
>> fail then return ENOSPC from falloc. This way we guaranty that each 2M block can be
>> mapped by a single RDAM handle.
>
> Umm, you don't need the 2M to be contiguous in order to represent them
> as a single RDMA handle. If that was true iSER would have never worked.
> Or I misunderstood what you meant...
>
OK I will let our RDMA guy Yigal Korman answer that, I guess you might be right.
But regardless of this little detail we would like to keep everything 2M. Yes
virtually on the wire protocol. But even on the Server Internal configuration
we would like to see a single TLB 2M mapping of all Target's pmem. Also on the
PCIE it is nice a scatter-list with 2M single entry, instead of the 4k.
And I think it is nice for DAX systems to fallocate and guaranty 2M contiguous
allocations of heavy accessed / mmap files.
Thank you for your interest.
Boaz
On Sun, Jan 31, 2016 at 4:20 PM, Boaz Harrosh <[email protected]> wrote:
>
> On 01/27/2016 07:27 PM, Sagi Grimberg wrote:
> > Hey Boaz,
> >
> >> RDMA passive target
> >> ~~~~~~~~~~~~~~~~~~~
> >>
> >> The idea is to have a storage brick that exports a very
> >> low level pure RDMA API to access its memory based storage.
> >> The brick might be battery backed volatile based memory, or
> >> pmem based. In any case the brick might utilize a much higher
> >> capacity then memory by utilizing a "tiering" to slower media,
> >> which is enabled by the API.
> >>
> >> The API is simple:
> >>
> >> 1. Alloc_2M_block_at_virtual_address (ADDR_64_BIT)
> >> ADDR_64_BIT is any virtual address and defines the logical ID of the block.
> >> If the ID is already allocated an error is returned.
> >> If storage is exhausted return => ENOSPC
> >> 2. Free_2M_block_at_virtual_address (ADDR_64_BIT)
> >> Space for logical ID is returned to free store and the ID becomes free for
> >> a new allocation.
> >> 3. map_virtual_address(ADDR_64_BIT, flags) => RDMA handle
> >> previously allocated virtual address is locked in memory and an RDMA handle
> >> is returned.
> >> Flags: read-only, read-write, shared and so on...
> >> 4. unmap__virtual_address(ADDR_64_BIT)
> >> At this point the brick can write data to slower storage if memory space
> >> is needed. The RDMA handle from [3] is revoked.
> >> 5. List_mapped_IDs
> >> An extent based list of all allocated ranges. (This is usually used on
> >> mount or after a crash)
> >
> > My understanding is that you're describing a wire protocol correct?
> >
>
> Almost. Not yet a wire protocol, Just an high level functionality description.
> first. But yes a wire protocol in the sense that I want an open source library
> that will be good for Kernel and Usermode. Any none Linux platform should be
> able to port the code base and use it.
> That said at some early point we should lock the wire protocol for inter version
> compatibility or at least have a fixture negotiation when things get evolved.
>
> >> The dumb brick is not the Network allocator / storage manager at all. and it
> >> is not a smart target / server. like an iser-target or pnfs-DS. A SW defined
> >> application can do that, on top of the Dumb-brick. The motivation is a low level
> >> very low latency API+library, which can be built upon for higher protocols or
> >> used directly for very low latency cluster.
> >> It does however mange a virtual allocation map of logical to physical mapping
> >> of the 2M blocks.
> >
> > The challenge in my mind would be to have persistence semantics in
> > place.
> >
>
> Ok Thanks for bringing this up.
>
> So there is two separate issues here. Which are actually not related to the
> above API. It is more an Initiator issue. Since once the server did the above
> map_virtual_address() and return a key to client machine it is out of the way.
>
> On the initiator what we do is: All RDMA async sends. Once the user did an fsync
> we do a sync-read(0, 1); so to guaranty both initiator and Server's nicks flush all
> write buffers, to Server's PCIE controller.
>
> But here lays the problem: In modern servers the PCIE/memory_controller chooses
> to write fast PCIE data (or actually any PCI data) directly to L3 cache on the
> principal that receiving application will access that memory very soon.
> This is what is called DDIO.
> Now here there is big uncertainty. and we are still investigating. The only working
> ADR machine we have with an old NvDIMM-type-12 legacy BIOS. (All the newer type-6
> none NFIT BIOS systems never worked and had various problems with persistence)
> So that only working system, though advertised as DDIO machine does not exhibit
> the above problem.
> On a test of RDMA-SEND x X; RDMA-READ(0,1); POWER-OFF;
> We always are fine and never get a compare error between the machines.
> [I guess it depends on the specific system and the depth of the ADR flushing
> on power-off, there are 15 milliseconds of power to work with]
>
> But the Intel documentation says different. And it says that in a DDIO system
> persistence is not Guaranteed.
>
> There are two ways to solve this:
> 1. Put a remote procedure on the passive machine that will do a CLFLUSH of
> all written regions. We hate that in our system and will not want to do
> so, this is CPU intensive and will kill our latencies.
> So NO!
> 2. Disable the DDIO for the NIC we use for storage.
> [In our setup we can do this because there is a 10G management NIC for
> regular trafic, and a 40/100G Melanox card dedicated to storage, so for
> the storage NIC DDIO may be disabled. (Though again it makes not difference
> for us because in our lab with or without it works the same)
> ]
> 3. There is a future option that we asked Intel to do, which we should talk about
> here. Set a per packet HEADER flag which says DDIO-off/on, and a way for the
> PCIE card to enforce it. Intel guys where positive for this initiative and said
> They will support it in the next chipsets.
> But I do not have any specifics on this option.
>
> For us. Only option two is viable right now.
>
> In any way to answer your question at the Initiator we assume that after a sync-read
> of a single byte from an RDMA channel, all previous writes are persistent.
> [With the DDIO flag set to off when 3. is available]
>
> But this is only the very little Information I was able to gather and the
> little experimentation we did here in the lab. A real working NvDIMM ADR
> system is very scarce so far and all Vendors came out short for us with
> real off-the-shelf systems.
> I was hoping you might have more information for me.
>
> >>
> >> Currently both drivers initiator and target are in Kernel, but with
> >> latest advancement by Dan Williams it can be implemented in user-mode as well,
> >> Almost.
> >>
> >> The almost is because:
> >> 1. If the target is over a /dev/pmemX then all is fine we have 2M contiguous
> >> memory blocks.
> >> 2. If the target is over an FS, we have a proposal pending for an falloc_2M_flag
> >> to ask the FS for a contiguous 2M allocations only. If any of the 2M allocations
> >> fail then return ENOSPC from falloc. This way we guaranty that each 2M block can be
> >> mapped by a single RDAM handle.
> >
> > Umm, you don't need the 2M to be contiguous in order to represent them
> > as a single RDMA handle. If that was true iSER would have never worked.
> > Or I misunderstood what you meant...
> >
>
> OK I will let our RDMA guy Yigal Korman answer that, I guess you might be right.
When Boaz says 'RDMA handle', he means the pair [rkey,remote_addr].
AFAIK the remote_addr describes a continuous memory space on the target.
So if you want to write to this 'handle' - it must be continuous.
Please correct me if I'm wrong.
>
> But regardless of this little detail we would like to keep everything 2M. Yes
> virtually on the wire protocol. But even on the Server Internal configuration
> we would like to see a single TLB 2M mapping of all Target's pmem. Also on the
> PCIE it is nice a scatter-list with 2M single entry, instead of the 4k.
> And I think it is nice for DAX systems to fallocate and guaranty 2M contiguous
> allocations of heavy accessed / mmap files.
>
> Thank you for your interest.
> Boaz
>
Regards,
Yigal
>>>> The almost is because:
>>>> 1. If the target is over a /dev/pmemX then all is fine we have 2M contiguous
>>>> memory blocks.
>>>> 2. If the target is over an FS, we have a proposal pending for an falloc_2M_flag
>>>> to ask the FS for a contiguous 2M allocations only. If any of the 2M allocations
>>>> fail then return ENOSPC from falloc. This way we guaranty that each 2M block can be
>>>> mapped by a single RDAM handle.
>>>
>>> Umm, you don't need the 2M to be contiguous in order to represent them
>>> as a single RDMA handle. If that was true iSER would have never worked.
>>> Or I misunderstood what you meant...
>>>
>>
>> OK I will let our RDMA guy Yigal Korman answer that, I guess you might be right.
>
> When Boaz says 'RDMA handle', he means the pair [rkey,remote_addr].
> AFAIK the remote_addr describes a continuous memory space on the target.
> So if you want to write to this 'handle' - it must be continuous.
> Please correct me if I'm wrong.
OK, this is definitely wrong. But let's defer this discussion to another
thread as it's not relevant to lsf folks...