2012-05-15 13:33:58

by Matthew Wilcox

[permalink] [raw]
Subject: NVM Mapping API


There are a number of interesting non-volatile memory (NVM) technologies
being developed. Some of them promise DRAM-comparable latencies and
bandwidths. At Intel, we've been thinking about various ways to present
those to software. This is a first draft of an API that supports the
operations we see as necessary. Patches can follow easily enough once
we've settled on an API.

We think the appropriate way to present directly addressable NVM to
in-kernel users is through a filesystem. Different technologies may want
to use different filesystems, or maybe some forms of directly addressable
NVM will want to use the same filesystem as each other.

For mapping regions of NVM into the kernel address space, we think we need
map, unmap, protect and sync operations; see kerneldoc for them below.
We also think we need read and write operations (to copy to/from DRAM).
The kernel_read() function already exists, and I don't think it would
be unreasonable to add its kernel_write() counterpart.

We aren't yet proposing a mechanism for carving up the NVM into regions.
vfs_truncate() seems like a reasonable API for resizing an NVM region.
filp_open() also seems reasonable for turning a name into a file pointer.

What we'd really like is for people to think about how they might use
fast NVM inside the kernel. There's likely to be a lot of it (at least in
servers); all the technologies are promising cheaper per-bit prices than
DRAM, so it's likely to be sold in larger capacities than DRAM is today.

Caching is one obvious use (be it FS-Cache, Bcache, Flashcache or
something else), but I bet there are more radical things we can do
with it. What if we stored the inode cache in it? Would booting with
a hot inode cache improve boot times? How about storing the tree of
'struct devices' in it so we don't have to rescan the busses at startup?


/**
* @nvm_filp: The NVM file pointer
* @start: The starting offset within the NVM region to be mapped
* @length: The number of bytes to map
* @protection: Protection bits
* @return Pointer to virtual mapping or PTR_ERR on failure
*
* This call maps a file to a virtual memory address. The start and length
* should be page aligned.
*
* Errors:
* EINVAL if start and length are not page aligned.
* ENODEV if the file pointer does not point to a mappable file
*/
void *nvm_map(struct file *nvm_filp, off_t start, size_t length,
pgprot_t protection);

/**
* @addr: The address returned by nvm_map()
*
* Unmaps a region previously mapped by nvm_map.
*/
void nvm_unmap(const void *addr);

/**
* @addr: The first byte to affect
* @length: The number of bytes to affect
* @protection: The new protection to use
*
* Updates the protection bits for the corresponding pages.
* The start and length must be page aligned, but need not be the entirety
* of the mapping.
*/
void nvm_protect(const void *addr, size_t length, pgprot_t protection);

/**
* @nvm_filp: The kernel file pointer
* @addr: The first byte to sync
* @length: The number of bytes to sync
* @returns Zero on success, -errno on failure
*
* Flushes changes made to the in-core copy of a mapped file back to NVM.
*/
int nvm_sync(struct file *nvm_filp, void *addr, size_t length);


2012-05-15 17:46:44

by Greg KH

[permalink] [raw]
Subject: Re: NVM Mapping API

On Tue, May 15, 2012 at 09:34:51AM -0400, Matthew Wilcox wrote:
>
> There are a number of interesting non-volatile memory (NVM) technologies
> being developed. Some of them promise DRAM-comparable latencies and
> bandwidths. At Intel, we've been thinking about various ways to present
> those to software. This is a first draft of an API that supports the
> operations we see as necessary. Patches can follow easily enough once
> we've settled on an API.
>
> We think the appropriate way to present directly addressable NVM to
> in-kernel users is through a filesystem. Different technologies may want
> to use different filesystems, or maybe some forms of directly addressable
> NVM will want to use the same filesystem as each other.
>
> For mapping regions of NVM into the kernel address space, we think we need
> map, unmap, protect and sync operations; see kerneldoc for them below.
> We also think we need read and write operations (to copy to/from DRAM).
> The kernel_read() function already exists, and I don't think it would
> be unreasonable to add its kernel_write() counterpart.
>
> We aren't yet proposing a mechanism for carving up the NVM into regions.
> vfs_truncate() seems like a reasonable API for resizing an NVM region.
> filp_open() also seems reasonable for turning a name into a file pointer.
>
> What we'd really like is for people to think about how they might use
> fast NVM inside the kernel. There's likely to be a lot of it (at least in
> servers); all the technologies are promising cheaper per-bit prices than
> DRAM, so it's likely to be sold in larger capacities than DRAM is today.
>
> Caching is one obvious use (be it FS-Cache, Bcache, Flashcache or
> something else), but I bet there are more radical things we can do
> with it. What if we stored the inode cache in it? Would booting with
> a hot inode cache improve boot times? How about storing the tree of
> 'struct devices' in it so we don't have to rescan the busses at startup?

Rescanning the busses at startup are required anyway, as devices can be
added and removed when the power is off, and I would be amazed if that
is actually taking any measurable time. Do you have any numbers for
this for different busses?

What about pramfs for the nvram? I have a recent copy of the patches,
and I think they are clean enough for acceptance, there was no
complaints the last time it was suggested. Can you use that for this
type of hardware?

thanks,

greg k-h

2012-05-15 23:02:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: NVM Mapping API

On 05/15/2012 06:34 AM, Matthew Wilcox wrote:
>
> There are a number of interesting non-volatile memory (NVM) technologies
> being developed. Some of them promise DRAM-comparable latencies and
> bandwidths. At Intel, we've been thinking about various ways to present
> those to software. This is a first draft of an API that supports the
> operations we see as necessary. Patches can follow easily enough once
> we've settled on an API.
>
> We think the appropriate way to present directly addressable NVM to
> in-kernel users is through a filesystem. Different technologies may want
> to use different filesystems, or maybe some forms of directly addressable
> NVM will want to use the same filesystem as each other.

> What we'd really like is for people to think about how they might use
> fast NVM inside the kernel. There's likely to be a lot of it (at least in
> servers); all the technologies are promising cheaper per-bit prices than
> DRAM, so it's likely to be sold in larger capacities than DRAM is today.
>
> Caching is one obvious use (be it FS-Cache, Bcache, Flashcache or
> something else), but I bet there are more radical things we can do
> with it. What if we stored the inode cache in it? Would booting with
> a hot inode cache improve boot times? How about storing the tree of
> 'struct devices' in it so we don't have to rescan the busses at startup?
>

I would love to use this from userspace. If I could carve out a little
piece of NVM as a file (or whatever) and mmap it, I could do all kinds
of fun things with that. It would be nice if it had well-defined, or at
least configurable or discoverable, caching properties (e.g. WB, WT, WC,
UC, etc.).

(Even better would be a way to make a clone of an fd that only allows
mmap, but that's a mostly unrelated issue.)

--Andy

2012-05-16 06:24:19

by Viacheslav Dubeyko

[permalink] [raw]
Subject: Re: NVM Mapping API

Hi,

On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote:
> There are a number of interesting non-volatile memory (NVM) technologies
> being developed. Some of them promise DRAM-comparable latencies and
> bandwidths. At Intel, we've been thinking about various ways to present
> those to software.

Could you please share vision of these NVM technologies in more details?
What capacity in bytes of of one NVM unit do we can expect? What about
bad blocks and any other reliability issues of such NVM technologies?

I think that some more deep understanding of this can give possibility
to imagine more deeply possible niche of such NVM units in future memory
subsystem architecture.

With the best regards,
Vyacheslav Dubeyko.

2012-05-16 09:52:06

by James Bottomley

[permalink] [raw]
Subject: Re: NVM Mapping API

On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote:
> There are a number of interesting non-volatile memory (NVM) technologies
> being developed. Some of them promise DRAM-comparable latencies and
> bandwidths. At Intel, we've been thinking about various ways to present
> those to software. This is a first draft of an API that supports the
> operations we see as necessary. Patches can follow easily enough once
> we've settled on an API.

If we start from first principles, does this mean it's usable as DRAM?
Meaning do we even need a non-memory API for it? The only difference
would be that some pieces of our RAM become non-volatile.

Or is there some impediment (like durability, or degradation on rewrite)
which makes this unsuitable as a complete DRAM replacement?

> We think the appropriate way to present directly addressable NVM to
> in-kernel users is through a filesystem. Different technologies may want
> to use different filesystems, or maybe some forms of directly addressable
> NVM will want to use the same filesystem as each other.

If it's actually DRAM, I'd present it as DRAM and figure out how to
label the non volatile property instead.

Alternatively, if it's not really DRAM, I think the UNIX file
abstraction makes sense (it's a piece of memory presented as something
like a filehandle with open, close, seek, read, write and mmap), but
it's less clear that it should be an actual file system. The reason is
that to present a VFS interface, you have to already have fixed the
format of the actual filesystem on the memory because we can't nest
filesystems (well, not without doing artificial loopbacks). Again, this
might make sense if there's some architectural reason why the flash
region has to have a specific layout, but your post doesn't shed any
light on this.

James

2012-05-16 13:04:20

by Boaz Harrosh

[permalink] [raw]
Subject: Re: NVM Mapping API

On 05/15/2012 04:34 PM, Matthew Wilcox wrote:

>
> There are a number of interesting non-volatile memory (NVM) technologies
> being developed. Some of them promise DRAM-comparable latencies and
> bandwidths. At Intel, we've been thinking about various ways to present
> those to software. This is a first draft of an API that supports the
> operations we see as necessary. Patches can follow easily enough once
> we've settled on an API.
>
> We think the appropriate way to present directly addressable NVM to
> in-kernel users is through a filesystem. Different technologies may want
> to use different filesystems, or maybe some forms of directly addressable
> NVM will want to use the same filesystem as each other.
>
> For mapping regions of NVM into the kernel address space, we think we need
> map, unmap, protect and sync operations; see kerneldoc for them below.
> We also think we need read and write operations (to copy to/from DRAM).
> The kernel_read() function already exists, and I don't think it would
> be unreasonable to add its kernel_write() counterpart.
>
> We aren't yet proposing a mechanism for carving up the NVM into regions.
> vfs_truncate() seems like a reasonable API for resizing an NVM region.
> filp_open() also seems reasonable for turning a name into a file pointer.
>
> What we'd really like is for people to think about how they might use
> fast NVM inside the kernel. There's likely to be a lot of it (at least in
> servers); all the technologies are promising cheaper per-bit prices than
> DRAM, so it's likely to be sold in larger capacities than DRAM is today.
>
> Caching is one obvious use (be it FS-Cache, Bcache, Flashcache or
> something else), but I bet there are more radical things we can do
> with it.



> What if we stored the inode cache in it? Would booting with
> a hot inode cache improve boot times? How about storing the tree of
> 'struct devices' in it so we don't have to rescan the busses at startup?
>


No for fast boots, just use it as an hibernation space. The rest is
already implemented. If you also want protection from crashes and
HW failures. Or power fail with no UPS, you can have a system checkpoint
every once in a while that saves an hibernation and continues. If you
always want a very fast boot to a clean system. checkpoint at entry state
and always resume from that hibernation.

Other uses:

* Journals, Journals, Journals. of other FSs. So one file system has
it's jurnal as a file in proposed above NVMFS.
Create an easy API for Kernel subsystems for allocating them.

* Execute in place.
Perhaps the elf loader can sense that the executable is on an NVMFS
and execute it in place instead of copy to DRAM. Or that happens
automatically with your below nvm_map()

>
> /**
> * @nvm_filp: The NVM file pointer
> * @start: The starting offset within the NVM region to be mapped
> * @length: The number of bytes to map
> * @protection: Protection bits
> * @return Pointer to virtual mapping or PTR_ERR on failure
> *
> * This call maps a file to a virtual memory address. The start and length
> * should be page aligned.
> *
> * Errors:
> * EINVAL if start and length are not page aligned.
> * ENODEV if the file pointer does not point to a mappable file
> */
> void *nvm_map(struct file *nvm_filp, off_t start, size_t length,
> pgprot_t protection);
>


The returned void * here is that a cooked up TLB that points
to real memory bus cycles HW. So is there a real physical
memory region this sits in? What is the difference from
say a PCIE DRAM card with battery.

Could I just use some kind of RAM-FS with this?

> /**
> * @addr: The address returned by nvm_map()
> *
> * Unmaps a region previously mapped by nvm_map.
> */
> void nvm_unmap(const void *addr);
>
> /**
> * @addr: The first byte to affect
> * @length: The number of bytes to affect
> * @protection: The new protection to use
> *
> * Updates the protection bits for the corresponding pages.
> * The start and length must be page aligned, but need not be the entirety
> * of the mapping.
> */
> void nvm_protect(const void *addr, size_t length, pgprot_t protection);
>
> /**
> * @nvm_filp: The kernel file pointer
> * @addr: The first byte to sync
> * @length: The number of bytes to sync
> * @returns Zero on success, -errno on failure
> *
> * Flushes changes made to the in-core copy of a mapped file back to NVM.
> */
> int nvm_sync(struct file *nvm_filp, void *addr, size_t length);


This I do not understand. Is that an on card memory cache flush, or is it
a system memory DMAed to NVM?

Thanks
Boaz

2012-05-16 15:56:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: NVM Mapping API

On Tue, May 15, 2012 at 10:46:39AM -0700, Greg KH wrote:
> On Tue, May 15, 2012 at 09:34:51AM -0400, Matthew Wilcox wrote:
> > What we'd really like is for people to think about how they might use
> > fast NVM inside the kernel. There's likely to be a lot of it (at least in
> > servers); all the technologies are promising cheaper per-bit prices than
> > DRAM, so it's likely to be sold in larger capacities than DRAM is today.
> >
> > Caching is one obvious use (be it FS-Cache, Bcache, Flashcache or
> > something else), but I bet there are more radical things we can do
> > with it. What if we stored the inode cache in it? Would booting with
> > a hot inode cache improve boot times? How about storing the tree of
> > 'struct devices' in it so we don't have to rescan the busses at startup?
>
> Rescanning the busses at startup are required anyway, as devices can be
> added and removed when the power is off, and I would be amazed if that
> is actually taking any measurable time. Do you have any numbers for
> this for different busses?

Hi Greg,

I wasn't particularly serious about this example ... I did once time
the scan of a PCIe bus and it took a noticable number of milliseconds
(which is why we now only scan the first device for the downstream "bus"
of root ports and downstream ports).

I'm just trying to stimulate a bit of discussion of possible usages for
persistent memory.

> What about pramfs for the nvram? I have a recent copy of the patches,
> and I think they are clean enough for acceptance, there was no
> complaints the last time it was suggested. Can you use that for this
> type of hardware?

pramfs is definitely one filesystem that's under investigation. I know
there will be types of NVM for which it won't be suitable, so rather
than people calling pramfs-specific functions, the notion is to get a
core API in the VFS that can call into the various different filesystems
that can handle the vagaries of different types of NVM.

Thanks.

2012-05-16 16:01:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: NVM Mapping API

On Tue, May 15, 2012 at 04:02:01PM -0700, Andy Lutomirski wrote:
> I would love to use this from userspace. If I could carve out a little
> piece of NVM as a file (or whatever) and mmap it, I could do all kinds
> of fun things with that. It would be nice if it had well-defined, or at
> least configurable or discoverable, caching properties (e.g. WB, WT, WC,
> UC, etc.).

Yes, usage from userspace is definitely planned; again through a
filesystem interface. Treating it like a regular file will work as
expected; the question is how to expose the interesting properties
(eg is there a lighter weight mechanism than calling msync()).

My hope was that by having a discussion of how to use this stuff within
the kernel, we might come up with some usage models that would inform
how we design a user space library.

> (Even better would be a way to make a clone of an fd that only allows
> mmap, but that's a mostly unrelated issue.)

O_MMAP_ONLY? And I'm not sure why you'd want to forbid reads and writes.

2012-05-16 16:09:27

by Matthew Wilcox

[permalink] [raw]
Subject: Re: NVM Mapping API

On Wed, May 16, 2012 at 10:24:13AM +0400, Vyacheslav Dubeyko wrote:
> On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote:
> > There are a number of interesting non-volatile memory (NVM) technologies
> > being developed. Some of them promise DRAM-comparable latencies and
> > bandwidths. At Intel, we've been thinking about various ways to present
> > those to software.
>
> Could you please share vision of these NVM technologies in more details?
> What capacity in bytes of of one NVM unit do we can expect? What about
> bad blocks and any other reliability issues of such NVM technologies?

No, I can't comment on any of that. This isn't about any particular piece
of technology; it's an observation that there are a lot of technologies
that seem to fit in this niche; some of them are even available to
buy today.

No statement of mine should be taken as an indication of any future
Intel product plans :-)

2012-05-16 17:34:30

by Matthew Wilcox

[permalink] [raw]
Subject: Re: NVM Mapping API

On Wed, May 16, 2012 at 10:52:00AM +0100, James Bottomley wrote:
> On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote:
> > There are a number of interesting non-volatile memory (NVM) technologies
> > being developed. Some of them promise DRAM-comparable latencies and
> > bandwidths. At Intel, we've been thinking about various ways to present
> > those to software. This is a first draft of an API that supports the
> > operations we see as necessary. Patches can follow easily enough once
> > we've settled on an API.
>
> If we start from first principles, does this mean it's usable as DRAM?
> Meaning do we even need a non-memory API for it? The only difference
> would be that some pieces of our RAM become non-volatile.

I'm not talking about a specific piece of technology, I'm assuming that
one of the competing storage technologies will eventually make it to
widespread production usage. Let's assume what we have is DRAM with a
giant battery on it.

So, while we can use it just as DRAM, we're not taking advantage of the
persistent aspect of it if we don't have an API that lets us find the
data we wrote before the last reboot. And that sounds like a filesystem
to me.

> Or is there some impediment (like durability, or degradation on rewrite)
> which makes this unsuitable as a complete DRAM replacement?

The idea behind using a different filesystem for different NVM types is
that we can hide those kinds of impediments in the filesystem. By the
way, did you know DRAM degrades on every write? I think it's on the
order of 10^20 writes (and CPU caches hide many writes to heavily-used
cache lines), so it's a long way away from MLC or even SLC rates, but
it does exist.

> Alternatively, if it's not really DRAM, I think the UNIX file
> abstraction makes sense (it's a piece of memory presented as something
> like a filehandle with open, close, seek, read, write and mmap), but
> it's less clear that it should be an actual file system. The reason is
> that to present a VFS interface, you have to already have fixed the
> format of the actual filesystem on the memory because we can't nest
> filesystems (well, not without doing artificial loopbacks). Again, this
> might make sense if there's some architectural reason why the flash
> region has to have a specific layout, but your post doesn't shed any
> light on this.

We can certainly present a block interface to allow using unmodified
standard filesystems on top of chunks of this NVM. That's probably not
the optimum way for a filesystem to use it though; there's really no
point in constructing a bio to carry data down to a layer that's simply
going to do a memcpy().

2012-05-16 18:32:11

by Matthew Wilcox

[permalink] [raw]
Subject: Re: NVM Mapping API

On Wed, May 16, 2012 at 04:04:05PM +0300, Boaz Harrosh wrote:
> No for fast boots, just use it as an hibernation space. The rest is
> already implemented. If you also want protection from crashes and
> HW failures. Or power fail with no UPS, you can have a system checkpoint
> every once in a while that saves an hibernation and continues. If you
> always want a very fast boot to a clean system. checkpoint at entry state
> and always resume from that hibernation.

Yes, checkpointing to it is definitely a good idea. I was thinking
more along the lines of suspend rather than hibernate. We trash a lot
of clean pages as part of the hibernation process, when it'd be better
to copy them to NVM and restore them.

> Other uses:
>
> * Journals, Journals, Journals. of other FSs. So one file system has
> it's jurnal as a file in proposed above NVMFS.
> Create an easy API for Kernel subsystems for allocating them.

That's a great idea. I could see us having a specific journal API.

> * Execute in place.
> Perhaps the elf loader can sense that the executable is on an NVMFS
> and execute it in place instead of copy to DRAM. Or that happens
> automatically with your below nvm_map()

If there's an executable on the NVMFS, it's going to get mapped into
userspace, so as long as the NVMFS implements the ->mmap method, that will
get called. It'll be up to the individual NVMFS whether it uses the page
cache to buffer a read-only mmap or whether it points directly to the NVM.

> > void *nvm_map(struct file *nvm_filp, off_t start, size_t length,
> > pgprot_t protection);
>
> The returned void * here is that a cooked up TLB that points
> to real memory bus cycles HW. So is there a real physical
> memory region this sits in? What is the difference from
> say a PCIE DRAM card with battery.

The concept we're currently playing with would have the NVM appear as
part of the CPU address space, yes.

> Could I just use some kind of RAM-FS with this?

For prototyping, sure.

> > /**
> > * @nvm_filp: The kernel file pointer
> > * @addr: The first byte to sync
> > * @length: The number of bytes to sync
> > * @returns Zero on success, -errno on failure
> > *
> > * Flushes changes made to the in-core copy of a mapped file back to NVM.
> > */
> > int nvm_sync(struct file *nvm_filp, void *addr, size_t length);
>
> This I do not understand. Is that an on card memory cache flush, or is it
> a system memory DMAed to NVM?

Up to the implementation; if it works out best to have a CPU with
write-through caches pointing directly to the address space of the NVM,
then it can be a no-op. If the CPU is using a writeback cache for the
NVM, then it'll flush the CPU cache. If the nvmfs has staged the writes
in DRAM, this will copy from DRAM to NVM. If the NVM card needs some
magic to flush an internal buffer, that will happen here.

Just as with mmaping a file in userspace today, there's no guarantee that
a store gets to stable storage until after a sync.

2012-05-16 19:55:30

by Christian Stroetmann

[permalink] [raw]
Subject: Re: NVM Mapping API

Hello Hardcore Coders,

I wanted to step into the discussion already yesterday, but ... I was
afraid to be rude in doing so.

On We, May 16, 2012 at 19:35, Matthew Wilcox wrote:
> On Wed, May 16, 2012 at 10:52:00AM +0100, James Bottomley wrote:
>> On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote:
>>> There are a number of interesting non-volatile memory (NVM) technologies
>>> being developed. Some of them promise DRAM-comparable latencies and
>>> bandwidths. At Intel, we've been thinking about various ways to present
>>> those to software. This is a first draft of an API that supports the
>>> operations we see as necessary. Patches can follow easily enough once
>>> we've settled on an API.
>> If we start from first principles, does this mean it's usable as DRAM?
>> Meaning do we even need a non-memory API for it? The only difference
>> would be that some pieces of our RAM become non-volatile.
> I'm not talking about a specific piece of technology, I'm assuming that
> one of the competing storage technologies will eventually make it to
> widespread production usage. Let's assume what we have is DRAM with a
> giant battery on it.
Our ST-RAM (see [1] for the original source of its description) is a
concept based on the combination of a writable volatile Random-Access
Memory (RAM) chip and a capacitor. Either an adapter, which has a
capacitor, is placed between a motherboard and a memory modul, the
memory chip is simply connected with a capacitor, or a RAM chip is
directly integrated with a chip capacitor. Also, the capacitor could be
an element that is integrated directly with the rest of a RAM chip.
While a computer system is running, the capacitor is charged with
electric power, so that after a computing system is switched off the
memory module will still be supported with needed power out of the
capacitor and in this way the content of the memory is not lost. In this
way a computing system has not to be booted in most of the normal use
cases after it is switched on again.

Boaz asked: "What is the difference from say a PCIE DRAM card with battery"? It sits in the RAM slot.


>
> So, while we can use it just as DRAM, we're not taking advantage of the
> persistent aspect of it if we don't have an API that lets us find the
> data we wrote before the last reboot. And that sounds like a filesystem
> to me.

No and yes.
1. In the first place it is just a normal DRAM.
2. But due to its nature it has also many aspects of a flash memory.
So the use case is for point
1. as a normal RAM module,
and for point
2. as a file system,
which again can be used
2.1 directly by the kernel as a normal file system,
2.2 directly by the kernel by the PRAMFS
2.3 by the proposed NVMFS, maybe as a shortcut for optimization,
and
2.4 from the userspace, most potentially by using the standard VFS.
Maybe this version 2.4 is the same as point 2.2.

>> Or is there some impediment (like durability, or degradation on rewrite)
>> which makes this unsuitable as a complete DRAM replacement?
> The idea behind using a different filesystem for different NVM types is
> that we can hide those kinds of impediments in the filesystem. By the
> way, did you know DRAM degrades on every write? I think it's on the
> order of 10^20 writes (and CPU caches hide many writes to heavily-used
> cache lines), so it's a long way away from MLC or even SLC rates, but
> it does exist.

As I said before, a filesystem for the different NVM types would not be
enough. These things are more complex due the possibility that they can
be used very flexbily.

>
>> Alternatively, if it's not really DRAM, I think the UNIX file
>> abstraction makes sense (it's a piece of memory presented as something
>> like a filehandle with open, close, seek, read, write and mmap), but
>> it's less clear that it should be an actual file system. The reason is
>> that to present a VFS interface, you have to already have fixed the
>> format of the actual filesystem on the memory because we can't nest
>> filesystems (well, not without doing artificial loopbacks). Again, this
>> might make sense if there's some architectural reason why the flash
>> region has to have a specific layout, but your post doesn't shed any
>> light on this.
> We can certainly present a block interface to allow using unmodified
> standard filesystems on top of chunks of this NVM. That's probably not
> the optimum way for a filesystem to use it though; there's really no
> point in constructing a bio to carry data down to a layer that's simply
> going to do a memcpy().
> --

I also saw the use cases by Boaz that are
Journals of other FS, which could be done on top of the NVMFS for
example, but is not really what I have in mind, and
Execute in place, for which an Elf loader feature is needed. Obviously,
this use case was envisioned by me as well.

For direct rebooting the checkpointing of standard RAM is also a needed
function. The decision what is trashed and what is marked as persistent
RAM content has to be made by the RAM experts of the Linux developers or
the user. I even think that this is a special use case on its own with
many options.



With all the best
C. Stroetmann

[1] ST-RAM http://www.ontonics.com/innovation/pipeline.htm#st-ram

2012-05-16 21:58:52

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: NVM Mapping API

On Wed, May 16, 2012 at 10:24:13AM +0400, Vyacheslav Dubeyko wrote:
> Could you please share vision of these NVM technologies in more details?
> What capacity in bytes of of one NVM unit do we can expect? What about
> bad blocks and any other reliability issues of such NVM technologies?
>
> I think that some more deep understanding of this can give possibility
> to imagine more deeply possible niche of such NVM units in future memory
> subsystem architecture.

Try having a look at the various articles on ReRAM, PRAM, FeRAM, MRAM...
There are a number of technologies being actively developed. For some
quick info, Samsung has presented data on an 8Gbit 20nm device (see
http://www.eetimes.com/electronics-news/4230958/ISSCC--Samsung-preps-8-Gbit-phase-change-memory ).
It's hard to predict who will be first to market with a real production
volume product, though.

The big question I have is what the actual interface for these types of
memory will be. If they're like actual RAM and can be mmap()ed into user
space, it will be preferable to avoid as much of the overhead of the existing
block infrastructure that most current day filesystems are built on top of.
If the devices have only modest endurance limits, we may need to stick the
kernel in the middle to prevent malicious code from wearing out a user's
memory cells.

-ben

2012-05-17 09:07:08

by Viacheslav Dubeyko

[permalink] [raw]
Subject: Re: NVM Mapping API

Hi,

> No, I can't comment on any of that. This isn't about any particular piece
> of technology; it's an observation that there are a lot of technologies
> that seem to fit in this niche; some of them are even available to
> buy today.
>
> No statement of mine should be taken as an indication of any future
> Intel product plans :-)
>

Ok. I understand. :-)

> > There are a number of interesting non-volatile memory (NVM) technologies
> > > being developed. Some of them promise DRAM-comparable latencies and
> > > bandwidths. At Intel, we've been thinking about various ways to present
> > > those to software.
> >

We can be more and more radical in the case of new NVM technologies, I
think. The non-volatile random access memory with DRAM-comparable read
and write operations' latencies can change computer world dramatically.
Just imagine a computer system with only NVM memory subsystem (for
example, it can be very promising mobile solution). It means that we
can forget about specified RAM and persistent storage solutions. We can
keep run-time and persistent information in one place and operate it on
the fly. Moreover, it means that we can keep any internal OS's state
persistently without any special efforts. I think that it can open new
very interesting opportunities.

The initial purpose of a filesystem is to distinguish run-time and
persistent information. Usually, we have slow persistent memory
subsystem (HDD) and fast run-time memory subsystem (DRAM). Filesystem
is a technique of synchronization a slow persistent memory subsystem
with fast run-time memory subsystem. But if we will have a fast memory
that can keep run-time and persistent information then it means a
revolutionary new approach in memory architecture. It means that two
different entities (run-time and persistent) can be one union. But for
such joined information entity traditional filesystems' and OS's
internal techniques are not adequate approaches. We need in
revolutionary new approaches. From NVM technology point of view, we can
be without filesystem completely, but, from usual user point of view,
modern computer system can't be imagined without filesystem.

We need in filesystem as a catalogue of our persistent information. But
OS can be represented as catalogue of run-time information. Then, with
NVM technologies, the OS and filesystem can be a union entity that
keeps as persistent as run-time information in one catalogue structure.
But such representation needs in dramatically reworking of OS internal
techniques. It means that traditional hierarchy of folders and files is
obsolete. We need in a new information structure approaches.
Theoretically, it is possible to reinterpret all information as
run-time and to use OS's technique of internal objects structure. But
it is impossible situation from end users point of view. So, we need in
filesystem layer anyway as layer which represent user information and
structure of it.

If we can operate and keep internal OS representation of information
then it means that we can reject file abstraction. We can operate by
information itself and keep information without using different files'
formats. But it is known that all in Linux is a file. Then, factually,
we can talk about completely new OS.

Actually, NVM technologies can support possibility doesn't boot OS
completely. Why does it need to boot if it is possible to keep any OS
state in memory persistently? I think that OS booting can be obsolete
thing.

Moreover, it is possible to be without swapping completely because all
our memory can be persistent. And for system with NVM memory only request
queue and I/O scheduler can be obsolete thing. I think that kernel
memory page approach can be redesign significantly, also. Such thing as
shared libraries can be useless because all code pieces can be
completely in memory.

So, I think that all what I said can sound as a clear fantasy. But,
maybe, it needs to discuss about new OS instead of new filesystem. :-)

With the best regards,
Vyacheslav Dubeyko.


2012-05-17 09:54:43

by James Bottomley

[permalink] [raw]
Subject: Re: NVM Mapping API

On Wed, 2012-05-16 at 13:35 -0400, Matthew Wilcox wrote:
> On Wed, May 16, 2012 at 10:52:00AM +0100, James Bottomley wrote:
> > On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote:
> > > There are a number of interesting non-volatile memory (NVM) technologies
> > > being developed. Some of them promise DRAM-comparable latencies and
> > > bandwidths. At Intel, we've been thinking about various ways to present
> > > those to software. This is a first draft of an API that supports the
> > > operations we see as necessary. Patches can follow easily enough once
> > > we've settled on an API.
> >
> > If we start from first principles, does this mean it's usable as DRAM?
> > Meaning do we even need a non-memory API for it? The only difference
> > would be that some pieces of our RAM become non-volatile.
>
> I'm not talking about a specific piece of technology, I'm assuming that
> one of the competing storage technologies will eventually make it to
> widespread production usage. Let's assume what we have is DRAM with a
> giant battery on it.
>
> So, while we can use it just as DRAM, we're not taking advantage of the
> persistent aspect of it if we don't have an API that lets us find the
> data we wrote before the last reboot. And that sounds like a filesystem
> to me.

Well, it sounds like a unix file to me rather than a filesystem (it's a
flat region with a beginning and end and no structure in between).
However, I'm not precluding doing this, I'm merely asking that if it
looks and smells like DRAM with the only additional property being
persistency, shouldn't we begin with the memory APIs and see if we can
add persistency to them? Imposing a VFS API looks slightly wrong to me
because it's effectively a flat region, not a hierarchical tree
structure, like a FS. If all the use cases are hierarchical trees, that
might be appropriate, but there hasn't really been any discussion of use
cases.

> > Or is there some impediment (like durability, or degradation on rewrite)
> > which makes this unsuitable as a complete DRAM replacement?
>
> The idea behind using a different filesystem for different NVM types is
> that we can hide those kinds of impediments in the filesystem. By the
> way, did you know DRAM degrades on every write? I think it's on the
> order of 10^20 writes (and CPU caches hide many writes to heavily-used
> cache lines), so it's a long way away from MLC or even SLC rates, but
> it does exist.

So are you saying does or doesn't have an impediment to being used like
DRAM?

> > Alternatively, if it's not really DRAM, I think the UNIX file
> > abstraction makes sense (it's a piece of memory presented as something
> > like a filehandle with open, close, seek, read, write and mmap), but
> > it's less clear that it should be an actual file system. The reason is
> > that to present a VFS interface, you have to already have fixed the
> > format of the actual filesystem on the memory because we can't nest
> > filesystems (well, not without doing artificial loopbacks). Again, this
> > might make sense if there's some architectural reason why the flash
> > region has to have a specific layout, but your post doesn't shed any
> > light on this.
>
> We can certainly present a block interface to allow using unmodified
> standard filesystems on top of chunks of this NVM. That's probably not
> the optimum way for a filesystem to use it though; there's really no
> point in constructing a bio to carry data down to a layer that's simply
> going to do a memcpy().

I think we might be talking at cross purposes. If you use the memory
APIs, this looks something like an anonymous region of memory with a get
and put API; something like SYSV shm if you like except that it's
persistent. No filesystem semantics at all. Only if you want FS
semantics (or want to impose some order on the region for unplugging and
replugging), do you put an FS on the memory region using loopback
techniques.

Again, this depends on use case. The SYSV shm API has a global flat
keyspace. Perhaps your envisaged use requires a hierarchical key space
and therefore a FS interface looks more natural with the leaves being
divided memory regions?

James

2012-05-17 18:58:51

by Matthew Wilcox

[permalink] [raw]
Subject: Re: NVM Mapping API

On Thu, May 17, 2012 at 10:54:38AM +0100, James Bottomley wrote:
> On Wed, 2012-05-16 at 13:35 -0400, Matthew Wilcox wrote:
> > I'm not talking about a specific piece of technology, I'm assuming that
> > one of the competing storage technologies will eventually make it to
> > widespread production usage. Let's assume what we have is DRAM with a
> > giant battery on it.
> >
> > So, while we can use it just as DRAM, we're not taking advantage of the
> > persistent aspect of it if we don't have an API that lets us find the
> > data we wrote before the last reboot. And that sounds like a filesystem
> > to me.
>
> Well, it sounds like a unix file to me rather than a filesystem (it's a
> flat region with a beginning and end and no structure in between).

That's true, but I think we want to put a structure on top of it.
Presumably there will be multiple independent users, and each will want
only a fraction of it.

> However, I'm not precluding doing this, I'm merely asking that if it
> looks and smells like DRAM with the only additional property being
> persistency, shouldn't we begin with the memory APIs and see if we can
> add persistency to them?

I don't think so. It feels harder to add useful persistent
properties to the memory APIs than it does to add memory-like
properties to our file APIs, at least partially because for
userspace we already have memory properties for our file APIs (ie
mmap/msync/munmap/mprotect/mincore/mlock/munlock/mremap).

> Imposing a VFS API looks slightly wrong to me
> because it's effectively a flat region, not a hierarchical tree
> structure, like a FS. If all the use cases are hierarchical trees, that
> might be appropriate, but there hasn't really been any discussion of use
> cases.

Discussion of use cases is exactly what I want! I think that a
non-hierarchical attempt at naming chunks of memory quickly expands
into cases where we learn we really do want a hierarchy after all.

> > > Or is there some impediment (like durability, or degradation on rewrite)
> > > which makes this unsuitable as a complete DRAM replacement?
> >
> > The idea behind using a different filesystem for different NVM types is
> > that we can hide those kinds of impediments in the filesystem. By the
> > way, did you know DRAM degrades on every write? I think it's on the
> > order of 10^20 writes (and CPU caches hide many writes to heavily-used
> > cache lines), so it's a long way away from MLC or even SLC rates, but
> > it does exist.
>
> So are you saying does or doesn't have an impediment to being used like
> DRAM?

>From the consumers point of view, it doesn't. If the underlying physical
technology does (some of the ones we've looked at have worse problems
than others), then it's up to the driver to disguise that.

> > > Alternatively, if it's not really DRAM, I think the UNIX file
> > > abstraction makes sense (it's a piece of memory presented as something
> > > like a filehandle with open, close, seek, read, write and mmap), but
> > > it's less clear that it should be an actual file system. The reason is
> > > that to present a VFS interface, you have to already have fixed the
> > > format of the actual filesystem on the memory because we can't nest
> > > filesystems (well, not without doing artificial loopbacks). Again, this
> > > might make sense if there's some architectural reason why the flash
> > > region has to have a specific layout, but your post doesn't shed any
> > > light on this.
> >
> > We can certainly present a block interface to allow using unmodified
> > standard filesystems on top of chunks of this NVM. That's probably not
> > the optimum way for a filesystem to use it though; there's really no
> > point in constructing a bio to carry data down to a layer that's simply
> > going to do a memcpy().
>
> I think we might be talking at cross purposes. If you use the memory
> APIs, this looks something like an anonymous region of memory with a get
> and put API; something like SYSV shm if you like except that it's
> persistent. No filesystem semantics at all. Only if you want FS
> semantics (or want to impose some order on the region for unplugging and
> replugging), do you put an FS on the memory region using loopback
> techniques.
>
> Again, this depends on use case. The SYSV shm API has a global flat
> keyspace. Perhaps your envisaged use requires a hierarchical key space
> and therefore a FS interface looks more natural with the leaves being
> divided memory regions?

I've really never heard anybody hold up the SYSV shm API as something
to be desired before. Indeed, POSIX shared memory is much closer to
the filesystem API; the only difference being use of shm_open() and
shm_unlink() instead of open() and unlink() [see shm_overview(7)].
And I don't really see the point in creating specialised nvm_open()
and nvm_unlink() functions ...

2012-05-17 19:05:24

by Matthew Wilcox

[permalink] [raw]
Subject: Re: NVM Mapping API

On Wed, May 16, 2012 at 05:58:49PM -0400, Benjamin LaHaise wrote:
> The big question I have is what the actual interface for these types of
> memory will be. If they're like actual RAM and can be mmap()ed into user
> space, it will be preferable to avoid as much of the overhead of the existing
> block infrastructure that most current day filesystems are built on top of.

Yes. I'm hoping that filesystem developers will indicate enthusiasm
for moving to new APIs. If not the ones I've proposed, then at least
ones which can be implemented more efficiently with a device that looks
like DRAM.

> If the devices have only modest endurance limits, we may need to stick the
> kernel in the middle to prevent malicious code from wearing out a user's
> memory cells.

Yes, or if the device has long write latencies or poor write bandwidth,
we'll also want to buffer writes in DRAM. My theory is that this is
doable transparently to the user; we can map it read-only, and handle
the fault by copying from NVM to DRAM, then changing the mapping and
restarting the instruction. The page would be written back to NVM on
a sync call, or when memory pressure or elapsed time dictates.

2012-05-18 09:04:04

by James Bottomley

[permalink] [raw]
Subject: Re: NVM Mapping API

On Thu, 2012-05-17 at 14:59 -0400, Matthew Wilcox wrote:
> On Thu, May 17, 2012 at 10:54:38AM +0100, James Bottomley wrote:
> > On Wed, 2012-05-16 at 13:35 -0400, Matthew Wilcox wrote:
> > > I'm not talking about a specific piece of technology, I'm assuming that
> > > one of the competing storage technologies will eventually make it to
> > > widespread production usage. Let's assume what we have is DRAM with a
> > > giant battery on it.
> > >
> > > So, while we can use it just as DRAM, we're not taking advantage of the
> > > persistent aspect of it if we don't have an API that lets us find the
> > > data we wrote before the last reboot. And that sounds like a filesystem
> > > to me.
> >
> > Well, it sounds like a unix file to me rather than a filesystem (it's a
> > flat region with a beginning and end and no structure in between).
>
> That's true, but I think we want to put a structure on top of it.
> Presumably there will be multiple independent users, and each will want
> only a fraction of it.
>
> > However, I'm not precluding doing this, I'm merely asking that if it
> > looks and smells like DRAM with the only additional property being
> > persistency, shouldn't we begin with the memory APIs and see if we can
> > add persistency to them?
>
> I don't think so. It feels harder to add useful persistent
> properties to the memory APIs than it does to add memory-like
> properties to our file APIs, at least partially because for
> userspace we already have memory properties for our file APIs (ie
> mmap/msync/munmap/mprotect/mincore/mlock/munlock/mremap).

This is what I don't quite get. At the OS level, it's all memory; we
just have to flag one region as persistent. This is easy, I'd do it in
the physical memory map. once this is done, we need either to tell the
allocators only use volatile, only use persistent, or don't care (I
presume the latter would only be if you needed the extra ram).

The missing thing is persistent key management of the memory space (so
if a user or kernel wants 10Mb of persistent space, they get the same
10Mb back again across boots).

The reason a memory API looks better to me is because a memory API can
be used within the kernel. For instance, I want a persistent /var/tmp
on tmpfs, I just tell tmpfs to allocate it in persistent memory and it
survives reboots. Likewise, if I want an area to dump panics, I just
use it ... in fact, I'd probably always place the dmesg buffer in
persistent memory.

If you start off with a vfs API, it becomes far harder to use it easily
from within the kernel.

The question, really is all about space management: how many persistent
spaces would there be. I think, given the use cases above it would be a
small number (it's basically one for every kernel use and one for ever
user use ... a filesystem mount counting as one use), so a flat key to
space management mapping (probably using u32 keys) makes sense, and
that's similar to our current shared memory API.

> > Imposing a VFS API looks slightly wrong to me
> > because it's effectively a flat region, not a hierarchical tree
> > structure, like a FS. If all the use cases are hierarchical trees, that
> > might be appropriate, but there hasn't really been any discussion of use
> > cases.
>
> Discussion of use cases is exactly what I want! I think that a
> non-hierarchical attempt at naming chunks of memory quickly expands
> into cases where we learn we really do want a hierarchy after all.

OK, so enumerate the uses. I can be persuaded the namespace has to be
hierarchical if there are orders of magnitude more users than I think
there will be.

> > > > Or is there some impediment (like durability, or degradation on rewrite)
> > > > which makes this unsuitable as a complete DRAM replacement?
> > >
> > > The idea behind using a different filesystem for different NVM types is
> > > that we can hide those kinds of impediments in the filesystem. By the
> > > way, did you know DRAM degrades on every write? I think it's on the
> > > order of 10^20 writes (and CPU caches hide many writes to heavily-used
> > > cache lines), so it's a long way away from MLC or even SLC rates, but
> > > it does exist.
> >
> > So are you saying does or doesn't have an impediment to being used like
> > DRAM?
>
> >From the consumers point of view, it doesn't. If the underlying physical
> technology does (some of the ones we've looked at have worse problems
> than others), then it's up to the driver to disguise that.

OK, so in a pinch it can be used as normal DRAM, that's great.

> > > > Alternatively, if it's not really DRAM, I think the UNIX file
> > > > abstraction makes sense (it's a piece of memory presented as something
> > > > like a filehandle with open, close, seek, read, write and mmap), but
> > > > it's less clear that it should be an actual file system. The reason is
> > > > that to present a VFS interface, you have to already have fixed the
> > > > format of the actual filesystem on the memory because we can't nest
> > > > filesystems (well, not without doing artificial loopbacks). Again, this
> > > > might make sense if there's some architectural reason why the flash
> > > > region has to have a specific layout, but your post doesn't shed any
> > > > light on this.
> > >
> > > We can certainly present a block interface to allow using unmodified
> > > standard filesystems on top of chunks of this NVM. That's probably not
> > > the optimum way for a filesystem to use it though; there's really no
> > > point in constructing a bio to carry data down to a layer that's simply
> > > going to do a memcpy().
> >
> > I think we might be talking at cross purposes. If you use the memory
> > APIs, this looks something like an anonymous region of memory with a get
> > and put API; something like SYSV shm if you like except that it's
> > persistent. No filesystem semantics at all. Only if you want FS
> > semantics (or want to impose some order on the region for unplugging and
> > replugging), do you put an FS on the memory region using loopback
> > techniques.
> >
> > Again, this depends on use case. The SYSV shm API has a global flat
> > keyspace. Perhaps your envisaged use requires a hierarchical key space
> > and therefore a FS interface looks more natural with the leaves being
> > divided memory regions?
>
> I've really never heard anybody hold up the SYSV shm API as something
> to be desired before. Indeed, POSIX shared memory is much closer to
> the filesystem API;

I'm not really ... I was just thinking this needs key -> region mapping
and SYSV shm does that. The POSIX anonymous memory API needs you to
map /dev/zero and then pass file descriptors around for sharing. It's
not clear how you manage a persistent key space with that.

> the only difference being use of shm_open() and
> shm_unlink() instead of open() and unlink() [see shm_overview(7)].
> And I don't really see the point in creating specialised nvm_open()
> and nvm_unlink() functions ...

The internal kernel API addition is simply a key -> region mapping.
Once that's done, you need an allocation API for userspace and you're
done. I bet most userspace uses will be either give me xGB and put a
tmpfs on it or give me xGB and put a something filesystem on it, but if
the user wants an xGB mmap'd region, you can give them that as well.

For a vfs interface, you have to do all of this as well, but in a much
more complex way because the file name becomes the key and the metadata
becomes the mapping.

James

2012-05-18 09:33:57

by Arnd Bergmann

[permalink] [raw]
Subject: Re: NVM Mapping API

On Tuesday 15 May 2012, Matthew Wilcox wrote:
>
> There are a number of interesting non-volatile memory (NVM) technologies
> being developed. Some of them promise DRAM-comparable latencies and
> bandwidths. At Intel, we've been thinking about various ways to present
> those to software. This is a first draft of an API that supports the
> operations we see as necessary. Patches can follow easily enough once
> we've settled on an API.
>
> We think the appropriate way to present directly addressable NVM to
> in-kernel users is through a filesystem. Different technologies may want
> to use different filesystems, or maybe some forms of directly addressable
> NVM will want to use the same filesystem as each other.

ext2 actually supports some of this already with mm/filemap_xip.c, Carsten
Otte introduced it initially to support drivers/s390/block/dcssblk.c with
execute-in-place, so you don't have to copy around the data when your
block device is mapped into the physical address space already.

I guess this could be implemented in modern file systems (ext4, btrfs)
as well, or you could have a new simple fs on top of the same base API.
(ext2+xip was originally a new file system but then merged into ext2).

Also note that you could easily implement non-volatile memory in other
virtual machines doing the same thing that dcssblk does: E.g. in KVM
you would only need to map a host file into the guess address space
and let the guest take advantage of a similar feature set that you
get from the new memory technologies in real hardware.

Arnd

2012-05-18 10:13:24

by Boaz Harrosh

[permalink] [raw]
Subject: Re: NVM Mapping API

On 05/18/2012 12:03 PM, James Bottomley wrote:

> On Thu, 2012-05-17 at 14:59 -0400, Matthew Wilcox wrote:
>> On Thu, May 17, 2012 at 10:54:38AM +0100, James Bottomley wrote:
>>> On Wed, 2012-05-16 at 13:35 -0400, Matthew Wilcox wrote:
>>>> I'm not talking about a specific piece of technology, I'm assuming that
>>>> one of the competing storage technologies will eventually make it to
>>>> widespread production usage. Let's assume what we have is DRAM with a
>>>> giant battery on it.
>>>>
>>>> So, while we can use it just as DRAM, we're not taking advantage of the
>>>> persistent aspect of it if we don't have an API that lets us find the
>>>> data we wrote before the last reboot. And that sounds like a filesystem
>>>> to me.
>>>
>>> Well, it sounds like a unix file to me rather than a filesystem (it's a
>>> flat region with a beginning and end and no structure in between).
>>
>> That's true, but I think we want to put a structure on top of it.
>> Presumably there will be multiple independent users, and each will want
>> only a fraction of it.
>>
>>> However, I'm not precluding doing this, I'm merely asking that if it
>>> looks and smells like DRAM with the only additional property being
>>> persistency, shouldn't we begin with the memory APIs and see if we can
>>> add persistency to them?
>>
>> I don't think so. It feels harder to add useful persistent
>> properties to the memory APIs than it does to add memory-like
>> properties to our file APIs, at least partially because for
>> userspace we already have memory properties for our file APIs (ie
>> mmap/msync/munmap/mprotect/mincore/mlock/munlock/mremap).
>
> This is what I don't quite get. At the OS level, it's all memory; we
> just have to flag one region as persistent. This is easy, I'd do it in
> the physical memory map. once this is done, we need either to tell the
> allocators only use volatile, only use persistent, or don't care (I
> presume the latter would only be if you needed the extra ram).
>
> The missing thing is persistent key management of the memory space (so
> if a user or kernel wants 10Mb of persistent space, they get the same
> 10Mb back again across boots).
>
> The reason a memory API looks better to me is because a memory API can
> be used within the kernel. For instance, I want a persistent /var/tmp
> on tmpfs, I just tell tmpfs to allocate it in persistent memory and it
> survives reboots. Likewise, if I want an area to dump panics, I just
> use it ... in fact, I'd probably always place the dmesg buffer in
> persistent memory.
>
> If you start off with a vfs API, it becomes far harder to use it easily
> from within the kernel.
>
> The question, really is all about space management: how many persistent
> spaces would there be. I think, given the use cases above it would be a
> small number (it's basically one for every kernel use and one for ever
> user use ... a filesystem mount counting as one use), so a flat key to
> space management mapping (probably using u32 keys) makes sense, and
> that's similar to our current shared memory API.
>
>>> Imposing a VFS API looks slightly wrong to me
>>> because it's effectively a flat region, not a hierarchical tree
>>> structure, like a FS. If all the use cases are hierarchical trees, that
>>> might be appropriate, but there hasn't really been any discussion of use
>>> cases.
>>
>> Discussion of use cases is exactly what I want! I think that a
>> non-hierarchical attempt at naming chunks of memory quickly expands
>> into cases where we learn we really do want a hierarchy after all.
>
> OK, so enumerate the uses. I can be persuaded the namespace has to be
> hierarchical if there are orders of magnitude more users than I think
> there will be.
>
>>>>> Or is there some impediment (like durability, or degradation on rewrite)
>>>>> which makes this unsuitable as a complete DRAM replacement?
>>>>
>>>> The idea behind using a different filesystem for different NVM types is
>>>> that we can hide those kinds of impediments in the filesystem. By the
>>>> way, did you know DRAM degrades on every write? I think it's on the
>>>> order of 10^20 writes (and CPU caches hide many writes to heavily-used
>>>> cache lines), so it's a long way away from MLC or even SLC rates, but
>>>> it does exist.
>>>
>>> So are you saying does or doesn't have an impediment to being used like
>>> DRAM?
>>
>> >From the consumers point of view, it doesn't. If the underlying physical
>> technology does (some of the ones we've looked at have worse problems
>> than others), then it's up to the driver to disguise that.
>
> OK, so in a pinch it can be used as normal DRAM, that's great.
>
>>>>> Alternatively, if it's not really DRAM, I think the UNIX file
>>>>> abstraction makes sense (it's a piece of memory presented as something
>>>>> like a filehandle with open, close, seek, read, write and mmap), but
>>>>> it's less clear that it should be an actual file system. The reason is
>>>>> that to present a VFS interface, you have to already have fixed the
>>>>> format of the actual filesystem on the memory because we can't nest
>>>>> filesystems (well, not without doing artificial loopbacks). Again, this
>>>>> might make sense if there's some architectural reason why the flash
>>>>> region has to have a specific layout, but your post doesn't shed any
>>>>> light on this.
>>>>
>>>> We can certainly present a block interface to allow using unmodified
>>>> standard filesystems on top of chunks of this NVM. That's probably not
>>>> the optimum way for a filesystem to use it though; there's really no
>>>> point in constructing a bio to carry data down to a layer that's simply
>>>> going to do a memcpy().
>>>
>>> I think we might be talking at cross purposes. If you use the memory
>>> APIs, this looks something like an anonymous region of memory with a get
>>> and put API; something like SYSV shm if you like except that it's
>>> persistent. No filesystem semantics at all. Only if you want FS
>>> semantics (or want to impose some order on the region for unplugging and
>>> replugging), do you put an FS on the memory region using loopback
>>> techniques.
>>>
>>> Again, this depends on use case. The SYSV shm API has a global flat
>>> keyspace. Perhaps your envisaged use requires a hierarchical key space
>>> and therefore a FS interface looks more natural with the leaves being
>>> divided memory regions?
>>
>> I've really never heard anybody hold up the SYSV shm API as something
>> to be desired before. Indeed, POSIX shared memory is much closer to
>> the filesystem API;
>
> I'm not really ... I was just thinking this needs key -> region mapping
> and SYSV shm does that. The POSIX anonymous memory API needs you to
> map /dev/zero and then pass file descriptors around for sharing. It's
> not clear how you manage a persistent key space with that.
>
>> the only difference being use of shm_open() and
>> shm_unlink() instead of open() and unlink() [see shm_overview(7)].
>> And I don't really see the point in creating specialised nvm_open()
>> and nvm_unlink() functions ...
>
> The internal kernel API addition is simply a key -> region mapping.
> Once that's done, you need an allocation API for userspace and you're
> done. I bet most userspace uses will be either give me xGB and put a
> tmpfs on it or give me xGB and put a something filesystem on it, but if
> the user wants an xGB mmap'd region, you can give them that as well.
>
> For a vfs interface, you have to do all of this as well, but in a much
> more complex way because the file name becomes the key and the metadata
> becomes the mapping.
>


Matthew is making very good points, and so does James. For one the very
strong point is "why not use NVM in an OOM situation, as a NUMA slower
node?"

I think the best approach is both, and layered.

0. An NVM Driver

1. Well define, and marry, the notion of "persistent memory" into
the Memory mode. Layers, speeds, and everything. Now you have one
or more flat regions of NVM.

So this is just one or more NVM memory zones, persistent being
a property of a zone.

2. Define a new NvmFS, which is like the RamFS we have today
that uses page_cach semantics and is in bed with the page-allocators
This layer gives you the key-to-buffer management as well as just
transparent POSIX API to existing applications.

Layers 1, 2 can be generic, if Layer 0 is well parametrized.

There might be a layer 2.5, where similar to a Partition, you
have a flat UUIed sub-region for the likes of Kernel subsystems
The NvmFS layer is mounted on an allocated UUIDed region, but also
a SWAP space a Journal, what ever hybrid idea anyone has.

> James
>


Because you see. I like and completely agree with what Matthew
said, and I want it.

But I also want all of what James said.
nvm_kalloc(struct uuid *uuid, size_t size, gfp);
(A new uuid is created but an existing one returns
it. And we might want to open exclusive/shared and
stuff)

Just my $0.017
Boaz

2012-05-18 12:07:26

by Marco Stornelli

[permalink] [raw]
Subject: Re: NVM Mapping API

2012/5/16 Matthew Wilcox <[email protected]>:
> On Tue, May 15, 2012 at 10:46:39AM -0700, Greg KH wrote:
>> On Tue, May 15, 2012 at 09:34:51AM -0400, Matthew Wilcox wrote:
>> > What we'd really like is for people to think about how they might use
>> > fast NVM inside the kernel. ?There's likely to be a lot of it (at least in
>> > servers); all the technologies are promising cheaper per-bit prices than
>> > DRAM, so it's likely to be sold in larger capacities than DRAM is today.
>> >
>> > Caching is one obvious use (be it FS-Cache, Bcache, Flashcache or
>> > something else), but I bet there are more radical things we can do
>> > with it. ?What if we stored the inode cache in it? ?Would booting with
>> > a hot inode cache improve boot times? ?How about storing the tree of
>> > 'struct devices' in it so we don't have to rescan the busses at startup?
>>
>> Rescanning the busses at startup are required anyway, as devices can be
>> added and removed when the power is off, and I would be amazed if that
>> is actually taking any measurable time. ?Do you have any numbers for
>> this for different busses?
>
> Hi Greg,
>
> I wasn't particularly serious about this example ... I did once time
> the scan of a PCIe bus and it took a noticable number of milliseconds
> (which is why we now only scan the first device for the downstream "bus"
> of root ports and downstream ports).
>
> I'm just trying to stimulate a bit of discussion of possible usages for
> persistent memory.
>
>> What about pramfs for the nvram? ?I have a recent copy of the patches,
>> and I think they are clean enough for acceptance, there was no
>> complaints the last time it was suggested. ?Can you use that for this
>> type of hardware?
>
> pramfs is definitely one filesystem that's under investigation. ?I know
> there will be types of NVM for which it won't be suitable, so rather

For example?

> than people calling pramfs-specific functions, the notion is to get a
> core API in the VFS that can call into the various different filesystems
> that can handle the vagaries of different types of NVM.
>

The idea could be good but I have doubt about it. Any fs is designed
for a specific environment, to provide VFS api to manage NVM is not
enough. I mean, a fs designed to reduce the seek time on hd, it adds
not needed complexity for this kind of environment. Maybe the goal
could be only for a "specific" support, for the journal for example.

Marco

2012-05-18 14:48:43

by Matthew Wilcox

[permalink] [raw]
Subject: Re: NVM Mapping API

On Fri, May 18, 2012 at 10:03:53AM +0100, James Bottomley wrote:
> On Thu, 2012-05-17 at 14:59 -0400, Matthew Wilcox wrote:
> > On Thu, May 17, 2012 at 10:54:38AM +0100, James Bottomley wrote:
> > > On Wed, 2012-05-16 at 13:35 -0400, Matthew Wilcox wrote:
> > > > I'm not talking about a specific piece of technology, I'm assuming that
> > > > one of the competing storage technologies will eventually make it to
> > > > widespread production usage. Let's assume what we have is DRAM with a
> > > > giant battery on it.
> > > >
> > > > So, while we can use it just as DRAM, we're not taking advantage of the
> > > > persistent aspect of it if we don't have an API that lets us find the
> > > > data we wrote before the last reboot. And that sounds like a filesystem
> > > > to me.
> > >
> > > Well, it sounds like a unix file to me rather than a filesystem (it's a
> > > flat region with a beginning and end and no structure in between).
> >
> > That's true, but I think we want to put a structure on top of it.
> > Presumably there will be multiple independent users, and each will want
> > only a fraction of it.
> >
> > > However, I'm not precluding doing this, I'm merely asking that if it
> > > looks and smells like DRAM with the only additional property being
> > > persistency, shouldn't we begin with the memory APIs and see if we can
> > > add persistency to them?
> >
> > I don't think so. It feels harder to add useful persistent
> > properties to the memory APIs than it does to add memory-like
> > properties to our file APIs, at least partially because for
> > userspace we already have memory properties for our file APIs (ie
> > mmap/msync/munmap/mprotect/mincore/mlock/munlock/mremap).
>
> This is what I don't quite get. At the OS level, it's all memory; we
> just have to flag one region as persistent. This is easy, I'd do it in
> the physical memory map. once this is done, we need either to tell the
> allocators only use volatile, only use persistent, or don't care (I
> presume the latter would only be if you needed the extra ram).
>
> The missing thing is persistent key management of the memory space (so
> if a user or kernel wants 10Mb of persistent space, they get the same
> 10Mb back again across boots).
>
> The reason a memory API looks better to me is because a memory API can
> be used within the kernel. For instance, I want a persistent /var/tmp
> on tmpfs, I just tell tmpfs to allocate it in persistent memory and it
> survives reboots. Likewise, if I want an area to dump panics, I just
> use it ... in fact, I'd probably always place the dmesg buffer in
> persistent memory.
>
> If you start off with a vfs API, it becomes far harder to use it easily
> from within the kernel.
>
> The question, really is all about space management: how many persistent
> spaces would there be. I think, given the use cases above it would be a
> small number (it's basically one for every kernel use and one for ever
> user use ... a filesystem mount counting as one use), so a flat key to
> space management mapping (probably using u32 keys) makes sense, and
> that's similar to our current shared memory API.

So who manages the key space? If we do it based on names, it's easy; all
kernel uses are ".kernel/..." and we manage our own sub-hierarchy within
the namespace. If there's only a u32, somebody has to lay down the rules
about which numbers are used for what things. This isn't quite as ugly
as the initial proposal somebody made to me "We just use the physical
address as the key", and I told them all about how a.out libraries worked.

Nevertheless, I'm not interested in being the Mitch DSouza of NVM.

> > Discussion of use cases is exactly what I want! I think that a
> > non-hierarchical attempt at naming chunks of memory quickly expands
> > into cases where we learn we really do want a hierarchy after all.
>
> OK, so enumerate the uses. I can be persuaded the namespace has to be
> hierarchical if there are orders of magnitude more users than I think
> there will be.

I don't know what the potential use cases might be. I just don't think
the use cases are all that bounded.

> > > Again, this depends on use case. The SYSV shm API has a global flat
> > > keyspace. Perhaps your envisaged use requires a hierarchical key space
> > > and therefore a FS interface looks more natural with the leaves being
> > > divided memory regions?
> >
> > I've really never heard anybody hold up the SYSV shm API as something
> > to be desired before. Indeed, POSIX shared memory is much closer to
> > the filesystem API;
>
> I'm not really ... I was just thinking this needs key -> region mapping
> and SYSV shm does that. The POSIX anonymous memory API needs you to
> map /dev/zero and then pass file descriptors around for sharing. It's
> not clear how you manage a persistent key space with that.

I didn't say "POSIX anonymous memory". I said "POSIX shared memory".
I even pointed you at the right manpage to read if you haven't heard
of it before. The POSIX committee took a look at SYSV shm and said
"This is too ugly". So they invented their own API.

> > the only difference being use of shm_open() and
> > shm_unlink() instead of open() and unlink() [see shm_overview(7)].
>
> The internal kernel API addition is simply a key -> region mapping.
> Once that's done, you need an allocation API for userspace and you're
> done. I bet most userspace uses will be either give me xGB and put a
> tmpfs on it or give me xGB and put a something filesystem on it, but if
> the user wants an xGB mmap'd region, you can give them that as well.
>
> For a vfs interface, you have to do all of this as well, but in a much
> more complex way because the file name becomes the key and the metadata
> becomes the mapping.

You're downplaying the complexity of your own solution while overstating
the complexity of mine. Let's compare, using your suggestion of the
dmesg buffer.

Mine:

struct file *filp = filp_open(".kernel/dmesg", O_RDWR, 0);
if (!IS_ERR(filp))
log_buf = nvm_map(filp, 0, __LOG_BUF_LEN, PAGE_KERNEL);

Yours:

log_buf = nvm_attach(492, NULL, 0); /* Hope nobody else used 492! */

Hm. Doesn't look all that different, does it? I've modelled nvm_attach()
after shmat(). Of course, this ignores the need to be able to sync,
which may vary between different NVM technologies, and the (desired
by some users) ability to change portions of the mapped NVM between
read-only and read-write.

If the extra parameters and extra lines of code hinder adoption, I have
no problems with adding a helper for the simple use cases:

void *nvm_attach(const char *name, int perms)
{
void *mem;
struct file *filp = filp_open(name, perms, 0);
if (IS_ERR(filp))
return NULL;
mem = nvm_map(filp, 0, filp->f_dentry->d_inode->i_size, PAGE_KERNEL);
fput(filp);
return mem;
}

I do think that using numbers to refer to regions of NVM is a complete
non-starter. This was one of the big mistakes of SYSV; one so big that
even POSIX couldn't stomach it.

2012-05-18 15:05:59

by Alan

[permalink] [raw]
Subject: Re: NVM Mapping API

> I do think that using numbers to refer to regions of NVM is a complete
> non-starter. This was one of the big mistakes of SYSV; one so big that
> even POSIX couldn't stomach it.

That basically degenerates to using UUIDs. Even then it's not a useful
solution because you need to be able to list the UUIDs in use and their
sizes which turns into a file system.

I would prefer we use names.

Alan

2012-05-18 15:31:24

by James Bottomley

[permalink] [raw]
Subject: Re: NVM Mapping API

On Fri, 2012-05-18 at 10:49 -0400, Matthew Wilcox wrote:
> You're downplaying the complexity of your own solution while overstating
> the complexity of mine. Let's compare, using your suggestion of the
> dmesg buffer.

I'll give you that one when you tell me how you use your vfs interface
simply from within the kernel. Both are always about the same
complexity in user space ...

To be honest, I'm not hugely concerned whether the key management API is
u32 or a string. What bothers me the most is that there will be
in-kernel users for whom trying to mmap a file through the vfs will be
hugely more complex than a simple give me a pointer to this persistent
region.

What all this tells me is that the key lookup API has to be exposed both
to the kernel and userspace. VFS may make the best sense for user
space, but the infrastructure needs to be non-VFS for the in kernel
users.

So what you want is a base region manager with allocation and key
lookup, which you expose to the kernel and on which you can build a
filesystem for userspace. Is everyone happy now?

James

2012-05-18 17:18:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: NVM Mapping API

On Fri, May 18, 2012 at 04:31:08PM +0100, James Bottomley wrote:
> On Fri, 2012-05-18 at 10:49 -0400, Matthew Wilcox wrote:
> > You're downplaying the complexity of your own solution while overstating
> > the complexity of mine. Let's compare, using your suggestion of the
> > dmesg buffer.
>
> I'll give you that one when you tell me how you use your vfs interface
> simply from within the kernel. Both are always about the same
> complexity in user space ...
>
> To be honest, I'm not hugely concerned whether the key management API is
> u32 or a string. What bothers me the most is that there will be
> in-kernel users for whom trying to mmap a file through the vfs will be
> hugely more complex than a simple give me a pointer to this persistent
> region.

Huh? You snipped the example where I showed exactly that. The user
calls nvm_map() and gets back a pointer to a kernel mapping for the
persistent region.

2012-05-19 22:16:31

by Christian Stroetmann

[permalink] [raw]
Subject: Re: NVM Mapping API

On We, May 16, 2012 at 21:58, Christian Stroetmann wrote:
> On We, May 16, 2012 at 19:35, Matthew Wilcox wrote:
>> On Wed, May 16, 2012 at 10:52:00AM +0100, James Bottomley wrote:
>>> On Tue, 2012-05-15 at 09:34 -0400, Matthew Wilcox wrote:
>>>> There are a number of interesting non-volatile memory (NVM)
>>>> technologies
>>>> being developed. Some of them promise DRAM-comparable latencies and
>>>> bandwidths. At Intel, we've been thinking about various ways to
>>>> present
>>>> those to software. This is a first draft of an API that supports the
>>>> operations we see as necessary. Patches can follow easily enough once
>>>> we've settled on an API.
>>> If we start from first principles, does this mean it's usable as DRAM?
>>> Meaning do we even need a non-memory API for it? The only difference
>>> would be that some pieces of our RAM become non-volatile.
>> I'm not talking about a specific piece of technology, I'm assuming that
>> one of the competing storage technologies will eventually make it to
>> widespread production usage. Let's assume what we have is DRAM with a
>> giant battery on it.
> Our ST-RAM (see [1] for the original source of its description) is a
> concept based on the combination of a writable volatile Random-Access
> Memory (RAM) chip and a capacitor.
[...]
> Boaz asked: "What is the difference from say a PCIE DRAM card with
> battery"? It sits in the RAM slot.
>
>
>>
>> So, while we can use it just as DRAM, we're not taking advantage of the
>> persistent aspect of it if we don't have an API that lets us find the
>> data we wrote before the last reboot. And that sounds like a filesystem
>> to me.
>
> No and yes.
> 1. In the first place it is just a normal DRAM.
> 2. But due to its nature it has also many aspects of a flash memory.
> So the use case is for point
> 1. as a normal RAM module,
> and for point
> 2. as a file system,
> which again can be used
> 2.1 directly by the kernel as a normal file system,
> 2.2 directly by the kernel by the PRAMFS
> 2.3 by the proposed NVMFS, maybe as a shortcut for optimization,
> and
> 2.4 from the userspace, most potentially by using the standard VFS.
> Maybe this version 2.4 is the same as point 2.2.
>
>>> Or is there some impediment (like durability, or degradation on
>>> rewrite)
>>> which makes this unsuitable as a complete DRAM replacement?
>> The idea behind using a different filesystem for different NVM types is
>> that we can hide those kinds of impediments in the filesystem. By the
>> way, did you know DRAM degrades on every write? I think it's on the
>> order of 10^20 writes (and CPU caches hide many writes to heavily-used
>> cache lines), so it's a long way away from MLC or even SLC rates, but
>> it does exist.
>
> As I said before, a filesystem for the different NVM types would not
> be enough. These things are more complex due the possibility that they
> can be used very flexbily.
>
>>
>>> Alternatively, if it's not really DRAM, I think the UNIX file
>>> abstraction makes sense (it's a piece of memory presented as something
>>> like a filehandle with open, close, seek, read, write and mmap), but
>>> it's less clear that it should be an actual file system. The reason is
>>> that to present a VFS interface, you have to already have fixed the
>>> format of the actual filesystem on the memory because we can't nest
>>> filesystems (well, not without doing artificial loopbacks). Again,
>>> this
>>> might make sense if there's some architectural reason why the flash
>>> region has to have a specific layout, but your post doesn't shed any
>>> light on this.
>> We can certainly present a block interface to allow using unmodified
>> standard filesystems on top of chunks of this NVM. That's probably not
>> the optimum way for a filesystem to use it though; there's really no
>> point in constructing a bio to carry data down to a layer that's simply
>> going to do a memcpy().
>> --
>
> I also saw the use cases by Boaz that are
> Journals of other FS, which could be done on top of the NVMFS for
> example, but is not really what I have in mind, and
> Execute in place, for which an Elf loader feature is needed.
> Obviously, this use case was envisioned by me as well.
>
> For direct rebooting the checkpointing of standard RAM is also a
> needed function. The decision what is trashed and what is marked as
> persistent RAM content has to be made by the RAM experts of the Linux
> developers or the user. I even think that this is a special use case
> on its own with many options.
>
Because it is now about 1 year ago when I played around with the
conceptual hardware aspects of anUninterruptible Power RAM (UPRAM) like
the ST-RAM, I looked in more detail at the software side yesterday and
today. So let me please add my first use case that I had in mind last
year and coined now:
Hybrid Hibernation (HyHi) or alternatively Suspend-to-NVM,
which is similar to hybrid sleep and hibernation, but also differs a
little bit due to the uninterruptible power feature.

But as it can be seen easily here again, even with this 1 use case exist
two paths to handle the NVM that are as:
1. RAM and
2. FS,
so that it leads a further time to the discussion, if hibernation should
be a kernel or a user space function (see [1] and [2] for more
information related with the discussion about uswsup (userspace software
suspend) and suspend2, and [3] for uswsup and [4] for TuxOnIce).

Eventually, there is an interest to reuse some functions or code.



Have fun in the sun
C. Stroetmann
> [1] ST-RAM http://www.ontonics.com/innovation/pipeline.htm#st-ram
>
[1] LKML: Pavel Machek: RE: suspend2 merge lkml.org/lkml/2007/4/24/405
[2] KernelTrap: Linux: Reviewung Suspend2 kerneltrap.org/node/6766
[3] suspend.sourceforge.net
[4] tuxonice.net

2012-05-31 17:53:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: NVM Mapping API

On Wed, May 16, 2012 at 9:02 AM, Matthew Wilcox <[email protected]> wrote:
> On Tue, May 15, 2012 at 04:02:01PM -0700, Andy Lutomirski wrote:
>> I would love to use this from userspace. ?If I could carve out a little
>> piece of NVM as a file (or whatever) and mmap it, I could do all kinds
>> of fun things with that. ?It would be nice if it had well-defined, or at
>> least configurable or discoverable, caching properties (e.g. WB, WT, WC,
>> UC, etc.).
>
> Yes, usage from userspace is definitely planned; again through a
> filesystem interface. ?Treating it like a regular file will work as
> expected; the question is how to expose the interesting properties
> (eg is there a lighter weight mechanism than calling msync()).

clfush? vdso system call?

If there's a proliferation of different technologies like this, we
could have an opaque struct nvm_mapping and a vdso call like

void __vdso_nvm_flush_writes(struct nvm_mapping *mapping, void
*address, size_t len);

that would read the struct nvm_mapping to figure out whether it should
do a clflush, sfence, mfence, posting read, or whatever else the
particular device needs. (This would also give a much better chance
of portability to architectures other than x86.)

>
> My hope was that by having a discussion of how to use this stuff within
> the kernel, we might come up with some usage models that would inform
> how we design a user space library.
>
>> (Even better would be a way to make a clone of an fd that only allows
>> mmap, but that's a mostly unrelated issue.)
>
> O_MMAP_ONLY? ?And I'm not sure why you'd want to forbid reads and writes.

I don't want to forbid reads and writes; I want to forbid ftruncate.
That way I don't need to worry about malicious / obnoxious programs
sharing the fd causing SIGBUS.

--Andy