2013-09-04 21:55:01

by Rob Gittins

[permalink] [raw]
Subject: RFC Block Layer Extensions to Support NV-DIMMs


Non-volatile DIMMs have started to become available. A NVDIMMs is a
DIMM that does not lose data across power interruptions. Some of the
NVDIMMs act like memory, while others are more like a block device
on the memory bus. Application uses vary from being used to cache
critical data, to being a boot device.

There are two access classes of NVDIMMs, block mode and
“load/store” mode DIMMs which are referred to as Direct Memory
Mappable.

The block mode is where the DIMM provides IO ports for read or write
of data. These DIMMs reside on the memory bus but do not appear in the
application address space. Block mode DIMMs do not require any changes
to the current infrastructure, since they provide IO type of interface.

Direct Memory Mappable DIMMs (DMMD) appear in the system address space
and are accessed via load and store instructions. These NVDIMMs
are part of the system physical address space (SPA) as memory with
the attribute that data survives a power interruption. As such this
memory is managed by the kernel which can assign virtual addresses and
mapped into application’s address space as well as being accessible
by the kernel. The area mapped into the system address space is
being referred to as persistent memory (PMEM).

PMEM introduces the need for new operations in the
block_device_operations to support the specific characteristics of
the media.

First data may not propagate all the way through the memory pipeline
when store instructions are executed. Data may stay in the CPU cache
or in other buffers in the processor and memory complex. In order to
ensure the durability of data there needs to be a driver entry point
to force a byte range out to media. The methods of doing this are
specific to the PMEM technology and need to be handled by the driver
that is supporting the DMMDs. To provide a way to ensure that data is
durable adding a commit function to the block_device_operations vector.

void (*commitpmem)(struct block_device *bdev, void *addr);

Another area requiring extension is the need to be able to clear PMEM
errors. When data is fetched from errored PMEM it is marked with the
poison attribute. If the CPU attempts to access the data it causes a
machine check. How errors are cleared is hardware dependent and needs
to be handled by the specific device driver. The following function
in the block_device_operations vector would clear the correct range
of PMEM and put new data there. If the argument data is null or the
size is zero the driver is free to put any data in PMEM it wishes.

void (*clearerrorpmem)(struct block_device *bdev, void *addr,
size_t len, void *data);

Different applications, filesystem and drivers may wish to share
ranges of PMEM. This is analogous to partitioning a disk that is
using multiple and different filesystems. Since PMEM is addressed
on a byte basis rather than a block basis the existing partitioning
model does not fit well. As a result there needs to be a way to
describe PMEM ranges.

struct pmem_layout *(*getpmem)(struct block_device *bdev);


Proposed patch.

---
Documentation/filesystems/Locking | 6 ++++
fs/block_dev.c | 42
+++++++++++++++++++++++++++++++++
include/linux/blkdev.h | 4 +++
include/linux/pmem.h | 47
+++++++++++++++++++++++++++++++++++++
4 files changed, 99 insertions(+), 0 deletions(-)
create mode 100644 include/linux/pmem.h

diff --git a/Documentation/filesystems/Locking
b/Documentation/filesystems/Locking
index 0706d32..78910f4 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -386,6 +386,9 @@ prototypes:
int (*revalidate_disk) (struct gendisk *);
int (*getgeo)(struct block_device *, struct hd_geometry *);
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
+ struct pmem_layout *(*getpmem)(struct block_device *);
+ void (*commitpmem)(struct block_device *, void *);
+ void (*clearerrorpmem)(struct block_device *, void *, size_t, void *);
locking rules:
bd_mutex
@@ -399,6 +402,9 @@ unlock_native_capacity: no
revalidate_disk: no
getgeo: no
swap_slot_free_notify: no (see below)
+getpmem: no
+commitpmem: no
+clearerrorpmem: no
media_changed, unlock_native_capacity and revalidate_disk are called
only
from
check_disk_change().
diff --git a/fs/block_dev.c b/fs/block_dev.c
index aae187a..a57863c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -27,6 +27,7 @@
#include <linux/namei.h>
#include <linux/log2.h>
#include <linux/cleancache.h>
+#include <linux/pmem.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -1716,3 +1717,44 @@ void iterate_bdevs(void (*func)(struct
block_device
*, void *), void *arg)
spin_unlock(&inode_sb_list_lock);
iput(old_inode);
}
+
+/**
+ * get_pmemgeo() - Return persistent memory geometry information
+ * @bdev: device to interrogate
+ *
+ * Provides the memory layout for a persistent memory volume which
+ * is made up of CPU-addressable persistent memory. If the
interrogated
+ * device does not support CPU-addressable persistent memory then
-ENOTTY
+ * is returned.
+ *
+ * Return: a pointer to a pmem_layout structure or ERR_PTR
+ */
+struct pmem_layout *get_pmemgeo(struct block_device *bdev)
+{
+ struct gendisk *bd_disk = bdev->bd_disk;
+
+ if (!bd_disk || !bd_disk->fops->getpmem)
+ return ERR_PTR(-ENOTTY);
+ return bd_disk->fops->getpmem(bdev);
+}
+EXPORT_SYMBOL(get_pmemgeo);
+
+void commit_pmem(struct block_device *bdev, void *addr)
+{
+ struct gendisk *bd_disk = bdev->bd_disk;
+
+ if (bd_disk && bd_disk->fops->commitpmem)
+ bd_disk->fops->commitpmem(bdev, addr);
+}
+EXPORT_SYMBOL(commit_pmem);
+
+void clear_pmem_error(struct block_device *bdev, void *addr, size_t
len,
+ void *data)
+{
+ struct gendisk *bd_disk = bdev->bd_disk;
+
+ if (bd_disk && bd_disk->fops->clearerrorpmem)
+ bd_disk->fops->clearerrorpmem(bdev, addr, len, data);
+}
+EXPORT_SYMBOL(clear_pmem_error);
+
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 78feda9..ba2c1f5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1498,6 +1498,10 @@ struct block_device_operations {
int (*getgeo)(struct block_device *, struct hd_geometry *);
/* this callback is with swap_lock and sometimes page table lock held
*/
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
+ /* persistent memory operations */
+ struct pmem_layout * (*getpmem)(struct block_device *);
+ void (*commitpmem)(struct block_device *, void *);
+ void (*clearerrorpmem)(struct block_device *, void *, size_t, void *);
struct module *owner;
};
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
new file mode 100644
index 0000000..f907307
--- /dev/null
+++ b/include/linux/pmem.h
@@ -0,0 +1,47 @@
+/*
+ * Definitions for the Persistent Memory interface
+ * Copyright (c) 2013, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but
WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
License
for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License
along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#ifndef _LINUX_PMEM_H
+#define _LINUX_PMEM_H
+
+#include <linux/types.h>
+
+struct persistent_memory_extent {
+ phys_addr_t pme_spa;
+ u64 pme_len;
+ int pme_numa_node;
+};
+
+struct pmem_layout {
+ u64 pml_flags;
+ u64 pml_total_size;
+ u32 pml_extent_count;
+ u32 pml_interleave; /* interleave bytes */
+ struct persistent_memory_extent pml_extents[];
+};
+
+/*
+ * Flags values
+ */
+#define PMEM_ENABLED 0x0000000000000001 /* can be used for Persistent
Mem
*/
+#define PMEM_ERRORED 0x0000000000000002 /* in an error state */
+#define PMEM_COMMIT 0x0000000000000004 /* commit function available */
+#define PMEM_CLEAR_ERROR 0x0000000000000008 /* clear error function
provided */
+
+#endif
+
--
1.7.1




2013-09-05 12:12:38

by Jeff Moyer

[permalink] [raw]
Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs

Rob Gittins <[email protected]> writes:

> Direct Memory Mappable DIMMs (DMMD) appear in the system address space
> and are accessed via load and store instructions. These NVDIMMs
> are part of the system physical address space (SPA) as memory with
> the attribute that data survives a power interruption. As such this
> memory is managed by the kernel which can assign virtual addresses and
> mapped into application’s address space as well as being accessible
> by the kernel. The area mapped into the system address space is
> being referred to as persistent memory (PMEM).
>
> PMEM introduces the need for new operations in the
> block_device_operations to support the specific characteristics of
> the media.
>
> First data may not propagate all the way through the memory pipeline
> when store instructions are executed. Data may stay in the CPU cache
> or in other buffers in the processor and memory complex. In order to
> ensure the durability of data there needs to be a driver entry point
> to force a byte range out to media. The methods of doing this are
> specific to the PMEM technology and need to be handled by the driver
> that is supporting the DMMDs. To provide a way to ensure that data is
> durable adding a commit function to the block_device_operations vector.

If the memory is available to be mapped into the address space of the
kernel or a user process, then I don't see why we should have a block
device at all. I think it would make more sense to have a different
driver class for these persistent memory devices.

> void (*commitpmem)(struct block_device *bdev, void *addr);

Seems like this would benefit from a length argument as well, no?

> Another area requiring extension is the need to be able to clear PMEM
> errors. When data is fetched from errored PMEM it is marked with the
> poison attribute. If the CPU attempts to access the data it causes a
> machine check. How errors are cleared is hardware dependent and needs
> to be handled by the specific device driver. The following function
> in the block_device_operations vector would clear the correct range
> of PMEM and put new data there. If the argument data is null or the
> size is zero the driver is free to put any data in PMEM it wishes.
>
> void (*clearerrorpmem)(struct block_device *bdev, void *addr,
> size_t len, void *data);

What is the size of data?

> Different applications, filesystem and drivers may wish to share
> ranges of PMEM. This is analogous to partitioning a disk that is
> using multiple and different filesystems. Since PMEM is addressed
> on a byte basis rather than a block basis the existing partitioning
> model does not fit well. As a result there needs to be a way to
> describe PMEM ranges.
>
> struct pmem_layout *(*getpmem)(struct block_device *bdev);

If existing partitioning doesn't work well, then it sounds like a block
device isn't the right fit (again). Ignoring that detail, what about
requesting and releasing ranges of persistent memory, as in your
"partitioning" example? Would that not also be a function of the
driver?

Cheers,
Jeff

>
>
> Proposed patch.
>
> ---
> Documentation/filesystems/Locking | 6 ++++
> fs/block_dev.c | 42
> +++++++++++++++++++++++++++++++++
> include/linux/blkdev.h | 4 +++
> include/linux/pmem.h | 47
> +++++++++++++++++++++++++++++++++++++
> 4 files changed, 99 insertions(+), 0 deletions(-)
> create mode 100644 include/linux/pmem.h
>
> diff --git a/Documentation/filesystems/Locking
> b/Documentation/filesystems/Locking
> index 0706d32..78910f4 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -386,6 +386,9 @@ prototypes:
> int (*revalidate_disk) (struct gendisk *);
> int (*getgeo)(struct block_device *, struct hd_geometry *);
> void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> + struct pmem_layout *(*getpmem)(struct block_device *);
> + void (*commitpmem)(struct block_device *, void *);
> + void (*clearerrorpmem)(struct block_device *, void *, size_t, void *);
> locking rules:
> bd_mutex
> @@ -399,6 +402,9 @@ unlock_native_capacity: no
> revalidate_disk: no
> getgeo: no
> swap_slot_free_notify: no (see below)
> +getpmem: no
> +commitpmem: no
> +clearerrorpmem: no
> media_changed, unlock_native_capacity and revalidate_disk are called
> only
> from
> check_disk_change().
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index aae187a..a57863c 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -27,6 +27,7 @@
> #include <linux/namei.h>
> #include <linux/log2.h>
> #include <linux/cleancache.h>
> +#include <linux/pmem.h>
> #include <asm/uaccess.h>
> #include "internal.h"
> @@ -1716,3 +1717,44 @@ void iterate_bdevs(void (*func)(struct
> block_device
> *, void *), void *arg)
> spin_unlock(&inode_sb_list_lock);
> iput(old_inode);
> }
> +
> +/**
> + * get_pmemgeo() - Return persistent memory geometry information
> + * @bdev: device to interrogate
> + *
> + * Provides the memory layout for a persistent memory volume which
> + * is made up of CPU-addressable persistent memory. If the
> interrogated
> + * device does not support CPU-addressable persistent memory then
> -ENOTTY
> + * is returned.
> + *
> + * Return: a pointer to a pmem_layout structure or ERR_PTR
> + */
> +struct pmem_layout *get_pmemgeo(struct block_device *bdev)
> +{
> + struct gendisk *bd_disk = bdev->bd_disk;
> +
> + if (!bd_disk || !bd_disk->fops->getpmem)
> + return ERR_PTR(-ENOTTY);
> + return bd_disk->fops->getpmem(bdev);
> +}
> +EXPORT_SYMBOL(get_pmemgeo);
> +
> +void commit_pmem(struct block_device *bdev, void *addr)
> +{
> + struct gendisk *bd_disk = bdev->bd_disk;
> +
> + if (bd_disk && bd_disk->fops->commitpmem)
> + bd_disk->fops->commitpmem(bdev, addr);
> +}
> +EXPORT_SYMBOL(commit_pmem);
> +
> +void clear_pmem_error(struct block_device *bdev, void *addr, size_t
> len,
> + void *data)
> +{
> + struct gendisk *bd_disk = bdev->bd_disk;
> +
> + if (bd_disk && bd_disk->fops->clearerrorpmem)
> + bd_disk->fops->clearerrorpmem(bdev, addr, len, data);
> +}
> +EXPORT_SYMBOL(clear_pmem_error);
> +
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 78feda9..ba2c1f5 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1498,6 +1498,10 @@ struct block_device_operations {
> int (*getgeo)(struct block_device *, struct hd_geometry *);
> /* this callback is with swap_lock and sometimes page table lock held
> */
> void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> + /* persistent memory operations */
> + struct pmem_layout * (*getpmem)(struct block_device *);
> + void (*commitpmem)(struct block_device *, void *);
> + void (*clearerrorpmem)(struct block_device *, void *, size_t, void *);
> struct module *owner;
> };
> diff --git a/include/linux/pmem.h b/include/linux/pmem.h
> new file mode 100644
> index 0000000..f907307
> --- /dev/null
> +++ b/include/linux/pmem.h
> @@ -0,0 +1,47 @@
> +/*
> + * Definitions for the Persistent Memory interface
> + * Copyright (c) 2013, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify
> it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but
> WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
> License
> for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License
> along with
> + * this program; if not, write to the Free Software Foundation, Inc.,
> + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
> + */
> +
> +#ifndef _LINUX_PMEM_H
> +#define _LINUX_PMEM_H
> +
> +#include <linux/types.h>
> +
> +struct persistent_memory_extent {
> + phys_addr_t pme_spa;
> + u64 pme_len;
> + int pme_numa_node;
> +};
> +
> +struct pmem_layout {
> + u64 pml_flags;
> + u64 pml_total_size;
> + u32 pml_extent_count;
> + u32 pml_interleave; /* interleave bytes */
> + struct persistent_memory_extent pml_extents[];
> +};
> +
> +/*
> + * Flags values
> + */
> +#define PMEM_ENABLED 0x0000000000000001 /* can be used for Persistent
> Mem
> */
> +#define PMEM_ERRORED 0x0000000000000002 /* in an error state */
> +#define PMEM_COMMIT 0x0000000000000004 /* commit function available */
> +#define PMEM_CLEAR_ERROR 0x0000000000000008 /* clear error function
> provided */
> +
> +#endif
> +
> --
> 1.7.1
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-09-05 15:35:03

by Matthew Wilcox

[permalink] [raw]
Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs

On Thu, Sep 05, 2013 at 08:12:05AM -0400, Jeff Moyer wrote:
> If the memory is available to be mapped into the address space of the
> kernel or a user process, then I don't see why we should have a block
> device at all. I think it would make more sense to have a different
> driver class for these persistent memory devices.

We already have at least two block devices in the tree that provide
this kind of functionality (arch/powerpc/sysdev/axonram.c and
drivers/s390/block/dcssblk.c). Looking at how they're written, it
seems like implementing either of them as a block device on top of a
character device that extended their functionality in the direction we
want would be a pretty major bloating factor for no real benefit (not
even a particularly cleaner architecture).

> > Different applications, filesystem and drivers may wish to share
> > ranges of PMEM. This is analogous to partitioning a disk that is
> > using multiple and different filesystems. Since PMEM is addressed
> > on a byte basis rather than a block basis the existing partitioning
> > model does not fit well. As a result there needs to be a way to
> > describe PMEM ranges.
> >
> > struct pmem_layout *(*getpmem)(struct block_device *bdev);
>
> If existing partitioning doesn't work well, then it sounds like a block
> device isn't the right fit (again). Ignoring that detail, what about
> requesting and releasing ranges of persistent memory, as in your
> "partitioning" example? Would that not also be a function of the
> driver?

"existing partitioning" doesn't even work well for existing drives!
Nobody actually builds a drive with fixed C/H/S any more.

2013-09-05 16:34:20

by Rob Gittins

[permalink] [raw]
Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs

Hi Jeff,

Thanks for taking the time to look at this.

On Thu, 2013-09-05 at 08:12 -0400, Jeff Moyer wrote:
> Rob Gittins <[email protected]> writes:
>
> > Direct Memory Mappable DIMMs (DMMD) appear in the system address space
> > and are accessed via load and store instructions. These NVDIMMs
> > are part of the system physical address space (SPA) as memory with
> > the attribute that data survives a power interruption. As such this
> > memory is managed by the kernel which can assign virtual addresses and
> > mapped into application’s address space as well as being accessible
> > by the kernel. The area mapped into the system address space is
> > being referred to as persistent memory (PMEM).
> >
> > PMEM introduces the need for new operations in the
> > block_device_operations to support the specific characteristics of
> > the media.
> >
> > First data may not propagate all the way through the memory pipeline
> > when store instructions are executed. Data may stay in the CPU cache
> > or in other buffers in the processor and memory complex. In order to
> > ensure the durability of data there needs to be a driver entry point
> > to force a byte range out to media. The methods of doing this are
> > specific to the PMEM technology and need to be handled by the driver
> > that is supporting the DMMDs. To provide a way to ensure that data is
> > durable adding a commit function to the block_device_operations vector.
>
> If the memory is available to be mapped into the address space of the
> kernel or a user process, then I don't see why we should have a block
> device at all. I think it would make more sense to have a different
> driver class for these persistent memory devices.

The reason to include block device is to allow existing file systems
to be used with NV-DIMMs. Assuming that NV-DIMMs are approximately
the same speed of DRAM it would mean a block IO would happen in
approximately 1uS. This would make for a really fast existing
filesystem.

>
> > void (*commitpmem)(struct block_device *bdev, void *addr);
>
> Seems like this would benefit from a length argument as well, no?

Yes. Great point. I will add that in.

>
> > Another area requiring extension is the need to be able to clear PMEM
> > errors. When data is fetched from errored PMEM it is marked with the
> > poison attribute. If the CPU attempts to access the data it causes a
> > machine check. How errors are cleared is hardware dependent and needs
> > to be handled by the specific device driver. The following function
> > in the block_device_operations vector would clear the correct range
> > of PMEM and put new data there. If the argument data is null or the
> > size is zero the driver is free to put any data in PMEM it wishes.
> >
> > void (*clearerrorpmem)(struct block_device *bdev, void *addr,
> > size_t len, void *data);

> What is the size of data?

clearerrorpmem as part of the process of clearing an error
can effectively write a buffer of data as part of the
clear process. If the len is zero or the data pointer is null then
only a error clear happens.


> > Different applications, filesystem and drivers may wish to share
> > ranges of PMEM. This is analogous to partitioning a disk that is
> > using multiple and different filesystems. Since PMEM is addressed
> > on a byte basis rather than a block basis the existing partitioning
> > model does not fit well. As a result there needs to be a way to
> > describe PMEM ranges.
> >
> > struct pmem_layout *(*getpmem)(struct block_device *bdev);
>
> If existing partitioning doesn't work well, then it sounds like a block
> device isn't the right fit (again). Ignoring that detail, what about
> requesting and releasing ranges of persistent memory, as in your
> "partitioning" example? Would that not also be a function of the
> driver?

The existing partitioning mechanism was intended for small drives
and works best for a single fs/device. We are approaching NV-DIMMs
as if they were more like LUNs in storage arrays. Each range is
treated as a device. A range of an NV-DIMM could be partitioned if
someone wanted to do such a thing.

Thanks,
Rob


>
> Cheers,
> Jeff
>
> >
> >
> > Proposed patch.
> >
> > ---
> > Documentation/filesystems/Locking | 6 ++++
> > fs/block_dev.c | 42
> > +++++++++++++++++++++++++++++++++
> > include/linux/blkdev.h | 4 +++
> > include/linux/pmem.h | 47
> > +++++++++++++++++++++++++++++++++++++
> > 4 files changed, 99 insertions(+), 0 deletions(-)
> > create mode 100644 include/linux/pmem.h
> >
> > diff --git a/Documentation/filesystems/Locking
> > b/Documentation/filesystems/Locking
> > index 0706d32..78910f4 100644
> > --- a/Documentation/filesystems/Locking
> > +++ b/Documentation/filesystems/Locking
> > @@ -386,6 +386,9 @@ prototypes:
> > int (*revalidate_disk) (struct gendisk *);
> > int (*getgeo)(struct block_device *, struct hd_geometry *);
> > void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> > + struct pmem_layout *(*getpmem)(struct block_device *);
> > + void (*commitpmem)(struct block_device *, void *);
> > + void (*clearerrorpmem)(struct block_device *, void *, size_t, void *);
> > locking rules:
> > bd_mutex
> > @@ -399,6 +402,9 @@ unlock_native_capacity: no
> > revalidate_disk: no
> > getgeo: no
> > swap_slot_free_notify: no (see below)
> > +getpmem: no
> > +commitpmem: no
> > +clearerrorpmem: no
> > media_changed, unlock_native_capacity and revalidate_disk are called
> > only
> > from
> > check_disk_change().
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index aae187a..a57863c 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -27,6 +27,7 @@
> > #include <linux/namei.h>
> > #include <linux/log2.h>
> > #include <linux/cleancache.h>
> > +#include <linux/pmem.h>
> > #include <asm/uaccess.h>
> > #include "internal.h"
> > @@ -1716,3 +1717,44 @@ void iterate_bdevs(void (*func)(struct
> > block_device
> > *, void *), void *arg)
> > spin_unlock(&inode_sb_list_lock);
> > iput(old_inode);
> > }
> > +
> > +/**
> > + * get_pmemgeo() - Return persistent memory geometry information
> > + * @bdev: device to interrogate
> > + *
> > + * Provides the memory layout for a persistent memory volume which
> > + * is made up of CPU-addressable persistent memory. If the
> > interrogated
> > + * device does not support CPU-addressable persistent memory then
> > -ENOTTY
> > + * is returned.
> > + *
> > + * Return: a pointer to a pmem_layout structure or ERR_PTR
> > + */
> > +struct pmem_layout *get_pmemgeo(struct block_device *bdev)
> > +{
> > + struct gendisk *bd_disk = bdev->bd_disk;
> > +
> > + if (!bd_disk || !bd_disk->fops->getpmem)
> > + return ERR_PTR(-ENOTTY);
> > + return bd_disk->fops->getpmem(bdev);
> > +}
> > +EXPORT_SYMBOL(get_pmemgeo);
> > +
> > +void commit_pmem(struct block_device *bdev, void *addr)
> > +{
> > + struct gendisk *bd_disk = bdev->bd_disk;
> > +
> > + if (bd_disk && bd_disk->fops->commitpmem)
> > + bd_disk->fops->commitpmem(bdev, addr);
> > +}
> > +EXPORT_SYMBOL(commit_pmem);
> > +
> > +void clear_pmem_error(struct block_device *bdev, void *addr, size_t
> > len,
> > + void *data)
> > +{
> > + struct gendisk *bd_disk = bdev->bd_disk;
> > +
> > + if (bd_disk && bd_disk->fops->clearerrorpmem)
> > + bd_disk->fops->clearerrorpmem(bdev, addr, len, data);
> > +}
> > +EXPORT_SYMBOL(clear_pmem_error);
> > +
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 78feda9..ba2c1f5 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1498,6 +1498,10 @@ struct block_device_operations {
> > int (*getgeo)(struct block_device *, struct hd_geometry *);
> > /* this callback is with swap_lock and sometimes page table lock held
> > */
> > void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> > + /* persistent memory operations */
> > + struct pmem_layout * (*getpmem)(struct block_device *);
> > + void (*commitpmem)(struct block_device *, void *);
> > + void (*clearerrorpmem)(struct block_device *, void *, size_t, void *);
> > struct module *owner;
> > };
> > diff --git a/include/linux/pmem.h b/include/linux/pmem.h
> > new file mode 100644
> > index 0000000..f907307
> > --- /dev/null
> > +++ b/include/linux/pmem.h
> > @@ -0,0 +1,47 @@
> > +/*
> > + * Definitions for the Persistent Memory interface
> > + * Copyright (c) 2013, Intel Corporation.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > it
> > + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but
> > WITHOUT
> > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> > or
> > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
> > License
> > for
> > + * more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > along with
> > + * this program; if not, write to the Free Software Foundation, Inc.,
> > + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
> > + */
> > +
> > +#ifndef _LINUX_PMEM_H
> > +#define _LINUX_PMEM_H
> > +
> > +#include <linux/types.h>
> > +
> > +struct persistent_memory_extent {
> > + phys_addr_t pme_spa;
> > + u64 pme_len;
> > + int pme_numa_node;
> > +};
> > +
> > +struct pmem_layout {
> > + u64 pml_flags;
> > + u64 pml_total_size;
> > + u32 pml_extent_count;
> > + u32 pml_interleave; /* interleave bytes */
> > + struct persistent_memory_extent pml_extents[];
> > +};
> > +
> > +/*
> > + * Flags values
> > + */
> > +#define PMEM_ENABLED 0x0000000000000001 /* can be used for Persistent
> > Mem
> > */
> > +#define PMEM_ERRORED 0x0000000000000002 /* in an error state */
> > +#define PMEM_COMMIT 0x0000000000000004 /* commit function available */
> > +#define PMEM_CLEAR_ERROR 0x0000000000000008 /* clear error function
> > provided */
> > +
> > +#endif
> > +
> > --
> > 1.7.1
> >
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/

2013-09-05 17:16:16

by Jeff Moyer

[permalink] [raw]
Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs

Matthew Wilcox <[email protected]> writes:

> On Thu, Sep 05, 2013 at 08:12:05AM -0400, Jeff Moyer wrote:
>> If the memory is available to be mapped into the address space of the
>> kernel or a user process, then I don't see why we should have a block
>> device at all. I think it would make more sense to have a different
>> driver class for these persistent memory devices.
>
> We already have at least two block devices in the tree that provide
> this kind of functionality (arch/powerpc/sysdev/axonram.c and
> drivers/s390/block/dcssblk.c). Looking at how they're written, it
> seems like implementing either of them as a block device on top of a
> character device that extended their functionality in the direction we
> want would be a pretty major bloating factor for no real benefit (not
> even a particularly cleaner architecture).

Fun examples to read, thanks for the pointers. I'll note that neither
required extensions to the block device operations. ;-) I do agree with
you that neither would benefit from changing.

There are a couple of things in this proposal that cause me grief,
centered around the commitpmem call:

>> void (*commitpmem)(struct block_device *bdev, void *addr);

For block devices, when you want to flush something out, you submit a
bio with REQ_FLUSH set. Or, you could have submitted one or more I/Os
with REQ_FUA. Here, you want to add another method to accomplish the
same thing, but outside of the data path. So, who would the caller of
this commitpmem function be? Let's assume that we have a file system
layered on top of this block device. Will the file system need to call
commitpmem in addition to sending down the appropriate flags with the
I/Os?

This brings me to the other thing. If the caller of commitpmem is a
persistent memory-aware file system, then it seems awkward to call into
a block driver at all. You are basically turning the block device into
a sort of hybrid thing, where you can access stuff behind it in myriad
ways. That's the part that doesn't make sense to me.

So, that's why I suggested that maybe pmem is different from a block
device, but a block device could certainly be layered on top of it.

Hopefully that clears up my concerns with the approach.

Cheers,
Jeff

2013-09-05 17:19:29

by Jeff Moyer

[permalink] [raw]
Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs

Rob Gittins <[email protected]> writes:

>> > void (*commitpmem)(struct block_device *bdev, void *addr);
>>
>> Seems like this would benefit from a length argument as well, no?
>
> Yes. Great point. I will add that in.

Rob, taking it a step further, maybe a vectored interface would be more
flexible. Something to consider, anyway.

>> > Another area requiring extension is the need to be able to clear PMEM
>> > errors. When data is fetched from errored PMEM it is marked with the
>> > poison attribute. If the CPU attempts to access the data it causes a
>> > machine check. How errors are cleared is hardware dependent and needs
>> > to be handled by the specific device driver. The following function
>> > in the block_device_operations vector would clear the correct range
>> > of PMEM and put new data there. If the argument data is null or the
>> > size is zero the driver is free to put any data in PMEM it wishes.
>> >
>> > void (*clearerrorpmem)(struct block_device *bdev, void *addr,
>> > size_t len, void *data);
>
>> What is the size of data?
>
> clearerrorpmem as part of the process of clearing an error
> can effectively write a buffer of data as part of the
> clear process. If the len is zero or the data pointer is null then
> only a error clear happens.

So data would be of 'len' size, then? In other words, this would be a
way to restore data that may have been there?

> The existing partitioning mechanism was intended for small drives
> and works best for a single fs/device. We are approaching NV-DIMMs
> as if they were more like LUNs in storage arrays. Each range is
> treated as a device. A range of an NV-DIMM could be partitioned if
> someone wanted to do such a thing.

OK, that clears things up, thanks.

-Jeff

2013-09-05 17:57:23

by Matthew Wilcox

[permalink] [raw]
Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs

On Thu, Sep 05, 2013 at 01:15:40PM -0400, Jeff Moyer wrote:
> Matthew Wilcox <[email protected]> writes:
>
> > On Thu, Sep 05, 2013 at 08:12:05AM -0400, Jeff Moyer wrote:
> >> If the memory is available to be mapped into the address space of the
> >> kernel or a user process, then I don't see why we should have a block
> >> device at all. I think it would make more sense to have a different
> >> driver class for these persistent memory devices.
> >
> > We already have at least two block devices in the tree that provide
> > this kind of functionality (arch/powerpc/sysdev/axonram.c and
> > drivers/s390/block/dcssblk.c). Looking at how they're written, it
> > seems like implementing either of them as a block device on top of a
> > character device that extended their functionality in the direction we
> > want would be a pretty major bloating factor for no real benefit (not
> > even a particularly cleaner architecture).
>
> Fun examples to read, thanks for the pointers. I'll note that neither
> required extensions to the block device operations. ;-) I do agree with
> you that neither would benefit from changing.

Ah, but they did require extensions to the block device operations!
See commit 420edbcc09008342c7b2665453f6b370739aadb0. The next obvious
question is, "Why isn't this API suitable for us". And the answer is
that this API suits an existing filesystem design like ext2 (with the
xip mount option) that wants to stick very closely to the idiom of
reading one block at a time. It's not really suited to wanting to
"read" multiple vectors at once. Sure, we could ask for something
like preadv/pwritev in the block device operations, but we thought that
allowing the filesystem to ask for the geometry and then map the pieces
it needed itself was more elegant.

> There are a couple of things in this proposal that cause me grief,
> centered around the commitpmem call:
>
> >> void (*commitpmem)(struct block_device *bdev, void *addr);
>
> For block devices, when you want to flush something out, you submit a
> bio with REQ_FLUSH set. Or, you could have submitted one or more I/Os
> with REQ_FUA. Here, you want to add another method to accomplish the
> same thing, but outside of the data path. So, who would the caller of
> this commitpmem function be? Let's assume that we have a file system
> layered on top of this block device. Will the file system need to call
> commitpmem in addition to sending down the appropriate flags with the
> I/Os?

Existing filesystems work as-is, without using commitpmem. commitpmem
is only for a filesystem that isn't using bios. We could use a bio to
commit writes ... but we'd like to be able to commit writes that are
as small as a cacheline, and I don't think Linux supports block sizes
smaller than 512 bytes.

> This brings me to the other thing. If the caller of commitpmem is a
> persistent memory-aware file system, then it seems awkward to call into
> a block driver at all. You are basically turning the block device into
> a sort of hybrid thing, where you can access stuff behind it in myriad
> ways. That's the part that doesn't make sense to me.

If we get the API right, the filesystem shouldn't care whether these
things are done with bios or with direct calls to the block driver.
Look at how we did sb_issue_discard(). That could be switched to a
direct call, but we choose to use bios because we want the discard bios
queued along with regular reads and writes.

> So, that's why I suggested that maybe pmem is different from a block
> device, but a block device could certainly be layered on top of it.

Architecturally, it absolutely could be done, but I think in terms of
writing the driver it is more complex, and in terms of the filesystem,
it makes no difference (assuming we get the APIs right). I think we
should proceed as-is, and we can look at migrating to a block-layer-free
version later, if that ever makes sense.

2013-09-05 19:49:43

by Zuckerman, Boris

[permalink] [raw]
Subject: RE: RFC Block Layer Extensions to Support NV-DIMMs

Hi,

It's a great topic! I am glad to see this conversation happening...

Let me try to open another can of worms...

Persistent memory updates are more like DB transactions and less like flushing IO ranges.

If someone offers commitpmem() functionality, someone has to assure that all updates before that call can be discarded on failure or on request. Also, the scope of updates may not be easily describable by a single range.

Forcing users to solve that (especially failure atomicity) on their own by journaling, logging or other mechanism is optimistic and that cannot be done efficiently.

So, where should we expect to have this functionality implemented? FS drivers, block drivers, controllers?

Regards, Boris

> -----Original Message-----
> From: Linux-pmfs [mailto:[email protected]] On Behalf Of Jeff
> Moyer
> Sent: Thursday, September 05, 2013 1:16 PM
> To: Matthew Wilcox
> Cc: [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs
>
> Matthew Wilcox <[email protected]> writes:
>
> > On Thu, Sep 05, 2013 at 08:12:05AM -0400, Jeff Moyer wrote:
> >> If the memory is available to be mapped into the address space of the
> >> kernel or a user process, then I don't see why we should have a block
> >> device at all. I think it would make more sense to have a different
> >> driver class for these persistent memory devices.
> >
> > We already have at least two block devices in the tree that provide
> > this kind of functionality (arch/powerpc/sysdev/axonram.c and
> > drivers/s390/block/dcssblk.c). Looking at how they're written, it
> > seems like implementing either of them as a block device on top of a
> > character device that extended their functionality in the direction we
> > want would be a pretty major bloating factor for no real benefit (not
> > even a particularly cleaner architecture).
>
> Fun examples to read, thanks for the pointers. I'll note that neither required
> extensions to the block device operations. ;-) I do agree with you that neither would
> benefit from changing.
>
> There are a couple of things in this proposal that cause me grief, centered around the
> commitpmem call:
>
> >> void (*commitpmem)(struct block_device *bdev, void *addr);
>
> For block devices, when you want to flush something out, you submit a bio with
> REQ_FLUSH set. Or, you could have submitted one or more I/Os with REQ_FUA.
> Here, you want to add another method to accomplish the same thing, but outside of
> the data path. So, who would the caller of this commitpmem function be? Let's
> assume that we have a file system layered on top of this block device. Will the file
> system need to call commitpmem in addition to sending down the appropriate flags
> with the I/Os?
>
> This brings me to the other thing. If the caller of commitpmem is a persistent
> memory-aware file system, then it seems awkward to call into a block driver at all.
> You are basically turning the block device into a sort of hybrid thing, where you can
> access stuff behind it in myriad ways. That's the part that doesn't make sense to
> me.
>
> So, that's why I suggested that maybe pmem is different from a block device, but a
> block device could certainly be layered on top of it.
>
> Hopefully that clears up my concerns with the approach.
>
> Cheers,
> Jeff
>
> _______________________________________________
> Linux-pmfs mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-pmfs

2013-09-05 20:43:52

by Gittins, Rob

[permalink] [raw]
Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs


Hi Boris,

The purpose of commitpmem is to notify the hardware that data is
ready to be made persistent. This would mean flush any internal
buffers and do whatever is needed in the hardware to ensure durable
data.

I was trying to keep the API simple to allow the application to build
it's own transaction mechanisms that would fit the specific app needs.

commitpmem is a device driver op since it may be very from one hardware
and media technology to another. Perhaps the name could be clearer.

Rob




On 9/5/13 1:46 PM, "Zuckerman, Boris" <[email protected]> wrote:

>Hi,
>
>It's a great topic! I am glad to see this conversation happening...
>
>Let me try to open another can of worms...
>
>Persistent memory updates are more like DB transactions and less like
>flushing IO ranges.
>
>If someone offers commitpmem() functionality, someone has to assure that
>all updates before that call can be discarded on failure or on request.
>Also, the scope of updates may not be easily describable by a single
>range.
>
>Forcing users to solve that (especially failure atomicity) on their own
>by journaling, logging or other mechanism is optimistic and that cannot
>be done efficiently.
>
>So, where should we expect to have this functionality implemented? FS
>drivers, block drivers, controllers?
>
>Regards, Boris
>
>> -----Original Message-----
>> From: Linux-pmfs [mailto:[email protected]] On
>>Behalf Of Jeff
>> Moyer
>> Sent: Thursday, September 05, 2013 1:16 PM
>> To: Matthew Wilcox
>> Cc: [email protected]; [email protected]; linux-
>> [email protected]; [email protected]
>> Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs
>>
>> Matthew Wilcox <[email protected]> writes:
>>
>> > On Thu, Sep 05, 2013 at 08:12:05AM -0400, Jeff Moyer wrote:
>> >> If the memory is available to be mapped into the address space of the
>> >> kernel or a user process, then I don't see why we should have a block
>> >> device at all. I think it would make more sense to have a different
>> >> driver class for these persistent memory devices.
>> >
>> > We already have at least two block devices in the tree that provide
>> > this kind of functionality (arch/powerpc/sysdev/axonram.c and
>> > drivers/s390/block/dcssblk.c). Looking at how they're written, it
>> > seems like implementing either of them as a block device on top of a
>> > character device that extended their functionality in the direction we
>> > want would be a pretty major bloating factor for no real benefit (not
>> > even a particularly cleaner architecture).
>>
>> Fun examples to read, thanks for the pointers. I'll note that neither
>>required
>> extensions to the block device operations. ;-) I do agree with you
>>that neither would
>> benefit from changing.
>>
>> There are a couple of things in this proposal that cause me grief,
>>centered around the
>> commitpmem call:
>>
>> >> void (*commitpmem)(struct block_device *bdev, void *addr);
>>
>> For block devices, when you want to flush something out, you submit a
>>bio with
>> REQ_FLUSH set. Or, you could have submitted one or more I/Os with
>>REQ_FUA.
>> Here, you want to add another method to accomplish the same thing, but
>>outside of
>> the data path. So, who would the caller of this commitpmem function
>>be? Let's
>> assume that we have a file system layered on top of this block device.
>>Will the file
>> system need to call commitpmem in addition to sending down the
>>appropriate flags
>> with the I/Os?
>>
>> This brings me to the other thing. If the caller of commitpmem is a
>>persistent
>> memory-aware file system, then it seems awkward to call into a block
>>driver at all.
>> You are basically turning the block device into a sort of hybrid thing,
>>where you can
>> access stuff behind it in myriad ways. That's the part that doesn't
>>make sense to
>> me.
>>
>> So, that's why I suggested that maybe pmem is different from a block
>>device, but a
>> block device could certainly be layered on top of it.
>>
>> Hopefully that clears up my concerns with the approach.
>>
>> Cheers,
>> Jeff
>>
>> _______________________________________________
>> Linux-pmfs mailing list
>> [email protected]
>> http://lists.infradead.org/mailman/listinfo/linux-pmfs
>
>_______________________________________________
>Linux-pmfs mailing list
>[email protected]
>http://lists.infradead.org/mailman/listinfo/linux-pmfs

2013-09-05 21:03:29

by Zuckerman, Boris

[permalink] [raw]
Subject: RE: RFC Block Layer Extensions to Support NV-DIMMs

Thanks!

I understand that...

However, unless transactional services are constructed lot of performance would be lost due to excessive commits of journals. This is specific for PM....

Regards, Boris


> -----Original Message-----
> From: Gittins, Rob [mailto:[email protected]]
> Sent: Thursday, September 05, 2013 4:44 PM
> To: Zuckerman, Boris; Jeff Moyer; Matthew Wilcox
> Cc: [email protected]; [email protected]; linux-
> [email protected]
> Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs
>
>
> Hi Boris,
>
> The purpose of commitpmem is to notify the hardware that data is ready to be made
> persistent. This would mean flush any internal buffers and do whatever is needed in
> the hardware to ensure durable data.
>
> I was trying to keep the API simple to allow the application to build it's own
> transaction mechanisms that would fit the specific app needs.
>
> commitpmem is a device driver op since it may be very from one hardware and media
> technology to another. Perhaps the name could be clearer.
>
> Rob
>
>
>
>
> On 9/5/13 1:46 PM, "Zuckerman, Boris" <[email protected]> wrote:
>
> >Hi,
> >
> >It's a great topic! I am glad to see this conversation happening...
> >
> >Let me try to open another can of worms...
> >
> >Persistent memory updates are more like DB transactions and less like
> >flushing IO ranges.
> >
> >If someone offers commitpmem() functionality, someone has to assure
> >that all updates before that call can be discarded on failure or on request.
> >Also, the scope of updates may not be easily describable by a single
> >range.
> >
> >Forcing users to solve that (especially failure atomicity) on their own
> >by journaling, logging or other mechanism is optimistic and that cannot
> >be done efficiently.
> >
> >So, where should we expect to have this functionality implemented? FS
> >drivers, block drivers, controllers?
> >
> >Regards, Boris
> >
> >> -----Original Message-----
> >> From: Linux-pmfs [mailto:[email protected]] On
> >>Behalf Of Jeff Moyer
> >> Sent: Thursday, September 05, 2013 1:16 PM
> >> To: Matthew Wilcox
> >> Cc: [email protected]; [email protected];
> >>linux- [email protected]; [email protected]
> >> Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs
> >>
> >> Matthew Wilcox <[email protected]> writes:
> >>
> >> > On Thu, Sep 05, 2013 at 08:12:05AM -0400, Jeff Moyer wrote:
> >> >> If the memory is available to be mapped into the address space of
> >> >> the kernel or a user process, then I don't see why we should have
> >> >> a block device at all. I think it would make more sense to have a
> >> >> different driver class for these persistent memory devices.
> >> >
> >> > We already have at least two block devices in the tree that provide
> >> > this kind of functionality (arch/powerpc/sysdev/axonram.c and
> >> > drivers/s390/block/dcssblk.c). Looking at how they're written, it
> >> > seems like implementing either of them as a block device on top of
> >> > a character device that extended their functionality in the
> >> > direction we want would be a pretty major bloating factor for no
> >> > real benefit (not even a particularly cleaner architecture).
> >>
> >> Fun examples to read, thanks for the pointers. I'll note that
> >>neither required extensions to the block device operations. ;-) I
> >>do agree with you that neither would benefit from changing.
> >>
> >> There are a couple of things in this proposal that cause me grief,
> >>centered around the commitpmem call:
> >>
> >> >> void (*commitpmem)(struct block_device *bdev, void *addr);
> >>
> >> For block devices, when you want to flush something out, you submit a
> >>bio with REQ_FLUSH set. Or, you could have submitted one or more
> >>I/Os with REQ_FUA.
> >> Here, you want to add another method to accomplish the same thing,
> >>but outside of the data path. So, who would the caller of this
> >>commitpmem function be? Let's assume that we have a file system
> >>layered on top of this block device.
> >>Will the file
> >> system need to call commitpmem in addition to sending down the
> >>appropriate flags with the I/Os?
> >>
> >> This brings me to the other thing. If the caller of commitpmem is a
> >>persistent memory-aware file system, then it seems awkward to call
> >>into a block driver at all.
> >> You are basically turning the block device into a sort of hybrid
> >>thing, where you can access stuff behind it in myriad ways. That's
> >>the part that doesn't make sense to me.
> >>
> >> So, that's why I suggested that maybe pmem is different from a block
> >>device, but a block device could certainly be layered on top of it.
> >>
> >> Hopefully that clears up my concerns with the approach.
> >>
> >> Cheers,
> >> Jeff
> >>
> >> _______________________________________________
> >> Linux-pmfs mailing list
> >> [email protected]
> >> http://lists.infradead.org/mailman/listinfo/linux-pmfs
> >
> >_______________________________________________
> >Linux-pmfs mailing list
> >[email protected]
> >http://lists.infradead.org/mailman/listinfo/linux-pmfs

2013-09-07 05:12:49

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs


Rob Gittins, on 09/04/2013 02:54 PM wrote:
> Non-volatile DIMMs have started to become available. A NVDIMMs is a
> DIMM that does not lose data across power interruptions. Some of the
> NVDIMMs act like memory, while others are more like a block device
> on the memory bus. Application uses vary from being used to cache
> critical data, to being a boot device.
>
> There are two access classes of NVDIMMs, block mode and
> ?load/store? mode DIMMs which are referred to as Direct Memory
> Mappable.
>
> The block mode is where the DIMM provides IO ports for read or write
> of data. These DIMMs reside on the memory bus but do not appear in the
> application address space. Block mode DIMMs do not require any changes
> to the current infrastructure, since they provide IO type of interface.
>
> Direct Memory Mappable DIMMs (DMMD) appear in the system address space
> and are accessed via load and store instructions. These NVDIMMs
> are part of the system physical address space (SPA) as memory with
> the attribute that data survives a power interruption. As such this
> memory is managed by the kernel which can assign virtual addresses and
> mapped into application?s address space as well as being accessible
> by the kernel. The area mapped into the system address space is
> being referred to as persistent memory (PMEM).
>
> PMEM introduces the need for new operations in the
> block_device_operations to support the specific characteristics of
> the media.
>
> First data may not propagate all the way through the memory pipeline
> when store instructions are executed. Data may stay in the CPU cache
> or in other buffers in the processor and memory complex. In order to
> ensure the durability of data there needs to be a driver entry point
> to force a byte range out to media. The methods of doing this are
> specific to the PMEM technology and need to be handled by the driver
> that is supporting the DMMDs. To provide a way to ensure that data is
> durable adding a commit function to the block_device_operations vector.
>
> void (*commitpmem)(struct block_device *bdev, void *addr);

Why to glue to the block concept for apparently not block class of devices? By pushing
NVDIMMs into the block model you both limiting them to block devices capabilities as
well as have to expand block devices by alien to them properties.

NVDIMMs are, apparently, a new class of devices, so better to have a new class of
kernel devices for them. If you then need to put file systems on top of them, just
write one-fit-all blk_nvmem driver, which can create a block device for all types of
NVDIMM devices and drivers.

This way you will clearly and gracefully get the best from NVDIMM devices as well as
won't soil block devices.

Vlad

2013-09-23 22:51:34

by Rob Gittins

[permalink] [raw]
Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs

On Fri, 2013-09-06 at 22:12 -0700, Vladislav Bolkhovitin wrote:
> Rob Gittins, on 09/04/2013 02:54 PM wrote:
> > Non-volatile DIMMs have started to become available. A NVDIMMs is a
> > DIMM that does not lose data across power interruptions. Some of the
> > NVDIMMs act like memory, while others are more like a block device
> > on the memory bus. Application uses vary from being used to cache
> > critical data, to being a boot device.
> >
> > There are two access classes of NVDIMMs, block mode and
> > “load/store” mode DIMMs which are referred to as Direct Memory
> > Mappable.
> >
> > The block mode is where the DIMM provides IO ports for read or write
> > of data. These DIMMs reside on the memory bus but do not appear in the
> > application address space. Block mode DIMMs do not require any changes
> > to the current infrastructure, since they provide IO type of interface.
> >
> > Direct Memory Mappable DIMMs (DMMD) appear in the system address space
> > and are accessed via load and store instructions. These NVDIMMs
> > are part of the system physical address space (SPA) as memory with
> > the attribute that data survives a power interruption. As such this
> > memory is managed by the kernel which can assign virtual addresses and
> > mapped into application’s address space as well as being accessible
> > by the kernel. The area mapped into the system address space is
> > being referred to as persistent memory (PMEM).
> >
> > PMEM introduces the need for new operations in the
> > block_device_operations to support the specific characteristics of
> > the media.
> >
> > First data may not propagate all the way through the memory pipeline
> > when store instructions are executed. Data may stay in the CPU cache
> > or in other buffers in the processor and memory complex. In order to
> > ensure the durability of data there needs to be a driver entry point
> > to force a byte range out to media. The methods of doing this are
> > specific to the PMEM technology and need to be handled by the driver
> > that is supporting the DMMDs. To provide a way to ensure that data is
> > durable adding a commit function to the block_device_operations vector.
> >
> > void (*commitpmem)(struct block_device *bdev, void *addr);
>
> Why to glue to the block concept for apparently not block class of devices? By pushing
> NVDIMMs into the block model you both limiting them to block devices capabilities as
> well as have to expand block devices by alien to them properties
Hi Vlad,

We chose to extent the block operations for a couple of reasons. The
majority of NVDIMM usage is by emulating block mode. We figure that
over time usages will appear that use them directly and then we can
design interfaces to enable direct use.

Since a range of NVDIMM needs a name, security and other attributes mmap
is a really good model to build on. This quickly takes us into the
realm of a file systems, which are easiest to build on the existing
block infrastructure.

Another reason to extend block is that all of the existing
administrative interfaces and tools such as mkfs still work and we have
not added some new management tools and requirements that may inhibit
the adoption of the technology. Basically if it works today for block
the same cli commands will work for NVDIMMs.

The extensions are so minimal that they don't negatively impact the
existing interfaces.

Thanks,
Rob



> .
>
> NVDIMMs are, apparently, a new class of devices, so better to have a new class of
> kernel devices for them. If you then need to put file systems on top of them, just
> write one-fit-all blk_nvmem driver, which can create a block device for all types of
> NVDIMM devices and drivers.
>
> This way you will clearly and gracefully get the best from NVDIMM devices as well as
> won't soil block devices.
>
> Vlad

2013-09-26 06:59:31

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs

Hi Rob,

Rob Gittins, on 09/23/2013 03:51 PM wrote:
> On Fri, 2013-09-06 at 22:12 -0700, Vladislav Bolkhovitin wrote:
>> Rob Gittins, on 09/04/2013 02:54 PM wrote:
>>> Non-volatile DIMMs have started to become available. A NVDIMMs is a
>>> DIMM that does not lose data across power interruptions. Some of the
>>> NVDIMMs act like memory, while others are more like a block device
>>> on the memory bus. Application uses vary from being used to cache
>>> critical data, to being a boot device.
>>>
>>> There are two access classes of NVDIMMs, block mode and
>>> “load/store” mode DIMMs which are referred to as Direct Memory
>>> Mappable.
>>>
>>> The block mode is where the DIMM provides IO ports for read or write
>>> of data. These DIMMs reside on the memory bus but do not appear in the
>>> application address space. Block mode DIMMs do not require any changes
>>> to the current infrastructure, since they provide IO type of interface.
>>>
>>> Direct Memory Mappable DIMMs (DMMD) appear in the system address space
>>> and are accessed via load and store instructions. These NVDIMMs
>>> are part of the system physical address space (SPA) as memory with
>>> the attribute that data survives a power interruption. As such this
>>> memory is managed by the kernel which can assign virtual addresses and
>>> mapped into application’s address space as well as being accessible
>>> by the kernel. The area mapped into the system address space is
>>> being referred to as persistent memory (PMEM).
>>>
>>> PMEM introduces the need for new operations in the
>>> block_device_operations to support the specific characteristics of
>>> the media.
>>>
>>> First data may not propagate all the way through the memory pipeline
>>> when store instructions are executed. Data may stay in the CPU cache
>>> or in other buffers in the processor and memory complex. In order to
>>> ensure the durability of data there needs to be a driver entry point
>>> to force a byte range out to media. The methods of doing this are
>>> specific to the PMEM technology and need to be handled by the driver
>>> that is supporting the DMMDs. To provide a way to ensure that data is
>>> durable adding a commit function to the block_device_operations vector.
>>>
>>> void (*commitpmem)(struct block_device *bdev, void *addr);
>>
>> Why to glue to the block concept for apparently not block class of devices? By pushing
>> NVDIMMs into the block model you both limiting them to block devices capabilities as
>> well as have to expand block devices by alien to them properties
> Hi Vlad,
>
> We chose to extent the block operations for a couple of reasons. The
> majority of NVDIMM usage is by emulating block mode. We figure that
> over time usages will appear that use them directly and then we can
> design interfaces to enable direct use.
>
> Since a range of NVDIMM needs a name, security and other attributes mmap
> is a really good model to build on. This quickly takes us into the
> realm of a file systems, which are easiest to build on the existing
> block infrastructure.
>
> Another reason to extend block is that all of the existing
> administrative interfaces and tools such as mkfs still work and we have
> not added some new management tools and requirements that may inhibit
> the adoption of the technology. Basically if it works today for block
> the same cli commands will work for NVDIMMs.
>
> The extensions are so minimal that they don't negatively impact the
> existing interfaces.

Well, they will negatively impact them, because those NVDIMM additions are conceptually
alien for the block devices concept.

You didn't answer, why not create a new class of devices for NVDIMM devices, and
implement one-fit-all block driver for them? Simple, clean and elegant solution, which
will fit your need to have block device from NVDIMM device pretty well with minimal effort.

Vlad

2013-09-26 14:57:04

by Zuckerman, Boris

[permalink] [raw]
Subject: RE: RFC Block Layer Extensions to Support NV-DIMMs

In support to what was said by Vlad:

To work with persistent memory as efficiently as we can work with RAM we need a bit more than "commit". It's reasonable to expect that we get some additional support from CPUs that goes beyond mfence and mflush. That may include discovery, transactional support, etc. Encapsulating that in a special class sooner than later seams a right thing to do...

Boris

> -----Original Message-----
> From: Linux-pmfs [mailto:[email protected]] On Behalf Of
> Vladislav Bolkhovitin
> Sent: Thursday, September 26, 2013 2:59 AM
> To: [email protected]
> Cc: [email protected]; [email protected]; linux-
> [email protected]
> Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs
>
> Hi Rob,
>
> Rob Gittins, on 09/23/2013 03:51 PM wrote:
> > On Fri, 2013-09-06 at 22:12 -0700, Vladislav Bolkhovitin wrote:
> >> Rob Gittins, on 09/04/2013 02:54 PM wrote:
> >>> Non-volatile DIMMs have started to become available. A NVDIMMs is a
> >>> DIMM that does not lose data across power interruptions. Some of
> >>> the NVDIMMs act like memory, while others are more like a block
> >>> device on the memory bus. Application uses vary from being used to
> >>> cache critical data, to being a boot device.
> >>>
> >>> There are two access classes of NVDIMMs, block mode and
> >>> “load/store” mode DIMMs which are referred to as Direct Memory
> >>> Mappable.
> >>>
> >>> The block mode is where the DIMM provides IO ports for read or write
> >>> of data. These DIMMs reside on the memory bus but do not appear in
> >>> the application address space. Block mode DIMMs do not require any
> >>> changes to the current infrastructure, since they provide IO type of interface.
> >>>
> >>> Direct Memory Mappable DIMMs (DMMD) appear in the system address
> >>> space and are accessed via load and store instructions. These
> >>> NVDIMMs are part of the system physical address space (SPA) as
> >>> memory with the attribute that data survives a power interruption.
> >>> As such this memory is managed by the kernel which can assign
> >>> virtual addresses and mapped into application’s address space as
> >>> well as being accessible by the kernel. The area mapped into the
> >>> system address space is being referred to as persistent memory (PMEM).
> >>>
> >>> PMEM introduces the need for new operations in the
> >>> block_device_operations to support the specific characteristics of
> >>> the media.
> >>>
> >>> First data may not propagate all the way through the memory pipeline
> >>> when store instructions are executed. Data may stay in the CPU
> >>> cache or in other buffers in the processor and memory complex. In
> >>> order to ensure the durability of data there needs to be a driver
> >>> entry point to force a byte range out to media. The methods of
> >>> doing this are specific to the PMEM technology and need to be
> >>> handled by the driver that is supporting the DMMDs. To provide a
> >>> way to ensure that data is durable adding a commit function to the
> block_device_operations vector.
> >>>
> >>> void (*commitpmem)(struct block_device *bdev, void *addr);
> >>
> >> Why to glue to the block concept for apparently not block class of
> >> devices? By pushing NVDIMMs into the block model you both limiting
> >> them to block devices capabilities as well as have to expand block
> >> devices by alien to them properties
> > Hi Vlad,
> >
> > We chose to extent the block operations for a couple of reasons. The
> > majority of NVDIMM usage is by emulating block mode. We figure that
> > over time usages will appear that use them directly and then we can
> > design interfaces to enable direct use.
> >
> > Since a range of NVDIMM needs a name, security and other attributes
> > mmap is a really good model to build on. This quickly takes us into
> > the realm of a file systems, which are easiest to build on the
> > existing block infrastructure.
> >
> > Another reason to extend block is that all of the existing
> > administrative interfaces and tools such as mkfs still work and we
> > have not added some new management tools and requirements that may
> > inhibit the adoption of the technology. Basically if it works today
> > for block the same cli commands will work for NVDIMMs.
> >
> > The extensions are so minimal that they don't negatively impact the
> > existing interfaces.
>
> Well, they will negatively impact them, because those NVDIMM additions are
> conceptually alien for the block devices concept.
>
> You didn't answer, why not create a new class of devices for NVDIMM devices, and
> implement one-fit-all block driver for them? Simple, clean and elegant solution, which
> will fit your need to have block device from NVDIMM device pretty well with minimal
> effort.
>
> Vlad
>
> _______________________________________________
> Linux-pmfs mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-pmfs
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-09-26 17:56:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs

On Thu, Sep 26, 2013 at 02:56:17PM +0000, Zuckerman, Boris wrote:
> To work with persistent memory as efficiently as we can work with RAM we need a bit more than "commit". It's reasonable to expect that we get some additional support from CPUs that goes beyond mfence and mflush. That may include discovery, transactional support, etc. Encapsulating that in a special class sooner than later seams a right thing to do...

If it's something CPU-specific, then we wouldn't handle it as part of
the "class", we'd handle it as an architecture abstraction. It's only
operations which are device-specific which would need to be exposed
through an operations vector. For example, suppose you buy one device
from IBM and another device from HP, and plug them both into your SPARC
system. The code you compile needs to run on SPARC, doing whatever
CPU operations are supported, but if HP and IBM have different ways of
handling a "commit" operation, we need that operation to be part of an
operations vector.

2013-09-26 19:37:41

by Zuckerman, Boris

[permalink] [raw]
Subject: RE: RFC Block Layer Extensions to Support NV-DIMMs

I assume that we may have both: CPUs that may have ability to support multiple transactions, CPUs that support only one, CPUs that support none (as today), as well as different devices - transaction capable and not.
So, it seems there is a room for compilers to do their work and for class drivers to do their, right?

boris

> -----Original Message-----
> From: Matthew Wilcox [mailto:[email protected]]
> Sent: Thursday, September 26, 2013 1:56 PM
> To: Zuckerman, Boris
> Cc: Vladislav Bolkhovitin; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs
>
> On Thu, Sep 26, 2013 at 02:56:17PM +0000, Zuckerman, Boris wrote:
> > To work with persistent memory as efficiently as we can work with RAM we need a
> bit more than "commit". It's reasonable to expect that we get some additional
> support from CPUs that goes beyond mfence and mflush. That may include discovery,
> transactional support, etc. Encapsulating that in a special class sooner than later
> seams a right thing to do...
>
> If it's something CPU-specific, then we wouldn't handle it as part of the "class", we'd
> handle it as an architecture abstraction. It's only operations which are device-specific
> which would need to be exposed through an operations vector. For example, suppose
> you buy one device from IBM and another device from HP, and plug them both into
> your SPARC system. The code you compile needs to run on SPARC, doing whatever
> CPU operations are supported, but if HP and IBM have different ways of handling a
> "commit" operation, we need that operation to be part of an operations vector.

2013-09-28 07:44:50

by Vladislav Bolkhovitin

[permalink] [raw]
Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs


Zuckerman, Boris, on 09/26/2013 12:36 PM wrote:
> I assume that we may have both: CPUs that may have ability to support multiple transactions, CPUs that support only one, CPUs that support none (as today), as well as different devices - transaction capable and not.
> So, it seems there is a room for compilers to do their work and for class drivers to do their, right?

Yes, correct.

Conceptually NVDIMMs are not block devices. They may be used as block devices, but may
not be as well. So, nailing them into the block abstraction by big hammer is simply a
bad design.

Vlad

> boris
>
>> -----Original Message-----
>> From: Matthew Wilcox [mailto:[email protected]]
>> Sent: Thursday, September 26, 2013 1:56 PM
>> To: Zuckerman, Boris
>> Cc: Vladislav Bolkhovitin; [email protected]; [email protected];
>> [email protected]; [email protected]
>> Subject: Re: RFC Block Layer Extensions to Support NV-DIMMs
>>
>> On Thu, Sep 26, 2013 at 02:56:17PM +0000, Zuckerman, Boris wrote:
>>> To work with persistent memory as efficiently as we can work with RAM we need a
>> bit more than "commit". It's reasonable to expect that we get some additional
>> support from CPUs that goes beyond mfence and mflush. That may include discovery,
>> transactional support, etc. Encapsulating that in a special class sooner than later
>> seams a right thing to do...
>>
>> If it's something CPU-specific, then we wouldn't handle it as part of the "class", we'd
>> handle it as an architecture abstraction. It's only operations which are device-specific
>> which would need to be exposed through an operations vector. For example, suppose
>> you buy one device from IBM and another device from HP, and plug them both into
>> your SPARC system. The code you compile needs to run on SPARC, doing whatever
>> CPU operations are supported, but if HP and IBM have different ways of handling a
>> "commit" operation, we need that operation to be part of an operations vector.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>