2008-06-07 04:58:58

by Martin K. Petersen

[permalink] [raw]
Subject: [PATCH 0 of 7] Block/SCSI Data Integrity Support


Another post of my block I/O data integrity patches. This kit goes on
top of the scsi_data_buffer and sd.h cleanups I posted earlier today.

There has been no changes to the block layer code since my last
submission.

Within SCSI, the changes are cleanups based on comments from Christoph
as well as working support for Type 3 and 4KB sectors.


What's This All About?
----------------------

These patches allow data integrity information (checksum and more) to
be attached to I/Os at the block/filesystem layers and transferred
through the entire I/O stack all the way to the physical storage
device.

The integrity metadata can be generated in close proximity to the
original data. Capable host adapters, RAID arrays and physical disks
can verify the data integrity and abort I/Os in case of a mismatch.

Right now this is SCSI disk only but similar efforts are in progress
for SATA and SCSI tape. With a few minor nits due to protocol
limitations the proposed SATA format is identical to the SCSI ditto
for easy interoperability.


T10 DIF
-------

SCSI drives can usually be reformatted to 520-byte sectors, yielding 8
extra bytes per sector. These 8 bytes have traditionally been used by
RAID controllers to store internal protection information.

DIF (Data Integrity Field) is an extension to the SCSI Block Commands
that standardizes the format of the 8 extra bytes and defines ways to
interact with the contents at the protocol level.

Each 8-byte DIF tuple is split into three chunks:

- a 16-bit guard tag containing a CRC of the data portion of
the sector.

- a 16-bit application tag which is up for grabs.

- a 32-bit reference tag which contains an incrementing
counter for each sector. For DIF Type 1 it also needs to
match the physical LBA on the drive.

There are three types of DIF defined: Type 1, Type 2, and Type 3.
These patches support Type 1 and Type 3. Type 2 depends on 32-byte
CDBs and is work in progress.

Since the DIF tuple format is standardized, both initiators and
targets (as well as potentially transport switches in-between) are
able to verify the integrity of the data going over the bus.

When writing, the HBA will DMA 512-byte sectors from host memory,
generate the matching integrity metadata and send out 520-byte sectors
on the wire. The disk will verify the integrity of the data before
committing it to stable storage.

When reading, the drive will send 520-byte sectors to the HBA. The
HBA will verify the data integrity and DMA 512-byte sectors to host
memory.

IOW, DIF provides means for added integrity protection between HBA and
disk.


Data Integrity Extensions
-------------------------

In order to provide true end-to-end data integrity we need to be able
to get access to the integrity metadata from the OS. Dealing with
520-byte sectors is quite inconvenient so we have worked with HBA
manufacturers to separate the data buffer scatter-gather from the
integrity metadata scatter-gather.

Also, the CRC16 is somewhat expensive to calculate in software. So we
have also allowed alternate checksums to be used. Currently we
support the IP checksum which is fast and cheap to calculate.

When writing, the HBA will DMA two scatterlists from host memory: One
containing the data as usual, and one containing the integrity
metadata. The HBA will verify that the two are in agreement and
interleave them before sending them out on the wire as 520-byte
sectors.

When reading, the disk will return 520-byte sectors, the HBA will
verify the integrity, split the integrity metadata from the data, and
DMA to the two separate scatterlists in host memory.


SCSI Layer Changes
------------------

At the SCSI level, there are a few changes required to support this:

- an extra scatterlist for the integrity metadata

- tweaks to sd.c to detect and handle disks formatted with DIF

- sd.c must issue the right READ/WRITE commands when a disk is
formatted with DIF

- extra fields in scsi_host to signal the HBA driver's DIF
capabilities


Block Layer Changes
-------------------

The main idea of DIF/DIX is to allow integrity metadata to be
generated as close to the original data as possible. So in the long
run we'd like this to happen in userland. Given mmap(), direct I/O,
etc. this obviously poses some challenges. *cough*

For now the integrity metadata is generated at the block layer when an
I/O is submitted by the filesystem. There are also functions that
allow filesystems to generate the integrity metadata earlier, and to
use the application tag to mark sectors for recovery purposes.

struct bio has been extended with a pointer to a struct bip which in
turn contains the integrity metadata. The bip is essentially a
trimmed down bio with a bio_vec and some housekeeping.

There are a few hooks inserted in fs/bio.c and block/blk-* to allow
integrity metadata to be handled correctly when splitting, cloning and
merging. Aside from that, the integrity stuff is completely opaque.

Because we don't want the block layer, filesystems, etc. to know about
DIF and tuple formats, all the functions that interact with the
integrity metadata reside in the SCSI layer and are registered via a
callback handler template. The block layer changes have been made so
that the upcoming standards for data integrity on SATA (T13 External
Path Protection) and SCSI tape will fit right in and can register
their own handlers.

I have included a more in-depth description of the block layer changes
in Documentation/block/data-integrity.txt.


Comments and suggestions welcome.

--
Martin K. Petersen Oracle Linux Engineering



2008-06-07 04:57:55

by Martin K. Petersen

[permalink] [raw]
Subject: [PATCH 3 of 7] block: Find bio sector offset given idx and offset

2 files changed, 26 insertions(+)
fs/bio.c | 24 ++++++++++++++++++++++++
include/linux/bio.h | 2 ++


Helper function to find the sector offset in a bio given bvec index
and page offset.

Signed-off-by: Martin K. Petersen <[email protected]>

---

diff -r 550d61001baa -r 318fa71e735d fs/bio.c
--- a/fs/bio.c Sat Jun 07 00:45:14 2008 -0400
+++ b/fs/bio.c Sat Jun 07 00:45:14 2008 -0400
@@ -1232,6 +1232,30 @@
return bp;
}

+sector_t bio_sector_offset(struct bio *bio, unsigned short index, unsigned int offset)
+{
+ struct bio_vec *bv;
+ unsigned int sector_sz = bio->bi_bdev->bd_disk->queue->hardsect_size;
+ sector_t sectors;
+ int i;
+
+ sectors = 0;
+
+ BUG_ON(index >= bio->bi_vcnt);
+
+ bio_for_each_segment(bv, bio, i) {
+ if (i == index) {
+ if (offset > bv->bv_offset)
+ sectors += (offset - bv->bv_offset) / sector_sz;
+ return sectors;
+ }
+
+ sectors += bv->bv_len / sector_sz;
+ }
+
+ BUG();
+}
+EXPORT_SYMBOL(bio_sector_offset);

/*
* create memory pools for biovec's in a bio_set.
diff -r 550d61001baa -r 318fa71e735d include/linux/bio.h
--- a/include/linux/bio.h Sat Jun 07 00:45:14 2008 -0400
+++ b/include/linux/bio.h Sat Jun 07 00:45:14 2008 -0400
@@ -315,6 +315,8 @@
extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
unsigned int, unsigned int);
extern int bio_get_nr_vecs(struct block_device *);
+extern sector_t bio_sector_offset(struct bio *, unsigned short, unsigned int);
+
extern struct bio *bio_map_user(struct request_queue *, struct block_device *,
unsigned long, unsigned int, int);
struct sg_iovec;

2008-06-07 04:59:22

by Martin K. Petersen

[permalink] [raw]
Subject: [PATCH 2 of 7] block: Globalize bio_set and bio_vec_slab

2 files changed, 38 insertions(+), 28 deletions(-)
fs/bio.c | 36 ++++++++----------------------------
include/linux/bio.h | 30 ++++++++++++++++++++++++++++++


Move struct bio_set and biovec_slab definitions to bio.h so they can
be used outside of bio.c.

Signed-off-by: Martin K. Petersen <[email protected]>

---

diff -r fb6f53e13ff3 -r 550d61001baa fs/bio.c
--- a/fs/bio.c Sat Jun 07 00:45:14 2008 -0400
+++ b/fs/bio.c Sat Jun 07 00:45:14 2008 -0400
@@ -28,24 +28,9 @@
#include <linux/blktrace_api.h>
#include <scsi/sg.h> /* for struct sg_iovec */

-#define BIO_POOL_SIZE 2
-
static struct kmem_cache *bio_slab __read_mostly;

-#define BIOVEC_NR_POOLS 6
-
-/*
- * a small number of entries is fine, not going to be performance critical.
- * basically we just need to survive
- */
-#define BIO_SPLIT_ENTRIES 2
mempool_t *bio_split_pool __read_mostly;
-
-struct biovec_slab {
- int nr_vecs;
- char *name;
- struct kmem_cache *slab;
-};

/*
* if you change this list, also change bvec_alloc or things will
@@ -60,23 +45,18 @@
#undef BV

/*
- * bio_set is used to allow other portions of the IO system to
- * allocate their own private memory pools for bio and iovec structures.
- * These memory pools in turn all allocate from the bio_slab
- * and the bvec_slabs[].
- */
-struct bio_set {
- mempool_t *bio_pool;
- mempool_t *bvec_pools[BIOVEC_NR_POOLS];
-};
-
-/*
* fs_bio_set is the bio_set containing bio and iovec memory pools used by
* IO code that does not need private memory pools.
*/
-static struct bio_set *fs_bio_set;
+struct bio_set *fs_bio_set;

-static inline struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx, struct bio_set *bs)
+inline int bvec_nr_vecs(int idx)
+{
+ return bvec_slabs[idx].nr_vecs;
+}
+EXPORT_SYMBOL(bvec_nr_vecs);
+
+struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx, struct bio_set *bs)
{
struct bio_vec *bvl;

diff -r fb6f53e13ff3 -r 550d61001baa include/linux/bio.h
--- a/include/linux/bio.h Sat Jun 07 00:45:14 2008 -0400
+++ b/include/linux/bio.h Sat Jun 07 00:45:14 2008 -0400
@@ -333,6 +333,36 @@
int, int);
extern int bio_uncopy_user(struct bio *);
void zero_fill_bio(struct bio *bio);
+extern struct bio_vec *bvec_alloc_bs(gfp_t, int, unsigned long *, struct bio_set *);
+extern inline int bvec_nr_vecs(int idx);
+
+/*
+ * bio_set is used to allow other portions of the IO system to
+ * allocate their own private memory pools for bio and iovec structures.
+ * These memory pools in turn all allocate from the bio_slab
+ * and the bvec_slabs[].
+ */
+#define BIO_POOL_SIZE 2
+#define BIOVEC_NR_POOLS 6
+
+struct bio_set {
+ mempool_t *bio_pool;
+ mempool_t *bvec_pools[BIOVEC_NR_POOLS];
+};
+
+struct biovec_slab {
+ int nr_vecs;
+ char *name;
+ struct kmem_cache *slab;
+};
+
+extern struct bio_set *fs_bio_set;
+
+/*
+ * a small number of entries is fine, not going to be performance critical.
+ * basically we just need to survive
+ */
+#define BIO_SPLIT_ENTRIES 2

#ifdef CONFIG_HIGHMEM
/*

2008-06-07 04:59:38

by Martin K. Petersen

[permalink] [raw]
Subject: [PATCH 1 of 7] lib: Add support for the T10 Data Integrity Field CRC

4 files changed, 83 insertions(+)
include/linux/crc-t10dif.h | 8 +++++
lib/Kconfig | 7 ++++
lib/Makefile | 1
lib/crc-t10dif.c | 67 ++++++++++++++++++++++++++++++++++++++++++++


Signed-off-by: Martin K. Petersen <[email protected]>

---

diff -r b76a6f493f0c -r fb6f53e13ff3 include/linux/crc-t10dif.h
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/include/linux/crc-t10dif.h Sat Jun 07 00:45:14 2008 -0400
@@ -0,0 +1,8 @@
+#ifndef _LINUX_CRC_T10DIF_H
+#define _LINUX_CRC_T10DIF_H
+
+#include <linux/types.h>
+
+__u16 crc_t10dif(unsigned char const *, size_t);
+
+#endif
diff -r b76a6f493f0c -r fb6f53e13ff3 lib/Kconfig
--- a/lib/Kconfig Sat Jun 07 00:45:14 2008 -0400
+++ b/lib/Kconfig Sat Jun 07 00:45:14 2008 -0400
@@ -28,6 +28,13 @@
modules require CRC16 functions, but a module built outside
the kernel tree does. Such modules that use library CRC16
functions require M here.
+
+config CRC_T10DIF
+ tristate "CRC calculation for the T10 Data Integrity Field"
+ help
+ This option is only needed if a module that's not in the
+ kernel tree needs to calculate CRC checks for use with the
+ SCSI data integrity subsystem.

config CRC_ITU_T
tristate "CRC ITU-T V.41 functions"
diff -r b76a6f493f0c -r fb6f53e13ff3 lib/Makefile
--- a/lib/Makefile Sat Jun 07 00:45:14 2008 -0400
+++ b/lib/Makefile Sat Jun 07 00:45:14 2008 -0400
@@ -45,6 +45,7 @@
obj-$(CONFIG_BITREVERSE) += bitrev.o
obj-$(CONFIG_CRC_CCITT) += crc-ccitt.o
obj-$(CONFIG_CRC16) += crc16.o
+obj-$(CONFIG_CRC_T10DIF)+= crc-t10dif.o
obj-$(CONFIG_CRC_ITU_T) += crc-itu-t.o
obj-$(CONFIG_CRC32) += crc32.o
obj-$(CONFIG_CRC7) += crc7.o
diff -r b76a6f493f0c -r fb6f53e13ff3 lib/crc-t10dif.c
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/lib/crc-t10dif.c Sat Jun 07 00:45:14 2008 -0400
@@ -0,0 +1,67 @@
+/*
+ * T10 Data Integrity Field CRC16 calculation
+ *
+ * Copyright (c) 2007 Oracle Corporation. All rights reserved.
+ * Written by Martin K. Petersen <[email protected]>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/crc-t10dif.h>
+
+/* Table generated using the following polynomium:
+ * x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1
+ * gt: 0x8bb7
+ */
+static const __u16 t10_dif_crc_table[256] = {
+ 0x0000, 0x8BB7, 0x9CD9, 0x176E, 0xB205, 0x39B2, 0x2EDC, 0xA56B,
+ 0xEFBD, 0x640A, 0x7364, 0xF8D3, 0x5DB8, 0xD60F, 0xC161, 0x4AD6,
+ 0x54CD, 0xDF7A, 0xC814, 0x43A3, 0xE6C8, 0x6D7F, 0x7A11, 0xF1A6,
+ 0xBB70, 0x30C7, 0x27A9, 0xAC1E, 0x0975, 0x82C2, 0x95AC, 0x1E1B,
+ 0xA99A, 0x222D, 0x3543, 0xBEF4, 0x1B9F, 0x9028, 0x8746, 0x0CF1,
+ 0x4627, 0xCD90, 0xDAFE, 0x5149, 0xF422, 0x7F95, 0x68FB, 0xE34C,
+ 0xFD57, 0x76E0, 0x618E, 0xEA39, 0x4F52, 0xC4E5, 0xD38B, 0x583C,
+ 0x12EA, 0x995D, 0x8E33, 0x0584, 0xA0EF, 0x2B58, 0x3C36, 0xB781,
+ 0xD883, 0x5334, 0x445A, 0xCFED, 0x6A86, 0xE131, 0xF65F, 0x7DE8,
+ 0x373E, 0xBC89, 0xABE7, 0x2050, 0x853B, 0x0E8C, 0x19E2, 0x9255,
+ 0x8C4E, 0x07F9, 0x1097, 0x9B20, 0x3E4B, 0xB5FC, 0xA292, 0x2925,
+ 0x63F3, 0xE844, 0xFF2A, 0x749D, 0xD1F6, 0x5A41, 0x4D2F, 0xC698,
+ 0x7119, 0xFAAE, 0xEDC0, 0x6677, 0xC31C, 0x48AB, 0x5FC5, 0xD472,
+ 0x9EA4, 0x1513, 0x027D, 0x89CA, 0x2CA1, 0xA716, 0xB078, 0x3BCF,
+ 0x25D4, 0xAE63, 0xB90D, 0x32BA, 0x97D1, 0x1C66, 0x0B08, 0x80BF,
+ 0xCA69, 0x41DE, 0x56B0, 0xDD07, 0x786C, 0xF3DB, 0xE4B5, 0x6F02,
+ 0x3AB1, 0xB106, 0xA668, 0x2DDF, 0x88B4, 0x0303, 0x146D, 0x9FDA,
+ 0xD50C, 0x5EBB, 0x49D5, 0xC262, 0x6709, 0xECBE, 0xFBD0, 0x7067,
+ 0x6E7C, 0xE5CB, 0xF2A5, 0x7912, 0xDC79, 0x57CE, 0x40A0, 0xCB17,
+ 0x81C1, 0x0A76, 0x1D18, 0x96AF, 0x33C4, 0xB873, 0xAF1D, 0x24AA,
+ 0x932B, 0x189C, 0x0FF2, 0x8445, 0x212E, 0xAA99, 0xBDF7, 0x3640,
+ 0x7C96, 0xF721, 0xE04F, 0x6BF8, 0xCE93, 0x4524, 0x524A, 0xD9FD,
+ 0xC7E6, 0x4C51, 0x5B3F, 0xD088, 0x75E3, 0xFE54, 0xE93A, 0x628D,
+ 0x285B, 0xA3EC, 0xB482, 0x3F35, 0x9A5E, 0x11E9, 0x0687, 0x8D30,
+ 0xE232, 0x6985, 0x7EEB, 0xF55C, 0x5037, 0xDB80, 0xCCEE, 0x4759,
+ 0x0D8F, 0x8638, 0x9156, 0x1AE1, 0xBF8A, 0x343D, 0x2353, 0xA8E4,
+ 0xB6FF, 0x3D48, 0x2A26, 0xA191, 0x04FA, 0x8F4D, 0x9823, 0x1394,
+ 0x5942, 0xD2F5, 0xC59B, 0x4E2C, 0xEB47, 0x60F0, 0x779E, 0xFC29,
+ 0x4BA8, 0xC01F, 0xD771, 0x5CC6, 0xF9AD, 0x721A, 0x6574, 0xEEC3,
+ 0xA415, 0x2FA2, 0x38CC, 0xB37B, 0x1610, 0x9DA7, 0x8AC9, 0x017E,
+ 0x1F65, 0x94D2, 0x83BC, 0x080B, 0xAD60, 0x26D7, 0x31B9, 0xBA0E,
+ 0xF0D8, 0x7B6F, 0x6C01, 0xE7B6, 0x42DD, 0xC96A, 0xDE04, 0x55B3
+};
+
+__u16 crc_t10dif(const unsigned char *buffer, size_t len)
+{
+ __u16 crc = 0;
+ unsigned int i;
+
+ for (i = 0 ; i < len ; i++)
+ crc = (crc << 8) ^ t10_dif_crc_table[((crc >> 8) ^ buffer[i]) & 0xff];
+
+ return crc;
+}
+EXPORT_SYMBOL(crc_t10dif);
+
+MODULE_DESCRIPTION("T10 DIF CRC calculation");
+MODULE_LICENSE("GPL");

2008-06-07 04:59:53

by Martin K. Petersen

[permalink] [raw]
Subject: [PATCH 6 of 7] scsi: Support devices with protection information (DIF)

5 files changed, 90 insertions(+)
drivers/scsi/Kconfig | 15 +++++++++++++++
drivers/scsi/scsi_lib.c | 42 ++++++++++++++++++++++++++++++++++++++++++
drivers/scsi/scsi_scan.c | 3 +++
include/scsi/scsi_cmnd.h | 29 +++++++++++++++++++++++++++++
include/scsi/scsi_device.h | 1 +


- Add support for an extra scatter-gather list containing protection
information.

- Remember devices with protection information turned on in INQUIRY.

Signed-off-by: Martin K. Petersen <[email protected]>

---

diff -r 5be7c534c954 -r ad65bfde4e05 drivers/scsi/Kconfig
--- a/drivers/scsi/Kconfig Sat Jun 07 00:45:15 2008 -0400
+++ b/drivers/scsi/Kconfig Sat Jun 07 00:45:15 2008 -0400
@@ -260,6 +260,21 @@
default m
depends on SCSI
depends on MODULES
+
+config SCSI_PROTECTION
+ bool "SCSI Data Integrity Protection"
+ depends on SCSI
+ depends on BLK_DEV_INTEGRITY
+ help
+ Some SCSI devices support data protection features above and
+ beyond those implemented in the transport. Select this
+ option to enable protection information to be transferred to
+ and from a device. Specifically, this option will enable DIF
+ (Data Integrity Field) for SCSI disks.
+
+ The SCSI protection features depend on the block layer data
+ integrity infrastructure so the latter must be enabled for
+ this option to work.

menu "SCSI Transports"
depends on SCSI
diff -r 5be7c534c954 -r ad65bfde4e05 drivers/scsi/scsi_lib.c
--- a/drivers/scsi/scsi_lib.c Sat Jun 07 00:45:15 2008 -0400
+++ b/drivers/scsi/scsi_lib.c Sat Jun 07 00:45:15 2008 -0400
@@ -778,6 +778,13 @@
kmem_cache_free(scsi_sdb_cache, bidi_sdb);
cmd->request->next_rq->special = NULL;
}
+
+#if defined(CONFIG_SCSI_PROTECTION)
+ if (scsi_prot_sg_count(cmd)) {
+ scsi_free_sgtable(cmd->prot_sdb);
+ kmem_cache_free(scsi_sdb_cache, cmd->prot_sdb);
+ }
+#endif
}
EXPORT_SYMBOL(scsi_release_buffers);

@@ -1031,6 +1038,32 @@
return BLKPREP_OK;
}

+#if defined(CONFIG_SCSI_PROTECTION)
+static int scsi_protect_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
+{
+ struct request *req = cmd->request;
+ struct scsi_data_buffer *pdb;
+ int ivecs, count;
+
+ pdb = kmem_cache_zalloc(scsi_sdb_cache, gfp_mask);
+ if (unlikely(pdb == NULL))
+ return BLKPREP_DEFER;
+
+ ivecs = blk_rq_count_integrity_sg(req);
+
+ if (unlikely(scsi_alloc_sgtable(pdb, ivecs, gfp_mask)))
+ return BLKPREP_DEFER;
+
+ count = blk_rq_map_integrity_sg(req, pdb->table.sgl);
+ BUG_ON(unlikely(count > ivecs));
+
+ cmd->prot_sdb = pdb;
+ cmd->prot_sdb->table.nents = count;
+
+ return BLKPREP_OK;
+}
+#endif
+
/*
* Function: scsi_init_io()
*
@@ -1062,6 +1095,15 @@
if (error)
goto err_exit;
}
+
+#if defined(CONFIG_SCSI_PROTECTION)
+ if (blk_integrity_rq(cmd->request)) {
+ error = scsi_protect_io(cmd, gfp_mask);
+
+ if (error != BLKPREP_OK)
+ goto err_exit;
+ }
+#endif

return BLKPREP_OK ;

diff -r 5be7c534c954 -r ad65bfde4e05 drivers/scsi/scsi_scan.c
--- a/drivers/scsi/scsi_scan.c Sat Jun 07 00:45:15 2008 -0400
+++ b/drivers/scsi/scsi_scan.c Sat Jun 07 00:45:15 2008 -0400
@@ -882,6 +882,9 @@

if (*bflags & BLIST_USE_10_BYTE_MS)
sdev->use_10_for_ms = 1;
+
+ if (inq_result[5] & 0x1)
+ sdev->protection = 1;

/* set the device running here so that slave configure
* may do I/O */
diff -r 5be7c534c954 -r ad65bfde4e05 include/scsi/scsi_cmnd.h
--- a/include/scsi/scsi_cmnd.h Sat Jun 07 00:45:15 2008 -0400
+++ b/include/scsi/scsi_cmnd.h Sat Jun 07 00:45:15 2008 -0400
@@ -88,6 +88,9 @@

/* These elements define the operation we ultimately want to perform */
struct scsi_data_buffer sdb;
+#if defined(CONFIG_SCSI_PROTECTION)
+ struct scsi_data_buffer *prot_sdb;
+#endif
unsigned underflow; /* Return error if less than
this amount is transferred */

@@ -209,4 +212,30 @@
buf, buflen);
}

+#if defined(CONFIG_SCSI_PROTECTION)
+
+static inline unsigned scsi_prot_sg_count(struct scsi_cmnd *cmd)
+{
+ return cmd->prot_sdb ? cmd->prot_sdb->table.nents : 0;
+}
+
+static inline struct scatterlist *scsi_prot_sglist(struct scsi_cmnd *cmd)
+{
+ return cmd->prot_sdb ? cmd->prot_sdb->table.sgl : NULL;
+}
+
+static inline struct scsi_data_buffer *scsi_prot(struct scsi_cmnd *cmd)
+{
+ return cmd->prot_sdb;
+}
+
+#define scsi_for_each_prot_sg(cmd, sg, nseg, __i) \
+ for_each_sg(scsi_prot_sglist(cmd), sg, nseg, __i)
+
+#else /* CONFIG_SCSI_PROTECTION */
+
+#define scsi_prot_sg_count(a) (0)
+
+#endif /* CONFIG_SCSI_PROTECTION */
+
#endif /* _SCSI_SCSI_CMND_H */
diff -r 5be7c534c954 -r ad65bfde4e05 include/scsi/scsi_device.h
--- a/include/scsi/scsi_device.h Sat Jun 07 00:45:15 2008 -0400
+++ b/include/scsi/scsi_device.h Sat Jun 07 00:45:15 2008 -0400
@@ -140,6 +140,7 @@
unsigned guess_capacity:1; /* READ_CAPACITY might be too high by 1 */
unsigned retry_hwerror:1; /* Retry HARDWARE_ERROR */
unsigned last_sector_bug:1; /* Always read last sector in a 1 sector read */
+ unsigned protection:1; /* Data Integrity Field */

DECLARE_BITMAP(supported_events, SDEV_EVT_MAXBITS); /* supported events */
struct list_head event_list; /* asserted events */

2008-06-07 05:00:24

by Martin K. Petersen

[permalink] [raw]
Subject: [PATCH 4 of 7] block: bio data integrity support

4 files changed, 825 insertions(+), 3 deletions(-)
fs/Makefile | 1
fs/bio-integrity.c | 715 +++++++++++++++++++++++++++++++++++++++++++++++++++
fs/bio.c | 27 +
include/linux/bio.h | 85 ++++++


Allows integrity metadata to be attached to a bio.

Signed-off-by: Martin K. Petersen <[email protected]>

---

diff -r 318fa71e735d -r f2ae9d5bce4c fs/Makefile
--- a/fs/Makefile Sat Jun 07 00:45:14 2008 -0400
+++ b/fs/Makefile Sat Jun 07 00:45:15 2008 -0400
@@ -19,6 +19,7 @@
obj-y += no-block.o
endif

+obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o
obj-$(CONFIG_INOTIFY) += inotify.o
obj-$(CONFIG_INOTIFY_USER) += inotify_user.o
obj-$(CONFIG_EPOLL) += eventpoll.o
diff -r 318fa71e735d -r f2ae9d5bce4c fs/bio-integrity.c
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/fs/bio-integrity.c Sat Jun 07 00:45:15 2008 -0400
@@ -0,0 +1,715 @@
+/*
+ * bio-integrity.c - bio data integrity extensions
+ *
+ * Copyright (C) 2007, 2008 Oracle Corporation
+ * Written by: Martin K. Petersen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING. If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
+ * USA.
+ *
+ */
+
+#include <linux/blkdev.h>
+#include <linux/mempool.h>
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+
+static struct kmem_cache *bio_integrity_slab __read_mostly;
+static struct workqueue_struct *kintegrityd_wq;
+
+/**
+ * bio_integrity_alloc_bioset - Allocate integrity payload and attach it to bio
+ * @bio: bio to attach integrity metadata to
+ * @gfp_mask: Memory allocation mask
+ * @nr_vecs: Number of integrity metadata scatter-gather elements
+ * @bs: bio_set to allocate from
+ *
+ * Description: This function prepares a bio for attaching integrity
+ * metadata. nr_vecs specifies the maximum number of pages containing
+ * integrity metadata that can be attached.
+ */
+struct bip *bio_integrity_alloc_bioset(struct bio *bio, gfp_t gfp_mask, unsigned int nr_vecs, struct bio_set *bs)
+{
+ struct bip *bip;
+ struct bio_vec *bv;
+ unsigned long idx;
+
+ BUG_ON(bio == NULL);
+
+ bip = mempool_alloc(bs->bio_integrity_pool, gfp_mask);
+ if (unlikely(bip == NULL)) {
+ printk(KERN_ERR "%s: could not alloc bip\n", __func__);
+ return NULL;
+ }
+
+ memset(bip, 0, sizeof(*bip));
+ idx = 0;
+
+ bv = bvec_alloc_bs(gfp_mask, nr_vecs, &idx, bs);
+ if (unlikely(bv == NULL)) {
+ printk(KERN_ERR "%s: could not alloc bip_vec\n", __func__);
+ mempool_free(bip, bs->bio_integrity_pool);
+ return NULL;
+ }
+
+ bip->bip_pool = idx;
+ bip->bip_vec = bv;
+ bip->bip_bio = bio;
+ bio->bi_integrity = bip;
+
+ return bip;
+}
+EXPORT_SYMBOL(bio_integrity_alloc_bioset);
+
+/**
+ * bio_integrity_alloc - Allocate integrity payload and attach it to bio
+ * @bio: bio to attach integrity metadata to
+ * @gfp_mask: Memory allocation mask
+ * @nr_vecs: Number of integrity metadata scatter-gather elements
+ *
+ * Description: This function prepares a bio for attaching integrity
+ * metadata. nr_vecs specifies the maximum number of pages containing
+ * integrity metadata that can be attached.
+ */
+struct bip *bio_integrity_alloc(struct bio *bio, gfp_t gfp_mask,
+ unsigned int nr_vecs)
+{
+ return bio_integrity_alloc_bioset(bio, gfp_mask, nr_vecs, fs_bio_set);
+}
+EXPORT_SYMBOL(bio_integrity_alloc);
+
+/**
+ * bio_integrity_free - Free bio integrity payload
+ * @bio: bio containing bip to be freed
+ * @bs: bio_set this bio was allocated from
+ *
+ * Description: Used to free the integrity portion of a bio. Usually
+ * called from bio_free().
+ */
+void bio_integrity_free(struct bio *bio, struct bio_set *bs)
+{
+ struct bip *bip = bio->bi_integrity;
+
+ BUG_ON(bip == NULL);
+
+ /* A cloned bio doesn't own the integrity metadata */
+ if (!bio_flagged(bio, BIO_CLONED) && bip->bip_buf != NULL)
+ kfree(bip->bip_buf);
+
+ mempool_free(bip->bip_vec, bs->bvec_pools[bip->bip_pool]);
+ mempool_free(bip, bs->bio_integrity_pool);
+
+ bio->bi_integrity = NULL;
+}
+EXPORT_SYMBOL(bio_integrity_free);
+
+/**
+ * bio_integrity_add_page - Attach integrity metadata
+ * @bio: bio to update
+ * @page: page containing integrity metadata
+ * @len: number of bytes of integrity metadata in page
+ * @offset: start offset within page
+ *
+ * Description: Attach a page containing integrity metadata to bio.
+ */
+int bio_integrity_add_page(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int offset)
+{
+ struct bip *bip;
+ struct bio_vec *iv;
+
+ bip = bio->bi_integrity;
+
+ if (bip->bip_vcnt >= bvec_nr_vecs(bip->bip_pool)) {
+ printk(KERN_ERR "%s: bip_vec full\n", __func__);
+ return 0;
+ }
+
+ iv = bip_vec_idx(bip, bip->bip_vcnt);
+ BUG_ON(iv == NULL);
+ BUG_ON(iv->bv_page != NULL);
+
+ iv->bv_page = page;
+ iv->bv_len = len;
+ iv->bv_offset = offset;
+ bip->bip_vcnt++;
+
+ return len;
+}
+EXPORT_SYMBOL(bio_integrity_add_page);
+
+/**
+ * bio_integrity_enabled - Check whether integrity can be passed
+ * @bio: bio to check
+ *
+ * Description: Determines whether bio_integrity_prep() can be called
+ * on this bio or not. bio data direction and target device must be
+ * set prior to calling. The functions honors the write_generate and
+ * read_verify flags in sysfs.
+ */
+inline int bio_integrity_enabled(struct bio *bio)
+{
+ /* Already protected? */
+ if (bio_integrity(bio))
+ return 0;
+
+ return bdev_integrity_enabled(bio->bi_bdev, bio_data_dir(bio));
+}
+EXPORT_SYMBOL(bio_integrity_enabled);
+
+/**
+ * bio_integrity_tag_size - Retrieve integrity tag space
+ * @bio: bio to inspect
+ *
+ * Description: Returns the maximum number of tag bytes that can be
+ * attached to this bio. Filesystems can use this to determine how
+ * much metadata to attach to an I/O.
+ */
+unsigned int bio_integrity_tag_size(struct bio *bio)
+{
+ struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+
+ BUG_ON(bio->bi_size == 0);
+
+ return bi->tag_size * (bio->bi_size / bi->sector_size);
+}
+EXPORT_SYMBOL(bio_integrity_tag_size);
+
+/**
+ * bio_integrity_set_tag - Attach a tag buffer to a bio
+ * @bio: bio to attach buffer to
+ * @tag_buf: Pointer to a buffer containing tag data
+ * @len: Length of the included buffer
+ *
+ * Description: Use this function to tag a bio by leveraging the extra
+ * space provided by devices formatted with integrity protection. The
+ * size of the integrity buffer must be <= to the size reported by
+ * bio_integrity_tag_size().
+ */
+int bio_integrity_set_tag(struct bio *bio, void *tag_buf, unsigned int len)
+{
+ struct bip *bip = bio->bi_integrity;
+ struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+ unsigned int nr_sectors;
+
+ BUG_ON(bip->bip_buf == NULL);
+ BUG_ON(bio_data_dir(bio) != WRITE);
+
+ if (bi->tag_size == 0)
+ return -1;
+
+ nr_sectors = len / bi->tag_size;
+
+ if (len % 2)
+ nr_sectors++;
+
+ if (bi->sector_size == 4096)
+ nr_sectors >>= 3;
+
+ if (nr_sectors * bi->tuple_size > bip->bip_size) {
+ printk(KERN_ERR "%s: tag too big for bio: %u > %u\n",
+ __func__, nr_sectors * bi->tuple_size, bip->bip_size);
+ return -1;
+ }
+
+ bi->set_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
+
+ return 0;
+}
+EXPORT_SYMBOL(bio_integrity_set_tag);
+
+/**
+ * bio_integrity_get_tag - Retrieve a tag buffer from a bio
+ * @bio: bio to retrieve buffer from
+ * @tag_buf: Pointer to a buffer for the tag data
+ * @len: Length of the target buffer
+ *
+ * Description: Use this function to retrieve the tag buffer from a
+ * completed I/O. The size of the integrity buffer must be <= to the
+ * size reported by bio_integrity_tag_size().
+ */
+int bio_integrity_get_tag(struct bio *bio, void *tag_buf, unsigned int len)
+{
+ struct bip *bip = bio->bi_integrity;
+ struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+ unsigned int nr_sectors;
+
+ BUG_ON(bip->bip_buf == NULL);
+ BUG_ON(bio_data_dir(bio) != READ);
+
+ if (bi->tag_size == 0)
+ return -1;
+
+ nr_sectors = len / bi->tag_size;
+
+ if (len % 2)
+ nr_sectors++;
+
+ if (bi->sector_size == 4096)
+ nr_sectors >>= 3;
+
+ if (nr_sectors * bi->tuple_size > bip->bip_size) {
+ printk(KERN_ERR "%s: tag too big for bio: %u > %u\n",
+ __func__, nr_sectors * bi->tuple_size, bip->bip_size);
+ return -1;
+ }
+
+ bi->get_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
+
+ return 0;
+}
+EXPORT_SYMBOL(bio_integrity_get_tag);
+
+/**
+ * bio_integrity_generate - Generate integrity metadata for a bio
+ * @bio: bio to generate integrity metadata for
+ *
+ * Description: Generates integrity metadata for a bio by calling the
+ * block device's generation callback function. The bio must have a
+ * bip attached with enough room to accomodate the generated integrity
+ * metadata.
+ */
+static void bio_integrity_generate(struct bio *bio)
+{
+ struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+ struct blk_integrity_exchg bix;
+ struct bio_vec *bv;
+ sector_t sector = bio->bi_sector;
+ unsigned int i, sectors, total;
+ void *prot_buf = bio->bi_integrity->bip_buf;
+
+ total = 0;
+ bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
+ bix.sector_size = bi->sector_size;
+
+ bio_for_each_segment(bv, bio, i) {
+ bix.data_buf = kmap_atomic(bv->bv_page, KM_USER0)
+ + bv->bv_offset;
+ bix.data_size = bv->bv_len;
+ bix.prot_buf = prot_buf;
+ bix.sector = sector;
+
+ bi->generate_fn(&bix);
+
+ sectors = bv->bv_len / bi->sector_size;
+ sector += sectors;
+ prot_buf += sectors * bi->tuple_size;
+ total += sectors * bi->tuple_size;
+ BUG_ON(total > bio->bi_integrity->bip_size);
+
+ kunmap_atomic(bv->bv_page, KM_USER0);
+ }
+}
+
+/**
+ * bio_integrity_prep - Prepare bio for integrity I/O
+ * @bio: bio to prepare
+ *
+ * Description: Allocates a buffer for integrity metadata, maps the
+ * pages and attaches them to a bio. The bio must have data
+ * direction, target device and start sector set priot to calling. In
+ * the WRITE case, integrity metadata will be generated using the
+ * block device's integrity function. In the READ case, the buffer
+ * will be prepared for DMA and a suitable end_io handler set up.
+ */
+int bio_integrity_prep(struct bio *bio)
+{
+ struct bip *bip;
+ struct blk_integrity *bi;
+ struct request_queue *q;
+ void *buf;
+ unsigned long start, end;
+ unsigned int len, nr_pages;
+ unsigned int bytes, offset, i;
+ unsigned int sectors = bio_sectors(bio);
+
+ bi = bdev_get_integrity(bio->bi_bdev);
+ q = bdev_get_queue(bio->bi_bdev);
+ BUG_ON(bi == NULL);
+ BUG_ON(bio_integrity(bio));
+
+ if (bi->sector_size == 4096)
+ sectors >>= 3;
+
+ /* Allocate kernel buffer for protection data */
+ len = sectors * blk_integrity_tuple_size(bi);
+ buf = kzalloc(len, GFP_NOIO | q->bounce_gfp);
+ if (unlikely(buf == NULL)) {
+ printk(KERN_ERR "could not allocate integrity buffer\n");
+ return -EIO;
+ }
+
+ end = (((unsigned long) buf) + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ start = ((unsigned long) buf) >> PAGE_SHIFT;
+ nr_pages = end - start;
+
+ /* Allocate bio integrity payload and integrity vectors */
+ bip = bio_integrity_alloc(bio, GFP_NOIO, nr_pages);
+ if (unlikely(bip == NULL)) {
+ printk(KERN_ERR "could not allocate data integrity bioset\n");
+ kfree(buf);
+ return -EIO;
+ }
+
+ bip->bip_buf = buf;
+ bip->bip_size = len;
+ bip->bip_sector = bio->bi_sector;
+
+ /* Map it */
+ offset = offset_in_page(buf);
+ for (i = 0 ; i < nr_pages ; i++) {
+ int ret;
+ bytes = PAGE_SIZE - offset;
+
+ if (len <= 0)
+ break;
+
+ if (bytes > len)
+ bytes = len;
+
+ ret = bio_integrity_add_page(bio, virt_to_page(buf),
+ bytes, offset);
+
+ if (ret == 0)
+ return 0;
+
+ if (ret < bytes)
+ break;
+
+ buf += bytes;
+ len -= bytes;
+ offset = 0;
+ }
+
+ /* Install custom I/O completion handler if read verify is enabled */
+ if (bio_data_dir(bio) == READ) {
+ bip->bip_end_io = bio->bi_end_io;
+ bio->bi_end_io = bio_integrity_endio;
+ }
+
+ /* Auto-generate integrity metadata if this is a write */
+ if (bio_data_dir(bio) == WRITE)
+ bio_integrity_generate(bio);
+
+ return 0;
+}
+EXPORT_SYMBOL(bio_integrity_prep);
+
+/**
+ * bio_integrity_verify - Verify integrity metadata for a bio
+ * @bio: bio to verify
+ *
+ * Description: This function is called to verify the integrity of a
+ * bio. The data in the bio io_vec is compared to the integrity
+ * metadata returned by the HBA.
+ */
+static int bio_integrity_verify(struct bio *bio)
+{
+ struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+ struct blk_integrity_exchg bix;
+ struct bio_vec *bv;
+ sector_t sector = bio->bi_integrity->bip_sector;
+ unsigned int i, sectors, total, ret;
+ void *prot_buf = bio->bi_integrity->bip_buf;
+
+ total = 0;
+ bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
+ bix.sector_size = bi->sector_size;
+
+ bio_for_each_segment(bv, bio, i) {
+ bix.data_buf = kmap_atomic(bv->bv_page, KM_USER0)
+ + bv->bv_offset;
+ bix.data_size = bv->bv_len;
+ bix.prot_buf = prot_buf;
+ bix.sector = sector;
+
+ ret = bi->verify_fn(&bix);
+
+ if (ret) {
+ kunmap_atomic(bv->bv_page, KM_USER0);
+ return ret;
+ }
+
+ sectors = bv->bv_len / bi->sector_size;
+ sector += sectors;
+ prot_buf += sectors * bi->tuple_size;
+ total += sectors * bi->tuple_size;
+ BUG_ON(total > bio->bi_integrity->bip_size);
+
+ kunmap_atomic(bv->bv_page, KM_USER0);
+ }
+
+ return 0;
+}
+
+/**
+ * bio_integrity_verify_fn - Integrity I/O completion worker
+ * @work: Work struct stored in bio to be verified
+ *
+ * Description: This workqueue function is called to complete a READ
+ * request. The function verifies the transferred integrity metadata
+ * and then calls the original bio end_io function.
+ */
+static void bio_integrity_verify_fn(struct work_struct *work)
+{
+ struct bip *bip = container_of(work, struct bip, bip_work);
+ struct bio *bio = bip->bip_bio;
+ int error = bip->bip_error;
+
+ if (bio_integrity_verify(bio)) {
+ clear_bit(BIO_UPTODATE, &bio->bi_flags);
+ error = -EIO;
+ }
+
+ /* Restore original bio completion handler */
+ bio->bi_end_io = bip->bip_end_io;
+
+ if (bio->bi_end_io)
+ bio->bi_end_io(bio, error);
+}
+
+/**
+ * bio_integrity_endio - Integrity I/O completion function
+ * @bio: Protected bio
+ * @error: Pointer to errno
+ *
+ * Description: Completion for integrity I/O
+ *
+ * Normally I/O completion is done in interrupt context. However,
+ * verifying I/O integrity is a time-consuming task which must be run
+ * in process context. This function postpones completion
+ * accordingly.
+ */
+void bio_integrity_endio(struct bio *bio, int error)
+{
+ struct bip *bip = bio->bi_integrity;
+
+ BUG_ON(bip->bip_bio != bio);
+
+ bip->bip_error = error;
+ INIT_WORK(&bip->bip_work, bio_integrity_verify_fn);
+ queue_work(kintegrityd_wq, &bip->bip_work);
+}
+EXPORT_SYMBOL(bio_integrity_endio);
+
+/**
+ * bio_integrity_advance - Advance integrity vector
+ * @bio: bio whose integrity vector to update
+ * @bytes_done: number of data bytes that have been completed
+ *
+ * Description: This function calculates how many integrity bytes the
+ * number of completed data bytes correspond to and advances the
+ * integrity vector accordingly.
+ */
+void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
+{
+ struct bip *bip = bio->bi_integrity;
+ struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+ struct bio_vec *iv;
+ unsigned int i, skip, nr_sectors;
+
+ BUG_ON(bip == NULL);
+ BUG_ON(bi == NULL);
+
+ nr_sectors = bytes_done >> 9;
+
+ if (bi->sector_size == 4096)
+ nr_sectors >>= 3;
+
+ skip = nr_sectors * bi->tuple_size;
+
+ bip_for_each_vec(iv, bip, i) {
+ if (skip == 0) {
+ bip->bip_idx = i;
+ return;
+ } else if (skip >= iv->bv_len) {
+ skip -= iv->bv_len;
+ } else { /* skip < iv->bv_len) */
+ iv->bv_offset += skip;
+ iv->bv_len -= skip;
+ bip->bip_idx = i;
+ return;
+ }
+ }
+}
+EXPORT_SYMBOL(bio_integrity_advance);
+
+/**
+ * bio_integrity_trim - Trim integrity vector
+ * @bio: bio whose integrity vector to update
+ * @offset: offset to first data sector
+ * @sectors: number of data sectors
+ *
+ * Description: Used to trim the integrity vector in a cloned bio.
+ * The ivec will be advanced corresponding to 'offset' data sectors
+ * and the length will be truncated corresponding to 'len' data
+ * sectors.
+ */
+void bio_integrity_trim(struct bio *bio, unsigned int offset, unsigned int sectors)
+{
+ struct bip *bip = bio->bi_integrity;
+ struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+ struct bio_vec *iv;
+ unsigned int i, skip, nr_bytes;
+
+ BUG_ON(bip == NULL);
+ BUG_ON(bi == NULL);
+ BUG_ON(!bio_flagged(bio, BIO_CLONED));
+
+ if (bi->sector_size == 4096)
+ sectors >>= 3;
+
+ bip->bip_sector = bip->bip_sector + offset;
+ skip = offset * bi->tuple_size;
+ nr_bytes = sectors * bi->tuple_size;
+
+ /* Mark head */
+ bip_for_each_vec(iv, bip, i) {
+ if (skip == 0) {
+ bip->bip_idx = i;
+ break;
+ } else if (skip >= iv->bv_len) {
+ skip -= iv->bv_len;
+ } else { /* skip < iv->bv_len) */
+ iv->bv_offset += skip;
+ iv->bv_len -= skip;
+ bip->bip_idx = i;
+ break;
+ }
+ }
+
+ /* Mark tail */
+ bip_for_each_vec(iv, bip, i) {
+ if (nr_bytes == 0) {
+ bip->bip_vcnt = i;
+ break;
+ } else if (nr_bytes >= iv->bv_len) {
+ nr_bytes -= iv->bv_len;
+ } else { /* nr_bytes < iv->bv_len) */
+ iv->bv_len = nr_bytes;
+ nr_bytes = 0;
+ }
+ }
+}
+EXPORT_SYMBOL(bio_integrity_trim);
+
+/**
+ * bio_integrity_split - Split integrity metadata
+ * @bio: Protected bio
+ * @bp: Resulting bio_pair
+ * @sectors: Offset
+ *
+ * Description: Splits an integrity page into a bio_pair.
+ */
+void bio_integrity_split(struct bio *bio, struct bio_pair *bp, int sectors)
+{
+ struct blk_integrity *bi;
+ struct bip *bip = bio->bi_integrity;
+
+ if (bio_integrity(bio) == 0)
+ return;
+
+ bi = bdev_get_integrity(bio->bi_bdev);
+ BUG_ON(bi == NULL);
+ BUG_ON(bip->bip_vcnt != 1);
+
+ if (bi->sector_size == 4096)
+ sectors >>= 3;
+
+ bp->bio1.bi_integrity = &bp->bip1;
+ bp->bio2.bi_integrity = &bp->bip2;
+
+ bp->iv1 = bip->bip_vec[0];
+ bp->iv2 = bip->bip_vec[0];
+
+ bp->bip1.bip_vec = &bp->iv1;
+ bp->bip2.bip_vec = &bp->iv2;
+
+ bp->iv1.bv_len = sectors * bi->tuple_size;
+ bp->iv2.bv_offset += sectors * bi->tuple_size;
+ bp->iv2.bv_len -= sectors * bi->tuple_size;
+
+ bp->bip1.bip_sector = bio->bi_integrity->bip_sector;
+ bp->bip2.bip_sector = bio->bi_integrity->bip_sector + sectors;
+
+ bp->bip1.bip_vcnt = bp->bip2.bip_vcnt = 1;
+ bp->bip1.bip_idx = bp->bip2.bip_idx = 0;
+}
+EXPORT_SYMBOL(bio_integrity_split);
+
+/**
+ * bio_integrity_clone - Callback for cloning bios with integrity metadata
+ * @bio: New bio
+ * @bio_src: Original bio
+ * @bs: bio_set to allocate bip from
+ *
+ * Description: Called to allocate a bip when cloning a bio
+ */
+int bio_integrity_clone(struct bio *bio, struct bio *bio_src, struct bio_set *bs)
+{
+ struct bip *bip_src = bio_src->bi_integrity;
+ struct bip *bip;
+
+ BUG_ON(bip_src == NULL);
+
+ bip = bio_integrity_alloc_bioset(bio, GFP_NOIO, bip_src->bip_vcnt, bs);
+
+ if (bip == NULL)
+ return -EIO;
+
+ memcpy(bip->bip_vec, bip_src->bip_vec,
+ bip_src->bip_vcnt * sizeof(struct bio_vec));
+
+ bip->bip_sector = bip_src->bip_sector;
+ bip->bip_vcnt = bip_src->bip_vcnt;
+ bip->bip_idx = bip_src->bip_idx;
+
+ return 0;
+}
+EXPORT_SYMBOL(bio_integrity_clone);
+
+int bioset_integrity_create(struct bio_set *bs, int pool_size)
+{
+ bs->bio_integrity_pool = mempool_create_slab_pool(pool_size,
+ bio_integrity_slab);
+ if (!bs->bio_integrity_pool)
+ return -1;
+
+ return 0;
+}
+EXPORT_SYMBOL(bioset_integrity_create);
+
+void bioset_integrity_free(struct bio_set *bs)
+{
+ if (bs->bio_integrity_pool)
+ mempool_destroy(bs->bio_integrity_pool);
+}
+EXPORT_SYMBOL(bioset_integrity_free);
+
+void __init bio_integrity_init_slab(void)
+{
+ bio_integrity_slab = KMEM_CACHE(bip, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
+}
+EXPORT_SYMBOL(bio_integrity_init_slab);
+
+static int __init integrity_init(void)
+{
+ kintegrityd_wq = create_workqueue("kintegrityd");
+
+ if (!kintegrityd_wq)
+ panic("Failed to create kintegrityd\n");
+
+ return 0;
+}
+subsys_initcall(integrity_init);
diff -r 318fa71e735d -r f2ae9d5bce4c fs/bio.c
--- a/fs/bio.c Sat Jun 07 00:45:14 2008 -0400
+++ b/fs/bio.c Sat Jun 07 00:45:15 2008 -0400
@@ -96,6 +96,9 @@

mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
}
+
+ if (bio_integrity(bio))
+ bio_integrity_free(bio, bio_set);

mempool_free(bio, bio_set->bio_pool);
}
@@ -255,9 +258,19 @@
{
struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);

- if (b) {
- b->bi_destructor = bio_fs_destructor;
- __bio_clone(b, bio);
+ if (!b)
+ return NULL;
+
+ b->bi_destructor = bio_fs_destructor;
+ __bio_clone(b, bio);
+
+ if (bio_integrity(bio)) {
+ int ret;
+
+ ret = bio_integrity_clone(b, bio, fs_bio_set);
+
+ if (ret < 0)
+ return NULL;
}

return b;
@@ -1229,6 +1242,9 @@
bp->bio1.bi_private = bi;
bp->bio2.bi_private = pool;

+ if (bio_integrity(bi))
+ bio_integrity_split(bi, bp, first_sectors);
+
return bp;
}

@@ -1294,6 +1310,7 @@
if (bs->bio_pool)
mempool_destroy(bs->bio_pool);

+ bioset_integrity_free(bs);
biovec_free_pools(bs);

kfree(bs);
@@ -1308,6 +1325,9 @@

bs->bio_pool = mempool_create_slab_pool(bio_pool_size, bio_slab);
if (!bs->bio_pool)
+ goto bad;
+
+ if (bioset_integrity_create(bs, bio_pool_size))
goto bad;

if (!biovec_create_pools(bs, bvec_pool_size))
@@ -1336,6 +1356,7 @@
{
bio_slab = KMEM_CACHE(bio, SLAB_HWCACHE_ALIGN|SLAB_PANIC);

+ bio_integrity_init_slab();
biovec_init_slabs();

fs_bio_set = bioset_create(BIO_POOL_SIZE, 2);
diff -r 318fa71e735d -r f2ae9d5bce4c include/linux/bio.h
--- a/include/linux/bio.h Sat Jun 07 00:45:14 2008 -0400
+++ b/include/linux/bio.h Sat Jun 07 00:45:15 2008 -0400
@@ -64,6 +64,7 @@

struct bio_set;
struct bio;
+struct bip;
typedef void (bio_end_io_t) (struct bio *, int);
typedef void (bio_destructor_t) (struct bio *);

@@ -112,6 +113,9 @@
atomic_t bi_cnt; /* pin count */

void *bi_private;
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+ struct bip *bi_integrity; /* data integrity */
+#endif

bio_destructor_t *bi_destructor; /* destructor */
};
@@ -271,6 +275,29 @@
*/
#define bio_get(bio) atomic_inc(&(bio)->bi_cnt)

+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+/*
+ * bio integrity payload
+ */
+struct bip {
+ struct bio *bip_bio; /* parent bio */
+ struct bio_vec *bip_vec; /* integrity data vector */
+
+ sector_t bip_sector; /* virtual start sector */
+
+ void *bip_buf; /* generated integrity data */
+ bio_end_io_t *bip_end_io; /* saved I/O completion fn */
+
+ int bip_error; /* saved I/O error */
+ unsigned int bip_size;
+
+ unsigned short bip_pool; /* pool the ivec came from */
+ unsigned short bip_vcnt; /* # of integrity bio_vecs */
+ unsigned short bip_idx; /* current bip_vec index */
+
+ struct work_struct bip_work; /* I/O completion */
+};
+#endif /* CONFIG_BLK_DEV_INTEGRITY */

/*
* A bio_pair is used when we need to split a bio.
@@ -285,6 +312,10 @@
struct bio_pair {
struct bio bio1, bio2;
struct bio_vec bv1, bv2;
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+ struct bip bip1, bip2;
+ struct bio_vec iv1, iv2;
+#endif
atomic_t cnt;
int error;
};
@@ -349,6 +380,9 @@

struct bio_set {
mempool_t *bio_pool;
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+ mempool_t *bio_integrity_pool;
+#endif
mempool_t *bvec_pools[BIOVEC_NR_POOLS];
};

@@ -413,5 +447,56 @@
__bio_kmap_irq((bio), (bio)->bi_idx, (flags))
#define bio_kunmap_irq(buf,flags) __bio_kunmap_irq(buf, flags)

+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+
+#define bip_vec_idx(bip, idx) (&(bip->bip_vec[(idx)]))
+#define bip_vec(bip) bip_vec_idx(bip, 0)
+
+#define __bip_for_each_vec(bvl, bip, i, start_idx) \
+ for (bvl = bip_vec_idx((bip), (start_idx)), i = (start_idx); \
+ i < (bip)->bip_vcnt; \
+ bvl++, i++)
+
+#define bip_for_each_vec(bvl, bip, i) \
+ __bip_for_each_vec(bvl, bip, i, (bip)->bip_idx)
+
+#define bio_integrity(bio) ((bio)->bi_integrity ? 1 : 0)
+
+extern struct bip *bio_integrity_alloc_bioset(struct bio *, gfp_t, unsigned int, struct bio_set *);
+extern struct bip *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
+extern void bio_integrity_free(struct bio *, struct bio_set *);
+extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
+extern inline int bio_integrity_enabled(struct bio *bio);
+extern int bio_integrity_set_tag(struct bio *, void *, unsigned int);
+extern int bio_integrity_get_tag(struct bio *, void *, unsigned int);
+extern int bio_integrity_prep(struct bio *);
+extern void bio_integrity_endio(struct bio *, int);
+extern void bio_integrity_advance(struct bio *, unsigned int);
+extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
+extern void bio_integrity_split(struct bio *, struct bio_pair *, int);
+extern int bio_integrity_clone(struct bio *, struct bio *, struct bio_set *);
+extern int bioset_integrity_create(struct bio_set *, int);
+extern void bioset_integrity_free(struct bio_set *);
+extern void bio_integrity_init_slab(void);
+
+#else /* CONFIG_BLK_DEV_INTEGRITY */
+
+#define bio_integrity(a) (0)
+#define bioset_integrity_create(a, b) (0)
+#define bio_integrity_prep(a) (0)
+#define bio_integrity_enabled(a) (0)
+#define bio_integrity_clone(a, b, c) (0)
+#define bioset_integrity_free(a) do { } while (0)
+#define bio_integrity_free(a, b) do { } while (0)
+#define bio_integrity_endio(a, b) do { } while (0)
+#define bio_integrity_advance(a, b) do { } while (0)
+#define bio_integrity_trim(a, b, c) do { } while (0)
+#define bio_integrity_split(a, b, c) do { } while (0)
+#define bio_integrity_set_tag(a, b, c) do { } while (0)
+#define bio_integrity_get_tag(a, b, c) do { } while (0)
+#define bio_integrity_init_slab(a) do { } while (0)
+
+#endif /* CONFIG_BLK_DEV_INTEGRITY */
+
#endif /* CONFIG_BLOCK */
#endif /* __LINUX_BIO_H */

2008-06-07 05:00:47

by Martin K. Petersen

[permalink] [raw]
Subject: [PATCH 5 of 7] block: Block/request layer data integrity support

10 files changed, 849 insertions(+)
Documentation/block/data-integrity.txt | 327 +++++++++++++++++++++++++++
block/Kconfig | 12
block/Makefile | 1
block/blk-core.c | 7
block/blk-integrity.c | 385 ++++++++++++++++++++++++++++++++
block/blk-merge.c | 3
block/blk.h | 8
block/elevator.c | 6
include/linux/blkdev.h | 97 ++++++++
include/linux/genhd.h | 3


Support for merging and mapping bio integrity metadata.

Block device integrity type registration.

Signed-off-by: Martin K. Petersen <[email protected]>

---

diff -r f2ae9d5bce4c -r 5be7c534c954 Documentation/block/data-integrity.txt
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/Documentation/block/data-integrity.txt Sat Jun 07 00:45:15 2008 -0400
@@ -0,0 +1,327 @@
+----------------------------------------------------------------------
+1. INTRODUCTION
+
+Modern filesystems feature checksumming of data and metadata to
+protect against data corruption. However, the detection of the
+corruption is done at read time which could potentially be months
+after the data was written. At that point the original data that the
+application tried to write is most likely lost.
+
+The solution is to ensure that the disk is actually storing what the
+application meant it to. Recent additions to both the SCSI family
+protocols (SBC Data Integrity Field, SCC protection proposal) as well
+as SATA/T13 (External Path Protection) try to remedy this by adding
+support for appending integrity metadata to an I/O. The integrity
+metadata (or protection information in SCSI terminology) includes a
+checksum for each sector as well as an incrementing counter that
+ensures the individual sectors are written in the right order. And
+for some protection schemes also that the I/O is written to the right
+place on disk.
+
+Current storage controllers and devices implement various protective
+measures, for instance checksumming and scrubbing. But these
+technologies are working in their own isolated domains or at best
+between adjacent nodes in the I/O path. The interesting thing about
+DIF and the other integrity extensions is that the protection format
+is well defined and every node in the I/O path can verify the
+integrity of the I/O and reject it if corruption is detected. This
+allows not only corruption prevention but also isolation of the point
+of failure.
+
+----------------------------------------------------------------------
+2. THE DATA INTEGRITY EXTENSIONS
+
+As written, the protocol extensions only protect the path between
+controller and storage device. However, many controllers actually
+allow the operating system to interact with the integrity metadata
+(IMD). We have been working with several FC/SAS HBA vendors to enable
+the protection information to be transferred to and from their
+controllers.
+
+The SCSI Data Integrity Field works by appending 8 bytes of protection
+information to each sector. The data + integrity metadata is stored
+in 520 byte sectors on disk. Data + IMD are interleaved when
+transferred between the controller and target. The T13 proposal is
+similar.
+
+Because it is highly inconvenient for operating systems to deal with
+520 (and 4104) byte sectors, we approached several HBA vendors and
+encouraged them to allow separation of the data and integrity metadata
+scatter-gather lists.
+
+The controller will interleave the buffers on write and split them on
+read. This means that the Linux can DMA the data buffers to and from
+host memory without changes to the page cache.
+
+Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
+is somewhat heavy to compute in software. Benchmarks found that
+calculating this checksum had a significant impact on system
+performance for a number of workloads. Some controllers allow a
+lighter-weight checksum to be used when interfacing with the operating
+system. Emulex, for instance, supports the TCP/IP checksum instead.
+The IP checksum received from the OS is converted to the 16-bit CRC
+when writing and vice versa. This allows the integrity metadata to be
+generated by Linux or the application at very low cost (comparable to
+software RAID5).
+
+The IP checksum is weaker than the CRC in terms of detecting bit
+errors. However, the strength is really in the separation of the data
+buffers and the integrity metadata. These two distinct buffers much
+match up for an I/O to complete.
+
+The separation of the data and integrity metadata buffers as well as
+the choice in checksums is referred to as the Data Integrity
+Extensions. As these extensions are outside the scope of the protocol
+bodies (T10, T13), Oracle and its partners are trying to standardize
+them within the Storage Networking Industry Association.
+
+----------------------------------------------------------------------
+3. KERNEL CHANGES
+
+The data integrity framework in Linux enables protection information
+to be pinned to I/Os and sent to/received from controllers that
+support it.
+
+The advantage to the integrity extensions in SCSI and SATA is that
+they enable us to protect the entire path from application to storage
+device. However, at the same time this is also the biggest
+disadvantage. It means that the protection information must be in a
+format that can be understood by the disk.
+
+Generally Linux/POSIX applications are agnostic to the intricacies of
+the storage devices they are accessing. The virtual filesystem switch
+and the block layer make things like hardware sector size and
+transport protocols completely transparent to the application.
+
+However, this level of detail is required when preparing the
+protection information to send to a disk. Consequently, the very
+concept of an end-to-end protection scheme is a layering violation.
+It is completely unreasonable for an application to be aware whether
+it is accessing a SCSI or SATA disk.
+
+The data integrity support implemented in Linux attempts to hide this
+from the application. As far as the application (and to some extent
+the kernel) is concerned, the integrity metadata is opaque information
+that's attached to the I/O.
+
+The current implementation allows the block layer to automatically
+generate the protection information for any I/O. Eventually the
+intent is to move the integrity metadata calculation to userspace for
+user data. Metadata and other I/O that originates within the kernel
+will still use the automatic generation interface.
+
+Some storage devices allow each hardware sector to be tagged with a
+16-bit value. The owner of this tag space is the owner of the block
+device. I.e. the filesystem in most cases. The filesystem can use
+this extra space to tag sectors as they see fit. Because the tag
+space is limited, the block interface allows tagging bigger chunks by
+way of interleaving. This way, 8*16 bits of information can be
+attached to a typical 4KB filesystem block.
+
+This also means that applications such as fsck and mkfs will need
+access to manipulate the tags from user space. A passthrough
+interface for this is being worked on.
+
+
+----------------------------------------------------------------------
+4. BLOCK LAYER IMPLEMENTATION DETAILS
+
+4.1 BIO
+
+The data integrity patches add a new field to struct bio when
+CONFIG_BLK_DEV_INTEGRITY is enabled. bio->bi_integrity is a pointer
+to a struct bip which contains the bio integrity payload. Essentially
+a bip is a trimmed down struct bio which holds a bio_vec containing
+the integrity metadata and the required housekeeping information (bvec
+pool, vector count, etc.)
+
+A kernel subsystem can enable data integrity protection on a bio by
+calling bio_integrity_alloc(bio). This will allocate and attach the
+bip to the bio.
+
+Individual pages containing integrity metadata can subsequently be
+attached using bio_integrity_add_page().
+
+bio_free() will automatically free the bip.
+
+
+4.2 BLOCK DEVICE
+
+Because the format of the protection data is tied to the physical
+disk, each block device has been extended with a block integrity
+profile (struct blk_integrity). This optional profile is registered
+with the block layer using blk_integrity_register().
+
+The profile contains callback functions for generating and verifying
+the protection data, as well as getting and setting application tags.
+The profile also contains a few constants to aid in completing,
+merging and splitting the integrity metadata.
+
+Layered block devices will need to pick a profile that's appropriate
+for all subdevices. blk_integrity_compare() can help with that. DM
+and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6
+will require extra work due to the application tag.
+
+
+----------------------------------------------------------------------
+5.0 BLOCK LAYER INTEGRITY API
+
+5.1 NORMAL FILESYSTEM
+
+ The normal filesystem is unaware that the underlying block device
+ is capable of sending/receiving integrity metadata. The IMD will
+ be automatically generated by the block layer at submit_bio() time
+ in case of a WRITE. A READ request will cause the I/O integrity
+ to be verified upon completion.
+
+ IMD generation and verification can be toggled using the
+
+ /sys/class/block/<bdev>/integrity/write_generate
+
+ and
+
+ /sys/class/block/<bdev>/integrity/read_verify
+
+ flags.
+
+
+5.2 INTEGRITY-AWARE FILESYSTEM
+
+ A filesystem that is integrity-aware can prepare I/Os with IMD
+ attached. It can also use the application tag space if this is
+ supported by the block device.
+
+
+ int bdev_integrity_enabled(block_device, int rw);
+
+ bdev_integrity_enabled() will return 1 if the block device
+ supports integrity metadata transfer for the data direction
+ specified in 'rw'.
+
+ bdev_integrity_enabled() honors the write_generate and
+ read_verify flags in sysfs and will respond accordingly.
+
+
+ int bio_integrity_prep(bio);
+
+ To generate IMD for WRITE and to set up buffers for READ, the
+ filesystem must call bio_integrity_prep(bio).
+
+ Prior to calling this function, the bio data direction and start
+ sector must be set, and the bio should have all data pages
+ added. It is up to the caller to ensure that the bio does not
+ change while I/O is in progress.
+
+ bio_integrity_prep() should only be called if
+ bio_integrity_enabled() returned 1.
+
+
+ int bio_integrity_tag_size(bio);
+
+ If the filesystem wants to use the application tag space it will
+ first have to find out how much storage space is available.
+ Because tag space is generally limited (usually 2 bytes per
+ sector regardless of sector size), the integrity framework
+ supports interleaving the information between the sectors in an
+ I/O.
+
+ Filesystems can call bio_integrity_tag_size(bio) to find out how
+ many bytes of storage are available for that particular bio.
+
+ Another option is bdev_get_tag_size(block_device) which will
+ return the number of available bytes per hardware sector.
+
+
+ int bio_integrity_set_tag(bio, void *tag_buf, len);
+
+ After a successful return from bio_integrity_prep(),
+ bio_integrity_set_tag() can be used to attach an opaque tag
+ buffer to a bio. Obviously this only makes sense if the I/O is
+ a WRITE.
+
+
+ int bio_integrity_get_tag(bio, void *tag_buf, len);
+
+ Similarly, at READ I/O completion time the filesystem can
+ retrieve the tag buffer using bio_integrity_get_tag().
+
+
+6.3 PASSING EXISTING INTEGRITY METADATA
+
+ Filesystems that either generate their own integrity metadata or
+ are capable of transferring IMD from user space can use the
+ following calls:
+
+
+ struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);
+
+ Allocates the bio integrity payload and hangs it off of the bio.
+ nr_pages indicate how many pages of protection data need to be
+ stored in the integrity bio_vec list (similar to bio_alloc()).
+
+ The integrity payload will be freed at bio_free() time.
+
+
+ int bio_integrity_add_page(bio, page, len, offset);
+
+ Attaches a page containing integrity metadata to an existing
+ bio. The bio must have an existing bip,
+ i.e. bio_integrity_alloc() must have been called. For a WRITE,
+ the integrity metadata in the pages must be in a format
+ understood by the target device with the notable exception that
+ the sector numbers will be remapped as the request traverses the
+ I/O stack. This implies that the pages added using this call
+ will be modified during I/O! The first reference tag in the
+ integrity metadata must have a value of bip->bip_sector.
+
+ Pages can be added using bio_integrity_add_page() as long as
+ there is room in the bip bio_vec array (nr_pages).
+
+ Upon completion of a READ operation, the attached pages will
+ contain the integrity metadata received from the storage device.
+ It is up to the receiver to process them and verify data
+ integrity upon completion.
+
+
+6.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
+ METADATA
+
+ To enable integrity exchange on a block device the gendisk must be
+ registered as capable:
+
+ int blk_integrity_register(gendisk, blk_integrity);
+
+ The blk_integrity struct is a template and should contain the
+ following:
+
+ static struct blk_integrity my_profile = {
+ .name = "STANDARDSBODY-TYPE-VARIANT-CSUM",
+ .generate_fn = my_generate_fn,
+ .verify_fn = my_verify_fn,
+ .get_tag_fn = my_get_tag_fn,
+ .set_tag_fn = my_set_tag_fn,
+ .tuple_size = sizeof(struct my_tuple_size),
+ .tag_size = <tag bytes per hw sector>,
+ };
+
+ 'name' is a text string which will be visible in sysfs. This is
+ part of the userland API so chose it carefully and never change
+ it. The format is standards body-type-variant.
+ E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
+
+ 'generate_fn' generates appropriate integrity metadata (for WRITE).
+
+ 'verify_fn' verifies that the data buffer matches the integrity
+ metadata.
+
+ 'tuple_size' must be set to match the size of the integrity
+ metadata per sector. I.e. 8 for DIF and EPP.
+
+ 'tag_size' must be set to identify how many bytes of tag space
+ are available per hardware sector. For DIF this is either 2 or
+ 0 depending on the value of the Control Mode Page ATO bit.
+
+ See 6.2 for a description of get_tag_fn and set_tag_fn.
+
+----------------------------------------------------------------------
+2007-12-24 Martin K. Petersen <[email protected]>
diff -r f2ae9d5bce4c -r 5be7c534c954 block/Kconfig
--- a/block/Kconfig Sat Jun 07 00:45:15 2008 -0400
+++ b/block/Kconfig Sat Jun 07 00:45:15 2008 -0400
@@ -81,6 +81,18 @@

If unsure, say N.

+config BLK_DEV_INTEGRITY
+ bool "Block layer data integrity support"
+ ---help---
+ Some storage devices allow extra information to be
+ stored/retrieved to help protect the data. The block layer
+ data integrity option provides hooks which can be used by
+ filesystems to ensure better data integrity.
+
+ Say yes here if you have a storage device that provides the
+ T10/SCSI Data Integrity Field or the T13/ATA External Path
+ Protection.
+
endif # BLOCK

config BLOCK_COMPAT
diff -r f2ae9d5bce4c -r 5be7c534c954 block/Makefile
--- a/block/Makefile Sat Jun 07 00:45:15 2008 -0400
+++ b/block/Makefile Sat Jun 07 00:45:15 2008 -0400
@@ -14,3 +14,4 @@

obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
+obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
diff -r f2ae9d5bce4c -r 5be7c534c954 block/blk-core.c
--- a/block/blk-core.c Sat Jun 07 00:45:15 2008 -0400
+++ b/block/blk-core.c Sat Jun 07 00:45:15 2008 -0400
@@ -143,6 +143,10 @@

bio->bi_size -= nbytes;
bio->bi_sector += (nbytes >> 9);
+
+ if (bio_integrity(bio))
+ bio_integrity_advance(bio, nbytes);
+
if (bio->bi_size == 0)
bio_endio(bio, error);
} else {
@@ -1381,6 +1385,9 @@
*/
blk_partition_remap(bio);

+ if (bio_integrity_enabled(bio) && bio_integrity_prep(bio))
+ goto end_io;
+
if (old_sector != -1)
blk_add_trace_remap(q, bio, old_dev, bio->bi_sector,
old_sector);
diff -r f2ae9d5bce4c -r 5be7c534c954 block/blk-integrity.c
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/block/blk-integrity.c Sat Jun 07 00:45:15 2008 -0400
@@ -0,0 +1,385 @@
+/*
+ * blk-integrity.c - Block layer data integrity extensions
+ *
+ * Copyright (C) 2007, 2008 Oracle Corporation
+ * Written by: Martin K. Petersen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING. If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
+ * USA.
+ *
+ */
+
+#include <linux/blkdev.h>
+#include <linux/mempool.h>
+#include <linux/bio.h>
+#include <linux/scatterlist.h>
+
+#include "blk.h"
+
+static struct kmem_cache *integrity_cachep;
+
+/**
+ * blk_rq_count_integrity_sg - Count number of integrity scatterlist elements
+ * @rq: request with integrity metadata attached
+ *
+ * Description: Returns the number of elements required in a
+ * scatterlist corresponding to the integrity metadata in a request.
+ */
+int blk_rq_count_integrity_sg(struct request *rq)
+{
+ struct bio_vec *iv, *ivprv;
+ struct req_iterator iter;
+ unsigned int segments;
+
+ ivprv = NULL;
+ segments = 0;
+
+ rq_for_each_integrity_segment(iv, rq, iter) {
+
+ if (ivprv && BIOVEC_PHYS_MERGEABLE(ivprv, iv))
+ ;
+ else
+ segments++;
+
+ ivprv = iv;
+ }
+
+ return segments;
+}
+EXPORT_SYMBOL(blk_rq_count_integrity_sg);
+
+/**
+ * blk_rq_map_integrity_sg - Map integrity metadata into a scatterlist
+ * @rq: request with integrity metadata attached
+ * @sglist: target scatterlist
+ *
+ * Description: Map the integrity vectors in request into a
+ * scatterlist. The scatterlist must be big enough to hold all
+ * elements. I.e. sized using blk_rq_count_integrity_sg().
+ */
+int blk_rq_map_integrity_sg(struct request *rq, struct scatterlist *sglist)
+{
+ struct bio_vec *iv, *ivprv;
+ struct req_iterator iter;
+ struct scatterlist *sg;
+ unsigned int segments;
+
+ ivprv = NULL;
+ sg = NULL;
+ segments = 0;
+
+ rq_for_each_integrity_segment(iv, rq, iter) {
+
+ if (ivprv) {
+ if (!BIOVEC_PHYS_MERGEABLE(ivprv, iv))
+ goto new_segment;
+
+ sg->length += iv->bv_len;
+ } else {
+new_segment:
+ if (!sg)
+ sg = sglist;
+ else {
+ sg->page_link &= ~0x02;
+ sg = sg_next(sg);
+ }
+
+ sg_set_page(sg, iv->bv_page, iv->bv_len, iv->bv_offset);
+ segments++;
+ }
+
+ ivprv = iv;
+ }
+
+ if (sg)
+ sg_mark_end(sg);
+
+ return segments;
+}
+EXPORT_SYMBOL(blk_rq_map_integrity_sg);
+
+/**
+ * blk_integrity_compare - Compare integrity profile of two block devices
+ * @b1: Device to compare
+ * @b2: Device to compare
+ *
+ * Description: Meta-devices like DM and MD need to verify that all
+ * sub-devices use the same integrity format before advertising to
+ * upper layers that they can send/receive integrity metadata. This
+ * function can be used to check whether two block devices have
+ * compatible integrity formats.
+ */
+int blk_integrity_compare(struct block_device *bd1, struct block_device *bd2)
+{
+ struct blk_integrity *b1 = bd1->bd_disk->integrity;
+ struct blk_integrity *b2 = bd2->bd_disk->integrity;
+
+ BUG_ON(bd1->bd_disk == NULL);
+ BUG_ON(bd2->bd_disk == NULL);
+
+ if (!b1 || !b2)
+ return 0;
+
+ if (b1->sector_size != b2->sector_size) {
+ printk(KERN_ERR "%s: %s/%s sector sz %u != %u\n", __func__,
+ bd1->bd_disk->disk_name, bd2->bd_disk->disk_name,
+ b1->sector_size, b2->sector_size);
+ return -1;
+ }
+
+ if (b1->tuple_size != b2->tuple_size) {
+ printk(KERN_ERR "%s: %s/%s tuple sz %u != %u\n", __func__,
+ bd1->bd_disk->disk_name, bd2->bd_disk->disk_name,
+ b1->tuple_size, b2->tuple_size);
+ return -1;
+ }
+
+ if (b1->tag_size && b2->tag_size && (b1->tag_size != b2->tag_size)) {
+ printk(KERN_ERR "%s: %s/%s tag sz %u != %u\n", __func__,
+ bd1->bd_disk->disk_name, bd2->bd_disk->disk_name,
+ b1->tag_size, b2->tag_size);
+ return -1;
+ }
+
+ if (strcmp(b1->name, b2->name)) {
+ printk(KERN_ERR "%s: %s/%s type %s != %s\n", __func__,
+ bd1->bd_disk->disk_name, bd2->bd_disk->disk_name,
+ b1->name, b2->name);
+ return -1;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(blk_integrity_compare);
+
+struct integrity_sysfs_entry {
+ struct attribute attr;
+ ssize_t (*show)(struct blk_integrity *, char *);
+ ssize_t (*store)(struct blk_integrity *, const char *, size_t);
+};
+
+static ssize_t integrity_attr_show(struct kobject *kobj, struct attribute *attr,
+ char *page)
+{
+ struct blk_integrity *bi =
+ container_of(kobj, struct blk_integrity, kobj);
+ struct integrity_sysfs_entry *entry =
+ container_of(attr, struct integrity_sysfs_entry, attr);
+ ssize_t ret = -EIO;
+
+ if (entry->show)
+ ret = entry->show(bi, page);
+
+ return ret;
+}
+
+static ssize_t integrity_attr_store(struct kobject *kobj, struct attribute *attr,
+ const char *page, size_t count)
+{
+ struct blk_integrity *bi =
+ container_of(kobj, struct blk_integrity, kobj);
+ struct integrity_sysfs_entry *entry =
+ container_of(attr, struct integrity_sysfs_entry, attr);
+ ssize_t ret = 0;
+
+ if (entry->store)
+ ret = entry->store(bi, page, count);
+
+ return ret;
+}
+
+static ssize_t integrity_format_show(struct blk_integrity *bi, char *page)
+{
+ if (bi != NULL && bi->name != NULL)
+ return sprintf(page, "%s\n", bi->name);
+ else
+ return sprintf(page, "none\n");
+}
+
+static ssize_t integrity_tag_size_show(struct blk_integrity *bi, char *page)
+{
+ if (bi != NULL)
+ return sprintf(page, "%u\n", bi->tag_size);
+ else
+ return sprintf(page, "0\n");
+}
+
+static ssize_t integrity_read_store(struct blk_integrity *bi,
+ const char *page, size_t count)
+{
+ char *p = (char *) page;
+ unsigned long val = simple_strtoul(p, &p, 10);
+
+ if (val == 1)
+ set_bit(INTEGRITY_FLAG_READ, &bi->flags);
+ else
+ clear_bit(INTEGRITY_FLAG_READ, &bi->flags);
+
+ return count;
+}
+
+static ssize_t integrity_read_show(struct blk_integrity *bi, char *page)
+{
+ return sprintf(page, "%d\n",
+ test_bit(INTEGRITY_FLAG_READ, &bi->flags) ? 1 : 0);
+}
+
+static ssize_t integrity_write_store(struct blk_integrity *bi,
+ const char *page, size_t count)
+{
+ char *p = (char *) page;
+ unsigned long val = simple_strtoul(p, &p, 10);
+
+ if (val == 1)
+ set_bit(INTEGRITY_FLAG_WRITE, &bi->flags);
+ else
+ clear_bit(INTEGRITY_FLAG_WRITE, &bi->flags);
+
+ return count;
+}
+
+static ssize_t integrity_write_show(struct blk_integrity *bi, char *page)
+{
+ return sprintf(page, "%d\n",
+ test_bit(INTEGRITY_FLAG_WRITE, &bi->flags) ? 1 : 0);
+}
+
+static struct integrity_sysfs_entry integrity_format_entry = {
+ .attr = { .name = "format", .mode = S_IRUGO },
+ .show = integrity_format_show,
+};
+
+static struct integrity_sysfs_entry integrity_tag_size_entry = {
+ .attr = { .name = "tag_size", .mode = S_IRUGO },
+ .show = integrity_tag_size_show,
+};
+
+static struct integrity_sysfs_entry integrity_read_entry = {
+ .attr = { .name = "read_verify", .mode = S_IRUGO | S_IWUSR },
+ .show = integrity_read_show,
+ .store = integrity_read_store,
+};
+
+static struct integrity_sysfs_entry integrity_write_entry = {
+ .attr = { .name = "write_generate", .mode = S_IRUGO | S_IWUSR },
+ .show = integrity_write_show,
+ .store = integrity_write_store,
+};
+
+static struct attribute *integrity_attrs[] = {
+ &integrity_format_entry.attr,
+ &integrity_tag_size_entry.attr,
+ &integrity_read_entry.attr,
+ &integrity_write_entry.attr,
+ NULL,
+};
+
+static struct sysfs_ops integrity_ops = {
+ .show = &integrity_attr_show,
+ .store = &integrity_attr_store,
+};
+
+static int __init blk_dev_integrity_init(void)
+{
+ integrity_cachep = kmem_cache_create("blkdev_integrity",
+ sizeof(struct blk_integrity),
+ 0, SLAB_PANIC, NULL);
+ return 0;
+}
+subsys_initcall(blk_dev_integrity_init);
+
+static void blk_integrity_release(struct kobject *kobj)
+{
+ struct blk_integrity *bi =
+ container_of(kobj, struct blk_integrity, kobj);
+
+ kmem_cache_free(integrity_cachep, bi);
+}
+
+static struct kobj_type integrity_ktype = {
+ .default_attrs = integrity_attrs,
+ .sysfs_ops = &integrity_ops,
+ .release = blk_integrity_release,
+};
+
+/**
+ * blk_integrity_register - Register a gendisk as being integrity-capable
+ * @disk: struct gendisk pointer to make integrity-aware
+ * @template: integrity profile
+ *
+ * Description: When a device needs to advertise itself as being able
+ * to send/receive integrity metadata it must use this function to
+ * register the capability with the block layer. The template is a
+ * blk_integrity struct with values appropriate for the underlying
+ * hardware. See Documentation/block/data-integrity.txt.
+ */
+int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
+{
+ struct blk_integrity *bi;
+
+ BUG_ON(disk == NULL);
+ BUG_ON(template == NULL);
+
+ if (disk->integrity == NULL) {
+ bi = kmem_cache_alloc(integrity_cachep, GFP_KERNEL | __GFP_ZERO);
+ if (!bi)
+ return -1;
+
+ if (kobject_init_and_add(&bi->kobj, &integrity_ktype,
+ &disk->dev.kobj, "%s", "integrity"))
+ return -1;
+
+ kobject_uevent(&bi->kobj, KOBJ_ADD);
+
+ set_bit(INTEGRITY_FLAG_READ, &bi->flags);
+ set_bit(INTEGRITY_FLAG_WRITE, &bi->flags);
+ bi->sector_size = disk->queue->hardsect_size;
+ disk->integrity = bi;
+ } else
+ bi = disk->integrity;
+
+ /* Use the provided profile as template */
+ bi->name = template->name;
+ bi->generate_fn = template->generate_fn;
+ bi->verify_fn = template->verify_fn;
+ bi->tuple_size = template->tuple_size;
+ bi->set_tag_fn = template->set_tag_fn;
+ bi->get_tag_fn = template->get_tag_fn;
+ bi->tag_size = template->tag_size;
+
+ return 0;
+}
+EXPORT_SYMBOL(blk_integrity_register);
+
+/**
+ * blk_integrity_unregister - Remove block integrity profile
+ * @disk: disk whose integrity profile to deallocate
+ *
+ * Description: This function frees all memory used by the block
+ * integrity profile. To be called at device teardown.
+ */
+void blk_integrity_unregister(struct gendisk *disk)
+{
+ struct blk_integrity *bi;
+
+ if (!disk || !disk->integrity)
+ return;
+
+ bi = disk->integrity;
+
+ kobject_uevent(&bi->kobj, KOBJ_REMOVE);
+ kobject_del(&bi->kobj);
+ kobject_put(&disk->dev.kobj);
+}
+EXPORT_SYMBOL(blk_integrity_unregister);
diff -r f2ae9d5bce4c -r 5be7c534c954 block/blk-merge.c
--- a/block/blk-merge.c Sat Jun 07 00:45:15 2008 -0400
+++ b/block/blk-merge.c Sat Jun 07 00:45:15 2008 -0400
@@ -441,6 +441,9 @@
|| next->special)
return 0;

+ if (blk_integrity_rq(req) != blk_integrity_rq(next))
+ return 0;
+
/*
* If we are allowed to merge, then append bio list
* from next to rq and release next. merge_requests_fn
diff -r f2ae9d5bce4c -r 5be7c534c954 block/blk.h
--- a/block/blk.h Sat Jun 07 00:45:15 2008 -0400
+++ b/block/blk.h Sat Jun 07 00:45:15 2008 -0400
@@ -51,4 +51,12 @@
return q->nr_congestion_off;
}

+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+
+#define rq_for_each_integrity_segment(bvl, _rq, _iter) \
+ __rq_for_each_bio(_iter.bio, _rq) \
+ bip_for_each_vec(bvl, _iter.bio->bi_integrity, _iter.i)
+
+#endif /* BLK_DEV_INTEGRITY */
+
#endif
diff -r f2ae9d5bce4c -r 5be7c534c954 block/elevator.c
--- a/block/elevator.c Sat Jun 07 00:45:15 2008 -0400
+++ b/block/elevator.c Sat Jun 07 00:45:15 2008 -0400
@@ -84,6 +84,12 @@
* must be same device and not a special request
*/
if (rq->rq_disk != bio->bi_bdev->bd_disk || rq->special)
+ return 0;
+
+ /*
+ * only merge integrity protected bio into ditto rq
+ */
+ if (bio_integrity(bio) != blk_integrity_rq(rq))
return 0;

if (!elv_iosched_allow_merge(rq, bio))
diff -r f2ae9d5bce4c -r 5be7c534c954 include/linux/blkdev.h
--- a/include/linux/blkdev.h Sat Jun 07 00:45:15 2008 -0400
+++ b/include/linux/blkdev.h Sat Jun 07 00:45:15 2008 -0400
@@ -113,6 +113,7 @@
__REQ_ALLOCED, /* request came from our alloc pool */
__REQ_RW_META, /* metadata io request */
__REQ_COPY_USER, /* contains copies of user pages */
+ __REQ_INTEGRITY, /* integrity metadata has been remapped */
__REQ_NR_BITS, /* stops here */
};

@@ -135,6 +136,7 @@
#define REQ_ALLOCED (1 << __REQ_ALLOCED)
#define REQ_RW_META (1 << __REQ_RW_META)
#define REQ_COPY_USER (1 << __REQ_COPY_USER)
+#define REQ_INTEGRITY (1 << __REQ_INTEGRITY)

#define BLK_MAX_CDB 16

@@ -866,6 +868,101 @@
MODULE_ALIAS("block-major-" __stringify(major) "-*")


+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+
+#define INTEGRITY_FLAG_READ 1 /* verify data integrity on read */
+#define INTEGRITY_FLAG_WRITE 2 /* generate data integrity on write */
+
+struct blk_integrity_exchg {
+ void *prot_buf;
+ void *data_buf;
+ sector_t sector;
+ unsigned int data_size;
+ unsigned short sector_size;
+ const char *disk_name;
+};
+
+typedef void (integrity_gen_fn) (struct blk_integrity_exchg *);
+typedef int (integrity_vrfy_fn) (struct blk_integrity_exchg *);
+typedef void (integrity_set_tag_fn) (void *, void *, unsigned int);
+typedef void (integrity_get_tag_fn) (void *, void *, unsigned int);
+
+struct blk_integrity {
+ integrity_gen_fn *generate_fn;
+ integrity_vrfy_fn *verify_fn;
+ integrity_set_tag_fn *set_tag_fn;
+ integrity_get_tag_fn *get_tag_fn;
+
+ unsigned short flags;
+ unsigned short tuple_size;
+ unsigned short sector_size;
+ unsigned short tag_size;
+
+ const char *name;
+
+ struct kobject kobj;
+};
+
+extern int blk_integrity_register(struct gendisk *, struct blk_integrity *);
+extern void blk_integrity_unregister(struct gendisk *);
+extern int blk_integrity_compare(struct block_device *, struct block_device *);
+extern int blk_rq_map_integrity_sg(struct request *, struct scatterlist *);
+extern int blk_rq_count_integrity_sg(struct request *);
+
+static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
+{
+ return (bi == NULL) ? 0 : bi->tuple_size;
+}
+
+static inline struct blk_integrity *bdev_get_integrity(struct block_device *bdev)
+{
+ return bdev->bd_disk->integrity;
+}
+
+static inline unsigned int bdev_get_tag_size(struct block_device *bdev)
+{
+ struct blk_integrity *bi = bdev_get_integrity(bdev);
+
+ return (bi == NULL) ? 0 : bi->tag_size;
+}
+
+static inline int bdev_integrity_enabled(struct block_device *bdev, int rw)
+{
+ struct blk_integrity *bi = bdev_get_integrity(bdev);
+
+ if (bi == NULL)
+ return 0;
+
+ if (rw == READ && bi->verify_fn != NULL &&
+ test_bit(INTEGRITY_FLAG_READ, &bi->flags))
+ return 1;
+
+ if (rw == WRITE && bi->generate_fn != NULL &&
+ test_bit(INTEGRITY_FLAG_WRITE, &bi->flags))
+ return 1;
+
+ return 0;
+}
+
+static inline int blk_integrity_rq(struct request *rq)
+{
+ BUG_ON(rq->bio == NULL);
+
+ return bio_integrity(rq->bio);
+}
+
+#else /* CONFIG_BLK_DEV_INTEGRITY */
+
+#define blk_integrity_rq(rq) (0)
+#define bdev_get_integrity(a) (0)
+#define bdev_get_tag_size(a) (0)
+#define blk_integrity_compare(a, b) (0)
+#define blk_integrity_register(a, b) (0)
+#define blk_integrity_unregister(a) do { } while (0);
+
+#endif /* CONFIG_BLK_DEV_INTEGRITY */
+
+
#else /* CONFIG_BLOCK */
/*
* stubs for when the block layer is configured out
diff -r f2ae9d5bce4c -r 5be7c534c954 include/linux/genhd.h
--- a/include/linux/genhd.h Sat Jun 07 00:45:15 2008 -0400
+++ b/include/linux/genhd.h Sat Jun 07 00:45:15 2008 -0400
@@ -141,6 +141,9 @@
struct disk_stats dkstats;
#endif
struct work_struct async_notify;
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+ struct blk_integrity *integrity;
+#endif
};

/*

2008-06-07 05:01:12

by Martin K. Petersen

[permalink] [raw]
Subject: [PATCH 7 of 7] Support for SCSI disk (SBC) Data Integrity Field

12 files changed, 935 insertions(+), 7 deletions(-)
Documentation/scsi/data-integrity.txt | 57 ++
drivers/scsi/Kconfig | 1
drivers/scsi/Makefile | 2
drivers/scsi/scsi_error.c | 3
drivers/scsi/scsi_lib.c | 4
drivers/scsi/scsi_sysfs.c | 4
drivers/scsi/sd.c | 58 ++
drivers/scsi/sd.h | 22 +
drivers/scsi/sd_dif.c | 644 +++++++++++++++++++++++++++++++++
include/scsi/scsi_cmnd.h | 3
include/scsi/scsi_dif.h | 140 +++++++
include/scsi/scsi_host.h | 4


Configure DMA of protection information and issue READ/WRITE commands
with RDPROTECT/WRPROTECT set accordingly.

Force READ CAPACITY(16) if the target has the PROTECT bit set and grab
an extra byte of response (P_TYPE and PROT_EN are in byte 12).

Signed-off-by: Martin K. Petersen <[email protected]>

---

diff -r ad65bfde4e05 -r 8bc1728dc75a Documentation/scsi/data-integrity.txt
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/Documentation/scsi/data-integrity.txt Sat Jun 07 00:45:15 2008 -0400
@@ -0,0 +1,57 @@
+----------------------------------------------------------------------
+1.0 INTRODUCTION
+
+For a general overview of the data integrity framework please consult
+Documentation/block/data-integrity.txt.
+
+----------------------------------------------------------------------
+2.0 SCSI LAYER IMPLEMENTATION DETAILS
+
+The scsi_command has been extended with a scatterlist for the
+integrity metadata. Note that all SCSI mid layer changes refer to
+this using the term "protection information" which is what it is
+called in the T10 spec.
+
+The term DIF (Data Integrity Field) is specific to SCSI disks (SBC).
+The SCSI midlayer doesn't know, or care, about the contents of the
+protection scatterlist, except it calls blk_rq_map_integrity_sg()
+during command initialization.
+
+
+2.1 SCSI DEVICE SCANNING
+
+A SCSI device has the PROTECT bit set in the standard INQUIRY page if
+it supports protection information. The state of this bit is saved in
+the scsi_device struct.
+
+
+2.2 SCSI DISK SETUP
+
+In the case of a SCSI disk the actual DIF protection format is
+contained in in result of READ CAPACITY(16). Consequently we have to
+use the 16-byte READ CAPACITY variant if the device is
+protection-capable.
+
+If the device has DIF-enabled we'll negotiate capabilities with the
+HBA. And if the HBA is capable of protection DMA, the blk_integrity
+profile will be registered.
+
+Currently we only support Type 1 and Type 3. Type 2 is only defined
+for 32-byte CDBs and is awaiting varlen CDB support.
+
+The controller may support checksum conversion as an optimization.
+Initial benchmarks showed that calculating a 16-bit CRC for each 512
+bytes of an I/O was expensive. Emulex' hardware had the capability to
+convert an IP checksum to the T10 CRC on the wire. So as part of the
+negotiation process the checksum algorithm will be selected and the
+blk_integrity profile set accordingly.
+
+----------------------------------------------------------------------
+3.0 HBA INTERFACE
+
+See the following doc:
+
+http://oss.oracle.com/projects/data-integrity/dist/documentation/linux-hba.pdf
+
+----------------------------------------------------------------------
+2007-12-24 Martin K. Petersen <[email protected]>
diff -r ad65bfde4e05 -r 8bc1728dc75a drivers/scsi/Kconfig
--- a/drivers/scsi/Kconfig Sat Jun 07 00:45:15 2008 -0400
+++ b/drivers/scsi/Kconfig Sat Jun 07 00:45:15 2008 -0400
@@ -265,6 +265,7 @@
bool "SCSI Data Integrity Protection"
depends on SCSI
depends on BLK_DEV_INTEGRITY
+ select CRC_T10DIF
help
Some SCSI devices support data protection features above and
beyond those implemented in the transport. Select this
diff -r ad65bfde4e05 -r 8bc1728dc75a drivers/scsi/Makefile
--- a/drivers/scsi/Makefile Sat Jun 07 00:45:15 2008 -0400
+++ b/drivers/scsi/Makefile Sat Jun 07 00:45:15 2008 -0400
@@ -149,6 +149,8 @@
scsi_tgt-y += scsi_tgt_lib.o scsi_tgt_if.o

sd_mod-objs := sd.o
+sd_mod-$(CONFIG_SCSI_PROTECTION) += sd_dif.o
+
sr_mod-objs := sr.o sr_ioctl.o sr_vendor.o
ncr53c8xx-flags-$(CONFIG_SCSI_ZALON) \
:= -DCONFIG_NCR53C8XX_PREFETCH -DSCSI_NCR_BIG_ENDIAN \
diff -r ad65bfde4e05 -r 8bc1728dc75a drivers/scsi/scsi_error.c
--- a/drivers/scsi/scsi_error.c Sat Jun 07 00:45:15 2008 -0400
+++ b/drivers/scsi/scsi_error.c Sat Jun 07 00:45:15 2008 -0400
@@ -333,6 +333,9 @@
return /* soft_error */ SUCCESS;

case ABORTED_COMMAND:
+ if (sshdr.asc == 0x10) /* DIF */
+ return SUCCESS;
+
return NEEDS_RETRY;
case NOT_READY:
case UNIT_ATTENTION:
diff -r ad65bfde4e05 -r 8bc1728dc75a drivers/scsi/scsi_lib.c
--- a/drivers/scsi/scsi_lib.c Sat Jun 07 00:45:15 2008 -0400
+++ b/drivers/scsi/scsi_lib.c Sat Jun 07 00:45:15 2008 -0400
@@ -947,6 +947,10 @@
scsi_requeue_command(q, cmd);
return;
} else {
+ if (sshdr.asc == 0x10) { /* DIF */
+ scsi_print_result(cmd);
+ scsi_print_sense("", cmd);
+ }
scsi_end_request(cmd, -EIO, this_count, 1);
return;
}
diff -r ad65bfde4e05 -r 8bc1728dc75a drivers/scsi/scsi_sysfs.c
--- a/drivers/scsi/scsi_sysfs.c Sat Jun 07 00:45:15 2008 -0400
+++ b/drivers/scsi/scsi_sysfs.c Sat Jun 07 00:45:15 2008 -0400
@@ -249,6 +249,8 @@
shost_rd_attr(can_queue, "%hd\n");
shost_rd_attr(sg_tablesize, "%hu\n");
shost_rd_attr(unchecked_isa_dma, "%d\n");
+shost_rd_attr(dif_capabilities, "%hd\n");
+shost_rd_attr(dif_guard_type, "%hd\n");
shost_rd_attr2(proc_name, hostt->proc_name, "%s\n");

static struct attribute *scsi_sysfs_shost_attrs[] = {
@@ -263,6 +265,8 @@
&dev_attr_hstate.attr,
&dev_attr_supported_mode.attr,
&dev_attr_active_mode.attr,
+ &dev_attr_dif_capabilities.attr,
+ &dev_attr_dif_guard_type.attr,
NULL
};

diff -r ad65bfde4e05 -r 8bc1728dc75a drivers/scsi/sd.c
--- a/drivers/scsi/sd.c Sat Jun 07 00:45:15 2008 -0400
+++ b/drivers/scsi/sd.c Sat Jun 07 00:45:15 2008 -0400
@@ -58,6 +58,7 @@
#include <scsi/scsi_host.h>
#include <scsi/scsi_ioctl.h>
#include <scsi/scsicam.h>
+#include <scsi/scsi_dif.h>

#include "sd.h"
#include "scsi_logging.h"
@@ -233,6 +234,24 @@
return snprintf(buf, 40, "%d\n", sdkp->device->allow_restart);
}

+static ssize_t
+sd_show_protection_type(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct scsi_disk *sdkp = to_scsi_disk(dev);
+
+ return snprintf(buf, 20, "%u\n", sdkp->protection_type);
+}
+
+static ssize_t
+sd_show_app_tag_own(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct scsi_disk *sdkp = to_scsi_disk(dev);
+
+ return snprintf(buf, 20, "%u\n", sdkp->ATO);
+}
+
static struct device_attribute sd_disk_attrs[] = {
__ATTR(cache_type, S_IRUGO|S_IWUSR, sd_show_cache_type,
sd_store_cache_type),
@@ -241,6 +260,8 @@
sd_store_allow_restart),
__ATTR(manage_start_stop, S_IRUGO|S_IWUSR, sd_show_manage_start_stop,
sd_store_manage_start_stop),
+ __ATTR(protection_type, S_IRUGO, sd_show_protection_type, NULL),
+ __ATTR(app_tag_own, S_IRUGO, sd_show_app_tag_own, NULL),
__ATTR_NULL,
};

@@ -353,6 +374,7 @@
struct scsi_cmnd *SCpnt;
struct scsi_device *sdp = q->queuedata;
struct gendisk *disk = rq->rq_disk;
+ struct scsi_disk *sdkp;
sector_t block = rq->sector;
unsigned int this_count = rq->nr_sectors;
unsigned int timeout = sdp->timeout;
@@ -369,6 +391,7 @@
if (ret != BLKPREP_OK)
goto out;
SCpnt = rq->special;
+ sdkp = scsi_disk(disk);

/* from here on until we're complete, any goto out
* is used for a killable error condition */
@@ -458,6 +481,11 @@
}
SCpnt->cmnd[0] = WRITE_6;
SCpnt->sc_data_direction = DMA_TO_DEVICE;
+
+ if (blk_integrity_rq(rq) &&
+ sd_dif_prepare(rq, block, sdp->sector_size) == -EIO)
+ goto out;
+
} else if (rq_data_dir(rq) == READ) {
SCpnt->cmnd[0] = READ_6;
SCpnt->sc_data_direction = DMA_FROM_DEVICE;
@@ -472,8 +500,13 @@
"writing" : "reading", this_count,
rq->nr_sectors));

- SCpnt->cmnd[1] = 0;
-
+ sd_dif_op(SCpnt);
+
+ if (scsi_host_dif_type(sdp->host, sdkp->protection_type))
+ SCpnt->cmnd[1] = 1 << 5;
+ else
+ SCpnt->cmnd[1] = 0;
+
if (block > 0xffffffff) {
SCpnt->cmnd[0] += READ_16 - READ_6;
SCpnt->cmnd[1] |= blk_fua_rq(rq) ? 0x8 : 0;
@@ -491,6 +524,7 @@
SCpnt->cmnd[13] = (unsigned char) this_count & 0xff;
SCpnt->cmnd[14] = SCpnt->cmnd[15] = 0;
} else if ((this_count > 0xff) || (block > 0x1fffff) ||
+ SCpnt->device->protection ||
SCpnt->device->use_10_for_rw) {
if (this_count > 0xffff)
this_count = 0xffff;
@@ -1004,7 +1038,8 @@
good_bytes = xfer_size;
break;
case ILLEGAL_REQUEST:
- if (SCpnt->device->use_10_for_rw &&
+ if (SCpnt->device->protection == 0 &&
+ SCpnt->device->use_10_for_rw &&
(SCpnt->cmnd[0] == READ_10 ||
SCpnt->cmnd[0] == WRITE_10))
SCpnt->device->use_10_for_rw = 0;
@@ -1017,6 +1052,9 @@
break;
}
out:
+ if (rq_data_dir(SCpnt->request) == READ && scsi_prot_sg_count(SCpnt))
+ sd_dif_complete(SCpnt, good_bytes);
+
return good_bytes;
}

@@ -1171,7 +1209,8 @@
unsigned char cmd[16];
int the_result, retries;
int sector_size = 0;
- int longrc = 0;
+ /* Force READ CAPACITY(16) when PROTECT=1 */
+ int longrc = sdkp->device->protection ? 1 : 0;
struct scsi_sense_hdr sshdr;
int sense_valid = 0;
struct scsi_device *sdp = sdkp->device;
@@ -1183,8 +1222,8 @@
memset((void *) cmd, 0, 16);
cmd[0] = SERVICE_ACTION_IN;
cmd[1] = SAI_READ_CAPACITY_16;
- cmd[13] = 12;
- memset((void *) buffer, 0, 12);
+ cmd[13] = 13;
+ memset((void *) buffer, 0, 13);
} else {
cmd[0] = READ_CAPACITY;
memset((void *) &cmd[1], 0, 9);
@@ -1192,7 +1231,7 @@
}

the_result = scsi_execute_req(sdp, cmd, DMA_FROM_DEVICE,
- buffer, longrc ? 12 : 8, &sshdr,
+ buffer, longrc ? 13 : 8, &sshdr,
SD_TIMEOUT, SD_MAX_RETRIES);

if (media_not_present(sdkp, &sshdr))
@@ -1267,6 +1306,8 @@

sector_size = (buffer[8] << 24) |
(buffer[9] << 16) | (buffer[10] << 8) | buffer[11];
+
+ sd_dif_config_disk(sdkp, buffer);
}

/* Some devices return the total number of sectors, not the
@@ -1564,6 +1605,7 @@
sdkp->write_prot = 0;
sdkp->WCE = 0;
sdkp->RCD = 0;
+ sdkp->ATO = 0;

sd_spinup_disk(sdkp);

@@ -1575,6 +1617,7 @@
sd_read_capacity(sdkp, buffer);
sd_read_write_protect_flag(sdkp, buffer);
sd_read_cache_type(sdkp, buffer);
+ sd_dif_app_tag_own(sdkp, buffer);
}

/*
@@ -1708,6 +1751,7 @@

dev_set_drvdata(dev, sdkp);
add_disk(gd);
+ sd_dif_config_host(sdkp);

sd_printk(KERN_NOTICE, sdkp, "Attached SCSI %sdisk\n",
sdp->removable ? "removable " : "");
diff -r ad65bfde4e05 -r 8bc1728dc75a drivers/scsi/sd.h
--- a/drivers/scsi/sd.h Sat Jun 07 00:45:15 2008 -0400
+++ b/drivers/scsi/sd.h Sat Jun 07 00:45:15 2008 -0400
@@ -41,7 +41,9 @@
u32 index;
u8 media_present;
u8 write_prot;
+ u8 protection_type;/* Data Integrity Field */
unsigned previous_state : 1;
+ unsigned ATO : 1; /* state of disk ATO bit */
unsigned WCE : 1; /* state of disk WCE bit */
unsigned RCD : 1; /* state of disk RCD bit, unused */
unsigned DPOFUA : 1; /* state of disk DPOFUA bit */
@@ -61,4 +63,24 @@
(sdsk)->disk->disk_name, ##a) : \
sdev_printk(prefix, (sdsk)->device, fmt, ##a)

+#if defined(CONFIG_SCSI_PROTECTION)
+
+extern unsigned char sd_dif_op(struct scsi_cmnd *);
+extern void sd_dif_app_tag_own(struct scsi_disk *, unsigned char *);
+extern void sd_dif_config_disk(struct scsi_disk *, unsigned char *);
+extern void sd_dif_config_host(struct scsi_disk *);
+extern int sd_dif_prepare(struct request *rq, sector_t, unsigned int);
+extern void sd_dif_complete(struct scsi_cmnd *, unsigned int);
+
+#else /* CONFIG_SCSI_PROTECTION */
+
+#define sd_dif_op(a) (0)
+#define sd_dif_app_tag_own(a, b) do { } while (0)
+#define sd_dif_config_disk(a, b) do { } while (0)
+#define sd_dif_config_host(a) do { } while (0)
+#define sd_dif_prepare(a, b, c) (0)
+#define sd_dif_complete(a, b) (0)
+
+#endif /* CONFIG_SCSI_PROTECTION */
+
#endif /* _SCSI_DISK_H */
diff -r ad65bfde4e05 -r 8bc1728dc75a drivers/scsi/sd_dif.c
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/drivers/scsi/sd_dif.c Sat Jun 07 00:45:15 2008 -0400
@@ -0,0 +1,644 @@
+/*
+ * sd_dif.c - SCSI Data Integrity Field
+ *
+ * Copyright (C) 2007, 2008 Oracle Corporation
+ * Written by: Martin K. Petersen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING. If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
+ * USA.
+ *
+ */
+
+#include <linux/blkdev.h>
+#include <linux/crc-t10dif.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_dbg.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_driver.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_ioctl.h>
+#include <scsi/scsicam.h>
+#include <scsi/scsi_dif.h>
+
+#include <net/checksum.h>
+
+#include "sd.h"
+
+typedef __u16 (csum_fn) (void *, unsigned int);
+
+static __u16 sd_dif_crc_fn(void *data, unsigned int len)
+{
+ return cpu_to_be16(crc_t10dif(data, len));
+}
+
+static __u16 sd_dif_ip_fn(void *data, unsigned int len)
+{
+ return ip_compute_csum(data, len);
+}
+
+/*
+ * Type 1 and Type 2 protection use the same format: 16 bit guard tag,
+ * 16 bit app tag, 32 bit reference tag.
+ */
+static void sd_dif_type1_generate(struct blk_integrity_exchg *bix, csum_fn *fn)
+{
+ void *buf = bix->data_buf;
+ struct sd_dif_tuple *sdt = bix->prot_buf;
+ sector_t sector = bix->sector;
+ unsigned int i;
+
+ for (i = 0 ; i < bix->data_size ; i += bix->sector_size, sdt++) {
+ sdt->guard_tag = fn(buf, bix->sector_size);
+ sdt->ref_tag = cpu_to_be32(sector & 0xffffffff);
+ sdt->app_tag = 0;
+
+ buf += bix->sector_size;
+ sector++;
+ }
+}
+
+static void sd_dif_type1_generate_crc(struct blk_integrity_exchg *bix)
+{
+ sd_dif_type1_generate(bix, sd_dif_crc_fn);
+}
+
+static void sd_dif_type1_generate_ip(struct blk_integrity_exchg *bix)
+{
+ sd_dif_type1_generate(bix, sd_dif_ip_fn);
+}
+
+static int sd_dif_type1_verify(struct blk_integrity_exchg *bix, csum_fn *fn)
+{
+ void *buf = bix->data_buf;
+ struct sd_dif_tuple *sdt = bix->prot_buf;
+ sector_t sector = bix->sector;
+ unsigned int i;
+ __u16 csum;
+
+ for (i = 0 ; i < bix->data_size ; i += bix->sector_size, sdt++) {
+ /* Unwritten sectors */
+ if (sdt->app_tag == 0xffff)
+ return 0;
+
+ /* Bad ref tag received from disk */
+ if (sdt->ref_tag == 0xffffffff) {
+ printk(KERN_ERR
+ "%s: bad phys ref tag on sector %lu\n",
+ bix->disk_name, sector);
+ return -EIO;
+ }
+
+ if (be32_to_cpu(sdt->ref_tag) != (sector & 0xffffffff)) {
+ printk(KERN_ERR
+ "%s: ref tag error on sector %lu (rcvd %u)\n",
+ bix->disk_name, sector,
+ be32_to_cpu(sdt->ref_tag));
+ return -EIO;
+ }
+
+ csum = fn(buf, bix->sector_size);
+
+ if (sdt->guard_tag != csum) {
+ printk(KERN_ERR "%s: guard tag error on sector %lu " \
+ "(rcvd %04x, data %04x)\n", bix->disk_name,
+ sector, be16_to_cpu(sdt->guard_tag),
+ be16_to_cpu(csum));
+ return -EIO;
+ }
+
+ buf += bix->sector_size;
+ sector++;
+ }
+
+ return 0;
+}
+
+static int sd_dif_type1_verify_crc(struct blk_integrity_exchg *bix)
+{
+ return sd_dif_type1_verify(bix, sd_dif_crc_fn);
+}
+
+static int sd_dif_type1_verify_ip(struct blk_integrity_exchg *bix)
+{
+ return sd_dif_type1_verify(bix, sd_dif_ip_fn);
+}
+
+/*
+ * Functions for interleaving and deinterleaving application tags
+ */
+static void sd_dif_type1_set_tag(void *prot, void *tag_buf, unsigned int sectors)
+{
+ struct sd_dif_tuple *sdt = prot;
+ char *tag = tag_buf;
+ unsigned int i, j;
+
+ for (i = 0, j = 0 ; i < sectors ; i++, j += 2, sdt++) {
+ sdt->app_tag = tag[j] << 8 | tag[j+1];
+ BUG_ON(sdt->app_tag == 0xffff);
+ }
+}
+
+static void sd_dif_type1_get_tag(void *prot, void *tag_buf, unsigned int sectors)
+{
+ struct sd_dif_tuple *sdt = prot;
+ char *tag = tag_buf;
+ unsigned int i, j;
+
+ for (i = 0, j = 0 ; i < sectors ; i++, j += 2, sdt++) {
+ tag[j] = (sdt->app_tag & 0xff00) >> 8;
+ tag[j+1] = sdt->app_tag & 0xff;
+ }
+}
+
+static struct blk_integrity dif_type1_integrity_crc = {
+ .name = "T10-DIF-TYPE1-CRC",
+ .generate_fn = sd_dif_type1_generate_crc,
+ .verify_fn = sd_dif_type1_verify_crc,
+ .get_tag_fn = sd_dif_type1_get_tag,
+ .set_tag_fn = sd_dif_type1_set_tag,
+ .tuple_size = sizeof(struct sd_dif_tuple),
+ .tag_size = 0,
+};
+
+static struct blk_integrity dif_type1_integrity_ip = {
+ .name = "T10-DIF-TYPE1-IP",
+ .generate_fn = sd_dif_type1_generate_ip,
+ .verify_fn = sd_dif_type1_verify_ip,
+ .get_tag_fn = sd_dif_type1_get_tag,
+ .set_tag_fn = sd_dif_type1_set_tag,
+ .tuple_size = sizeof(struct sd_dif_tuple),
+ .tag_size = 0,
+};
+
+
+/*
+ * Type 3 protection has a 16-bit guard tag and 16 + 32 bits of opaque tag space.
+ */
+static void sd_dif_type3_generate(struct blk_integrity_exchg *bix, csum_fn *fn)
+{
+ void *buf = bix->data_buf;
+ struct sd_dif_tuple *sdt = bix->prot_buf;
+ unsigned int i;
+
+ for (i = 0 ; i < bix->data_size ; i += bix->sector_size, sdt++) {
+ sdt->guard_tag = fn(buf, bix->sector_size);
+ sdt->ref_tag = 0;
+ sdt->app_tag = 0;
+
+ buf += bix->sector_size;
+ }
+}
+
+static void sd_dif_type3_generate_crc(struct blk_integrity_exchg *bix)
+{
+ sd_dif_type3_generate(bix, sd_dif_crc_fn);
+}
+
+static void sd_dif_type3_generate_ip(struct blk_integrity_exchg *bix)
+{
+ sd_dif_type3_generate(bix, sd_dif_ip_fn);
+}
+
+static int sd_dif_type3_verify(struct blk_integrity_exchg *bix, csum_fn *fn)
+{
+ void *buf = bix->data_buf;
+ struct sd_dif_tuple *sdt = bix->prot_buf;
+ sector_t sector = bix->sector;
+ unsigned int i;
+ __u16 csum;
+
+ for (i = 0 ; i < bix->data_size ; i += bix->sector_size, sdt++) {
+ /* Unwritten sectors */
+ if (sdt->app_tag == 0xffff && sdt->ref_tag == 0xffffffff)
+ return 0;
+
+ csum = fn(buf, bix->sector_size);
+
+ if (sdt->guard_tag != csum) {
+ printk(KERN_ERR "%s: guard tag error on sector %lu " \
+ "(rcvd %04x, data %04x)\n", bix->disk_name,
+ sector, be16_to_cpu(sdt->guard_tag),
+ be16_to_cpu(csum));
+ return -EIO;
+ }
+
+ buf += bix->sector_size;
+ sector++;
+ }
+
+ return 0;
+}
+
+static int sd_dif_type3_verify_crc(struct blk_integrity_exchg *bix)
+{
+ return sd_dif_type3_verify(bix, sd_dif_crc_fn);
+}
+
+static int sd_dif_type3_verify_ip(struct blk_integrity_exchg *bix)
+{
+ return sd_dif_type3_verify(bix, sd_dif_ip_fn);
+}
+
+static void sd_dif_type3_set_tag(void *prot, void *tag_buf, unsigned int sectors)
+{
+ struct sd_dif_tuple *sdt = prot;
+ char *tag = tag_buf;
+ unsigned int i, j;
+
+ for (i = 0, j = 0 ; i < sectors ; i++, j += 6, sdt++) {
+ sdt->app_tag = tag[j] << 8 | tag[j+1];
+ sdt->ref_tag = tag[j+2] << 24 | tag[j+3] << 16 |
+ tag[j+4] << 8 | tag[j+5];
+ }
+}
+
+static void sd_dif_type3_get_tag(void *prot, void *tag_buf, unsigned int sectors)
+{
+ struct sd_dif_tuple *sdt = prot;
+ char *tag = tag_buf;
+ unsigned int i, j;
+
+ for (i = 0, j = 0 ; i < sectors ; i++, j += 2, sdt++) {
+ tag[j] = (sdt->app_tag & 0xff00) >> 8;
+ tag[j+1] = sdt->app_tag & 0xff;
+ tag[j+2] = (sdt->ref_tag & 0xff000000) >> 24;
+ tag[j+3] = (sdt->ref_tag & 0xff0000) >> 16;
+ tag[j+4] = (sdt->ref_tag & 0xff00) >> 8;
+ tag[j+5] = sdt->ref_tag & 0xff;
+ BUG_ON(sdt->app_tag == 0xffff || sdt->ref_tag == 0xffffffff);
+ }
+}
+
+static struct blk_integrity dif_type3_integrity_crc = {
+ .name = "T10-DIF-TYPE3-CRC",
+ .generate_fn = sd_dif_type3_generate_crc,
+ .verify_fn = sd_dif_type3_verify_crc,
+ .get_tag_fn = sd_dif_type3_get_tag,
+ .set_tag_fn = sd_dif_type3_set_tag,
+ .tuple_size = sizeof(struct sd_dif_tuple),
+ .tag_size = 0,
+};
+
+static struct blk_integrity dif_type3_integrity_ip = {
+ .name = "T10-DIF-TYPE3-IP",
+ .generate_fn = sd_dif_type3_generate_ip,
+ .verify_fn = sd_dif_type3_verify_ip,
+ .get_tag_fn = sd_dif_type3_get_tag,
+ .set_tag_fn = sd_dif_type3_set_tag,
+ .tuple_size = sizeof(struct sd_dif_tuple),
+ .tag_size = 0,
+};
+
+
+/*
+ * The ATO bit indicates whether the application tag is available for
+ * use by the operating system.
+ */
+void sd_dif_app_tag_own(struct scsi_disk *sdkp, unsigned char *buffer)
+{
+ int res, offset;
+ struct scsi_device *sdp = sdkp->device;
+ struct scsi_mode_data data;
+ struct scsi_sense_hdr sshdr;
+
+ if (sdp->type != TYPE_DISK)
+ return;
+
+ if (sdkp->protection_type == 0)
+ return;
+
+ res = scsi_mode_sense(sdp, 1, 0x0a, buffer, 36, SD_TIMEOUT,
+ SD_MAX_RETRIES, &data, &sshdr);
+
+ if (!scsi_status_is_good(res) || !data.header_length ||
+ data.length < 6) {
+ sd_printk(KERN_WARNING, sdkp,
+ "getting Control mode page failed, assume no ATO\n");
+
+ if (scsi_sense_valid(&sshdr))
+ sd_print_sense_hdr(sdkp, &sshdr);
+
+ goto no_ato;
+ }
+
+ offset = data.header_length + data.block_descriptor_length;
+
+ if ((buffer[offset] & 0x3f) != 0x0a) {
+ sd_printk(KERN_ERR, sdkp, "ATO Got wrong page\n");
+ goto no_ato;
+ }
+
+ if ((buffer[offset + 5] & 0x80) == 0)
+ goto no_ato;
+
+ sdkp->ATO = 1;
+ sd_printk(KERN_NOTICE, sdkp, "ATO Enabled\n");
+
+ return;
+
+no_ato:
+ sd_printk(KERN_NOTICE, sdkp, "ATO Disabled\n");
+}
+
+/*
+ * Determine whether disk supports Data Integrity Field.
+ */
+void sd_dif_config_disk(struct scsi_disk *sdkp, unsigned char *buffer)
+{
+ struct scsi_device *sdp = sdkp->device;
+ u8 type;
+
+ if (sdp->protection == 0 || (buffer[12] & 1) == 0)
+ type = 0;
+ else
+ type = ((buffer[12] >> 1) & 7) + 1; /* P_TYPE 0 = Type 1 */
+
+ switch (type) {
+ case SCSI_DIF_TYPE0_PROTECTION:
+ sd_printk(KERN_NOTICE, sdkp, "formatted without data " \
+ "integrity protection\n");
+ sdkp->protection_type = 0;
+ break;
+
+ case SCSI_DIF_TYPE1_PROTECTION:
+ case SCSI_DIF_TYPE3_PROTECTION:
+ sd_printk(KERN_NOTICE, sdkp, "formatted with DIF Type %d " \
+ "protection\n", type);
+ sdkp->protection_type = type;
+ break;
+
+ case SCSI_DIF_TYPE2_PROTECTION:
+ sd_printk(KERN_ERR, sdkp, "formatted with DIF Type 2 " \
+ "protection which is currently unsupported. " \
+ "Disabling disk!\n");
+ goto disable;
+
+ default:
+ sd_printk(KERN_ERR, sdkp, "formatted with unknown " \
+ "protection type %d. Disabling disk!\n", type);
+ goto disable;
+ }
+
+ return;
+
+disable:
+ sdkp->protection_type = 0;
+ sdkp->capacity = 0;
+}
+
+/*
+ * Configure exchange of protection information between OS and HBA.
+ */
+void sd_dif_config_host(struct scsi_disk *sdkp)
+{
+ struct scsi_device *sdp = sdkp->device;
+ struct gendisk *disk = sdkp->disk;
+ u8 type = sdkp->protection_type;
+
+ /* Does HBA support protection DMA? */
+ if (scsi_host_dif_dma(sdp->host) == 0) {
+
+ if (type) {
+ sd_printk(KERN_NOTICE, sdkp, "Type %d protection " \
+ "unsupported by HBA. No protection DMA!\n",
+ type);
+ sdkp->protection_type = 0;
+ }
+
+ return;
+ }
+
+ /* Does HBA support this type? */
+ if (scsi_host_dif_type(sdp->host, type) == 0) {
+ sd_printk(KERN_NOTICE, sdkp, "Type %d protection " \
+ "unsupported by HBA. Disabling DIF!\n", type);
+ sdkp->protection_type = 0;
+ return;
+ }
+
+ if (scsi_host_guard_type(sdkp->device->host) & SCSI_DIF_GUARD_IP)
+ if (type == SCSI_DIF_TYPE3_PROTECTION)
+ blk_integrity_register(disk, &dif_type3_integrity_ip);
+ else
+ blk_integrity_register(disk, &dif_type1_integrity_ip);
+ else
+ if (type == SCSI_DIF_TYPE3_PROTECTION)
+ blk_integrity_register(disk, &dif_type3_integrity_crc);
+ else
+ blk_integrity_register(disk, &dif_type1_integrity_crc);
+
+ sd_printk(KERN_INFO, sdkp,
+ "Enabling %s integrity protection between OS and HBA\n",
+ disk->integrity->name);
+
+ /* Signal to block layer that we support sector tagging */
+ if (type && sdkp->ATO) {
+ if (type == SCSI_DIF_TYPE3_PROTECTION)
+ disk->integrity->tag_size = sizeof(u16) + sizeof(u32);
+ else
+ disk->integrity->tag_size = sizeof(u16);
+
+ sd_printk(KERN_INFO, sdkp, "DIF application tag size %u\n",
+ disk->integrity->tag_size);
+ }
+}
+
+/*
+ * DIF DMA operation magic decoder ring. DIF-capable HBA drivers
+ * should call this function in their queuecommand to determine how to
+ * handle the I/O.
+ */
+unsigned char sd_dif_op(struct scsi_cmnd *scmd)
+{
+ struct request *rq = scmd->request;
+ struct scsi_disk *sdkp;
+ int hba_to_disk, os_to_hba, csum_convert;
+
+ if (rq->cmd_type != REQ_TYPE_FS)
+ return SCSI_DIF_NORMAL;
+
+ /* Protection information passed between OS and HBA */
+ sdkp = scsi_disk(rq->rq_disk);
+ hba_to_disk = sdkp->protection_type;
+
+ /* Protection information between HBA and storage device */
+ os_to_hba = scsi_prot_sg_count(scmd);
+
+ /* Convert checksum? */
+ if (scsi_host_guard_type(scmd->device->host) == SCSI_DIF_GUARD_IP)
+ csum_convert = 1;
+ else
+ csum_convert = 0;
+
+ switch (scmd->cmnd[0]) {
+ case READ_10:
+ case READ_12:
+ case READ_16:
+ if (hba_to_disk && os_to_hba)
+ return csum_convert ?
+ SCSI_DIF_READ_CONVERT :
+ SCSI_DIF_READ_PASS;
+
+ else if (hba_to_disk && !os_to_hba)
+ return SCSI_DIF_READ_STRIP;
+
+ else if (!hba_to_disk && os_to_hba)
+ return SCSI_DIF_READ_INSERT;
+
+ break;
+
+ case WRITE_10:
+ case WRITE_12:
+ case WRITE_16:
+ if (hba_to_disk && os_to_hba)
+ return csum_convert ?
+ SCSI_DIF_WRITE_CONVERT :
+ SCSI_DIF_WRITE_PASS;
+
+ else if (hba_to_disk && !os_to_hba)
+ return SCSI_DIF_WRITE_INSERT;
+
+ else if (!hba_to_disk && os_to_hba)
+ return SCSI_DIF_WRITE_STRIP;
+
+ break;
+ }
+
+ return SCSI_DIF_NORMAL;
+}
+
+/*
+ * The virtual start sector is the one that was originally submitted
+ * by the block layer. Due to partitioning, MD/DM cloning, etc. the
+ * actual physical start sector is likely to be different. Remap
+ * protection information to match the physical LBA.
+ *
+ * From a protocol perspective there's a slight difference between
+ * Type 1 and 2. The latter uses 32-byte CDBs exclusively, and the
+ * reference tag is seeded in the CDB. This gives us the potential to
+ * avoid virt->phys remapping during write. However, at read time we
+ * don't know whether the virt sector is the same as when we wrote it
+ * (we could be reading from real disk as opposed to MD/DM device. So
+ * we always remap Type 2 making it identical to Type 1.
+ *
+ * Type 3 does not have a reference tag so no remapping is required.
+ */
+int sd_dif_prepare(struct request *rq, sector_t hw_sector, unsigned int sector_sz)
+{
+ const int tuple_sz = sizeof(struct sd_dif_tuple);
+ struct bio *bio;
+ struct scsi_disk *sdkp;
+ struct sd_dif_tuple *sdt;
+ unsigned int i, j;
+ u32 phys, virt;
+
+ /* Already remapped? */
+ if (rq->cmd_flags & REQ_INTEGRITY)
+ return 0;
+
+ sdkp = rq->bio->bi_bdev->bd_disk->private_data;
+
+ if (sdkp->protection_type == SCSI_DIF_TYPE3_PROTECTION)
+ return 0;
+
+ rq->cmd_flags |= REQ_INTEGRITY;
+ phys = hw_sector & 0xffffffff;
+
+ __rq_for_each_bio(bio, rq) {
+ struct bio_vec *iv;
+
+ virt = bio->bi_integrity->bip_sector & 0xffffffff;
+
+ bip_for_each_vec(iv, bio->bi_integrity, i) {
+ sdt = kmap_atomic(iv->bv_page, KM_USER0) + iv->bv_offset;
+
+ for (j = 0 ; j < iv->bv_len ; j += tuple_sz, sdt++) {
+
+ if (be32_to_cpu(sdt->ref_tag) != virt)
+ goto error;
+
+ sdt->ref_tag = cpu_to_be32(phys);
+ virt++;
+ phys++;
+ }
+
+ kunmap_atomic(iv->bv_page, KM_USER0);
+ }
+ }
+
+ return 0;
+
+error:
+ sd_printk(KERN_ERR, sdkp, "%s: virt %u, phys %u, ref %u\n",
+ __func__, virt, phys, be32_to_cpu(sdt->ref_tag));
+
+ return -EIO;
+}
+
+/*
+ * Remap physical sector values in the reference tag to the virtual
+ * values expected by the block layer.
+ */
+void sd_dif_complete(struct scsi_cmnd *scmd, unsigned int good_bytes)
+{
+ const int tuple_sz = sizeof(struct sd_dif_tuple);
+ struct scsi_disk *sdkp;
+ struct bio *bio;
+ struct sd_dif_tuple *sdt;
+ unsigned int i, j, sectors, sector_sz;
+ u32 phys, virt;
+
+ sdkp = scsi_disk(scmd->request->rq_disk);
+
+ if (sdkp->protection_type == SCSI_DIF_TYPE3_PROTECTION)
+ return;
+
+ sector_sz = scmd->device->sector_size;
+ sectors = good_bytes / sector_sz;
+
+ phys = scmd->request->sector & 0xffffffff;
+ if (sector_sz == 4096)
+ phys >>= 3;
+
+ __rq_for_each_bio(bio, scmd->request) {
+ struct bio_vec *iv;
+
+ virt = bio->bi_integrity->bip_sector & 0xffffffff;
+
+ bip_for_each_vec(iv, bio->bi_integrity, i) {
+ sdt = kmap_atomic(iv->bv_page, KM_USER0) + iv->bv_offset;
+
+ for (j = 0 ; j < iv->bv_len ; j += tuple_sz, sdt++) {
+
+ if (sectors == 0)
+ return;
+
+ if (be32_to_cpu(sdt->ref_tag) != phys &&
+ sdt->app_tag != 0xffff)
+ sdt->ref_tag = 0xffffffff; /* Bad ref */
+ else
+ sdt->ref_tag = cpu_to_be32(virt);
+
+ virt++;
+ phys++;
+ sectors--;
+ }
+
+ kunmap_atomic(iv->bv_page, KM_USER0);
+ }
+ }
+}
+
diff -r ad65bfde4e05 -r 8bc1728dc75a include/scsi/scsi_cmnd.h
--- a/include/scsi/scsi_cmnd.h Sat Jun 07 00:45:15 2008 -0400
+++ b/include/scsi/scsi_cmnd.h Sat Jun 07 00:45:15 2008 -0400
@@ -78,6 +78,9 @@
int allowed;
int timeout_per_command;

+#if defined(CONFIG_SCSI_PROTECTION)
+ char prot_op;
+#endif
unsigned short cmd_len;
enum dma_data_direction sc_data_direction;

diff -r ad65bfde4e05 -r 8bc1728dc75a include/scsi/scsi_dif.h
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/include/scsi/scsi_dif.h Sat Jun 07 00:45:15 2008 -0400
@@ -0,0 +1,140 @@
+/*
+ * scsi_dif.h - SCSI Data Integrity Field
+ *
+ * Copyright (C) 2007, 2008 Oracle Corporation
+ * Written by: Martin K. Petersen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING. If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
+ * USA.
+ *
+ */
+
+#ifndef _SCSI_SCSI_DIF_H
+#define _SCSI_SCSI_DIF_H
+
+#include <scsi/scsi_host.h>
+
+/*
+ * Type 1 through 3 indicate the DIF format. Type H is for protection
+ * between OS and HBA only. The DMA flag indicates that the initiator
+ * is capable of transferring protection data to and from host memory.
+ */
+
+enum scsi_host_dif_capabilities {
+ SHOST_DIF_TYPE1_PROTECTION = 1 << 0,
+ SHOST_DIF_TYPE2_PROTECTION = 1 << 1,
+ SHOST_DIF_TYPE3_PROTECTION = 1 << 2,
+ SHOST_DIF_TYPEH_PROTECTION = 1 << 6,
+ SHOST_DIF_PROTECTION_DMA = 1 << 7,
+};
+
+static inline void scsi_host_set_dif_caps(struct Scsi_Host *shost, unsigned char mask)
+{
+ shost->dif_capabilities = mask;
+}
+
+static inline unsigned char scsi_host_dif_dma(struct Scsi_Host *shost)
+{
+ return shost->dif_capabilities & SHOST_DIF_PROTECTION_DMA;
+}
+
+static inline unsigned char scsi_host_dif_type(struct Scsi_Host *shost, unsigned int target_type)
+{
+ if (target_type == 0)
+ return shost->dif_capabilities & SHOST_DIF_TYPEH_PROTECTION;
+
+ return shost->dif_capabilities & (1 << (target_type - 1));
+}
+
+/*
+ * All DIF-capable initiators must support the T10-mandated CRC
+ * checksum. Controllers can optionally implement the IP checksum
+ * scheme which has much lower impact on system performance. Note
+ * that the main rationale for the checksum is to match integrity
+ * metadata with data. Detecting bit errors are a job for ECC memory
+ * and buses.
+ */
+
+enum scsi_host_guard_types {
+ SCSI_DIF_GUARD_CRC = 1 << 0,
+ SCSI_DIF_GUARD_IP = 1 << 1,
+};
+
+static inline void scsi_host_set_guard_type(struct Scsi_Host *shost, unsigned char type)
+{
+ shost->dif_guard_type = type;
+}
+
+static inline unsigned char scsi_host_guard_type(struct Scsi_Host *shost)
+{
+ return shost->dif_guard_type;
+}
+
+/*
+ * Depending on the protection scheme implemented by initiator and
+ * target device, the request needs to be routed accordingly. The
+ * host operations below are hints that tell the controller driver how
+ * to handle the I/O.
+ */
+
+enum scsi_host_dif_operations {
+ /* Normal I/O */
+ SCSI_DIF_NORMAL = 0,
+
+ /* OS-HBA: Protected, HBA-Target: Unprotected */
+ SCSI_DIF_READ_INSERT,
+ SCSI_DIF_WRITE_STRIP,
+
+ /* OS-HBA: Unprotected, HBA-Target: Protected */
+ SCSI_DIF_READ_STRIP,
+ SCSI_DIF_WRITE_INSERT,
+
+ /* OS-HBA: Protected, HBA-Target: Protected */
+ SCSI_DIF_READ_PASS,
+ SCSI_DIF_WRITE_PASS,
+
+ /* OS-HBA: Protected, HBA-Target: Protected, checksum conversion */
+ SCSI_DIF_READ_CONVERT,
+ SCSI_DIF_WRITE_CONVERT,
+};
+
+
+/* A DIF-capable target device can be formatted with different
+ * protection schemes. Currently 0 through 3 are defined:
+ *
+ * Type 0 is regular (unprotected I/O)
+ *
+ * Type 1 defines the contents of the guard and reference tags
+ *
+ * Type 2 defines the contents of the guard and reference tags and
+ * uses 32-byte commands to seed the latter
+ *
+ * Type 3 defines the contents of the guard tag only
+ */
+
+enum sd_dif_target_protection_types {
+ SCSI_DIF_TYPE0_PROTECTION = 0x0,
+ SCSI_DIF_TYPE1_PROTECTION = 0x1,
+ SCSI_DIF_TYPE2_PROTECTION = 0x2,
+ SCSI_DIF_TYPE3_PROTECTION = 0x3,
+};
+
+/* DIF contents are considered data and consequently host-endian */
+struct sd_dif_tuple {
+ __u16 guard_tag;
+ __u16 app_tag;
+ __u32 ref_tag;
+};
+
+#endif /* _SCSI_SCSI_DIF_H */
diff -r ad65bfde4e05 -r 8bc1728dc75a include/scsi/scsi_host.h
--- a/include/scsi/scsi_host.h Sat Jun 07 00:45:15 2008 -0400
+++ b/include/scsi/scsi_host.h Sat Jun 07 00:45:15 2008 -0400
@@ -636,6 +636,10 @@
*/
unsigned int max_host_blocked;

+ /* Data Integrity Field */
+ unsigned char dif_capabilities;
+ unsigned char dif_guard_type;
+
/*
* q used for scsi_tgt msgs, async events or any other requests that
* need to be processed in userspace

2008-06-07 14:55:38

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH 4 of 7] block: bio data integrity support

"Martin K. Petersen" <[email protected]> writes:

> 4 files changed, 825 insertions(+), 3 deletions(-)
> fs/Makefile | 1
> fs/bio-integrity.c | 715 +++++++++++++++++++++++++++++++++++++++++++++++++++
> fs/bio.c | 27 +
> include/linux/bio.h | 85 ++++++
>
>
> Allows integrity metadata to be attached to a bio.
>
> Signed-off-by: Martin K. Petersen <[email protected]>
>
> ---
>
> diff -r 318fa71e735d -r f2ae9d5bce4c fs/Makefile
> --- a/fs/Makefile Sat Jun 07 00:45:14 2008 -0400
> +++ b/fs/Makefile Sat Jun 07 00:45:15 2008 -0400
> @@ -19,6 +19,7 @@
> obj-y += no-block.o
> endif
>
> +obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o
> obj-$(CONFIG_INOTIFY) += inotify.o
> obj-$(CONFIG_INOTIFY_USER) += inotify_user.o
> obj-$(CONFIG_EPOLL) += eventpoll.o
> diff -r 318fa71e735d -r f2ae9d5bce4c fs/bio-integrity.c
> --- /dev/null Thu Jan 01 00:00:00 1970 +0000
> +++ b/fs/bio-integrity.c Sat Jun 07 00:45:15 2008 -0400
> @@ -0,0 +1,715 @@
> +/*
> + * bio-integrity.c - bio data integrity extensions
> + *
> + * Copyright (C) 2007, 2008 Oracle Corporation
> + * Written by: Martin K. Petersen <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; see the file COPYING. If not, write to
> + * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
> + * USA.
> + *
> + */
> +
> +#include <linux/blkdev.h>
> +#include <linux/mempool.h>
> +#include <linux/bio.h>
> +#include <linux/workqueue.h>
> +
> +static struct kmem_cache *bio_integrity_slab __read_mostly;
> +static struct workqueue_struct *kintegrityd_wq;
> +
> +/**
> + * bio_integrity_alloc_bioset - Allocate integrity payload and attach it to bio
> + * @bio: bio to attach integrity metadata to
> + * @gfp_mask: Memory allocation mask
> + * @nr_vecs: Number of integrity metadata scatter-gather elements
> + * @bs: bio_set to allocate from
> + *
> + * Description: This function prepares a bio for attaching integrity
> + * metadata. nr_vecs specifies the maximum number of pages containing
> + * integrity metadata that can be attached.
> + */
> +struct bip *bio_integrity_alloc_bioset(struct bio *bio, gfp_t gfp_mask, unsigned int nr_vecs, struct bio_set *bs)
> +{
> + struct bip *bip;
> + struct bio_vec *bv;
> + unsigned long idx;
> +
> + BUG_ON(bio == NULL);
> +
> + bip = mempool_alloc(bs->bio_integrity_pool, gfp_mask);
> + if (unlikely(bip == NULL)) {
> + printk(KERN_ERR "%s: could not alloc bip\n", __func__);
> + return NULL;
> + }
> +
> + memset(bip, 0, sizeof(*bip));
> + idx = 0;
> +
> + bv = bvec_alloc_bs(gfp_mask, nr_vecs, &idx, bs);
> + if (unlikely(bv == NULL)) {
> + printk(KERN_ERR "%s: could not alloc bip_vec\n", __func__);
> + mempool_free(bip, bs->bio_integrity_pool);
> + return NULL;
> + }
> +
> + bip->bip_pool = idx;
> + bip->bip_vec = bv;
> + bip->bip_bio = bio;
> + bio->bi_integrity = bip;
> +
> + return bip;
> +}
> +EXPORT_SYMBOL(bio_integrity_alloc_bioset);
> +
> +/**
> + * bio_integrity_alloc - Allocate integrity payload and attach it to bio
> + * @bio: bio to attach integrity metadata to
> + * @gfp_mask: Memory allocation mask
> + * @nr_vecs: Number of integrity metadata scatter-gather elements
> + *
> + * Description: This function prepares a bio for attaching integrity
> + * metadata. nr_vecs specifies the maximum number of pages containing
> + * integrity metadata that can be attached.
> + */
> +struct bip *bio_integrity_alloc(struct bio *bio, gfp_t gfp_mask,
> + unsigned int nr_vecs)
> +{
> + return bio_integrity_alloc_bioset(bio, gfp_mask, nr_vecs, fs_bio_set);
> +}
> +EXPORT_SYMBOL(bio_integrity_alloc);
> +
> +/**
> + * bio_integrity_free - Free bio integrity payload
> + * @bio: bio containing bip to be freed
> + * @bs: bio_set this bio was allocated from
> + *
> + * Description: Used to free the integrity portion of a bio. Usually
> + * called from bio_free().
> + */
> +void bio_integrity_free(struct bio *bio, struct bio_set *bs)
> +{
> + struct bip *bip = bio->bi_integrity;
> +
> + BUG_ON(bip == NULL);
> +
> + /* A cloned bio doesn't own the integrity metadata */
> + if (!bio_flagged(bio, BIO_CLONED) && bip->bip_buf != NULL)
> + kfree(bip->bip_buf);
> +
> + mempool_free(bip->bip_vec, bs->bvec_pools[bip->bip_pool]);
> + mempool_free(bip, bs->bio_integrity_pool);
> +
> + bio->bi_integrity = NULL;
> +}
> +EXPORT_SYMBOL(bio_integrity_free);
> +
> +/**
> + * bio_integrity_add_page - Attach integrity metadata
> + * @bio: bio to update
> + * @page: page containing integrity metadata
> + * @len: number of bytes of integrity metadata in page
> + * @offset: start offset within page
> + *
> + * Description: Attach a page containing integrity metadata to bio.
> + */
> +int bio_integrity_add_page(struct bio *bio, struct page *page,
> + unsigned int len, unsigned int offset)
> +{
> + struct bip *bip;
> + struct bio_vec *iv;
> +
> + bip = bio->bi_integrity;
> +
> + if (bip->bip_vcnt >= bvec_nr_vecs(bip->bip_pool)) {
> + printk(KERN_ERR "%s: bip_vec full\n", __func__);
> + return 0;
> + }
> +
> + iv = bip_vec_idx(bip, bip->bip_vcnt);
> + BUG_ON(iv == NULL);
> + BUG_ON(iv->bv_page != NULL);
> +
> + iv->bv_page = page;
> + iv->bv_len = len;
> + iv->bv_offset = offset;
> + bip->bip_vcnt++;
> +
> + return len;
> +}
> +EXPORT_SYMBOL(bio_integrity_add_page);
> +
> +/**
> + * bio_integrity_enabled - Check whether integrity can be passed
> + * @bio: bio to check
> + *
> + * Description: Determines whether bio_integrity_prep() can be called
> + * on this bio or not. bio data direction and target device must be
> + * set prior to calling. The functions honors the write_generate and
> + * read_verify flags in sysfs.
> + */
> +inline int bio_integrity_enabled(struct bio *bio)
> +{
> + /* Already protected? */
> + if (bio_integrity(bio))
> + return 0;
> +
> + return bdev_integrity_enabled(bio->bi_bdev, bio_data_dir(bio));
> +}
> +EXPORT_SYMBOL(bio_integrity_enabled);
> +
> +/**
> + * bio_integrity_tag_size - Retrieve integrity tag space
> + * @bio: bio to inspect
> + *
> + * Description: Returns the maximum number of tag bytes that can be
> + * attached to this bio. Filesystems can use this to determine how
> + * much metadata to attach to an I/O.
> + */
> +unsigned int bio_integrity_tag_size(struct bio *bio)
> +{
> + struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
> +
> + BUG_ON(bio->bi_size == 0);
> +
> + return bi->tag_size * (bio->bi_size / bi->sector_size);
> +}
> +EXPORT_SYMBOL(bio_integrity_tag_size);
> +
> +/**
> + * bio_integrity_set_tag - Attach a tag buffer to a bio
> + * @bio: bio to attach buffer to
> + * @tag_buf: Pointer to a buffer containing tag data
> + * @len: Length of the included buffer
> + *
> + * Description: Use this function to tag a bio by leveraging the extra
> + * space provided by devices formatted with integrity protection. The
> + * size of the integrity buffer must be <= to the size reported by
> + * bio_integrity_tag_size().
> + */
> +int bio_integrity_set_tag(struct bio *bio, void *tag_buf, unsigned int len)
> +{
> + struct bip *bip = bio->bi_integrity;
> + struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
> + unsigned int nr_sectors;
> +
> + BUG_ON(bip->bip_buf == NULL);
> + BUG_ON(bio_data_dir(bio) != WRITE);
> +
> + if (bi->tag_size == 0)
> + return -1;
> +
> + nr_sectors = len / bi->tag_size;
> +
> + if (len % 2)
> + nr_sectors++;
Seems i've missing something. What is purpose of this black magic?
do you want just express following?
nr_sectors = (len + bi->tag_size - 1) / bi->tag_size;
> +
> + if (bi->sector_size == 4096)
> + nr_sectors >>= 3;
Why here and later sector_size == 4096 is so special, what about 1k and
2k sect_sz? Do you want just transform value from 512 to bi->sectors_size?
> +
> + if (nr_sectors * bi->tuple_size > bip->bip_size) {
> + printk(KERN_ERR "%s: tag too big for bio: %u > %u\n",
> + __func__, nr_sectors * bi->tuple_size, bip->bip_size);
> + return -1;
> + }
> +
> + bi->set_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(bio_integrity_set_tag);
> +
> +/**
> + * bio_integrity_get_tag - Retrieve a tag buffer from a bio
> + * @bio: bio to retrieve buffer from
> + * @tag_buf: Pointer to a buffer for the tag data
> + * @len: Length of the target buffer
> + *
> + * Description: Use this function to retrieve the tag buffer from a
> + * completed I/O. The size of the integrity buffer must be <= to the
> + * size reported by bio_integrity_tag_size().
> + */
> +int bio_integrity_get_tag(struct bio *bio, void *tag_buf, unsigned int len)
> +{
> + struct bip *bip = bio->bi_integrity;
> + struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
> + unsigned int nr_sectors;
> +
> + BUG_ON(bip->bip_buf == NULL);
> + BUG_ON(bio_data_dir(bio) != READ);
> +
> + if (bi->tag_size == 0)
> + return -1;
> +
> + nr_sectors = len / bi->tag_size;
> +
> + if (len % 2)
> + nr_sectors++;
> +
> + if (bi->sector_size == 4096)
> + nr_sectors >>= 3;
> +
> + if (nr_sectors * bi->tuple_size > bip->bip_size) {
> + printk(KERN_ERR "%s: tag too big for bio: %u > %u\n",
> + __func__, nr_sectors * bi->tuple_size, bip->bip_size);
> + return -1;
> + }
> +
> + bi->get_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(bio_integrity_get_tag);
> +
> +/**
> + * bio_integrity_generate - Generate integrity metadata for a bio
> + * @bio: bio to generate integrity metadata for
> + *
> + * Description: Generates integrity metadata for a bio by calling the
> + * block device's generation callback function. The bio must have a
> + * bip attached with enough room to accomodate the generated integrity
> + * metadata.
> + */
> +static void bio_integrity_generate(struct bio *bio)
> +{
> + struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
> + struct blk_integrity_exchg bix;
> + struct bio_vec *bv;
> + sector_t sector = bio->bi_sector;
> + unsigned int i, sectors, total;
> + void *prot_buf = bio->bi_integrity->bip_buf;
> +
> + total = 0;
> + bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
> + bix.sector_size = bi->sector_size;
> +
> + bio_for_each_segment(bv, bio, i) {
> + bix.data_buf = kmap_atomic(bv->bv_page, KM_USER0)
> + + bv->bv_offset;
> + bix.data_size = bv->bv_len;
> + bix.prot_buf = prot_buf;
> + bix.sector = sector;
> +
> + bi->generate_fn(&bix);
> +
> + sectors = bv->bv_len / bi->sector_size;
> + sector += sectors;
> + prot_buf += sectors * bi->tuple_size;
> + total += sectors * bi->tuple_size;
> + BUG_ON(total > bio->bi_integrity->bip_size);
> +
> + kunmap_atomic(bv->bv_page, KM_USER0);
> + }
> +}
> +
> +/**
> + * bio_integrity_prep - Prepare bio for integrity I/O
> + * @bio: bio to prepare
> + *
> + * Description: Allocates a buffer for integrity metadata, maps the
> + * pages and attaches them to a bio. The bio must have data
> + * direction, target device and start sector set priot to calling. In
> + * the WRITE case, integrity metadata will be generated using the
> + * block device's integrity function. In the READ case, the buffer
> + * will be prepared for DMA and a suitable end_io handler set up.
> + */
> +int bio_integrity_prep(struct bio *bio)
> +{
> + struct bip *bip;
> + struct blk_integrity *bi;
> + struct request_queue *q;
> + void *buf;
> + unsigned long start, end;
> + unsigned int len, nr_pages;
> + unsigned int bytes, offset, i;
> + unsigned int sectors = bio_sectors(bio);
> +
> + bi = bdev_get_integrity(bio->bi_bdev);
> + q = bdev_get_queue(bio->bi_bdev);
> + BUG_ON(bi == NULL);
> + BUG_ON(bio_integrity(bio));
> +
> + if (bi->sector_size == 4096)
> + sectors >>= 3;
> +
> + /* Allocate kernel buffer for protection data */
> + len = sectors * blk_integrity_tuple_size(bi);
> + buf = kzalloc(len, GFP_NOIO | q->bounce_gfp);
> + if (unlikely(buf == NULL)) {
> + printk(KERN_ERR "could not allocate integrity buffer\n");
> + return -EIO;
> + }
> +
> + end = (((unsigned long) buf) + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + start = ((unsigned long) buf) >> PAGE_SHIFT;
> + nr_pages = end - start;
> +
> + /* Allocate bio integrity payload and integrity vectors */
> + bip = bio_integrity_alloc(bio, GFP_NOIO, nr_pages);
> + if (unlikely(bip == NULL)) {
> + printk(KERN_ERR "could not allocate data integrity bioset\n");
> + kfree(buf);
> + return -EIO;
> + }
> +
> + bip->bip_buf = buf;
> + bip->bip_size = len;
> + bip->bip_sector = bio->bi_sector;
> +
> + /* Map it */
> + offset = offset_in_page(buf);
> + for (i = 0 ; i < nr_pages ; i++) {
> + int ret;
> + bytes = PAGE_SIZE - offset;
> +
> + if (len <= 0)
> + break;
> +
> + if (bytes > len)
> + bytes = len;
> +
> + ret = bio_integrity_add_page(bio, virt_to_page(buf),
> + bytes, offset);
> +
> + if (ret == 0)
> + return 0;
> +
> + if (ret < bytes)
> + break;
> +
> + buf += bytes;
> + len -= bytes;
> + offset = 0;
> + }
> +
> + /* Install custom I/O completion handler if read verify is enabled */
> + if (bio_data_dir(bio) == READ) {
> + bip->bip_end_io = bio->bi_end_io;
> + bio->bi_end_io = bio_integrity_endio;
> + }
> +
> + /* Auto-generate integrity metadata if this is a write */
> + if (bio_data_dir(bio) == WRITE)
> + bio_integrity_generate(bio);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(bio_integrity_prep);
> +
> +/**
> + * bio_integrity_verify - Verify integrity metadata for a bio
> + * @bio: bio to verify
> + *
> + * Description: This function is called to verify the integrity of a
> + * bio. The data in the bio io_vec is compared to the integrity
> + * metadata returned by the HBA.
> + */
> +static int bio_integrity_verify(struct bio *bio)
> +{
> + struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
> + struct blk_integrity_exchg bix;
> + struct bio_vec *bv;
> + sector_t sector = bio->bi_integrity->bip_sector;
> + unsigned int i, sectors, total, ret;
> + void *prot_buf = bio->bi_integrity->bip_buf;
> +
> + total = 0;
> + bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
> + bix.sector_size = bi->sector_size;
> +
> + bio_for_each_segment(bv, bio, i) {
> + bix.data_buf = kmap_atomic(bv->bv_page, KM_USER0)
> + + bv->bv_offset;
> + bix.data_size = bv->bv_len;
> + bix.prot_buf = prot_buf;
> + bix.sector = sector;
> +
> + ret = bi->verify_fn(&bix);
> +
> + if (ret) {
> + kunmap_atomic(bv->bv_page, KM_USER0);
> + return ret;
> + }
> +
> + sectors = bv->bv_len / bi->sector_size;
> + sector += sectors;
> + prot_buf += sectors * bi->tuple_size;
> + total += sectors * bi->tuple_size;
> + BUG_ON(total > bio->bi_integrity->bip_size);
> +
> + kunmap_atomic(bv->bv_page, KM_USER0);
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * bio_integrity_verify_fn - Integrity I/O completion worker
> + * @work: Work struct stored in bio to be verified
> + *
> + * Description: This workqueue function is called to complete a READ
> + * request. The function verifies the transferred integrity metadata
> + * and then calls the original bio end_io function.
> + */
> +static void bio_integrity_verify_fn(struct work_struct *work)
> +{
> + struct bip *bip = container_of(work, struct bip, bip_work);
> + struct bio *bio = bip->bip_bio;
> + int error = bip->bip_error;
> +
> + if (bio_integrity_verify(bio)) {
> + clear_bit(BIO_UPTODATE, &bio->bi_flags);
> + error = -EIO;
> + }
> +
> + /* Restore original bio completion handler */
> + bio->bi_end_io = bip->bip_end_io;
> +
> + if (bio->bi_end_io)
> + bio->bi_end_io(bio, error);
> +}
> +
> +/**
> + * bio_integrity_endio - Integrity I/O completion function
> + * @bio: Protected bio
> + * @error: Pointer to errno
> + *
> + * Description: Completion for integrity I/O
> + *
> + * Normally I/O completion is done in interrupt context. However,
> + * verifying I/O integrity is a time-consuming task which must be run
> + * in process context. This function postpones completion
> + * accordingly.
> + */
> +void bio_integrity_endio(struct bio *bio, int error)
> +{
> + struct bip *bip = bio->bi_integrity;
> +
> + BUG_ON(bip->bip_bio != bio);
> +
> + bip->bip_error = error;
> + INIT_WORK(&bip->bip_work, bio_integrity_verify_fn);
> + queue_work(kintegrityd_wq, &bip->bip_work);
> +}
> +EXPORT_SYMBOL(bio_integrity_endio);
> +
> +/**
> + * bio_integrity_advance - Advance integrity vector
> + * @bio: bio whose integrity vector to update
> + * @bytes_done: number of data bytes that have been completed
> + *
> + * Description: This function calculates how many integrity bytes the
> + * number of completed data bytes correspond to and advances the
> + * integrity vector accordingly.
> + */
> +void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
> +{
> + struct bip *bip = bio->bi_integrity;
> + struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
> + struct bio_vec *iv;
> + unsigned int i, skip, nr_sectors;
> +
> + BUG_ON(bip == NULL);
> + BUG_ON(bi == NULL);
> +
> + nr_sectors = bytes_done >> 9;
> +
> + if (bi->sector_size == 4096)
> + nr_sectors >>= 3;
> +
> + skip = nr_sectors * bi->tuple_size;
> +
> + bip_for_each_vec(iv, bip, i) {
> + if (skip == 0) {
> + bip->bip_idx = i;
> + return;
> + } else if (skip >= iv->bv_len) {
> + skip -= iv->bv_len;
> + } else { /* skip < iv->bv_len) */
> + iv->bv_offset += skip;
> + iv->bv_len -= skip;
> + bip->bip_idx = i;
> + return;
> + }
> + }
> +}
> +EXPORT_SYMBOL(bio_integrity_advance);
> +
> +/**
> + * bio_integrity_trim - Trim integrity vector
> + * @bio: bio whose integrity vector to update
> + * @offset: offset to first data sector
> + * @sectors: number of data sectors
> + *
> + * Description: Used to trim the integrity vector in a cloned bio.
> + * The ivec will be advanced corresponding to 'offset' data sectors
> + * and the length will be truncated corresponding to 'len' data
> + * sectors.
> + */
> +void bio_integrity_trim(struct bio *bio, unsigned int offset, unsigned int sectors)
> +{
> + struct bip *bip = bio->bi_integrity;
> + struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
> + struct bio_vec *iv;
> + unsigned int i, skip, nr_bytes;
> +
> + BUG_ON(bip == NULL);
> + BUG_ON(bi == NULL);
> + BUG_ON(!bio_flagged(bio, BIO_CLONED));
> +
> + if (bi->sector_size == 4096)
> + sectors >>= 3;
> +
> + bip->bip_sector = bip->bip_sector + offset;
> + skip = offset * bi->tuple_size;
> + nr_bytes = sectors * bi->tuple_size;
> +
> + /* Mark head */
> + bip_for_each_vec(iv, bip, i) {
> + if (skip == 0) {
> + bip->bip_idx = i;
> + break;
> + } else if (skip >= iv->bv_len) {
> + skip -= iv->bv_len;
> + } else { /* skip < iv->bv_len) */
> + iv->bv_offset += skip;
> + iv->bv_len -= skip;
> + bip->bip_idx = i;
> + break;
> + }
> + }
> +
> + /* Mark tail */
> + bip_for_each_vec(iv, bip, i) {
> + if (nr_bytes == 0) {
> + bip->bip_vcnt = i;
> + break;
> + } else if (nr_bytes >= iv->bv_len) {
> + nr_bytes -= iv->bv_len;
> + } else { /* nr_bytes < iv->bv_len) */
> + iv->bv_len = nr_bytes;
> + nr_bytes = 0;
> + }
> + }
> +}
> +EXPORT_SYMBOL(bio_integrity_trim);
> +
> +/**
> + * bio_integrity_split - Split integrity metadata
> + * @bio: Protected bio
> + * @bp: Resulting bio_pair
> + * @sectors: Offset
> + *
> + * Description: Splits an integrity page into a bio_pair.
> + */
> +void bio_integrity_split(struct bio *bio, struct bio_pair *bp, int sectors)
> +{
> + struct blk_integrity *bi;
> + struct bip *bip = bio->bi_integrity;
> +
> + if (bio_integrity(bio) == 0)
> + return;
> +
> + bi = bdev_get_integrity(bio->bi_bdev);
> + BUG_ON(bi == NULL);
> + BUG_ON(bip->bip_vcnt != 1);
> +
> + if (bi->sector_size == 4096)
> + sectors >>= 3;
> +
> + bp->bio1.bi_integrity = &bp->bip1;
> + bp->bio2.bi_integrity = &bp->bip2;
> +
> + bp->iv1 = bip->bip_vec[0];
> + bp->iv2 = bip->bip_vec[0];
> +
> + bp->bip1.bip_vec = &bp->iv1;
> + bp->bip2.bip_vec = &bp->iv2;
> +
> + bp->iv1.bv_len = sectors * bi->tuple_size;
> + bp->iv2.bv_offset += sectors * bi->tuple_size;
> + bp->iv2.bv_len -= sectors * bi->tuple_size;
> +
> + bp->bip1.bip_sector = bio->bi_integrity->bip_sector;
> + bp->bip2.bip_sector = bio->bi_integrity->bip_sector + sectors;
> +
> + bp->bip1.bip_vcnt = bp->bip2.bip_vcnt = 1;
> + bp->bip1.bip_idx = bp->bip2.bip_idx = 0;
> +}
> +EXPORT_SYMBOL(bio_integrity_split);
> +
> +/**
> + * bio_integrity_clone - Callback for cloning bios with integrity metadata
> + * @bio: New bio
> + * @bio_src: Original bio
> + * @bs: bio_set to allocate bip from
> + *
> + * Description: Called to allocate a bip when cloning a bio
> + */
> +int bio_integrity_clone(struct bio *bio, struct bio *bio_src, struct bio_set *bs)
> +{
> + struct bip *bip_src = bio_src->bi_integrity;
> + struct bip *bip;
> +
> + BUG_ON(bip_src == NULL);
> +
> + bip = bio_integrity_alloc_bioset(bio, GFP_NOIO, bip_src->bip_vcnt, bs);
> +
> + if (bip == NULL)
> + return -EIO;
> +
> + memcpy(bip->bip_vec, bip_src->bip_vec,
> + bip_src->bip_vcnt * sizeof(struct bio_vec));
> +
> + bip->bip_sector = bip_src->bip_sector;
> + bip->bip_vcnt = bip_src->bip_vcnt;
> + bip->bip_idx = bip_src->bip_idx;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(bio_integrity_clone);
> +
> +int bioset_integrity_create(struct bio_set *bs, int pool_size)
> +{
> + bs->bio_integrity_pool = mempool_create_slab_pool(pool_size,
> + bio_integrity_slab);
> + if (!bs->bio_integrity_pool)
> + return -1;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(bioset_integrity_create);
> +
> +void bioset_integrity_free(struct bio_set *bs)
> +{
> + if (bs->bio_integrity_pool)
> + mempool_destroy(bs->bio_integrity_pool);
> +}
> +EXPORT_SYMBOL(bioset_integrity_free);
> +
> +void __init bio_integrity_init_slab(void)
> +{
> + bio_integrity_slab = KMEM_CACHE(bip, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
> +}
> +EXPORT_SYMBOL(bio_integrity_init_slab);
> +
> +static int __init integrity_init(void)
> +{
> + kintegrityd_wq = create_workqueue("kintegrityd");
> +
> + if (!kintegrityd_wq)
> + panic("Failed to create kintegrityd\n");
> +
> + return 0;
> +}
> +subsys_initcall(integrity_init);
> diff -r 318fa71e735d -r f2ae9d5bce4c fs/bio.c
> --- a/fs/bio.c Sat Jun 07 00:45:14 2008 -0400
> +++ b/fs/bio.c Sat Jun 07 00:45:15 2008 -0400
> @@ -96,6 +96,9 @@
>
> mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
> }
> +
> + if (bio_integrity(bio))
> + bio_integrity_free(bio, bio_set);
>
> mempool_free(bio, bio_set->bio_pool);
> }
> @@ -255,9 +258,19 @@
> {
> struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
>
> - if (b) {
> - b->bi_destructor = bio_fs_destructor;
> - __bio_clone(b, bio);
> + if (!b)
> + return NULL;
> +
> + b->bi_destructor = bio_fs_destructor;
> + __bio_clone(b, bio);
> +
> + if (bio_integrity(bio)) {
> + int ret;
> +
> + ret = bio_integrity_clone(b, bio, fs_bio_set);
> +
> + if (ret < 0)
> + return NULL;
> }
>
> return b;
> @@ -1229,6 +1242,9 @@
> bp->bio1.bi_private = bi;
> bp->bio2.bi_private = pool;
>
> + if (bio_integrity(bi))
> + bio_integrity_split(bi, bp, first_sectors);
> +
> return bp;
> }
>
> @@ -1294,6 +1310,7 @@
> if (bs->bio_pool)
> mempool_destroy(bs->bio_pool);
>
> + bioset_integrity_free(bs);
> biovec_free_pools(bs);
>
> kfree(bs);
> @@ -1308,6 +1325,9 @@
>
> bs->bio_pool = mempool_create_slab_pool(bio_pool_size, bio_slab);
> if (!bs->bio_pool)
> + goto bad;
> +
> + if (bioset_integrity_create(bs, bio_pool_size))
> goto bad;
>
> if (!biovec_create_pools(bs, bvec_pool_size))
> @@ -1336,6 +1356,7 @@
> {
> bio_slab = KMEM_CACHE(bio, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
>
> + bio_integrity_init_slab();
> biovec_init_slabs();
>
> fs_bio_set = bioset_create(BIO_POOL_SIZE, 2);
> diff -r 318fa71e735d -r f2ae9d5bce4c include/linux/bio.h
> --- a/include/linux/bio.h Sat Jun 07 00:45:14 2008 -0400
> +++ b/include/linux/bio.h Sat Jun 07 00:45:15 2008 -0400
> @@ -64,6 +64,7 @@
>
> struct bio_set;
> struct bio;
> +struct bip;
> typedef void (bio_end_io_t) (struct bio *, int);
> typedef void (bio_destructor_t) (struct bio *);
>
> @@ -112,6 +113,9 @@
> atomic_t bi_cnt; /* pin count */
>
> void *bi_private;
> +#if defined(CONFIG_BLK_DEV_INTEGRITY)
> + struct bip *bi_integrity; /* data integrity */
> +#endif
>
> bio_destructor_t *bi_destructor; /* destructor */
> };
> @@ -271,6 +275,29 @@
> */
> #define bio_get(bio) atomic_inc(&(bio)->bi_cnt)
>
> +#if defined(CONFIG_BLK_DEV_INTEGRITY)
> +/*
> + * bio integrity payload
> + */
> +struct bip {
> + struct bio *bip_bio; /* parent bio */
> + struct bio_vec *bip_vec; /* integrity data vector */
> +
> + sector_t bip_sector; /* virtual start sector */
> +
> + void *bip_buf; /* generated integrity data */
> + bio_end_io_t *bip_end_io; /* saved I/O completion fn */
> +
> + int bip_error; /* saved I/O error */
> + unsigned int bip_size;
> +
> + unsigned short bip_pool; /* pool the ivec came from */
> + unsigned short bip_vcnt; /* # of integrity bio_vecs */
> + unsigned short bip_idx; /* current bip_vec index */
> +
> + struct work_struct bip_work; /* I/O completion */
> +};
> +#endif /* CONFIG_BLK_DEV_INTEGRITY */
>
> /*
> * A bio_pair is used when we need to split a bio.
> @@ -285,6 +312,10 @@
> struct bio_pair {
> struct bio bio1, bio2;
> struct bio_vec bv1, bv2;
> +#if defined(CONFIG_BLK_DEV_INTEGRITY)
> + struct bip bip1, bip2;
> + struct bio_vec iv1, iv2;
> +#endif
> atomic_t cnt;
> int error;
> };
> @@ -349,6 +380,9 @@
>
> struct bio_set {
> mempool_t *bio_pool;
> +#if defined(CONFIG_BLK_DEV_INTEGRITY)
> + mempool_t *bio_integrity_pool;
> +#endif
> mempool_t *bvec_pools[BIOVEC_NR_POOLS];
> };
>
> @@ -413,5 +447,56 @@
> __bio_kmap_irq((bio), (bio)->bi_idx, (flags))
> #define bio_kunmap_irq(buf,flags) __bio_kunmap_irq(buf, flags)
>
> +#if defined(CONFIG_BLK_DEV_INTEGRITY)
> +
> +#define bip_vec_idx(bip, idx) (&(bip->bip_vec[(idx)]))
> +#define bip_vec(bip) bip_vec_idx(bip, 0)
> +
> +#define __bip_for_each_vec(bvl, bip, i, start_idx) \
> + for (bvl = bip_vec_idx((bip), (start_idx)), i = (start_idx); \
> + i < (bip)->bip_vcnt; \
> + bvl++, i++)
> +
> +#define bip_for_each_vec(bvl, bip, i) \
> + __bip_for_each_vec(bvl, bip, i, (bip)->bip_idx)
> +
> +#define bio_integrity(bio) ((bio)->bi_integrity ? 1 : 0)
> +
> +extern struct bip *bio_integrity_alloc_bioset(struct bio *, gfp_t, unsigned int, struct bio_set *);
> +extern struct bip *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
> +extern void bio_integrity_free(struct bio *, struct bio_set *);
> +extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
> +extern inline int bio_integrity_enabled(struct bio *bio);
> +extern int bio_integrity_set_tag(struct bio *, void *, unsigned int);
> +extern int bio_integrity_get_tag(struct bio *, void *, unsigned int);
> +extern int bio_integrity_prep(struct bio *);
> +extern void bio_integrity_endio(struct bio *, int);
> +extern void bio_integrity_advance(struct bio *, unsigned int);
> +extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
> +extern void bio_integrity_split(struct bio *, struct bio_pair *, int);
> +extern int bio_integrity_clone(struct bio *, struct bio *, struct bio_set *);
> +extern int bioset_integrity_create(struct bio_set *, int);
> +extern void bioset_integrity_free(struct bio_set *);
> +extern void bio_integrity_init_slab(void);
> +
> +#else /* CONFIG_BLK_DEV_INTEGRITY */
> +
> +#define bio_integrity(a) (0)
> +#define bioset_integrity_create(a, b) (0)
> +#define bio_integrity_prep(a) (0)
> +#define bio_integrity_enabled(a) (0)
> +#define bio_integrity_clone(a, b, c) (0)
> +#define bioset_integrity_free(a) do { } while (0)
> +#define bio_integrity_free(a, b) do { } while (0)
> +#define bio_integrity_endio(a, b) do { } while (0)
> +#define bio_integrity_advance(a, b) do { } while (0)
> +#define bio_integrity_trim(a, b, c) do { } while (0)
> +#define bio_integrity_split(a, b, c) do { } while (0)
> +#define bio_integrity_set_tag(a, b, c) do { } while (0)
> +#define bio_integrity_get_tag(a, b, c) do { } while (0)
> +#define bio_integrity_init_slab(a) do { } while (0)
> +
> +#endif /* CONFIG_BLK_DEV_INTEGRITY */
> +
> #endif /* CONFIG_BLOCK */
> #endif /* __LINUX_BIO_H */
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2008-06-08 04:31:26

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 5 of 7] block: Block/request layer data integrity support

On Sat, Jun 07, 2008 at 12:55:38AM -0400, Martin K. Petersen wrote:
> +5.1 NORMAL FILESYSTEM
> +
> + The normal filesystem is unaware that the underlying block device
> + is capable of sending/receiving integrity metadata. The IMD will
> + be automatically generated by the block layer at submit_bio() time
> + in case of a WRITE. A READ request will cause the I/O integrity
> + to be verified upon completion.
> +
> + IMD generation and verification can be toggled using the
> +
> + /sys/class/block/<bdev>/integrity/write_generate
> +
> + and
> +
> + /sys/class/block/<bdev>/integrity/read_verify
> +
> + flags.
> +

All of the sysfs interfaces should be documented in Documentation/ABI/
not here, otherwise they will be tough to find and figure out how to
use :)

thanks,

greg k-h

2008-06-09 15:05:54

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 4 of 7] block: bio data integrity support

>>>>> "DM" == Monakhov Dmitri <[email protected]> writes:

>> + nr_sectors = len / bi->tag_size;
>> +
>> + if (len % 2) + nr_sectors++;
DM> Seems i've missing something. What is purpose of this black magic?
DM> do you want just express following? nr_sectors = (len +
DM> bi->tag_size - 1) / bi->tag_size;

Yep. In the original DIF spec, only 2-byte tags were supported so the
check for an odd length was a fast and elegant solution. But now that
I implemented Type 3 which has 6 bytes of tag space that's an invalid
assumption. Fixed.


>> +
>> + if (bi->sector_size == 4096) + nr_sectors >>= 3;

DM> Why here and later sector_size == 4096 is so special, what about
DM> 1k and 2k sect_sz? Do you want just transform value from 512 to
DM> bi->sectors_size?

Well, so only 512-byte DIF storage devices are currently available.
The whole industry is in the process of transitioning to 4KB sectors.
There will be no DIF devices with 1KB or 2KB sectors.

And even as it stands it's unclear that 4KB sectors are going to look
like they do in the current version of the spec. It's going to be an
interoperability nightmare as it is now as the tag is attached to the
hardware sector size. This means that it's still only 8 bytes of DIF
for a device with 4KB sectors (IOW, 4104 bytes and not 4160). That
means that *two* protection buffers would have to be generated for -
say - a mirror with heterogeneous sector sizes. And tagging won't
work as there's not the same space available for both drives of the
mirror.

The tag space problem also causes issues with RAID arrays exporting
512 byte sectors to the host but using drives with 4KB sectors in the
back. Where is the array going to store the tags for the last 7 512
byte sectors?

So 4KB vs. DIF is up in the air at this point. The current checks are
there because I've been messing with 4KB sector devices for other
reasons. And technically they are in accordance with the current
spec.

--
Martin K. Petersen Oracle Linux Engineering

2008-06-09 15:07:33

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 5 of 7] block: Block/request layer data integrity support

>>>>> "Greg" == Greg KH <[email protected]> writes:

>> + /sys/class/block/<bdev>/integrity/write_generate
>> + /sys/class/block/<bdev>/integrity/read_verify

Greg> All of the sysfs interfaces should be documented in
Greg> Documentation/ABI/ not here, otherwise they will be tough to
Greg> find and figure out how to use :)

Will do.

--
Martin K. Petersen Oracle Linux Engineering

2008-06-09 16:09:21

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 3 of 7] block: Find bio sector offset given idx and offset

"Martin K. Petersen" <[email protected]> writes:

> 2 files changed, 26 insertions(+)
> fs/bio.c | 24 ++++++++++++++++++++++++
> include/linux/bio.h | 2 ++
>
>
> Helper function to find the sector offset in a bio given bvec index
> and page offset.
>

Unless I've missed something, this helper isn't used anywhere in your
patchset.

Cheers,

Jeff

2008-06-09 16:17:12

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 3 of 7] block: Find bio sector offset given idx and offset

>>>>> "Jeff" == Jeff Moyer <[email protected]> writes:

Jeff> Unless I've missed something, this helper isn't used anywhere in
Jeff> your patchset.

You are right.

bio_sector_offset() is used in the patch that adds integrity support
to DM. I didn't post the MD/DM patches this time around in an attempt
to limit the number of patches and focus on the core code.

You can see the DM patch here:

http://oss.oracle.com/mercurial/mkp/linux-2.6-di/rev/6b79173c501a

--
Martin K. Petersen Oracle Linux Engineering

2008-06-10 14:43:35

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 0 of 7] Block/SCSI Data Integrity Support

"Martin K. Petersen" <[email protected]> writes:

> Another post of my block I/O data integrity patches. This kit goes on
> top of the scsi_data_buffer and sd.h cleanups I posted earlier today.

Pointers to archives would have been appreciated. I can't, for the life
of me, find these.

> There has been no changes to the block layer code since my last
> submission.
>
> Within SCSI, the changes are cleanups based on comments from Christoph
> as well as working support for Type 3 and 4KB sectors.

Thanks for all of the great documentation. It would be good to include
some instructions on how one would test this, and what testing you
performed. Also, please use the '-p' switch to diff, as it makes
reviewing patches much easier.

I set out to try your changes, but ran into some problems. First, this
patch set didn't apply cleanly to a git checkout. So, I grabbed your
mercurial repository, but got a build failure:

block/blk-core.c: In function 'generic_make_request':
include/linux/bio.h:469: sorry, unimplemented: inlining failed in call to 'bio_i
ntegrity_enabled': function body not available
block/blk-core.c:1388: sorry, unimplemented: called from here
make[1]: *** [block/blk-core.o] Error 1
make: *** [block] Error 2

Cheers,

Jeff

2008-06-10 15:29:05

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 0 of 7] Block/SCSI Data Integrity Support

>>>>> "Jeff" == Jeff Moyer <[email protected]> writes:

Jeff> "Martin K. Petersen" <[email protected]> writes:
>> Another post of my block I/O data integrity patches. This kit goes
>> on top of the scsi_data_buffer and sd.h cleanups I posted earlier
>> today.

Jeff> Pointers to archives would have been appreciated. I can't, for
Jeff> the life of me, find these.

http://marc.info/?l=linux-scsi&m=121272302931588&w=2
http://marc.info/?l=linux-scsi&m=121278031605941&w=2
http://marc.info/?l=linux-scsi&m=121302438515260&w=2
http://marc.info/?l=linux-scsi&m=121278067906564&w=2


Jeff> Thanks for all of the great documentation. It would be good to
Jeff> include some instructions on how one would test this, and what
Jeff> testing you performed.

modprobe scsi_debug dix=199 dif=1 guard=1 dev_size_mb=1024 num_parts=1

I'm testing with XFS and btrfs. Generally doing kernel builds, etc.
ext2/3 are still problematic because they modify pages in flight.


Jeff> I set out to try your changes, but ran into some problems.
Jeff> First, this patch set didn't apply cleanly to a git checkout.

I generally track Linus closely so it must be because of the patches
you were missing.

You can grab my patch stack here. It's always in sync with the hg
repo:

http://oss.oracle.com/~mkp/patches/


Jeff> block/blk-core.c: In function 'generic_make_request':
Jeff> include/linux/bio.h:469: sorry, unimplemented: inlining failed
Jeff> in call to 'bio_i ntegrity_enabled': function body not available
Jeff> block/blk-core.c:1388: sorry, unimplemented: called from here
Jeff> make[1]: *** [block/blk-core.o] Error 1 make: *** [block] Error
Jeff> 2

Odd. Which compiler are you using? Compiles just fine for me on both
EL5 and FC9.

Judging from the error I'm guessing it's objecting to the inlining.
Tried to work around it. Please pull, update and let me know whether
that did the trick.

--
Martin K. Petersen Oracle Linux Engineering

2008-06-10 18:51:28

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 0 of 7] Block/SCSI Data Integrity Support

"Martin K. Petersen" <[email protected]> writes:

>>>>>> "Jeff" == Jeff Moyer <[email protected]> writes:
> Jeff> Thanks for all of the great documentation. It would be good to
> Jeff> include some instructions on how one would test this, and what
> Jeff> testing you performed.
>
> modprobe scsi_debug dix=199 dif=1 guard=1 dev_size_mb=1024 num_parts=1
>
> I'm testing with XFS and btrfs. Generally doing kernel builds, etc.
> ext2/3 are still problematic because they modify pages in flight.

So, is it safe to say that the library routines for integrity-aware file
systems have not been tested at all? Specifically, I'm talking about:
bio_integrity_tag_size
bio_integrity_set_tag
bio_integrity_get_tag

> Jeff> block/blk-core.c: In function 'generic_make_request':
> Jeff> include/linux/bio.h:469: sorry, unimplemented: inlining failed
> Jeff> in call to 'bio_i ntegrity_enabled': function body not available
> Jeff> block/blk-core.c:1388: sorry, unimplemented: called from here
> Jeff> make[1]: *** [block/blk-core.o] Error 1 make: *** [block] Error
> Jeff> 2
>
> Odd. Which compiler are you using? Compiles just fine for me on both
> EL5 and FC9.

gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-41)

> Judging from the error I'm guessing it's objecting to the inlining.
> Tried to work around it. Please pull, update and let me know whether
> that did the trick.

I did a new clone (just to be sure I got your change) and I get the same
problem. I also can't see the changeset in the log, so are you sure you
pushed it?

I got rid of the inline in the definition in bio.h. The .c file didn't
define the function as inline, so I didn't have to change it. It seems
to be building now.

Cheers,

Jeff

2008-06-10 18:56:18

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 2 of 7] block: Globalize bio_set and bio_vec_slab

"Martin K. Petersen" <[email protected]> writes:

> 2 files changed, 38 insertions(+), 28 deletions(-)
> fs/bio.c | 36 ++++++++----------------------------
> include/linux/bio.h | 30 ++++++++++++++++++++++++++++++
>
>
> Move struct bio_set and biovec_slab definitions to bio.h so they can
> be used outside of bio.c.

Don't sell yourself short; you also implemented bvec_nr_vecs.

Looks okay to me.

Reviewed-by: Jeff Moyer <[email protected]>

2008-06-10 19:03:24

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 3 of 7] block: Find bio sector offset given idx and offset

"Martin K. Petersen" <[email protected]> writes:

> 2 files changed, 26 insertions(+)
> fs/bio.c | 24 ++++++++++++++++++++++++
> include/linux/bio.h | 2 ++
>
>
> Helper function to find the sector offset in a bio given bvec index
> and page offset.
>
> Signed-off-by: Martin K. Petersen <[email protected]>

Reviewed-by: Jeff Moyer <[email protected]>

2008-06-10 20:48:38

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 0 of 7] Block/SCSI Data Integrity Support

>>>>> "Jeff" == Jeff Moyer <[email protected]> writes:

Jeff> So, is it safe to say that the library routines for
Jeff> integrity-aware file systems have not been tested at all?
Jeff> Specifically, I'm talking about: bio_integrity_tag_size
Jeff> bio_integrity_set_tag bio_integrity_get_tag

I have not tried using them from within a filesystem, if that's what
you mean. But I have attached random strings to bios and read them
back later.


Jeff> gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-41)

Ok, I'm trying to chase down a 5.2 box to figure out what the problem
is. Maybe I'll just move that function to the header file.


Jeff> I did a new clone (just to be sure I got your change) and I get
Jeff> the same problem. I also can't see the changeset in the log, so
Jeff> are you sure you pushed it?

Yup, it's there.


Jeff> I got rid of the inline in the definition in bio.h. The .c file
Jeff> didn't define the function as inline, so I didn't have to change
Jeff> it. It seems to be building now.

The problem is that your gcc is unhappy about the fact that the
inlined function is defined elsewhere. The gcc info page said only
declare it inline in the header and not the declaration. The change I
pushed removed inline from the .c file. But that didn't help.

--
Martin K. Petersen Oracle Linux Engineering

2008-06-10 20:55:10

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 4 of 7] block: bio data integrity support

"Martin K. Petersen" <[email protected]> writes:

Comments inlined below.

> +struct bip *bio_integrity_alloc_bioset(struct bio *bio, gfp_t gfp_mask, unsigned int nr_vecs, struct bio_set *bs)
> +{
> + struct bip *bip;
> + struct bio_vec *bv;
> + unsigned long idx;
> +
> + BUG_ON(bio == NULL);
> +
> + bip = mempool_alloc(bs->bio_integrity_pool, gfp_mask);
> + if (unlikely(bip == NULL)) {
> + printk(KERN_ERR "%s: could not alloc bip\n", __func__);
> + return NULL;
> + }
> +
> + memset(bip, 0, sizeof(*bip));
> + idx = 0;

That assignment isn't necessary.

> +int bio_integrity_set_tag(struct bio *bio, void *tag_buf, unsigned int len)
> +{
> + struct bip *bip = bio->bi_integrity;
> + struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
> + unsigned int nr_sectors;
> +
> + BUG_ON(bip->bip_buf == NULL);
> + BUG_ON(bio_data_dir(bio) != WRITE);
> +
> + if (bi->tag_size == 0)
> + return -1;
> +
> + nr_sectors = len / bi->tag_size;
> +
> + if (len % 2)
> + nr_sectors++;

I see you've changed this to:

nr_sectors = (len + bi->tag_size - 1) / bi->tag_size;

why not simply use DIV_ROUND_UP?

> +
> + if (bi->sector_size == 4096)
> + nr_sectors >>= 3;
> +
> + if (nr_sectors * bi->tuple_size > bip->bip_size) {
> + printk(KERN_ERR "%s: tag too big for bio: %u > %u\n",
> + __func__, nr_sectors * bi->tuple_size, bip->bip_size);
> + return -1;
> + }
> +
> + bi->set_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(bio_integrity_set_tag);
> +
> +/**
> + * bio_integrity_get_tag - Retrieve a tag buffer from a bio
> + * @bio: bio to retrieve buffer from
> + * @tag_buf: Pointer to a buffer for the tag data
> + * @len: Length of the target buffer
> + *
> + * Description: Use this function to retrieve the tag buffer from a
> + * completed I/O. The size of the integrity buffer must be <= to the
> + * size reported by bio_integrity_tag_size().
> + */
> +int bio_integrity_get_tag(struct bio *bio, void *tag_buf, unsigned int len)
> +{
> + struct bip *bip = bio->bi_integrity;
> + struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
> + unsigned int nr_sectors;
> +
> + BUG_ON(bip->bip_buf == NULL);
> + BUG_ON(bio_data_dir(bio) != READ);
> +
> + if (bi->tag_size == 0)
> + return -1;
> +
> + nr_sectors = len / bi->tag_size;
> +
> + if (len % 2)
> + nr_sectors++;
> +
> + if (bi->sector_size == 4096)
> + nr_sectors >>= 3;
> +
> + if (nr_sectors * bi->tuple_size > bip->bip_size) {
> + printk(KERN_ERR "%s: tag too big for bio: %u > %u\n",
> + __func__, nr_sectors * bi->tuple_size, bip->bip_size);
> + return -1;
> + }
> +
> + bi->get_tag_fn(bip->bip_buf, tag_buf, nr_sectors);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(bio_integrity_get_tag);

set_tag and get_tag are almost identical. Any chance you want to factor
out that code?

> +/**
> + * bio_integrity_generate - Generate integrity metadata for a bio
> + * @bio: bio to generate integrity metadata for
> + *
> + * Description: Generates integrity metadata for a bio by calling the
> + * block device's generation callback function. The bio must have a
> + * bip attached with enough room to accomodate the generated integrity
^^^^^^^^^^
accommodate

> + * metadata.
> + */
> +static void bio_integrity_generate(struct bio *bio)
> +{
> + struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);

Hmm, up until this point you use bi to mean bio_integrity, but now
it means blk_integrity. Confusion will ensue. ;)

> + struct blk_integrity_exchg bix;

struct blk_integrity_exchg is not yet defined in your patch set, so this
will likely break git bisect.

> +int bio_integrity_prep(struct bio *bio)
> +{
...
> + buf = kzalloc(len, GFP_NOIO | q->bounce_gfp);

Does this actually need to be zeroed?

> +void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
...
> + bip_for_each_vec(iv, bip, i) {
> + if (skip == 0) {
> + bip->bip_idx = i;
> + return;
> + } else if (skip >= iv->bv_len) {
> + skip -= iv->bv_len;
> + } else { /* skip < iv->bv_len) */
> + iv->bv_offset += skip;
> + iv->bv_len -= skip;
> + bip->bip_idx = i;
> + return;
> + }
> + }
> +}

> +void bio_integrity_trim(struct bio *bio, unsigned int offset, unsigned int sectors)
> +{
...
> + /* Mark head */
> + bip_for_each_vec(iv, bip, i) {
> + if (skip == 0) {
> + bip->bip_idx = i;
> + break;
> + } else if (skip >= iv->bv_len) {
> + skip -= iv->bv_len;
> + } else { /* skip < iv->bv_len) */
> + iv->bv_offset += skip;
> + iv->bv_len -= skip;
> + bip->bip_idx = i;
> + break;
> + }
> + }

The above two loops look pretty much the same to me. Can you factor
that out to a helper?

Cheers,

Jeff

2008-06-10 20:55:35

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 0 of 7] Block/SCSI Data Integrity Support

"Martin K. Petersen" <[email protected]> writes:

>>>>>> "Jeff" == Jeff Moyer <[email protected]> writes:
>
> Jeff> So, is it safe to say that the library routines for
> Jeff> integrity-aware file systems have not been tested at all?
> Jeff> Specifically, I'm talking about: bio_integrity_tag_size
> Jeff> bio_integrity_set_tag bio_integrity_get_tag
>
> I have not tried using them from within a filesystem, if that's what
> you mean. But I have attached random strings to bios and read them
> back later.

I was checking to see if you had exercised the code at all, and you
have. Great!

Cheers,

Jeff

2008-06-11 04:06:22

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 4 of 7] block: bio data integrity support

>>>>> "Jeff" == Jeff Moyer <[email protected]> writes:

>> + memset(bip, 0, sizeof(*bip));
>> + idx = 0;

Jeff> That assignment isn't necessary.

Zap!


Jeff> nr_sectors = (len + bi->tag_size - 1) / bi->tag_size;

Jeff> why not simply use DIV_ROUND_UP?

Fixed.


Jeff> set_tag and get_tag are almost identical. Any chance you want
Jeff> to factor out that code?

Done.


Jeff> Hmm, up until this point you use bi to mean bio_integrity, but
Jeff> now it means blk_integrity. Confusion will ensue. ;)

Err, uhm. There is no bio_integrity. There's the bio integrity
payload which I always refer to as struct bip *bip. And struct
blk_integrity which is always bi. I'm also anal about using bv for
the data bio_vec and iv for the integrity bio_vec. I can't see any
place where I'm inconsistent.


Jeff> struct blk_integrity_exchg is not yet defined in your patch set,
Jeff> so this will likely break git bisect.

bio-integrity.patch and blk-integrity.patch are artificially split up
to ease the review process. They are not meant to be separate
changesets.


>> + buf = kzalloc(len, GFP_NOIO | q->bounce_gfp);

Jeff> Does this actually need to be zeroed?

Nope.


>> +void bio_integrity_advance(struct bio *bio, unsigned int
> +void bio_integrity_trim(struct bio *bio, unsigned int offset, unsigned int sectors)

Jeff> The above two loops look pretty much the same to me. Can you
Jeff> factor that out to a helper?

I've created helpers for marking head and tail of the ivec.

--
Martin K. Petersen Oracle Linux Engineering

2008-06-11 17:41:53

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 4 of 7] block: bio data integrity support

"Martin K. Petersen" <[email protected]> writes:

>>>>>> "Jeff" == Jeff Moyer <[email protected]> writes:
> Jeff> Hmm, up until this point you use bi to mean bio_integrity, but
> Jeff> now it means blk_integrity. Confusion will ensue. ;)
>
> Err, uhm. There is no bio_integrity. There's the bio integrity
> payload which I always refer to as struct bip *bip. And struct
> blk_integrity which is always bi. I'm also anal about using bv for
> the data bio_vec and iv for the integrity bio_vec. I can't see any
> place where I'm inconsistent.

Wow, I have no idea where I got that impression. Sorry!

> Jeff> struct blk_integrity_exchg is not yet defined in your patch set,
> Jeff> so this will likely break git bisect.
>
> bio-integrity.patch and blk-integrity.patch are artificially split up
> to ease the review process. They are not meant to be separate
> changesets.

OK, just wanted to make sure you were aware of it.

Cheers,

Jeff

2008-07-17 13:56:24

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH 0 of 7] Block/SCSI Data Integrity Support

On Tue, Jun 10, 2008 at 11:28 AM, Martin K. Petersen
<[email protected]> wrote:
>>>>>> "Jeff" == Jeff Moyer <[email protected]> writes:
>
> Jeff> Thanks for all of the great documentation. It would be good to
> Jeff> include some instructions on how one would test this, and what
> Jeff> testing you performed.
>
> modprobe scsi_debug dix=199 dif=1 guard=1 dev_size_mb=1024 num_parts=1
>
> I'm testing with XFS and btrfs. Generally doing kernel builds, etc.
> ext2/3 are still problematic because they modify pages in flight.

Have you made the ext2/3/4 developers aware of this?

Could you elaborate on the interaction between the data integrity
support in the block layer and a given filesystem? Shouldn't _any_
filesystem "just work" given that the block layer is what is
generating the checksums and then verifying them on read?

regards,
Mike

2008-07-17 15:36:43

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 0 of 7] Block/SCSI Data Integrity Support

>>>>> "Mike" == Mike Snitzer <[email protected]> writes:

>> I'm testing with XFS and btrfs. Generally doing kernel builds,
>> etc. ext2/3 are still problematic because they modify pages in
>> flight.

Mike> Have you made the ext2/3/4 developers aware of this?

Yep.


Mike> Shouldn't _any_ filesystem "just work" given that the block
Mike> layer is what is generating the checksums and then verifying
Mike> them on read?

Yep.

There are a couple of issues. One problem is that pages are no longer
locked down during I/O. Instead the writeback bit is being set to
indicate that I/O is in progress. Not all corners of ext* have been
adapted to that properly. Especially ext2 suffers and often modifies
pages containing metadata while they are in flight. If I remember
correctly, ext2/dir.c hasn't been made aware of writeback at all and
assumes the page lock still works like it used to.

That is normally not a huge problem because the page is being
scheduled for write again shortly thereafter. So the inconsistent
block on disk gets overwritten pretty much instantly. But that kind
of sloppy behavior is a no-go with integrity checking turned on.

There also appears to be some quirks in the page cache in general.
There's something not quite right in clear_page_dirty() /
page_mkwrite() territory. If I sync excessively I can make any fs
keel over. peterz said that an mmapped page is supposed to be
read-only during writeback but that appears to be racy when a forced
sync is involved.

That's my recollection, anyway. I've been busy with the innards of
the integrity code stuff for a couple of months and haven't poked at
the fs/vm issues for a while.

--
Martin K. Petersen Oracle Linux Engineering