Hi-
I know there is some work on ext4 regarding metadata corruption detection; btrfs also has some corruption detection facilities. The IETF NFS working group is considering the addition of corruption detection to the next NFSv4 minor version. T10 has introduced DIF/DIX.
I'm probably ignorant of the current state of implementation in Linux, but I'm interested in understanding common ground among local file systems, block storage, and network file systems. Example questions include: Do we need standardized APIs for block device corruption detection? How much of T10 DIF/DIX should NFS support? What are the drivers for this feature (broad use cases)?
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
>>>>> "Bernd" == Bernd Schubert <[email protected]> writes:
Bernd> Hmm, direct IO would mean we could not use the page cache. As we
Bernd> are using it, that would not really suit us. libaio then might be
Bernd> another option then.
Bernd> What kind of help do you exactly need?
As far as libaio is concerned I had a PoC working a few years ago. I'll
be happy to revive it if people are actually interested. So a real world
use case would be a great help...
But James is right that buffered I/O is much more challenging than
direct I/O. And all the use cases we have had have involved databases
and business apps that were doing direct I/O anyway.
--
Martin K. Petersen Oracle Linux Engineering
On Thu, 2012-01-26 at 17:27 +0100, Bernd Schubert wrote:
> On 01/26/2012 03:53 PM, Martin K. Petersen wrote:
> >>>>>> "Bernd" == Bernd Schubert<[email protected]> writes:
> >
> > Bernd> We from the Fraunhofer FhGFS team would like to also see the T10
> > Bernd> DIF/DIX API exposed to user space, so that we could make use of
> > Bernd> it for our FhGFS file system. And I think this feature is not
> > Bernd> only useful for file systems, but in general, scientific
> > Bernd> applications, databases, etc also would benefit from insurance of
> > Bernd> data integrity.
> >
> > I'm attending a SNIA meeting today to discuss a (cross-OS) data
> > integrity aware API. We'll see what comes out of that.
> >
> > With the Linux hat on I'm still mainly interested in pursuing the
> > sys_dio interface Joel and I proposed last year. We have good experience
> > with that I/O model and it suits applications that want to interact with
> > the protection information well. libaio is also on my list.
> >
> > But obviously any help and input is appreciated...
> >
>
> I guess you are referring to the interface described here
>
> http://www.spinics.net/lists/linux-mm/msg14512.html
>
> Hmm, direct IO would mean we could not use the page cache. As we are
> using it, that would not really suit us. libaio then might be another
> option then.
Are you really sure you want protection information and the page cache?
The reason for using DIO is that no-one could really think of a valid
page cache based use case. What most applications using protection
information want is to say: This is my data and this is the integrity
verification, send it down and assure me you wrote it correctly. If you
go via the page cache, we have all sorts of problems, like our
granularity is a page (not a block) so you'd have to guarantee to write
a page at a time (a mechanism for combining subpage units of protection
information sounds like a nightmare). The write becomes mark page dirty
and wait for the system to flush it, and we can update the page in the
meantime. How do we update the page and its protection information
atomically. What happens if the page gets updated but no protection
information is supplied and so on ... The can of worms just gets more
squirmy. Doing DIO only avoids all of this.
James
On 01/31/2012 03:10 AM, Martin K. Petersen wrote:
>>>>>> "Bernd" == Bernd Schubert<[email protected]> writes:
>
> Bernd> Hmm, direct IO would mean we could not use the page cache. As we
> Bernd> are using it, that would not really suit us. libaio then might be
> Bernd> another option then.
>
> Bernd> What kind of help do you exactly need?
>
> As far as libaio is concerned I had a PoC working a few years ago. I'll
> be happy to revive it if people are actually interested. So a real world
> use case would be a great help...
I guess it would be useful for us, although right now data integrity is
not on our todo list for the next couple of months. Unless other people
would be interested in it right now, can we postpone for some time?
>
> But James is right that buffered I/O is much more challenging than
> direct I/O. And all the use cases we have had have involved databases
> and business apps that were doing direct I/O anyway.
>
I guess we should talk to developers of other parallel file systems and
see what they think about it. I think cephfs already uses data integrity
provided by btrfs, although I'm not entirely sure and need to check the
code. As I said before, Lustre does network checksums already and
*might* be interested.
Cheers,
Bernd
On 01/17/2012 09:15 PM, Chuck Lever wrote:
> Hi-
>
> I know there is some work on ext4 regarding metadata corruption
> detection; btrfs also has some corruption detection facilities. The
> IETF NFS working group is considering the addition of corruption
> detection to the next NFSv4 minor version. T10 has introduced
> DIF/DIX.
>
> I'm probably ignorant of the current state of implementation in
> Linux, but I'm interested in understanding common ground among local
> file systems, block storage, and network file systems. Example
> questions include: Do we need standardized APIs for block device
> corruption detection? How much of T10 DIF/DIX should NFS support?
> What are the drivers for this feature (broad use cases)?
>
Other network file systems such as Lustre already use their own network
data checksums. As far as I know Lustre plans (planned?) to use
underlying ZFS checksums also for network transfers, so real
client-to-disk (end-to-end) checksums. Using T10 DIF/DIX might be on
their todo list.
We from the Fraunhofer FhGFS team would like to also see the T10 DIF/DIX
API exposed to user space, so that we could make use of it for our FhGFS
file system.
And I think this feature is not only useful for file systems, but in
general, scientific applications, databases, etc also would benefit from
insurance of data integrity.
Cheers,
Bernd
--
Bernd Schubert
Fraunhofer ITWM
On Jan 31, 2012, at 2:16 PM, Bernd Schubert wrote:
> On 01/27/2012 12:21 AM, James Bottomley wrote:
>> On Thu, 2012-01-26 at 17:27 +0100, Bernd Schubert wrote:
>>> On 01/26/2012 03:53 PM, Martin K. Petersen wrote:
>>>>>>>>> "Bernd" == Bernd Schubert<[email protected]> writes:
>>>>
>>>> Bernd> We from the Fraunhofer FhGFS team would like to also see the T10
>>>> Bernd> DIF/DIX API exposed to user space, so that we could make use of
>>>> Bernd> it for our FhGFS file system. And I think this feature is not
>>>> Bernd> only useful for file systems, but in general, scientific
>>>> Bernd> applications, databases, etc also would benefit from insurance of
>>>> Bernd> data integrity.
>>>>
>>>> I'm attending a SNIA meeting today to discuss a (cross-OS) data
>>>> integrity aware API. We'll see what comes out of that.
>>>>
>>>> With the Linux hat on I'm still mainly interested in pursuing the
>>>> sys_dio interface Joel and I proposed last year. We have good experience
>>>> with that I/O model and it suits applications that want to interact with
>>>> the protection information well. libaio is also on my list.
>>>>
>>>> But obviously any help and input is appreciated...
>>>>
>>>
>>> I guess you are referring to the interface described here
>>>
>>> http://www.spinics.net/lists/linux-mm/msg14512.html
>>>
>>> Hmm, direct IO would mean we could not use the page cache. As we are
>>> using it, that would not really suit us. libaio then might be another
>>> option then.
>>
>> Are you really sure you want protection information and the page cache?
>> The reason for using DIO is that no-one could really think of a valid
>> page cache based use case. What most applications using protection
>> information want is to say: This is my data and this is the integrity
>> verification, send it down and assure me you wrote it correctly. If you
>> go via the page cache, we have all sorts of problems, like our
>> granularity is a page (not a block) so you'd have to guarantee to write
>> a page at a time (a mechanism for combining subpage units of protection
>> information sounds like a nightmare). The write becomes mark page dirty
>> and wait for the system to flush it, and we can update the page in the
>> meantime. How do we update the page and its protection information
>> atomically. What happens if the page gets updated but no protection
>> information is supplied and so on ... The can of worms just gets more
>> squirmy. Doing DIO only avoids all of this.
>
> Well, entirely direct-IO will not work anyway as FhGFS is a parallel network file system, so data are sent from clients to servers, so data are not entirely direct anymore.
> The problem with server side storage direct-IO is that it is too slow for several work cases. I guess the write performance could be mostly solved somehow, but then still the read-cache would be entirely missing. From Lustre history I know that server side read-cache improved performance of applications at several sites. So I really wouldn't like to disable it for FhGFS...
> I guess if we couldn't use the page cache, we probably wouldn't attempt to use DIF/DIX interface, but will calculate our own checksums once we are going to work on the data integrity feature on our side.
This is interesting. I imagine the Linux kernel NFS server will have the same issue: it depends on the page cache for good performance, and does not, itself, use direct I/O.
Thus it wouldn't be able to use a direct I/O-only DIF/DIX implementation, and we can't use DIF/DIX for end-to-end corruption detection for a Linux client - Linux server configuration.
If high-performance applications such as databases demand corruption detection, it will need to work without introducing significant performance overhead.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
On 01/27/2012 12:21 AM, James Bottomley wrote:
> On Thu, 2012-01-26 at 17:27 +0100, Bernd Schubert wrote:
>> On 01/26/2012 03:53 PM, Martin K. Petersen wrote:
>>>>>>>> "Bernd" == Bernd Schubert<[email protected]> writes:
>>>
>>> Bernd> We from the Fraunhofer FhGFS team would like to also see the T10
>>> Bernd> DIF/DIX API exposed to user space, so that we could make use of
>>> Bernd> it for our FhGFS file system. And I think this feature is not
>>> Bernd> only useful for file systems, but in general, scientific
>>> Bernd> applications, databases, etc also would benefit from insurance of
>>> Bernd> data integrity.
>>>
>>> I'm attending a SNIA meeting today to discuss a (cross-OS) data
>>> integrity aware API. We'll see what comes out of that.
>>>
>>> With the Linux hat on I'm still mainly interested in pursuing the
>>> sys_dio interface Joel and I proposed last year. We have good experience
>>> with that I/O model and it suits applications that want to interact with
>>> the protection information well. libaio is also on my list.
>>>
>>> But obviously any help and input is appreciated...
>>>
>>
>> I guess you are referring to the interface described here
>>
>> http://www.spinics.net/lists/linux-mm/msg14512.html
>>
>> Hmm, direct IO would mean we could not use the page cache. As we are
>> using it, that would not really suit us. libaio then might be another
>> option then.
>
> Are you really sure you want protection information and the page cache?
> The reason for using DIO is that no-one could really think of a valid
> page cache based use case. What most applications using protection
> information want is to say: This is my data and this is the integrity
> verification, send it down and assure me you wrote it correctly. If you
> go via the page cache, we have all sorts of problems, like our
> granularity is a page (not a block) so you'd have to guarantee to write
> a page at a time (a mechanism for combining subpage units of protection
> information sounds like a nightmare). The write becomes mark page dirty
> and wait for the system to flush it, and we can update the page in the
> meantime. How do we update the page and its protection information
> atomically. What happens if the page gets updated but no protection
> information is supplied and so on ... The can of worms just gets more
> squirmy. Doing DIO only avoids all of this.
Well, entirely direct-IO will not work anyway as FhGFS is a parallel
network file system, so data are sent from clients to servers, so data
are not entirely direct anymore.
The problem with server side storage direct-IO is that it is too slow
for several work cases. I guess the write performance could be mostly
solved somehow, but then still the read-cache would be entirely missing.
From Lustre history I know that server side read-cache improved
performance of applications at several sites. So I really wouldn't like
to disable it for FhGFS...
I guess if we couldn't use the page cache, we probably wouldn't attempt
to use DIF/DIX interface, but will calculate our own checksums once we
are going to work on the data integrity feature on our side.
Cheers,
Bernd
On Tue, Jan 31, 2012 at 11:22 AM, Bernd Schubert
<[email protected]> wrote:
> I guess we should talk to developers of other parallel file systems and see
> what they think about it. I think cephfs already uses data integrity
> provided by btrfs, although I'm not entirely sure and need to check the
> code. As I said before, Lustre does network checksums already and *might* be
> interested.
Actually, right now Ceph doesn't check btrfs' data integrity
information, but since Ceph doesn't have any data-at-rest integrity
verification it relies on btrfs if you want that. Integrating
integrity verification throughout the system is on our long-term to-do
list.
We too will be said if using a kernel-level integrity system requires
using DIO, although we could probably work out a way to do
"translation" between our own integrity checksums and the
btrfs-generated ones if we have to (thanks to replication).
-Greg
>>>>> "Chuck" == Chuck Lever <[email protected]> writes:
Chuck> I'm probably ignorant of the current state of implementation in
Chuck> Linux, but I'm interested in understanding common ground among
Chuck> local file systems, block storage, and network file systems.
Chuck> Example questions include: Do we need standardized APIs for block
Chuck> device corruption detection?
The block layer integrity stuff aims to be format agnostic. It was
designed to accommodate different types of protection information (back
then the ATA proposal was still on the table).
Chuck> How much of T10 DIF/DIX should NFS support?
You can either support the T10 PI format and act as a conduit. Or you
can invent your own format and potentially force a conversion. I'd
prefer the former (despite the limitations of T10 PI).
Chuck> What are the drivers for this feature (broad use cases)?
Two things:
1. Continuity. Downtime can be very costly and many applications require
to be taken offline to do recovery after a corruption error.
2. Archival. You want to make sure your write a good copy to backup.
With huge amounts of data it is often unfeasible to scrub and verify
the data on the backup media.
--
Martin K. Petersen Oracle Linux Engineering
>>>>> "Bernd" == Bernd Schubert <[email protected]> writes:
Bernd> We from the Fraunhofer FhGFS team would like to also see the T10
Bernd> DIF/DIX API exposed to user space, so that we could make use of
Bernd> it for our FhGFS file system. And I think this feature is not
Bernd> only useful for file systems, but in general, scientific
Bernd> applications, databases, etc also would benefit from insurance of
Bernd> data integrity.
I'm attending a SNIA meeting today to discuss a (cross-OS) data
integrity aware API. We'll see what comes out of that.
With the Linux hat on I'm still mainly interested in pursuing the
sys_dio interface Joel and I proposed last year. We have good experience
with that I/O model and it suits applications that want to interact with
the protection information well. libaio is also on my list.
But obviously any help and input is appreciated...
--
Martin K. Petersen Oracle Linux Engineering
On 01/26/2012 03:53 PM, Martin K. Petersen wrote:
>>>>>> "Bernd" == Bernd Schubert<[email protected]> writes:
>
> Bernd> We from the Fraunhofer FhGFS team would like to also see the T10
> Bernd> DIF/DIX API exposed to user space, so that we could make use of
> Bernd> it for our FhGFS file system. And I think this feature is not
> Bernd> only useful for file systems, but in general, scientific
> Bernd> applications, databases, etc also would benefit from insurance of
> Bernd> data integrity.
>
> I'm attending a SNIA meeting today to discuss a (cross-OS) data
> integrity aware API. We'll see what comes out of that.
>
> With the Linux hat on I'm still mainly interested in pursuing the
> sys_dio interface Joel and I proposed last year. We have good experience
> with that I/O model and it suits applications that want to interact with
> the protection information well. libaio is also on my list.
>
> But obviously any help and input is appreciated...
>
I guess you are referring to the interface described here
http://www.spinics.net/lists/linux-mm/msg14512.html
Hmm, direct IO would mean we could not use the page cache. As we are
using it, that would not really suit us. libaio then might be another
option then.
What kind of help do you exactly need?
Thanks,
Bernd
>>>>> "Andreas" == Andreas Dilger <[email protected]> writes:
Andreas> Is there a description of sys_dio() somewhere?
This was the original draft:
http://www.spinics.net/lists/linux-mm/msg14512.html
Andreas> In particular, I'm interested to know whether it allows full
Andreas> scatter-gather IO submission, unlike pwrite() which only allows
Andreas> multiple input buffers, and not multiple file offsets.
Each request descriptor contains buffer, target file, and offset. So
it's a single entry per descriptor. But many descriptors can be
submitted (and reaped) in a single syscall. So you don't have the single
file offset limitation of pwritev().
--
Martin K. Petersen Oracle Linux Engineering
>>>>> "Chuck" == Chuck Lever <[email protected]> writes:
>> I guess if we couldn't use the page cache, we probably wouldn't
>> attempt to use DIF/DIX interface, but will calculate our own
>> checksums once we are going to work on the data integrity feature on
>> our side.
Chuck> This is interesting. I imagine the Linux kernel NFS server will
Chuck> have the same issue: it depends on the page cache for good
Chuck> performance, and does not, itself, use direct I/O.
Just so we're perfectly clear here: There's nothing that prevents you
from hanging protection information off of your page private pointer so
it can be submitted along with the data.
The concern is purely that the filesystem owning the pages need to
handle access conflicts (racing DI and non-DI updates) and potentially
sub page modifications for small filesystem block sizes.
The other problem with buffered I/O is that the notion of "all these
pages are belong to us^wone write request" goes out the window. You
really need something like aio to be able to be able to get completion
status and re-drive the I/O if there's an integrity error. Otherwise you
might as well just let the kernel autoprotect the pages from the block
layer down (which is what we do now).
--
Martin K. Petersen Oracle Linux Engineering
On 02/01/2012 08:15 PM, Martin K. Petersen wrote:
>>>>>> "James" == James Bottomley <[email protected]> writes:
>
> IOW, the filesystem should only ever act as a conduit. The only real
> challenge as far as I can tell is how to handle concurrent protected and
> unprotected updates to a page. If a non-PI-aware app updates a cached
> page which is subsequently read by an app requesting PI that means we
> may have to force a write-out followed by a read to get valid PI. We
> could synthesize it to avoid the I/O but I think that would be violating
> the premise of protected transfer. Another option is to have an
> exclusive write access mechanism that only permits either protected or
> unprotected access to a page.
Yes. a protected write implies a byte-range locking on the file.
(And can be implemented with one)
Also the open() call demands an O_PROTECT option. Protection is
then a file attribute as well.
>
Boaz
On 02/01/2012 07:30 PM, Andrea Arcangeli wrote:
> On Wed, Feb 01, 2012 at 12:16:05PM -0600, James Bottomley wrote:
>> supplying protection information to user space isn't about the
>> application checking what's on disk .. there's automatic verification in
>> the chain to do that (both the HBA and the disk will check the
>> protection information on entry/exit and transfer). Supplying
>> protection information to userspace is about checking nothing went wrong
>> in the handoff between the end of the DIF stack and the application.
>
> Not sure if I got this right, but keeping protection information for
> in-ram pagecache and exposing it to userland somehow, to me sounds a
> bit of overkill as a concept. Then you should want that for anonymous
> memory too. If you copy the pagecache to a malloc()ed buffer and
> verify pagecache was consistent, but then the buffer is corrupt by
> hardware bitflip or software bug, then what's the point. Besides if
> this is getting exposed to userland and it's not hidden in the kernel
> (FS/Storage layers), userland could code its own verification logic
> without much added complexity. With CRC in hardware on the CPU it
> doesn't sound like a big cost to do it fully in userland and then you
> could run it on anonymous memory too if you need and not be dependent
> on hardware or filesystem details (well other than a a cpuid check at
> startup).
I think the point for network file systems is that they can reuse the
disk-checksum for network verification. So instead of calculating a
checksum for network and disk, just use one for both. The checksum also
is supposed to be cached in memory, as that avoids re-calculation for
other clients.
1)
client-1: sends data and checksum
server: Receives those data and verifies the checksum -> network
transfer was ok, sends data and checksum to disk
2)
client-2 ... client-N: Ask for those data
server: send cached data and cached checksum
client-2 ... client-N: Receive data and verify checksum
So the hole point of caching checksums is to avoid the server needs to
recalculate those for dozens of clients. Recalculating checksums simply
does not scale with an increasing number of clients, which want to read
data processed by another client.
Cheers,
Bernd
On Tue, Jan 31, 2012 at 11:28:26AM -0800, Gregory Farnum wrote:
> On Tue, Jan 31, 2012 at 11:22 AM, Bernd Schubert
> <[email protected]> wrote:
> > I guess we should talk to developers of other parallel file systems and see
> > what they think about it. I think cephfs already uses data integrity
> > provided by btrfs, although I'm not entirely sure and need to check the
> > code. As I said before, Lustre does network checksums already and *might* be
> > interested.
>
> Actually, right now Ceph doesn't check btrfs' data integrity
> information, but since Ceph doesn't have any data-at-rest integrity
> verification it relies on btrfs if you want that. Integrating
> integrity verification throughout the system is on our long-term to-do
> list.
> We too will be said if using a kernel-level integrity system requires
> using DIO, although we could probably work out a way to do
> "translation" between our own integrity checksums and the
> btrfs-generated ones if we have to (thanks to replication).
DIO isn't really required, but doing this without synchronous writes
will get painful in a hurry. There's nothing wrong with letting the
data sit in the page cache after the IO is done though.
-chris
>>>>> "James" == James Bottomley <[email protected]> writes:
James> I broadly agree with this, but even if you do sync writes and
James> cache read only copies, we still have the problem of how we do
James> the read side verification of DIX.
Whoever requested the protected information will know how to verify
it.
Right now, if the Oracle DB enables a protected transfer, it'll do a
verification pass once a read I/O completes.
Similarly, the block layer will verify data+PI if the auto-protection
feature has been turned on.
James> In theory, when you read, you could either get the cached copy or
James> an actual read (which will supply protection information), so for
James> the cached copy we need to return cached protection information
James> implying that we need some way of actually caching it.
Let's assume we add a PI buffer to kaio. If an application wants to send
or receive PI it needs to sit on top of a filesystem that can act as a
conduit for PI. That filesystem will need to store the PI for each page
somewhere hanging off of its page private pointer.
When submitting a write the filesystem must iterate over these PI
buffers and generate a bio integrity payload that it an attach to the
data bio. This works exactly the same way as iterating over the data
pages to build the data portion of the bio.
When an application is requesting PI, the filesystem must allocate the
relevant memory and update its private data to reflect the PI buffers.
These buffers are then attached the same way as on a write. And when the
I/O completes, the PI buffers contain the relevant PI from storage. Then
the application gets completion and can proceed to verify that data and
PI match.
IOW, the filesystem should only ever act as a conduit. The only real
challenge as far as I can tell is how to handle concurrent protected and
unprotected updates to a page. If a non-PI-aware app updates a cached
page which is subsequently read by an app requesting PI that means we
may have to force a write-out followed by a read to get valid PI. We
could synthesize it to avoid the I/O but I think that would be violating
the premise of protected transfer. Another option is to have an
exclusive write access mechanism that only permits either protected or
unprotected access to a page.
--
Martin K. Petersen Oracle Linux Engineering
On Thu, Feb 02, 2012 at 10:04:59AM +0100, Bernd Schubert wrote:
> I think the point for network file systems is that they can reuse the
> disk-checksum for network verification. So instead of calculating a
> checksum for network and disk, just use one for both. The checksum also
> is supposed to be cached in memory, as that avoids re-calculation for
> other clients.
>
> 1)
> client-1: sends data and checksum
>
> server: Receives those data and verifies the checksum -> network
> transfer was ok, sends data and checksum to disk
>
> 2)
> client-2 ... client-N: Ask for those data
>
> server: send cached data and cached checksum
>
> client-2 ... client-N: Receive data and verify checksum
>
>
> So the hole point of caching checksums is to avoid the server needs to
> recalculate those for dozens of clients. Recalculating checksums simply
> does not scale with an increasing number of clients, which want to read
> data processed by another client.
This makes sense indeed. My argument was only about the exposure of
the storage hw format cksum to userland (through some new ioctl for
further userland verification of the pagecache data in the client
pagecache, done by whatever program is reading from the cache). The
network fs client lives in kernel, the network fs server lives in
kernel, so no need to expose the cksum to userland to do what you
described above.
I meant if we can't trust the pagecache to be correct (after the
network fs client code already checked the cksum cached by the server
and sent to the client along the server cached data), I don't see much
value added through a further verification by the userland program
running on the client and accessing pagecache in the client. If we
can't trust client pagecache to be safe against memory bitflips or
software bugs, we can hardly trust the anonymous memory too.
On Wed, 2012-02-01 at 18:59 +0100, Bernd Schubert wrote:
> On 02/01/2012 06:41 PM, Chris Mason wrote:
> > On Wed, Feb 01, 2012 at 10:52:55AM -0600, James Bottomley wrote:
> >> On Wed, 2012-02-01 at 11:45 -0500, Chris Mason wrote:
[...]
> >>> DIO isn't really required, but doing this without synchronous writes
> >>> will get painful in a hurry. There's nothing wrong with letting the
> >>> data sit in the page cache after the IO is done though.
> >>
> >> I broadly agree with this, but even if you do sync writes and cache read
> >> only copies, we still have the problem of how we do the read side
> >> verification of DIX. In theory, when you read, you could either get the
> >> cached copy or an actual read (which will supply protection
> >> information), so for the cached copy we need to return cached protection
> >> information implying that we need some way of actually caching it.
> >
> > Good point, reading from the cached copy is a lower level of protection
> > because in theory bugs in your scsi drivers could corrupt the pages
> > later on.
>
> But that only matters if the application is going to verify if data are
> really on disk. For example (client server scenario)
Um, well, then why do you want DIX? If you don't care about having the
client verify the data, that means you trust the integrity of the page
cache and then you just use the automated DIF within the driver layer
and SCSI will verify the data all the way up until block places it in
the page cache.
The whole point of supplying protection information to user space is
that the application can verify the data didn't get corrupted after it
left the DIF protected block stack.
> 1) client-A writes a page
> 2) client-B reads this page
>
> client-B is simply not interested here where it gets the page from, as
> long as it gets correct data.
How does it know it got correct data if it doesn't verify? Something
might have corrupted the page between the time the block layer placed
the DIF verified data there and the client reads it.
> The network files system in between also
> will just be happy existing in-cache crcs for network verification.
> Only if the page is later on dropped from the cache and read again,
> on-disk crcs matter. If those are bad, one of the layers is going to
> complain or correct those data.
>
> If the application wants to check data on disk it can either use DIO or
> alternatively something like fadvsise(DONTNEED_LOCAL_AND_REMOTE)
> (something I wanted to propose for some time already, at least I'm not
> happy that posix_fadvise(POSIX_FADV_DONTNEED) is not passed to the file
> system at all).
supplying protection information to user space isn't about the
application checking what's on disk .. there's automatic verification in
the chain to do that (both the HBA and the disk will check the
protection information on entry/exit and transfer). Supplying
protection information to userspace is about checking nothing went wrong
in the handoff between the end of the DIF stack and the application.
James
On 02/01/2012 06:41 PM, Chris Mason wrote:
> On Wed, Feb 01, 2012 at 10:52:55AM -0600, James Bottomley wrote:
>> On Wed, 2012-02-01 at 11:45 -0500, Chris Mason wrote:
>>> On Tue, Jan 31, 2012 at 11:28:26AM -0800, Gregory Farnum wrote:
>>>> On Tue, Jan 31, 2012 at 11:22 AM, Bernd Schubert
>>>> <[email protected]> wrote:
>>>>> I guess we should talk to developers of other parallel file systems and see
>>>>> what they think about it. I think cephfs already uses data integrity
>>>>> provided by btrfs, although I'm not entirely sure and need to check the
>>>>> code. As I said before, Lustre does network checksums already and *might* be
>>>>> interested.
>>>>
>>>> Actually, right now Ceph doesn't check btrfs' data integrity
>>>> information, but since Ceph doesn't have any data-at-rest integrity
>>>> verification it relies on btrfs if you want that. Integrating
>>>> integrity verification throughout the system is on our long-term to-do
>>>> list.
>>>> We too will be said if using a kernel-level integrity system requires
>>>> using DIO, although we could probably work out a way to do
>>>> "translation" between our own integrity checksums and the
>>>> btrfs-generated ones if we have to (thanks to replication).
>>>
>>> DIO isn't really required, but doing this without synchronous writes
>>> will get painful in a hurry. There's nothing wrong with letting the
>>> data sit in the page cache after the IO is done though.
>>
>> I broadly agree with this, but even if you do sync writes and cache read
>> only copies, we still have the problem of how we do the read side
>> verification of DIX. In theory, when you read, you could either get the
>> cached copy or an actual read (which will supply protection
>> information), so for the cached copy we need to return cached protection
>> information implying that we need some way of actually caching it.
>
> Good point, reading from the cached copy is a lower level of protection
> because in theory bugs in your scsi drivers could corrupt the pages
> later on.
But that only matters if the application is going to verify if data are
really on disk. For example (client server scenario)
1) client-A writes a page
2) client-B reads this page
client-B is simply not interested here where it gets the page from, as
long as it gets correct data. The network files system in between also
will just be happy existing in-cache crcs for network verification.
Only if the page is later on dropped from the cache and read again,
on-disk crcs matter. If those are bad, one of the layers is going to
complain or correct those data.
If the application wants to check data on disk it can either use DIO or
alternatively something like fadvsise(DONTNEED_LOCAL_AND_REMOTE)
(something I wanted to propose for some time already, at least I'm not
happy that posix_fadvise(POSIX_FADV_DONTNEED) is not passed to the file
system at all).
Cheers,
Bernd
On 2012-02-02, at 12:26, Andrea Arcangeli <[email protected]> wrote:
> On Thu, Feb 02, 2012 at 10:04:59AM +0100, Bernd Schubert wrote:
>> I think the point for network file systems is that they can reuse the
>> disk-checksum for network verification. So instead of calculating a
>> checksum for network and disk, just use one for both. The checksum also
>> is supposed to be cached in memory, as that avoids re-calculation for
>> other clients.
>>
>> 1)
>> client-1: sends data and checksum
>>
>> server: Receives those data and verifies the checksum -> network
>> transfer was ok, sends data and checksum to disk
>>
>> 2)
>> client-2 ... client-N: Ask for those data
>>
>> server: send cached data and cached checksum
>>
>> client-2 ... client-N: Receive data and verify checksum
>>
>>
>> So the hole point of caching checksums is to avoid the server needs to
>> recalculate those for dozens of clients. Recalculating checksums simply
>> does not scale with an increasing number of clients, which want to read
>> data processed by another client.
>
> This makes sense indeed. My argument was only about the exposure of
> the storage hw format cksum to userland (through some new ioctl for
> further userland verification of the pagecache data in the client
> pagecache, done by whatever program is reading from the cache). The
> network fs client lives in kernel, the network fs server lives in
> kernel, so no need to expose the cksum to userland to do what you
> described above.
>
> I meant if we can't trust the pagecache to be correct (after the
> network fs client code already checked the cksum cached by the server
> and sent to the client along the server cached data), I don't see much
> value added through a further verification by the userland program
> running on the client and accessing pagecache in the client. If we
> can't trust client pagecache to be safe against memory bitflips or
> software bugs, we can hardly trust the anonymous memory too.
For servers, and clients to a lesser extent, the data may reside in cache for a long time. I agree that in many cases the data will be used immediately after the kernel verifies the data checksum from disk, but for long-lived data the change of accidental corruption (bit flip, bad pointer, other software bug) increases.
For our own checksum implementation in Lustre, we are planning to keep the checksum attached to the pages in cache on both the client and server, along with a "last checked" time, and periodically revalidate the in-memory checksum.
As Bernd states, this dramatically reduces the checksum overhead on the server, and avoids duplicate checksum calculations for the disk and network transfers if the same algorithms can be used for both.
Cheers, Andreas
On Wed, Feb 01, 2012 at 12:16:05PM -0600, James Bottomley wrote:
> supplying protection information to user space isn't about the
> application checking what's on disk .. there's automatic verification in
> the chain to do that (both the HBA and the disk will check the
> protection information on entry/exit and transfer). Supplying
> protection information to userspace is about checking nothing went wrong
> in the handoff between the end of the DIF stack and the application.
Not sure if I got this right, but keeping protection information for
in-ram pagecache and exposing it to userland somehow, to me sounds a
bit of overkill as a concept. Then you should want that for anonymous
memory too. If you copy the pagecache to a malloc()ed buffer and
verify pagecache was consistent, but then the buffer is corrupt by
hardware bitflip or software bug, then what's the point. Besides if
this is getting exposed to userland and it's not hidden in the kernel
(FS/Storage layers), userland could code its own verification logic
without much added complexity. With CRC in hardware on the CPU it
doesn't sound like a big cost to do it fully in userland and then you
could run it on anonymous memory too if you need and not be dependent
on hardware or filesystem details (well other than a a cpuid check at
startup).
On Wed, 2012-02-01 at 11:45 -0500, Chris Mason wrote:
> On Tue, Jan 31, 2012 at 11:28:26AM -0800, Gregory Farnum wrote:
> > On Tue, Jan 31, 2012 at 11:22 AM, Bernd Schubert
> > <[email protected]> wrote:
> > > I guess we should talk to developers of other parallel file systems and see
> > > what they think about it. I think cephfs already uses data integrity
> > > provided by btrfs, although I'm not entirely sure and need to check the
> > > code. As I said before, Lustre does network checksums already and *might* be
> > > interested.
> >
> > Actually, right now Ceph doesn't check btrfs' data integrity
> > information, but since Ceph doesn't have any data-at-rest integrity
> > verification it relies on btrfs if you want that. Integrating
> > integrity verification throughout the system is on our long-term to-do
> > list.
> > We too will be said if using a kernel-level integrity system requires
> > using DIO, although we could probably work out a way to do
> > "translation" between our own integrity checksums and the
> > btrfs-generated ones if we have to (thanks to replication).
>
> DIO isn't really required, but doing this without synchronous writes
> will get painful in a hurry. There's nothing wrong with letting the
> data sit in the page cache after the IO is done though.
I broadly agree with this, but even if you do sync writes and cache read
only copies, we still have the problem of how we do the read side
verification of DIX. In theory, when you read, you could either get the
cached copy or an actual read (which will supply protection
information), so for the cached copy we need to return cached protection
information implying that we need some way of actually caching it.
James
On Wed, Feb 01, 2012 at 10:52:55AM -0600, James Bottomley wrote:
> On Wed, 2012-02-01 at 11:45 -0500, Chris Mason wrote:
> > On Tue, Jan 31, 2012 at 11:28:26AM -0800, Gregory Farnum wrote:
> > > On Tue, Jan 31, 2012 at 11:22 AM, Bernd Schubert
> > > <[email protected]> wrote:
> > > > I guess we should talk to developers of other parallel file systems and see
> > > > what they think about it. I think cephfs already uses data integrity
> > > > provided by btrfs, although I'm not entirely sure and need to check the
> > > > code. As I said before, Lustre does network checksums already and *might* be
> > > > interested.
> > >
> > > Actually, right now Ceph doesn't check btrfs' data integrity
> > > information, but since Ceph doesn't have any data-at-rest integrity
> > > verification it relies on btrfs if you want that. Integrating
> > > integrity verification throughout the system is on our long-term to-do
> > > list.
> > > We too will be said if using a kernel-level integrity system requires
> > > using DIO, although we could probably work out a way to do
> > > "translation" between our own integrity checksums and the
> > > btrfs-generated ones if we have to (thanks to replication).
> >
> > DIO isn't really required, but doing this without synchronous writes
> > will get painful in a hurry. There's nothing wrong with letting the
> > data sit in the page cache after the IO is done though.
>
> I broadly agree with this, but even if you do sync writes and cache read
> only copies, we still have the problem of how we do the read side
> verification of DIX. In theory, when you read, you could either get the
> cached copy or an actual read (which will supply protection
> information), so for the cached copy we need to return cached protection
> information implying that we need some way of actually caching it.
Good point, reading from the cached copy is a lower level of protection
because in theory bugs in your scsi drivers could corrupt the pages
later on.
But I think even without keeping the crcs attached to the page, there is
value in keeping the cached copy in lots of workloads. The database is
going to O_DIRECT read (with crcs checked) and then stuff it into a
database buffer cache for long term use. Stuffing it into a page cache
on the kernel side is about the same.
-chris
On 02/02/2012 08:26 PM, Andrea Arcangeli wrote:
> On Thu, Feb 02, 2012 at 10:04:59AM +0100, Bernd Schubert wrote:
>> I think the point for network file systems is that they can reuse the
>> disk-checksum for network verification. So instead of calculating a
>> checksum for network and disk, just use one for both. The checksum also
>> is supposed to be cached in memory, as that avoids re-calculation for
>> other clients.
>>
>> 1)
>> client-1: sends data and checksum
>>
>> server: Receives those data and verifies the checksum -> network
>> transfer was ok, sends data and checksum to disk
>>
>> 2)
>> client-2 ... client-N: Ask for those data
>>
>> server: send cached data and cached checksum
>>
>> client-2 ... client-N: Receive data and verify checksum
>>
>>
>> So the hole point of caching checksums is to avoid the server needs to
>> recalculate those for dozens of clients. Recalculating checksums simply
>> does not scale with an increasing number of clients, which want to read
>> data processed by another client.
>
> This makes sense indeed. My argument was only about the exposure of
> the storage hw format cksum to userland (through some new ioctl for
> further userland verification of the pagecache data in the client
> pagecache, done by whatever program is reading from the cache). The
> network fs client lives in kernel, the network fs server lives in
> kernel, so no need to expose the cksum to userland to do what you
> described above.
>
> I meant if we can't trust the pagecache to be correct (after the
> network fs client code already checked the cksum cached by the server
> and sent to the client along the server cached data), I don't see much
> value added through a further verification by the userland program
> running on the client and accessing pagecache in the client. If we
> can't trust client pagecache to be safe against memory bitflips or
> software bugs, we can hardly trust the anonymous memory too.
Well, now it gets a bit troublesome - not all file systems are in kernel
space. FhGFS uses kernel clients, but has user space daemons. I think
Ceph does it similarly. And although I'm not sure about the roadmap of
Gluster and if data verification is planned at all, but if it would like
to do that, even the clients would need get access to the checksums in
user space.
Now lets assume we ignore user space clients for now, what about using
the splice interface to also send checksums? So as basic concept file
systems servers are not interested at all about the real data, but only
do the management between disk and network. So a possible solution to
not expose checksums to user space daemons is to simply not expose data
to the servers at all. However, in that case the server side kernel
would need to do the checksum verification, so even for user space daemons.
Remaining issue with splice is that splice does not work with
inifiniband-ibverbs due to the missing socket fd.
Another solution that also might work is to expose checksums only
read-only to user space.
Cheers,
Bernd