Hello,
I'm wondering is there any way in Linux to do proper fsync(), which
makes sure data is written to the disk.
Currently on IDE devices one can see, fsync() only flushes data to the
drive cache which is not enough for ACID guaranties database server must
give.
There is solution just to disable drive write cache, but it seems to
slowdown performance way to much.
I would be also happy enough with some global kernel option which would
enable drive cache flush on fsync :)
Mac OS X also has this "optimization", but at least it provides an
alternative flush method for Database Servers:
fcntl(fd, F_FULLFSYNC, NULL)
can be used instead of fsync() to get true fsync() behavior.
--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com
Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/
On Wed, Mar 17 2004, Peter Zaitsev wrote:
> Hello,
>
> I'm wondering is there any way in Linux to do proper fsync(), which
> makes sure data is written to the disk.
>
> Currently on IDE devices one can see, fsync() only flushes data to the
> drive cache which is not enough for ACID guaranties database server must
> give.
>
> There is solution just to disable drive write cache, but it seems to
> slowdown performance way to much.
Chris and I have working real fsync() with the barrier patches. I'll
clean it up and post a patch for vanilla 2.6.5-rc today.
--
Jens Axboe
On Thu, 18 Mar 2004, Jens Axboe wrote:
> Chris and I have working real fsync() with the barrier patches. I'll
> clean it up and post a patch for vanilla 2.6.5-rc today.
This is good news.
The barrier stuff is long overdue^UI'm looking forward to this.
I'm using the term "TCQ" liberally although it may be inexact for older
(parallel) ATA generations:
All these ATA fsync() vs. write cache issues have been open for much too
long - no reproaches, but it's a pity we haven't been able to have data
consistency for data bases and fast bulk writes (that need the write
cache without TCQ) in the same drive for so long. I have seen Linux
introduce TCQ for PATA early in 2.5, then drop it again. Similarly,
FreeBSD ventured into TCQ for ATA but appears to have dropped it again
as well.
May I ask that the information whether a particular driver (file system,
hardware) supports write barriers be exposed in a standard way, for
instance in the Kconfig help lines?
If I recall correctly from earlier patches, the barrier stuff is 1.
command model (ATA vs. SCSI) specific and 2. driver and hardware
specific and 3. requires that the file system knows how to use this
properly.
Given that file systems have certain write ordering requirements if they
are to be recoverable after a crash, I suspect Linux has _not_ been able
to guarantee on-disk consistency for any time for years, which means
that a crash in the wrong moment can kill the file system itself if the
drive has reordered writes - only ext3 without write cache seems to
behave better in this respect (data=ordered).
I would like to have a document that shows which file system, which
chipset driver for PATA, which chipset driver for ATA, which low-level
SCSI host adaptor driver, which file system support write barrier. We
will probably also need to check if intermediate layers such as md and
dm-mod propagate such information.
Given the necessary information, I can hack together a HTML document to
provide this information; this offer has however not seen any response
in the past. I am however not acquainted with the drivers and need
information from the kernel hackers. Without such support, such a
documentation effort is doomed.
BTW, I should very much like to be able to trace the low-level write
information that goes out to the device, possibly including the payload
- something like tcpdump for the ATA or SCSI commands that are sent to
the driver. Is such a facility available?
--
Matthias Andree
Encrypt your mail: my GnuPG key ID is 0x052E7D95
(btw - maybe you don't like to be cc'ed on kernel posts, but I do. it's
lkml etiquette to do so, and it makes sure that I see your mail.
otherwise I might not, especially true for bigger threads. so please, cc
people. thanks)
On Thu, Mar 18 2004, Matthias Andree wrote:
> On Thu, 18 Mar 2004, Jens Axboe wrote:
>
> > Chris and I have working real fsync() with the barrier patches. I'll
> > clean it up and post a patch for vanilla 2.6.5-rc today.
>
> This is good news.
>
> The barrier stuff is long overdue^UI'm looking forward to this.
>
> I'm using the term "TCQ" liberally although it may be inexact for older
> (parallel) ATA generations:
>
> All these ATA fsync() vs. write cache issues have been open for much too
> long - no reproaches, but it's a pity we haven't been able to have data
> consistency for data bases and fast bulk writes (that need the write
> cache without TCQ) in the same drive for so long. I have seen Linux
> introduce TCQ for PATA early in 2.5, then drop it again. Similarly,
> FreeBSD ventured into TCQ for ATA but appears to have dropped it again
> as well.
That's because PATA TCQ sucks :-)
> May I ask that the information whether a particular driver (file system,
> hardware) supports write barriers be exposed in a standard way, for
> instance in the Kconfig help lines?
Since reiser is the first implementation of it, it gets to chose how
this works. Currently that's done by giving -o barrier=flush (=ordered
used to exist as well, it will probably return - right now we just
played with IDE).
> If I recall correctly from earlier patches, the barrier stuff is 1.
> command model (ATA vs. SCSI) specific and 2. driver and hardware
> specific and 3. requires that the file system knows how to use this
> properly.
Yes.
> Given that file systems have certain write ordering requirements if they
> are to be recoverable after a crash, I suspect Linux has _not_ been able
> to guarantee on-disk consistency for any time for years, which means
> that a crash in the wrong moment can kill the file system itself if the
> drive has reordered writes - only ext3 without write cache seems to
> behave better in this respect (data=ordered).
>
> I would like to have a document that shows which file system, which
> chipset driver for PATA, which chipset driver for ATA, which low-level
> SCSI host adaptor driver, which file system support write barrier. We
> will probably also need to check if intermediate layers such as md and
> dm-mod propagate such information.
Only PATA core needs to support it, not the chipset drivers. md and dm
aren't a difficult to implement now that unplug/congestion already
iterates the device list and I added a blkdev_issue_flush() command.
> Given the necessary information, I can hack together a HTML document to
> provide this information; this offer has however not seen any response
> in the past. I am however not acquainted with the drivers and need
> information from the kernel hackers. Without such support, such a
> documentation effort is doomed.
Usual approach - just start writing, it's a lot easier to get
corrections (people seem to be several times more willing to point out
your errors than give you recomendations for something you haven't
started yet).
> BTW, I should very much like to be able to trace the low-level write
> information that goes out to the device, possibly including the payload
> - something like tcpdump for the ATA or SCSI commands that are sent to
> the driver. Is such a facility available?
No.
--
Jens Axboe
Jens Axboe schrieb am 2004-03-18:
> > All these ATA fsync() vs. write cache issues have been open for much too
> > long - no reproaches, but it's a pity we haven't been able to have data
> > consistency for data bases and fast bulk writes (that need the write
> > cache without TCQ) in the same drive for so long. I have seen Linux
> > introduce TCQ for PATA early in 2.5, then drop it again. Similarly,
> > FreeBSD ventured into TCQ for ATA but appears to have dropped it again
> > as well.
>
> That's because PATA TCQ sucks :-)
True. Few drives support it, and many of these you would not want to run
in production...
> > May I ask that the information whether a particular driver (file system,
> > hardware) supports write barriers be exposed in a standard way, for
> > instance in the Kconfig help lines?
>
> Since reiser is the first implementation of it, it gets to chose how
> this works. Currently that's done by giving -o barrier=flush (=ordered
> used to exist as well, it will probably return - right now we just
> played with IDE).
This looks as though this was not the default and required the user to
know what he's doing. Would it be possible to choose a sane default
(like flush for ATA or ordered for SCSI when the underlying driver
supports ordered tags) and leave the user just the chance to override
this?
> Only PATA core needs to support it, not the chipset drivers. md and dm
Hum, I know the older Promise chips were blacklisted for PATA TCQ in
FreeBSD. Might "ordered" cause situations where similar things happen to
Linux? How about SCSI/libata? Is the situation the same there?
> aren't a difficult to implement now that unplug/congestion already
> iterates the device list and I added a blkdev_issue_flush() command.
So this would - for SCSI - be an sd issue rather than a driver issue as
well?
--
Matthias Andree
Encrypt your mail: my GnuPG key ID is 0x052E7D95
On Thu, Mar 18 2004, Matthias Andree wrote:
> > > All these ATA fsync() vs. write cache issues have been open for much too
> > > long - no reproaches, but it's a pity we haven't been able to have data
> > > consistency for data bases and fast bulk writes (that need the write
> > > cache without TCQ) in the same drive for so long. I have seen Linux
> > > introduce TCQ for PATA early in 2.5, then drop it again. Similarly,
> > > FreeBSD ventured into TCQ for ATA but appears to have dropped it again
> > > as well.
> >
> > That's because PATA TCQ sucks :-)
>
> True. Few drives support it, and many of these you would not want to run
> in production...
Plus, the spec is broken.
> > > May I ask that the information whether a particular driver (file system,
> > > hardware) supports write barriers be exposed in a standard way, for
> > > instance in the Kconfig help lines?
> >
> > Since reiser is the first implementation of it, it gets to chose how
> > this works. Currently that's done by giving -o barrier=flush (=ordered
> > used to exist as well, it will probably return - right now we just
> > played with IDE).
>
> This looks as though this was not the default and required the user to
> know what he's doing. Would it be possible to choose a sane default
> (like flush for ATA or ordered for SCSI when the underlying driver
> supports ordered tags) and leave the user just the chance to override
> this?
When things have matured, might not be a bad idea to default to using
barriers.
> > Only PATA core needs to support it, not the chipset drivers. md and dm
>
> Hum, I know the older Promise chips were blacklisted for PATA TCQ in
> FreeBSD. Might "ordered" cause situations where similar things happen to
> Linux? How about SCSI/libata? Is the situation the same there?
Don't confuse TCQ and barriers, it has nothing to do with each other for
IDE. I can't imagine any chipsets having problems with a syncronize
cache command.
> > aren't a difficult to implement now that unplug/congestion already
> > iterates the device list and I added a blkdev_issue_flush() command.
>
> So this would - for SCSI - be an sd issue rather than a driver issue as
> well?
No, for scsi it's a low level driver issue. IDE chipset 'drivers' really
aren't anything but setup stuff, and maybe a few hooks to deal with dma.
All the action is in the ide core.
--
Jens Axboe
On Wed, 2004-03-17 at 22:47, Jens Axboe wrote:
> > There is solution just to disable drive write cache, but it seems to
> > slowdown performance way to much.
>
> Chris and I have working real fsync() with the barrier patches. I'll
> clean it up and post a patch for vanilla 2.6.5-rc today.
Good to hear. How is it going to work from user point of view ?
Just fsync working back again or there would be some special handling.
Also. What is about fsync() in 2.6 nowadays ?
I've done some tests on 3WARE RAID array and it looks like it is
different compared to 2.4 I've been testing previously.
I have the simple test which has single page writes to the file followed
by fsync(). First run give you the case when file grows with each
write, second when you're writing to existing file space.
The results I have on 2.4 is something like 40 sec per 1000 fsyncs for
new file, and 0.6 sec for existing file.
With 2.6.3 I have both existing file and new file to complete in less
than 1 second.
--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com
Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/
On Thu, Mar 18 2004, Peter Zaitsev wrote:
> On Wed, 2004-03-17 at 22:47, Jens Axboe wrote:
>
> > > There is solution just to disable drive write cache, but it seems to
> > > slowdown performance way to much.
> >
> > Chris and I have working real fsync() with the barrier patches. I'll
> > clean it up and post a patch for vanilla 2.6.5-rc today.
>
> Good to hear. How is it going to work from user point of view ?
> Just fsync working back again or there would be some special handling.
It's just going to work :)
> Also. What is about fsync() in 2.6 nowadays ?
>
> I've done some tests on 3WARE RAID array and it looks like it is
> different compared to 2.4 I've been testing previously.
>
> I have the simple test which has single page writes to the file followed
> by fsync(). First run give you the case when file grows with each
> write, second when you're writing to existing file space.
>
> The results I have on 2.4 is something like 40 sec per 1000 fsyncs for
> new file, and 0.6 sec for existing file.
>
> With 2.6.3 I have both existing file and new file to complete in less
> than 1 second.
I believe some missed set_page_writeback() calls caused fsync() to never
really wait on anything, pretty broken... IIRC, it's fixed in latest
-mm, or maybe it's just pending for next release.
--
Jens Axboe
On Thu, 2004-03-18 at 14:47, Jens Axboe wrote:
> > With 2.6.3 I have both existing file and new file to complete in less
> > than 1 second.
>
> I believe some missed set_page_writeback() calls caused fsync() to never
> really wait on anything, pretty broken... IIRC, it's fixed in latest
> -mm, or maybe it's just pending for next release.
This should have only been broken in -mm. Which kernels exactly are you
comparing? Maybe the 3ware array defaults to different writecache
settings under 2.6?
-chris
On Thu, 2004-03-18 at 12:11, Chris Mason wrote:
> > I believe some missed set_page_writeback() calls caused fsync() to never
> > really wait on anything, pretty broken... IIRC, it's fixed in latest
> > -mm, or maybe it's just pending for next release.
>
> This should have only been broken in -mm. Which kernels exactly are you
> comparing? Maybe the 3ware array defaults to different writecache
> settings under 2.6?
I'm trying RH AS 3.0 kernel, however I have the same behavior on my
SuSE 8.2 workstation.
I use 2.6.3 kernel for tests now (It is not the latest I know)
EXT3 file system.
3WARE has writeback cache setting in both cases.
Here is the test program I was using:
#include <stdio.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <errno.h>
char buffer[4096] __attribute__((__aligned__(4096)));
main()
{
int rc2,rc;
int i;
buffer[0]=(char)getpid();
rc=open("write",O_RDWR | O_CREAT,0666);
if (rc==-1) printf("Error at open: %d\n",errno);
for(i=0;i<1000;i++)
{
rc2=write(rc,&buffer,4096);
printf(".");
fflush(stdout);
if (rc2<0)
{
printf("Error code: %d\n",errno);
return;
}
fsync(rc);
}
}
--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com
Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/
On Thu, 2004-03-18 at 15:17, Peter Zaitsev wrote:
> On Thu, 2004-03-18 at 12:11, Chris Mason wrote:
>
> > > I believe some missed set_page_writeback() calls caused fsync() to never
> > > really wait on anything, pretty broken... IIRC, it's fixed in latest
> > > -mm, or maybe it's just pending for next release.
> >
> > This should have only been broken in -mm. Which kernels exactly are you
> > comparing? Maybe the 3ware array defaults to different writecache
> > settings under 2.6?
>
> I'm trying RH AS 3.0 kernel, however I have the same behavior on my
> SuSE 8.2 workstation.
>
Some suse 8.2 kernels had write barriers for IDE, some did not. If
you're running any kind of recent suse kernel, you're doing cache
flushes on fsync with ext3.
Not sure if RH has ever carried the patches or not. Easy enough to test
for on suse, just look for blk_queue_ordered in the System.map.
> I use 2.6.3 kernel for tests now (It is not the latest I know)
> EXT3 file system.
>
> 3WARE has writeback cache setting in both cases.
Then it sounds like your 2.4 is doing flushes. I'd expect this test to
run very quickly without them.
-chris
On Thu, 2004-03-18 at 12:33, Chris Mason wrote:
> Some suse 8.2 kernels had write barriers for IDE, some did not. If
> you're running any kind of recent suse kernel, you're doing cache
> flushes on fsync with ext3.
I have this kernel:
Linux abyss 2.4.20-4GB #1 Sat Feb 7 02:07:16 UTC 2004 i686 unknown
unknown GNU/Linux
I believe it is reasonably recent one from Hubert's kernels.
The thing is the performance is different if file grows or it does not.
If it does - we have some 25 fsync/sec. IF we're writing to existing
one, we have some 1600 fsync/sec
In the former case cache is surely not flushed.
> > I use 2.6.3 kernel for tests now (It is not the latest I know)
> > EXT3 file system.
> >
> > 3WARE has writeback cache setting in both cases.
>
> Then it sounds like your 2.4 is doing flushes. I'd expect this test to
> run very quickly without them.
2.4 does flush in one case but not in other. 2.6 does not do it in ether
case.
I was also surprised to see this simple test case has so different
performance with default and "deadline" IO scheduler - 1.6 vs 0.5 sec
per 1000 fsync's.
--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com
Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/
On Thu, 2004-03-18 at 15:46, Peter Zaitsev wrote:
> On Thu, 2004-03-18 at 12:33, Chris Mason wrote:
>
> > Some suse 8.2 kernels had write barriers for IDE, some did not. If
> > you're running any kind of recent suse kernel, you're doing cache
> > flushes on fsync with ext3.
>
> I have this kernel:
>
>
> Linux abyss 2.4.20-4GB #1 Sat Feb 7 02:07:16 UTC 2004 i686 unknown
> unknown GNU/Linux
>
> I believe it is reasonably recent one from Hubert's kernels.
>
> The thing is the performance is different if file grows or it does not.
> If it does - we have some 25 fsync/sec. IF we're writing to existing
> one, we have some 1600 fsync/sec
>
> In the former case cache is surely not flushed.
>
Hmmm, is it reiser? For both 2.4 reiserfs and ext3, the flush happens
when you commit. ext3 always commits on fsync and reiser only commits
when you've changed metadata.
Thanks to Jens, the 2.6 barrier patch has a nice clean way to allow
barriers on fsync, O_SYNC, O_DIRECT, etc, so we can make IDE drives much
safer than the 2.4 code did.
I had a patch to make fsync always generate the barriers in 2.4, but it
was tricky since it had to figure out the last buffer it was going to
write before it wrote it. The 2.6 code is much better.
> 2.4 does flush in one case but not in other. 2.6 does not do it in ether
> case.
>
> I was also surprised to see this simple test case has so different
> performance with default and "deadline" IO scheduler - 1.6 vs 0.5 sec
> per 1000 fsync's.
Not sure on that one, both cases are generating tons of unplugs, the
drive is just responding insanely fast.
-chris
On Thu, 2004-03-18 at 13:02, Chris Mason wrote:
> > In the former case cache is surely not flushed.
> >
> Hmmm, is it reiser? For both 2.4 reiserfs and ext3, the flush happens
> when you commit. ext3 always commits on fsync and reiser only commits
> when you've changed metadata.
Oh. Yes. This is Reiser, I did not think it is FS issue.
I'll know to stay away from ReiserFS now.
>
> Thanks to Jens, the 2.6 barrier patch has a nice clean way to allow
> barriers on fsync, O_SYNC, O_DIRECT, etc, so we can make IDE drives much
> safer than the 2.4 code did.
Great.
> > I was also surprised to see this simple test case has so different
> > performance with default and "deadline" IO scheduler - 1.6 vs 0.5 sec
> > per 1000 fsync's.
>
> Not sure on that one, both cases are generating tons of unplugs, the
> drive is just responding insanely fast.
Well why it would be slow if it has write cache off.
--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com
Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/
On Thu, 2004-03-18 at 16:09, Peter Zaitsev wrote:
> On Thu, 2004-03-18 at 13:02, Chris Mason wrote:
>
> > > In the former case cache is surely not flushed.
> > >
> > Hmmm, is it reiser? For both 2.4 reiserfs and ext3, the flush happens
> > when you commit. ext3 always commits on fsync and reiser only commits
> > when you've changed metadata.
>
> Oh. Yes. This is Reiser, I did not think it is FS issue.
> I'll know to stay away from ReiserFS now.
For reiserfs data=ordered should be enough to trigger the needed
commits. If not, data=journal. Note that neither fs does barriers for
O_SYNC, so we're just not perfect in 2.4.
-chris
Chris Mason wrote:
>On Thu, 2004-03-18 at 16:09, Peter Zaitsev wrote:
>
>
>>On Thu, 2004-03-18 at 13:02, Chris Mason wrote:
>>
>>
>>
>>>>In the former case cache is surely not flushed.
>>>>
>>>>
>>>>
>>>Hmmm, is it reiser? For both 2.4 reiserfs and ext3, the flush happens
>>>when you commit. ext3 always commits on fsync and reiser only commits
>>>when you've changed metadata.
>>>
>>>
>>Oh. Yes. This is Reiser, I did not think it is FS issue.
>>I'll know to stay away from ReiserFS now.
>>
>>
>
>For reiserfs data=ordered should be enough to trigger the needed
>commits. If not, data=journal. Note that neither fs does barriers for
>O_SYNC, so we're just not perfect in 2.4.
>
>-chris
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>
You are not listening to Peter. As I understand it from what Peter says
and your words, your implementation is wrong, and makes fsync
meaningless. If so, then you need to fix it. fsync should not be
meaningless even for metadata only journaling. This is a serious bug
that needs immediate correction, if Peter and I understand it correctly
from your words.
--
Hans
On Fri, 2004-03-19 at 03:05, Hans Reiser wrote:
> Chris Mason wrote:
>
> >On Thu, 2004-03-18 at 16:09, Peter Zaitsev wrote:
> >
> >
> >>On Thu, 2004-03-18 at 13:02, Chris Mason wrote:
> >>
> >>
> >>
> >>>>In the former case cache is surely not flushed.
> >>>>
> >>>>
> >>>>
> >>>Hmmm, is it reiser? For both 2.4 reiserfs and ext3, the flush happens
> >>>when you commit. ext3 always commits on fsync and reiser only commits
> >>>when you've changed metadata.
> >>>
> >>>
> >>Oh. Yes. This is Reiser, I did not think it is FS issue.
> >>I'll know to stay away from ReiserFS now.
> >>
> >>
> >
> >For reiserfs data=ordered should be enough to trigger the needed
> >commits. If not, data=journal. Note that neither fs does barriers for
> >O_SYNC, so we're just not perfect in 2.4.
> >
> >-chris
> >
> You are not listening to Peter. As I understand it from what Peter says
> and your words, your implementation is wrong, and makes fsync
> meaningless. If so, then you need to fix it. fsync should not be
> meaningless even for metadata only journaling. This is a serious bug
> that needs immediate correction, if Peter and I understand it correctly
> from your words.
I am listening to Peter, Jens and I have spent a significant amount of
time on this code. We can go back and spend many more hours testing and
debugging the 2.4 changes, or we can go forward with a very nice
solution in 2.6.
I'm planning on going forward with 2.6
-chris
On Fri, 2004-03-19 at 05:52, Chris Mason wrote:
> I am listening to Peter, Jens and I have spent a significant amount of
> time on this code. We can go back and spend many more hours testing and
> debugging the 2.4 changes, or we can go forward with a very nice
> solution in 2.6.
>
> I'm planning on going forward with 2.6
Chris, Hans
It is great to hear this is going to be fixed in 2.6, however it is
quite a pity we have a real mess with this in 2.4 series.
Resuming what I've heard so far it looks like it depends on:
- If it is fsync/O_SYNC or O_DIRECT (which user would expect to have
the same effect in this respect.
- It depends on kernel version. Some vendors have some fixes, while
others do not have them.
- It depends on hardware - if it has write cache on or off
- It depends on type of write (if it changes mata data or not)
- Finally it depends on file system and even journal mount options
Just curious does at least Asynchronous IO have the same behavior as
standard IO ?
All of these makes it extremely hard to explain what do users need in
order to get durability for their changes, while preserving performance.
Furthermore as it was broken for years I expect we'll have people which
developed things with fast fsync() in mind, who would start screaming
once we have real fsync()
(see my mail about Apple actually disabling cache flush on fsync() due
to this reason)
--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com
Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/
Chris Mason wrote:
>On Fri, 2004-03-19 at 03:05, Hans Reiser wrote:
>
>
>>Chris Mason wrote:
>>
>>
>>
>>>On Thu, 2004-03-18 at 16:09, Peter Zaitsev wrote:
>>>
>>>
>>>
>>>
>>>>On Thu, 2004-03-18 at 13:02, Chris Mason wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>>In the former case cache is surely not flushed.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>Hmmm, is it reiser? For both 2.4 reiserfs and ext3, the flush happens
>>>>>when you commit. ext3 always commits on fsync and reiser only commits
>>>>>when you've changed metadata.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>Oh. Yes. This is Reiser, I did not think it is FS issue.
>>>>I'll know to stay away from ReiserFS now.
>>>>
>>>>
>>>>
>>>>
>>>For reiserfs data=ordered should be enough to trigger the needed
>>>commits. If not, data=journal. Note that neither fs does barriers for
>>>O_SYNC, so we're just not perfect in 2.4.
>>>
>>>-chris
>>>
>>>
>>>
>>You are not listening to Peter. As I understand it from what Peter says
>>and your words, your implementation is wrong, and makes fsync
>>meaningless. If so, then you need to fix it. fsync should not be
>>meaningless even for metadata only journaling. This is a serious bug
>>that needs immediate correction, if Peter and I understand it correctly
>>from your words.
>>
>>
>
>I am listening to Peter, Jens and I have spent a significant amount of
>time on this code.
>
but you need to get it right.
>We can go back and spend many more hours testing and
>debugging the 2.4 changes, or we can go forward with a very nice
>solution in 2.6.
>
>I'm planning on going forward with 2.6
>
>
This is a very important patch that you have created, but you haven't
articulated what happens in the following scenario (Peter I am making up
something without knowing your internals, please feel encouraged to help
me on this).
mysql fsync()'s a file, which it thinks guarantees that all of a mysql
transaction has reached disk. The disk write caches it. You let fsync
return. It is not on disk. mysql performs its mysql commit, and writes
a mysql commit record which reaches disk, but not all of the transaction
is on disk. The system crashes. mysql plays the log. mysql has
internal corruption. User calls Peter. Peter asks, what do you expect
when you use a piece of shit like reiserfs? User doesn't care about our
internal squabbling and goes back to using windows which does proper
commits.
Or, random application fsyncs, expects that it means that data has
reached disk, and tells user to perform real world actions dependent on
the data being on disk, but it is not.
I hope I am totally off-base and not understanding you.... Please help
me here.
>-chris
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>
--
Hans
On Fri, 2004-03-19 at 14:36, Hans Reiser wrote:
> I hope I am totally off-base and not understanding you.... Please help
> me here.
Lets look at actual scope of the problem:
filesystem metadata
filesystem data (fsync, O_SYNC, O_DIRECT)
block device data (fsync, O_SYNC, O_DIRECT)
Multiply the cases above times each filesystem and also times md and
device mapper, since the barriers need to aggregate down to all the
drives.
In other words, just fixing fsync in 2.4 is not enough, and there is
still considerable development needed in 2.6. Maybe after all the 2.6
changes are done and accepted we can consider backporting parts of it to
2.4.
-chris
Chris Mason wrote:
>On Fri, 2004-03-19 at 14:36, Hans Reiser wrote:
>
>
>
>>I hope I am totally off-base and not understanding you.... Please help
>>me here.
>>
>>
>
>Lets look at actual scope of the problem:
>
>filesystem metadata
>filesystem data (fsync, O_SYNC, O_DIRECT)
>block device data (fsync, O_SYNC, O_DIRECT)
>
>Multiply the cases above times each filesystem and also times md and
>device mapper, since the barriers need to aggregate down to all the
>drives.
>
>In other words, just fixing fsync in 2.4 is not enough, and there is
>still considerable development needed in 2.6. Maybe after all the 2.6
>changes are done and accepted we can consider backporting parts of it to
>2.4.
>
>-chris
>
>
>
>
>
>
In 2.6 does fsync always insert a write barrier when the metadata
journaling option is set for reiserfs?
--
Hans
On Fri, 2004-03-19 at 11:36, Hans Reiser wrote:
> mysql fsync()'s a file, which it thinks guarantees that all of a mysql
> transaction has reached disk. The disk write caches it. You let fsync
> return. It is not on disk. mysql performs its mysql commit, and writes
> a mysql commit record which reaches disk, but not all of the transaction
> is on disk. The system crashes. mysql plays the log. mysql has
> internal corruption. User calls Peter. Peter asks, what do you expect
> when you use a piece of shit like reiserfs? User doesn't care about our
> internal squabbling and goes back to using windows which does proper
> commits.
This is right,
We had some unexplained data corruptions in Innodb which can be
explained by broken fsync(), but in the most cases the scenario is less
gloomy. Users just do not see some of last committed transactions if
they test durability by shutting off the power, which is however already
not good enough for critical applications.
However this is due to external pre-caution Innodb does. It uses
"double write buffer", which basically means each page is first written
to some small page based log file, and only afterwards written to the
proper place on the disk. We have to do it even with proper fsync()
implementation as there is still possibility to crash in the middle of
fsync (or synchronous write) which will result in partial page write.
Think for example about the case when page crosses stripe boundary on
RAID.
If file system would guaranty atomicity of write() calls (synchronous
would be enough) we could disable it and get good extra performance.
--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com
Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/
On Fri, 2004-03-19 at 15:04, Hans Reiser wrote:
> Chris Mason wrote:
> >Lets look at actual scope of the problem:
> >
> >filesystem metadata
> >filesystem data (fsync, O_SYNC, O_DIRECT)
> >block device data (fsync, O_SYNC, O_DIRECT)
> >
> >Multiply the cases above times each filesystem and also times md and
> >device mapper, since the barriers need to aggregate down to all the
> >drives.
> >
> >In other words, just fixing fsync in 2.4 is not enough, and there is
> >still considerable development needed in 2.6. Maybe after all the 2.6
> >changes are done and accepted we can consider backporting parts of it to
> >2.4.
> >
> In 2.6 does fsync always insert a write barrier when the metadata
> journaling option is set for reiserfs?
Yes, fsync is done in the 2.6 patches. O_SYNC, O_DIRECT and others are
not yet. The important part right now is to get the IDE core bits
reviewed and all the FS guys to agree on how we want to use them.
It's much cleaner in 2.6, the filesystem can just request a flush after
the last data buffer goes down the pipe.
-chris
On Fri, 2004-03-19 at 14:26, Peter Zaitsev wrote:
> On Fri, 2004-03-19 at 05:52, Chris Mason wrote:
>
>
> > I am listening to Peter, Jens and I have spent a significant amount of
> > time on this code. We can go back and spend many more hours testing and
> > debugging the 2.4 changes, or we can go forward with a very nice
> > solution in 2.6.
> >
> > I'm planning on going forward with 2.6
>
> Chris, Hans
>
> It is great to hear this is going to be fixed in 2.6, however it is
> quite a pity we have a real mess with this in 2.4 series.
>
It is indeed.
> Resuming what I've heard so far it looks like it depends on:
>
> - If it is fsync/O_SYNC or O_DIRECT (which user would expect to have
> the same effect in this respect.
> - It depends on kernel version. Some vendors have some fixes, while
> others do not have them.
> - It depends on hardware - if it has write cache on or off
> - It depends on type of write (if it changes mata data or not)
> - Finally it depends on file system and even journal mount options
>
All of the above is correct.
> Just curious does at least Asynchronous IO have the same behavior as
> standard IO ?
>
For the suse patch, yes. If it triggers a commit, you get a cache
flush.
>
> All of these makes it extremely hard to explain what do users need in
> order to get durability for their changes, while preserving performance.
>
> Furthermore as it was broken for years I expect we'll have people which
> developed things with fast fsync() in mind, who would start screaming
> once we have real fsync()
>
> (see my mail about Apple actually disabling cache flush on fsync() due
> to this reason)
These are all difficult issues. I wish I had easier answers for you,
hopefully we can get it all nailed down in 2.6 for starters.
-chris
Chris Mason wrote:
>
>>- It depends on type of write (if it changes mata data or not)
>>- Finally it depends on file system and even journal mount options
>>
>>
>>
>All of the above is correct.
>
>
Doesn't the above statement contradict the following?:
>In 2.6 does fsync always insert a write barrier when the metadata
>> journaling option is set for reiserfs?
>
>
Yes, fsync is done in the 2.6 patches.
and I was imprecise, I should have asked, does fsync flush the disk
cache regardless of what mount options are set or data/metadata touched
in the 2.6 patches?
--
Hans
On Fri, 2004-03-19 at 15:31, Hans Reiser wrote:
> Chris Mason wrote:
>
> >
> >>- It depends on type of write (if it changes mata data or not)
> >>- Finally it depends on file system and even journal mount options
> >>
> >>
> >>
> >All of the above is correct.
> >
> >
> Doesn't the above statement contradict the following?:
>
Sorry for the confusion, I thought he was asking about 2.4.x.
> >In 2.6 does fsync always insert a write barrier when the metadata
> >> journaling option is set for reiserfs?
> >
> >
>
> Yes, fsync is done in the 2.6 patches.
>
> and I was imprecise, I should have asked, does fsync flush the disk
> cache regardless of what mount options are set or data/metadata touched
> in the 2.6 patches?
>
-chris
Chris Mason wrote:
>
>>
>>and I was imprecise, I should have asked, does fsync flush the disk
>>cache regardless of what mount options are set or data/metadata touched
>>in the 2.6 patches?
>>
>>
>>
Forgive my relentlessness, is the answer to the above yes?
--
Hans
On Fri, 2004-03-19 at 15:48, Hans Reiser wrote:
> Chris Mason wrote:
>
> >
> >>
> >>and I was imprecise, I should have asked, does fsync flush the disk
> >>cache regardless of what mount options are set or data/metadata touched
> >>in the 2.6 patches?
> Forgive my relentlessness, is the answer to the above yes?
Yes ;-) One goal of the 2.6 patches is to make it possible for higher
levels to easily insert flushes when needed. The reiserfs fsync and
ext3 fsync code will both do this.
I realized I wasn't clear about device mapper and md earlier, both need
extra work to aggregate the flushes down to the drives. They don't yet
support flushing.
-chris
On Fri, 19 Mar 2004, Peter Zaitsev wrote:
> If file system would guaranty atomicity of write() calls (synchronous
> would be enough) we could disable it and get good extra performance.
Berkeley DB 4.2.52 for instance documents that page writes (of data base
pages) must be atomic, hence, if the data base page size is larger than
what the FS can write atomically, a crash may leave the data base in a
non-recoverable (catastrophic) state. (This assumes using the
write-ahead logging "Berkeley DB Transactional Data Store" mode of
operation, the other modes aren't recoveable after crash anyways.)
--
Matthias Andree
Encrypt your mail: my GnuPG key ID is 0x052E7D95
Peter Zaitsev wrote:
> If file system would guaranty atomicity of write() calls (synchronous
> would be enough) we could disable it and get good extra performance.
Store an MD5 or SHA digest of the page in the page itself, or elsewhere.
(Obviously the digest doesn't include the bytes used to store it).
Then partial write errors are always detectable, even if there's a
hardware failure, so journal writes are effectively atomic.
-- Jamie
Chris Mason wrote:
>On Fri, 2004-03-19 at 15:48, Hans Reiser wrote:
>
>
>>Chris Mason wrote:
>>
>>
>>
>>>>and I was imprecise, I should have asked, does fsync flush the disk
>>>>cache regardless of what mount options are set or data/metadata touched
>>>>in the 2.6 patches?
>>>>
>>>>
>
>
>
>>Forgive my relentlessness, is the answer to the above yes?
>>
>>
>
>Yes ;-) One goal of the 2.6 patches is to make it possible for higher
>levels to easily insert flushes when needed. The reiserfs fsync and
>ext3 fsync code will both do this.
>
>
will do it, or do do it?
>I realized I wasn't clear about device mapper and md earlier, both need
>extra work to aggregate the flushes down to the drives. They don't yet
>support flushing.
>
>-chris
>
>
>
>
>
>
Sounds like things are very much unfinished and you guys need more time
before these patches are ready for inclusion. We need to get reiser4 to
use this stuff. Zam, put that in your todo list.
--
Hans
On Sat, 2004-03-20 at 02:20, Jamie Lokier wrote:
> Peter Zaitsev wrote:
> > If file system would guaranty atomicity of write() calls (synchronous
> > would be enough) we could disable it and get good extra performance.
>
> Store an MD5 or SHA digest of the page in the page itself, or elsewhere.
> (Obviously the digest doesn't include the bytes used to store it).
>
> Then partial write errors are always detectable, even if there's a
> hardware failure, so journal writes are effectively atomic.
Jamie,
The problem is not detecting the partial page writes, but dealing with
them. Obviously there is checksum on the page (it is however not
MD5/SHA which are designed for cryptographic needs) and so page
corruption is detected if it happens for whatever reason.
The problem is you can't do anything with the page if only unknown
portion of it was modified.
Innodb uses sort of "logical" logging which just says something like
delete row #2 from page #123, so if page is badly corrupted it will not
help to recover.
Of course you can log full pages, but this will increase overhead
significantly, especially for small row sizes.
This is why solution now is to use long term "logical" log and short
term "physical" log, which is used by background page writer, before
writing pages to their original locations.
--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com
Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/
Hi!
I have written the InnoDB backend to MySQL. Some notes on the fsync()
processing problem:
1. It is dangerous for a database if fsync'ed files are physically written
to the disk in an order different from the order in which the fsync's were
called on them. In a power outage this can cause database corruption.
For example, a database must make sure that the log file is written to the
disk at least up to the 'log sequence number' of any data page written to
disk. Thus, we must first write to the log file and call fsync() on it, and
only after that are allowed to write the data page to a data file and call
fsync() on the data file.
2. An 'atomic' file write in the OS does not solve the problem of partially
written database pages in a power outage if the disk drive is not guaranteed
to stay operational long enough to be able to write the whole page
physically to disk. An InnoDB data page is 16 kB, and probably not
guaranteed to be any 'atomic' unit of physical disk writes. However, in
practice, half-written pages (either because of the OS or the disk) seem to
be very rare.
3. Jeffrey Siegal wrote to me that he checked a few disk drives if they
support a cache flush. Some of them did, others did not. If the disk drive
does not support a cache flush, then the only way to do a proper fsync is to
configure it not to cache writes at all. Though, in some drives even the
non-cache configuration option may be missing.
Best regards,
Heikki Tuuri
Innobase Oy
http://www.innodb.com
...........
List: linux-kernel
Subject: Re: True fsync() in Linux (on IDE)
From: Peter Zaitsev <peter () mysql ! com>
Date: 2004-03-20 19:48:23
Message-ID: <1079812102.3182.31.camel () abyss ! local>
[Download message RAW]
On Sat, 2004-03-20 at 02:20, Jamie Lokier wrote:
> Peter Zaitsev wrote:
> > If file system would guaranty atomicity of write() calls (synchronous
> > would be enough) we could disable it and get good extra performance.
>
> Store an MD5 or SHA digest of the page in the page itself, or elsewhere.
> (Obviously the digest doesn't include the bytes used to store it).
>
> Then partial write errors are always detectable, even if there's a
> hardware failure, so journal writes are effectively atomic.
Jamie,
The problem is not detecting the partial page writes, but dealing with
them. Obviously there is checksum on the page (it is however not
MD5/SHA which are designed for cryptographic needs) and so page
corruption is detected if it happens for whatever reason.
The problem is you can't do anything with the page if only unknown
portion of it was modified.
Innodb uses sort of "logical" logging which just says something like
delete row #2 from page #123, so if page is badly corrupted it will not
help to recover.
Of course you can log full pages, but this will increase overhead
significantly, especially for small row sizes.
This is why solution now is to use long term "logical" log and short
term "physical" log, which is used by background page writer, before
writing pages to their original locations.
--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com
On Mon, Mar 22 2004, Heikki Tuuri wrote:
> 2. An 'atomic' file write in the OS does not solve the problem of partially
> written database pages in a power outage if the disk drive is not guaranteed
> to stay operational long enough to be able to write the whole page
> physically to disk. An InnoDB data page is 16 kB, and probably not
> guaranteed to be any 'atomic' unit of physical disk writes. However, in
> practice, half-written pages (either because of the OS or the disk) seem to
> be very rare.
There's no such thing as atomic writes bigger than a sector really, we
just pretend there is. Timing usually makes this true.
For bigger atomic writes, 2.4 SUSE kernel had some nasty hack (called
blk-atomic) to prevent reordering by the io scheduler to avoid partial
blocks from databases. For 2.6 it's much easier since you can just send
a 16kb write out as one unit. So you send out the db data page as 1
hardware request, which is pretty much as atomic as you can do it from
the OS point of view. As long as you avoid a big window between parts of
this data page, you've narrowed the window of breakage from many seconds
to miliseconds (maybe even 0, if drive platter inertia guarentees you
that the write will ensure the data is on platter).
--
Jens Axboe
Jens Axboe schrieb am 2004-03-22:
> There's no such thing as atomic writes bigger than a sector really, we
> just pretend there is. Timing usually makes this true.
If there is no such atomicity (except maybe in ext3fs data=journal or
the upcoming reiserfs4 - isn't there?), then nobody should claim so. If
the kernel cannot 100.00000000% guarantee the write is atomic, claiming
otherwise is plain fraud and nothing else.
Some people bet their whole business/company and hence a fair deal of
their belongings on a single data base, and making them believe facts
that simply aren't reality is dangerous. These people will have very
little understanding for sloppiness here. Linux has no obligation to be
fast or reliable, but it MUST PROPERLY AND TRUTHFULLY state what it can
guarantee and what it cannot guarantee.
> For bigger atomic writes, 2.4 SUSE kernel had some nasty hack (called
> blk-atomic) to prevent reordering by the io scheduler to avoid partial
> blocks from databases.
That does not make a write atomic if the scheduled blocks are still
written one at a time (and I believe tagged command queueing won't help
to unroll partial writes either).
If the hardware support is missing, it is prudent to say just that and
not make any bogus promises about platter inertia and "it usually
works". (who says that the filter curves adjust to the decreasing
platter speed and the electronics are sustained for long enough? how
about write verify and remapping broken blocks?)
So we only write one hardware block size atomically, usually 512 bytes
on ATA and SCSI disk drives (MO might do 2048 at a time, but why
introduce complexity). That's a data point in this whole fsync()
discussion.
--
Matthias Andree
Encrypt your mail: my GnuPG key ID is 0x052E7D95
On Mon, Mar 22, 2004 at 04:17:12PM +0100, Matthias Andree wrote:
> If there is no such atomicity (except maybe in ext3fs data=journal or
> the upcoming reiserfs4 - isn't there?), then nobody should claim so. If
> the kernel cannot 100.00000000% guarantee the write is atomic, claiming
> otherwise is plain fraud and nothing else.
Who claims writes are atomic?
Matthias Andree wrote:
> Jens Axboe schrieb am 2004-03-22:
>
>
>>There's no such thing as atomic writes bigger than a sector really, we
>>just pretend there is. Timing usually makes this true.
;)
>
> If there is no such atomicity (except maybe in ext3fs data=journal or
> the upcoming reiserfs4 - isn't there?), then nobody should claim so. If
> the kernel cannot 100.00000000% guarantee the write is atomic, claiming
> otherwise is plain fraud and nothing else.
>
> Some people bet their whole business/company and hence a fair deal of
> their belongings on a single data base, and making them believe facts
> that simply aren't reality is dangerous. These people will have very
> little understanding for sloppiness here. Linux has no obligation to be
> fast or reliable, but it MUST PROPERLY AND TRUTHFULLY state what it can
> guarantee and what it cannot guarantee.
Some databases (eg. oracle) can write a checksum for each database page
to overcome this problem, as this is not just "a linux problem".
--
Christoffer
Topper Harley: Interesting perfume.
Ramada Thompson: It's Vicks. I have a cold.
Matthias Andree wrote:
>Jens Axboe schrieb am 2004-03-22:
>
>
>
>>There's no such thing as atomic writes bigger than a sector really, we
>>just pretend there is. Timing usually makes this true.
>>
>>
Can you explain about the timing?
>
>If there is no such atomicity (except maybe in ext3fs data=journal or
>the upcoming reiserfs4 - isn't there?), then nobody should claim so.
>
Well, nobody is going to use anything except reiser4 are they?;-).....
I think that we are able to guarantee that the write is fully atomic
regardless of what the block layer does, so long as the block layer
respects our ordering and does not cache it where it should not.
zam, you are watching this thread about flushing the ide cache I hope....
> If
>the kernel cannot 100.00000000% guarantee the write is atomic, claiming
>otherwise is plain fraud and nothing else.
>
>Some people bet their whole business/company and hence a fair deal of
>their belongings on a single data base, and making them believe facts
>that simply aren't reality is dangerous. These people will have very
>little understanding for sloppiness here. Linux has no obligation to be
>fast or reliable, but it MUST PROPERLY AND TRUTHFULLY state what it can
>guarantee and what it cannot guarantee.
>
>
>
>>For bigger atomic writes, 2.4 SUSE kernel had some nasty hack (called
>>blk-atomic) to prevent reordering by the io scheduler to avoid partial
>>blocks from databases.
>>
>>
>
>That does not make a write atomic if the scheduled blocks are still
>written one at a time (and I believe tagged command queueing won't help
>to unroll partial writes either).
>
>If the hardware support is missing, it is prudent to say just that and
>not make any bogus promises about platter inertia and "it usually
>works". (who says that the filter curves adjust to the decreasing
>platter speed and the electronics are sustained for long enough? how
>about write verify and remapping broken blocks?)
>
>So we only write one hardware block size atomically, usually 512 bytes
>on ATA and SCSI disk drives (MO might do 2048 at a time, but why
>introduce complexity). That's a data point in this whole fsync()
>discussion.
>
>
>
--
Hans
On Mon, 22 Mar 2004, Christoffer Hall-Frederiksen wrote:
> >If there is no such atomicity (except maybe in ext3fs data=journal or
> >the upcoming reiserfs4 - isn't there?), then nobody should claim so. If
> >the kernel cannot 100.00000000% guarantee the write is atomic, claiming
> >otherwise is plain fraud and nothing else.
> >
> >Some people bet their whole business/company and hence a fair deal of
> >their belongings on a single data base, and making them believe facts
> >that simply aren't reality is dangerous. These people will have very
> >little understanding for sloppiness here. Linux has no obligation to be
> >fast or reliable, but it MUST PROPERLY AND TRUTHFULLY state what it can
> >guarantee and what it cannot guarantee.
>
> Some databases (eg. oracle) can write a checksum for each database page
> to overcome this problem, as this is not just "a linux problem".
I am aware some databases support checksumming (Berkeley DB also does,
since v4.1 (*), and probably a lot more so they know where log recovery
starts) but does that make statements sensible that claim the timing
(some stochastic factor) would usually give "guarantees" about
atomicity of the individual page write when the hardware doesn't
guarantee anything beyond 512 bytes at a time? I think it does not.
I don't mind to beat up anyone, I'd just like to have the guarantees
documented without thin-ice kind of promises "usually you'll get more".
It's good to get more than what was asked for, but the application
designer cannot take that into account because he gets no guarantees. So
why bother wasting space and time for writing and reading such lines? Or
even discussing?
Maybe some interface so an application can query the maximum size of an
atomic write for any given file system (stat[v]fs extension maybe) would
be useful though, so applications can be optimized for data-journaling
file systems should these prove capable to provide "large atomic write"
guarantees.
(*) http://cvs.sourceforge.net/viewcvs.py/bogofilter/bogofilter/src/datastore_db.c?only_with_tag=branch-db-txn#rev1.93.2.5
--
Matthias Andree
Encrypt your mail: my GnuPG key ID is 0x052E7D95