*** Is that aim being achieved by the current policy? ***
As I understand it the old (<=2.6.11) sync model kept the data in sync
without updating the FAT until later. This runs the risks of partial
corruption of one or more files on pullout.
The new model attempts to be more rigourous by updating the FAT every time
a block of data is written. Thus the "hammering" of the physical memory
hosting the FAT record.
In view of the nature of flash memory this may actually be drastically
increasing the chance that the whole FAT gets erased.
part IV (end of a sage)
If a pullout occurs during write , there is now a near 50% chance that
this takes out the entire FAT.
It would seem that the main advantage of this scheme is that it is so slow
that it encourages users to turn it off. Presumably in the process of
coming to that conclusion they will become aware of the need to run umount
or the sync command before doing removing the device.
= Danger of destroying hardware =
It seems that there are well documented cases of this abusive rewriting of
the FAT causing rapid and total premature failure of what Alan Cox refers
to as "ultra-crap devices".
There may be valid reasons of cost or miniaturisation that preclude the
additional hardware found in more complex devices.
Even if better quality devices may have some sort of paging mechanism
which makes them more resistant to this sort of abuse, it does not seem
good engineering practice to dismiss those that fail as "shite".
There is nothing in the spec of vfat that suggests the FAT will be written
10.000 during the writing of one large file. Indeed it is hard to imagine
that any other implementation on any other OS or any previous linux kernel
behaves like that.
So should the hardware manufacturers have anticipated this particular
driver implementation or should the kernel be more aware of the existing
hardware that it purports to support.
= The way forward =
It would seem that the first step could be to revert to the 2.6.11
behaviour which was more appropriate and probably safer even from the data
point of view.
I lack the knowlege and experience to produce reliable kernel code so I
wont try. However, I have already seen a number of suggestions of how the
old model could be improved. This post could be the starting point for a
discussion of more robust techniques. In any case the coding is unlikely
to be very complex given the existing , tested code base that is in place
in 2.6.11
Any new technique should probably aim to be applicable to larger devices
as well. The 2G limit is artificial and is a tacit recognition of the
precarity of the current code. USB hard disks are just as prone to
accidental cable pullout. Some periodic or per file sync should probably
be envisaged for the VFAT sync mount option.
PS if anyone can tell me why I had to post this ten times and chop it into
little bits it would be appreciated in not messing up the list in the
future.
I spent an hour reading the faq and I dont see anything taboo here.
[email protected] wrote:
>
> There is nothing in the spec of vfat that suggests the FAT will be written
> 10.000 during the writing of one large file. Indeed it is hard to imagine
> that any other implementation on any other OS or any previous linux kernel
> behaves like that.
We sync the file metadata once per write() syscall. If your app writes a
large file in lots of little bits, it'll do a lot of syncs. Other
implementations of fatfs will (must) do the same thing.
> It would seem that the first step could be to revert to the 2.6.11
> behaviour which was more appropriate and probably safer even from the data
> point of view.
fatfs used to be buggy - it didn't implement `-o sync'. Now it does, and
what we're seeing is the fallout from the late fixing of that bug.
You're right - people need to understand what they're doing, make their own
decision, then remove the `-o sync' option. There aren't any easy
solutions.
Thanks for the reply.
On Mon, 27 Feb 2006 01:51:14 +0100, Andrew Morton <[email protected]> wrote:
> [email protected] wrote:
>>
>> There is nothing in the spec of vfat that suggests the FAT will be
>> written
>> 10.000 during the writing of one large file. Indeed it is hard to
>> imagine
>> that any other implementation on any other OS or any previous linux
>> kernel
>> behaves like that.
>
> We sync the file metadata once per write() syscall. If your app writes a
> large file in lots of little bits, it'll do a lot of syncs. Other
> implementations of fatfs will (must) do the same thing.
That would not seem to be the case at least on MS systems. I had a freind
do some timings copying a large group of files to a 128M usb flash device.
There was an arbitary mix of files including many small files and some
larger files, one in excess of 50MB.
suse10 default 4m10
win2k 2m30
suse w/o sync 30s
The suse test was drag and drop in konqueror , the other dnd in windows
explorer.
>
>> It would seem that the first step could be to revert to the 2.6.11
>> behaviour which was more appropriate and probably safer even from the
>> data
>> point of view.
>
> fatfs used to be buggy - it didn't implement `-o sync'. Now it does, and
> what we're seeing is the fallout from the late fixing of that bug.
>
I just tested on my 2.6.11 kernel which would predate the change and there
is a clear difference between mounting my usb device with and without sync
option.
ls -ail /tmpd/mail*
239151 -rw-r--r-- 1 root root 8169540 2006-02-27 19:04
/tmpd/mail-bak.2006-02-28.bz2
bash-3.1#time cp !$ /mnt/usb
time cp /tmpd/mail* /mnt/usb
real 0m0.227s
user 0m0.001s
sys 0m0.070s
It returns immediately with no disk activity. About 30s later there was
disk activity. Presumably some periodic flushing of IO buffers.
bash-3.1#umount /mnt/usb
bash-3.1#mount -o sync !$
bash-3.1#time cp /tmpd/mail* /mnt/usb
real 0m5.440s
user 0m0.000s
sys 0m0.143s
So the older model did seem to have some sync functionality , tho'
presumably not the agressive one-for-one sync that is now being used.
Please correct me if my interpretation is flawed here:
flash has to be cleared before being written. If metadata is written with
every block output with write(), the risk of erasing the FAT is now many
times higher than with the old sync policy.
So the newer sync policy drastically _reduces_ the data security in the
case of untimely disconnection despite the speed penalty and possible
hardware damage it incurs.
A less rigourous sync policy may in fact be more appropriate than the
current model.
Thanks again.
[Note: I am not subscribed to LKML, if you wish me to recieve any follow
ups please BCC: col-pepper at piments point com . thx]
[email protected] wrote:
>
> That would not seem to be the case at least on MS systems. I had a freind
> do some timings copying a large group of files to a 128M usb flash device.
> There was an arbitary mix of files including many small files and some
> larger files, one in excess of 50MB.
>
> suse10 default 4m10
> win2k 2m30
> suse w/o sync 30s
>
> The suse test was drag and drop in konqueror , the other dnd in windows
> explorer.
We don't know that the same number of same-sized write()s were happening in
each case.
There's been some talk about implementing fsync()-on-file-close for this
problem, and some protopatches. But nothing final yet.
As Andrew suggested, Ogawa Hirofumi, FAT maintainer post
following patch. I may help you. Please check it.
Message-Id: <[email protected]>
Subject: Re: [EXPERIMENT] Add new "flush" option
This adds new "flush" option for hotplug devices.
Current implementation of "flush" option does,
- synchronizing data pages at ->release() (last close(2))
- if user's work seems to be done (fs is not active), all
metadata syncs by pdflush()
This option would provide kind of sane progress, and dirty buffers is
flushed more frequently (if fs is not active). This option doesn't
provide any robustness (robustness is provided by other options), but
probably the option is proper for hotplug devices.
(Please don't assume that dirty buffers is synchronized at any point.
This implementation will be changed easily.)
[email protected] wrote:
> Thanks for the reply.
>
>
> On Mon, 27 Feb 2006 01:51:14 +0100, Andrew Morton <[email protected]> wrote:
>
>> [email protected] wrote:
>>
>>>
>>> There is nothing in the spec of vfat that suggests the FAT will be
>>> written
>>> 10.000 during the writing of one large file. Indeed it is hard to
>>> imagine
>>> that any other implementation on any other OS or any previous linux
>>> kernel
>>> behaves like that.
>>
>>
>> We sync the file metadata once per write() syscall. If your app writes a
>> large file in lots of little bits, it'll do a lot of syncs. Other
>> implementations of fatfs will (must) do the same thing.
>
>
> That would not seem to be the case at least on MS systems. I had a
> freind do some timings copying a large group of files to a 128M usb
> flash device.
> There was an arbitary mix of files including many small files and some
> larger files, one in excess of 50MB.
>
> suse10 default 4m10
> win2k 2m30
> suse w/o sync 30s
>
> The suse test was drag and drop in konqueror , the other dnd in windows
> explorer.
>
>>
>>> It would seem that the first step could be to revert to the 2.6.11
>>> behaviour which was more appropriate and probably safer even from
>>> the data
>>> point of view.
>>
>>
>> fatfs used to be buggy - it didn't implement `-o sync'. Now it does, and
>> what we're seeing is the fallout from the late fixing of that bug.
>>
>
> I just tested on my 2.6.11 kernel which would predate the change and
> there is a clear difference between mounting my usb device with and
> without sync option.
>
> ls -ail /tmpd/mail*
> 239151 -rw-r--r-- 1 root root 8169540 2006-02-27 19:04
> /tmpd/mail-bak.2006-02-28.bz2
> bash-3.1#time cp !$ /mnt/usb
> time cp /tmpd/mail* /mnt/usb
>
> real 0m0.227s
> user 0m0.001s
> sys 0m0.070s
>
> It returns immediately with no disk activity. About 30s later there was
> disk activity. Presumably some periodic flushing of IO buffers.
>
> bash-3.1#umount /mnt/usb
> bash-3.1#mount -o sync !$
>
> bash-3.1#time cp /tmpd/mail* /mnt/usb
>
> real 0m5.440s
> user 0m0.000s
> sys 0m0.143s
>
> So the older model did seem to have some sync functionality , tho'
> presumably not the agressive one-for-one sync that is now being used.
>
> Please correct me if my interpretation is flawed here:
>
> flash has to be cleared before being written. If metadata is written
> with every block output with write(), the risk of erasing the FAT is
> now many times higher than with the old sync policy.
>
> So the newer sync policy drastically _reduces_ the data security in the
> case of untimely disconnection despite the speed penalty and possible
> hardware damage it incurs.
>
> A less rigourous sync policy may in fact be more appropriate than the
> current model.
>
> Thanks again.
>
>
> [Note: I am not subscribed to LKML, if you wish me to recieve any follow
> ups please BCC: col-pepper at piments point com . thx]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Hiroyuki Machida [email protected]
SSW Dept. HENC, Sony Corp.
"linux-os \(Dick Johnson\)" <[email protected]> writes:
> On Mon, 27 Feb 2006 [email protected] wrote:
>
>> On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson)
>> <[email protected]> wrote:
>>
[...]
> It takes about a second to erase a 64k physical sector. This is
> a required operation before it is written.
> Since the projected life of these new devices is about 5 to 10 million
> such cycles, (older NAND flash used in modems was only 100-200k) the
> writer would have to be running that "brand new device" for at least 5
> million seconds. Let's see:
What FLASH are you talking about? I work with NAND FLASH chips directly
in embedded projects, and for both Toshiba and Samsung NAND FLASH the
erase time of 128Kb (64K words) block is 2 milliseconds typical. Page
program time is 0.3 milliseconds typical, so, having 64 pages per block,
total erase-write block cycle is about 22ms.
Those chips indeed support about 100K program/erase cycles. Well, maybe
there are some new NAND FLASH chips that support more program/erase
cycles (just checked Samsung but found none), but I doubt they are 1000
times slower for block erase anyway.
-- Sergei.
"linux-os \(Dick Johnson\)" <[email protected]> writes:
> On Tue, 28 Feb 2006, Lennart Sorensen wrote:
>
>> On Tue, Feb 28, 2006 at 08:10:44AM -0500, linux-os (Dick Johnson) wrote:
>>>
>>> On Mon, 27 Feb 2006 [email protected] wrote:
>>>
>>>> On Mon, 27 Feb 2006 22:32:07 +0100, linux-os (Dick Johnson)
>>>> <[email protected]> wrote:
>>>>
>>>>> Flash does not get zeroed to be written! It gets erased, which sets all
>>>>> the bits to '1', i.e., all bytes to 0xff.
>>>>
>>>> Thanks for the correction, but that does not change the discussion.
>>>>
>>>>> Further, the designers of
>>>>> flash disks are not stupid as you assume. The direct access occurs
>>>>> to static RAM (read/write stuff).
>>>>
>>>> I'm not assuming anything . Some hardware has been killed by this issue.
>>>> http://lkml.org/lkml/2005/5/13/144
>>>
>>> No. That hardware was not killed by that issue. The writer, or another
>>> who had encountered the same issue, eventually repartitioned and
>>> reformatted the device. The partition table had gotten corrupted by
>>> some experiments and the writer assumed that the device was broken.
>>> It wasn't.
>>>
>>> Also, if you read other elements in this thread, you would have
>>> learned about something that has become somewhat of a red herring.
>>>
>>> It takes about a second to erase a 64k physical sector. This is
>>> a required operation before it is written. Since the projected
>>> life of these new devices is about 5 to 10 million such cycles,
>>> (older NAND flash used in modems was only 100-200k) the writer
>>> would have to be running that "brand new device" for at least
>>> 5 million seconds. Let's see:
>>
>> How come I can write to my compact flash at about 2M/s if you claim it
>> takes 1s to erase a 64k sector? Somehow I think your number is much too
>> high. Or it can do multiple erases at the same time.
>>
>> Also the 5 to 10 million is a lot higher than the numbers the makers of
>> the compact flash cards I use claim.
>>
>
> Here is an instrumented erase function on a driver that rewrites
> the first sector of a BIOS ROM. Unlike the Flash DISKS, the
> BIOS ROM has no buffering in static RAM so you can gustimate
> the actual time to erase............
BIOS ROM is never NAND FLASH, it's most probably NOR FLASH, and FLASH
DISKS are most probably NAND FLASH. NOR and NAND are very different
technologies. You compare apples and oranges, -- static RAM has nothing
to do with that.
-- Sergei.
On Monday 27 February 2006 18:12, Andrew Morton wrote:
> We don't know that the same number of same-sized write()s were happening in
> each case.
>
> There's been some talk about implementing fsync()-on-file-close for this
> problem, and some protopatches. But nothing final yet.
Here's the patch I'm using in -suse right now. What I want to do is make a
much more generic -o flush, but it'll still need a few bits in individual
filesystem to kick off metadata writes quickly.
The basic goal behind the code is to trigger writes without waiting for both
data and metadata. If the user is watching the memory stick, when the
little light stops flashing all the data and metadata will be on disk.
It also generally throttles userland a little during file release. This
could be changed to throttle for each page dirtied, but most users I
asked liked the current setup better.
-chris
From: Chris Mason <[email protected]>
Subject: add -o flush for fat
Fat is commonly used on removable media, mounting with -o flush tells the
FS to write things to disk as quickly as possible. It is like -o sync, but
much faster (and not as safe).
diff -r a06cef570da0 fs/fat/file.c
--- a/fs/fat/file.c Sun Jan 15 11:59:32 2006 -0500
+++ b/fs/fat/file.c Sun Jan 15 13:00:35 2006 -0500
@@ -13,6 +13,7 @@
#include <linux/smp_lock.h>
#include <linux/buffer_head.h>
#include <linux/writeback.h>
+#include <linux/blkdev.h>
int fat_generic_ioctl(struct inode *inode, struct file *filp,
unsigned int cmd, unsigned long arg)
@@ -112,6 +113,19 @@ int fat_generic_ioctl(struct inode *inod
}
}
+static int
+fat_file_release(struct inode *inode, struct file *filp)
+{
+
+ if ((filp->f_mode & FMODE_WRITE) &&
+ MSDOS_SB(inode->i_sb)->options.flush) {
+ writeback_inode(inode);
+ writeback_bdev(inode->i_sb);
+ blk_congestion_wait(WRITE, HZ/10);
+ }
+ return 0;
+}
+
struct file_operations fat_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
@@ -121,6 +135,7 @@ struct file_operations fat_file_operatio
.aio_read = generic_file_aio_read,
.aio_write = generic_file_aio_write,
.mmap = generic_file_mmap,
+ .release = fat_file_release,
.ioctl = fat_generic_ioctl,
.fsync = file_fsync,
.sendfile = generic_file_sendfile,
@@ -293,6 +308,10 @@ void fat_truncate(struct inode *inode)
lock_kernel();
fat_free(inode, nr_clusters);
unlock_kernel();
+ if (MSDOS_SB(inode->i_sb)->options.flush) {
+ writeback_inode(inode);
+ writeback_bdev(inode->i_sb);
+ }
}
struct inode_operations fat_file_inode_operations = {
diff -r a06cef570da0 fs/fat/inode.c
--- a/fs/fat/inode.c Sun Jan 15 11:59:32 2006 -0500
+++ b/fs/fat/inode.c Sun Jan 15 13:00:35 2006 -0500
@@ -24,6 +24,7 @@
#include <linux/vfs.h>
#include <linux/parser.h>
#include <linux/uio.h>
+#include <linux/writeback.h>
#include <asm/unaligned.h>
#ifndef CONFIG_FAT_DEFAULT_IOCHARSET
@@ -860,7 +861,7 @@ enum {
Opt_charset, Opt_shortname_lower, Opt_shortname_win95,
Opt_shortname_winnt, Opt_shortname_mixed, Opt_utf8_no, Opt_utf8_yes,
Opt_uni_xl_no, Opt_uni_xl_yes, Opt_nonumtail_no, Opt_nonumtail_yes,
- Opt_obsolate, Opt_err,
+ Opt_obsolate, Opt_flush, Opt_err,
};
static match_table_t fat_tokens = {
@@ -892,7 +893,8 @@ static match_table_t fat_tokens = {
{Opt_obsolate, "cvf_format=%20s"},
{Opt_obsolate, "cvf_options=%100s"},
{Opt_obsolate, "posix"},
- {Opt_err, NULL}
+ {Opt_flush, "flush"},
+ {Opt_err, NULL},
};
static match_table_t msdos_tokens = {
{Opt_nodots, "nodots"},
@@ -1033,6 +1035,9 @@ static int parse_options(char *options,
return 0;
opts->codepage = option;
break;
+ case Opt_flush:
+ opts->flush = 1;
+ break;
/* msdos specific */
case Opt_dots:
diff -r a06cef570da0 fs/fs-writeback.c
--- a/fs/fs-writeback.c Sun Jan 15 11:59:32 2006 -0500
+++ b/fs/fs-writeback.c Sun Jan 15 13:00:35 2006 -0500
@@ -390,6 +390,29 @@ sync_sb_inodes(struct super_block *sb, s
return; /* Leave any unwritten inodes on s_io */
}
+void
+writeback_bdev(struct super_block *sb)
+{
+ struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
+ filemap_flush(mapping);
+ blk_run_address_space(mapping);
+}
+EXPORT_SYMBOL_GPL(writeback_bdev);
+
+void
+writeback_inode(struct inode *inode)
+{
+
+ struct address_space *mapping = inode->i_mapping;
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ .nr_to_write = 0,
+ };
+ sync_inode(inode, &wbc);
+ filemap_fdatawrite(mapping);
+}
+EXPORT_SYMBOL_GPL(writeback_inode);
+
/*
* Start writeback of dirty pagecache data against all unlocked inodes.
*
diff -r a06cef570da0 fs/msdos/namei.c
--- a/fs/msdos/namei.c Sun Jan 15 11:59:32 2006 -0500
+++ b/fs/msdos/namei.c Sun Jan 15 13:00:35 2006 -0500
@@ -11,6 +11,7 @@
#include <linux/buffer_head.h>
#include <linux/msdos_fs.h>
#include <linux/smp_lock.h>
+#include <linux/writeback.h>
/* MS-DOS "device special files" */
static const unsigned char *reserved_names[] = {
@@ -293,7 +294,7 @@ static int msdos_create(struct inode *di
struct nameidata *nd)
{
struct super_block *sb = dir->i_sb;
- struct inode *inode;
+ struct inode *inode = NULL;
struct fat_slot_info sinfo;
struct timespec ts;
unsigned char msdos_name[MSDOS_NAME];
@@ -329,6 +330,11 @@ static int msdos_create(struct inode *di
d_instantiate(dentry, inode);
out:
unlock_kernel();
+ if (!err && MSDOS_SB(sb)->options.flush) {
+ writeback_inode(dir);
+ writeback_inode(inode);
+ writeback_bdev(sb);
+ }
return err;
}
@@ -361,6 +367,11 @@ static int msdos_rmdir(struct inode *dir
fat_detach(inode);
out:
unlock_kernel();
+ if (!err && MSDOS_SB(inode->i_sb)->options.flush) {
+ writeback_inode(dir);
+ writeback_inode(inode);
+ writeback_bdev(inode->i_sb);
+ }
return err;
}
@@ -414,6 +425,11 @@ static int msdos_mkdir(struct inode *dir
d_instantiate(dentry, inode);
unlock_kernel();
+ if (MSDOS_SB(sb)->options.flush) {
+ writeback_inode(dir);
+ writeback_inode(inode);
+ writeback_bdev(sb);
+ }
return 0;
out_free:
@@ -443,6 +459,11 @@ static int msdos_unlink(struct inode *di
fat_detach(inode);
out:
unlock_kernel();
+ if (!err && MSDOS_SB(inode->i_sb)->options.flush) {
+ writeback_inode(dir);
+ writeback_inode(inode);
+ writeback_bdev(inode->i_sb);
+ }
return err;
}
@@ -648,6 +669,11 @@ static int msdos_rename(struct inode *ol
new_dir, new_msdos_name, new_dentry, is_hid);
out:
unlock_kernel();
+ if (!err && MSDOS_SB(old_dir->i_sb)->options.flush) {
+ writeback_inode(old_dir);
+ writeback_inode(new_dir);
+ writeback_bdev(old_dir->i_sb);
+ }
return err;
}
diff -r a06cef570da0 include/linux/msdos_fs.h
--- a/include/linux/msdos_fs.h Sun Jan 15 11:59:32 2006 -0500
+++ b/include/linux/msdos_fs.h Sun Jan 15 13:00:35 2006 -0500
@@ -203,6 +203,7 @@ struct fat_mount_options {
unicode_xlate:1, /* create escape sequences for unhandled Unicode */
numtail:1, /* Does first alias have a numeric '~1' type tail? */
atari:1, /* Use Atari GEMDOS variation of MS-DOS fs */
+ flush:1, /* write things quickly */
nocase:1; /* Does this need case conversion? 0=need case conversion*/
};
diff -r a06cef570da0 include/linux/writeback.h
--- a/include/linux/writeback.h Sun Jan 15 11:59:32 2006 -0500
+++ b/include/linux/writeback.h Sun Jan 15 13:00:35 2006 -0500
@@ -68,6 +68,8 @@ int inode_wait(void *);
int inode_wait(void *);
void sync_inodes_sb(struct super_block *, int wait);
void sync_inodes(int wait);
+void writeback_bdev(struct super_block *);
+void writeback_inode(struct inode *);
/* writeback.h requires fs.h; it, too, is not included from here. */
static inline void wait_on_inode(struct inode *inode)
Chris Mason <[email protected]> wrote:
>
> On Monday 27 February 2006 18:12, Andrew Morton wrote:
>
> > We don't know that the same number of same-sized write()s were happening in
> > each case.
> >
> > There's been some talk about implementing fsync()-on-file-close for this
> > problem, and some protopatches. But nothing final yet.
>
> Here's the patch I'm using in -suse right now. What I want to do is make a
> much more generic -o flush, but it'll still need a few bits in individual
> filesystem to kick off metadata writes quickly.
>
> The basic goal behind the code is to trigger writes without waiting for both
> data and metadata. If the user is watching the memory stick, when the
> little light stops flashing all the data and metadata will be on disk.
>
> It also generally throttles userland a little during file release. This
> could be changed to throttle for each page dirtied, but most users I
> asked liked the current setup better.
>
> ...
>
> +static int
> +fat_file_release(struct inode *inode, struct file *filp)
On a single line, please.
> + if (MSDOS_SB(inode->i_sb)->options.flush) {
Did you consider making `-o flush' a generic mount option rather than
msdos-only?
I guess there isn't a lot of demand for this for other filesystems, and
having an ignored option like this is a bit misleading...
> +void
> +writeback_inode(struct inode *inode)
> +{
> +
> + struct address_space *mapping = inode->i_mapping;
> + struct writeback_control wbc = {
> + .sync_mode = WB_SYNC_NONE,
> + .nr_to_write = 0,
> + };
> + sync_inode(inode, &wbc);
> + filemap_fdatawrite(mapping);
I think that filemap_fdatawrite() will be a no-op?
On Tuesday 28 February 2006 14:10, Andrew Morton wrote:
> On a single line, please.
>
Ack.
> > + if (MSDOS_SB(inode->i_sb)->options.flush) {
>
> Did you consider making `-o flush' a generic mount option rather than
> msdos-only?
Yes, long term I think the generic option is better. I have three or so ideas
for a generic patch:
1) When the block device leaves congestion, it asks for more io
2) pdflush operation that tries to constantly keep a given block device
congested
3) my current patch aggregated to other filesystems that people want -o flush
on.
I've made a few stabs at #1, but didn't like the end result. #2 seems like
the best choice so far. If I got it working nicely I would add the generic
option, otherwise with option #3 it's probably best to keep it per FS.
The main goal for my current patch was to find out if this functionality will
actually make people happy (so far the beta testers like it). If the
complaints are low, it's worth the time to add something generic.
>
> I guess there isn't a lot of demand for this for other filesystems, and
> having an ignored option like this is a bit misleading...
>
> > +void
> > +writeback_inode(struct inode *inode)
> > +{
> > +
> > + struct address_space *mapping = inode->i_mapping;
> > + struct writeback_control wbc = {
> > + .sync_mode = WB_SYNC_NONE,
> > + .nr_to_write = 0,
> > + };
> > + sync_inode(inode, &wbc);
> > + filemap_fdatawrite(mapping);
>
> I think that filemap_fdatawrite() will be a no-op?
This part is nasty, I want to write all of the file data pages and write the
inode without waiting on it. The nr_to_write = 0 will make sure that
sync_inode only writes the inode, and WB_SYNC_NONE makes sure it does not
wait for that io to finish.
What I really want is WB_SYNC_NONE in mpage_writepages, but I don't want to
trigger this code:
if (wbc->sync_mode == WB_SYNC_NONE) {
index = mapping->writeback_index; /* Start from prev offset */
So, I use filemap_fdatawrite to make sure all of the data pages get written.
It's not perfect, but I was going for minimal changes outside of fat.
-chris
On Wednesday 01 March 2006 10:00, OGAWA Hirofumi wrote:
> Chris Mason <[email protected]> writes:
> > @@ -329,6 +330,11 @@ static int msdos_create(struct inode *di
> > d_instantiate(dentry, inode);
> > out:
> > unlock_kernel();
> > + if (!err && MSDOS_SB(sb)->options.flush) {
> > + writeback_inode(dir);
> > + writeback_inode(inode);
> > + writeback_bdev(sb);
> > + }
> > return err;
> > }
>
> If buffers is already queued for I/O, and if you don't wait anything,
> the buffers wouldn't be (re-)submited, then those buffers will be
> flushing by normal periodically wb_kupdate() after all.
Just to make sure we're using the same terms, do you mean the pages are marked
dirty and on the SB's dirty list, or do you mean the page has been through
writepage and is currently on its way to the disk?
>
> Do you have any plan to address it? Or I'm just missing something?
If you mean the page is just dirty, it will get written by the
filemap_fdatawrite calls. If you mean the page is PG_writeback, it is
already on the way to the disk, so it passes the 'blinking light on the
memory stick' rule.
-chris
On Wednesday 01 March 2006 20:15, OGAWA Hirofumi wrote:
> > Just to make sure we're using the same terms, do you mean the pages are
> > marked dirty and on the SB's dirty list, or do you mean the page has been
> > through writepage and is currently on its way to the disk?
>
> The page is already on device's request queue, and the page is already
> marked a PG_writeback. And that page is not processed by device yet.
>
> Then, you call next filemap_fdatawrite(), it just re-dirty the page
> and queues to sb->s_dirty, because the page's buffer_heads is still
> locked. So, the re-dirtyed page is re-submited to device by
> periodically wb_kupdate()?
filemap_fdatawrite() won't redirty the page. It will wait on the pending
writeback.
-chris
Chris Mason <[email protected]> writes:
> filemap_fdatawrite() won't redirty the page. It will wait on the pending
> writeback.
Umm... I'm looking the following code.
+ if (MSDOS_SB(sb)->options.flush) {
+ writeback_inode(dir);
+ writeback_inode(inode);
+ writeback_bdev(sb);
+ }
+void
+writeback_bdev(struct super_block *sb)
+{
+ struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
+ filemap_flush(mapping);
+ blk_run_address_space(mapping);
+}
+EXPORT_SYMBOL_GPL(writeback_bdev);
filemap_flush() is using WB_SYNC_NONE.
in mpage_writepages()
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);
if (PageWriteback(page) ||
!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}
Where does wait it?
--
OGAWA Hirofumi <[email protected]>
On Thursday 02 March 2006 09:07, OGAWA Hirofumi wrote:
> Chris Mason <[email protected]> writes:
> > filemap_fdatawrite() won't redirty the page. It will wait on the pending
> > writeback.
>
> Umm... I'm looking the following code.
>
> +void
> +writeback_bdev(struct super_block *sb)
> +{
> + struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
> + filemap_flush(mapping);
> + blk_run_address_space(mapping);
> +}
> +EXPORT_SYMBOL_GPL(writeback_bdev);
>
> filemap_flush() is using WB_SYNC_NONE.
>
Ok, I thought you were asking about the code that called filemap_fdatawrite,
which does wait. filemap_flush is used on the underlying block device. In
the case of a page that is already under IO, the io is not cancelled but
allowed to continue.
This is the desired result. When you're doing a number of operations in
sequence, each operation will start io on the block device. If they used
filemap_fdatawrite instead of filemap_flush, they would end up being
synchronous.
-chris
Chris Mason <[email protected]> writes:
> Ok, I thought you were asking about the code that called filemap_fdatawrite,
> which does wait. filemap_flush is used on the underlying block device. In
> the case of a page that is already under IO, the io is not cancelled but
> allowed to continue.
>
> This is the desired result. When you're doing a number of operations in
> sequence, each operation will start io on the block device. If they used
> filemap_fdatawrite instead of filemap_flush, they would end up being
> synchronous.
Of course, I know. Let's return to beginning of this thread, do you have
any plan to address it?
--
OGAWA Hirofumi <[email protected]>
Hi all,
Chris Mason wrote:
> On Monday 27 February 2006 18:12, Andrew Morton wrote:
>
>> We don't know that the same number of same-sized write()s were happening in
>> each case.
>>
>> There's been some talk about implementing fsync()-on-file-close for this
>> problem, and some protopatches. But nothing final yet.
>>
>
> Here's the patch I'm using in -suse right now. What I want to do is make a
> much more generic -o flush, but it'll still need a few bits in individual
> filesystem to kick off metadata writes quickly.
>
> The basic goal behind the code is to trigger writes without waiting for both
> data and metadata. If the user is watching the memory stick, when the
> little light stops flashing all the data and metadata will be on disk.
>
> It also generally throttles userland a little during file release. This
> could be changed to throttle for each page dirtied, but most users I
> asked liked the current setup better.
>
I like the idea and would like to see something like this in mainline.
Here is some non-scientific benchmark done with 2.6.16, comparing
default mount and flush mount of a USB2 stick:
/////////////////////////////////////////////////////////////////////
Single File "Test": 43MB
$ time cp Test /media/usbdisk/test/ && time umount /media/usbdisk/
/////////////////////////////////////////////////////////////////////
VANILLA:
real 0m3.770s
user 0m0.004s
sys 0m0.308s
real 0m9.439s
user 0m0.000s
sys 0m0.040s
FLUSH:
real 0m6.000s
user 0m0.012s
sys 0m0.400s
real 0m3.668s
user 0m0.000s
sys 0m0.028s
REAL TIME RATIO (FLUSH/VANILLA):
9.6 / 13.1 = 0.73
/////////////////////////////////////////////////////////////////////
Directory Tree "flushtest": 44MB (8866 files, 1820 dirs)
$ time cp -R flushtest/ /media/usbdisk/ && time umount /media/usbdisk/
/////////////////////////////////////////////////////////////////////
VANILLA:
real 0m0.966s
user 0m0.024s
sys 0m0.860s
real 1m11.962s
user 0m0.004s
sys 0m0.160s
FLUSH:
real 1m41.645s
user 0m0.032s
sys 0m1.112s
real 0m4.660s
user 0m0.004s
sys 0m0.068s
REAL TIME RATIO (FLUSH/VANILLA):
106.3 / 77.9 = 1.36
On Wed, 29 Mar 2006 04:13:03 +0200, Mathis Ahrens <[email protected]>
wrote:
> Hi all,
>
> Chris Mason wrote:
>> On Monday 27 February 2006 18:12, Andrew Morton wrote:
>>
>>> We don't know that the same number of same-sized write()s were
>>> happening in
>>> each case.
>>>
>>> There's been some talk about implementing fsync()-on-file-close for
>>> this
>>> problem, and some protopatches. But nothing final yet.
>>>
>>
>> Here's the patch I'm using in -suse right now. What I want to do is
>> make a much more generic -o flush, but it'll still need a few bits in
>> individual filesystem to kick off metadata writes quickly.
>>
>> The basic goal behind the code is to trigger writes without waiting for
>> both
>> data and metadata. If the user is watching the memory stick, when the
>> little light stops flashing all the data and metadata will be on disk.
>>
>> It also generally throttles userland a little during file release.
>> This could be changed to throttle for each page dirtied, but most users
>> I asked liked the current setup better.
>>
>
> I like the idea and would like to see something like this in mainline.
>
> Here is some non-scientific benchmark done with 2.6.16, comparing
> default mount and flush mount of a USB2 stick:
>
> /////////////////////////////////////////////////////////////////////
> Single File "Test": 43MB
> $ time cp Test /media/usbdisk/test/ && time umount /media/usbdisk/
> /////////////////////////////////////////////////////////////////////
>
> VANILLA:
>
> real 0m3.770s
> user 0m0.004s
> sys 0m0.308s
>
> real 0m9.439s
> user 0m0.000s
> sys 0m0.040s
>
> FLUSH:
>
> real 0m6.000s
> user 0m0.012s
> sys 0m0.400s
>
> real 0m3.668s
> user 0m0.000s
> sys 0m0.028s
>
> REAL TIME RATIO (FLUSH/VANILLA):
> 9.6 / 13.1 = 0.73
>
> /////////////////////////////////////////////////////////////////////
> Directory Tree "flushtest": 44MB (8866 files, 1820 dirs)
> $ time cp -R flushtest/ /media/usbdisk/ && time umount /media/usbdisk/
> /////////////////////////////////////////////////////////////////////
>
> VANILLA:
>
> real 0m0.966s
> user 0m0.024s
> sys 0m0.860s
>
> real 1m11.962s
> user 0m0.004s
> sys 0m0.160s
>
> FLUSH:
>
> real 1m41.645s
> user 0m0.032s
> sys 0m1.112s
>
> real 0m4.660s
> user 0m0.004s
> sys 0m0.068s
>
> REAL TIME RATIO (FLUSH/VANILLA):
> 106.3 / 77.9 = 1.36
>
>
That's interesting, albeit non-scientific, I think it is quite informative.
There are two basic problems with the current code: speed is down by
around and order of magnitude compared to a non-synced write and the fact
that the code is hammering the FAT. The two are obviously related.
Viewing the system globally rather than considering the details of the
techniques used, it would seem that any algorithm that does not
drastically reduce write times, at least on the one large file test , is
missing the mark and presumably repeating the problem in a slightly
different way.
Not knocking the efforts Chris has put in , it's great to see this is
getting some attention, but I think viewing overall performance times as
shown above gives a touchstone as to whether any particular proto is
effective.
The fact that flush can be almost 40% slower in some cases is worrying.
Thanks for the info.