2001-11-12 14:54:51

by Ben Israel

[permalink] [raw]
Subject: File System Performance

My System:
Pentium III 800MHz
128M SDRAM
Aug 2001 MSC Linux
XFS File System

My Results:
hdparm /dev/hda

/dev/hda:
multcount = 0 (off)
I/O support = 0 (default 16-bit)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
nowerr = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 3739/255/63, sectors = 60074784, start = 0

hdparm -t /dev/hda

/dev/hda:
Timing buffered disk reads: 64 MB in 2.71 seconds = 23.62 MB/sec

time cp -r /usr/src/linux-2.4.6 tst

real 0m47.376s
user 0m0.180s
sys 0m2.710s

du -bs tst
144187392 tst

Actual Performance
2*144MB/48s=6MB/sec

Notes:
1) for consistent results; data size should exceed file cache.
2) cp reads and writes files: so 2*3MB/sec = 6MB/sec fileio on a 24MB/sec
drive

My Questions:
Is there a way of identifying what the file system is doing here?
Is there a way to improve it?
What tools are there to identify and time the raw disk io done by the file
system?






2001-11-12 16:34:22

by Andrew Morton

[permalink] [raw]
Subject: Re: File System Performance

Ben Israel wrote:
>
> ...
> 128M SDRAM
> ...
> time cp -r /usr/src/linux-2.4.6 tst
> ...
> 2*144MB/48s=6MB/sec
>

There was some discussion about this last week. It appears to
be due to ext2's directory placement policy. Al Viro has a
patch which implements the "Orlov allocator" which FreeBSD are
using.

It works, and it'll get you close to disk bandwidth with this test.
But the effects of this change on other workloads (the so-called
"slow growth" scenario) still needs to be understood and tested.

-

2001-11-12 17:30:38

by Andrew Morton

[permalink] [raw]
Subject: Re: File System Performance

Ben Israel wrote:
>
> XFS File System
>

err... I was talking about ext2/3. Ignore me.

-

2001-11-12 19:18:44

by Ben Israel

[permalink] [raw]
Subject: Re: File System Performance

Andrew

Thank you very much for your response. I would like to know what ever I can
about File Systems that achieve near Raw Disk Transfer Speeds on large file
system modifications. What does the "Orlov allocator" do differently? What
is
"slow growth" workload? All File Systems I've used have this problem. XFS is
just supposedly high performance. It offers some improvement, but is still
off by a factor of 4.

Andrew Morton wrote:

> Ben Israel wrote:
> >
> > ...
> > 128M SDRAM
> > ...
> > time cp -r /usr/src/linux-2.4.6 tst
> > ...
> > 2*144MB/48s=6MB/sec
> >
>
> There was some discussion about this last week. It appears to
> be due to ext2's directory placement policy. Al Viro has a
> patch which implements the "Orlov allocator" which FreeBSD are
> using.
>
> It works, and it'll get you close to disk bandwidth with this test.
> But the effects of this change on other workloads (the so-called
> "slow growth" scenario) still needs to be understood and tested.
>
> -


2001-11-12 19:47:04

by Andrew Morton

[permalink] [raw]
Subject: Re: File System Performance

Ben Israel wrote:
>
> Andrew
>
> Thank you very much for your response. I would like to know what ever I can
> about File Systems that achieve near Raw Disk Transfer Speeds on large file
> system modifications.

This is very filesystem-dependent. Probably the XFS mailing list
[email protected] is the place to ask for XFS.

> What does the "Orlov allocator" do differently?

The problem which ext2 experiences with your particular workload
is that it attempts to balance the creation of new directories
across the filesystem's various "block groups". There are tens
or hundreds of these. ext2 also tries to place files in the same
block group as their parent directory.

Consequently, when you untar (or just copy) a directory tree,
it gets placed all over the partition. So both reading it and
writing it requires a lot of seeking.

That's a fast-growth workload. A slow-growth workload is
the normal sort of day-to-day activity where you reqularly
create and delete files. The occupancy of the fs grows slowly.

We are seeing some degradation in performance in the slow-growth
case with the Orlov algorithm. But fast-growth is increased a lot.

The Orlov algorithm (or at least Al's version of it) tries to
spread top-level directories evenly across the partition, under
the assumption that these contain unrelated information. But for
directories at a lower level, it is more aggressive about placing
directories in the same block group as their parent. It _will_
move into a different block group, but only when it sees that the
current one is filling up.

> All File Systems I've used have this problem. XFS is
> just supposedly high performance. It offers some improvement, but is still
> off by a factor of 4.

Yup. Which places all of our little two-instructions-here and
one-cacheline-there optimisations into an ironic context, no?

Here's Al's current allocator. It's against 2.4.15-pre3 or -pre4.
The comments are informative. Please, see if it helps with your
workloads.



diff -urN S15-pre3/fs/ext2/ialloc.c S15-pre3-ext2/fs/ext2/ialloc.c
--- S15-pre3/fs/ext2/ialloc.c Sun Nov 11 15:24:12 2001
+++ S15-pre3-ext2/fs/ext2/ialloc.c Mon Nov 12 02:22:20 2001
@@ -17,6 +17,7 @@
#include <linux/ext2_fs.h>
#include <linux/locks.h>
#include <linux/quotaops.h>
+#include <linux/random.h>


/*
@@ -230,7 +231,7 @@
* group to find a free inode.
*/

-static int find_group_dir(struct super_block *sb, int parent_group)
+static int find_group_dir(struct super_block *sb, struct inode *parent)
{
struct ext2_super_block * es = sb->u.ext2_sb.s_es;
int ngroups = sb->u.ext2_sb.s_groups_count;
@@ -263,8 +264,113 @@
return best_group;
}

-static int find_group_other(struct super_block *sb, int parent_group)
+#define INODE_COST 64
+#define BLOCK_COST 256
+
+static int find_group_orlov(struct super_block *sb, const struct inode *parent)
+{
+ int parent_group = parent->u.ext2_i.i_block_group;
+ struct ext2_super_block * es = sb->u.ext2_sb.s_es;
+ int ngroups = sb->u.ext2_sb.s_groups_count;
+ int inodes_per_group = EXT2_INODES_PER_GROUP(sb);
+ int avefreei = le32_to_cpu(es->s_free_inodes_count) / ngroups;
+ int avefreeb = le32_to_cpu(es->s_free_blocks_count) / ngroups;
+ int blocks_per_dir;
+ int ndirs = sb->u.ext2_sb.s_dir_count;
+ int max_debt, max_dirs, min_blocks, min_inodes;
+ int group = -1, i;
+ struct ext2_group_desc *desc;
+ struct buffer_head *bh;
+
+ if (parent == sb->s_root->d_inode) {
+ struct ext2_group_desc *best_desc = NULL;
+ struct buffer_head *best_bh = NULL;
+ int best_ndir = inodes_per_group;
+ int best_group = -1;
+
+ get_random_bytes(&group, sizeof(group));
+ parent_group = (unsigned)group % ngroups;
+ for (i = 0; i < ngroups; i++) {
+ group = (parent_group + i) % ngroups;
+ desc = ext2_get_group_desc (sb, group, &bh);
+ if (!desc || !desc->bg_free_inodes_count)
+ continue;
+ if (le16_to_cpu(desc->bg_used_dirs_count) >= best_ndir)
+ continue;
+ if (le16_to_cpu(desc->bg_free_inodes_count) < avefreei)
+ continue;
+ if (le16_to_cpu(desc->bg_free_blocks_count) < avefreeb)
+ continue;
+ best_group = group;
+ best_ndir = le16_to_cpu(desc->bg_used_dirs_count);
+ best_desc = desc;
+ best_bh = bh;
+ }
+ if (best_group >= 0) {
+ desc = best_desc;
+ bh = best_bh;
+ group = best_group;
+ goto found;
+ }
+ goto fallback;
+ }
+
+ blocks_per_dir = (le32_to_cpu(es->s_blocks_count) -
+ le32_to_cpu(es->s_free_blocks_count)) / ndirs;
+
+ max_dirs = ndirs / ngroups + inodes_per_group / 16;
+ min_inodes = avefreei - inodes_per_group / 4;
+ min_blocks = avefreeb - EXT2_BLOCKS_PER_GROUP(sb) / 4;
+
+ max_debt = EXT2_BLOCKS_PER_GROUP(sb) / max(blocks_per_dir, BLOCK_COST);
+ if (max_debt * INODE_COST > inodes_per_group)
+ max_debt = inodes_per_group / INODE_COST;
+ if (max_debt > 255)
+ max_debt = 255;
+ if (max_debt == 0)
+ max_debt = 1;
+
+ for (i = 0; i < ngroups; i++) {
+ group = (parent_group + i) % ngroups;
+ desc = ext2_get_group_desc (sb, group, &bh);
+ if (!desc || !desc->bg_free_inodes_count)
+ continue;
+ if (sb->u.ext2_sb.debts[group] >= max_debt)
+ continue;
+ if (le16_to_cpu(desc->bg_used_dirs_count) >= max_dirs)
+ continue;
+ if (le16_to_cpu(desc->bg_free_inodes_count) < min_inodes)
+ continue;
+ if (le16_to_cpu(desc->bg_free_blocks_count) < min_blocks)
+ continue;
+ goto found;
+ }
+
+fallback:
+ for (i = 0; i < ngroups; i++) {
+ group = (parent_group + i) % ngroups;
+ desc = ext2_get_group_desc (sb, group, &bh);
+ if (!desc || !desc->bg_free_inodes_count)
+ continue;
+ if (le16_to_cpu(desc->bg_free_inodes_count) >= avefreei)
+ goto found;
+ }
+
+ return -1;
+
+found:
+ desc->bg_free_inodes_count =
+ cpu_to_le16(le16_to_cpu(desc->bg_free_inodes_count) - 1);
+ desc->bg_used_dirs_count =
+ cpu_to_le16(le16_to_cpu(desc->bg_used_dirs_count) + 1);
+ sb->u.ext2_sb.s_dir_count++;
+ mark_buffer_dirty(bh);
+ return group;
+}
+
+static int find_group_other(struct super_block *sb, struct inode *parent)
{
+ int parent_group = parent->u.ext2_i.i_block_group;
int ngroups = sb->u.ext2_sb.s_groups_count;
struct ext2_group_desc *desc;
struct buffer_head *bh;
@@ -333,9 +439,9 @@
es = sb->u.ext2_sb.s_es;
repeat:
if (S_ISDIR(mode))
- group = find_group_dir(sb, dir->u.ext2_i.i_block_group);
+ group = find_group_orlov(sb, dir);
else
- group = find_group_other(sb, dir->u.ext2_i.i_block_group);
+ group = find_group_other(sb, dir);

err = -ENOSPC;
if (group == -1)
@@ -369,6 +475,15 @@

es->s_free_inodes_count =
cpu_to_le32(le32_to_cpu(es->s_free_inodes_count) - 1);
+
+ if (S_ISDIR(mode)) {
+ if (sb->u.ext2_sb.debts[i] < 255)
+ sb->u.ext2_sb.debts[i]++;
+ } else {
+ if (sb->u.ext2_sb.debts[i])
+ sb->u.ext2_sb.debts[i]--;
+ }
+
mark_buffer_dirty(sb->u.ext2_sb.s_sbh);
sb->s_dirt = 1;
inode->i_uid = current->fsuid;
@@ -470,6 +585,21 @@
#else
return le32_to_cpu(sb->u.ext2_sb.s_es->s_free_inodes_count);
#endif
+}
+
+/* Called at mount-time, super-block is locked */
+unsigned long ext2_count_dirs (struct super_block * sb)
+{
+ unsigned long count = 0;
+ int i;
+
+ for (i = 0; i < sb->u.ext2_sb.s_groups_count; i++) {
+ struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL);
+ if (!gdp)
+ continue;
+ count += le16_to_cpu(gdp->bg_used_dirs_count);
+ }
+ return count;
}

#ifdef CONFIG_EXT2_CHECK
diff -urN S15-pre3/fs/ext2/super.c S15-pre3-ext2/fs/ext2/super.c
--- S15-pre3/fs/ext2/super.c Tue Oct 9 21:47:26 2001
+++ S15-pre3-ext2/fs/ext2/super.c Mon Nov 12 02:03:02 2001
@@ -605,6 +605,12 @@
printk ("EXT2-fs: not enough memory\n");
goto failed_mount;
}
+ sb->u.ext2_sb.debts = kmalloc(sb->u.ext2_sb.s_groups_count, GFP_KERNEL);
+ if (!sb->u.ext2_sb.debts) {
+ printk ("EXT2-fs: not enough memory\n");
+ goto failed_mount;
+ }
+ memset(sb->u.ext2_sb.debts, 0, sb->u.ext2_sb.s_groups_count);
for (i = 0; i < db_count; i++) {
sb->u.ext2_sb.s_group_desc[i] = bread (dev, logic_sb_block + i + 1,
sb->s_blocksize);
@@ -621,6 +627,7 @@
db_count = i;
goto failed_mount2;
}
+ sb->u.ext2_sb.s_dir_count = ext2_count_dirs(sb);
for (i = 0; i < EXT2_MAX_GROUP_LOADED; i++) {
sb->u.ext2_sb.s_inode_bitmap_number[i] = 0;
sb->u.ext2_sb.s_inode_bitmap[i] = NULL;
diff -urN S15-pre3/include/linux/ext2_fs.h S15-pre3-ext2/include/linux/ext2_fs.h
--- S15-pre3/include/linux/ext2_fs.h Sun Nov 11 00:46:41 2001
+++ S15-pre3-ext2/include/linux/ext2_fs.h Mon Nov 12 02:03:02 2001
@@ -578,6 +578,7 @@
extern unsigned long ext2_count_free_inodes (struct super_block *);
extern void ext2_check_inodes_bitmap (struct super_block *);
extern unsigned long ext2_count_free (struct buffer_head *, unsigned);
+extern unsigned long ext2_count_dirs (struct super_block *);

/* inode.c */
extern void ext2_read_inode (struct inode *);
diff -urN S15-pre3/include/linux/ext2_fs_sb.h S15-pre3-ext2/include/linux/ext2_fs_sb.h
--- S15-pre3/include/linux/ext2_fs_sb.h Fri Feb 16 21:29:52 2001
+++ S15-pre3-ext2/include/linux/ext2_fs_sb.h Mon Nov 12 02:03:02 2001
@@ -56,6 +56,8 @@
int s_desc_per_block_bits;
int s_inode_size;
int s_first_ino;
+ unsigned long s_dir_count;
+ u8 *debts;
};

#endif /* _LINUX_EXT2_FS_SB */

2001-11-12 20:00:26

by Richard Gooch

[permalink] [raw]
Subject: Re: File System Performance

Andrew Morton writes:
> Ben Israel wrote:
> >
> > Andrew
> >
> > Thank you very much for your response. I would like to know what ever I can
> > about File Systems that achieve near Raw Disk Transfer Speeds on large file
> > system modifications.
>
> This is very filesystem-dependent. Probably the XFS mailing list
> [email protected] is the place to ask for XFS.
>
> > What does the "Orlov allocator" do differently?
>
> The problem which ext2 experiences with your particular workload
> is that it attempts to balance the creation of new directories
> across the filesystem's various "block groups". There are tens
> or hundreds of these. ext2 also tries to place files in the same
> block group as their parent directory.
>
> Consequently, when you untar (or just copy) a directory tree,
> it gets placed all over the partition. So both reading it and
> writing it requires a lot of seeking.
>
> That's a fast-growth workload. A slow-growth workload is
> the normal sort of day-to-day activity where you reqularly
> create and delete files. The occupancy of the fs grows slowly.
>
> We are seeing some degradation in performance in the slow-growth
> case with the Orlov algorithm. But fast-growth is increased a lot.
>
> The Orlov algorithm (or at least Al's version of it) tries to
> spread top-level directories evenly across the partition, under
> the assumption that these contain unrelated information. But for
> directories at a lower level, it is more aggressive about placing
> directories in the same block group as their parent. It _will_
> move into a different block group, but only when it sees that the
> current one is filling up.

Here's an idea: add a "--compact" option to tar, so that it creates
*all* inodes (files and directories alike) in the base directory, and
then renames newly created entries to shuffle them into their correct
positions. That should limit the number of block groups that are used,
right?

It would probably also be a good idea to do that for cp as well, so
that when I do a "cp -al" of a virgin kernel tree, I can keep all the
directory inodes together. It will make a cold diff even faster.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-11-12 20:11:36

by Steve Lord

[permalink] [raw]
Subject: Re: File System Performance

On Mon, 2001-11-12 at 13:46, Andrew Morton wrote:
> Ben Israel wrote:
> >
> > Andrew
> >
> > Thank you very much for your response. I would like to know what ever I can
> > about File Systems that achieve near Raw Disk Transfer Speeds on large file
> > system modifications.
>
> This is very filesystem-dependent. Probably the XFS mailing list
> [email protected] is the place to ask for XFS.
>
> > What does the "Orlov allocator" do differently?
>
>
> The Orlov algorithm (or at least Al's version of it) tries to
> spread top-level directories evenly across the partition, under
> the assumption that these contain unrelated information. But for
> directories at a lower level, it is more aggressive about placing
> directories in the same block group as their parent. It _will_
> move into a different block group, but only when it sees that the
> current one is filling up.

XFS does something similar to this - the filesystem is split into
'allocation groups' these are independent chunks of upto 4 Gbytes
of disk space. By default a directory is placed in a different
allocation group than it's parent, file inodes are placed in the
same allocation group - preferably 'close' to the directory inode,
file data also prefers the same allocation group as the inode.

This all works OK when the filesystem is not bursting at the seams,
and when there is some correspondence between logical block numbers
in the filesystem and physical drives underneath. I believe that once
you get into complex filesystems using raid devices and smart caching
drives it gets very hard for the filesystem to predict anything
about latency of access for two blocks which it things are next
to each other in the filesystem.

>
> > All File Systems I've used have this problem. XFS is
> > just supposedly high performance. It offers some improvement, but is still
> > off by a factor of 4.
>

XFS on Irix can reach disk speeds, but for a limited set of workloads,
i.e. direct I/O into preallocated space.

Looking at your original benchmark:

hdparm -t /dev/hda

/dev/hda:
Timing buffered disk reads: 64 MB in 2.71 seconds = 23.62 MB/sec

time cp -r /usr/src/linux-2.4.6 tst

real 0m47.376s
user 0m0.180s
sys 0m2.710s

du -bs tst
144187392 tst

Actual Performance
2*144MB/48s=6MB/sec

There is a lot of difference between doing a raw read of data from
the device, and copying a large directory tree. The raw read is
purely sequential, the tree copy is basically random access since
the kernel is being asked to read and then write a lot of small
chunks of disk space.

Steve

--

Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]

2001-11-12 20:43:09

by Andrew Morton

[permalink] [raw]
Subject: Re: File System Performance

Steve Lord wrote:
>
> ...
> This all works OK when the filesystem is not bursting at the seams,
> and when there is some correspondence between logical block numbers
> in the filesystem and physical drives underneath. I believe that once
> you get into complex filesystems using raid devices and smart caching
> drives it gets very hard for the filesystem to predict anything
> about latency of access for two blocks which it things are next
> to each other in the filesystem.
>

Well, file systems have for all time made efforts to write things
out contiguously. If an IO system were to come along and deoptimise
that case it'd be pretty broken?

> ...
>
> There is a lot of difference between doing a raw read of data from
> the device, and copying a large directory tree. The raw read is
> purely sequential, the tree copy is basically random access since
> the kernel is being asked to read and then write a lot of small
> chunks of disk space.

hum. Yes. Copying a tree from and to the same disk won't
achieve disk bandwidth. If they're different disks then
ext2+Orlov will actually get there.

We in fact do extremely well on IDE with Orlov for this workload,
even though we're not doing inter-inode readahead. This will be because
the disk is doing the readhead for us. It's going to be highly dependent
upon the disk cache size, readahead cache partitioning algorithm, etc.

BTW, I've been trying to hunt down a suitable file system aging tool.
We're not very happy with Keith Smith's workload because the directory
infomation was lost (he was purely studying FFS algorithmic differences
- the load isn't 100% suitable for testing other filesystems / algorithms). Constantin Loizides' tools are proving to be rather complex to compile,
drive and understand. Does the XFS team have anything like this in the
kitbag?

-

2001-11-12 21:32:21

by Steve Lord

[permalink] [raw]
Subject: Re: File System Performance

On Mon, 2001-11-12 at 14:41, Andrew Morton wrote:
> Steve Lord wrote:
> >
> > ...
> > This all works OK when the filesystem is not bursting at the seams,
> > and when there is some correspondence between logical block numbers
> > in the filesystem and physical drives underneath. I believe that once
> > you get into complex filesystems using raid devices and smart caching
> > drives it gets very hard for the filesystem to predict anything
> > about latency of access for two blocks which it things are next
> > to each other in the filesystem.
> >
>
> Well, file systems have for all time made efforts to write things
> out contiguously. If an IO system were to come along and deoptimise
> that case it'd be pretty broken?

Yes you are right, I was just pointing out that there is so much
underneath us now that things get a little out of our control.

>
> > ...
> >
> > There is a lot of difference between doing a raw read of data from
> > the device, and copying a large directory tree. The raw read is
> > purely sequential, the tree copy is basically random access since
> > the kernel is being asked to read and then write a lot of small
> > chunks of disk space.
>
> hum. Yes. Copying a tree from and to the same disk won't
> achieve disk bandwidth. If they're different disks then
> ext2+Orlov will actually get there.
>
> We in fact do extremely well on IDE with Orlov for this workload,
> even though we're not doing inter-inode readahead. This will be because
> the disk is doing the readhead for us. It's going to be highly dependent
> upon the disk cache size, readahead cache partitioning algorithm, etc.

I tried an experiment which puzzled me somwhat:

> mount /xfs
> cd /xfs/lord/xfs-linux
> time tar cf /dev/null linux

real 0m7.743s
user 0m0.510s
sys 0m1.380s

> hdparm -t /dev/sda5

/dev/sda5:
Timing buffered disk reads: 64 MB in 3.76 seconds = 17.02 MB/sec

> du -sk linux
173028 linux

The tar got ~21 Mbytes/sec.




>
> BTW, I've been trying to hunt down a suitable file system aging tool.
> We're not very happy with Keith Smith's workload because the directory
> infomation was lost (he was purely studying FFS algorithmic differences
> - the load isn't 100% suitable for testing other filesystems / algorithms). Constantin Loizides' tools are proving to be rather complex to compile,
> drive and understand. Does the XFS team have anything like this in the
> kitbag?

No, not really, there is one test we ship which basically does random
calls and creates a fairly deep directory tree, but I would not say it
bears much relationship to real life usage. It would also have to have
large chunks commented out for non xfs filesystems as it does extended
attributes etc. You can however vary the mix of calls it does as a
percentage of the whole.

In our cvs tree cmd/xfsprogs/tests/src/fsstress.c

Steve

>
> -
--

Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]

2001-11-12 21:45:23

by Andrew Morton

[permalink] [raw]
Subject: Re: File System Performance

Steve Lord wrote:
>
> ...
>
> I tried an experiment which puzzled me somwhat:
>
> > mount /xfs
> > cd /xfs/lord/xfs-linux
> > time tar cf /dev/null linux
>
> real 0m7.743s
> user 0m0.510s
> sys 0m1.380s
>
> > hdparm -t /dev/sda5
>
> /dev/sda5:
> Timing buffered disk reads: 64 MB in 3.76 seconds = 17.02 MB/sec
>
> > du -sk linux
> 173028 linux
>
> The tar got ~21 Mbytes/sec.
>

It's tar. It cheats. It somehow detects that the
output is /dev/null, and so it doesn't read the input files.
I think.

akpm-1:/opt> l
total 562884
-rw-r--r-- 1 akpm akpm 575785370 Apr 8 2001 backup

akpm-1:/opt> time tar cf /dev/null backup
tar cf /dev/null backup 0.02s user 0.00s system 4% cpu 0.422 total

That's 570 megs in 0.4 seconds. Impressive.

So I just use
time (find dir -type f | xargs cat > /dev/null)

> In our cvs tree cmd/xfsprogs/tests/src/fsstress.c

OK, thanks.

-

2001-11-12 21:50:13

by Steve Lord

[permalink] [raw]
Subject: Re: File System Performance

On Mon, 2001-11-12 at 15:43, Andrew Morton wrote:

>
> It's tar. It cheats. It somehow detects that the
> output is /dev/null, and so it doesn't read the input files.
> I think.
>

Yep it cheats all right:

execve("/bin/tar", ["tar", "cf", "/dev/null", "linux"], [/* 30 vars */])
= 0
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
59.70 1.114097 740 1505 getdents64
37.22 0.694516 55 12650 lstat64
1.62 0.030264 39 775 3 open
0.56 0.010518 14 777 close
0.40 0.007395 10 770 fstat64
0.24 0.004553 6 749 fcntl64
0.11 0.001972 986 2 poll
0.02 0.000420 210 2 2 connect
0.02 0.000418 32 13 read
0.02 0.000345 20 17 old_mmap
0.02 0.000286 36 8 munmap
0.02 0.000281 18 16 mmap2
0.01 0.000189 47 4 socket
0.01 0.000161 16 10 brk
0.01 0.000101 51 2 sendto
0.00 0.000076 15 5 mprotect
0.00 0.000076 38 2 recvfrom
0.00 0.000075 25 3 uname
0.00 0.000059 30 2 readv
0.00 0.000057 29 2 bind
0.00 0.000056 28 2 setsockopt
0.00 0.000039 39 1 readlink
0.00 0.000025 13 2 getpid
0.00 0.000023 23 1 rt_sigaction
0.00 0.000021 21 1 time
0.00 0.000021 21 1 gettimeofday
0.00 0.000014 7 2 ioctl
------ ----------- ----------- --------- --------- ----------------
100.00 1.866058 17324 5 total

Not a lot of reads in there.... I should have known it could
not go that fast.

Steve

--

Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]

2001-11-12 21:53:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: File System Performance

In article <[email protected]>,
Andrew Morton <[email protected]> wrote:
>
>It's tar. It cheats. It somehow detects that the
>output is /dev/null, and so it doesn't read the input files.

Probably the kernel.

If you do a mmap()+write(), the write() to /dev/null won't even read the
mmap contents, which in turn will cause the pages to never be brought
in.

Anything which uses mmap+write will show this.

Linus

2001-11-12 21:53:53

by Lionel Bouton

[permalink] [raw]
Subject: Re: File System Performance

>
>
>
>I tried an experiment which puzzled me somwhat:
>
>> mount /xfs
>> cd /xfs/lord/xfs-linux
>> time tar cf /dev/null linux
>>
>
>real 0m7.743s
>user 0m0.510s
>sys 0m1.380s
>
>>hdparm -t /dev/sda5
>>
>
>/dev/sda5:
> Timing buffered disk reads: 64 MB in 3.76 seconds = 17.02 MB/sec
>
>>du -sk linux
>>
>173028 linux
>
>The tar got ~21 Mbytes/sec.
>
Things I'll check :

0/ rerun this test !!!
1/ is there cache on the scsi controler ?
2/ xfs data cached at mount ? (I don't believe so)
3/ "hdparm -t" is on crack.
4/ du reports a disk usage way ahead of files' sizes total (don't know
xfs enough to estimate this propability) and tar won't read the whole
"du -sk" data.


2/ "time mount /xfs" could help (if mount + tar times are below
expected, this case can be eliminated).
3/ ask hdparm's maintener.
4/ tar, check tar size.

2001-11-12 22:12:13

by Lionel Bouton

[permalink] [raw]
Subject: Re: File System Performance

Linus Torvalds wrote:

>In article <[email protected]>,
>Andrew Morton <[email protected]> wrote:
>
>>It's tar. It cheats. It somehow detects that the
>>output is /dev/null, and so it doesn't read the input files.
>>
>
>Probably the kernel.
>
Seems not the case with gnu tar : write isn't even called once on the fd
returned by open("/dev/null",...). In fact a "grep write" on the strace
output is empty in the "tar cf /dev/null" case. Every file in the tar-ed
tree is stat-ed but no-one is read-ed.

--
Lionel Bouton

-
"I wanted to be free, so I opensourced my whole DNA code" Gyver, 1999.



2001-11-12 22:17:13

by Andrew Morton

[permalink] [raw]
Subject: Re: File System Performance

Linus Torvalds wrote:
>
> In article <[email protected]>,
> Andrew Morton <[email protected]> wrote:
> >
> >It's tar. It cheats. It somehow detects that the
> >output is /dev/null, and so it doesn't read the input files.
>
> Probably the kernel.
>
> If you do a mmap()+write(), the write() to /dev/null won't even read the
> mmap contents, which in turn will cause the pages to never be brought
> in.
>
> Anything which uses mmap+write will show this.

Actually, tar _is_ doing funnies with /dev/null. Changelog says:

1995-12-21 Fran?ois Pinard <[email protected]>

* buffer.c: Rename a few err variables to status.
* extract.c: Rename a few check variables to status.

Corrections to speed-up the sizeing pass in Amanda:
* tar.h: Declare dev_null_output.
* buffer.c (open_archive): Detect when archive is /dev/null.
(flush_write): Avoid writing to /dev/null.
* create.c (dump_file): Do not open file if archive is being
written to /dev/null, nor read file nor restore times.
Reported by Greg Maples and Tor Lillqvist.

One wonders why.

-

2001-11-12 22:19:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: File System Performance


On Mon, 12 Nov 2001, Lionel Bouton wrote:
>
> Seems not the case with gnu tar : write isn't even called once on the fd
> returned by open("/dev/null",...). In fact a "grep write" on the strace
> output is empty in the "tar cf /dev/null" case. Every file in the tar-ed
> tree is stat-ed but no-one is read-ed.

Wow. What a sleazy optimization - it can't be anything but a special case.

How do they do it anyway? By matching on the name Or by knowing what the
minor/major numbers of /dev/null are supposed to be on that particular
operating system?

And what's the _point_ of the optimization? I've never heard of a "tar
benchmark"..

Linus

2001-11-12 22:26:53

by Gérard Roudier

[permalink] [raw]
Subject: Re: File System Performance



On Mon, 12 Nov 2001, Lionel Bouton wrote:

> Linus Torvalds wrote:
>
> >In article <[email protected]>,
> >Andrew Morton <[email protected]> wrote:
> >
> >>It's tar. It cheats. It somehow detects that the
> >>output is /dev/null, and so it doesn't read the input files.
> >>
> >
> >Probably the kernel.
> >
> Seems not the case with gnu tar : write isn't even called once on the fd
> returned by open("/dev/null",...). In fact a "grep write" on the strace
> output is empty in the "tar cf /dev/null" case. Every file in the tar-ed
> tree is stat-ed but no-one is read-ed.

This kind of guessing from gnu tar seems utterly stupid to me. Developpers
that spend time for such kind of fake optimization either have time to
waste or brain to be fixed. Just my opinion.

G?rard.

2001-11-12 22:31:33

by Steve Lord

[permalink] [raw]
Subject: Re: File System Performance

On Mon, 2001-11-12 at 16:16, Andrew Morton wrote:

>
> Actually, tar _is_ doing funnies with /dev/null. Changelog says:
>
> 1995-12-21 Fran?ois Pinard <[email protected]>
>
> * buffer.c: Rename a few err variables to status.
> * extract.c: Rename a few check variables to status.
>
> Corrections to speed-up the sizeing pass in Amanda:
> * tar.h: Declare dev_null_output.
> * buffer.c (open_archive): Detect when archive is /dev/null.
> (flush_write): Avoid writing to /dev/null.
> * create.c (dump_file): Do not open file if archive is being
> written to /dev/null, nor read file nor restore times.
> Reported by Greg Maples and Tor Lillqvist.
>
> One wonders why.

For almost 6 years too - I suspect they optimized tar instead of
fixing the way amanda works.

Steve

2001-11-12 22:31:53

by Ragnar Kjørstad

[permalink] [raw]
Subject: Re: File System Performance

On Mon, Nov 12, 2001 at 02:14:59PM -0800, Linus Torvalds wrote:
> > Seems not the case with gnu tar : write isn't even called once on the fd
> > returned by open("/dev/null",...). In fact a "grep write" on the strace
> > output is empty in the "tar cf /dev/null" case. Every file in the tar-ed
> > tree is stat-ed but no-one is read-ed.
>
> And what's the _point_ of the optimization? I've never heard of a "tar
> benchmark"..

Sure - it's called "amanda" :-)


I believe amanda run the backup once to /dev/null first to estimate the
size of the dataset, so it can make better use of the available tapes.


--
Ragnar Kj?rstad
Big Storage

2001-11-12 22:33:04

by Alan

[permalink] [raw]
Subject: Re: File System Performance

> Corrections to speed-up the sizeing pass in Amanda:
> * tar.h: Declare dev_null_output.
> * buffer.c (open_archive): Detect when archive is /dev/null.
> (flush_write): Avoid writing to /dev/null.
> * create.c (dump_file): Do not open file if archive is being
> written to /dev/null, nor read file nor restore times.
> Reported by Greg Maples and Tor Lillqvist.
>
> One wonders why.

So when you archive the file set twice (once to compute its size) its
faster. Seems sane enough.

2001-11-12 22:33:13

by Lionel Bouton

[permalink] [raw]
Subject: Re: File System Performance

Andrew Morton wrote:

>
>[...]
> Corrections to speed-up the sizeing pass in Amanda:
>
An ugly hack in tar justified by a lack of functionnality in Amanda ?!

Would be typical behavior in a closed-source environnement and make me
ROTFL :-/

2001-11-12 22:36:53

by Grant Erickson

[permalink] [raw]
Subject: Re: File System Performance

On Mon, 12 Nov 2001, Linus Torvalds wrote:
> On Mon, 12 Nov 2001, Lionel Bouton wrote:
> >
> > Seems not the case with gnu tar : write isn't even called once on the fd
> > returned by open("/dev/null",...). In fact a "grep write" on the strace
> > output is empty in the "tar cf /dev/null" case. Every file in the tar-ed
> > tree is stat-ed but no-one is read-ed.
>
> Wow. What a sleazy optimization - it can't be anything but a special case.
>
> How do they do it anyway? By matching on the name Or by knowing what the
> minor/major numbers of /dev/null are supposed to be on that particular
> operating system?

>From src/buffer.c in tar-1.13.19 in open_archive():

[ ... ]

#if !MSDOS

/* Detect if outputting to "/dev/null". */
{
static char const dev_null[] = "/dev/null";
struct stat dev_null_stat;

dev_null_output =
(strcmp (archive_name_array[0], dev_null) == 0
|| (! _isrmt (archive)
&& S_ISCHR (archive_stat.st_mode)
&& stat (dev_null, &dev_null_stat) == 0
&& archive_stat.st_dev == dev_null_stat.st_dev
&& archive_stat.st_ino == dev_null_stat.st_ino));
}

[ ... ]

Regards,

Grant


--
Grant Erickson University of Minnesota Alumni
o mail:[email protected] 1996 BSEE
o http://www.umn.edu/~erick205 1998 MSEE

2001-11-12 22:37:03

by Andrew Morton

[permalink] [raw]
Subject: Re: File System Performance

Linus Torvalds wrote:
>
> On Mon, 12 Nov 2001, Lionel Bouton wrote:
> >
> > Seems not the case with gnu tar : write isn't even called once on the fd
> > returned by open("/dev/null",...). In fact a "grep write" on the strace
> > output is empty in the "tar cf /dev/null" case. Every file in the tar-ed
> > tree is stat-ed but no-one is read-ed.
>
> Wow. What a sleazy optimization - it can't be anything but a special case.
>
> How do they do it anyway? By matching on the name Or by knowing what the
> minor/major numbers of /dev/null are supposed to be on that particular
> operating system?

/* Detect if outputting to "/dev/null". */
{
static char const dev_null[] = "/dev/null";
struct stat dev_null_stat;

dev_null_output =
(strcmp (archive_name_array[0], dev_null) == 0
|| (! _isrmt (archive)
&& stat (dev_null, &dev_null_stat) == 0
&& S_ISCHR (archive_stat.st_mode)
&& archive_stat.st_rdev == dev_null_stat.st_rdev));
}

> And what's the _point_ of the optimization? I've never heard of a "tar
> benchmark"..

Ask Al. He understands those GNU folks.

It's actually a bug. I may _want_ an ISREG /dev/null...

-

2001-11-12 22:38:54

by Alan

[permalink] [raw]
Subject: Re: File System Performance

> An ugly hack in tar justified by a lack of functionnality in Amanda ?!
> Would be typical behavior in a closed-source environnement and make me
> ROTFL :-/

Amanda uses tar to size archives. Its actually a tweak to tar that probably
should have been tar --size-only instead

2001-11-12 22:47:23

by Mike Castle

[permalink] [raw]
Subject: Re: File System Performance

On Mon, Nov 12, 2001 at 02:16:23PM -0800, Andrew Morton wrote:
> One wonders why.

No, on doesn't:

> Corrections to speed-up the sizeing pass in Amanda:

Granted, star's nullout option may have been smarter, and more in the realm
of ``least surprise,'' but one can understand why gtar does this.

Try tar cf /dev/zero instead.

mrc
--
Mike Castle [email protected] http://www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

2001-11-12 22:46:53

by Xavier Bestel

[permalink] [raw]
Subject: Re: File System Performance

le lun 12-11-2001 ? 23:39, Alan Cox a ?crit :
> > Corrections to speed-up the sizeing pass in Amanda:
> > * tar.h: Declare dev_null_output.
> > * buffer.c (open_archive): Detect when archive is /dev/null.
> > (flush_write): Avoid writing to /dev/null.
> > * create.c (dump_file): Do not open file if archive is being
> > written to /dev/null, nor read file nor restore times.
> > Reported by Greg Maples and Tor Lillqvist.
> >
> > One wonders why.
>
> So when you archive the file set twice (once to compute its size) its
> faster. Seems sane enough.

What would have been sane is a tar --just-compute-the-size option, not
hardcoding /dev/null.



2001-11-12 23:04:48

by Mike Castle

[permalink] [raw]
Subject: Re: File System Performance

On Mon, Nov 12, 2001 at 02:36:08PM -0800, Andrew Morton wrote:
> /* Detect if outputting to "/dev/null". */
> {
> static char const dev_null[] = "/dev/null";
> struct stat dev_null_stat;
>
> dev_null_output =
> (strcmp (archive_name_array[0], dev_null) == 0
> || (! _isrmt (archive)
> && stat (dev_null, &dev_null_stat) == 0
> && S_ISCHR (archive_stat.st_mode)
> && archive_stat.st_rdev == dev_null_stat.st_rdev));
> }
>
> It's actually a bug. I may _want_ an ISREG /dev/null...

Doesn't the S_ISCHR() handle that case?

mrc
--
Mike Castle [email protected] http://www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

2001-11-12 23:08:07

by Mike Fedyk

[permalink] [raw]
Subject: Re: File System Performance

On Mon, Nov 12, 2001 at 12:59:54PM -0700, Richard Gooch wrote:
> Here's an idea: add a "--compact" option to tar, so that it creates
> *all* inodes (files and directories alike) in the base directory, and
> then renames newly created entries to shuffle them into their correct
> positions. That should limit the number of block groups that are used,
> right?
>
> It would probably also be a good idea to do that for cp as well, so
> that when I do a "cp -al" of a virgin kernel tree, I can keep all the
> directory inodes together. It will make a cold diff even faster.
>

I don't think that would help at all... With the current file/dir allocator
it will choose a new block group for each directory no matter what the
parent is...

The new allocator would spread them accross the different block groups only
if they are created *directly* off of the root of the FS. Your idea would
make things worse for the new allocator *if* the tree is uncompressed into
the root of the FS, and *no* difference if not...

Mike

2001-11-13 00:05:18

by Richard Gooch

[permalink] [raw]
Subject: Re: File System Performance

Mike Fedyk writes:
> On Mon, Nov 12, 2001 at 12:59:54PM -0700, Richard Gooch wrote:
> > Here's an idea: add a "--compact" option to tar, so that it creates
> > *all* inodes (files and directories alike) in the base directory, and
> > then renames newly created entries to shuffle them into their correct
> > positions. That should limit the number of block groups that are used,
> > right?
> >
> > It would probably also be a good idea to do that for cp as well, so
> > that when I do a "cp -al" of a virgin kernel tree, I can keep all the
> > directory inodes together. It will make a cold diff even faster.
>
> I don't think that would help at all... With the current file/dir
> allocator it will choose a new block group for each directory no
> matter what the parent is...

I thought the current implementation was that when creating a
directory, ext2fs searches forward from the block group the parent
directory is in, looking for a "relatively free" block group. So, a
number of successive calls to mkdir(2) with the same parent directory
will result in the child directories being in the same block group.

So, creating the directory tree by creating directories in the base
directory and then shuffling should result in the directories be
spread out over a modest number of block groups, rather than a large
number.

Addendum to my scheme: leaf nodes should be created in their
directories, not in the base directory. IOW, it's only directories
that should use this trick.

Am I wrong in my understanding of the current algorithm?

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-11-13 00:08:49

by Mike Fedyk

[permalink] [raw]
Subject: Re: File System Performance

On Mon, Nov 12, 2001 at 05:04:57PM -0700, Richard Gooch wrote:
> Mike Fedyk writes:
> > On Mon, Nov 12, 2001 at 12:59:54PM -0700, Richard Gooch wrote:
> > > Here's an idea: add a "--compact" option to tar, so that it creates
> > > *all* inodes (files and directories alike) in the base directory, and
> > > then renames newly created entries to shuffle them into their correct
> > > positions. That should limit the number of block groups that are used,
> > > right?
> > >
> > > It would probably also be a good idea to do that for cp as well, so
> > > that when I do a "cp -al" of a virgin kernel tree, I can keep all the
> > > directory inodes together. It will make a cold diff even faster.
> >
> > I don't think that would help at all... With the current file/dir
> > allocator it will choose a new block group for each directory no
> > matter what the parent is...
>
> I thought the current implementation was that when creating a
> directory, ext2fs searches forward from the block group the parent
> directory is in, looking for a "relatively free" block group. So, a
> number of successive calls to mkdir(2) with the same parent directory
> will result in the child directories being in the same block group.
>
> So, creating the directory tree by creating directories in the base
> directory and then shuffling should result in the directories be
> spread out over a modest number of block groups, rather than a large
> number.
>
> Addendum to my scheme: leaf nodes should be created in their
> directories, not in the base directory. IOW, it's only directories
> that should use this trick.
>
> Am I wrong in my understanding of the current algorithm?
>

You are almost describing the new algo to a "T"...

It deals very well with fast growth, but not so well with slow growth, as
mentioned in previous posts in this thread...

There is a lengthy thread in ext2-devel right now, if you read it it'll
answer many of your questions.

Mike

2001-11-13 00:26:58

by Richard Gooch

[permalink] [raw]
Subject: Re: File System Performance

Mike Fedyk writes:
> On Mon, Nov 12, 2001 at 05:04:57PM -0700, Richard Gooch wrote:
> > Mike Fedyk writes:
> > > On Mon, Nov 12, 2001 at 12:59:54PM -0700, Richard Gooch wrote:
> > > > Here's an idea: add a "--compact" option to tar, so that it creates
> > > > *all* inodes (files and directories alike) in the base directory, and
> > > > then renames newly created entries to shuffle them into their correct
> > > > positions. That should limit the number of block groups that are used,
> > > > right?
> > > >
> > > > It would probably also be a good idea to do that for cp as well, so
> > > > that when I do a "cp -al" of a virgin kernel tree, I can keep all the
> > > > directory inodes together. It will make a cold diff even faster.
> > >
> > > I don't think that would help at all... With the current file/dir
> > > allocator it will choose a new block group for each directory no
> > > matter what the parent is...
> >
> > I thought the current implementation was that when creating a
> > directory, ext2fs searches forward from the block group the parent
> > directory is in, looking for a "relatively free" block group. So, a
> > number of successive calls to mkdir(2) with the same parent directory
> > will result in the child directories being in the same block group.
> >
> > So, creating the directory tree by creating directories in the base
> > directory and then shuffling should result in the directories be
> > spread out over a modest number of block groups, rather than a large
> > number.
> >
> > Addendum to my scheme: leaf nodes should be created in their
> > directories, not in the base directory. IOW, it's only directories
> > that should use this trick.
> >
> > Am I wrong in my understanding of the current algorithm?
>
> You are almost describing the new algo to a "T"...

I assume you mean my scheme for tar. Which is an adaptation for
user-space of a scheme that's been proposed for in-kernel ext2fs.

> It deals very well with fast growth, but not so well with slow
> growth, as mentioned in previous posts in this thread...

Yes, yes. I know that.

> There is a lengthy thread in ext2-devel right now, if you read it
> it'll answer many of your questions.

Is this different from the long thread that's been on linux-kernel?

Erm, I'm not really asking a bunch of questions. The only question I
asked was whether I mis-read the current code, and that in turn is a
response to your assertion that my scheme would not help, as part of
an explanation of why it should work. Which you haven't responded to.
If you claim my tar scheme wouldn't help, then you're also saying that
the new algorithm for ext2fs won't help. Is that what you meant to
say?

In any case, my point (I think you missed it, although I guess I
didn't make it explicit) was that, rather than tuning the in-kernel
algorithm for this fast-growth scenario, we may be better off adding
an option to tar so we can make the choice in user-space. From the
posts that I've seen, it's not clear that we have an obvious choice
for a scheme that works well for both slow and fast growth cases.

Having an option for tar would allow the user to make the choice.
Ultimately, the user knows best (or at least can, if they care enough
to sit down and think about it) what the access patterns will be.

However, I see that people are banging away at figuring out a generic
in-kernel mechanism that will work with both slow and fast growth
cases. We may see something good come out of that.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-11-13 00:19:58

by Andreas Dilger

[permalink] [raw]
Subject: Re: File System Performance

On Nov 12, 2001 12:41 -0800, Andrew Morton wrote:
> BTW, I've been trying to hunt down a suitable file system aging tool.
> We're not very happy with Keith Smith's workload because the directory
> infomation was lost (he was purely studying FFS algorithmic differences
> - the load isn't 100% suitable for testing other filesystems / algorithms).
> Constantin Loizides' tools are proving to be rather complex to compile,
> drive and understand.

What _may_ be a very interesting tool for doing "real-world" I/O generation
is to use the InterMezzo KML (kernel modification log), which is basically
a 100% record of every filesystem operation done (e.g. create, write,
delete, mkdir, rmdir, etc).

Peter, do you have any very large KML files which would simulate the usage
of a filesystem over a long period of time, or would Tacitus have something
like that?

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-13 00:42:59

by Peter J. Braam

[permalink] [raw]
Subject: Re: File System Performance

The KML in fact doesn't record the writes. I don't have a large KML,
but it is easy to set one up. Let me know if you need a hand.

- Peter -

On Mon, Nov 12, 2001 at 05:17:05PM -0700, Andreas Dilger wrote:
> On Nov 12, 2001 12:41 -0800, Andrew Morton wrote:
> > BTW, I've been trying to hunt down a suitable file system aging tool.
> > We're not very happy with Keith Smith's workload because the directory
> > infomation was lost (he was purely studying FFS algorithmic differences
> > - the load isn't 100% suitable for testing other filesystems / algorithms).
> > Constantin Loizides' tools are proving to be rather complex to compile,
> > drive and understand.
>
> What _may_ be a very interesting tool for doing "real-world" I/O generation
> is to use the InterMezzo KML (kernel modification log), which is basically
> a 100% record of every filesystem operation done (e.g. create, write,
> delete, mkdir, rmdir, etc).
>
> Peter, do you have any very large KML files which would simulate the usage
> of a filesystem over a long period of time, or would Tacitus have something
> like that?
>
> Cheers, Andreas
> --
> Andreas Dilger
> http://sourceforge.net/projects/ext2resize/
> http://www-mddsp.enel.ucalgary.ca/People/adilger/
>

--

2001-11-13 00:47:59

by Mike Castle

[permalink] [raw]
Subject: Re: File System Performance

On Mon, Nov 12, 2001 at 05:26:31PM -0700, Richard Gooch wrote:
> Having an option for tar would allow the user to make the choice.
> Ultimately, the user knows best (or at least can, if they care enough
> to sit down and think about it) what the access patterns will be.

It also has the benefit of possibly being useful on other systems besides
Linux.

How many other systems suffers from the same/similar issue?

mrc
--
Mike Castle [email protected] http://www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

2001-11-13 01:29:02

by Mike Fedyk

[permalink] [raw]
Subject: Re: File System Performance

On Mon, Nov 12, 2001 at 05:26:31PM -0700, Richard Gooch wrote:
> Mike Fedyk writes:
> > On Mon, Nov 12, 2001 at 05:04:57PM -0700, Richard Gooch wrote:
> > > I thought the current implementation was that when creating a
> > > directory, ext2fs searches forward from the block group the parent
> > > directory is in, looking for a "relatively free" block group. So, a
> > > number of successive calls to mkdir(2) with the same parent directory
> > > will result in the child directories being in the same block group.
> > >
> > > So, creating the directory tree by creating directories in the base
> > > directory and then shuffling should result in the directories be
> > > spread out over a modest number of block groups, rather than a large
> > > number.
> > >
> > > Addendum to my scheme: leaf nodes should be created in their
> > > directories, not in the base directory. IOW, it's only directories
> > > that should use this trick.
> > >
> > > Am I wrong in my understanding of the current algorithm?
> >
> > You are almost describing the new algo to a "T"...
>

This is talking about the previous post only... The two messages above are
comment and reply, and are meant for each other. The part about tar moved
below...

> I assume you mean my scheme for tar. Which is an adaptation for
> user-space of a scheme that's been proposed for in-kernel ext2fs.
>

Nope, not one bit...

> > It deals very well with fast growth, but not so well with slow
> > growth, as mentioned in previous posts in this thread...
>
> Yes, yes. I know that.
>

Ok... We're getting somewhere.

> > There is a lengthy thread in ext2-devel right now, if you read it
> > it'll answer many of your questions.
>
> Is this different from the long thread that's been on linux-kernel?
>

Yes. The thread on ext2-devel will give you *much* more detail on the patch
you are trying to understand.

> Erm, I'm not really asking a bunch of questions. The only question I
> asked was whether I mis-read the current code, and that in turn is a
> response to your assertion that my scheme would not help, as part of
> an explanation of why it should work. Which you haven't responded to.

Hmm, I don't see the part I didn't reply to...

> If you claim my tar scheme wouldn't help, then you're also saying that
> the new algorithm for ext2fs won't help. Is that what you meant to
> say?
>

No, I'm not saying that...

> In any case, my point (I think you missed it, although I guess I
> didn't make it explicit) was that, rather than tuning the in-kernel
> algorithm for this fast-growth scenario, we may be better off adding
> an option to tar so we can make the choice in user-space. From the
> posts that I've seen, it's not clear that we have an obvious choice
> for a scheme that works well for both slow and fast growth cases.
>

Maybe not obvious, but with a little work, both can probably be made *better*.

> Having an option for tar would allow the user to make the choice.
> Ultimately, the user knows best (or at least can, if they care enough
> to sit down and think about it) what the access patterns will be.
>

Fixing tar won't help anyone except for tar users. What about the other
programs that create activity much like tar? This isn't a user space issue.

> However, I see that people are banging away at figuring out a generic
> in-kernel mechanism that will work with both slow and fast growth
> cases. We may see something good come out of that.
>

Yep

===================

Let me disect our previous conversation...

> > > > On Mon, Nov 12, 2001 at 12:59:54PM -0700, Richard Gooch wrote:
> > > > > Here's an idea: add a "--compact" option to tar, so that it creates
> > > > > *all* inodes (files and directories alike) in the base directory, and
> > > > > then renames newly created entries to shuffle them into their correct
> > > > > positions. That should limit the number of block groups that are used,
> > > > > right?
> > > > >

Currently, without any patching, any new directory will be put in a
different block group from its parent.

So, if you create the dirs in the same dir and then shuffle them around, you
gain nothing.

> > > > > It would probably also be a good idea to do that for cp as well, so
> > > > > that when I do a "cp -al" of a virgin kernel tree, I can keep all the
> > > > > directory inodes together. It will make a cold diff even faster.
> > > >

This doesn't fix all fast growth type apps, only tar and cp...

> > > Mike Fedyk writes:
> > > > I don't think that would help at all... With the current file/dir
> > > > allocator it will choose a new block group for each directory no
> > > > matter what the parent is...
> > >

Now does this make sence?

> > On Mon, Nov 12, 2001 at 05:04:57PM -0700, Richard Gooch wrote:
> > > I thought the current implementation was that when creating a
> > > directory, ext2fs searches forward from the block group the parent
> > > directory is in, looking for a "relatively free" block group. So, a
> > > number of successive calls to mkdir(2) with the same parent directory
> > > will result in the child directories being in the same block group.
> > >

Not currently, but the patch that is out will do this.

> > > So, creating the directory tree by creating directories in the base
> > > directory and then shuffling should result in the directories be
> > > spread out over a modest number of block groups, rather than a large
> > > number.
> > >
> > > Addendum to my scheme: leaf nodes should be created in their
> > > directories, not in the base directory. IOW, it's only directories
> > > that should use this trick.
> > >

If the kernel is patched...

> > > Am I wrong in my understanding of the current algorithm?
> >

Yes.

> Mike Fedyk writes:
> > You are almost describing the new algo to a "T"...
>

The above is a little more verbose, does it help?

Now, if I am not stating fact, as things currently are, and the state of
available patches are not what I am describing, someone please let me know.
Though to my understanding there is no error.

Mike

2001-11-13 06:36:05

by Richard Gooch

[permalink] [raw]
Subject: Re: File System Performance

[In the interests of brevity, I'm going to trim this to respond to one
point only, which seems to be the cause of the confusion]

Mike Fedyk writes:
> On Mon, Nov 12, 2001 at 05:26:31PM -0700, Richard Gooch wrote:
> > Mike Fedyk writes:
> > > On Mon, Nov 12, 2001 at 05:04:57PM -0700, Richard Gooch wrote:
> > > > > > Here's an idea: add a "--compact" option to tar, so that it creates
> > > > > > *all* inodes (files and directories alike) in the base directory, and
> > > > > > then renames newly created entries to shuffle them into their correct
> > > > > > positions. That should limit the number of block groups that are used,
> > > > > > right?
> > > > > >
>
> Currently, without any patching, any new directory will be put in a
> different block group from its parent.

Your statement seems inconsistent with the comment above
fs/ext2/ialloc.c:ext2_new_inode():
/*
* There are two policies for allocating an inode. If the new inode is
* a directory, then a forward search is made for a block group with both
* free space and a low directory-to-inode ratio; if that fails, then of
* the groups with above-average free space, that group with the fewest
* directories already is chosen.
*
* For other inodes, search forward from the parent directory\'s block
* group to find a free inode.
*/

So N successive calls to mkdir(2), with the same parent, should result
in those directories being stored in the same block group as each
other. And, furthermore, if the parent directory block group is mostly
empty, then the child directories are placed adjacent to the parent's
block group.

> So, if you create the dirs in the same dir and then shuffle them
> around, you gain nothing.

If my reading of the comment above is correct, then the shuffling will
provide a gain.

> > > > > > It would probably also be a good idea to do that for cp as well, so
> > > > > > that when I do a "cp -al" of a virgin kernel tree, I can keep all the
> > > > > > directory inodes together. It will make a cold diff even faster.
> > > > >
>
> This doesn't fix all fast growth type apps, only tar and cp...

Sure. But it would provide me with a way of getting a better layout
for my kernel trees, and it might prove useful for other applications
and kernels. And it will provide a good fallback if good in-kernel
heuristic isn't forthcoming.

> > > > Mike Fedyk writes:
> > > > > I don't think that would help at all... With the current file/dir
> > > > > allocator it will choose a new block group for each directory no
> > > > > matter what the parent is...
> > > >
>
> Now does this make sence?

It would, if you were correct about the current implementation. Which
I think isn't the case.

> > > On Mon, Nov 12, 2001 at 05:04:57PM -0700, Richard Gooch wrote:
> > > > I thought the current implementation was that when creating a
> > > > directory, ext2fs searches forward from the block group the parent
> > > > directory is in, looking for a "relatively free" block group. So, a
> > > > number of successive calls to mkdir(2) with the same parent directory
> > > > will result in the child directories being in the same block group.
> > > >
>
> Not currently, but the patch that is out will do this.

I think it currently *does*. Check the comment. Straight from the
2.4.14 sources.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

Subject: Re: File System Performance

Linus Torvalds <[email protected]> writes:

>And what's the _point_ of the optimization? I've never heard of a "tar
>benchmark"..

http://www.amanda.org

Regards
Henning

--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]

Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20

2001-11-13 09:58:39

by Peter Wächtler

[permalink] [raw]
Subject: Re: File System Performance

Mike Castle wrote:
>
> On Mon, Nov 12, 2001 at 02:36:08PM -0800, Andrew Morton wrote:
> > /* Detect if outputting to "/dev/null". */
> > {
> > static char const dev_null[] = "/dev/null";
> > struct stat dev_null_stat;
> >
> > dev_null_output =
> > (strcmp (archive_name_array[0], dev_null) == 0
> > || (! _isrmt (archive)
> > && stat (dev_null, &dev_null_stat) == 0
> > && S_ISCHR (archive_stat.st_mode)
> > && archive_stat.st_rdev == dev_null_stat.st_rdev));
> > }
> >
> > It's actually a bug. I may _want_ an ISREG /dev/null...
>
> Doesn't the S_ISCHR() handle that case?
>

No. It is: if (A || B)
If A is true, B is not even evaluated.

2001-11-13 15:48:33

by Ben Israel

[permalink] [raw]
Subject: Re: File System Performance

Andrew Morton wrote:
>
> It works, and it'll get you close to disk bandwidth with this test.
> But the effects of this change on other workloads (the so-called
> "slow growth" scenario) still needs to be understood and tested.
>

Does compiling the kernel constitute a reasonable example of a slow
growth workload? If not what is a slow growth workload where file
system performance is important?






2001-11-13 16:36:03

by Andreas Dilger

[permalink] [raw]
Subject: Re: File System Performance

On Nov 12, 2001 17:04 -0700, Richard Gooch wrote:
> Mike Fedyk writes:
> > I don't think that would help at all... With the current file/dir
> > allocator it will choose a new block group for each directory no
> > matter what the parent is...
>
> I thought the current implementation was that when creating a
> directory, ext2fs searches forward from the block group the parent
> directory is in, looking for a "relatively free" block group. So, a
> number of successive calls to mkdir(2) with the same parent directory
> will result in the child directories being in the same block group.
>
> So, creating the directory tree by creating directories in the base
> directory and then shuffling should result in the directories be
> spread out over a modest number of block groups, rather than a large
> number.

I think that it doesn't matter what the starting directory is for the
creation of directories - we always scan all groups and will always
pick the directory with the minimum number of blocks, as long as it
doesn't have too many inodes (i.e. more than average, like /dev ;-).

> Addendum to my scheme: leaf nodes should be created in their
> directories, not in the base directory. IOW, it's only directories
> that should use this trick.

That is the current algorithm anyways - create leaf nodes in the same
group as the parent directories.



Now, if there _was_ a way to influence the creation of directories in
the same group, then this would do what Andrew wants - put all of the
directories, inodes, blocks into the same group, for fast access.

Maybe if there was some hysteresis in group selection it would be
enough - pick the best directory, and then continue to create subdirs
(and files, implicitly) there until we pass some threshold(s). Some
thresholds are:

1) Block count too high. We don't want to be totally wimpy here (e.g.
bailing out when we hit the "average", but we (probably, maybe) also
don't want to untar a kernel in a single group, fill it up, and then
have to put all of the object files in another group. Maybe fill
until we hit (1 + average)/2, or (3 + average)/4?
2) Inode count/directory count too high. See #1.
3) Time (yucky, but reasonable if it is long enough, i.e. ten(s) of
seconds), so we reset the metrics, possibly in conjuction with...
4) process ID (yucky also, but possible, IF we want "tar" to put all
things in a single group), but not if the process is long-lived
(e.g. nfs, so we need (3) also)

Using (3) and (4) is ugly, because it is somewhat non-deterministic,
but 4 at least is repeatable in that a single process will continue to
write into a single group, while a multiple-writer scenario will spread
the files around (but some write to a log file might screw it up).
Also, if we have "state" in selecting a group to create files, we also
want to periodically pick a new group which is "best" to create new
directories/files in.

>From what I can understand of the code, the Orlov allocator does something
like #1 and/or #2. Andrew showed that this is an improvement for writing
a tarball, but with the "aging" tool he used, it didn't work as well
as the current algorithm. Unfortunately, the weights/parameters of the
Orlov code is hard to deduce from the code, so I don't know how it would
be tuned. Also, it is not certain that the aging tool is valid for our
needs.

So, the difference between the current allocator and the Orlov allocator
is that the current allocator always tries to find the "best" group for
each new directory, while the Orlov allocator will pick the parent unless
it is "bad" (i.e. too few inodes/blocks, too many directories). It may
be that "too few" and "too many" are not tuned correctly yet. Given that
tarballs and other archives are normally created with depth-first traversal,
we may not need to use the number of directories as a selection criterion
and clustering many small directories is not itself a bad thing. I'm still
not sure of the "debt" thing going on there...

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-13 20:48:15

by Andreas Dilger

[permalink] [raw]
Subject: Re: File System Performance

On Nov 12, 2001 17:40 -0700, Peter J . Braam wrote:
> The KML in fact doesn't record the writes. I don't have a large KML,
> but it is easy to set one up. Let me know if you need a hand.

We don't actually need to have the data contents, just the file sizes,
which I think the CLOSE records have, don't they? The one thing I'm
unsure of is whether you zero the KML as it is "used", or does it keep
the data from past transactions? At one time we were thinking about
using "punch" to reduce the actual file size, but I doubt that is in
place yet.

This is purely to measure the effects of repeated file creation, deletion,
updates in a real setting over a very long period (e.g. many months/years),
which is why setting something like this up today won't get us anywhere
(any large amount of activity would just be synthetic).

Do you think Ron Minnich or the folks at Tacitus would have a KML which
has been generated on a large server over a long period of time and not
erased?

> On Mon, Nov 12, 2001 at 05:17:05PM -0700, Andreas Dilger wrote:
> > On Nov 12, 2001 12:41 -0800, Andrew Morton wrote:
> > > BTW, I've been trying to hunt down a suitable file system aging tool.
> > > We're not very happy with Keith Smith's workload because the directory
> > > infomation was lost (he was purely studying FFS algorithmic differences
> > > - the load isn't 100% suitable for testing other filesystems / algorithms).
> > > Constantin Loizides' tools are proving to be rather complex to compile,
> > > drive and understand.
> >
> > What _may_ be a very interesting tool for doing "real-world" I/O generation
> > is to use the InterMezzo KML (kernel modification log), which is basically
> > a 100% record of every filesystem operation done (e.g. create, write,
> > delete, mkdir, rmdir, etc).
> >
> > Peter, do you have any very large KML files which would simulate the usage
> > of a filesystem over a long period of time, or would Tacitus have something
> > like that?

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-13 20:58:55

by Andreas Dilger

[permalink] [raw]
Subject: Re: File System Performance

On Nov 12, 2001 23:34 -0700, Richard Gooch wrote:
> > Currently, without any patching, any new directory will be put in a
> > different block group from its parent.
>
> Your statement seems inconsistent with the comment above
> fs/ext2/ialloc.c:ext2_new_inode():
> /*
> * There are two policies for allocating an inode. If the new inode is
> * a directory, then a forward search is made for a block group with both
> * free space and a low directory-to-inode ratio; if that fails, then of
> * the groups with above-average free space, that group with the fewest
> * directories already is chosen.
> *
> * For other inodes, search forward from the parent directory\'s block
> * group to find a free inode.
> */

This is only a comment, and not actual code. The code looks like:

if (S_ISDIR(mode)) {
int avefreei = le32_to_cpu(es->s_free_inodes_count) /
sb->u.ext2_sb.s_groups_count;

/* Place directory in a group with free inodes and blocks. */
for (j = 0; j < sb->u.ext2_sb.s_groups_count; j++) {
tmp = ext2_get_group_desc (sb, j, &bh2);
if (tmp && le16_to_cpu(tmp->bg_free_inodes_count) &&
le16_to_cpu(tmp->bg_free_inodes_count) >= avefreei){
if (!gdp ||
(le16_to_cpu(tmp->bg_free_blocks_count) >
le16_to_cpu(gdp->bg_free_blocks_count))) {
group = j;
gdp = tmp;
}
}
}
} else {

So, as you can see, it searches all of the directories for a group which:
a) has any free inodes (redundant with (b), actually)
b) has more than the average number of free inodes
c) has the most free blocks

> So N successive calls to mkdir(2), with the same parent, should result
> in those directories being stored in the same block group as each
> other. And, furthermore, if the parent directory block group is mostly
> empty, then the child directories are placed adjacent to the parent's
> block group.

So, as we see above, the starting directory is irrelevant. We pick the
best directory with an exhaustive search of all block groups. Note that
you are correct in that IF the parent block group is mostly empty, then
subdirectories would also be placed there, but the nature of the algorithm
is that EVERYTHING would go there until it is not "better" than any other
group, at which case we have approximately round-robin group allocation.

Maybe a better heuristic is to add a "bonus" to the parent directory's
group, so other things being equal we stay in the same group. This gives
us some advantage of clustering, but also allows a global optimal choice
to be made in case we are looking for a new group.

> > Now does this make sence?
>
> It would, if you were correct about the current implementation. Which
> I think isn't the case.
>
> > Not currently, but the patch that is out will do this.
>
> I think it currently *does*. Check the comment. Straight from the
> 2.4.14 sources.

You shouldn't base arguments on a comment, because they tend to not be
updated along with changes to the code.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-16 22:10:53

by Peter J. Braam

[permalink] [raw]
Subject: Re: File System Performance

I don't think anybody has a large KML to share. Unfortunately, I
suspect in many environments people don't want to give such
information.

We should start logging somewhere.

- peter -

On Tue, Nov 13, 2001 at 01:46:53PM -0700, Andreas Dilger wrote:
> On Nov 12, 2001 17:40 -0700, Peter J . Braam wrote:
> > The KML in fact doesn't record the writes. I don't have a large KML,
> > but it is easy to set one up. Let me know if you need a hand.
>
> We don't actually need to have the data contents, just the file sizes,
> which I think the CLOSE records have, don't they? The one thing I'm
> unsure of is whether you zero the KML as it is "used", or does it keep
> the data from past transactions? At one time we were thinking about
> using "punch" to reduce the actual file size, but I doubt that is in
> place yet.
>
> This is purely to measure the effects of repeated file creation, deletion,
> updates in a real setting over a very long period (e.g. many months/years),
> which is why setting something like this up today won't get us anywhere
> (any large amount of activity would just be synthetic).
>
> Do you think Ron Minnich or the folks at Tacitus would have a KML which
> has been generated on a large server over a long period of time and not
> erased?
>
> > On Mon, Nov 12, 2001 at 05:17:05PM -0700, Andreas Dilger wrote:
> > > On Nov 12, 2001 12:41 -0800, Andrew Morton wrote:
> > > > BTW, I've been trying to hunt down a suitable file system aging tool.
> > > > We're not very happy with Keith Smith's workload because the directory
> > > > infomation was lost (he was purely studying FFS algorithmic differences
> > > > - the load isn't 100% suitable for testing other filesystems / algorithms).
> > > > Constantin Loizides' tools are proving to be rather complex to compile,
> > > > drive and understand.
> > >
> > > What _may_ be a very interesting tool for doing "real-world" I/O generation
> > > is to use the InterMezzo KML (kernel modification log), which is basically
> > > a 100% record of every filesystem operation done (e.g. create, write,
> > > delete, mkdir, rmdir, etc).
> > >
> > > Peter, do you have any very large KML files which would simulate the usage
> > > of a filesystem over a long period of time, or would Tacitus have something
> > > like that?
>
> Cheers, Andreas
> --
> Andreas Dilger
> http://sourceforge.net/projects/ext2resize/
> http://www-mddsp.enel.ucalgary.ca/People/adilger/
>

--

2001-11-16 23:14:45

by Mike Fedyk

[permalink] [raw]
Subject: Re: File System Performance

On Fri, Nov 16, 2001 at 03:07:54PM -0700, Peter J . Braam wrote:
> I don't think anybody has a large KML to share. Unfortunately, I
> suspect in many environments people don't want to give such
> information.

I'm sure the data can be sanitized. File sizes and information that doesn't
include file names shouldn't bother people...