LinuxLists.cc - test13-pre6

[permalink] [raw]

Subject: Re: test13-pre6

On Fri, 29 Dec 2000, Linus Torvalds wrote:

> Al: this changes "mapping->host" to be truly defined as a pointer to the
> inode that owns the mapping. That's how every user actually _used_ the
> host pointer, so this cleans those up to not need any casts. The main
> reason, however, is that we needed to have some FS-level anchor for dirty
> pages in order to get the correct sync() semantics. If you think it's
> worth it to have a notion of an anonymous host we need to add something
> else.

Two examples: devices and bitmaps-in-pagecache trick. But both belong to
2.5, so...

BTW, nice timing ;-) -pre6 appeared 5 minutes after I've started testing
sane-s_lock patch (SMP-safe lock_super() and friends, refcount on superblocks,
death of mount_sem, beginning of SMP-safe super.c). Oh, well...

Oblock_super(): what the hell is wait_on_super() doing in fsync_file()?
It gives absolutely no warranties - ->write_super() can easily block, so
it looks very odd.

BTW, while we are dropping the junk from vm_operations_struct, could we lose
->protect() and ->wppage()? Both are never used and never defined. I can
send a diff, indeed, but
ed include/linux/mm.h <<EOF
g/\*protect/d
g/wppage/d
w
q
EOF
should take care of them.
Cheers,
Al
It's nice to be back...

2000-12-30 01:51:02

[permalink] [raw]

Subject: Re: test13-pre6

On Fri, 29 Dec 2000, Alexander Viro wrote:
>
> Two examples: devices and bitmaps-in-pagecache trick. But both belong to
> 2.5, so...

Also, they can easily be done with a private inode, if required. So even
in 2.5.x this may not be a major problem.

> BTW, nice timing ;-) -pre6 appeared 5 minutes after I've started testing
> sane-s_lock patch (SMP-safe lock_super() and friends, refcount on superblocks,
> death of mount_sem, beginning of SMP-safe super.c). Oh, well...
>
> Oblock_super(): what the hell is wait_on_super() doing in fsync_file()?
> It gives absolutely no warranties - ->write_super() can easily block, so
> it looks very odd.

A lot of the superblock locking has been odd. It should probably be a
lock_super() + unlock_super(). At least that's what sync_supers() does.

> BTW, while we are dropping the junk from vm_operations_struct, could we lose
> ->protect() and ->wppage()?

Sure. I think sync() and unmap() fall under that heading too - it used to
do a msync(), but that was before we handled dirty pages directly, so...

Linus

2000-12-30 02:59:06

[permalink] [raw]

Subject: Re: test13-pre6

Linus Torvalds wrote:
>
> Ok, there's a test13-pre6 out there now, which does a partial sync with
> Alan, in addition to hopefully fixing the innd shared mapping writeback
> problem for good. Thanks to Marcelo Tosatti and others..

After the page_cache_release at line 574 of vmscan.c the page is
unlocked and only owned by the page cache - anything could happen. How
do you know the set_page_dirty at line 581 is still hitting a valid
page?

--
Daniel

2000-12-30 03:39:38

by Byron Stanoszek

[permalink] [raw]

Subject: Re: test13-pre6 (Fork Bug with Athlons? Temporary Fix)

On Fri, 29 Dec 2000, Linus Torvalds wrote:

>
> Ok, there's a test13-pre6 out there now, which does a partial sync with
> Alan, in addition to hopefully fixing the innd shared mapping writeback
> problem for good. Thanks to Marcelo Tosatti and others..

I've been noticing a problem with the memory context switching conflicting with
fork() on my Athlon. The problem began in the test13-pre2 patch, and because
nobody else has seen this problem (or otherwise reported it) since then, I
felt I should look into it a little further.

I narrowed the problem down to a subset of patches from the MM set in
test13-pre2. Reversing the attached 'context.patch' fixes the problem (only for
i386), but I'm not yet sure why. test13-pre2 and up work without any problems
on an Intel cpu (Pentium 180 & P3 800 tested).

Anyways, I can't seem to find out what really changes with the patch except for
the obvious 'void *segment' changing into a typedef-struct. The only thing I
can think of is that the compiler decodes it differently, but I think I can
safely rule that out. I tried both 2.91.66 and 2.95.2, using both different
types of parameters for P5 & K7 (-march=i586 & -march=i686 -malign-functions=4)
and it still gives the problem on the Athlon. Maybe there's something I've
overlooked in that attached patch. Request for an extra pair of eyes please. :)

Here are the casual symptoms. The parent seems to die as soon as a forked child
exits, which seems to me that a new LDT isn't being initialized correctly:

root:~> ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 1.1 0.4 1228 532 ? S 21:42 0:05 init [3]
root 2 0.0 0.0 0 0 ? SW 21:42 0:00 [keventd]
root 3 0.0 0.0 0 0 ? SW 21:42 0:00 [kswapd]
root 4 0.0 0.0 0 0 ? SW 21:42 0:00 [kreclaimd]
root 5 0.0 0.0 0 0 ? SW 21:42 0:00 [bdflush]
root 6 0.0 0.0 0 0 ? SW 21:42 0:00 [kupdate]
root 289 0.0 0.4 1284 604 ? S 21:42 0:00 syslogd -m 0
root 299 0.0 0.8 1912 1104 ? S 21:42 0:00 klogd
root 351 0.0 1.2 9292 1576 ? S 21:42 0:00 named
root 361 0.0 0.0 0 0 ? Z 21:42 0:00 [named <defunct>]
root 363 0.0 1.2 9292 1576 ? S 21:42 0:00 named
root 364 0.0 1.2 9292 1576 ? S 21:42 0:00 named
root 365 0.0 0.7 2064 936 ? S 21:42 0:00 /usr/sbin/sshd
..etc
(Note PID 361)

root:~> strace nslookup sunsite.unc.edu
:
:
rt_sigaction(SIGINT, {0x4003ce78, ~[], 0x4000000}, NULL, 8) = 0
rt_sigaction(SIGTERM, {0x4003ce78, ~[], 0x4000000}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN}, NULL, 8) = 0
rt_sigaction(SIGHUP, {SIG_DFL}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [HUP INT TERM], NULL, 8) = 0
getpid() = 2615
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
close(3) = 0
socket(PF_INET6, SOCK_STREAM, 0) = -1 ENOSYS (Function not implemented)
socket(PF_INET6, SOCK_STREAM, 0) = -1 ENOSYS (Function not implemented)
socket(PF_INET6, SOCK_STREAM, 0) = -1 EAFNOSUPPORT (Address family not supported by protocol)--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++

---Example parent/child process:

root:~> tar -xzvvf ../pkgs/zgv-5.2.tar.gz
:
:
-rw------- rus/users 1356 2000-06-01 11:46:57 zgv-5.2/INSTALL
-rw------- rus/users 17976 1994-08-23 16:09:05 zgv-5.2/COPYING
-rw------- rus/users 1077 1998-08-26 09:24:31 zgv-5.2/README.fonts
-rw------- rus/users 120 2000-04-22 22:46:49 zgv-5.2/AUTHORS
-rw------- rus/users 3714 2000-01-23 16:29:40 zgv-5.2/SECURITY
Segmentation fault (core dumped)

root:~> strace tar -xzvvf ../pkgs/zgv-5.2.tar.gz
:
:
open("zgv-5.2/COPYING", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 4
write(4, "\t\t GNU GENERAL PUBLIC LICENSE"..., 9728) = 9728
read(3, "ccept this License. Therefore, "..., 10240) = 10240
write(4, "ccept this License. Therefore, "..., 8248) = 8248
close(4) = 0
utime("zgv-5.2/COPYING", [2000/12/29-20:21:16, 1994/08/23-16:09:05]) = 0
chown32("zgv-5.2/COPYING", 500, 100) = 0
write(1, "-rw------- rus/users 1077 1"..., 72-rw------- rus/users 1077 1998-08-26 09:24:31 zgv-5.2/README.fonts
) = 72
open("zgv-5.2/README.fonts", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 4
write(4, "The copyright for *.bdf (taken f"..., 1024) = 1024
read(3, "\"as\nis\" without express or impli"..., 10240) = 8192
--- SIGCHLD (Child exited) ---
--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++

Ideas, anyone?

-Byron

--
Byron Stanoszek Ph: (330) 644-3059
Systems Programmer Fax: (330) 644-8110
Commercial Timesharing Inc. Email: [email protected]

Attachments:

context.patch (8.34 kB)

2000-12-30 03:47:52

[permalink] [raw]

Subject: Re: test13-pre6

On Sat, 30 Dec 2000, Daniel Phillips wrote:

> Linus Torvalds wrote:
> >
> > Ok, there's a test13-pre6 out there now, which does a partial sync with
> > Alan, in addition to hopefully fixing the innd shared mapping writeback
> > problem for good. Thanks to Marcelo Tosatti and others..
>
> After the page_cache_release at line 574 of vmscan.c the page is
> unlocked and only owned by the page cache - anything could happen. How
> do you know the set_page_dirty at line 581 is still hitting a valid
> page?

Good question.

It should be safe because of the magic return value from "writepage()".

If "writepage()" returns 1, then that implies that the page is locked down
somehow. But you're right, this is ugly, if not outright buggy (maybe the
locked down state could change after the writepage, who knows?).

Moving the test is probably a good idea.

Linus

2000-12-30 04:10:52

[permalink] [raw]

Subject: Re: test13-pre6 (Fork Bug with Athlons? Temporary Fix)

On Fri, 29 Dec 2000, Byron Stanoszek wrote:
>
> I narrowed the problem down to a subset of patches from the MM set in
> test13-pre2. Reversing the attached 'context.patch' fixes the problem (only for
> i386), but I'm not yet sure why. test13-pre2 and up work without any problems
> on an Intel cpu (Pentium 180 & P3 800 tested).

Cool.

Maybe your libc is different on the different machines? Normal programs
shouldn't use segments at all, so I really do not see how this patch could
matter in the least, even if it was completely and utterly buggy (which is
not obvious at first glance).

I wonder why you seem to have an LDT at all..

> Anyways, I can't seem to find out what really changes with the patch except for
> the obvious 'void *segment' changing into a typedef-struct.

Would you mind trying to hunt this down a bit more? In particular, it
would be good to see if the behaviour is the same if you do the typedef
change but leave the other logic alone. That would also cut down on the
purely syntactic changes of the patch.

I'll take a look at the code here.

Thanks,

Linus

2000-12-30 04:52:18

by Dan Aloni

[permalink] [raw]

Subject: Re: test13-pre6

On Fri, 29 Dec 2000, Linus Torvalds wrote:

> Marco d'Itri and everybody else who has seen innd problems (or other
> shared map problems): can you verify that test13-pre6 works for you?

The ->mapping problem seems to be gone in test13-pre5, I'm running this
kernel for over 30 hours now with no glitch, gonna check if it's the same
for test13-pre6.

--
Dan Aloni
[email protected]

2000-12-30 05:45:57

[permalink] [raw]

Subject: Re: test13-pre6 (Fork Bug with Athlons? Temporary Fix)

Ok, I don't think this is an athlon bug, and I think I've figured out what
the problem is. For now, you rtemporary fix is probably fine, I'll clean
stuff up a bit and make a nicer patch available tomorrow.

Linus

2000-12-30 06:26:26

by Andi Kleen

[permalink] [raw]

Subject: Re: test13-pre6 (Fork Bug with Athlons? Temporary Fix)

On Fri, Dec 29, 2000 at 07:36:21PM -0800, Linus Torvalds wrote:
> Maybe your libc is different on the different machines? Normal programs
> shouldn't use segments at all, so I really do not see how this patch could
> matter in the least, even if it was completely and utterly buggy (which is
> not obvious at first glance).
>
> I wonder why you seem to have an LDT at all..

glibc 2.2 linuxthreads sets up an LDT to use %gs as thread local data base
pointer.

-Andi

2000-12-30 08:44:37

by Graham Murray

[permalink] [raw]

Subject: Re: test13-pre6 (Fork Bug with Athlons? Temporary Fix)

Byron Stanoszek <[email protected]> writes:

> I narrowed the problem down to a subset of patches from the MM set in
> test13-pre2. Reversing the attached 'context.patch' fixes the problem (only for
> i386), but I'm not yet sure why. test13-pre2 and up work without any problems
> on an Intel cpu (Pentium 180 & P3 800 tested).
[Snip]
> root 351 0.0 1.2 9292 1576 ? S 21:42 0:00 named
> root 361 0.0 0.0 0 0 ? Z 21:42 0:00 [named <defunct>]
> root 363 0.0 1.2 9292 1576 ? S 21:42 0:00 named
> root 364 0.0 1.2 9292 1576 ? S 21:42 0:00 named
> root 365 0.0 0.7 2064 936 ? S 21:42 0:00 /usr/sbin/sshd
> ..etc
> (Note PID 361)

I am seeing the same thing with the [named <defunct>] on a PIII 600,
so it is not Athlon specific. I haven't yet tried test13-pre6 but it
happens with pre3,4,5. So I am still running on test12.

I will try running pre6 then, if it still fails will try with your
context.patch and see if that fixes it.

2000-12-30 18:40:46

[permalink] [raw]

Subject: Re: test13-pre6

On Fri, 29 Dec 2000, Linus Torvalds wrote:

> > Oblock_super(): what the hell is wait_on_super() doing in fsync_file()?
> > It gives absolutely no warranties - ->write_super() can easily block, so
> > it looks very odd.
>
> A lot of the superblock locking has been odd. It should probably be a
> lock_super() + unlock_super(). At least that's what sync_supers() does.

Yeah, that's what I've done, but I'm less than sure that we need anything
of that kind in the first place...

Anyway, under the "it works here" category: current state of sane-s_lock
patch. It will need further cleanup in fs/super.c, but IMO the current
form is already better than code in tree. Comments are more than welcome.

One note on the patch: sb->s_count is amount of extra references to superblock
(== not coming from vfsmounts) + 1 if there is a vfsmount over that superblock.
I.e. same scheme as we use for mm_struct. get_super() bumps the refcount,
put_super() decrements it and frees the superblock if counter hits zero.
add_vfsmnt() bumps refcount if we are adding the first vfsmount.
remove_vfsmnt() bumps refcount _unless_ we are removing the last vfsmount.
I.e. remove_vfsmnt() + put_super() always does the right thing.

List of superblocks contains only "active" ones. kill_super() takes the
superblock out of the list. get_empty_super() became simpler - it doesn't
bother searching the list for "reusable" ones. List is still BKL-protected;
I'll fix that.

We got _two_ sempaphores - one for get_super()/umount() synchronization
(->s_umount) and another for lock_super()/unlock_super() (->s_lock).
I would expect the latter to go into the per-fs part - all synchronization
that is interesting from VFS point of view is done by ->s_umount. The
reason why I've kept s_lock at all is simple - I don't want to mix
changes in fs-private synchronization with the fs-independent stuff.

Patch is against -pre6 and it seems to working here. Comments (and help
in testing) are welcome.

Peter, if you could remind me the uses of get_empty_super() and friends
in your code I would be very grateful - I seriously suspect that we
could use better interface in that area.
Cheers,
Al
Patch follows:

diff -urN rc13-pre6/arch/parisc/hpux/sys_hpux.c rc13-pre6-s_lock/arch/parisc/hpux/sys_hpux.c
--- rc13-pre6/arch/parisc/hpux/sys_hpux.c Tue Dec 12 02:10:00 2000
+++ rc13-pre6-s_lock/arch/parisc/hpux/sys_hpux.c Sat Dec 30 09:17:25 2000
@@ -109,9 +109,11 @@

lock_kernel();
s = get_super(to_kdev_t(dev));
+ unlock_kernel();
if (s == NULL)
goto out;
err = vfs_statfs(s, &sbuf);
+ put_super(s);
if (err)
goto out;

@@ -124,7 +126,6 @@
/* Changed to hpux_ustat: */
err = copy_to_user(ubuf,&tmp,sizeof(struct hpux_ustat)) ? -EFAULT : 0;
out:
- unlock_kernel();
return err;
}

diff -urN rc13-pre6/arch/sparc/kernel/sys_sunos.c rc13-pre6-s_lock/arch/sparc/kernel/sys_sunos.c
--- rc13-pre6/arch/sparc/kernel/sys_sunos.c Sun Aug 13 16:56:28 2000
+++ rc13-pre6-s_lock/arch/sparc/kernel/sys_sunos.c Sat Dec 30 09:17:25 2000
@@ -620,8 +620,6 @@
};

-extern dev_t get_unnamed_dev(void);
-extern void put_unnamed_dev(dev_t);
extern asmlinkage int sys_connect(int fd, struct sockaddr *uservaddr, int addrlen);
extern asmlinkage int sys_socket(int family, int type, int protocol);
extern asmlinkage int sys_bind(int fd, struct sockaddr *umyaddr, int addrlen);
diff -urN rc13-pre6/arch/sparc64/kernel/sys_sunos32.c rc13-pre6-s_lock/arch/sparc64/kernel/sys_sunos32.c
--- rc13-pre6/arch/sparc64/kernel/sys_sunos32.c Tue Dec 12 02:10:04 2000
+++ rc13-pre6-s_lock/arch/sparc64/kernel/sys_sunos32.c Sat Dec 30 09:17:25 2000
@@ -580,8 +580,6 @@
char *netname; /* server's netname */
};

-extern dev_t get_unnamed_dev(void);
-extern void put_unnamed_dev(dev_t);
extern asmlinkage int sys_mount(char *, char *, char *, unsigned long, void *);
extern asmlinkage int sys_connect(int fd, struct sockaddr *uservaddr, int addrlen);
extern asmlinkage int sys_socket(int family, int type, int protocol);
diff -urN rc13-pre6/drivers/acorn/block/mfmhd.c rc13-pre6-s_lock/drivers/acorn/block/mfmhd.c
--- rc13-pre6/drivers/acorn/block/mfmhd.c Wed Nov 29 19:28:26 2000
+++ rc13-pre6-s_lock/drivers/acorn/block/mfmhd.c Sat Dec 30 09:17:26 2000
@@ -1486,13 +1486,8 @@
for (i = maxp - 1; i >= 0; i--) {
int minor = start + i;
kdev_t devi = MKDEV(MAJOR_NR, minor);
- struct super_block *sb = get_super(devi);
-
- sync_dev (devi);
- if (sb)
- invalidate_inodes (sb);
- invalidate_buffers (devi);

+ invalidate_dev(devi, 1);
mfm_gendisk.part[minor].start_sect = 0;
mfm_gendisk.part[minor].nr_sects = 0;
}
diff -urN rc13-pre6/drivers/block/DAC960.c rc13-pre6-s_lock/drivers/block/DAC960.c
--- rc13-pre6/drivers/block/DAC960.c Tue Dec 12 02:10:05 2000
+++ rc13-pre6-s_lock/drivers/block/DAC960.c Sat Dec 30 09:17:26 2000
@@ -4990,16 +4990,12 @@
PartitionNumber);
int MinorNumber = DAC960_MinorNumber(LogicalDriveNumber,
PartitionNumber);
- SuperBlock_T *SuperBlock = get_super(Device);
if (Controller->GenericDiskInfo.part[MinorNumber].nr_sects == 0)
continue;
/*
Flush all changes and invalidate buffered state.
*/
- sync_dev(Device);
- if (SuperBlock != NULL)
- invalidate_inodes(SuperBlock);
- invalidate_buffers(Device);
+ invalidate_dev(Device, 1);
/*
Clear existing partition sizes.
*/
diff -urN rc13-pre6/drivers/block/acsi.c rc13-pre6-s_lock/drivers/block/acsi.c
--- rc13-pre6/drivers/block/acsi.c Tue Dec 12 02:10:05 2000
+++ rc13-pre6-s_lock/drivers/block/acsi.c Sat Dec 30 09:17:26 2000
@@ -1887,12 +1887,7 @@
for( i = max_p - 1; i >= 0 ; i-- ) {
if (gdev->part[start + i].nr_sects != 0) {
kdev_t devp = MKDEV(MAJOR_NR, start + i);
- struct super_block *sb = get_super(devp);
-
- fsync_dev(devp);
- if (sb)
- invalidate_inodes(sb);
- invalidate_buffers(devp);
+ invalidate_dev(devp, 2);
gdev->part[start + i].nr_sects = 0;
}
gdev->part[start+i].start_sect = 0;
diff -urN rc13-pre6/drivers/block/amiflop.c rc13-pre6-s_lock/drivers/block/amiflop.c
--- rc13-pre6/drivers/block/amiflop.c Wed Nov 29 19:28:35 2000
+++ rc13-pre6-s_lock/drivers/block/amiflop.c Sat Dec 30 09:17:26 2000
@@ -1490,7 +1490,6 @@
{
int drive = inode->i_rdev & 3;
static struct floppy_struct getprm;
- struct super_block * sb;

switch(cmd){
case HDIO_GETGEO:
@@ -1540,10 +1539,7 @@
break;
case FDFMTEND:
floppy_off(drive);
- sb = get_super(inode->i_rdev);
- if (sb)
- invalidate_inodes(sb);
- invalidate_buffers(inode->i_rdev);
+ invalidate_dev(inode->i_rdev, 0);
break;
case FDGETPRM:
memset((void *)&getprm, 0, sizeof (getprm));
@@ -1677,9 +1673,6 @@

static int floppy_release(struct inode * inode, struct file * filp)
{
-#ifdef DEBUG
- struct super_block * sb;
-#endif
int drive = MINOR(inode->i_rdev) & 3;

if (unit[drive].dirty == 1) {
diff -urN rc13-pre6/drivers/block/blkpg.c rc13-pre6-s_lock/drivers/block/blkpg.c
--- rc13-pre6/drivers/block/blkpg.c Mon Mar 20 01:36:47 2000
+++ rc13-pre6-s_lock/drivers/block/blkpg.c Sat Dec 30 09:17:26 2000
@@ -157,8 +157,7 @@

/* partition in use? Incomplete check for now. */
devp = MKDEV(MAJOR(dev), minor);
- if (get_super(devp) || /* mounted? */
- is_swap_partition(devp))
+ if (is_mounted(devp) || is_swap_partition(devp))
return -EBUSY;

/* all seems OK */
diff -urN rc13-pre6/drivers/block/cciss.c rc13-pre6-s_lock/drivers/block/cciss.c
--- rc13-pre6/drivers/block/cciss.c Wed Nov 29 21:36:25 2000
+++ rc13-pre6-s_lock/drivers/block/cciss.c Sat Dec 30 09:17:26 2000
@@ -707,10 +707,7 @@
for(i=max_p; i>=0; i--) {
int minor = start+i;
kdev_t devi = MKDEV(MAJOR_NR + ctlr, minor);
- struct super_block *sb = get_super(devi);
- sync_dev(devi);
- if (sb) invalidate_inodes(sb);
- invalidate_buffers(devi);
+ invalidate_dev(devi, 1);
gdev->part[minor].start_sect = 0;
gdev->part[minor].nr_sects = 0;

diff -urN rc13-pre6/drivers/block/cpqarray.c rc13-pre6-s_lock/drivers/block/cpqarray.c
--- rc13-pre6/drivers/block/cpqarray.c Wed Nov 29 21:42:57 2000
+++ rc13-pre6-s_lock/drivers/block/cpqarray.c Sat Dec 30 09:17:26 2000
@@ -1571,10 +1571,7 @@
for(i=max_p; i>=0; i--) {
int minor = start+i;
kdev_t devi = MKDEV(MAJOR_NR + ctlr, minor);
- struct super_block *sb = get_super(devi);
- sync_dev(devi);
- if (sb) invalidate_inodes(sb);
- invalidate_buffers(devi);
+ invalidate_dev(devi, 1);
gdev->part[minor].start_sect = 0;
gdev->part[minor].nr_sects = 0;

diff -urN rc13-pre6/drivers/block/paride/pd.c rc13-pre6-s_lock/drivers/block/paride/pd.c
--- rc13-pre6/drivers/block/paride/pd.c Tue Jul 25 09:23:16 2000
+++ rc13-pre6-s_lock/drivers/block/paride/pd.c Sat Dec 30 09:17:26 2000
@@ -593,8 +593,6 @@
long flags;
kdev_t devp;

- struct super_block *sb;
-
unit = DEVICE_NR(dev);
if ((unit >= PD_UNITS) || (!PD.present)) return -ENODEV;

@@ -610,12 +608,7 @@
for (p=(PD_PARTNS-1);p>=0;p--) {
minor = p + unit*PD_PARTNS;
devp = MKDEV(MAJOR_NR, minor);
- fsync_dev(devp);
-
- sb = get_super(devp);
- if (sb) invalidate_inodes(sb);
-
- invalidate_buffers(devp);
+ invalidate_dev(devp, 2);
pd_hd[minor].start_sect = 0;
pd_hd[minor].nr_sects = 0;
}
diff -urN rc13-pre6/drivers/block/ps2esdi.c rc13-pre6-s_lock/drivers/block/ps2esdi.c
--- rc13-pre6/drivers/block/ps2esdi.c Mon Jul 31 16:33:53 2000
+++ rc13-pre6-s_lock/drivers/block/ps2esdi.c Sat Dec 30 09:17:26 2000
@@ -1180,12 +1180,7 @@
partition >= 0; partition--) {
int minor = (start | partition);
kdev_t devp = MKDEV(MAJOR_NR, minor);
- struct super_block * sb = get_super(devp);
-
- sync_dev(devp);
- if (sb)
- invalidate_inodes(sb);
- invalidate_buffers(devp);
+ invalidate_dev(devp, 1);
ps2esdi_gendisk.part[start + partition].start_sect = 0;
ps2esdi_gendisk.part[start + partition].nr_sects = 0;
}
diff -urN rc13-pre6/drivers/block/xd.c rc13-pre6-s_lock/drivers/block/xd.c
--- rc13-pre6/drivers/block/xd.c Wed Nov 29 21:36:26 2000
+++ rc13-pre6-s_lock/drivers/block/xd.c Sat Dec 30 09:17:26 2000
@@ -394,12 +394,7 @@
for (partition = xd_gendisk.max_p - 1; partition >= 0; partition--) {
int minor = (start | partition);
kdev_t devp = MKDEV(MAJOR_NR, minor);
- struct super_block * sb = get_super(devp);
-
- sync_dev(devp);
- if (sb)
- invalidate_inodes(sb);
- invalidate_buffers(devp);
+ invalidate_dev(devp, 1);
xd_gendisk.part[minor].start_sect = 0;
xd_gendisk.part[minor].nr_sects = 0;
};
diff -urN rc13-pre6/drivers/char/raw.c rc13-pre6-s_lock/drivers/char/raw.c
--- rc13-pre6/drivers/char/raw.c Wed Nov 29 19:28:41 2000
+++ rc13-pre6-s_lock/drivers/char/raw.c Sat Dec 30 09:17:26 2000
@@ -103,7 +103,7 @@
*/

sector_size = 512;
- if (get_super(rdev) != NULL) {
+ if (is_mounted(rdev)) {
if (blksize_size[MAJOR(rdev)])
sector_size = blksize_size[MAJOR(rdev)][MINOR(rdev)];
} else {
diff -urN rc13-pre6/drivers/i2o/i2o_block.c rc13-pre6-s_lock/drivers/i2o/i2o_block.c
--- rc13-pre6/drivers/i2o/i2o_block.c Wed Nov 29 21:43:05 2000
+++ rc13-pre6-s_lock/drivers/i2o/i2o_block.c Sat Dec 30 09:17:27 2000
@@ -900,12 +900,7 @@
{
int m = minor+i;
kdev_t d = MKDEV(MAJOR_NR, m);
- struct super_block *sb = get_super(d);
-
- sync_dev(d);
- if(sb)
- invalidate_inodes(sb);
- invalidate_buffers(d);
+ invalidate_dev(d, 1);
i2ob_gendisk.part[m].start_sect = 0;
i2ob_gendisk.part[m].nr_sects = 0;
}
diff -urN rc13-pre6/drivers/ide/hd.c rc13-pre6-s_lock/drivers/ide/hd.c
--- rc13-pre6/drivers/ide/hd.c Wed Nov 29 21:43:06 2000
+++ rc13-pre6-s_lock/drivers/ide/hd.c Sat Dec 30 09:17:27 2000
@@ -881,12 +881,7 @@
for (i=max_p - 1; i >=0 ; i--) {
int minor = start + i;
kdev_t devi = MKDEV(MAJOR_NR, minor);
- struct super_block *sb = get_super(devi);
-
- sync_dev(devi);
- if (sb)
- invalidate_inodes(sb);
- invalidate_buffers(devi);
+ invalidate_dev(devi, 1);
gdev->part[minor].start_sect = 0;
gdev->part[minor].nr_sects = 0;
}
diff -urN rc13-pre6/drivers/ide/ide.c rc13-pre6-s_lock/drivers/ide/ide.c
--- rc13-pre6/drivers/ide/ide.c Tue Dec 12 02:10:10 2000
+++ rc13-pre6-s_lock/drivers/ide/ide.c Sat Dec 30 09:17:27 2000
@@ -1761,11 +1761,7 @@
for (p = 0; p < (1<<PARTN_BITS); ++p) {
if (drive->part[p].nr_sects > 0) {
kdev_t devp = MKDEV(major, minor+p);
- struct super_block * sb = get_super(devp);
- fsync_dev (devp);
- if (sb)
- invalidate_inodes(sb);
- invalidate_buffers (devp);
+ invalidate_dev(devp, 2);
set_blocksize(devp, 1024);
}
drive->part[p].start_sect = 0;
@@ -1982,9 +1978,7 @@
for (p = 0; p < (1<<PARTN_BITS); ++p) {
if (drive->part[p].nr_sects > 0) {
kdev_t devp = MKDEV(hwif->major, minor+p);
- struct super_block * sb = get_super(devp);
- if (sb) invalidate_inodes(sb);
- invalidate_buffers (devp);
+ invalidate_dev(devp, 0);
}
}
#ifdef CONFIG_PROC_FS
diff -urN rc13-pre6/drivers/md/md.c rc13-pre6-s_lock/drivers/md/md.c
--- rc13-pre6/drivers/md/md.c Tue Dec 12 02:10:15 2000
+++ rc13-pre6-s_lock/drivers/md/md.c Sat Dec 30 09:17:27 2000
@@ -1088,7 +1088,7 @@
}
memset(rdev, 0, sizeof(*rdev));

- if (get_super(newdev)) {
+ if (is_mounted(newdev)) {
printk("md: can not import %s, has active inodes!\n",
partition_name(newdev));
err = -EBUSY;
@@ -1726,7 +1726,7 @@
}

/* this shouldn't be needed as above would have fired */
- if (!ro && get_super(dev)) {
+ if (!ro && is_mounted(dev)) {
printk (STILL_MOUNTED, mdidx(mddev));
OUT(-EBUSY);
}
diff -urN rc13-pre6/drivers/mtd/ftl.c rc13-pre6-s_lock/drivers/mtd/ftl.c
--- rc13-pre6/drivers/mtd/ftl.c Tue Dec 12 02:10:19 2000
+++ rc13-pre6-s_lock/drivers/mtd/ftl.c Sat Dec 30 09:17:27 2000
@@ -915,9 +915,6 @@

static release_t ftl_close(struct inode *inode, struct file *file)
{
-#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
- struct super_block *sb = get_super(inode->i_rdev);
-#endif
int minor = MINOR(inode->i_rdev);
partition_t *part = myparts[minor >> 4];
int i;
@@ -925,11 +922,7 @@
DEBUG(0, "ftl_cs: ftl_close(%d)\n", minor);

/* Flush all writes */
- fsync_dev(inode->i_rdev);
-#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
- if (sb) invalidate_inodes(sb);
-#endif
- invalidate_buffers(inode->i_rdev);
+ invalidate_dev(inode->i_rdev, 2);

/* Wait for any pending erase operations to complete */
if (part->mtd->sync)
diff -urN rc13-pre6/drivers/mtd/mtdblock.c rc13-pre6-s_lock/drivers/mtd/mtdblock.c
--- rc13-pre6/drivers/mtd/mtdblock.c Sat Dec 30 07:37:35 2000
+++ rc13-pre6-s_lock/drivers/mtd/mtdblock.c Sat Dec 30 09:17:27 2000
@@ -355,19 +355,12 @@
{
int dev;
struct mtdblk_dev *mtdblk;
-#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
- struct super_block * sb = get_super(inode->i_rdev);
-#endif
DEBUG(MTD_DEBUG_LEVEL1, "mtdblock_release\n");

if (inode == NULL)
release_return(-ENODEV);

- fsync_dev(inode->i_rdev);
-#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
- if (sb) invalidate_inodes(sb);
-#endif
- invalidate_buffers(inode->i_rdev);
+ invalidate_dev(inode->i_rdev, 2);

dev = MINOR(inode->i_rdev);
mtdblk = mtdblks[dev];
diff -urN rc13-pre6/drivers/mtd/nftl.c rc13-pre6-s_lock/drivers/mtd/nftl.c
--- rc13-pre6/drivers/mtd/nftl.c Tue Dec 12 02:10:19 2000
+++ rc13-pre6-s_lock/drivers/mtd/nftl.c Sat Dec 30 09:17:28 2000
@@ -997,17 +997,13 @@

static int nftl_release(struct inode *inode, struct file *fp)
{
- struct super_block *sb = get_super(inode->i_rdev);
struct NFTLrecord *thisNFTL;

thisNFTL = NFTLs[MINOR(inode->i_rdev) / 16];

DEBUG(MTD_DEBUG_LEVEL2, "NFTL_release\n");

- fsync_dev(inode->i_rdev);
- if (sb)
- invalidate_inodes(sb);
- invalidate_buffers(inode->i_rdev);
+ invalidate_dev(inode->i_rdev, 2);

if (thisNFTL->mtd->sync)
thisNFTL->mtd->sync(thisNFTL->mtd);
diff -urN rc13-pre6/drivers/scsi/sd.c rc13-pre6-s_lock/drivers/scsi/sd.c
--- rc13-pre6/drivers/scsi/sd.c Wed Nov 29 19:29:16 2000
+++ rc13-pre6-s_lock/drivers/scsi/sd.c Sat Dec 30 09:17:28 2000
@@ -1262,11 +1262,7 @@
for (i = max_p - 1; i >= 0; i--) {
int index = start + i;
kdev_t devi = MKDEV_SD_PARTITION(index);
- struct super_block *sb = get_super(devi);
- sync_dev(devi);
- if (sb)
- invalidate_inodes(sb);
- invalidate_buffers(devi);
+ invalidate_dev(devi, 1);
sd_gendisks->part[index].start_sect = 0;
sd_gendisks->part[index].nr_sects = 0;
/*
@@ -1315,11 +1311,7 @@
for (j = max_p - 1; j >= 0; j--) {
int index = start + j;
kdev_t devi = MKDEV_SD_PARTITION(index);
- struct super_block *sb = get_super(devi);
- sync_dev(devi);
- if (sb)
- invalidate_inodes(sb);
- invalidate_buffers(devi);
+ invalidate_dev(devi, 1);
sd_gendisks->part[index].start_sect = 0;
sd_gendisks->part[index].nr_sects = 0;
sd_sizes[index] = 0;
diff -urN rc13-pre6/drivers/scsi/sr.c rc13-pre6-s_lock/drivers/scsi/sr.c
--- rc13-pre6/drivers/scsi/sr.c Sat Dec 30 07:37:41 2000
+++ rc13-pre6-s_lock/drivers/scsi/sr.c Sat Dec 30 09:17:28 2000
@@ -874,15 +874,11 @@
for (cpnt = scsi_CDs, i = 0; i < sr_template.dev_max; i++, cpnt++)
if (cpnt->device == SDp) {
kdev_t devi = MKDEV(MAJOR_NR, i);
- struct super_block *sb = get_super(devi);
-
/*
* Since the cdrom is read-only, no need to sync the device.
* We should be kind to our buffer cache, however.
*/
- if (sb)
- invalidate_inodes(sb);
- invalidate_buffers(devi);
+ invalidate_dev(devi, 0);

/*
* Reset things back to a sane state so that one can re-load a new
diff -urN rc13-pre6/drivers/sound/via82cxxx_audio.c rc13-pre6-s_lock/drivers/sound/via82cxxx_audio.c
--- rc13-pre6/drivers/sound/via82cxxx_audio.c Wed Nov 29 21:37:13 2000
+++ rc13-pre6-s_lock/drivers/sound/via82cxxx_audio.c Sat Dec 30 09:17:28 2000
@@ -1727,20 +1727,8 @@
}

-#ifndef VM_RESERVE
-static int via_mm_swapout (struct page *page, struct file *filp)
-{
- return 0;
-}
-#endif /* VM_RESERVE */
-
-
struct vm_operations_struct via_mm_ops = {
nopage: via_mm_nopage,
-
-#ifndef VM_RESERVE
- swapout: via_mm_swapout,
-#endif
};

@@ -1789,9 +1777,7 @@
vma->vm_ops = &via_mm_ops;
vma->vm_private_data = card;

-#ifdef VM_RESERVE
- vma->vm_flags |= VM_RESERVE;
-#endif
+ vma->vm_flags |= VM_RESERVED;

if (rd)
card->ch_in.is_mapped = 1;
diff -urN rc13-pre6/fs/block_dev.c rc13-pre6-s_lock/fs/block_dev.c
--- rc13-pre6/fs/block_dev.c Wed Nov 29 19:29:31 2000
+++ rc13-pre6-s_lock/fs/block_dev.c Sat Dec 30 09:17:28 2000
@@ -576,8 +576,11 @@
bdevname(dev));

sb = get_super(dev);
- if (sb && invalidate_inodes(sb))
- printk("VFS: busy inodes on changed media.\n");
+ if (sb) {
+ if (invalidate_inodes(sb))
+ printk("VFS: busy inodes on changed media.\n");
+ put_super(sb);
+ }

destroy_buffers(dev);

diff -urN rc13-pre6/fs/buffer.c rc13-pre6-s_lock/fs/buffer.c
--- rc13-pre6/fs/buffer.c Sat Dec 30 07:37:45 2000
+++ rc13-pre6-s_lock/fs/buffer.c Sat Dec 30 09:17:28 2000
@@ -337,9 +337,10 @@

/* sync the superblock to buffers */
sb = inode->i_sb;
- wait_on_super(sb);
+ lock_super(sb);
if (sb->s_op && sb->s_op->write_super)
sb->s_op->write_super(sb);
+ unlock_super(sb);

/* .. finally sync the buffers to disk */
dev = inode->i_dev;
@@ -677,6 +678,19 @@
spin_unlock(&lru_list_lock);
if (slept)
goto retry;
+}
+
+void invalidate_dev(kdev_t dev, int sync_flag)
+{
+ struct super_block *sb = get_super(dev);
+ if (sync_flag == 1)
+ sync_dev(dev);
+ else if (sync_flag == 2)
+ fsync_dev(dev);
+ if (sb)
+ invalidate_inodes(sb);
+ put_super(sb);
+ invalidate_buffers(dev);
}

void set_blocksize(kdev_t dev, int size)
diff -urN rc13-pre6/fs/coda/dir.c rc13-pre6-s_lock/fs/coda/dir.c
--- rc13-pre6/fs/coda/dir.c Sat Dec 30 07:37:45 2000
+++ rc13-pre6-s_lock/fs/coda/dir.c Sat Dec 30 09:17:28 2000
@@ -557,8 +557,8 @@
return -ENXIO;
}

- *ind = NULL;
*ind = iget(sbptr, ino);
+ put_super(sbptr);

if ( *ind == NULL ) {
printk("coda_inode_grab: iget(dev: %d, ino: %ld) "
diff -urN rc13-pre6/fs/coda/inode.c rc13-pre6-s_lock/fs/coda/inode.c
--- rc13-pre6/fs/coda/inode.c Sat Dec 30 07:37:45 2000
+++ rc13-pre6-s_lock/fs/coda/inode.c Sat Dec 30 09:17:28 2000
@@ -97,7 +97,6 @@
struct coda_sb_info *sbi = NULL;
struct venus_comm *vc = NULL;
ViceFid fid;
- kdev_t dev = sb->s_dev;
int error;
int idx;
ENTRY;
@@ -139,7 +138,6 @@
sb->s_blocksize = 1024; /* XXXXX what do we put here?? */
sb->s_blocksize_bits = 10;
sb->s_magic = CODA_SUPER_MAGIC;
- sb->s_dev = dev;
sb->s_op = &coda_super_operations;

/* get root fid from Venus: this needs the root inode */
diff -urN rc13-pre6/fs/devfs/base.c rc13-pre6-s_lock/fs/devfs/base.c
--- rc13-pre6/fs/devfs/base.c Wed Nov 29 21:44:16 2000
+++ rc13-pre6-s_lock/fs/devfs/base.c Sat Dec 30 09:17:28 2000
@@ -2166,8 +2166,11 @@
printk ( KERN_DEBUG "VFS: Disk change detected on device %s\n",
kdevname (dev) );
sb = get_super (dev);
- if ( sb && invalidate_inodes (sb) )
- printk("VFS: busy inodes on changed media..\n");
+ if (sb) {
+ if (invalidate_inodes (sb))
+ printk("VFS: busy inodes on changed media..\n");
+ put_super(sb);
+ }
invalidate_buffers (dev);
/* Ugly hack to disable messages about unable to read partition table */
tmp = warn_no_part;
diff -urN rc13-pre6/fs/dquot.c rc13-pre6-s_lock/fs/dquot.c
--- rc13-pre6/fs/dquot.c Tue Dec 12 02:10:42 2000
+++ rc13-pre6-s_lock/fs/dquot.c Sat Dec 30 09:17:29 2000
@@ -326,7 +326,7 @@
memset(&dquot->dq_dqb, 0, sizeof(struct dqblk));
}

-void invalidate_dquots(kdev_t dev, short type)
+static void invalidate_dquots(struct super_block *sb, short type)
{
struct dquot *dquot, *next;
int need_restart;
@@ -336,12 +336,10 @@
need_restart = 0;
while ((dquot = next) != NULL) {
next = dquot->dq_next;
- if (dquot->dq_dev != dev)
+ if (dquot->dq_sb != sb)
continue;
if (dquot->dq_type != type)
continue;
- if (!dquot->dq_sb) /* Already invalidated entry? */
- continue;
if (dquot->dq_flags & DQ_LOCKED) {
__wait_on_dquot(dquot);

@@ -350,12 +348,10 @@
/*
* Make sure it's still the same dquot.
*/
- if (dquot->dq_dev != dev)
+ if (dquot->dq_sb != sb)
continue;
if (dquot->dq_type != type)
continue;
- if (!dquot->dq_sb)
- continue;
}
/*
* Because inodes needn't to be the only holders of dquot
@@ -1390,7 +1386,7 @@
}

/* Function in inode.c - remove pointers to dquots in icache */
-extern void remove_dquot_ref(kdev_t, short);
+extern void remove_dquot_ref(struct super_block *, short);

/*
* Turn quota off on a device. type == -1 ==> quotaoff for all types (umount)
@@ -1415,8 +1411,8 @@
reset_enable_flags(dqopt, cnt);

/* Note: these are blocking operations */
- remove_dquot_ref(sb->s_dev, cnt);
- invalidate_dquots(sb->s_dev, cnt);
+ remove_dquot_ref(sb, cnt);
+ invalidate_dquots(sb, cnt);

/* Wait for any pending IO - remove me as soon as invalidate is more polite */
down(&dqopt->dqio_sem);
@@ -1604,6 +1600,8 @@
if (sb && sb_has_quota_enabled(sb, type))
ret = set_dqblk(sb, id, type, flags, (struct dqblk *) addr);
out:
+ if (sb)
+ put_super(sb);
unlock_kernel();
return ret;
}
diff -urN rc13-pre6/fs/ext2/super.c rc13-pre6-s_lock/fs/ext2/super.c
--- rc13-pre6/fs/ext2/super.c Sat Dec 30 07:37:46 2000
+++ rc13-pre6-s_lock/fs/ext2/super.c Sat Dec 30 09:17:29 2000
@@ -75,9 +75,6 @@
va_start (args, fmt);
vsprintf (error_buf, fmt, args);
va_end (args);
- /* this is to prevent panic from syncing this filesystem */
- if (sb->s_lock)
- sb->s_lock=0;
sb->s_flags |= MS_RDONLY;
panic ("EXT2-fs panic (device %s): %s: %s\n",
bdevname(sb->s_dev), function, error_buf);
diff -urN rc13-pre6/fs/inode.c rc13-pre6-s_lock/fs/inode.c
--- rc13-pre6/fs/inode.c Sat Dec 30 07:37:47 2000
+++ rc13-pre6-s_lock/fs/inode.c Sat Dec 30 09:17:29 2000
@@ -1002,14 +1002,13 @@
void put_dquot_list(struct list_head *);
int remove_inode_dquot_ref(struct inode *, short, struct list_head *);

-void remove_dquot_ref(kdev_t dev, short type)
+void remove_dquot_ref(struct super_block *sb, short type)
{
- struct super_block *sb = get_super(dev);
struct inode *inode;
struct list_head *act_head;
LIST_HEAD(tofree_head);

- if (!sb || !sb->dq_op)
+ if (!sb->dq_op)
return; /* nothing to do */

/* We have to be protected against other CPUs */
diff -urN rc13-pre6/fs/jffs/inode-v23.c rc13-pre6-s_lock/fs/jffs/inode-v23.c
--- rc13-pre6/fs/jffs/inode-v23.c Sat Dec 30 07:37:48 2000
+++ rc13-pre6-s_lock/fs/jffs/inode-v23.c Sat Dec 30 09:17:29 2000
@@ -159,7 +159,6 @@

D1(printk (KERN_NOTICE "jffs_put_super(): Successfully waited on thread.\n"));

- sb->s_dev = 0;
jffs_cleanup_control((struct jffs_control *)sb->u.generic_sbp);
D1(printk(KERN_NOTICE "JFFS: Successfully unmounted device %s.\n",
kdevname(dev)));
diff -urN rc13-pre6/fs/super.c rc13-pre6-s_lock/fs/super.c
--- rc13-pre6/fs/super.c Tue Dec 12 02:10:46 2000
+++ rc13-pre6-s_lock/fs/super.c Sat Dec 30 12:03:30 2000
@@ -41,14 +41,6 @@
#define __NO_VERSION__
#include <linux/module.h>

-/*
- * We use a semaphore to synchronize all mount/umount
- * activity - imagine the mess if we have a race between
- * unmounting a filesystem and re-mounting it (or something
- * else).
- */
-static DECLARE_MUTEX(mount_sem);
-
extern void wait_for_keypress(void);

extern int root_mountflags;
@@ -346,6 +338,8 @@
INIT_LIST_HEAD(&mnt->mnt_clash);
}
INIT_LIST_HEAD(&mnt->mnt_mounts);
+ if (list_empty(&sb->s_mounts))
+ atomic_inc(&sb->s_count);
list_add(&mnt->mnt_instances, &sb->s_mounts);
list_add(&mnt->mnt_list, vfsmntlist.prev);
spin_unlock(&dcache_lock);
@@ -411,6 +405,8 @@
static void remove_vfsmnt(struct vfsmount *mnt)
{
/* First of all, remove it from all lists */
+ if (mnt->mnt_instances.next != mnt->mnt_instances.prev)
+ atomic_inc(&mnt->mnt_sb->s_count);
list_del(&mnt->mnt_instances);
list_del(&mnt->mnt_clash);
list_del(&mnt->mnt_list);
@@ -579,27 +575,11 @@
#undef FREEROOM
}

-/**
- * __wait_on_super - wait on a superblock
- * @sb: superblock to wait on
- *
- * Waits for a superblock to become unlocked and then returns. It does
- * not take the lock. This is an internal function. See wait_on_super().
- */
-
-void __wait_on_super(struct super_block * sb)
+void put_super(struct super_block *sb)
{
- DECLARE_WAITQUEUE(wait, current);
-
- add_wait_queue(&sb->s_wait, &wait);
-repeat:
- set_current_state(TASK_UNINTERRUPTIBLE);
- if (sb->s_lock) {
- schedule();
- goto repeat;
- }
- remove_wait_queue(&sb->s_wait, &wait);
- current->state = TASK_RUNNING;
+ up(&sb->s_umount);
+ if (atomic_dec_and_test(&sb->s_count))
+ kfree(sb);
}

/*
@@ -611,20 +591,23 @@
{
struct super_block * sb;

+restart:
for (sb = sb_entry(super_blocks.next);
sb != sb_entry(&super_blocks);
sb = sb_entry(sb->s_list.next)) {
- if (!sb->s_dev)
- continue;
if (dev && sb->s_dev != dev)
continue;
if (!sb->s_dirt)
continue;
+ atomic_inc(&sb->s_count);
+ down(&sb->s_umount);
lock_super(sb);
- if (sb->s_dev && sb->s_dirt && (!dev || dev == sb->s_dev))
+ if (sb->s_root && sb->s_dirt)
if (sb->s_op && sb->s_op->write_super)
sb->s_op->write_super(sb);
unlock_super(sb);
+ put_super(sb);
+ goto restart;
}
}

@@ -646,9 +629,11 @@
s = sb_entry(super_blocks.next);
while (s != sb_entry(&super_blocks))
if (s->s_dev == dev) {
- wait_on_super(s);
- if (s->s_dev == dev)
+ atomic_inc(&s->s_count);
+ down(&s->s_umount);
+ if (s->s_root)
return s;
+ put_super(s);
goto restart;
} else
s = sb_entry(s->s_list.next);
@@ -668,6 +653,7 @@
if (s == NULL)
goto out;
err = vfs_statfs(s, &sbuf);
+ put_super(s);
if (err)
goto out;

@@ -691,42 +677,61 @@

struct super_block *get_empty_super(void)
{
- struct super_block *s;
-
- for (s = sb_entry(super_blocks.next);
- s != sb_entry(&super_blocks);
- s = sb_entry(s->s_list.next)) {
- if (s->s_dev)
- continue;
- if (!s->s_lock)
- return s;
- printk("VFS: empty superblock %p locked!\n", s);
- }
- /* Need a new one... */
- if (nr_super_blocks >= max_super_blocks)
- return NULL;
- s = kmalloc(sizeof(struct super_block), GFP_USER);
+ struct super_block *s = kmalloc(sizeof(struct super_block), GFP_USER);
if (s) {
nr_super_blocks++;
memset(s, 0, sizeof(struct super_block));
INIT_LIST_HEAD(&s->s_dirty);
- list_add (&s->s_list, super_blocks.prev);
- init_waitqueue_head(&s->s_wait);
INIT_LIST_HEAD(&s->s_files);
INIT_LIST_HEAD(&s->s_mounts);
+ sema_init(&s->s_umount, 0);
+ sema_init(&s->s_lock, 1);
+ atomic_set(&s->s_count, 1);
+ list_add (&s->s_list, super_blocks.prev);
}
return s;
}

+/*
+ * Unnamed block devices are dummy devices used by virtual
+ * filesystems which don't use real block-devices. -- jrs
+ */
+
+static unsigned int unnamed_dev_in_use[256/(8*sizeof(unsigned int))];
+
+static kdev_t get_unnamed_dev(void)
+{
+ int i;
+
+ for (i = 1; i < 256; i++) {
+ if (!test_and_set_bit(i,unnamed_dev_in_use))
+ return MKDEV(UNNAMED_MAJOR, i);
+ }
+ return 0;
+}
+
+static void put_unnamed_dev(kdev_t dev)
+{
+ if (!dev || MAJOR(dev) != UNNAMED_MAJOR)
+ return;
+ if (test_and_clear_bit(MINOR(dev), unnamed_dev_in_use))
+ return;
+ printk("VFS: put_unnamed_dev: freeing unused device %s\n",
+ kdevname(dev));
+}
+
static struct super_block * read_super(kdev_t dev, struct block_device *bdev,
struct file_system_type *type, int flags,
void *data, int silent)
{
struct super_block * s;
+ kdev_t rdev = dev ? dev : get_unnamed_dev();
+ if (!rdev)
+ goto fail;
s = get_empty_super();
if (!s)
- goto out;
- s->s_dev = dev;
+ goto fail;
+ s->s_dev = rdev;
s->s_bdev = bdev;
s->s_flags = flags;
s->s_dirt = 0;
@@ -743,48 +748,24 @@
/* tell bdcache that we are going to keep this one */
if (bdev)
atomic_inc(&bdev->bd_count);
-out:
return s;

out_fail:
s->s_dev = 0;
- s->s_bdev = 0;
+ if (!dev)
+ put_unnamed_dev(rdev);
+ s->s_bdev = NULL;
s->s_type = NULL;
unlock_super(s);
+ put_super(s);
+fail:
return NULL;
}

-/*
- * Unnamed block devices are dummy devices used by virtual
- * filesystems which don't use real block-devices. -- jrs
- */
-
-static unsigned int unnamed_dev_in_use[256/(8*sizeof(unsigned int))];
-
-kdev_t get_unnamed_dev(void)
-{
- int i;
-
- for (i = 1; i < 256; i++) {
- if (!test_and_set_bit(i,unnamed_dev_in_use))
- return MKDEV(UNNAMED_MAJOR, i);
- }
- return 0;
-}
-
-void put_unnamed_dev(kdev_t dev)
-{
- if (!dev || MAJOR(dev) != UNNAMED_MAJOR)
- return;
- if (test_and_clear_bit(MINOR(dev), unnamed_dev_in_use))
- return;
- printk("VFS: put_unnamed_dev: freeing unused device %s\n",
- kdevname(dev));
-}
-
static struct super_block *get_sb_bdev(struct file_system_type *fs_type,
char *dev_name, int flags, void * data)
{
+ DECLARE_MUTEX(mount_sem);
struct inode *inode;
struct block_device *bdev;
struct block_device_operations *bdops;
@@ -809,16 +790,18 @@
bdev = inode->i_bdev;
bdops = devfs_get_ops ( devfs_get_handle_from_inode (inode) );
if (bdops) bdev->bd_op = bdops;
+ dev = to_kdev_t(bdev->bd_dev);
/* Done with lookups, semaphore down */
down(&mount_sem);
- dev = to_kdev_t(bdev->bd_dev);
sb = get_super(dev);
if (sb) {
if (fs_type == sb->s_type &&
((flags ^ sb->s_flags) & MS_RDONLY) == 0) {
+ up(&mount_sem);
path_release(&nd);
return sb;
}
+ put_super(sb);
} else {
mode_t mode = FMODE_READ; /* we always need it ;-) */
if (!(flags & MS_RDONLY))
@@ -833,6 +816,7 @@
error = -EINVAL;
sb = read_super(dev, bdev, fs_type, flags, data, 0);
if (sb) {
+ up(&mount_sem);
get_filesystem(fs_type);
path_release(&nd);
return sb;
@@ -841,29 +825,21 @@
blkdev_put(bdev, BDEV_FS);
}
out:
- path_release(&nd);
up(&mount_sem);
+ path_release(&nd);
return ERR_PTR(error);
}

static struct super_block *get_sb_nodev(struct file_system_type *fs_type,
int flags, void * data)
{
- kdev_t dev;
- int error = -EMFILE;
- down(&mount_sem);
- dev = get_unnamed_dev();
- if (dev) {
- struct super_block * sb;
- error = -EINVAL;
- sb = read_super(dev, NULL, fs_type, flags, data, 0);
- if (sb) {
- get_filesystem(fs_type);
- return sb;
- }
- put_unnamed_dev(dev);
+ struct super_block * sb;
+ int error = -EINVAL;
+ sb = read_super(0, NULL, fs_type, flags, data, 0);
+ if (sb) {
+ get_filesystem(fs_type);
+ return sb;
}
- up(&mount_sem);
return ERR_PTR(error);
}

@@ -875,10 +851,11 @@
* Get the superblock of kernel-wide instance, but
* keep the reference to fs_type.
*/
- down(&mount_sem);
sb = fs_type->kern_mnt->mnt_sb;
if (!sb)
BUG();
+ atomic_inc(&sb->s_count);
+ down(&sb->s_umount);
get_filesystem(fs_type);
do_remount_sb(sb, flags, data);
return sb;
@@ -913,6 +890,7 @@
"Self-destruct in 5 seconds. Have a nice day...\n");
}

+ list_del(&sb->s_list);
dev = sb->s_dev;
sb->s_dev = 0; /* Free the superblock */
bdev = sb->s_bdev;
@@ -969,21 +947,18 @@

struct vfsmount *kern_mount(struct file_system_type *type)
{
- kdev_t dev = get_unnamed_dev();
struct super_block *sb;
struct vfsmount *mnt;
- if (!dev)
- return ERR_PTR(-EMFILE);
- sb = read_super(dev, NULL, type, 0, NULL, 0);
- if (!sb) {
- put_unnamed_dev(dev);
+ sb = read_super(0, NULL, type, 0, NULL, 0);
+ if (!sb)
return ERR_PTR(-EINVAL);
- }
mnt = add_vfsmnt(NULL, sb->s_root, NULL);
if (!mnt) {
kill_super(sb, 0);
+ put_super(sb);
return ERR_PTR(-ENOMEM);
}
+ put_super(sb);
type->kern_mnt = mnt;
return mnt;
}
@@ -993,9 +968,11 @@
void kern_umount(struct vfsmount *mnt)
{
struct super_block *sb = mnt->mnt_sb;
+ down(&sb->s_umount);
spin_lock(&dcache_lock);
remove_vfsmnt(mnt);
kill_super(sb, 0);
+ put_super(sb);
}

/*
@@ -1014,6 +991,8 @@
{
struct super_block * sb = mnt->mnt_sb;

+ down(&sb->s_umount);
+
/*
* No sense to grab the lock for this test, but test itself looks
* somewhat bogus. Suggestions for better replacement?
@@ -1033,6 +1012,7 @@
mntput(mnt);
if (!(sb->s_flags & MS_RDONLY))
retval = do_remount_sb(sb, MS_RDONLY, 0);
+ up(&sb->s_umount);
return retval;
}

@@ -1041,14 +1021,14 @@
if (mnt->mnt_instances.next != mnt->mnt_instances.prev) {
if (atomic_read(&mnt->mnt_count) > 2) {
spin_unlock(&dcache_lock);
- mntput(mnt);
- return -EBUSY;
+ goto busy;
}
if (sb->s_type->fs_flags & FS_SINGLE)
put_filesystem(sb->s_type);
/* We hold two references, so mntput() is safe */
mntput(mnt);
remove_vfsmnt(mnt);
+ put_super(sb);
return 0;
}
spin_unlock(&dcache_lock);
@@ -1084,18 +1064,15 @@
shrink_dcache_sb(sb);
fsync_dev(sb->s_dev);

- if (sb->s_root->d_inode->i_state) {
- mntput(mnt);
- return -EBUSY;
- }
+ if (sb->s_root->d_inode->i_state)
+ goto busy;

/* Something might grab it again - redo checks */

spin_lock(&dcache_lock);
if (atomic_read(&mnt->mnt_count) > 2) {
spin_unlock(&dcache_lock);
- mntput(mnt);
- return -EBUSY;
+ goto busy;
}

/* OK, that's the point of no return */
@@ -1103,7 +1080,12 @@
remove_vfsmnt(mnt);

kill_super(sb, umount_root);
+ put_super(sb);
return 0;
+busy:
+ mntput(mnt);
+ up(&sb->s_umount);
+ return -EBUSY;
}

/*
@@ -1141,9 +1123,7 @@

dput(nd.dentry);
/* puts nd.mnt */
- down(&mount_sem);
retval = do_umount(nd.mnt, 0, flags);
- up(&mount_sem);
goto out;
dput_and_out:
path_release(&nd);
@@ -1208,7 +1188,6 @@
if (old_nd.mnt->mnt_sb->s_type->fs_flags & FS_SINGLE)
get_filesystem(old_nd.mnt->mnt_sb->s_type);

- down(&mount_sem);
/* there we go */
down(&new_nd.dentry->d_inode->i_zombie);
if (IS_DEADDIR(new_nd.dentry->d_inode))
@@ -1216,7 +1195,6 @@
else if (add_vfsmnt(&new_nd, old_nd.dentry, old_nd.mnt->mnt_devname))
err = 0;
up(&new_nd.dentry->d_inode->i_zombie);
- up(&mount_sem);
if (err && old_nd.mnt->mnt_sb->s_type->fs_flags & FS_SINGLE)
put_filesystem(old_nd.mnt->mnt_sb->s_type);
out2:
@@ -1346,12 +1324,6 @@
if (!type_page || !memchr(type_page, 0, PAGE_SIZE))
return -EINVAL;

-#if 0 /* Can be deleted again. Introduced in patch-2.3.99-pre6 */
- /* loopback mount? This is special - requires fewer capabilities */
- if (strcmp(type_page, "bind")==0)
- return do_loopback(dev_name, dir_name);
-#endif
-
/* for the rest we _really_ need capabilities... */
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
@@ -1368,7 +1340,7 @@
if (retval)
goto fs_out;

- /* get superblock, locks mount_sem on success */
+ /* get superblock, locks ->s_umount on success */
if (fstype->fs_flags & FS_NOMOUNT)
sb = ERR_PTR(-EINVAL);
else if (fstype->fs_flags & FS_REQUIRES_DEV)
@@ -1405,7 +1377,7 @@
goto fail;
retval = 0;
unlock_out:
- up(&mount_sem);
+ put_super(sb);
dput_out:
path_release(&nd);
fs_out:
@@ -1478,26 +1450,19 @@
fs_type = get_fs_type("nfs");
if (!fs_type)
goto no_nfs;
- ROOT_DEV = get_unnamed_dev();
- if (!ROOT_DEV)
- /*
- * Your /linuxrc sucks worse than MSExchange - that's the
- * only way you could run out of anon devices at that point.
- */
- goto no_anon;
data = nfs_root_data();
if (!data)
- goto no_server;
- sb = read_super(ROOT_DEV, NULL, fs_type, root_mountflags, data, 1);
- if (sb)
+ goto no_anon;
+ sb = read_super(0, NULL, fs_type, root_mountflags, data, 1);
+ if (sb) {
/*
* We _can_ fail there, but if that will happen we have no
* chance anyway (no memory for vfsmnt and we _will_ need it,
* no matter which fs we try to mount).
*/
+ ROOT_DEV = sb->s_dev;
goto mount_it;
-no_server:
- put_unnamed_dev(ROOT_DEV);
+ }
no_anon:
put_filesystem(fs_type);
no_nfs:
@@ -1607,12 +1572,12 @@
}
else
vfsmnt = add_vfsmnt(NULL, sb->s_root, "/dev/root");
- /* FIXME: if something will try to umount us right now... */
if (vfsmnt) {
set_fs_root(current->fs, vfsmnt, sb->s_root);
set_fs_pwd(current->fs, vfsmnt, sb->s_root);
if (bdev)
bdput(bdev); /* sb holds a reference */
+ put_super(sb);
return;
}
panic("VFS: add_vfsmnt failed for root fs");
@@ -1697,7 +1662,6 @@
root_mnt = mntget(current->fs->rootmnt);
root = dget(current->fs->root);
read_unlock(&current->fs->lock);
- down(&mount_sem);
down(&old_nd.dentry->d_inode->i_zombie);
error = -ENOENT;
if (IS_DEADDIR(new_nd.dentry->d_inode))
@@ -1732,7 +1696,6 @@
error = 0;
out2:
up(&old_nd.dentry->d_inode->i_zombie);
- up(&mount_sem);
dput(root);
mntput(root_mnt);
path_release(&old_nd);
@@ -1765,10 +1728,8 @@
if (devfs_nd.mnt->mnt_sb->s_magic == DEVFS_SUPER_MAGIC &&
devfs_nd.dentry == devfs_nd.mnt->mnt_root) {
dput(devfs_nd.dentry);
- down(&mount_sem);
/* puts devfs_nd.mnt */
do_umount(devfs_nd.mnt, 0, 0);
- up(&mount_sem);
} else
path_release(&devfs_nd);
}
diff -urN rc13-pre6/fs/udf/inode.c rc13-pre6-s_lock/fs/udf/inode.c
--- rc13-pre6/fs/udf/inode.c Tue Dec 12 02:10:46 2000
+++ rc13-pre6-s_lock/fs/udf/inode.c Sat Dec 30 09:17:29 2000
@@ -203,7 +203,6 @@
udf_release_data(bh);

inode->i_data.a_ops->writepage(page);
- UnlockPage(page);
page_cache_release(page);

mark_inode_dirty(inode);
diff -urN rc13-pre6/fs/ufs/super.c rc13-pre6-s_lock/fs/ufs/super.c
--- rc13-pre6/fs/ufs/super.c Sun Sep 17 12:15:20 2000
+++ rc13-pre6-s_lock/fs/ufs/super.c Sat Dec 30 09:17:29 2000
@@ -230,9 +230,6 @@
va_start (args, fmt);
vsprintf (error_buf, fmt, args);
va_end (args);
- /* this is to prevent panic from syncing this filesystem */
- if (sb->s_lock)
- sb->s_lock = 0;
sb->s_flags |= MS_RDONLY;
printk (KERN_CRIT "UFS-fs panic (device %s): %s: %s\n",
kdevname(sb->s_dev), function, error_buf);
diff -urN rc13-pre6/include/linux/fs.h rc13-pre6-s_lock/include/linux/fs.h
--- rc13-pre6/include/linux/fs.h Sat Dec 30 07:37:54 2000
+++ rc13-pre6-s_lock/include/linux/fs.h Sat Dec 30 09:44:34 2000
@@ -667,7 +667,6 @@
kdev_t s_dev;
unsigned long s_blocksize;
unsigned char s_blocksize_bits;
- unsigned char s_lock;
unsigned char s_dirt;
struct file_system_type *s_type;
struct super_operations *s_op;
@@ -675,7 +674,9 @@
unsigned long s_flags;
unsigned long s_magic;
struct dentry *s_root;
- wait_queue_head_t s_wait;
+ struct semaphore s_umount;
+ struct semaphore s_lock; /* probably goner */
+ atomic_t s_count;

struct list_head s_dirty; /* dirty inodes */
struct list_head s_files;
@@ -1070,6 +1071,7 @@
extern void write_inode_now(struct inode *, int);
extern void sync_dev(kdev_t);
extern int fsync_dev(kdev_t);
+extern void invalidate_dev(kdev_t, int);
extern int fsync_inode_buffers(struct inode *);
extern int osync_inode_buffers(struct inode *);
extern int inode_has_buffers(struct inode *);
@@ -1265,7 +1267,16 @@
extern struct file_system_type *get_fs_type(const char *name);
extern struct super_block *get_super(kdev_t);
struct super_block *get_empty_super(void);
-extern void put_super(kdev_t);
+extern void put_super(struct super_block *sb);
+static inline int is_mounted(kdev_t dev)
+{
+ struct super_block *sb = get_super(dev);
+ if (sb) {
+ put_super(sb);
+ return 1;
+ }
+ return 0;
+}
unsigned long generate_cluster(kdev_t, int b[], int);
unsigned long generate_cluster_swab32(kdev_t, int b[], int);
extern kdev_t ROOT_DEV;
diff -urN rc13-pre6/include/linux/locks.h rc13-pre6-s_lock/include/linux/locks.h
--- rc13-pre6/include/linux/locks.h Sat Dec 16 12:30:30 2000
+++ rc13-pre6-s_lock/include/linux/locks.h Sat Dec 30 09:44:35 2000
@@ -39,30 +39,15 @@
* a super-block (although even this isn't done right now.
* nfs may need it).
*/
-extern void __wait_on_super(struct super_block *);
-
-extern inline void wait_on_super(struct super_block * sb)
-{
- if (sb->s_lock)
- __wait_on_super(sb);
-}

extern inline void lock_super(struct super_block * sb)
{
- if (sb->s_lock)
- __wait_on_super(sb);
- sb->s_lock = 1;
+ down(&sb->s_lock);
}

extern inline void unlock_super(struct super_block * sb)
{
- sb->s_lock = 0;
- /*
- * No need of any barrier, we're protected by
- * the big kernel lock here... unfortunately :)
- */
- if (waitqueue_active(&sb->s_wait))
- wake_up(&sb->s_wait);
+ up(&sb->s_lock);
}

#endif /* _LINUX_LOCKS_H */
diff -urN rc13-pre6/include/linux/quotaops.h rc13-pre6-s_lock/include/linux/quotaops.h
--- rc13-pre6/include/linux/quotaops.h Sat Dec 16 12:52:02 2000
+++ rc13-pre6-s_lock/include/linux/quotaops.h Sat Dec 30 09:58:07 2000
@@ -21,7 +21,6 @@
*/
extern void dquot_initialize(struct inode *inode, short type);
extern void dquot_drop(struct inode *inode);
-extern void invalidate_dquots(kdev_t dev, short type);
extern int quota_off(struct super_block *sb, short type);
extern int sync_dquots(kdev_t dev, short type);

diff -urN rc13-pre6/kernel/ksyms.c rc13-pre6-s_lock/kernel/ksyms.c
--- rc13-pre6/kernel/ksyms.c Sat Dec 30 07:37:55 2000
+++ rc13-pre6-s_lock/kernel/ksyms.c Sat Dec 30 09:17:30 2000
@@ -127,6 +127,8 @@
EXPORT_SYMBOL(update_atime);
EXPORT_SYMBOL(get_fs_type);
EXPORT_SYMBOL(get_super);
+EXPORT_SYMBOL(put_super);
+EXPORT_SYMBOL(invalidate_dev);
EXPORT_SYMBOL(get_empty_super);
EXPORT_SYMBOL(getname);
EXPORT_SYMBOL(names_cachep);
@@ -472,7 +474,6 @@

/* Added to make file system as module */
EXPORT_SYMBOL(sys_tz);
-EXPORT_SYMBOL(__wait_on_super);
EXPORT_SYMBOL(file_fsync);
EXPORT_SYMBOL(fsync_inode_buffers);
EXPORT_SYMBOL(clear_inode);

2000-12-30 20:18:32

[permalink] [raw]

Subject: [RFC] Generic deferred file writing

When I saw you put in the if (PageDirty) -->writepage and related code
over the last couple of weeks I was wondering if you realize how close
we are to having generic deferred file writing in the VFS. I took some
time today to code this little hack and it comes awfully close to doing
the job. However, *** Warning, do not run this on a machine you care
about, it will mess it up ***.

The advantages of deferred file writing are pretty obvious. Right now
we are deferring just the writeout of data to the disk, but we can also
defer the disk mapping, so that metadata blocks don't have to stay
around in cache waiting for data blocks to get mapped into them one at
a time - a whole group can be done in one flush. There is more scope
for the filesystem to optimize allocation - we just need some kind of
->get_blocks call that maps n file blocks in one operation. Sometimes
we'll be able to write files, read them back and erase them without
ever hitting the disk, or even the filesystem. Sequences of short
writes will cost a lot less cpu and hit less cache. There are probably
other advantages as well.

For this to work properly there has to be an analog of kflushd for
pages. We don't have that yet, and it's some distance away - so I'm
not even beginning to suggest this is a 2.4.0 thing.

In the patch below I didn't try to write the error handling exactly
right and the indenting is wrong. It's just meant to show the idea
clearly. Instead of mapping each file page to storage as we normally do
in generic_file_write we just mark it dirty and let page_launder or
Marcelo's new flushing code call the filesystem later.

This code worked well enough to copy some files and directories in uml,
and still have them there after a reboot. Beyond that - patching it
into test13-pre5 on a real machine resulted in a toasted passwd file,
so there's some work to do yet.

Here it is...

--- 2.4.0-test12.clean/mm/filemap.c Sat Dec 9 09:13:23 2000
+++ 2.4.0-test12/mm/filemap.c Sat Dec 30 19:56:24 2000
@@ -2449,6 +2449,16 @@
PAGE_BUG(page);
}

+ if (!offset && bytes == PAGE_SIZE)
+ {
+ kaddr = page_address(page);
+ status = copy_from_user(kaddr+offset, buf, bytes);
+ flush_dcache_page(page);
+ SetPageUptodate(page);
+ SetPageDirty(page); /* set_page_dirty in test13 */
+ }
+ else
+ {
status = mapping->a_ops->prepare_write(file, page, offset, offset+bytes);
if (status)
goto unlock;
@@ -2458,6 +2468,7 @@
if (status)
goto fail_write;
status = mapping->a_ops->commit_write(file, page, offset, offset+bytes);
+ }
if (!status)
status = bytes;

2000-12-30 20:36:36

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sat, 30 Dec 2000, Daniel Phillips wrote:
>
> When I saw you put in the if (PageDirty) -->writepage and related code
> over the last couple of weeks I was wondering if you realize how close
> we are to having generic deferred file writing in the VFS.

I'm very aware of it indeed.

However, it does break various common assumptions, one of them being
proper error handling. Things like proper detection of quota overflows and
even simple "out of disk space" issues.

One of the main advantages of deferred writing would be that we could do
temp-files without ever actually doing most of the low-level filesystem
block allocation, but in order to get that advantage we really need to
handle the out-of-disk case gracefully.

I considered doing something like this as a mount option, so that people
could decide on their own whether they want a safe filesystem, or whether
it's ok to do deferred writes. People might find it worth it for /tmp, but
might be unwilling to use it for /var/spool/mail, for example.

(Hmm.. It might be perfectly fine for /vsr/spool/mail - mail delivery
tends to be really careful about doing "fsync()" etc and actually pick up
the errors that way. HOWEVER, before doing that you should expand the
writepage logic to set the page "error" bit for when it fails to write out
a full page - right now we just lose the error completely).

Linus

2000-12-30 20:37:57

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sat, 30 Dec 2000, Daniel Phillips wrote:

> When I saw you put in the if (PageDirty) -->writepage and related code
> over the last couple of weeks I was wondering if you realize how close
> we are to having generic deferred file writing in the VFS. I took some
> time today to code this little hack and it comes awfully close to doing
> the job. However, *** Warning, do not run this on a machine you care
> about, it will mess it up ***.
>
> The advantages of deferred file writing are pretty obvious. Right now
> we are deferring just the writeout of data to the disk, but we can also
> defer the disk mapping, so that metadata blocks don't have to stay
> around in cache waiting for data blocks to get mapped into them one at
> a time - a whole group can be done in one flush.

Except that we've got file-expanding writes outside of ->i_sem. Thanks, but
no thanks.

2000-12-30 20:53:12

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sat, 30 Dec 2000, Alexander Viro wrote:
>
> Except that we've got file-expanding writes outside of ->i_sem. Thanks, but
> no thanks.

No, Al, the file size is still updated inside i_sem.

Yes, it will do actual block allocation outside i_sem, but that is already
true of any mmap'ed writes, and has been true for a long long time. So if
we have a bug here (and I don't think we have one), it's not something
new. But the inode semaphore doesn't protect the balloc() data structures
anyway, as they are filesystem-global.

If you're nervous about the effects of "truncate()", then that should be
handled properly by truncate_inode_pages().

In short, I don't see _those_ kinds of issues. I do see error reporting as
a major issue, though. If we need to do proper low-level block allocation
in order to get correct ENOSPC handling, then the win from doing deferred
writes is not very big.

Linus

2000-12-30 21:41:49

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

Linus writes:
> In short, I don't see _those_ kinds of issues. I do see error reporting as
> a major issue, though. If we need to do proper low-level block allocation
> in order to get correct ENOSPC handling, then the win from doing deferred
> writes is not very big.

It should be relatively light-weight to call into the filesystem simply
to allocate a "count" of blocks needed for the current file. It may
even be possible to do delayed inode allocation. This would defer
the inode/block bitmap searching/allocation on ext2 until the file
was actually written - only the free_blocks/free_inodes count in the
superblock would be decremented, and we would get ENOSPC immediately
if we don't have enough space for the file. On fsck, these values are
recalculated from the group descriptors on ext2, so it wouldn't be a
problem if the system crashed with pre-allocated blocks.

It would definitely be a win on ext2 and XFS, and if it isn't possible
on other filesystems, it should at least not be a loss.

We would need to ensure we also keep enough space for indirect blocks
and such, so we need to pass more information than just the number of
blocks added (i.e. how big the file already is).

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2000-12-30 22:17:36

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sat, 30 Dec 2000, Linus Torvalds wrote:

>
>
> On Sat, 30 Dec 2000, Alexander Viro wrote:
> >
> > Except that we've got file-expanding writes outside of ->i_sem. Thanks, but
> > no thanks.
>
> No, Al, the file size is still updated inside i_sem.

Then we are screwed. Look: we call write(). Twice. The second call happens
to overflow the quota. Getting the second chunk of data written and the
first one ending up as a hole is the last thing you would expect, isn't it?

> In short, I don't see _those_ kinds of issues. I do see error reporting as
> a major issue, though. If we need to do proper low-level block allocation
> in order to get correct ENOSPC handling, then the win from doing deferred
> writes is not very big.

Well, see above. I'm pretty nervous about breaking the ordering of metadata
allocation. For pageout() we don't have such ordering. For write() we
certainly do. Notice that reserving disk space upon write() and eating it
later is _very_ messy job - you'll have to take care of situations when
we reserve the space upon write() and get pageout do the real allocation.
Not nice, since pageout has no way in hell to tell whether it is eating
from a reserved area or just flushing the mmaped one. We could keep the
per-bh "reserved" flag to fold that information into the pagecache, but
IMO it's simply not worth the trouble. If some filesystems wants that -
hey, it can do that right now. Just make ->prepare_write() do reservations
and let ->commit_write() mark the page dirty. Then ->writepage() will
eventually flush it.

Again, if one is willing to implement reservation on block level - fine,
there is no need to change anything in VFS or VM. I certainly don't want
to mess with that, but hey, if somebody is into masochism - let them.

2000-12-30 22:43:14

by ebiederman

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

Linus Torvalds <[email protected]> writes:

> In short, I don't see _those_ kinds of issues. I do see error reporting as
> a major issue, though. If we need to do proper low-level block allocation
> in order to get correct ENOSPC handling, then the win from doing deferred
> writes is not very big.

To get ENOSPC handling 99% correct all we need to do is decrement a counter,
that remembers how many disks blocks are free. If we need a better
estimate than just the data blocks it should not be hard to add an
extra callback to the filesystem.

There look to be some interesting cases to handle when we fill up a
filesystem. Before actually failing and returning ENOSPC the
filesystem might want to fsync itself. And see how correct it's
estimates were. But that is the rare case and shouldn't affect
performance.

<rant>
In the long term VFS support for deferred writes looks like a major
win. Knowing how large a file is before we write it to disk allows
very efficient disk organization, and fast file access (esp combined
with an extent based fs). Support for compressing files in real time
falls out naturally. Support for filesystems maintain coherency by
never writing the same block back to the same disk location also
appears.
</rant>

One other thing to think about for the VFS/MM layer is limiting the
total number of dirty pages in the system (to what disk pressure shows
the disk can handle), to keep system performance smooth when swapping.
All cases except mmaped files are easy, and they can be handled by a
modified page fault handler that directly puts the dirty bit on the
struct page. (Except that is buggy with respect to clearing the dirty
bit on the struct page.) In reality we would have to create a queue
of pointers to dirty pte's from the page fault handler and depending
on a timer or memory pressure move the dirty bits to the actual page.

Combined with the code to make sync and fsync to work on the page
cache we msync would be obsolete?

Of course the most important part is that when all of that is
working, the VFS/MM layer it would be perfect. World domination
would be achieved. For who would be caught using an OS with an
imperfect VFS layer :)

Eric

2000-12-30 23:16:19

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On 30 Dec 2000, Eric W. Biederman wrote:
>
> One other thing to think about for the VFS/MM layer is limiting the
> total number of dirty pages in the system (to what disk pressure shows
> the disk can handle), to keep system performance smooth when swapping.

This is a separate issue, and I think that it is most closely tied in to
the "RSS limit" kind of patches because of the memory mapping issues. If
you've seen the RSS rlimit patch (it's been posted a few times this week),
then you could think of that modified by a "Resident writable pages Set
Size" approach. Not just for shared mappings - this is also an issue with
limiting swapout.

(I actually don't think that RSS is all that interesting, it's really the
"potentially dirty RSS" that counts for VM behaviour - everything else can
be dropped easily enough)

Linus

2000-12-31 00:02:11

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sat, 30 Dec 2000, Alexander Viro wrote:
> Well, see above. I'm pretty nervous about breaking the ordering of metadata
> allocation. For pageout() we don't have such ordering. For write() we
> certainly do. Notice that reserving disk space upon write() and eating it
> later is _very_ messy job - you'll have to take care of situations when
> we reserve the space upon write() and get pageout do the real allocation.
> Not nice, since pageout has no way in hell to tell whether it is eating
> from a reserved area or just flushing the mmaped one. We could keep the
> per-bh "reserved" flag to fold that information into the pagecache, but
> IMO it's simply not worth the trouble. If some filesystems wants that -
> hey, it can do that right now. Just make ->prepare_write() do reservations
> and let ->commit_write() mark the page dirty. Then ->writepage() will
> eventually flush it.

This is a refinement of the idea and some abstraction like that is
clearly needed, and maybe that is exactly the right one. For now I'm
interested in putting this on the table so that we can check the
stability and performance, maybe uncover come more bugs, then start
going after some of the things that need to be done to turn it into a
useful option.

P.S., I humbly apologize for writing (!offset && bytes == PAGE_SIZE)
when I could have just written (bytes == PAGE_SIZE).

--
Daniel

2000-12-31 01:33:32

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sat, Dec 30, 2000 at 03:00:43PM -0700, Eric W. Biederman wrote:
> To get ENOSPC handling 99% correct all we need to do is decrement a counter,
> that remembers how many disks blocks are free. If we need a better

Yes, we need to add one field to the in-core superblock to do this accounting.

> estimate than just the data blocks it should not be hard to add an
> extra callback to the filesystem.

Yes, I was thinking at this callback too. Such a callback is nearly the only
support we need from the filesystem to provide allocate on flush. Allocate on
flush is a pagecache issue, not really a filesystem issue. When a filesystem
doesn't implement such callback we can simply get_block(create) at pagecache
creation time as usual.

Andrea

2000-12-31 01:44:25

by Chris Wedgwood

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sun, Dec 31, 2000 at 02:02:34AM +0100, Andrea Arcangeli wrote:

Yes, we need to add one field to the in-core superblock to do
this accounting.

How does this work for filesystems like reisefs which do tail merging
and other filesystems which might do sub-clock allocation?

We either need more than one field (or a byte counter) + lock or
perhaps a generic fields + callback to the fs itself.

> estimate than just the data blocks it should not be hard to add
> an extra callback to the filesystem.

Yes, I was thinking at this callback too. Such a callback is
nearly the only support we need from the filesystem to provide
allocate on flush.

I think a callback is the way to go.

--cw

2000-12-31 01:48:55

by ebiederman

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

Linus Torvalds <[email protected]> writes:

> On 30 Dec 2000, Eric W. Biederman wrote:
> >
> > One other thing to think about for the VFS/MM layer is limiting the
> > total number of dirty pages in the system (to what disk pressure shows
> > the disk can handle), to keep system performance smooth when swapping.
>
> This is a separate issue, and I think that it is most closely tied in to
> the "RSS limit" kind of patches because of the memory mapping issues. If
> you've seen the RSS rlimit patch (it's been posted a few times this week),
> then you could think of that modified by a "Resident writable pages Set
> Size" approach.

Building on the RSS limit approach sounds much simpler then they way
I was thinking.

> Not just for shared mappings - this is also an issue with
> limiting swapout.
>
> (I actually don't think that RSS is all that interesting, it's really the
> "potentially dirty RSS" that counts for VM behaviour - everything else can
> be dropped easily enough)

Definitely.

Now the only tricky bit is how do we sense when we are overloading
the swap disks. Well that is the next step. I'll take a look
and see what it takes to keep statistics on dirty mapped pages.

Eric

2000-12-31 02:21:53

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sun, 31 Dec 2000, Andrea Arcangeli wrote:

> On Sat, Dec 30, 2000 at 03:00:43PM -0700, Eric W. Biederman wrote:
> > To get ENOSPC handling 99% correct all we need to do is decrement a counter,
> > that remembers how many disks blocks are free. If we need a better
>
> Yes, we need to add one field to the in-core superblock to do this accounting.

And its meaning for 2/3 of filesystems would be?

> > estimate than just the data blocks it should not be hard to add an
> > extra callback to the filesystem.
>
> Yes, I was thinking at this callback too. Such a callback is nearly the only
> support we need from the filesystem to provide allocate on flush. Allocate on
> flush is a pagecache issue, not really a filesystem issue. When a filesystem

I _doubt_ it. If it is a pagecache issue it should apply to NFS. It should
apply to ramfs. It should apply to helluva lot of filesystems that are not
block-based. Pagecache doesn't (and shouldn't) know about blocks.

> doesn't implement such callback we can simply get_block(create) at pagecache
> creation time as usual.

Umm... You do realize that get_block is _not_ visible on pagecache level?
Sure thing, filesystem should be free to use whatever functions it wants
for address_space methods. No arguments here. It should be able whatever
callbacks these functions expect. If filesystem doesn't implement reservation
it should use functions that do not expect such argument. That's it. No
need to invent new methods or shoehorn all block filesystems into the same
scheme.

2000-12-31 02:45:45

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

Hi,

On Sun, 31 Dec 2000, Andrea Arcangeli wrote:

> > estimate than just the data blocks it should not be hard to add an
> > extra callback to the filesystem.
>
> Yes, I was thinking at this callback too. Such a callback is nearly the only
> support we need from the filesystem to provide allocate on flush.

Actually the getblock function could be split into 3 functions:
- alloc_block: mostly just decrementing a counter (and quota)
- get_block: allocating a block from the bitmap
- commit_block: inserting the new block into the inode

This would be really useful for streaming, one could get as fast as
possible the block number and the data could be very quickly written,
while keeping the cache usage low. Or streaming directly from a device
to disk also wants to get rid of the data as fast as possible.

bye, Roman

2000-12-31 02:59:52

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sun, 31 Dec 2000, Roman Zippel wrote:
>
> On Sun, 31 Dec 2000, Andrea Arcangeli wrote:
>
> > > estimate than just the data blocks it should not be hard to add an
> > > extra callback to the filesystem.
> >
> > Yes, I was thinking at this callback too. Such a callback is nearly the only
> > support we need from the filesystem to provide allocate on flush.
>
> Actually the getblock function could be split into 3 functions:
> - alloc_block: mostly just decrementing a counter (and quota)
> - get_block: allocating a block from the bitmap
> - commit_block: inserting the new block into the inode
>
> This would be really useful for streaming, one could get as fast as
> possible the block number and the data could be very quickly written,
> while keeping the cache usage low. Or streaming directly from a device
> to disk also wants to get rid of the data as fast as possible.

Now, to insert a small note of sanity here: I think people are starting to
overdesign stuff.

The fact is that currently the "get_block()" interface that the page cache
helper functions use does NOT have to be very expensive at all.

In fact, in a properly designed filesystem just a bit of low-level caching
would easily make the average "get_block()" be very fast indeed. The fact
that right now ext2 has not been optimized for this is _not_ a reason to
design the VFS layer around a slow get_block() operation.

If you look at the generic block-based writing routines, they are actually
not all that expensive. Any kind of complication is only going to make
those functions more complex, and any performance gained could easily be
lost in extra complexity.

There are only two real advantages to deferred writing:

- not having to do get_block() at all for temp-files, as we never have to
do the allocation if we end up removing the file.

NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
right is quite possibly big enough that this advantage is possibly very
questionable. It's very possible that people could speed things up
using this approach, but I also suspect that it is equally (if not
more) possible to speed things up by just making sure that the
low-level FS has a fast get_block().

- Using "global" access patterns to do a better job of "get_block()", ie
taking advantage of issues with journalling etc and deferring the write
in order to get a bigger journal.

The second point is completely different, and THIS is where I think there
are potentially real advantages. However, I also think that this is not
actually about deferred writes at all: it's really a question of the
filesystem having access to the page when the physical write is actually
started so that the filesystem might choose to _change_ the allocation it
did - it might have allocated a backing store block at "get_block()" time,
but by the time it actually writes the stuff out to disk it may have
allocated a bigger contiguous area somewhere else for the data..

I really think that the #2 thing is the more interesting one, and that
anybody looking at ext2 should look at just improving the locking and
making the block allocation functions run faster. Which should not be all
that difficult - the last time I looked at the thing it was quite
horrible.

Linus

2000-12-31 03:05:12

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sat, Dec 30, 2000 at 08:50:52PM -0500, Alexander Viro wrote:
> And its meaning for 2/3 of filesystems would be?

It should stay in the private part of the in-core superblock of course.

> I _doubt_ it. If it is a pagecache issue it should apply to NFS. It should
> apply to ramfs. It should apply to helluva lot of filesystems that are not
> block-based. Pagecache doesn't (and shouldn't) know about blocks.

With pagecache I meant the library of pagecache methods in buffer.c. Even
if they are recalled by the lowlevel filesystem code and they can be
overridden by lowlevel filesystem code, they aren't lowlevel filesystem code
but they're infact common code. We can implement another version of them that
instead of knowing about get_block, also know about another filesystem
callback and when possible it only reserve the space for a delayed allocation
later triggered (in parallel) by future kupdate. They will know about this new
callback in the same way the current standard pagecache library methods knows
about get_block_t. Filesystems implementing this callback will be able to use
those new pagecache library methods.

> it should use functions that do not expect such argument. That's it. No
> need to invent new methods or shoehorn all block filesystems into the same
> scheme.

Of course.

Andrea

2000-12-31 13:36:46

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

Hi,

On Sat, 30 Dec 2000, Linus Torvalds wrote:

> In fact, in a properly designed filesystem just a bit of low-level caching
> would easily make the average "get_block()" be very fast indeed. The fact
> that right now ext2 has not been optimized for this is _not_ a reason to
> design the VFS layer around a slow get_block() operation.
> [..]
> The second point is completely different, and THIS is where I think there
> are potentially real advantages. However, I also think that this is not
> actually about deferred writes at all: it's really a question of the
> filesystem having access to the page when the physical write is actually
> started so that the filesystem might choose to _change_ the allocation it
> did - it might have allocated a backing store block at "get_block()" time,
> but by the time it actually writes the stuff out to disk it may have
> allocated a bigger contiguous area somewhere else for the data..
>
> I really think that the #2 thing is the more interesting one, and that
> anybody looking at ext2 should look at just improving the locking and
> making the block allocation functions run faster. Which should not be all
> that difficult - the last time I looked at the thing it was quite
> horrible.

What makes get_block business complicated now, is that can be called
recursively: get_block needs to allocate something, what might start new
i/o which calls again get_block.
Writing dirty pages should be a real asynchronous process, but it isn't
right now, as get_block is synchronous. Making get_block asynchronous is
almost impossible, so one usually does it in a separate thread.
So IMO something like this should happen: dirty pages should be put on a
separate list and a thread takes these pages and allocates the buffers for
them and starts the i/o. This had another advantage: get_block wouldn't
really need to do preallocation anymore, the get_block function could work
instead on a number of pages (preallocation would instead happen in the
page cache).
This could make the get_block function and the needed locking very simple,
e.g. one could use a simple semaphore instead of kernel_lock to protect
getting of multiple blocks instead of only one. Also splitting it into
several tasks can make it faster, so in one step we just do the resource
allocation to guarantee the write, in a separate step we do the real
allocation. If this is done for several pages at once, it can be very
fast and simple.

bye, Roman

2000-12-31 15:11:57

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sat, Dec 30, 2000 at 06:28:39PM -0800, Linus Torvalds wrote:
> There are only two real advantages to deferred writing:
>
> - not having to do get_block() at all for temp-files, as we never have to
> do the allocation if we end up removing the file.
>
> NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
> right is quite possibly big enough that this advantage is possibly very
> questionable. It's very possible that people could speed things up
> using this approach, but I also suspect that it is equally (if not
> more) possible to speed things up by just making sure that the
> low-level FS has a fast get_block().

get_block for large files can be improved using extents, but how can we
implement a fast get_block without restructuring the on-disk format of the
filesystem? (in turn using another filesystem instead of ext2?)

get_block needs to walk all level of inode metadata indirection if they
exists. It has to map the logical page from its (inode) address space to the
physical blocks. If those indirection blocks aren't in cache it has to block
to read them. It doesn't matter how it is actually implemented in core. And
then later as you say those allocated blocks could never get written because
the file may be deleted in the meantime.

With allocate on flush we can run the slow get_block in parallel
asynchronously using a kernel daemon after the page flushtime timeout
triggers. It looks nicer to me. The in-core overhead of the reserved
blocks for delayed allocation should be not relevant at all (and it also
should not need the big kernel lock making the whole write path big lock
free).

Andrea

2000-12-31 17:04:29

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sun, 31 Dec 2000, Andrea Arcangeli wrote:
>
> get_block for large files can be improved using extents, but how can we
> implement a fast get_block without restructuring the on-disk format of the
> filesystem? (in turn using another filesystem instead of ext2?)

By doing a better job of caching stuff.

There are multiple levels of caching here. One issue is the question of
"allocate a new block". Go and look how the ext2 block pre-allocation
works, and cry. It should _not_ be a loop that sets one bit at a time, it
should be something that notices when the (u32 *) is zero and grabs the
whole 32 blocks in one go. Instead of defaulting to a pre-allocation of 8
blocks, doing that same expensive thing much too often.

The other thing is that one of the common cases for writing is consecutive
writing to the end of the file. Now, you figure it out: if get_block()
really is a bottle-neck, why not cache the last tree lookup? You'd get a
99% hitrate for that common case.

You should realize that all the block allocation etc code was written for
a very different VFS layer. "get_block()" didn't even exist. We didn't
have SMP issues. We had very different access patterns (the virtual caches
in the page cache makes the accesses to "get_block()" very different, as
the VFS layer keeps track of man mappings on its own.

And get_block() was basically tacked on top of the old code. Al Viro did a
good job of cleaning up some of the issues, but go and look at get_block()
and follow it all the way down into "ext2_new_block()" and back, and I bet
you'll ask yourself why it's so complex. And wonder if it couldn't just be
cleaned up and sped up a lot that way.

Linus

2000-12-31 17:21:05

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sun, Dec 31, 2000 at 08:33:01AM -0800, Linus Torvalds wrote:
> By doing a better job of caching stuff.

Caching can happen after we are been slow and we waited for I/O synchronously
the first time (bread).

How can we optimize the first time (when the indirect blocks are out of buffer
cache) without changing the on-disk format? We can't as far I can see.

It's of course fine to optimize the address_space->physical_block resolver
algorithm, because it has to run anyways if we want to write such data to disk
eventually (despite it's asynchronous with allocate on flush, or synchronous
like now). Probably it's a more sensible optimization than the allocate
on flush thing. But still being able to run the resolver in an asynchronous
manner, in parallel, only at the time we need to flush the page to disk, looks
nicer behaviour to me.

Andrea

2000-12-31 17:22:45

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sun, 31 Dec 2000, Linus Torvalds wrote:

> The other thing is that one of the common cases for writing is consecutive
> writing to the end of the file. Now, you figure it out: if get_block()
> really is a bottle-neck, why not cache the last tree lookup? You'd get a
> 99% hitrate for that common case.

Because it is not a bottleneck. The _real_ bottleneck is in ext2_new_block().
Try to profile it and you'll see.

We could diddle with ext2_get_block(). No arguments. But the real source of
PITA is balloc.c, not inode.c. Look at the group descriptor cache code and
weep. That, and bitmaps handling.

2000-12-31 17:43:18

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sun, 31 Dec 2000, Alexander Viro wrote:
>
> On Sun, 31 Dec 2000, Linus Torvalds wrote:
>
> > The other thing is that one of the common cases for writing is consecutive
> > writing to the end of the file. Now, you figure it out: if get_block()
> > really is a bottle-neck, why not cache the last tree lookup? You'd get a
> > 99% hitrate for that common case.
>
> Because it is not a bottleneck. The _real_ bottleneck is in ext2_new_block().
> Try to profile it and you'll see.
>
> We could diddle with ext2_get_block(). No arguments. But the real source of
> PITA is balloc.c, not inode.c. Look at the group descriptor cache code and
> weep. That, and bitmaps handling.

I'm not surprised. Just doign pre-allocation 32 blocks at a time would
probably help. But that code really should be re-written, I think.

Linus

2000-12-31 19:03:43

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

Linus Torvalds wrote:
> There are only two real advantages to deferred writing:
>
> - not having to do get_block() at all for temp-files, as we never have to
> do the allocation if we end up removing the file.
>
> NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors
> right is quite possibly big enough that this advantage is possibly very
> questionable. It's very possible that people could speed things up
> using this approach, but I also suspect that it is equally (if not
> more) possible to speed things up by just making sure that the
> low-level FS has a fast get_block().

It's not that hard or inefficient to return the ENOSPC from the usual
point. For example, make a gross overestimate of the space needed for
the write, compare to a cached filesystem free space value less the
amount deferred so far, and fail to take the optimization if it looks
even close. Also, it's not necessarily bad to defer the ENOSPC to file
close time. The same thing can happen with failed disk io, and that's
just a fact of life with asynchronous io. A reliable program needs to
be able to deal with it. (I think there was a long thread on this not
long ago, regarding filesystem errors returned after a program exits.)

> - Using "global" access patterns to do a better job of "get_block()", ie
> taking advantage of issues with journalling etc and deferring the write
> in order to get a bigger journal.

Another nicety is not having to bother the filesystem at all about
sequences of short writes.

> The second point is completely different, and THIS is where I think there
> are potentially real advantages. However, I also think that this is not
> actually about deferred writes at all: it's really a question of the
> filesystem having access to the page when the physical write is actually
> started so that the filesystem might choose to _change_ the allocation it
> did - it might have allocated a backing store block at "get_block()" time,
> but by the time it actually writes the stuff out to disk it may have
> allocated a bigger contiguous area somewhere else for the data..

You'd most likely be able to measure the overhead under load of doing
the allocation twice, due to the occasional need to read back a
swapped-out metadata block.

XFS does deferred allocation and the Reiser guys are talking about it -
it seems to be something worth doing. For me the question is, if the
VFS can already do the deferring, is there a good reason for a
filesystem to duplicate that functionality?

--
Daniel

2000-12-31 19:15:57

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sun, 31 Dec 2000, Daniel Phillips wrote:
>
> It's not that hard or inefficient to return the ENOSPC from the usual
> point. For example, make a gross overestimate of the space needed for
> the write, compare to a cached filesystem free space value less the
> amount deferred so far, and fail to take the optimization if it looks
> even close.

Let me repeat myself one more time:

I do not believe that "get_block()" is as big of a problem as people make
it out to be.

And more importantly:

I strongly believe that trying to be clever is detrimental to your
health.

The "clever" approach is to add tons of complexity, have various
heuristics to try to not overflow, and then try to debug it considering
that the ENOSPC case is actually rather rare.

The "intelligent" approach is just to say that if get_block() shows up on
the performance profiles, then it should be optimized.

I'd rather be intelligent than clever. Optimize get_block(), which in the
case of ext2 seems to be mostly ext2_new_block() and the balloc.c mess.

The argument that Andrea had is bogus: the common case for writes (and
writes is the only part that deferred writing would touch) is re-writing
the whole file, and the IO to look up the metadata is never an issue for
that case. Everything is basically cached and created on-the-fly. IO is
not the issue, being good about new block allocation _is_ the issue.

Don't get me wrong: I like the notion of deferred writes. But I'm also
very pragmatic: I have not heard of a really good argument that makes it
obvious that deferred writes is a major win performance-wise that would
make it worth the complexity.

One form of deferred writes I _do_ like is the mount-time-option form.
Because that one doesn't add complexity. Kind of like the "noatime" mount
option - it can be worth it under some circumstances, and sometimes it's
acceptable to not get 100% unix semantics - at which point deferred writes
have none of the disadvantages of trying to be clever.

Linus

2000-12-31 19:43:47

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

Linus Torvalds wrote:
> I do not believe that "get_block()" is as big of a problem as people make
> it out to be.

I didn't mention get_block - disk accesses obviously far outweigh
filesystem cpu/cache usage in overall impact. The question is, what
happens to disk access patterns when we do the deferred allocation.

> One form of deferred writes I _do_ like is the mount-time-option form.
> Because that one doesn't add complexity. Kind of like the "noatime" mount
> option - it can be worth it under some circumstances, and sometimes it's
> acceptable to not get 100% unix semantics - at which point deferred writes
> have none of the disadvantages of trying to be clever.

And the added attraction of requiring almost no effort.

--
Daniel

2000-12-31 20:02:22

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sun, 31 Dec 2000, Daniel Phillips wrote:

> Linus Torvalds wrote:
> > I do not believe that "get_block()" is as big of a problem as people make
> > it out to be.
>
> I didn't mention get_block - disk accesses obviously far outweigh
> filesystem cpu/cache usage in overall impact. The question is, what
> happens to disk access patterns when we do the deferred allocation.

Note that the deferred allocation is only possible with a full page write.

Go and do statistics on a system of how often this happens, and what the
circumstances are. Just for fun.

I will bet you $5 USD that 99.9% of all such writes are to new files, at
the end of the file. I'm sure you can come up with other usage patterns,
but they'll be special (big databases etc, and I bet that they'll want to
have stuff cached all the time anyway for other reasons).

So I seriously doubt that you'll have much of an IO component to the
writing anyway - except for the "normal" deferred write of actually
writing the stuff out at all.

Now, this is where I agree with you, but I disagree with where most of the
discussion has been going: I _do_ believe that we may want to change block
allocation discussions at write-out-time. That makes sense to me. But that
doesn't really impact "ENOSPC" - the write would not be really "deferred"
by the VM layer, and the filesystem would always be aware of the writer
synchronously.

> > One form of deferred writes I _do_ like is the mount-time-option form.
> > Because that one doesn't add complexity. Kind of like the "noatime" mount
> > option - it can be worth it under some circumstances, and sometimes it's
> > acceptable to not get 100% unix semantics - at which point deferred writes
> > have none of the disadvantages of trying to be clever.
>
> And the added attraction of requiring almost no effort.

Did I mention my belief in the true meaning of "intelligence"?

"Intelligence is the ability to avoid doing work, yet get the work done".

Lazy programmers are the best programmers. Think Tom Sawyer painting the
fence. That's intelligence.

Requireing almost no effort is a big plus in my book.

It's the "clever" programmer I'm afraid of. The one who isn't afraid of
generating complexity, because he has a Plan (capital "P"), and he knows
he can work out the details later.

Linus

2000-12-31 21:39:46

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

Hi,

On Sun, 31 Dec 2000, Linus Torvalds wrote:

> Let me repeat myself one more time:
>
> I do not believe that "get_block()" is as big of a problem as people make
> it out to be.

The real problem is that get_block() doesn't scale and it's very hard to
do. A recursive per inode-semaphore might help, but it's still a pain to
get it right.

bye, Roman

2000-12-31 22:03:25

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

On Sun, 31 Dec 2000, Roman Zippel wrote:
>
> On Sun, 31 Dec 2000, Linus Torvalds wrote:
>
> > Let me repeat myself one more time:
> >
> > I do not believe that "get_block()" is as big of a problem as people make
> > it out to be.
>
> The real problem is that get_block() doesn't scale and it's very hard to
> do. A recursive per inode-semaphore might help, but it's still a pain to
> get it right.

Not true.

There's nothing unscalable in get_block() per se. The only lock we hold is
the per-page lock, which we must hold anyway. get_block() itself does not
need any real locking: you can do it with a simple per-inode spinlock if
you want to (and release the spinlock and try again if you need to fetch
non-cached data blocks).

Sure, the current ext2 _implementation_ sucks. Nobody has ever contested
that.

Stop re-designing something just because you want to.

Linus

2001-01-01 03:29:34

[permalink] [raw]

Subject: Re: [RFC] Generic deferred file writing

Hi,

On Sun, 31 Dec 2000, Linus Torvalds wrote:

> cached_allocation = NULL;
>
> repeat:
> spin_lock();
> result = try_to_find_existing();
> if (!result) {
> if (!cached_allocation) {
> spin_unlock();
> cached_allocation = allocate_block();
> goto repeat;
> }
> result = cached_allocation;
> add_to_datastructures(result);
> }
> spin_unlock();
> return result;
>
> This is quite standard, and Linux does it in many places. It doesn't have
> to be complex or ugly.

No problem with that.

> Also, I don't see why you claim the current get_block() is recursive and
> hard to use: it obviously isn't. If you look at the current ext2
> get_block(), the way it protects most of its data structures is by the
> super-block-global lock. That wouldn't work if your claims of recursive
> invocation were true.

I just rechecked that, but I don't see no superblock lock here, it uses
the kernel_lock instead. Although Al could give the definitive answer for
this, he wrote it. :)

> The way the Linux MM works, if the lower levels need to do buffer
> allocations, they will use GFP_BUFFER (which "bread()" does internally),
> which will mean that the VM layer will _not_ call down recursively to
> write stuff out while it's already trying to write something else. This is
> exactly so that filesystems don't have to release and re-try if they don't
> want to.
>
> In short, I don't see any of your arguments.

Then I must have misunderstood Al. Al?
If you were right here, I would see absolutely no reason for the current
complexity. (Me is a bit confused here.)

bye, Roman

2001-04-21 20:07:05

[permalink] [raw]

Subject: Races in affs_unlink(), affs_rmdir() and affs_rename()

mkdir /A
mkdir /B
mkdir /C
touch /A/a
ln /A/a /B/b
ln /A/a /C/c
rm /A/a &
rm /B/b

can corrupt filesystem. Scenario:

unlink("/B/b") locks /B, removes "b" and unlocks /B. Then it calls
affs_remove_link(), which blocks.

unlink("/A/a") locks /A, removes "a" and unlocks /A. Then it calls
affs_remove_link(). Which locks /B, renames removed entry into "b",
removes old "b" and inserts renamed "a" into /B.

The rest is irrelevant - we're already in it.

Similar race exists between unlink() and rename();

mkdir /A
mkdir /B
mkdir /C
touch /A/a
touch /B/a
ln /A/a /B/b
ln /A/a /C/c
rm /A/a &
mv /B/a /B/b
- similar scenario, different source of affs_remove_header().

Another one: unlink() and rmdir():
mkdir /A
mkdir /B
touch /A/a
ln /A/a /B/a
rm /A/a &
rmdir /B

Since you don't lock /B for affs_empty_dir(), you can hit the
window between removing old /B/a and inserting renamed /A/a into /B.
Notice that VFS _does_ lock /B (->i_zombie), but affs_remove_link()
for /A/a doesn't even look at it.

Same thing for rename()/rmdir() (rmdir victim contains a link to rename
target, apply the previous scenario).
Al

2001-04-21 22:16:38

[permalink] [raw]

Subject: Re: Races in affs_unlink(), affs_rmdir() and affs_rename()

Hi,

Alexander Viro wrote:

> unlink("/B/b") locks /B, removes "b" and unlocks /B. Then it calls
> affs_remove_link(), which blocks.
>
> unlink("/A/a") locks /A, removes "a" and unlocks /A. Then it calls
> affs_remove_link(). Which locks /B, renames removed entry into "b",
> removes old "b" and inserts renamed "a" into /B.
>
> The rest is irrelevant - we're already in it.

Thanks for finding that one, but it should be easy to fix. I can remove
the parent pointer in aff_remove_hash and check for that before I try to
rename that entry.

> Since you don't lock /B for affs_empty_dir(), you can hit the
> window between removing old /B/a and inserting renamed /A/a into /B.
> Notice that VFS _does_ lock /B (->i_zombie), but affs_remove_link()
> for /A/a doesn't even look at it.

I thought about that one and I know it should be locked. The reason I
don't do right now is, that affs supports hardlinks to dirs. The problem
are especially recursive links, e.g.:

mkdir A
ln A A/B
rm A/B

This is possible with affs, but will already deadlock in vfs.

mkdir A
mkdir A/B
ln A A/B/C
rm A/B/C/A &
rm A/B/C &
rm A/B

Every rm already takes the hash lock of the parent and then I can't
simply also take the hash lock of the dir itself. What I actually want
to do is to insert a reverse is_subdir() check before taking the lock.
On the other hand I was thinking whether I should allow links to dirs at
all and just show them as empty/readonly dirs. For 2.4 that's probably
safer, as it would require a lot of locking changes in vfs and the other
fs to support this properly, particularly moving most of the locking
from vfs into the fs.

bye, Roman

2001-04-22 05:54:12

[permalink] [raw]

Subject: Re: Races in affs_unlink(), affs_rmdir() and affs_rename()

On Sun, 22 Apr 2001, Roman Zippel wrote:

> This is possible with affs, but will already deadlock in vfs.
>
> mkdir A
> mkdir A/B
> ln A A/B/C
> rm A/B/C/A &
> rm A/B/C &
> rm A/B
>
> Every rm already takes the hash lock of the parent and then I can't
> simply also take the hash lock of the dir itself. What I actually want
> to do is to insert a reverse is_subdir() check before taking the lock.
> On the other hand I was thinking whether I should allow links to dirs at
> all and just show them as empty/readonly dirs. For 2.4 that's probably
> safer, as it would require a lot of locking changes in vfs and the other
> fs to support this properly, particularly moving most of the locking
> from vfs into the fs.

And all that to support a misfeature present in one legacy filesystem?
Don't forget that find et.al. are going to die on loops in directory
tree, so you'd also need to change large chunk of userland code.

By the way, how would you detect the attempts to detach a subtree by
rmdir()/rename() with the multiple links on directories? Again, forget about
the VFS side of that business, the question is how to check that
required change doesn't make on-disk data structures inconsistent.

As far I know even native (AmigaOS) implementation doesn't handle
directory links correctly. So I don't see much point in allowing them
even for compatibility purposes.

Seriously, screw the directory links - getting the thing work correctly
without them has much higher priority.

2001-04-22 12:58:07

[permalink] [raw]

Subject: Re: Races in affs_unlink(), affs_rmdir() and affs_rename()

Hi,

Alexander Viro wrote:

> > all and just show them as empty/readonly dirs. For 2.4 that's probably
> > safer, as it would require a lot of locking changes in vfs and the other
> > fs to support this properly, particularly moving most of the locking
> > from vfs into the fs.
>
> And all that to support a misfeature present in one legacy filesystem?
> Don't forget that find et.al. are going to die on loops in directory
> tree, so you'd also need to change large chunk of userland code.

It's not the only reason, but right now I can't even offer it as a mount
option. On the other hand I could also export them as symlinks.
Anyway, the major reason I'd like to see it is, that it IMO could
simplify locking. All possible locks don't need to be taken in advance,
instead they can be taken when needed. Also unlink/rename of files/dirs
are no specially cases anymore (at least locking wise).
VFS would just operate on dentries and the fs works with the inodes.
With affs I tried to show how it could look on the fs side.

> By the way, how would you detect the attempts to detach a subtree by
> rmdir()/rename() with the multiple links on directories? Again, forget about
> the VFS side of that business, the question is how to check that
> required change doesn't make on-disk data structures inconsistent.

Do you have an example? At the affs side there is no big difference
between link to files or dirs.

bye, Roman

2001-04-22 13:15:40